11-24-2010, 02:45 AM | #1 |
Zealot
Posts: 122
Karma: 164
Join Date: Aug 2010
Location: Old Ynysybwl
Device: Sony PRS-300
|
macro - Search and Replace
I have a workflow which is serving me well. I use Calibre to organise my books and I have developed a series of processes by which I automate my conversions - mainly from PDF ->RTF -> epub. I do scan some materials and ABBYY takes care of the OCR.
I have a number of search and replace functions for Word 2010 which I have recorded into macros and they serve me well. But there is one aspect I have yet to crack and hopefully someone here can offer a suggestion. I cannot create a search for when a line does not end with a fullstop followed by ^p^p. I can detect when lines commence with a lower case letter and rejoin them to the prev paras. But some line breaks are occuring within capitalised phrases. The defining feature is the lack of the Full Stop followed by the two Para marks. I want to test for that condition and replace with a single space. Any ideas how this can be acheived in word? |
11-24-2010, 09:55 AM | #2 |
Enthusiast
Posts: 29
Karma: 22
Join Date: Oct 2010
Location: London
Device: Kindle, iPad, iPhone 4, HTC Desire
|
I understand that you are looking for a solution within word but this would be very easy if you can open the book in Dreamweaver and use regex there
--- stunjelly.com ebook formatting and repairs |
Advert | |
|
11-24-2010, 11:04 AM | #3 | |
Zealot
Posts: 122
Karma: 164
Join Date: Aug 2010
Location: Old Ynysybwl
Device: Sony PRS-300
|
Quote:
Last edited by oldbwl; 11-24-2010 at 11:06 AM. |
|
11-25-2010, 12:54 AM | #4 |
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
Any decent text editor? Vim? Emacs? Textmate? Notepad++? They're all free.
You could use any scripting language with regex capabilities: perl, python, sed, awk, lua, etc., probably with a one-liner. For a text file: Code:
perl -pe 's/([^\.])\n\n/\1 /g' filename > new-filename I can't stand Windows. I can't stand Word. So I'm afraid I can't help you do it there. |
11-25-2010, 08:24 AM | #5 | |
Zealot
Posts: 122
Karma: 164
Join Date: Aug 2010
Location: Old Ynysybwl
Device: Sony PRS-300
|
Quote:
|
|
Advert | |
|
01-22-2011, 11:08 AM | #6 |
Member
Posts: 17
Karma: 7700
Join Date: Jan 2011
Device: kindle
|
I also work in Word. I don't do all that much conversion to justify learning another program and regex, although I sympathize with the aversion feelings towards MS Word.
When I do my conversions pdf > htm > mobi, I record my macros and use them only in that one document... cumbersome! So, if yo can, please send me the Word macros you've developed. |
01-30-2011, 04:34 AM | #7 |
Connoisseur
Posts: 75
Karma: 204999
Join Date: Aug 2006
Location: London
|
In Word, if regular expressions are on, .^13^13 will find full stop followed by two paragraph marks and [!.]^13^13 will find (anything but a full stop) followed by 2 paragraph marks.
bob |
02-04-2011, 11:04 AM | #8 |
Member
Posts: 17
Karma: 7700
Join Date: Jan 2011
Device: kindle
|
When you say "regular expressions on", you mean to enable the Use of Wildcards? Right?
What about when you have a hard coded page number between 2 paragraph marks without full stop? How can you join the paragraphs and delete the page number? |
02-09-2011, 08:20 AM | #9 |
Readaholic
Posts: 255
Karma: 1058454
Join Date: Jul 2009
Location: Swindon, UK
Device: Sony PRS-T2 (previously 505 and 650)
|
This is a bit of a kludgy, 3 step solution, but it should work and it uses Word's search and replace rather than Regex or a scripting language.
1) Replace all double paragraphs with some nonsense text, e.g. Replace ^p^p with "zxzxqwqw". 2) Replace all full stops followed by your nonsense text with a full stop followed by a double paragraph, e.g. Replace ".zxzxqwqw" with .^p^p 3) Replace all remaining nonsense text with a space, e.g. Replace "zxzxqwqw" with " " I've tried it on a couple of short samples and it seems to work OK, but be sure to take a backup before trying it on a large scale! Ian |
02-10-2011, 03:56 AM | #10 | |
Connoisseur
Posts: 75
Karma: 204999
Join Date: Aug 2006
Location: London
|
Quote:
For the case above you could search for ^13[0-9]*^13 and in the replace line either ^p or ^p^p or a space character depending on how you want the spacing. bob Last edited by comtrjl; 02-10-2011 at 09:35 AM. |
|
02-24-2011, 02:42 PM | #11 |
Connoisseur
Posts: 61
Karma: 12096
Join Date: Sep 2010
Location: Tasmania
Device: Sony PRS 650
|
Couple of points:
It's not clear what sort of books you're dealing with but if they are novels you may have additional punctuation other than the full stop occurring at the ends of paragraphs, for example double or single quotes, or maybe a question mark. Also you may have paragraphs that end with no punctuation but that you want to retain as a paragraph such as Titles, Chapter numbers, sub-headings, a quote, poem verses or the author's name of quotes. The code given by comtrjl is for Regex which you are probably not using in Word (it is possible but you do it inside a VBA script). comtrjl's Find code is ^13[0-9]*^13 which in Word using wildcards would find a single digit followed by anything until it finds a paragraph marker. You need ^13[0-9]@^13 which finds one or more digits bounded by paragraph markers. If the page numbers include a space, like this '2 3' for 23 then your find code should be ^13[0-9,_]@^13 where the underline is indicating where you put a space. If the word 'Page' or 'page is included the Find code is ^13[Pp]age [0-9]@^13 or if it's of the form 'Page 123 of 360' you use [Pp]age [0-9]@ of [0-9, ]@^13. Make sure you click on 'More' in the Find/Replace box and that you enable Wildcards. |
02-25-2011, 12:54 PM | #12 |
Member
Posts: 17
Karma: 7700
Join Date: Jan 2011
Device: kindle
|
All of these pointers and suggestions help.
Thanks a million... |
02-25-2011, 07:53 PM | #13 |
Connoisseur
Posts: 61
Karma: 12096
Join Date: Sep 2010
Location: Tasmania
Device: Sony PRS 650
|
Just one more suggestion for now - but it is a really beaut fixer.
After you've copied the PDF and pasted into Word notice that some of the font information is carried through. So before you attempt any changes select a bit of the text you want to remove and see what it says at the top of Word as to the font name and its size. Often you'll find that the font used for Author - Page, Title - Page, Page Number is different to that used in the body of the text. Consider such things as font name, font size, bold, italics or even a unique Style. Compare with the text you do not want to lose. If there is a difference you're laughing! If this is the case go to Find Replace. Click in the Find box but don't enter anything. Click 'More'. Click 'Format'. Select 'Font' and choose the attributes of what you want to get rid of. Leave 'Replace with' empty unless you need to replace with a space or paragraph marker (it varies from one document to another). Try a Find Next then a Replace to check it's working before clicking Replace All. One thing not pointed out above in the thread was why ^13 is used rather that ^p in the Find box. The reason is that you cannot use ^p when you select 'Use wildcards'; however, you can and should use ^p anytime in the Replace box. Also NEVER use ^13 in the Replace box. Here in means the ASCII character 13 and doesn't include all the hidden info about the 'paragraph'. In fact the preceding text is now not separated as a paragraph from the succeeding text. One final tip. Notice the down arrows at the right side of both the Find and Replace boxes. That's your source of Find and Replace expressions you've already used (Same in Sigil) so you don't need to type them again. Last edited by Faster; 03-04-2011 at 02:25 PM. |
02-26-2011, 08:28 AM | #14 |
eBook pro
Posts: 71
Karma: 5634
Join Date: Jan 2011
Location: Hertford, UK
Device: PC, iPad, Kindle, Kindle Fire, Galaxy Ace
|
I have had the same problem and solved it by recording a macro and using search & replace stepping through the alphabet one letter at a time i.e. a^p^p, b^p^p, c^p^p.....
Not the most elegant of solutions and a bit time consuming to get right but once set up it does work. |
02-26-2011, 09:59 AM | #15 |
Connoisseur
Posts: 61
Karma: 12096
Join Date: Sep 2010
Location: Tasmania
Device: Sony PRS 650
|
Dillinquent: why not use [a-z]^13^13 for your Find criteria with Wildcards enabled? You only search through once this - way rather than 26 times.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Search and Replace with consecutive numbers? | seagull | Sigil | 0 | 11-06-2010 02:38 PM |
Search & Replace | Pat Nickholds | Sigil | 2 | 10-21-2010 11:18 PM |
Wild Card search and replace | crutledge | Sigil | 2 | 06-05-2010 04:19 PM |
Search and replace in 0.2.0 | paulpeer | Sigil | 7 | 03-13-2010 11:59 AM |
Why no search and replace? | charleski | Sigil | 10 | 11-24-2009 04:13 PM |