Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 11-24-2010, 02:45 AM   #1
oldbwl
Zealot
oldbwl doesn't litteroldbwl doesn't litter
 
oldbwl's Avatar
 
Posts: 122
Karma: 164
Join Date: Aug 2010
Location: Old Ynysybwl
Device: Sony PRS-300
macro - Search and Replace

I have a workflow which is serving me well. I use Calibre to organise my books and I have developed a series of processes by which I automate my conversions - mainly from PDF ->RTF -> epub. I do scan some materials and ABBYY takes care of the OCR.

I have a number of search and replace functions for Word 2010 which I have recorded into macros and they serve me well. But there is one aspect I have yet to crack and hopefully someone here can offer a suggestion.

I cannot create a search for when a line does not end with a fullstop followed by ^p^p.

I can detect when lines commence with a lower case letter and rejoin them to the prev paras. But some line breaks are occuring within capitalised phrases. The defining feature is the lack of the Full Stop followed by the two Para marks. I want to test for that condition and replace with a single space. Any ideas how this can be acheived in word?
oldbwl is offline   Reply With Quote
Old 11-24-2010, 09:55 AM   #2
KLUTCH
Enthusiast
KLUTCH began at the beginning.
 
KLUTCH's Avatar
 
Posts: 29
Karma: 22
Join Date: Oct 2010
Location: London
Device: Kindle, iPad, iPhone 4, HTC Desire
I understand that you are looking for a solution within word but this would be very easy if you can open the book in Dreamweaver and use regex there

---
stunjelly.com
ebook formatting and repairs
KLUTCH is offline   Reply With Quote
Advert
Old 11-24-2010, 11:04 AM   #3
oldbwl
Zealot
oldbwl doesn't litteroldbwl doesn't litter
 
oldbwl's Avatar
 
Posts: 122
Karma: 164
Join Date: Aug 2010
Location: Old Ynysybwl
Device: Sony PRS-300
Quote:
Originally Posted by KLUTCH View Post
I understand that you are looking for a solution within word but this would be very easy if you can open the book in Dreamweaver and use regex there

---
stunjelly.com
ebook formatting and repairs
LOL, at the cost of Dreamweaver these days, that is no more than a forlorn hope......... £345 standalone, unless you can get a student or EDU version - which I can't.

Last edited by oldbwl; 11-24-2010 at 11:06 AM.
oldbwl is offline   Reply With Quote
Old 11-25-2010, 12:54 AM   #4
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
Any decent text editor? Vim? Emacs? Textmate? Notepad++? They're all free.

You could use any scripting language with regex capabilities: perl, python, sed, awk, lua, etc., probably with a one-liner.

For a text file:

Code:
perl -pe 's/([^\.])\n\n/\1 /g' filename > new-filename
would replace every double line-break following anything but a full-stop with a single space, if I'm not mistaken.

I can't stand Windows. I can't stand Word. So I'm afraid I can't help you do it there.
frabjous is offline   Reply With Quote
Old 11-25-2010, 08:24 AM   #5
oldbwl
Zealot
oldbwl doesn't litteroldbwl doesn't litter
 
oldbwl's Avatar
 
Posts: 122
Karma: 164
Join Date: Aug 2010
Location: Old Ynysybwl
Device: Sony PRS-300
Quote:
Originally Posted by frabjous View Post
Any decent text editor? Vim? Emacs? Textmate? Notepad++? They're all free.

You could use any scripting language with regex capabilities: perl, python, sed, awk, lua, etc., probably with a one-liner.

For a text file:

Code:
perl -pe 's/([^\.])\n\n/\1 /g' filename > new-filename
would replace every double line-break following anything but a full-stop with a single space, if I'm not mistaken.

I can't stand Windows. I can't stand Word. So I'm afraid I can't help you do it there.
I have to use WIndows and Word and am not allowed to intall any programs, so I am stuck. Such a pity I don't know how to convert your code snippet to a Word Equiv. Looks good. I need to find a detailed help file on what expressions I can use in Word. I will start with the built in one but don'tr hold out too much hope.
oldbwl is offline   Reply With Quote
Advert
Old 01-22-2011, 11:08 AM   #6
flopis
Member
flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.
 
Posts: 17
Karma: 7700
Join Date: Jan 2011
Device: kindle
I also work in Word. I don't do all that much conversion to justify learning another program and regex, although I sympathize with the aversion feelings towards MS Word.
When I do my conversions pdf > htm > mobi, I record my macros and use them only in that one document... cumbersome!
So, if yo can, please send me the Word macros you've developed.
flopis is offline   Reply With Quote
Old 01-30-2011, 04:34 AM   #7
comtrjl
Connoisseur
comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.
 
Posts: 75
Karma: 204999
Join Date: Aug 2006
Location: London
In Word, if regular expressions are on, .^13^13 will find full stop followed by two paragraph marks and [!.]^13^13 will find (anything but a full stop) followed by 2 paragraph marks.

bob
comtrjl is offline   Reply With Quote
Old 02-04-2011, 11:04 AM   #8
flopis
Member
flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.
 
Posts: 17
Karma: 7700
Join Date: Jan 2011
Device: kindle
When you say "regular expressions on", you mean to enable the Use of Wildcards? Right?
What about when you have a hard coded page number between 2 paragraph marks without full stop? How can you join the paragraphs and delete the page number?
flopis is offline   Reply With Quote
Old 02-09-2011, 08:20 AM   #9
mediax
Readaholic
mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.
 
mediax's Avatar
 
Posts: 255
Karma: 1058454
Join Date: Jul 2009
Location: Swindon, UK
Device: Sony PRS-T2 (previously 505 and 650)
This is a bit of a kludgy, 3 step solution, but it should work and it uses Word's search and replace rather than Regex or a scripting language.

1) Replace all double paragraphs with some nonsense text, e.g.
Replace ^p^p with "zxzxqwqw".

2) Replace all full stops followed by your nonsense text with a full stop followed by a double paragraph, e.g.
Replace ".zxzxqwqw" with .^p^p

3) Replace all remaining nonsense text with a space, e.g.
Replace "zxzxqwqw" with " "

I've tried it on a couple of short samples and it seems to work OK, but be sure to take a backup before trying it on a large scale!

Ian
mediax is offline   Reply With Quote
Old 02-10-2011, 03:56 AM   #10
comtrjl
Connoisseur
comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.comtrjl ought to be getting tired of karma fortunes by now.
 
Posts: 75
Karma: 204999
Join Date: Aug 2006
Location: London
Quote:
Originally Posted by flopis View Post
When you say "regular expressions on", you mean to enable the Use of Wildcards? Right?
What about when you have a hard coded page number between 2 paragraph marks without full stop? How can you join the paragraphs and delete the page number?
Sorry, yes I meant 'Use wildcards'.
For the case above you could search for ^13[0-9]*^13 and in the replace line either ^p or ^p^p or a space character depending on how you want the spacing.

bob

Last edited by comtrjl; 02-10-2011 at 09:35 AM.
comtrjl is offline   Reply With Quote
Old 02-24-2011, 02:42 PM   #11
Faster
Connoisseur
Faster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of light
 
Posts: 61
Karma: 12096
Join Date: Sep 2010
Location: Tasmania
Device: Sony PRS 650
Couple of points:
It's not clear what sort of books you're dealing with but if they are novels you may have additional punctuation other than the full stop occurring at the ends of paragraphs, for example double or single quotes, or maybe a question mark. Also you may have paragraphs that end with no punctuation but that you want to retain as a paragraph such as Titles, Chapter numbers, sub-headings, a quote, poem verses or the author's name of quotes.

The code given by comtrjl is for Regex which you are probably not using in Word (it is possible but you do it inside a VBA script).
comtrjl's Find code is ^13[0-9]*^13 which in Word using wildcards would find a single digit followed by anything until it finds a paragraph marker.

You need ^13[0-9]@^13 which finds one or more digits bounded by paragraph markers.
If the page numbers include a space, like this '2 3' for 23 then your find code should be ^13[0-9,_]@^13 where the underline is indicating where you put a space.

If the word 'Page' or 'page is included the Find code is ^13[Pp]age [0-9]@^13 or if it's of the form 'Page 123 of 360' you use [Pp]age [0-9]@ of [0-9, ]@^13.

Make sure you click on 'More' in the Find/Replace box and that you enable Wildcards.
Faster is offline   Reply With Quote
Old 02-25-2011, 12:54 PM   #12
flopis
Member
flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.flopis knows the square root of minus one.
 
Posts: 17
Karma: 7700
Join Date: Jan 2011
Device: kindle
All of these pointers and suggestions help.
Thanks a million...
flopis is offline   Reply With Quote
Old 02-25-2011, 07:53 PM   #13
Faster
Connoisseur
Faster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of light
 
Posts: 61
Karma: 12096
Join Date: Sep 2010
Location: Tasmania
Device: Sony PRS 650
Just one more suggestion for now - but it is a really beaut fixer.

After you've copied the PDF and pasted into Word notice that some of the font information is carried through. So before you attempt any changes select a bit of the text you want to remove and see what it says at the top of Word as to the font name and its size. Often you'll find that the font used for Author - Page, Title - Page, Page Number is different to that used in the body of the text. Consider such things as font name, font size, bold, italics or even a unique Style. Compare with the text you do not want to lose. If there is a difference you're laughing!
If this is the case go to Find Replace. Click in the Find box but don't enter anything. Click 'More'. Click 'Format'. Select 'Font' and choose the attributes of what you want to get rid of.
Leave 'Replace with' empty unless you need to replace with a space or paragraph marker (it varies from one document to another).
Try a Find Next then a Replace to check it's working before clicking Replace All.

One thing not pointed out above in the thread was why ^13 is used rather that ^p in the Find box. The reason is that you cannot use ^p when you select 'Use wildcards'; however, you can and should use ^p anytime in the Replace box. Also NEVER use ^13 in the Replace box. Here in means the ASCII character 13 and doesn't include all the hidden info about the 'paragraph'. In fact the preceding text is now not separated as a paragraph from the succeeding text.

One final tip. Notice the down arrows at the right side of both the Find and Replace boxes. That's your source of Find and Replace expressions you've already used (Same in Sigil) so you don't need to type them again.

Last edited by Faster; 03-04-2011 at 02:25 PM.
Faster is offline   Reply With Quote
Old 02-26-2011, 08:28 AM   #14
Dillinquent
eBook pro
Dillinquent shares his or her toysDillinquent shares his or her toysDillinquent shares his or her toysDillinquent shares his or her toysDillinquent shares his or her toysDillinquent shares his or her toysDillinquent shares his or her toysDillinquent shares his or her toysDillinquent shares his or her toysDillinquent shares his or her toysDillinquent shares his or her toys
 
Dillinquent's Avatar
 
Posts: 71
Karma: 5634
Join Date: Jan 2011
Location: Hertford, UK
Device: PC, iPad, Kindle, Kindle Fire, Galaxy Ace
I have had the same problem and solved it by recording a macro and using search & replace stepping through the alphabet one letter at a time i.e. a^p^p, b^p^p, c^p^p.....
Not the most elegant of solutions and a bit time consuming to get right but once set up it does work.
Dillinquent is offline   Reply With Quote
Old 02-26-2011, 09:59 AM   #15
Faster
Connoisseur
Faster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of light
 
Posts: 61
Karma: 12096
Join Date: Sep 2010
Location: Tasmania
Device: Sony PRS 650
Dillinquent: why not use [a-z]^13^13 for your Find criteria with Wildcards enabled? You only search through once this - way rather than 26 times.
Faster is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Search and Replace with consecutive numbers? seagull Sigil 0 11-06-2010 02:38 PM
Search & Replace Pat Nickholds Sigil 2 10-21-2010 11:18 PM
Wild Card search and replace crutledge Sigil 2 06-05-2010 04:19 PM
Search and replace in 0.2.0 paulpeer Sigil 7 03-13-2010 11:59 AM
Why no search and replace? charleski Sigil 10 11-24-2009 04:13 PM


All times are GMT -4. The time now is 07:15 AM.


MobileRead.com is a privately owned, operated and funded community.