Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 09-09-2008, 02:49 PM   #1
texasnightowl
Guru
texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.
 
Posts: 699
Karma: 1001556
Join Date: Jul 2008
Location: Texas
Device: Oasis 3, K4B(NT), K3/KK
Question about doing some searching and replacing in Word

I've been doing some editing over the weekend. Several of these are in PDF format and I am either using MobiPocket Creator to get an html file out to edit OR I am using Abbyy Transformer to get an RTF or DOC file out.

Once I have either the html or RTF/DOC file I am doing some editing in Word. In most cases I'm doing fine at removing line breaks or manual page breaks or whatever else. But in some cases, there are some paragraph breaks in the middle of sentences that I'm having a tougher time picking up without doing a grammar check and finding those paragraphs that have an incomplete sentence.

In these cases, the first sentence is beginning with a lower case letter instead of being capitalized.

Is there an easy way to do a search for paragraphs that start with a word with no capitalization? Am I overlooking something?
texasnightowl is offline   Reply With Quote
Old 09-09-2008, 03:58 PM   #2
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,975
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
What you are experiencing is a common issue when converting PDF to some other format. The best way to make sure it's all fixed is to compare the output witht he PDF until you have gone over every paragraph.
JSWolf is offline   Reply With Quote
Advert
Old 09-10-2008, 08:48 AM   #3
DDHarriman
Guru
DDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura about
 
Posts: 860
Karma: 4380
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
A manual way is:

1 - do a “find and replace” as “every paragraph mark with 2 paragraph marks”;

2 - look into your text, increasing the size of the font helps showing better the oddities in the text, and it’s going to be quite easy to find the paragraph’s broken. Correct them and go on with looking into the text;

3 - in the end, do a “find and replace” to reverse the original one, “every 2 paragraph marks with just 1”.

Proof reading is a long, costly and tedious process… see Project Gutenberg efford with collective proofreading!

In Digitization projects is by far the most costly part of the project and one of the main reasons the PDF format “image with OCRed text under it” is so popular in these projects and also in the Enterprise world.

Best regards,

Last edited by DDHarriman; 09-10-2008 at 02:07 PM.
DDHarriman is offline   Reply With Quote
Old 09-10-2008, 11:53 AM   #4
=X=
Wizard
=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.
 
=X='s Avatar
 
Posts: 3,671
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
Quote:
Originally Posted by texasnightowl View Post
Is there an easy way to do a search for paragraphs that start with a word with no capitalization? Am I overlooking something?
With regular expressions what you are asking to do is extremely easy. However word somehow has chosen to make this very difficult, because they are inconsistent. You'll see what I mean in a few

1) Hit <CTRL>+H to get the find/replace
2) Depress the <MORE> button (if you only see <Less> don't do anything)
3) Check the "Use Wildcards" check box
4) In the Find text box enter w/o quotes "^13([a-z]?)"
//This finds all lower case letters on a new line
//**Here is the inconsistency. New paragraphs are ^p which is not the same as ^13 which is a new line. When using the "Use Wildcards" option the ^p is not supported, so there will be cases where you will not find text using the expression in line 4.
5) In the Replace enter w/o quotes " \1".
// ** Note the space in before \1. This puts a space between the two words else you will have a lot of spelling errors due to word concatenation.
// the \1 inserts the text found in () from step 4. If you had two ()() then you would have \1\2

=X=

Last edited by =X=; 09-10-2008 at 11:55 AM. Reason: edited grammer
=X= is offline   Reply With Quote
Old 09-10-2008, 03:05 PM   #5
texasnightowl
Guru
texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.
 
Posts: 699
Karma: 1001556
Join Date: Jul 2008
Location: Texas
Device: Oasis 3, K4B(NT), K3/KK
Quote:
Originally Posted by =X= View Post
With regular expressions what you are asking to do is extremely easy. However word somehow has chosen to make this very difficult, because they are inconsistent. You'll see what I mean in a few

1) Hit <CTRL>+H to get the find/replace
2) Depress the <MORE> button (if you only see <Less> don't do anything)
3) Check the "Use Wildcards" check box
4) In the Find text box enter w/o quotes "^13([a-z]?)"
//This finds all lower case letters on a new line
//**Here is the inconsistency. New paragraphs are ^p which is not the same as ^13 which is a new line. When using the "Use Wildcards" option the ^p is not supported, so there will be cases where you will not find text using the expression in line 4.
5) In the Replace enter w/o quotes " \1".
// ** Note the space in before \1. This puts a space between the two words else you will have a lot of spelling errors due to word concatenation.
// the \1 inserts the text found in () from step 4. If you had two ()() then you would have \1\2

=X=
Thanks so much! I'll play with these strings tonight or tomorrow.
texasnightowl is offline   Reply With Quote
Advert
Old 09-11-2008, 12:18 AM   #6
FizzyWater
You kids get off my lawn!
FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.
 
FizzyWater's Avatar
 
Posts: 4,220
Karma: 73492664
Join Date: Aug 2007
Location: Columbus, Ohio
Device: Oasis 2 and Libra H2O and half a dozen older models I can't let go of
I played a little with Word macros and came up with the following to do what the other poster suggested doing in the "Find" and "Replace" boxes. I only tested this on one document, so you might want to test this on some of the documents you typically convert.

Code:
Sub ParagraphBreaksInMiddleOfSentences()

    Selection.WholeStory
    
    'to delete all section breaks first (replace the ^b if you
    'want to delete all section *and* page breaks or ^m if
    'you want to delete only page breaks)
    With Selection.Find
        .Text = "^b"
        .Replacement.Text = ""
        .Forward = True
        .Wrap = wdFindContinue
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
    'resets FindAndReplace parameters to defaults before running next one
    Call ClearFindAndReplaceParameters
    
    'to replace any combination of a paragraph return,
    'then a lower case letter, with a space and the
    'same lower case letter
    With Selection.Find
        .Text = "^13([a-z]?)"
        .Replacement.Text = " \1"
        .MatchWildcards = True
        .Forward = True
        .Wrap = wdFindContinue
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
    Call ClearFindAndReplaceParameters
    
    'to replace any combination of a line feed, then a
    'lower case letter, with a space and the same lower
    'case letter
    With Selection.Find
        .Text = "^l([a-z]?)"  'this is a caret (^) and a lower case L (l)
        .Replacement.Text = " \1"
        .MatchWildcards = True
        .Forward = True
        .Wrap = wdFindContinue
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
    Call ClearFindAndReplaceParameters

End Sub

Sub ClearFindAndReplaceParameters()
' copied from word.mvps.org

  With Selection.Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Text = ""
    .Replacement.Text = ""
    .Forward = True
    .Wrap = wdFindStop
    .Format = False
    .MatchCase = False
    .MatchWholeWord = False
    .MatchWildcards = False
    .MatchSoundsLike = False
    .MatchAllWordForms = False
  End With

End Sub
Quote:
=X= said:
//**Here is the inconsistency. New paragraphs are ^p which is not the same as ^13 which is a new line. When using the "Use Wildcards" option the ^p is not supported, so there will be cases where you will not find text using the expression in line 4.
The Microsoft Office website specifically listed the ^13 as a replacement for the paragraph mark. When I tested created paragraph returns versus line returns, the ^13 replaced all the paragraph returns and the ^l replaced all the line returns.

I use Windows, so it may be that this works different in other operating systems. But as I said above, I really only tested this on one "real" ebook and just one single-sentence test file...so I could be wrong.

Last edited by FizzyWater; 09-11-2008 at 12:29 AM. Reason: to get code to display indents properly
FizzyWater is offline   Reply With Quote
Old 09-11-2008, 09:20 AM   #7
texasnightowl
Guru
texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.texasnightowl ought to be getting tired of karma fortunes by now.
 
Posts: 699
Karma: 1001556
Join Date: Jul 2008
Location: Texas
Device: Oasis 3, K4B(NT), K3/KK
Thanks FizzyWater! I will try it tonight or sometime over the weekend. I have a couple of ebooks that were only in pdf format so far (which I am hating), so I will try this soon. Thanks!
texasnightowl is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Searching books - question jabook PocketBook 7 10-19-2010 05:37 PM
Word question brianh Calibre 4 08-24-2010 05:12 PM
question about Word joycedb Writers' Corner 7 06-23-2010 04:03 PM
Kindle Searching Function Question keegon Amazon Kindle 4 01-09-2010 01:19 PM
Newbie Question on book searching jazz_jeff Sony Reader 6 09-22-2008 06:21 PM


All times are GMT -4. The time now is 09:11 AM.


MobileRead.com is a privately owned, operated and funded community.