View Single Post
Old 05-01-2008, 02:04 PM   #1
46137
Member
46137 began at the beginning.
 
Posts: 11
Karma: 40
Join Date: May 2008
Location: Lima, Peru
Device: Sony PRS 505
Reformatting untidy text files macro

I use the Sony Reader 505 and really don't like the pdf function. I actually use it a lot to read educational pdfs (I'm a teacher). I also like to use arial 16 full justified so as to not get a headache. Because of this I really have to get the pdfs into rtf.

Converting PDFs. I use Adobe, ABC PDF converter and Cut and Paste. It all depends on which gives the best result for each document

Cut and paste from html also gives these problems

As you know often a PDF document/ HTML converted to text is often very messy with double returns at each line, page numbers etc. Makes editing a pain. I’ve been struggling with this for ages and have come up with a few techniques to make this easier. If I’m telling anybody anything they know… sorry!!

You often need to use 1 or 2 or more of these but I have found you can generally make a readable rtf/doc/plain text with minium fuss

First… look at the document. This is really important. Look for patterns. Click the reveal formatting button (backwards P on tool bar) and see what you have. Look for double/triple multiple carriage returns, repeated formatting. Sometime Autoformat will sort it straight away, but if it doesn’t……

I generally always do this:

Replace <space> with <space>. This replaces soft space with space

Replace <space. <space> with <space>, gets rid of double spaces. Keep hitting Replace All until no changes are made. When I first started I had lots of trouble with extra spaces, breaking words, lines separated etc. Getting each word to be separated by a single space really helped

The easiest to edit quickly is when each paragraph has a double or triple return return. Use find a replace to replace each double carriage return (Double carriage return is ^p^p) with any long random string. I use xxxxxxxxxx. If this is the case then it takes seconds to get a reabale document.

If you have a mix of double and triple it is easy to replace each single ^p with a double ^p^p then replace 3 or 4 returns with xxxxxxxxxxx.

When this had done. Replace each return with nothing. Finally replace xxxxxxxxxx with ^p^p

This gives each paragraph with a double return, which is how I like it!!

If you use PDF converter from ABC then the Page No. is in blue. This is very easy to deal with. One technique is the find and replace formatting function. On the find and replace hit more,-format –font -font colour blue. Just replace anything blue with nothing. All page numbers gone.

This also works with sub and superscript, underline, bold etc

Often if you autoformat then words can link together which makes spellchecking a page. This occurs because at the end of each line the carriage return is next to the final word with out a <space>.

Example

wrote the timetable^p
in two days

becomes:

wrote the timetablein two days

Replace ^p with <space>^p for the whole document.

Then replace <space><space> with <space>
Finally replace ^p<space>^p with ^p^p. This will get rid of extra spaces between carriage returns

Use find and replace for headings you wish to remove, such as titles appearing on each page.

Example may be Book Name <space> 1, Book Name 2, ……..Book Name 27 etc

Replace Book Name <space> 1 with Book Name <space> Hit replace all until no more are found. Change 1 for 2 and repeat. Do this until you get to 0 then start again from 1 and keep going unil no more changes. Finally Replace Book Name <space> with nothing. This works for any repeated text and number.

Remember any repeated string is your friend. It even works for page numbers on their own. Just remember to replace the formatting each time. Look for repeated patterns so:

^p
^p
123
^p

and

^p
^p
124
^p

etc etc are easy
For all the page numbers just do multiple runs, until no more changes. If the book has over 1 hundred page you have to find and replace each number at least 3 times

Replace ^p1 with ^p: give 23 and 24
Replace ^p2 with ^p: give 3 and 4
Replace ^p3 with ^p: give <nothing> and 4
Replace ^p4 with ^p: Give <nothing> and <nothing>
Etc etc

Until you end up with the easy to change ^p^p^p^p, just change to ^p^p^p. This will give your return between paragraphs

For really stubborn documents I use this macro. First I find all headings and make sure that I have a triple carriage after them. Example

Heading^p
^p
^p

I make sure that any words and carriage returns are seperated by a space.

Macro

Sub newcarr()
'
' newcarr Macro
' Macro recorded 26/04/2008 by SEC
'
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.Text = " "
.Replacement.Text = " "
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = " "
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = """"
.Replacement.Text = "'"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = " "
.Replacement.Text = " "
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "^p^p^p"
.Replacement.Text = "xxxxxxxxxx"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = ". ^p"
.Replacement.Text = ".^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "? ^p"
.Replacement.Text = "?^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "' ^p"
.Replacement.Text = "'^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = ") ^p"
.Replacement.Text = ")^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = ": ^p"
.Replacement.Text = ":^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "; ^p"
.Replacement.Text = ";^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = ".^p"
.Replacement.Text = "1xyz"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "?^p"
.Replacement.Text = "2xyz"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "'^p"
.Replacement.Text = "3xyz"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = ")^p"
.Replacement.Text = "4xyz"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = ":^p"
.Replacement.Text = "5xyz"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = ";^p"
.Replacement.Text = "6xyz"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "^p"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "xxxxxxxxxx"
.Replacement.Text = "^p^p^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "1xyz"
.Replacement.Text = ".^p^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "2xyz"
.Replacement.Text = "?^p^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "3xyz"
.Replacement.Text = "'^p^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "4xyz"
.Replacement.Text = ")^p^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "5xyz"
.Replacement.Text = ":^p^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "6xyz"
.Replacement.Text = ";^p^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub
Sub arial()
'
' arial Macro
' Macro recorded 26/04/2008 by SEC
'
Selection.WholeStory
Selection.Font.Color = wdColorAutomatic
Selection.ParagraphFormat.Alignment = wdAlignParagraphJustify
End Sub



This basically removes double spaces. Removes extra spaces between punctuation and returns and clears up speech marks

I then replace any terminal puntuation (.’!)?) that has a ^p after it with strings. I basically follow the rule that if the sentence ends at the end of the line then this is where a paragraph should be. Not always true but gives a readable document.

Finally delete all carriage returns then replace strings with double returns.

I must stress none of these work on their own, none of them always work but by combining these techniques I am able to convert a text document made from a PDF much more quickly. The labour intensive deleting each extra return used to take hours. I can in 95% of the cases get a readable document in about 10mins for an average size book, where manually might take 2 or 3 hours (Yes we've all done it!!!). Clever use of find and replace really helps.

Oh... also always save as a plain text document before doing the cosmetic formatting, makes for a smaller file!!!

Hope it helps and please don't flame me if you already know all these!!!!!!

46137 is offline   Reply With Quote