05-01-2008, 02:04 PM | #1 |
Member
Posts: 11
Karma: 40
Join Date: May 2008
Location: Lima, Peru
Device: Sony PRS 505
|
Reformatting untidy text files macro
I use the Sony Reader 505 and really don't like the pdf function. I actually use it a lot to read educational pdfs (I'm a teacher). I also like to use arial 16 full justified so as to not get a headache. Because of this I really have to get the pdfs into rtf.
Converting PDFs. I use Adobe, ABC PDF converter and Cut and Paste. It all depends on which gives the best result for each document Cut and paste from html also gives these problems As you know often a PDF document/ HTML converted to text is often very messy with double returns at each line, page numbers etc. Makes editing a pain. I’ve been struggling with this for ages and have come up with a few techniques to make this easier. If I’m telling anybody anything they know… sorry!! You often need to use 1 or 2 or more of these but I have found you can generally make a readable rtf/doc/plain text with minium fuss First… look at the document. This is really important. Look for patterns. Click the reveal formatting button (backwards P on tool bar) and see what you have. Look for double/triple multiple carriage returns, repeated formatting. Sometime Autoformat will sort it straight away, but if it doesn’t…… I generally always do this: Replace <space> with <space>. This replaces soft space with space Replace <space. <space> with <space>, gets rid of double spaces. Keep hitting Replace All until no changes are made. When I first started I had lots of trouble with extra spaces, breaking words, lines separated etc. Getting each word to be separated by a single space really helped The easiest to edit quickly is when each paragraph has a double or triple return return. Use find a replace to replace each double carriage return (Double carriage return is ^p^p) with any long random string. I use xxxxxxxxxx. If this is the case then it takes seconds to get a reabale document. If you have a mix of double and triple it is easy to replace each single ^p with a double ^p^p then replace 3 or 4 returns with xxxxxxxxxxx. When this had done. Replace each return with nothing. Finally replace xxxxxxxxxx with ^p^p This gives each paragraph with a double return, which is how I like it!! If you use PDF converter from ABC then the Page No. is in blue. This is very easy to deal with. One technique is the find and replace formatting function. On the find and replace hit more,-format –font -font colour blue. Just replace anything blue with nothing. All page numbers gone. This also works with sub and superscript, underline, bold etc Often if you autoformat then words can link together which makes spellchecking a page. This occurs because at the end of each line the carriage return is next to the final word with out a <space>. Example wrote the timetable^p in two days becomes: wrote the timetablein two days Replace ^p with <space>^p for the whole document. Then replace <space><space> with <space> Finally replace ^p<space>^p with ^p^p. This will get rid of extra spaces between carriage returns Use find and replace for headings you wish to remove, such as titles appearing on each page. Example may be Book Name <space> 1, Book Name 2, ……..Book Name 27 etc Replace Book Name <space> 1 with Book Name <space> Hit replace all until no more are found. Change 1 for 2 and repeat. Do this until you get to 0 then start again from 1 and keep going unil no more changes. Finally Replace Book Name <space> with nothing. This works for any repeated text and number. Remember any repeated string is your friend. It even works for page numbers on their own. Just remember to replace the formatting each time. Look for repeated patterns so: ^p ^p 123 ^p and ^p ^p 124 ^p etc etc are easy For all the page numbers just do multiple runs, until no more changes. If the book has over 1 hundred page you have to find and replace each number at least 3 times Replace ^p1 with ^p: give 23 and 24 Replace ^p2 with ^p: give 3 and 4 Replace ^p3 with ^p: give <nothing> and 4 Replace ^p4 with ^p: Give <nothing> and <nothing> Etc etc Until you end up with the easy to change ^p^p^p^p, just change to ^p^p^p. This will give your return between paragraphs For really stubborn documents I use this macro. First I find all headings and make sure that I have a triple carriage after them. Example Heading^p ^p ^p I make sure that any words and carriage returns are seperated by a space. Macro Sub newcarr() ' ' newcarr Macro ' Macro recorded 26/04/2008 by SEC ' Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = " " .Replacement.Text = " " .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = " " .Replacement.Text = "" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = """" .Replacement.Text = "'" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = " " .Replacement.Text = " " .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "^p^p^p" .Replacement.Text = "xxxxxxxxxx" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = ". ^p" .Replacement.Text = ".^p" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "? ^p" .Replacement.Text = "?^p" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "' ^p" .Replacement.Text = "'^p" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = ") ^p" .Replacement.Text = ")^p" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = ": ^p" .Replacement.Text = ":^p" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "; ^p" .Replacement.Text = ";^p" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = ".^p" .Replacement.Text = "1xyz" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "?^p" .Replacement.Text = "2xyz" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "'^p" .Replacement.Text = "3xyz" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = ")^p" .Replacement.Text = "4xyz" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = ":^p" .Replacement.Text = "5xyz" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = ";^p" .Replacement.Text = "6xyz" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "^p" .Replacement.Text = "" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "xxxxxxxxxx" .Replacement.Text = "^p^p^p" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "1xyz" .Replacement.Text = ".^p^p" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "2xyz" .Replacement.Text = "?^p^p" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "3xyz" .Replacement.Text = "'^p^p" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "4xyz" .Replacement.Text = ")^p^p" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "5xyz" .Replacement.Text = ":^p^p" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll With Selection.Find .Text = "6xyz" .Replacement.Text = ";^p^p" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll End Sub Sub arial() ' ' arial Macro ' Macro recorded 26/04/2008 by SEC ' Selection.WholeStory Selection.Font.Color = wdColorAutomatic Selection.ParagraphFormat.Alignment = wdAlignParagraphJustify End Sub This basically removes double spaces. Removes extra spaces between punctuation and returns and clears up speech marks I then replace any terminal puntuation (.’!)?) that has a ^p after it with strings. I basically follow the rule that if the sentence ends at the end of the line then this is where a paragraph should be. Not always true but gives a readable document. Finally delete all carriage returns then replace strings with double returns. I must stress none of these work on their own, none of them always work but by combining these techniques I am able to convert a text document made from a PDF much more quickly. The labour intensive deleting each extra return used to take hours. I can in 95% of the cases get a readable document in about 10mins for an average size book, where manually might take 2 or 3 hours (Yes we've all done it!!!). Clever use of find and replace really helps. Oh... also always save as a plain text document before doing the cosmetic formatting, makes for a smaller file!!! Hope it helps and please don't flame me if you already know all these!!!!!! |
05-01-2008, 07:11 PM | #2 |
Wizard
Posts: 2,624
Karma: 1008294
Join Date: Dec 2007
Location: Iowa, USA
Device: Nook Simple Touch
|
I used ABC PDF converter once and the output had put a ? in place of all " never did figure out why
|
Advert | |
|
05-01-2008, 09:16 PM | #3 |
Resident Curmudgeon
Posts: 75,899
Karma: 134368292
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Is there actually any PDF converter thatc an convert a text based PDF to some other format without error? Even Adobe Acrobat Pro makes plenty of mistakes.
|
05-02-2008, 12:36 AM | #4 | |
Grand Sorcerer
Posts: 19,832
Karma: 11844413
Join Date: Jan 2007
Location: Tampa, FL USA
Device: Kindle Touch
|
Quote:
BOb |
|
05-02-2008, 02:52 AM | #5 |
Junior Member
Posts: 3
Karma: 10
Join Date: Feb 2008
Location: UK
Device: None yet
|
I've also made a few quick macros to sort out text.
The first is intended to format text as italic when it has been formatted using underscores (ie: this is _italic_ text) - lots of text files seem to use this. Note that if there are an uneven number of underscores you'll get interesting results... The second macro is to fix text that has extraneous carriage returns in it, as often happens, like this: "this is one line of text but somehow we have a new carriage return in it..." Note that you must first ensure you have edited the macro to indicate how many carriage returns are in the text (sometimes it will be two, but usually it's one). Both macros are quick hacks, and can benefit from some tweaking, but work fine for my purposes. Here they are: Sub FixBadText() ' THIS MACRO WILL REPLACE EXTRA LINE BREAKS IN TEXT WITH A SPACE Dim sReplaceParas As String ' NOTE: change the value in the double quotes below to ^13^13 if there are two ' carriage returns in the document sReplaceParas = "^13" Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find ' need to use wildcards here .Text = "[A-z]" & sReplaceParas & "[A-z]" ' ^13 is paragraph char .Forward = True .Wrap = wdFindContinue 'wdFindStop .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = True 'False .MatchSoundsLike = False .MatchAllWordForms = False .Replacement.Text = "" End With While Selection.Find.Execute 'Do something within the found text ' here i need to replace just the middle chars, ie the paragraph marks Selection.TypeText (Selection.Characters.First & " " & Selection.Characters.Last) Wend 'Now do the same but for commas! Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find ' need to use wildcards here .Text = "," & sReplaceParas & "[A-z]" ' ^13 is paragraph char .Forward = True .Wrap = wdFindContinue 'wdFindStop .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = True 'False .MatchSoundsLike = False .MatchAllWordForms = False .Replacement.Text = "" End With While Selection.Find.Execute 'Do something within the found text ' here i need to replace just the middle chars, ie the paragraph marks 'MsgBox "Value found: " & Selection.Characters.First & Selection.Characters.Last, vbCritical Selection.TypeText (Selection.Characters.First & " " & Selection.Characters.Last) Wend 'Now do the same but for hyphens! Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find ' need to use wildcards here .Text = "-" & sReplaceParas & "[A-z]" ' ^13 is paragraph char .Forward = True .Wrap = wdFindContinue 'wdFindStop .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = True 'False .MatchSoundsLike = False .MatchAllWordForms = False .Replacement.Text = "" End With While Selection.Find.Execute 'Do something within the found text ' here i need to replace just the middle chars, ie the paragraph marks 'MsgBox "Value found: " & Selection.Characters.First & Selection.Characters.Last, vbCritical Selection.TypeText (Selection.Characters.First & " " & Selection.Characters.Last) Wend End Sub --------------------------------------------------- Sub ChangeToItalics() ' THIS WILL REPLACE _some text_ into italics ' Note: if there is an uneven number of underscores, you will encounter problems! Dim iBookMark As Integer Dim lStart As Long Dim lEnd As Long Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find ' Need to use wildcards here .Text = "_" .Forward = True .Wrap = wdFindContinue 'wdFindStop .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False .Replacement.Text = "" End With iBookMark = 0 While Selection.Find.Execute ' Here we replace the underscores with blank string, and ensure any text between ' them is formatted as italic ActiveDocument.Bookmarks.Add "temp" & iBookMark 'Selection.MoveRight Selection.MoveUntil "_" ActiveDocument.Bookmarks.Add "temp" & (iBookMark + 1) ' Note that first char in story is 0, not 1... lStart = ActiveDocument.Bookmarks("temp0").Start lEnd = ActiveDocument.Bookmarks("temp1").Start ' Now make the first bookmark select the whole text between the two underscores ActiveDocument.Bookmarks("temp0").Start = lStart + 1 ActiveDocument.Bookmarks("temp0").End = lEnd ' Now select the bookmark text ActiveDocument.Bookmarks("temp0").Select ' And make it italic Selection.ItalicRun ' Now delete the underscores ActiveDocument.Bookmarks("temp0").Select Selection.MoveLeft wdCharacter, 2 Selection.Delete ActiveDocument.Bookmarks("temp1").Select Selection.Delete Wend ' Delete the bookmarks we created Dim iCount As Integer iCount = 0 Do While iCount <= 2 If ActiveDocument.Bookmarks.Exists("temp" & iCount) = True Then ActiveDocument.Bookmarks("temp" & iCount).Delete End If iCount = iCount + 1 Loop End Sub |
Advert | |
|
05-02-2008, 09:29 AM | #6 |
Technogeezer
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
|
I have had great results with ABC Amber PDF Converter and ABBYY PDF Transformer. I have also made extensive use of Stingo's Word Macro (found in the MobileRead Wiki.) Your macro deserves to be there too.
I have found that the quality of the conversion from PDF depends on the content of the PDF. If it is text based then the conversion is very good and the problems are limited to headers, footers, and other physical page artifacts. The only other real issue is the encoding scheme -- Unicode vs. 8 bit vs. ASCII etc. If it is an image inside the PDF then the output quality is directly related to the OCR properties. |
05-02-2008, 10:20 AM | #7 |
Member
Posts: 11
Karma: 40
Join Date: May 2008
Location: Lima, Peru
Device: Sony PRS 505
|
I prefer ABC to Adobe itself. Adobe leaves lots of artefacts, while ABC blue page no. helps a lot really helps a lot.
Often if the letter is a non-standard code then you will get problems such as losing ?. The other you find is losing Fl and Fi so flag becomes .ag. and fixing becomes .xing I always save the output as an rtf or word doc then save as a txt to edit. However always look at the raw rtf. If there is a repeatbale pattern for problems based on formatting then this is the time to solve itThis has the advantage of keeping the non-standard stuff. When you save as txt it then asks you to decide on substitue characters/page returns etc. You can experiment with substituions and encoding and often after a few tries you get better results. For headers and footers don't forget you can find/replace for fonts size, font, position etc etc. IThe ideal output will have headers/footers a different size, page Numbes as Page Number and any footnotes in subscript. This is then a doddle to clear up. |
05-02-2008, 10:22 AM | #8 |
Resident Curmudgeon
Posts: 75,899
Karma: 134368292
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
46137, would you mind actually attaching the macro instead of pisting it as text inside a message please?
|
05-02-2008, 09:27 PM | #9 |
Member
Posts: 11
Karma: 40
Join Date: May 2008
Location: Lima, Peru
Device: Sony PRS 505
|
Yer tiz
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Word Macro: Footnotes to inline text ? | Hadrien | Workshop | 17 | 10-08-2011 01:28 PM |
Word Formatting Macro (Stingo's Macro) | Stingo | Sony Reader | 75 | 08-24-2010 05:18 AM |
Kindle DX Graphite first look and macro shots of text | MobileTechReview | Amazon Kindle | 31 | 07-09-2010 05:37 PM |
Reformatting .txt files | willijt | Workshop | 14 | 03-27-2010 10:05 AM |
Reformatting PDF Files for Sony Reader | sfernald | Sony Reader | 13 | 11-11-2007 08:52 AM |