09-26-2009, 07:07 PM | #16 |
Groupie
Posts: 162
Karma: 24658
Join Date: Sep 2009
Device: PRS-505
|
Thanks a lot ElfWreck! Actually, I spent some more time trying to learn about Regular Expressions (used by most text editors for Search and Replace) and I ended up doing this:
Converted PDF -> HTML Now all the unwanted mid sentence pagebreaks are basically those that look like *</p> where * is some character other than a period (since a period indicates end of sentence and probably end of para). I used Komodo Edit which is a free and powerful text/html editor to then open the HTML file. Then I used the Edit->Replace Feature (Ctrl-H) and entered the following: (Make sure the following boxes are checked: Regex, Multiline and Replace) Enter the following in the section - Find what: Code:
([^\.'"!?:\)])</span></p> <p><span class=font3> Enter the following in the section - Replace with: Code:
\1 In my particular HTML file, paragraps end as </span></p> and then <p><span class=font3> would start the next para. What the Regex expression above does is only find those paragraph breaks that do not have a (. , !, ), ?, : ) character just preceeding the paragraph break (since those would indicate complete sentences and probably the end of a legit paragraph). Now you can keep using the Find and Replace feature to rapidly cycle through all instances to find these fault paragraph breaks. If it is indeed faulty, you just hit replace and the whole code section is replaced by a space thus bridging the broken sentence together. This ended up working really well and I managed to fix a 300 page (300 PDF page ie) book in 10 minutes! You can actually go even faster if you just use the Replace All feature although you might end up taking out a couple of legit paragraph breaks (for example, some paragraph ending sentences might end with a comma or some other character not being checked for). Note- You can easily change the non expression part of the code above to modify it depending on how the paragraph end and start code is in your particular HTML file. I hope this makes sense. It worked really well for me. I also find that HTML makes doing all the little tweaks really quick and painless. Also, I'd like to mention that for some reason this Regex Expression refused to work on Notepad++ for me (which is why I moved to Komodo). If anyone can get it to work on Notepad++ do let me know. Cheers Last edited by orion2001; 09-26-2009 at 07:09 PM. |
09-26-2009, 09:50 PM | #17 |
Groupie
Posts: 162
Karma: 24658
Join Date: Sep 2009
Device: PRS-505
|
Just a further update regarding Notepad++
Turns out that it isn't capable of using Regexp with multiline searches (as in this case). You can only use multi-line searches in "Extended mode" but you cant use regular expressions in that mode. I think this coupled with the lack of secure-ftp integrated in Notepad++ is going to make me move entirely to using Komodo as my text editor of choice. |
Advert | |
|
09-26-2009, 09:58 PM | #18 |
Groupie
Posts: 162
Karma: 24658
Join Date: Sep 2009
Device: PRS-505
|
|
09-27-2009, 12:33 AM | #19 | ||
Grand Sorcerer
Posts: 5,185
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
|
Quote:
I use "qqq" as a substitute sequence for multi-stage find-and-replace functions, because Word's abilities are limited. It can find "[any letter][paragraph break]" but doesn't allow "replace the paragraph part of that with a space." It can format or replace the entire search string, or add something to the beginning or end of it. So I add qqq to the end of it, and then search for "[paragraph break]qqq" and replace *that* with a space. I use it because qqq is exceedingly unlikely to be repeated anywhere in the body of the book, and I won't accidentally replace real text that way. I am almost entirely clueless about HTML. I gather the principles are about the same as what I usually do in Word, but I'd have to learn a whole new set of keywords and search options. (Which I should do.) I have Kompozer, and occasionally have tried to work with it. It's confusing, and Word is not, because I have lots of practice with Word and none with HTML editors. (I suspect that Semagic doesn't count as an HTML editor. Most of what I know about HTML, I learned by posting at LiveJournal.) Quote:
Same basic principle I use, except Word doesn't have a way to "find all X that don't match Trait Y," nor a way to "find all X with trait A, or B, or C." Much less "find all X that don't match trait A, B, or C." However, it does have "find any letter" separate from "any character" or "any digit." (Does not have "any punctuation.") The biggest problem working with Word is that the HTML output is atrocious; it has to be ported into something else & converted to be useful to anything other than Frontpage websites. Word 97 had okay HTML output. But you lose a lot of features using the old versions of Word. |
||
09-27-2009, 12:41 AM | #20 |
Groupie
Posts: 162
Karma: 24658
Join Date: Sep 2009
Device: PRS-505
|
Thanks! That is very useful. I use Word, but I hate it when it comes time for rigorous formatting. I am currently in the middle of writing my doctorate thesis in Word and I am not having any fun . It works OK most of the time but every now and then it does something silly and it is a huge pain hacking at it till I can fix it. I wish I could use LaTeX but my advisor is a MS junkie. Anyways, I think both our approaches ended up being the same albeit via different tools. I am now trying to learn BD as it seems like a very useful tool for creating the final ebook. Thanks again!
Cheers |
Advert | |
|
09-27-2009, 01:16 AM | #21 | |
Grand Sorcerer
Posts: 5,185
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
|
Quote:
If you need it more correctly formatted than that, you could use Open Office, which is similar to Word in structure & workflow but less full of MS's peculiar approach to some formatting concepts. (And free. And if teacher complains, tell him not everyone can afford Microsoft Office.) |
|
09-27-2009, 01:34 AM | #22 |
Groupie
Posts: 162
Karma: 24658
Join Date: Sep 2009
Device: PRS-505
|
Heh, thanks but I don't think that will work. I'm going to have to do a back and forth with my files and we use features like track comments/changes, etc to work on manuscripts. It would end up being too much of an hassle. In addition I use EndNote for my bibliography (and so does he) which would also cause problems. Lastly, he pays for Word and Endnote licenses so I can't quite argue on the monetary front
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
eBook PDF - free tool for creating PDF eBooks from text files | KACartlidge | 6 | 01-04-2012 09:41 AM | |
Best PDF conversion tool. | Dark123 | 19 | 04-21-2010 02:52 AM | |
Best PDF Convertion Tool | Nathan Campos | Workshop | 5 | 12-27-2009 10:47 AM |
Yet another PDF cropping tool | sjvr767 | iRex | 7 | 02-14-2009 07:04 AM |