Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 09-26-2009, 07:07 PM   #16
orion2001
Groupie
orion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notes
 
Posts: 162
Karma: 24658
Join Date: Sep 2009
Device: PRS-505
Thanks a lot ElfWreck! Actually, I spent some more time trying to learn about Regular Expressions (used by most text editors for Search and Replace) and I ended up doing this:

Converted PDF -> HTML

Now all the unwanted mid sentence pagebreaks are basically those that look like *</p> where * is some character other than a period (since a period indicates end of sentence and probably end of para). I used Komodo Edit which is a free and powerful text/html editor to then open the HTML file. Then I used the Edit->Replace Feature (Ctrl-H) and entered the following:

(Make sure the following boxes are checked: Regex, Multiline and Replace)

Enter the following in the section - Find what:
Code:
([^\.'"!?:\)])</span></p>
<p><span class=font3>

Enter the following in the section - Replace with:
Code:
\1
(Note: \1 above actually has a space after the 1)

In my particular HTML file, paragraps end as </span></p> and then
<p><span class=font3> would start the next para.

What the Regex expression above does is only find those paragraph breaks that do not have a (. , !, ), ?, : ) character just preceeding the paragraph break (since those would indicate complete sentences and probably the end of a legit paragraph).

Now you can keep using the Find and Replace feature to rapidly cycle through all instances to find these fault paragraph breaks. If it is indeed faulty, you just hit replace and the whole code section is replaced by a space thus bridging the broken sentence together. This ended up working really well and I managed to fix a 300 page (300 PDF page ie) book in 10 minutes! You can actually go even faster if you just use the Replace All feature although you might end up taking out a couple of legit paragraph breaks (for example, some paragraph ending sentences might end with a comma or some other character not being checked for).

Note- You can easily change the non expression part of the code above to modify it depending on how the paragraph end and start code is in your particular HTML file.

I hope this makes sense. It worked really well for me. I also find that HTML makes doing all the little tweaks really quick and painless.

Also, I'd like to mention that for some reason this Regex Expression refused to work on Notepad++ for me (which is why I moved to Komodo). If anyone can get it to work on Notepad++ do let me know.

Cheers

Last edited by orion2001; 09-26-2009 at 07:09 PM.
orion2001 is offline   Reply With Quote
Old 09-26-2009, 09:50 PM   #17
orion2001
Groupie
orion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notes
 
Posts: 162
Karma: 24658
Join Date: Sep 2009
Device: PRS-505
Just a further update regarding Notepad++

Turns out that it isn't capable of using Regexp with multiline searches (as in this case). You can only use multi-line searches in "Extended mode" but you cant use regular expressions in that mode. I think this coupled with the lack of secure-ftp integrated in Notepad++ is going to make me move entirely to using Komodo as my text editor of choice.
orion2001 is offline   Reply With Quote
Old 09-26-2009, 09:58 PM   #18
orion2001
Groupie
orion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notes
 
Posts: 162
Karma: 24658
Join Date: Sep 2009
Device: PRS-505
Quote:
Originally Posted by Elfwreck View Post

Then: Search for [any letter]^p (or [any letter][space]^p), replace with [find what text]qqq, then replace ^pqqq with [space].
If you don't mind, could you explain this to me? I'm not sure what the ^p and the ^pqqq refer to. I'm a bit of a formatting noob .
orion2001 is offline   Reply With Quote
Old 09-27-2009, 12:33 AM   #19
Elfwreck
Grand Sorcerer
Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.
 
Elfwreck's Avatar
 
Posts: 5,187
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
Quote:
Originally Posted by orion2001 View Post
If you don't mind, could you explain this to me? I'm not sure what the ^p and the ^pqqq refer to. I'm a bit of a formatting noob .
Not knowing those doesn't mean you're a formatting noob; it means you don't use Microsoft Word for formatting. Word's find-and-replace functions use ^ to indicate a non-keyboard character. So ^p is "paragraph break;" ^t is "tab;" ^$ is "any letter;" ^? is "any character;" ^b is "section break;" ^m is "manual page break." (There are more, but there's no need for anyone to learn them; they're part of Word's dropdown menus in the find-and-replace dialog box.)

I use "qqq" as a substitute sequence for multi-stage find-and-replace functions, because Word's abilities are limited. It can find "[any letter][paragraph break]" but doesn't allow "replace the paragraph part of that with a space."

It can format or replace the entire search string, or add something to the beginning or end of it. So I add qqq to the end of it, and then search for "[paragraph break]qqq" and replace *that* with a space.

I use it because qqq is exceedingly unlikely to be repeated anywhere in the body of the book, and I won't accidentally replace real text that way.

I am almost entirely clueless about HTML. I gather the principles are about the same as what I usually do in Word, but I'd have to learn a whole new set of keywords and search options. (Which I should do.) I have Kompozer, and occasionally have tried to work with it. It's confusing, and Word is not, because I have lots of practice with Word and none with HTML editors. (I suspect that Semagic doesn't count as an HTML editor. Most of what I know about HTML, I learned by posting at LiveJournal.)

Quote:
What the Regex expression above does is only find those paragraph breaks that do not have a (. , !, ), ?, : ) character just preceeding the paragraph break (since those would indicate complete sentences and probably the end of a legit paragraph).
I'd add mdashes to that list. And quotation marks.

Same basic principle I use, except Word doesn't have a way to "find all X that don't match Trait Y," nor a way to "find all X with trait A, or B, or C." Much less "find all X that don't match trait A, B, or C." However, it does have "find any letter" separate from "any character" or "any digit." (Does not have "any punctuation.")

The biggest problem working with Word is that the HTML output is atrocious; it has to be ported into something else & converted to be useful to anything other than Frontpage websites. Word 97 had okay HTML output. But you lose a lot of features using the old versions of Word.
Elfwreck is offline   Reply With Quote
Old 09-27-2009, 12:41 AM   #20
orion2001
Groupie
orion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notes
 
Posts: 162
Karma: 24658
Join Date: Sep 2009
Device: PRS-505
Thanks! That is very useful. I use Word, but I hate it when it comes time for rigorous formatting. I am currently in the middle of writing my doctorate thesis in Word and I am not having any fun . It works OK most of the time but every now and then it does something silly and it is a huge pain hacking at it till I can fix it. I wish I could use LaTeX but my advisor is a MS junkie. Anyways, I think both our approaches ended up being the same albeit via different tools. I am now trying to learn BD as it seems like a very useful tool for creating the final ebook. Thanks again!

Cheers
orion2001 is offline   Reply With Quote
Old 09-27-2009, 01:16 AM   #21
Elfwreck
Grand Sorcerer
Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.
 
Elfwreck's Avatar
 
Posts: 5,187
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
Quote:
Originally Posted by orion2001 View Post
Thanks! That is very useful. I use Word, but I hate it when it comes time for rigorous formatting. I am currently in the middle of writing my doctorate thesis in Word and I am not having any fun . It works OK most of the time but every now and then it does something silly and it is a huge pain hacking at it till I can fix it. I wish I could use LaTeX but my advisor is a MS junkie.
You could try making it in LaTex, output to PDF, and converting that to Word. The tables would probably have to be reformatted, and the actual formatting would be atrocious from a desktop publishing perspective (footnotes would be loose text at the bottom of the page, not linked to their numbers), but it'd probably *look* right.

If you need it more correctly formatted than that, you could use Open Office, which is similar to Word in structure & workflow but less full of MS's peculiar approach to some formatting concepts. (And free. And if teacher complains, tell him not everyone can afford Microsoft Office.)
Elfwreck is offline   Reply With Quote
Old 09-27-2009, 01:34 AM   #22
orion2001
Groupie
orion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notes
 
Posts: 162
Karma: 24658
Join Date: Sep 2009
Device: PRS-505
Heh, thanks but I don't think that will work. I'm going to have to do a back and forth with my files and we use features like track comments/changes, etc to work on manuscripts. It would end up being too much of an hassle. In addition I use EndNote for my bibliography (and so does he) which would also cause problems. Lastly, he pays for Word and Endnote licenses so I can't quite argue on the monetary front
orion2001 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
eBook PDF - free tool for creating PDF eBooks from text files KACartlidge PDF 6 01-04-2012 09:41 AM
Best PDF conversion tool. Dark123 PDF 19 04-21-2010 02:52 AM
Best PDF Convertion Tool Nathan Campos Workshop 5 12-27-2009 10:47 AM
Yet another PDF cropping tool sjvr767 iRex 7 02-14-2009 07:04 AM


All times are GMT -4. The time now is 08:55 PM.


MobileRead.com is a privately owned, operated and funded community.