PDF to Kindle: The unobtainable Holy Grail of ebooks - Page 4

alansplace · 10-19-2011, 04:32 PM

Quote:

Originally Posted by Blossom

They are really more Word 2003 wildcards but Basically This is my reference notes I hope you can make heads or tales out of them.

Code:

Do a S&R for Manual line breaks and replace with paragraph marks.

MS Word it uses ^13 for a return, with wildcard box checked in the Search Box

^13([a-z]) = This checks for broken sentences

([a-zA-Z])^13 = This checks for broken sentences

([a-z])^13([A-Z]) = This checks for broken sentences

Replace Box
\1 and \2 if there is more then one bracket, add appropriate spaces as needed.

[0-9]{1,}^13 = This checks for page numbers 
[0-9]{1,} = Second check for page numbers and OCR error where numbers replace letters. 

[A-Z]{3,} = Match Case checked, Replace 3, if needed for more word matches.

On Chapter Headers I use S&R if they are already in bold this makes it easier, then I do a search to find bold text using the formatting button. Word has a powerful search! You can search by formatting or wildcards, special word characters or just the regular way. I can then do a replace only on the formatting.

I also use the Styles panel to make batch changes. Alot of back titles I buy have inconsistency when it comes to formatting this feature comes in handy to fix that quick. Highlighting a chapter heading and then click Clear formatting and clicking the appropriate style will really help it to take on the correct formatting you want.

I also use Macros to make it alot faster!

Quote:

Originally Posted by DiapDealer

For broken sentences in HTML, I use the following search regex:

Code:

([^.”":?’'!>—…)])</p>\s+<p[^>]*>

And the replace would be:

Code:

\1

(NOTE: there needs to be a "space" character following the \1 for it to work properly)

I don't trust it enough to blindly do a "Replace All" on a whole book, but I rarely have to intervene when stepping through a document an incident at a time.

Quote:

Originally Posted by Blossom

I will have try this when working with code.

What program does this work with? I've tried Notepad++ and Notepad2 and it can't find anything.

cool, thanks

Serpentine · 10-19-2011, 06:09 PM

I have a bunch of regex that I have been meaning to put out, for fixing pdf -> epubs.
Most of them are written in python compatible syntax. Some of it may be in JGsoft syntax (I use RegexBuddy for all of my regex work - well worth it if you're fixing a lot of books!). No idea if these are of any value

The first (long) one is to spot empty/whitespace filled elements - a fairly full element list so you can remove anything you quite want to preserve.

Spoiler:

Trim off white space from the start of a few elements - this will replace every paragraph/div/etc - so be very careful; or use lookaheads...
find : <((div|p|h\d|span)[\s\w=\-"/\.;]*)>\s*(.*?)\s*</\2>
replace : <\1>\3</\1>

Find and fix ellipse with what I like (I like unspaced points) i.e " . . . " -> " ... " and ". .. " -> "... " . Preserves the leading/trailing space.
find : (?P<lead>\s*)(\s?\.){2,3}(?P<trail>\s?)(?![\/\\])
replace : \g<lead>...\g<trail>

Replace ' - ' and ' -- ' as well as a few other cases with emdash - trying to preserve quote spacing (i.e don't space it if before/after a quotation mark)
This is written in JGsoft syntax.
find : (?<=<(?P<wat>p|div|h\d|span)[^<>]*>.*)(?P<lead>(\s*|[“”,]))(\s?\-){2,4}(?P<trail>(\s*|[“”]))(?=.*</(\k<wat>)>)
replace : \g<lead>—\g<trail>

Fix broken paragraphs, not joined due to page breaks - be careful; might want to add a \d to the lowercase capture (^A-Z).
find : ([^<]*?)(?!\.)\s+([^A-Z][^<]*?)
replace : \1 \2

Find nested formatting tags - i.e "hmm" -> "hmm"
find : <([buis])>\s*<\1>(.*?)</\1>\s*</\1>
replace: <\1>\2</\1>

Find and replace redundant formatting : "dep" -> derp
find : <([sibu])>(.*)</\1>(\s*)<\1>(.*)</\1>
replace : <\1>\2\3\4</\1>

Find spaced, grouped formatting tags (sbiu) - makes it easy to replace with css later if needed.
find : <([busi])>\s*<([busi])>\s*<(\1|\2)>(.*)</(\1|\2)>\s*</\2>\s*</\1>
replace : <\1><\2>\4</\1></\2>

Stripping spans from paragraphs - be careful if your style is using spans for formatting - replace them with the html tag version.
<p[\s\w=\-";]*>\s*<span[\s\w=\-";]*>(.*)\s*

Find missing quotation marks :
(?<=<(p|div|h\d|span)[^<>]*>)[^"]*"([^"\n\r]*?)(?=</\1>)

Replace straight quotation marks with fancy ones - NB: use the previous regex to find missing ones beforehand! :
find : (?<=<(p|div|h\d|span)[^<>]*>.*)"(?<quote>[^"\u201C\u201D\u201E\u201F\u2033\u2036\r\n]+?)\s*?"(?=.*</\1>)
replace : “\g<quote>”

Find incorrect direction double quotation marks - often Calibre makes mistakes following dashes and such - best to fix manually unless you're sure of the case:
find : (“\s*)(?P<quote>.+?)\s*?(“|”)

Have a bunch of other ones, but they are a bit messy.

Blossom · 10-19-2011, 06:24 PM

Quote:

Originally Posted by Serpentine

I have a bunch of regex that I have been meaning to put out, for fixing pdf -> epubs.
Most of them are written in python compatible syntax. Some of it may be in JGsoft syntax (I use RegexBuddy for all of my regex work - well worth it if you're fixing a lot of books!). No idea if these are of any

You might want to use Code tag to keep it formatted properly. Your first Regex has a

in it.

I can see how these will come in handy. I just need figure out a program they will work on. What html editor do you use? I also have Regexbuddy but it's greek to me.

I'd love to figure out how to find missing punctuations at the end of a sentence. Those Harlequin Treasury titles I buy are full of OCR errors and stuff like that.

Serpentine · 10-19-2011, 07:19 PM

Quote:

Originally Posted by Blossom

I can see how these will come in handy. I just need figure out a program they will work on. What html editor do you use? I also have Regexbuddy but it's greek to me.

I just use Sigil and RegexBuddy - from time to time Notepad++, but it's not all that useful for regex as it's got a very limited and strange engine for that.

I do as much as I can to make the file super simple, strip most stuff so that only the essentials for the desired CSS are left. From there its just a matter of making expressions that do what you want. RegexBuddy is great for this part.

If you go to the 'test' tab, put your file down there in the bottom pane, then enter the regex at the top, it will highlight the matches. There's a little dropdown box at the top that lets you select your regex flavour, GJsoft is very easy and allows some nice stuff if you use lookaheads and such. Python and perl are pretty much the same if you need to share things with friends.

Working out what things do is the tricky part, but regex is rather easy to understand - a lot of people focus on trivial stuff that wildcards can work just as well with. I find the most handy things that make a person suddenly get regex are back references as lookarounds (sometimes only refered to as lookahead/lookbehind).

If you right click in the top pane above the test text, it has a nice context menu which allows you to add things that you might not know the name of. If you paste in regex which you don't understand, you can also swap to the 'Create' tab - this explains the regex, but don't be expecting it to be straight forward explanations

I'd suggest just using the test area for testing, use the 'grep' tab to apply regex to your files - handy for epubs with multiple xhtml files, also makes it easy to preview replacements - always preview.

Back to two things I think most people miss:
Back references are really easy - say we want to find simple formatting tags:
<([sbui])>([^<]+)</\1>
Orange finds a tag with a single letter from the set {s,b,u,i} in it eg <s>. The (brackets) around the character catch mean that the result is stored.
Red finds characters as they appear that are not a "<" - avoiding us going into the next tag by mistake.
Blue finds us a tag with the same letter as the group we got in the first (\1) group. i.e if we get a 'b' match for , we can reference back and then use it to find its' matching .

Lookarounds are something that very few short intros/tuts ever explain well. They're actually damn easy. They're used to 'look ahead', i.e you use them to decide if something is what you want, or not. If it's something that signifies you want to look closer - its positive, if its something that says you don't want to look at it its negative. If it's something prior of the potentially interesting stuff (i.e the deciding factor is on the left hand side of what you want to potentially get) - it's called a lookbehind.

For example if we only want to match stuff in an italic tag, if it is found after a comma: (?<=,).+?
Now you're saying "But why not just use : ,.+?
Well, if you do that you're including the comma in your match, so if you were to be removing the text you found, you'd need to make sure you replace it with a comma, to make sure it's not removed. It also allows you to match quite specific things. For example - if you only wanted to match it if there was an exclamation mark somewhere previously in the line : (?<=!.*).+? // this requires the JGsoft syntax which allows repetition in the lookarounds.

Those are examples of positive use, where you are looking for a specific something nearby. You could use the negative to avoid something. i.e you'd use a negative one.

I'd just throw a pile of random html into the test window and see what you can do.

DiapDealer · 10-19-2011, 07:52 PM

Quote:

Originally Posted by Blossom

I just need figure out a program they will work on. What html editor do you use? I also have Regexbuddy but it's greek to me.

EditPad Lite v7 (free for personal use) now uses the exact same Regex Search and Replace engine (JGSoft) that its big brother, EditPad Pro, does.

Blossom · 10-19-2011, 08:39 PM

Quote:

Originally Posted by Serpentine

I just use Sigil and RegexBuddy - from time to time Notepad++, but it's not all that useful for regex as it's got a very limited and strange engine for that.

I do as much as I can to make the file super simple, strip most stuff so that only the essentials for the desired CSS are left. From there its just a matter of making expressions that do what you want. RegexBuddy is great for this part.

I may give it another shot when I have some free time.

Quote:

Originally Posted by DiapDealer

EditPad Lite v7 (free for personal use) now uses the exact same Regex Search and Replace engine (JGSoft) that its big brother, EditPad Pro, does.

I will give it a try. So far I have Notepad2, Notepad++,PSPad and Ultra Edit what's one more?

tentimes · 10-21-2011, 06:28 AM

I'm delighted to see the useful information this thread is pulling out - thanks everyone

jj2me · 10-21-2011, 04:28 PM

As a totally different approach, you could always crop the PDF to fit your reader's screen, using PDF Scissors.

Some don't like/trust the web Java (the author wrote it for himself and gives it away free, so never bothered to purchase a digital signature), but I've had no problems, and know of no complaints about it.

Greg_E · 10-23-2011, 10:40 PM

Ok, so after reading this I get the impression that my results so far are normal and a lot of work needs to be put in to fixing a PDF conversion. Now why would I want to do this? I have several PDF books that came with some printed books, I want to keep the printed books at home and use the electronic version when I am not home. I have a tablet computer but the resolution is such that the PDF is a little fuzzy when in full screen mode, and too small when not in full screen. I have used a Kindle DX and of course the resolution and character clarity are perfect so I might look into getting one. While I am doing that thinking, reading an ebook version with flow-able text would certainly make the reading easier. I tried conversions with Calibre and MobiPocket, both have the typical extra junk text and lack of TOC links that have been discussed in this (and probably too many other) threads. I've noted that several of you go through and massage the text with a text editor, but that seems like more work than I am able to do.

What I was wondering is has anyone seen a WYSIWYG ebook layout editor which would certainly make the clean up a little easier for the common human?

I assume the answer is that such a tool does not exist, but thought it was worth asking.

Blossom · 10-23-2011, 11:33 PM

Quote:

Originally Posted by Greg_E

Ok, so after reading this I get the impression that my results so far are normal and a lot of work needs to be put in to fixing a PDF conversion. Now why would I want to do this? I have several PDF books that came with some printed books, I want to keep the printed books at home and use the electronic version when I am not home. I have a tablet computer but the resolution is such that the PDF is a little fuzzy when in full screen mode, and too small when not in full screen. I have used a Kindle DX and of course the resolution and character clarity are perfect so I might look into getting one. While I am doing that thinking, reading an ebook version with flow-able text would certainly make the reading easier. I tried conversions with Calibre and MobiPocket, both have the typical extra junk text and lack of TOC links that have been discussed in this (and probably too many other) threads. I've noted that several of you go through and massage the text with a text editor, but that seems like more work than I am able to do.

What I was wondering is has anyone seen a WYSIWYG ebook layout editor which would certainly make the clean up a little easier for the common human?

I assume the answer is that such a tool does not exist, but thought it was worth asking.

I use Word 2003. It's a WYSIWYG editor. There is also Sigil which allows you to see code and how it will look.

Serpentine · 10-24-2011, 12:12 AM

I'd recommend Sigil, so long as you keep things simple, the epubs will convert very well to mobi.

The process really isn't all that time consuming after you've done it once - and for the most part isn't necessary at all. You just need to get the PDF converted to html then add it to a blank book and you'll be on your way. If the converted html has preserved the italics, bolding and important indents (i.e., written letters), you should be able to ignore pretty much everything else and just mark up the chapter (i.e., make them headings) and generate the ToC. Calibre can throw in the metadata and cover etc if you do an epub->epub conversion afterwards if you're lazy

.

Greg_E · 10-24-2011, 12:30 AM

I tried Calibre on one PDF and it got lots of the parts out of order so I need to do a lot of fooling with it to put it back into the correct order. Then tried MobiPocket again and the text is all there but most of the images are missing. These are technical books so the images are often important to the content... Seems like there will be a lot of manual clean up with these documents. I'll have to write the publisher and see if they have a epub or mobi version that I can download. They are pretty good at offering updates and errata since there are certifications on the line.

Greg_E · 10-24-2011, 11:21 AM

Looking at the version created from Mobipocket it mostly works and I think for now I'll deal with the missing images since it is far easier to read than the PDF on my tablet display. The PDF is about 12 cpi (full screen) and the Mobipocket reader is able to push that to 24 where it looks good with clean edges, wish I could afford a higher resolution device, but for now this will need to work. Keeping an eye on ebay for a used (and not hacked) DX where I should be able to shrink the font back down to 12 or 14 cpi to get more reading between page turns.

I do have a different epub by the same company (Cisco Press) and converted it to a Mobi friendly format and it is really nice to read on this tablet... Kind of wish they had done epub instead of PDF on the other books I bought.

EbokJunkie · 10-25-2011, 08:51 PM

There is alternative approach to pdf viewing on e-readers: Convert pdf to a set of half-page landscape oriented images and join images in single pdf (or just view the sequence of png images on your reader).
This approach has it's own pros and cons but at least it grants you access to reasonably comfortable reading.

Tool:
Google for pdflrfwin-0.99. This software can convert pdf to set of images pooled into single zip file (just change manually lrf extension of output file to zip extension). Many free softwares can join png images into pdf (like this one).

10-19-2011, 06:09 PM	#47
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	I have a bunch of regex that I have been meaning to put out, for fixing pdf -> epubs. Most of them are written in python compatible syntax. Some of it may be in JGsoft syntax (I use RegexBuddy for all of my regex work - well worth it if you're fixing a lot of books!). No idea if these are of any value The first (long) one is to spot empty/whitespace filled elements - a fairly full element list so you can remove anything you quite want to preserve. Spoiler: Code: <(h\d\|[uod]l\|a\|hr\|abbr\|acronym\|address\|applet\|area\|b\|base\|basefont\|bdo\|big\|blockquote\|body\|button\|caption\|center\|cite\|code\|col\|colgroup\|dd\|del\|dfn\|dir\|div\|dt\|em\|fieldset\|font\|form\|frame\|frameset\|head\|hr\|html\|i\|iframe\|ins\|kbd\|label\|legend\|li\|link\|map\|menu\|meta\|noframes\|noscript\|object\|optgroup\|option\|p\|param\|pre\|q\|s\|samp\|script\|select\|small\|span\|strike\|strong\|style\|sub\|sup\|table\|tbody\|td\|textarea\|tfoot\|th\|thead\|title\|tr\|tt\|u\|var)[\s\w=\-"/\.;_]>([\s\r\n]\| )</\1> Trim off white space from the start of a few elements - this will replace every paragraph/div/etc - so be very careful; or use lookaheads... find : <((div\|p\|h\d\|span)[\s\w=\-"/\.;])>\s(.?)\s</\2> replace : <\1>\3</\1> Find and fix ellipse with what I like (I like unspaced points) i.e " . . . " -> " ... " and ". .. " -> "... " . Preserves the leading/trailing space. find : (?P<lead>\s)(\s?\.){2,3}(?P<trail>\s?)(?![\/\\]) replace : \g<lead>...\g<trail> Replace ' - ' and ' -- ' as well as a few other cases with emdash - trying to preserve quote spacing (i.e don't space it if before/after a quotation mark) This is written in JGsoft syntax. find : (?<=<(?P<wat>p\|div\|h\d\|span)[^<>]>.)(?P<lead>(\s\|[“”,]))(\s?\-){2,4}(?P<trail>(\s\|[“”]))(?=.</(\k<wat>)>) replace : \g<lead>—\g<trail> Fix broken paragraphs, not joined due to page breaks - be careful; might want to add a \d to the lowercase capture (^A-Z). find : <p>([^<]?)(?!\.)</p>\s+<p>([^A-Z][^<]?)</p> replace : <p>\1 \2</p> Find nested formatting tags - i.e "<i><i>hmm</i></i>" -> "<i>hmm</i>" find : <([buis])>\s<\1>(.?)</\1>\s</\1> replace: <\1>\2</\1> Find and replace redundant formatting : "<i>de</i><i>p</i>" -> <i>derp</i> find : <([sibu])>(.)</\1>(\s)<\1>(.)</\1> replace : <\1>\2\3\4</\1> Find spaced, grouped formatting tags (sbiu) - makes it easy to replace with css later if needed. find : <([busi])>\s<([busi])>\s<(\1\|\2)>(.)</(\1\|\2)>\s</\2>\s</\1> replace : <\1><\2>\4</\1></\2> Stripping spans from paragraphs - be careful if your style is using spans for formatting - replace them with the html tag version. <p[\s\w=\-";]>\s<span[\s\w=\-";]>(.)</span>\s</p> Find missing quotation marks : (?<=<(p\|div\|h\d\|span)[^<>]>)[^"]"([^"\n\r]?)(?=</\1>) Replace straight quotation marks with fancy ones - NB: use the previous regex to find missing ones beforehand! : find : (?<=<(p\|div\|h\d\|span)[^<>]>.)"(?<quote>[^"\u201C\u201D\u201E\u201F\u2033\u2036\r\n]+?)\s?"(?=.</\1>) replace : “\g<quote>” Find incorrect direction double quotation marks - often Calibre makes mistakes following dashes and such - best to fix manually unless you're sure of the case: find : (“\s)(?P<quote>.+?)\s?(“\|”) Have a bunch of other ones, but they are a bit messy. Last edited by Serpentine; 10-20-2011 at 01:36 PM. Reason: now with less emotes! and a few fixes*

10-25-2011, 08:51 PM	#59
EbokJunkie Addict Posts: 230 Karma: 13495 Join Date: Feb 2009 Location: SoCal Device: Kindle 3, Kindle PW, Pocketbook 301+, Pocketbook Touch, Sony 950, 350	There is alternative approach to pdf viewing on e-readers: Convert pdf to a set of half-page landscape oriented images and join images in single pdf (or just view the sequence of png images on your reader). This approach has it's own pros and cons but at least it grants you access to reasonably comfortable reading. Tool: Google for pdflrfwin-0.99. This software can convert pdf to set of images pooled into single zip file (just change manually lrf extension of output file to zip extension). Many free softwares can join png images into pdf (like this one). Last edited by EbokJunkie; 10-25-2011 at 08:55 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
KINDLE DEAL: The Holy Bible: NKJV ($3.36 CANADA)	gospelebooks	Deals and Resources (No Self-Promotion or Affiliate Links)	2	04-09-2011 01:07 PM
Free Book (Kindle / Nook) - The Holy Bible	koland	Deals and Resources (No Self-Promotion or Affiliate Links)	21	11-14-2010 02:51 PM
Free Book (Kindle) - The Holy Bible	koland	Deals and Resources (No Self-Promotion or Affiliate Links)	21	10-09-2010 11:31 AM
Free Book (Kindle) - Holy Bible (GW)	koland	Deals and Resources (No Self-Promotion or Affiliate Links)	0	10-04-2010 04:29 AM
The search for the Holy Grail of reading lights continues	Bob Russell	News	19	04-01-2009 02:24 PM

10-21-2011, 06:28 AM	#52
tentimes Junior Member Posts: 6 Karma: 10 Join Date: Oct 2011 Device: Kindle 4	I'm delighted to see the useful information this thread is pulling out - thanks everyone

10-21-2011, 04:28 PM	#53
jj2me Guru Posts: 820 Karma: 8820388 Join Date: Dec 2008 Device: Sony PRS-505, -350; Kindle 3 3G, DX, PW 2; various tablets	As a totally different approach, you could always crop the PDF to fit your reader's screen, using PDF Scissors. Some don't like/trust the web Java (the author wrote it for himself and gives it away free, so never bothered to purchase a digital signature), but I've had no problems, and know of no complaints about it.

10-23-2011, 10:40 PM	#54
Greg_E Zealot Posts: 103 Karma: 1180 Join Date: Oct 2011 Device: Acer Iconia a500, XP tablet PC	Ok, so after reading this I get the impression that my results so far are normal and a lot of work needs to be put in to fixing a PDF conversion. Now why would I want to do this? I have several PDF books that came with some printed books, I want to keep the printed books at home and use the electronic version when I am not home. I have a tablet computer but the resolution is such that the PDF is a little fuzzy when in full screen mode, and too small when not in full screen. I have used a Kindle DX and of course the resolution and character clarity are perfect so I might look into getting one. While I am doing that thinking, reading an ebook version with flow-able text would certainly make the reading easier. I tried conversions with Calibre and MobiPocket, both have the typical extra junk text and lack of TOC links that have been discussed in this (and probably too many other) threads. I've noted that several of you go through and massage the text with a text editor, but that seems like more work than I am able to do. What I was wondering is has anyone seen a WYSIWYG ebook layout editor which would certainly make the clean up a little easier for the common human? I assume the answer is that such a tool does not exist, but thought it was worth asking.

10-24-2011, 12:12 AM	#56
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	I'd recommend Sigil, so long as you keep things simple, the epubs will convert very well to mobi. The process really isn't all that time consuming after you've done it once - and for the most part isn't necessary at all. You just need to get the PDF converted to html then add it to a blank book and you'll be on your way. If the converted html has preserved the italics, bolding and important indents (i.e., written letters), you should be able to ignore pretty much everything else and just mark up the chapter (i.e., make them headings) and generate the ToC. Calibre can throw in the metadata and cover etc if you do an epub->epub conversion afterwards if you're lazy .

10-24-2011, 12:30 AM	#57
Greg_E Zealot Posts: 103 Karma: 1180 Join Date: Oct 2011 Device: Acer Iconia a500, XP tablet PC	I tried Calibre on one PDF and it got lots of the parts out of order so I need to do a lot of fooling with it to put it back into the correct order. Then tried MobiPocket again and the text is all there but most of the images are missing. These are technical books so the images are often important to the content... Seems like there will be a lot of manual clean up with these documents. I'll have to write the publisher and see if they have a epub or mobi version that I can download. They are pretty good at offering updates and errata since there are certifications on the line.

10-24-2011, 11:21 AM	#58
Greg_E Zealot Posts: 103 Karma: 1180 Join Date: Oct 2011 Device: Acer Iconia a500, XP tablet PC	Looking at the version created from Mobipocket it mostly works and I think for now I'll deal with the missing images since it is far easier to read than the PDF on my tablet display. The PDF is about 12 cpi (full screen) and the Mobipocket reader is able to push that to 24 where it looks good with clean edges, wish I could afford a higher resolution device, but for now this will need to work. Keeping an eye on ebay for a used (and not hacked) DX where I should be able to shrink the font back down to 12 or 14 cpi to get more reading between page turns. I do have a different epub by the same company (Cisco Press) and converted it to a Mobi friendly format and it is really nice to read on this tablet... Kind of wish they had done epub instead of PDF on the other books I bought.

Advert

Advert