![]() |
#46 | |||
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,895
Karma: 464403178
Join Date: Feb 2010
Location: 33.9388° N, 117.2716° W
Device: Kindles K-2, K-KB, PW 1 & 2, Voyage, Fire 2, 5 & HD 8, Surface 3, iPad
|
Quote:
Quote:
Quote:
![]() |
|||
![]() |
![]() |
![]() |
#47 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
I have a bunch of regex that I have been meaning to put out, for fixing pdf -> epubs.
Most of them are written in python compatible syntax. Some of it may be in JGsoft syntax (I use RegexBuddy for all of my regex work - well worth it if you're fixing a lot of books!). No idea if these are of any value ![]() The first (long) one is to spot empty/whitespace filled elements - a fairly full element list so you can remove anything you quite want to preserve. Spoiler:
Trim off white space from the start of a few elements - this will replace every paragraph/div/etc - so be very careful; or use lookaheads... find : <((div|p|h\d|span)[\s\w=\-"/\.;]*)>\s*(.*?)\s*</\2> replace : <\1>\3</\1> Find and fix ellipse with what I like (I like unspaced points) i.e " . . . " -> " ... " and ". .. " -> "... " . Preserves the leading/trailing space. find : (?P<lead>\s*)(\s?\.){2,3}(?P<trail>\s?)(?![\/\\]) replace : \g<lead>...\g<trail> Replace ' - ' and ' -- ' as well as a few other cases with emdash - trying to preserve quote spacing (i.e don't space it if before/after a quotation mark) This is written in JGsoft syntax. find : (?<=<(?P<wat>p|div|h\d|span)[^<>]*>.*)(?P<lead>(\s*|[“”,]))(\s?\-){2,4}(?P<trail>(\s*|[“”]))(?=.*</(\k<wat>)>) replace : \g<lead>—\g<trail> Fix broken paragraphs, not joined due to page breaks - be careful; might want to add a \d to the lowercase capture (^A-Z). find : <p>([^<]*?)(?!\.)</p>\s+<p>([^A-Z][^<]*?)</p> replace : <p>\1 \2</p> Find nested formatting tags - i.e "<i><i>hmm</i></i>" -> "<i>hmm</i>" find : <([buis])>\s*<\1>(.*?)</\1>\s*</\1> replace: <\1>\2</\1> Find and replace redundant formatting : "<i>de</i><i>p</i>" -> <i>derp</i> find : <([sibu])>(.*)</\1>(\s*)<\1>(.*)</\1> replace : <\1>\2\3\4</\1> Find spaced, grouped formatting tags (sbiu) - makes it easy to replace with css later if needed. find : <([busi])>\s*<([busi])>\s*<(\1|\2)>(.*)</(\1|\2)>\s*</\2>\s*</\1> replace : <\1><\2>\4</\1></\2> Stripping spans from paragraphs - be careful if your style is using spans for formatting - replace them with the html tag version. <p[\s\w=\-";]*>\s*<span[\s\w=\-";]*>(.*)</span>\s*</p> Find missing quotation marks : (?<=<(p|div|h\d|span)[^<>]*>)[^"]*"([^"\n\r]*?)(?=</\1>) Replace straight quotation marks with fancy ones - NB: use the previous regex to find missing ones beforehand! : find : (?<=<(p|div|h\d|span)[^<>]*>.*)"(?<quote>[^"\u201C\u201D\u201E\u201F\u2033\u2036\r\n]+?)\s*?"(?=.*</\1>) replace : “\g<quote>” Find incorrect direction double quotation marks - often Calibre makes mistakes following dashes and such - best to fix manually unless you're sure of the case: find : (“\s*)(?P<quote>.+?)\s*?(“|”) Have a bunch of other ones, but they are a bit messy. Last edited by Serpentine; 10-20-2011 at 12:36 PM. Reason: now with less emotes! and a few fixes |
![]() |
![]() |
Advert | |
|
![]() |
#48 | |
Treasure Seeker
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 18,708
Karma: 26026435
Join Date: Mar 2010
Device: Kobo HD Glo, Kindles, Kindle Fires, Andriod Devices
|
Quote:
![]() I can see how these will come in handy. I just need figure out a program they will work on. What html editor do you use? I also have Regexbuddy but it's greek to me. ![]() I'd love to figure out how to find missing punctuations at the end of a sentence. Those Harlequin Treasury titles I buy are full of OCR errors and stuff like that. ![]() |
|
![]() |
![]() |
![]() |
#49 | |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
Quote:
I do as much as I can to make the file super simple, strip most stuff so that only the essentials for the desired CSS are left. From there its just a matter of making expressions that do what you want. RegexBuddy is great for this part. If you go to the 'test' tab, put your file down there in the bottom pane, then enter the regex at the top, it will highlight the matches. There's a little dropdown box at the top that lets you select your regex flavour, GJsoft is very easy and allows some nice stuff if you use lookaheads and such. Python and perl are pretty much the same if you need to share things with friends. Working out what things do is the tricky part, but regex is rather easy to understand - a lot of people focus on trivial stuff that wildcards can work just as well with. I find the most handy things that make a person suddenly get regex are back references as lookarounds (sometimes only refered to as lookahead/lookbehind). If you right click in the top pane above the test text, it has a nice context menu which allows you to add things that you might not know the name of. If you paste in regex which you don't understand, you can also swap to the 'Create' tab - this explains the regex, but don't be expecting it to be straight forward explanations ![]() I'd suggest just using the test area for testing, use the 'grep' tab to apply regex to your files - handy for epubs with multiple xhtml files, also makes it easy to preview replacements - always preview. Back to two things I think most people miss: Back references are really easy - say we want to find simple formatting tags: <([sbui])>([^<]+)</\1> Orange finds a tag with a single letter from the set {s,b,u,i} in it eg <s>. The (brackets) around the character catch mean that the result is stored. Red finds characters as they appear that are not a "<" - avoiding us going into the next tag by mistake. Blue finds us a tag with the same letter as the group we got in the first (\1) group. i.e if we get a 'b' match for <b>, we can reference back and then use it to find its' matching </b>. Lookarounds are something that very few short intros/tuts ever explain well. They're actually damn easy. They're used to 'look ahead', i.e you use them to decide if something is what you want, or not. If it's something that signifies you want to look closer - its positive, if its something that says you don't want to look at it its negative. If it's something prior of the potentially interesting stuff (i.e the deciding factor is on the left hand side of what you want to potentially get) - it's called a lookbehind. For example if we only want to match stuff in an italic tag, if it is found after a comma: (?<=,)<i>.+?</i> Now you're saying "But why not just use : ,<i>.+?</i> Well, if you do that you're including the comma in your match, so if you were to be removing the text you found, you'd need to make sure you replace it with a comma, to make sure it's not removed. It also allows you to match quite specific things. For example - if you only wanted to match it if there was an exclamation mark somewhere previously in the line : (?<=!.*)<i>.+?</i> // this requires the JGsoft syntax which allows repetition in the lookarounds. Those are examples of positive use, where you are looking for a specific something nearby. You could use the negative to avoid something. i.e you'd use a negative one. I'd just throw a pile of random html into the test window and see what you can do. Last edited by Serpentine; 10-19-2011 at 06:23 PM. |
|
![]() |
![]() |
![]() |
#50 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,592
Karma: 204624552
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
|
|
![]() |
![]() |
Advert | |
|
![]() |
#51 | ||
Treasure Seeker
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 18,708
Karma: 26026435
Join Date: Mar 2010
Device: Kobo HD Glo, Kindles, Kindle Fires, Andriod Devices
|
Quote:
![]() ![]() Quote:
![]() ![]() |
||
![]() |
![]() |
![]() |
#52 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Oct 2011
Device: Kindle 4
|
I'm delighted to see the useful information this thread is pulling out - thanks everyone
![]() |
![]() |
![]() |
![]() |
#53 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 820
Karma: 8820388
Join Date: Dec 2008
Device: Sony PRS-505, -350; Kindle 3 3G, DX, PW 2; various tablets
|
As a totally different approach, you could always crop the PDF to fit your reader's screen, using PDF Scissors.
Some don't like/trust the web Java (the author wrote it for himself and gives it away free, so never bothered to purchase a digital signature), but I've had no problems, and know of no complaints about it. |
![]() |
![]() |
![]() |
#54 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 103
Karma: 1180
Join Date: Oct 2011
Device: Acer Iconia a500, XP tablet PC
|
Ok, so after reading this I get the impression that my results so far are normal and a lot of work needs to be put in to fixing a PDF conversion. Now why would I want to do this? I have several PDF books that came with some printed books, I want to keep the printed books at home and use the electronic version when I am not home. I have a tablet computer but the resolution is such that the PDF is a little fuzzy when in full screen mode, and too small when not in full screen. I have used a Kindle DX and of course the resolution and character clarity are perfect so I might look into getting one. While I am doing that thinking, reading an ebook version with flow-able text would certainly make the reading easier. I tried conversions with Calibre and MobiPocket, both have the typical extra junk text and lack of TOC links that have been discussed in this (and probably too many other) threads. I've noted that several of you go through and massage the text with a text editor, but that seems like more work than I am able to do.
What I was wondering is has anyone seen a WYSIWYG ebook layout editor which would certainly make the clean up a little easier for the common human? I assume the answer is that such a tool does not exist, but thought it was worth asking. |
![]() |
![]() |
![]() |
#55 | |
Treasure Seeker
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 18,708
Karma: 26026435
Join Date: Mar 2010
Device: Kobo HD Glo, Kindles, Kindle Fires, Andriod Devices
|
Quote:
|
|
![]() |
![]() |
![]() |
#56 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
I'd recommend Sigil, so long as you keep things simple, the epubs will convert very well to mobi.
The process really isn't all that time consuming after you've done it once - and for the most part isn't necessary at all. You just need to get the PDF converted to html then add it to a blank book and you'll be on your way. If the converted html has preserved the italics, bolding and important indents (i.e., written letters), you should be able to ignore pretty much everything else and just mark up the chapter (i.e., make them headings) and generate the ToC. Calibre can throw in the metadata and cover etc if you do an epub->epub conversion afterwards if you're lazy ![]() |
![]() |
![]() |
![]() |
#57 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 103
Karma: 1180
Join Date: Oct 2011
Device: Acer Iconia a500, XP tablet PC
|
I tried Calibre on one PDF and it got lots of the parts out of order so I need to do a lot of fooling with it to put it back into the correct order. Then tried MobiPocket again and the text is all there but most of the images are missing. These are technical books so the images are often important to the content... Seems like there will be a lot of manual clean up with these documents. I'll have to write the publisher and see if they have a epub or mobi version that I can download. They are pretty good at offering updates and errata since there are certifications on the line.
|
![]() |
![]() |
![]() |
#58 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 103
Karma: 1180
Join Date: Oct 2011
Device: Acer Iconia a500, XP tablet PC
|
Looking at the version created from Mobipocket it mostly works and I think for now I'll deal with the missing images since it is far easier to read than the PDF on my tablet display. The PDF is about 12 cpi (full screen) and the Mobipocket reader is able to push that to 24 where it looks good with clean edges, wish I could afford a higher resolution device, but for now this will need to work. Keeping an eye on ebay for a used (and not hacked) DX where I should be able to shrink the font back down to 12 or 14 cpi to get more reading between page turns.
I do have a different epub by the same company (Cisco Press) and converted it to a Mobi friendly format and it is really nice to read on this tablet... Kind of wish they had done epub instead of PDF on the other books I bought. |
![]() |
![]() |
![]() |
#59 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 230
Karma: 13495
Join Date: Feb 2009
Location: SoCal
Device: Kindle 3, Kindle PW, Pocketbook 301+, Pocketbook Touch, Sony 950, 350
|
There is alternative approach to pdf viewing on e-readers: Convert pdf to a set of half-page landscape oriented images and join images in single pdf (or just view the sequence of png images on your reader).
This approach has it's own pros and cons but at least it grants you access to reasonably comfortable reading. Tool: Google for pdflrfwin-0.99. This software can convert pdf to set of images pooled into single zip file (just change manually lrf extension of output file to zip extension). Many free softwares can join png images into pdf (like this one). Last edited by EbokJunkie; 10-25-2011 at 07:55 PM. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
KINDLE DEAL: The Holy Bible: NKJV ($3.36 CANADA) | gospelebooks | Deals and Resources (No Self-Promotion or Affiliate Links) | 2 | 04-09-2011 12:07 PM |
Free Book (Kindle / Nook) - The Holy Bible | koland | Deals and Resources (No Self-Promotion or Affiliate Links) | 21 | 11-14-2010 01:51 PM |
Free Book (Kindle) - The Holy Bible | koland | Deals and Resources (No Self-Promotion or Affiliate Links) | 21 | 10-09-2010 10:31 AM |
Free Book (Kindle) - Holy Bible (GW) | koland | Deals and Resources (No Self-Promotion or Affiliate Links) | 0 | 10-04-2010 03:29 AM |
The search for the Holy Grail of reading lights continues | Bob Russell | News | 19 | 04-01-2009 01:24 PM |