12-04-2017, 01:28 PM | #1 |
Enthusiast
Posts: 27
Karma: 7526
Join Date: Nov 2012
Location: Bristol, UK
Device: Nexus 10 (2 of them), Nexus 5, iPhone, iPad, Kindle, iMac
|
Converting PDF/HTML to ereader formats
I have downloaded a free pdf book and wish to convert it to be read via my Kindle or my epub readers. I used Calibre to convert it to AZW3 and HTMLZ, both of which display similarly, that is, with hard-coded line lengths and page headers, numbers etc., displaying clumsy line formatting on small 'pages' with paragraph, not line- spacing between lines and chapter titles, etc. inappropriate to the page size.
I tried one or two online services but, while they got the ToC lines down to one line each with fewer dots between the chapter title and the page number, they choked when faced with the many diacritics in the text! Calibre handled the diacritics without fault. I would rather not spend days removing '</p>'s and repeating page header and footer texts! Is there a program or a script that could speed this process up or even do it all for me? |
12-04-2017, 01:44 PM | #2 |
A Hairy Wizard
Posts: 3,093
Karma: 18727053
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
That is the million dollar question... PDF is probably the worst possible source document to do a conversion from. In my opinion, it is a far better use of your time to search for a different source format than try and convert a pdf. Sorry
|
Advert | |
|
12-04-2017, 04:42 PM | #3 |
null operator (he/him)
Posts: 20,558
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
If it's a multi-column PDF you might be able to make use of ==>> k2pdfopt: optimizes PDFs for viewing on e-readers
Or, if you can wrangle the PDF into Word (maybe via this ==>> Edit PDF content in Word), then you could use Toxaris' add in ==>> e-Book Tools - a Word add-in to create an ePUB that you can finalise in an ebook editor such as Sigil or Caiibre. Otherwise - what Turtle91 said. BR Last edited by BetterRed; 12-04-2017 at 04:44 PM. |
12-04-2017, 06:16 PM | #4 |
Resident Curmudgeon
Posts: 73,897
Karma: 128597114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Forget PDF exists. Problem with converting PDF solved.
|
12-05-2017, 10:18 AM | #5 | |
Bookmaker & Cat Slave
Posts: 11,460
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Sorry Chris: My business does this every single day, and there's no magic, fast way. The only way to get there from here is to a) use Abbyy Fine Reader, which at least will remove the running heads/footers, and b) do the rest by hand. I wish there were a faster/better way, but there isn't. Having said that, yes, you can write regex to clean up some portion of the line-ending </p>'s, and all that, but...it's still all human labor, eyes and hands. Hitch |
|
Advert | |
|
12-07-2017, 04:34 AM | #6 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Yes, as Hitch says, a decent OCR program such as Abbyy, followed by hand clean-up of the resulting code, really is as good as it gets. PDF is a "terminal format": that is, you can convert to it easily enough, but you can't convert from it.
|
12-07-2017, 10:18 AM | #7 |
Whatever...
Posts: 197
Karma: 1114225
Join Date: Feb 2015
Location: Austria
Device: PocketBook InkPad 840, Touch HD 2
|
Instead of Abbyy, which costs money, you can try pdftotext, which is open source - it's a little Windows command line utility, which you can download from www.xpdfreader.com (it's included in the Xpdf tools). With the -layout option, it gives you a plain text file with line breaks that correspond to the lines in the PDF (doesn't work with all PDF files, though).
You'll still have a lot of editing to do, but with a decent editor and a little knowledge of regular expressions and a tool to convert plain text to epub you should be able to get a reasonably readable epub version in an hour or so. (I have a tool that helps with the process, but like pdftotext it's Windows command line stuff, so I don't know if you're interested.) |
12-07-2017, 01:04 PM | #8 | |
Bookmaker & Cat Slave
Posts: 11,460
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Doesn't that give you, as you said, a plain text file? So that all formatting is lost? For some folks, that would be a crapload more work. Abbyy, if I am not mistaken, has a website where you can convert a single document for free. There are limits on it, and all that, but you might try searching for that. Hitch |
|
12-07-2017, 02:16 PM | #9 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
You can also buy older releases of Abbyy at much lower prices than the current version, and they are more than adequate for the job - you just lose the spelling dictionary for Outer Mongolia (or whatever ).
|
12-09-2017, 04:56 AM | #10 | |
Whatever...
Posts: 197
Karma: 1114225
Join Date: Feb 2015
Location: Austria
Device: PocketBook InkPad 840, Touch HD 2
|
Quote:
Unless I've missed something, even with a good OCR software, though, it's not trivial to retain italics, while getting rid of all the excess formatting that the OCR'd output file is burdened with (even Abbyy is wont to hallucinate dozens of different page and paragraph formats, font sizes, etc.). I do have a fairly recent version of Abbyy, but, for instance, when I had to convert a series of 26 PDF books for a friend (being blind, she couldn't read the PDFs), I got it done much more quickly with the pdftotext tool. (There were no italics in those PDFs, which we had bought, but which seem to have been produced from carelessly OCR'd print editions. No idea if the paper originals had them, but I've often come across public-domain books on the Internet without italics, when the originals have them.) (Which is not meant to say it's ok to lose them.) |
|
12-09-2017, 12:53 PM | #11 | |
Bookmaker & Cat Slave
Posts: 11,460
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
I'm sure that everyone has their own preferred way of working. I have a ton of expertise in Word, so frankly, it's super-easy for me to do the cleanup on Abbyy output into a Word file, or, of course, regex it to the nth, in HTML. If there ARE italics and bold, in large numbers, I'll use Toxaris' superb "ePUB Tools" Word plug-in, first--as that makes marking/retaining both of those character markups simplicity itself--and I'll "clean" the styles from the rest, and then restyle them. I find that the fastest route. Offered solely FWIW. Hitch |
|
12-11-2017, 06:46 PM | #12 | |||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
I have 12 regex to help clean up after (5 to handle Italics, Bold, BoldItalics, Smallcaps, BoldSmallcaps + 7 to just clean up some <td> and some other anomalies). Example: Search: <span style="font-style:italic;"> Replace: <span class="italics"> Quote:
I've written about this in-depth before: https://www.mobileread.com/forums/sh...72#post2883972 Most of your time is going to be spent editing and correcting the text/formatting, so the better you can get the input, the easier/faster your life will be on those later steps. Last edited by Tex2002ans; 12-11-2017 at 06:50 PM. |
|||
12-15-2017, 12:00 AM | #14 |
Wizard
Posts: 3,108
Karma: 60231510
Join Date: Nov 2011
Location: Australia
Device: Kobo Aura H2O, Kindle Oasis, Huwei Ascend Mate 7
|
|
12-15-2017, 12:40 AM | #15 | |
null operator (he/him)
Posts: 20,558
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
I am surprised (shocked even) at how well it does with PDF's from institutions like Rand, McKinsey, Stratfor etc - previously I often didn't even bother trying to convert their documents. Tables would end up as a meaningless shambles, sidebars as a nonsensical farrago etc. And the number of styles it creates for a document is significantly fewer than other pdf converters. BR |
|
Tags |
calibre, convert, diacritics, kindle |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Converting PDF to HTML with MobiConverter? | Dullahir | Calibre | 0 | 05-16-2013 08:05 PM |
Converting PDF to HTML with MobiConverter? | Dullahir | Conversion | 0 | 05-16-2013 04:27 AM |
Converting PDF to HTML | crich70 | Conversion | 5 | 07-23-2011 10:02 AM |
need help converting .pdf to other formats | mgrunk | Calibre | 2 | 11-10-2010 08:19 PM |
Converting PDF to HTML | Nirf | Calibre | 7 | 06-24-2010 08:51 AM |