Converting PDF/HTML to ereader formats

HappyChris · 12-04-2017, 01:28 PM

I have downloaded a free pdf book and wish to convert it to be read via my Kindle or my epub readers. I used Calibre to convert it to AZW3 and HTMLZ, both of which display similarly, that is, with hard-coded line lengths and page headers, numbers etc., displaying clumsy line formatting on small 'pages' with paragraph, not line- spacing between lines and chapter titles, etc. inappropriate to the page size.

I tried one or two online services but, while they got the ToC lines down to one line each with fewer dots between the chapter title and the page number, they choked when faced with the many diacritics in the text! Calibre handled the diacritics without fault.

I would rather not spend days removing ''s and repeating page header and footer texts! Is there a program or a script that could speed this process up or even do it all for me?

Turtle91 · 12-04-2017, 01:44 PM

That is the million dollar question... PDF is probably the worst possible source document to do a conversion from. In my opinion, it is a far better use of your time to search for a different source format than try and convert a pdf. Sorry

BetterRed · 12-04-2017, 04:42 PM

If it's a multi-column PDF you might be able to make use of ==>> k2pdfopt: optimizes PDFs for viewing on e-readers

Or, if you can wrangle the PDF into Word (maybe via this ==>> Edit PDF content in Word), then you could use Toxaris' add in ==>> e-Book Tools - a Word add-in to create an ePUB that you can finalise in an ebook editor such as Sigil or Caiibre.

Otherwise - what Turtle91 said.

BR

JSWolf · 12-04-2017, 06:16 PM

Forget PDF exists. Problem with converting PDF solved.

Hitch · 12-05-2017, 10:18 AM

Quote:

Originally Posted by HappyChris

I have downloaded a free pdf book and wish to convert it to be read via my Kindle or my epub readers. I used Calibre to convert it to AZW3 and HTMLZ, both of which display similarly, that is, with hard-coded line lengths and page headers, numbers etc., displaying clumsy line formatting on small 'pages' with paragraph, not line- spacing between lines and chapter titles, etc. inappropriate to the page size.

I tried one or two online services but, while they got the ToC lines down to one line each with fewer dots between the chapter title and the page number, they choked when faced with the many diacritics in the text! Calibre handled the diacritics without fault.

I would rather not spend days removing ''s and repeating page header and footer texts! Is there a program or a script that could speed this process up or even do it all for me?

Sorry Chris:

My business does this every single day, and there's no magic, fast way. The only way to get there from here is to a) use Abbyy Fine Reader, which at least will remove the running heads/footers, and b) do the rest by hand. I wish there were a faster/better way, but there isn't.

Having said that, yes, you can write regex to clean up some portion of the line-ending 's, and all that, but...it's still all human labor, eyes and hands.

Hitch

HarryT · 12-07-2017, 04:34 AM

Yes, as Hitch says, a decent OCR program such as Abbyy, followed by hand clean-up of the resulting code, really is as good as it gets. PDF is a "terminal format": that is, you can convert to it easily enough, but you can't convert from it.

RobertDDL · 12-07-2017, 10:18 AM

Instead of Abbyy, which costs money, you can try pdftotext, which is open source - it's a little Windows command line utility, which you can download from www.xpdfreader.com (it's included in the Xpdf tools). With the -layout option, it gives you a plain text file with line breaks that correspond to the lines in the PDF (doesn't work with all PDF files, though).

You'll still have a lot of editing to do, but with a decent editor and a little knowledge of regular expressions and a tool to convert plain text to epub you should be able to get a reasonably readable epub version in an hour or so. (I have a tool that helps with the process, but like pdftotext it's Windows command line stuff, so I don't know if you're interested.)

Hitch · 12-07-2017, 01:04 PM

Quote:

Originally Posted by RobertDDL

Instead of Abbyy, which costs money, you can try pdftotext, which is open source - it's a little Windows command line utility, which you can download from www.xpdfreader.com (it's included in the Xpdf tools). With the -layout option, it gives you a plain text file with line breaks that correspond to the lines in the PDF (doesn't work with all PDF files, though).

You'll still have a lot of editing to do, but with a decent editor and a little knowledge of regular expressions and a tool to convert plain text to epub you should be able to get a reasonably readable epub version in an hour or so. (I have a tool that helps with the process, but like pdftotext it's Windows command line stuff, so I don't know if you're interested.)

Robert:

Doesn't that give you, as you said, a plain text file? So that all formatting is lost? For some folks, that would be a crapload more work.

Abbyy, if I am not mistaken, has a website where you can convert a single document for free. There are limits on it, and all that, but you might try searching for that.

Hitch

HarryT · 12-07-2017, 02:16 PM

You can also buy older releases of Abbyy at much lower prices than the current version, and they are more than adequate for the job - you just lose the spelling dictionary for Outer Mongolia (or whatever

).

RobertDDL · 12-09-2017, 04:56 AM

Quote:

Originally Posted by Hitch

Robert:

Doesn't that give you, as you said, a plain text file? So that all formatting is lost? For some folks, that would be a crapload more work.

Abbyy, if I am not mistaken, has a website where you can convert a single document for free. There are limits on it, and all that, but you might try searching for that.

Hitch

Yes, you are right about all formatting being lost. For novels, in most cases that would be headings and italics. It doesn't take much time to manually restore headings, in a plain text file (word-wrap disabled) you can usually spot them quite easily, even if they don't start with numbers or "chapter" etc. The loss of italics can be quite unfortunate, if the writer relies on them - not all of them do, though.

Unless I've missed something, even with a good OCR software, though, it's not trivial to retain italics, while getting rid of all the excess formatting that the OCR'd output file is burdened with (even Abbyy is wont to hallucinate dozens of different page and paragraph formats, font sizes, etc.). I do have a fairly recent version of Abbyy, but, for instance, when I had to convert a series of 26 PDF books for a friend (being blind, she couldn't read the PDFs), I got it done much more quickly with the pdftotext tool. (There were no italics in those PDFs, which we had bought, but which seem to have been produced from carelessly OCR'd print editions. No idea if the paper originals had them, but I've often come across public-domain books on the Internet without italics, when the originals have them.) (Which is not meant to say it's ok to lose them.)

Hitch · 12-09-2017, 12:53 PM

Quote:

Originally Posted by RobertDDL

Yes, you are right about all formatting being lost. For novels, in most cases that would be headings and italics. It doesn't take much time to manually restore headings, in a plain text file (word-wrap disabled) you can usually spot them quite easily, even if they don't start with numbers or "chapter" etc. The loss of italics can be quite unfortunate, if the writer relies on them - not all of them do, though.

Unless I've missed something, even with a good OCR software, though, it's not trivial to retain italics, while getting rid of all the excess formatting that the OCR'd output file is burdened with (even Abbyy is wont to hallucinate dozens of different page and paragraph formats, font sizes, etc.). I do have a fairly recent version of Abbyy, but, for instance, when I had to convert a series of 26 PDF books for a friend (being blind, she couldn't read the PDFs), I got it done much more quickly with the pdftotext tool. (There were no italics in those PDFs, which we had bought, but which seem to have been produced from carelessly OCR'd print editions. No idea if the paper originals had them, but I've often come across public-domain books on the Internet without italics, when the originals have them.) (Which is not meant to say it's ok to lose them.)

Honestly, I've never noticed any sturm und drang retaining italics with Abbyy. Using Abbyy is, as you rightly say, a PITA, overall, but in my opinion, compared to all the other methods to get a PDF to an editable form, it's the best way to go. Maybe I'm just lucky that way, that we get a lot of books that are bold/italics laden, but we do.

I'm sure that everyone has their own preferred way of working. I have a ton of expertise in Word, so frankly, it's super-easy for me to do the cleanup on Abbyy output into a Word file, or, of course, regex it to the nth, in HTML. If there ARE italics and bold, in large numbers, I'll use Toxaris' superb "ePUB Tools" Word plug-in, first--as that makes marking/retaining both of those character markups simplicity itself--and I'll "clean" the styles from the rest, and then restyle them. I find that the fastest route.

Offered solely FWIW.

Hitch

Tex2002ans · 12-11-2017, 06:46 PM

Quote:

Originally Posted by Hitch

My business does this every single day, and there's no magic, fast way. The only way to get there from here is to a) use Abbyy Fine Reader, which at least will remove the running heads/footers, and b) do the rest by hand. I wish there were a faster/better way, but there isn't.

Yep, I agree completely. PDF is an awful input source, and there is no real good way besides a lot of human elbow grease.

Quote:

Originally Posted by Hitch

Honestly, I've never noticed any sturm und drang retaining italics with Abbyy. Using Abbyy is, as you rightly say, a PITA, overall, but in my opinion, compared to all the other methods to get a PDF to an editable form, it's the best way to go. Maybe I'm just lucky that way, that we get a lot of books that are bold/italics laden, but we do.

Or as long as you have Finereader 10 or higher, it has EPUB output. The EPUB output has relatively clean code with only a handful of inline styles.

I have 12 regex to help clean up after (5 to handle Italics, Bold, BoldItalics, Smallcaps, BoldSmallcaps + 7 to just clean up some <td> and some other anomalies).

Example:

Search: 
Replace:

Quote:

Originally Posted by RobertDDL

Yes, you are right about all formatting being lost. For novels, in most cases that would be headings and italics. [...] The loss of italics can be quite unfortunate, if the writer relies on them - not all of them do, though.

Getting the OCR correct is just a portion of the work. Formatting is just as important (and is where a lot of other tools fail miserably).

I've written about this in-depth before:

https://www.mobileread.com/forums/sh...72#post2883972

Most of your time is going to be spent editing and correcting the text/formatting, so the better you can get the input, the easier/faster your life will be on those later steps.

willus · 12-14-2017, 11:52 PM

My two cents on PDF conversions. This topic has been visited in a number of past MR threads.

darryl · 12-15-2017, 12:00 AM

Quote:

Originally Posted by JSWolf

Forget PDF exists. Problem with converting PDF solved.

I totally agree. I dabbled with converting pdf's years ago and decided it is simply not worthwhile. Of course Hitch would seem to have little choice given her business.

BetterRed · 12-15-2017, 12:40 AM

Quote:

Originally Posted by willus

My two cents on PDF conversions. This topic has been visited in a number of past MR threads.

- I've only recently 'stumbled on' Word's PDF convert facility.

I am surprised (shocked even) at how well it does with PDF's from institutions like Rand, McKinsey, Stratfor etc - previously I often didn't even bother trying to convert their documents. Tables would end up as a meaningless shambles, sidebars as a nonsensical farrago etc. And the number of styles it creates for a document is significantly fewer than other pdf converters.

BR

12-04-2017, 01:28 PM	#1
HappyChris Enthusiast Posts: 27 Karma: 7526 Join Date: Nov 2012 Location: Bristol, UK Device: Nexus 10 (2 of them), Nexus 5, iPhone, iPad, Kindle, iMac	Converting PDF/HTML to ereader formats I have downloaded a free pdf book and wish to convert it to be read via my Kindle or my epub readers. I used Calibre to convert it to AZW3 and HTMLZ, both of which display similarly, that is, with hard-coded line lengths and page headers, numbers etc., displaying clumsy line formatting on small 'pages' with paragraph, not line- spacing between lines and chapter titles, etc. inappropriate to the page size. I tried one or two online services but, while they got the ToC lines down to one line each with fewer dots between the chapter title and the page number, they choked when faced with the many diacritics in the text! Calibre handled the diacritics without fault. I would rather not spend days removing '</p>'s and repeating page header and footer texts! Is there a program or a script that could speed this process up or even do it all for me?

12-04-2017, 04:42 PM	#3
BetterRed null operator (he/him) Posts: 20,558 Karma: 26954694 Join Date: Mar 2012 Location: Sydney Australia Device: none	If it's a multi-column PDF you might be able to make use of ==>> k2pdfopt: optimizes PDFs for viewing on e-readers Or, if you can wrangle the PDF into Word (maybe via this ==>> Edit PDF content in Word), then you could use Toxaris' add in ==>> e-Book Tools - a Word add-in to create an ePUB that you can finalise in an ebook editor such as Sigil or Caiibre. Otherwise - what Turtle91 said. BR Last edited by BetterRed; 12-04-2017 at 04:44 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Converting PDF to HTML with MobiConverter?	Dullahir	Calibre	0	05-16-2013 08:05 PM
Converting PDF to HTML with MobiConverter?	Dullahir	Conversion	0	05-16-2013 04:27 AM
Converting PDF to HTML	crich70	Conversion	5	07-23-2011 10:02 AM
need help converting .pdf to other formats	mgrunk	Calibre	2	11-10-2010 08:19 PM
Converting PDF to HTML	Nirf	Calibre	7	06-24-2010 08:51 AM

12-04-2017, 01:44 PM	#2
Turtle91 A Hairy Wizard Posts: 3,093 Karma: 18727053 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	That is the million dollar question... PDF is probably the worst possible source document to do a conversion from. In my opinion, it is a far better use of your time to search for a different source format than try and convert a pdf. Sorry

12-04-2017, 06:16 PM	#4
JSWolf Resident Curmudgeon Posts: 73,897 Karma: 128597114 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	Forget PDF exists. Problem with converting PDF solved.

12-07-2017, 04:34 AM	#6
HarryT eBook Enthusiast Posts: 85,544 Karma: 93383043 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6	Yes, as Hitch says, a decent OCR program such as Abbyy, followed by hand clean-up of the resulting code, really is as good as it gets. PDF is a "terminal format": that is, you can convert to it easily enough, but you can't convert from it.

12-07-2017, 10:18 AM	#7
RobertDDL Whatever... Posts: 197 Karma: 1114225 Join Date: Feb 2015 Location: Austria Device: PocketBook InkPad 840, Touch HD 2	Instead of Abbyy, which costs money, you can try pdftotext, which is open source - it's a little Windows command line utility, which you can download from www.xpdfreader.com (it's included in the Xpdf tools). With the -layout option, it gives you a plain text file with line breaks that correspond to the lines in the PDF (doesn't work with all PDF files, though). You'll still have a lot of editing to do, but with a decent editor and a little knowledge of regular expressions and a tool to convert plain text to epub you should be able to get a reasonably readable epub version in an hour or so. (I have a tool that helps with the process, but like pdftotext it's Windows command line stuff, so I don't know if you're interested.)

12-07-2017, 02:16 PM	#9
HarryT eBook Enthusiast Posts: 85,544 Karma: 93383043 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6	You can also buy older releases of Abbyy at much lower prices than the current version, and they are more than adequate for the job - you just lose the spelling dictionary for Outer Mongolia (or whatever ).

12-14-2017, 11:52 PM	#13
willus Fuzzball, the purple cat Posts: 1,272 Karma: 11087488 Join Date: Jun 2011 Location: California Device: iPad	My two cents on PDF conversions. This topic has been visited in a number of past MR threads.

Advert

Advert