Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 12-04-2017, 01:28 PM   #1
HappyChris
Enthusiast
HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.
 
Posts: 27
Karma: 7526
Join Date: Nov 2012
Location: Bristol, UK
Device: Nexus 10 (2 of them), Nexus 5, iPhone, iPad, Kindle, iMac
Converting PDF/HTML to ereader formats

I have downloaded a free pdf book and wish to convert it to be read via my Kindle or my epub readers. I used Calibre to convert it to AZW3 and HTMLZ, both of which display similarly, that is, with hard-coded line lengths and page headers, numbers etc., displaying clumsy line formatting on small 'pages' with paragraph, not line- spacing between lines and chapter titles, etc. inappropriate to the page size.

I tried one or two online services but, while they got the ToC lines down to one line each with fewer dots between the chapter title and the page number, they choked when faced with the many diacritics in the text! Calibre handled the diacritics without fault.

I would rather not spend days removing '</p>'s and repeating page header and footer texts! Is there a program or a script that could speed this process up or even do it all for me?
HappyChris is offline   Reply With Quote
Old 12-04-2017, 01:44 PM   #2
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,093
Karma: 18727053
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
That is the million dollar question... PDF is probably the worst possible source document to do a conversion from. In my opinion, it is a far better use of your time to search for a different source format than try and convert a pdf. Sorry
Turtle91 is offline   Reply With Quote
Old 12-04-2017, 04:42 PM   #3
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,532
Karma: 26944418
Join Date: Mar 2012
Location: Sydney Australia
Device: none
If it's a multi-column PDF you might be able to make use of ==>> k2pdfopt: optimizes PDFs for viewing on e-readers

Or, if you can wrangle the PDF into Word (maybe via this ==>> Edit PDF content in Word), then you could use Toxaris' add in ==>> e-Book Tools - a Word add-in to create an ePUB that you can finalise in an ebook editor such as Sigil or Caiibre.

Otherwise - what Turtle91 said.

BR

Last edited by BetterRed; 12-04-2017 at 04:44 PM.
BetterRed is offline   Reply With Quote
Old 12-04-2017, 06:16 PM   #4
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,845
Karma: 128597114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Forget PDF exists. Problem with converting PDF solved.
JSWolf is offline   Reply With Quote
Old 12-05-2017, 10:18 AM   #5
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,460
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
Quote:
Originally Posted by HappyChris View Post
I have downloaded a free pdf book and wish to convert it to be read via my Kindle or my epub readers. I used Calibre to convert it to AZW3 and HTMLZ, both of which display similarly, that is, with hard-coded line lengths and page headers, numbers etc., displaying clumsy line formatting on small 'pages' with paragraph, not line- spacing between lines and chapter titles, etc. inappropriate to the page size.

I tried one or two online services but, while they got the ToC lines down to one line each with fewer dots between the chapter title and the page number, they choked when faced with the many diacritics in the text! Calibre handled the diacritics without fault.

I would rather not spend days removing '</p>'s and repeating page header and footer texts! Is there a program or a script that could speed this process up or even do it all for me?

Sorry Chris:

My business does this every single day, and there's no magic, fast way. The only way to get there from here is to a) use Abbyy Fine Reader, which at least will remove the running heads/footers, and b) do the rest by hand. I wish there were a faster/better way, but there isn't.

Having said that, yes, you can write regex to clean up some portion of the line-ending </p>'s, and all that, but...it's still all human labor, eyes and hands.


Hitch
Hitch is offline   Reply With Quote
Old 12-07-2017, 04:34 AM   #6
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Yes, as Hitch says, a decent OCR program such as Abbyy, followed by hand clean-up of the resulting code, really is as good as it gets. PDF is a "terminal format": that is, you can convert to it easily enough, but you can't convert from it.
HarryT is offline   Reply With Quote
Old 12-07-2017, 10:18 AM   #7
RobertDDL
Whatever...
RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.
 
RobertDDL's Avatar
 
Posts: 197
Karma: 1114225
Join Date: Feb 2015
Location: Austria
Device: PocketBook InkPad 840, Touch HD 2
Instead of Abbyy, which costs money, you can try pdftotext, which is open source - it's a little Windows command line utility, which you can download from www.xpdfreader.com (it's included in the Xpdf tools). With the -layout option, it gives you a plain text file with line breaks that correspond to the lines in the PDF (doesn't work with all PDF files, though).

You'll still have a lot of editing to do, but with a decent editor and a little knowledge of regular expressions and a tool to convert plain text to epub you should be able to get a reasonably readable epub version in an hour or so. (I have a tool that helps with the process, but like pdftotext it's Windows command line stuff, so I don't know if you're interested.)
RobertDDL is offline   Reply With Quote
Old 12-07-2017, 01:04 PM   #8
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,460
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
Quote:
Originally Posted by RobertDDL View Post
Instead of Abbyy, which costs money, you can try pdftotext, which is open source - it's a little Windows command line utility, which you can download from www.xpdfreader.com (it's included in the Xpdf tools). With the -layout option, it gives you a plain text file with line breaks that correspond to the lines in the PDF (doesn't work with all PDF files, though).

You'll still have a lot of editing to do, but with a decent editor and a little knowledge of regular expressions and a tool to convert plain text to epub you should be able to get a reasonably readable epub version in an hour or so. (I have a tool that helps with the process, but like pdftotext it's Windows command line stuff, so I don't know if you're interested.)
Robert:

Doesn't that give you, as you said, a plain text file? So that all formatting is lost? For some folks, that would be a crapload more work.

Abbyy, if I am not mistaken, has a website where you can convert a single document for free. There are limits on it, and all that, but you might try searching for that.

Hitch
Hitch is offline   Reply With Quote
Old 12-07-2017, 02:16 PM   #9
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
You can also buy older releases of Abbyy at much lower prices than the current version, and they are more than adequate for the job - you just lose the spelling dictionary for Outer Mongolia (or whatever ).
HarryT is offline   Reply With Quote
Old 12-09-2017, 04:56 AM   #10
RobertDDL
Whatever...
RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.
 
RobertDDL's Avatar
 
Posts: 197
Karma: 1114225
Join Date: Feb 2015
Location: Austria
Device: PocketBook InkPad 840, Touch HD 2
Quote:
Originally Posted by Hitch View Post
Robert:

Doesn't that give you, as you said, a plain text file? So that all formatting is lost? For some folks, that would be a crapload more work.

Abbyy, if I am not mistaken, has a website where you can convert a single document for free. There are limits on it, and all that, but you might try searching for that.

Hitch
Yes, you are right about all formatting being lost. For novels, in most cases that would be headings and italics. It doesn't take much time to manually restore headings, in a plain text file (word-wrap disabled) you can usually spot them quite easily, even if they don't start with numbers or "chapter" etc. The loss of italics can be quite unfortunate, if the writer relies on them - not all of them do, though.

Unless I've missed something, even with a good OCR software, though, it's not trivial to retain italics, while getting rid of all the excess formatting that the OCR'd output file is burdened with (even Abbyy is wont to hallucinate dozens of different page and paragraph formats, font sizes, etc.). I do have a fairly recent version of Abbyy, but, for instance, when I had to convert a series of 26 PDF books for a friend (being blind, she couldn't read the PDFs), I got it done much more quickly with the pdftotext tool. (There were no italics in those PDFs, which we had bought, but which seem to have been produced from carelessly OCR'd print editions. No idea if the paper originals had them, but I've often come across public-domain books on the Internet without italics, when the originals have them.) (Which is not meant to say it's ok to lose them.)
RobertDDL is offline   Reply With Quote
Old 12-09-2017, 12:53 PM   #11
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,460
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
Quote:
Originally Posted by RobertDDL View Post
Yes, you are right about all formatting being lost. For novels, in most cases that would be headings and italics. It doesn't take much time to manually restore headings, in a plain text file (word-wrap disabled) you can usually spot them quite easily, even if they don't start with numbers or "chapter" etc. The loss of italics can be quite unfortunate, if the writer relies on them - not all of them do, though.

Unless I've missed something, even with a good OCR software, though, it's not trivial to retain italics, while getting rid of all the excess formatting that the OCR'd output file is burdened with (even Abbyy is wont to hallucinate dozens of different page and paragraph formats, font sizes, etc.). I do have a fairly recent version of Abbyy, but, for instance, when I had to convert a series of 26 PDF books for a friend (being blind, she couldn't read the PDFs), I got it done much more quickly with the pdftotext tool. (There were no italics in those PDFs, which we had bought, but which seem to have been produced from carelessly OCR'd print editions. No idea if the paper originals had them, but I've often come across public-domain books on the Internet without italics, when the originals have them.) (Which is not meant to say it's ok to lose them.)
Honestly, I've never noticed any sturm und drang retaining italics with Abbyy. Using Abbyy is, as you rightly say, a PITA, overall, but in my opinion, compared to all the other methods to get a PDF to an editable form, it's the best way to go. Maybe I'm just lucky that way, that we get a lot of books that are bold/italics laden, but we do.

I'm sure that everyone has their own preferred way of working. I have a ton of expertise in Word, so frankly, it's super-easy for me to do the cleanup on Abbyy output into a Word file, or, of course, regex it to the nth, in HTML. If there ARE italics and bold, in large numbers, I'll use Toxaris' superb "ePUB Tools" Word plug-in, first--as that makes marking/retaining both of those character markups simplicity itself--and I'll "clean" the styles from the rest, and then restyle them. I find that the fastest route.

Offered solely FWIW.

Hitch
Hitch is offline   Reply With Quote
Old 12-11-2017, 06:46 PM   #12
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Hitch View Post
My business does this every single day, and there's no magic, fast way. The only way to get there from here is to a) use Abbyy Fine Reader, which at least will remove the running heads/footers, and b) do the rest by hand. I wish there were a faster/better way, but there isn't.
Yep, I agree completely. PDF is an awful input source, and there is no real good way besides a lot of human elbow grease.

Quote:
Originally Posted by Hitch View Post
Honestly, I've never noticed any sturm und drang retaining italics with Abbyy. Using Abbyy is, as you rightly say, a PITA, overall, but in my opinion, compared to all the other methods to get a PDF to an editable form, it's the best way to go. Maybe I'm just lucky that way, that we get a lot of books that are bold/italics laden, but we do.
Or as long as you have Finereader 10 or higher, it has EPUB output. The EPUB output has relatively clean code with only a handful of inline styles.

I have 12 regex to help clean up after (5 to handle Italics, Bold, BoldItalics, Smallcaps, BoldSmallcaps + 7 to just clean up some <td> and some other anomalies).

Example:

Search: <span style="font-style:italic;">
Replace: <span class="italics">

Quote:
Originally Posted by RobertDDL View Post
Yes, you are right about all formatting being lost. For novels, in most cases that would be headings and italics. [...] The loss of italics can be quite unfortunate, if the writer relies on them - not all of them do, though.
Getting the OCR correct is just a portion of the work. Formatting is just as important (and is where a lot of other tools fail miserably).

I've written about this in-depth before:

https://www.mobileread.com/forums/sh...72#post2883972

Most of your time is going to be spent editing and correcting the text/formatting, so the better you can get the input, the easier/faster your life will be on those later steps.

Last edited by Tex2002ans; 12-11-2017 at 06:50 PM.
Tex2002ans is offline   Reply With Quote
Old 12-14-2017, 11:52 PM   #13
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
My two cents on PDF conversions. This topic has been visited in a number of past MR threads.
willus is offline   Reply With Quote
Old 12-15-2017, 12:00 AM   #14
darryl
Wizard
darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.
 
darryl's Avatar
 
Posts: 3,108
Karma: 60231510
Join Date: Nov 2011
Location: Australia
Device: Kobo Aura H2O, Kindle Oasis, Huwei Ascend Mate 7
Quote:
Originally Posted by JSWolf View Post
Forget PDF exists. Problem with converting PDF solved.
I totally agree. I dabbled with converting pdf's years ago and decided it is simply not worthwhile. Of course Hitch would seem to have little choice given her business.
darryl is offline   Reply With Quote
Old 12-15-2017, 12:40 AM   #15
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,532
Karma: 26944418
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by willus View Post
My two cents on PDF conversions. This topic has been visited in a number of past MR threads.
- I've only recently 'stumbled on' Word's PDF convert facility.

I am surprised (shocked even) at how well it does with PDF's from institutions like Rand, McKinsey, Stratfor etc - previously I often didn't even bother trying to convert their documents. Tables would end up as a meaningless shambles, sidebars as a nonsensical farrago etc. And the number of styles it creates for a document is significantly fewer than other pdf converters.

BR
BetterRed is offline   Reply With Quote
Reply

Tags
calibre, convert, diacritics, kindle

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting PDF to HTML with MobiConverter? Dullahir Calibre 0 05-16-2013 08:05 PM
Converting PDF to HTML with MobiConverter? Dullahir Conversion 0 05-16-2013 04:27 AM
Converting PDF to HTML crich70 Conversion 5 07-23-2011 10:02 AM
need help converting .pdf to other formats mgrunk Calibre 2 11-10-2010 08:19 PM
Converting PDF to HTML Nirf Calibre 7 06-24-2010 08:51 AM


All times are GMT -4. The time now is 04:00 PM.


MobileRead.com is a privately owned, operated and funded community.