Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 12-04-2017, 02:28 PM   #1
HappyChris
Member
HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.HappyChris knows the square root of minus one.
 
Posts: 16
Karma: 7526
Join Date: Nov 2012
Location: Bristol, UK
Device: Nexus 10 (2 of them), Nexus 5, Nook STG
Converting PDF/HTML to ereader formats

I have downloaded a free pdf book and wish to convert it to be read via my Kindle or my epub readers. I used Calibre to convert it to AZW3 and HTMLZ, both of which display similarly, that is, with hard-coded line lengths and page headers, numbers etc., displaying clumsy line formatting on small 'pages' with paragraph, not line- spacing between lines and chapter titles, etc. inappropriate to the page size.

I tried one or two online services but, while they got the ToC lines down to one line each with fewer dots between the chapter title and the page number, they choked when faced with the many diacritics in the text! Calibre handled the diacritics without fault.

I would rather not spend days removing '</p>'s and repeating page header and footer texts! Is there a program or a script that could speed this process up or even do it all for me?
HappyChris is offline   Reply With Quote
Advert
Old 12-04-2017, 02:44 PM   #2
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 1,529
Karma: 11635512
Join Date: Dec 2012
Location: Altus, Oklahoma today
Device: iPhone 6/5/iPad 1,2 & Air/Surface Pro/Kindle PW
That is the million dollar question... PDF is probably the worst possible source document to do a conversion from. In my opinion, it is a far better use of your time to search for a different source format than try and convert a pdf. Sorry
Turtle91 is offline   Reply With Quote
Old 12-04-2017, 05:42 PM   #3
BetterRed
null operator
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 9,306
Karma: 7810051
Join Date: Mar 2012
Location: Sydney Australia
Device: none
If it's a multi-column PDF you might be able to make use of ==>> k2pdfopt: optimizes PDFs for viewing on e-readers

Or, if you can wrangle the PDF into Word (maybe via this ==>> Edit PDF content in Word), then you could use Toxaris' add in ==>> e-Book Tools - a Word add-in to create an ePUB that you can finalise in an ebook editor such as Sigil or Caiibre.

Otherwise - what Turtle91 said.

BR

Last edited by BetterRed; 12-04-2017 at 05:44 PM.
BetterRed is offline   Reply With Quote
Old 12-04-2017, 07:16 PM   #4
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 50,418
Karma: 43720217
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Aura H2O, Sony PRS-650, Sony PRS-T1, nook STR, iPad 4, iPhone 5
Forget PDF exists. Problem with converting PDF solved.
JSWolf is offline   Reply With Quote
Old 12-05-2017, 11:18 AM   #5
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 5,914
Karma: 55146348
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, and NookColor. 2 Droid, 1 Win8 ePUB rdrs
Quote:
Originally Posted by HappyChris View Post
I have downloaded a free pdf book and wish to convert it to be read via my Kindle or my epub readers. I used Calibre to convert it to AZW3 and HTMLZ, both of which display similarly, that is, with hard-coded line lengths and page headers, numbers etc., displaying clumsy line formatting on small 'pages' with paragraph, not line- spacing between lines and chapter titles, etc. inappropriate to the page size.

I tried one or two online services but, while they got the ToC lines down to one line each with fewer dots between the chapter title and the page number, they choked when faced with the many diacritics in the text! Calibre handled the diacritics without fault.

I would rather not spend days removing '</p>'s and repeating page header and footer texts! Is there a program or a script that could speed this process up or even do it all for me?

Sorry Chris:

My business does this every single day, and there's no magic, fast way. The only way to get there from here is to a) use Abbyy Fine Reader, which at least will remove the running heads/footers, and b) do the rest by hand. I wish there were a faster/better way, but there isn't.

Having said that, yes, you can write regex to clean up some portion of the line-ending </p>'s, and all that, but...it's still all human labor, eyes and hands.


Hitch
Hitch is offline   Reply With Quote
Old 12-07-2017, 05:34 AM   #6
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 82,213
Karma: 76823449
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Yes, as Hitch says, a decent OCR program such as Abbyy, followed by hand clean-up of the resulting code, really is as good as it gets. PDF is a "terminal format": that is, you can convert to it easily enough, but you can't convert from it.
HarryT is offline   Reply With Quote
Old 12-07-2017, 11:18 AM   #7
RobertDDL
Whatever...
RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.
 
RobertDDL's Avatar
 
Posts: 153
Karma: 718577
Join Date: Feb 2015
Location: Austria
Device: Pocketbook InkPad 840
Instead of Abbyy, which costs money, you can try pdftotext, which is open source - it's a little Windows command line utility, which you can download from www.xpdfreader.com (it's included in the Xpdf tools). With the -layout option, it gives you a plain text file with line breaks that correspond to the lines in the PDF (doesn't work with all PDF files, though).

You'll still have a lot of editing to do, but with a decent editor and a little knowledge of regular expressions and a tool to convert plain text to epub you should be able to get a reasonably readable epub version in an hour or so. (I have a tool that helps with the process, but like pdftotext it's Windows command line stuff, so I don't know if you're interested.)
RobertDDL is offline   Reply With Quote
Old 12-07-2017, 02:04 PM   #8
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 5,914
Karma: 55146348
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, and NookColor. 2 Droid, 1 Win8 ePUB rdrs
Quote:
Originally Posted by RobertDDL View Post
Instead of Abbyy, which costs money, you can try pdftotext, which is open source - it's a little Windows command line utility, which you can download from www.xpdfreader.com (it's included in the Xpdf tools). With the -layout option, it gives you a plain text file with line breaks that correspond to the lines in the PDF (doesn't work with all PDF files, though).

You'll still have a lot of editing to do, but with a decent editor and a little knowledge of regular expressions and a tool to convert plain text to epub you should be able to get a reasonably readable epub version in an hour or so. (I have a tool that helps with the process, but like pdftotext it's Windows command line stuff, so I don't know if you're interested.)
Robert:

Doesn't that give you, as you said, a plain text file? So that all formatting is lost? For some folks, that would be a crapload more work.

Abbyy, if I am not mistaken, has a website where you can convert a single document for free. There are limits on it, and all that, but you might try searching for that.

Hitch
Hitch is offline   Reply With Quote
Old 12-07-2017, 03:16 PM   #9
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 82,213
Karma: 76823449
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
You can also buy older releases of Abbyy at much lower prices than the current version, and they are more than adequate for the job - you just lose the spelling dictionary for Outer Mongolia (or whatever ).
HarryT is offline   Reply With Quote
Old 12-09-2017, 05:56 AM   #10
RobertDDL
Whatever...
RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.RobertDDL ought to be getting tired of karma fortunes by now.
 
RobertDDL's Avatar
 
Posts: 153
Karma: 718577
Join Date: Feb 2015
Location: Austria
Device: Pocketbook InkPad 840
Quote:
Originally Posted by Hitch View Post
Robert:

Doesn't that give you, as you said, a plain text file? So that all formatting is lost? For some folks, that would be a crapload more work.

Abbyy, if I am not mistaken, has a website where you can convert a single document for free. There are limits on it, and all that, but you might try searching for that.

Hitch
Yes, you are right about all formatting being lost. For novels, in most cases that would be headings and italics. It doesn't take much time to manually restore headings, in a plain text file (word-wrap disabled) you can usually spot them quite easily, even if they don't start with numbers or "chapter" etc. The loss of italics can be quite unfortunate, if the writer relies on them - not all of them do, though.

Unless I've missed something, even with a good OCR software, though, it's not trivial to retain italics, while getting rid of all the excess formatting that the OCR'd output file is burdened with (even Abbyy is wont to hallucinate dozens of different page and paragraph formats, font sizes, etc.). I do have a fairly recent version of Abbyy, but, for instance, when I had to convert a series of 26 PDF books for a friend (being blind, she couldn't read the PDFs), I got it done much more quickly with the pdftotext tool. (There were no italics in those PDFs, which we had bought, but which seem to have been produced from carelessly OCR'd print editions. No idea if the paper originals had them, but I've often come across public-domain books on the Internet without italics, when the originals have them.) (Which is not meant to say it's ok to lose them.)
RobertDDL is offline   Reply With Quote
Old 12-09-2017, 01:53 PM   #11
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 5,914
Karma: 55146348
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, and NookColor. 2 Droid, 1 Win8 ePUB rdrs
Quote:
Originally Posted by RobertDDL View Post
Yes, you are right about all formatting being lost. For novels, in most cases that would be headings and italics. It doesn't take much time to manually restore headings, in a plain text file (word-wrap disabled) you can usually spot them quite easily, even if they don't start with numbers or "chapter" etc. The loss of italics can be quite unfortunate, if the writer relies on them - not all of them do, though.

Unless I've missed something, even with a good OCR software, though, it's not trivial to retain italics, while getting rid of all the excess formatting that the OCR'd output file is burdened with (even Abbyy is wont to hallucinate dozens of different page and paragraph formats, font sizes, etc.). I do have a fairly recent version of Abbyy, but, for instance, when I had to convert a series of 26 PDF books for a friend (being blind, she couldn't read the PDFs), I got it done much more quickly with the pdftotext tool. (There were no italics in those PDFs, which we had bought, but which seem to have been produced from carelessly OCR'd print editions. No idea if the paper originals had them, but I've often come across public-domain books on the Internet without italics, when the originals have them.) (Which is not meant to say it's ok to lose them.)
Honestly, I've never noticed any sturm und drang retaining italics with Abbyy. Using Abbyy is, as you rightly say, a PITA, overall, but in my opinion, compared to all the other methods to get a PDF to an editable form, it's the best way to go. Maybe I'm just lucky that way, that we get a lot of books that are bold/italics laden, but we do.

I'm sure that everyone has their own preferred way of working. I have a ton of expertise in Word, so frankly, it's super-easy for me to do the cleanup on Abbyy output into a Word file, or, of course, regex it to the nth, in HTML. If there ARE italics and bold, in large numbers, I'll use Toxaris' superb "ePUB Tools" Word plug-in, first--as that makes marking/retaining both of those character markups simplicity itself--and I'll "clean" the styles from the rest, and then restyle them. I find that the fastest route.

Offered solely FWIW.

Hitch
Hitch is offline   Reply With Quote
Old 12-11-2017, 07:46 PM   #12
Tex2002ans
Guru
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 897
Karma: 5454725
Join Date: Jul 2012
Device: Nook
Quote:
Originally Posted by Hitch View Post
My business does this every single day, and there's no magic, fast way. The only way to get there from here is to a) use Abbyy Fine Reader, which at least will remove the running heads/footers, and b) do the rest by hand. I wish there were a faster/better way, but there isn't.
Yep, I agree completely. PDF is an awful input source, and there is no real good way besides a lot of human elbow grease.

Quote:
Originally Posted by Hitch View Post
Honestly, I've never noticed any sturm und drang retaining italics with Abbyy. Using Abbyy is, as you rightly say, a PITA, overall, but in my opinion, compared to all the other methods to get a PDF to an editable form, it's the best way to go. Maybe I'm just lucky that way, that we get a lot of books that are bold/italics laden, but we do.
Or as long as you have Finereader 10 or higher, it has EPUB output. The EPUB output has relatively clean code with only a handful of inline styles.

I have 12 regex to help clean up after (5 to handle Italics, Bold, BoldItalics, Smallcaps, BoldSmallcaps + 7 to just clean up some <td> and some other anomalies).

Example:

Search: <span style="font-style:italic;">
Replace: <span class="italics">

Quote:
Originally Posted by RobertDDL View Post
Yes, you are right about all formatting being lost. For novels, in most cases that would be headings and italics. [...] The loss of italics can be quite unfortunate, if the writer relies on them - not all of them do, though.
Getting the OCR correct is just a portion of the work. Formatting is just as important (and is where a lot of other tools fail miserably).

I've written about this in-depth before:

https://www.mobileread.com/forums/sh...72#post2883972

Most of your time is going to be spent editing and correcting the text/formatting, so the better you can get the input, the easier/faster your life will be on those later steps.

Last edited by Tex2002ans; 12-11-2017 at 07:50 PM.
Tex2002ans is offline   Reply With Quote
Old 12-15-2017, 12:52 AM   #13
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 954
Karma: 6379999
Join Date: Jun 2011
Location: California
Device: Kindle 2, iPad
My two cents on PDF conversions. This topic has been visited in a number of past MR threads.
willus is offline   Reply With Quote
Old 12-15-2017, 01:00 AM   #14
darryl
Wizard
darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.darryl ought to be getting tired of karma fortunes by now.
 
darryl's Avatar
 
Posts: 2,073
Karma: 25791510
Join Date: Nov 2011
Location: Australia
Device: Kobo Aura H2O, Kindle Oasis, Huwei Ascend Mate 7
Quote:
Originally Posted by JSWolf View Post
Forget PDF exists. Problem with converting PDF solved.
I totally agree. I dabbled with converting pdf's years ago and decided it is simply not worthwhile. Of course Hitch would seem to have little choice given her business.
darryl is offline   Reply With Quote
Old 12-15-2017, 01:40 AM   #15
BetterRed
null operator
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 9,306
Karma: 7810051
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by willus View Post
My two cents on PDF conversions. This topic has been visited in a number of past MR threads.
- I've only recently 'stumbled on' Word's PDF convert facility.

I am surprised (shocked even) at how well it does with PDF's from institutions like Rand, McKinsey, Stratfor etc - previously I often didn't even bother trying to convert their documents. Tables would end up as a meaningless shambles, sidebars as a nonsensical farrago etc. And the number of styles it creates for a document is significantly fewer than other pdf converters.

BR
BetterRed is offline   Reply With Quote
Reply

Tags
calibre, convert, diacritics, kindle

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting PDF to HTML with MobiConverter? Dullahir Calibre 0 05-16-2013 09:05 PM
Converting PDF to HTML with MobiConverter? Dullahir Conversion 0 05-16-2013 05:27 AM
Converting PDF to HTML crich70 Conversion 5 07-23-2011 11:02 AM
need help converting .pdf to other formats mgrunk Calibre 2 11-10-2010 09:19 PM
Converting PDF to HTML Nirf Calibre 7 06-24-2010 09:51 AM


All times are GMT -4. The time now is 07:33 PM.


MobileRead.com is a privately owned, operated and funded community.