View Full Version : A real PDF to epub/djvu/rtf/html software?.


DsOft
08-12-2008, 08:25 PM
Hiya

Ive tryed to convert my ebook collection (pdf , impossible to read on my iliad), anyone knows any software to do it?

Ive tryied some software, but i cant get any quality success. (pdf 2 djvu, pdf reaper and solid pdf converter)

Please help.

eimert
08-13-2008, 04:35 PM
Hiya

Ive tryed to convert my ebook collection (pdf , impossible to read on my iliad), anyone knows any software to do it?

Ive tryied some software, but i cant get any quality success. (pdf 2 djvu, pdf reaper and solid pdf converter)

Please help.

Hi,

I tried many programs to convert pdf to doc/rtf. All freeware programs I tried were not satisfactory at all. ABBYY Finereader does the job (some times older versions on magazine disks) but needs a lot of proofing (for German text at least). The best results I've got with Scansoft PDF Converter Pro. It is now at version 5, I tried the vers. 4 demo for 14 days. It works beautifully most of the time.
Just my experience.
Hope that helps.

Cheers,
Klaus

DsOft
08-17-2008, 10:46 PM
Thanks, ill try it.

Im still trying other ones...

Djvu support on iliad isnt official, and the viewers available dont make the job very well, and still too buggy.
.pdf : (Adobe Acrobat Pro 9 -> doc) -> (OFFICE 2007 + DJVU plug-in -> .djvu)
Pretty well job if the .pdf dont have so much frames and embebbed photos, but its too slow if you have 35k+ books, like me. (no queue support).

IŽve tryed some pdf(or doc,rtf) 2 epub but just waste of time.

Im taking my .lit to .hmtl thought LITConverter, pretty good job (if no images.

still working on it.

FizzyWater
08-17-2008, 11:45 PM
The best results I've got with Scansoft PDF Converter Pro. Cheers, Klaus

I've been generally happy with my Scansoft, although it's effectiveness is ... limited? ... by how the original PDF was made. Sometimes I still get paragraph returns for every line, forced page breaks - even between words - where the PDF page ended, etc.

Lately, I've been using Mobipocket Creator. You can import PDF files and build a Mobipocket file. For each new Mobipocket file created, it also creates a HTML file and "image" folder with the pics from the PDFs. So I take that HTML file and do whatever clean-up I want and use it to create books in whatever format I really want!

peterbbb
08-18-2008, 05:20 AM
Which version of Mobipocket Creator imports PDF?

mbovenka
08-18-2008, 08:14 AM
Which version of Mobipocket Creator imports PDF?

At least the current one does, if you get the Publisher Edition (which is also free, just like the Home).

Does a pretty good job on converting them, most of the time.

eimert
08-18-2008, 09:19 AM
I've been generally happy with my Scansoft, although it's effectiveness is ... limited? ... by how the original PDF was made. Sometimes I still get paragraph returns for every line, forced page breaks - even between words - where the PDF page ended, etc.

Yes, that is true for me, too. While it works really nicely most of the time, I do get those line breaks sometimes. However, that is very easy to cure with Word or OpenOffice.

Lately, I've been using Mobipocket Creator. You can import PDF files and build a Mobipocket file. For each new Mobipocket file created, it also creates a HTML file and "image" folder with the pics from the PDFs. So I take that HTML file and do whatever clean-up I want and use it to create books in whatever format I really want!

Hm, can't comment on the Creator as I never got it running on my machine. The import function in the Mobipocket Reader works fine on some files and not at all on others. Might depend on the coding of the pdf?
... OK, forget what I just wrote. I just decided to give the Mobipocket Creator another try (I've got a new system). It installed without problem and started without any error messages(!) - that is a first for me :-)
It converts OK, but in my hands and for the document I tried, there are quite some more glitches in the formatting (like manual line breaks etc) than I got for the same document with Scansoft. But again, that probably depends on your pdf file. Anyway, if you can live with the manual cleanup, the Mobipocket Creator seems like an viable option to Scansoft, I think - especially since it is free!

Cheers,
Klaus

peterbbb
08-18-2008, 12:25 PM
Thanks. Installed and used and it works to convert PDFs to .prc files

Faenad
08-23-2008, 01:00 AM
I convert a lot of PDF to Html to read with ”Book on my windows mobile device and I have experimented alot of softwares.

So far my best "recipe" is :

- I purchased Foxit Pdf Editor. Before converting a PDF I do a "cleanup" of all part that are usually messed up by the converter : Editor Logo, Big image (the first page which usually display the cover), index, etc. I remove the blank page and the pages non essential stuff like ISBN that are easily messed up.

- Then I convert with the Free Mobipocket Creator. Best converter I have found so far, and free.

- Finally if the images extracted by Mobipocket Creator are too big (Big image means long waiting time with my 200Mhz phone), I convert then all to a smaller resolution using the batch editor of Photoscape. It's also a freeware.

It takes me about 2-5 min for each PDF depending on the size and complexity.
The result are really good.

If the PDF have a lot of images (the most difficult thing to convert correctly) then I will convert it instead to Rgo. Haven't found yet other converter to a more popular format that works well with images heavy PDF.

JSWolf
08-23-2008, 01:06 AM
There is no currently known solution to perfectly convert PDF to something else. Even Adobe Acrobat Pro messes up on just text based PDF.

Timoleon
08-24-2008, 09:02 AM
There is no currently known solution to perfectly convert PDF to something else. Even Adobe Acrobat Pro messes up on just text based PDF.

Say, JSWolf ---

When are you going to finish that Donaldson novel? Would you like me to give you the Cliff Notes version so you can have done with it?:rofl:

Tim

JSWolf
09-01-2008, 09:22 AM
Say, JSWolf ---

When are you going to finish that Donaldson novel? Would you like me to give you the Cliff Notes version so you can have done with it?:rofl:

Tim
I'm almost done with it. Then I have the second book as well. He's still to write three and four.

DsOft
09-25-2008, 03:54 AM
Ok. Ill explain the way i use atm.


Simple texts PDFs:

Just Adobe Acrobat 9 Pro => .rtf or .html

Heavy Image PDFs:
Adobe acrobat: PDF => JPG Grayscale full quality.

Auto Photo Editor (Zeallsoft):
JPG => Max CROP (manually) & Turn Right (to a temp1 folder)
JPG => CROP (temp1 JPGs) 50% RIGHT (to temp2 folder) & Rename to *_1.jpg
JPG => CROP (temp1 JPGs) 50% LEFT (to final folder) & Rename to *_2.jpg & Move temp2 JPGs to final folder

So youŽll have:
book_Page1_1.jpg (first 50% of Page 1)
book_Page1_2.jpg (last 50% of Page 1)
book_Page2_1.jpg
book_Page2_2.jpg
etc...

Mobipocket Creator: JPGs (from final folder, so left and right sides of original PDF) => .opf + .prc




CONS:

iliad uncompress .prc files to a same named folder.
Really slow process. Only to really needed ebooks.




So, what do you think?, and, anyone knows any way to make a .mobi file from those .opf + .prc (that iliad dont touch? or only do a temporal uncompress).

Pd. Ive tryed all PDF 2 something that iliad reads, and them only work on plain text files, anything harder becomes to an unusable ebook.

nrapallo
09-30-2008, 12:14 PM
I've had GREAT success using Mobipocket Creator to convert a .pdf directly into .prc using it's 'Import From Existing File - Adobe PDF'.

I was amazed at how nice the .prc ended up; a lot of the formatting was retained and all words were correctly "decoded"! The only problem was bold and italic words needed a space before and after the bold/italics tags. Nothing a quick search & replace on <b>, </b> or <i>,</i> couldn't fix. Much of the paragraph layout was retained as were headings with <h1> tags, etc.

Very powerful indeed! Try it out...

p.s. I had tried v7 Adobe Acrobat Pro's own 'Save as...' feature and was totally disappointed. Your mileage may vary...

wallcraft
09-30-2008, 12:28 PM
I've had GREAT success using Mobipocket Creator to convert a .pdf directly into .prc using it's 'Import From Existing File - Adobe PDF'. Under the hood, this is a two step process and sometimes you might want to edit the intermediate HTML. MobiPocket Creator should leave behind its intermediate OEB ebook files, but you can also just use the command line version pdf2xml, see Mobipocket convert in mass? (http://www.mobileread.com/forums/showpost.php?p=260477&postcount=14).

nrapallo
09-30-2008, 01:58 PM
Under the hood, this is a two step process and sometimes you might want to edit the intermediate HTML. MobiPocket Creator should leave behind its intermediate OEB ebook files, but you can also just use the command line version pdf2xml, see Mobipocket convert in mass? (http://www.mobileread.com/forums/showpost.php?p=260477&postcount=14).

Nice to know, but I *do* like the complete run through to get the .prc actually produced. I'm a lazy bugger... :whistle:

Dumb question time: What good is the .xml if I already have the .prc and .opf with .html from the Mobipocket Creator Import? :dunno: :)

I've never used .xml files (my Word is 'stuck' at Word 2003). Which program converts these and/or reads them for further processing?

Would you say they are better for storage, portability or something else? I'm not in the know here. Any info would be appreciated!

Thanks!

DaleDe
09-30-2008, 02:43 PM
You can add the import and export of XML files to your Word 2003 (Actually all the way back to Word 2000) via a free download from Microsoft. They are pushing their new docx format. See the wiki.

Dale

sasilk
10-03-2008, 10:20 AM
For the iLiad users...

I found a nice easy way of outputting PDF files so that they're readable on the iLiad. You can create your own PDF format styles for the PDF printer. So I created one for a paper size and margins that fir my iLiad, with a font that I liked to read. Then all you have to do is print whatever format you have to the PDF printer using that style and it will create a file that works on your iLiad.

chrisophus
05-04-2009, 01:59 PM
I wrote a python script which converts the output of pdf2xml to html and attempts to maintain formatting of complex pdf's. I then use calibre to generate the ebook format (mobi in my case). It seems to work pretty well. You can read more about it on my blog at http://talkings.org/2009/05/03/complex-pdf-html/.

kovidgoyal
05-04-2009, 07:38 PM
Cool, it was always in the back of my mind to write a script to implement column detection and a few other goodies form the output of pdf2xml, but I never found the time/motivation.

I'll be willing to integrate this into calibre (after the 0.6 release), so open a ticket and attch your script. Integration will depend on how easy it is to compile pdf2xml on various platforms.

chrisophus
05-04-2009, 11:35 PM
That sounds good. What time frame are you looking at? I still need to do some work on it to automate detection of more aspects of the content.

kovidgoyal
05-04-2009, 11:54 PM
0.6 will take another couple of moths, so there's no rush

chrisophus
05-06-2009, 02:21 PM
I am pretty happy with the progress I've made in the last couple of days. It seems to be working with almost anything I throw at it. I am adding a lot of options to customize how it handles the formatting. I'll post again when I have a new version up. I wish I had a better name than cxpdfhtml.py...

kovidgoyal
05-06-2009, 02:28 PM
the name i used for my abortive attempt was pdfreflow.py

tlc
05-08-2009, 04:39 AM
My interest is just getting better reflowable paragraphs on fiction. I tried cxpdfhtml.py on a novel and was surprised at how well the "break on short lines" approach worked, although I haven't read in depth to find the not-short-enough lines.

I was wondering if you are considering (or anyone else has implemented) detection of paragraphs based on indentation?

chrisophus
05-10-2009, 01:01 AM
Actually it does use indentation to detect paragraphs. Basically if a line is indented and the next line is not, it is considered the beginning of a paragraph block. A short line break is detected if no other type of block/code is detected and the line is indented and doesn't quite go to the end of the line (10 pixels).
Although that could easily be made a configuration option as well.

Thanks for the feedback. I hope it proves useful.

JSWolf
05-10-2009, 09:43 PM
Actually it does use indentation to detect paragraphs. Basically if a line is indented and the next line is not, it is considered the beginning of a paragraph block. A short line break is detected if no other type of block/code is detected and the line is indented and doesn't quite go to the end of the line (10 pixels).
Although that could easily be made a configuration option as well.

Thanks for the feedback. I hope it proves useful.
And it is and where do we get whatever it is? Thanks!

chrisophus
05-10-2009, 10:06 PM
It is cxpdfhtml. See my earlier post and my blog for details and download links: http://talkings.org/2009/05/07/cxpdfhtml/

JSWolf
05-11-2009, 11:48 AM
Thank you. I'll give it a go later when I have a chance and find a PDF I want to convert.

stisev
09-10-2009, 07:03 AM
Hi all,
Like you guys, I have a lot of purchased PDF files, none of which are DRMed (I refuse to purchase any store that DRMs anything).

All I can say is that it is virtually impossible to convert everything successfully.

Like 2 previous recommendations, I like Nuance PDF Converter Pro. Nuance PDF Converter 6.0 Pro just came out and the converter is the most accurate so far, but still chokes with some books.

charleski
04-20-2010, 03:12 AM
PDF to ePub Converter http://www.pdf-epub-converter.com. all-in-one converter. but not free.:cool:

Fails to recognise PDF tags and makes no attempt to detect or reconstruct paragraphs. Renders each page as a single div in a separate flow with br tags separating the lines and indented paragraphs marked by non-breaking spaces. Produced a 93kB (!!) css file which was almost totally composed of font-size and spurious (and misspelt) vertical-align attributes.

vastav
12-01-2010, 03:57 AM
Fails to recognise PDF tags and makes no attempt to detect or reconstruct paragraphs. Renders each page as a single div in a separate flow with br tags separating the lines and indented paragraphs marked by non-breaking spaces. Produced a 93kB (!!) css file which was almost totally composed of font-size and spurious (and misspelt) vertical-align attributes.

You may want to try the Acrobat plugin based ePub conversion solution available at http://www.pdf2epub.com which recognizes PDF tags, reconstructs paragraphs and does a decent job in retaining most of the font and layout attributes.

JGB
12-04-2010, 09:08 PM
You may want to try the Acrobat plugin based ePub conversion solution available at http://www.pdf2epub.com which recognizes PDF tags, reconstructs paragraphs and does a decent job in retaining most of the font and layout attributes.

Does this work better than caliber?
Or is Caliber's built in PDF converter better? I kind of liked the idea of printing the PDF file to the correct size and using that if it is more likely to remain clean and usable, but would this improve the conversion any?

MRRV
12-21-2010, 06:41 AM
Hello everyone,

We put an online pdf2epub service available there (http://beta.open.xerox.com/Services/Rossinante).
You can visualize the output of each step of the conversion using a GUI.

Any feedback is welcome.
MRRV

frabjous
12-21-2010, 11:02 AM
Hello everyone,

We put an online pdf2epub service available there (http://beta.open.xerox.com/Services/Rossinante).
You can visualize the output of each step of the conversion using a GUI.

Any feedback is welcome.
MRRV

If any feedback is welcome then:


I would (and will) never use this service because it requires me to get an account and login even to test it.

I suspect that eventually I'd have to pay for these conversions, which, again I would never do so long as free solutions exist such as calibre or online converters like this (http://www.2epub.com/) (even if yours is better, which I'll never know about).

Kosst Amojan
01-02-2011, 04:57 PM
It's expensive but I absolutely love ABBYY Finereader. It has excellent OCR capabilities and can convert to several different formats with different options (removing headers and all) with CSS.

I usually convert to an html file then use Sigil to convert to a epub, always get excellent results.