Converting PDF to HTML

Nirf · 06-23-2010, 11:25 AM

Hello all,

while trying to convert PDF format to LRF format it goes through HTML as an intermediary (apparently), and it hangs on this phase. I'm running ubuntu 10.04, I tried both installing from repo and then trying the latest version by downloading off the website. I used the console command ebook-convert, and after a while I hit control C to end it. Here's what it looks like.

ebook-convert hello.pdf hello.lrf
1% Converting input to HTML...
InputFormatPlugin: PDF Input running
on /home/nir/Documents/Calibre Library/Malcolm Gladwell/Outliers_ The Story of Success (Little, Brown & Co; 2008) (3)/hello.pdf
^CTraceback (most recent call last):
File "/tmp/init.py", line 48, in <module>
File "/home/kovid/build/calibre/src/calibre/ebooks/conversion/cli.py", line 254, in main
File "/home/kovid/build/calibre/src/calibre/ebooks/conversion/plumber.py", line 815, in run
File "/home/kovid/build/calibre/src/calibre/customize/conversion.py", line 211, in __call__
File "/home/kovid/build/calibre/src/calibre/ebooks/pdf/input.py", line 50, in convert
File "/home/kovid/build/calibre/src/calibre/ebooks/pdf/pdftohtml.py", line 61, in pdftohtml
File "/usr/lib64/python2.6/subprocess.py", line 1157, in wait
pid, sts = os.waitpid(self.pid, os.WNOHANG)
KeyboardInterrupt

Any suggestions with this? As I said, I tried both the repo install and then the direct download install (after removing the repo) and had the same problem both times. I don't know where to go from here because it seems to be getting stuck in a generic python file, and its the correct version of python... Help appreciated!

Nirf · 06-23-2010, 11:28 AM

Oh, quick PS, when I run calibre I get two "Link hasn't been detected!" messages although calibre still runs. This makes me think I may be missing some of the required packages, but this doesn't make much sense as I've a) checked most of the major ones and b) I had the same problem when I installed from repo, and the repo install should install all the required packages automatically. Also, when I hit convert, I get a pile more "link hasn't been detected" messages. Is there any kind of debug mode for calibre where I can check what libraries seem to be missing?

kovidgoyal · 06-23-2010, 01:17 PM

pdftohtml (the program calibre uses to convert podf to html is hanging). You can try running it independently on the pdf file to see it works and then convert the resulting html.

Starson17 · 06-23-2010, 01:51 PM

Quote:

Originally Posted by Nirf

Oh, quick PS, when I run calibre I get two "Link hasn't been detected!" messages although calibre still runs.

These are normal and can be ignored.

Nirf · 06-24-2010, 12:50 AM

Ok, so I followed the suggestions. Running pdftohtml on hello.pdf worked and produced a bunch of files, hello.html, hellos.html, hello_ind.html, and a zillion .png files for all the pages. However, I couldn't find any way to add the html file meaningfully as a book into calibre. I would choose hello.html, and next thing I know when the book is in the library, it shows up as a zip file, and there's no way to preview it. Very odd behavior.

Also, I let ebook-convert run for a long time this time, and here's what I eventually got (after there was a memory look so bad that everything was slowing down and I ended it the hard way)

1% Converting input to HTML...
InputFormatPlugin: PDF Input running
on /home/nir/Documents/Calibre Library/Malcolm Gladwell/Outliers_ The Story of Success (Little, Brown & Co; 2008) (3)/hello.pdf
pdftohtml log:

Parsing all content...
Initial parse failed:
Parsing file 'index.html' as HTML
Forcing index.html into XHTML namespace
Generating default TOC from spine...
34% Running transforms on ebook...
Merging user specified metadata...
Detecting structure...
Auto generated TOC with 0 entries.
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Cleaning up manifest...
Trimming unused files from manifest...
Creating LRF Output...
67% Creating LRF Output
Processing u'index.html'
Parsing HTML...
Converting to BBeB...
Terminated

These conversions are also taking huge amounts of time... the pdftohtml conversion took a very long time (a few minutes) and for the ebook-convert command to get to this point takes even longer. I didn't remember it taking even close to this long before.... what's going on?

Nirf · 06-24-2010, 02:21 AM

A lot of evidence is suddenly pointing to the fact that this is very troublesome because the PDF in question is just a series of scanned images and doesn't contain text at all per se...

Starson17 · 06-24-2010, 08:47 AM

Quote:

Originally Posted by Nirf

I couldn't find any way to add the html file meaningfully as a book into calibre. I would choose hello.html, and next thing I know when the book is in the library, it shows up as a zip file,

HTML files are always added as zip files. This is normal behavior.

Quote:

and there's no way to preview it. Very odd behavior.

I'm not sure what you mean by "preview," but I can read my html files, just fine. Are you trying to read it?

Starson17 · 06-24-2010, 08:51 AM

Quote:

Originally Posted by Nirf

A lot of evidence is suddenly pointing to the fact that this is very troublesome because the PDF in question is just a series of scanned images and doesn't contain text at all per se...

I have lots of pdf's like that - scanned images of the pages. If you think of them as what they are - images - they pretty much behave as expected for me.

06-23-2010, 11:25 AM	#1
Nirf Junior Member Posts: 5 Karma: 10 Join Date: Aug 2008 Device: PRS-505	Converting PDF to HTML Hello all, while trying to convert PDF format to LRF format it goes through HTML as an intermediary (apparently), and it hangs on this phase. I'm running ubuntu 10.04, I tried both installing from repo and then trying the latest version by downloading off the website. I used the console command ebook-convert, and after a while I hit control C to end it. Here's what it looks like. ebook-convert hello.pdf hello.lrf 1% Converting input to HTML... InputFormatPlugin: PDF Input running on /home/nir/Documents/Calibre Library/Malcolm Gladwell/Outliers_ The Story of Success (Little, Brown & Co; 2008) (3)/hello.pdf ^CTraceback (most recent call last): File "/tmp/init.py", line 48, in <module> File "/home/kovid/build/calibre/src/calibre/ebooks/conversion/cli.py", line 254, in main File "/home/kovid/build/calibre/src/calibre/ebooks/conversion/plumber.py", line 815, in run File "/home/kovid/build/calibre/src/calibre/customize/conversion.py", line 211, in __call__ File "/home/kovid/build/calibre/src/calibre/ebooks/pdf/input.py", line 50, in convert File "/home/kovid/build/calibre/src/calibre/ebooks/pdf/pdftohtml.py", line 61, in pdftohtml File "/usr/lib64/python2.6/subprocess.py", line 1157, in wait pid, sts = os.waitpid(self.pid, os.WNOHANG) KeyboardInterrupt Any suggestions with this? As I said, I tried both the repo install and then the direct download install (after removing the repo) and had the same problem both times. I don't know where to go from here because it seems to be getting stuck in a generic python file, and its the correct version of python... Help appreciated!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Content Converting HTML emails?	shermozle	Amazon Kindle	5	09-27-2010 10:03 PM
Converting Merged HTML file to Epub/PDF Not Working	MV64	Calibre	1	06-07-2010 07:48 PM
Converting multiple HTML files into a single hyperlinked PDF?	Jürgen Hubert	Reading and Management	6	01-11-2010 07:44 AM
Converting from html	mysweety	Calibre	16	09-23-2009 08:20 AM
Converting HTML to Mobi?	Sonist	Calibre	5	02-10-2009 01:23 PM

06-23-2010, 11:28 AM	#2
Nirf Junior Member Posts: 5 Karma: 10 Join Date: Aug 2008 Device: PRS-505	Oh, quick PS, when I run calibre I get two "Link hasn't been detected!" messages although calibre still runs. This makes me think I may be missing some of the required packages, but this doesn't make much sense as I've a) checked most of the major ones and b) I had the same problem when I installed from repo, and the repo install should install all the required packages automatically. Also, when I hit convert, I get a pile more "link hasn't been detected" messages. Is there any kind of debug mode for calibre where I can check what libraries seem to be missing?

06-23-2010, 01:17 PM	#3
kovidgoyal creator of calibre Posts: 43,859 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	pdftohtml (the program calibre uses to convert podf to html is hanging). You can try running it independently on the pdf file to see it works and then convert the resulting html.

06-24-2010, 12:50 AM	#5
Nirf Junior Member Posts: 5 Karma: 10 Join Date: Aug 2008 Device: PRS-505	Ok, so I followed the suggestions. Running pdftohtml on hello.pdf worked and produced a bunch of files, hello.html, hellos.html, hello_ind.html, and a zillion .png files for all the pages. However, I couldn't find any way to add the html file meaningfully as a book into calibre. I would choose hello.html, and next thing I know when the book is in the library, it shows up as a zip file, and there's no way to preview it. Very odd behavior. Also, I let ebook-convert run for a long time this time, and here's what I eventually got (after there was a memory look so bad that everything was slowing down and I ended it the hard way) 1% Converting input to HTML... InputFormatPlugin: PDF Input running on /home/nir/Documents/Calibre Library/Malcolm Gladwell/Outliers_ The Story of Success (Little, Brown & Co; 2008) (3)/hello.pdf pdftohtml log: Parsing all content... Initial parse failed: Parsing file 'index.html' as HTML Forcing index.html into XHTML namespace Generating default TOC from spine... 34% Running transforms on ebook... Merging user specified metadata... Detecting structure... Auto generated TOC with 0 entries. Flattening CSS and remapping font sizes... Source base font size is 12.00000pt Cleaning up manifest... Trimming unused files from manifest... Creating LRF Output... 67% Creating LRF Output Processing u'index.html' Parsing HTML... Converting to BBeB... Terminated These conversions are also taking huge amounts of time... the pdftohtml conversion took a very long time (a few minutes) and for the ebook-convert command to get to this point takes even longer. I didn't remember it taking even close to this long before.... what's going on?

06-24-2010, 02:21 AM	#6
Nirf Junior Member Posts: 5 Karma: 10 Join Date: Aug 2008 Device: PRS-505	A lot of evidence is suddenly pointing to the fact that this is very troublesome because the PDF in question is just a series of scanned images and doesn't contain text at all per se...

Advert

Advert