View Full Version : pdf2lrf


kovidgoyal
07-30-2007, 05:56 PM
Part of libprs500 (http://libprs500.kovidgoyal.net) v0.3.81. It extracts the text from PDF files and converts them to LRF. Preserves bold and italics. See attached demo.

It doesn't support embedded images and results are not going to be satisfactory for complex PDF files. But for converting simple novels, it works great.

Linux users: If you want support for PDF links then you need to install poppler from CVS.

To use:

pdf2lrf "mybook.pdf"


Enjoy.

JSWolf
07-30-2007, 06:54 PM
Would it be possible to have PDF2HTML so we can then edit the text/book how we want and then use html2lrf to create a properly formatted book?

And this is a great step forward for PDF conversion without the need for Acrobat.

kovidgoyal
07-30-2007, 07:05 PM
pdftohtml is on your path in windows.

pdftohtml mybook.pdf

JSWolf
07-30-2007, 07:07 PM
pdftohtml is on your path in windows.

pdftohtml mybook.pdf

Very nice! Thank you!

kovidgoyal
08-03-2007, 04:02 PM
Incidentally, is there some reason this thread isn't being made a sticky?

astra
08-03-2007, 06:05 PM
Are there any instruction how to use this feature? Sort of help or FAQ?

kovidgoyal
08-03-2007, 06:08 PM
Start up a terminal (Start->Run and type cmd.exe)
change to the directory of your pdf file

cd "c:\my directory"
pdf2lrf mybook.pdf

JSWolf
08-11-2007, 03:22 PM
Incidentally, is there some reason this thread isn't being made a sticky?
Just got lost in the shuffle. It's stuck now.

kovidgoyal
08-11-2007, 03:27 PM
thanks.

cirocco
08-18-2007, 12:19 PM
Could you explain how to get correct non-english characters from pdf? I get strange results with polish language.
a word "CZĘŚĆ" is converted into:
<b>CZ </b><br>
<b>E S </b><br>
<b>C I</b><br>

kovidgoyal
08-18-2007, 12:24 PM
Try the -enc switch of pdftohtml?

cirocco
08-18-2007, 03:17 PM
Thanks, I tried it, but I can only get the error message:
Error: Couldn't find unicodeMap file for the 'iso-8859-2' encoding
Is there a list of encoding names?

kovidgoyal
08-18-2007, 03:27 PM
I dont know, you'll have to contact the author of pdftohtml.

cirocco
08-18-2007, 04:43 PM
Thanks for your input, I discovered that my pdf has embedded font without unicode map, which may be the reason of all problems and there is no easy way of fixing it :-(

BlackVoid
04-02-2008, 04:38 PM
This is a MESS.
Line breaks ignored.
Page breaks after 1-2 lines on a page, IN THE MIDDLE of the sentence.

:angry::angry::angry:

JSWolf
04-04-2008, 12:33 PM
This is a MESS.
Line breaks ignored.
Page breaks after 1-2 lines on a page, IN THE MIDDLE of the sentence.

:angry::angry::angry:

pdflrf does not do that sort of thing to PDF. It's because the original PDF has these problems.