MobileRead Forums - View Single Post - howto: importing PDFs to a word processor

Antartica · 09-05-2006, 03:27 AM

I've been looking for an easy way to convert pdfs. Until now I was using a pdf2html program and processing the result, with mixed results. For the curious, this is what I used to convert some pdfs so they become nice to read on the Iliad (11cmx15cm, etc):
pdftohtml ( http://pdftohtml.sourceforge.net ), some ad-hoc scripts, tidy (http://tidy.sourceforge.net/ ), gnuhtml2latex (http://packages.debian.org/unstable/text/gnuhtml2latex ) and lyx ( http://www.lyx.org ). The results are acceptable but it's a lengthy process (about an hour for each book, mostly to adapt the ad-hoc scripts so they join lines correctly and detect chapter headings).

I've found an alternative: a plug-in for Abiword (a lean and portable wordprocessor) that imports pdf with some heuristics (and the heuristics seems to be well chosen, as to be general aplicable). It supports styles, multiple columns, etc.

It's incredible. As an example the author posts some images of before (pdf) importing and after (Abiword), see the attached images.

For a description of what it does:
http://www.abisource.com/twiki/bin/v...luginWithStyle

To download the sources of the pdf import plug-in and try it:
http://jauco.nl/blog/

Caution: I've just found it, so I have not tested it yet. As I have some spare time I'll try it ;-).

Tell me what you think about about it ;-).

09-05-2006, 03:27 AM	#1
Antartica Evangelist Posts: 423 Karma: 1517132 Join Date: Jun 2006 Location: Madrid, Spain Device: quaderno, remarkable2, yotaphone2, prs950, iliad, onhandpc, newton	howto: importing PDFs to a word processor I've been looking for an easy way to convert pdfs. Until now I was using a pdf2html program and processing the result, with mixed results. For the curious, this is what I used to convert some pdfs so they become nice to read on the Iliad (11cmx15cm, etc): pdftohtml ( http://pdftohtml.sourceforge.net ), some ad-hoc scripts, tidy (http://tidy.sourceforge.net/ ), gnuhtml2latex (http://packages.debian.org/unstable/text/gnuhtml2latex ) and lyx ( http://www.lyx.org ). The results are acceptable but it's a lengthy process (about an hour for each book, mostly to adapt the ad-hoc scripts so they join lines correctly and detect chapter headings). I've found an alternative: a plug-in for Abiword (a lean and portable wordprocessor) that imports pdf with some heuristics (and the heuristics seems to be well chosen, as to be general aplicable). It supports styles, multiple columns, etc. It's incredible. As an example the author posts some images of before (pdf) importing and after (Abiword), see the attached images. For a description of what it does: http://www.abisource.com/twiki/bin/v...luginWithStyle To download the sources of the pdf import plug-in and try it: http://jauco.nl/blog/ Caution: I've just found it, so I have not tested it yet. As I have some spare time I'll try it ;-). Tell me what you think about about it ;-). Attached Thumbnails Last edited by Antartica; 09-05-2006 at 03:29 AM.