|
|
View Full Version : howto: importing PDFs to a word processor
Antartica 09-05-2006, 04:27 AM I've been looking for an easy way to convert pdfs. Until now I was using a pdf2html program and processing the result, with mixed results. For the curious, this is what I used to convert some pdfs so they become nice to read on the Iliad (11cmx15cm, etc):
pdftohtml ( http://pdftohtml.sourceforge.net ), some ad-hoc scripts, tidy (http://tidy.sourceforge.net/ ), gnuhtml2latex (http://packages.debian.org/unstable/text/gnuhtml2latex ) and lyx ( http://www.lyx.org ). The results are acceptable but it's a lengthy process (about an hour for each book, mostly to adapt the ad-hoc scripts so they join lines correctly and detect chapter headings).
I've found an alternative: a plug-in for Abiword (a lean and portable wordprocessor) that imports pdf with some heuristics (and the heuristics seems to be well chosen, as to be general aplicable). It supports styles, multiple columns, etc.
It's incredible. As an example the author posts some images of before (pdf) importing and after (Abiword), see the attached images.
For a description of what it does:
http://www.abisource.com/twiki/bin/view/Abiword/PDFImportPluginWithStyle
To download the sources of the pdf import plug-in and try it:
http://jauco.nl/blog/
Caution: I've just found it, so I have not tested it yet. As I have some spare time I'll try it ;-).
Tell me what you think about about it ;-).
If the images depict the general conversion quality of this plugin, then I am really impressed. It's better than most commercial solutions I've seen.
I am curious to hear how it works for you.
vranghel 09-05-2006, 11:25 PM I've been looking for an easy way to convert pdfs. Until now I was using a pdf2html program and processing the result, with mixed results. For the curious, this is what I used to convert some pdfs so they become nice to read on the Iliad (11cmx15cm, etc):
pdftohtml ( http://pdftohtml.sourceforge.net ), some ad-hoc scripts, tidy (http://tidy.sourceforge.net/ ), gnuhtml2latex (http://packages.debian.org/unstable/text/gnuhtml2latex ) and lyx ( http://www.lyx.org ). The results are acceptable but it's a lengthy process (about an hour for each book, mostly to adapt the ad-hoc scripts so they join lines correctly and detect chapter headings).
I've found an alternative: a plug-in for Abiword (a lean and portable wordprocessor) that imports pdf with some heuristics (and the heuristics seems to be well chosen, as to be general aplicable). It supports styles, multiple columns, etc.
It's incredible. As an example the author posts some images of before (pdf) importing and after (Abiword), see the attached images.
For a description of what it does:
http://www.abisource.com/twiki/bin/view/Abiword/PDFImportPluginWithStyle
To download the sources of the pdf import plug-in and try it:
http://jauco.nl/blog/
Caution: I've just found it, so I have not tested it yet. As I have some spare time I'll try it ;-).
Tell me what you think about about it ;-).
Seems that my programming illiteracy is quite advanced: how the hell am i supposed to install the patch?
http://www.jauco.nl/SoC/abiword-pdf-style-0.3.patch
http://www.jauco.nl/SoC/poppler-pdf-style-0.3.patch
Those two are supposed to be the plugins, but when i click on them it opens a text file. There's no .dll no .exe no nothin' :huh:
I'd really appreciate some help from someone more knowledgeable. :happy2:
Antartica 09-06-2006, 01:10 PM Seems that my programming illiteracy is quite advanced: how the hell am i supposed to install the patch?
http://www.jauco.nl/SoC/abiword-pdf-style-0.3.patch
http://www.jauco.nl/SoC/poppler-pdf-style-0.3.patch
Those two are supposed to be the plugins, but when i click on them it opens a text file. There's no .dll no .exe no nothin' :huh:
I'd really appreciate some help from someone more knowledgeable. :happy2:
Some background :scholar: : patch(1) is an UNIX utility usually used to merge some modifications into the source code of a released version of a program. And those files ("patches") are generated with the diff(1) utility. So the files are named patches or diffs.
Patches are usually geared to programmers or advanced users, not afraid of downloading source code and compilling it himself. It's really not very difficult if you have the right tools.
So this is a patch in the old UNIX way. In Windows is more common to say "patch" refering to a package of replacement files needed to upgrade a program.
And more to the point: search below for detailed instructions to how to apply the patch and compile the program (in Linux, that is what I've installed; in Windows+Cygwin it should be slightly different)... but the instructions are incomplete right now, as I've found that the patched poppler library fails to compile using gcc 3.3.5 :-( .
Anyway, in the next message I say how to get to that error :sleepy:
Antartica 09-06-2006, 01:16 PM (Partial and ) Detailed Debian GNU/Linux 3.0 "Sarge" instructions (what I've done):
For patching, compiling and installing the required poppler library:
$ su
# apt-get install cdbs gnome-pkg-tools libgtk2.0-dev libqt3-mt-dev automake1.9 dh-make build-essential dpkg-dev libjpeg62-dev libz-dev fakeroot libxml2-dev
# exit
$ mkdir src.poppler
$ cd src.poppler
$ wget http://poppler.freedesktop.org/poppler-0.5.3.tar.gz
$ wget http://www.jauco.nl/SoC/poppler-pdf-style-0.3.patch
$ tar -xvzf poppler-0.5.3.tar.gz
$ cd poppler-0.5.3
$ patch -p1 < ../poppler-pdf-style-0.3.patch
$ ln -s /usr/include/libxml2/libxml poppler/
$ echo "s" | dh_make
$ sed -i "s/configure /configure --enable-zlib --enable-xpdf-headers/g" debian/rules
$ chmod a+x debian/rules
$ fakeroot debian/rules binary
This should have generated a .deb file that you can install, but it failed to compile, with the following error:
g++ -DHAVE_CONFIG_H -I. -I. -I.. -I. -I.. -I../goo -I/usr/include/freetype2 -Wall -Wno-unused -g -O2 -MT ABWOutputDev.lo -MD -MP -MF .deps/ABWOutputDev.Tpo -c ABWOutputDev.cc -fPIC -DPIC -o .libs/ABWOutputDev.o
ABWOutputDev.cc: In member function `void ABWOutputDev::ATP_recursive(xmlNode*)
':
ABWOutputDev.cc:804: error: declaration of `void
ABWOutputDev::cleanUpNode(xmlNode*, bool)' outside of class is not
definition
It seems to be some construct that is not legal in gcc 3.3.5... I hope to have tomorrow some time to try to debug the offending file, but don't count on it :-(
After being able to compile the poppler library, it is necessary to do the same with the abiword sources... so there is quite a bit of work left to do.
BTW: Maybe this post should be in hacks/devel :-?
vranghel 09-06-2006, 03:20 PM Some background :scholar: : patch(1) is an UNIX utility usually used to merge some modifications into the source code of a released version of a program. And those files ("patches") are generated with the diff(1) utility. So the files are named patches or diffs.
Patches are usually geared to programmers or advanced users, not afraid of downloading source code and compilling it himself. It's really not very difficult if you have the right tools.
So this is a patch in the old UNIX way. In Windows is more common to say "patch" refering to a package of replacement files needed to upgrade a program.
And more to the point: search below for detailed instructions to how to apply the patch and compile the program (in Linux, that is what I've installed; in Windows+Cygwin it should be slightly different)... but the instructions are incomplete right now, as I've found that the patched poppler library fails to compile using gcc 3.3.5 :-( .
Anyway, in the next message I say how to get to that error :sleepy:
Thanks Antarctica for taking the time to explain.
Unfortunately at the 2nd post you have lost me....all that code is chinese to me :rolleyes5
So there seems to be a some kind of error in the patch as it will not compile.
Hopefully it is not a big issue, because the ideea of the plugin is wonderful and i'd really want to see it in action
That sounds great.
Well, if someone manages to compile it with patch for windows, *please* upload the executable. I didn't manage.
vranghel 09-06-2006, 08:17 PM That sounds great.
Well, if someone manages to compile it with patch for windows, *please* upload the executable. I didn't manage.
I second that! :crowngrin
Jauco 11-06-2006, 04:42 AM Hey, I didn't see this thread earlier but I like to positive tone :)
I'm the guy trying to write the pdf plugin. ATM if you can't install the patch, you probably don't want to, because the program is buggy as some infernal place.
The past 2 months where increadibly busy for me, so I didn't do much work on it but once I get most of the bugs out of the code, I will try to get it released with the windows version of abiword.
Greets,
Jauco
Antartica 11-06-2006, 05:22 AM Hi Jauco!
Thanks for taking the time to register here and replying :-)
Hey, I didn't see this thread earlier but I like to positive tone :)
I'm the guy trying to write the pdf plugin. ATM if you can't install the patch, you probably don't want to, because the program is buggy as some infernal place.
Mmm... anyway I would greatly appreciate the needed info for compiling it and experimenting with the buggy version O:-).
I only need a bit of information:
1. The linux distribution/linux version that you're using to compile
2. The compiler version
3. The libpoppler and abiword version
I hope that with that information I will be able to replicate your compilation success ;-)
The past 2 months where increadibly busy for me, so I didn't do much work on it but once I get most of the bugs out of the code, I will try to get it released with the windows version of abiword.
Great! Thanks for taking the time to do such a needed plugin :-)
Antartica
Jauco 11-06-2006, 12:03 PM I'm using a vanilla ubuntu linux "dapper drake"
compiler : whichever came with dapper drake (4.0.3 I think)
poppler source: cvs from back then. I'd suggest using the latest release
abiword source: Doesn't matter. latest release will be fine.
|