Conversion from two-column PDF

Borodin · 07-02-2012, 02:36 PM

I have read the sticky regarding this and suspect the answer to my question is that there is no current solution. I've just come here to make sure.

I have a PDF document that has two columns of text per page with a central gutter. I pushed it through the document converter using all default settings, and the output contains text reading from left to right acros the page, ignoring the gutter.

Is there any solution to this, or has anyone begun to work on a fix? I am happy to roll up my sleeves and write a little Pythin if that's what it takes.

theducks · 07-02-2012, 05:11 PM

Moderator Notice
Did you bother to read this sticky? https://www.mobileread.com/forums/sho...d.php?t=118605

PeterT · 07-02-2012, 05:30 PM

One option might be to use a tool like Briss to crop the pages.

See http://sourceforge.net/projects/briss/

ldolse · 07-02-2012, 08:53 PM

Last paragraph in the sticky if you want to contribute or experiment - the new engine already handles two column, but overall it does a worse job at laying out the html to match the pdf than the existing engine.

Borodin · 07-04-2012, 06:57 AM

Quote:

Originally Posted by theducks

Did you bother to read this sticky? https://www.mobileread.com/forums/sho...d.php?t=118605

Yes I 'bothered' to read it as I said in my post. The sticky doesn't make it clear that this specific problem in unsolved and I thought it prudent to ask before I launched into writing code to resolve something that was already fixed.

Is this the quality of moderation to expect on this forum?

Borodin · 07-04-2012, 07:01 AM

Quote:

Originally Posted by ldolse

Last paragraph in the sticky if you want to contribute or experiment - the new engine already handles two column, but overall it does a worse job at laying out the html to match the pdf than the existing engine.

Thanks Idolse. Does 'the new engine' mean that which comes with the latest release, which I already have? I am really not too interested in layout as my input is simply two-column streamed text.

Borodin · 07-04-2012, 07:02 AM

Quote:

Originally Posted by PeterT

One option might be to use a tool like Briss to crop the pages.

See http://sourceforge.net/projects/briss/

Nice. I will give this a try.

ldolse · 07-04-2012, 07:14 AM

The new engine is in every release, but it's disabled by default - instructions on how to use it are in the sticky - basically you need to do the conversion from the command line with two specific arguments:
--new-pdf-engine
--debug-pipeline "/some/directory"

For all the cli arguments and examples refer to the manual:
http://manual.calibre-ebook.com/cli/...#ebook-convert
http://manual.calibre-ebook.com/conv...ormatting-demo

The above won't generate an actual ebook for you, it will just dump an html version of the pdf into the debug folder. You can convert the html to an ebook by re-importing it to Calibre and converting from zip to whatever.

You'll still wish the layout was better after you see the output, but for two column it will probably give you manually salvageable results, which the production version won't do. By layout I mean many paragraphs will probably need to be manually fixed - the sentences will be correct, but it's not good at detecting appropriate paragraph breaks/page breaks as I recall.

kiwidude · 07-04-2012, 07:18 AM

@Borodin - obviously theducks just missed that part of your post. You have to appreciate that there are an enormous number of repeated posts every week from people who *don't* bother to read the pdf sticky or think that somehow their case is different. You happened to get hit by theducks on autopilot. Birdstrike

If you are wanting to continue the technical discussion along the lines of you adding the feature yourself this thread would be better placed in the Development forum.

Whether that new PDF engine is going to give you much joy in this regard though is another matter. Kovid recently upgraded a number of the pdf components and the latest version that ships with calibre has some terminal issues I found with my Extract ISBN plugin (causing unmanaged code crashes). I changed my plugin to use some different pdf components as did Kovid for calibre as he has other priorities to work on. I am sure he can give you a more definitive answer.

Borodin · 07-04-2012, 07:32 AM

Quote:

Originally Posted by kiwidude

If you are wanting to continue the technical discussion along the lines of you adding the feature yourself this thread would be better placed in the Development forum.

Thanks Kiwi I will do that

I haven't looked at how the conversion process pipes through HTML but I am fluent with Python, and as long as the HTML is XHTML I am sure XSLT will be useful.

It sounds like the PDF engine is a work in progress and I am anxious to contibute where I can without creating an unmergeable branch.

kiwidude · 07-04-2012, 07:39 AM

I've moved the thread.

As I mentioned above the initial problem you will have is that the engine is written in C++ to produce the (x?)html output. And last time I tried it (via my plugins) it was regularly falling over with nasty crashes. So unless you are prepared to roll up your sleeves on the C++ side and figure out why that is you may not get past the starting gate.

ldolse · 07-04-2012, 08:03 AM

I didn't realize the C++ portion had been updated and was buggy now - assuming you get past that part you will get perfect xhtml. Basically what poppler does (when it works) is create a nice xhtml file which will render exactly like the original pdf. But aside from rendering perfectly and providing lots of info about the layout it's still no good because it still has all the limitations of pdf. The reflow code then tries to turn all the sentence fragments into paragraphs. So the debug output will actually have the original xhtml from Poppler along with the reflowed version.

kovidgoyal · 07-04-2012, 08:49 AM

The C++ part was removed, it now uses

pdftohtml -xml

to generate the layout XML. However, I haven't gotten around to migrating the python code that reads the XML and converts it to HTML. That should be fairly simple to do. The python code still expects the old version of the XML, so you will need to change it slightly.

avantman42 · 07-04-2012, 08:55 AM

If you have access to pdftohtml & sed, I use the following:

Code:

pdftohtml -c -s -i -xml INPUT_FILE.pdf
sed -e s/"<[^>]*>"//g INPUT_FILE.xml > OUTPUT_FILE.txt

That usually gives a reasonable text file, which can then be worked on if needed and converted to whatever format you wish using ebook-convert.

theducks · 07-04-2012, 10:02 AM

Quote:

Originally Posted by Borodin

Yes I 'bothered' to read it as I said in my post. The sticky doesn't make it clear that this specific problem in unsolved and I thought it prudent to ask before I launched into writing code to resolve something that was already fixed.

Is this the quality of moderation to expect on this forum?

Sorry, missed the first paragraph.

07-02-2012, 02:36 PM	#1
Borodin Member Posts: 12 Karma: 10 Join Date: Jul 2012 Device: Android mobile phone	Conversion from two-column PDF I have read the sticky regarding this and suspect the answer to my question is that there is no current solution. I've just come here to make sure. I have a PDF document that has two columns of text per page with a central gutter. I pushed it through the document converter using all default settings, and the output contains text reading from left to right acros the page, ignoring the gutter. Is there any solution to this, or has anyone begun to work on a fix? I am happy to roll up my sleeves and write a little Pythin if that's what it takes.

07-04-2012, 07:14 AM	#8
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	The new engine is in every release, but it's disabled by default - instructions on how to use it are in the sticky - basically you need to do the conversion from the command line with two specific arguments: --new-pdf-engine --debug-pipeline "/some/directory" For all the cli arguments and examples refer to the manual: http://manual.calibre-ebook.com/cli/...#ebook-convert http://manual.calibre-ebook.com/conv...ormatting-demo The above won't generate an actual ebook for you, it will just dump an html version of the pdf into the debug folder. You can convert the html to an ebook by re-importing it to Calibre and converting from zip to whatever. You'll still wish the layout was better after you see the output, but for two column it will probably give you manually salvageable results, which the production version won't do. By layout I mean many paragraphs will probably need to be manually fixed - the sentences will be correct, but it's not good at detecting appropriate paragraph breaks/page breaks as I recall. Last edited by ldolse; 07-04-2012 at 07:16 AM.

07-04-2012, 07:18 AM	#9
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@Borodin - obviously theducks just missed that part of your post. You have to appreciate that there are an enormous number of repeated posts every week from people who don't bother to read the pdf sticky or think that somehow their case is different. You happened to get hit by theducks on autopilot. Birdstrike If you are wanting to continue the technical discussion along the lines of you adding the feature yourself this thread would be better placed in the Development forum. Whether that new PDF engine is going to give you much joy in this regard though is another matter. Kovid recently upgraded a number of the pdf components and the latest version that ships with calibre has some terminal issues I found with my Extract ISBN plugin (causing unmanaged code crashes). I changed my plugin to use some different pdf components as did Kovid for calibre as he has other priorities to work on. I am sure he can give you a more definitive answer. Last edited by kiwidude; 07-04-2012 at 07:26 AM. Reason: Fix the plugin name

07-04-2012, 08:03 AM	#12
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	I didn't realize the C++ portion had been updated and was buggy now - assuming you get past that part you will get perfect xhtml. Basically what poppler does (when it works) is create a nice xhtml file which will render exactly like the original pdf. But aside from rendering perfectly and providing lots of info about the layout it's still no good because it still has all the limitations of pdf. The reflow code then tries to turn all the sentence fragments into paragraphs. So the debug output will actually have the original xhtml from Poppler along with the reflowed version. Last edited by ldolse; 07-04-2012 at 08:15 AM.

07-04-2012, 08:55 AM	#14
avantman42 Wizard Posts: 1,090 Karma: 6058305 Join Date: Sep 2010 Location: UK Device: Kindle Paperwhite	If you have access to pdftohtml & sed, I use the following: Code: pdftohtml -c -s -i -xml INPUT_FILE.pdf sed -e s/"<[^>]*>"//g INPUT_FILE.xml > OUTPUT_FILE.txt That usually gives a reasonable text file, which can then be worked on if needed and converted to whatever format you wish using ebook-convert.

07-02-2012, 05:11 PM	#2
theducks Well trained by Cats Posts: 29,812 Karma: 54830978 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	Moderator Notice Did you bother to read this sticky? https://www.mobileread.com/forums/sho...d.php?t=118605

07-02-2012, 05:30 PM	#3
PeterT Grand Sorcerer Posts: 12,171 Karma: 73448616 Join Date: Nov 2007 Location: Toronto Device: Nexus 7, Clara, Touch, Tolino EPOS	One option might be to use a tool like Briss to crop the pages. See http://sourceforge.net/projects/briss/

07-02-2012, 08:53 PM	#4
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Last paragraph in the sticky if you want to contribute or experiment - the new engine already handles two column, but overall it does a worse job at laying out the html to match the pdf than the existing engine.

07-04-2012, 07:39 AM	#11
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	I've moved the thread. As I mentioned above the initial problem you will have is that the engine is written in C++ to produce the (x?)html output. And last time I tried it (via my plugins) it was regularly falling over with nasty crashes. So unless you are prepared to roll up your sleeves on the C++ side and figure out why that is you may not get past the starting gate.

07-04-2012, 08:49 AM	#13
kovidgoyal creator of calibre Posts: 43,864 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The C++ part was removed, it now uses pdftohtml -xml to generate the layout XML. However, I haven't gotten around to migrating the python code that reads the XML and converts it to HTML. That should be fairly simple to do. The python code still expects the old version of the XML, so you will need to change it slightly.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
2 column PDF book to 1 column possible?	SeaBookGuy	Calibre	19	07-01-2013 02:30 AM
Q: multi-column PDF to single column mobi format converstion	auburn1975	Calibre	7	01-28-2012 06:11 PM
PDF to EPUB conversion, 2 column pbs	geoff3	Conversion	6	03-14-2011 04:39 PM
Double Column Conversion Tool	mazzeltjes	Calibre	0	12-10-2009 04:22 PM
pdf to lrf with 2 column and 1 column pages in same file	danielwille	Sony Reader	3	11-12-2008 10:57 AM

Advert

Advert