07-02-2012, 02:36 PM | #1 |
Member
Posts: 12
Karma: 10
Join Date: Jul 2012
Device: Android mobile phone
|
Conversion from two-column PDF
I have read the sticky regarding this and suspect the answer to my question is that there is no current solution. I've just come here to make sure.
I have a PDF document that has two columns of text per page with a central gutter. I pushed it through the document converter using all default settings, and the output contains text reading from left to right acros the page, ignoring the gutter. Is there any solution to this, or has anyone begun to work on a fix? I am happy to roll up my sleeves and write a little Pythin if that's what it takes. |
07-02-2012, 05:11 PM | #2 |
Well trained by Cats
Posts: 29,812
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Moderator Notice
Did you bother to read this sticky? https://www.mobileread.com/forums/sho...d.php?t=118605 |
Advert | |
|
07-02-2012, 05:30 PM | #3 |
Grand Sorcerer
Posts: 12,171
Karma: 73448616
Join Date: Nov 2007
Location: Toronto
Device: Nexus 7, Clara, Touch, Tolino EPOS
|
One option might be to use a tool like Briss to crop the pages.
See http://sourceforge.net/projects/briss/ |
07-02-2012, 08:53 PM | #4 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Last paragraph in the sticky if you want to contribute or experiment - the new engine already handles two column, but overall it does a worse job at laying out the html to match the pdf than the existing engine.
|
07-04-2012, 06:57 AM | #5 | |
Member
Posts: 12
Karma: 10
Join Date: Jul 2012
Device: Android mobile phone
|
Quote:
Is this the quality of moderation to expect on this forum? |
|
Advert | |
|
07-04-2012, 07:01 AM | #6 |
Member
Posts: 12
Karma: 10
Join Date: Jul 2012
Device: Android mobile phone
|
Thanks Idolse. Does 'the new engine' mean that which comes with the latest release, which I already have? I am really not too interested in layout as my input is simply two-column streamed text.
|
07-04-2012, 07:02 AM | #7 | |
Member
Posts: 12
Karma: 10
Join Date: Jul 2012
Device: Android mobile phone
|
Quote:
|
|
07-04-2012, 07:14 AM | #8 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
The new engine is in every release, but it's disabled by default - instructions on how to use it are in the sticky - basically you need to do the conversion from the command line with two specific arguments:
--new-pdf-engine --debug-pipeline "/some/directory" For all the cli arguments and examples refer to the manual: http://manual.calibre-ebook.com/cli/...#ebook-convert http://manual.calibre-ebook.com/conv...ormatting-demo The above won't generate an actual ebook for you, it will just dump an html version of the pdf into the debug folder. You can convert the html to an ebook by re-importing it to Calibre and converting from zip to whatever. You'll still wish the layout was better after you see the output, but for two column it will probably give you manually salvageable results, which the production version won't do. By layout I mean many paragraphs will probably need to be manually fixed - the sentences will be correct, but it's not good at detecting appropriate paragraph breaks/page breaks as I recall. Last edited by ldolse; 07-04-2012 at 07:16 AM. |
07-04-2012, 07:18 AM | #9 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@Borodin - obviously theducks just missed that part of your post. You have to appreciate that there are an enormous number of repeated posts every week from people who *don't* bother to read the pdf sticky or think that somehow their case is different. You happened to get hit by theducks on autopilot. Birdstrike
If you are wanting to continue the technical discussion along the lines of you adding the feature yourself this thread would be better placed in the Development forum. Whether that new PDF engine is going to give you much joy in this regard though is another matter. Kovid recently upgraded a number of the pdf components and the latest version that ships with calibre has some terminal issues I found with my Extract ISBN plugin (causing unmanaged code crashes). I changed my plugin to use some different pdf components as did Kovid for calibre as he has other priorities to work on. I am sure he can give you a more definitive answer. Last edited by kiwidude; 07-04-2012 at 07:26 AM. Reason: Fix the plugin name |
07-04-2012, 07:32 AM | #10 | |
Member
Posts: 12
Karma: 10
Join Date: Jul 2012
Device: Android mobile phone
|
Quote:
It sounds like the PDF engine is a work in progress and I am anxious to contibute where I can without creating an unmergeable branch. |
|
07-04-2012, 07:39 AM | #11 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
I've moved the thread.
As I mentioned above the initial problem you will have is that the engine is written in C++ to produce the (x?)html output. And last time I tried it (via my plugins) it was regularly falling over with nasty crashes. So unless you are prepared to roll up your sleeves on the C++ side and figure out why that is you may not get past the starting gate. |
07-04-2012, 08:03 AM | #12 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I didn't realize the C++ portion had been updated and was buggy now - assuming you get past that part you will get perfect xhtml. Basically what poppler does (when it works) is create a nice xhtml file which will render exactly like the original pdf. But aside from rendering perfectly and providing lots of info about the layout it's still no good because it still has all the limitations of pdf. The reflow code then tries to turn all the sentence fragments into paragraphs. So the debug output will actually have the original xhtml from Poppler along with the reflowed version.
Last edited by ldolse; 07-04-2012 at 08:15 AM. |
07-04-2012, 08:49 AM | #13 |
creator of calibre
Posts: 43,864
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The C++ part was removed, it now uses
pdftohtml -xml to generate the layout XML. However, I haven't gotten around to migrating the python code that reads the XML and converts it to HTML. That should be fairly simple to do. The python code still expects the old version of the XML, so you will need to change it slightly. |
07-04-2012, 08:55 AM | #14 |
Wizard
Posts: 1,090
Karma: 6058305
Join Date: Sep 2010
Location: UK
Device: Kindle Paperwhite
|
If you have access to pdftohtml & sed, I use the following:
Code:
pdftohtml -c -s -i -xml INPUT_FILE.pdf sed -e s/"<[^>]*>"//g INPUT_FILE.xml > OUTPUT_FILE.txt |
07-04-2012, 10:02 AM | #15 | |
Well trained by Cats
Posts: 29,812
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
2 column PDF book to 1 column possible? | SeaBookGuy | Calibre | 19 | 07-01-2013 02:30 AM |
Q: multi-column PDF to single column mobi format converstion | auburn1975 | Calibre | 7 | 01-28-2012 06:11 PM |
PDF to EPUB conversion, 2 column pbs | geoff3 | Conversion | 6 | 03-14-2011 04:39 PM |
Double Column Conversion Tool | mazzeltjes | Calibre | 0 | 12-10-2009 04:22 PM |
pdf to lrf with 2 column and 1 column pages in same file | danielwille | Sony Reader | 3 | 11-12-2008 10:57 AM |