Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 07-02-2012, 02:36 PM   #1
Borodin
Member
Borodin began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jul 2012
Device: Android mobile phone
Lightbulb Conversion from two-column PDF

I have read the sticky regarding this and suspect the answer to my question is that there is no current solution. I've just come here to make sure.

I have a PDF document that has two columns of text per page with a central gutter. I pushed it through the document converter using all default settings, and the output contains text reading from left to right acros the page, ignoring the gutter.

Is there any solution to this, or has anyone begun to work on a fix? I am happy to roll up my sleeves and write a little Pythin if that's what it takes.
Borodin is offline   Reply With Quote
Old 07-02-2012, 05:11 PM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,812
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Moderator Notice
Did you bother to read this sticky? https://www.mobileread.com/forums/sho...d.php?t=118605
theducks is offline   Reply With Quote
Advert
Old 07-02-2012, 05:30 PM   #3
PeterT
Grand Sorcerer
PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.
 
PeterT's Avatar
 
Posts: 12,171
Karma: 73448616
Join Date: Nov 2007
Location: Toronto
Device: Nexus 7, Clara, Touch, Tolino EPOS
One option might be to use a tool like Briss to crop the pages.

See http://sourceforge.net/projects/briss/
PeterT is offline   Reply With Quote
Old 07-02-2012, 08:53 PM   #4
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Last paragraph in the sticky if you want to contribute or experiment - the new engine already handles two column, but overall it does a worse job at laying out the html to match the pdf than the existing engine.
ldolse is offline   Reply With Quote
Old 07-04-2012, 06:57 AM   #5
Borodin
Member
Borodin began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jul 2012
Device: Android mobile phone
Quote:
Originally Posted by theducks View Post
Did you bother to read this sticky? https://www.mobileread.com/forums/sho...d.php?t=118605
Yes I 'bothered' to read it as I said in my post. The sticky doesn't make it clear that this specific problem in unsolved and I thought it prudent to ask before I launched into writing code to resolve something that was already fixed.

Is this the quality of moderation to expect on this forum?
Borodin is offline   Reply With Quote
Advert
Old 07-04-2012, 07:01 AM   #6
Borodin
Member
Borodin began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jul 2012
Device: Android mobile phone
Quote:
Originally Posted by ldolse View Post
Last paragraph in the sticky if you want to contribute or experiment - the new engine already handles two column, but overall it does a worse job at laying out the html to match the pdf than the existing engine.
Thanks Idolse. Does 'the new engine' mean that which comes with the latest release, which I already have? I am really not too interested in layout as my input is simply two-column streamed text.
Borodin is offline   Reply With Quote
Old 07-04-2012, 07:02 AM   #7
Borodin
Member
Borodin began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jul 2012
Device: Android mobile phone
Quote:
Originally Posted by PeterT View Post
One option might be to use a tool like Briss to crop the pages.

See http://sourceforge.net/projects/briss/
Nice. I will give this a try.
Borodin is offline   Reply With Quote
Old 07-04-2012, 07:14 AM   #8
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
The new engine is in every release, but it's disabled by default - instructions on how to use it are in the sticky - basically you need to do the conversion from the command line with two specific arguments:
--new-pdf-engine
--debug-pipeline "/some/directory"

For all the cli arguments and examples refer to the manual:
http://manual.calibre-ebook.com/cli/...#ebook-convert
http://manual.calibre-ebook.com/conv...ormatting-demo

The above won't generate an actual ebook for you, it will just dump an html version of the pdf into the debug folder. You can convert the html to an ebook by re-importing it to Calibre and converting from zip to whatever.

You'll still wish the layout was better after you see the output, but for two column it will probably give you manually salvageable results, which the production version won't do. By layout I mean many paragraphs will probably need to be manually fixed - the sentences will be correct, but it's not good at detecting appropriate paragraph breaks/page breaks as I recall.

Last edited by ldolse; 07-04-2012 at 07:16 AM.
ldolse is offline   Reply With Quote
Old 07-04-2012, 07:18 AM   #9
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
@Borodin - obviously theducks just missed that part of your post. You have to appreciate that there are an enormous number of repeated posts every week from people who *don't* bother to read the pdf sticky or think that somehow their case is different. You happened to get hit by theducks on autopilot. Birdstrike

If you are wanting to continue the technical discussion along the lines of you adding the feature yourself this thread would be better placed in the Development forum.

Whether that new PDF engine is going to give you much joy in this regard though is another matter. Kovid recently upgraded a number of the pdf components and the latest version that ships with calibre has some terminal issues I found with my Extract ISBN plugin (causing unmanaged code crashes). I changed my plugin to use some different pdf components as did Kovid for calibre as he has other priorities to work on. I am sure he can give you a more definitive answer.

Last edited by kiwidude; 07-04-2012 at 07:26 AM. Reason: Fix the plugin name
kiwidude is offline   Reply With Quote
Old 07-04-2012, 07:32 AM   #10
Borodin
Member
Borodin began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jul 2012
Device: Android mobile phone
Quote:
Originally Posted by kiwidude View Post
If you are wanting to continue the technical discussion along the lines of you adding the feature yourself this thread would be better placed in the Development forum.
Thanks Kiwi I will do that I haven't looked at how the conversion process pipes through HTML but I am fluent with Python, and as long as the HTML is XHTML I am sure XSLT will be useful.

It sounds like the PDF engine is a work in progress and I am anxious to contibute where I can without creating an unmergeable branch.
Borodin is offline   Reply With Quote
Old 07-04-2012, 07:39 AM   #11
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
I've moved the thread.

As I mentioned above the initial problem you will have is that the engine is written in C++ to produce the (x?)html output. And last time I tried it (via my plugins) it was regularly falling over with nasty crashes. So unless you are prepared to roll up your sleeves on the C++ side and figure out why that is you may not get past the starting gate.
kiwidude is offline   Reply With Quote
Old 07-04-2012, 08:03 AM   #12
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I didn't realize the C++ portion had been updated and was buggy now - assuming you get past that part you will get perfect xhtml. Basically what poppler does (when it works) is create a nice xhtml file which will render exactly like the original pdf. But aside from rendering perfectly and providing lots of info about the layout it's still no good because it still has all the limitations of pdf. The reflow code then tries to turn all the sentence fragments into paragraphs. So the debug output will actually have the original xhtml from Poppler along with the reflowed version.

Last edited by ldolse; 07-04-2012 at 08:15 AM.
ldolse is offline   Reply With Quote
Old 07-04-2012, 08:49 AM   #13
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,864
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The C++ part was removed, it now uses

pdftohtml -xml


to generate the layout XML. However, I haven't gotten around to migrating the python code that reads the XML and converts it to HTML. That should be fairly simple to do. The python code still expects the old version of the XML, so you will need to change it slightly.
kovidgoyal is offline   Reply With Quote
Old 07-04-2012, 08:55 AM   #14
avantman42
Wizard
avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.
 
avantman42's Avatar
 
Posts: 1,090
Karma: 6058305
Join Date: Sep 2010
Location: UK
Device: Kindle Paperwhite
If you have access to pdftohtml & sed, I use the following:

Code:
pdftohtml -c -s -i -xml INPUT_FILE.pdf
sed -e s/"<[^>]*>"//g INPUT_FILE.xml > OUTPUT_FILE.txt
That usually gives a reasonable text file, which can then be worked on if needed and converted to whatever format you wish using ebook-convert.
avantman42 is offline   Reply With Quote
Old 07-04-2012, 10:02 AM   #15
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,812
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Borodin View Post
Yes I 'bothered' to read it as I said in my post. The sticky doesn't make it clear that this specific problem in unsolved and I thought it prudent to ask before I launched into writing code to resolve something that was already fixed.

Is this the quality of moderation to expect on this forum?
Sorry, missed the first paragraph.
theducks is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
2 column PDF book to 1 column possible? SeaBookGuy Calibre 19 07-01-2013 02:30 AM
Q: multi-column PDF to single column mobi format converstion auburn1975 Calibre 7 01-28-2012 06:11 PM
PDF to EPUB conversion, 2 column pbs geoff3 Conversion 6 03-14-2011 04:39 PM
Double Column Conversion Tool mazzeltjes Calibre 0 12-10-2009 04:22 PM
pdf to lrf with 2 column and 1 column pages in same file danielwille Sony Reader 3 11-12-2008 10:57 AM


All times are GMT -4. The time now is 05:18 AM.


MobileRead.com is a privately owned, operated and funded community.