Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 09-04-2019, 08:50 PM   #1
therealjoeblow
Zealot
therealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfolded
 
Posts: 106
Karma: 52102
Join Date: Jun 2010
Device: Samsung Android Tablet w/Moon+ Pro Reader
PDF (magazine style columns) to ePub Conversion

First, I know that PDF is a bad source to try to convert from. It's the only source I have though.

The PDF is laid out "magazine style" with the text in columns, meant to be read down column 1, then down column 2:

Code:
This is some text               The text in the second
that is in the first            column is formatted like
column.                         this.  I guess you get 
                                the idea.
It has paragraphs
laid out like this.
...And when I convert to ePub, I end up with lines that scramble the paragraphs from column 1 and 2 together making it impossible to read comprehensively:

This is some text
The text in the second
that is in the first
column is formatted like
column.
this. I guess you get
the idea.
It has paragraphs
laid out like this.

...Basically every 2nd line belongs with a different paragraph from another column.

Is there any way to convert this successfully?

Cheers
TRJB

Last edited by therealjoeblow; 09-04-2019 at 08:52 PM.
therealjoeblow is offline   Reply With Quote
Old 09-04-2019, 08:54 PM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,175
Karma: 60406498
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Some better OCR programs can be configured to deal (mostly) with 2 columns.
Those still need hand holding to determine which SHOULD join the article and which belong elsewhere (eg continued from page n)
theducks is offline   Reply With Quote
Old 09-04-2019, 09:24 PM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,527
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you have microsoft word, try importing it into that. It has a more spohisticated analysis engine that handles multi-col PDFs sometimes.
kovidgoyal is online now   Reply With Quote
Old 09-04-2019, 11:20 PM   #4
therealjoeblow
Zealot
therealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfolded
 
Posts: 106
Karma: 52102
Join Date: Jun 2010
Device: Samsung Android Tablet w/Moon+ Pro Reader
I just <ctrl-a> copied all of the PDF, and <ctrl-c> pasted it into a .txt file

All of the text copied over in the correct order, then I just used regex in notepad++ to clean up line breaks like I would have to anyway, and then made a epub with calibre; then just manually inserted the graphics where they belonged.

A bit tedious, but got it done.

Although that makes me wonder... if simple <crtl-a> is able to copy all of the text in the correct order, why is Calibre not able to process it in that way?

That part doesn't make sense since simple 'copy all' is probably one of the most rudimentary text tools and it seems to be able to handle the columns.

Cheers
TRJB
therealjoeblow is offline   Reply With Quote
Old 09-04-2019, 11:53 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,527
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
calibre does not extract text from PDFs, that is done by pdftohtml from the poppler project, feel free to ask them why it does not do a better job.
kovidgoyal is online now   Reply With Quote
Old 09-05-2019, 12:28 AM   #6
therealjoeblow
Zealot
therealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfolded
 
Posts: 106
Karma: 52102
Join Date: Jun 2010
Device: Samsung Android Tablet w/Moon+ Pro Reader
Ok, thanks.

It wasn't meant as a criticism of your awesome work BTW!

Cheers
TRJB
therealjoeblow is offline   Reply With Quote
Old 09-05-2019, 12:57 AM   #7
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,527
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Yeah, it's been on my TODO list forever to get rid of pdftohtml and write my own text extraction engine, but...
kovidgoyal is online now   Reply With Quote
Old 09-05-2019, 11:24 AM   #8
retiredbiker
Evangelist
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 451
Karma: 3886916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma
Quote:
Some better OCR programs can be configured to deal (mostly) with 2 columns.
Those still need hand holding to determine which SHOULD join the article and which belong elsewhere (eg continued from page n)
I use the free Tesseract OCR that comes with Ubuntu, and the OCRfeeder GUI front end. Most times, it will handle double columns correctly, but not always. And if there are embedded pictures, breaks with images or huge caps, and son on, I often have to do smaller areas one at a time. Also for the "continued on..." case. Can be tedious, but it works pretty well, and is amazingly accurate in detecting paragraph breaks and dealing with end-of-line hyphens. I do it page-by-page, proofing as I go, so I'm not faced with a massive proofread task at the end.

As for the copy/paste technique, I find that all depends on what is in the individual PDF. Some work fine, some are simply impossible. If the text layer is just junk, OCR is the easier answer.
retiredbiker is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
AZW3 to EPUB conversion 2 columns ctemple Conversion 1 01-27-2019 08:56 PM
Calibre lose code block style on EPUB to HTML conversion phase mw-b Conversion 1 11-02-2018 02:56 AM
converting PDF magazine to ePub format PublicarGuate General Discussions 2 01-21-2014 05:44 PM
books with a two columns style yuxi_kelly ePub 3 01-13-2011 03:27 PM
convert pdf (2 columns) to epub ikke Calibre 3 07-19-2010 05:59 PM


All times are GMT -4. The time now is 12:46 AM.


MobileRead.com is a privately owned, operated and funded community.