![]() |
#1 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 106
Karma: 52102
Join Date: Jun 2010
Device: Samsung Android Tablet w/Moon+ Pro Reader
|
PDF (magazine style columns) to ePub Conversion
First, I know that PDF is a bad source to try to convert from. It's the only source I have though.
The PDF is laid out "magazine style" with the text in columns, meant to be read down column 1, then down column 2: Code:
This is some text The text in the second that is in the first column is formatted like column. this. I guess you get the idea. It has paragraphs laid out like this. This is some text The text in the second that is in the first column is formatted like column. this. I guess you get the idea. It has paragraphs laid out like this. ...Basically every 2nd line belongs with a different paragraph from another column. Is there any way to convert this successfully? Cheers TRJB Last edited by therealjoeblow; 09-04-2019 at 08:52 PM. |
![]() |
![]() |
![]() |
#2 |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,175
Karma: 60406498
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Some better OCR programs can be configured to deal (mostly) with 2 columns.
Those still need hand holding to determine which SHOULD join the article and which belong elsewhere (eg continued from page n) |
![]() |
![]() |
![]() |
#3 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,527
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If you have microsoft word, try importing it into that. It has a more spohisticated analysis engine that handles multi-col PDFs sometimes.
|
![]() |
![]() |
![]() |
#4 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 106
Karma: 52102
Join Date: Jun 2010
Device: Samsung Android Tablet w/Moon+ Pro Reader
|
I just <ctrl-a> copied all of the PDF, and <ctrl-c> pasted it into a .txt file
All of the text copied over in the correct order, then I just used regex in notepad++ to clean up line breaks like I would have to anyway, and then made a epub with calibre; then just manually inserted the graphics where they belonged. A bit tedious, but got it done. Although that makes me wonder... if simple <crtl-a> is able to copy all of the text in the correct order, why is Calibre not able to process it in that way? That part doesn't make sense since simple 'copy all' is probably one of the most rudimentary text tools and it seems to be able to handle the columns. Cheers TRJB |
![]() |
![]() |
![]() |
#5 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,527
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
calibre does not extract text from PDFs, that is done by pdftohtml from the poppler project, feel free to ask them why it does not do a better job.
|
![]() |
![]() |
![]() |
#6 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 106
Karma: 52102
Join Date: Jun 2010
Device: Samsung Android Tablet w/Moon+ Pro Reader
|
Ok, thanks.
It wasn't meant as a criticism of your awesome work BTW! Cheers TRJB |
![]() |
![]() |
![]() |
#7 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,527
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Yeah, it's been on my TODO list forever to get rid of pdftohtml and write my own text extraction engine, but...
|
![]() |
![]() |
![]() |
#8 | |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 451
Karma: 3886916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma
|
Quote:
As for the copy/paste technique, I find that all depends on what is in the individual PDF. Some work fine, some are simply impossible. If the text layer is just junk, OCR is the easier answer. |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
AZW3 to EPUB conversion 2 columns | ctemple | Conversion | 1 | 01-27-2019 08:56 PM |
Calibre lose code block style on EPUB to HTML conversion phase | mw-b | Conversion | 1 | 11-02-2018 02:56 AM |
converting PDF magazine to ePub format | PublicarGuate | General Discussions | 2 | 01-21-2014 05:44 PM |
books with a two columns style | yuxi_kelly | ePub | 3 | 01-13-2011 03:27 PM |
convert pdf (2 columns) to epub | ikke | Calibre | 3 | 07-19-2010 05:59 PM |