![]() |
#1 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 55
Karma: 392326
Join Date: Feb 2023
Device: Kobo Libra 2
|
Converting multicolumn PDF?
I've read the forums and saw that this question has come up in the past, and the answer has been that "it's in development" - however, these posts have been several years old, and the latest version still makes a mess of multiple column pdf's.
There are a few Linux-based tools that can convert multicolumn files relatively easily, such as pdftotext, but that removes italics and bolding. pdftohtml can also do it, but its output requires quite a lot of manual work to convert into a single-column format suitable for epub conversion. I've also tried using k2pdfopt to convert a pdf into single-column format to pass on to calibre, but that makes calibre choke - probably because the resulting file is not internally a true single-column pdf despite looking like it in a viewers. Has the development of this functionality for Calibre been abandoned? Or are there any other tools - preferably available in linux - besides Acrobat itself that could be used to convert a multi-column pdf into epub or at least some intermediate format with italics and bolding intact? |
![]() |
![]() |
![]() |
#2 |
Still reading
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 14,033
Karma: 105092227
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper
|
See https://www.mobileread.com/forums/sh...d.php?t=144711
PDFs are for print, proofing print and print replica. They are not meant to be changed, converted or edited in anyway. They are not ebooks. There are multiple kinds, at one extreme are images only, and the other is lines of text, each at a specific place on a specified page size. Some have images of the text (from a scan) and a searchable text layer full of errors as it's been automatically created and not proof read by OCR software. There is no one solution. IMO it's outside the scope of Calibre. I might use k2pdfopt, The GIMP or LO Draw to do something to a PDF on Linux. Sometimes all I can do is crop and adjust contrast. Or hope someone produces a real ebook. |
![]() |
![]() |
Advert | |
|
![]() |
#3 | |
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 21,725
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
Did you try passing the single column output of k2pdfopt through MS Word or LO Writer to produce a DOCX and then convert that to EPUB in calibre. BR |
|
![]() |
![]() |
![]() |
#4 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 450
Karma: 3886916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma
|
I do multi-column old magazine stories, the pdf coming from, say, Internet Archive. Any text that is already in these is worthless, it would take forever to correct it by hand.
So my method is to use various Linux tools. Get the images out with pdftopng or pdfimages. If they are really terrible, run them through Scan Taylor Advanced. Minor corrections can be done with ImageMagick. Do my own OCR using OCRFeeder, a front-end for tesseract. The multi-column problem here is handled by OCRFeeder being able to do one column at a time, and also avoid advertisements, handle the "continued on page 99" situation, and so on. Copy the OCR text into LibreOffice...proof it there, bring it into Calibre, and convert to epub. Tweak the code in the Calibre Editor as needed. Any images, tables and the like can be dealt with as necessary, case-by-case. I use Gimp to handle any image editing needed. Yes this is labour intensive. But it works and ends up with a really good epub. You will never find a script-kiddie solution to getting good results out of a multi-column pdf, especially when there all sorts of interruptions to nice clean columns. |
![]() |
![]() |
![]() |
#5 | |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 55
Karma: 392326
Join Date: Feb 2023
Device: Kobo Libra 2
|
Quote:
Looks like so far the best bet is using pdftohtml to convert the file to XML and then use a text editor and various regular expressions to strip or replace the xml tags with html tags before using ebook-convert to convert it. Takes quite a bit of manual work, but it's doable, at least for the more interesting use cases - it's probably 1-2 hours of work to do a book, if the layout and formatting is consistent so I can effectively use regexp. |
|
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 55
Karma: 392326
Join Date: Feb 2023
Device: Kobo Libra 2
|
Quote:
I have used gImagereader to OCR a couple of sources where the source was only avaible as a scan (as you noted, Archive's TXT or EPUB versions are often worthless), but only for short texts. I'll have to check out OCRFeeder. |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
EPUB to Multicolumn PDF using Calibre | hthiart | Conversion | 1 | 08-04-2013 04:55 AM |
MultiColumn Page In Epub | saravanan.p | ePub | 11 | 01-31-2012 10:12 PM |
Converting multicolumn to one column | fgruber | enTourage Archive | 5 | 01-12-2011 08:36 AM |
PDF multicolumn | Mandos | Calibre | 1 | 04-21-2009 08:06 PM |