Converting multicolumn PDF?

Lukusaukko · 02-23-2023, 03:07 PM

I've read the forums and saw that this question has come up in the past, and the answer has been that "it's in development" - however, these posts have been several years old, and the latest version still makes a mess of multiple column pdf's.
There are a few Linux-based tools that can convert multicolumn files relatively easily, such as pdftotext, but that removes italics and bolding. pdftohtml can also do it, but its output requires quite a lot of manual work to convert into a single-column format suitable for epub conversion. I've also tried using k2pdfopt to convert a pdf into single-column format to pass on to calibre, but that makes calibre choke - probably because the resulting file is not internally a true single-column pdf despite looking like it in a viewers.
Has the development of this functionality for Calibre been abandoned? Or are there any other tools - preferably available in linux - besides Acrobat itself that could be used to convert a multi-column pdf into epub or at least some intermediate format with italics and bolding intact?

Quoth · 02-23-2023, 03:44 PM

See https://www.mobileread.com/forums/sh...d.php?t=144711

PDFs are for print, proofing print and print replica. They are not meant to be changed, converted or edited in anyway. They are not ebooks.

There are multiple kinds, at one extreme are images only, and the other is lines of text, each at a specific place on a specified page size.
Some have images of the text (from a scan) and a searchable text layer full of errors as it's been automatically created and not proof read by OCR software.
There is no one solution. IMO it's outside the scope of Calibre.

I might use k2pdfopt, The GIMP or LO Draw to do something to a PDF on Linux. Sometimes all I can do is crop and adjust contrast.
Or hope someone produces a real ebook.

BetterRed · 02-23-2023, 05:35 PM

Quote:

Originally Posted by Lukusaukko

I've read the forums and saw that this question has come up in the past, and the answer has been that "it's in development" - however, these posts have been several years old, and the latest version still makes a mess of multiple column pdf's.

The people who posted "it's in development" in 2010 haven't been involved in calibre development for a decade or more. The problem is, many 2 column PDFs are also littered with complex tables and infocrapics, and they will never be amenable to conversion.

Did you try passing the single column output of k2pdfopt through MS Word or LO Writer to produce a DOCX and then convert that to EPUB in calibre.

BR

retiredbiker · 02-24-2023, 10:17 AM

I do multi-column old magazine stories, the pdf coming from, say, Internet Archive. Any text that is already in these is worthless, it would take forever to correct it by hand.

So my method is to use various Linux tools. Get the images out with pdftopng or pdfimages. If they are really terrible, run them through Scan Taylor Advanced. Minor corrections can be done with ImageMagick. Do my own OCR using OCRFeeder, a front-end for tesseract. The multi-column problem here is handled by OCRFeeder being able to do one column at a time, and also avoid advertisements, handle the "continued on page 99" situation, and so on. Copy the OCR text into LibreOffice...proof it there, bring it into Calibre, and convert to epub. Tweak the code in the Calibre Editor as needed. Any images, tables and the like can be dealt with as necessary, case-by-case. I use Gimp to handle any image editing needed.

Yes this is labour intensive. But it works and ends up with a really good epub. You will never find a script-kiddie solution to getting good results out of a multi-column pdf, especially when there all sorts of interruptions to nice clean columns.

Lukusaukko · 02-24-2023, 10:31 AM

Quote:

Originally Posted by BetterRed

Did you try passing the single column output of k2pdfopt through MS Word or LO Writer to produce a DOCX and then convert that to EPUB in calibre.

Writer can't open PDF's into editable format - at least in the version I have installed - it always opens them in Draw if at all. Word would work better - I know it can open multi-column PDF's directly, with varying degrees of success, but as a Linux user, it's not really an option.

Looks like so far the best bet is using pdftohtml to convert the file to XML and then use a text editor and various regular expressions to strip or replace the xml tags with html tags before using ebook-convert to convert it. Takes quite a bit of manual work, but it's doable, at least for the more interesting use cases - it's probably 1-2 hours of work to do a book, if the layout and formatting is consistent so I can effectively use regexp.

Lukusaukko · 02-24-2023, 10:43 AM

Quote:

Originally Posted by retiredbiker

I do multi-column old magazine stories, the pdf coming from, say, Internet Archive. Any text that is already in these is worthless, it would take forever to correct it by hand.

It's almost like my use cases, though my sources for pulps and weird fiction are often more or less proofread - it's just that they have also attempted to replicate the original's layout, which makes them a pain to read on a e-reader... and often it's the only source available.

I have used gImagereader to OCR a couple of sources where the source was only avaible as a scan (as you noted, Archive's TXT or EPUB versions are often worthless), but only for short texts. I'll have to check out OCRFeeder.

02-23-2023, 03:07 PM	#1
Lukusaukko Connoisseur Posts: 55 Karma: 392326 Join Date: Feb 2023 Device: Kobo Libra 2	Converting multicolumn PDF? I've read the forums and saw that this question has come up in the past, and the answer has been that "it's in development" - however, these posts have been several years old, and the latest version still makes a mess of multiple column pdf's. There are a few Linux-based tools that can convert multicolumn files relatively easily, such as pdftotext, but that removes italics and bolding. pdftohtml can also do it, but its output requires quite a lot of manual work to convert into a single-column format suitable for epub conversion. I've also tried using k2pdfopt to convert a pdf into single-column format to pass on to calibre, but that makes calibre choke - probably because the resulting file is not internally a true single-column pdf despite looking like it in a viewers. Has the development of this functionality for Calibre been abandoned? Or are there any other tools - preferably available in linux - besides Acrobat itself that could be used to convert a multi-column pdf into epub or at least some intermediate format with italics and bolding intact?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
EPUB to Multicolumn PDF using Calibre	hthiart	Conversion	1	08-04-2013 04:55 AM
MultiColumn Page In Epub	saravanan.p	ePub	11	01-31-2012 10:12 PM
Converting multicolumn to one column	fgruber	enTourage Archive	5	01-12-2011 08:36 AM
PDF multicolumn	Mandos	Calibre	1	04-21-2009 08:06 PM

02-23-2023, 03:44 PM	#2
Quoth Still reading Posts: 14,033 Karma: 105092227 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper	See https://www.mobileread.com/forums/sh...d.php?t=144711 PDFs are for print, proofing print and print replica. They are not meant to be changed, converted or edited in anyway. They are not ebooks. There are multiple kinds, at one extreme are images only, and the other is lines of text, each at a specific place on a specified page size. Some have images of the text (from a scan) and a searchable text layer full of errors as it's been automatically created and not proof read by OCR software. There is no one solution. IMO it's outside the scope of Calibre. I might use k2pdfopt, The GIMP or LO Draw to do something to a PDF on Linux. Sometimes all I can do is crop and adjust contrast. Or hope someone produces a real ebook.

02-24-2023, 10:17 AM	#4
retiredbiker Evangelist Posts: 450 Karma: 3886916 Join Date: May 2013 Location: Ontario, Canada Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma	I do multi-column old magazine stories, the pdf coming from, say, Internet Archive. Any text that is already in these is worthless, it would take forever to correct it by hand. So my method is to use various Linux tools. Get the images out with pdftopng or pdfimages. If they are really terrible, run them through Scan Taylor Advanced. Minor corrections can be done with ImageMagick. Do my own OCR using OCRFeeder, a front-end for tesseract. The multi-column problem here is handled by OCRFeeder being able to do one column at a time, and also avoid advertisements, handle the "continued on page 99" situation, and so on. Copy the OCR text into LibreOffice...proof it there, bring it into Calibre, and convert to epub. Tweak the code in the Calibre Editor as needed. Any images, tables and the like can be dealt with as necessary, case-by-case. I use Gimp to handle any image editing needed. Yes this is labour intensive. But it works and ends up with a really good epub. You will never find a script-kiddie solution to getting good results out of a multi-column pdf, especially when there all sorts of interruptions to nice clean columns.

Advert

Advert