Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 02-23-2023, 03:07 PM   #1
Lukusaukko
Connoisseur
Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.
 
Posts: 55
Karma: 392326
Join Date: Feb 2023
Device: Kobo Libra 2
Converting multicolumn PDF?

I've read the forums and saw that this question has come up in the past, and the answer has been that "it's in development" - however, these posts have been several years old, and the latest version still makes a mess of multiple column pdf's.
There are a few Linux-based tools that can convert multicolumn files relatively easily, such as pdftotext, but that removes italics and bolding. pdftohtml can also do it, but its output requires quite a lot of manual work to convert into a single-column format suitable for epub conversion. I've also tried using k2pdfopt to convert a pdf into single-column format to pass on to calibre, but that makes calibre choke - probably because the resulting file is not internally a true single-column pdf despite looking like it in a viewers.
Has the development of this functionality for Calibre been abandoned? Or are there any other tools - preferably available in linux - besides Acrobat itself that could be used to convert a multi-column pdf into epub or at least some intermediate format with italics and bolding intact?
Lukusaukko is offline   Reply With Quote
Old 02-23-2023, 03:44 PM   #2
Quoth
Still reading
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 14,033
Karma: 105092227
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper
See https://www.mobileread.com/forums/sh...d.php?t=144711

PDFs are for print, proofing print and print replica. They are not meant to be changed, converted or edited in anyway. They are not ebooks.

There are multiple kinds, at one extreme are images only, and the other is lines of text, each at a specific place on a specified page size.
Some have images of the text (from a scan) and a searchable text layer full of errors as it's been automatically created and not proof read by OCR software.
There is no one solution. IMO it's outside the scope of Calibre.

I might use k2pdfopt, The GIMP or LO Draw to do something to a PDF on Linux. Sometimes all I can do is crop and adjust contrast.
Or hope someone produces a real ebook.
Quoth is offline   Reply With Quote
Advert
Old 02-23-2023, 05:35 PM   #3
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 21,725
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by Lukusaukko View Post
I've read the forums and saw that this question has come up in the past, and the answer has been that "it's in development" - however, these posts have been several years old, and the latest version still makes a mess of multiple column pdf's.
The people who posted "it's in development" in 2010 haven't been involved in calibre development for a decade or more. The problem is, many 2 column PDFs are also littered with complex tables and infocrapics, and they will never be amenable to conversion.

Did you try passing the single column output of k2pdfopt through MS Word or LO Writer to produce a DOCX and then convert that to EPUB in calibre.

BR
BetterRed is online now   Reply With Quote
Old 02-24-2023, 10:17 AM   #4
retiredbiker
Evangelist
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 450
Karma: 3886916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma
I do multi-column old magazine stories, the pdf coming from, say, Internet Archive. Any text that is already in these is worthless, it would take forever to correct it by hand.

So my method is to use various Linux tools. Get the images out with pdftopng or pdfimages. If they are really terrible, run them through Scan Taylor Advanced. Minor corrections can be done with ImageMagick. Do my own OCR using OCRFeeder, a front-end for tesseract. The multi-column problem here is handled by OCRFeeder being able to do one column at a time, and also avoid advertisements, handle the "continued on page 99" situation, and so on. Copy the OCR text into LibreOffice...proof it there, bring it into Calibre, and convert to epub. Tweak the code in the Calibre Editor as needed. Any images, tables and the like can be dealt with as necessary, case-by-case. I use Gimp to handle any image editing needed.

Yes this is labour intensive. But it works and ends up with a really good epub. You will never find a script-kiddie solution to getting good results out of a multi-column pdf, especially when there all sorts of interruptions to nice clean columns.
retiredbiker is offline   Reply With Quote
Old 02-24-2023, 10:31 AM   #5
Lukusaukko
Connoisseur
Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.
 
Posts: 55
Karma: 392326
Join Date: Feb 2023
Device: Kobo Libra 2
Quote:
Originally Posted by BetterRed View Post
Did you try passing the single column output of k2pdfopt through MS Word or LO Writer to produce a DOCX and then convert that to EPUB in calibre.
Writer can't open PDF's into editable format - at least in the version I have installed - it always opens them in Draw if at all. Word would work better - I know it can open multi-column PDF's directly, with varying degrees of success, but as a Linux user, it's not really an option.

Looks like so far the best bet is using pdftohtml to convert the file to XML and then use a text editor and various regular expressions to strip or replace the xml tags with html tags before using ebook-convert to convert it. Takes quite a bit of manual work, but it's doable, at least for the more interesting use cases - it's probably 1-2 hours of work to do a book, if the layout and formatting is consistent so I can effectively use regexp.
Lukusaukko is offline   Reply With Quote
Advert
Old 02-24-2023, 10:43 AM   #6
Lukusaukko
Connoisseur
Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.Lukusaukko ought to be getting tired of karma fortunes by now.
 
Posts: 55
Karma: 392326
Join Date: Feb 2023
Device: Kobo Libra 2
Quote:
Originally Posted by retiredbiker View Post
I do multi-column old magazine stories, the pdf coming from, say, Internet Archive. Any text that is already in these is worthless, it would take forever to correct it by hand.
It's almost like my use cases, though my sources for pulps and weird fiction are often more or less proofread - it's just that they have also attempted to replicate the original's layout, which makes them a pain to read on a e-reader... and often it's the only source available.

I have used gImagereader to OCR a couple of sources where the source was only avaible as a scan (as you noted, Archive's TXT or EPUB versions are often worthless), but only for short texts. I'll have to check out OCRFeeder.
Lukusaukko is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
EPUB to Multicolumn PDF using Calibre hthiart Conversion 1 08-04-2013 04:55 AM
MultiColumn Page In Epub saravanan.p ePub 11 01-31-2012 10:12 PM
Converting multicolumn to one column fgruber enTourage Archive 5 01-12-2011 08:36 AM
PDF multicolumn Mandos Calibre 1 04-21-2009 08:06 PM


All times are GMT -4. The time now is 03:22 AM.


MobileRead.com is a privately owned, operated and funded community.