Read this before Posting PDF Questions

ldolse · 01-26-2011, 09:12 PM

PDF format is the worst choice of source formats to use with Calibre, and conversions can range from decent to quite awful. Read here for the official statement on Calibre's PDF support.

This post will go into a little more detail on the various issues surrounding pdf conversions and what can be done about it. Note if your PDF looks like complete garbage after conversion (i.e. nothing at all like the original text) there is nothing to be done about it using Calibre.

There are page numbers, headers, or footers in my output

You need to use Calibre's Search and Replace feature when converting from pdf in order to remove any text you don't want. These require the use of a search syntax called regular expressions. If you are intimidated by regular expressions, many Windows users have reported that Mobipocket creator is a good alternative to use to do the initial pdf conversion. Use Mobipocket Creator to convert the pdf to the .mobi format, and then use Calibre to convert from mobi to your final desired format.

I cropped the headers/footers from my pdf with another tool, but Calibre still converts them

Most pdf cropping utilities only change the visible page boundaries of the pdf, they don't actually eliminate the text data.

You need to find a utility which both crops AND deletes hidden text. Very few tools do this - Adobe Acrobat has an option to 'remove hidden text' while optimizing pdfs which can facilitate this. The alternative is to use Calibre's book editor to delete the headers, or use Sigil after conversion to epub.

Some of my paragraphs are split into multiple paragraphs

They weren't actually split into new paragraphs - this is how pdf works. There is no concept of a 'paragraph' in pdf - every line is basically it's own paragraph. Calibre attempts to rebuild the actual paragraphs using punctuation and line length clues. This is prone to errors, and for some documents will require manual cleanup using calibre's book editor or in a program like Sigil. Using Sigil requires converting to epub first, editing the epub in Sigil, and then converting to the final intended format.
Before you attempt manually cleaning up the file, you can try changing the 'Line unwrap-factor' - under pdf input in the conversion options. The default setting for this is 0.45, you can set this lower to make line unwrapping more 'aggressive', but be aware that doing this may unwrap lines which shouldn't be unwrapped.

Various character pairs like 'ff', 'll', etc are missing from my conversion

This is probably caused by the PDF containing what are called ligatures. These occur when the publisher changes certain pairs of characters into a single character to make the text 'look better'. Common are 'll', 'fl', 'fi', 'ff', 'ffl', and 'ffi'. Unfortunately, due to a bug in the third party library Calibre uses, in many cases ligatures simply aren't supported. Several users have reported having good luck with Mobipocket Creator or Acrobat Professional for these types of files. Other users have suggested using the 'PDF Print' features provided by various plugins & operating systems. Print the PDF to a new PDF, and then convert the new PDF in Calibre.

The lines in my article are all running together, or mixed up

This is most likely because your pdf uses multi-column formatting, which Calibre currently does not support. The only solution to this issue using Calibre is to first use a third party pdf cutting/splitting utility to split and re-order your pages.

My pdf converted, but it doesn't contain any text, or the text is all garbled

Many pdfs are actually made up of many images of scanned books, one image for each page. Many of these types of pdfs use hidden OCR (optical character recognition - i.e. machine reading) text underneath the images, but not all of them do. When there is no OCR text at all, you will often get a conversion that has no text, or is made up only of images. If the pdf uses hidden OCR text, in most cases no editing was done to the OCR, and depending on the text quality and OCR engine the resulting text can be quite awful. There isn't anything you can do with a pdf like this in Calibre. Your best bet is to use real OCR software like ABBYY Finereader or Acrobat Professional to convert the document. There are also open source OCR projects such as Tesseract and OCRopus.

My images/tables/text formatting etc are messed up

Many types of pdf images aren't supported or may convert incorrectly. PDF tables and other specialized text formatting are not supported. In fact, the only 'special' text formatting which is supported are italics and bold - all other text formatting and positioning is eliminated during conversion.

My PDF has a table of contents or links/bookmarks, but they weren't used during conversion

Calibre doesn't support the existing Table of Contents/links/bookmarks in a pdf, for either building a TOC or detecting chapters. Instead it uses a heuristic approach to guess where the appropriate chapter breaks are. This works well in some documents, but can produce poor results or false positives in other documents.

Some of my images are negative

This is a bug, and it's not going to get fixed anytime soon. You can convert to epub, then use the 'Tweak Epub' feature to access the images. Open the images in an image editor, fix and save them, then use the tweak epub window to rebuild the epub. If your final format is something other than epub use Calibre to convert your fixed epub to the desired format.

Something was working last week, and it's not working this week

Make sure it really broke - pdfs can be defined in many many ways, and what converted well in one pdf may not convert in another. Check one of the pdfs that 'worked' last week - if it converts incorrectly, then this is a bug - open a bug at the bugtracker and attach a file with a description of the problem. However, if the old pdf still converts correctly, it means the latest pdf you're converting just isn't compatible with Calibre.

Calibre just created a ridiculously huge PDF!

Make sure you're running a version of Calibre greater than or equal to 0.8.21

How can I help make pdf conversion better?

Improving pdf conversion is on the to-do list of the Calibre developers, but any help would be greatly appreciated. There is a new pdf engine that is currently in progress, and fixes many of the issues described above, like multi-column pdfs, ligatures, line wrapping, etc. Development is presently stalled, and there is no ETA for this being released.

The new engine converts the pdf to xml which retains all positioning information, and the xml then needs to be converted to html/xhtml. It's the conversion to html which still needs work. If you want to get involved you can read about syncing the source code here. The new pdf engine reflow code, which converts the xml to html, is in /src/calibre/ebooks/pdf/reflow.py If you make changes to this code it can be tested using the command line - use the ebook-convert command with the argument --new-pdf-engine. You also need to specify a debug output directory to see the output.

01-26-2011, 09:12 PM	#1
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Read this before Posting PDF Questions PDF format is the worst choice of source formats to use with Calibre, and conversions can range from decent to quite awful. Read here for the official statement on Calibre's PDF support. This post will go into a little more detail on the various issues surrounding pdf conversions and what can be done about it. Note if your PDF looks like complete garbage after conversion (i.e. nothing at all like the original text) there is nothing to be done about it using Calibre. There are page numbers, headers, or footers in my output You need to use Calibre's Search and Replace feature when converting from pdf in order to remove any text you don't want. These require the use of a search syntax called regular expressions. If you are intimidated by regular expressions, many Windows users have reported that Mobipocket creator is a good alternative to use to do the initial pdf conversion. Use Mobipocket Creator to convert the pdf to the .mobi format, and then use Calibre to convert from mobi to your final desired format. I cropped the headers/footers from my pdf with another tool, but Calibre still converts them Most pdf cropping utilities only change the visible page boundaries of the pdf, they don't actually eliminate the text data. You need to find a utility which both crops AND deletes hidden text. Very few tools do this - Adobe Acrobat has an option to 'remove hidden text' while optimizing pdfs which can facilitate this. The alternative is to use Calibre's book editor to delete the headers, or use Sigil after conversion to epub. Some of my paragraphs are split into multiple paragraphs They weren't actually split into new paragraphs - this is how pdf works. There is no concept of a 'paragraph' in pdf - every line is basically it's own paragraph. Calibre attempts to rebuild the actual paragraphs using punctuation and line length clues. This is prone to errors, and for some documents will require manual cleanup using calibre's book editor or in a program like Sigil. Using Sigil requires converting to epub first, editing the epub in Sigil, and then converting to the final intended format. Before you attempt manually cleaning up the file, you can try changing the 'Line unwrap-factor' - under pdf input in the conversion options. The default setting for this is 0.45, you can set this lower to make line unwrapping more 'aggressive', but be aware that doing this may unwrap lines which shouldn't be unwrapped. Various character pairs like 'ff', 'll', etc are missing from my conversion This is probably caused by the PDF containing what are called ligatures. These occur when the publisher changes certain pairs of characters into a single character to make the text 'look better'. Common are 'll', 'fl', 'fi', 'ff', 'ffl', and 'ffi'. Unfortunately, due to a bug in the third party library Calibre uses, in many cases ligatures simply aren't supported. Several users have reported having good luck with Mobipocket Creator or Acrobat Professional for these types of files. Other users have suggested using the 'PDF Print' features provided by various plugins & operating systems. Print the PDF to a new PDF, and then convert the new PDF in Calibre. The lines in my article are all running together, or mixed up This is most likely because your pdf uses multi-column formatting, which Calibre currently does not support. The only solution to this issue using Calibre is to first use a third party pdf cutting/splitting utility to split and re-order your pages. My pdf converted, but it doesn't contain any text, or the text is all garbled Many pdfs are actually made up of many images of scanned books, one image for each page. Many of these types of pdfs use hidden OCR (optical character recognition - i.e. machine reading) text underneath the images, but not all of them do. When there is no OCR text at all, you will often get a conversion that has no text, or is made up only of images. If the pdf uses hidden OCR text, in most cases no editing was done to the OCR, and depending on the text quality and OCR engine the resulting text can be quite awful. There isn't anything you can do with a pdf like this in Calibre. Your best bet is to use real OCR software like ABBYY Finereader or Acrobat Professional to convert the document. There are also open source OCR projects such as Tesseract and OCRopus. My images/tables/text formatting etc are messed up Many types of pdf images aren't supported or may convert incorrectly. PDF tables and other specialized text formatting are not supported. In fact, the only 'special' text formatting which is supported are italics and bold - all other text formatting and positioning is eliminated during conversion. My PDF has a table of contents or links/bookmarks, but they weren't used during conversion Calibre doesn't support the existing Table of Contents/links/bookmarks in a pdf, for either building a TOC or detecting chapters. Instead it uses a heuristic approach to guess where the appropriate chapter breaks are. This works well in some documents, but can produce poor results or false positives in other documents. Some of my images are negative This is a bug, and it's not going to get fixed anytime soon. You can convert to epub, then use the 'Tweak Epub' feature to access the images. Open the images in an image editor, fix and save them, then use the tweak epub window to rebuild the epub. If your final format is something other than epub use Calibre to convert your fixed epub to the desired format. Something was working last week, and it's not working this week Make sure it really broke - pdfs can be defined in many many ways, and what converted well in one pdf may not convert in another. Check one of the pdfs that 'worked' last week - if it converts incorrectly, then this is a bug - open a bug at the bugtracker and attach a file with a description of the problem. However, if the old pdf still converts correctly, it means the latest pdf you're converting just isn't compatible with Calibre. Calibre just created a ridiculously huge PDF! Make sure you're running a version of Calibre greater than or equal to 0.8.21 How can I help make pdf conversion better? Improving pdf conversion is on the to-do list of the Calibre developers, but any help would be greatly appreciated. There is a new pdf engine that is currently in progress, and fixes many of the issues described above, like multi-column pdfs, ligatures, line wrapping, etc. Development is presently stalled, and there is no ETA for this being released. The new engine converts the pdf to xml which retains all positioning information, and the xml then needs to be converted to html/xhtml. It's the conversion to html which still needs work. If you want to get involved you can read about syncing the source code here. The new pdf engine reflow code, which converts the xml to html, is in /src/calibre/ebooks/pdf/reflow.py If you make changes to this code it can be tested using the command line - use the ebook-convert command with the argument --new-pdf-engine. You also need to specify a debug output directory to see the output. Last edited by BetterRed; 07-18-2017 at 10:43 PM. Reason: links into calibre manual corrected

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
pdf questions	omro	Apple Devices	14	05-30-2010 09:34 AM
【Best PDF Size】I find The reason of slowing When Read PDF file	linlance	Sony Reader	0	03-11-2010 08:13 AM
DX: PDF questions	Wilkinson	Amazon Kindle	1	01-30-2010 06:49 AM
PRS-600 Can it read comic books & PDF Questions please?	Riffy	Sony Reader	6	09-25-2009 11:06 AM