Scannable pdf file loses data converting to mobi

Rich Gibson · 01-18-2013, 10:51 AM

Hi. I'm really enjoying Calibre. I've also posted this at the Facebook site but I see that this is a better place to ask my question.

I've run into a snag though. Up till now I've scanned paper documents creating a PDF, then used ABBYY to create a searchable PDF. Then I use PDF editor and remove the headers and footers. Finally I convert the document in Calibre to mobi. I do this so that I can play back the book using Ivova speech..omitting the voice reading the headers and footers. With my most recent book Calibre is somehow taking the original non-searchable PDF and restoring the headers and footers. The mobi file is the original non-searchable PDF file..but the accompanying .pdf file in the book folder is the header and footer edited PDF file. I even changed the names of the other versions and still Calibre produces a non-data PDF file with a .mobi extension. Any suggestions how Calibre takes a PDF which has no headers and imbedded text and outputs the original scanned .pdf file for a mobi? This process has run without flaw on many other books till now.

When I try to read the file (using Moon + Reader Pro) instead of the normal page-oriented presentation I see that it resembles a standard pdf file format. When I try to play the voice it speeds through the entire document indicating there is no scannable text. There are no error messages. I have the log and will post it below. Thanks for listening. I really like Calibre and have made a donation it's that useful.

'prefer_metadata_cover': False,
'pretty_print': False,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': u'/var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/JkJWbC.opf',
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': u'',
'search_replace': '[]',
'series': None,
'series_index': None,
'share_not_sync': False,
'smarten_punctuation': True,
'sr1_replace': None,
'sr1_search': None,
'sr2_replace': None,
'sr2_search': None,
'sr3_replace': None,
'sr3_search': None,
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'unsmarten_punctuation': False,
'unwrap_factor': 0.45,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
InputFormatPlugin: PDF Input running
on /var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/wkBpQ3.pdf
Converting file to html...
Retrieving document metadata...
Generating manifest...
Rendering manifest...
Parsing all content...
Parsing index.html ...
********* Heuristic processing HTML *********
flow is too short, not running heuristics
Initial parse failed, using more forgiving parsers
Parsing index.html as HTML
Generating default TOC from spine...
Merging user specified metadata...
Detecting structure...
Auto generated TOC with 0 entries.
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Removing fake margins...
Found 275 items of level: p_1
p_1 left margin stats: Counter({u'0': 275})
p_1 right margin stats: Counter({u'0': 275})
Cleaning up manifest...
Trimming unused files from manifest...
Creating MOBI Output...
Serializing resources...
Creating MOBI 6 output
Applying case-transforming CSS...
Parsing manglecase.css ...
Rasterizing SVG images...
Converting XHTML to Mobipocket markup...
Serializing markup content...
Compressing markup content...
No TOC, MOBI index not generated
MOBI output written to /var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/B0Bgha.mobi

kovidgoyal · 01-18-2013, 11:05 AM

This will be because your PDF is image based, i.e. it contains only scans of page images. I have no clue why your OCR process failed for this PDF.

Rich Gibson · 01-18-2013, 02:05 PM

I believe I may know the reason. When I tried to get to reader app to play the document with voice it started spelling words instead of putting them together. A careful review using Apple's Preview shows the scanned print is significantly fainter than every other scan. The book does appear light to the eye as well.

I tried the pdf file on the tablet and it reads better...clearly the first scan wasn't good enough. I rescanned with a darker setting and tried a few pages and copied it to the tablet and the voice speaks the pdf file perfectly but somehow Calibre produces a mobi document without any text....but it IS in the pdf file before I convert it. Go figure!

kovidgoyal · 01-18-2013, 10:01 PM

If you have adobe acrobat you can try extracting the text with that and then convert the resulting text file with calibre.

01-18-2013, 10:51 AM	#1
Rich Gibson Junior Member Posts: 4 Karma: 10 Join Date: Jan 2013 Device: Samsung Galaxy Tablet	Scannable pdf file loses data converting to mobi Hi. I'm really enjoying Calibre. I've also posted this at the Facebook site but I see that this is a better place to ask my question. I've run into a snag though. Up till now I've scanned paper documents creating a PDF, then used ABBYY to create a searchable PDF. Then I use PDF editor and remove the headers and footers. Finally I convert the document in Calibre to mobi. I do this so that I can play back the book using Ivova speech..omitting the voice reading the headers and footers. With my most recent book Calibre is somehow taking the original non-searchable PDF and restoring the headers and footers. The mobi file is the original non-searchable PDF file..but the accompanying .pdf file in the book folder is the header and footer edited PDF file. I even changed the names of the other versions and still Calibre produces a non-data PDF file with a .mobi extension. Any suggestions how Calibre takes a PDF which has no headers and imbedded text and outputs the original scanned .pdf file for a mobi? This process has run without flaw on many other books till now. When I try to read the file (using Moon + Reader Pro) instead of the normal page-oriented presentation I see that it resembles a standard pdf file format. When I try to play the voice it speeds through the entire document indicating there is no scannable text. There are no error messages. I have the log and will post it below. Thanks for listening. I really like Calibre and have made a donation it's that useful. 'prefer_metadata_cover': False, 'pretty_print': False, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': u'/var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/JkJWbC.opf', 'remove_fake_margins': True, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': True, 'replace_scene_breaks': u'', 'search_replace': '[]', 'series': None, 'series_index': None, 'share_not_sync': False, 'smarten_punctuation': True, 'sr1_replace': None, 'sr1_search': None, 'sr2_replace': None, 'sr2_search': None, 'sr3_replace': None, 'sr3_search': None, 'start_reading_at': None, 'subset_embedded_fonts': False, 'tags': None, 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'toc_title': None, 'unsmarten_punctuation': False, 'unwrap_factor': 0.45, 'unwrap_lines': True, 'use_auto_toc': False, 'verbose': 2} InputFormatPlugin: PDF Input running on /var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/wkBpQ3.pdf Converting file to html... Retrieving document metadata... Generating manifest... Rendering manifest... Parsing all content... Parsing index.html ... ******* Heuristic processing HTML ******* flow is too short, not running heuristics Initial parse failed, using more forgiving parsers Parsing index.html as HTML Generating default TOC from spine... Merging user specified metadata... Detecting structure... Auto generated TOC with 0 entries. Flattening CSS and remapping font sizes... Source base font size is 12.00000pt Removing fake margins... Found 275 items of level: p_1 p_1 left margin stats: Counter({u'0': 275}) p_1 right margin stats: Counter({u'0': 275}) Cleaning up manifest... Trimming unused files from manifest... Creating MOBI Output... Serializing resources... Creating MOBI 6 output Applying case-transforming CSS... Parsing manglecase.css ... Rasterizing SVG images... Converting XHTML to Mobipocket markup... Serializing markup content... Compressing markup content... No TOC, MOBI index not generated MOBI output written to /var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/B0Bgha.mobi

01-18-2013, 02:05 PM	#3
Rich Gibson Junior Member Posts: 4 Karma: 10 Join Date: Jan 2013 Device: Samsung Galaxy Tablet	I believe I may know the reason. When I tried to get to reader app to play the document with voice it started spelling words instead of putting them together. A careful review using Apple's Preview shows the scanned print is significantly fainter than every other scan. The book does appear light to the eye as well. I tried the pdf file on the tablet and it reads better...clearly the first scan wasn't good enough. I rescanned with a darker setting and tried a few pages and copied it to the tablet and the voice speaks the pdf file perfectly but somehow Calibre produces a mobi document without any text....but it IS in the pdf file before I convert it. Go figure! Last edited by Rich Gibson; 01-18-2013 at 03:01 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF -> MOBI Hangs at "Converting file to html..."	JohnTRN	Conversion	3	12-28-2012 01:10 PM
File Size when converting CBZ to mobi?	Ito	Conversion	2	05-09-2012 01:57 PM
Error converting pdf to mobi, and also chm to mobi	Neo139	Conversion	10	08-12-2011 09:55 AM
Converting Mobi or HTML file to Epub	Patuba	Sigil	1	07-23-2011 04:14 PM
Converting Mobi or HTML file to Epub	Patuba	ePub	7	07-19-2011 12:11 PM

01-18-2013, 11:05 AM	#2
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	This will be because your PDF is image based, i.e. it contains only scans of page images. I have no clue why your OCR process failed for this PDF.

01-18-2013, 10:01 PM	#4
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If you have adobe acrobat you can try extracting the text with that and then convert the resulting text file with calibre.