MobileRead Forums - View Single Post - Scannable pdf file loses data converting to mobi

Rich Gibson · 01-18-2013, 10:51 AM

Hi. I'm really enjoying Calibre. I've also posted this at the Facebook site but I see that this is a better place to ask my question.

I've run into a snag though. Up till now I've scanned paper documents creating a PDF, then used ABBYY to create a searchable PDF. Then I use PDF editor and remove the headers and footers. Finally I convert the document in Calibre to mobi. I do this so that I can play back the book using Ivova speech..omitting the voice reading the headers and footers. With my most recent book Calibre is somehow taking the original non-searchable PDF and restoring the headers and footers. The mobi file is the original non-searchable PDF file..but the accompanying .pdf file in the book folder is the header and footer edited PDF file. I even changed the names of the other versions and still Calibre produces a non-data PDF file with a .mobi extension. Any suggestions how Calibre takes a PDF which has no headers and imbedded text and outputs the original scanned .pdf file for a mobi? This process has run without flaw on many other books till now.

When I try to read the file (using Moon + Reader Pro) instead of the normal page-oriented presentation I see that it resembles a standard pdf file format. When I try to play the voice it speeds through the entire document indicating there is no scannable text. There are no error messages. I have the log and will post it below. Thanks for listening. I really like Calibre and have made a donation it's that useful.

'prefer_metadata_cover': False,
'pretty_print': False,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': u'/var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/JkJWbC.opf',
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': u'',
'search_replace': '[]',
'series': None,
'series_index': None,
'share_not_sync': False,
'smarten_punctuation': True,
'sr1_replace': None,
'sr1_search': None,
'sr2_replace': None,
'sr2_search': None,
'sr3_replace': None,
'sr3_search': None,
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'unsmarten_punctuation': False,
'unwrap_factor': 0.45,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
InputFormatPlugin: PDF Input running
on /var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/wkBpQ3.pdf
Converting file to html...
Retrieving document metadata...
Generating manifest...
Rendering manifest...
Parsing all content...
Parsing index.html ...
********* Heuristic processing HTML *********
flow is too short, not running heuristics
Initial parse failed, using more forgiving parsers
Parsing index.html as HTML
Generating default TOC from spine...
Merging user specified metadata...
Detecting structure...
Auto generated TOC with 0 entries.
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Removing fake margins...
Found 275 items of level: p_1
p_1 left margin stats: Counter({u'0': 275})
p_1 right margin stats: Counter({u'0': 275})
Cleaning up manifest...
Trimming unused files from manifest...
Creating MOBI Output...
Serializing resources...
Creating MOBI 6 output
Applying case-transforming CSS...
Parsing manglecase.css ...
Rasterizing SVG images...
Converting XHTML to Mobipocket markup...
Serializing markup content...
Compressing markup content...
No TOC, MOBI index not generated
MOBI output written to /var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/B0Bgha.mobi

01-18-2013, 10:51 AM	#1
Rich Gibson Junior Member Posts: 4 Karma: 10 Join Date: Jan 2013 Device: Samsung Galaxy Tablet	Scannable pdf file loses data converting to mobi Hi. I'm really enjoying Calibre. I've also posted this at the Facebook site but I see that this is a better place to ask my question. I've run into a snag though. Up till now I've scanned paper documents creating a PDF, then used ABBYY to create a searchable PDF. Then I use PDF editor and remove the headers and footers. Finally I convert the document in Calibre to mobi. I do this so that I can play back the book using Ivova speech..omitting the voice reading the headers and footers. With my most recent book Calibre is somehow taking the original non-searchable PDF and restoring the headers and footers. The mobi file is the original non-searchable PDF file..but the accompanying .pdf file in the book folder is the header and footer edited PDF file. I even changed the names of the other versions and still Calibre produces a non-data PDF file with a .mobi extension. Any suggestions how Calibre takes a PDF which has no headers and imbedded text and outputs the original scanned .pdf file for a mobi? This process has run without flaw on many other books till now. When I try to read the file (using Moon + Reader Pro) instead of the normal page-oriented presentation I see that it resembles a standard pdf file format. When I try to play the voice it speeds through the entire document indicating there is no scannable text. There are no error messages. I have the log and will post it below. Thanks for listening. I really like Calibre and have made a donation it's that useful. 'prefer_metadata_cover': False, 'pretty_print': False, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': u'/var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/JkJWbC.opf', 'remove_fake_margins': True, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': True, 'replace_scene_breaks': u'', 'search_replace': '[]', 'series': None, 'series_index': None, 'share_not_sync': False, 'smarten_punctuation': True, 'sr1_replace': None, 'sr1_search': None, 'sr2_replace': None, 'sr2_search': None, 'sr3_replace': None, 'sr3_search': None, 'start_reading_at': None, 'subset_embedded_fonts': False, 'tags': None, 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'toc_title': None, 'unsmarten_punctuation': False, 'unwrap_factor': 0.45, 'unwrap_lines': True, 'use_auto_toc': False, 'verbose': 2} InputFormatPlugin: PDF Input running on /var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/wkBpQ3.pdf Converting file to html... Retrieving document metadata... Generating manifest... Rendering manifest... Parsing all content... Parsing index.html ... ******* Heuristic processing HTML ******* flow is too short, not running heuristics Initial parse failed, using more forgiving parsers Parsing index.html as HTML Generating default TOC from spine... Merging user specified metadata... Detecting structure... Auto generated TOC with 0 entries. Flattening CSS and remapping font sizes... Source base font size is 12.00000pt Removing fake margins... Found 275 items of level: p_1 p_1 left margin stats: Counter({u'0': 275}) p_1 right margin stats: Counter({u'0': 275}) Cleaning up manifest... Trimming unused files from manifest... Creating MOBI Output... Serializing resources... Creating MOBI 6 output Applying case-transforming CSS... Parsing manglecase.css ... Rasterizing SVG images... Converting XHTML to Mobipocket markup... Serializing markup content... Compressing markup content... No TOC, MOBI index not generated MOBI output written to /var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/B0Bgha.mobi