01-18-2013, 10:51 AM | #1 |
Junior Member
Posts: 4
Karma: 10
Join Date: Jan 2013
Device: Samsung Galaxy Tablet
|
Scannable pdf file loses data converting to mobi
Hi. I'm really enjoying Calibre. I've also posted this at the Facebook site but I see that this is a better place to ask my question.
I've run into a snag though. Up till now I've scanned paper documents creating a PDF, then used ABBYY to create a searchable PDF. Then I use PDF editor and remove the headers and footers. Finally I convert the document in Calibre to mobi. I do this so that I can play back the book using Ivova speech..omitting the voice reading the headers and footers. With my most recent book Calibre is somehow taking the original non-searchable PDF and restoring the headers and footers. The mobi file is the original non-searchable PDF file..but the accompanying .pdf file in the book folder is the header and footer edited PDF file. I even changed the names of the other versions and still Calibre produces a non-data PDF file with a .mobi extension. Any suggestions how Calibre takes a PDF which has no headers and imbedded text and outputs the original scanned .pdf file for a mobi? This process has run without flaw on many other books till now. When I try to read the file (using Moon + Reader Pro) instead of the normal page-oriented presentation I see that it resembles a standard pdf file format. When I try to play the voice it speeds through the entire document indicating there is no scannable text. There are no error messages. I have the log and will post it below. Thanks for listening. I really like Calibre and have made a donation it's that useful. 'prefer_metadata_cover': False, 'pretty_print': False, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': u'/var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/JkJWbC.opf', 'remove_fake_margins': True, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': True, 'replace_scene_breaks': u'', 'search_replace': '[]', 'series': None, 'series_index': None, 'share_not_sync': False, 'smarten_punctuation': True, 'sr1_replace': None, 'sr1_search': None, 'sr2_replace': None, 'sr2_search': None, 'sr3_replace': None, 'sr3_search': None, 'start_reading_at': None, 'subset_embedded_fonts': False, 'tags': None, 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'toc_title': None, 'unsmarten_punctuation': False, 'unwrap_factor': 0.45, 'unwrap_lines': True, 'use_auto_toc': False, 'verbose': 2} InputFormatPlugin: PDF Input running on /var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/wkBpQ3.pdf Converting file to html... Retrieving document metadata... Generating manifest... Rendering manifest... Parsing all content... Parsing index.html ... ********* Heuristic processing HTML ********* flow is too short, not running heuristics Initial parse failed, using more forgiving parsers Parsing index.html as HTML Generating default TOC from spine... Merging user specified metadata... Detecting structure... Auto generated TOC with 0 entries. Flattening CSS and remapping font sizes... Source base font size is 12.00000pt Removing fake margins... Found 275 items of level: p_1 p_1 left margin stats: Counter({u'0': 275}) p_1 right margin stats: Counter({u'0': 275}) Cleaning up manifest... Trimming unused files from manifest... Creating MOBI Output... Serializing resources... Creating MOBI 6 output Applying case-transforming CSS... Parsing manglecase.css ... Rasterizing SVG images... Converting XHTML to Mobipocket markup... Serializing markup content... Compressing markup content... No TOC, MOBI index not generated MOBI output written to /var/folders/b1/htl47r2s79551pj5hxgbhmp00000gn/T/calibre_0.9.15_tmp_uytj6P/B0Bgha.mobi |
01-18-2013, 11:05 AM | #2 |
creator of calibre
Posts: 43,749
Karma: 22446736
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
This will be because your PDF is image based, i.e. it contains only scans of page images. I have no clue why your OCR process failed for this PDF.
|
Advert | |
|
01-18-2013, 02:05 PM | #3 |
Junior Member
Posts: 4
Karma: 10
Join Date: Jan 2013
Device: Samsung Galaxy Tablet
|
I believe I may know the reason. When I tried to get to reader app to play the document with voice it started spelling words instead of putting them together. A careful review using Apple's Preview shows the scanned print is significantly fainter than every other scan. The book does appear light to the eye as well.
I tried the pdf file on the tablet and it reads better...clearly the first scan wasn't good enough. I rescanned with a darker setting and tried a few pages and copied it to the tablet and the voice speaks the pdf file perfectly but somehow Calibre produces a mobi document without any text....but it IS in the pdf file before I convert it. Go figure! Last edited by Rich Gibson; 01-18-2013 at 03:01 PM. |
01-18-2013, 10:01 PM | #4 |
creator of calibre
Posts: 43,749
Karma: 22446736
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If you have adobe acrobat you can try extracting the text with that and then convert the resulting text file with calibre.
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF -> MOBI Hangs at "Converting file to html..." | JohnTRN | Conversion | 3 | 12-28-2012 01:10 PM |
File Size when converting CBZ to mobi? | Ito | Conversion | 2 | 05-09-2012 01:57 PM |
Error converting pdf to mobi, and also chm to mobi | Neo139 | Conversion | 10 | 08-12-2011 09:55 AM |
Converting Mobi or HTML file to Epub | Patuba | Sigil | 1 | 07-23-2011 04:14 PM |
Converting Mobi or HTML file to Epub | Patuba | ePub | 7 | 07-19-2011 12:11 PM |