PDF to EPUB not finding text

retiredbiker · 05-20-2020, 12:08 PM

Calibre 4.16, Ubuntu 18.04

I have a pdf which, when trying to convert, only results in an epub full of the pictures of pages, although it has a perfectly good ToC. The debug files contain only lists of png files and the ToC, not a hint of html. Conversion only takes 2 minutes. Heuristic process on/off makes no difference. When I try converting with the "no pictures" option checked, all I get is the, again perfectly good, document outline, and no pages at all.. This conversion only takes 1 second.

The thing is, the pdf has an excellent text layer. I can copy/paste it, and when I run it through pdftotext, I get an excellent text file. It's as though Calibre is not even attempting to extract the text.

Can I have somehow made a setting that turns off the text extraction?

Log from debug session:

Spoiler:

BetterRed · 05-20-2020, 07:19 PM

This happened to me a lot when I was converting lots of PDFs. I gave up trying to fix with calibre's myriad of conversion settings. I just found other ways - most recently Word or Writer. Using Word also allows me to use Toxaris' ePub Tools and Transtools add-ins.

BR

kovidgoyal · 05-20-2020, 10:13 PM

calibre uses pdftohtml from poppler to get the html from the PDF. Presumably it does not work with text layers.

retiredbiker · 05-21-2020, 12:03 AM

Quote:

Originally Posted by kovidgoyal

calibre uses pdftohtml from poppler to get the html from the PDF. Presumably it does not work with text layers.

Sure enough, I tried manually running this through pdftohtml, and got exactly the same result as the Calibre conversion. So whatever is in this pdf, pdftohtml doesn't like it (I called it a "text layer"...I actually have no clue about what all goes into a pdf!).

Nothing I fat-fingered, anyway, that was my concern.

Thank you!

retiredbiker · 05-21-2020, 12:09 AM

Quote:

Originally Posted by BetterRed

This happened to me a lot when I was converting lots of PDFs. I gave up trying to fix with calibre's myriad of conversion settings. I just found other ways - most recently Word or Writer. Using Word also allows me to use Toxaris' ePub Tools and Transtools add-ins.

BR

I hear you. I can't stand them; this was the first one I attempted this way in about a year. Actually pdftotext and Writer often do not badly...mostly it clobbers italics.

05-20-2020, 12:08 PM	#1
retiredbiker Addict Posts: 387 Karma: 1638210 Join Date: May 2013 Location: Ontario, Canada Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma	PDF to EPUB not finding text Calibre 4.16, Ubuntu 18.04 I have a pdf which, when trying to convert, only results in an epub full of the pictures of pages, although it has a perfectly good ToC. The debug files contain only lists of png files and the ToC, not a hint of html. Conversion only takes 2 minutes. Heuristic process on/off makes no difference. When I try converting with the "no pictures" option checked, all I get is the, again perfectly good, document outline, and no pages at all.. This conversion only takes 1 second. The thing is, the pdf has an excellent text layer. I can copy/paste it, and when I run it through pdftotext, I get an excellent text file. It's as though Calibre is not even attempting to extract the text. Can I have somehow made a setting that turns off the text extraction? Log from debug session: Spoiler: calibre Debug log calibre 4.16 embedded-python: True is64bit: True Linux-5.3.0-7648-generic-x86_64-with-debian-buster-sid Linux ('64bit', 'ELF') ('Linux', '5.3.0-7648-generic', '#41~1586790036~18.04~600aeb5~dev-Ubuntu SMP Mon Apr 13 17:49:24 ') Python 2.7.16 Linux: ('debian', 'buster/sid', '') Interface language: None Successfully initialized third party plugins: Gather KFX-ZIP (from KFX Input) (1, 31, 0) && DeDRM (6, 6, 1) && Package KFX (from KFX Input) (1, 31, 0) && Quality Check (1, 9, 11) && KindleUnpack - The Plugin (0, 82, 1) && Reading List (1, 6, 7) && Goodreads (1, 4, 0) && EpubSplit (2, 9, 0) && Libgen Fiction (0, 1, 0) && Count Pages (1, 9, 0) && Diaps Editing Toolbag (0, 3, 6) && Kobo Utilities (2, 11, 0) && KoboTouchExtended (3, 2, 7) && KFX metadata reader (from KFX Input) (1, 31, 0) && KFX Input (1, 31, 0) && EpubCheck (0, 2, 2) && Find Duplicates (1, 6, 3) && Job Spy (1, 0, 181) && Modify ePub (1, 4, 1) && NormComment (0, 0, 2) calibre 4.16 embedded-python: True is64bit: True Linux-5.3.0-7648-generic-x86_64-with-debian-buster-sid Linux ('64bit', 'ELF') ('Linux', '5.3.0-7648-generic', '#41~1586790036~18.04~600aeb5~dev-Ubuntu SMP Mon Apr 13 17:49:24 ') Python 2.7.16 Linux: ('debian', 'buster/sid', '') Interface language: None Successfully initialized third party plugins: Gather KFX-ZIP (from KFX Input) (1, 31, 0) && DeDRM (6, 6, 1) && Package KFX (from KFX Input) (1, 31, 0) && Quality Check (1, 9, 11) && KindleUnpack - The Plugin (0, 82, 1) && Reading List (1, 6, 7) && Goodreads (1, 4, 0) && EpubSplit (2, 9, 0) && Libgen Fiction (0, 1, 0) && Count Pages (1, 9, 0) && Diaps Editing Toolbag (0, 3, 6) && Kobo Utilities (2, 11, 0) && KoboTouchExtended (3, 2, 7) && KFX metadata reader (from KFX Input) (1, 31, 0) && KFX Input (1, 31, 0) && EpubCheck (0, 2, 2) && Find Duplicates (1, 6, 3) && Job Spy (1, 0, 181) && Modify ePub (1, 4, 1) && NormComment (0, 0, 2) Turning off automatic hidpi scaling devicePixelRatio: 1.0 logicalDpi: 120.0 x 120.0 physicalDpi: 69.8681948424 x 69.9795918367 Using calibre Qt style: True [0.00] Starting up... [0.00] Showing splash screen... [0.06] splash screen shown [0.06] Initializing db... [0.07] db initialized [0.07] Constructing main UI... DEBUG: 0.0 KoboUtilites::action.py - loading translations DEBUG: 0.0 KoboUtilites::dialogs.py - loading translations DEBUG: 0.0 KoboUtilites::action.py - loading translations DEBUG: 0.0 NormComment::action.py - loading translations Looking for desktop notifier support from: org.freedesktop.Notifications org.freedesktop.Notifications found in 0.0 seconds DEBUG: 0.6 No Kobo Touch, Glo or Mini appears to be connected DEBUG: 0.6 rebuild_menus - self.supports_ratings=None, self.supports_tiles=None DEBUG: 0.6 KoboUtilities:set_toolbar_button_tooltip - start: text='None' DEBUG: 0.6 KoboUtilities:set_toolbar_button_tooltip - setting to text='Utilities to use with Kobo ereaders Driver: KoboTouchExtended' Job Spy has begun initialization... Calibre, and hence Job Spy, was gracefully shut down last time? True Last time daemon started: never Last time daemon failed: never Total daemon starts inception_to_date: 0 Total daemon failures inception-to-date: 0 libpng warning: iCCP: known incorrect sRGB profile Job Spy has finished initialization... [1.18] main UI initialized... [1.18] Hiding splash screen [61.75] splash screen hidden [61.75] Started up in 61.75 seconds with 7 books Worker Launch took: 0.0458168983459 Job: 0 Convert book 1 of 1 (Will-O-The-Wisp) finished Convert book 1 of 1 (Will-O-The-Wisp) Conversion options changed from defaults: dont_split_on_page_breaks: True transform_css_rules: '[{"match_type": "", "query": "", "action": "change", "property": "line-height", "action_data": "1.2"}, {"match_type": "<", "query": "2em", "action": "change", "property": "text-indent", "action_data": "2em"}]' change_justification: u'left' markup_chapter_headings: False preserve_cover_aspect_ratio: True debug_pipeline: u'/home/chris/Booksin/debug' smarten_punctuation: True filter_css: u',color,font-family,background-color,page-break,background' cover: u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/IYznXD.jpeg' renumber_headings: False read_metadata_from_opf: u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/mGNSHI.opf' verbose: 2 extra_css: u'body {\n margin: 0 .1em 0 .1em;\n line-height: 1.2;\n font-size: 1em;\n}' output_profile: u'tablet' disable_font_rescaling: True enable_heuristics: True unwrap_factor: 0.3 Resolved conversion options calibre version: 4.16.0 {'asciiize': False, 'author_sort': None, 'authors': None, 'base_font_size': 0.0, 'book_producer': None, 'change_justification': u'left', 'chapter': u"//[((name()='h1' or name()='h2') and re:test(., '\\s((chapter\|book\|section\|part)\\s+)\|((prolog\|pr ologue\|epilogue)(\\s+\|$))', 'i')) or @class = 'chapter']", 'chapter_mark': u'pagebreak', 'comments': None, 'cover': u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/IYznXD.jpeg', 'debug_pipeline': u'/home/chris/Booksin/debug', 'dehyphenate': True, 'delete_blank_paragraphs': True, 'disable_font_rescaling': True, 'dont_split_on_page_breaks': True, 'duplicate_links_in_toc': False, 'embed_all_fonts': False, 'embed_font_family': None, 'enable_heuristics': True, 'epub_flatten': False, 'epub_inline_toc': False, 'epub_toc_at_end': False, 'epub_version': u'2', 'expand_css': False, 'extra_css': u'body {\n margin: 0 .1em 0 .1em;\n line-height: 1.2;\n font-size: 1em;\n}', 'extract_to': None, 'filter_css': u',color,font-family,background-color,page-break,background', 'fix_indents': True, 'flow_size': 260, 'font_size_mapping': None, 'format_scene_breaks': True, 'html_unwrap_factor': 0.4, 'input_encoding': None, 'input_profile': <calibre.customize.profiles.InputProfile object at 0x7f844cdfa550>, 'insert_blank_line': False, 'insert_blank_line_size': 0.5, 'insert_metadata': False, 'isbn': None, 'italicize_common_cases': True, 'keep_ligatures': False, 'language': None, 'level1_toc': None, 'level2_toc': None, 'level3_toc': None, 'line_height': 0.0, 'linearize_tables': False, 'margin_bottom': 5.0, 'margin_left': 5.0, 'margin_right': 5.0, 'margin_top': 5.0, 'markup_chapter_headings': False, 'max_toc_links': 50, 'minimum_line_height': 120.0, 'new_pdf_engine': False, 'no_chapters_in_toc': False, 'no_default_epub_cover': False, 'no_images': False, 'no_inline_navbars': False, 'no_svg_cover': False, 'output_profile': <calibre.customize.profiles.TabletOutput object at 0x7f844ce1f190>, 'page_breaks_before': u"//[name()='h1' or name()='h2']", 'prefer_metadata_cover': False, 'preserve_cover_aspect_ratio': True, 'pretty_print': True, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/mGNSHI.opf', 'remove_fake_margins': True, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': False, 'replace_scene_breaks': u'', 'search_replace': '[]', 'series': None, 'series_index': None, 'smarten_punctuation': True, 'sr1_replace': None, 'sr1_search': None, 'sr2_replace': None, 'sr2_search': None, 'sr3_replace': None, 'sr3_search': None, 'start_reading_at': None, 'subset_embedded_fonts': False, 'tags': None, 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'toc_title': None, 'transform_css_rules': '[{"match_type": "", "query": "", "action": "change", "property": "line-height", "action_data": "1.2"}, {"match_type": "<", "query": "2em", "action": "change", "property": "text-indent", "action_data": "2em"}]', 'unsmarten_punctuation': False, 'unwrap_factor': 0.3, 'unwrap_lines': True, 'use_auto_toc': False, 'verbose': 2} InputFormatPlugin: PDF Input running on /tmp/calibre_4.16.0_tmp_Mwr9Ih/aQVDaL.pdf Converting file to html... pdftohtml log: Page-1 Page-2 Page-3 Page-4 Page-5 Page-6 Page-7 Page-8 Page-9 Page-10 Page-11 Page-12 Page-13 Page-14 Page-15 Page-16 Page-17 Page-18 Page-19 Page-20 Page-21 Page-22 Page-23 Page-24 Page-25 Page-26 Page-27 Page-28 Page-29 Page-30 Page-31 Page-32 Page-33 Page-34 Page-35 Page-36 Page-37 Page-38 Page-39 Page-40 Page-41 Page-42 Page-43 Page-44 Page-45 Page-46 Page-47 Page-48 Page-49 Page-50 Page-51 Page-52 Page-53 Page-54 Page-55 Page-56 Page-57 Page-58 Page-59 Page-60 Page-61 Page-62 Page-63 Page-64 Page-65 Page-66 Page-67 Page-68 Page-69 Page-70 Page-71 Page-72 Page-73 Page-74 Page-75 Page-76 Page-77 Page-78 Page-79 Page-80 Page-81 Page-82 Page-83 Page-84 Page-85 Page-86 Page-87 Page-88 Page-89 Page-90 Page-91 Page-92 Page-93 Page-94 Page-95 Page-96 Page-97 Page-98 Page-99 Page-100 Page-101 Page-102 Page-103 Page-104 Page-105 Page-106 Page-107 Page-108 Page-109 Page-110 Page-111 Page-112 Page-113 Page-114 Page-115 Page-116 Page-117 Page-118 Page-119 Page-120 Page-121 Page-122 Page-123 Page-124 Page-125 Page-126 Page-127 Page-128 Page-129 Page-130 Page-131 Page-132 Page-133 Page-134 Page-135 Page-136 Page-137 Page-138 Page-139 Page-140 Page-141 Page-142 Page-143 Page-144 Page-145 Page-146 Page-147 Page-148 Page-149 Page-150 Page-151 Page-152 Page-153 Page-154 Page-155 Page-156 Page-157 Page-158 Page-159 Page-160 Page-161 Page-162 Retrieving document metadata... Generating manifest... Rendering manifest... Input debug saved to: /home/chris/Booksin/debug/input Parsing all content... Parsing index.html ... ****** Heuristic processing HTML ***** There are 0 blank lines. 0.0 percent blank Hard line breaks check returned True Median line length is 42, calculated with html format Unwrapping required, unwrapping Lines Fixing hyphenated content lookup word is: WillO, orig is: Will-O too short, returned hyphenated word: Will-O lookup word is: TheWisp, orig is: The-Wisp returned hyphenated word: The-Wisp Formatting scene breaks Reading TOC from NCX... Parsed HTML written to: /home/chris/Booksin/debug/parsed Merging user specified metadata... Detecting structure... Structured HTML written to: /home/chris/Booksin/debug/structure Flattening CSS and remapping font sizes... Filtering CSS properties: , background, background-color, page-break, font-family, color Source base font size is 12.00000pt Removing fake margins... Found 163 items of level: p_1 p_1 left margin stats: Counter({u'0': 163}) p_1 right margin stats: Counter({u'0': 163}) Cleaning up manifest... Trimming unused files from manifest... Processed HTML written to: /home/chris/Booksin/debug/processed Creating EPUB Output... Splitting markup on page breaks and flow limits, if any... Looking for large trees in index.html... No large trees found Removing anchor from TOC href: index.html#p1 EPUB output written to /tmp/calibre_4.16.0_tmp_Mwr9Ih/8CEvWk.epub (gedit:9211): WARNING *: 13:37:31.169: Error querying file info: Error when getting information for file “/tmp/HVSCHA”: No such file or directory Last edited by retiredbiker; 05-20-2020 at 01:52 PM.*

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Convert epub to pdf, with notes with main text in the pdf?	8140david	ePub	1	06-18-2015 01:13 PM
Convert epub to pdf, with notes with main text in the pdf?	8140david	Conversion	1	06-18-2015 11:02 AM
Generate epub using text-recognized text in PDF not Pictures.	lordofazeroth	Conversion	0	09-19-2013 04:16 PM
Split text in EPub to PDF	TripleD	Conversion	15	09-09-2012 12:06 AM
PDF to Epub - Images with Text	ebahm	Calibre	2	09-19-2010 03:23 PM

05-20-2020, 07:19 PM	#2
BetterRed null operator (he/him) Posts: 20,567 Karma: 26954694 Join Date: Mar 2012 Location: Sydney Australia Device: none	This happened to me a lot when I was converting lots of PDFs. I gave up trying to fix with calibre's myriad of conversion settings. I just found other ways - most recently Word or Writer. Using Word also allows me to use Toxaris' ePub Tools and Transtools add-ins. BR

05-20-2020, 10:13 PM	#3
kovidgoyal creator of calibre Posts: 43,856 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	calibre uses pdftohtml from poppler to get the html from the PDF. Presumably it does not work with text layers.