Addict
Posts: 387
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
PDF to EPUB not finding text
Calibre 4.16, Ubuntu 18.04
I have a pdf which, when trying to convert, only results in an epub full of the pictures of pages, although it has a perfectly good ToC. The debug files contain only lists of png files and the ToC, not a hint of html. Conversion only takes 2 minutes. Heuristic process on/off makes no difference. When I try converting with the "no pictures" option checked, all I get is the, again perfectly good, document outline, and no pages at all.. This conversion only takes 1 second.
The thing is, the pdf has an excellent text layer. I can copy/paste it, and when I run it through pdftotext, I get an excellent text file. It's as though Calibre is not even attempting to extract the text.
Can I have somehow made a setting that turns off the text extraction?
Log from debug session:
Spoiler :
calibre Debug log
calibre 4.16 embedded-python: True is64bit: True
Linux-5.3.0-7648-generic-x86_64-with-debian-buster-sid Linux ('64bit', 'ELF')
('Linux', '5.3.0-7648-generic', '#41~1586790036~18.04~600aeb5~dev-Ubuntu SMP Mon Apr 13 17:49:24 ')
Python 2.7.16
Linux: ('debian', 'buster/sid', '')
Interface language: None
Successfully initialized third party plugins: Gather KFX-ZIP (from KFX Input) (1, 31, 0) && DeDRM (6, 6, 1) && Package KFX (from KFX Input) (1, 31, 0) && Quality Check (1, 9, 11) && KindleUnpack - The Plugin (0, 82, 1) && Reading List (1, 6, 7) && Goodreads (1, 4, 0) && EpubSplit (2, 9, 0) && Libgen Fiction (0, 1, 0) && Count Pages (1, 9, 0) && Diaps Editing Toolbag (0, 3, 6) && Kobo Utilities (2, 11, 0) && KoboTouchExtended (3, 2, 7) && KFX metadata reader (from KFX Input) (1, 31, 0) && KFX Input (1, 31, 0) && EpubCheck (0, 2, 2) && Find Duplicates (1, 6, 3) && Job Spy (1, 0, 181) && Modify ePub (1, 4, 1) && NormComment (0, 0, 2)
calibre 4.16 embedded-python: True is64bit: True
Linux-5.3.0-7648-generic-x86_64-with-debian-buster-sid Linux ('64bit', 'ELF')
('Linux', '5.3.0-7648-generic', '#41~1586790036~18.04~600aeb5~dev-Ubuntu SMP Mon Apr 13 17:49:24 ')
Python 2.7.16
Linux: ('debian', 'buster/sid', '')
Interface language: None
Successfully initialized third party plugins: Gather KFX-ZIP (from KFX Input) (1, 31, 0) && DeDRM (6, 6, 1) && Package KFX (from KFX Input) (1, 31, 0) && Quality Check (1, 9, 11) && KindleUnpack - The Plugin (0, 82, 1) && Reading List (1, 6, 7) && Goodreads (1, 4, 0) && EpubSplit (2, 9, 0) && Libgen Fiction (0, 1, 0) && Count Pages (1, 9, 0) && Diaps Editing Toolbag (0, 3, 6) && Kobo Utilities (2, 11, 0) && KoboTouchExtended (3, 2, 7) && KFX metadata reader (from KFX Input) (1, 31, 0) && KFX Input (1, 31, 0) && EpubCheck (0, 2, 2) && Find Duplicates (1, 6, 3) && Job Spy (1, 0, 181) && Modify ePub (1, 4, 1) && NormComment (0, 0, 2)
Turning off automatic hidpi scaling
devicePixelRatio: 1.0
logicalDpi: 120.0 x 120.0
physicalDpi: 69.8681948424 x 69.9795918367
Using calibre Qt style: True
[0.00] Starting up...
[0.00] Showing splash screen...
[0.06] splash screen shown
[0.06] Initializing db...
[0.07] db initialized
[0.07] Constructing main UI...
DEBUG: 0.0 KoboUtilites::action.py - loading translations
DEBUG: 0.0 KoboUtilites::dialogs.py - loading translations
DEBUG: 0.0 KoboUtilites::action.py - loading translations
DEBUG: 0.0 NormComment::action.py - loading translations
Looking for desktop notifier support from: org.freedesktop.Notifications
org.freedesktop.Notifications found in 0.0 seconds
DEBUG: 0.6 No Kobo Touch, Glo or Mini appears to be connected
DEBUG: 0.6 rebuild_menus - self.supports_ratings=None, self.supports_tiles=None
DEBUG: 0.6 KoboUtilities:set_toolbar_button_tooltip - start: text='None'
DEBUG: 0.6 KoboUtilities:set_toolbar_button_tooltip - setting to text='Utilities to use with Kobo ereaders
Driver: KoboTouchExtended'
Job Spy has begun initialization...
Calibre, and hence Job Spy, was gracefully shut down last time? True
Last time daemon started: never
Last time daemon failed: never
Total daemon starts inception_to_date: 0
Total daemon failures inception-to-date: 0
libpng warning: iCCP: known incorrect sRGB profile
Job Spy has finished initialization...
[1.18] main UI initialized...
[1.18] Hiding splash screen
[61.75] splash screen hidden
[61.75] Started up in 61.75 seconds with 7 books
Worker Launch took: 0.0458168983459
Job: 0 Convert book 1 of 1 (Will-O-The-Wisp) finished
Convert book 1 of 1 (Will-O-The-Wisp)
Conversion options changed from defaults:
dont_split_on_page_breaks: True
transform_css_rules: '[{"match_type": "*", "query": "", "action": "change", "property": "line-height", "action_data": "1.2"}, {"match_type": "<", "query": "2em", "action": "change", "property": "text-indent", "action_data": "2em"}]'
change_justification: u'left'
markup_chapter_headings: False
preserve_cover_aspect_ratio: True
debug_pipeline: u'/home/chris/Booksin/debug'
smarten_punctuation: True
filter_css: u',color,font-family,background-color,page-break,background'
cover: u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/IYznXD.jpeg'
renumber_headings: False
read_metadata_from_opf: u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/mGNSHI.opf'
verbose: 2
extra_css: u'body {\n margin: 0 .1em 0 .1em;\n line-height: 1.2;\n font-size: 1em;\n}'
output_profile: u'tablet'
disable_font_rescaling: True
enable_heuristics: True
unwrap_factor: 0.3
Resolved conversion options
calibre version: 4.16.0
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0.0,
'book_producer': None,
'change_justification': u'left',
'chapter': u"//*[((name()='h1' or name()='h2') and re:test(., '\\s*((chapter|book|section|part)\\s+)|((prolog|pr ologue|epilogue)(\\s+|$))', 'i')) or @class = 'chapter']",
'chapter_mark': u'pagebreak',
'comments': None,
'cover': u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/IYznXD.jpeg',
'debug_pipeline': u'/home/chris/Booksin/debug',
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': True,
'dont_split_on_page_breaks': True,
'duplicate_links_in_toc': False,
'embed_all_fonts': False,
'embed_font_family': None,
'enable_heuristics': True,
'epub_flatten': False,
'epub_inline_toc': False,
'epub_toc_at_end': False,
'epub_version': u'2',
'expand_css': False,
'extra_css': u'body {\n margin: 0 .1em 0 .1em;\n line-height: 1.2;\n font-size: 1em;\n}',
'extract_to': None,
'filter_css': u',color,font-family,background-color,page-break,background',
'fix_indents': True,
'flow_size': 260,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x7f844cdfa550>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0.0,
'linearize_tables': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': False,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'new_pdf_engine': False,
'no_chapters_in_toc': False,
'no_default_epub_cover': False,
'no_images': False,
'no_inline_navbars': False,
'no_svg_cover': False,
'output_profile': <calibre.customize.profiles.TabletOutput object at 0x7f844ce1f190>,
'page_breaks_before': u"//*[name()='h1' or name()='h2']",
'prefer_metadata_cover': False,
'preserve_cover_aspect_ratio': True,
'pretty_print': True,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/mGNSHI.opf',
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': False,
'replace_scene_breaks': u'',
'search_replace': '[]',
'series': None,
'series_index': None,
'smarten_punctuation': True,
'sr1_replace': None,
'sr1_search': None,
'sr2_replace': None,
'sr2_search': None,
'sr3_replace': None,
'sr3_search': None,
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'transform_css_rules': '[{"match_type": "*", "query": "", "action": "change", "property": "line-height", "action_data": "1.2"}, {"match_type": "<", "query": "2em", "action": "change", "property": "text-indent", "action_data": "2em"}]',
'unsmarten_punctuation': False,
'unwrap_factor': 0.3,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
InputFormatPlugin: PDF Input running
on /tmp/calibre_4.16.0_tmp_Mwr9Ih/aQVDaL.pdf
Converting file to html...
pdftohtml log:
Page-1
Page-2
Page-3
Page-4
Page-5
Page-6
Page-7
Page-8
Page-9
Page-10
Page-11
Page-12
Page-13
Page-14
Page-15
Page-16
Page-17
Page-18
Page-19
Page-20
Page-21
Page-22
Page-23
Page-24
Page-25
Page-26
Page-27
Page-28
Page-29
Page-30
Page-31
Page-32
Page-33
Page-34
Page-35
Page-36
Page-37
Page-38
Page-39
Page-40
Page-41
Page-42
Page-43
Page-44
Page-45
Page-46
Page-47
Page-48
Page-49
Page-50
Page-51
Page-52
Page-53
Page-54
Page-55
Page-56
Page-57
Page-58
Page-59
Page-60
Page-61
Page-62
Page-63
Page-64
Page-65
Page-66
Page-67
Page-68
Page-69
Page-70
Page-71
Page-72
Page-73
Page-74
Page-75
Page-76
Page-77
Page-78
Page-79
Page-80
Page-81
Page-82
Page-83
Page-84
Page-85
Page-86
Page-87
Page-88
Page-89
Page-90
Page-91
Page-92
Page-93
Page-94
Page-95
Page-96
Page-97
Page-98
Page-99
Page-100
Page-101
Page-102
Page-103
Page-104
Page-105
Page-106
Page-107
Page-108
Page-109
Page-110
Page-111
Page-112
Page-113
Page-114
Page-115
Page-116
Page-117
Page-118
Page-119
Page-120
Page-121
Page-122
Page-123
Page-124
Page-125
Page-126
Page-127
Page-128
Page-129
Page-130
Page-131
Page-132
Page-133
Page-134
Page-135
Page-136
Page-137
Page-138
Page-139
Page-140
Page-141
Page-142
Page-143
Page-144
Page-145
Page-146
Page-147
Page-148
Page-149
Page-150
Page-151
Page-152
Page-153
Page-154
Page-155
Page-156
Page-157
Page-158
Page-159
Page-160
Page-161
Page-162
Retrieving document metadata...
Generating manifest...
Rendering manifest...
Input debug saved to: /home/chris/Booksin/debug/input
Parsing all content...
Parsing index.html ...
********* Heuristic processing HTML *********
There are 0 blank lines. 0.0 percent blank
Hard line breaks check returned True
Median line length is 42, calculated with html format
Unwrapping required, unwrapping Lines
Fixing hyphenated content
lookup word is: WillO, orig is: Will-O
too short, returned hyphenated word: Will-O
lookup word is: TheWisp, orig is: The-Wisp
returned hyphenated word: The-Wisp
Formatting scene breaks
Reading TOC from NCX...
Parsed HTML written to: /home/chris/Booksin/debug/parsed
Merging user specified metadata...
Detecting structure...
Structured HTML written to: /home/chris/Booksin/debug/structure
Flattening CSS and remapping font sizes...
Filtering CSS properties: , background, background-color, page-break, font-family, color
Source base font size is 12.00000pt
Removing fake margins...
Found 163 items of level: p_1
p_1 left margin stats: Counter({u'0': 163})
p_1 right margin stats: Counter({u'0': 163})
Cleaning up manifest...
Trimming unused files from manifest...
Processed HTML written to: /home/chris/Booksin/debug/processed
Creating EPUB Output...
Splitting markup on page breaks and flow limits, if any...
Looking for large trees in index.html...
No large trees found
Removing anchor from TOC href: index.html#p1
EPUB output written to /tmp/calibre_4.16.0_tmp_Mwr9Ih/8CEvWk.epub
** (gedit:9211): WARNING **: 13:37:31.169: Error querying file info: Error when getting information for file “/tmp/HVSCHA”: No such file or directory
Last edited by retiredbiker; 05-20-2020 at 01:52 PM .