View Single Post
Old 05-20-2020, 12:08 PM   #1
retiredbiker
Evangelist
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 450
Karma: 3886916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma
PDF to EPUB not finding text

Calibre 4.16, Ubuntu 18.04

I have a pdf which, when trying to convert, only results in an epub full of the pictures of pages, although it has a perfectly good ToC. The debug files contain only lists of png files and the ToC, not a hint of html. Conversion only takes 2 minutes. Heuristic process on/off makes no difference. When I try converting with the "no pictures" option checked, all I get is the, again perfectly good, document outline, and no pages at all.. This conversion only takes 1 second.

The thing is, the pdf has an excellent text layer. I can copy/paste it, and when I run it through pdftotext, I get an excellent text file. It's as though Calibre is not even attempting to extract the text.

Can I have somehow made a setting that turns off the text extraction?


Log from debug session:

Spoiler:
calibre Debug log
calibre 4.16 embedded-python: True is64bit: True
Linux-5.3.0-7648-generic-x86_64-with-debian-buster-sid Linux ('64bit', 'ELF')
('Linux', '5.3.0-7648-generic', '#41~1586790036~18.04~600aeb5~dev-Ubuntu SMP Mon Apr 13 17:49:24 ')
Python 2.7.16
Linux: ('debian', 'buster/sid', '')
Interface language: None
Successfully initialized third party plugins: Gather KFX-ZIP (from KFX Input) (1, 31, 0) && DeDRM (6, 6, 1) && Package KFX (from KFX Input) (1, 31, 0) && Quality Check (1, 9, 11) && KindleUnpack - The Plugin (0, 82, 1) && Reading List (1, 6, 7) && Goodreads (1, 4, 0) && EpubSplit (2, 9, 0) && Libgen Fiction (0, 1, 0) && Count Pages (1, 9, 0) && Diaps Editing Toolbag (0, 3, 6) && Kobo Utilities (2, 11, 0) && KoboTouchExtended (3, 2, 7) && KFX metadata reader (from KFX Input) (1, 31, 0) && KFX Input (1, 31, 0) && EpubCheck (0, 2, 2) && Find Duplicates (1, 6, 3) && Job Spy (1, 0, 181) && Modify ePub (1, 4, 1) && NormComment (0, 0, 2)
calibre 4.16 embedded-python: True is64bit: True
Linux-5.3.0-7648-generic-x86_64-with-debian-buster-sid Linux ('64bit', 'ELF')
('Linux', '5.3.0-7648-generic', '#41~1586790036~18.04~600aeb5~dev-Ubuntu SMP Mon Apr 13 17:49:24 ')
Python 2.7.16
Linux: ('debian', 'buster/sid', '')
Interface language: None
Successfully initialized third party plugins: Gather KFX-ZIP (from KFX Input) (1, 31, 0) && DeDRM (6, 6, 1) && Package KFX (from KFX Input) (1, 31, 0) && Quality Check (1, 9, 11) && KindleUnpack - The Plugin (0, 82, 1) && Reading List (1, 6, 7) && Goodreads (1, 4, 0) && EpubSplit (2, 9, 0) && Libgen Fiction (0, 1, 0) && Count Pages (1, 9, 0) && Diaps Editing Toolbag (0, 3, 6) && Kobo Utilities (2, 11, 0) && KoboTouchExtended (3, 2, 7) && KFX metadata reader (from KFX Input) (1, 31, 0) && KFX Input (1, 31, 0) && EpubCheck (0, 2, 2) && Find Duplicates (1, 6, 3) && Job Spy (1, 0, 181) && Modify ePub (1, 4, 1) && NormComment (0, 0, 2)
Turning off automatic hidpi scaling
devicePixelRatio: 1.0
logicalDpi: 120.0 x 120.0
physicalDpi: 69.8681948424 x 69.9795918367
Using calibre Qt style: True
[0.00] Starting up...
[0.00] Showing splash screen...
[0.06] splash screen shown
[0.06] Initializing db...
[0.07] db initialized
[0.07] Constructing main UI...
DEBUG: 0.0 KoboUtilites::action.py - loading translations
DEBUG: 0.0 KoboUtilites::dialogs.py - loading translations
DEBUG: 0.0 KoboUtilites::action.py - loading translations
DEBUG: 0.0 NormComment::action.py - loading translations
Looking for desktop notifier support from: org.freedesktop.Notifications
org.freedesktop.Notifications found in 0.0 seconds
DEBUG: 0.6 No Kobo Touch, Glo or Mini appears to be connected
DEBUG: 0.6 rebuild_menus - self.supports_ratings=None, self.supports_tiles=None
DEBUG: 0.6 KoboUtilities:set_toolbar_button_tooltip - start: text='None'
DEBUG: 0.6 KoboUtilities:set_toolbar_button_tooltip - setting to text='Utilities to use with Kobo ereaders

Driver: KoboTouchExtended'
Job Spy has begun initialization...
Calibre, and hence Job Spy, was gracefully shut down last time? True
Last time daemon started: never
Last time daemon failed: never
Total daemon starts inception_to_date: 0
Total daemon failures inception-to-date: 0
libpng warning: iCCP: known incorrect sRGB profile
Job Spy has finished initialization...
[1.18] main UI initialized...
[1.18] Hiding splash screen
[61.75] splash screen hidden
[61.75] Started up in 61.75 seconds with 7 books
Worker Launch took: 0.0458168983459
Job: 0 Convert book 1 of 1 (Will-O-The-Wisp) finished
Convert book 1 of 1 (Will-O-The-Wisp)
Conversion options changed from defaults:
dont_split_on_page_breaks: True
transform_css_rules: '[{"match_type": "*", "query": "", "action": "change", "property": "line-height", "action_data": "1.2"}, {"match_type": "<", "query": "2em", "action": "change", "property": "text-indent", "action_data": "2em"}]'
change_justification: u'left'
markup_chapter_headings: False
preserve_cover_aspect_ratio: True
debug_pipeline: u'/home/chris/Booksin/debug'
smarten_punctuation: True
filter_css: u',color,font-family,background-color,page-break,background'
cover: u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/IYznXD.jpeg'
renumber_headings: False
read_metadata_from_opf: u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/mGNSHI.opf'
verbose: 2
extra_css: u'body {\n margin: 0 .1em 0 .1em;\n line-height: 1.2;\n font-size: 1em;\n}'
output_profile: u'tablet'
disable_font_rescaling: True
enable_heuristics: True
unwrap_factor: 0.3
Resolved conversion options
calibre version: 4.16.0
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0.0,
'book_producer': None,
'change_justification': u'left',
'chapter': u"//*[((name()='h1' or name()='h2') and re:test(., '\\s*((chapter|book|section|part)\\s+)|((prolog|pr ologue|epilogue)(\\s+|$))', 'i')) or @class = 'chapter']",
'chapter_mark': u'pagebreak',
'comments': None,
'cover': u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/IYznXD.jpeg',
'debug_pipeline': u'/home/chris/Booksin/debug',
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': True,
'dont_split_on_page_breaks': True,
'duplicate_links_in_toc': False,
'embed_all_fonts': False,
'embed_font_family': None,
'enable_heuristics': True,
'epub_flatten': False,
'epub_inline_toc': False,
'epub_toc_at_end': False,
'epub_version': u'2',
'expand_css': False,
'extra_css': u'body {\n margin: 0 .1em 0 .1em;\n line-height: 1.2;\n font-size: 1em;\n}',
'extract_to': None,
'filter_css': u',color,font-family,background-color,page-break,background',
'fix_indents': True,
'flow_size': 260,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x7f844cdfa550>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0.0,
'linearize_tables': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': False,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'new_pdf_engine': False,
'no_chapters_in_toc': False,
'no_default_epub_cover': False,
'no_images': False,
'no_inline_navbars': False,
'no_svg_cover': False,
'output_profile': <calibre.customize.profiles.TabletOutput object at 0x7f844ce1f190>,
'page_breaks_before': u"//*[name()='h1' or name()='h2']",
'prefer_metadata_cover': False,
'preserve_cover_aspect_ratio': True,
'pretty_print': True,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/mGNSHI.opf',
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': False,
'replace_scene_breaks': u'',
'search_replace': '[]',
'series': None,
'series_index': None,
'smarten_punctuation': True,
'sr1_replace': None,
'sr1_search': None,
'sr2_replace': None,
'sr2_search': None,
'sr3_replace': None,
'sr3_search': None,
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'transform_css_rules': '[{"match_type": "*", "query": "", "action": "change", "property": "line-height", "action_data": "1.2"}, {"match_type": "<", "query": "2em", "action": "change", "property": "text-indent", "action_data": "2em"}]',
'unsmarten_punctuation': False,
'unwrap_factor': 0.3,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
InputFormatPlugin: PDF Input running
on /tmp/calibre_4.16.0_tmp_Mwr9Ih/aQVDaL.pdf
Converting file to html...
pdftohtml log:
Page-1
Page-2
Page-3
Page-4
Page-5
Page-6
Page-7
Page-8
Page-9
Page-10
Page-11
Page-12
Page-13
Page-14
Page-15
Page-16
Page-17
Page-18
Page-19
Page-20
Page-21
Page-22
Page-23
Page-24
Page-25
Page-26
Page-27
Page-28
Page-29
Page-30
Page-31
Page-32
Page-33
Page-34
Page-35
Page-36
Page-37
Page-38
Page-39
Page-40
Page-41
Page-42
Page-43
Page-44
Page-45
Page-46
Page-47
Page-48
Page-49
Page-50
Page-51
Page-52
Page-53
Page-54
Page-55
Page-56
Page-57
Page-58
Page-59
Page-60
Page-61
Page-62
Page-63
Page-64
Page-65
Page-66
Page-67
Page-68
Page-69
Page-70
Page-71
Page-72
Page-73
Page-74
Page-75
Page-76
Page-77
Page-78
Page-79
Page-80
Page-81
Page-82
Page-83
Page-84
Page-85
Page-86
Page-87
Page-88
Page-89
Page-90
Page-91
Page-92
Page-93
Page-94
Page-95
Page-96
Page-97
Page-98
Page-99
Page-100
Page-101
Page-102
Page-103
Page-104
Page-105
Page-106
Page-107
Page-108
Page-109
Page-110
Page-111
Page-112
Page-113
Page-114
Page-115
Page-116
Page-117
Page-118
Page-119
Page-120
Page-121
Page-122
Page-123
Page-124
Page-125
Page-126
Page-127
Page-128
Page-129
Page-130
Page-131
Page-132
Page-133
Page-134
Page-135
Page-136
Page-137
Page-138
Page-139
Page-140
Page-141
Page-142
Page-143
Page-144
Page-145
Page-146
Page-147
Page-148
Page-149
Page-150
Page-151
Page-152
Page-153
Page-154
Page-155
Page-156
Page-157
Page-158
Page-159
Page-160
Page-161
Page-162
Retrieving document metadata...
Generating manifest...
Rendering manifest...
Input debug saved to: /home/chris/Booksin/debug/input
Parsing all content...
Parsing index.html ...
********* Heuristic processing HTML *********
There are 0 blank lines. 0.0 percent blank
Hard line breaks check returned True
Median line length is 42, calculated with html format
Unwrapping required, unwrapping Lines
Fixing hyphenated content
lookup word is: WillO, orig is: Will-O
too short, returned hyphenated word: Will-O
lookup word is: TheWisp, orig is: The-Wisp
returned hyphenated word: The-Wisp
Formatting scene breaks
Reading TOC from NCX...
Parsed HTML written to: /home/chris/Booksin/debug/parsed
Merging user specified metadata...
Detecting structure...
Structured HTML written to: /home/chris/Booksin/debug/structure
Flattening CSS and remapping font sizes...
Filtering CSS properties: , background, background-color, page-break, font-family, color
Source base font size is 12.00000pt
Removing fake margins...
Found 163 items of level: p_1
p_1 left margin stats: Counter({u'0': 163})
p_1 right margin stats: Counter({u'0': 163})
Cleaning up manifest...
Trimming unused files from manifest...
Processed HTML written to: /home/chris/Booksin/debug/processed
Creating EPUB Output...
Splitting markup on page breaks and flow limits, if any...
Looking for large trees in index.html...
No large trees found
Removing anchor from TOC href: index.html#p1
EPUB output written to /tmp/calibre_4.16.0_tmp_Mwr9Ih/8CEvWk.epub


** (gedit:9211): WARNING **: 13:37:31.169: Error querying file info: Error when getting information for file “/tmp/HVSCHA”: No such file or directory

Last edited by retiredbiker; 05-20-2020 at 01:52 PM.
retiredbiker is offline   Reply With Quote