Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 05-20-2020, 12:08 PM   #1
retiredbiker
Addict
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 387
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
PDF to EPUB not finding text

Calibre 4.16, Ubuntu 18.04

I have a pdf which, when trying to convert, only results in an epub full of the pictures of pages, although it has a perfectly good ToC. The debug files contain only lists of png files and the ToC, not a hint of html. Conversion only takes 2 minutes. Heuristic process on/off makes no difference. When I try converting with the "no pictures" option checked, all I get is the, again perfectly good, document outline, and no pages at all.. This conversion only takes 1 second.

The thing is, the pdf has an excellent text layer. I can copy/paste it, and when I run it through pdftotext, I get an excellent text file. It's as though Calibre is not even attempting to extract the text.

Can I have somehow made a setting that turns off the text extraction?


Log from debug session:

Spoiler:
calibre Debug log
calibre 4.16 embedded-python: True is64bit: True
Linux-5.3.0-7648-generic-x86_64-with-debian-buster-sid Linux ('64bit', 'ELF')
('Linux', '5.3.0-7648-generic', '#41~1586790036~18.04~600aeb5~dev-Ubuntu SMP Mon Apr 13 17:49:24 ')
Python 2.7.16
Linux: ('debian', 'buster/sid', '')
Interface language: None
Successfully initialized third party plugins: Gather KFX-ZIP (from KFX Input) (1, 31, 0) && DeDRM (6, 6, 1) && Package KFX (from KFX Input) (1, 31, 0) && Quality Check (1, 9, 11) && KindleUnpack - The Plugin (0, 82, 1) && Reading List (1, 6, 7) && Goodreads (1, 4, 0) && EpubSplit (2, 9, 0) && Libgen Fiction (0, 1, 0) && Count Pages (1, 9, 0) && Diaps Editing Toolbag (0, 3, 6) && Kobo Utilities (2, 11, 0) && KoboTouchExtended (3, 2, 7) && KFX metadata reader (from KFX Input) (1, 31, 0) && KFX Input (1, 31, 0) && EpubCheck (0, 2, 2) && Find Duplicates (1, 6, 3) && Job Spy (1, 0, 181) && Modify ePub (1, 4, 1) && NormComment (0, 0, 2)
calibre 4.16 embedded-python: True is64bit: True
Linux-5.3.0-7648-generic-x86_64-with-debian-buster-sid Linux ('64bit', 'ELF')
('Linux', '5.3.0-7648-generic', '#41~1586790036~18.04~600aeb5~dev-Ubuntu SMP Mon Apr 13 17:49:24 ')
Python 2.7.16
Linux: ('debian', 'buster/sid', '')
Interface language: None
Successfully initialized third party plugins: Gather KFX-ZIP (from KFX Input) (1, 31, 0) && DeDRM (6, 6, 1) && Package KFX (from KFX Input) (1, 31, 0) && Quality Check (1, 9, 11) && KindleUnpack - The Plugin (0, 82, 1) && Reading List (1, 6, 7) && Goodreads (1, 4, 0) && EpubSplit (2, 9, 0) && Libgen Fiction (0, 1, 0) && Count Pages (1, 9, 0) && Diaps Editing Toolbag (0, 3, 6) && Kobo Utilities (2, 11, 0) && KoboTouchExtended (3, 2, 7) && KFX metadata reader (from KFX Input) (1, 31, 0) && KFX Input (1, 31, 0) && EpubCheck (0, 2, 2) && Find Duplicates (1, 6, 3) && Job Spy (1, 0, 181) && Modify ePub (1, 4, 1) && NormComment (0, 0, 2)
Turning off automatic hidpi scaling
devicePixelRatio: 1.0
logicalDpi: 120.0 x 120.0
physicalDpi: 69.8681948424 x 69.9795918367
Using calibre Qt style: True
[0.00] Starting up...
[0.00] Showing splash screen...
[0.06] splash screen shown
[0.06] Initializing db...
[0.07] db initialized
[0.07] Constructing main UI...
DEBUG: 0.0 KoboUtilites::action.py - loading translations
DEBUG: 0.0 KoboUtilites::dialogs.py - loading translations
DEBUG: 0.0 KoboUtilites::action.py - loading translations
DEBUG: 0.0 NormComment::action.py - loading translations
Looking for desktop notifier support from: org.freedesktop.Notifications
org.freedesktop.Notifications found in 0.0 seconds
DEBUG: 0.6 No Kobo Touch, Glo or Mini appears to be connected
DEBUG: 0.6 rebuild_menus - self.supports_ratings=None, self.supports_tiles=None
DEBUG: 0.6 KoboUtilities:set_toolbar_button_tooltip - start: text='None'
DEBUG: 0.6 KoboUtilities:set_toolbar_button_tooltip - setting to text='Utilities to use with Kobo ereaders

Driver: KoboTouchExtended'
Job Spy has begun initialization...
Calibre, and hence Job Spy, was gracefully shut down last time? True
Last time daemon started: never
Last time daemon failed: never
Total daemon starts inception_to_date: 0
Total daemon failures inception-to-date: 0
libpng warning: iCCP: known incorrect sRGB profile
Job Spy has finished initialization...
[1.18] main UI initialized...
[1.18] Hiding splash screen
[61.75] splash screen hidden
[61.75] Started up in 61.75 seconds with 7 books
Worker Launch took: 0.0458168983459
Job: 0 Convert book 1 of 1 (Will-O-The-Wisp) finished
Convert book 1 of 1 (Will-O-The-Wisp)
Conversion options changed from defaults:
dont_split_on_page_breaks: True
transform_css_rules: '[{"match_type": "*", "query": "", "action": "change", "property": "line-height", "action_data": "1.2"}, {"match_type": "<", "query": "2em", "action": "change", "property": "text-indent", "action_data": "2em"}]'
change_justification: u'left'
markup_chapter_headings: False
preserve_cover_aspect_ratio: True
debug_pipeline: u'/home/chris/Booksin/debug'
smarten_punctuation: True
filter_css: u',color,font-family,background-color,page-break,background'
cover: u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/IYznXD.jpeg'
renumber_headings: False
read_metadata_from_opf: u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/mGNSHI.opf'
verbose: 2
extra_css: u'body {\n margin: 0 .1em 0 .1em;\n line-height: 1.2;\n font-size: 1em;\n}'
output_profile: u'tablet'
disable_font_rescaling: True
enable_heuristics: True
unwrap_factor: 0.3
Resolved conversion options
calibre version: 4.16.0
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0.0,
'book_producer': None,
'change_justification': u'left',
'chapter': u"//*[((name()='h1' or name()='h2') and re:test(., '\\s*((chapter|book|section|part)\\s+)|((prolog|pr ologue|epilogue)(\\s+|$))', 'i')) or @class = 'chapter']",
'chapter_mark': u'pagebreak',
'comments': None,
'cover': u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/IYznXD.jpeg',
'debug_pipeline': u'/home/chris/Booksin/debug',
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': True,
'dont_split_on_page_breaks': True,
'duplicate_links_in_toc': False,
'embed_all_fonts': False,
'embed_font_family': None,
'enable_heuristics': True,
'epub_flatten': False,
'epub_inline_toc': False,
'epub_toc_at_end': False,
'epub_version': u'2',
'expand_css': False,
'extra_css': u'body {\n margin: 0 .1em 0 .1em;\n line-height: 1.2;\n font-size: 1em;\n}',
'extract_to': None,
'filter_css': u',color,font-family,background-color,page-break,background',
'fix_indents': True,
'flow_size': 260,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x7f844cdfa550>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0.0,
'linearize_tables': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': False,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'new_pdf_engine': False,
'no_chapters_in_toc': False,
'no_default_epub_cover': False,
'no_images': False,
'no_inline_navbars': False,
'no_svg_cover': False,
'output_profile': <calibre.customize.profiles.TabletOutput object at 0x7f844ce1f190>,
'page_breaks_before': u"//*[name()='h1' or name()='h2']",
'prefer_metadata_cover': False,
'preserve_cover_aspect_ratio': True,
'pretty_print': True,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': u'/tmp/calibre_4.16.0_tmp_Mwr9Ih/mGNSHI.opf',
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': False,
'replace_scene_breaks': u'',
'search_replace': '[]',
'series': None,
'series_index': None,
'smarten_punctuation': True,
'sr1_replace': None,
'sr1_search': None,
'sr2_replace': None,
'sr2_search': None,
'sr3_replace': None,
'sr3_search': None,
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'transform_css_rules': '[{"match_type": "*", "query": "", "action": "change", "property": "line-height", "action_data": "1.2"}, {"match_type": "<", "query": "2em", "action": "change", "property": "text-indent", "action_data": "2em"}]',
'unsmarten_punctuation': False,
'unwrap_factor': 0.3,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
InputFormatPlugin: PDF Input running
on /tmp/calibre_4.16.0_tmp_Mwr9Ih/aQVDaL.pdf
Converting file to html...
pdftohtml log:
Page-1
Page-2
Page-3
Page-4
Page-5
Page-6
Page-7
Page-8
Page-9
Page-10
Page-11
Page-12
Page-13
Page-14
Page-15
Page-16
Page-17
Page-18
Page-19
Page-20
Page-21
Page-22
Page-23
Page-24
Page-25
Page-26
Page-27
Page-28
Page-29
Page-30
Page-31
Page-32
Page-33
Page-34
Page-35
Page-36
Page-37
Page-38
Page-39
Page-40
Page-41
Page-42
Page-43
Page-44
Page-45
Page-46
Page-47
Page-48
Page-49
Page-50
Page-51
Page-52
Page-53
Page-54
Page-55
Page-56
Page-57
Page-58
Page-59
Page-60
Page-61
Page-62
Page-63
Page-64
Page-65
Page-66
Page-67
Page-68
Page-69
Page-70
Page-71
Page-72
Page-73
Page-74
Page-75
Page-76
Page-77
Page-78
Page-79
Page-80
Page-81
Page-82
Page-83
Page-84
Page-85
Page-86
Page-87
Page-88
Page-89
Page-90
Page-91
Page-92
Page-93
Page-94
Page-95
Page-96
Page-97
Page-98
Page-99
Page-100
Page-101
Page-102
Page-103
Page-104
Page-105
Page-106
Page-107
Page-108
Page-109
Page-110
Page-111
Page-112
Page-113
Page-114
Page-115
Page-116
Page-117
Page-118
Page-119
Page-120
Page-121
Page-122
Page-123
Page-124
Page-125
Page-126
Page-127
Page-128
Page-129
Page-130
Page-131
Page-132
Page-133
Page-134
Page-135
Page-136
Page-137
Page-138
Page-139
Page-140
Page-141
Page-142
Page-143
Page-144
Page-145
Page-146
Page-147
Page-148
Page-149
Page-150
Page-151
Page-152
Page-153
Page-154
Page-155
Page-156
Page-157
Page-158
Page-159
Page-160
Page-161
Page-162
Retrieving document metadata...
Generating manifest...
Rendering manifest...
Input debug saved to: /home/chris/Booksin/debug/input
Parsing all content...
Parsing index.html ...
********* Heuristic processing HTML *********
There are 0 blank lines. 0.0 percent blank
Hard line breaks check returned True
Median line length is 42, calculated with html format
Unwrapping required, unwrapping Lines
Fixing hyphenated content
lookup word is: WillO, orig is: Will-O
too short, returned hyphenated word: Will-O
lookup word is: TheWisp, orig is: The-Wisp
returned hyphenated word: The-Wisp
Formatting scene breaks
Reading TOC from NCX...
Parsed HTML written to: /home/chris/Booksin/debug/parsed
Merging user specified metadata...
Detecting structure...
Structured HTML written to: /home/chris/Booksin/debug/structure
Flattening CSS and remapping font sizes...
Filtering CSS properties: , background, background-color, page-break, font-family, color
Source base font size is 12.00000pt
Removing fake margins...
Found 163 items of level: p_1
p_1 left margin stats: Counter({u'0': 163})
p_1 right margin stats: Counter({u'0': 163})
Cleaning up manifest...
Trimming unused files from manifest...
Processed HTML written to: /home/chris/Booksin/debug/processed
Creating EPUB Output...
Splitting markup on page breaks and flow limits, if any...
Looking for large trees in index.html...
No large trees found
Removing anchor from TOC href: index.html#p1
EPUB output written to /tmp/calibre_4.16.0_tmp_Mwr9Ih/8CEvWk.epub


** (gedit:9211): WARNING **: 13:37:31.169: Error querying file info: Error when getting information for file “/tmp/HVSCHA”: No such file or directory

Last edited by retiredbiker; 05-20-2020 at 01:52 PM.
retiredbiker is offline   Reply With Quote
Old 05-20-2020, 07:19 PM   #2
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,567
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
This happened to me a lot when I was converting lots of PDFs. I gave up trying to fix with calibre's myriad of conversion settings. I just found other ways - most recently Word or Writer. Using Word also allows me to use Toxaris' ePub Tools and Transtools add-ins.

BR
BetterRed is offline   Reply With Quote
Old 05-20-2020, 10:13 PM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
calibre uses pdftohtml from poppler to get the html from the PDF. Presumably it does not work with text layers.
kovidgoyal is online now   Reply With Quote
Old 05-21-2020, 12:03 AM   #4
retiredbiker
Addict
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 387
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
Quote:
Originally Posted by kovidgoyal View Post
calibre uses pdftohtml from poppler to get the html from the PDF. Presumably it does not work with text layers.
Sure enough, I tried manually running this through pdftohtml, and got exactly the same result as the Calibre conversion. So whatever is in this pdf, pdftohtml doesn't like it (I called it a "text layer"...I actually have no clue about what all goes into a pdf!).

Nothing I fat-fingered, anyway, that was my concern.

Thank you!
retiredbiker is offline   Reply With Quote
Old 05-21-2020, 12:09 AM   #5
retiredbiker
Addict
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 387
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
Quote:
Originally Posted by BetterRed View Post
This happened to me a lot when I was converting lots of PDFs. I gave up trying to fix with calibre's myriad of conversion settings. I just found other ways - most recently Word or Writer. Using Word also allows me to use Toxaris' ePub Tools and Transtools add-ins.

BR
I hear you. I can't stand them; this was the first one I attempted this way in about a year. Actually pdftotext and Writer often do not badly...mostly it clobbers italics.
retiredbiker is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Convert epub to pdf, with notes with main text in the pdf? 8140david ePub 1 06-18-2015 01:13 PM
Convert epub to pdf, with notes with main text in the pdf? 8140david Conversion 1 06-18-2015 11:02 AM
Generate epub using text-recognized text in PDF not Pictures. lordofazeroth Conversion 0 09-19-2013 04:16 PM
Split text in EPub to PDF TripleD Conversion 15 09-09-2012 12:06 AM
PDF to Epub - Images with Text ebahm Calibre 2 09-19-2010 03:23 PM


All times are GMT -4. The time now is 10:05 AM.


MobileRead.com is a privately owned, operated and funded community.