|
|
#1231 |
|
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 52,842
Karma: 180988364
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
|
|
|
|
|
|
#1232 |
|
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,854
Karma: 64181416
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
|
|
|
|
| Advert | |
|
|
|
|
#1233 |
|
Junior Member
![]() Posts: 1
Karma: 10
Join Date: Jul 2026
Device: KindleOasis
|
Performance fix + configurable PDF page DPI for print replica conversion
Hi jhowell,
First of all, thank you for maintaining this excellent plugin. While converting Japanese print replica books (PDF-backed fixed layout KFX), I ran into two issues and would like to propose fixes for both. A patch against v2.25.0 is included below. 1. O(n^2) slowdown in check_consistency() with PDF-backed books get_pdf_page_size() creates a new pypdf.PdfReader (and re-flattens the page tree) for every page reference. For a 336-page print replica book, decode_book() takes over 30 seconds, nearly all of it in this loop. The patch caches the PdfReader per PDF content (keyed by a content fingerprint, since each resource fragment may hold a distinct bytes object with identical content). Results on the same book: - get_pdf_page_size across 336 pages: over 30 sec -> 0.18 sec - decode_book() incl. consistency check: over 30 sec -> about 11 sec 2. Configurable DPI for PDF page rendering (currently fixed at 150) convert_pdf_to_jpeg() is limited to 150 dpi because calibre's page_images() does not accept a resolution argument. 150 dpi is quite low for technical books - small text and ruby annotations become hard to read. The patch invokes pdftoppm directly (using calibre's bundled poppler binary when available, falling back to page_images() at 150 dpi on failure) and adds a "pdf_page_dpi" conversion option (default 300, clamped to 72-600), e.g.: ebook-convert book.kfx book.epub --pdf-page-dpi 300 I set the default to 300 in the patch, but keeping 150 as the default would of course preserve existing behavior. Verified on Windows (calibre 9.8) and Linux with two Japanese print replica books. Happy to adjust anything if you'd prefer a different approach. Thanks again! Code:
--- a/kfxlib/resources.py
+++ b/kfxlib/resources.py
@@ -314,17 +314,62 @@
return outfile.getvalue()
-def convert_pdf_to_jpeg(pdf_data, page_num, dpi=150, reported_errors=None):
- pdf_file = temp_filename("pdf", pdf_data)
- jpeg_dir = create_temp_dir()
+PDF_PAGE_DPI = 150 # default; overridden by the KFX Input "pdf_page_dpi" conversion option
+
+
+def find_pdftoppm():
+ """Locate the pdftoppm executable (calibre's bundled poppler, or the system PATH)."""
+ candidates = []
if calibre_numeric_version is not None:
+ try:
+ from calibre.ebooks.pdf.pdftohtml import PDFTOHTML
+ base = os.path.dirname(PDFTOHTML)
+ candidates.append(os.path.join(base, "pdftoppm.exe"))
+ candidates.append(os.path.join(base, "pdftoppm"))
+ except Exception:
+ pass
+
+ for candidate in candidates:
+ if os.path.exists(candidate):
+ return candidate
+
+ return "pdftoppm" # rely on the system PATH
+
+
+def convert_pdf_to_jpeg(pdf_data, page_num, dpi=None, reported_errors=None):
+ if dpi is None:
+ dpi = PDF_PAGE_DPI
+
+ pdf_file = temp_filename("pdf", pdf_data)
+ jpeg_dir = create_temp_dir()
- if dpi != 150:
- raise Exception("calibre PDF page_images supports only default 150dpi")
+ rendered = False
+ try:
+ import subprocess
+ args = [
+ find_pdftoppm(), "-jpeg", "-r", str(dpi), "-cropbox",
+ "-f", str(page_num), "-l", str(page_num),
+ pdf_file, os.path.join(jpeg_dir, "page")]
+
+ kwargs = {}
+ if os.name == "nt":
+ kwargs["creationflags"] = 0x08000000 # CREATE_NO_WINDOW
+
+ subprocess.run(args, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, timeout=120, **kwargs)
+ rendered = True
+ except Exception as e:
+ if reported_errors is not None and "pdftoppm_direct" not in reported_errors:
+ reported_errors.add("pdftoppm_direct")
+ log.warning("Direct pdftoppm rendering failed (%s), falling back to calibre page_images at 150dpi" % repr(e))
+ if not rendered and calibre_numeric_version is not None:
from calibre.ebooks.metadata.pdf import page_images
page_images(pdf_file, jpeg_dir, first=page_num, last=page_num)
+ rendered = True
+
+ if not rendered:
+ raise Exception("No PDF rendering method available (pdftoppm not found)")
for dirpath, dirnames, filenames in os.walk(jpeg_dir):
if len(filenames) != 1:
@@ -495,9 +540,25 @@
return best_raw_media, best_quality
+_pdf_reader_cache = {} # content fingerprint -> PdfReader
+
+
+def get_cached_pdf_reader(pdf_data):
+ """Cache PdfReader per PDF content. Avoids O(n^2) re-parsing and page-tree
+ re-flattening when a book references hundreds of pages of the same embedded
+ PDF (print replica). Keyed by a content fingerprint since each resource may
+ hold a distinct bytes object with identical content."""
+ key = (len(pdf_data), bytes(pdf_data[:256]), bytes(pdf_data[-256:]))
+ reader = _pdf_reader_cache.get(key)
+ if reader is None:
+ reader = pypdf.PdfReader(io.BytesIO(pdf_data))
+ len(reader.pages) # force one-time page tree flatten while caching
+ _pdf_reader_cache[key] = reader
+ return reader
+
+
def get_pdf_page_size(pdf_data, resource_name, page_num):
- raw_media_file = io.BytesIO(pdf_data)
- pdf = pypdf.PdfReader(raw_media_file)
+ pdf = get_cached_pdf_reader(pdf_data)
page = pdf.pages[page_num - 1]
if page.user_unit != 1:
--- a/__init__.py
+++ b/__init__.py
@@ -25,7 +25,7 @@
name = "KFX Input"
author = "jhowell"
file_types = {"azw8", "kfx", "kfx-zip", "kpf"}
- version = (2, 25, 0)
+ version = (2, 25, 1) # custom build: configurable PDF page DPI + PdfReader cache
minimum_calibre_version = (5, 0, 0) # Python 3.8.5
supported_platforms = ["windows", "osx", "linux"]
description = "Convert from Amazon KFX format"
@@ -36,6 +36,11 @@
help="Allow conversion to proceed even if the KFX book contains unexpected or incorrect data "
"that may not convert properly. If this option is selected it is recommend that the log of each "
"conversion be checked for error messages."),
+ OptionRecommendation(
+ name="pdf_page_dpi", recommended_value=300,
+ help="Resolution (DPI) used to render embedded PDF pages of print replica books as images "
+ "during conversion. Higher values produce sharper text at the cost of larger output files. "
+ "Default is 300. (The original plugin used a fixed 150 dpi.)"),
}
recommendations = EPUBInput.recommendations
@@ -93,13 +98,23 @@
job_log = set_logger(JobLog(log))
job_log.info("Converting %s" % name_of_file(stream))
+ from calibre_plugins.kfx_input.kfxlib import resources as kfx_resources
+ try:
+ pdf_page_dpi = int(getattr(options, "pdf_page_dpi", 300) or 300)
+ except (TypeError, ValueError):
+ pdf_page_dpi = 300
+ kfx_resources.PDF_PAGE_DPI = max(72, min(600, pdf_page_dpi))
+ job_log.info("PDF page rendering resolution: %d dpi" % kfx_resources.PDF_PAGE_DPI)
+
book = YJ_Book(stream, symbol_catalog_filename=get_symbol_catalog_filename())
book.decode_book(retain_yj_locals=True)
if book.has_pdf_resource:
job_log.warning(
- "This book contains PDF content. It can be extracted using either the From KFX user interface "
- "plugin or the KFX Input plugin CLI. See the KFX Input plugin documentation for more information.")
+ "This book contains PDF content. Its pages will be rendered as %d dpi images for this "
+ "conversion. To obtain the original PDF without quality loss use either the From KFX user "
+ "interface plugin or the KFX Input plugin CLI instead. See the KFX Input plugin "
+ "documentation for more information." % kfx_resources.PDF_PAGE_DPI)
if book.is_fixed_layout or book.is_magazine:
job_log.warning(
|
|
|
|
|
|
#1234 |
|
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 84,011
Karma: 153695583
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
You should be doing your testing with the latest version of calibre.
|
|
|
|
|
|
#1235 | |
|
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,383
Karma: 95902893
Join Date: Nov 2011
Location: Charlottesville, VA
Device: Kindles
|
Quote:
Update: There have been a lot of changes to the plugin since version 2.25.0 which your patch is based on. In cases where PDF pages are composed of just an image that image will be extracted with no loss of resolution. And if not the DPI for PDF pages that need to be rendered as images has already been changed from 150 to 300. I believe that those existing changes should be sufficient. If you need a higher DPI you can alter the value of PDF_TO_IMAGE_DPI in resources.py to whatever you want. I still need to look into the caching issue that you raised. Last edited by jhowell; Yesterday at 09:52 AM. Reason: Update |
|
|
|
|
| Advert | |
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| KFX conversion, transfer back to library issue. | shoelesshunter | Conversion | 12 | 09-22-2025 09:49 AM |
| [Conversion Input] Microsoft Doc Input Plugin | igi | Plugins | 77 | 03-08-2025 04:04 AM |
| [Conversion Input] LaTeX Formulas Input Conversion Plugin | sevyls | Plugins | 0 | 03-23-2015 05:52 AM |
| [Input Plugin] DOCX Input | SauliusP. | Plugins | 42 | 06-05-2013 04:01 AM |
| Looking For MHT Input Conversion Plugin | FlooseMan Dave | Plugins | 4 | 03-30-2010 05:52 PM |