[Conversion Plugin] KFX Input - Page 83

DNSB · 06-23-2026, 09:12 PM

Quote:

Originally Posted by yoshi

Thank you for your explanation.

Adding "rendition:page-spread-center" without using "Plugin Tweaks" would be beneficial for EPUB readers.

You could try wrapping it in [noparse] ... [/noparse]. This keeps the :p from being seen as a winky.

theducks · 06-23-2026, 09:58 PM

Quote:

Originally Posted by yoshi

I tried to correct the above corrupted character but failed to do it.

You need to disable smilies under advanced edit option. (I did it for you)

utakata · 07-02-2026, 06:03 PM

Hi jhowell,

First of all, thank you for maintaining this excellent plugin.

While converting Japanese print replica books (PDF-backed fixed layout KFX), I ran into two issues and would like to propose fixes for both. A patch against v2.25.0 is included below.

1. O(n^2) slowdown in check_consistency() with PDF-backed books

get_pdf_page_size() creates a new pypdf.PdfReader (and re-flattens the page tree) for every page reference. For a 336-page print replica book, decode_book() takes over 30 seconds, nearly all of it in this loop.

The patch caches the PdfReader per PDF content (keyed by a content fingerprint, since each resource fragment may hold a distinct bytes object with identical content). Results on the same book:
- get_pdf_page_size across 336 pages: over 30 sec -> 0.18 sec
- decode_book() incl. consistency check: over 30 sec -> about 11 sec

2. Configurable DPI for PDF page rendering (currently fixed at 150)

convert_pdf_to_jpeg() is limited to 150 dpi because calibre's page_images() does not accept a resolution argument. 150 dpi is quite low for technical books - small text and ruby annotations become hard to read.

The patch invokes pdftoppm directly (using calibre's bundled poppler binary when available, falling back to page_images() at 150 dpi on failure) and adds a "pdf_page_dpi" conversion option (default 300, clamped to 72-600), e.g.:

ebook-convert book.kfx book.epub --pdf-page-dpi 300

I set the default to 300 in the patch, but keeping 150 as the default would of course preserve existing behavior.

Verified on Windows (calibre 9.8) and Linux with two Japanese print replica books.

Happy to adjust anything if you'd prefer a different approach. Thanks again!

Code:

--- a/kfxlib/resources.py
+++ b/kfxlib/resources.py
@@ -314,17 +314,62 @@
     return outfile.getvalue()
 
 
-def convert_pdf_to_jpeg(pdf_data, page_num, dpi=150, reported_errors=None):
-    pdf_file = temp_filename("pdf", pdf_data)
-    jpeg_dir = create_temp_dir()
+PDF_PAGE_DPI = 150      # default; overridden by the KFX Input "pdf_page_dpi" conversion option
+
+
+def find_pdftoppm():
+    """Locate the pdftoppm executable (calibre's bundled poppler, or the system PATH)."""
+    candidates = []
 
     if calibre_numeric_version is not None:
+        try:
+            from calibre.ebooks.pdf.pdftohtml import PDFTOHTML
+            base = os.path.dirname(PDFTOHTML)
+            candidates.append(os.path.join(base, "pdftoppm.exe"))
+            candidates.append(os.path.join(base, "pdftoppm"))
+        except Exception:
+            pass
+
+    for candidate in candidates:
+        if os.path.exists(candidate):
+            return candidate
+
+    return "pdftoppm"      # rely on the system PATH
+
+
+def convert_pdf_to_jpeg(pdf_data, page_num, dpi=None, reported_errors=None):
+    if dpi is None:
+        dpi = PDF_PAGE_DPI
+
+    pdf_file = temp_filename("pdf", pdf_data)
+    jpeg_dir = create_temp_dir()
 
-        if dpi != 150:
-            raise Exception("calibre PDF page_images supports only default 150dpi")
+    rendered = False
+    try:
+        import subprocess
+        args = [
+            find_pdftoppm(), "-jpeg", "-r", str(dpi), "-cropbox",
+            "-f", str(page_num), "-l", str(page_num),
+            pdf_file, os.path.join(jpeg_dir, "page")]
+
+        kwargs = {}
+        if os.name == "nt":
+            kwargs["creationflags"] = 0x08000000    # CREATE_NO_WINDOW
+
+        subprocess.run(args, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, timeout=120, **kwargs)
+        rendered = True
+    except Exception as e:
+        if reported_errors is not None and "pdftoppm_direct" not in reported_errors:
+            reported_errors.add("pdftoppm_direct")
+            log.warning("Direct pdftoppm rendering failed (%s), falling back to calibre page_images at 150dpi" % repr(e))
 
+    if not rendered and calibre_numeric_version is not None:
         from calibre.ebooks.metadata.pdf import page_images
         page_images(pdf_file, jpeg_dir, first=page_num, last=page_num)
+        rendered = True
+
+    if not rendered:
+        raise Exception("No PDF rendering method available (pdftoppm not found)")
 
     for dirpath, dirnames, filenames in os.walk(jpeg_dir):
         if len(filenames) != 1:
@@ -495,9 +540,25 @@
     return best_raw_media, best_quality
 
 
+_pdf_reader_cache = {}      # content fingerprint -> PdfReader
+
+
+def get_cached_pdf_reader(pdf_data):
+    """Cache PdfReader per PDF content. Avoids O(n^2) re-parsing and page-tree
+    re-flattening when a book references hundreds of pages of the same embedded
+    PDF (print replica). Keyed by a content fingerprint since each resource may
+    hold a distinct bytes object with identical content."""
+    key = (len(pdf_data), bytes(pdf_data[:256]), bytes(pdf_data[-256:]))
+    reader = _pdf_reader_cache.get(key)
+    if reader is None:
+        reader = pypdf.PdfReader(io.BytesIO(pdf_data))
+        len(reader.pages)       # force one-time page tree flatten while caching
+        _pdf_reader_cache[key] = reader
+    return reader
+
+
 def get_pdf_page_size(pdf_data, resource_name, page_num):
-    raw_media_file = io.BytesIO(pdf_data)
-    pdf = pypdf.PdfReader(raw_media_file)
+    pdf = get_cached_pdf_reader(pdf_data)
     page = pdf.pages[page_num - 1]
 
     if page.user_unit != 1:
--- a/__init__.py
+++ b/__init__.py
@@ -25,7 +25,7 @@
     name = "KFX Input"
     author = "jhowell"
     file_types = {"azw8", "kfx", "kfx-zip", "kpf"}
-    version = (2, 25, 0)
+    version = (2, 25, 1)    # custom build: configurable PDF page DPI + PdfReader cache
     minimum_calibre_version = (5, 0, 0)     # Python 3.8.5
     supported_platforms = ["windows", "osx", "linux"]
     description = "Convert from Amazon KFX format"
@@ -36,6 +36,11 @@
             help="Allow conversion to proceed even if the KFX book contains unexpected or incorrect data "
             "that may not convert properly. If this option is selected it is recommend that the log of each "
             "conversion be checked for error messages."),
+        OptionRecommendation(
+            name="pdf_page_dpi", recommended_value=300,
+            help="Resolution (DPI) used to render embedded PDF pages of print replica books as images "
+            "during conversion. Higher values produce sharper text at the cost of larger output files. "
+            "Default is 300. (The original plugin used a fixed 150 dpi.)"),
     }
 
     recommendations = EPUBInput.recommendations
@@ -93,13 +98,23 @@
             job_log = set_logger(JobLog(log))
             job_log.info("Converting %s" % name_of_file(stream))
 
+            from calibre_plugins.kfx_input.kfxlib import resources as kfx_resources
+            try:
+                pdf_page_dpi = int(getattr(options, "pdf_page_dpi", 300) or 300)
+            except (TypeError, ValueError):
+                pdf_page_dpi = 300
+            kfx_resources.PDF_PAGE_DPI = max(72, min(600, pdf_page_dpi))
+            job_log.info("PDF page rendering resolution: %d dpi" % kfx_resources.PDF_PAGE_DPI)
+
             book = YJ_Book(stream, symbol_catalog_filename=get_symbol_catalog_filename())
             book.decode_book(retain_yj_locals=True)
 
             if book.has_pdf_resource:
                 job_log.warning(
-                    "This book contains PDF content. It can be extracted using either the From KFX user interface "
-                    "plugin or the KFX Input plugin CLI. See the KFX Input plugin documentation for more information.")
+                    "This book contains PDF content. Its pages will be rendered as %d dpi images for this "
+                    "conversion. To obtain the original PDF without quality loss use either the From KFX user "
+                    "interface plugin or the KFX Input plugin CLI instead. See the KFX Input plugin "
+                    "documentation for more information." % kfx_resources.PDF_PAGE_DPI)
 
             if book.is_fixed_layout or book.is_magazine:
                 job_log.warning(

JSWolf · 07-02-2026, 07:06 PM

You should be doing your testing with the latest version of calibre.

jhowell · Yesterday, 08:36 AM

Quote:

Originally Posted by utakata

While converting Japanese print replica books (PDF-backed fixed layout KFX), I ran into two issues and would like to propose fixes for both. A patch against v2.25.0 is included below.

I will take a look at your patches an incorporate something similar in the next release of the plugin. (I am working on other projects so it may not happen immediately.)

Update:

There have been a lot of changes to the plugin since version 2.25.0 which your patch is based on.

In cases where PDF pages are composed of just an image that image will be extracted with no loss of resolution. And if not the DPI for PDF pages that need to be rendered as images has already been changed from 150 to 300. I believe that those existing changes should be sufficient. If you need a higher DPI you can alter the value of PDF_TO_IMAGE_DPI in resources.py to whatever you want.

I still need to look into the caching issue that you raised.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
KFX conversion, transfer back to library issue.	shoelesshunter	Conversion	12	09-22-2025 09:49 AM
[Conversion Input] Microsoft Doc Input Plugin	igi	Plugins	77	03-08-2025 04:04 AM
[Conversion Input] LaTeX Formulas Input Conversion Plugin	sevyls	Plugins	0	03-23-2015 05:52 AM
[Input Plugin] DOCX Input	SauliusP.	Plugins	42	06-05-2013 04:01 AM
Looking For MHT Input Conversion Plugin	FlooseMan Dave	Plugins	4	03-30-2010 05:52 PM

07-02-2026, 07:06 PM	#1234
JSWolf Resident Curmudgeon Posts: 84,011 Karma: 153695583 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	You should be doing your testing with the latest version of calibre.

Advert

Advert