A description of a PDF's structure

chikamichi · 12-13-2015, 10:51 AM

Hi!

I'd like to know about a generated PDF's HTML structure. Wild guessing allowed me to find that the inner content of a page is wrapped in a tag with class .page, but that's pretty much about it.

Is there a resource showing the big picture of a generated PDF's structure?

As a matter of fact, what I'm trying to do is, through CSS, assign a background to the full extent of my PDF's pages (much like what happens for a cover page). I tried to assign a background-color to body, but it doesn't work. I was thinking maybe Calibre wraps the entire page inside a tag with a specific classname I could leverage.

Thank you!

kovidgoyal · 12-13-2015, 10:53 AM

calibre uses pdftohtml from poppler to generate html from PDF files. As far as I know, the only structure present in such files is an empty achor witht he page number at the start of every page.

chikamichi · 12-13-2015, 11:09 AM

Thank you, that's a good lead, but actually, my process is the other-way around

I have an HTML input (which I can't share for now for it's generated on the fly as a temporary file right before passing it down to ebook-convert; I need to find a way to snapshot it before it gets removed from the hard-drive) and that HTML is converted into a PDF thanks to ebook-convert.

I have access to the raw HTML templates eventually combined into one fully-fledged document fed to ebook-convert, though, so that might help. Within those templates, I have no ".page" tag, yet that CSS selector is being recognized by the convert tool eventually, for I can apply styles to that classname and they get properly rendered in the generated PDF document. I can't get styles assigned to the CSS selector "body" to be rendered, though. That's why I'm a bit confused about the final structure used by ebook-convert: it doesn't 100% match the HTML templates. Any insights?

kovidgoyal · 12-13-2015, 11:16 AM

http://manual.calibre-ebook.com/conv...l#introduction

chikamichi · 12-13-2015, 11:46 AM

Thank you. Actually, it's because I read on this very page you linked to, that "It is important to remeber that all the transforms act on the XHTML output by the Input Plugin, not on the input file itself", that I decided to ask my question here

I would like to know how this XHTML (the Input Plugin generates) look like. The big picture, really, I guess there is an overall, default structure that's going to be used as a layout for a transform. I have not been able to find that piece of information so far. With the process I'm currently bound to, ebook-convert is used internally, so I am unable to use Calibre's debug mode to inspect the output from the Input Plugin itself meanwhile the transform is happening.

eschwartz · 12-13-2015, 12:03 PM

Did you try using the debug option to inspect the state of the XHTML at each stage of the conversion.

chikamichi · 12-13-2015, 12:08 PM

Quote:

Originally Posted by eschwartz

Did you try using the debug option to inspect the state of the XHTML at each stage of the conversion.

Hi. As I said, "With the process I'm currently bound to, ebook-convert is used internally, so I am unable to use Calibre's debug mode to inspect the output from the Input Plugin itself meanwhile the transform is happening." What I meant by "internally" is I have no control whatsoever on ebook-converter, I can't tweak the options or anything like that to activate a debug/verbose mode, for instance.

I was hoping there would be some kind of "default template" from a "default recipe" used by the Input Plugin to do conversion, that I could peak at to get a sense of what the HTML is likely to, well, look like at this stage of the process.

eschwartz · 12-13-2015, 12:15 PM

Huh.

Why can't you control the options to ebook-convert?

Are you using some sort of precompiled binary with an arbitrarily-canned cmdline which you don't have the source to?

Well, go ahead and write a batch/bash script that comes earlier in the PATH than ebook-convert, and your current process will find it instead. Make it add the debug flag to the cmdline and pass it on to the real ebook-convert.

chikamichi · 12-13-2015, 12:18 PM

Quote:

Why can't you control the options to ebook-convert?

I have no ownership of the tool leveraging ebook-convert, I simply know it uses it to transform an HTML document into a PDF one.

I like your idea about trumping ebook-convert, thx.

eschwartz · 12-13-2015, 12:23 PM

Ah, so precompiled binary with an arbitrarily canned cmdline then.

You're welcome.

Let's just hope they don't hardcode the binary location!

chikamichi · 12-13-2015, 12:38 PM

Alright, with further inspection, I'm pretty sure the Input Plugin is being leveraged is https://github.com/kovidgoyal/calibr.../html_input.py

I also know the HTML document fed to that plugin is valid.

And I was also able to activate some debugging from ebook-convert!

Quote:

debug: InputFormatPlugin: HTML Input running
on /tmp/tmp-82696wkvncw/SUMMARY.html
debug: Creator not specified
debug: Building file list...
debug: Normalizing filename cases
debug: Rewriting HTML links
debug: 34% Exécution des transformations du livre numérique…
debug: Merging user specified metadata...
debug: Detecting structure...
debug: Detected chapter: Introduction
debug: Detected chapter: Quick-start
debug: Auto generated TOC with 2 entries.
debug: Flattening CSS and remapping font sizes...
debug: Source base font size is 25.92000pt
debug: Removing fake margins...
debug: Cleaning up manifest...
Trimming unused files from manifest...
debug: Creating PDF Output...
debug: 67% Exécution de l'extension PDF Output
debug: libpng warning: iCCP: Not recognizing known sRGB profile that has been edited
debug: Bottom margin is too small for footer, increasing it to 18.0pts
debug: 78% Rendered SUMMARY.html
debug: 89% Rendered index.html
debug: 100% Rendered quick-start.html
debug: Rendered PDF in 0.571188 seconds:
debug: PDF output written to /tmp/tmp-82696wkvncw/index.pdf
debug: Sortie sauvegardée vers /tmp/tmp-82696wkvncw/index.pdf

Too bad I never get a chance to have a look at /tmp/tmp-82696wkvncw/SUMMARY.html before the conversion process ends, for it gets destroyed: I would then be able to actually see what is being fed to the Input Plugin.

What I was unable to discover as well, from inspecting Calibre's codebase, is how this HTML document is going to be transformed into Calibre's internal XHTML for further processing/conversion.

eschwartz · 12-13-2015, 12:41 PM

Looking at the actual code which does the work is indeed another way to figure out what it is doing.

http://manual.calibre-ebook.com/deve...ml#code-layout

And take a look at:
src/calibre/ebooks/conversion/plumber.py
src/calibre/ebooks/oeb/transforms/*

chikamichi · 12-13-2015, 12:44 PM

Thank you, it is what I'm doing right now already, indeed

I might find a hint, it's worth trying.

chikamichi · 12-13-2015, 12:48 PM

By the way, I am now using the "--log=debug --debug" flags with ebook-convert. I thought the "--debug" flag would be responsible for enabling the debug output, creating input/, parsed/, structure/ and processed/ directories, but I have not been able to find them. I may simply not be looking at the correct location. Is there a way to enforce the output location from the command line as well?

eschwartz · 12-13-2015, 12:58 PM

--debug-pipeline /path/to/output-directory/

The --log would I assume save the log as created by status messages during the conversion.

12-13-2015, 10:51 AM	#1
chikamichi Member Posts: 17 Karma: 10 Join Date: Dec 2015 Device: none	A description of a PDF's structure Hi! I'd like to know about a generated PDF's HTML structure. Wild guessing allowed me to find that the inner content of a page is wrapped in a tag with class .page, but that's pretty much about it. Is there a resource showing the big picture of a generated PDF's structure? As a matter of fact, what I'm trying to do is, through CSS, assign a background to the full extent of my PDF's pages (much like what happens for a cover page). I tried to assign a background-color to body, but it doesn't work. I was thinking maybe Calibre wraps the entire page inside a tag with a specific classname I could leverage. Thank you!

12-13-2015, 11:09 AM	#3
chikamichi Member Posts: 17 Karma: 10 Join Date: Dec 2015 Device: none	Thank you, that's a good lead, but actually, my process is the other-way around I have an HTML input (which I can't share for now for it's generated on the fly as a temporary file right before passing it down to ebook-convert; I need to find a way to snapshot it before it gets removed from the hard-drive) and that HTML is converted into a PDF thanks to ebook-convert. I have access to the raw HTML templates eventually combined into one fully-fledged document fed to ebook-convert, though, so that might help. Within those templates, I have no ".page" tag, yet that CSS selector is being recognized by the convert tool eventually, for I can apply styles to that classname and they get properly rendered in the generated PDF document. I can't get styles assigned to the CSS selector "body" to be rendered, though. That's why I'm a bit confused about the final structure used by ebook-convert: it doesn't 100% match the HTML templates. Any insights? Last edited by chikamichi; 12-13-2015 at 11:15 AM.

12-13-2015, 11:46 AM	#5
chikamichi Member Posts: 17 Karma: 10 Join Date: Dec 2015 Device: none	Thank you. Actually, it's because I read on this very page you linked to, that "It is important to remeber that all the transforms act on the XHTML output by the Input Plugin, not on the input file itself", that I decided to ask my question here I would like to know how this XHTML (the Input Plugin generates) look like. The big picture, really, I guess there is an overall, default structure that's going to be used as a layout for a transform. I have not been able to find that piece of information so far. With the process I'm currently bound to, ebook-convert is used internally, so I am unable to use Calibre's debug mode to inspect the output from the Input Plugin itself meanwhile the transform is happening. Last edited by chikamichi; 12-13-2015 at 12:01 PM.

12-13-2015, 12:15 PM	#8
eschwartz Ex-Helpdesk Junkie Posts: 19,422 Karma: 85397180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Huh. Why can't you control the options to ebook-convert? Are you using some sort of precompiled binary with an arbitrarily-canned cmdline which you don't have the source to? Well, go ahead and write a batch/bash script that comes earlier in the PATH than ebook-convert, and your current process will find it instead. Make it add the debug flag to the cmdline and pass it on to the real ebook-convert. Last edited by eschwartz; 12-13-2015 at 12:18 PM.

12-13-2015, 12:41 PM	#12
eschwartz Ex-Helpdesk Junkie Posts: 19,422 Karma: 85397180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Looking at the actual code which does the work is indeed another way to figure out what it is doing. http://manual.calibre-ebook.com/deve...ml#code-layout And take a look at: src/calibre/ebooks/conversion/plumber.py src/calibre/ebooks/oeb/transforms/* Last edited by eschwartz; 12-13-2015 at 12:45 PM.

12-13-2015, 10:53 AM	#2
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	calibre uses pdftohtml from poppler to generate html from PDF files. As far as I know, the only structure present in such files is an empty achor witht he page number at the start of every page.

12-13-2015, 11:16 AM	#4
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	http://manual.calibre-ebook.com/conv...l#introduction

12-13-2015, 12:03 PM	#6
eschwartz Ex-Helpdesk Junkie Posts: 19,422 Karma: 85397180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Did you try using the debug option to inspect the state of the XHTML at each stage of the conversion.

12-13-2015, 12:23 PM	#10
eschwartz Ex-Helpdesk Junkie Posts: 19,422 Karma: 85397180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Ah, so precompiled binary with an arbitrarily canned cmdline then. You're welcome. Let's just hope they don't hardcode the binary location!

12-13-2015, 12:44 PM	#13
chikamichi Member Posts: 17 Karma: 10 Join Date: Dec 2015 Device: none	Thank you, it is what I'm doing right now already, indeed I might find a hint, it's worth trying.

12-13-2015, 12:48 PM	#14
chikamichi Member Posts: 17 Karma: 10 Join Date: Dec 2015 Device: none	By the way, I am now using the "--log=debug --debug" flags with ebook-convert. I thought the "--debug" flag would be responsible for enabling the debug output, creating input/, parsed/, structure/ and processed/ directories, but I have not been able to find them. I may simply not be looking at the correct location. Is there a way to enforce the output location from the command line as well?

12-13-2015, 12:58 PM	#15
eschwartz Ex-Helpdesk Junkie Posts: 19,422 Karma: 85397180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	--debug-pipeline /path/to/output-directory/ The --log would I assume save the log as created by status messages during the conversion.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[old thread] filename and library structure /author and titel structure	tscamera	Library Management	4	05-31-2011 05:44 PM
Interesting behavior of Structure Detection PDF to MOBI	tleon	Conversion	8	05-04-2011 05:29 PM
description for the lrf structure	joblack	LRF	2	08-01-2009 03:35 PM
Other Non-Fiction Abelson, H; Sussman G: Structure and Interpretation of Computer Programs. PDF. v1.0	scottdw	Other Books	0	07-05-2008 08:43 AM

Advert

Advert