12-13-2015, 10:51 AM | #1 |
Member
Posts: 17
Karma: 10
Join Date: Dec 2015
Device: none
|
A description of a PDF's structure
Hi!
I'd like to know about a generated PDF's HTML structure. Wild guessing allowed me to find that the inner content of a page is wrapped in a tag with class .page, but that's pretty much about it. Is there a resource showing the big picture of a generated PDF's structure? As a matter of fact, what I'm trying to do is, through CSS, assign a background to the full extent of my PDF's pages (much like what happens for a cover page). I tried to assign a background-color to body, but it doesn't work. I was thinking maybe Calibre wraps the entire page inside a tag with a specific classname I could leverage. Thank you! |
12-13-2015, 10:53 AM | #2 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
calibre uses pdftohtml from poppler to generate html from PDF files. As far as I know, the only structure present in such files is an empty achor witht he page number at the start of every page.
|
Advert | |
|
12-13-2015, 11:09 AM | #3 |
Member
Posts: 17
Karma: 10
Join Date: Dec 2015
Device: none
|
Thank you, that's a good lead, but actually, my process is the other-way around I have an HTML input (which I can't share for now for it's generated on the fly as a temporary file right before passing it down to ebook-convert; I need to find a way to snapshot it before it gets removed from the hard-drive) and that HTML is converted into a PDF thanks to ebook-convert.
I have access to the raw HTML templates eventually combined into one fully-fledged document fed to ebook-convert, though, so that might help. Within those templates, I have no ".page" tag, yet that CSS selector is being recognized by the convert tool eventually, for I can apply styles to that classname and they get properly rendered in the generated PDF document. I can't get styles assigned to the CSS selector "body" to be rendered, though. That's why I'm a bit confused about the final structure used by ebook-convert: it doesn't 100% match the HTML templates. Any insights? Last edited by chikamichi; 12-13-2015 at 11:15 AM. |
12-13-2015, 11:16 AM | #4 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
12-13-2015, 11:46 AM | #5 |
Member
Posts: 17
Karma: 10
Join Date: Dec 2015
Device: none
|
Thank you. Actually, it's because I read on this very page you linked to, that "It is important to remeber that all the transforms act on the XHTML output by the Input Plugin, not on the input file itself", that I decided to ask my question here
I would like to know how this XHTML (the Input Plugin generates) look like. The big picture, really, I guess there is an overall, default structure that's going to be used as a layout for a transform. I have not been able to find that piece of information so far. With the process I'm currently bound to, ebook-convert is used internally, so I am unable to use Calibre's debug mode to inspect the output from the Input Plugin itself meanwhile the transform is happening. Last edited by chikamichi; 12-13-2015 at 12:01 PM. |
Advert | |
|
12-13-2015, 12:03 PM | #6 |
Ex-Helpdesk Junkie
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Did you try using the debug option to inspect the state of the XHTML at each stage of the conversion.
|
12-13-2015, 12:08 PM | #7 | |
Member
Posts: 17
Karma: 10
Join Date: Dec 2015
Device: none
|
Quote:
I was hoping there would be some kind of "default template" from a "default recipe" used by the Input Plugin to do conversion, that I could peak at to get a sense of what the HTML is likely to, well, look like at this stage of the process. Last edited by chikamichi; 12-13-2015 at 12:12 PM. |
|
12-13-2015, 12:15 PM | #8 |
Ex-Helpdesk Junkie
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Huh.
Why can't you control the options to ebook-convert? Are you using some sort of precompiled binary with an arbitrarily-canned cmdline which you don't have the source to? Well, go ahead and write a batch/bash script that comes earlier in the PATH than ebook-convert, and your current process will find it instead. Make it add the debug flag to the cmdline and pass it on to the real ebook-convert. Last edited by eschwartz; 12-13-2015 at 12:18 PM. |
12-13-2015, 12:18 PM | #9 | |
Member
Posts: 17
Karma: 10
Join Date: Dec 2015
Device: none
|
Quote:
I like your idea about trumping ebook-convert, thx. |
|
12-13-2015, 12:23 PM | #10 |
Ex-Helpdesk Junkie
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Ah, so precompiled binary with an arbitrarily canned cmdline then. You're welcome.
Let's just hope they don't hardcode the binary location! |
12-13-2015, 12:38 PM | #11 | |
Member
Posts: 17
Karma: 10
Join Date: Dec 2015
Device: none
|
Alright, with further inspection, I'm pretty sure the Input Plugin is being leveraged is https://github.com/kovidgoyal/calibr.../html_input.py
I also know the HTML document fed to that plugin is valid. And I was also able to activate some debugging from ebook-convert! Quote:
What I was unable to discover as well, from inspecting Calibre's codebase, is how this HTML document is going to be transformed into Calibre's internal XHTML for further processing/conversion. |
|
12-13-2015, 12:41 PM | #12 |
Ex-Helpdesk Junkie
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Looking at the actual code which does the work is indeed another way to figure out what it is doing.
http://manual.calibre-ebook.com/deve...ml#code-layout And take a look at: src/calibre/ebooks/conversion/plumber.py src/calibre/ebooks/oeb/transforms/* Last edited by eschwartz; 12-13-2015 at 12:45 PM. |
12-13-2015, 12:44 PM | #13 |
Member
Posts: 17
Karma: 10
Join Date: Dec 2015
Device: none
|
Thank you, it is what I'm doing right now already, indeed I might find a hint, it's worth trying.
|
12-13-2015, 12:48 PM | #14 |
Member
Posts: 17
Karma: 10
Join Date: Dec 2015
Device: none
|
By the way, I am now using the "--log=debug --debug" flags with ebook-convert. I thought the "--debug" flag would be responsible for enabling the debug output, creating input/, parsed/, structure/ and processed/ directories, but I have not been able to find them. I may simply not be looking at the correct location. Is there a way to enforce the output location from the command line as well?
|
12-13-2015, 12:58 PM | #15 |
Ex-Helpdesk Junkie
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
--debug-pipeline /path/to/output-directory/
The --log would I assume save the log as created by status messages during the conversion. |
Tags |
css, html, pdf, structure |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[old thread] filename and library structure /author and titel structure | tscamera | Library Management | 4 | 05-31-2011 05:44 PM |
Interesting behavior of Structure Detection PDF to MOBI | tleon | Conversion | 8 | 05-04-2011 05:29 PM |
description for the lrf structure | joblack | LRF | 2 | 08-01-2009 03:35 PM |
Other Non-Fiction Abelson, H; Sussman G: Structure and Interpretation of Computer Programs. PDF. v1.0 | scottdw | Other Books | 0 | 07-05-2008 08:43 AM |