View Single Post
Old 05-12-2012, 07:01 AM   #12
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
The existing heuristics are primarily living in calibre/ebooks/conversion/utils.py, though Kovid is correct in the sense that they're primarily called from preprocess.py (and you'll need to touch a handful of other files to add the option to the conversion pipeline). I would say there are two ways to solve your problem:

Contribute to the next gen pdf engine
Preferred solution in the sense that the new engine should convert many more types of pdf formatting accurately, and better screenplay formatting would get a free ride.

Add heuristics to try to format for screenplays
The existing heuristics are primarily regex based, and you could certainly add regexes/patterns for screenplays to a new heuristics option which tries to match the various patterns of a screenplay and insert the appropriate css. The way heuristics stands today you'd need to insert all your styles inline - later in the conversion pipeline Calibre would convert those inline styles to css. The replace nbsp indents and format scene break options both insert formatting along the lines of what I'm talking about.

The reason this option is less desirable though is that trying to create generalized rules like these is hard to ever get perfect. Note perfection wasn't the original goal of heuristics - it was designed to basically take in garbage from a variety of formats and make it some what less trashy and potentially worth salvaging by hand.

Edit - reading through your text I see one big problem for your heuristic approach - you're assuming pdfs have blank lines - they don't. They have 'start text at xyz coordinate'. Blank lines aren't a part of that deal.

In terms of indentation level, that data is also gone by the time it gets to heuristics, but I have seen many pdfs with indentation information preserved by the pdftohtml function Calibre uses through the use of multiple non-breaking spaces - these are currently removed early in the conversion pipeline (in preprocess.py for pdf) as they're troublesome to work with in the rest of the conversion pipeline and not needed for a typical book, but you could preserve them in cases that a user has enabled the screenplay heuristic - you'd want to convert them to inline styles with a left margin based on the number of spaces.

Last edited by ldolse; 05-12-2012 at 07:23 AM.
ldolse is offline   Reply With Quote