Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 10-09-2011, 07:15 AM   #1
joesh
Junior Member
joesh began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Oct 2011
Location: Seattle WA
Device: Nook
Want to add Screenplay format knowledge to Calibre

Hi Folks
I'm new to Calibre and so far am impressed with all that it can do. I'd like to read screenplays on my eReader (a Nook) and have found several threads talking about this but no good solutions that preserve the simple but important formatting.

I'm interested in adding some knowledge of screenplay formatting rules to Calibre so it can carry forward the important formatting bits into converted docs.

I've read http://manual.calibre-ebook.com/develop.html and it suggests that I post here for both help in getting up to speed on the codebase and for advice on how to approach a problem and where in the code it should go.

A bit more about what I'm trying to do: first, I hear you when you say PDF is a poor source format. Yet PDF is the most likely format in which a screenplay exists. A far second is formatted text.

The important bits of screenplay format are pretty much just:


Code:
ONE UPCASED LINE ABOUT THE SCENE SETTING

An arbitrary amount of text describing setting or action. Often mentions a character like PRODUCER whose dialog will appear below:

                                PRODUCER
                           (wringing his hands)
              Why is it so hard to get my screenplays to look good
              on my new eReader?!
That's the meat of it. Can you help? Thanks!
joesh is offline   Reply With Quote
Old 10-09-2011, 08:17 AM   #2
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,889
Karma: 12755553
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by joesh View Post
That's the meat of it. Can you help? Thanks!
Just a thought, if you have specific questions, you might want to actually ask them.
DoctorOhh is online now   Reply With Quote
Old 10-09-2011, 04:11 PM   #3
joesh
Junior Member
joesh began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Oct 2011
Location: Seattle WA
Device: Nook
OBJECTIVE: To add screenplay formatting recognition to Calibre

QUESTION: Is there a generalized way to add formats other than a chapter book? Right now it appears that there's one case that's in the UI via the "Structure Detection" and "Heuristic Processing" panels.

QUESTION: Where in the code should I try to add this?

QUESTION: What's the best way for a Calibre newbie to get up-to-speed on the code so I don't do things "the wrong way"?
joesh is offline   Reply With Quote
Old 10-09-2011, 07:18 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,344
Karma: 5382313
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
There are two places in the code you should look at. The PDF input plugin (ebooks/pdf/input.py)

and the heuristic processing code which IIRC is in ebooks/oeb/preprocess.py
kovidgoyal is offline   Reply With Quote
Old 10-09-2011, 10:51 PM   #5
joesh
Junior Member
joesh began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Oct 2011
Location: Seattle WA
Device: Nook
Thanks! That should help me a bunch.

Still curious - is it a design decision to not have user-selectable heuristics/modules for recognizing different formatting conventions other than chapter books? Or is it that the need hasn't yet been strong enough?
Joe
joesh is offline   Reply With Quote
Old 10-09-2011, 10:55 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,344
Karma: 5382313
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
There has been no need/desire to maintain special purpose code amongst calibre developers.
kovidgoyal is offline   Reply With Quote
Old 05-08-2012, 02:43 PM   #7
mcforman
Member
mcforman began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Sep 2010
Device: Kindle DX
I have the same need. Did you ever get a satisfactory answer? Could really use it myself!
mcforman is offline   Reply With Quote
Old 05-08-2012, 02:50 PM   #8
mcforman
Member
mcforman began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Sep 2010
Device: Kindle DX
Kovid -

I'm a TV producer and can really tell you this is a desire/need. TV/Film screenplays follow a basic format. So, even though they are pdf (which I know you stated is evil), are there any "presets" you can suggest in the calibre conversation process that will work best. Calibre does a great job on it's on but the line spacing is problematic in that the ideal output should look like this when finished:


**FADE IN:**

A RIVER.

We're underwater, watching a fat catfish swim along.

This is The Beast.

EDWARD (V.O.)
There are some fish that cannot be caught. It's not that they're faster or stronger than other fish. They're just touched by something extra. Call it luck. Call it grace. One such fish was The Beast.

The Beast's journey takes it past a dangling fish hook, baited with worms. Past a tempting lure, sparkling in the sun. Past a swiping bear claw. The Beast isn't worried.

EDWARD (V.O.)(CONT'D)
By the time I was born, he was already a legend. He'd taken more hundred-dollar lures than any fish in Alabama. Some said that fish was the ghost of Henry Walls, a thief who'd drowned in that river 60 years before. Others claimed he was a lesser dinosaur, left over from the Cretaceous period.

INT. WILL'S BEDROOM - NIGHT (1973)

WILL BLOOM, AGE 3, listens wide-eyed as his father EDWARD BLOOM, 40's and handsome, tells the story. In every gesture, Edward is bigger than life, describing each detail with absolute conviction.

EDWARD
I didn't put any stock into such speculation or superstition. All I knew was I'd been trying to catch that fish since I was a boy no bigger than you.
(closer)
And on the day you were born, that was the day I finally caught him.

EXT. CAMPFIRE - NIGHT (1977)

A few years later, and Will sits with the other INDIAN GUIDES as Edward continues telling the story to the tribe.

EDWARD
Now, I'd tried everything on it: worms, lures, peanut butter, peanut butter-and-cheese. But on that day I had a revelation: if that fish was the ghost of a thief, the usual bait wasn't going to work. I would have to use something he truly desired.

Edward points to his wedding band, glinting in the firelight.

LITTLE BRAVE
(confused)
Your finger?

ANY HELP would be GREATLY appreciated!!!!!
mcforman is offline   Reply With Quote
Old 05-08-2012, 04:03 PM   #9
dwig
Guru
dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.
 
dwig's Avatar
 
Posts: 984
Karma: 1382338
Join Date: Dec 2004
Location: Paradise (Key West, FL)
Device: Current:Dell Venue 8 Pro, Kindle 3/WiFi - Retired:Clie UX50, T415, ...
Quote:
Originally Posted by joesh View Post
...I hear you when you say PDF is a poor source format. Yet PDF is the most likely format in which a screenplay exists. ...
The fact that it is a "likely format" to be found as a source for conversion in absolutely no way means that it is feasible, or even possible, to create an automated conversion routine that always works with all examples. Two PDFs that look identical can be vastly different in their internal construction.
dwig is offline   Reply With Quote
Old 05-11-2012, 08:02 AM   #10
joesh
Junior Member
joesh began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Oct 2011
Location: Seattle WA
Device: Nook
Seems to me there are two distinct features being requested here.

1. Can you make Calibre's PDF translation better?
2. Assuming an "acceptably-translated" PDF, can you add a "screenplay" heuristic set that'll be savvy about screenplay format?

I see from responses above and throughout the forums that (1) is a sore subject around here. No problem. PDF is fine input for minds but poor for computers. So lets go to (2).

I've played with feeding the current PDF parser a bunch of screenplays and I think that what it generates fits my criteria of an "acceptably-translated" PDF for the heuristics I have in mind.

These heuristics would mainly use indentation to detect structure. A block of text at a given level of indentation would be the unit of reflow. Blank lines would also delimit a block - as well as passing through unaltered.

That's most of it right there. I suspect there would be a few tweaks to this - like parentheticals allowing either same-level or +1 indentation to match - so that
Code:
    (this would
     be one block)
but I think this would do a pretty nice job.

Am I missing something really big?

Last edited by joesh; 05-11-2012 at 08:03 AM. Reason: fix the blockquote
joesh is offline   Reply With Quote
Old 05-11-2012, 08:52 AM   #11
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,230
Karma: 1345754
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
The problem is that you can't just sidestep the PDF issue. It doesn't matter how many times what you guys are asking for gets rephrased...

PDFs have no "structure" such as indentation - many don't even have text being just images. As I understand it the various PDF converters attempt to resurrect such indentation and line breaks and apply heuristics to attempt to guess where paragraphs might end and indentation exists. But as has been repeated over and over there are certain issues (some particularly in calibre's current PDF converter) that result in text that is corrupted, such as the oft quoted double-L issue (ligatures) etc.

Adobe themselves who invented this awful format can't come up with a tool that can convert to something more useful. Now if the originator of the format can't do it, what does that tell you? That it completely sucks for anything other than being rendered as a PDF.

So as I posted on the other thread your options are:

(1) Buy a decent sized tablet and open them in a PDF reader so you don't bother converting. That is what I and many others do, particularly for technical books which rely on layout. If you want an e-ink screen, go hunting for a Kindle DX or whatever other models might be out there...

(2) Do the conversion but live with the formatting being trashed. How trashed depends on a variety of factors such as which tool, what settings and how that PDF was authored. There are no magic settings, you might stumble on something that looks "mostly alright" for one PDF and find it doesn't work well with the next one.

(3) Do the conversion but spend many hours making it readable using an html editor.

In my opinion it is a non-starter, but then I've only dabbled around the edges with PDF conversions. Calibre's perpetually on hold "new" PDF engine contains some improvements that might be able to be built on, but until/if it ever gets released you really are pushing the proverbial uphill.

Last edited by kiwidude; 05-11-2012 at 09:06 AM.
kiwidude is offline   Reply With Quote
Old 05-12-2012, 07:01 AM   #12
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
The existing heuristics are primarily living in calibre/ebooks/conversion/utils.py, though Kovid is correct in the sense that they're primarily called from preprocess.py (and you'll need to touch a handful of other files to add the option to the conversion pipeline). I would say there are two ways to solve your problem:

Contribute to the next gen pdf engine
Preferred solution in the sense that the new engine should convert many more types of pdf formatting accurately, and better screenplay formatting would get a free ride.

Add heuristics to try to format for screenplays
The existing heuristics are primarily regex based, and you could certainly add regexes/patterns for screenplays to a new heuristics option which tries to match the various patterns of a screenplay and insert the appropriate css. The way heuristics stands today you'd need to insert all your styles inline - later in the conversion pipeline Calibre would convert those inline styles to css. The replace nbsp indents and format scene break options both insert formatting along the lines of what I'm talking about.

The reason this option is less desirable though is that trying to create generalized rules like these is hard to ever get perfect. Note perfection wasn't the original goal of heuristics - it was designed to basically take in garbage from a variety of formats and make it some what less trashy and potentially worth salvaging by hand.

Edit - reading through your text I see one big problem for your heuristic approach - you're assuming pdfs have blank lines - they don't. They have 'start text at xyz coordinate'. Blank lines aren't a part of that deal.

In terms of indentation level, that data is also gone by the time it gets to heuristics, but I have seen many pdfs with indentation information preserved by the pdftohtml function Calibre uses through the use of multiple non-breaking spaces - these are currently removed early in the conversion pipeline (in preprocess.py for pdf) as they're troublesome to work with in the rest of the conversion pipeline and not needed for a typical book, but you could preserve them in cases that a user has enabled the screenplay heuristic - you'd want to convert them to inline styles with a left margin based on the number of spaces.

Last edited by ldolse; 05-12-2012 at 07:23 AM.
ldolse is offline   Reply With Quote
Old 05-17-2012, 05:13 AM   #13
joesh
Junior Member
joesh began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Oct 2011
Location: Seattle WA
Device: Nook
ldolse - thanks for the considered response and education on how Calibre removes in preprocessing much of the formatting I was hoping to use.

As far as blank lines are concerned, certainly PDF doesn't have them but translators like pdftotext do create them in the text output - as does pdftohtml I believe.

kiwidude - I really do understand that PDF is, in general, a programming language and a PostScript interpreter is a fairly large beast. That said, most screenplay PDFs are created by a small handful of programs and generally create PDFs that are easy enough for tools like pdftotext to render with pretty high fidelity.

[edit: I stand corrected - I've just found a script output from one of the big screenwriting programs that's not well rendered by pdftotext et al]

I'm sure not looking for perfection here. What pdftotext generates is very satisfactory. Which brings me to a different thought - most eReaders understand straight text, right? Perhaps an easier way to go would be to make a separate tool that'd rewrap paragraphs to a width appropriate for a given reader and then just send the resulting text file to the eReader. Comments?

Last edited by joesh; 05-17-2012 at 07:52 AM. Reason: found that - as kiwidude said - even pdf for screenplays can be knotty to render
joesh is offline   Reply With Quote
Old 05-21-2012, 04:21 AM   #14
Dopedangel
Wizard
Dopedangel ought to be getting tired of karma fortunes by now.Dopedangel ought to be getting tired of karma fortunes by now.Dopedangel ought to be getting tired of karma fortunes by now.Dopedangel ought to be getting tired of karma fortunes by now.Dopedangel ought to be getting tired of karma fortunes by now.Dopedangel ought to be getting tired of karma fortunes by now.Dopedangel ought to be getting tired of karma fortunes by now.Dopedangel ought to be getting tired of karma fortunes by now.Dopedangel ought to be getting tired of karma fortunes by now.Dopedangel ought to be getting tired of karma fortunes by now.Dopedangel ought to be getting tired of karma fortunes by now.
 
Dopedangel's Avatar
 
Posts: 1,124
Karma: 8671315
Join Date: Dec 2006
Location: Singapore
Device: Coolreader(Nexus 5)\Coolreader(Nook Touch)
Quote:
Originally Posted by joesh View Post
ldolse - thanks for the considered response and education on how Calibre removes in preprocessing much of the formatting I was hoping to use.

As far as blank lines are concerned, certainly PDF doesn't have them but translators like pdftotext do create them in the text output - as does pdftohtml I believe.

kiwidude - I really do understand that PDF is, in general, a programming language and a PostScript interpreter is a fairly large beast. That said, most screenplay PDFs are created by a small handful of programs and generally create PDFs that are easy enough for tools like pdftotext to render with pretty high fidelity.

[edit: I stand corrected - I've just found a script output from one of the big screenwriting programs that's not well rendered by pdftotext et al]

I'm sure not looking for perfection here. What pdftotext generates is very satisfactory. Which brings me to a different thought - most eReaders understand straight text, right? Perhaps an easier way to go would be to make a separate tool that'd rewrap paragraphs to a width appropriate for a given reader and then just send the resulting text file to the eReader. Comments?
Have you tried using an ocr software like ABBYY Finereader I think it would preserve the formating when it is used to convert or http://pdftransformer.abbyy.com/

Last edited by Dopedangel; 05-21-2012 at 04:23 AM.
Dopedangel is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Add new format in Edit Meta deletes file contents in Calibre tpkacer Introduce Yourself 7 11-22-2010 01:43 PM
Add books - Meta Data format wwaldo Calibre 2 03-17-2010 08:38 PM
Add two files of same format to a book entry Karl Korsch Calibre 1 02-10-2010 09:45 PM
Add same ebook different format to calibre JMikeD Calibre 2 01-11-2010 09:45 PM
Ectaco jetBook to add DRM format jgray News 19 09-22-2009 06:25 PM


All times are GMT -4. The time now is 11:16 AM.


MobileRead.com is a privately owned, operated and funded community.