Seems to me there are two distinct features being requested here.
1. Can you make Calibre's PDF translation better?
2. Assuming an "acceptably-translated" PDF, can you add a "screenplay" heuristic set that'll be savvy about screenplay format?
I see from responses above and throughout the forums that (1) is a sore subject around here. No problem. PDF is fine input for minds but poor for computers. So lets go to (2).
I've played with feeding the current PDF parser a bunch of screenplays and I think that what it generates fits my criteria of an "acceptably-translated" PDF for the heuristics I have in mind.
These heuristics would mainly use indentation to detect structure. A block of text at a given level of indentation would be the unit of reflow. Blank lines would also delimit a block - as well as passing through unaltered.
That's most of it right there. I suspect there would be a few tweaks to this - like parentheticals allowing either same-level or +1 indentation to match - so that
be one block)
but I think this would do a pretty nice job.
Am I missing something really big?