MobileRead Forums - View Single Post - Learning More About Cleaning Up Documents

ldolse · 01-24-2011, 09:32 PM

Quote:

Originally Posted by Agama

@ Manichean, user_none : Thanks for these links, both resources look really good. I have downloaded the book and am immediately impressed by the scope of Python.

@Idolse : The scripts are geared towards tidy up of markdown to ePub conversions and use the exploded ePub from calibre's Tweak ePub:
1) Renames the html split files which result from markdown -> ePub and updates the opf/ncx files to match.
2) Strips all class="calibre[0-9]*" attributes from an ePub and links in a custom stylesheet.
3) Tidies the OPF file by stripping out blank lines and splitting lines with multiple XML tags. These are not produced by calibre but some publishers seem fond of unreadable OPF files, (e.g. Feedbooks use lots of blank lines).
4) ToC editor which presents the ePub ncf file as a simple text file, (1 line per ToC entry), for insertions/deletions/ammendments/hierarchies then rebuilds it including playOrder.
5) (In progress) Applies a predefined set of regexes to a plain text file prior to conversion.

Those mostly sound like things that probably would be best tied directly into the tweak epub feature rather than elsewhere in the pipeline. The regexes you're looking to apply to text conversion are something that could be done directly in the text input plugin, either as a new option or modifying an existing one, depending on what they are.