View Single Post
Old 04-28-2009, 02:38 AM   #1
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Unwrapping hard line breaks across all input formats

Hi, for the few weeks I've been on the forums I've come across a lot of threads where users are dealing with hard line breaks in various types of content, and I've seen this in a lot of the content I've downloaded, whether that's text, pdf, or even a variety of ebook formats.

Calibre already handles this in it's PDF processing, and of course this is a basic requirement for PDFs as pdftohtml doesn't handle this function at all.

For the other formats I think this type of processing should be optional, since the majority(?) of content is well formed with regard to wrapping. It seems like it ought to be simple enough to make it an optional step during the conversion pipeline.

I think all the logic is already in preprocess.py (in pluginize), it's just tied to the PDF format. Of course every format would need slightly different regexes, but the basic logic we've worked out for pdf would apply.

My python skills aren't great, but if someone created the hooks into the other formats for an option for these types of regexes to be applied during the conversion process I'd be happy to own the regexps and working the kinks out of each format. Put the hooks in one format and show me how it was done and I may even be able to apply it to others.

Seems like the worst offenders are text, then rtf, followed by LIT, as these seem to be the formats that a lot of OCR work tends to wind up going to.

Some of the threads where people have expressed interest/frustration:
How to deal with irregular hard-wrapping on a large scale?
line formatting formatting question
text reformat
Tool for removing line breaks in text documents
ldolse is offline   Reply With Quote