![]() |
Unwrapping hard line breaks across all input formats
Hi, for the few weeks I've been on the forums I've come across a lot of threads where users are dealing with hard line breaks in various types of content, and I've seen this in a lot of the content I've downloaded, whether that's text, pdf, or even a variety of ebook formats.
Calibre already handles this in it's PDF processing, and of course this is a basic requirement for PDFs as pdftohtml doesn't handle this function at all. For the other formats I think this type of processing should be optional, since the majority(?) of content is well formed with regard to wrapping. It seems like it ought to be simple enough to make it an optional step during the conversion pipeline. I think all the logic is already in preprocess.py (in pluginize), it's just tied to the PDF format. Of course every format would need slightly different regexes, but the basic logic we've worked out for pdf would apply. My python skills aren't great, but if someone created the hooks into the other formats for an option for these types of regexes to be applied during the conversion process I'd be happy to own the regexps and working the kinks out of each format. Put the hooks in one format and show me how it was done and I may even be able to apply it to others. Seems like the worst offenders are text, then rtf, followed by LIT, as these seem to be the formats that a lot of OCR work tends to wind up going to. Some of the threads where people have expressed interest/frustration: How to deal with irregular hard-wrapping on a large scale? line formatting formatting question text reformat Tool for removing line breaks in text documents |
The new conersion framework makes this very easy to do, but, I'm not so sure it is a good idea. The reason it works for PDF is that calibre is post processing the output from pdftohtml, which is pretty consistent. TXT/LIT/RTF files can have a very wide range of input that would probably require different algorithms to process. Which would mean the user would have to select the algorithm at conversion time. I suppose that is doable...
|
From a user perspective I was thinking of presenting just one option - 'fix line-breaks' or 'enable text processing' or something like that. Enable/Disable similar to 'Detect Chapters' today.
That would then invoke a function similar to the current pdftohtml functions in preprocess.py. Then just write one set of regexes per format. Text/RTF would get one regex (i think that would be the same regex in those cases), LIT would get another as the lit->html output looks different. I'm not sure if it makes as much sense for other more modern formats, as it's the older ones that seem to have the problem, though it does apply if someone has a book that was originally converted from a bad file. Anyway I wasn't thinking the user would need to worry about writing/specifying replacement patterns. I suppose that would also be ok for the power user, but that would be a lot more GUI work to maintain default regexes in the GUI for each format. I think a checkbox with best effort regexes hard-coded would be a big step over what we've got now. Default would be disabled of course. |
When I am trying to reformat a plain text book with lots of weird hard breaks I can usually make 99% of work using quick and dirty trick (usually implemented in Vim editor using Regular Expressions.
find <dot><end of paragraph> find <exclamation point><end of paragraph> find <question mark><end of paragraph> find <dot><quote><end of paragraph> find <exclamation point><quote><end of paragraph> find <question mark><quote><end of paragraph> Replace all those things with <what you found>HereIsEndOfParagraph<end of paragraph> now fiond every line that does not end with HereIsEndOfParagraph<end of paragraph> and join it with the next " Vim script. Can be easily adapted for sed " This script can be written in a much more condensed and clever " way, but this way it is much more understandable :%substitute/[.]$/\0HereIsEndOfParagraph/ :%substitute/[?]/\0HereIsEndOfParagraph/ :%substitute/[!]/\0HereIsEndOfParagraph/ :%substitute/[.]["]$/\0HereIsEndOfParagraph/ :%substitute/[?]["]/\0HereIsEndOfParagraph/ :%substitute/[!]["]/\0HereIsEndOfParagraph/ :global!/HereIsEndOfParagraph$/join :%substitute/HereIsEndOfParagraph$// "end of quick-and-dirty Vim script |
That logic is basically what I'm referring to, it's just with a single regex replacement in Python. That is then combined with a median line length calculation so that only lines approaching the document median have the regex applied (an extra safety to prevent short lines without punctuation from being wrapped). This is what we're doing already for PDF post processing.
There will always be docs that this won't work for, but I think that we can handle the majority of the cases where a user needs to hand edit to fix this sort of thing. |
Quote:
|
Quote:
|
Hmm well I have no objection to adding it as a default off option with possibly different regexes per input format. Do me a favor and open a ticket for it, I'll add the framework for it to the new calibre conversion pipeline and you can then fine tune the regexes to your heart's content.
|
Quote:
States are just to demonstrate the concept. " The first one-line example :global!/\([.]"\?\|[?]"\?\|[!]"\?\)$/.;/\([.]"\?\|[?]"\?\|[!]"\?\)$/join " the same one, yet shorter :v/\([.]"\?\|[?]"\?\|[!]"\?\)$/.;\([.]"\?\|[?]"\?\|[!]"\?\)$/j " and even more condensed one! :v/\."\?$\|?"\?$\|!"\?$/.;/\."\?$\|?"\?$\|!"\?$/j " previous versions work, but you get an error [that you can safely ignore] " at the last line of text, so here we go with even more interesting " looking one :1;$-1v/\."\?$\|?"\?$\|!"\?$/.;/\."\?$\|?"\?$\|!"\?$/j I have tested those examples on a sample text. The final version would require testing and tweaking on much wider range of texts. Now I go to try do this in "pure" RegExp using that lovely "negative lookbehind of non-arbitrary length" that even Perl doesn't have [EVIL LAUGHTER] :D |
Quote:
|
Quote:
|
Quote:
Also such tool always has to be tweaked and my first example is much more pleasant to work with. I tend to refine my scripts with every use, so at the end I just need to run the script and briefly check for results. Sometimes you open your own script and you just wonder "What the $#$%! is this supposed to do?" Besides, with the replacement RegEx \= you can do lots of very interesting stuff, because \= is followed by an expression of quite powerfull Vim scription language. So in Vim you can work with variables, global variables, functions and other goodies even from inside the regExp |
Ticket is opened:
http://calibre.kovidgoyal.net/ticket/2359 Thanks Kacir for putting some starting points together, once the framework is in place I'll try these out. |
I see the ticket has been fixed, and I saw the changes you made in the core files. That said, I'm not quite sure how to leverage this in an input plugin.
First question is which file is actually considered the input plugin? Is it /calibre/ebooks/<format>/input.py? I see most folders there have an input.py, but not all. And then finally what exactly do I need to define in an input plugin, do I need to define a function called HTMLPreProcessor in each plugin that acts similarly to pdftohtml? |
An input plugin is defined as a subclass of InputFormatPlugin, by convention these subclass are usually in files called input.py. To add support for preprocessing for a particular input format, just reimplement the following method in the input plugin for that format
Code:
def preprocess_html(self, html): |
| All times are GMT -4. The time now is 06:34 PM. |
Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.