![]() |
#1 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Unwrapping hard line breaks across all input formats
Hi, for the few weeks I've been on the forums I've come across a lot of threads where users are dealing with hard line breaks in various types of content, and I've seen this in a lot of the content I've downloaded, whether that's text, pdf, or even a variety of ebook formats.
Calibre already handles this in it's PDF processing, and of course this is a basic requirement for PDFs as pdftohtml doesn't handle this function at all. For the other formats I think this type of processing should be optional, since the majority(?) of content is well formed with regard to wrapping. It seems like it ought to be simple enough to make it an optional step during the conversion pipeline. I think all the logic is already in preprocess.py (in pluginize), it's just tied to the PDF format. Of course every format would need slightly different regexes, but the basic logic we've worked out for pdf would apply. My python skills aren't great, but if someone created the hooks into the other formats for an option for these types of regexes to be applied during the conversion process I'd be happy to own the regexps and working the kinks out of each format. Put the hooks in one format and show me how it was done and I may even be able to apply it to others. Seems like the worst offenders are text, then rtf, followed by LIT, as these seem to be the formats that a lot of OCR work tends to wind up going to. Some of the threads where people have expressed interest/frustration: How to deal with irregular hard-wrapping on a large scale? line formatting formatting question text reformat Tool for removing line breaks in text documents |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,164
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The new conersion framework makes this very easy to do, but, I'm not so sure it is a good idea. The reason it works for PDF is that calibre is post processing the output from pdftohtml, which is pretty consistent. TXT/LIT/RTF files can have a very wide range of input that would probably require different algorithms to process. Which would mean the user would have to select the algorithm at conversion time. I suppose that is doable...
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
From a user perspective I was thinking of presenting just one option - 'fix line-breaks' or 'enable text processing' or something like that. Enable/Disable similar to 'Detect Chapters' today.
That would then invoke a function similar to the current pdftohtml functions in preprocess.py. Then just write one set of regexes per format. Text/RTF would get one regex (i think that would be the same regex in those cases), LIT would get another as the lit->html output looks different. I'm not sure if it makes as much sense for other more modern formats, as it's the older ones that seem to have the problem, though it does apply if someone has a book that was originally converted from a bad file. Anyway I wasn't thinking the user would need to worry about writing/specifying replacement patterns. I suppose that would also be ok for the power user, but that would be a lot more GUI work to maintain default regexes in the GUI for each format. I think a checkbox with best effort regexes hard-coded would be a big step over what we've got now. Default would be disabled of course. |
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,462
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
When I am trying to reformat a plain text book with lots of weird hard breaks I can usually make 99% of work using quick and dirty trick (usually implemented in Vim editor using Regular Expressions.
find <dot><end of paragraph> find <exclamation point><end of paragraph> find <question mark><end of paragraph> find <dot><quote><end of paragraph> find <exclamation point><quote><end of paragraph> find <question mark><quote><end of paragraph> Replace all those things with <what you found>HereIsEndOfParagraph<end of paragraph> now fiond every line that does not end with HereIsEndOfParagraph<end of paragraph> and join it with the next " Vim script. Can be easily adapted for sed " This script can be written in a much more condensed and clever " way, but this way it is much more understandable :%substitute/[.]$/\0HereIsEndOfParagraph/ :%substitute/[?]/\0HereIsEndOfParagraph/ :%substitute/[!]/\0HereIsEndOfParagraph/ :%substitute/[.]["]$/\0HereIsEndOfParagraph/ :%substitute/[?]["]/\0HereIsEndOfParagraph/ :%substitute/[!]["]/\0HereIsEndOfParagraph/ :global!/HereIsEndOfParagraph$/join :%substitute/HereIsEndOfParagraph$// "end of quick-and-dirty Vim script |
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
That logic is basically what I'm referring to, it's just with a single regex replacement in Python. That is then combined with a median line length calculation so that only lines approaching the document median have the regex applied (an extra safety to prevent short lines without punctuation from being wrapped). This is what we're doing already for PDF post processing.
There will always be docs that this won't work for, but I think that we can handle the majority of the cases where a user needs to hand edit to fix this sort of thing. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,462
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
|
![]() |
![]() |
![]() |
#7 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
|
![]() |
![]() |
![]() |
#8 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,164
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Hmm well I have no objection to adding it as a default off option with possibly different regexes per input format. Do me a favor and open a ticket for it, I'll add the framework for it to the new calibre conversion pipeline and you can then fine tune the regexes to your heart's content.
|
![]() |
![]() |
![]() |
#9 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,462
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
Quote:
States are just to demonstrate the concept. " The first one-line example :global!/\([.]"\?\|[?]"\?\|[!]"\?\)$/.;/\([.]"\?\|[?]"\?\|[!]"\?\)$/join " the same one, yet shorter :v/\([.]"\?\|[?]"\?\|[!]"\?\)$/.;\([.]"\?\|[?]"\?\|[!]"\?\)$/j " and even more condensed one! :v/\."\?$\|?"\?$\|!"\?$/.;/\."\?$\|?"\?$\|!"\?$/j " previous versions work, but you get an error [that you can safely ignore] " at the last line of text, so here we go with even more interesting " looking one :1;$-1v/\."\?$\|?"\?$\|!"\?$/.;/\."\?$\|?"\?$\|!"\?$/j I have tested those examples on a sample text. The final version would require testing and tweaking on much wider range of texts. Now I go to try do this in "pure" RegExp using that lovely "negative lookbehind of non-arbitrary length" that even Perl doesn't have [EVIL LAUGHTER] ![]() Last edited by kacir; 04-28-2009 at 02:55 PM. |
|
![]() |
![]() |
![]() |
#10 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,462
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
:%s/\(\."\?\|?"\?\|!"\?\)\@<!\n/ /
Last edited by kacir; 04-28-2009 at 08:07 AM. |
![]() |
![]() |
![]() |
#11 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
|
![]() |
![]() |
![]() |
#12 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,462
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
Quote:
Also such tool always has to be tweaked and my first example is much more pleasant to work with. I tend to refine my scripts with every use, so at the end I just need to run the script and briefly check for results. Sometimes you open your own script and you just wonder "What the $#$%! is this supposed to do?" Besides, with the replacement RegEx \= you can do lots of very interesting stuff, because \= is followed by an expression of quite powerfull Vim scription language. So in Vim you can work with variables, global variables, functions and other goodies even from inside the regExp Last edited by kacir; 04-28-2009 at 09:06 AM. |
|
![]() |
![]() |
![]() |
#13 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Ticket is opened:
http://calibre.kovidgoyal.net/ticket/2359 Thanks Kacir for putting some starting points together, once the framework is in place I'll try these out. |
![]() |
![]() |
![]() |
#14 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I see the ticket has been fixed, and I saw the changes you made in the core files. That said, I'm not quite sure how to leverage this in an input plugin.
First question is which file is actually considered the input plugin? Is it /calibre/ebooks/<format>/input.py? I see most folders there have an input.py, but not all. And then finally what exactly do I need to define in an input plugin, do I need to define a function called HTMLPreProcessor in each plugin that acts similarly to pdftohtml? |
![]() |
![]() |
![]() |
#15 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,164
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
An input plugin is defined as a subclass of InputFormatPlugin, by convention these subclass are usually in files called input.py. To add support for preprocessing for a particular input format, just reimplement the following method in the input plugin for that format
Code:
def preprocess_html(self, html): ''' This method is called by the conversion pipeline on all HTML before it is parsed. It is meant to be used to do any required preprocessing on the HTML, like removing hard line breaks, etc. :param html: A unicode string :return: A unicode string ''' return html |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Spurious Line Breaks | Halk | Workshop | 1 | 05-15-2010 01:22 PM |
Still having problems PDF to MOBI line unwrapping | jengwen | Calibre | 2 | 04-16-2010 09:14 AM |
CSS for line breaks after dashes? | JaneFancher | Sigil | 4 | 04-05-2010 12:18 PM |
No line breaks in TXT conversions - is it just me? | TMF | Calibre | 3 | 09-24-2009 02:46 PM |
No line breaks | ecpepper | Amazon Kindle | 3 | 08-09-2009 06:42 PM |