Unwrapping hard line breaks across all input formats

ldolse · 04-28-2009, 03:38 AM

Hi, for the few weeks I've been on the forums I've come across a lot of threads where users are dealing with hard line breaks in various types of content, and I've seen this in a lot of the content I've downloaded, whether that's text, pdf, or even a variety of ebook formats.

Calibre already handles this in it's PDF processing, and of course this is a basic requirement for PDFs as pdftohtml doesn't handle this function at all.

For the other formats I think this type of processing should be optional, since the majority(?) of content is well formed with regard to wrapping. It seems like it ought to be simple enough to make it an optional step during the conversion pipeline.

I think all the logic is already in preprocess.py (in pluginize), it's just tied to the PDF format. Of course every format would need slightly different regexes, but the basic logic we've worked out for pdf would apply.

My python skills aren't great, but if someone created the hooks into the other formats for an option for these types of regexes to be applied during the conversion process I'd be happy to own the regexps and working the kinks out of each format. Put the hooks in one format and show me how it was done and I may even be able to apply it to others.

Seems like the worst offenders are text, then rtf, followed by LIT, as these seem to be the formats that a lot of OCR work tends to wind up going to.

Some of the threads where people have expressed interest/frustration:
How to deal with irregular hard-wrapping on a large scale?
line formatting formatting question
text reformat
Tool for removing line breaks in text documents

kovidgoyal · 04-28-2009, 04:04 AM

The new conersion framework makes this very easy to do, but, I'm not so sure it is a good idea. The reason it works for PDF is that calibre is post processing the output from pdftohtml, which is pretty consistent. TXT/LIT/RTF files can have a very wide range of input that would probably require different algorithms to process. Which would mean the user would have to select the algorithm at conversion time. I suppose that is doable...

ldolse · 04-28-2009, 05:06 AM

From a user perspective I was thinking of presenting just one option - 'fix line-breaks' or 'enable text processing' or something like that. Enable/Disable similar to 'Detect Chapters' today.

That would then invoke a function similar to the current pdftohtml functions in preprocess.py. Then just write one set of regexes per format. Text/RTF would get one regex (i think that would be the same regex in those cases), LIT would get another as the lit->html output looks different. I'm not sure if it makes as much sense for other more modern formats, as it's the older ones that seem to have the problem, though it does apply if someone has a book that was originally converted from a bad file.

Anyway I wasn't thinking the user would need to worry about writing/specifying replacement patterns. I suppose that would also be ok for the power user, but that would be a lot more GUI work to maintain default regexes in the GUI for each format. I think a checkbox with best effort regexes hard-coded would be a big step over what we've got now.

Default would be disabled of course.

kacir · 04-28-2009, 06:24 AM

When I am trying to reformat a plain text book with lots of weird hard breaks I can usually make 99% of work using quick and dirty trick (usually implemented in Vim editor using Regular Expressions.

find <dot><end of paragraph>
find <exclamation point><end of paragraph>
find <question mark><end of paragraph>
find <dot><quote><end of paragraph>
find <exclamation point><quote><end of paragraph>
find <question mark><quote><end of paragraph>
Replace all those things with <what you found>HereIsEndOfParagraph<end of paragraph>

now fiond every line that does not end with HereIsEndOfParagraph<end of paragraph> and join it with the next

" Vim script. Can be easily adapted for sed
" This script can be written in a much more condensed and clever
" way, but this way it is much more understandable
:%substitute/[.]$/\0HereIsEndOfParagraph/
:%substitute/[?]/\0HereIsEndOfParagraph/
:%substitute/[!]/\0HereIsEndOfParagraph/
:%substitute/[.]["]$/\0HereIsEndOfParagraph/
:%substitute/[?]["]/\0HereIsEndOfParagraph/
:%substitute/[!]["]/\0HereIsEndOfParagraph/

:global!/HereIsEndOfParagraph$/join

:%substitute/HereIsEndOfParagraph$//

"end of quick-and-dirty Vim script

ldolse · 04-28-2009, 06:43 AM

That logic is basically what I'm referring to, it's just with a single regex replacement in Python. That is then combined with a median line length calculation so that only lines approaching the document median have the regex applied (an extra safety to prevent short lines without punctuation from being wrapped). This is what we're doing already for PDF post processing.

There will always be docs that this won't work for, but I think that we can handle the majority of the cases where a user needs to hand edit to fix this sort of thing.

kacir · 04-28-2009, 06:48 AM

Quote:

Originally Posted by ldolse

it's just with a single regex replacement in Python.

This can also be implemented as a one line command in Vim, but then I would have to write 700 words explaining what is going on ;-)

tompe · 04-28-2009, 06:50 AM

Quote:

Originally Posted by kacir

This can also be implemented as a one line command in Vim, but then I would have to write 700 words explaining what is going on ;-)

But one line is not one regexp. And I do not think you can implement it with just one regexp since you have states in the process.

kovidgoyal · 04-28-2009, 07:40 AM

Hmm well I have no objection to adding it as a default off option with possibly different regexes per input format. Do me a favor and open a ticket for it, I'll add the framework for it to the new calibre conversion pipeline and you can then fine tune the regexes to your heart's content.

kacir · 04-28-2009, 08:15 AM

Quote:

Originally Posted by tompe

But one line is not one regexp. And I do not think you can implement it with just one regexp since you have states in the process.

HA!

States are just to demonstrate the concept.

" The first one-line example
:global!/$[.]"\?\|[?]"\?\|[!]"\?$$/.;/$[.]"\?\|[?]"\?\|[!]"\?$$/join

" the same one, yet shorter
:v/$[.]"\?\|[?]"\?\|[!]"\?$$/.;$[.]"\?\|[?]"\?\|[!]"\?$$/j

" and even more condensed one!
:v/\."\?$\|?"\?$\|!"\?$/.;/\."\?$\|?"\?$\|!"\?$/j

" previous versions work, but you get an error [that you can safely ignore]
" at the last line of text, so here we go with even more interesting
" looking one
:1;$-1v/\."\?$\|?"\?$\|!"\?$/.;/\."\?$\|?"\?$\|!"\?$/j

I have tested those examples on a sample text.
The final version would require testing and tweaking on much wider range of texts.

Now I go to try do this in "pure" RegExp using that lovely
"negative lookbehind of non-arbitrary length" that even Perl doesn't have
[EVIL LAUGHTER]

kacir · 04-28-2009, 08:39 AM

Quote:

Originally Posted by tompe

But one line is not one regexp. And I do not think you can implement it with just one regexp since you have states in the process.

:%s/$\."\?\|?"\?\|!"\?$\@<!\n/ /

tompe · 04-28-2009, 08:51 AM

Quote:

Originally Posted by kacir

:%s/$\."\?\|?"\?\|!"\?$\@<!\n//

Ah, I did not read it carefully enough. I thought you had done something clever for strange cases by doing the substitution in more than one step.

kacir · 04-28-2009, 09:25 AM

Quote:

Originally Posted by tompe

Ah, I did not read it carefully enough. I thought you had done something clever for strange cases by doing the substitution in more than one step.

I just left the room for doing stange cases. Because there WILL be exceptions ;-)
Also such tool always has to be tweaked and my first example is much more pleasant to work with. I tend to refine my scripts with every use, so at the end I just need to run the script and briefly check for results. Sometimes you open your own script and you just wonder "What the $#$%! is this supposed to do?"

Besides, with the replacement RegEx \= you can do lots of very interesting stuff, because \= is followed by an expression of quite powerfull Vim scription language. So in Vim you can work with variables, global variables, functions and other goodies even from inside the regExp

ldolse · 04-28-2009, 11:49 AM

Ticket is opened:
http://calibre.kovidgoyal.net/ticket/2359

Thanks Kacir for putting some starting points together, once the framework is in place I'll try these out.

ldolse · 05-09-2009, 10:10 AM

I see the ticket has been fixed, and I saw the changes you made in the core files. That said, I'm not quite sure how to leverage this in an input plugin.

First question is which file is actually considered the input plugin? Is it /calibre/ebooks/<format>/input.py? I see most folders there have an input.py, but not all.

And then finally what exactly do I need to define in an input plugin, do I need to define a function called HTMLPreProcessor in each plugin that acts similarly to pdftohtml?

kovidgoyal · 05-09-2009, 01:07 PM

An input plugin is defined as a subclass of InputFormatPlugin, by convention these subclass are usually in files called input.py. To add support for preprocessing for a particular input format, just reimplement the following method in the input plugin for that format

Code:

    def preprocess_html(self, html):
        '''
        This method is called by the conversion pipeline on all HTML before it
        is parsed. It is meant to be used to do any required preprocessing on
        the HTML, like removing hard line breaks, etc.

        :param html: A unicode string
        :return: A unicode string
        '''
        return html

04-28-2009, 03:38 AM	#1
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Unwrapping hard line breaks across all input formats Hi, for the few weeks I've been on the forums I've come across a lot of threads where users are dealing with hard line breaks in various types of content, and I've seen this in a lot of the content I've downloaded, whether that's text, pdf, or even a variety of ebook formats. Calibre already handles this in it's PDF processing, and of course this is a basic requirement for PDFs as pdftohtml doesn't handle this function at all. For the other formats I think this type of processing should be optional, since the majority(?) of content is well formed with regard to wrapping. It seems like it ought to be simple enough to make it an optional step during the conversion pipeline. I think all the logic is already in preprocess.py (in pluginize), it's just tied to the PDF format. Of course every format would need slightly different regexes, but the basic logic we've worked out for pdf would apply. My python skills aren't great, but if someone created the hooks into the other formats for an option for these types of regexes to be applied during the conversion process I'd be happy to own the regexps and working the kinks out of each format. Put the hooks in one format and show me how it was done and I may even be able to apply it to others. Seems like the worst offenders are text, then rtf, followed by LIT, as these seem to be the formats that a lot of OCR work tends to wind up going to. Some of the threads where people have expressed interest/frustration: How to deal with irregular hard-wrapping on a large scale? line formatting formatting question text reformat Tool for removing line breaks in text documents

05-09-2009, 01:07 PM	#15
kovidgoyal creator of calibre Posts: 45,631 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	An input plugin is defined as a subclass of InputFormatPlugin, by convention these subclass are usually in files called input.py. To add support for preprocessing for a particular input format, just reimplement the following method in the input plugin for that format Code: def preprocess_html(self, html): ''' This method is called by the conversion pipeline on all HTML before it is parsed. It is meant to be used to do any required preprocessing on the HTML, like removing hard line breaks, etc. :param html: A unicode string :return: A unicode string ''' return html

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Spurious Line Breaks	Halk	Workshop	1	05-15-2010 02:22 PM
Still having problems PDF to MOBI line unwrapping	jengwen	Calibre	2	04-16-2010 10:14 AM
CSS for line breaks after dashes?	JaneFancher	Sigil	4	04-05-2010 01:18 PM
No line breaks in TXT conversions - is it just me?	TMF	Calibre	3	09-24-2009 03:46 PM
No line breaks	ecpepper	Amazon Kindle	3	08-09-2009 07:42 PM

04-28-2009, 04:04 AM	#2
kovidgoyal creator of calibre Posts: 45,631 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The new conersion framework makes this very easy to do, but, I'm not so sure it is a good idea. The reason it works for PDF is that calibre is post processing the output from pdftohtml, which is pretty consistent. TXT/LIT/RTF files can have a very wide range of input that would probably require different algorithms to process. Which would mean the user would have to select the algorithm at conversion time. I suppose that is doable...

04-28-2009, 05:06 AM	#3
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	From a user perspective I was thinking of presenting just one option - 'fix line-breaks' or 'enable text processing' or something like that. Enable/Disable similar to 'Detect Chapters' today. That would then invoke a function similar to the current pdftohtml functions in preprocess.py. Then just write one set of regexes per format. Text/RTF would get one regex (i think that would be the same regex in those cases), LIT would get another as the lit->html output looks different. I'm not sure if it makes as much sense for other more modern formats, as it's the older ones that seem to have the problem, though it does apply if someone has a book that was originally converted from a bad file. Anyway I wasn't thinking the user would need to worry about writing/specifying replacement patterns. I suppose that would also be ok for the power user, but that would be a lot more GUI work to maintain default regexes in the GUI for each format. I think a checkbox with best effort regexes hard-coded would be a big step over what we've got now. Default would be disabled of course.

04-28-2009, 06:24 AM	#4
kacir Wizard Posts: 3,468 Karma: 10684861 Join Date: May 2006 Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20	When I am trying to reformat a plain text book with lots of weird hard breaks I can usually make 99% of work using quick and dirty trick (usually implemented in Vim editor using Regular Expressions. find <dot><end of paragraph> find <exclamation point><end of paragraph> find <question mark><end of paragraph> find <dot><quote><end of paragraph> find <exclamation point><quote><end of paragraph> find <question mark><quote><end of paragraph> Replace all those things with <what you found>HereIsEndOfParagraph<end of paragraph> now fiond every line that does not end with HereIsEndOfParagraph<end of paragraph> and join it with the next " Vim script. Can be easily adapted for sed " This script can be written in a much more condensed and clever " way, but this way it is much more understandable :%substitute/[.]$/\0HereIsEndOfParagraph/ :%substitute/[?]/\0HereIsEndOfParagraph/ :%substitute/[!]/\0HereIsEndOfParagraph/ :%substitute/[.]["]$/\0HereIsEndOfParagraph/ :%substitute/[?]["]/\0HereIsEndOfParagraph/ :%substitute/[!]["]/\0HereIsEndOfParagraph/ :global!/HereIsEndOfParagraph$/join :%substitute/HereIsEndOfParagraph$// "end of quick-and-dirty Vim script

04-28-2009, 06:43 AM	#5
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	That logic is basically what I'm referring to, it's just with a single regex replacement in Python. That is then combined with a median line length calculation so that only lines approaching the document median have the regex applied (an extra safety to prevent short lines without punctuation from being wrapped). This is what we're doing already for PDF post processing. There will always be docs that this won't work for, but I think that we can handle the majority of the cases where a user needs to hand edit to fix this sort of thing.

04-28-2009, 07:40 AM	#8
kovidgoyal creator of calibre Posts: 45,631 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Hmm well I have no objection to adding it as a default off option with possibly different regexes per input format. Do me a favor and open a ticket for it, I'll add the framework for it to the new calibre conversion pipeline and you can then fine tune the regexes to your heart's content.

04-28-2009, 11:49 AM	#13
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Ticket is opened: http://calibre.kovidgoyal.net/ticket/2359 Thanks Kacir for putting some starting points together, once the framework is in place I'll try these out.

05-09-2009, 10:10 AM	#14
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	I see the ticket has been fixed, and I saw the changes you made in the core files. That said, I'm not quite sure how to leverage this in an input plugin. First question is which file is actually considered the input plugin? Is it /calibre/ebooks/<format>/input.py? I see most folders there have an input.py, but not all. And then finally what exactly do I need to define in an input plugin, do I need to define a function called HTMLPreProcessor in each plugin that acts similarly to pdftohtml?

Advert

Advert