Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 04-28-2009, 02:38 AM   #1
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Unwrapping hard line breaks across all input formats

Hi, for the few weeks I've been on the forums I've come across a lot of threads where users are dealing with hard line breaks in various types of content, and I've seen this in a lot of the content I've downloaded, whether that's text, pdf, or even a variety of ebook formats.

Calibre already handles this in it's PDF processing, and of course this is a basic requirement for PDFs as pdftohtml doesn't handle this function at all.

For the other formats I think this type of processing should be optional, since the majority(?) of content is well formed with regard to wrapping. It seems like it ought to be simple enough to make it an optional step during the conversion pipeline.

I think all the logic is already in preprocess.py (in pluginize), it's just tied to the PDF format. Of course every format would need slightly different regexes, but the basic logic we've worked out for pdf would apply.

My python skills aren't great, but if someone created the hooks into the other formats for an option for these types of regexes to be applied during the conversion process I'd be happy to own the regexps and working the kinks out of each format. Put the hooks in one format and show me how it was done and I may even be able to apply it to others.

Seems like the worst offenders are text, then rtf, followed by LIT, as these seem to be the formats that a lot of OCR work tends to wind up going to.

Some of the threads where people have expressed interest/frustration:
How to deal with irregular hard-wrapping on a large scale?
line formatting formatting question
text reformat
Tool for removing line breaks in text documents
ldolse is offline   Reply With Quote
Old 04-28-2009, 03:04 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,744
Karma: 22446736
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The new conersion framework makes this very easy to do, but, I'm not so sure it is a good idea. The reason it works for PDF is that calibre is post processing the output from pdftohtml, which is pretty consistent. TXT/LIT/RTF files can have a very wide range of input that would probably require different algorithms to process. Which would mean the user would have to select the algorithm at conversion time. I suppose that is doable...
kovidgoyal is online now   Reply With Quote
Advert
Old 04-28-2009, 04:06 AM   #3
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
From a user perspective I was thinking of presenting just one option - 'fix line-breaks' or 'enable text processing' or something like that. Enable/Disable similar to 'Detect Chapters' today.

That would then invoke a function similar to the current pdftohtml functions in preprocess.py. Then just write one set of regexes per format. Text/RTF would get one regex (i think that would be the same regex in those cases), LIT would get another as the lit->html output looks different. I'm not sure if it makes as much sense for other more modern formats, as it's the older ones that seem to have the problem, though it does apply if someone has a book that was originally converted from a bad file.

Anyway I wasn't thinking the user would need to worry about writing/specifying replacement patterns. I suppose that would also be ok for the power user, but that would be a lot more GUI work to maintain default regexes in the GUI for each format. I think a checkbox with best effort regexes hard-coded would be a big step over what we've got now.

Default would be disabled of course.
ldolse is offline   Reply With Quote
Old 04-28-2009, 05:24 AM   #4
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,447
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
When I am trying to reformat a plain text book with lots of weird hard breaks I can usually make 99% of work using quick and dirty trick (usually implemented in Vim editor using Regular Expressions.

find <dot><end of paragraph>
find <exclamation point><end of paragraph>
find <question mark><end of paragraph>
find <dot><quote><end of paragraph>
find <exclamation point><quote><end of paragraph>
find <question mark><quote><end of paragraph>
Replace all those things with <what you found>HereIsEndOfParagraph<end of paragraph>

now fiond every line that does not end with HereIsEndOfParagraph<end of paragraph> and join it with the next

" Vim script. Can be easily adapted for sed
" This script can be written in a much more condensed and clever
" way, but this way it is much more understandable
:%substitute/[.]$/\0HereIsEndOfParagraph/
:%substitute/[?]/\0HereIsEndOfParagraph/
:%substitute/[!]/\0HereIsEndOfParagraph/
:%substitute/[.]["]$/\0HereIsEndOfParagraph/
:%substitute/[?]["]/\0HereIsEndOfParagraph/
:%substitute/[!]["]/\0HereIsEndOfParagraph/

:global!/HereIsEndOfParagraph$/join

:%substitute/HereIsEndOfParagraph$//

"end of quick-and-dirty Vim script
kacir is offline   Reply With Quote
Old 04-28-2009, 05:43 AM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
That logic is basically what I'm referring to, it's just with a single regex replacement in Python. That is then combined with a median line length calculation so that only lines approaching the document median have the regex applied (an extra safety to prevent short lines without punctuation from being wrapped). This is what we're doing already for PDF post processing.

There will always be docs that this won't work for, but I think that we can handle the majority of the cases where a user needs to hand edit to fix this sort of thing.
ldolse is offline   Reply With Quote
Advert
Old 04-28-2009, 05:48 AM   #6
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,447
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by ldolse View Post
it's just with a single regex replacement in Python.
This can also be implemented as a one line command in Vim, but then I would have to write 700 words explaining what is going on ;-)
kacir is offline   Reply With Quote
Old 04-28-2009, 05:50 AM   #7
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
Quote:
Originally Posted by kacir View Post
This can also be implemented as a one line command in Vim, but then I would have to write 700 words explaining what is going on ;-)
But one line is not one regexp. And I do not think you can implement it with just one regexp since you have states in the process.
tompe is offline   Reply With Quote
Old 04-28-2009, 06:40 AM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,744
Karma: 22446736
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Hmm well I have no objection to adding it as a default off option with possibly different regexes per input format. Do me a favor and open a ticket for it, I'll add the framework for it to the new calibre conversion pipeline and you can then fine tune the regexes to your heart's content.
kovidgoyal is online now   Reply With Quote
Old 04-28-2009, 07:15 AM   #9
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,447
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by tompe View Post
But one line is not one regexp. And I do not think you can implement it with just one regexp since you have states in the process.
HA!

States are just to demonstrate the concept.

" The first one-line example
:global!/\([.]"\?\|[?]"\?\|[!]"\?\)$/.;/\([.]"\?\|[?]"\?\|[!]"\?\)$/join

" the same one, yet shorter
:v/\([.]"\?\|[?]"\?\|[!]"\?\)$/.;\([.]"\?\|[?]"\?\|[!]"\?\)$/j

" and even more condensed one!
:v/\."\?$\|?"\?$\|!"\?$/.;/\."\?$\|?"\?$\|!"\?$/j

" previous versions work, but you get an error [that you can safely ignore]
" at the last line of text, so here we go with even more interesting
" looking one
:1;$-1v/\."\?$\|?"\?$\|!"\?$/.;/\."\?$\|?"\?$\|!"\?$/j

I have tested those examples on a sample text.
The final version would require testing and tweaking on much wider range of texts.


Now I go to try do this in "pure" RegExp using that lovely
"negative lookbehind of non-arbitrary length" that even Perl doesn't have
[EVIL LAUGHTER]

Last edited by kacir; 04-28-2009 at 02:55 PM.
kacir is offline   Reply With Quote
Old 04-28-2009, 07:39 AM   #10
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,447
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by tompe View Post
But one line is not one regexp. And I do not think you can implement it with just one regexp since you have states in the process.
:%s/\(\."\?\|?"\?\|!"\?\)\@<!\n/ /

Last edited by kacir; 04-28-2009 at 08:07 AM.
kacir is offline   Reply With Quote
Old 04-28-2009, 07:51 AM   #11
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
Quote:
Originally Posted by kacir View Post
:%s/\(\."\?\|?"\?\|!"\?\)\@<!\n//
Ah, I did not read it carefully enough. I thought you had done something clever for strange cases by doing the substitution in more than one step.
tompe is offline   Reply With Quote
Old 04-28-2009, 08:25 AM   #12
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,447
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by tompe View Post
Ah, I did not read it carefully enough. I thought you had done something clever for strange cases by doing the substitution in more than one step.
I just left the room for doing stange cases. Because there WILL be exceptions ;-)
Also such tool always has to be tweaked and my first example is much more pleasant to work with. I tend to refine my scripts with every use, so at the end I just need to run the script and briefly check for results. Sometimes you open your own script and you just wonder "What the $#$%! is this supposed to do?"

Besides, with the replacement RegEx \= you can do lots of very interesting stuff, because \= is followed by an expression of quite powerfull Vim scription language. So in Vim you can work with variables, global variables, functions and other goodies even from inside the regExp

Last edited by kacir; 04-28-2009 at 09:06 AM.
kacir is offline   Reply With Quote
Old 04-28-2009, 10:49 AM   #13
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Ticket is opened:
http://calibre.kovidgoyal.net/ticket/2359

Thanks Kacir for putting some starting points together, once the framework is in place I'll try these out.
ldolse is offline   Reply With Quote
Old 05-09-2009, 09:10 AM   #14
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I see the ticket has been fixed, and I saw the changes you made in the core files. That said, I'm not quite sure how to leverage this in an input plugin.

First question is which file is actually considered the input plugin? Is it /calibre/ebooks/<format>/input.py? I see most folders there have an input.py, but not all.

And then finally what exactly do I need to define in an input plugin, do I need to define a function called HTMLPreProcessor in each plugin that acts similarly to pdftohtml?
ldolse is offline   Reply With Quote
Old 05-09-2009, 12:07 PM   #15
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,744
Karma: 22446736
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
An input plugin is defined as a subclass of InputFormatPlugin, by convention these subclass are usually in files called input.py. To add support for preprocessing for a particular input format, just reimplement the following method in the input plugin for that format

Code:
    def preprocess_html(self, html):
        '''
        This method is called by the conversion pipeline on all HTML before it
        is parsed. It is meant to be used to do any required preprocessing on
        the HTML, like removing hard line breaks, etc.

        :param html: A unicode string
        :return: A unicode string
        '''
        return html
kovidgoyal is online now   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Spurious Line Breaks Halk Workshop 1 05-15-2010 01:22 PM
Still having problems PDF to MOBI line unwrapping jengwen Calibre 2 04-16-2010 09:14 AM
CSS for line breaks after dashes? JaneFancher Sigil 4 04-05-2010 12:18 PM
No line breaks in TXT conversions - is it just me? TMF Calibre 3 09-24-2009 02:46 PM
No line breaks ecpepper Amazon Kindle 3 08-09-2009 06:42 PM


All times are GMT -4. The time now is 11:43 PM.


MobileRead.com is a privately owned, operated and funded community.