View Single Post
Old 10-03-2011, 07:24 AM   #9
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by roffLOL View Post
The advertised goals were met yesterday. I have created a 98% replica of the PDF-document that serves as my test case. It removes '-' at end of lines and connects the lines. It appends non-fullstop paragraphs that spans pages, it keeps an indentation ratio for paragraph beginnings equal to that of the original document (if it has one), it retains font-information (even for single word in the middle of sentences), it retains line and paragraph spacing, it translates all special characters to their HTML-equivalence, and for some weird reason, the line numbering seems to automagically vanish, even though I can't remember implementing anything to leave them out.
Sounds pretty promising, looking forward to seeing it. One note - you can't remove '-' from line endings wholesale - this is one area I spent a fair amount of time, creating a 'dehyphenate' routine which uses the raw document as a dictionary. Feel free to borrow that code/concept from Calibre. The only weakness in the function is that I perform naive 'stemming' which focuses mainly on the english language to increase the likelihood of a dictionary match.

Now that Calibre has the ability to allow the user to specify the language for a book I've been thinking about an industrial strength multi-language stemmer - the most appropriate ones from my searches would be Snowball, which has a python wrapper.

Last edited by ldolse; 10-03-2011 at 07:29 AM.
ldolse is offline   Reply With Quote