Quote:
Originally Posted by roffLOL
The advertised goals were met yesterday. I have created a 98% replica of the PDF-document that serves as my test case. It removes '-' at end of lines and connects the lines. It appends non-fullstop paragraphs that spans pages, it keeps an indentation ratio for paragraph beginnings equal to that of the original document (if it has one), it retains font-information (even for single word in the middle of sentences), it retains line and paragraph spacing, it translates all special characters to their HTML-equivalence, and for some weird reason, the line numbering seems to automagically vanish, even though I can't remember implementing anything to leave them out.
|
Sounds pretty promising, looking forward to seeing it. One note - you can't remove '-' from line endings wholesale - this is one area I spent a fair amount of time, creating a 'dehyphenate' routine which uses the raw document as a dictionary. Feel free to borrow that code/concept from Calibre. The only weakness in the function is that I perform naive '
stemming' which focuses mainly on the english language to increase the likelihood of a dictionary match.
Now that Calibre has the ability to allow the user to specify the language for a book I've been thinking about an industrial strength multi-language stemmer - the most appropriate ones from my searches would be
Snowball, which has a python wrapper.