View Single Post
Old 09-18-2010, 03:23 AM   #12
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
The preprocess function basically does the regex you're suggesting, but it analyzes the document and gets the median line length, and ties this to the unwrapping function. This is to prevent things like lists, poetry, titles, etc from being unwrapped.

The problem is the original implementation assumed that if there were hard breaks in the document they would be universal across the file. Reality is that many files have a lot of variability, so this pushes the median line length longer than the typical broken line.

Preprocessing now lets you specify the aggressiveness of the line unwrapping - i.e. make the line length cut-off shorter. It's the line unwrapping factor under structure detection. If you get a chance give it a shot and see if it solves your original problem with unwrapping. Note you may need to set it down to 0.1 or 0.2 if there is a huge amount of variability in line length.

edit - single digit chapters are covered now as well.

Last edited by ldolse; 09-18-2010 at 03:25 AM.
ldolse is offline   Reply With Quote