MobileRead Forums - View Single Post - Chapter detection when only digits

ldolse · 09-18-2010, 03:23 AM

The preprocess function basically does the regex you're suggesting, but it analyzes the document and gets the median line length, and ties this to the unwrapping function. This is to prevent things like lists, poetry, titles, etc from being unwrapped.

The problem is the original implementation assumed that if there were hard breaks in the document they would be universal across the file. Reality is that many files have a lot of variability, so this pushes the median line length longer than the typical broken line.

Preprocessing now lets you specify the aggressiveness of the line unwrapping - i.e. make the line length cut-off shorter. It's the line unwrapping factor under structure detection. If you get a chance give it a shot and see if it solves your original problem with unwrapping. Note you may need to set it down to 0.1 or 0.2 if there is a huge amount of variability in line length.

edit - single digit chapters are covered now as well.

09-18-2010, 03:23 AM	#12
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	The preprocess function basically does the regex you're suggesting, but it analyzes the document and gets the median line length, and ties this to the unwrapping function. This is to prevent things like lists, poetry, titles, etc from being unwrapped. The problem is the original implementation assumed that if there were hard breaks in the document they would be universal across the file. Reality is that many files have a lot of variability, so this pushes the median line length longer than the typical broken line. Preprocessing now lets you specify the aggressiveness of the line unwrapping - i.e. make the line length cut-off shorter. It's the line unwrapping factor under structure detection. If you get a chance give it a shot and see if it solves your original problem with unwrapping. Note you may need to set it down to 0.1 or 0.2 if there is a huge amount of variability in line length. edit - single digit chapters are covered now as well. Last edited by ldolse; 09-18-2010 at 03:25 AM.