From a user perspective I was thinking of presenting just one option - 'fix line-breaks' or 'enable text processing' or something like that. Enable/Disable similar to 'Detect Chapters' today.
That would then invoke a function similar to the current pdftohtml functions in preprocess.py. Then just write one set of regexes per format. Text/RTF would get one regex (i think that would be the same regex in those cases), LIT would get another as the lit->html output looks different. I'm not sure if it makes as much sense for other more modern formats, as it's the older ones that seem to have the problem, though it does apply if someone has a book that was originally converted from a bad file.
Anyway I wasn't thinking the user would need to worry about writing/specifying replacement patterns. I suppose that would also be ok for the power user, but that would be a lot more GUI work to maintain default regexes in the GUI for each format. I think a checkbox with best effort regexes hard-coded would be a big step over what we've got now.
Default would be disabled of course.
|