MobileRead Forums - View Single Post - Regexes to improve pdf to epub conversion

user_none · 04-10-2009, 06:57 AM

/src/calibre/ebooks/conversion/preprocess.py is where the processing rules for pdftohtml have moved to in pluginize. Right now you can find the changes in my driver-dev branch because Kovid hasn't merged them into the main branch yet. If you have any questions about the branches or what's going on with development, just ask, off board or on.

One thing to realize about the regex rules. They are not being applied to the raw output of pdftohtml. They build on one another so don't forget to take into account the rules before and how they change the markup.

Quote:

Originally Posted by ldolse

If pluginize refers to some sort of plugin architecture that users can enable disable functions...

That is the general idea. Not everything is being moved into a plugin though. The processing rules for pdftohtml won't be a plugin but PDF input itself is a plugin.

Quote:

Originally Posted by ldolse

I've found the second regex will wrap things like page headers and footers(since they lack punctuation)

Indeed this happens. I'm happy to merge any fixes for this that you come up with.

Quote:

Originally Posted by ldolse

I don't want to duplicate any effort when it comes to submitting changes.

Don't worry about it. The better solution wins. Especially in this care where regexes can always be improved to take into account more cases. All I did was spend a few minutes fiddling with your rules to get them working with the other processing rules. I also simplified them to use a look behind and look ahead instead of match groups because I find them easier to work with. At the very least my changes will help you understand how the regexes work in the preprocessor.