hmm... never coded in Python before.
I found the correct function I believe:
Code:
# Fix pdftohtml markup
PDFTOHTML = [
# Remove <hr> tags
(re.compile(r'<hr.*?>', re.IGNORECASE), lambda match: '<br />'),
# Remove page numbers
(re.compile(r'\d+<br>', re.IGNORECASE), lambda match: ''),
# Remove <br> and replace <br><br> with <p>
(re.compile(r'<br.*?>\s*<br.*?>', re.IGNORECASE), lambda match: '<p>'),
(re.compile(r'(.*)<br.*?>', re.IGNORECASE),
lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40
else match.group(1)),
# Remove hyphenation
(re.compile(r'-\n\r?'), lambda match: ''),
# Remove gray background
(re.compile(r'<BODY[^<>]+>'), lambda match : '<BODY>'),
# Remove non breaking spaces
(re.compile(ur'\u00a0'), lambda match : ' '),
]
Looks straightforward for replacing basic expressions, however I can't figure out how to use backreferences with lambda match. It looks to me like this snippet from above is close to what I want:
Code:
(re.compile(r'(.*)<br.*?>', re.IGNORECASE),
lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40
else match.group(1)),
Looks to me like group(1) is the back reference, and the function is checking to see if the length of the match is less than 40 characters before replacing the text. But in my case I've got 2-4 backreferences I need to concatenate, probably using the join() function somehow. I searched around a bit on the net, but I didn't see any good examples, at least not ones that are remotely similar to how you've structured the functions above.
Any advice to point me in the right direction?