Quote:
Originally Posted by Manichean
Doesn't the DOTALL only work if there's a dot as a quantifier? The only dot in that regex is a literal one.
(Edit: For some funny reason, in the quoted regex above, my browser displays vertical bars instead of backslashes in some cases. I assure you that there are no vertical bars there  )
|
DOTALL definitely affects '.'. What I am not sure of is whether or not it affects the character classes. The documentation implies that it does not, but I haven't tried it.
@janvanmaar: I didn't read your first post carefully. My understanding of xpath produces strings for a given html tag. To know what the regexp will do, you must know the tag structure around the text. My guess is that you will find something like
Code:
<p>2.</p><p>Name of the chapter</p>
This input would account for '^[0-9]+\.$' working, because it is the content of an inner <p> tag. The rest will require multi-tag matching, which means regexps of a rather higher order. For example, you probably won't be able to use anchors.
When faced with this problem, I have done one of three things:
1. convert the PDF to epub, use an editor to enclose the chapter indicators in <h1>...</h1> tags, and convert again.
2. similar, but go through .txt so I can clean up other stuff such as paragraph endings.
3. live without a toc.
EDIT: Everything in this post has been covered above. I should just take a nap.