MobileRead Forums - View Single Post

chaley · 11-02-2010, 11:05 AM

Quote:

Originally Posted by Manichean

Doesn't the DOTALL only work if there's a dot as a quantifier? The only dot in that regex is a literal one.
(Edit: For some funny reason, in the quoted regex above, my browser displays vertical bars instead of backslashes in some cases. I assure you that there are no vertical bars there

)

DOTALL definitely affects '.'. What I am not sure of is whether or not it affects the character classes. The documentation implies that it does not, but I haven't tried it.

@janvanmaar: I didn't read your first post carefully. My understanding of xpath produces strings for a given html tag. To know what the regexp will do, you must know the tag structure around the text. My guess is that you will find something like

Code:

<p>2.</p><p>Name of the chapter</p>

This input would account for '^[0-9]+\.$' working, because it is the content of an inner <p> tag. The rest will require multi-tag matching, which means regexps of a rather higher order. For example, you probably won't be able to use anchors.

When faced with this problem, I have done one of three things:
1. convert the PDF to epub, use an editor to enclose the chapter indicators in <h1>...</h1> tags, and convert again.
2. similar, but go through .txt so I can clean up other stuff such as paragraph endings.
3. live without a toc.

EDIT: Everything in this post has been covered above. I should just take a nap.