View Single Post
Old 11-02-2010, 11:05 AM   #11
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 12,476
Karma: 8025702
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by Manichean View Post
Doesn't the DOTALL only work if there's a dot as a quantifier? The only dot in that regex is a literal one.
(Edit: For some funny reason, in the quoted regex above, my browser displays vertical bars instead of backslashes in some cases. I assure you that there are no vertical bars there )
DOTALL definitely affects '.'. What I am not sure of is whether or not it affects the character classes. The documentation implies that it does not, but I haven't tried it.

@janvanmaar: I didn't read your first post carefully. My understanding of xpath produces strings for a given html tag. To know what the regexp will do, you must know the tag structure around the text. My guess is that you will find something like
Code:
<p>2.</p><p>Name of the chapter</p>
This input would account for '^[0-9]+\.$' working, because it is the content of an inner <p> tag. The rest will require multi-tag matching, which means regexps of a rather higher order. For example, you probably won't be able to use anchors.

When faced with this problem, I have done one of three things:
1. convert the PDF to epub, use an editor to enclose the chapter indicators in <h1>...</h1> tags, and convert again.
2. similar, but go through .txt so I can clean up other stuff such as paragraph endings.
3. live without a toc.

EDIT: Everything in this post has been covered above. I should just take a nap.
chaley is offline   Reply With Quote