MobileRead Forums - View Single Post - Detect chapters without using tag or class.

tonyx3 · 09-14-2010, 01:41 AM

Disclaimer: I am in no way a regex or XPATH expert, so I could be completely wrong. Please don't completely flame me out.

Ok, so this got kinda long.
SUMMARY:

Is there a way to match chapters using only a regex?
If not, can we please have one?
What about testing chapter detection before conversion?

There's lots of threads on the forum with people asking for help getting their chapters to detect properly.

For me, at least, a lot of the trouble I have comes from trying to use calibre to convert poorly formatted books from one format into properly formatted books in another format.

For example, in an ideal situation, the chapter titles in the source format will all be in their own line, with proper opening and closing tags (preferably h1), and will include the word 'chapter.' If that happens, it's easy to get calibre to detect them and generate a proper Table of Contents for my target format.

In reality, though, I often find myself trying to convert books in which the chapter titles don't have 'chapter' in them, and often don't have any special or unique format tags. Basically, there's a lot of poorly formatted books out there. I won't even mention trying to convert from PDF.

Here's an actual recent example from a book I converted:

Source format: LRF
The chapter titles simply consist of the chapter number, spelled out in all caps, e.g. ONE, TWO, THREE, etc. (this isn't a huge problem, but it's annoying.

The tag isn't a header tag, its <span>, and it isn't closed out immediately after the chapter title, like this:

Code:

preceding text from Introduction<br/><span class="ts1">ONE<span class="ts2"><br/>Beginning text of chapter 1.

Now, that would be annoying, but workable if I could just use a regex to match the <span class="ts1"> just before the chapter title and the <span class="ts2"> just after it. The chapter titles are the only thing formatted that way.

But, unless I'm mistaken, I can't do that, because I need to use XPATH to match tags and classes for the chapters.

I don't know the best solution for that.

I guess I'm advocating for the addition of an option to only match a regex for chapter (in the same way that the header removal works). It's easy to write a regex to match most chapters. Much easier, for me at least, than trying to use the current chapter detection options.

Also an option to test the chapter detection before conversion would be great.

Often I try to use the test feature of the header removal to write my expression for chapters, but then it doesn't work when I use it, and I don't know that until the conversion is done.

I'm sure there is some way to write an amazing expression for the example I gave, but it's not as simple as matching with a regex (using the header removal test feature, this is easy)

09-14-2010, 01:41 AM	#1
tonyx3 Connoisseur Posts: 55 Karma: 10 Join Date: Jan 2010 Device: Nexus One	Detect chapters without using tag or class. *Disclaimer: I am in no way a regex or XPATH expert, so I could be completely wrong. Please don't completely flame me out.* Ok, so this got kinda long. SUMMARY: Is there a way to match chapters using only a regex? If not, can we please have one? What about testing chapter detection before conversion? There's lots of threads on the forum with people asking for help getting their chapters to detect properly. For me, at least, a lot of the trouble I have comes from trying to use calibre to convert poorly formatted books from one format into properly formatted books in another format. For example, in an ideal situation, the chapter titles in the source format will all be in their own line, with proper opening and closing tags (preferably h1), and will include the word 'chapter.' If that happens, it's easy to get calibre to detect them and generate a proper Table of Contents for my target format. In reality, though, I often find myself trying to convert books in which the chapter titles don't have 'chapter' in them, and often don't have any special or unique format tags. Basically, there's a lot of poorly formatted books out there. I won't even mention trying to convert from PDF. Here's an actual recent example from a book I converted: Source format: LRF The chapter titles simply consist of the chapter number, spelled out in all caps, e.g. ONE, TWO, THREE, etc. (this isn't a huge problem, but it's annoying. The tag isn't a header tag, its <span>, and it isn't closed out immediately after the chapter title, like this: Code: preceding text from Introduction<br/><span class="ts1">ONE<span class="ts2"><br/>Beginning text of chapter 1. Now, that would be annoying, but workable if I could just use a regex to match the <span class="ts1"> just before the chapter title and the <span class="ts2"> just after it. The chapter titles are the only thing formatted that way. But, unless I'm mistaken, I can't do that, because I need to use XPATH to match tags and classes for the chapters. I don't know the best solution for that. I guess I'm advocating for the addition of an option to only match a regex for chapter (in the same way that the header removal works). It's easy to write a regex to match most chapters. Much easier, for me at least, than trying to use the current chapter detection options. Also an option to test the chapter detection before conversion would be great. Often I try to use the test feature of the header removal to write my expression for chapters, but then it doesn't work when I use it, and I don't know that until the conversion is done. I'm sure there is some way to write an amazing expression for the example I gave, but it's not as simple as matching with a regex (using the header removal test feature, this is easy)