xpath for chapter detection

romnempire · 07-24-2010, 04:49 PM

i have a TXT to EPUB conversion that I'm trying to work through.

the TXT uses a linebreak and tab to signify a new paragraph, and two linebreaks before and after the chapter name

in order to detect chapters, with no obvious "chapter" in the chapter title, I copied a list of the names of the chapter titles from the toc and used find replace and the xpath expression wizard to create this xpath expression

//*[re:test(., "^SOMEONE LIKE YOU$|^Taste$|^Lamb to the Slaughter$|^Man from the South$|^The Soldier$|^My Lady Love, My Dove$|^Dip in the Pool$|^Galloping Foxley$|^Skin$|^Poison$|^The Wish$|^Neck$|^The Sound Machine$|^Nunc Dimittis$|^The Great Automatic Grammatizator$|^Claud's Dog$|^The Ratcatcher$|^Rummins$|^Mr Hoddy$|^Mr Feasey$|^EIGHT FURTHER TALES OF THE UNEXPECTED$|^The Umbrella Man$|^Mr Botibol$|^Vengeance is Mine Inc.$|^The Butler$|^Ah, Sweet Mystery of Life$|^The Bookseller$|^The Hitchhiker$|^The Surgeon$", "i")]

to make this work, I need the expression to recognize the paragraph as true if the entire 'chapter title' paragraph matches one of the literal strings. However, the expression seems to be matching any paragraph that contains the string.

can you help me make this work?

jackie_w · 07-24-2010, 11:35 PM

Rather than typing all this into Calibre you may have more joy with the following approach.

Edit your source .TXT file's headings using Markdown markup language, like this:-

Code:

# My Book Title

# My Book's Author

///Table of Contents///

## SOMEONE LIKE YOU

... blah blah blah ...

## Taste

... blah blah blah ...

## Lamb to the Slaughter

... blah blah blah ... etc etc

The single # lines will be treated as <h1> and the ## lines as <h2>.

The ///Table of Contents/// line should be placed where you want the internal TOC to be.

When you've finished editing, convert the TXT to EPUB, making sure you check the [Convert] - [TXT Input] - "Process using markdown" box.

If you would also like a TOC which appears in the TOC left sidebar in the ebook viewer you should also do this during the conversion
in [Convert] - [Structure Detection] set "Detect chapters at" to //h:h2
in [Convert] - [Table of Contents] set "Level 1 TOC" to //h:h2

You can read more about markdown here

romnempire · 07-26-2010, 04:29 PM

useful, but I can't think of a methodology for editing the text document to markup without doing it manually, meaning I would have to edit each book I ever formed a toc for.

well, I guess I could do a python script, but learning xpath, if possible, seems easier.

susan_cassidy · 07-26-2010, 05:17 PM

Part of the problem might be that in your regex "//*[re:test(., "^SOMEONE LIKE YOU", you have the asterisk, meaning all tags, but you have no tags at all in a .txt file. Also, usually, in 'or' expressions like "(a|b|c)", the carat goes outside the leading parenthesis, and the $ goes outside the closing parenthesis. Can you use the 2 linebreaks before and after the chapter name to detect the chapters? Something like "\n\n.*\n\n"? Of course, I imagine you'd have to do that in a program, since you still don't have any tags to match.

romnempire · 07-26-2010, 05:42 PM

i was under the impression calibre ran toc creation after converting to xhtml, and made everything separated by a carriage return into a new paragraph

romnempire · 07-26-2010, 05:44 PM

oh, sorry I wasn't explicit that I was using calibre

romnempire · 07-26-2010, 05:58 PM

if the special characters $ or ^ are put outside of the commas, EX:

//*[re:test(., ^"SOMEONE LIKE YOU"$,

calibre returns that the xpath expression is invalid.

Agama · 07-26-2010, 06:34 PM

Quote:

Originally Posted by jackie_w

Rather than typing all this into Calibre you may have more joy with the following approach.

Edit your source .TXT file's headings using Markdown ...

This is definitely the easiest way to do it. If you have a text editor which can manage multi-line regular expressions then you can do it with one search/replace. Alternatively you can semi-automate it using a free editor such as Notepad++ by using an Extended seach to find \n\nChapter_Name and then invoke a Macro to insert the ## markdown characters. It only takes a few seconds per chapter and it's worth the effort.

07-24-2010, 04:49 PM	#1
romnempire Member Posts: 14 Karma: 10 Join Date: Dec 2009 Device: Kindle 2	xpath for chapter detection i have a TXT to EPUB conversion that I'm trying to work through. the TXT uses a linebreak and tab to signify a new paragraph, and two linebreaks before and after the chapter name in order to detect chapters, with no obvious "chapter" in the chapter title, I copied a list of the names of the chapter titles from the toc and used find replace and the xpath expression wizard to create this xpath expression //*[re:test(., "^SOMEONE LIKE YOU$\|^Taste$\|^Lamb to the Slaughter$\|^Man from the South$\|^The Soldier$\|^My Lady Love, My Dove$\|^Dip in the Pool$\|^Galloping Foxley$\|^Skin$\|^Poison$\|^The Wish$\|^Neck$\|^The Sound Machine$\|^Nunc Dimittis$\|^The Great Automatic Grammatizator$\|^Claud's Dog$\|^The Ratcatcher$\|^Rummins$\|^Mr Hoddy$\|^Mr Feasey$\|^EIGHT FURTHER TALES OF THE UNEXPECTED$\|^The Umbrella Man$\|^Mr Botibol$\|^Vengeance is Mine Inc.$\|^The Butler$\|^Ah, Sweet Mystery of Life$\|^The Bookseller$\|^The Hitchhiker$\|^The Surgeon$", "i")] to make this work, I need the expression to recognize the paragraph as true if the entire 'chapter title' paragraph matches one of the literal strings. However, the expression seems to be matching any paragraph that contains the string. can you help me make this work?

07-24-2010, 11:35 PM	#2
jackie_w Grand Sorcerer Posts: 6,274 Karma: 16800000 Join Date: Sep 2009 Location: UK Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3	Rather than typing all this into Calibre you may have more joy with the following approach. Edit your source .TXT file's headings using Markdown markup language, like this:- Code: # My Book Title # My Book's Author ///Table of Contents/// ## SOMEONE LIKE YOU ... blah blah blah ... ## Taste ... blah blah blah ... ## Lamb to the Slaughter ... blah blah blah ... etc etc The single # lines will be treated as <h1> and the ## lines as <h2>. The ///Table of Contents/// line should be placed where you want the internal TOC to be. When you've finished editing, convert the TXT to EPUB, making sure you check the [Convert] - [TXT Input] - "Process using markdown" box. If you would also like a TOC which appears in the TOC left sidebar in the ebook viewer you should also do this during the conversion in [Convert] - [Structure Detection] set "Detect chapters at" to //h:h2 in [Convert] - [Table of Contents] set "Level 1 TOC" to //h:h2 You can read more about markdown here

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Help with Chapter detection	ubergeeksov	Calibre	0	09-02-2010 05:56 AM
chapter detection in any book	yuki86	Calibre	9	05-06-2009 07:54 AM
Chapter detection for LRF	HenryP	Calibre	12	04-03-2009 09:22 AM
Cant find help for chapter detection	fallwood	Calibre	6	12-10-2008 02:20 PM
Calibre chapter detection	AKninja04	Calibre	5	09-14-2008 01:09 PM

07-26-2010, 04:29 PM	#3
romnempire Member Posts: 14 Karma: 10 Join Date: Dec 2009 Device: Kindle 2	useful, but I can't think of a methodology for editing the text document to markup without doing it manually, meaning I would have to edit each book I ever formed a toc for. well, I guess I could do a python script, but learning xpath, if possible, seems easier.

07-26-2010, 05:17 PM	#4
susan_cassidy Wizard Posts: 2,251 Karma: 3720310 Join Date: Jan 2009 Location: USA Device: Kindle, iPad (not used much for reading)	Part of the problem might be that in your regex "//[re:test(., "^SOMEONE LIKE YOU", you have the asterisk, meaning all tags, but you have no tags at all in a .txt file. Also, usually, in 'or' expressions like "(a\|b\|c)", the carat goes outside the leading parenthesis, and the $ goes outside the closing parenthesis. Can you use the 2 linebreaks before and after the chapter name to detect the chapters? Something like "\n\n.\n\n"? Of course, I imagine you'd have to do that in a program, since you still don't have any tags to match.

07-26-2010, 05:42 PM	#5
romnempire Member Posts: 14 Karma: 10 Join Date: Dec 2009 Device: Kindle 2	i was under the impression calibre ran toc creation after converting to xhtml, and made everything separated by a carriage return into a new paragraph

07-26-2010, 05:44 PM	#6
romnempire Member Posts: 14 Karma: 10 Join Date: Dec 2009 Device: Kindle 2	oh, sorry I wasn't explicit that I was using calibre

07-26-2010, 05:58 PM	#7
romnempire Member Posts: 14 Karma: 10 Join Date: Dec 2009 Device: Kindle 2	if the special characters $ or ^ are put outside of the commas, EX: //*[re:test(., ^"SOMEONE LIKE YOU"$, calibre returns that the xpath expression is invalid.

Advert

Advert