Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 09-14-2010, 01:41 AM   #1
tonyx3
Connoisseur
tonyx3 began at the beginning.
 
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
Detect chapters without using tag or class.

Disclaimer: I am in no way a regex or XPATH expert, so I could be completely wrong. Please don't completely flame me out.



Ok, so this got kinda long.
SUMMARY:
  1. Is there a way to match chapters using only a regex?
  2. If not, can we please have one?
  3. What about testing chapter detection before conversion?


There's lots of threads on the forum with people asking for help getting their chapters to detect properly.


For me, at least, a lot of the trouble I have comes from trying to use calibre to convert poorly formatted books from one format into properly formatted books in another format.


For example, in an ideal situation, the chapter titles in the source format will all be in their own line, with proper opening and closing tags (preferably h1), and will include the word 'chapter.' If that happens, it's easy to get calibre to detect them and generate a proper Table of Contents for my target format.


In reality, though, I often find myself trying to convert books in which the chapter titles don't have 'chapter' in them, and often don't have any special or unique format tags. Basically, there's a lot of poorly formatted books out there. I won't even mention trying to convert from PDF.

Here's an actual recent example from a book I converted:
Source format: LRF
The chapter titles simply consist of the chapter number, spelled out in all caps, e.g. ONE, TWO, THREE, etc. (this isn't a huge problem, but it's annoying.

The tag isn't a header tag, its <span>, and it isn't closed out immediately after the chapter title, like this:
Code:
preceding text from Introduction<br/><span class="ts1">ONE<span class="ts2"><br/>Beginning text of chapter 1.
Now, that would be annoying, but workable if I could just use a regex to match the <span class="ts1"> just before the chapter title and the <span class="ts2"> just after it. The chapter titles are the only thing formatted that way.

But, unless I'm mistaken, I can't do that, because I need to use XPATH to match tags and classes for the chapters.


I don't know the best solution for that.

I guess I'm advocating for the addition of an option to only match a regex for chapter (in the same way that the header removal works). It's easy to write a regex to match most chapters. Much easier, for me at least, than trying to use the current chapter detection options.

Also an option to test the chapter detection before conversion would be great.

Often I try to use the test feature of the header removal to write my expression for chapters, but then it doesn't work when I use it, and I don't know that until the conversion is done.


I'm sure there is some way to write an amazing expression for the example I gave, but it's not as simple as matching with a regex (using the header removal test feature, this is easy)
tonyx3 is offline   Reply With Quote
Old 09-14-2010, 01:47 AM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Try enabling the preprocess option. It does quite a bit in the current release, and there is additional logic coming in the next release.
ldolse is offline   Reply With Quote
 
Enthusiast
Old 09-14-2010, 01:51 AM   #3
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,800
Karma: 12528001
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by ldolse View Post
Try enabling the preprocess option. It does quite a bit in the current release, and there is additional logic coming in the next release.
I'm curious. What, if anything, should you enter in the preprocess area?

Last edited by DoctorOhh; 09-14-2010 at 01:54 AM.
DoctorOhh is online now   Reply With Quote
Old 09-14-2010, 02:12 AM   #4
capidamonte
Not who you think I am...
capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!
 
capidamonte's Avatar
 
Posts: 343
Karma: 5337
Join Date: Jan 2010
Location: Honolulu
Device: Sony PRS-350
Where is the preprocess option? I've searched several times now, and I'm having one of those blind spot moments.
capidamonte is offline   Reply With Quote
Old 09-14-2010, 02:14 AM   #5
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,800
Karma: 12528001
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by capidamonte View Post
Where is the preprocess option. I've searched several times now, and I'm having one of those blind spot moments.
Where it is is easy, what it does I'll let others address.

Preferences - Conversion - Common Options - Structure detection

Or Convert books icon - Structure Detection (I have to scroll down)
DoctorOhh is online now   Reply With Quote
Old 09-14-2010, 02:56 AM   #6
tonyx3
Connoisseur
tonyx3 began at the beginning.
 
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
Quote:
Originally Posted by ldolse View Post
Try enabling the preprocess option. It does quite a bit in the current release, and there is additional logic coming in the next release.
I tried enabling it. It didn't make any difference to the output.

Is there something I'm supposed to do here other than tick the little box?



I'm all for learning to work with the existing setup, so if there's some way to do it, let me know. I just haven't had much luck with the current options unless the source file is already semi-well formatted.


I inevitably give up and use calibre to convert to rtf, then format in Word, and then use calibre to convert that to epub (or whatever format).

But in a lot of cases the overall formatting is fine, it's just the chapters that are the problem.

I shouldn't have to go out to rtf and Word and back just to get chapters correct, when the rest of the formatting is fine.
tonyx3 is offline   Reply With Quote
Old 09-14-2010, 04:12 AM   #7
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
You could use Sigil to edit the ePub directly, that, at least, would spare you having to go to Word. Not much of an improvement, I'll admit.

As for your initial problem... you gave the example
Code:
preceding text from Introduction<br/><span class="ts1">ONE<span class="ts2"><br/>Beginning text of chapter 1.
for what you want to match. Generally, it should be possible to match tags in a regexp, though, of course, the regexp won't "understand" those tags other than as a string. In your example, the expression
Code:
<span class="ts1">[A-Z]+<span class="ts2"><br/>
should match the chapter headings.

EDIT: Oh, I'm sorry, I misread your post. I'll have to put the thinking cap back on.

Last edited by Manichean; 09-14-2010 at 04:15 AM.
Manichean is offline   Reply With Quote
Old 09-14-2010, 04:25 AM   #8
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
I'm not quite sure how to go on about using the whole book as source in the re.test() function, but shouldn't
Code:
re.test(<whole book as source>, <span class="ts1">[A-Z]+<span class="ts2"><br/>, )
just use the regexp for chapter matching?
Another thing to try would be
Code:
/h:span[@class="ts1"]re.test(.,[A-Z]+,)/h:span[@class="ts2"]
Disclaimer: I didn't try the above myself. It might be horribly wrong...
Manichean is offline   Reply With Quote
Old 09-14-2010, 05:07 AM   #9
tonyx3
Connoisseur
tonyx3 began at the beginning.
 
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
Quote:
Originally Posted by Manichean View Post
I'm not quite sure how to go on about using the whole book as source in the re.test() function, but shouldn't
Code:
re.test(<whole book as source>, <span class="ts1">[A-Z]+<span class="ts2"><br/>, )
just use the regexp for chapter matching?
Another thing to try would be
Code:
/h:span[@class="ts1"]re.test(.,[A-Z]+,)/h:span[@class="ts2"]
Disclaimer: I didn't try the above myself. It might be horribly wrong...


Hrm.. I just tried that second one, but calibre said it was an invalid XPATH expression. I'm not sure but I think maybe you can't use /h:span twice.. ?
tonyx3 is offline   Reply With Quote
Old 09-14-2010, 06:09 AM   #10
capidamonte
Not who you think I am...
capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!
 
capidamonte's Avatar
 
Posts: 343
Karma: 5337
Join Date: Jan 2010
Location: Honolulu
Device: Sony PRS-350
Quote:
Originally Posted by dwanthny View Post
Where it is is easy, what it does I'll let others address.

Preferences - Conversion - Common Options - Structure detection

Or Convert books icon - Structure Detection (I have to scroll down)
Thanks! That is a non-obvious place.
capidamonte is offline   Reply With Quote
Old 09-14-2010, 07:36 AM   #11
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by tonyx3 View Post
Hrm.. I just tried that second one, but calibre said it was an invalid XPATH expression. I'm not sure but I think maybe you can't use /h:span twice.. ?
I don't know, I've never used XPath. I just skimmed over the tutorial in the manual and tried to guess what could work, after I realized that a regexp alone wouldn't work...
Manichean is offline   Reply With Quote
Old 09-14-2010, 09:22 AM   #12
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I just noticed your source format is LRF, that's not hooked into the preprocess option at the moment, partially because I don't really have any lrf files in my library, and partially because I wasn't aware that it was a format that had problems like this.

Right now preprocessing works for html, Lit, txt, and rtf inputs. One option would be to specify a directory for debug output and grab the first pass 'parsed' html output and re-convert that.

I wouldn't recommend enabling it globally in the preferences section, do it on a book by book basis. Overall it's pretty conservative and won't do much to a well marked up file, the only really destructive thing it will do across all files is remove all non-breaking spaces.



As far as what preprocessing does, I don't quite remember what's in .7.18, there are a bunch of changes going in the next release. I think .7.18 has basic chapter detection and line unwrapping. .7.18 worked pretty well on txt, rtf, and some types of lit files, but I've tested with a larger range of crappy files now, so the new code is doing better.

Right now it attempts to:
  • Convert non-breaking space indents to css indents
  • Removes remaining nonbreaking spaces (the most destructive thing it does right now)
  • Checks the file to see if there are blank lines inserted between every paragraph and deletes them if that's the case (second most destructive thing, need to improve this to preserve soft breaks if they exist)
  • Adds markup to lit files which are actually glorified text in <pre> tags and a lit wrapper
  • Tries up to four different regexes for chapter/chapter title detection, trying the ones with fewest false positives first, marks them in h2/h3 tags.
  • Unwraps hard line breaks based on the median line length and punctuation
  • Removes/unwraps soft hyphens, unwraps other hyphens
  • Searches for places where h1 or h2 headers immediately follow each other from one line to the next, which will cause Calibre to split on those points, changes the second header to h3. This prevents chapter headings and titles/images from being separated. (this particular step is also applied to mobi files)

I've tested this across a couple dozen garbage lit files and a bunch of html, txt, and rtf files. Getting fairly good results at this point, but the line unwrapping could use some more work. It works best when all the hard line breaks are pretty much in the same place, but if the lengths are variable then line unwrapping might not work. I need to add a user configurable unwrap_factor like pdf to resolve that problem. It has other problems similar to pdf where lines aren't always unwrapped to avoid false positives - will be looking into cases where there is spacing between paragraphs or indents to make this a bit smarter.

Anyway the idea isn't to be perfect, it's just to make it so that as few hand edits as possible are required after conversion.

Last edited by ldolse; 09-14-2010 at 09:24 AM.
ldolse is offline   Reply With Quote
Old 09-14-2010, 09:31 AM   #13
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by dwanthny View Post
I'm curious. What, if anything, should you enter in the preprocess area?
Nothing to enter in the area, just enable the checkbox. Right now nothing is user configurable.

BTW, just found a bunch of lrf files on my system I didn't know about, looking into them now. At first glance it looks like some modifications to the regexes for chapter detection will be needed. I don't think I accounted for the headings to be nested in so many div/span tags, haven't seen that with the other formats.

I'm curious though, for LRF samples I have the sources all look good, and the files are already nicely split per chapter, is this not the case with yours? The only problem I've seen is that in one case a TOC wasn't automatically generated during conversion, but look and feel was ok even in this case.

Last edited by ldolse; 09-14-2010 at 10:24 AM.
ldolse is offline   Reply With Quote
Old 09-14-2010, 10:53 AM   #14
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,800
Karma: 12528001
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by ldolse View Post
Nothing to enter in the area, just enable the checkbox. Right now nothing is user configurable.
Maybe I'm looking in the wrong spot but the one under structure detection has a editable area.

Quote:
Originally Posted by ldolse View Post
I'm curious though, for LRF samples I have the sources all look good, and the files are already nicely split per chapter, is this not the case with yours? The only problem I've seen is that in one case a TOC wasn't automatically generated during conversion, but look and feel was ok even in this case.
Just like with every other format if garbage was the source garbage is what they ended up with.

Last edited by DoctorOhh; 09-14-2010 at 10:56 AM.
DoctorOhh is online now   Reply With Quote
Old 09-14-2010, 10:57 AM   #15
tonyx3
Connoisseur
tonyx3 began at the beginning.
 
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
Quote:
Originally Posted by dwanthny View Post
Just like with every other format if garbage was the source garbage is what they ended up with.
Exactly.

For this particular issue, if the body text is ok, then having more options for chapter detection would solve a ton of conversion problems.

Having it be dependent on the source having proper tags adds an extra layer of trouble with improperly formatted sources.
tonyx3 is offline   Reply With Quote
Reply

Tags
chapter, regex

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Can't detect Cybook Gen 3 minca Calibre 4 08-09-2010 08:50 AM
SD Class support drdman Astak EZReader 6 10-30-2009 12:42 AM
ePub Chapters vs. Stanza Chapters kjk Sigil 4 09-14-2009 10:50 AM
What do need to detect a Kindle 2? TallMomof2 Calibre 3 02-24-2009 05:00 PM
TeX class nsg Sony Reader 3 11-05-2007 07:58 PM


All times are GMT -4. The time now is 11:15 PM.


MobileRead.com is a privately owned, operated and funded community.