Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 09-12-2010, 03:03 PM   #1
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
Chapter detection when only digits - regex needed

Hi, I need help.

I'm trying to convert a load of rtf's to epubs, and most of them have chapters which are only digits on their own line, followed by a title on the next.

And rather than adding 'Chapter ' to all places in the rtf's by hand,
can anyone show me a regex that will allow a chapter that is only numbers to be recognised.

None of the rtf's have line/page numbers, they have been removed, so only chapter #'s are on line of their own.

I've added to the detect chapters regex
re:test(., '\d|\d\d',i)
and that does split the chapters correctly, however it also splits on any paragraph with a single or double digit.

Also would be nice if was able to assign it automatically a <h#> tag.

Any help is appreciated.
Perkin is offline   Reply With Quote
Old 09-12-2010, 03:51 PM   #2
wallcraft
reader
wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.
 
wallcraft's Avatar
 
Posts: 6,979
Karma: 5183568
Join Date: Mar 2006
Location: Mississippi, USA
Device: Kindle 3 and Fire
Quote:
Originally Posted by Perkin View Post
I've added to the detect chapters regex
re:test(., '\d|\d\d',i)
and that does split the chapters correctly, however it also splits on any paragraph with a single or double digit.
Try adding "^" for start of a line and "$" for end of a line. I would use ("+" matches one or more instances):
Code:
re:test(.,'^\d+$',i)

Last edited by wallcraft; 09-12-2010 at 03:54 PM.
wallcraft is offline   Reply With Quote
Old 09-12-2010, 04:02 PM   #3
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
Thanks, that works great.

Now, is there anyway to make that a <h#> entry?
Perkin is offline   Reply With Quote
Old 09-12-2010, 06:32 PM   #4
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Try the preprocess option, can't remember if that case is covered under rtf, but I'm pretty sure it is.
ldolse is offline   Reply With Quote
Old 09-12-2010, 06:53 PM   #5
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
I'm preprocessing anyway. Not to worry, I think I've got it covered, I'm converting the rtf to epub with Calibre, then using Sigil in code-view and using several regex's to do the thing's like Header's, broken lines etc..

Now with each of the chapters split properly, the sigil part takes a few minutes, much quicker than what I was doing before. They then just need a quick proofread, to fix any other mistakes.

I'm happy.
Perkin is offline   Reply With Quote
Old 09-13-2010, 11:58 AM   #6
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
I've just done a test, to try and make chapters with only digits come out as heading tags, but have been unsuccessful, is there anyway to do that,

'Chapter ##' come out as headings fine, and so do 'Prologue' and 'Epilogue'.

Is there anyway to customise the rft input plugin as you would with some of the other formats.
(I think that may be where the heading tags become applied)

I'm able to split at the correct places with the regex as give in an earlier post, but would like them to be headings as well.
Perkin is offline   Reply With Quote
Old 09-13-2010, 01:35 PM   #7
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I just looked at the preprocessing code - the single digit chapter headings weren't included in the last checkin - the logic you want is basically done now, but need to check in the changes. If Kovid accepts them then it should be in the next build.

I noticed you said that even with preprocessing enabled you still needed to manually remove hard line breaks - if this is the case please open up a bug with the file, I'm trying to catch as many cases as possible without creating false positives
ldolse is offline   Reply With Quote
Old 09-13-2010, 01:43 PM   #8
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
Thanks Idolse, that's nice to know, hope it makes it into the next build.

I'll try and reduce an rtf to a short few paragraphs, which still have the linebreaks not recognised.
Perkin is offline   Reply With Quote
Old 09-13-2010, 02:27 PM   #9
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
Idolse, I transplanted several paragraphs which were incorrectly wrapped, into a new rtf, converted it to epub and they were all then wrapped correctly, but reconverting the whole original rtf, still had the mis-wrapping at those same places.

Weird.
Perkin is offline   Reply With Quote
Old 09-13-2010, 02:48 PM   #10
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Maybe not so weird - the unwrapping function looks at the median line length across all the lines. This works great if all the hard line breaks are in roughly the same spot, but if the lines are extremely variable (and sometimes/often long) this doesn't work out that well, since the median becomes longer than the typical broken line.

I'm going to add an option to tweak that logic, but not sure if it's going to make it in the next release as it's the first time I've attempted any GUI work.
ldolse is offline   Reply With Quote
Old 09-13-2010, 02:57 PM   #11
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
If it helps, I fix the remaining ones in Sigil, by doing a s&r
search '([a-z])</p>## <p>' replace '\1 ' (##=crlf*2, +2 spaces), replace has a space as well.
And a similar one for lines ending in a comma.
(And do a search for ones that end in a semicolon or colon, just as a check)

I don't know if the preprocessing code can include a regex, but that sort of thing may make it more complete.
Perkin is offline   Reply With Quote
Old 09-18-2010, 04:23 AM   #12
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
The preprocess function basically does the regex you're suggesting, but it analyzes the document and gets the median line length, and ties this to the unwrapping function. This is to prevent things like lists, poetry, titles, etc from being unwrapped.

The problem is the original implementation assumed that if there were hard breaks in the document they would be universal across the file. Reality is that many files have a lot of variability, so this pushes the median line length longer than the typical broken line.

Preprocessing now lets you specify the aggressiveness of the line unwrapping - i.e. make the line length cut-off shorter. It's the line unwrapping factor under structure detection. If you get a chance give it a shot and see if it solves your original problem with unwrapping. Note you may need to set it down to 0.1 or 0.2 if there is a huge amount of variability in line length.

edit - single digit chapters are covered now as well.

Last edited by ldolse; 09-18-2010 at 04:25 AM.
ldolse is offline   Reply With Quote
Old 09-20-2010, 03:51 PM   #13
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
Idolse, I've been able to test and found that the new wrapping/detection is better.
Thanks

There's one small thing I've notice it isn't catching, if one line ends in a quote and the next line begins something like 'he said.'
e.g.
Quote:
'That was nearly perfect Idolse.'
he said.
Quickly fixed in Sigil with S&R '</p>(.)(.) <p>([a-z])' -> ' \3'
Or something similar depending on paragraph, and spacing etc.

Last edited by Perkin; 09-20-2010 at 03:54 PM. Reason: additional info
Perkin is offline   Reply With Quote
Old 09-20-2010, 06:44 PM   #14
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Glad to hear it's working better for you.

That scenario isn't covered on purpose. That's because if the line happens to break on the quote you can't create a simple regex that can differentiate between these two scenarios:

Quote:
'That was nearly perfect ldolse.'
he said.
and:
Quote:
He said 'That was nearly perfect ldolse.'
This is the start of a new paragraph.
I've been thinking to add an enhancement to preprocessing where the user can specify that there is a blank line between every paragraph, or that most/every paragraph is indented, so that at least this scenario can be covered when when the document provides that much differentiation.

Last edited by ldolse; 09-20-2010 at 06:49 PM.
ldolse is offline   Reply With Quote
Old 09-20-2010, 07:20 PM   #15
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
Thanks to you it's miles easier now.
To rectify those missed wraps, I use the search '</p>(.)(.) <p>([a-z])' and replace ' \3' in sigil, with match case + minimal matching, that wraps them (and any other line that begins with a lower case letter).

Can anything similar not be used in the pre-process code?
Perkin is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Structure Detection - Remove Header (or Footer) Regex DarkKipper Conversion 69 11-09-2013 01:21 PM
Help with Chapter detection ubergeeksov Calibre 0 09-02-2010 05:56 AM
chapter detection in any book yuki86 Calibre 9 05-06-2009 07:54 AM
Chapter detection for LRF HenryP Calibre 12 04-03-2009 09:22 AM
Cant find help for chapter detection fallwood Calibre 6 12-10-2008 02:20 PM


All times are GMT -4. The time now is 07:38 AM.


MobileRead.com is a privately owned, operated and funded community.