02-14-2011, 10:28 AM | #1 |
Member
Posts: 24
Karma: 322
Join Date: Jan 2011
Device: Kindle
|
[Old Thread] html/zip to mobi not detecting chapter breaks
I have documents that I created in Word, saved as a web page, and am converting to mobi with Calibre. For the life of me, I can't get the chapters to be detected for a page break!
I posted a while back with a simialr problem for azw to mobi and someone was kind enough to help me edit the xpath for detecting chapters. That code is working beautifully for my azw files, but seems no luck for these html files. I've tried selecting the Heuristic option "Detect & markup unformatted chapter headings..." but hasn't made a difference. Have run a debug and I can't figure it out. If anyone has any suggestions, I would be very appreciative! My current Xpath is: Code:
//*[((name()='span' or name()='h2') and re:test(., 'chapter|ch|book|section|part|pt|prologue|epilogue\s+', 'i') and (@class = 'bold')) ] Code:
"Alright, what I'd like to do now is shoot some backlit winter shots, something that might be good for a January or March scene. I'd like to have you on the skis in a full tuck position, as if you were rounding a corner on a downhill slope. If you're comfortable, I'd like to have you strip completely, or if you prefer, I have a thong you can wear."</span></p> <p class="MsoNormal" align="center" style="mso-margin-top-alt:auto;mso-margin-bottom-alt: auto;text-align:center;line-height:normal"><span style="font-size:14.0pt; font-family:"Times New Roman";mso-fareast-font-family:"Times New Roman"; color:black"> </span></p> <p class="MsoNormal" align="center" style="mso-margin-top-alt:auto;mso-margin-bottom-alt: auto;text-align:center;line-height:normal"><span style="font-size:14.0pt; font-family:"Times New Roman";mso-fareast-font-family:"Times New Roman"; color:black">Chapter 2</span></p> <p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto; line-height:normal"><span style="font-size:14.0pt;font-family:"Times New Roman"; mso-fareast-font-family:"Times New Roman";color:black">The incredulous look must have been plain on my face. As she realized how her offer sounded, her face turned red and she quickly clarified, "Not one of <i>my </i>thongs. Not that I'm trying to say that I even <i>have</i> thongs," her cheeks were starting to remind me of the <span class="SpellE"><span class="GramE">claymation</span></span> Rudolph when his nose cover popped off. "I just mean that I have a brand new men's thong you can wear and I can Photoshop out the lines on your hips." ~Rach |
02-14-2011, 11:03 AM | #2 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Where exactly should the chapter start in your example?
|
02-14-2011, 11:09 AM | #3 |
Member
Posts: 24
Karma: 322
Join Date: Jan 2011
Device: Kindle
|
Where it says "Chapter 2" in the middle of the html garb
|
02-14-2011, 11:39 AM | #4 |
Resident Curmudgeon
Posts: 73,998
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
I do have a suggestion.
Take the HTML you got from Word and load it into a text editor such as Notepad++ and clean up the mess Word left in and make it nice clean HTML code and then you can take your chapter headings and make them look like <h2>Chapter 2</h2> and you'll get a good ToC. The problem s that when you save as a webpage from Word, you get one hell of a mess from Word. Take a look and you will see what I mean. It's not good code at all. It's a real mess. I've cleaned up my share of Word's mess and it can take a good while to do so. |
02-14-2011, 11:39 AM | #5 |
Resident Curmudgeon
Posts: 73,998
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
|
02-14-2011, 11:43 AM | #6 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Try using filtered HTML when saving from Word. Also try enabling the heuristic options.
|
02-14-2011, 11:57 AM | #7 | |
Well trained by Cats
Posts: 29,809
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
It just has been trashed (with un-necessary tags) up by Word If you set a nice <body class=...> the mso-normal could dissapear. Experiment! (Sigil is a good tool for this) rename a class in the CSS (mso-normal->mXo-normal), leaving the usage in place. See what happens to your masterpiece |
|
02-14-2011, 12:06 PM | #8 |
Wizard
Posts: 2,251
Karma: 3720310
Join Date: Jan 2009
Location: USA
Device: Kindle, iPad (not used much for reading)
|
You also probably want to change the tags for Chapter headings from paragraph tags to an h2 or something, so that it is easy to recognize to generate a TOC, etc. You may want to manually add the special Mobipocket-specific page break tag: <mbpagebreak/>
|
02-14-2011, 01:45 PM | #9 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Heuristics should work with that chapter - just enable Heuristics under the conversion options.
|
02-15-2011, 04:09 PM | #10 |
Member
Posts: 24
Karma: 322
Join Date: Jan 2011
Device: Kindle
|
Thanks everyone! Happy to report that chapter detection is working. Manichean, thanks for the suggestions to save as filtered web page. That seems to have done the trick!
Since I have a LARGE quantity of files to convert with Word origins, it would be too time consuming to hand edit the tags for each chapter of every file. But you're right, Word makes a mess of it. Does anyone have a different recommendation for converting text to mobi? My project is that I'm organizing stories for my creative writing group, copy/pasting from our internet pages and then creating mobi's. Currently i paste to Word, save as web page, run the zip through Calibre. TheDucks - sorry, but you lost me. I'm not that savvy with the lingo. Thanks again everyone! ~Rach Last edited by RachDvn; 02-15-2011 at 04:29 PM. |
02-15-2011, 05:02 PM | #11 | |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Quote:
|
|
02-15-2011, 05:05 PM | #12 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Depending on what the web pages look like you could just save the html directly from the website and load it to Calibre. If you need only a portion of the web pages you could look at Calibre's recipe framework, as it can grab web pages, extract the relevant portion, and convert that to a ebook (albeit one that uses 'news' features on some readers).
The problem you'll find with text is that you'll lose italics and other formatting with a straight copy/paste to a text editor. If the originals don't have any formatting that might be an ok option though. If the recipe framework is too complicated for you, another thing you could look at is firebug plugin for Firefox. It's still a 'bit' complicated, but it provides you a gui where you can get to just the relevant html that contains the story and copy just that into a text editor. If that's of interest I can explain in a bit more detail. |
02-15-2011, 05:06 PM | #13 | |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
|
|
02-15-2011, 05:49 PM | #14 | |
Wizard
Posts: 1,613
Karma: 6718479
Join Date: Dec 2004
Location: Paradise (Key West, FL)
Device: Current:Surface Go & Kindle 3 - Retired: DellV8p, Clie UX50, ...
|
Quote:
My current pet method for moving Web pages to MOBI format is to save the source while in my browser and edit that before converting with Calibre. My workflow involves using Opera browser with Notepad++ set as my app for viewing the Source. I simply rightclick on a page and select Source from the menu. The source HTML appears in Notepad++ and I then:
However you approach saving the original HTML source, doing so will preserve the formatting (bold, italic, ...). |
|
02-16-2011, 08:43 AM | #15 |
Member
Posts: 24
Karma: 322
Join Date: Jan 2011
Device: Kindle
|
Thanks so much. Haven't used Sigil before, so I'll play with it as well as trying dwig's method.
~Rach |
Tags |
chapter break, detect chapter |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
xpath to insert chapter breaks - but chapter name cut off ? | Rob557 | Conversion | 2 | 03-06-2014 06:59 AM |
mobi to rtf chapter breaks | arslonga | Conversion | 0 | 04-05-2012 12:50 PM |
HTML to MOBI conversion ignores page breaks | LeftHanded Matt | Conversion | 2 | 12-21-2011 12:25 PM |
[Old Thread] HTML to MOBI for Kindle | eggheadbooks1 | Conversion | 37 | 04-30-2011 01:48 PM |
Convert HTML to MOBI (HTML recognized as ZIP file) | pdubois | Conversion | 1 | 01-25-2011 12:55 PM |