|
|
#1 |
|
Member
![]() ![]() ![]() ![]() Posts: 23
Karma: 322
Join Date: Jan 2011
Device: Kindle
|
html/zip to mobi not detecting chapter breaks
I posted a while back with a simialr problem for azw to mobi and someone was kind enough to help me edit the xpath for detecting chapters. That code is working beautifully for my azw files, but seems no luck for these html files. I've tried selecting the Heuristic option "Detect & markup unformatted chapter headings..." but hasn't made a difference. Have run a debug and I can't figure it out. If anyone has any suggestions, I would be very appreciative! My current Xpath is: Code:
//*[((name()='span' or name()='h2') and re:test(., 'chapter|ch|book|section|part|pt|prologue|epilogue\s+', 'i') and (@class = 'bold')) ] Code:
"Alright, what I'd like to do now is shoot some backlit winter shots, something that might be good for a January or March scene. I'd like to have you on the skis in a full tuck position, as if you were rounding a corner on a downhill slope. If you're comfortable, I'd like to have you strip completely, or if you prefer, I have a thong you can wear."</span></p> <p class="MsoNormal" align="center" style="mso-margin-top-alt:auto;mso-margin-bottom-alt: auto;text-align:center;line-height:normal"><span style="font-size:14.0pt; font-family:"Times New Roman";mso-fareast-font-family:"Times New Roman"; color:black"> </span></p> <p class="MsoNormal" align="center" style="mso-margin-top-alt:auto;mso-margin-bottom-alt: auto;text-align:center;line-height:normal"><span style="font-size:14.0pt; font-family:"Times New Roman";mso-fareast-font-family:"Times New Roman"; color:black">Chapter 2</span></p> <p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto; line-height:normal"><span style="font-size:14.0pt;font-family:"Times New Roman"; mso-fareast-font-family:"Times New Roman";color:black">The incredulous look must have been plain on my face. As she realized how her offer sounded, her face turned red and she quickly clarified, "Not one of <i>my </i>thongs. Not that I'm trying to say that I even <i>have</i> thongs," her cheeks were starting to remind me of the <span class="SpellE"><span class="GramE">claymation</span></span> Rudolph when his nose cover popped off. "I just mean that I have a brand new men's thong you can wear and I can Photoshop out the lines on your hips." ~Rach |
|
|
|
|
|
#2 |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,127
Karma: 77366
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Where exactly should the chapter start in your example?
__________________
I reject your reality and substitute my own. |
|
|
|
|
Enthusiast
|
|
|
|
#3 |
|
Member
![]() ![]() ![]() ![]() Posts: 23
Karma: 322
Join Date: Jan 2011
Device: Kindle
|
Where it says "Chapter 2" in the middle of the html garb
|
|
|
|
|
|
#4 |
|
Mobile Reader Geek
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 34,209
Karma: 13801264
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad
|
I do have a suggestion.
Take the HTML you got from Word and load it into a text editor such as Notepad++ and clean up the mess Word left in and make it nice clean HTML code and then you can take your chapter headings and make them look like <h2>Chapter 2</h2> and you'll get a good ToC. The problem s that when you save as a webpage from Word, you get one hell of a mess from Word. Take a look and you will see what I mean. It's not good code at all. It's a real mess. I've cleaned up my share of Word's mess and it can take a good while to do so.
__________________
|
|
|
|
|
|
#5 |
|
Mobile Reader Geek
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 34,209
Karma: 13801264
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad
|
But how much of that garb is HTML and how much of that garb is Word?
__________________
|
|
|
|
|
|
#6 |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,127
Karma: 77366
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Try using filtered HTML when saving from Word. Also try enabling the heuristic options.
__________________
I reject your reality and substitute my own. |
|
|
|
|
|
#7 | |
|
Staff to 4 Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10,707
Karma: 2485850
Join Date: Aug 2009
Location: The (original) Silicon Valley, USA
Device: Galaxy Tab 2,Black Astak PEz, K4NT(now Wifes)
|
Quote:
![]() It just has been trashed (with un-necessary tags) up by Word ![]() If you set a nice <body class=...> the mso-normal could dissapear. Experiment! (Sigil is a good tool for this) rename a class in the CSS (mso-normal->mXo-normal), leaving the usage in place. See what happens to your masterpiece
__________________
Using: Ubuntu(32 bit):Oneric,Precise and XPpro SP3, W7HP(64)- - Libre Office w/Writer2EPUB
|
|
|
|
|
|
|
#8 |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,417
Karma: 1040308
Join Date: Jan 2009
Device: Kindle, iPad (not used much for reading)
|
You also probably want to change the tags for Chapter headings from paragraph tags to an h2 or something, so that it is easy to recognize to generate a TOC, etc. You may want to manually add the special Mobipocket-specific page break tag: <mbp
agebreak/>
|
|
|
|
|
|
#9 |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Heuristics should work with that chapter - just enable Heuristics under the conversion options.
|
|
|
|
|
|
#10 |
|
Member
![]() ![]() ![]() ![]() Posts: 23
Karma: 322
Join Date: Jan 2011
Device: Kindle
|
Thanks everyone! Happy to report that chapter detection is working. Manichean, thanks for the suggestions to save as filtered web page. That seems to have done the trick!
Since I have a LARGE quantity of files to convert with Word origins, it would be too time consuming to hand edit the tags for each chapter of every file. But you're right, Word makes a mess of it. Does anyone have a different recommendation for converting text to mobi? My project is that I'm organizing stories for my creative writing group, copy/pasting from our internet pages and then creating mobi's. Currently i paste to Word, save as web page, run the zip through Calibre. TheDucks - sorry, but you lost me. I'm not that savvy with the lingo.![]() Thanks again everyone! ~Rach Last edited by RachDvn; 02-15-2011 at 04:29 PM. |
|
|
|
|
|
#11 | |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,127
Karma: 77366
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Quote:
__________________
I reject your reality and substitute my own. |
|
|
|
|
|
|
#12 |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Depending on what the web pages look like you could just save the html directly from the website and load it to Calibre. If you need only a portion of the web pages you could look at Calibre's recipe framework, as it can grab web pages, extract the relevant portion, and convert that to a ebook (albeit one that uses 'news' features on some readers).
The problem you'll find with text is that you'll lose italics and other formatting with a straight copy/paste to a text editor. If the originals don't have any formatting that might be an ok option though. If the recipe framework is too complicated for you, another thing you could look at is firebug plugin for Firefox. It's still a 'bit' complicated, but it provides you a gui where you can get to just the relevant html that contains the story and copy just that into a text editor. If that's of interest I can explain in a bit more detail. |
|
|
|
|
|
#13 | |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
|
|
|
|
|
|
|
#14 | |
|
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 890
Karma: 1089705
Join Date: Dec 2004
Location: Paradise (Key West, FL)
Device: Current:Kindle 3/WiFi Retired:Clie - UX50, T415, ...
|
Quote:
My current pet method for moving Web pages to MOBI format is to save the source while in my browser and edit that before converting with Calibre. My workflow involves using Opera browser with Notepad++ set as my app for viewing the Source. I simply rightclick on a page and select Source from the menu. The source HTML appears in Notepad++ and I then:
However you approach saving the original HTML source, doing so will preserve the formatting (bold, italic, ...).
__________________
----- dwig |
|
|
|
|
|
|
#15 |
|
Member
![]() ![]() ![]() ![]() Posts: 23
Karma: 322
Join Date: Jan 2011
Device: Kindle
|
Thanks so much. Haven't used Sigil before, so I'll play with it as well as trying dwig's method.
~Rach |
|
|
|
![]() |
| Tags |
| chapter break, detect chapter |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Convert HTML to MOBI (HTML recognized as ZIP file) | pdubois | Conversion | 1 | 01-25-2011 12:55 PM |
| Chapter Breaks | Mike Ramberg | Sigil | 2 | 01-09-2011 06:20 PM |
| Formatting Chapter Breaks? | NVash | Calibre | 3 | 12-09-2010 05:09 AM |
| Xpath expression for detecting chapter marks | p3aul | Calibre | 5 | 11-14-2010 11:14 PM |
| Help w/ Chapter Breaks | pastorjamie | Calibre | 1 | 02-25-2010 07:30 PM |