[Old Thread] html/zip to mobi not detecting chapter breaks

RachDvn · 02-14-2011, 10:28 AM

I have documents that I created in Word, saved as a web page, and am converting to mobi with Calibre. For the life of me, I can't get the chapters to be detected for a page break!

I posted a while back with a simialr problem for azw to mobi and someone was kind enough to help me edit the xpath for detecting chapters. That code is working beautifully for my azw files, but seems no luck for these html files. I've tried selecting the Heuristic option "Detect & markup unformatted chapter headings..." but hasn't made a difference. Have run a debug and I can't figure it out.

If anyone has any suggestions, I would be very appreciative!

My current Xpath is:

Code:

//*[((name()='span' or name()='h2') and re:test(., 'chapter|ch|book|section|part|pt|prologue|epilogue\s+', 'i') and (@class = 'bold')) ]

An example section surrounding an undefined Chapter:

Code:

"Alright, what I'd
like to do now is shoot some backlit winter shots, something that might be good
for a January or March scene. I'd like to have you on the skis in a full tuck
position, as if you were rounding a corner on a downhill slope. If you're
comfortable, I'd like to have you strip completely, or if you prefer, I have a
thong you can wear."</span></p>
<p class="MsoNormal" align="center" style="mso-margin-top-alt:auto;mso-margin-bottom-alt: auto;text-align:center;line-height:normal"><span style="font-size:14.0pt; font-family:&quot;Times New Roman&quot;;mso-fareast-font-family:&quot;Times New Roman&quot;; color:black">
</span></p>
<p class="MsoNormal" align="center" style="mso-margin-top-alt:auto;mso-margin-bottom-alt: auto;text-align:center;line-height:normal"><span style="font-size:14.0pt; font-family:&quot;Times New Roman&quot;;mso-fareast-font-family:&quot;Times New Roman&quot;; color:black">Chapter 2</span></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto; line-height:normal"><span style="font-size:14.0pt;font-family:&quot;Times New Roman&quot;; mso-fareast-font-family:&quot;Times New Roman&quot;;color:black">The incredulous look
must have been plain on my face. As she realized how her offer sounded, her
face turned red and she quickly clarified, "Not one of <i>my </i>thongs.
Not that I'm trying to say that I even <i>have</i> thongs," her cheeks
were starting to remind me of the <span class="SpellE"><span class="GramE">claymation</span></span>
Rudolph when his nose cover popped off. "I just mean that I have a brand
new men's thong you can wear and I can Photoshop out the lines on your
hips."

Thanks so much!
~Rach

Manichean · 02-14-2011, 11:03 AM

Where exactly should the chapter start in your example?

RachDvn · 02-14-2011, 11:09 AM

Where it says "Chapter 2" in the middle of the html garb

JSWolf · 02-14-2011, 11:39 AM

I do have a suggestion.

Take the HTML you got from Word and load it into a text editor such as Notepad++ and clean up the mess Word left in and make it nice clean HTML code and then you can take your chapter headings and make them look like <h2>Chapter 2</h2> and you'll get a good ToC.

The problem s that when you save as a webpage from Word, you get one hell of a mess from Word. Take a look and you will see what I mean. It's not good code at all. It's a real mess. I've cleaned up my share of Word's mess and it can take a good while to do so.

JSWolf · 02-14-2011, 11:39 AM

Quote:

Originally Posted by RachDvn

Where it says "Chapter 2" in the middle of the html garb

But how much of that garb is HTML and how much of that garb is Word?

Manichean · 02-14-2011, 11:43 AM

Try using filtered HTML when saving from Word. Also try enabling the heuristic options.

theducks · 02-14-2011, 11:57 AM

Quote:

Originally Posted by JSWolf

But how much of that garb is HTML and how much of that garb is Word?

It is all HTML

It just has been trashed (with un-necessary tags) up by Word

If you set a nice <body class=...> the mso-normal could dissapear.

Experiment! (Sigil is a good tool for this)
rename a class in the CSS (mso-normal->mXo-normal), leaving the usage in place.
See what happens

to your masterpiece

susan_cassidy · 02-14-2011, 12:06 PM

You also probably want to change the tags for Chapter headings from paragraph tags to an h2 or something, so that it is easy to recognize to generate a TOC, etc. You may want to manually add the special Mobipocket-specific page break tag: <mbp

agebreak/>

ldolse · 02-14-2011, 01:45 PM

Heuristics should work with that chapter - just enable Heuristics under the conversion options.

RachDvn · 02-15-2011, 04:09 PM

Thanks everyone! Happy to report that chapter detection is working. Manichean, thanks for the suggestions to save as filtered web page. That seems to have done the trick!

Since I have a LARGE quantity of files to convert with Word origins, it would be too time consuming to hand edit the tags for each chapter of every file. But you're right, Word makes a mess of it.

Does anyone have a different recommendation for converting text to mobi? My project is that I'm organizing stories for my creative writing group, copy/pasting from our internet pages and then creating mobi's. Currently i paste to Word, save as web page, run the zip through Calibre.

TheDucks - sorry, but you lost me.

I'm not that savvy with the lingo.

Thanks again everyone!
~Rach

Manichean · 02-15-2011, 05:02 PM

Quote:

Originally Posted by RachDvn

Does anyone have a different recommendation for converting text to mobi? My project is that I'm organizing stories for my creative writing group, copy/pasting from our internet pages and then creating mobi's. Currently i paste to Word, save as web page, run the zip through Calibre.

Try using Sigil instead of Word. It creates ePubs, which should convert pretty easily to Mobi.

ldolse · 02-15-2011, 05:05 PM

Depending on what the web pages look like you could just save the html directly from the website and load it to Calibre. If you need only a portion of the web pages you could look at Calibre's recipe framework, as it can grab web pages, extract the relevant portion, and convert that to a ebook (albeit one that uses 'news' features on some readers).

The problem you'll find with text is that you'll lose italics and other formatting with a straight copy/paste to a text editor. If the originals don't have any formatting that might be an ok option though.

If the recipe framework is too complicated for you, another thing you could look at is firebug plugin for Firefox. It's still a 'bit' complicated, but it provides you a gui where you can get to just the relevant html that contains the story and copy just that into a text editor. If that's of interest I can explain in a bit more detail.

ldolse · 02-15-2011, 05:06 PM

Quote:

Originally Posted by Manichean

Try using Sigil instead of Word. It creates ePubs, which should convert pretty easily to Mobi.

Forgot about that - I believe this should preserve italics/etc, so this might be easiest.

dwig · 02-15-2011, 05:49 PM

Quote:

Originally Posted by ldolse

Forgot about that - I believe this should preserve italics/etc, so this might be easiest.

I think you'll find that you'll still loss the formatting, though I do agree that Sigil will prove preferable to Word.

My current pet method for moving Web pages to MOBI format is to save the source while in my browser and edit that before converting with Calibre.

My workflow involves using Opera browser with Notepad++ set as my app for viewing the Source. I simply rightclick on a page and select Source from the menu. The source HTML appears in Notepad++ and I then:

save it as HTML; Notepad++ will then color code the tags.
make the basic edits and resave
then import into Calibre.
(sometimes) convert to ePub and do further edits in Sigil
convert to MOBI

However you approach saving the original HTML source, doing so will preserve the formatting (bold, italic, ...).

RachDvn · 02-16-2011, 08:43 AM

Thanks so much. Haven't used Sigil before, so I'll play with it as well as trying dwig's method.

~Rach

02-14-2011, 12:06 PM	#8
susan_cassidy Wizard Posts: 2,251 Karma: 3720310 Join Date: Jan 2009 Location: USA Device: Kindle, iPad (not used much for reading)	You also probably want to change the tags for Chapter headings from paragraph tags to an h2 or something, so that it is easy to recognize to generate a TOC, etc. You may want to manually add the special Mobipocket-specific page break tag: <mbpagebreak/>

02-15-2011, 04:09 PM	#10
RachDvn Member Posts: 24 Karma: 322 Join Date: Jan 2011 Device: Kindle	Thanks everyone! Happy to report that chapter detection is working. Manichean, thanks for the suggestions to save as filtered web page. That seems to have done the trick! Since I have a LARGE quantity of files to convert with Word origins, it would be too time consuming to hand edit the tags for each chapter of every file. But you're right, Word makes a mess of it. Does anyone have a different recommendation for converting text to mobi? My project is that I'm organizing stories for my creative writing group, copy/pasting from our internet pages and then creating mobi's. Currently i paste to Word, save as web page, run the zip through Calibre. TheDucks - sorry, but you lost me. I'm not that savvy with the lingo. Thanks again everyone! ~Rach Last edited by RachDvn; 02-15-2011 at 04:29 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
xpath to insert chapter breaks - but chapter name cut off ?	Rob557	Conversion	2	03-06-2014 06:59 AM
mobi to rtf chapter breaks	arslonga	Conversion	0	04-05-2012 12:50 PM
HTML to MOBI conversion ignores page breaks	LeftHanded Matt	Conversion	2	12-21-2011 12:25 PM
[Old Thread] HTML to MOBI for Kindle	eggheadbooks1	Conversion	37	04-30-2011 01:48 PM
Convert HTML to MOBI (HTML recognized as ZIP file)	pdubois	Conversion	1	01-25-2011 12:55 PM

02-14-2011, 11:03 AM	#2
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	Where exactly should the chapter start in your example?

02-14-2011, 11:09 AM	#3
RachDvn Member Posts: 24 Karma: 322 Join Date: Jan 2011 Device: Kindle	Where it says "Chapter 2" in the middle of the html garb

02-14-2011, 11:39 AM	#4
JSWolf Resident Curmudgeon Posts: 73,998 Karma: 128903378 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	I do have a suggestion. Take the HTML you got from Word and load it into a text editor such as Notepad++ and clean up the mess Word left in and make it nice clean HTML code and then you can take your chapter headings and make them look like <h2>Chapter 2</h2> and you'll get a good ToC. The problem s that when you save as a webpage from Word, you get one hell of a mess from Word. Take a look and you will see what I mean. It's not good code at all. It's a real mess. I've cleaned up my share of Word's mess and it can take a good while to do so.

02-14-2011, 11:43 AM	#6
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	Try using filtered HTML when saving from Word. Also try enabling the heuristic options.

02-14-2011, 01:45 PM	#9
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Heuristics should work with that chapter - just enable Heuristics under the conversion options.

02-15-2011, 05:05 PM	#12
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Depending on what the web pages look like you could just save the html directly from the website and load it to Calibre. If you need only a portion of the web pages you could look at Calibre's recipe framework, as it can grab web pages, extract the relevant portion, and convert that to a ebook (albeit one that uses 'news' features on some readers). The problem you'll find with text is that you'll lose italics and other formatting with a straight copy/paste to a text editor. If the originals don't have any formatting that might be an ok option though. If the recipe framework is too complicated for you, another thing you could look at is firebug plugin for Firefox. It's still a 'bit' complicated, but it provides you a gui where you can get to just the relevant html that contains the story and copy just that into a text editor. If that's of interest I can explain in a bit more detail.

02-16-2011, 08:43 AM	#15
RachDvn Member Posts: 24 Karma: 322 Join Date: Jan 2011 Device: Kindle	Thanks so much. Haven't used Sigil before, so I'll play with it as well as trying dwig's method. ~Rach