View Full Version : Best way to get clean HTML


JSWolf
03-29-2009, 05:18 PM
I'm wondering is there a way to get clean HTML out of a Mobipocket eBook that does not have all the junk you get from a Mobipocket HTML file?

Every paragraph is loaded with junk. What I'd like is if this junk could somehow be converted into CSS so it would be easy to edit the CSS instead of having to fool around with all the junk.

Here is an example of what I mean...

<div style="margin-top: 6"/><div style="text-indent: 1em"><font size="3">“What I was going to ask your boss, Charley, is if there is some good reason you can’t go to Buenos Aires right now.”</font></div><div style="margin-top: 6"/>

Hadrien
03-29-2009, 06:13 PM
TidyHTML maybe ?

tompe
03-29-2009, 06:37 PM
I'm wondering is there a way to get clean HTML out of a Mobipocket eBook that does not have all the junk you get from a Mobipocket HTML file?


If I remember correctly this depends on the book. You do not get it for all books so it is nothing inherent in the format.

Hadrien
03-29-2009, 06:44 PM
If I remember correctly this depends on the book. You do not get it for all books so it is nothing inherent in the format.

I disagree, for some things you do not have much of a choice and you need to use junk with Mobipocket.

tompe
03-31-2009, 07:55 AM
I disagree, for some things you do not have much of a choice and you need to use junk with Mobipocket.

Yes, if you want to force a specific formatting. But for a straightforward book formatted as a standard paperback you should be able to use clean html.

JSWolf
03-31-2009, 08:08 AM
I think it's time Mobipocket & AZW all went away. They mess they make of well formatted HTML is not nice.

All4Fun
03-31-2009, 02:21 PM
TidyHTML maybe ?

So will TidyHTML do the trick?

=X=
03-31-2009, 04:53 PM
... What I'd like is if this junk could somehow be converted into CSS so it would be easy to edit the CSS ...
Yes there is a tool called cssutils (http://code.google.com/p/cssutils/) that parses out style in HTML and creates CSS file.

=X=

tompe
03-31-2009, 04:56 PM
I think it's time Mobipocket & AZW all went away. They mess they make of well formatted HTML is not nice.

Yes, a good idea. DRM stopping working on all MobiPocket files will be the final death of DRM.

ericshliao
04-01-2009, 04:45 AM
I'm wondering is there a way to get clean HTML out of a Mobipocket eBook that does not have all the junk you get from a Mobipocket HTML file?

It's a problem bothering me for some time, too. I just want a clean html file with simple html tag, such as <H1>, <H2>,<P> from MS Word file.

kacir
04-01-2009, 06:00 AM
html Tidy
and
demoroniser

Jellby
04-01-2009, 07:32 AM
I disagree, for some things you do not have much of a choice and you need to use junk with Mobipocket.

Even so, the amount of junk you have to use (and which is recognized by mobipocket readers) is very limited. The use of the normal <P>, <DIV>, <I>, etc. tags plus properties like WIDTH, HEIGHT and ALIGN is often enough. Add <FONT> with SIZE and COLOR and I think that's about the only needed junk.

Nate the great
04-01-2009, 07:57 AM
If the input and output are consistent, I could write a specific cleanup program for it. Anyone interested?

Sweetpea
04-01-2009, 09:34 AM
I generally use regular expression search and replace...

To take your example (replaced your weird characters with the quotes for readability):

<div style="margin-top: 6"/>
<div style="text-indent: 1em"><font size="3">"What I was going to ask your boss, Charley, is if there is some good reason you can't go to Buenos Aires right now."</font></div>
<div style="margin-top: 6"/>

in my style:

.emptyLine { margin-top: 6em; }
p { text-indent: 1em; font-size: normal; }


<div style="margin-top: 6" /> would be replaced with <div class="emptyLine" />
<div style="text-indent: 1em"><font size="3"> would be replaced with <p>
</font></div> would be replaced by </p>



I generally start with headers and other exceptions (there are less headers than paragraphs, generally :p). Then I create an epub out of it, check it, fix any errors and repeat the checking process until it's clean.

=X=
04-01-2009, 12:31 PM
If the input and output are consistent, I could write a specific cleanup program for it. Anyone interested?

Yes I would be very interested.

=X=

Nate the great
04-01-2009, 02:13 PM
Yes I would be very interested.

=X=

If you will post a before and after, I'll take a look.

brewt
04-02-2009, 12:44 AM
Did someone say Word?

(Bahoo-hoohoo-haha-haha).

"Clean" html from Word isn't all that possible. Now, this isn't to say one can't use word to produce "viable" files that can (and do) convert well into ebook formats. But "Clean"? Noo, not in my observations.

Personally, I got over being clean. I am most of the time happy to let Word mangle the styles it wants to embed as "css" into the html file all it wants. There's just waaay to much other usefulness in Word to overcome my fear of evil.

Saving the file as [Web Page, Filtered] goes a long way of extracting the extra junk word generically implants - that's all there and well and fine if you need to reconstruct an actual Word Document with all of Word's formatting tricks intact from the html file. Which isn't the usual goal here - MobiCreator, if you import a real Word Document, converts it to a filtered html file before it converts it to a mobi file.

If you just have to use CSS in Word, remember, it's css 1 only, and there are weirdnesses in css you can't construct using Word Properly (see my thread in the epub forum about using css in Word to make a drop cap work in an epub - it doesn't look like a drop cap in Word, but it works out ok in the epub). And Word is all too anxious to over-impose changes into the embedded overlay of the html file - just TRY to redefine "Normal" in css without forcing it into normal.dot and see what happens in your html file.

Unless you intend to hand-re-edit the htm file after you've made it in Word, what what do you care if it's "clean"? Is it oh-so-much smaller? Is it really worth your time? Wouldn't better metatags be more useful in the long run when the formats change (again)? Or more care toward managing your stylesets and where to use them?

pennypenny.

-bjc

p.s. be sure to check in on the word document properties - you might be surprised that Word could be embedding your work computer name, company name, logon name, things you really maybe don't want in the html meta-info. If, you know, you use work machines to do any of this.

Sweetpea
04-02-2009, 03:22 AM
Unless you intend to hand-re-edit the htm file after you've made it in Word, what what do you care if it's "clean"? Is it oh-so-much smaller? Is it really worth your time? Wouldn't better metatags be more useful in the long run when the formats change (again)? Or more care toward managing your stylesets and where to use them?

Actually, I sometimes half the size of HTML files generated by Word. So, personally, I wouldn't touch Word with a stick if I have to finish with HTML files.

brewt
04-02-2009, 11:00 AM
Not to pick on Sweetpea, but let's try something.

In the attached test.zip are html files of Sweetpea's post conjured by copying and pasting into Word, and the resultant mobi files.

I saved them as [Full Web Page], and [Web Page, Filtered] out of Word. Sure enough, the html file for [Full Web Page] is twice as big as the [Web Page, Filtered] file.

Funny thing: When I try to open the files in a browser, in the [Full Web Page] file I can't see the picture. Same thing in the mobi files - that's why the mobi file for [Full Web Page] is smaller.

But, when I look at the html code in the filtered file, it's not too bad - the styles names are longer than [h1] etc., and since the styles straight off the web site are being expressed as modifications of existant styles on the fly, sure we could trim out some file size by defining the styles better. How much time do I have to do that? (zilch)

But give up Word because we hate evil so bad? Not a chance.....in Notepad (or vi, or textpad, ted, whatever) I get to miss out on Selecting by Style, search and replace on invisible characters like hard vs soft carriage returns (do I remember the code? is this western or unicode?), grammar check, spell check, multi-columns, tables, picture embedding by drag & drop, MACROS, automated TOCs, just to scratch the surface.

To make a toc in Notepad, I get to hand code it.
To make a table in Notepad, I get to hand code it.
To embed a picture in Notepad, I get to hand code it.
To change all instances of a style in Notepad, I get to search and replace.
I get to be the spellchecker/grammar check (semi colon rules, anyone?)

I can go on all day. I'd rather have the machine assist me through my own ineptitudes than say "oh, i don't need any help here" and do it the hard way.....being all lazy and all as I am......:)

-bjc