![]() |
#1 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,008
Karma: 144284074
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Best way to get clean HTML
I'm wondering is there a way to get clean HTML out of a Mobipocket eBook that does not have all the junk you get from a Mobipocket HTML file?
Every paragraph is loaded with junk. What I'd like is if this junk could somehow be converted into CSS so it would be easy to edit the CSS instead of having to fool around with all the junk. Here is an example of what I mean... Code:
<div style="margin-top: 6"/><div style="text-indent: 1em"><font size="3">“What I was going to ask your boss, Charley, is if there is some good reason you can’t go to Buenos Aires right now.”</font></div><div style="margin-top: 6"/> |
![]() |
![]() |
![]() |
#2 |
Feedbooks.com Co-Founder
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,263
Karma: 145123
Join Date: Nov 2006
Location: Paris, France
Device: Sony PRS-t-1/350/300/500/505/600/700, Nexus S, iPad
|
TidyHTML maybe ?
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linkpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
|
![]() |
![]() |
![]() |
#4 |
Feedbooks.com Co-Founder
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,263
Karma: 145123
Join Date: Nov 2006
Location: Paris, France
Device: Sony PRS-t-1/350/300/500/505/600/700, Nexus S, iPad
|
|
![]() |
![]() |
![]() |
#5 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linkpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,008
Karma: 144284074
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
I think it's time Mobipocket & AZW all went away. They mess they make of well formatted HTML is not nice.
|
![]() |
![]() |
![]() |
#7 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 149
Karma: 937
Join Date: Mar 2009
Device: Kindle Paperwhite (10th Gen)
|
|
![]() |
![]() |
![]() |
#8 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,671
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
|
Quote:
=X= |
|
![]() |
![]() |
![]() |
#9 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linkpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
|
![]() |
![]() |
![]() |
#10 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() Posts: 976
Karma: 687
Join Date: Nov 2007
Device: Dell X51v; iLiad v2
|
It's a problem bothering me for some time, too. I just want a clean html file with simple html tag, such as <H1>, <H2>,<P> from MS Word file.
|
![]() |
![]() |
![]() |
#11 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,462
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
html Tidy
and demoroniser |
![]() |
![]() |
![]() |
#12 |
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,543
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Even so, the amount of junk you have to use (and which is recognized by mobipocket readers) is very limited. The use of the normal <P>, <DIV>, <I>, etc. tags plus properties like WIDTH, HEIGHT and ALIGN is often enough. Add <FONT> with SIZE and COLOR and I think that's about the only needed junk.
|
![]() |
![]() |
![]() |
#13 |
Sir Penguin of Edinburgh
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
|
If the input and output are consistent, I could write a specific cleanup program for it. Anyone interested?
|
![]() |
![]() |
![]() |
#14 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,707
Karma: 32763414
Join Date: Dec 2008
Location: Krewerd
Device: Pocketbook Inkpad 4 Color; Samsung Galaxy Tab S6
|
I generally use regular expression search and replace...
To take your example (replaced your weird characters with the quotes for readability): Code:
<div style="margin-top: 6"/> <div style="text-indent: 1em"><font size="3">"What I was going to ask your boss, Charley, is if there is some good reason you can't go to Buenos Aires right now."</font></div> <div style="margin-top: 6"/> .emptyLine { margin-top: 6em; } p { text-indent: 1em; font-size: normal; } <div style="margin-top: 6" /> would be replaced with <div class="emptyLine" /> <div style="text-indent: 1em"><font size="3"> would be replaced with <p> </font></div> would be replaced by </p> I generally start with headers and other exceptions (there are less headers than paragraphs, generally ![]() |
![]() |
![]() |
![]() |
#15 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,671
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
clean HTML or PDF before mobi conversion in Calibre | mark235 | Calibre | 9 | 12-25-2010 09:37 PM |
BookDesigner HTML0 to clean HTML conversion utility | Pablo | Workshop | 15 | 08-24-2010 12:05 PM |
Clean and compress HTML before making ebook | eping | Workshop | 4 | 01-13-2010 07:51 PM |
Tool to easily clean and refurbish html-text before conversion | Pulp | Workshop | 3 | 10-13-2008 10:16 AM |
Docvert 2.0 converts MS Word files to clean HTML | Alexander Turcic | Lounge | 0 | 03-16-2006 04:50 AM |