| 
			
			 | 
		#1 | 
| 
			
			
			
			 Resident Curmudgeon 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 80,782 
				Karma: 150249619 
				Join Date: Nov 2006 
				Location: Roslindale, Massachusetts 
				
				
				Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3 
				
				
				 | 
	
	
	
		
		
			
			 
				
				Best way to get clean HTML
			 
			
			
			I'm wondering is there a way to get clean HTML out of a Mobipocket eBook that does not have all the junk you get from a Mobipocket HTML file? 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Every paragraph is loaded with junk. What I'd like is if this junk could somehow be converted into CSS so it would be easy to edit the CSS instead of having to fool around with all the junk. Here is an example of what I mean... Code: 
	<div style="margin-top: 6"/><div style="text-indent: 1em"><font size="3">“What I was going to ask your boss, Charley, is if there is some good reason you can’t go to Buenos Aires right now.”</font></div><div style="margin-top: 6"/>  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#2 | 
| 
			
			
			
			 Feedbooks.com Co-Founder 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,263 
				Karma: 145123 
				Join Date: Nov 2006 
				Location: Paris, France 
				
				
				Device: Sony PRS-t-1/350/300/500/505/600/700, Nexus S, iPad 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			TidyHTML maybe ?
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#3 | 
| 
			
			
			
			 Grand Sorcerer 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452 
				Karma: 7185064 
				Join Date: Oct 2007 
				Location: Linkpng, Sweden 
				
				
				Device: Kindle Voyage, Nexus 5, Kindle PW 
				
				
				 | 
	
	|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#4 | 
| 
			
			
			
			 Feedbooks.com Co-Founder 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,263 
				Karma: 145123 
				Join Date: Nov 2006 
				Location: Paris, France 
				
				
				Device: Sony PRS-t-1/350/300/500/505/600/700, Nexus S, iPad 
				
				
				 | 
	
	|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#5 | 
| 
			
			
			
			 Grand Sorcerer 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452 
				Karma: 7185064 
				Join Date: Oct 2007 
				Location: Linkpng, Sweden 
				
				
				Device: Kindle Voyage, Nexus 5, Kindle PW 
				
				
				 | 
	
	|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#6 | 
| 
			
			
			
			 Resident Curmudgeon 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 80,782 
				Karma: 150249619 
				Join Date: Nov 2006 
				Location: Roslindale, Massachusetts 
				
				
				Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			I think it's time Mobipocket & AZW all went away. They mess they make of well formatted HTML is not nice.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#7 | 
| 
			
			
			
			 Zealot 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 149 
				Karma: 937 
				Join Date: Mar 2009 
				
				
				
				Device: Kindle Paperwhite (10th Gen) 
				
				
				 | 
	
	|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#8 | |
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,671 
				Karma: 12205348 
				Join Date: Mar 2008 
				
				
				
				Device: Galaxy S, Nook w/CM7 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 =X=  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#9 | 
| 
			
			
			
			 Grand Sorcerer 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452 
				Karma: 7185064 
				Join Date: Oct 2007 
				Location: Linkpng, Sweden 
				
				
				Device: Kindle Voyage, Nexus 5, Kindle PW 
				
				
				 | 
	
	|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#10 | 
| 
			
			
			
			 Guru 
			
			![]() ![]() ![]() ![]() ![]() ![]() Posts: 976 
				Karma: 687 
				Join Date: Nov 2007 
				
				
				
				Device: Dell X51v; iLiad v2 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			It's a problem bothering me for some time, too. I just want a clean html file with simple html tag, such as <H1>, <H2>,<P> from MS Word file.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#11 | 
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,465 
				Karma: 10684861 
				Join Date: May 2006 
				
				
				
				Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			html Tidy 
		
	
		
		
		
		
		
		
		
		
		
		
	
	and demoroniser  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#12 | 
| 
			
			
			
			 frumious Bandersnatch 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,570 
				Karma: 20150435 
				Join Date: Jan 2008 
				Location: Spaniard in Sweden 
				
				
				Device: Cybook Orizon, Kobo Aura 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Even so, the amount of junk you have to use (and which is recognized by mobipocket readers) is very limited. The use of the normal <P>, <DIV>, <I>, etc. tags plus properties like WIDTH, HEIGHT and ALIGN is often enough. Add <FONT> with SIZE and COLOR and I think that's about the only needed junk.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#13 | 
| 
			
			
			
			 Sir Penguin of Edinburgh 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,375 
				Karma: 23555235 
				Join Date: Apr 2007 
				Location: DC Metro area 
				
				
				Device: Shake a stick plus 1 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			If the input and output are consistent, I could write a specific cleanup program for it. Anyone interested?
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#14 | 
| 
			
			
			
			 Grand Sorcerer 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,707 
				Karma: 32763414 
				Join Date: Dec 2008 
				Location: Krewerd 
				
				
				Device: Pocketbook Inkpad 4 Color; Samsung Galaxy Tab S6 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			I generally use regular expression search and replace... 
		
	
		
		
		
		
		
		
		
		
		
		
	
	To take your example (replaced your weird characters with the quotes for readability): Code: 
	<div style="margin-top: 6"/> <div style="text-indent: 1em"><font size="3">"What I was going to ask your boss, Charley, is if there is some good reason you can't go to Buenos Aires right now."</font></div> <div style="margin-top: 6"/> .emptyLine { margin-top: 6em; } p { text-indent: 1em; font-size: normal; } <div style="margin-top: 6" /> would be replaced with <div class="emptyLine" /> <div style="text-indent: 1em"><font size="3"> would be replaced with <p> </font></div> would be replaced by </p> I generally start with headers and other exceptions (there are less headers than paragraphs, generally  ).  Then I create an epub out of it, check it, fix any errors and repeat the checking process until it's clean.
		 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#15 | 
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,671 
				Karma: 12205348 
				Join Date: Mar 2008 
				
				
				
				Device: Galaxy S, Nook w/CM7 
				
				
				 | 
	
	|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
            
| Thread Tools | Search this Thread | 
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| clean HTML or PDF before mobi conversion in Calibre | mark235 | Calibre | 9 | 12-25-2010 10:37 PM | 
| BookDesigner HTML0 to clean HTML conversion utility | Pablo | Workshop | 15 | 08-24-2010 01:05 PM | 
| Clean and compress HTML before making ebook | eping | Workshop | 4 | 01-13-2010 08:51 PM | 
| Tool to easily clean and refurbish html-text before conversion | Pulp | Workshop | 3 | 10-13-2008 11:16 AM | 
| Docvert 2.0 converts MS Word files to clean HTML | Alexander Turcic | Lounge | 0 | 03-16-2006 05:50 AM |