10-02-2010, 10:57 AM | #1 |
Zennist
Posts: 1,022
Karma: 47809468
Join Date: Jul 2010
Device: iPod Touch, Sony PRS-350, Nook HD+ & HD
|
MS Word "crap" at beginning of html files
Is there an easy way to clean up the styling "crap" that Word puts at the top of all html files? Like maybe through a regex search and replace?
Why does Word put a gazillion font items at the start of html files even when those fonts are not being used in the file? It makes it much harder to edit the files in Sigil or any other editor. I manually selected the Word stuff in one file and deleted it and there didn't seem to be any negative impact, so I assume it's safe to delete it. But how does one do it for every file? I try to stay away from Word for epubs, but I find if a file needs heavy editing and also needs a new TOC, it's the easiest thing to use. I save my files as filtered html. Maybe if I run the file through another utility first it will clean up that "crap'? Anyone have any suggestions? |
10-02-2010, 11:04 AM | #2 |
Dylanologist
Posts: 200
Karma: 146754
Join Date: Apr 2010
Location: Hanover, New Hampshire, USA
Device: none/all/any
|
PatNY - I use Word but never its HTML. After I do the editing and add the tags I want in Word, clear all the formatting, select all, copy, and paste it into a simple text editor where I save the file as Unicode (UTF-8) with an HTML extension. I then open this file in Sigil and finish my ePub work there.
Does this idea help? - Fabe |
Advert | |
|
10-02-2010, 11:30 AM | #3 |
Zennist
Posts: 1,022
Karma: 47809468
Join Date: Jul 2010
Device: iPod Touch, Sony PRS-350, Nook HD+ & HD
|
Fabe, I am not sure what you are suggesting. Do you mean you add html code to the text while in Word? If so, that would not be easy for me as I am not very experienced with html coding.
The reason I use Word is because of its style gallery and easy way of applying styles (headers) which is a snap. The document map pane also makes it easy to navigate, and I like its user-friendly search and replace. But if I use the gallery to apply header styles, then clearing all the formatting as you suggest will undo that, correct? Also, some html books I have are linked to image files. If I do what you are suggesting, won't I lose all those links? BTW, I love your part of the country. I was in Hanover just this summer and had a wonderful time. It's beautiful up there. |
10-02-2010, 11:57 AM | #4 | |
Dylanologist
Posts: 200
Karma: 146754
Join Date: Apr 2010
Location: Hanover, New Hampshire, USA
Device: none/all/any
|
Quote:
1. Yes, I add some of the HTML code in Word first. A prime example is paragraph marks. Example: Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is the same text with paragraph markings added with Find and Replace:Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. <p>Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting.</p> Find = ^p<p>Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting.</p> Replace = </p>^p^p<p> 2. Yes, I can not argue with the ease of use, but Word was designed for the printed page with HTML web pages added as an after thought, hence the "crap." Word does not generate "good" html for Sigil to turn into ePubs. 3. Yes, I primarily use Word for its Find & Replace sophistication. Several people have urged me to use TextWrangler instead. It is a fine program, but I feel like I need an engineering degree to use it, so I stick with Word. 4. Yes, "clear formatting" will undo styles. 5. I handle graphics in Sigil. But I must say, I stay away from embedded graphics as much as possible. - Fabe |
|
10-02-2010, 05:12 PM | #5 |
Wizard
Posts: 1,018
Karma: 67827
Join Date: Jan 2005
Device: PocketBook Era
|
|
Advert | |
|
10-04-2010, 09:13 AM | #6 | |
Intentionally Left Blank
Posts: 172
Karma: 300106
Join Date: Feb 2006
Location: Royal Oak, MI, USA
Device: Nook STR
|
Quote:
Now save the document as a text file, add the proper <html>, <body>, etc. tags at the top and bottom, and you'll have something fit for a clean import into Sigil. Hope this helps. CAUTION: This is quite useful, but not perfect and so is provided as-is, no warranty, use at your own risk, and the other usual disclaimers. It assumes you have a double-paragraph mark between paragraphs, as is common for Gutenberg and other text files. If you actually have just a single paragraph marker at the end of each paragraph, it'll turn the whole document into one huge paragraph. It also clears "unnecessary" white space, so if you have a table or tabs/spaces at the start of a paragraph, or other such formatting, you'll lose it. This is basically intended for documents that are paragraphs of text with chapter headings. |
|
10-04-2010, 11:22 AM | #7 |
Dylanologist
Posts: 200
Karma: 146754
Join Date: Apr 2010
Location: Hanover, New Hampshire, USA
Device: none/all/any
|
Thanks DTM. I have never been good creating Word macros, so I DO waste time with repetitive tasks. - Fabe
|
10-06-2010, 08:05 PM | #8 |
Zennist
Posts: 1,022
Karma: 47809468
Join Date: Jul 2010
Device: iPod Touch, Sony PRS-350, Nook HD+ & HD
|
rlauzon, thanks for the tip about open office. I tried it this evening and it cleaned up the html very well. Not only that, but it kept all the links to the images and everything imported intact into Calibre very nicely. And all you have to do is import the file into Open Office and then save it. Nifty trick!
dtm, thanks for the macros, I will give them a try, though I'm more inclined to use the Open Office fix as it's simpler. |
10-06-2010, 09:40 PM | #9 |
Intentionally Left Blank
Posts: 172
Karma: 300106
Join Date: Feb 2006
Location: Royal Oak, MI, USA
Device: Nook STR
|
I don't have Open Office. Tried it years ago and was not impressed. Maybe it's time to try it again.
|
10-07-2010, 02:00 PM | #10 |
Banned
Posts: 640
Karma: 4911
Join Date: Jul 2007
Location: Grapevine, TX
Device: iPad4
|
I use Word for most of my ebook creations. Word files saved as Filtered Html and imported into Calibre. My epubs usually come out looking perfect so I gotta ask; what is the problem with having all the extra crap that Word adds?
|
10-07-2010, 02:11 PM | #11 |
Intentionally Left Blank
Posts: 172
Karma: 300106
Join Date: Feb 2006
Location: Royal Oak, MI, USA
Device: Nook STR
|
Hey, if you're happy, go with it.
But there are a couple of problems that can result. First, it can hard-code font sizes, which you probably don't want to have happen. There's another thread all about that sort of problem. And second, having the style information at the top of each file makes it a major headache to change anything. You have to do it in every chapter. It's much better to put it all in an external CSS file. |
10-07-2010, 02:31 PM | #12 |
Well trained by Cats
Posts: 29,799
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Another point, is the files become massively bloated (as bad as 3x) and slow down performance on portable readers.
|
10-12-2010, 03:04 AM | #13 |
Guru
Posts: 774
Karma: 1211741
Join Date: May 2008
Location: Oregon
Device: EB1150, iPhone, Cool-er Purple, Pocketbook 360, Kindle Fire
|
save as rtf
I just save the document as a rtf instead of a word doc or a html file, then import to Calibre and convert to epub. This probably takes a bit longer since I can't directly import directly to Sigil, but it seems to work with minimal problems.
You are right that word does tend to add a ton of junk to its html. Amy |
10-12-2010, 06:51 AM | #14 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Is this for Word/Office 2007? My older Word 2003 for XP doesn't recognize the template format. thought I'd try it on a itchy file I have, but it looks like I'll have to run it through the usual hoops--thanks anyway. Thanks, Hitch |
|
10-12-2010, 08:18 AM | #15 |
Intentionally Left Blank
Posts: 172
Karma: 300106
Join Date: Feb 2006
Location: Royal Oak, MI, USA
Device: Nook STR
|
Yes, that was created under Word 2007. I opened it, saved it "down" to 97-2003 format, and have attached the resulting file. I have not tried it.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Content Kindle's "Topaz" Format Looks Like Crap | Gideon | Amazon Kindle | 21 | 10-07-2010 10:35 PM |
Kindle DX optimal "page" size - PDF or Word template | guiyoforward | Amazon Kindle | 12 | 09-28-2010 07:05 PM |
Sigil 024 and regular expressions on "all HTML files" | WS64 | Sigil | 4 | 08-13-2010 07:33 PM |
Microsoft Reader plugin "Read in" for Word doesn't load anymore | K-Thom | Reading and Management | 15 | 04-17-2009 05:52 AM |
"Beginning Ruby: From Novice to Professional" $10 | teamonkey | Deals and Resources (No Self-Promotion or Affiliate Links) | 0 | 06-20-2008 03:05 PM |