![]() |
MS Word "crap" at beginning of html files
Is there an easy way to clean up the styling "crap" that Word puts at the top of all html files? Like maybe through a regex search and replace?
Why does Word put a gazillion font items at the start of html files even when those fonts are not being used in the file? It makes it much harder to edit the files in Sigil or any other editor. I manually selected the Word stuff in one file and deleted it and there didn't seem to be any negative impact, so I assume it's safe to delete it. But how does one do it for every file? I try to stay away from Word for epubs, but I find if a file needs heavy editing and also needs a new TOC, it's the easiest thing to use. I save my files as filtered html. Maybe if I run the file through another utility first it will clean up that "crap'? Anyone have any suggestions? |
PatNY - I use Word but never its HTML. After I do the editing and add the tags I want in Word, clear all the formatting, select all, copy, and paste it into a simple text editor where I save the file as Unicode (UTF-8) with an HTML extension. I then open this file in Sigil and finish my ePub work there.
Does this idea help? - Fabe |
Fabe, I am not sure what you are suggesting. Do you mean you add html code to the text while in Word? If so, that would not be easy for me as I am not very experienced with html coding.
The reason I use Word is because of its style gallery and easy way of applying styles (headers) which is a snap. The document map pane also makes it easy to navigate, and I like its user-friendly search and replace. But if I use the gallery to apply header styles, then clearing all the formatting as you suggest will undo that, correct? Also, some html books I have are linked to image files. If I do what you are suggesting, won't I lose all those links? BTW, I love your part of the country. I was in Hanover just this summer and had a wonderful time. It's beautiful up there. |
Quote:
1. Yes, I add some of the HTML code in Word first. A prime example is paragraph marks. Example: Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is the same text with paragraph markings added with Find and Replace:Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. <p>Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting.</p> Find = ^p<p>Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting.</p> Replace = </p>^p^p<p> 2. Yes, I can not argue with the ease of use, but Word was designed for the printed page with HTML web pages added as an after thought, hence the "crap." Word does not generate "good" html for Sigil to turn into ePubs. 3. Yes, I primarily use Word for its Find & Replace sophistication. Several people have urged me to use TextWrangler instead. It is a fine program, but I feel like I need an engineering degree to use it, so I stick with Word. 4. Yes, "clear formatting" will undo styles. 5. I handle graphics in Sigil. But I must say, I stay away from embedded graphics as much as possible. - Fabe |
Quote:
|
1 Attachment(s)
Quote:
Now save the document as a text file, add the proper <html>, <body>, etc. tags at the top and bottom, and you'll have something fit for a clean import into Sigil. Hope this helps. CAUTION: This is quite useful, but not perfect and so is provided as-is, no warranty, use at your own risk, and the other usual disclaimers. It assumes you have a double-paragraph mark between paragraphs, as is common for Gutenberg and other text files. If you actually have just a single paragraph marker at the end of each paragraph, it'll turn the whole document into one huge paragraph. It also clears "unnecessary" white space, so if you have a table or tabs/spaces at the start of a paragraph, or other such formatting, you'll lose it. This is basically intended for documents that are paragraphs of text with chapter headings. |
Thanks DTM. I have never been good creating Word macros, so I DO waste time with repetitive tasks. - Fabe
|
rlauzon, thanks for the tip about open office. I tried it this evening and it cleaned up the html very well. Not only that, but it kept all the links to the images and everything imported intact into Calibre very nicely. And all you have to do is import the file into Open Office and then save it. Nifty trick!
dtm, thanks for the macros, I will give them a try, though I'm more inclined to use the Open Office fix as it's simpler. |
I don't have Open Office. Tried it years ago and was not impressed. Maybe it's time to try it again.
|
I use Word for most of my ebook creations. Word files saved as Filtered Html and imported into Calibre. My epubs usually come out looking perfect so I gotta ask; what is the problem with having all the extra crap that Word adds?
|
Hey, if you're happy, go with it.
But there are a couple of problems that can result. First, it can hard-code font sizes, which you probably don't want to have happen. There's another thread all about that sort of problem. And second, having the style information at the top of each file makes it a major headache to change anything. You have to do it in every chapter. It's much better to put it all in an external CSS file. |
Quote:
|
save as rtf
I just save the document as a rtf instead of a word doc or a html file, then import to Calibre and convert to epub. This probably takes a bit longer since I can't directly import directly to Sigil, but it seems to work with minimal problems.
You are right that word does tend to add a ton of junk to its html. Amy |
Quote:
Is this for Word/Office 2007? My older Word 2003 for XP doesn't recognize the template format. thought I'd try it on a itchy file I have, but it looks like I'll have to run it through the usual hoops--thanks anyway. Thanks, Hitch |
1 Attachment(s)
Quote:
|
| All times are GMT -4. The time now is 06:07 PM. |
Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.