MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   MS Word "crap" at beginning of html files (https://www.mobileread.com/forums/showthread.php?t=100978)

PatNY 10-02-2010 11:57 AM

MS Word "crap" at beginning of html files
 
Is there an easy way to clean up the styling "crap" that Word puts at the top of all html files? Like maybe through a regex search and replace?

Why does Word put a gazillion font items at the start of html files even when those fonts are not being used in the file? It makes it much harder to edit the files in Sigil or any other editor. I manually selected the Word stuff in one file and deleted it and there didn't seem to be any negative impact, so I assume it's safe to delete it. But how does one do it for every file?

I try to stay away from Word for epubs, but I find if a file needs heavy editing and also needs a new TOC, it's the easiest thing to use. I save my files as filtered html.

Maybe if I run the file through another utility first it will clean up that "crap'?

Anyone have any suggestions?

Fabe 10-02-2010 12:04 PM

PatNY - I use Word but never its HTML. After I do the editing and add the tags I want in Word, clear all the formatting, select all, copy, and paste it into a simple text editor where I save the file as Unicode (UTF-8) with an HTML extension. I then open this file in Sigil and finish my ePub work there.

Does this idea help? - Fabe

PatNY 10-02-2010 12:30 PM

Fabe, I am not sure what you are suggesting. Do you mean you add html code to the text while in Word? If so, that would not be easy for me as I am not very experienced with html coding.

The reason I use Word is because of its style gallery and easy way of applying styles (headers) which is a snap. The document map pane also makes it easy to navigate, and I like its user-friendly search and replace.

But if I use the gallery to apply header styles, then clearing all the formatting as you suggest will undo that, correct?

Also, some html books I have are linked to image files. If I do what you are suggesting, won't I lose all those links?

BTW, I love your part of the country. I was in Hanover just this summer and had a wonderful time. It's beautiful up there.

Fabe 10-02-2010 12:57 PM

Quote:

Originally Posted by PatNY (Post 1141909)
(1)I am not sure what you are suggesting. Do you mean you add html code to the text while in Word?

(2)I use Word because of its easy way of applying styles (headers). The document map pane also makes it easy to navigate,

(3)I like its user-friendly search and replace.

(4)if I use the gallery to apply header styles, then clearing all the formatting as you suggest will undo that, correct?

(5)Also, some html books I have are linked to image files. If I do what you are suggesting, won't I lose all those links?

PatNY -
1. Yes, I add some of the HTML code in Word first. A prime example is paragraph marks. Example:
Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting.
Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting.
Here is the same text with paragraph markings added with Find and Replace:
<p>Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting.</p>

<p>Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting. Here is a body of text pasted into Word and made generic by clearing all formatting.</p>
Find = ^p
Replace = </p>^p^p<p>

2. Yes, I can not argue with the ease of use, but Word was designed for the printed page with HTML web pages added as an after thought, hence the "crap." Word does not generate "good" html for Sigil to turn into ePubs.

3. Yes, I primarily use Word for its Find & Replace sophistication. Several people have urged me to use TextWrangler instead. It is a fine program, but I feel like I need an engineering degree to use it, so I stick with Word.

4. Yes, "clear formatting" will undo styles.

5. I handle graphics in Sigil. But I must say, I stay away from embedded graphics as much as possible.

- Fabe

rlauzon 10-02-2010 06:12 PM

Quote:

Originally Posted by PatNY (Post 1141881)
Is there an easy way to clean up the styling "crap" that Word puts at the top of all html files?

Anyone have any suggestions?

I usually run it through Open Office.org. it cleaned it up nicely.

DTM 10-04-2010 10:13 AM

1 Attachment(s)
Quote:

Originally Posted by Fabe (Post 1141890)
PatNY - I use Word but never its HTML. After I do the editing and add the tags I want in Word, clear all the formatting, select all, copy, and paste it into a simple text editor where I save the file as Unicode (UTF-8) with an HTML extension. I then open this file in Sigil and finish my ePub work there.

Does this idea help? - Fabe

I do exactly the same thing. The attached file will unzip into a Word template file that contains several macros. Use this for the document you want to convert to HTML, then run the macro called Word2HTML. It will clean up the double-paragraph markers and end-of-line paragraph markers you commonly get in text documents, mark word heading1 - heading5 with <h1> - <h5>, replace special characters with escape codes, double-hyphens with em dashes, and more. (If you don't want to do all of the operations--see CAUTION, below--you can run the other individual macros one at a time, if you prefer.)

Now save the document as a text file, add the proper <html>, <body>, etc. tags at the top and bottom, and you'll have something fit for a clean import into Sigil. Hope this helps.

CAUTION: This is quite useful, but not perfect and so is provided as-is, no warranty, use at your own risk, and the other usual disclaimers. It assumes you have a double-paragraph mark between paragraphs, as is common for Gutenberg and other text files. If you actually have just a single paragraph marker at the end of each paragraph, it'll turn the whole document into one huge paragraph. It also clears "unnecessary" white space, so if you have a table or tabs/spaces at the start of a paragraph, or other such formatting, you'll lose it. This is basically intended for documents that are paragraphs of text with chapter headings.

Fabe 10-04-2010 12:22 PM

Thanks DTM. I have never been good creating Word macros, so I DO waste time with repetitive tasks. - Fabe

PatNY 10-06-2010 09:05 PM

rlauzon, thanks for the tip about open office. I tried it this evening and it cleaned up the html very well. Not only that, but it kept all the links to the images and everything imported intact into Calibre very nicely. And all you have to do is import the file into Open Office and then save it. Nifty trick!

dtm, thanks for the macros, I will give them a try, though I'm more inclined to use the Open Office fix as it's simpler.

DTM 10-06-2010 10:40 PM

I don't have Open Office. Tried it years ago and was not impressed. Maybe it's time to try it again.

edbro 10-07-2010 03:00 PM

I use Word for most of my ebook creations. Word files saved as Filtered Html and imported into Calibre. My epubs usually come out looking perfect so I gotta ask; what is the problem with having all the extra crap that Word adds?

DTM 10-07-2010 03:11 PM

Hey, if you're happy, go with it.

But there are a couple of problems that can result. First, it can hard-code font sizes, which you probably don't want to have happen. There's another thread all about that sort of problem.

And second, having the style information at the top of each file makes it a major headache to change anything. You have to do it in every chapter. It's much better to put it all in an external CSS file.

theducks 10-07-2010 03:31 PM

Quote:

Originally Posted by edbro (Post 1150920)
I use Word for most of my ebook creations. Word files saved as Filtered Html and imported into Calibre. My epubs usually come out looking perfect so I gotta ask; what is the problem with having all the extra crap that Word adds?

Another point, is the files become massively bloated (as bad as 3x) and slow down performance on portable readers. :(

sassanik 10-12-2010 04:04 AM

save as rtf
 
I just save the document as a rtf instead of a word doc or a html file, then import to Calibre and convert to epub. This probably takes a bit longer since I can't directly import directly to Sigil, but it seems to work with minimal problems.

You are right that word does tend to add a ton of junk to its html.


Amy

Hitch 10-12-2010 07:51 AM

Quote:

Originally Posted by DTM (Post 1145209)
I do exactly the same thing. The attached file will unzip into a Word template file that contains several macros. Use this for the document you want to convert to HTML, then run the macro called Word2HTML. It will clean up the double-paragraph markers and end-of-line paragraph markers you commonly get in text documents, mark word heading1 - heading5 with <h1> - <h5>, replace special characters with escape codes, double-hyphens with em dashes, and more. (If you don't want to do all of the operations--see CAUTION, below--you can run the other individual macros one at a time, if you prefer.)

Now save the document as a text file, add the proper <html>, <body>, etc. tags at the top and bottom, and you'll have something fit for a clean import into Sigil. Hope this helps.

CAUTION: This is quite useful, but not perfect and so is provided as-is, no warranty, use at your own risk, and the other usual disclaimers. It assumes you have a double-paragraph mark between paragraphs, as is common for Gutenberg and other text files. If you actually have just a single paragraph marker at the end of each paragraph, it'll turn the whole document into one huge paragraph. It also clears "unnecessary" white space, so if you have a table or tabs/spaces at the start of a paragraph, or other such formatting, you'll lose it. This is basically intended for documents that are paragraphs of text with chapter headings.

DTM:

Is this for Word/Office 2007? My older Word 2003 for XP doesn't recognize the template format. thought I'd try it on a itchy file I have, but it looks like I'll have to run it through the usual hoops--thanks anyway.

Thanks,

Hitch

DTM 10-12-2010 09:18 AM

1 Attachment(s)
Quote:

Originally Posted by Hitch (Post 1158578)
Is this for Word/Office 2007? My older Word 2003 for XP doesn't recognize the template format.

Yes, that was created under Word 2007. I opened it, saved it "down" to 97-2003 format, and have attached the resulting file. I have not tried it.


All times are GMT -4. The time now is 06:07 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.