 05-08-2021, 07:05 PM #1 conan50 Fanatic   Posts: 565 Karma: 11771100 Join Date: Jun 2014 Device: kindle Clean up old Word file I have a fantasy novel that I've been working on for ages, around 500 pages long. I think I was using Word 97 when I first started writing it. Eventually I'm going to want to convert it to an ebook, but before I get there I'm trying to figure out the best way to clean up the entire file. If possible without removing all formatting. Editing is slow, as in the file is unwieldy compared to files of similar size, which makes me think there is excess garbage in the code from numerous conversions over the years from doc to odt to docx. My question, what is the best way to clean up a large docx file that likely is loaded with conversion artifacts? Thanks.
 05-09-2021, 04:55 AM #2 BetterRed null operator   Posts: 17,272 Karma: 20292561 Join Date: Mar 2012 Location: Sydney Australia Device: none @conan50 - if you have a lot of in-line formatting, i.e. you've eschewed using Styles, then conversion to EPUB will result in 'messy' HTML code. What will you lose if you save it as plain text? BTW: This is not really a calibre question, the Workshop forum would be better. I can move it if you want. BR
 05-09-2021, 06:32 AM #3 JSWolf Resident Curmudgeon     Posts: 63,397 Karma: 103125649 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Aura H2O, PRS-650, PRS-T1, nook STR, iPad 4, iPhone SE 2020, PW3 If you've done in-line formatting, then I do think that you might be better off stripping all the formatting and redoing the formatting with styles.
conan50
 Originally Posted by BetterRed @conan50 - if you have a lot of in-line formatting, i.e. you've eschewed using Styles, then conversion to EPUB will result in 'messy' HTML code. What will you lose if you save it as plain text? BTW: This is not really a calibre question, the Workshop forum would be better. I can move it if you want. BR
Yes, please move it. I wasn't sure where to post.
I might try plain text. If I paste it into another document with formatting removed, would that completely clean up the text?

conan50
 Originally Posted by JSWolf If you've done in-line formatting, then I do think that you might be better off stripping all the formatting and redoing the formatting with styles.
Yeah, I'm not really sure what got carried over when I merely saved the whole thing from doc to odt then back to docx. The book is finished except for editing, but it seems slow to load and as I make changes they also are very slow as I type in or remove text which makes me think the document needs cleaned up or the formatting completely removed and started over.

 05-09-2021, 11:27 AM #7 retiredbiker Addict     Posts: 223 Karma: 232318 Join Date: May 2013 Location: Ontario, Canada Device: Kindle KB, Oasis, Ubuntu, Jutoh,Kobo Forma From what you describe, it is probably a real mess under the skin; I've seen files like that. The traditional "dynamite" approach to fix this is to copy all the text into a plain text editor to remove ALL the formatting. (Just using "remove formatting" in either Word or Writer is not guaranteed to get it all.) Then open the text file back in your word processor and re-format using styles. Unfortunately, this will blow away italics, and that can be a real pain to put back in if they are used a lot. I found a work-around for this that I used on a book or two with good results. It's a nasty thing, but for what it's worth, I'll add it. I used Writer, but something similar could be done with Word: 1. Save the file as html. 2. Open the html file in a text editor (like gedit--any good text-only editor). 3. Identify all the code indicating italics. There may be several types of coding. 4. Use Search & Replace to replace all that code with some unused text marker. I think used #i for italics start and #/i for italics end. Save the html file. 5. Now open that html file again in the word processor. The text markers you put in will show up as text and no italics will appear. 6. NOW remove all formatting (select all and Ctrl-M for Writer). (Some odd things might be left behind that you will later have to fix by hand, but probably not much.) 7. Save the result again as an html file, and open it in the text editor. 8. Use search and replace to turn #i into and #/i into . 9. Now open the result back in the word processor. Italics should re-appear. 10. Now you can save it as docx or odt, and reformat paragraphs, headings, and so on, using styles. As I said, it's ugly, but worth it if you have hundreds and hundreds of instances of italics. Last edited by retiredbiker; 05-09-2021 at 11:29 AM.
 05-09-2021, 01:10 PM #8 JSWolf Resident Curmudgeon     Posts: 63,397 Karma: 103125649 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Aura H2O, PRS-650, PRS-T1, nook STR, iPad 4, iPhone SE 2020, PW3 It would be a lot easier to save as text, load the resulting text file and apply all the formatting as styles from there. Then save as DOCX. Convert to ePub, load in the Calibre Editor or Sigil and see if there is anything you want to fix.
Turtle91
 Originally Posted by JSWolf It would be a lot easier to save as text, load the resulting text file and apply all the formatting as styles from there. Then save as DOCX. Convert to ePub, load in the Calibre Editor or Sigil and see if there is anything you want to fix.
Your method wouldn't save the italics formatting.

 05-09-2021, 03:09 PM #10 conan50 Fanatic   Posts: 565 Karma: 11771100 Join Date: Jun 2014 Device: kindle Thanks folks! Sounds like it is going to be a big job to clean it up. Pretty much what I figured.
JSWolf
 Originally Posted by Turtle91 Your method wouldn't save the italics formatting.
It would in the end as the italics would be put back.

 05-09-2021, 05:21 PM #12 DaleDe Grand Sorcerer     Posts: 11,457 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7 One thing you might do is to simply look at how many formatting paragraphs you have in your old files and what were their names. You may be able to make the conversions easier if you clean up the documents themselves depending on how you made them in the first place.
Tex2002ans
 Originally Posted by conan50 If possible without removing all formatting.
Could be possible, could be possible.

(Upload to Google Drive or other filesharing site and I could take a look.)

Quote:
 Originally Posted by conan50 I have a fantasy novel that I've been working on for ages, around 500 pages long. I think I was using Word 97 when I first started writing it. Eventually I'm going to want to convert it to an ebook, but before I get there I'm trying to figure out the best way to clean up the entire file.
Styles.

For more info on the why/how, see three of my posts in "eBook Formatting in Sigil":

Post #48
Post #50
Post #52

especially the 2 videos I linked in "MS Word vs Open Office Word" (Post #3).

Quote:
 Originally Posted by conan50 Editing is slow, as in the file is unwieldy compared to files of similar size, which makes me think there is excess garbage in the code from numerous conversions over the years from doc to odt to docx. My question, what is the best way to clean up a large docx file that likely is loaded with conversion artifacts?
Yep, definitely sounds like a bunch of cruft built up and hidden in the background.

... Especially if you've been using the same file for over 20 years + saved in various different programs/formats. Who knows what crap crept in.

In your case, it may be best to export to a super clean/minimalist document, then reimport back so that you're starting from a proper foundation.

Once you have that fantastic base, everything else becomes better.

Quote:
 Originally Posted by conan50 Thanks folks! Sounds like it is going to be a big job to clean it up. Pretty much what I figured.
Generating clean documents has never been easier.

There are tools (like Toxaris's EPUBTools) to generate super clean ebooks from your Word files.

You still might have to put in a little elbow grease to get some of the more complicated formatting back (like blockquotes/poetry/footnotes), but the vast bulk of the conversion can be converted super cleanly.

Quote:
 Originally Posted by retiredbiker The traditional "dynamite" approach to fix this is to copy all the text into a plain text editor to remove ALL the formatting. [...] Unfortunately, this will blow away italics, and that can be a real pain to put back in if they are used a lot. I found a work-around for this that I used on a book or two with good results. It's a nasty thing, but for what it's worth, I'll add it. I used Writer, but something similar could be done with Word: [...]
I referenced this in passing just 2 days ago.

There's a much easier way of doing this using Word's/LibreOffice's Advanced Find and Replace.

No need to introduce their disgusting HTML exports/imports.

Instead, you use the word processor's italics formatting within Advanced Search, then replace with your own "markdown".

Last year, I wrote step-by-step instructions for:

The 1st one went from:

Code:
This is italics and more italics too.
to:

Code:
This is \emph{italics} and \emph{more italics too}.
and the 2nd one went from:

Code:
This is <i>italics</i> and <i>more italics too</i>.
to:

Code:
This is italics and more italics too.
Microsoft Word and LibreOffice have slightly different buttons, but the concept is all the same.

 05-10-2021, 06:07 PM #14 conan50 Fanatic   Posts: 565 Karma: 11771100 Join Date: Jun 2014 Device: kindle Thanks again for all the suggestions! I used Softmaker Office to save the document as a text file, removing all the formatting. Then opened it back up in Softmaker Office and selected all and used default formatting, put my headings back and a bit of other needed formatting and saved as a docx file. Definitely faster to load and edit. Will see what it looks like in Sigil after I finish editing it.
Hitch
 Originally Posted by conan50 Thanks again for all the suggestions! I used Softmaker Office to save the document as a text file, removing all the formatting. Then opened it back up in Softmaker Office and selected all and used default formatting, put my headings back and a bit of other needed formatting and saved as a docx file. Definitely faster to load and edit. Will see what it looks like in Sigil after I finish editing it.
uh...

Well, I was going to suggest using Toxaris's amazing ePUBTools plugin for Word. You can then tag all the italics and bold, throughout (and underscores, if used), first.

Then cTRL-A and remove all the inline styling and then restore the italics, bold and underscores, throughout.

I suggest creating a main body style and applying that, BEFORE you restore the italic/bold, etc. and then you're 95% of the way home.

I've used it for just this purpose--cleaning up horrid old files--for years and it's the Best. Thing. EVAH.

Seriously.

Hitch