![]() |
#1 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 597
Karma: 14054112
Join Date: Jun 2014
Device: kindle
|
Clean up old Word file
I have a fantasy novel that I've been working on for ages, around 500 pages long. I think I was using Word 97 when I first started writing it. Eventually I'm going to want to convert it to an ebook, but before I get there I'm trying to figure out the best way to clean up the entire file. If possible without removing all formatting. Editing is slow, as in the file is unwieldy compared to files of similar size, which makes me think there is excess garbage in the code from numerous conversions over the years from doc to odt to docx.
My question, what is the best way to clean up a large docx file that likely is loaded with conversion artifacts? Thanks. |
![]() |
![]() |
![]() |
#2 |
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 21,725
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
@conan50 - if you have a lot of in-line formatting, i.e. you've eschewed using Styles, then conversion to EPUB will result in 'messy' HTML code.
What will you lose if you save it as plain text? BTW: This is not really a calibre question, the Workshop forum would be better. I can move it if you want. BR |
![]() |
![]() |
![]() |
#3 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,758
Karma: 145864619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
If you've done in-line formatting, then I do think that you might be better off stripping all the formatting and redoing the formatting with styles.
|
![]() |
![]() |
![]() |
#4 | |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 597
Karma: 14054112
Join Date: Jun 2014
Device: kindle
|
Quote:
I might try plain text. If I paste it into another document with formatting removed, would that completely clean up the text? |
|
![]() |
![]() |
![]() |
#5 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 597
Karma: 14054112
Join Date: Jun 2014
Device: kindle
|
Yeah, I'm not really sure what got carried over when I merely saved the whole thing from doc to odt then back to docx. The book is finished except for editing, but it seems slow to load and as I make changes they also are very slow as I type in or remove text which makes me think the document needs cleaned up or the formatting completely removed and started over.
|
![]() |
![]() |
![]() |
#6 |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,062
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Moderator Notice
Moved out of Calibre |
![]() |
![]() |
![]() |
#7 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 450
Karma: 3886916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma
|
From what you describe, it is probably a real mess under the skin; I've seen files like that.
The traditional "dynamite" approach to fix this is to copy all the text into a plain text editor to remove ALL the formatting. (Just using "remove formatting" in either Word or Writer is not guaranteed to get it all.) Then open the text file back in your word processor and re-format using styles. Unfortunately, this will blow away italics, and that can be a real pain to put back in if they are used a lot. I found a work-around for this that I used on a book or two with good results. It's a nasty thing, but for what it's worth, I'll add it. I used Writer, but something similar could be done with Word: 1. Save the file as html. 2. Open the html file in a text editor (like gedit--any good text-only editor). 3. Identify all the code indicating italics. There may be several types of coding. 4. Use Search & Replace to replace all that code with some unused text marker. I think used #i for italics start and #/i for italics end. Save the html file. 5. Now open that html file again in the word processor. The text markers you put in will show up as text and no italics will appear. 6. NOW remove all formatting (select all and Ctrl-M for Writer). (Some odd things might be left behind that you will later have to fix by hand, but probably not much.) 7. Save the result again as an html file, and open it in the text editor. 8. Use search and replace to turn #i into <i> and #/i into </i>. 9. Now open the result back in the word processor. Italics should re-appear. 10. Now you can save it as docx or odt, and reformat paragraphs, headings, and so on, using styles. As I said, it's ugly, but worth it if you have hundreds and hundreds of instances of italics. Last edited by retiredbiker; 05-09-2021 at 11:29 AM. |
![]() |
![]() |
![]() |
#8 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,758
Karma: 145864619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
It would be a lot easier to save as text, load the resulting text file and apply all the formatting as styles from there. Then save as DOCX. Convert to ePub, load in the Calibre Editor or Sigil and see if there is anything you want to fix.
|
![]() |
![]() |
![]() |
#9 |
A Hairy Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,353
Karma: 20171571
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
Your method wouldn't save the italics formatting.
|
![]() |
![]() |
![]() |
#10 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 597
Karma: 14054112
Join Date: Jun 2014
Device: kindle
|
Thanks folks! Sounds like it is going to be a big job to clean it up. Pretty much what I figured.
|
![]() |
![]() |
![]() |
#11 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,758
Karma: 145864619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
|
![]() |
![]() |
![]() |
#12 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
One thing you might do is to simply look at how many formatting paragraphs you have in your old files and what were their names. You may be able to make the conversions easier if you clean up the documents themselves depending on how you made them in the first place.
|
![]() |
![]() |
![]() |
#13 | ||||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Could be possible, could be possible.
Please contact me via PM and send me your file. (Upload to Google Drive or other filesharing site and I could take a look.) Quote:
For more info on the why/how, see three of my posts in "eBook Formatting in Sigil": Post #48 Post #50 Post #52 especially the 2 videos I linked in "MS Word vs Open Office Word" (Post #3). Quote:
... Especially if you've been using the same file for over 20 years + saved in various different programs/formats. Who knows what crap crept in. In your case, it may be best to export to a super clean/minimalist document, then reimport back so that you're starting from a proper foundation. Once you have that fantastic base, everything else becomes better. ![]() Quote:
There are tools (like Toxaris's EPUBTools) to generate super clean ebooks from your Word files. You still might have to put in a little elbow grease to get some of the more complicated formatting back (like blockquotes/poetry/footnotes), but the vast bulk of the conversion can be converted super cleanly. ![]() Quote:
There's a much easier way of doing this using Word's/LibreOffice's Advanced Find and Replace. No need to introduce their disgusting HTML exports/imports. Instead, you use the word processor's italics formatting within Advanced Search, then replace with your own "markdown". Last year, I wrote step-by-step instructions for: The 1st one went from: Code:
This is italics and more italics too. Code:
This is \emph{italics} and \emph{more italics too}. Code:
This is <i>italics</i> and <i>more italics too</i>. Code:
This is italics and more italics too. ![]() Last edited by Tex2002ans; 05-10-2021 at 02:11 PM. |
||||
![]() |
![]() |
![]() |
#14 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 597
Karma: 14054112
Join Date: Jun 2014
Device: kindle
|
Thanks again for all the suggestions!
I used Softmaker Office to save the document as a text file, removing all the formatting. Then opened it back up in Softmaker Office and selected all and used default formatting, put my headings back and a bit of other needed formatting and saved as a docx file. Definitely faster to load and edit. Will see what it looks like in Sigil after I finish editing it. |
![]() |
![]() |
![]() |
#15 | |
Bookmaker & Cat Slave
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,503
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Well, I was going to suggest using Toxaris's amazing ePUBTools plugin for Word. You can then tag all the italics and bold, throughout (and underscores, if used), first. Then cTRL-A and remove all the inline styling and then restore the italics, bold and underscores, throughout. I suggest creating a main body style and applying that, BEFORE you restore the italic/bold, etc. and then you're 95% of the way home. I've used it for just this purpose--cleaning up horrid old files--for years and it's the Best. Thing. EVAH. Seriously. Hitch |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Word macro for clean HTML code | Toxaris | ePub | 135 | 02-28-2015 02:21 AM |
Converting Word to clean HTML nightmare | holdit | ePub | 5 | 12-24-2013 02:57 AM |
Clean HTML from word For EPub | holdit | ePub | 10 | 10-21-2013 07:00 AM |
Clean HTML from word | holdit | Workshop | 6 | 10-09-2013 05:20 PM |
Docvert 2.0 converts MS Word files to clean HTML | Alexander Turcic | Lounge | 0 | 03-16-2006 04:50 AM |