12-11-2012, 10:51 PM | #1 |
Connoisseur
Posts: 60
Karma: 10
Join Date: Jun 2012
Device: Kindle Touch
|
How to Clean/Strip HTML from epub file?
I want to clean up a book. It is an absolute mess. I know how to convert it to an epub file, open it in Sigil, and extract the text files into a directory. Now, I want to strip essentially all the HTML other than the paragraph code and then fix up the text without changing the paragraph structure. (Afterward, I will add chapter, header, and other codes as needed, then run it through Calibre to recreate the ebook.) What is the easiest way to get rid of everything except the paragraph structure?
Sigil is useless beyond extracting the text files from the epub file. I had a problem with Word making every line a separate paragraph. I use a Mac. |
12-11-2012, 11:41 PM | #2 |
Wizard
Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
|
I'm not at sure if I understand you, but my first reaction is that your question is at about the same level as asking 'How do I remove all the bricks from my house'. ePub ebooks consist of html files, plus a couple of other files. No html, no ePub.
In any event I suggest that you'll get better advice on the ePub forum than on the General Discussion forum. Last edited by AlexBell; 12-11-2012 at 11:42 PM. Reason: fixed typo |
Advert | |
|
12-11-2012, 11:47 PM | #3 |
Wizard
Posts: 1,601
Karma: 9211856
Join Date: Jan 2010
Device: kindle Oasis 2018, kindle 4 NT, kindle PW2, iPhone, iPad mini
|
you'll need to leave in the bare bones html structure but you can search and replace (with nothing) the html elements you don't need. Removing the attributes might be trickier. If you use a text editor that supports regular expressions, you can remove things pretty quickly and powerfully. If the file is xhtml (which it might be) you could also try an xslt.
It will take time no matter how you do it. |
12-12-2012, 01:18 AM | #4 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Where did you get the book from? My immediate reaction is that you should approach the publisher and get them to fix it.
|
12-12-2012, 03:27 AM | #5 | |
Connoisseur
Posts: 60
Karma: 10
Join Date: Jun 2012
Device: Kindle Touch
|
Quote:
My guess is that someone scanned the paper book, did an OCR, did little to clean it up, and then created an epub file. Somehow a lot of html dross was inserted along the way. I almost have to go back to the OCR output to clean it up. To make matters really complicated, the text has a lot of intentional misspellings to imitate an Eastern European mangling of the English language. I can sympathize with the original processor. Alas, there is no retail ebook available. I'm ordering the paper book through my local library in order to proofread the text. Maybe I will start over with my own scan and OCR. I haven't decided yet. Last edited by Jimbo724; 12-12-2012 at 03:41 AM. |
|
Advert | |
|
12-12-2012, 03:36 AM | #6 |
Member
Posts: 23
Karma: 172
Join Date: Jul 2008
Device: Prs505
|
Convert the Book from Epub to Txt then back to Epub
That way all the HTML stuff is lost |
12-12-2012, 06:02 AM | #7 |
Feral Underclass
Posts: 3,622
Karma: 26821535
Join Date: Jan 2010
Location: Yorkshire, tha noz
Device: 2nd hand paperback
|
Epub is just a zip file with html (and other) files inside it. Just rename it and extract them with whatever you usually use.
|
12-12-2012, 08:39 AM | #8 |
Evangelist
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
|
12-12-2012, 08:48 AM | #9 | |
The Grand Mouse 高貴的老鼠
Posts: 71,506
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
Quote:
Go to http://www.barebones.com/products/textwrangler/ and get TextWrangler, a very good, free, text editor. I highly recommend its 'big brother' BBEdit if you're involved in a lot of text editing. It's only $49.99. It doesn't suck.® Last edited by pdurrant; 12-12-2012 at 08:50 AM. |
|
12-12-2012, 11:22 AM | #10 |
Connoisseur
Posts: 64
Karma: 69964
Join Date: Dec 2007
Device: Kindle
|
Have you tried using Calibre to convert it to a text file?
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can Calibre Strip HTML links when exporting to epub? | Dasun | Calibre | 6 | 03-03-2020 02:47 AM |
Converting Mobi or HTML file to Epub | Patuba | Sigil | 1 | 07-23-2011 04:14 PM |
Converting Mobi or HTML file to Epub | Patuba | ePub | 7 | 07-19-2011 12:11 PM |
Easiest way to clean an ePub file? | mtrahan | ePub | 8 | 04-27-2011 08:06 AM |
How can i convert HTML or txt file to EPUB file ? | guguqiaqia | ePub | 7 | 05-28-2010 09:15 PM |