Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book General > General Discussions

Notices

Reply
 
Thread Tools Search this Thread
Old 12-11-2012, 10:51 PM   #1
Jimbo724
Connoisseur
Jimbo724 began at the beginning.
 
Posts: 60
Karma: 10
Join Date: Jun 2012
Device: Kindle Touch
How to Clean/Strip HTML from epub file?

I want to clean up a book. It is an absolute mess. I know how to convert it to an epub file, open it in Sigil, and extract the text files into a directory. Now, I want to strip essentially all the HTML other than the paragraph code and then fix up the text without changing the paragraph structure. (Afterward, I will add chapter, header, and other codes as needed, then run it through Calibre to recreate the ebook.) What is the easiest way to get rid of everything except the paragraph structure?

Sigil is useless beyond extracting the text files from the epub file.

I had a problem with Word making every line a separate paragraph.

I use a Mac.
Jimbo724 is offline   Reply With Quote
Old 12-11-2012, 11:41 PM   #2
AlexBell
Wizard
AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.
 
AlexBell's Avatar
 
Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
I'm not at sure if I understand you, but my first reaction is that your question is at about the same level as asking 'How do I remove all the bricks from my house'. ePub ebooks consist of html files, plus a couple of other files. No html, no ePub.

In any event I suggest that you'll get better advice on the ePub forum than on the General Discussion forum.

Last edited by AlexBell; 12-11-2012 at 11:42 PM. Reason: fixed typo
AlexBell is offline   Reply With Quote
Advert
Old 12-11-2012, 11:47 PM   #3
Joykins
Wizard
Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.
 
Joykins's Avatar
 
Posts: 1,601
Karma: 9211856
Join Date: Jan 2010
Device: kindle Oasis 2018, kindle 4 NT, kindle PW2, iPhone, iPad mini
you'll need to leave in the bare bones html structure but you can search and replace (with nothing) the html elements you don't need. Removing the attributes might be trickier. If you use a text editor that supports regular expressions, you can remove things pretty quickly and powerfully. If the file is xhtml (which it might be) you could also try an xslt.

It will take time no matter how you do it.
Joykins is offline   Reply With Quote
Old 12-12-2012, 01:18 AM   #4
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Where did you get the book from? My immediate reaction is that you should approach the publisher and get them to fix it.
HarryT is offline   Reply With Quote
Old 12-12-2012, 03:27 AM   #5
Jimbo724
Connoisseur
Jimbo724 began at the beginning.
 
Posts: 60
Karma: 10
Join Date: Jun 2012
Device: Kindle Touch
Quote:
Originally Posted by Joykins View Post
search and replace (with nothing) the html elements you don't need.
That appears to be the best approach.

My guess is that someone scanned the paper book, did an OCR, did little to clean it up, and then created an epub file. Somehow a lot of html dross was inserted along the way. I almost have to go back to the OCR output to clean it up. To make matters really complicated, the text has a lot of intentional misspellings to imitate an Eastern European mangling of the English language. I can sympathize with the original processor. Alas, there is no retail ebook available.

I'm ordering the paper book through my local library in order to proofread the text. Maybe I will start over with my own scan and OCR. I haven't decided yet.

Last edited by Jimbo724; 12-12-2012 at 03:41 AM.
Jimbo724 is offline   Reply With Quote
Advert
Old 12-12-2012, 03:36 AM   #6
Bossman
Member
Bossman doesn't litterBossman doesn't litter
 
Posts: 23
Karma: 172
Join Date: Jul 2008
Device: Prs505
Convert the Book from Epub to Txt then back to Epub

That way all the HTML stuff is lost
Bossman is offline   Reply With Quote
Old 12-12-2012, 06:02 AM   #7
mr ploppy
Feral Underclass
mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.
 
mr ploppy's Avatar
 
Posts: 3,622
Karma: 26821535
Join Date: Jan 2010
Location: Yorkshire, tha noz
Device: 2nd hand paperback
Epub is just a zip file with html (and other) files inside it. Just rename it and extract them with whatever you usually use.
mr ploppy is offline   Reply With Quote
Old 12-12-2012, 08:39 AM   #8
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Quote:
Originally Posted by Jimbo724 View Post
Sigil is useless beyond extracting the text files from the epub file.

I use a Mac.
It's telling.

Strip all of the style/classes from the tags, remove the stylesheets/embedded styles.

Regex is your friend, Sigil makes sure you don't mess it up too much.
Serpentine is offline   Reply With Quote
Old 12-12-2012, 08:48 AM   #9
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,496
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
Quote:
Originally Posted by Jimbo724 View Post
I had a problem with Word making every line a separate paragraph.

I use a Mac.
I just noticed that last sentence.

Go to http://www.barebones.com/products/textwrangler/ and get TextWrangler, a very good, free, text editor.

I highly recommend its 'big brother' BBEdit if you're involved in a lot of text editing. It's only $49.99. It doesn't suck.®

Last edited by pdurrant; 12-12-2012 at 08:50 AM.
pdurrant is offline   Reply With Quote
Old 12-12-2012, 11:22 AM   #10
garrybuck
Connoisseur
garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.
 
Posts: 64
Karma: 69964
Join Date: Dec 2007
Device: Kindle
Have you tried using Calibre to convert it to a text file?
garrybuck is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Can Calibre Strip HTML links when exporting to epub? Dasun Calibre 6 03-03-2020 02:47 AM
Converting Mobi or HTML file to Epub Patuba Sigil 1 07-23-2011 04:14 PM
Converting Mobi or HTML file to Epub Patuba ePub 7 07-19-2011 12:11 PM
Easiest way to clean an ePub file? mtrahan ePub 8 04-27-2011 08:06 AM
How can i convert HTML or txt file to EPUB file ? guguqiaqia ePub 7 05-28-2010 09:15 PM


All times are GMT -4. The time now is 10:11 PM.


MobileRead.com is a privately owned, operated and funded community.