Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book General > General Discussions

Notices

Reply
 
Thread Tools Search this Thread
Old 12-11-2012, 11:51 PM   #1
Jimbo724
Connoisseur
Jimbo724 began at the beginning.
 
Posts: 60
Karma: 10
Join Date: Jun 2012
Device: Kindle Touch
How to Clean/Strip HTML from epub file?

I want to clean up a book. It is an absolute mess. I know how to convert it to an epub file, open it in Sigil, and extract the text files into a directory. Now, I want to strip essentially all the HTML other than the paragraph code and then fix up the text without changing the paragraph structure. (Afterward, I will add chapter, header, and other codes as needed, then run it through Calibre to recreate the ebook.) What is the easiest way to get rid of everything except the paragraph structure?

Sigil is useless beyond extracting the text files from the epub file.

I had a problem with Word making every line a separate paragraph.

I use a Mac.
Jimbo724 is offline   Reply With Quote
Old 12-12-2012, 12:41 AM   #2
AlexBell
Wizard
AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.
 
AlexBell's Avatar
 
Posts: 2,319
Karma: 3943902
Join Date: May 2008
Location: Launceston, Tasmania
Device: Kindle3, Kobo Touch, Sony PRS T3, Nexus 7, iPad mini
I'm not at sure if I understand you, but my first reaction is that your question is at about the same level as asking 'How do I remove all the bricks from my house'. ePub ebooks consist of html files, plus a couple of other files. No html, no ePub.

In any event I suggest that you'll get better advice on the ePub forum than on the General Discussion forum.

Last edited by AlexBell; 12-12-2012 at 12:42 AM. Reason: fixed typo
AlexBell is offline   Reply With Quote
Old 12-12-2012, 12:47 AM   #3
Joykins
Wizard
Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.Joykins ought to be getting tired of karma fortunes by now.
 
Joykins's Avatar
 
Posts: 1,514
Karma: 6570458
Join Date: Jan 2010
Device: nook, kindle 4 NT, kindle PW2, iPhone
you'll need to leave in the bare bones html structure but you can search and replace (with nothing) the html elements you don't need. Removing the attributes might be trickier. If you use a text editor that supports regular expressions, you can remove things pretty quickly and powerfully. If the file is xhtml (which it might be) you could also try an xslt.

It will take time no matter how you do it.
Joykins is offline   Reply With Quote
Old 12-12-2012, 02:18 AM   #4
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 64,930
Karma: 42992227
Join Date: Nov 2006
Location: UK
Device: Kindle Voyage, iPad Mini, iPhone 4, MS Surface Pro, N7
Where did you get the book from? My immediate reaction is that you should approach the publisher and get them to fix it.
HarryT is online now   Reply With Quote
Old 12-12-2012, 04:27 AM   #5
Jimbo724
Connoisseur
Jimbo724 began at the beginning.
 
Posts: 60
Karma: 10
Join Date: Jun 2012
Device: Kindle Touch
Quote:
Originally Posted by Joykins View Post
search and replace (with nothing) the html elements you don't need.
That appears to be the best approach.

My guess is that someone scanned the paper book, did an OCR, did little to clean it up, and then created an epub file. Somehow a lot of html dross was inserted along the way. I almost have to go back to the OCR output to clean it up. To make matters really complicated, the text has a lot of intentional misspellings to imitate an Eastern European mangling of the English language. I can sympathize with the original processor. Alas, there is no retail ebook available.

I'm ordering the paper book through my local library in order to proofread the text. Maybe I will start over with my own scan and OCR. I haven't decided yet.

Last edited by Jimbo724; 12-12-2012 at 04:41 AM.
Jimbo724 is offline   Reply With Quote
Old 12-12-2012, 04:36 AM   #6
Bossman
Member
Bossman doesn't litterBossman doesn't litter
 
Posts: 23
Karma: 172
Join Date: Jul 2008
Device: Prs505
Convert the Book from Epub to Txt then back to Epub

That way all the HTML stuff is lost
Bossman is offline   Reply With Quote
Old 12-12-2012, 07:02 AM   #7
mr ploppy
Feral Underclass
mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.mr ploppy ought to be getting tired of karma fortunes by now.
 
mr ploppy's Avatar
 
Posts: 3,532
Karma: 26546435
Join Date: Jan 2010
Location: Yorkshire, tha noz
Device: 2nd hand paperback
Epub is just a zip file with html (and other) files inside it. Just rename it and extract them with whatever you usually use.
mr ploppy is offline   Reply With Quote
Old 12-12-2012, 09:39 AM   #8
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Quote:
Originally Posted by Jimbo724 View Post
Sigil is useless beyond extracting the text files from the epub file.

I use a Mac.
It's telling.

Strip all of the style/classes from the tags, remove the stylesheets/embedded styles.

Regex is your friend, Sigil makes sure you don't mess it up too much.
Serpentine is offline   Reply With Quote
Old 12-12-2012, 09:48 AM   #9
pdurrant
The Grand Mouse
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 32,810
Karma: 89897838
Join Date: Jul 2007
Location: Norfolk, England
Device: NOOK ST GlowLight
Quote:
Originally Posted by Jimbo724 View Post
I had a problem with Word making every line a separate paragraph.

I use a Mac.
I just noticed that last sentence.

Go to http://www.barebones.com/products/textwrangler/ and get TextWrangler, a very good, free, text editor.

I highly recommend its 'big brother' BBEdit if you're involved in a lot of text editing. It's only $49.99. It doesn't suck.®

Last edited by pdurrant; 12-12-2012 at 09:50 AM.
pdurrant is online now   Reply With Quote
Old 12-12-2012, 12:22 PM   #10
garrybuck
Connoisseur
garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.garrybuck juggles neatly with hedgehogs.
 
Posts: 64
Karma: 69964
Join Date: Dec 2007
Device: Kindle
Have you tried using Calibre to convert it to a text file?
garrybuck is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting Mobi or HTML file to Epub Patuba Sigil 1 07-23-2011 05:14 PM
Converting Mobi or HTML file to Epub Patuba ePub 7 07-19-2011 01:11 PM
Easiest way to clean an ePub file? mtrahan ePub 8 04-27-2011 09:06 AM
Can Calibre Strip HTML links when exporting to epub? Dasun Calibre 4 11-10-2010 11:28 AM
How can i convert HTML or txt file to EPUB file ? guguqiaqia ePub 7 05-28-2010 10:15 PM


All times are GMT -4. The time now is 07:50 AM.


MobileRead.com is a privately owned, operated and funded community.