Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 03-16-2010, 02:51 AM   #1
ficbot
Wizard
ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.
 
Posts: 2,409
Karma: 4132096
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
Need help converting file which is too long to be HTML

I have a purchased ebook I am trying to convert from very messy HTML. The 'liberation script' I ran on it produced a very messy output and while on my old Sony it looked fine, it looked terrible when I tried mobipocket for my Kindle. My usual trick with this type of book is to open it in a web browser and copy/paste the text from there into Kompozer (my web program) to get clean HTML. However, the ebook (The Mists of Avalon) is VERY long and I seem to be running into some sort of file size limit. When I tried copy and pasting, it would not grab the whole thing at once, which was fine. But then I went to the spot it left off at and copied from there, it would not let me paste it into the HTML. I am thinking there must be some sort of file size limit for HTML I am unaware of?

I am wondering what my other options are for this book. I bought it awhile ago when there was no epub. I do not plan to deal with the format this book started in for the future, but this is the lone, last book of my collection of them that has to be converted and I would like to not have to buy it again. In the past I experimented with RTF but it appeared tiny when converted to LRF. Will I have this same issue with mobi? Or is there some other trick I can use to remove the extraneous garbage this HTML file seems to have?

I am on a mac. Software I am comfortable with and/or own includes Open Office, Kompozer, Calibre, Firefox/Safari and Pages.
ficbot is offline   Reply With Quote
Old 03-16-2010, 06:58 AM   #2
nrapallo
GuteBook/Mobi2IMP Creator
nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.
 
nrapallo's Avatar
 
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
Hi ficbot!

I usually convert very large .html code into ebooks and have never run into a file that was too large to convert.

The fact that your copy/paste was cut off prematurely may indicate that there is a null/EOF character in the text stream (note: EOF = end of file).

Just so I understand, you converted a .lrf into .html and that .html can be viewed in your web broser, but cannot be copied properly. Is that right?

Here's then what I would recommend:

1. open the .html in a text editor capable of dealing with hexadecimal characters and search for a '00' byte. Just replace them with a space or nothing depending on what works better.

2. if you have access to a Windows computer, copy and paste that web text into the Outlook Express email program into a newly created (blank) message. This allows the .html to be retained and perhaps cleansed of any troublesome characters. Then clicking the Source tab at the bottom of the message, you can copy the HTML code of that text and paste it into an empty file and save it as a new .html file. Then repeat your web text extraction procedure.

3. if you can't make any of that work, just contact me via PM or email me at "nrapallo (at) yahoo.ca"

I can see no reason a text file would be too long to copy so there should be a simple solution to your problem, once the source of the problem has been identified.

Regards,
nrapallo is offline   Reply With Quote
Advert
Old 03-16-2010, 07:57 PM   #3
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
What happens when you try to convert the HTML directly through calibre? (Mess and all?)

Can you open the HTML file in OpenOffice?

Have you tried uploading it to Google Docs?

Be warned that I found that Kompozer does not generate clean HTML -- in fact, I gave up completely on WYSIWYG HTML editors after looking at some of the nonsense code Kompozer was putting in my document. Now I edit HTML in a text editor.
frabjous is offline   Reply With Quote
Old 03-16-2010, 08:22 PM   #4
ficbot
Wizard
ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.
 
Posts: 2,409
Karma: 4132096
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
When I try to convert it, I get a readable mobipocket file but with like 10 line breaks between each paragraph. I opened in Text Wrangler and EVERY single line of the HTML file has it's own font and formatting info. It will take ages to go through line by line and remove them. I tried using find and replace and it wound up removing the line breaks too. Open Office opened it but the HTML it generated was pretty awful too. Would saving it at an RTF help things? And then converting that? I had trouble on the Sony with RTF files displaying tiny and Kovid told me HTM was better for converting in Calibre. It is just such a loooooong book. I would hate to have to go through it line by line.
ficbot is offline   Reply With Quote
Old 03-16-2010, 10:27 PM   #5
charleski
Wizard
charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1281258
Join Date: Sep 2009
Device: PRS-505
When you convert in calibre there's an option to specify a directory for debug info. If that's filled in, calibre will write out the html corresponding to each stage of its parsing and structure detection process, and sometimes you can root around in there to find a version that's easier to edit.

It's usually best to try to get a cleaner html file to start with, but it might be that the original was just very poorly coded -I've come across commercial ebooks that have clearly been run through calibre a couple of times by a lazy technician and end up a mess.

If there's no way to find a cleaner version of html to start from, then there's nothing for it but to strip the tags with a sequence of regular expressions. These can be tricky, and you need to be sure to save each intermediate step in case something goes wrong. Post a couple of lines of the html so we can see what the problem is like.
charleski is offline   Reply With Quote
Advert
Old 03-16-2010, 10:54 PM   #6
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
You could try running HTML tidy on it to try to get cleaner HTML file. (HTML tidy is available in many ways; here's an online version.)

Too bad Ahi never finished his pacify script -- I think this was the main point of it.
frabjous is offline   Reply With Quote
Old 04-06-2010, 01:07 PM   #7
eping
ePub Maker
eping began at the beginning.
 
eping's Avatar
 
Posts: 120
Karma: 16
Join Date: Dec 2009
Location: Mordor
Device: iPad,Kindle 3, Nook 2
Have this problem solved?

Quote:
Originally Posted by ficbot View Post
When I try to convert it, I get a readable mobipocket file but with like 10 line breaks between each paragraph. I opened in Text Wrangler and EVERY single line of the HTML file has it's own font and formatting info.
I think this is typical machine-generated redundant HTML, usually made by
MS Word/Excel or Adobe Acrobat.
You can email the file to me if possible, I woulld like to optimize it for you,
I just have a tool can clean and optimze this style of HTML by merging all repeated inline styles into one line in stylesheet.
eping is offline   Reply With Quote
Old 04-06-2010, 02:09 PM   #8
ficbot
Wizard
ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.
 
Posts: 2,409
Karma: 4132096
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
Thanks for the offer. I did manage to solve the problem by doing find and replaces but every line was slightly different so it took awhile I think it's readable now.
ficbot is offline   Reply With Quote
Old 04-06-2010, 11:45 PM   #9
eping
ePub Maker
eping began at the beginning.
 
eping's Avatar
 
Posts: 120
Karma: 16
Join Date: Dec 2009
Location: Mordor
Device: iPad,Kindle 3, Nook 2
Sorry, I forgot to say it's free. You needn't pay.

If anyone have such occasion demand of HTML cleaning and optimizing, please email me.
I'm very pleased to serve.
My address: service(at)htmlcleaner(dot)com

I would like to set up a free online service if I could sometime later.
eping is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting Merged HTML file to Epub/PDF Not Working MV64 Calibre 1 06-07-2010 07:48 PM
Problem with converting very simple HTML file with table frabjous Calibre 3 09-18-2009 02:36 PM
Small HTML file won't finish converting AlexBell Calibre 2 07-06-2009 06:15 AM
Problem converting a webpage html to LRF, what program should I use? Long page turns seajewel Workshop 1 08-01-2008 06:32 AM
converting lit html output into one big file for BD Dave Berk Sony Reader 15 03-29-2007 10:02 PM


All times are GMT -4. The time now is 10:11 AM.


MobileRead.com is a privately owned, operated and funded community.