03-16-2010, 03:51 AM | #1 |
Wizard
Posts: 2,409
Karma: 4132096
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
|
Need help converting file which is too long to be HTML
I have a purchased ebook I am trying to convert from very messy HTML. The 'liberation script' I ran on it produced a very messy output and while on my old Sony it looked fine, it looked terrible when I tried mobipocket for my Kindle. My usual trick with this type of book is to open it in a web browser and copy/paste the text from there into Kompozer (my web program) to get clean HTML. However, the ebook (The Mists of Avalon) is VERY long and I seem to be running into some sort of file size limit. When I tried copy and pasting, it would not grab the whole thing at once, which was fine. But then I went to the spot it left off at and copied from there, it would not let me paste it into the HTML. I am thinking there must be some sort of file size limit for HTML I am unaware of?
I am wondering what my other options are for this book. I bought it awhile ago when there was no epub. I do not plan to deal with the format this book started in for the future, but this is the lone, last book of my collection of them that has to be converted and I would like to not have to buy it again. In the past I experimented with RTF but it appeared tiny when converted to LRF. Will I have this same issue with mobi? Or is there some other trick I can use to remove the extraneous garbage this HTML file seems to have? I am on a mac. Software I am comfortable with and/or own includes Open Office, Kompozer, Calibre, Firefox/Safari and Pages. |
03-16-2010, 07:58 AM | #2 |
GuteBook/Mobi2IMP Creator
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
|
Hi ficbot!
I usually convert very large .html code into ebooks and have never run into a file that was too large to convert. The fact that your copy/paste was cut off prematurely may indicate that there is a null/EOF character in the text stream (note: EOF = end of file). Just so I understand, you converted a .lrf into .html and that .html can be viewed in your web broser, but cannot be copied properly. Is that right? Here's then what I would recommend: 1. open the .html in a text editor capable of dealing with hexadecimal characters and search for a '00' byte. Just replace them with a space or nothing depending on what works better. 2. if you have access to a Windows computer, copy and paste that web text into the Outlook Express email program into a newly created (blank) message. This allows the .html to be retained and perhaps cleansed of any troublesome characters. Then clicking the Source tab at the bottom of the message, you can copy the HTML code of that text and paste it into an empty file and save it as a new .html file. Then repeat your web text extraction procedure. 3. if you can't make any of that work, just contact me via PM or email me at "nrapallo (at) yahoo.ca" I can see no reason a text file would be too long to copy so there should be a simple solution to your problem, once the source of the problem has been identified. Regards, |
Advert | |
|
03-16-2010, 08:57 PM | #3 |
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
What happens when you try to convert the HTML directly through calibre? (Mess and all?)
Can you open the HTML file in OpenOffice? Have you tried uploading it to Google Docs? Be warned that I found that Kompozer does not generate clean HTML -- in fact, I gave up completely on WYSIWYG HTML editors after looking at some of the nonsense code Kompozer was putting in my document. Now I edit HTML in a text editor. |
03-16-2010, 09:22 PM | #4 |
Wizard
Posts: 2,409
Karma: 4132096
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
|
When I try to convert it, I get a readable mobipocket file but with like 10 line breaks between each paragraph. I opened in Text Wrangler and EVERY single line of the HTML file has it's own font and formatting info. It will take ages to go through line by line and remove them. I tried using find and replace and it wound up removing the line breaks too. Open Office opened it but the HTML it generated was pretty awful too. Would saving it at an RTF help things? And then converting that? I had trouble on the Sony with RTF files displaying tiny and Kovid told me HTM was better for converting in Calibre. It is just such a loooooong book. I would hate to have to go through it line by line.
|
03-16-2010, 11:27 PM | #5 |
Wizard
Posts: 1,196
Karma: 1281258
Join Date: Sep 2009
Device: PRS-505
|
When you convert in calibre there's an option to specify a directory for debug info. If that's filled in, calibre will write out the html corresponding to each stage of its parsing and structure detection process, and sometimes you can root around in there to find a version that's easier to edit.
It's usually best to try to get a cleaner html file to start with, but it might be that the original was just very poorly coded -I've come across commercial ebooks that have clearly been run through calibre a couple of times by a lazy technician and end up a mess. If there's no way to find a cleaner version of html to start from, then there's nothing for it but to strip the tags with a sequence of regular expressions. These can be tricky, and you need to be sure to save each intermediate step in case something goes wrong. Post a couple of lines of the html so we can see what the problem is like. |
Advert | |
|
03-16-2010, 11:54 PM | #6 |
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
You could try running HTML tidy on it to try to get cleaner HTML file. (HTML tidy is available in many ways; here's an online version.)
Too bad Ahi never finished his pacify script -- I think this was the main point of it. |
04-06-2010, 02:07 PM | #7 | |
ePub Maker
Posts: 120
Karma: 16
Join Date: Dec 2009
Location: Mordor
Device: iPad,Kindle 3, Nook 2
|
Have this problem solved?
Quote:
MS Word/Excel or Adobe Acrobat. You can email the file to me if possible, I woulld like to optimize it for you, I just have a tool can clean and optimze this style of HTML by merging all repeated inline styles into one line in stylesheet. |
|
04-06-2010, 03:09 PM | #8 |
Wizard
Posts: 2,409
Karma: 4132096
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
|
Thanks for the offer. I did manage to solve the problem by doing find and replaces but every line was slightly different so it took awhile I think it's readable now.
|
04-07-2010, 12:45 AM | #9 |
ePub Maker
Posts: 120
Karma: 16
Join Date: Dec 2009
Location: Mordor
Device: iPad,Kindle 3, Nook 2
|
Sorry, I forgot to say it's free. You needn't pay.
If anyone have such occasion demand of HTML cleaning and optimizing, please email me. I'm very pleased to serve. My address: service(at)htmlcleaner(dot)com I would like to set up a free online service if I could sometime later. |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Converting Merged HTML file to Epub/PDF Not Working | MV64 | Calibre | 1 | 06-07-2010 08:48 PM |
Problem with converting very simple HTML file with table | frabjous | Calibre | 3 | 09-18-2009 03:36 PM |
Small HTML file won't finish converting | AlexBell | Calibre | 2 | 07-06-2009 07:15 AM |
Problem converting a webpage html to LRF, what program should I use? Long page turns | seajewel | Workshop | 1 | 08-01-2008 07:32 AM |
converting lit html output into one big file for BD | Dave Berk | Sony Reader | 15 | 03-29-2007 11:02 PM |