Unable to convert Mobi to Epub - Page 2

DoctorOhh · 07-16-2011, 11:27 AM

Quote:

Originally Posted by Japes

I have the De-DRM plugin installed, so, if there IS DRM, it's automatically removed, so, I don't know for sure if it has DRM or not. Is there a way for me to tell?

If you try to open the original in Calibre's viewer before being added to calibre the viewer will tell you.

Quote:

Originally Posted by Japes

And, yes, I can view the Mobi file in Calibre by double clicking it (it takes forever to open but it does open and, from a quick scan, it appears to look fine).

Just so I'm clear you are able to view the book in calibre's viewer by opening the file from within calibre? Or are you double clicking on the original downloaded file before you add it to calibre.

If you send me a link to your file in a PM I'll be glad to look at it.

Japes · 07-16-2011, 02:22 PM

Kovid, can you please chime in here? I'm at a loss.

siebert · 07-17-2011, 01:48 PM

Hi,

I'm not sure if calibre can convert the result to epub as the unpacked html might contain mobi specific tags, but you can try to unpack the mobi file with my version of mobiunpack found here: https://www.mobileread.com/forums/sho...9&postcount=72

As it can handle dictionaries whose source is hundreds of megabytes big, I'm pretty optimistic that it also works for this mobipocket file.

Ciao,
Steffen

Japes · 07-17-2011, 06:30 PM

Thanks for this. Downloaded it. It extracts to a py file. How on earth do I use it?

siebert · 07-17-2011, 06:34 PM

You need to install python 2.x (not 3.x!) from python.org (I'm using version 2.5.4, but a newer 2.x version might work, too).

Then copy the mobi file in the same directory as the mobiunpack.py script and run from a shell/cmd window:

python mobiunpack.py file.mobi

Ciao,
Steffen

travger · 07-18-2011, 02:57 AM

Somewhere is a nice thingy named mobiunpack.pyw, it gives you friendlier GUI to use. (I only shudder at cmd line).

siebert · 07-18-2011, 04:57 AM

Quote:

Originally Posted by travger

Somewhere is a nice thingy named mobiunpack.pyw, it gives you friendlier GUI to use.

I never tried that, but as it is just some GUI calling the actual mobiunpack.py for the unpacking, it should work if you make sure that it uses my mobiunpack.py instead of the delivered one, otherwise you won't get dictionary support nor the speed optimization for huge files.

Quote:

(I only shudder at cmd line).

No pain, no gain

Ciao,
Steffen

KevinH · 07-18-2011, 01:04 PM

Hi Steffen,

Quote:

Originally Posted by siebert

I never tried that, but as it is just some GUI calling the actual mobiunpack.py for the unpacking, it should work if you make sure that it uses my mobiunpack.py instead of the delivered one, otherwise you won't get dictionary support nor the speed optimization for huge files.
Steffen

I just wanted to say very nice job with your new mobiunpack.py version!

I diffed your speedup changes against the original and all looks great except for one thing, why did you remove the imghdr code that detects the proper image type so that it creates a file with the proper extension? I, for one, want all of my file extensions to match the actual contents of the file because not every program ignores the extension when working with files. Are you using fake "image" files to store extra sections (non-html, non-image) from the original mobi file? Perhaps index information from the dictionaries?

Also, it would be nice to grab all of the string concats and file writes into one function that passes in the "big-file" flag, and new data and handles it, just to make the code look cleaner.

That said, I find it hard to think that even a 26 meg mobi file fills up memory in todays multi gb machines. It might simply be that the string concatenation needs to be replaced with simply adding string pieces to a list and then doing a "".join(list) at the end. This should prevent the creation of multiple copies of the 26 meg long string which is what must be filling up memory. Perhaps because the garbage collection is not aggressive enough to reclaim and reuse it in a timely fashion? ... at least older version of python used to recommend that for heavy string concats.

Thanks again for all of your work on it.

Take care,

KevinH

siebert · 07-18-2011, 03:15 PM

Quote:

Originally Posted by KevinH

Hi Steffen,
I just wanted to say very nice job with your new mobiunpack.py version!

Thanks

Quote:

I diffed your speedup changes against the original and all looks great except for one thing, why did you remove the imghdr code that detects the proper image type so that it creates a file with the proper extension?

It's one of the speed optimization things. My image handling is generic, if I have a reference in the html for the image stored in section x, I don't have to look up the file but the name is just 0000y.jpg, where y is calculated from x.

Quote:

Are you using fake "image" files to store extra sections (non-html, non-image) from the original mobi file? Perhaps index information from the dictionaries?

No, only images are needed. In the mobiunpack version I've published some non-image sections will be written as image files, but they won't be referenced by the html source, so it doesn't matter.

I have an improved version which is not yet published which detects and ignores these non-images, but that is only for cosmetic reasons.

Quote:

Also, it would be nice to grab all of the string concats and file writes into one function that passes in the "big-file" flag, and new data and handles it, just to make the code look cleaner.

Feel free to provide an improved version

Quote:

That said, I find it hard to think that even a 26 meg mobi file fills up memory in todays multi gb machines.

First of all the 20MB dictionary mobi file uncompresses into 100MB html text.

And into this 100MB thousands of strings have to be inserted all over the html text, which means for each insert all the 100MB of data must be copied at least once.

Even the decompression of the compressed texts is much faster if I append each block to a temporary disk file instead of handling everything in memory.

Quote:

It might simply be that the string concatenation needs to be replaced with simply adding string pieces to a list and then doing a "".join(list) at the end.

Maybe, I didn't test that, but I doubt it would be as fast. But feel free to do a test...

Ciao,
Steffen

KevinH · 07-18-2011, 03:33 PM

Hi Steffen,

> Feel free to provide an improved version

Thanks, I already helped write the original version you adapted and my interest is in the additional code that converts the old mobi raw html into normal html for archival purposes. So having the extensions on the images is useful. I will add that back in.

> First of all the 20MB dictionary mobi file uncompresses into 100MB html text.

> And into this 100MB thousands of strings have to be inserted all over the html text, which means for each insert all the 100MB of data must be copied at least once.

Or as I said, we could try using lists of string segments and inserting segments into position via list insertion and then doing a join to put it all together.

If that works, then I will rewrite it that way, if not I will pull all of the pieces that do the write to a file versus concatenating strings into a separate function to clean the code up and make it more readable.

> Even the decompression of the compressed texts is much faster if I append each block to a temporary disk file instead of handling everything in memory.

Good point.

> But feel free to do a test...

Will do.

Thanks,

Kevin

siebert · 07-18-2011, 06:18 PM

Quote:

Originally Posted by KevinH

Thanks, I already helped write the original version you adapted and my interest is in the additional code that converts the old mobi raw html into normal html for archival purposes. So having the extensions on the images is useful. I will add that back in.

Ok. I've just published my latest changes so you can start from there. To ease development I've also pushed my git repository to github. If you're used to git, just fork my repository and start coding.

See https://www.mobileread.com/forums/sho...7&postcount=75 for details.

Ciao,
Steffen

KevinH · 07-18-2011, 06:51 PM

Hi Steffen,

Thanks! I will grab it and start from there. Not much of a git user (the Linux kernel started using git long after I stopped contributing to the Linux PPC port). Used just about everything else at one point or another rcs, cvs, svn, hg, etc. Guess I should at least play around with git.

Paul and I had already set up a google code page for the earlier versions

http://code.google.com/p/ebook-conversion-tools/

but only Paul was adding much to it lately.

Take care,

Kevin

Quote:

Originally Posted by siebert

Ok. I've just published my latest changes so you can start from there. To ease development I've also pushed my git repository to github. If you're used to git, just fork my repository and start coding.

See https://www.mobileread.com/forums/sho...7&postcount=75 for details.

Ciao,
Steffen

KevinH · 07-19-2011, 01:14 AM

Hi Steffen,

I made all of the code changes and created a FastConcat class that hides all of the hugeFile temp file creation and string lists appending. It is simple to use and it uses the python tempfile module.

fc = FastConcat(hugeFile)
...
fc.concat(data)
...
fc.getresult()

That all seemed to work fine. Then I reverted your image file name extension changes and now I can see why you decided to ignore the file extensions on images! ;-)

Your approach allows you to update all image links with one regular expressions substitution which is much faster than doing one for each image.

I had one old dictionary to play/test with and it unfortunately uses the older unsupported inflection rules but it did let me play with things and it used over 9000 gifs and jpegs.

It would indeed take a very long run time to process all of those image links one by one.

So I will have to try something else to speed it up. When I get a workable solution, I will post it.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Unable to convert to MOBI	chota	Conversion	7	03-06-2011 07:19 AM
PRS-700 Unable to convert pdf to other formats (epub/rtf/doc)	testndtv	Sony Reader	1	09-24-2010 02:45 PM
Convert .prc / .mobi to epub	goldberry	Calibre	3	09-12-2010 04:56 PM
Unable to convert RTF files to ePub	Chrysanthemum	Calibre	14	07-07-2010 02:57 PM
Unable Convert Gutenberg TXT to Mobi	ascherjim	Calibre	4	06-23-2009 09:55 AM

07-16-2011, 02:22 PM	#17
Japes Addict Posts: 303 Karma: 1033852 Join Date: Jun 2011 Device: Sony PRS-350,Sony PRS-950,Pocketbook 360+,B&N Nook Simple Touch Reader	Kovid, can you please chime in here? I'm at a loss.

07-17-2011, 01:48 PM	#18
siebert Developer Posts: 155 Karma: 280 Join Date: Nov 2010 Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)	Hi, I'm not sure if calibre can convert the result to epub as the unpacked html might contain mobi specific tags, but you can try to unpack the mobi file with my version of mobiunpack found here: https://www.mobileread.com/forums/sho...9&postcount=72 As it can handle dictionaries whose source is hundreds of megabytes big, I'm pretty optimistic that it also works for this mobipocket file. Ciao, Steffen

07-17-2011, 06:30 PM	#19
Japes Addict Posts: 303 Karma: 1033852 Join Date: Jun 2011 Device: Sony PRS-350,Sony PRS-950,Pocketbook 360+,B&N Nook Simple Touch Reader	Thanks for this. Downloaded it. It extracts to a py file. How on earth do I use it?

07-17-2011, 06:34 PM	#20
siebert Developer Posts: 155 Karma: 280 Join Date: Nov 2010 Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)	You need to install python 2.x (not 3.x!) from python.org (I'm using version 2.5.4, but a newer 2.x version might work, too). Then copy the mobi file in the same directory as the mobiunpack.py script and run from a shell/cmd window: python mobiunpack.py file.mobi Ciao, Steffen

07-18-2011, 02:57 AM	#21
travger Evangelist Posts: 485 Karma: 270594 Join Date: Aug 2010 Device: palm tx, Windows7, Galaxy A5	Somewhere is a nice thingy named mobiunpack.pyw, it gives you friendlier GUI to use. (I only shudder at cmd line).

07-18-2011, 03:33 PM	#25
KevinH Sigil Developer Posts: 9,412 Karma: 6733754 Join Date: Nov 2009 Device: many	Hi Steffen, > Feel free to provide an improved version Thanks, I already helped write the original version you adapted and my interest is in the additional code that converts the old mobi raw html into normal html for archival purposes. So having the extensions on the images is useful. I will add that back in. > First of all the 20MB dictionary mobi file uncompresses into 100MB html text. > And into this 100MB thousands of strings have to be inserted all over the html text, which means for each insert all the 100MB of data must be copied at least once. Or as I said, we could try using lists of string segments and inserting segments into position via list insertion and then doing a join to put it all together. If that works, then I will rewrite it that way, if not I will pull all of the pieces that do the write to a file versus concatenating strings into a separate function to clean the code up and make it more readable. > Even the decompression of the compressed texts is much faster if I append each block to a temporary disk file instead of handling everything in memory. Good point. > But feel free to do a test... Will do. Thanks, Kevin

07-19-2011, 01:14 AM	#28
KevinH Sigil Developer Posts: 9,412 Karma: 6733754 Join Date: Nov 2009 Device: many	Hi Steffen, I made all of the code changes and created a FastConcat class that hides all of the hugeFile temp file creation and string lists appending. It is simple to use and it uses the python tempfile module. fc = FastConcat(hugeFile) ... fc.concat(data) ... fc.getresult() That all seemed to work fine. Then I reverted your image file name extension changes and now I can see why you decided to ignore the file extensions on images! ;-) Your approach allows you to update all image links with one regular expressions substitution which is much faster than doing one for each image. I had one old dictionary to play/test with and it unfortunately uses the older unsupported inflection rules but it did let me play with things and it used over 9000 gifs and jpegs. It would indeed take a very long run time to process all of those image links one by one. So I will have to try something else to speed it up. When I get a workable solution, I will post it.

Advert

Advert