Need for a dual mobi metadata tweaker - Page 4

AcidWeb · 05-25-2014, 08:04 AM

Well @KevinH. There is small problem.

Your code work correctly but only on 64bit Python. On 32bit release when input file have more than ~300MB I'm getting MemoryError exception here.

Not sure even why :-S It should not hit 32bit memory limits.

KevinH · 05-25-2014, 08:25 PM

Hi,
I am not sure either. Please verify this problem exists with the original v003 script as well. It builds the datalst using append and not allocating all 3 pieces at once.

If it exists with my original script, please post a link where I can download such a huge comic book and try myself to see if we can delete objects more aggressively to free up memory.

Kevin

Quote:

Originally Posted by AcidWeb

Well @KevinH. There is small problem.

Your code work correctly but only on 64bit Python. On 32bit release when input file have more than ~300MB I'm getting MemoryError exception here.

Not sure even why :-S It should not hit 32bit memory limits.

AcidWeb · 05-26-2014, 12:45 AM

Well I made some additional tests and results are ever more puzzling.

If I extract my Python3 version of your code and run standalone - It run correctly.
If I extract my Python3 version of your code and run standalone as QRunnable thread (Like my program) - It run correctly.
If I run it as QRunnable thread from my program - MemoryError.
If I run it from my program main worker QThread - MemoryError.

As we can see apparently that is not directly connected to your code. Either way debugging that will be pain :-)

Thank you.

KevinH · 05-26-2014, 11:24 AM

Hi AcidWeb,

Threads are typically allocated with their own max stack allocation. I am not sure whether python objects are allocated on the stack or the heap at run time and if that changes when objects are "returned". Also for "returned" objects does it matter if they are named objects or auto temp allocated/deallocated objects?

So as a workaround, why not spawn a full subprocess (fork) from a thread and just wait for it to finish to collect the output? That should allocate an entire new process (not just a thread) but still allow you to have full concurrency.

My 2 cents ...

Anyway, Glad it is not me tracking it down! ;-)

Take care,

KevinH

Quote:

Originally Posted by AcidWeb

Well I made some additional tests and results are ever more puzzling.

If I extract my Python3 version of your code and run standalone - It run correctly.
If I extract my Python3 version of your code and run standalone as QRunnable thread (Like my program) - It run correctly.
If I run it as QRunnable thread from my program - MemoryError.
If I run it from my program main worker QThread - MemoryError.

As we can see apparently that is not directly connected to your code. Either way debugging that will be pain :-)

Thank you.

AcidWeb · 05-26-2014, 11:38 AM

It is not fault of QT. Code run through generic threading.Thread crash too.

AcidWeb · 05-26-2014, 12:16 PM

Heh. Not threads are cause of problem. Is is even more strange.

Look on this snippet. X.mobi have 400mb.

Quote:

import os
import sys
import argparse
import configparser
#from tkinter import Tk, ttk, filedialog
from threading import Thread
from KindleButler import DualMetaFix

ready_file = DualMetaFix.DualMobiMetaFix("D:\X.mobi", bytes('12345', 'UTF-8'))
exit(0)

It works. But if I uncomment tkinter it start to crash with MemoryError.
Importing any bigger third party library is starting to crash program (still only on 32bit Python).
I would say that is something wrong with my Python enviroment - but I replicated that on two machines.

KevinH · 05-26-2014, 02:25 PM

Hi,

Be careful you are not mixing threading systems. Tk uses TCL which on many platforms has its own threading library. I know on Mac OS X, there was a horrible conflict between tcl threads and true Mac OS X (Mach-kernel) threads, and then normal posix threads. It can also cause problems to spawn main loops in Tk from threads that are not main themselves.

Also, make sure you are using a version of Tk/TCL that is specifically compiled for your version of Python (you seem to be using version 3.X and not 2.7). A good place to get the latest TCL is from ActiveState (free community addition).

One thing to note, be careful tracking down "errors" or "bugs" when running out of memory or memory corruption occurs as it will often give you false positives.

Have fun!

KevinH

AcidWeb · 05-26-2014, 03:03 PM

I highly doubt that is Tk fault. If I import Paramiko or Pillow it also start MemoryError crash.
Also I stopped using threads in that code at all.

As long I don't import any bigger library DualMetaFix work correctly.

KevinH · 05-26-2014, 03:14 PM

Hi,

Can you run it in its own process and watch how memory is allocated to see just how large it gets. Perhaps I have done something stupid and unused memory is not being collected/freed properly?

Wow, this is a tough one.

KevinH

AcidWeb · 05-27-2014, 04:20 AM

Input file = 409MB

Memory usage before:

Quote:

datalst = [datain[0:secstart], secdata, datain[secend:]]

420.35546875

After that line:
819.8828125

And it crash line later on:

Quote:

datalst = b''.join(datalst)

On 64bit Python memory usage after that line is around 1225.

Using append don't impact memory usage.

EDIT:
Well after spending another 6h on tests now I'm quite sure that is not an error. It just use too much memory.
All these anomalies were caused by fact that standalone program was running very close to memory limit and success depended on the number of imports (lol!).

Both of headers are on beginning on the file? Why we loading entire file?

KevinH · 05-27-2014, 01:54 PM

Hi,

Both headers are not at the beginning of the file. Typically the mobi6 header comes right after the palm section table, then there are lots of additional sections that hold all of the text of the file, all of the resources (fonts, images, resc section) and then a ncx index, flis, fcis, srcs sections, datp, etc and then a boundary section and then finally comes the mobi8 header, followed by its own text sections, and its indexes, and then a new boundary section containing a CONT section which is an HD Container with lots of HD images.

So to edit both headers you need to split the file at the headers and then recreate the entire file twice.

There really is no other way to deal with this unless you want to use file io to build the new version from smaller chunks and pieces which will be much slower than doing it in memory.

I will take a look at it when I get a free moment.

Kevin

Quote:

Originally Posted by AcidWeb

Input file = 409MB

Memory usage before:

420.35546875

After that line:
819.8828125

And it crash line later on:

On 64bit Python memory usage after that line is around 1225.

Using append don't impact memory usage.

EDIT:
Well after spending another 6h on tests now I'm quite sure that is not an error. It just use too much memory.
All these anomalies were caused by fact that standalone program was running very close to memory limit and success depended on the number of imports (lol!).

Both of headers are on beginning on the file? Why we loading entire file?

KevinH · 05-27-2014, 03:07 PM

Hi,

Looking more closely, we could mmap the file and in that way create the equivalent of a mutable string so we would not have the issues with having multiple copies of the data at the same time.

Using mmap should keep memory usage quite close to the original 400 meg.

Alternatively, we can use direct access file io operations seek and read and write to build the output file on the fly reading it in in small chunks and writing it out as we go.

Either approach would eliminate the need to deal with the memory allocation and deallocation of python's immutable strings.

Do a google search on python and mmap
or on python and random/direct access files using seek

I personally think that using mmap would be fastest and easiest with 1X type memory usage (ie. kept around 400 meg for this file) but that using fileio approaches would have the smallest memory footprint but would be slower.

Let me know what you think.

KevinH

KevinH · 05-27-2014, 04:08 PM

Hi AcidWeb,

Attached is a quick and dirty revision of my original dualmetafix.py to use mmap. I called it dualmetafix_mmap.py. It passes my check with a small sample file.

But Please check its memory usage against your 400 Meg file and see how bad it gets. It should stay near just a few meg over the file size. If not, we should probably move to complete fileio operations with seek and read/write in chunks.

Please let me know what you see.

KevinH

AcidWeb · 05-27-2014, 04:28 PM

You work fast :-)

I will check it out tomorrow.

EDIT:
It is working great. 32bit Python now can process even 650MB (KindleGen limit) MOBI files.
Memory usage coincides with your assumptions. Thank you very much. Really impressive work.

Doitsu · 06-10-2014, 02:45 PM

@KevinH: Thanks for creating this very useful tool!

I might have found either a Mobi Meta Editor (MME) bug or a bug with your script. The script works great with files converted straight from the source with KindleGen, but somehow it doesn't seem to like Mobi files whose metadata section was regenerated by Mobi Meta Editor and/or already contains an ASIN or EBOK value. In these cases it fails with the following error message:

Code:

Error: add_exth: trimmed non-null bytes at end of section

Steps to reproduce this error:

1. Generate a .mobi file (KindleGen -dont_append_source Parrot.epub)
2. Open the generated .mobi file with MME, add an EXTH 113 ASIN value and save the new file.
3. Process the new file with your script.

Please find attached the original file (Parrot_orig.mobi) and the file processed by MME (Parrot_MME.mobi).

05-26-2014, 12:45 AM	#48
AcidWeb KCC Co-Author Posts: 845 Karma: 765434 Join Date: Mar 2013 Location: Poland Device: Kindle Oasis 2	Well I made some additional tests and results are ever more puzzling. If I extract my Python3 version of your code and run standalone - It run correctly. If I extract my Python3 version of your code and run standalone as QRunnable thread (Like my program) - It run correctly. If I run it as QRunnable thread from my program - MemoryError. If I run it from my program main worker QThread - MemoryError. As we can see apparently that is not directly connected to your code. Either way debugging that will be pain :-) Thank you. Last edited by AcidWeb; 05-26-2014 at 12:50 AM.

05-26-2014, 03:03 PM	#53
AcidWeb KCC Co-Author Posts: 845 Karma: 765434 Join Date: Mar 2013 Location: Poland Device: Kindle Oasis 2	I highly doubt that is Tk fault. If I import Paramiko or Pillow it also start MemoryError crash. Also I stopped using threads in that code at all. As long I don't import any bigger library DualMetaFix work correctly. Last edited by AcidWeb; 05-26-2014 at 03:05 PM.

05-27-2014, 04:28 PM	#59
AcidWeb KCC Co-Author Posts: 845 Karma: 765434 Join Date: Mar 2013 Location: Poland Device: Kindle Oasis 2	You work fast :-) I will check it out tomorrow. EDIT: It is working great. 32bit Python now can process even 650MB (KindleGen limit) MOBI files. Memory usage coincides with your assumptions. Thank you very much. Really impressive work. Last edited by AcidWeb; 05-28-2014 at 12:48 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Recommended settings to convert dual-column PDF to useable MOBI format	Cephas Atheos	Conversion	7	09-18-2012 07:32 AM
Insert metadata as page at start of book adds does not replace (mobi to mobi)	linusnc	Calibre	2	07-19-2012 03:54 PM
Update Mobi header/file metadata without doing a Mobi to Mobi conversion	RecQuery	Conversion	2	06-30-2012 11:43 AM
EPUB (CSS) tweaker app	Loccy	Conversion	9	01-23-2011 10:22 PM
Firefox Tweaker: Flexbeta FireTweaker XP	Alexander Turcic	Lounge	0	08-16-2004 04:51 AM

05-25-2014, 08:04 AM	#46
AcidWeb KCC Co-Author Posts: 845 Karma: 765434 Join Date: Mar 2013 Location: Poland Device: Kindle Oasis 2	Well @KevinH. There is small problem. Your code work correctly but only on 64bit Python. On 32bit release when input file have more than ~300MB I'm getting MemoryError exception here. Not sure even why :-S It should not hit 32bit memory limits.

05-26-2014, 11:38 AM	#50
AcidWeb KCC Co-Author Posts: 845 Karma: 765434 Join Date: Mar 2013 Location: Poland Device: Kindle Oasis 2	It is not fault of QT. Code run through generic threading.Thread crash too.

05-26-2014, 02:25 PM	#52
KevinH Sigil Developer Posts: 7,645 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi, Be careful you are not mixing threading systems. Tk uses TCL which on many platforms has its own threading library. I know on Mac OS X, there was a horrible conflict between tcl threads and true Mac OS X (Mach-kernel) threads, and then normal posix threads. It can also cause problems to spawn main loops in Tk from threads that are not main themselves. Also, make sure you are using a version of Tk/TCL that is specifically compiled for your version of Python (you seem to be using version 3.X and not 2.7). A good place to get the latest TCL is from ActiveState (free community addition). One thing to note, be careful tracking down "errors" or "bugs" when running out of memory or memory corruption occurs as it will often give you false positives. Have fun! KevinH

05-26-2014, 03:14 PM	#54
KevinH Sigil Developer Posts: 7,645 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi, Can you run it in its own process and watch how memory is allocated to see just how large it gets. Perhaps I have done something stupid and unused memory is not being collected/freed properly? Wow, this is a tough one. KevinH

05-27-2014, 03:07 PM	#57
KevinH Sigil Developer Posts: 7,645 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi, Looking more closely, we could mmap the file and in that way create the equivalent of a mutable string so we would not have the issues with having multiple copies of the data at the same time. Using mmap should keep memory usage quite close to the original 400 meg. Alternatively, we can use direct access file io operations seek and read and write to build the output file on the fly reading it in in small chunks and writing it out as we go. Either approach would eliminate the need to deal with the memory allocation and deallocation of python's immutable strings. Do a google search on python and mmap or on python and random/direct access files using seek I personally think that using mmap would be fastest and easiest with 1X type memory usage (ie. kept around 400 meg for this file) but that using fileio approaches would have the smallest memory footprint but would be slower. Let me know what you think. KevinH