Hi, Frabjous!
I would suggest holding off until I put up the next version. The haphazard unicode errors are basically gone as of the development version I am currently working on.
The filesize thing is weird... the 600 KB file certainly did not cause a memory issue, but whatever the issue was got misreported as such.
I have successfully processed 700+ MB (nearly 1 GB) files with pacify.py before... and once I implement spooling (which I actually think I will do sooner rather than later after all), file size will be a non-issue so long as you have both sufficient memory and disk space.
And yes, the HTML parsing needs to take tables and such properly into account... along with a few other things.
I'll keep everyone update via this thread...
- Ahi
|