I can confirm the LOC data corresponds to 150 byte chunks, not 128 bytes as I previously thought. I've also managed to decrypt the book and convert to raw HTML. But this leaves me with the presky problem of cleaning the text up.
There's a lot of damaged markup in each of these chunks. Any suggestions on how to deal with this? Or perhaps there's a tool that would automatically scrape the appropriate text, given byte offsets?
Edit: BeautifulSoup saves the day!! Imprecision aside, I've got everything working and I think I might post this on the internet to help other people out.
Last edited by kyzcreig; 06-25-2015 at 03:31 AM.
|