After finding out what a State Machine is, I found the idea kind of interesting. However I don't see how it is practical. I'm not a programmer so perhaps I'm imagining more complexity than is necessary. It seems to me though that this is a fairly extreme amount of work.
Their are dozens of issues including language used, dictionary (and definition of each word) used, including first and last name listings, dealing with non-standard names (sci-fi and fantasy, foreign names), type of book (fiction, textbook, recipes, phone book, etc), images & their captions... the list is potentially endless.
I'm not trying to completely destroy your efforts. I just think as much thought as you've given this, it's all theoretical and contains a few too many assumptions on the input and output. This is not necessarily a bad thing. You've got to start somewhere.
By all means, work to prove me wrong!