If you put the file from archive.com through ABBYY Finereader 15. You can choose to save it as html formatted text, without header/footer, without pictures and without linebreaks or hyphens.
The result of the first 100 pages
looks like this.
Simply put, to convert a file to another format the previous file should be generated by an algorithm. Otherwise there are too many odd things. With an algorithm all things are within brackets and nicely recognisable.
Now every entry is encapsulated in following p-tags.
A new entry is bold followed by [a word between brackets] or followed by 'ou' and another word and [a word between brackets].
You can filter out the very large capitals at the beginning of a new letter because they have a rather large font. In this case 61pt.
I'll have a go at it in the coming week, to see how far these simple rules come.