View Single Post
Old 10-02-2022, 04:08 PM   #136
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 897
Karma: 149877
Join Date: Jul 2013
Location: Netherlands
Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
If you put the file from archive.com through ABBYY Finereader 15. You can choose to save it as html formatted text, without header/footer, without pictures and without linebreaks or hyphens.
The result of the first 100 pages looks like this.

Simply put, to convert a file to another format the previous file should be generated by an algorithm. Otherwise there are too many odd things. With an algorithm all things are within brackets and nicely recognisable.

Now every entry is encapsulated in following p-tags.
A new entry is bold followed by [a word between brackets] or followed by 'ou' and another word and [a word between brackets].

You can filter out the very large capitals at the beginning of a new letter because they have a rather large font. In this case 61pt.

I'll have a go at it in the coming week, to see how far these simple rules come.

Last edited by Markismus; 10-02-2022 at 04:32 PM.
Markismus is offline