MobileRead Forums - View Single Post

ahi · 05-29-2009, 12:31 PM

Quote:

Originally Posted by Sabardeyn

After finding out what a State Machine is, I found the idea kind of interesting. However I don't see how it is practical. I'm not a programmer so perhaps I'm imagining more complexity than is necessary. It seems to me though that this is a fairly extreme amount of work.

Actually yesterday, in a stolen half hour while my two year old daughter watched Elmo's World and Play With Me Sesame episodes, I threw together an admittedly messy Python script that already gets the word array and the sentence array mostly right.

Basically after going through a piece of text, I would be left with a sequential array of all the words (sans punctuation and spacing) and a sequential array of all sentences.

The part that I seem to have neglected to think of is this: if the production of the output is to take place after the generation of the various arrays, all the pieces of data (arrays and all) need to be cross-referenced.

In other words, for any given character in the entire text, I should be able to readily examine what word, sentence, or paragraph that character is a part of.

You are right in saying that this is complex. Certainly it is more complex than most approaches I have seen to text reformatting--however it's benefit is that it provides the programmer with ways of asking reasonably high level questions instead of being stuck with an ants' eye-view of the text.

Before running out of my daddy's-time, I tried just a very simple experiment. After processing my sample file, I listed all sentences (by the broad definition, not the stricter colloquial one) that began with an alphanumeric character and ended either with the same or with a colon ( : ).

Doing this gave me a list composed mostly of all the chapter titles, and a few false positives. The false positives though were few and very different. Most were far longer than the chapter titles, and the chapter titles had commonalities not shared by these false positives.

With a bit more logic, it might be possible to figure out chapter titles even without resorting to dumb keyword matching.

Quote:

Originally Posted by Sabardeyn

Their are dozens of issues including language used, dictionary (and definition of each word) used, including first and last name listings, dealing with non-standard names (sci-fi and fantasy, foreign names), type of book (fiction, textbook, recipes, phone book, etc), images & their captions... the list is potentially endless.

The language used is an issue for anything involving a wordlist. My ideas with regards to that were more around the lines of: words that only appear once in a given text (and not at all in the global wordlist), and words using non-standard characters are suspect of being errors. (And should be reported in some way to the human user, perhaps for review and possible fixing in a second pass on the file?)

Meanings of words is not necessary. Any tricks that require word matching would obviously not be compatible across languages, but would also not be difficult to alter for use with another language. The program I am working on will use profiles to treat different languages differently. Having said that though, I think we both might be surprised to find out how much can actually be achieved before ever resorting to keyword matching.

But, of course, you are right that this is never going to be universal. If I can turn this into something that can do an admirable job with the majority of simple to complex novels (deal with epigrams, footnotes, more than one level of structural division) I will be happy.

Quote:

Originally Posted by Sabardeyn

I'm not trying to completely destroy your efforts. I just think as much thought as you've given this, it's all theoretical and contains a few too many assumptions on the input and output. This is not necessarily a bad thing. You've got to start somewhere.

By all means, work to prove me wrong!

My primary assumptions about the input is that human formatted text is interpretable by a program wherein the programmer can pose high-level human questions about the text. Questions that ought lead to similar judgment as our subconscious evaluation of the text does.

If I can recognize a series of indented lines, mostly with non-sentence finishing punctuation at the end, and a dash fronting (likely also non-proper-sentence) line at the end, as a poem... so ought a program be able to do so, if upon encountering an indented or an unorthodoxly short/terminated line, it can check ahead to see what follows.

Definitely the eventual totality of rules is certain to be more complicated than one can readily wrap one's mind around--but with some flow-charting and sensibly layered code, I think it an achievable goal.

I appreciate your comments, Sabardeyn! And will let you know how/whether I manage to make progress.

- Ahi

P.s.: In brief, my approach is still primarily about matching the form of the text (as opposed to trying to discern or divine its semantic significance), but in more sophisticated ways than usual--the biggest thing, in my mind, being the ability to look up previous/next/nearby words, sentences, paragraphs and evaluate them... and then based on the findings potentially reevaluate the current chunk of text being examined.