Text Processing: Some Ideas

ahi · 05-28-2009, 11:47 AM

The notion of text processing comes up in this subforum fairly often, and while regexes are a God-send, they can trip up from the idiosyncrasies of poorly formatted files.

I am making this thread to share some ideas, as it seems to me there are at least a handful of scripters/coders on these forums who have their own text processors, but are not 100% satisfied with how well they handle messy real-world files.

Let it be noted that I have not yet implemented these ideas, so I do not mean to suggest they are proven to work--but they are the ideas I will soon be using to put together my own script.

Thoughts, suggestions, and counter-arguments as to why this or that idea is problematic are most welcome! And hopefully something below will give at least one person a bright idea or two as to how to improve their reformatting scripts.

__________________________________________________

Processing via a State Machine

Have the processing happen one character at a time, with the "receipt" of each character, building...

- a growing string of all non-eliminated (e.g.: duplicate spaces, et al) parsed characters
- a growing array of all words (i.e.: one or more alphanumeric characters demarked by whitespace or punctuation [though some words, broadly speaking, contain punctuation: "3.14", "it's", "O'brien", "window-cleaner"])
- a growing array of all sentences (i.e.: one or more words, demarked by whitespace or punctuation)
- a growing array of all lines (i.e.: string of characters demarked by linebreaks [i.e.: actual lines of input file])
- a growing array of all paragraphs (i.e.: one or more sentences demarked by greater than most common linebreaks)

This may seem excessive, but this method of processing would allow the program to make decisions based not just on what the current character, word, or sentence is, but also on more complicated bases like "What was the previous word?" or "What word did the current paragraph start with?" or "What character did the last sentence of the previous paragraph end with?"

And, it seems to me, it's precisely the answers to these sort of questions that would allow a program/script to make more intelligent decisions about what to do with a given portion of text. A simple--but not too simple--example that comes to my mind:

If the current paragraph is in all capitals, and the previous paragraph's last sentence ended with a colon (

character (particularly if that same sentence also contains one or more words like "says", "reads", "written") then this current paragraph should be marked/typeset as a "notice".

In fact, going a step forward, if the arrays are built prior to any decision making beginning, the program should be able to look both forward and backwards, allowing it to do things like recognize epigrams at the beginnings of chapters as such and differentiates notices from poems and letters. (i.e.: to simplify things, letters generally end with a sign-off which could be detected, and poems usually have detectably different from normal punctuation/line-break patterns)

Detecting the occasional needlessly line-broken paragraphs would also be possible... if the majority of the document contains paragraphs all on a single line (i.e.: the overwhelming majority of lines always begin with a capital letter and end with punctuation) and you come upon two paragraphs where the first one fails to end with sentence-closing punctuation and the second one starts with a non-capital letter, you have an erroneously line-broken paragraph.

Whitespace treatment

Multiple whitespaces should be merged and weighed (for the purpose of processing--not in the output). A single space is whitespace with weight 1. A single linebreak is whitespace with weight 1000. Two spaces and one linebreak weigh 1002--7 spaces and 3 linebreaks weigh 3007.

Averaging whitespace weights over the document might then give a reasonably reliable idea as to whether paragraphs reside all in their own line, or if they are manually line-broken into multiple lines. i.e.: A document whose average whitespace weights tend toward 1000 have manually line-broken paragraphs. A document whose average whitespace weights tend toward 2000+ have paragraphs on their own lines, unless of course lines fail to begin with capital letter and end with sentence-closing punctuation (in which case the greater whitespace weight is indicative of double-spacing and manual line-breaks).

EDIT: Actually, maybe taking the median as opposed to the mean would be how to do it.

__________________________________________________

I have some other ideas, relating to quotation marks, the use of wordlists (a combination of global and real-time built local ones) to detect likely misspellings, HTML tags, et al. If the above concepts generate any interest, I might take the time to write some ruminations about those as well. If not, mea culpa!

- Ahi

Sabardeyn · 05-29-2009, 05:03 AM

After finding out what a State Machine is, I found the idea kind of interesting. However I don't see how it is practical. I'm not a programmer so perhaps I'm imagining more complexity than is necessary. It seems to me though that this is a fairly extreme amount of work.

Their are dozens of issues including language used, dictionary (and definition of each word) used, including first and last name listings, dealing with non-standard names (sci-fi and fantasy, foreign names), type of book (fiction, textbook, recipes, phone book, etc), images & their captions... the list is potentially endless.

I'm not trying to completely destroy your efforts. I just think as much thought as you've given this, it's all theoretical and contains a few too many assumptions on the input and output. This is not necessarily a bad thing. You've got to start somewhere.

By all means, work to prove me wrong!

ahi · 05-29-2009, 11:31 AM

Quote:

Originally Posted by Sabardeyn

After finding out what a State Machine is, I found the idea kind of interesting. However I don't see how it is practical. I'm not a programmer so perhaps I'm imagining more complexity than is necessary. It seems to me though that this is a fairly extreme amount of work.

Actually yesterday, in a stolen half hour while my two year old daughter watched Elmo's World and Play With Me Sesame episodes, I threw together an admittedly messy Python script that already gets the word array and the sentence array mostly right.

Basically after going through a piece of text, I would be left with a sequential array of all the words (sans punctuation and spacing) and a sequential array of all sentences.

The part that I seem to have neglected to think of is this: if the production of the output is to take place after the generation of the various arrays, all the pieces of data (arrays and all) need to be cross-referenced.

In other words, for any given character in the entire text, I should be able to readily examine what word, sentence, or paragraph that character is a part of.

You are right in saying that this is complex. Certainly it is more complex than most approaches I have seen to text reformatting--however it's benefit is that it provides the programmer with ways of asking reasonably high level questions instead of being stuck with an ants' eye-view of the text.

Before running out of my daddy's-time, I tried just a very simple experiment. After processing my sample file, I listed all sentences (by the broad definition, not the stricter colloquial one) that began with an alphanumeric character and ended either with the same or with a colon ( : ).

Doing this gave me a list composed mostly of all the chapter titles, and a few false positives. The false positives though were few and very different. Most were far longer than the chapter titles, and the chapter titles had commonalities not shared by these false positives.

With a bit more logic, it might be possible to figure out chapter titles even without resorting to dumb keyword matching.

Quote:

Originally Posted by Sabardeyn

Their are dozens of issues including language used, dictionary (and definition of each word) used, including first and last name listings, dealing with non-standard names (sci-fi and fantasy, foreign names), type of book (fiction, textbook, recipes, phone book, etc), images & their captions... the list is potentially endless.

The language used is an issue for anything involving a wordlist. My ideas with regards to that were more around the lines of: words that only appear once in a given text (and not at all in the global wordlist), and words using non-standard characters are suspect of being errors. (And should be reported in some way to the human user, perhaps for review and possible fixing in a second pass on the file?)

Meanings of words is not necessary. Any tricks that require word matching would obviously not be compatible across languages, but would also not be difficult to alter for use with another language. The program I am working on will use profiles to treat different languages differently. Having said that though, I think we both might be surprised to find out how much can actually be achieved before ever resorting to keyword matching.

But, of course, you are right that this is never going to be universal. If I can turn this into something that can do an admirable job with the majority of simple to complex novels (deal with epigrams, footnotes, more than one level of structural division) I will be happy.

Quote:

Originally Posted by Sabardeyn

I'm not trying to completely destroy your efforts. I just think as much thought as you've given this, it's all theoretical and contains a few too many assumptions on the input and output. This is not necessarily a bad thing. You've got to start somewhere.

By all means, work to prove me wrong!

My primary assumptions about the input is that human formatted text is interpretable by a program wherein the programmer can pose high-level human questions about the text. Questions that ought lead to similar judgment as our subconscious evaluation of the text does.

If I can recognize a series of indented lines, mostly with non-sentence finishing punctuation at the end, and a dash fronting (likely also non-proper-sentence) line at the end, as a poem... so ought a program be able to do so, if upon encountering an indented or an unorthodoxly short/terminated line, it can check ahead to see what follows.

Definitely the eventual totality of rules is certain to be more complicated than one can readily wrap one's mind around--but with some flow-charting and sensibly layered code, I think it an achievable goal.

I appreciate your comments, Sabardeyn! And will let you know how/whether I manage to make progress.

- Ahi

P.s.: In brief, my approach is still primarily about matching the form of the text (as opposed to trying to discern or divine its semantic significance), but in more sophisticated ways than usual--the biggest thing, in my mind, being the ability to look up previous/next/nearby words, sentences, paragraphs and evaluate them... and then based on the findings potentially reevaluate the current chunk of text being examined.

Sabardeyn · 05-29-2009, 04:13 PM

I think I understand your approach a little better now. I was under the assumption that you were building something that would attempt to understand grammer, syntax, paragraphs and document structures. At least to the extent of being capable of pattern matching and comparison to determine appropriate usage.

Basically I was expecting your output to result in an English PhD with a minor in Linguistics! Of course, J. R. R. Tolkien v2.0 would be completely acceptable.

kacir · 05-29-2009, 04:35 PM

Have a look at the
http://www.nicemice.net/par/
program

par is a paragraph reformatter that does remarkably clever job reformatting lines. It can recognize and properly wrap mails with multi level quotes, C comments, numbered lists, lists, bordered paragraphs, weirdly formatted text.

the [uncomplete] documentation is at http://www.nicemice.net/par/par-doc.var

Par is Unix prorgam written in pure C, and I was able to compile it on Windows (using Borland free C compiler)
I am pretty sure you can borrow an idea or two by studying the code ;-)

05-28-2009, 11:47 AM	#1
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Text Processing: Some Ideas The notion of text processing comes up in this subforum fairly often, and while regexes are a God-send, they can trip up from the idiosyncrasies of poorly formatted files. I am making this thread to share some ideas, as it seems to me there are at least a handful of scripters/coders on these forums who have their own text processors, but are not 100% satisfied with how well they handle messy real-world files. Let it be noted that I have not yet implemented these ideas, so I do not mean to suggest they are proven to work--but they are the ideas I will soon be using to put together my own script. Thoughts, suggestions, and counter-arguments as to why this or that idea is problematic are most welcome! And hopefully something below will give at least one person a bright idea or two as to how to improve their reformatting scripts. __________________________________________________ Processing via a State Machine Have the processing happen one character at a time, with the "receipt" of each character, building... - a growing string of all non-eliminated (e.g.: duplicate spaces, et al) parsed characters - a growing array of all words (i.e.: one or more alphanumeric characters demarked by whitespace or punctuation [though some words, broadly speaking, contain punctuation: "3.14", "it's", "O'brien", "window-cleaner"]) - a growing array of all sentences (i.e.: one or more words, demarked by whitespace or punctuation) - a growing array of all lines (i.e.: string of characters demarked by linebreaks [i.e.: actual lines of input file]) - a growing array of all paragraphs (i.e.: one or more sentences demarked by greater than most common linebreaks) This may seem excessive, but this method of processing would allow the program to make decisions based not just on what the current character, word, or sentence is, but also on more complicated bases like "What was the previous word?" or "What word did the current paragraph start with?" or "What character did the last sentence of the previous paragraph end with?" And, it seems to me, it's precisely the answers to these sort of questions that would allow a program/script to make more intelligent decisions about what to do with a given portion of text. A simple--but not too simple--example that comes to my mind: If the current paragraph is in all capitals, and the previous paragraph's last sentence ended with a colon ( character (particularly if that same sentence also contains one or more words like "says", "reads", "written") then this current paragraph should be marked/typeset as a "notice". In fact, going a step forward, if the arrays are built prior to any decision making beginning, the program should be able to look both forward and backwards, allowing it to do things like recognize epigrams at the beginnings of chapters as such and differentiates notices from poems and letters. (i.e.: to simplify things, letters generally end with a sign-off which could be detected, and poems usually have detectably different from normal punctuation/line-break patterns) Detecting the occasional needlessly line-broken paragraphs would also be possible... if the majority of the document contains paragraphs all on a single line (i.e.: the overwhelming majority of lines always begin with a capital letter and end with punctuation) and you come upon two paragraphs where the first one fails to end with sentence-closing punctuation and the second one starts with a non-capital letter, you have an erroneously line-broken paragraph. Whitespace treatment Multiple whitespaces should be merged and weighed (for the purpose of processing--not in the output). A single space is whitespace with weight 1. A single linebreak is whitespace with weight 1000. Two spaces and one linebreak weigh 1002--7 spaces and 3 linebreaks weigh 3007. Averaging whitespace weights over the document might then give a reasonably reliable idea as to whether paragraphs reside all in their own line, or if they are manually line-broken into multiple lines. i.e.: A document whose average whitespace weights tend toward 1000 have manually line-broken paragraphs. A document whose average whitespace weights tend toward 2000+ have paragraphs on their own lines, unless of course lines fail to begin with capital letter and end with sentence-closing punctuation (in which case the greater whitespace weight is indicative of double-spacing and manual line-breaks). EDIT: Actually, maybe taking the median as opposed to the mean would be how to do it. __________________________________________________ I have some other ideas, relating to quotation marks, the use of wordlists (a combination of global and real-time built local ones) to detect likely misspellings, HTML tags, et al. If the above concepts generate any interest, I might take the time to write some ruminations about those as well. If not, mea culpa! - Ahi Last edited by ahi; 05-28-2009 at 01:07 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Canti: Manga Processing Program	lilman	Apple Devices	55	04-14-2011 05:50 PM
HTML to .MOBI: large l.h. margin; text cuts off on the rt. Ideas how to fix?	thorn	Calibre	1	02-21-2010 01:47 AM
Comic File Processing	wonderboy	Other formats	1	08-08-2009 04:17 AM
Image processing using html2epub?	Portnull	Calibre	2	06-03-2009 12:31 PM
Perl processing	alexxxm	Sony Reader	3	11-26-2007 06:13 AM

05-29-2009, 05:03 AM	#2
Sabardeyn Guru Posts: 644 Karma: 1242364 Join Date: May 2009 Location: The Right Coast Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)	After finding out what a State Machine is, I found the idea kind of interesting. However I don't see how it is practical. I'm not a programmer so perhaps I'm imagining more complexity than is necessary. It seems to me though that this is a fairly extreme amount of work. Their are dozens of issues including language used, dictionary (and definition of each word) used, including first and last name listings, dealing with non-standard names (sci-fi and fantasy, foreign names), type of book (fiction, textbook, recipes, phone book, etc), images & their captions... the list is potentially endless. I'm not trying to completely destroy your efforts. I just think as much thought as you've given this, it's all theoretical and contains a few too many assumptions on the input and output. This is not necessarily a bad thing. You've got to start somewhere. By all means, work to prove me wrong!

05-29-2009, 04:13 PM	#4
Sabardeyn Guru Posts: 644 Karma: 1242364 Join Date: May 2009 Location: The Right Coast Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)	I think I understand your approach a little better now. I was under the assumption that you were building something that would attempt to understand grammer, syntax, paragraphs and document structures. At least to the extent of being capable of pattern matching and comparison to determine appropriate usage. Basically I was expecting your output to result in an English PhD with a minor in Linguistics! Of course, J. R. R. Tolkien v2.0 would be completely acceptable.

05-29-2009, 04:35 PM	#5
kacir Wizard Posts: 3,450 Karma: 10484861 Join Date: May 2006 Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20	Have a look at the http://www.nicemice.net/par/ program par is a paragraph reformatter that does remarkably clever job reformatting lines. It can recognize and properly wrap mails with multi level quotes, C comments, numbered lists, lists, bordered paragraphs, weirdly formatted text. the [uncomplete] documentation is at http://www.nicemice.net/par/par-doc.var Par is Unix prorgam written in pure C, and I was able to compile it on Windows (using Borland free C compiler) I am pretty sure you can borrow an idea or two by studying the code ;-)

Advert

Advert