MobileRead Forums - View Single Post

ahi · 05-28-2009, 11:47 AM

The notion of text processing comes up in this subforum fairly often, and while regexes are a God-send, they can trip up from the idiosyncrasies of poorly formatted files.

I am making this thread to share some ideas, as it seems to me there are at least a handful of scripters/coders on these forums who have their own text processors, but are not 100% satisfied with how well they handle messy real-world files.

Let it be noted that I have not yet implemented these ideas, so I do not mean to suggest they are proven to work--but they are the ideas I will soon be using to put together my own script.

Thoughts, suggestions, and counter-arguments as to why this or that idea is problematic are most welcome! And hopefully something below will give at least one person a bright idea or two as to how to improve their reformatting scripts.

__________________________________________________

Processing via a State Machine

Have the processing happen one character at a time, with the "receipt" of each character, building...

- a growing string of all non-eliminated (e.g.: duplicate spaces, et al) parsed characters
- a growing array of all words (i.e.: one or more alphanumeric characters demarked by whitespace or punctuation [though some words, broadly speaking, contain punctuation: "3.14", "it's", "O'brien", "window-cleaner"])
- a growing array of all sentences (i.e.: one or more words, demarked by whitespace or punctuation)
- a growing array of all lines (i.e.: string of characters demarked by linebreaks [i.e.: actual lines of input file])
- a growing array of all paragraphs (i.e.: one or more sentences demarked by greater than most common linebreaks)

This may seem excessive, but this method of processing would allow the program to make decisions based not just on what the current character, word, or sentence is, but also on more complicated bases like "What was the previous word?" or "What word did the current paragraph start with?" or "What character did the last sentence of the previous paragraph end with?"

And, it seems to me, it's precisely the answers to these sort of questions that would allow a program/script to make more intelligent decisions about what to do with a given portion of text. A simple--but not too simple--example that comes to my mind:

If the current paragraph is in all capitals, and the previous paragraph's last sentence ended with a colon (

character (particularly if that same sentence also contains one or more words like "says", "reads", "written") then this current paragraph should be marked/typeset as a "notice".

In fact, going a step forward, if the arrays are built prior to any decision making beginning, the program should be able to look both forward and backwards, allowing it to do things like recognize epigrams at the beginnings of chapters as such and differentiates notices from poems and letters. (i.e.: to simplify things, letters generally end with a sign-off which could be detected, and poems usually have detectably different from normal punctuation/line-break patterns)

Detecting the occasional needlessly line-broken paragraphs would also be possible... if the majority of the document contains paragraphs all on a single line (i.e.: the overwhelming majority of lines always begin with a capital letter and end with punctuation) and you come upon two paragraphs where the first one fails to end with sentence-closing punctuation and the second one starts with a non-capital letter, you have an erroneously line-broken paragraph.

Whitespace treatment

Multiple whitespaces should be merged and weighed (for the purpose of processing--not in the output). A single space is whitespace with weight 1. A single linebreak is whitespace with weight 1000. Two spaces and one linebreak weigh 1002--7 spaces and 3 linebreaks weigh 3007.

Averaging whitespace weights over the document might then give a reasonably reliable idea as to whether paragraphs reside all in their own line, or if they are manually line-broken into multiple lines. i.e.: A document whose average whitespace weights tend toward 1000 have manually line-broken paragraphs. A document whose average whitespace weights tend toward 2000+ have paragraphs on their own lines, unless of course lines fail to begin with capital letter and end with sentence-closing punctuation (in which case the greater whitespace weight is indicative of double-spacing and manual line-breaks).

EDIT: Actually, maybe taking the median as opposed to the mean would be how to do it.

__________________________________________________

I have some other ideas, relating to quotation marks, the use of wordlists (a combination of global and real-time built local ones) to detect likely misspellings, HTML tags, et al. If the above concepts generate any interest, I might take the time to write some ruminations about those as well. If not, mea culpa!

- Ahi

05-28-2009, 11:47 AM	#1
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Text Processing: Some Ideas The notion of text processing comes up in this subforum fairly often, and while regexes are a God-send, they can trip up from the idiosyncrasies of poorly formatted files. I am making this thread to share some ideas, as it seems to me there are at least a handful of scripters/coders on these forums who have their own text processors, but are not 100% satisfied with how well they handle messy real-world files. Let it be noted that I have not yet implemented these ideas, so I do not mean to suggest they are proven to work--but they are the ideas I will soon be using to put together my own script. Thoughts, suggestions, and counter-arguments as to why this or that idea is problematic are most welcome! And hopefully something below will give at least one person a bright idea or two as to how to improve their reformatting scripts. __________________________________________________ Processing via a State Machine Have the processing happen one character at a time, with the "receipt" of each character, building... - a growing string of all non-eliminated (e.g.: duplicate spaces, et al) parsed characters - a growing array of all words (i.e.: one or more alphanumeric characters demarked by whitespace or punctuation [though some words, broadly speaking, contain punctuation: "3.14", "it's", "O'brien", "window-cleaner"]) - a growing array of all sentences (i.e.: one or more words, demarked by whitespace or punctuation) - a growing array of all lines (i.e.: string of characters demarked by linebreaks [i.e.: actual lines of input file]) - a growing array of all paragraphs (i.e.: one or more sentences demarked by greater than most common linebreaks) This may seem excessive, but this method of processing would allow the program to make decisions based not just on what the current character, word, or sentence is, but also on more complicated bases like "What was the previous word?" or "What word did the current paragraph start with?" or "What character did the last sentence of the previous paragraph end with?" And, it seems to me, it's precisely the answers to these sort of questions that would allow a program/script to make more intelligent decisions about what to do with a given portion of text. A simple--but not too simple--example that comes to my mind: If the current paragraph is in all capitals, and the previous paragraph's last sentence ended with a colon ( character (particularly if that same sentence also contains one or more words like "says", "reads", "written") then this current paragraph should be marked/typeset as a "notice". In fact, going a step forward, if the arrays are built prior to any decision making beginning, the program should be able to look both forward and backwards, allowing it to do things like recognize epigrams at the beginnings of chapters as such and differentiates notices from poems and letters. (i.e.: to simplify things, letters generally end with a sign-off which could be detected, and poems usually have detectably different from normal punctuation/line-break patterns) Detecting the occasional needlessly line-broken paragraphs would also be possible... if the majority of the document contains paragraphs all on a single line (i.e.: the overwhelming majority of lines always begin with a capital letter and end with punctuation) and you come upon two paragraphs where the first one fails to end with sentence-closing punctuation and the second one starts with a non-capital letter, you have an erroneously line-broken paragraph. Whitespace treatment Multiple whitespaces should be merged and weighed (for the purpose of processing--not in the output). A single space is whitespace with weight 1. A single linebreak is whitespace with weight 1000. Two spaces and one linebreak weigh 1002--7 spaces and 3 linebreaks weigh 3007. Averaging whitespace weights over the document might then give a reasonably reliable idea as to whether paragraphs reside all in their own line, or if they are manually line-broken into multiple lines. i.e.: A document whose average whitespace weights tend toward 1000 have manually line-broken paragraphs. A document whose average whitespace weights tend toward 2000+ have paragraphs on their own lines, unless of course lines fail to begin with capital letter and end with sentence-closing punctuation (in which case the greater whitespace weight is indicative of double-spacing and manual line-breaks). EDIT: Actually, maybe taking the median as opposed to the mean would be how to do it. __________________________________________________ I have some other ideas, relating to quotation marks, the use of wordlists (a combination of global and real-time built local ones) to detect likely misspellings, HTML tags, et al. If the above concepts generate any interest, I might take the time to write some ruminations about those as well. If not, mea culpa! - Ahi Last edited by ahi; 05-28-2009 at 01:07 PM.