Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 09-14-2009, 11:21 AM   #1
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Text Analysis & Paragraph Detection

I would like to post some thoughts, musings, et cetera on text analysis and paragraph detection. While I am giving my own thoughts mostly in relation to the work I am actually doing on pacify, this discussion need not in any way focus on that specific program/use/approach.

---

Paragraph Detection

Detecting line-broken paragraphs actually seems straightforward--assuming the file is at least semi-consistently prepared.

The most straightforward way to detect whether or not paragraphs are line-broken is to simply count what percentage of non-empty lines begin with a character that is not an opening quote, a dash/en-dash/em-dash, an opening parenthesis, or a capital letter.

If the percentage is more than 50 (for a book of any length, much more than 50), chances are very good that it contains line-broken paragraphs.

If this is the case, the best way to go about fixing up the paragraphs, so each paragraph has its own line, is by:

1) run through the entire document, counting how many times certain sequences of whitespace characters occur:

- The most frequent whitespace sequence should be ' ' (i.e.: a single space). Word breaks, if you will.

- The second most frequent whitespace sequence should be whatever whitespace sequence is used to separate intra-paragraph lines. (like a single newline character)

- The third most frequent whitespace sequence should be the paragraph break indicating whitespace sequence. (like two newline characters)

2) Replace all instances of the second most frequent whitespace sequence with a single space.

This will result, in most cases, in a file that has each paragraph on its own line. It may however also incorrectly single-line non-paragraph text. This is usually of minimal consequence, more likely to impact title page text than anything else.

If poems and quotes are indented with leading spaces or a leading tab, they will not be erroneously processed along with paragraphs, as their whitespace sequences will be different from that of intra-paragraph linebreaks.

Also, some files are not 100% consistent in what whitespace sequence separates intra-paragraph lines. Usually the problem is an additional space character either at the beginning or at the end of the line... sometimes.

This can be easily addressed by using whitespace weights instead of the sequences themselves. Instead of counting whitespace sequences in the above described process, I only count weights... spaces are worth 0.24, tabs 2.00, linebreaks 8.00. With such a system \s\r and \r\s and even \s\r\s are worth 8 (if rounded).

The final improvement to this intra-paragraph linebreak fixing method would be to ensure that the whitespace sequence substitution only takes place within paragraphs.

This could be achieved by checking to make sure either

A) The current line is directly preceded and/or followed by one or more lines that are, excepting the final line before an empty line, of average length. And the first of which lines begins with a valid sentence starting characters (capital letter, opening quote/parenthesis, dash/en-dash/em-dash, et cetera)... and the last of which ends with a valid sentence/paragraph ending character (period, colon, exclamation mark, question mark, closing quote, closing parenthesis, et cetera).

Not all of the above should be absolutely required for a given line to be considered to be part of a paragraph, if most are met, a single exception is not necessarily a deal-breaker. e.g.: If all is well, except that the paragraph ends with a comma... it's should probably still be treated like a paragraph.

B) If the current line has neither directly preceding or following lines, check 5-10 lines forward and back to ascertain that those lines contain paragraphs. If they do, the current line is fairly certain to be a paragraph shorter than the line-break line length.

---

Though the above may be long and somewhat meandering... using the above ideas, it should be fairly straightforward to implement a paragraph detecting/fixing algorithm.

I encourage and welcome similarly hairy (but well thought out) descriptions for other text analysis/fixing tasks.

- Ahi
ahi is offline   Reply With Quote
Old 09-14-2009, 11:34 AM   #2
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quotation Mark Fixing

The way I've been fixing quotation marks is by parsing through the document, character by character, and keeping track of whether the current state of the document is quotation-opened or quotation-closed.

Doing so, however, led to fairly frequent errors due to (legitimately) unclosed quotation marks. As a result, I started overriding the decision of whether to put an opening quotation mark or a closed one based on which side of the quotation mark had alphanumeric characters (as opposed to whitespace or punctuation). This fixed most false positives.

In English, however, there is also the use of apostrophes in words. Therefore single quotation marks that have alphanumeric characters on both sides (e.g.: Steve's, it's, ain't) are considered apostrophes and not quotation marks. Also, any single quotation mark that follows an 's' is considered suspect of being an apostrophe (e.g.: Jesus' name, Boris' house)... suspicion being turned to certainty if the paragraph is yet to have an opening single quote and/or has no subsequent closing single quote or following-line opening single quote (as said line's first character).

The last bit of complication would be words like >> 'Tis <<. This is probably best handled by an exception list... which, while not exhaustive, should work reasonably well for the vast majority of documents. Or, alternatively, the user could be alerted about lone-ranger single quotation marks (as they do, in some PG documents, occur by error... or, rather, sometimes a second single quotation mark fails to occur by error but is discernible by context).

- Ahi
ahi is offline   Reply With Quote
Old 09-14-2009, 11:42 AM   #3
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
I've been using end-of-line punctuation, rather than beginning-of-line characters, to distinguish paragraph marks. This can lead to false positives if punctuation just happens to fall at the end of a line, but my results have been fairly good so far.

One problem with relying on quotation marks specifically is that in English language texts, often a quote that runs for more than one paragraph does not have closing quotation marks for the earlier paragraph(s), but only for the final paragraph of the quote. The use of single quotes as both apostrophes and as dialogue markers (more common in British than American English) can be a problem, especially in the case of plural possessives, as you describe above.
nekokami is offline   Reply With Quote
Old 09-14-2009, 11:45 AM   #4
Sparrow
Wizard
Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.
 
Posts: 4,395
Karma: 1358132
Join Date: Nov 2007
Location: UK
Device: Palm TX, CyBook Gen3
Quote:
Originally Posted by ahi View Post
The most straightforward way to detect whether or not paragraphs are line-broken is to simply count what percentage of non-empty lines begin with a character that is not an opening quote, a dash/en-dash/em-dash, an opening parenthesis, or a capital letter.
Like Neko, I look for the character that precedes the line break. In MSWord the line break is generally the ^p character.
I replace all space+^p occurrences with ^p; repeating until all such spaces are removed.

Then for any letter/number, or non-full stop punctuation (except quotes) I replace the following ^p with a space.

Hyphens get replaced individually, since some may need to be retained.

This is quick and dirty - it will retain full-stop+^p when they should be full-stop+space - but the process is normally just prep for proof-reading. (Or I can just opt to live with those inaccuracies.)
Also, verses need to be edited manually.
Sparrow is offline   Reply With Quote
Old 09-14-2009, 12:02 PM   #5
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by nekokami View Post
I've been using end-of-line punctuation, rather than beginning-of-line characters, to distinguish paragraph marks. This can lead to false positives if punctuation just happens to fall at the end of a line, but my results have been fairly good so far.

One problem with relying on quotation marks specifically is that in English language texts, often a quote that runs for more than one paragraph does not have closing quotation marks for the earlier paragraph(s), but only for the final paragraph of the quote. The use of single quotes as both apostrophes and as dialogue markers (more common in British than American English) can be a problem, especially in the case of plural possessives, as you describe above.
The algorithm vaguely described above does fairly well with paragraph-spanning quotation marks, single quotation marks, and apostrophes within a single document.

Of course, there's no real way to put that into a single regex... probably requires at least a dozen line script.

Quote:
Originally Posted by Sparrow View Post
Also, verses need to be edited manually.
I think verses should be detectable too... even if not helpfully preceded (on each line) with additional whitespace. Basicaly you are looking for irregular lines... less than average length, perhaps all ending on punctuation (but not always on sentence-ending punctuation)... possibly several starting with capitals despite there being no sentence-ending punctuation on the preceding line.

I've not actually attacked this problem yet... but when I do, I'll post my ideas in detail.

I think it should be possible for the majority of straightforward books to autodetect chapter titles and verses/quoted portions... with considerable accuracy.

- Ahi
ahi is offline   Reply With Quote
Old 09-14-2009, 12:24 PM   #6
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,546
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by ahi View Post
The last bit of complication would be words like >> 'Tis <<.
And things like Cockney accent, you better be careful, or the dog might bite before you see 'im comin'
Jellby is offline   Reply With Quote
Old 09-14-2009, 12:27 PM   #7
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by Jellby View Post
And things like Cockney accent, you better be careful, or the dog might bite before you see 'im comin'
He he... how likely is a text full of such delightful colloquialism to opt to use single quotation marks?

- Ahi

Ps.: Though if it did, the exclusion list approach might be an alright way of handling it. If we know the text is English, >> 'im << is never (unless "Im" is a proper noun... but being uncapitalized, it isn't) the beginning of a quote... and >> comin' << is likewise never (at least correctly) the end of one.

Last edited by ahi; 09-14-2009 at 12:30 PM.
ahi is offline   Reply With Quote
Old 09-14-2009, 12:33 PM   #8
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
So do you have a regexp to share for the first algorithm? Some Word macros that people could download to implement these would be nice.

I believe the next version of Calibre is supposed to include my algorithm, but yours might have been a better choice....
nekokami is offline   Reply With Quote
Old 09-14-2009, 12:43 PM   #9
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by nekokami View Post
So do you have a regexp to share for the first algorithm? Some Word macros that people could download to implement these would be nice.

I believe the next version of Calibre is supposed to include my algorithm, but yours might have been a better choice....
No. I don't use regexes. I find they are too fragile and their full impact is difficult to accurately foresee. (i.e.: They fail too easily and sometimes change things they weren't intended to by the regex "writer".)

Pacify more or less uses the algorithm described for fixing quotation marks... you are welcome to download it and play with it and/or check out the source. But it is not ready for primetime as yet.

A word macro might be possible... assuming they use VBScript or JScript... but I haven't done that sort of thing in a while. Perhaps you'd like to take a crack at it, based on the description and the code in pacify.py?

- Ahi
ahi is offline   Reply With Quote
Old 09-14-2009, 12:49 PM   #10
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
I'd rather see pacify incorporated into Calibre and/or Sigil.
nekokami is offline   Reply With Quote
Old 09-14-2009, 12:52 PM   #11
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by nekokami View Post
I'd rather see pacify incorporated into Calibre and/or Sigil.
Hehe.

Perhaps once it's reasonably stable, I'll offer it to all and sundry for seemless incorporation into their backends.

I'd be curious to know whether you find it works as well or better than your own approach. I think it should (even the version featured on the first post of the thread I linked to)... but I'd be grateful to know with certainty, if you are up to doing a few checks.

- Ahi
ahi is offline   Reply With Quote
Old 09-14-2009, 12:54 PM   #12
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
I'll see what I can do over the next couple of days.
nekokami is offline   Reply With Quote
Old 09-14-2009, 01:34 PM   #13
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,546
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by ahi View Post
He he... how likely is a text full of such delightful colloquialism to opt to use single quotation marks?
Meet Mr. Galsworthy's The Forsyte Chronicles, or some of the Wodehouse works
Jellby is offline   Reply With Quote
Old 09-14-2009, 01:41 PM   #14
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by Jellby View Post
Meet Mr. Galsworthy's The Forsyte Chronicles, or some of the Wodehouse works
Well... in that case, it shouldn't be too hard to build up a reasonably strong exclusion list.

Although looking the Project Gutenberg's list of Wodehouse's stuff... it doesn't seem as frightening as your post first made me think.

- Ahi
ahi is offline   Reply With Quote
Old 09-14-2009, 01:52 PM   #15
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,546
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by ahi View Post
Although looking the Project Gutenberg's list of Wodehouse's stuff... it doesn't seem as frightening as your post first made me think.
No, it's not so bad. It's just stuff that happens.
Jellby is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting from LRF: Paragraph & Line Breaks wudaben LRF 0 07-14-2010 11:32 PM
Search & replace TEXT ToeRag Calibre 3 04-10-2010 01:44 PM
Indentations & Paragraph Spacing Loss Dis Sigil 6 12-03-2009 02:18 PM
Sony PRS-505, text indents, paragraph spacing pdurrant Sigil 7 08-03-2009 06:03 AM
Cybook & text-based pdfs StephieP Bookeen 17 04-28-2008 11:50 AM


All times are GMT -4. The time now is 05:53 AM.


MobileRead.com is a privately owned, operated and funded community.