![]() |
#1 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Text Analysis & Paragraph Detection
I would like to post some thoughts, musings, et cetera on text analysis and paragraph detection. While I am giving my own thoughts mostly in relation to the work I am actually doing on pacify, this discussion need not in any way focus on that specific program/use/approach.
--- Paragraph Detection Detecting line-broken paragraphs actually seems straightforward--assuming the file is at least semi-consistently prepared. The most straightforward way to detect whether or not paragraphs are line-broken is to simply count what percentage of non-empty lines begin with a character that is not an opening quote, a dash/en-dash/em-dash, an opening parenthesis, or a capital letter. If the percentage is more than 50 (for a book of any length, much more than 50), chances are very good that it contains line-broken paragraphs. If this is the case, the best way to go about fixing up the paragraphs, so each paragraph has its own line, is by: 1) run through the entire document, counting how many times certain sequences of whitespace characters occur: - The most frequent whitespace sequence should be ' ' (i.e.: a single space). Word breaks, if you will. - The second most frequent whitespace sequence should be whatever whitespace sequence is used to separate intra-paragraph lines. (like a single newline character) - The third most frequent whitespace sequence should be the paragraph break indicating whitespace sequence. (like two newline characters) 2) Replace all instances of the second most frequent whitespace sequence with a single space. This will result, in most cases, in a file that has each paragraph on its own line. It may however also incorrectly single-line non-paragraph text. This is usually of minimal consequence, more likely to impact title page text than anything else. If poems and quotes are indented with leading spaces or a leading tab, they will not be erroneously processed along with paragraphs, as their whitespace sequences will be different from that of intra-paragraph linebreaks. Also, some files are not 100% consistent in what whitespace sequence separates intra-paragraph lines. Usually the problem is an additional space character either at the beginning or at the end of the line... sometimes. This can be easily addressed by using whitespace weights instead of the sequences themselves. Instead of counting whitespace sequences in the above described process, I only count weights... spaces are worth 0.24, tabs 2.00, linebreaks 8.00. With such a system \s\r and \r\s and even \s\r\s are worth 8 (if rounded). The final improvement to this intra-paragraph linebreak fixing method would be to ensure that the whitespace sequence substitution only takes place within paragraphs. This could be achieved by checking to make sure either A) The current line is directly preceded and/or followed by one or more lines that are, excepting the final line before an empty line, of average length. And the first of which lines begins with a valid sentence starting characters (capital letter, opening quote/parenthesis, dash/en-dash/em-dash, et cetera)... and the last of which ends with a valid sentence/paragraph ending character (period, colon, exclamation mark, question mark, closing quote, closing parenthesis, et cetera). Not all of the above should be absolutely required for a given line to be considered to be part of a paragraph, if most are met, a single exception is not necessarily a deal-breaker. e.g.: If all is well, except that the paragraph ends with a comma... it's should probably still be treated like a paragraph. B) If the current line has neither directly preceding or following lines, check 5-10 lines forward and back to ascertain that those lines contain paragraphs. If they do, the current line is fairly certain to be a paragraph shorter than the line-break line length. --- Though the above may be long and somewhat meandering... using the above ideas, it should be fairly straightforward to implement a paragraph detecting/fixing algorithm. I encourage and welcome similarly hairy (but well thought out) descriptions for other text analysis/fixing tasks. - Ahi |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quotation Mark Fixing
The way I've been fixing quotation marks is by parsing through the document, character by character, and keeping track of whether the current state of the document is quotation-opened or quotation-closed. Doing so, however, led to fairly frequent errors due to (legitimately) unclosed quotation marks. As a result, I started overriding the decision of whether to put an opening quotation mark or a closed one based on which side of the quotation mark had alphanumeric characters (as opposed to whitespace or punctuation). This fixed most false positives. In English, however, there is also the use of apostrophes in words. Therefore single quotation marks that have alphanumeric characters on both sides (e.g.: Steve's, it's, ain't) are considered apostrophes and not quotation marks. Also, any single quotation mark that follows an 's' is considered suspect of being an apostrophe (e.g.: Jesus' name, Boris' house)... suspicion being turned to certainty if the paragraph is yet to have an opening single quote and/or has no subsequent closing single quote or following-line opening single quote (as said line's first character). The last bit of complication would be words like >> 'Tis <<. This is probably best handled by an exception list... which, while not exhaustive, should work reasonably well for the vast majority of documents. Or, alternatively, the user could be alerted about lone-ranger single quotation marks (as they do, in some PG documents, occur by error... or, rather, sometimes a second single quotation mark fails to occur by error but is discernible by context). - Ahi |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
fruminous edugeek
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
I've been using end-of-line punctuation, rather than beginning-of-line characters, to distinguish paragraph marks. This can lead to false positives if punctuation just happens to fall at the end of a line, but my results have been fairly good so far.
One problem with relying on quotation marks specifically is that in English language texts, often a quote that runs for more than one paragraph does not have closing quotation marks for the earlier paragraph(s), but only for the final paragraph of the quote. The use of single quotes as both apostrophes and as dialogue markers (more common in British than American English) can be a problem, especially in the case of plural possessives, as you describe above. ![]() |
![]() |
![]() |
![]() |
#4 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,395
Karma: 1358132
Join Date: Nov 2007
Location: UK
Device: Palm TX, CyBook Gen3
|
Quote:
I replace all space+^p occurrences with ^p; repeating until all such spaces are removed. Then for any letter/number, or non-full stop punctuation (except quotes) I replace the following ^p with a space. Hyphens get replaced individually, since some may need to be retained. This is quick and dirty - it will retain full-stop+^p when they should be full-stop+space - but the process is normally just prep for proof-reading. (Or I can just opt to live with those inaccuracies.) Also, verses need to be edited manually. |
|
![]() |
![]() |
![]() |
#5 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
Of course, there's no real way to put that into a single regex... probably requires at least a dozen line script. I think verses should be detectable too... even if not helpfully preceded (on each line) with additional whitespace. Basicaly you are looking for irregular lines... less than average length, perhaps all ending on punctuation (but not always on sentence-ending punctuation)... possibly several starting with capitals despite there being no sentence-ending punctuation on the preceding line. I've not actually attacked this problem yet... but when I do, I'll post my ideas in detail. I think it should be possible for the majority of straightforward books to autodetect chapter titles and verses/quoted portions... with considerable accuracy. - Ahi |
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,546
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
|
![]() |
![]() |
![]() |
#7 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
- Ahi Ps.: Though if it did, the exclusion list approach might be an alright way of handling it. If we know the text is English, >> 'im << is never (unless "Im" is a proper noun... but being uncapitalized, it isn't) the beginning of a quote... and >> comin' << is likewise never (at least correctly) the end of one. Last edited by ahi; 09-14-2009 at 12:30 PM. |
|
![]() |
![]() |
![]() |
#8 |
fruminous edugeek
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
So do you have a regexp to share for the first algorithm? Some Word macros that people could download to implement these would be nice.
I believe the next version of Calibre is supposed to include my algorithm, but yours might have been a better choice.... |
![]() |
![]() |
![]() |
#9 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
Pacify more or less uses the algorithm described for fixing quotation marks... you are welcome to download it and play with it and/or check out the source. But it is not ready for primetime as yet. A word macro might be possible... assuming they use VBScript or JScript... but I haven't done that sort of thing in a while. Perhaps you'd like to take a crack at it, based on the description and the code in pacify.py? - Ahi |
|
![]() |
![]() |
![]() |
#10 |
fruminous edugeek
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
I'd rather see pacify incorporated into Calibre and/or Sigil.
|
![]() |
![]() |
![]() |
#11 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Hehe.
![]() Perhaps once it's reasonably stable, I'll offer it to all and sundry for seemless incorporation into their backends. I'd be curious to know whether you find it works as well or better than your own approach. I think it should (even the version featured on the first post of the thread I linked to)... but I'd be grateful to know with certainty, if you are up to doing a few checks. - Ahi |
![]() |
![]() |
![]() |
#12 |
fruminous edugeek
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
I'll see what I can do over the next couple of days.
![]() |
![]() |
![]() |
![]() |
#13 |
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,546
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
|
![]() |
![]() |
![]() |
#14 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
![]() Although looking the Project Gutenberg's list of Wodehouse's stuff... it doesn't seem as frightening as your post first made me think. - Ahi |
|
![]() |
![]() |
![]() |
#15 |
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,546
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Converting from LRF: Paragraph & Line Breaks | wudaben | LRF | 0 | 07-14-2010 11:32 PM |
Search & replace TEXT | ToeRag | Calibre | 3 | 04-10-2010 01:44 PM |
Indentations & Paragraph Spacing Loss | Dis | Sigil | 6 | 12-03-2009 02:18 PM |
Sony PRS-505, text indents, paragraph spacing | pdurrant | Sigil | 7 | 08-03-2009 06:03 AM |
Cybook & text-based pdfs | StephieP | Bookeen | 17 | 04-28-2008 11:50 AM |