09-03-2009, 12:58 PM | #46 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Hmmm... it seems like your suggestions would take me back down the road of (effectively) a pre-calculated database being parsed from the input text. Perhaps that means there is great enough merit to the idea that it is worth the pain of implementation?
Certainly, there are processing tasks that are best handled looping through character by character. But then, I suppose, there are tasks where going word by word or sentence by sentence really would be helpful... 1) Loop through all words to try to identify instances of words that contain an apostrophe (whether at beginng, end, or penultimate position) but have not yet been identified as such. Once this is done, and oddities like << 'Tis >> are identified, the quotation mark smartening function could be made simpler by having it ignore any single quote/apostrophe that is considered to be part of a word. 2) Loop through all words to try to find words that have been accidentally run together. 3) Loop through all sentences (in the classical sense) to identify any that end abruptly and may indicate an erroneous paragraph break. --- You are probably right about keeping line-breaks intact, instead of liberally stripping them out. Regarding footnotes... the only misunderstanding is that I conceive of a footnote as being always tied to a single character, the "*", "**" or "1" or whatever footnote mark being an output concern and the tied-to character being whatever precedes the footnote mark. e.g.: in "exceptional* service", I see the "*" character as something to remove at input time and reinsert at output time, and the footnote would be actually tied to the "l" (last letter of 'exceptional') and thus appear immediately after it. This has the benefit of removing all sign of the footnote from the text stream, so it doesn't interfere with processing. Oh, and the difficult to parse sentence about "link start" and "link end" was just my attempt to say that if there are redundant <a href=""> tags in an HTML document, treating them the same way as I treat the formatting would automatically simplify them. --- Regarding the internationalization... I think perhaps all I need to do is to build the skeleton in such a way that language-specific processing functions in .py files can override generic processing functions in the main file. And, obviously, ensure that when the language of a given text is known, to only allow the correct language's processing functions to override. Shouldn't be too difficult, thanks to Eval. Sort of a minimalist plug-in-like architecture would result, I suppose. --- Ultimately, I'm now thinking the way to do things is to have pTome basically contain a linked list of pBlocks, the classification of which could range from 'line-break' to 'chapter title' to 'paragraph', et cetera. Then the pBlocks in turn would have text subdivided into one or more pParts, some of which may be classified as "sentence" or "poem line" et cetera, and each of which in turn would contain one or more pItems (which would be, as per your own thinking, my current sort of pStrings or something like it--everything else being mostly for containment and classification/categorization) that would be classified as "space" or "punctuation" or "word" et cetera. And when a change is made, only the specific pBlock, pPart, or pItem would have to be changed and all levels underneath regenerated via reparsing. Makes sense? This way, I theoretically ought to be able to loop through all words within all sentences, while still querying "what character is before this word" or "does the sentence previous to this one end with punctuation/the way a proper sentence should". - Ahi |
09-03-2009, 01:55 PM | #47 | |||
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Quote:
Quote:
Quote:
|
|||
Advert | |
|
09-03-2009, 02:04 PM | #48 | ||
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
Quote:
It's definitely going to be a monster of sorts... but hopefully with my ideas starting to become increasingly clear and granular, it will end up a tamable monster. Thanks for the sanity checks! I'll give you a shout when there is code! - Ahi |
||
09-12-2009, 03:44 PM | #49 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
I am beginning to make some progress...
Operator overloading is turning out to be tolerably good in python. Any hints as to how to make a Python class immutable? My googling thus far suggests that there is no sane, simple way that is universal... and in most instances deriving your class from an already immutable one is suggested. Since I do not know what all I'd need to override, it seems saner to do things from scratch... but immutability is necessary, I think. - Ahi |
09-12-2009, 03:58 PM | #50 |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Sorry, I'm no help there. I've VERY little (almost the same as 'no') experience with Python. Hopefully someone else can help with this one. (Perhaps you should send a message to Kovid... he seems to do a LOT of work in Python, and may be able to give you a quick, easy answer.)
|
Advert | |
|
09-12-2009, 04:33 PM | #51 |
creator of calibre
Posts: 44,017
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
http://en.wikipedia.org/wiki/Immutable_object#Python
Also python has a lot of immutable builtin objects like tuple and frozenset that can be inherited from |
09-13-2009, 03:49 PM | #52 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Thanks, Kovid. I believe I managed to solve the problem.
Do you have any resources you'd recommend relating to UTF-8 handling/character encoding conversion with Python? I keep bumping into UnicodeDecodeError style messages... and I am yet to really grasp the elegant way to avoid them. - Ahi |
09-14-2009, 12:01 AM | #53 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Not an update/new version yet... but I'm hoping some people might be kind enough to run it on a few .txt and .rtf files... and report back (along with the relevant portion of the log file [the filesize and whitespace analysis numerical values below "Analyzing text"]) whether or not pacify correctly determined whether there are intra-paragraph line-breaks or not in the given file.
Run with: pacify.py -i input.txt -o txt or pacify.py -i input.rtf -o latex - Ahi P.s.: This is a rewrite little stopped right in the middle of work... barely usable as is, and does not have the full functionality present in the previous version. Takes only .txt and .rtf for input, produces only .txt or .tex for output. And there is no support for RTF footnotes. |
09-14-2009, 10:18 AM | #54 | ||||
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
First, a Project Gutenberg TXT file with line breaks:
Quote:
Quote:
Quote:
Quote:
|
||||
09-14-2009, 10:51 AM | #55 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
|
09-14-2009, 11:33 AM | #56 | ||
Addict
Posts: 304
Karma: 2454436
Join Date: Sep 2008
Device: PRS-505, PRS-650, iPad, Samsung Galaxy SII (JB), Google Nexus 7 (2013)
|
Quote:
Quote:
|
||
09-14-2009, 11:49 AM | #57 |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
|
09-14-2009, 12:13 PM | #58 | ||
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
This stuff: Quote:
The first one, of course, is "single space" (word breaks). The second one is either paragraph breaks or linebreaks (if there are intra-paragraph linebreaks) whitespace sequence count and the third one (if there are intra-paragraph linebreaks) is the paragraph breaks whitespace sequence count. Usually, if the second value is above 50, and the third value above 3, it indicates the present of intra-paragraph breaks... but hard values are not the right way to go. I need to figure out the calculation (perhaps ratios of the whitespace sequence counts?) that yields reliable results in "all" cases. As I suspect there might easily be files out there where there are intra-paragraph linebreaks but perhaps the third value would only come to 2.9. What makes this hard though is that the second and third values do vary pretty wildly... so I'm not too sure comparisons/ratios between those two are the way. - Ahi |
||
09-14-2009, 02:35 PM | #59 | |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Quote:
Also, a 'newline' followed by a QUOTE is almost certainly the start of a paragraph. Count how many times a 'newline'-quote pair is preceded by another 'newline' (ie, paragraphs separated by blank lines). The ratio of the number of 'newline'-'newline'-quote instances to the number of 'newline'-quote instances, would give a pretty good indication of whether it's a line-break or paragraph-break file. If you use at least 2 or 3 different "statistical rulers" and they all agree (or 2 out of 3), then that's about the best indicator you're going to get. |
|
09-14-2009, 02:38 PM | #60 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
- Ahi |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Best pdf to text/rtf/whatever I have ever seen | jblitereader | Ectaco jetBook | 13 | 07-10-2010 12:02 AM |
RTF and TEXT conversion | spaze | Calibre | 4 | 08-23-2009 03:11 AM |
Automatic .Lit extractor for the iLiad | Adam B. | iRex | 34 | 09-25-2008 07:20 PM |
kovidgoyal: templatemaker -- automatic data extractor | sammykrupa | Sony Reader | 1 | 07-21-2007 01:52 PM |
Text to RTF question. | Roy White | Sony Reader | 0 | 05-12-2007 06:59 PM |