Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 09-03-2009, 12:58 PM   #46
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Hmmm... it seems like your suggestions would take me back down the road of (effectively) a pre-calculated database being parsed from the input text. Perhaps that means there is great enough merit to the idea that it is worth the pain of implementation?

Certainly, there are processing tasks that are best handled looping through character by character. But then, I suppose, there are tasks where going word by word or sentence by sentence really would be helpful...

1) Loop through all words to try to identify instances of words that contain an apostrophe (whether at beginng, end, or penultimate position) but have not yet been identified as such. Once this is done, and oddities like << 'Tis >> are identified, the quotation mark smartening function could be made simpler by having it ignore any single quote/apostrophe that is considered to be part of a word.

2) Loop through all words to try to find words that have been accidentally run together.

3) Loop through all sentences (in the classical sense) to identify any that end abruptly and may indicate an erroneous paragraph break.

---

You are probably right about keeping line-breaks intact, instead of liberally stripping them out.

Regarding footnotes... the only misunderstanding is that I conceive of a footnote as being always tied to a single character, the "*", "**" or "1" or whatever footnote mark being an output concern and the tied-to character being whatever precedes the footnote mark.

e.g.: in "exceptional* service", I see the "*" character as something to remove at input time and reinsert at output time, and the footnote would be actually tied to the "l" (last letter of 'exceptional') and thus appear immediately after it. This has the benefit of removing all sign of the footnote from the text stream, so it doesn't interfere with processing.

Oh, and the difficult to parse sentence about "link start" and "link end" was just my attempt to say that if there are redundant <a href=""> tags in an HTML document, treating them the same way as I treat the formatting would automatically simplify them.

---

Regarding the internationalization... I think perhaps all I need to do is to build the skeleton in such a way that language-specific processing functions in .py files can override generic processing functions in the main file. And, obviously, ensure that when the language of a given text is known, to only allow the correct language's processing functions to override.

Shouldn't be too difficult, thanks to Eval. Sort of a minimalist plug-in-like architecture would result, I suppose.

---

Ultimately, I'm now thinking the way to do things is to have pTome basically contain a linked list of pBlocks, the classification of which could range from 'line-break' to 'chapter title' to 'paragraph', et cetera.

Then the pBlocks in turn would have text subdivided into one or more pParts, some of which may be classified as "sentence" or "poem line" et cetera, and each of which in turn would contain one or more pItems (which would be, as per your own thinking, my current sort of pStrings or something like it--everything else being mostly for containment and classification/categorization) that would be classified as "space" or "punctuation" or "word" et cetera.

And when a change is made, only the specific pBlock, pPart, or pItem would have to be changed and all levels underneath regenerated via reparsing.

Makes sense? This way, I theoretically ought to be able to loop through all words within all sentences, while still querying "what character is before this word" or "does the sentence previous to this one end with punctuation/the way a proper sentence should".

- Ahi
ahi is offline   Reply With Quote
Old 09-03-2009, 01:55 PM   #47
ekaser
Opinion Artiste
ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.
 
ekaser's Avatar
 
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
Quote:
Originally Posted by ahi View Post
Hmmm... it seems like your suggestions would take me back down the road of (effectively) a pre-calculated database being parsed from the input text. Perhaps that means there is great enough merit to the idea that it is worth the pain of implementation?
That's up to you. I'm just trying to provide a second point-of-view, one that may or may not mesh and/or differ from your own. Ultimately, you're writing the code, so you have to (get to) make the decisions about what will work best for what YOU are trying to accomplish. It just seems to me like there are two primary things: structure and content, with content being further divided into images, text and text formatting. Structure (other than sentence and paragraph massaging) would remain pretty much unchanged, I'd think, and most of the changes your program would make would be to the content. Therefore, separating the two as much as possible, while maintaining the necessary links between structure and content, would be best.

Quote:
Regarding footnotes... the only misunderstanding is that I conceive of a footnote as being always tied to a single character, the "*", "**" or "1" or whatever footnote mark being an output concern and the tied-to character being whatever precedes the footnote mark.
I guess I was seeing footnotes and hyperlinks as being pretty much the same thing. You're going to have to handle hyperlinks, so I figured why not treat the footnotes the same way (in terms of the link itself)? You have to write the code once (for hyperlinks), so if you fold the footnote link into the same mold, you're all done. But YOU'RE the one that has to hold all of this in your head until it's excreted as code, so whatever works best in YOUR head is key...

Quote:
Makes sense? This way, I theoretically ought to be able to loop through all words within all sentences, while still querying "what character is before this word" or "does the sentence previous to this one end with punctuation/the way a proper sentence should".
It's sounding pretty good to me!
ekaser is offline   Reply With Quote
Advert
Old 09-03-2009, 02:04 PM   #48
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by ekaser View Post
That's up to you. I'm just trying to provide a second point-of-view, one that may or may not mesh and/or differ from your own. Ultimately, you're writing the code, so you have to (get to) make the decisions about what will work best for what YOU are trying to accomplish. It just seems to me like there are two primary things: structure and content, with content being further divided into images, text and text formatting. Structure (other than sentence and paragraph massaging) would remain pretty much unchanged, I'd think, and most of the changes your program would make would be to the content. Therefore, separating the two as much as possible, while maintaining the necessary links between structure and content, would be best.
Well... yes. I think the reason I feel compelled to try to yield to your implied return to my earlier conception is because the more high-level and real-world I can make the structure (sentences referring to real/detected sentences, et cetera) the simpler I can make my text/content processing code... and, presumably, the less error prone they will be. (Though even in the best scenario, most of them would end up being pretty complex still... which is reason enough to simply where/as possible.)

Quote:
Originally Posted by ekaser View Post
I guess I was seeing footnotes and hyperlinks as being pretty much the same thing. You're going to have to handle hyperlinks, so I figured why not treat the footnotes the same way (in terms of the link itself)? You have to write the code once (for hyperlinks), so if you fold the footnote link into the same mold, you're all done. But YOU'RE the one that has to hold all of this in your head until it's excreted as code, so whatever works best in YOUR head is key...
That is a good point. I will have to give it more thought. Getting rid of the footnote-marks is definitely a useful thing (because it reduces the number of places where processing code needs to account for them or parse around them)... so unless there are any big benefits from simplifying links/footnotes to be handled via the same underlying mechanism, I will likely keep to the approach I described.

It's definitely going to be a monster of sorts... but hopefully with my ideas starting to become increasingly clear and granular, it will end up a tamable monster.

Quote:
Originally Posted by ekaser View Post
It's sounding pretty good to me!
Thanks for the sanity checks!

I'll give you a shout when there is code!

- Ahi
ahi is offline   Reply With Quote
Old 09-12-2009, 03:44 PM   #49
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
I am beginning to make some progress...

Operator overloading is turning out to be tolerably good in python.

Any hints as to how to make a Python class immutable?

My googling thus far suggests that there is no sane, simple way that is universal... and in most instances deriving your class from an already immutable one is suggested.

Since I do not know what all I'd need to override, it seems saner to do things from scratch... but immutability is necessary, I think.

- Ahi
ahi is offline   Reply With Quote
Old 09-12-2009, 03:58 PM   #50
ekaser
Opinion Artiste
ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.
 
ekaser's Avatar
 
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
Quote:
Originally Posted by ahi View Post
I am beginning to make some progress...

Any hints as to how to make a Python class immutable?
Sorry, I'm no help there. I've VERY little (almost the same as 'no') experience with Python. Hopefully someone else can help with this one. (Perhaps you should send a message to Kovid... he seems to do a LOT of work in Python, and may be able to give you a quick, easy answer.)
ekaser is offline   Reply With Quote
Advert
Old 09-12-2009, 04:33 PM   #51
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,017
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
http://en.wikipedia.org/wiki/Immutable_object#Python

Also python has a lot of immutable builtin objects like tuple and frozenset that can be inherited from
kovidgoyal is offline   Reply With Quote
Old 09-13-2009, 03:49 PM   #52
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Thanks, Kovid. I believe I managed to solve the problem.

Do you have any resources you'd recommend relating to UTF-8 handling/character encoding conversion with Python? I keep bumping into UnicodeDecodeError style messages... and I am yet to really grasp the elegant way to avoid them.

- Ahi
ahi is offline   Reply With Quote
Old 09-14-2009, 12:01 AM   #53
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Not an update/new version yet... but I'm hoping some people might be kind enough to run it on a few .txt and .rtf files... and report back (along with the relevant portion of the log file [the filesize and whitespace analysis numerical values below "Analyzing text"]) whether or not pacify correctly determined whether there are intra-paragraph line-breaks or not in the given file.

Run with:

pacify.py -i input.txt -o txt

or

pacify.py -i input.rtf -o latex

- Ahi

P.s.: This is a rewrite little stopped right in the middle of work... barely usable as is, and does not have the full functionality present in the previous version. Takes only .txt and .rtf for input, produces only .txt or .tex for output. And there is no support for RTF footnotes.
Attached Files
File Type: zip pacify.zip (5.4 KB, 185 views)
ahi is offline   Reply With Quote
Old 09-14-2009, 10:18 AM   #54
ekaser
Opinion Artiste
ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.
 
ekaser's Avatar
 
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
First, a Project Gutenberg TXT file with line breaks:
Quote:
Analyzing text...

[1]: Starting on Mon, 14 Sep 2009 07:07:51

[1]: Simplifying linebreaks...
[1]: Analyzing whitespace patterns...
[1]: 100.0% processed (0.6 MB of 0.6 MB)

[1]: Finished on Mon, 14 Sep 2009 07:08:03

[1]: Filesize:
[1]: 658765

[1]: Whitespace analysis:
[1]: [(104865, 0.0), (8362, 8.0), (2689, 16.0), (41, 24.0), (31, 32.0), (28, 1.0), (20, 9.0), (13, 10.0), (3, 2.0), (3, 17.0), (2, 40.0), (1, 88.0)]

[1]: 1591.84231099
[1]: 126.934491055
[1]: 40.8188048849
[1]: 0.622376720075
[1]: 0.470577520056

[1]: ... appears to be a file with line-breaks.
Then, another TXT file without line breaks, entire paragraph on each line:
Quote:
[1]: Analyzing text...

[1]: Starting on Mon, 14 Sep 2009 07:11:57

[1]: Simplifying linebreaks...
[1]: Analyzing whitespace patterns...
[1]: 100.0% processed (0.3 MB of 0.3 MB)

[1]: Finished on Mon, 14 Sep 2009 07:11:57

[1]: Filesize:
[1]: 317441

[1]: Whitespace analysis:
[1]: [(55965, 0.0), (2342, 9.0), (186, 8.0), (19, 17.0), (5, 11.0), (3, 16.0), (1, 12.0), (1, 55.0)]

[1]: 1763.00477884
[1]: 73.7774893602
[1]: 5.85935654185
[1]: 0.598536420941
[1]: 0.157509584458

[1]: ... appears to be a file with paragraph breaks.
Then one RTF file:
Quote:
[1]: Analyzing text...

[1]: Starting on Mon, 14 Sep 2009 07:13:23

[1]: Simplifying linebreaks...
[1]: Analyzing whitespace patterns...
[1]: 100.0% processed (1.1 MB of 1.1 MB)

[1]: Finished on Mon, 14 Sep 2009 07:13:25

[1]: Filesize:
[1]: 1174398

[1]: Whitespace analysis:
[1]: [(201159, 0.0), (6815, 16.0), (58, 32.0), (15, 1.0), (5, 8.0), (3, 80.0), (2, 48.0), (2, 40.0), (1, 64.0), (1, 24.0)]

[1]: 1712.86906143
[1]: 58.0297309771
[1]: 0.493870050869
[1]: 0.127725013156
[1]: 0.0425750043852

[1]: ... appears to be a file with paragraph breaks.
And then a second RTF file:
Quote:
[1]: Analyzing text...

[1]: Starting on Mon, 14 Sep 2009 07:14:20

[1]: Simplifying linebreaks...
[1]: Analyzing whitespace patterns...
[1]: 100.0% processed (1.2 MB of 1.2 MB)

[1]: Finished on Mon, 14 Sep 2009 07:14:22

[1]: Filesize:
[1]: 1209776

[1]: Whitespace analysis:
[1]: [(187376, 0.0), (16677, 16.0), (128, 65.0), (85, 32.0), (39, 49.0), (31, 81.0), (12, 48.0), (3, 114.0), (3, 97.0)]

[1]: 1548.84871249
[1]: 137.851965984
[1]: 1.05804710955
[1]: 0.702609408684
[1]: 0.32237372869

[1]: ... appears to be a file with paragraph breaks.
ekaser is offline   Reply With Quote
Old 09-14-2009, 10:51 AM   #55
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by ekaser View Post
....
Thanks, ekaser!

Am I correct in assuming you encountered no false positives at all?

I have, by the way, successfully processed a 700+ MB .rtf file with this script that yielded 33 MB of LaTeX formatted text as output (the rest having been pictures).

- Ahi
ahi is offline   Reply With Quote
Old 09-14-2009, 11:33 AM   #56
Kirtai
Addict
Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.Kirtai ought to be getting tired of karma fortunes by now.
 
Posts: 304
Karma: 2454436
Join Date: Sep 2008
Device: PRS-505, PRS-650, iPad, Samsung Galaxy SII (JB), Google Nexus 7 (2013)
Quote:
Originally Posted by ekaser View Post
In general, they seem to delete leading space when turning italics ON and sometimes inserting an extraneous space when turning them OFF (usually when followed by punctuation).
Unfortunatly you sometimes want no spaces between formatted and unformatted characters. The first character of the first sentence in a chapter can get all sorts of odd formatting.

Quote:
Originally Posted by ahi View Post
Yes, yes... "Duh!", I know. But is there a way to increase Python's memory limit? My system can obviously take it... as less than a tenth of the swap is used before Python dies... so I'd really like to force Python to process these huge behemoths... even if it slows my system to a crawl for minutes or even hours.
Check your shells ulimit command, it will be able to change the process limits.
Kirtai is offline   Reply With Quote
Old 09-14-2009, 11:49 AM   #57
ekaser
Opinion Artiste
ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.
 
ekaser's Avatar
 
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
Quote:
Originally Posted by ahi View Post
Am I correct in assuming you encountered no false positives at all?
None that I saw, but then I only tried the four files, so that's a fairly limited set of data points.
ekaser is offline   Reply With Quote
Old 09-14-2009, 12:13 PM   #58
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by ekaser View Post
None that I saw, but then I only tried the four files, so that's a fairly limited set of data points.
True.

This stuff:

Quote:
[1]: 1548.84871249
[1]: 137.851965984
[1]: 1.05804710955
[1]: 0.702609408684
[1]: 0.32237372869
is based on a percentage calculation (abused beyond recognition) done on the text length (file size, sans formatting) and the whitespace pattern frequency... the first one is the most frequent whitespace sequence, the second one the second most frequent, the third one the third most frequent...

The first one, of course, is "single space" (word breaks). The second one is either paragraph breaks or linebreaks (if there are intra-paragraph linebreaks) whitespace sequence count and the third one (if there are intra-paragraph linebreaks) is the paragraph breaks whitespace sequence count.

Usually, if the second value is above 50, and the third value above 3, it indicates the present of intra-paragraph breaks... but hard values are not the right way to go.

I need to figure out the calculation (perhaps ratios of the whitespace sequence counts?) that yields reliable results in "all" cases. As I suspect there might easily be files out there where there are intra-paragraph linebreaks but perhaps the third value would only come to 2.9. What makes this hard though is that the second and third values do vary pretty wildly... so I'm not too sure comparisons/ratios between those two are the way.

- Ahi
ahi is offline   Reply With Quote
Old 09-14-2009, 02:35 PM   #59
ekaser
Opinion Artiste
ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.
 
ekaser's Avatar
 
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
Quote:
Originally Posted by ahi View Post
This stuff is based on a percentage calculation (abused beyond recognition) done on the text length (file size, sans formatting) and the whitespace pattern frequency... the first one is the most frequent whitespace sequence, the second one the second most frequent, the third one the third most frequent...
You're really looking at a statistical analysis (effectively) to figure out what the paragraph formatting of a text file is. Another thing you could include in this process is this maximum line length. You could just figure an "average" line length, and if it's over a certain amount, you probably have a paragraph-break file. Another perhaps better way is to keep a running count of the number of lines of length N, from 0 to ... say... oh, 255. Any line of length >255 gets lumped in with lines of length 255. For line-break files, the bulk of the file will be lines less than some number (80 to 128 max, I'd guess) with very few over that. A paragraph-break file will have far more lines with lengths greater than that limit, with probably quite a few in the 255 bucket.

Also, a 'newline' followed by a QUOTE is almost certainly the start of a paragraph. Count how many times a 'newline'-quote pair is preceded by another 'newline' (ie, paragraphs separated by blank lines). The ratio of the number of 'newline'-'newline'-quote instances to the number of 'newline'-quote instances, would give a pretty good indication of whether it's a line-break or paragraph-break file.

If you use at least 2 or 3 different "statistical rulers" and they all agree (or 2 out of 3), then that's about the best indicator you're going to get.
ekaser is offline   Reply With Quote
Old 09-14-2009, 02:38 PM   #60
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by ekaser View Post
You're really looking at a statistical analysis (effectively) to figure out what the paragraph formatting of a text file is. Another thing you could include in this process is this maximum line length. You could just figure an "average" line length, and if it's over a certain amount, you probably have a paragraph-break file. Another perhaps better way is to keep a running count of the number of lines of length N, from 0 to ... say... oh, 255. Any line of length >255 gets lumped in with lines of length 255. For line-break files, the bulk of the file will be lines less than some number (80 to 128 max, I'd guess) with very few over that. A paragraph-break file will have far more lines with lengths greater than that limit, with probably quite a few in the 255 bucket.

Also, a 'newline' followed by a QUOTE is almost certainly the start of a paragraph. Count how many times a 'newline'-quote pair is preceded by another 'newline' (ie, paragraphs separated by blank lines). The ratio of the number of 'newline'-'newline'-quote instances to the number of 'newline'-quote instances, would give a pretty good indication of whether it's a line-break or paragraph-break file.
Good points. Thanks for that. Based on my own tests, it is already fairly accurate... I think I might need to stop relying solely on the whitespace sequences for determining whether the file has intra-paragraph linebreaks... and use the whitespace sequences for processing the paragraph fixing once I've conclusively determined that it does.

- Ahi
ahi is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Best pdf to text/rtf/whatever I have ever seen jblitereader Ectaco jetBook 13 07-10-2010 12:02 AM
RTF and TEXT conversion spaze Calibre 4 08-23-2009 03:11 AM
Automatic .Lit extractor for the iLiad Adam B. iRex 34 09-25-2008 07:20 PM
kovidgoyal: templatemaker -- automatic data extractor sammykrupa Sony Reader 1 07-21-2007 01:52 PM
Text to RTF question. Roy White Sony Reader 0 05-12-2007 06:59 PM


All times are GMT -4. The time now is 11:42 AM.


MobileRead.com is a privately owned, operated and funded community.