09-01-2009, 01:58 PM | #31 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
Ultimately, I think what would be achievable (and what is my short term goal): 1. Input plaintext from TXT or extract text with limited formatting information from RTF or HTML. 2. Fix-up character mish-mash... ("..." replace with "…", "--" with "–" or "—", et cetera) 3. Try to detect and, with user approval, correct erroneous paragraph breaks. 4. Smarten quotation marks. 5. Try to detect poems, letters, quotations and mark them somehow. (A third layer, with a single setting per line, as opposed to per unicode character?) 6. Try to detect part, chapter, section headers... possibly interactively with help from user to make more accurate. 7. Output with formatting intact into the chosen format. In the case of simple novels, with no multi-level headers (i.e.: chapters, sections, et cetera) I think such a process should be able to create a nearly perfect file even without user interaction. In the case of more complex novels, a fair bit of work would remain, but the existing "tagging" by the script ought to result in mostly false positives that would be fairly quick to correct. - Ahi |
|
09-01-2009, 02:03 PM | #32 | ||
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Quote:
Quote:
Another thought regarding justification formatting: it's not character by character formatting, but rather paragraph by paragraph, as it were. Justification (left, center, right, full) could be handled by two (or four, your choice) bits in the format 'byte' for the end-of-paragraph/block character (CR, LF, NUL, whatever you choose to use). That character doesn't need to worry about the normal character formatting issues, so it can be used to hold 'block' formatting information. For easiest sequential processing, you might want said block formatting info to apply to the FOLLOWING paragraph rather than the preceding one, in which case you'd want to start the document with a special end-of-block character that never gets emitted to the output, which is there simply to hold the block formatting info for the first block. Just stream of consciousness stuff, take it for what it's worth... |
||
Advert | |
|
09-01-2009, 02:13 PM | #33 | |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Quote:
Everett |
|
09-01-2009, 02:26 PM | #34 | |||
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
My original long-ago grand plan for pacify was to parse the input text in such a way that I can correlate any given character of the text to the word that said text is part of, the sentence that it is a part of, and the paragraph that it is a part of. Doing so, I still think, would make for some very nice high-level processing possibilities. Quote:
Quote:
The reason I've abandoned this idea is because: 1) The interconnections between words, paragraphs/lines, and characters seems to me to be a bit more complex than I can readily visualize... and hence I'm not too sure how to go about it. Not to mention that some characters are not really words (commas, periods, apostrophes, dashes), but need to appear at the world level, and some characters are not part of a paragraph but should appear at paragraph level (a single newline character between paragraph, or three newlines at the end of a chapter, before the next chapter's title)... not sure how to deal with these. 2) Even if I could create a class that goes from plaintext into this almost "database" sort of format... (which does not even yet account for formatting or anything else) I haven't yet wrapped my mind around how I would updated all levels while doing text processing... or, rather, how to do so in a simple enough way not to fully counteract the simplicity of querying by the complexity of altering. Still, I'd be curious to hear what you think. - Ahi |
|||
09-01-2009, 02:37 PM | #35 |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Buffered input
More thoughts on "buffered input" (if the file is too large and you can't figure out how to make Python handle the whole thing at once, and probably a good idea to do anyway, as you never know what kind of system someone might want to run it on, and what the limits of their resources will be)...
IF you're planning on generating a table of contents, or any other set of section listings at the front of the file, then what might be best is something like this: 1) Maintain, entirely in memory, the information about the overall structure of the file (sections, headings, etc) along with "section numbers" or spool file offset for each section. Normally, I wouldn't expect there to be more than a few dozen entries in this structure, maybe up to a few hundred, but that might change depending upon what all info you chose to put there. 2) Create the input buffer and initialize the overall structure (which would also involve starting the first section, it's first block, etc). 3) Create a temporary spool file to hold 'blocks' of processed data. Each time you finish a block in a fashion that you know you won't need to refer to it again until the "dump phase", write it out the temp spool. As you detect new sections/chapters/etc, create a new entry on the "overall structure" heap (see #1 above) with the needed information, including the current point in the spool file (length). 4) Process, process, process... 5) When finished with input, you now start the actual output file using the information in the overall structure, and then dump the rest of the file by re-reading the temp spool file and formatting it to the output in conjunction with the info in the overall structure heap. Something along that line would allow you to process pretty much ANY size file on almost any machine, and would allow you to do Tables of Content, etc. pretty easily. Another thing you could do is include, at the front of the document, notes to the user (perhaps command line option, or perhaps write to a separate output file, or just to standard out), information about potential problem areas, so the user would know where to look. ...But I think you were already doing that, so never mind. |
Advert | |
|
09-01-2009, 03:19 PM | #36 | |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Quote:
It seems to me that you really do need to read the whole file in (and spooling it out to a temp file, and re-read the spool file to do more processing, and re-spool to another temp file, lather, rinse, repeat...), first breaking it up into 'apparent' sections, chapters, blocks, paragraphs, whatever. THEN, you can start processing what you've detected from the input, to see what you want to change. Maybe, roughly, something like this: 1) Read input file. a) start block b) add items (image, paragraph, links, etc) to spooled block i) images are automatically a block unto themselves ii) paragraphs of text are anything terminated by CR or LF or CR/LF (you have to keep track of what end-of-line sequence the source file uses and use the same thing again on the output). iii) a link is a block, just like an image, but with associated text, and is terminated by the end of link, not by EOL characters. c) repeat a) and b) until entire file has been parsed in. Step 1) is all about reading in and parsing, NO modification (other than possible removal of duplicated tags, like bolds and italics, that will happen naturally as a part of the conversion to the internal format. 2) Start processing the file. a) Look for structural changes first. i) Are there chapters/sections? If so, are any of them missing? Are any of them duplicated? Try to straighten these out first. ii) Once the chapters/sections are located/fixed, try to find indents. These are indented quotes, poetry, etc. Just identify them. iii) Try to identify intentional extra line breaks (between what looks like the end of one paragraph and start of another, "scene breaks") iv) Remove all other extaneous line breaks that aren't scene breaks and which don't go before or after indented quotes/poetry, ie. extraneous duplicate sequential line breaks. b) Look for grammatical problems (this is the toughest part!), the kinds of things you've already called out. Most of these will remain WITHIN a paragraph-block, but it will also sometimes involve joining two paragraphs together that have been eroneously split. Step 2) will involve writing several copies of the temp spool file, as each successive output file becomes the input file to the next phase of things to look for. Once you have run out of things to look for and process, then: 3) Read the final temp spool file and output the final output file. See, simple as that. Nothing to it. Truly, that's the way I'd approach it. It's a long hard slog, because you're not getting to the really cool, slick stuff for quite a while, but I'd first get all of the input/parsing/spooling/output stuff working nice and clean. Once that is done, then you can almost forget about it and focus on the 'good' stuff, and can take each part of it a step at a time, not worrying about how many times you read and write temporary spool files. Another thing that occurs to me: there's probably going to be a bunch of the changes that you make that you won't want to output to the user right away, as you'll want to refer them to specific line #'s of the final output file. So you may want to insert "change blocks" into the stream as you're processing. You can think of those "change blocks" as "debug information", and they will be completely omitted from the final output file and are simply skipped as you're working on the file, but their purpose is to then generate the console/debug output information to the user AS you're writing that final output file... "Removed 3 blank lines at line 23", "Joined broken paragraphs at line 79", etc. All intended merely as "food for thought"... EDIT: Well, hell, I screwed the pooch on THAT formatting! |
|
09-01-2009, 06:59 PM | #37 | |||
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Another formatting error to watch for
I'm currently trying to "pretty up" a Bujold book (one I bought from Baen, that unfortunately CAME this way ). Wherever they have italicized text, they've screwed up the spacing. Several examples:
Quote:
Notice no space after the first italicized sentence and the second sentence. Quote:
No space between 'sort' and italicized 'our'. Quote:
In general, they seem to delete leading space when turning italics ON and sometimes inserting an extraneous space when turning them OFF (usually when followed by punctuation). Anyway, I thought you might like to add these examples to your file of possible things to look for. |
|||
09-01-2009, 08:47 PM | #38 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
I'll share with you tomorrow a potential solution to my "grand plan". - Ahi |
|
09-02-2009, 03:29 PM | #39 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Grand Plan!
Ok, ekaser... so here's my grand plan's reformulation:
The text is basically parsed into a pTome wrapping class... Each pTome contains an arbitrary number of pPar objects, which are assumed to be either paragraphs or lines with necessary line-breaks (like poems or quotations). Each pPar object has a 1) classification [e.g.: paragraph, quotation, {chapter/section} title, et cetera], 2) a pString object. Each pString object has a 1) text string, 2) a formatting string. The pTome class would have accessor methods to facilitate high-level "posing" of the sort of questions I identified in my earlier post, but instead of words being preparsed, they would be parsed only on the fly whenever an accessor method needed it. I do not foresee a need to perform word level operations, only to make word or higher level queries. Some outstanding decisions on my mind... 1) color... I should probably include it in the formatting string... so I think I'll probably make the formatting "string" not work on the basis of bitfields but something a bit more complex, so if in the future I discover a reason to make the conversion from RTF or HTML more fine-grained, I can do so without much internal rewriting. 2) links, footnotes, annotations... I am thinking these might have to be their own parallel "strings" (not containing unicode bytes, but rather arbitrarily long sub-pStrings though, or destinations in the case of links). After all, a given character could be both part of a link, and be (right in front of) a footnote (mark). I'm not sure how annotations work in RTFs, but that might also coexist with the previous two in certain complex cases. Can you think of a better way that doesn't introduce too much complexity? With regards to the links, I think the link parallel string would only be a destination for location infromation "deposited" from a higher level... almost certainly by the owning pTome. Basically... the following: Code:
<h1>The Beginning</h1> <p> It was rather a new sort of experience<footnote>though admittedly she's been on the run from the law before, but that was a <i>long</i> time ago</footnote>, and she did not deal well with it. Or, rather, <b>it</b> did not deal kindly with her. <p> Code:
pTome
|
|
|--- pPar[0]
| |
| |___ pClassification = "title"
| |___ pString
| |
| |__ "The Beginning" # text
| |__ "0000000000000" # formatting
| |__ "0000000000000" # links
| |__ "0000000000000" # annotations
| |__ "0000000000000" # footnotes
|
|--- pPar[1]
|
|___ pClassification = "paragraph"
|___ pString
|
|__ "It was rather a new sort of experience, and she did not deal well with it. Or, rather, it did not deal kindly with her."
|__ "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000bb000000000000000000000000000000" # formatting
|__ "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" # links
|__ "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" # annotations
|__ "0000000000000000000000000000000000000*0000000000000000000000000000000000000000000000000000000000000000000000000000000000" # footnotes
|
|_ pString
|
|__ "though admittedly she's been on the run from the law before, but that was a long time ago"
|__ "0000000000000000000000000000000000000000000000000000000000000000000000000000iiii000000000"
- Ahi |
09-02-2009, 09:56 PM | #40 | |||||
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Quote:
Quote:
Quote:
End of rambling thoughts... [EDIT: after typing all that, I've now read further and seen there's more on footnotes in your note... ] Quote:
Quote:
It seems to me a bit of a waste of storage, as most of those fields (HOWEVER they work) are going to be empty most of the time. What jumps to my mind is something like this: stick with a single "attribute" string that is completely parallel with the text string (whether that's an array of BYTE, WORD, DWORD, or class/structure, doesn't matter). One of the flags in the attribute for a letter is "footnote here", another is "link start", another "link end" (maybe, I wave magic wand), another is "anotation here". Then, the pPar structure (class, if you will) contains a pointer to a linked list of objects which contain the data (footnote, link target, etc) for each of those. The linked list is in the same sequential order as the items appear on the line of the pPar, so as you flow along the pPar line, you have another pointer that "flows along" the linked list of sub-pPars that are footnotes, link targets, annotations, etc. Or maybe not. I'm just not seeing quite what you're visualizing for those extra 'fields'. Other than that, I think you're definitely on the right track! |
|||||
09-02-2009, 10:09 PM | #41 |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Footnotes/Annotations
While on the subject, remember that 'annotations' can take the form of footnotes at the bottom of the page, endnotes at the end of the chapter, and annotations at the end of the book. Three unique places you have to keep track of and place them.
|
09-03-2009, 12:02 AM | #42 | |||||
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
Quote:
A function trying to detect an instance of a single quote being embedded within two words, with no space on either side (like "with an 'increased'chance of precipitation") could, while making queries related to trying to decide what to make of the apostrophe/single-quote character come to the conclusion that these are likely two words, and as a result write "' " ( '&space; ) into the output, instead of just the apostrophe, before continuing to the next character in the text. Admittedly, I think it's not impossible that I might come to find myself wrong on this point of "not needing word-level accessors that change the data"... but it should be easy enough to ammend the list of methods with a few more accessors. Quote:
Quote:
pPar, for the sake of simplicity, needs to "end with a line break"... so to speak. In reality, I think I will end up stripping out line breaks, so the assumed line breaks at the end of pPar's will be crucial to the production of correct formatting in the output. In other words, if I keep strictly with this idea, you are right in noting that a pPar might not be sufficient for every footnote... despite the vast majority of them almost certainly being just a few words or a line/paragraph. Probably each footnote having its own pTome is the right idea... though the idea is unsettling at first thought, as I initially imagined pTome as containing the whole of a contiguous piece of text. I suspect though this is wholly subjective, with little foundation... as I cannot think of any obvious reason why this approach could/would cause problems. Perhaps I need to think of pTome as a contiguous "grouping" of text. Whether that grouping is 1 book, 1 part, 1 chapter, or a mere 1 footnote/annotation. Quote:
For links, footnotes, and annotations, my original thought was that (in python terms) I would use a dictionary object for each of those. In essence what my diagram shows as "0" for those strings, in reality it would be numerical keys for which the dictionary was never assigned any data. Code:
linkLayer = {} linkLayer[28] = 'chapter 1, paragraph 5' linkLayer[29] = linkLayer[28] linkLayer[30] = linkLayer[28] linkLayer[31] = linkLayer[28] linkLayer[32] = linkLayer[28] While the above still may waste space needlessly, it is nowhere near as bad as my original diagram suggested... though I'd definitely look into "assigning by reference" with python when I get to this part of the code. At least one target of my conversion activity is basically The Great Hungarian Defining Dictionary (not an accurate translation of its name, but it describes fairly well what it is). Basically 100 MB of RTF, with only a few small pictures here and there... and potentially my way of wanting to convert this could easily have over 100,000 intra-document links. Also, keep in mind, not treating links with "link open here" and "link end here" sort of solutions works to include links under the automatic formatting refactoring umbrella. Something though that has no relevance to annotations and footnotes, on account of those being "one dimensionally referenced" (insert here, as opposed to span this length). I think the main reason for my wanting to have these three in additional parallel streams, in a way, is to be able to arbitrarily alter the main text without having to make my code do contortions to ignore the fact that (for a lot of practical considerations) a footnote does not really come between two characters of the main body text. Thanks. It's good to have constructive feedback! I'm just sorry I won't really have time to get any meaningful work done on this until next week. But as I manage, I'll definitely update you of whatever developments/discoveries/realizations/thoughts along the way. - Ahi |
|||||
09-03-2009, 09:58 AM | #43 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Next consideration...
... how to architect this for internationalization. My own personal need is already related to two languages, English and Hungarian, and I would prefer adding additional ones to be a reasonably straightforward process. Part me of wonders whether the easiest thing might be to separate out the processing functions into language-specific python modules to be included by the main pacify.py script as per need. But I'm all but certain that is at least an inelegant and probably a very substandard approach. What if, after all, a given document has text in two languages (and both RTF and HTML have language tagging capacity... so applying the appropriate rules is at least conceivable). I'm also aware that the best case scenario would be to somehow store the rules in an externally stored and loadable data/config file of some sort. But I it might ultimately end up being overly complicated... How do you store as a data file a rule like: Quote:
Maybe have the main program contain a "skeleton" of all processing functions, which would then (based on command line options and/or imported metadata) in turn call language-specific versions of the processing questions on the fly at runtime? FixParagraphs(text) would load FixParagraphs.hu.py or FixParagraphs.en.py depending on language and do an eval("FixParagraphs_"+curlang+"(text)") I'll continue thinking on it further... - Ahi |
|
09-03-2009, 12:11 PM | #44 | |||||||
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Quote:
Quote:
Quote:
An image A sentence A fragment of a sentence (down to possibly just a single word) A break A 'break' can be any one of: Line break (such as at the end of each line of poetry) Paragraph break (may trigger auto-first-line indent, etc) Scene break (those extra blank lines in novels that separate scene changes) Page break Sub-section break Section break Chapter break Part break Book break A pBlock then, is a COLLECTION (or LIST) of pBlocks AND pItems, representing anything including: The entire 'book' "Meta information" (not sure of right word, but title, author, copyright, etc) A table of contents A section or 'Part' A chapter A sub-chapter (potentially multi-level, as in text books, etc) A header or footer (are you thinking of supporting those?) A link An anchor (target of a link) A footnote, endnote, annotation An index (a whole different can of worms) A pItem In other words, the Book is a pBlock containing a list of pBlocks comprising the Meta Info, TOC, a list of Parts OR a list of Chapters, maybe Annotations, maybe Index, and a Book-break. A PART pBlock would consist mostly of Chapter pBlocks (with maybe leading title (1,2,3 or I,II,II, etc) and/or quote/title). A CHAPTER pBlock would consist mostly of a collection of sentences and paragraph-breaks. And so on... Only at the sentence and sentence-fragment level would you have the formatting strings. However, I'm not sure how all of that would fit into your ideas for "ease of testing and reformatting" the text. That's just how my brain tends to approach things: break them down, structure them into as few types of nesting and repeating items as possible. Whatever you do, you have to encode in your 'database' the STRUCTURE of the book, the FORMATTING of the text, and the TEXT itself. I'm just trying to throw grist into your mill... More random grist: bottom-of-page footnotes are problems. Usually, they will get processed as soon as the "footnote link" is encountered, and their height gets removed from the height of that current page (or whatever), but sometimes the footnote reference will be too close to the bottom of the page for the entire footnote to fit in the remaining space on that page, so it has to get split and run over onto the bottom of the next page. Just grist. Another thought: it MAY be possible to use your 'standard' link/anchor mechasim for footnotes (after all, that's really what they are, the original hyperlinks). The only difference is where/how the target text is automatically placed. Quote:
Quote:
Quote:
Quote:
|
|||||||
09-03-2009, 12:18 PM | #45 | |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Quote:
Formatting of the text is DEFINITELY something that can vary greatly from language to language. For example, in English, a question always has a question mark at the END of the sentence. In Spanish, there's a mark at the START and end of the sentence. I'm afraid I'm not multi-lingual aware/talented enough to help much with this one. MY thought would be that you'd almost need completely separate "reformatting modules" for each language, as trying to come up with any kind of scripting or rule-based structure seems incredibly complex and difficult. You almost have to custom-create it for each language. But then, there are other folks MUCH smarter and more educated in that subject than me! |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Best pdf to text/rtf/whatever I have ever seen | jblitereader | Ectaco jetBook | 13 | 07-10-2010 12:02 AM |
RTF and TEXT conversion | spaze | Calibre | 4 | 08-23-2009 03:11 AM |
Automatic .Lit extractor for the iLiad | Adam B. | iRex | 34 | 09-25-2008 07:20 PM |
kovidgoyal: templatemaker -- automatic data extractor | sammykrupa | Sony Reader | 1 | 07-21-2007 01:52 PM |
Text to RTF question. | Roy White | Sony Reader | 0 | 05-12-2007 06:59 PM |