![]() |
#61 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Preliminary version with support for HTML input... a bit rough, but ok for a preview.
Also does the quotation mark fixing, like older versions do (edit: although it appears to be a bit wonky--will fix). Update: for better HTML support. Currently <Hx> tags are simply formatted as \textsc{} but I will soon add support to directly translate them to \chapter{} \section{} et cetera formatting in LaTeX. Also, see the (unmodified) .txt and .tex files presently generated from this HTML file by this version of pacify.py. - Ahi Last edited by ahi; 09-16-2009 at 10:39 PM. |
![]() |
![]() |
![]() |
#62 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
If you are around, ekaser, I am mostly done with the rearchitecting.
I have a(n admittedly very simple) plugin architecture in place, where basically all functionality (excepting the core classes used by the processing) come from plugins that are classified either as an (1) input plugin, (2) a language plugin, (3) a processing plugin, or (4) an output plugin. It makes for very pleasantly clean development... albeit somewhat torturous command line handling, as the plugins (and the [command line based] choice of which specific plugins to use at runtime) are relevant in figuring out what are correct and what are erroneous command line arguments. I have a plaintext, an RTF, and an HTML input plugin working already fairly well, and a plaintext output and HTML output plugin likewise functional, if a bit immature as yet. The language plugins are handled in such a way that (1) they can somewhat customize the pacify class('s running instance) to potentially alter other plugins' behavior, but mostly (2) preprocess the text right after its read in from the input file in whatever language-specific way, and (3) post-process the text after all other plugins are done but before it is written to the output file. Switching my development to Python 3 also got rid of difficult to understand and (for me) seemingly impossible to definitively correct unicode related errors my pacify script previously suffered from. The only point of (as yet) shame is that I have not had the fortitude to fully implement my crazy text-as-database concept yet. My formatted text string class objects are being manipulated fairly directly. I probably should bite the bullet and take my time to figure out both the spooling (which I am, to be honest, yet to fully wrap my mind around--any good "idiot's guide" level resources you can point me toward) and the text-as-database stuff... but I am just too impatient for practical results to do so. On the upside, if and when I do get around to doing that stuff... I should be able to insert the necessary code fairly readily without having to make radical changes in too many places. I'm not going to upload another version until it's able to produce reasonably useful output... but it's getting closer. I've decided to build categorization into the formatting stream as well... probably not incredibly efficient... but unless it starts to cause problems with even files just dozens of MB large, I'll probably stick with it for now (and once spooling is implemented, that should take care of the problem altogether). I am also thinking of implementing footnotes/endnotes (and perhaps annotations?) in the formatting stream too... but I'm now thinking I will not bother with links at all. I cannot think of any input documents (other than of the "choose your own adventure" variety, which is fairly rare) where existing links ought necessarily be respected, instead of new links being generated as warranted by the document's structure. (Albeit perhaps in HTML, there should be some ability to interpret links as footnotes when appropriate.) Just wanted to share where I am and what I've done. - Ahi |
![]() |
![]() |
![]() |
#63 |
Opinion Artiste
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Sounds like you've made some good progress! I've been gone most of this week camping, and just got back this afternoon. Amazing how far behind you can fall in just a few days' time!
![]() As for the spooling stuff, can't think of any references off-hand. It's pretty much just a matter of 'virtualizing' (serializing) your memory out to disk into a series of temporary "working files", so that you're not trying to keep hundreds of megabytes in memory at once, just the data structures that point to them (as needed). But my programming library is pretty limited (I'm one of the original inventors of the Not-Invented-Here Syndrome... ![]() I hear what you're saying about "too impatient to see something working"! That tends to be my problem as well. |
![]() |
![]() |
![]() |
#64 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
Some preliminary results.
First tried an HTML > TXT pacification (is that a word?): Code:
pacify.py -i imp.html -o txt pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com) Profile: en Input file: imp.html Output file: imp-pacified.txt Log file: imp-pacified.log Reading imp.html... Starting on Mon, 21 Sep 2009 11:24:33 File too large to read fully into memory. Traceback (most recent call last): File "/home/kck/bin/pacify.py", line 88, in <module> main() File "/home/kck/bin/pacify.py", line 79, in main pacify = Pacify(args) File "/home/kck/bin/pmodules.py", line 42, in __init__ self.inbuffer = self.ReadHTML() File "/home/kck/bin/pmodules.py", line 608, in ReadHTML if tmpbuffer[-1] != u' ' and tmpbuffer[-1] != u'\n' and tmpbuffer[-1] != u'␢': IndexError: list index out of range Split the above into chapters, and tried again on individual chapters. Output for this chapter seemed to be pretty good. I would suggest putting at least a space between table cells when converting from HTML, and perhaps a space between cells and a line break between rows. Code:
pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com) Profile: en Input file: imp-c_split_14.html Output file: imp-c_split_14-pacified.txt Log file: imp-c_split_14-pacified.log Reading imp-c_split_14.html... Starting on Mon, 21 Sep 2009 11:29:54 Cleaning formatting... Starting on Mon, 21 Sep 2009 11:29:54 Finished on Mon, 21 Sep 2009 11:29:54 Analyzing text... Starting on Mon, 21 Sep 2009 11:29:54 Simplifying linebreaks... Analyzing whitespace patterns... 100.0% processed (0.0 MB of 0.0 MB) Finished on Mon, 21 Sep 2009 11:29:54 Filesize: 19645 Whitespace analysis: [(3492, 0.0), (37, 16.0), (8, 8.0)] 1777.55153983 18.8343089845 4.07228302367 ... assumed to be a file with paragraph breaks. Correcting quotation marks... Enlightening text... Done! Writing imp-c_split_14-pacified.txt... Done! Code:
pacify.py -i imp.txt -o latex pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com) Profile: en Input file: imp.txt Output file: imp-pacified.tex Log file: imp-pacified.log Reading imp.txt... Starting on Mon, 21 Sep 2009 11:35:30 Finished on Mon, 21 Sep 2009 11:35:30 Cleaning formatting... Starting on Mon, 21 Sep 2009 11:35:30 Finished on Mon, 21 Sep 2009 11:35:30 Analyzing text... Starting on Mon, 21 Sep 2009 11:35:30 Simplifying linebreaks... Analyzing whitespace patterns... 100.0% processed (0.4 MB of 0.4 MB) Finished on Mon, 21 Sep 2009 11:35:31 Filesize: 409959 Whitespace analysis: [(72019, 0.0), (1063, 16.0), (1, 8.0)] 1756.73664927 25.9294222105 0.0243926831708 ... assumed to be a file with paragraph breaks. Correcting quotation marks... Enlightening text... Done! Writing imp-pacified.tex... Converting to LaTeX... Replacing \'s Replacing {'s Replacing }'s Replacing >'s Replacing <'s Replacing ~'s Replacing ^'s Replacing &'s Replacing #'s Replacing _'s Replacing $'s Replacing %'s Formatting... Traceback (most recent call last): File "/home/kck/bin/pacify.py", line 88, in <module> main() File "/home/kck/bin/pacify.py", line 79, in main pacify = Pacify(args) File "/home/kck/bin/pmodules.py", line 65, in __init__ outfile.write(self.GetAsLaTeX().encode('utf-8')) File "/home/kck/bin/pmodules.py", line 225, in GetAsLaTeX if curFormat != self.inbuffer.format[idx]: File "/home/kck/bin/pmodules.py", line 989, in __ne__ if self.isBold != other.isBold: AttributeError: pString instance has no attribute 'isBold' Tried an HTML > LaTeX conversion. That worked. It would have been nice to get a HTML table > LaTeX tabular conversion, but perhaps that goes against what you're trying to do. If so, then at least a space, and maybe a line break between rows, seems in order. Code:
pacify.py -i imp-c_split_11.html -o latex pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com) Profile: en Input file: imp-c_split_11.html Output file: imp-c_split_11-pacified.tex Log file: imp-c_split_11-pacified.log Reading imp-c_split_11.html... Starting on Mon, 21 Sep 2009 11:44:05 Cleaning formatting... Starting on Mon, 21 Sep 2009 11:44:05 Finished on Mon, 21 Sep 2009 11:44:05 Analyzing text... Starting on Mon, 21 Sep 2009 11:44:05 Simplifying linebreaks... Analyzing whitespace patterns... 100.0% processed (0.0 MB of 0.0 MB) Finished on Mon, 21 Sep 2009 11:44:05 Filesize: 23759 Whitespace analysis: [(4198, 0.0), (46, 16.0), (4, 8.0)] 1766.90938171 19.3610842207 1.68357254093 ... assumed to be a file with paragraph breaks. Correcting quotation marks... Enlightening text... Done! Writing imp-c_split_11-pacified.tex... Converting to LaTeX... Replacing \'s Replacing {'s Replacing }'s Replacing >'s Replacing <'s Replacing ~'s Replacing ^'s Replacing &'s Replacing #'s Replacing _'s Replacing $'s Replacing %'s Formatting... Done! |
![]() |
![]() |
![]() |
#65 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Hi, Frabjous!
I would suggest holding off until I put up the next version. The haphazard unicode errors are basically gone as of the development version I am currently working on. The filesize thing is weird... the 600 KB file certainly did not cause a memory issue, but whatever the issue was got misreported as such. I have successfully processed 700+ MB (nearly 1 GB) files with pacify.py before... and once I implement spooling (which I actually think I will do sooner rather than later after all), file size will be a non-issue so long as you have both sufficient memory and disk space. And yes, the HTML parsing needs to take tables and such properly into account... along with a few other things. I'll keep everyone update via this thread... - Ahi |
![]() |
![]() |
![]() |
#66 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,349
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
@ahi: Look at SpooledTemporaryFile in the python tempfile module
|
![]() |
![]() |
![]() |
#67 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
import antigravity indeed! On the topic of RTF parsing, Kovid. If I recall correctly you need HTML returned by the RTF parser... is that right? The current one you use for calibre... how... precise is it? The parser I am working on purposely limits the complexity of formatting it cares about. Would you still have any use for such a partial RTF parsing engine, as a (re)starting point? - Ahi |
|
![]() |
![]() |
![]() |
#68 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
|
![]() |
![]() |
![]() |
#69 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,349
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The python logging module at least in 2.x has various unicode related bugs, so I had to implement a custom logging module for calibre as well.
I recently refactored the calibre RTF parser to fix its speed issues, so I'm ok for now. Incidentally, the RTF parser actually output XML which I then convert to HTML using an XSLT stylesheet. It handles a pretty large subset of the RTF 1.5 spec including embedded images and so on. My philosophy in calibre is to accept as large an input set as possible and do *something* with it, even if that something may not be optimal, so untill your RTF parser handles a larger set of input, I think I'll pass. |
![]() |
![]() |
![]() |
#70 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
In your custom logging module, do you have something that reliably down-converts unicode to plain ascii that can be printed even to dumb DOS terminals? - Ahi |
|
![]() |
![]() |
![]() |
#71 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,349
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
calibre has a general unicode->ascii converter that reliably downcoverts unicode to ascii that (looks like) the unicode. This is thanks to user_none by the way.
|
![]() |
![]() |
![]() |
#72 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
Is it GPL (or similarly) licensed? - Ahi |
|
![]() |
![]() |
![]() |
#73 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
|
![]() |
![]() |
![]() |
#74 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,349
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
All of calibre is GPL
|
![]() |
![]() |
![]() |
#75 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Having some difficulty getting my LaTeX output algorithm right...
The internal representation is, of course, plaintext but with (for simplicity's sake, let us say) bold/italic/smallcap/underline formatting's presence or absence indicating for every character of text. My first, somewhat naive approach, was something like (I simply considerably below, of course): Code:
for idx in range(0, len(manuscript.text)): if manuscript.format[idx-1].bold == True and manuscript.format[idx-1].bold == False: output += "}" elif manuscript.format[idx-1].bold == False and manuscript.format[idx-1].bold == True: output += "\textbf{" if manuscript.format[idx-1].italic == True and manuscript.format[idx-1].italic == False: output += "}" elif manuscript.format[idx-1].italic == False and manuscript.format[idx-1].bold == italic: output += "\textit{" output += manuscript.text[idx] e.g.: Code:
T h i s i s i n d e e d a s t r a n g e i d e a !
-- -- -- -- -- -I -I -I BI BI B- B- B- B- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Code:
This \textit{is \textbf{in}deed} a strange idea! resulting in: This is indeed a strange idea! instead of the desired: This is indeed a strange idea! --- I have since tried some more complicated variations, but they've either not yielded the correct output, or did so using very ugly (unnecessarily complicated) LaTeX code. Basically the correct output for the above example would be: Code:
This \textit{is} \textbf{\textit{in}deed} a strange idea! Code:
This \textit{is \textbf{indeed}} a strange idea! Note: Well, not literally spaghetti code, of course... but almost as confusing looking. - Ahi Last edited by ahi; 09-24-2009 at 09:30 AM. |
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Best pdf to text/rtf/whatever I have ever seen | jblitereader | Ectaco jetBook | 13 | 07-10-2010 12:02 AM |
RTF and TEXT conversion | spaze | Calibre | 4 | 08-23-2009 03:11 AM |
Automatic .Lit extractor for the iLiad | Adam B. | iRex | 34 | 09-25-2008 07:20 PM |
kovidgoyal: templatemaker -- automatic data extractor | sammykrupa | Sony Reader | 1 | 07-21-2007 01:52 PM |
Text to RTF question. | Roy White | Sony Reader | 0 | 05-12-2007 06:59 PM |