Some preliminary results.
First tried an HTML > TXT pacification (is that a word?):
Code:
pacify.py -i imp.html -o txt
pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)
Profile: en
Input file: imp.html
Output file: imp-pacified.txt
Log file: imp-pacified.log
Reading imp.html...
Starting on Mon, 21 Sep 2009 11:24:33
File too large to read fully into memory.
Traceback (most recent call last):
File "/home/kck/bin/pacify.py", line 88, in <module>
main()
File "/home/kck/bin/pacify.py", line 79, in main
pacify = Pacify(args)
File "/home/kck/bin/pmodules.py", line 42, in __init__
self.inbuffer = self.ReadHTML()
File "/home/kck/bin/pmodules.py", line 608, in ReadHTML
if tmpbuffer[-1] != u' ' and tmpbuffer[-1] != u'\n' and tmpbuffer[-1] != u'␢':
IndexError: list index out of range
The file in question is ~627 KB. pacify is going to be quite limited in its usefulness if I can't do a file that big.
Split the above into chapters, and tried again on individual chapters. Output for this chapter seemed to be pretty good. I would suggest putting at least a space between table cells when converting from HTML, and perhaps a space between cells and a line break between rows.
Code:
pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)
Profile: en
Input file: imp-c_split_14.html
Output file: imp-c_split_14-pacified.txt
Log file: imp-c_split_14-pacified.log
Reading imp-c_split_14.html...
Starting on Mon, 21 Sep 2009 11:29:54
Cleaning formatting...
Starting on Mon, 21 Sep 2009 11:29:54
Finished on Mon, 21 Sep 2009 11:29:54
Analyzing text...
Starting on Mon, 21 Sep 2009 11:29:54
Simplifying linebreaks...
Analyzing whitespace patterns...
100.0% processed (0.0 MB of 0.0 MB)
Finished on Mon, 21 Sep 2009 11:29:54
Filesize:
19645
Whitespace analysis:
[(3492, 0.0), (37, 16.0), (8, 8.0)]
1777.55153983
18.8343089845
4.07228302367
... assumed to be a file with paragraph breaks.
Correcting quotation marks...
Enlightening text...
Done!
Writing imp-c_split_14-pacified.txt...
Done!
Then I tried a plain text > latex conversion. Didn't work:
Code:
pacify.py -i imp.txt -o latex
pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)
Profile: en
Input file: imp.txt
Output file: imp-pacified.tex
Log file: imp-pacified.log
Reading imp.txt...
Starting on Mon, 21 Sep 2009 11:35:30
Finished on Mon, 21 Sep 2009 11:35:30
Cleaning formatting...
Starting on Mon, 21 Sep 2009 11:35:30
Finished on Mon, 21 Sep 2009 11:35:30
Analyzing text...
Starting on Mon, 21 Sep 2009 11:35:30
Simplifying linebreaks...
Analyzing whitespace patterns...
100.0% processed (0.4 MB of 0.4 MB)
Finished on Mon, 21 Sep 2009 11:35:31
Filesize:
409959
Whitespace analysis:
[(72019, 0.0), (1063, 16.0), (1, 8.0)]
1756.73664927
25.9294222105
0.0243926831708
... assumed to be a file with paragraph breaks.
Correcting quotation marks...
Enlightening text...
Done!
Writing imp-pacified.tex...
Converting to LaTeX...
Replacing \'s
Replacing {'s
Replacing }'s
Replacing >'s
Replacing <'s
Replacing ~'s
Replacing ^'s
Replacing &'s
Replacing #'s
Replacing _'s
Replacing $'s
Replacing %'s
Formatting...
Traceback (most recent call last):
File "/home/kck/bin/pacify.py", line 88, in <module>
main()
File "/home/kck/bin/pacify.py", line 79, in main
pacify = Pacify(args)
File "/home/kck/bin/pmodules.py", line 65, in __init__
outfile.write(self.GetAsLaTeX().encode('utf-8'))
File "/home/kck/bin/pmodules.py", line 225, in GetAsLaTeX
if curFormat != self.inbuffer.format[idx]:
File "/home/kck/bin/pmodules.py", line 989, in __ne__
if self.isBold != other.isBold:
AttributeError: pString instance has no attribute 'isBold'
No clue what happened there. It did create a file, but the file was blank. There was no imp-pacified.log.
Tried an HTML > LaTeX conversion. That worked. It would have been nice to get a HTML table > LaTeX tabular conversion, but perhaps that goes against what you're trying to do. If so, then at least a space, and maybe a line break between rows, seems in order.
Code:
pacify.py -i imp-c_split_11.html -o latex
pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)
Profile: en
Input file: imp-c_split_11.html
Output file: imp-c_split_11-pacified.tex
Log file: imp-c_split_11-pacified.log
Reading imp-c_split_11.html...
Starting on Mon, 21 Sep 2009 11:44:05
Cleaning formatting...
Starting on Mon, 21 Sep 2009 11:44:05
Finished on Mon, 21 Sep 2009 11:44:05
Analyzing text...
Starting on Mon, 21 Sep 2009 11:44:05
Simplifying linebreaks...
Analyzing whitespace patterns...
100.0% processed (0.0 MB of 0.0 MB)
Finished on Mon, 21 Sep 2009 11:44:05
Filesize:
23759
Whitespace analysis:
[(4198, 0.0), (46, 16.0), (4, 8.0)]
1766.90938171
19.3610842207
1.68357254093
... assumed to be a file with paragraph breaks.
Correcting quotation marks...
Enlightening text...
Done!
Writing imp-c_split_11-pacified.tex...
Converting to LaTeX...
Replacing \'s
Replacing {'s
Replacing }'s
Replacing >'s
Replacing <'s
Replacing ~'s
Replacing ^'s
Replacing &'s
Replacing #'s
Replacing _'s
Replacing $'s
Replacing %'s
Formatting...
Done!
Will do some more playing around when I have the leisure.