View Single Post
Old 09-21-2009, 11:55 AM   #64
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
Some preliminary results.

First tried an HTML > TXT pacification (is that a word?):

Code:
pacify.py -i imp.html -o txt

pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)

	Profile:	en
	Input file:	imp.html
	Output file:	imp-pacified.txt
	Log file:	imp-pacified.log

Reading imp.html...

	Starting on Mon, 21 Sep 2009 11:24:33

File too large to read fully into memory.

Traceback (most recent call last):
  File "/home/kck/bin/pacify.py", line 88, in <module>
    main()
  File "/home/kck/bin/pacify.py", line 79, in main
    pacify = Pacify(args)
  File "/home/kck/bin/pmodules.py", line 42, in __init__
    self.inbuffer = self.ReadHTML()
  File "/home/kck/bin/pmodules.py", line 608, in ReadHTML
    if tmpbuffer[-1] != u' ' and tmpbuffer[-1] != u'\n' and tmpbuffer[-1] != u'␢':
IndexError: list index out of range
The file in question is ~627 KB. pacify is going to be quite limited in its usefulness if I can't do a file that big.

Split the above into chapters, and tried again on individual chapters. Output for this chapter seemed to be pretty good. I would suggest putting at least a space between table cells when converting from HTML, and perhaps a space between cells and a line break between rows.

Code:
pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)

	Profile:	en
	Input file:	imp-c_split_14.html
	Output file:	imp-c_split_14-pacified.txt
	Log file:	imp-c_split_14-pacified.log

Reading imp-c_split_14.html...

	Starting on Mon, 21 Sep 2009 11:29:54

Cleaning formatting...

	Starting on Mon, 21 Sep 2009 11:29:54


	Finished on Mon, 21 Sep 2009 11:29:54

Analyzing text...

	Starting on Mon, 21 Sep 2009 11:29:54

	    Simplifying linebreaks...
	    Analyzing whitespace patterns...
	    100.0% processed (0.0 MB of 0.0 MB)

	Finished on Mon, 21 Sep 2009 11:29:54

	Filesize:
	19645

	Whitespace analysis:
	[(3492, 0.0), (37, 16.0), (8, 8.0)]

	1777.55153983
	18.8343089845
	4.07228302367

	... assumed to be a file with paragraph breaks.

Correcting quotation marks...

	Enlightening text...

	Done!

Writing imp-c_split_14-pacified.txt...

	Done!
Then I tried a plain text > latex conversion. Didn't work:

Code:
pacify.py -i imp.txt -o latex

pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)

	Profile:	en
	Input file:	imp.txt
	Output file:	imp-pacified.tex
	Log file:	imp-pacified.log

Reading imp.txt...

	Starting on Mon, 21 Sep 2009 11:35:30

	Finished on Mon, 21 Sep 2009 11:35:30

Cleaning formatting...

	Starting on Mon, 21 Sep 2009 11:35:30


	Finished on Mon, 21 Sep 2009 11:35:30

Analyzing text...

	Starting on Mon, 21 Sep 2009 11:35:30

	    Simplifying linebreaks...
	    Analyzing whitespace patterns...
	    100.0% processed (0.4 MB of 0.4 MB)

	Finished on Mon, 21 Sep 2009 11:35:31

	Filesize:
	409959

	Whitespace analysis:
	[(72019, 0.0), (1063, 16.0), (1, 8.0)]

	1756.73664927
	25.9294222105
	0.0243926831708

	... assumed to be a file with paragraph breaks.

Correcting quotation marks...

	Enlightening text...

	Done!

Writing imp-pacified.tex...

	Converting to LaTeX...

		Replacing \'s
		Replacing {'s
		Replacing }'s
		Replacing >'s
		Replacing <'s
		Replacing ~'s
		Replacing ^'s
		Replacing &'s
		Replacing #'s
		Replacing _'s
		Replacing $'s
		Replacing %'s

		Formatting...

Traceback (most recent call last):
  File "/home/kck/bin/pacify.py", line 88, in <module>
    main()
  File "/home/kck/bin/pacify.py", line 79, in main
    pacify = Pacify(args)
  File "/home/kck/bin/pmodules.py", line 65, in __init__
    outfile.write(self.GetAsLaTeX().encode('utf-8'))
  File "/home/kck/bin/pmodules.py", line 225, in GetAsLaTeX
    if curFormat != self.inbuffer.format[idx]:
  File "/home/kck/bin/pmodules.py", line 989, in __ne__
    if self.isBold != other.isBold:
AttributeError: pString instance has no attribute 'isBold'
No clue what happened there. It did create a file, but the file was blank. There was no imp-pacified.log.

Tried an HTML > LaTeX conversion. That worked. It would have been nice to get a HTML table > LaTeX tabular conversion, but perhaps that goes against what you're trying to do. If so, then at least a space, and maybe a line break between rows, seems in order.

Code:
pacify.py -i imp-c_split_11.html -o latex

pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)

	Profile:	en
	Input file:	imp-c_split_11.html
	Output file:	imp-c_split_11-pacified.tex
	Log file:	imp-c_split_11-pacified.log

Reading imp-c_split_11.html...

	Starting on Mon, 21 Sep 2009 11:44:05

Cleaning formatting...

	Starting on Mon, 21 Sep 2009 11:44:05


	Finished on Mon, 21 Sep 2009 11:44:05

Analyzing text...

	Starting on Mon, 21 Sep 2009 11:44:05

	    Simplifying linebreaks...
	    Analyzing whitespace patterns...
	    100.0% processed (0.0 MB of 0.0 MB)

	Finished on Mon, 21 Sep 2009 11:44:05

	Filesize:
	23759

	Whitespace analysis:
	[(4198, 0.0), (46, 16.0), (4, 8.0)]

	1766.90938171
	19.3610842207
	1.68357254093

	... assumed to be a file with paragraph breaks.

Correcting quotation marks...

	Enlightening text...

	Done!

Writing imp-c_split_11-pacified.tex...

	Converting to LaTeX...

		Replacing \'s
		Replacing {'s
		Replacing }'s
		Replacing >'s
		Replacing <'s
		Replacing ~'s
		Replacing ^'s
		Replacing &'s
		Replacing #'s
		Replacing _'s
		Replacing $'s
		Replacing %'s

		Formatting...

	Done!
Will do some more playing around when I have the leisure.
frabjous is offline   Reply With Quote