pacify.py (Text reformatter / RTF extractor) - Page 5

ahi · 09-14-2009, 11:10 PM

Preliminary version with support for HTML input... a bit rough, but ok for a preview.

Also does the quotation mark fixing, like older versions do (edit: although it appears to be a bit wonky--will fix).

Update: for better HTML support. Currently <Hx> tags are simply formatted as \textsc{} but I will soon add support to directly translate them to \chapter{} \section{} et cetera formatting in LaTeX.

Also, see the (unmodified) .txt and .tex files presently generated from this HTML file by this version of pacify.py.

- Ahi

ahi · 09-20-2009, 11:41 PM

If you are around, ekaser, I am mostly done with the rearchitecting.

I have a(n admittedly very simple) plugin architecture in place, where basically all functionality (excepting the core classes used by the processing) come from plugins that are classified either as an (1) input plugin, (2) a language plugin, (3) a processing plugin, or (4) an output plugin.

It makes for very pleasantly clean development... albeit somewhat torturous command line handling, as the plugins (and the [command line based] choice of which specific plugins to use at runtime) are relevant in figuring out what are correct and what are erroneous command line arguments.

I have a plaintext, an RTF, and an HTML input plugin working already fairly well, and a plaintext output and HTML output plugin likewise functional, if a bit immature as yet. The language plugins are handled in such a way that (1) they can somewhat customize the pacify class('s running instance) to potentially alter other plugins' behavior, but mostly (2) preprocess the text right after its read in from the input file in whatever language-specific way, and (3) post-process the text after all other plugins are done but before it is written to the output file.

Switching my development to Python 3 also got rid of difficult to understand and (for me) seemingly impossible to definitively correct unicode related errors my pacify script previously suffered from.

The only point of (as yet) shame is that I have not had the fortitude to fully implement my crazy text-as-database concept yet. My formatted text string class objects are being manipulated fairly directly.

I probably should bite the bullet and take my time to figure out both the spooling (which I am, to be honest, yet to fully wrap my mind around--any good "idiot's guide" level resources you can point me toward) and the text-as-database stuff... but I am just too impatient for practical results to do so.

On the upside, if and when I do get around to doing that stuff... I should be able to insert the necessary code fairly readily without having to make radical changes in too many places.

I'm not going to upload another version until it's able to produce reasonably useful output... but it's getting closer. I've decided to build categorization into the formatting stream as well... probably not incredibly efficient... but unless it starts to cause problems with even files just dozens of MB large, I'll probably stick with it for now (and once spooling is implemented, that should take care of the problem altogether). I am also thinking of implementing footnotes/endnotes (and perhaps annotations?) in the formatting stream too... but I'm now thinking I will not bother with links at all. I cannot think of any input documents (other than of the "choose your own adventure" variety, which is fairly rare) where existing links ought necessarily be respected, instead of new links being generated as warranted by the document's structure. (Albeit perhaps in HTML, there should be some ability to interpret links as footnotes when appropriate.)

Just wanted to share where I am and what I've done.

- Ahi

ekaser · 09-21-2009, 12:15 AM

Sounds like you've made some good progress! I've been gone most of this week camping, and just got back this afternoon. Amazing how far behind you can fall in just a few days' time!

As for the spooling stuff, can't think of any references off-hand. It's pretty much just a matter of 'virtualizing' (serializing) your memory out to disk into a series of temporary "working files", so that you're not trying to keep hundreds of megabytes in memory at once, just the data structures that point to them (as needed). But my programming library is pretty limited (I'm one of the original inventors of the Not-Invented-Here Syndrome...

).

I hear what you're saying about "too impatient to see something working"! That tends to be my problem as well.

frabjous · 09-21-2009, 12:55 PM

Some preliminary results.

First tried an HTML > TXT pacification (is that a word?):

Code:

pacify.py -i imp.html -o txt

pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)

	Profile:	en
	Input file:	imp.html
	Output file:	imp-pacified.txt
	Log file:	imp-pacified.log

Reading imp.html...

	Starting on Mon, 21 Sep 2009 11:24:33

File too large to read fully into memory.

Traceback (most recent call last):
  File "/home/kck/bin/pacify.py", line 88, in <module>
    main()
  File "/home/kck/bin/pacify.py", line 79, in main
    pacify = Pacify(args)
  File "/home/kck/bin/pmodules.py", line 42, in __init__
    self.inbuffer = self.ReadHTML()
  File "/home/kck/bin/pmodules.py", line 608, in ReadHTML
    if tmpbuffer[-1] != u' ' and tmpbuffer[-1] != u'\n' and tmpbuffer[-1] != u'␢':
IndexError: list index out of range

The file in question is ~627 KB. pacify is going to be quite limited in its usefulness if I can't do a file that big.

Split the above into chapters, and tried again on individual chapters. Output for this chapter seemed to be pretty good. I would suggest putting at least a space between table cells when converting from HTML, and perhaps a space between cells and a line break between rows.

Code:

pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)

	Profile:	en
	Input file:	imp-c_split_14.html
	Output file:	imp-c_split_14-pacified.txt
	Log file:	imp-c_split_14-pacified.log

Reading imp-c_split_14.html...

	Starting on Mon, 21 Sep 2009 11:29:54

Cleaning formatting...

	Starting on Mon, 21 Sep 2009 11:29:54


	Finished on Mon, 21 Sep 2009 11:29:54

Analyzing text...

	Starting on Mon, 21 Sep 2009 11:29:54

	    Simplifying linebreaks...
	    Analyzing whitespace patterns...
	    100.0% processed (0.0 MB of 0.0 MB)

	Finished on Mon, 21 Sep 2009 11:29:54

	Filesize:
	19645

	Whitespace analysis:
	[(3492, 0.0), (37, 16.0), (8, 8.0)]

	1777.55153983
	18.8343089845
	4.07228302367

	... assumed to be a file with paragraph breaks.

Correcting quotation marks...

	Enlightening text...

	Done!

Writing imp-c_split_14-pacified.txt...

	Done!

Then I tried a plain text > latex conversion. Didn't work:

Code:

pacify.py -i imp.txt -o latex

pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)

	Profile:	en
	Input file:	imp.txt
	Output file:	imp-pacified.tex
	Log file:	imp-pacified.log

Reading imp.txt...

	Starting on Mon, 21 Sep 2009 11:35:30

	Finished on Mon, 21 Sep 2009 11:35:30

Cleaning formatting...

	Starting on Mon, 21 Sep 2009 11:35:30


	Finished on Mon, 21 Sep 2009 11:35:30

Analyzing text...

	Starting on Mon, 21 Sep 2009 11:35:30

	    Simplifying linebreaks...
	    Analyzing whitespace patterns...
	    100.0% processed (0.4 MB of 0.4 MB)

	Finished on Mon, 21 Sep 2009 11:35:31

	Filesize:
	409959

	Whitespace analysis:
	[(72019, 0.0), (1063, 16.0), (1, 8.0)]

	1756.73664927
	25.9294222105
	0.0243926831708

	... assumed to be a file with paragraph breaks.

Correcting quotation marks...

	Enlightening text...

	Done!

Writing imp-pacified.tex...

	Converting to LaTeX...

		Replacing \'s
		Replacing {'s
		Replacing }'s
		Replacing >'s
		Replacing <'s
		Replacing ~'s
		Replacing ^'s
		Replacing &'s
		Replacing #'s
		Replacing _'s
		Replacing $'s
		Replacing %'s

		Formatting...

Traceback (most recent call last):
  File "/home/kck/bin/pacify.py", line 88, in <module>
    main()
  File "/home/kck/bin/pacify.py", line 79, in main
    pacify = Pacify(args)
  File "/home/kck/bin/pmodules.py", line 65, in __init__
    outfile.write(self.GetAsLaTeX().encode('utf-8'))
  File "/home/kck/bin/pmodules.py", line 225, in GetAsLaTeX
    if curFormat != self.inbuffer.format[idx]:
  File "/home/kck/bin/pmodules.py", line 989, in __ne__
    if self.isBold != other.isBold:
AttributeError: pString instance has no attribute 'isBold'

No clue what happened there. It did create a file, but the file was blank. There was no imp-pacified.log.

Tried an HTML > LaTeX conversion. That worked. It would have been nice to get a HTML table > LaTeX tabular conversion, but perhaps that goes against what you're trying to do. If so, then at least a space, and maybe a line break between rows, seems in order.

Code:

pacify.py -i imp-c_split_11.html -o latex

pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)

	Profile:	en
	Input file:	imp-c_split_11.html
	Output file:	imp-c_split_11-pacified.tex
	Log file:	imp-c_split_11-pacified.log

Reading imp-c_split_11.html...

	Starting on Mon, 21 Sep 2009 11:44:05

Cleaning formatting...

	Starting on Mon, 21 Sep 2009 11:44:05


	Finished on Mon, 21 Sep 2009 11:44:05

Analyzing text...

	Starting on Mon, 21 Sep 2009 11:44:05

	    Simplifying linebreaks...
	    Analyzing whitespace patterns...
	    100.0% processed (0.0 MB of 0.0 MB)

	Finished on Mon, 21 Sep 2009 11:44:05

	Filesize:
	23759

	Whitespace analysis:
	[(4198, 0.0), (46, 16.0), (4, 8.0)]

	1766.90938171
	19.3610842207
	1.68357254093

	... assumed to be a file with paragraph breaks.

Correcting quotation marks...

	Enlightening text...

	Done!

Writing imp-c_split_11-pacified.tex...

	Converting to LaTeX...

		Replacing \'s
		Replacing {'s
		Replacing }'s
		Replacing >'s
		Replacing <'s
		Replacing ~'s
		Replacing ^'s
		Replacing &'s
		Replacing #'s
		Replacing _'s
		Replacing $'s
		Replacing %'s

		Formatting...

	Done!

Will do some more playing around when I have the leisure.

ahi · 09-21-2009, 01:13 PM

Hi, Frabjous!

I would suggest holding off until I put up the next version. The haphazard unicode errors are basically gone as of the development version I am currently working on.

The filesize thing is weird... the 600 KB file certainly did not cause a memory issue, but whatever the issue was got misreported as such.

I have successfully processed 700+ MB (nearly 1 GB) files with pacify.py before... and once I implement spooling (which I actually think I will do sooner rather than later after all), file size will be a non-issue so long as you have both sufficient memory and disk space.

And yes, the HTML parsing needs to take tables and such properly into account... along with a few other things.

I'll keep everyone update via this thread...

- Ahi

kovidgoyal · 09-21-2009, 01:15 PM

@ahi: Look at SpooledTemporaryFile in the python tempfile module

ahi · 09-21-2009, 01:23 PM

Quote:

Originally Posted by kovidgoyal

@ahi: Look at SpooledTemporaryFile in the python tempfile module

Thanks!

import antigravity indeed!

On the topic of RTF parsing, Kovid. If I recall correctly you need HTML returned by the RTF parser... is that right?

The current one you use for calibre... how... precise is it? The parser I am working on purposely limits the complexity of formatting it cares about. Would you still have any use for such a partial RTF parsing engine, as a (re)starting point?

- Ahi

ahi · 09-21-2009, 01:41 PM

Quote:

Originally Posted by kovidgoyal

@ahi: Look at SpooledTemporaryFile in the python tempfile module

I've stupidly recreated some of the functionality of Python's logging module too.

- Ahi

kovidgoyal · 09-21-2009, 02:01 PM

The python logging module at least in 2.x has various unicode related bugs, so I had to implement a custom logging module for calibre as well.

I recently refactored the calibre RTF parser to fix its speed issues, so I'm ok for now. Incidentally, the RTF parser actually output XML which I then convert to HTML using an XSLT stylesheet. It handles a pretty large subset of the RTF 1.5 spec including embedded images and so on. My philosophy in calibre is to accept as large an input set as possible and do *something* with it, even if that something may not be optimal, so untill your RTF parser handles a larger set of input, I think I'll pass.

ahi · 09-21-2009, 02:04 PM

Quote:

Originally Posted by kovidgoyal

The python logging module at least in 2.x has various unicode related bugs, so I had to implement a custom logging module for calibre as well.

I recently refactored the calibre RTF parser to fix its speed issues, so I'm ok for now. Incidentally, the RTF parser actually output XML which I then convert to HTML using an XSLT stylesheet. It handles a pretty large subset of the RTF 1.5 spec including embedded images and so on. My philosophy in calibre is to accept as large an input set as possible and do *something* with it, even if that something may not be optimal, so untill your RTF parser handles a larger set of input, I think I'll pass.

Makes sense.

In your custom logging module, do you have something that reliably down-converts unicode to plain ascii that can be printed even to dumb DOS terminals?

- Ahi

kovidgoyal · 09-21-2009, 02:07 PM

calibre has a general unicode->ascii converter that reliably downcoverts unicode to ascii that (looks like) the unicode. This is thanks to user_none by the way.

ahi · 09-21-2009, 03:34 PM

Quote:

Originally Posted by kovidgoyal

calibre has a general unicode->ascii converter that reliably downcoverts unicode to ascii that (looks like) the unicode. This is thanks to user_none by the way.

Must have involved a good bit of (overly but not sufficiently) mindless cutting and pasting... or is there a better way?

Is it GPL (or similarly) licensed?

- Ahi

ahi · 09-21-2009, 03:50 PM

... actually, there is unicodedate.normalize. Oh, Python...

- Ahi

kovidgoyal · 09-21-2009, 04:26 PM

All of calibre is GPL

ahi · 09-24-2009, 10:13 AM

Having some difficulty getting my LaTeX output algorithm right...

The internal representation is, of course, plaintext but with (for simplicity's sake, let us say) bold/italic/smallcap/underline formatting's presence or absence indicating for every character of text.

My first, somewhat naive approach, was something like (I simply considerably below, of course):

Code:

for idx in range(0, len(manuscript.text)):
    
    if manuscript.format[idx-1].bold == True and manuscript.format[idx-1].bold == False:
        output += "}"
    elif manuscript.format[idx-1].bold == False and manuscript.format[idx-1].bold == True:
        output += "\textbf{"
    
    if manuscript.format[idx-1].italic == True and manuscript.format[idx-1].italic == False:
        output += "}"
    elif manuscript.format[idx-1].italic == False and manuscript.format[idx-1].bold == italic:
        output += "\textit{"
    
    output += manuscript.text[idx]

The problem is that the closing braces are not identified with a specific type of opening brace... and the moment formatting is not cleanly nested, it results in incorrect code.

e.g.:

Code:


T  h  i  s     i  s     i  n  d  e  e  d     a     s  t  r  a  n  g  e     i  d  e  a  !  
-- -- -- -- -- -I -I -I BI BI B- B- B- B- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

would generate the following LaTeX output from the above algorithm:

Code:

This \textit{is \textbf{in}deed} a strange idea!

resulting in:

This is indeed a strange idea!

instead of the desired:

This is indeed a strange idea!

Basically the \textit's closing brace actually closes the \textbf and vice versa.

---

I have since tried some more complicated variations, but they've either not yielded the correct output, or did so using very ugly (unnecessarily complicated) LaTeX code.

Basically the correct output for the above example would be:

Code:

This \textit{is} \textbf{\textit{in}deed} a strange idea!

.. but (lest somebody suggest just having white-space forcing brace closures and subsequent command reissual when non-white-space characters continue) if the whole word "indeed" were in italic (as well as in bold), the correct output would be:

Code:

This \textit{is \textbf{indeed}} a strange idea!

If anybody can nudge me toward an elegant solution that involves less spaghetti code than I've been throwing at it so far, I'd be most grateful!

Note: Well, not literally spaghetti code, of course... but almost as confusing looking.

- Ahi

09-24-2009, 10:13 AM	#75
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Having some difficulty getting my LaTeX output algorithm right... The internal representation is, of course, plaintext but with (for simplicity's sake, let us say) bold/italic/smallcap/underline formatting's presence or absence indicating for every character of text. My first, somewhat naive approach, was something like (I simply considerably below, of course): Code: for idx in range(0, len(manuscript.text)): if manuscript.format[idx-1].bold == True and manuscript.format[idx-1].bold == False: output += "}" elif manuscript.format[idx-1].bold == False and manuscript.format[idx-1].bold == True: output += "\textbf{" if manuscript.format[idx-1].italic == True and manuscript.format[idx-1].italic == False: output += "}" elif manuscript.format[idx-1].italic == False and manuscript.format[idx-1].bold == italic: output += "\textit{" output += manuscript.text[idx] The problem is that the closing braces are not identified with a specific type of opening brace... and the moment formatting is not cleanly nested, it results in incorrect code. e.g.: Code: T h i s i s i n d e e d a s t r a n g e i d e a ! -- -- -- -- -- -I -I -I BI BI B- B- B- B- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- would generate the following LaTeX output from the above algorithm: Code: This \textit{is \textbf{in}deed} a strange idea! resulting in: This is indeed a strange idea! instead of the desired: This is indeed a strange idea! Basically the \textit's closing brace actually closes the \textbf and vice versa. --- I have since tried some more complicated variations, but they've either not yielded the correct output, or did so using very ugly (unnecessarily complicated) LaTeX code. Basically the correct output for the above example would be: Code: This \textit{is} \textbf{\textit{in}deed} a strange idea! .. but (lest somebody suggest just having white-space forcing brace closures and subsequent command reissual when non-white-space characters continue) if the whole word "indeed" were in italic (as well as in bold), the correct output would be: Code: This \textit{is \textbf{indeed}} a strange idea! If anybody can nudge me toward an elegant solution that involves less spaghetti code than I've been throwing at it so far, I'd be most grateful! Note: Well, not literally spaghetti code, of course... but almost as confusing looking. - Ahi Last edited by ahi; 09-24-2009 at 10:30 AM.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Best pdf to text/rtf/whatever I have ever seen	jblitereader	Ectaco jetBook	13	07-10-2010 01:02 AM
RTF and TEXT conversion	spaze	Calibre	4	08-23-2009 04:11 AM
Automatic .Lit extractor for the iLiad	Adam B.	iRex	34	09-25-2008 08:20 PM
kovidgoyal: templatemaker -- automatic data extractor	sammykrupa	Sony Reader	1	07-21-2007 02:52 PM
Text to RTF question.	Roy White	Sony Reader	0	05-12-2007 07:59 PM

09-20-2009, 11:41 PM	#62
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	If you are around, ekaser, I am mostly done with the rearchitecting. I have a(n admittedly very simple) plugin architecture in place, where basically all functionality (excepting the core classes used by the processing) come from plugins that are classified either as an (1) input plugin, (2) a language plugin, (3) a processing plugin, or (4) an output plugin. It makes for very pleasantly clean development... albeit somewhat torturous command line handling, as the plugins (and the [command line based] choice of which specific plugins to use at runtime) are relevant in figuring out what are correct and what are erroneous command line arguments. I have a plaintext, an RTF, and an HTML input plugin working already fairly well, and a plaintext output and HTML output plugin likewise functional, if a bit immature as yet. The language plugins are handled in such a way that (1) they can somewhat customize the pacify class('s running instance) to potentially alter other plugins' behavior, but mostly (2) preprocess the text right after its read in from the input file in whatever language-specific way, and (3) post-process the text after all other plugins are done but before it is written to the output file. Switching my development to Python 3 also got rid of difficult to understand and (for me) seemingly impossible to definitively correct unicode related errors my pacify script previously suffered from. The only point of (as yet) shame is that I have not had the fortitude to fully implement my crazy text-as-database concept yet. My formatted text string class objects are being manipulated fairly directly. I probably should bite the bullet and take my time to figure out both the spooling (which I am, to be honest, yet to fully wrap my mind around--any good "idiot's guide" level resources you can point me toward) and the text-as-database stuff... but I am just too impatient for practical results to do so. On the upside, if and when I do get around to doing that stuff... I should be able to insert the necessary code fairly readily without having to make radical changes in too many places. I'm not going to upload another version until it's able to produce reasonably useful output... but it's getting closer. I've decided to build categorization into the formatting stream as well... probably not incredibly efficient... but unless it starts to cause problems with even files just dozens of MB large, I'll probably stick with it for now (and once spooling is implemented, that should take care of the problem altogether). I am also thinking of implementing footnotes/endnotes (and perhaps annotations?) in the formatting stream too... but I'm now thinking I will not bother with links at all. I cannot think of any input documents (other than of the "choose your own adventure" variety, which is fairly rare) where existing links ought necessarily be respected, instead of new links being generated as warranted by the document's structure. (Albeit perhaps in HTML, there should be some ability to interpret links as footnotes when appropriate.) Just wanted to share where I am and what I've done. - Ahi

09-21-2009, 12:15 AM	#63
ekaser Opinion Artiste Posts: 301 Karma: 61464 Join Date: Mar 2009 Location: Albany, OR Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire	Sounds like you've made some good progress! I've been gone most of this week camping, and just got back this afternoon. Amazing how far behind you can fall in just a few days' time! As for the spooling stuff, can't think of any references off-hand. It's pretty much just a matter of 'virtualizing' (serializing) your memory out to disk into a series of temporary "working files", so that you're not trying to keep hundreds of megabytes in memory at once, just the data structures that point to them (as needed). But my programming library is pretty limited (I'm one of the original inventors of the Not-Invented-Here Syndrome... ). I hear what you're saying about "too impatient to see something working"! That tends to be my problem as well.

09-21-2009, 01:13 PM	#65
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Hi, Frabjous! I would suggest holding off until I put up the next version. The haphazard unicode errors are basically gone as of the development version I am currently working on. The filesize thing is weird... the 600 KB file certainly did not cause a memory issue, but whatever the issue was got misreported as such. I have successfully processed 700+ MB (nearly 1 GB) files with pacify.py before... and once I implement spooling (which I actually think I will do sooner rather than later after all), file size will be a non-issue so long as you have both sufficient memory and disk space. And yes, the HTML parsing needs to take tables and such properly into account... along with a few other things. I'll keep everyone update via this thread... - Ahi

09-21-2009, 01:15 PM	#66
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@ahi: Look at SpooledTemporaryFile in the python tempfile module

09-21-2009, 02:01 PM	#69
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The python logging module at least in 2.x has various unicode related bugs, so I had to implement a custom logging module for calibre as well. I recently refactored the calibre RTF parser to fix its speed issues, so I'm ok for now. Incidentally, the RTF parser actually output XML which I then convert to HTML using an XSLT stylesheet. It handles a pretty large subset of the RTF 1.5 spec including embedded images and so on. My philosophy in calibre is to accept as large an input set as possible and do something with it, even if that something may not be optimal, so untill your RTF parser handles a larger set of input, I think I'll pass.

09-21-2009, 02:07 PM	#71
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	calibre has a general unicode->ascii converter that reliably downcoverts unicode to ascii that (looks like) the unicode. This is thanks to user_none by the way.

09-21-2009, 03:50 PM	#73
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	... actually, there is unicodedate.normalize. Oh, Python... - Ahi

09-21-2009, 04:26 PM	#74
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	All of calibre is GPL