Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 09-14-2009, 10:10 PM   #61
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Preliminary version with support for HTML input... a bit rough, but ok for a preview.

Also does the quotation mark fixing, like older versions do (edit: although it appears to be a bit wonky--will fix).

Update: for better HTML support. Currently <Hx> tags are simply formatted as \textsc{} but I will soon add support to directly translate them to \chapter{} \section{} et cetera formatting in LaTeX.

Also, see the (unmodified) .txt and .tex files presently generated from this HTML file by this version of pacify.py.

- Ahi
Attached Files
File Type: txt eves_diary-pacified.txt (52.4 KB, 231 views)
File Type: txt eves_diary-pacified_tex.txt (52.6 KB, 217 views)
File Type: zip pacify.zip (6.9 KB, 290 views)

Last edited by ahi; 09-16-2009 at 10:39 PM.
ahi is offline   Reply With Quote
Old 09-20-2009, 10:41 PM   #62
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
If you are around, ekaser, I am mostly done with the rearchitecting.

I have a(n admittedly very simple) plugin architecture in place, where basically all functionality (excepting the core classes used by the processing) come from plugins that are classified either as an (1) input plugin, (2) a language plugin, (3) a processing plugin, or (4) an output plugin.

It makes for very pleasantly clean development... albeit somewhat torturous command line handling, as the plugins (and the [command line based] choice of which specific plugins to use at runtime) are relevant in figuring out what are correct and what are erroneous command line arguments.

I have a plaintext, an RTF, and an HTML input plugin working already fairly well, and a plaintext output and HTML output plugin likewise functional, if a bit immature as yet. The language plugins are handled in such a way that (1) they can somewhat customize the pacify class('s running instance) to potentially alter other plugins' behavior, but mostly (2) preprocess the text right after its read in from the input file in whatever language-specific way, and (3) post-process the text after all other plugins are done but before it is written to the output file.

Switching my development to Python 3 also got rid of difficult to understand and (for me) seemingly impossible to definitively correct unicode related errors my pacify script previously suffered from.

The only point of (as yet) shame is that I have not had the fortitude to fully implement my crazy text-as-database concept yet. My formatted text string class objects are being manipulated fairly directly.

I probably should bite the bullet and take my time to figure out both the spooling (which I am, to be honest, yet to fully wrap my mind around--any good "idiot's guide" level resources you can point me toward) and the text-as-database stuff... but I am just too impatient for practical results to do so.

On the upside, if and when I do get around to doing that stuff... I should be able to insert the necessary code fairly readily without having to make radical changes in too many places.

I'm not going to upload another version until it's able to produce reasonably useful output... but it's getting closer. I've decided to build categorization into the formatting stream as well... probably not incredibly efficient... but unless it starts to cause problems with even files just dozens of MB large, I'll probably stick with it for now (and once spooling is implemented, that should take care of the problem altogether). I am also thinking of implementing footnotes/endnotes (and perhaps annotations?) in the formatting stream too... but I'm now thinking I will not bother with links at all. I cannot think of any input documents (other than of the "choose your own adventure" variety, which is fairly rare) where existing links ought necessarily be respected, instead of new links being generated as warranted by the document's structure. (Albeit perhaps in HTML, there should be some ability to interpret links as footnotes when appropriate.)

Just wanted to share where I am and what I've done.

- Ahi
ahi is offline   Reply With Quote
Old 09-20-2009, 11:15 PM   #63
ekaser
Opinion Artiste
ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.
 
ekaser's Avatar
 
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
Sounds like you've made some good progress! I've been gone most of this week camping, and just got back this afternoon. Amazing how far behind you can fall in just a few days' time!

As for the spooling stuff, can't think of any references off-hand. It's pretty much just a matter of 'virtualizing' (serializing) your memory out to disk into a series of temporary "working files", so that you're not trying to keep hundreds of megabytes in memory at once, just the data structures that point to them (as needed). But my programming library is pretty limited (I'm one of the original inventors of the Not-Invented-Here Syndrome... ).

I hear what you're saying about "too impatient to see something working"! That tends to be my problem as well.
ekaser is offline   Reply With Quote
Old 09-21-2009, 11:55 AM   #64
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
Some preliminary results.

First tried an HTML > TXT pacification (is that a word?):

Code:
pacify.py -i imp.html -o txt

pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)

	Profile:	en
	Input file:	imp.html
	Output file:	imp-pacified.txt
	Log file:	imp-pacified.log

Reading imp.html...

	Starting on Mon, 21 Sep 2009 11:24:33

File too large to read fully into memory.

Traceback (most recent call last):
  File "/home/kck/bin/pacify.py", line 88, in <module>
    main()
  File "/home/kck/bin/pacify.py", line 79, in main
    pacify = Pacify(args)
  File "/home/kck/bin/pmodules.py", line 42, in __init__
    self.inbuffer = self.ReadHTML()
  File "/home/kck/bin/pmodules.py", line 608, in ReadHTML
    if tmpbuffer[-1] != u' ' and tmpbuffer[-1] != u'\n' and tmpbuffer[-1] != u'␢':
IndexError: list index out of range
The file in question is ~627 KB. pacify is going to be quite limited in its usefulness if I can't do a file that big.

Split the above into chapters, and tried again on individual chapters. Output for this chapter seemed to be pretty good. I would suggest putting at least a space between table cells when converting from HTML, and perhaps a space between cells and a line break between rows.

Code:
pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)

	Profile:	en
	Input file:	imp-c_split_14.html
	Output file:	imp-c_split_14-pacified.txt
	Log file:	imp-c_split_14-pacified.log

Reading imp-c_split_14.html...

	Starting on Mon, 21 Sep 2009 11:29:54

Cleaning formatting...

	Starting on Mon, 21 Sep 2009 11:29:54


	Finished on Mon, 21 Sep 2009 11:29:54

Analyzing text...

	Starting on Mon, 21 Sep 2009 11:29:54

	    Simplifying linebreaks...
	    Analyzing whitespace patterns...
	    100.0% processed (0.0 MB of 0.0 MB)

	Finished on Mon, 21 Sep 2009 11:29:54

	Filesize:
	19645

	Whitespace analysis:
	[(3492, 0.0), (37, 16.0), (8, 8.0)]

	1777.55153983
	18.8343089845
	4.07228302367

	... assumed to be a file with paragraph breaks.

Correcting quotation marks...

	Enlightening text...

	Done!

Writing imp-c_split_14-pacified.txt...

	Done!
Then I tried a plain text > latex conversion. Didn't work:

Code:
pacify.py -i imp.txt -o latex

pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)

	Profile:	en
	Input file:	imp.txt
	Output file:	imp-pacified.tex
	Log file:	imp-pacified.log

Reading imp.txt...

	Starting on Mon, 21 Sep 2009 11:35:30

	Finished on Mon, 21 Sep 2009 11:35:30

Cleaning formatting...

	Starting on Mon, 21 Sep 2009 11:35:30


	Finished on Mon, 21 Sep 2009 11:35:30

Analyzing text...

	Starting on Mon, 21 Sep 2009 11:35:30

	    Simplifying linebreaks...
	    Analyzing whitespace patterns...
	    100.0% processed (0.4 MB of 0.4 MB)

	Finished on Mon, 21 Sep 2009 11:35:31

	Filesize:
	409959

	Whitespace analysis:
	[(72019, 0.0), (1063, 16.0), (1, 8.0)]

	1756.73664927
	25.9294222105
	0.0243926831708

	... assumed to be a file with paragraph breaks.

Correcting quotation marks...

	Enlightening text...

	Done!

Writing imp-pacified.tex...

	Converting to LaTeX...

		Replacing \'s
		Replacing {'s
		Replacing }'s
		Replacing >'s
		Replacing <'s
		Replacing ~'s
		Replacing ^'s
		Replacing &'s
		Replacing #'s
		Replacing _'s
		Replacing $'s
		Replacing %'s

		Formatting...

Traceback (most recent call last):
  File "/home/kck/bin/pacify.py", line 88, in <module>
    main()
  File "/home/kck/bin/pacify.py", line 79, in main
    pacify = Pacify(args)
  File "/home/kck/bin/pmodules.py", line 65, in __init__
    outfile.write(self.GetAsLaTeX().encode('utf-8'))
  File "/home/kck/bin/pmodules.py", line 225, in GetAsLaTeX
    if curFormat != self.inbuffer.format[idx]:
  File "/home/kck/bin/pmodules.py", line 989, in __ne__
    if self.isBold != other.isBold:
AttributeError: pString instance has no attribute 'isBold'
No clue what happened there. It did create a file, but the file was blank. There was no imp-pacified.log.

Tried an HTML > LaTeX conversion. That worked. It would have been nice to get a HTML table > LaTeX tabular conversion, but perhaps that goes against what you're trying to do. If so, then at least a space, and maybe a line break between rows, seems in order.

Code:
pacify.py -i imp-c_split_11.html -o latex

pacify v0.4.0 (2009-09-13) - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)

	Profile:	en
	Input file:	imp-c_split_11.html
	Output file:	imp-c_split_11-pacified.tex
	Log file:	imp-c_split_11-pacified.log

Reading imp-c_split_11.html...

	Starting on Mon, 21 Sep 2009 11:44:05

Cleaning formatting...

	Starting on Mon, 21 Sep 2009 11:44:05


	Finished on Mon, 21 Sep 2009 11:44:05

Analyzing text...

	Starting on Mon, 21 Sep 2009 11:44:05

	    Simplifying linebreaks...
	    Analyzing whitespace patterns...
	    100.0% processed (0.0 MB of 0.0 MB)

	Finished on Mon, 21 Sep 2009 11:44:05

	Filesize:
	23759

	Whitespace analysis:
	[(4198, 0.0), (46, 16.0), (4, 8.0)]

	1766.90938171
	19.3610842207
	1.68357254093

	... assumed to be a file with paragraph breaks.

Correcting quotation marks...

	Enlightening text...

	Done!

Writing imp-c_split_11-pacified.tex...

	Converting to LaTeX...

		Replacing \'s
		Replacing {'s
		Replacing }'s
		Replacing >'s
		Replacing <'s
		Replacing ~'s
		Replacing ^'s
		Replacing &'s
		Replacing #'s
		Replacing _'s
		Replacing $'s
		Replacing %'s

		Formatting...

	Done!
Will do some more playing around when I have the leisure.
frabjous is offline   Reply With Quote
Old 09-21-2009, 12:13 PM   #65
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Hi, Frabjous!

I would suggest holding off until I put up the next version. The haphazard unicode errors are basically gone as of the development version I am currently working on.

The filesize thing is weird... the 600 KB file certainly did not cause a memory issue, but whatever the issue was got misreported as such.

I have successfully processed 700+ MB (nearly 1 GB) files with pacify.py before... and once I implement spooling (which I actually think I will do sooner rather than later after all), file size will be a non-issue so long as you have both sufficient memory and disk space.

And yes, the HTML parsing needs to take tables and such properly into account... along with a few other things.

I'll keep everyone update via this thread...

- Ahi
ahi is offline   Reply With Quote
Old 09-21-2009, 12:15 PM   #66
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,349
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
@ahi: Look at SpooledTemporaryFile in the python tempfile module
kovidgoyal is online now   Reply With Quote
Old 09-21-2009, 12:23 PM   #67
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by kovidgoyal View Post
@ahi: Look at SpooledTemporaryFile in the python tempfile module
Thanks!

import antigravity indeed!

On the topic of RTF parsing, Kovid. If I recall correctly you need HTML returned by the RTF parser... is that right?

The current one you use for calibre... how... precise is it? The parser I am working on purposely limits the complexity of formatting it cares about. Would you still have any use for such a partial RTF parsing engine, as a (re)starting point?

- Ahi
ahi is offline   Reply With Quote
Old 09-21-2009, 12:41 PM   #68
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by kovidgoyal View Post
@ahi: Look at SpooledTemporaryFile in the python tempfile module
I've stupidly recreated some of the functionality of Python's logging module too.

- Ahi
ahi is offline   Reply With Quote
Old 09-21-2009, 01:01 PM   #69
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,349
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The python logging module at least in 2.x has various unicode related bugs, so I had to implement a custom logging module for calibre as well.

I recently refactored the calibre RTF parser to fix its speed issues, so I'm ok for now. Incidentally, the RTF parser actually output XML which I then convert to HTML using an XSLT stylesheet. It handles a pretty large subset of the RTF 1.5 spec including embedded images and so on. My philosophy in calibre is to accept as large an input set as possible and do *something* with it, even if that something may not be optimal, so untill your RTF parser handles a larger set of input, I think I'll pass.
kovidgoyal is online now   Reply With Quote
Old 09-21-2009, 01:04 PM   #70
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by kovidgoyal View Post
The python logging module at least in 2.x has various unicode related bugs, so I had to implement a custom logging module for calibre as well.

I recently refactored the calibre RTF parser to fix its speed issues, so I'm ok for now. Incidentally, the RTF parser actually output XML which I then convert to HTML using an XSLT stylesheet. It handles a pretty large subset of the RTF 1.5 spec including embedded images and so on. My philosophy in calibre is to accept as large an input set as possible and do *something* with it, even if that something may not be optimal, so untill your RTF parser handles a larger set of input, I think I'll pass.
Makes sense.

In your custom logging module, do you have something that reliably down-converts unicode to plain ascii that can be printed even to dumb DOS terminals?

- Ahi
ahi is offline   Reply With Quote
Old 09-21-2009, 01:07 PM   #71
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,349
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
calibre has a general unicode->ascii converter that reliably downcoverts unicode to ascii that (looks like) the unicode. This is thanks to user_none by the way.
kovidgoyal is online now   Reply With Quote
Old 09-21-2009, 02:34 PM   #72
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by kovidgoyal View Post
calibre has a general unicode->ascii converter that reliably downcoverts unicode to ascii that (looks like) the unicode. This is thanks to user_none by the way.
Must have involved a good bit of (overly but not sufficiently) mindless cutting and pasting... or is there a better way?

Is it GPL (or similarly) licensed?

- Ahi
ahi is offline   Reply With Quote
Old 09-21-2009, 02:50 PM   #73
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
... actually, there is unicodedate.normalize. Oh, Python...

- Ahi
ahi is offline   Reply With Quote
Old 09-21-2009, 03:26 PM   #74
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,349
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
All of calibre is GPL
kovidgoyal is online now   Reply With Quote
Old 09-24-2009, 09:13 AM   #75
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Having some difficulty getting my LaTeX output algorithm right...

The internal representation is, of course, plaintext but with (for simplicity's sake, let us say) bold/italic/smallcap/underline formatting's presence or absence indicating for every character of text.

My first, somewhat naive approach, was something like (I simply considerably below, of course):

Code:
for idx in range(0, len(manuscript.text)):
    
    if manuscript.format[idx-1].bold == True and manuscript.format[idx-1].bold == False:
        output += "}"
    elif manuscript.format[idx-1].bold == False and manuscript.format[idx-1].bold == True:
        output += "\textbf{"
    
    if manuscript.format[idx-1].italic == True and manuscript.format[idx-1].italic == False:
        output += "}"
    elif manuscript.format[idx-1].italic == False and manuscript.format[idx-1].bold == italic:
        output += "\textit{"
    
    output += manuscript.text[idx]
The problem is that the closing braces are not identified with a specific type of opening brace... and the moment formatting is not cleanly nested, it results in incorrect code.

e.g.:

Code:

T  h  i  s     i  s     i  n  d  e  e  d     a     s  t  r  a  n  g  e     i  d  e  a  !  
-- -- -- -- -- -I -I -I BI BI B- B- B- B- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
would generate the following LaTeX output from the above algorithm:

Code:
This \textit{is \textbf{in}deed} a strange idea!

resulting in:

This is indeed a strange idea!

instead of the desired:

This is indeed a strange idea!
Basically the \textit's closing brace actually closes the \textbf and vice versa.

---

I have since tried some more complicated variations, but they've either not yielded the correct output, or did so using very ugly (unnecessarily complicated) LaTeX code.

Basically the correct output for the above example would be:

Code:
This \textit{is} \textbf{\textit{in}deed} a strange idea!
.. but (lest somebody suggest just having white-space forcing brace closures and subsequent command reissual when non-white-space characters continue) if the whole word "indeed" were in italic (as well as in bold), the correct output would be:

Code:
This \textit{is \textbf{indeed}} a strange idea!
If anybody can nudge me toward an elegant solution that involves less spaghetti code than I've been throwing at it so far, I'd be most grateful!

Note: Well, not literally spaghetti code, of course... but almost as confusing looking.

- Ahi

Last edited by ahi; 09-24-2009 at 09:30 AM.
ahi is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Best pdf to text/rtf/whatever I have ever seen jblitereader Ectaco jetBook 13 07-10-2010 12:02 AM
RTF and TEXT conversion spaze Calibre 4 08-23-2009 03:11 AM
Automatic .Lit extractor for the iLiad Adam B. iRex 34 09-25-2008 07:20 PM
kovidgoyal: templatemaker -- automatic data extractor sammykrupa Sony Reader 1 07-21-2007 01:52 PM
Text to RTF question. Roy White Sony Reader 0 05-12-2007 06:59 PM


All times are GMT -4. The time now is 12:07 PM.


MobileRead.com is a privately owned, operated and funded community.