Converting eReader books - Page 5

pdurrant · 12-04-2009, 12:54 PM

I'm not Jim, but yes please do post your xpml2xhtml code. Thank you.

Quote:

Originally Posted by KevinH

If you want it I will happily post it for you (since it has no DRM removal code ii it at all).

KevinH · 12-04-2009, 01:01 PM

Hi,

Sure thing. When I get off work tonight I will post it on pastebin (I don't have access to a webserver of my own to post it directly) and then post the link to it here.

I should probably figure out how to post things to the webspace my ISP provides but I have never bothered.

Take care,

Kevin

KevinH · 12-04-2009, 03:21 PM

Hi,

I finished my grading early (final exams are as a big a pain for profs as they are for students) so I went ahead and posted the new version of xpml2xhtml.py to pastebin.de. This code is completely anti-drm free and so is okay to post, e-mail people and share. It requires HTML Tidy command line executable to be installed on the machine. This is installed already under Mac OSX (at least on my machine) and will build out of the box on Mac OSX and Linux and pre-built binaries for Windows are available from: http://int64.org/projects/tidy-binaries

Just make sure tidy is in the path someplace (I have never tried tidy on windows so feedback welcome).

The link is to xpml2xhtml.py is:

http://pastebin.de/3445

It includes a command line optional switch --sigil-breaks that will automatically insert sigil Chapter Breaks which makes it easy to use Sigil to go from the output to a finished epub in much less time (if sigil would only read in the meta info in the header I would be so happy!).

I use it as follows (on my Mac or under Linux)

python xpml2xhtml.py --sigil-breaks input.pml output.html

And to just make things clear, the format for footnotes in the input pml file is the xml one not the one of the original ereader2html one. The snippets of code to which create this format in the pml file are at:

http://pastebin.de/3444

for those who are interested.

Hope this helps,

Please let me know if you run into problems or troublesome files that won't convert. I am always looking for test documents that hit corner cases.

Take care,

KevinH

macr0t0r · 12-05-2009, 01:51 AM

I have a final to prep for, so it'll be a week before I can play with it. I originally tried ereader2html, but I left screaming in horror at the html it produced. I'll give yours a try. There may be some valuable tidbits that could be pushed into Calibre. Thanks!
- Jim

user_none · 12-05-2009, 07:09 AM

Quote:

Originally Posted by macr0t0r

There may be some valuable tidbits that could be pushed into Calibre.

Kevin and I spoke to each other about his parser and the new calibre one while both were being developed last week. Other than general considerations like how to handle certain cases (some from calibre's went into xpml2xhtml and some from xpml2xhtml went into calibre's) the design of each are very different making it difficult to import actual code.

KevinH · 12-05-2009, 12:21 PM

Hi,

Yes, xpml2xhtml.py is in no way only my work. I have literally exchanged ideas and code with "user_none" and borrowed ideas from "WayneD's" perl pml2html.pl conversion program, took ideas and code posted on the Dark Blog by others, and of course started with the original code posted on the Dark blog.

I just now borrowed the idea of cleaning up chars. I hated to touch the pml file produced since that is the original. But I now have added the following to my latest version of xpml2xhtml.pl that literally cleans up the last issue I was having that forced me to use tidy (handling those special win1252 chars)

Based on Jim and user_none comments above, I have added:

def cleanupHighChars(src):
# convert special win152 chars 0x80 - 0xa0 to be properly handled later
src = re.sub('[\x80-\xa0]', lambda x: '\\a%03d' % ord(x.group()), src)
src = re.sub('[^\x00-\xff]', lambda x: '\\U%04x' % ord(x.group()), src)
return src

which when it finds these special win1252 chars it recodes them to more proper pml with the \a and \U tags and then have expanded the pml_chars array as follows based on the following win1252 page:

http://www.microsoft.com/globaldev/r.../sbcs/1252.htm

which gives me

pml_chars = {
128:'€', 129:'', 130:'—',131:'ƒ',132:'„',
133:'…', 134:'†',135:'‡',136:'ˆ',137: '‰',
138: 'Š', 139:'‹', 140:'Œ', 141:'', 142:'Ž' ,
143: '', 144:'', 145:'‘', 146:'’', 147:'“',
148:'”', 149:'•', 150: '–', 151: '—', 152: '',
153:'™', 154:'š', 155:'›', 156:'œ', 157:'',
158:'ž', 159:'Ÿ', 160:' ',
}

Then I handle all of the \a tags values by translating them

elif cmd == 'a':
final += self.pml_chars.get(attr, '&#%d;' % attr)

So I can now properly handle all of those special win1252 chars that are not allowed to be encoded in unicode just by value and that need to be remapped to special html codes.

So now, I can modify the program to use an option --use-tidy flag if that will default to no, so that the code is useable even by people without tidy.

That said, I like to see the structure when I look at an html file and tidy's nice indentation and wrapping makes for easily understood code (i.e. makes it easy to see html breakpoints).

I will test my new code further and post a final version over the weekend.

Thanks for all of the code tips and ideas.

KevinH

KevinH · 12-05-2009, 08:00 PM

Hi,

I added the cleanup code, made use of tidy optional with a command line switch (--use-tidy) fixed some corner cases and made a few other improvements.

So if you are going to try xpml2xhtml.py, please try this version:

http://pastebin.de/3639

Hope this helps,

KevinH

macr0t0r · 12-08-2009, 03:41 PM

Finished my final! Alright, let me take a look at all of this. I think the best way I can deal with this within Calibre is if I create a PDB "on import" plugin that automatically converted when I added the PDB. Then I could just add the resulting HTML in the "Edit MetaData" window in the GUI.

Truth be told, I'd rather just use the Calibre's built-in features. I'll see what features xpml2xhtml.py have that really matter and look at how feasible it is to add it into user_none's code. With the latest changes, his stuff already does most of what I want. I'd just like to have the footnotes better handled with pagebreaks and return links.

- Jim

user_none · 12-08-2009, 05:35 PM

Quote:

Originally Posted by macr0t0r

Finished my final! Alright, let me take a look at all of this. I think the best way I can deal with this within Calibre is if I create a PDB "on import" plugin that automatically converted when I added the PDB. Then I could just add the resulting HTML in the "Edit MetaData" window in the GUI.

Why do you need a PDB on import plugin? eReader PDB's are fully supported.

Quote:

Originally Posted by macr0t0r

Truth be told, I'd rather just use the Calibre's built-in features. I'll see what features xpml2xhtml.py have that really matter and look at how feasible it is to add it into user_none's code. With the latest changes, his stuff already does most of what I want. I'd just like to have the footnotes better handled with pagebreaks and return links.

Once again I'm one step ahead of you, I added this a few days ago.

macr0t0r · 12-08-2009, 06:14 PM

Quote:

Originally Posted by user_none

Why do you need a PDB on import plugin? eReader PDB's are fully supported.

Once again I'm one step ahead of you, I added this a few days ago.

I know that eReader PDBs are supported. This is a little trick I do if I want to use an external converter for PDB within the Calibre Python environment (very useful on Windows machines). By importing the PDB, it calls the conversion routine on the file and generates a zipped HTML file in the same directory. I can then add that within the MetaData GUI. Now I have both the original eReader file and the converted HTML zip file to work with. I can't do this as a conversion plugin since that plugin expects OEB output. Perhaps with some work, I could figure out how to call the HTML to OEB functions within the plugin after converting to HTML. This whole plugin thing is still a bit of effort to work with.

However....this may be unnecessary. I'll do another bzr update and see how your work looks. If it's good enough, then....it's good enough!

- Jim

macr0t0r · 12-09-2009, 02:09 AM

Hmmm....I'm on revision 3999 on my Bazar project, but I don't see your changes that add link-back to footnotes. I'm looking at pmlconvertor.py:

Code:

    (re.compile(r'\\Fn="(?P<target>.+?)"(?P<text>.*?)\\Fn'), lambda match: '<a href="#fns-%s">%s</a>' % (match.group('target'), match.group('text')) if match.group('text') else ''),
    (re.compile(r'\\Sd="(?P<target>.+?)"(?P<text>.*?)\\Sd'), lambda match: '<a href="#fns-%s">%s</a>' % (match.group('target'), match.group('text')) if match.group('text') else ''),
<snip>
    # Sidebar and Footnotes
    (re.compile(r'&lt;sidebar\s+id="(?P<target>.+?)"&gt;\s*(?P<text>.*?)\s*&lt;/sidebar&gt;', re.DOTALL), lambda match: '<div id="fns-%s">%s</div>' % (match.group('target'), match.group('text')) if match.group('text') else ''),
    (re.compile(r'&lt;footnote\s+id="(?P<target>.+?)"&gt;\s*(?P<text>.*?)\s*&lt;/footnote&gt;', re.DOTALL), lambda match: '<div id="fns-%s">%s</div>' % (match.group('target'), match.group('text')) if match.group('text') else ''),

I was expecting something like this (footnotes only):

Code:

    (re.compile(r'\\Fn="(?P<target>.+?)"(?P<text>.*?)\\Fn'), lambda match: '<a id="Xfns-%s" href="#fns-%s">%s</a>' % (match.group('target'), match.group('target'), match.group('text')) if match.group('text') else ''),
<snip>
    # Sidebar and Footnotes
    (re.compile(r'&lt;sidebar\s+id="(?P<target>.+?)"&gt;\s*(?P<text>.*?)\s*&lt;/sidebar&gt;', re.DOTALL), lambda match: '<div title="Footnote" id="fns-%s" style="page-break-before : always;">%s<br /><a href=#Xfns-%s>-Back-</a></div>' % (match.group('target'), match.group('text'), match.group('target')) if match.group('text') else ''),

Is your code similar to this?

- Jim

user_none · 12-09-2009, 06:11 AM

Quote:

Originally Posted by macr0t0r

Is your code similar to this?

Not in the slightest. I wrote a new parser and replaced the regex one with it. It doesn't look like your branch as properly merged with trunk. Are you doing bzr merge lp:calibre?. You can see the new parser here in the mean time.

macr0t0r · 12-09-2009, 01:43 PM

I had done a bzr revert (to blow out my unnecessary changes) followed by a bzr merge. Hmmmm...maybe there were errors that I didn't notice.

I looked at your code, and that definitely does what I need to do. Man, the thing is practically a re-write! How long did it take you to re-organize all of that? I like how you first convert the pseudo-XML footnote references into your own PML codes. It's odd that the eReader format doesn't do it that way as it does for standard links.

I'll beat bzr into submission and try it out. Thanks!

- Jim

Apprentice Alf · 12-18-2012, 07:41 AM

I doubt that anyone is still interested in this code, but I am preparing a new tools release and I am removing a lot of obsolete files.

xplm2xhtml.py is one of the ones to go. As this file contains no de-drm code at all, I am attaching that latest version that I have to this post.

— Alf.

12-04-2009, 01:01 PM	#62
KevinH Sigil Developer Posts: 7,602 Karma: 5433388 Join Date: Nov 2009 Device: many	posting xpml2xhtml.py Hi, Sure thing. When I get off work tonight I will post it on pastebin (I don't have access to a webserver of my own to post it directly) and then post the link to it here. I should probably figure out how to post things to the webspace my ISP provides but I have never bothered. Take care, Kevin

12-04-2009, 03:21 PM	#63
KevinH Sigil Developer Posts: 7,602 Karma: 5433388 Join Date: Nov 2009 Device: many	new version of xpml2xhtml.py Hi, I finished my grading early (final exams are as a big a pain for profs as they are for students) so I went ahead and posted the new version of xpml2xhtml.py to pastebin.de. This code is completely anti-drm free and so is okay to post, e-mail people and share. It requires HTML Tidy command line executable to be installed on the machine. This is installed already under Mac OSX (at least on my machine) and will build out of the box on Mac OSX and Linux and pre-built binaries for Windows are available from: http://int64.org/projects/tidy-binaries Just make sure tidy is in the path someplace (I have never tried tidy on windows so feedback welcome). The link is to xpml2xhtml.py is: http://pastebin.de/3445 It includes a command line optional switch --sigil-breaks that will automatically insert sigil Chapter Breaks which makes it easy to use Sigil to go from the output to a finished epub in much less time (if sigil would only read in the meta info in the header I would be so happy!). I use it as follows (on my Mac or under Linux) python xpml2xhtml.py --sigil-breaks input.pml output.html And to just make things clear, the format for footnotes in the input pml file is the xml one not the one of the original ereader2html one. The snippets of code to which create this format in the pml file are at: http://pastebin.de/3444 for those who are interested. Hope this helps, Please let me know if you run into problems or troublesome files that won't convert. I am always looking for test documents that hit corner cases. Take care, KevinH

12-05-2009, 12:21 PM	#66
KevinH Sigil Developer Posts: 7,602 Karma: 5433388 Join Date: Nov 2009 Device: many	more on xpml2xhtml.py Hi, Yes, xpml2xhtml.py is in no way only my work. I have literally exchanged ideas and code with "user_none" and borrowed ideas from "WayneD's" perl pml2html.pl conversion program, took ideas and code posted on the Dark Blog by others, and of course started with the original code posted on the Dark blog. I just now borrowed the idea of cleaning up chars. I hated to touch the pml file produced since that is the original. But I now have added the following to my latest version of xpml2xhtml.pl that literally cleans up the last issue I was having that forced me to use tidy (handling those special win1252 chars) Based on Jim and user_none comments above, I have added: def cleanupHighChars(src): # convert special win152 chars 0x80 - 0xa0 to be properly handled later src = re.sub('[\x80-\xa0]', lambda x: '\\a%03d' % ord(x.group()), src) src = re.sub('[^\x00-\xff]', lambda x: '\\U%04x' % ord(x.group()), src) return src which when it finds these special win1252 chars it recodes them to more proper pml with the \a and \U tags and then have expanded the pml_chars array as follows based on the following win1252 page: http://www.microsoft.com/globaldev/r.../sbcs/1252.htm which gives me pml_chars = { 128:'€', 129:'', 130:'—',131:'ƒ',132:'„', 133:'…', 134:'†',135:'‡',136:'ˆ',137: '‰', 138: 'Š', 139:'‹', 140:'Œ', 141:'', 142:'Ž' , 143: '', 144:'', 145:'‘', 146:'’', 147:'“', 148:'”', 149:'•', 150: '–', 151: '—', 152: '', 153:'™', 154:'š', 155:'›', 156:'œ', 157:'', 158:'ž', 159:'Ÿ', 160:' ', } Then I handle all of the \a tags values by translating them elif cmd == 'a': final += self.pml_chars.get(attr, '&#%d;' % attr) So I can now properly handle all of those special win1252 chars that are not allowed to be encoded in unicode just by value and that need to be remapped to special html codes. So now, I can modify the program to use an option --use-tidy flag if that will default to no, so that the code is useable even by people without tidy. That said, I like to see the structure when I look at an html file and tidy's nice indentation and wrapping makes for easily understood code (i.e. makes it easy to see html breakpoints). I will test my new code further and post a final version over the weekend. Thanks for all of the code tips and ideas. KevinH

12-05-2009, 08:00 PM	#67
KevinH Sigil Developer Posts: 7,602 Karma: 5433388 Join Date: Nov 2009 Device: many	final version of xpml2xhtml.py Hi, I added the cleanup code, made use of tidy optional with a command line switch (--use-tidy) fixed some corner cases and made a few other improvements. So if you are going to try xpml2xhtml.py, please try this version: http://pastebin.de/3639 Hope this helps, KevinH

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

12-05-2009, 01:51 AM	#64
macr0t0r Connoisseur Posts: 91 Karma: 108 Join Date: Jan 2008 Device: Palm Treo 680, Sony Reader	I have a final to prep for, so it'll be a week before I can play with it. I originally tried ereader2html, but I left screaming in horror at the html it produced. I'll give yours a try. There may be some valuable tidbits that could be pushed into Calibre. Thanks! - Jim

12-08-2009, 03:41 PM	#68
macr0t0r Connoisseur Posts: 91 Karma: 108 Join Date: Jan 2008 Device: Palm Treo 680, Sony Reader	Finished my final! Alright, let me take a look at all of this. I think the best way I can deal with this within Calibre is if I create a PDB "on import" plugin that automatically converted when I added the PDB. Then I could just add the resulting HTML in the "Edit MetaData" window in the GUI. Truth be told, I'd rather just use the Calibre's built-in features. I'll see what features xpml2xhtml.py have that really matter and look at how feasible it is to add it into user_none's code. With the latest changes, his stuff already does most of what I want. I'd just like to have the footnotes better handled with pagebreaks and return links. - Jim

12-09-2009, 01:43 PM	#73
macr0t0r Connoisseur Posts: 91 Karma: 108 Join Date: Jan 2008 Device: Palm Treo 680, Sony Reader	I had done a bzr revert (to blow out my unnecessary changes) followed by a bzr merge. Hmmmm...maybe there were errors that I didn't notice. I looked at your code, and that definitely does what I need to do. Man, the thing is practically a re-write! How long did it take you to re-organize all of that? I like how you first convert the pseudo-XML footnote references into your own PML codes. It's odd that the eReader format doesn't do it that way as it does for standard links. I'll beat bzr into submission and try it out. Thanks! - Jim

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
converting sony books or B&N books for ipad?	cavi	General Discussions	2	04-25-2010 11:45 PM
Converting to Palm Digital Ereader	rocojo	Calibre	5	12-27-2009 08:31 AM
Converting Fictionwise's Secure eReader to something my 505 will read	RWJ	Calibre	12	09-11-2009 04:33 PM
converting long, somewhat complex docs to eReader	Richard Maseles	Other formats	4	01-07-2009 05:28 PM
Converting books to eReader with Dropbook	Robotech_Master	Workshop	1	12-23-2008 12:46 PM