12-04-2009, 12:54 PM | #61 |
The Grand Mouse 高貴的老鼠
Posts: 71,480
Karma: 305784726
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
|
12-04-2009, 01:01 PM | #62 |
Sigil Developer
Posts: 7,602
Karma: 5433388
Join Date: Nov 2009
Device: many
|
posting xpml2xhtml.py
Hi,
Sure thing. When I get off work tonight I will post it on pastebin (I don't have access to a webserver of my own to post it directly) and then post the link to it here. I should probably figure out how to post things to the webspace my ISP provides but I have never bothered. Take care, Kevin |
12-04-2009, 03:21 PM | #63 |
Sigil Developer
Posts: 7,602
Karma: 5433388
Join Date: Nov 2009
Device: many
|
new version of xpml2xhtml.py
Hi,
I finished my grading early (final exams are as a big a pain for profs as they are for students) so I went ahead and posted the new version of xpml2xhtml.py to pastebin.de. This code is completely anti-drm free and so is okay to post, e-mail people and share. It requires HTML Tidy command line executable to be installed on the machine. This is installed already under Mac OSX (at least on my machine) and will build out of the box on Mac OSX and Linux and pre-built binaries for Windows are available from: http://int64.org/projects/tidy-binaries Just make sure tidy is in the path someplace (I have never tried tidy on windows so feedback welcome). The link is to xpml2xhtml.py is: http://pastebin.de/3445 It includes a command line optional switch --sigil-breaks that will automatically insert sigil Chapter Breaks which makes it easy to use Sigil to go from the output to a finished epub in much less time (if sigil would only read in the meta info in the header I would be so happy!). I use it as follows (on my Mac or under Linux) python xpml2xhtml.py --sigil-breaks input.pml output.html And to just make things clear, the format for footnotes in the input pml file is the xml one not the one of the original ereader2html one. The snippets of code to which create this format in the pml file are at: http://pastebin.de/3444 for those who are interested. Hope this helps, Please let me know if you run into problems or troublesome files that won't convert. I am always looking for test documents that hit corner cases. Take care, KevinH |
12-05-2009, 01:51 AM | #64 |
Connoisseur
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
|
I have a final to prep for, so it'll be a week before I can play with it. I originally tried ereader2html, but I left screaming in horror at the html it produced. I'll give yours a try. There may be some valuable tidbits that could be pushed into Calibre. Thanks!
- Jim |
12-05-2009, 07:09 AM | #65 |
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Kevin and I spoke to each other about his parser and the new calibre one while both were being developed last week. Other than general considerations like how to handle certain cases (some from calibre's went into xpml2xhtml and some from xpml2xhtml went into calibre's) the design of each are very different making it difficult to import actual code.
|
12-05-2009, 12:21 PM | #66 |
Sigil Developer
Posts: 7,602
Karma: 5433388
Join Date: Nov 2009
Device: many
|
more on xpml2xhtml.py
Hi,
Yes, xpml2xhtml.py is in no way only my work. I have literally exchanged ideas and code with "user_none" and borrowed ideas from "WayneD's" perl pml2html.pl conversion program, took ideas and code posted on the Dark Blog by others, and of course started with the original code posted on the Dark blog. I just now borrowed the idea of cleaning up chars. I hated to touch the pml file produced since that is the original. But I now have added the following to my latest version of xpml2xhtml.pl that literally cleans up the last issue I was having that forced me to use tidy (handling those special win1252 chars) Based on Jim and user_none comments above, I have added: def cleanupHighChars(src): # convert special win152 chars 0x80 - 0xa0 to be properly handled later src = re.sub('[\x80-\xa0]', lambda x: '\\a%03d' % ord(x.group()), src) src = re.sub('[^\x00-\xff]', lambda x: '\\U%04x' % ord(x.group()), src) return src which when it finds these special win1252 chars it recodes them to more proper pml with the \a and \U tags and then have expanded the pml_chars array as follows based on the following win1252 page: http://www.microsoft.com/globaldev/r.../sbcs/1252.htm which gives me pml_chars = { 128:'€', 129:'', 130:'—',131:'ƒ',132:'„', 133:'…', 134:'†',135:'‡',136:'ˆ',137: '‰', 138: 'Š', 139:'‹', 140:'Œ', 141:'', 142:'Ž' , 143: '', 144:'', 145:'‘', 146:'’', 147:'“', 148:'”', 149:'•', 150: '–', 151: '—', 152: '', 153:'™', 154:'š', 155:'›', 156:'œ', 157:'', 158:'ž', 159:'Ÿ', 160:' ', } Then I handle all of the \a tags values by translating them elif cmd == 'a': final += self.pml_chars.get(attr, '&#%d;' % attr) So I can now properly handle all of those special win1252 chars that are not allowed to be encoded in unicode just by value and that need to be remapped to special html codes. So now, I can modify the program to use an option --use-tidy flag if that will default to no, so that the code is useable even by people without tidy. That said, I like to see the structure when I look at an html file and tidy's nice indentation and wrapping makes for easily understood code (i.e. makes it easy to see html breakpoints). I will test my new code further and post a final version over the weekend. Thanks for all of the code tips and ideas. KevinH |
12-05-2009, 08:00 PM | #67 |
Sigil Developer
Posts: 7,602
Karma: 5433388
Join Date: Nov 2009
Device: many
|
final version of xpml2xhtml.py
Hi,
I added the cleanup code, made use of tidy optional with a command line switch (--use-tidy) fixed some corner cases and made a few other improvements. So if you are going to try xpml2xhtml.py, please try this version: http://pastebin.de/3639 Hope this helps, KevinH |
12-08-2009, 03:41 PM | #68 |
Connoisseur
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
|
Finished my final! Alright, let me take a look at all of this. I think the best way I can deal with this within Calibre is if I create a PDB "on import" plugin that automatically converted when I added the PDB. Then I could just add the resulting HTML in the "Edit MetaData" window in the GUI.
Truth be told, I'd rather just use the Calibre's built-in features. I'll see what features xpml2xhtml.py have that really matter and look at how feasible it is to add it into user_none's code. With the latest changes, his stuff already does most of what I want. I'd just like to have the footnotes better handled with pagebreaks and return links. - Jim |
12-08-2009, 05:35 PM | #69 | ||
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Quote:
Quote:
|
||
12-08-2009, 06:14 PM | #70 | |
Connoisseur
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
|
Quote:
However....this may be unnecessary. I'll do another bzr update and see how your work looks. If it's good enough, then....it's good enough! - Jim |
|
12-09-2009, 02:09 AM | #71 |
Connoisseur
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
|
Hmmm....I'm on revision 3999 on my Bazar project, but I don't see your changes that add link-back to footnotes. I'm looking at pmlconvertor.py:
Code:
(re.compile(r'\\Fn="(?P<target>.+?)"(?P<text>.*?)\\Fn'), lambda match: '<a href="#fns-%s">%s</a>' % (match.group('target'), match.group('text')) if match.group('text') else ''), (re.compile(r'\\Sd="(?P<target>.+?)"(?P<text>.*?)\\Sd'), lambda match: '<a href="#fns-%s">%s</a>' % (match.group('target'), match.group('text')) if match.group('text') else ''), <snip> # Sidebar and Footnotes (re.compile(r'<sidebar\s+id="(?P<target>.+?)">\s*(?P<text>.*?)\s*</sidebar>', re.DOTALL), lambda match: '<div id="fns-%s">%s</div>' % (match.group('target'), match.group('text')) if match.group('text') else ''), (re.compile(r'<footnote\s+id="(?P<target>.+?)">\s*(?P<text>.*?)\s*</footnote>', re.DOTALL), lambda match: '<div id="fns-%s">%s</div>' % (match.group('target'), match.group('text')) if match.group('text') else ''), Code:
(re.compile(r'\\Fn="(?P<target>.+?)"(?P<text>.*?)\\Fn'), lambda match: '<a id="Xfns-%s" href="#fns-%s">%s</a>' % (match.group('target'), match.group('target'), match.group('text')) if match.group('text') else ''), <snip> # Sidebar and Footnotes (re.compile(r'<sidebar\s+id="(?P<target>.+?)">\s*(?P<text>.*?)\s*</sidebar>', re.DOTALL), lambda match: '<div title="Footnote" id="fns-%s" style="page-break-before : always;">%s<br /><a href=#Xfns-%s>-Back-</a></div>' % (match.group('target'), match.group('text'), match.group('target')) if match.group('text') else ''), - Jim |
12-09-2009, 06:11 AM | #72 |
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Not in the slightest. I wrote a new parser and replaced the regex one with it. It doesn't look like your branch as properly merged with trunk. Are you doing bzr merge lp:calibre?. You can see the new parser here in the mean time.
|
12-09-2009, 01:43 PM | #73 |
Connoisseur
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
|
I had done a bzr revert (to blow out my unnecessary changes) followed by a bzr merge. Hmmmm...maybe there were errors that I didn't notice.
I looked at your code, and that definitely does what I need to do. Man, the thing is practically a re-write! How long did it take you to re-organize all of that? I like how you first convert the pseudo-XML footnote references into your own PML codes. It's odd that the eReader format doesn't do it that way as it does for standard links. I'll beat bzr into submission and try it out. Thanks! - Jim |
12-18-2012, 07:41 AM | #74 |
Member
Posts: 14
Karma: 1236266
Join Date: Dec 2010
Device: None
|
I doubt that anyone is still interested in this code, but I am preparing a new tools release and I am removing a lot of obsolete files.
xplm2xhtml.py is one of the ones to go. As this file contains no de-drm code at all, I am attaching that latest version that I have to this post. — Alf. Last edited by Apprentice Alf; 12-18-2012 at 08:05 AM. |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
converting sony books or B&N books for ipad? | cavi | General Discussions | 2 | 04-25-2010 11:45 PM |
Converting to Palm Digital Ereader | rocojo | Calibre | 5 | 12-27-2009 08:31 AM |
Converting Fictionwise's Secure eReader to something my 505 will read | RWJ | Calibre | 12 | 09-11-2009 04:33 PM |
converting long, somewhat complex docs to eReader | Richard Maseles | Other formats | 4 | 01-07-2009 05:28 PM |
Converting books to eReader with Dropbook | Robotech_Master | Workshop | 1 | 12-23-2008 12:46 PM |