10-29-2009, 11:18 AM | #46 |
Member
Posts: 15
Karma: 50
Join Date: Jul 2008
Device: Sony Reader
|
I'm trying to figure this out too. I know a particular ereader file I want to convert to LRF is cp1252 charset, but where exactly does --input-encoding="cp1252" go, when using the command-line interface?
Given the command is written ereader2html infile.pdb outdir "your name" credit_card_number (which works fine, bar the dodgy A's with hats and random line-breaks, etc), where should one insert --input-encoding="cp1252" Or have I typically got it all horribly wrong? And is it likely to make a difference, given that the html file produced by MobiDeDrm is already encoded in cp1252, according to Firefox? I should also mention the following: I went to calibre, set the specified file--type plug-in to 'cp1252'. I then imported an html file which had been successfully exploded from ereader with MobiDeDrm. I had already checked that html file and saw that all characters etc, under firefox showed up just fine. After importing the html file - which apparently causes calibre to automatically zip the html, together with any associated image files, into a single zip - I navigated to the folder (on a mac) where the new zip file was located: exploded the zip: and found that the html file within had become scrambled - ie missing apostrophes and quote marks, random page breaks, that kind of thing. So clearly the problem happens when the html is first exported into Calibre. Any thoughts? Or should I try and 'open' a ticket as suggested (I figured I'd try here first, being as clueless about a lot of this stuff as some others on this thread)? |
10-29-2009, 01:01 PM | #47 |
curmudgeon
Posts: 1,481
Karma: 5748190
Join Date: Jun 2006
Location: Redwood City, CA USA
Device: Kobo Aura HD, (ex)nook, (ex)PRS-700, (ex)PRS-500
|
The best approach I am aware of is to take your copy of ereader2html.py and modify it to (a) write out the pml file instead of converting to html, and (b) re-write high-ascii characters using the \aNNN syntax, where NNN is the value of the byte in octal? decimal? (I forget; check the mobileread wiki page for the format). That works around the whole problem by making sure that the contents of the pml file are fully escaped.
THEN, use Drop Book to create a new (DRM-free) ereader file, import that into Calibre, and off you go. One reason this is a good choice is that Calibre's conversion to html seems to work much better than ereader2html's version. As always, your mileage may vary, etc. Xenophon |
Advert | |
|
10-29-2009, 01:15 PM | #48 |
Member
Posts: 15
Karma: 50
Join Date: Jul 2008
Device: Sony Reader
|
I now realise that much of what I posted back there is complete bollocks, since I had in fact got my ereader2html mixed up with calibre's own command line interface stuff, which I hadn't used before.
I just ran the book in question, post ereader2html conversion, through calibre's command line interface, as mentioned earlier in the thread, using what appears to be thte correct charset. As a result, all the quote marks, dashes etc. are finally exactly where they should be ... except the text is all either centred or right-side ragged. Wah! Xenophon - thanks for the tips, but I'm afraid I hardly understood any of it! I can find my way around a computer on a basic level, and i only found out about charsets today. It all sounds quite advanced to me, unless you can point me to online resources that explain it reasonably simply. Edit: I got it a little further along - I converted the post-ereader2html output to epub, then converted that to LRF using the command-line interface with the cp1252 modifier. i got intact quotes, correct text alignment ... but lots of random misplaced paragraph returns breaking up the text. Anyone got any ideas? Last edited by PressEnter; 10-29-2009 at 02:12 PM. |
10-29-2009, 02:33 PM | #49 |
curmudgeon
Posts: 1,481
Karma: 5748190
Join Date: Jun 2006
Location: Redwood City, CA USA
Device: Kobo Aura HD, (ex)nook, (ex)PRS-700, (ex)PRS-500
|
What I recommended is straight-forward for an experienced programmer. Problem is that distributing the resulting program is probably a felony in the US. Passing on more detailed directions is also likely to be a felony. So I pass on those slightly-cryptic tips for those who can do it themselves.
Note that someone posted directions over in the workshop forum for how to modify ereader2html output pml instead of html. That's the most important thing, and it's only a 2-line edit. And "Drop Book" is a free download from fictionwise and/or ereader.com. Xenophon |
10-29-2009, 04:29 PM | #50 |
Zealot
Posts: 115
Karma: 260
Join Date: Sep 2008
Location: Suffolk, England
Device: sony prs505, kindle, ithing
|
I have just checked and somebody has already raised a ticket about this. Meanwhile, I think I will convert to ePub which seems OK. I'm afraid the changes are beyond me - I need to follow instructions written in words of one syllable! I felt triumphant when I converted my first book using ereader2html!
Thanks for all the help guys. |
Advert | |
|
10-30-2009, 09:51 AM | #51 |
Member
Posts: 15
Karma: 50
Join Date: Jul 2008
Device: Sony Reader
|
I've heavily rewritten this comment to reflect every step I took from a DRM'ed ereader .pdb book to a non-DRM'd, error-free open format transferable to my Sony Reader. All this was done an a Mac ibook running OSX 10.4, with the ereader2html MobiDeDrm script with Python 2.6 installed.
I am not a programmer, but I managed to figure out how to do this in a day or so. First of all, I went here:. I followed the instructions precisely, creating a new python script named 'ereader2pml.py'. It's very easy. I ran this script on a protected ereader file. It puts any graphics inside the book into a separate folder called 'outdir', and also placed in there an .html file as well as a .pml file. Pml means 'palm markup language', the layout language used for ereader books. The pml is just a textfile, with odd characters replacing apostrophes, quotes and dashes and so forth - the most frequently recurring characters were "ì", "î", "í", "Ö", and "ö". At first I manually searched and replaced on these using TextEdit, then discovered this page, which contains an applescript (scroll down, it has a pink background) for searching and replacing on specified terms in a text document automatically. I copied it, opened 'script editor' in the applications folder, and pasted the copied text into a new document (you can save the script as an 'app'). I then scrolled down the applescript to the terms the script was designed to search on - originally "one", "two" and so forth (the script is designed to pop up screens asking you to confirm what you want each word or character or number to be replaced with). It's ridiculously easy just to swap the numbers in the applescript for the 'odd' characters I listed above ("Ö" etc). Then you save the script as an app - dead easy, like I say, even for me - and start it. Once you're in, navigate to the .pml file you just created with ereader2pml, select it, and it'll ask you what to replace each of the odd characters with. After that, it takes a minute or two for it to do the entire search and replace, automatically, saving you a bundle of time and trouble (there may be commercial software out there that does this too, but I'm a skinflint). It's very important that you also create a folder called nameofbook_img, assuming your book is called nameofbook.pml (ie, exactly the same as the book's title, but with _img added). This folder should go in the same location as your .pml book (you can just trash the html file created at the same time). So if you have a folder called 'ereader hacks' which contains nameofbook.pml, it has another subfolder called 'nameofbook_img' (I got this from ereader's own online guide to palm markup syntax, found after a quick google). This folder is for any images contained within the ereader file. Even if you're not bothered about having them, you're going to need the folder, or the next and final step might not work. Place all the images extracted from the original drm'ed nameofbook.pdb file into the _img folder. Next: download dropbook, as suggested in the previous link to how to create ereader2pml.py. Finally, drag the .pml file onto the free dropbook software - it's used for creating ereader files. If there's any errors, it'll show them up in a separate window. Usually they're weird characters like the ones I listed above, where rarer characters like ° (as in 23°) have failed to come across. You'll be able to see what and where they are from the window list of errors and you can then go into the .pml file with TextEdit to change them manually (or alternatively add the new 'mistranslated' characters to the list in your new search and replace script, then repeat the whole process). If the images aren't in the exact correct folder as I stated above, dropbook won't be able to find them and will return error messages. Not only that, it will probably fail to create the new DRM-free, error-free nameofbook.pdb file you will otherwise be delighted to find waiting for you. So make sure you have that image file. It took me a while to figure all this out, but once I had, it was ridiculously easy to run further ereader books. I don't think I've missed anything out, and I can't find any difference between the broken .pdb book and the original drm'ed one. Ok, all of the above is now wildly irrelevant because it turned out I was using an older version of ereader2html (v3.0). I've upgraded to v6, and that does everything I need it to. So I'm 'greying' the above in case anyone makes the mistake of trying to get anything useful out of it. What I will say, however, is that the generated html file seems to work a lot better than the pml file produced by the cracking process. One other thing - sometimes there are unfeasibly long breaks between sections of a book. I've found the easiest way, personally, to edit this stuff is once you've ported the cracked html into calibre, go to the 'zip' file calibre creates after importing, unzip it, and edit the html file inside, ie deleting multiple paragraph returns or multiple iterations of <p></p>. Then rezip it (remember to get rid of the original zip file), and calibre will generate a new, hopefully improved ebook file with your changes incorporated. Last edited by PressEnter; 11-07-2009 at 11:57 PM. Reason: New information |
11-01-2009, 03:16 AM | #52 |
Member
Posts: 15
Karma: 50
Join Date: Jul 2008
Device: Sony Reader
|
I only just realised editing that post above doesn't move the thread up to the top of the stack, and there might be other people who might find what I did useful (apologies to anyone who knows this stuff better than I do, it's probably a total mishmash but it worked for me), so I'm just adding in this comment to rectify that.
|
12-02-2009, 01:57 PM | #53 |
Connoisseur
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
|
Okay, this manual editing thing is for the birds, so I'll offer some code to handle the conversion of special characters automatically. Obviously, I'm not going to post any illegal material, so you'll have to do the modifications yourself.
First off, add this definition right about the " def getText(self):" definition: Code:
def cleanPML(self,pml) # Update old \b font tag with correct \B bold font tag pml2 = pml.replace('\\b', '\\B') # Convert special characters to proper PML code. High ASCII start at (\x82, \a130) and go up to (\xff, \a255) for k in xrange(130,256): # a2b_hex takes in a hexidecimal as a string and converts it to a binary ascii code that we search and replace for badChar=binascii.a2b_hex('%02x' % k) pml2 = pml2.replace(badChar, '\\a%03d' % k) #end for k return pml2 Code:
zlib.decompress(.... Code:
self.cleanPML(zlib.decompress(... - Jim Last edited by macr0t0r; 12-02-2009 at 02:02 PM. |
12-02-2009, 03:13 PM | #54 |
The Grand Mouse 高貴的老鼠
Posts: 71,504
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
I'm not certain that this is a good change to make. \b is 'deprecated', but not unsupported. And it doesn't do exactly the same thing as the \B tag, so changing form \b to \B could cause the text in a document to display differently.
|
12-02-2009, 04:39 PM | #55 |
Connoisseur
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
|
True, the \b tag refers to the designated "boldFont" while \B makes the current text bold. However, I've had the "\b" tag give me headaches with badly formatted ebooks where I think the publisher never checked their work. In almost every case, it seems the intended effect was to make the current text bold. Instead, I'd have the text suddely drop in size when made bold. I finally added that line since I've yet to come across a book that converted badly with the "upgrade."
- Jim |
12-02-2009, 06:14 PM | #56 | |
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Quote:
Code:
text = re.sub('[^\x00-\x7f]', lambda x: '\\U%04x' % ord(x.group()), text) |
|
12-02-2009, 06:37 PM | #57 | |
Connoisseur
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
|
Quote:
Second, I don't believe there are extended codes for \x80 and \x81. However, this is a fascinating little trick. Perhaps this could work? Code:
text = re.sub('[\x82-\xff]', lambda x: '\\a%03d' % ord(x.group()), text) Code:
text = re.sub('[^\x00-\xff]', lambda x: '\\U%04x' % ord(x.group()), text) Last edited by macr0t0r; 12-02-2009 at 06:40 PM. |
|
12-02-2009, 07:07 PM | #58 | |||
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Quote:
Quote:
Quote:
|
|||
12-03-2009, 04:32 AM | #59 |
Connoisseur
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
|
Okay, now I'm getting somewhere! I'm able to bring in a very clean PMLZ archive into Calibre. Now it's time for the next two headaches:
1. Cover Page 2. Footnotes conversion First off, the cover page is always the "cover.png" file, but Calibre doesn't recognize that from the PML. I've put in a suggested fix for that in this post: https://www.mobileread.com/forums/sho...40&postcount=7 Second, I want to better format the footnotes. They're all lumped together, so if the footnote does not have a unique number, I can't tell which footnote is being referenced. Any chance of adding a page break after each footnote or at least greatly increasing the space between each one? And while this is a bit off-topic: why in the heck does ePub NOT have FOOTNOTES? Seriously, the eReader/PeanutPress format has had this support for years! Almost every reference book (and many stories) have footnotes. Did the creators of the ePub format just forget? Okay, I'm off my rant. Thanks for the help guys! With a little more effort, I'll be able to make the new-fangled Sony Reader be nearly as good as the aging Palm TX (Crimony, the e-ink is the only thing going for this guy). - Jim |
12-04-2009, 12:40 PM | #60 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi Jim,
I have the same issue and have modified the xpml2xhtml.py to handle footnotes when converting to html in another way. It literally now includes a page break before each footnote and and adds a "return" new hyperlink to take you back in the text to where you first clicked on the footnote. It also lets Tidy handle the funny chars so no cleanup necessary. This works well since only 1 footnote appears on a page when you click on it (similar to how my ereader program handles them). The return link is nice if you have an e reading device that does not keep an any to access page history (ie. read that the Sony eBook reader). There are also a large number of xhtml fixes and thing. You might like to try it and then let it import the xhtml it generates into Calibre directly (since xhtml seems to be Calibre's internal format). Anyway that is what I do. If you want it I will happily post it for you (since it has no DRM removal code ii it at all). Just let me know, KevinH |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
converting sony books or B&N books for ipad? | cavi | General Discussions | 2 | 04-25-2010 11:45 PM |
Converting to Palm Digital Ereader | rocojo | Calibre | 5 | 12-27-2009 08:31 AM |
Converting Fictionwise's Secure eReader to something my 505 will read | RWJ | Calibre | 12 | 09-11-2009 04:33 PM |
converting long, somewhat complex docs to eReader | Richard Maseles | Other formats | 4 | 01-07-2009 05:28 PM |
Converting books to eReader with Dropbook | Robotech_Master | Workshop | 1 | 12-23-2008 12:46 PM |