Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 10-29-2009, 11:18 AM   #46
PressEnter
Member
PressEnter is on a distinguished road
 
Posts: 15
Karma: 50
Join Date: Jul 2008
Device: Sony Reader
I'm trying to figure this out too. I know a particular ereader file I want to convert to LRF is cp1252 charset, but where exactly does --input-encoding="cp1252" go, when using the command-line interface?

Given the command is written

ereader2html infile.pdb outdir "your name" credit_card_number

(which works fine, bar the dodgy A's with hats and random line-breaks, etc), where should one insert

--input-encoding="cp1252"

Or have I typically got it all horribly wrong?

And is it likely to make a difference, given that the html file produced by MobiDeDrm is already encoded in cp1252, according to Firefox?

I should also mention the following: I went to calibre, set the specified file--type plug-in to 'cp1252'. I then imported an html file which had been successfully exploded from ereader with MobiDeDrm. I had already checked that html file and saw that all characters etc, under firefox showed up just fine.

After importing the html file - which apparently causes calibre to automatically zip the html, together with any associated image files, into a single zip - I navigated to the folder (on a mac) where the new zip file was located: exploded the zip: and found that the html file within had become scrambled - ie missing apostrophes and quote marks, random page breaks, that kind of thing. So clearly the problem happens when the html is first exported into Calibre.

Any thoughts? Or should I try and 'open' a ticket as suggested (I figured I'd try here first, being as clueless about a lot of this stuff as some others on this thread)?
PressEnter is offline   Reply With Quote
Old 10-29-2009, 01:01 PM   #47
Xenophon
curmudgeon
Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.
 
Xenophon's Avatar
 
Posts: 1,481
Karma: 5748190
Join Date: Jun 2006
Location: Redwood City, CA USA
Device: Kobo Aura HD, (ex)nook, (ex)PRS-700, (ex)PRS-500
The best approach I am aware of is to take your copy of ereader2html.py and modify it to (a) write out the pml file instead of converting to html, and (b) re-write high-ascii characters using the \aNNN syntax, where NNN is the value of the byte in octal? decimal? (I forget; check the mobileread wiki page for the format). That works around the whole problem by making sure that the contents of the pml file are fully escaped.

THEN, use Drop Book to create a new (DRM-free) ereader file, import that into Calibre, and off you go.

One reason this is a good choice is that Calibre's conversion to html seems to work much better than ereader2html's version.

As always, your mileage may vary, etc.

Xenophon
Xenophon is offline   Reply With Quote
Advert
Old 10-29-2009, 01:15 PM   #48
PressEnter
Member
PressEnter is on a distinguished road
 
Posts: 15
Karma: 50
Join Date: Jul 2008
Device: Sony Reader
I now realise that much of what I posted back there is complete bollocks, since I had in fact got my ereader2html mixed up with calibre's own command line interface stuff, which I hadn't used before.

I just ran the book in question, post ereader2html conversion, through calibre's command line interface, as mentioned earlier in the thread, using what appears to be thte correct charset. As a result, all the quote marks, dashes etc. are finally exactly where they should be ... except the text is all either centred or right-side ragged. Wah!

Xenophon - thanks for the tips, but I'm afraid I hardly understood any of it! I can find my way around a computer on a basic level, and i only found out about charsets today. It all sounds quite advanced to me, unless you can point me to online resources that explain it reasonably simply.

Edit: I got it a little further along - I converted the post-ereader2html output to epub, then converted that to LRF using the command-line interface with the cp1252 modifier. i got intact quotes, correct text alignment ... but lots of random misplaced paragraph returns breaking up the text. Anyone got any ideas?

Last edited by PressEnter; 10-29-2009 at 02:12 PM.
PressEnter is offline   Reply With Quote
Old 10-29-2009, 02:33 PM   #49
Xenophon
curmudgeon
Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.Xenophon ought to be getting tired of karma fortunes by now.
 
Xenophon's Avatar
 
Posts: 1,481
Karma: 5748190
Join Date: Jun 2006
Location: Redwood City, CA USA
Device: Kobo Aura HD, (ex)nook, (ex)PRS-700, (ex)PRS-500
What I recommended is straight-forward for an experienced programmer. Problem is that distributing the resulting program is probably a felony in the US. Passing on more detailed directions is also likely to be a felony. So I pass on those slightly-cryptic tips for those who can do it themselves.

Note that someone posted directions over in the workshop forum for how to modify ereader2html output pml instead of html. That's the most important thing, and it's only a 2-line edit. And "Drop Book" is a free download from fictionwise and/or ereader.com.

Xenophon
Xenophon is offline   Reply With Quote
Old 10-29-2009, 04:29 PM   #50
Rachel
Zealot
Rachel has a complete set of Star Wars action figures.Rachel has a complete set of Star Wars action figures.Rachel has a complete set of Star Wars action figures.
 
Posts: 115
Karma: 260
Join Date: Sep 2008
Location: Suffolk, England
Device: sony prs505, kindle, ithing
I have just checked and somebody has already raised a ticket about this. Meanwhile, I think I will convert to ePub which seems OK. I'm afraid the changes are beyond me - I need to follow instructions written in words of one syllable! I felt triumphant when I converted my first book using ereader2html!

Thanks for all the help guys.
Rachel is offline   Reply With Quote
Advert
Old 10-30-2009, 09:51 AM   #51
PressEnter
Member
PressEnter is on a distinguished road
 
Posts: 15
Karma: 50
Join Date: Jul 2008
Device: Sony Reader
I've heavily rewritten this comment to reflect every step I took from a DRM'ed ereader .pdb book to a non-DRM'd, error-free open format transferable to my Sony Reader. All this was done an a Mac ibook running OSX 10.4, with the ereader2html MobiDeDrm script with Python 2.6 installed.

I am not a programmer, but I managed to figure out how to do this in a day or so.

First of all, I went here:. I followed the instructions precisely, creating a new python script named 'ereader2pml.py'. It's very easy.

I ran this script on a protected ereader file. It puts any graphics inside the book into a separate folder called 'outdir', and also placed in there an .html file as well as a .pml file. Pml means 'palm markup language', the layout language used for ereader books.

The pml is just a textfile, with odd characters replacing apostrophes, quotes and dashes and so forth - the most frequently recurring characters were "ì", "î", "í", "Ö", and "ö".

At first I manually searched and replaced on these using TextEdit, then discovered this page, which contains an applescript (scroll down, it has a pink background) for searching and replacing on specified terms in a text document automatically. I copied it, opened 'script editor' in the applications folder, and pasted the copied text into a new document (you can save the script as an 'app').

I then scrolled down the applescript to the terms the script was designed to search on - originally "one", "two" and so forth (the script is designed to pop up screens asking you to confirm what you want each word or character or number to be replaced with).

It's ridiculously easy just to swap the numbers in the applescript for the 'odd' characters I listed above ("Ö" etc). Then you save the script as an app - dead easy, like I say, even for me - and start it.

Once you're in, navigate to the .pml file you just created with ereader2pml, select it, and it'll ask you what to replace each of the odd characters with. After that, it takes a minute or two for it to do the entire search and replace, automatically, saving you a bundle of time and trouble (there may be commercial software out there that does this too, but I'm a skinflint).

It's very important that you also create a folder called nameofbook_img, assuming your book is called nameofbook.pml (ie, exactly the same as the book's title, but with _img added). This folder should go in the same location as your .pml book (you can just trash the html file created at the same time). So if you have a folder called 'ereader hacks' which contains nameofbook.pml, it has another subfolder called 'nameofbook_img' (I got this from ereader's own online guide to palm markup syntax, found after a quick google). This folder is for any images contained within the ereader file. Even if you're not bothered about having them, you're going to need the folder, or the next and final step might not work. Place all the images extracted from the original drm'ed nameofbook.pdb file into the _img folder.

Next: download dropbook, as suggested in the previous link to how to create ereader2pml.py.

Finally, drag the .pml file onto the free dropbook software - it's used for creating ereader files. If there's any errors, it'll show them up in a separate window. Usually they're weird characters like the ones I listed above, where rarer characters like ° (as in 23°) have failed to come across. You'll be able to see what and where they are from the window list of errors and you can then go into the .pml file with TextEdit to change them manually (or alternatively add the new 'mistranslated' characters to the list in your new search and replace script, then repeat the whole process).

If the images aren't in the exact correct folder as I stated above, dropbook won't be able to find them and will return error messages. Not only that, it will probably fail to create the new DRM-free, error-free nameofbook.pdb file you will otherwise be delighted to find waiting for you. So make sure you have that image file.

It took me a while to figure all this out, but once I had, it was ridiculously easy to run further ereader books. I don't think I've missed anything out, and I can't find any difference between the broken .pdb book and the original drm'ed one.


Ok, all of the above is now wildly irrelevant because it turned out I was using an older version of ereader2html (v3.0). I've upgraded to v6, and that does everything I need it to. So I'm 'greying' the above in case anyone makes the mistake of trying to get anything useful out of it. What I will say, however, is that the generated html file seems to work a lot better than the pml file produced by the cracking process.

One other thing - sometimes there are unfeasibly long breaks between sections of a book. I've found the easiest way, personally, to edit this stuff is once you've ported the cracked html into calibre, go to the 'zip' file calibre creates after importing, unzip it, and edit the html file inside, ie deleting multiple paragraph returns or multiple iterations of <p></p>. Then rezip it (remember to get rid of the original zip file), and calibre will generate a new, hopefully improved ebook file with your changes incorporated.

Last edited by PressEnter; 11-07-2009 at 11:57 PM. Reason: New information
PressEnter is offline   Reply With Quote
Old 11-01-2009, 03:16 AM   #52
PressEnter
Member
PressEnter is on a distinguished road
 
Posts: 15
Karma: 50
Join Date: Jul 2008
Device: Sony Reader
I only just realised editing that post above doesn't move the thread up to the top of the stack, and there might be other people who might find what I did useful (apologies to anyone who knows this stuff better than I do, it's probably a total mishmash but it worked for me), so I'm just adding in this comment to rectify that.
PressEnter is offline   Reply With Quote
Old 12-02-2009, 01:57 PM   #53
macr0t0r
Connoisseur
macr0t0r doesn't littermacr0t0r doesn't litter
 
macr0t0r's Avatar
 
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
Okay, this manual editing thing is for the birds, so I'll offer some code to handle the conversion of special characters automatically. Obviously, I'm not going to post any illegal material, so you'll have to do the modifications yourself.

First off, add this definition right about the " def getText(self):" definition:
Code:
    def cleanPML(self,pml)
        # Update old \b font tag with correct \B bold font tag
        pml2 = pml.replace('\\b', '\\B')
        # Convert special characters to proper PML code.  High ASCII start at (\x82, \a130) and go up to (\xff, \a255)
        for k in xrange(130,256):
            # a2b_hex takes in a hexidecimal as a string and converts it to a binary ascii code that we search and replace for
            badChar=binascii.a2b_hex('%02x' % k)
            pml2 = pml2.replace(badChar, '\\a%03d' % k)
        #end for k
        return pml2
Make sure you use spaces instead of tabs for indents. Now, change any line with:
Code:
zlib.decompress(....
to:
Code:
self.cleanPML(zlib.decompress(...
That will automatically fix all of the special characters. Mind you, it will run a little slower, but it's still faster than manual editing! I actually use this bit of code when cleaning up my OpenOffice-generated files before converting them to use on my Palm. (Yes, I still use a Palm. What?)

- Jim

Last edited by macr0t0r; 12-02-2009 at 02:02 PM.
macr0t0r is offline   Reply With Quote
Old 12-02-2009, 03:13 PM   #54
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,504
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
Quote:
Originally Posted by macr0t0r View Post
# Update old \b font tag with correct \B bold font tag
pml2 = pml.replace('\\b', '\\B')
I'm not certain that this is a good change to make. \b is 'deprecated', but not unsupported. And it doesn't do exactly the same thing as the \B tag, so changing form \b to \B could cause the text in a document to display differently.
pdurrant is offline   Reply With Quote
Old 12-02-2009, 04:39 PM   #55
macr0t0r
Connoisseur
macr0t0r doesn't littermacr0t0r doesn't litter
 
macr0t0r's Avatar
 
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
True, the \b tag refers to the designated "boldFont" while \B makes the current text bold. However, I've had the "\b" tag give me headaches with badly formatted ebooks where I think the publisher never checked their work. In almost every case, it seems the intended effect was to make the current text bold. Instead, I'd have the text suddely drop in size when made bold. I finally added that line since I've yet to come across a book that converted badly with the "upgrade."

- Jim
macr0t0r is offline   Reply With Quote
Old 12-02-2009, 06:14 PM   #56
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by macr0t0r View Post
...
Code:
        # Convert special characters to proper PML code.  High ASCII start at (\x82, \a130) and go up to (\xff, \a255)
        for k in xrange(130,256):
            # a2b_hex takes in a hexidecimal as a string and converts it to a binary ascii code that we search and replace for
            badChar=binascii.a2b_hex('%02x' % k)
            pml2 = pml2.replace(badChar, '\\a%03d' % k)
        #end for k
A simpler way to do this is:

Code:
text = re.sub('[^\x00-\x7f]', lambda x: '\\U%04x' % ord(x.group()), text)
user_none is offline   Reply With Quote
Old 12-02-2009, 06:37 PM   #57
macr0t0r
Connoisseur
macr0t0r doesn't littermacr0t0r doesn't litter
 
macr0t0r's Avatar
 
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
Quote:
Originally Posted by user_none View Post
A simpler way to do this is:

Code:
text = re.sub('[^\x00-\x7f]', lambda x: '\\U%04x' % ord(x.group()), text)
Hmmm....I like that it's small and fast, but it has a couple issues. First off, the \U tag has proven to be unreliable with some fonts, and it's a train-wreck on Symbian devices. Also, I prefer how the "\a000" translates directly to "& #000;" in HTML (space to prevent htmlizing).

Second, I don't believe there are extended codes for \x80 and \x81. However, this is a fascinating little trick. Perhaps this could work?
Code:
text = re.sub('[\x82-\xff]', lambda x: '\\a%03d' % ord(x.group()), text)
Then, perhaps I could fall back to unicode for whatever is left:
Code:
text = re.sub('[^\x00-\xff]', lambda x: '\\U%04x' % ord(x.group()), text)
- Jim

Last edited by macr0t0r; 12-02-2009 at 06:40 PM.
macr0t0r is offline   Reply With Quote
Old 12-02-2009, 07:07 PM   #58
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by macr0t0r View Post
... First off, the \U tag has proven to be unreliable with some fonts, and it's a train-wreck on Symbian devices.
Good to know. That code is from calibre's PML output and I only test against the desktop software with the standard font. Looking at the docs it seem that \\U only supports certain fonts and versions.

Quote:
Originally Posted by macr0t0r View Post
Second, I don't believe there are extended codes for \x80 and \x81.
Looks like there isn't.

Quote:
Originally Posted by macr0t0r View Post
However, this is a fascinating little trick. Perhaps this could work?
Code:
text = re.sub('[\x82-\xff]', lambda x: '\\a%03d' % ord(x.group()), text)
Then, perhaps I could fall back to unicode for whatever is left:
Code:
text = re.sub('[^\x00-\xff]', lambda x: '\\U%04x' % ord(x.group()), text)
This will work very well inside of the eReader script because you should never encounter characters that are not defined by either the \\a or \\U tags.
user_none is offline   Reply With Quote
Old 12-03-2009, 04:32 AM   #59
macr0t0r
Connoisseur
macr0t0r doesn't littermacr0t0r doesn't litter
 
macr0t0r's Avatar
 
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
Okay, now I'm getting somewhere! I'm able to bring in a very clean PMLZ archive into Calibre. Now it's time for the next two headaches:
1. Cover Page
2. Footnotes conversion

First off, the cover page is always the "cover.png" file, but Calibre doesn't recognize that from the PML. I've put in a suggested fix for that in this post: https://www.mobileread.com/forums/sho...40&postcount=7

Second, I want to better format the footnotes. They're all lumped together, so if the footnote does not have a unique number, I can't tell which footnote is being referenced. Any chance of adding a page break after each footnote or at least greatly increasing the space between each one?

And while this is a bit off-topic: why in the heck does ePub NOT have FOOTNOTES? Seriously, the eReader/PeanutPress format has had this support for years! Almost every reference book (and many stories) have footnotes. Did the creators of the ePub format just forget?

Okay, I'm off my rant. Thanks for the help guys! With a little more effort, I'll be able to make the new-fangled Sony Reader be nearly as good as the aging Palm TX (Crimony, the e-ink is the only thing going for this guy).

- Jim
macr0t0r is offline   Reply With Quote
Old 12-04-2009, 12:40 PM   #60
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
Hi Jim,

I have the same issue and have modified the xpml2xhtml.py to handle footnotes when converting to html in another way. It literally now includes a page break before each footnote and and adds a "return" new hyperlink to take you back in the text to where you first clicked on the footnote. It also lets Tidy handle the funny chars so no cleanup necessary.

This works well since only 1 footnote appears on a page when you click on it (similar to how my ereader program handles them). The return link is nice if you have an e reading device that does not keep an any to access page history (ie. read that the Sony eBook reader).

There are also a large number of xhtml fixes and thing. You might like to try it and then let it import the xhtml it generates into Calibre directly (since xhtml seems to be Calibre's internal format).

Anyway that is what I do.

If you want it I will happily post it for you (since it has no DRM removal code ii it at all).

Just let me know,

KevinH
KevinH is online now   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
converting sony books or B&N books for ipad? cavi General Discussions 2 04-25-2010 11:45 PM
Converting to Palm Digital Ereader rocojo Calibre 5 12-27-2009 08:31 AM
Converting Fictionwise's Secure eReader to something my 505 will read RWJ Calibre 12 09-11-2009 04:33 PM
converting long, somewhat complex docs to eReader Richard Maseles Other formats 4 01-07-2009 05:28 PM
Converting books to eReader with Dropbook Robotech_Master Workshop 1 12-23-2008 12:46 PM


All times are GMT -4. The time now is 05:21 PM.


MobileRead.com is a privately owned, operated and funded community.