View Full Version : html2mobi (a mobigen replacement written in Perl)


tompe
11-25-2007, 07:08 PM
When I realized that there was support for reading and writing mobi files in Perl I got inspired to start to write a mobigen replacement today since my favourite language is Perl.

Now if a set of html files are given to the script a table of content is generated automatically. The script also takes an opf file as input and now it manages to generate a working mobi file for the Alice in Wonderland test example with working images (at least they work in FBReader). The table of content is not working properly but I will look at that. Does anybody know the datastructure for this? I can always just add it in the beginning but if it is possible to do it correctly I will do it.

Now I just save the images in new records. Will this work? I seem to remember some limitations mentioned about the size of a record.

In a couple of days I can make the first alpha version available. But first I want to test the script on some more examples. So does anybody have any recomendation for files to test with? Or know about some well known issues I should check for?

JSWolf
11-25-2007, 07:26 PM
How ill your script handle images? Will they be the same size in the script generated mobi book?

kovidgoyal
11-25-2007, 07:28 PM
Are you actually parsing the HTML and recreating it or just packaging it into a mobi?

tompe
11-25-2007, 07:49 PM
Are you actually parsing the HTML and recreating it or just packaging it into a mobi?

I am parsing the HTML and recreating it after some patching. With a lot of complete HTML files as input you have to do this to get just one HTM file. Also you need to change the img tag. And I suppose I need to patch bad HTML code. For exemple some old lit files seems to give bad HTML and wrong entities after using clit.

But I have not actually found a specification of allowed HTML code. I was going to take the appoach that what works on my Gen3 is allowed...

tompe
11-25-2007, 07:53 PM
How ill your script handle images? Will they be the same size in the script generated mobi book?

I have to check this. I suppose I will use the method with alternative files or just scale a file if it is to big. And of course I can scale up an image if only images with small resolutions are available. And I will convert to jpg if the record size becomes to big using another format.

Do you always want to maximize the image size according to the reading device? Or should you add some size specification in the img tag?

wallcraft
11-25-2007, 08:40 PM
MobiPocket does have PRCGEN Documentation (http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen), which provides some information about the supported HTML.

You have probably already seen MobiPocket TOC using mobigen (http://www.mobileread.com/forums/showthread.php?t=14911) and Images in MobiPocket (http://www.mobileread.com/forums/showthread.php?t=14641). In particular, a toc.html appears to be required for mobigen to create a TOC and it is inserted at the end of the .mobi file. An automatic TOC would be a useful addition, and yet another reason to prefer html2mobi over mobigen.

I have never seen the hisrc attribute (Image support and display (http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=images.htm)) used for an image in an actual MOBI file, but it might be one way to add a larger image to a MOBI file while maintaining backward compatibility. It might be enough, though, to have a default image size, or have html2mobi honor width & height larger than the image by rescaling the image (note that the reader ignores width & height larger than the image).

wallcraft
11-25-2007, 08:51 PM
A mobi2html that explodes MOBI to HTML would also be useful. It would obviously only work on DRM-free PRC and MOBI files. The easiest option would just be to extract the single HTML file and the images, with the images correctly referenced in the HTML. Better would be to extract the .opf file from the HTML preamble. Note that mobi2epub would then be a simple addition, or just use the existing oeb2epub.py (http://www.mobileread.com/forums/showpost.php?p=116279&postcount=61) in combination with mobi2html.

tompe
11-25-2007, 09:04 PM
A mobi2html that explodes MOBI to HTML would also be useful. It would obviously only work on DRM-free PRC and MOBI files. The easiest option would just be to extract the single HTML file and the images, with the images correctly referenced in the HTML. Better would be to extract the .opf file from the HTML preamble. Note that mobi2epub would then be a simple addition, or just use the existing oeb2epub.py (http://www.mobileread.com/forums/showpost.php?p=116279&postcount=61) in combination with mobi2html.

I wrote that first. There are some issues with the images since I do not know how to find out which record they start in. But I do not think that the opf is saved in the preamble. At least it did not seem to be there for the Alice files.

tompe
11-25-2007, 09:15 PM
MobiPocket does have PRCGEN Documentation (http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen), which provides some information about the supported HTML.

You have probably already seen MobiPocket TOC using mobigen (http://www.mobileread.com/forums/showthread.php?t=14911) and Images in MobiPocket (http://www.mobileread.com/forums/showthread.php?t=14641). In particular, a toc.html appears to be required for mobigen to create a TOC and it is inserted at the end of the .mobi file. An automatic TOC would be a useful addition, and yet another reason to prefer html2mobi over mobigen.

I have never seen the hisrc attribute (Image support and display (http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=images.htm)) used for an image in an actual MOBI file, but it might be one way to add a larger image to a MOBI file while maintaining backward compatibility. It might be enough, though, to have a default image size, or have html2mobi honor width & height larger than the image by rescaling the image (note that the reader ignores width & height larger than the image).

I will look at the PRCGEN Documentation. I must have done womething wrong since on my Gen3 the Alice book became 600 pages long...

For the Alice in Wonderland opf the toc is inserted in the end because it is in the spine specification. And if it was not there its has to be inserted because it is in the manifest specification. What I do not get is how to code things so you get a button in FBReader for the toc. I assume the guide tag has something to do with this.

The gif cover that was 600x800 caused my Gen3 to hang so I had to reboot it. I rescaled it a bit and saved as jpg instead and that worked better.

wallcraft
11-25-2007, 09:37 PM
An OPF preamble does seem to optional. This is from the MobiPocket version of Ring of Fire (http://www.webscription.net/p-352-ring-of-fire.aspx) from the Baen Free Library.

<HTML><HEAD><metadata>
<dc-metadata xmlns:dc="http://purl.org/metadata/dublin_core" xmlns:oebpackage="http://openebook.org/namespaces/oeb-package/1.0/">
<dc:Title>Ring of Fire</dc:Title>
<dc:Type>Novel</dc:Type>
<dc:Identifier id="ISBN-074347175X" scheme="ISBN-Hardcover">0-7434-7175-X</dc:Identifier>
<dc:Identifier id="ISBN13-9780743471756" scheme="ISBN13-Hardcover">978-0-7434-7175-6</dc:Identifier>
<dc:Identifier id="ISBN-1416509089" scheme="ISBN-Paperback">1-4165-0908-9</dc:Identifier>
<dc:Identifier id="ISBN13-9781416509080" scheme="ISBN13-Paperback">978-1-4165-0908-0</dc:Identifier>
<dc:Identifier id="DOI-074347175X" scheme="DOI">10.1125/Baen.074347175X</dc:Identifier>
<dc:Publisher>Baen Books</dc:Publisher>
<dc:Creator role="aut" file-as="Flint, Eric">Eric Flint</dc:Creator>
<dc:Contributor role="art" file-as="Blair, Dru">Dru Blair</dc:Contributor>
<dc:Subject>Science Fiction</dc:Subject>
<dc:Rights>2004 by Eric Flint</dc:Rights>
<dc:Date>2004-01-01</dc:Date>
<dc:Language>US English (en-us)</dc:Language>
</dc-metadata>
</metadata>
<GUIDE>
<REFERENCE TYPE="toc" TITLE="Table of Contents" HREF="074347175X_top.htm" filepos="0001692887">
<REFERENCE TYPE="cover" TITLE="Cover" HREF="074347175X__i_.htm" filepos="0000001553">
<REFERENCE TYPE="copyright-page" TITLE="Copyright" HREF="074347175X__p_.htm" filepos="0000001785">
<REFERENCE TYPE="firstpage" TITLE="First Page" HREF="074347175X__p_.htm#Chap_0" filepos="0000004946">
</GUIDE>
<METADATA HREF="xyz_metadata.htm" filepos="0001694500"><hr></HEAD><BODY>
<h1 align="center"><img src="BMP" recindex="00001"><br />
Ring of Fire<br />
by<br />Eric Flint</H1>
<p align="center"><A HREF="074347175X_top.htm" filepos="0001692887">Table of Contents</A></P>

If the e-book has anything more than the first page under the Reader's contents icon, then I think it has to have a non-empty <GUIDE> section.

Another way to generate "typical" MOBI books would be to run mobigen.exe on an exploded LIT file, and compare the result to using html2mobi. In the case of Baen books, you can use the LIT version and compare the result to their MOBI version.

igorsk
11-25-2007, 10:11 PM
I wrote that first. There are some issues with the images since I do not know how to find out which record they start in. But I do not think that the opf is saved in the preamble. At least it did not seem to be there for the Alice files.
Number of the first record with images is the dword at offset 0x5C in the 'MOBI' header (which starts at offset 0x10 in record 0).
I'm going to do a post on internals of mobi format "soon"...

tompe
11-25-2007, 10:20 PM
If the e-book has anything more than the first page under the Reader's contents icon, then I think it has to have a non-empty <GUIDE> section.

Another way to generate "typical" MOBI books would be to run mobigen.exe on an exploded LIT file, and compare the result to using html2mobi. In the case of Baen books, you can use the LIT version and compare the result to their MOBI version.

I tried to use the guide tag according to the documentation but I did not get it to work. What I did not get was how the other point of the href should be specified. I tried with a name attribute to a but it did not seem to work.

I actually got mobi2html to work. Use it as:

perl mobi2html Alice_In_Wonderland.mobi > Alice.html

The images should work. But there are some problem with the rendering of the "wave text". I attach the script if anybody are interested in playing around with it. How do I attach a file called mobi2html?

tompe
11-25-2007, 10:31 PM
Number of the first record with images is the dword at offset 0x5C in the 'MOBI' header (which starts at offset 0x10 in record 0).
I'm going to do a post on internals of mobi format "soon"...

Do you know how the "library" image is specified? I noticed that when I had 7 images in the document then the library image was the record directly after the 7:th image record.

andym
11-26-2007, 04:24 AM
:oops2:

igorsk
11-26-2007, 06:06 AM
Do you know how the "library" image is specified? I noticed that when I had 7 images in the document then the library image was the record directly after the 7:th image record.
Nope, didn't see how this one is stored.
I did notice that one of my books had a cover image that does not actually appear in the .mobi file... so it seems it's downloaded from the server and is stored separately in the Covers folder.

jbenny
11-26-2007, 06:49 AM
Something that may be of help to you is the source to a program called "pdbshred". It can extract the HTML and image files from a Mobipocket and Peanut ebook. You can find the program and source (in C) to the program by Googling for "pdbshred source". I would post a direct link, but because of some additional functionality in the program, some here may not like a direct link.

A similar program is called "makedoc", but it doesn't extract the images.

cstross
11-26-2007, 08:45 AM
Looks extremely cute ...

Are you planning on packaging this and sticking it on CPAN when it's stable?

schmidt349
11-26-2007, 09:48 AM
Well done Tompe! Looks like you beat me to the punch :knife:

Would you mind if I bolted an XSL backend onto your code, effectively making it "xml2html2mobi?" It's a mouthful, but would be quite useful.

God I love Perl.

tompe
11-26-2007, 10:25 AM
Looks extremely cute ...

Are you planning on packaging this and sticking it on CPAN when it's stable?

Might be a good idea. I have never done it before but why not. If I can find out how to put a script on CPAN.

I will release the html2mobi script here in a day or two so I can get some feedback. I have to fix one serious bug and write some documentation.

tompe
11-26-2007, 10:32 AM
Would you mind if I bolted an XSL backend onto your code, effectively making it "xml2html2mobi?" It's a mouthful, but would be quite useful.

God I love Perl.

Yes, it is a pleasure to program Perl :-)

I should probable write some packages to make it easier to do a xml2html2mobi. I wanted to have just one file to make it easier to use but maybe I should just split it up and submit it to CPAN. That can be the next step after it works and is tested more.

I used XML::Parser::Lite::Tree to parse the opf file but I am not sure this was a good idea. Do you know of any better library for opf files or for XML? I really liked HTML::Element and HTML::TreeBuilder so something similar for XML would be nice. Or a specific opf file library.

tompe
11-26-2007, 02:42 PM
I have a problem. My converter generates a mobi file that is not entirely correct. It works perfect in FBreader. On my Gen3 it works but the number of pages is 650 when it should be arount 25. There are a lot of empty pages in the end. My Palm T5 refuses to load the file and says corrupt database 0x0209 (2).

What I wondered is if this is a problem with the Palmdoc things or if it is a problem with the html that i packed in the Palmdoc format?

I can have forgotten to set some parameter in the Palm::PDB package but I tested to load a working mobi file and than replacing the text and it did not work.

Ideas?

tompe
11-26-2007, 05:01 PM
I realised that I had not written any Mobipocket header in record 0 at all and I was fooled by it working so well with FBReader. Were there any specification of the data that should be in record 0 anywhere? I have googled for it but can not find it.

igorsk
11-26-2007, 06:07 PM
No spec. A few fields are documented in pdbshred but they're probably not what you need. I'm working on a more or less complete doc but here's what you should be able to get away with:

0 DWord dwSignature //'MOBI'
4 DWord dwSize //including first two fields (put 0x18 here)
8 DWord dwType //pub type: 2=book,3=palmdoc,4=audio,news=257,feed=258,magazin e=259 etc
C DWord dwCodepage //1252=western, 65001 = UTF8. Better not use anything else
10 DWord dwUniqueId //? filled from rand() calls
14 DWord dwFileFormatVer //seems to correspond to Mobipocket reader ver. put 3 here


This is in addition to the palmdoc header, naturally.

tompe
11-26-2007, 08:09 PM
No spec. A few fields are documented in pdbshred but they're probably not what you need. I'm working on a more or less complete doc but here's what you should be able to get away with:



This is in addition to the palmdoc header, naturally.

Thanks. I have now managed to write record 0 so now I can add the MOBI header also.

When I unpacked a mobi file I saw three records after the last image and they have size 36, 52 and 4. What are these? One contained the string FLIS and one the string FCIS. Maybe the end of the document is not detected becasue I have not written these records.

tompe
11-26-2007, 08:14 PM
How long must the MOBI header be?

At position 0xF4 I see the string EXTH and after that follows some strings that indicates that the author and titlte are stored there. Does this belong to the header?

igorsk
11-26-2007, 09:24 PM
I beleive FCIS and FLIS have something to do with dictionary indices. Do you set the unpacked size and number of records in Palmdoc header correctly?

tompe
11-26-2007, 09:36 PM
I beleive FCIS and FLIS have something to do with dictionary indices. Do you set the unpacked size and number of records in Palmdoc header correctly?

The last record that was 4 byte contains E9 8E 0D 0A. I wonder if this is important...

The number of records are correct because I tried to include the image records in that number but then FBReader started to display garbage after the end of the text. I will double check the unpacked size. I have not set this pointer to first image either.

Now I have got the strange phenomen that the images in FBReader is correct but on my Gen3 they seem to be shifted. The "library" image seems to work. I just put it in the last record and it was displayed correctly on the Gen3. The change I did was that I set the record "id" to an increasing number for the text content instead of using 0.

Well, it moves forward. Hopefully I will fix the problem with the size and the image order soon so I have a first alpha version of the scripts.

igorsk
11-26-2007, 09:53 PM
The "number of records" in palmdoc header (Word at 0x8) needs to be set to the number of records containing only text (no pictures). E.g. if you have compressed text in records 1,2 and 3, then set it to 3. The uncompressed size (dword at 4) has to be the full uncompressed size of all text.
By the way, I was wrong. Mobi format 3 needs MOBI header to be 0x74 bytes long, not 0x18. The fields are mostly irrelevant except for the number of the first record with images I mentioned above (at 0x5C).
There are also DATP records that contain mapping from uncompresed offset to record numbers but I didn't figure out their format yet and not sure if they're mandatory...

tompe
11-26-2007, 10:23 PM
Got it nearly to work on my Gen3 when I extende the MOBI header. The only problem is now that the title says "libc-2.3.6" and the header information is wrong...

Strangely enough the library image works without me including it. Maybe it takes the first record with an image and uses this.

# 4 DWord dwSize //including first two fields (put 0x18 here)

If I put 0x18 here it does not work. If I put 0xE4 here as in my example document then it works but the title did not work. So what does this number mean?

tompe
11-26-2007, 10:57 PM
# 4 DWord dwSize //including first two fields (put 0x18 here)

If I put 0x18 here it does not work. If I put 0xE4 here as in my example document then it works but the title did not work. So what does this number mean?

I just realized what this field is. It is a pointer to the block that starts with EXTH. The first number after that sees to be the size of this block. But I have not managed to see how it is coded.

Maybe I should try to set this pointer to 0 and see if that means that this block does not exist.

igorsk
11-26-2007, 11:35 PM
dwSize is not a pointer. It's the size of MOBI header (EXTH immediately follows it so it looks like a pointer). For version 3 set it to 0x74. EXTH contains only metadata (and DRM ids) so it shouldn't matter if it's missing.

tompe
11-27-2007, 07:17 AM
dwSize is not a pointer. It's the size of MOBI header (EXTH immediately follows it so it looks like a pointer). For version 3 set it to 0x74. EXTH contains only metadata (and DRM ids) so it shouldn't matter if it's missing.

But it seems to matter or something else is wrong. On Gen3 I get a corrupted title and this corruption changes depending on how long the MOBI header is. So I assume it tries to read unitialized memory. Or how do I specify that the metadata does not exists?

What is the difference between version 3 and 4 and why is it preferred to use version 3?

igorsk
11-27-2007, 07:57 AM
Version 4 adds support for DRMv2 (PID-based) and extents MOBI header to 0xD0 bytes. It's not that v3 is "preferred" but it should be enough for your purposes. Maybe Cybook's parser assumes header v4 and doesn't check the actual size. Since you say you get corrupted title, my guess is that it expects the extended title record to be present (in addition to the title at the beginning of the prc file). This extra info is specified by these fields:

44 Dword dwTitleOffset //from the start of record 0, codepage is dwCodepage
48 Dword dwTitleLength

Here's the format of EXTH header which immediately follows the MOBI header (i.e. at 0x10+MOBI.dwSize in rec0). I beleive you also need to set flag 0x40 in dword at MOBI+0x70 if EXTH is present.

0 dd dwSignature //'EXTH'
4 dd dwSize //including everything
8 dd dwCount //count of extra data items
extra item:
0 dd id
4 dd size
8 <size-8> data

ids:
1: drm_server_id
2: drm_commerce_id
3: drm_ebookbase_book_id
100: Author
101: Publisher
102: Imprint
104: ISBN
105: Subject
106: PublishingDate
107: Review
108: Contributor
109: Rights
110: SubjectCode
111: Type
112: Source
113: ASIN
114: VersionNumber
115: Sample
116: StartReading
203: hasFakeCover

mrkai
11-27-2007, 08:17 AM
I beleive you also need to set flag 0x40 in dword at MOBI+0x70 if EXTH is present.


...and in the Version 4 mobi files generated by the latest Creator, this seems to be set to 0x50.

-K

igorsk
11-27-2007, 08:34 AM
Well, 0x50 does include 0x40 :)

tompe
11-27-2007, 08:49 AM
Thanks a lot. I will add the extended header tonight and see if it works.

In the Alice mobi file dwTitleOffset is 0xFFFFFFFF so I assume that means that this title is not there. Where is this title usually placed? Is it in the MOBI header or after it?

But using EXTH information seems more flexible so I might just go with that.

Actually I had set the MOBI+0x70 flag to 0x50 whan I tested so that might explain why the Gen3 reader misbehaved when no extended header was available.

igorsk
11-27-2007, 09:38 AM
Extended title is somewhere after the EXTH header (it's not present in EXTH records btw). If not present, the one from the start of PRC is used (I think it's assumed to be in 1252 codepage).

tompe
11-27-2007, 11:47 AM
I am trying to parse the EXTH using the mobi file in the Alice in Wonderland example from Mobipockets web pages. It does not seem to follow your description exactly. The title string is in the block but what I get is:


EXTH doctype: EXTH
EXTH length: 152
EXTH n_items: 7
ITEM: 64 16 - 100 22 - Carroll, Lewis
ITEM: 6e 11 - 110 17 - FIC004000
ITEM: 69 10 - 105 16 - Classics
ITEM: 12c 2e - 300 46 -
ITEM: c9 c - 201 12 -
ITEM: cb c - 203 12 -
ITEM: ca c - 202 12 -


Ideas?

igorsk
11-27-2007, 12:04 PM
IDs 3xx and 2xx are binary fields. I don't know their purpose (yet?). As I said, you shouldn't worry about EXTH now, it shouldn't be necessary for your purpose.

tompe
11-27-2007, 12:55 PM
IDs 3xx and 2xx are binary fields. I don't know their purpose (yet?). As I said, you shouldn't worry about EXTH now, it shouldn't be necessary for your purpose.

But I cannot get it to work on the Gen3 at all without using this. I tested now to set the EXTH flag to 0 and to not have any data at all after the MOBI header. My Gen3 said "Multipart books not supported" or something similar. The only way I have got it to load is by having an EXTH.

Maybe I should test just adding an author in EXTH. That might work...

mrkai
11-27-2007, 06:20 PM
Igorsk, do you have a mobi v3 sample file? It looks like something is a bit different in v4 with regards to the EXTH...

-K

tompe
11-27-2007, 06:22 PM
I missread a hextable. The title I saw was the extended title and when I added extended title I get correct display of title in the library. But when I open the file the Gen3 hangs or says that Multobook is not supported. So there probably is some remaining problem with the data in the header.

Well, some progress at least. Back to more testing...

tompe
11-27-2007, 08:20 PM
It works! It seems that the Gen3 did not like images of size 600x800.

So here is the first release. The state of the code is very alpha. Please let me know if it does not work or if you find bugs or if you have suggestions for enhancements. I have tested with the Alice in Wonderland exemple from Mobipockets web site and I load it to my Gen3 and to my Palm T5 and read it. Here are the scripts:

http://www.ida.liu.se/~tompe/mobiperl/mobiperl-0.01.tar


perldoc -t html2mobi

NAME
html2mobi - A script to convert html files or an opf file to mobi

SYNOPSIS
html2mobi file.html

html2mobi file1.html file2.html ...

html2mobi file.opf

DESCRIPTION
A script to convert html files or an opf file to a mobi format file.

OPTIONS
--title TITLE
Specify the title for the book. This overrides the value given in
the opf file.

--mobifile MOBIFILE
Name of the output file. This overrides the default value.

--htmlfile HTMLFILE
Saves the html that is packed into mobi format. This html code
contains Mobipocket specific things that are added automatically.
This is mostly useful for debugging.

--coverimage IMAGE
The image to be used as cover in a library listing like the one in
Cybook Gen3. The image will be rescaled to a suitable format
(180x240). If no image is specified the first image in the source
files is used.

--gentoc
For a collection of html files generate the table of contents
automatically.

--pda
Scale images to work for pda's (must be used for Alice to work on my
Palm T5).

--scale f
Scale all images that are smaller then a certain size with scale
factor f.

EXAMPLES
html2mobi Alice_In_Wonderland.opf

html2mobi Alice_In_Wonderland.html

TODO
- Specify margins with flags

- Follow local links when given a root html file

BUGS
- Image sizes is not handled correctly for Palm T5. Large images
works on Gen3 but are problematic on T5.

- Guide specified toc not generated

- Not correct author in EXTH

- Plus a lot more.... this is an alpha version

AUTHOR
Tommy Persson (tpe@ida.liu.se)

JSWolf
11-28-2007, 06:08 PM
What image sizes are you allowing up to? Are you resizing if you have a LRGE image?

tompe
11-28-2007, 06:12 PM
What image sizes are you allowing up to? Are you resizing if you have a LRGE image?

Well I will allow whtever the user want to specify and try to find out what the limts are on different devices. If the limits are different. Since the Mobigen format is not documented anywhere it is hard to know so the best you can do is test it for specific devices.

The cover image should be 600x800. I resized it to 80% of that for now. I also have s scale argument so you can scale all your images if you want. I will add whatever feature is requested in connection with images.

JSWolf
11-28-2007, 06:19 PM
How about just keeping the images the same size they are?

wallcraft
11-28-2007, 06:22 PM
GIF images in MobIPocket very definitely have a 64KB limit (actually slightly smaller than 64 KB) and I assumed that this was a PRC record size limit. If a GIF image is larger than ~64 KB, mobigen.exe, reduces its size (in pixels) until it fits in 64KB.

I don't know if larger JPEGs are allowed, or if MobiPocket just increases the lossyness(?) until it also fits in 64 KB.

See Images in MobiPocket (http://www.mobileread.com/forums/showthread.php?t=14641), which includes references to MobiPocket documentation and to other threads on this issue.

tompe
11-28-2007, 06:36 PM
How about just keeping the images the same size they are?

That is what I am doing now. But there is probably some issue with the format that causes 600x800 images to nor work. So therefore I had to shrink them a bit. But this problem is on the todo list.

tompe
11-28-2007, 06:39 PM
There is a version 0.02 available at http://www.ida.liu.se/~tompe/mobiperl/

Table of contents and links works. You had to change href to a filepos to get it to work on the Gen3.

wallcraft
11-28-2007, 10:12 PM
I took a look at my (DRM-free) LIT files converted to MOBI via ConvertLIT and mobigen.exe -jpeg. Any JPEG that was originally larger than 64 KB was reduced to less than 64 KB in the MOBI file, but the image dimensions in pixels was maintained. So I think it is the case that MOBI only allows images < 64 KB.

DaleDe
11-29-2007, 02:19 AM
I took a look at my (DRM-free) LIT files converted to MOBI via ConvertLIT and mobigen.exe -jpeg. Any JPEG that was originally larger than 64 KB was reduced to less than 64 KB in the MOBI file, but the image dimensions in pixels was maintained. So I think it is the case that MOBI only allows images < 64 KB.

That is a limitation of the Palm database size, which is the fundamental Mobi format.

Dale

JSWolf
11-29-2007, 02:57 AM
I took a look at my (DRM-free) LIT files converted to MOBI via ConvertLIT and mobigen.exe -jpeg. Any JPEG that was originally larger than 64 KB was reduced to less than 64 KB in the MOBI file, but the image dimensions in pixels was maintained. So I think it is the case that MOBI only allows images < 64 KB.
So Mobigen recompresses the jpeg image until it hits a physical size of 64k or less? So even if that makes the image fuzzy, it won't matter as long as the 64k limit is maintained right? That is crazy.

wallcraft
11-29-2007, 10:44 AM
So Mobigen recompresses the jpeg image until it hits a physical size of 64k or less? So even if that makes the image fuzzy, it won't matter as long as the 64k limit is maintained right? That is crazy. Yes, although many JPEGs are already smaller than 64 KB, and so far the examples I have seen don't look bad at 64 KB. For GIF images, MobiPocket instead makes the image itself smaller until it fits in 64 KB (which is a different loss in quality, arguably much worse). MobiPocket's primary legacy issue is the Palm format, but Plucker uses PDB and it allows larger images by splitting them across multiple records.

tompe
11-29-2007, 10:52 AM
I took a look at my (DRM-free) LIT files converted to MOBI via ConvertLIT and mobigen.exe -jpeg. Any JPEG that was originally larger than 64 KB was reduced to less than 64 KB in the MOBI file, but the image dimensions in pixels was maintained. So I think it is the case that MOBI only allows images < 64 KB.

I will look at this because it might be that 64 KB is max if you want it to work on all devices. It might be that larger images works on the Gen3.

Do you have any good example lit file with a large image?

tompe
11-29-2007, 10:56 AM
I also have a lit file which had a 600x800 cover image but this image was not in the manifest so mobigen will fail to us the bigger image and you will get the small image. My script searches the unpacked files for a big cover image and uses that even if it is not in the manifest. You can also specify on the command line a file to use as cover image.

tompe
11-29-2007, 11:20 AM
I will look at this because it might be that 64 KB is max if you want it to work on all devices. It might be that larger images works on the Gen3.

I just checked. A jpg file with size 138462 works on the Gen3. But it does not work on my Palm T5. So if you generate your mobi files for a specific device you can have larger images. That might actually explain why the Kindle uses another file extension.

I will probably let my script use large files and print warnings that the file will not work on some devices and have a flag to generate mobi files with records with a max size of 64 KB.

HarryT
11-29-2007, 11:22 AM
So Mobigen recompresses the jpeg image until it hits a physical size of 64k or less? So even if that makes the image fuzzy, it won't matter as long as the 64k limit is maintained right? That is crazy.

No, it's a consequence of the fact that the image has to fit into 64k. It's not really a terribly onerous limit for the type of images you tend to get in eBooks - they are generally under 64k anyway.

wallcraft
11-29-2007, 11:34 AM
I just checked. A jpg file with size 138462 works on the Gen3. But it does not work on my Palm T5. Was this the only image in the file? I don't know much about the PRC format, but I would expect record length violations to only work for the last record in the file.

tompe
11-29-2007, 12:47 PM
Was this the only image in the file? I don't know much about the PRC format, but I would expect record length violations to only work for the last record in the file.

No, it was the first record after the compressed text that was to large. Then there were 6 records with smaller images after that. When I try to load this on my palm it says "corrupt database".

tompe
11-29-2007, 07:57 PM
Version 0.03 is now available.

I have fixed so that all records are smaller than 64K. I resize the image data by choosing a suitable jpeg quality. If I find a real example were the image quality becomes too bad I will try to find a solution. So if anybody have a real example of a lit file where mobigen -jpeg fails to produce something that you can read I would really like to have a pointer to it.

The guide table of content now works so in FBReader I get the table of content button.

tompe
11-29-2007, 08:11 PM
I still have a bug that causes the Gen3 to crash if I use images larger than a certain size (e.g. 600x800). If I reduce the size so that the width is 480 it works. So does anybody have any idea about something in the MOBI header that could cause this?

igorsk
11-29-2007, 08:16 PM
Do they work in mobigen generated files? It could be just a limitation of Cybook's viewer.

tompe
11-29-2007, 09:03 PM
Do they work in mobigen generated files? It could be just a limitation of Cybook's viewer.

Yes. The Alice mobi-file contains a gif (the cover) that is 600x800 and it works on the Cybook. Hum, I use version 3 in my generated file and the working file use version 4. Maybe I should test version 4 also.

tompe
11-29-2007, 09:26 PM
With version 4 I got Multiformat book not supported on the Cybook. But version 4 worked on my T5. And on my T5 my generated file with the 600x800 image works.

A bit hard to debug when the error only is on the Cybook.

I attach the file that crashes my Cybook but works on my T5. It would be nice to know it the file works or not on other readers (you cannot upload .mobi files...)

JSWolf
11-29-2007, 11:29 PM
With version 4 I got Multiformat book not supported on the Cybook. But version 4 worked on my T5. And on my T5 my generated file with the 600x800 image works.

A bit hard to debug when the error only is on the Cybook.

I attach the file that crashes my Cybook but works on my T5. It would be nice to know it the file works or not on other readers (you cannot upload .mobi files...)
The latest Mobipocket Desktop for Windows (some version 6.1) works fine with this. Also the iLiad will read it as well. I don't know about FBReader. I suspect so.

wallcraft
11-29-2007, 11:40 PM
I attach the file that crashes my Cybook but works on my T5. It would be nice to know it the file works or not on other readers This works using Windows MobiPocket Reader and the Palm MobiPocket Reader on my Nokia 770. It also works on both devices using FBReader, although FBReader does not currently have an image scaling or navigating option so the cover image is too big for the 770's screen (the internal images are small enough to fit on the screen).

wallcraft
11-29-2007, 11:45 PM
It also works on the Kindle. I have a serious USB cable management issue with all these Readers.

DaleDe
11-30-2007, 02:01 AM
With version 4 I got Multiformat book not supported on the Cybook. But version 4 worked on my T5. And on my T5 my generated file with the 600x800 image works.

A bit hard to debug when the error only is on the Cybook.

I attach the file that crashes my Cybook but works on my T5. It would be nice to know it the file works or not on other readers (you cannot upload .mobi files...)

I tried it on a pc so far (I will also check a pocket pc). The images are ok although the inside ones are a bit small. The TOC should just after the cover page. The chapters should start at the top of the page.

Dale

tompe
11-30-2007, 07:12 AM
It also works on the Kindle. I have a serious USB cable management issue with all these Readers.

I also have a serious USB cable management issue. Bluetooth would be nice...

Does the Kindle have a "library image"? It is a bit irritating that it works better on the Kindle then on the Cybook :-)

In the working file from Mobipocket there is a small image in the last image record that probably is this library image. But on the Cybook the library image works even if I remove this last image record. So I suspect that it uses the first image as the library image and that the bug might be that the library image cannot be 600x800. So the question is how you point out which record to use as a library image. I looked for a pointer in the MOBI header but did not find anything obvious.

tompe
11-30-2007, 07:41 AM
I tried it on a pc so far (I will also check a pocket pc). The images are ok although the inside ones are a bit small. The TOC should just after the cover page. The chapters should start at the top of the page.


That the chapters does not start at the top of the page is a real bug (I am loosing the Mobipocket specific tags). There is a flag to put the TOC after the cover page also. According to the opf file you should only put it last. Ths inside images looked better on my T5 with regard to the size.

tompe
12-01-2007, 10:41 PM
There is a version 0.04 available at http://www.ida.liu.se/~tompe/mobiperl/

I fixed the bug with the missing pagebreak. I will continue to work on these tools when I have time and especially when I notice some problems when I use the conversion tools or when somebody else reports a problem.

Or when new information about the MOBI header becomes available...

ppxnouse
12-30-2007, 08:29 PM
Hello tombe,

I currently try to convert some HTML files that contain lots of sample code.
The pages come from various chm files and most of them use the <PRE> tag for the sample code sections.
According to this page: http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=TagRef_OEB.htm
The PRE tag is not supported by Mobipocket readerand so the result is pretty unreadable (all the formatting of the source code sections is removed and you end up with a single line of code).

Since you already do some HTML parsing/replacing, I would love to see html2mobi reformat the <pre> sections (Adding html line breaks, tabs and setting a different font).

BTW: What about adding a CHM decompiler like chmdump, so I can be more lazy ?! ;-)

wallcraft
12-30-2007, 08:45 PM
See PRE tag problem, need support (http://www.mobipocket.com/forum/viewtopic.php?t=1916) on the MobiPocket Forum, which recommends replacing each space by &nbsp; among other things. I don't think MobiPocket can specify a generic fixed font though. You would also use <p align="left">

tompe
12-30-2007, 09:29 PM
Since you already do some HTML parsing/replacing, I would love to see html2mobi reformat the <pre> sections (Adding html line breaks, tabs and setting a different font).

BTW: What about adding a CHM decompiler like chmdump, so I can be more lazy ?! ;-)

The <pre> support should be possible to do as easily. Do you know where I can find some suitable example chm files?

ppxnouse
12-30-2007, 10:36 PM
MSDN magazin can be downloaded for free as compiled html (pretty hardcore .chm files though):

http://msdn.microsoft.com/msdnmag/chm/

If you just look for code in <pre> tag examples, Oreilly open books do contain that. For example: http://www.masonbook.com/book/

Several Wrox and Addison Wesley titels came with a .chm version on the book CD. (Maybe you have one)

I would like to sent you a chm book, that even does have ACSCI tables in <pre> tags.

tompe
12-30-2007, 10:36 PM
I fixed the <pre> by replacing space with &nbsp; and adding <br /> at the end of each line. It kind of works but if a line is longer than the maximum length the code is not so readable. And for the Cybook it seems to be 60-70 characters.

ppxnouse
12-30-2007, 10:43 PM
Great, thank you. but if a line is longer than the maximum length the code is not so readableSure, but that is pretty a minor issue compared to the old behaviour.

tompe
12-30-2007, 10:47 PM
If you are using Mobiperl with your own Perl installation there is a file mobiperl-0.0.20.tar with this change (works for one HTML file now). I will not do a new Windows binary distribution until I fix some more things.

jeczmien
02-04-2008, 03:59 PM
Hi.
I've tried mobi2html (latest 0.0.26 - on debian sid) with Polish www.ebook.pl free books. I've always got empty html file (with main markups - html and body) and one gif file (I've checked - it is a cover).

Every book can be open in Mobipocket reader without any trouble.

As an example you can try:

http://ebook.pl/ebooki/Classic_000248.prc

Can someone tell me what I'm doing wrong?

tompe
02-04-2008, 05:47 PM
Hi.
I've tried mobi2html (latest 0.0.26 - on debian sid) with Polish www.ebook.pl free books. I've always got empty html file (with main markups - html and body) and one gif file (I've checked - it is a cover).

Every book can be open in Mobipocket reader without any trouble.

As an example you can try:

http://ebook.pl/ebooki/Classic_000248.prc

Can someone tell me what I'm doing wrong?

Is this a DRM:ed book? FBReader could not read it properly either. So either it is DRM:ed or it is compressed with the highest compression which is not documented anywhere.

JSWolf
02-04-2008, 05:52 PM
I was able to convert it to HTML. But, I don't know if I am allowed to post it here. Is it public domain (life + 50) at least?

tompe
02-04-2008, 05:58 PM
I was able to convert it to HTML. But, I don't know if I am allowed to post it here. Is it public domain (life + 50) at least?

So what was the problem? Was it drm:ed?

JSWolf
02-04-2008, 06:22 PM
The file is not DRMed. But, I wasn't able to use mobi2html. I had to use something else to convert it to html and it did work.

tompe
02-04-2008, 07:08 PM
The file is not DRMed. But, I wasn't able to use mobi2html. I had to use something else to convert it to html and it did work.

So what was that something else then? Your comments are not so helpful if the goal is to understand what the problem is.

I assume the book was compressed with the secret compression format then.

wallcraft
02-04-2008, 07:48 PM
I assume the book was compressed with the secret compression format then. I agree. I tried a couple of files from this site, and they both came up in FBReader like other MobiPocket-specific compressed files (i.e. as garbage).

jeczmien
02-05-2008, 09:25 AM
I was able to convert it to HTML. But, I don't know if I am allowed to post it here. Is it public domain (life + 50) at least?

IT IS public domain - you can post it without any problem.

jeczmien
02-05-2008, 09:30 AM
The file is not DRMed. But, I wasn't able to use mobi2html. I had to use something else to convert it to html and it did work.

So, is that "else" has any name?
Please let us know :)

mateo
02-07-2008, 11:52 PM
I attempted to use this. the --title option doesn't seem to do anything. Not a big problem since FBreader can change the title and author. I used 2 simple html files and tried the TOC. Doesn't work. It tries to create a table of contents using a couple of bullet points but nothing is written beside them.

tompe
02-08-2008, 05:22 AM
I attempted to use this. the --title option doesn't seem to do anything. Not a big problem since FBreader can change the title and author. I used 2 simple html files and tried the TOC. Doesn't work. It tries to create a table of contents using a couple of bullet points but nothing is written beside them.

I will check this in the weekend. Which version did you use? Did you use the latest mobiperl version which is 0.0.26?

darkninja
02-12-2008, 01:33 PM
I wrote a decompressor for the new huffdic compressed files. Maybe this code can be incorporated into mobiperl?

Note. This program does not break any DRM encryption, so it's not illegal. It just decompresses files compressed with the new compression into a raw html file.

Thanks to Igor Skochinsky for the valuable assistance.

http://pastebin.com/m656dfbda Version 0.02