Is it feasible to build dictionary with opf2mobi/html2mobi?

EbokJunkie · 01-11-2014, 01:36 PM

I'm trying to build a mobi dictionary using mobiperl (with a view to building big dictionaries that crash mobigen).
Unfortunately, Kindle doesn't recognize my dictionaries as dictionaries, no index is available. Mobigen builds my test dictionary (just two entries) without problems so I used MobiMetaEdotor to copy all relevant EXTH records to mobiperl created mobi, but to no avail.
Tried to explode two dictionaries with KindleUnpack and to compare resulting html files but didn't notice considerable differences.
Please take a look at attached files and advise if building dictionaries with mobiperl makes sense.

Doitsu · 01-11-2014, 02:51 PM

Quote:

Originally Posted by EbokJunkie

I'm trying to build a mobi dictionary using mobiperl (with a view to building big dictionaries that crash mobigen).

AFAIK, mobiperl is a set of tools not specifically designed to create a dictionary.

Quote:

Originally Posted by EbokJunkie

Unfortunately, Kindle doesn't recognize my dictionaries as dictionaries, no index is available.

In that case the source files are most likely incorrectly formatted. You may want to compile the dictionary with the latest version of KindleGen at a command prompt and compare the output that KindleGen generates for the two-entry dictionary that works with the other one that doesn't work.

Quote:

Originally Posted by EbokJunkie

Mobigen builds my test dictionary (just two entries) without problems so I used MobiMetaEdotor to copy all relevant EXTH records to mobiperl created mobi, but to no avail.

Just changing the metadata is not enough to convert a regular book into a dictionary. You'll need to modify the .opf file and the .html files and recompile the book.
Did you compare the .opf file of the dictionary that doesn't work with the one that does to ensure that it has all the entries required for dictionaries?

Quote:

Originally Posted by EbokJunkie

Tried to explode two dictionaries with KindleUnpack and to compare resulting html files but didn't notice considerable differences.

KindleUnpack only has rudimentary support for dictionaries; for example, it cannot reverse-engineer inflections.

EbokJunkie · 01-11-2014, 03:03 PM

>Did you compare the .opf file of the dictionary that doesn't work with the one that does
>to ensure that it has all the entries required for dictionaries?
I'm ising the same opf for both mobigen and opf2mobi processing.

Doitsu · 01-11-2014, 03:10 PM

Quote:

Originally Posted by EbokJunkie

>Did you compare the .opf file of the dictionary that doesn't work with the one that does
>to ensure that it has all the entries required for dictionaries?
I'm ising the same opf for both mobigen and opf2mobi processing.

Then there's obviously something wrong with the entry definitions. Why don't you post two entries from the file that doesn't work?

EbokJunkie · 01-11-2014, 03:20 PM

Sorry, I don't understand. I posted source html file with all entries.

pdurrant · 01-11-2014, 03:51 PM

Quote:

Originally Posted by EbokJunkie

Sorry, I don't understand. I posted source html file with all entries.

Mobipocket/Kindle dictionaries, to work properly as machine-searchable dictionaries, have to be created from specially formatted source.

See here, here, and here.

EbokJunkie · 01-11-2014, 04:17 PM

pdurrant
Thank you but I know this.
I create dictionaries by hand in open text format called Lingvo DSL format and convert to html/opf using open source Rubi script dsl2mobi. This script always creates valid html and opf files suitable for subsequent building machine-searchable dictionaries.
My upload is an html file created by dsl2mobi from short two-entry dsl file.
In works perfectly well with mobigen and kindlegen 2.9.

pdurrant · 01-11-2014, 05:00 PM

Quote:

Originally Posted by EbokJunkie

pdurrant
Thank you but I know this.

Oh, I see. Sorry — I read your original post too quickly.

I suspect that the only way to get to the bottom of this will be to examine the output very carefully indeed. KindleUnpack might not output sufficient info - you may need to delve into it with a Hex Editor.

I'll take a quick look at what you uploaded.

pdurrant · 01-11-2014, 05:06 PM

OK, clearly mobiperl isn't going to do what you want. And you won't spot differences in the HTML as that's not really the important bit. If you use MobiUnpack with the appropriate flags to dump everything, you'll see that in your example from mobiperl you just have the Mobipocket header, html, ncx and opf sections in your generated file.

In the one from Kindlegen you also four INDX sections, a FCIS section, a FLIS section, and three tiny unknown sections.

If you want to build a Kindle dictionary without Kindlegen, you're going to need to reverse engineer the compiled dictionary format, and then create something that can build it. No easy task!

EbokJunkie · 01-11-2014, 05:36 PM

Thanks, I suspected there is something similar and intimidating

.
Unfortunately, kindlgen crashes on 100MB+ html files, Mobipocket eBook Creator dies even on smaller files and mobigen chokes after 300+.
Evidently, Amazon broke something dictionary related in kindlegen (at least related to the source size), and mobigen is the best (although inadequate) tool for this task.

Doitsu · 01-11-2014, 07:00 PM

Since Mobipocket Creator expects unicode files with byte-order-marks (BOM) and your .html source file doesn't have one, it couldn't hurt to add a BOM to your .html source file, in particular, if it contains non-Latin characters. (KindleGen doesn't require a BOM, but handles utf8 files with a BOM fine.)

Resave your .html source file as a utf8 file with a BOM, empty your Temp folder and execute KindleGen using the following command line:

Code:

KindleGen your.opf > error.log

If it crashes again, have a look at error.log, which should help you identify the line that causes problems. If it doesn't, post the log file here.

If none of the above helps, split your source file into several smaller files. The largest file that I ever compiled was an 80MB utf-16LE source file, which compiled fine with both Mobipocket Creator and KindleGen. Try splitting your source file into several 75 MB files and update the <manifest> and <spine> sections of your .opf accordingly.

Also have a look at the output that your Ruby script creates and check for isolated ampersands (&) or angle brackets (<>) that haven't been escaped as entities (& < > etc.) as these are known to cause problems with many HTML parsers.
For example, having a line such as the following will cause problems:

Code:

<idx:orth>Rock Music > Rock & Roll</idx:orth>

EbokJunkie · 01-11-2014, 07:51 PM

Thanks, I'll proceed with caution and check headers for isolated ampersands.
As to splitting source, I usually split source at DSL level and process each part with Ruby script separately. Noticed that mobigen is able to convert at least 300MB html with option C2; this builds 40-50 MB mobi dictionary. However, I have to leave i7 W7 desktop crunching on background for 6-8 hours running a few parts concurrently. TBH, bigger dictionaries may hang Kindle, at least that sometimes happens with PW1.
Added:
Wow!!! Thank you for heads up about Mobipocket Creator and BOM!
302 MB html, no BOM used to crash MPC at the start, now it compiled uncompressed prc in five minutes! Cannot say the same about kindlegen, it still crashed in a minute after start.
MPC+BOM looks like a solutiion. Thanks again.

01-11-2014, 07:00 PM	#11
Doitsu Grand Sorcerer Posts: 5,584 Karma: 22735033 Join Date: Dec 2010 Device: Kindle PW2	Since Mobipocket Creator expects unicode files with byte-order-marks (BOM) and your .html source file doesn't have one, it couldn't hurt to add a BOM to your .html source file, in particular, if it contains non-Latin characters. (KindleGen doesn't require a BOM, but handles utf8 files with a BOM fine.) Resave your .html source file as a utf8 file with a BOM, empty your Temp folder and execute KindleGen using the following command line: Code: KindleGen your.opf > error.log If it crashes again, have a look at error.log, which should help you identify the line that causes problems. If it doesn't, post the log file here. If none of the above helps, split your source file into several smaller files. The largest file that I ever compiled was an 80MB utf-16LE source file, which compiled fine with both Mobipocket Creator and KindleGen. Try splitting your source file into several 75 MB files and update the <manifest> and <spine> sections of your .opf accordingly. Also have a look at the output that your Ruby script creates and check for isolated ampersands (&) or angle brackets (<>) that haven't been escaped as entities (& < > etc.) as these are known to cause problems with many HTML parsers. For example, having a line such as the following will cause problems: Code: <idx:orth>Rock Music > Rock & Roll</idx:orth> Last edited by Doitsu; 01-12-2014 at 02:11 AM.

01-11-2014, 07:51 PM	#12
EbokJunkie Addict Posts: 229 Karma: 13495 Join Date: Feb 2009 Location: SoCal Device: Kindle 3, Kindle PW, Pocketbook 301+, Pocketbook Touch, Sony 950, 350	Thanks, I'll proceed with caution and check headers for isolated ampersands. As to splitting source, I usually split source at DSL level and process each part with Ruby script separately. Noticed that mobigen is able to convert at least 300MB html with option C2; this builds 40-50 MB mobi dictionary. However, I have to leave i7 W7 desktop crunching on background for 6-8 hours running a few parts concurrently. TBH, bigger dictionaries may hang Kindle, at least that sometimes happens with PW1. Added: Wow!!! Thank you for heads up about Mobipocket Creator and BOM! 302 MB html, no BOM used to crash MPC at the start, now it compiled uncompressed prc in five minutes! Cannot say the same about kindlegen, it still crashed in a minute after start. MPC+BOM looks like a solutiion. Thanks again. Last edited by EbokJunkie; 01-11-2014 at 08:13 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
CLI vs GUI html2mobi with image - bug?	DanielJG	Conversion	0	04-23-2012 09:35 AM
Build my own dictionary	petervdb	PocketBook	1	01-26-2012 08:00 AM
HTML2Mobi and Windows 1252 encoding	bizzybody	Kindle Formats	0	12-05-2010 08:03 PM
Wikipedia (offline) Dictionary? Available? Feasible?	ivanatpr	Amazon Kindle	2	10-22-2010 05:39 PM
html2mobi - html formatting	brunovg	Kindle Formats	2	12-13-2009 05:56 AM

01-11-2014, 03:03 PM	#3
EbokJunkie Addict Posts: 229 Karma: 13495 Join Date: Feb 2009 Location: SoCal Device: Kindle 3, Kindle PW, Pocketbook 301+, Pocketbook Touch, Sony 950, 350	>Did you compare the .opf file of the dictionary that doesn't work with the one that does >to ensure that it has all the entries required for dictionaries? I'm ising the same opf for both mobigen and opf2mobi processing.

01-11-2014, 03:20 PM	#5
EbokJunkie Addict Posts: 229 Karma: 13495 Join Date: Feb 2009 Location: SoCal Device: Kindle 3, Kindle PW, Pocketbook 301+, Pocketbook Touch, Sony 950, 350	Sorry, I don't understand. I posted source html file with all entries.

01-11-2014, 04:17 PM	#7
EbokJunkie Addict Posts: 229 Karma: 13495 Join Date: Feb 2009 Location: SoCal Device: Kindle 3, Kindle PW, Pocketbook 301+, Pocketbook Touch, Sony 950, 350	pdurrant Thank you but I know this. I create dictionaries by hand in open text format called Lingvo DSL format and convert to html/opf using open source Rubi script dsl2mobi. This script always creates valid html and opf files suitable for subsequent building machine-searchable dictionaries. My upload is an html file created by dsl2mobi from short two-entry dsl file. In works perfectly well with mobigen and kindlegen 2.9.

01-11-2014, 05:06 PM	#9
pdurrant The Grand Mouse 高貴的老鼠 Posts: 71,506 Karma: 306214458 Join Date: Jul 2007 Location: Norfolk, England Device: Kindle Voyage	OK, clearly mobiperl isn't going to do what you want. And you won't spot differences in the HTML as that's not really the important bit. If you use MobiUnpack with the appropriate flags to dump everything, you'll see that in your example from mobiperl you just have the Mobipocket header, html, ncx and opf sections in your generated file. In the one from Kindlegen you also four INDX sections, a FCIS section, a FLIS section, and three tiny unknown sections. If you want to build a Kindle dictionary without Kindlegen, you're going to need to reverse engineer the compiled dictionary format, and then create something that can build it. No easy task!

01-11-2014, 05:36 PM	#10
EbokJunkie Addict Posts: 229 Karma: 13495 Join Date: Feb 2009 Location: SoCal Device: Kindle 3, Kindle PW, Pocketbook 301+, Pocketbook Touch, Sony 950, 350	Thanks, I suspected there is something similar and intimidating . Unfortunately, kindlgen crashes on 100MB+ html files, Mobipocket eBook Creator dies even on smaller files and mobigen chokes after 300+. Evidently, Amazon broke something dictionary related in kindlegen (at least related to the source size), and mobigen is the best (although inadequate) tool for this task.