06-25-2015, 06:39 AM | #1156 |
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
If you're after the rendered text, you'll probably need some sort of parser that can handle malformed html. You'll also need to determine the characted encoding (usually utf8, but quite often cp1252 as well). Your best bet is to convert any html entities to their character equivalents and then parse to get the text.
|
06-26-2015, 12:33 PM | #1157 |
Enthusiast
Posts: 33
Karma: 12694
Join Date: Aug 2014
Device: kindle paperwhite
|
Yup, it turns out BeautifulSoup is very robust for this purpose. I had to increase the imprecision of the chunks and make sure I parsed out semi-complete elements but otherwise I got exactly what I needed
Super happy about this! |
Advert | |
|
06-29-2015, 07:03 AM | #1158 |
Resident Curmudgeon
Posts: 73,983
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Where is the standalone version of MobiUnpack? All I see are two ZIP files with the app version. No standalone exists.
|
06-29-2015, 07:31 AM | #1159 | |
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
The ONE zip with "app" in the title is the Applescript version. |
|
07-02-2015, 09:10 PM | #1160 |
Author/Illustrator
Posts: 14
Karma: 2952
Join Date: Mar 2012
Location: Boise, ID
Device: iPad 2 & 3, Kindle Paperwhite, Kindle Fire 1 & 2, HD7 & HD8.9, RazrMax
|
Redundant Mobi7 in split KF8 FXL
Is there a way to remove the duplicate mobi7 images folder from the split KF8?
I understand that KU is not intended as a converter per se, but for FXL the mobi7 files are redundant in the split out KF8, and only add to the file size. It would be highly useful to be able to output a KF8 FXL with the smallest file size possible for use in distribution via non-Amazon outlets. There is no way to do this currently so far as I know. Or are the files in this folder somehow necessary for older Kindle models? Aside from the addition of a thumbnail cover in the Mobi7 folder (and their location obviously) they are identical, and therefore redundant. And there is nothing else in there. Fantastic work on this over the years btw. It has been of inestimable value. |
Advert | |
|
07-02-2015, 10:22 PM | #1161 | |
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I don't think I'm exactly sure what you mean. The mobi7 files/folder aren't a part of the split-out KF8 file--any more than the KF8 folder is. In addition to unpacking the content into the various folders (for inspection/editing/conversion, etc...), splitting a kindlegen-created book gives you an .azw3 (representing the KF8 portion) and a .mobi (representing the mobi7 portion).
Keep the AZW3 and delete the rest if that's all you're interested in after splitting. None of the files/folders are necessary for Kindles (new or old) except for two: the .mobi and the .azw3. Quote:
|
|
07-02-2015, 10:56 PM | #1162 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
The only "non-Amazon outlets" that I know of don't ACCEPT FXL files for sale. (I guess that would be "fourth."). Where are you planning to sell this? Maybe someone here can help with that, if they have experience there. FWIW. Hitch |
|
07-02-2015, 11:14 PM | #1163 | ||
Author/Illustrator
Posts: 14
Karma: 2952
Join Date: Mar 2012
Location: Boise, ID
Device: iPad 2 & 3, Kindle Paperwhite, Kindle Fire 1 & 2, HD7 & HD8.9, RazrMax
|
Quote:
Is KindleUnpack creating this mobi7 folder on extracting the .azw3, or is this part of its internal files? The reason I was checking is due to the file size of the resulting .azw3, which is the same size as the unsplit mobi file. Is this due to the presence of the source epub in the KF8 version, or both HD and compressed images? My assumption, based on the extracted .azw3 is that the mobi7 folder is still in there. But perhaps I'm misinterpreting the data here. Quote:
|
||
07-03-2015, 09:15 AM | #1164 | ||||
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
I think you may be confusing what is getting unpacked with what is being included in the in the .mobi and .azw3 files after being split. Even a standalone azw3 file (no mobi7 component present) will produce a mobi7 folder (with images) when unpacked with KindleUnpack. That's just the way it works Quote:
Quote:
A dual-format kindlebook (created with Kindlegen) that is unpacked with KindleUnpack (with the 'split' box checked) should result in a mobi7 folder (with contents other than just images--opf, html, ncx, etc...), a mobi8 folder (containing an OEBPS file structure and an epub file), and an HDImages folder (that may or may not contain images). In addition to those three folders and their contents, there should be four other files (if the 'Split' option was selected) at the same directory level as the three above-mentioned folders: a mobi8-<something,something>.azw3 file a mobi7-<something,something>.mobi file a kindlegensrc.zip file a kindlegenbuild.log file Without the 'Split' option selected, the results would be the same--with the exception of the last four files not being present. If those three folders, and those four files aren't present after unpacking (with the split option checked), then you didn't have a dual-format, kindlegen-created file to begin with (either that or Amazon have changed the kindlegen output in a way that KindleUnpack doesn't yet account for). Quote:
Last edited by DiapDealer; 07-03-2015 at 09:17 AM. |
||||
07-03-2015, 10:38 AM | #1165 | |||||
Author/Illustrator
Posts: 14
Karma: 2952
Join Date: Mar 2012
Location: Boise, ID
Device: iPad 2 & 3, Kindle Paperwhite, Kindle Fire 1 & 2, HD7 & HD8.9, RazrMax
|
Quote:
Quote:
split out azw3: 107,281 KB split out mobi7: 39,155 KB So not exact, but close enough. Both effectively 104 MB. Minus only 18 kb of data. Moreover, the two split files combine to greater than the source content, so something is being duplicated that should not, if there is no mobi7 component in the azw3 as you say. Quote:
Also, just for reference, when creating the source file with Kindlegen I used -dont_append_source, so there should be no source files present internally to begin with. Quote:
Quote:
My understanding, then, just to clarify, is that the split azw3 contains both sets of images for dual-format (low-res/hi-res for sending to HD/non-HD devices), but no actual mobi7 component, and that this is just created upon unpack/split by KU, drawing from (i.e. duplicating on extract) the low-res/compressed image files, thereby producing more actual content than was really in there in the first place (obviously, since the combined files are greater than the source). I just was unclear as to what exactly was happening, but you have answered that. So thanks. |
|||||
07-03-2015, 11:49 AM | #1166 | |||
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
Quote:
Quite simply put: if all you're interested in is splitting to get a stand-alone AZW3, the rest of the extracted data is irrelevant. Quote:
|
|||
07-03-2015, 01:52 PM | #1167 | |
Author/Illustrator
Posts: 14
Karma: 2952
Join Date: Mar 2012
Location: Boise, ID
Device: iPad 2 & 3, Kindle Paperwhite, Kindle Fire 1 & 2, HD7 & HD8.9, RazrMax
|
Quote:
Therefore, the unpack process in this case is actually creating a mobi7 component from the split KF8 file, if, as you say, it is not present in the split azw3 file. However, I believe it actually is present in the azw3, at least as a housing for the compressed images that are sent to non-HD devices. The reasoning for this is due to some confusion on my part over the function of the "Use HD Images" option, and its resulting output. Here is what I'm seeing when unpacking the split azw3 with/without this option ticked: Use HD Images: * BOTH mobi7 and mobi8 folders contain HD images Do NOT use HD Images: * NEITHER mobi7 or mobi8 folder contain HD images (i.e. both have compressed jpegs) * HDImages folder created containing HD images In neither of these cases do the resulting files equal the input file size (even discounting the produced epub3 source file, the size of which is relative to the images being used). However, upon further inspection/calculation, the combination of one file from each iteration (i.e. one HD image folder plus one non-HD image folder) does equal almost exactly the input file size. This leads me to believe that both are present in the source azw3, but are not being extracted accurately (that is, with one folder containing each version of the images). Only when unticking "Use HD Images" do you get both, but in this case you actually get two sets of compressed jpegs as well as the original size images in their separate folder. This is where my confusion lay. As my understand has been that the purpose of KU is to ascertain what exactly is occurring in the source conversion, my natural presumption was that the unpacked content reflected what was actually in that file (with the caveats for the reproduced epub structure files). Perhaps I was expecting more fidelity than is intended. It is ultimately not important at this point, other than as an academic exercise, which is always useful to further understanding in my experience. Otherwise, I walking around looking like this: which is not uncommon. Thanks for bearing with me as I muddle through this. Mostly I could have worked it out on my own, but sometimes it's easier just to ask. |
|
07-05-2015, 06:49 PM | #1168 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi,
You really don't seem to understand what an mobi ebook actually is, so please let me try to explain a bit. It is a compiled ebook format that uses a palm database structure - a set of starting offsets to binary data referred to either as sections or records depending on who you ask. What KindleUnpack does is examine these binary sections, identifies any sections that are headers and then use them to identify starting section numbers where images are stored, text is stored, index information, and etc. and then extract them to files. The data from these files are used to create html3.2 code that can be used to input back into kindlegen for the older mobi 7 pieces and used to create an epub-like structure for the kf8 pieces. If you actually want to see the rawml you can dump that as well. The header sections also have EXTH records that contain the MetaData information. If you want to understand the exact layout of the mobi file, simply run DumpMobiHeader_v018.py or later and look at the description of what is stored in each section of the palm database file. For joint mobis, the images are not duplicated, they are stored after the mobi7 header and before the kf8 header. Later mobis can also have a completely separate container of HDImages and placeholders. When Kindleunpack unpacks image sections (and fonts and RESC sections) it stores them all in a mobi 7 folder and copies the correct piecs to the mobi 8 folder as needed. When Kindleunpack unpacks from the HD Container, it will store these images in their own HDImage folder as they can notbe shared with a mobi 7. There is a switch to have the HDImages overwrite their low resolution cousins. So please run DumpMobiHeader and examine the section map to see what is actually being stored i side the palm database structure. If you have further questions, post the output of DumpMobiHeader from running on your mobi so that I understand exactly what it is you are askng. It will even work on DRMd ebooks since the headers themselves and most images are not typically encrypted. Hope this helps, KevinH |
07-05-2015, 07:37 PM | #1169 |
Author/Illustrator
Posts: 14
Karma: 2952
Join Date: Mar 2012
Location: Boise, ID
Device: iPad 2 & 3, Kindle Paperwhite, Kindle Fire 1 & 2, HD7 & HD8.9, RazrMax
|
Yeah, thanks Kevin, that clarifies a lot. This is not exactly stuff your average ebook creator knows, or even needs to know, or is very easy to find out for that matter. This is why we ask questions. If I already knew the answers I wouldn't bother. It's not as if it's innate information we're all born with. Most content creators don't even know how to use Kindlegen, let alone KindleUnpack. I was just trying to understand exactly how it works, based on the evidence I had before me, so thank you for your explanation.
|
07-06-2015, 03:43 AM | #1170 | |
The Grand Mouse 高貴的老鼠
Posts: 71,506
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can i rotate text and insert images in Mobi and EPUB? | JanGLi | Kindle Formats | 5 | 02-02-2013 04:16 PM |
PDF to Mobi with text and images | pocketsprocket | Kindle Formats | 7 | 05-21-2012 07:06 AM |
Mobi files - images | DWC | Introduce Yourself | 5 | 07-06-2011 01:43 AM |
pdf to mobi... creating images rather than text | Dumhed | Calibre | 5 | 11-06-2010 12:08 PM |
Transfer of images on text files | anirudh215 | 2 | 06-22-2009 09:28 AM |