Issues Converting Translated ZIP Content Back to EPUB in Calibre

Gaunc · 03-09-2024, 10:16 AM

Hello everyone,

This is my first thread here, and I'm reaching out for some assistance with Calibre. I've recently begun managing a small catalogue of EPUB files, which has led me to learn and use Calibre for the first time.

My workflow involves several steps designed around translating EPUB content. Here's a brief overview:

Convert EPUB to ZIP: Using Calibre, I first convert my EPUB files into ZIP format.
Unzip and Translate: After unzipping the EPUB, I run a script that translates the content within all the HTML files to another language, utilizing an AI LLM API.
Re-zip and Convert Back to EPUB: Once translation is complete, I rezip the files and attempt to convert this ZIP back into an EPUB format using Calibre.
The issue arises in the final step. Despite the translated HTML content displaying perfectly in a web browser, the re-converted EPUB file from the ZIP is a total mess. Interestingly, this issue persists even when I try converting the original EPUB to ZIP and then back again, without any modifications.

As someone new to Calibre and this process, I'm unsure where the problem lies or how to fix it. Has anyone here dealt with similar conversion challenges or have experience with translating content in EPUB files? Any insights or advice would be greatly appreciated.

Thank you in advance for your help!

DNSB · 03-09-2024, 12:17 PM

Quote:

Originally Posted by Gaunc

Hello everyone,

This is my first thread here, and I'm reaching out for some assistance with Calibre. I've recently begun managing a small catalogue of EPUB files, which has led me to learn and use Calibre for the first time.

My workflow involves several steps designed around translating EPUB content. Here's a brief overview:

Convert EPUB to ZIP: Using Calibre, I first convert my EPUB files into ZIP format.
Unzip and Translate: After unzipping the EPUB, I run a script that translates the content within all the HTML files to another language, utilizing an AI LLM API.
Re-zip and Convert Back to EPUB: Once translation is complete, I rezip the files and attempt to convert this ZIP back into an EPUB format using Calibre.
The issue arises in the final step. Despite the translated HTML content displaying perfectly in a web browser, the re-converted EPUB file from the ZIP is a total mess. Interestingly, this issue persists even when I try converting the original EPUB to ZIP and then back again, without any modifications.

As someone new to Calibre and this process, I'm unsure where the problem lies or how to fix it. Has anyone here dealt with similar conversion challenges or have experience with translating content in EPUB files? Any insights or advice would be greatly appreciated.

Thank you in advance for your help!

For what it may be worth, an ePub is a zip container so no need to convert to zip. You do have to maintain the structure of the container so links point to the correct locations. I would suggest unzipping the epub into a directory, trying your translation on the html/xhtml files only and then rezipping the directory contents. One special note is that the mimetype file must be in the root of the .zip container and must be stored with no compression.

You will need to correct language references in the ePub. If for instance you are translating from English to German, any references to lang="en" or xml:lang="en" would need to be changed to lang="de" or xml:lang="de".

The last time I saw an attempt to machine translate an ePub, it also ran into the issue that the translation attempted to translate everything. I.e. <body> was translated to <körper> which is not valid and class= was translated as klasse= which again is not valid. Hopefully, the tools have improved over the last few years so that will not be an issue.

You might be better off converting the ePub to a .docx Word document, translating that document and then converting back to ePub.

Quoth · 03-09-2024, 12:35 PM

Epub is a zip. just rename it.

However export/conversion to docx, rtf or whatever for translation would be better. Your method risks messing up the epub manifest and css etc, as you've discovered.

Also the elephant in the room is the so called AI, the LLM. Either they are rubbish or plagiarising.

An epub is simply a zip, but the contents for an epub2 are:
HTML files, in order. Each new file causes a page break. The HTML headers ideally import css.
The CSS file(s), if any. Bad design if there are not.
The font files, if any. Order is irrelevant.
The image files, if any. Order is irrelevant.
A content.opf which is mandatory. It lists the files and what they do (a manifest).
A toc.ncx which is optional. It's the "system" Table of Contents for an app or ereader.
Epub3 has other possible files

Calibre has an editor which manages the relation between the files. Editing the HTML outside of Calibre is risky.

Simply passing the HTML via an API is a disaster as IDs (anchors), imports, classes etc won't be preserved, apart from risk of mangling the HTML tags.

It's best to export docx, translate each section/chapter separately (copy / paste only same style blocks with no images), check all links, anchors, headings, etc, save as docx, import to Calibre.

What you are doing only works (badly) for web pages. The html files in an ebook are not the same as a standalone web page even though using HTML5. An epub3 is even more frought with disaster to do this.

BetterRed · 03-09-2024, 03:11 PM

Quote:

Originally Posted by DNSB

. . .

You might be better off converting the ePub to a .docx Word document, translating that document and then converting back to ePub.

There are a couple of Word addins you might want to consider:

TransTools – Translation productivity tools.

e-Book Tools.

There is overlap between them, but they have there own strengths and weaknesses, example: TransTools Unbreaker and e-BookTools Dialogue checker are unique to each and invaluable.

I have them installed in the desktop version of Word from latest Office 365 with no issues… I keep everything local.

BR

Gaunc · 03-09-2024, 08:25 PM

Thank you all for your insights! I admit my knowledge of ePub is not very deep. I've been troubleshooting based on my workflow, and I'm currently trying to figure out the last step: converting a .zip file back to an .epub. Initially, I assumed that the process I used in Calibre to convert an ePub to a zip file could be simply reversed, but it seems that's not the case.

I attempted to rename my ePub file to .zip, but that approach didn't work. Regarding the LLM translation, that part went smoothly and without errors. I've completed a script segment that employs BeautifulSoup to parse individual HTML files, extracting content from specific tags. The content of the book was within three <div> classes, so the script needed to fetch the content from those specified classes, pass it through the LLM, and use the output to replace the original HTML content. I've been using Google Gemini for this, and it's quite remarkable—it didn't alter any HTML tags, and the formatting remained unchanged when viewing the HTML files.

I've uploaded the translated HTML "website" to GitHub as a demonstration of this part of the process working. You can view it here: https://gaunc1.github.io/brobromybookishere/.

BetterRed · 03-09-2024, 10:03 PM

Perhaps you could use the Calibre Unpack tool: it provides 'Explode' and 'Rebuild' features.

Click image for larger version

Name: Screenshot 2024-03-10 140055.jpg
Views: 17
Size: 90.4 KB
ID: 206809

BR

03-09-2024, 10:16 AM	#1
Gaunc Junior Member Posts: 2 Karma: 10 Join Date: Mar 2024 Device: Ipad Pro 12.9, Kobo	Issues Converting Translated ZIP Content Back to EPUB in Calibre Hello everyone, This is my first thread here, and I'm reaching out for some assistance with Calibre. I've recently begun managing a small catalogue of EPUB files, which has led me to learn and use Calibre for the first time. My workflow involves several steps designed around translating EPUB content. Here's a brief overview: Convert EPUB to ZIP: Using Calibre, I first convert my EPUB files into ZIP format. Unzip and Translate: After unzipping the EPUB, I run a script that translates the content within all the HTML files to another language, utilizing an AI LLM API. Re-zip and Convert Back to EPUB: Once translation is complete, I rezip the files and attempt to convert this ZIP back into an EPUB format using Calibre. The issue arises in the final step. Despite the translated HTML content displaying perfectly in a web browser, the re-converted EPUB file from the ZIP is a total mess. Interestingly, this issue persists even when I try converting the original EPUB to ZIP and then back again, without any modifications. As someone new to Calibre and this process, I'm unsure where the problem lies or how to fix it. Has anyone here dealt with similar conversion challenges or have experience with translating content in EPUB files? Any insights or advice would be greatly appreciated. Thank you in advance for your help!

03-09-2024, 12:35 PM	#3
Quoth the rook, bossing Never. Posts: 11,164 Karma: 85874891 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	Epub is a zip. just rename it. However export/conversion to docx, rtf or whatever for translation would be better. Your method risks messing up the epub manifest and css etc, as you've discovered. Also the elephant in the room is the so called AI, the LLM. Either they are rubbish or plagiarising. An epub is simply a zip, but the contents for an epub2 are: HTML files, in order. Each new file causes a page break. The HTML headers ideally import css. The CSS file(s), if any. Bad design if there are not. The font files, if any. Order is irrelevant. The image files, if any. Order is irrelevant. A content.opf which is mandatory. It lists the files and what they do (a manifest). A toc.ncx which is optional. It's the "system" Table of Contents for an app or ereader. Epub3 has other possible files Calibre has an editor which manages the relation between the files. Editing the HTML outside of Calibre is risky. Simply passing the HTML via an API is a disaster as IDs (anchors), imports, classes etc won't be preserved, apart from risk of mangling the HTML tags. It's best to export docx, translate each section/chapter separately (copy / paste only same style blocks with no images), check all links, anchors, headings, etc, save as docx, import to Calibre. What you are doing only works (badly) for web pages. The html files in an ebook are not the same as a standalone web page even though using HTML5. An epub3 is even more frought with disaster to do this. Last edited by Quoth; 03-09-2024 at 12:39 PM.

03-09-2024, 10:03 PM	#6
BetterRed null operator (he/him) Posts: 20,575 Karma: 26954694 Join Date: Mar 2012 Location: Sydney Australia Device: none	Perhaps you could use the Calibre Unpack tool: it provides 'Explode' and 'Rebuild' features. BR

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Trouble with Korean font when converting ZIP to EPUB	junj	Conversion	2	04-11-2021 10:55 PM
Problem converting from zip to epub.	nstock	Conversion	2	10-31-2017 03:18 AM
Error message when converting from ZIP to ePub	luthar28	Conversion	2	05-24-2011 01:04 PM
Conversion error when converting zip to epub	siebert	Conversion	2	02-27-2011 11:40 AM

03-09-2024, 08:25 PM	#5
Gaunc Junior Member Posts: 2 Karma: 10 Join Date: Mar 2024 Device: Ipad Pro 12.9, Kobo	Thank you all for your insights! I admit my knowledge of ePub is not very deep. I've been troubleshooting based on my workflow, and I'm currently trying to figure out the last step: converting a .zip file back to an .epub. Initially, I assumed that the process I used in Calibre to convert an ePub to a zip file could be simply reversed, but it seems that's not the case. I attempted to rename my ePub file to .zip, but that approach didn't work. Regarding the LLM translation, that part went smoothly and without errors. I've completed a script segment that employs BeautifulSoup to parse individual HTML files, extracting content from specific tags. The content of the book was within three <div> classes, so the script needed to fetch the content from those specified classes, pass it through the LLM, and use the output to replace the original HTML content. I've been using Google Gemini for this, and it's quite remarkable—it didn't alter any HTML tags, and the formatting remained unchanged when viewing the HTML files. I've uploaded the translated HTML "website" to GitHub as a demonstration of this part of the process working. You can view it here: https://gaunc1.github.io/brobromybookishere/.