MobileRead Forums - View Single Post - ebook-convert creates extra split.htm files, with random text, inserts at hard breaks

Dyspeptica · 02-12-2017, 12:20 AM

Hi,
Using 2.74.0 on fedora 25, recently updated from fedora 23 (unknown previous version of calibre). So this may (or may not) be version related. It did not happen previously.

I am creating epubs from html using ebook-convert in a bash script.
I create a cover jpg externally and use --cover xx.jpg on the command line. The html text starts with a frontispiece. and then the actual text, normally a Table of Contents which is original to the text (but cleaned up by me). The frontispiece itself ends with a "<div style="page-break-before:always;"></div>" line ( a 'hard' page break) and then some title headers which become the top of the main text.

I have been getting random extra text pages, between the cover and the 'frontispiece' and after the frontispiece, before the main body. This occurs on conversion.

I unzipped the epub and found that there is a ...split_000.htm file, which contains some random extra text, a split_002.htm file with the *same* text, and that text is repeated at the bottom of the ...split_001.htm file, inserted *after* the <div style="page-break-before:always;"></div>" line. The extra text, in one file I unzipped came from deep in the text, but on another file, it was from near the top of the text.

The ..split_000.htm file is inserted right after the cover, there is hard break, then ..split_001.htm (with its copy of the text at the bottom, below the proper page break point, then split_002.htm with a hard break ,and finally split_003.htm with the correct extra lines/text which was originally at the bottom of the frontispiece below the hard break point in that text. The html file is a concatenation of the frontis and body files. It looks fine in a browser. Problem is with the conversion.

Deleting the 000 and 002.htm files, and re-zipping produces an epub which *looks* correct, but internal references to a Table of Contents link (which is actually in 003.htm), fails because there is no 000.htm file. (And 000.htm has no ToC). So convert is re-writing the links to point to the wrong place?

If I remove the text content of the 'bad' files, the resulting epub has no cover, and has blank pages where the text was, and internal links in the body, point to the blank page which is ahead of the frontispiece and after the (now missing) cover.

I am lost indeed and need help. Direct email and lots of examples available on request, of course!

This is the convert command line, slightly cleaned of real data:

/usr/bin/ebook-convert $oldfile ../epub/$newfile --base-font-size 7 --linearize-tables --input-profile default --output-profile ipad --input-encoding utf-8 --cover ../../cover/${g}.jpg --no-svg-cover --remove-paragraph-spacing --remove-paragraph-spacing-indent-size -1 --unsmarten-punctuation --disable-delete-blank-paragraphs --max-toc-links 0 --no-chapters-in-toc --toc-threshold 0 --author-sort "" --authors "" --publisher "" --title "${title}" > /dev/null 2>&1

As I said, this is in a for loop in a script, and runs against 4000 odd files on some runs. It used to work fine..

02-12-2017, 12:20 AM	#1
Dyspeptica Enthusiast Posts: 27 Karma: 10 Join Date: May 2012 Device: iPad	ebook-convert creates extra split.htm files, with random text, inserts at hard breaks Hi, Using 2.74.0 on fedora 25, recently updated from fedora 23 (unknown previous version of calibre). So this may (or may not) be version related. It did not happen previously. I am creating epubs from html using ebook-convert in a bash script. I create a cover jpg externally and use --cover xx.jpg on the command line. The html text starts with a frontispiece. and then the actual text, normally a Table of Contents which is original to the text (but cleaned up by me). The frontispiece itself ends with a "<div style="page-break-before:always;"></div>" line ( a 'hard' page break) and then some title headers which become the top of the main text. I have been getting random extra text pages, between the cover and the 'frontispiece' and after the frontispiece, before the main body. This occurs on conversion. I unzipped the epub and found that there is a ...split_000.htm file, which contains some random extra text, a split_002.htm file with the same text, and that text is repeated at the bottom of the ...split_001.htm file, inserted after the <div style="page-break-before:always;"></div>" line. The extra text, in one file I unzipped came from deep in the text, but on another file, it was from near the top of the text. The ..split_000.htm file is inserted right after the cover, there is hard break, then ..split_001.htm (with its copy of the text at the bottom, below the proper page break point, then split_002.htm with a hard break ,and finally split_003.htm with the correct extra lines/text which was originally at the bottom of the frontispiece below the hard break point in that text. The html file is a concatenation of the frontis and body files. It looks fine in a browser. Problem is with the conversion. Deleting the 000 and 002.htm files, and re-zipping produces an epub which looks correct, but internal references to a Table of Contents link (which is actually in 003.htm), fails because there is no 000.htm file. (And 000.htm has no ToC). So convert is re-writing the links to point to the wrong place? If I remove the text content of the 'bad' files, the resulting epub has no cover, and has blank pages where the text was, and internal links in the body, point to the blank page which is ahead of the frontispiece and after the (now missing) cover. I am lost indeed and need help. Direct email and lots of examples available on request, of course! This is the convert command line, slightly cleaned of real data: /usr/bin/ebook-convert $oldfile ../epub/$newfile --base-font-size 7 --linearize-tables --input-profile default --output-profile ipad --input-encoding utf-8 --cover ../../cover/${g}.jpg --no-svg-cover --remove-paragraph-spacing --remove-paragraph-spacing-indent-size -1 --unsmarten-punctuation --disable-delete-blank-paragraphs --max-toc-links 0 --no-chapters-in-toc --toc-threshold 0 --author-sort "" --authors "" --publisher "" --title "${title}" > /dev/null 2>&1 As I said, this is in a for loop in a script, and runs against 4000 odd files on some runs. It used to work fine..