AZW3 to EPUB conversion is split with one page per paragrah

steinarb · 01-09-2015, 03:37 AM

Calibre 2.15 on Windows 8.1.

No custom conversion settings.

The target reader is a Sony PRS-T1.

When converting an AZW3 file to EPUB the size increased from 0.6MB to 3MB.

The EPUB didn't contain any obviously large files, the cover image was 103kB and the OPF file was 600kB (but compressed well). However, the EPUB contained a lot of HTML files.

When reading the EPUB I discovered that the text had been split with one page per paragraph, and the page count was 4911 (the AZW3 original has 593 pages).

Is there a particular setting I should look at to make it not split on the paragraph level?

I added metadata from the internet before converting, keeping the original cover, but that hasn't contributed much to the OPF. Most of the OPF size is related to entries for each individual HTML file.

The conversion log is attached (with a lot of "Detected chapter" entries).

The file I'm trying to convert is copyrighted, but I can create a bug report with the input and output files attached if this is necessary (as described here https://www.mobileread.com/forums/sho...d.php?t=186697 ).

steinarb · 01-09-2015, 03:59 AM

I have found the culprit I think.

I took an edit book on the AZW3 original, and looked at a random HTML file.

Many, perhaps all, of the <p> elements look like this:

Code:

<p class="chapter">“Yes,” he said. “It is.”</p>

And that triggers the XPath structure detection

Code:

@class = 'chapter'

steinarb · 01-09-2015, 04:07 AM

What I did to fix this, was to:

Start a conversion
Select "None" in "Chapter Mark" (the original setting was "Pagebreak")
Do the conversion

The resulting EPUB looks ok in Calibre's EPUB reader, the TOC points to the correct places and contains meaningful entries for the real chapters. The resulting epub also have a size of 0.6MB, instead of 3MB.

theducks · 01-09-2015, 09:13 AM

That is not a bug
By default, Calibre looks for certain words in the 'Structure Detection' section of Conversion, to split/make into headings, upon.

Were you expecting Calibre to count the results and go 'That is an absurd number of Chapters' ?

IMHO p class="chapter" was a weak choice for a pragraph selector name

One I have not seen used by common ebook publishers

steinarb · 01-09-2015, 09:52 AM

Quote:

Originally Posted by theducks

That is not a bug

I know that. I never said it was.

Quote:

Originally Posted by theducks

Were you expecting Calibre to count the results and go 'That is an absurd number of Chapters' ?

I had no expectations.

I had a problem.

I found a fix for the problem.

I posted my findings as responses to my original problem.

This makes it useful for someone who has the same problem and might google for a solution.

Quote:

Originally Posted by theducks

IMHO p class="chapter" was a weak choice for a pragraph selector name

One I have not seen used by common ebook publishers

This was a very common ebook publisher/vendor.

The previous books of the same series had no such problem.

I would use "silly choice for a paragraph selector name", rather than a "weak choice for a paragraph selector name", but at least it made it pretty obvious what was happening when I found it.

I worried for a bit that I would have to XSLT or perl-modify the individual HTML files of the AZW3 file, or mess with the chapter detection XPath expression, but was relived when removing the chapter mark gave a satisfactory result.

rashkae · 03-04-2015, 10:57 AM

I find the default Calibre page splitting options to be a hindrance. They are no doubt great, (probably essential) when converting non e-book documents to e-book formats. However, E-book formats (epub, mobi, azw3, etc.) are probably already split (or at least have page breaks), and adding heuristics to create new splits can often muddle thing, so I change these defaults in my own config.:

Under Structure Detection, Chapter Mark setting is changed from "Page Break" to none,

Insert Page Breaks before Xpath: is disabled ( / )

Some books need to have page breaks removed from their CSS when used inappropriately.. (I've seen many books with a css page-break-before: always in their chapter headings, but also have some kind of graphic at the top of the page. When run through an e-book conversion, this causes the graphic to be put on a page by itself.)

loadoutgreen75 · 05-14-2018, 09:20 AM

Thanks for posting how you fixed it, I had been stuck at this for a while with epub->epub, I didn't think to look at the raw text

01-09-2015, 03:59 AM	#2
steinarb Enthusiast Posts: 39 Karma: 53696 Join Date: Nov 2012 Device: Sony PRS T-1	I have found the culprit I think. I took an edit book on the AZW3 original, and looked at a random HTML file. Many, perhaps all, of the <p> elements look like this: Code: <p class="chapter">“Yes,” he said. “It is.”</p> And that triggers the XPath structure detection Code: @class = 'chapter'

01-09-2015, 04:07 AM	#3
steinarb Enthusiast Posts: 39 Karma: 53696 Join Date: Nov 2012 Device: Sony PRS T-1	What I did to fix this, was to: Start a conversion Select "None" in "Chapter Mark" (the original setting was "Pagebreak") Do the conversion The resulting EPUB looks ok in Calibre's EPUB reader, the TOC points to the correct places and contains meaningful entries for the real chapters. The resulting epub also have a size of 0.6MB, instead of 3MB.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
conversion to .azw3, epub as source?	Freeballer	Conversion	4	05-08-2014 11:34 PM
Problem with conversion from AZW3 to ePub	Kaetrin	Conversion	3	05-30-2013 04:57 AM
azw3 to epub conversion stuck on 1%	krysk	Conversion	2	04-21-2013 11:53 AM
AZW3 to EPUB Conversion Probs	grizedale	Conversion	4	04-16-2013 06:47 PM
Conversion from epub to azw3	Joy736	Conversion	12	01-01-2013 11:00 AM

01-09-2015, 09:13 AM	#4
theducks Well trained by Cats Posts: 31,883 Karma: 64184592 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	That is not a bug By default, Calibre looks for certain words in the 'Structure Detection' section of Conversion, to split/make into headings, upon. Were you expecting Calibre to count the results and go 'That is an absurd number of Chapters' ? IMHO p class="chapter" was a weak choice for a pragraph selector name One I have not seen used by common ebook publishers

03-04-2015, 10:57 AM	#6
rashkae Wizard Posts: 1,292 Karma: 5935030 Join Date: Jun 2011 Location: Ontario, Canada Device: Kobo Aura HD	I find the default Calibre page splitting options to be a hindrance. They are no doubt great, (probably essential) when converting non e-book documents to e-book formats. However, E-book formats (epub, mobi, azw3, etc.) are probably already split (or at least have page breaks), and adding heuristics to create new splits can often muddle thing, so I change these defaults in my own config.: Under Structure Detection, Chapter Mark setting is changed from "Page Break" to none, Insert Page Breaks before Xpath: is disabled ( / ) Some books need to have page breaks removed from their CSS when used inappropriately.. (I've seen many books with a css page-break-before: always in their chapter headings, but also have some kind of graphic at the top of the page. When run through an e-book conversion, this causes the graphic to be put on a page by itself.)

05-14-2018, 09:20 AM	#7
loadoutgreen75 Junior Member Posts: 1 Karma: 10 Join Date: May 2018 Device: kobo	Thanks for posting how you fixed it, I had been stuck at this for a while with epub->epub, I didn't think to look at the raw text

Advert

Advert