MobileRead Forums - View Single Post

ldolse · 04-15-2009, 12:30 AM

Hi,

I just started converting some .lit files to epub, and I'm noticing some nasty behavior around css. When converted to HTML, lit files make heavy use of tags embedded within tags. Every single line/paragraph in the entire book is embedded between these tags. A single line or paragraph starts like this:


The problem comes with the way Calibre handles the CSS generation. Every single tag in the book gets it's own unique id on top of the class. 99% of these ids use the same settings. The ids in the tags are all setting the text-indent and font size, which never changes in my test (so it could be made part of the class). The ids in the tags are setting a handful of font sizes.

Needless to say this makes it impossible to reformat the book using the flexibility of css. I'm also seeing performance issues in some readers as they try to deal with rendering html with thousands of unique ids.

Here are all the tags/classes that were duplicated in a book I just converted:

calibre_css_id_xxx - 2789 occurrences with same indent, 2810 occurrences with identical font size
calibre_class_x (span class): 5 occurrences with identical settings
cfs_xxx - 2872 occurrences setting the same font size
div.Sectionxx - 41 occurrences setting the same page break

Note the same handling of and tags happens if I use ConvertLit to convert to HTML and then convert the resulting HTML to epub.

Can Calibre just check to see if a given setting has been used in an existing CSS or class before creating a new ID? Seems like this would fix the problem.

If you point me to the section of code where these decisions are made I can see if I can work out any options myself, but that might be beyond my skills at the moment.

Other issues:
Chapter Detection
When I was creating epubs from PDFs I got a decent handle around how chapter detection functioned in that workflow. If it matches the xpath then the chapters get tagged, and I can see the chapters in Adobe DE. I've converted a couple Lit files, one of them split the book into separate HTML files for each Chapter, but no table of contents was created that Adobe DE could see. Is there something that needs to be done to make that happen?

There are no linefeeds in the resulting html file, would be nice to stick in some line feeds after just to make it easier to work with. Some text editors struggle with opening a file with a single line that long as well.

04-15-2009, 12:30 AM	#460
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	.Lit conversion to epub Hi, I just started converting some .lit files to epub, and I'm noticing some nasty behavior around css. When converted to HTML, lit files make heavy use of <span></span> tags embedded within <p></p> tags. Every single line/paragraph in the entire book is embedded between these tags. A single line or paragraph starts like this: <p class="MsoPlainText" id="calibre_css_id_533"><span class="calibre_class_2" id="cfs_1100"> The problem comes with the way Calibre handles the CSS generation. Every single tag in the book gets it's own unique id on top of the class. 99% of these ids use the same settings. The ids in the <p> tags are all setting the text-indent and font size, which never changes in my test (so it could be made part of the class). The ids in the <span> tags are setting a handful of font sizes. Needless to say this makes it impossible to reformat the book using the flexibility of css. I'm also seeing performance issues in some readers as they try to deal with rendering html with thousands of unique ids. Here are all the tags/classes that were duplicated in a book I just converted: calibre_css_id_xxx - 2789 occurrences with same indent, 2810 occurrences with identical font size calibre_class_x (span class): 5 occurrences with identical settings cfs_xxx - 2872 occurrences setting the same font size div.Sectionxx - 41 occurrences setting the same page break Note the same handling of <p> and <span> tags happens if I use ConvertLit to convert to HTML and then convert the resulting HTML to epub. Can Calibre just check to see if a given setting has been used in an existing CSS or class before creating a new ID? Seems like this would fix the problem. If you point me to the section of code where these decisions are made I can see if I can work out any options myself, but that might be beyond my skills at the moment. Other issues: Chapter Detection When I was creating epubs from PDFs I got a decent handle around how chapter detection functioned in that workflow. If it matches the xpath then the chapters get tagged, and I can see the chapters in Adobe DE. I've converted a couple Lit files, one of them split the book into separate HTML files for each Chapter, but no table of contents was created that Adobe DE could see. Is there something that needs to be done to make that happen? There are no linefeeds in the resulting html file, would be nice to stick in some line feeds after </p> just to make it easier to work with. Some text editors struggle with opening a file with a single line that long as well.