Cleaning ePubs: automatically, fast and with as many generic rules as possible - Page 8

capidamonte · 12-26-2013, 06:19 PM

stevelitt,

Strict (X)HTML source documents with custom parsers for target devices is the elegant answer, I think. Write your converters individually (Kindle 1, iPad3, Kobo, Nook, etc.) Some may require document rewriting (old MOBI image sizes, for instance), specialized splitting into various files, and of course, individualized css docs for various targets.

For your root doc structure, I suggest that you simply up your H#s by one.

H1: book element (title, cover, appendix, body, toc, list of maps, etc.)

H2: Parts within the body (eg: Book 1, Part 1, Volume 1), or glossary (eg: A, B, ... Z) or appendices (eg: Art Deco, Post-Modern) et al.

H3: Chapters within the body (mostly) or appendices (more rare) and other such elements that need them. There may be other Chapter analogues in books that I'm not thinking of. An extensive glossary may need something like Sa-Sn, Sm-Sz for instance.

H4, H5, H6: Sub-sections of a Chapter, as necessary. See above re: glossary for further refinement in non-body book elements.

Proper classes in different elements will help with divergent styling needs (<h2 class="body">, <h2 class="appendix">) either for aesthetics or for eReader compatibility.

Relatively simple and, if standardized, easy to convert to various forms of ePub or MOBI.

This was my plan, originally, back when I worked for Hitch -- but I got sidetracked by life.

I will now return to the underside of my rock.

PS: hello, Hitch.

Hitch · 12-27-2013, 03:00 AM

Quote:

Originally Posted by capidamonte

stevelitt,

Strict (X)HTML source documents with custom parsers for target devices is the elegant answer, I think. Write your converters individually (Kindle 1, iPad3, Kobo, Nook, etc.) Some may require document rewriting (old MOBI image sizes, for instance), specialized splitting into various files, and of course, individualized css docs for various targets.

For your root doc structure, I suggest that you simply up your H#s by one.

H1: book element (title, cover, appendix, body, toc, list of maps, etc.)

H2: Parts within the body (eg: Book 1, Part 1, Volume 1), or glossary (eg: A, B, ... Z) or appendices (eg: Art Deco, Post-Modern) et al.

H3: Chapters within the body (mostly) or appendices (more rare) and other such elements that need them. There may be other Chapter analogues in books that I'm not thinking of. An extensive glossary may need something like Sa-Sn, Sm-Sz for instance.

H4, H5, H6: Sub-sections of a Chapter, as necessary. See above re: glossary for further refinement in non-body book elements.

Proper classes in different elements will help with divergent styling needs (<h2 class="body">, <h2 class="appendix">) either for aesthetics or for eReader compatibility.

Relatively simple and, if standardized, easy to convert to various forms of ePub or MOBI.

This was my plan, originally, back when I worked for Hitch -- but I got sidetracked by life.

I will now return to the underside of my rock.

PS: hello, Hitch.

Hey, kiddo!

Give me a ring next week. I'll be around, let's chew the fat.

@Stevelitt: I can attest that although he's been out of the game for some time now, Cap knows his epub-fu. He can be quite useful to you if you're developing something. He ran away to join the circus, or something akin to that, but I think he's goofing off a lot these days. He probably needs a good project to keep him out of trouble. ;-)

Hitch

stevelitt · 01-03-2014, 08:19 AM

Quote:

Originally Posted by Hitch

Hey, kiddo!

Give me a ring next week. I'll be around, let's chew the fat.

@Stevelitt: I can attest that although he's been out of the game for some time now, Cap knows his epub-fu. He can be quite useful to you if you're developing something. He ran away to join the circus, or something akin to that, but I think he's goofing off a lot these days. He probably needs a good project to keep him out of trouble. ;-)

Hitch

Thanks Hitch and Cap!

There's a lot to report. I'm almost done. All that remains is the parser object, and I have the pseudocode design for the parser object complete, so it's just a matter of translating the design to Python.

I spent the last several days learning Python's lxml.etree XML/Xhtml API, so now I'm barely competent with it. Still, it's much, much, MUCH better and easier than the brute force home brew text parsing I was going to do on the Xhtml. I had to be dragged, kicking and screaming, by the guysin on #python IRC channel, into the world of XML parsers. XML parsers have improved A LOT in the past several years.

I'll speak to some of the details in a reply to Cap.

Guys, thanks for all your help and support.

SteveT

stevelitt · 01-03-2014, 08:55 AM

Quote:

Originally Posted by capidamonte

stevelitt,

Strict (X)HTML source documents with custom parsers for target devices is the elegant answer, I think.

Hi capidamonte,

Bless your heart for saying that! All my publisher friends are telling me what a turkey I am for sourcing Xhtml instead of (ugh) MS Word or (gulp) a PDF file or (what could *possibly* go wrong) send it to a service, so every minor update involves a week's delay and a fifty dollar charge.

Quote:

Originally Posted by capidamonte

Write your converters individually (Kindle 1, iPad3, Kobo, Nook, etc.) Some may require document rewriting (old MOBI image sizes, for instance), specialized splitting into various files, and of course, individualized css docs for various targets.

:-)
Well, of course we're working toward that ideal, but for the time being, I'll settle for outputting a standard ePub and converting that with Calibre, or Calibre plus Kindlegen. I never said it on this forum, but my reason for creating this software is so that in 2014, I can write a (short) eBook every two weeks. That means very soon now I need to quit programming and start slamming out content.

But yeah, some day I'll have different converters for different devices.

Quote:

Originally Posted by capidamonte

For your root doc structure, I suggest that you simply up your H#s by one.

H1: book element (title, cover, appendix, body, toc, list of maps, etc.)

H2: Parts within the body (eg: Book 1, Part 1, Volume 1), or glossary (eg: A, B, ... Z) or appendices (eg: Art Deco, Post-Modern) et al.

H3: Chapters within the body (mostly) or appendices (more rare) and other such elements that need them. There may be other Chapter analogues in books that I'm not thinking of. An extensive glossary may need something like Sa-Sn, Sm-Sz for instance.

H4, H5, H6: Sub-sections of a Chapter, as necessary. See above re: glossary for further refinement in non-body book elements.

capidamonte, I think you're going to like this. My program operates off two things: The Xhtml source doc, and a Yaml config file. The Yaml config file defines the relationship between <h?/> tags and things like Part, chapter, section, subsection, etc. Because of course not all books have parts, and it's conceivable a small, 99 cent eBook might not even have chapters. That same Yaml config file also defines which H numbers get their own pages, and which ones go in the table of contents.

Because my philosophy is that the Frontmatter and Backmatter are completely different animals than the mainmatter, I *don't* use <h?> tags for anything in the frontmatter or backmatter. Instead, I let the book author use something like <p class=foreward/>, and if the book author wants the section to get its own page, which is how I make a heading appear at the top of the page, the book author puts <div class="_pagebreak_"/> right above the <p class=foreward/>. But you just reminded me that I need to let the author determine, on a case by case basis, which frontmatter or backmatter elements get put in the tables of contents. So, in the <div class="_pagebreak_"/>, I'll probably have an attribute called "contents" that can have a value of Y or N, if the Xhtml will still parse and show up in a browser (for viewing while authoring the book). If the author doesn't put in that attribute, the default is not to put it in the contents.

Quote:

Originally Posted by capidamonte

Proper classes in different elements will help with divergent styling needs (<h2 class="body">, <h2 class="appendix">) either for aesthetics or for eReader compatibility.

Relatively simple and, if standardized, easy to convert to various forms of ePub or MOBI.

This was my plan, originally, back when I worked for Hitch -- but I got sidetracked by life.

Thanks for this advice. It at once gives me ideas, and re-enforces my belief that I'm on the right track.

I should have this program written, tested, documented, Expat (free software) licensed, and available for download on Troubleshooters.Com within a week. I'll let you guys know when it's done.

And THANK YOU, all of you, for all the help.

SteveT

capidamonte · 01-11-2014, 10:28 PM

Steve,

I'm not familiar with YAML, but if I'm following you correctly then you'll be writing a config file for every individual book -- redefining H1 as chapter if there are no Parts, for instance.

If that is what you wish to do, go forth and be productive.

I'm more of a mind that one should try to produce an ideal form of XHTML structure that covers all the elements that you can find in a book, and code appropriately for the structure of the book. (It will mean iteration as you find new and surprising elements.)

For instance:

Quote:

<div class="_pagebreak_"/> right above the <p class=foreward/>

Why not simply wrap all elements in a book's Foreword in a set of <div class="foreword">...</div> tags? Simple, and you can define all the paragraphs either with literal <p class="foreword">...</p> or with inheritance, or with child selectors, or whatever, in the CSS.

Putting things into the structural markup for visual reasons (pagebreak, etc.) is probably too much work.

Sigil lets you use a marker for import from HTML:

Code:

<hr class="sigilChapterBreak" />

It breaks up the html, then removes the marker. It's excellent if you're working towards using Sigil. (Much like its metadata import -- excellent for creating basic HTML documents as an archive format.) But working toward Sigil means abandoning your archival format at some point. I don't mean to imply that you are doing this -- I merely mean that form follows goal -- and that it's always easy to fall into the fingerpainting of display in the least likely places.

If you want to do something like that, sure, what you suggest will work. I think, though, that you'd ultimately be on a better course to develop a pure and strict format for the markup, and put most of your efforts into the converters. When your converter sees a <div class="foreword">...</div> then it generates a pagebreak for the target format without needing anyone to explicitly code it.

Perhaps that's what you mean to do with YAML?

Have you looked at FB2? It'd be a good start from which to develop a book schema for XHTML. If I get some time, I intend to do so myself.

Related to this, and to other discussion I've recently read on MR -- there is no reason to be chintzy with the names of classes, ids, styles, etc. Call things "ChapterName", "ChapterNumber", "letter-signature" or the like. CSS should be as readable and informative as possible -- names like "ChNm" or "ls" are pretty difficult for anyone who comes after.

I think that one of the goals of proper book coding would be to make it easy for someone to crack it open and understand what's going on. One can do everything with cascading styles, such that you'd have to use tools like Firebug to trace inhertance. But why obfuscate it? Make it as easy and explicit as possible for the next gal.

I may have drifted from topic, or conflated several ideas here.

The community should develop a "MobileRead" house style in XML or XHTML.

With that final off-topic blurt,

Aloha.

Hitch · 01-12-2014, 03:55 PM

Quote:

Originally Posted by capidamonte

Steve,

<snip all>

With that final off-topic blurt,

Aloha.

Steve:

If you plan to export to MOBI, them there "pagebreaks before" don't work, FWIW. HTH.

Cap:

Vis a vis that last Zip you sent: need WE, in 1400, also. At your earliest, if you could?

Thx.

Hitch

sgtrock · 03-30-2014, 05:44 PM

Quote:

Originally Posted by stevelitt

Hi capidamonte,

I should have this program written, tested, documented, Expat (free software) licensed, and available for download on Troubleshooters.Com within a week. I'll let you guys know when it's done.

And THANK YOU, all of you, for all the help.

SteveT

Steve;

Now that the Sigil development team has faded away completely, it looks like 0.7.4 will be the last version (at least for quite some time). I'm looking for a toolchain to replace it. This thread has me hoping that maybe your program will at least partially meet my needs. Did you ever get around to posting it for download?

TIA,

sgtrock

Doonge · 03-31-2014, 11:47 AM

About the class names, and HTML structure, here's good starting guidelines http://www.idpf.org/epub/vocab/structure/
https://en.wikipedia.org/wiki/Book_design
I concur with the explicit class name, and one could borrow from those guidelines aswell: "heading-label" and "heading-number" are good picks aswell.

Emphasis structure is not mentioned there, but there's for instance http://grammar.ccc.commnet.edu/grammar/italics.htm
http://www.dailywritingtips.com/how-...-your-writing/
...

About the header tags, I'm not quite sure it's a solid idea to assign the "numbers" (h1, h2, ...) to explicit things. After all, why defend semantic and readable markup just to throw the idea by the window when it comes to headings?
I like the idea to use the heading level as parameter to compile the whole, but I don't think you have to stick to the # of h?. You might want to work with the nesting level instead, especially if you adopt HTML5 markup for the source. I think the headings reflect the structure of the text, and this should be represented by the structure of the HTML (the nesting). Might be wrong, but nesting should do the trick, you don't need differentiation between h1-h6.
http://www.w3schools.com/html/html5_...c_elements.asp
http://coding.smashingmagazine.com/2...e-of-sections/

radius · 03-31-2014, 05:20 PM

Quote:

Originally Posted by sgtrock

Now that the Sigil development team has faded away completely, it looks like 0.7.4 will be the last version (at least for quite some time). I'm looking for a toolchain to replace it.

Hi sarge, what do you need that Sigil isn't giving you right now? Just curious since it does everything I personally need.

theducks · 03-31-2014, 06:26 PM

Quote:

Originally Posted by radius

Hi sarge, what do you need that Sigil isn't giving you right now? Just curious since it does everything I personally need.

Sigil will continue to function on all EXISTING OS versions.

Given the support lifespan of XP

Hmmm! Vista, W7 now W8 and XP was still king a many BIG businesses.

That is quite a Legacy for XP. Unless I need to change hardware (because magic smoke escaped), I expect my systems to probably outlast myself.

WillAdams · 03-31-2014, 08:55 PM

Here:

http://oreilly.com/openbook/utp/

exaltedwombat · 04-01-2014, 07:39 AM

Quote:

Originally Posted by sgtrock

Steve;

Now that the Sigil development team has faded away completely, it looks like 0.7.4 will be the last version (at least for quite some time). I'm looking for a toolchain to replace it. This thread has me hoping that maybe your program will at least partially meet my needs.

We're all constantly looking for better tools of course. But surely, all you're interested in NOW is a program that NOW does a better job than the latest version of Sigil. It would be mad to jump ship merely because another program shows potential. Sigil isn't suddenly going to stop working.

sgtrock · 04-04-2014, 10:27 AM

Quote:

Originally Posted by radius

Hi sarge, what do you need that Sigil isn't giving you right now? Just curious since it does everything I personally need.

The big thing for me is the lack of EPUB 3 support. While that's not a huge deal at the moment because so few readers support it right now, it will become a bigger deal as time goes on. With development in Sigil stalled and probably stopped completely, I think it's time to begin looking at alternatives under active development in the hope that we can find one or more that will be supporting version 3 in the future.

sgtrock · 04-04-2014, 10:31 AM

Quote:

Originally Posted by WillAdams

Here:

http://oreilly.com/openbook/utp/

You do realize that most universities abandoned troff and groff for TeX/LaTeX a couple of decades ago, right?

03-31-2014, 11:47 AM	#113
Doonge Connoisseur Posts: 80 Karma: 1184732 Join Date: Nov 2013 Device: Kobo Glo	About the class names, and HTML structure, here's good starting guidelines http://www.idpf.org/epub/vocab/structure/ https://en.wikipedia.org/wiki/Book_design I concur with the explicit class name, and one could borrow from those guidelines aswell: "heading-label" and "heading-number" are good picks aswell. Emphasis structure is not mentioned there, but there's for instance http://grammar.ccc.commnet.edu/grammar/italics.htm http://www.dailywritingtips.com/how-...-your-writing/ ... About the header tags, I'm not quite sure it's a solid idea to assign the "numbers" (h1, h2, ...) to explicit things. After all, why defend semantic and readable markup just to throw the idea by the window when it comes to headings? I like the idea to use the heading level as parameter to compile the whole, but I don't think you have to stick to the # of h?. You might want to work with the nesting level instead, especially if you adopt HTML5 markup for the source. I think the headings reflect the structure of the text, and this should be represented by the structure of the HTML (the nesting). Might be wrong, but nesting should do the trick, you don't need differentiation between h1-h6. http://www.w3schools.com/html/html5_...c_elements.asp http://coding.smashingmagazine.com/2...e-of-sections/ Last edited by Doonge; 03-31-2014 at 12:04 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Touch Problem with all epubs, my epubs, or my kobo? (line clipping)	plague006	Kobo Reader	14	12-02-2011 11:32 PM
Gui Plugin for Cleaning Ebooks, Fast	burbleburble	Plugins	91	10-11-2011 04:45 PM

12-26-2013, 06:19 PM	#106
capidamonte Not who you think I am... Posts: 374 Karma: 30283 Join Date: Jan 2010 Location: Honolulu Device: PocketBook 360 -- Ivory	stevelitt, Strict (X)HTML source documents with custom parsers for target devices is the elegant answer, I think. Write your converters individually (Kindle 1, iPad3, Kobo, Nook, etc.) Some may require document rewriting (old MOBI image sizes, for instance), specialized splitting into various files, and of course, individualized css docs for various targets. For your root doc structure, I suggest that you simply up your H#s by one. H1: book element (title, cover, appendix, body, toc, list of maps, etc.) H2: Parts within the body (eg: Book 1, Part 1, Volume 1), or glossary (eg: A, B, ... Z) or appendices (eg: Art Deco, Post-Modern) et al. H3: Chapters within the body (mostly) or appendices (more rare) and other such elements that need them. There may be other Chapter analogues in books that I'm not thinking of. An extensive glossary may need something like Sa-Sn, Sm-Sz for instance. H4, H5, H6: Sub-sections of a Chapter, as necessary. See above re: glossary for further refinement in non-body book elements. Proper classes in different elements will help with divergent styling needs (<h2 class="body">, <h2 class="appendix">) either for aesthetics or for eReader compatibility. Relatively simple and, if standardized, easy to convert to various forms of ePub or MOBI. This was my plan, originally, back when I worked for Hitch -- but I got sidetracked by life. I will now return to the underside of my rock. PS: hello, Hitch.

03-31-2014, 08:55 PM	#116
WillAdams Wizard Posts: 1,242 Karma: 3439432 Join Date: Feb 2008 Device: Amazon Kindle Paperwhite (300ppi), Samsung Galaxy Book 12	Here: http://oreilly.com/openbook/utp/