![]() |
#1 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Nationally Adopted XML eBook Format
Some DTDs and an example document of the XML eBook/source format used by the Hungarian Electronic Library (a very large and comprehensive eBook initiative managed by the Hungary's National Széchényi Library). 7,000+ free books (1/4th of Project Gutenberg's 28,000) from a pool of primarily Hungarian language books (which, obviously, is smaller than the pool of English language books)--both classics and contemporary works.
The format is both precise and concise, making it easier to work with (for purposes of conversion and generation) than either plaintext or (colloquial) HTML sources. Comments in the DTD files are in Hungarian, but most of it is pretty self-explanatory. Also, the example file and the terminology defined by the DTDs are English though. I am, of course, be happy to offer additional explanation or translate anything anyone is curious about. XML File: http://mek.oszk.hu/03400/03407/03407xml.zip DTDs: http://mek.oszk.hu/mekdtd/academic/TEI-MEK-academic.dtd http://mek.oszk.hu/mekdtd/article/TEI-MEK-article.dtd http://mek.oszk.hu/mekdtd/drama/TEI-MEK-drama.dtd http://mek.oszk.hu/mekdtd/mixed/TEI-MEK-mixed.dtd http://mek.oszk.hu/mekdtd/prose/TEI-MEK-prose.dtd http://mek.oszk.hu/mekdtd/verse/TEI-MEK-verse.dtd - Ahi Last edited by ahi; 06-05-2009 at 11:02 AM. Reason: specified library's size |
![]() |
![]() |
![]() |
#2 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,999
Karma: 300001
Join Date: Jan 2007
Location: Citrus Heights, California
Device: TWO Kindle 2s, one each Bookeen Cybook Gen3, Sony PRS-500, Axim X51V
|
Quote:
![]() Derek |
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
|
What makes the efforts of the Hungarians in creating and using this format / schema so great? How does it improve upon any of the existing file standards?
|
![]() |
![]() |
![]() |
#4 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
Quote:
Not that I know of any tools that use it as a source (though the National Széchényi Library doubtless uses such tools internally, and I do plan on writing one myself as well), but any conversion tool could more easily (and with greater confidence) generate valid output documents from this sort of an XML format than it could from either plaintext, HTML, or anything else in-between. It's a bit like docbook, I suppose, except that it was designed by a national library and has already been (presumably sufficiently successfully) put to use to a very diverse set of 600+ documents. Put to use by a single authority, without unnecessary variation in the format or its implementation. (There's obviously lots more docbook books out there... but my experience in the past gave me the impression that in many cases it is a format that either limits the fidelity of the content or requires ad-hoc extension of it.) I think that's the best I can answer the question... will it do? - Ahi |
||
![]() |
![]() |
![]() |
#5 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
|
In retrospect, I think that first question didn't quite come out the way I meant it. My apologies.
Your comments to the second question answered my poorly worded request though. Since I'm looking to future-proof my library now (before it gets any larger or harder to manage), I'm trying to find good tools and practices to implement now. Having pretty much decided on EPUB, I was confused by the sudden appearance of a new contender. Do you have a "smallish" ebook that uses this TEI-MEK structure? (Not sure of this format's name.) I'm only looking for a couple of sample pages tied together into a book with various example implementations of the XML scheme. I don't read Hungarian and the DTDs were pretty much over my head. But I might be able to figure things out with an example file. I probably should be learning XML - but I would rather wait for improved tools that can provide assistance and automation of tasks. Edit: Or I could just re-read your original post and download the sample book! ![]() Last edited by Sabardeyn; 06-08-2009 at 08:03 PM. |
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
![]() If you go here: http://mek.oszk.hu/katalog/index.phtml And select in one of the dropdowns "Formátum" (i.e.: "format") and type in "XML", and click on the "keres" (i.e.: "search") button in the lower right, you will have all 600+ books listed. Although for some reason a couple of them doesn't link to their XML versions... most do though, and a few are English books. - Ahi |
|
![]() |
![]() |
![]() |
#7 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,275
Karma: 145488788
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
But what really makes these any better then ePub?
|
![]() |
![]() |
![]() |
#8 |
Sir Penguin of Edinburgh
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
|
Epub isn't XML; it has XML as one of its parts. But it also has HTML tags. A strict XML file shouldn't have anything other than XML tags (I think)*.
* Could someone correct me if I'm wrong? Last edited by Nate the great; 06-08-2009 at 09:54 PM. |
![]() |
![]() |
![]() |
#9 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
You said it, I didn't.
However, since you did: The fact that its semantic mark-up instead of hodge-podge HTML aimed at getting finicky display software to cooperate makes it better... at preserving semantic and formatting information in a given text. In fact, in another thread, I have explicitly stated that ePub is probably the most ideal format for archiving by the non-professional-Librarian, given all considerations. Quote:
As a result, despite the existence of the heavenly ideal of sensibly formatted and standard-compliant HTML markup, the average document is going to be somewhat to moderately messy and difficult or (over multiple documents created by different authors) impossible to automatically parse/convert in a definitively correct way (unlike a well defined semantic XML format like this). - Ahi Ps.: Any suggestions, JSWolf, on how to improve my ugly duckling of an ePub file of The Art of War (posted in another thread)? I find the lack of interest therein--particularly in light of the rather greater positive attention the PDFs received--to be odd given the apparent enthusiasm about ePub on this forum. Particularly since I spent more time making the single (and, on my Sony PRS-505, not particularly great looking) ePub, than I did producing the 8 considerably more professional looking custom PDFs. |
|
![]() |
![]() |
![]() |
#10 | |
speaking for myself
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 139
Karma: 2166
Join Date: Feb 2008
Location: San Francisco Bay Area
Device: PRS-505
|
Quote:
XHTML attempts to be a semantic language. XHTML has well-defined DTD and can be validated. Most HTML presentational attributes and tags are either outlawed (e.g. FONT) or deprecated (ALIGN). Instead, CSS stylesheets do the job. In addition, classes can be assigned to add your own semantic structure on top of what XHTML provides. In reality, a lot of XHTML content is just somewhat tidied-up presentational HTML - and that looks really ugly on the source level. It is still XML, but there is not much sematics in it, it is just styled to look right. I consider this to be bad authoring practice, but that's somewhat a question of opinion. There is no such thing as "XML tags". Better to say, there are no tags defined by XML specification itself. XML defines how to build various XML dialects - and XML dialect is where tags are defined. XHTML is merely one of such dialects that tries to make transition from HTML simple. (SVG and DTBook, also mandated by EPUB spec, are also XML dialects). |
|
![]() |
![]() |
![]() |
#11 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
You are right, of course, Peter. I generally don't bother to make the finer distinction, but arguably XML is a meta-language for creating XHTML-like languages.
My point above was that while HTML/XHTML must yield semantic coherence and/or consistency to some degree in favour of actually having it displayed right across the board, an XML language that is not concerned with display logistics is more free to mark information up more concisely and succinctly, and eminently parseably (with the program less likely to get confused what a given combination of tags actually mean). Am I making sense? (Despite, I suspect, having become somewhat less understandable for the less tech-savvy.) - Ahi |
![]() |
![]() |
![]() |
#12 | |
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,544
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Quote:
I just realized that XHTML forbids block content inside a <p> element. So you cannot have a <div> inside a <p>. Is this sensible? Maybe, but I wanted to have <div>'s inside <p>'s for cases when a piece of poetry is inserted in a character's speech, and the text resumes after it with no indentation or whatever. I can work around this by having a <p>, then a <div> and then a <p> with no indentation, but the logical structure would be a <div> inside a <p>. By not being limited by XHTML, this issue can be avoided. Is this the kind of thing you mean? |
|
![]() |
![]() |
![]() |
#13 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,224
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
@Jellby: The correct way to do poetry in XHTML would be
Code:
<div class="poem"> <div class="stanza"> <div class="line"> ... </div> . . . </div> . . . </div> |
![]() |
![]() |
![]() |
#14 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
Code:
<lg type="verse"> <l>Mivel malommester alsóknak hadnagyok,</l> <l>Szárnyas madaraknak az sólyom az urok,</l> <l>A fenevadaknak oroszlány királyok,</l> <l>Az fáknak fenyőfa légyen az császárok.<ref target="note5221-1" type="footnote">[<hi vertical-align="top">96</hi>]</ref><note id="note5221-1" type="footnote">96. RMKT IV (Budapest, 1967), p. 439.</note></l> </lg> A program written to parse the this XML format encountering the above code should be unambiguously clear as to how to interpret it. On the other hand, your intuitive way of encoding poetry in XHTML is certain not to precisely match the way a 100 other random people do so. Assuredly there would be overlap... but surely there would easily be a dozen or two different ways, all nominally correct, but none definitively correct and certain to be used by all "correct" documents. Does that clarify my meaning? - Ahi |
|
![]() |
![]() |
![]() |
#15 | |
Sir Penguin of Edinburgh
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
|
Quote:
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
calibredb list --output-format=xml no longer supported | Mekk | Library Management | 6 | 06-11-2010 08:13 AM |
Epub revision - alignment with broadly-adopted Web standards | Nate the great | ePub | 4 | 04-08-2010 10:38 PM |
Master Format for multi-format eBook Generation? | cerement | Workshop | 43 | 04-01-2009 12:00 PM |
Ebook Library refusing to load on screen when new media.xml file required | seajewel | Calibre | 0 | 06-29-2008 07:46 PM |
IDPF - New digital book standard released: OEBPS (XML format) & OCF (container) | CommanderROR | News | 13 | 11-04-2006 08:49 AM |