LRF output - Page 49

kovidgoyal · 08-15-2008, 12:25 PM

html2lrf does read metadata from HTML files. IIRC the current code is optimized to recognize the metadata generated by the ereader2html script.

pepak · 08-15-2008, 02:57 PM

Quote:

Originally Posted by llasram

The OPF spec itself includes a fair number of examples.

Thanks. For some reason my searches only returned other OPFs, not this one.

Still, the specification seems to have a whole lot of content which is not relevant to html2lrf, and others that seem to be missing (e.g. author-sort). Can anyone (kovidgoyal?) shed light on what is used and what is not?

Quote:

Originally Posted by kovidgoyal

html2lrf does read metadata from HTML files.

Heh, it was that dismissed ticket of mine about reading metadata from HTML files which caused my mistake.

Quote:

Originally Posted by kovidgoyal

IIRC the current code is optimized to recognize the metadata generated by the ereader2html script.

Is there a newer version of ereader2html than 0.03? That one doesn't seem to save any metadata.

Anyway, do you think that maybe specific html2lrf metadata might be useful? Metadata in the form <meta name="lrf-prefix:commandline-parameter-name" value="commandline-parameter-value"> should be easy enough to implement - it would simply reuse the code from commandline parser.

kovidgoyal · 08-15-2008, 03:19 PM

Here's the current code to extract metadata from HTML files (it looks for metadata in comment sections:

Code:

def get_metadata(stream):
    src = stream.read()
    
    # Title
    title = None
    pat = re.compile(r'<!--.*?TITLE=(?P<q>[\'"])(.+)(?P=q).*?-->', re.DOTALL)
    match = pat.search(src)
    if match:
        title = match.group(2)
    else:
        pat = re.compile('<title>([^<>]+?)</title>', re.IGNORECASE)
        match = pat.search(src)
        if match:
            title = match.group(1)
        
    # Author
    author = None
    pat = re.compile(r'<!--.*?AUTHOR=(?P<q>[\'"])(.+)(?P=q).*?-->', re.DOTALL)
    match = pat.search(src)
    if match:
        author = match.group(2).replace(',', ';')
        
    mi = MetaInformation(title, [author] if author else None)
    
    # Publisher
    pat = re.compile(r'<!--.*?PUBLISHER=(?P<q>[\'"])(.+)(?P=q).*?-->', re.DOTALL)
    match = pat.search(src)
    if match:
        mi.publisher = match.group(2)
        
    # ISBN
    pat = re.compile(r'<!--.*?ISBN=[\'"]([^"\']+)[\'"].*?-->', re.DOTALL)
    match = pat.search(src)
    if match:
        isbn = match.group(1)
        mi.isbn = re.sub(r'[^0-9xX]', '', isbn)
        
    return mi

I dont think adding support for lrf specific metadata is worthwhile, but adding support for reading more generic kinds of metadata (basically extending the above code, is easy enough to do).

You can get a good idea of what kinds of metadata from OPF calibre supports by using the GUI to save an ebook. The GUI willc reate an OPF file with entries for all the metadata it knows about.

llasram · 08-15-2008, 03:52 PM

Quote:

Originally Posted by pepak

Still, the specification seems to have a whole lot of content which is not relevant to html2lrf, and others that seem to be missing (e.g. author-sort). Can anyone (kovidgoyal?) shed light on what is used and what is not?

The "author-sort" is taken from the OPF "file-as" attribute on the Dublin Core <creator/> of OPF "role" "aut". Obvious, itnit?

pepak · 08-16-2008, 04:27 AM

Thanks to both of you.

Another question: I bought a book scanner and have been using it to convert my paper books into ebooks in HTML format (because I consider it the best in regards of current and future functionality). I noticed several strange things when converting them into LRF using HTML2LRF on Windows and using that LRF on my Sony Reader PRS-505. Please note: I do not use GUI - I convert from command line and then copy the LRF to the Reader using file management utility.

1) "author-sort" doesn't seem to have any effect. I use command line such as

Code:

--author="Steve Perry" --author-sort="PERRY STEVE"

but in the books-by-author the book gets sorted among "S", not among "P".

2) I just can't understand chapter detection and TOC generation: I use <h2> tag for marking chapters, as in

Code:

<h2 id="contents">Table of Contents</h2>
<h2 id="chapter-10">The Attack</h2>

(Note: The id="contents" in the example refers to a hand-crafted TOC for the HTML file, which I will call html-toc further on. My problem relates to the TOC as displayed by the reader, which I will call lrf-toc.)

The command line is:

Code:

--chapter-regex=^

(this is real ^; I had to prepend it by another ^ for use in batch files)

I took it to mean that ANY h[1-6] tag would be considered a new chapter. Curiously enough, in my example above <h2 id="chapter-10"> gets detected as a chapter but <h2 id="contents"> does not. I thought maybe the regexp didn't get used so as an experiment, I renamed that chapter-10 to xxxpter-10, expecting it not to appear in lrf-toc. Strangely enough, it DID get detected. Only that <h2 id="contents"> seems to be ignored.

3) Another problem with chapter detection: I have a book which has 10 chapters and a whole lot of footnotes. I used a <ol> list at the end of the document to store all notes:

Code:

<ol id="notes">
  <li>
    <p id="note-1">Footnote 1</p>
  </li>
  <li>
    <p id="note-2">Footnote 2</p>
  </li>
</ol>

and the command line:

Code:

--force-page-break-before-tag="h2|p id="

(because if I don't use page breaks, the links just won't work correctly in LRF; in case you wonder why I used <p id="..." instead of the sematically better <li id="...">, it's because in the latter case the links won't work correctly even with page breaks).

Two strange things happen:
(i) All footnotes get recognized as chapters (!), so I get some 90 chapters instead of 10 in the lrf-toc.
(ii) Despite the force-page-break, there are as many footnotes per page as can fit (!) and still the links work correctly in the LRF (!!!). I don't complain about it, this result is actually very useful, but I find it strange that with <h2> chapters I need to keep each at the start of its own page to make it work but with <li><p> I can have many on the same page and still they work.

Are these expected behaviors due to some property of LRF which I am not familiar with or are these bugs and I should create a new ticket for them? (In that case, is it possible to send the demo file privately? I do not want to infringe on someone's copyright by posting a book into a public section)

pepak · 08-16-2008, 04:50 AM

Forgot:

4) Paragraphs in <blockquote> have a much larger padding between them than normal paragraphs.

5) Paragraphs in <blockquote> can't be centered using class styles.

llasram · 08-16-2008, 09:15 AM

Quote:

Originally Posted by pepak

Another question: I bought a book scanner and have been using it to convert my paper books into ebooks in HTML format (because I consider it the best in regards of current and future functionality).

HTML – excellent choice. I would actually recommend going the extra mile and saving your books as full EPUB books. Even if you don’t like the way ADE on the Reader renders EPUB, the additional metadata, external TOC, etc. in EPUB is an arguably better-in-the-first-place work-around for some of the issues below.

Quote:

Originally Posted by pepak

1) "author-sort" doesn't seem to have any effect. I use command line such asn

Code:

--author="Steve Perry" --author-sort="PERRY STEVE"

but in the books-by-author the book gets sorted among "S", not among "P".

Hmm. That’s weird. I just tested with my firmware-updated 505 and it totally ignores the ‘Author.reading’ metadata. I vaguely remember it working before I updated the firmware, but I use the “Sort by Author” view so infrequently that I can’t be sure. This one looks like an upstream problem with Sony, although if purchased DRMed BBeB books sort correctly it may mean a community miscomprehension of the file format.

Quote:

Originally Posted by pepak

2) I just can't understand chapter detection and TOC generation: I use <h2> tag for marking chapters, as in

Code:

<h2 id="contents">Table of Contents</h2>
<h2 id="chapter-10">The Attack</h2>

(Note: The id="contents" in the example refers to a hand-crafted TOC for the HTML file, which I will call html-toc further on. My problem relates to the TOC as displayed by the reader, which I will call lrf-toc.)

The command line is:

Code:

--chapter-regex=^

(this is real ^; I had to prepend it by another ^ for use in batch files)

I took it to mean that ANY h[1-6] tag would be considered a new chapter. Curiously enough, in my example above <h2 id="chapter-10"> gets detected as a chapter but <h2 id="contents"> does not. I thought maybe the regexp didn't get used so as an experiment, I renamed that chapter-10 to xxxpter-10, expecting it not to appear in lrf-toc. Strangely enough, it DID get detected. Only that <h2 id="contents"> seems to be ignored.

I’m not able to reproduce this one with a minimal example. Could you open a ticket with a file reproducing the error?

Quote:

Originally Posted by pepak

3) Another problem with chapter detection: I have a book which has 10 chapters and a whole lot of footnotes. I used a <ol> list at the end of the document to store all notes:

Code:

<ol id="notes">
  <li>
    <p id="note-1">Footnote 1</p>
  </li>
  <li>
    <p id="note-2">Footnote 2</p>
  </li>
</ol>

and the command line:

Code:

--force-page-break-before-tag="h2|p id="

That regexp is only applied to the tag name, so the ‘p id=’ portion will never match.

Quote:

Originally Posted by pepak

(because if I don't use page breaks, the links just won't work correctly in LRF; in case you wonder why I used <p id="..." instead of the sematically better <li id="...">, it's because in the latter case the links won't work correctly even with page breaks).

That sounds like a bug. If you can create a fairly minimal file reproducing the error, could you submit a ticket for that one too?

Quote:

Originally Posted by pepak

Two strange things happen:
(i) All footnotes get recognized as chapters (!), so I get some 90 chapters instead of 10 in the lrf-toc.

Default behavior is to add all link-targets to the lrf-toc – see the option ‘--no-links-in-toc’.

Quote:

Originally Posted by pepak

(ii) Despite the force-page-break, there are as many footnotes per page as can fit (!) and still the links work correctly in the LRF (!!!). I don't complain about it, this result is actually very useful, but I find it strange that with <h2> chapters I need to keep each at the start of its own page to make it work but with <li><p> I can have many on the same page and still they work.

If I understand this correctly, there are two issues going on here. First, that calibre’s chapter-detection co-joins “add this to the lrf-toc as a chapter” and “put a page-break at this point.” As an alternative to this, you can create an OPF file specifying an external NCX TOC (or HTML TOC). Calibre will generate an lrf-toc from that without inserting page-breaks. The second issue is the inconsistent way calibre finds link-targets, only paying attention to the ‘id’ attribute on a handful of tags – much obliged if you could open a ticket there too.

Quote:

Originally Posted by pepak

Are these expected behaviors due to some property of LRF which I am not familiar with or are these bugs and I should create a new ticket for them? (In that case, is it possible to send the demo file privately? I do not want to infringe on someone's copyright by posting a book into a public section)

Well ideally for each ticket you would create a minimal HTML input file which re-creates the described error. Failing that, could you (perhaps with a script) replace all the text in your HTML file with “lorem ipsum” text? If not, then... Actually, if you e-mail me the file at llasram@gmail.com I’ll do the “lorem ipsum” replacement and send you back the resulting file for you to directly attach to the ticket(s)

Quote:

Originally Posted by pepak

4) Paragraphs in <blockquote> have a much larger padding between them than normal paragraphs.

5) Paragraphs in <blockquote> can't be centered using class styles.

Those are known-but-annoying issues with calibre’s ad-hoc CSS parsing and rendering. With the Reader getting EPUB support LRF formatting issues are downgraded a bit, but that one bugs me too and if you open a ticket I’ll see if I can’t at least improve the situation.

pepak · 08-16-2008, 10:06 AM

Quote:

Originally Posted by llasram

HTML – excellent choice. I would actually recommend going the extra mile and saving your books as full EPUB books.

I chose HTML because it is device-independent and easy to modify (I spend a lot of time formatting and spell-checking my e-books). I am not so sure whether I want to move to a format which is primarily intended for e-books. I'll do some research about EPUB format and see.

Quote:

Hmm. That’s weird. I just tested with my firmware-updated 505 and it totally ignores the ‘Author.reading’ metadata. I vaguely remember it working before I updated the firmware,

Me too. So it is apparently a bug in the firmware. No problem, I was just wondering if it is a bug in Calibre.

Quote:

I’m not able to reproduce this one with a minimal example. Could you open a ticket with a file reproducing the error?

Will open a ticket for all issues. I just wanted to make sure opening ticket is the right thing to do with these questions - my previous tickets were mostly discarded for various reasons.

Quote:

That regexp is only applied to the tag name, so the ‘p id=’ portion will never match.

I see. I think this would be a nice feature. Will create a ticket for it and see what happens.

Quote:

Default behavior is to add all link-targets to the lrf-toc – see the option ‘--no-links-in-toc’.

Thanks.

So maybe my TOC-creating issues are actually not a result of one specific header not getting detected but a result of NO header getting detected and instead creating TOC from the links - those links that I put in my HTML-TOC:

Code:

<h2 id="contents">
<ol>
  <li><a href="#chapter-1">Chapter 1</a></li>
  ...
</ol>

I will check this. It seems plausible to me.

Quote:

If I understand this correctly, there are two issues going on here. [...]

Actually, it was meant as an observation of a strange inconsistency:

A) I have a chapter in my e-book:

Code:

<h2 id="chapter-10">Chapter 10</h2>
<p>Something or whatever.</p>

This chapter appears in the LRF-TOC (either due to chapter detection or due to links being added, see above). But the behavior in the Reader differs depending on
--force-page-break-before-tag=h2: If the option is used, LRF-TOC item works as expected. If the option is not used, LRF-TOC item actually links to one page before the chapter (if the chapter starts at page 123, link from LRF-TOC takes me to page 122). It is seemingly impossible to get two chapters on one page.

B) I have a footnote in my e-book in the semantically correct form:

Code:

<ol><li id="note-1"><p>Text for footnote 1</p></li></ol>

I couldn't get the links in the text to work correctly at all, no matter what options I tried. Maybe it got fixed in the newer versions of Calibre - when I found the workaround, I never bothered to try again.

C) I have a footnote in the form:

Code:

<ol><li><p id="note-1">Text for footnote 1</p></li></ol>

Multiple footnotes can be on one page and all links to them function correctly (take me to the page with the footnote) - compare it to A) where I need to put each chapter on a separate page.

I understand now why I never got a page break before the footnote, so that's not an issue anymore.

Quote:

Well ideally for each ticket you would create a minimal HTML input file which re-creates the described error.

The problem with this approach is that with earlier releases of Calibre I found that some of the errors only appear with the full book, not with a minimal example. I will try to do it, but I am afraid I might need to upload the whole book - maybe even with the original texts, to be certain.

The good news is that I can demonstrate all of these issues with one book :-)

I'll see about opening those tickets. Thanks for the answer.

kovidgoyal · 08-16-2008, 01:06 PM

Note that the LRF format doesn't support inline links, so it's typically a good idea to porce either paragraph or page breaks before an inline link.

The handling of blockquote is deliberate. It gives the best results for "typical" usage of <blockquote>

pepak · 08-17-2008, 08:48 AM

Quote:

Originally Posted by pepak

So maybe my TOC-creating issues are actually not a result of one specific header not getting detected but a result of NO header getting detected and instead creating TOC from the links.

Indeed. When I disabled table generation from links (--no-links-in-toc), no items appeared in LRF-TOC. The problem is that I just can't seem to generate ANY LRF-TOC when --no-links-in-toc is used.

Code:

<h2 id="chapter-2">Something or another</h2>

Code:

any2lrf.exe --force-page-break-before-tag=h2 demo.htm

I guess this is not worthy of a ticket - it's probably not a bug, just my misunderstanding of how chapter detection works. I would appreciate some pointers.

I have generated a demo HTML file for the rest of the issues and will create a new ticket shortly.

kovidgoyal · 08-17-2008, 11:37 AM

You need --add-chapters-to-toc

llasram · 08-18-2008, 03:35 PM

Quote:

Originally Posted by pepak

I chose HTML because it is device-independent and easy to modify (I spend a lot of time formatting and spell-checking my e-books). I am not so sure whether I want to move to a format which is primarily intended for e-books. I'll do some research about EPUB format and see.

EPUB is basically just XHTML with separate XML metadata (OPF for metadata & multi-file content ordering and NCX for table of context) all bundled up in a ZIP file. It lets you use HTML for the content without sacrificing consistent metadata while still bundling everything nicely into one file.

Best of luck with your scanning!

pepak · 08-23-2008, 04:39 AM

Quote:

Originally Posted by kovidgoyal

You need --add-chapters-to-toc

I am afraid I still can't get it to work:

Code:

...
<h2 id="chapter-1">Beginning</h2>
<p>Something or anything</p>
...

Code:

any2lrf --no-links-in-toc --force-page-break-before-tag="h2" --add-chapters-to-toc book.htm

No chapters appear in Table of Contents.

kovidgoyal · 08-23-2008, 09:54 AM

Add

Code:

--chapter-attr h2,id,chapter

pepak · 08-24-2008, 10:57 AM

Quote:

Originally Posted by kovidgoyal

Add

Code:

--chapter-attr h2,id,chapter

That lists each chapter twice for some reason.

08-16-2008, 04:27 AM	#725
pepak Guru Posts: 610 Karma: 4150 Join Date: Mar 2008 Device: Sony Reader PRS-T3, Kobo Libra H2O	Thanks to both of you. Another question: I bought a book scanner and have been using it to convert my paper books into ebooks in HTML format (because I consider it the best in regards of current and future functionality). I noticed several strange things when converting them into LRF using HTML2LRF on Windows and using that LRF on my Sony Reader PRS-505. Please note: I do not use GUI - I convert from command line and then copy the LRF to the Reader using file management utility. 1) "author-sort" doesn't seem to have any effect. I use command line such as Code: --author="Steve Perry" --author-sort="PERRY STEVE" but in the books-by-author the book gets sorted among "S", not among "P". 2) I just can't understand chapter detection and TOC generation: I use <h2> tag for marking chapters, as in Code: <h2 id="contents">Table of Contents</h2> <h2 id="chapter-10">The Attack</h2> (Note: The id="contents" in the example refers to a hand-crafted TOC for the HTML file, which I will call html-toc further on. My problem relates to the TOC as displayed by the reader, which I will call lrf-toc.) The command line is: Code: --chapter-regex=^ (this is real ^; I had to prepend it by another ^ for use in batch files) I took it to mean that ANY h[1-6] tag would be considered a new chapter. Curiously enough, in my example above <h2 id="chapter-10"> gets detected as a chapter but <h2 id="contents"> does not. I thought maybe the regexp didn't get used so as an experiment, I renamed that chapter-10 to xxxpter-10, expecting it not to appear in lrf-toc. Strangely enough, it DID get detected. Only that <h2 id="contents"> seems to be ignored. 3) Another problem with chapter detection: I have a book which has 10 chapters and a whole lot of footnotes. I used a <ol> list at the end of the document to store all notes: Code: <ol id="notes"> <li> <p id="note-1">Footnote 1</p> </li> <li> <p id="note-2">Footnote 2</p> </li> </ol> and the command line: Code: --force-page-break-before-tag="h2\|p id=" (because if I don't use page breaks, the links just won't work correctly in LRF; in case you wonder why I used <p id="..." instead of the sematically better <li id="...">, it's because in the latter case the links won't work correctly even with page breaks). Two strange things happen: (i) All footnotes get recognized as chapters (!), so I get some 90 chapters instead of 10 in the lrf-toc. (ii) Despite the force-page-break, there are as many footnotes per page as can fit (!) and still the links work correctly in the LRF (!!!). I don't complain about it, this result is actually very useful, but I find it strange that with <h2> chapters I need to keep each at the start of its own page to make it work but with <li><p> I can have many on the same page and still they work. Are these expected behaviors due to some property of LRF which I am not familiar with or are these bugs and I should create a new ticket for them? (In that case, is it possible to send the demo file privately? I do not want to infringe on someone's copyright by posting a book into a public section)

08-23-2008, 09:54 AM	#734
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Add Code: --chapter-attr h2,id,chapter

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Change font of header for LRF Output on PRS 505	duckbill	Calibre	3	05-15-2010 12:07 PM
Pissed off with LRF formatting: LRF/LRS clean tool?	grimborg	LRF	8	02-15-2010 02:14 PM
Fonts for LRF output	krischik	Calibre	1	10-03-2009 06:01 AM
CBZ > LRF (LRF>HTML/MOBI????)	sideburnt	Calibre	4	09-15-2009 07:44 AM
libprs500 Issues Converting .LIT to .LRF - .LRF crashes everything	vasbinde	Calibre	6	02-14-2008 01:16 PM

08-15-2008, 12:25 PM	#721
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	html2lrf does read metadata from HTML files. IIRC the current code is optimized to recognize the metadata generated by the ereader2html script.

08-16-2008, 04:50 AM	#726
pepak Guru Posts: 610 Karma: 4150 Join Date: Mar 2008 Device: Sony Reader PRS-T3, Kobo Libra H2O	Forgot: 4) Paragraphs in <blockquote> have a much larger padding between them than normal paragraphs. 5) Paragraphs in <blockquote> can't be centered using class styles.

08-16-2008, 01:06 PM	#729
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Note that the LRF format doesn't support inline links, so it's typically a good idea to porce either paragraph or page breaks before an inline link. The handling of blockquote is deliberate. It gives the best results for "typical" usage of <blockquote>

08-17-2008, 11:37 AM	#731
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You need --add-chapters-to-toc