![]() |
#721 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,359
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
html2lrf does read metadata from HTML files. IIRC the current code is optimized to recognize the metadata generated by the ereader2html script.
|
![]() |
![]() |
![]() |
#722 | |||
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Quote:
Still, the specification seems to have a whole lot of content which is not relevant to html2lrf, and others that seem to be missing (e.g. author-sort). Can anyone (kovidgoyal?) shed light on what is used and what is not? Quote:
Quote:
Anyway, do you think that maybe specific html2lrf metadata might be useful? Metadata in the form <meta name="lrf-prefix:commandline-parameter-name" value="commandline-parameter-value"> should be easy enough to implement - it would simply reuse the code from commandline parser. |
|||
![]() |
![]() |
![]() |
#723 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,359
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Here's the current code to extract metadata from HTML files (it looks for metadata in comment sections:
Code:
def get_metadata(stream): src = stream.read() # Title title = None pat = re.compile(r'<!--.*?TITLE=(?P<q>[\'"])(.+)(?P=q).*?-->', re.DOTALL) match = pat.search(src) if match: title = match.group(2) else: pat = re.compile('<title>([^<>]+?)</title>', re.IGNORECASE) match = pat.search(src) if match: title = match.group(1) # Author author = None pat = re.compile(r'<!--.*?AUTHOR=(?P<q>[\'"])(.+)(?P=q).*?-->', re.DOTALL) match = pat.search(src) if match: author = match.group(2).replace(',', ';') mi = MetaInformation(title, [author] if author else None) # Publisher pat = re.compile(r'<!--.*?PUBLISHER=(?P<q>[\'"])(.+)(?P=q).*?-->', re.DOTALL) match = pat.search(src) if match: mi.publisher = match.group(2) # ISBN pat = re.compile(r'<!--.*?ISBN=[\'"]([^"\']+)[\'"].*?-->', re.DOTALL) match = pat.search(src) if match: isbn = match.group(1) mi.isbn = re.sub(r'[^0-9xX]', '', isbn) return mi You can get a good idea of what kinds of metadata from OPF calibre supports by using the GUI to save an ebook. The GUI willc reate an OPF file with entries for all the metadata it knows about. |
![]() |
![]() |
![]() |
#724 | |
Reticulator of Tharn
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
|
Quote:
![]() |
|
![]() |
![]() |
![]() |
#725 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Thanks to both of you.
Another question: I bought a book scanner and have been using it to convert my paper books into ebooks in HTML format (because I consider it the best in regards of current and future functionality). I noticed several strange things when converting them into LRF using HTML2LRF on Windows and using that LRF on my Sony Reader PRS-505. Please note: I do not use GUI - I convert from command line and then copy the LRF to the Reader using file management utility. 1) "author-sort" doesn't seem to have any effect. I use command line such as Code:
--author="Steve Perry" --author-sort="PERRY STEVE" 2) I just can't understand chapter detection and TOC generation: I use <h2> tag for marking chapters, as in Code:
<h2 id="contents">Table of Contents</h2> <h2 id="chapter-10">The Attack</h2> The command line is: Code:
--chapter-regex=^ I took it to mean that ANY h[1-6] tag would be considered a new chapter. Curiously enough, in my example above <h2 id="chapter-10"> gets detected as a chapter but <h2 id="contents"> does not. I thought maybe the regexp didn't get used so as an experiment, I renamed that chapter-10 to xxxpter-10, expecting it not to appear in lrf-toc. Strangely enough, it DID get detected. Only that <h2 id="contents"> seems to be ignored. 3) Another problem with chapter detection: I have a book which has 10 chapters and a whole lot of footnotes. I used a <ol> list at the end of the document to store all notes: Code:
<ol id="notes"> <li> <p id="note-1">Footnote 1</p> </li> <li> <p id="note-2">Footnote 2</p> </li> </ol> Code:
--force-page-break-before-tag="h2|p id=" Two strange things happen: (i) All footnotes get recognized as chapters (!), so I get some 90 chapters instead of 10 in the lrf-toc. (ii) Despite the force-page-break, there are as many footnotes per page as can fit (!) and still the links work correctly in the LRF (!!!). I don't complain about it, this result is actually very useful, but I find it strange that with <h2> chapters I need to keep each at the start of its own page to make it work but with <li><p> I can have many on the same page and still they work. Are these expected behaviors due to some property of LRF which I am not familiar with or are these bugs and I should create a new ticket for them? (In that case, is it possible to send the demo file privately? I do not want to infringe on someone's copyright by posting a book into a public section) |
![]() |
![]() |
![]() |
#726 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Forgot:
4) Paragraphs in <blockquote> have a much larger padding between them than normal paragraphs. 5) Paragraphs in <blockquote> can't be centered using class styles. |
![]() |
![]() |
![]() |
#727 | ||||||||
Reticulator of Tharn
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Those are known-but-annoying issues with calibre’s ad-hoc CSS parsing and rendering. With the Reader getting EPUB support LRF formatting issues are downgraded a bit, but that one bugs me too and if you open a ticket I’ll see if I can’t at least improve the situation. |
||||||||
![]() |
![]() |
![]() |
#728 | |||||||
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Quote:
Quote:
Quote:
Quote:
Quote:
So maybe my TOC-creating issues are actually not a result of one specific header not getting detected but a result of NO header getting detected and instead creating TOC from the links - those links that I put in my HTML-TOC: Code:
<h2 id="contents"> <ol> <li><a href="#chapter-1">Chapter 1</a></li> ... </ol> Quote:
A) I have a chapter in my e-book: Code:
<h2 id="chapter-10">Chapter 10</h2> <p>Something or whatever.</p> --force-page-break-before-tag=h2: If the option is used, LRF-TOC item works as expected. If the option is not used, LRF-TOC item actually links to one page before the chapter (if the chapter starts at page 123, link from LRF-TOC takes me to page 122). It is seemingly impossible to get two chapters on one page. B) I have a footnote in my e-book in the semantically correct form: Code:
<ol><li id="note-1"><p>Text for footnote 1</p></li></ol> C) I have a footnote in the form: Code:
<ol><li><p id="note-1">Text for footnote 1</p></li></ol> I understand now why I never got a page break before the footnote, so that's not an issue anymore. Quote:
The good news is that I can demonstrate all of these issues with one book :-) I'll see about opening those tickets. Thanks for the answer. |
|||||||
![]() |
![]() |
![]() |
#729 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,359
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Note that the LRF format doesn't support inline links, so it's typically a good idea to porce either paragraph or page breaks before an inline link.
The handling of blockquote is deliberate. It gives the best results for "typical" usage of <blockquote> |
![]() |
![]() |
![]() |
#730 | |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Quote:
Code:
<h2 id="chapter-2">Something or another</h2> Code:
any2lrf.exe --force-page-break-before-tag=h2 demo.htm I have generated a demo HTML file for the rest of the issues and will create a new ticket shortly. |
|
![]() |
![]() |
![]() |
#731 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,359
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You need --add-chapters-to-toc
|
![]() |
![]() |
![]() |
#732 | |
Reticulator of Tharn
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
|
Quote:
Best of luck with your scanning! |
|
![]() |
![]() |
![]() |
#733 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
I am afraid I still can't get it to work:
Code:
... <h2 id="chapter-1">Beginning</h2> <p>Something or anything</p> ... Code:
any2lrf --no-links-in-toc --force-page-break-before-tag="h2" --add-chapters-to-toc book.htm |
![]() |
![]() |
![]() |
#734 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,359
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Add
Code:
--chapter-attr h2,id,chapter |
![]() |
![]() |
![]() |
#735 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
|
![]() |
![]() |
![]() |
Tags |
html2lrf, libprs500 |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Change font of header for LRF Output on PRS 505 | duckbill | Calibre | 3 | 05-15-2010 11:07 AM |
Pissed off with LRF formatting: LRF/LRS clean tool? | grimborg | LRF | 8 | 02-15-2010 01:14 PM |
Fonts for LRF output | krischik | Calibre | 1 | 10-03-2009 05:01 AM |
CBZ > LRF (LRF>HTML/MOBI????) | sideburnt | Calibre | 4 | 09-15-2009 06:44 AM |
libprs500 Issues Converting .LIT to .LRF - .LRF crashes everything | vasbinde | Calibre | 6 | 02-14-2008 12:16 PM |