|
|
View Full Version : Formatting issue converting eReader PDB e-books
diamante 01-24-2011, 04:20 AM Calibre is awesome. I'm blown away by how it can convert my old e-books and send them to either my Kindle or my Sony Reader, seamlessly handling the necessary conversions.
I've noticed one small issue with eReader PDBs that I've converted. It pertains to extra space that's used between paragraphs to indicate the passage of time or a change in point of view. Apparently how this is displayed depends on the e-reader device and software. I've noticed that even in the old eReader world, these spaces are indicated with three centered asterisks between paragraphs in older versions of eReader for Windows and Palm OS, while the most recent version of eReader for Windows simply puts more space between the paragraphs without displaying asterisks or any other "horizontal rule," if that's the correct term.
In any case, Calibre seems to ignore whatever element is used in the original PDBs when converting them to EPUB and MOBI. Is there any way to make it recognize and preserve the extra spacing, or insert some kind of horizontal rule? I know I'm being demanding here, but since Calibre is already so elegant there must be a way to do this...
user_none 01-24-2011, 06:59 AM Soft scene breaks are something that has come to my attention recently and there is currently no support for them in eReader, zTXT, PalmDoc, or TXT inputs. They were never accounted for because up until last week I have never seen an ebook using them.
diamante 01-25-2011, 04:16 AM Hello! Thank you for the reply. I came across your blog in my search for an answer to this question, and I thought I would try this forum before e-mailing you directly. :-)
So, they are called soft scene breaks. I'd like to put in a respectful request for support for them in eReader input. I hope this only requires a simple tweak, but I'm prepared for less favorable news.
I've noticed these scene breaks for years in eReader e-books but just thought of them as three centered asterisks until I checked yesterday and found that they are handled differently in different eReader versions. (I also found it interesting that older versions of the eReader software show additional TOC elements or levels that aren't supported in the final version of eReader for Windows; these elements were also ignored in my Calibre conversions to EPUB and MOBI.)
In any case, I am very grateful for Calibre's conversion capabilities. I gave it a try on a whim and the results far exceeded my expectations, to put it mildly. Many thanks to you!
user_none 01-25-2011, 06:48 AM I'd like to put in a respectful request for support for them in eReader input. I hope this only requires a simple tweak, but I'm prepared for less favorable news.
No it's not a simple tweak. It's some thing that is high on my todo list though. ldolse and I spoke about it not to long ago and it is something he would like to see implemented too. Once I finish with the few things I'm currently working on I plan to bring it up with him again.
...
So, they are called soft scene breaks. I'd like to put in a respectful request for support for them in eReader input. ...
I've noticed these scene breaks for years in eReader e-books but just thought of them as three centered asterisks until I checked yesterday and found that they are handled differently in different eReader versions. ...
I've found both what I think of as "soft" scene breaks (extra blank space) and "hard" scene breaks (extra space with an ornament, three asterisks, or horizontal rule) in books. Usually a book uses one style or the other, but I've seen the occasional book, both printed and ebook, that use both, one indicating a small jump and the other a larger one.
When massaging ebook files I habitually replace the old text convention (***) with a horizontal rule (25% width is my preference) and make sure that there is a non-breaking space in the "soft" scene breaks if they are done with simple paragraph tags (<p>, I replace any <p></p> pairs with <p> </p>) instead of a CSS style. I haven't encountered it myself, but I've read that some ereader software ignore empty paragraphs and therefore don't display the extra blank space when a simple <p></p> pair is used.
ldolse 01-25-2011, 09:10 AM I've found both what I think of as "soft" scene breaks (extra blank space) and "hard" scene breaks (extra space with an ornament, three asterisks, or horizontal rule) in books. Usually a book uses one style or the other, but I've seen the occasional book, both printed and ebook, that use both, one indicating a small jump and the other a larger one.
When massaging ebook files I habitually replace the old text convention (***) with a horizontal rule (25% width is my preference) and make sure that there is a non-breaking space in the "soft" scene breaks if they are done with simple paragraph tags (<p>, I replace any <p></p> pairs with <p> </p>) instead of a CSS style. I haven't encountered it myself, but I've read that some ereader software ignore empty paragraphs and therefore don't display the extra blank space when a simple <p></p> pair is used.
The format scene breaks option under heuristics is doing things along these lines - although scene break detection is currently only working in a couple special cases and really needs a some more work put into it. One of the things that I'm starting to dislike about 'soft' breaks is that they really don't work with ebooks. With a printed book the publisher will always make sure a soft break winds up in the middle of the page so it's obvious to the reader. With reflowable books more often than not a soft break will wind up on a page break, and then the user won't even realize it was supposed to be a softbreak. So the idea of the format scene breaks option just replaces all scene breaks with horizontal rules.
diamante 01-26-2011, 03:25 AM user_none, I'm glad to hear that this is high on your to-do list!
Out of curiosity, I just used eReader eBook Studio to take a look at the PML generated during the conversion process for an e-book that I'm currently reading. The soft scene breaks are of course invisible on my Sony Reader and Kindle, but in eBook Studio they are plainly visible as two blank lines, i.e. three hard returns between paragraphs (there is only one hard return separating most other paragraphs). Also, the paragraph immediately preceding each soft scene break is centered for some reason, with this attribute extending through the first of the two blank lines of each soft scene break. You mentioned that soft scene breaks are currently not supported for eReader input; pardon my ignorance, but does this mean that the PML file is generated at a stage prior to eReader input? Also, if I add an eReader PDB to Calibre and convert it first to EPUB and then to MOBI and possibly other formats, does Calibre always go back to the PMLZ to do each conversion? By the way, can you recommend a better application than eBook Studio for editing PML files?
idolse, the earlier versions of eReader for Windows, as well as the Palm OS versions, displayed the soft scene breaks I describe above as three asterisks, probably because the extra blank space would often go unnoticed on small screens. The eReader developers apparently agreed with you on soft scene breaks for e-books and decided to "harden" them for the devices in use back then. On the Kindle and other newer devices, though, I think soft scene breaks can work as well as they do in printed books. In many printed books as well as e-books there are added cues to signal a soft scene break. Sometimes a paragraph immediately following a soft scene break will be unindented while most or all other paragraphs are indented. I just checked one Kindle e-book against a printed version and found that both use this method (extra blank space followed by a non-indented paragraph). I have also noticed in printed books that soft scene breaks that occur between pages are sometimes very easy to miss unless you're aware of the change in indentation, so it seems to me that the issue really isn't that different for e-books, except of course that e-books are reflowable and therefore a specific break may be more or less conspicuous depending on the device, font size settings, etc.
user_none 01-26-2011, 08:11 AM Out of curiosity, I just used eReader eBook Studio to take a look at the PML generated during the conversion process for an e-book that I'm currently reading. ... in eBook Studio they are plainly visible as two blank lines, ... You mentioned that soft scene breaks are currently not supported for eReader input; pardon my ignorance, but does this mean that the PML file is generated at a stage prior to eReader input?
Conversion is a three stage process: Input -> OEB -> Output. The input format is read and converted to OEB (what is inside an EPUB file and consists of XHTML, CSS and some control files). Depending on a few factors (such as having heuristic processing enabled) the OEB is then further manipulated. The OEB is then given to the output generator and it transforms the OEB into the output format.
When I said that soft scene breaks are not supported in PDB (eReader) input I mean that literally. The PML is extracted from the PDB file and then the attributes are read and transformed into XHTML equivalents. This is harder than it sounds especially because PML is a pseudo fixed layout format. Soft scene breaks at this point are just ignored. They are not transformed into any XHTML or retained in any way.
Also, if I add an eReader PDB to Calibre and convert it first to EPUB and then to MOBI and possibly other formats, does Calibre always go back to the PMLZ to do each conversion?
In the upper left of the conversion screen you can select which format you want to use for the source format.
By the way, can you recommend a better application than eBook Studio for editing PML files?
Nope. Other than using a plain text editor, eBook Studio is the only dedicated PML editor I know of. PDB (eReader) and PML are a dying format and have been quickly supplanted by EPUB.
... the earlier versions of eReader for Windows, as well as the Palm OS versions, displayed the soft scene breaks I describe above as three asterisks...
What ldolse and I are planning is to retain soft scene breaks as soft scene breaks but have a heuristic option (I think he's already added it) that will transform them into hard scene breaks.
KevinH 01-26-2011, 01:36 PM Hi,
Might I ask, exactly how is a "soft scene break" done in the pml? Is there a specific tag for it? Is it simply double-linebreaks? I have never seen a pdb book with such a beastie?
user_none 01-26-2011, 04:44 PM Might I ask, exactly how is a "soft scene break" done in the pml? Is there a specific tag for it? Is it simply double-linebreaks? I have never seen a pdb book with such a beastie?
Multiple line breaks between paragraphs. I had not seen one (eReader) or otherwise that uses them until very recently. It seems to be a growing typesetting trend with some publishers.
I purchased Deadhouse Gates (http://search.barnesandnoble.com/books/product.aspx?ean=9781429926492) by Steven Erikson today from B&N. Turns out they're selling it as an eReader file and it's littered with soft scene breaks.
I've gone ahead and made changes to calibre's PML input to account for them. Basically if there is 3+ empty lines it puts an empty paragraph in the resultant HTML. I plan to do more work on it with ldolse in the near future to make it more robust. In the mean time the next calibre release will at least keep these books readable.
diamante 01-26-2011, 05:09 PM user_none, thanks very much for answering my questions so thoroughly.
KevinH, at least in the PMLs generated by Calibre from the PDBs that I have, as viewed in eBook Studio, it is indeed just the two linebreaks between paragraphs. I'm not sure if the centering of the paragraph preceding the two linebreaks has any relevance. Again, when I open the original PDBs in an older version of eReader, these breaks appear as three asterisks. (Haven't you noticed these scene breaks in PDBs? I'm pretty sure they're in the vast majority of the many PDBs I have.) In the final version of eReader, they appear simply as extra space between paragraphs.
I just saw user_none's latest post.
user_none, I read somewhere that if you download a Barnes & Noble e-book using a computer, it's in eReader format. If you use a Nook, it's an EPUB. I'm not sure if this is true, but I thought it was interesting. And wow, you're already making the necessary modifications to support these SSBs!!
diamante 02-01-2011, 04:43 AM In the mean time the next calibre release will at least keep these books readable.
Readable? As far as I'm concerned they are perfect now. THANK YOU! Not only the soft scene breaks but also the TOC issue, fixed!!
I re-converted the e-book I had just finished reading, as a test. The soft scene breaks showed up as three centered asterisks, and the TOC had all the layers that were in the original PDB. I'm new to the Sony Reader, so I had never even seen a layered TOC in anything but the old eReader before. Impressive.
Before realizing that I would need to go back to the original PDBs and not the PMLs, I converted another e-book and noticed that only one item from the whole eReader TOC showed up in the TOC on the Sony Reader. When I reconverted the PDB, though, everything was there, and so were the soft scene breaks. Very, very nice.
ldolse 02-01-2011, 05:40 AM The next release of Calibre should be a bit more robust in this department - I believe user_none is extending the list of input formats which preserve soft scene breaks on input, and heuristics has had some improvements added to attempt to detect the difference between soft scene breaks and vertical whitespace. Things which are detected as actual scene breaks will get styled so that the ereader won't break the page on a soft scene break - one of my pet peeves that I see even with professionally published ebooks. Instead (assuming the reader supports css correctly), you'll always have a couple lines of text above the scene break.
Lastly there will be an option to convert soft scene breaks to 'hard breaks' (as named in this thread) with some ornamentation. Basically either convert 'soft' breaks to hard breaks, or convert vanilla '***' style breaks to something a bit fancier of your choosing.
diamante 02-05-2011, 03:44 AM Idolse, all this sounds great! Thanks for the update. I see the new release came out today; I can't wait to try it.
ChristopherTD 02-07-2011, 03:35 AM Generally my experience has been that soft scene breaks are lost and just the normal paragraph spacing appears. But it has been a long time since I converted my PDB books, so I might wait on 0.7.45 and try some of the new goodness!
Thanks for keeping working on this!
DoctorOhh 02-07-2011, 04:27 AM Generally my experience has been that soft scene breaks are lost and just the normal paragraph spacing appears. But it has been a long time since I converted my PDB books, so I might wait on 0.7.45 and try some of the new goodness!
If I read it right the preliminary new goodness is in 0.7.44
New Features (http://calibre-ebook.com/whats-new)
Heuristics: Improved Scene break detection and add option to control what scene breaks are replaced by.
user_none 02-07-2011, 09:30 AM 0.7.44 should include the PML input changes to retain soft scene breaks. 0.7.45 has some tweaks tp make detection mre reliable. As always if there are issues this is good place to discus and if there is a bug, the bug tracker is the place to report bugs so they dont get lost.
Also, with 0.7.45 the soft scene breaks will be preserved in PML (eReader PDB) output too.
I did notice that the eReader software (I didnt check with the B&N branded one) will condense 2(?) blank lines into one. So a PML document will need at least 3 blank lines for soft scene breaks. As noted earlier \c \c seems to be often used with soft scene breaks.
diamante 02-09-2011, 03:46 AM Now I've converted a series of eReader PDB e-books in which there are no soft scene breaks at all. In the original e-books, the first paragraph of each chapter is not indented while other paragraphs have first-line indentation. When I convert the PDBs to EPUB format, the first paragraph of each chapter looks fine, but instead of first-line indentation all other paragraphs are fully indented from the left margin.
Is this a bug, or do I just need to change some settings to get the EPUBs to come out like the originals? In addition to only indenting the first line of each paragraph except the first paragraph of each chapter, can I set Calibre not to insert extra space between paragraphs? This is not as important as the indentation issue, of course, but I'd like to see if I can get the converted e-books to look as much like the originals as possible.
user_none 02-09-2011, 06:55 AM When I convert the PDBs to EPUB format, the first paragraph of each chapter looks fine, but instead of first-line indentation all other paragraphs are fully indented from the left margin.
Sounds like a bug but I can't tell without the file. Please open a ticket (http://bugs.calibre-ebook.com/) and attach the file.
My guess is your book is doing something like:
\tO\tnce upon at time...
I've seen books that do this and while it sometimes works depending on the reader but it's invalid. According to the PML spec, "\t = Indent block. Start at beginning of a line, close with \t at end of a line." Invalid formatting is pretty much a guess of how it should look.
diamante 02-10-2011, 06:15 AM Sounds like a bug but I can't tell without the file. Please open a ticket (http://bugs.calibre-ebook.com/) and attach the file.
The original file is DRM-protected, but here is a sample from the PML, cut and pasted from a text editor (with my comments as the "text"); this e-book is using tab stops instead of first-line indentation:
This is the last sentence of a chapter.
\p
\X1\B\c2
\c\B\X1Here begins the first paragraph of the next chapter. This paragraph is not indented.
\T="10%"Subsequent paragraphs look like this. They are supposed to have first-line indentation (and they appear to in eReader; in eBook Studio they have a 10% tab stop at the beginning of each paragraph, but no indentation) but when converted they are block indented.
\T="10%"New paragraph here, same story for all paragraphs except the first paragraph of each chapter.
user_none 02-10-2011, 09:09 PM I've pushed up changes to the handling of \t and \T tags. There is no way to get them 100% mapped to XHTML unfortunately. \t and \T can do some very strange things and are highly dependent fix positions within the viewer screen. So, only certain, common cases are handled.
\t starting and ending a line (or another line) will create a hard margin indent.
\t starting a line and ending anywhere before the end will create a text indent.
\T starting a line will create a text indent.
\t sets and \T inside of a line will be ignored.
\tText ... end of Text\t
will produce
Text
...
end of text
---
\tText ...
end of Text\t
will produce
Text
...
end of text
---
\tT\text ... end of Text
will produce
Text
...
end of text
---
\T="5%"Text ... end of Text
will produce
Text
...
end of text
---
Text ... \tend\t of\T="5%" Text
will produce
Text
...
end of Text
* I'm using the ... in place of a long string of text to denote how it will appear when wrapped.
diamante 02-11-2011, 01:58 AM That looks great! Thanks very much for addressing the issue, and for the detailed explanation.
bfollowell 02-11-2011, 09:58 AM I've found both what I think of as "soft" scene breaks (extra blank space) and "hard" scene breaks (extra space with an ornament, three asterisks, or horizontal rule) in books. Usually a book uses one style or the other, but I've seen the occasional book, both printed and ebook, that use both, one indicating a small jump and the other a larger one.
With print books at least, it has been my experience that there is rarely any difference between what you're calling a soft break and a hard break other than where they happen to fall on a page. Where a break would fall somewhere within a page, these breaks typically just have a larger space, though some books couple this with no indentation on the first new paragraph following the break. The breaks with the asterisks or some other symbol are almost always used only where the break falls at the very top or bottom of a page. Other than that, there's typically not really any difference in the severity or importance of the break. Far more often than not, it's just a positional/typesetting thing and most books that I read use both.
Now, that's not to say some certain book or publisher may not use this differently and I'm certain I've read the occasional book that does, but I think these are in the minority.
- Byron
user_none 02-11-2011, 11:33 AM Both are scene breaks and serve the same purpose. However, they are indicated using differnt typographical techniques. We are making the distinction because we have to use different techiques to determine if we have encountered one or the other when parsing the text.
diamante 02-12-2011, 06:24 PM Once again, problem fixed. I'm loving Calibre and user_none! Thank you thank you thank you.
|