|  01-18-2014, 11:44 AM | #76 | |
| Wizard            Posts: 1,065 Karma: 858115 Join Date: Jan 2011 Device: Kobo Clara, Kindle Paperwhite 10 | Quote: 
  Yeah, I did know about most html tags needing a closing tag. To my mind self-closing tags mixed with non-self-closing tags in search/replace scenarios are quite confusing. Last edited by unboggling; 01-18-2014 at 12:06 PM. | |
|   |   | 
|  01-18-2014, 12:15 PM | #77 | |
| Well trained by Cats            Posts: 31,241 Karma: 61360164 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A | Quote: 
  Why does it matter to a S&R if they are mixed in BTW I have seen a non-pretty version, so a S&R REGEX for a BR <br\s*/> no class Calibre assigns a class (and includes the space) <br class="calibre\d+" /> | |
|   |   | 
|  01-18-2014, 12:18 PM | #78 | |
| Wizard            Posts: 1,065 Karma: 858115 Join Date: Jan 2011 Device: Kobo Clara, Kindle Paperwhite 10 | Quote: 
 Did you mean either of those are the Search regex, depending on whether book was previously converted by calibre or not? What is the Replace regex to end up with </p> closing prior paragraph and <p> opening next paragraph? Or whatever else is the best way to do it? (I'm unskilled with regex too  ) Last edited by unboggling; 01-20-2014 at 05:03 AM. Reason: undeleted a previously deleted previous strike-out edit. (Because it was quoted below.) | |
|   |   | 
|  01-18-2014, 12:51 PM | #79 | |
| Well trained by Cats            Posts: 31,241 Karma: 61360164 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A | Quote: 
 Code: <br(?:\sclass="calibre\d+")*\s*/> | |
|   |   | 
|  01-18-2014, 12:55 PM | #80 | 
| Well trained by Cats            Posts: 31,241 Karma: 61360164 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A | 
			
			UB Code: </p> <p class="current"> | 
|   |   | 
|  01-18-2014, 01:18 PM | #81 | |
| Wizard            Posts: 1,065 Karma: 858115 Join Date: Jan 2011 Device: Kobo Clara, Kindle Paperwhite 10 | Quote: 
  I finally understand. My confusion about that one of the stumbling blocks to fixing books with tools like Sigil or Edit Book or simple text editor.  (( And it was so easy to fix that in an RTF opened in MS Word, in "Advanced Find & Replace" usually just replace ^l with ^p^t (replace linefeed with paragraph tab), then save as DOCX. Last edited by unboggling; 02-12-2014 at 11:09 AM. Reason: clarify | |
|   |   | 
|  01-20-2014, 03:57 AM | #82 | |
| Wizard            Posts: 1,065 Karma: 858115 Join Date: Jan 2011 Device: Kobo Clara, Kindle Paperwhite 10 | Quote: 
 I've been looking at some raw book formats: unfixed original downloaded files that I kept separately outside of calibre. Specifically these were original format files that I copied into calibre way back when, then had fixed the calibre copy with the method RTF -> Word (advanced find & replace) -> DOCX. I added the raw unfixed originals into calibre again as separate duplicate records, converted them to EPUB, and looked at them in Edit Book. So I saw what LadyKate was talking about. For the most part these formats were extravagantly riddled with excess span and font tags. (That was boggling. I didn't even try to fix them in Edit Book, hadn't a clue where to start. There seemed to be more html tags than content text.) So, like LadyKate said, that usage of spans is another common thing, in addition to the break tag instead of paragraph tags thing. In the past habitually fixing things in RTF in Word, the specific nature of the html problems had been invisible to me. In the conversion of original to EPUB, calibre had added its own classes to that span mishmash as best it could. Which seemed to make the span multitude harder to deal with. But I'm just starting to learn about this stuff on the html side. And don't really know what I'm doing there yet. Meanwhile, I was really looking for an old raw file with a lot of break tags so I could play with theduck's search/replace regex in html editor or Edit Book. Didn't find any of those, got distracted by the formats with span problem. Last edited by unboggling; 01-20-2014 at 11:01 AM. Reason: minor clarification. | |
|   |   | 
|  01-25-2014, 10:52 AM | #83 | 
| Fanatic            Posts: 515 Karma: 1470724 Join Date: Jul 2013 Location: Quebec CA Device: android 4 (samsung tablet and asus tablet) | 
			
			[QUOTE=unboggling;2742113]I'm confused.  I'm not much good with html and usually don't fix books by messing with html tags, but I would've thought I'd want most of those <br> tags replaced with </p><p> I thought <br> didn't have a closing tag? Or is <br/> an alternate form of <br> ? First, in XHTML all tags need to be closed. So <br> becomes <br /> Now the search string I use is to find a break <br> followed by lowercase alphabetic. That will indicate a break that is not a paragraph marker but just someone putting the hard return because they want the line to "look pretty" lol. I use regex often in searching for patterns that indicate the line end is not a paragraph end before I put in the paragraphs, | 
|   |   | 
|  01-25-2014, 11:05 AM | #84 | |
| Fanatic            Posts: 515 Karma: 1470724 Join Date: Jul 2013 Location: Quebec CA Device: android 4 (samsung tablet and asus tablet) | Quote: 
 I have not seen a word processor since the days of the old dos versions of WordPerfect that shows the codes inserted to change the look and feel of the document created. Every time you make a change even if you don't complete it, a code is inserted. You change to italic, change your mind, remove the two characters typed, change the color etc. and it leaves more font changes, spans, color changes etc than text. While you only see the result of all these changes in a WYSIWYG word processor or web page creator, they are only as good as the underlying code. | |
|   |   | 
|  01-25-2014, 11:06 AM | #85 | |
| Well trained by Cats            Posts: 31,241 Karma: 61360164 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A | 
			
			[QUOTE=LadyKate;2747763] Quote: 
  (like <hr> ) assumed closure (there was no </br> and probably made parsing into a headache as html evolved <br /> is proper | |
|   |   | 
|  01-25-2014, 01:55 PM | #86 | 
| Wizard            Posts: 1,065 Karma: 858115 Join Date: Jan 2011 Device: Kobo Clara, Kindle Paperwhite 10 | 
			
			This recent discussion highlights a fundamental difference in approaches to fixing formatting problems in ebooks. These approaches seem both skill-driven and assumption-driven. One assumption is that fixing formatting problems with regular expressions at code level is the best approach. I've noticed this is an assumption common to programmers, web designers, ebook designers, advanced calibre users. Particularly when technical knowledge and skills (regular expressions, HTML/XHTML, CSS) are at the higher end of the learning curve, it is easy to make the assumption. I guess that demographically this is a small (but vocal) minority of the total calibre user population. A different assumption is that fixing formats above code level in a word processor works well enough. Particularly when technical knowledge and skills (regular expressions, HTML/XHTML, CSS) are at the lower end of the learning curve, it is easy to make this assumption. I made this assumption. I extrapolate that some other calibre users share this assumption, and guess that demographically this is a larger (but quieter) minority of the total calibre user population. Consider those span-riddled original formats I looked at the other day. I had eliminated annoying formatting problems from copies of them a couple years ago with the method: EPUB -> RTF -> fix in Word or Open Office Writer -> DOCX or ODT -> EPUB. About 3 minutes time each. Two years later, having learned a lot since then, looking at the morass of HTML and XHTML and CSS tags in those span-plagued original formats, it seems that in Edit Book now it would take much longer to fix each format at code level, even if I knew how. Same with fixing them at code level outside calibre in a programmer-oriented editor. So the first conversion to RTF blew away the ToC links — so what? — that's quickly fixable in calibre ToC Editor after conversion to EPUB, if not fixed already by the conversion-applied XPath expression. So the "fixed" EPUB contains unnecessary tags I didn't see while editing with word processor, and is larger in filesize than if it had been fixed cleanly at code level — so what? — I don't see those unnecessary tags when reading the book, and sufficient cheap storage is available to accommodate larger files. Assumptions aside, approach and method to fix formatting problems depend on need, constrained by knowledge/skill level. From the point of view of an ebook consumer reading for enjoyment, I would ignore the technical aspect of ebooks, except for the need to fix formatting problems that annoy me, the quickest way possible at my current knowledge/skill level. From the point of view of an ebook designer, maybe I would want the underlying code to be clean. But I'm not an ebook designer. I'm an ebook consumer, who likes reading books more than fixing books. Last edited by unboggling; 02-01-2014 at 01:29 AM. Reason: clarify, change to more precise or correct technical terms, fix typos. | 
|   |   | 
|  01-25-2014, 08:37 PM | #87 | 
| Wizard            Posts: 1,065 Karma: 858115 Join Date: Jan 2011 Device: Kobo Clara, Kindle Paperwhite 10 | 
			
			@LadyKate and theducks, thank you for the clarifications.  I'd completely forgotten about HTML vs XHTML.    Last edited by unboggling; 01-25-2014 at 08:42 PM. | 
|   |   | 
|  01-31-2014, 10:08 PM | #88 | |
| Fanatic            Posts: 515 Karma: 1470724 Join Date: Jul 2013 Location: Quebec CA Device: android 4 (samsung tablet and asus tablet) | Quote: 
 I usually do a total cleanup for favorite authors lol but seem to have quite a problem doing a conversion from pdf or whatever without touching on correcting them at least a little. One disappointing thing is that HTML BOOK FIXER while it removes all the spans etc also removes the formatting of <span class="italic"> or bold or whatever lol. Sometimes I wonder how important those italics are versus an unreadable book. | |
|   |   | 
|  02-01-2014, 06:44 AM | #89 | |
| Wizard            Posts: 1,065 Karma: 858115 Join Date: Jan 2011 Device: Kobo Clara, Kindle Paperwhite 10 | Quote: 
 I fix hard page breaks if they affect readability, such as a page break after each ToC item, or a page break between "Chapter n" and an associated chapter title. I get rid of the offending page breaks in RTF or DOCX in Word by replacing "^m" with nothing, and let calibre insert page breaks automatically during conversion later. Is fixing page breaks more complicated in HTML/XHTML? They should be handled in CSS, yes? I have PDFs only if I could find no better format, all nonfiction used infrequently for reference rather than reading start-to-finish. I prefer to be annoyed at their headers/footers than spend time eliminating them and other problems after conversion from PDF, so I don't bother to convert nonfiction PDFs. They are the only exception to my "no more than mildly annoying" rule. I have no fiction PDFs anymore. I gradually replaced them with better formats instead of converting and fixing them, except for a few I fixed that were unavailable in different format. At present PDFs are an infrequent annoyance. I just checked statistics in my library. Less than 0.5% of the book formats are rated "mildly annoying", and 75% of those are advance reader copies with no specific annoying formatting, rated "mildly annoying" on general principle. "Mildly annoying" is the worst rating currently in the library, excluding placeholders with no format. Every other book format is rated "no annoyance". Less than 0.1% of the book formats are PDF; I rate them on relative annoyance of specific formatting problems, with a little slack due to unavailability of better formats, and ignore my strong annoyance at the mere existence of PDFs in my library. All other formats are EPUB. What is mildly annoying to me may not annoy someone else. Or what doesn't annoy me may annoy someone else. Or any of various possible formatting quirks or problems may spark (in any of different people) anger, rage, or despair that drives a fix-formatting frenzy or, almost inconceivably, a retreat to paper books. But paper books may have formatting problems too, such as folded mis-cut corners or pages with blurred ink. An alternative is audio books, but what if the narrator conveys inappropriate emotion at inopportune moments, or inadvertently skips or misreads a few important words or sentences, or the volume level fluctuates between barely audible and loud? Apparently there is no reprieve from either (1) suffering negative emotion in response to perceived/judged problems in formatting, or (2) fixing formatting to reduce the frequency and intensity of formatting-instigated negative emotion. Last edited by unboggling; 02-02-2014 at 04:27 PM. Reason: clarify; statistics; page breaks; rambling. | |
|   |   | 
|  02-07-2014, 09:48 PM | #90 | |
| Fanatic            Posts: 515 Karma: 1470724 Join Date: Jul 2013 Location: Quebec CA Device: android 4 (samsung tablet and asus tablet) | Quote: 
 Sorry all, just feeling my age here   | |
|   |   | 
|  | 
| Tags | 
| calibre workflow, ebook management strategy, ebook management workflow | 
| Thread Tools | Search this Thread | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| PRS-T1 Manage Collections in Calibre (Help!) | FatCat0 | Sony Reader | 19 | 08-11-2012 12:00 PM | 
| How to find & manage ebooks from various apps? | rapidlanguage | Library Management | 3 | 01-06-2012 08:13 AM | 
| Development Using Calibre to manage eDGe library | mrspaceman | enTourage Archive | 76 | 05-12-2011 12:38 PM | 
| Neo How to manage ebooks? | ivanm | BeBook | 11 | 08-19-2010 11:01 AM | 
| How do you manage your read queue with ebooks? | DuncanWatson | General Discussions | 7 | 05-14-2010 01:30 PM |