How to mark chapter headings - Page 7

Jabby · 06-10-2012, 08:22 AM

Quote:

Originally Posted by Jellby

I understand your point, Hitch, and I agree that we should have more realistic expectations. But maybe some publishers were also expecting to make easy profit from direct-to-epub conversion.

And maybe they should try a different workflow that allows them to fix blatant mistakes (things like "Erdlcigh" for "Erdleigh", or "‘em" for "’em", or "Tike" for "Like") without having to pay someone else to do it.

Anyway, my experience is quite limited, as I've only bought about a dozen ebooks.

I have purchased considerably more than a dozen books (close to 300) and concur with your assessment.

I suppose Penguin would be considered a BPH. If so, I considerer them one of the most egregious in this regard. Whole paragraphs out of order. Hit the convert button and push it out the door. Why worry, there is no refund on an ebook.

My 2¢

Regards - John

Hitch · 06-10-2012, 06:15 PM

Quote:

Originally Posted by Jabby

I have purchased considerably more than a dozen books (close to 300) and concur with your assessment.

I suppose Penguin would be considered a BPH. If so, I considerer them one of the most egregious in this regard. Whole paragraphs out of order. Hit the convert button and push it out the door. Why worry, there is no refund on an ebook.

My 2¢

Regards - John

John:

Wherever you're buying, you should stop--if you buy at Amazon or Nook, (and I think iBooks and Kobo) there damn sure IS a refund. I certainly wouldn't continue to buy Penguin titles if they are that bad--that's wholly unacceptable. This is absolutely not the type of "error" I was talking about--I was discussing the usual typos, etc.; not wholesale neglect."

@Jellby--that reminds me, pls. look for a PM from me on an unrelated topic, speaking of your mad skills--but I also agree with you that any publisher that does not require (on from PDF or from scan titles) a character-by-character comparison, like we do, is not living up to its responsibilities to its readers.

Hitch

JSWolf · 06-10-2012, 10:48 PM

There is no way to do a novel length conversion from PDF without errors. OCR can be better if you correct any issues the OCR flags as it does its thing. But I do agree that you need a full A/B comparison to make sure it's correct.

Hitch · 06-11-2012, 06:27 AM

Quote:

Originally Posted by JSWolf

There is no way to do a novel length conversion from PDF without errors. OCR can be better if you correct any issues the OCR flags as it does its thing. But I do agree that you need a full A/B comparison to make sure it's correct.

Yes. Absolutely right. We hit 99.7%, but that's about as good as it gets. WE correct the OCR issues during, and we do a full A/B comparison afterwards, AND we give the file to the client for proofing. Until Adobe decides to play ball on the html-export functions for PDF's, (or, hell, even Word, or XML), that's about as good as it will get, IMHO. Someone in another thread, somewhere, claimed that you could get good results using Acrobat Pro X to crop the headers/footers, export, and then do (something--don't recall) with Calibre, but....I'd have to see the materials to be convinced. That's nothing against Calibre; my question is the initial export from Acrobat Pro X. I've NEVER seen clean html from Pro X--at least, not HTML that wouldn't take longer to clean up in the first place than it takes to go the long route--scan & OCR.

Just my $.02.
Hitch

JSWolf · 06-11-2012, 10:09 AM

I have converted some PDF using Acrobat Pro 8 that turned out not too badly. But of course, there were errors as nothing can convert without any errors.

Serpentine · 06-11-2012, 12:07 PM

I usually just crop out header/footer/page numbering; throw it over to Samatra and save it out as plain text.

If it comes out in a reasonable form (i.e not missing characters), I will progress to merging the paragraph lines, marking up chapters and the rest of it.

If not, wasting time trying to clean and fix it won't save you any time: scan.

Thalia Helikon · 06-11-2012, 02:26 PM

Quote:

Originally Posted by DiapDealer

Of course there's not one single way that will work for all ebooks. No one tried to claim there was. The point is... those with sufficient regex skills and pattern-matching abilities are almost always going to be able to tailor an expression that will be able to isolate (and mark/change/fix) the chapter headers for the current ebook being worked on. Without ever needing Book View or a two-step process.

For a book with chapter headers simply arabic numbers, without the word "chapter..."

What would be the regex expression for any number with paragraph break before and after?

Code:

<p> ### </p>

Since I am still learning to crawl, before learning to walk, I search for "1" and then "2" and so on.

Serpentine · 06-11-2012, 04:12 PM

<p[^<>]*>\s*\d+\s*</p>

<p[^<>]*> - p tags, that may have class or other attributes (can confuse with other tags starting with p, but simple enough for books)
\s*- to collect whitespace if there is any present padding the digits
\d+ - collect one or more digit characters

Tkepner · 06-13-2012, 11:37 PM

Hi guys (and gals),

I'm trying to generate eBooks from several sources (WordPerfect, PageMaker, InDesign, etc.). My first stop on the trail is Kompozer, which lets me see the text and the crap code presented by those other programs (and place the illustrations where they belong, fix minor transition errors, set up where I want page breaks, etc.). Then I move them to Calibre where I put in the Meta Data, Cover, and generate the ePub files. My final step is going to be Sigil, and that's where I hit a wall.

I don't always want to use <h1> codes for Table of Contents Entries. For example, the copyright page is NOT going to have a big<h1> headline at the top that says "Copyright Page", nor are the Dedication or Acknowledgment pages! Okay, Sigil lets me "Add Semantics," so that should be okay.

HOWEVER,

that's not what I get when I hit the "Generate TOC" button. All the TOC entries created in Calibre disappear (except for the ones that actually use the <h1> and <h2> tags), and the entries I added via the Sigil "Add Semantics" are a no show.

Here is a sample chapter head:
<body class="calibre">
<p class="ChapterNmbr2" id="calibre_pb_5"><span class="calibre14 calibre15 calibre19">-2-</span></p>

<div class="calibre4">
<p class="calibre7"><span class="calibre1 chapter calibre15" id="calibre_toc_5"><a class="calibre20" id="TOC1_2"></a>Endangering the King</span></p>
</div>

class="ChapterNmbr2" ==> is my pagebreak separator in Calibre. The 14, 15, and 19 are center, bold, and font.

it displays like this:

-2-
Endangering the King

Where the page break is above the -2- and the "Endangering the King" is what goes into the TOC as the Chapter Title.

So, how do I preserve the TOC coming from Calibre and get the Add Semantics to actually appear?

Terry Kepner

Doitsu · 06-14-2012, 02:20 AM

Quote:

Originally Posted by Tkepner

[...]entries I added via the Sigil "Add Semantics" are a no show.

Information added via Add Semantics is only used for generating entries in the <guide> section.

Quote:

Originally Posted by Tkepner

I don't always want to use <h1> codes for Table of Contents Entries.

Sigil can only generate TOCs from headers.

If you want to be able to automatically generate TOC entries, you’ll need to use header tags and styles. You could easily simplify your example as follows:

Code:

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title></title>
  <style type="text/css">
/*<![CDATA[*/

  h3 {
        text-align: center; 
        font-weight: bold;
        font-size: 130%;
        page-break-before: always;
  }
  /*]]>*/
  </style>
</head>

<body>
  <h3 id="TOC1_2" title="Endangering the King">-2-<br />
  Endangering the King</h3>
<!-- text of chapter 2 -->
</body>
</html>

JSWolf · 06-14-2012, 11:58 AM

WOW! That is some nasty looking code. What is the source that caused that?

DaleDe · 06-14-2012, 12:10 PM

Quote:

Originally Posted by JSWolf

WOW! That is some nasty looking code. What is the source that caused that?

Were you referring to Doitsu's code example? It looks perfectly fine to me, not one bit nasty.

Dale

DiapDealer · 06-14-2012, 12:18 PM

Quote:

Originally Posted by DaleDe

Were you referring to Doitsu's code example? It looks perfectly fine to me, not one bit nasty.

I agree. It certainly looks free of nasty to me.

theducks · 06-14-2012, 01:49 PM

Quote:

Originally Posted by DiapDealer

I agree. It certainly looks free of nasty to me.

+1

A perfectly good example of a 2 line chapter heading where the designer did NOT want

the Number to be in the TOC

Tkepner · 06-15-2012, 03:10 AM

The code came from Word-perfect originally and looked somewhat like this:
---------------------------
<p align="center"><span style="font-size: 10pt;"></span><span
style="font-size: 11pt;"></span><span
style="font-size: 10pt;"></span><span
style="font-size: 15pt;"></span><span
style="font-family: Goudy Old Style;"><strong></strong></span><span
style="font-family: Goudy Old Style;">-2-</span></p>
<p><span style="font-family: Goudy Old Style;"></span><span
style="font-size: 10pt;"></span><span
style="font-size: 10pt;"><strong>Endangering the King</strong></span></p>
<br wp="BR1">
<br wp="BR2">
--------------------------------------

Then it went into Tidy, which removed the awful redundancies. Then I moved it into Calibre which gave the code I posted.

Anyway, thanks for the response. Unfortunately it doesn't help. I cannot use <h1> headers on things like a title page, copyright page, acknowledgements page, and so forth--it would make the book look like an amateur did it instead of coming from a professional publications house. (Yeah, having the words "Title Page" above the title of the book would really look stupid, at least with Dedications and Acknowledgements I might be able to get away using one of the other header tags).

And while your code looks nice, considering where I am coming from I don't want to become a professional coder for epub files anymore than you would want to become a professional graphics person just to combine a simple picture with text. The procedures I am using deliver clean enough code to do the job, I was just hoping it was something I was missing that wouldn't let me put those Add Semantic things into the TOC.

It should be something they should add as an option in the drop-down box in the Generate Window: Include Semantics in TOC.

Until then this is just one more limitation preventing eBooks from replacing real books.

Again, thanks for your help.

06-13-2012, 11:37 PM	#99
Tkepner Junior Member Posts: 3 Karma: 10 Join Date: May 2012 Device: Pandigital eReader	Calibre TOC not surviving Sigil Generate TOC Hi guys (and gals), I'm trying to generate eBooks from several sources (WordPerfect, PageMaker, InDesign, etc.). My first stop on the trail is Kompozer, which lets me see the text and the crap code presented by those other programs (and place the illustrations where they belong, fix minor transition errors, set up where I want page breaks, etc.). Then I move them to Calibre where I put in the Meta Data, Cover, and generate the ePub files. My final step is going to be Sigil, and that's where I hit a wall. I don't always want to use <h1> codes for Table of Contents Entries. For example, the copyright page is NOT going to have a big<h1> headline at the top that says "Copyright Page", nor are the Dedication or Acknowledgment pages! Okay, Sigil lets me "Add Semantics," so that should be okay. HOWEVER, that's not what I get when I hit the "Generate TOC" button. All the TOC entries created in Calibre disappear (except for the ones that actually use the <h1> and <h2> tags), and the entries I added via the Sigil "Add Semantics" are a no show. Here is a sample chapter head: <body class="calibre"> <p class="ChapterNmbr2" id="calibre_pb_5"><span class="calibre14 calibre15 calibre19">-2-</span></p> <div class="calibre4"> <p class="calibre7"><span class="calibre1 chapter calibre15" id="calibre_toc_5"><a class="calibre20" id="TOC1_2"></a>Endangering the King</span></p> </div> class="ChapterNmbr2" ==> is my pagebreak separator in Calibre. The 14, 15, and 19 are center, bold, and font. it displays like this: -2- Endangering the King Where the page break is above the -2- and the "Endangering the King" is what goes into the TOC as the Chapter Title. So, how do I preserve the TOC coming from Calibre and get the Add Semantics to actually appear? Terry Kepner

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Calibre: Chapter Headings	Paxman53	Introduce Yourself	5	10-22-2011 09:13 AM
Chapter Headings	Paxman53	Conversion	3	10-12-2011 12:31 PM
Chapter Headings on their own page? Help!	Lee5150	Calibre	3	10-06-2011 08:12 AM
Why H1 and H2 Chapter Headings?	Ransom	Calibre	11	08-10-2011 04:29 PM
Help converting chapter headings	p3aul	Conversion	6	04-03-2011 12:56 PM

06-10-2012, 10:48 PM	#93
JSWolf Resident Curmudgeon Posts: 73,932 Karma: 128903250 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	There is no way to do a novel length conversion from PDF without errors. OCR can be better if you correct any issues the OCR flags as it does its thing. But I do agree that you need a full A/B comparison to make sure it's correct.

06-11-2012, 10:09 AM	#95
JSWolf Resident Curmudgeon Posts: 73,932 Karma: 128903250 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	I have converted some PDF using Acrobat Pro 8 that turned out not too badly. But of course, there were errors as nothing can convert without any errors.

06-11-2012, 12:07 PM	#96
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	I usually just crop out header/footer/page numbering; throw it over to Samatra and save it out as plain text. If it comes out in a reasonable form (i.e not missing characters), I will progress to merging the paragraph lines, marking up chapters and the rest of it. If not, wasting time trying to clean and fix it won't save you any time: scan.

06-11-2012, 04:12 PM	#98
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	<p[^<>]>\s\d+\s</p> <p[^<>]> - p tags, that may have class or other attributes (can confuse with other tags starting with p, but simple enough for books) \s*- to collect whitespace if there is any present padding the digits \d+ - collect one or more digit characters

06-14-2012, 11:58 AM	#101
JSWolf Resident Curmudgeon Posts: 73,932 Karma: 128903250 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	WOW! That is some nasty looking code. What is the source that caused that?

06-15-2012, 03:10 AM	#105
Tkepner Junior Member Posts: 3 Karma: 10 Join Date: May 2012 Device: Pandigital eReader	The code came from Word-perfect originally and looked somewhat like this: --------------------------- <p align="center"><span style="font-size: 10pt;"></span><span style="font-size: 11pt;"></span><span style="font-size: 10pt;"></span><span style="font-size: 15pt;"></span><span style="font-family: Goudy Old Style;"><strong></strong></span><span style="font-family: Goudy Old Style;">-2-</span></p> <p><span style="font-family: Goudy Old Style;"></span><span style="font-size: 10pt;"></span><span style="font-size: 10pt;"><strong>Endangering the King</strong></span></p> <br wp="BR1"> <br wp="BR2"> -------------------------------------- Then it went into Tidy, which removed the awful redundancies. Then I moved it into Calibre which gave the code I posted. Anyway, thanks for the response. Unfortunately it doesn't help. I cannot use <h1> headers on things like a title page, copyright page, acknowledgements page, and so forth--it would make the book look like an amateur did it instead of coming from a professional publications house. (Yeah, having the words "Title Page" above the title of the book would really look stupid, at least with Dedications and Acknowledgements I might be able to get away using one of the other header tags). And while your code looks nice, considering where I am coming from I don't want to become a professional coder for epub files anymore than you would want to become a professional graphics person just to combine a simple picture with text. The procedures I am using deliver clean enough code to do the job, I was just hoping it was something I was missing that wouldn't let me put those Add Semantic things into the TOC. It should be something they should add as an option in the drop-down box in the Generate Window: Include Semantics in TOC. Until then this is just one more limitation preventing eBooks from replacing real books. Again, thanks for your help.

Advert

Advert