Quote:
Originally Posted by patrik
Just for fun I printed the ebook (from Calibre viewer) to pdf, which I would run through Finereader.
The result contained a bunch of blue squares, which Acrobat says are links. This messes up the ocr.
Any ideas to not have that included in the "printout"?
|
Calibre's EPUB reader/previewer is not built for FXL EPUBs.
It supports normal, reflowable EPUBs only.
Quote:
Originally Posted by Hitch
Unhunh. Okay....honestly, try copy-paste. I know, that sounds whacky, but if you copy-paste it to a text editor, you lose the spans.
[...]
My best guesstimate is that Tex may have some thoughts. INDD-generated FXL is a tough cookie.
|
The few rare times I've actually ran across Fixed Layout is when I've had the actual source files... so I just go back into the INDD/IDML and export as a normal EPUB.
Never had to really work on a book that "required" such fixed layouts (like double-page spread cookbooks, magazines, etc.).
* * *
IF I only had Fixed-Layout EPUB to work from:
I think it might be better to use Calibre to convert to RTF or TXT (Markdown)... some simpler format that doesn't have <span>s and crap, but still supports basic formatting (Bold/Italics/Headings).
To test that, go into Calibre:
1. Right-Click your book > Convert Books > Convert individually.
2. In the upper right dropdown, select
Output Format: RTF (or TXT).
2.5. If you chose TXT, on the left-hand side, select
TXT output.
In General > Formatting, change the dropdown from "plain" to "markdown".
3. Convert.
That should generate a file that's minus a lot of the HTML cruft, but should still carry over italics/bold, etc.
I think that'd be infinitely easier to clean than a FXL-EPUB->PDF/Screenshots->Finereader->EPUB roundtrip.
Note: This should work well for something like patrik's example, a normal Fiction book.
For very complicated books like comics/children's—with highlighting/read-along/curvy text within images—or heavy Maths/Physics, that method wouldn't work.
Alternate #2: Maybe even a Calibre EPUB->EPUB conversion might be able to condense a lot of that crap down.
But first, I'd run regex to remove inline:
Code:
id="_idTextSpan41408"
top:2589.14px;
left:1107.67px;
letter-spacing:0.73px;
4 Regexes:
- id="_idTextSpan\d+"
- top:-*[\d\.]+px;
- left:-*[\d\.]+px;
- letter-spacing:-*[\d\.]+px;
then running a Calibre EPUB->EPUB should morph all those thousands of <span>s into a smaller amount of <span class="calibre##"> classes.
So Step 1 (Original):
Step 2 (The 4 Regexes):
Code:
<p class="Drop-Cap ParaOverride-1"><span class="CharOverride-14" style="position:absolute;">“</span><span class="CharOverride-3" style="position:absolute;">I </span><span class="CharOverride-3" style="position:absolute;">want </span><span class="CharOverride-3" style="position:absolute;">you </span>
Step 3 (Calibre EPUB->EPUB):
Code:
<p class="Drop-Cap ParaOverride-1"><span class="calibre1">“</span><span class="calibre2">I </span><span class="calibre2">want </span><span class="calibre2">you </span>
From there, it's at least a bit more readable, and you may be able to clean it up like JSWolf said, using Diap's Editing Toolbag (or similar tools).
If you're lucky, the book will condense down to only a few dozen calibre## classes...
If you're unlucky, the book will condense down to hundreds/thousands of calibre## classes.
Quote:
Originally Posted by salamanderjuice
If you really wanna learn something you could try using an HTML parser in a language like Python or Perl to process the HTML to remove all the span tags (or whatever else makes it fixed layout) while keeping the rest.
|
Agreed. "20 minutes manual cleanup per book" vs. "a few seconds to run a parser".
Although I haven't seen enough (disgusting) FXL EPUBs to know what potential ugly code you'd run across.
All I know is that every single word—and sometimes character, as Hitch said—is wrapped in a <span> with enough manual styling to fill up your entire screen, flying off the monitor.
To throw out
all <span>s might not be right... so I'd go in and surgically remove certain inline styles (like top/left/letter-spacing).
But sometimes it's easier to throw out nearly everything, then add the rare formatting exceptions back in later. (Like blockquotes, poetry, etc.)
All depends on the book...
Side Note: And InDesign's mentality is to always go back to InDesign as your "source document", do your fixes/adjustments there, then reexport. Never to create human-readable/maintainable code in the EPUB/output itself.