Quote:
Originally Posted by Tex2002ans
Why, exactly, are you trying to use hex codes instead of just using the actual character?
In EPUB, the only special entity you have to worry about is the Non-Breaking Space ( or  ).
Everything else can use the actual Unicode characters:
— = Em Dash
There's no need to clutter your code with —.
Great. What tools are you using to scan + OCR?
Only because the OCR isn't recognizing these characters?
OCR outputs:
but your actual article says:
Usually, if you enable the proper OCR languages, these accented characters will be recognized.
Side Note: I wrote a bit about OCR + German/Spanish/French accents in "Abbyy Finereader 15 gothic/Fraktur Altdeutsch/Oldgerman" (Post #5).
For example, English only recognizes A-Z while Spanish will recognize A-Z + a few more:
So if you have a 98% English book with 2% Spanish names/words, you'd tell OCR this is an English AND Spanish book. This would catch all the little accents on the ñ and á and é.
|
It's a case of "you can't get there from here" for me.
Let me fill you in on the background. I used to have a Optibook (scan?) dedicated scanner + ABBYY Finereader 9. It did reasonably good, until the scanner died. I refuse to spend more money for another dedicated scanner.
I currently have Brothers multi-function scanner/printer/copier hooked up. It has its own OCR software. I did some test shots, and it OCR'ed better than Finereader 9.
However -
I want fully reflowable print streams. I want the e-book reader to be able to change font size with no problems. That's where it gets real sticky.
To do fully reflowable text, there can be no hard line breaks(feeds) other that paragraph breaks. Otherwise you get into the "long line plus short line" when the font overflows The line length.
aaaaaaaaaaaa
bbbbbbbbbbbb
Becomes
aaaaaaaaaa
aa
bbbbbbbbbb
bb
Which is not acceptable. (to me, anyways - and I'M doing the work!)
My OCR output choices are
TXT
RTF
HTML
XML
Which one would you rather use - to glue together 40 separate pages of text, and make them reflowable?
If you use TXT, goodbye to all you bold, italics, ect. BUT removing the line feed character is a piece of cake for a hex editor. (which I have in a virtualbox XP machine, which I run under Linux). No control characters at all, so LibreOffice has no problems with expanding and contracting font size (or type).
RTF is FULL of control characters, of ALL sorts. Yes, you can blow up the font size, but the text steps on itself, because there are other control characters that don't change, that control the line spacing. Play with it yourself to see what I mean. Ripping out the RTF control data is a Royal pain the the patootie - I used to do it when I used finereader 9. And yes, those line size definition are translated into ODT and DOC files, converting does not solve the problem.
HTML/XML has its own set of glue together problems. I am not fluent in HTML. I know just enough to limp along with it.
So, given my choices, I picked TXT and add back in the italics, ect. Since I am already in the hex editor anyways, adding unicode characters wouldn't be that big of a bother. OTOH, I can add those other language characters, once the text is glue'd together. Yes, I'm using LibreWriter. There characters are only used for place names, and for scientific journal citations, in journals that are not in English. They are usually in Portuguese or French.I needed the umlauts, because some of the authors are German, with umlauts in their last names.
I'm still missing 2 characters - a "c" with a cap on it Think the letter V (upside down, with the point of the v at the top) on the "c", and an "a" with a tilde; you know, the squiggle line over the n in Spanish. (it's a Portuguese feature.)
This is a long haul, if you have better ways to do it, please let me know. (I've done 8 out of 180, so far.)