I have been a fan of this idea ever since you posted the first topic.. just have been a little busy with a few large projects.. so I haven't gotten the time to sit and write my usual in-depth tomes.
I stumbled upon this many months ago when looking up some EPUB3 information. This company called Infogrid Pacific works on one of the few EPUB3 reading programs, AZARDI. They also have a program out there called "Infogrid: Digital Publisher", which states to do exactly this:
http://www.infogridpacific.com/DigitalPublisher.html
You use their program to create their intermediary, and then it allows you to output that one file into a wide range out output formats:
Their blog had some very useful information on EPUB3, and there are some good nuggets of information there on OCR/different formats (I haven't taken a look in about a year though.... I recall most the posts being self-promotional):
http://www.infogridpacific.com/blog/
Maybe you might be able to gather some good ideas from their documentation/manuals/blogs/posts.
Quote:
Originally Posted by skreutzer
So in general it is a good idea to maintain a tool as command line tool in order to keep automatibility, and develop a GUI on top of it to make it user friendly, just as LyX does for LaTeX.
|
I have completed ~200 PDF (scans)->OCR->EPUB non-fiction economics books. All are coded with consistent CSS (only a few very minor variations per EPUB).
I actually spent a nice chunk of time in December looking as a way to go backwards from my consistent XHTML -> LaTeX -> PDF.
Jellby pointed me towards:
(This is an ongoing project, at the pace I am going getting distracted with more and more book conversion, this EPUB -> PDF research will probably take me years!

)
Now, I see a few large problems:
- Initial
- Getting consistent input.
- As was mentioned... most of the documents out there are HORRORS.
- Something like Toxaris's Macro for Microsoft Word is extremely helpful.
- Something like Writer2EPUB is helpful.
- I agree with your idea of having a LyX-type WYSIWYM editor. That would be ideal.
- BUT, the thing is, getting people to use it. As Hitch mentioned........ there is just no damn way that authors are going to use it.
- They are stuck in using Microsoft Word (horribly).
- An extremely small minority might read and learn how to use Styles properly.
- An even smaller minority might jump ship to an open source alternative like Libre/Open Office.
- I can see this maybe being aimed as a tool for intermediaries, who can use the tool themselves to quickly clean/markup input... which will make their lives easier/faster.
- Output
- What do you include? (Those marked with (?)... how are you going to mark these up?)
- Headers
- Paragraphs
- Blockquotes
- Left/Center/Right/Justified
- Tables
- Footnotes (?)
- Poetry (?)
- Pullquotes (?)
- Indexes (?)
- Figures (?)
- Images (?)
- Floating images (?)
- SVG (?)
- Captions (?)
- Boxed text (?)
- Math (?)
- Formulas (?)
- Fractions (?)
- In non-fiction books, page numbers are a HUGE problem. "See Footnote 3 on page 5".
- I see how LaTeX/LyX handles it, by placing tags/ids/references, but from what I gathered (I haven't tackled a proer EPUB->LaTeX->PDF conversion yet)... this will take a while to mark up properly, and make sure everything is correct.
- Intermediate.
- How in-depth do you want this intermediary to go?
- Do you mark titles of books as a different class?
- <i class="book">Title of Book</i>
- Do you mark foreign languages (which might need a different font/treatment... depending on the output format?)
- <span class="greek">Greek words</span>
- Do you mark down people's names?
- <span class="name">First Last</span>
- This much in-depth markup will be extremely useful in an output format (let us say I wanted to use LaTeX to auto-generate an Index for me. Having a list of names/titles of books might be extremely helpful to have marked as different classes).
- Going so in-depth, while it might be FANTASTIC in the long-run, will be a complete pain to initially mark everything up. (Which is why I avoid it).
- The cost will go up prohibitively (As Hitch has mentioned, these conversions are expected to be done for pennies.)
- If everything is marked up properly the first time, it will be a "one button press" conversion.... Although we understand this.... most of these authors just run it through Calibre, run it through some horrible automated system like Smashwords, or pay for a cheap crappy conversion.
Let me just reiterate, I am an extremely small minority of the users. (I am one of the few here who is paid to convert (most here do it for personal usage or as a hobby)).
Non-fiction is much harder/more complex than just handling your simple fictional work (which is probably the vast majority of writers getting books converted).
I try to push consistency across all of my books, so that it will make it way easier to swap things around if needed. For example, we had a ton of discussion in this topic about footnotes:
https://www.mobileread.com/forums/sho...d.php?t=225045
I treat them the same across all my books, so I can easily just regex them if needed (early on I used to have superscript footnotes, now I have them in the [##] format).