View Single Post
Old 02-03-2014, 06:03 PM   #17
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
I have been a fan of this idea ever since you posted the first topic.. just have been a little busy with a few large projects.. so I haven't gotten the time to sit and write my usual in-depth tomes.

I stumbled upon this many months ago when looking up some EPUB3 information. This company called Infogrid Pacific works on one of the few EPUB3 reading programs, AZARDI. They also have a program out there called "Infogrid: Digital Publisher", which states to do exactly this:

http://www.infogridpacific.com/DigitalPublisher.html

You use their program to create their intermediary, and then it allows you to output that one file into a wide range out output formats:

Click image for larger version

Name:	fod2012_online.png
Views:	461
Size:	50.4 KB
ID:	118699

Their blog had some very useful information on EPUB3, and there are some good nuggets of information there on OCR/different formats (I haven't taken a look in about a year though.... I recall most the posts being self-promotional):

http://www.infogridpacific.com/blog/

Maybe you might be able to gather some good ideas from their documentation/manuals/blogs/posts.

Quote:
Originally Posted by skreutzer View Post
So in general it is a good idea to maintain a tool as command line tool in order to keep automatibility, and develop a GUI on top of it to make it user friendly, just as LyX does for LaTeX.
I have completed ~200 PDF (scans)->OCR->EPUB non-fiction economics books. All are coded with consistent CSS (only a few very minor variations per EPUB).

I actually spent a nice chunk of time in December looking as a way to go backwards from my consistent XHTML -> LaTeX -> PDF.

Jellby pointed me towards:

(This is an ongoing project, at the pace I am going getting distracted with more and more book conversion, this EPUB -> PDF research will probably take me years! )

Now, I see a few large problems:
  • Initial
    • Getting consistent input.
      • As was mentioned... most of the documents out there are HORRORS.
        • Something like Toxaris's Macro for Microsoft Word is extremely helpful.
        • Something like Writer2EPUB is helpful.
      • I agree with your idea of having a LyX-type WYSIWYM editor. That would be ideal.
        • BUT, the thing is, getting people to use it. As Hitch mentioned........ there is just no damn way that authors are going to use it.
          • They are stuck in using Microsoft Word (horribly).
          • An extremely small minority might read and learn how to use Styles properly.
          • An even smaller minority might jump ship to an open source alternative like Libre/Open Office.
        • I can see this maybe being aimed as a tool for intermediaries, who can use the tool themselves to quickly clean/markup input... which will make their lives easier/faster.
    • Output
      • What do you include? (Those marked with (?)... how are you going to mark these up?)
        • Headers
        • Paragraphs
        • Blockquotes
        • Left/Center/Right/Justified
        • Tables
        • Footnotes (?)
        • Poetry (?)
        • Pullquotes (?)
        • Indexes (?)
        • Figures (?)
        • Images (?)
          • Floating images (?)
          • SVG (?)
        • Captions (?)
        • Boxed text (?)
        • Math (?)
          • Formulas (?)
          • Fractions (?)
      • In non-fiction books, page numbers are a HUGE problem. "See Footnote 3 on page 5".
        • I see how LaTeX/LyX handles it, by placing tags/ids/references, but from what I gathered (I haven't tackled a proer EPUB->LaTeX->PDF conversion yet)... this will take a while to mark up properly, and make sure everything is correct.
  • Intermediate.
    • How in-depth do you want this intermediary to go?
      • Do you mark titles of books as a different class?
        • <i class="book">Title of Book</i>
      • Do you mark foreign languages (which might need a different font/treatment... depending on the output format?)
        • <span class="greek">Greek words</span>
      • Do you mark down people's names?
        • <span class="name">First Last</span>
      • This much in-depth markup will be extremely useful in an output format (let us say I wanted to use LaTeX to auto-generate an Index for me. Having a list of names/titles of books might be extremely helpful to have marked as different classes).
      • Going so in-depth, while it might be FANTASTIC in the long-run, will be a complete pain to initially mark everything up. (Which is why I avoid it).
        • The cost will go up prohibitively (As Hitch has mentioned, these conversions are expected to be done for pennies.)
        • If everything is marked up properly the first time, it will be a "one button press" conversion.... Although we understand this.... most of these authors just run it through Calibre, run it through some horrible automated system like Smashwords, or pay for a cheap crappy conversion.

Let me just reiterate, I am an extremely small minority of the users. (I am one of the few here who is paid to convert (most here do it for personal usage or as a hobby)).

Non-fiction is much harder/more complex than just handling your simple fictional work (which is probably the vast majority of writers getting books converted).

I try to push consistency across all of my books, so that it will make it way easier to swap things around if needed. For example, we had a ton of discussion in this topic about footnotes: https://www.mobileread.com/forums/sho...d.php?t=225045

I treat them the same across all my books, so I can easily just regex them if needed (early on I used to have superscript footnotes, now I have them in the [##] format).
Tex2002ans is offline   Reply With Quote