MobileRead Forums - View Single Post - Automated Processing Workflows as and with Free Software

Tex2002ans · 02-03-2014, 06:03 PM

I have been a fan of this idea ever since you posted the first topic.. just have been a little busy with a few large projects.. so I haven't gotten the time to sit and write my usual in-depth tomes.

I stumbled upon this many months ago when looking up some EPUB3 information. This company called Infogrid Pacific works on one of the few EPUB3 reading programs, AZARDI. They also have a program out there called "Infogrid: Digital Publisher", which states to do exactly this:

http://www.infogridpacific.com/DigitalPublisher.html

You use their program to create their intermediary, and then it allows you to output that one file into a wide range out output formats:

Click image for larger version

Name: fod2012_online.png
Views: 465
Size: 50.4 KB
ID: 118699

Their blog had some very useful information on EPUB3, and there are some good nuggets of information there on OCR/different formats (I haven't taken a look in about a year though.... I recall most the posts being self-promotional):

http://www.infogridpacific.com/blog/

Maybe you might be able to gather some good ideas from their documentation/manuals/blogs/posts.

Quote:

Originally Posted by skreutzer

So in general it is a good idea to maintain a tool as command line tool in order to keep automatibility, and develop a GUI on top of it to make it user friendly, just as LyX does for LaTeX.

I have completed ~200 PDF (scans)->OCR->EPUB non-fiction economics books. All are coded with consistent CSS (only a few very minor variations per EPUB).

I actually spent a nice chunk of time in December looking as a way to go backwards from my consistent XHTML -> LaTeX -> PDF.

Jellby pointed me towards:

PrinceXML: http://www.princexml.com/
- XHTML + CSS = PDF
Not so Short Introduction to LaTeX: http://tug.ctan.org/tex-archive/info/lshort/english/
I stumbled upon LyX: http://www.lyx.org/
- GUI for creating LaTeX documents. Their documentation was also FANTASTIC... while testing it out, I was reminded a lot of Sigil.

(This is an ongoing project, at the pace I am going getting distracted with more and more book conversion, this EPUB -> PDF research will probably take me years!

)

Now, I see a few large problems:

Initial
- Getting consistent input.
  - As was mentioned... most of the documents out there are HORRORS.
    - Something like Toxaris's Macro for Microsoft Word is extremely helpful.
    - Something like Writer2EPUB is helpful.
  - I agree with your idea of having a LyX-type WYSIWYM editor. That would be ideal.
    - BUT, the thing is, getting people to use it. As Hitch mentioned........ there is just no damn way that authors are going to use it.
      - They are stuck in using Microsoft Word (horribly).
      - An extremely small minority might read and learn how to use Styles properly.
      - An even smaller minority might jump ship to an open source alternative like Libre/Open Office.
    - I can see this maybe being aimed as a tool for intermediaries, who can use the tool themselves to quickly clean/markup input... which will make their lives easier/faster.
- Output
  - What do you include? (Those marked with (?)... how are you going to mark these up?)
    - Headers
    - Paragraphs
    - Blockquotes
    - Left/Center/Right/Justified
    - Tables
    - Footnotes (?)
    - Poetry (?)
    - Pullquotes (?)
    - Indexes (?)
    - Figures (?)
    - Images (?)
      - Floating images (?)
      - SVG (?)
    - Captions (?)
    - Boxed text (?)
    - Math (?)
      - Formulas (?)
      - Fractions (?)
  - In non-fiction books, page numbers are a HUGE problem. "See Footnote 3 on page 5".
    - I see how LaTeX/LyX handles it, by placing tags/ids/references, but from what I gathered (I haven't tackled a proer EPUB->LaTeX->PDF conversion yet)... this will take a while to mark up properly, and make sure everything is correct.
Intermediate.
- How in-depth do you want this intermediary to go?
 - Do you mark titles of books as a different class?
 - Title of Book
 - Do you mark foreign languages (which might need a different font/treatment... depending on the output format?)
 - Greek words
 - Do you mark down people's names?
 - First Last
 - This much in-depth markup will be extremely useful in an output format (let us say I wanted to use LaTeX to auto-generate an Index for me. Having a list of names/titles of books might be extremely helpful to have marked as different classes).
 - Going so in-depth, while it might be FANTASTIC in the long-run, will be a complete pain to initially mark everything up. (Which is why I avoid it).
 - The cost will go up prohibitively (As Hitch has mentioned, these conversions are expected to be done for pennies.)
 - If everything is marked up properly the first time, it will be a "one button press" conversion.... Although we understand this.... most of these authors just run it through Calibre, run it through some horrible automated system like Smashwords, or pay for a cheap crappy conversion.

Let me just reiterate, I am an extremely small minority of the users. (I am one of the few here who is paid to convert (most here do it for personal usage or as a hobby)).

Non-fiction is much harder/more complex than just handling your simple fictional work (which is probably the vast majority of writers getting books converted).

I try to push consistency across all of my books, so that it will make it way easier to swap things around if needed. For example, we had a ton of discussion in this topic about footnotes: https://www.mobileread.com/forums/sho...d.php?t=225045

I treat them the same across all my books, so I can easily just regex them if needed (early on I used to have superscript footnotes, now I have them in the [##] format).