View Single Post
Old 03-23-2017, 04:39 PM   #31
notimp
Addict
notimp ought to be getting tired of karma fortunes by now.notimp ought to be getting tired of karma fortunes by now.notimp ought to be getting tired of karma fortunes by now.notimp ought to be getting tired of karma fortunes by now.notimp ought to be getting tired of karma fortunes by now.notimp ought to be getting tired of karma fortunes by now.notimp ought to be getting tired of karma fortunes by now.notimp ought to be getting tired of karma fortunes by now.notimp ought to be getting tired of karma fortunes by now.notimp ought to be getting tired of karma fortunes by now.notimp ought to be getting tired of karma fortunes by now.
 
Posts: 248
Karma: 892441
Join Date: Jul 2010
Device: K2i
Quote:
Originally Posted by KevinH View Post
You can always compile and include your own tidy html5 library inside your plugin (one plugin author already does this) or you can use Sigil's internal gumbo library. Sigil's gumbo parser is a fork of Google's Gumbo parser that autocorrects using the exact same rules as browsers. Sigil gumbo library is available through the plugin api as well.

FWIW, You also might be better off parsing the file in gumbo first and then doing replacements of the text parts so that you do not break the xhtml syntax by funny search and replaces.

I think there is a simple example of using Sigil's gumbo in the testme plugin that is documented in the latest Plugin framework Developers guide.
Thank you for the clarification - I believe I also caught the switch to gumbo in changelogs several months ago but never saw it do its magic while lets say switching from source view to wysiwyg f.e. (or at least not in the way I needed it - it's been a while since I tested the new parser version) but if its a library I can call - it might work.

I'll also look into plugins other people have written already, as suggested.

I'm afraid the main enemy here is complexity. I'm confident I could hack together a little find/replace regex plugin in python - but calling other modules requires more "learning curve" first...

In any case - I'm doing some pretty freaky replacements (like search for this, followed by whatever, followed by that tag 1 or 3 times, which can also include that other tag 0 or 1 times...) so I'm relying on actually working on the html source and not just "visible text". (Because it turns out - that OCR software outputs have "predictable formating" you can actually take advantage of, if you wan't to modify text thats structured in a certain way - but ignore text that isn't - and the actual formating followed by formating (or not ) isn't something you can do on a text level.

The thing is, that it turns out, that you can pretty much automate ePub production from Finereader>Sigil or Finereader>Atlantis>Sigil (Atlantis = one of the best wordprocessing>epub converters out there) based on "predictable error profiles" that can be automated away by looking for sentence (missing end sign) structure and formating structure at the same time.

Short way of saying - I created a better method to create ePubs from Finereader results than is "available" on the international scene (shared/created it with a german "scenes" (dont ask ) board) in 2012 - that broke once tidy was removed. At which point it was more convenient to just tell everyone to keep using the older versions of Sigil, than to try to get the functionality back myself. The availability of a GUI frontend library for plugins now peaked my interest again.

I'll look into it (this isn't a promise of delivery - for people recognizing my nickname.. ).

Last edited by notimp; 03-23-2017 at 05:41 PM.
notimp is offline   Reply With Quote