01-08-2014, 02:41 AM | #16 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
oh, no, not THAT type of thing: not broken dialogues, per se. I meant, that for some bizarro-world reason, the typist had in two instances (two diff. books) created one paragraph style for dialogue paragraphs, and one for narrative. Obviously, that's easy-peasy to solve. For broken dialogues, I have your wondrous tool. But then there are simply broken paragraphs, outside of dialogue, and I'm working on more all-inclusive regex/searches to fix those, as much as possible...and lastly, this one real doozy (the one of which I spoke), in which there were sentences somewhat like this: I tol him to leave the room "I'll be back, "David will make sure of that.' he smiled. Now, obviously, no formatter can "save" that. It's just bugger-all bad. But I've given some contemplation to the idea of playing with your broken dialogues, Tox, to create a pass that marks up--identifies--all the broken dialogues, and then hand it BACK to the author to review, for preliminary clean up. Just toying with it. Not solidified in ye olden brain yet. Just pushing around in "the little grey cells." ;-) Hitch |
|
01-08-2014, 04:53 AM | #17 | |
Addict
Posts: 265
Karma: 724240
Join Date: Aug 2013
Device: KyBook
|
Quote:
|
|
01-08-2014, 10:17 AM | #18 | |
Well trained by Cats
Posts: 29,800
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
But what I see being discussed: is it will have more 'automatic (only?)' features. I keep thinking about 'Tidy' (I do use 'Pretty') As an example Early Sigil would just stomp on your NCX file. The current Sigil waits for/lets you to do the deed |
|
01-08-2014, 01:30 PM | #19 | |
Grand Sorcerer
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
Quote:
Dale |
|
01-09-2014, 07:06 AM | #20 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
One advantage this fella has is he says he can program in C. That was the sticking point over Sigil moving on, few C programmers that hadn't had their fill of C at work.
|
01-09-2014, 07:44 AM | #21 |
Wizard
Posts: 1,264
Karma: 10203040
Join Date: Dec 2011
Device: a variety (mostly kindles and kobos)
|
[pedant mode]Sigil is written in C++ not C. I'm a C programmer for my day job. I keep meaning to learn C++ but I never seem to find the time. They are related and there's some overlap but they are different.
Anyway I believe the OP said he was a C++ coder so I really am being pedantic [/pedant mode] |
01-09-2014, 12:17 PM | #22 | ||||||||||||||
Software Developer
Posts: 189
Karma: 89000
Join Date: Jan 2014
Location: Germany
Device: PocketBook Touch Lux 3
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
But as with your idea to identify broken dialogues, that's exactly what I'm proposing: Keep the stuff which is already in good quality, and throw away what is not (I myself are mostly concerned about this for XML, text tools for such purpose would be a different topic, but why not work in this field also, since a solution is needed for self-publishers as well?). In this specific case, either you or the author has to re-apply apostrophes and quotation marks. If the author has to do it, make it as easy as possible for him. If you have to do it, make it as easy as possible for you. As an advanced solution, don't allow quotation marks in your (online) editing software at all, but let the author mark quotations and direct speech semantically (my initial question of this thread was, if Sigil could be that software) by something like Code:
I tol him to leave the room <dialogue>I'll be back, David will make sure of that.</dialogue> he smiled. Code:
I tol him to leave the room “I'll be back, David will make sure of that.” he smiled. In this way: The crucial part about the "feature" of direct formatting shown in the screenshot is that the text "use 2 egg whites instead for healthier version" gets marked as Code:
style="color: rgb(255, 0, 0)" Code:
class="alternative" |
||||||||||||||
01-09-2014, 01:53 PM | #23 | |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Quote:
Word XML is just that. It is perfectly valid XML with a schema specifically for Word documents, just as the intention was. In principle it is possible to load the XML in Word and have your original document. The same applies for their HTML output. It is valid, even if it is not what we would like. All XML 'formats' are custom, but some schemas are public and agreed upon by various parties. That is also one of the issues. A schema needs to be agreed to correctly identify the semantic value of the tags. You cannot expect all (or any) wordprocessor to honor the schema you would like. So, you would need to map the XML schema from the wordprocessor to your schema. That will not always be possible. You also greatly overestimate the willingness of writers to change their ways and their reaction to being forced to work in a certain way. They would rather use another program or even Wordpad than to change their wow. Only a small amount of writers is willing to do that. You might take a look at my Word add-in. I create clean HTML output (or XHTML directly in an ePUB) out of Word, but at a price. Styling like margins and fonts will be removed. It would be relatively easy to create an export for another format (e.g. Markdown) in the same way. I like the idea, but I think you are too optimistic. However, if I can help to improve things, I probably will. |
|
01-09-2014, 04:42 PM | #24 | ||||||
Software Developer
Posts: 189
Karma: 89000
Join Date: Jan 2014
Location: Germany
Device: PocketBook Touch Lux 3
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Well, do you have any needs for your own projects? I'm mostly driven by my own personal need, currently just small "book" projects. But over time, I hope to provide more and more general purpose processing tools, which could be used by self-publishers or to set up an (online?) service. On the one hand, it's a lot of work and won't be sufficient for all kinds of uses within the first time, on the other hand if a solution is implemented once, a lot of texts can be processed with it. The problem to get good semantic XML will still need to be addressed, but that's exactly what I was wondering about if Sigil could be used for it (to let the author do the semantic markup of his text with Sigil if he failed to do it right in the first place, and then take the prepared EPUB (XHTML) file from Sigil as input for an automated processing system. But there are also alternative ways to get a semantic XML/XHTML file from the author, one could be to write a JavaScript based online/offline text editor for semantic editing. Currently, I write semantic XHTML myself as input for conversion to EPUB, but as OpenOffice (therefore LibreOffice too, I assume) is already capable of valid, semantic XHTML output, I should probably work on a way to educate the author (video tutorial), a website to provide this education, a list of style names to use in OpenOffice, an upload form for the author to submit OpenOffice XHTML output on the mentioned website, and a schema to check if the uploaded file matches the expected style names, so that the file then could be automatically be processed to EPUB, and later to PDF. I know how this description reads, but existing free software would provide short cuts, the development could be done collectively as free software, and over time the system would expand, so it could become a real option for self-publishers that would reduce manual labor for authors, formatters and developers. Maybe it would not be in the scope of the website, but depending on the interfaces, theoretically, somebody could from there distribute the prepared files directly to online e-book shops and print-on-demand services. As build as and with free software, that system would not be an online service by some provider, but could be set up by everybody online or offline. The free software license would make sure that every improvement is available to everybody else, so essentially a community would work together instead of competing against each other. I myself don't need necessarily such a large system, I'm glad to develop my own little system to use it for my book projects and maybe for people I work together with, and if it grows beyond that because my results are freely licensed, fine. In any case, I'm interested if somebody else does something similar with free software, and if there could be a joint effort to provide a common solution for a larger audience of people. Last edited by skreutzer; 01-09-2014 at 05:31 PM. |
||||||
01-10-2014, 02:52 AM | #25 | ||||
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Quote:
However, you are making a mistake here. Word does NOT output XHTML, nor makes that claim. It can output HTML (in two flavours), XML (again in two flavours) and DOCX. Of course there are more formats, but lets ignore them for now. The HTML output is valid HTML 4.01 by default. The problem most people have with it, that it is full of code to make sure the output in a browser resembles the original document AND that it can be understood by Word upon importing to make it a Word document again. It does that well enough, that it is not practical for subsequent processing is another story. That is also not the purpose. The XML output is valid XML. The structure used is described in detail in the various websites from Microsoft. It has the same premise as the HTML output, that it must be understood by Word upon importing. That makes it less valuable for semantics. To give a short example of where issues will arise. Lets say I make a word italic. In the code <w:i /> (amongst other things) will be used to identify that it is italic. Now, when I create a style that applies italic, that code will not be there, but the code to apply the style. From the perspective from Word that makes sense, since italic is embedded in the style. From a semantic point of view it makes it a whole lot more difficult (the same applies for the HTML output btw). That also makes a whole lot harder to map it to other XML schemas. I mention the docx format because that is essentially the same as the XML, only divided in multiple structured files in a container. Quote:
Quote:
Quote:
I tried to work with OpenOffice, but it is just not for me. I miss several features (not for ePUB creation) and don't like the interface. I also do not like the output to be honest and I am not the only one. There is a reason why there is also a program to take the output from OpenOffice to prepare it for ePUB. I believe it is called ePUBWriter. |
||||
01-10-2014, 03:50 AM | #26 | |||||
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
^^ What Toxaris said, firstly.
Quote:
You are more than welcome to give this idea a go, but trust me when I tell you: given that there are dozens of word processors out there that can already do this, for all intents and purposes, why would the people who ALREADY won't do this, do it with yours? Quote:
Quote:
Quote:
Quote:
Hitch |
|||||
01-10-2014, 05:32 AM | #27 | |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Quote:
That said, I have done my share of XML conversions with XSLT. In fact, that was also my first ideas with regards of creating clean (X)HTML for ePUB. That went out the window very fast, since it would cripple the result to an undesired level. Too much could not be converted with the XSLT. I don't really care too much about lists and tables with col/rowspans, but whole pieces of formatting (like bold/italic) could get lost if it is part of a style (as I mentioned before). So, that is why I decided to do it differently. I actually revisited the idea with OpenXML conversion and ran against the same limitations. No way to solve the inheritance of certain formatting in styles. Is it the fault of Microsoft? No, not really since the information is there and from their point of view it is perfectly logical. They cannot solve it within the current specification of their OpenXML definition. It would save me a lot of work, but they do not have any need for it. They would rather build in ePUB exporting capablities first. I know that there have been many requests for exports from Word that are clean, but there are also difficulties there. I am already thinking for future developments of my add-in to try to create a basic stylesheet based upon the layout in Word. Simple stuff like indents, centering and alike. I don't know if that will happen, but I am thinking about it. It has quite some serious impacts and I do not know if there is a need for it. |
|
01-11-2014, 05:15 AM | #28 | ||
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Quote:
Hitch |
||
01-11-2014, 04:31 PM | #29 | |||||||||||
Software Developer
Posts: 189
Karma: 89000
Join Date: Jan 2014
Location: Germany
Device: PocketBook Touch Lux 3
|
Quote:
Quote:
Regarding your description of style, that is the ideal solution for the problem. If the style definition is hard to read and to apply (if it isn't XML, as CSS isn't XML and therefore would require a reading software to parse CSS), a user would have to define the style again with the same or a similar visual appearance for a processing software, or a converter/parser would transform the Word/CSS style definition to something that the processing software could read and apply. However, my processing workflow would most likely expect some style like "emphasis", and would either apply always the same visual appearance for PDF output to it, no matter what the visual appearance was in the word processor, or let the user decide which visual appearance for "emphasis" is preferred. Something like <p class="MsoNormal"> is perfectly fine, the user would, corresponding to the description of <w:i/> above, just define how default text should look like in EPUB output, how in PDF output, how in SQL output, how in whatever output, since there could be indeed different requirements. Just to complete this overview: the worst case would be direct formatting like <p style="font-size: 11pt; font-family: Arial; color: rgb(255, 0, 0)">. There is no information encoded here which would tell a program if the <p> is supposed to be handled the same way or different than other <p>s - instead, they just look the same on purpose or accidentally. Some <p>s should probably be handled the same way, even if their visual appearance is different, and other <p>s should probably handled different, even if their visual appearance is the same. There are also unnecessary dependencies introduced, such as the ability of parsing CSS, to know what an alternative for the "Arial" font could be (if not available for the target format), and to interpret RGB color code (if the target format or software only supports things like "black", "blue", "red" strings). Quote:
Quote:
Quote:
For the topic of my initial question, I would really like to hear how you clean the input from authors in terms of structure, and if/how the added structural information is used for a later automated processing. Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
|
|||||||||||
01-11-2014, 04:47 PM | #30 |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Gang:
I suspect that my view of the possibility of this functioning, widely, is utterly jaded by my own experiences, and I am thus just mostly talking to myself, to hear myself talk. I don't like this at the best of times, so I'm just going to bug out of this discussion. I don't genuinely think I have anything to truly add, other than cynicism, so ...that's not helpful. Hitch |
Tags |
sigil, wysiwym, xml |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Marvin as a cloud front-end | taguntumi | Marvin | 9 | 11-22-2013 08:21 PM |
[Old Thread] Web Front end | DezmondFinney | Development | 24 | 12-18-2012 08:53 AM |
soPDF GUI Front-End | Nathan Campos | 37 | 11-04-2011 07:45 PM | |
Web front end | DezmondFinney | Development | 7 | 08-10-2011 09:51 AM |
Hacking the front-end | DezmondFinney | Development | 18 | 08-05-2011 03:22 AM |