A simple tagged text format with an HTML converter - Page 2

skreutzer · 05-03-2015, 02:47 PM

As far as I understand it, the whole point is to support machine-readability in the future and too the ease of writing in that format. In my opinion, to achieve the former, it would be advantageous to stick to an already established, well supported format, especially as it provides more or less the very same features, only in a slightly different notation.

Hitch · 05-03-2015, 06:57 PM

Quote:

Originally Posted by jandrew

Not sure what you mean here, markdown always supported atx style headings like:

# level one heading

## level two heading (and so on)

or am I missing your point in this grepping matter?

FWIW, the Markdown used by Teamworks is the # method; the markdown used by Desk (owned by Salesforce) uses textually-indicated headings, e.g., .h3, .h4, etc. Just for clarity about what I was bitching about.

Quote:

If your system existed in isolation that would be fine, but shouldn’t readers of this thread who might be unfamiliar with text based writing systems be able to read comments about how your system compares to existing solutions?

There is nothing particularly wrong with your solution per se, but the fact is that your dthtm.exe/dunyazad system is basically a very limited, but syntactically differing, version of a pandoc/markdown system. You have not offered any advantages that I can see, over just using a minimal subset of markdown.

On the other hand, what pandoc/markdown offers (beyond additional markup extensions) is:

it has been widely used and tested for years
it is actively maintained/developed
there are many text editors providing markdown support (like syntax highlighting and preview modes) for win/mac/linux/ios/android platforms
multiple format conversion including: html, pdf, epub2/3 and more

So, even if people just want to stick with headings, pararaphs, and italics, if they go with pandoc/markdown they have the above advantages. If their writing later requires footnotes, tables, citations, lists, code blocks, images, or such, they can add to their knowledge of pandoc/markdown as needed.

Certainly I get that you aren’t trying to be everything for everybody, and your system works for you and that is fine, and you are, of course, already invested in your system. I am not trying to attack your system for not being something else. But others might appreciate some context before investing.

cheers, andrew

This is sort of my gripe; I mean, firstly, arguing about which "plain text system" is best actually begs the question. The assumption (the "begging the question") is, that a plain text system is preferable, and thus, we should debate whether RobertDDL's system is better than Pandoc or better than Markdown, yadda.

I'm simply saying that while it's a nice intellectual exercise, it's not one in which I, personally, can invest time, because I do think it's wheel-spinning. RobertDDL has his system in place. It's what he's going to use, and he's happy with it. Quite honestly, despite his multiple explanations, I'm damned if I understand WHY, because the entire choice seems to be about the idea that the NCX is superfluous, and the OPF unnecessary, and that it's "needlessly complicated," which is great if you're dealing with a one-page long book, but once you're past 270KB, you have other choices to make, that absolutely require nav aids like the NCX (or landmarks), a spine, etc. Nonetheless...

I think that anyone who's making books for themselves should be able to do whatever they want. It's moot. Once you migrate into making books for others, you have to fall in line with what the reading systems support. {shrug}. That's the boat I'm in, and honestly, I find ePUB to be a fairly smooth, sleek, reading system. If you use Sigil, the heavy lifting of the OPF and NCX are done for you, assuming that you have a brain and use headings for structure, which is what they are for. Moreover, using those same heading styles, a human-viewable nav system can be built as well, in addition to the (also human-viewable) NCX. I certainly don't see how that's slower than grepping markdown headings--and certainly not slower than having to TYPE the markdown in the first place, plus the regex/grep, plus...

Like I said: it's a nice intellectual exercise, but I think I'll bow out, and leave it to you geeksters for a nice heated debate about which will work better in the long run.

As far as machine-readability, @skreutzer, were that true, we'd be using XML and XSLT. We're not. That should tell us all something, I'd think. However, if that were the goal, I'd argue for XML, simply because it's already well-established and in large document-archival systems, already in place (like medical records, etc.).

My $.02, and worth less than that.

Hitch

jandrew · 05-04-2015, 01:59 AM

Quote:

Originally Posted by Hitch

I think that anyone who's making books for themselves should be able to do whatever they want. It's moot. Once you migrate into making books for others, you have to fall in line with what the reading systems support. {shrug}. That's the boat I'm in, and honestly, I find ePUB to be a fairly smooth, sleek, reading system.

I'm not sure we are on the same page here. I am not talking about any kind of different reading format/system. I like epub just fine. Markdown text is what I write (minimally marked plain text files), regardless of target output. When my target is epub, I can generate it with:

Code:

pandoc -o title.epub chap01.md chap02.md ...

Of course, I actually organize my chapters in separate directories, and use a makefile (rakefile) to generate a target, and you'll want to supply your own css file and metadata, but that is beside the point. Markdown is my writing format, pandoc is for generating a target output (be it epub, html, pdf, docx, LaTeX, odt, docbook, etc). I hope that clarifies things a bit Hitch.

RobertDDL · 05-04-2015, 04:41 AM

Quote:

Originally Posted by jandrew

Not sure what you mean here, markdown always supported atx style headings like:

# level one heading

## level two heading (and so on)

or am I missing your point in this grepping matter?

Sorry, my mistake then -- I had been referring to the Markdown flavor that is described in the Wikipedia article, http://en.wikipedia.org/wiki/Markdown which states that headings level 1 are followed by a line of === while headings level 2 are followed by a line of ---, and only from heading level 3 on the # prefix is used.

http://daringfireball.net/projects/m.../syntax#header states that headings can either have that style, or # and ## for levels 1 and 2 -- in that second case, this is exactly the same what I do. Some other differences remain, but they are not grave.

Quote:

Originally Posted by Hitch

Quite honestly, despite his multiple explanations, I'm damned if I understand WHY, because the entire choice seems to be about the idea that the NCX is superfluous, and the OPF unnecessary, and that it's "needlessly complicated," which is great if you're dealing with a one-page long
book, but once you're past 270KB, you have other choices to make...

This is a misunderstanding, sorry if I have contributed to it by not having made my point clear enough. Whatever I've said about ePub in another thread, I am really not proposing plain text as a better alternative to ePub. But, plain text exists, and is here to stay, and will always be the most easily accessed and most compatible format available, so, when you use it, it makes sense to give a few thoughts to how you use it.

Plain text to be read as is is one thing, converting it to something else (like ePub, giving you a TOC) is another thing -- I've tried to cover both, but my emphasis is on the first. (That other thread, BTW, was about a software that converts plain text to ePub, where the author MobiEpubMaker said "Italic/bold style can't be supported by text file so it can't be supported by this software", so I just wanted to show that, along with headings, it can be.)

And, BTW, my InkPad reads plain text quite nicely, and when reading a novel, whether p- or e-book, I usually don't look at the TOC anyway. To the InkPad a heading in a plain text file is a line that is preceded by 3 blank lines -- easily done. And, it ignores blank lines, so I replace them with something unobtrusive. Or, I convert the text to HTML to ePub and read the ePub. All this I can easily do with a plain text file, without needing a markup language, if it follows some simple style conventions.

But, again, I only wanted to discuss how to use plain text in a certain context, when it is used. (And, on a personal note, I have a blind friend for whom plain text is the only format she can reliably read on all of her devices -- this has taught me to value simplicity.)

skreutzer · 05-04-2015, 05:14 AM

Quote:

Originally Posted by Hitch

As far as machine-readability, @skreutzer, were that true, we'd be using XML and XSLT. We're not. That should tell us all something, I'd think. However, if that were the goal, I'd argue for XML, simply because it's already well-established and in large document-archival systems, already in place (like medical records, etc.).

Well, if you construct EPUBs, the XHTML is XML, the NCX is XML, the OPF is XML. I don't know how you would generate an XHTML-TOC automatically, but XSLT would be a more flexible way to do that for programmers than to implement it in static code. But we don't need to furtherly discuss this, because, XML formats being plain-text files too, are for most people not easy to read and write and also inflict even more overhead of typing it, that's exactly why someone would consider to use Markdown and generate an EPUB full of XML files from it.

Toxaris · 05-04-2015, 05:51 AM

Markdown or other text based file formats might be fine for archiving purposes, but not for writing. Even for archiving purposes it is not really usable, since there are so many dialects. In my opinion DocBook is a nice concept, also for archiving, but again not really usable. There are no good DocBook editors out there. Most of them are just a fancy text-editors. Converting to DocBook also does not work, since the power of DocBook is the possibility to really use structure for the text. In case of converting, the wrong or generic structure tags will be used, reducing the value.

RobertDDL · 06-02-2015, 11:37 AM

To provide an update, for those (few, I know) who may be interested.

The conversion workflow text to html to epub or mobi is now sufficiently tested (I hope). It supports title, author, heading levels, scene breaks, blank lines and italics, and something that might pass as footnotes.

With the plain text files from the Dunyazad Library, or any other plain text files that adhere to their very simple rules, you can use

dthtm myfile[.txt] [-c]
to create the htm file, and then
ebook-convert myfile.htm myfile.epub --epub-inline-toc --level1-toc //h:h1 --level2-toc //h:h2 --language en
or
ebook-convert myfile.htm myfile.mobi --level1-toc //h:h1 --level2-toc //h:h2 --language en
to create an epub or mobi file.

dthtm.exe is my tool (current version: 1.0), to be found on http://www.dunyazad-library.net/plaintext.htm (together with source code and documentation), and ebook-convert.exe is part of Calibre.

Again, I agree that there are other/better/more powerful/more flexible/well established/etc. solutions available, I just wanted to add my own minimalist one...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Made a simple converter for My Clippings	Wes Sterling	Amazon Kindle	25	10-21-2013 06:38 AM
Need help w/very simple task: page of Word text > Kindle text I can share w/friends	kearnine	Conversion	1	10-17-2012 08:25 PM
Convert simple HTML files to Kindle format (.mobi?)	rubdottocom	Conversion	3	11-30-2011 11:19 AM
Text to HTML (or any e-book format, really) program that detects chapters?	JeremyR	Workshop	11	02-10-2011 09:29 PM
HTML to MOBI text format is off when I get it on Kindle	cloudyvisions	Calibre	5	07-14-2010 12:42 AM

05-03-2015, 02:47 PM	#16
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	As far as I understand it, the whole point is to support machine-readability in the future and too the ease of writing in that format. In my opinion, to achieve the former, it would be advantageous to stick to an already established, well supported format, especially as it provides more or less the very same features, only in a slightly different notation.

05-04-2015, 05:51 AM	#21
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Markdown or other text based file formats might be fine for archiving purposes, but not for writing. Even for archiving purposes it is not really usable, since there are so many dialects. In my opinion DocBook is a nice concept, also for archiving, but again not really usable. There are no good DocBook editors out there. Most of them are just a fancy text-editors. Converting to DocBook also does not work, since the power of DocBook is the possibility to really use structure for the text. In case of converting, the wrong or generic structure tags will be used, reducing the value.

06-02-2015, 11:37 AM	#22
RobertDDL Whatever... Posts: 197 Karma: 1114225 Join Date: Feb 2015 Location: Austria Device: PocketBook InkPad 840, Touch HD 2	To provide an update, for those (few, I know) who may be interested. The conversion workflow text to html to epub or mobi is now sufficiently tested (I hope). It supports title, author, heading levels, scene breaks, blank lines and italics, and something that might pass as footnotes. With the plain text files from the Dunyazad Library, or any other plain text files that adhere to their very simple rules, you can use dthtm myfile[.txt] [-c] to create the htm file, and then ebook-convert myfile.htm myfile.epub --epub-inline-toc --level1-toc //h:h1 --level2-toc //h:h2 --language en or ebook-convert myfile.htm myfile.mobi --level1-toc //h:h1 --level2-toc //h:h2 --language en to create an epub or mobi file. dthtm.exe is my tool (current version: 1.0), to be found on http://www.dunyazad-library.net/plaintext.htm (together with source code and documentation), and ebook-convert.exe is part of Calibre. Again, I agree that there are other/better/more powerful/more flexible/well established/etc. solutions available, I just wanted to add my own minimalist one...