Quote:
Originally Posted by DNSB
The problem for me is that when I played with similar issues a few years back, I ended up spending more time on trying to have my code handle special cases and not convert my ebook to egarbage than it would have taken me to do a regex to generate/modify the headers and then copy/paste. Entries that did not have header tags (<p> or <div> are more fun), h? tags wrapped around images, multiple h? tags in the same file (e.g. for the chapter title and subtitle), multiple chapters in the same file mixed with chapters split across multiple files (thank you, Gutenberg!), headers with multiple <a>, <span> and <br /> tags between the h? tags.
It was not a simple project and I was never happy with the results but it taught me quite a bit about Python and making sure I had backup copies. 
|
Yes, depending on where your sources come from it can be a never-ending source of delight to discover all the ways people can "break" something as simple (in theory) as a chapter heading.

Special mention for Gutenberg and their "an hr is as good as a new file" structure.
But, am I wrong in thinking that you also were using, as your starting point, the html files? I think you're completely right, if you do that, there are too many different possibilities to handle and you'll never manage to make something that can deal with all of them, and it's very very likely you'll break something. Which is exactly why I am not trying to do this using regex. But,
BUT! if there is a good TOC in the file already and there could be a way to do a "reverse create TOC" basically, instead of having to resolve all those tricky problems you just go around them. I really believe it must be possible to automate that. Everything you need is already in the toc; the text is there, all you have to do is copy-paste, the link is there, all you have to do is follow it... all the necessary elements are already in the file.
I really do think it's as close to a perfect solution as it's possible to get to simply find a way to automate copying the original TOC titles back into the files they link to... if you copy the title into an html comment I cannot even see how you could break the file at all, and that would be one single operation so you don't even have to figure out multiple scenarios. Obviously a bit of work would still be required after that to stick this text into the proper tag or add the attribute or whathaveyou but the most fastidious and annoying part would already be done, no copying and pasting by hand between two files, no mucking about with regexes for various wEiRd CaSeS and random spans or one-to-three br's or a's or sup or anything else, and the whole process would be much smoother because you wouldn't have nearly as many variations to adjust for.
I guess I am going to have to do like you and use it to "learn a lot about Python" (lesson 1: apparently Python is what I'll have to learn if I want to make my own plugin). I already have learned the painful lesson about having backup copies during previous "experiments".
How hard is Python to learn? (serious question). I am completely comfortable with html and css but I don't know any programming languages.