02-02-2017, 12:25 AM | #1 |
Enthusiast
Posts: 34
Karma: 10
Join Date: Feb 2017
Device: none
|
Unable to open htm-file.
Hi,
Friend of mine wrote a book in Word. I converted it as suggested in User Guide File->Save As Filtered HTML. I was able to open it in version 0.7.1 but today I installed new version 0.9.7 and failed to open. The error is: --------------------------------- The following file was not loaded due to invalid content or not well formed XML: [full path file name] (line 786: @787:43: That tag isn't allowed here Currently open tags: html, body, div..) Try setting the Clean Source preference to Mend XTML Source Code on Open and reloading the file. ---------------------------------- 1. First of all there is no "Clean Source" but "Mend XTML Source Code On" 2. I try to open file with and without check "Open" - same result, file is not opened and there is only "Close" button. 3. I don't understand why older version is able to open the file but the latest one is not. 4. There are a lot of errors in converted file and I am willing to clean it but how I can do it if I am not able to open in Sigil? Sigil Version: 0.9.7 Loaded Qt: 5.6.1 Build time: 2016.10.29 15:56:51 UTC Thanks |
02-02-2017, 02:05 AM | #2 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Apparently Sigil thinks it is a XML file instead of HTML.
Have you considered using my Word add-in? It contains various tools and also enables you to create an ePUB directly from Word. The code it produces is clean in itself. If you want, the ePUB can be opened in Sigil automatically after being saved. |
Advert | |
|
02-02-2017, 02:32 AM | #3 | |
null operator (he/him)
Posts: 20,568
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
Or use the Import DOCX plugin for Sigil Or convert the DOCX to EPUB via calibre (GUI or command line) Or import the DOCX into the calibre book editor Assuming you have Word 2007 or later, converting the DOCX via any of the above (including ePub Tools) is almost always a better place to start than Filtered HTML. And there's a lot your friend can do in Word to make conversion easier. Such as : avoiding the use of 'white space' to align text (horizontally and vertically), and using Word Styles in an attached Template instead of inline styles. BR |
|
02-02-2017, 03:09 AM | #4 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
The solution to your problem was actually in the error message. You can open any html in Sigil that is derived from a Word doc(filtered html), Word docx(filtered html), AbiWord html, ODF html or even Google Doc html using the following simple procedure in Sigil.
* Open Sigil 0.9.7 and go to Edit > Preferences > General. Then set Mend XHTML Source Code on: to Open and save. * Now if you open your Word filtered html doc in Sigil you should have no problems. Sigil 'mends' the html by replacing the XMLNS header with the correct version for the epub standard. Last edited by slowsmile; 02-02-2017 at 03:18 AM. |
02-02-2017, 11:35 AM | #5 |
mostly an observer
Posts: 1,515
Karma: 987654
Join Date: Dec 2012
Device: Kindle
|
Rather than let Word interpret my docs, I run them through Word2CleanHtml.com online.
|
Advert | |
|
02-02-2017, 11:45 AM | #6 |
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Which always seemed a bit silly (and way too resource-intensive) to me, when there are so many solutions to get the same--if not better-- results without leaving the Word/Sigil environment. Most of them have been mentioned in this thread (and none of them involve having to upload your entire novel to someone else's servers).
|
02-02-2017, 10:55 PM | #7 |
Enthusiast
Posts: 34
Karma: 10
Join Date: Feb 2017
Device: none
|
It didn't help.
|
02-03-2017, 09:13 AM | #8 |
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I've never really come across an html file that Sigil's new parser (Google's Gumbo, by the way ... Sigil 0.7.x used htmlTidy which was prone to destroying data) flat out refused to open. Is there any way you can produce a non-copyright violating, small sample file that exhibits this issue? I'd like to see for myself what Word is puttng out that Gumbo can't handle.
|
02-03-2017, 06:39 PM | #9 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@NotJohn...I've used and tested the online Word2CleanHTML app. This app does indeed clean up the HTML but unfortunately it also zaps most of the styling in the HTML. That sounds rather dumb to me because that means that you must re-style your HTML file from scratch in Sigil. So you're actually duplicating your work and creating more unnecessary work for yourself. Is that sensible?
There are much better ways to clean up your html while, at the same time, also preserving all the styling. For instance, as one of its tasks, my OpenDocHTMLImport plugin will first thoroughly clean out and reformat both your html file and CSS file for you and will leave you with a working html file that you wont have to restyle from scratch in Sigil. I use mostly bs4 for cleaning out the proprietary dross from the html file and this works quite well. I would also second DiapDealer's comment. Sigil's Mend HTML on Open facility is also a very useful way of loading in html files derived from Word(as Web page filtered html), AbiWord, Google and OpenDoc into Sigil. I've also found that, so far, the only html file that it won't load or accept is a Word doc saved as just a Web Page(not filtered html). From the above, I'm also guessing that the OP probably just saved his Word doc as a 'Web Page' which wont work. But if he had saved his Word doc as 'Web Page Filtered html' and set 'Mend XHTML file on Open' in Sigil Preferences as advised then it would probably load into Sigil without no problems. I've also just followed this procedure using a Word filtered htm file and it loaded into Sigil without any problems at all. Last edited by slowsmile; 02-03-2017 at 09:17 PM. |
02-03-2017, 11:14 PM | #10 | |
Enthusiast
Posts: 34
Karma: 10
Join Date: Feb 2017
Device: none
|
Quote:
<p class=podpis style='text-indent:21.3pt'> <table cellpadding=0 cellspacing=0> <tr> <td width=196 height=0></td> </tr> <tr> <td></td> <td><img width=328 height=156 src="RoberMelamedBook_files/image002.jpg" alt="links/menashe.jpg"></td> </tr> </table> <br clear=ALL> Menashe people with Rabbi Avichail (right)</p> |
|
02-04-2017, 11:00 AM | #11 |
2B || !2B
Posts: 851
Karma: 194010
Join Date: Feb 2010
Location: Austria
Device: Sony PRS505/650/T1/tolino vision 5
|
Hi,
your code isn't valid html. (table is not allowed in p, as both are block elements.) But I had a very interesting finding. Spoiler:
I'am on Sigil 0.9.1 on Windows Last edited by Mark Nord; 02-04-2017 at 03:02 PM. Reason: Set Spoiler, as issue is resolved |
02-04-2017, 12:13 PM | #12 |
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Good information. Thanks! Are you saying that your doctype div.xml file WON'T open with 0.9.1? Because I have no problem opening that one (after renaming to .html) with Sigil 0.9.7.
The only one I have trouble opening (or adding via add existing file) with Sigil v0.9.7 is the "html p.xml" file. |
02-04-2017, 12:29 PM | #13 |
Enthusiast
Posts: 34
Karma: 10
Join Date: Feb 2017
Device: none
|
Word add-in
I installed Word add-in and received this errors. Pictures are attached. Again it's table conversion error.
|
02-04-2017, 12:38 PM | #14 |
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Did you let Sigil try to automatically fix the error?
|
02-04-2017, 12:59 PM | #15 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
DiapDealer, I received the same results as you did by renaming the files to .html I was able to load them all but the html_p one.
The problem with the html_p.html file is the lack of a DOCTYPE on the file itself. It seems sigil-gumbo actually repairs differently depending on the DOCTYPE. This was something I did not know but now makes sense. With no DOCTYPE on the html_p.html file, Sigil literally needs to clean the file twice to get it to a proper clean state. The first pass cleans up a bunch of garbage but not the table in p issue, but it does add the proper DOCTYPE at the end (our Sigil code does that). But without a clear recognized DOCTYPE, gumbo cleans only to heavily transitional html (very weak cleaning). The second pass will see the DOCTYPE the first pass added, and then proceed to clean up the table in p problem. If I simply edit html_p.html and add a <!DOCTYPE html> or the epub2 version of that, at the top of the file before trying to load it, gumbo will properly clean everything in one pass. So it appears that I will need to check for and add in the DOCTYPE inside CleanSource::Mend before passing anything to gumbo so that gumbo will properly repair the whole mess in one pass. I will keep playing around with this. Thanks for the test cases. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sigil unable to open a file | Andjety | Sigil | 39 | 03-20-2017 11:08 PM |
Unable to open file | Toreth | Sigil | 25 | 03-16-2015 06:36 PM |
unable to open database file | mihal.v | Calibre | 3 | 08-16-2014 09:44 AM |
Unable to open database file | JulieR | Calibre | 2 | 04-24-2009 04:40 AM |
Unable to open file that is 8MB | timyap | Sony Reader | 12 | 05-09-2008 09:51 AM |