![]() |
Problems editing XML documents
The last couple of books I've downloaded from Gutenberg in HTML format have arrived as XML documents, and so far I haven't discovered any way to edit these with MS Word. Anyone know if/how this can be done? I'm using Word 2003.
I do note that if I save the document as a Rich Text File, I can import it into Book Designer and all the extraneous material will be gone, but that means I have to completely edit a book in BD, and I'd rather do it in Word first. Any advice would be appreciated. Jim |
Are you sure they're not html documents just with an .xml extension? Try renaming them.
Can you give an example of such a document, or a link to where on Gutenberg where you found them? It might be easier to help if I knew what they were like. |
Here's one of the books I downloaded:
http://www.gutenberg.org/etext/6801 Don't know if this is something new that Gutenberg's doing, or something to do with my version of Word. Jim |
1 Attachment(s)
That is a normal xhtml file, but I think some word processors might be confused by the first line of the code, which is:
Code:
<?xml version='1.0' encoding='UTF-8'?>I don't have MS Word installed, but I was able to convert it to .doc format with AbiWord without making any changes to the file at all. With OpenOffice, I only had to delete this first line in a text editor before opening it, and then it worked fine. You could try deleting the first line and then opening them in Word and see if they work then. (Use a text editor like Notepad or Wordpad to delete the first line.) For good measure, I attach the .doc file created by OpenOffice here. |
That's brilliant, frabjous. Deleted the first line in Notepad, opened the XHTML document, and it opened as a normal file. I'm much obliged to you for spending the time and for your expertise.
Jim |
In fact you can also delete the first line in Openoffice.org, then save it, and reopen it. The first round it will be treated as a text file, the second round as (X)HTML.
|
Quote:
2. Having the XML declaration ("<?xml version='1.0' encoding='UTF-8'?>") at the start of an XHTML document (before the DOCTYPE) is mandatory. The fact that a lot of XHTML documents don't have it is besides the point. |
[accidental double post, sorry)
|
Quote:
You're right about regular html not being xml. I remembered that after I posted, but forgot to change it. Wasn't really the issue here; these documents are xhtml. (Saying "nothing to do" with it, however, is certainly misleading. They are two children of common ancestors so-to-speak. Still I apologize if anyone was misled.) As for the tag, however, according to the W3C xhtml spec, this tag is not mandatory. Quote:
But even if some other spec somewhere says it's mandatory, but given what this thread is about, and the problem the original poster was having, what could possibly be more to the point than the fact this is unusual? It's obvious that in this case that it is this tag that was preventing the file from being correctly read and converted by Word, and, apparently, by OpenOffice. Pointing out that it's "mandatory" is in fact, given what the OP asked, what is besides the point. What is to the point is that removing it solves the problem. |
An even easier way is simply to open the document in Firefox (which has no problem with the xml declaration) then save the page. :)
|
Firefox saves the file as is, including the xml declaration. And rightly so, because it saves files verbatim.
And indeed, the XML declaration is not required. The XML spec says: `XML documents SHOULD begin with an XML declaration', not `XML documents MUST begin with an XML declaration'. The grammar also clearly marks it as optional. It is, however, the fault of Openoffice.org that it doesn't recognise the file as XHTML. But this issue has been on the list of future enhancements since 2005. The problem is that Openoffice.org doesn't have a proper XHTML import filter, and it treats them as HTML, but therefore it doesn't recognize the XML declaration. |
Quote:
Quote:
XML by itself is used for many, many things other than XHTML and was created primarily for those "other" things. Quote:
My bad, I was wrong. Quote:
And OO and Word not being able to handle it... that's just damn stupid on their part. And pretty surprising for OO. |
I'll admit I was surprised that OO couldn't handle it. I gather from Pietvo's remark that it's already been reported as a bug, but perhaps I'll put in a "vote" for it.
I'm never surprised by Microsoft products not working well, but this is Word 2003. Maybe it's been fixed by now. |
Quote:
Removing the XML declaration makes Word just import it like any other HTML file. |
Quote:
Jim |
| All times are GMT -4. The time now is 09:30 PM. |
Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.