03-28-2010, 02:16 PM | #1 |
Guru
Posts: 744
Karma: 2825929
Join Date: Feb 2007
Location: Fresno
Device: Kindle 1; iPad Air; iPhone 7; Kobo Libra; Kindle Oasis 3
|
Problems editing XML documents
The last couple of books I've downloaded from Gutenberg in HTML format have arrived as XML documents, and so far I haven't discovered any way to edit these with MS Word. Anyone know if/how this can be done? I'm using Word 2003.
I do note that if I save the document as a Rich Text File, I can import it into Book Designer and all the extraneous material will be gone, but that means I have to completely edit a book in BD, and I'd rather do it in Word first. Any advice would be appreciated. Jim |
03-28-2010, 08:32 PM | #2 |
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
Are you sure they're not html documents just with an .xml extension? Try renaming them.
Can you give an example of such a document, or a link to where on Gutenberg where you found them? It might be easier to help if I knew what they were like. |
Advert | |
|
03-28-2010, 10:06 PM | #3 |
Guru
Posts: 744
Karma: 2825929
Join Date: Feb 2007
Location: Fresno
Device: Kindle 1; iPad Air; iPhone 7; Kobo Libra; Kindle Oasis 3
|
Here's one of the books I downloaded:
http://www.gutenberg.org/etext/6801 Don't know if this is something new that Gutenberg's doing, or something to do with my version of Word. Jim |
03-29-2010, 01:12 AM | #4 |
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
That is a normal xhtml file, but I think some word processors might be confused by the first line of the code, which is:
Code:
<?xml version='1.0' encoding='UTF-8'?> I don't have MS Word installed, but I was able to convert it to .doc format with AbiWord without making any changes to the file at all. With OpenOffice, I only had to delete this first line in a text editor before opening it, and then it worked fine. You could try deleting the first line and then opening them in Word and see if they work then. (Use a text editor like Notepad or Wordpad to delete the first line.) For good measure, I attach the .doc file created by OpenOffice here. |
03-29-2010, 10:00 AM | #5 |
Guru
Posts: 744
Karma: 2825929
Join Date: Feb 2007
Location: Fresno
Device: Kindle 1; iPad Air; iPhone 7; Kobo Libra; Kindle Oasis 3
|
That's brilliant, frabjous. Deleted the first line in Notepad, opened the XHTML document, and it opened as a normal file. I'm much obliged to you for spending the time and for your expertise.
Jim |
Advert | |
|
03-29-2010, 02:17 PM | #6 |
Reader
Posts: 519
Karma: 24612
Join Date: Aug 2009
Location: Utrecht, NL
Device: Kobo Aura 2, iPhone, iPad
|
In fact you can also delete the first line in Openoffice.org, then save it, and reopen it. The first round it will be treated as a text file, the second round as (X)HTML.
|
03-30-2010, 03:35 PM | #7 | |
Created Sigil, FlightCrew
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
|
Quote:
2. Having the XML declaration ("<?xml version='1.0' encoding='UTF-8'?>") at the start of an XHTML document (before the DOCTYPE) is mandatory. The fact that a lot of XHTML documents don't have it is besides the point. |
|
03-30-2010, 05:54 PM | #8 |
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
[accidental double post, sorry)
Last edited by frabjous; 03-30-2010 at 06:09 PM. |
03-30-2010, 06:03 PM | #9 | ||
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
Quote:
You're right about regular html not being xml. I remembered that after I posted, but forgot to change it. Wasn't really the issue here; these documents are xhtml. (Saying "nothing to do" with it, however, is certainly misleading. They are two children of common ancestors so-to-speak. Still I apologize if anyone was misled.) As for the tag, however, according to the W3C xhtml spec, this tag is not mandatory. Quote:
But even if some other spec somewhere says it's mandatory, but given what this thread is about, and the problem the original poster was having, what could possibly be more to the point than the fact this is unusual? It's obvious that in this case that it is this tag that was preventing the file from being correctly read and converted by Word, and, apparently, by OpenOffice. Pointing out that it's "mandatory" is in fact, given what the OP asked, what is besides the point. What is to the point is that removing it solves the problem. Last edited by frabjous; 03-30-2010 at 06:14 PM. |
||
03-30-2010, 07:24 PM | #10 |
Wizard
Posts: 1,196
Karma: 1281258
Join Date: Sep 2009
Device: PRS-505
|
An even easier way is simply to open the document in Firefox (which has no problem with the xml declaration) then save the page.
|
03-31-2010, 04:09 AM | #11 |
Reader
Posts: 519
Karma: 24612
Join Date: Aug 2009
Location: Utrecht, NL
Device: Kobo Aura 2, iPhone, iPad
|
Firefox saves the file as is, including the xml declaration. And rightly so, because it saves files verbatim.
And indeed, the XML declaration is not required. The XML spec says: `XML documents SHOULD begin with an XML declaration', not `XML documents MUST begin with an XML declaration'. The grammar also clearly marks it as optional. It is, however, the fault of Openoffice.org that it doesn't recognise the file as XHTML. But this issue has been on the list of future enhancements since 2005. The problem is that Openoffice.org doesn't have a proper XHTML import filter, and it treats them as HTML, but therefore it doesn't recognize the XML declaration. |
03-31-2010, 08:52 AM | #12 | ||||
Created Sigil, FlightCrew
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
|
Quote:
Quote:
XML by itself is used for many, many things other than XHTML and was created primarily for those "other" things. Quote:
My bad, I was wrong. Quote:
And OO and Word not being able to handle it... that's just damn stupid on their part. And pretty surprising for OO. |
||||
03-31-2010, 09:10 AM | #13 |
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
I'll admit I was surprised that OO couldn't handle it. I gather from Pietvo's remark that it's already been reported as a bug, but perhaps I'll put in a "vote" for it.
I'm never surprised by Microsoft products not working well, but this is Word 2003. Maybe it's been fixed by now. |
03-31-2010, 09:19 AM | #14 | |
Created Sigil, FlightCrew
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
|
Quote:
Removing the XML declaration makes Word just import it like any other HTML file. |
|
03-31-2010, 04:09 PM | #15 | |
Guru
Posts: 744
Karma: 2825929
Join Date: Feb 2007
Location: Fresno
Device: Kindle 1; iPad Air; iPhone 7; Kobo Libra; Kindle Oasis 3
|
Quote:
Jim |
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Editing problems | Brientje | ePub | 14 | 10-05-2010 06:22 AM |
Need help on Sony cache.xml | janpardo | Reading and Management | 0 | 05-24-2010 08:22 AM |
Question about editing documents once they are in Calibre | ficbot | Calibre | 4 | 09-10-2009 09:58 PM |
Why xml?? | real_yoni | Sony Reader Dev Corner | 1 | 01-20-2009 11:45 AM |
PRS-500 Available XML commands | johnmcelfresh | Sony Reader Dev Corner | 0 | 08-18-2007 01:55 PM |