![]() |
Putting a soup back to the metadata
Hi,
i would like to modify the metadata. So i'm reading the metadata using bk.getmetadataxml() and make use of Sigils own Beautifulsoup: from sigil_bs4 import BeautifulSoup But i'm getting an html and a body element around the metadata :angry: Even using the lxml parser doesnt change this behaviour. Is there any way (except deleting <html>, <body> and the corresponding closing tags myself) to prevent this? Once i made my changes to the metadata i would like to write it back, serialize_xhtml inserts unwanted elements as well, is there any other way or do i need to make use of prettify() to get the metadata as a string and writing it back via setmetadataxml( string ). I guess setmetadataxml does not accept a soup... vroni |
bk.getmetadataxml() returns a utf-8 encoded xml fragment from <metadata> to </metadata>.
bk.setmetadataxml() expects a similar utf-8 encoded xml fragment in return. All the processing that happens in between those two events is entirely up to you. But neither serialize_xhtml() nor sigil_bs4's xhtml parser in general, would be a wise choice, in my opinion, for processing the data. Considering that the opf file and the resulting metadata fragment is not, in fact, xhtml. |
You mean i have to parse it myself? Pfuuh, this will be the end of the development. I would just alter one meta entry and maybe add another one, but would like to keep all others. I dont want to do it via regex.
I already have the correct soup, but with these nasty html and body elements around. Hmmm wasnt there a simple xml parser available? |
BS4 can use an pure xml parser such as lxml. There is also a built in QuickParser (see the test plugin example and epub3itizer plugin for examples of quickparser use) which can happily parse fragments of xml or xhtml.
|
Hi Kevin,
thanks for the Quickparser Hint. Regarding BS4 the LXML parser adds html and body elements as well (which i dont understand), as written im my first post. |
No you need to tell lxml to use an xml parser and an xml serializer with bs4.
Check out Sigil/src/Resource_Files/python3lib/xmlprocessor.py for examples. For example: performOPFUpdates in that file show how to use an xmlbuilder to parse pure xml for bs4 and how to serialize it back using decodexml. |
@Vroni:
The following minimal code should get you started: Code:
#!/usr/bin/env python |
Technically, I think the builder should be set to lxml-xml or even just xml if you do not manually set the TreeBuilder to use as is done in xmlprocessor.py.
The key is to make sure you use etree.XMLParser via lxml |
Quote:
|
Quote:
|
Note lxml will parse both pure xml and html. You have to tell it which one to use by telling it which builder or parser to use. And you should also use an appropriate serializer. If you try to use an html parser and serializer on a pure xml fragment you will end up with exactly the error you reported.
KevinH Quote:
|
The problem is the parser (at this point) Putting debugging prints in the code i can see the html/body is already inserted in the soup by the parser.
Vroni |
Quote:
|
I didnt got any error message, but lets see where's the difference between doitsos and my code
|
Quote:
|
| All times are GMT -4. The time now is 08:44 PM. |
Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.