MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Plugins (https://www.mobileread.com/forums/forumdisplay.php?f=268)
-   -   Putting a soup back to the metadata (https://www.mobileread.com/forums/showthread.php?t=322001)

Vroni 08-04-2019 10:34 AM

Putting a soup back to the metadata
 
Hi,

i would like to modify the metadata. So i'm reading the metadata using bk.getmetadataxml() and make use of Sigils own Beautifulsoup:

from sigil_bs4 import BeautifulSoup

But i'm getting an html and a body element around the metadata :angry:

Even using the lxml parser doesnt change this behaviour. Is there any way (except deleting <html>, <body> and the corresponding closing tags myself) to prevent this?

Once i made my changes to the metadata i would like to write it back, serialize_xhtml inserts unwanted elements as well, is there any other way or do i need to make use of prettify() to get the metadata as a string and writing it back via setmetadataxml( string ). I guess setmetadataxml does not accept a soup...

vroni

DiapDealer 08-04-2019 01:32 PM

bk.getmetadataxml() returns a utf-8 encoded xml fragment from <metadata> to </metadata>.

bk.setmetadataxml() expects a similar utf-8 encoded xml fragment in return.

All the processing that happens in between those two events is entirely up to you. But neither serialize_xhtml() nor sigil_bs4's xhtml parser in general, would be a wise choice, in my opinion, for processing the data. Considering that the opf file and the resulting metadata fragment is not, in fact, xhtml.

Vroni 08-04-2019 03:32 PM

You mean i have to parse it myself? Pfuuh, this will be the end of the development. I would just alter one meta entry and maybe add another one, but would like to keep all others. I dont want to do it via regex.

I already have the correct soup, but with these nasty html and body elements around.

Hmmm wasnt there a simple xml parser available?

KevinH 08-04-2019 03:57 PM

BS4 can use an pure xml parser such as lxml. There is also a built in QuickParser (see the test plugin example and epub3itizer plugin for examples of quickparser use) which can happily parse fragments of xml or xhtml.

Vroni 08-04-2019 04:37 PM

Hi Kevin,

thanks for the Quickparser Hint.

Regarding BS4 the LXML parser adds html and body elements as well (which i dont understand), as written im my first post.

KevinH 08-04-2019 04:40 PM

No you need to tell lxml to use an xml parser and an xml serializer with bs4.

Check out Sigil/src/Resource_Files/python3lib/xmlprocessor.py for examples.

For example: performOPFUpdates in that file show how to use an xmlbuilder to parse pure xml for bs4 and how to serialize it back using decodexml.

Doitsu 08-04-2019 06:57 PM

@Vroni:

The following minimal code should get you started:

Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from sigil_bs4 import BeautifulSoup

def run(bk):
    metadata_soup = BeautifulSoup(bk.getmetadataxml(), 'lxml')
    dc_language = metadata_soup.find('dc:language')
    if not dc_language:
        dc_language = metadata_soup.new_tag('dc:language')
        metadata_soup.metadata.append(dc_language)
    dc_language.string = 'en-US'
    new_metadata = str(metadata_soup.prettyprint_xhtml())
    bk.setmetadataxml(new_metadata)
    print('Done')
    return 0

def main():
    print('I reached main when I should not have\n')
    return -1

if __name__ == "__main__":
    sys.exit(main())

It'll change the language code to en-US or add a new en-US language metadata entry.

KevinH 08-04-2019 08:36 PM

Technically, I think the builder should be set to lxml-xml or even just xml if you do not manually set the TreeBuilder to use as is done in xmlprocessor.py.

The key is to make sure you use etree.XMLParser via lxml

Vroni 08-08-2019 09:32 AM

Quote:

Originally Posted by Doitsu (Post 3875398)
@Vroni:

The following minimal code should get you started:

Hi, thx for the example. I'm pretty sure i've tested my code with lxml as parser and got the html and body elements as well. But i will try this again and see what i did wrong as soon as my Sigiil installation is working again :)

DiapDealer 08-08-2019 09:51 AM

Quote:

Originally Posted by Vroni (Post 3876536)
Hi, thx for the example. I'm pretty sure i've tested my code with lxml as parser and got the html and body elements as well.

If so, then I'll once again point out Kevin's suggestion of using xmlprocessor.py as an example of parsing/serializing pure xml with the tools available to Sigil plugins. The file can be found in the Sigil/python3lib folder of a Windows installation of Sigil, or in the src/Resource_File/python3lib folder of the Sigil source code.

KevinH 08-08-2019 10:16 AM

Note lxml will parse both pure xml and html. You have to tell it which one to use by telling it which builder or parser to use. And you should also use an appropriate serializer. If you try to use an html parser and serializer on a pure xml fragment you will end up with exactly the error you reported.

KevinH

Quote:

Originally Posted by Vroni (Post 3876536)
Hi, thx for the example. I'm pretty sure i've tested my code with lxml as parser and got the html and body elements as well. But i will try this again and see what i did wrong as soon as my Sigiil installation is working again :)


Vroni 08-08-2019 10:37 AM

The problem is the parser (at this point) Putting debugging prints in the code i can see the html/body is already inserted in the soup by the parser.

Vroni

DiapDealer 08-08-2019 10:45 AM

Quote:

Originally Posted by Vroni (Post 3876555)
The problem is the parser (at this point) Putting debugging prints in the code i can see the html/body is already inserted in the soup by the parser.

Vroni

Then you're not configuring the parser correctly.

Vroni 08-08-2019 11:00 AM

I didnt got any error message, but lets see where's the difference between doitsos and my code

DiapDealer 08-08-2019 11:42 AM

Quote:

Originally Posted by Vroni (Post 3876563)
but lets see where's the difference between doitsos and my code

And then maybe look at the code that both of Sigil's maintainers are trying really, really hard to steer you toward. ;)


All times are GMT -4. The time now is 08:44 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.