Putting a soup back to the metadata - Page 2

DiapDealer · 08-08-2019, 12:25 PM

This plugin prints the xml snippet that bk.getmetadata() returns, prints the soup made from that snippet, adds the dc:language entry if not present, serializes the soup and prints the results, then ultimately writes the xml snippet back with bk.setmetadata().

You'll note that at no point in the process do any html or body tags get added.

Doitsu · 08-08-2019, 01:00 PM

Quote:

Originally Posted by DiapDealer

This plugin prints xml snippet that bk.getmetadata() returns, prints the soup made from that snippet, adds the dc:lang entry if not present, serializes and prints the soup, then ultimately writes the xml snippet back with bk.setmetadata().

You'll note that at no point in the process do any html or body tags get added.

Thanks for the code! You might want to add it to the Sigil API Framework documentation, because LXMLTreeBuilderForXML is somewhat "underdocumented."

DiapDealer · 08-08-2019, 01:09 PM

xmlprocessor.py also has examples of passing optional lists of relevant void tags to LXMLTreeBuilderForXML that are specific to xml file-types to assist in processing entire opf, ncx, and other xml files-types.

And the LXMLTreeBuilderForXML approach is probably overkill for simple epub metadata work. You can accomplish the same thing with:

Code:

from sigil_bs4 import BeautifulSoup 

metadata = bk.getmetadataxml()
metadata_soup = BeautifulSoup(metadata, "lxml-xml")
.
.
stir the xml soup
.
.
new_metadata = metadata_soup.decodexml(indent_level=0, formatter='minimal', indent_chars="  ")
# or new_metadata = metadata_soup.decodexml() if you don't care about prettying.

The point is to avoid html parsers and (x)html serializers.

Vroni · 08-09-2019, 06:16 AM

Quote:

Originally Posted by DiapDealer

This plugin prints the xml snippet that bk.getmetadata() returns, prints the soup made from that snippet, adds the dc:language entry if not present, serializes the soup and prints the results, then ultimately writes the xml snippet back with bk.setmetadata().

Thats the expected result, but it adds a second (third foruth and so on) dc_language element all the time

From the documentation find() returns None if it finds nothing and in that case if adds it and sets the language ex_us.

So whats wrong with if not dc_language: ? If should not insert something, just changing.

Before:

Code:

  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:identifier id="BookId" opf:scheme="UUID">urn:uuid:7967fadc-d511-42ee-aad1-a472e662546a</dc:identifier>
    <dc:language>de</dc:language>
    <dc:title>[Title here]</dc:title>
  </metadata>

After

Code:

<?xml version="1.0" encoding="utf-8" ?>
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
  <dc:identifier id="BookId" opf:scheme="UUID">urn:uuid:7967fadc-d511-42ee-aad1-a472e662546a</dc:identifier>
  <dc:language>de</dc:language>
  <dc:title>[Title here]</dc:title>
  <dc:language>en-US</dc:language>
</metadata>

By the way, the parser adds the xml starting declaration. At least, that doesnt mess up the content.opf file.

What my Python abilities now exceed is that if not dc_language statement does work. Debugging the code with print(): the content of the Variable is None

Vroni

Doitsu · 08-09-2019, 07:09 AM

Quote:

Originally Posted by Vroni

So whats wrong with if not dc_language: ?

Both my bs4 + lxml suggestion and DiapDealers bs4 + LXMLTreeBuilderForXML code snippets work as designed.

If they don't work on your machine, please post your code.

I haven't tested DiapDealer's latest bs4 + lxml-xml suggestion, but, IIRC, if you're using the lxml-xml parser, you'll have to omit the dc: namespace prefix when using bs4 find:

Code:

dc_language = metadata_soup.find('language')

DiapDealer · 08-09-2019, 07:23 AM

Yes, the xml declaration hurts nothing and affects nothing. That's why I didn't mention it. It's not relevant to writing the metadata soup back to the epub. But once again, the xmlprocessor.py file that we keep trying to point people to for examples of how to parse/serialize pure xml with sigil_bs4 has an example of how to easily strip the xml header.

As for the logic of adding the dc:language element or not; it was only ever intended as a simple example of diddling the metadata via bs4. If it doesn't work, then change the logic. My sample was addressing the proper way to parse/serialize pure xml fragments in a Sigil plugin. It's up to you to figure out how best to modify the metadata soup.

Vroni · 08-09-2019, 07:59 AM

Well it might be the colon in dc:language which confuses your version diap, as this is not a tag but a tag with a namepace. Its just not found, thats why it adds a new one. Always.

This is my coding now:

Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from sigil_bs4 import BeautifulSoup
from sigil_bs4.builder._lxml import LXMLTreeBuilderForXML


def run(bk):
#    xmlbuilder = LXMLTreeBuilderForXML(parser=None)
    metadata = bk.getmetadataxml()
    print('...')
    print(metadata)
#    metadata_soup = BeautifulSoup(metadata, features=None, from_encoding="utf-8", builder=xmlbuilder)
    metadata_soup = BeautifulSoup(bk.getmetadataxml(), 'lxml')
    print('...')
    print(metadata_soup)
    print('...')    
    dc_language = metadata_soup.find({"dc:language"})
    print(dc_language)
   
    if dc_language is None:
        print('...')  
        print('Creating new element')
        dc_language = metadata_soup.new_tag('dc:language')
        metadata_soup.metadata.append(dc_language)
    dc_language.string = 'en-US'
    new_metadata = metadata_soup.decodexml(indent_level=0, formatter='minimal', indent_chars="  ")[40:]
    print('...')
    print(new_metadata)
    
    bk.setmetadataxml(new_metadata)
    print('Done')
    return 0

def main():
    print('I reached main when I should not have\n')
    return -1

if __name__ == "__main__":
    sys.exit(main())

If i use doitsos version i get this one:

Code:

<?xml version="1.0" encoding="utf-8" ?>
<html>
<body>
 <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf"><dc:identifier id="BookId" opf:scheme="UUID">urn:uuid:7967fadc-d511-42ee-aad1-a472e662546a</dc:identifier>
 <dc:title>[Title here]</dc:title>
 <dc:language>en-US</dc:language></metadata>
</body>
</html>

Vroni · 08-09-2019, 08:12 AM

So searching for language without the namespace works fine in Diaps code, and in addition i'm slicing the xml declaration away.

for this, its a good starting point!

DiapDealer · 08-09-2019, 09:24 AM

Now after all that, is when I'll mention that I feel that using bs4/lxml parsing/serializing for simple changes/additions to an epub's metadata is considerable overkill. Like using a scalpel to peel an orange. Unless I'm planning on writing a plugin that grants a user considerable autonomy over making complex metadata edits, I'm using a quick regex to make the change I need and moving on. But to each their own.

Vroni · 08-09-2019, 09:42 AM

Well, i think i've a good knowledge in regex, but not in python nor in BS and this was a good chance to learn it.

And you never know how complex this plugin will be in 2 years

I've some ideas, but the constraint is time

DiapDealer · 08-09-2019, 10:42 AM

Hey, I'm all for learning. I don't want to discourage anyone from broadening their knowledge.

Vroni · 08-10-2019, 02:18 AM

In addition, i remember thinking about to do it with regex as i started realizing the idea, but than i realized that attributes can be in arbitrary order such as

Code:

<meta name="xyz" content="123">

Code:

<meta content="123" name="xyz">

This would have required more coding, checking the first variant and if not found, try to find the second variant to see if its present.

DiapDealer · 08-10-2019, 10:43 AM

You would only have to do two checks if you didn't know what you were looking for. Otherwise, you simply search for what matters ... regardless of position.

If a meta tag with the name "xyz" is what you need to find, then you use a regex that doesn't care in what order the name attribute appears:

Code:

<meta[^>]*(?=name=\"xyz\")[^>]*>

If it's the content attribute that you're looking to match, then its:

Code:

<meta[^>]*(?=content=\"123\")[^>]*>

Not trying to discourage you from using bs4/lxml for pure xml, just trying to point out that unless you've got a lot of complicated metadata editing to do, a simple find and replace could turn out to be much simpler and use less lines of code.

Vroni · 08-10-2019, 12:47 PM

I'm loking for calibre:series and need the content. Interested in your approach

Doitsu · 08-10-2019, 03:53 PM

Quote:

Originally Posted by Vroni

I'm loking for calibre:series and need the content. Interested in your approach

You might find KevinH's ePub3-itizer plugin helpful. It contains code to convert custom Calibre metadata entries to EPUB3 metadata entries. (Have a look at _convertOpf() in opf_converter.py.)

08-08-2019, 01:09 PM	#18
DiapDealer Grand Sorcerer Posts: 28,897 Karma: 207182180 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	xmlprocessor.py also has examples of passing optional lists of relevant void tags to LXMLTreeBuilderForXML that are specific to xml file-types to assist in processing entire opf, ncx, and other xml files-types. And the LXMLTreeBuilderForXML approach is probably overkill for simple epub metadata work. You can accomplish the same thing with: Code: from sigil_bs4 import BeautifulSoup metadata = bk.getmetadataxml() metadata_soup = BeautifulSoup(metadata, "lxml-xml") . . stir the xml soup . . new_metadata = metadata_soup.decodexml(indent_level=0, formatter='minimal', indent_chars=" ") # or new_metadata = metadata_soup.decodexml() if you don't care about prettying. The point is to avoid html parsers and (x)html serializers. Last edited by DiapDealer; 08-08-2019 at 04:34 PM.

08-10-2019, 02:18 AM	#27
Vroni Banned Posts: 168 Karma: 10010 Join Date: Oct 2018 Device: Tolino/PRS 650/Tablet	In addition, i remember thinking about to do it with regex as i started realizing the idea, but than i realized that attributes can be in arbitrary order such as Code: <meta name="xyz" content="123"> Code: <meta content="123" name="xyz"> This would have required more coding, checking the first variant and if not found, try to find the second variant to see if its present.

08-10-2019, 10:43 AM	#28
DiapDealer Grand Sorcerer Posts: 28,897 Karma: 207182180 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	You would only have to do two checks if you didn't know what you were looking for. Otherwise, you simply search for what matters ... regardless of position. If a meta tag with the name "xyz" is what you need to find, then you use a regex that doesn't care in what order the name attribute appears: Code: <meta[^>](?=name=\"xyz\")[^>]> If it's the content attribute that you're looking to match, then its: Code: <meta[^>](?=content=\"123\")[^>]> Not trying to discourage you from using bs4/lxml for pure xml, just trying to point out that unless you've got a lot of complicated metadata editing to do, a simple find and replace could turn out to be much simpler and use less lines of code.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Please, give us back old metadata tab!	semsaudade	Sigil	26	05-19-2017 04:58 AM
metadata.db library back up	obihal	Library Management	2	06-05-2015 04:04 PM
iPad [Marvin] editing metadata and syncing back	tsolignani	Apple Devices	3	02-15-2013 12:56 PM
back cover of paperback - metadata ?	cybmole	Calibre	0	05-11-2011 04:43 PM
Free Book (Kindle) - Putting the Public Back in Public Relations	koland	Deals and Resources (No Self-Promotion or Affiliate Links)	0	12-27-2010 10:28 AM

08-09-2019, 07:23 AM	#21
DiapDealer Grand Sorcerer Posts: 28,897 Karma: 207182180 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Yes, the xml declaration hurts nothing and affects nothing. That's why I didn't mention it. It's not relevant to writing the metadata soup back to the epub. But once again, the xmlprocessor.py file that we keep trying to point people to for examples of how to parse/serialize pure xml with sigil_bs4 has an example of how to easily strip the xml header. As for the logic of adding the dc:language element or not; it was only ever intended as a simple example of diddling the metadata via bs4. If it doesn't work, then change the logic. My sample was addressing the proper way to parse/serialize pure xml fragments in a Sigil plugin. It's up to you to figure out how best to modify the metadata soup.

08-09-2019, 08:12 AM	#23
Vroni Banned Posts: 168 Karma: 10010 Join Date: Oct 2018 Device: Tolino/PRS 650/Tablet	So searching for language without the namespace works fine in Diaps code, and in addition i'm slicing the xml declaration away. for this, its a good starting point!

08-09-2019, 09:24 AM	#24
DiapDealer Grand Sorcerer Posts: 28,897 Karma: 207182180 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Now after all that, is when I'll mention that I feel that using bs4/lxml parsing/serializing for simple changes/additions to an epub's metadata is considerable overkill. Like using a scalpel to peel an orange. Unless I'm planning on writing a plugin that grants a user considerable autonomy over making complex metadata edits, I'm using a quick regex to make the change I need and moving on. But to each their own.

08-09-2019, 09:42 AM	#25
Vroni Banned Posts: 168 Karma: 10010 Join Date: Oct 2018 Device: Tolino/PRS 650/Tablet	Well, i think i've a good knowledge in regex, but not in python nor in BS and this was a good chance to learn it. And you never know how complex this plugin will be in 2 years I've some ideas, but the constraint is time

08-09-2019, 10:42 AM	#26
DiapDealer Grand Sorcerer Posts: 28,897 Karma: 207182180 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Hey, I'm all for learning. I don't want to discourage anyone from broadening their knowledge.

08-10-2019, 12:47 PM	#29
Vroni Banned Posts: 168 Karma: 10010 Join Date: Oct 2018 Device: Tolino/PRS 650/Tablet	I'm loking for calibre:series and need the content. Interested in your approach

Advert

Advert