![]() |
#16 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,591
Karma: 204624552
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
This plugin prints the xml snippet that bk.getmetadata() returns, prints the soup made from that snippet, adds the dc:language entry if not present, serializes the soup and prints the results, then ultimately writes the xml snippet back with bk.setmetadata().
You'll note that at no point in the process do any html or body tags get added. Last edited by DiapDealer; 08-08-2019 at 11:58 AM. |
![]() |
![]() |
![]() |
#17 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,731
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
|
|
![]() |
![]() |
Advert | |
|
![]() |
#18 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,591
Karma: 204624552
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
xmlprocessor.py also has examples of passing optional lists of relevant void tags to LXMLTreeBuilderForXML that are specific to xml file-types to assist in processing entire opf, ncx, and other xml files-types.
And the LXMLTreeBuilderForXML approach is probably overkill for simple epub metadata work. You can accomplish the same thing with: Code:
from sigil_bs4 import BeautifulSoup metadata = bk.getmetadataxml() metadata_soup = BeautifulSoup(metadata, "lxml-xml") . . stir the xml soup . . new_metadata = metadata_soup.decodexml(indent_level=0, formatter='minimal', indent_chars=" ") # or new_metadata = metadata_soup.decodexml() if you don't care about prettying. Last edited by DiapDealer; 08-08-2019 at 03:34 PM. |
![]() |
![]() |
![]() |
#19 | |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 168
Karma: 10010
Join Date: Oct 2018
Device: Tolino/PRS 650/Tablet
|
Quote:
![]() So whats wrong with if not dc_language: ? If should not insert something, just changing. Before: Code:
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf"> <dc:identifier id="BookId" opf:scheme="UUID">urn:uuid:7967fadc-d511-42ee-aad1-a472e662546a</dc:identifier> <dc:language>de</dc:language> <dc:title>[Title here]</dc:title> </metadata> Code:
<?xml version="1.0" encoding="utf-8" ?> <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf"> <dc:identifier id="BookId" opf:scheme="UUID">urn:uuid:7967fadc-d511-42ee-aad1-a472e662546a</dc:identifier> <dc:language>de</dc:language> <dc:title>[Title here]</dc:title> <dc:language>en-US</dc:language> </metadata> ![]() What my Python abilities now exceed is that if not dc_language statement does work. Debugging the code with print(): the content of the Variable is None ![]() Vroni |
|
![]() |
![]() |
![]() |
#20 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,731
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
|
Both my bs4 + lxml suggestion and DiapDealers bs4 + LXMLTreeBuilderForXML code snippets work as designed.
If they don't work on your machine, please post your code. I haven't tested DiapDealer's latest bs4 + lxml-xml suggestion, but, IIRC, if you're using the lxml-xml parser, you'll have to omit the dc: namespace prefix when using bs4 find: Code:
dc_language = metadata_soup.find('language') |
![]() |
![]() |
Advert | |
|
![]() |
#21 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,591
Karma: 204624552
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Yes, the xml declaration hurts nothing and affects nothing. That's why I didn't mention it. It's not relevant to writing the metadata soup back to the epub. But once again, the xmlprocessor.py file that we keep trying to point people to for examples of how to parse/serialize pure xml with sigil_bs4 has an example of how to easily strip the xml header.
As for the logic of adding the dc:language element or not; it was only ever intended as a simple example of diddling the metadata via bs4. If it doesn't work, then change the logic. My sample was addressing the proper way to parse/serialize pure xml fragments in a Sigil plugin. It's up to you to figure out how best to modify the metadata soup. |
![]() |
![]() |
![]() |
#22 |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 168
Karma: 10010
Join Date: Oct 2018
Device: Tolino/PRS 650/Tablet
|
Well it might be the colon in dc:language which confuses your version diap, as this is not a tag but a tag with a namepace. Its just not found, thats why it adds a new one. Always.
This is my coding now: Code:
#!/usr/bin/env python # -*- coding: utf-8 -*- import sys from sigil_bs4 import BeautifulSoup from sigil_bs4.builder._lxml import LXMLTreeBuilderForXML def run(bk): # xmlbuilder = LXMLTreeBuilderForXML(parser=None) metadata = bk.getmetadataxml() print('...') print(metadata) # metadata_soup = BeautifulSoup(metadata, features=None, from_encoding="utf-8", builder=xmlbuilder) metadata_soup = BeautifulSoup(bk.getmetadataxml(), 'lxml') print('...') print(metadata_soup) print('...') dc_language = metadata_soup.find({"dc:language"}) print(dc_language) if dc_language is None: print('...') print('Creating new element') dc_language = metadata_soup.new_tag('dc:language') metadata_soup.metadata.append(dc_language) dc_language.string = 'en-US' new_metadata = metadata_soup.decodexml(indent_level=0, formatter='minimal', indent_chars=" ")[40:] print('...') print(new_metadata) bk.setmetadataxml(new_metadata) print('Done') return 0 def main(): print('I reached main when I should not have\n') return -1 if __name__ == "__main__": sys.exit(main()) Code:
<?xml version="1.0" encoding="utf-8" ?> <html> <body> <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf"><dc:identifier id="BookId" opf:scheme="UUID">urn:uuid:7967fadc-d511-42ee-aad1-a472e662546a</dc:identifier> <dc:title>[Title here]</dc:title> <dc:language>en-US</dc:language></metadata> </body> </html> |
![]() |
![]() |
![]() |
#23 |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 168
Karma: 10010
Join Date: Oct 2018
Device: Tolino/PRS 650/Tablet
|
So searching for language without the namespace works fine in Diaps code, and in addition i'm slicing the xml declaration away.
![]() |
![]() |
![]() |
![]() |
#24 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,591
Karma: 204624552
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Now after all that, is when I'll mention that I feel that using bs4/lxml parsing/serializing for simple changes/additions to an epub's metadata is considerable overkill. Like using a scalpel to peel an orange. Unless I'm planning on writing a plugin that grants a user considerable autonomy over making complex metadata edits, I'm using a quick regex to make the change I need and moving on. But to each their own.
|
![]() |
![]() |
![]() |
#25 |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 168
Karma: 10010
Join Date: Oct 2018
Device: Tolino/PRS 650/Tablet
|
Well, i think i've a good knowledge in regex, but not in python nor in BS and this was a good chance to learn it.
And you never know how complex this plugin will be in 2 years ![]() I've some ideas, but the constraint is time ![]() |
![]() |
![]() |
![]() |
#26 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,591
Karma: 204624552
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Hey, I'm all for learning. I don't want to discourage anyone from broadening their knowledge.
![]() |
![]() |
![]() |
![]() |
#27 |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 168
Karma: 10010
Join Date: Oct 2018
Device: Tolino/PRS 650/Tablet
|
In addition, i remember thinking about to do it with regex as i started realizing the idea, but than i realized that attributes can be in arbitrary order such as
Code:
<meta name="xyz" content="123"> Code:
<meta content="123" name="xyz"> |
![]() |
![]() |
![]() |
#28 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,591
Karma: 204624552
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
You would only have to do two checks if you didn't know what you were looking for. Otherwise, you simply search for what matters ... regardless of position.
If a meta tag with the name "xyz" is what you need to find, then you use a regex that doesn't care in what order the name attribute appears: Code:
<meta[^>]*(?=name=\"xyz\")[^>]*> Code:
<meta[^>]*(?=content=\"123\")[^>]*> |
![]() |
![]() |
![]() |
#29 |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 168
Karma: 10010
Join Date: Oct 2018
Device: Tolino/PRS 650/Tablet
|
I'm loking for calibre:series and need the content. Interested in your approach
![]() |
![]() |
![]() |
![]() |
#30 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,731
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Please, give us back old metadata tab! | semsaudade | Sigil | 26 | 05-19-2017 03:58 AM |
metadata.db library back up | obihal | Library Management | 2 | 06-05-2015 03:04 PM |
iPad [Marvin] editing metadata and syncing back | tsolignani | Apple Devices | 3 | 02-15-2013 11:56 AM |
back cover of paperback - metadata ? | cybmole | Calibre | 0 | 05-11-2011 03:43 PM |
Free Book (Kindle) - Putting the Public Back in Public Relations | koland | Deals and Resources (No Self-Promotion or Affiliate Links) | 0 | 12-27-2010 09:28 AM |