[GUI Plugin] X-Ray Creator - Page 15

Shark69 · 08-23-2018, 12:47 PM

I am sorry. I realized that the files did not work. In addition, we can not attach files with author rights. Because of this, I deleted them. There were two main problems. The first one was that the plugin is not very good recovering aliases characters from the X-Ray file. Manual tasks should be done to debug the list. The second and more important is the quality of the ebook. It's quite poor and the plugin does not work fine with files with extrange data in the html labels. The problem can be solved with some hours of work to regenerate a more clean ebook, but I don't know if it is worth.

KloudZ · 08-23-2018, 12:52 PM

Quote:

Originally Posted by Shark69

I am sorry. I realized that the files did not work. In addition, we can not attach files with author rights. Because of this, I deleted them. There were two main problems. The first one was that the plugin is not very good recovering aliases characters from the X-Ray file. Manual tasks should be done to debug the list. The second and more important is the quality of the ebook. It's quite poor and the plugin does not work fine with files with extrange data in the html labels. The problem can be solved with some hours of work to regenerate a more clean ebook, but I don't know if it is worth.

Sounds complicatd. Thanks anyway!

Bulu009 · 08-31-2018, 04:18 PM

I have the following error. I am on KDE Neon 5.13.4 and Calibre 3.30. Please help me.

Starting job: Creating Files

09-01-2018 01:38:18 Initializing...
09-01-2018 01:38:18 Echo Burning - Lee Child
09-01-2018 01:38:18 Parsing Goodreads data...
Job: "Creating Files" failed with error:
Traceback (most recent call last):
File "site-packages/calibre/gui2/threaded_jobs.py", line 84, in start_work
File "calibre_plugins.xray_creator.lib.xray_creator ", line 284, in create_files_event
File "calibre_plugins.xray_creator.lib.book", line 223, in create_files_event
File "calibre_plugins.xray_creator.lib.book", line 443, in _parse_goodreads_data
File "calibre_plugins.xray_creator.lib.goodreads_parser ", line 40, in parse
File "calibre_plugins.xray_creator.lib.goodreads_parser ", line 50, in _get_xray
File "calibre_plugins.xray_creator.lib.goodreads_parser ", line 254, in _get_quotes
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in position 0: ordinal not in range(256)

Called with args: (,) {u'abort': , u'notifications': , u'log': }

szarroug3 · 08-31-2018, 11:30 PM

Hey everyone, so sorry i've been AWOL.. I just don't have the time to work on this plugin anymorer

.. I did just make a super minor fixed that may have fixed some issues for you guys where some test was put into <i> tags instead of <p> tags. I'm not sure it'll actually help many of you but it might.

Shark69 · 09-01-2018, 04:44 AM

Thanks... New version has test files inside... FYI

Shark69 · 09-01-2018, 12:26 PM

¿Is this correct?
PARAGRAPH_PAT = re.compile(r'<(p|i|h\d) .*?>.+?(?:<\/\1)', re.I)
Due the blank, paragraphs like
<p>Hello!</p>
are not catch... I think....

szarroug3 · 09-02-2018, 12:06 AM

Woah you are fast haha. I got rid of the test files and completely refactored the book parsing algorithm.. Now it uses all regex instead of a mix of regex and some other algorithms. It's much more accurate as far as I can tell and this way, I don't have to encode/decode the html which should make it work better for books with non-ascii characters.

szarroug3 · 09-02-2018, 12:07 AM

dammit, i still have an old pyc file in there!

szarroug3 · 09-02-2018, 12:20 AM

okay i removed the test files for realsies this time.. or did i?!

szarroug3 · 09-02-2018, 12:32 AM

So I just noticed it won't catch instances where there's non-whitespace, non-alphabetic character right before the word. ie "Armansky won't be caught because the kindle would highlight the " as well as the word. Guess that's more regex work for me haha

Edit: now that I think about it, I think quotes are the only valid case for this. I'm not going to attempt to catch typos like forgetting a space after a period or comma. Nothing else coming before the word would make sense other than a quote of some type so guess i'll just make it check for those as well.

Edit2: I guess catching everything until the previous whitespace is another easy option. That way if it is a typo, people can still use it.. Decisions decisions.

Edit3: I decided to go with the anything that's connected to the word before up until a whitespace. If someone gives me good reason to change this, I will.

Shark69 · 09-02-2018, 04:51 AM

Thanks for the new version. I have to study more accurately because a lot of things are changed in parsing, but I've found a problem not existing in the prior version. The count field in entity table from the sqlite asc file is no longer updated. Another thing... names begining and ending with a non ascii char as "René" or "Ángel" are not located.

szarroug3 · 09-05-2018, 02:24 AM

Quote:

Originally Posted by Shark69

Thanks for the new version. I have to study more accurately because a lot of things are changed in parsing, but I've found a problem not existing in the prior version. The count field in entity table from the sqlite asc file is no longer updated. Another thing... names begining and ending with a non ascii char as "René" or "Ángel" are not located.

Fixed the count thing along with a few other unrelated minor things. Not sure what's wrong with the the non-ascii character starting words. I'll try to look tomorrow.

I did run a quick test using regexr. Looks like it works to me. Are you sure that the name is written correctly in the config and that it uses that same character in the book itself?

Shark69 · 09-05-2018, 01:19 PM

Sure.... I can provide you with an example... json file, test ebook file and asc file generated. Look at the pictures, please...
And thanks... of course

From json:

Quote:

"René": {"description": "Jefe de las tropas francesas. ", "aliases": ["René"]},

]

szarroug3 · 09-06-2018, 10:19 PM

Okay, so I've figured out what's wrong but I can't figure out how to fix it. In the regex pattern I wrote, i use \b around the word I'm looking for. Turns out that this doesn't work when the first or last character in the word is non-ascii.

There are three different positions that qualify as word boundaries:

Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.

Basically, since the non-ascii character doesn't count as a "word" character, it doesn't fulfill any of these requirements.

I'm still working on it.

Shark69 · 09-07-2018, 02:00 PM

As an alternative and talking about the code before the refactoring (because I know it better), I'd like to suggest you processing the text with four regex:

For aliases inside the paragraph:
word_pat = re.compile(r'(?=([^a-zA-Z0-9_]' + r'[^a-zA-Z0-9_]|[^a-zA-Z0-9_]'.join(escaped_word_list) + r'[^a-zA-Z0-9_]))', re.I)

For aliases at the beginning of paragraph:
word_pat = re.compile(r'(?=(^' + r'[^a-zA-Z0-9_]|^'.join(escaped_word_list) + r'[^a-zA-Z0-9_]))', re.I)

For aliases at the end of paragraph:
word_pat = re.compile(r'(?=([^a-zA-Z0-9_]' + r'$|[^a-zA-Z0-9_]'.join(escaped_word_list) + r'$))', re.I)

and then for aliases found just as a paragraph:
word_pat = re.compile(r'(?=(^' + r'$|^'.join(escaped_word_list) + r'$))', re.I)

I've checked it with success.

09-02-2018, 12:32 AM	#220
szarroug3 Zealot Posts: 104 Karma: 10000 Join Date: Apr 2016 Device: Kindle PW2	So I just noticed it won't catch instances where there's non-whitespace, non-alphabetic character right before the word. ie "Armansky won't be caught because the kindle would highlight the " as well as the word. Guess that's more regex work for me haha Edit: now that I think about it, I think quotes are the only valid case for this. I'm not going to attempt to catch typos like forgetting a space after a period or comma. Nothing else coming before the word would make sense other than a quote of some type so guess i'll just make it check for those as well. Edit2: I guess catching everything until the previous whitespace is another easy option. That way if it is a typo, people can still use it.. Decisions decisions. Edit3: I decided to go with the anything that's connected to the word before up until a whitespace. If someone gives me good reason to change this, I will. Last edited by szarroug3; 09-02-2018 at 01:04 AM.

09-06-2018, 10:19 PM	#224
szarroug3 Zealot Posts: 104 Karma: 10000 Join Date: Apr 2016 Device: Kindle PW2	Okay, so I've figured out what's wrong but I can't figure out how to fix it. In the regex pattern I wrote, i use \b around the word I'm looking for. Turns out that this doesn't work when the first or last character in the word is non-ascii. There are three different positions that qualify as word boundaries: Before the first character in the string, if the first character is a word character. After the last character in the string, if the last character is a word character. Between two characters in the string, where one is a word character and the other is not a word character. Basically, since the non-ascii character doesn't count as a "word" character, it doesn't fulfill any of these requirements. I'm still working on it. Last edited by szarroug3; 09-06-2018 at 10:31 PM.

09-07-2018, 02:00 PM	#225
Shark69 Zealot Posts: 136 Karma: 493152 Join Date: Mar 2012 Location: Spain Device: Kindle Oasis 2	As an alternative and talking about the code before the refactoring (because I know it better), I'd like to suggest you processing the text with four regex: For aliases inside the paragraph: word_pat = re.compile(r'(?=([^a-zA-Z0-9_]' + r'[^a-zA-Z0-9_]\|[^a-zA-Z0-9_]'.join(escaped_word_list) + r'[^a-zA-Z0-9_]))', re.I) For aliases at the beginning of paragraph: word_pat = re.compile(r'(?=(^' + r'[^a-zA-Z0-9_]\|^'.join(escaped_word_list) + r'[^a-zA-Z0-9_]))', re.I) For aliases at the end of paragraph: word_pat = re.compile(r'(?=([^a-zA-Z0-9_]' + r'$\|[^a-zA-Z0-9_]'.join(escaped_word_list) + r'$))', re.I) and then for aliases found just as a paragraph: word_pat = re.compile(r'(?=(^' + r'$\|^'.join(escaped_word_list) + r'$))', re.I) I've checked it with success. Last edited by Shark69; 09-07-2018 at 02:13 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[GUI Plugin] EpubMerge	JimmXinu	Plugins	522	04-01-2024 10:25 AM
[GUI Plugin] KindleUnpack - The Plugin	DiapDealer	Plugins	492	10-25-2022 08:13 AM
[GUI Plugin] Unplugged	Jellby	Plugins	16	09-03-2019 02:57 PM
[GUI Plugin] Astro-ph	iatheia	Plugins	14	07-25-2015 11:41 PM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM

08-23-2018, 12:47 PM	#211
Shark69 Zealot Posts: 136 Karma: 493152 Join Date: Mar 2012 Location: Spain Device: Kindle Oasis 2	I am sorry. I realized that the files did not work. In addition, we can not attach files with author rights. Because of this, I deleted them. There were two main problems. The first one was that the plugin is not very good recovering aliases characters from the X-Ray file. Manual tasks should be done to debug the list. The second and more important is the quality of the ebook. It's quite poor and the plugin does not work fine with files with extrange data in the html labels. The problem can be solved with some hours of work to regenerate a more clean ebook, but I don't know if it is worth.

08-31-2018, 04:18 PM	#213
Bulu009 Junior Member Posts: 3 Karma: 10 Join Date: Aug 2018 Device: Kindle Paperwhite 3	I have the following error. I am on KDE Neon 5.13.4 and Calibre 3.30. Please help me. Starting job: Creating Files 09-01-2018 01:38:18 Initializing... 09-01-2018 01:38:18 Echo Burning - Lee Child 09-01-2018 01:38:18 Parsing Goodreads data... Job: "Creating Files" failed with error: Traceback (most recent call last): File "site-packages/calibre/gui2/threaded_jobs.py", line 84, in start_work File "calibre_plugins.xray_creator.lib.xray_creator ", line 284, in create_files_event File "calibre_plugins.xray_creator.lib.book", line 223, in create_files_event File "calibre_plugins.xray_creator.lib.book", line 443, in _parse_goodreads_data File "calibre_plugins.xray_creator.lib.goodreads_parser ", line 40, in parse File "calibre_plugins.xray_creator.lib.goodreads_parser ", line 50, in _get_xray File "calibre_plugins.xray_creator.lib.goodreads_parser ", line 254, in _get_quotes UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in position 0: ordinal not in range(256) Called with args: (,) {u'abort': , u'notifications': , u'log': }

08-31-2018, 11:30 PM	#214
szarroug3 Zealot Posts: 104 Karma: 10000 Join Date: Apr 2016 Device: Kindle PW2	Hey everyone, so sorry i've been AWOL.. I just don't have the time to work on this plugin anymorer .. I did just make a super minor fixed that may have fixed some issues for you guys where some test was put into <i> tags instead of <p> tags. I'm not sure it'll actually help many of you but it might.

09-01-2018, 04:44 AM	#215
Shark69 Zealot Posts: 136 Karma: 493152 Join Date: Mar 2012 Location: Spain Device: Kindle Oasis 2	Thanks... New version has test files inside... FYI

09-01-2018, 12:26 PM	#216
Shark69 Zealot Posts: 136 Karma: 493152 Join Date: Mar 2012 Location: Spain Device: Kindle Oasis 2	¿Is this correct? PARAGRAPH_PAT = re.compile(r'<(p\|i\|h\d) .*?>.+?(?:<\/\1)', re.I) Due the blank, paragraphs like <p>Hello!</p> are not catch... I think....

09-02-2018, 12:06 AM	#217
szarroug3 Zealot Posts: 104 Karma: 10000 Join Date: Apr 2016 Device: Kindle PW2	Woah you are fast haha. I got rid of the test files and completely refactored the book parsing algorithm.. Now it uses all regex instead of a mix of regex and some other algorithms. It's much more accurate as far as I can tell and this way, I don't have to encode/decode the html which should make it work better for books with non-ascii characters.

09-02-2018, 12:07 AM	#218
szarroug3 Zealot Posts: 104 Karma: 10000 Join Date: Apr 2016 Device: Kindle PW2	dammit, i still have an old pyc file in there!

09-02-2018, 12:20 AM	#219
szarroug3 Zealot Posts: 104 Karma: 10000 Join Date: Apr 2016 Device: Kindle PW2	okay i removed the test files for realsies this time.. or did i?!

09-02-2018, 04:51 AM	#221
Shark69 Zealot Posts: 136 Karma: 493152 Join Date: Mar 2012 Location: Spain Device: Kindle Oasis 2	Thanks for the new version. I have to study more accurately because a lot of things are changed in parsing, but I've found a problem not existing in the prior version. The count field in entity table from the sqlite asc file is no longer updated. Another thing... names begining and ending with a non ascii char as "René" or "Ángel" are not located.

Advert

Advert