[GUI Plugin] FanFictionDownLoader - Page 235

JimmXinu · 11-16-2014, 11:33 PM

Quote:

Originally Posted by JimmXinu

Attached is (yet another) test version, with all of the changes previous, plus:
...

Still contains the self-closing change for that one ffnet story--haven't decided yet if I should include that or not in the next release.

I've figured out that, yes, self-closing does change the output in ways that aren't acceptable--to me at least. This test version returns ffnet processing to what it was before.

dman2 · 11-17-2014, 05:10 PM

I was reading on my Kindle, and noticed that there was a gap in the story. I looked into the HTML on this story (http://www.literotica.com/s/the-succ...eduction-ch-17) and it has odd formatting, specifically a .... Everything after that on that page is lost. I looked around and found a few other instances of align center and . It would seem that basic HTML syntax is allowed in the stories, but I couldn't find a definitive list of allowed HTML.

JimmXinu · 11-17-2014, 06:55 PM

Quote:

Originally Posted by dman2

I was reading on my Kindle, and noticed that there was a gap in the story. I looked into the HTML on this story (http://www.literotica.com/s/the-succ...eduction-ch-17) and it has odd formatting, specifically a .... Everything after that on that page is lost. I looked around and found a few other instances of align center and . It would seem that basic HTML syntax is allowed in the stories, but I couldn't find a definitive list of allowed HTML.

I'm not seeing quite the same behavior. On the page you linked to, there is a centered chapter title block in .... That is not being included in the generated epub. But the text of that page after is included. (Unless told otherwise, I assume all FFDL users download to epub and convert as needed to other formats.)

The "..." is nested inside an outer block, which technically is against standard. The existing literotica.com adapter tries to convert that top outer to a div, but it appears to cause problems with nested tags.

Attached is a test version that attempts to use a different method to deal with that bit of HTML ugliness. It appears to work better to me, but I don't read that site, so I'm really only testing with a couple random stories. So while I don't think it will break for other stories, I can't guarantee it.

BTW, while that chapter title block will appear, it won't appear centered. The attribute align="center" isn't recognized by many e-readers and is stripped out. The other chapters are centered because they use <center> and there is an explicit replacement for <center> tags in FFDL.

JimmXinu · 11-18-2014, 05:35 PM

Version 2.0.10 - 18 Nov 2014

New Site: fhsarchive.com -- eFiction Base adapter.
Fixes for storiesonline.net site changes--'codes' are now 'sitetags', thanks Jeff.
Fix for literotica.com HTML.
Fix for AO3 fetch after login.
Fix for User-Agent with saved cookie jar.
Fix for ffnet adapter for 'get urls from page'.
Fix for images in FimFiction.net stories.
Fix handling of new books and custom_column_settings.
Fix for fimf not working with manual is_adult (caching issue).
Fix for calibre 2.10 keyboard shortcuts change.
Known issue: Specific metadata 'eroticatags' for literotica.com doesn't work on all stories.
Known issue: Metadata collection is not as complete for 'Base eFiction' adapters.

dman2 · 11-19-2014, 09:24 AM

Looks good to me, I will let you know if I find any issues it causes. Thanks!

JimmXinu · 11-21-2014, 12:22 AM

Quote:

Originally Posted by JimmXinu

The problem is being caused by some incorrect tag nesting in the chapter 27 text:
...
I'm sure cryzed is willing to remind me that there are newer parsers available, but they are not easy to include in all the different forms FFDL takes. I may take another look at them, but I'm more concerned right now about calibre 2.10 breaking keyboard shortcuts in FFDL.

FYI, I hacked on FFDL (command-line) today until I could get it to work with the current versions of BS and html5lib. Still failed to parse that chapter.

cryzed · 11-21-2014, 05:29 AM

Did you try explicitly specifying the parser for the BeautifulSoup instance?:

Code:

BeautifulSoup(markup, 'html5lib')

And if I remember correctly, the error occured in the BaseAdapter.utf8FromSoup method. Is the BeautifulSoup instance that is passed to it really a BeautifulSoup 3 or BeautifulSoup 4 instance? It should be entirely dependent on the site adapter calling it.

If all this seems correct, the only thing I can think of is narrowing it down to the element that causes the error and extracting it (possibly via the soup instance if that doesn't already cause an error) before trying to turn the soup into a string, but I think you already tried something like that.

If all this doesn't help I'm a bit stumped, since the html5lib library is supposed to act exactly like a real browser when parsing HTML. I checked the code and there doesn't seem to be anything to indicate that the BeautifulSoup instance is modified improperly (which can easily lead to such errors), is it possibly that the raw HTML modifications the adapter does shortly beforehand at places are at fault?

JimmXinu · 11-21-2014, 12:46 PM

Quote:

Originally Posted by cryzed

Did you try explicitly specifying the parser for the BeautifulSoup instance?:

Yep.

Quote:

Originally Posted by cryzed

And if I remember correctly, the error occured in the BaseAdapter.utf8FromSoup method. Is the BeautifulSoup instance that is passed to it really a BeautifulSoup 3 or BeautifulSoup 4 instance?

Yeah, I modified the adapter to use bs4 and BaseAdapter.utf8FromSoup to accept either.

The error is coming from the utf8FromSoup code that does a findAll on all tags to strip off extra attributes. If I bypass that it works--so a more forgiving method of spinning through the tags may work. The improperly nested tags cause confusion.

Quote:

Originally Posted by cryzed

... is it possibly that the raw HTML modifications the adapter does shortly beforehand at places are at fault?

That's a good question--I hadn't checked that. But no, skipping those didn't help.

BTW, I did consult your code from the package-magic branch and I'm using part of it, thanks for that.

cryzed · 11-21-2014, 01:13 PM

I'm glad it was useful in some way. I just checked the code in the BaseAdapter class again: Is there any reason you are using _getAttrMap() instead of the attrs attribute? I've never seen that method, I assume it's undocumented, although I really doubt that's the reason for the problems (it might be for some strange reason though!).

As a last resort you could try iterating over all elements and when you catch a recursion exception, navigate from the last unproblematic element onwards, i.e. skipping the next somehow or traversing the tree upwards with the parent attribute until the exception doesn't occur anymore. I doubt that's an ideal solution though unfortunately, and it's just me articulating half-processed ideas. I would have to see for myself first if something like that is even possible.

EDIT: Also sorry for never finishing up that branch. After creating the thread in the forum months ago, I somehow never got around to it, and it didn't seem so pressing at the time as you had mentioned.

JimmXinu · 11-22-2014, 08:43 PM

This post is about development issues. Users can safely ignore.

@cryzed - NP. I wasn't in any rush to worry about it then. I'm not now, really, but I had more spare time the last week. But I'm expecting to have less time next week; so I wanted to save the changes where you could see them and perhaps offer 'pythonic' help and advice.

(I did eventually get past the attribute issue, btw. Spinning on a generator instead of an iterator lets me process until it fails instead of failing before it processes. Not perfect, but better.)

I've checked in a new branch 'bs4' that includes the six, html5lib, and bs4 libraries, changes to allow their import in all the different run environments, and some not-ready-for-prime-time changes in a small handful of adapters to test out the bs4 changes--I don't intend to use the adapters as is. The packages are all at the top level because it makes it much easier in web engine and plugin that way.

My to-do list for this is:

Spoiler:

cryzed · 11-23-2014, 10:31 AM

Only using one sounds like the better idea, since bs4 is mostly just an improved bs3. Whenever possible new adapters should definitely use bs4.

I agree that creating a "new-style" base class for adapters using bs4 might be a good idea, however I'm not sure why _you_ want to have different parent base classes, because of methods such as utf8FromSoup? Those should work with soup objects created by both versions hopefully, I don't think the BaseSiteAdapter's methods need to be adapted for that.

However if we actually go that way, it would be the ideal oppurtinity to smarten up some things: I would try to keep as much functionality as possible out of the BaseSiteAdapter class. Actually I'd argue that utf8FromSoup should be in the same module as stripHTML and various other utility functions. In my opinion, it should be primarily the subclasses' job to implement the logic to retrieve and parse the story's metadata and chapters. The BaseSiteAdapter should only have the most general functionality in it which is strictly necessary (e.g. caching, storing parsed metadata...). A case can be made for _fetchUrl to stay there under a new name possibly, since it automatically decodes the retrieved content based on the configuration for the adapter. (When I had started on my mentioned FFDL clone a short while ago, I was actually using the requests library that bundles chardet as a fallback option and did something quite similar, maybe this would be a good oppurtunity to add requests to the list of new modules?)

Regarding stripHTML, if you check here you will see that using regular expressions at all shouldn't be necessary. Simply joining over the strings or stripped_strings generator attribute, or using the get_text method should do exactly what you want, even decoding all HTML entities in the process.

I've never taken a closer look at html.py or geturls.py, so I'm not sure what they do exactly. It looks like the HtmlProcessor is in charge of cleaning and normalizing the returned HTML content, and I assume geturls.py is used internally for the get URLs dialogue. If there were never problems with them before, I don't see a real reason to change them -- Of course simply making them use bs4 instead of bs3 should work without any problems. However, regarding the geturls.py logic, I'm not quite sure what processing you do in there, it looks nearly like you are filtering the URLs? Why do you do that? Shouldn't it be simply possible to return all found anchor tags, absolutizing their URLs with urlparse.urljoin and then checking if any loaded adapter has a maching site URL pattern? That way you don't have to inspect them at all and can be sure to have only gotten valid story URLs.

I would definitely suggest updating chardet, there is no reason not to do so, there shouldn't be any breaking of backwards-compatibility.

Regarding AO3, I assume you mean that when authors paste text into their stories containing HTML tags they are turned into text (i.e. HTML characters are "escaped" and turned into HTML entities)? This should definitely not be FFDL's responsibility to fix, this is AO3's task and/or the author's, we can't possibly check and account for all user errors -- at least in my opinion.

I would handle the last issue like you did: simply account for the differences. Turning &nbsp ([n]on-[b]reaking [sp]ace) HTML entities into their Unicode equivalent makes more sense anyways, and I'm sure that only very few sites will use these. Site adapters using bs4 will simply have to account for those differences, it's not like all old adapters using bs3 will suddenly get different Unicode output.

Let me know what you think and maybe detail the scale of changes you had planned in a bit more detail.

EDIT: Also regarding the insert_into_python_path function in downloader.py, I don't actually think that's strictly necessary as long as bs4 and html5lib are at the top-level in the directory structure, like they are currently; they'll be preferred by default.

EDIT2: I did some work on the branch, nothing major just fixed up downloader.py. I'm sure I have before, but I'll suggest it again: Do try the free PyCharm Community Edition, it's very helpful during development. Especially the "Find Usages" feature and jump to definition.

JimmXinu · 12-01-2014, 01:27 PM

(@cryzed - I haven't been ignoring you, I was away last week. So I guess technically, I was--but it wasn't deliberate.

)

Attached is a test version with these changes:

A site-specific adapter for csi-forensics.com (instead of an eFiction Base adapter), thanks scout78;
New site fanfiction-junkies.de, again thanks scout78;
Change site castlefans.org/fanfic back to fanfic.castletv.net again.

cryzed · 12-01-2014, 01:32 PM

No worries

.

JimmXinu · 12-01-2014, 07:50 PM

Updated test version with additional changes to fanfic.castletv.net, thanks again scout78.

arthurh3535 · 12-02-2014, 05:32 PM

Quote:

Originally Posted by JimmXinu

Version 2.0.10 - 18 Nov 2014

Fix for literotica.com HTML.
Known issue: Specific metadata 'eroticatags' for literotica.com doesn't work on all stories.

I'm not actually getting Literotica.com to actually download any more (in Calibre).

11-17-2014, 05:10 PM	#3512
dman2 Junior Member Posts: 2 Karma: 10 Join Date: Nov 2014 Device: Kindle Paperwhite	Literotica Issue I was reading on my Kindle, and noticed that there was a gap in the story. I looked into the HTML on this story (http://www.literotica.com/s/the-succ...eduction-ch-17) and it has odd formatting, specifically a <p align="center">...</p>. Everything after that on that page is lost. I looked around and found a few other instances of align center and <strong> </strong>. It would seem that basic HTML syntax is allowed in the stories, but I couldn't find a definitive list of allowed HTML.

11-18-2014, 05:35 PM	#3514
JimmXinu Plugin Developer Posts: 7,329 Karma: 5007213 Join Date: Dec 2011 Location: Midwest USA Device: Kobo Clara Colour running KOReader	New Release Version 2.0.10 - 18 Nov 2014 New Site: fhsarchive.com -- eFiction Base adapter. Fixes for storiesonline.net site changes--'codes' are now 'sitetags', thanks Jeff. Fix for literotica.com HTML. Fix for AO3 fetch after login. Fix for User-Agent with saved cookie jar. Fix for ffnet adapter for 'get urls from page'. Fix for images in FimFiction.net stories. Fix handling of new books and custom_column_settings. Fix for fimf not working with manual is_adult (caching issue). Fix for calibre 2.10 keyboard shortcuts change. Known issue: Specific metadata 'eroticatags' for literotica.com doesn't work on all stories. Known issue: Metadata collection is not as complete for 'Base eFiction' adapters.

11-21-2014, 05:29 AM	#3517
cryzed Evangelist Posts: 408 Karma: 1050547 Join Date: Mar 2011 Device: Kindle Oasis 2	Did you try explicitly specifying the parser for the BeautifulSoup instance?: Code: BeautifulSoup(markup, 'html5lib') And if I remember correctly, the error occured in the BaseAdapter.utf8FromSoup method. Is the BeautifulSoup instance that is passed to it really a BeautifulSoup 3 or BeautifulSoup 4 instance? It should be entirely dependent on the site adapter calling it. If all this seems correct, the only thing I can think of is narrowing it down to the element that causes the error and extracting it (possibly via the soup instance if that doesn't already cause an error) before trying to turn the soup into a string, but I think you already tried something like that. If all this doesn't help I'm a bit stumped, since the html5lib library is supposed to act exactly like a real browser when parsing HTML. I checked the code and there doesn't seem to be anything to indicate that the BeautifulSoup instance is modified improperly (which can easily lead to such errors), is it possibly that the raw HTML modifications the adapter does shortly beforehand at places are at fault? Last edited by cryzed; 11-21-2014 at 05:34 AM.

11-21-2014, 01:13 PM	#3519
cryzed Evangelist Posts: 408 Karma: 1050547 Join Date: Mar 2011 Device: Kindle Oasis 2	I'm glad it was useful in some way. I just checked the code in the BaseAdapter class again: Is there any reason you are using _getAttrMap() instead of the attrs attribute? I've never seen that method, I assume it's undocumented, although I really doubt that's the reason for the problems (it might be for some strange reason though!). As a last resort you could try iterating over all elements and when you catch a recursion exception, navigate from the last unproblematic element onwards, i.e. skipping the next somehow or traversing the tree upwards with the parent attribute until the exception doesn't occur anymore. I doubt that's an ideal solution though unfortunately, and it's just me articulating half-processed ideas. I would have to see for myself first if something like that is even possible. EDIT: Also sorry for never finishing up that branch. After creating the thread in the forum months ago, I somehow never got around to it, and it didn't seem so pressing at the time as you had mentioned. Last edited by cryzed; 11-21-2014 at 01:21 PM.

11-22-2014, 08:43 PM	#3520
JimmXinu Plugin Developer Posts: 7,329 Karma: 5007213 Join Date: Dec 2011 Location: Midwest USA Device: Kobo Clara Colour running KOReader	Development Notes This post is about development issues. Users can safely ignore. @cryzed - NP. I wasn't in any rush to worry about it then. I'm not now, really, but I had more spare time the last week. But I'm expecting to have less time next week; so I wanted to save the changes where you could see them and perhaps offer 'pythonic' help and advice. (I did eventually get past the attribute issue, btw. Spinning on a generator instead of an iterator lets me process until it fails instead of failing before it processes. Not perfect, but better.) I've checked in a new branch 'bs4' that includes the six, html5lib, and bs4 libraries, changes to allow their import in all the different run environments, and some not-ready-for-prime-time changes in a small handful of adapters to test out the bs4 changes--I don't intend to use the adapters as is. The packages are all at the top level because it makes it much easier in web engine and plugin that way. My to-do list for this is: Spoiler: Is it worthwhile trying to make adapters that can use either bs3 or bs4? Or should each only use one? Leaning towards only one. If 'either', should it be configurable per adapter? If not 'either', should bs3 adapters and bs4 adapters have different parent base classes? How should common code to avoid having BeautifulSoup(___,'html5lib') everywhere be implemented? Base adapter method? How should stripHTML be changed to accommodate? Move to a method on base adapter? Needed at all with bs4? What about non-adapters that use bs? html.py, geturls.py, etc. chardet - should that library be updated and pulled up to the same level as bs4, etc? AO3 adapter--<b> tags added directly as text are being treated as text. bs4/html5lib do some things differently. For example,   rather than becoming a space becomes \u00a0 -- a literally non-breaking space character. See the TtH adapter's date string.

11-19-2014, 09:24 AM	#3515
dman2 Junior Member Posts: 2 Karma: 10 Join Date: Nov 2014 Device: Kindle Paperwhite	Looks good to me, I will let you know if I find any issues it causes. Thanks!

11-23-2014, 10:31 AM	#3521
cryzed Evangelist Posts: 408 Karma: 1050547 Join Date: Mar 2011 Device: Kindle Oasis 2	Only using one sounds like the better idea, since bs4 is mostly just an improved bs3. Whenever possible new adapters should definitely use bs4. I agree that creating a "new-style" base class for adapters using bs4 might be a good idea, however I'm not sure why _you_ want to have different parent base classes, because of methods such as utf8FromSoup? Those should work with soup objects created by both versions hopefully, I don't think the BaseSiteAdapter's methods need to be adapted for that. However if we actually go that way, it would be the ideal oppurtinity to smarten up some things: I would try to keep as much functionality as possible out of the BaseSiteAdapter class. Actually I'd argue that utf8FromSoup should be in the same module as stripHTML and various other utility functions. In my opinion, it should be primarily the subclasses' job to implement the logic to retrieve and parse the story's metadata and chapters. The BaseSiteAdapter should only have the most general functionality in it which is strictly necessary (e.g. caching, storing parsed metadata...). A case can be made for _fetchUrl to stay there under a new name possibly, since it automatically decodes the retrieved content based on the configuration for the adapter. (When I had started on my mentioned FFDL clone a short while ago, I was actually using the requests library that bundles chardet as a fallback option and did something quite similar, maybe this would be a good oppurtunity to add requests to the list of new modules?) Regarding stripHTML, if you check here you will see that using regular expressions at all shouldn't be necessary. Simply joining over the strings or stripped_strings generator attribute, or using the get_text method should do exactly what you want, even decoding all HTML entities in the process. I've never taken a closer look at html.py or geturls.py, so I'm not sure what they do exactly. It looks like the HtmlProcessor is in charge of cleaning and normalizing the returned HTML content, and I assume geturls.py is used internally for the get URLs dialogue. If there were never problems with them before, I don't see a real reason to change them -- Of course simply making them use bs4 instead of bs3 should work without any problems. However, regarding the geturls.py logic, I'm not quite sure what processing you do in there, it looks nearly like you are filtering the URLs? Why do you do that? Shouldn't it be simply possible to return all found anchor tags, absolutizing their URLs with urlparse.urljoin and then checking if any loaded adapter has a maching site URL pattern? That way you don't have to inspect them at all and can be sure to have only gotten valid story URLs. I would definitely suggest updating chardet, there is no reason not to do so, there shouldn't be any breaking of backwards-compatibility. Regarding AO3, I assume you mean that when authors paste text into their stories containing HTML tags they are turned into text (i.e. HTML characters are "escaped" and turned into HTML entities)? This should definitely not be FFDL's responsibility to fix, this is AO3's task and/or the author's, we can't possibly check and account for all user errors -- at least in my opinion. I would handle the last issue like you did: simply account for the differences. Turning &nbsp ([n]on-[b]reaking [sp]ace) HTML entities into their Unicode equivalent makes more sense anyways, and I'm sure that only very few sites will use these. Site adapters using bs4 will simply have to account for those differences, it's not like all old adapters using bs3 will suddenly get different Unicode output. Let me know what you think and maybe detail the scale of changes you had planned in a bit more detail. EDIT: Also regarding the insert_into_python_path function in downloader.py, I don't actually think that's strictly necessary as long as bs4 and html5lib are at the top-level in the directory structure, like they are currently; they'll be preferred by default. EDIT2: I did some work on the branch, nothing major just fixed up downloader.py. I'm sure I have before, but I'll suggest it again: Do try the free PyCharm Community Edition, it's very helpful during development. Especially the "Find Usages" feature and jump to definition. Last edited by cryzed; 11-23-2014 at 03:07 PM.

12-01-2014, 01:27 PM	#3522
JimmXinu Plugin Developer Posts: 7,329 Karma: 5007213 Join Date: Dec 2011 Location: Midwest USA Device: Kobo Clara Colour running KOReader	(@cryzed - I haven't been ignoring you, I was away last week. So I guess technically, I was--but it wasn't deliberate. ) Attached is a test version with these changes: A site-specific adapter for csi-forensics.com (instead of an eFiction Base adapter), thanks scout78; New site fanfiction-junkies.de, again thanks scout78; Change site castlefans.org/fanfic back to fanfic.castletv.net again. Last edited by JimmXinu; 12-01-2014 at 07:50 PM. Reason: Remove obsolete test versions - replaced by newer test or released version.

12-01-2014, 01:32 PM	#3523
cryzed Evangelist Posts: 408 Karma: 1050547 Join Date: Mar 2011 Device: Kindle Oasis 2	No worries .

12-01-2014, 07:50 PM	#3524
JimmXinu Plugin Developer Posts: 7,329 Karma: 5007213 Join Date: Dec 2011 Location: Midwest USA Device: Kobo Clara Colour running KOReader	Updated test version with additional changes to fanfic.castletv.net, thanks again scout78. Last edited by JimmXinu; 12-09-2014 at 11:26 AM. Reason: Remove obsolete test versions - replaced by newer test or released version.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[GUI Plugin] Find Duplicates	kiwidude	Plugins	1132	Today 11:47 AM
[GUI Plugin] Count Pages	kiwidude	Plugins	1940	03-05-2026 04:54 PM
[GUI Plugin] Open With	kiwidude	Plugins	405	02-09-2026 07:54 AM
[GUI Plugin] Resize Cover	kiwidude	Plugins	101	02-09-2026 07:42 AM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM