MobileRead Forums - View Single Post

cryzed · 11-23-2014, 10:31 AM

Only using one sounds like the better idea, since bs4 is mostly just an improved bs3. Whenever possible new adapters should definitely use bs4.

I agree that creating a "new-style" base class for adapters using bs4 might be a good idea, however I'm not sure why _you_ want to have different parent base classes, because of methods such as utf8FromSoup? Those should work with soup objects created by both versions hopefully, I don't think the BaseSiteAdapter's methods need to be adapted for that.

However if we actually go that way, it would be the ideal oppurtinity to smarten up some things: I would try to keep as much functionality as possible out of the BaseSiteAdapter class. Actually I'd argue that utf8FromSoup should be in the same module as stripHTML and various other utility functions. In my opinion, it should be primarily the subclasses' job to implement the logic to retrieve and parse the story's metadata and chapters. The BaseSiteAdapter should only have the most general functionality in it which is strictly necessary (e.g. caching, storing parsed metadata...). A case can be made for _fetchUrl to stay there under a new name possibly, since it automatically decodes the retrieved content based on the configuration for the adapter. (When I had started on my mentioned FFDL clone a short while ago, I was actually using the requests library that bundles chardet as a fallback option and did something quite similar, maybe this would be a good oppurtunity to add requests to the list of new modules?)

Regarding stripHTML, if you check here you will see that using regular expressions at all shouldn't be necessary. Simply joining over the strings or stripped_strings generator attribute, or using the get_text method should do exactly what you want, even decoding all HTML entities in the process.

I've never taken a closer look at html.py or geturls.py, so I'm not sure what they do exactly. It looks like the HtmlProcessor is in charge of cleaning and normalizing the returned HTML content, and I assume geturls.py is used internally for the get URLs dialogue. If there were never problems with them before, I don't see a real reason to change them -- Of course simply making them use bs4 instead of bs3 should work without any problems. However, regarding the geturls.py logic, I'm not quite sure what processing you do in there, it looks nearly like you are filtering the URLs? Why do you do that? Shouldn't it be simply possible to return all found anchor tags, absolutizing their URLs with urlparse.urljoin and then checking if any loaded adapter has a maching site URL pattern? That way you don't have to inspect them at all and can be sure to have only gotten valid story URLs.

I would definitely suggest updating chardet, there is no reason not to do so, there shouldn't be any breaking of backwards-compatibility.

Regarding AO3, I assume you mean that when authors paste text into their stories containing HTML tags they are turned into text (i.e. HTML characters are "escaped" and turned into HTML entities)? This should definitely not be FFDL's responsibility to fix, this is AO3's task and/or the author's, we can't possibly check and account for all user errors -- at least in my opinion.

I would handle the last issue like you did: simply account for the differences. Turning &nbsp ([n]on-[b]reaking [sp]ace) HTML entities into their Unicode equivalent makes more sense anyways, and I'm sure that only very few sites will use these. Site adapters using bs4 will simply have to account for those differences, it's not like all old adapters using bs3 will suddenly get different Unicode output.

Let me know what you think and maybe detail the scale of changes you had planned in a bit more detail.

EDIT: Also regarding the insert_into_python_path function in downloader.py, I don't actually think that's strictly necessary as long as bs4 and html5lib are at the top-level in the directory structure, like they are currently; they'll be preferred by default.

EDIT2: I did some work on the branch, nothing major just fixed up downloader.py. I'm sure I have before, but I'll suggest it again: Do try the free PyCharm Community Edition, it's very helpful during development. Especially the "Find Usages" feature and jump to definition.

11-23-2014, 10:31 AM	#3521
cryzed Evangelist Posts: 408 Karma: 1050547 Join Date: Mar 2011 Device: Kindle Oasis 2	Only using one sounds like the better idea, since bs4 is mostly just an improved bs3. Whenever possible new adapters should definitely use bs4. I agree that creating a "new-style" base class for adapters using bs4 might be a good idea, however I'm not sure why _you_ want to have different parent base classes, because of methods such as utf8FromSoup? Those should work with soup objects created by both versions hopefully, I don't think the BaseSiteAdapter's methods need to be adapted for that. However if we actually go that way, it would be the ideal oppurtinity to smarten up some things: I would try to keep as much functionality as possible out of the BaseSiteAdapter class. Actually I'd argue that utf8FromSoup should be in the same module as stripHTML and various other utility functions. In my opinion, it should be primarily the subclasses' job to implement the logic to retrieve and parse the story's metadata and chapters. The BaseSiteAdapter should only have the most general functionality in it which is strictly necessary (e.g. caching, storing parsed metadata...). A case can be made for _fetchUrl to stay there under a new name possibly, since it automatically decodes the retrieved content based on the configuration for the adapter. (When I had started on my mentioned FFDL clone a short while ago, I was actually using the requests library that bundles chardet as a fallback option and did something quite similar, maybe this would be a good oppurtunity to add requests to the list of new modules?) Regarding stripHTML, if you check here you will see that using regular expressions at all shouldn't be necessary. Simply joining over the strings or stripped_strings generator attribute, or using the get_text method should do exactly what you want, even decoding all HTML entities in the process. I've never taken a closer look at html.py or geturls.py, so I'm not sure what they do exactly. It looks like the HtmlProcessor is in charge of cleaning and normalizing the returned HTML content, and I assume geturls.py is used internally for the get URLs dialogue. If there were never problems with them before, I don't see a real reason to change them -- Of course simply making them use bs4 instead of bs3 should work without any problems. However, regarding the geturls.py logic, I'm not quite sure what processing you do in there, it looks nearly like you are filtering the URLs? Why do you do that? Shouldn't it be simply possible to return all found anchor tags, absolutizing their URLs with urlparse.urljoin and then checking if any loaded adapter has a maching site URL pattern? That way you don't have to inspect them at all and can be sure to have only gotten valid story URLs. I would definitely suggest updating chardet, there is no reason not to do so, there shouldn't be any breaking of backwards-compatibility. Regarding AO3, I assume you mean that when authors paste text into their stories containing HTML tags they are turned into text (i.e. HTML characters are "escaped" and turned into HTML entities)? This should definitely not be FFDL's responsibility to fix, this is AO3's task and/or the author's, we can't possibly check and account for all user errors -- at least in my opinion. I would handle the last issue like you did: simply account for the differences. Turning &nbsp ([n]on-[b]reaking [sp]ace) HTML entities into their Unicode equivalent makes more sense anyways, and I'm sure that only very few sites will use these. Site adapters using bs4 will simply have to account for those differences, it's not like all old adapters using bs3 will suddenly get different Unicode output. Let me know what you think and maybe detail the scale of changes you had planned in a bit more detail. EDIT: Also regarding the insert_into_python_path function in downloader.py, I don't actually think that's strictly necessary as long as bs4 and html5lib are at the top-level in the directory structure, like they are currently; they'll be preferred by default. EDIT2: I did some work on the branch, nothing major just fixed up downloader.py. I'm sure I have before, but I'll suggest it again: Do try the free PyCharm Community Edition, it's very helpful during development. Especially the "Find Usages" feature and jump to definition. Last edited by cryzed; 11-23-2014 at 03:07 PM.