![]() |
#3511 |
Plugin Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,010
Karma: 4604635
Join Date: Dec 2011
Location: Midwest USA
Device: Kobo Clara Colour running KOReader
|
Test Version
I've figured out that, yes, self-closing <p> does change the output in ways that aren't acceptable--to me at least. This test version returns ffnet processing to what it was before.
Last edited by JimmXinu; 11-17-2014 at 09:58 PM. Reason: Remove obsolete test versions - replaced by newer test or released version. |
![]() |
![]() |
#3512 |
Junior Member
![]() Posts: 2
Karma: 10
Join Date: Nov 2014
Device: Kindle Paperwhite
|
Literotica Issue
I was reading on my Kindle, and noticed that there was a gap in the story. I looked into the HTML on this story (http://www.literotica.com/s/the-succ...eduction-ch-17) and it has odd formatting, specifically a <p align="center">...</p>. Everything after that on that page is lost. I looked around and found a few other instances of align center and <strong> </strong>. It would seem that basic HTML syntax is allowed in the stories, but I couldn't find a definitive list of allowed HTML.
|
![]() |
![]() |
#3513 | |
Plugin Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,010
Karma: 4604635
Join Date: Dec 2011
Location: Midwest USA
Device: Kobo Clara Colour running KOReader
|
Test Version
Quote:
The "<p align="center">...</p>" is nested inside an outer <p> block, which technically is against standard. The existing literotica.com adapter tries to convert that top outer <p> to a div, but it appears to cause problems with nested <p> tags. Attached is a test version that attempts to use a different method to deal with that bit of HTML ugliness. It appears to work better to me, but I don't read that site, so I'm really only testing with a couple random stories. So while I don't think it will break for other stories, I can't guarantee it. BTW, while that chapter title block will appear, it won't appear centered. The attribute align="center" isn't recognized by many e-readers and is stripped out. The other chapters are centered because they use <center> and there is an explicit replacement for <center> tags in FFDL. Last edited by JimmXinu; 11-18-2014 at 09:34 PM. Reason: Remove obsolete test versions - replaced by newer test or released version. |
|
![]() |
![]() |
#3514 |
Plugin Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,010
Karma: 4604635
Join Date: Dec 2011
Location: Midwest USA
Device: Kobo Clara Colour running KOReader
|
New Release
Version 2.0.10 - 18 Nov 2014
|
![]() |
![]() |
#3515 |
Junior Member
![]() Posts: 2
Karma: 10
Join Date: Nov 2014
Device: Kindle Paperwhite
|
Looks good to me, I will let you know if I find any issues it causes. Thanks!
|
![]() |
![]() |
#3516 | |
Plugin Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,010
Karma: 4604635
Join Date: Dec 2011
Location: Midwest USA
Device: Kobo Clara Colour running KOReader
|
Quote:
|
|
![]() |
![]() |
#3517 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 408
Karma: 1050547
Join Date: Mar 2011
Device: Kindle Oasis 2
|
Did you try explicitly specifying the parser for the BeautifulSoup instance?:
Code:
BeautifulSoup(markup, 'html5lib') If all this seems correct, the only thing I can think of is narrowing it down to the element that causes the error and extracting it (possibly via the soup instance if that doesn't already cause an error) before trying to turn the soup into a string, but I think you already tried something like that. If all this doesn't help I'm a bit stumped, since the html5lib library is supposed to act exactly like a real browser when parsing HTML. I checked the code and there doesn't seem to be anything to indicate that the BeautifulSoup instance is modified improperly (which can easily lead to such errors), is it possibly that the raw HTML modifications the adapter does shortly beforehand at places are at fault? Last edited by cryzed; 11-21-2014 at 05:34 AM. |
![]() |
![]() |
#3518 | |||
Plugin Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,010
Karma: 4604635
Join Date: Dec 2011
Location: Midwest USA
Device: Kobo Clara Colour running KOReader
|
Quote:
Quote:
The error is coming from the utf8FromSoup code that does a findAll on all tags to strip off extra attributes. If I bypass that it works--so a more forgiving method of spinning through the tags may work. The improperly nested tags cause confusion. Quote:
BTW, I did consult your code from the package-magic branch and I'm using part of it, thanks for that. |
|||
![]() |
![]() |
#3519 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 408
Karma: 1050547
Join Date: Mar 2011
Device: Kindle Oasis 2
|
I'm glad it was useful in some way. I just checked the code in the BaseAdapter class again: Is there any reason you are using _getAttrMap() instead of the attrs attribute? I've never seen that method, I assume it's undocumented, although I really doubt that's the reason for the problems (it might be for some strange reason though!).
As a last resort you could try iterating over all elements and when you catch a recursion exception, navigate from the last unproblematic element onwards, i.e. skipping the next somehow or traversing the tree upwards with the parent attribute until the exception doesn't occur anymore. I doubt that's an ideal solution though unfortunately, and it's just me articulating half-processed ideas. I would have to see for myself first if something like that is even possible. EDIT: Also sorry for never finishing up that branch. After creating the thread in the forum months ago, I somehow never got around to it, and it didn't seem so pressing at the time as you had mentioned. Last edited by cryzed; 11-21-2014 at 01:21 PM. |
![]() |
![]() |
#3520 |
Plugin Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,010
Karma: 4604635
Join Date: Dec 2011
Location: Midwest USA
Device: Kobo Clara Colour running KOReader
|
Development Notes
This post is about development issues. Users can safely ignore.
@cryzed - NP. I wasn't in any rush to worry about it then. I'm not now, really, but I had more spare time the last week. But I'm expecting to have less time next week; so I wanted to save the changes where you could see them and perhaps offer 'pythonic' help and advice. ![]() (I did eventually get past the attribute issue, btw. Spinning on a generator instead of an iterator lets me process until it fails instead of failing before it processes. Not perfect, but better.) I've checked in a new branch 'bs4' that includes the six, html5lib, and bs4 libraries, changes to allow their import in all the different run environments, and some not-ready-for-prime-time changes in a small handful of adapters to test out the bs4 changes--I don't intend to use the adapters as is. The packages are all at the top level because it makes it much easier in web engine and plugin that way. My to-do list for this is: Spoiler:
|
![]() |
![]() |
#3521 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 408
Karma: 1050547
Join Date: Mar 2011
Device: Kindle Oasis 2
|
Only using one sounds like the better idea, since bs4 is mostly just an improved bs3. Whenever possible new adapters should definitely use bs4.
I agree that creating a "new-style" base class for adapters using bs4 might be a good idea, however I'm not sure why _you_ want to have different parent base classes, because of methods such as utf8FromSoup? Those should work with soup objects created by both versions hopefully, I don't think the BaseSiteAdapter's methods need to be adapted for that. However if we actually go that way, it would be the ideal oppurtinity to smarten up some things: I would try to keep as much functionality as possible out of the BaseSiteAdapter class. Actually I'd argue that utf8FromSoup should be in the same module as stripHTML and various other utility functions. In my opinion, it should be primarily the subclasses' job to implement the logic to retrieve and parse the story's metadata and chapters. The BaseSiteAdapter should only have the most general functionality in it which is strictly necessary (e.g. caching, storing parsed metadata...). A case can be made for _fetchUrl to stay there under a new name possibly, since it automatically decodes the retrieved content based on the configuration for the adapter. (When I had started on my mentioned FFDL clone a short while ago, I was actually using the requests library that bundles chardet as a fallback option and did something quite similar, maybe this would be a good oppurtunity to add requests to the list of new modules?) Regarding stripHTML, if you check here you will see that using regular expressions at all shouldn't be necessary. Simply joining over the strings or stripped_strings generator attribute, or using the get_text method should do exactly what you want, even decoding all HTML entities in the process. I've never taken a closer look at html.py or geturls.py, so I'm not sure what they do exactly. It looks like the HtmlProcessor is in charge of cleaning and normalizing the returned HTML content, and I assume geturls.py is used internally for the get URLs dialogue. If there were never problems with them before, I don't see a real reason to change them -- Of course simply making them use bs4 instead of bs3 should work without any problems. However, regarding the geturls.py logic, I'm not quite sure what processing you do in there, it looks nearly like you are filtering the URLs? Why do you do that? Shouldn't it be simply possible to return all found anchor tags, absolutizing their URLs with urlparse.urljoin and then checking if any loaded adapter has a maching site URL pattern? That way you don't have to inspect them at all and can be sure to have only gotten valid story URLs. I would definitely suggest updating chardet, there is no reason not to do so, there shouldn't be any breaking of backwards-compatibility. Regarding AO3, I assume you mean that when authors paste text into their stories containing HTML tags they are turned into text (i.e. HTML characters are "escaped" and turned into HTML entities)? This should definitely not be FFDL's responsibility to fix, this is AO3's task and/or the author's, we can't possibly check and account for all user errors -- at least in my opinion. I would handle the last issue like you did: simply account for the differences. Turning   ([n]on-[b]reaking [sp]ace) HTML entities into their Unicode equivalent makes more sense anyways, and I'm sure that only very few sites will use these. Site adapters using bs4 will simply have to account for those differences, it's not like all old adapters using bs3 will suddenly get different Unicode output. Let me know what you think and maybe detail the scale of changes you had planned in a bit more detail. EDIT: Also regarding the insert_into_python_path function in downloader.py, I don't actually think that's strictly necessary as long as bs4 and html5lib are at the top-level in the directory structure, like they are currently; they'll be preferred by default. EDIT2: I did some work on the branch, nothing major just fixed up downloader.py. I'm sure I have before, but I'll suggest it again: Do try the free PyCharm Community Edition, it's very helpful during development. Especially the "Find Usages" feature and jump to definition. Last edited by cryzed; 11-23-2014 at 03:07 PM. |
![]() |
![]() |
#3522 |
Plugin Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,010
Karma: 4604635
Join Date: Dec 2011
Location: Midwest USA
Device: Kobo Clara Colour running KOReader
|
(@cryzed - I haven't been ignoring you, I was away last week. So I guess technically, I was--but it wasn't deliberate.
![]() Attached is a test version with these changes:
Last edited by JimmXinu; 12-01-2014 at 07:50 PM. Reason: Remove obsolete test versions - replaced by newer test or released version. |
![]() |
![]() |
#3523 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 408
Karma: 1050547
Join Date: Mar 2011
Device: Kindle Oasis 2
|
No worries
![]() |
![]() |
![]() |
#3524 |
Plugin Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,010
Karma: 4604635
Join Date: Dec 2011
Location: Midwest USA
Device: Kobo Clara Colour running KOReader
|
Updated test version with additional changes to fanfic.castletv.net, thanks again scout78.
Last edited by JimmXinu; 12-09-2014 at 11:26 AM. Reason: Remove obsolete test versions - replaced by newer test or released version. |
![]() |
![]() |
#3525 | |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Dec 2014
Device: Nexus7
|
Quote:
![]() Last edited by arthurh3535; 12-02-2014 at 05:53 PM. |
|
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
[GUI Plugin] Count Pages | kiwidude | Plugins | 1846 | 08-02-2025 10:44 PM |
[GUI Plugin] Resize Cover | kiwidude | Plugins | 100 | 07-10-2025 08:50 AM |
[GUI Plugin] Find Duplicates | kiwidude | Plugins | 1124 | 04-18-2025 09:19 AM |
[GUI Plugin] Open With | kiwidude | Plugins | 404 | 02-21-2025 05:42 AM |
[GUI Plugin] Plugin Updater **Deprecated** | kiwidude | Plugins | 159 | 06-19-2011 12:27 PM |