MobileRead Forums - View Single Post

davidfor · 03-19-2016, 10:05 PM

Quote:

Originally Posted by JimmXinu

(version365 sent me some URLs)

It looks like that is the case. For whatever reason, those tags are only shown on the last 'page' of each chapter and they can be different for each chapter.

The architecture of FanFicFare is designed to collect metadata with as few page requests as possible for comparison before collecting story chapters separately. When FFF is showing you a progress bar "Downloading Metadata" before going to background, it's doing the metadata-only collection.

Fetching two additional pages for each chapter (the chapter's first 'page' to get the last page URL, and then the last page) during metadata-only collection strikes me as excessive.

Literotica also has the tags in the metadata of each page in the chapter. That means you can get it from the first page of the chapter. I have a version of the adapter that includes this:

Code:

    def getCategories(self, soup):
        if self.getConfig("use_meta_keywords"):
            categories = soup.find("meta", {"name":"keywords"})['content'].split(', ')
            categories = [c for c in categories if not self.story.getMetadata('title') in c]
            if self.story.getMetadata('author') in categories:
                categories.remove(self.story.getMetadata('author'))
            logger.debug("Meta = %s" % categories)
            for category in categories:
    #            logger.debug("\tCategory=%s" % category)
                self.story.addToList('category', category.title())

It includes some cleanup as older stories seem to have the story/chapter name and author in the keywords.

I call that against the parsed page in extractChapterUrlsAndMetadata. In the current adapter, it is the variable "soup1". I also call it from within the chapter processing so it can pick up extra tags from the later chapters.

It's been a while since I looked at it, but my version does two other things:

Parses the title of chapters to normalize them to "Chapter 1" or "Part 4". If it can't do that, it uses the full title.
Puts the chapter description at the start of each chapter with a line to separate it from the text.

There are some differences in how I parse the story text, but I think that's mainly code style. Though I do strip the outer level of div tags from around each page.