MobileRead Forums - View Single Post

chaley · 07-15-2012, 09:57 AM

The following seems to work, but I make no guarantees. It produces a list of numbers and a list of titles. The cruft in the middle is necessary to filter out ancillary text such as "aka". As far I can tell from brief looks, the numbers and titles correspond until the numbers run out. The titles after the numbers run out seem to be anthologies or other "non-numbered" books.

This script runs with calibre-debug -e

Code:

from lxml import html
import urllib2
from calibre import browser
from contextlib import closing

url = 'http://www.fantasticfiction.co.uk/p/james-patterson/'
br = browser()
with closing(br.open(url, timeout=10)) as f:
    doc = html.fromstring(f.read())
    for data in doc.xpath(('//div[@class="sectionleft"]')):
        t = data.xpath('./text()')
        numbers = []
        for x in t:
            try:
                f = float(x)
                numbers.append(int(f))
            except:
                pass
        books = data.xpath('a[contains(@href,".htm")]/text()')
        print len(numbers), len(books), numbers, books

07-15-2012, 09:57 AM	#25
chaley Grand Sorcerer Posts: 12,476 Karma: 8025702 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	The following seems to work, but I make no guarantees. It produces a list of numbers and a list of titles. The cruft in the middle is necessary to filter out ancillary text such as "aka". As far I can tell from brief looks, the numbers and titles correspond until the numbers run out. The titles after the numbers run out seem to be anthologies or other "non-numbered" books. This script runs with calibre-debug -e Code: from lxml import html import urllib2 from calibre import browser from contextlib import closing url = 'http://www.fantasticfiction.co.uk/p/james-patterson/' br = browser() with closing(br.open(url, timeout=10)) as f: doc = html.fromstring(f.read()) for data in doc.xpath(('//div[@class="sectionleft"]')): t = data.xpath('./text()') numbers = [] for x in t: try: f = float(x) numbers.append(int(f)) except: pass books = data.xpath('a[contains(@href,".htm")]/text()') print len(numbers), len(books), numbers, books