![]() |
#1 |
Enthusiast
![]() Posts: 46
Karma: 10
Join Date: Dec 2011
Device: Kindle 3
|
![]()
Repbblica is the best Italian newspaper. I've been reading it since I bought my Kindle, but now it's very hard. The actual recipe is too slow: 1 hour to download everything. It's very strange. The previous recipe, on the contrary, was fast, but didn't manage to download the economic rss (the other ones were ok). Can anyone make a miracle?
![]() |
![]() |
![]() |
![]() |
#2 |
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Oct 2011
Device: Kindle 3g
|
This recipe is indeed problematic. I gave it a try however it's beyond me. How about keeping only todays articles?
That should limit the time to 1/5. |
![]() |
![]() |
![]() |
#3 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
The main problem with repubblica is the economy feed. That section of the site is in HTML5 so I had to add some mumbo jumbo in order to try and cleanup the stuff since calibre does not handle that so well. All those added things slowed the article fetch. Unless Kovid has an idea how to solve this...
|
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,195
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
@kiklop74: Which recipe is this (filename)?
|
![]() |
![]() |
![]() |
#5 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
|
![]() |
![]() |
![]() |
#6 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,195
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You mean the following code is slowing it down?
Code:
for item in soup.findAll(['hgroup','deresponsabilizzazione','per']): item.name = 'div' item.attrs = [] |
![]() |
![]() |
![]() |
#7 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
No. Slowdown comes mainly from the regexp.
|
![]() |
![]() |
![]() |
#8 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,195
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You should be able to replace all those regexes with a single regex that simply strips the head section. Why is there the regext that strips code before the opening <head>?
|
![]() |
![]() |
![]() |
#9 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
Because if not I get a garbage html, with incorrectly processed html5 header content. That is the real reason I did all that stuff. For some reason calibre fails to properly cleanup html5 as generated by la repubblica economic section.
|
![]() |
![]() |
![]() |
#10 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,195
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Try using the regex
r'.*</head>', '<html>' instead, should be much faster. |
![]() |
![]() |
![]() |
#11 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
|
![]() |
![]() |
![]() |
#12 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,195
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
OK, try implementing
Code:
def preprocess_raw_html(self, raw, url): return '<html><head>'+raw[raw.find('</head>'):] |
![]() |
![]() |
![]() |
#13 |
Enthusiast
![]() Posts: 46
Karma: 10
Join Date: Dec 2011
Device: Kindle 3
|
I don't know if this can help, but somewhere on the web (http://forum.simplicissimus.it) I found this possible tip:
substitute preprocess_regexps = [ (re.compile(r'.*?<head>', re.DOTALL|re.IGNORECASE), lambda match: '<head>'), (re.compile(r'<head>.*?<title>', re.DOTALL|re.IGNORECASE), lambda match: '<head><title>'), (re.compile(r'</title>.*?</head>', re.DOTALL|re.IGNORECASE), lambda match: '</title></head>') ] with preprocess_regexps = [ (re.compile(r'<head>.*?', re.DOTALL|re.IGNORECASE), lambda match: '<head>'), (re.compile(r'<head>.*?<title>', re.DOTALL|re.IGNORECASE), lambda match: '<head><title>'), (re.compile(r'</title>.*?</head>', re.DOTALL|re.IGNORECASE), lambda match: '</title></head>') ] It seems to be working good!!! Credits: timewolf on http://forum.simplicissimus.it/calib...?topicseen#new |
![]() |
![]() |
![]() |
#14 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
|
![]() |
![]() |
![]() |
#15 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
New recipe is available at
https://bugs.launchpad.net/calibre/+bug/904387 Will be included in the next release of Calibre. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
New Recipe for La Repubblica | gambarini | Recipes | 0 | 01-02-2011 04:33 AM |
Boris Gone...tragedy | guyjack | enTourage Archive | 25 | 08-29-2010 05:52 PM |
Seriously thoughtful tragedy at Ft. Hood for those who have not heard | kindlekitten | Lounge | 7 | 11-06-2009 12:02 PM |
web2lrf: La Repubblica | alexxxm | Sony Reader | 1 | 11-13-2007 12:27 PM |
Other Fiction Dreiser, Theodore: An American Tragedy. v1. 17 July 07 | Dr. Drib | BBeB/LRF Books | 4 | 07-23-2007 09:03 AM |