Repubblica recipe: a tragedy

faber1971 · 12-01-2011, 04:51 PM

Repbblica is the best Italian newspaper. I've been reading it since I bought my Kindle, but now it's very hard. The actual recipe is too slow: 1 hour to download everything. It's very strange. The previous recipe, on the contrary, was fast, but didn't manage to download the economic rss (the other ones were ok). Can anyone make a miracle?

SteliosGero · 12-04-2011, 04:47 PM

This recipe is indeed problematic. I gave it a try however it's beyond me. How about keeping only todays articles?

That should limit the time to 1/5.

kiklop74 · 12-06-2011, 09:39 PM

The main problem with repubblica is the economy feed. That section of the site is in HTML5 so I had to add some mumbo jumbo in order to try and cleanup the stuff since calibre does not handle that so well. All those added things slowed the article fetch. Unless Kovid has an idea how to solve this...

kovidgoyal · 12-06-2011, 10:29 PM

@kiklop74: Which recipe is this (filename)?

kiklop74 · 12-07-2011, 05:38 AM

Quote:

Originally Posted by kovidgoyal

@kiklop74: Which recipe is this (filename)?

la_repubblica.recipe

kovidgoyal · 12-07-2011, 10:36 PM

You mean the following code is slowing it down?

Code:

        for item in soup.findAll(['hgroup','deresponsabilizzazione','per']):
            item.name = 'div'
            item.attrs = []

Incidentally you should replace the del 'style' part with remove_attributes=['style']

kiklop74 · 12-08-2011, 05:00 PM

No. Slowdown comes mainly from the regexp.

kovidgoyal · 12-08-2011, 08:53 PM

You should be able to replace all those regexes with a single regex that simply strips the head section. Why is there the regext that strips code before the opening <head>?

kiklop74 · 12-09-2011, 05:33 PM

Quote:

Originally Posted by kovidgoyal

You should be able to replace all those regexes with a single regex that simply strips the head section. Why is there the regext that strips code before the opening <head>?

Because if not I get a garbage html, with incorrectly processed html5 header content. That is the real reason I did all that stuff. For some reason calibre fails to properly cleanup html5 as generated by la repubblica economic section.

kovidgoyal · 12-09-2011, 11:08 PM

Try using the regex

r'.*</head>', '<html>'

instead, should be much faster.

kiklop74 · 12-11-2011, 08:28 AM

Quote:

Originally Posted by kovidgoyal

Try using the regex

r'.*</head>', '<html>'

instead, should be much faster.

It does not help. Just the presence of any regex makes news download 15-30 times slower for no apparent reason.

kovidgoyal · 12-11-2011, 09:52 PM

OK, try implementing

Code:

def preprocess_raw_html(self, raw, url):
   return '<html><head>'+raw[raw.find('</head>'):]

faber1971 · 12-14-2011, 04:32 AM

I don't know if this can help, but somewhere on the web (http://forum.simplicissimus.it) I found this possible tip:

substitute

preprocess_regexps = [
(re.compile(r'.*?<head>', re.DOTALL|re.IGNORECASE), lambda match: '<head>'),
(re.compile(r'<head>.*?<title>', re.DOTALL|re.IGNORECASE), lambda match: '<head><title>'),
(re.compile(r'</title>.*?</head>', re.DOTALL|re.IGNORECASE), lambda match: '</title></head>')
]

with

preprocess_regexps = [
(re.compile(r'<head>.*?', re.DOTALL|re.IGNORECASE), lambda match: '<head>'),
(re.compile(r'<head>.*?<title>', re.DOTALL|re.IGNORECASE), lambda match: '<head><title>'),
(re.compile(r'</title>.*?</head>', re.DOTALL|re.IGNORECASE), lambda match: '</title></head>')
]

It seems to be working good!!!
Credits: timewolf on http://forum.simplicissimus.it/calib...?topicseen#new

kiklop74 · 12-14-2011, 01:52 PM

Quote:

Originally Posted by kovidgoyal

OK, try implementing

Code:

def preprocess_raw_html(self, raw, url):
   return '<html><head>'+raw[raw.find('</head>'):]

OK, this solves the problem. I will post updated recipe to bug tracker shortly.

kiklop74 · 12-14-2011, 02:03 PM

New recipe is available at

https://bugs.launchpad.net/calibre/+bug/904387

Will be included in the next release of Calibre.

12-01-2011, 04:51 PM	#1
faber1971 Enthusiast Posts: 46 Karma: 10 Join Date: Dec 2011 Device: Kindle 3	Repubblica recipe: a tragedy Repbblica is the best Italian newspaper. I've been reading it since I bought my Kindle, but now it's very hard. The actual recipe is too slow: 1 hour to download everything. It's very strange. The previous recipe, on the contrary, was fast, but didn't manage to download the economic rss (the other ones were ok). Can anyone make a miracle?

12-07-2011, 10:36 PM	#6
kovidgoyal creator of calibre Posts: 45,195 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You mean the following code is slowing it down? Code: for item in soup.findAll(['hgroup','deresponsabilizzazione','per']): item.name = 'div' item.attrs = [] Incidentally you should replace the del 'style' part with remove_attributes=['style']

12-11-2011, 09:52 PM	#12
kovidgoyal creator of calibre Posts: 45,195 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	OK, try implementing Code: def preprocess_raw_html(self, raw, url): return '<html><head>'+raw[raw.find('</head>'):]

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
New Recipe for La Repubblica	gambarini	Recipes	0	01-02-2011 04:33 AM
Boris Gone...tragedy	guyjack	enTourage Archive	25	08-29-2010 05:52 PM
Seriously thoughtful tragedy at Ft. Hood for those who have not heard	kindlekitten	Lounge	7	11-06-2009 12:02 PM
web2lrf: La Repubblica	alexxxm	Sony Reader	1	11-13-2007 12:27 PM
Other Fiction Dreiser, Theodore: An American Tragedy. v1. 17 July 07	Dr. Drib	BBeB/LRF Books	4	07-23-2007 09:03 AM

12-04-2011, 04:47 PM	#2
SteliosGero Junior Member Posts: 8 Karma: 10 Join Date: Oct 2011 Device: Kindle 3g	This recipe is indeed problematic. I gave it a try however it's beyond me. How about keeping only todays articles? That should limit the time to 1/5.

12-06-2011, 09:39 PM	#3
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	The main problem with repubblica is the economy feed. That section of the site is in HTML5 so I had to add some mumbo jumbo in order to try and cleanup the stuff since calibre does not handle that so well. All those added things slowed the article fetch. Unless Kovid has an idea how to solve this...

12-06-2011, 10:29 PM	#4
kovidgoyal creator of calibre Posts: 45,195 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@kiklop74: Which recipe is this (filename)?

12-08-2011, 05:00 PM	#7
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	No. Slowdown comes mainly from the regexp.

12-08-2011, 08:53 PM	#8
kovidgoyal creator of calibre Posts: 45,195 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You should be able to replace all those regexes with a single regex that simply strips the head section. Why is there the regext that strips code before the opening <head>?

12-09-2011, 11:08 PM	#10
kovidgoyal creator of calibre Posts: 45,195 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Try using the regex r'.*</head>', '<html>' instead, should be much faster.

12-14-2011, 04:32 AM	#13
faber1971 Enthusiast Posts: 46 Karma: 10 Join Date: Dec 2011 Device: Kindle 3	I don't know if this can help, but somewhere on the web (http://forum.simplicissimus.it) I found this possible tip: substitute preprocess_regexps = [ (re.compile(r'.?<head>', re.DOTALL\|re.IGNORECASE), lambda match: '<head>'), (re.compile(r'<head>.?<title>', re.DOTALL\|re.IGNORECASE), lambda match: '<head><title>'), (re.compile(r'</title>.?</head>', re.DOTALL\|re.IGNORECASE), lambda match: '</title></head>') ] with preprocess_regexps = [ (re.compile(r'<head>.?', re.DOTALL\|re.IGNORECASE), lambda match: '<head>'), (re.compile(r'<head>.?<title>', re.DOTALL\|re.IGNORECASE), lambda match: '<head><title>'), (re.compile(r'</title>.?</head>', re.DOTALL\|re.IGNORECASE), lambda match: '</title></head>') ] It seems to be working good!!! Credits: timewolf on http://forum.simplicissimus.it/calib...?topicseen#new

12-14-2011, 02:03 PM	#15
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	New recipe is available at https://bugs.launchpad.net/calibre/+bug/904387 Will be included in the next release of Calibre.