Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 12-01-2011, 04:51 PM   #1
faber1971
Enthusiast
faber1971 began at the beginning.
 
Posts: 46
Karma: 10
Join Date: Dec 2011
Device: Kindle 3
Question Repubblica recipe: a tragedy

Repbblica is the best Italian newspaper. I've been reading it since I bought my Kindle, but now it's very hard. The actual recipe is too slow: 1 hour to download everything. It's very strange. The previous recipe, on the contrary, was fast, but didn't manage to download the economic rss (the other ones were ok). Can anyone make a miracle?
faber1971 is offline   Reply With Quote
Old 12-04-2011, 04:47 PM   #2
SteliosGero
Junior Member
SteliosGero began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Oct 2011
Device: Kindle 3g
This recipe is indeed problematic. I gave it a try however it's beyond me. How about keeping only todays articles?

That should limit the time to 1/5.
SteliosGero is offline   Reply With Quote
Advert
Old 12-06-2011, 09:39 PM   #3
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
The main problem with repubblica is the economy feed. That section of the site is in HTML5 so I had to add some mumbo jumbo in order to try and cleanup the stuff since calibre does not handle that so well. All those added things slowed the article fetch. Unless Kovid has an idea how to solve this...
kiklop74 is offline   Reply With Quote
Old 12-06-2011, 10:29 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
@kiklop74: Which recipe is this (filename)?
kovidgoyal is online now   Reply With Quote
Old 12-07-2011, 05:38 AM   #5
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
Quote:
Originally Posted by kovidgoyal View Post
@kiklop74: Which recipe is this (filename)?
la_repubblica.recipe
kiklop74 is offline   Reply With Quote
Advert
Old 12-07-2011, 10:36 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You mean the following code is slowing it down?

Code:
        for item in soup.findAll(['hgroup','deresponsabilizzazione','per']):
            item.name = 'div'
            item.attrs = []
Incidentally you should replace the del 'style' part with remove_attributes=['style']
kovidgoyal is online now   Reply With Quote
Old 12-08-2011, 05:00 PM   #7
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
No. Slowdown comes mainly from the regexp.
kiklop74 is offline   Reply With Quote
Old 12-08-2011, 08:53 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You should be able to replace all those regexes with a single regex that simply strips the head section. Why is there the regext that strips code before the opening <head>?
kovidgoyal is online now   Reply With Quote
Old 12-09-2011, 05:33 PM   #9
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
Quote:
Originally Posted by kovidgoyal View Post
You should be able to replace all those regexes with a single regex that simply strips the head section. Why is there the regext that strips code before the opening <head>?
Because if not I get a garbage html, with incorrectly processed html5 header content. That is the real reason I did all that stuff. For some reason calibre fails to properly cleanup html5 as generated by la repubblica economic section.
kiklop74 is offline   Reply With Quote
Old 12-09-2011, 11:08 PM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Try using the regex

r'.*</head>', '<html>'

instead, should be much faster.
kovidgoyal is online now   Reply With Quote
Old 12-11-2011, 08:28 AM   #11
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
Quote:
Originally Posted by kovidgoyal View Post
Try using the regex

r'.*</head>', '<html>'

instead, should be much faster.
It does not help. Just the presence of any regex makes news download 15-30 times slower for no apparent reason.
kiklop74 is offline   Reply With Quote
Old 12-11-2011, 09:52 PM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
OK, try implementing

Code:
def preprocess_raw_html(self, raw, url):
   return '<html><head>'+raw[raw.find('</head>'):]
kovidgoyal is online now   Reply With Quote
Old 12-14-2011, 04:32 AM   #13
faber1971
Enthusiast
faber1971 began at the beginning.
 
Posts: 46
Karma: 10
Join Date: Dec 2011
Device: Kindle 3
I don't know if this can help, but somewhere on the web (http://forum.simplicissimus.it) I found this possible tip:


substitute

preprocess_regexps = [
(re.compile(r'.*?<head>', re.DOTALL|re.IGNORECASE), lambda match: '<head>'),
(re.compile(r'<head>.*?<title>', re.DOTALL|re.IGNORECASE), lambda match: '<head><title>'),
(re.compile(r'</title>.*?</head>', re.DOTALL|re.IGNORECASE), lambda match: '</title></head>')
]

with

preprocess_regexps = [
(re.compile(r'<head>.*?', re.DOTALL|re.IGNORECASE), lambda match: '<head>'),
(re.compile(r'<head>.*?<title>', re.DOTALL|re.IGNORECASE), lambda match: '<head><title>'),
(re.compile(r'</title>.*?</head>', re.DOTALL|re.IGNORECASE), lambda match: '</title></head>')
]


It seems to be working good!!!
Credits: timewolf on http://forum.simplicissimus.it/calib...?topicseen#new
faber1971 is offline   Reply With Quote
Old 12-14-2011, 01:52 PM   #14
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
Quote:
Originally Posted by kovidgoyal View Post
OK, try implementing

Code:
def preprocess_raw_html(self, raw, url):
   return '<html><head>'+raw[raw.find('</head>'):]
OK, this solves the problem. I will post updated recipe to bug tracker shortly.
kiklop74 is offline   Reply With Quote
Old 12-14-2011, 02:03 PM   #15
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
New recipe is available at

https://bugs.launchpad.net/calibre/+bug/904387

Will be included in the next release of Calibre.
kiklop74 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
New Recipe for La Repubblica gambarini Recipes 0 01-02-2011 04:33 AM
Boris Gone...tragedy guyjack enTourage Archive 25 08-29-2010 05:52 PM
Seriously thoughtful tragedy at Ft. Hood for those who have not heard kindlekitten Lounge 7 11-06-2009 12:02 PM
web2lrf: La Repubblica alexxxm Sony Reader 1 11-13-2007 12:27 PM
Other Fiction Dreiser, Theodore: An American Tragedy. v1. 17 July 07 Dr. Drib BBeB/LRF Books 4 07-23-2007 09:03 AM


All times are GMT -4. The time now is 08:21 AM.


MobileRead.com is a privately owned, operated and funded community.