Replace spam with their content

cyttorak · 11-29-2014, 08:01 AM

Hi

calibre convert this code:

PHP Code:


			
<div>- recordarles también que mola que de vez en cuando se pase alguien de otros grupos por las reus de <span class="il">comunicación</span> y que esta el correo&nbsp;<a href="mailto:comunicacioninterna@ganemosmadrid.info" target="_blank">comunicacioninterna@<wbr>ganemosmadrid.info</a> para que nos pasen las <span class="il">actas</span> y las necesidades de <span class="il">comunicación</span> que tengan.</div>

in this:

PHP Code:


			
<div class="calibre6"><p class="calibre9">- recordarles también que mola que de vez en cuando se pase alguien de otros grupos por las reus de </p><span>comunicación</span><p class="calibre9"> y que esta el correo </p><a href="mailto:comunicacioninterna@ganemosmadrid.info" target="_blank">comunicacioninterna@ganemosmadrid.info</a><p class="calibre9"> para que nos pasen las </p><span>actas</span><p class="calibre9"> y las necesidades de </p><span>comunicación</span><p class="calibre9"> que tengan.</p></div>

I don't know why is putting that extra <p class="calibre9"> netx to the <span>

How can I avoid that?

My code:

Code:

class AdvancedUserRecipe1416065639(BasicNewsRecipe):
	title	= u'Ganemos'
	description = 'Actas de Ganemos'
	oldest_article = 365
	max_articles_per_feed = 100
	auto_cleanup = True
	reverse_article_order = True
	remove_empty_feeds = True
	language = 'es_ES'
	category = 'Rss'
	publisher = 'Ganemos'
	publication_type = 'actas'
	remove_attributes = ['class','id','name']
	feeds	= [
		(u'Feminismos', u'http://ganemosmadrid.info/category/actas/actas_feminismos/feed/')
		,(u'Programas y contenido', u'http://ganemosmadrid.info/category/actas/actas_programa/feed/')
		,(u'Candidaturas', u'http://ganemosmadrid.info/category/actas/actas_candidaturas/feed/')
		,(u'Comunicación', u'http://ganemosmadrid.info/category/actas/actas-comunicacion/feed/')
		,(u'Coordinación', u'http://ganemosmadrid.info/category/actas/actas_coordinacion/feed/')
		,(u'Herramientas y metodología', u'http://ganemosmadrid.info/category/actas/actas_herramientas/feed/')
		,(u'Movimiento municipalista', u'http://ganemosmadrid.info/category/actas/actas_movimiento/feed/')
	]
	extra_css = '.calibre_navbar {display:none;}'
	preprocess_regexps = [
		(re.compile(u'\xa0'), lambda match: ' ')
		,(re.compile(r'&nbsp;',re.DOTALL|re.IGNORECASE), lambda match: ' ')
		,(re.compile(r'\s*<p[^>]*>\s*</p>\s*',re.DOTALL|re.IGNORECASE), lambda match: '')
		,(re.compile(r'\s*<div[^>]*>\s*</div>\s*',re.DOTALL|re.IGNORECASE), lambda match: '')
	]

	conversion_options = {
		'comments' : description
		,'tags' : category
		,'language' : language
		,'publisher' : publisher
	}

	def get_cover_url(self):
		return 'http://ganemosmadrid.info/wp-content/uploads/2014/11/GM_ORG_SEPT.png'

	def parse_feeds (self):
		def parseFecha(d,m,a,f):
			if f:
				if len(f)==10:
					return f
				sf=re.split('[\/\-]',f)
				d=sf[0]
				m=sf[1]
				if len(m)==1:
					m='0'+m
				try:
					a=sf[2]
				except IndexError:
					a=None
			if len(d)==1:
				d='0'+d
			m=m.lower()
			if m=='enero':
				m='01'
			elif m=='febrero':
				m='02'
			elif m=='marzo':
				m='03'
			elif m=='abril':
				m='04'
			elif m=='mayo':
				m='05'
			elif m=='junio':
				m='06'
			elif m=='julio':
				m='07'
			elif m=='agosto':
				m='08'
			elif m=='septiembre':
				m='09'
			elif m=='octubre':
				m='10'
			elif m=='noviembre':
				m='11'
			elif m=='diciembre':
				m='12'
			if not a:
				if float(m)>5:
					a='2014'
				else:
					a='2015'
			elif len(a)==2:
				a='20'+a
			return d+'/'+m+'/'+a
		ordinal = re.compile(u'^(Acta )?(\d+)(er\.?|o\.?|a\.|º|ª) ', re.IGNORECASE|re.UNICODE)
		fecha1 = re.compile(u'.*?(\d\d?\/\d\d\/(20)?1\d).*', re.IGNORECASE|re.UNICODE)
		fecha2 = re.compile(u'.*?(\d+) (de )?(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)( de )?(201\d)?.*', re.IGNORECASE|re.UNICODE)
		gts = re.compile(u'.*?(Grupos territoriales|Cultura).*', re.IGNORECASE|re.UNICODE)
		feeds = BasicNewsRecipe.parse_feeds(self)
		for f in feeds:
			for a,c in enumerate(f.articles):
				g=u''
				self.log('==> '+c.title)
				mOr = ordinal.match(c.title)
				mF1 = fecha1.match(c.title)
				mF2 = fecha2.match(c.title)
				mGt = gts.match(c.title)
				if mGt:
					g=mGt.group(1).lower().capitalize()
				else:
					g=f.title
				if mOr:
					g=mOr.group(2)+'º '+g
				if mF1:
					g=parseFecha(None,None,None,mF1.group(1))+' '+g
				if mF2:
					g=parseFecha(mF2.group(1),mF2.group(3),mF2.group(5),None)+' '+g
				c.title=g
				self.log('<== '+c.title+'\n')
		return feeds

ireadtheinternet · 11-30-2014, 06:54 PM

I think that is normal. I have calibre# classes inserted also in my recipes, I don't think it causes any harm, and it seems to be needed for some internal processing.

eschwartz · 11-30-2014, 10:47 PM

It is because calibre flattens all CSS, in order to ensure it works as best as possible across all devices.

The end result is that it looks the way it was supposed to, on the ereader screen, and looks like highly-confusing garbage in the internals -- on the theory that conversions are not usually meant to be edited.

You can use ebook-convert via the command-line and pass an output name without an extension to get it to write the un-flattened OEB directory to that location. Not really sure who would use that, actually... If you want to do anything with the HTML, do it before calibre converts it. And then it won't matter afterward, what calibre does to it.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Getting Full Content from Partial Content Feeds	thread314	Calibre	5	05-05-2012 10:49 AM
search and replace - drops blanks in replace ?	cybmole	Conversion	10	03-13-2011 03:07 AM
Indianapolis Public Schools Replace Textbooks with Digital Content (THE Journal)	Nate the great	News	1	01-15-2010 08:18 PM

11-30-2014, 06:54 PM	#2
ireadtheinternet Member Posts: 21 Karma: 10 Join Date: Oct 2014 Device: Android	I think that is normal. I have calibre# classes inserted also in my recipes, I don't think it causes any harm, and it seems to be needed for some internal processing.

11-30-2014, 10:47 PM	#3
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	It is because calibre flattens all CSS, in order to ensure it works as best as possible across all devices. The end result is that it looks the way it was supposed to, on the ereader screen, and looks like highly-confusing garbage in the internals -- on the theory that conversions are not usually meant to be edited. You can use ebook-convert via the command-line and pass an output name without an extension to get it to write the un-flattened OEB directory to that location. Not really sure who would use that, actually... If you want to do anything with the HTML, do it before calibre converts it. And then it won't matter afterward, what calibre does to it.

Advert