Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-28-2011, 09:34 AM   #1
hiperlink
Enthusiast
hiperlink began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
Unhappy Replacing tags after using them

Hi All!

I'm trying to clean up some really messy HTML newspaper site's page. They are heavily using tables.

In my recipe I was able to find the needed content, and extract it via keeponly_tags, and remove_tags.

Spoiler:
Code:
keep_only_tags          = [
                       dict(name='td', attrs={'class':['content']}) ,
    ]
    remove_tags = [
                       dict(name='div', attrs={'class':['ad-container-outer',\
                                                        'tags noborder',\
                                                        'video-container',\
                                                        'h']}) ,
                       dict(name='div', attrs={'style':['width:17px; height:17px; background-color:#8D0648; margin-bottom:25px; float:right;']}) ,
                       dict(name='td', attrs={'class':['foot']}) ,
                       dict(name='tfoot', ) ,
    ]


But the article(s) are in an inner table/(thead|tr/td). Which - if I convert the recipe to mobi for my Kindle - doesn't look good. Actually Only the first screen is filled with the text, and the second page is empty.

So I tried to get rid of the unnecessary tags, but without luck.

I tried postprocess_html:
Spoiler:
Code:
def postprocess_html(self, soup, first):
    for rpltags in ['table','thead','tr','td','tbody','tfoot']:
        canfind = soup.find(rpltags)
        if canfind:
            for tags in soup.findAll(rpltags):
                tags.replaceWithChildren()
    
        return soup


But it gave me a TypeError:
Spoiler:
Code:
Could not fetch link http://nol.hu/belfold/20110328-utcara_vonul_az_mszp
Traceback (most recent call last):
  File "site-packages/calibre/web/fetch/simple.py", line 457, in process_links
  File "site-packages/calibre/web/feeds/news.py", line 707, in _postprocess_html
  File "/tmp/calibre_0.7.52_tmp_vi75Xu/calibre_0.7.52_WW92bF_recipes/recipe0.py", line 84, in postprocess_html
    tags.replaceWithChildren()
TypeError: 'NoneType' object is not callable


Then I had tried preprocess_regexps, but it gave me empty article pages
Spoiler:
Code:
    preprocess_regexps      = [ 
        (re.compile(r'<table.*?>', re.IGNORECASE), lambda match: '<div>'), 
        (re.compile(r'</table.*?>', re.IGNORECASE), lambda match: '</div>'), 
        (re.compile(r'<thead.*?>', re.IGNORECASE), lambda match: '<div>'), 
        (re.compile(r'</thead.*?>', re.IGNORECASE), lambda match: '</div>'), 
        (re.compile(r'<tfoot.*?>', re.IGNORECASE), lambda match: '<div>'), 
        (re.compile(r'</tfoot.*?>', re.IGNORECASE), lambda match: '</div>'), 
        (re.compile(r'<tr.*?>', re.IGNORECASE), lambda match: '<div>'), 
        (re.compile(r'</tr.*?>', re.IGNORECASE), lambda match: '</div>'), 
        (re.compile(r'<td.*?>', re.IGNORECASE), lambda match: '<div>'), 
        (re.compile(r'</td.*?>', re.IGNORECASE), lambda match: '</div>')
]


The recipe in its actual state (which works fine if you are creating e.g. PDF output) can be reached here: https://github.com/zsoltika/.hu-reci...0_1_nap.recipe

So my question is: after cleaning up the articles html via keeponly_tags, and remove_tags, how does one replace some tags - in my case: table, thead, tfoot, tr, td; BUT only the tag names, not their contents! - with another tag name (e.g. </?span>)?

And one more thing popped into my mind: wouldn't it be nicer, if the various api callables/overrides etc. at http://calibre-ebook.com/user_manual/news_recipe.html will be numbered? I mean I don't get which applies earlier in the process from ['preprocess_html', 'preprocess_regexps', 'keeponly_tags', 'remove_tags'].

Thanks for any help!
hiperlink is offline   Reply With Quote
Old 03-28-2011, 10:04 AM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by hiperlink View Post
So my question is: after cleaning up the articles html via keeponly_tags, and remove_tags, how does one replace some tags - in my case: table, thead, tfoot, tr, td; BUT only the tag names, not their contents! - with another tag name (e.g. </?span>)?
The simplest way is:
Code:
'linearize_tables' : True
Alternatively:
Code:
    def postprocess_html(self, soup, first_fetch):
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'
Starson17 is offline   Reply With Quote
 
Advertisement
Old 03-28-2011, 11:23 AM   #3
hiperlink
Enthusiast
hiperlink began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
Worked like a charme, thank You!
hiperlink is offline   Reply With Quote
Reply

Tags
recipes, replacewith, tables

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Replacing my Sony with K3? cognym Amazon Kindle 61 02-02-2011 05:02 PM
Replacing my new Kobo - again! objectman Kobo Reader 7 09-20-2010 09:00 PM
Replacing the battery AprilHare Sony Reader 12 04-29-2009 02:08 PM
Replacing ¬ PieOPah Workshop 5 12-17-2008 05:25 PM
iLiad Replacing the contentlister tribble iRex Developer's Corner 21 06-22-2007 04:58 PM


All times are GMT -4. The time now is 09:11 PM.


MobileRead.com is a privately owned, operated and funded community.