Remove author tag from comics

BRGriff · 06-01-2011, 08:25 PM

I have worked diligently to resolve this issue myself. I have done lots of searching on this site and have made many improvements in the recipe I am trying to write. I have learned to rotate images and remove every tag except one to obtain ONLY the image file and nothing more. The tag I can not seem to remove is the name of the author.

For Arcamax comics, the original HTML code for the strip is:
<a href="/thefunnies/andycapp/bio" class="author bio" rel="/thefunnies/andycapp/bio?ajax" title="Reginald Smythe">Reginald Smythe</a>

For GoComics, the original HTML code is:
<h1 ><a href="/kitandcarlyle/2011/06/01">Kit 'N' Carlyle</a><span> by Larry Wright</span></h1>

The tag or information I want to remove is highlighted in "red".

I have tried in every way I know to remove the tag.

Here is the reason it matters: I am getting older and my eye-sight is not what it use to be. That is the primary reason for getting an E-reader in the first place; so I can enlarge the text and be able to read with more comfort. But Comics are coming out too small on my Kindle 3 to read!!!!

My thought is if I remove as much information as possible, I have more screen area to see the comic. I have added extra CSS code to enlarge the image, within the limits of the Kindle 3 so it does not default to "fit to screen" but the author's name is interfering with getting the most out of that.

I have tried "debugging" the process from input to processed but can find nothing I am able to change there. I have tried "Inspect" in the Calibre Viewer and it shows the link as a "span" with a "class" equal to "underline". I might also note that the author's name in the conversion process is VERY different: In Arcamax, the author's name is in small type inline with the image. In GoComics, the author's name is very large and appears atop the comic.

I will most appreciative for all and any help with this. I may be getting old, but there is still enough kid left in me to want to read my funnies every day.

Thank you,

Starson17 · 06-02-2011, 12:10 PM

For Arcamax comics, you can try:

Code:

    remove_tags    = [dict(name='a', attrs={'class':'author bio'})]

You might as well remove the entire a tag, as there's nothing left if you just remove the part you marked in red.

For GoComics, I'll make some suggestions. Here is where you go to see how to do this. It's the BeautifulSoup documentation. You can try removing all span tags. That's probably too aggressive. You can try removing the first span tag in each h1 tag. Usually preprocess_html is used.

Code:

    def preprocess_html(self, soup):
        for h1 in soup.findAll('h1'):
            span = h1.find('span')
            if span:
                span.extract()

That's untested code.

Last, you can try using regular expressions in your remove_tags. Remove any span tag that has the " by " in it.

Here's some running code you can look over that hunts around in the soup for break tags and removes them based on attributes and the existence of Sibling tags.

Code:

    def preprocess_html(self, soup):
        for br in soup.findAll('br'):
            prev = br.findPreviousSibling(True)
            if hasattr(prev, 'name') and prev.name == 'br':
                next = br.findNextSibling(True)
                if hasattr(next, 'name') and next.name == 'br':
                    br.extract()

BRGriff · 06-02-2011, 04:50 PM

Thank you Starson17! The Arcamax fix is working great and the comics are larger and much easier to read.

GoComics is having trouble with its servers in the aftermath of its merger with Comics.com. Therefore I am having difficulty testing changes to the recipe code. I did try the preprocess code you submitted above but it didn't work for me. Maybe I did something wrong. I placed the code in the recipe after the "articles = self.make_links(url) subroutine and before the "def make_links(self, url):" subroutine.

I have also tried working with the "remove_tags":

dict(name='h1', attrs={'span':['by']}), and also dict(name='span', attr={'by':['']}), neither one of which worked.

It has occurred to me that not only do I need to get rid of the author's name, but also the comic's name as shown in red below:
<h1 ><a href="/kitandcarlyle/2011/06/02">Kit 'N' Carlyle</a><span> by Larry Wright</span></h1>

What may be easier is that the url in the original site HTML appears elsewhere without the comic strip name or the author's name. The HTML is shown below:

<div class="social-box">
<ul>
<li>
<form id="myspacepostto" method="post" action="http://www.myspace.com/index.cfm?fuseaction=postto" target="_blank">
<input type="hidden" name="u" value="http://www.gocomics.com/kitandcarlyle/2011/06/02"/>
</li>
</ul>
</div>

I have edited out the extraneous HTML code. Once GoComics is up and running smoothly, I will try adding to "keep_only_tags" the code: dict(name='input', atrrs={'u':['value']}). Do you think that might work?

I very much appreciate all your help and patience.

BRGriff · 06-03-2011, 05:30 PM

Just an update: I have been able to remove the author's name on GoComics with the simple "remove_tags = [dict(name='span')]".

As to removing the comic's name; i.e. "Kit and Carlye", it is "link text" and I am still working on that with the same problem in testing since GoComics is still having difficulties with it's server since the merger.

I am reading up on HTML parsing and regex but have yet to find the answer.

At present, both Arcamax and GoComics ARE larger on my Kindle 3 and much easier to read. So something good has come of all this but I really do want to maximize the image size further and will keep working on the solution. The final part of my effort will have to be addressed in the "Conversion" forum as the TOC text at the top of each page is too large and taking up valuable screen area.

I welcome any input anyone has to offer.

Starson17 · 06-06-2011, 09:27 AM

Quote:

Originally Posted by brgriff

Just an update: I have been able to remove the author's name on GoComics with the simple "remove_tags = [dict(name='span')]".

As to removing the comic's name; i.e. "Kit and Carlye", it is "link text" and I am still working on that with the same problem in testing since GoComics is still having difficulties with it's server since the merger.

Have you tried a simple:

Code:

remove_tags  =  [dict(name='h1')]

instead of removing the span?
The span is inside the h1 tag, so you can remove both with the above. The only possible issue is if there are other h1 elements you don't want removed.

You are following this more closely than I am, so let me know when the server seems to stabilize. I'll try to grab some time to fix anything in the recipe that needs fixing if no one else does it first.

Quote:

At present, both Arcamax and GoComics ARE larger on my Kindle 3 and much easier to read. So something good has come of all this but I really do want to maximize the image size further

You have three issues. One is the size of the incoming image, the next is the size of the conversion and the last is the size of the image displayed on the device. All can be controlled somewhat. For the first, you can see if the comic_size parameter used in the go_comic recipe still works after the merger. For the second, you can play with your specified device (calibre will resize images to fit) and for the third, you can adjust the CSS parameters, call on the image manipulation routines or control what's on your page (as you are doing now). All have an effect.

BRGriff · 06-06-2011, 04:23 PM

The element in the <h1> tag I want to keep is the link to the comic strip:

Quote:

<h1 ><a href="/kitandcarlyle/2011/06/01">Kit 'N' Carlyle</a><span> by Larry Wright</span></h1>

That is why I only removed the <span> and not the entire <h1>. Is my thinking correct on this issue?

As far as GoComics is concerned, it is now fairly stable and getting better by the day. I am now able to down load all my comics from both the former Comics.com and Gocomics just from the Gocomics recipe with the new feed list. There are still some problems with editorial comics not downloading. I do not know if they have been removed or simply not put online yet. As to the general comics, all 25 of the ones I follow are coming through without a hitch.

As to the size of the image, I have never been able to get the recipe "comic_size=" to work. This may be because I am converting to .mobi. It did not work for me even before the merger. I adjust the size via CSS using pixels instead of percentages. One other manner of messing around with the image size is to get the "zoom" image from Gocomics. Since my present method of dealing with the image size is working, I have not messed around with the coding to obtain the zoomed image.

I use a different recipe from the one you have written, Since I have a Kindle 3 with a screen of only 600X800, it is important to ME that I maximize the space for the image and remove as much extraneous data as possible. Therefore, I strip the "Banner"; comic strip "Alink" and the "Author's Name". This leaves me with only the jpeg image and nothing more.

Thanks for your input and continuing help.

Starson17 · 06-07-2011, 08:49 AM

Quote:

Originally Posted by brgriff

The element in the <h1> tag I want to keep is the link to the comic strip:

The link goes along with something to click on. The item you click on is the name of the comic strip. If I understood you correctly, you said wanted to remove the name of the strip, and that leaves nothing to click on, so there's no way to get to that link. I'd leave the link on the image and remove the entire h1 tag.

Quote:

As to the size of the image, I have never been able to get the recipe "comic_size=" to work.

That option controls the size of the image retrieved.

Quote:

This may be because I am converting to .mobi.

It's not affected by the format, but by the ereader device you've told calibre you have. Your specified device causes calibre to reduce the maximum size of images to fit the device display. There's no reason to grab an image larger than your device can display because calibre will reduce the size anyway. That's probably why you saw no effect. I don't even know if that option still works, but it did when I wrote the recipe.

Quote:

It did not work for me even before the merger. I adjust the size via CSS using pixels instead of percentages. One other manner of messing around with the image size is to get the "zoom" image from Gocomics. Since my present method of dealing with the image size is working, I have not messed around with the coding to obtain the zoomed image.

I'm not sure what the zoom image is, but perhaps it's the same as the parameter fetched by comic_size. That option controls what the site sends. The CSS controls how what is sent is displayed.

Quote:

I maximize the space for the image and remove as much extraneous data as possible. Therefore, I strip the "Banner"; comic strip "Alink" and the "Author's Name". This leaves me with only the jpeg image and nothing more.

I don't use my own recipe either. I force my display to portrait, then rotate images if needed so that the long dimension is always vertical.

AustinTim · 07-25-2011, 05:51 PM

BRGrif,
in your original post you mention that you learned how to rotate images... did you mean that you do this in the recipe/processing? so that the 3-panel strips come out using the long dimension of the kindle?

thanks,
-tim

Starson17 · 07-25-2011, 09:15 PM

Quote:

Originally Posted by AustinTim

BRGrif,
in your original post you mention that you learned how to rotate images... did you mean that you do this in the recipe/processing? so that the 3-panel strips come out using the long dimension of the kindle?

thanks,
-tim

I don't know about BRGrif, but I do this in recipes. Here's a link to code that finds the long dimension of each image and turns it vertical:
https://www.mobileread.com/forums/sho...7&postcount=11

BRGriff · 07-26-2011, 05:43 PM

Austin Tim,

Quote:

BRGrif,
in your original post you mention that you learned how to rotate images... did you mean that you do this in the recipe/processing? so that the 3-panel strips come out using the long dimension of the kindle?

thanks,
-tim

As Staron17 has replied, there is code for flipping the image in the Recipe so it automatically appears in landscape mode on your Kindle. I tried it but took it out of my Recipe for this reason: the resulting image is smaller and less readable in landscape if done by the Recipe method rather than manually rotating the screen after the comics have been loaded on your Kindle. The reason is that when you flip the comic via Recipe, the TOC header remains at the top of the portrait view (right side on landscape view) thus limiting the length of the comic strip. Since the Kindle re-sizes images proportionally, the resulting overall image is going to be smaller than if flip the image manually. Make since? Let me know what your thoughts are if you try the code Starson17 pointed you to.

AustinTim · 07-26-2011, 05:46 PM

Starson,
I tried the code you referenced with the code for Arcamax comics and not only did it not work it somehow made it that the script did not even download the images...

any ideas of what might be wrong here?

Thanks,
-tim

Spoiler:

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = 'Copyright 2010 Starson17'
'''
www.arcamax.com
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag

class Arcamax(BasicNewsRecipe):
    title               = 'ComicsArcamax'
    __author__          = 'TDS'
    __version__         = '1.05'
    __date__            = '12 May 2011'
    description         = u'Family Friendly Comics - Customize for more days/comics: Defaults to 7 days, 25 comics - 20 general, 5 editorial.'
    category            = 'news, comics'
    language            = 'en'
    use_embedded_content= False
    no_stylesheets      = True
    remove_javascript   = True
    cover_url           = 'http://www.arcamax.com/images/pub/amuse/leftcol/zits.jpg'

    ####### USER PREFERENCES - SET COMICS AND NUMBER OF COMICS TO RETRIEVE ########
    num_comics_to_get = 1
    # CHOOSE COMIC STRIPS BELOW - REMOVE COMMENT '# ' FROM IN FRONT OF DESIRED STRIPS

    conversion_options = {'linearize_tables'  : True
                        , 'comment'           : description
                        , 'tags'              : category
                        , 'language'          : language
                        }

    keep_only_tags     = [dict(name='div', attrs={'class':['comics-header']}),
                                        dict(name='b', attrs={'class':['current']}),
                                        dict(name='article', attrs={'class':['comic']}),
                                        ]

    remove_tags = [dict(name='div', attrs={'id':['comicfull' ]}),
                               dict(name='div', attrs={'class':['calendar' ]}),
                               dict(name='a', attrs={'class':['author bio']}),
                               dict(name='a', attrs={'href':['/']}),
                               dict(name='a', attrs={'href':['/comics']}),
                               dict(name='nav', attrs={'class':['calendar-nav' ]}),
                               ]

    def parse_index(self):
        feeds = []
        for title, url in [
                            ######## COMICS - GENERAL ########
                            #(u"9 Chickweed Lane", u"http://www.arcamax.com/thefunnies/ninechickweedlane"),
                            #(u"Agnes", u"http://www.arcamax.com/thefunnies/agnes"),
                            #(u"Andy Capp", u"http://www.arcamax.com/thefunnies/andycapp"),
                            #(u"BC", u"http://www.arcamax.com/thefunnies/bc"),
                            (u"Baby Blues", u"http://www.arcamax.com/thefunnies/babyblues"),
                            #(u"Beetle Bailey", u"http://www.arcamax.com/thefunnies/beetlebailey"),
                            #(u"Blondie", u"http://www.arcamax.com/thefunnies/blondie"),
                            #u"Boondocks", u"http://www.arcamax.com/thefunnies/boondocks"),
                            #(u"Cathy", u"http://www.arcamax.com/thefunnies/cathy"),
                            #(u"Daddys Home", u"http://www.arcamax.com/thefunnies/daddyshome"),
                            (u"Dilbert", u"http://www.arcamax.com/thefunnies/dilbert"),
                            #(u"Dinette Set", u"http://www.arcamax.com/thefunnies/thedinetteset"),
                            #(u"Dog Eat Doug", u"http://www.arcamax.com/thefunnies/dogeatdoug"),
                            (u"Doonesbury", u"http://www.arcamax.com/thefunnies/doonesbury"),
                            #(u"Dustin", u"http://www.arcamax.com/thefunnies/dustin"),
                            #(u"Family Circus", u"http://www.arcamax.com/thefunnies/familycircus"),
                            #(u"Garfield", u"http://www.arcamax.com/thefunnies/garfield"),
                            #(u"Get Fuzzy", u"http://www.arcamax.com/thefunnies/getfuzzy"),
                            #(u"Girls and Sports", u"http://www.arcamax.com/thefunnies/girlsandsports"),
                            #(u"Hagar the Horrible", u"http://www.arcamax.com/thefunnies/hagarthehorrible"),
                            #(u"Heathcliff", u"http://www.arcamax.com/thefunnies/heathcliff"),
                            #(u"Jerry King Cartoons", u"http://www.arcamax.com/thefunnies/humorcartoon"),
                            #(u"Luann", u"http://www.arcamax.com/thefunnies/luann"),
                            #(u"Momma", u"http://www.arcamax.com/thefunnies/momma"),
                            #(u"Mother Goose and Grimm", u"http://www.arcamax.com/thefunnies/mothergooseandgrimm"),
                            #(u"Mutts", u"http://www.arcamax.com/thefunnies/mutts"),
                            #(u"Non Sequitur", u"http://www.arcamax.com/thefunnies/nonsequitur"),
                            (u"Pearls Before Swine", u"http://www.arcamax.com/thefunnies/pearlsbeforeswine"),
                            #(u"Pickles", u"http://www.arcamax.com/thefunnies/pickles"),
                            #(u"Red and Rover", u"http://www.arcamax.com/thefunnies/redandrover"),
                            #(u"Rubes", u"http://www.arcamax.com/thefunnies/rubes"),
                            #(u"Rugrats", u"http://www.arcamax.com/thefunnies/rugrats"),
                            (u"Speed Bump", u"http://www.arcamax.com/thefunnies/speedbump"),
                            #(u"Wizard of Id", u"http://www.arcamax.com/thefunnies/wizardofid"),
                            (u"Zits", u"http://www.arcamax.com/thefunnies/zits"),
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds

    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        pages = range(1, self.num_comics_to_get+1)
        for page in pages:
            page_soup = self.index_to_soup(url)
            if page_soup:
                title = self.tag_to_string(page_soup.find(name='div', attrs={'class':'comics-header'}).h1.contents[0])
                page_url = url
                # orig prev_page_url = 'http://www.arcamax.com' + page_soup.find('a', attrs={'class':'prev'}, text='Previous').parent['href']
                prev_page_url = 'http://www.arcamax.com' + page_soup.find('span', text='Previous').parent.parent['href']
                date = self.tag_to_string(page_soup.find(name='b', attrs={'class':['current']}))
            current_articles.append({'title': title, 'url': page_url, 'description':'', 'date': date})
            url = prev_page_url
        current_articles.reverse()
        return current_articles

    def preprocess_html(self, soup):
        for img_tag in soup.findAll('img'):
            parent_tag = img_tag.parent
            if parent_tag.name == 'a':
                new_tag = Tag(soup,'p')
                new_tag.insert(0,img_tag)
                parent_tag.replaceWith(new_tag)
            elif parent_tag.name == 'p':
                if not self.tag_to_string(parent_tag) == '':
                    new_div = Tag(soup,'div')
                    new_tag = Tag(soup,'p')
                    new_tag.insert(0,img_tag)
                    parent_tag.replaceWith(new_div)
                    new_div.insert(0,new_tag)
                    new_div.insert(1,parent_tag)
        return soup
		
    def postprocess_html(self, soup, first):
       # process all the images. assumes that the new html has the correct path
        for tag in soup.findAll(lambda tag: tag.name.lower()=='img' and tag.has_key('src')):
            iurl = tag['src']
            img = Image()
            img.open(iurl)
            width, height = img.size
            print 'img is: ', iurl, 'width is: ', width, 'height is: ', height 
            if img < 0:
                raise RuntimeError('Out of memory')
            pw = PixelWand()
            if( width > height ) :
                print 'Rotate image'
                img.rotate(pw, -90)
                img.save(iurl)
        return soup
		
    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    img {max-width:100%; min-width:100%;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
		'''

BRGriff · 07-26-2011, 06:13 PM

Austin Tim,

You need to import the Pixel Wand by adding:

from calibre.utils.magick import Image, PixelWand

This goes along with the other two import tags BasicNews Recipe and BeautifulSoup.

See if that helps.

Starson17 · 07-27-2011, 02:08 PM

Quote:

Originally Posted by BRGriff

You need to import the Pixel Wand by adding:
from calibre.utils.magick import Image, PixelWand

That should fix it. It's the first thing in the post I linked to.

Purple Lady · 12-30-2011, 07:15 PM

@Starson17, for GoComics I would like to remove the entire line with the comic name, date, and author as well as the line that has "This article was downloaded by calibre from..". I originally tried

Code:

remove_tags  =  [dict(name='h1')]

to remove the first line but that wouldn't allow the comic to be retrieved at all, lol. So I removed it by removing the entire h1 in preprocess_html after the data was extracted from it by adding the code in bold

Code:

   def preprocess_html(self, soup):
        if soup.title:
            title_string = soup.title.string.strip()
            _cd = title_string.split(',',1)[1]
            comic_date = ' '.join(_cd.split(' ', 4)[0:-1])
        if soup.h1.span:
            artist = soup.h1.span.string
            soup.h1.span.string.replaceWith(comic_date + artist)
        feature_item = soup.find('p',attrs={'class':'feature_item'})
        for h1 in soup.findAll('h1'):
                     h1.extract()

I cannot figure out how to get rid of the line that has "This article was downloaded by calibre from..". Can you help?

I need to be able to make the comic as large as possible so I can read it, but there is one more problem - when I put my Sony 950 in landscape mode it makes it into two columns. Is this a problem with the Sony, or does the recipe make it this way? I noticed that with my news feed it also does two columns, but it keeps one column for a book.

06-01-2011, 08:25 PM	#1
BRGriff Connoisseur Posts: 58 Karma: 12 Join Date: May 2011 Location: Deland, Florida Device: Kindle 3	Remove author tag from comics I have worked diligently to resolve this issue myself. I have done lots of searching on this site and have made many improvements in the recipe I am trying to write. I have learned to rotate images and remove every tag except one to obtain ONLY the image file and nothing more. The tag I can not seem to remove is the name of the author. For Arcamax comics, the original HTML code for the strip is: <a href="/thefunnies/andycapp/bio" class="author bio" rel="/thefunnies/andycapp/bio?ajax" title="Reginald Smythe">Reginald Smythe</a> For GoComics, the original HTML code is: <h1 ><a href="/kitandcarlyle/2011/06/01">Kit 'N' Carlyle</a><span> by Larry Wright</span></h1> The tag or information I want to remove is highlighted in "red". I have tried in every way I know to remove the tag. Here is the reason it matters: I am getting older and my eye-sight is not what it use to be. That is the primary reason for getting an E-reader in the first place; so I can enlarge the text and be able to read with more comfort. But Comics are coming out too small on my Kindle 3 to read!!!! My thought is if I remove as much information as possible, I have more screen area to see the comic. I have added extra CSS code to enlarge the image, within the limits of the Kindle 3 so it does not default to "fit to screen" but the author's name is interfering with getting the most out of that. I have tried "debugging" the process from input to processed but can find nothing I am able to change there. I have tried "Inspect" in the Calibre Viewer and it shows the link as a "span" with a "class" equal to "underline". I might also note that the author's name in the conversion process is VERY different: In Arcamax, the author's name is in small type inline with the image. In GoComics, the author's name is very large and appears atop the comic. I will most appreciative for all and any help with this. I may be getting old, but there is still enough kid left in me to want to read my funnies every day. Thank you,

06-02-2011, 12:10 PM	#2
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	For Arcamax comics, you can try: Code: remove_tags = [dict(name='a', attrs={'class':'author bio'})] You might as well remove the entire a tag, as there's nothing left if you just remove the part you marked in red. For GoComics, I'll make some suggestions. Here is where you go to see how to do this. It's the BeautifulSoup documentation. You can try removing all span tags. That's probably too aggressive. You can try removing the first span tag in each h1 tag. Usually preprocess_html is used. Code: def preprocess_html(self, soup): for h1 in soup.findAll('h1'): span = h1.find('span') if span: span.extract() That's untested code. Last, you can try using regular expressions in your remove_tags. Remove any span tag that has the " by " in it. Here's some running code you can look over that hunts around in the soup for break tags and removes them based on attributes and the existence of Sibling tags. Code: def preprocess_html(self, soup): for br in soup.findAll('br'): prev = br.findPreviousSibling(True) if hasattr(prev, 'name') and prev.name == 'br': next = br.findNextSibling(True) if hasattr(next, 'name') and next.name == 'br': br.extract()

06-03-2011, 05:30 PM	#4
BRGriff Connoisseur Posts: 58 Karma: 12 Join Date: May 2011 Location: Deland, Florida Device: Kindle 3	Progress report Just an update: I have been able to remove the author's name on GoComics with the simple "remove_tags = [dict(name='span')]". As to removing the comic's name; i.e. "Kit and Carlye", it is "link text" and I am still working on that with the same problem in testing since GoComics is still having difficulties with it's server since the merger. I am reading up on HTML parsing and regex but have yet to find the answer. At present, both Arcamax and GoComics ARE larger on my Kindle 3 and much easier to read. So something good has come of all this but I really do want to maximize the image size further and will keep working on the solution. The final part of my effort will have to be addressed in the "Conversion" forum as the TOC text at the top of each page is too large and taking up valuable screen area. I welcome any input anyone has to offer.

12-30-2011, 07:15 PM	#14
Purple Lady Grand Sorcerer Posts: 5,698 Karma: 16542228 Join Date: Feb 2010 Location: Pennsylvania Device: Huawei MediaPad M5, LG V30, Boyue T80S, Nexus 7 LTE, K3 3G, Fire HD8	@Starson17, for GoComics I would like to remove the entire line with the comic name, date, and author as well as the line that has "This article was downloaded by calibre from..". I originally tried Code: remove_tags = [dict(name='h1')] to remove the first line but that wouldn't allow the comic to be retrieved at all, lol. So I removed it by removing the entire h1 in preprocess_html after the data was extracted from it by adding the code in bold Code: def preprocess_html(self, soup): if soup.title: title_string = soup.title.string.strip() _cd = title_string.split(',',1)[1] comic_date = ' '.join(_cd.split(' ', 4)[0:-1]) if soup.h1.span: artist = soup.h1.span.string soup.h1.span.string.replaceWith(comic_date + artist) feature_item = soup.find('p',attrs={'class':'feature_item'}) for h1 in soup.findAll('h1'): h1.extract() I cannot figure out how to get rid of the line that has "This article was downloaded by calibre from..". Can you help? I need to be able to make the comic as large as possible so I can read it, but there is one more problem - when I put my Sony 950 in landscape mode it makes it into two columns. Is this a problem with the Sony, or does the recipe make it this way? I noticed that with my news feed it also does two columns, but it keeps one column for a book. Last edited by Purple Lady; 12-30-2011 at 07:18 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Amazon Author Tag Exchange U.S. and U.K.	Williamlk	Writers' Corner	393	11-21-2012 10:49 PM
Calibre 7.36 Author Fields in Tag Browser is weird	dfad1469	Library Management	44	01-24-2011 04:47 AM
Hello from a Reader, Author, and Comics fan in North Carolina	MichaelJasper	Introduce Yourself	5	09-26-2010 12:06 PM
Creating a Library file w/Author, Title, Summary and tag info	asktheeightball	Calibre	2	01-18-2010 10:28 AM
remove tag	alexxxm	Calibre	1	01-18-2010 04:36 AM

06-02-2011, 04:50 PM	#3
BRGriff Connoisseur Posts: 58 Karma: 12 Join Date: May 2011 Location: Deland, Florida Device: Kindle 3	Thank you Starson17! The Arcamax fix is working great and the comics are larger and much easier to read. GoComics is having trouble with its servers in the aftermath of its merger with Comics.com. Therefore I am having difficulty testing changes to the recipe code. I did try the preprocess code you submitted above but it didn't work for me. Maybe I did something wrong. I placed the code in the recipe after the "articles = self.make_links(url) subroutine and before the "def make_links(self, url):" subroutine. I have also tried working with the "remove_tags": dict(name='h1', attrs={'span':['by']}), and also dict(name='span', attr={'by':['']}), neither one of which worked. It has occurred to me that not only do I need to get rid of the author's name, but also the comic's name as shown in red below: <h1 ><a href="/kitandcarlyle/2011/06/02">Kit 'N' Carlyle</a><span> by Larry Wright</span></h1> What may be easier is that the url in the original site HTML appears elsewhere without the comic strip name or the author's name. The HTML is shown below: <div class="social-box"> <ul> <li> <form id="myspacepostto" method="post" action="http://www.myspace.com/index.cfm?fuseaction=postto" target="_blank"> <input type="hidden" name="u" value="http://www.gocomics.com/kitandcarlyle/2011/06/02"/> </li> </ul> </div><!-- end div.social-box --> I have edited out the extraneous HTML code. Once GoComics is up and running smoothly, I will try adding to "keep_only_tags" the code: dict(name='input', atrrs={'u':['value']}). Do you think that might work? I very much appreciate all your help and patience.

07-25-2011, 05:51 PM	#8
AustinTim Member Posts: 19 Karma: 10 Join Date: Feb 2011 Device: kindle 3	BRGrif, in your original post you mention that you learned how to rotate images... did you mean that you do this in the recipe/processing? so that the 3-panel strips come out using the long dimension of the kindle? thanks, -tim

07-26-2011, 06:13 PM	#12
BRGriff Connoisseur Posts: 58 Karma: 12 Join Date: May 2011 Location: Deland, Florida Device: Kindle 3	Austin Tim, You need to import the Pixel Wand by adding: from calibre.utils.magick import Image, PixelWand This goes along with the other two import tags BasicNews Recipe and BeautifulSoup. See if that helps.

Advert

Advert