![]() |
#1 |
Connoisseur
![]() Posts: 58
Karma: 12
Join Date: May 2011
Location: Deland, Florida
Device: Kindle 3
|
Remove author tag from comics
I have worked diligently to resolve this issue myself. I have done lots of searching on this site and have made many improvements in the recipe I am trying to write. I have learned to rotate images and remove every tag except one to obtain ONLY the image file and nothing more. The tag I can not seem to remove is the name of the author.
For Arcamax comics, the original HTML code for the strip is: <a href="/thefunnies/andycapp/bio" class="author bio" rel="/thefunnies/andycapp/bio?ajax" title="Reginald Smythe">Reginald Smythe</a> For GoComics, the original HTML code is: <h1 ><a href="/kitandcarlyle/2011/06/01">Kit 'N' Carlyle</a><span> by Larry Wright</span></h1> The tag or information I want to remove is highlighted in "red". I have tried in every way I know to remove the tag. Here is the reason it matters: I am getting older and my eye-sight is not what it use to be. That is the primary reason for getting an E-reader in the first place; so I can enlarge the text and be able to read with more comfort. But Comics are coming out too small on my Kindle 3 to read!!!! My thought is if I remove as much information as possible, I have more screen area to see the comic. I have added extra CSS code to enlarge the image, within the limits of the Kindle 3 so it does not default to "fit to screen" but the author's name is interfering with getting the most out of that. I have tried "debugging" the process from input to processed but can find nothing I am able to change there. I have tried "Inspect" in the Calibre Viewer and it shows the link as a "span" with a "class" equal to "underline". I might also note that the author's name in the conversion process is VERY different: In Arcamax, the author's name is in small type inline with the image. In GoComics, the author's name is very large and appears atop the comic. I will most appreciative for all and any help with this. I may be getting old, but there is still enough kid left in me to want to read my funnies every day. Thank you, |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
For Arcamax comics, you can try:
Code:
remove_tags = [dict(name='a', attrs={'class':'author bio'})] For GoComics, I'll make some suggestions. Here is where you go to see how to do this. It's the BeautifulSoup documentation. You can try removing all span tags. That's probably too aggressive. You can try removing the first span tag in each h1 tag. Usually preprocess_html is used. Code:
def preprocess_html(self, soup): for h1 in soup.findAll('h1'): span = h1.find('span') if span: span.extract() ![]() Last, you can try using regular expressions in your remove_tags. Remove any span tag that has the " by " in it. Here's some running code you can look over that hunts around in the soup for break tags and removes them based on attributes and the existence of Sibling tags. Code:
def preprocess_html(self, soup): for br in soup.findAll('br'): prev = br.findPreviousSibling(True) if hasattr(prev, 'name') and prev.name == 'br': next = br.findNextSibling(True) if hasattr(next, 'name') and next.name == 'br': br.extract() |
![]() |
![]() |
![]() |
#3 |
Connoisseur
![]() Posts: 58
Karma: 12
Join Date: May 2011
Location: Deland, Florida
Device: Kindle 3
|
Thank you Starson17! The Arcamax fix is working great and the comics are larger and much easier to read.
GoComics is having trouble with its servers in the aftermath of its merger with Comics.com. Therefore I am having difficulty testing changes to the recipe code. I did try the preprocess code you submitted above but it didn't work for me. Maybe I did something wrong. I placed the code in the recipe after the "articles = self.make_links(url) subroutine and before the "def make_links(self, url):" subroutine. I have also tried working with the "remove_tags": dict(name='h1', attrs={'span':['by']}), and also dict(name='span', attr={'by':['']}), neither one of which worked. It has occurred to me that not only do I need to get rid of the author's name, but also the comic's name as shown in red below: <h1 ><a href="/kitandcarlyle/2011/06/02">Kit 'N' Carlyle</a><span> by Larry Wright</span></h1> What may be easier is that the url in the original site HTML appears elsewhere without the comic strip name or the author's name. The HTML is shown below: <div class="social-box"> <ul> <li> <form id="myspacepostto" method="post" action="http://www.myspace.com/index.cfm?fuseaction=postto" target="_blank"> <input type="hidden" name="u" value="http://www.gocomics.com/kitandcarlyle/2011/06/02"/> </li> </ul> </div><!-- end div.social-box --> I have edited out the extraneous HTML code. Once GoComics is up and running smoothly, I will try adding to "keep_only_tags" the code: dict(name='input', atrrs={'u':['value']}). Do you think that might work? I very much appreciate all your help and patience. |
![]() |
![]() |
![]() |
#4 |
Connoisseur
![]() Posts: 58
Karma: 12
Join Date: May 2011
Location: Deland, Florida
Device: Kindle 3
|
Progress report
Just an update: I have been able to remove the author's name on GoComics with the simple "remove_tags = [dict(name='span')]".
As to removing the comic's name; i.e. "Kit and Carlye", it is "link text" and I am still working on that with the same problem in testing since GoComics is still having difficulties with it's server since the merger. I am reading up on HTML parsing and regex but have yet to find the answer. At present, both Arcamax and GoComics ARE larger on my Kindle 3 and much easier to read. So something good has come of all this but I really do want to maximize the image size further and will keep working on the solution. The final part of my effort will have to be addressed in the "Conversion" forum as the TOC text at the top of each page is too large and taking up valuable screen area. I welcome any input anyone has to offer. |
![]() |
![]() |
![]() |
#5 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Code:
remove_tags = [dict(name='h1')] The span is inside the h1 tag, so you can remove both with the above. The only possible issue is if there are other h1 elements you don't want removed. You are following this more closely than I am, so let me know when the server seems to stabilize. I'll try to grab some time to fix anything in the recipe that needs fixing if no one else does it first. Quote:
|
||
![]() |
![]() |
![]() |
#6 | |
Connoisseur
![]() Posts: 58
Karma: 12
Join Date: May 2011
Location: Deland, Florida
Device: Kindle 3
|
The element in the <h1> tag I want to keep is the link to the comic strip:
Quote:
As far as GoComics is concerned, it is now fairly stable and getting better by the day. I am now able to down load all my comics from both the former Comics.com and Gocomics just from the Gocomics recipe with the new feed list. There are still some problems with editorial comics not downloading. I do not know if they have been removed or simply not put online yet. As to the general comics, all 25 of the ones I follow are coming through without a hitch. As to the size of the image, I have never been able to get the recipe "comic_size=" to work. This may be because I am converting to .mobi. It did not work for me even before the merger. I adjust the size via CSS using pixels instead of percentages. One other manner of messing around with the image size is to get the "zoom" image from Gocomics. Since my present method of dealing with the image size is working, I have not messed around with the coding to obtain the zoomed image. I use a different recipe from the one you have written, Since I have a Kindle 3 with a screen of only 600X800, it is important to ME that I maximize the space for the image and remove as much extraneous data as possible. Therefore, I strip the "Banner"; comic strip "Alink" and the "Author's Name". This leaves me with only the jpeg image and nothing more. Thanks for your input and continuing help. |
|
![]() |
![]() |
![]() |
#7 | |||||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
Quote:
Quote:
Quote:
Last edited by Starson17; 06-07-2011 at 08:52 AM. |
|||||
![]() |
![]() |
![]() |
#8 |
Member
![]() Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle 3
|
BRGrif,
in your original post you mention that you learned how to rotate images... did you mean that you do this in the recipe/processing? so that the 3-panel strips come out using the long dimension of the kindle? thanks, -tim |
![]() |
![]() |
![]() |
#9 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
https://www.mobileread.com/forums/sho...7&postcount=11 |
|
![]() |
![]() |
![]() |
#10 | |
Connoisseur
![]() Posts: 58
Karma: 12
Join Date: May 2011
Location: Deland, Florida
Device: Kindle 3
|
Austin Tim,
Quote:
|
|
![]() |
![]() |
![]() |
#11 |
Member
![]() Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle 3
|
rotating
Starson,
I tried the code you referenced with the code for Arcamax comics and not only did it not work it somehow made it that the script did not even download the images... any ideas of what might be wrong here? Thanks, -tim Spoiler:
|
![]() |
![]() |
![]() |
#12 |
Connoisseur
![]() Posts: 58
Karma: 12
Join Date: May 2011
Location: Deland, Florida
Device: Kindle 3
|
Austin Tim,
You need to import the Pixel Wand by adding: from calibre.utils.magick import Image, PixelWand This goes along with the other two import tags BasicNews Recipe and BeautifulSoup. See if that helps. |
![]() |
![]() |
![]() |
#13 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
![]() |
![]() |
![]() |
#14 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,698
Karma: 16542228
Join Date: Feb 2010
Location: Pennsylvania
Device: Huawei MediaPad M5, LG V30, Boyue T80S, Nexus 7 LTE, K3 3G, Fire HD8
|
@Starson17, for GoComics I would like to remove the entire line with the comic name, date, and author as well as the line that has "This article was downloaded by calibre from..". I originally tried
Code:
remove_tags = [dict(name='h1')] Code:
def preprocess_html(self, soup):
if soup.title:
title_string = soup.title.string.strip()
_cd = title_string.split(',',1)[1]
comic_date = ' '.join(_cd.split(' ', 4)[0:-1])
if soup.h1.span:
artist = soup.h1.span.string
soup.h1.span.string.replaceWith(comic_date + artist)
feature_item = soup.find('p',attrs={'class':'feature_item'})
for h1 in soup.findAll('h1'):
h1.extract()
I need to be able to make the comic as large as possible so I can read it, but there is one more problem - when I put my Sony 950 in landscape mode it makes it into two columns. Is this a problem with the Sony, or does the recipe make it this way? I noticed that with my news feed it also does two columns, but it keeps one column for a book. Last edited by Purple Lady; 12-30-2011 at 07:18 PM. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Amazon Author Tag Exchange U.S. and U.K. | Williamlk | Writers' Corner | 393 | 11-21-2012 10:49 PM |
Calibre 7.36 Author Fields in Tag Browser is weird | dfad1469 | Library Management | 44 | 01-24-2011 04:47 AM |
Hello from a Reader, Author, and Comics fan in North Carolina | MichaelJasper | Introduce Yourself | 5 | 09-26-2010 12:06 PM |
Creating a Library file w/Author, Title, Summary and tag info | asktheeightball | Calibre | 2 | 01-18-2010 10:28 AM |
remove tag | alexxxm | Calibre | 1 | 01-18-2010 04:36 AM |