Custom recipes (archive, read-only) - Page 169

Starson17 · 08-25-2010, 01:52 PM

Quote:

Originally Posted by naisren

Thanks for your help and sorry for my confusing expression.

That's OK. I looked at your site and ran your recipe (thank you for using code tags - you may also want to add spoiler tags to reduce the length).

I now understand your problem. The site has bad html. The page you are trying to parse to get feeds is seen as a giant NavigableString inside a single tag. There are no other tags within it as far as BeautifulSoup is concerned. I don't know exactly why, but I suspect it isn't solely due to the fact that it is using the " />" format to immediately close div tags, then trying to close them again with the normal </div>, so there are two closings (another bit of bad html.)

Whatever is going on is confusing Beautiful Soup to the extent that it can't find anything except the first surrounding tag. It should still be possible to extract feeds, but it will require much trickier programming to get links out of the giant string which is soup.contents[0].string. You will need to treat it as a string, then extract from the string, instead of trying to find tags within that string (although you may be able to use BS to convert it into a tag-based structure with some trickery).

It's an interesting problem, and I regret that I don't have time now to attack it. If you solve it, post your solution.

stewie1 · 08-25-2010, 04:08 PM

The Financial Times recipe that's currently posted isn't the complete print edition. I'm a subscriber and trying to put something together that will allow me to get the day's print edition (http://www.ft.com/us-edition). Unfortunately, there is no RSS feed for this.

Can anyone help, either by putting a recipe together, or directing me to a template I might be able to use to give it a shot myself (note I am a complete novice at this).

Thanks.

TonytheBookworm · 08-25-2010, 10:36 PM

Starson17, I thought I was getting the knack of this but apparently not. If it is an indent problem then hit me. I don't think it is though but I can't seem to understand why the print is not being appended to the url. I looked at the newyorker as an example and I to the most part use the same print_version()

my code

Spoiler:

I also am using the the test you gave me of
ebook-convert tmz.recipe output_dir --test -vv > myrecipe.txt

I don't really see any errors in there and I also don't see the print_url is: ... either

Sorry to bug you I will get it one of these days.

kerrware · 08-26-2010, 03:33 AM

Quote:

Originally Posted by Starson17

If you are seeing the article content stored locally (when running ebook-convert), and you can click through from the initial index.html to the index.html files in the folders to see that content, then I see no reason why you should have problems converting the html structure, with article content, to an EPUB. Where is the problem occurring? I'd check it for you, but have no username/password for the site.

Thanks for response. I tried the "Add Book" function in Calibre unsing the debug output from ebook-convert and then converted it to an epub book and got the same result - no article content. I guess this means that the epub conversion process can't handle the html from the site for some reason.

The only thing I can think of is to try and introduce some "remove_tags" type code to try and simplyfy the html so it can be converted. This could take some time (not that familiar with html or python code). Any suggestions as to what I can and can't remove?

Thanks.

Starson17 · 08-26-2010, 07:51 AM

Quote:

Originally Posted by kerrware

I tried the "Add Book" function in Calibre unsing the debug output from ebook-convert and then converted it to an epub book and got the same result - no article content. I guess this means that the epub conversion process can't handle the html from the site for some reason.

Send me your username/password by PM and I'll take a look.

Starson17 · 08-26-2010, 07:58 AM

Quote:

Originally Posted by TonytheBookworm

Starson17, I thought I was getting the knack of this but apparently not. If it is an indent problem then hit me. I don't think it is though but I can't seem to understand why the print is not being appended to the url. I looked at the newyorker as an example and I to the most part use the same print_version()

my code

Spoiler:

I also am using the the test you gave me of
ebook-convert tmz.recipe output_dir --test -vv > myrecipe.txt

I don't really see any errors in there and I also don't see the print_url is: ... either

Sorry to bug you I will get it one of these days.

I ran your code. It puts these in the output file:

Code:

print_url is:  http://www.tmz.com/2010/08/18/tmz-seinfeld-uncle-leo-crank-call-police-burbank-kim-kardashian-facebook/print
print_url is:  http://www.tmz.com/2010/08/25/mel-gibson-extortion-case-oksana-grigorieva-domestic-violence-sheriff-district-attorney-investigation/print

AFAICT, it's working perfectly. I even tested one of those links, and it works, too. What do you think is busted?

cisaak · 08-26-2010, 09:11 AM

Quote:

Originally Posted by Starson17

Not without a Kindle (anyone want to send me one?

) as I'm not sure where in the recipe the Kindle is picking up the masthead.

However, the masthead is only used in a few places in an EPUB. Open the EPUB, find the masthead and change the css file to modify its properties, then convert the EPUB to whatever format Kindle uses and see if that fixes it. If so, modify the extra_css in your recipe to make the same change.

If you have a problem understanding this, take it a step at a time, and let me know which step you have trouble with.

I used calibre to convert my recipe to EPUB form. I do not know how to "open" the EPUB and look for the masthead. I checked the Debug section of the calibre conversion and found the css in the input directory. The term "masthead" is not mentioned. What is my next step?

Starson17 · 08-26-2010, 10:15 AM

Quote:

Originally Posted by naisren

My recipe is

Spoiler:

OK, your problem was so interesting, I couldn't resist looking at it further. Your problem is the bad html code in your source page http://www.51voa.com/. The closing angle bracket of each opening tag is ' />', instead of ' >'. That results in each tag being closed twice. Beautiful Soup is confused and sees the entire page as a single [document] element having a single NavigableString of text, not as multiple tags, so none of the tag-based searches or manipulation commands will work. There are no tags for BeautifulSoup to find or work with.

To fix this, you first grab that page (as you have already done in your code):

Code:

        soup = self.index_to_soup('http://www.51voa.com/')

Then, grab the string that is in the contents of the big single [document] element and search and replace the bad closing brackets as follows:

Code:

        rawc = soup.contents[0].string.replace(' />', ' >')

Now it's fixed, but it's still text, so you next convert the string back into a BeautifulSoup object:

Code:

        soup = BeautifulSoup(rawc, fromEncoding=self.encoding)

(Also, add

Code:

from calibre.ebooks.BeautifulSoup import BeautifulSoup

to your recipe)

That's it! Put the two extra lines above after your first index_to_soup line. Be aware that any legitimate single element tags, such as <img>, <br> etc. will get mangled with the simple search and replace above. You may have to special case any tags that are allowed to have a closing slash inside the opening tag so they don't get mangled.

Edit: I forgot, you also need this line:
encoding = 'utf-8'
or else the final step will fail.

Starson17 · 08-26-2010, 10:21 AM

Quote:

Originally Posted by cisaak

I used calibre to convert my recipe to EPUB form. I do not know how to "open" the EPUB and look for the masthead.

change the file extension from .epub to .zip and unpack it into a directory. You will find the css file there and will be able to open the unpacked book with your browser.

Starson17 · 08-26-2010, 11:38 AM

Quote:

Originally Posted by kerrware

The only thing I can think of is to try and introduce some "remove_tags" type code to try and simplyfy the html so it can be converted. This could take some time (not that familiar with html or python code). Any suggestions as to what I can and can't remove?

Try this:

Code:

    keep_only_tags = dict(name='div', attrs={'id':['ds-headline','viewarticle']})

You may find some other items you want to keep (use FireFox/FireBug to find them), but you're right, there's something in there that's messing up the conversion.

TonytheBookworm · 08-26-2010, 01:13 PM

Quote:

AFAICT, it's working perfectly. I even tested one of those links, and it works, too. What do you think is busted?

notice is the attached screenshot. that up top I have Home: Michael Bay.... I know I can remove the tags to get rid of that. And also down at the bottom I have tags:Michael Bay ... etc.

The reason I suspected that it is not hitting the print_url is because when i navigate to the page manually I do not see any of those tags at all (as in if i look at the original html /print version) However, I do notice the tags if I do not append /print . Any thoughts?

notice that when you goto http://www.tmz.com/2010/08/24/michae...r-pistol-whip/
You will get the div tag of <breadcrumbs> that i wish to remove.

But when you go to http://www.tmz.com/2010/08/24/michae...tol-whip/print
You do not get the div tag of <breadcrumbs> maybe I've got an indention problem going on here that just isn't obvious to me or something but for sure something is going on

I'm a dumb_____ for whatever reason the recipe that i posted to you is the correct version. geany didn't save it yet I thought it had. I went and opened it back up after rebooting my computer today and noticed i still had the old version haha. Live and learn. Sorry for trouble.

Starson17 · 08-26-2010, 02:21 PM

Quote:

Originally Posted by TonytheBookworm

I'm a dumb_____ for whatever reason the recipe that i posted to you is the correct version. geany didn't save it yet I thought it had. I went and opened it back up after rebooting my computer today and noticed i still had the old version haha. Live and learn. Sorry for trouble.

No sweat - I've done the same thing a few times myself.

TonytheBookworm · 08-26-2010, 03:05 PM

Here is the tmz.

Ebookerr · 08-26-2010, 05:45 PM

The New England of Journal changed their format and the current Calibre version (0.7.15) recipe does not work

Ebookerr

TonytheBookworm · 08-26-2010, 09:34 PM

Starson I know ask before but I'm still not clear on how to do it

if i have for instance the following link (example only)
http://www.testpage.com/doi/abs/10.1...495?ai=rv&af=R
but I need it to be the following
http://www.testpage.com/doi/full/10....viewType=Print

how would I do this?
I figure it would be some form or search and replace like i seen in beautiful soup but again i'm kinda clueless (any examples of this in action ? )

the abs needs to be replaced with full and then simply append the ?view part yet not sure how this is done

thanks again. I hope to learn something new.

Maybe something like ?

Spoiler:

the above is just a non-working guess

08-26-2010, 05:45 PM	#2534
Ebookerr Junior Member Posts: 9 Karma: 18 Join Date: Sep 2008 Device: UTstarcom	New England Journal of Medicine Recipe The New England of Journal changed their format and the current Calibre version (0.7.15) recipe does not work Ebookerr

08-26-2010, 09:34 PM	#2535
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	Starson I know ask before but I'm still not clear on how to do it if i have for instance the following link (example only) http://www.testpage.com/doi/abs/10.1...495?ai=rv&af=R but I need it to be the following http://www.testpage.com/doi/full/10....viewType=Print how would I do this? I figure it would be some form or search and replace like i seen in beautiful soup but again i'm kinda clueless (any examples of this in action ? ) the abs needs to be replaced with full and then simply append the ?view part yet not sure how this is done thanks again. I hope to learn something new. Maybe something like ? Spoiler: Code: def preprocess_html(soup) for abs in soup.findall('abs) newtext ='full' abs.replaceWith(newtext) the above is just a non-working guess Last edited by TonytheBookworm; 08-26-2010 at 09:48 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

08-25-2010, 04:08 PM	#2522
stewie1 Junior Member Posts: 4 Karma: 10 Join Date: Aug 2010 Device: Kindle	The Financial Times recipe that's currently posted isn't the complete print edition. I'm a subscriber and trying to put something together that will allow me to get the day's print edition (http://www.ft.com/us-edition). Unfortunately, there is no RSS feed for this. Can anyone help, either by putting a recipe together, or directing me to a template I might be able to use to give it a shot myself (note I am a complete novice at this). Thanks.