Trouble with internal links

marumari · 04-09-2015, 11:20 AM

Hey there! I'm having trouble with getting internal links working, with a BasicNewsRecipe. I'm creating a recipe for The Codeless Code. Almost everything works, except for internal links. Basically, it is grabbing every article that has a URL that resembles this:

http://thecodelesscode.com/case/171

And adding that to the feed. All the way from 1 up to 184, and putting them into a nice book. In preprocess_html, I am stripping out all of the links, except the ones that begin with /case/. However, no matter what I do, I can't seem to get those links within the book to work at all.

If I leave them alone (ie, href="/case/152"), it generates this error:
Referenced file u'/case/152' not found

If I change it to the full URI (ie, http://thecodelesscode.com/case/152), then it works fine, but it leaves a hyperlink to the website, not to the chapter inside the ebook.

If I change it to a relative URI (ie, href="152") it will just say that it can't find u'152'.

Is there a trick to what I'm trying to do? Or is the BasicNewsRecipe just not intended for this sort of thing?

Thanks!

kovidgoyal · 04-09-2015, 11:31 AM

Downloaded articles are named according to a particular scheme ass feed_n/article_n/index.html

You need to convert your internal links to refer to those names. There is no easy way to do that, since the recipe download system is not designed for it. Essentially, you need to override create_opf() in your recipe class to store a mapping of article.orig_url -> filename

Then implement postprocess_book() to use that mapping to replace the links in the downloaded articles using the previously stored mapping.

marumari · 04-09-2015, 11:34 AM

Awesome, thank you.

marumari · 04-09-2015, 02:32 PM

Okay, so I thought I had everything entirely figured out. I've generated the proper mappings in create_opf without any issue.

And in postprocess_book, I can even find every HTML file and fix the hrefs, for example:

Code:

    def postprocess_book(self, oeb, opts, log):
      output_files = [ self.path_remappings[key] for key in self.path_remappings.keys() ]
 
      for output in output_files:
        # Load the HTML file in
        f = open(self.output_dir + '/feed_0/' + output)
        soup = bs(f)
        f.close()

        # Replace all the anchors
        anchors = soup.findAll('a')
        for anchor in anchors:
          if '/case/' in anchor['href']:
            if anchor['href'] in self.path_remappings:
              anchor['href'] = '../' + self.path_remappings[ anchor['href'] ]

        # Write it back out
        with open(self.output_dir + '/feed_0/' + output, "wb") as f:
          html = unicode(soup)
          f.write(html.encode('utf-8'))
          f.close()

Looking at the Soup, I see that the href went from:
their <a href="/case/174">newly appointed</a> master-in-training Zjing decided that they should work in separate shifts -- Landhwa by day, Wangohan by night.</p>

To:
their <a href="../article_5/index.html">newly appointed</a> master-in-training Zjing decided that they should work in separate shifts -- Landhwa by day, Wangohan by night.</p>

However, in the very final file (article_5/index_u1.html), it ends up like this:
their <a href="../..//case/174">newly appointed</a> master-in-training Zjing decided that they should work in separate shifts -- Landhwa by day, Wangohan by night.</p>

Am I going about this the wrong way, by messing with the HTML files in the output directory? Should I instead be mucking around with some internal structure in oeb?

marumari · 04-09-2015, 05:30 PM

Never mind, I think I figured out how it's internally represented in memory. I'll post the final recipe when I'm all done.

kovidgoyal · 04-09-2015, 08:31 PM

You wan to work with the oeb object, like this:

Code:

for item in oeb.spine:
   for a in item.data.xpath('//*[local-name()="a" and @href]'):
       href = a.get('href')
       a.set('href', mapping[href])

marumari · 04-10-2015, 12:52 AM

Finished! It generates very nice epub files, and pretty darned nice mobi files:

https://github.com/marumari/codeless...esscode.recipe

Thanks again for your help in pointing me in the right direction.

marumari · 04-10-2015, 01:49 PM

I'm having a bit of trouble with having it automatically resize the images that it fetches. I've set:

scale_news_images = (600, 400)

But when I run ebook-convert, and go into feed_0, I see:

april@machine(feed_0)$ find . -name '*.jpg' -exec exiftool {} \; | grep Height
Image Height : 402
Image Height : 389
Image Height : 446
Image Height : 557
Image Height : 196
Image Height : 400
Image Height : 424

And so it didn't resize them at all. Is there something I'm missing here? Thanks!

PeterT · 04-10-2015, 02:21 PM

Are you setting compress_news_image to true?

From the documentation:

Quote:

compress_news_images = False
Set this to False to ignore all scaling and compression parameters and pass images through unmodified. If True and the other compression parameters are left at their default values, jpeg images will be scaled to fit in the screen dimensions set by the output profile and compressed to size at most (w * h)/16 where w x h are the scaled image dimensions.

marumari · 04-10-2015, 02:35 PM

Derp! Thanks!

marumari · 04-10-2015, 06:23 PM

Okay, so I think I've gotten the recipe to pretty much a "final" state. Produces really nice EPUB and MOBI files now, without some of the superfluous stuff that comes with the BasicNewsRecipe. (ie, article listings, duplicate indexes, etc.)

How would I go about getting it included in the next version of Calibre?

Thanks!

kovidgoyal · 04-10-2015, 10:40 PM

You can send a pull request (put your recipe in the recipes folder)

kovidgoyal · 04-11-2015, 03:40 AM

You might want to update your recipe to take advantage of this

https://github.com/kovidgoyal/calibr...eb35357eefc698

04-09-2015, 11:20 AM	#1
marumari Junior Member Posts: 8 Karma: 10 Join Date: Apr 2015 Device: Kindle Voyage	Trouble with internal links Hey there! I'm having trouble with getting internal links working, with a BasicNewsRecipe. I'm creating a recipe for The Codeless Code. Almost everything works, except for internal links. Basically, it is grabbing every article that has a URL that resembles this: http://thecodelesscode.com/case/171 And adding that to the feed. All the way from 1 up to 184, and putting them into a nice book. In preprocess_html, I am stripping out all of the links, except the ones that begin with /case/. However, no matter what I do, I can't seem to get those links within the book to work at all. If I leave them alone (ie, href="/case/152"), it generates this error: Referenced file u'/case/152' not found If I change it to the full URI (ie, http://thecodelesscode.com/case/152), then it works fine, but it leaves a hyperlink to the website, not to the chapter inside the ebook. If I change it to a relative URI (ie, href="152") it will just say that it can't find u'152'. Is there a trick to what I'm trying to do? Or is the BasicNewsRecipe just not intended for this sort of thing? Thanks!

04-09-2015, 02:32 PM	#4
marumari Junior Member Posts: 8 Karma: 10 Join Date: Apr 2015 Device: Kindle Voyage	Okay, so I thought I had everything entirely figured out. I've generated the proper mappings in create_opf without any issue. And in postprocess_book, I can even find every HTML file and fix the hrefs, for example: Code: def postprocess_book(self, oeb, opts, log): output_files = [ self.path_remappings[key] for key in self.path_remappings.keys() ] for output in output_files: # Load the HTML file in f = open(self.output_dir + '/feed_0/' + output) soup = bs(f) f.close() # Replace all the anchors anchors = soup.findAll('a') for anchor in anchors: if '/case/' in anchor['href']: if anchor['href'] in self.path_remappings: anchor['href'] = '../' + self.path_remappings[ anchor['href'] ] # Write it back out with open(self.output_dir + '/feed_0/' + output, "wb") as f: html = unicode(soup) f.write(html.encode('utf-8')) f.close() Looking at the Soup, I see that the href went from: their <a href="/case/174">newly appointed</a> master-in-training Zjing decided that they should work in separate shifts -- Landhwa by day, Wangohan by night.</p> To: their <a href="../article_5/index.html">newly appointed</a> master-in-training Zjing decided that they should work in separate shifts -- Landhwa by day, Wangohan by night.</p> However, in the very final file (article_5/index_u1.html), it ends up like this: their <a href="../..//case/174">newly appointed</a> master-in-training Zjing decided that they should work in separate shifts -- Landhwa by day, Wangohan by night.</p> Am I going about this the wrong way, by messing with the HTML files in the output directory? Should I instead be mucking around with some internal structure in oeb? Last edited by marumari; 04-09-2015 at 02:57 PM.

04-09-2015, 08:31 PM	#6
kovidgoyal creator of calibre Posts: 45,343 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You wan to work with the oeb object, like this: Code: for item in oeb.spine: for a in item.data.xpath('//*[local-name()="a" and @href]'): href = a.get('href') a.set('href', mapping[href])

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Convert external links to internal links	sup	Recipes	2	11-28-2013 09:39 AM
Internal Links best Practices	Jamestoo	ePub	2	02-26-2012 11:26 AM
Links to URLs work, internal links don't?	NewDay	ePub	36	10-27-2010 04:09 AM
internal links and chapter division	.mau.	Sigil	23	07-28-2010 04:01 PM
Internal Links???	Guns4Hire	PocketBook	11	04-18-2010 02:25 AM

04-09-2015, 11:31 AM	#2
kovidgoyal creator of calibre Posts: 45,343 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Downloaded articles are named according to a particular scheme ass feed_n/article_n/index.html You need to convert your internal links to refer to those names. There is no easy way to do that, since the recipe download system is not designed for it. Essentially, you need to override create_opf() in your recipe class to store a mapping of article.orig_url -> filename Then implement postprocess_book() to use that mapping to replace the links in the downloaded articles using the previously stored mapping.

04-09-2015, 11:34 AM	#3
marumari Junior Member Posts: 8 Karma: 10 Join Date: Apr 2015 Device: Kindle Voyage	Awesome, thank you.

04-09-2015, 05:30 PM	#5
marumari Junior Member Posts: 8 Karma: 10 Join Date: Apr 2015 Device: Kindle Voyage	Never mind, I think I figured out how it's internally represented in memory. I'll post the final recipe when I'm all done.

04-10-2015, 12:52 AM	#7
marumari Junior Member Posts: 8 Karma: 10 Join Date: Apr 2015 Device: Kindle Voyage	Finished! It generates very nice epub files, and pretty darned nice mobi files: https://github.com/marumari/codeless...esscode.recipe Thanks again for your help in pointing me in the right direction.

04-10-2015, 01:49 PM	#8
marumari Junior Member Posts: 8 Karma: 10 Join Date: Apr 2015 Device: Kindle Voyage	I'm having a bit of trouble with having it automatically resize the images that it fetches. I've set: scale_news_images = (600, 400) But when I run ebook-convert, and go into feed_0, I see: april@machine(feed_0)$ find . -name '*.jpg' -exec exiftool {} \; \| grep Height Image Height : 402 Image Height : 389 Image Height : 446 Image Height : 557 Image Height : 196 Image Height : 400 Image Height : 424 And so it didn't resize them at all. Is there something I'm missing here? Thanks!

04-10-2015, 02:35 PM	#10
marumari Junior Member Posts: 8 Karma: 10 Join Date: Apr 2015 Device: Kindle Voyage	Derp! Thanks!

04-10-2015, 06:23 PM	#11
marumari Junior Member Posts: 8 Karma: 10 Join Date: Apr 2015 Device: Kindle Voyage	Okay, so I think I've gotten the recipe to pretty much a "final" state. Produces really nice EPUB and MOBI files now, without some of the superfluous stuff that comes with the BasicNewsRecipe. (ie, article listings, duplicate indexes, etc.) How would I go about getting it included in the next version of Calibre? Thanks!

04-10-2015, 10:40 PM	#12
kovidgoyal creator of calibre Posts: 45,343 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You can send a pull request (put your recipe in the recipes folder)

04-11-2015, 03:40 AM	#13
kovidgoyal creator of calibre Posts: 45,343 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You might want to update your recipe to take advantage of this https://github.com/kovidgoyal/calibr...eb35357eefc698

Advert

Advert