Lists getting changed in recipe processing

ptsefton · 05-18-2011, 11:03 PM

Hi all,

I'm new here - I had a look around but could not find anything on this problem.

I am working on recipes to scrape WordPress sites and I am running into problems with Calibre v0.8.1 changing the HTML format of pages.

For example, using this recipe: https://bitbucket.org/wwmm/schtml/sr...tsefton.recipe

With this command:

ebook-convert ptsefton.recipe .epub --debug-pipeline d --test

The recipe fetches the first page which has this code in it:

<ul><li><a href="#id2">Immediate future</a></li><li><a href="#id3">The future</a></li></ul>

I know that this code is still intact when postprocess_html returns the HTML, but in the debug output in the parsed directory it has changed to this:

<ul/><li/><a href="#id2">Immediate future</a><li/><a href="#id3">The future</a>

Does anyone have any idea why this would be happening?

Thanks,
Peter

kovidgoyal · 05-18-2011, 11:15 PM

look for ascii control codes in the raw html, they usually cause this sort of thing.

ptsefton · 05-18-2011, 11:40 PM

Thanks @kovidgoyal for the prompt reply.

Turned out not to be control characters - I was returning only div element instead of the whole page in the soup variable.

Solved.

05-18-2011, 11:03 PM	#1
ptsefton Junior Member Posts: 4 Karma: 10 Join Date: May 2011 Device: none	Lists getting changed in recipe processing Hi all, I'm new here - I had a look around but could not find anything on this problem. I am working on recipes to scrape WordPress sites and I am running into problems with Calibre v0.8.1 changing the HTML format of pages. For example, using this recipe: https://bitbucket.org/wwmm/schtml/sr...tsefton.recipe With this command: ebook-convert ptsefton.recipe .epub --debug-pipeline d --test The recipe fetches the first page which has this code in it: <ul><li><a href="#id2">Immediate future</a></li><li><a href="#id3">The future</a></li></ul> I know that this code is still intact when postprocess_html returns the HTML, but in the debug output in the parsed directory it has changed to this: <ul/><li/><a href="#id2">Immediate future</a><li/><a href="#id3">The future</a> Does anyone have any idea why this would be happening? Thanks, Peter

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Heuristic processing	saxondawg	Conversion	6	01-21-2018 08:43 PM
Word Processing on the Kindle 3	cow_trix	Amazon Kindle	41	05-17-2011 04:22 AM
Trying to use Textile processing	getajob	Conversion	18	03-09-2011 08:34 AM
Comic File Processing	wonderboy	Other formats	1	08-08-2009 05:17 AM
Perl processing	alexxxm	Sony Reader	3	11-26-2007 07:13 AM

05-18-2011, 11:15 PM	#2
kovidgoyal creator of calibre Posts: 45,639 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	look for ascii control codes in the raw html, they usually cause this sort of thing.

05-18-2011, 11:40 PM	#3
ptsefton Junior Member Posts: 4 Karma: 10 Join Date: May 2011 Device: none	Thanks @kovidgoyal for the prompt reply. Turned out not to be control characters - I was returning only div element instead of the whole page in the soup variable. Solved.

Advert