|
|
#1 |
|
Junior Member
![]() Posts: 4
Karma: 10
Join Date: May 2011
Device: none
|
Lists getting changed in recipe processing
Hi all,
I'm new here - I had a look around but could not find anything on this problem. I am working on recipes to scrape WordPress sites and I am running into problems with Calibre v0.8.1 changing the HTML format of pages. For example, using this recipe: https://bitbucket.org/wwmm/schtml/sr...tsefton.recipe With this command: ebook-convert ptsefton.recipe .epub --debug-pipeline d --test The recipe fetches the first page which has this code in it: <ul><li><a href="#id2">Immediate future</a></li><li><a href="#id3">The future</a></li></ul> I know that this code is still intact when postprocess_html returns the HTML, but in the debug output in the parsed directory it has changed to this: <ul/><li/><a href="#id2">Immediate future</a><li/><a href="#id3">The future</a> Does anyone have any idea why this would be happening? Thanks, Peter |
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,639
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
look for ascii control codes in the raw html, they usually cause this sort of thing.
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Junior Member
![]() Posts: 4
Karma: 10
Join Date: May 2011
Device: none
|
Thanks @kovidgoyal for the prompt reply.
Turned out not to be control characters - I was returning only div element instead of the whole page in the soup variable. Solved. |
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Heuristic processing | saxondawg | Conversion | 6 | 01-21-2018 08:43 PM |
| Word Processing on the Kindle 3 | cow_trix | Amazon Kindle | 41 | 05-17-2011 04:22 AM |
| Trying to use Textile processing | getajob | Conversion | 18 | 03-09-2011 08:34 AM |
| Comic File Processing | wonderboy | Other formats | 1 | 08-08-2009 05:17 AM |
| Perl processing | alexxxm | Sony Reader | 3 | 11-26-2007 07:13 AM |