11-06-2011, 09:36 AM | #1 |
Addict
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
|
using auto_cleanup and manual clean up together
After weeks of tinkering withe the Daily Mirror recipe, I went back to the start and found auto_cleanup was doing a really good job - with a couple of exceptions
1) The articles by and date text are erased after the headline. 2)The text "Advertisement >>" is left intact. The article source for the date is Spoiler:
so I thought using auto_cleanup_keep = '//a[@class="published"]' or auto_cleanup_keep = '//*[@class="published"]' would mean the date got left in - it wasn't. I also tried preprocess_regexps = [ (re.compile(r'Advertisement >>', re.IGNORECASE | re.DOTALL), lambda match: '')] to just delete "Advertisement >>" so even if a class was created by calibre it would be empty. Again no success. Is the call being ignored because autocleanup is being used? It would be nice to fix this as the file created is smaller than my butchery and seems formatted in a cleaner way. Here's the simplified code as it stands Spoiler:
|
11-06-2011, 10:08 AM | #2 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
auto_cleanup_keep will typically fail if you put it on a low level element like an <a> tag. Instead find the <div> the a is in and try keeping that.
|
11-06-2011, 11:06 AM | #3 | |
Addict
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
|
Quote:
I tried the div above, and it's parent and both together. no good. Also I thought that the use of the * as in auto_cleanup_keep = '//*[@class="important"]' meant all elements would be saved regardless of the tag it's attached to. ALso, is preprocess_regexps = [ (re.compile(r'Advertisement >>', re.IGNORECASE | re.DOTALL), lambda match: '')] not deleting instances of "Advertisement >>" because auto clean up overides it? Can you do auto clean up followed by manual for any stray elements that get through. +++++++++++++++++ BTW the whole reason Ive gone down this path is I discovered the text/paragraph after the first image in an article is being displayed to the right of the image (in the original Daily Mirror recipe).On my prs300 it's getting "displayed" off screen. I can't find a method to insert a crlf after the image/ before the image caption. |
|
11-06-2011, 01:32 PM | #4 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
stick
img { display:block} in the extra_css and set conversion_options = { 'linearize_tables' : True } |
11-06-2011, 02:35 PM | #5 |
Addict
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
|
|
11-06-2011, 09:13 PM | #6 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Use the --debug-pipeline option and post one of the downloaded html files that display this issue (alo add no_stylesheets = True) to your recipe.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Short Fiction Martinez, Brian: A Good Clean, A Harsh Clean. v1. PDF, 13th Dec 2010 | BrianMartinez | Other Books | 0 | 12-13-2010 09:27 PM |
Short Fiction Martinez, Brian: A Good Clean, A Harsh Clean. v1. 13th Dec 2010 | BrianMartinez | Kindle Books | 0 | 12-13-2010 09:25 PM |
Short Fiction Martinez, Brian: A Good Clean, A Harsh Clean. v1. 13th Dec 2010 | BrianMartinez | ePub Books | 0 | 12-13-2010 09:23 PM |
The best way to clean a white PP? | Dr. Drib | Astak EZReader | 6 | 02-10-2010 02:26 AM |
How to clean lightwedge | PsyDocJoanne | Sony Reader | 9 | 10-01-2008 07:03 PM |