I know exactly what you're talking about with Microsoft inserted cruft. However for pdf this wouldn't be a problem, the markup returned by pdftohtml is really simplistic.
I guess this partially depends on the type of pdf - at 20 megs I'm assuming it's images with text underneath. That should process quickly. If it actually is 20 megs of text then it may indeed take Calibre a while to process it...
|