To do simultaneous dither and render, look at the animation demos. I split them apart here so I could post a tiny video player that did not contain the transcoder, in the tradition of linux tools that you just pipe together into a long one-liner.
My algorithms keep evolving in small increments, each time I post a new demo.
The "no dither table" algorithm was my original creation, using logic tools including Karnaugh maps and DeMorgan's theorem (usually used for digital logic hardware simplification). It was an evolutionary step beyond my "formula 42" dithering that supported all kindle models in that formula. The key idea was to make it cache-friendly by elimination of conditional branching logic that can flush the cache, and later to avoid small cache flushes caused by RAM-resident table lookups. The latest branch-free formula also contains contrast and brightness adjustments. In the future it will do all that, and do 4-pixels per operation at no extra cost (and perhaps some savings making it more than 4 times faster).