Skip to content

Fast RGB pixel average on ARM

Here I present a small trick, how to compute extremely fast the average of two RGBA_8888 values on an ARM processor, and how to implement it easily in C++/gcc (e.g. using the Android NDK).

We have two 4-byte values representing two pixels, in format RGBA_8888. Let’s ignore the Alpha channel, assuming both have alpha=0xff (opaque). We want to produce a new 4-byte value with the per-channel average of the two pixels.

Below we see the byte structure of the two pixels and of the desired average:
pixel1: r1|g1|b1|a1
pixel2: r2|g2|b2|a2
average: (r1+r2)/2 | (g1+g2)/2 | (b1+b2)/2 | (a1+a2)/2

To implement this operation in C, we’d need to extract the individual bytes (R,G,B,A) of each pixel, average them (four additions and four shifts) and finally recompose the bytes to form the result (e.g. with shifts and ORs).

It turns out there is a much faster way to do it, which takes exactly.. one CPU instruction!

The magic instruction is UHADD8, which adds the corresponding bytes of two registers, in pairs, halves the result and recomposes the bytes — exactly the operation we needed for our RGBA average.

But how to access this ARM instruction from C++? It turns out, again, that there is a clean and simple way to do it:

extern inline unsigned averageBytes(unsigned a, unsigned b) {
    unsigned ret;
    asm("uhadd8 %0, %1, %2": "=r" (ret): "r" (a), "r" (b));
    return ret;

GCC is smart enough that invoking this inline function compiles down to just one CPU instruction (the UHADD8) without any other overhead.

So, problem solved: we have this neat C++ function for averaging two RGBA values, with a total cost of just one CPU instruction!

Post a Comment

Your email is never published nor shared. Required fields are marked *