Skip to content

Android ARM: PLD preload magic

I just discovered the incredible performance boost that can be achieved by using the PLD (“Preload Data”) ARM assembler instruction.

What I needed to do is convert image pixel data from RGB to RGBA format — from 3bytes/pixel to 4bytes/pixel; fullscreen in real time during animation. But the general situation is anytime you need to process a large amount of RAM data really fast.

while (n--) {
    *dest++ = *src++;
}

This loop is plain, it just copies data from a memory source to a destination. It is used here just a as a placeholder for some processing of src data. (of course, if you only need to copy the data you should use memcpy instead)

Let’s time this loop over 1MB of data, on a Samsung Galaxy Tab 10.1 with a Tegra 2 processor — it takes about 25ms. What slows the loop down is waiting for data that is not in the processor cache to be fetched from the main memory, which is slow. We can fix this by directing the CPU to prefetch data ahead of the read. We modify the loop adding the PLD magic line:

while (n--) {
    asm ("PLD [%0, #128]"::"r" (src));
    *dest++ = *src++;
}

That asm line starts preloading data from memory to the CPU cache, 128 bytes ahead of the current src location, without blocking the CPU.

We measure again, and the same loop over 1MB of data now takes only 8ms instead of 25ms — it is three times faster! Amazing for that 1-liner, I say. By the way, this is now very close to the performance of memcpy, which is itself implemented in highly-optimized ARM assembly.

You may observe that our loop may be optimized a little bit further by doing partial unrolling — processing more than a single element at each iteration.

With partial loop unrolling:

n /= 4; //assume it's multiple of 4
while (n--) {
    asm ("PLD [%0, #128]"::"r" (src));
    *dest++ = *src++;
    *dest++ = *src++;
    *dest++ = *src++;
    *dest++ = *src++;
}

The conclusion is that if you find yourself optimizing to death some piece of C/C++ code on Android that reads a lot of memory, you should try using the PLD and profile again to see if it helps.. Enjoy!

asm ("PLD [%0, #128]"::"r" (src));

PS:
If you’re curious about the RGB_888 to RGBA_8888 conversion speed, it is possible to do a fullscreen conversion (1280×752 px) on the Tab in about 7ms, which is quite impressive IMO. This is faster than the corresponding memcpy() RGBA to RGBA which takes about 8ms, and thus makes the case for the introduction of the RGB_888 (3bytes/pixel) Bitmap format in the Android Java API (as it saves RAM and memory bandwidth when the Alpha channel isn’t needed).

Post a Comment

Your email is never published nor shared. Required fields are marked *