Skip to content

Android ARM: PLD preload magic

I just discovered the incredible performance boost that can be achieved by using the PLD (“Preload Data”) ARM assembler instruction.

What I needed to do is convert image pixel data from RGB to RGBA format — from 3bytes/pixel to 4bytes/pixel; fullscreen in real time during animation. But the general situation is anytime you need to process a large amount of RAM data really fast.

while (n--) {
    *dest++ = *src++;
}

This loop is plain, it just copies data from a memory source to a destination. It is used here just a as a placeholder for some processing of src data. (of course, if you only need to copy the data you should use memcpy instead)

Let’s time this loop over 1MB of data, on a Samsung Galaxy Tab 10.1 with a Tegra 2 processor — it takes about 25ms. What slows the loop down is waiting for data that is not in the processor cache to be fetched from the main memory, which is slow. We can fix this by directing the CPU to prefetch data ahead of the read. We modify the loop adding the PLD magic line:

while (n--) {
    asm ("PLD [%0, #128]"::"r" (src));
    *dest++ = *src++;
}

That asm line starts preloading data from memory to the CPU cache, 128 bytes ahead of the current src location, without blocking the CPU.

We measure again, and the same loop over 1MB of data now takes only 8ms instead of 25ms — it is three times faster! Amazing for that 1-liner, I say. By the way, this is now very close to the performance of memcpy, which is itself implemented in highly-optimized ARM assembly.

You may observe that our loop may be optimized a little bit further by doing partial unrolling — processing more than a single element at each iteration.

With partial loop unrolling:

n /= 4; //assume it's multiple of 4
while (n--) {
    asm ("PLD [%0, #128]"::"r" (src));
    *dest++ = *src++;
    *dest++ = *src++;
    *dest++ = *src++;
    *dest++ = *src++;
}

The conclusion is that if you find yourself optimizing to death some piece of C/C++ code on Android that reads a lot of memory, you should try using the PLD and profile again to see if it helps.. Enjoy!

asm ("PLD [%0, #128]"::"r" (src));

PS:
If you’re curious about the RGB_888 to RGBA_8888 conversion speed, it is possible to do a fullscreen conversion (1280×752 px) on the Tab in about 7ms, which is quite impressive IMO. This is faster than the corresponding memcpy() RGBA to RGBA which takes about 8ms, and thus makes the case for the introduction of the RGB_888 (3bytes/pixel) Bitmap format in the Android Java API (as it saves RAM and memory bandwidth when the Alpha channel isn’t needed).

{ 6 } Comments

  1. Matt | 2017-02-09 at 02:45 | Permalink

    Hi, why are you loading #128 ahead? Isn’t the current src value the next one to be loaded? Why don’t you preload that one?

  2. Mihai Preda | 2017-02-09 at 03:47 | Permalink

    The preload is “loading ahead of time”, before you need it. A bit ahead (128bytes) to give the CPU time to do the transfer from RAM to cache before you need to process the data.

  3. Matt | 2017-02-09 at 17:28 | Permalink

    Thanks for the reply! Is there a good way to calculate the best offset (I guess it depends on how many instructions are happening on each loop) ? I’ve been trying many different values but the performance is always the same with not preloading at all..

  4. Matt | 2017-02-09 at 17:34 | Permalink

    Oh, and is it bad to call pld like 6 times in a row, do you know how many plds the cache can actually use at once?

  5. Mihai Preda | 2017-02-09 at 22:18 | Permalink

    Yes it depends on the instructions in the loop (how long they take). I’d suggest trial and error. If you’ve already tried and you see no benefit, it may be the case that the particular CPU is smart and does the fetch-ahead to the cache automatically, thus no PLD needed.

    The cost of multiple PLDs is more in triggering transfers to cache and potentially affecting what’s in cache at a given moment.

    Anyway, keep in mind that this is a pretty low-level optimization, I would not sweat over it if you’ve tried it and saw no benefit.

  6. Matt | 2017-02-10 at 00:42 | Permalink

    Thanks! Yeah unfortunately it didn’t help much, I tried many combinations. I found something else though so it all worked out :)

Post a Comment

Your email is never published nor shared. Required fields are marked *