w3hello.com logo
Home PHP C# C++ Android Java Javascript Python IOS SQL HTML videos Categories
What memory access patterns are most efficient for outer-product-type double loops?

For large enough M, The Y vector will exceed the L1 cache size.* Thus on every new outer iteration, you'll be reloading Y from main memory (or at least, a slower cache). In other words, you won't be exploiting temporal locality in Y.

You should block up your accesses to Y; something like this:

for (jj = 0; jj < M; jj += CACHE_SIZE) {        // Iterate
over blocks
    for (i = 0; i < N; i++) {
        for (j = jj; j < (jj + CACHE_SIZE); j++) {   // Iterate within
block
            out[i*M + j] = X[i] * Y[j];
        }
    }
}

The above doesn't do anything smart with accesses to X, but new values are only being accessed 1/CACHE_SIZE as often, so the impact is probably negligible.


* If everything is small enough to already fit in cache, then you can't do better than what you already have (vectorisation opportunities notwithstanding).





© Copyright 2018 w3hello.com Publishing Limited. All rights reserved.