w3hello.com logo
Home PHP C# C++ Android Java Javascript Python IOS SQL HTML videos Categories
Performance difference while using shared data-structure instead of private data-structure in OpenMP

I decided to get rid of global variable. That's your code, modified in several places.

//timings.cpp
#include <sys/time.h>
#include <cstdlib>
#include <stdio.h>
#include <math.h>
#include <omp.h>
#include <unistd.h>

#define PI 3.14159265
#define large 100000

int main() {
    int i;
    timeval t1,t2;

    double elapsedtime;
    bool b=false;

    double e[large];
    double p[large];

    omp_set_num_threads(1);
    for(i=0;i<large;i++) {
        e[i]=9.0;
    }

   /* for(i=0;i<large;i++) {
       p[i]=9.0;
    }*/

     gettimeofday(&t1, NULL);
  #pragma omp parallel for firstprivate(b) private(i) shared(e)
  //#pragma omp parallel for firstprivate(b) private(e,i)
     for(i=0;i<large;i++) {
        if (!b)
        {
            printf("e[i]=%f, e address: %p, n=%d
",e[i],&e,omp_get_thread_num());
            b=true;
        }
       
fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
       
fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
       
fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
       
fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
       
fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
       
fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
       
fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
       
fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
       
fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
    }

    gettimeofday(&t2, NULL);
    elapsedtime = (t2.tv_sec*1000000 + t2.tv_usec) - (t1.tv_sec * 1000000 +
t1.tv_usec);
    printf("%f ",elapsedtime/1000);
    return 0;
}

We shall run it through script "1.sh" to automatically measure timings,

#/bin/bash
sed -i '/parallel/ s,#,//#,g' timings.cpp
sed -i '/parallel/ s,////#,#,g' timings.cpp
g++ -O0 -fopenmp timings.cpp -o timings
> time1.txt
for loopvar in {1..10}
do
if [ "$loopvar" -eq 1 ]
then
./timings >> time1.txt;
cat time1.txt;
echo;
else
./timings | tail -1 >> time1.txt;
fi
done
echo "---------"
echo "Total time:"
echo `tail -1 time1.txt | sed s/' '/'+'/g | sed s/$/0/ | bc -li | tail
-1`/`tail -1 time1.txt| wc -w | sed s/$/.0/` | bc -li | tail -1

Here are testing results (Intel@ Core 2 Duo E8300):

1) #pragma omp parallel for firstprivate(b) private(i) shared(e)

user@comp:~ ./1.sh
Total time:
152.96380000000000000000

We have strange latencies. E.g. output:

e[i]=9.000000, e address: 0x7fffb67c6960, n=0
e[i]=9.000000, e address: 0x7fffb67c6960, n=7
e[i]=9.000000, e address: 0x7fffb67c6960, n=8
//etc..

Note the address - it's the same for all arrays (so it is called shared)

2) #pragma omp parallel for firstprivate(e,b) private(i)

user@comp:~ ./1.sh
Total time:
157.48220000000000000000

We have copying of data e (firstprivate) to each thread E.g. output:

e[i]=9.000000, e address: 0x7ff93c4238e0, n=1
e[i]=9.000000, e address: 0x7ff939c1e8e0, n=6
e[i]=9.000000, e address: 0x7ff93ac208e0, n=4

3) #pragma omp parallel for firstprivate(b) private(e,i)

Total time:
123.97110000000000000000

No copying of data, only allocation (private are used uninitialized) E.g. output:

 e[i]=0.000000, e address: 0x7fca98bdb8e0, n=1
 e[i]=0.000000, e address: 0x7fffa2d10090, n=0
 e[i]=0.000000, e address: 0x7fca983da8e0, n=2

Here we have different addresses, but all e values contain memory garbage (nills are likely due to mmap memory page preallocation).

To see, that firstprivate(e) is slower because of copying of arrays, let's comment out all calculations (lines with "fmodf") // #pragma omp parallel for firstprivate(b) private(i) shared(e)

Total time:
9.69700000000000000000

// #pragma omp parallel for firstprivate(e,b) private(i)

Total time:
12.83000000000000000000

// #pragma omp parallel for firstprivate(b) private(i,e)

Total time:
9.34880000000000000000

Firstprivate(e) is slow because of copying array. Shared(e) is slow because of calculation lines.

Compile with -O3 -ftree-vectorize slightly decreases time of shared:

// #pragma omp parallel for firstprivate(b) private(i) shared(e)

user@comp:~ ./1.sh
Total time:
141.38330000000000000000

// #pragma omp parallel for firstprivate(b) private(e,i)

Total time:
121.80390000000000000000

Using schedule(static, 256) doesn't make the trick.

Let's continue with -O0 option turned on. Comment out array filling: // e[i]=9.0;

// #pragma omp parallel for firstprivate(b) private(i) shared(e)

Total time:
121.40780000000000000000

// #pragma omp parallel for firstprivate(b) private(e,i)

Total time:
122.33990000000000000000

So, "shared" is slower because of "private" data were used uninitialized (as proposed by commenters).

Let's see the dependence on thread number:

4threads
shared
Total time:
156.95030000000000000000
private
Total time:
121.11390000000000000000

2threads
shared
Total time:
155.96970000000000000000
private
Total time:
126.62130000000000000000

1thread (perfomance goes down ca. twice, I have 2-core machine)
shared
Total time:
283.06280000000000000000
private
Total time:
229.37680000000000000000

To compile this with 1.sh, I manually discomented both "parallel for" lines to give 1.sh comment out both of them.

**1thread without parallel, initialized e[i]**
Total time:
281.22040000000000000000

**1thread without parallel, uninitialized e[i]** 
Total time:
231.66060000000000000000

So, it's not OpenMP issue, but memory/cache using issue. Generation of asm code with

g++ -O0 -S timings.cpp

in both cases gives two differences: one, that can be neglected, in label LC numeration, the other, that one label (L3) contains not 1, but 5 asm lines, when initializing e array:

L3:
movl    -800060(%rbp), %eax
movslq  %eax, %rdx
movabsq $4621256167635550208, %rax
movq    %rax, -800016(%rbp,%rdx,8)

(where initialization occures) and common line: addl $1, -800060(%rbp)

So, it seems like cache issue.

That's not an answer, you can use the code above to study the problem further,





© Copyright 2018 w3hello.com Publishing Limited. All rights reserved.