* How to make parallelizing loops and vectorization work at the same time?
@ 2023-09-15 11:20 Hanke Zhang
2023-09-15 11:59 ` Richard Biener
0 siblings, 1 reply; 6+ messages in thread
From: Hanke Zhang @ 2023-09-15 11:20 UTC (permalink / raw)
To: gcc
Hi I'm trying to accelerate my program with -ftree-vectorize and
-ftree-parallelize-loops.
Here are my test results using the different options (based on
gcc10.3.0 on i9-12900KF):
gcc-10 test.c -O3 -flto
> time: 29000 ms
gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize
> time: 17000 ms
gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24
> time: 5000 ms
gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 -mavx2 -ftree-vectorize
> time: 5000 ms
I found that these two options do not work at the same time, that is,
if I use the `-ftree-vectorize` option alone, it can bring a big
efficiency gain compared to doing nothing; At the same time, if I use
the option of `-ftree-parallelize-loops` alone, it will also bring a
big efficiency gain. But if I use both options, vectorization fails,
that is, I can't get the benefits of vectorization, I can only get the
benefits of parallelizing loops.
I know that the reason may be that after parallelizing the loop,
vectorization cannot be performed, but is there any way I can reap the
benefits of both optimizations?
Here is my example program, adapted from the 462.libquantum in speccpu2006:
```
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define MAX_UNSIGNED unsigned long long
struct quantum_reg_node_struct {
float _Complex *amplitude; /* alpha_j */
MAX_UNSIGNED *state; /* j */
};
typedef struct quantum_reg_node_struct quantum_reg_node;
struct quantum_reg_struct {
int width; /* number of qubits in the qureg */
int size; /* number of non-zero vectors */
int hashw; /* width of the hash array */
quantum_reg_node *node;
int *hash;
};
typedef struct quantum_reg_struct quantum_reg;
void quantum_toffoli(int control1, int control2, int target, quantum_reg *reg) {
for (int i = 0; i < reg->size; i++) {
if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) {
if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2)) {
reg->node->state[i] ^= ((MAX_UNSIGNED)1 << target);
}
}
}
}
int get_random() {
return rand() % 64;
}
void init(quantum_reg *reg) {
reg->size = 2097152;
for (int i = 0; i < reg->size; i++) {
reg->node = (quantum_reg_node *)malloc(sizeof(quantum_reg_node));
reg->node->state = (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSIGNED)
* reg->size);
reg->node->amplitude = (float _Complex *)malloc(sizeof(float
_Complex) * reg->size);
if (i >= 1) break;
}
for (int i = 0; i < reg->size; i++) {
reg->node->amplitude[i] = 0;
reg->node->state[i] = 0;
}
}
int main() {
quantum_reg reg;
init(®);
for (int i = 0; i < 65000; i++) {
quantum_toffoli(get_random(), get_random(), get_random(), ®);
}
}
```
Thanks so much.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: How to make parallelizing loops and vectorization work at the same time?
2023-09-15 11:20 How to make parallelizing loops and vectorization work at the same time? Hanke Zhang
@ 2023-09-15 11:59 ` Richard Biener
2023-09-15 13:09 ` Hanke Zhang
0 siblings, 1 reply; 6+ messages in thread
From: Richard Biener @ 2023-09-15 11:59 UTC (permalink / raw)
To: Hanke Zhang; +Cc: gcc
On Fri, Sep 15, 2023 at 1:21 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hi I'm trying to accelerate my program with -ftree-vectorize and
> -ftree-parallelize-loops.
>
> Here are my test results using the different options (based on
> gcc10.3.0 on i9-12900KF):
> gcc-10 test.c -O3 -flto
> > time: 29000 ms
> gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize
> > time: 17000 ms
> gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24
> > time: 5000 ms
> gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 -mavx2 -ftree-vectorize
> > time: 5000 ms
>
First of all -O3 already enables -ftree-vectorize, adding -mavx2 is what brings
the first gain. So adding -ftree-vectorize to the last command-line is not
expected to change anything. Instead you can use -fno-tree-vectorize on
the second last one. Doing that I get 111s vs 41s thus doing both helps.
Note parallelization hasn't seen any development in the last years.
Richard.
> I found that these two options do not work at the same time, that is,
> if I use the `-ftree-vectorize` option alone, it can bring a big
> efficiency gain compared to doing nothing; At the same time, if I use
> the option of `-ftree-parallelize-loops` alone, it will also bring a
> big efficiency gain. But if I use both options, vectorization fails,
> that is, I can't get the benefits of vectorization, I can only get the
> benefits of parallelizing loops.
>
> I know that the reason may be that after parallelizing the loop,
> vectorization cannot be performed, but is there any way I can reap the
> benefits of both optimizations?
>
> Here is my example program, adapted from the 462.libquantum in speccpu2006:
>
> ```
> #include <stdio.h>
> #include <stdlib.h>
> #include <time.h>
>
> #define MAX_UNSIGNED unsigned long long
>
> struct quantum_reg_node_struct {
> float _Complex *amplitude; /* alpha_j */
> MAX_UNSIGNED *state; /* j */
> };
>
> typedef struct quantum_reg_node_struct quantum_reg_node;
>
> struct quantum_reg_struct {
> int width; /* number of qubits in the qureg */
> int size; /* number of non-zero vectors */
> int hashw; /* width of the hash array */
> quantum_reg_node *node;
> int *hash;
> };
>
> typedef struct quantum_reg_struct quantum_reg;
>
> void quantum_toffoli(int control1, int control2, int target, quantum_reg *reg) {
> for (int i = 0; i < reg->size; i++) {
> if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) {
> if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2)) {
> reg->node->state[i] ^= ((MAX_UNSIGNED)1 << target);
> }
> }
> }
> }
>
> int get_random() {
> return rand() % 64;
> }
>
> void init(quantum_reg *reg) {
> reg->size = 2097152;
> for (int i = 0; i < reg->size; i++) {
> reg->node = (quantum_reg_node *)malloc(sizeof(quantum_reg_node));
> reg->node->state = (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSIGNED)
> * reg->size);
> reg->node->amplitude = (float _Complex *)malloc(sizeof(float
> _Complex) * reg->size);
> if (i >= 1) break;
> }
> for (int i = 0; i < reg->size; i++) {
> reg->node->amplitude[i] = 0;
> reg->node->state[i] = 0;
> }
> }
>
> int main() {
> quantum_reg reg;
> init(®);
> for (int i = 0; i < 65000; i++) {
> quantum_toffoli(get_random(), get_random(), get_random(), ®);
> }
> }
> ```
>
> Thanks so much.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: How to make parallelizing loops and vectorization work at the same time?
2023-09-15 11:59 ` Richard Biener
@ 2023-09-15 13:09 ` Hanke Zhang
2023-09-15 13:13 ` Richard Biener
0 siblings, 1 reply; 6+ messages in thread
From: Hanke Zhang @ 2023-09-15 13:09 UTC (permalink / raw)
To: Richard Biener; +Cc: gcc
Richard Biener <richard.guenther@gmail.com> 于2023年9月15日周五 19:59写道:
>
> On Fri, Sep 15, 2023 at 1:21 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
> >
> > Hi I'm trying to accelerate my program with -ftree-vectorize and
> > -ftree-parallelize-loops.
> >
> > Here are my test results using the different options (based on
> > gcc10.3.0 on i9-12900KF):
> > gcc-10 test.c -O3 -flto
> > > time: 29000 ms
> > gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize
> > > time: 17000 ms
> > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24
> > > time: 5000 ms
> > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 -mavx2 -ftree-vectorize
> > > time: 5000 ms
> >
>
> First of all -O3 already enables -ftree-vectorize, adding -mavx2 is what brings
> the first gain. So adding -ftree-vectorize to the last command-line is not
> expected to change anything. Instead you can use -fno-tree-vectorize on
> the second last one. Doing that I get 111s vs 41s thus doing both helps.
>
> Note parallelization hasn't seen any development in the last years.
>
> Richard.
Hi Richard:
Thank you for your sincere reply.
I get what you mean above. But I still see the following after I add
`-fipo-info-vec`:
gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec
> test.c:29:5: optimized: loop vectorized using 32 byte vectors
gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec -ftree-parallelize-loops=24
> nothing happened
That means the vectorization does nothing help actually.
At the same time, I added `-fno-tree-vectorize` to the second last one
command. It did not bring about a performance change on my computer.
So I still think only parallel loops work.
Hanke Zhang
>
> > I found that these two options do not work at the same time, that is,
> > if I use the `-ftree-vectorize` option alone, it can bring a big
> > efficiency gain compared to doing nothing; At the same time, if I use
> > the option of `-ftree-parallelize-loops` alone, it will also bring a
> > big efficiency gain. But if I use both options, vectorization fails,
> > that is, I can't get the benefits of vectorization, I can only get the
> > benefits of parallelizing loops.
> >
> > I know that the reason may be that after parallelizing the loop,
> > vectorization cannot be performed, but is there any way I can reap the
> > benefits of both optimizations?
> >
> > Here is my example program, adapted from the 462.libquantum in speccpu2006:
> >
> > ```
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <time.h>
> >
> > #define MAX_UNSIGNED unsigned long long
> >
> > struct quantum_reg_node_struct {
> > float _Complex *amplitude; /* alpha_j */
> > MAX_UNSIGNED *state; /* j */
> > };
> >
> > typedef struct quantum_reg_node_struct quantum_reg_node;
> >
> > struct quantum_reg_struct {
> > int width; /* number of qubits in the qureg */
> > int size; /* number of non-zero vectors */
> > int hashw; /* width of the hash array */
> > quantum_reg_node *node;
> > int *hash;
> > };
> >
> > typedef struct quantum_reg_struct quantum_reg;
> >
> > void quantum_toffoli(int control1, int control2, int target, quantum_reg *reg) {
> > for (int i = 0; i < reg->size; i++) {
> > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) {
> > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2)) {
> > reg->node->state[i] ^= ((MAX_UNSIGNED)1 << target);
> > }
> > }
> > }
> > }
> >
> > int get_random() {
> > return rand() % 64;
> > }
> >
> > void init(quantum_reg *reg) {
> > reg->size = 2097152;
> > for (int i = 0; i < reg->size; i++) {
> > reg->node = (quantum_reg_node *)malloc(sizeof(quantum_reg_node));
> > reg->node->state = (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSIGNED)
> > * reg->size);
> > reg->node->amplitude = (float _Complex *)malloc(sizeof(float
> > _Complex) * reg->size);
> > if (i >= 1) break;
> > }
> > for (int i = 0; i < reg->size; i++) {
> > reg->node->amplitude[i] = 0;
> > reg->node->state[i] = 0;
> > }
> > }
> >
> > int main() {
> > quantum_reg reg;
> > init(®);
> > for (int i = 0; i < 65000; i++) {
> > quantum_toffoli(get_random(), get_random(), get_random(), ®);
> > }
> > }
> > ```
> >
> > Thanks so much.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: How to make parallelizing loops and vectorization work at the same time?
2023-09-15 13:09 ` Hanke Zhang
@ 2023-09-15 13:13 ` Richard Biener
2023-09-15 14:07 ` Hanke Zhang
0 siblings, 1 reply; 6+ messages in thread
From: Richard Biener @ 2023-09-15 13:13 UTC (permalink / raw)
To: Hanke Zhang; +Cc: gcc
On Fri, Sep 15, 2023 at 3:09 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> 于2023年9月15日周五 19:59写道:
>
> >
> > On Fri, Sep 15, 2023 at 1:21 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
> > >
> > > Hi I'm trying to accelerate my program with -ftree-vectorize and
> > > -ftree-parallelize-loops.
> > >
> > > Here are my test results using the different options (based on
> > > gcc10.3.0 on i9-12900KF):
> > > gcc-10 test.c -O3 -flto
> > > > time: 29000 ms
> > > gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize
> > > > time: 17000 ms
> > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24
> > > > time: 5000 ms
> > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 -mavx2 -ftree-vectorize
> > > > time: 5000 ms
> > >
> >
> > First of all -O3 already enables -ftree-vectorize, adding -mavx2 is what brings
> > the first gain. So adding -ftree-vectorize to the last command-line is not
> > expected to change anything. Instead you can use -fno-tree-vectorize on
> > the second last one. Doing that I get 111s vs 41s thus doing both helps.
> >
> > Note parallelization hasn't seen any development in the last years.
> >
> > Richard.
>
> Hi Richard:
>
> Thank you for your sincere reply.
>
> I get what you mean above. But I still see the following after I add
> `-fipo-info-vec`:
>
> gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec
> > test.c:29:5: optimized: loop vectorized using 32 byte vectors
> gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec -ftree-parallelize-loops=24
> > nothing happened
>
> That means the vectorization does nothing help actually.
>
> At the same time, I added `-fno-tree-vectorize` to the second last one
> command. It did not bring about a performance change on my computer.
>
> So I still think only parallel loops work.
I checked GCC 13 and do see vectorized loops when parallelizing.
Richard.
> Hanke Zhang
>
> >
> > > I found that these two options do not work at the same time, that is,
> > > if I use the `-ftree-vectorize` option alone, it can bring a big
> > > efficiency gain compared to doing nothing; At the same time, if I use
> > > the option of `-ftree-parallelize-loops` alone, it will also bring a
> > > big efficiency gain. But if I use both options, vectorization fails,
> > > that is, I can't get the benefits of vectorization, I can only get the
> > > benefits of parallelizing loops.
> > >
> > > I know that the reason may be that after parallelizing the loop,
> > > vectorization cannot be performed, but is there any way I can reap the
> > > benefits of both optimizations?
> > >
> > > Here is my example program, adapted from the 462.libquantum in speccpu2006:
> > >
> > > ```
> > > #include <stdio.h>
> > > #include <stdlib.h>
> > > #include <time.h>
> > >
> > > #define MAX_UNSIGNED unsigned long long
> > >
> > > struct quantum_reg_node_struct {
> > > float _Complex *amplitude; /* alpha_j */
> > > MAX_UNSIGNED *state; /* j */
> > > };
> > >
> > > typedef struct quantum_reg_node_struct quantum_reg_node;
> > >
> > > struct quantum_reg_struct {
> > > int width; /* number of qubits in the qureg */
> > > int size; /* number of non-zero vectors */
> > > int hashw; /* width of the hash array */
> > > quantum_reg_node *node;
> > > int *hash;
> > > };
> > >
> > > typedef struct quantum_reg_struct quantum_reg;
> > >
> > > void quantum_toffoli(int control1, int control2, int target, quantum_reg *reg) {
> > > for (int i = 0; i < reg->size; i++) {
> > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) {
> > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2)) {
> > > reg->node->state[i] ^= ((MAX_UNSIGNED)1 << target);
> > > }
> > > }
> > > }
> > > }
> > >
> > > int get_random() {
> > > return rand() % 64;
> > > }
> > >
> > > void init(quantum_reg *reg) {
> > > reg->size = 2097152;
> > > for (int i = 0; i < reg->size; i++) {
> > > reg->node = (quantum_reg_node *)malloc(sizeof(quantum_reg_node));
> > > reg->node->state = (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSIGNED)
> > > * reg->size);
> > > reg->node->amplitude = (float _Complex *)malloc(sizeof(float
> > > _Complex) * reg->size);
> > > if (i >= 1) break;
> > > }
> > > for (int i = 0; i < reg->size; i++) {
> > > reg->node->amplitude[i] = 0;
> > > reg->node->state[i] = 0;
> > > }
> > > }
> > >
> > > int main() {
> > > quantum_reg reg;
> > > init(®);
> > > for (int i = 0; i < 65000; i++) {
> > > quantum_toffoli(get_random(), get_random(), get_random(), ®);
> > > }
> > > }
> > > ```
> > >
> > > Thanks so much.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: How to make parallelizing loops and vectorization work at the same time?
2023-09-15 13:13 ` Richard Biener
@ 2023-09-15 14:07 ` Hanke Zhang
2023-09-18 6:45 ` Richard Biener
0 siblings, 1 reply; 6+ messages in thread
From: Hanke Zhang @ 2023-09-15 14:07 UTC (permalink / raw)
To: Richard Biener; +Cc: gcc
I get it. It's a `lto` problem. If I remove `-flto`, both work.
Thanks for your help again!
Richard Biener <richard.guenther@gmail.com> 于2023年9月15日周五 21:13写道:
>
> On Fri, Sep 15, 2023 at 3:09 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> >
> > Richard Biener <richard.guenther@gmail.com> 于2023年9月15日周五 19:59写道:
> >
> > >
> > > On Fri, Sep 15, 2023 at 1:21 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
> > > >
> > > > Hi I'm trying to accelerate my program with -ftree-vectorize and
> > > > -ftree-parallelize-loops.
> > > >
> > > > Here are my test results using the different options (based on
> > > > gcc10.3.0 on i9-12900KF):
> > > > gcc-10 test.c -O3 -flto
> > > > > time: 29000 ms
> > > > gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize
> > > > > time: 17000 ms
> > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24
> > > > > time: 5000 ms
> > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 -mavx2 -ftree-vectorize
> > > > > time: 5000 ms
> > > >
> > >
> > > First of all -O3 already enables -ftree-vectorize, adding -mavx2 is what brings
> > > the first gain. So adding -ftree-vectorize to the last command-line is not
> > > expected to change anything. Instead you can use -fno-tree-vectorize on
> > > the second last one. Doing that I get 111s vs 41s thus doing both helps.
> > >
> > > Note parallelization hasn't seen any development in the last years.
> > >
> > > Richard.
> >
> > Hi Richard:
> >
> > Thank you for your sincere reply.
> >
> > I get what you mean above. But I still see the following after I add
> > `-fipo-info-vec`:
> >
> > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec
> > > test.c:29:5: optimized: loop vectorized using 32 byte vectors
> > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec -ftree-parallelize-loops=24
> > > nothing happened
> >
> > That means the vectorization does nothing help actually.
> >
> > At the same time, I added `-fno-tree-vectorize` to the second last one
> > command. It did not bring about a performance change on my computer.
> >
> > So I still think only parallel loops work.
>
> I checked GCC 13 and do see vectorized loops when parallelizing.
>
> Richard.
>
> > Hanke Zhang
> >
> > >
> > > > I found that these two options do not work at the same time, that is,
> > > > if I use the `-ftree-vectorize` option alone, it can bring a big
> > > > efficiency gain compared to doing nothing; At the same time, if I use
> > > > the option of `-ftree-parallelize-loops` alone, it will also bring a
> > > > big efficiency gain. But if I use both options, vectorization fails,
> > > > that is, I can't get the benefits of vectorization, I can only get the
> > > > benefits of parallelizing loops.
> > > >
> > > > I know that the reason may be that after parallelizing the loop,
> > > > vectorization cannot be performed, but is there any way I can reap the
> > > > benefits of both optimizations?
> > > >
> > > > Here is my example program, adapted from the 462.libquantum in speccpu2006:
> > > >
> > > > ```
> > > > #include <stdio.h>
> > > > #include <stdlib.h>
> > > > #include <time.h>
> > > >
> > > > #define MAX_UNSIGNED unsigned long long
> > > >
> > > > struct quantum_reg_node_struct {
> > > > float _Complex *amplitude; /* alpha_j */
> > > > MAX_UNSIGNED *state; /* j */
> > > > };
> > > >
> > > > typedef struct quantum_reg_node_struct quantum_reg_node;
> > > >
> > > > struct quantum_reg_struct {
> > > > int width; /* number of qubits in the qureg */
> > > > int size; /* number of non-zero vectors */
> > > > int hashw; /* width of the hash array */
> > > > quantum_reg_node *node;
> > > > int *hash;
> > > > };
> > > >
> > > > typedef struct quantum_reg_struct quantum_reg;
> > > >
> > > > void quantum_toffoli(int control1, int control2, int target, quantum_reg *reg) {
> > > > for (int i = 0; i < reg->size; i++) {
> > > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) {
> > > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2)) {
> > > > reg->node->state[i] ^= ((MAX_UNSIGNED)1 << target);
> > > > }
> > > > }
> > > > }
> > > > }
> > > >
> > > > int get_random() {
> > > > return rand() % 64;
> > > > }
> > > >
> > > > void init(quantum_reg *reg) {
> > > > reg->size = 2097152;
> > > > for (int i = 0; i < reg->size; i++) {
> > > > reg->node = (quantum_reg_node *)malloc(sizeof(quantum_reg_node));
> > > > reg->node->state = (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSIGNED)
> > > > * reg->size);
> > > > reg->node->amplitude = (float _Complex *)malloc(sizeof(float
> > > > _Complex) * reg->size);
> > > > if (i >= 1) break;
> > > > }
> > > > for (int i = 0; i < reg->size; i++) {
> > > > reg->node->amplitude[i] = 0;
> > > > reg->node->state[i] = 0;
> > > > }
> > > > }
> > > >
> > > > int main() {
> > > > quantum_reg reg;
> > > > init(®);
> > > > for (int i = 0; i < 65000; i++) {
> > > > quantum_toffoli(get_random(), get_random(), get_random(), ®);
> > > > }
> > > > }
> > > > ```
> > > >
> > > > Thanks so much.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: How to make parallelizing loops and vectorization work at the same time?
2023-09-15 14:07 ` Hanke Zhang
@ 2023-09-18 6:45 ` Richard Biener
0 siblings, 0 replies; 6+ messages in thread
From: Richard Biener @ 2023-09-18 6:45 UTC (permalink / raw)
To: Hanke Zhang; +Cc: gcc
On Fri, Sep 15, 2023 at 4:07 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
>
> I get it. It's a `lto` problem. If I remove `-flto`, both work.
That's odd - it might be that GCC thinks part of the program is cold and doesn't
optimize it. Does using -fwhole-program instead of -flto also not work?
Richard.
> Thanks for your help again!
>
> Richard Biener <richard.guenther@gmail.com> 于2023年9月15日周五 21:13写道:
> >
> > On Fri, Sep 15, 2023 at 3:09 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> > >
> > > Richard Biener <richard.guenther@gmail.com> 于2023年9月15日周五 19:59写道:
> > >
> > > >
> > > > On Fri, Sep 15, 2023 at 1:21 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
> > > > >
> > > > > Hi I'm trying to accelerate my program with -ftree-vectorize and
> > > > > -ftree-parallelize-loops.
> > > > >
> > > > > Here are my test results using the different options (based on
> > > > > gcc10.3.0 on i9-12900KF):
> > > > > gcc-10 test.c -O3 -flto
> > > > > > time: 29000 ms
> > > > > gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize
> > > > > > time: 17000 ms
> > > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24
> > > > > > time: 5000 ms
> > > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 -mavx2 -ftree-vectorize
> > > > > > time: 5000 ms
> > > > >
> > > >
> > > > First of all -O3 already enables -ftree-vectorize, adding -mavx2 is what brings
> > > > the first gain. So adding -ftree-vectorize to the last command-line is not
> > > > expected to change anything. Instead you can use -fno-tree-vectorize on
> > > > the second last one. Doing that I get 111s vs 41s thus doing both helps.
> > > >
> > > > Note parallelization hasn't seen any development in the last years.
> > > >
> > > > Richard.
> > >
> > > Hi Richard:
> > >
> > > Thank you for your sincere reply.
> > >
> > > I get what you mean above. But I still see the following after I add
> > > `-fipo-info-vec`:
> > >
> > > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec
> > > > test.c:29:5: optimized: loop vectorized using 32 byte vectors
> > > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec -ftree-parallelize-loops=24
> > > > nothing happened
> > >
> > > That means the vectorization does nothing help actually.
> > >
> > > At the same time, I added `-fno-tree-vectorize` to the second last one
> > > command. It did not bring about a performance change on my computer.
> > >
> > > So I still think only parallel loops work.
> >
> > I checked GCC 13 and do see vectorized loops when parallelizing.
> >
> > Richard.
> >
> > > Hanke Zhang
> > >
> > > >
> > > > > I found that these two options do not work at the same time, that is,
> > > > > if I use the `-ftree-vectorize` option alone, it can bring a big
> > > > > efficiency gain compared to doing nothing; At the same time, if I use
> > > > > the option of `-ftree-parallelize-loops` alone, it will also bring a
> > > > > big efficiency gain. But if I use both options, vectorization fails,
> > > > > that is, I can't get the benefits of vectorization, I can only get the
> > > > > benefits of parallelizing loops.
> > > > >
> > > > > I know that the reason may be that after parallelizing the loop,
> > > > > vectorization cannot be performed, but is there any way I can reap the
> > > > > benefits of both optimizations?
> > > > >
> > > > > Here is my example program, adapted from the 462.libquantum in speccpu2006:
> > > > >
> > > > > ```
> > > > > #include <stdio.h>
> > > > > #include <stdlib.h>
> > > > > #include <time.h>
> > > > >
> > > > > #define MAX_UNSIGNED unsigned long long
> > > > >
> > > > > struct quantum_reg_node_struct {
> > > > > float _Complex *amplitude; /* alpha_j */
> > > > > MAX_UNSIGNED *state; /* j */
> > > > > };
> > > > >
> > > > > typedef struct quantum_reg_node_struct quantum_reg_node;
> > > > >
> > > > > struct quantum_reg_struct {
> > > > > int width; /* number of qubits in the qureg */
> > > > > int size; /* number of non-zero vectors */
> > > > > int hashw; /* width of the hash array */
> > > > > quantum_reg_node *node;
> > > > > int *hash;
> > > > > };
> > > > >
> > > > > typedef struct quantum_reg_struct quantum_reg;
> > > > >
> > > > > void quantum_toffoli(int control1, int control2, int target, quantum_reg *reg) {
> > > > > for (int i = 0; i < reg->size; i++) {
> > > > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) {
> > > > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2)) {
> > > > > reg->node->state[i] ^= ((MAX_UNSIGNED)1 << target);
> > > > > }
> > > > > }
> > > > > }
> > > > > }
> > > > >
> > > > > int get_random() {
> > > > > return rand() % 64;
> > > > > }
> > > > >
> > > > > void init(quantum_reg *reg) {
> > > > > reg->size = 2097152;
> > > > > for (int i = 0; i < reg->size; i++) {
> > > > > reg->node = (quantum_reg_node *)malloc(sizeof(quantum_reg_node));
> > > > > reg->node->state = (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSIGNED)
> > > > > * reg->size);
> > > > > reg->node->amplitude = (float _Complex *)malloc(sizeof(float
> > > > > _Complex) * reg->size);
> > > > > if (i >= 1) break;
> > > > > }
> > > > > for (int i = 0; i < reg->size; i++) {
> > > > > reg->node->amplitude[i] = 0;
> > > > > reg->node->state[i] = 0;
> > > > > }
> > > > > }
> > > > >
> > > > > int main() {
> > > > > quantum_reg reg;
> > > > > init(®);
> > > > > for (int i = 0; i < 65000; i++) {
> > > > > quantum_toffoli(get_random(), get_random(), get_random(), ®);
> > > > > }
> > > > > }
> > > > > ```
> > > > >
> > > > > Thanks so much.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-09-18 6:45 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-15 11:20 How to make parallelizing loops and vectorization work at the same time? Hanke Zhang
2023-09-15 11:59 ` Richard Biener
2023-09-15 13:09 ` Hanke Zhang
2023-09-15 13:13 ` Richard Biener
2023-09-15 14:07 ` Hanke Zhang
2023-09-18 6:45 ` Richard Biener
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).