From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=IGMy=FC=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x22d.google.com (mail-lj1-x22d.google.com [IPv6:2a00:1450:4864:20::22d])
	by sourceware.org (Postfix) with ESMTPS id 81C203858D32
	for <gcc@gcc.gnu.org>; Mon, 18 Sep 2023 06:45:54 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 81C203858D32
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x22d.google.com with SMTP id 38308e7fff4ca-2bffdf50212so15007191fa.1
        for <gcc@gcc.gnu.org>; Sun, 17 Sep 2023 23:45:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1695019552; x=1695624352; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=IneG698QyqCXn1DbUwSQC7We3+BCPHFousmNLVOGrsA=;
        b=RaehcOusxZDeBxZVepOfv/JAlgtksxYte8kUV5dkVnAIzvwhaKnnzRufP1OSRRltTC
         NBFhuZO1zTEgikSPMj3r8TmQRToS3ku48iCFH4VltpJMSxkKuenIhzyyWwrrpZbzyse5
         qX4+YD9VNIQGFF3eIqdJCk/5Nh0odjNR3FiagABL3BZqy4Ddlm+WDvH482clbJIkYAHQ
         H0dVXpE4K3dJorGuFccnACK6VPcZ2Mvzy9GgugyCcoLfr3tCuJ95SsUmtJ5oLiAFJp1p
         sPa5sb0MYvvpL/YJCTrfEF/8DdgmuhCV4SEN0+Gze02u0N4DWRDj/u5cmELM3rCcQRtW
         SlYw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1695019552; x=1695624352;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=IneG698QyqCXn1DbUwSQC7We3+BCPHFousmNLVOGrsA=;
        b=qirjTpPiHdWoJ10v4G+cyAGdLLXzVMBXO7dRkSsZ9a0BzsqholUDAvSQLPPZfbzlK4
         uwu4VezD177wXrvDPKwoCIYnBdhgEtS3b+R8SQ678amLRX+zYl0+VIvpa8rmp2W5MmRa
         PrDsf9Zc6c7uTJoICwGpwo586/LjKQa7qeaDvQXR+B2aiXZ+4w2YucF17Q6AH5Vbdxhs
         ovyllJGTrKZOc0yQ7fS08a1rZa1IJcrA/GHeBLuYnJoPZMprErvEH9hGUEwAAML3sAJ8
         mh9ospcU3pZQfIZpUpKN7e7IYlBisUW7ZmXCjwX8bmmwQnsHZ/m0wH/NlyWN5r1NTrcV
         fWqg==
X-Gm-Message-State: AOJu0Yz0knezFYdMBY6UEkNMoML2RC6ReWDZ9HecZ8xFcnFoPBYzg8Ln
	3l4XEh6b3Dnbw73usa7RgHnxtOOeBlIoKeQhQ2hfotWt
X-Google-Smtp-Source: AGHT+IErOPG6H2ZC+mSQk+EwqvW4ygYCQEXexEANdxiEKV2lpa3NBujG3Mg/QxEzF0o/VMFOTw4YrY9Hpgs2Mi08oBU=
X-Received: by 2002:a05:6512:2148:b0:4fd:fabf:b6ee with SMTP id
 s8-20020a056512214800b004fdfabfb6eemr6256174lfr.9.1695019552157; Sun, 17 Sep
 2023 23:45:52 -0700 (PDT)
MIME-Version: 1.0
References: <CAM_DAs_cejW=sLEmo4tzkcjh6_AqxtKuem5Sv7aKtM3e=DozvQ@mail.gmail.com>
 <CAFiYyc33W1V09dAOx-uLy1EAy+Ym=QnfAHU7LxYWe7ZgLXjfQw@mail.gmail.com>
 <CAM_DAs8FEhBzsuSdyF3f=Eo=_QDPS_iB9woO9s2MXzwjUbML6A@mail.gmail.com>
 <CAFiYyc1FB+FwA9xR3s-KXnoCXrhRZg0mj_x+yk_fDY-11S6WBg@mail.gmail.com> <CAM_DAs9QcctX=3TsrMWe99YvKommnJqn0ja2NUnySgW+Dm7bRA@mail.gmail.com>
In-Reply-To: <CAM_DAs9QcctX=3TsrMWe99YvKommnJqn0ja2NUnySgW+Dm7bRA@mail.gmail.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 18 Sep 2023 08:45:40 +0200
Message-ID: <CAFiYyc0F1uDS2da4rEbd+qtqaB30k-=t2bi=BNdf1Oo7NH7qag@mail.gmail.com>
Subject: Re: How to make parallelizing loops and vectorization work at the
 same time?
To: Hanke Zhang <hkzhang455@gmail.com>
Cc: gcc@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-0.2 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc.gcc.gnu.org>

On Fri, Sep 15, 2023 at 4:07=E2=80=AFPM Hanke Zhang <hkzhang455@gmail.com> =
wrote:
>
> I get it. It's a `lto` problem. If I remove `-flto`, both work.

That's odd - it might be that GCC thinks part of the program is cold and do=
esn't
optimize it.  Does using -fwhole-program instead of -flto also not work?

Richard.

> Thanks for your help again!
>
> Richard Biener <richard.guenther@gmail.com> =E4=BA=8E2023=E5=B9=B49=E6=9C=
=8815=E6=97=A5=E5=91=A8=E4=BA=94 21:13=E5=86=99=E9=81=93=EF=BC=9A
> >
> > On Fri, Sep 15, 2023 at 3:09=E2=80=AFPM Hanke Zhang <hkzhang455@gmail.c=
om> wrote:
> > >
> > > Richard Biener <richard.guenther@gmail.com> =E4=BA=8E2023=E5=B9=B49=
=E6=9C=8815=E6=97=A5=E5=91=A8=E4=BA=94 19:59=E5=86=99=E9=81=93=EF=BC=9A
> > >
> > > >
> > > > On Fri, Sep 15, 2023 at 1:21=E2=80=AFPM Hanke Zhang via Gcc <gcc@gc=
c.gnu.org> wrote:
> > > > >
> > > > > Hi I'm trying to accelerate my program with -ftree-vectorize and
> > > > > -ftree-parallelize-loops.
> > > > >
> > > > > Here are my test results using the different options (based on
> > > > > gcc10.3.0 on i9-12900KF):
> > > > > gcc-10 test.c -O3 -flto
> > > > > > time: 29000 ms
> > > > > gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize
> > > > > > time: 17000 ms
> > > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=3D24
> > > > > > time: 5000 ms
> > > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=3D24 -mavx2 -ftr=
ee-vectorize
> > > > > > time: 5000 ms
> > > > >
> > > >
> > > > First of all -O3 already enables -ftree-vectorize, adding -mavx2 is=
 what brings
> > > > the first gain.  So adding -ftree-vectorize to the last command-lin=
e is not
> > > > expected to change anything.  Instead you can use -fno-tree-vectori=
ze on
> > > > the second last one.  Doing that I get 111s vs 41s thus doing both =
helps.
> > > >
> > > > Note parallelization hasn't seen any development in the last years.
> > > >
> > > > Richard.
> > >
> > > Hi Richard:
> > >
> > > Thank you for your sincere reply.
> > >
> > > I get what you mean above. But I still see the following after I add
> > > `-fipo-info-vec`:
> > >
> > > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec
> > > > test.c:29:5: optimized: loop vectorized using 32 byte vectors
> > > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec -ftree-parallelize-loop=
s=3D24
> > > > nothing happened
> > >
> > > That means the vectorization does nothing help actually.
> > >
> > > At the same time, I added `-fno-tree-vectorize` to the second last on=
e
> > > command. It did not bring about a performance change on my computer.
> > >
> > > So I still think only parallel loops work.
> >
> > I checked GCC 13 and do see vectorized loops when parallelizing.
> >
> > Richard.
> >
> > > Hanke Zhang
> > >
> > > >
> > > > > I found that these two options do not work at the same time, that=
 is,
> > > > > if I use the `-ftree-vectorize` option alone, it can bring a big
> > > > > efficiency gain compared to doing nothing; At the same time, if I=
 use
> > > > > the option of `-ftree-parallelize-loops` alone, it will also brin=
g a
> > > > > big efficiency gain. But if I use both options, vectorization fai=
ls,
> > > > > that is, I can't get the benefits of vectorization, I can only ge=
t the
> > > > > benefits of parallelizing loops.
> > > > >
> > > > > I know that the reason may be that after parallelizing the loop,
> > > > > vectorization cannot be performed, but is there any way I can rea=
p the
> > > > > benefits of both optimizations?
> > > > >
> > > > > Here is my example program, adapted from the 462.libquantum in sp=
eccpu2006:
> > > > >
> > > > > ```
> > > > > #include <stdio.h>
> > > > > #include <stdlib.h>
> > > > > #include <time.h>
> > > > >
> > > > > #define MAX_UNSIGNED unsigned long long
> > > > >
> > > > > struct quantum_reg_node_struct {
> > > > >     float _Complex *amplitude; /* alpha_j */
> > > > >     MAX_UNSIGNED *state;       /* j */
> > > > > };
> > > > >
> > > > > typedef struct quantum_reg_node_struct quantum_reg_node;
> > > > >
> > > > > struct quantum_reg_struct {
> > > > >     int width; /* number of qubits in the qureg */
> > > > >     int size;  /* number of non-zero vectors */
> > > > >     int hashw; /* width of the hash array */
> > > > >     quantum_reg_node *node;
> > > > >     int *hash;
> > > > > };
> > > > >
> > > > > typedef struct quantum_reg_struct quantum_reg;
> > > > >
> > > > > void quantum_toffoli(int control1, int control2, int target, quan=
tum_reg *reg) {
> > > > >     for (int i =3D 0; i < reg->size; i++) {
> > > > >         if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) =
{
> > > > >             if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control=
2))  {
> > > > >                 reg->node->state[i] ^=3D ((MAX_UNSIGNED)1 << targ=
et);
> > > > >             }
> > > > >         }
> > > > >     }
> > > > > }
> > > > >
> > > > > int get_random() {
> > > > >     return rand() % 64;
> > > > > }
> > > > >
> > > > > void init(quantum_reg *reg) {
> > > > >     reg->size =3D 2097152;
> > > > >     for (int i =3D 0; i < reg->size; i++)  {
> > > > >         reg->node =3D (quantum_reg_node *)malloc(sizeof(quantum_r=
eg_node));
> > > > >         reg->node->state =3D (MAX_UNSIGNED *)malloc(sizeof(MAX_UN=
SIGNED)
> > > > > * reg->size);
> > > > >         reg->node->amplitude =3D (float _Complex *)malloc(sizeof(=
float
> > > > > _Complex) * reg->size);
> > > > >         if (i >=3D 1) break;
> > > > >     }
> > > > >     for (int i =3D 0; i < reg->size; i++)  {
> > > > >         reg->node->amplitude[i] =3D 0;
> > > > >         reg->node->state[i] =3D 0;
> > > > >     }
> > > > > }
> > > > >
> > > > > int main() {
> > > > >     quantum_reg reg;
> > > > >     init(&reg);
> > > > >     for (int i =3D 0; i < 65000; i++) {
> > > > >         quantum_toffoli(get_random(), get_random(), get_random(),=
 &reg);
> > > > >     }
> > > > > }
> > > > > ```
> > > > >
> > > > > Thanks so much.