From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=h9ln=E7=gmail.com=hkzhang455@sourceware.org>
Received: from mail-lf1-x135.google.com (mail-lf1-x135.google.com [IPv6:2a00:1450:4864:20::135])
	by sourceware.org (Postfix) with ESMTPS id 6F1E13858D1E
	for <gcc@gcc.gnu.org>; Fri, 15 Sep 2023 13:09:35 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 6F1E13858D1E
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lf1-x135.google.com with SMTP id 2adb3069b0e04-502e6d632b6so3063968e87.0
        for <gcc@gcc.gnu.org>; Fri, 15 Sep 2023 06:09:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1694783374; x=1695388174; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=nOIkNne8ZN9jHHiwp6AW3SjYVVHYvMtXTogkMrthPKk=;
        b=K6q+admUraqxrGyd2b2yclZcysXER/+RFlZ2LrGfDjS1Q1uwJS+IgCc6bLkZLpIQMa
         9H0V59e3q1n6XfrqGYgaDkBJRohA1UFfntHjdPKj512pTJ9jd0lULrMDFS3/CVc5zTD6
         crsnrZhivLMHYEEJ1kFE3kOf50i6aiC3bO67QUQX2XpqYFAB5slsRSR+p/PUy6GjC0WW
         hIoCp0As0zSAFoIBXGtSMkgfhXzULxI42s6I9JJNRNUcsQ9CGSbSE5C4MVoC0j1AOT1S
         gM3lOjrDXUfHVoXEwJwRM9oENIaQKivvSi2ZFX0fmcu3NVNqxe7yFZIr6jerPX1Z4JfU
         /oRA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1694783374; x=1695388174;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=nOIkNne8ZN9jHHiwp6AW3SjYVVHYvMtXTogkMrthPKk=;
        b=mvm2SLZo7UMt5mMO9GYS2zUW0X8kYCsafHXuUXjsGJSb+iQnfyklVmw1kmBfbHiZt8
         JX+AenSrGeWmJBVGC6hyOkPUpHdbpwXKiVqOblD1Zm7Jiq/0818w/0ARpyAa7PXiHh73
         PSDTO92vUS0Zpxo90Je+W7wQh5C9B/XxpKMEVB1nKHYjHfcoP8qqNzcQiH1qJYelGjvn
         hqQumBT2j80nefqpjdCqSdRULqQZP97AzoSB/bi6ZjVORgnuvEQ+yC1gAj1/R8HC0awx
         bEgm9A++bi1mR/1EvgasSt83HMa0l8AMF7FQwAoPsF1XiOGlA9A9F+XEn99xUvlpbG4X
         LcIw==
X-Gm-Message-State: AOJu0YzU1qpqv6QAPIvtKtJb75sdMeUOu27Dgf2gEbLHnne3ho9FgFMV
	u0PE3qZnxc3WLxf/323pyWlK8HqeRJ4IuwxG0nakAzsbt0A=
X-Google-Smtp-Source: AGHT+IE4VgWNDxIcKm9/CzwRn/pBhDItE097xQQBNQYCqre9zGS18VWYshN9uGKML8aYQ/iG6+iYT+bic1kEVzouQt4=
X-Received: by 2002:ac2:5f03:0:b0:500:b890:fb38 with SMTP id
 3-20020ac25f03000000b00500b890fb38mr1148748lfq.24.1694783373554; Fri, 15 Sep
 2023 06:09:33 -0700 (PDT)
MIME-Version: 1.0
References: <CAM_DAs_cejW=sLEmo4tzkcjh6_AqxtKuem5Sv7aKtM3e=DozvQ@mail.gmail.com>
 <CAFiYyc33W1V09dAOx-uLy1EAy+Ym=QnfAHU7LxYWe7ZgLXjfQw@mail.gmail.com>
In-Reply-To: <CAFiYyc33W1V09dAOx-uLy1EAy+Ym=QnfAHU7LxYWe7ZgLXjfQw@mail.gmail.com>
From: Hanke Zhang <hkzhang455@gmail.com>
Date: Fri, 15 Sep 2023 21:09:21 +0800
Message-ID: <CAM_DAs8FEhBzsuSdyF3f=Eo=_QDPS_iB9woO9s2MXzwjUbML6A@mail.gmail.com>
Subject: Re: How to make parallelizing loops and vectorization work at the
 same time?
To: Richard Biener <richard.guenther@gmail.com>
Cc: gcc@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=2.2 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Level: **
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc.gcc.gnu.org>

Richard Biener <richard.guenther@gmail.com> =E4=BA=8E2023=E5=B9=B49=E6=9C=
=8815=E6=97=A5=E5=91=A8=E4=BA=94 19:59=E5=86=99=E9=81=93=EF=BC=9A

>
> On Fri, Sep 15, 2023 at 1:21=E2=80=AFPM Hanke Zhang via Gcc <gcc@gcc.gnu.=
org> wrote:
> >
> > Hi I'm trying to accelerate my program with -ftree-vectorize and
> > -ftree-parallelize-loops.
> >
> > Here are my test results using the different options (based on
> > gcc10.3.0 on i9-12900KF):
> > gcc-10 test.c -O3 -flto
> > > time: 29000 ms
> > gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize
> > > time: 17000 ms
> > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=3D24
> > > time: 5000 ms
> > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=3D24 -mavx2 -ftree-vec=
torize
> > > time: 5000 ms
> >
>
> First of all -O3 already enables -ftree-vectorize, adding -mavx2 is what =
brings
> the first gain.  So adding -ftree-vectorize to the last command-line is n=
ot
> expected to change anything.  Instead you can use -fno-tree-vectorize on
> the second last one.  Doing that I get 111s vs 41s thus doing both helps.
>
> Note parallelization hasn't seen any development in the last years.
>
> Richard.

Hi Richard:

Thank you for your sincere reply.

I get what you mean above. But I still see the following after I add
`-fipo-info-vec`:

gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec
> test.c:29:5: optimized: loop vectorized using 32 byte vectors
gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec -ftree-parallelize-loops=3D24
> nothing happened

That means the vectorization does nothing help actually.

At the same time, I added `-fno-tree-vectorize` to the second last one
command. It did not bring about a performance change on my computer.

So I still think only parallel loops work.

Hanke Zhang

>
> > I found that these two options do not work at the same time, that is,
> > if I use the `-ftree-vectorize` option alone, it can bring a big
> > efficiency gain compared to doing nothing; At the same time, if I use
> > the option of `-ftree-parallelize-loops` alone, it will also bring a
> > big efficiency gain. But if I use both options, vectorization fails,
> > that is, I can't get the benefits of vectorization, I can only get the
> > benefits of parallelizing loops.
> >
> > I know that the reason may be that after parallelizing the loop,
> > vectorization cannot be performed, but is there any way I can reap the
> > benefits of both optimizations?
> >
> > Here is my example program, adapted from the 462.libquantum in speccpu2=
006:
> >
> > ```
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <time.h>
> >
> > #define MAX_UNSIGNED unsigned long long
> >
> > struct quantum_reg_node_struct {
> >     float _Complex *amplitude; /* alpha_j */
> >     MAX_UNSIGNED *state;       /* j */
> > };
> >
> > typedef struct quantum_reg_node_struct quantum_reg_node;
> >
> > struct quantum_reg_struct {
> >     int width; /* number of qubits in the qureg */
> >     int size;  /* number of non-zero vectors */
> >     int hashw; /* width of the hash array */
> >     quantum_reg_node *node;
> >     int *hash;
> > };
> >
> > typedef struct quantum_reg_struct quantum_reg;
> >
> > void quantum_toffoli(int control1, int control2, int target, quantum_re=
g *reg) {
> >     for (int i =3D 0; i < reg->size; i++) {
> >         if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) {
> >             if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2))  {
> >                 reg->node->state[i] ^=3D ((MAX_UNSIGNED)1 << target);
> >             }
> >         }
> >     }
> > }
> >
> > int get_random() {
> >     return rand() % 64;
> > }
> >
> > void init(quantum_reg *reg) {
> >     reg->size =3D 2097152;
> >     for (int i =3D 0; i < reg->size; i++)  {
> >         reg->node =3D (quantum_reg_node *)malloc(sizeof(quantum_reg_nod=
e));
> >         reg->node->state =3D (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSIGNED=
)
> > * reg->size);
> >         reg->node->amplitude =3D (float _Complex *)malloc(sizeof(float
> > _Complex) * reg->size);
> >         if (i >=3D 1) break;
> >     }
> >     for (int i =3D 0; i < reg->size; i++)  {
> >         reg->node->amplitude[i] =3D 0;
> >         reg->node->state[i] =3D 0;
> >     }
> > }
> >
> > int main() {
> >     quantum_reg reg;
> >     init(&reg);
> >     for (int i =3D 0; i < 65000; i++) {
> >         quantum_toffoli(get_random(), get_random(), get_random(), &reg)=
;
> >     }
> > }
> > ```
> >
> > Thanks so much.