From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=cEi3=E7=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lf1-x133.google.com (mail-lf1-x133.google.com [IPv6:2a00:1450:4864:20::133])
	by sourceware.org (Postfix) with ESMTPS id B01CC3858D1E
	for <gcc@gcc.gnu.org>; Fri, 15 Sep 2023 11:59:44 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B01CC3858D1E
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lf1-x133.google.com with SMTP id 2adb3069b0e04-502b1bbe5c3so3464429e87.1
        for <gcc@gcc.gnu.org>; Fri, 15 Sep 2023 04:59:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1694779183; x=1695383983; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=LWIEEV2FFU8FkitJROL7CEkhqL6zkyRSg78QfMIw2aU=;
        b=nn2hmlmLsZ4Dtqc/dcRm7/F9GAJtEfRj+4C4bTITcc2KaKU4PmPpZofEsOzB1zlrUm
         dHwqWQLiU3c16LKzYu8tiGiY+Bf0+e5WKGfjoY5+MJoRI3SjdtdnCmeRqeozGtkb92Aa
         buPGMd/L6xFPe5/+/gZEhMgRu8jiL/2cexScw1m6N2mdxIcVJQwqk66G+vYPLr4nXHz2
         8XhGThfkhxJnQ3VoRsSp+rPJSr6d3YfyUwfoLIz+4eO3xRhkxo59JQZzHKumKvCRVAMg
         e/lzmGevvxH7PHQfkdyRNZeH1A9+dQXv7H5M8maFmjJ9KWpY1FJcPTlCxNQBFNxqLdLI
         cIiw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1694779183; x=1695383983;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=LWIEEV2FFU8FkitJROL7CEkhqL6zkyRSg78QfMIw2aU=;
        b=lulO6+B3UfuoOPZh034VAB4WrtknW+8gw8qKDClALCNiQMgo+/iRv1rXpg9vnB/m2Y
         cNDQLVlNNu8/Za5GP2FtuTyv9GGYigVGFZpPD1Ri348u8pER1auNATEUp8Y2Jvy4QOXh
         1//guALaDxQiwW+p7krwr1jDPRBPwkWjSXHmlXCGn3Lo1pZMrPPHYP3FZ15Rp0v4JP2O
         8PbGbkjmFw8AxILSmwLfe0cYWbrdYOzbX6/qyt05UZB+dSww7SD9cjTnPWbmDbqRUkjK
         GmvC+Q7iE+MNPqPJPUSy94Ov3o2HO/XDv7gX8zQmtWbcEB43z9p1Ar4e9BcSMlpn3VUJ
         5Lxg==
X-Gm-Message-State: AOJu0YwCKNSFsXTvUSGY0mFZxga5Egp5wov3eUyFZvdIqZ+ncHy2J8et
	cJainx9wvOVmUFCf+tIpi6+8lmPzLks4w5eZ37s=
X-Google-Smtp-Source: AGHT+IELNHZt1BynFl7VT/yF2U3sEO5aIOSojVfoFLVGkekU4wJiPfRCdluLFhbsGIWBwrTn5cJYjqzXd99J3uTYhQ4=
X-Received: by 2002:a19:384b:0:b0:500:75f1:c22e with SMTP id
 d11-20020a19384b000000b0050075f1c22emr1038865lfj.40.1694779182711; Fri, 15
 Sep 2023 04:59:42 -0700 (PDT)
MIME-Version: 1.0
References: <CAM_DAs_cejW=sLEmo4tzkcjh6_AqxtKuem5Sv7aKtM3e=DozvQ@mail.gmail.com>
In-Reply-To: <CAM_DAs_cejW=sLEmo4tzkcjh6_AqxtKuem5Sv7aKtM3e=DozvQ@mail.gmail.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Fri, 15 Sep 2023 13:59:30 +0200
Message-ID: <CAFiYyc33W1V09dAOx-uLy1EAy+Ym=QnfAHU7LxYWe7ZgLXjfQw@mail.gmail.com>
Subject: Re: How to make parallelizing loops and vectorization work at the
 same time?
To: Hanke Zhang <hkzhang455@gmail.com>
Cc: gcc@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc.gcc.gnu.org>

On Fri, Sep 15, 2023 at 1:21=E2=80=AFPM Hanke Zhang via Gcc <gcc@gcc.gnu.or=
g> wrote:
>
> Hi I'm trying to accelerate my program with -ftree-vectorize and
> -ftree-parallelize-loops.
>
> Here are my test results using the different options (based on
> gcc10.3.0 on i9-12900KF):
> gcc-10 test.c -O3 -flto
> > time: 29000 ms
> gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize
> > time: 17000 ms
> gcc-10 test.c -O3 -flto -ftree-parallelize-loops=3D24
> > time: 5000 ms
> gcc-10 test.c -O3 -flto -ftree-parallelize-loops=3D24 -mavx2 -ftree-vecto=
rize
> > time: 5000 ms
>

First of all -O3 already enables -ftree-vectorize, adding -mavx2 is what br=
ings
the first gain.  So adding -ftree-vectorize to the last command-line is not
expected to change anything.  Instead you can use -fno-tree-vectorize on
the second last one.  Doing that I get 111s vs 41s thus doing both helps.

Note parallelization hasn't seen any development in the last years.

Richard.

> I found that these two options do not work at the same time, that is,
> if I use the `-ftree-vectorize` option alone, it can bring a big
> efficiency gain compared to doing nothing; At the same time, if I use
> the option of `-ftree-parallelize-loops` alone, it will also bring a
> big efficiency gain. But if I use both options, vectorization fails,
> that is, I can't get the benefits of vectorization, I can only get the
> benefits of parallelizing loops.
>
> I know that the reason may be that after parallelizing the loop,
> vectorization cannot be performed, but is there any way I can reap the
> benefits of both optimizations?
>
> Here is my example program, adapted from the 462.libquantum in speccpu200=
6:
>
> ```
> #include <stdio.h>
> #include <stdlib.h>
> #include <time.h>
>
> #define MAX_UNSIGNED unsigned long long
>
> struct quantum_reg_node_struct {
>     float _Complex *amplitude; /* alpha_j */
>     MAX_UNSIGNED *state;       /* j */
> };
>
> typedef struct quantum_reg_node_struct quantum_reg_node;
>
> struct quantum_reg_struct {
>     int width; /* number of qubits in the qureg */
>     int size;  /* number of non-zero vectors */
>     int hashw; /* width of the hash array */
>     quantum_reg_node *node;
>     int *hash;
> };
>
> typedef struct quantum_reg_struct quantum_reg;
>
> void quantum_toffoli(int control1, int control2, int target, quantum_reg =
*reg) {
>     for (int i =3D 0; i < reg->size; i++) {
>         if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) {
>             if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2))  {
>                 reg->node->state[i] ^=3D ((MAX_UNSIGNED)1 << target);
>             }
>         }
>     }
> }
>
> int get_random() {
>     return rand() % 64;
> }
>
> void init(quantum_reg *reg) {
>     reg->size =3D 2097152;
>     for (int i =3D 0; i < reg->size; i++)  {
>         reg->node =3D (quantum_reg_node *)malloc(sizeof(quantum_reg_node)=
);
>         reg->node->state =3D (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSIGNED)
> * reg->size);
>         reg->node->amplitude =3D (float _Complex *)malloc(sizeof(float
> _Complex) * reg->size);
>         if (i >=3D 1) break;
>     }
>     for (int i =3D 0; i < reg->size; i++)  {
>         reg->node->amplitude[i] =3D 0;
>         reg->node->state[i] =3D 0;
>     }
> }
>
> int main() {
>     quantum_reg reg;
>     init(&reg);
>     for (int i =3D 0; i < 65000; i++) {
>         quantum_toffoli(get_random(), get_random(), get_random(), &reg);
>     }
> }
> ```
>
> Thanks so much.