From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x12e.google.com (mail-lf1-x12e.google.com [IPv6:2a00:1450:4864:20::12e]) by sourceware.org (Postfix) with ESMTPS id 955B03858D1E for ; Fri, 15 Sep 2023 14:07:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 955B03858D1E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lf1-x12e.google.com with SMTP id 2adb3069b0e04-502934c88b7so3699246e87.2 for ; Fri, 15 Sep 2023 07:07:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1694786856; x=1695391656; darn=gcc.gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=kOaJTNddrPGFWqTfNJ4Mc866hDwTU+1Kl2QNrQ7ovVA=; b=aL044yvG/Zfg0VaYLyZwPEHw7Wv1OdYnkH7jis0vfY2aLdFouVy/7KLnX1ScmUq3cS ILi7U4s3SPbJueCXKkkVeimDGlEUHX9LcQsudRJS+TFJr96HjuI81PNz7QWgslOPEZO0 jW51aS+JItsIH/Oir+/9Pe7QNNXo1DJLAmHHwcI9gFyWjwT847XvJIepi648XF5akuPD 9y9IdxMrS9MCnl66OHXelCXEZaNKZDK2L5udgcHa/cks1/NnJfD7SzEGmjwrdThs5m3r OpQ0jB78ZAtSGAotn2hEtYciLpf3yZPGiFgxhjXKuNzLYTiN/DofEPAOCDpzOMBb3+Er fzIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694786856; x=1695391656; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kOaJTNddrPGFWqTfNJ4Mc866hDwTU+1Kl2QNrQ7ovVA=; b=sQ4GfSCSRM04FfJgxQWRQNGzQmoYrtwNdsT/N4U9grC+Y2Fc1UoX3y+rxlfy7ROJwl HccvQlIJXhQ3/NOPMPC7hUZy9nAX91iRts7hM3Jzj2RPk2Xnqmv1eBu9eCpN3cphz2t/ LABgwinmbY4MoCw/iaco7VCp6aRuSchTaMFGJzpwo+OpmWsdMa7RzY83U19mwIZoLUHd QkjeEJw8xuFytR50B6TDuv7Eu7sUDKw0ZL/NZ22eumNMYHcdxe31+1gv6u9A7vOuTIBp U2fUFbj5qYjhdeyJZ2JDSI/QS7KKzuvtTzMWWkHZdViIDcLZbCX/suAFqSK65XyG603u PA+g== X-Gm-Message-State: AOJu0Yw0m1p/sUh+hQEBpFBwTlApO5INK2KGHl8k0Gzn6wL/ho4EKoYj JbMla5HOn7EaLSjlByywUdjzKgOYOWThhdMzoPt7to4jtGY= X-Google-Smtp-Source: AGHT+IF35i4jhH9HpgeAPO+GrOpfJxsvZkZ9g5jEGMOwvxAqVIMVXh5PljYlWEkl9SNBbs24Si1nHpmsImPoNmBR6hY= X-Received: by 2002:a05:6512:3b0:b0:4f8:5886:186d with SMTP id v16-20020a05651203b000b004f85886186dmr1455417lfp.9.1694786855713; Fri, 15 Sep 2023 07:07:35 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Hanke Zhang Date: Fri, 15 Sep 2023 22:07:24 +0800 Message-ID: Subject: Re: How to make parallelizing loops and vectorization work at the same time? To: Richard Biener Cc: gcc@gcc.gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=2.2 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Level: ** X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: I get it. It's a `lto` problem. If I remove `-flto`, both work. Thanks for your help again! Richard Biener =E4=BA=8E2023=E5=B9=B49=E6=9C= =8815=E6=97=A5=E5=91=A8=E4=BA=94 21:13=E5=86=99=E9=81=93=EF=BC=9A > > On Fri, Sep 15, 2023 at 3:09=E2=80=AFPM Hanke Zhang wrote: > > > > Richard Biener =E4=BA=8E2023=E5=B9=B49=E6= =9C=8815=E6=97=A5=E5=91=A8=E4=BA=94 19:59=E5=86=99=E9=81=93=EF=BC=9A > > > > > > > > On Fri, Sep 15, 2023 at 1:21=E2=80=AFPM Hanke Zhang via Gcc wrote: > > > > > > > > Hi I'm trying to accelerate my program with -ftree-vectorize and > > > > -ftree-parallelize-loops. > > > > > > > > Here are my test results using the different options (based on > > > > gcc10.3.0 on i9-12900KF): > > > > gcc-10 test.c -O3 -flto > > > > > time: 29000 ms > > > > gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize > > > > > time: 17000 ms > > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=3D24 > > > > > time: 5000 ms > > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=3D24 -mavx2 -ftree= -vectorize > > > > > time: 5000 ms > > > > > > > > > > First of all -O3 already enables -ftree-vectorize, adding -mavx2 is w= hat brings > > > the first gain. So adding -ftree-vectorize to the last command-line = is not > > > expected to change anything. Instead you can use -fno-tree-vectorize= on > > > the second last one. Doing that I get 111s vs 41s thus doing both he= lps. > > > > > > Note parallelization hasn't seen any development in the last years. > > > > > > Richard. > > > > Hi Richard: > > > > Thank you for your sincere reply. > > > > I get what you mean above. But I still see the following after I add > > `-fipo-info-vec`: > > > > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec > > > test.c:29:5: optimized: loop vectorized using 32 byte vectors > > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec -ftree-parallelize-loops= =3D24 > > > nothing happened > > > > That means the vectorization does nothing help actually. > > > > At the same time, I added `-fno-tree-vectorize` to the second last one > > command. It did not bring about a performance change on my computer. > > > > So I still think only parallel loops work. > > I checked GCC 13 and do see vectorized loops when parallelizing. > > Richard. > > > Hanke Zhang > > > > > > > > > I found that these two options do not work at the same time, that i= s, > > > > if I use the `-ftree-vectorize` option alone, it can bring a big > > > > efficiency gain compared to doing nothing; At the same time, if I u= se > > > > the option of `-ftree-parallelize-loops` alone, it will also bring = a > > > > big efficiency gain. But if I use both options, vectorization fails= , > > > > that is, I can't get the benefits of vectorization, I can only get = the > > > > benefits of parallelizing loops. > > > > > > > > I know that the reason may be that after parallelizing the loop, > > > > vectorization cannot be performed, but is there any way I can reap = the > > > > benefits of both optimizations? > > > > > > > > Here is my example program, adapted from the 462.libquantum in spec= cpu2006: > > > > > > > > ``` > > > > #include > > > > #include > > > > #include > > > > > > > > #define MAX_UNSIGNED unsigned long long > > > > > > > > struct quantum_reg_node_struct { > > > > float _Complex *amplitude; /* alpha_j */ > > > > MAX_UNSIGNED *state; /* j */ > > > > }; > > > > > > > > typedef struct quantum_reg_node_struct quantum_reg_node; > > > > > > > > struct quantum_reg_struct { > > > > int width; /* number of qubits in the qureg */ > > > > int size; /* number of non-zero vectors */ > > > > int hashw; /* width of the hash array */ > > > > quantum_reg_node *node; > > > > int *hash; > > > > }; > > > > > > > > typedef struct quantum_reg_struct quantum_reg; > > > > > > > > void quantum_toffoli(int control1, int control2, int target, quantu= m_reg *reg) { > > > > for (int i =3D 0; i < reg->size; i++) { > > > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) { > > > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2)= ) { > > > > reg->node->state[i] ^=3D ((MAX_UNSIGNED)1 << target= ); > > > > } > > > > } > > > > } > > > > } > > > > > > > > int get_random() { > > > > return rand() % 64; > > > > } > > > > > > > > void init(quantum_reg *reg) { > > > > reg->size =3D 2097152; > > > > for (int i =3D 0; i < reg->size; i++) { > > > > reg->node =3D (quantum_reg_node *)malloc(sizeof(quantum_reg= _node)); > > > > reg->node->state =3D (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSI= GNED) > > > > * reg->size); > > > > reg->node->amplitude =3D (float _Complex *)malloc(sizeof(fl= oat > > > > _Complex) * reg->size); > > > > if (i >=3D 1) break; > > > > } > > > > for (int i =3D 0; i < reg->size; i++) { > > > > reg->node->amplitude[i] =3D 0; > > > > reg->node->state[i] =3D 0; > > > > } > > > > } > > > > > > > > int main() { > > > > quantum_reg reg; > > > > init(®); > > > > for (int i =3D 0; i < 65000; i++) { > > > > quantum_toffoli(get_random(), get_random(), get_random(), &= reg); > > > > } > > > > } > > > > ``` > > > > > > > > Thanks so much.