From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x22d.google.com (mail-lj1-x22d.google.com [IPv6:2a00:1450:4864:20::22d]) by sourceware.org (Postfix) with ESMTPS id 81C203858D32 for ; Mon, 18 Sep 2023 06:45:54 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 81C203858D32 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lj1-x22d.google.com with SMTP id 38308e7fff4ca-2bffdf50212so15007191fa.1 for ; Sun, 17 Sep 2023 23:45:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1695019552; x=1695624352; darn=gcc.gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=IneG698QyqCXn1DbUwSQC7We3+BCPHFousmNLVOGrsA=; b=RaehcOusxZDeBxZVepOfv/JAlgtksxYte8kUV5dkVnAIzvwhaKnnzRufP1OSRRltTC NBFhuZO1zTEgikSPMj3r8TmQRToS3ku48iCFH4VltpJMSxkKuenIhzyyWwrrpZbzyse5 qX4+YD9VNIQGFF3eIqdJCk/5Nh0odjNR3FiagABL3BZqy4Ddlm+WDvH482clbJIkYAHQ H0dVXpE4K3dJorGuFccnACK6VPcZ2Mvzy9GgugyCcoLfr3tCuJ95SsUmtJ5oLiAFJp1p sPa5sb0MYvvpL/YJCTrfEF/8DdgmuhCV4SEN0+Gze02u0N4DWRDj/u5cmELM3rCcQRtW SlYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695019552; x=1695624352; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=IneG698QyqCXn1DbUwSQC7We3+BCPHFousmNLVOGrsA=; b=qirjTpPiHdWoJ10v4G+cyAGdLLXzVMBXO7dRkSsZ9a0BzsqholUDAvSQLPPZfbzlK4 uwu4VezD177wXrvDPKwoCIYnBdhgEtS3b+R8SQ678amLRX+zYl0+VIvpa8rmp2W5MmRa PrDsf9Zc6c7uTJoICwGpwo586/LjKQa7qeaDvQXR+B2aiXZ+4w2YucF17Q6AH5Vbdxhs ovyllJGTrKZOc0yQ7fS08a1rZa1IJcrA/GHeBLuYnJoPZMprErvEH9hGUEwAAML3sAJ8 mh9ospcU3pZQfIZpUpKN7e7IYlBisUW7ZmXCjwX8bmmwQnsHZ/m0wH/NlyWN5r1NTrcV fWqg== X-Gm-Message-State: AOJu0Yz0knezFYdMBY6UEkNMoML2RC6ReWDZ9HecZ8xFcnFoPBYzg8Ln 3l4XEh6b3Dnbw73usa7RgHnxtOOeBlIoKeQhQ2hfotWt X-Google-Smtp-Source: AGHT+IErOPG6H2ZC+mSQk+EwqvW4ygYCQEXexEANdxiEKV2lpa3NBujG3Mg/QxEzF0o/VMFOTw4YrY9Hpgs2Mi08oBU= X-Received: by 2002:a05:6512:2148:b0:4fd:fabf:b6ee with SMTP id s8-20020a056512214800b004fdfabfb6eemr6256174lfr.9.1695019552157; Sun, 17 Sep 2023 23:45:52 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Mon, 18 Sep 2023 08:45:40 +0200 Message-ID: Subject: Re: How to make parallelizing loops and vectorization work at the same time? To: Hanke Zhang Cc: gcc@gcc.gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.2 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, Sep 15, 2023 at 4:07=E2=80=AFPM Hanke Zhang = wrote: > > I get it. It's a `lto` problem. If I remove `-flto`, both work. That's odd - it might be that GCC thinks part of the program is cold and do= esn't optimize it. Does using -fwhole-program instead of -flto also not work? Richard. > Thanks for your help again! > > Richard Biener =E4=BA=8E2023=E5=B9=B49=E6=9C= =8815=E6=97=A5=E5=91=A8=E4=BA=94 21:13=E5=86=99=E9=81=93=EF=BC=9A > > > > On Fri, Sep 15, 2023 at 3:09=E2=80=AFPM Hanke Zhang wrote: > > > > > > Richard Biener =E4=BA=8E2023=E5=B9=B49= =E6=9C=8815=E6=97=A5=E5=91=A8=E4=BA=94 19:59=E5=86=99=E9=81=93=EF=BC=9A > > > > > > > > > > > On Fri, Sep 15, 2023 at 1:21=E2=80=AFPM Hanke Zhang via Gcc wrote: > > > > > > > > > > Hi I'm trying to accelerate my program with -ftree-vectorize and > > > > > -ftree-parallelize-loops. > > > > > > > > > > Here are my test results using the different options (based on > > > > > gcc10.3.0 on i9-12900KF): > > > > > gcc-10 test.c -O3 -flto > > > > > > time: 29000 ms > > > > > gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize > > > > > > time: 17000 ms > > > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=3D24 > > > > > > time: 5000 ms > > > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=3D24 -mavx2 -ftr= ee-vectorize > > > > > > time: 5000 ms > > > > > > > > > > > > > First of all -O3 already enables -ftree-vectorize, adding -mavx2 is= what brings > > > > the first gain. So adding -ftree-vectorize to the last command-lin= e is not > > > > expected to change anything. Instead you can use -fno-tree-vectori= ze on > > > > the second last one. Doing that I get 111s vs 41s thus doing both = helps. > > > > > > > > Note parallelization hasn't seen any development in the last years. > > > > > > > > Richard. > > > > > > Hi Richard: > > > > > > Thank you for your sincere reply. > > > > > > I get what you mean above. But I still see the following after I add > > > `-fipo-info-vec`: > > > > > > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec > > > > test.c:29:5: optimized: loop vectorized using 32 byte vectors > > > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec -ftree-parallelize-loop= s=3D24 > > > > nothing happened > > > > > > That means the vectorization does nothing help actually. > > > > > > At the same time, I added `-fno-tree-vectorize` to the second last on= e > > > command. It did not bring about a performance change on my computer. > > > > > > So I still think only parallel loops work. > > > > I checked GCC 13 and do see vectorized loops when parallelizing. > > > > Richard. > > > > > Hanke Zhang > > > > > > > > > > > > I found that these two options do not work at the same time, that= is, > > > > > if I use the `-ftree-vectorize` option alone, it can bring a big > > > > > efficiency gain compared to doing nothing; At the same time, if I= use > > > > > the option of `-ftree-parallelize-loops` alone, it will also brin= g a > > > > > big efficiency gain. But if I use both options, vectorization fai= ls, > > > > > that is, I can't get the benefits of vectorization, I can only ge= t the > > > > > benefits of parallelizing loops. > > > > > > > > > > I know that the reason may be that after parallelizing the loop, > > > > > vectorization cannot be performed, but is there any way I can rea= p the > > > > > benefits of both optimizations? > > > > > > > > > > Here is my example program, adapted from the 462.libquantum in sp= eccpu2006: > > > > > > > > > > ``` > > > > > #include > > > > > #include > > > > > #include > > > > > > > > > > #define MAX_UNSIGNED unsigned long long > > > > > > > > > > struct quantum_reg_node_struct { > > > > > float _Complex *amplitude; /* alpha_j */ > > > > > MAX_UNSIGNED *state; /* j */ > > > > > }; > > > > > > > > > > typedef struct quantum_reg_node_struct quantum_reg_node; > > > > > > > > > > struct quantum_reg_struct { > > > > > int width; /* number of qubits in the qureg */ > > > > > int size; /* number of non-zero vectors */ > > > > > int hashw; /* width of the hash array */ > > > > > quantum_reg_node *node; > > > > > int *hash; > > > > > }; > > > > > > > > > > typedef struct quantum_reg_struct quantum_reg; > > > > > > > > > > void quantum_toffoli(int control1, int control2, int target, quan= tum_reg *reg) { > > > > > for (int i =3D 0; i < reg->size; i++) { > > > > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) = { > > > > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control= 2)) { > > > > > reg->node->state[i] ^=3D ((MAX_UNSIGNED)1 << targ= et); > > > > > } > > > > > } > > > > > } > > > > > } > > > > > > > > > > int get_random() { > > > > > return rand() % 64; > > > > > } > > > > > > > > > > void init(quantum_reg *reg) { > > > > > reg->size =3D 2097152; > > > > > for (int i =3D 0; i < reg->size; i++) { > > > > > reg->node =3D (quantum_reg_node *)malloc(sizeof(quantum_r= eg_node)); > > > > > reg->node->state =3D (MAX_UNSIGNED *)malloc(sizeof(MAX_UN= SIGNED) > > > > > * reg->size); > > > > > reg->node->amplitude =3D (float _Complex *)malloc(sizeof(= float > > > > > _Complex) * reg->size); > > > > > if (i >=3D 1) break; > > > > > } > > > > > for (int i =3D 0; i < reg->size; i++) { > > > > > reg->node->amplitude[i] =3D 0; > > > > > reg->node->state[i] =3D 0; > > > > > } > > > > > } > > > > > > > > > > int main() { > > > > > quantum_reg reg; > > > > > init(®); > > > > > for (int i =3D 0; i < 65000; i++) { > > > > > quantum_toffoli(get_random(), get_random(), get_random(),= ®); > > > > > } > > > > > } > > > > > ``` > > > > > > > > > > Thanks so much.