From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x135.google.com (mail-lf1-x135.google.com [IPv6:2a00:1450:4864:20::135]) by sourceware.org (Postfix) with ESMTPS id 6F1E13858D1E for ; Fri, 15 Sep 2023 13:09:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 6F1E13858D1E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lf1-x135.google.com with SMTP id 2adb3069b0e04-502e6d632b6so3063968e87.0 for ; Fri, 15 Sep 2023 06:09:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1694783374; x=1695388174; darn=gcc.gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nOIkNne8ZN9jHHiwp6AW3SjYVVHYvMtXTogkMrthPKk=; b=K6q+admUraqxrGyd2b2yclZcysXER/+RFlZ2LrGfDjS1Q1uwJS+IgCc6bLkZLpIQMa 9H0V59e3q1n6XfrqGYgaDkBJRohA1UFfntHjdPKj512pTJ9jd0lULrMDFS3/CVc5zTD6 crsnrZhivLMHYEEJ1kFE3kOf50i6aiC3bO67QUQX2XpqYFAB5slsRSR+p/PUy6GjC0WW hIoCp0As0zSAFoIBXGtSMkgfhXzULxI42s6I9JJNRNUcsQ9CGSbSE5C4MVoC0j1AOT1S gM3lOjrDXUfHVoXEwJwRM9oENIaQKivvSi2ZFX0fmcu3NVNqxe7yFZIr6jerPX1Z4JfU /oRA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694783374; x=1695388174; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nOIkNne8ZN9jHHiwp6AW3SjYVVHYvMtXTogkMrthPKk=; b=mvm2SLZo7UMt5mMO9GYS2zUW0X8kYCsafHXuUXjsGJSb+iQnfyklVmw1kmBfbHiZt8 JX+AenSrGeWmJBVGC6hyOkPUpHdbpwXKiVqOblD1Zm7Jiq/0818w/0ARpyAa7PXiHh73 PSDTO92vUS0Zpxo90Je+W7wQh5C9B/XxpKMEVB1nKHYjHfcoP8qqNzcQiH1qJYelGjvn hqQumBT2j80nefqpjdCqSdRULqQZP97AzoSB/bi6ZjVORgnuvEQ+yC1gAj1/R8HC0awx bEgm9A++bi1mR/1EvgasSt83HMa0l8AMF7FQwAoPsF1XiOGlA9A9F+XEn99xUvlpbG4X LcIw== X-Gm-Message-State: AOJu0YzU1qpqv6QAPIvtKtJb75sdMeUOu27Dgf2gEbLHnne3ho9FgFMV u0PE3qZnxc3WLxf/323pyWlK8HqeRJ4IuwxG0nakAzsbt0A= X-Google-Smtp-Source: AGHT+IE4VgWNDxIcKm9/CzwRn/pBhDItE097xQQBNQYCqre9zGS18VWYshN9uGKML8aYQ/iG6+iYT+bic1kEVzouQt4= X-Received: by 2002:ac2:5f03:0:b0:500:b890:fb38 with SMTP id 3-20020ac25f03000000b00500b890fb38mr1148748lfq.24.1694783373554; Fri, 15 Sep 2023 06:09:33 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Hanke Zhang Date: Fri, 15 Sep 2023 21:09:21 +0800 Message-ID: Subject: Re: How to make parallelizing loops and vectorization work at the same time? To: Richard Biener Cc: gcc@gcc.gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=2.2 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Level: ** X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Richard Biener =E4=BA=8E2023=E5=B9=B49=E6=9C= =8815=E6=97=A5=E5=91=A8=E4=BA=94 19:59=E5=86=99=E9=81=93=EF=BC=9A > > On Fri, Sep 15, 2023 at 1:21=E2=80=AFPM Hanke Zhang via Gcc wrote: > > > > Hi I'm trying to accelerate my program with -ftree-vectorize and > > -ftree-parallelize-loops. > > > > Here are my test results using the different options (based on > > gcc10.3.0 on i9-12900KF): > > gcc-10 test.c -O3 -flto > > > time: 29000 ms > > gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize > > > time: 17000 ms > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=3D24 > > > time: 5000 ms > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=3D24 -mavx2 -ftree-vec= torize > > > time: 5000 ms > > > > First of all -O3 already enables -ftree-vectorize, adding -mavx2 is what = brings > the first gain. So adding -ftree-vectorize to the last command-line is n= ot > expected to change anything. Instead you can use -fno-tree-vectorize on > the second last one. Doing that I get 111s vs 41s thus doing both helps. > > Note parallelization hasn't seen any development in the last years. > > Richard. Hi Richard: Thank you for your sincere reply. I get what you mean above. But I still see the following after I add `-fipo-info-vec`: gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec > test.c:29:5: optimized: loop vectorized using 32 byte vectors gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec -ftree-parallelize-loops=3D24 > nothing happened That means the vectorization does nothing help actually. At the same time, I added `-fno-tree-vectorize` to the second last one command. It did not bring about a performance change on my computer. So I still think only parallel loops work. Hanke Zhang > > > I found that these two options do not work at the same time, that is, > > if I use the `-ftree-vectorize` option alone, it can bring a big > > efficiency gain compared to doing nothing; At the same time, if I use > > the option of `-ftree-parallelize-loops` alone, it will also bring a > > big efficiency gain. But if I use both options, vectorization fails, > > that is, I can't get the benefits of vectorization, I can only get the > > benefits of parallelizing loops. > > > > I know that the reason may be that after parallelizing the loop, > > vectorization cannot be performed, but is there any way I can reap the > > benefits of both optimizations? > > > > Here is my example program, adapted from the 462.libquantum in speccpu2= 006: > > > > ``` > > #include > > #include > > #include > > > > #define MAX_UNSIGNED unsigned long long > > > > struct quantum_reg_node_struct { > > float _Complex *amplitude; /* alpha_j */ > > MAX_UNSIGNED *state; /* j */ > > }; > > > > typedef struct quantum_reg_node_struct quantum_reg_node; > > > > struct quantum_reg_struct { > > int width; /* number of qubits in the qureg */ > > int size; /* number of non-zero vectors */ > > int hashw; /* width of the hash array */ > > quantum_reg_node *node; > > int *hash; > > }; > > > > typedef struct quantum_reg_struct quantum_reg; > > > > void quantum_toffoli(int control1, int control2, int target, quantum_re= g *reg) { > > for (int i =3D 0; i < reg->size; i++) { > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) { > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2)) { > > reg->node->state[i] ^=3D ((MAX_UNSIGNED)1 << target); > > } > > } > > } > > } > > > > int get_random() { > > return rand() % 64; > > } > > > > void init(quantum_reg *reg) { > > reg->size =3D 2097152; > > for (int i =3D 0; i < reg->size; i++) { > > reg->node =3D (quantum_reg_node *)malloc(sizeof(quantum_reg_nod= e)); > > reg->node->state =3D (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSIGNED= ) > > * reg->size); > > reg->node->amplitude =3D (float _Complex *)malloc(sizeof(float > > _Complex) * reg->size); > > if (i >=3D 1) break; > > } > > for (int i =3D 0; i < reg->size; i++) { > > reg->node->amplitude[i] =3D 0; > > reg->node->state[i] =3D 0; > > } > > } > > > > int main() { > > quantum_reg reg; > > init(®); > > for (int i =3D 0; i < 65000; i++) { > > quantum_toffoli(get_random(), get_random(), get_random(), ®)= ; > > } > > } > > ``` > > > > Thanks so much.