From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr1-x42f.google.com (mail-wr1-x42f.google.com [IPv6:2a00:1450:4864:20::42f]) by sourceware.org (Postfix) with ESMTPS id 03027385828D for ; Mon, 14 Aug 2023 14:58:53 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 03027385828D Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-wr1-x42f.google.com with SMTP id ffacd0b85a97d-318015ade49so3987652f8f.0 for ; Mon, 14 Aug 2023 07:58:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1692025131; x=1692629931; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=UfcNq/a1NEH4SfKrTdJu9nbVTgzSSjJimGvwevhSxBI=; b=dAhiFiGopNaBIOz6BbdJxUzDtwCxaW7gJXbw/nhgZ+YG1wRpArkjpJIVjyof/4hqRm megg/uBZe+ivuMqEZKWzJw+wdmYf8FyVc0Va2vEko+0Ma/0p/gVOyGmPg2Zc9fDFAoMn gy50U18XtEW5e6tO7tscPcWVbHKzOhAgqEJEowgojN+VK2yPkY/BDFx7YaNea/MyFBAl xw+cHpYLbLY/M5dQOl7FSAVxKrRWt6Z1QKmjyo1hKZ2Rno8GGdw/ZCrjkxiEhSoOXsyc 0tG3PvOb4RZ15lXUloYmO2bkTAXPnntgSH7FkC7XrezLoKqyFThvN7dQhAFXyunCMyUr xqpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692025131; x=1692629931; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=UfcNq/a1NEH4SfKrTdJu9nbVTgzSSjJimGvwevhSxBI=; b=OcwLkfO41/t1a91Du3+5crXisxQv8GblD9qYaEARrbumSt2PxmHjU3/jRN9TGQuvjR lB3jnPDjRWL3I1h01xQJTDuu42tENigOk6p6sZJrp6dQezNyiqzkdnmEZ23JKsDQVCjD cmU6T9YQwBeHWz/6J3jLIb/1n4uSUVh9VC1+EqzTgwj2eIikKQ5U5nvveDZn9c1yGmrf pGCQ3pLlkMOFuSx5wvvvErYXUU4xwLNdpt0yQhigkA8j2wSfDqaAVZETYOu40Cjmy/iE 1XB+N2fjlx4JFHpSeEyPohIF3Io2llZ+1KtURurwvaOgbPnriXKtzDfYCfa1BFR9a44x a9iw== X-Gm-Message-State: AOJu0Yx3CuZ7ej4PhUb2LdwwCf6/ftWIpPJXkNhdaM+YjASGTTfjZATK dE3RQgE71HAdeHgdp2BbtSMUvN+gSjGI8wFoXfwFiA== X-Google-Smtp-Source: AGHT+IHmK8eQW18aLbu7hf7ihTdW/QeJRw64P+yi8XTq+XbSyKYprS9HO1YNHz2vArFxOaSaL5PnVczlqBnstuhmCCo= X-Received: by 2002:adf:e648:0:b0:319:76ce:32ba with SMTP id b8-20020adfe648000000b0031976ce32bamr2436488wrn.21.1692025131271; Mon, 14 Aug 2023 07:58:51 -0700 (PDT) MIME-Version: 1.0 References: <20230728070552.50C1413276@imap2.suse-dmz.suse.de> <320f94c3-8805-ec64-dd4b-0454c8ecce14@gmail.com> In-Reply-To: From: Prathamesh Kulkarni Date: Mon, 14 Aug 2023 20:28:15 +0530 Message-ID: Subject: Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization To: Richard Biener Cc: Richard Biener , Jeff Law , gcc-patches@gcc.gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-3.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mon, 7 Aug 2023 at 13:19, Richard Biener wr= ote: > > On Mon, Aug 7, 2023 at 2:05=E2=80=AFAM Prathamesh Kulkarni via Gcc-patche= s > wrote: > > > > On Thu, 3 Aug 2023 at 17:48, Richard Biener wrote: > > > > > > On Thu, 3 Aug 2023, Richard Biener wrote: > > > > > > > On Thu, 3 Aug 2023, Richard Biener wrote: > > > > > > > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote: > > > > > > > > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches > > > > > > wrote: > > > > > > > > > > > > > > On Mon, 31 Jul 2023, Jeff Law wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote: > > > > > > > > > The following delays sinking of loads within the same inn= ermost > > > > > > > > > loop when it was unconditional before. That's a not unco= mmon > > > > > > > > > issue preventing vectorization when masked loads are not = available. > > > > > > > > > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > > > > > > > > > > > > > I have a followup patch improving sinking that without th= is would > > > > > > > > > cause more of the problematic sinking - now that we have = a second > > > > > > > > > sink pass after loop opts this looks like a reasonable ap= proach? > > > > > > > > > > > > > > > > > > OK? > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Richard. > > > > > > > > > > > > > > > > > > PR tree-optimization/92335 > > > > > > > > > * tree-ssa-sink.cc (select_best_block): Before loop > > > > > > > > > optimizations avoid sinking unconditional loads/stores > > > > > > > > > in innermost loops to conditional executed places. > > > > > > > > > > > > > > > > > > * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing. > > > > > > > > > * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c, > > > > > > > > > expect predictive commoning to happen instead of sinking= . > > > > > > > > > * gcc.dg/vect/pr65947-3.c: Adjust. > > > > > > > > I think it's reasonable -- there's probably going to be cas= es where it's not > > > > > > > > great, but more often than not I think it's going to be a r= easonable > > > > > > > > heuristic. > > > > > > > > > > > > > > > > If there is undesirable fallout, better to find it over the= coming months than > > > > > > > > next spring. So I'd suggest we go forward now to give more= time to find any > > > > > > > > pathological cases (if they exist). > > > > > > > > > > > > > > Agreed, I've pushed this now. > > > > > > Hi Richard, > > > > > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651= c4d6f6a0), > > > > > > pr65947-7.c "failed" for aarch64-linux-gnu: > > > > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VEC= TORIZED" > > > > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects > > > > > > scan-tree-dump-not vect "LOOP VECTORIZED" > > > > > > > > > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { t= arget { > > > > > > ! vect_fold_extract_last } } } } */ > > > > > > > > > > > > With your commit, condition_reduction in pr65947-7.c gets vecto= rized > > > > > > regardless of vect_fold_extract_last, > > > > > > which gates the above test (which is an improvement, because th= e > > > > > > function didn't get vectorized before the commit). > > > > > > > > > > > > The attached patch thus removes the gating on vect_fold_extract= _last, > > > > > > and the test passes again. > > > > > > OK to commit ? > > > > > > > > > > OK. > > > > > > > > Or wait - the loop doesn't vectorize on x86_64, so I guess one > > > > critical target condition is missing. Can you figure out which? > > > > > > I see > > > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > note: vect_is_simple_use: operand last_19 =3D PHI , > > > type of def: reduction > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > note: vect_is_simple_use: vectype vector(4) int > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > missed: multiple types in double reduction or condition reduction o= r > > > fold-left reduction. > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1: > > > missed: not vectorized: relevant phi not supported: last_19 =3D PHI > > > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > missed: bad operation or unsupported loop bound. > > Hi Richard, > > Looking at the aarch64 vect dump, it seems the loop in > > condition_reduction gets vectorized with V4HI mode > > while fails for other modes in vectorizable_condition: > > > > if ((double_reduc || reduction_type !=3D TREE_CODE_REDUCTION) > > && ncopies > 1) > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > "multiple types in double reduction or conditi= on " > > "reduction or fold-left reduction.\n"); > > return false; > > } > > > > From the dump: > > foo.c:9:21: note: =3D=3D=3D vect_analyze_loop_operations =3D=3D=3D > > foo.c:9:21: note: examining phi: last_19 =3D PHI > > foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of > > def: internal > > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > > foo.c:9:21: note: vect_is_simple_use: operand last_19 =3D PHI > > , type of def: reduction > > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > > > > For V8HI, VF =3D 8, and vectype_in =3D vector(4) int. > > Thus ncopies =3D VF / length(vectype_in) =3D 2, which is greater than 1= , > > and thus fails: > > foo.c:9:21: missed: multiple types in double reduction or condition > > reduction or fold-left reduction. > > foo.c:4:1: missed: not vectorized: relevant phi not supported: > > last_19 =3D PHI > > While for V4HI, VF =3D 4 and thus ncopies =3D 1, so it succeeds. > > > > For x86_64, it seems the vectorizer doesn't seem to try V4HI mode. > > If I "force" the vectorizer to use V4HI mode, we get the following dump= : > > foo.c:9:21: note: =3D=3D=3D vect_analyze_loop_operations =3D=3D=3D > > foo.c:9:21: note: examining phi: last_19 =3D PHI > > foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of > > def: internal > > foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int > > foo.c:9:21: note: vect_is_simple_use: operand last_19 =3D PHI > > , type of def: reduction > > foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int > > foo.c:9:21: missed: multiple types in double reduction or condition > > reduction or fold-left reduction. > > > > Not sure tho if this is the only reason for the test to fail to > > vectorize on the target. > > Will investigate in more details next week. > > The odd thing is that you say > > for (int i =3D 0; i < N; i++) > { > aval =3D a[i]; > if (b[i] < min_v) > last =3D aval; > } > > fails to vectorize but > > for (int i =3D 0; i < N; i++) > { > if (b[i] < min_v) > last =3D a[i]; > } > > succeeds? The IL difference should be irrelevant for the reduction Hi Richard, Sorry for late response. No this case containing a conditional load doesn't vectorize on aarch64 eit= her: foo2.c:9:21: note: =3D=3D=3D analyze_loop_nest =3D=3D=3D foo2.c:9:21: note: =3D=3D=3D vect_analyze_loop_form =3D=3D=3D foo2.c:9:21: missed: not vectorized: control flow in loop. foo2.c:9:21: missed: bad loop form. For this test: for (int i =3D 0; i < N; i++) { aval =3D a[i]; if (b[i] < min_v) last =3D aval; } IIUC sink pass made the load conditional preventing vectorization (similar to above), but your PR92335 fix delays the sinking of load before loop opts, and thus gets vectorized. Till vect pass, the dumps are similar for x86 and aarch64. > vectorization: > > [local count: 1049367889]: > # last_19 =3D PHI > # i_21 =3D PHI > # ivtmp_18 =3D PHI > _1 =3D (long unsigned int) i_21; > _2 =3D _1 * 2; > _3 =3D a_12(D) + _2; > aval_13 =3D *_3; > _5 =3D _1 * 4; > _6 =3D b_14(D) + _5; > _7 =3D *_6; > last_16 =3D (int) aval_13; > _9 =3D _7 < min_v_15(D); > last_8 =3D _9 ? last_16 : last_19; > i_17 =3D i_21 + 1; > ivtmp_10 =3D ivtmp_18 - 1; > if (ivtmp_10 !=3D 0) > goto ; [97.68%] > > vs > > [local count: 1049367889]: > # last_19 =3D PHI > # i_21 =3D PHI > # ivtmp_11 =3D PHI > _1 =3D (long unsigned int) i_21; > _2 =3D _1 * 4; > _3 =3D b_13(D) + _2; > _4 =3D *_3; > _5 =3D _4 < min_v_14(D); > _6 =3D _1 * 2; > _38 =3D _37 + _6; > _7 =3D (short int *) _38; > _8 =3D .MASK_LOAD (_7, 16B, _5); > last_16 =3D (int) _8; > last_9 =3D _5 ? last_16 : last_19; > i_17 =3D i_21 + 1; > ivtmp_10 =3D ivtmp_11 - 1; > if (ivtmp_10 !=3D 0) > goto ; [97.68%] > > maybe since the "mask" is used twice with the .MASK_LOAD > we are not actually looking at the def (the comparison) and it's > the comparison which would introduce the "multiple types"? > > That is, I wonder why not sinking the load, avoiding a conditional > load, makes a difference to vectorizing the condition/extract last reduct= ion. IIUC, the issue is that the vector type used for reduction seems to be different than the vector type used for determining VF, and it passes the above check in vectoriable_reduction, only if VF matches the length of the vector type used for reduction. For V4HI mode on aarch64, it sets VF =3D 4 which matches with the length of= vector type used for reduction (vector (4) int): foo.c:9:21: note: examining phi: last_19 =3D PHI foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: note: vect_is_simple_use: operand last_19 =3D PHI , type of def: reduction foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > > It doesn't seem to make a difference for x86. That said, the "fix" is > probably sticking the correct target on the dump-check, it seems > that vect_fold_extract_last is no longer correct here. Um sorry, I did go thru various checks in target-supports.exp, but not sure which one will be appropriate for this case, and am stuck here :/ Could you please suggest how to proceed ? Thanks, Prathamesh > > Richard. > > > Thanks, > > Prathamesh > > > > > > Richard.