From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=uqod=MS=gmail.com=crazylht@sourceware.org>
Received: from mail-qv1-xf31.google.com (mail-qv1-xf31.google.com [IPv6:2607:f8b0:4864:20::f31])
	by sourceware.org (Postfix) with ESMTPS id 1E452384770C
	for <gcc-patches@gcc.gnu.org>; Wed, 15 May 2024 02:15:05 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1E452384770C
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1E452384770C
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::f31
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715739307; cv=none;
	b=mWQ6SaNkSU5KyRj7qgWFZtrHm9EErd4kqcSekJlmIYEmqdETCwCHWUwA3Vg2SAS8DivF6nbONlPPVSL0ThcxO3l0UoxRHWwfFUToz1B4sF05sww6bEWTHvvVOAAjNNShKuGXlEGDoplaH9yr2pr8dWq4jnGvm+Rdyf4jbeerZNI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1715739307; c=relaxed/simple;
	bh=bFeApv0cUfp/OFTvzTJNmTabpSR5reZYH9FT/0nE+EE=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=YGdWr3G/yLAorzX9tegCXeaA0Imhq2ecDH/MqXF3y/IkP6cBZDftXDZn70CKRrpr/Hn2YTMhEmt3yiXsvWVhsE7SnGVdYKtrVWG+/lMUU/n+cXaj7ZHnfHmtOCZjfnwwPTI8T/vGvLVEqA2L/ZVbQXL8CzOluWM74juKCd4EN7Y=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-qv1-xf31.google.com with SMTP id 6a1803df08f44-69b50b8239fso57961596d6.0
        for <gcc-patches@gcc.gnu.org>; Tue, 14 May 2024 19:15:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1715739304; x=1716344104; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=iEzOnUGQEdOBWYcn6e6glMU4nhhMYbzn083vXSxFBXY=;
        b=XmDn+VeLjHsr6TgPdQbKGU3nTOo1G1mooPrw8Q/i1DLelg9FLJtPlHwJHbqioyjbkV
         u35mcjvjgffmc1tu1hImDqrtANA0rxbCtOcTgxmEDuJnnQkMyEBCnAJVTcz/Q1TvD98I
         VHVllA022DxpgMOy2GkQh8YT4IkHtwMy72ZyQMOtUxyzIIR3BfHem/lzehARKy1om1P6
         i2ehjLrrZQlDBE3DkJtdmd1JPF88wp2hAUYPOvFaQ/+KT9TQJjV5Klpl2dUKQ6adyu0F
         TSHi7pjcmeQW7Q4hcl5dpJ+8vVZtHLhqxdzyybrvDJMjbwS5z79q15KduTqmQ3faOHDX
         7XqA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1715739304; x=1716344104;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=iEzOnUGQEdOBWYcn6e6glMU4nhhMYbzn083vXSxFBXY=;
        b=JyS3BmoCB7b4pVXa9Q0FNYZT9hJ2opO0tUvGaVji8PPBRekCumGs5sjNixdfoezvjg
         r5gkXlXjnuMnKO1vqB8FJiAkHH3g8GOQEf594FlLGuPEMmJD92T8JZNKUcsW4qpVHjt/
         p/D0US7lqpN4UNWSF+/6ZJxGUWf4WMVZT/EznY0vNLEwGQqfmkN6k+GIW+BeRGXLuYC1
         XDD8Wv70ow68aIMzx9280bbcPTjqUjDJC4+fyzUYr4uZXWZ+o++kzc3b9C09ydgThmqv
         h8WSo/si+MrdWXlAJ9ziVlrbbNqNHsWw7JlKqpeK/cgjOLDWSp6ILgMHTRpDDVQoQg6r
         D0pA==
X-Forwarded-Encrypted: i=1; AJvYcCVXvKxv/4UxmKLuL6UHaUir9Pi8Evw2Zh0JfROwGvbLUfWZoqnGGj3+WNKEdva0UuW/ZHZyomIdVC6nEwNJb3b5TspClY5leg==
X-Gm-Message-State: AOJu0YwYNAAlS063n55KyHszQ+4ot7HHzlAI1A41wlHXVYTuanbAWOHk
	JjhXYm6pyRnvtq7tzSBS87iJ9HRGjUbB0ak0HOIQGeGyMz8fZRGQ1pYBpQVSqDX0he+JDyKjoPu
	dwBca1W/RVF3vrWF5ZQTusqOyT0c=
X-Google-Smtp-Source: AGHT+IHCGm7R6kSFtWXwEMo1II1PCqXWX22YB6Qtfkv63LxOmpsK8Jbbb6tUQYKow5Q4secpnrRKyKAGAArZRZj8p9Q=
X-Received: by 2002:a05:6214:2dc1:b0:6a0:a4db:b297 with SMTP id
 6a1803df08f44-6a16798c9c4mr258244686d6.23.1715739303995; Tue, 14 May 2024
 19:15:03 -0700 (PDT)
MIME-Version: 1.0
References: <20240513022737.3105192-1-hongtao.liu@intel.com> <CAFiYyc3snQb4M=mC8a7Bqx_kDr+104uxp35JnnpDQx4DYQLP1Q@mail.gmail.com>
In-Reply-To: <CAFiYyc3snQb4M=mC8a7Bqx_kDr+104uxp35JnnpDQx4DYQLP1Q@mail.gmail.com>
From: Hongtao Liu <crazylht@gmail.com>
Date: Wed, 15 May 2024 10:14:51 +0800
Message-ID: <CAMZc-bw51uL7MaS7-tL+9XHZxFPZjn__fFBYs5ShQcp1D_spkQ@mail.gmail.com>
Subject: Re: [PATCH] Don't reduce estimated unrolled size for innermost loop.
To: Richard Biener <richard.guenther@gmail.com>
Cc: liuhongt <hongtao.liu@intel.com>, gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-8.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Mon, May 13, 2024 at 3:40=E2=80=AFPM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Mon, May 13, 2024 at 4:29=E2=80=AFAM liuhongt <hongtao.liu@intel.com> =
wrote:
> >
> > As testcase in the PR, O3 cunrolli may prevent vectorization for the
> > innermost loop and increase register pressure.
> > The patch removes the 1/3 reduction of unr_insn for innermost loop for =
UL_ALL.
> > ul !=3D UR_ALL is needed since some small loop complete unrolling at O2=
 relies
> > the reduction.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > No big impact for SPEC2017.
> > Ok for trunk?
>
> This removes the 1/3 reduction when unrolling a loop nest (the case I was
> concerned about).  Unrolling of a nest is by iterating in
> tree_unroll_loops_completely
> so the to be unrolled loop appears innermost.  So I think you need a new
> parameter on tree_unroll_loops_completely_1 indicating whether we're in t=
he
> first iteration (or whether to assume inner most loops will "simplify").
yes, it would be better.
>
> Few comments below
>
> > gcc/ChangeLog:
> >
> >         PR tree-optimization/112325
> >         * tree-ssa-loop-ivcanon.cc (estimated_unrolled_size): Add 2
> >         new parameters: loop and ul, and remove unr_insns reduction
> >         for innermost loop.
> >         (try_unroll_loop_completely): Pass loop and ul to
> >         estimated_unrolled_size.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.dg/tree-ssa/pr112325.c: New test.
> >         * gcc.dg/vect/pr69783.c: Add extra option --param
> >         max-completely-peeled-insns=3D300.
> > ---
> >  gcc/testsuite/gcc.dg/tree-ssa/pr112325.c | 57 ++++++++++++++++++++++++
> >  gcc/testsuite/gcc.dg/vect/pr69783.c      |  2 +-
> >  gcc/tree-ssa-loop-ivcanon.cc             | 16 +++++--
> >  3 files changed, 71 insertions(+), 4 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c b/gcc/testsuite/g=
cc.dg/tree-ssa/pr112325.c
> > new file mode 100644
> > index 00000000000..14208b3e7f8
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> > @@ -0,0 +1,57 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-cunrolli-details" } */
> > +
> > +typedef unsigned short ggml_fp16_t;
> > +static float table_f32_f16[1 << 16];
> > +
> > +inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
> > +    unsigned short s;
> > +    __builtin_memcpy(&s, &f, sizeof(unsigned short));
> > +    return table_f32_f16[s];
> > +}
> > +
> > +typedef struct {
> > +    ggml_fp16_t d;
> > +    ggml_fp16_t m;
> > +    unsigned char qh[4];
> > +    unsigned char qs[32 / 2];
> > +} block_q5_1;
> > +
> > +typedef struct {
> > +    float d;
> > +    float s;
> > +    char qs[32];
> > +} block_q8_1;
> > +
> > +void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const voi=
d * restrict vx, const void * restrict vy) {
> > +    const int qk =3D 32;
> > +    const int nb =3D n / qk;
> > +
> > +    const block_q5_1 * restrict x =3D vx;
> > +    const block_q8_1 * restrict y =3D vy;
> > +
> > +    float sumf =3D 0.0;
> > +
> > +    for (int i =3D 0; i < nb; i++) {
> > +        unsigned qh;
> > +        __builtin_memcpy(&qh, x[i].qh, sizeof(qh));
> > +
> > +        int sumi =3D 0;
> > +
> > +        for (int j =3D 0; j < qk/2; ++j) {
> > +            const unsigned char xh_0 =3D ((qh >> (j + 0)) << 4) & 0x10=
;
> > +            const unsigned char xh_1 =3D ((qh >> (j + 12)) ) & 0x10;
> > +
> > +            const int x0 =3D (x[i].qs[j] & 0xF) | xh_0;
> > +            const int x1 =3D (x[i].qs[j] >> 4) | xh_1;
> > +
> > +            sumi +=3D (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
> > +        }
> > +
> > +        sumf +=3D (ggml_lookup_fp16_to_fp32(x[i].d)*y[i].d)*sumi + ggm=
l_lookup_fp16_to_fp32(x[i].m)*y[i].s;
> > +    }
> > +
> > +    *s =3D sumf;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump {(?n)Not unrolling loop [1-9] \(--param=
 max-completely-peel-times limit reached} "cunrolli"} } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/pr69783.c b/gcc/testsuite/gcc.dg=
/vect/pr69783.c
> > index 5df95d0ce4e..a1f75514d72 100644
> > --- a/gcc/testsuite/gcc.dg/vect/pr69783.c
> > +++ b/gcc/testsuite/gcc.dg/vect/pr69783.c
> > @@ -1,6 +1,6 @@
> >  /* { dg-do compile } */
> >  /* { dg-require-effective-target vect_float } */
> > -/* { dg-additional-options "-Ofast -funroll-loops" } */
> > +/* { dg-additional-options "-Ofast -funroll-loops --param max-complete=
ly-peeled-insns=3D300" } */
>
> If we rely on unrolling of a loop can you put #pragma unroll [N]
> before the respective loop
> instead?
>
> >  #define NXX 516
> >  #define NYY 516
> > diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.c=
c
> > index bf017137260..5e0eca647a1 100644
> > --- a/gcc/tree-ssa-loop-ivcanon.cc
> > +++ b/gcc/tree-ssa-loop-ivcanon.cc
> > @@ -444,7 +444,9 @@ tree_estimate_loop_size (class loop *loop, edge exi=
t, edge edge_to_cancel,
> >
> >  static unsigned HOST_WIDE_INT
> >  estimated_unrolled_size (struct loop_size *size,
> > -                        unsigned HOST_WIDE_INT nunroll)
> > +                        unsigned HOST_WIDE_INT nunroll,
> > +                        enum unroll_level ul,
> > +                        class loop* loop)
> >  {
> >    HOST_WIDE_INT unr_insns =3D ((nunroll)
> >                              * (HOST_WIDE_INT) (size->overall
> > @@ -453,7 +455,15 @@ estimated_unrolled_size (struct loop_size *size,
> >      unr_insns =3D 0;
> >    unr_insns +=3D size->last_iteration - size->last_iteration_eliminate=
d_by_peeling;
> >
> > -  unr_insns =3D unr_insns * 2 / 3;
> > +  /* For innermost loop, loop body is not likely to be simplied as muc=
h as 1/3.
> > +     and may increase a lot of register pressure.
> > +     UL !=3D UL_ALL is need to unroll small loop at O2.  */
> > +  class loop *loop_father =3D loop_outer (loop);
> > +  if (loop->inner || !loop_father
>
> Do we ever get here for !loop_father?  We shouldn't.
>
> > +      || loop_father->latch =3D=3D EXIT_BLOCK_PTR_FOR_FN (cfun)
>
> This means you excempt all loops that are direct children of the loop
> root tree.  That doesn't make much sense.
>
> > +      || ul !=3D UL_ALL)
>
> This is also quite odd - we're being more optimistic for UL_NO_GROWTH
> than for UL_ALL?  This doesn't make much sense.
>
> Overall I think this means removal of being optimistic doesn't work so we=
ll?
They're mostly used to avoid testcase regressions., the regressed
testcases rely on the behavior of complete unroll from the first
unroll, but now it's only unrolled by the second unroll.
I checked some, the codegen are the same, I need to go through all of
them, if the final codegen are the same or optimal, I'll just adjust
testcases?

++: g++.dg/warn/Warray-bounds-20.C  -std=3Dgnu++14 LP64 note (test for

g++warnings, line 56)

g++: g++.dg/warn/Warray-bounds-20.C  -std=3Dgnu++14 note (test for

g++warnings, line 66)

g++: g++.dg/warn/Warray-bounds-20.C  -std=3Dgnu++17 LP64 note (test for

g++warnings, line 56)

g++: g++.dg/warn/Warray-bounds-20.C  -std=3Dgnu++17 note (test for

g++warnings, line 66)

g++: g++.dg/warn/Warray-bounds-20.C  -std=3Dgnu++20 LP64 note (test for

g++warnings, line 56)

g++: g++.dg/warn/Warray-bounds-20.C  -std=3Dgnu++20 note (test for

g++warnings, line 66)

g++: g++.dg/warn/Warray-bounds-20.C  -std=3Dgnu++98 LP64 note (test for

g++warnings, line 56)

g++: g++.dg/warn/Warray-bounds-20.C  -std=3Dgnu++98 note (test for

g++warnings, line 66)

gcc: gcc.dg/Warray-bounds-68.c  (test for warnings, line 18)

gcc: gcc.dg/graphite/interchange-8.c execution test

gcc: gcc.dg/tree-prof/update-cunroll-2.c scan-tree-dump-not optimized
"Invalid sum"

gcc: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli "Last
iteration exit edge was proved true."

gcc: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli "loop with 2
iterations completely unrolled"

gcc: gcc.dg/tree-ssa/dump-6.c scan-tree-dump store-merging "MEM
<unsigned long> \\[\\(char \\*\\)\\&a8] =3D "

gcc: gcc.dg/tree-ssa/loop-36.c scan-tree-dump-not dce3 "c.array"

gcc: gcc.dg/tree-ssa/ssa-dom-cse-5.c scan-tree-dump-times dom2 "return 3;" =
1

gcc: gcc.dg/tree-ssa/update-cunroll.c scan-tree-dump-times optimized
"Invalid sum" 0

gcc: gcc.dg/tree-ssa/vrp88.c scan-tree-dump vrp1 "Folded into: if.*"

gcc: gcc.dg/vect/no-vfa-vect-dv-2.c scan-tree-dump-times vect
"vectorized 3 loops" 1

>
> If we need some extra leeway for UL_NO_GROWTH for what we expect
> to unroll it might be better to add sth like --param
> nogrowth-completely-peeled-insns
> specifying a fixed surplus size?  Or we need to look at what's the proble=
m
> with the testcases regressing or the one you are trying to fix.
>
> I did experiment with better estimating cleanup done at some point
> (see attached),
> but didn't get to finishing that (and as said, as we're running VN on the=
 result
> we'd ideally do that as part of the estimation somehow).
>
> Richard.
>
> > +    unr_insns =3D unr_insns * 2 / 3;
> > +
> >    if (unr_insns <=3D 0)
> >      unr_insns =3D 1;
> >
> > @@ -837,7 +847,7 @@ try_unroll_loop_completely (class loop *loop,
> >
> >           unsigned HOST_WIDE_INT ninsns =3D size.overall;
> >           unsigned HOST_WIDE_INT unr_insns
> > -           =3D estimated_unrolled_size (&size, n_unroll);
> > +           =3D estimated_unrolled_size (&size, n_unroll, ul, loop);
> >           if (dump_file && (dump_flags & TDF_DETAILS))
> >             {
> >               fprintf (dump_file, "  Loop size: %d\n", (int) ninsns);
> > --
> > 2.31.1
> >


--=20
BR,
Hongtao