Re: [PATCH] Don't reduce estimated unrolled size for innermost loop.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Hongtao Liu <crazylht@gmail.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: liuhongt <hongtao.liu@intel.com>, gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] Don't reduce estimated unrolled size for innermost loop.
Date: Wed, 15 May 2024 10:14:51 +0800	[thread overview]
Message-ID: <CAMZc-bw51uL7MaS7-tL+9XHZxFPZjn__fFBYs5ShQcp1D_spkQ@mail.gmail.com> (raw)
In-Reply-To: <CAFiYyc3snQb4M=mC8a7Bqx_kDr+104uxp35JnnpDQx4DYQLP1Q@mail.gmail.com>

On Mon, May 13, 2024 at 3:40 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Mon, May 13, 2024 at 4:29 AM liuhongt <hongtao.liu@intel.com> wrote:
> >
> > As testcase in the PR, O3 cunrolli may prevent vectorization for the
> > innermost loop and increase register pressure.
> > The patch removes the 1/3 reduction of unr_insn for innermost loop for UL_ALL.
> > ul != UR_ALL is needed since some small loop complete unrolling at O2 relies
> > the reduction.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > No big impact for SPEC2017.
> > Ok for trunk?
>
> This removes the 1/3 reduction when unrolling a loop nest (the case I was
> concerned about).  Unrolling of a nest is by iterating in
> tree_unroll_loops_completely
> so the to be unrolled loop appears innermost.  So I think you need a new
> parameter on tree_unroll_loops_completely_1 indicating whether we're in the
> first iteration (or whether to assume inner most loops will "simplify").
yes, it would be better.
>
> Few comments below
>
> > gcc/ChangeLog:
> >
> >         PR tree-optimization/112325
> >         * tree-ssa-loop-ivcanon.cc (estimated_unrolled_size): Add 2
> >         new parameters: loop and ul, and remove unr_insns reduction
> >         for innermost loop.
> >         (try_unroll_loop_completely): Pass loop and ul to
> >         estimated_unrolled_size.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.dg/tree-ssa/pr112325.c: New test.
> >         * gcc.dg/vect/pr69783.c: Add extra option --param
> >         max-completely-peeled-insns=300.
> > ---
> >  gcc/testsuite/gcc.dg/tree-ssa/pr112325.c | 57 ++++++++++++++++++++++++
> >  gcc/testsuite/gcc.dg/vect/pr69783.c      |  2 +-
> >  gcc/tree-ssa-loop-ivcanon.cc             | 16 +++++--
> >  3 files changed, 71 insertions(+), 4 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> > new file mode 100644
> > index 00000000000..14208b3e7f8
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> > @@ -0,0 +1,57 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-cunrolli-details" } */
> > +
> > +typedef unsigned short ggml_fp16_t;
> > +static float table_f32_f16[1 << 16];
> > +
> > +inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
> > +    unsigned short s;
> > +    __builtin_memcpy(&s, &f, sizeof(unsigned short));
> > +    return table_f32_f16[s];
> > +}
> > +
> > +typedef struct {
> > +    ggml_fp16_t d;
> > +    ggml_fp16_t m;
> > +    unsigned char qh[4];
> > +    unsigned char qs[32 / 2];
> > +} block_q5_1;
> > +
> > +typedef struct {
> > +    float d;
> > +    float s;
> > +    char qs[32];
> > +} block_q8_1;
> > +
> > +void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
> > +    const int qk = 32;
> > +    const int nb = n / qk;
> > +
> > +    const block_q5_1 * restrict x = vx;
> > +    const block_q8_1 * restrict y = vy;
> > +
> > +    float sumf = 0.0;
> > +
> > +    for (int i = 0; i < nb; i++) {
> > +        unsigned qh;
> > +        __builtin_memcpy(&qh, x[i].qh, sizeof(qh));
> > +
> > +        int sumi = 0;
> > +
> > +        for (int j = 0; j < qk/2; ++j) {
> > +            const unsigned char xh_0 = ((qh >> (j + 0)) << 4) & 0x10;
> > +            const unsigned char xh_1 = ((qh >> (j + 12)) ) & 0x10;
> > +
> > +            const int x0 = (x[i].qs[j] & 0xF) | xh_0;
> > +            const int x1 = (x[i].qs[j] >> 4) | xh_1;
> > +
> > +            sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
> > +        }
> > +
> > +        sumf += (ggml_lookup_fp16_to_fp32(x[i].d)*y[i].d)*sumi + ggml_lookup_fp16_to_fp32(x[i].m)*y[i].s;
> > +    }
> > +
> > +    *s = sumf;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump {(?n)Not unrolling loop [1-9] \(--param max-completely-peel-times limit reached} "cunrolli"} } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/pr69783.c b/gcc/testsuite/gcc.dg/vect/pr69783.c
> > index 5df95d0ce4e..a1f75514d72 100644
> > --- a/gcc/testsuite/gcc.dg/vect/pr69783.c
> > +++ b/gcc/testsuite/gcc.dg/vect/pr69783.c
> > @@ -1,6 +1,6 @@
> >  /* { dg-do compile } */
> >  /* { dg-require-effective-target vect_float } */
> > -/* { dg-additional-options "-Ofast -funroll-loops" } */
> > +/* { dg-additional-options "-Ofast -funroll-loops --param max-completely-peeled-insns=300" } */
>
> If we rely on unrolling of a loop can you put #pragma unroll [N]
> before the respective loop
> instead?
>
> >  #define NXX 516
> >  #define NYY 516
> > diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
> > index bf017137260..5e0eca647a1 100644
> > --- a/gcc/tree-ssa-loop-ivcanon.cc
> > +++ b/gcc/tree-ssa-loop-ivcanon.cc
> > @@ -444,7 +444,9 @@ tree_estimate_loop_size (class loop *loop, edge exit, edge edge_to_cancel,
> >
> >  static unsigned HOST_WIDE_INT
> >  estimated_unrolled_size (struct loop_size *size,
> > -                        unsigned HOST_WIDE_INT nunroll)
> > +                        unsigned HOST_WIDE_INT nunroll,
> > +                        enum unroll_level ul,
> > +                        class loop* loop)
> >  {
> >    HOST_WIDE_INT unr_insns = ((nunroll)
> >                              * (HOST_WIDE_INT) (size->overall
> > @@ -453,7 +455,15 @@ estimated_unrolled_size (struct loop_size *size,
> >      unr_insns = 0;
> >    unr_insns += size->last_iteration - size->last_iteration_eliminated_by_peeling;
> >
> > -  unr_insns = unr_insns * 2 / 3;
> > +  /* For innermost loop, loop body is not likely to be simplied as much as 1/3.
> > +     and may increase a lot of register pressure.
> > +     UL != UL_ALL is need to unroll small loop at O2.  */
> > +  class loop *loop_father = loop_outer (loop);
> > +  if (loop->inner || !loop_father
>
> Do we ever get here for !loop_father?  We shouldn't.
>
> > +      || loop_father->latch == EXIT_BLOCK_PTR_FOR_FN (cfun)
>
> This means you excempt all loops that are direct children of the loop
> root tree.  That doesn't make much sense.
>
> > +      || ul != UL_ALL)
>
> This is also quite odd - we're being more optimistic for UL_NO_GROWTH
> than for UL_ALL?  This doesn't make much sense.
>
> Overall I think this means removal of being optimistic doesn't work so well?
They're mostly used to avoid testcase regressions., the regressed
testcases rely on the behavior of complete unroll from the first
unroll, but now it's only unrolled by the second unroll.
I checked some, the codegen are the same, I need to go through all of
them, if the final codegen are the same or optimal, I'll just adjust
testcases?

++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++14 LP64 note (test for

g++warnings, line 56)

g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++14 note (test for

g++warnings, line 66)

g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++17 LP64 note (test for

g++warnings, line 56)

g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++17 note (test for

g++warnings, line 66)

g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++20 LP64 note (test for

g++warnings, line 56)

g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++20 note (test for

g++warnings, line 66)

g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++98 LP64 note (test for

g++warnings, line 56)

g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++98 note (test for

g++warnings, line 66)

gcc: gcc.dg/Warray-bounds-68.c  (test for warnings, line 18)

gcc: gcc.dg/graphite/interchange-8.c execution test

gcc: gcc.dg/tree-prof/update-cunroll-2.c scan-tree-dump-not optimized
"Invalid sum"

gcc: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli "Last
iteration exit edge was proved true."

gcc: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli "loop with 2
iterations completely unrolled"

gcc: gcc.dg/tree-ssa/dump-6.c scan-tree-dump store-merging "MEM
<unsigned long> \\[\\(char \\*\\)\\&a8] = "

gcc: gcc.dg/tree-ssa/loop-36.c scan-tree-dump-not dce3 "c.array"

gcc: gcc.dg/tree-ssa/ssa-dom-cse-5.c scan-tree-dump-times dom2 "return 3;" 1

gcc: gcc.dg/tree-ssa/update-cunroll.c scan-tree-dump-times optimized
"Invalid sum" 0

gcc: gcc.dg/tree-ssa/vrp88.c scan-tree-dump vrp1 "Folded into: if.*"

gcc: gcc.dg/vect/no-vfa-vect-dv-2.c scan-tree-dump-times vect
"vectorized 3 loops" 1

>
> If we need some extra leeway for UL_NO_GROWTH for what we expect
> to unroll it might be better to add sth like --param
> nogrowth-completely-peeled-insns
> specifying a fixed surplus size?  Or we need to look at what's the problem
> with the testcases regressing or the one you are trying to fix.
>
> I did experiment with better estimating cleanup done at some point
> (see attached),
> but didn't get to finishing that (and as said, as we're running VN on the result
> we'd ideally do that as part of the estimation somehow).
>
> Richard.
>
> > +    unr_insns = unr_insns * 2 / 3;
> > +
> >    if (unr_insns <= 0)
> >      unr_insns = 1;
> >
> > @@ -837,7 +847,7 @@ try_unroll_loop_completely (class loop *loop,
> >
> >           unsigned HOST_WIDE_INT ninsns = size.overall;
> >           unsigned HOST_WIDE_INT unr_insns
> > -           = estimated_unrolled_size (&size, n_unroll);
> > +           = estimated_unrolled_size (&size, n_unroll, ul, loop);
> >           if (dump_file && (dump_flags & TDF_DETAILS))
> >             {
> >               fprintf (dump_file, "  Loop size: %d\n", (int) ninsns);
> > --
> > 2.31.1
> >



-- 
BR,
Hongtao

next prev parent reply	other threads:[~2024-05-15  2:15 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-13  2:27 liuhongt
2024-05-13  7:40 ` Richard Biener
2024-05-15  2:14   ` Hongtao Liu [this message]
2024-05-15  9:24     ` Richard Biener
2024-05-15  9:49       ` Hongtao Liu
2024-05-21  2:35       ` Hongtao Liu
2024-05-21  6:14         ` Richard Biener
2024-05-22  5:07           ` [V2 PATCH] Don't reduce estimated unrolled size for innermost loop at cunrolli liuhongt
2024-05-23  1:55             ` Hongtao Liu
2024-05-23 11:59             ` Richard Biener
2024-05-24  7:29               ` [V3 PATCH] Don't reduce estimated unrolled size for innermost loop liuhongt
2024-05-29 11:22                 ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMZc-bw51uL7MaS7-tL+9XHZxFPZjn__fFBYs5ShQcp1D_spkQ@mail.gmail.com \
    --to=crazylht@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hongtao.liu@intel.com \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).