From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <crazylht@gmail.com>
Received: from mail-yb1-xb30.google.com (mail-yb1-xb30.google.com [IPv6:2607:f8b0:4864:20::b30])
	by sourceware.org (Postfix) with ESMTPS id 9A8DD382E4CF
	for <gcc-patches@gcc.gnu.org>; Mon, 14 Nov 2022 01:35:57 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 9A8DD382E4CF
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-yb1-xb30.google.com with SMTP id o70so11870473yba.7
        for <gcc-patches@gcc.gnu.org>; Sun, 13 Nov 2022 17:35:57 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=uNkUu8xkYyg899auLBCg0FeK3JLy7/ZD4/4gPnWGtz4=;
        b=VYC2w+eUOQfrIWxxoHJzReQW/fNKjlyyZfU2n6vPALMQRllvsRgpZ+Ne9fX7eSgYDq
         AhySD+QgWNfc5sDa0RiJsZWG0Ub8LsItQohaKfmlkJ63+MvpZ0nncLm7VVvflOV8V7kR
         hWq8AwY3cUDeqX+WXeNwFKEDfObGs+7FqhifO2ZfVQ6pF6naEyYgOpFEy/KUBQZsMUZ9
         U4n7EyI7JQm1uLxpV6p4/ZymMXGcOtseOhFKKfa49woGx9YOtxUx9DWu/N6OGwRaM5nq
         /0jETHyBb3g08Xw89Lx705W75HzxyIiyOBnaJJOu4+UvtXRF5Sn7eiJJdvptwyXvvsLx
         y9ag==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=uNkUu8xkYyg899auLBCg0FeK3JLy7/ZD4/4gPnWGtz4=;
        b=18kayLeNMcjr8vr6eE3peI5hO9oogEggrwPtfsfiFmPD/HzX8/T2hYaYy5bsJO0yyt
         UiKpETSP8IRIyA593Am6y1pGZ5J5TcL85zTKMM6xo8fz0GYZe5+5VdjnS5rW12sEMIEp
         AmG6N6i4YsD+/w7Fbx1hiEl2ca27lzGaGeBtptPQpl8EGHYJdKdN8VS47ks5BJFSKayY
         jFjcJ4ct8Y4P4RrTeg722u1QpuAIaJ4JvJx67IuOZnMyljp9PpOs3VipPHOJsfk2dsVg
         PASEUsDYmQHJ5q75dcD5vPB1PKKnGozrZjHUTWcicc0alYmq0VgEf9JaDDpi472zVXzI
         As0g==
X-Gm-Message-State: ANoB5pmZMDbUZZs+WhiNwvsOS/ffT632RYEjbHEsO8EcdYKT8bxR7qFh
	RMOMZ2LJkSIkkVBn0xZj3P87ksgpeW2ntiyKiDE=
X-Google-Smtp-Source: AA0mqf6PI7Bn/ZGgru2gpqKP3U+gcHcX3jn5g929GARn6t8U3zIA+hQNSKCGdDwaMVIAtyL5sPaSM3h2iDLkyKSXJWk=
X-Received: by 2002:a25:ac81:0:b0:6df:927f:38c9 with SMTP id
 x1-20020a25ac81000000b006df927f38c9mr7349763ybi.92.1668389756856; Sun, 13 Nov
 2022 17:35:56 -0800 (PST)
MIME-Version: 1.0
References: <20221102033728.99379-1-hongyu.wang@intel.com> <CAFiYyc37s4Aoo6PmL5EAv0r_-UwtQ5tPbFZBTF_96==YQMUpDw@mail.gmail.com>
 <CAMZc-bxMH43VHD8oP1iN4DH=bN1-NXxpydTc4EGdQwfxwZjJPQ@mail.gmail.com> <CA+OydWnjXU0y4Xp4vxqXXUh5C4rn2kdu4ZNuGvXCUeqjoH2P_w@mail.gmail.com>
In-Reply-To: <CA+OydWnjXU0y4Xp4vxqXXUh5C4rn2kdu4ZNuGvXCUeqjoH2P_w@mail.gmail.com>
From: Hongtao Liu <crazylht@gmail.com>
Date: Mon, 14 Nov 2022 09:35:45 +0800
Message-ID: <CAMZc-bzNEyfermChHcf8DXF91KosuicAX7crn1yuyWfZ8YCOHw@mail.gmail.com>
Subject: Re: [PATCH V2] Enable small loop unrolling for O2
To: Hongyu Wang <wwwhhhyyy333@gmail.com>
Cc: Richard Biener <richard.guenther@gmail.com>, Hongyu Wang <hongyu.wang@intel.com>, 
	gcc-patches@gcc.gnu.org, ubizjak@gmail.com, hongtao.liu@intel.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-6.1 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_NUMSUBJECT,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Wed, Nov 9, 2022 at 9:29 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
>
> > Although ix86_small_unroll_insns is coming from issue_rate, it's tuned
> > for codesize.
> > Make it exact as issue_rate and using factor * issue_width /
> > loop->ninsns may increase code size too much.
> > So I prefer to add those 2 parameters to the cost table for core
> > tunings instead of 1.
>
> Yes, here is the updated patch that changes the cost table.
>
> Bootstrapped & regrtested on x86_64-pc-linux-gnu.
>
> Ok for trunk?
Ok, Note GCC documents have been ported to sphinx, so you need to
adjust changes in invoke.texi to new sphinx files.
>
> Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> =E4=BA=8E2022=E5=B9=
=B411=E6=9C=888=E6=97=A5=E5=91=A8=E4=BA=8C 11:05=E5=86=99=E9=81=93=EF=BC=9A
> >
> > On Mon, Nov 7, 2022 at 10:25 PM Richard Biener via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > On Wed, Nov 2, 2022 at 4:37 AM Hongyu Wang <hongyu.wang@intel.com> wr=
ote:
> > > >
> > > > Hi, this is the updated patch of
> > > > https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
> > > > which uses targetm.loop_unroll_adjust as gate to enable small loop =
unroll.
> > > >
> > > > This patch does not change rs6000/s390 since I don't have machine t=
o
> > > > test them, but I suppose the default behavior is the same since the=
y
> > > > enable flag_unroll_loops at O2.
> > > >
> > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > >
> > > > Ok for trunk?
> > > >
> > > > ---------- Patch content --------
> > > >
> > > > Modern processors has multiple way instruction decoders
> > > > For x86, icelake/zen3 has 5 uops, so for small loop with <=3D 4
> > > > instructions (usually has 3 uops with a cmp/jmp pair that can be
> > > > macro-fused), the decoder would have 2 uops bubble for each iterati=
on
> > > > and the pipeline could not be fully utilized.
> > > >
> > > > Therefore, this patch enables loop unrolling for small size loop at=
 O2
> > > > to fullfill the decoder as much as possible. It turns on rtl loop
> > > > unrolling when targetm.loop_unroll_adjust exists and O2 plus speed =
only.
> > > > In x86 backend the default behavior is to unroll small loops with l=
ess
> > > > than 4 insns by 1 time.
> > > >
> > > > This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
> > > > 0.9% codesize increment. For other benchmarks the variants are mino=
r
> > > > and overall codesize increased by 0.2%.
> > > >
> > > > The kernel image size increased by 0.06%, and no impact on eembc.
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * common/config/i386/i386-common.cc (ix86_optimization_tabl=
e):
> > > >         Enable small loop unroll at O2 by default.
> > > >         * config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unr=
oll
> > > >         factor if -munroll-only-small-loops enabled and -funroll-lo=
ops/
> > > >         -funroll-all-loops are disabled.
> > > >         * config/i386/i386.opt: Add -munroll-only-small-loops,
> > > >         -param=3Dx86-small-unroll-ninsns=3D for loop insn limit,
> > > >         -param=3Dx86-small-unroll-factor=3D for unroll factor.
> > > >         * doc/invoke.texi: Document -munroll-only-small-loops,
> > > >         x86-small-unroll-ninsns and x86-small-unroll-factor.
> > > >         * loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
> > > >         loop unrolling for -O2-speed and above if target hook
> > > >         loop_unroll_adjust exists.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         * gcc.dg/guality/loop-1.c: Add additional option
> > > >           -mno-unroll-only-small-loops.
> > > >         * gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loo=
ps.
> > > >         * gcc.target/i386/pr93002.c: Likewise.
> > > > ---
> > > >  gcc/common/config/i386/i386-common.cc   |  1 +
> > > >  gcc/config/i386/i386.cc                 | 18 ++++++++++++++++++
> > > >  gcc/config/i386/i386.opt                | 13 +++++++++++++
> > > >  gcc/doc/invoke.texi                     | 16 ++++++++++++++++
> > > >  gcc/loop-init.cc                        | 10 +++++++---
> > > >  gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
> > > >  gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
> > > >  gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
> > > >  8 files changed, 59 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/con=
fig/i386/i386-common.cc
> > > > index f66bdd5a2af..c6891486078 100644
> > > > --- a/gcc/common/config/i386/i386-common.cc
> > > > +++ b/gcc/common/config/i386/i386-common.cc
> > > > @@ -1724,6 +1724,7 @@ static const struct default_options ix86_opti=
on_optimization_table[] =3D
> > > >      /* The STC algorithm produces the smallest code at -Os, for x8=
6.  */
> > > >      { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
> > > >        REORDER_BLOCKS_ALGORITHM_STC },
> > > > +    { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, =
NULL, 1 },
> > > >      /* Turn off -fschedule-insns by default.  It tends to make the
> > > >         problem with not enough registers even worse.  */
> > > >      { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > index c0f37149ed0..0f94a3b609e 100644
> > > > --- a/gcc/config/i386/i386.cc
> > > > +++ b/gcc/config/i386/i386.cc
> > > > @@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll,=
 class loop *loop)
> > > >    unsigned i;
> > > >    unsigned mem_count =3D 0;
> > > >
> > > > +  /* Unroll small size loop when unroll factor is not explicitly
> > > > +     specified.  */
> > > > +  if (!(flag_unroll_loops
> > > > +       || flag_unroll_all_loops
> > > > +       || loop->unroll))
> > > > +    {
> > > > +      nunroll =3D 1;
> > > > +
> > > > +      /* Any explicit -f{no-}unroll-{all-}loops turns off
> > > > +        -munroll-only-small-loops.  */
> > > > +      if (ix86_unroll_only_small_loops
> > > > +         && !OPTION_SET_P (flag_unroll_loops))
> > > > +       if (loop->ninsns <=3D (unsigned) ix86_small_unroll_ninsns)
> > >
> > > either add braces or combine the two if's
> > >
> > > Otherwise the middle-end changes look OK.  The target maintainers nee=
d to decide
> > > whether the two --params should be core tunings instead - I would ass=
ume that
> > > given your rationale the decode and issue widths of the core plays an=
 important
> > > role here.  That might also suggest a single parameter instead and un=
rolling
> > > (factor * issue_width) / loop->ninsns times instead of a static unrol=
l_factor?
> > Although ix86_small_unroll_insns is coming from issue_rate, it's tuned
> > for codesize.
> > Make it exact as issue_rate and using factor * issue_width /
> > loop->ninsns may increase code size too much.
> > So I prefer to add those 2 parameters to the cost table for core
> > tunings instead of 1.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > +         nunroll =3D (unsigned) ix86_small_unroll_factor;
> > > > +
> > > > +      return nunroll;
> > > > +    }
> > > > +
> > > >    if (!TARGET_ADJUST_UNROLL)
> > > >       return nunroll;
> > > >
> > > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > > index 53d534f6392..6da9c8d670d 100644
> > > > --- a/gcc/config/i386/i386.opt
> > > > +++ b/gcc/config/i386/i386.opt
> > > > @@ -1224,3 +1224,16 @@ mavxvnniint8
> > > >  Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
> > > >  Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
> > > >  AVXVNNIINT8 built-in functions and code generation.
> > > > +
> > > > +munroll-only-small-loops
> > > > +Target Var(ix86_unroll_only_small_loops) Init(0) Save
> > > > +Enable conservative small loop unrolling.
> > > > +
> > > > +-param=3Dx86-small-unroll-ninsns=3D
> > > > +Target Joined UInteger Var(ix86_small_unroll_ninsns) Init(4) Param
> > > > +Insturctions number limit for loop to be unrolled under
> > > > +-munroll-only-small-loops.
> > > > +
> > > > +-param=3Dx86-small-unroll-factor=3D
> > > > +Target Joined UInteger Var(ix86_small_unroll_factor) Init(2) Param
> > > > +Unroll factor for -munroll-only-small-loops.
> > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > index 550aec87809..487218bd0ce 100644
> > > > --- a/gcc/doc/invoke.texi
> > > > +++ b/gcc/doc/invoke.texi
> > > > @@ -15821,6 +15821,14 @@ The following choices of @var{name} are av=
ailable on i386 and x86_64 targets:
> > > >  @item x86-stlf-window-ninsns
> > > >  Instructions number above which STFL stall penalty can be compensa=
ted.
> > > >
> > > > +@item x86-small-unroll-ninsns
> > > > +If -munroll-only-small-loops is enabled, only unroll loops with in=
struction
> > > > +count less than this parameter. The default value is 4.
> > > > +
> > > > +@item x86-small-unroll-factor
> > > > +If -munroll-only-small-loops is enabled, reset the unroll factor w=
ith this
> > > > +value. The default value is 2 which means the loop will be unrolle=
d once.
> > > > +
> > > >  @end table
> > > >
> > > >  @end table
> > > > @@ -25232,6 +25240,14 @@ environments where no dynamic link is perf=
ormed, like firmwares, OS
> > > >  kernels, executables linked with @option{-static} or @option{-stat=
ic-pie}.
> > > >  @option{-mdirect-extern-access} is not compatible with @option{-fP=
IC} or
> > > >  @option{-fpic}.
> > > > +
> > > > +@item -munroll-only-small-loops
> > > > +@itemx -mno-unroll-only-small-loops
> > > > +@opindex munroll-only-small-loops
> > > > +Controls conservative small loop unrolling. It is default enbaled =
by
> > > > +O2, and unrolls loop with less than 4 insns by 1 time. Explicit
> > > > +-f[no-]unroll-[all-]loops would disable this flag to avoid any
> > > > +unintended unrolling behavior that user does not want.
> > > >  @end table
> > > >
> > > >  @node M32C Options
> > > > diff --git a/gcc/loop-init.cc b/gcc/loop-init.cc
> > > > index b9e07973dd6..9789efa1e11 100644
> > > > --- a/gcc/loop-init.cc
> > > > +++ b/gcc/loop-init.cc
> > > > @@ -565,9 +565,12 @@ public:
> > > >    {}
> > > >
> > > >    /* opt_pass methods: */
> > > > -  bool gate (function *) final override
> > > > +  bool gate (function *fun) final override
> > > >      {
> > > > -      return (flag_unroll_loops || flag_unroll_all_loops || cfun->=
has_unroll);
> > > > +      return (flag_unroll_loops || flag_unroll_all_loops || cfun->=
has_unroll
> > > > +             || (targetm.loop_unroll_adjust
> > > > +                 && optimize >=3D 2
> > > > +                 && optimize_function_for_speed_p (fun)));
> > > >      }
> > > >
> > > >    unsigned int execute (function *) final override;
> > > > @@ -583,7 +586,8 @@ pass_rtl_unroll_loops::execute (function *fun)
> > > >        if (dump_file)
> > > >         df_dump (dump_file);
> > > >
> > > > -      if (flag_unroll_loops)
> > > > +      if (flag_unroll_loops
> > > > +         || targetm.loop_unroll_adjust)
> > > >         flags |=3D UAP_UNROLL;
> > > >        if (flag_unroll_all_loops)
> > > >         flags |=3D UAP_UNROLL_ALL;
> > > > diff --git a/gcc/testsuite/gcc.dg/guality/loop-1.c b/gcc/testsuite/=
gcc.dg/guality/loop-1.c
> > > > index 1b1f6d32271..a32ea445a3f 100644
> > > > --- a/gcc/testsuite/gcc.dg/guality/loop-1.c
> > > > +++ b/gcc/testsuite/gcc.dg/guality/loop-1.c
> > > > @@ -1,5 +1,7 @@
> > > >  /* { dg-do run } */
> > > >  /* { dg-options "-fno-tree-scev-cprop -fno-tree-vectorize -g" } */
> > > > +/* { dg-additional-options "-mno-unroll-only-small-loops" { target=
 ia32 } } */
> > > > +
> > > >
> > > >  #include "../nop.h"
> > > >
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr86270.c b/gcc/testsuit=
e/gcc.target/i386/pr86270.c
> > > > index 81841ef5bd7..cbc9fbb0450 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/pr86270.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr86270.c
> > > > @@ -1,5 +1,5 @@
> > > >  /* { dg-do compile } */
> > > > -/* { dg-options "-O2" } */
> > > > +/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
> > > >
> > > >  int *a;
> > > >  long len;
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr93002.c b/gcc/testsuit=
e/gcc.target/i386/pr93002.c
> > > > index 0248fcc00a5..f75a847f75d 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/pr93002.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr93002.c
> > > > @@ -1,6 +1,6 @@
> > > >  /* PR target/93002 */
> > > >  /* { dg-do compile } */
> > > > -/* { dg-options "-O2" } */
> > > > +/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
> > > >  /* { dg-final { scan-assembler-not "cmp\[^\n\r]*-1" } } */
> > > >
> > > >  volatile int sink;
> > > > --
> > > > 2.18.1
> > > >
> >
> >
> >
> > --
> > BR,
> > Hongtao


--=20
BR,
Hongtao