From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-ed1-x52a.google.com (mail-ed1-x52a.google.com
 [IPv6:2a00:1450:4864:20::52a])
 by sourceware.org (Postfix) with ESMTPS id 4BF073858C27
 for <gcc-patches@gcc.gnu.org>; Wed,  4 Aug 2021 11:31:54 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4BF073858C27
Received: by mail-ed1-x52a.google.com with SMTP id i6so3101884edu.1
 for <gcc-patches@gcc.gnu.org>; Wed, 04 Aug 2021 04:31:54 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=3H06exEElnSe44z/qeQukLy2u4tnqMIqdirBv9cjKr8=;
 b=nq5+susRvBNAW8YGBuqqw6LVfcQuU5z+NKbc4KV6yTyhmPL6gBpbJteReQ5yd+MQcj
 aTIvLtwo1QPwQMopl1xseEN/RCJvcMvzHDIf25vCPcTFiwHfQVwL2F2hIVDx+rbi7Rtf
 PwxxvaYayk+LAa0LyV02HmbNeXAJjEpo/Vd67O9sqWbFvDthdk+854+u1QKvPTfx7WRf
 rzPHYnKgTzRyBv3cLpaIdXnZ5yUIk8gojjBAG73pm64dWaQgjW5CarUyE8LLg3zj6nZY
 +rE0rxoJRhLFMIEqpCW7Njvh8fR6y3SBLYgeFMQVckG3P3oKV/e2t+kHA58WxeJY+lOc
 5vqQ==
X-Gm-Message-State: AOAM531s7MDtNhSbFMc/ByQCiag1Eo+TguMq3eXC+PYnsa1gmqbmMqJw
 rFwVrkdZQyNVIqPYS2hktcyOsNhUZO625BHr6Do=
X-Google-Smtp-Source: ABdhPJxjbvy4Gufy2uQyLDsEcb2bBFHpXyCeeVrXUt4MmbutUzZfvdunDqFs8SDD72+v2rNZdYuur03NZUmZ6EbHsWA=
X-Received: by 2002:a50:d749:: with SMTP id i9mr8865929edj.248.1628076713407; 
 Wed, 04 Aug 2021 04:31:53 -0700 (PDT)
MIME-Version: 1.0
References: <npo98q4o-3q76-r6q1-5744-694951o0n959@fhfr.qr>
In-Reply-To: <npo98q4o-3q76-r6q1-5744-694951o0n959@fhfr.qr>
From: Richard Biener <richard.guenther@gmail.com>
Date: Wed, 4 Aug 2021 13:31:42 +0200
Message-ID: <CAFiYyc1jk_Yo10E+-pFvkeZmX24MZJnu8qMenysJOXnsanOQcw@mail.gmail.com>
Subject: Re: [PATCH 2/2] Rewrite more vector loads to scalar loads
To: Richard Biener <rguenther@suse.de>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
 Richard Sandiford <richard.sandiford@arm.com>, 
 Hongtao Liu <hongtao.liu@intel.com>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-8.8 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 04 Aug 2021 11:31:56 -0000

On Mon, Aug 2, 2021 at 3:41 PM Richard Biener <rguenther@suse.de> wrote:
>
> This teaches forwprop to rewrite more vector loads that are only
> used in BIT_FIELD_REFs as scalar loads.  This provides the
> remaining uplift to SPEC CPU 2017 510.parest_r on Zen 2 which
> has CPU gathers disabled.
>
> In particular vector load + vec_unpack + bit-field-ref is turned
> into (extending) scalar loads which avoids costly XMM/GPR
> transitions.  To not conflict with vector load + bit-field-ref
> + vector constructor matching to vector load + shuffle the
> extended transform is only done after vector lowering.
>
> Overall the two patches provide a 22% speedup of 510.parest_r.
>
> I'm in the process of confirming speedups of 500.perlbench_r,
> 557.xz_r, 549.fotonik3d_r and 554.roms_r as well as slowdowns
> of 503.bwaves_r, 507.cactuBSSN_r and 538.imagick_r.

I have confirmed the 500.perlbench_r and 557.xz_r speedups,
the 554.roms was noise, so were the 503.bwaves and
507.cactuBSSN_r slowdowns.  The 538.imagick_r slowdown
is real but it doesn't reproduce with -flto and analyzing it
doesn't show any effect of the two patches on the code
pointed to by perf.

I've now pushed [2/2] first because that makes more sense
and thus its effect can be independently assessed.

Richard.

> 2021-07-30  Richard Biener  <rguenther@suse.de>
>
>         * tree-ssa-forwprop.c (pass_forwprop::execute): Split
>         out code to decompose vector loads ...
>         (optimize_vector_load): ... here.  Generalize it to
>         handle intermediate widening and TARGET_MEM_REF loads
>         and apply it to loads with a supported vector mode as well.
>
>         * gcc.target/i386/vect-gather-1.c: Amend.
> ---
>  gcc/testsuite/gcc.target/i386/vect-gather-1.c |   4 +-
>  gcc/tree-ssa-forwprop.c                       | 244 +++++++++++++-----
>  2 files changed, 185 insertions(+), 63 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/vect-gather-1.c b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
> index 134aef39666..261b66be061 100644
> --- a/gcc/testsuite/gcc.target/i386/vect-gather-1.c
> +++ b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details" } */
> +/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details -fdump-tree-forwprop4" } */
>
>  #ifndef INDEXTYPE
>  #define INDEXTYPE int
> @@ -16,3 +16,5 @@ double vmul(INDEXTYPE *rowstart, INDEXTYPE *rowend,
>  /* With gather emulation this should be profitable to vectorize
>     even with plain SSE2.  */
>  /* { dg-final { scan-tree-dump "loop vectorized" "vect" } } */
> +/* The index vector loads and promotions should be scalar after forwprop.  */
> +/* { dg-final { scan-tree-dump-not "vec_unpack" "forwprop4" } } */
> diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
> index db3b18b275c..bd64b8e46bc 100644
> --- a/gcc/tree-ssa-forwprop.c
> +++ b/gcc/tree-ssa-forwprop.c
> @@ -2757,6 +2757,182 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
>  }
>
>
> +/* Rewrite the vector load at *GSI to component-wise loads if the load
> +   is only used in BIT_FIELD_REF extractions with eventual intermediate
> +   widening.  */
> +
> +static void
> +optimize_vector_load (gimple_stmt_iterator *gsi)
> +{
> +  gimple *stmt = gsi_stmt (*gsi);
> +  tree lhs = gimple_assign_lhs (stmt);
> +  tree rhs = gimple_assign_rhs1 (stmt);
> +
> +  /* Gather BIT_FIELD_REFs to rewrite, looking through
> +     VEC_UNPACK_{LO,HI}_EXPR.  */
> +  use_operand_p use_p;
> +  imm_use_iterator iter;
> +  bool rewrite = true;
> +  auto_vec<gimple *, 8> bf_stmts;
> +  auto_vec<tree, 8> worklist;
> +  worklist.quick_push (lhs);
> +  do
> +    {
> +      tree def = worklist.pop ();
> +      unsigned HOST_WIDE_INT def_eltsize
> +       = TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (def))));
> +      FOR_EACH_IMM_USE_FAST (use_p, iter, def)
> +       {
> +         gimple *use_stmt = USE_STMT (use_p);
> +         if (is_gimple_debug (use_stmt))
> +           continue;
> +         if (!is_gimple_assign (use_stmt))
> +           {
> +             rewrite = false;
> +             break;
> +           }
> +         enum tree_code use_code = gimple_assign_rhs_code (use_stmt);
> +         tree use_rhs = gimple_assign_rhs1 (use_stmt);
> +         if (use_code == BIT_FIELD_REF
> +             && TREE_OPERAND (use_rhs, 0) == def
> +             /* If its on the VEC_UNPACK_{HI,LO}_EXPR
> +                def need to verify it is element aligned.  */
> +             && (def == lhs
> +                 || (known_eq (bit_field_size (use_rhs), def_eltsize)
> +                     && constant_multiple_p (bit_field_offset (use_rhs),
> +                                             def_eltsize))))
> +           {
> +             bf_stmts.safe_push (use_stmt);
> +             continue;
> +           }
> +         /* Walk through one level of VEC_UNPACK_{LO,HI}_EXPR.  */
> +         if (def == lhs
> +             && (use_code == VEC_UNPACK_HI_EXPR
> +                 || use_code == VEC_UNPACK_LO_EXPR)
> +             && use_rhs == lhs)
> +           {
> +             worklist.safe_push (gimple_assign_lhs (use_stmt));
> +             continue;
> +           }
> +         rewrite = false;
> +         break;
> +       }
> +      if (!rewrite)
> +       break;
> +    }
> +  while (!worklist.is_empty ());
> +
> +  if (!rewrite)
> +    {
> +      gsi_next (gsi);
> +      return;
> +    }
> +  /* We now have all ultimate uses of the load to rewrite in bf_stmts.  */
> +
> +  /* Prepare the original ref to be wrapped in adjusted BIT_FIELD_REFs.
> +     For TARGET_MEM_REFs we have to separate the LEA from the reference.  */
> +  tree load_rhs = rhs;
> +  if (TREE_CODE (load_rhs) == TARGET_MEM_REF)
> +    {
> +      if (TREE_CODE (TREE_OPERAND (load_rhs, 0)) == ADDR_EXPR)
> +       mark_addressable (TREE_OPERAND (TREE_OPERAND (load_rhs, 0), 0));
> +      tree tem = make_ssa_name (TREE_TYPE (TREE_OPERAND (load_rhs, 0)));
> +      gimple *new_stmt
> +       = gimple_build_assign (tem, build1 (ADDR_EXPR, TREE_TYPE (tem),
> +                                           unshare_expr (load_rhs)));
> +      gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
> +      load_rhs = build2_loc (EXPR_LOCATION (load_rhs),
> +                            MEM_REF, TREE_TYPE (load_rhs), tem,
> +                            build_int_cst
> +                              (TREE_TYPE (TREE_OPERAND (load_rhs, 1)), 0));
> +    }
> +
> +  /* Rewrite the BIT_FIELD_REFs to be actual loads, re-emitting them at
> +     the place of the original load.  */
> +  for (gimple *use_stmt : bf_stmts)
> +    {
> +      tree bfr = gimple_assign_rhs1 (use_stmt);
> +      tree new_rhs = unshare_expr (load_rhs);
> +      if (TREE_OPERAND (bfr, 0) != lhs)
> +       {
> +         /* When the BIT_FIELD_REF is on the promoted vector we have to
> +            adjust it and emit a conversion afterwards.  */
> +         gimple *def_stmt
> +             = SSA_NAME_DEF_STMT (TREE_OPERAND (bfr, 0));
> +         enum tree_code def_code
> +             = gimple_assign_rhs_code (def_stmt);
> +
> +         /* The adjusted BIT_FIELD_REF is of the promotion source
> +            vector size and at half of the offset...  */
> +         new_rhs = fold_build3 (BIT_FIELD_REF,
> +                                TREE_TYPE (TREE_TYPE (lhs)),
> +                                new_rhs,
> +                                TYPE_SIZE (TREE_TYPE (TREE_TYPE (lhs))),
> +                                size_binop (EXACT_DIV_EXPR,
> +                                            TREE_OPERAND (bfr, 2),
> +                                            bitsize_int (2)));
> +         /* ... and offsetted by half of the vector if VEC_UNPACK_HI_EXPR.  */
> +         if (def_code == (!BYTES_BIG_ENDIAN
> +                          ? VEC_UNPACK_HI_EXPR : VEC_UNPACK_LO_EXPR))
> +           TREE_OPERAND (new_rhs, 2)
> +             = size_binop (PLUS_EXPR, TREE_OPERAND (new_rhs, 2),
> +                           size_binop (EXACT_DIV_EXPR,
> +                                       TYPE_SIZE (TREE_TYPE (lhs)),
> +                                       bitsize_int (2)));
> +         tree tem = make_ssa_name (TREE_TYPE (TREE_TYPE (lhs)));
> +         gimple *new_stmt = gimple_build_assign (tem, new_rhs);
> +         location_t loc = gimple_location (use_stmt);
> +         gimple_set_location (new_stmt, loc);
> +         gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
> +         /* Perform scalar promotion.  */
> +         new_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),
> +                                         NOP_EXPR, tem);
> +         gimple_set_location (new_stmt, loc);
> +         gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
> +       }
> +      else
> +       {
> +         /* When the BIT_FIELD_REF is on the original load result
> +            we can just wrap that.  */
> +         tree new_rhs = fold_build3 (BIT_FIELD_REF, TREE_TYPE (bfr),
> +                                     unshare_expr (load_rhs),
> +                                     TREE_OPERAND (bfr, 1),
> +                                     TREE_OPERAND (bfr, 2));
> +         gimple *new_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),
> +                                                 new_rhs);
> +         location_t loc = gimple_location (use_stmt);
> +         gimple_set_location (new_stmt, loc);
> +         gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
> +       }
> +      gimple_stmt_iterator gsi2 = gsi_for_stmt (use_stmt);
> +      unlink_stmt_vdef (use_stmt);
> +      gsi_remove (&gsi2, true);
> +    }
> +
> +  /* Finally get rid of the intermediate stmts.  */
> +  gimple *use_stmt;
> +  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
> +    {
> +      if (is_gimple_debug (use_stmt))
> +       {
> +         if (gimple_debug_bind_p (use_stmt))
> +           {
> +             gimple_debug_bind_reset_value (use_stmt);
> +             update_stmt (use_stmt);
> +           }
> +         continue;
> +       }
> +      gimple_stmt_iterator gsi2 = gsi_for_stmt (use_stmt);
> +      unlink_stmt_vdef (use_stmt);
> +      release_defs (use_stmt);
> +      gsi_remove (&gsi2, true);
> +    }
> +  /* And the original load.  */
> +  release_defs (stmt);
> +  gsi_remove (gsi, true);
> +}
> +
> +
>  /* Primitive "lattice" function for gimple_simplify.  */
>
>  static tree
> @@ -3007,71 +3183,15 @@ pass_forwprop::execute (function *fun)
>                 gsi_next (&gsi);
>             }
>           else if (TREE_CODE (TREE_TYPE (lhs)) == VECTOR_TYPE
> -                  && TYPE_MODE (TREE_TYPE (lhs)) == BLKmode
> +                  && (TYPE_MODE (TREE_TYPE (lhs)) == BLKmode
> +                      /* After vector lowering rewrite all loads, but
> +                         initially do not since this conflicts with
> +                         vector CONSTRUCTOR to shuffle optimization.  */
> +                      || (fun->curr_properties & PROP_gimple_lvec))
>                    && gimple_assign_load_p (stmt)
>                    && !gimple_has_volatile_ops (stmt)
> -                  && (TREE_CODE (gimple_assign_rhs1 (stmt))
> -                      != TARGET_MEM_REF)
>                    && !stmt_can_throw_internal (cfun, stmt))
> -           {
> -             /* Rewrite loads used only in BIT_FIELD_REF extractions to
> -                component-wise loads.  */
> -             use_operand_p use_p;
> -             imm_use_iterator iter;
> -             bool rewrite = true;
> -             FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
> -               {
> -                 gimple *use_stmt = USE_STMT (use_p);
> -                 if (is_gimple_debug (use_stmt))
> -                   continue;
> -                 if (!is_gimple_assign (use_stmt)
> -                     || gimple_assign_rhs_code (use_stmt) != BIT_FIELD_REF
> -                     || TREE_OPERAND (gimple_assign_rhs1 (use_stmt), 0) != lhs)
> -                   {
> -                     rewrite = false;
> -                     break;
> -                   }
> -               }
> -             if (rewrite)
> -               {
> -                 gimple *use_stmt;
> -                 FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
> -                   {
> -                     if (is_gimple_debug (use_stmt))
> -                       {
> -                         if (gimple_debug_bind_p (use_stmt))
> -                           {
> -                             gimple_debug_bind_reset_value (use_stmt);
> -                             update_stmt (use_stmt);
> -                           }
> -                         continue;
> -                       }
> -
> -                     tree bfr = gimple_assign_rhs1 (use_stmt);
> -                     tree new_rhs = fold_build3 (BIT_FIELD_REF,
> -                                                 TREE_TYPE (bfr),
> -                                                 unshare_expr (rhs),
> -                                                 TREE_OPERAND (bfr, 1),
> -                                                 TREE_OPERAND (bfr, 2));
> -                     gimple *new_stmt
> -                       = gimple_build_assign (gimple_assign_lhs (use_stmt),
> -                                              new_rhs);
> -
> -                     location_t loc = gimple_location (use_stmt);
> -                     gimple_set_location (new_stmt, loc);
> -                     gimple_stmt_iterator gsi2 = gsi_for_stmt (use_stmt);
> -                     unlink_stmt_vdef (use_stmt);
> -                     gsi_remove (&gsi2, true);
> -
> -                     gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
> -                   }
> -
> -                 release_defs (stmt);
> -                 gsi_remove (&gsi, true);
> -               }
> -             else
> -               gsi_next (&gsi);
> -           }
> +           optimize_vector_load (&gsi);
>
>           else if (code == COMPLEX_EXPR)
>             {
> --
> 2.31.1