public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [RFC] test builtin ratio for loop distribution
@ 2021-01-27 12:40 Alexandre Oliva
  2021-01-27 15:12 ` Richard Biener
  2021-02-05  0:13 ` Jim Wilson
  0 siblings, 2 replies; 26+ messages in thread
From: Alexandre Oliva @ 2021-01-27 12:40 UTC (permalink / raw)
  To: gcc-patches; +Cc: Zdenek Dvorak


This patch attempts to fix a libgcc codegen regression introduced in
gcc-10, as -ftree-loop-distribute-patterns was enabled at -O2.


The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch adds to the loop distribution pass some cost analysis based
on preexisting *_RATIO macros, so that we won't transform loops with
trip counts as low as the ratios we'd rather expand inline.


This patch is not finished; it needs adjustments to the testsuite, to
make up for the behavior changes it brings about.  Specifically, on a
x86_64-linux-gnu regstrap, it regresses:

> FAIL: gcc.dg/pr53265.c  (test for warnings, line 40)
> FAIL: gcc.dg/pr53265.c  (test for warnings, line 42)
> FAIL: gcc.dg/tree-ssa/ldist-38.c scan-tree-dump ldist "split to 0 loops and 1 library cal> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++14  scan-tree-dump ldist "split to 0 loops and 1 library calls"
> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++17  scan-tree-dump ldist "split to 0 loops and 1 library calls"
> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++2a  scan-tree-dump ldist "split to 0 loops and 1 library calls"

I suppose just lengthening the loops will take care of ldist-38 and
pr78847, but the loss of the warnings in pr53265 is more concerning, and
will require investigation.

Nevertheless, I seek feedback on whether this is an acceptable approach,
or whether we should use alternate tuning parameters for ldist, or
something entirely different.  Thanks in advance,


for  gcc/ChangeLog

	* tree-loop-distribution.c (maybe_normalize_partition): New.
	(loop_distribution::distribute_loop): Call it.

[requires testsuite adjustments and investigation of a warning regression]
---
 gcc/tree-loop-distribution.c |   54 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index bb15fd3723fb6..b5198652817ee 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -2848,6 +2848,52 @@ fuse_memset_builtins (vec<struct partition *> *partitions)
     }
 }
 
+/* Return false if it's profitable to turn the LOOP PARTITION into a builtin
+   call, and true if it wasn't, changing the PARTITION to PKIND_NORMAL.  */
+
+static bool
+maybe_normalize_partition (class loop *loop, struct partition *partition)
+{
+  unsigned HOST_WIDE_INT ratio;
+
+  switch (partition->kind)
+    {
+    case PKIND_NORMAL:
+    case PKIND_PARTIAL_MEMSET:
+      return false;
+
+    case PKIND_MEMSET:
+      if (integer_zerop (gimple_assign_rhs1 (DR_STMT
+					     (partition->builtin->dst_dr))))
+	ratio = CLEAR_RATIO (optimize_loop_for_speed_p (loop));
+      else
+	ratio = SET_RATIO (optimize_loop_for_speed_p (loop));
+      break;
+
+    case PKIND_MEMCPY:
+    case PKIND_MEMMOVE:
+      ratio = MOVE_RATIO (optimize_loop_for_speed_p (loop));
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+
+  tree niters = number_of_latch_executions (loop);
+  if (niters == NULL_TREE || niters == chrec_dont_know)
+    return false;
+
+  wide_int minit, maxit;
+  value_range_kind vrk = determine_value_range (niters, &minit, &maxit);
+  if (vrk == VR_RANGE && wi::ltu_p (maxit, ratio))
+    {
+      partition->kind = PKIND_NORMAL;
+      return true;
+    }
+
+  return false;
+}
+
 void
 loop_distribution::finalize_partitions (class loop *loop,
 					vec<struct partition *> *partitions,
@@ -3087,6 +3133,14 @@ loop_distribution::distribute_loop (class loop *loop, vec<gimple *> stmts,
     }
 
   finalize_partitions (loop, &partitions, &alias_ddrs);
+  {
+    bool any_changes_p = false;
+    for (i = 0; partitions.iterate (i, &partition); ++i)
+      if (maybe_normalize_partition (loop, partition))
+	any_changes_p = true;
+    if (any_changes_p)
+      finalize_partitions (loop, &partitions, &alias_ddrs);
+  }
 
   /* If there is a reduction in all partitions make sure the last one
      is not classified for builtin code generation.  */

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-01-27 12:40 [RFC] test builtin ratio for loop distribution Alexandre Oliva
@ 2021-01-27 15:12 ` Richard Biener
  2021-01-28  5:28   ` Alexandre Oliva
  2021-02-05  0:13 ` Jim Wilson
  1 sibling, 1 reply; 26+ messages in thread
From: Richard Biener @ 2021-01-27 15:12 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: GCC Patches, Zdenek Dvorak

On Wed, Jan 27, 2021 at 2:18 PM Alexandre Oliva <oliva@adacore.com> wrote:
>
>
> This patch attempts to fix a libgcc codegen regression introduced in
> gcc-10, as -ftree-loop-distribute-patterns was enabled at -O2.
>
>
> The ldist pass turns even very short loops into memset calls.  E.g.,
> the TFmode emulation calls end with a loop of up to 3 iterations, to
> zero out trailing words, and the loop distribution pass turns them
> into calls of the memset builtin.
>
> Though short constant-length memsets are usually dealt with
> efficiently, for non-constant-length ones, the options are setmemM, or
> a function calls.

There's also folding which is expected to turn small memcpy/memset
into simple stmts again.

> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.

But yes, this particular issue has come up before.  We do have some
on-the-side info for alignment and size estimates from profiling, notably
memset expansion will call

  if (currently_expanding_gimple_stmt)
    stringop_block_profile (currently_expanding_gimple_stmt,
                            &expected_align, &expected_size);

I'm not sure whether those would be a suitable vehicle to record
known alignments.  The alternative is to create builtin variants
that encode this info and are usable for the middle-end here.

That said, rather than not transforming the loop as you do I'd
say we want to re-inline small copies more forcefully during
loop distribution code-gen so we turn a loop that sets
3 'short int' to zero into a 'int' store and a 'short' store for example
(we have more code-size leeway here because we formerly had
a loop).

Since you don't add a testcase I can't see whether the actual
case would be fixed by setting SSA pointer alignment on the
memset arguments (that is, whether they are invariant and thus
no SSA name is involved or not).

> This patch adds to the loop distribution pass some cost analysis based
> on preexisting *_RATIO macros, so that we won't transform loops with
> trip counts as low as the ratios we'd rather expand inline.
>
>
> This patch is not finished; it needs adjustments to the testsuite, to
> make up for the behavior changes it brings about.  Specifically, on a
> x86_64-linux-gnu regstrap, it regresses:
>
> > FAIL: gcc.dg/pr53265.c  (test for warnings, line 40)
> > FAIL: gcc.dg/pr53265.c  (test for warnings, line 42)
> > FAIL: gcc.dg/tree-ssa/ldist-38.c scan-tree-dump ldist "split to 0 loops and 1 library cal> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++14  scan-tree-dump ldist "split to 0 loops and 1 library calls"
> > FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++17  scan-tree-dump ldist "split to 0 loops and 1 library calls"
> > FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++2a  scan-tree-dump ldist "split to 0 loops and 1 library calls"
>
> I suppose just lengthening the loops will take care of ldist-38 and
> pr78847, but the loss of the warnings in pr53265 is more concerning, and
> will require investigation.
>
> Nevertheless, I seek feedback on whether this is an acceptable approach,
> or whether we should use alternate tuning parameters for ldist, or
> something entirely different.  Thanks in advance,
>
>
> for  gcc/ChangeLog
>
>         * tree-loop-distribution.c (maybe_normalize_partition): New.
>         (loop_distribution::distribute_loop): Call it.
>
> [requires testsuite adjustments and investigation of a warning regression]
> ---
>  gcc/tree-loop-distribution.c |   54 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
>
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index bb15fd3723fb6..b5198652817ee 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -2848,6 +2848,52 @@ fuse_memset_builtins (vec<struct partition *> *partitions)
>      }
>  }
>
> +/* Return false if it's profitable to turn the LOOP PARTITION into a builtin
> +   call, and true if it wasn't, changing the PARTITION to PKIND_NORMAL.  */
> +
> +static bool
> +maybe_normalize_partition (class loop *loop, struct partition *partition)
> +{
> +  unsigned HOST_WIDE_INT ratio;
> +
> +  switch (partition->kind)
> +    {
> +    case PKIND_NORMAL:
> +    case PKIND_PARTIAL_MEMSET:
> +      return false;
> +
> +    case PKIND_MEMSET:
> +      if (integer_zerop (gimple_assign_rhs1 (DR_STMT
> +                                            (partition->builtin->dst_dr))))
> +       ratio = CLEAR_RATIO (optimize_loop_for_speed_p (loop));
> +      else
> +       ratio = SET_RATIO (optimize_loop_for_speed_p (loop));
> +      break;
> +
> +    case PKIND_MEMCPY:
> +    case PKIND_MEMMOVE:
> +      ratio = MOVE_RATIO (optimize_loop_for_speed_p (loop));
> +      break;
> +
> +    default:
> +      gcc_unreachable ();
> +    }
> +
> +  tree niters = number_of_latch_executions (loop);
> +  if (niters == NULL_TREE || niters == chrec_dont_know)
> +    return false;
> +
> +  wide_int minit, maxit;
> +  value_range_kind vrk = determine_value_range (niters, &minit, &maxit);
> +  if (vrk == VR_RANGE && wi::ltu_p (maxit, ratio))
> +    {
> +      partition->kind = PKIND_NORMAL;
> +      return true;
> +    }
> +
> +  return false;
> +}
> +
>  void
>  loop_distribution::finalize_partitions (class loop *loop,
>                                         vec<struct partition *> *partitions,
> @@ -3087,6 +3133,14 @@ loop_distribution::distribute_loop (class loop *loop, vec<gimple *> stmts,
>      }
>
>    finalize_partitions (loop, &partitions, &alias_ddrs);
> +  {
> +    bool any_changes_p = false;
> +    for (i = 0; partitions.iterate (i, &partition); ++i)
> +      if (maybe_normalize_partition (loop, partition))
> +       any_changes_p = true;
> +    if (any_changes_p)
> +      finalize_partitions (loop, &partitions, &alias_ddrs);
> +  }
>
>    /* If there is a reduction in all partitions make sure the last one
>       is not classified for builtin code generation.  */
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-01-27 15:12 ` Richard Biener
@ 2021-01-28  5:28   ` Alexandre Oliva
  2021-01-28  8:59     ` Richard Biener
  0 siblings, 1 reply; 26+ messages in thread
From: Alexandre Oliva @ 2021-01-28  5:28 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Zdenek Dvorak

On Jan 27, 2021, Richard Biener <richard.guenther@gmail.com> wrote:

> That said, rather than not transforming the loop as you do I'd
> say we want to re-inline small copies more forcefully during
> loop distribution code-gen so we turn a loop that sets
> 3 'short int' to zero into a 'int' store and a 'short' store for example
> (we have more code-size leeway here because we formerly had
> a loop).

Ok, that makes sense to me, if there's a chance of growing the access
stride.

> Since you don't add a testcase

Uhh, sorry, I mentioned TFmode emulation calls, but I wasn't explicit I
meant the soft-fp ones from libgcc.

./xgcc -B./ -O2 $srcdir/libgcc/soft-fp/fixtfdi.c \
  -I$srcdir/libgcc/config/riscv -I$srcdir/include \
  -S -o - | grep memset

> I can't see whether the actual case would be fixed by setting SSA
> pointer alignment on the memset arguments

The dest pointer alignment is known at the builtin expansion time,
that's not a problem.  What isn't known (*) is that the length is a
multiple of the word size: what gets to the expander is that it's
between 4 and 12 bytes (targeting 32-bit risc-v), but not that it's
either 4, 8 or 12.  Coming up with an efficient inline expansion becomes
a bit of a challenge without that extra knowledge.


(*) at least before my related patch for get_nonzero_bits
https://gcc.gnu.org/pipermail/gcc-patches/2021-January/564344.html

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-01-28  5:28   ` Alexandre Oliva
@ 2021-01-28  8:59     ` Richard Biener
  2021-02-02 17:13       ` Alexandre Oliva
  0 siblings, 1 reply; 26+ messages in thread
From: Richard Biener @ 2021-01-28  8:59 UTC (permalink / raw)
  To: Alexandre Oliva, Jan Hubicka; +Cc: GCC Patches, Zdenek Dvorak

On Thu, Jan 28, 2021 at 6:28 AM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On Jan 27, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
>
> > That said, rather than not transforming the loop as you do I'd
> > say we want to re-inline small copies more forcefully during
> > loop distribution code-gen so we turn a loop that sets
> > 3 'short int' to zero into a 'int' store and a 'short' store for example
> > (we have more code-size leeway here because we formerly had
> > a loop).
>
> Ok, that makes sense to me, if there's a chance of growing the access
> stride.
>
> > Since you don't add a testcase
>
> Uhh, sorry, I mentioned TFmode emulation calls, but I wasn't explicit I
> meant the soft-fp ones from libgcc.
>
> ./xgcc -B./ -O2 $srcdir/libgcc/soft-fp/fixtfdi.c \
>   -I$srcdir/libgcc/config/riscv -I$srcdir/include \
>   -S -o - | grep memset
>
> > I can't see whether the actual case would be fixed by setting SSA
> > pointer alignment on the memset arguments
>
> The dest pointer alignment is known at the builtin expansion time,
> that's not a problem.  What isn't known (*) is that the length is a
> multiple of the word size: what gets to the expander is that it's
> between 4 and 12 bytes (targeting 32-bit risc-v), but not that it's
> either 4, 8 or 12.  Coming up with an efficient inline expansion becomes
> a bit of a challenge without that extra knowledge.

Ah, yes - that's also useful information.  I guess for memset
an enhanced builtin would then have calloc style
size and nmemb arguments where the destination is expected
to have at least 'size' alignment.  That would allow turning
back the memset into the original loop (but with optimal IVs, etc.).
So kind of the x86 rep mov{bwlq}.  Now think of sth similarly
useful for memcpy/move.

Richard.

>
>
> (*) at least before my related patch for get_nonzero_bits
> https://gcc.gnu.org/pipermail/gcc-patches/2021-January/564344.html
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-01-28  8:59     ` Richard Biener
@ 2021-02-02 17:13       ` Alexandre Oliva
  2021-02-03  8:59         ` Richard Biener
  0 siblings, 1 reply; 26+ messages in thread
From: Alexandre Oliva @ 2021-02-02 17:13 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Jan 28, 2021, Richard Biener <richard.guenther@gmail.com> wrote:

> That would allow turning back the memset into the original loop (but
> with optimal IVs, etc.).

Is this sort of what you had in mind?

I haven't tested the inline expansion of memset much yet; and that of
memcpy, not at all; this really is mainly to check that I understood
what you had in mind before I spend further time polishing it.


test builtin ratio for loop distribution

From: Alexandre Oliva <oliva@adacore.com>

The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch adds, to the loop distribution pass, the ability to perform
inline expansion of bounded variable-length memset and memcpy in ways
that take advantage of known alignments and size's factors, when
preexisting *_RATIO macros suggest the inline expansion is
advantageous.


for  gcc/ChangeLog

	* tree-loop-distribution.c: Include builtins.h.
	(generate_memset_builtin): Introduce optional inline expansion
	of bounded variable-sized memset calls.
	(generate_memcpy_builtin): Likewise for memcpy only.
	(loop_distribution::execute): Fix loop structure afterwards.
---
 gcc/tree-loop-distribution.c |  280 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 279 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index bb15fd3723fb6..3be7a4c1ac281 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "tree-eh.h"
 #include "gimple-fold.h"
+#include "builtins.h"
 
 
 #define MAX_DATAREFS_NUM \
@@ -1148,6 +1149,23 @@ generate_memset_builtin (class loop *loop, partition *partition)
   /* The new statements will be placed before LOOP.  */
   gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
 
+  /* Compute builtin->size range and ctz before it's gimplified; sub-expressions
+     thereof are rewritten in place, so they end up referencing SSA_NAMEs for
+     which we don't have VR info.  */
+  unsigned align = get_pointer_alignment (builtin->dst_base) / BITS_PER_UNIT;
+  unsigned alctz, szctz, xbits;
+  wide_int szmin, szmax;
+  value_range_kind vrk;
+  if (align)
+    {
+      alctz = wi::ctz (align);
+      szctz = MIN (tree_ctz (builtin->size), alctz);
+      xbits = alctz - szctz;
+      vrk = determine_value_range (builtin->size, &szmin, &szmax);
+      if (szmin == szmax)
+	align = 0;
+    }
+
   nb_bytes = rewrite_to_non_trapping_overflow (builtin->size);
   nb_bytes = force_gimple_operand_gsi (&gsi, nb_bytes, true, NULL_TREE,
 				       false, GSI_CONTINUE_LINKING);
@@ -1172,6 +1190,127 @@ generate_memset_builtin (class loop *loop, partition *partition)
       val = tem;
     }
 
+  unsigned HOST_WIDE_INT ratio;
+  if (integer_zerop (val))
+    ratio = CLEAR_RATIO (optimize_loop_for_speed_p (loop));
+  else
+    ratio = SET_RATIO (optimize_loop_for_speed_p (loop));
+
+  /* Compare the ratio with the number of (most-aligned) required stores needed
+     for szmax bytes, with the ratio: a loop that iterates up to szmax >> alctz,
+     and then one conditional store for each extra bit of alignment that the
+     destination has over the size.  */
+  if (align && vrk == VR_RANGE
+      && wi::ltu_p (wi::rshift (szmax, alctz, UNSIGNED) + xbits, ratio))
+    {
+      gimple *stmt;
+      tree bytes = create_tmp_var_raw (size_type_node, "ldistbytes");
+      tree ptr = create_tmp_var_raw (build_pointer_type (char_type_node),
+				     "ldistptr");
+      tree stval = force_gimple_operand_gsi (&gsi,
+					     fold_convert
+					     (unsigned_char_type_node, val),
+					     true, NULL_TREE, false,
+					     GSI_CONTINUE_LINKING);
+
+      for (unsigned i = 1; i != alctz; i *= 2)
+	{
+	  unsigned bits = i * 2 * BITS_PER_UNIT;
+	  tree type = build_nonstandard_integer_type (bits, true);
+	  stval = fold_convert (type, stval);
+	  tree op2 = fold_build2_loc (partition->loc, LSHIFT_EXPR,
+				      TREE_TYPE (stval), stval,
+				      build_int_cst (type, i * BITS_PER_UNIT));
+	  stval = fold_build2_loc (partition->loc, BIT_IOR_EXPR,
+				   TREE_TYPE (stval), stval, op2);
+	  stval = force_gimple_operand_gsi (&gsi, stval, true, NULL_TREE,
+					    false, GSI_CONTINUE_LINKING);
+	}
+
+      stmt = gimple_build_assign (bytes, nb_bytes);
+      gimple_set_location (stmt, partition->loc);
+      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+      stmt = gimple_build_assign (ptr, mem);
+      gimple_set_location (stmt, partition->loc);
+      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+      gsi = gsi_start_bb (split_block (gsi_bb (gsi), stmt)->dest);
+
+      for (unsigned i = 0; i <= xbits; i++)
+	{
+	  tree blksz = build_int_cst (size_type_node, align >> i);
+
+	  if (i > 0)
+	    {
+	      unsigned bits = (align >> i) * BITS_PER_UNIT;
+	      tree type = build_nonstandard_integer_type (bits, true);
+	      stval = fold_convert (type, stval);
+	      stval = force_gimple_operand_gsi (&gsi, stval, true, NULL_TREE,
+						false, GSI_CONTINUE_LINKING);
+	    }
+
+	  tree labcond = NULL; // create_artificial_label (partition->loc);
+	  tree labskip = NULL; // create_artificial_label (partition->loc);
+
+	  stmt = gimple_build_cond (GE_EXPR, bytes, blksz, labcond, labskip);
+	  gimple_set_location (stmt, partition->loc);
+	  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	  basic_block bbfrom = gsi_bb (gsi);
+	  basic_block bbcond = split_block (bbfrom, stmt)->dest;
+
+	  gsi = gsi_start_bb (bbcond);
+
+	  stmt = gimple_build_assign (fold_build2
+				      (MEM_REF, TREE_TYPE (stval), ptr,
+				       build_int_cst (TREE_TYPE (ptr), 0)),
+				      stval);
+	  gimple_set_location (stmt, partition->loc);
+	  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	  if (i == 0 || i < xbits)
+	    {
+	      stmt = gimple_build_assign (ptr,
+					  fold_build2_loc
+					  (partition->loc,
+					   POINTER_PLUS_EXPR,
+					   TREE_TYPE (ptr),
+					   ptr, blksz));
+	      gimple_set_location (stmt, partition->loc);
+	      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	      stmt = gimple_build_assign (bytes,
+					  fold_build2_loc
+					  (partition->loc,
+					   MINUS_EXPR,
+					   TREE_TYPE (bytes),
+					   bytes, blksz));
+	      gimple_set_location (stmt, partition->loc);
+	      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+	    }
+
+	  basic_block bbskip = split_block (bbcond, stmt)->dest;
+
+	  if (i == 0)
+	    redirect_edge_succ (single_succ_edge (bbcond), bbfrom);
+
+	  single_succ_edge (bbfrom)->flags |= EDGE_TRUE_VALUE;
+	  single_succ_edge (bbfrom)->flags &= ~EDGE_FALLTHRU;
+	  make_edge (bbfrom, bbskip, EDGE_FALSE_VALUE);
+	  set_immediate_dominator (CDI_DOMINATORS, bbskip, bbfrom);
+	}
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "generated memset, inlined with %u-trip loop ctzs %u..%u \n",
+		 unsigned((wi::rshift (szmax, alctz, UNSIGNED)
+			   + xbits).to_uhwi ()),
+		 alctz, szctz);
+
+      return;
+    }
+
   fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_MEMSET));
   fn_call = gimple_build_call (fn, 3, mem, val, nb_bytes);
   gimple_set_location (fn_call, partition->loc);
@@ -1202,6 +1341,27 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
   /* The new statements will be placed before LOOP.  */
   gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
 
+  /* Compute builtin->size range and ctz before it's gimplified; sub-expressions
+     thereof are rewritten in place, so they end up referencing SSA_NAMEs for
+     which we don't have VR info.  */
+  unsigned align = (MIN (get_pointer_alignment (builtin->dst_base),
+			 get_pointer_alignment (builtin->src_base))
+		    / BITS_PER_UNIT);
+  unsigned alctz, szctz, xbits;
+  wide_int szmin, szmax;
+  value_range_kind vrk;
+  if (align)
+    {
+      alctz = wi::ctz (align);
+      szctz = MIN (tree_ctz (builtin->size), alctz);
+      xbits = alctz - szctz;
+      vrk = determine_value_range (builtin->size, &szmin, &szmax);
+      /* A call with constant size will be expanded into an unrolled loop
+	 later.	 */
+      if (szmin == szmax)
+	align = 0;
+    }
+
   nb_bytes = rewrite_to_non_trapping_overflow (builtin->size);
   nb_bytes = force_gimple_operand_gsi (&gsi, nb_bytes, true, NULL_TREE,
 				       false, GSI_CONTINUE_LINKING);
@@ -1211,12 +1371,127 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
       || ! ptr_derefs_may_alias_p (dest, src))
     kind = BUILT_IN_MEMCPY;
   else
-    kind = BUILT_IN_MEMMOVE;
+    {
+      kind = BUILT_IN_MEMMOVE;
+      /* The inlined loop won't handle overlapping memory areas.  */
+      align = 0;
+    }
 
   dest = force_gimple_operand_gsi (&gsi, dest, true, NULL_TREE,
 				   false, GSI_CONTINUE_LINKING);
   src = force_gimple_operand_gsi (&gsi, src, true, NULL_TREE,
 				  false, GSI_CONTINUE_LINKING);
+
+  unsigned HOST_WIDE_INT ratio;
+  ratio = MOVE_RATIO (optimize_loop_for_speed_p (loop));
+
+  /* Compare the ratio with the number of (most-aligned) required stores needed
+     for szmax bytes, with the ratio: a loop that iterates up to szmax >> alctz,
+     and then one conditional store for each extra bit of alignment that the
+     destination has over the size.  */
+  if (align && vrk == VR_RANGE
+      && wi::ltu_p (wi::rshift (szmax, alctz, UNSIGNED) + xbits, ratio))
+    {
+      gimple *stmt;
+      tree bytes = create_tmp_var_raw (size_type_node, "ldistbytes");
+      tree dptr = create_tmp_var_raw (build_pointer_type (char_type_node),
+				     "ldistdptr");
+      tree sptr = create_tmp_var_raw (build_pointer_type (char_type_node),
+				     "ldistsptr");
+
+      stmt = gimple_build_assign (bytes, nb_bytes);
+      gimple_set_location (stmt, partition->loc);
+      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+      stmt = gimple_build_assign (dptr, dest);
+      gimple_set_location (stmt, partition->loc);
+      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+      stmt = gimple_build_assign (sptr, src);
+      gimple_set_location (stmt, partition->loc);
+      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+      gsi = gsi_start_bb (split_block (gsi_bb (gsi), stmt)->dest);
+
+      for (unsigned i = 0; i <= xbits; i++)
+	{
+	  tree blksz = build_int_cst (size_type_node, align >> i);
+	  unsigned bits = (align >> i) * BITS_PER_UNIT;
+	  tree type = build_nonstandard_integer_type (bits, true);
+	  tree val = make_ssa_name (type);
+
+	  stmt = gimple_build_cond (GE_EXPR, bytes, blksz, NULL, NULL);
+	  gimple_set_location (stmt, partition->loc);
+	  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	  basic_block bbfrom = gsi_bb (gsi);
+	  basic_block bbcond = split_block (bbfrom, stmt)->dest;
+
+	  gsi = gsi_start_bb (bbcond);
+	  stmt = gimple_build_assign (val,
+				      fold_build2
+				      (MEM_REF, TREE_TYPE (val), sptr,
+				       build_int_cst (TREE_TYPE (sptr), 0)));
+	  gimple_set_location (stmt, partition->loc);
+	  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	  stmt = gimple_build_assign (fold_build2
+				      (MEM_REF, TREE_TYPE (val), dptr,
+				       build_int_cst (TREE_TYPE (dptr), 0)),
+				      val);
+	  gimple_set_location (stmt, partition->loc);
+	  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	  if (i == 0 || i < xbits)
+	    {
+	      stmt = gimple_build_assign (sptr,
+					  fold_build2_loc
+					  (partition->loc,
+					   POINTER_PLUS_EXPR,
+					   TREE_TYPE (sptr),
+					   sptr, blksz));
+	      gimple_set_location (stmt, partition->loc);
+	      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	      stmt = gimple_build_assign (dptr,
+					  fold_build2_loc
+					  (partition->loc,
+					   POINTER_PLUS_EXPR,
+					   TREE_TYPE (dptr),
+					   dptr, blksz));
+	      gimple_set_location (stmt, partition->loc);
+	      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	      stmt = gimple_build_assign (bytes,
+					  fold_build2_loc
+					  (partition->loc,
+					   MINUS_EXPR,
+					   TREE_TYPE (bytes),
+					   bytes, blksz));
+	      gimple_set_location (stmt, partition->loc);
+	      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+	    }
+
+	  basic_block bbskip = split_block (bbcond, stmt)->dest;
+	  if (i == 0)
+	    redirect_edge_succ (single_succ_edge (bbcond), bbfrom);
+
+	  single_succ_edge (bbfrom)->flags |= EDGE_TRUE_VALUE;
+	  single_succ_edge (bbfrom)->flags &= ~EDGE_FALLTHRU;
+	  make_edge (bbfrom, bbskip, EDGE_FALSE_VALUE);
+	  set_immediate_dominator (CDI_DOMINATORS, bbskip, bbfrom);
+	}
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "generated memcpy, inlined with %u-trip loop ctzs %u..%u \n",
+		 unsigned((wi::rshift (szmax, alctz, UNSIGNED)
+			   + xbits).to_uhwi ()),
+		 alctz, szctz);
+
+      return;
+    }
+
   fn = build_fold_addr_expr (builtin_decl_implicit (kind));
   fn_call = gimple_build_call (fn, 3, dest, src, nb_bytes);
   gimple_set_location (fn_call, partition->loc);
@@ -3357,6 +3632,9 @@ loop_distribution::execute (function *fun)
       scev_reset_htab ();
       mark_virtual_operands_for_renaming (fun);
       rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+      /* Newly-added loops, for inline expansions memset/memcpy, need to be
+	 integrated.  */
+      fix_loop_structure (NULL);
     }
 
   checking_verify_loop_structure ();


-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-02-02 17:13       ` Alexandre Oliva
@ 2021-02-03  8:59         ` Richard Biener
  2021-02-03 15:11           ` Alexandre Oliva
  0 siblings, 1 reply; 26+ messages in thread
From: Richard Biener @ 2021-02-03  8:59 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Tue, Feb 2, 2021 at 6:14 PM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On Jan 28, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
>
> > That would allow turning back the memset into the original loop (but
> > with optimal IVs, etc.).
>
> Is this sort of what you had in mind?
>
> I haven't tested the inline expansion of memset much yet; and that of
> memcpy, not at all; this really is mainly to check that I understood
> what you had in mind before I spend further time polishing it.

So I think we should try to match what __builtin_memcpy/memset
expansion would do here, taking advantage of extra alignment
and size knowledge.  In particular,

 a) if __builtin_memcpy/memset would use setmem/cpymem optabs
     see if we can have variants of memcpy/memset transferring alignment
     and size knowledge

 b) if expansion would use BY_PIECES then expand to an unrolled loop

 c) if expansion would emit a memset/memcpy call but we know
     alignment and have a low bound on niters emit a loop (like your patch does)

a) might be difficult but adding the builtin variants may pay off in any case.

The patch itself could benefit from one or two helpers we already
have, first of all there's create_empty_loop_on_edge (so you don't
need the loop fixup) which conveniently adds the control IV for you.
If you want to use pointer IVs for simplicity there's create_iv.  In the
end you shouldn't need more than creating the actual memory GIMPLE
stmts.  If expansion would use BY_PIECES you can implement the
unrolled code by setting loop->unroll to the number of iterations
to (maximally) peel and cunroll will take care of that.

Note that for memmove if we know the dependence direction, we
can also emit a loop / unrolled code.

I think the builtins with alignment and calloc-style element count
will be useful on its own.

Richard.

>
> test builtin ratio for loop distribution
>
> From: Alexandre Oliva <oliva@adacore.com>
>
> The ldist pass turns even very short loops into memset calls.  E.g.,
> the TFmode emulation calls end with a loop of up to 3 iterations, to
> zero out trailing words, and the loop distribution pass turns them
> into calls of the memset builtin.
>
> Though short constant-length memsets are usually dealt with
> efficiently, for non-constant-length ones, the options are setmemM, or
> a function calls.
>
> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.
>
> This patch adds, to the loop distribution pass, the ability to perform
> inline expansion of bounded variable-length memset and memcpy in ways
> that take advantage of known alignments and size's factors, when
> preexisting *_RATIO macros suggest the inline expansion is
> advantageous.
>
>
> for  gcc/ChangeLog
>
>         * tree-loop-distribution.c: Include builtins.h.
>         (generate_memset_builtin): Introduce optional inline expansion
>         of bounded variable-sized memset calls.
>         (generate_memcpy_builtin): Likewise for memcpy only.
>         (loop_distribution::execute): Fix loop structure afterwards.
> ---
>  gcc/tree-loop-distribution.c |  280 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 279 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index bb15fd3723fb6..3be7a4c1ac281 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-vectorizer.h"
>  #include "tree-eh.h"
>  #include "gimple-fold.h"
> +#include "builtins.h"
>
>
>  #define MAX_DATAREFS_NUM \
> @@ -1148,6 +1149,23 @@ generate_memset_builtin (class loop *loop, partition *partition)
>    /* The new statements will be placed before LOOP.  */
>    gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
>
> +  /* Compute builtin->size range and ctz before it's gimplified; sub-expressions
> +     thereof are rewritten in place, so they end up referencing SSA_NAMEs for
> +     which we don't have VR info.  */
> +  unsigned align = get_pointer_alignment (builtin->dst_base) / BITS_PER_UNIT;
> +  unsigned alctz, szctz, xbits;
> +  wide_int szmin, szmax;
> +  value_range_kind vrk;
> +  if (align)
> +    {
> +      alctz = wi::ctz (align);
> +      szctz = MIN (tree_ctz (builtin->size), alctz);
> +      xbits = alctz - szctz;
> +      vrk = determine_value_range (builtin->size, &szmin, &szmax);
> +      if (szmin == szmax)
> +       align = 0;
> +    }
> +
>    nb_bytes = rewrite_to_non_trapping_overflow (builtin->size);
>    nb_bytes = force_gimple_operand_gsi (&gsi, nb_bytes, true, NULL_TREE,
>                                        false, GSI_CONTINUE_LINKING);
> @@ -1172,6 +1190,127 @@ generate_memset_builtin (class loop *loop, partition *partition)
>        val = tem;
>      }
>
> +  unsigned HOST_WIDE_INT ratio;
> +  if (integer_zerop (val))
> +    ratio = CLEAR_RATIO (optimize_loop_for_speed_p (loop));
> +  else
> +    ratio = SET_RATIO (optimize_loop_for_speed_p (loop));
> +
> +  /* Compare the ratio with the number of (most-aligned) required stores needed
> +     for szmax bytes, with the ratio: a loop that iterates up to szmax >> alctz,
> +     and then one conditional store for each extra bit of alignment that the
> +     destination has over the size.  */
> +  if (align && vrk == VR_RANGE
> +      && wi::ltu_p (wi::rshift (szmax, alctz, UNSIGNED) + xbits, ratio))
> +    {
> +      gimple *stmt;
> +      tree bytes = create_tmp_var_raw (size_type_node, "ldistbytes");
> +      tree ptr = create_tmp_var_raw (build_pointer_type (char_type_node),
> +                                    "ldistptr");
> +      tree stval = force_gimple_operand_gsi (&gsi,
> +                                            fold_convert
> +                                            (unsigned_char_type_node, val),
> +                                            true, NULL_TREE, false,
> +                                            GSI_CONTINUE_LINKING);
> +
> +      for (unsigned i = 1; i != alctz; i *= 2)
> +       {
> +         unsigned bits = i * 2 * BITS_PER_UNIT;
> +         tree type = build_nonstandard_integer_type (bits, true);
> +         stval = fold_convert (type, stval);
> +         tree op2 = fold_build2_loc (partition->loc, LSHIFT_EXPR,
> +                                     TREE_TYPE (stval), stval,
> +                                     build_int_cst (type, i * BITS_PER_UNIT));
> +         stval = fold_build2_loc (partition->loc, BIT_IOR_EXPR,
> +                                  TREE_TYPE (stval), stval, op2);
> +         stval = force_gimple_operand_gsi (&gsi, stval, true, NULL_TREE,
> +                                           false, GSI_CONTINUE_LINKING);
> +       }
> +
> +      stmt = gimple_build_assign (bytes, nb_bytes);
> +      gimple_set_location (stmt, partition->loc);
> +      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +      stmt = gimple_build_assign (ptr, mem);
> +      gimple_set_location (stmt, partition->loc);
> +      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +      gsi = gsi_start_bb (split_block (gsi_bb (gsi), stmt)->dest);
> +
> +      for (unsigned i = 0; i <= xbits; i++)
> +       {
> +         tree blksz = build_int_cst (size_type_node, align >> i);
> +
> +         if (i > 0)
> +           {
> +             unsigned bits = (align >> i) * BITS_PER_UNIT;
> +             tree type = build_nonstandard_integer_type (bits, true);
> +             stval = fold_convert (type, stval);
> +             stval = force_gimple_operand_gsi (&gsi, stval, true, NULL_TREE,
> +                                               false, GSI_CONTINUE_LINKING);
> +           }
> +
> +         tree labcond = NULL; // create_artificial_label (partition->loc);
> +         tree labskip = NULL; // create_artificial_label (partition->loc);
> +
> +         stmt = gimple_build_cond (GE_EXPR, bytes, blksz, labcond, labskip);
> +         gimple_set_location (stmt, partition->loc);
> +         gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +         basic_block bbfrom = gsi_bb (gsi);
> +         basic_block bbcond = split_block (bbfrom, stmt)->dest;
> +
> +         gsi = gsi_start_bb (bbcond);
> +
> +         stmt = gimple_build_assign (fold_build2
> +                                     (MEM_REF, TREE_TYPE (stval), ptr,
> +                                      build_int_cst (TREE_TYPE (ptr), 0)),
> +                                     stval);
> +         gimple_set_location (stmt, partition->loc);
> +         gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +         if (i == 0 || i < xbits)
> +           {
> +             stmt = gimple_build_assign (ptr,
> +                                         fold_build2_loc
> +                                         (partition->loc,
> +                                          POINTER_PLUS_EXPR,
> +                                          TREE_TYPE (ptr),
> +                                          ptr, blksz));
> +             gimple_set_location (stmt, partition->loc);
> +             gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +             stmt = gimple_build_assign (bytes,
> +                                         fold_build2_loc
> +                                         (partition->loc,
> +                                          MINUS_EXPR,
> +                                          TREE_TYPE (bytes),
> +                                          bytes, blksz));
> +             gimple_set_location (stmt, partition->loc);
> +             gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +           }
> +
> +         basic_block bbskip = split_block (bbcond, stmt)->dest;
> +
> +         if (i == 0)
> +           redirect_edge_succ (single_succ_edge (bbcond), bbfrom);
> +
> +         single_succ_edge (bbfrom)->flags |= EDGE_TRUE_VALUE;
> +         single_succ_edge (bbfrom)->flags &= ~EDGE_FALLTHRU;
> +         make_edge (bbfrom, bbskip, EDGE_FALSE_VALUE);
> +         set_immediate_dominator (CDI_DOMINATORS, bbskip, bbfrom);
> +       }
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file,
> +                "generated memset, inlined with %u-trip loop ctzs %u..%u \n",
> +                unsigned((wi::rshift (szmax, alctz, UNSIGNED)
> +                          + xbits).to_uhwi ()),
> +                alctz, szctz);
> +
> +      return;
> +    }
> +
>    fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_MEMSET));
>    fn_call = gimple_build_call (fn, 3, mem, val, nb_bytes);
>    gimple_set_location (fn_call, partition->loc);
> @@ -1202,6 +1341,27 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
>    /* The new statements will be placed before LOOP.  */
>    gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
>
> +  /* Compute builtin->size range and ctz before it's gimplified; sub-expressions
> +     thereof are rewritten in place, so they end up referencing SSA_NAMEs for
> +     which we don't have VR info.  */
> +  unsigned align = (MIN (get_pointer_alignment (builtin->dst_base),
> +                        get_pointer_alignment (builtin->src_base))
> +                   / BITS_PER_UNIT);
> +  unsigned alctz, szctz, xbits;
> +  wide_int szmin, szmax;
> +  value_range_kind vrk;
> +  if (align)
> +    {
> +      alctz = wi::ctz (align);
> +      szctz = MIN (tree_ctz (builtin->size), alctz);
> +      xbits = alctz - szctz;
> +      vrk = determine_value_range (builtin->size, &szmin, &szmax);
> +      /* A call with constant size will be expanded into an unrolled loop
> +        later.  */
> +      if (szmin == szmax)
> +       align = 0;
> +    }
> +
>    nb_bytes = rewrite_to_non_trapping_overflow (builtin->size);
>    nb_bytes = force_gimple_operand_gsi (&gsi, nb_bytes, true, NULL_TREE,
>                                        false, GSI_CONTINUE_LINKING);
> @@ -1211,12 +1371,127 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
>        || ! ptr_derefs_may_alias_p (dest, src))
>      kind = BUILT_IN_MEMCPY;
>    else
> -    kind = BUILT_IN_MEMMOVE;
> +    {
> +      kind = BUILT_IN_MEMMOVE;
> +      /* The inlined loop won't handle overlapping memory areas.  */
> +      align = 0;
> +    }
>
>    dest = force_gimple_operand_gsi (&gsi, dest, true, NULL_TREE,
>                                    false, GSI_CONTINUE_LINKING);
>    src = force_gimple_operand_gsi (&gsi, src, true, NULL_TREE,
>                                   false, GSI_CONTINUE_LINKING);
> +
> +  unsigned HOST_WIDE_INT ratio;
> +  ratio = MOVE_RATIO (optimize_loop_for_speed_p (loop));
> +
> +  /* Compare the ratio with the number of (most-aligned) required stores needed
> +     for szmax bytes, with the ratio: a loop that iterates up to szmax >> alctz,
> +     and then one conditional store for each extra bit of alignment that the
> +     destination has over the size.  */
> +  if (align && vrk == VR_RANGE
> +      && wi::ltu_p (wi::rshift (szmax, alctz, UNSIGNED) + xbits, ratio))
> +    {
> +      gimple *stmt;
> +      tree bytes = create_tmp_var_raw (size_type_node, "ldistbytes");
> +      tree dptr = create_tmp_var_raw (build_pointer_type (char_type_node),
> +                                    "ldistdptr");
> +      tree sptr = create_tmp_var_raw (build_pointer_type (char_type_node),
> +                                    "ldistsptr");
> +
> +      stmt = gimple_build_assign (bytes, nb_bytes);
> +      gimple_set_location (stmt, partition->loc);
> +      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +      stmt = gimple_build_assign (dptr, dest);
> +      gimple_set_location (stmt, partition->loc);
> +      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +      stmt = gimple_build_assign (sptr, src);
> +      gimple_set_location (stmt, partition->loc);
> +      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +      gsi = gsi_start_bb (split_block (gsi_bb (gsi), stmt)->dest);
> +
> +      for (unsigned i = 0; i <= xbits; i++)
> +       {
> +         tree blksz = build_int_cst (size_type_node, align >> i);
> +         unsigned bits = (align >> i) * BITS_PER_UNIT;
> +         tree type = build_nonstandard_integer_type (bits, true);
> +         tree val = make_ssa_name (type);
> +
> +         stmt = gimple_build_cond (GE_EXPR, bytes, blksz, NULL, NULL);
> +         gimple_set_location (stmt, partition->loc);
> +         gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +         basic_block bbfrom = gsi_bb (gsi);
> +         basic_block bbcond = split_block (bbfrom, stmt)->dest;
> +
> +         gsi = gsi_start_bb (bbcond);
> +         stmt = gimple_build_assign (val,
> +                                     fold_build2
> +                                     (MEM_REF, TREE_TYPE (val), sptr,
> +                                      build_int_cst (TREE_TYPE (sptr), 0)));
> +         gimple_set_location (stmt, partition->loc);
> +         gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +         stmt = gimple_build_assign (fold_build2
> +                                     (MEM_REF, TREE_TYPE (val), dptr,
> +                                      build_int_cst (TREE_TYPE (dptr), 0)),
> +                                     val);
> +         gimple_set_location (stmt, partition->loc);
> +         gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +         if (i == 0 || i < xbits)
> +           {
> +             stmt = gimple_build_assign (sptr,
> +                                         fold_build2_loc
> +                                         (partition->loc,
> +                                          POINTER_PLUS_EXPR,
> +                                          TREE_TYPE (sptr),
> +                                          sptr, blksz));
> +             gimple_set_location (stmt, partition->loc);
> +             gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +             stmt = gimple_build_assign (dptr,
> +                                         fold_build2_loc
> +                                         (partition->loc,
> +                                          POINTER_PLUS_EXPR,
> +                                          TREE_TYPE (dptr),
> +                                          dptr, blksz));
> +             gimple_set_location (stmt, partition->loc);
> +             gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +             stmt = gimple_build_assign (bytes,
> +                                         fold_build2_loc
> +                                         (partition->loc,
> +                                          MINUS_EXPR,
> +                                          TREE_TYPE (bytes),
> +                                          bytes, blksz));
> +             gimple_set_location (stmt, partition->loc);
> +             gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +           }
> +
> +         basic_block bbskip = split_block (bbcond, stmt)->dest;
> +         if (i == 0)
> +           redirect_edge_succ (single_succ_edge (bbcond), bbfrom);
> +
> +         single_succ_edge (bbfrom)->flags |= EDGE_TRUE_VALUE;
> +         single_succ_edge (bbfrom)->flags &= ~EDGE_FALLTHRU;
> +         make_edge (bbfrom, bbskip, EDGE_FALSE_VALUE);
> +         set_immediate_dominator (CDI_DOMINATORS, bbskip, bbfrom);
> +       }
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file,
> +                "generated memcpy, inlined with %u-trip loop ctzs %u..%u \n",
> +                unsigned((wi::rshift (szmax, alctz, UNSIGNED)
> +                          + xbits).to_uhwi ()),
> +                alctz, szctz);
> +
> +      return;
> +    }
> +
>    fn = build_fold_addr_expr (builtin_decl_implicit (kind));
>    fn_call = gimple_build_call (fn, 3, dest, src, nb_bytes);
>    gimple_set_location (fn_call, partition->loc);
> @@ -3357,6 +3632,9 @@ loop_distribution::execute (function *fun)
>        scev_reset_htab ();
>        mark_virtual_operands_for_renaming (fun);
>        rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
> +      /* Newly-added loops, for inline expansions memset/memcpy, need to be
> +        integrated.  */
> +      fix_loop_structure (NULL);
>      }
>
>    checking_verify_loop_structure ();
>
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-02-03  8:59         ` Richard Biener
@ 2021-02-03 15:11           ` Alexandre Oliva
  2021-02-04  8:37             ` Richard Biener
  0 siblings, 1 reply; 26+ messages in thread
From: Alexandre Oliva @ 2021-02-03 15:11 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Feb  3, 2021, Richard Biener <richard.guenther@gmail.com> wrote:

> So I think we should try to match what __builtin_memcpy/memset
> expansion would do here, taking advantage of extra alignment
> and size knowledge.  In particular,

>  a) if __builtin_memcpy/memset would use setmem/cpymem optabs
>      see if we can have variants of memcpy/memset transferring alignment
>      and size knowledge

We could add more optional parameters to them to convey the length's
known ctz.  However, the ctz can't be recovered reliably.  We can't even
recover it after gimplifying the length within ldist!

That said, my other patch already enables ctz calls to recover it, at
least in libgcc risc-v tfmode cases, and it's possible it's readily
available in other cases.  I'd rather leave that for someone dealing
with the machine-specific patterns to figure out whether a separate
argument would be useful.  RISC-V, which is what I'm dealing with,
doesn't have much to offer as far as these patterns are concerned.

>  b) if expansion would use BY_PIECES then expand to an unrolled loop

Why would that be better than keeping the constant-length memset call,
that would be turned into an unrolled loop during expand?

>  c) if expansion would emit a memset/memcpy call but we know
>      alignment and have a low bound on niters emit a loop (like your patch does)

> a) might be difficult but adding the builtin variants may pay off in any case.

*nod*

> The patch itself could benefit from one or two helpers we already
> have, first of all there's create_empty_loop_on_edge (so you don't
> need the loop fixup)

Uhh, thanks, but...  you realize nearly all of the gimple-building code
is one and the same for the loop and for trailing count misalignment?
There doesn't seem to be a lot of benefit to using this primitive, aside
from its dealing with creating the loop data structure which, again, I'd
only want to do in the loop case.

(I guess I should add more comments as to the inline expansion
 strategy.  it's equivalent to:

 i = len, ptr = base, blksz = 1 << alctz;
 while (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += blksz; }
 blksz >>= 1; if (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += blksz; }
 blksz >>= 1; if (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += blksz; }
 ... until blksz gets down to zero or to 1<<szctz
 
> Note that for memmove if we know the dependence direction, we
> can also emit a loop / unrolled code.

*nod*, but the logic would have to be quite different, using bit tests,
and odds are we won't know the direction and have to output a test and
code for both possibilities, which would be quite unlikely to be
beneficial.  Though the original code would quite likely make the
direction visible; perhaps if the size is small enough that we would
expand a memcpy inline, we should refrain from transforming the loop
into a memmove call.

In the case at hand, there's no benefit at all to these transformations:
we start from a loop with the known alignment and a small loop count (0
to 2 words copied), and if everything goes all right with the
transformation, we may be lucky to get back to that.  It's not like the
transformation could even increase the known alignment, so why bother,
and throw away debug info by rewriting the loop into the same code minus
debug?

> I think the builtins with alignment and calloc-style element count
> will be useful on its own.

Oh, I see, you're suggesting actual separate builtin functions.  Uhh...
I'm not sure I want to go there.  I'd much rather recover the ctz of the
length, and use it in existing code.


I'd also prefer if the generic memset (and memcpy and memmove?) builtin
expanders dealt with non-constant lengths in the way I implemented.
That feels like the right spot for it.  That deprives us of gimple loop
optimizations in the inlined loop generated by the current patch, but if
we expand an unrolled loop with compares and offsets with small
constants, loop optimizations might not even be relevant.


FWIW, the patch I posted yesterday is broken, the regstrap test did not
even build libstdc++-v3 successfully.  I'm not sure whether to pursue it
further, or to reimplement it in the expander.  Suggestions are welcome.

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-02-03 15:11           ` Alexandre Oliva
@ 2021-02-04  8:37             ` Richard Biener
  2021-02-04 22:17               ` Alexandre Oliva
  0 siblings, 1 reply; 26+ messages in thread
From: Richard Biener @ 2021-02-04  8:37 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Wed, Feb 3, 2021 at 4:11 PM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On Feb  3, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
>
> > So I think we should try to match what __builtin_memcpy/memset
> > expansion would do here, taking advantage of extra alignment
> > and size knowledge.  In particular,
>
> >  a) if __builtin_memcpy/memset would use setmem/cpymem optabs
> >      see if we can have variants of memcpy/memset transferring alignment
> >      and size knowledge
>
> We could add more optional parameters to them to convey the length's
> known ctz.  However, the ctz can't be recovered reliably.  We can't even
> recover it after gimplifying the length within ldist!
>
> That said, my other patch already enables ctz calls to recover it, at
> least in libgcc risc-v tfmode cases, and it's possible it's readily
> available in other cases.  I'd rather leave that for someone dealing
> with the machine-specific patterns to figure out whether a separate
> argument would be useful.  RISC-V, which is what I'm dealing with,
> doesn't have much to offer as far as these patterns are concerned.
>
> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
>
> Why would that be better than keeping the constant-length memset call,
> that would be turned into an unrolled loop during expand?

Well, because of the possibly lost ctz and alignment info.  There's also
the eventual benefit of eliding either the source (for memcpy) if the
variable accesses in the original loop or the address-taken for the
memcpy/memset are the only reasons it is not.  Of course that's then
again a general GIMPLE memcpy/memset inlining deficit (we only
"inline" it if it's possible to express the operation with one read and
one write).

> >  c) if expansion would emit a memset/memcpy call but we know
> >      alignment and have a low bound on niters emit a loop (like your patch does)
>
> > a) might be difficult but adding the builtin variants may pay off in any case.
>
> *nod*
>
> > The patch itself could benefit from one or two helpers we already
> > have, first of all there's create_empty_loop_on_edge (so you don't
> > need the loop fixup)
>
> Uhh, thanks, but...  you realize nearly all of the gimple-building code
> is one and the same for the loop and for trailing count misalignment?

Sorry, the code lacked comments and so I didn't actually try decipering
the code you generate ;)

> There doesn't seem to be a lot of benefit to using this primitive, aside
> from its dealing with creating the loop data structure which, again, I'd
> only want to do in the loop case.
>
> (I guess I should add more comments as to the inline expansion
>  strategy.  it's equivalent to:
>
>  i = len, ptr = base, blksz = 1 << alctz;
>  while (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += blksz; }
>  blksz >>= 1; if (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += blksz; }
>  blksz >>= 1; if (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += blksz; }
>  ... until blksz gets down to zero or to 1<<szctz

Oh, I see ... fancy.  Maybe a little too fancy ;)  I'd have emitted a loop
storing 1<<szctz pieces but then this would likely be simply the
original loop in most cases.  Which, for loop distribution, hints at
an alternative implementation of simply guessing whether a generated
memcpy/memset would be inlined by RTL expansion and in that
case _not_ generate a builtin but mark the partitioned loop for GIMPLE
complete unrolling.  That might not be optimal and relies on later
store-merging to form larger writes of course.

> > Note that for memmove if we know the dependence direction, we
> > can also emit a loop / unrolled code.
>
> *nod*, but the logic would have to be quite different, using bit tests,
> and odds are we won't know the direction and have to output a test and
> code for both possibilities, which would be quite unlikely to be
> beneficial.  Though the original code would quite likely make the
> direction visible; perhaps if the size is small enough that we would
> expand a memcpy inline, we should refrain from transforming the loop
> into a memmove call.
>
> In the case at hand, there's no benefit at all to these transformations:
> we start from a loop with the known alignment and a small loop count (0
> to 2 words copied), and if everything goes all right with the
> transformation, we may be lucky to get back to that.  It's not like the
> transformation could even increase the known alignment, so why bother,
> and throw away debug info by rewriting the loop into the same code minus
> debug?

The original motivation was really that esp. for small trip count loops
the target knows best how to implement them.  Now, that completely
fails of course in case the target doesn't implement any of this or
the generic code fails because we lost ctz and alignment info.

> > I think the builtins with alignment and calloc-style element count
> > will be useful on its own.
>
> Oh, I see, you're suggesting actual separate builtin functions.  Uhh...
> I'm not sure I want to go there.  I'd much rather recover the ctz of the
> length, and use it in existing code.

Yeah, but when we generate memcpy there might not be a way to
store the ctz info until RTL expansion where the magic should really happen ...

> I'd also prefer if the generic memset (and memcpy and memmove?) builtin
> expanders dealt with non-constant lengths in the way I implemented.

*nod*

> That feels like the right spot for it.  That deprives us of gimple loop
> optimizations in the inlined loop generated by the current patch, but if
> we expand an unrolled loop with compares and offsets with small
> constants, loop optimizations might not even be relevant.
>
>
> FWIW, the patch I posted yesterday is broken, the regstrap test did not
> even build libstdc++-v3 successfully.  I'm not sure whether to pursue it
> further, or to reimplement it in the expander.  Suggestions are welcome.

So I think we agree that ideally RTL expansion should be the place to
recover the loop.  There's left to be see how to most easily transfer
alignment and ctz info there.  I guess alternatively one could indeed
make the point that _not_ doing the memcpy/memset replacement
is likely least surprising than trying to do fancy inlining in loop distribution
as compared to RTL expansion.

So I'd say go for improving RTL expansion.  If that for some reason
fails then anticipating the RTL expansion failure cases in loop
distribution and not generating the builtin there (but the loop
as it would be distributed) is probably the best and simplest thing.

We might consider doing RTL-like *_BY_PIECES expansion of
builtins on GIMPLE as well, not at random places through fold_stmt,
but at a select point in the pass pipeline.

Thanks,
Richard.

> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-02-04  8:37             ` Richard Biener
@ 2021-02-04 22:17               ` Alexandre Oliva
  2021-02-05  8:02                 ` Richard Biener
  2021-02-11 10:19                 ` Alexandre Oliva
  0 siblings, 2 replies; 26+ messages in thread
From: Alexandre Oliva @ 2021-02-04 22:17 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Feb  4, 2021, Richard Biener <richard.guenther@gmail.com> wrote:

>> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
>> 
>> Why would that be better than keeping the constant-length memset call,
>> that would be turned into an unrolled loop during expand?

> Well, because of the possibly lost ctz and alignment info.

Funny you should mention that.  I got started with the expand-time
expansion yesterday, and found out that we're not using the alignment
information that is available.  Though the pointer is known to point to
an aligned object, we are going for 8-bit alignment for some reason.

The strategy I used there was to first check whether by_pieces would
expand inline a constant length near the max known length, then loop
over the bits in the variable length, expand in each iteration a
constant-length store-by-pieces for the fixed length corresponding to
that bit, and a test comparing the variable length with the fixed length
guarding the expansion of the store-by-pieces.  We may get larger code
this way (no loops), but only O(log(len)) compares.

I've also fixed some bugs in the ldist expander, so now it bootstraps,
but with a few regressions in the testsuite, that I'm yet to look into.

>> Uhh, thanks, but...  you realize nearly all of the gimple-building code
>> is one and the same for the loop and for trailing count misalignment?

> Sorry, the code lacked comments and so I didn't actually try decipering
> the code you generate ;)

Oh, come on, it was planly obscure ;-D

Sorry for posting an early-draft before polishing it up.

> The original motivation was really that esp. for small trip count loops
> the target knows best how to implement them.  Now, that completely
> fails of course in case the target doesn't implement any of this or
> the generic code fails because we lost ctz and alignment info.

In our case, generic code fails because it won't handle variable-sized
clear-by-pieces.  But then, I found out, when it's fixed-size, it also
makes the code worse, because it seems to expand to byte stores even
when the store-to object is known to have wider alignment:

union u {
  long long i;
  char c[8];
} x[8];
int s(union u *p, int k) {
  for (int i = k ? 0 : 3; i < 8; i++) {
    for (int j = 0; j < 8; j++) {
      p[i].c[j] = 0;
    } // becomes a memset to an 8-byte-aligned 8-byte object, then 8 byte stores
  }
}

>> > I think the builtins with alignment and calloc-style element count
>> > will be useful on its own.
>> 
>> Oh, I see, you're suggesting actual separate builtin functions.  Uhh...
>> I'm not sure I want to go there.  I'd much rather recover the ctz of the
>> length, and use it in existing code.

> Yeah, but when we generate memcpy there might not be a way to
> store the ctz info until RTL expansion where the magic should really happen ...

True.  It can be recovered without much difficulty in the cases I've
looked at, but it could be lost in others.

> So I'd say go for improving RTL expansion.

'k, thanks

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-01-27 12:40 [RFC] test builtin ratio for loop distribution Alexandre Oliva
  2021-01-27 15:12 ` Richard Biener
@ 2021-02-05  0:13 ` Jim Wilson
  2021-02-11 10:11   ` Alexandre Oliva
  1 sibling, 1 reply; 26+ messages in thread
From: Jim Wilson @ 2021-02-05  0:13 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: GCC Patches, Zdenek Dvorak

On Wed, Jan 27, 2021 at 4:40 AM Alexandre Oliva <oliva@adacore.com> wrote:

> This patch attempts to fix a libgcc codegen regression introduced in
> gcc-10, as -ftree-loop-distribute-patterns was enabled at -O2.
>
> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.
>
> FYI we have a bug report for this for a coremark regression which sounds
like the same problem.
     https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Jim

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-02-04 22:17               ` Alexandre Oliva
@ 2021-02-05  8:02                 ` Richard Biener
  2021-02-11 10:19                 ` Alexandre Oliva
  1 sibling, 0 replies; 26+ messages in thread
From: Richard Biener @ 2021-02-05  8:02 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Thu, Feb 4, 2021 at 11:18 PM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On Feb  4, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
>
> >> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
> >>
> >> Why would that be better than keeping the constant-length memset call,
> >> that would be turned into an unrolled loop during expand?
>
> > Well, because of the possibly lost ctz and alignment info.
>
> Funny you should mention that.  I got started with the expand-time
> expansion yesterday, and found out that we're not using the alignment
> information that is available.  Though the pointer is known to point to
> an aligned object, we are going for 8-bit alignment for some reason.
>
> The strategy I used there was to first check whether by_pieces would
> expand inline a constant length near the max known length, then loop
> over the bits in the variable length, expand in each iteration a
> constant-length store-by-pieces for the fixed length corresponding to
> that bit, and a test comparing the variable length with the fixed length
> guarding the expansion of the store-by-pieces.  We may get larger code
> this way (no loops), but only O(log(len)) compares.
>
> I've also fixed some bugs in the ldist expander, so now it bootstraps,
> but with a few regressions in the testsuite, that I'm yet to look into.
>
> >> Uhh, thanks, but...  you realize nearly all of the gimple-building code
> >> is one and the same for the loop and for trailing count misalignment?
>
> > Sorry, the code lacked comments and so I didn't actually try decipering
> > the code you generate ;)
>
> Oh, come on, it was planly obscure ;-D
>
> Sorry for posting an early-draft before polishing it up.
>
> > The original motivation was really that esp. for small trip count loops
> > the target knows best how to implement them.  Now, that completely
> > fails of course in case the target doesn't implement any of this or
> > the generic code fails because we lost ctz and alignment info.
>
> In our case, generic code fails because it won't handle variable-sized
> clear-by-pieces.  But then, I found out, when it's fixed-size, it also
> makes the code worse, because it seems to expand to byte stores even
> when the store-to object is known to have wider alignment:
>
> union u {
>   long long i;
>   char c[8];
> } x[8];
> int s(union u *p, int k) {
>   for (int i = k ? 0 : 3; i < 8; i++) {
>     for (int j = 0; j < 8; j++) {
>       p[i].c[j] = 0;
>     } // becomes a memset to an 8-byte-aligned 8-byte object, then 8 byte stores
>   }
> }

On x86_64 I see two DImode stores generated, but that appears to be
done via setmem.

> >> > I think the builtins with alignment and calloc-style element count
> >> > will be useful on its own.
> >>
> >> Oh, I see, you're suggesting actual separate builtin functions.  Uhh...
> >> I'm not sure I want to go there.  I'd much rather recover the ctz of the
> >> length, and use it in existing code.
>
> > Yeah, but when we generate memcpy there might not be a way to
> > store the ctz info until RTL expansion where the magic should really happen ...
>
> True.  It can be recovered without much difficulty in the cases I've
> looked at, but it could be lost in others.
>
> > So I'd say go for improving RTL expansion.
>
> 'k, thanks
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-02-05  0:13 ` Jim Wilson
@ 2021-02-11 10:11   ` Alexandre Oliva
  0 siblings, 0 replies; 26+ messages in thread
From: Alexandre Oliva @ 2021-02-11 10:11 UTC (permalink / raw)
  To: Jim Wilson; +Cc: GCC Patches, Zdenek Dvorak

On Feb  4, 2021, Jim Wilson <jimw@sifive.com> wrote:

> FYI we have a bug report for this for a coremark regression which sounds
> like the same problem.
>      https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Indeed, thanks!

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-02-04 22:17               ` Alexandre Oliva
  2021-02-05  8:02                 ` Richard Biener
@ 2021-02-11 10:19                 ` Alexandre Oliva
  2021-02-11 12:14                   ` Alexandre Oliva
  2021-02-12 11:34                   ` Richard Biener
  1 sibling, 2 replies; 26+ messages in thread
From: Alexandre Oliva @ 2021-02-11 10:19 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Feb  4, 2021, Alexandre Oliva <oliva@adacore.com> wrote:

> On Feb  4, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
>>> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
>>> 
>>> Why would that be better than keeping the constant-length memset call,
>>> that would be turned into an unrolled loop during expand?

>> Well, because of the possibly lost ctz and alignment info.

> Funny you should mention that.  I got started with the expand-time
> expansion yesterday, and found out that we're not using the alignment
> information that is available.  Though the pointer is known to point to
> an aligned object, we are going for 8-bit alignment for some reason.

> The strategy I used there was to first check whether by_pieces would
> expand inline a constant length near the max known length, then loop
> over the bits in the variable length, expand in each iteration a
> constant-length store-by-pieces for the fixed length corresponding to
> that bit, and a test comparing the variable length with the fixed length
> guarding the expansion of the store-by-pieces.  We may get larger code
> this way (no loops), but only O(log(len)) compares.

> I've also fixed some bugs in the ldist expander, so now it bootstraps,
> but with a few regressions in the testsuite, that I'm yet to look into.

A few more fixes later, this seems to work.

It encompasses the patch to recover tree_ctz from a mult_expr def, it
adds code to set up the alignment data for the ldist-generated memset
dest, and then it expands memset as described above if length is not
constant, setmem is not available, but the by-pieces machinery would
still be used for nearby constants.

How does this look?


[PR94092] introduce try store by multiple pieces

From: Alexandre Oliva <oliva@adacore.com>

The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch handles variable lengths with multiple conditional
power-of-2-constant-sized stores-by-pieces, so as to reduce the
overhead of length compares.

It also propagates the alignment of the memset dest, introduced by
loop-distribution, to the SSA pointer used to hold it, so that it is
not discarded right away, and may be recovered by the memset builtin
expander.

Finally, it computes the ctz of the length, and uses it to omit blocks
for stores of small byte counts when we're storing whole words.
tree_ctz is improved so as to look at the length DEF stmt, and zero
out the least-significant bits if it's a multiply by a power of two.


for  gcc/ChangeLog

	PR tree-optimization/94092
	* builtins.c (try_store_by_multiple_pieces): New.
	(expand_builtin_memset_args): Use it.  If target_char_cast
	fails, proceed as for non-constant val.  Pass len's ctz to...
	* expr.c (clear_storage_hints): ... this.  Try store by
	multiple pieces after setmem.
	(clear_storage): Adjust.
	* expr.h (clear_storage_hints): Likewise.
	(try_store_by_multiple_pieces): Declare.
	* tree-loop-distribution.c: Include builtins.h.
	(generate_memset_builtin): Propagate dst_base alignmen to mem.
	* tree-ssanames.c (get_nonzero_bits): Zero out low bits of
	integral types, when a MULT_EXPR INTEGER_CST operand ensures
	the result will be a multiple of a power of two.
---
 gcc/builtins.c               |  113 +++++++++++++++++++++++++++++++++++++++---
 gcc/expr.c                   |    9 +++
 gcc/expr.h                   |   12 ++++
 gcc/tree-loop-distribution.c |    9 +++
 gcc/tree-ssanames.c          |   23 ++++++++-
 5 files changed, 154 insertions(+), 12 deletions(-)

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 0aed008687ccc..44f3af92536a5 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -6600,6 +6600,97 @@ expand_builtin_memset (tree exp, rtx target, machine_mode mode)
   return expand_builtin_memset_args (dest, val, len, target, mode, exp);
 }
 
+/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO.
+   Return TRUE if successful, FALSE otherwise.  TO is assumed to be
+   aligned at an ALIGN boundary.  LEN is assumed to be a multiple of
+   1<<CTZ_LEN, and between min_len and max_len.
+
+   The strategy is to issue one store_by_pieces for each power of two,
+   for most to least significant, guarded by a test on whether there are
+   at least that many bytes to copy.  */
+
+bool
+try_store_by_multiple_pieces (rtx to, rtx len, int ctz_len,
+			      unsigned HOST_WIDE_INT min_len,
+			      unsigned HOST_WIDE_INT max_len,
+			      rtx val, char valc, unsigned int align)
+{
+  int max_bits = floor_log2 (max_len);
+  int min_bits = floor_log2 (min_len);
+
+  if (val)
+    valc = 1;
+
+  if (!can_store_by_pieces ((HOST_WIDE_INT_1U << max_bits) * 2
+			    - (HOST_WIDE_INT_1U << ctz_len),
+			    builtin_memset_read_str, &valc, align, true))
+    return false;
+
+  rtx (*constfun) (void *, HOST_WIDE_INT, scalar_int_mode);
+  void *constfundata;
+  if (val)
+    {
+      constfun = builtin_memset_gen_str;
+      constfundata = val = force_reg (TYPE_MODE (unsigned_char_type_node),
+				      val);
+    }
+  else
+    {
+      constfun = builtin_memset_read_str;
+      constfundata = &valc;
+    }
+
+  /* Bits above SHR_BITS don't need to be tested.  */
+  int shr_bits = (max_bits != min_bits
+		  ? max_bits
+		  : floor_log2 (max_len ^ min_len));
+
+  rtx ptr = copy_addr_to_reg (convert_to_mode (ptr_mode, XEXP (to, 0), 0));
+  rtx rem = copy_to_mode_reg (ptr_mode, convert_to_mode (ptr_mode, len, 0));
+  set_mem_align (to, align);
+
+  for (int i = max_bits; i >= ctz_len; i--)
+    {
+      unsigned HOST_WIDE_INT blksize = HOST_WIDE_INT_1U << i;
+      rtx_code_label *label = NULL;
+
+      if (i <= shr_bits)
+	{
+	  label = gen_label_rtx ();
+	  emit_cmp_and_jump_insns (rem, GEN_INT (blksize), LT, NULL,
+				   ptr_mode, 1, label,
+				   profile_probability::even ());
+	}
+      else if ((max_len & blksize) == 0)
+	continue;
+
+      if (align > blksize * BITS_PER_UNIT)
+	{
+	  align = blksize * BITS_PER_UNIT;
+	  set_mem_align (to, align);
+	}
+
+      to = store_by_pieces (to, blksize,
+			    constfun, constfundata,
+			    align, true, RETURN_END);
+
+      if (label)
+	{
+	  emit_move_insn (ptr, XEXP (to, 0));
+	  XEXP (to, 0) = ptr;
+
+	  clear_mem_offset (to);
+	  clear_mem_size (to);
+
+	  emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize));
+
+	  emit_label (label);
+	}
+    }
+
+  return true;
+}
+
 /* Helper function to do the actual work for expand_builtin_memset.  The
    arguments to the builtin_memset call DEST, VAL, and LEN are broken out
    so that this can also be called without constructing an actual CALL_EXPR.
@@ -6654,7 +6745,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
   dest_mem = get_memory_rtx (dest, len);
   val_mode = TYPE_MODE (unsigned_char_type_node);
 
-  if (TREE_CODE (val) != INTEGER_CST)
+  if (TREE_CODE (val) != INTEGER_CST
+      || target_char_cast (val, &c))
     {
       rtx val_rtx;
 
@@ -6678,7 +6770,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
       else if (!set_storage_via_setmem (dest_mem, len_rtx, val_rtx,
 					dest_align, expected_align,
 					expected_size, min_size, max_size,
-					probable_max_size))
+					probable_max_size)
+	       && !try_store_by_multiple_pieces (dest_mem, len_rtx,
+						 tree_ctz (len),
+						 min_size, max_size,
+						 val_rtx, 0,
+						 dest_align))
 	goto do_libcall;
 
       dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
@@ -6686,9 +6783,6 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
       return dest_mem;
     }
 
-  if (target_char_cast (val, &c))
-    goto do_libcall;
-
   if (c)
     {
       if (tree_fits_uhwi_p (len)
@@ -6702,7 +6796,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
 					expected_size, min_size, max_size,
-					probable_max_size))
+					probable_max_size)
+	       && !try_store_by_multiple_pieces (dest_mem, len_rtx,
+						 tree_ctz (len),
+						 min_size, max_size,
+						 NULL_RTX, c,
+						 dest_align))
 	goto do_libcall;
 
       dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
@@ -6716,7 +6815,7 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				   ? BLOCK_OP_TAILCALL : BLOCK_OP_NORMAL,
 				   expected_align, expected_size,
 				   min_size, max_size,
-				   probable_max_size);
+				   probable_max_size, tree_ctz (len));
 
   if (dest_addr == 0)
     {
diff --git a/gcc/expr.c b/gcc/expr.c
index 04ef5ad114d06..b212e9ac575a7 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -3056,7 +3056,8 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
 		     unsigned int expected_align, HOST_WIDE_INT expected_size,
 		     unsigned HOST_WIDE_INT min_size,
 		     unsigned HOST_WIDE_INT max_size,
-		     unsigned HOST_WIDE_INT probable_max_size)
+		     unsigned HOST_WIDE_INT probable_max_size,
+		     unsigned ctz_size)
 {
   machine_mode mode = GET_MODE (object);
   unsigned int align;
@@ -3103,6 +3104,10 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
 				   expected_align, expected_size,
 				   min_size, max_size, probable_max_size))
     ;
+  else if (try_store_by_multiple_pieces (object, size, ctz_size,
+					 min_size, max_size,
+					 NULL_RTX, 0, align))
+    ;
   else if (ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (object)))
     return set_storage_via_libcall (object, size, const0_rtx,
 				    method == BLOCK_OP_TAILCALL);
@@ -3120,7 +3125,7 @@ clear_storage (rtx object, rtx size, enum block_op_methods method)
     min = max = UINTVAL (size);
   else
     max = GET_MODE_MASK (GET_MODE (size));
-  return clear_storage_hints (object, size, method, 0, -1, min, max, max);
+  return clear_storage_hints (object, size, method, 0, -1, min, max, max, 0);
 }
 
 
diff --git a/gcc/expr.h b/gcc/expr.h
index 1f0177a4cfa5d..6ff2384df63f8 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -193,7 +193,8 @@ extern rtx clear_storage_hints (rtx, rtx, enum block_op_methods,
 			        unsigned int, HOST_WIDE_INT,
 				unsigned HOST_WIDE_INT,
 				unsigned HOST_WIDE_INT,
-				unsigned HOST_WIDE_INT);
+				unsigned HOST_WIDE_INT,
+				unsigned);
 /* The same, but always output an library call.  */
 extern rtx set_storage_via_libcall (rtx, rtx, rtx, bool = false);
 
@@ -224,6 +225,15 @@ extern int can_store_by_pieces (unsigned HOST_WIDE_INT,
 extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT, by_pieces_constfn,
 			    void *, unsigned int, bool, memop_ret);
 
+/* If can_store_by_pieces passes for worst-case values near MAX_LEN, call
+   store_by_pieces within conditionals so as to handle variable LEN efficiently,
+   storing VAL, if non-NULL_RTX, or valc instead.  */
+extern bool try_store_by_multiple_pieces (rtx to, rtx len, int ctz_len,
+					  unsigned HOST_WIDE_INT min_len,
+					  unsigned HOST_WIDE_INT max_len,
+					  rtx val, char valc,
+					  unsigned int align);
+
 /* Emit insns to set X from Y.  */
 extern rtx_insn *emit_move_insn (rtx, rtx);
 extern rtx_insn *gen_move_insn (rtx, rtx);
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index bb15fd3723fb6..a64b89037a724 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "tree-eh.h"
 #include "gimple-fold.h"
+#include "builtins.h"
 
 
 #define MAX_DATAREFS_NUM \
@@ -1155,6 +1156,14 @@ generate_memset_builtin (class loop *loop, partition *partition)
   mem = force_gimple_operand_gsi (&gsi, mem, true, NULL_TREE,
 				  false, GSI_CONTINUE_LINKING);
 
+  if (TREE_CODE (mem) == SSA_NAME)
+    if (ptr_info_def *pi = get_ptr_info (mem))
+      {
+	unsigned al = get_pointer_alignment (builtin->dst_base);
+	if (al > pi->align || pi->misalign)
+	  set_ptr_info_alignment (pi, al, 0);
+      }
+
   /* This exactly matches the pattern recognition in classify_partition.  */
   val = gimple_assign_rhs1 (DR_STMT (builtin->dst_dr));
   /* Handle constants like 0x15151515 and similarly
diff --git a/gcc/tree-ssanames.c b/gcc/tree-ssanames.c
index 51a26d2fce1c2..c4b5bf2a4999a 100644
--- a/gcc/tree-ssanames.c
+++ b/gcc/tree-ssanames.c
@@ -546,10 +546,29 @@ get_nonzero_bits (const_tree name)
     }
 
   range_info_def *ri = SSA_NAME_RANGE_INFO (name);
+  wide_int ret;
   if (!ri)
-    return wi::shwi (-1, precision);
+    ret = wi::shwi (-1, precision);
+  else
+    ret = ri->get_nonzero_bits ();
+
+  /* If NAME is defined as a multiple of a constant C, we know the ctz(C) low
+     bits are zero.  ??? Should we handle LSHIFT_EXPR too?  Non-constants,
+     e.g. the minimum shift count, and ctz from both MULT_EXPR operands?  That
+     could make for deep recursion.  */
+  if (INTEGRAL_TYPE_P (TREE_TYPE (name))
+      && SSA_NAME_DEF_STMT (name)
+      && is_gimple_assign (SSA_NAME_DEF_STMT (name))
+      && gimple_assign_rhs_code (SSA_NAME_DEF_STMT (name)) == MULT_EXPR
+      && TREE_CODE (gimple_assign_rhs2 (SSA_NAME_DEF_STMT (name))) == INTEGER_CST)
+    {
+      unsigned HOST_WIDE_INT bits
+	= tree_ctz (gimple_assign_rhs2 (SSA_NAME_DEF_STMT (name)));
+      wide_int mask = wi::shwi (-1, precision) << bits;
+      ret &= mask;
+    }
 
-  return ri->get_nonzero_bits ();
+  return ret;
 }
 
 /* Return TRUE is OP, an SSA_NAME has a range of values [0..1], false


-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-02-11 10:19                 ` Alexandre Oliva
@ 2021-02-11 12:14                   ` Alexandre Oliva
  2021-02-12 11:34                   ` Richard Biener
  1 sibling, 0 replies; 26+ messages in thread
From: Alexandre Oliva @ 2021-02-11 12:14 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Feb 11, 2021, Alexandre Oliva <oliva@adacore.com> wrote:

> How does this look?


> for  gcc/ChangeLog

> 	PR tree-optimization/94092
> 	* builtins.c (try_store_by_multiple_pieces): New.
> 	(expand_builtin_memset_args): Use it.  If target_char_cast
> 	fails, proceed as for non-constant val.  Pass len's ctz to...
> 	* expr.c (clear_storage_hints): ... this.  Try store by
> 	multiple pieces after setmem.
> 	(clear_storage): Adjust.
> 	* expr.h (clear_storage_hints): Likewise.
> 	(try_store_by_multiple_pieces): Declare.
> 	* tree-loop-distribution.c: Include builtins.h.
> 	(generate_memset_builtin): Propagate dst_base alignmen to mem.
> 	* tree-ssanames.c (get_nonzero_bits): Zero out low bits of
> 	integral types, when a MULT_EXPR INTEGER_CST operand ensures
> 	the result will be a multiple of a power of two.

I forgot to mention this passed regstrap on x86_64-linux-gnu, as well as
some cross testing of riscv32-elf.

I've also regstrapped it on x86_64-linux-gnu along with a patch for
testing purposes, that tried try_store_by_multiple_pieces before setmem
in all 3 locations where they are called, which gives me some confidence
that the implementation is reasonably robust.


Is this ok to install?  (if not right now, perhaps in stage1)

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-02-11 10:19                 ` Alexandre Oliva
  2021-02-11 12:14                   ` Alexandre Oliva
@ 2021-02-12 11:34                   ` Richard Biener
  2021-02-16  4:56                     ` Alexandre Oliva
  1 sibling, 1 reply; 26+ messages in thread
From: Richard Biener @ 2021-02-12 11:34 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Thu, Feb 11, 2021 at 11:19 AM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On Feb  4, 2021, Alexandre Oliva <oliva@adacore.com> wrote:
>
> > On Feb  4, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
> >>> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
> >>>
> >>> Why would that be better than keeping the constant-length memset call,
> >>> that would be turned into an unrolled loop during expand?
>
> >> Well, because of the possibly lost ctz and alignment info.
>
> > Funny you should mention that.  I got started with the expand-time
> > expansion yesterday, and found out that we're not using the alignment
> > information that is available.  Though the pointer is known to point to
> > an aligned object, we are going for 8-bit alignment for some reason.
>
> > The strategy I used there was to first check whether by_pieces would
> > expand inline a constant length near the max known length, then loop
> > over the bits in the variable length, expand in each iteration a
> > constant-length store-by-pieces for the fixed length corresponding to
> > that bit, and a test comparing the variable length with the fixed length
> > guarding the expansion of the store-by-pieces.  We may get larger code
> > this way (no loops), but only O(log(len)) compares.
>
> > I've also fixed some bugs in the ldist expander, so now it bootstraps,
> > but with a few regressions in the testsuite, that I'm yet to look into.
>
> A few more fixes later, this seems to work.
>
> It encompasses the patch to recover tree_ctz from a mult_expr def, it
> adds code to set up the alignment data for the ldist-generated memset
> dest, and then it expands memset as described above if length is not
> constant, setmem is not available, but the by-pieces machinery would
> still be used for nearby constants.
>
> How does this look?

Overall it looks good - I can't really review the RTL code generation
parts because I'm not too familiar with the usual pitfalls there.

Some comments below.

>
> [PR94092] introduce try store by multiple pieces
>
> From: Alexandre Oliva <oliva@adacore.com>
>
> The ldist pass turns even very short loops into memset calls.  E.g.,
> the TFmode emulation calls end with a loop of up to 3 iterations, to
> zero out trailing words, and the loop distribution pass turns them
> into calls of the memset builtin.
>
> Though short constant-length memsets are usually dealt with
> efficiently, for non-constant-length ones, the options are setmemM, or
> a function calls.
>
> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.
>
> This patch handles variable lengths with multiple conditional
> power-of-2-constant-sized stores-by-pieces, so as to reduce the
> overhead of length compares.
>
> It also propagates the alignment of the memset dest, introduced by
> loop-distribution, to the SSA pointer used to hold it, so that it is
> not discarded right away, and may be recovered by the memset builtin
> expander.
>
> Finally, it computes the ctz of the length, and uses it to omit blocks
> for stores of small byte counts when we're storing whole words.
> tree_ctz is improved so as to look at the length DEF stmt, and zero
> out the least-significant bits if it's a multiply by a power of two.
>
>
> for  gcc/ChangeLog
>
>         PR tree-optimization/94092
>         * builtins.c (try_store_by_multiple_pieces): New.
>         (expand_builtin_memset_args): Use it.  If target_char_cast
>         fails, proceed as for non-constant val.  Pass len's ctz to...
>         * expr.c (clear_storage_hints): ... this.  Try store by
>         multiple pieces after setmem.
>         (clear_storage): Adjust.
>         * expr.h (clear_storage_hints): Likewise.
>         (try_store_by_multiple_pieces): Declare.
>         * tree-loop-distribution.c: Include builtins.h.
>         (generate_memset_builtin): Propagate dst_base alignmen to mem.
>         * tree-ssanames.c (get_nonzero_bits): Zero out low bits of
>         integral types, when a MULT_EXPR INTEGER_CST operand ensures
>         the result will be a multiple of a power of two.
> ---
>  gcc/builtins.c               |  113 +++++++++++++++++++++++++++++++++++++++---
>  gcc/expr.c                   |    9 +++
>  gcc/expr.h                   |   12 ++++
>  gcc/tree-loop-distribution.c |    9 +++
>  gcc/tree-ssanames.c          |   23 ++++++++-
>  5 files changed, 154 insertions(+), 12 deletions(-)
>
> diff --git a/gcc/builtins.c b/gcc/builtins.c
> index 0aed008687ccc..44f3af92536a5 100644
> --- a/gcc/builtins.c
> +++ b/gcc/builtins.c
> @@ -6600,6 +6600,97 @@ expand_builtin_memset (tree exp, rtx target, machine_mode mode)
>    return expand_builtin_memset_args (dest, val, len, target, mode, exp);
>  }
>
> +/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO.
> +   Return TRUE if successful, FALSE otherwise.  TO is assumed to be
> +   aligned at an ALIGN boundary.  LEN is assumed to be a multiple of
> +   1<<CTZ_LEN, and between min_len and max_len.
> +
> +   The strategy is to issue one store_by_pieces for each power of two,
> +   for most to least significant, guarded by a test on whether there are
> +   at least that many bytes to copy.  */
> +
> +bool
> +try_store_by_multiple_pieces (rtx to, rtx len, int ctz_len,
> +                             unsigned HOST_WIDE_INT min_len,
> +                             unsigned HOST_WIDE_INT max_len,
> +                             rtx val, char valc, unsigned int align)
> +{
> +  int max_bits = floor_log2 (max_len);
> +  int min_bits = floor_log2 (min_len);
> +
> +  if (val)
> +    valc = 1;
> +
> +  if (!can_store_by_pieces ((HOST_WIDE_INT_1U << max_bits) * 2
> +                           - (HOST_WIDE_INT_1U << ctz_len),
> +                           builtin_memset_read_str, &valc, align, true))
> +    return false;
> +
> +  rtx (*constfun) (void *, HOST_WIDE_INT, scalar_int_mode);
> +  void *constfundata;
> +  if (val)
> +    {
> +      constfun = builtin_memset_gen_str;
> +      constfundata = val = force_reg (TYPE_MODE (unsigned_char_type_node),
> +                                     val);
> +    }
> +  else
> +    {
> +      constfun = builtin_memset_read_str;
> +      constfundata = &valc;
> +    }
> +
> +  /* Bits above SHR_BITS don't need to be tested.  */
> +  int shr_bits = (max_bits != min_bits
> +                 ? max_bits
> +                 : floor_log2 (max_len ^ min_len));
> +
> +  rtx ptr = copy_addr_to_reg (convert_to_mode (ptr_mode, XEXP (to, 0), 0));
> +  rtx rem = copy_to_mode_reg (ptr_mode, convert_to_mode (ptr_mode, len, 0));
> +  set_mem_align (to, align);
> +
> +  for (int i = max_bits; i >= ctz_len; i--)
> +    {
> +      unsigned HOST_WIDE_INT blksize = HOST_WIDE_INT_1U << i;
> +      rtx_code_label *label = NULL;
> +
> +      if (i <= shr_bits)
> +       {
> +         label = gen_label_rtx ();
> +         emit_cmp_and_jump_insns (rem, GEN_INT (blksize), LT, NULL,
> +                                  ptr_mode, 1, label,
> +                                  profile_probability::even ());
> +       }
> +      else if ((max_len & blksize) == 0)
> +       continue;
> +
> +      if (align > blksize * BITS_PER_UNIT)
> +       {
> +         align = blksize * BITS_PER_UNIT;
> +         set_mem_align (to, align);
> +       }
> +
> +      to = store_by_pieces (to, blksize,
> +                           constfun, constfundata,
> +                           align, true, RETURN_END);
> +
> +      if (label)
> +       {
> +         emit_move_insn (ptr, XEXP (to, 0));
> +         XEXP (to, 0) = ptr;
> +
> +         clear_mem_offset (to);
> +         clear_mem_size (to);
> +
> +         emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize));
> +
> +         emit_label (label);
> +       }
> +    }
> +
> +  return true;
> +}
> +
>  /* Helper function to do the actual work for expand_builtin_memset.  The
>     arguments to the builtin_memset call DEST, VAL, and LEN are broken out
>     so that this can also be called without constructing an actual CALL_EXPR.
> @@ -6654,7 +6745,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>    dest_mem = get_memory_rtx (dest, len);
>    val_mode = TYPE_MODE (unsigned_char_type_node);
>
> -  if (TREE_CODE (val) != INTEGER_CST)
> +  if (TREE_CODE (val) != INTEGER_CST
> +      || target_char_cast (val, &c))
>      {
>        rtx val_rtx;
>
> @@ -6678,7 +6770,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>        else if (!set_storage_via_setmem (dest_mem, len_rtx, val_rtx,
>                                         dest_align, expected_align,
>                                         expected_size, min_size, max_size,
> -                                       probable_max_size))
> +                                       probable_max_size)
> +              && !try_store_by_multiple_pieces (dest_mem, len_rtx,
> +                                                tree_ctz (len),
> +                                                min_size, max_size,
> +                                                val_rtx, 0,
> +                                                dest_align))
>         goto do_libcall;
>
>        dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
> @@ -6686,9 +6783,6 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>        return dest_mem;
>      }
>
> -  if (target_char_cast (val, &c))
> -    goto do_libcall;
> -
>    if (c)
>      {
>        if (tree_fits_uhwi_p (len)
> @@ -6702,7 +6796,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>                                         gen_int_mode (c, val_mode),
>                                         dest_align, expected_align,
>                                         expected_size, min_size, max_size,
> -                                       probable_max_size))
> +                                       probable_max_size)
> +              && !try_store_by_multiple_pieces (dest_mem, len_rtx,
> +                                                tree_ctz (len),
> +                                                min_size, max_size,
> +                                                NULL_RTX, c,
> +                                                dest_align))
>         goto do_libcall;
>
>        dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
> @@ -6716,7 +6815,7 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>                                    ? BLOCK_OP_TAILCALL : BLOCK_OP_NORMAL,
>                                    expected_align, expected_size,
>                                    min_size, max_size,
> -                                  probable_max_size);
> +                                  probable_max_size, tree_ctz (len));
>
>    if (dest_addr == 0)
>      {
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 04ef5ad114d06..b212e9ac575a7 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -3056,7 +3056,8 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
>                      unsigned int expected_align, HOST_WIDE_INT expected_size,
>                      unsigned HOST_WIDE_INT min_size,
>                      unsigned HOST_WIDE_INT max_size,
> -                    unsigned HOST_WIDE_INT probable_max_size)
> +                    unsigned HOST_WIDE_INT probable_max_size,
> +                    unsigned ctz_size)
>  {
>    machine_mode mode = GET_MODE (object);
>    unsigned int align;
> @@ -3103,6 +3104,10 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
>                                    expected_align, expected_size,
>                                    min_size, max_size, probable_max_size))
>      ;
> +  else if (try_store_by_multiple_pieces (object, size, ctz_size,
> +                                        min_size, max_size,
> +                                        NULL_RTX, 0, align))
> +    ;
>    else if (ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (object)))
>      return set_storage_via_libcall (object, size, const0_rtx,
>                                     method == BLOCK_OP_TAILCALL);
> @@ -3120,7 +3125,7 @@ clear_storage (rtx object, rtx size, enum block_op_methods method)
>      min = max = UINTVAL (size);
>    else
>      max = GET_MODE_MASK (GET_MODE (size));
> -  return clear_storage_hints (object, size, method, 0, -1, min, max, max);
> +  return clear_storage_hints (object, size, method, 0, -1, min, max, max, 0);
>  }
>
>
> diff --git a/gcc/expr.h b/gcc/expr.h
> index 1f0177a4cfa5d..6ff2384df63f8 100644
> --- a/gcc/expr.h
> +++ b/gcc/expr.h
> @@ -193,7 +193,8 @@ extern rtx clear_storage_hints (rtx, rtx, enum block_op_methods,
>                                 unsigned int, HOST_WIDE_INT,
>                                 unsigned HOST_WIDE_INT,
>                                 unsigned HOST_WIDE_INT,
> -                               unsigned HOST_WIDE_INT);
> +                               unsigned HOST_WIDE_INT,
> +                               unsigned);
>  /* The same, but always output an library call.  */
>  extern rtx set_storage_via_libcall (rtx, rtx, rtx, bool = false);
>
> @@ -224,6 +225,15 @@ extern int can_store_by_pieces (unsigned HOST_WIDE_INT,
>  extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT, by_pieces_constfn,
>                             void *, unsigned int, bool, memop_ret);
>
> +/* If can_store_by_pieces passes for worst-case values near MAX_LEN, call
> +   store_by_pieces within conditionals so as to handle variable LEN efficiently,
> +   storing VAL, if non-NULL_RTX, or valc instead.  */
> +extern bool try_store_by_multiple_pieces (rtx to, rtx len, int ctz_len,
> +                                         unsigned HOST_WIDE_INT min_len,
> +                                         unsigned HOST_WIDE_INT max_len,
> +                                         rtx val, char valc,
> +                                         unsigned int align);
> +
>  /* Emit insns to set X from Y.  */
>  extern rtx_insn *emit_move_insn (rtx, rtx);
>  extern rtx_insn *gen_move_insn (rtx, rtx);
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index bb15fd3723fb6..a64b89037a724 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-vectorizer.h"
>  #include "tree-eh.h"
>  #include "gimple-fold.h"
> +#include "builtins.h"
>
>
>  #define MAX_DATAREFS_NUM \
> @@ -1155,6 +1156,14 @@ generate_memset_builtin (class loop *loop, partition *partition)
>    mem = force_gimple_operand_gsi (&gsi, mem, true, NULL_TREE,
>                                   false, GSI_CONTINUE_LINKING);
>
> +  if (TREE_CODE (mem) == SSA_NAME)
> +    if (ptr_info_def *pi = get_ptr_info (mem))
> +      {
> +       unsigned al = get_pointer_alignment (builtin->dst_base);
> +       if (al > pi->align || pi->misalign)

We still might prefer pi->align == 64 and pi->misalign == 32 over al == 16
so maybe factor that in, too.

Also what's still lost is the alignment constraint the original access has
(thus also the base pointer must have).  Maybe store this info in
loop distributions builtin_info and compute it from the data reference
for example via get_object_alignment (DR_REF (dr)).

> +         set_ptr_info_alignment (pi, al, 0);
> +      }
> +
>    /* This exactly matches the pattern recognition in classify_partition.  */
>    val = gimple_assign_rhs1 (DR_STMT (builtin->dst_dr));
>    /* Handle constants like 0x15151515 and similarly
> diff --git a/gcc/tree-ssanames.c b/gcc/tree-ssanames.c
> index 51a26d2fce1c2..c4b5bf2a4999a 100644
> --- a/gcc/tree-ssanames.c
> +++ b/gcc/tree-ssanames.c
> @@ -546,10 +546,29 @@ get_nonzero_bits (const_tree name)
>      }
>
>    range_info_def *ri = SSA_NAME_RANGE_INFO (name);
> +  wide_int ret;
>    if (!ri)
> -    return wi::shwi (-1, precision);
> +    ret = wi::shwi (-1, precision);
> +  else
> +    ret = ri->get_nonzero_bits ();
> +
> +  /* If NAME is defined as a multiple of a constant C, we know the ctz(C) low
> +     bits are zero.  ??? Should we handle LSHIFT_EXPR too?  Non-constants,
> +     e.g. the minimum shift count, and ctz from both MULT_EXPR operands?  That
> +     could make for deep recursion.  */
> +  if (INTEGRAL_TYPE_P (TREE_TYPE (name))
> +      && SSA_NAME_DEF_STMT (name)
> +      && is_gimple_assign (SSA_NAME_DEF_STMT (name))
> +      && gimple_assign_rhs_code (SSA_NAME_DEF_STMT (name)) == MULT_EXPR
> +      && TREE_CODE (gimple_assign_rhs2 (SSA_NAME_DEF_STMT (name))) == INTEGER_CST)
> +    {
> +      unsigned HOST_WIDE_INT bits
> +       = tree_ctz (gimple_assign_rhs2 (SSA_NAME_DEF_STMT (name)));
> +      wide_int mask = wi::shwi (-1, precision) << bits;
> +      ret &= mask;
> +    }
>
> -  return ri->get_nonzero_bits ();
> +  return ret;
>  }

So I wonder whether we should instead re-run CCP after loop opts which
computes nonzero bits as well instead of the above "hack".  Would
nonzero bits be ready to compute in the above way from loop distribution?
That is, you can do set_nonzero_bits just like you did
set_ptr_info_alignment ...

Since CCP also performs copy propagation an obvious candidate would be
to replace the last pass_copy_prop with pass_ccp (with a comment noting
to compute up-to-date nonzero bits info).

Thanks,
Richard.

>  /* Return TRUE is OP, an SSA_NAME has a range of values [0..1], false
>
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-02-12 11:34                   ` Richard Biener
@ 2021-02-16  4:56                     ` Alexandre Oliva
  2021-02-16 10:47                       ` Alexandre Oliva
  0 siblings, 1 reply; 26+ messages in thread
From: Alexandre Oliva @ 2021-02-16  4:56 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Feb 12, 2021, Richard Biener <richard.guenther@gmail.com> wrote:

>> +  if (TREE_CODE (mem) == SSA_NAME)
>> +    if (ptr_info_def *pi = get_ptr_info (mem))
>> +      {
>> +       unsigned al = get_pointer_alignment (builtin->dst_base);
>> +       if (al > pi->align || pi->misalign)

> We still might prefer pi->align == 64 and pi->misalign == 32 over al == 16
> so maybe factor that in, too.

Ugh, apologies, I somehow posted an incorrect and outdated version of
the patch.  The improved (propagates both alignments) and fixed (divides
by BITS_PER_UNIT, fixing a regression in gfortran.dg/sms-2.f90) had
this alternate hunk as the only difference:

@@ -1155,6 +1156,16 @@ generate_memset_builtin (class loop *loop, partition *partition)
   mem = force_gimple_operand_gsi (&gsi, mem, true, NULL_TREE,
 				  false, GSI_CONTINUE_LINKING);
 
+  if (TREE_CODE (mem) == SSA_NAME)
+    if (ptr_info_def *pi = get_ptr_info (mem))
+      {
+	unsigned al;
+	unsigned HOST_WIDE_INT misal;
+	if (get_pointer_alignment_1 (builtin->dst_base, &al, &misal))
+	  set_ptr_info_alignment (pi, al / BITS_PER_UNIT,
+				  misal / BITS_PER_UNIT);
+      }
+
   /* This exactly matches the pattern recognition in classify_partition.  */
   val = gimple_assign_rhs1 (DR_STMT (builtin->dst_dr));
   /* Handle constants like 0x15151515 and similarly



> So I wonder whether we should instead re-run CCP after loop opts which
> computes nonzero bits as well instead of the above "hack".  Would
> nonzero bits be ready to compute in the above way from loop distribution?
> That is, you can do set_nonzero_bits just like you did
> set_ptr_info_alignment ...

> Since CCP also performs copy propagation an obvious candidate would be
> to replace the last pass_copy_prop with pass_ccp (with a comment noting
> to compute up-to-date nonzero bits info).

I'll look into these possibilities.


-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-02-16  4:56                     ` Alexandre Oliva
@ 2021-02-16 10:47                       ` Alexandre Oliva
  2021-02-16 12:11                         ` Richard Biener
  0 siblings, 1 reply; 26+ messages in thread
From: Alexandre Oliva @ 2021-02-16 10:47 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Feb 16, 2021, Alexandre Oliva <oliva@adacore.com> wrote:

>> So I wonder whether we should instead re-run CCP after loop opts which
>> computes nonzero bits as well instead of the above "hack".

That works.  It takes care of both the dest alignment and the len ctz.

Explicitly masking out the len tz from nonzero bits also works, but I
suppose the ccp pass does a better job, and it takes care of memcpy and
memmove as well.


I noticed that ccp releases CDI_DOMINATORS information.  It looks like
the last copy_prop pass, now replaced with ccp in my tree, was late
enough that it doesn't seem to matter, but at first I thought you meant
an earlier copy_prop, that runs before graphite, and dropping dominator
info at that point was a problem for e.g. cunroll.

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] test builtin ratio for loop distribution
  2021-02-16 10:47                       ` Alexandre Oliva
@ 2021-02-16 12:11                         ` Richard Biener
  2021-02-19  8:08                           ` [PR94092] " Alexandre Oliva
  0 siblings, 1 reply; 26+ messages in thread
From: Richard Biener @ 2021-02-16 12:11 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Tue, Feb 16, 2021 at 11:48 AM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On Feb 16, 2021, Alexandre Oliva <oliva@adacore.com> wrote:
>
> >> So I wonder whether we should instead re-run CCP after loop opts which
> >> computes nonzero bits as well instead of the above "hack".
>
> That works.  It takes care of both the dest alignment and the len ctz.
>
> Explicitly masking out the len tz from nonzero bits also works, but I
> suppose the ccp pass does a better job, and it takes care of memcpy and
> memmove as well.

It's certainly more general, yes.

> I noticed that ccp releases CDI_DOMINATORS information.  It looks like
> the last copy_prop pass, now replaced with ccp in my tree, was late
> enough that it doesn't seem to matter, but at first I thought you meant
> an earlier copy_prop, that runs before graphite, and dropping dominator
> info at that point was a problem for e.g. cunroll.

We indeed shouldn't release CDI_DOMINATORS info, but if it is a problem
elsewhere then that pass fails to compute dominator info.  But yes,
in the loop pipeline we expect dominator info to be present (but fast
query can disable itself leading to issues if doms are not recomputed).

Richard.

> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PR94092] Re: [RFC] test builtin ratio for loop distribution
  2021-02-16 12:11                         ` Richard Biener
@ 2021-02-19  8:08                           ` Alexandre Oliva
  2021-02-22  9:53                             ` Richard Biener
  0 siblings, 1 reply; 26+ messages in thread
From: Alexandre Oliva @ 2021-02-19  8:08 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

Here's an improved version of the patch.  Regstrapped on
x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
ahead of setmem, and also tested with riscv32-elf.

Is it ok to install?  Or should it wait for stage1?


[PR94092] introduce try store by multiple pieces

From: Alexandre Oliva <oliva@adacore.com>

The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length clearing memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch handles variable lengths with multiple conditional
power-of-2-constant-sized stores-by-pieces, so as to reduce the
overhead of length compares.

It also changes the last copy-prop pass into ccp, so that pointer
alignment and length's nonzero bits are detected and made available
for the expander, even for ldist-introduced SSA_NAMEs.


for  gcc/ChangeLog

	PR tree-optimization/94092
	* builtins.c (try_store_by_multiple_pieces): New.
	(expand_builtin_memset_args): Use it.  If target_char_cast
	fails, proceed as for non-constant val.  Pass len's ctz to...
	* expr.c (clear_storage_hints): ... this.  Try store by
	multiple pieces after setmem.
	(clear_storage): Adjust.
	* expr.h (clear_storage_hints): Likewise.
	(try_store_by_multiple_pieces): Declare.
	* passes.def: Replace the last copy_prop to ccp.
---
 gcc/builtins.c |  182 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 gcc/expr.c     |    9 ++-
 gcc/expr.h     |   13 ++++
 gcc/passes.def |    5 +-
 4 files changed, 197 insertions(+), 12 deletions(-)

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 0aed008687ccc..05b14f2a3f6d3 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -6600,6 +6600,166 @@ expand_builtin_memset (tree exp, rtx target, machine_mode mode)
   return expand_builtin_memset_args (dest, val, len, target, mode, exp);
 }
 
+/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO.
+   Return TRUE if successful, FALSE otherwise.  TO is assumed to be
+   aligned at an ALIGN-bits boundary.  LEN must be a multiple of
+   1<<CTZ_LEN between MIN_LEN and MAX_LEN.
+
+   The strategy is to issue one store_by_pieces for each power of two,
+   from most to least significant, guarded by a test on whether there
+   are at least that many bytes left to copy in LEN.
+
+   ??? Should we skip some powers of two in favor of loops?  Maybe start
+   at the max of TO/LEN/word alignment, at least when optimizing for
+   size, instead of ensuring O(log len) dynamic compares?  */
+
+bool
+try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,
+			      unsigned HOST_WIDE_INT min_len,
+			      unsigned HOST_WIDE_INT max_len,
+			      rtx val, char valc, unsigned int align)
+{
+  int max_bits = floor_log2 (max_len);
+  int min_bits = floor_log2 (min_len);
+  int sctz_len = ctz_len;
+
+  gcc_checking_assert (sctz_len >= 0);
+
+  if (val)
+    valc = 1;
+
+  /* Bits more significant than TST_BITS are part of the shared prefix
+     in the binary representation of both min_len and max_len.  Since
+     they're identical, we don't need to test them in the loop.  */
+  int tst_bits = (max_bits != min_bits ? max_bits
+		  : floor_log2 (max_len ^ min_len));
+
+  /* Check whether it's profitable to start by storing a fixed BLKSIZE
+     bytes, to lower max_bits.  In the unlikely case of a constant LEN
+     (implied by identical MAX_LEN and MIN_LEN), we want to issue a
+     single store_by_pieces, but otherwise, select the minimum multiple
+     of the ALIGN (in bytes) and of the MCD of the possible LENs, that
+     brings MAX_LEN below TST_BITS, if that's lower than min_len.  */
+  unsigned HOST_WIDE_INT blksize;
+  if (max_len > min_len)
+    {
+      unsigned HOST_WIDE_INT alrng = MAX (HOST_WIDE_INT_1U << ctz_len,
+					  align / BITS_PER_UNIT);
+      blksize = max_len	- (HOST_WIDE_INT_1U << tst_bits) + alrng;
+      blksize &= ~(alrng - 1);
+    }
+  else if (max_len == min_len)
+    blksize = max_len;
+  else
+    gcc_unreachable ();
+  if (min_len >= blksize)
+    {
+      min_len -= blksize;
+      min_bits = floor_log2 (min_len);
+      max_len -= blksize;
+      max_bits = floor_log2 (max_len);
+
+      tst_bits = (max_bits != min_bits ? max_bits
+		 : floor_log2 (max_len ^ min_len));
+    }
+  else
+    blksize = 0;
+
+  /* Check that we can use store by pieces for the maximum store count
+     we may issue (initial fixed-size block, plus conditional
+     power-of-two-sized from max_bits to ctz_len.  */
+  unsigned HOST_WIDE_INT xlenest = blksize;
+  if (max_bits >= 0)
+    xlenest += ((HOST_WIDE_INT_1U << max_bits) * 2
+		- (HOST_WIDE_INT_1U << ctz_len));
+  if (!can_store_by_pieces (xlenest, builtin_memset_read_str,
+			    &valc, align, true))
+    return false;
+
+  rtx (*constfun) (void *, HOST_WIDE_INT, scalar_int_mode);
+  void *constfundata;
+  if (val)
+    {
+      constfun = builtin_memset_gen_str;
+      constfundata = val = force_reg (TYPE_MODE (unsigned_char_type_node),
+				      val);
+    }
+  else
+    {
+      constfun = builtin_memset_read_str;
+      constfundata = &valc;
+    }
+
+  rtx ptr = copy_addr_to_reg (convert_to_mode (ptr_mode, XEXP (to, 0), 0));
+  rtx rem = copy_to_mode_reg (ptr_mode, convert_to_mode (ptr_mode, len, 0));
+  to = replace_equiv_address (to, ptr);
+  set_mem_align (to, align);
+
+  if (blksize)
+    {
+      to = store_by_pieces (to, blksize,
+			    constfun, constfundata,
+			    align, true,
+			    max_len != 0 ? RETURN_END : RETURN_BEGIN);
+      if (max_len == 0)
+	return true;
+
+      /* Adjust PTR, TO and REM.  Since TO's address is likely
+	 PTR+offset, we have to replace it.  */
+      emit_move_insn (ptr, XEXP (to, 0));
+      to = replace_equiv_address (to, ptr);
+      emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize));
+    }
+
+  /* Iterate over power-of-two block sizes from the maximum length to
+     the least significant bit possibly set in the length.  */
+  for (int i = max_bits; i >= sctz_len; i--)
+    {
+      rtx_code_label *label = NULL;
+      blksize = HOST_WIDE_INT_1U << i;
+
+      /* If we're past the bits shared between min_ and max_len, expand
+	 a test on the dynamic length, comparing it with the
+	 BLKSIZE.  */
+      if (i <= tst_bits)
+	{
+	  label = gen_label_rtx ();
+	  emit_cmp_and_jump_insns (rem, GEN_INT (blksize), LT, NULL,
+				   ptr_mode, 1, label,
+				   profile_probability::even ());
+	}
+      /* If we are at a bit that is in the prefix shared by min_ and
+	 max_len, skip this BLKSIZE if the bit is clear.  */
+      else if ((max_len & blksize) == 0)
+	continue;
+
+      /* Issue a store of BLKSIZE bytes.  */
+      to = store_by_pieces (to, blksize,
+			    constfun, constfundata,
+			    align, true,
+			    i != sctz_len ? RETURN_END : RETURN_BEGIN);
+
+      /* Adjust REM and PTR, unless this is the last iteration.  */
+      if (i != sctz_len)
+	{
+	  emit_move_insn (ptr, XEXP (to, 0));
+	  to = replace_equiv_address (to, ptr);
+	  emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize));
+	}
+
+      if (label)
+	{
+	  emit_label (label);
+
+	  /* Given conditional stores, the offset can no longer be
+	     known, so clear it.  */
+	  clear_mem_offset (to);
+	}
+    }
+
+  return true;
+}
+
 /* Helper function to do the actual work for expand_builtin_memset.  The
    arguments to the builtin_memset call DEST, VAL, and LEN are broken out
    so that this can also be called without constructing an actual CALL_EXPR.
@@ -6654,7 +6814,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
   dest_mem = get_memory_rtx (dest, len);
   val_mode = TYPE_MODE (unsigned_char_type_node);
 
-  if (TREE_CODE (val) != INTEGER_CST)
+  if (TREE_CODE (val) != INTEGER_CST
+      || target_char_cast (val, &c))
     {
       rtx val_rtx;
 
@@ -6678,7 +6839,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
       else if (!set_storage_via_setmem (dest_mem, len_rtx, val_rtx,
 					dest_align, expected_align,
 					expected_size, min_size, max_size,
-					probable_max_size))
+					probable_max_size)
+	       && !try_store_by_multiple_pieces (dest_mem, len_rtx,
+						 tree_ctz (len),
+						 min_size, max_size,
+						 val_rtx, 0,
+						 dest_align))
 	goto do_libcall;
 
       dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
@@ -6686,9 +6852,6 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
       return dest_mem;
     }
 
-  if (target_char_cast (val, &c))
-    goto do_libcall;
-
   if (c)
     {
       if (tree_fits_uhwi_p (len)
@@ -6702,7 +6865,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
 					expected_size, min_size, max_size,
-					probable_max_size))
+					probable_max_size)
+	       && !try_store_by_multiple_pieces (dest_mem, len_rtx,
+						 tree_ctz (len),
+						 min_size, max_size,
+						 NULL_RTX, c,
+						 dest_align))
 	goto do_libcall;
 
       dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
@@ -6716,7 +6884,7 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				   ? BLOCK_OP_TAILCALL : BLOCK_OP_NORMAL,
 				   expected_align, expected_size,
 				   min_size, max_size,
-				   probable_max_size);
+				   probable_max_size, tree_ctz (len));
 
   if (dest_addr == 0)
     {
diff --git a/gcc/expr.c b/gcc/expr.c
index 86dc1b6c97349..f92e4136f67e5 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -3056,7 +3056,8 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
 		     unsigned int expected_align, HOST_WIDE_INT expected_size,
 		     unsigned HOST_WIDE_INT min_size,
 		     unsigned HOST_WIDE_INT max_size,
-		     unsigned HOST_WIDE_INT probable_max_size)
+		     unsigned HOST_WIDE_INT probable_max_size,
+		     unsigned ctz_size)
 {
   machine_mode mode = GET_MODE (object);
   unsigned int align;
@@ -3103,6 +3104,10 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
 				   expected_align, expected_size,
 				   min_size, max_size, probable_max_size))
     ;
+  else if (try_store_by_multiple_pieces (object, size, ctz_size,
+					 min_size, max_size,
+					 NULL_RTX, 0, align))
+    ;
   else if (ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (object)))
     return set_storage_via_libcall (object, size, const0_rtx,
 				    method == BLOCK_OP_TAILCALL);
@@ -3120,7 +3125,7 @@ clear_storage (rtx object, rtx size, enum block_op_methods method)
     min = max = UINTVAL (size);
   else
     max = GET_MODE_MASK (GET_MODE (size));
-  return clear_storage_hints (object, size, method, 0, -1, min, max, max);
+  return clear_storage_hints (object, size, method, 0, -1, min, max, max, 0);
 }
 
 
diff --git a/gcc/expr.h b/gcc/expr.h
index 1f0177a4cfa5d..e43f7cf3b9565 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -193,7 +193,8 @@ extern rtx clear_storage_hints (rtx, rtx, enum block_op_methods,
 			        unsigned int, HOST_WIDE_INT,
 				unsigned HOST_WIDE_INT,
 				unsigned HOST_WIDE_INT,
-				unsigned HOST_WIDE_INT);
+				unsigned HOST_WIDE_INT,
+				unsigned);
 /* The same, but always output an library call.  */
 extern rtx set_storage_via_libcall (rtx, rtx, rtx, bool = false);
 
@@ -224,6 +225,16 @@ extern int can_store_by_pieces (unsigned HOST_WIDE_INT,
 extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT, by_pieces_constfn,
 			    void *, unsigned int, bool, memop_ret);
 
+/* If can_store_by_pieces passes for worst-case values near MAX_LEN, call
+   store_by_pieces within conditionals so as to handle variable LEN efficiently,
+   storing VAL, if non-NULL_RTX, or valc instead.  */
+extern bool try_store_by_multiple_pieces (rtx to, rtx len,
+					  unsigned int ctz_len,
+					  unsigned HOST_WIDE_INT min_len,
+					  unsigned HOST_WIDE_INT max_len,
+					  rtx val, char valc,
+					  unsigned int align);
+
 /* Emit insns to set X from Y.  */
 extern rtx_insn *emit_move_insn (rtx, rtx);
 extern rtx_insn *gen_move_insn (rtx, rtx);
diff --git a/gcc/passes.def b/gcc/passes.def
index e9ed3c7bc5773..8568151233d78 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -332,8 +332,9 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_thread_jumps);
       NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
       /* Threading can leave many const/copy propagations in the IL.
-	 Clean them up.  */
-      NEXT_PASS (pass_copy_prop);
+	 Clean them up.  Instead of just copy_prop, we use ccp to
+	 compute alignment and nonzero bits.  */
+      NEXT_PASS (pass_ccp, true /* nonzero_p */);
       NEXT_PASS (pass_warn_restrict);
       NEXT_PASS (pass_dse);
       NEXT_PASS (pass_cd_dce, true /* update_address_taken_p */);


-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution
  2021-02-19  8:08                           ` [PR94092] " Alexandre Oliva
@ 2021-02-22  9:53                             ` Richard Biener
  2021-04-29  4:26                               ` Alexandre Oliva
  0 siblings, 1 reply; 26+ messages in thread
From: Richard Biener @ 2021-02-22  9:53 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Fri, Feb 19, 2021 at 9:08 AM Alexandre Oliva <oliva@adacore.com> wrote:
>
> Here's an improved version of the patch.  Regstrapped on
> x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
> ahead of setmem, and also tested with riscv32-elf.
>
> Is it ok to install?  Or should it wait for stage1?

It generally looks OK but I'd wait for stage1.  I'd also like to see
comments from others.  Note that I think we need to guard
the loop emitting on optimize_*_for_speed/!size.  It's also not entirely
clear to me how the code avoids expanding all > 1 byte block size
memsets to a loop (thus bypassing more optimal libc implementations
for larger sizes).  I also remember that we sometimes do a dynamic
dispatch between inlined and non-inlined code paths, though that might
be target specific handling in the x86 backend.

Thanks,
Richard.

> [PR94092] introduce try store by multiple pieces
>
> From: Alexandre Oliva <oliva@adacore.com>
>
> The ldist pass turns even very short loops into memset calls.  E.g.,
> the TFmode emulation calls end with a loop of up to 3 iterations, to
> zero out trailing words, and the loop distribution pass turns them
> into calls of the memset builtin.
>
> Though short constant-length clearing memsets are usually dealt with
> efficiently, for non-constant-length ones, the options are setmemM, or
> a function calls.
>
> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.
>
> This patch handles variable lengths with multiple conditional
> power-of-2-constant-sized stores-by-pieces, so as to reduce the
> overhead of length compares.
>
> It also changes the last copy-prop pass into ccp, so that pointer
> alignment and length's nonzero bits are detected and made available
> for the expander, even for ldist-introduced SSA_NAMEs.
>
>
> for  gcc/ChangeLog
>
>         PR tree-optimization/94092
>         * builtins.c (try_store_by_multiple_pieces): New.
>         (expand_builtin_memset_args): Use it.  If target_char_cast
>         fails, proceed as for non-constant val.  Pass len's ctz to...
>         * expr.c (clear_storage_hints): ... this.  Try store by
>         multiple pieces after setmem.
>         (clear_storage): Adjust.
>         * expr.h (clear_storage_hints): Likewise.
>         (try_store_by_multiple_pieces): Declare.
>         * passes.def: Replace the last copy_prop to ccp.
> ---
>  gcc/builtins.c |  182 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  gcc/expr.c     |    9 ++-
>  gcc/expr.h     |   13 ++++
>  gcc/passes.def |    5 +-
>  4 files changed, 197 insertions(+), 12 deletions(-)
>
> diff --git a/gcc/builtins.c b/gcc/builtins.c
> index 0aed008687ccc..05b14f2a3f6d3 100644
> --- a/gcc/builtins.c
> +++ b/gcc/builtins.c
> @@ -6600,6 +6600,166 @@ expand_builtin_memset (tree exp, rtx target, machine_mode mode)
>    return expand_builtin_memset_args (dest, val, len, target, mode, exp);
>  }
>
> +/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO.
> +   Return TRUE if successful, FALSE otherwise.  TO is assumed to be
> +   aligned at an ALIGN-bits boundary.  LEN must be a multiple of
> +   1<<CTZ_LEN between MIN_LEN and MAX_LEN.
> +
> +   The strategy is to issue one store_by_pieces for each power of two,
> +   from most to least significant, guarded by a test on whether there
> +   are at least that many bytes left to copy in LEN.
> +
> +   ??? Should we skip some powers of two in favor of loops?  Maybe start
> +   at the max of TO/LEN/word alignment, at least when optimizing for
> +   size, instead of ensuring O(log len) dynamic compares?  */
> +
> +bool
> +try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,
> +                             unsigned HOST_WIDE_INT min_len,
> +                             unsigned HOST_WIDE_INT max_len,
> +                             rtx val, char valc, unsigned int align)
> +{
> +  int max_bits = floor_log2 (max_len);
> +  int min_bits = floor_log2 (min_len);
> +  int sctz_len = ctz_len;
> +
> +  gcc_checking_assert (sctz_len >= 0);
> +
> +  if (val)
> +    valc = 1;
> +
> +  /* Bits more significant than TST_BITS are part of the shared prefix
> +     in the binary representation of both min_len and max_len.  Since
> +     they're identical, we don't need to test them in the loop.  */
> +  int tst_bits = (max_bits != min_bits ? max_bits
> +                 : floor_log2 (max_len ^ min_len));
> +
> +  /* Check whether it's profitable to start by storing a fixed BLKSIZE
> +     bytes, to lower max_bits.  In the unlikely case of a constant LEN
> +     (implied by identical MAX_LEN and MIN_LEN), we want to issue a
> +     single store_by_pieces, but otherwise, select the minimum multiple
> +     of the ALIGN (in bytes) and of the MCD of the possible LENs, that
> +     brings MAX_LEN below TST_BITS, if that's lower than min_len.  */
> +  unsigned HOST_WIDE_INT blksize;
> +  if (max_len > min_len)
> +    {
> +      unsigned HOST_WIDE_INT alrng = MAX (HOST_WIDE_INT_1U << ctz_len,
> +                                         align / BITS_PER_UNIT);
> +      blksize = max_len        - (HOST_WIDE_INT_1U << tst_bits) + alrng;
> +      blksize &= ~(alrng - 1);
> +    }
> +  else if (max_len == min_len)
> +    blksize = max_len;
> +  else
> +    gcc_unreachable ();
> +  if (min_len >= blksize)
> +    {
> +      min_len -= blksize;
> +      min_bits = floor_log2 (min_len);
> +      max_len -= blksize;
> +      max_bits = floor_log2 (max_len);
> +
> +      tst_bits = (max_bits != min_bits ? max_bits
> +                : floor_log2 (max_len ^ min_len));
> +    }
> +  else
> +    blksize = 0;
> +
> +  /* Check that we can use store by pieces for the maximum store count
> +     we may issue (initial fixed-size block, plus conditional
> +     power-of-two-sized from max_bits to ctz_len.  */
> +  unsigned HOST_WIDE_INT xlenest = blksize;
> +  if (max_bits >= 0)
> +    xlenest += ((HOST_WIDE_INT_1U << max_bits) * 2
> +               - (HOST_WIDE_INT_1U << ctz_len));
> +  if (!can_store_by_pieces (xlenest, builtin_memset_read_str,
> +                           &valc, align, true))
> +    return false;
> +
> +  rtx (*constfun) (void *, HOST_WIDE_INT, scalar_int_mode);
> +  void *constfundata;
> +  if (val)
> +    {
> +      constfun = builtin_memset_gen_str;
> +      constfundata = val = force_reg (TYPE_MODE (unsigned_char_type_node),
> +                                     val);
> +    }
> +  else
> +    {
> +      constfun = builtin_memset_read_str;
> +      constfundata = &valc;
> +    }
> +
> +  rtx ptr = copy_addr_to_reg (convert_to_mode (ptr_mode, XEXP (to, 0), 0));
> +  rtx rem = copy_to_mode_reg (ptr_mode, convert_to_mode (ptr_mode, len, 0));
> +  to = replace_equiv_address (to, ptr);
> +  set_mem_align (to, align);
> +
> +  if (blksize)
> +    {
> +      to = store_by_pieces (to, blksize,
> +                           constfun, constfundata,
> +                           align, true,
> +                           max_len != 0 ? RETURN_END : RETURN_BEGIN);
> +      if (max_len == 0)
> +       return true;
> +
> +      /* Adjust PTR, TO and REM.  Since TO's address is likely
> +        PTR+offset, we have to replace it.  */
> +      emit_move_insn (ptr, XEXP (to, 0));
> +      to = replace_equiv_address (to, ptr);
> +      emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize));
> +    }
> +
> +  /* Iterate over power-of-two block sizes from the maximum length to
> +     the least significant bit possibly set in the length.  */
> +  for (int i = max_bits; i >= sctz_len; i--)
> +    {
> +      rtx_code_label *label = NULL;
> +      blksize = HOST_WIDE_INT_1U << i;
> +
> +      /* If we're past the bits shared between min_ and max_len, expand
> +        a test on the dynamic length, comparing it with the
> +        BLKSIZE.  */
> +      if (i <= tst_bits)
> +       {
> +         label = gen_label_rtx ();
> +         emit_cmp_and_jump_insns (rem, GEN_INT (blksize), LT, NULL,
> +                                  ptr_mode, 1, label,
> +                                  profile_probability::even ());
> +       }
> +      /* If we are at a bit that is in the prefix shared by min_ and
> +        max_len, skip this BLKSIZE if the bit is clear.  */
> +      else if ((max_len & blksize) == 0)
> +       continue;
> +
> +      /* Issue a store of BLKSIZE bytes.  */
> +      to = store_by_pieces (to, blksize,
> +                           constfun, constfundata,
> +                           align, true,
> +                           i != sctz_len ? RETURN_END : RETURN_BEGIN);
> +
> +      /* Adjust REM and PTR, unless this is the last iteration.  */
> +      if (i != sctz_len)
> +       {
> +         emit_move_insn (ptr, XEXP (to, 0));
> +         to = replace_equiv_address (to, ptr);
> +         emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize));
> +       }
> +
> +      if (label)
> +       {
> +         emit_label (label);
> +
> +         /* Given conditional stores, the offset can no longer be
> +            known, so clear it.  */
> +         clear_mem_offset (to);
> +       }
> +    }
> +
> +  return true;
> +}
> +
>  /* Helper function to do the actual work for expand_builtin_memset.  The
>     arguments to the builtin_memset call DEST, VAL, and LEN are broken out
>     so that this can also be called without constructing an actual CALL_EXPR.
> @@ -6654,7 +6814,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>    dest_mem = get_memory_rtx (dest, len);
>    val_mode = TYPE_MODE (unsigned_char_type_node);
>
> -  if (TREE_CODE (val) != INTEGER_CST)
> +  if (TREE_CODE (val) != INTEGER_CST
> +      || target_char_cast (val, &c))
>      {
>        rtx val_rtx;
>
> @@ -6678,7 +6839,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>        else if (!set_storage_via_setmem (dest_mem, len_rtx, val_rtx,
>                                         dest_align, expected_align,
>                                         expected_size, min_size, max_size,
> -                                       probable_max_size))
> +                                       probable_max_size)
> +              && !try_store_by_multiple_pieces (dest_mem, len_rtx,
> +                                                tree_ctz (len),
> +                                                min_size, max_size,
> +                                                val_rtx, 0,
> +                                                dest_align))
>         goto do_libcall;
>
>        dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
> @@ -6686,9 +6852,6 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>        return dest_mem;
>      }
>
> -  if (target_char_cast (val, &c))
> -    goto do_libcall;
> -
>    if (c)
>      {
>        if (tree_fits_uhwi_p (len)
> @@ -6702,7 +6865,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>                                         gen_int_mode (c, val_mode),
>                                         dest_align, expected_align,
>                                         expected_size, min_size, max_size,
> -                                       probable_max_size))
> +                                       probable_max_size)
> +              && !try_store_by_multiple_pieces (dest_mem, len_rtx,
> +                                                tree_ctz (len),
> +                                                min_size, max_size,
> +                                                NULL_RTX, c,
> +                                                dest_align))
>         goto do_libcall;
>
>        dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
> @@ -6716,7 +6884,7 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>                                    ? BLOCK_OP_TAILCALL : BLOCK_OP_NORMAL,
>                                    expected_align, expected_size,
>                                    min_size, max_size,
> -                                  probable_max_size);
> +                                  probable_max_size, tree_ctz (len));
>
>    if (dest_addr == 0)
>      {
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 86dc1b6c97349..f92e4136f67e5 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -3056,7 +3056,8 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
>                      unsigned int expected_align, HOST_WIDE_INT expected_size,
>                      unsigned HOST_WIDE_INT min_size,
>                      unsigned HOST_WIDE_INT max_size,
> -                    unsigned HOST_WIDE_INT probable_max_size)
> +                    unsigned HOST_WIDE_INT probable_max_size,
> +                    unsigned ctz_size)
>  {
>    machine_mode mode = GET_MODE (object);
>    unsigned int align;
> @@ -3103,6 +3104,10 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
>                                    expected_align, expected_size,
>                                    min_size, max_size, probable_max_size))
>      ;
> +  else if (try_store_by_multiple_pieces (object, size, ctz_size,
> +                                        min_size, max_size,
> +                                        NULL_RTX, 0, align))
> +    ;
>    else if (ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (object)))
>      return set_storage_via_libcall (object, size, const0_rtx,
>                                     method == BLOCK_OP_TAILCALL);
> @@ -3120,7 +3125,7 @@ clear_storage (rtx object, rtx size, enum block_op_methods method)
>      min = max = UINTVAL (size);
>    else
>      max = GET_MODE_MASK (GET_MODE (size));
> -  return clear_storage_hints (object, size, method, 0, -1, min, max, max);
> +  return clear_storage_hints (object, size, method, 0, -1, min, max, max, 0);
>  }
>
>
> diff --git a/gcc/expr.h b/gcc/expr.h
> index 1f0177a4cfa5d..e43f7cf3b9565 100644
> --- a/gcc/expr.h
> +++ b/gcc/expr.h
> @@ -193,7 +193,8 @@ extern rtx clear_storage_hints (rtx, rtx, enum block_op_methods,
>                                 unsigned int, HOST_WIDE_INT,
>                                 unsigned HOST_WIDE_INT,
>                                 unsigned HOST_WIDE_INT,
> -                               unsigned HOST_WIDE_INT);
> +                               unsigned HOST_WIDE_INT,
> +                               unsigned);
>  /* The same, but always output an library call.  */
>  extern rtx set_storage_via_libcall (rtx, rtx, rtx, bool = false);
>
> @@ -224,6 +225,16 @@ extern int can_store_by_pieces (unsigned HOST_WIDE_INT,
>  extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT, by_pieces_constfn,
>                             void *, unsigned int, bool, memop_ret);
>
> +/* If can_store_by_pieces passes for worst-case values near MAX_LEN, call
> +   store_by_pieces within conditionals so as to handle variable LEN efficiently,
> +   storing VAL, if non-NULL_RTX, or valc instead.  */
> +extern bool try_store_by_multiple_pieces (rtx to, rtx len,
> +                                         unsigned int ctz_len,
> +                                         unsigned HOST_WIDE_INT min_len,
> +                                         unsigned HOST_WIDE_INT max_len,
> +                                         rtx val, char valc,
> +                                         unsigned int align);
> +
>  /* Emit insns to set X from Y.  */
>  extern rtx_insn *emit_move_insn (rtx, rtx);
>  extern rtx_insn *gen_move_insn (rtx, rtx);
> diff --git a/gcc/passes.def b/gcc/passes.def
> index e9ed3c7bc5773..8568151233d78 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -332,8 +332,9 @@ along with GCC; see the file COPYING3.  If not see
>        NEXT_PASS (pass_thread_jumps);
>        NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
>        /* Threading can leave many const/copy propagations in the IL.
> -        Clean them up.  */
> -      NEXT_PASS (pass_copy_prop);
> +        Clean them up.  Instead of just copy_prop, we use ccp to
> +        compute alignment and nonzero bits.  */
> +      NEXT_PASS (pass_ccp, true /* nonzero_p */);
>        NEXT_PASS (pass_warn_restrict);
>        NEXT_PASS (pass_dse);
>        NEXT_PASS (pass_cd_dce, true /* update_address_taken_p */);
>
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution
  2021-02-22  9:53                             ` Richard Biener
@ 2021-04-29  4:26                               ` Alexandre Oliva
  2021-04-30 14:42                                 ` Jeff Law
  0 siblings, 1 reply; 26+ messages in thread
From: Alexandre Oliva @ 2021-04-29  4:26 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Patches, Zdenek Dvorak

On Feb 22, 2021, Richard Biener <richard.guenther@gmail.com> wrote:

> On Fri, Feb 19, 2021 at 9:08 AM Alexandre Oliva <oliva@adacore.com> wrote:
>> 
>> Here's an improved version of the patch.  Regstrapped on
>> x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
>> ahead of setmem, and also tested with riscv32-elf.
>> 
>> Is it ok to install?  Or should it wait for stage1?

> It generally looks OK but I'd wait for stage1.

'k, I've retested the patch now, and it's still working as expected.

> I'd also like to see
> comments from others.  Note that I think we need to guard
> the loop emitting on optimize_*_for_speed/!size.

There's no loop, it's straightline codegen with interspersed
conditionals if we'd output straightline code for a slightly larger
block.

> It's also not entirely
> clear to me how the code avoids expanding all > 1 byte block size
> memsets to a loop (thus bypassing more optimal libc implementations
> for larger sizes).

It doesn't preempt preexisting expansions, that may cover even
variable-length cases (e.g. with an insn expander), but when we get to
it, it's limited by the same tuning defines that limit move by pieces.

> I also remember that we sometimes do a dynamic
> dispatch between inlined and non-inlined code paths, though that might
> be target specific handling in the x86 backend.

Yup, no effect on those.


I've dropped references to PR94092, since it's a different issue.


Re-regstrapped on x86_64-linux-gnu, with and without a patchlet that
exercises it far more thoroughly.  Ok to install?



introduce try store by multiple pieces

The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length clearing memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch handles variable lengths with multiple conditional
power-of-2-constant-sized stores-by-pieces, so as to reduce the
overhead of length compares.

It also changes the last copy-prop pass into ccp, so that pointer
alignment and length's nonzero bits are detected and made available
for the expander, even for ldist-introduced SSA_NAMEs.


for  gcc/ChangeLog

	* builtins.c (try_store_by_multiple_pieces): New.
	(expand_builtin_memset_args): Use it.  If target_char_cast
	fails, proceed as for non-constant val.  Pass len's ctz to...
	* expr.c (clear_storage_hints): ... this.  Try store by
	multiple pieces after setmem.
	(clear_storage): Adjust.
	* expr.h (clear_storage_hints): Likewise.
	(try_store_by_multiple_pieces): Declare.
	* passes.def: Replace the last copy_prop with ccp.
---
 gcc/builtins.c |  182 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 gcc/expr.c     |    9 ++-
 gcc/expr.h     |   13 ++++
 gcc/passes.def |    5 +-
 4 files changed, 197 insertions(+), 12 deletions(-)

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 8c5324bf7de9a..e3ccb3c17dbb3 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -6645,6 +6645,166 @@ expand_builtin_memset (tree exp, rtx target, machine_mode mode)
   return expand_builtin_memset_args (dest, val, len, target, mode, exp);
 }
 
+/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO.
+   Return TRUE if successful, FALSE otherwise.  TO is assumed to be
+   aligned at an ALIGN-bits boundary.  LEN must be a multiple of
+   1<<CTZ_LEN between MIN_LEN and MAX_LEN.
+
+   The strategy is to issue one store_by_pieces for each power of two,
+   from most to least significant, guarded by a test on whether there
+   are at least that many bytes left to copy in LEN.
+
+   ??? Should we skip some powers of two in favor of loops?  Maybe start
+   at the max of TO/LEN/word alignment, at least when optimizing for
+   size, instead of ensuring O(log len) dynamic compares?  */
+
+bool
+try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,
+			      unsigned HOST_WIDE_INT min_len,
+			      unsigned HOST_WIDE_INT max_len,
+			      rtx val, char valc, unsigned int align)
+{
+  int max_bits = floor_log2 (max_len);
+  int min_bits = floor_log2 (min_len);
+  int sctz_len = ctz_len;
+
+  gcc_checking_assert (sctz_len >= 0);
+
+  if (val)
+    valc = 1;
+
+  /* Bits more significant than TST_BITS are part of the shared prefix
+     in the binary representation of both min_len and max_len.  Since
+     they're identical, we don't need to test them in the loop.  */
+  int tst_bits = (max_bits != min_bits ? max_bits
+		  : floor_log2 (max_len ^ min_len));
+
+  /* Check whether it's profitable to start by storing a fixed BLKSIZE
+     bytes, to lower max_bits.  In the unlikely case of a constant LEN
+     (implied by identical MAX_LEN and MIN_LEN), we want to issue a
+     single store_by_pieces, but otherwise, select the minimum multiple
+     of the ALIGN (in bytes) and of the MCD of the possible LENs, that
+     brings MAX_LEN below TST_BITS, if that's lower than min_len.  */
+  unsigned HOST_WIDE_INT blksize;
+  if (max_len > min_len)
+    {
+      unsigned HOST_WIDE_INT alrng = MAX (HOST_WIDE_INT_1U << ctz_len,
+					  align / BITS_PER_UNIT);
+      blksize = max_len - (HOST_WIDE_INT_1U << tst_bits) + alrng;
+      blksize &= ~(alrng - 1);
+    }
+  else if (max_len == min_len)
+    blksize = max_len;
+  else
+    gcc_unreachable ();
+  if (min_len >= blksize)
+    {
+      min_len -= blksize;
+      min_bits = floor_log2 (min_len);
+      max_len -= blksize;
+      max_bits = floor_log2 (max_len);
+
+      tst_bits = (max_bits != min_bits ? max_bits
+		 : floor_log2 (max_len ^ min_len));
+    }
+  else
+    blksize = 0;
+
+  /* Check that we can use store by pieces for the maximum store count
+     we may issue (initial fixed-size block, plus conditional
+     power-of-two-sized from max_bits to ctz_len.  */
+  unsigned HOST_WIDE_INT xlenest = blksize;
+  if (max_bits >= 0)
+    xlenest += ((HOST_WIDE_INT_1U << max_bits) * 2
+		- (HOST_WIDE_INT_1U << ctz_len));
+  if (!can_store_by_pieces (xlenest, builtin_memset_read_str,
+			    &valc, align, true))
+    return false;
+
+  rtx (*constfun) (void *, HOST_WIDE_INT, scalar_int_mode);
+  void *constfundata;
+  if (val)
+    {
+      constfun = builtin_memset_gen_str;
+      constfundata = val = force_reg (TYPE_MODE (unsigned_char_type_node),
+				      val);
+    }
+  else
+    {
+      constfun = builtin_memset_read_str;
+      constfundata = &valc;
+    }
+
+  rtx ptr = copy_addr_to_reg (convert_to_mode (ptr_mode, XEXP (to, 0), 0));
+  rtx rem = copy_to_mode_reg (ptr_mode, convert_to_mode (ptr_mode, len, 0));
+  to = replace_equiv_address (to, ptr);
+  set_mem_align (to, align);
+
+  if (blksize)
+    {
+      to = store_by_pieces (to, blksize,
+			    constfun, constfundata,
+			    align, true,
+			    max_len != 0 ? RETURN_END : RETURN_BEGIN);
+      if (max_len == 0)
+	return true;
+
+      /* Adjust PTR, TO and REM.  Since TO's address is likely
+	 PTR+offset, we have to replace it.  */
+      emit_move_insn (ptr, XEXP (to, 0));
+      to = replace_equiv_address (to, ptr);
+      emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize));
+    }
+
+  /* Iterate over power-of-two block sizes from the maximum length to
+     the least significant bit possibly set in the length.  */
+  for (int i = max_bits; i >= sctz_len; i--)
+    {
+      rtx_code_label *label = NULL;
+      blksize = HOST_WIDE_INT_1U << i;
+
+      /* If we're past the bits shared between min_ and max_len, expand
+	 a test on the dynamic length, comparing it with the
+	 BLKSIZE.  */
+      if (i <= tst_bits)
+	{
+	  label = gen_label_rtx ();
+	  emit_cmp_and_jump_insns (rem, GEN_INT (blksize), LT, NULL,
+				   ptr_mode, 1, label,
+				   profile_probability::even ());
+	}
+      /* If we are at a bit that is in the prefix shared by min_ and
+	 max_len, skip this BLKSIZE if the bit is clear.  */
+      else if ((max_len & blksize) == 0)
+	continue;
+
+      /* Issue a store of BLKSIZE bytes.  */
+      to = store_by_pieces (to, blksize,
+			    constfun, constfundata,
+			    align, true,
+			    i != sctz_len ? RETURN_END : RETURN_BEGIN);
+
+      /* Adjust REM and PTR, unless this is the last iteration.  */
+      if (i != sctz_len)
+	{
+	  emit_move_insn (ptr, XEXP (to, 0));
+	  to = replace_equiv_address (to, ptr);
+	  emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize));
+	}
+
+      if (label)
+	{
+	  emit_label (label);
+
+	  /* Given conditional stores, the offset can no longer be
+	     known, so clear it.  */
+	  clear_mem_offset (to);
+	}
+    }
+
+  return true;
+}
+
 /* Helper function to do the actual work for expand_builtin_memset.  The
    arguments to the builtin_memset call DEST, VAL, and LEN are broken out
    so that this can also be called without constructing an actual CALL_EXPR.
@@ -6699,7 +6859,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
   dest_mem = get_memory_rtx (dest, len);
   val_mode = TYPE_MODE (unsigned_char_type_node);
 
-  if (TREE_CODE (val) != INTEGER_CST)
+  if (TREE_CODE (val) != INTEGER_CST
+      || target_char_cast (val, &c))
     {
       rtx val_rtx;
 
@@ -6723,7 +6884,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
       else if (!set_storage_via_setmem (dest_mem, len_rtx, val_rtx,
 					dest_align, expected_align,
 					expected_size, min_size, max_size,
-					probable_max_size))
+					probable_max_size)
+	       && !try_store_by_multiple_pieces (dest_mem, len_rtx,
+						 tree_ctz (len),
+						 min_size, max_size,
+						 val_rtx, 0,
+						 dest_align))
 	goto do_libcall;
 
       dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
@@ -6731,9 +6897,6 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
       return dest_mem;
     }
 
-  if (target_char_cast (val, &c))
-    goto do_libcall;
-
   if (c)
     {
       if (tree_fits_uhwi_p (len)
@@ -6747,7 +6910,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
 					expected_size, min_size, max_size,
-					probable_max_size))
+					probable_max_size)
+	       && !try_store_by_multiple_pieces (dest_mem, len_rtx,
+						 tree_ctz (len),
+						 min_size, max_size,
+						 NULL_RTX, c,
+						 dest_align))
 	goto do_libcall;
 
       dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
@@ -6761,7 +6929,7 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				   ? BLOCK_OP_TAILCALL : BLOCK_OP_NORMAL,
 				   expected_align, expected_size,
 				   min_size, max_size,
-				   probable_max_size);
+				   probable_max_size, tree_ctz (len));
 
   if (dest_addr == 0)
     {
diff --git a/gcc/expr.c b/gcc/expr.c
index a4a004dd177dd..9f2ec8a09a4ec 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -3086,7 +3086,8 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
 		     unsigned int expected_align, HOST_WIDE_INT expected_size,
 		     unsigned HOST_WIDE_INT min_size,
 		     unsigned HOST_WIDE_INT max_size,
-		     unsigned HOST_WIDE_INT probable_max_size)
+		     unsigned HOST_WIDE_INT probable_max_size,
+		     unsigned ctz_size)
 {
   machine_mode mode = GET_MODE (object);
   unsigned int align;
@@ -3133,6 +3134,10 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
 				   expected_align, expected_size,
 				   min_size, max_size, probable_max_size))
     ;
+  else if (try_store_by_multiple_pieces (object, size, ctz_size,
+					 min_size, max_size,
+					 NULL_RTX, 0, align))
+    ;
   else if (ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (object)))
     return set_storage_via_libcall (object, size, const0_rtx,
 				    method == BLOCK_OP_TAILCALL);
@@ -3150,7 +3155,7 @@ clear_storage (rtx object, rtx size, enum block_op_methods method)
     min = max = UINTVAL (size);
   else
     max = GET_MODE_MASK (GET_MODE (size));
-  return clear_storage_hints (object, size, method, 0, -1, min, max, max);
+  return clear_storage_hints (object, size, method, 0, -1, min, max, max, 0);
 }
 
 
diff --git a/gcc/expr.h b/gcc/expr.h
index 1f0177a4cfa5d..e43f7cf3b9565 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -193,7 +193,8 @@ extern rtx clear_storage_hints (rtx, rtx, enum block_op_methods,
 			        unsigned int, HOST_WIDE_INT,
 				unsigned HOST_WIDE_INT,
 				unsigned HOST_WIDE_INT,
-				unsigned HOST_WIDE_INT);
+				unsigned HOST_WIDE_INT,
+				unsigned);
 /* The same, but always output an library call.  */
 extern rtx set_storage_via_libcall (rtx, rtx, rtx, bool = false);
 
@@ -224,6 +225,16 @@ extern int can_store_by_pieces (unsigned HOST_WIDE_INT,
 extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT, by_pieces_constfn,
 			    void *, unsigned int, bool, memop_ret);
 
+/* If can_store_by_pieces passes for worst-case values near MAX_LEN, call
+   store_by_pieces within conditionals so as to handle variable LEN efficiently,
+   storing VAL, if non-NULL_RTX, or valc instead.  */
+extern bool try_store_by_multiple_pieces (rtx to, rtx len,
+					  unsigned int ctz_len,
+					  unsigned HOST_WIDE_INT min_len,
+					  unsigned HOST_WIDE_INT max_len,
+					  rtx val, char valc,
+					  unsigned int align);
+
 /* Emit insns to set X from Y.  */
 extern rtx_insn *emit_move_insn (rtx, rtx);
 extern rtx_insn *gen_move_insn (rtx, rtx);
diff --git a/gcc/passes.def b/gcc/passes.def
index f7c42897f43a6..55e8164d56b2b 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -336,8 +336,9 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_thread_jumps);
       NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
       /* Threading can leave many const/copy propagations in the IL.
-	 Clean them up.  */
-      NEXT_PASS (pass_copy_prop);
+	 Clean them up.  Instead of just copy_prop, we use ccp to
+	 compute alignment and nonzero bits.  */
+      NEXT_PASS (pass_ccp, true /* nonzero_p */);
       NEXT_PASS (pass_warn_restrict);
       NEXT_PASS (pass_dse);
       NEXT_PASS (pass_cd_dce, true /* update_address_taken_p */);


-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution
  2021-04-29  4:26                               ` Alexandre Oliva
@ 2021-04-30 14:42                                 ` Jeff Law
  2021-05-03  8:55                                   ` Richard Biener
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff Law @ 2021-04-30 14:42 UTC (permalink / raw)
  To: Alexandre Oliva, Richard Biener; +Cc: GCC Patches, Jan Hubicka, Zdenek Dvorak


On 4/28/2021 10:26 PM, Alexandre Oliva wrote:
> On Feb 22, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
>
>> On Fri, Feb 19, 2021 at 9:08 AM Alexandre Oliva <oliva@adacore.com> wrote:
>>> Here's an improved version of the patch.  Regstrapped on
>>> x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
>>> ahead of setmem, and also tested with riscv32-elf.
>>>
>>> Is it ok to install?  Or should it wait for stage1?
>> It generally looks OK but I'd wait for stage1.
> 'k, I've retested the patch now, and it's still working as expected.

Then I'd go forward with it at this point.



>
> I've dropped references to PR94092, since it's a different issue.

ACK.  Note I just put a patch into pr94092 that may be of interest.  In 
simplest terms it tries to use REGNO_POINTER_ALIGN to increase the 
alignment of MEMs.  We'd been using it internally for a bit before we 
went a slightly different direction.  Feel free to use it if it's helpful.


jeff


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution
  2021-04-30 14:42                                 ` Jeff Law
@ 2021-05-03  8:55                                   ` Richard Biener
  2021-05-04  1:59                                     ` Alexandre Oliva
  0 siblings, 1 reply; 26+ messages in thread
From: Richard Biener @ 2021-05-03  8:55 UTC (permalink / raw)
  To: Jeff Law; +Cc: Alexandre Oliva, GCC Patches, Jan Hubicka, Zdenek Dvorak

On Fri, Apr 30, 2021 at 4:42 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
> On 4/28/2021 10:26 PM, Alexandre Oliva wrote:
> > On Feb 22, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> >> On Fri, Feb 19, 2021 at 9:08 AM Alexandre Oliva <oliva@adacore.com> wrote:
> >>> Here's an improved version of the patch.  Regstrapped on
> >>> x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
> >>> ahead of setmem, and also tested with riscv32-elf.
> >>>
> >>> Is it ok to install?  Or should it wait for stage1?
> >> It generally looks OK but I'd wait for stage1.
> > 'k, I've retested the patch now, and it's still working as expected.
>
> Then I'd go forward with it at this point.

Yes, it looks good to me as well.

Thanks,
Richard.

>
>
> >
> > I've dropped references to PR94092, since it's a different issue.
>
> ACK.  Note I just put a patch into pr94092 that may be of interest.  In
> simplest terms it tries to use REGNO_POINTER_ALIGN to increase the
> alignment of MEMs.  We'd been using it internally for a bit before we
> went a slightly different direction.  Feel free to use it if it's helpful.
>
>
> jeff
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution
  2021-05-03  8:55                                   ` Richard Biener
@ 2021-05-04  1:59                                     ` Alexandre Oliva
  2021-05-04  5:49                                       ` Prathamesh Kulkarni
  0 siblings, 1 reply; 26+ messages in thread
From: Alexandre Oliva @ 2021-05-04  1:59 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, GCC Patches, Jan Hubicka, Zdenek Dvorak

On May  3, 2021, Richard Biener <richard.guenther@gmail.com> wrote:

> On Fri, Apr 30, 2021 at 4:42 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>> 
>> 
>> On 4/28/2021 10:26 PM, Alexandre Oliva wrote:
>> > On Feb 22, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
>> >
>> >> On Fri, Feb 19, 2021 at 9:08 AM Alexandre Oliva <oliva@adacore.com> wrote:
>> >>> Here's an improved version of the patch.  Regstrapped on
>> >>> x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
>> >>> ahead of setmem, and also tested with riscv32-elf.
>> >>>
>> >>> Is it ok to install?  Or should it wait for stage1?
>> >> It generally looks OK but I'd wait for stage1.
>> > 'k, I've retested the patch now, and it's still working as expected.
>> 
>> Then I'd go forward with it at this point.

> Yes, it looks good to me as well.

Thanks, I've put it in.

-- 
Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
   Free Software Activist                       GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution
  2021-05-04  1:59                                     ` Alexandre Oliva
@ 2021-05-04  5:49                                       ` Prathamesh Kulkarni
  2021-05-04  6:09                                         ` Alexandre Oliva
  0 siblings, 1 reply; 26+ messages in thread
From: Prathamesh Kulkarni @ 2021-05-04  5:49 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Richard Biener, GCC Patches, Jan Hubicka, Zdenek Dvorak

On Tue, 4 May 2021 at 07:30, Alexandre Oliva <oliva@adacore.com> wrote:
>
> On May  3, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
>
> > On Fri, Apr 30, 2021 at 4:42 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
> >>
> >>
> >> On 4/28/2021 10:26 PM, Alexandre Oliva wrote:
> >> > On Feb 22, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
> >> >
> >> >> On Fri, Feb 19, 2021 at 9:08 AM Alexandre Oliva <oliva@adacore.com> wrote:
> >> >>> Here's an improved version of the patch.  Regstrapped on
> >> >>> x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
> >> >>> ahead of setmem, and also tested with riscv32-elf.
> >> >>>
> >> >>> Is it ok to install?  Or should it wait for stage1?
> >> >> It generally looks OK but I'd wait for stage1.
> >> > 'k, I've retested the patch now, and it's still working as expected.
> >>
> >> Then I'd go forward with it at this point.
>
> > Yes, it looks good to me as well.
>
> Thanks, I've put it in.
Hi,
This caused the following build error:
../../gcc/gcc/builtins.c:6750:18: error: invalid conversion from
‘rtx_def* (*)(void*, void*, long int, scalar_int_mode)’ to
rtx_def* (*)(void*, long int, scalar_int_mode)’ [-fpermissive]
 6750 |       constfun = builtin_memset_gen_str;

It looks like constfun's prototype had a typo with missing 2nd param for void *.
I committed the fix as obvious in the following commit:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=5fbe6a8e73b52c6ebc28b9111456226c1cda6472

Thanks,
Prathamesh
>
> --
> Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
>    Free Software Activist                       GNU Toolchain Engineer
> Disinformation flourishes because many people care deeply about injustice
> but very few check the facts.  Ask me about <https://stallmansupport.org>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution
  2021-05-04  5:49                                       ` Prathamesh Kulkarni
@ 2021-05-04  6:09                                         ` Alexandre Oliva
  0 siblings, 0 replies; 26+ messages in thread
From: Alexandre Oliva @ 2021-05-04  6:09 UTC (permalink / raw)
  To: Prathamesh Kulkarni
  Cc: Richard Biener, GCC Patches, Jan Hubicka, Zdenek Dvorak

On May  4, 2021, Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> wrote:

> It looks like constfun's prototype had a typo with missing 2nd param for void *.

Ugh, the patch for https://gcc.gnu.org/PR90773 added the param the day
after I retested the patch, and I did not give it yet another spin
before checking it in :-(

> I committed the fix as obvious in the following commit:
> https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=5fbe6a8e73b52c6ebc28b9111456226c1cda6472

Thank you!

-- 
Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
   Free Software Activist                       GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-05-04  6:10 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-27 12:40 [RFC] test builtin ratio for loop distribution Alexandre Oliva
2021-01-27 15:12 ` Richard Biener
2021-01-28  5:28   ` Alexandre Oliva
2021-01-28  8:59     ` Richard Biener
2021-02-02 17:13       ` Alexandre Oliva
2021-02-03  8:59         ` Richard Biener
2021-02-03 15:11           ` Alexandre Oliva
2021-02-04  8:37             ` Richard Biener
2021-02-04 22:17               ` Alexandre Oliva
2021-02-05  8:02                 ` Richard Biener
2021-02-11 10:19                 ` Alexandre Oliva
2021-02-11 12:14                   ` Alexandre Oliva
2021-02-12 11:34                   ` Richard Biener
2021-02-16  4:56                     ` Alexandre Oliva
2021-02-16 10:47                       ` Alexandre Oliva
2021-02-16 12:11                         ` Richard Biener
2021-02-19  8:08                           ` [PR94092] " Alexandre Oliva
2021-02-22  9:53                             ` Richard Biener
2021-04-29  4:26                               ` Alexandre Oliva
2021-04-30 14:42                                 ` Jeff Law
2021-05-03  8:55                                   ` Richard Biener
2021-05-04  1:59                                     ` Alexandre Oliva
2021-05-04  5:49                                       ` Prathamesh Kulkarni
2021-05-04  6:09                                         ` Alexandre Oliva
2021-02-05  0:13 ` Jim Wilson
2021-02-11 10:11   ` Alexandre Oliva

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).