From: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [RFC] ldist: Recognize rawmemchr loop patterns
Date: Thu, 25 Feb 2021 18:01:33 +0100 [thread overview]
Message-ID: <YDfX7V4iLtHjs1um@localhost.localdomain> (raw)
In-Reply-To: <YCj7GniMRQ5DkS37@localhost.localdomain>
Ping
On Sun, Feb 14, 2021 at 11:27:40AM +0100, Stefan Schulze Frielinghaus wrote:
> On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote:
> > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
> > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > This patch adds support for recognizing loops which mimic the behaviour
> > > of function rawmemchr, and replaces those with an internal function call
> > > in case a target provides them. In contrast to the original rawmemchr
> > > function, this patch also supports different instances where the memory
> > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> > > respectively.
> > >
> > > This patch is not final and I'm looking for some feedback:
> > >
> > > Previously, only loops which mimic the behaviours of functions memset,
> > > memcpy, and memmove have been detected and replaced by corresponding
> > > function calls. One characteristic of those loops/partitions is that
> > > they don't have a reduction. In contrast, loops which mimic the
> > > behaviour of rawmemchr compute a result and therefore have a reduction.
> > > My current attempt is to ensure that the reduction statement is not used
> > > in any other partition and only in that case ignore the reduction and
> > > replace the loop by a function call. We then only need to replace the
> > > reduction variable of the loop which contained the loop result by the
> > > variable of the lhs of the internal function call. This should ensure
> > > that the transformation is correct independently of how partitions are
> > > fused/distributed in the end. Any thoughts about this?
> >
> > Currently we're forcing reduction partitions last (and force to have a single
> > one by fusing all partitions containing a reduction) because code-generation
> > does not properly update SSA form for the reduction results. ISTR that
> > might be just because we do not copy the LC PHI nodes or do not adjust
> > them when copying. That might not be an issue in case you replace the
> > partition with a call. I guess you can try to have a testcase with
> > two rawmemchr patterns and a regular loop part that has to be scheduled
> > inbetween both for correctness.
>
> Ah ok, in that case I updated my patch by removing the constraint that
> the reduction statement must be in precisely one partition. Please find
> attached the testcases I came up so far. Since transforming a loop into
> a rawmemchr function call is backend dependend, I planned to include
> those only in my backend patch. I wasn't able to come up with any
> testcase where a loop is distributed into multiple partitions and where
> one is classified as a rawmemchr builtin. The latter boils down to a
> for loop with an empty body only in which case I suspect that loop
> distribution shouldn't be done anyway.
>
> > > Furthermore, I simply added two new members (pattern, fn) to structure
> > > builtin_info which I consider rather hacky. For the long run I thought
> > > about to split up structure builtin_info into a union where each member
> > > is a structure for a particular builtin of a partition, i.e., something
> > > like this:
> > >
> > > union builtin_info
> > > {
> > > struct binfo_memset *memset;
> > > struct binfo_memcpymove *memcpymove;
> > > struct binfo_rawmemchr *rawmemchr;
> > > };
> > >
> > > Such that a structure for one builtin does not get "polluted" by a
> > > different one. Any thoughts about this?
> >
> > Probably makes sense if the list of recognized patterns grow further.
> >
> > I see you use internal functions rather than builtin functions. I guess
> > that's OK. But you use new target hooks for expansion where I think
> > new optab entries similar to cmpmem would be more appropriate
> > where the distinction between 8, 16 or 32 bits can be encoded in
> > the modes.
>
> The optab implementation is really nice which allows me to use iterators
> in the backend which in the end saves me some boiler plate code compared
> to the previous implementation :)
>
> While using optabs now, I only require one additional member (pattern)
> in the builtin_info struct. Thus I didn't want to overcomplicate things
> and kept the single struct approach as is.
>
> For the long run, should I resubmit this patch once stage 1 opens or how
> would you propose to proceed?
>
> Thanks for your review so far!
>
> Cheers,
> Stefan
>
> >
> > Richard.
> >
> > > Cheers,
> > > Stefan
> > > ---
> > > gcc/internal-fn.c | 42 ++++++
> > > gcc/internal-fn.def | 3 +
> > > gcc/target-insns.def | 3 +
> > > gcc/tree-loop-distribution.c | 257 ++++++++++++++++++++++++++++++-----
> > > 4 files changed, 272 insertions(+), 33 deletions(-)
> > >
> > > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> > > index dd7173126fb..9cd62544a1a 100644
> > > --- a/gcc/internal-fn.c
> > > +++ b/gcc/internal-fn.c
> > > @@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *)
> > > gcc_unreachable ();
> > > }
> > >
> > > +static void
> > > +expand_RAWMEMCHR8 (internal_fn, gcall *stmt)
> > > +{
> > > + if (targetm.have_rawmemchr8 ())
> > > + {
> > > + rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > > + rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > > + rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > > + emit_insn (targetm.gen_rawmemchr8 (result, start, pattern));
> > > + }
> > > + else
> > > + gcc_unreachable();
> > > +}
> > > +
> > > +static void
> > > +expand_RAWMEMCHR16 (internal_fn, gcall *stmt)
> > > +{
> > > + if (targetm.have_rawmemchr16 ())
> > > + {
> > > + rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > > + rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > > + rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > > + emit_insn (targetm.gen_rawmemchr16 (result, start, pattern));
> > > + }
> > > + else
> > > + gcc_unreachable();
> > > +}
> > > +
> > > +static void
> > > +expand_RAWMEMCHR32 (internal_fn, gcall *stmt)
> > > +{
> > > + if (targetm.have_rawmemchr32 ())
> > > + {
> > > + rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > > + rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > > + rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > > + emit_insn (targetm.gen_rawmemchr32 (result, start, pattern));
> > > + }
> > > + else
> > > + gcc_unreachable();
> > > +}
> > > +
> > > /* Expand the IFN_UNIQUE function according to its first argument. */
> > >
> > > static void
> > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > index daeace7a34e..34247859704 100644
> > > --- a/gcc/internal-fn.def
> > > +++ b/gcc/internal-fn.def
> > > @@ -348,6 +348,9 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> > > DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
> > > DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
> > > DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> > > +DEF_INTERNAL_FN (RAWMEMCHR8, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > > +DEF_INTERNAL_FN (RAWMEMCHR16, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > > +DEF_INTERNAL_FN (RAWMEMCHR32, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > >
> > > /* An unduplicable, uncombinable function. Generally used to preserve
> > > a CFG property in the face of jump threading, tail merging or
> > > diff --git a/gcc/target-insns.def b/gcc/target-insns.def
> > > index 672c35698d7..9248554cbf3 100644
> > > --- a/gcc/target-insns.def
> > > +++ b/gcc/target-insns.def
> > > @@ -106,3 +106,6 @@ DEF_TARGET_INSN (trap, (void))
> > > DEF_TARGET_INSN (unique, (void))
> > > DEF_TARGET_INSN (untyped_call, (rtx x0, rtx x1, rtx x2))
> > > DEF_TARGET_INSN (untyped_return, (rtx x0, rtx x1))
> > > +DEF_TARGET_INSN (rawmemchr8, (rtx x0, rtx x1, rtx x2))
> > > +DEF_TARGET_INSN (rawmemchr16, (rtx x0, rtx x1, rtx x2))
> > > +DEF_TARGET_INSN (rawmemchr32, (rtx x0, rtx x1, rtx x2))
> > > diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> > > index 7ee19fc8677..f5b24bf53bc 100644
> > > --- a/gcc/tree-loop-distribution.c
> > > +++ b/gcc/tree-loop-distribution.c
> > > @@ -218,7 +218,7 @@ enum partition_kind {
> > > be unnecessary and removed once distributed memset can be understood
> > > and analyzed in data reference analysis. See PR82604 for more. */
> > > PKIND_PARTIAL_MEMSET,
> > > - PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
> > > + PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE, PKIND_RAWMEMCHR
> > > };
> > >
> > > /* Type of distributed loop. */
> > > @@ -244,6 +244,8 @@ struct builtin_info
> > > is only used in memset builtin distribution for now. */
> > > tree dst_base_base;
> > > unsigned HOST_WIDE_INT dst_base_offset;
> > > + tree pattern;
> > > + internal_fn fn;
> > > };
> > >
> > > /* Partition for loop distribution. */
> > > @@ -588,7 +590,8 @@ class loop_distribution
> > > bool
> > > classify_partition (loop_p loop,
> > > struct graph *rdg, partition *partition,
> > > - bitmap stmt_in_all_partitions);
> > > + bitmap stmt_in_all_partitions,
> > > + vec<struct partition *> *partitions);
> > >
> > >
> > > /* Returns true when PARTITION1 and PARTITION2 access the same memory
> > > @@ -1232,6 +1235,67 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
> > > }
> > > }
> > >
> > > +/* Generate a call to rawmemchr{8,16,32} for PARTITION in LOOP. */
> > > +
> > > +static void
> > > +generate_rawmemchr_builtin (class loop *loop, partition *partition)
> > > +{
> > > + gimple_stmt_iterator gsi;
> > > + tree mem, pattern;
> > > + struct builtin_info *builtin = partition->builtin;
> > > + gimple *fn_call;
> > > +
> > > + data_reference_p dr = builtin->src_dr;
> > > + tree base = builtin->src_base;
> > > +
> > > + tree result_old = TREE_OPERAND (DR_REF (dr), 0);
> > > + tree result_new = copy_ssa_name (result_old);
> > > +
> > > + /* The new statements will be placed before LOOP. */
> > > + gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> > > +
> > > + mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false, GSI_CONTINUE_LINKING);
> > > + pattern = builtin->pattern;
> > > + if (TREE_CODE (pattern) == INTEGER_CST)
> > > + pattern = fold_convert (integer_type_node, pattern);
> > > + fn_call = gimple_build_call_internal (builtin->fn, 2, mem, pattern);
> > > + gimple_call_set_lhs (fn_call, result_new);
> > > + gimple_set_location (fn_call, partition->loc);
> > > + gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> > > +
> > > + imm_use_iterator iter;
> > > + gimple *stmt;
> > > + use_operand_p use_p;
> > > + FOR_EACH_IMM_USE_STMT (stmt, iter, result_old)
> > > + {
> > > + FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> > > + SET_USE (use_p, result_new);
> > > +
> > > + update_stmt (stmt);
> > > + }
> > > +
> > > + fold_stmt (&gsi);
> > > +
> > > + if (dump_file && (dump_flags & TDF_DETAILS))
> > > + switch (builtin->fn)
> > > + {
> > > + case IFN_RAWMEMCHR8:
> > > + fprintf (dump_file, "generated rawmemchr8\n");
> > > + break;
> > > +
> > > + case IFN_RAWMEMCHR16:
> > > + fprintf (dump_file, "generated rawmemchr16\n");
> > > + break;
> > > +
> > > + case IFN_RAWMEMCHR32:
> > > + fprintf (dump_file, "generated rawmemchr32\n");
> > > + break;
> > > +
> > > + default:
> > > + gcc_unreachable ();
> > > + }
> > > +}
> > > +
> > > /* Remove and destroy the loop LOOP. */
> > >
> > > static void
> > > @@ -1334,6 +1398,10 @@ generate_code_for_partition (class loop *loop,
> > > generate_memcpy_builtin (loop, partition);
> > > break;
> > >
> > > + case PKIND_RAWMEMCHR:
> > > + generate_rawmemchr_builtin (loop, partition);
> > > + break;
> > > +
> > > default:
> > > gcc_unreachable ();
> > > }
> > > @@ -1525,44 +1593,53 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
> > > }
> > > }
> > >
> > > - if (!single_st)
> > > + if (!single_ld && !single_st)
> > > return false;
> > >
> > > - /* Bail out if this is a bitfield memory reference. */
> > > - if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> > > - && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> > > - return false;
> > > -
> > > - /* Data reference must be executed exactly once per iteration of each
> > > - loop in the loop nest. We only need to check dominance information
> > > - against the outermost one in a perfect loop nest because a bb can't
> > > - dominate outermost loop's latch without dominating inner loop's. */
> > > - basic_block bb_st = gimple_bb (DR_STMT (single_st));
> > > - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> > > - return false;
> > > + basic_block bb_ld = NULL;
> > > + basic_block bb_st = NULL;
> > >
> > > if (single_ld)
> > > {
> > > - gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> > > - /* Direct aggregate copy or via an SSA name temporary. */
> > > - if (load != store
> > > - && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> > > - return false;
> > > -
> > > /* Bail out if this is a bitfield memory reference. */
> > > if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
> > > && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
> > > return false;
> > >
> > > - /* Load and store must be in the same loop nest. */
> > > - basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
> > > - if (bb_st->loop_father != bb_ld->loop_father)
> > > + /* Data reference must be executed exactly once per iteration of each
> > > + loop in the loop nest. We only need to check dominance information
> > > + against the outermost one in a perfect loop nest because a bb can't
> > > + dominate outermost loop's latch without dominating inner loop's. */
> > > + bb_ld = gimple_bb (DR_STMT (single_ld));
> > > + if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> > > + return false;
> > > + }
> > > +
> > > + if (single_st)
> > > + {
> > > + /* Bail out if this is a bitfield memory reference. */
> > > + if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> > > + && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> > > return false;
> > >
> > > /* Data reference must be executed exactly once per iteration.
> > > - Same as single_st, we only need to check against the outermost
> > > + Same as single_ld, we only need to check against the outermost
> > > loop. */
> > > - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> > > + bb_st = gimple_bb (DR_STMT (single_st));
> > > + if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> > > + return false;
> > > + }
> > > +
> > > + if (single_ld && single_st)
> > > + {
> > > + gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> > > + /* Direct aggregate copy or via an SSA name temporary. */
> > > + if (load != store
> > > + && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> > > + return false;
> > > +
> > > + /* Load and store must be in the same loop nest. */
> > > + if (bb_st->loop_father != bb_ld->loop_father)
> > > return false;
> > >
> > > edge e = single_exit (bb_st->loop_father);
> > > @@ -1681,6 +1758,84 @@ alloc_builtin (data_reference_p dst_dr, data_reference_p src_dr,
> > > return builtin;
> > > }
> > >
> > > +/* Given data reference DR in loop nest LOOP, classify if it forms builtin
> > > + rawmemchr{8,16,32} call. */
> > > +
> > > +static bool
> > > +classify_builtin_rawmemchr (loop_p loop, partition *partition, data_reference_p dr, tree loop_result)
> > > +{
> > > + tree dr_ref = DR_REF (dr);
> > > + tree dr_access_base = build_fold_addr_expr (dr_ref);
> > > + tree dr_access_size = TYPE_SIZE_UNIT (TREE_TYPE (dr_ref));
> > > + gimple *dr_stmt = DR_STMT (dr);
> > > + tree rhs1 = gimple_assign_rhs1 (dr_stmt);
> > > + affine_iv iv;
> > > + tree pattern;
> > > +
> > > + if (TREE_OPERAND (rhs1, 0) != loop_result)
> > > + return false;
> > > +
> > > + /* A limitation of the current implementation is that we only support
> > > + constant patterns. */
> > > + gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header));
> > > + pattern = gimple_cond_rhs (cond_stmt);
> > > + if (gimple_cond_code (cond_stmt) != NE_EXPR
> > > + || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (dr_stmt)
> > > + || TREE_CODE (pattern) != INTEGER_CST)
> > > + return false;
> > > +
> > > + /* Bail out if no affine induction variable with constant step can be
> > > + determined. */
> > > + if (!simple_iv (loop, loop, dr_access_base, &iv, false))
> > > + return false;
> > > +
> > > + /* Bail out if memory accesses are not consecutive. */
> > > + if (!operand_equal_p (iv.step, dr_access_size, 0))
> > > + return false;
> > > +
> > > + /* Bail out if direction of memory accesses is not growing. */
> > > + if (get_range_pos_neg (iv.step) != 1)
> > > + return false;
> > > +
> > > + internal_fn fn;
> > > + switch (TREE_INT_CST_LOW (iv.step))
> > > + {
> > > + case 1:
> > > + if (!targetm.have_rawmemchr8 ())
> > > + return false;
> > > + fn = IFN_RAWMEMCHR8;
> > > + break;
> > > +
> > > + case 2:
> > > + if (!targetm.have_rawmemchr16 ())
> > > + return false;
> > > + fn = IFN_RAWMEMCHR16;
> > > + break;
> > > +
> > > + case 4:
> > > + if (!targetm.have_rawmemchr32 ())
> > > + return false;
> > > + fn = IFN_RAWMEMCHR32;
> > > + break;
> > > +
> > > + default:
> > > + return false;
> > > + }
> > > +
> > > + struct builtin_info *builtin;
> > > + builtin = alloc_builtin (NULL, NULL, NULL_TREE, NULL_TREE, NULL_TREE);
> > > + builtin->src_dr = dr;
> > > + builtin->src_base = iv.base;
> > > + builtin->pattern = pattern;
> > > + builtin->fn = fn;
> > > +
> > > + partition->loc = gimple_location (dr_stmt);
> > > + partition->builtin = builtin;
> > > + partition->kind = PKIND_RAWMEMCHR;
> > > +
> > > + return true;
> > > +}
> > > +
> > > /* Given data reference DR in loop nest LOOP, classify if it forms builtin
> > > memset call. */
> > >
> > > @@ -1792,12 +1947,16 @@ loop_distribution::classify_builtin_ldst (loop_p loop, struct graph *rdg,
> > > bool
> > > loop_distribution::classify_partition (loop_p loop,
> > > struct graph *rdg, partition *partition,
> > > - bitmap stmt_in_all_partitions)
> > > + bitmap stmt_in_all_partitions,
> > > + vec<struct partition *> *partitions)
> > > {
> > > bitmap_iterator bi;
> > > unsigned i;
> > > data_reference_p single_ld = NULL, single_st = NULL;
> > > bool volatiles_p = false, has_reduction = false;
> > > + unsigned nreductions = 0;
> > > + gimple *reduction_stmt = NULL;
> > > + bool has_interpar_reduction = false;
> > >
> > > EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
> > > {
> > > @@ -1821,6 +1980,19 @@ loop_distribution::classify_partition (loop_p loop,
> > > partition->reduction_p = true;
> > > else
> > > has_reduction = true;
> > > +
> > > + /* Determine whether the reduction statement occurs in other
> > > + partitions than the current one. */
> > > + struct partition *piter;
> > > + for (unsigned j = 0; partitions->iterate (j, &piter); ++j)
> > > + {
> > > + if (piter == partition)
> > > + continue;
> > > + if (bitmap_bit_p (piter->stmts, i))
> > > + has_interpar_reduction = true;
> > > + }
> > > + reduction_stmt = stmt;
> > > + ++nreductions;
> > > }
> > > }
> > >
> > > @@ -1840,6 +2012,30 @@ loop_distribution::classify_partition (loop_p loop,
> > > if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
> > > return has_reduction;
> > >
> > > + /* If we determined a single load and a single reduction statement which does
> > > + not occur in any other partition, then try to classify this partition as a
> > > + rawmemchr builtin. */
> > > + if (single_ld != NULL
> > > + && single_st == NULL
> > > + && nreductions == 1
> > > + && !has_interpar_reduction
> > > + && is_gimple_assign (reduction_stmt))
> > > + {
> > > + /* If we classified the partition as a builtin, then ignoring the single
> > > + reduction is safe, since the reduction variable is not used in other
> > > + partitions. */
> > > + tree reduction_var = gimple_assign_lhs (reduction_stmt);
> > > + return !classify_builtin_rawmemchr (loop, partition, single_ld, reduction_var);
> > > + }
> > > +
> > > + if (single_st == NULL)
> > > + return has_reduction;
> > > +
> > > + /* Don't distribute loop if niters is unknown. */
> > > + tree niters = number_of_latch_executions (loop);
> > > + if (niters == NULL_TREE || niters == chrec_dont_know)
> > > + return has_reduction;
> > > +
> > > partition->loc = gimple_location (DR_STMT (single_st));
> > >
> > > /* Classify the builtin kind. */
> > > @@ -2979,7 +3175,7 @@ loop_distribution::distribute_loop (class loop *loop, vec<gimple *> stmts,
> > > FOR_EACH_VEC_ELT (partitions, i, partition)
> > > {
> > > reduction_in_all
> > > - |= classify_partition (loop, rdg, partition, stmt_in_all_partitions);
> > > + |= classify_partition (loop, rdg, partition, stmt_in_all_partitions, &partitions);
> > > any_builtin |= partition_builtin_p (partition);
> > > }
> > >
> > > @@ -3290,11 +3486,6 @@ loop_distribution::execute (function *fun)
> > > && !optimize_loop_for_speed_p (loop)))
> > > continue;
> > >
> > > - /* Don't distribute loop if niters is unknown. */
> > > - tree niters = number_of_latch_executions (loop);
> > > - if (niters == NULL_TREE || niters == chrec_dont_know)
> > > - continue;
> > > -
> > > /* Get the perfect loop nest for distribution. */
> > > loop = prepare_perfect_loop_nest (loop);
> > > for (; loop; loop = loop->inner)
> > > --
> > > 2.23.0
> > >
> commit bf792239150
> Author: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
> Date: Mon Feb 8 10:35:30 2021 +0100
>
> ldist: Recognize rawmemchr loop patterns
>
> This patch adds support for recognizing loops which mimic the behaviour
> of function rawmemchr, and replaces those with an internal function call
> in case a target provides them. In contrast to the original rawmemchr
> function, this patch also supports different instances where the memory
> pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> respectively.
>
> diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> index dd7173126fb..18e12b863c6 100644
> --- a/gcc/internal-fn.c
> +++ b/gcc/internal-fn.c
> @@ -2917,6 +2917,31 @@ expand_VEC_CONVERT (internal_fn, gcall *)
> gcc_unreachable ();
> }
>
> +void
> +expand_RAWMEMCHR (internal_fn, gcall *stmt)
> +{
> + expand_operand ops[3];
> +
> + tree lhs = gimple_call_lhs (stmt);
> + tree lhs_type = TREE_TYPE (lhs);
> + rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> + create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
> +
> + for (unsigned int i = 0; i < 2; ++i)
> + {
> + tree rhs = gimple_call_arg (stmt, i);
> + tree rhs_type = TREE_TYPE (rhs);
> + rtx rhs_rtx = expand_normal (rhs);
> + create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type));
> + }
> +
> + insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode);
> +
> + expand_insn (icode, 3, ops);
> + if (!rtx_equal_p (lhs_rtx, ops[0].value))
> + emit_move_insn (lhs_rtx, ops[0].value);
> +}
> +
> /* Expand the IFN_UNIQUE function according to its first argument. */
>
> static void
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index daeace7a34e..95c76795648 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -348,6 +348,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
> DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
> DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> +DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
>
> /* An unduplicable, uncombinable function. Generally used to preserve
> a CFG property in the face of jump threading, tail merging or
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index b192a9d070b..f7c69f914ce 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a")
> OPTAB_D (movmem_optab, "movmem$a")
> OPTAB_D (setmem_optab, "setmem$a")
> OPTAB_D (strlen_optab, "strlen$a")
> +OPTAB_D (rawmemchr_optab, "rawmemchr$I$a")
>
> OPTAB_DC(fma_optab, "fma$a4", FMA)
> OPTAB_D (fms_optab, "fms$a4")
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index 7ee19fc8677..09f200da61f 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -115,6 +115,10 @@ along with GCC; see the file COPYING3. If not see
> #include "tree-vectorizer.h"
> #include "tree-eh.h"
> #include "gimple-fold.h"
> +#include "rtl.h"
> +#include "memmodel.h"
> +#include "insn-codes.h"
> +#include "optabs.h"
>
>
> #define MAX_DATAREFS_NUM \
> @@ -218,7 +222,7 @@ enum partition_kind {
> be unnecessary and removed once distributed memset can be understood
> and analyzed in data reference analysis. See PR82604 for more. */
> PKIND_PARTIAL_MEMSET,
> - PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
> + PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE, PKIND_RAWMEMCHR
> };
>
> /* Type of distributed loop. */
> @@ -244,6 +248,8 @@ struct builtin_info
> is only used in memset builtin distribution for now. */
> tree dst_base_base;
> unsigned HOST_WIDE_INT dst_base_offset;
> + /* Pattern is used only in rawmemchr builtin distribution for now. */
> + tree pattern;
> };
>
> /* Partition for loop distribution. */
> @@ -1232,6 +1238,66 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
> }
> }
>
> +/* Generate a call to rawmemchr for PARTITION in LOOP. */
> +
> +static void
> +generate_rawmemchr_builtin (class loop *loop, partition *partition)
> +{
> + gimple_stmt_iterator gsi;
> + tree mem, pattern;
> + struct builtin_info *builtin = partition->builtin;
> + gimple *fn_call;
> +
> + data_reference_p dr = builtin->src_dr;
> + tree base = builtin->src_base;
> +
> + tree result_old = build_fold_addr_expr (DR_REF (dr));
> + tree result_new = copy_ssa_name (result_old);
> +
> + /* The new statements will be placed before LOOP. */
> + gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> +
> + mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
> + GSI_CONTINUE_LINKING);
> + pattern = builtin->pattern;
> + fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern);
> + gimple_call_set_lhs (fn_call, result_new);
> + gimple_set_location (fn_call, partition->loc);
> + gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> +
> + imm_use_iterator iter;
> + gimple *stmt;
> + use_operand_p use_p;
> + FOR_EACH_IMM_USE_STMT (stmt, iter, result_old)
> + {
> + FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> + SET_USE (use_p, result_new);
> +
> + update_stmt (stmt);
> + }
> +
> + fold_stmt (&gsi);
> +
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + switch (TYPE_MODE (TREE_TYPE (pattern)))
> + {
> + case QImode:
> + fprintf (dump_file, "generated rawmemchrqi\n");
> + break;
> +
> + case HImode:
> + fprintf (dump_file, "generated rawmemchrhi\n");
> + break;
> +
> + case SImode:
> + fprintf (dump_file, "generated rawmemchrsi\n");
> + break;
> +
> + default:
> + gcc_unreachable ();
> + }
> +}
> +
> /* Remove and destroy the loop LOOP. */
>
> static void
> @@ -1334,6 +1400,10 @@ generate_code_for_partition (class loop *loop,
> generate_memcpy_builtin (loop, partition);
> break;
>
> + case PKIND_RAWMEMCHR:
> + generate_rawmemchr_builtin (loop, partition);
> + break;
> +
> default:
> gcc_unreachable ();
> }
> @@ -1525,44 +1595,53 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
> }
> }
>
> - if (!single_st)
> + if (!single_ld && !single_st)
> return false;
>
> - /* Bail out if this is a bitfield memory reference. */
> - if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> - && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> - return false;
> -
> - /* Data reference must be executed exactly once per iteration of each
> - loop in the loop nest. We only need to check dominance information
> - against the outermost one in a perfect loop nest because a bb can't
> - dominate outermost loop's latch without dominating inner loop's. */
> - basic_block bb_st = gimple_bb (DR_STMT (single_st));
> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> - return false;
> + basic_block bb_ld = NULL;
> + basic_block bb_st = NULL;
>
> if (single_ld)
> {
> - gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> - /* Direct aggregate copy or via an SSA name temporary. */
> - if (load != store
> - && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> - return false;
> -
> /* Bail out if this is a bitfield memory reference. */
> if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
> && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
> return false;
>
> - /* Load and store must be in the same loop nest. */
> - basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
> - if (bb_st->loop_father != bb_ld->loop_father)
> + /* Data reference must be executed exactly once per iteration of each
> + loop in the loop nest. We only need to check dominance information
> + against the outermost one in a perfect loop nest because a bb can't
> + dominate outermost loop's latch without dominating inner loop's. */
> + bb_ld = gimple_bb (DR_STMT (single_ld));
> + if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> + return false;
> + }
> +
> + if (single_st)
> + {
> + /* Bail out if this is a bitfield memory reference. */
> + if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> + && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> return false;
>
> /* Data reference must be executed exactly once per iteration.
> - Same as single_st, we only need to check against the outermost
> + Same as single_ld, we only need to check against the outermost
> loop. */
> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> + bb_st = gimple_bb (DR_STMT (single_st));
> + if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> + return false;
> + }
> +
> + if (single_ld && single_st)
> + {
> + gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> + /* Direct aggregate copy or via an SSA name temporary. */
> + if (load != store
> + && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> + return false;
> +
> + /* Load and store must be in the same loop nest. */
> + if (bb_st->loop_father != bb_ld->loop_father)
> return false;
>
> edge e = single_exit (bb_st->loop_father);
> @@ -1681,6 +1760,68 @@ alloc_builtin (data_reference_p dst_dr, data_reference_p src_dr,
> return builtin;
> }
>
> +/* Given data reference DR in loop nest LOOP, classify if it forms builtin
> + rawmemchr call. */
> +
> +static bool
> +classify_builtin_rawmemchr (loop_p loop, partition *partition,
> + data_reference_p dr, tree loop_result)
> +{
> + tree dr_ref = DR_REF (dr);
> + tree dr_access_base = build_fold_addr_expr (dr_ref);
> + tree dr_access_size = TYPE_SIZE_UNIT (TREE_TYPE (dr_ref));
> + gimple *dr_stmt = DR_STMT (dr);
> + affine_iv iv;
> + tree pattern;
> +
> + if (dr_access_base != loop_result)
> + return false;
> +
> + /* A limitation of the current implementation is that we only support
> + constant patterns. */
> + gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header));
> + pattern = gimple_cond_rhs (cond_stmt);
> + if (gimple_cond_code (cond_stmt) != NE_EXPR
> + || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (dr_stmt)
> + || TREE_CODE (pattern) != INTEGER_CST)
> + return false;
> +
> + /* Bail out if no affine induction variable with constant step can be
> + determined. */
> + if (!simple_iv (loop, loop, dr_access_base, &iv, false))
> + return false;
> +
> + /* Bail out if memory accesses are not consecutive. */
> + if (!operand_equal_p (iv.step, dr_access_size, 0))
> + return false;
> +
> + /* Bail out if direction of memory accesses is not growing. */
> + if (get_range_pos_neg (iv.step) != 1)
> + return false;
> +
> + /* Bail out if target does not provide rawmemchr for a certain mode. */
> + machine_mode mode;
> + switch (TREE_INT_CST_LOW (iv.step))
> + {
> + case 1: mode = QImode; break;
> + case 2: mode = HImode; break;
> + case 4: mode = SImode; break;
> + default: return false;
> + }
> + if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing)
> + return false;
> +
> + struct builtin_info *builtin;
> + builtin = alloc_builtin (NULL, dr, NULL_TREE, iv.base, NULL_TREE);
> + builtin->pattern = pattern;
> +
> + partition->loc = gimple_location (dr_stmt);
> + partition->builtin = builtin;
> + partition->kind = PKIND_RAWMEMCHR;
> +
> + return true;
> +}
> +
> /* Given data reference DR in loop nest LOOP, classify if it forms builtin
> memset call. */
>
> @@ -1798,6 +1939,8 @@ loop_distribution::classify_partition (loop_p loop,
> unsigned i;
> data_reference_p single_ld = NULL, single_st = NULL;
> bool volatiles_p = false, has_reduction = false;
> + unsigned nreductions = 0;
> + gimple *reduction_stmt = NULL;
>
> EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
> {
> @@ -1821,6 +1964,10 @@ loop_distribution::classify_partition (loop_p loop,
> partition->reduction_p = true;
> else
> has_reduction = true;
> +
> + /* Determine whether STMT is the only reduction statement or not. */
> + reduction_stmt = stmt;
> + ++nreductions;
> }
> }
>
> @@ -1840,6 +1987,27 @@ loop_distribution::classify_partition (loop_p loop,
> if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
> return has_reduction;
>
> + /* If we determined a single load and a single reduction statement, then try
> + to classify this partition as a rawmemchr builtin. */
> + if (single_ld != NULL
> + && single_st == NULL
> + && nreductions == 1
> + && is_gimple_assign (reduction_stmt))
> + {
> + /* If we classified the partition as a builtin, then ignoring the single
> + reduction is safe, since the whole partition is replaced by a call. */
> + tree reduction_var = gimple_assign_lhs (reduction_stmt);
> + return !classify_builtin_rawmemchr (loop, partition, single_ld, reduction_var);
> + }
> +
> + if (single_st == NULL)
> + return has_reduction;
> +
> + /* Don't distribute loop if niters is unknown. */
> + tree niters = number_of_latch_executions (loop);
> + if (niters == NULL_TREE || niters == chrec_dont_know)
> + return has_reduction;
> +
> partition->loc = gimple_location (DR_STMT (single_st));
>
> /* Classify the builtin kind. */
> @@ -3290,11 +3458,6 @@ loop_distribution::execute (function *fun)
> && !optimize_loop_for_speed_p (loop)))
> continue;
>
> - /* Don't distribute loop if niters is unknown. */
> - tree niters = number_of_latch_executions (loop);
> - if (niters == NULL_TREE || niters == chrec_dont_know)
> - continue;
> -
> /* Get the perfect loop nest for distribution. */
> loop = prepare_perfect_loop_nest (loop);
> for (; loop; loop = loop->inner)
> /* { dg-do compile } */
> /* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
>
> #include <cstdint>
>
> template<typename T, T pattern>
> T *rawmemchr(T *s)
> {
> while (*s != pattern)
> ++s;
> return s;
> }
>
> template uint8_t *rawmemchr<uint8_t, 0xab>(uint8_t *);
> template uint16_t *rawmemchr<uint16_t, 0xabcd>(uint16_t *);
> template uint32_t *rawmemchr<uint32_t, 0xabcdef15>(uint32_t *);
>
> template int8_t *rawmemchr<int8_t, (int8_t)0xab>(int8_t *);
> template int16_t *rawmemchr<int16_t, (int16_t)0xabcd>(int16_t *);
> template int32_t *rawmemchr<int32_t, (int32_t)0xabcdef15>(int32_t *);
>
> /* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "ldist" } } */
> /* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "ldist" } } */
> /* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "ldist" } } */
> /* { dg-do run } */
> /* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details -mzarch -march=z13" } */
>
> #include <cstring>
> #include <cassert>
> #include <cstdint>
> #include <iostream>
>
> template <typename T, T pattern>
> __attribute__((noinline,noclone))
> T* rawmemchr (T *s)
> {
> while (*s != pattern)
> ++s;
> return s;
> }
>
> template <typename T, T pattern>
> void doit()
> {
> T *buf = new T[4096 * 2];
> assert (buf != NULL);
> memset (buf, 0xa, 4096 * 2);
> // ensure q is 4096-byte aligned
> T *q = buf + (4096 - ((uintptr_t)buf & 4095));
> T *p;
>
> // unaligned + block boundary + 1st load
> p = (T *) ((uintptr_t)q - 8);
> p[2] = pattern;
> assert ((rawmemchr<T, pattern> (&p[0]) == &p[2]));
> p[2] = (T) 0xaaaaaaaa;
>
> // unaligned + block boundary + 2nd load
> p = (T *) ((uintptr_t)q - 8);
> p[6] = pattern;
> assert ((rawmemchr<T, pattern> (&p[0]) == &p[6]));
> p[6] = (T) 0xaaaaaaaa;
>
>
>
> // unaligned + 1st load
> q[5] = pattern;
> assert ((rawmemchr<T, pattern>(&q[2]) == &q[5]));
> q[5] = (T) 0xaaaaaaaa;
>
> // unaligned + 2nd load
> q[14] = pattern;
> assert ((rawmemchr<T, pattern>(&q[2]) == &q[14]));
> q[14] = (T) 0xaaaaaaaa;
>
> // unaligned + 3rd load
> q[19] = pattern;
> assert ((rawmemchr<T, pattern>(&q[2]) == &q[19]));
> q[19] = (T) 0xaaaaaaaa;
>
> // unaligned + 4th load
> q[25] = pattern;
> assert ((rawmemchr<T, pattern>(&q[2]) == &q[25]));
> q[25] = (T) 0xaaaaaaaa;
>
>
>
> // aligned + 1st load
> q[5] = pattern;
> assert ((rawmemchr<T, pattern>(&q[0]) == &q[5]));
> q[5] = (T) 0xaaaaaaaa;
>
> // aligned + 2nd load
> q[14] = pattern;
> assert ((rawmemchr<T, pattern>(&q[0]) == &q[14]));
> q[14] = (T) 0xaaaaaaaa;
>
> // aligned + 3rd load
> q[19] = pattern;
> assert ((rawmemchr<T, pattern>(&q[0]) == &q[19]));
> q[19] = (T) 0xaaaaaaaa;
>
> // aligned + 4th load
> q[25] = pattern;
> assert ((rawmemchr<T, pattern>(&q[0]) == &q[25]));
> q[25] = (T) 0xaaaaaaaa;
>
> delete buf;
> }
>
> int main(void)
> {
> doit<uint8_t, 0xde> ();
> doit<int8_t, (int8_t)0xde> ();
> doit<uint16_t, 0xdead> ();
> doit<int16_t, (int16_t)0xdead> ();
> doit<uint32_t, 0xdeadbeef> ();
> doit<int32_t, (int32_t)0xdeadbeef> ();
> return 0;
> }
>
> /* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "ldist" } } */
> /* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "ldist" } } */
> /* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "ldist" } } */
next prev parent reply other threads:[~2021-02-25 17:01 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-08 12:47 Stefan Schulze Frielinghaus
2021-02-09 8:57 ` Richard Biener
2021-02-14 10:27 ` Stefan Schulze Frielinghaus
2021-02-25 17:01 ` Stefan Schulze Frielinghaus [this message]
2021-02-25 23:49 ` Jeff Law
2021-03-02 12:29 ` Richard Biener
2021-03-03 17:17 ` Stefan Schulze Frielinghaus
2021-03-16 17:13 ` Stefan Schulze Frielinghaus
2021-04-08 8:23 ` Stefan Schulze Frielinghaus
2021-05-04 17:25 ` Stefan Schulze Frielinghaus
2021-05-05 9:36 ` Richard Biener
2021-05-05 10:03 ` Richard Biener
2021-05-07 12:32 ` Stefan Schulze Frielinghaus
2021-05-20 9:24 ` Richard Biener
2021-05-20 18:37 ` Stefan Schulze Frielinghaus
2021-06-14 17:26 ` Stefan Schulze Frielinghaus
2021-06-16 14:22 ` Richard Biener
2021-06-25 10:23 ` Stefan Schulze Frielinghaus
2021-08-06 14:02 ` Stefan Schulze Frielinghaus
2021-08-20 10:35 ` Richard Biener
2021-09-03 8:00 ` Stefan Schulze Frielinghaus
2021-09-06 9:56 ` Richard Biener
2021-09-13 14:53 ` Stefan Schulze Frielinghaus
2021-09-17 8:08 ` Richard Biener
2021-10-11 16:02 ` Stefan Schulze Frielinghaus
2022-01-31 13:16 ` Tom de Vries
2022-01-31 15:00 ` Richard Biener
2022-01-31 16:26 ` [PATCH][ldist] Don't add lib calls with -fno-tree-loop-distribute-patterns Tom de Vries
2022-02-01 7:04 ` Richard Biener
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YDfX7V4iLtHjs1um@localhost.localdomain \
--to=stefansf@linux.ibm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).