From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id 864A139A1834 for ; Fri, 6 Aug 2021 14:02:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 864A139A1834 Received: from pps.filterd (m0098416.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 176DjrMZ135222 for ; Fri, 6 Aug 2021 10:02:18 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com with ESMTP id 3a859e659r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Fri, 06 Aug 2021 10:02:16 -0400 Received: from m0098416.ppops.net (m0098416.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 176Dl31U140396 for ; Fri, 6 Aug 2021 10:02:16 -0400 Received: from ppma04ams.nl.ibm.com (63.31.33a9.ip4.static.sl-reverse.com [169.51.49.99]) by mx0b-001b2d01.pphosted.com with ESMTP id 3a859e657f-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 06 Aug 2021 10:02:15 -0400 Received: from pps.filterd (ppma04ams.nl.ibm.com [127.0.0.1]) by ppma04ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 176Dvkmt000762; Fri, 6 Aug 2021 14:02:13 GMT Received: from b06cxnps4076.portsmouth.uk.ibm.com (d06relay13.portsmouth.uk.ibm.com [9.149.109.198]) by ppma04ams.nl.ibm.com with ESMTP id 3a4x5959da-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 06 Aug 2021 14:02:13 +0000 Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com [9.149.105.58]) by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 176E2ARA54460686 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 6 Aug 2021 14:02:10 GMT Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 942B64C05E; Fri, 6 Aug 2021 14:02:10 +0000 (GMT) Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9D5954C05C; Fri, 6 Aug 2021 14:02:09 +0000 (GMT) Received: from localhost.localdomain (unknown [9.145.75.161]) by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTPS; Fri, 6 Aug 2021 14:02:09 +0000 (GMT) Date: Fri, 6 Aug 2021 16:02:09 +0200 From: Stefan Schulze Frielinghaus To: Richard Biener Cc: GCC Patches Subject: Re: [RFC] ldist: Recognize rawmemchr loop patterns Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-TM-AS-GCONF: 00 X-Proofpoint-GUID: BL9v289NIuEEmq30ibGG6wBcvAkZP8D1 X-Proofpoint-ORIG-GUID: f-kNoMIv4DV3QJHj1lzhimGqcU-g0ih0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-08-06_04:2021-08-05, 2021-08-06 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 mlxlogscore=999 clxscore=1015 suspectscore=0 malwarescore=0 bulkscore=0 impostorscore=0 mlxscore=0 adultscore=0 priorityscore=1501 lowpriorityscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2107140000 definitions=main-2108060094 X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 06 Aug 2021 14:02:23 -0000 ping On Fri, Jun 25, 2021 at 12:23:32PM +0200, Stefan Schulze Frielinghaus wrote: > On Wed, Jun 16, 2021 at 04:22:35PM +0200, Richard Biener wrote: > > On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus > > wrote: > > > > > > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote: > > > [...] > > > > > but we won't ever arrive here because of the niters condition. But > > > > > yes, doing the pattern matching in the innermost loop processing code > > > > > looks good to me - for the specific case it would be > > > > > > > > > > /* Don't distribute loop if niters is unknown. */ > > > > > tree niters = number_of_latch_executions (loop); > > > > > if (niters == NULL_TREE || niters == chrec_dont_know) > > > > > ---> here? > > > > > continue; > > > > > > > > Right, please find attached a new version of the patch where everything > > > > is included in the loop distribution pass. I will do a bootstrap and > > > > regtest on IBM Z over night. If you give me green light I will also do > > > > the same on x86_64. > > > > > > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at > > > least the ldist-strlen testcase). If you are Ok with the patch, then I > > > would rebase and run the testsuites again and post a patch series > > > including the rawmemchr implementation for IBM Z. > > > > @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop > > *loop, vec *work_list) > > return work_list->length () > 0; > > } > > > > +static void > > +generate_rawmemchr_builtin (loop_p loop, tree reduction_var, > > + data_reference_p store_dr, tree base, tree pattern, > > + location_t loc) > > +{ > > > > this new function needs a comment. Applies to all of the new ones, btw. > > Done. > > > + gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)) > > + && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE (pattern)); > > > > this looks fragile and is probably unnecessary as well. > > > > + gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base)); > > > > in general you want types_compatible_p () checks which for pointers means > > all pointers are compatible ... > > True, I removed both asserts. > > > (skipping stuff) > > > > @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun) > > && !optimize_loop_for_speed_p (loop))) > > continue; > > > > - /* Don't distribute loop if niters is unknown. */ > > + /* If niters is unknown don't distribute loop but rather try to transform > > + it to a call to a builtin. */ > > tree niters = number_of_latch_executions (loop); > > if (niters == NULL_TREE || niters == chrec_dont_know) > > - continue; > > + { > > + if (transform_reduction_loop (loop)) > > + { > > + changed = true; > > + loops_to_be_destroyed.safe_push (loop); > > + if (dump_file) > > + fprintf (dump_file, "Loop %d transformed into a > > builtin.\n", loop->num); > > + } > > + continue; > > + } > > > > please look at > > > > if (nb_generated_loops + nb_generated_calls > 0) > > { > > changed = true; > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, > > loc, "Loop%s %d distributed: split to > > %d loops " > > "and %d library calls.\n", str, loop->num, > > nb_generated_loops, nb_generated_calls); > > > > and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the > > transforms are reported with -fopt-info-loop > > Done. > > > + > > + return transform_reduction_loop_1 (loop, load_dr, store_dr, reduction_var); > > +} > > > > what's the point in tail-calling here and visually splitting the > > function in half? > > In the first place I thought that this is more pleasant since in > transform_reduction_loop_1 it is settled that we have a single load, > store, and reduction variable. After refactoring this isn't true > anymore and I inlined the function and made this clear via a comment. > > > > > (sorry for picking random pieces now ;)) > > > > + for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi); > > + gsi_next (&bsi), ++ninsns) > > + { > > > > this counts debug insns, I guess you want gsi_next_nondebug at least. > > not sure why you are counting PHIs at all btw - for the loops you match > > you are expecting at most two, one IV and eventually one for the virtual > > operand of the store? > > Yes, I removed the counting for the phi loop and changed to > gsi_next_nondebug for both loops. > > > > > + if (gimple_has_volatile_ops (phi)) > > + return false; > > > > PHIs never have volatile ops. > > > > + if (gimple_clobber_p (phi)) > > + continue; > > > > or are clobbers. > > Removed both. > > > Btw, can you factor out a helper from find_single_drs working on a > > stmt to reduce code duplication? > > Ahh sorry for that. I've already done this in one of my first patches > but didn't copy that over. Although my changes do not require a RDG the > whole pass is based upon this data structure. Therefore, in order to > share more code I decided to temporarily build the RDG so that I can > call into find_single_drs. Since the graph is rather small I guess the > overhead is acceptable w.r.t. code sharing. > > struct graph *rdg = build_rdg (loop, NULL); > if (rdg == NULL) > { > if (dump_file && (dump_flags & TDF_DETAILS)) > fprintf (dump_file, > "Loop %d not transformed: failed to build the RDG.\n", > loop->num); > > return false; > } > auto_bitmap partition_stmts; > bitmap_set_range (partition_stmts, 0, rdg->n_vertices); > find_single_drs (loop, rdg, partition_stmts, &store_dr, &load_dr); > free_rdg (rdg); > > As a side-effect of this, now, I also have to (de)allocate the class > member datarefs_vec prior/after calling into transform_reduction_loop: > > /* If niters is unknown don't distribute loop but rather try to transform > it to a call to a builtin. */ > tree niters = number_of_latch_executions (loop); > if (niters == NULL_TREE || niters == chrec_dont_know) > { > datarefs_vec.create (20); > if (transform_reduction_loop (loop)) > { > changed = true; > loops_to_be_destroyed.safe_push (loop); > if (dump_enabled_p ()) > { > dump_user_location_t loc = find_loop_location (loop); > dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, > loc, "Loop %d transformed into a builtin.\n", > loop->num); > } > } > free_data_refs (datarefs_vec); > continue; > } > > > > > + tree reduction_var; > > + switch (gimple_code (reduction_stmt)) > > + { > > + case GIMPLE_PHI: > > + reduction_var = gimple_phi_result (reduction_stmt); > > + break; > > + case GIMPLE_ASSIGN: > > + reduction_var = gimple_assign_lhs (reduction_stmt); > > + break; > > + default: > > + /* Bail out e.g. for GIMPLE_CALL. */ > > + return false; > > > > gimple_get_lhs (reduction_stmt); would work for both PHIs > > and assigns. > > Done. > > > > > + if (reduction_var == NULL) > > + return false; > > > > it can never be NULL here. > > True, otherwise the reduction statement wouldn't have a dependence outside > the loop. => Removed. > > > > > + /* Bail out if this is a bitfield memory reference. */ > > + if (TREE_CODE (DR_REF (load_dr)) == COMPONENT_REF > > + && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (load_dr), 1))) > > + return false; > > ... > > > > I see this is again quite some code copied from find_single_drs, please > > see how to avoid this much duplication by splitting out helpers. > > Sorry again. Hope the solution above is more appropriate. > > > > > +static bool > > +transform_reduction_loop_1 (loop_p loop, > > + data_reference_p load_dr, > > + data_reference_p store_dr, > > + tree reduction_var) > > +{ > > + tree load_ref = DR_REF (load_dr); > > + tree load_type = TREE_TYPE (load_ref); > > + tree load_access_base = build_fold_addr_expr (load_ref); > > + tree load_access_size = TYPE_SIZE_UNIT (load_type); > > + affine_iv load_iv, reduction_iv; > > + tree pattern; > > + > > + /* A limitation of the current implementation is that we only support > > + constant patterns. */ > > + edge e = single_exit (loop); > > + gcond *cond_stmt = safe_dyn_cast (last_stmt (e->src)); > > + if (!cond_stmt) > > + return false; > > > > that looks like checks to be done at the start of > > transform_reduction_loop, not this late. > > Pulled this to the very beginning of transform_reduction_loop. > > > > > + if (gimple_cond_code (cond_stmt) != NE_EXPR > > + || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (DR_STMT (load_dr)) > > + || TREE_CODE (pattern) != INTEGER_CST) > > + return false; > > > > half of this as well. Btw, there's no canonicalization for > > the tests so you have to verify the false edge actually exits > > the loop and allow for EQ_EXPR in case the false edge does. > > Uh good point. I added checks for that and pulled most of it to the > beginning of transform_reduction_loop. > > > > > + /* Handle strlen like loops. */ > > + if (store_dr == NULL > > + && integer_zerop (pattern) > > + && TREE_CODE (reduction_iv.base) == INTEGER_CST > > + && TREE_CODE (reduction_iv.step) == INTEGER_CST > > + && integer_onep (reduction_iv.step) > > + && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node) > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))) > > + { > > > > I wonder what goes wrong with a larger or smaller wrapping IV type? > > The iteration > > only stops when you load a NUL and the increments just wrap along (you're > > using the pointer IVs to compute the strlen result). Can't you simply truncate? > > I think truncation is enough as long as no overflow occurs in strlen or > strlen_using_rawmemchr. > > > For larger than size_type_node (actually larger than ptr_type_node would matter > > I guess), the argument is that since pointer wrapping would be undefined anyway > > the IV cannot wrap either. Now, the correct check here would IMHO be > > > > TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION > > (ptr_type_node) > > || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var)) > > > > ? > > Regarding the implementation which makes use of rawmemchr: > > We can count at most PTRDIFF_MAX many bytes without an overflow. Thus, > the maximal length we can determine of a string where each character has > size S is PTRDIFF_MAX / S without an overflow. Since an overflow for > ptrdiff type is undefined we have to make sure that if an overflow > occurs, then an overflow occurs for reduction variable, too, and that > this is undefined, too. However, I'm not sure anymore whether we want > to respect overflows in all cases. If TYPE_PRECISION (ptr_type_node) > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then > this would mean that a single string consumes more than half of the > virtual addressable memory. At least for architectures where > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable > to neglect the case where computing pointer difference may overflow. > Otherwise we are talking about strings with lenghts of multiple > pebibytes. For other architectures we might have to be more precise > and make sure that reduction variable overflows first and that this is > undefined. > > Thus a conservative condition would be (I assumed that the size of any > integral type is a power of two which I'm not sure if this really holds; > IIRC the C standard requires only that the alignment is a power of two > but not necessarily the size so I might need to change this): > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type)) > or in other words return true if reduction variable overflows first > and false otherwise. */ > > static bool > reduction_var_overflows_first (tree reduction_var, tree load_type) > { > unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node); > unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var)); > unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type))); > return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent); > } > > TYPE_PRECISION (ptrdiff_type_node) == 64 > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > && reduction_var_overflows_first (reduction_var, load_type) > > Regarding the implementation which makes use of strlen: > > I'm not sure what it means if strlen is called for a string with a > length greater than SIZE_MAX. Therefore, similar to the implementation > using rawmemchr where we neglect the case of an overflow for 64bit > architectures, a conservative condition would be: > > TYPE_PRECISION (size_type_node) == 64 > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node)) > > I still included the overflow undefined check for reduction variable in > order to rule out situations where the reduction variable is unsigned > and overflows as many times until strlen(,_using_rawmemchr) overflows, > too. Maybe this is all theoretical nonsense but I'm afraid of uncommon > architectures. Anyhow, while writing this down it becomes clear that > this deserves a comment which I will add once it becomes clear which way > to go. > > > > > + if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))) > > + { > > + const char *msg = G_("assuming signed overflow does not occur " > > + "when optimizing strlen like loop"); > > + fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC); > > + } > > > > no, please don't add any new strict-overflow warnings ;) > > I just stumbled over code which produces such a warning and thought this > is a hard requirement :D The new patch doesn't contain it anymore. > > > > > The generate_*_builtin routines need some factoring - if you code-generate > > into a gimple_seq you could use gimple_build () which would do the fold_stmt > > (not sure why you do that - you should see to fold the call, not necessarily > > the rest). The replacement of reduction_var and the dumping could be shared. > > There's also GET_MODE_NAME for the printing. > > I wasn't really sure which way to go. Use a gsi, as it is done by > existing generate_* functions, or make use of gimple_seq. Since the > latter uses internally also gsi I thought it is better to stick to gsi > in the first place. Now, after changing to gimple_seq I see the beauty > of it :) > > I created two helper functions generate_strlen_builtin_1 and > generate_reduction_builtin_1 in order to reduce code duplication. > > In function generate_strlen_builtin I changed from using > builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit > (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not > sure whether my intuition about the difference between implicit and > explicit builtins is correct. In builtins.def there is a small example > given which I would paraphrase as "use builtin_decl_explicit if the > semantics of the builtin is defined by the C standard; otherwise use > builtin_decl_implicit" but probably my intuition is wrong? > > Beside that I'm not sure whether I really have to call > build_fold_addr_expr which looks superfluous to me since > gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL: > > tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN)); > gimple *fn_call = gimple_build_call (fn, 1, mem); > > However, since it is also used that way in the context of > generate_memset_builtin I didn't remove it so far. > > > I think overall the approach is sound now but the details still need work. > > Once again thank you very much for your review. Really appreciated! > > Cheers, > Stefan > commit d24639c895ce3c0f9539570bab7b6510e98b1ffa > Author: Stefan Schulze Frielinghaus > Date: Wed Mar 17 09:00:06 2021 +0100 > > foo > > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c > index d209a52f823..633f41d9e6d 100644 > --- a/gcc/internal-fn.c > +++ b/gcc/internal-fn.c > @@ -2929,6 +2929,35 @@ expand_VEC_CONVERT (internal_fn, gcall *) > gcc_unreachable (); > } > > +/* Expand IFN_RAWMEMCHAR internal function. */ > + > +void > +expand_RAWMEMCHR (internal_fn, gcall *stmt) > +{ > + expand_operand ops[3]; > + > + tree lhs = gimple_call_lhs (stmt); > + if (!lhs) > + return; > + tree lhs_type = TREE_TYPE (lhs); > + rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE); > + create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type)); > + > + for (unsigned int i = 0; i < 2; ++i) > + { > + tree rhs = gimple_call_arg (stmt, i); > + tree rhs_type = TREE_TYPE (rhs); > + rtx rhs_rtx = expand_normal (rhs); > + create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type)); > + } > + > + insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode); > + > + expand_insn (icode, 3, ops); > + if (!rtx_equal_p (lhs_rtx, ops[0].value)) > + emit_move_insn (lhs_rtx, ops[0].value); > +} > + > /* Expand the IFN_UNIQUE function according to its first argument. */ > > static void > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def > index daeace7a34e..95c76795648 100644 > --- a/gcc/internal-fn.def > +++ b/gcc/internal-fn.def > @@ -348,6 +348,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL) > DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL) > DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL) > DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL) > +DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL) > > /* An unduplicable, uncombinable function. Generally used to preserve > a CFG property in the face of jump threading, tail merging or > diff --git a/gcc/optabs.def b/gcc/optabs.def > index b192a9d070b..f7c69f914ce 100644 > --- a/gcc/optabs.def > +++ b/gcc/optabs.def > @@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a") > OPTAB_D (movmem_optab, "movmem$a") > OPTAB_D (setmem_optab, "setmem$a") > OPTAB_D (strlen_optab, "strlen$a") > +OPTAB_D (rawmemchr_optab, "rawmemchr$I$a") > > OPTAB_DC(fma_optab, "fma$a4", FMA) > OPTAB_D (fms_optab, "fms$a4") > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c > new file mode 100644 > index 00000000000..6db62d7644d > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c > @@ -0,0 +1,72 @@ > +/* { dg-do run { target s390x-*-* } } */ > +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */ > +/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */ > +/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */ > +/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */ > + > +/* Rawmemchr pattern: reduction stmt but no store */ > + > +#include > +#include > + > +typedef __SIZE_TYPE__ size_t; > +extern void* malloc (size_t); > +extern void* memset (void*, int, size_t); > + > +#define test(T, pattern) \ > +__attribute__((noinline)) \ > +T *test_##T (T *p) \ > +{ \ > + while (*p != (T)pattern) \ > + ++p; \ > + return p; \ > +} > + > +test (uint8_t, 0xab) > +test (uint16_t, 0xabcd) > +test (uint32_t, 0xabcdef15) > + > +test (int8_t, 0xab) > +test (int16_t, 0xabcd) > +test (int32_t, 0xabcdef15) > + > +#define run(T, pattern, i) \ > +{ \ > +T *q = p; \ > +q[i] = (T)pattern; \ > +assert (test_##T (p) == &q[i]); \ > +q[i] = 0; \ > +} > + > +int main(void) > +{ > + void *p = malloc (1024); > + assert (p); > + memset (p, 0, 1024); > + > + run (uint8_t, 0xab, 0); > + run (uint8_t, 0xab, 1); > + run (uint8_t, 0xab, 13); > + > + run (uint16_t, 0xabcd, 0); > + run (uint16_t, 0xabcd, 1); > + run (uint16_t, 0xabcd, 13); > + > + run (uint32_t, 0xabcdef15, 0); > + run (uint32_t, 0xabcdef15, 1); > + run (uint32_t, 0xabcdef15, 13); > + > + run (int8_t, 0xab, 0); > + run (int8_t, 0xab, 1); > + run (int8_t, 0xab, 13); > + > + run (int16_t, 0xabcd, 0); > + run (int16_t, 0xabcd, 1); > + run (int16_t, 0xabcd, 13); > + > + run (int32_t, 0xabcdef15, 0); > + run (int32_t, 0xabcdef15, 1); > + run (int32_t, 0xabcdef15, 13); > + > + return 0; > +} > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c > new file mode 100644 > index 00000000000..00d6ea0f8e9 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c > @@ -0,0 +1,83 @@ > +/* { dg-do run { target s390x-*-* } } */ > +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */ > +/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */ > +/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */ > +/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */ > + > +/* Rawmemchr pattern: reduction stmt and store */ > + > +#include > +#include > + > +typedef __SIZE_TYPE__ size_t; > +extern void* malloc (size_t); > +extern void* memset (void*, int, size_t); > + > +uint8_t *p_uint8_t; > +uint16_t *p_uint16_t; > +uint32_t *p_uint32_t; > + > +int8_t *p_int8_t; > +int16_t *p_int16_t; > +int32_t *p_int32_t; > + > +#define test(T, pattern) \ > +__attribute__((noinline)) \ > +T *test_##T (void) \ > +{ \ > + while (*p_##T != pattern) \ > + ++p_##T; \ > + return p_##T; \ > +} > + > +test (uint8_t, 0xab) > +test (uint16_t, 0xabcd) > +test (uint32_t, 0xabcdef15) > + > +test (int8_t, (int8_t)0xab) > +test (int16_t, (int16_t)0xabcd) > +test (int32_t, (int32_t)0xabcdef15) > + > +#define run(T, pattern, i) \ > +{ \ > +T *q = p; \ > +q[i] = pattern; \ > +p_##T = p; \ > +T *r = test_##T (); \ > +assert (r == p_##T); \ > +assert (r == &q[i]); \ > +q[i] = 0; \ > +} > + > +int main(void) > +{ > + void *p = malloc (1024); > + assert (p); > + memset (p, '\0', 1024); > + > + run (uint8_t, 0xab, 0); > + run (uint8_t, 0xab, 1); > + run (uint8_t, 0xab, 13); > + > + run (uint16_t, 0xabcd, 0); > + run (uint16_t, 0xabcd, 1); > + run (uint16_t, 0xabcd, 13); > + > + run (uint32_t, 0xabcdef15, 0); > + run (uint32_t, 0xabcdef15, 1); > + run (uint32_t, 0xabcdef15, 13); > + > + run (int8_t, (int8_t)0xab, 0); > + run (int8_t, (int8_t)0xab, 1); > + run (int8_t, (int8_t)0xab, 13); > + > + run (int16_t, (int16_t)0xabcd, 0); > + run (int16_t, (int16_t)0xabcd, 1); > + run (int16_t, (int16_t)0xabcd, 13); > + > + run (int32_t, (int32_t)0xabcdef15, 0); > + run (int32_t, (int32_t)0xabcdef15, 1); > + run (int32_t, (int32_t)0xabcdef15, 13); > + > + return 0; > +} > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c > new file mode 100644 > index 00000000000..918b60099e4 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c > @@ -0,0 +1,100 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */ > +/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 4 "ldist" } } */ > +/* { dg-final { scan-tree-dump-times "generated strlenHI\n" 4 "ldist" { target s390x-*-* } } } */ > +/* { dg-final { scan-tree-dump-times "generated strlenSI\n" 4 "ldist" { target s390x-*-* } } } */ > + > +#include > +#include > + > +typedef __SIZE_TYPE__ size_t; > +extern void* malloc (size_t); > +extern void* memset (void*, int, size_t); > + > +#define test(T, U) \ > +__attribute__((noinline)) \ > +U test_##T##U (T *s) \ > +{ \ > + U i; \ > + for (i=0; s[i]; ++i); \ > + return i; \ > +} > + > +test (uint8_t, size_t) > +test (uint16_t, size_t) > +test (uint32_t, size_t) > +test (uint8_t, int) > +test (uint16_t, int) > +test (uint32_t, int) > + > +test (int8_t, size_t) > +test (int16_t, size_t) > +test (int32_t, size_t) > +test (int8_t, int) > +test (int16_t, int) > +test (int32_t, int) > + > +#define run(T, U, i) \ > +{ \ > +T *q = p; \ > +q[i] = 0; \ > +assert (test_##T##U (p) == i); \ > +memset (&q[i], 0xf, sizeof (T)); \ > +} > + > +int main(void) > +{ > + void *p = malloc (1024); > + assert (p); > + memset (p, 0xf, 1024); > + > + run (uint8_t, size_t, 0); > + run (uint8_t, size_t, 1); > + run (uint8_t, size_t, 13); > + > + run (int8_t, size_t, 0); > + run (int8_t, size_t, 1); > + run (int8_t, size_t, 13); > + > + run (uint8_t, int, 0); > + run (uint8_t, int, 1); > + run (uint8_t, int, 13); > + > + run (int8_t, int, 0); > + run (int8_t, int, 1); > + run (int8_t, int, 13); > + > + run (uint16_t, size_t, 0); > + run (uint16_t, size_t, 1); > + run (uint16_t, size_t, 13); > + > + run (int16_t, size_t, 0); > + run (int16_t, size_t, 1); > + run (int16_t, size_t, 13); > + > + run (uint16_t, int, 0); > + run (uint16_t, int, 1); > + run (uint16_t, int, 13); > + > + run (int16_t, int, 0); > + run (int16_t, int, 1); > + run (int16_t, int, 13); > + > + run (uint32_t, size_t, 0); > + run (uint32_t, size_t, 1); > + run (uint32_t, size_t, 13); > + > + run (int32_t, size_t, 0); > + run (int32_t, size_t, 1); > + run (int32_t, size_t, 13); > + > + run (uint32_t, int, 0); > + run (uint32_t, int, 1); > + run (uint32_t, int, 13); > + > + run (int32_t, int, 0); > + run (int32_t, int, 1); > + run (int32_t, int, 13); > + > + return 0; > +} > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c > new file mode 100644 > index 00000000000..e25d6ea5b56 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c > @@ -0,0 +1,58 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */ > +/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 3 "ldist" } } */ > + > +#include > + > +typedef __SIZE_TYPE__ size_t; > +extern void* malloc (size_t); > +extern void* memset (void*, int, size_t); > + > +__attribute__((noinline)) > +int test_pos (char *s) > +{ > + int i; > + for (i=42; s[i]; ++i); > + return i; > +} > + > +__attribute__((noinline)) > +int test_neg (char *s) > +{ > + int i; > + for (i=-42; s[i]; ++i); > + return i; > +} > + > +__attribute__((noinline)) > +int test_including_null_char (char *s) > +{ > + int i; > + for (i=1; s[i-1]; ++i); > + return i; > +} > + > +int main(void) > +{ > + void *p = malloc (1024); > + assert (p); > + memset (p, 0xf, 1024); > + char *s = (char *)p + 100; > + > + s[42+13] = 0; > + assert (test_pos (s) == 42+13); > + s[42+13] = 0xf; > + > + s[13] = 0; > + assert (test_neg (s) == 13); > + s[13] = 0xf; > + > + s[-13] = 0; > + assert (test_neg (s) == -13); > + s[-13] = 0xf; > + > + s[13] = 0; > + assert (test_including_null_char (s) == 13+1); > + > + return 0; > +} > diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c > index 65aa1df4aba..6bd4bc8588a 100644 > --- a/gcc/tree-loop-distribution.c > +++ b/gcc/tree-loop-distribution.c > @@ -116,6 +116,10 @@ along with GCC; see the file COPYING3. If not see > #include "tree-eh.h" > #include "gimple-fold.h" > #include "tree-affine.h" > +#include "intl.h" > +#include "rtl.h" > +#include "memmodel.h" > +#include "optabs.h" > > > #define MAX_DATAREFS_NUM \ > @@ -650,6 +654,10 @@ class loop_distribution > control_dependences *cd, int *nb_calls, bool *destroy_p, > bool only_patterns_p); > > + /* Transform loops which mimic the effects of builtins rawmemchr or strlen and > + replace them accordingly. */ > + bool transform_reduction_loop (loop_p loop); > + > /* Compute topological order for basic blocks. Topological order is > needed because data dependence is computed for data references in > lexicographical order. */ > @@ -1490,14 +1498,14 @@ loop_distribution::build_rdg_partition_for_vertex (struct graph *rdg, int v) > data references. */ > > static bool > -find_single_drs (class loop *loop, struct graph *rdg, partition *partition, > +find_single_drs (class loop *loop, struct graph *rdg, const bitmap &partition_stmts, > data_reference_p *dst_dr, data_reference_p *src_dr) > { > unsigned i; > data_reference_p single_ld = NULL, single_st = NULL; > bitmap_iterator bi; > > - EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi) > + EXECUTE_IF_SET_IN_BITMAP (partition_stmts, 0, i, bi) > { > gimple *stmt = RDG_STMT (rdg, i); > data_reference_p dr; > @@ -1538,44 +1546,47 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition, > } > } > > - if (!single_st) > - return false; > - > - /* Bail out if this is a bitfield memory reference. */ > - if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF > - && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1))) > + if (!single_ld && !single_st) > return false; > > - /* Data reference must be executed exactly once per iteration of each > - loop in the loop nest. We only need to check dominance information > - against the outermost one in a perfect loop nest because a bb can't > - dominate outermost loop's latch without dominating inner loop's. */ > - basic_block bb_st = gimple_bb (DR_STMT (single_st)); > - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st)) > - return false; > + basic_block bb_ld = NULL; > + basic_block bb_st = NULL; > > if (single_ld) > { > - gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld); > - /* Direct aggregate copy or via an SSA name temporary. */ > - if (load != store > - && gimple_assign_lhs (load) != gimple_assign_rhs1 (store)) > - return false; > - > /* Bail out if this is a bitfield memory reference. */ > if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF > && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1))) > return false; > > - /* Load and store must be in the same loop nest. */ > - basic_block bb_ld = gimple_bb (DR_STMT (single_ld)); > - if (bb_st->loop_father != bb_ld->loop_father) > + /* Data reference must be executed exactly once per iteration of each > + loop in the loop nest. We only need to check dominance information > + against the outermost one in a perfect loop nest because a bb can't > + dominate outermost loop's latch without dominating inner loop's. */ > + bb_ld = gimple_bb (DR_STMT (single_ld)); > + if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld)) > + return false; > + } > + > + if (single_st) > + { > + /* Bail out if this is a bitfield memory reference. */ > + if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF > + && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1))) > return false; > > /* Data reference must be executed exactly once per iteration. > - Same as single_st, we only need to check against the outermost > + Same as single_ld, we only need to check against the outermost > loop. */ > - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld)) > + bb_st = gimple_bb (DR_STMT (single_st)); > + if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st)) > + return false; > + } > + > + if (single_ld && single_st) > + { > + /* Load and store must be in the same loop nest. */ > + if (bb_st->loop_father != bb_ld->loop_father) > return false; > > edge e = single_exit (bb_st->loop_father); > @@ -1850,9 +1861,19 @@ loop_distribution::classify_partition (loop_p loop, > return has_reduction; > > /* Find single load/store data references for builtin partition. */ > - if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld)) > + if (!find_single_drs (loop, rdg, partition->stmts, &single_st, &single_ld) > + || !single_st) > return has_reduction; > > + if (single_ld && single_st) > + { > + gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld); > + /* Direct aggregate copy or via an SSA name temporary. */ > + if (load != store > + && gimple_assign_lhs (load) != gimple_assign_rhs1 (store)) > + return has_reduction; > + } > + > partition->loc = gimple_location (DR_STMT (single_st)); > > /* Classify the builtin kind. */ > @@ -3257,6 +3278,379 @@ find_seed_stmts_for_distribution (class loop *loop, vec *work_list) > return work_list->length () > 0; > } > > +/* A helper function for generate_{rawmemchr,strlen}_builtin functions in order > + to place new statements SEQ before LOOP and replace the old reduction > + variable with the new one. */ > + > +static void > +generate_reduction_builtin_1 (loop_p loop, gimple_seq &seq, > + tree reduction_var_old, tree reduction_var_new, > + const char *info, machine_mode load_mode) > +{ > + /* Place new statements before LOOP. */ > + gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src); > + gsi_insert_seq_after (&gsi, seq, GSI_CONTINUE_LINKING); > + > + /* Replace old reduction variable with new one. */ > + imm_use_iterator iter; > + gimple *stmt; > + use_operand_p use_p; > + FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var_old) > + { > + FOR_EACH_IMM_USE_ON_STMT (use_p, iter) > + SET_USE (use_p, reduction_var_new); > + > + update_stmt (stmt); > + } > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, info, GET_MODE_NAME (load_mode)); > +} > + > +/* Generate a call to rawmemchr and place it before LOOP. REDUCTION_VAR is > + replaced with a fresh SSA name representing the result of the call. */ > + > +static void > +generate_rawmemchr_builtin (loop_p loop, tree reduction_var, > + data_reference_p store_dr, tree base, tree pattern, > + location_t loc) > +{ > + gimple_seq seq = NULL; > + > + tree mem = force_gimple_operand (base, &seq, true, NULL_TREE); > + gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern); > + tree reduction_var_new = copy_ssa_name (reduction_var); > + gimple_call_set_lhs (fn_call, reduction_var_new); > + gimple_set_location (fn_call, loc); > + gimple_seq_add_stmt (&seq, fn_call); > + > + if (store_dr) > + { > + gassign *g = gimple_build_assign (DR_REF (store_dr), reduction_var_new); > + gimple_seq_add_stmt (&seq, g); > + } > + > + generate_reduction_builtin_1 (loop, seq, reduction_var, reduction_var_new, > + "generated rawmemchr%s\n", > + TYPE_MODE (TREE_TYPE (TREE_TYPE (base)))); > +} > + > +/* Helper function for generate_strlen_builtin(,_using_rawmemchr) */ > + > +static void > +generate_strlen_builtin_1 (loop_p loop, gimple_seq &seq, > + tree reduction_var_old, tree reduction_var_new, > + machine_mode mode, tree start_len) > +{ > + /* REDUCTION_VAR_NEW has either size type or ptrdiff type and must be > + converted if types of old and new reduction variable are not compatible. */ > + reduction_var_new = gimple_convert (&seq, TREE_TYPE (reduction_var_old), > + reduction_var_new); > + > + /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start > + length. */ > + if (!integer_zerop (start_len)) > + { > + tree lhs = make_ssa_name (TREE_TYPE (reduction_var_new)); > + gimple *g = gimple_build_assign (lhs, PLUS_EXPR, reduction_var_new, > + start_len); > + gimple_seq_add_stmt (&seq, g); > + reduction_var_new = lhs; > + } > + > + generate_reduction_builtin_1 (loop, seq, reduction_var_old, reduction_var_new, > + "generated strlen%s\n", mode); > +} > + > +/* Generate a call to strlen and place it before LOOP. REDUCTION_VAR is > + replaced with a fresh SSA name representing the result of the call. */ > + > +static void > +generate_strlen_builtin (loop_p loop, tree reduction_var, tree base, > + tree start_len, location_t loc) > +{ > + gimple_seq seq = NULL; > + > + tree reduction_var_new = make_ssa_name (size_type_node); > + > + tree mem = force_gimple_operand (base, &seq, true, NULL_TREE); > + tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN)); > + gimple *fn_call = gimple_build_call (fn, 1, mem); > + gimple_call_set_lhs (fn_call, reduction_var_new); > + gimple_set_location (fn_call, loc); > + gimple_seq_add_stmt (&seq, fn_call); > + > + generate_strlen_builtin_1 (loop, seq, reduction_var, reduction_var_new, > + QImode, start_len); > +} > + > +/* Generate code in order to mimic the behaviour of strlen but this time over > + an array of elements with mode different than QI. REDUCTION_VAR is replaced > + with a fresh SSA name representing the result, i.e., the length. */ > + > +static void > +generate_strlen_builtin_using_rawmemchr (loop_p loop, tree reduction_var, > + tree base, tree start_len, > + location_t loc) > +{ > + gimple_seq seq = NULL; > + > + tree start = force_gimple_operand (base, &seq, true, NULL_TREE); > + tree zero = build_zero_cst (TREE_TYPE (TREE_TYPE (start))); > + gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, start, zero); > + tree end = make_ssa_name (TREE_TYPE (base)); > + gimple_call_set_lhs (fn_call, end); > + gimple_set_location (fn_call, loc); > + gimple_seq_add_stmt (&seq, fn_call); > + > + /* Determine the number of elements between START and END by > + evaluating (END - START) / sizeof (*START). */ > + tree diff = make_ssa_name (ptrdiff_type_node); > + gimple *diff_stmt = gimple_build_assign (diff, POINTER_DIFF_EXPR, end, start); > + gimple_seq_add_stmt (&seq, diff_stmt); > + /* Let SIZE be the size of the the pointed-to type of START. */ > + tree size = gimple_convert (&seq, ptrdiff_type_node, > + TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (start)))); > + tree count = make_ssa_name (ptrdiff_type_node); > + gimple *count_stmt = gimple_build_assign (count, TRUNC_DIV_EXPR, diff, size); > + gimple_seq_add_stmt (&seq, count_stmt); > + > + generate_strlen_builtin_1 (loop, seq, reduction_var, count, > + TYPE_MODE (TREE_TYPE (TREE_TYPE (base))), > + start_len); > +} > + > +static bool > +reduction_var_overflows_first (tree reduction_var, tree load_type) > +{ > + unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node); > + unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var)); > + unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type))); > + return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent); > +} > + > +/* Transform loops which mimic the effects of builtins rawmemchr or strlen and > + replace them accordingly. For example, a loop of the form > + > + for (; *p != 42; ++p); > + > + is replaced by > + > + p = rawmemchr (p, 42); > + > + under the assumption that rawmemchr is available for a particular MODE. > + Another example is > + > + int i; > + for (i = 42; s[i]; ++i); > + > + which is replaced by > + > + i = (int)strlen (&s[42]) + 42; > + > + for some character array S. In case array S is not of type character array > + we end up with > + > + i = (int)(rawmemchr (&s[42], 0) - &s[42]) + 42; > + > + assuming that rawmemchr is available for a particular MODE. */ > + > +bool > +loop_distribution::transform_reduction_loop (loop_p loop) > +{ > + gimple *reduction_stmt = NULL; > + data_reference_p load_dr = NULL, store_dr = NULL; > + > + edge e = single_exit (loop); > + gcond *cond = safe_dyn_cast (last_stmt (e->src)); > + if (!cond) > + return false; > + /* Ensure loop condition is an (in)equality test and loop is exited either if > + the inequality test fails or the equality test succeeds. */ > + if (!(e->flags & EDGE_FALSE_VALUE && gimple_cond_code (cond) == NE_EXPR) > + && !(e->flags & EDGE_TRUE_VALUE && gimple_cond_code (cond) == EQ_EXPR)) > + return false; > + /* A limitation of the current implementation is that we only support > + constant patterns in (in)equality tests. */ > + tree pattern = gimple_cond_rhs (cond); > + if (TREE_CODE (pattern) != INTEGER_CST) > + return false; > + > + basic_block *bbs = get_loop_body (loop); > + > + for (unsigned i = 0, ninsns = 0; i < loop->num_nodes; ++i) > + { > + basic_block bb = bbs[i]; > + > + for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi); > + gsi_next_nondebug (&bsi)) > + { > + gphi *phi = bsi.phi (); > + if (virtual_operand_p (gimple_phi_result (phi))) > + continue; > + if (stmt_has_scalar_dependences_outside_loop (loop, phi)) > + { > + if (reduction_stmt) > + return false; > + reduction_stmt = phi; > + } > + } > + > + for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); > + gsi_next_nondebug (&bsi), ++ninsns) > + { > + /* Bail out early for loops which are unlikely to match. */ > + if (ninsns > 16) > + return false; > + gimple *stmt = gsi_stmt (bsi); > + if (gimple_clobber_p (stmt)) > + continue; > + if (gimple_code (stmt) == GIMPLE_LABEL) > + continue; > + if (gimple_has_volatile_ops (stmt)) > + return false; > + if (stmt_has_scalar_dependences_outside_loop (loop, stmt)) > + { > + if (reduction_stmt) > + return false; > + reduction_stmt = stmt; > + } > + } > + } > + > + /* A limitation of the current implementation is that we require a reduction > + statement. Therefore, loops without a reduction statement as in the > + following are not recognized: > + int *p; > + void foo (void) { for (; *p; ++p); } */ > + if (reduction_stmt == NULL) > + return false; > + > + /* Reduction variables are guaranteed to be SSA names. */ > + tree reduction_var; > + switch (gimple_code (reduction_stmt)) > + { > + case GIMPLE_ASSIGN: > + case GIMPLE_PHI: > + reduction_var = gimple_get_lhs (reduction_stmt); > + break; > + default: > + /* Bail out e.g. for GIMPLE_CALL. */ > + return false; > + } > + > + struct graph *rdg = build_rdg (loop, NULL); > + if (rdg == NULL) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, > + "Loop %d not transformed: failed to build the RDG.\n", > + loop->num); > + > + return false; > + } > + auto_bitmap partition_stmts; > + bitmap_set_range (partition_stmts, 0, rdg->n_vertices); > + find_single_drs (loop, rdg, partition_stmts, &store_dr, &load_dr); > + free_rdg (rdg); > + > + /* Bail out if there is no single load. */ > + if (load_dr == NULL) > + return false; > + > + /* Reaching this point we have a loop with a single reduction variable, > + a single load, and an optional single store. */ > + > + tree load_ref = DR_REF (load_dr); > + tree load_type = TREE_TYPE (load_ref); > + tree load_access_base = build_fold_addr_expr (load_ref); > + tree load_access_size = TYPE_SIZE_UNIT (load_type); > + affine_iv load_iv, reduction_iv; > + > + if (!INTEGRAL_TYPE_P (load_type) > + || !type_has_mode_precision_p (load_type)) > + return false; > + > + /* We already ensured that the loop condition tests for (in)equality where the > + rhs is a constant pattern. Now ensure that the lhs is the result of the > + load. */ > + if (gimple_cond_lhs (cond) != gimple_assign_lhs (DR_STMT (load_dr))) > + return false; > + > + /* Bail out if no affine induction variable with constant step can be > + determined. */ > + if (!simple_iv (loop, loop, load_access_base, &load_iv, false)) > + return false; > + > + /* Bail out if memory accesses are not consecutive or not growing. */ > + if (!operand_equal_p (load_iv.step, load_access_size, 0)) > + return false; > + > + if (!simple_iv (loop, loop, reduction_var, &reduction_iv, false)) > + return false; > + > + /* Handle rawmemchr like loops. */ > + if (operand_equal_p (load_iv.base, reduction_iv.base) > + && operand_equal_p (load_iv.step, reduction_iv.step)) > + { > + if (store_dr) > + { > + /* Ensure that we store to X and load from X+I where I>0. */ > + if (TREE_CODE (load_iv.base) != POINTER_PLUS_EXPR > + || !integer_onep (TREE_OPERAND (load_iv.base, 1))) > + return false; > + tree ptr_base = TREE_OPERAND (load_iv.base, 0); > + if (TREE_CODE (ptr_base) != SSA_NAME) > + return false; > + gimple *def = SSA_NAME_DEF_STMT (ptr_base); > + if (!gimple_assign_single_p (def) > + || gimple_assign_rhs1 (def) != DR_REF (store_dr)) > + return false; > + /* Ensure that the reduction value is stored. */ > + if (gimple_assign_rhs1 (DR_STMT (store_dr)) != reduction_var) > + return false; > + } > + /* Bail out if target does not provide rawmemchr for a certain mode. */ > + machine_mode mode = TYPE_MODE (load_type); > + if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing) > + return false; > + location_t loc = gimple_location (DR_STMT (load_dr)); > + generate_rawmemchr_builtin (loop, reduction_var, store_dr, load_iv.base, > + pattern, loc); > + return true; > + } > + > + /* Handle strlen like loops. */ > + if (store_dr == NULL > + && integer_zerop (pattern) > + && TREE_CODE (reduction_iv.base) == INTEGER_CST > + && TREE_CODE (reduction_iv.step) == INTEGER_CST > + && integer_onep (reduction_iv.step)) > + { > + location_t loc = gimple_location (DR_STMT (load_dr)); > + if (TYPE_MODE (load_type) == TYPE_MODE (char_type_node) > + && TYPE_PRECISION (load_type) == TYPE_PRECISION (char_type_node) > + && (TYPE_PRECISION (size_type_node) == 64 > + || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > + && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node)))) > + generate_strlen_builtin (loop, reduction_var, load_iv.base, > + reduction_iv.base, loc); > + else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE (load_type)) > + != CODE_FOR_nothing > + && (TYPE_PRECISION (ptrdiff_type_node) == 64 > + || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)) > + && reduction_var_overflows_first (reduction_var, load_type)))) > + generate_strlen_builtin_using_rawmemchr (loop, reduction_var, > + load_iv.base, > + reduction_iv.base, loc); > + else > + return false; > + return true; > + } > + > + return false; > +} > + > /* Given innermost LOOP, return the outermost enclosing loop that forms a > perfect loop nest. */ > > @@ -3321,10 +3715,27 @@ loop_distribution::execute (function *fun) > && !optimize_loop_for_speed_p (loop))) > continue; > > - /* Don't distribute loop if niters is unknown. */ > + /* If niters is unknown don't distribute loop but rather try to transform > + it to a call to a builtin. */ > tree niters = number_of_latch_executions (loop); > if (niters == NULL_TREE || niters == chrec_dont_know) > - continue; > + { > + datarefs_vec.create (20); > + if (transform_reduction_loop (loop)) > + { > + changed = true; > + loops_to_be_destroyed.safe_push (loop); > + if (dump_enabled_p ()) > + { > + dump_user_location_t loc = find_loop_location (loop); > + dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, > + loc, "Loop %d transformed into a builtin.\n", > + loop->num); > + } > + } > + free_data_refs (datarefs_vec); > + continue; > + } > > /* Get the perfect loop nest for distribution. */ > loop = prepare_perfect_loop_nest (loop);