From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 23EEA3857423 for ; Thu, 20 May 2021 18:37:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 23EEA3857423 Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 14KIYAwv032340 for ; Thu, 20 May 2021 14:37:31 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 38nw4j02fx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 20 May 2021 14:37:30 -0400 Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 14KIbUsJ053746 for ; Thu, 20 May 2021 14:37:30 -0400 Received: from ppma04ams.nl.ibm.com (63.31.33a9.ip4.static.sl-reverse.com [169.51.49.99]) by mx0a-001b2d01.pphosted.com with ESMTP id 38nw4j02fe-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 20 May 2021 14:37:30 -0400 Received: from pps.filterd (ppma04ams.nl.ibm.com [127.0.0.1]) by ppma04ams.nl.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 14KIbRcj015690; Thu, 20 May 2021 18:37:27 GMT Received: from b06cxnps3074.portsmouth.uk.ibm.com (d06relay09.portsmouth.uk.ibm.com [9.149.109.194]) by ppma04ams.nl.ibm.com with ESMTP id 38j5x8asma-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 20 May 2021 18:37:27 +0000 Received: from d06av23.portsmouth.uk.ibm.com (d06av23.portsmouth.uk.ibm.com [9.149.105.59]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 14KIbPDe33554886 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 20 May 2021 18:37:25 GMT Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F0386A4051; Thu, 20 May 2021 18:37:24 +0000 (GMT) Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1DDE3A4040; Thu, 20 May 2021 18:37:24 +0000 (GMT) Received: from localhost.localdomain (unknown [9.145.62.86]) by d06av23.portsmouth.uk.ibm.com (Postfix) with ESMTPS; Thu, 20 May 2021 18:37:24 +0000 (GMT) Date: Thu, 20 May 2021 20:37:23 +0200 From: Stefan Schulze Frielinghaus To: Richard Biener Cc: GCC Patches Subject: Re: [RFC] ldist: Recognize rawmemchr loop patterns Message-ID: References: <20210208124716.487709-1-stefansf@linux.ibm.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="T3p6VPitb2u5EUVs" Content-Disposition: inline In-Reply-To: X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: X0zvnjtfEq6JXUuvEXecbgoBUz1yCRIG X-Proofpoint-GUID: eF2wKzZxMQjR12iOKveTK1Fl6ifE64KX X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761 definitions=2021-05-20_05:2021-05-20, 2021-05-20 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 bulkscore=0 spamscore=0 mlxscore=0 suspectscore=0 mlxlogscore=999 impostorscore=0 adultscore=0 phishscore=0 clxscore=1015 priorityscore=1501 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2105200112 X-Spam-Status: No, score=-10.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 May 2021 18:37:36 -0000 --T3p6VPitb2u5EUVs Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Thu, May 20, 2021 at 11:24:57AM +0200, Richard Biener wrote: > On Fri, May 7, 2021 at 2:32 PM Stefan Schulze Frielinghaus > wrote: > > > > On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote: > > > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus > > > wrote: > > > > > > > > [snip] > > > > > > > > Please find attached a new version of the patch. A major change compared to > > > > the previous patch is that I created a separate pass which hopefully makes > > > > reviewing also easier since it is almost self-contained. After realizing that > > > > detecting loops which mimic the behavior of rawmemchr/strlen functions does not > > > > really fit into the topic of loop distribution, I created a separate pass. > > > > > > It's true that these reduction-like patterns are more difficult than > > > the existing > > > memcpy/memset cases. > > > > > > > Due > > > > to this I was also able to play around a bit and schedule the pass at different > > > > times. Currently it is scheduled right before loop distribution where loop > > > > header copying already took place which leads to the following effect. > > > > > > In fact I'd schedule it after loop distribution so there's the chance that loop > > > distribution can expose a loop that fits the new pattern. > > > > > > > Running > > > > this setup over > > > > > > > > char *t (char *p) > > > > { > > > > for (; *p; ++p); > > > > return p; > > > > } > > > > > > > > the new pass transforms > > > > > > > > char * t (char * p) > > > > { > > > > char _1; > > > > char _7; > > > > > > > > [local count: 118111600]: > > > > _7 = *p_3(D); > > > > if (_7 != 0) > > > > goto ; [89.00%] > > > > else > > > > goto ; [11.00%] > > > > > > > > [local count: 105119324]: > > > > > > > > [local count: 955630225]: > > > > # p_8 = PHI > > > > p_6 = p_8 + 1; > > > > _1 = *p_6; > > > > if (_1 != 0) > > > > goto ; [89.00%] > > > > else > > > > goto ; [11.00%] > > > > > > > > [local count: 105119324]: > > > > # p_2 = PHI > > > > goto ; [100.00%] > > > > > > > > [local count: 850510901]: > > > > goto ; [100.00%] > > > > > > > > [local count: 12992276]: > > > > > > > > [local count: 118111600]: > > > > # p_9 = PHI > > > > return p_9; > > > > > > > > } > > > > > > > > into > > > > > > > > char * t (char * p) > > > > { > > > > char * _5; > > > > char _7; > > > > > > > > [local count: 118111600]: > > > > _7 = *p_3(D); > > > > if (_7 != 0) > > > > goto ; [89.00%] > > > > else > > > > goto ; [11.00%] > > > > > > > > [local count: 105119324]: > > > > _5 = p_3(D) + 1; > > > > p_10 = .RAWMEMCHR (_5, 0); > > > > > > > > [local count: 118111600]: > > > > # p_9 = PHI > > > > return p_9; > > > > > > > > } > > > > > > > > which is fine so far. However, I haven't made up my mind so far whether it is > > > > worthwhile to spend more time in order to also eliminate the "first unrolling" > > > > of the loop. > > > > > > Might be a phiopt transform ;) Might apply to quite some set of > > > builtins. I wonder how the strlen case looks like though. > > > > > > > I gave it a shot by scheduling the pass prior pass copy header > > > > and ended up with: > > > > > > > > char * t (char * p) > > > > { > > > > [local count: 118111600]: > > > > p_5 = .RAWMEMCHR (p_3(D), 0); > > > > return p_5; > > > > > > > > } > > > > > > > > which seems optimal to me. The downside of this is that I have to initialize > > > > scalar evolution analysis which might be undesired that early. > > > > > > > > All this brings me to the question where do you see this peace of code running? > > > > If in a separate pass when would you schedule it? If in an existing pass, > > > > which one would you choose? > > > > > > I think it still fits loop distribution. If you manage to detect it > > > with your pass > > > standalone then you should be able to detect it in loop distribution. > > > > If a loop is distributed only because one of the partitions matches a > > rawmemchr/strlen-like loop pattern, then we have at least two partitions > > which walk over the same memory region. Since a rawmemchr/strlen-like > > loop has no body (neglecting expression-3 of a for-loop where just an > > increment happens) it is governed by the memory accesses in the loop > > condition. Therefore, in such a case loop distribution would result in > > performance degradation. This is why I think that it does not fit > > conceptually into ldist pass. However, since I make use of a couple of > > helper functions from ldist pass, it may still fit technically. > > > > Since currently all ldist optimizations operate over loops where niters > > is known and for rawmemchr/strlen-like loops this is not the case, it is > > not possible that those optimizations expose a loop which is suitable > > for rawmemchr/strlen optimization. > > True - though that seems to be an unnecessary restriction. > > > Therefore, what do you think about > > scheduling rawmemchr/strlen optimization right between those > > if-statements of function loop_distribution::execute? > > > > if (nb_generated_loops + nb_generated_calls > 0) > > { > > changed = true; > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, > > loc, "Loop%s %d distributed: split to %d loops " > > "and %d library calls.\n", str, loop->num, > > nb_generated_loops, nb_generated_calls); > > > > break; > > } > > > > // rawmemchr/strlen like loops > > > > if (dump_file && (dump_flags & TDF_DETAILS)) > > fprintf (dump_file, "Loop%s %d not distributed.\n", str, loop->num); > > but we won't ever arrive here because of the niters condition. But > yes, doing the pattern matching in the innermost loop processing code > looks good to me - for the specific case it would be > > /* Don't distribute loop if niters is unknown. */ > tree niters = number_of_latch_executions (loop); > if (niters == NULL_TREE || niters == chrec_dont_know) > ---> here? > continue; Right, please find attached a new version of the patch where everything is included in the loop distribution pass. I will do a bootstrap and regtest on IBM Z over night. If you give me green light I will also do the same on x86_64. > > > > Can you > > > explain what part is "easier" as standalone pass? > > > > Yea that term is rather misleading. It was probably easier for me to > > understand the underlying problem and to play around with my code. > > There are no technical reasons for a standalone pass. > > And sorry for the late response... > > Richard. > > > Cheers, > > Stefan > > > > > > > > > Another topic which came up is whether there exists a more elegant solution to > > > > my current implementation in order to deal with stores (I'm speaking of the `if > > > > (store_dr)` statement inside of function transform_loop_1). For example, > > > > > > > > extern char *p; > > > > char *t () > > > > { > > > > for (; *p; ++p); > > > > return p; > > > > } > > > > > > > > ends up as > > > > > > > > char * t () > > > > { > > > > char * _1; > > > > char * _2; > > > > char _3; > > > > char * p.1_8; > > > > char _9; > > > > char * p.1_10; > > > > char * p.1_11; > > > > > > > > [local count: 118111600]: > > > > p.1_8 = p; > > > > _9 = *p.1_8; > > > > if (_9 != 0) > > > > goto ; [89.00%] > > > > else > > > > goto ; [11.00%] > > > > > > > > [local count: 105119324]: > > > > > > > > [local count: 955630225]: > > > > # p.1_10 = PHI <_1(6), p.1_8(5)> > > > > _1 = p.1_10 + 1; > > > > p = _1; > > > > _3 = *_1; > > > > if (_3 != 0) > > > > goto ; [89.00%] > > > > else > > > > goto ; [11.00%] > > > > > > > > [local count: 105119324]: > > > > # _2 = PHI <_1(3)> > > > > goto ; [100.00%] > > > > > > > > [local count: 850510901]: > > > > goto ; [100.00%] > > > > > > > > [local count: 12992276]: > > > > > > > > [local count: 118111600]: > > > > # p.1_11 = PHI <_2(8), p.1_8(7)> > > > > return p.1_11; > > > > > > > > } > > > > > > > > where inside the loop a load and store occurs. For a rawmemchr like loop I > > > > have to show that we never load from a memory location to which we write. > > > > Currently I solve this by hard coding those facts which are not generic at all. > > > > I gave compute_data_dependences_for_loop a try which failed to determine the > > > > fact that stores only happen to p[0] and loads from p[i] where i>0. Maybe > > > > there are more generic solutions to express this in contrast to my current one? > > > > > > So the example loop is not valid to be trasformed to rawmemchr since it's > > > valid to call it with p = &p; - but sure, once you pass the first *p != 0 check > > > things become fishy but GCC isn't able to turn that into a non-dependence. > > > > > > Why is the case of stores inside the loop important? In fact that you > > > think it is > > > makes a case for integrating this with loop distribution since loop distribution > > > would be able to prove (if possible) that the stores can be separated into > > > a different loop. > > > > > > And sorry for the delay in answering ... > > > > > > Thanks, > > > Richard. > > > > > > > > > > > Thanks again for your input so far. Really appreciated. > > > > > > > > Cheers, > > > > Stefan --T3p6VPitb2u5EUVs Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="rawmemchr.diff" diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c index d209a52f823..b39a0172f01 100644 --- a/gcc/internal-fn.c +++ b/gcc/internal-fn.c @@ -2929,6 +2929,33 @@ expand_VEC_CONVERT (internal_fn, gcall *) gcc_unreachable (); } +void +expand_RAWMEMCHR (internal_fn, gcall *stmt) +{ + expand_operand ops[3]; + + tree lhs = gimple_call_lhs (stmt); + if (!lhs) + return; + tree lhs_type = TREE_TYPE (lhs); + rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE); + create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type)); + + for (unsigned int i = 0; i < 2; ++i) + { + tree rhs = gimple_call_arg (stmt, i); + tree rhs_type = TREE_TYPE (rhs); + rtx rhs_rtx = expand_normal (rhs); + create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type)); + } + + insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode); + + expand_insn (icode, 3, ops); + if (!rtx_equal_p (lhs_rtx, ops[0].value)) + emit_move_insn (lhs_rtx, ops[0].value); +} + /* Expand the IFN_UNIQUE function according to its first argument. */ static void diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index daeace7a34e..95c76795648 100644 --- a/gcc/internal-fn.def +++ b/gcc/internal-fn.def @@ -348,6 +348,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL) DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL) DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL) DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL) +DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL) /* An unduplicable, uncombinable function. Generally used to preserve a CFG property in the face of jump threading, tail merging or diff --git a/gcc/optabs.def b/gcc/optabs.def index b192a9d070b..f7c69f914ce 100644 --- a/gcc/optabs.def +++ b/gcc/optabs.def @@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a") OPTAB_D (movmem_optab, "movmem$a") OPTAB_D (setmem_optab, "setmem$a") OPTAB_D (strlen_optab, "strlen$a") +OPTAB_D (rawmemchr_optab, "rawmemchr$I$a") OPTAB_DC(fma_optab, "fma$a4", FMA) OPTAB_D (fms_optab, "fms$a4") diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c new file mode 100644 index 00000000000..e998dd16b29 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c @@ -0,0 +1,72 @@ +/* { dg-do run { target s390x-*-* } } */ +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */ +/* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "ldist" { target s390x-*-* } } } */ +/* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "ldist" { target s390x-*-* } } } */ +/* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "ldist" { target s390x-*-* } } } */ + +/* Rawmemchr pattern: reduction stmt but no store */ + +#include +#include + +typedef __SIZE_TYPE__ size_t; +extern void* malloc (size_t); +extern void* memset (void*, int, size_t); + +#define test(T, pattern) \ +__attribute__((noinline)) \ +T *test_##T (T *p) \ +{ \ + while (*p != (T)pattern) \ + ++p; \ + return p; \ +} + +test (uint8_t, 0xab) +test (uint16_t, 0xabcd) +test (uint32_t, 0xabcdef15) + +test (int8_t, 0xab) +test (int16_t, 0xabcd) +test (int32_t, 0xabcdef15) + +#define run(T, pattern, i) \ +{ \ +T *q = p; \ +q[i] = (T)pattern; \ +assert (test_##T (p) == &q[i]); \ +q[i] = 0; \ +} + +int main(void) +{ + void *p = malloc (1024); + assert (p); + memset (p, 0, 1024); + + run (uint8_t, 0xab, 0); + run (uint8_t, 0xab, 1); + run (uint8_t, 0xab, 13); + + run (uint16_t, 0xabcd, 0); + run (uint16_t, 0xabcd, 1); + run (uint16_t, 0xabcd, 13); + + run (uint32_t, 0xabcdef15, 0); + run (uint32_t, 0xabcdef15, 1); + run (uint32_t, 0xabcdef15, 13); + + run (int8_t, 0xab, 0); + run (int8_t, 0xab, 1); + run (int8_t, 0xab, 13); + + run (int16_t, 0xabcd, 0); + run (int16_t, 0xabcd, 1); + run (int16_t, 0xabcd, 13); + + run (int32_t, 0xabcdef15, 0); + run (int32_t, 0xabcdef15, 1); + run (int32_t, 0xabcdef15, 13); + + return 0; +} diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c new file mode 100644 index 00000000000..046450ea7e8 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c @@ -0,0 +1,83 @@ +/* { dg-do run { target s390x-*-* } } */ +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */ +/* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "ldist" { target s390x-*-* } } } */ +/* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "ldist" { target s390x-*-* } } } */ +/* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "ldist" { target s390x-*-* } } } */ + +/* Rawmemchr pattern: reduction stmt and store */ + +#include +#include + +typedef __SIZE_TYPE__ size_t; +extern void* malloc (size_t); +extern void* memset (void*, int, size_t); + +uint8_t *p_uint8_t; +uint16_t *p_uint16_t; +uint32_t *p_uint32_t; + +int8_t *p_int8_t; +int16_t *p_int16_t; +int32_t *p_int32_t; + +#define test(T, pattern) \ +__attribute__((noinline)) \ +T *test_##T (void) \ +{ \ + while (*p_##T != pattern) \ + ++p_##T; \ + return p_##T; \ +} + +test (uint8_t, 0xab) +test (uint16_t, 0xabcd) +test (uint32_t, 0xabcdef15) + +test (int8_t, (int8_t)0xab) +test (int16_t, (int16_t)0xabcd) +test (int32_t, (int32_t)0xabcdef15) + +#define run(T, pattern, i) \ +{ \ +T *q = p; \ +q[i] = pattern; \ +p_##T = p; \ +T *r = test_##T (); \ +assert (r == p_##T); \ +assert (r == &q[i]); \ +q[i] = 0; \ +} + +int main(void) +{ + void *p = malloc (1024); + assert (p); + memset (p, '\0', 1024); + + run (uint8_t, 0xab, 0); + run (uint8_t, 0xab, 1); + run (uint8_t, 0xab, 13); + + run (uint16_t, 0xabcd, 0); + run (uint16_t, 0xabcd, 1); + run (uint16_t, 0xabcd, 13); + + run (uint32_t, 0xabcdef15, 0); + run (uint32_t, 0xabcdef15, 1); + run (uint32_t, 0xabcdef15, 13); + + run (int8_t, (int8_t)0xab, 0); + run (int8_t, (int8_t)0xab, 1); + run (int8_t, (int8_t)0xab, 13); + + run (int16_t, (int16_t)0xabcd, 0); + run (int16_t, (int16_t)0xabcd, 1); + run (int16_t, (int16_t)0xabcd, 13); + + run (int32_t, (int32_t)0xabcdef15, 0); + run (int32_t, (int32_t)0xabcdef15, 1); + run (int32_t, (int32_t)0xabcdef15, 13); + + return 0; +} diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c new file mode 100644 index 00000000000..c88d1db0a93 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c @@ -0,0 +1,100 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */ +/* { dg-final { scan-tree-dump-times "generated strlen\n" 4 "ldist" } } */ +/* { dg-final { scan-tree-dump-times "generated strlen using rawmemchrhi\n" 4 "ldist" { target s390x-*-* } } } */ +/* { dg-final { scan-tree-dump-times "generated strlen using rawmemchrsi\n" 4 "ldist" { target s390x-*-* } } } */ + +#include +#include + +typedef __SIZE_TYPE__ size_t; +extern void* malloc (size_t); +extern void* memset (void*, int, size_t); + +#define test(T, U) \ +__attribute__((noinline)) \ +U test_##T##U (T *s) \ +{ \ + U i; \ + for (i=0; s[i]; ++i); \ + return i; \ +} + +test (uint8_t, size_t) +test (uint16_t, size_t) +test (uint32_t, size_t) +test (uint8_t, int) +test (uint16_t, int) +test (uint32_t, int) + +test (int8_t, size_t) +test (int16_t, size_t) +test (int32_t, size_t) +test (int8_t, int) +test (int16_t, int) +test (int32_t, int) + +#define run(T, U, i) \ +{ \ +T *q = p; \ +q[i] = 0; \ +assert (test_##T##U (p) == i); \ +memset (&q[i], 0xf, sizeof (T)); \ +} + +int main(void) +{ + void *p = malloc (1024); + assert (p); + memset (p, 0xf, 1024); + + run (uint8_t, size_t, 0); + run (uint8_t, size_t, 1); + run (uint8_t, size_t, 13); + + run (int8_t, size_t, 0); + run (int8_t, size_t, 1); + run (int8_t, size_t, 13); + + run (uint8_t, int, 0); + run (uint8_t, int, 1); + run (uint8_t, int, 13); + + run (int8_t, int, 0); + run (int8_t, int, 1); + run (int8_t, int, 13); + + run (uint16_t, size_t, 0); + run (uint16_t, size_t, 1); + run (uint16_t, size_t, 13); + + run (int16_t, size_t, 0); + run (int16_t, size_t, 1); + run (int16_t, size_t, 13); + + run (uint16_t, int, 0); + run (uint16_t, int, 1); + run (uint16_t, int, 13); + + run (int16_t, int, 0); + run (int16_t, int, 1); + run (int16_t, int, 13); + + run (uint32_t, size_t, 0); + run (uint32_t, size_t, 1); + run (uint32_t, size_t, 13); + + run (int32_t, size_t, 0); + run (int32_t, size_t, 1); + run (int32_t, size_t, 13); + + run (uint32_t, int, 0); + run (uint32_t, int, 1); + run (uint32_t, int, 13); + + run (int32_t, int, 0); + run (int32_t, int, 1); + run (int32_t, int, 13); + + return 0; +} diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c new file mode 100644 index 00000000000..cd06e4a27cb --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c @@ -0,0 +1,58 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */ +/* { dg-final { scan-tree-dump-times "generated strlen\n" 3 "ldist" } } */ + +#include + +typedef __SIZE_TYPE__ size_t; +extern void* malloc (size_t); +extern void* memset (void*, int, size_t); + +__attribute__((noinline)) +int test_pos (char *s) +{ + int i; + for (i=42; s[i]; ++i); + return i; +} + +__attribute__((noinline)) +int test_neg (char *s) +{ + int i; + for (i=-42; s[i]; ++i); + return i; +} + +__attribute__((noinline)) +int test_including_null_char (char *s) +{ + int i; + for (i=1; s[i-1]; ++i); + return i; +} + +int main(void) +{ + void *p = malloc (1024); + assert (p); + memset (p, 0xf, 1024); + char *s = (char *)p + 100; + + s[42+13] = 0; + assert (test_pos (s) == 42+13); + s[42+13] = 0xf; + + s[13] = 0; + assert (test_neg (s) == 13); + s[13] = 0xf; + + s[-13] = 0; + assert (test_neg (s) == -13); + s[-13] = 0xf; + + s[13] = 0; + assert (test_including_null_char (s) == 13+1); + + return 0; +} diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c index 65aa1df4aba..8c0963b1836 100644 --- a/gcc/tree-loop-distribution.c +++ b/gcc/tree-loop-distribution.c @@ -116,6 +116,10 @@ along with GCC; see the file COPYING3. If not see #include "tree-eh.h" #include "gimple-fold.h" #include "tree-affine.h" +#include "intl.h" +#include "rtl.h" +#include "memmodel.h" +#include "optabs.h" #define MAX_DATAREFS_NUM \ @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop *loop, vec *work_list) return work_list->length () > 0; } +static void +generate_rawmemchr_builtin (loop_p loop, tree reduction_var, + data_reference_p store_dr, tree base, tree pattern, + location_t loc) +{ + gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)) + && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE (pattern)); + gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base)); + + /* The new statements will be placed before LOOP. */ + gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src); + + tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false, + GSI_CONTINUE_LINKING); + gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern); + tree reduction_var_new = copy_ssa_name (reduction_var); + gimple_call_set_lhs (fn_call, reduction_var_new); + gimple_set_location (fn_call, loc); + gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING); + + if (store_dr) + { + gassign *g = gimple_build_assign (DR_REF (store_dr), reduction_var_new); + gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING); + } + + imm_use_iterator iter; + gimple *stmt; + use_operand_p use_p; + FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var) + { + FOR_EACH_IMM_USE_ON_STMT (use_p, iter) + SET_USE (use_p, reduction_var_new); + + update_stmt (stmt); + } + + fold_stmt (&gsi); + + if (dump_file && (dump_flags & TDF_DETAILS)) + switch (TYPE_MODE (TREE_TYPE (pattern))) + { + case QImode: + fprintf (dump_file, "generated rawmemchrqi\n"); + break; + + case HImode: + fprintf (dump_file, "generated rawmemchrhi\n"); + break; + + case SImode: + fprintf (dump_file, "generated rawmemchrsi\n"); + break; + + default: + gcc_unreachable (); + } +} + +static void +generate_strlen_builtin (loop_p loop, tree reduction_var, tree base, + tree start_len, location_t loc) +{ + gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))); + gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (start_len)); + + /* The new statements will be placed before LOOP. */ + gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src); + + tree reduction_var_new = make_ssa_name (size_type_node); + + tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false, + GSI_CONTINUE_LINKING); + tree fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_STRLEN)); + gimple *fn_call = gimple_build_call (fn, 1, mem); + gimple_call_set_lhs (fn_call, reduction_var_new); + gimple_set_location (fn_call, loc); + gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING); + + /* In case reduction type is not compatible with size type, then + conversion is sound even in case an overflow occurs since we previously + ensured that for reduction type an overflow is undefined. */ + tree convert = fold_convert (TREE_TYPE (reduction_var), reduction_var_new); + reduction_var_new = force_gimple_operand_gsi (&gsi, convert, true, NULL_TREE, + false, GSI_CONTINUE_LINKING); + + /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start + length. */ + if (!integer_zerop (start_len)) + { + tree fn_result = reduction_var_new; + reduction_var_new = make_ssa_name (TREE_TYPE (reduction_var)); + gimple *add_stmt = gimple_build_assign (reduction_var_new, PLUS_EXPR, + fn_result, start_len); + gsi_insert_after (&gsi, add_stmt, GSI_CONTINUE_LINKING); + } + + imm_use_iterator iter; + gimple *stmt; + use_operand_p use_p; + FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var) + { + FOR_EACH_IMM_USE_ON_STMT (use_p, iter) + SET_USE (use_p, reduction_var_new); + + update_stmt (stmt); + } + + fold_stmt (&gsi); + + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "generated strlen\n"); +} + +static void +generate_strlen_builtin_using_rawmemchr (loop_p loop, tree reduction_var, + tree base, tree start_len, + location_t loc) +{ + gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))); + gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (start_len)); + + /* The new statements will be placed before LOOP. */ + gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src); + + tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false, + GSI_CONTINUE_LINKING); + tree zero = build_zero_cst (TREE_TYPE (TREE_TYPE (mem))); + gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, zero); + tree end = make_ssa_name (TREE_TYPE (base)); + gimple_call_set_lhs (fn_call, end); + gimple_set_location (fn_call, loc); + gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING); + + tree diff = make_ssa_name (ptrdiff_type_node); + gimple *diff_stmt = gimple_build_assign (diff, POINTER_DIFF_EXPR, end, mem); + gsi_insert_after (&gsi, diff_stmt, GSI_CONTINUE_LINKING); + + tree convert = fold_convert (ptrdiff_type_node, + TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (mem)))); + tree size = force_gimple_operand_gsi (&gsi, convert, true, NULL_TREE, false, + GSI_CONTINUE_LINKING); + + tree count = make_ssa_name (ptrdiff_type_node); + gimple *count_stmt = gimple_build_assign (count, TRUNC_DIV_EXPR, diff, size); + gsi_insert_after (&gsi, count_stmt, GSI_CONTINUE_LINKING); + + convert = fold_convert (TREE_TYPE (reduction_var), count); + tree reduction_var_new = force_gimple_operand_gsi (&gsi, convert, true, + NULL_TREE, false, + GSI_CONTINUE_LINKING); + + /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start + length. */ + if (!integer_zerop (start_len)) + { + tree lhs = make_ssa_name (TREE_TYPE (reduction_var_new)); + gimple *g = gimple_build_assign (lhs, PLUS_EXPR, reduction_var_new, + start_len); + gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING); + reduction_var_new = lhs; + } + + imm_use_iterator iter; + gimple *stmt; + use_operand_p use_p; + FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var) + { + FOR_EACH_IMM_USE_ON_STMT (use_p, iter) + SET_USE (use_p, reduction_var_new); + + update_stmt (stmt); + } + + fold_stmt (&gsi); + + if (dump_file && (dump_flags & TDF_DETAILS)) + switch (TYPE_MODE (TREE_TYPE (zero))) + { + case HImode: + fprintf (dump_file, "generated strlen using rawmemchrhi\n"); + break; + + case SImode: + fprintf (dump_file, "generated strlen using rawmemchrsi\n"); + break; + + default: + gcc_unreachable (); + } +} + +static bool +transform_reduction_loop_1 (loop_p loop, + data_reference_p load_dr, + data_reference_p store_dr, + tree reduction_var) +{ + tree load_ref = DR_REF (load_dr); + tree load_type = TREE_TYPE (load_ref); + tree load_access_base = build_fold_addr_expr (load_ref); + tree load_access_size = TYPE_SIZE_UNIT (load_type); + affine_iv load_iv, reduction_iv; + tree pattern; + + /* A limitation of the current implementation is that we only support + constant patterns. */ + edge e = single_exit (loop); + gcond *cond_stmt = safe_dyn_cast (last_stmt (e->src)); + if (!cond_stmt) + return false; + pattern = gimple_cond_rhs (cond_stmt); + if (gimple_cond_code (cond_stmt) != NE_EXPR + || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (DR_STMT (load_dr)) + || TREE_CODE (pattern) != INTEGER_CST) + return false; + + /* Bail out if no affine induction variable with constant step can be + determined. */ + if (!simple_iv (loop, loop, load_access_base, &load_iv, false)) + return false; + + /* Bail out if memory accesses are not consecutive or not growing. */ + if (!operand_equal_p (load_iv.step, load_access_size, 0)) + return false; + + if (!INTEGRAL_TYPE_P (load_type) + || !type_has_mode_precision_p (load_type)) + return false; + + if (!simple_iv (loop, loop, reduction_var, &reduction_iv, false)) + return false; + + /* Handle rawmemchr like loops. */ + if (operand_equal_p (load_iv.base, reduction_iv.base) + && operand_equal_p (load_iv.step, reduction_iv.step)) + { + if (store_dr) + { + /* Ensure that we store to X and load from X+I where I>0. */ + if (TREE_CODE (load_iv.base) != POINTER_PLUS_EXPR + || !integer_onep (TREE_OPERAND (load_iv.base, 1))) + return false; + tree ptr_base = TREE_OPERAND (load_iv.base, 0); + if (TREE_CODE (ptr_base) != SSA_NAME) + return false; + gimple *def = SSA_NAME_DEF_STMT (ptr_base); + if (!gimple_assign_single_p (def) + || gimple_assign_rhs1 (def) != DR_REF (store_dr)) + return false; + /* Ensure that the reduction value is stored. */ + if (gimple_assign_rhs1 (DR_STMT (store_dr)) != reduction_var) + return false; + } + /* Bail out if target does not provide rawmemchr for a certain mode. */ + machine_mode mode = TYPE_MODE (load_type); + if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing) + return false; + location_t loc = gimple_location (DR_STMT (load_dr)); + generate_rawmemchr_builtin (loop, reduction_var, store_dr, load_iv.base, + pattern, loc); + return true; + } + + /* Handle strlen like loops. */ + if (store_dr == NULL + && integer_zerop (pattern) + && TREE_CODE (reduction_iv.base) == INTEGER_CST + && TREE_CODE (reduction_iv.step) == INTEGER_CST + && integer_onep (reduction_iv.step) + && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node) + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))) + { + location_t loc = gimple_location (DR_STMT (load_dr)); + if (TYPE_MODE (load_type) == TYPE_MODE (char_type_node) + && TYPE_PRECISION (load_type) == TYPE_PRECISION (char_type_node)) + generate_strlen_builtin (loop, reduction_var, load_iv.base, + reduction_iv.base, loc); + else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE (load_type)) + != CODE_FOR_nothing) + generate_strlen_builtin_using_rawmemchr (loop, reduction_var, + load_iv.base, + reduction_iv.base, loc); + else + return false; + if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))) + { + const char *msg = G_("assuming signed overflow does not occur " + "when optimizing strlen like loop"); + fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC); + } + return true; + } + + return false; +} + +static bool +transform_reduction_loop (loop_p loop) +{ + gimple *reduction_stmt = NULL; + data_reference_p load_dr = NULL, store_dr = NULL; + + basic_block *bbs = get_loop_body (loop); + + for (unsigned i = 0, ninsns = 0; i < loop->num_nodes; ++i) + { + basic_block bb = bbs[i]; + + for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi); + gsi_next (&bsi), ++ninsns) + { + /* Bail out early for loops which are unlikely to match. */ + if (ninsns > 16) + return false; + gphi *phi = bsi.phi (); + if (gimple_has_volatile_ops (phi)) + return false; + if (gimple_clobber_p (phi)) + continue; + if (virtual_operand_p (gimple_phi_result (phi))) + continue; + if (stmt_has_scalar_dependences_outside_loop (loop, phi)) + { + if (reduction_stmt) + return false; + reduction_stmt = phi; + } + } + + for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); + gsi_next (&bsi), ++ninsns) + { + /* Bail out early for loops which are unlikely to match. */ + if (ninsns > 16) + return false; + + gimple *stmt = gsi_stmt (bsi); + + if (gimple_clobber_p (stmt)) + continue; + + if (gimple_code (stmt) == GIMPLE_LABEL || is_gimple_debug (stmt)) + continue; + + if (gimple_has_volatile_ops (stmt)) + return false; + + if (stmt_has_scalar_dependences_outside_loop (loop, stmt)) + { + if (reduction_stmt) + return false; + reduction_stmt = stmt; + } + + /* Any scalar stmts are ok. */ + if (!gimple_vuse (stmt)) + continue; + + /* Otherwise just regular loads/stores. */ + if (!gimple_assign_single_p (stmt)) + return false; + + auto_vec dr_vec; + if (!find_data_references_in_stmt (loop, stmt, &dr_vec)) + return false; + data_reference_p dr; + unsigned j; + FOR_EACH_VEC_ELT (dr_vec, j, dr) + { + tree type = TREE_TYPE (DR_REF (dr)); + if (!ADDR_SPACE_GENERIC_P (TYPE_ADDR_SPACE (type))) + return false; + if (DR_IS_READ (dr)) + { + if (load_dr != NULL) + return false; + load_dr = dr; + } + else + { + if (store_dr != NULL) + return false; + store_dr = dr; + } + } + } + } + + /* A limitation of the current implementation is that we require a reduction + statement. Therefore, loops without a reduction statement as in the + following are not recognized: + int *p; + void foo (void) { for (; *p; ++p); } */ + if (load_dr == NULL || reduction_stmt == NULL) + return false; + + /* Reduction variables are guaranteed to be SSA names. */ + tree reduction_var; + switch (gimple_code (reduction_stmt)) + { + case GIMPLE_PHI: + reduction_var = gimple_phi_result (reduction_stmt); + break; + case GIMPLE_ASSIGN: + reduction_var = gimple_assign_lhs (reduction_stmt); + break; + default: + /* Bail out e.g. for GIMPLE_CALL. */ + return false; + } + if (reduction_var == NULL) + return false; + + /* Bail out if this is a bitfield memory reference. */ + if (TREE_CODE (DR_REF (load_dr)) == COMPONENT_REF + && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (load_dr), 1))) + return false; + + /* Data reference must be executed exactly once per iteration of each + loop in the loop nest. We only need to check dominance information + against the outermost one in a perfect loop nest because a bb can't + dominate outermost loop's latch without dominating inner loop's. */ + basic_block load_bb = gimple_bb (DR_STMT (load_dr)); + if (!dominated_by_p (CDI_DOMINATORS, loop->latch, load_bb)) + return false; + + if (store_dr) + { + /* Bail out if this is a bitfield memory reference. */ + if (TREE_CODE (DR_REF (store_dr)) == COMPONENT_REF + && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (store_dr), 1))) + return false; + + /* Data reference must be executed exactly once per iteration of each + loop in the loop nest. We only need to check dominance information + against the outermost one in a perfect loop nest because a bb can't + dominate outermost loop's latch without dominating inner loop's. */ + basic_block store_bb = gimple_bb (DR_STMT (store_dr)); + if (!dominated_by_p (CDI_DOMINATORS, loop->latch, store_bb)) + return false; + + /* Load and store must be in the same loop nest. */ + if (store_bb->loop_father != load_bb->loop_father) + return false; + + edge e = single_exit (store_bb->loop_father); + if (!e) + return false; + bool load_dom = dominated_by_p (CDI_DOMINATORS, e->src, load_bb); + bool store_dom = dominated_by_p (CDI_DOMINATORS, e->src, store_bb); + if (load_dom != store_dom) + return false; + } + + return transform_reduction_loop_1 (loop, load_dr, store_dr, reduction_var); +} + /* Given innermost LOOP, return the outermost enclosing loop that forms a perfect loop nest. */ @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun) && !optimize_loop_for_speed_p (loop))) continue; - /* Don't distribute loop if niters is unknown. */ + /* If niters is unknown don't distribute loop but rather try to transform + it to a call to a builtin. */ tree niters = number_of_latch_executions (loop); if (niters == NULL_TREE || niters == chrec_dont_know) - continue; + { + if (transform_reduction_loop (loop)) + { + changed = true; + loops_to_be_destroyed.safe_push (loop); + if (dump_file) + fprintf (dump_file, "Loop %d transformed into a builtin.\n", loop->num); + } + continue; + } /* Get the perfect loop nest for distribution. */ loop = prepare_perfect_loop_nest (loop); --T3p6VPitb2u5EUVs--