From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stefansf@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id E23B03858D28
 for <gcc-patches@gcc.gnu.org>; Mon, 11 Oct 2021 16:02:59 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E23B03858D28
Received: from pps.filterd (m0098413.ppops.net [127.0.0.1])
 by mx0b-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 19BEgqh4020436
 for <gcc-patches@gcc.gnu.org>; Mon, 11 Oct 2021 12:02:59 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0b-001b2d01.pphosted.com with ESMTP id 3bmj9s9b7s-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT)
 for <gcc-patches@gcc.gnu.org>; Mon, 11 Oct 2021 12:02:59 -0400
Received: from m0098413.ppops.net (m0098413.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 19BFt2gl012407
 for <gcc-patches@gcc.gnu.org>; Mon, 11 Oct 2021 12:02:58 -0400
Received: from ppma05fra.de.ibm.com (6c.4a.5195.ip4.static.sl-reverse.com
 [149.81.74.108])
 by mx0b-001b2d01.pphosted.com with ESMTP id 3bmj9s9b6n-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Mon, 11 Oct 2021 12:02:58 -0400
Received: from pps.filterd (ppma05fra.de.ibm.com [127.0.0.1])
 by ppma05fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 19BFwgdJ000765;
 Mon, 11 Oct 2021 16:02:56 GMT
Received: from b06avi18626390.portsmouth.uk.ibm.com
 (b06avi18626390.portsmouth.uk.ibm.com [9.149.26.192])
 by ppma05fra.de.ibm.com with ESMTP id 3bk2q9e8dh-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Mon, 11 Oct 2021 16:02:56 +0000
Received: from b06wcsmtp001.portsmouth.uk.ibm.com
 (b06wcsmtp001.portsmouth.uk.ibm.com [9.149.105.160])
 by b06avi18626390.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP
 id 19BFvNGR58261762
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Mon, 11 Oct 2021 15:57:23 GMT
Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 4C1CCA405F;
 Mon, 11 Oct 2021 16:02:54 +0000 (GMT)
Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id E3357A405C;
 Mon, 11 Oct 2021 16:02:53 +0000 (GMT)
Received: from localhost.localdomain (unknown [9.145.185.45])
 by b06wcsmtp001.portsmouth.uk.ibm.com (Postfix) with ESMTPS;
 Mon, 11 Oct 2021 16:02:53 +0000 (GMT)
Date: Mon, 11 Oct 2021 18:02:53 +0200
From: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [RFC] ldist: Recognize rawmemchr loop patterns
Message-ID: <YWRgLalC+6mRpZeI@localhost.localdomain>
References: <CAFiYyc3RR5P4GZXEUiwN2=5d96_pGaHPwPTYCOY=ax-6W3=raw@mail.gmail.com>
 <YKasY3VQ5GDBRomk@localhost.localdomain>
 <YMeRMgR17PE0KTFx@localhost.localdomain>
 <CAFiYyc2JW0H-+Z7X7vwD39T3i2ndF5No6gQ55Qs=0VBtCF5=WA@mail.gmail.com>
 <YNWuowVwITc48yeL@localhost.localdomain>
 <CAFiYyc1xG0TPaygUj6UoGABA-kFkOZGD32Ky+Nva18gdtHcvCA@mail.gmail.com>
 <YTHWOO+D4kpeDcHJ@localhost.localdomain>
 <CAFiYyc3w=0QJsYhgWWEQXiN-3NErNCVm4L4fkDbfE0tLNXvtvw@mail.gmail.com>
 <YT9l12mswNfteDHb@localhost.localdomain>
 <CAFiYyc3zDoGU3iRRfXBg-4AyGb4r_Ut7gQUS4VA_g8+ZiLg3KQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAFiYyc3zDoGU3iRRfXBg-4AyGb4r_Ut7gQUS4VA_g8+ZiLg3KQ@mail.gmail.com>
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: Z6QgI0R6YxHSqA2PUu_bgUPxcPQSTgyq
X-Proofpoint-ORIG-GUID: 2fCLbjusKPjTqiLD9Cy0_ys4bzZcuTFZ
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.182.1,Aquarius:18.0.790,Hydra:6.0.425,FMLib:17.0.607.475
 definitions=2021-10-11_05,2021-10-11_01,2020-04-07_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 malwarescore=0 suspectscore=0
 mlxlogscore=999 clxscore=1015 mlxscore=0 adultscore=0 bulkscore=0
 priorityscore=1501 spamscore=0 impostorscore=0 phishscore=0
 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2109230001 definitions=main-2110110095
X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Oct 2021 16:03:02 -0000

On Fri, Sep 17, 2021 at 10:08:27AM +0200, Richard Biener wrote:
> On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus
> <stefansf@linux.ibm.com> wrote:
> >
> > On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
> > > On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
> > > <stefansf@linux.ibm.com> wrote:
> > > >
> > > > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> > > > [...]
> > > > > > >
> > > > > > > +  /* Handle strlen like loops.  */
> > > > > > > +  if (store_dr == NULL
> > > > > > > +      && integer_zerop (pattern)
> > > > > > > +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > > > > > +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > > > > > +      && integer_onep (reduction_iv.step)
> > > > > > > +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> > > > > > > +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> > > > > > > +    {
> > > > > > >
> > > > > > > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > > > > > > The iteration
> > > > > > > only stops when you load a NUL and the increments just wrap along (you're
> > > > > > > using the pointer IVs to compute the strlen result).  Can't you simply truncate?
> > > > > >
> > > > > > I think truncation is enough as long as no overflow occurs in strlen or
> > > > > > strlen_using_rawmemchr.
> > > > > >
> > > > > > > For larger than size_type_node (actually larger than ptr_type_node would matter
> > > > > > > I guess), the argument is that since pointer wrapping would be undefined anyway
> > > > > > > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > > > > > >
> > > > > > >       TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > > > > > (ptr_type_node)
> > > > > > >        || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > > > > > >
> > > > > > > ?
> > > > > >
> > > > > > Regarding the implementation which makes use of rawmemchr:
> > > > > >
> > > > > > We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> > > > > > the maximal length we can determine of a string where each character has
> > > > > > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> > > > > > ptrdiff type is undefined we have to make sure that if an overflow
> > > > > > occurs, then an overflow occurs for reduction variable, too, and that
> > > > > > this is undefined, too.  However, I'm not sure anymore whether we want
> > > > > > to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> > > > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> > > > > > this would mean that a single string consumes more than half of the
> > > > > > virtual addressable memory.  At least for architectures where
> > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> > > > > > to neglect the case where computing pointer difference may overflow.
> > > > > > Otherwise we are talking about strings with lenghts of multiple
> > > > > > pebibytes.  For other architectures we might have to be more precise
> > > > > > and make sure that reduction variable overflows first and that this is
> > > > > > undefined.
> > > > > >
> > > > > > Thus a conservative condition would be (I assumed that the size of any
> > > > > > integral type is a power of two which I'm not sure if this really holds;
> > > > > > IIRC the C standard requires only that the alignment is a power of two
> > > > > > but not necessarily the size so I might need to change this):
> > > > > >
> > > > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type))
> > > > > >    or in other words return true if reduction variable overflows first
> > > > > >    and false otherwise.  */
> > > > > >
> > > > > > static bool
> > > > > > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > > > > {
> > > > > >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > > > > >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
> > > > > >   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
> > > > > >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
> > > > > > }
> > > > > >
> > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64
> > > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > > >     && reduction_var_overflows_first (reduction_var, load_type)
> > > > > >
> > > > > > Regarding the implementation which makes use of strlen:
> > > > > >
> > > > > > I'm not sure what it means if strlen is called for a string with a
> > > > > > length greater than SIZE_MAX.  Therefore, similar to the implementation
> > > > > > using rawmemchr where we neglect the case of an overflow for 64bit
> > > > > > architectures, a conservative condition would be:
> > > > > >
> > > > > > TYPE_PRECISION (size_type_node) == 64
> > > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > > >     && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
> > > > > >
> > > > > > I still included the overflow undefined check for reduction variable in
> > > > > > order to rule out situations where the reduction variable is unsigned
> > > > > > and overflows as many times until strlen(,_using_rawmemchr) overflows,
> > > > > > too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> > > > > > architectures.  Anyhow, while writing this down it becomes clear that
> > > > > > this deserves a comment which I will add once it becomes clear which way
> > > > > > to go.
> > > > >
> > > > > I think all the arguments about objects bigger than half of the address-space
> > > > > also are valid for 32bit targets and thus 32bit size_type_node (or
> > > > > 32bit pointer size).
> > > > > I'm not actually sure what's the canonical type to check against, whether
> > > > > it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype (the
> > > > > middle-end "offset" type used for all address computations).  For weird reasons
> > > > > I'd lean towards 'sizetype' (for example some embedded targets have 24bit
> > > > > pointers but 16bit 'sizetype').
> > > >
> > > > Ok, for the strlen implementation I changed from size_type_node to
> > > > sizetype and assume that no overflow occurs for string objects bigger
> > > > than half of the address space for 32-bit targets and up:
> > > >
> > > >   (TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
> > > >    && TYPE_PRECISION (ptr_type_node) >= 32)
> > > >   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > >       && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype))
> > > >
> > > > and similarly for the rawmemchr implementation:
> > > >
> > > >   (TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
> > > >    && TYPE_PRECISION (ptrdiff_type_node) >= 32)
> > > >   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > >       && reduction_var_overflows_first (reduction_var, load_type))
> > > >
> > > > >
> > > > > > >
> > > > > > > +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> > > > > > > +       {
> > > > > > > +         const char *msg = G_("assuming signed overflow does not occur "
> > > > > > > +                              "when optimizing strlen like loop");
> > > > > > > +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> > > > > > > +       }
> > > > > > >
> > > > > > > no, please don't add any new strict-overflow warnings ;)
> > > > > >
> > > > > > I just stumbled over code which produces such a warning and thought this
> > > > > > is a hard requirement :D The new patch doesn't contain it anymore.
> > > > > >
> > > > > > >
> > > > > > > The generate_*_builtin routines need some factoring - if you code-generate
> > > > > > > into a gimple_seq you could use gimple_build () which would do the fold_stmt
> > > > > > > (not sure why you do that - you should see to fold the call, not necessarily
> > > > > > > the rest).  The replacement of reduction_var and the dumping could be shared.
> > > > > > > There's also GET_MODE_NAME for the printing.
> > > > > >
> > > > > > I wasn't really sure which way to go.  Use a gsi, as it is done by
> > > > > > existing generate_* functions, or make use of gimple_seq.  Since the
> > > > > > latter uses internally also gsi I thought it is better to stick to gsi
> > > > > > in the first place.  Now, after changing to gimple_seq I see the beauty
> > > > > > of it :)
> > > > > >
> > > > > > I created two helper functions generate_strlen_builtin_1 and
> > > > > > generate_reduction_builtin_1 in order to reduce code duplication.
> > > > > >
> > > > > > In function generate_strlen_builtin I changed from using
> > > > > > builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
> > > > > > (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
> > > > > > sure whether my intuition about the difference between implicit and
> > > > > > explicit builtins is correct.  In builtins.def there is a small example
> > > > > > given which I would paraphrase as "use builtin_decl_explicit if the
> > > > > > semantics of the builtin is defined by the C standard; otherwise use
> > > > > > builtin_decl_implicit" but probably my intuition is wrong?
> > > > > >
> > > > > > Beside that I'm not sure whether I really have to call
> > > > > > build_fold_addr_expr which looks superfluous to me since
> > > > > > gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:
> > > > > >
> > > > > > tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> > > > > > gimple *fn_call = gimple_build_call (fn, 1, mem);
> > > > > >
> > > > > > However, since it is also used that way in the context of
> > > > > > generate_memset_builtin I didn't remove it so far.
> > > > > >
> > > > > > > I think overall the approach is sound now but the details still need work.
> > > > > >
> > > > > > Once again thank you very much for your review.  Really appreciated!
> > > > >
> > > > > The patch lacks a changelog entry / description.  It's nice if patches sent
> > > > > out for review are basically the rev as git format-patch produces.
> > > > >
> > > > > The rawmemchr optab needs documenting in md.texi
> > > >
> > > > While writing the documentation in md.texi I realised that other
> > > > instructions expect an address to be a memory operand which is not the
> > > > case for rawmemchr currently. At the moment the address is either an
> > > > SSA_NAME or ADDR_EXPR with a tree pointer type in expand_RAWMEMCHR. As a
> > > > consequence in the backend define_expand rawmemchr<mode> expects a
> > > > register operand and not a memory operand. Would it make sense to build
> > > > a MEM_REF out of SSA_NAME/ADDR_EXPR in expand_RAWMEMCHR? Not sure if
> > > > MEM_REF is supposed to be the canonical form here.
> > >
> > > I suppose the expander could use code similar to what
> > > expand_builtin_memset_args does,
> > > using get_memory_rtx.  I suppose that we're using MEM operands because those
> > > can convey things like alias info or alignment info, something which
> > > REG operands cannot
> > > (easily).  I wouldn't build a MEM_REF and try to expand that.
> >
> > The new patch contains the following changes:
> >
> > - In expand_RAWMEMCHR I'm using get_memory_rtx now.  This means I had to
> >   change linkage of get_memory_rtx to extern.
> >
> > - In function generate_strlen_builtin_using_rawmemchr I'm not
> >   reconstructing the load type anymore from the base pointer but rather
> >   pass it as a parameter from function transform_reduction_loop where we
> >   also ensured that it is of integral type.  Reconstructing the load
> >   type was error prone since e.g. I didn't distinct between
> >   pointer_plus_expr or addr_expr.  Thus passing the load type should be
> >   more solid.
> >
> > Regtested on IBM Z and x86.  Ok for mainline?
> 
> OK, and sorry for all the repeated delays.

No problem at all.  I'm glad to see how the patch evolved over each
iteration.  That being said:  Thanks for all your reviews and hints!

The patch implementing the rawmemchr expander for IBM Z was also ack'd and
I pushed both commits today.

For the xalancbmk benchmark we now recognize 1081 rawmemchr-like loops
where at least one is in the hot path.  Utilising a specialised rawmemchr
implementation for 16-bit characters gives good results on IBM Z ...
just saying maybe other archs are interested, too ;-)

Thanks,
Stefan


> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Stefan
> >
> > >
> > > > >
> > > > > +}
> > > > > +
> > > > > +static bool
> > > > > +reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > > > +{
> > > > > +  unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > > > >
> > > > > this function needs a comment.
> > > >
> > > > Done.
> > > >
> > > > >
> > > > > +         if (stmt_has_scalar_dependences_outside_loop (loop, phi))
> > > > > +           {
> > > > > +             if (reduction_stmt)
> > > > > +               return false;
> > > > >
> > > > > you leak bbs here and elsewhere where you early exit the function.
> > > > > In fact you fail to free it at all.
> > > >
> > > > Whoopsy. I factored the whole loop out into static function
> > > > determine_reduction_stmt in order to deal with all early exits.
> > > >
> > > > >
> > > > > Otherwise the patch looks good - thanks for all the improvements.
> > > > >
> > > > > What I do wonder is
> > > > >
> > > > > +  tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> > > > > +  gimple *fn_call = gimple_build_call (fn, 1, mem);
> > > > >
> > > > > using builtin_decl_explicit means that in a TU where strlen is neither
> > > > > declared nor used we can end up emitting calls to it.  For memcpy/memmove
> > > > > that's usually OK since we require those to be present even in a
> > > > > freestanding environment.  But I'm not sure about strlen here so I'd
> > > > > lean towards using builtin_decl_implicit and checking that for NULL which
> > > > > IIRC should prevent emitting strlen when it's not declared and maybe even
> > > > > if it's declared but not used.  All other uses that generate STRLEN
> > > > > use that at least.
> > > >
> > > > Thanks for clarification.  I changed it back to builtin_decl_implicit
> > > > and check for null pointers.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > Thanks,
> > > > Stefan