From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=jLTl=JY=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id C7126386183D
	for <gcc-patches@gcc.gnu.org>; Thu, 15 Feb 2024 09:46:33 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C7126386183D
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org C7126386183D
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707990396; cv=none;
	b=LkvuF4kJOYvLa4HvZhyz0bgjeAvQw4WFjRUgqLQjNyP4v6d6QKHyjGtTkp2EahEqpd1yrCjUQD+yzdYgccyQ1DzwEtbDWZX+TpC+3J/2WfLjkXmAPjveDzKbZ/IBAJbKIrY20q/+T6KdbEbWzadtZD5kusFehUab79ZoSh5d66s=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1707990396; c=relaxed/simple;
	bh=fF6LmOwJ1NneFs4mooejwJHP5R1bx5SfnxF61CXJyHU=;
	h=From:To:Subject:Date:Message-ID:MIME-Version; b=RXfZeyF631c+C2NC6vDl7xouMzhchdlIaA0OXJaV0qq2/dhWVo6cWa8PD1I0zvgXNQxVADTIEmDyDtWRJurd8i5bPFyUZjiUYj0zXAOvXkVsQqKoEMlZmV9kfGE3dS4OFmi7NyGWgn/Hd5DfUuf5v/qCNSR68G52TntzReSWkEY=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 325471FB;
	Thu, 15 Feb 2024 01:47:14 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DAD633F766;
	Thu, 15 Feb 2024 01:46:32 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <rguenther@suse.de>
Mail-Followup-To: Richard Biener <rguenther@suse.de>,gcc-patches@gcc.gnu.org, richard.sandiford@arm.com
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] middle-end/113576 - avoid out-of-bound vector element access
References: <20240206113819.2335F3858402@sourceware.org>
	<mpto7cjuvrr.fsf@arm.com>
	<75s55n4q-2768-6o3s-0q19-n0r55349423n@fhfr.qr>
	<mpt7cj7uk5n.fsf@arm.com>
	<rs541599-q5oo-o86o-380r-66qs118r0s1r@fhfr.qr>
Date: Thu, 15 Feb 2024 09:46:31 +0000
In-Reply-To: <rs541599-q5oo-o86o-380r-66qs118r0s1r@fhfr.qr> (Richard Biener's
	message of "Thu, 15 Feb 2024 09:12:47 +0100 (CET)")
Message-ID: <mpth6iaqcug.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-21.0 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Richard Biener <rguenther@suse.de> writes:
> On Wed, 14 Feb 2024, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > On Wed, 14 Feb 2024, Richard Sandiford wrote:
>> >
>> >> Richard Biener <rguenther@suse.de> writes:
>> >> > The following avoids accessing out-of-bound vector elements when
>> >> > native encoding a boolean vector with sub-BITS_PER_UNIT precision
>> >> > elements.  The error was basing the number of elements to extract
>> >> > on the rounded up total byte size involved and the patch bases
>> >> > everything on the total number of elements to extract instead.
>> >> 
>> >> It's too long ago to be certain, but I think this was a deliberate choice.
>> >> The point of the new vector constant encoding is that it can give an
>> >> allegedly sensible value for any given index, even out-of-range ones.
>> >> 
>> >> Since the padding bits are undefined, we should in principle have a free
>> >> choice of what to use.  And for VLA, it's often better to continue the
>> >> existing pattern rather than force to zero.
>> >> 
>> >> I don't strongly object to changing it.  I think we should be careful
>> >> about relying on zeroing for correctness though.  The bits are in principle
>> >> undefined and we can't rely on reading zeros from equivalent memory or
>> >> register values.
>> >
>> > The main motivation for a change here is to allow catching out-of-bound
>> > indices again for VECTOR_CST_ELT, at least for constant nunits because
>> > it might be a programming error like fat-fingering the index.  I do
>> > think it's a regression that we no longer catch those.
>> >
>> > It's probably also a bit non-obvious how an encoding continues and
>> > there might be DImode masks that can be represented by a 
>> > zero-extended QImode immediate but "continued" it would require
>> > a larger immediate.
>> >
>> > The change also effectively only changes something for 1 byte
>> > encodings since nunits is a power of two and so is the element
>> > size in bits.
>> 
>> Yeah, but even there, there's an argument that all-1s (0xff) is a more
>> obvious value for an all-1s mask.
>> 
>> > A patch restoring the VECTOR_CST_ELT checking might be the
>> > following
>> >
>> > diff --git a/gcc/tree.cc b/gcc/tree.cc
>> > index 046a558d1b0..4c9b05167fd 100644
>> > --- a/gcc/tree.cc
>> > +++ b/gcc/tree.cc
>> > @@ -10325,6 +10325,9 @@ vector_cst_elt (const_tree t, unsigned int i)
>> >    if (i < encoded_nelts)
>> >      return VECTOR_CST_ENCODED_ELT (t, i);
>> >  
>> > +  /* Catch out-of-bound element accesses.  */
>> > +  gcc_checking_assert (maybe_gt (VECTOR_CST_NELTS (t), i));
>> > +
>> >    /* If there are no steps, the final encoded value is the right one.  */
>> >    if (!VECTOR_CST_STEPPED_P (t))
>> >      {
>> >
>> > but it triggers quite a bit via const_binop for, for example
>> >
>> > #2  0x00000000011c1506 in const_binop (code=PLUS_EXPR, 
>> >     arg1=<vector_cst 0x7ffff6f40b40>, arg2=<vector_cst 0x7ffff6e731e0>)
>> > (gdb) p debug_generic_expr (arg1)
>> > { 12, 13, 14, 15 }
>> > $5 = void
>> > (gdb) p debug_generic_expr (arg2)
>> > { -2, -2, -2, -3 }
>> > (gdb) p count
>> > $4 = 6
>> > (gdb) l
>> > 1711          if (!elts.new_binary_operation (type, arg1, arg2, 
>> > step_ok_p))
>> > 1712            return NULL_TREE;
>> > 1713          unsigned int count = elts.encoded_nelts ();
>> > 1714          for (unsigned int i = 0; i < count; ++i)
>> > 1715            {
>> > 1716              tree elem1 = VECTOR_CST_ELT (arg1, i);
>> > 1717              tree elem2 = VECTOR_CST_ELT (arg2, i);
>> > 1718
>> > 1719              tree elt = const_binop (code, elem1, elem2);
>> >
>> > this seems like an error to me - why would we, for fixed-size
>> > vectors and for PLUS ever create a vector encoding with 6 elements?!
>> > That seems at least inefficient to me?
>> 
>> It's a case of picking your poison.  On the other side, operating
>> individually on each element of a V64QI is inefficient when the
>> representation says up-front that all elements are equal.
>
> True, though I wonder why for VLS vectors new_binary_operation
> doesn't cap the number of encoded elts on the fixed vector size,
> like doing
>
>   encoded_elts = ordered_min (TYPE_VECTOR_SUBPARTS (..), encoded_elts);
>
> or if there's no good way to write it applying for both VLA and VLS
> do it only when TYPE_VECTOR_SUBPARTS is constant.

ordered_min can't be used because there's no guarantee that encoded_elts
and TYPE_VECTOR_SUBPARTS are well-ordered for the VLA case.  E.g.
for a stepped (3-element) encoding and a length of 2+2X, the stepped
encoding is longer for X==0 and the vector is longer for X>0.

But yeah, in general, trying to enforce this for VLS would probably
lead to a proliferation of more "if VLA do one thing, if VLS do some
other thing".  The aim was to avoid that where it didn't seem strictly
necessary.

>> Fundemantally, operations on VLA vectors are treated as functions
>> that map patterns to patterns.  The number of elements that are
>> consumed isn't really relevant to the function itself.  The VLA
>> folders therefore rely on being to read an element from a pattern
>> even if the index is outside TREE_VECTOR_SUBPARTS.
>> 
>> There were two reasons for using VLA paths for VLS vectors.
>> One I mentioned above: it saves time when all elements are equal,
>> or have a similarly compact representation.  The other is that it
>> makes VLA less special and ensures that the code gets more testing.
>> 
>> Maybe one compromise between that and the assert would be:
>> 
>> (1) enforce the assert only for VLS and
>
> that's what I did by using maybe_gt?
>
>> (2) add new checks to ensure that a VLA-friendly operation will never
>>     read out-of-bounds for VLS vectors
>> 
>> But I think this would be awkward.  E.g. we now build reversed vectors
>> as a 3-element series N-1, N-2, N-3.  It would be nice not to have
>> to special-case N==2 by suppressing N-3.  And the condition for (2)
>> might not always be obvious.
>
> If we know only N <= 2 is problematic then having must_eq ()
> special cases there makes sense IMO.

I'm not sure it's just that.  E.g. a zip is { 0, N, 1, N+1, 2, N+2, ... },
which would need to be curtailed for N==2 or N==4.

>> Another option would be to have special accessors that are allowed
>> to read out of bounds, and add the assert (for both VLA & VLS) to
>> VECTOR_CST_ELT.  It might take a while to root out all the places that
>> need to change though.
>
> Yes, that's another possibility.
>
> The point is (and we're seeing it now with the behavior of
> encoding QImode bitmasks) that the behavior isn't entirely
> obvious when VLS vectors are involved.  There's a memset zero
> of the QImode so anything other than zero-padding is odd.

Which memset do you mean?  The one in:

      /* Zero the buffer and then set bits later where necessary.  */
      int extract_bytes = MIN (len, total_bytes - off);
      if (ptr)
	memset (ptr, 0, extract_bytes);

?  If so, that's there for elt_bits>1.  It means that the canonical
form of a 2-bit active mask element is 01 rather than 11.  (And that's
something that might end up needing to be target-configurable.  The current
behaviour is guided by SVE, although it shouldn't matter much in practice
there.)

> It's also not clear to me we'd get reliable sign extension
> (we probably don't?).

For which case?  The idea isn't that we always sign-extend or always
zero-extend, but that we repeat the pattern.  So an all-1s mask would
give an all-1s QImode.  A 1010 mask would give a 10101010 QI, etc.

Thanks,
Richard