From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-408312-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 71025 invoked by alias); 25 Sep 2015 11:30:14 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 71007 invoked by uid 89); 25 Sep 2015 11:30:12 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.5 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.2
X-HELO: eu-smtp-delivery-143.mimecast.com
Received: from eu-smtp-delivery-143.mimecast.com (HELO eu-smtp-delivery-143.mimecast.com) (207.82.80.143) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 25 Sep 2015 11:30:10 +0000
Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.140]) by eu-smtp-1.mimecast.com with ESMTP id uk-mta-36-06SQmDFzRdOLA8Na0n7XVg-1; Fri, 25 Sep 2015 12:30:04 +0100
Received: from [10.2.207.14] ([10.1.2.79]) by cam-owa1.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959);	 Fri, 25 Sep 2015 12:30:04 +0100
Subject: Re: [PATCH v2][GCC] Algorithmic optimization in match and simplify
To: Richard Biener <richard.guenther@gmail.com>
References: <55E092D9.8070700@arm.com> <alpine.DEB.2.20.1508281956570.9451@laptop-mg.saclay.inria.fr> <55E5AACB.3050205@arm.com> <CAFiYyc0j=1JYycYi9zBFMZooCE7C-mxZUUjiXzrAX47jgKaRvw@mail.gmail.com> <55E82AFF.4020401@arm.com> <CAFiYyc2znWLc-S7txfaJT2VatwSmg_7pZ81__ehoGpU1Qdtf6w@mail.gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
From: Andre Vieira <Andre.SimoesDiasVieira@arm.com>
Message-ID: <5605303B.7080102@arm.com>
Date: Fri, 25 Sep 2015 11:44:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <CAFiYyc2znWLc-S7txfaJT2VatwSmg_7pZ81__ehoGpU1Qdtf6w@mail.gmail.com>
X-MC-Unique: 06SQmDFzRdOLA8Na0n7XVg-1
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2015-09/txt/msg01937.txt.bz2

On 17/09/15 10:46, Richard Biener wrote:
> On Thu, Sep 3, 2015 at 1:11 PM, Andre Vieira
> <Andre.SimoesDiasVieira@arm.com> wrote:
>> On 01/09/15 15:01, Richard Biener wrote:
>>>
>>> On Tue, Sep 1, 2015 at 3:40 PM, Andre Vieira
>>> <Andre.SimoesDiasVieira@arm.com> wrote:
>>>>
>>>> Hi Marc,
>>>>
>>>> On 28/08/15 19:07, Marc Glisse wrote:
>>>>>
>>>>>
>>>>> (not a review, I haven't even read the whole patch)
>>>>>
>>>>> On Fri, 28 Aug 2015, Andre Vieira wrote:
>>>>>
>>>>>> 2015-08-03  Andre Vieira  <andre.simoesdiasvieira@arm.com>
>>>>>>
>>>>>>     * match.pd: Added new patterns:
>>>>>>       ((X {&,<<,>>} C0) {|,^} C1) {^,|} C2)
>>>>>>       (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>}
>>>>>> C1)
>>>>>
>>>>>
>>>>>
>>>>> +(for op0 (rshift rshift lshift lshift bit_and bit_and)
>>>>> + op1 (bit_ior bit_xor bit_ior bit_xor bit_ior bit_xor)
>>>>> + op2 (bit_xor bit_ior bit_xor bit_ior bit_xor bit_ior)
>>>>>
>>>>> You can nest for-loops, it seems clearer as:
>>>>> (for op0 (rshift lshift bit_and)
>>>>>      (for op1 (bit_ior bit_xor)
>>>>>           op2 (bit_xor bit_ior)
>>>>
>>>>
>>>>
>>>> Will do, thank you for pointing it out.
>>>>>
>>>>>
>>>>>
>>>>> +(simplify
>>>>> + (op2:c
>>>>> +  (op1:c
>>>>> +   (op0 @0 INTEGER_CST@1) INTEGER_CST@2) INTEGER_CST@3)
>>>>>
>>>>> I suspect you will want more :s (single_use) and less :c
>>>>> (canonicalization
>>>>> should put constants in second position).
>>>>>
>>>> I can't find the definition of :s (single_use).
>>>
>>>
>>> Sorry for that - didn't get along updating it yet :/  It restricts
>>> matching to
>>> sub-expressions that have a single-use.  So
>>>
>>> +  a &=3D 0xd123;
>>> +  unsigned short tem =3D a ^ 0x6040;
>>> +  a =3D tem | 0xc031; /* Simplify _not_ to ((a & 0xd123) | 0xe071).  */
>>> ... use of tem ...
>>>
>>> we shouldn't do the simplifcation here because the expression
>>> (a & 0x123) ^ 0x6040 is kept live by 'tem'.
>>>
>>>> GCC internals do point out
>>>> that canonicalization does put constants in the second position, didnt
>>>> see
>>>> that first. Thank you for pointing it out.
>>>>
>>>>> +       C1 =3D wi::bit_and_not (C1,C2);
>>>>>
>>>>> Space after ','.
>>>>>
>>>> Will do.
>>>>
>>>>> Having wide_int_storage in many places is surprising, I can't find
>>>>> similar
>>>>> code anywhere else in gcc.
>>>>>
>>>>>
>>>>
>>>> I tried looking for examples of something similar, I think I ended up
>>>> using
>>>> wide_int because I was able to convert easily to and from it and it has
>>>> the
>>>> "mask" and "wide_int_to_tree" functions. I welcome any suggestions on
>>>> what I
>>>> should be using here for integer constant transformations and
>>>> comparisons.
>>>
>>>
>>> Using wide-ints is fine, but you shouldn't need 'wide_int_storage'
>>> constructors - those
>>> are indeed odd.  Is it just for the initializers of wide-ints?
>>>
>>> +    wide_int zero_mask =3D wi::zero (prec);
>>> +    wide_int C0 =3D wide_int_storage (@1);
>>> +    wide_int C1 =3D wide_int_storage (@2);
>>> +    wide_int C2 =3D wide_int_storage (@3);
>>> ...
>>> +       zero_mask =3D wide_int_storage (wi::mask (C0.to_uhwi (), false,
>>> prec));
>>>
>>> tree_to_uhwi (@1) should do the trick as well
>>>
>>> +       C1 =3D wi::bit_and_not (C1,C2);
>>> +       cst_emit =3D wi::bit_or (C1, C2);
>>>
>>> the ops should be replacable with @2 and @3, the store to C1 obviously =
not
>>> (but you can use a tree temporary and use wide_int_to_tree here to avoid
>>> the back-and-forth for the case where C1 is not assigned to).
>>>
>>> Note that transforms only doing association are prone to endless recurs=
ion
>>> in case some other pattern does the reverse op...
>>>
>>> Richard.
>>>
>>>
>>>> BR,
>>>> Andre
>>>>
>>>
>> Thank you for all the comments, see reworked version:
>>
>> Two new algorithmic optimisations:
>>    1.((X op0 C0) op1 C1) op2 C2)
>>      with op0 =3D {&, >>, <<}, op1 =3D {|,^}, op2 =3D {|,^} and op1 !=3D=
 op2
>>      zero_mask has 1's for all bits that are sure to be 0 in (X op0 C0)
>>      and 0's otherwise.
>>      if (op1 =3D=3D '^') C1 &=3D ~C2 (Only changed if actually emitted)
>>      if ((C1 & ~zero_mask) =3D=3D 0) then emit (X op0 C0) op2 (C1 op2 C2)
>>      if ((C2 & ~zero_mask) =3D=3D 0) then emit (X op0 C0) op1 (C1 op2 C2)
>>    2. (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)
>>
>>
>> This patch does two algorithmic optimisations that target patterns like:
>> (((x << 24) | 0x00FFFFFF) ^ 0xFF000000) and ((x ^ 0x40000002) >> 1) |
>> 0x80000000.
>>
>> The transformation uses the knowledge of which bits are zero after
>> operations like (X {&,<<,(unsigned)>>}) to combine constants, reducing
>> run-time operations.
>> The two examples above would be transformed into (X << 24) ^ 0xFFFFFFFF =
and
>> (X >> 1) ^ 0xa0000001 respectively.
>>
>> The second transformation enables more applications of the first. Also s=
ome
>> targets may benefit from delaying shift operations. I am aware that such=
 an
>> optimization, in combination with one or more optimizations that cause t=
he
>> reverse transformation, may lead to an infinite loop. Though such behavi=
or
>> has not been detected during regression testing and bootstrapping on
>> aarch64.
>
> +/* (X bit_op C0) rshift C1 -> (X rshift C0) bit_op (C0 rshift C1) */
> +(for bit_op (bit_ior bit_xor bit_and)
> +(simplify
> + (rshift (bit_op:s @0 INTEGER_CST@1) INTEGER_CST@2)
> + (bit_op
> +  (rshift @0 @2)
> +  { wide_int_to_tree (type, wi::rshift (@1, @2, TYPE_SIGN (type))); })))
> +
> +/* (X bit_op C0) lshift C1 -> (X lshift C0) bit_op (C0 lshift C1) */
> +(for bit_op (bit_ior bit_xor bit_and)
> +(simplify
> + (lshift (bit_op:s @0 INTEGER_CST@1) INTEGER_CST@2)
> + (bit_op
> +  (lshift @0 @2)
> +  { wide_int_to_tree (type, wi::lshift (@1, @2)); })))

Will do, good to see that my second transformation still picks up the=20
shift @1 @2 as a constant. I'm assuming there is some constant folding=20
going on between simplify transformations?

>
> this may be one case where not using wide-ints to be able to combine the
> patterns makes sense.  Thus,
>
> (for shift (lshift rshift)
>   (simplify
>    (shift ...)
>    (bit_op
>     (shift @0 @2)
>     (shift @1 @2))))
>
> note that I'm worried we'd take on "invalid" ubsan here when the above
> applies to
>
> int foo (int i)
> {
>    return (i & 0x7fffffff) >> 3;
> }
> int main () { return foo (0x80000007); }
>
> and i is negative.  That's because right-shift of negative values
> invokes undefined
> behavior.  IIRC in the middle-end we will not be taking advantage of
> that but the
> simplifications apply to GENERIC as well and thus may hit before ubsan
> instrumentation triggers(?)  It would be nice if you could double-check t=
hat.

I was looking into this and I understand your worries, though, I found=20
out that for the particular case of shifts and bit_and there already is=20
a simplify transformation that does the same, regardless of the sign.

/* Fold (X & C2) << C1 into (X << C1) & (C2 << C1)
    (X & C2) >> C1 into (X >> C1) & (C2 >> C1).  */
(for shift (lshift rshift)
  (simplify
   (shift (convert?:s (bit_and:s @0 INTEGER_CST@2)) INTEGER_CST@1)
   (if (tree_nop_conversion_p (type, TREE_TYPE (@0)))
    (with { tree mask =3D int_const_binop (shift, fold_convert (type, @2),=
=20
@1); }
     (bit_and (shift (convert @0) @1) { mask; })))))

Now, I don't quite understand what you mean by ubsan instrumentation,=20
will GCC introduce guards into code where it detects potential undefined=20
behavior? Also, it might be worth to note that right shift of negative=20
values is denoted as "implementation defined" by the C standard. GCC=20
however doesn't include it in its list of implementation defined=20
behavior so I guess that is why you refer to it as undefined?

Should we perhaps disable transformations where we can not guarantee=20
that the right shift produced is not one of negative values? I.e.=20
prohibit this transformation for:
1) signed types and op1 =3D=3D bit_xor
2) signed types and op1 =3D=3D bit_and and C1 has sign bit set.

Also would it be useful in cases where you have signed shift and=20
bit_and, and C1 is not negative, to do the transformation but replace=20
the signed shift by an unsigned shift. This to make sure we don't=20
introduce undefined/implementation defined behavior were there was none.

This does however change the current behavior!

>
> + (if (!(op0 =3D=3D RSHIFT_EXPR && !TYPE_UNSIGNED (type)) && wi::fits_uhw=
i_p (@1))
>
> you only need fits_uhwi_p (@1) in the op0 !=3D BIT_AND_EXPR case it
> seems, so better
> move it down to those cases.

So I used to, but I had the problem that I didn't know how to make it=20
"fail" the matching if this was not the case. For instance if op0 is a=20
lshift for which the constant doesn't fit uhwi, then it would fall=20
through and never set the zero mask, potentially leading to a wrong=20
transformation. Now I'm not sure this can happen, since that would mean=20
that either constant @2 or @3 need to be all 1's and that might already=20
be caught by some other transformation, but its wrong to rely on such=20
behavior IMO. So for now I have changed it to:

(simplify
  (op2
   (op1:s
    (op0@4 @0 INTEGER_CST@1) INTEGER_CST@2) INTEGER_CST@3)
  (if (!(op0 =3D=3D RSHIFT_EXPR && !TYPE_UNSIGNED (type)) &&
       (op0 =3D=3D BIT_AND_EXPR || wi::fits_uhwi_p (@1)))


It would be cool to have a FAIL expression, usable in the with clauses,=20
to make the pattern match fail a bit like the one in the machine=20
description language.

>
> +   (if (wi::eq_p (wi::bit_and (C1, zero_mask_not), wi::zero (prec)))
>
> I think you can write
>
>     (if (wi::bit_and (...) =3D=3D 0)
>
> or at least wi:eq_p (wi::bit_and (...), 0).
>

wi::bit_and (...) =3D=3D 0 seems to be doing the trick.

> I wonder if we shouldn't improve the pattern by handling (X op0 C0)
> transparently
> via using get_nonzero_bits (yeah, that's not exactly zero_mask but its
> inverse AFAIK).
> We'd wrap get_nonzero_bits in a helper that can handle GENERIC and your
> &, >>, << cases (hmm, such function must already exist somewhere...).  So=
 you'd
> reduce the pattern to
>
> + (for op1 (bit_ior bit_xor)
> +      op2 (bit_xor bit_ior)
> +(simplify
> + (op2
> +  (op1:s @0 INTEGER_CST@2) INTEGER_CST@3))
>     (with
>      {
>        wide_int zero_mask_not =3D get_nonzero_bits (@0);
> ...
>      }
>
> This would make use of value-range information determined by VRP for exam=
ple.

I'll go look for such a function.

>
> note that with your pattern you'd want to capture (op0:s @0 INTEGER_CST@1)
> like via (op0@4 @0 INTEGER_CST@1) so you can re-use it in the replacement
> like so:
>
> +   (if (wi::eq_p (wi::bit_and (C1, zero_mask_not), wi::zero (prec)))
> +    (op2 @4 { wide_int_to_tree (type, cst_emit); })
> +    (if (wi::eq_p (wi::bit_and (@3, zero_mask_not), wi::zero (prec)))
> +     (op1 @4 { wide_int_to_tree (type, cst_emit); }))))))))
>
> the expression doesn't need a :s then obviously.

Yeah makes sense.
>
> Thanks and sorry for the delay in reviewing this.
> Richard.
>

Thank you for all the comments!
>
>> gcc/ChangeLog:
>>
>> 2015-08-03  Andre Vieira  <andre.simoesdiasvieira@arm.com>
>>
>>    * match.pd: Added new patterns:
>>      ((X {&,<<,>>} C0) {|,^} C1) {^,|} C2)
>>      (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)
>>
>> gcc/testsuite/ChangeLog:
>>
>> 2015-08-03  Andre Vieira  <andre.simoesdiasvieira@arm.com>
>>              Hale Wang  <hale.wang@arm.com>
>>
>>    * gcc.dg/tree-ssa/forwprop-33.c: New test.
>