From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=dTeV=HD=ventanamicro.com=jlaw@sourceware.org>
Received: from mail-oa1-x34.google.com (mail-oa1-x34.google.com [IPv6:2001:4860:4864:20::34])
	by sourceware.org (Postfix) with ESMTPS id 489703858D39
	for <gcc-patches@gcc.gnu.org>; Wed, 22 Nov 2023 17:59:07 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 489703858D39
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=ventanamicro.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ventanamicro.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 489703858D39
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2001:4860:4864:20::34
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1700675957; cv=none;
	b=abAPhIcN3AYeUd+8rGo+jki4crZOocKW5b3cMzH0NyP5h++OtEHvPFjYs5tBP8eJYaHu1Z6wsI0206opKm29IMdOqmOKwf5FukEIKPMGT/odwmlMbwo6nfOlxijfxSSUqoFz7ZkDyjYizwL2gqqCo1A+nFw507cY38JdSk8FCgM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1700675957; c=relaxed/simple;
	bh=XykT2sKhN9KSLRd4ZUA3mX8Gxezwhpl6nYlhUcP48ag=;
	h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=lP9QjA73qg6BNtxYNB1oBGRqbaATGE+vk7x+2ivdPnZ7OWrnYMMpSmEKVzmalp75ok8/X9dlDqsDBluQB801wQZFEplT74ZJY/lwZ6IUTDgBbyFSHvkq+Fk1C44Pp+gh1dAPAuLa05fE9uFtuZz87gJxdnPCl/bUNXFNPITqG0E=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-oa1-x34.google.com with SMTP id 586e51a60fabf-1efad296d42so48396fac.2
        for <gcc-patches@gcc.gnu.org>; Wed, 22 Nov 2023 09:59:07 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=ventanamicro.com; s=google; t=1700675946; x=1701280746; darn=gcc.gnu.org;
        h=content-transfer-encoding:in-reply-to:from:references:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=WOfDF2Cr2EG7eNDvJ79ywYk31koZ+I+Mv5ZTblZ/C6g=;
        b=SabS872QlRnYX+f2dFX/oupp6S35AkdglASXV5ysp3JD13ZpBtTj9/YwI3Hq+s9xfr
         XPxQYBh3XS+pX/LhXt3s368BUFGPXgxv4blEBezTdOoKwHD2rqOc6kc0nNW7B25BI358
         eNePSSAODDFZ2Ad9tdoU+Wca71GQMAR1rZLsJfvq0tD4GC+GLc1DiEtf8ZnqXJMSMJ34
         q4aVR57OnXwcoHTP+pH4P2MWPMBO6pI8NM+6NE1Mvpa2ZynkHZWdIhvSHYFGsBhZAWtd
         iGhcMmV+Y+QXFDjhfNp5++K44ZnPjgAB6kLfbiixjartDvBRfuVEMw9hI+BnFYjyjXUy
         +ZPA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1700675946; x=1701280746;
        h=content-transfer-encoding:in-reply-to:from:references:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=WOfDF2Cr2EG7eNDvJ79ywYk31koZ+I+Mv5ZTblZ/C6g=;
        b=EaumCA1GfyoeBAP2+z7EYjrrNcCGb455uARP/llougFhYhsT8Kl0gF+/lVMagphXvl
         A5eBXbgB42MkWNe8xaA4wL+Xlg1ELVqMeY8RizsNOnfzQEJQG0bqCIw8ne9uM74Wg6X1
         E5Rptp5+Pu0TwHuZhDcxEUAxXUVfmimS2TsyGwKKjOX00zeqtHYGG/Bt5dWjQoQw7/6G
         nraB+XqF4MhQ0i7fhL00tYLvlJ+dcTOu2Z9Gc9HgY6rj1J41jgb6H1nu1hNkcM+shy87
         RppMAv6tkSkb4GKwDhoZx1/Kq10e8J0CzCK1qvnbM2LrBEK8bhpklMUgE4sZrxkiFWzQ
         B2WA==
X-Gm-Message-State: AOJu0YxqXwTYsX79sMhkDMfWJKM7mKZbqf90XNPwBHliTuN9B7DBgU0j
	ll/bubcZVnPbJVLO/GMs69bo9s/WvzPu/s5gqnE=
X-Google-Smtp-Source: AGHT+IGtalrF7FFvUVTm2jqdpxxT9VD4d7Py4nrDUCNIX+pvYLERC6PBW87dKSBOfZHbItip+rd4Ag==
X-Received: by 2002:a05:6870:158f:b0:1f4:d516:2e2b with SMTP id j15-20020a056870158f00b001f4d5162e2bmr3891928oab.16.1700675945901;
        Wed, 22 Nov 2023 09:59:05 -0800 (PST)
Received: from [172.31.0.109] ([136.36.130.248])
        by smtp.gmail.com with ESMTPSA id gr25-20020a056870aa9900b001f4d4ea4472sm27949oab.47.2023.11.22.09.59.04
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 22 Nov 2023 09:59:05 -0800 (PST)
Message-ID: <929d7ab8-7ca2-4cd7-a68b-1a24862c7077@ventanamicro.com>
Date: Wed, 22 Nov 2023 10:59:04 -0700
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFA] New pass for sign/zero extension elimination
Content-Language: en-US
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
 Jivan Hakobyan <jivanhakobyan9@gmail.com>, richard.sandiford@arm.com
References: <6d5f8ba7-0c60-4789-87ae-68617ce6ac2c@ventanamicro.com>
 <mpt8r6smgba.fsf@arm.com>
From: Jeff Law <jlaw@ventanamicro.com>
In-Reply-To: <mpt8r6smgba.fsf@arm.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-4.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>


On 11/20/23 11:26, Richard Sandiford wrote:

>> +
>> +/* If we know the destination of CODE only uses some low bits
>> +   (say just the QI bits of an SI operation), then return true
>> +   if we can propagate the need for just the subset of bits
>> +   from the destination to the sources.  */
>> +
>> +static bool
>> +safe_for_live_propagation (rtx_code code)
>> +{
>> +  /* First handle rtx classes which as a whole are known to
>> +     be either safe or unsafe.  */
>> +  switch (GET_RTX_CLASS (code))
>> +    {
>> +      case RTX_OBJ:
>> +	return true;
>> +
>> +      case RTX_COMPARE:
>> +      case RTX_COMM_COMPARE:
>> +      case RTX_TERNARY:
> 
> I suppose operands 1 and 2 of an IF_THEN_ELSE would be safe.
Yes.  The only downside is we'd need to special case IF_THEN_ELSE 
because it doesn't apply to operand 0.  Right now we're pretty 
conservative with anything other than binary codes.  Comment added about 
the possibility of handling I-T-E as well.


> 
> This made me wonder: is this safe for !TRULY_NOOP_TRUNCATION?  But I
> suppose it is.  What !TRULY_NOOP_TRUNCATION models is that the target
> mode has a canonical form that must be maintained, and wouldn't be by
> a plain subreg.  So TRULY_NOOP_TRUNCATION is more of an issue for
> consumers of the liveness information, rather than the computing the
> liveness information itself.
Really interesting question.  I think ext-dce is safe.  As you note this 
is more a consumer side question and on the consumer side we don't muck 
with TRUNCATE at all.


>> +    case SS_TRUNCATE:
>> +    case US_TRUNCATE:
>> +    case PLUS:
>> +    case MULT:
>> +    case SS_MULT:
>> +    case US_MULT:
>> +    case SMUL_HIGHPART:
>> +    case UMUL_HIGHPART:
>> +    case AND:
>> +    case IOR:
>> +    case XOR:
>> +    case SS_PLUS:
>> +    case US_PLUS:
> 
> I don't think it's safe to propagate through saturating ops.
> They don't have the property that (x op y)%z == (x%z op x%z)%z
Yea, you're probably right.  Removed.


>> +
>> +	  /* We don't support vector destinations or destinations
>> +	     wider than DImode.   It is safe to continue this loop.
>> +	     At worst, it will leave things live which could have
>> +	     been made dead.  */
>> +	  if (VECTOR_MODE_P (GET_MODE (x)) || GET_MODE (x) > E_DImode)
>> +	    continue;
> 
> The E_DImode comparison hard-codes an assumption about the order of
> the mode enum.  How about using something like:
Guilty as charged :-)  Not surprised you called that out.


> 
> 	  scalar_int_mode outer_mode;
> 	  if (!is_a<scalar_int_mode> (GET_MODE (x), &outer_mode)
> 	      || GET_MODE_BITSIZE (outer_mode) > 64)
> 	    continue;
Wouldn't we also want to verify that the size is constant, or is it the 
case that all the variable cases are vector (and would we want to 
actually depend on that)?

> 
> The other continues use iter.skip_subrtxes (); when continuing.
> I don't think it matters for correctness whether we do that or not,
> since SETs and CLOBBERs shouldn't be nested.  But skipping should
> be faster.
My thought on not skipping the sub-rtxs in this case was to make sure we 
processed things like memory addresses which could have embedded side 
effects.  It probably doesn't matter in practice though.

> 
> Maybe it would be worth splitting the SET/CLOBBER code out into > a subfunction, to make the loop iteration easier to handle?
Yea, it could use another round of such work.  In the originalm set and 
use handling were one big function which drove me nuts.


>> +	      /* Transfer all the LIVENOW bits for X into LIVE_TMP.  */
>> +	      HOST_WIDE_INT rn = REGNO (SUBREG_REG (x));
>> +	      for (HOST_WIDE_INT i = 4 * rn; i < 4 * rn + 4; i++)
>> +		if (bitmap_bit_p (livenow, i))
>> +		  bitmap_set_bit (live_tmp, i);
>> +
>> +	      /* The mode of the SUBREG tells us how many bits we can
>> +		 clear.  */
>> +	      machine_mode mode = GET_MODE (x);
>> +	      HOST_WIDE_INT size = GET_MODE_SIZE (mode).to_constant ();
>> +	      bitmap_clear_range (livenow, 4 * rn, size);
> 
> Is clearing SIZE bytes correct?  Feels like it should be clearing
> something like log2 (size) + 1 instead.
Yea, I think you're right.  Fixed.


>> +	      bit = SUBREG_BYTE (x).to_constant () * BITS_PER_UNIT;
>> +	      if (WORDS_BIG_ENDIAN)
>> +		bit = (GET_MODE_BITSIZE (GET_MODE (SUBREG_REG (x))).to_constant ()
>> +		       - GET_MODE_BITSIZE (GET_MODE (x)).to_constant () - bit);
>> +
>> +	      /* Catch big endian correctness issues rather than triggering
>> +		 undefined behavior.  */
>> +	      gcc_assert (bit < sizeof (HOST_WIDE_INT) * 8);
> 
> This could probably use subreg_lsb, to avoid the inline endianness adjustment.
That's the routine I was looking for!  The original totally mucked up 
the endianness adjustment and I kept thinking we must have an existing 
routine to do this for us but didn't find it immediately, so I just 
banged out a trivial endianness adjustment.

> 
>> +
>> +	      mask = GET_MODE_MASK (GET_MODE (SUBREG_REG (x))) << bit;
>> +	      if (!mask)
>> +		mask = -0x100000000ULL;
> 
> Not sure I follow this.  What does the -0x100000000ULL constant indicate?
> Also, isn't it the mask of the outer register that is shifted, rather
> than the mask of the inner mode?  E.g. if we have:
Inherited.  I should have marked it like the other one as needing 
investigation.  Probably the fastest way is to just rip it out for a 
test to see what breaks.

>> +	  /* Some ports generate (clobber (const_int)).  */
>> +	  else if (CONST_INT_P (x))
> 
> Ugh.  What on earth do they do that for?  I thought that (with const0_rtx)
> was reserved as combine's "something went wrong" representation.
I stumbled over it on avr I think.  I didn't dig into why other than to 
note it occurred in an insn pattern as opposed to rtl templates for 
generation from define_expands, splitters, etc.


>> +/* INSN has a sign/zero extended source inside SET that we will
>> +   try to turn into a SUBREG.  */
>> +static void
>> +ext_dce_try_optimize_insn (rtx_insn *insn, rtx set, bitmap changed_pseudos)
>> +{
>> +  rtx src = SET_SRC (set);
>> +  rtx inner = XEXP (src, 0);
>> +
>> +  /* Avoid (subreg (mem)) and other constructs which may are valid RTL, but
> 
> s/are// or s/may are/may be/
Fixed.  "may be".


>> +
>> +/* Process uses in INSN.  Set appropriate bits in LIVENOW for any chunks of
>> +   pseudos that become live, potentially filtering using bits from LIVE_TMP.
>> +
>> +   If MODIFIED is true, then optimize sign/zero extensions to SUBREGs when
> 
> MODIFY
Fixed.


>> +
>> +	  /* ?!? How much of this should mirror SET handling, potentially
>> +	     being shared?   */
>> +	  if (SUBREG_BYTE (dst).is_constant () && SUBREG_P (dst))
>> +	    {
>> +	      bit = SUBREG_BYTE (dst).to_constant () * BITS_PER_UNIT;
>> +	      if (WORDS_BIG_ENDIAN)
>> +		bit = (GET_MODE_BITSIZE (GET_MODE (SUBREG_REG (dst))).to_constant ()
>> +		       - GET_MODE_BITSIZE (GET_MODE (dst)).to_constant () - bit);
> 
> Same subreg_lsb thing here.
Yea, I'll fix them all.  There's 2 or 3 IIRC.


>> +
>> +	      /* ?!? What is the point of this adjustment to DST_MASK?  */
>> +	      if (code == PLUS || code == MINUS
>> +		  || code == MULT || code == ASHIFT)
>> +		dst_mask
>> +		  = dst_mask ? ((2ULL << floor_log2 (dst_mask)) - 1) : 0;
> 
> Yeah, sympathise with the ?!? here :)
Inherited.  Like the other bit of magic I think I'll do a test with them 
pulled out to see if I can make something undesirable trigger.

> 
>> +	      /* We will handle the other operand of a binary operator
>> +		 at the bottom of the loop by resetting Y.  */
>> +	      if (BINARY_P (src))
>> +		y = XEXP (src, 0);
> 
> What about UNARY_P, given that NOT is included in the codes above?
We'll break that inner for(;;) then iterate into the subobject, marking 
the relevant bits live.  FWIW, the control flow of this code continues 
to be my biggest concern from a maintenance standpoint.  Figuring it out 
was a major pain and I've tried to document what is and what is not 
safe.  But it's still painful to walk through.

I pondered if note_uses/note_stores would be better, but concluded we'd 
just end up with a ton of state objects to carry around and reasoning 
about that would be just as hard.


> 
>> +	      else
>> +		y = src;
>> +
>> +	      /* We're inside a SET and want to process the source operands
>> +		 making things live.  Breaking from this loop will cause
>> +		 the iterator to work on sub-rtxs, so it is safe to break
>> +		 if we see something we don't know how to handle.  */
>> +	      for (;;)
>> +		{
>> +		  /* Strip an outer STRICT_LOW_PART or paradoxical subreg.
>> +		     That has the effect of making the whole referenced
>> +		     register live.  We might be able to avoid that for
>> +		     STRICT_LOW_PART at some point.  */
>> +		  if (GET_CODE (x) == STRICT_LOW_PART
>> +		      || paradoxical_subreg_p (x))
>> +		    x = XEXP (x, 0);
> 
> Isn't "x" still the set at this point?  If it was intended to be "y",
> then I thought STRICT_LOW_PART was for destinations only.
I'll have to go back through the history.  Something definitely does not 
look right here.

> 

>> +		  else if (!CONSTANT_P (y))
>> +		    break;
>> +		  /* We might have (ashift (const_int 1) (reg...)) */
>> +		  else if (CONSTANT_P (y)
>> +			   && binop_implies_op2_fully_live (GET_CODE (src)))
> 
> The CONSTANT_P (y) condition is redundant given the above.
> But couldn't the binop_implies_op2_fully_live be shared by both paths?
> Maybe it could be tested...
Yea, that was a fun little twisty maze.  THe CONSTANT_P is definitely 
reundant.  I had a bunch of fixes related to shifts in the code 
initially, but Jivan and I slowly cleaned up and refactored over time, 
but I think there's still some work to do here.


>> +
>> +  do
>> +    {
>> +      qin = qout = worklist;
>> +
>> +      /* Put every block on the worklist.  */
>> +      int *rpo = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
>> +      int n = inverted_rev_post_order_compute (cfun, rpo);
> 
> Does this need to be recalculated each iteration, or could it be done
> outside the loop?
I think it can/should move outside the loop.  Done.


> 
> Was also wondering if this could use a DF_BACKWARD df_simple_dataflow
> (with its precomputed order), rather than doing the worklist management
> manually.  If df_simple_dataflow isn't efficient enough then we should
> try to fix that...
I'll have to take a look.  I actually never really had to look at this 
code as all the fixes/cleanups were in the sets/uses processing side. 
The liveness iteration and such "just worked".

Thanks a ton for all the feedback.  I've got several of the fixes done 
and I'll throw them into testing today and get a V2 out as a few people 
are playing with this.  V2 won't have fixes for everything you've 
pointed out, so it probably won't be worth the time to review the V2 
knowing a V3 is inevitable.

jeff