From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=qweC=CD=suse.de=rguenther@sourceware.org>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
	by sourceware.org (Postfix) with ESMTPS id 013753858D3C
	for <gcc-patches@gcc.gnu.org>; Thu, 15 Jun 2023 09:52:14 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 013753858D3C
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
	by smtp-out1.suse.de (Postfix) with ESMTP id CD6CE223E8;
	Thu, 15 Jun 2023 09:52:13 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1686822733; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=ubQ+stY7f9VL9+/mFOyr4mXrQdDU6+Hhems0kqRL8FQ=;
	b=t1BDqO0kOSwMAnVaAn4h3pv+/TvnSxuC8LUVTl3h+Yz4lZ3zt1NKmfAXHoVYuXrJX4uGmH
	XfJEMRsocYSfBgjDnq4eV9bvTMwGzy+CJr4re/ocI+Z7wx9spOwUcOd4p7q0rlyoLe2cGq
	NlC1Zf/RCbZOw6mNpUf8e18XRUzvjt4=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1686822733;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=ubQ+stY7f9VL9+/mFOyr4mXrQdDU6+Hhems0kqRL8FQ=;
	b=QhOmL24vls0nSTjVuuqilIfc1aalVUFkCmG/7XHKEbxuI2emI3hMIkQUqMM+b7tTPpL/qi
	g9DE+TSweCgJ41Bw==
Received: from wotan.suse.de (wotan.suse.de [10.160.0.1])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by relay2.suse.de (Postfix) with ESMTPS id 803822C141;
	Thu, 15 Jun 2023 09:52:13 +0000 (UTC)
Date: Thu, 15 Jun 2023 09:52:13 +0000 (UTC)
From: Richard Biener <rguenther@suse.de>
To: Robin Dapp <rdapp.gcc@gmail.com>
cc: "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai>, 
    gcc-patches <gcc-patches@gcc.gnu.org>, 
    "richard.sandiford" <richard.sandiford@arm.com>, 
    krebbel <krebbel@linux.ibm.com>, uweigand <uweigand@de.ibm.com>, 
    linkw@linux.ibm.com
Subject: Re: [PATCH V2] VECT: Support LEN_MASK_ LOAD/STORE to support flow
 control for length loop control
In-Reply-To: <93123844-a2a8-90ab-d47a-6f859fb3a8d6@gmail.com>
Message-ID: <nycvar.YFH.7.77.849.2306150922480.4723@jbgna.fhfr.qr>
References: <20230612041438.272885-1-juzhe.zhong@rivai.ai> <nycvar.YFH.7.77.849.2306150718540.4723@jbgna.fhfr.qr> <1A27F1FE56A91998+2023061516473052116877@rivai.ai> <7e31f226-476a-75d1-6a0f-13fa9134e311@gmail.com> <nycvar.YFH.7.77.849.2306150914360.4723@jbgna.fhfr.qr>
 <93123844-a2a8-90ab-d47a-6f859fb3a8d6@gmail.com>
User-Agent: Alpine 2.22 (LSU 394 2020-01-19)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Thu, 15 Jun 2023, Robin Dapp wrote:

> > Meh, PoP is now behind a paywall, trying to get through ... I wonder
> > if there's a nice online html documenting the s390 len_load/store
> > instructions to better understand the need for the bias.
> 
> https://publibfp.dhe.ibm.com/epubs/pdf/a227832c.pdf
> 
> Look for vector load with length (store).  The length operand specifies
> the highest bytes to load instead of the actual length.

Hmm.  It indeed cannot represent len == 0, so you are making sure
that never happens?  Because when it is actually zero you are
going to get -1 here?  At least I don't see the bias operand used at
all:

; Implement len_load/len_store optabs with vll/vstl.
(define_expand "len_load_v16qi"
  [(match_operand:V16QI 0 "register_operand")
   (match_operand:V16QI 1 "memory_operand")
   (match_operand:QI 2 "register_operand")
   (match_operand:QI 3 "vll_bias_operand")
  ]
  "TARGET_VX && TARGET_64BIT"
{
  rtx mem = adjust_address (operands[1], BLKmode, 0);

  rtx len = gen_reg_rtx (SImode);
  emit_move_insn (len, gen_rtx_ZERO_EXTEND (SImode, operands[2]));
  emit_insn (gen_vllv16qi (operands[0], len, mem));
  DONE;
})

the docs of len_load say

"
@cindex @code{len_load_@var{m}} instruction pattern
@item @samp{len_load_@var{m}}
Load (operand 2 - operand 3) elements from vector memory operand 1
into vector register operand 0, setting the other elements of
operand 0 to undefined values.  Operands 0 and 1 have mode @var{m},
which must be a vector mode.  Operand 2 has whichever integer mode the
target prefers.  Operand 3 conceptually has mode @code{QI}. 

Operand 2 can be a variable or a constant amount.  Operand 3 specifies a
constant bias: it is either a constant 0 or a constant -1.  The predicate 
on
operand 3 must only accept the bias values that the target actually 
supports.
GCC handles a bias of 0 more efficiently than a bias of -1.

If (operand 2 - operand 3) exceeds the number of elements in mode
@var{m}, the behavior is undefined.

If the target prefers the length to be measured in bytes rather than
elements, it should only implement this pattern for vectors of @code{QI}
elements."

the minus in 'operand 2 - operand 3' should be a plus if the
bias is really zero or -1.  I suppose

'If (operand 2 - operand 3) exceeds the number of elements in mode
@var{m}, the behavior is undefined.'

means that the vectorizer has to make sure the biased element
count never underflows?

That is, for a loop like

void foo (double *x, float *y, int n)
{
  for (int i = 0; i < n; ++i)
    y[i] = x[i];
}

you should get

   x1 = len_load (...);
   x2 = len_load (...);
   y = VEC_PACK_TRUNC_EXPR <x1, x2>
   len_store (..., y);

but then the x2 load can end up with a len of zero and thus
trap (since you will load either a full vector or the first
byte of it).  I see you do

  /* If the backend requires a bias of -1 for LEN_LOAD, we must not emit
     len_loads with a length of zero.  In order to avoid that we prohibit
     more than one loop length here.  */
  if (partial_load_bias == -1
      && LOOP_VINFO_LENS (loop_vinfo).length () > 1)
    return false;

that's quite conservative.  I think you can do better when the
loads are aligned, reading an extra byte when ignoring the bias
is OK and you at least know the very first element is used.
For stores you would need to emit compare&jump for all but
the first store of a group though ...

That said, I'm still not seeing where you actually apply the bias.

Richard.