From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ljxx=C5=embecosm.com=macro@sourceware.org>
Received: from mail-wr1-x42e.google.com (mail-wr1-x42e.google.com [IPv6:2a00:1450:4864:20::42e])
	by sourceware.org (Postfix) with ESMTPS id A10EC3858408
	for <gcc-patches@gcc.gnu.org>; Tue, 11 Jul 2023 15:01:27 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A10EC3858408
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=embecosm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=embecosm.com
Received: by mail-wr1-x42e.google.com with SMTP id ffacd0b85a97d-307d58b3efbso5738806f8f.0
        for <gcc-patches@gcc.gnu.org>; Tue, 11 Jul 2023 08:01:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=embecosm.com; s=google; t=1689087686; x=1691679686;
        h=mime-version:user-agent:references:message-id:in-reply-to:subject
         :cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=C99BVlcUEYmolG3MVwsGsZ34Z88aZ9EXm0t9fQuBVQo=;
        b=A6GodwbgiKQCzHKXou8tdf2g/DSSbvqYLmUha8NqUsBeTgFTRqwMNX7Cr8E4zUZhwc
         Ds7ADurROzhQe8DmyrbKfypE++N5+PXDee527kJw5qOWEGjk05xlyHZ2cLWBkWzSRcLI
         +RlGA2ZlsgP0MuYf0/XpWNDdYaNhQyzh0yHvYALX+riF5kVFSHv836pTjcrzUUwY6KnH
         4ROR2ph1vS0imA8Wd4pUfp9NqKOAWULJKrzee6vahznX+X/MH7R2F6lB1Rhy6OmtkdAX
         RLeuZIpcS0ZTxl1flgTNjodaHtgkTgL2qYK4hBrMNsaC9xHNT9kkT+GqADkLYsvFSPpn
         4yCQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689087686; x=1691679686;
        h=mime-version:user-agent:references:message-id:in-reply-to:subject
         :cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=C99BVlcUEYmolG3MVwsGsZ34Z88aZ9EXm0t9fQuBVQo=;
        b=gbNe7sBkM1usKHURlOGwJW+fFTZrC6KSSf4SGpYrgO8BTYyLLRAzoeW7LrltwYFjSO
         amsOtr9J4U/WpPN8Zys6AFis16p/+5kclVvxJgQPNiOiDjnZKGJbb+Yi3QRNTC/2SRVD
         nZnrbpViuR56ZJLef9M75mYuLX5OeCeFRO1zOP1CiqfgnFyKa+vftS9JOxDxIqgX858d
         ykElJOQx7BHUfprUZ5T0I2jTcc64daA7tJ/2Sa/68+BUSocx8JYpH7PBb+wnd32eEmOw
         t0TFZIRnr/iFbdIahE0kmrqrbkSfziE9rCKG2seloHoIVkO7qqWykmhBqXUJhnSXDTLl
         c46g==
X-Gm-Message-State: ABy/qLbafmzYbtrcn1DTgDCdqMX4cOkhpTri2OIOvlc5gcVW5LSfxv2x
	qna3esd8FpnREiTIDxJIK/tWEHtJ/gDhI3Kfz6E=
X-Google-Smtp-Source: APBJJlGjasRCq0BSynTotNyCAMvfHzZHqDsiwARf37AJw9sSTlDDQbdW1HQL9xdbcwJsnOxJO/fBtQ==
X-Received: by 2002:a5d:4cc5:0:b0:315:903b:c2f1 with SMTP id c5-20020a5d4cc5000000b00315903bc2f1mr9325222wrt.25.1689087686234;
        Tue, 11 Jul 2023 08:01:26 -0700 (PDT)
Received: from tpp.orcam.me.uk (tpp.orcam.me.uk. [2001:8b0:154:0:ea6a:64ff:fe24:f2fc])
        by smtp.gmail.com with ESMTPSA id a4-20020adfeec4000000b0030c4d8930b1sm2475823wrp.91.2023.07.11.08.01.25
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 11 Jul 2023 08:01:25 -0700 (PDT)
Date: Tue, 11 Jul 2023 16:01:24 +0100 (BST)
From: "Maciej W. Rozycki" <macro@embecosm.com>
To: Richard Biener <richard.guenther@gmail.com>
cc: Rainer Orth <ro@cebitec.uni-bielefeld.de>, 
    Mike Stump <mikestump@comcast.net>, gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 2/3] testsuite: Require 128-bit vectors for
 bb-slp-pr95839.c
In-Reply-To: <CAFiYyc23ujqdOfJ=R5uQL1YycJvxgdiyMuYY=-+J2ppSu7JDvA@mail.gmail.com>
Message-ID: <alpine.DEB.2.20.2307111423190.28892@tpp.orcam.me.uk>
References: <alpine.DEB.2.20.2307062118070.28892@tpp.orcam.me.uk> <alpine.DEB.2.20.2307062153310.28892@tpp.orcam.me.uk> <CAFiYyc23ujqdOfJ=R5uQL1YycJvxgdiyMuYY=-+J2ppSu7JDvA@mail.gmail.com>
User-Agent: Alpine 2.20 (DEB 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Fri, 7 Jul 2023, Richard Biener wrote:

> > The bb-slp-pr95839.c test assumes quad-single float vector support, but
> > some targets only support pairs of floats, causing this test to fail
> > with such targets.  Limit this test to targets that support at least
> > 128-bit vectors then, and add a complementing test that can be run with
> > targets that have support for 64-bit vectors only.  There is no need to
> > adjust bb-slp-pr95839-2.c as 128 bits are needed even for the smallest
> > vector of doubles, so support is implied by the presence of vectors of
> > doubles.
> 
> I wonder why you see the testcase FAIL, on x86-64 when doing
> 
> typedef float __attribute__((vector_size(32))) v4f32;
> 
> v4f32 f(v4f32 a, v4f32 b)
> {
>   /* Check that we vectorize this CTOR without any loads.  */
>   return (v4f32){a[0] + b[0], a[1] + b[1], a[2] + b[2], a[3] + b[3],
>   a[4] + b[4], a[5] + b[5], a[6] + b[6], a[7] + b[7]};
> }
> 
> I see we vectorize the add and the "store".  We fail to perform
> extraction from the incoming vectors (unless you enable AVX),
> that's a missed optimization.
> 
> So with paired floats I would expect sth similar?  Maybe
> x86 is saved by kind-of-presence (but disabled) of V8SFmode vectors.

 I am not familiar enough with this stuff to answer your question.

 As we pass and return V2SF data in FP registers just as with complex 
float data with this hardware the function from my bb-slp-pr95839-v8.c 
expands to a single vector FP add instruction, followed by a function 
return.

 Conversely, the original function from bb-slp-pr95839.c expands to a 
sequence of 22 instructions to extract incoming vector FP data from 4 
64-bit GPRs into 8 FPRs, add the vectors piecemeal with 4 scalar FP add 
instructions, and then insert outgoing vector FP data from 4 FPRs back to 
2 64-bit GPRs.  As an experiment I have modified the backend minimally so 
as to pass and return V4SF data in FP registers as well, but that didn't 
make the vectoriser trigger.

> That said, we should handle this better so can you file an
> enhancement bugreport for this?

 Filed as PR -optimization/110630.  I can't publish RISC-V information 
related to the hardware affected, but as a quick check I ran the MIPS 
compiler:

$ mips-linux-gnu-gcc -march=mips64 -mabi=64 -mpaired-single -O2 -S bb-slp-pr95839*.c

and got this code for bb-slp-pr95839-v8.c (mind the branch delay slot):

	jr	$31
	add.ps	$f0,$f12,$f13

vs code for bb-slp-pr95839.c:

	daddiu	$sp,$sp,-64
	sd	$5,24($sp)
	sd	$7,40($sp)
	lwc1	$f0,24($sp)
	lwc1	$f1,40($sp)
	sd	$4,16($sp)
	sd	$6,32($sp)
	add.s	$f3,$f0,$f1
	lwc1	$f0,28($sp)
	lwc1	$f1,44($sp)
	lwc1	$f4,36($sp)
	swc1	$f3,56($sp)
	add.s	$f2,$f0,$f1
	lwc1	$f0,16($sp)
	lwc1	$f1,32($sp)
	swc1	$f2,60($sp)
	add.s	$f1,$f0,$f1
	lwc1	$f0,20($sp)
	ld	$3,56($sp)
	add.s	$f0,$f0,$f4
	swc1	$f1,48($sp)
	swc1	$f0,52($sp)
	ld	$2,48($sp)
	jr	$31
	daddiu	$sp,$sp,64

so this is essentially the same scenario (up to the machine instruction 
count), and therefore it seems backend-agnostic.  I can imagine the latter 
case could expand to something like (instruction reordering surely needed 
for performance omitted for clarity):

	dmtc1	$4,$f0
	dmtc1	$5,$f1
	dmtc1	$6,$f2
	dmtc1	$7,$f3
	add.ps	$f0,$f0,$f1
	add.ps	$f2,$f2,$f3
	dmfc1	$2,$f0
	jr	$31
	dmfc1	$3,$f2

saving a lot of cycles, and removing the need for spilling temporaries to 
the stack and for frame creation in the first place.

 Do you agree it still makes sense to include bb-slp-pr95839-v8.c with the 
testsuite?

  Maciej