From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=9+XD=C6=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x233.google.com (mail-lj1-x233.google.com [IPv6:2a00:1450:4864:20::233])
	by sourceware.org (Postfix) with ESMTPS id 93E143858D20
	for <gcc-patches@gcc.gnu.org>; Wed, 12 Jul 2023 07:15:25 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 93E143858D20
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x233.google.com with SMTP id 38308e7fff4ca-2b69e6d324aso106081061fa.0
        for <gcc-patches@gcc.gnu.org>; Wed, 12 Jul 2023 00:15:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1689146124; x=1691738124;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=AQGCPDLu7AWpFuhNEkkjPGUhaQKXMOfPcjYEVV65y7w=;
        b=WnvRUierfG/C/NrCTXpEpNXa23lq14REwkQXCkrCSYqzPGLHEdjHo9yIAyTGmLeJC4
         UVKgwSBgJhfM/AciIXf9TCBxRX7hhJMcjfMDJ5FFQn0vzA+K2BQ3aowIuFYbgKOkBE12
         8wUh0Yor7Il8QHeKUVZWohN407NtRxK2fjJm28UXOVkP/jAjyF9MTeg8eKFwYEbZnqV7
         NR2FXBNeoD2bl6jlkDVVvGqUg+v+WIzpgE78Lhs39/XaCdE4Ea890zlHzJsOURecdNXR
         67hZswMMheOdAzVQHLnaWc3ZyCTic2NgrKYhSuIFnk3KRVQAsUQVsmbVSGHzafTsEIQ/
         T1dQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689146124; x=1691738124;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=AQGCPDLu7AWpFuhNEkkjPGUhaQKXMOfPcjYEVV65y7w=;
        b=LAuP8kyD0C3JexHXcPGyJFQQqmfgKF1HbBlmIPiEIpwjzAXYRXjzYTVnEfVE00ac72
         XsukFJBTlCB6Qe2cFLfiF9VbH5hblQ3fV8m6W3Vb47ZbqOFwSAXxO92uR2/Gt/nBQvDv
         potAp4yCb6/ewqgd6+RcjYqSE3jeG5sgVCf+4ji/3yijON+qYWpc1eD1Rt0JEWskBsyd
         sq/vi7VT8c/hzqsbFKH+z6lgtisW74OOTIxWHrhcUbZT3H7cg9YROORtoOV8BN4XLlA5
         e/6Jb9DHzs32ej2jOr3R8mdC/G2RWWOhWrTHVhhfO8rK/+VcP4T+n9LpEjyNKskJh6IN
         Erxw==
X-Gm-Message-State: ABy/qLZVyeTn2d6GXbiBXUl8rlyAHloq7y9Usqq6rBHYrq/i38cgHia9
	uKl/lnM7+qPU1k6j/GDH1wtbIyaAl7rkb6kEw3c=
X-Google-Smtp-Source: APBJJlGSofbHhBUYc3YNyblyx27phlg4qStDSyViGpoV3PQYnejZnyJsKpanznDRER05r/HUSMehQ26T5RiFPNXdUTM=
X-Received: by 2002:a2e:9cc2:0:b0:2b6:b2bf:ab4d with SMTP id
 g2-20020a2e9cc2000000b002b6b2bfab4dmr13705567ljj.14.1689146123665; Wed, 12
 Jul 2023 00:15:23 -0700 (PDT)
MIME-Version: 1.0
References: <alpine.DEB.2.20.2307062118070.28892@tpp.orcam.me.uk>
 <alpine.DEB.2.20.2307062153310.28892@tpp.orcam.me.uk> <CAFiYyc23ujqdOfJ=R5uQL1YycJvxgdiyMuYY=-+J2ppSu7JDvA@mail.gmail.com>
 <alpine.DEB.2.20.2307111423190.28892@tpp.orcam.me.uk>
In-Reply-To: <alpine.DEB.2.20.2307111423190.28892@tpp.orcam.me.uk>
From: Richard Biener <richard.guenther@gmail.com>
Date: Wed, 12 Jul 2023 09:15:01 +0200
Message-ID: <CAFiYyc0JBDFQBxS_A+80pcQ1CWUTYacwp+RkG3Qys3gC84c8Dw@mail.gmail.com>
Subject: Re: [PATCH 2/3] testsuite: Require 128-bit vectors for bb-slp-pr95839.c
To: "Maciej W. Rozycki" <macro@embecosm.com>
Cc: Rainer Orth <ro@cebitec.uni-bielefeld.de>, Mike Stump <mikestump@comcast.net>, 
	gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Tue, Jul 11, 2023 at 5:01=E2=80=AFPM Maciej W. Rozycki <macro@embecosm.c=
om> wrote:
>
> On Fri, 7 Jul 2023, Richard Biener wrote:
>
> > > The bb-slp-pr95839.c test assumes quad-single float vector support, b=
ut
> > > some targets only support pairs of floats, causing this test to fail
> > > with such targets.  Limit this test to targets that support at least
> > > 128-bit vectors then, and add a complementing test that can be run wi=
th
> > > targets that have support for 64-bit vectors only.  There is no need =
to
> > > adjust bb-slp-pr95839-2.c as 128 bits are needed even for the smalles=
t
> > > vector of doubles, so support is implied by the presence of vectors o=
f
> > > doubles.
> >
> > I wonder why you see the testcase FAIL, on x86-64 when doing
> >
> > typedef float __attribute__((vector_size(32))) v4f32;
> >
> > v4f32 f(v4f32 a, v4f32 b)
> > {
> >   /* Check that we vectorize this CTOR without any loads.  */
> >   return (v4f32){a[0] + b[0], a[1] + b[1], a[2] + b[2], a[3] + b[3],
> >   a[4] + b[4], a[5] + b[5], a[6] + b[6], a[7] + b[7]};
> > }
> >
> > I see we vectorize the add and the "store".  We fail to perform
> > extraction from the incoming vectors (unless you enable AVX),
> > that's a missed optimization.
> >
> > So with paired floats I would expect sth similar?  Maybe
> > x86 is saved by kind-of-presence (but disabled) of V8SFmode vectors.
>
>  I am not familiar enough with this stuff to answer your question.
>
>  As we pass and return V2SF data in FP registers just as with complex
> float data with this hardware the function from my bb-slp-pr95839-v8.c
> expands to a single vector FP add instruction, followed by a function
> return.
>
>  Conversely, the original function from bb-slp-pr95839.c expands to a
> sequence of 22 instructions to extract incoming vector FP data from 4
> 64-bit GPRs into 8 FPRs, add the vectors piecemeal with 4 scalar FP add
> instructions, and then insert outgoing vector FP data from 4 FPRs back to
> 2 64-bit GPRs.  As an experiment I have modified the backend minimally so
> as to pass and return V4SF data in FP registers as well, but that didn't
> make the vectoriser trigger.
>
> > That said, we should handle this better so can you file an
> > enhancement bugreport for this?
>
>  Filed as PR -optimization/110630.

Thanks!

>  I can't publish RISC-V information
> related to the hardware affected, but as a quick check I ran the MIPS
> compiler:
>
> $ mips-linux-gnu-gcc -march=3Dmips64 -mabi=3D64 -mpaired-single -O2 -S bb=
-slp-pr95839*.c
>
> and got this code for bb-slp-pr95839-v8.c (mind the branch delay slot):
>
>         jr      $31
>         add.ps  $f0,$f12,$f13
>
> vs code for bb-slp-pr95839.c:
>
>         daddiu  $sp,$sp,-64
>         sd      $5,24($sp)
>         sd      $7,40($sp)
>         lwc1    $f0,24($sp)
>         lwc1    $f1,40($sp)
>         sd      $4,16($sp)
>         sd      $6,32($sp)
>         add.s   $f3,$f0,$f1
>         lwc1    $f0,28($sp)
>         lwc1    $f1,44($sp)
>         lwc1    $f4,36($sp)
>         swc1    $f3,56($sp)
>         add.s   $f2,$f0,$f1
>         lwc1    $f0,16($sp)
>         lwc1    $f1,32($sp)
>         swc1    $f2,60($sp)
>         add.s   $f1,$f0,$f1
>         lwc1    $f0,20($sp)
>         ld      $3,56($sp)
>         add.s   $f0,$f0,$f4
>         swc1    $f1,48($sp)
>         swc1    $f0,52($sp)
>         ld      $2,48($sp)
>         jr      $31
>         daddiu  $sp,$sp,64
>
> so this is essentially the same scenario (up to the machine instruction
> count), and therefore it seems backend-agnostic.  I can imagine the latte=
r
> case could expand to something like (instruction reordering surely needed
> for performance omitted for clarity):
>
>         dmtc1   $4,$f0
>         dmtc1   $5,$f1
>         dmtc1   $6,$f2
>         dmtc1   $7,$f3
>         add.ps  $f0,$f0,$f1
>         add.ps  $f2,$f2,$f3
>         dmfc1   $2,$f0
>         jr      $31
>         dmfc1   $3,$f2
>
> saving a lot of cycles, and removing the need for spilling temporaries to
> the stack and for frame creation in the first place.
>
>  Do you agree it still makes sense to include bb-slp-pr95839-v8.c with th=
e
> testsuite?

Sure, more coverage is always  nice.

Richard.

>   Maciej