From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x233.google.com (mail-lj1-x233.google.com [IPv6:2a00:1450:4864:20::233]) by sourceware.org (Postfix) with ESMTPS id 93E143858D20 for ; Wed, 12 Jul 2023 07:15:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 93E143858D20 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lj1-x233.google.com with SMTP id 38308e7fff4ca-2b69e6d324aso106081061fa.0 for ; Wed, 12 Jul 2023 00:15:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1689146124; x=1691738124; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=AQGCPDLu7AWpFuhNEkkjPGUhaQKXMOfPcjYEVV65y7w=; b=WnvRUierfG/C/NrCTXpEpNXa23lq14REwkQXCkrCSYqzPGLHEdjHo9yIAyTGmLeJC4 UVKgwSBgJhfM/AciIXf9TCBxRX7hhJMcjfMDJ5FFQn0vzA+K2BQ3aowIuFYbgKOkBE12 8wUh0Yor7Il8QHeKUVZWohN407NtRxK2fjJm28UXOVkP/jAjyF9MTeg8eKFwYEbZnqV7 NR2FXBNeoD2bl6jlkDVVvGqUg+v+WIzpgE78Lhs39/XaCdE4Ea890zlHzJsOURecdNXR 67hZswMMheOdAzVQHLnaWc3ZyCTic2NgrKYhSuIFnk3KRVQAsUQVsmbVSGHzafTsEIQ/ T1dQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689146124; x=1691738124; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=AQGCPDLu7AWpFuhNEkkjPGUhaQKXMOfPcjYEVV65y7w=; b=LAuP8kyD0C3JexHXcPGyJFQQqmfgKF1HbBlmIPiEIpwjzAXYRXjzYTVnEfVE00ac72 XsukFJBTlCB6Qe2cFLfiF9VbH5hblQ3fV8m6W3Vb47ZbqOFwSAXxO92uR2/Gt/nBQvDv potAp4yCb6/ewqgd6+RcjYqSE3jeG5sgVCf+4ji/3yijON+qYWpc1eD1Rt0JEWskBsyd sq/vi7VT8c/hzqsbFKH+z6lgtisW74OOTIxWHrhcUbZT3H7cg9YROORtoOV8BN4XLlA5 e/6Jb9DHzs32ej2jOr3R8mdC/G2RWWOhWrTHVhhfO8rK/+VcP4T+n9LpEjyNKskJh6IN Erxw== X-Gm-Message-State: ABy/qLZVyeTn2d6GXbiBXUl8rlyAHloq7y9Usqq6rBHYrq/i38cgHia9 uKl/lnM7+qPU1k6j/GDH1wtbIyaAl7rkb6kEw3c= X-Google-Smtp-Source: APBJJlGSofbHhBUYc3YNyblyx27phlg4qStDSyViGpoV3PQYnejZnyJsKpanznDRER05r/HUSMehQ26T5RiFPNXdUTM= X-Received: by 2002:a2e:9cc2:0:b0:2b6:b2bf:ab4d with SMTP id g2-20020a2e9cc2000000b002b6b2bfab4dmr13705567ljj.14.1689146123665; Wed, 12 Jul 2023 00:15:23 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Wed, 12 Jul 2023 09:15:01 +0200 Message-ID: Subject: Re: [PATCH 2/3] testsuite: Require 128-bit vectors for bb-slp-pr95839.c To: "Maciej W. Rozycki" Cc: Rainer Orth , Mike Stump , gcc-patches@gcc.gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Tue, Jul 11, 2023 at 5:01=E2=80=AFPM Maciej W. Rozycki wrote: > > On Fri, 7 Jul 2023, Richard Biener wrote: > > > > The bb-slp-pr95839.c test assumes quad-single float vector support, b= ut > > > some targets only support pairs of floats, causing this test to fail > > > with such targets. Limit this test to targets that support at least > > > 128-bit vectors then, and add a complementing test that can be run wi= th > > > targets that have support for 64-bit vectors only. There is no need = to > > > adjust bb-slp-pr95839-2.c as 128 bits are needed even for the smalles= t > > > vector of doubles, so support is implied by the presence of vectors o= f > > > doubles. > > > > I wonder why you see the testcase FAIL, on x86-64 when doing > > > > typedef float __attribute__((vector_size(32))) v4f32; > > > > v4f32 f(v4f32 a, v4f32 b) > > { > > /* Check that we vectorize this CTOR without any loads. */ > > return (v4f32){a[0] + b[0], a[1] + b[1], a[2] + b[2], a[3] + b[3], > > a[4] + b[4], a[5] + b[5], a[6] + b[6], a[7] + b[7]}; > > } > > > > I see we vectorize the add and the "store". We fail to perform > > extraction from the incoming vectors (unless you enable AVX), > > that's a missed optimization. > > > > So with paired floats I would expect sth similar? Maybe > > x86 is saved by kind-of-presence (but disabled) of V8SFmode vectors. > > I am not familiar enough with this stuff to answer your question. > > As we pass and return V2SF data in FP registers just as with complex > float data with this hardware the function from my bb-slp-pr95839-v8.c > expands to a single vector FP add instruction, followed by a function > return. > > Conversely, the original function from bb-slp-pr95839.c expands to a > sequence of 22 instructions to extract incoming vector FP data from 4 > 64-bit GPRs into 8 FPRs, add the vectors piecemeal with 4 scalar FP add > instructions, and then insert outgoing vector FP data from 4 FPRs back to > 2 64-bit GPRs. As an experiment I have modified the backend minimally so > as to pass and return V4SF data in FP registers as well, but that didn't > make the vectoriser trigger. > > > That said, we should handle this better so can you file an > > enhancement bugreport for this? > > Filed as PR -optimization/110630. Thanks! > I can't publish RISC-V information > related to the hardware affected, but as a quick check I ran the MIPS > compiler: > > $ mips-linux-gnu-gcc -march=3Dmips64 -mabi=3D64 -mpaired-single -O2 -S bb= -slp-pr95839*.c > > and got this code for bb-slp-pr95839-v8.c (mind the branch delay slot): > > jr $31 > add.ps $f0,$f12,$f13 > > vs code for bb-slp-pr95839.c: > > daddiu $sp,$sp,-64 > sd $5,24($sp) > sd $7,40($sp) > lwc1 $f0,24($sp) > lwc1 $f1,40($sp) > sd $4,16($sp) > sd $6,32($sp) > add.s $f3,$f0,$f1 > lwc1 $f0,28($sp) > lwc1 $f1,44($sp) > lwc1 $f4,36($sp) > swc1 $f3,56($sp) > add.s $f2,$f0,$f1 > lwc1 $f0,16($sp) > lwc1 $f1,32($sp) > swc1 $f2,60($sp) > add.s $f1,$f0,$f1 > lwc1 $f0,20($sp) > ld $3,56($sp) > add.s $f0,$f0,$f4 > swc1 $f1,48($sp) > swc1 $f0,52($sp) > ld $2,48($sp) > jr $31 > daddiu $sp,$sp,64 > > so this is essentially the same scenario (up to the machine instruction > count), and therefore it seems backend-agnostic. I can imagine the latte= r > case could expand to something like (instruction reordering surely needed > for performance omitted for clarity): > > dmtc1 $4,$f0 > dmtc1 $5,$f1 > dmtc1 $6,$f2 > dmtc1 $7,$f3 > add.ps $f0,$f0,$f1 > add.ps $f2,$f2,$f3 > dmfc1 $2,$f0 > jr $31 > dmfc1 $3,$f2 > > saving a lot of cycles, and removing the need for spilling temporaries to > the stack and for frame creation in the first place. > > Do you agree it still makes sense to include bb-slp-pr95839-v8.c with th= e > testsuite? Sure, more coverage is always nice. Richard. > Maciej