From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr1-x42e.google.com (mail-wr1-x42e.google.com [IPv6:2a00:1450:4864:20::42e]) by sourceware.org (Postfix) with ESMTPS id A10EC3858408 for ; Tue, 11 Jul 2023 15:01:27 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A10EC3858408 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=embecosm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=embecosm.com Received: by mail-wr1-x42e.google.com with SMTP id ffacd0b85a97d-307d58b3efbso5738806f8f.0 for ; Tue, 11 Jul 2023 08:01:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=embecosm.com; s=google; t=1689087686; x=1691679686; h=mime-version:user-agent:references:message-id:in-reply-to:subject :cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=C99BVlcUEYmolG3MVwsGsZ34Z88aZ9EXm0t9fQuBVQo=; b=A6GodwbgiKQCzHKXou8tdf2g/DSSbvqYLmUha8NqUsBeTgFTRqwMNX7Cr8E4zUZhwc Ds7ADurROzhQe8DmyrbKfypE++N5+PXDee527kJw5qOWEGjk05xlyHZ2cLWBkWzSRcLI +RlGA2ZlsgP0MuYf0/XpWNDdYaNhQyzh0yHvYALX+riF5kVFSHv836pTjcrzUUwY6KnH 4ROR2ph1vS0imA8Wd4pUfp9NqKOAWULJKrzee6vahznX+X/MH7R2F6lB1Rhy6OmtkdAX RLeuZIpcS0ZTxl1flgTNjodaHtgkTgL2qYK4hBrMNsaC9xHNT9kkT+GqADkLYsvFSPpn 4yCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689087686; x=1691679686; h=mime-version:user-agent:references:message-id:in-reply-to:subject :cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=C99BVlcUEYmolG3MVwsGsZ34Z88aZ9EXm0t9fQuBVQo=; b=gbNe7sBkM1usKHURlOGwJW+fFTZrC6KSSf4SGpYrgO8BTYyLLRAzoeW7LrltwYFjSO amsOtr9J4U/WpPN8Zys6AFis16p/+5kclVvxJgQPNiOiDjnZKGJbb+Yi3QRNTC/2SRVD nZnrbpViuR56ZJLef9M75mYuLX5OeCeFRO1zOP1CiqfgnFyKa+vftS9JOxDxIqgX858d ykElJOQx7BHUfprUZ5T0I2jTcc64daA7tJ/2Sa/68+BUSocx8JYpH7PBb+wnd32eEmOw t0TFZIRnr/iFbdIahE0kmrqrbkSfziE9rCKG2seloHoIVkO7qqWykmhBqXUJhnSXDTLl c46g== X-Gm-Message-State: ABy/qLbafmzYbtrcn1DTgDCdqMX4cOkhpTri2OIOvlc5gcVW5LSfxv2x qna3esd8FpnREiTIDxJIK/tWEHtJ/gDhI3Kfz6E= X-Google-Smtp-Source: APBJJlGjasRCq0BSynTotNyCAMvfHzZHqDsiwARf37AJw9sSTlDDQbdW1HQL9xdbcwJsnOxJO/fBtQ== X-Received: by 2002:a5d:4cc5:0:b0:315:903b:c2f1 with SMTP id c5-20020a5d4cc5000000b00315903bc2f1mr9325222wrt.25.1689087686234; Tue, 11 Jul 2023 08:01:26 -0700 (PDT) Received: from tpp.orcam.me.uk (tpp.orcam.me.uk. [2001:8b0:154:0:ea6a:64ff:fe24:f2fc]) by smtp.gmail.com with ESMTPSA id a4-20020adfeec4000000b0030c4d8930b1sm2475823wrp.91.2023.07.11.08.01.25 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2023 08:01:25 -0700 (PDT) Date: Tue, 11 Jul 2023 16:01:24 +0100 (BST) From: "Maciej W. Rozycki" To: Richard Biener cc: Rainer Orth , Mike Stump , gcc-patches@gcc.gnu.org Subject: Re: [PATCH 2/3] testsuite: Require 128-bit vectors for bb-slp-pr95839.c In-Reply-To: Message-ID: References: User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, 7 Jul 2023, Richard Biener wrote: > > The bb-slp-pr95839.c test assumes quad-single float vector support, but > > some targets only support pairs of floats, causing this test to fail > > with such targets. Limit this test to targets that support at least > > 128-bit vectors then, and add a complementing test that can be run with > > targets that have support for 64-bit vectors only. There is no need to > > adjust bb-slp-pr95839-2.c as 128 bits are needed even for the smallest > > vector of doubles, so support is implied by the presence of vectors of > > doubles. > > I wonder why you see the testcase FAIL, on x86-64 when doing > > typedef float __attribute__((vector_size(32))) v4f32; > > v4f32 f(v4f32 a, v4f32 b) > { > /* Check that we vectorize this CTOR without any loads. */ > return (v4f32){a[0] + b[0], a[1] + b[1], a[2] + b[2], a[3] + b[3], > a[4] + b[4], a[5] + b[5], a[6] + b[6], a[7] + b[7]}; > } > > I see we vectorize the add and the "store". We fail to perform > extraction from the incoming vectors (unless you enable AVX), > that's a missed optimization. > > So with paired floats I would expect sth similar? Maybe > x86 is saved by kind-of-presence (but disabled) of V8SFmode vectors. I am not familiar enough with this stuff to answer your question. As we pass and return V2SF data in FP registers just as with complex float data with this hardware the function from my bb-slp-pr95839-v8.c expands to a single vector FP add instruction, followed by a function return. Conversely, the original function from bb-slp-pr95839.c expands to a sequence of 22 instructions to extract incoming vector FP data from 4 64-bit GPRs into 8 FPRs, add the vectors piecemeal with 4 scalar FP add instructions, and then insert outgoing vector FP data from 4 FPRs back to 2 64-bit GPRs. As an experiment I have modified the backend minimally so as to pass and return V4SF data in FP registers as well, but that didn't make the vectoriser trigger. > That said, we should handle this better so can you file an > enhancement bugreport for this? Filed as PR -optimization/110630. I can't publish RISC-V information related to the hardware affected, but as a quick check I ran the MIPS compiler: $ mips-linux-gnu-gcc -march=mips64 -mabi=64 -mpaired-single -O2 -S bb-slp-pr95839*.c and got this code for bb-slp-pr95839-v8.c (mind the branch delay slot): jr $31 add.ps $f0,$f12,$f13 vs code for bb-slp-pr95839.c: daddiu $sp,$sp,-64 sd $5,24($sp) sd $7,40($sp) lwc1 $f0,24($sp) lwc1 $f1,40($sp) sd $4,16($sp) sd $6,32($sp) add.s $f3,$f0,$f1 lwc1 $f0,28($sp) lwc1 $f1,44($sp) lwc1 $f4,36($sp) swc1 $f3,56($sp) add.s $f2,$f0,$f1 lwc1 $f0,16($sp) lwc1 $f1,32($sp) swc1 $f2,60($sp) add.s $f1,$f0,$f1 lwc1 $f0,20($sp) ld $3,56($sp) add.s $f0,$f0,$f4 swc1 $f1,48($sp) swc1 $f0,52($sp) ld $2,48($sp) jr $31 daddiu $sp,$sp,64 so this is essentially the same scenario (up to the machine instruction count), and therefore it seems backend-agnostic. I can imagine the latter case could expand to something like (instruction reordering surely needed for performance omitted for clarity): dmtc1 $4,$f0 dmtc1 $5,$f1 dmtc1 $6,$f2 dmtc1 $7,$f3 add.ps $f0,$f0,$f1 add.ps $f2,$f2,$f3 dmfc1 $2,$f0 jr $31 dmfc1 $3,$f2 saving a lot of cycles, and removing the need for spilling temporaries to the stack and for frame creation in the first place. Do you agree it still makes sense to include bb-slp-pr95839-v8.c with the testsuite? Maciej