From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x232.google.com (mail-lj1-x232.google.com [IPv6:2a00:1450:4864:20::232]) by sourceware.org (Postfix) with ESMTPS id 273923858D38 for ; Mon, 9 Oct 2023 11:33:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 273923858D38 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lj1-x232.google.com with SMTP id 38308e7fff4ca-2b95d5ee18dso52971601fa.1 for ; Mon, 09 Oct 2023 04:33:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1696851215; x=1697456015; darn=gcc.gnu.org; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=43DDDIQbtM3ZWu3B1q0zBug+GLbg7c7wiIsv5OeOnco=; b=I9K6IWXO3co7qE2WLqxBHLzvlCsT0dyaBVIpM6tdnQRvoZbWzrFcglYqYn/VZESD8F HzuAVEdoZuF5wmm5Rr2BafVct8ns7HrNCLiD9Tgv20wL1YjqTR3JmmcjLbR4jmEPtEwL 9DVw1tf60Ey3IlN2bk0qhh+Km+/DQq9Wqpo2zoUJOESy87fDf2/CYO/eVj9OvYD1k+7r 9AJeNLNJnbm81so1UqbKeWi90wesXRykiql27nPl3HsebfZrdz8wearPjM/fuQZy8ZMf phfJHRZCGvXsbOgFhB6SnkU7/juXd2ScaIuHxCLdgk7ry+di0TmjGnCBAHE99q7JelJP 3VWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696851215; x=1697456015; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=43DDDIQbtM3ZWu3B1q0zBug+GLbg7c7wiIsv5OeOnco=; b=wCCSDEHREOXUt8o3XXdOXYbxKpFVjDg6Mxy0FHfhR5+k5IkCqhABPcYAzavv7XPahd skIo6vyPcrh5VkVmuR9V0okLZXyMIrdsgxwDfrfQGl/7GN8KhNHqfnqHm0U96WIx2jx4 TV4ZrIQDQM0KhoQjPFtKN87wovzE+QDFjwIxTCML6MilT+3Il884bn79ncJYwg8O6b4x Ii1Tb9cN+xuI3x34FJqokfiYDMJ/FQxlwHCIwCfSBUnpXh7YDqyAMuHRikT/gFHvSFhO AN0zkY6nxlNYQZQbds5aBvAdmvWkkM7mx9o6393fQK0W7rp4s69E1UDHi8JXdpE+BW3k mQSw== X-Gm-Message-State: AOJu0YxqW1OKP1yPjDeJPgSzYGNSFdUfhBNW8z3YPsTDNsvX3UG7Zbuy oANQE6cQuShC+XE8qC+MvascIduJKo5on8ozTIU= X-Google-Smtp-Source: AGHT+IESC/h/KDDL3ArysWTS+yAER5UKHYr4vH1E4yKl48ayPsDQGP6tqdPAJRlvpuEY29AT1k2CIwRMO3Dcud+fngQ= X-Received: by 2002:ac2:4c4b:0:b0:503:cca:e52f with SMTP id o11-20020ac24c4b000000b005030ccae52fmr16203398lfk.51.1696851215059; Mon, 09 Oct 2023 04:33:35 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Mon, 9 Oct 2023 13:30:47 +0200 Message-ID: Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign. To: Tamar Christina , Richard Biener , "gcc-patches@gcc.gnu.org" , nd , Richard Earnshaw , Marcus Shawcroft , Kyrylo Tkachov , richard.sandiford@arm.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_MANYTO,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mon, Oct 9, 2023 at 12:17=E2=80=AFPM Richard Sandiford wrote: > > Tamar Christina writes: > >> -----Original Message----- > >> From: Richard Sandiford > >> Sent: Monday, October 9, 2023 10:56 AM > >> To: Tamar Christina > >> Cc: Richard Biener ; gcc-patches@gcc.gnu.o= rg; > >> nd ; Richard Earnshaw ; > >> Marcus Shawcroft ; Kyrylo Tkachov > >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign. > >> > >> Tamar Christina writes: > >> >> -----Original Message----- > >> >> From: Richard Sandiford > >> >> Sent: Saturday, October 7, 2023 10:58 AM > >> >> To: Richard Biener > >> >> Cc: Tamar Christina ; > >> >> gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw > >> >> ; Marcus Shawcroft > >> >> ; Kyrylo Tkachov > >> > >> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysig= n. > >> >> > >> >> Richard Biener writes: > >> >> > On Thu, Oct 5, 2023 at 10:46=E2=80=AFPM Tamar Christina > >> >> wrote: > >> >> >> > >> >> >> > -----Original Message----- > >> >> >> > From: Richard Sandiford > >> >> >> > Sent: Thursday, October 5, 2023 9:26 PM > >> >> >> > To: Tamar Christina > >> >> >> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw > >> >> >> > ; Marcus Shawcroft > >> >> >> > ; Kyrylo Tkachov > >> >> > >> >> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for > >> >> cond_copysign. > >> >> >> > > >> >> >> > Tamar Christina writes: > >> >> >> > >> -----Original Message----- > >> >> >> > >> From: Richard Sandiford > >> >> >> > >> Sent: Thursday, October 5, 2023 8:29 PM > >> >> >> > >> To: Tamar Christina > >> >> >> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard > >> >> >> > >> Earnshaw ; Marcus Shawcroft > >> >> >> > >> ; Kyrylo Tkachov > >> >> >> > > >> >> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for > >> >> cond_copysign. > >> >> >> > >> > >> >> >> > >> Tamar Christina writes: > >> >> >> > >> > Hi All, > >> >> >> > >> > > >> >> >> > >> > This adds an implementation for masked copysign along wit= h > >> >> >> > >> > an optimized pattern for masked copysign (x, -1). > >> >> >> > >> > >> >> >> > >> It feels like we're ending up with a lot of AArch64-specifi= c > >> >> >> > >> code that just hard- codes the observation that changing th= e > >> >> >> > >> sign is equivalent to changing the top bit. We then need t= o > >> >> >> > >> make sure that we choose the best way of changing the top b= it > >> >> >> > >> for any > >> >> given situation. > >> >> >> > >> > >> >> >> > >> Hard-coding the -1/negative case is one instance of that. > >> >> >> > >> But it looks like we also fail to use the best sequence for= SVE2. E.g. > >> >> >> > >> [https://godbolt.org/z/ajh3MM5jv]: > >> >> >> > >> > >> >> >> > >> #include > >> >> >> > >> > >> >> >> > >> void f(double *restrict a, double *restrict b) { > >> >> >> > >> for (int i =3D 0; i < 100; ++i) > >> >> >> > >> a[i] =3D __builtin_copysign(a[i], b[i]); } > >> >> >> > >> > >> >> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t= c) { > >> >> >> > >> for (int i =3D 0; i < 100; ++i) > >> >> >> > >> a[i] =3D (a[i] & ~c) | (b[i] & c); } > >> >> >> > >> > >> >> >> > >> gives: > >> >> >> > >> > >> >> >> > >> f: > >> >> >> > >> mov x2, 0 > >> >> >> > >> mov w3, 100 > >> >> >> > >> whilelo p7.d, wzr, w3 > >> >> >> > >> .L2: > >> >> >> > >> ld1d z30.d, p7/z, [x0, x2, lsl 3] > >> >> >> > >> ld1d z31.d, p7/z, [x1, x2, lsl 3] > >> >> >> > >> and z30.d, z30.d, #0x7fffffffffffffff > >> >> >> > >> and z31.d, z31.d, #0x8000000000000000 > >> >> >> > >> orr z31.d, z31.d, z30.d > >> >> >> > >> st1d z31.d, p7, [x0, x2, lsl 3] > >> >> >> > >> incd x2 > >> >> >> > >> whilelo p7.d, w2, w3 > >> >> >> > >> b.any .L2 > >> >> >> > >> ret > >> >> >> > >> g: > >> >> >> > >> mov x3, 0 > >> >> >> > >> mov w4, 100 > >> >> >> > >> mov z29.d, x2 > >> >> >> > >> whilelo p7.d, wzr, w4 > >> >> >> > >> .L6: > >> >> >> > >> ld1d z30.d, p7/z, [x0, x3, lsl 3] > >> >> >> > >> ld1d z31.d, p7/z, [x1, x3, lsl 3] > >> >> >> > >> bsl z31.d, z31.d, z30.d, z29.d > >> >> >> > >> st1d z31.d, p7, [x0, x3, lsl 3] > >> >> >> > >> incd x3 > >> >> >> > >> whilelo p7.d, w3, w4 > >> >> >> > >> b.any .L6 > >> >> >> > >> ret > >> >> >> > >> > >> >> >> > >> I saw that you originally tried to do this in match.pd and > >> >> >> > >> that the decision was to fold to copysign instead. But > >> >> >> > >> perhaps there's a compromise where isel does something with > >> >> >> > >> the (new) copysign canonical > >> >> >> > form? > >> >> >> > >> I.e. could we go with your new version of the match.pd patc= h, > >> >> >> > >> and add some isel stuff as a follow-on? > >> > >> [A] > >> > >> >> >> > >> > >> >> >> > > > >> >> >> > > Sure if that's what's desired.... But.. > >> >> >> > > > >> >> >> > > The example you posted above is for instance worse for x86 > >> >> >> > > https://godbolt.org/z/x9ccqxW6T where the first operation ha= s > >> >> >> > > a dependency chain of 2 and the latter of 3. It's likely an= y > >> >> >> > > open coding of this > >> >> >> > operation is going to hurt a target. > >> >> >> > > > >> >> >> > > So I'm unsure what isel transform this into... > >> >> >> > > >> >> >> > I didn't mean that we should go straight to using isel for the > >> >> >> > general case, just for the new case. The example above was > >> >> >> > instead trying to show the general point that hiding the logic > >> >> >> > ops in target code is > >> >> a double-edged sword. > >> >> >> > >> >> >> I see.. but the problem here is that transforming copysign (x, -= 1) > >> >> >> into (x | 0x8000000) would require an integer operation on an FP > >> >> >> value. I'm happy to do it but it seems like it'll be an AArch64 > >> >> >> only thing > >> >> anyway. > >> >> >> > >> >> >> If we want to do this we need to check can_change_mode_class or = a > >> hook. > >> >> >> Most targets including x86 reject the conversion. So it'll just > >> >> >> be effectively an AArch64 thing. > >> >> >> > >> >> >> You're right that the actual equivalent transformation is this > >> >> >> https://godbolt.org/z/KesfrMv5z But the target won't allow it. > >> >> >> > >> >> >> > > >> >> >> > The x86_64 example for the -1 case would be > >> >> >> > https://godbolt.org/z/b9s6MaKs8 where the isel change would be > >> >> >> > an improvement. Without that, I guess > >> >> >> > x86_64 will need to have a similar patch to the AArch64 one. > >> >> >> > > >> >> >> > >> >> >> I think that's to be expected. I think it's logical that every > >> >> >> target just needs to implement their optabs optimally. > >> >> >> > >> >> >> > That said, https://godbolt.org/z/e6nqoqbMh suggests that > >> >> >> > powerpc64 is probably relying on the current copysign -> neg/a= bs > >> transform. > >> >> >> > (Not sure why the second function uses different IVs from the > >> >> >> > first.) > >> >> >> > > >> >> >> > Personally, I wouldn't be against a target hook that indicated > >> >> >> > whether float bit manipulation is "free" for a given mode, if = it > >> >> >> > comes to > >> >> that. > >> >> >> > >> >> >> I'm really sure Richi would agree there. Generally speaking I > >> >> >> don't think people see doing FP operations on Int as beneficial. > >> >> >> But it's always > >> >> the case on AArch64. > >> >> > > >> >> > IIRC we're doing fpclassify "expansion" early for example. > >> >> > > >> >> > Note the issue I see is that the middle-end shouldn't get in the > >> >> > way of targets that have a copysign optab. In case it's worthwhi= le > >> >> > to special-case a "setsign" thing then maybe provide an optab for > >> >> > that as well. Now, that doesn't help if we want setsign to be > >> >> > expanded from the middle-end but still wan the copysign optab (an= d > >> >> > not require targets to implement both if they want to escape > >> >> > middle-end generic > >> >> expansion of setsign). > >> >> > > >> >> > But yes, I also thought the , 1 and , -1 cases could be special > >> >> > cased by RTL expansion (or ISEL). But it would need to invoke > >> >> > costing which likely means target specific changes anyway... :/ > >> >> > >> >> FWIW, if we had the target hook I suggested, I don't think AArch64 > >> >> would need to implement copysign or xorsign optabs. That's probabl= y > >> >> true of > >> >> AArch32 too (at least for all combinations that are likely to matte= r these > >> days). > >> >> > >> >> I'd go one step further and say that, if a target wants to do its o= wn > >> >> thing for copysign and xorsign, it clearly doesn't meet the > >> >> requirement that bit manipulation of floats is "free" for that mode= . > >> >> > >> > > >> > So I'm aware I have no say here, but I'll make one last effort. > >> > > >> > The patch isn't just implementing the fneg (fabs ()) optimization ju= st > >> because. > >> > The location where it's implemented makes a big difference. > >> > > >> > If this optimization is done late, it doesn't fix the regression > >> > fully, because doing It after all optimization passes have run means= it can't > >> properly be optimized. > >> > > >> > The point of doing the rewrite early to ORR accomplished two things: > >> > > >> > 1. It makes PRE realize that the block it's splitting would only hav= e 1 > >> instruction in it > >> > and that such a split is not beneficial. This is why I'm agains= t doing such > >> optimizations > >> > so later. Optimizations don=E2=80=99t' happen in isolation, and = when they make > >> sense should > >> > happen early. Transforming fneg (fabs (..)) in isel not only fe= els wrong but > >> results in a 4% > >> > performance loss. > >> > > >> > 2. Doing it early also gets the ORRs to be reassociated changing whe= re the > >> loop dependent > >> > variable lands. Early makes it land in the merging MOVPRFX rat= her than > >> requiring a SEL > >> > at the end of the iteration. > >> > > >> > Replacing the fneg (fabs (..)) with copysign accomplishes 1 but no 2= . > >> > This results in a 2-3% loss, but I can live with that given doing 1 = gets us back > >> to GCC 12 levels. > >> > > >> > Doing fneg (fabs (..)) in isel would have no meaning for me and not > >> > fix the regression. I won't be looking to do that in that case. > >> > > >> > If it's acceptable I can transform COPYSIGN (X, -1) in gimple-isel. > >> > >> FWIW, this is what I meant by [A] above. I.e. go with your latest tar= get- > >> independent change, and make isel replace COPYSIGN (X, -1) with logic > >> operations, under control of a new target hook that says that logic op= erations > >> on float values are as cheap as logic operations on integer values (fo= r a given > >> float mode). In future, hopefully all COPYSIGNs and XORSIGNs would be > >> handled the same way, under control of the same hook, but that would b= e a > >> separate future change. > > > > Ok, would you prefer isel or in the expander that Richi suggested? > > I assume isel is better if the intention is to later remove the expansi= on code? > > Yeah, isel sounds better to me FWIW (for that reason), but Richi would > know better. ISEL is supposed to be "RTL expansion on GIMPLE" and since we have copysign builtin expansion with several variants in the optab code and also direct IFN expansion I think it wouldn't be good to add another place pre-empting those. > At least with isel, there's the theoretical possibility of doing > simple optimisation before expand, or taking surrounding stmts > into account, if that ever becomes necessary for future changes. Yeah, but I doubt there's anything for copysign (x, -1) here. Unfortunately IFN_COPYSIGN expansion isn't routed through expand_copysign but goes through expand_fn_using_insn(?). I suppose we can go back to DEF_INTERNAL_OPTAB_FN for it and implement -1 handing in expand_COPYSIGN? There's also the case being made for targets not implementing a vector copysign expander to handle this in pattern recognition. Richard. > > Thanks, > Richard