From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot1-x332.google.com (mail-ot1-x332.google.com [IPv6:2607:f8b0:4864:20::332]) by sourceware.org (Postfix) with ESMTPS id 3167D3858D20 for ; Fri, 1 Sep 2023 10:45:24 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3167D3858D20 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=vrull.eu Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=vrull.eu Received: by mail-ot1-x332.google.com with SMTP id 46e09a7af769-6bd3317144fso1454902a34.1 for ; Fri, 01 Sep 2023 03:45:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vrull.eu; s=google; t=1693565123; x=1694169923; darn=gcc.gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=9Ol4SUNlzmY9yemiIbMi3JA2E5ncEJiLHLzw4/hlLQQ=; b=e4byeF5b2YJUKfnHSEV/DZ1tTP3Tlbec8OIfH8nmj0qZKtHx1LLSnQWgcPMBIvVOeD siQ7pQVcU8Cefx0GFOGA4CS+92ZWj+0H1jhRADpVN1mPri1aMl8z+ej4tNm7xbyEaqST u7LGw30YIDaSFmHrZghJefb8ZP8JNEgmc3qkzooTaVEpdbXCCN0Nc3RH/L2DPGsWaoqs BJJc4HmAQN28tXlKcNIfzQmm7ACMulzDkMLAXcV1N5xUUpH5F2uC3usQ7WymnF8szCK2 MrE01POSW+Ti1jyUos1Xtc83WN4+4bc/smA2WCrecpIRZqRaad/GUg6FldRvwr3xOMby kokw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693565123; x=1694169923; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=9Ol4SUNlzmY9yemiIbMi3JA2E5ncEJiLHLzw4/hlLQQ=; b=YKLMMdakrhAoL6CukBE5RAkpN9TIpntCT73cWjQ1XcpnFV0pyxnQQ1f0rBpQ9GXf4Q aZ+DakZgdbRS0jTygzW1yiv/93mdrTYDNKVbGeooj6dDU3COQWJUVXLiNKZ35y76YJcZ bfd7r6UJEujCNyaP4n1JgPyQbT76PHYntH+iHlsBMpOFCkAj4g8YwMfo0PTiCBcLvgre rqYDc6IojIS4mxXe/642UArFCF7PsEea7QjFrGmPaaAw8wIbS21740cqJH73lh/nCfKQ UJeTxQP1Cfv7DD6rPltHb7bbrT9o1ecieaLi1714Rw+TTa8DsjU+6MZYi/K1KSvAcg10 oxUg== X-Gm-Message-State: AOJu0YyrDx2MDu3J3qEbBoi42/mnNQroDSaz8joPfX4Ag8OhrSRl1sgo vm+//PYyTMoIDBvJquszWtv4QSQJmoiWxEiYhcKanxiEN6LO4hZ6foI= X-Google-Smtp-Source: AGHT+IEnkU8aEfbPoP25Hn7L+oYMPf8w3lPV+lBj+HGuCYvgfmoNSkNC5+C6O55IJ+hBOtalLwVKiFpZdrIi4zA1FMU= X-Received: by 2002:a05:6870:d782:b0:1be:f7d8:e7a2 with SMTP id bd2-20020a056870d78200b001bef7d8e7a2mr2189650oab.21.1693565122443; Fri, 01 Sep 2023 03:45:22 -0700 (PDT) MIME-Version: 1.0 References: <014901d9b002$094f5ec0$1bee1c40$@nextmovesoftware.com> <006201d9b010$820dfb60$8629f220$@nextmovesoftware.com> In-Reply-To: From: Manolis Tsamis Date: Fri, 1 Sep 2023 13:44:46 +0300 Message-ID: Subject: Re: [x86_64 PATCH] Improve __int128 argument passing (in ix86_expand_move). To: Uros Bizjak Cc: Roger Sayle , gcc-patches@gcc.gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,JMQ_SPF_NEUTRAL,KAM_SHORT,RCVD_IN_DNSWL_NONE,SCC_5_SHORT_WORD_LINES,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Roger, I've (accidentally) found a codegen regression that I bisected down to this patch. For these two functions: typedef struct { float minx, miny; float maxx, maxy; } AABB; int TestOverlap(AABB a, AABB b) { return a.minx <=3D b.maxx && a.miny <=3D b.maxy && a.maxx >=3D b.minx && a.maxx >=3D b.minx; } int TestOverlap2(AABB a, AABB b) { return a.miny <=3D b.maxy && a.maxx >=3D b.minx; } GCC used to produce this code: TestOverlap: comiss xmm3, xmm0 movq rdx, xmm0 movq rsi, xmm1 movq rax, xmm3 jb .L10 shr rdx, 32 shr rax, 32 movd xmm0, eax movd xmm4, edx comiss xmm0, xmm4 jb .L10 movd xmm1, esi xor eax, eax comiss xmm1, xmm2 setnb al ret .L10: xor eax, eax ret TestOverlap2: shufps xmm0, xmm0, 85 shufps xmm3, xmm3, 85 comiss xmm3, xmm0 jb .L17 xor eax, eax comiss xmm1, xmm2 setnb al ret .L17: xor eax, eax ret After this patch codegen gets much worse: TestOverlap: movq rax, xmm1 movq rdx, xmm2 movq rsi, xmm0 mov rdi, rax movq rax, xmm3 mov rcx, rsi xchg rdx, rax movd xmm1, edx mov rsi, rax mov rax, rdx comiss xmm1, xmm0 jb .L10 shr rcx, 32 shr rax, 32 movd xmm0, eax movd xmm4, ecx comiss xmm0, xmm4 jb .L10 movd xmm0, esi movd xmm1, edi xor eax, eax comiss xmm1, xmm0 setnb al ret .L10: xor eax, eax ret TestOverlap2: movq rdx, xmm2 movq rax, xmm3 movq rsi, xmm0 xchg rdx, rax mov rcx, rsi mov rsi, rax mov rax, rdx shr rcx, 32 shr rax, 32 movd xmm4, ecx movd xmm0, eax comiss xmm0, xmm4 jb .L17 movd xmm0, esi xor eax, eax comiss xmm1, xmm0 setnb al ret .L17: xor eax, eax ret I saw that you've been improving i386 argument passing, so maybe this is just a missed case of these additions? (Can also be seen here https://godbolt.org/z/E4xrEn6KW) PS: I found the code that clang generates, with cmpleps + pextrw to avoid the fp->int->fp + shr interesting. I wonder if something like this could be added to GCC as well. Thanks! Manolis On Thu, Jul 6, 2023 at 5:21=E2=80=AFPM Uros Bizjak via Gcc-patches wrote: > > On Thu, Jul 6, 2023 at 3:48=E2=80=AFPM Roger Sayle wrote: > > > > > On Thu, Jul 6, 2023 at 2:04=E2=80=AFPM Roger Sayle > > > wrote: > > > > > > > > > > > > Passing 128-bit integer (TImode) parameters on x86_64 can sometimes > > > > result in surprising code. Consider the example below (from PR 436= 44): > > > > > > > > __uint128 foo(__uint128 x, unsigned long long y) { > > > > return x+y; > > > > } > > > > > > > > which currently results in 6 consecutive movq instructions: > > > > > > > > foo: movq %rsi, %rax > > > > movq %rdi, %rsi > > > > movq %rdx, %rcx > > > > movq %rax, %rdi > > > > movq %rsi, %rax > > > > movq %rdi, %rdx > > > > addq %rcx, %rax > > > > adcq $0, %rdx > > > > ret > > > > > > > > The underlying issue is that during RTL expansion, we generate the > > > > following initial RTL for the x argument: > > > > > > > > (insn 4 3 5 2 (set (reg:TI 85) > > > > (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1 > > > > (nil)) > > > > (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8) > > > > (reg:DI 87)) "pr43644-2.c":5:1 -1 > > > > (nil)) > > > > (insn 6 5 7 2 (set (reg/v:TI 84 [ x ]) > > > > (reg:TI 85)) "pr43644-2.c":5:1 -1 > > > > (nil)) > > > > > > > > which by combine/reload becomes > > > > > > > > (insn 25 3 22 2 (set (reg/v:TI 84 [ x ]) > > > > (const_int 0 [0])) "pr43644-2.c":5:1 -1 > > > > (nil)) > > > > (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0) > > > > (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal} > > > > (expr_list:REG_DEAD (reg:DI 93) > > > > (nil))) > > > > (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8) > > > > (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal} > > > > (expr_list:REG_DEAD (reg:DI 94) > > > > (nil))) > > > > > > > > where the heavy use of SUBREG SET_DESTs creates challenges for both > > > > combine and register allocation. > > > > > > > > The improvement proposed here is to avoid these problematic SUBREGs= by > > > > adding (two) special cases to ix86_expand_move. For insn 4, which > > > > sets a TImode destination from a paradoxical SUBREG, to assign the > > > > lowpart, we can use an explicit zero extension (zero_extendditi2 wa= s > > > > added in July 2022), and for insn 5, which sets the highpart of a > > > > TImode register we can use the *insvti_highpart_1 instruction (that > > > > was added in May 2023, after being approved for stage1 in January). > > > > This allows combine to work its magic, merging these insns into a > > > > *concatditi3 and from there into other optimized forms. > > > > > > How about we introduce *insvti_lowpart_1, similar to *insvti_highpart= _1, in the > > > hope that combine is smart enough to also combine these two instructi= ons? IMO, > > > faking insert to lowpart of the register with zero_extend is a bit ov= erkill, and could > > > hinder some other optimization opportunities (as perhaps hinted by fa= iling > > > testcases). > > > > The use of ZERO_EXTEND serves two purposes, both the setting of the low= part > > and of informing the RTL passes that the highpart is dead. Notice in t= he original > > RTL stream, i.e. current GCC, insn 25 is inserted by the .286r.init-reg= s pass, clearing > > the entirety of the TImode register (like a clobber), and preventing TI= :84 from > > occupying the same registers as DI:93 and DI:94. > > > > If the middle-end had asked the backend to generate a SET to STRICT_LOW= PART > > then our hands would be tied, but a paradoxical SUBREG allows us the fr= eedom > > to set the highpart bits to a defined value (we could have used sign ex= tension if > > that was cheap), which then simplifies data-flow and liveness analysis.= Allowing the > > highpart to contain undefined or untouched data is exactly the sort of = security > > side-channel leakage that the clear regs pass attempts to address. > > > > I can investigate an *insvti_lowpart_1, but I don't think it will help = with this > > issue, i.e. it won't prevent init-regs from clobbering/clearing TImode = parameters. > > Thanks for the explanation, the patch is OK then. > > Thanks, > Uros. > > > > > > > So for the test case above, we now generate only a single movq: > > > > > > > > foo: movq %rdx, %rax > > > > xorl %edx, %edx > > > > addq %rdi, %rax > > > > adcq %rsi, %rdx > > > > ret > > > > > > > > But there is a little bad news. This patch causes two (minor) miss= ed > > > > optimization regressions on x86_64; gcc.target/i386/pr82580.c and > > > > gcc.target/i386/pr91681-1.c. As shown in the test case above, we'r= e > > > > no longer generating adcq $0, but instead using xorl. For the othe= r > > > > FAIL, register allocation now has more freedom and is (arbitrarily) > > > > choosing a register assignment that doesn't match what the test is > > > > expecting. These issues are easier to explain and fix once this pa= tch > > > > is in the tree. > > > > > > > > The good news is that this approach fixes a number of long standing > > > > issues, that need to checked in bugzilla, including PR target/11053= 3 > > > > which was just opened/reported earlier this week. > > > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstr= ap > > > > and make -k check, both with and without --target_board=3Dunix{-m32= } > > > > with only the two new FAILs described above. Ok for mainline? > > > > > > > > 2023-07-06 Roger Sayle > > > > > > > > gcc/ChangeLog > > > > PR target/43644 > > > > PR target/110533 > > > > * config/i386/i386-expand.cc (ix86_expand_move): Convert SE= Ts of > > > > TImode destinations from paradoxical SUBREGs (setting the l= owpart) > > > > into explicit zero extensions. Use *insvti_highpart_1 inst= ruction > > > > to set the highpart of a TImode destination. > > > > > > > > gcc/testsuite/ChangeLog > > > > PR target/43644 > > > > PR target/110533 > > > > * gcc.target/i386/pr110533.c: New test case. > > > > * gcc.target/i386/pr43644-2.c: Likewise. > > > > > > > > Thanks in advance, > > > > Roger > > > > -- > > > > > >