From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qv1-xf34.google.com (mail-qv1-xf34.google.com [IPv6:2607:f8b0:4864:20::f34]) by sourceware.org (Postfix) with ESMTPS id 6DCB6385DC22 for ; Thu, 13 May 2021 09:41:07 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 6DCB6385DC22 Received: by mail-qv1-xf34.google.com with SMTP id 5so9508637qvk.0 for ; Thu, 13 May 2021 02:41:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=L2+ImhWx0j7xOlYdhjwB3cnELlJDdvq1guxTfA8nT0U=; b=ufHebu+AFOp63mfwRcci638qYNChqVlpUKtbkZHCylMpcogiMBLQtz2aee5+vZ4dvP 8Bh6ux9O/8RvocbeLeGXA4nTUSXXUYOzPWApPf5sJzBzdvBEY2u+pdFRxBSDgIx6pyZE EJIKD+XLV3rceoehDSx1WymhAreBtwK1Yrx8Js8iKwOmiSXLvXqqbHSWJoaqyXTMerHs vTvxlU9FwARleORSK+RUpyQBbLN1xIc4PxG6SNUnR/RCIuQaffOnT38P3OfmS58E256h 97ciT+Ic0uS7OAudDOOgAFxdityTlFIx7ZtyOyFPcrXMrNWWLBCZ+xGx3wruRvBWwdDg ZL3w== X-Gm-Message-State: AOAM532B8tG8MNbCoQPaWHO1luXXtJvsV0y4FWTjwmidoBVLdB352gFe K5ZXufpwM/2OFiI4GJKkLEVMslYKIhYA0PcPmNo= X-Google-Smtp-Source: ABdhPJw8fxmIJWA7kKvJR9dEK7uxfd1OY9iqsTFjo0oJtDZ50bFASyN37+ViEboReA3pliXk2Dmd2M1Clwbk1+2FYPY= X-Received: by 2002:a05:6214:21a7:: with SMTP id t7mr27371795qvc.4.1620898866987; Thu, 13 May 2021 02:41:06 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Uros Bizjak Date: Thu, 13 May 2021 11:40:55 +0200 Message-ID: Subject: Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735] To: Hongtao Liu Cc: GCC Patches , Jakub Jelinek , "H. J. Lu" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 May 2021 09:41:10 -0000 On Thu, May 13, 2021 at 11:18 AM Hongtao Liu wrote: > > Hi: > When __builtin_ia32_vzeroupper is called explicitly, the corresponding > vzeroupper pattern does not carry any CLOBBERS or SETs before LRA, > which leads to incorrect optimization in pass_reload. > In order to solve this problem, this patch introduces a pre_reload > splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the > problem in pr. > > At the same time, in order to optimize the low 128 bits in > post_reload CSE, this patch also transforms those CLOBBERS to SETs in > pass_vzeroupper. > > It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15 > are callee-saved, so even if there're no other uses of xmm6-xmm15 in the > function, because of vzeroupper's pattern, pro_epilog will save and > restore those registers, which is obviously redundant. In order to > eliminate this redundancy, a post_reload splitter is introduced, which > drops those SETs, until epilogue_completed splitter adds those SETs > back, it looks to be safe since there's no CSE between post_reload > split2 and epilogue_completed split3??? Also frame info needs to be > updated in pro_epilog, which saves and restores xmm6-xmm15 only if > there's usage other than explicit vzeroupper pattern. > > Bootstrapped and regtested on X86_64-linux-gnu{-m32,} > Ok for trunk? Some time ago a support for CLOBBER_HIGH RTX was added (and later removed for some reason). Perhaps we could resurrect the patch for the purpose of ferrying 128bit modes via vzeroupper RTX? +(define_split + [(match_parallel 0 "vzeroupper_pattern" + [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] + "TARGET_AVX && ix86_pre_reload_split ()" + [(match_dup 0)] +{ + /* When vzeroupper is explictly used, for LRA purpose, make it clear + the instruction kills sse registers. */ + gcc_assert (cfun->machine->has_explicit_vzeroupper); + unsigned int nregs = TARGET_64BIT ? 16 : 8; + rtvec vec = rtvec_alloc (nregs + 1); + RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode, + gen_rtvec (1, const1_rtx), + UNSPECV_VZEROUPPER); + for (unsigned int i = 0; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + rtx reg = gen_rtx_REG (V2DImode, regno); + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + } + operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); +}) Wouldn't this also kill lower 128bit values that are not touched by vzeroupper? A CLOBBER_HIGH would be more appropriate here. Uros. > gcc/ChangeLog: > > PR target/82735 > * config/i386/i386-expand.c (ix86_expand_builtin): Count > number of __builtin_ia32_vzeroupper. > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers): > Transform CLOBBERs to SETs for explicit vzeroupper pattern so > that CSE can optimize lower 128 bits. > * config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog): > New. > (ix86_save_reg): If there's no use of xmm6~xmm15 other than > explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save > REGNO. > (ix86_finalize_stack_frame_flags): Recompute frame layout if > there's explicit vzeroupper under TARGET_64BIT_MS_ABI. > * config/i386/i386.h (struct machine_function): Change type of > has_explicit_vzeroupper from BOOL_BITFILED to unsigned int. > * config/i386/sse.md (*avx_vzeroupper_2): New post-reload > splitter which will drop all SETs for explicit vzeroupper > patterns. > (*avx_vzeroupper_1): Generate SET reg to reg instead of > CLOBBER, and add pre-reload splitter after it. > > gcc/testsuite/ChangeLog: > > PR target/82735 > * gcc.target/i386/pr82735-1.c: New test. > * gcc.target/i386/pr82735-2.c: New test. > * gcc.target/i386/pr82735-3.c: New test. > * gcc.target/i386/pr82735-4.c: New test. > * gcc.target/i386/pr82735-5.c: New test. > > > -- > BR, > Hongtao