From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-qk1-x72e.google.com (mail-qk1-x72e.google.com
 [IPv6:2607:f8b0:4864:20::72e])
 by sourceware.org (Postfix) with ESMTPS id 58BC23858295
 for <gcc-patches@gcc.gnu.org>; Thu,  2 Jun 2022 10:04:13 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 58BC23858295
Received: by mail-qk1-x72e.google.com with SMTP id br33so1547576qkb.0
 for <gcc-patches@gcc.gnu.org>; Thu, 02 Jun 2022 03:04:13 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=cvSC5avDJxKm11z0ZjI0Y5TaGoGuKexM9LeiY0Fakcs=;
 b=HII4fgdY3yplGZzhRVGe93AS+u9gdomTxPxmcVaart8cijVX1yNJ5D2Y4dVxO+PYKP
 gW6TjIUA1baLWc2OdY/F6F74hT00M5s26R1i+Vd9Zx6j1fPtnYyDLFn+TXvXli+SF/Xs
 NAiXHEs000hppCYgzgIZdZ+Ki8W2PWjWcOKeUC9l9R10XKW5lodD7V0wueokPxZKYBkg
 pCop06x8hzufkSeVUlyu1qW61ockXRUf61PmM0NJ8A1U+AdauJ4MsIHns9sXx9qPmEGL
 DLWDfNwIIbUYhTPXCPI7tT5fKxQ3h2srY2MVswE7H8n6D9wm7a/r/X9OZrZwE44yLzVt
 5RJA==
X-Gm-Message-State: AOAM531qoyErbhBbxZ7WS097vYJQbC5qhNT+9sabOQ7bcOuq61g/QWnO
 ODpVsT2cxn9jfGaHHdbiHY3x2GXt1wt75v/cuNp3p8oHyts=
X-Google-Smtp-Source: ABdhPJy/mzt+RUlNrG1hDmJVeEH/2GYiijKIgu0ezf/cR87n87c6ICOw2g+FVAvXKJpAVmUyzFK4oN5aBk1e7Os0XZ0=
X-Received: by 2002:a37:2fc4:0:b0:6a6:494e:4761 with SMTP id
 v187-20020a372fc4000000b006a6494e4761mr2727667qkh.4.1654164252636; Thu, 02
 Jun 2022 03:04:12 -0700 (PDT)
MIME-Version: 1.0
References: <032501d87651$43cf0960$cb6d1c20$@nextmovesoftware.com>
 <CAFULd4ZqXDc6p9eMqcKJ30niEporP5hoy7q3EE0Fyhj6UgNvBA@mail.gmail.com>
In-Reply-To: <CAFULd4ZqXDc6p9eMqcKJ30niEporP5hoy7q3EE0Fyhj6UgNvBA@mail.gmail.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Thu, 2 Jun 2022 12:04:01 +0200
Message-ID: <CAFiYyc3nZ9=4wpwmYhwzX=VYKzW1fGe13NUOTxKixs60A0rRtQ@mail.gmail.com>
Subject: Re: [x86 PATCH] Add peephole2 to reduce double word register
 shuffling.
To: Uros Bizjak <ubizjak@gmail.com>
Cc: Roger Sayle <roger@nextmovesoftware.com>,
 GCC Patches <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Jun 2022 10:04:16 -0000

On Thu, Jun 2, 2022 at 11:48 AM Uros Bizjak via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, Jun 2, 2022 at 9:20 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
> >
> > The simple test case below demonstrates an interesting register
> > allocation challenge facing x86_64, imposed by ABI requirements
> > on int128.
> >
> > __int128 foo(__int128 x, __int128 y)
> > {
> >   return x+y;
> > }
> >
> > For which GCC currently generates the unusual sequence:
> >
> >         movq    %rsi, %rax
> >         movq    %rdi, %r8
> >         movq    %rax, %rdi
> >         movq    %rdx, %rax
> >         movq    %rcx, %rdx
> >         addq    %r8, %rax
> >         adcq    %rdi, %rdx
> >         ret
> >
> > The challenge is that the x86_64 ABI requires passing the first __int128,
> > x, in %rsi:%rdi (highpart in %rsi, lowpart in %rdi), where internally
> > GCC prefers TI mode (double word) integers to be register allocated as
> > %rdi:%rsi (highpart in %rdi, lowpart in %rsi).  So after reload, we have
> > four mov instructions, two to move the double word to temporary registers
> > and then two to move them back.
> >
> > This patch adds a peephole2 to spot this register shuffling, and with
> > -Os generates a xchg instruction, to produce:
> >
> >         xchgq   %rsi, %rdi
> >         movq    %rdx, %rax
> >         movq    %rcx, %rdx
> >         addq    %rsi, %rax
> >         adcq    %rdi, %rdx
> >         ret
> >
> > or when optimizing for speed, a three mov sequence, using just one of
> > the temporary registers, which ultimately results in the improved:
> >
> >         movq    %rdi, %r8
> >         movq    %rdx, %rax
> >         movq    %rcx, %rdx
> >         addq    %r8, %rax
> >         adcq    %rsi, %rdx
> >         ret
> >
> > I've a follow-up patch which improves things further, and with the
> > output in flux, I'd like to add the new testcase with part 2, once
> > we're back down to requiring only two movq instructions.
>
> Shouldn't we rather do something about:
>
> (insn 2 9 3 2 (set (reg:DI 85)
>        (reg:DI 5 di [ x ])) "dword-2.c":2:1 82 {*movdi_internal}
>     (nil))
> (insn 3 2 4 2 (set (reg:DI 86)
>        (reg:DI 4 si [ x+8 ])) "dword-2.c":2:1 82 {*movdi_internal}
>     (nil))
> (insn 4 3 5 2 (set (reg:TI 84)
>        (subreg:TI (reg:DI 85) 0)) "dword-2.c":2:1 81 {*movti_internal}
>     (nil))
> (insn 5 4 6 2 (set (subreg:DI (reg:TI 84) 8)
>        (reg:DI 86)) "dword-2.c":2:1 82 {*movdi_internal}
>     (nil))
> (insn 6 5 7 2 (set (reg/v:TI 83 [ x ])
>        (reg:TI 84)) "dword-2.c":2:1 81 {*movti_internal}
>     (nil))
>
> The above is how the functionTImode argument is constructed.
>
> The other problem is that double-word addition gets split only after
> reload, mostly due to RA reasons. In the past it was determined that
> RA creates better code when registers are split late (this reason
> probably does not hold anymore), but nowadays the limitation remains
> only for arithmetic and shifts.

Hmm.  Presumably the lower-subreg pass doesn't split the above
after the double-word adds are split?  Or maybe we simply do it
too late.

> Attached to this message, please find the patch that performs dual
> word mode arithmetic splitting before reload. It improves generated
> code somehow, but due to the above argument construction sequence, the
> bulk of moves remain. Unfortunately, when under register pressure
> (e.g. 32-bit targets), the peephole approach gets ineffective due to
> register spilling, so IMO the root of the problem should be fixed.
>
> Uros.
>
>
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32} with
> > no new failures.  Ok for mainline?
> >
> >
> > 2022-06-02  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386.md (define_peephole2): Recognize double word
> >         swap sequences, and replace them with more efficient idioms,
> >         including using xchg when optimizing for size.
> >
> >
> > Thanks in advance,
> > Roger
> > --
> >