From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ubizjak@gmail.com>
Received: from mail-qv1-xf30.google.com (mail-qv1-xf30.google.com
 [IPv6:2607:f8b0:4864:20::f30])
 by sourceware.org (Postfix) with ESMTPS id 7A198395A05B
 for <gcc-patches@gcc.gnu.org>; Thu,  2 Jun 2022 09:37:10 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7A198395A05B
Received: by mail-qv1-xf30.google.com with SMTP id b11so3211239qvv.4
 for <gcc-patches@gcc.gnu.org>; Thu, 02 Jun 2022 02:37:10 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=HfMNMVNA8g0q1aYRpXCvkLJrMOkAWpyx81Jekjf2YmM=;
 b=lcoOO8d27Zm3KR7YLvnfI9+/sKFKh+xfJgI6uhLdlCVGE4ydFKc2wpvX7bDm46nlSd
 +d48LIH+cJG/RrrDC+wBH9pFTO2K256P8U3qI5XPMRfKF04QoKiKPsmt4cweRAUk8PMm
 T8bNnBv4yr9vbp6dlM7KUEaOthHhGHdq0K6fR0G0oVTJGu3M6F8x9bbX/NW24hAVlZLP
 wErZMGL7gfy13ovcXlD5VAOMWQ+7woZaUf4nHnA8ZiR3k90qa02tHSZGQE6K+FPyA63v
 Yz/B6ICVCTnTLBvd8lDY7HahNTVJfCh/2DM7O+Qz3yRN2IYftxVN+WA7evtmhHKnWk97
 LzBw==
X-Gm-Message-State: AOAM533tTmBPG7v1RKHuC85YDfHrJVQNrY8Le7tUjZ4Zp+WHj7sDfXLj
 4xFoSsUZx/tRUexoywp9C7ogXsUNBJcQbxDi91FTgSPf3GfoZA==
X-Google-Smtp-Source: ABdhPJyk6IT+Tv8rJ6oIkOdZcoxjOtLWY/FDfBxipMGMVd3MfSR2obwbPLOAKTG30ak0aDKCK8ycIwCg5D7gzcLCIKU=
X-Received: by 2002:a05:6214:2aa4:b0:462:508a:47d2 with SMTP id
 js4-20020a0562142aa400b00462508a47d2mr39723943qvb.125.1654162629824; Thu, 02
 Jun 2022 02:37:09 -0700 (PDT)
MIME-Version: 1.0
References: <032501d87651$43cf0960$cb6d1c20$@nextmovesoftware.com>
 <CAFULd4ZqXDc6p9eMqcKJ30niEporP5hoy7q3EE0Fyhj6UgNvBA@mail.gmail.com>
In-Reply-To: <CAFULd4ZqXDc6p9eMqcKJ30niEporP5hoy7q3EE0Fyhj6UgNvBA@mail.gmail.com>
From: Uros Bizjak <ubizjak@gmail.com>
Date: Thu, 2 Jun 2022 11:36:58 +0200
Message-ID: <CAFULd4Y3mxW1bvpBBJqxfWL4pnfNua2OojB8aDCu-+FQyYegYA@mail.gmail.com>
Subject: Re: [x86 PATCH] Add peephole2 to reduce double word register
 shuffling.
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Jun 2022 09:37:12 -0000

On Thu, Jun 2, 2022 at 11:32 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Thu, Jun 2, 2022 at 9:20 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
> >
> > The simple test case below demonstrates an interesting register
> > allocation challenge facing x86_64, imposed by ABI requirements
> > on int128.
> >
> > __int128 foo(__int128 x, __int128 y)
> > {
> >   return x+y;
> > }
> >
> > For which GCC currently generates the unusual sequence:
> >
> >         movq    %rsi, %rax
> >         movq    %rdi, %r8
> >         movq    %rax, %rdi
> >         movq    %rdx, %rax
> >         movq    %rcx, %rdx
> >         addq    %r8, %rax
> >         adcq    %rdi, %rdx
> >         ret
> >
> > The challenge is that the x86_64 ABI requires passing the first __int128,
> > x, in %rsi:%rdi (highpart in %rsi, lowpart in %rdi), where internally
> > GCC prefers TI mode (double word) integers to be register allocated as
> > %rdi:%rsi (highpart in %rdi, lowpart in %rsi).  So after reload, we have
> > four mov instructions, two to move the double word to temporary registers
> > and then two to move them back.
> >
> > This patch adds a peephole2 to spot this register shuffling, and with
> > -Os generates a xchg instruction, to produce:
> >
> >         xchgq   %rsi, %rdi
> >         movq    %rdx, %rax
> >         movq    %rcx, %rdx
> >         addq    %rsi, %rax
> >         adcq    %rdi, %rdx
> >         ret
> >
> > or when optimizing for speed, a three mov sequence, using just one of
> > the temporary registers, which ultimately results in the improved:
> >
> >         movq    %rdi, %r8
> >         movq    %rdx, %rax
> >         movq    %rcx, %rdx
> >         addq    %r8, %rax
> >         adcq    %rsi, %rdx
> >         ret
> >
> > I've a follow-up patch which improves things further, and with the
> > output in flux, I'd like to add the new testcase with part 2, once
> > we're back down to requiring only two movq instructions.
>
> Shouldn't we rather do something about:
>
> (insn 2 9 3 2 (set (reg:DI 85)
>        (reg:DI 5 di [ x ])) "dword-2.c":2:1 82 {*movdi_internal}
>     (nil))
> (insn 3 2 4 2 (set (reg:DI 86)
>        (reg:DI 4 si [ x+8 ])) "dword-2.c":2:1 82 {*movdi_internal}
>     (nil))
> (insn 4 3 5 2 (set (reg:TI 84)
>        (subreg:TI (reg:DI 85) 0)) "dword-2.c":2:1 81 {*movti_internal}
>     (nil))
> (insn 5 4 6 2 (set (subreg:DI (reg:TI 84) 8)
>        (reg:DI 86)) "dword-2.c":2:1 82 {*movdi_internal}
>     (nil))
> (insn 6 5 7 2 (set (reg/v:TI 83 [ x ])
>        (reg:TI 84)) "dword-2.c":2:1 81 {*movti_internal}
>     (nil))
>
> The above is how the functionTImode argument is constructed.
>
> The other problem is that double-word addition gets split only after
> reload, mostly due to RA reasons. In the past it was determined that
> RA creates better code when registers are split late (this reason
> probably does not hold anymore), but nowadays the limitation remains
> only for arithmetic and shifts.

FYI, the effect of the patch can be seen with the following testcase:

--cut here--
#include <stdint.h>

void test (int64_t n)
{
  while (1)
    {
      n++;
      asm volatile ("#"
            :: "b" ((int32_t)n),
               "c" ((int32_t)(n >> 32)));
    }
}
--cut here--

Please compile this with -O2 -m32 with patched and unpatched compiler.

Uros.