From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 681C0385841B; Wed, 19 Jan 2022 17:40:36 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 681C0385841B
From: "munroesj at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/104124] New: Poor optimization for vector splat DW with
 small consts
Date: Wed, 19 Jan 2022 17:40:36 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 11.1.1
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: munroesj at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-104124-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 19 Jan 2022 17:40:36 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104124

            Bug ID: 104124
           Summary: Poor optimization for vector splat DW with small
                    consts
           Product: gcc
           Version: 11.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

It looks to me like the compiler is seeing register pressure caused by load=
ing
all the vector long long constants I need in my code. This is leaf code of a
size it can run out of volatilizes (no stack-frame). But this puts more
pressure on volatile VRs, VSRs, and GPRs. Especially GPRs because it loading
from .rodata when it could (and should) use a vector immediate.

For example:

vui64_t
__test_splatudi_0_V0 (void)
{
  return vec_splats ((unsigned long long) 0);
}

vi64_t
__test_splatudi_1_V0 (void)
{
  return vec_splats ((signed long long) -1);
}

Generate:
00000000000001a0 <__test_splatudi_0_V0>:
     1a0:       8c 03 40 10     vspltisw v2,0
     1a4:       20 00 80 4e     blr

00000000000001c0 <__test_splatudi_1_V0>:
     1c0:       8c 03 5f 10     vspltisw v2,-1
     1c4:       20 00 80 4e     blr
        ...

But other cases that could use immedates like:

vui64_t
__test_splatudi_12_V0 (void)
{
  return vec_splats ((unsigned long long) 12);
}

GCC 9/10/11 Generates for power8:

0000000000000170 <__test_splatudi_12_V0>:
     170:       00 00 4c 3c     addis   r2,r12,0
                        170: R_PPC64_REL16_HA   .TOC.
     174:       00 00 42 38     addi    r2,r2,0
                        174: R_PPC64_REL16_LO   .TOC.+0x4
     178:       00 00 22 3d     addis   r9,r2,0
                        178: R_PPC64_TOC16_HA   .rodata.cst16+0x20
     17c:       00 00 29 39     addi    r9,r9,0
                        17c: R_PPC64_TOC16_LO   .rodata.cst16+0x20
     180:       ce 48 40 7c     lvx     v2,0,r9
     184:       20 00 80 4e     blr

and for Power9:
0000000000000000 <__test_splatisd_12_PWR9>:
       0:       d1 62 40 f0     xxspltib vs34,12
       4:       02 16 58 10     vextsb2d v2,v2
       8:       20 00 80 4e     blr

So why can't the power8 target generate:

00000000000000f0 <__test_splatudi_12_V1>:
      f0:       8c 03 4c 10     vspltisw v2,12
      f4:       4e 16 40 10     vupkhsw v2,v2
      f8:       20 00 80 4e     blr

This is 4 cycles vs 9 ((best case) and it is always 9 cycles because GCC do=
es
not exploit immediate fusion).
In fact GCC 8 (AT12) does this.

So I tried defining my own vec_splatudi:

vi64_t
__test_splatudi_12_V1 (void)
{
  vi32_t vwi =3D vec_splat_s32 (12);
  return vec_unpackl (vwi);
}

Which generates the <__test_splatudi_12_V1> sequence above for GCC 8. But f=
or
GCC 9/10/11 it generates:

0000000000000110 <__test_splatudi_12_V1>:
     110:       00 00 4c 3c     addis   r2,r12,0
                        110: R_PPC64_REL16_HA   .TOC.
     114:       00 00 42 38     addi    r2,r2,0
                        114: R_PPC64_REL16_LO   .TOC.+0x4
     118:       00 00 22 3d     addis   r9,r2,0
                        118: R_PPC64_TOC16_HA   .rodata.cst16+0x20
     11c:       00 00 29 39     addi    r9,r9,0
                        11c: R_PPC64_TOC16_LO   .rodata.cst16+0x20
     120:       ce 48 40 7c     lvx     v2,0,r9
     124:       20 00 80 4e     blr

Again! GCC has gone out of its way to be this clever! Badly! While it can be
appropriately clever for power9!

I have tried many permutations of this and the only way I have found to pre=
vent
this (GCC 9/10/11) cleverness is to use inline __asm (which has other bad s=
ide
effects).=