From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 647E03858C39; Thu, 16 Feb 2023 17:53:32 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 647E03858C39
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1676570012;
	bh=VmhpvI/TYJIS64aqjfbS93JdGlILQbbyr42jL72Fgo8=;
	h=From:To:Subject:Date:From;
	b=oGQ41+orsojd1iYXYgiaLKsDth0DL0GFlYRNrWvkWn1NH4UBCeQuBu7qM7+32wemN
	 OJq1QfNvuFq9xpZfSKWSRYEdDNlqHQCG96IE9reyB1ML3DEbsVAE8OBZ9vh+2NZAch
	 R4xDALgZd+uDIc5uB4NJGw6T4zB8XxsTQSBFvuho=
From: "tkoenig at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug rtl-optimization/108826] New: Inefficient address generation on
 POWER and RISC-V
Date: Thu, 16 Feb 2023 17:53:31 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: rtl-optimization
X-Bugzilla-Version: unknown
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: tkoenig at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-108826-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108826

            Bug ID: 108826
           Summary: Inefficient address generation on POWER and RISC-V
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

For the code (reduced from embench)

struct {
  unsigned int table[4][100];
} * _nettle_aes_decrypt_T;
unsigned int _nettle_aes_decrypt_w1;
void _nettle_aes_decrypt() {
  _nettle_aes_decrypt_T->table[2][0] =3D
      _nettle_aes_decrypt_T->table[2][_nettle_aes_decrypt_w1 >> 6 & 5];
}

current trunk generates

0:      addis 2,12,.TOC.-.LCF0@ha
        addi 2,2,.TOC.-.LCF0@l
        .localentry     _nettle_aes_decrypt,.-_nettle_aes_decrypt
        addis 9,2,.LANCHOR0+8@toc@ha
        lwz 9,.LANCHOR0+8@toc@l(9)
        addis 10,2,.LANCHOR0@toc@ha
        ld 10,.LANCHOR0@toc@l(10)
        srwi 9,9,6
        andi. 9,9,0x5
        addi 9,9,200
        sldi 9,9,2
        lwzx 9,10,9
        stw 9,800(10)
        blr

After the TOC loading, this shifts the value once, does the and, adds 200
and then shifts back the value. These two shifts are not necessary.

A better alternative would be something like (please excuse any errors)

        srwi 9,9,4
        andi 9,9,20
        add  9,9,2
        lwz  9,800(9)
        stw  9,800(9)

saving an instruction.

RISC-V does something similar.  According to godbolt:

        lui     a5,%hi(_nettle_aes_decrypt_w1)
        lw      a5,%lo(_nettle_aes_decrypt_w1)(a5)
        lui     a4,%hi(_nettle_aes_decrypt_T)
        ld      a4,%lo(_nettle_aes_decrypt_T)(a4)
        srliw   a5,a5,6
        andi    a5,a5,5
        addi    a5,a5,200
        slli    a5,a5,2
        add     a5,a4,a5
        lw      a5,0(a5)
        sw      a5,800(a4)
        ret


(which is why I think this is a general RTL optimization issue).
x86 is much better:

        movl    _nettle_aes_decrypt_w1(%rip), %eax
        movq    _nettle_aes_decrypt_T(%rip), %rdx
        shrl    $6, %eax
        andl    $5, %eax
        movl    800(%rdx,%rax,4), %eax
        movl    %eax, 800(%rdx)
        ret

but it can use the complex addressing modes on x86.=