From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 8DBD33890435; Tue, 26 May 2020 13:14:44 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8DBD33890435 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1590498884; bh=KIc7n2F5vrRTbBr4l5DhcjMKfisvlrAvhLTuSoDMCmo=; h=From:To:Subject:Date:In-Reply-To:References:From; b=uYhNhPd+i1lkl6Pfs585W3zf1Y0dWbjYgz99KxwD/sO3KBujvCPWeA4tRN63a0yHq cRojMPNdnnZBl5UZdig/cmYTBryVxvvVnTeyM86pSxvB5ZMBjeRGicS/PeaRce802F 8928kHBqCFwKoRw+PRyvJ37F/nqd1/6dqtCN2oMo= From: "wilco at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/95285] AArch64:aarch64 medium code model proposal Date: Tue, 26 May 2020 13:14:44 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: wilco at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 26 May 2020 13:14:44 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D95285 Wilco changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |wilco at gcc dot gnu.org --- Comment #2 from Wilco --- (In reply to Bu Le from comment #0) > Created attachment 48584 [details] > proposed patch >=20 > I would like to propose an implementation of the medium code model in > aarch64. A prototype is attached, passed bootstrap and the regression tes= t. >=20 > Mcmodel =3D medium is a missing code model in aarch64 architecture, which= is > supported in x86. This code model describes a situation that some small d= ata > is relocated by small code model while large data is relocated by large c= ode > model. The official statement about medium code model in x86 ABI file page > 34 URL : https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf >=20 > The key difference between x86 and aarch64 is that x86 can use lea+movabs > instruction to implement a dynamic relocatable large code model. Currentl= y, > large code model in AArch64 relocate the symbol using ldr instruction, wh= ich > can only be static linked. However, the small code mode use adrp + ldr > instruction, which can be dynamic linked. Therefore, the medium code model > cannot be implemented directly by simply setting a threshold. As a result= a > dynamic reloadable large code model is needed first for a functional medi= um > code model. >=20 > I met this problem when compiling CESM, which is a climate forecast softw= are > that widely used in hpc field. In some configure case, when the manipulat= ing > large arrays, the large code model with dynamic relocation is needed. The > following case is abstract from CESM for this scenario. >=20 > program main > common/baz/a,b,c > real a,b,c > b =3D 1.0 > call foo() > print*, b > end >=20 > subroutine foo() > common/baz/a,b,c > real a,b,c >=20 > integer, parameter :: nx =3D 1024 > integer, parameter :: ny =3D 1024 > integer, parameter :: nz =3D 1024 > integer, parameter :: nf =3D 1 > real :: bar(nf,nx*ny*nz) > real :: bar1(nf,nx*ny*nz) > bar =3D 0.0 > bar1 =3D0.0 > b =3D bar(1,1024*1024*100) > b =3D bar1(1,1) >=20 > return > end >=20 > compile with -mcmodel=3Dsmall -fPIC will give following error due to the > access of bar1 array > test.f90:(.text+0x28): relocation truncated to fit: > R_AARCH64_ADR_PREL_PG_HI21 against `.bss' > test.f90:(.text+0x6c): relocation truncated to fit: > R_AARCH64_ADR_PREL_PG_HI21 against `.bss' >=20 > compile with -mcmodel=3Dlarge -fPIC will give unsupported error: > f951: sorry, unimplemented: code model =E2=80=98large=E2=80=99 with =E2= =80=98-fPIC=E2=80=99 >=20 > As discussed in the beginning, to tackle this problem we have to solve the > static large code model problem. My solution here is to use > R_AARCH64_MOVW_PREL_Gx group relocation with instructions to calculate the > current PC value. >=20 > Before change (mcmodel=3Dsmall) : > adrp x0, bar1.2782 > add x0, x0, :lo12:bar1.2782 >=20 > After change:(mcmodel =3D medium proposed): > movz x0, :prel_g3:bar1.2782 > movk x0, :prel_g2_nc:bar1.2782 > movk x0, :prel_g1_nc:bar1.2782 > movk x0, :prel_g0_nc:bar1.2782 > adr x1, . > sub x1, x1, 0x4 > add x0, x0, x1 >=20 > The first 4 movk instruction will calculate the offset between bar1 and t= he > last movk instruction in 64-bits, which fulfil the requirement of large c= ode > model(64-bit relocation). > The adr+sub instruction will calculate the pc-address of the last movk > instruction. By adding the offset with the PC address, bar1 can be > dynamically located. >=20 > Because this relocation is time consuming, a threshold is set to classify > the size of the data to be relocated, like x86. The default value of the > threshold is set to 65536, which is max relocation capability of small co= de > model. > This implementation will also need to amend the linker in binutils so that > the4 movk can calculated the same pc-offset of the last movk instruction. >=20 > The good side of this implementation is that it can use existed relocation > type to prototype a medium code model. >=20 > The drawback of this implementation also exists.=20 > For start, these 4movk instructions and the adr instruction must be combi= ned > in this order. No other instruction should insert in between the sequence, > which will leads to mistake symbol address. This might impede the insn > schedule optimizations.=20 > Secondly, the linker need to make the change correspondingly so that every > mov instruction calculate the same pc-offset. For example, in my > implementation, the fisrt movz instruction will need to add 12 to the res= ult > of ":prel_g3:bar1.2782" to make up the pc-offset.=20=20=20 >=20 > I haven't figure out a suitable solution for these problems yet. You are > most welcomed to leave your suggestions regarding these issues. Is the main usage scenario huge arrays? If so, these could easily be alloca= ted via malloc at startup rather than using bss. It means an extra indirection = in some cases (to load the pointer), but it should be much more efficient than using a large code model with all the overheads.=