public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Power10 PCREL_OPT support (September 5th 2020)
@ 2020-09-05  4:27 Michael Meissner
  2020-09-05  4:31 ` [PATCH 1/3] power10: Add PCREL_OPT load support Michael Meissner
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Michael Meissner @ 2020-09-05  4:27 UTC (permalink / raw)
  To: gcc-patches, Segher Boessenkool, David Edelsohn,
	Michael Meissner, Bill Schmidt, Peter Bergner

The ELF-v2 ISA 3.1 support for Power10 has relocations to optimize cases where
the code is references an external variable in only one location.  This patch
is similar to the optimizations that the linker already does to optimize TOC
accesses.

This patch is a revision of the patches last submitted on August 18th, 2020.
Compared to those patches, the current patches do:

    1) The dataflow (DF) functions are used to find the single defs and single
    ref to that def in the same basic block.  Less parsing of the RTL is being
    done.

    2) I switched to use the validate_change and apply_change_group functions
    instead of manually replacing the PATTERN if the value did not pan out.  I
    tested this by temporarily disabling some of the patterns, and validated
    the the results were undone (and then I re-instated the changes to allow
    these patches to do the work).

    3) I tried to clean up the comments, and added some more comments in
    places.

    4) I verified that the patches could be moved earlier, and these patches
    move the code just before the 2nd scheduler pass (previously it was after
    the 2nd scheduler pass).  I did try out an earlier set of patches, and I
    could move the pass to be just after the reload pass.  In building Spec
    2017, I noticed that moving the changes before the 2nd scheduler pass
    caught a for more cases than moving it after the register allocation or
    after the 2nd scheduler pass.

    5) I removed the '%r' case to print_operand, and instead used a common
    function to create the .reloc.

I will be submitting 3 patches as follow-ups to this message:

    * The first patch adds support for PCREL_OPT loads;
    * The second patch adds support for PCREL_OPT stores; (and)
    * The third patch adds the tests.

I have built the compiler with/without the patches, and there were no
regressions in the testsuite.  Can I check these patches into the master
branch?  I do not antipate needing to backport these changes to GCC 10.3.

If the program is compiled to be the main program, and the variable is defined
in the main program, these relocations will convert loading up the address of
the external variable and then doing a load or store using that address to be
doing the prefixed load or store directly and converting the second instruction
into a NOP.

For example, consider the following program:

        extern int ext_variable;

        int ret_var (void)
        {
          return ext_variable;
        }

        void store_var (int i)
        {
          ext_variable = i;
        }

Currently on power10, the compiler compiles this as:

        ret_var:
                pld 9,ext_variable@got@pcrel
                lwa 3,0(9)
                blr

        store_var:
                pld 9,ext_variable@got@pcrel
                stw 3,0(9)
                blr

That is, it loads up the address of 'ext_variable' from the GOT table into
register r9, and then uses r9 as a base register to reference the actual
variable.

The linker does optimize the case where you are compiling the main program, and
the variable is also defined in the main program to be:

        ret_var:
                pla     9,ext_variable
                lwa     3,0(9)
                blr

        store_var:
                pla     9,ext_variable
                stw     3,0(9)
                blr

These patches generate:

        ret_var:
                pld     9,ext_variable@got@pcrel
        .Lpcrel1:
                .reloc .Lpcrel1-8,R_PPC64_PCREL_OPT,.-(.Lpcrel1-8)
                lwa     3,0(9)
                blr

        store_var:
                pld     9,ext_variable@got@pcrel
        .Lpcrel2:
                .reloc .Lpcrel2-8,R_PPC64_PCREL_OPT,.-(.Lpcrel2-8)
                stw     3,0(9)
                blr

Note, the label for locating the PLD occurs after the PLD and not before it.
This is so that if the assembler adds a NOP in front of the PLD to align it,
the relocations will still work.

If the linker can, it will convert the code into:

        ret_var:
                plwa    3,ext_variable@pcrel(0),1
                nop
                blr

        store_var:
                pstw    3,ext_variable@pcrel(0),1
                nop
                blr

These patches allow the load of the address to not be physically adjacent to
the actual load or store, which should allow for better code.

For loads, there must no references to the register that is being loaded
between the PLD and the actual load.

For stores, it becomes a little trickier, in that the register being stored
must be live at the time the PLD instruction is done, and it must continue to
be live and unmodified between the PLD and the store.

For both loads and stores, there must be only one reference to the address
being loaded into a base register, and that base register must die at the point
of the load/store.

For reference, here is what the current compiler generates for a medium code
model system targeting power9 with the TOC support:

                .section        ".toc","aw"
        .LC0:
                .quad   ext_variable
                .section        ".text"

        ret_var:
        .LCF0:
        0:      addis   2,12,.TOC.-.LCF0@ha
                addi    2,2,.TOC.-.LCF0@l
                .localentry     ret_var,.-ret_var
                addis   9,2,.LC0@toc@ha
                ld      9,.LC0@toc@l(9)
                lwa     3,0(9)
                blr

                .section        ".toc","aw"
                .set .LC1,.LC0

                .section        ".text"
        store_var:
        .LCF1:
        0:      addis   2,12,.TOC.-.LCF1@ha
                addi    2,2,.TOC.-.LCF1@l
                .localentry     store_var,.-store_var
                addis   9,2,.LC1@toc@ha
                ld      9,.LC1@toc@l(9)
                stw     3,0(9)
                blr

And the linker optimizes this to:

        ret_var:
                lis     2,.TOC@ha
                addi    2,2,.TOC@l
                .localentry     ret_var,.-ret_var
                nop                     ; addis eliminated due to small TOC
                addi    9,2,<offset>    ; ld converted into addi
                lwa     3,0(9)          ; actual load

        store_var:
                lis     2,.TOC@ha
                addi    2,2,.TOC@l
                .localentry     store_var,.-store_var
                nop                     ; addis eliminated due to small TOC
                addi    9,2,<offset>    ; ld converted into addi
                stw     3,0(9)          ; actual store

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.ibm.com, phone: +1 (978) 899-4797

^ permalink raw reply	[flat|nested] 9+ messages in thread
* [PATCH 0/3] Power10 PCREL_OPT support
@ 2020-08-18  6:31 Michael Meissner
  2020-08-18  6:34 ` [PATCH 1/3] Power10: Add PCREL_OPT load support Michael Meissner
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Meissner @ 2020-08-18  6:31 UTC (permalink / raw)
  To: gcc-patches, Segher Boessenkool, David Edelsohn,
	Michael Meissner, Bill Schmidt, Peter Bergner, Alan Modra

The ELF-v2 ISA 3.1 support for Power10 has relocations to optimize cases where
the code is references an external variable in only one location.  This patch
is similar to the optimizations that the linker already does to optimize TOC
accesses.

I will be submitting 3 patches as follow-ups to this message:

    * The first patch adds support for PCREL_OPT loads;
    * The second patch adds support for PCREL_OPT stores; (and)
    * The third patch adds the tests.

If the program is compiled to be the main program, and the variable is defined
in the main program, these relocations will convert loading up the address of
the external variable and then doing a load or store using that address to be
doing the prefixed load or store directly and converting the second instruction
into a NOP.

For example, consider the following program:

	extern int ext_variable;

	int ret_var (void)
	{
	  return ext_variable;
	}

	void store_var (int i)
	{
	  ext_variable = i;
	}

Currently on power10, the compiler compiles this as:

	ret_var:
	        pld 9,ext_variable@got@pcrel
		lwa 3,0(9)
	        blr

	store_var:
		pld 9,ext_variable@got@pcrel
		stw 3,0(9)
		blr

That is, it loads up the address of 'ext_variable' from the GOT table into
register r9, and then uses r9 as a base register to reference the actual
variable.

The linker does optimize the case where you are compiling the main program, and
the variable is also defined in the main program to be:

	ret_var:
		pla	9,ext_variable,1
		lwa	3,0(9)
		blr

	store_var:
		pla	9,ext_variable,1
		stw	3,0(9)
		blr

These patches generate:

	ret_var:
	        pld	9,ext_variable@got@pcrel
	.Lpcrel1:
		.reloc .Lpcrel1-8,R_PPC64_PCREL_OPT,.-(.Lpcrel1-8)
	        lwa	3,0(9)
		blr

	store_var:
	        pld	9,ext_variable@got@pcrel
	.Lpcrel2:
		.reloc .Lpcrel2-8,R_PPC64_PCREL_OPT,.-(.Lpcrel2-8)
	        stw	3,0(9)
		blr

Note, the label for locating the PLD occurs after the PLD and not before it.
This is so that if the assembler adds a NOP in front of the PLD to align it,
the relocations will still work.

If the linker can, it will convert the code into:

	ret_var:
		plwa	3,ext_variable,1
		nop
		blr

	store_var:
		pstw	3,ext_variable,1
		nop
		blr

These patches allow the load of the address to not be physically adjacent to
the actual load or store, which should allow for better code.

For loads, there must no references to the register that is being loaded
between the PLD and the actual load.

For stores, it becomes a little trickier, in that the register being stored
must be live at the time the PLD instruction is done, and it must continue to
be live and unmodified between the PLD and the store.

For both loads and stores, there must be only one reference to the address
being loaded into a base register, and that base register must die at the point
of the load/store.

In order to do this, the pass that converts the load address and load/store
must occur late in the compilation cycle.  In particular, the second scheduler
pass will duplicate and optimize some of the references and it will produce an
invalid program.  In the past, Segher has said that we should be able to move
it earlier.  I have my doubts whether that is feasible.  What I would like to
do is put these patches into GCC 11, which will enable many of the cases that
we want to optimize.

Then somebody else can take a swing at doing the optimization to allow the code
to do this optimization earlier.  That way, even if we can't get the super
optimized code to work, we at least will get the majority of cases to work.

For reference, here is what the current compiler generates for a medium code
model system targeting power9 with the TOC support:

	        .section        ".toc","aw"
	.LC0:
		.quad   ext_variable
		.section	".text"

	ret_var:
	.LCF0:
	0:      addis	2,12,.TOC.-.LCF0@ha
		addi	2,2,.TOC.-.LCF0@l
		.localentry     ret_var,.-ret_var
		addis	9,2,.LC0@toc@ha
		ld	9,.LC0@toc@l(9)
		lwa	3,0(9)
		blr

		.section        ".toc","aw"
		.set .LC1,.LC0

		.section        ".text"
	store_var:
	.LCF1:
	0:      addis	2,12,.TOC.-.LCF1@ha
		addi	2,2,.TOC.-.LCF1@l
		.localentry     store_var,.-store_var
		addis	9,2,.LC1@toc@ha
		ld	9,.LC1@toc@l(9)
		stw	3,0(9)
		blr

And the linker optimizes this to:

	ret_var:
		lis	2,.TOC@ha
		addi	2,2,.TOC@l 
		.localentry     ret_var,.-ret_var
		nop			; addis eliminated due to small TOC
		addi	9,2,<offset>	; ld converted into addi
		lwa	3,0(9)		; actual load

	store_var:
		lis	2,.TOC@ha
		addi	2,2,.TOC@l
		.localentry	store_var,.-store_var
		nop			; addis eliminated due to small TOC
		addi	9,2,<offset>	; ld converted into addi
		stw	3,0(9)		; actual store

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.ibm.com, phone: +1 (978) 899-4797

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-09-22  3:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-05  4:27 [PATCH 0/3] Power10 PCREL_OPT support (September 5th 2020) Michael Meissner
2020-09-05  4:31 ` [PATCH 1/3] power10: Add PCREL_OPT load support Michael Meissner
2020-09-05  4:35 ` [PATCH 2/3] power10: Add PCREL_OPT store support Michael Meissner
2020-09-05  4:36 ` [PATCH 3/3] power10: Add tests for PCREL_OPT Michael Meissner
2020-09-05 15:20 ` [PATCH 0/3] Power10 PCREL_OPT support (September 5th 2020) Michael Meissner
2020-09-22  3:21 ` Ping: " Michael Meissner
  -- strict thread matches above, loose matches on Subject: below --
2020-08-18  6:31 [PATCH 0/3] Power10 PCREL_OPT support Michael Meissner
2020-08-18  6:34 ` [PATCH 1/3] Power10: Add PCREL_OPT load support Michael Meissner
2020-08-21  2:09   ` Segher Boessenkool
2020-09-03 17:24     ` Michael Meissner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).