From: Alan Hayward <Alan.Hayward@arm.com>
To: Jeff Law <law@redhat.com>
Cc: Richard Biener <richard.guenther@gmail.com>,
"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
nd <nd@arm.com>
Subject: Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
Date: Thu, 30 Nov 2017 11:16:00 -0000 [thread overview]
Message-ID: <90F53098-5814-4513-BB43-C6221E4125D2@arm.com> (raw)
In-Reply-To: <599c2582-6d8d-a909-a878-28e8135d5100@redhat.com>
> On 27 Nov 2017, at 17:29, Jeff Law <law@redhat.com> wrote:
>
> On 11/23/2017 04:11 AM, Alan Hayward wrote:
>>
>>> On 22 Nov 2017, at 17:33, Jeff Law <law@redhat.com> wrote:
>>>
>>> On 11/22/2017 04:31 AM, Alan Hayward wrote:
>>>>
>>>>> On 21 Nov 2017, at 03:13, Jeff Law <law@redhat.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED. I'd
>>>>>>> totally forgotten about it. And in fact it seems to come pretty close
>>>>>>> to what you need…
>>>>>>
>>>>>> Yes, some of the code is similar to the way
>>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the
>>>>>> CLOBBER expr code served as a starting point for writing the patch. The main difference
>>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a specific Instruction,
>>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied to an expression (tls_desc).
>>>>>> It meant there wasn’t really any opportunity to resume any existing code.
>>>>> Understood. Though your first patch mentions that you're trying to
>>>>> describe partial preservation "around TLS calls". Presumably those are
>>>>> represented as normal insns, not call_insn.
>>>>>
>>>>> That brings me back to Richi's idea of exposing a set of the low subreg
>>>>> to itself using whatever mode is wide enough to cover the neon part of
>>>>> the register.
>>>>>
>>>>> That should tell the generic parts of the compiler that you're just
>>>>> clobbering the upper part and at least in theory you can implement in
>>>>> the aarch64 backend and the rest of the compiler should "just work"
>>>>> because that's the existing semantics of a subreg store.
>>>>>
>>>>> The only worry would be if a pass tried to get overly smart and
>>>>> considered that kind of set a nop -- but I think I'd argue that's simply
>>>>> wrong given the semantics of a partial store.
>>>>>
>>>>
>>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X).
>>>> It’s something we considered, and then dismissed.
>>>>
>>>> The problem then is you are now using SET semantics on those registers, and it
>>>> would make the register live around the function, which might not be the case.
>>>> Whereas clobber semantics will just make the register dead - which is exactly
>>>> what we want (but only conditionally).
>>> ?!? A set of the subreg is the *exact* semantics you want. It says the
>>> low part is preserved while the upper part is clobbered across the TLS
>>> insns.
>>>
>>> jeff
>>
>> Consider where the TLS call is inside a loop. The compiler would normally want
>> to hoist that out of the loop. By adding a set(x,x) into the parallel of the tls_desc we
>> are now making x live across the loop, x is dependant on the value from the previous
>> iteration, and the tls_desc can no longer be hoisted.
> Hmm. I think I see the problem you're trying to point out. Let me
> restate it and see if you agree.
>
> The low subreg set does clearly indicate the upper part of the SVE
> register is clobbered. The problem is from a liveness standpoint the
> compiler is considering the low part live, even though it's a self-set.
>
Yes.
> In fact, if that is the case, then a single TLS call (independent of a
> loop) would make the low part of the register globally live. This
> should be testable. Include one of these low part self sets on the
> existing TLS calls and compile a little test function and let's look at
> the liveness data.
>
Ok, so I’ve put
(set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM))
(set (reg:TI V1_REGNUM) (reg:TI V1_REGNUM))
etc up to V31 in the UNSPEC_TLSDESC pattern in aarch64.md, using up all the neon registers.
In an ideal world, this change would do nothing as neon reg size is TI.
First I compiled up sve_tls_preserve_1.c (Test1 below)
Without the SET changes, gcc does not backup any neon registers before the tlsdesccall (good).
With the SET changes, gcc backs up all the neon registers (not ideal).
Without the SET changes, dump logs give:
cse1:
;; regs ever live 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 66 [cc]
cprop1:
;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
;; lr def 0 [x0] 30 [x30] 32 [v0] 66 [cc] 79 80 81 82 84 85 86 87 88 89
;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
;; live gen 0 [x0] 32 [v0] 78 79 80 81 82 83 84 85 86 87 88 89
;; live kill 30 [x30] 66 [cc]
;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
reload:
;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 66 [cc]
With the set changes:
cse1:
;; regs ever live 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
cprop1:
;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
;; lr def 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc] 79 80 81 82 84 85 86 87 88 89
;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
;; live gen 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 78 79 80 81 82 83 84 85 86 87 88 89
;; live kill 30 [x30] 66 [cc]
;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
reload:
;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
Is there any extra dump data you think would be useful?
For completeness, I also ran through some more tests…..
Second I tried putting the tls call inside a simple for loop (Test2 below)
Both before and after the SET changes, gcc hoisted the tlsdesccall call out of the loop.
This has already happened before the expand pass.
(and as before, with the SET changes, gcc backs up all the neon registers before the tlsdesccall).
Third, I made the tls variable an array, and accessed elements from it in the loop. (Test3 below)
Both before and after the SET changes, gcc fails to hoist the tlsdesccall call.
That’s a missed optimisation chance - I’m not sure how tricky or worthwhile that’d be to fix (at the
tree level the tls load looks just like a MEM base plus index access. I guess you’d need something
specific in an rtl pass to be the hoist).
Anyway, we now have a tlsdec inside a loop, which is what I wanted.
Without the SET changes, nothing is backed up whilst inside the loop.
With the SET changes, gcc tries to back up the live neon registers, and ICEs with
"error: unable to find a register to spill” in lra-assigns.c.
Forth, I took the original test and made two tls accesses in a row with an asm volatile in-between (Test4 below).
With the SET changes, gcc ICEs in the same way.
Looking at the dump files,
Without the SET changes, dump logs for the final test are similar to the very first test:
;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
;; lr def 0 [x0] 30 [x30] 32 [v0] 66 [cc] 81 84 85 86 88 89 90 91 95 96 97 98 99 100
;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
;; live gen 0 [x0] 32 [v0] 81 84 85 86 88 89 90 91 92 95 96 97 98 99 100
;; live kill 30 [x30] 66 [cc]
;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
With the SET changes:
;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 66 [cc]
;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
;; lr def 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc] 81 84 85 86 88 89 90 91 95 96 97 98 99 100
;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
;; live gen 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 81 84 85 86 88 89 90 91 92 95 96 97 98 99 100
;; live kill 30 [x30] 66 [cc]
;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
Test1:
__thread v4si tx;
v4si foo (v4si a, v4si b, v4si c)
{
v4si y;
y = a + tx + b + c;
return y + 7;
}
Test2:
__thread v4si tx;
v4si foo (v4si a, v4si *b, v4si *c)
{
v4si y;
for(int i=0; i<MAX; i++)
y += a + tx + b[i] + c[i];
return y + 7;
}
Test3:
__thread v4si tx[MAX];
v4si foo (v4si a, v4si *b, v4si *c)
{
v4si y;
for(int i=0; i<MAX; i++)
y += a + tx[i] + b[i] + c[i];
return y + 7;
}
Test4:
__thread v4si tx;
__thread v4si ty;
v4si foo (v4si a, v4si b, v4si c)
{
v4si y;
y = a + tx + b + c;
asm volatile ("" : : : "memory");
y += a + ty + b + c;
return y + 7;
}
>
> Now it could be the case that various local analysis could sub-optimally
> handle things. You mention LICM. I know our original LICM did have a
> problem in that if it saw a use of a hard reg in a loop without seeing a
> set of that hard reg it considered the register varying within the loop.
> I have no idea if we carried that forward when the loop code was
> rewritten (when I looked at this it was circa 1992).
>
>
>>
>> Or consider a stream of code containing two tls_desc calls (ok, the compiler might
>> optimise one of the tls calls away, but this approach should be reusable for other exprs).
>> Between the two set(x,x)’s x is considered live so the register allocator can’t use that
>> register.
>> Given that we are applying this to all the neon registers, the register allocator now throws
>> an ICE because it can’t find any free hard neon registers to use.
> Given your statements it sounds like the liveness infrastructure is
> making those neon regs globally live when it sees the low part subreg
> self-set. Let's confirm that one way or the other and see where it
> takes us.
>
> Jeff
next prev parent reply other threads:[~2017-11-30 11:04 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-16 12:34 Alan Hayward
2017-11-16 18:15 ` Jeff Law
2017-11-16 18:39 ` Richard Biener
2017-11-16 18:57 ` Alan Hayward
2017-11-17 19:45 ` Jeff Law
2017-11-20 15:42 ` Alan Hayward
2017-11-21 6:19 ` Jeff Law
2017-11-22 11:41 ` Alan Hayward
2017-11-22 17:44 ` Jeff Law
2017-11-23 11:25 ` Alan Hayward
2017-11-27 17:47 ` Jeff Law
2017-11-28 11:58 ` Richard Biener
2017-11-28 15:17 ` Jeff Law
2017-11-30 11:16 ` Alan Hayward [this message]
2017-12-12 11:11 ` Alan Hayward
2017-12-19 10:12 ` Alan Hayward
2017-12-19 16:27 ` Jeff Law
2018-01-12 12:14 ` Alan Hayward
2018-01-24 12:20 ` Alan Hayward
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=90F53098-5814-4513-BB43-C6221E4125D2@arm.com \
--to=alan.hayward@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=law@redhat.com \
--cc=nd@arm.com \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).