From: Alan Hayward <Alan.Hayward@arm.com>
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
Jeff Law <law@redhat.com>,
Richard Biener <richard.guenther@gmail.com>
Cc: nd <nd@arm.com>
Subject: Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
Date: Tue, 19 Dec 2017 10:12:00 -0000 [thread overview]
Message-ID: <8C508103-8D47-4049-BF01-89C9A8F39F5E@arm.com> (raw)
In-Reply-To: <E7A82AB6-4CE5-442D-96E4-A3812FDE630D@arm.com>
Ping ping.
I think there should be enough information in the first test to show that any "set to self”
registers become live. Let me know if there’s anything I’ve missed.
Thanks,
Alan.
> On 12 Dec 2017, at 11:11, Alan Hayward <Alan.Hayward@arm.com> wrote:
>
> Ping.
>
>> On 30 Nov 2017, at 11:03, Alan Hayward <Alan.Hayward@arm.com> wrote:
>>
>>
>>> On 27 Nov 2017, at 17:29, Jeff Law <law@redhat.com> wrote:
>>>
>>> On 11/23/2017 04:11 AM, Alan Hayward wrote:
>>>>
>>>>> On 22 Nov 2017, at 17:33, Jeff Law <law@redhat.com> wrote:
>>>>>
>>>>> On 11/22/2017 04:31 AM, Alan Hayward wrote:
>>>>>>
>>>>>>> On 21 Nov 2017, at 03:13, Jeff Law <law@redhat.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED. I'd
>>>>>>>>> totally forgotten about it. And in fact it seems to come pretty close
>>>>>>>>> to what you need…
>>>>>>>>
>>>>>>>> Yes, some of the code is similar to the way
>>>>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the
>>>>>>>> CLOBBER expr code served as a starting point for writing the patch. The main difference
>>>>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a specific Instruction,
>>>>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied to an expression (tls_desc).
>>>>>>>> It meant there wasn’t really any opportunity to resume any existing code.
>>>>>>> Understood. Though your first patch mentions that you're trying to
>>>>>>> describe partial preservation "around TLS calls". Presumably those are
>>>>>>> represented as normal insns, not call_insn.
>>>>>>>
>>>>>>> That brings me back to Richi's idea of exposing a set of the low subreg
>>>>>>> to itself using whatever mode is wide enough to cover the neon part of
>>>>>>> the register.
>>>>>>>
>>>>>>> That should tell the generic parts of the compiler that you're just
>>>>>>> clobbering the upper part and at least in theory you can implement in
>>>>>>> the aarch64 backend and the rest of the compiler should "just work"
>>>>>>> because that's the existing semantics of a subreg store.
>>>>>>>
>>>>>>> The only worry would be if a pass tried to get overly smart and
>>>>>>> considered that kind of set a nop -- but I think I'd argue that's simply
>>>>>>> wrong given the semantics of a partial store.
>>>>>>>
>>>>>>
>>>>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X).
>>>>>> It’s something we considered, and then dismissed.
>>>>>>
>>>>>> The problem then is you are now using SET semantics on those registers, and it
>>>>>> would make the register live around the function, which might not be the case.
>>>>>> Whereas clobber semantics will just make the register dead - which is exactly
>>>>>> what we want (but only conditionally).
>>>>> ?!? A set of the subreg is the *exact* semantics you want. It says the
>>>>> low part is preserved while the upper part is clobbered across the TLS
>>>>> insns.
>>>>>
>>>>> jeff
>>>>
>>>> Consider where the TLS call is inside a loop. The compiler would normally want
>>>> to hoist that out of the loop. By adding a set(x,x) into the parallel of the tls_desc we
>>>> are now making x live across the loop, x is dependant on the value from the previous
>>>> iteration, and the tls_desc can no longer be hoisted.
>>> Hmm. I think I see the problem you're trying to point out. Let me
>>> restate it and see if you agree.
>>>
>>> The low subreg set does clearly indicate the upper part of the SVE
>>> register is clobbered. The problem is from a liveness standpoint the
>>> compiler is considering the low part live, even though it's a self-set.
>>>
>>
>> Yes.
>>
>>> In fact, if that is the case, then a single TLS call (independent of a
>>> loop) would make the low part of the register globally live. This
>>> should be testable. Include one of these low part self sets on the
>>> existing TLS calls and compile a little test function and let's look at
>>> the liveness data.
>>>
>>
>>
>> Ok, so I’ve put
>> (set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM))
>> (set (reg:TI V1_REGNUM) (reg:TI V1_REGNUM))
>> etc up to V31 in the UNSPEC_TLSDESC pattern in aarch64.md, using up all the neon registers.
>>
>> In an ideal world, this change would do nothing as neon reg size is TI.
>>
>> First I compiled up sve_tls_preserve_1.c (Test1 below)
>>
>> Without the SET changes, gcc does not backup any neon registers before the tlsdesccall (good).
>> With the SET changes, gcc backs up all the neon registers (not ideal).
>>
>> Without the SET changes, dump logs give:
>>
>> cse1:
>> ;; regs ever live 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 66 [cc]
>>
>> cprop1:
>> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; lr def 0 [x0] 30 [x30] 32 [v0] 66 [cc] 79 80 81 82 84 85 86 87 88 89
>> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; live gen 0 [x0] 32 [v0] 78 79 80 81 82 83 84 85 86 87 88 89
>> ;; live kill 30 [x30] 66 [cc]
>> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>>
>> reload:
>> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 66 [cc]
>>
>> With the set changes:
>>
>> cse1:
>> ;; regs ever live 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
>>
>> cprop1:
>> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
>> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
>> ;; lr def 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc] 79 80 81 82 84 85 86 87 88 89
>> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
>> ;; live gen 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 78 79 80 81 82 83 84 85 86 87 88 89
>> ;; live kill 30 [x30] 66 [cc]
>> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>>
>> reload:
>> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
>>
>>
>> Is there any extra dump data you think would be useful?
>>
>>
>> For completeness, I also ran through some more tests…..
>>
>>
>>
>> Second I tried putting the tls call inside a simple for loop (Test2 below)
>>
>> Both before and after the SET changes, gcc hoisted the tlsdesccall call out of the loop.
>> This has already happened before the expand pass.
>> (and as before, with the SET changes, gcc backs up all the neon registers before the tlsdesccall).
>>
>>
>> Third, I made the tls variable an array, and accessed elements from it in the loop. (Test3 below)
>>
>> Both before and after the SET changes, gcc fails to hoist the tlsdesccall call.
>> That’s a missed optimisation chance - I’m not sure how tricky or worthwhile that’d be to fix (at the
>> tree level the tls load looks just like a MEM base plus index access. I guess you’d need something
>> specific in an rtl pass to be the hoist).
>> Anyway, we now have a tlsdec inside a loop, which is what I wanted.
>> Without the SET changes, nothing is backed up whilst inside the loop.
>> With the SET changes, gcc tries to back up the live neon registers, and ICEs with
>> "error: unable to find a register to spill” in lra-assigns.c.
>>
>>
>> Forth, I took the original test and made two tls accesses in a row with an asm volatile in-between (Test4 below).
>>
>> With the SET changes, gcc ICEs in the same way.
>> Looking at the dump files,
>>
>> Without the SET changes, dump logs for the final test are similar to the very first test:
>>
>> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
>>
>> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; lr def 0 [x0] 30 [x30] 32 [v0] 66 [cc] 81 84 85 86 88 89 90 91 95 96 97 98 99 100
>> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; live gen 0 [x0] 32 [v0] 81 84 85 86 88 89 90 91 92 95 96 97 98 99 100
>> ;; live kill 30 [x30] 66 [cc]
>> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>>
>>
>> With the SET changes:
>>
>> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 66 [cc]
>>
>> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
>> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
>> ;; lr def 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc] 81 84 85 86 88 89 90 91 95 96 97 98 99 100
>> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
>> ;; live gen 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 81 84 85 86 88 89 90 91 92 95 96 97 98 99 100
>> ;; live kill 30 [x30] 66 [cc]
>> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>>
>>
>>
>>
>> Test1:
>>
>> __thread v4si tx;
>>
>> v4si foo (v4si a, v4si b, v4si c)
>> {
>> v4si y;
>>
>> y = a + tx + b + c;
>>
>> return y + 7;
>> }
>>
>>
>> Test2:
>>
>>
>> __thread v4si tx;
>>
>> v4si foo (v4si a, v4si *b, v4si *c)
>> {
>> v4si y;
>>
>> for(int i=0; i<MAX; i++)
>> y += a + tx + b[i] + c[i];
>>
>> return y + 7;
>> }
>>
>>
>> Test3:
>>
>> __thread v4si tx[MAX];
>>
>> v4si foo (v4si a, v4si *b, v4si *c)
>> {
>> v4si y;
>>
>> for(int i=0; i<MAX; i++)
>> y += a + tx[i] + b[i] + c[i];
>>
>> return y + 7;
>> }
>>
>>
>> Test4:
>>
>> __thread v4si tx;
>> __thread v4si ty;
>>
>> v4si foo (v4si a, v4si b, v4si c)
>> {
>> v4si y;
>>
>> y = a + tx + b + c;
>>
>> asm volatile ("" : : : "memory");
>>
>> y += a + ty + b + c;
>>
>> return y + 7;
>> }
>>
>>
>>>
>>> Now it could be the case that various local analysis could sub-optimally
>>> handle things. You mention LICM. I know our original LICM did have a
>>> problem in that if it saw a use of a hard reg in a loop without seeing a
>>> set of that hard reg it considered the register varying within the loop.
>>> I have no idea if we carried that forward when the loop code was
>>> rewritten (when I looked at this it was circa 1992).
>>>
>>>
>>>>
>>>> Or consider a stream of code containing two tls_desc calls (ok, the compiler might
>>>> optimise one of the tls calls away, but this approach should be reusable for other exprs).
>>>> Between the two set(x,x)’s x is considered live so the register allocator can’t use that
>>>> register.
>>>> Given that we are applying this to all the neon registers, the register allocator now throws
>>>> an ICE because it can’t find any free hard neon registers to use.
>>> Given your statements it sounds like the liveness infrastructure is
>>> making those neon regs globally live when it sees the low part subreg
>>> self-set. Let's confirm that one way or the other and see where it
>>> takes us.
>>>
>>> Jeff
>>
>
next prev parent reply other threads:[~2017-12-19 10:12 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-16 12:34 Alan Hayward
2017-11-16 18:15 ` Jeff Law
2017-11-16 18:39 ` Richard Biener
2017-11-16 18:57 ` Alan Hayward
2017-11-17 19:45 ` Jeff Law
2017-11-20 15:42 ` Alan Hayward
2017-11-21 6:19 ` Jeff Law
2017-11-22 11:41 ` Alan Hayward
2017-11-22 17:44 ` Jeff Law
2017-11-23 11:25 ` Alan Hayward
2017-11-27 17:47 ` Jeff Law
2017-11-28 11:58 ` Richard Biener
2017-11-28 15:17 ` Jeff Law
2017-11-30 11:16 ` Alan Hayward
2017-12-12 11:11 ` Alan Hayward
2017-12-19 10:12 ` Alan Hayward [this message]
2017-12-19 16:27 ` Jeff Law
2018-01-12 12:14 ` Alan Hayward
2018-01-24 12:20 ` Alan Hayward
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8C508103-8D47-4049-BF01-89C9A8F39F5E@arm.com \
--to=alan.hayward@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=law@redhat.com \
--cc=nd@arm.com \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).