Discussion about arm testcase failures seen with patch for PR111673

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Discussion about arm testcase failures seen with patch for PR111673
@ 2023-11-03 11:28 Surya Kumari Jangala
  2023-11-24  8:09 ` Surya Kumari Jangala
  0 siblings, 1 reply; 10+ messages in thread
From: Surya Kumari Jangala @ 2023-11-03 11:28 UTC (permalink / raw)
  To: Richard Sandiford, Peter Bergner; +Cc: GCC Development, vmakarov

Hi Richard,
I had submitted a patch for review (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
regarding scaling save/restore costs of callee save registers with block
frequency in the IRA pass (PR111673).

This patch has been approved by VMakarov 
(https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).

With this patch, we are seeing performance improvements with spec on x86 
(exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).

I received a mail from Linaro about some failures seen in the CI pipeline with
this patch. I have analyzed the failures and I wish to discuss the analysis with you.

One failure reported by the Linaro CI is:

FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, r[0-9]+, \\[r[0-9]+\\] 2

The diff in the assembly between trunk and patch is:

93c93
<       push    {r4, r5}
---
>       push    {fp}
95c95
<       ldrexd  r4, r5, [r0]
---
>       ldrexd  fp, ip, [r0]
99c99
<       pop     {r4, r5}
---
>       ldr     fp, [sp], #4

The test fails with patch because the ldrexd insn uses fp & ip registers instead
of r[0-9]+

But the code produced by patch is better because it is pushing and restoring only
one register (fp) instead of two registers (r4, r5). Hence, this test can be
modified to allow it to pass on arm. Please let me know what you think.

If you need more information, please let me know. I will be sending separate mails
for the other test failures.

Regards,
Surya

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Discussion about arm testcase failures seen with patch for PR111673
  2023-11-03 11:28 Discussion about arm testcase failures seen with patch for PR111673 Surya Kumari Jangala
@ 2023-11-24  8:09 ` Surya Kumari Jangala
  2023-11-24 10:48   ` Richard Earnshaw
  0 siblings, 1 reply; 10+ messages in thread
From: Surya Kumari Jangala @ 2023-11-24  8:09 UTC (permalink / raw)
  To: Richard Sandiford, Peter Bergner; +Cc: GCC Development, vmakarov

Hi Richard,
Ping. Please let me know if the test failure that I mentioned in the mail below can be handled by changing the expected generated code. I am not conversant with arm, and hence would appreciate your help.

Regards,
Surya

On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
> Hi Richard,
> I had submitted a patch for review (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
> regarding scaling save/restore costs of callee save registers with block
> frequency in the IRA pass (PR111673).
> 
> This patch has been approved by VMakarov 
> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
> 
> With this patch, we are seeing performance improvements with spec on x86 
> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
> 
> I received a mail from Linaro about some failures seen in the CI pipeline with
> this patch. I have analyzed the failures and I wish to discuss the analysis with you.
> 
> One failure reported by the Linaro CI is:
> 
> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, r[0-9]+, \\[r[0-9]+\\] 2
> 
> The diff in the assembly between trunk and patch is:
> 
> 93c93
> <       push    {r4, r5}
> ---
>>       push    {fp}
> 95c95
> <       ldrexd  r4, r5, [r0]
> ---
>>       ldrexd  fp, ip, [r0]
> 99c99
> <       pop     {r4, r5}
> ---
>>       ldr     fp, [sp], #4
> 
> 
> The test fails with patch because the ldrexd insn uses fp & ip registers instead
> of r[0-9]+
> 
> But the code produced by patch is better because it is pushing and restoring only
> one register (fp) instead of two registers (r4, r5). Hence, this test can be
> modified to allow it to pass on arm. Please let me know what you think.
> 
> If you need more information, please let me know. I will be sending separate mails
> for the other test failures.
> 
> Regards,
> Surya
> 
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Discussion about arm testcase failures seen with patch for PR111673
  2023-11-24  8:09 ` Surya Kumari Jangala
@ 2023-11-24 10:48   ` Richard Earnshaw
  2023-11-28 12:52     ` Discussion about arm/aarch64 " Surya Kumari Jangala
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Earnshaw @ 2023-11-24 10:48 UTC (permalink / raw)
  To: Surya Kumari Jangala, Richard Sandiford, Peter Bergner
  Cc: GCC Development, vmakarov



On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
> Hi Richard,
> Ping. Please let me know if the test failure that I mentioned in the mail below can be handled by changing the expected generated code. I am not conversant with arm, and hence would appreciate your help.
> 
> Regards,
> Surya
> 
> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>> Hi Richard,
>> I had submitted a patch for review (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>> regarding scaling save/restore costs of callee save registers with block
>> frequency in the IRA pass (PR111673).
>>
>> This patch has been approved by VMakarov
>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>
>> With this patch, we are seeing performance improvements with spec on x86
>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>
>> I received a mail from Linaro about some failures seen in the CI pipeline with
>> this patch. I have analyzed the failures and I wish to discuss the analysis with you.
>>
>> One failure reported by the Linaro CI is:
>>
>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, r[0-9]+, \\[r[0-9]+\\] 2
>>
>> The diff in the assembly between trunk and patch is:
>>
>> 93c93
>> <       push    {r4, r5}
>> ---
>>>        push    {fp}
>> 95c95
>> <       ldrexd  r4, r5, [r0]
>> ---
>>>        ldrexd  fp, ip, [r0]
>> 99c99
>> <       pop     {r4, r5}
>> ---
>>>        ldr     fp, [sp], #4
>>
>>
>> The test fails with patch because the ldrexd insn uses fp & ip registers instead
>> of r[0-9]+
>>
>> But the code produced by patch is better because it is pushing and restoring only
>> one register (fp) instead of two registers (r4, r5). Hence, this test can be
>> modified to allow it to pass on arm. Please let me know what you think.
>>
>> If you need more information, please let me know. I will be sending separate mails
>> for the other test failures.
>>

Thanks for looking at this.


The key part of this test is that the compiler generates LDREXD.  The 
registers used for that are pretty much irrelevant as we don't match 
them to any other operations within the test.  So I'd recommend just 
testing for the mnemonic and not for any of the operands (ie just match 
"ldrexd\t").

R.

>> Regards,
>> Surya
>>
>>
>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673
  2023-11-24 10:48   ` Richard Earnshaw
@ 2023-11-28 12:52     ` Surya Kumari Jangala
  2023-11-28 13:48       ` Richard Earnshaw
  0 siblings, 1 reply; 10+ messages in thread
From: Surya Kumari Jangala @ 2023-11-28 12:52 UTC (permalink / raw)
  To: Richard Earnshaw, Richard Sandiford, Peter Bergner
  Cc: GCC Development, vmakarov

Hi Richard,
Thanks a lot for your response!

Another failure reported by the Linaro CI is as follows :
(Note: I am planning to send a separate mail for each failure, as this will make
the discussion easy to track)

FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve -moverride=tune=none  check-function-bodies dup_x0_m 

Expected code:

      ...
      add     (x[0-9]+), x0, #?1
      mov     (p[0-7])\.b, p15\.b
      mov     z0\.d, \2/m, \1
      ...
      ret


Code obtained w/o patch:
        addvl   sp, sp, #-1
        str     p15, [sp]
        add     x0, x0, 1
        mov     p3.b, p15.b
        mov     z0.d, p3/m, x0
        ldr     p15, [sp]
        addvl   sp, sp, #1
        ret

Code obtained w/ patch:
	addvl   sp, sp, #-1
        str     p15, [sp]
        mov     p3.b, p15.b
        add     x0, x0, 1
        mov     z0.d, p3/m, x0
        ldr     p15, [sp]
        addvl   sp, sp, #1
        ret

As we can see, with the patch, the following two instructions are interchanged:
        add     x0, x0, 1
        mov     p3.b, p15.b

I believe that this is fine and the test can be modified to allow it to pass on
aarch64. Please let me know what you think.

Regards,
Surya


On 24/11/23 4:18 pm, Richard Earnshaw wrote:
> 
> 
> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
>> Hi Richard,
>> Ping. Please let me know if the test failure that I mentioned in the mail below can be handled by changing the expected generated code. I am not conversant with arm, and hence would appreciate your help.
>>
>> Regards,
>> Surya
>>
>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>>> Hi Richard,
>>> I had submitted a patch for review (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>>> regarding scaling save/restore costs of callee save registers with block
>>> frequency in the IRA pass (PR111673).
>>>
>>> This patch has been approved by VMakarov
>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>>
>>> With this patch, we are seeing performance improvements with spec on x86
>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>>
>>> I received a mail from Linaro about some failures seen in the CI pipeline with
>>> this patch. I have analyzed the failures and I wish to discuss the analysis with you.
>>>
>>> One failure reported by the Linaro CI is:
>>>
>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, r[0-9]+, \\[r[0-9]+\\] 2
>>>
>>> The diff in the assembly between trunk and patch is:
>>>
>>> 93c93
>>> <       push    {r4, r5}
>>> ---
>>>>        push    {fp}
>>> 95c95
>>> <       ldrexd  r4, r5, [r0]
>>> ---
>>>>        ldrexd  fp, ip, [r0]
>>> 99c99
>>> <       pop     {r4, r5}
>>> ---
>>>>        ldr     fp, [sp], #4
>>>
>>>
>>> The test fails with patch because the ldrexd insn uses fp & ip registers instead
>>> of r[0-9]+
>>>
>>> But the code produced by patch is better because it is pushing and restoring only
>>> one register (fp) instead of two registers (r4, r5). Hence, this test can be
>>> modified to allow it to pass on arm. Please let me know what you think.
>>>
>>> If you need more information, please let me know. I will be sending separate mails
>>> for the other test failures.
>>>
> 
> Thanks for looking at this.
> 
> 
> The key part of this test is that the compiler generates LDREXD.  The registers used for that are pretty much irrelevant as we don't match them to any other operations within the test.  So I'd recommend just testing for the mnemonic and not for any of the operands (ie just match "ldrexd\t").
> 
> R.
> 
>>> Regards,
>>> Surya
>>>
>>>
>>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673
  2023-11-28 12:52     ` Discussion about arm/aarch64 " Surya Kumari Jangala
@ 2023-11-28 13:48       ` Richard Earnshaw
  2023-11-28 15:41         ` Richard Sandiford
  2023-12-14  7:17         ` Surya Kumari Jangala
  0 siblings, 2 replies; 10+ messages in thread
From: Richard Earnshaw @ 2023-11-28 13:48 UTC (permalink / raw)
  To: Surya Kumari Jangala, Richard Sandiford, Peter Bergner
  Cc: GCC Development, vmakarov



On 28/11/2023 12:52, Surya Kumari Jangala wrote:
> Hi Richard,
> Thanks a lot for your response!
> 
> Another failure reported by the Linaro CI is as follows :
> (Note: I am planning to send a separate mail for each failure, as this will make
> the discussion easy to track)
> 
> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve -moverride=tune=none  check-function-bodies dup_x0_m
> 
> Expected code:
> 
>        ...
>        add     (x[0-9]+), x0, #?1
>        mov     (p[0-7])\.b, p15\.b
>        mov     z0\.d, \2/m, \1
>        ...
>        ret
> 
> 
> Code obtained w/o patch:
>          addvl   sp, sp, #-1
>          str     p15, [sp]
>          add     x0, x0, 1
>          mov     p3.b, p15.b
>          mov     z0.d, p3/m, x0
>          ldr     p15, [sp]
>          addvl   sp, sp, #1
>          ret
> 
> Code obtained w/ patch:
> 	addvl   sp, sp, #-1
>          str     p15, [sp]
>          mov     p3.b, p15.b
>          add     x0, x0, 1
>          mov     z0.d, p3/m, x0
>          ldr     p15, [sp]
>          addvl   sp, sp, #1
>          ret
> 
> As we can see, with the patch, the following two instructions are interchanged:
>          add     x0, x0, 1
>          mov     p3.b, p15.b

Indeed, both look acceptable results to me, especially given that we 
don't schedule results at -O1.

There's two ways of fixing this:
1) Simply swap the order to what the compiler currently generates (which 
is a little fragile, since it might flip back someday).
2) Write the test as


** (
**       add     (x[0-9]+), x0, #?1
**       mov     (p[0-7])\.b, p15\.b
**       mov     z0\.d, \2/m, \1
** |
**       mov     (p[0-7])\.b, p15\.b
**       add     (x[0-9]+), x0, #?1
**       mov     z0\.d, \1/m, \2
** )

Note, we need to swap the match names in the third insn to account for 
the different order of the earlier instructions.

Neither is ideal, but the second is perhaps a little more bomb proof.

I don't really have a strong feeling either way, but perhaps the second 
is slightly preferable.

Richard S: thoughts?

R.

> I believe that this is fine and the test can be modified to allow it to pass on
> aarch64. Please let me know what you think.
> 
> Regards,
> Surya
> 
> 
> On 24/11/23 4:18 pm, Richard Earnshaw wrote:
>>
>>
>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
>>> Hi Richard,
>>> Ping. Please let me know if the test failure that I mentioned in the mail below can be handled by changing the expected generated code. I am not conversant with arm, and hence would appreciate your help.
>>>
>>> Regards,
>>> Surya
>>>
>>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>>>> Hi Richard,
>>>> I had submitted a patch for review (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>>>> regarding scaling save/restore costs of callee save registers with block
>>>> frequency in the IRA pass (PR111673).
>>>>
>>>> This patch has been approved by VMakarov
>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>>>
>>>> With this patch, we are seeing performance improvements with spec on x86
>>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>>>
>>>> I received a mail from Linaro about some failures seen in the CI pipeline with
>>>> this patch. I have analyzed the failures and I wish to discuss the analysis with you.
>>>>
>>>> One failure reported by the Linaro CI is:
>>>>
>>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, r[0-9]+, \\[r[0-9]+\\] 2
>>>>
>>>> The diff in the assembly between trunk and patch is:
>>>>
>>>> 93c93
>>>> <       push    {r4, r5}
>>>> ---
>>>>>         push    {fp}
>>>> 95c95
>>>> <       ldrexd  r4, r5, [r0]
>>>> ---
>>>>>         ldrexd  fp, ip, [r0]
>>>> 99c99
>>>> <       pop     {r4, r5}
>>>> ---
>>>>>         ldr     fp, [sp], #4
>>>>
>>>>
>>>> The test fails with patch because the ldrexd insn uses fp & ip registers instead
>>>> of r[0-9]+
>>>>
>>>> But the code produced by patch is better because it is pushing and restoring only
>>>> one register (fp) instead of two registers (r4, r5). Hence, this test can be
>>>> modified to allow it to pass on arm. Please let me know what you think.
>>>>
>>>> If you need more information, please let me know. I will be sending separate mails
>>>> for the other test failures.
>>>>
>>
>> Thanks for looking at this.
>>
>>
>> The key part of this test is that the compiler generates LDREXD.  The registers used for that are pretty much irrelevant as we don't match them to any other operations within the test.  So I'd recommend just testing for the mnemonic and not for any of the operands (ie just match "ldrexd\t").
>>
>> R.
>>
>>>> Regards,
>>>> Surya
>>>>
>>>>
>>>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673
  2023-11-28 13:48       ` Richard Earnshaw
@ 2023-11-28 15:41         ` Richard Sandiford
  2023-12-14  7:17         ` Surya Kumari Jangala
  1 sibling, 0 replies; 10+ messages in thread
From: Richard Sandiford @ 2023-11-28 15:41 UTC (permalink / raw)
  To: Richard Earnshaw
  Cc: Surya Kumari Jangala, Peter Bergner, GCC Development, vmakarov

Richard Earnshaw <Richard.Earnshaw@foss.arm.com> writes:
> On 28/11/2023 12:52, Surya Kumari Jangala wrote:
>> Hi Richard,
>> Thanks a lot for your response!
>> 
>> Another failure reported by the Linaro CI is as follows :
>> (Note: I am planning to send a separate mail for each failure, as this will make
>> the discussion easy to track)
>> 
>> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve -moverride=tune=none  check-function-bodies dup_x0_m
>> 
>> Expected code:
>> 
>>        ...
>>        add     (x[0-9]+), x0, #?1
>>        mov     (p[0-7])\.b, p15\.b
>>        mov     z0\.d, \2/m, \1
>>        ...
>>        ret
>> 
>> 
>> Code obtained w/o patch:
>>          addvl   sp, sp, #-1
>>          str     p15, [sp]
>>          add     x0, x0, 1
>>          mov     p3.b, p15.b
>>          mov     z0.d, p3/m, x0
>>          ldr     p15, [sp]
>>          addvl   sp, sp, #1
>>          ret
>> 
>> Code obtained w/ patch:
>> 	addvl   sp, sp, #-1
>>          str     p15, [sp]
>>          mov     p3.b, p15.b
>>          add     x0, x0, 1
>>          mov     z0.d, p3/m, x0
>>          ldr     p15, [sp]
>>          addvl   sp, sp, #1
>>          ret
>> 
>> As we can see, with the patch, the following two instructions are interchanged:
>>          add     x0, x0, 1
>>          mov     p3.b, p15.b
>
> Indeed, both look acceptable results to me, especially given that we 
> don't schedule results at -O1.
>
> There's two ways of fixing this:
> 1) Simply swap the order to what the compiler currently generates (which 
> is a little fragile, since it might flip back someday).
> 2) Write the test as
>
>
> ** (
> **       add     (x[0-9]+), x0, #?1
> **       mov     (p[0-7])\.b, p15\.b
> **       mov     z0\.d, \2/m, \1
> ** |
> **       mov     (p[0-7])\.b, p15\.b
> **       add     (x[0-9]+), x0, #?1
> **       mov     z0\.d, \1/m, \2
> ** )
>
> Note, we need to swap the match names in the third insn to account for 
> the different order of the earlier instructions.
>
> Neither is ideal, but the second is perhaps a little more bomb proof.
>
> I don't really have a strong feeling either way, but perhaps the second 
> is slightly preferable.
>
> Richard S: thoughts?

Yeah, I agree the second is probably better.  The | doesn't reset the
capture numbers, so I think the final instruction needs to be:

**       mov     z0\.d, \3/m, \4

Thanks,
Richard

>
> R.
>
>> I believe that this is fine and the test can be modified to allow it to pass on
>> aarch64. Please let me know what you think.
>> 
>> Regards,
>> Surya
>> 
>> 
>> On 24/11/23 4:18 pm, Richard Earnshaw wrote:
>>>
>>>
>>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
>>>> Hi Richard,
>>>> Ping. Please let me know if the test failure that I mentioned in the mail below can be handled by changing the expected generated code. I am not conversant with arm, and hence would appreciate your help.
>>>>
>>>> Regards,
>>>> Surya
>>>>
>>>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>>>>> Hi Richard,
>>>>> I had submitted a patch for review (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>>>>> regarding scaling save/restore costs of callee save registers with block
>>>>> frequency in the IRA pass (PR111673).
>>>>>
>>>>> This patch has been approved by VMakarov
>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>>>>
>>>>> With this patch, we are seeing performance improvements with spec on x86
>>>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>>>>
>>>>> I received a mail from Linaro about some failures seen in the CI pipeline with
>>>>> this patch. I have analyzed the failures and I wish to discuss the analysis with you.
>>>>>
>>>>> One failure reported by the Linaro CI is:
>>>>>
>>>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, r[0-9]+, \\[r[0-9]+\\] 2
>>>>>
>>>>> The diff in the assembly between trunk and patch is:
>>>>>
>>>>> 93c93
>>>>> <       push    {r4, r5}
>>>>> ---
>>>>>>         push    {fp}
>>>>> 95c95
>>>>> <       ldrexd  r4, r5, [r0]
>>>>> ---
>>>>>>         ldrexd  fp, ip, [r0]
>>>>> 99c99
>>>>> <       pop     {r4, r5}
>>>>> ---
>>>>>>         ldr     fp, [sp], #4
>>>>>
>>>>>
>>>>> The test fails with patch because the ldrexd insn uses fp & ip registers instead
>>>>> of r[0-9]+
>>>>>
>>>>> But the code produced by patch is better because it is pushing and restoring only
>>>>> one register (fp) instead of two registers (r4, r5). Hence, this test can be
>>>>> modified to allow it to pass on arm. Please let me know what you think.
>>>>>
>>>>> If you need more information, please let me know. I will be sending separate mails
>>>>> for the other test failures.
>>>>>
>>>
>>> Thanks for looking at this.
>>>
>>>
>>> The key part of this test is that the compiler generates LDREXD.  The registers used for that are pretty much irrelevant as we don't match them to any other operations within the test.  So I'd recommend just testing for the mnemonic and not for any of the operands (ie just match "ldrexd\t").
>>>
>>> R.
>>>
>>>>> Regards,
>>>>> Surya
>>>>>
>>>>>
>>>>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673
  2023-11-28 13:48       ` Richard Earnshaw
  2023-11-28 15:41         ` Richard Sandiford
@ 2023-12-14  7:17         ` Surya Kumari Jangala
  2023-12-14 16:11           ` Richard Earnshaw (lists)
  1 sibling, 1 reply; 10+ messages in thread
From: Surya Kumari Jangala @ 2023-12-14  7:17 UTC (permalink / raw)
  To: Richard Earnshaw, Richard Sandiford, Peter Bergner
  Cc: GCC Development, vmakarov

Hi Richard,
Thanks a lot for your response!

Another failure reported by the Linaro CI is as follows:

Running gcc:gcc.dg/dg.exp ...
FAIL: gcc.dg/ira-shrinkwrap-prep-1.c scan-rtl-dump pro_and_epilogue "Performing shrink-wrapping"
FAIL: gcc.dg/pr10474.c scan-rtl-dump pro_and_epilogue "Performing shrink-wrapping"

I analyzed the failures and the root cause is the same for both the failures.

The test pr10474.c is as follows:

void f(int *i)
{
        if (!i)
                return;
        else
        {
                __builtin_printf("Hi");
                *i=0;
        }
}

With the patch (for PR111673), x1 (volatile) is being assigned to hold value of
x0 (first parameter). Since it is a volatile, x1 is saved to the stack as there
is a call later on. The save to the stack is generated in the LRA pass. The save
is generated in the entry basic block. Due to the usage of the stack pointer in
the entry bb, the testcase fails to be shrink wrapped.

The reason why LRA generates the store insn in the entry bb is as follows:
LRA emits insns to save volatile registers in the inheritance/splitting pass.
In this pass, LRA builds EBBs (Extended Basic Block) and traverses the insns in
the EBBs in reverse order from the last insn to the first insn. When LRA sees a
write to a pseudo (that has been assigned a volatile register), and there is a
read following the write, with an intervening call insn between the write and read,
then LRA generates a spill immediately after the write and a restore immediately
before the read. In the above test, there is an EBB containing the entry bb and
the bb with the printf call. In the entry bb, there is a write to x1 (basically
a copy from x0 to x1) and in the printf bb, there is a read of x1 after the call
insn. So LRA generates a spill in the entry bb.

Without patch, x19 is chosen to hold the value of x0. Since x19 is a non-volatile,
the input RTL to the shrink wrap pass does not have any code to save x19 to the
stack. Only the insn that copies x0 to x19 is present in the entry bb. In the
shrink wrap pass, this insn is moved down the cfg to the bb containing the call
to printf, thereby allowing prolog to be allocated only where needed. Thus shrink
wrap succeeds.

Shrink wrap can be made to succeed if the save of x1 occurs just before the call
insn, instead of generating it after the write to x1. This will ensure that the
spill does not occur in the entry bb. In fact, it is more efficient if the save
occurs only in the path containing the printf call instead of occurring in the
entry bb.

I have a patch (bootstrapped and regtested on powerpc) that makes changes in
LRA to save volatile registers before a call instead of after the write to the
volatile. With this patch, both the above tests pass.

Since the patch for PR111673 has been approved by Vladimir, I plan to
commit the patch to trunk. And I will fix the test failures after doing the
commit.

Regards,
Surya

On 28/11/23 7:18 pm, Richard Earnshaw wrote:
> 
> 
> On 28/11/2023 12:52, Surya Kumari Jangala wrote:
>> Hi Richard,
>> Thanks a lot for your response!
>>
>> Another failure reported by the Linaro CI is as follows :
>> (Note: I am planning to send a separate mail for each failure, as this will make
>> the discussion easy to track)
>>
>> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve -moverride=tune=none  check-function-bodies dup_x0_m
>>
>> Expected code:
>>
>>        ...
>>        add     (x[0-9]+), x0, #?1
>>        mov     (p[0-7])\.b, p15\.b
>>        mov     z0\.d, \2/m, \1
>>        ...
>>        ret
>>
>>
>> Code obtained w/o patch:
>>          addvl   sp, sp, #-1
>>          str     p15, [sp]
>>          add     x0, x0, 1
>>          mov     p3.b, p15.b
>>          mov     z0.d, p3/m, x0
>>          ldr     p15, [sp]
>>          addvl   sp, sp, #1
>>          ret
>>
>> Code obtained w/ patch:
>>     addvl   sp, sp, #-1
>>          str     p15, [sp]
>>          mov     p3.b, p15.b
>>          add     x0, x0, 1
>>          mov     z0.d, p3/m, x0
>>          ldr     p15, [sp]
>>          addvl   sp, sp, #1
>>          ret
>>
>> As we can see, with the patch, the following two instructions are interchanged:
>>          add     x0, x0, 1
>>          mov     p3.b, p15.b
> 
> Indeed, both look acceptable results to me, especially given that we don't schedule results at -O1.
> 
> There's two ways of fixing this:
> 1) Simply swap the order to what the compiler currently generates (which is a little fragile, since it might flip back someday).
> 2) Write the test as
> 
> 
> ** (
> **       add     (x[0-9]+), x0, #?1
> **       mov     (p[0-7])\.b, p15\.b
> **       mov     z0\.d, \2/m, \1
> ** |
> **       mov     (p[0-7])\.b, p15\.b
> **       add     (x[0-9]+), x0, #?1
> **       mov     z0\.d, \1/m, \2
> ** )
> 
> Note, we need to swap the match names in the third insn to account for the different order of the earlier instructions.
> 
> Neither is ideal, but the second is perhaps a little more bomb proof.
> 
> I don't really have a strong feeling either way, but perhaps the second is slightly preferable.
> 
> Richard S: thoughts?
> 
> R.
> 
>> I believe that this is fine and the test can be modified to allow it to pass on
>> aarch64. Please let me know what you think.
>>
>> Regards,
>> Surya
>>
>>
>> On 24/11/23 4:18 pm, Richard Earnshaw wrote:
>>>
>>>
>>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
>>>> Hi Richard,
>>>> Ping. Please let me know if the test failure that I mentioned in the mail below can be handled by changing the expected generated code. I am not conversant with arm, and hence would appreciate your help.
>>>>
>>>> Regards,
>>>> Surya
>>>>
>>>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>>>>> Hi Richard,
>>>>> I had submitted a patch for review (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>>>>> regarding scaling save/restore costs of callee save registers with block
>>>>> frequency in the IRA pass (PR111673).
>>>>>
>>>>> This patch has been approved by VMakarov
>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>>>>
>>>>> With this patch, we are seeing performance improvements with spec on x86
>>>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>>>>
>>>>> I received a mail from Linaro about some failures seen in the CI pipeline with
>>>>> this patch. I have analyzed the failures and I wish to discuss the analysis with you.
>>>>>
>>>>> One failure reported by the Linaro CI is:
>>>>>
>>>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, r[0-9]+, \\[r[0-9]+\\] 2
>>>>>
>>>>> The diff in the assembly between trunk and patch is:
>>>>>
>>>>> 93c93
>>>>> <       push    {r4, r5}
>>>>> ---
>>>>>>         push    {fp}
>>>>> 95c95
>>>>> <       ldrexd  r4, r5, [r0]
>>>>> ---
>>>>>>         ldrexd  fp, ip, [r0]
>>>>> 99c99
>>>>> <       pop     {r4, r5}
>>>>> ---
>>>>>>         ldr     fp, [sp], #4
>>>>>
>>>>>
>>>>> The test fails with patch because the ldrexd insn uses fp & ip registers instead
>>>>> of r[0-9]+
>>>>>
>>>>> But the code produced by patch is better because it is pushing and restoring only
>>>>> one register (fp) instead of two registers (r4, r5). Hence, this test can be
>>>>> modified to allow it to pass on arm. Please let me know what you think.
>>>>>
>>>>> If you need more information, please let me know. I will be sending separate mails
>>>>> for the other test failures.
>>>>>
>>>
>>> Thanks for looking at this.
>>>
>>>
>>> The key part of this test is that the compiler generates LDREXD.  The registers used for that are pretty much irrelevant as we don't match them to any other operations within the test.  So I'd recommend just testing for the mnemonic and not for any of the operands (ie just match "ldrexd\t").
>>>
>>> R.
>>>
>>>>> Regards,
>>>>> Surya
>>>>>
>>>>>
>>>>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673
  2023-12-14  7:17         ` Surya Kumari Jangala
@ 2023-12-14 16:11           ` Richard Earnshaw (lists)
  2023-12-15 17:04             ` Surya Kumari Jangala
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Earnshaw (lists) @ 2023-12-14 16:11 UTC (permalink / raw)
  To: Surya Kumari Jangala, Richard Earnshaw, Richard Sandiford, Peter Bergner
  Cc: GCC Development, vmakarov

On 14/12/2023 07:17, Surya Kumari Jangala via Gcc wrote:
> Hi Richard,
> Thanks a lot for your response!
> 
> Another failure reported by the Linaro CI is as follows:
> 
> Running gcc:gcc.dg/dg.exp ...
> FAIL: gcc.dg/ira-shrinkwrap-prep-1.c scan-rtl-dump pro_and_epilogue "Performing shrink-wrapping"
> FAIL: gcc.dg/pr10474.c scan-rtl-dump pro_and_epilogue "Performing shrink-wrapping"
> 
> I analyzed the failures and the root cause is the same for both the failures.
> 
> The test pr10474.c is as follows:
> 
> void f(int *i)
> {
>         if (!i)
>                 return;
>         else
>         {
>                 __builtin_printf("Hi");
>                 *i=0;
>         }
> }
> 
> 
> With the patch (for PR111673), x1 (volatile) is being assigned to hold value of
> x0 (first parameter). Since it is a volatile, x1 is saved to the stack as there
> is a call later on. The save to the stack is generated in the LRA pass. The save
> is generated in the entry basic block. Due to the usage of the stack pointer in
> the entry bb, the testcase fails to be shrink wrapped.

I'm not entirely sure I understand what you mean from a quick glance.  Do you mean that X1 has the /v flag marked on it (ie it's printed in dumps as "reg/v")?  If so, that's not volatile, it just means that the register is associated with a user variable (as opposed to a compiler-generated temporary variable):

From the manual:

@item REG_USERVAR_P (@var{x})
In a @code{reg}, nonzero if it corresponds to a variable present in
the user's source code.  Zero for temporaries generated internally by
the compiler.  Stored in the @code{volatil} field and printed as
@samp{/v}.

There are several other cases where we re-use this bit on different RTL constructs to mean things other than 'volatile': it pretty much only has the conventional meaning on MEM objects.

> 
> The reason why LRA generates the store insn in the entry bb is as follows:
> LRA emits insns to save volatile registers in the inheritance/splitting pass.
> In this pass, LRA builds EBBs (Extended Basic Block) and traverses the insns in
> the EBBs in reverse order from the last insn to the first insn. When LRA sees a
> write to a pseudo (that has been assigned a volatile register), and there is a
> read following the write, with an intervening call insn between the write and read,
> then LRA generates a spill immediately after the write and a restore immediately
> before the read. In the above test, there is an EBB containing the entry bb and
> the bb with the printf call. In the entry bb, there is a write to x1 (basically
> a copy from x0 to x1) and in the printf bb, there is a read of x1 after the call
> insn. So LRA generates a spill in the entry bb.
> 
> Without patch, x19 is chosen to hold the value of x0. Since x19 is a non-volatile,
> the input RTL to the shrink wrap pass does not have any code to save x19 to the
> stack. Only the insn that copies x0 to x19 is present in the entry bb. In the
> shrink wrap pass, this insn is moved down the cfg to the bb containing the call
> to printf, thereby allowing prolog to be allocated only where needed. Thus shrink
> wrap succeeds.
> 
> 
> Shrink wrap can be made to succeed if the save of x1 occurs just before the call
> insn, instead of generating it after the write to x1. This will ensure that the
> spill does not occur in the entry bb. In fact, it is more efficient if the save
> occurs only in the path containing the printf call instead of occurring in the
> entry bb.
> 
> I have a patch (bootstrapped and regtested on powerpc) that makes changes in
> LRA to save volatile registers before a call instead of after the write to the
> volatile. With this patch, both the above tests pass.
> 
> Since the patch for PR111673 has been approved by Vladimir, I plan to
> commit the patch to trunk. And I will fix the test failures after doing the
> commit.
> 

I think I'd probably understand this better if you could give some example RTL (before and after).  Do you have that?

R.

> Regards,
> Surya
> 
> 
> 
> On 28/11/23 7:18 pm, Richard Earnshaw wrote:
>>
>>
>> On 28/11/2023 12:52, Surya Kumari Jangala wrote:
>>> Hi Richard,
>>> Thanks a lot for your response!
>>>
>>> Another failure reported by the Linaro CI is as follows :
>>> (Note: I am planning to send a separate mail for each failure, as this will make
>>> the discussion easy to track)
>>>
>>> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve -moverride=tune=none  check-function-bodies dup_x0_m
>>>
>>> Expected code:
>>>
>>>        ...
>>>        add     (x[0-9]+), x0, #?1
>>>        mov     (p[0-7])\.b, p15\.b
>>>        mov     z0\.d, \2/m, \1
>>>        ...
>>>        ret
>>>
>>>
>>> Code obtained w/o patch:
>>>          addvl   sp, sp, #-1
>>>          str     p15, [sp]
>>>          add     x0, x0, 1
>>>          mov     p3.b, p15.b
>>>          mov     z0.d, p3/m, x0
>>>          ldr     p15, [sp]
>>>          addvl   sp, sp, #1
>>>          ret
>>>
>>> Code obtained w/ patch:
>>>     addvl   sp, sp, #-1
>>>          str     p15, [sp]
>>>          mov     p3.b, p15.b
>>>          add     x0, x0, 1
>>>          mov     z0.d, p3/m, x0
>>>          ldr     p15, [sp]
>>>          addvl   sp, sp, #1
>>>          ret
>>>
>>> As we can see, with the patch, the following two instructions are interchanged:
>>>          add     x0, x0, 1
>>>          mov     p3.b, p15.b
>>
>> Indeed, both look acceptable results to me, especially given that we don't schedule results at -O1.
>>
>> There's two ways of fixing this:
>> 1) Simply swap the order to what the compiler currently generates (which is a little fragile, since it might flip back someday).
>> 2) Write the test as
>>
>>
>> ** (
>> **       add     (x[0-9]+), x0, #?1
>> **       mov     (p[0-7])\.b, p15\.b
>> **       mov     z0\.d, \2/m, \1
>> ** |
>> **       mov     (p[0-7])\.b, p15\.b
>> **       add     (x[0-9]+), x0, #?1
>> **       mov     z0\.d, \1/m, \2
>> ** )
>>
>> Note, we need to swap the match names in the third insn to account for the different order of the earlier instructions.
>>
>> Neither is ideal, but the second is perhaps a little more bomb proof.
>>
>> I don't really have a strong feeling either way, but perhaps the second is slightly preferable.
>>
>> Richard S: thoughts?
>>
>> R.
>>
>>> I believe that this is fine and the test can be modified to allow it to pass on
>>> aarch64. Please let me know what you think.
>>>
>>> Regards,
>>> Surya
>>>
>>>
>>> On 24/11/23 4:18 pm, Richard Earnshaw wrote:
>>>>
>>>>
>>>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
>>>>> Hi Richard,
>>>>> Ping. Please let me know if the test failure that I mentioned in the mail below can be handled by changing the expected generated code. I am not conversant with arm, and hence would appreciate your help.
>>>>>
>>>>> Regards,
>>>>> Surya
>>>>>
>>>>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>>>>>> Hi Richard,
>>>>>> I had submitted a patch for review (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>>>>>> regarding scaling save/restore costs of callee save registers with block
>>>>>> frequency in the IRA pass (PR111673).
>>>>>>
>>>>>> This patch has been approved by VMakarov
>>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>>>>>
>>>>>> With this patch, we are seeing performance improvements with spec on x86
>>>>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>>>>>
>>>>>> I received a mail from Linaro about some failures seen in the CI pipeline with
>>>>>> this patch. I have analyzed the failures and I wish to discuss the analysis with you.
>>>>>>
>>>>>> One failure reported by the Linaro CI is:
>>>>>>
>>>>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, r[0-9]+, \\[r[0-9]+\\] 2
>>>>>>
>>>>>> The diff in the assembly between trunk and patch is:
>>>>>>
>>>>>> 93c93
>>>>>> <       push    {r4, r5}
>>>>>> ---
>>>>>>>         push    {fp}
>>>>>> 95c95
>>>>>> <       ldrexd  r4, r5, [r0]
>>>>>> ---
>>>>>>>         ldrexd  fp, ip, [r0]
>>>>>> 99c99
>>>>>> <       pop     {r4, r5}
>>>>>> ---
>>>>>>>         ldr     fp, [sp], #4
>>>>>>
>>>>>>
>>>>>> The test fails with patch because the ldrexd insn uses fp & ip registers instead
>>>>>> of r[0-9]+
>>>>>>
>>>>>> But the code produced by patch is better because it is pushing and restoring only
>>>>>> one register (fp) instead of two registers (r4, r5). Hence, this test can be
>>>>>> modified to allow it to pass on arm. Please let me know what you think.
>>>>>>
>>>>>> If you need more information, please let me know. I will be sending separate mails
>>>>>> for the other test failures.
>>>>>>
>>>>
>>>> Thanks for looking at this.
>>>>
>>>>
>>>> The key part of this test is that the compiler generates LDREXD.  The registers used for that are pretty much irrelevant as we don't match them to any other operations within the test.  So I'd recommend just testing for the mnemonic and not for any of the operands (ie just match "ldrexd\t").
>>>>
>>>> R.
>>>>
>>>>>> Regards,
>>>>>> Surya
>>>>>>
>>>>>>
>>>>>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673
  2023-12-14 16:11           ` Richard Earnshaw (lists)
@ 2023-12-15 17:04             ` Surya Kumari Jangala
  2024-01-29  6:14               ` Surya Kumari Jangala
  0 siblings, 1 reply; 10+ messages in thread
From: Surya Kumari Jangala @ 2023-12-15 17:04 UTC (permalink / raw)
  To: Richard Earnshaw (lists),
	Richard Earnshaw, Richard Sandiford, Peter Bergner
  Cc: GCC Development, vmakarov

Hi Richard,
Here are more details about the testcase failure and my analysis/fix:

Testcase:

void f(int *i)
{
        if (!i)
                return;
        else
        {
                __builtin_printf("Hi");
                *i=0;
        }
}

----------

Assembly w/o patch:
        cbz     x0, .L7
        stp     x29, x30, [sp, -32]!
        mov     x29, sp
        str     x19, [sp, 16]
        mov     x19, x0
        adrp    x0, .LC0
        add     x0, x0, :lo12:.LC0
        bl      printf
        str     wzr, [x19]
        ldr     x19, [sp, 16]
        ldp     x29, x30, [sp], 32
        ret
        .p2align 2,,3
.L7:
        ret

-----------

Assembly w/ patch:
        stp     x29, x30, [sp, -32]!
        mov     x29, sp
        str     x0, [sp, 24]
        cbz     x0, .L1
        adrp    x0, .LC0
        add     x0, x0, :lo12:.LC0
        bl      printf
        ldr     x1, [sp, 24]
        str     wzr, [x1]
.L1:
        ldp     x29, x30, [sp], 32
        ret


As we can see above, w/o patch the test case gets shrink wrapped.

Input RTL to the LRA pass (the RTL is same both w/ and w/o patch):

BB2:
  set r95, x0
  set r92, r95
  if (r92 eq 0) jump BB4
BB3:
  set x0, symbol-ref("Hi")
  x0 = call printf
  set mem(r92), 0
BB4:
  ret


Register assignment by IRA:
w/o patch:
  r92-->x19
  r95-->x0
  r94-->x0

w/ patch:
  r92-->x1
  r95-->x0
  r94-->x0


RTL after LRA:

w/o patch:
BB2:
  set x19, x0
  if (x19 eq 0) jump BB4
BB3:
  set x0, symbol-ref("Hi")
  x0 = call printf
  set mem(x19), 0
BB4:
  ret


w/ patch:
BB2:
  set x1, x0
  set mem(sp+24), x1
  if (x1 eq 0) jump BB4
BB3:
  set x0, symbol-ref("Hi")
  x0 = call printf
  set x1, mem(sp+24)
  set mem(x1), 0
BB4:
  ret


The difference between w/o patch and w/ patch is that w/o patch, a callee-save
register (x19) is chosen to hold the value of x0 (input parameter register). While
w/ patch, a caller-save register (x1) is chosen.

W/o patch, during the shrink wrap pass, first copy propagation is done and
the 'if' insn in BB2 is changed as follows:
  set x19, x0
  if (x19 eq 0) jump BB4

changed to:
  set x19, x0
  if (x0 eq 0) jump BB4   

Next, the insn "set x19, x0" is moved down the cfg to BB3. Since x19 is a
callee-save register, prolog gets generated in BB3 thereby resulting in
successful shrink wrapping.

W/ patch, during the shrink wrap pass, copy propagation changes BB2 as follows:
  set x1, x0
  set mem(sp+24), x1
  if (x1 eq 0) jump BB4

changed to:
  set x1, x0
  set mem(sp+24), x0
  if (x0 eq 0) jump BB4

However the store insn (set mem[sp+24], x0) cannot be moved down to BB3.
hence prolog gets generated in BB2 itself due to the use of 'sp'. Thereby
shrink wrap fails.

The store insn (which basically saves x1 to stack) is generated by the
LRA pass. This insn is needed because x1 is a caller-save register and we
have a call insn that will clobber this register. However, the store insn is generated
in the entry BB (BB2) instead of in BB3 which has the call insn. If the store
is generated in BB3, then the testcase will be shrink wrapped successfully.
In fact, it is more efficient if the store occurs only in the path containing
the printf call instead of occurring in the entry bb.

The reason why LRA generates the store insn in the entry bb is as follows:
LRA emits insns to save caller-save registers in the inheritance/splitting pass.
In this pass, LRA builds EBBs (Extended Basic Block) and traverses the insns in
the EBBs in reverse order from the last insn to the first insn. When LRA sees a
write to a pseudo (that has been assigned a caller-save register), and there is a
read following the write, with an intervening call insn between the write and read,
then LRA generates a spill immediately after the write and a restore immediately
before the read. The spill is needed because the call insn will clobber the
caller-save register.

In the above testcase, LRA forms two EBBs: the first EBB contains BB2 & BB3 while
the second EBB contains BB4. 

In BB2, there is a write to x1 in the insn : 
set r92, r95 //r92 is assigned x1 and r95 is assigned x0

In BB3, there is a read of x1 after the call
insn.
set mem(r92), 0   // r92 is assigned x1

So LRA generates a spill in BB2 after the write to x1.

I have a patch (bootstrapped and regtested on powerpc) that makes changes in
LRA to save caller-save registers before a call instead of after the write to the
caller-save register. With this patch, both the above test gets successfully
shrink wrapped. After committing the patch for PR111673, I plan to get the 
LRA fix reviewed.

Please let me know if you need more information.

Regards,
Surya


On 14/12/23 9:41 pm, Richard Earnshaw (lists) wrote:
> On 14/12/2023 07:17, Surya Kumari Jangala via Gcc wrote:
>> Hi Richard,
>> Thanks a lot for your response!
>>
>> Another failure reported by the Linaro CI is as follows:
>>
>> Running gcc:gcc.dg/dg.exp ...
>> FAIL: gcc.dg/ira-shrinkwrap-prep-1.c scan-rtl-dump pro_and_epilogue "Performing shrink-wrapping"
>> FAIL: gcc.dg/pr10474.c scan-rtl-dump pro_and_epilogue "Performing shrink-wrapping"
>>
>> I analyzed the failures and the root cause is the same for both the failures.
>>
>> The test pr10474.c is as follows:
>>
>> void f(int *i)
>> {
>>         if (!i)
>>                 return;
>>         else
>>         {
>>                 __builtin_printf("Hi");
>>                 *i=0;
>>         }
>> }
>>
>>
>> With the patch (for PR111673), x1 (volatile) is being assigned to hold value of
>> x0 (first parameter). Since it is a volatile, x1 is saved to the stack as there
>> is a call later on. The save to the stack is generated in the LRA pass. The save
>> is generated in the entry basic block. Due to the usage of the stack pointer in
>> the entry bb, the testcase fails to be shrink wrapped.
> 
> I'm not entirely sure I understand what you mean from a quick glance.  Do you mean that X1 has the /v flag marked on it (ie it's printed in dumps as "reg/v")?  If so, that's not volatile, it just means that the register is associated with a user variable (as opposed to a compiler-generated temporary variable):
> 
> From the manual:
> 
> @item REG_USERVAR_P (@var{x})
> In a @code{reg}, nonzero if it corresponds to a variable present in
> the user's source code.  Zero for temporaries generated internally by
> the compiler.  Stored in the @code{volatil} field and printed as
> @samp{/v}.
> 
> There are several other cases where we re-use this bit on different RTL constructs to mean things other than 'volatile': it pretty much only has the conventional meaning on MEM objects.
> 
>>
>> The reason why LRA generates the store insn in the entry bb is as follows:
>> LRA emits insns to save volatile registers in the inheritance/splitting pass.
>> In this pass, LRA builds EBBs (Extended Basic Block) and traverses the insns in
>> the EBBs in reverse order from the last insn to the first insn. When LRA sees a
>> write to a pseudo (that has been assigned a volatile register), and there is a
>> read following the write, with an intervening call insn between the write and read,
>> then LRA generates a spill immediately after the write and a restore immediately
>> before the read. In the above test, there is an EBB containing the entry bb and
>> the bb with the printf call. In the entry bb, there is a write to x1 (basically
>> a copy from x0 to x1) and in the printf bb, there is a read of x1 after the call
>> insn. So LRA generates a spill in the entry bb.
>>
>> Without patch, x19 is chosen to hold the value of x0. Since x19 is a non-volatile,
>> the input RTL to the shrink wrap pass does not have any code to save x19 to the
>> stack. Only the insn that copies x0 to x19 is present in the entry bb. In the
>> shrink wrap pass, this insn is moved down the cfg to the bb containing the call
>> to printf, thereby allowing prolog to be allocated only where needed. Thus shrink
>> wrap succeeds.
>>
>>
>> Shrink wrap can be made to succeed if the save of x1 occurs just before the call
>> insn, instead of generating it after the write to x1. This will ensure that the
>> spill does not occur in the entry bb. In fact, it is more efficient if the save
>> occurs only in the path containing the printf call instead of occurring in the
>> entry bb.
>>
>> I have a patch (bootstrapped and regtested on powerpc) that makes changes in
>> LRA to save volatile registers before a call instead of after the write to the
>> volatile. With this patch, both the above tests pass.
>>
>> Since the patch for PR111673 has been approved by Vladimir, I plan to
>> commit the patch to trunk. And I will fix the test failures after doing the
>> commit.
>>
> 
> I think I'd probably understand this better if you could give some example RTL (before and after).  Do you have that?
> 
> R.
> 
>> Regards,
>> Surya
>>
>>
>>
>> On 28/11/23 7:18 pm, Richard Earnshaw wrote:
>>>
>>>
>>> On 28/11/2023 12:52, Surya Kumari Jangala wrote:
>>>> Hi Richard,
>>>> Thanks a lot for your response!
>>>>
>>>> Another failure reported by the Linaro CI is as follows :
>>>> (Note: I am planning to send a separate mail for each failure, as this will make
>>>> the discussion easy to track)
>>>>
>>>> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve -moverride=tune=none  check-function-bodies dup_x0_m
>>>>
>>>> Expected code:
>>>>
>>>>        ...
>>>>        add     (x[0-9]+), x0, #?1
>>>>        mov     (p[0-7])\.b, p15\.b
>>>>        mov     z0\.d, \2/m, \1
>>>>        ...
>>>>        ret
>>>>
>>>>
>>>> Code obtained w/o patch:
>>>>          addvl   sp, sp, #-1
>>>>          str     p15, [sp]
>>>>          add     x0, x0, 1
>>>>          mov     p3.b, p15.b
>>>>          mov     z0.d, p3/m, x0
>>>>          ldr     p15, [sp]
>>>>          addvl   sp, sp, #1
>>>>          ret
>>>>
>>>> Code obtained w/ patch:
>>>>     addvl   sp, sp, #-1
>>>>          str     p15, [sp]
>>>>          mov     p3.b, p15.b
>>>>          add     x0, x0, 1
>>>>          mov     z0.d, p3/m, x0
>>>>          ldr     p15, [sp]
>>>>          addvl   sp, sp, #1
>>>>          ret
>>>>
>>>> As we can see, with the patch, the following two instructions are interchanged:
>>>>          add     x0, x0, 1
>>>>          mov     p3.b, p15.b
>>>
>>> Indeed, both look acceptable results to me, especially given that we don't schedule results at -O1.
>>>
>>> There's two ways of fixing this:
>>> 1) Simply swap the order to what the compiler currently generates (which is a little fragile, since it might flip back someday).
>>> 2) Write the test as
>>>
>>>
>>> ** (
>>> **       add     (x[0-9]+), x0, #?1
>>> **       mov     (p[0-7])\.b, p15\.b
>>> **       mov     z0\.d, \2/m, \1
>>> ** |
>>> **       mov     (p[0-7])\.b, p15\.b
>>> **       add     (x[0-9]+), x0, #?1
>>> **       mov     z0\.d, \1/m, \2
>>> ** )
>>>
>>> Note, we need to swap the match names in the third insn to account for the different order of the earlier instructions.
>>>
>>> Neither is ideal, but the second is perhaps a little more bomb proof.
>>>
>>> I don't really have a strong feeling either way, but perhaps the second is slightly preferable.
>>>
>>> Richard S: thoughts?
>>>
>>> R.
>>>
>>>> I believe that this is fine and the test can be modified to allow it to pass on
>>>> aarch64. Please let me know what you think.
>>>>
>>>> Regards,
>>>> Surya
>>>>
>>>>
>>>> On 24/11/23 4:18 pm, Richard Earnshaw wrote:
>>>>>
>>>>>
>>>>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
>>>>>> Hi Richard,
>>>>>> Ping. Please let me know if the test failure that I mentioned in the mail below can be handled by changing the expected generated code. I am not conversant with arm, and hence would appreciate your help.
>>>>>>
>>>>>> Regards,
>>>>>> Surya
>>>>>>
>>>>>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>>>>>>> Hi Richard,
>>>>>>> I had submitted a patch for review (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>>>>>>> regarding scaling save/restore costs of callee save registers with block
>>>>>>> frequency in the IRA pass (PR111673).
>>>>>>>
>>>>>>> This patch has been approved by VMakarov
>>>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>>>>>>
>>>>>>> With this patch, we are seeing performance improvements with spec on x86
>>>>>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>>>>>>
>>>>>>> I received a mail from Linaro about some failures seen in the CI pipeline with
>>>>>>> this patch. I have analyzed the failures and I wish to discuss the analysis with you.
>>>>>>>
>>>>>>> One failure reported by the Linaro CI is:
>>>>>>>
>>>>>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, r[0-9]+, \\[r[0-9]+\\] 2
>>>>>>>
>>>>>>> The diff in the assembly between trunk and patch is:
>>>>>>>
>>>>>>> 93c93
>>>>>>> <       push    {r4, r5}
>>>>>>> ---
>>>>>>>>         push    {fp}
>>>>>>> 95c95
>>>>>>> <       ldrexd  r4, r5, [r0]
>>>>>>> ---
>>>>>>>>         ldrexd  fp, ip, [r0]
>>>>>>> 99c99
>>>>>>> <       pop     {r4, r5}
>>>>>>> ---
>>>>>>>>         ldr     fp, [sp], #4
>>>>>>>
>>>>>>>
>>>>>>> The test fails with patch because the ldrexd insn uses fp & ip registers instead
>>>>>>> of r[0-9]+
>>>>>>>
>>>>>>> But the code produced by patch is better because it is pushing and restoring only
>>>>>>> one register (fp) instead of two registers (r4, r5). Hence, this test can be
>>>>>>> modified to allow it to pass on arm. Please let me know what you think.
>>>>>>>
>>>>>>> If you need more information, please let me know. I will be sending separate mails
>>>>>>> for the other test failures.
>>>>>>>
>>>>>
>>>>> Thanks for looking at this.
>>>>>
>>>>>
>>>>> The key part of this test is that the compiler generates LDREXD.  The registers used for that are pretty much irrelevant as we don't match them to any other operations within the test.  So I'd recommend just testing for the mnemonic and not for any of the operands (ie just match "ldrexd\t").
>>>>>
>>>>> R.
>>>>>
>>>>>>> Regards,
>>>>>>> Surya
>>>>>>>
>>>>>>>
>>>>>>>
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673
  2023-12-15 17:04             ` Surya Kumari Jangala
@ 2024-01-29  6:14               ` Surya Kumari Jangala
  0 siblings, 0 replies; 10+ messages in thread
From: Surya Kumari Jangala @ 2024-01-29  6:14 UTC (permalink / raw)
  To: Richard Earnshaw (lists),
	Richard Earnshaw, Richard Sandiford, Peter Bergner
  Cc: GCC Development, vmakarov

Hi Richard,
Ping. 
I have provided root cause analysis of two test failures on arm with my patch for PR111673. I have also provided a solution (a fix in LRA) to fix these failures. 
Please let me know if the LRA fix is fine. If so, I can ignore these two arm test failures for now, and checkin the LRA patch after checking in the patch for PR111673.

Regards,
Surya

On 15/12/23 10:34 pm, Surya Kumari Jangala wrote:
> Hi Richard,
> Here are more details about the testcase failure and my analysis/fix:
> 
> Testcase:
> 
> void f(int *i)
> {
>         if (!i)
>                 return;
>         else
>         {
>                 __builtin_printf("Hi");
>                 *i=0;
>         }
> }
> 
> ----------
> 
> Assembly w/o patch:
>         cbz     x0, .L7
>         stp     x29, x30, [sp, -32]!
>         mov     x29, sp
>         str     x19, [sp, 16]
>         mov     x19, x0
>         adrp    x0, .LC0
>         add     x0, x0, :lo12:.LC0
>         bl      printf
>         str     wzr, [x19]
>         ldr     x19, [sp, 16]
>         ldp     x29, x30, [sp], 32
>         ret
>         .p2align 2,,3
> .L7:
>         ret
> 
> -----------
> 
> Assembly w/ patch:
>         stp     x29, x30, [sp, -32]!
>         mov     x29, sp
>         str     x0, [sp, 24]
>         cbz     x0, .L1
>         adrp    x0, .LC0
>         add     x0, x0, :lo12:.LC0
>         bl      printf
>         ldr     x1, [sp, 24]
>         str     wzr, [x1]
> .L1:
>         ldp     x29, x30, [sp], 32
>         ret
> 
> 
> As we can see above, w/o patch the test case gets shrink wrapped.
> 
> Input RTL to the LRA pass (the RTL is same both w/ and w/o patch):
> 
> BB2:
>   set r95, x0
>   set r92, r95
>   if (r92 eq 0) jump BB4
> BB3:
>   set x0, symbol-ref("Hi")
>   x0 = call printf
>   set mem(r92), 0
> BB4:
>   ret
> 
> 
> Register assignment by IRA:
> w/o patch:
>   r92-->x19
>   r95-->x0
>   r94-->x0
> 
> w/ patch:
>   r92-->x1
>   r95-->x0
>   r94-->x0
> 
> 
> RTL after LRA:
> 
> w/o patch:
> BB2:
>   set x19, x0
>   if (x19 eq 0) jump BB4
> BB3:
>   set x0, symbol-ref("Hi")
>   x0 = call printf
>   set mem(x19), 0
> BB4:
>   ret
> 
> 
> w/ patch:
> BB2:
>   set x1, x0
>   set mem(sp+24), x1
>   if (x1 eq 0) jump BB4
> BB3:
>   set x0, symbol-ref("Hi")
>   x0 = call printf
>   set x1, mem(sp+24)
>   set mem(x1), 0
> BB4:
>   ret
> 
> 
> The difference between w/o patch and w/ patch is that w/o patch, a callee-save
> register (x19) is chosen to hold the value of x0 (input parameter register). While
> w/ patch, a caller-save register (x1) is chosen.
> 
> W/o patch, during the shrink wrap pass, first copy propagation is done and
> the 'if' insn in BB2 is changed as follows:
>   set x19, x0
>   if (x19 eq 0) jump BB4
> 
> changed to:
>   set x19, x0
>   if (x0 eq 0) jump BB4   
> 
> Next, the insn "set x19, x0" is moved down the cfg to BB3. Since x19 is a
> callee-save register, prolog gets generated in BB3 thereby resulting in
> successful shrink wrapping.
> 
> W/ patch, during the shrink wrap pass, copy propagation changes BB2 as follows:
>   set x1, x0
>   set mem(sp+24), x1
>   if (x1 eq 0) jump BB4
> 
> changed to:
>   set x1, x0
>   set mem(sp+24), x0
>   if (x0 eq 0) jump BB4
> 
> However the store insn (set mem[sp+24], x0) cannot be moved down to BB3.
> hence prolog gets generated in BB2 itself due to the use of 'sp'. Thereby
> shrink wrap fails.
> 
> The store insn (which basically saves x1 to stack) is generated by the
> LRA pass. This insn is needed because x1 is a caller-save register and we
> have a call insn that will clobber this register. However, the store insn is generated
> in the entry BB (BB2) instead of in BB3 which has the call insn. If the store
> is generated in BB3, then the testcase will be shrink wrapped successfully.
> In fact, it is more efficient if the store occurs only in the path containing
> the printf call instead of occurring in the entry bb.
> 
> The reason why LRA generates the store insn in the entry bb is as follows:
> LRA emits insns to save caller-save registers in the inheritance/splitting pass.
> In this pass, LRA builds EBBs (Extended Basic Block) and traverses the insns in
> the EBBs in reverse order from the last insn to the first insn. When LRA sees a
> write to a pseudo (that has been assigned a caller-save register), and there is a
> read following the write, with an intervening call insn between the write and read,
> then LRA generates a spill immediately after the write and a restore immediately
> before the read. The spill is needed because the call insn will clobber the
> caller-save register.
> 
> In the above testcase, LRA forms two EBBs: the first EBB contains BB2 & BB3 while
> the second EBB contains BB4. 
> 
> In BB2, there is a write to x1 in the insn : 
> set r92, r95 //r92 is assigned x1 and r95 is assigned x0
> 
> In BB3, there is a read of x1 after the call
> insn.
> set mem(r92), 0   // r92 is assigned x1
> 
> So LRA generates a spill in BB2 after the write to x1.
> 
> I have a patch (bootstrapped and regtested on powerpc) that makes changes in
> LRA to save caller-save registers before a call instead of after the write to the
> caller-save register. With this patch, both the above test gets successfully
> shrink wrapped. After committing the patch for PR111673, I plan to get the 
> LRA fix reviewed.
> 
> Please let me know if you need more information.
> 
> Regards,
> Surya
> 
> 
> On 14/12/23 9:41 pm, Richard Earnshaw (lists) wrote:
>> On 14/12/2023 07:17, Surya Kumari Jangala via Gcc wrote:
>>> Hi Richard,
>>> Thanks a lot for your response!
>>>
>>> Another failure reported by the Linaro CI is as follows:
>>>
>>> Running gcc:gcc.dg/dg.exp ...
>>> FAIL: gcc.dg/ira-shrinkwrap-prep-1.c scan-rtl-dump pro_and_epilogue "Performing shrink-wrapping"
>>> FAIL: gcc.dg/pr10474.c scan-rtl-dump pro_and_epilogue "Performing shrink-wrapping"
>>>
>>> I analyzed the failures and the root cause is the same for both the failures.
>>>
>>> The test pr10474.c is as follows:
>>>
>>> void f(int *i)
>>> {
>>>         if (!i)
>>>                 return;
>>>         else
>>>         {
>>>                 __builtin_printf("Hi");
>>>                 *i=0;
>>>         }
>>> }
>>>
>>>
>>> With the patch (for PR111673), x1 (volatile) is being assigned to hold value of
>>> x0 (first parameter). Since it is a volatile, x1 is saved to the stack as there
>>> is a call later on. The save to the stack is generated in the LRA pass. The save
>>> is generated in the entry basic block. Due to the usage of the stack pointer in
>>> the entry bb, the testcase fails to be shrink wrapped.
>>
>> I'm not entirely sure I understand what you mean from a quick glance.  Do you mean that X1 has the /v flag marked on it (ie it's printed in dumps as "reg/v")?  If so, that's not volatile, it just means that the register is associated with a user variable (as opposed to a compiler-generated temporary variable):
>>
>> From the manual:
>>
>> @item REG_USERVAR_P (@var{x})
>> In a @code{reg}, nonzero if it corresponds to a variable present in
>> the user's source code.  Zero for temporaries generated internally by
>> the compiler.  Stored in the @code{volatil} field and printed as
>> @samp{/v}.
>>
>> There are several other cases where we re-use this bit on different RTL constructs to mean things other than 'volatile': it pretty much only has the conventional meaning on MEM objects.
>>
>>>
>>> The reason why LRA generates the store insn in the entry bb is as follows:
>>> LRA emits insns to save volatile registers in the inheritance/splitting pass.
>>> In this pass, LRA builds EBBs (Extended Basic Block) and traverses the insns in
>>> the EBBs in reverse order from the last insn to the first insn. When LRA sees a
>>> write to a pseudo (that has been assigned a volatile register), and there is a
>>> read following the write, with an intervening call insn between the write and read,
>>> then LRA generates a spill immediately after the write and a restore immediately
>>> before the read. In the above test, there is an EBB containing the entry bb and
>>> the bb with the printf call. In the entry bb, there is a write to x1 (basically
>>> a copy from x0 to x1) and in the printf bb, there is a read of x1 after the call
>>> insn. So LRA generates a spill in the entry bb.
>>>
>>> Without patch, x19 is chosen to hold the value of x0. Since x19 is a non-volatile,
>>> the input RTL to the shrink wrap pass does not have any code to save x19 to the
>>> stack. Only the insn that copies x0 to x19 is present in the entry bb. In the
>>> shrink wrap pass, this insn is moved down the cfg to the bb containing the call
>>> to printf, thereby allowing prolog to be allocated only where needed. Thus shrink
>>> wrap succeeds.
>>>
>>>
>>> Shrink wrap can be made to succeed if the save of x1 occurs just before the call
>>> insn, instead of generating it after the write to x1. This will ensure that the
>>> spill does not occur in the entry bb. In fact, it is more efficient if the save
>>> occurs only in the path containing the printf call instead of occurring in the
>>> entry bb.
>>>
>>> I have a patch (bootstrapped and regtested on powerpc) that makes changes in
>>> LRA to save volatile registers before a call instead of after the write to the
>>> volatile. With this patch, both the above tests pass.
>>>
>>> Since the patch for PR111673 has been approved by Vladimir, I plan to
>>> commit the patch to trunk. And I will fix the test failures after doing the
>>> commit.
>>>
>>
>> I think I'd probably understand this better if you could give some example RTL (before and after).  Do you have that?
>>
>> R.
>>
>>> Regards,
>>> Surya
>>>
>>>
>>>
>>> On 28/11/23 7:18 pm, Richard Earnshaw wrote:
>>>>
>>>>
>>>> On 28/11/2023 12:52, Surya Kumari Jangala wrote:
>>>>> Hi Richard,
>>>>> Thanks a lot for your response!
>>>>>
>>>>> Another failure reported by the Linaro CI is as follows :
>>>>> (Note: I am planning to send a separate mail for each failure, as this will make
>>>>> the discussion easy to track)
>>>>>
>>>>> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve -moverride=tune=none  check-function-bodies dup_x0_m
>>>>>
>>>>> Expected code:
>>>>>
>>>>>        ...
>>>>>        add     (x[0-9]+), x0, #?1
>>>>>        mov     (p[0-7])\.b, p15\.b
>>>>>        mov     z0\.d, \2/m, \1
>>>>>        ...
>>>>>        ret
>>>>>
>>>>>
>>>>> Code obtained w/o patch:
>>>>>          addvl   sp, sp, #-1
>>>>>          str     p15, [sp]
>>>>>          add     x0, x0, 1
>>>>>          mov     p3.b, p15.b
>>>>>          mov     z0.d, p3/m, x0
>>>>>          ldr     p15, [sp]
>>>>>          addvl   sp, sp, #1
>>>>>          ret
>>>>>
>>>>> Code obtained w/ patch:
>>>>>     addvl   sp, sp, #-1
>>>>>          str     p15, [sp]
>>>>>          mov     p3.b, p15.b
>>>>>          add     x0, x0, 1
>>>>>          mov     z0.d, p3/m, x0
>>>>>          ldr     p15, [sp]
>>>>>          addvl   sp, sp, #1
>>>>>          ret
>>>>>
>>>>> As we can see, with the patch, the following two instructions are interchanged:
>>>>>          add     x0, x0, 1
>>>>>          mov     p3.b, p15.b
>>>>
>>>> Indeed, both look acceptable results to me, especially given that we don't schedule results at -O1.
>>>>
>>>> There's two ways of fixing this:
>>>> 1) Simply swap the order to what the compiler currently generates (which is a little fragile, since it might flip back someday).
>>>> 2) Write the test as
>>>>
>>>>
>>>> ** (
>>>> **       add     (x[0-9]+), x0, #?1
>>>> **       mov     (p[0-7])\.b, p15\.b
>>>> **       mov     z0\.d, \2/m, \1
>>>> ** |
>>>> **       mov     (p[0-7])\.b, p15\.b
>>>> **       add     (x[0-9]+), x0, #?1
>>>> **       mov     z0\.d, \1/m, \2
>>>> ** )
>>>>
>>>> Note, we need to swap the match names in the third insn to account for the different order of the earlier instructions.
>>>>
>>>> Neither is ideal, but the second is perhaps a little more bomb proof.
>>>>
>>>> I don't really have a strong feeling either way, but perhaps the second is slightly preferable.
>>>>
>>>> Richard S: thoughts?
>>>>
>>>> R.
>>>>
>>>>> I believe that this is fine and the test can be modified to allow it to pass on
>>>>> aarch64. Please let me know what you think.
>>>>>
>>>>> Regards,
>>>>> Surya
>>>>>
>>>>>
>>>>> On 24/11/23 4:18 pm, Richard Earnshaw wrote:
>>>>>>
>>>>>>
>>>>>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
>>>>>>> Hi Richard,
>>>>>>> Ping. Please let me know if the test failure that I mentioned in the mail below can be handled by changing the expected generated code. I am not conversant with arm, and hence would appreciate your help.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Surya
>>>>>>>
>>>>>>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>>>>>>>> Hi Richard,
>>>>>>>> I had submitted a patch for review (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>>>>>>>> regarding scaling save/restore costs of callee save registers with block
>>>>>>>> frequency in the IRA pass (PR111673).
>>>>>>>>
>>>>>>>> This patch has been approved by VMakarov
>>>>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>>>>>>>
>>>>>>>> With this patch, we are seeing performance improvements with spec on x86
>>>>>>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>>>>>>>
>>>>>>>> I received a mail from Linaro about some failures seen in the CI pipeline with
>>>>>>>> this patch. I have analyzed the failures and I wish to discuss the analysis with you.
>>>>>>>>
>>>>>>>> One failure reported by the Linaro CI is:
>>>>>>>>
>>>>>>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, r[0-9]+, \\[r[0-9]+\\] 2
>>>>>>>>
>>>>>>>> The diff in the assembly between trunk and patch is:
>>>>>>>>
>>>>>>>> 93c93
>>>>>>>> <       push    {r4, r5}
>>>>>>>> ---
>>>>>>>>>         push    {fp}
>>>>>>>> 95c95
>>>>>>>> <       ldrexd  r4, r5, [r0]
>>>>>>>> ---
>>>>>>>>>         ldrexd  fp, ip, [r0]
>>>>>>>> 99c99
>>>>>>>> <       pop     {r4, r5}
>>>>>>>> ---
>>>>>>>>>         ldr     fp, [sp], #4
>>>>>>>>
>>>>>>>>
>>>>>>>> The test fails with patch because the ldrexd insn uses fp & ip registers instead
>>>>>>>> of r[0-9]+
>>>>>>>>
>>>>>>>> But the code produced by patch is better because it is pushing and restoring only
>>>>>>>> one register (fp) instead of two registers (r4, r5). Hence, this test can be
>>>>>>>> modified to allow it to pass on arm. Please let me know what you think.
>>>>>>>>
>>>>>>>> If you need more information, please let me know. I will be sending separate mails
>>>>>>>> for the other test failures.
>>>>>>>>
>>>>>>
>>>>>> Thanks for looking at this.
>>>>>>
>>>>>>
>>>>>> The key part of this test is that the compiler generates LDREXD.  The registers used for that are pretty much irrelevant as we don't match them to any other operations within the test.  So I'd recommend just testing for the mnemonic and not for any of the operands (ie just match "ldrexd\t").
>>>>>>
>>>>>> R.
>>>>>>
>>>>>>>> Regards,
>>>>>>>> Surya
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-01-29  6:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-03 11:28 Discussion about arm testcase failures seen with patch for PR111673 Surya Kumari Jangala
2023-11-24  8:09 ` Surya Kumari Jangala
2023-11-24 10:48   ` Richard Earnshaw
2023-11-28 12:52     ` Discussion about arm/aarch64 " Surya Kumari Jangala
2023-11-28 13:48       ` Richard Earnshaw
2023-11-28 15:41         ` Richard Sandiford
2023-12-14  7:17         ` Surya Kumari Jangala
2023-12-14 16:11           ` Richard Earnshaw (lists)
2023-12-15 17:04             ` Surya Kumari Jangala
2024-01-29  6:14               ` Surya Kumari Jangala

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).