* [PATCH] LoongArch: Increase division costs
@ 2024-03-26 9:48 Xi Ruoyao
2024-03-27 2:38 ` chenglulu
2024-03-27 7:54 ` Richard Biener
0 siblings, 2 replies; 12+ messages in thread
From: Xi Ruoyao @ 2024-03-26 9:48 UTC (permalink / raw)
To: gcc-patches; +Cc: chenglulu, i, xuchenghua, Xi Ruoyao
The latency of LA464 and LA664 division instructions depends on the
input. When I updated the costs in r14-6642, I unintentionally set the
division costs to the best-case latency (when the first operand is 0).
Per a recent discussion [1] we should use "something sensible" instead
of it.
Use the average of the minimum and maximum latency observed instead.
This enables multiplication to reciprocal sequence reduction and speeds
up the following test case for about 30%:
int
main (void)
{
unsigned long stat = 0xdeadbeef;
for (int i = 0; i < 100000000; i++)
stat = (stat * stat + stat * 114514 + 1919810) % 1000000007;
asm(""::"r"(stat));
}
[1]: https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648348.html
gcc/ChangeLog:
* config/loongarch/loongarch-def.cc
(loongarch_rtx_cost_data::loongarch_rtx_cost_data): Increase
default division cost to the average of the best case and worst
case senarios observed.
gcc/testsuite/ChangeLog:
* gcc.target/loongarch/div-const-reduction.c: New test.
---
Bootstrapped and regtested on loongarch64-linux-gnu. Ok for trunk?
gcc/config/loongarch/loongarch-def.cc | 8 ++++----
gcc/testsuite/gcc.target/loongarch/div-const-reduction.c | 9 +++++++++
2 files changed, 13 insertions(+), 4 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/loongarch/div-const-reduction.c
diff --git a/gcc/config/loongarch/loongarch-def.cc b/gcc/config/loongarch/loongarch-def.cc
index e8c129ce643..93e72a520d5 100644
--- a/gcc/config/loongarch/loongarch-def.cc
+++ b/gcc/config/loongarch/loongarch-def.cc
@@ -95,12 +95,12 @@ loongarch_rtx_cost_data::loongarch_rtx_cost_data ()
: fp_add (COSTS_N_INSNS (5)),
fp_mult_sf (COSTS_N_INSNS (5)),
fp_mult_df (COSTS_N_INSNS (5)),
- fp_div_sf (COSTS_N_INSNS (8)),
- fp_div_df (COSTS_N_INSNS (8)),
+ fp_div_sf (COSTS_N_INSNS (12)),
+ fp_div_df (COSTS_N_INSNS (15)),
int_mult_si (COSTS_N_INSNS (4)),
int_mult_di (COSTS_N_INSNS (4)),
- int_div_si (COSTS_N_INSNS (5)),
- int_div_di (COSTS_N_INSNS (5)),
+ int_div_si (COSTS_N_INSNS (14)),
+ int_div_di (COSTS_N_INSNS (22)),
movcf2gr (COSTS_N_INSNS (7)),
movgr2cf (COSTS_N_INSNS (15)),
branch_cost (6),
diff --git a/gcc/testsuite/gcc.target/loongarch/div-const-reduction.c b/gcc/testsuite/gcc.target/loongarch/div-const-reduction.c
new file mode 100644
index 00000000000..0ee86410dd7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/loongarch/div-const-reduction.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=la464" } */
+/* { dg-final { scan-assembler-not "div\.\[dw\]" } } */
+
+int
+test (int a)
+{
+ return a % 1000000007;
+}
--
2.44.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] LoongArch: Increase division costs
2024-03-26 9:48 [PATCH] LoongArch: Increase division costs Xi Ruoyao
@ 2024-03-27 2:38 ` chenglulu
2024-03-27 10:39 ` Xi Ruoyao
2024-03-27 7:54 ` Richard Biener
1 sibling, 1 reply; 12+ messages in thread
From: chenglulu @ 2024-03-27 2:38 UTC (permalink / raw)
To: Xi Ruoyao, gcc-patches; +Cc: i, xuchenghua
在 2024/3/26 下午5:48, Xi Ruoyao 写道:
> The latency of LA464 and LA664 division instructions depends on the
> input. When I updated the costs in r14-6642, I unintentionally set the
> division costs to the best-case latency (when the first operand is 0).
> Per a recent discussion [1] we should use "something sensible" instead
> of it.
>
> Use the average of the minimum and maximum latency observed instead.
> This enables multiplication to reciprocal sequence reduction and speeds
> up the following test case for about 30%:
>
> int
> main (void)
> {
> unsigned long stat = 0xdeadbeef;
> for (int i = 0; i < 100000000; i++)
> stat = (stat * stat + stat * 114514 + 1919810) % 1000000007;
> asm(""::"r"(stat));
> }
>
> [1]: https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648348.html
The test case div-const-reduction.c is modified to assemble the instruction
sequence as follows:
lu12i.w $r12,999997440>>12 # 0x3b9ac000
ori $r12,$r12,2567
mod.w $r13,$r13,$r12
This sequence of instructions takes 5 clock cycles.
The sequence of instructions after adding the patch is:
lu12i.w $r15,1152917504>>12 # 0x44b82000
ori $r15,$r15,3993
mulh.w $r12,$r16,$r15
srai.w $r14,$r16,31
lu12i.w $r13,999997440>>12 # 0x3b9ac000
ori $r13,$r13,2567
srai.w $r12,$r12,28
sub.w $r12,$r12,$r14
mul.w $r12,$r12,$r13
sub.w $r16,$r16,$r12
This sequence of instructions takes 11 clock cycles.
This test case is optimized and takes 6 more clock cycles than before optimization,
so I need to run the spec.
Thanks!
> gcc/ChangeLog:
>
> * config/loongarch/loongarch-def.cc
> (loongarch_rtx_cost_data::loongarch_rtx_cost_data): Increase
> default division cost to the average of the best case and worst
> case senarios observed.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/loongarch/div-const-reduction.c: New test.
> ---
>
> Bootstrapped and regtested on loongarch64-linux-gnu. Ok for trunk?
>
> gcc/config/loongarch/loongarch-def.cc | 8 ++++----
> gcc/testsuite/gcc.target/loongarch/div-const-reduction.c | 9 +++++++++
> 2 files changed, 13 insertions(+), 4 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/loongarch/div-const-reduction.c
>
> diff --git a/gcc/config/loongarch/loongarch-def.cc b/gcc/config/loongarch/loongarch-def.cc
> index e8c129ce643..93e72a520d5 100644
> --- a/gcc/config/loongarch/loongarch-def.cc
> +++ b/gcc/config/loongarch/loongarch-def.cc
> @@ -95,12 +95,12 @@ loongarch_rtx_cost_data::loongarch_rtx_cost_data ()
> : fp_add (COSTS_N_INSNS (5)),
> fp_mult_sf (COSTS_N_INSNS (5)),
> fp_mult_df (COSTS_N_INSNS (5)),
> - fp_div_sf (COSTS_N_INSNS (8)),
> - fp_div_df (COSTS_N_INSNS (8)),
> + fp_div_sf (COSTS_N_INSNS (12)),
> + fp_div_df (COSTS_N_INSNS (15)),
> int_mult_si (COSTS_N_INSNS (4)),
> int_mult_di (COSTS_N_INSNS (4)),
> - int_div_si (COSTS_N_INSNS (5)),
> - int_div_di (COSTS_N_INSNS (5)),
> + int_div_si (COSTS_N_INSNS (14)),
> + int_div_di (COSTS_N_INSNS (22)),
> movcf2gr (COSTS_N_INSNS (7)),
> movgr2cf (COSTS_N_INSNS (15)),
> branch_cost (6),
> diff --git a/gcc/testsuite/gcc.target/loongarch/div-const-reduction.c b/gcc/testsuite/gcc.target/loongarch/div-const-reduction.c
> new file mode 100644
> index 00000000000..0ee86410dd7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/loongarch/div-const-reduction.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=la464" } */
> +/* { dg-final { scan-assembler-not "div\.\[dw\]" } } */
> +
> +int
> +test (int a)
> +{
> + return a % 1000000007;
> +}
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] LoongArch: Increase division costs
2024-03-26 9:48 [PATCH] LoongArch: Increase division costs Xi Ruoyao
2024-03-27 2:38 ` chenglulu
@ 2024-03-27 7:54 ` Richard Biener
2024-03-27 12:20 ` Xi Ruoyao
1 sibling, 1 reply; 12+ messages in thread
From: Richard Biener @ 2024-03-27 7:54 UTC (permalink / raw)
To: Xi Ruoyao; +Cc: gcc-patches, chenglulu, i, xuchenghua
On Tue, Mar 26, 2024 at 10:52 AM Xi Ruoyao <xry111@xry111.site> wrote:
>
> The latency of LA464 and LA664 division instructions depends on the
> input. When I updated the costs in r14-6642, I unintentionally set the
> division costs to the best-case latency (when the first operand is 0).
> Per a recent discussion [1] we should use "something sensible" instead
> of it.
>
> Use the average of the minimum and maximum latency observed instead.
> This enables multiplication to reciprocal sequence reduction and speeds
> up the following test case for about 30%:
>
> int
> main (void)
> {
> unsigned long stat = 0xdeadbeef;
> for (int i = 0; i < 100000000; i++)
> stat = (stat * stat + stat * 114514 + 1919810) % 1000000007;
> asm(""::"r"(stat));
> }
I think you should be able to see a constant divisor and thus could do
better than return the same latency for everything. For non-constant
divisors using the best-case latency shouldn't be a problem.
> [1]: https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648348.html
>
> gcc/ChangeLog:
>
> * config/loongarch/loongarch-def.cc
> (loongarch_rtx_cost_data::loongarch_rtx_cost_data): Increase
> default division cost to the average of the best case and worst
> case senarios observed.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/loongarch/div-const-reduction.c: New test.
> ---
>
> Bootstrapped and regtested on loongarch64-linux-gnu. Ok for trunk?
>
> gcc/config/loongarch/loongarch-def.cc | 8 ++++----
> gcc/testsuite/gcc.target/loongarch/div-const-reduction.c | 9 +++++++++
> 2 files changed, 13 insertions(+), 4 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/loongarch/div-const-reduction.c
>
> diff --git a/gcc/config/loongarch/loongarch-def.cc b/gcc/config/loongarch/loongarch-def.cc
> index e8c129ce643..93e72a520d5 100644
> --- a/gcc/config/loongarch/loongarch-def.cc
> +++ b/gcc/config/loongarch/loongarch-def.cc
> @@ -95,12 +95,12 @@ loongarch_rtx_cost_data::loongarch_rtx_cost_data ()
> : fp_add (COSTS_N_INSNS (5)),
> fp_mult_sf (COSTS_N_INSNS (5)),
> fp_mult_df (COSTS_N_INSNS (5)),
> - fp_div_sf (COSTS_N_INSNS (8)),
> - fp_div_df (COSTS_N_INSNS (8)),
> + fp_div_sf (COSTS_N_INSNS (12)),
> + fp_div_df (COSTS_N_INSNS (15)),
> int_mult_si (COSTS_N_INSNS (4)),
> int_mult_di (COSTS_N_INSNS (4)),
> - int_div_si (COSTS_N_INSNS (5)),
> - int_div_di (COSTS_N_INSNS (5)),
> + int_div_si (COSTS_N_INSNS (14)),
> + int_div_di (COSTS_N_INSNS (22)),
> movcf2gr (COSTS_N_INSNS (7)),
> movgr2cf (COSTS_N_INSNS (15)),
> branch_cost (6),
> diff --git a/gcc/testsuite/gcc.target/loongarch/div-const-reduction.c b/gcc/testsuite/gcc.target/loongarch/div-const-reduction.c
> new file mode 100644
> index 00000000000..0ee86410dd7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/loongarch/div-const-reduction.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=la464" } */
> +/* { dg-final { scan-assembler-not "div\.\[dw\]" } } */
> +
> +int
> +test (int a)
> +{
> + return a % 1000000007;
> +}
> --
> 2.44.0
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] LoongArch: Increase division costs
2024-03-27 2:38 ` chenglulu
@ 2024-03-27 10:39 ` Xi Ruoyao
2024-03-27 12:42 ` Xi Ruoyao
0 siblings, 1 reply; 12+ messages in thread
From: Xi Ruoyao @ 2024-03-27 10:39 UTC (permalink / raw)
To: chenglulu, gcc-patches; +Cc: i, xuchenghua
On Wed, 2024-03-27 at 10:38 +0800, chenglulu wrote:
>
> 在 2024/3/26 下午5:48, Xi Ruoyao 写道:
> > The latency of LA464 and LA664 division instructions depends on the
> > input. When I updated the costs in r14-6642, I unintentionally set the
> > division costs to the best-case latency (when the first operand is 0).
> > Per a recent discussion [1] we should use "something sensible" instead
> > of it.
> >
> > Use the average of the minimum and maximum latency observed instead.
> > This enables multiplication to reciprocal sequence reduction and speeds
> > up the following test case for about 30%:
> >
> > int
> > main (void)
> > {
> > unsigned long stat = 0xdeadbeef;
> > for (int i = 0; i < 100000000; i++)
> > stat = (stat * stat + stat * 114514 + 1919810) % 1000000007;
> > asm(""::"r"(stat));
> > }
> >
> > [1]: https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648348.html
>
> The test case div-const-reduction.c is modified to assemble the instruction
> sequence as follows:
> lu12i.w $r12,999997440>>12 # 0x3b9ac000
> ori $r12,$r12,2567
> mod.w $r13,$r13,$r12
>
> This sequence of instructions takes 5 clock cycles.
Hmm indeed, it seems a waste to do this reduction for int / 1000000007.
I'll try to make a better heuristic as Richard suggests...
--
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] LoongArch: Increase division costs
2024-03-27 7:54 ` Richard Biener
@ 2024-03-27 12:20 ` Xi Ruoyao
2024-03-27 12:28 ` Richard Biener
0 siblings, 1 reply; 12+ messages in thread
From: Xi Ruoyao @ 2024-03-27 12:20 UTC (permalink / raw)
To: Richard Biener; +Cc: gcc-patches, chenglulu, i, xuchenghua
On Wed, 2024-03-27 at 08:54 +0100, Richard Biener wrote:
> On Tue, Mar 26, 2024 at 10:52 AM Xi Ruoyao <xry111@xry111.site> wrote:
> >
> > The latency of LA464 and LA664 division instructions depends on the
> > input. When I updated the costs in r14-6642, I unintentionally set the
> > division costs to the best-case latency (when the first operand is 0).
> > Per a recent discussion [1] we should use "something sensible" instead
> > of it.
> >
> > Use the average of the minimum and maximum latency observed instead.
> > This enables multiplication to reciprocal sequence reduction and speeds
> > up the following test case for about 30%:
> >
> > int
> > main (void)
> > {
> > unsigned long stat = 0xdeadbeef;
> > for (int i = 0; i < 100000000; i++)
> > stat = (stat * stat + stat * 114514 + 1919810) % 1000000007;
> > asm(""::"r"(stat));
> > }
>
> I think you should be able to see a constant divisor and thus could do
> better than return the same latency for everything. For non-constant
> divisors using the best-case latency shouldn't be a problem.
Hmm, it seems not really possible as at now. expand_divmod does
something like:
max_cost = (unsignedp
? udiv_cost (speed, compute_mode)
: sdiv_cost (speed, compute_mode));
which is reading the pre-calculated costs from a table. Thus we don't
really know the denominator and cannot estimate the cost based on it :(.
CSE really invokes the cost hook with the actual (mod (a, (const_int
1000000007)) RTX but it's less important.
--
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] LoongArch: Increase division costs
2024-03-27 12:20 ` Xi Ruoyao
@ 2024-03-27 12:28 ` Richard Biener
0 siblings, 0 replies; 12+ messages in thread
From: Richard Biener @ 2024-03-27 12:28 UTC (permalink / raw)
To: Xi Ruoyao; +Cc: gcc-patches, chenglulu, i, xuchenghua
On Wed, Mar 27, 2024 at 1:20 PM Xi Ruoyao <xry111@xry111.site> wrote:
>
> On Wed, 2024-03-27 at 08:54 +0100, Richard Biener wrote:
> > On Tue, Mar 26, 2024 at 10:52 AM Xi Ruoyao <xry111@xry111.site> wrote:
> > >
> > > The latency of LA464 and LA664 division instructions depends on the
> > > input. When I updated the costs in r14-6642, I unintentionally set the
> > > division costs to the best-case latency (when the first operand is 0).
> > > Per a recent discussion [1] we should use "something sensible" instead
> > > of it.
> > >
> > > Use the average of the minimum and maximum latency observed instead.
> > > This enables multiplication to reciprocal sequence reduction and speeds
> > > up the following test case for about 30%:
> > >
> > > int
> > > main (void)
> > > {
> > > unsigned long stat = 0xdeadbeef;
> > > for (int i = 0; i < 100000000; i++)
> > > stat = (stat * stat + stat * 114514 + 1919810) % 1000000007;
> > > asm(""::"r"(stat));
> > > }
> >
> > I think you should be able to see a constant divisor and thus could do
> > better than return the same latency for everything. For non-constant
> > divisors using the best-case latency shouldn't be a problem.
>
> Hmm, it seems not really possible as at now. expand_divmod does
> something like:
>
> max_cost = (unsignedp
> ? udiv_cost (speed, compute_mode)
> : sdiv_cost (speed, compute_mode));
>
> which is reading the pre-calculated costs from a table. Thus we don't
> really know the denominator and cannot estimate the cost based on it :(.
Ah, too bad. OTOH for the actual case it decomposes it could compute
the real cost, avoiding the table which is filled with reg-reg operations only.
> CSE really invokes the cost hook with the actual (mod (a, (const_int
> 1000000007)) RTX but it's less important.
>
> --
> Xi Ruoyao <xry111@xry111.site>
> School of Aerospace Science and Technology, Xidian University
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] LoongArch: Increase division costs
2024-03-27 10:39 ` Xi Ruoyao
@ 2024-03-27 12:42 ` Xi Ruoyao
2024-03-28 3:18 ` chenglulu
2024-03-29 1:23 ` chenglulu
0 siblings, 2 replies; 12+ messages in thread
From: Xi Ruoyao @ 2024-03-27 12:42 UTC (permalink / raw)
To: chenglulu, gcc-patches; +Cc: i, xuchenghua
On Wed, 2024-03-27 at 18:39 +0800, Xi Ruoyao wrote:
> On Wed, 2024-03-27 at 10:38 +0800, chenglulu wrote:
> >
> > 在 2024/3/26 下午5:48, Xi Ruoyao 写道:
> > > The latency of LA464 and LA664 division instructions depends on the
> > > input. When I updated the costs in r14-6642, I unintentionally set the
> > > division costs to the best-case latency (when the first operand is 0).
> > > Per a recent discussion [1] we should use "something sensible" instead
> > > of it.
> > >
> > > Use the average of the minimum and maximum latency observed instead.
> > > This enables multiplication to reciprocal sequence reduction and speeds
> > > up the following test case for about 30%:
> > >
> > > int
> > > main (void)
> > > {
> > > unsigned long stat = 0xdeadbeef;
> > > for (int i = 0; i < 100000000; i++)
> > > stat = (stat * stat + stat * 114514 + 1919810) % 1000000007;
> > > asm(""::"r"(stat));
> > > }
> > >
> > > [1]: https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648348.html
> >
> > The test case div-const-reduction.c is modified to assemble the instruction
> > sequence as follows:
> > lu12i.w $r12,999997440>>12 # 0x3b9ac000
> > ori $r12,$r12,2567
> > mod.w $r13,$r13,$r12
> >
> > This sequence of instructions takes 5 clock cycles.
It actually may take 5 to 8 cycles depending on the input. And
multiplication is fully pipelined while division is not, so the
reciprocal sequence should still produce a better throughput.
> Hmm indeed, it seems a waste to do this reduction for int / 1000000007.
> I'll try to make a better heuristic as Richard suggests...
Oops, it seems impossible (w/o refactoring the generic code). See my
reply to Richi :(.
Can you also try benchmarking with the costs of SI and DI division
increased to (10, 10) instead of (14, 22) - allowing more CSE but not
reciprocal sequence reduction, and (10, 22) - only allowing reduction
for DI but not SI?
--
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] LoongArch: Increase division costs
2024-03-27 12:42 ` Xi Ruoyao
@ 2024-03-28 3:18 ` chenglulu
2024-03-29 1:23 ` chenglulu
1 sibling, 0 replies; 12+ messages in thread
From: chenglulu @ 2024-03-28 3:18 UTC (permalink / raw)
To: Xi Ruoyao, gcc-patches; +Cc: i, xuchenghua
在 2024/3/27 下午8:42, Xi Ruoyao 写道:
> On Wed, 2024-03-27 at 18:39 +0800, Xi Ruoyao wrote:
>> On Wed, 2024-03-27 at 10:38 +0800, chenglulu wrote:
>>> 在 2024/3/26 下午5:48, Xi Ruoyao 写道:
>>>> The latency of LA464 and LA664 division instructions depends on the
>>>> input. When I updated the costs in r14-6642, I unintentionally set the
>>>> division costs to the best-case latency (when the first operand is 0).
>>>> Per a recent discussion [1] we should use "something sensible" instead
>>>> of it.
>>>>
>>>> Use the average of the minimum and maximum latency observed instead.
>>>> This enables multiplication to reciprocal sequence reduction and speeds
>>>> up the following test case for about 30%:
>>>>
>>>> int
>>>> main (void)
>>>> {
>>>> unsigned long stat = 0xdeadbeef;
>>>> for (int i = 0; i < 100000000; i++)
>>>> stat = (stat * stat + stat * 114514 + 1919810) % 1000000007;
>>>> asm(""::"r"(stat));
>>>> }
>>>>
>>>> [1]: https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648348.html
>>> The test case div-const-reduction.c is modified to assemble the instruction
>>> sequence as follows:
>>> lu12i.w $r12,999997440>>12 # 0x3b9ac000
>>> ori $r12,$r12,2567
>>> mod.w $r13,$r13,$r12
>>>
>>> This sequence of instructions takes 5 clock cycles.
> It actually may take 5 to 8 cycles depending on the input. And
> multiplication is fully pipelined while division is not, so the
> reciprocal sequence should still produce a better throughput.
>
>> Hmm indeed, it seems a waste to do this reduction for int / 1000000007.
>> I'll try to make a better heuristic as Richard suggests...
> Oops, it seems impossible (w/o refactoring the generic code). See my
> reply to Richi :(.
>
> Can you also try benchmarking with the costs of SI and DI division
> increased to (10, 10) instead of (14, 22) - allowing more CSE but not
> reciprocal sequence reduction, and (10, 22) - only allowing reduction
> for DI but not SI?
No problem, I'll test both cases ((10,10) and (10,22)).
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] LoongArch: Increase division costs
2024-03-27 12:42 ` Xi Ruoyao
2024-03-28 3:18 ` chenglulu
@ 2024-03-29 1:23 ` chenglulu
2024-04-01 1:29 ` Xi Ruoyao
1 sibling, 1 reply; 12+ messages in thread
From: chenglulu @ 2024-03-29 1:23 UTC (permalink / raw)
To: Xi Ruoyao, gcc-patches; +Cc: i, xuchenghua
在 2024/3/27 下午8:42, Xi Ruoyao 写道:
> On Wed, 2024-03-27 at 18:39 +0800, Xi Ruoyao wrote:
>> On Wed, 2024-03-27 at 10:38 +0800, chenglulu wrote:
>>> 在 2024/3/26 下午5:48, Xi Ruoyao 写道:
>>>> The latency of LA464 and LA664 division instructions depends on the
>>>> input. When I updated the costs in r14-6642, I unintentionally set the
>>>> division costs to the best-case latency (when the first operand is 0).
>>>> Per a recent discussion [1] we should use "something sensible" instead
>>>> of it.
>>>>
>>>> Use the average of the minimum and maximum latency observed instead.
>>>> This enables multiplication to reciprocal sequence reduction and speeds
>>>> up the following test case for about 30%:
>>>>
>>>> int
>>>> main (void)
>>>> {
>>>> unsigned long stat = 0xdeadbeef;
>>>> for (int i = 0; i < 100000000; i++)
>>>> stat = (stat * stat + stat * 114514 + 1919810) % 1000000007;
>>>> asm(""::"r"(stat));
>>>> }
>>>>
>>>> [1]: https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648348.html
>>> The test case div-const-reduction.c is modified to assemble the instruction
>>> sequence as follows:
>>> lu12i.w $r12,999997440>>12 # 0x3b9ac000
>>> ori $r12,$r12,2567
>>> mod.w $r13,$r13,$r12
>>>
>>> This sequence of instructions takes 5 clock cycles.
> It actually may take 5 to 8 cycles depending on the input. And
> multiplication is fully pipelined while division is not, so the
> reciprocal sequence should still produce a better throughput.
>
>> Hmm indeed, it seems a waste to do this reduction for int / 1000000007.
>> I'll try to make a better heuristic as Richard suggests...
> Oops, it seems impossible (w/o refactoring the generic code). See my
> reply to Richi :(.
>
> Can you also try benchmarking with the costs of SI and DI division
> increased to (10, 10) instead of (14, 22) - allowing more CSE but not
> reciprocal sequence reduction, and (10, 22) - only allowing reduction
> for DI but not SI?
I tested spec2006. In the floating-point program, the test items with large
fluctuations are removed, and the rest is basically unchanged.
The fixed-point 464.h264ref (10,10) was 6.7% higher than (5,5) and (10,22).
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] LoongArch: Increase division costs
2024-03-29 1:23 ` chenglulu
@ 2024-04-01 1:29 ` Xi Ruoyao
2024-04-01 2:22 ` chenglulu
0 siblings, 1 reply; 12+ messages in thread
From: Xi Ruoyao @ 2024-04-01 1:29 UTC (permalink / raw)
To: chenglulu, gcc-patches; +Cc: i, xuchenghua
On Fri, 2024-03-29 at 09:23 +0800, chenglulu wrote:
> I tested spec2006. In the floating-point program, the test items with large
>
> fluctuations are removed, and the rest is basically unchanged.
>
> The fixed-point 464.h264ref (10,10) was 6.7% higher than (5,5) and (10,22).
So IIUC (10,10) is better than (5,5), (10,22), and the originally
proposed (14,22)? Then should I make a change to make all 4 costs (SF,
DF, SI, DI) 10?
I'd still want DI % 1000000007 to be reduced as reciprocal sequence (but
not SI % 1000000007) since DI % (smaller const) is quite important for
some workloads like competitive programming. However "adapting with
different modulos" is not possible w/o refactoring generic code so it
must be deferred to at least GCC 15.
--
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] LoongArch: Increase division costs
2024-04-01 1:29 ` Xi Ruoyao
@ 2024-04-01 2:22 ` chenglulu
2024-04-01 2:23 ` Xi Ruoyao
0 siblings, 1 reply; 12+ messages in thread
From: chenglulu @ 2024-04-01 2:22 UTC (permalink / raw)
To: Xi Ruoyao, gcc-patches; +Cc: i, xuchenghua
在 2024/4/1 上午9:29, Xi Ruoyao 写道:
> On Fri, 2024-03-29 at 09:23 +0800, chenglulu wrote:
>
>> I tested spec2006. In the floating-point program, the test items with large
>>
>> fluctuations are removed, and the rest is basically unchanged.
>>
>> The fixed-point 464.h264ref (10,10) was 6.7% higher than (5,5) and (10,22).
> So IIUC (10,10) is better than (5,5), (10,22), and the originally
> proposed (14,22)? Then should I make a change to make all 4 costs (SF,
> DF, SI, DI) 10?
I think this may require the analysis of the spec's test case. I took a
look at the test results again,
where the scores of SPEC INT 462.libquantum fluctuated greatly, but the
combination of (10,22)
showed an overall upward trend compared to the scores of the other two
combinations.
I don't know if (10,22) this combination happens to have the kind of
test cases in the changelog.
So can we change it together in GCC15?
>
> I'd still want DI % 1000000007 to be reduced as reciprocal sequence (but
> not SI % 1000000007) since DI % (smaller const) is quite important for
> some workloads like competitive programming. However "adapting with
> different modulos" is not possible w/o refactoring generic code so it
> must be deferred to at least GCC 15.
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] LoongArch: Increase division costs
2024-04-01 2:22 ` chenglulu
@ 2024-04-01 2:23 ` Xi Ruoyao
0 siblings, 0 replies; 12+ messages in thread
From: Xi Ruoyao @ 2024-04-01 2:23 UTC (permalink / raw)
To: chenglulu, gcc-patches; +Cc: i, xuchenghua
On Mon, 2024-04-01 at 10:22 +0800, chenglulu wrote:
>
> 在 2024/4/1 上午9:29, Xi Ruoyao 写道:
> > On Fri, 2024-03-29 at 09:23 +0800, chenglulu wrote:
> >
> > > I tested spec2006. In the floating-point program, the test items with large
> > >
> > > fluctuations are removed, and the rest is basically unchanged.
> > >
> > > The fixed-point 464.h264ref (10,10) was 6.7% higher than (5,5) and (10,22).
> > So IIUC (10,10) is better than (5,5), (10,22), and the originally
> > proposed (14,22)? Then should I make a change to make all 4 costs (SF,
> > DF, SI, DI) 10?
>
> I think this may require the analysis of the spec's test case. I took a
> look at the test results again,
>
> where the scores of SPEC INT 462.libquantum fluctuated greatly, but the
> combination of (10,22)
>
> showed an overall upward trend compared to the scores of the other two
> combinations.
>
> I don't know if (10,22) this combination happens to have the kind of
> test cases in the changelog.
>
> So can we change it together in GCC15?
Ok. Abandoning this patch then.
--
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-04-01 2:24 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-26 9:48 [PATCH] LoongArch: Increase division costs Xi Ruoyao
2024-03-27 2:38 ` chenglulu
2024-03-27 10:39 ` Xi Ruoyao
2024-03-27 12:42 ` Xi Ruoyao
2024-03-28 3:18 ` chenglulu
2024-03-29 1:23 ` chenglulu
2024-04-01 1:29 ` Xi Ruoyao
2024-04-01 2:22 ` chenglulu
2024-04-01 2:23 ` Xi Ruoyao
2024-03-27 7:54 ` Richard Biener
2024-03-27 12:20 ` Xi Ruoyao
2024-03-27 12:28 ` Richard Biener
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).