[PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.
@ 2023-12-12 11:14 Jiahao Xu
  2023-12-12 11:26 ` Xi Ruoyao
  0 siblings, 1 reply; 11+ messages in thread
From: Jiahao Xu @ 2023-12-12 11:14 UTC (permalink / raw)
  To: gcc-patches; +Cc: xry111, i, chenglulu, xuchenghua, Jiahao Xu

Define LOGICAL_OP_NON_SHORT_CIRCUIT as 0, for a short-circuit branch, use the
short-circuit operation instead of the non-short-circuit operation.

This gives a 1.8% improvement in SPECCPU 2017 fprate on 3A6000.

gcc/ChangeLog:

	* config/loongarch/loongarch.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Define.

gcc/testsuite/ChangeLog:

	* gcc.target/loongarch/short-circuit.c: New test.

diff --git a/gcc/config/loongarch/loongarch.h b/gcc/config/loongarch/loongarch.h
index f1350b6048f..880c576c35b 100644
--- a/gcc/config/loongarch/loongarch.h
+++ b/gcc/config/loongarch/loongarch.h
@@ -869,6 +869,7 @@ typedef struct {
    1 is the default; other values are interpreted relative to that.  */
 
 #define BRANCH_COST(speed_p, predictable_p) loongarch_branch_cost
+#define LOGICAL_OP_NON_SHORT_CIRCUIT 0
 
 /* Return the asm template for a conditional branch instruction.
    OPCODE is the opcode's mnemonic and OPERANDS is the asm template for
diff --git a/gcc/testsuite/gcc.target/loongarch/short-circuit.c b/gcc/testsuite/gcc.target/loongarch/short-circuit.c
new file mode 100644
index 00000000000..bed585ee172
--- /dev/null
+++ b/gcc/testsuite/gcc.target/loongarch/short-circuit.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ffast-math -fdump-tree-gimple" } */
+
+int
+short_circuit (float *a)
+{
+  float t1x = a[0];
+  float t2x = a[1];
+  float t1y = a[2];
+  float t2y = a[3];
+  float t1z = a[4];
+  float t2z = a[5];
+
+  if (t1x > t2y  || t2x < t1y  || t1x > t2z || t2x < t1z || t1y > t2z || t2y < t1z)
+    return 0;
+
+  return 1;
+}
+/* { dg-final { scan-tree-dump-times "if" 6 "gimple" } } */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.
  2023-12-12 11:14 [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT Jiahao Xu
@ 2023-12-12 11:26 ` Xi Ruoyao
  2023-12-12 11:59   ` Jiahao Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Xi Ruoyao @ 2023-12-12 11:26 UTC (permalink / raw)
  To: Jiahao Xu, gcc-patches; +Cc: i, chenglulu, xuchenghua

On Tue, 2023-12-12 at 19:14 +0800, Jiahao Xu wrote:
> Define LOGICAL_OP_NON_SHORT_CIRCUIT as 0, for a short-circuit branch, use the
> short-circuit operation instead of the non-short-circuit operation.
> 
> This gives a 1.8% improvement in SPECCPU 2017 fprate on 3A6000.

In r14-15 we removed LOGICAL_OP_NON_SHORT_CIRCUIT definition because the
default value (1 for all current LoongArch CPUs with branch_cost = 6)
may reduce the number of conditional branch instructions.

I guess here the problem is floating-point compare instruction is much
more costly than other instructions but the fact is not correctly
modeled yet.  Could you try
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
where I've raised fp_add cost (which is used for estimating floating-
point compare cost) to 5 instructions and see if it solves your problem
without LOGICAL_OP_NON_SHORT_CIRCUIT?

If not I guess you can try increasing the floating-point comparison cost
more in loongarch_rtx_costs:

    case UNLT:
      /* Branch comparisons have VOIDmode, so use the first operand's
         mode instead.  */
      mode = GET_MODE (XEXP (x, 0));
      if (FLOAT_MODE_P (mode))
        {
          *total = loongarch_cost->fp_add;


Try to make it fp_add + something?

          return false;
        }
      *total = loongarch_binary_cost (x, COSTS_N_INSNS (1), COSTS_N_INSNS (4),
                                      speed);
      return true;


If adjusting the cost model does not work I'd say this is a middle-end
issue and we should submit a bug report.

> gcc/ChangeLog:
> 
> 	* config/loongarch/loongarch.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Define.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/loongarch/short-circuit.c: New test.
> 
> diff --git a/gcc/config/loongarch/loongarch.h b/gcc/config/loongarch/loongarch.h
> index f1350b6048f..880c576c35b 100644
> --- a/gcc/config/loongarch/loongarch.h
> +++ b/gcc/config/loongarch/loongarch.h
> @@ -869,6 +869,7 @@ typedef struct {
>     1 is the default; other values are interpreted relative to that.  */
>  
>  #define BRANCH_COST(speed_p, predictable_p) loongarch_branch_cost
> +#define LOGICAL_OP_NON_SHORT_CIRCUIT 0
>  
>  /* Return the asm template for a conditional branch instruction.
>     OPCODE is the opcode's mnemonic and OPERANDS is the asm template for
> diff --git a/gcc/testsuite/gcc.target/loongarch/short-circuit.c b/gcc/testsuite/gcc.target/loongarch/short-circuit.c
> new file mode 100644
> index 00000000000..bed585ee172
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/loongarch/short-circuit.c
> @@ -0,0 +1,19 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ffast-math -fdump-tree-gimple" } */
> +
> +int
> +short_circuit (float *a)
> +{
> +  float t1x = a[0];
> +  float t2x = a[1];
> +  float t1y = a[2];
> +  float t2y = a[3];
> +  float t1z = a[4];
> +  float t2z = a[5];
> +
> +  if (t1x > t2y  || t2x < t1y  || t1x > t2z || t2x < t1z || t1y > t2z || t2y < t1z)
> +    return 0;
> +
> +  return 1;
> +}
> +/* { dg-final { scan-tree-dump-times "if" 6 "gimple" } } */

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.
  2023-12-12 11:26 ` Xi Ruoyao
@ 2023-12-12 11:59   ` Jiahao Xu
  2023-12-12 12:39     ` Xi Ruoyao
  0 siblings, 1 reply; 11+ messages in thread
From: Jiahao Xu @ 2023-12-12 11:59 UTC (permalink / raw)
  To: Xi Ruoyao, gcc-patches; +Cc: i, chenglulu, xuchenghua


在 2023/12/12 下午7:26, Xi Ruoyao 写道:
> On Tue, 2023-12-12 at 19:14 +0800, Jiahao Xu wrote:
>> Define LOGICAL_OP_NON_SHORT_CIRCUIT as 0, for a short-circuit branch, use the
>> short-circuit operation instead of the non-short-circuit operation.
>>
>> This gives a 1.8% improvement in SPECCPU 2017 fprate on 3A6000.
> In r14-15 we removed LOGICAL_OP_NON_SHORT_CIRCUIT definition because the
> default value (1 for all current LoongArch CPUs with branch_cost = 6)
> may reduce the number of conditional branch instructions.
>
> I guess here the problem is floating-point compare instruction is much
> more costly than other instructions but the fact is not correctly
> modeled yet.  Could you try
> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
> where I've raised fp_add cost (which is used for estimating floating-
> point compare cost) to 5 instructions and see if it solves your problem
> without LOGICAL_OP_NON_SHORT_CIRCUIT?
I think this is not the same issue as the cost of floating-point 
comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT 
affects how the short-circuit branch, such as (A AND-IF B), is executed, 
and it is not directly related to the cost of floating-point comparison 
instructions. I will try to test it using SPECCPU 2017.
> If not I guess you can try increasing the floating-point comparison cost
> more in loongarch_rtx_costs:
>
>      case UNLT:
>        /* Branch comparisons have VOIDmode, so use the first operand's
>           mode instead.  */
>        mode = GET_MODE (XEXP (x, 0));
>        if (FLOAT_MODE_P (mode))
>          {
>            *total = loongarch_cost->fp_add;
>
>
> Try to make it fp_add + something?
>
>            return false;
>          }
>        *total = loongarch_binary_cost (x, COSTS_N_INSNS (1), COSTS_N_INSNS (4),
>                                        speed);
>        return true;
>
>
> If adjusting the cost model does not work I'd say this is a middle-end
> issue and we should submit a bug report.
>
>> gcc/ChangeLog:
>>
>> 	* config/loongarch/loongarch.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Define.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 	* gcc.target/loongarch/short-circuit.c: New test.
>>
>> diff --git a/gcc/config/loongarch/loongarch.h b/gcc/config/loongarch/loongarch.h
>> index f1350b6048f..880c576c35b 100644
>> --- a/gcc/config/loongarch/loongarch.h
>> +++ b/gcc/config/loongarch/loongarch.h
>> @@ -869,6 +869,7 @@ typedef struct {
>>      1 is the default; other values are interpreted relative to that.  */
>>   
>>   #define BRANCH_COST(speed_p, predictable_p) loongarch_branch_cost
>> +#define LOGICAL_OP_NON_SHORT_CIRCUIT 0
>>   
>>   /* Return the asm template for a conditional branch instruction.
>>      OPCODE is the opcode's mnemonic and OPERANDS is the asm template for
>> diff --git a/gcc/testsuite/gcc.target/loongarch/short-circuit.c b/gcc/testsuite/gcc.target/loongarch/short-circuit.c
>> new file mode 100644
>> index 00000000000..bed585ee172
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/loongarch/short-circuit.c
>> @@ -0,0 +1,19 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -ffast-math -fdump-tree-gimple" } */
>> +
>> +int
>> +short_circuit (float *a)
>> +{
>> +  float t1x = a[0];
>> +  float t2x = a[1];
>> +  float t1y = a[2];
>> +  float t2y = a[3];
>> +  float t1z = a[4];
>> +  float t2z = a[5];
>> +
>> +  if (t1x > t2y  || t2x < t1y  || t1x > t2z || t2x < t1z || t1y > t2z || t2y < t1z)
>> +    return 0;
>> +
>> +  return 1;
>> +}
>> +/* { dg-final { scan-tree-dump-times "if" 6 "gimple" } } */


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.
  2023-12-12 11:59   ` Jiahao Xu
@ 2023-12-12 12:39     ` Xi Ruoyao
  2023-12-12 18:27       ` Xi Ruoyao
  0 siblings, 1 reply; 11+ messages in thread
From: Xi Ruoyao @ 2023-12-12 12:39 UTC (permalink / raw)
  To: Jiahao Xu, gcc-patches; +Cc: i, chenglulu, xuchenghua

On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote:
> > I guess here the problem is floating-point compare instruction is much
> > more costly than other instructions but the fact is not correctly
> > modeled yet.  Could you try
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
> > where I've raised fp_add cost (which is used for estimating floating-
> > point compare cost) to 5 instructions and see if it solves your problem
> > without LOGICAL_OP_NON_SHORT_CIRCUIT?
> I think this is not the same issue as the cost of floating-point 
> comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT 
> affects how the short-circuit branch, such as (A AND-IF B), is executed, 
> and it is not directly related to the cost of floating-point comparison 
> instructions. I will try to test it using SPECCPU 2017.

The point is if the cost of floating-point comparison is very high, the
middle end *should* short cut floating-point comparisons even if
LOGICAL_OP_NON_SHORT_CIRCUIT = 1.

I've created https://gcc.gnu.org/PR112985.

Another factor regressing the code is we don't have modeled movcf2gr
instruction yet, so we are not really eliding the branches as
LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.
  2023-12-12 12:39     ` Xi Ruoyao
@ 2023-12-12 18:27       ` Xi Ruoyao
  2023-12-13  1:30         ` chenglulu
                           ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Xi Ruoyao @ 2023-12-12 18:27 UTC (permalink / raw)
  To: Jiahao Xu, gcc-patches; +Cc: i, chenglulu, xuchenghua

On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:
> On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote:
> > > I guess here the problem is floating-point compare instruction is much
> > > more costly than other instructions but the fact is not correctly
> > > modeled yet.  Could you try
> > > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
> > > where I've raised fp_add cost (which is used for estimating floating-
> > > point compare cost) to 5 instructions and see if it solves your problem
> > > without LOGICAL_OP_NON_SHORT_CIRCUIT?
> > I think this is not the same issue as the cost of floating-point 
> > comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT 
> > affects how the short-circuit branch, such as (A AND-IF B), is executed, 
> > and it is not directly related to the cost of floating-point comparison 
> > instructions. I will try to test it using SPECCPU 2017.
> 
> The point is if the cost of floating-point comparison is very high, the
> middle end *should* short cut floating-point comparisons even if
> LOGICAL_OP_NON_SHORT_CIRCUIT = 1.
> 
> I've created https://gcc.gnu.org/PR112985.
> 
> Another factor regressing the code is we don't have modeled movcf2gr
> instruction yet, so we are not really eliding the branches as
> LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.

I made up this:

diff --git a/gcc/config/loongarch/loongarch.md b/gcc/config/loongarch/loongarch.md
index a5d0dcd65fe..84d828ebd0f 100644
--- a/gcc/config/loongarch/loongarch.md
+++ b/gcc/config/loongarch/loongarch.md
@@ -3169,6 +3169,42 @@ (define_insn "s<code>_<ANYF:mode>_using_FCCmode"
   [(set_attr "type" "fcmp")
    (set_attr "mode" "FCC")])
 
+(define_insn "movcf2gr<GPR:mode>"
+  [(set (match_operand:GPR 0 "register_operand" "=r")
+	(if_then_else:GPR (ne (match_operand:FCC 1 "register_operand" "z")
+			      (const_int 0))
+			  (const_int 1)
+			  (const_int 0)))]
+  "TARGET_HARD_FLOAT"
+  "movcf2gr\t%0,%1"
+  [(set_attr "type" "move")
+   (set_attr "mode" "FCC")])
+
+(define_expand "cstore<ANYF:mode>4"
+  [(set (match_operand:SI 0 "register_operand")
+	(match_operator:SI 1 "loongarch_fcmp_operator"
+	  [(match_operand:ANYF 2 "register_operand")
+	   (match_operand:ANYF 3 "register_operand")]))]
+  ""
+  {
+    rtx fcc = gen_reg_rtx (FCCmode);
+    rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), FCCmode,
+			      operands[2], operands[3]);
+
+    emit_insn (gen_rtx_SET (fcc, cmp));
+    if (TARGET_64BIT)
+      {
+	rtx gpr = gen_reg_rtx (DImode);
+	emit_insn (gen_movcf2grdi (gpr, fcc));
+	emit_insn (gen_rtx_SET (operands[0],
+				lowpart_subreg (SImode, gpr, DImode)));
+      }
+    else
+      emit_insn (gen_movcf2grsi (operands[0], fcc));
+
+    DONE;
+  })
+
 

 ;;
 ;;  ....................
diff --git a/gcc/config/loongarch/predicates.md b/gcc/config/loongarch/predicates.md
index 9e9ce58cb53..83fea08315c 100644
--- a/gcc/config/loongarch/predicates.md
+++ b/gcc/config/loongarch/predicates.md
@@ -590,6 +590,10 @@ (define_predicate "order_operator"
 (define_predicate "loongarch_cstore_operator"
   (match_code "ne,eq,gt,gtu,ge,geu,lt,ltu,le,leu"))
 
+(define_predicate "loongarch_fcmp_operator"
+  (match_code
+    "unordered,uneq,unlt,unle,eq,lt,le,ordered,ltgt,ne,ge,gt,unge,ungt"))
+
 (define_predicate "small_data_pattern"
   (and (match_code "set,parallel,unspec,unspec_volatile,prefetch")
        (match_test "loongarch_small_data_pattern_p (op)")))

and now this function is compiled to (with LOGICAL_OP_NON_SHORT_CIRCUIT
= 1):

	fld.s	$f1,$r4,0
	fld.s	$f0,$r4,4
	fld.s	$f3,$r4,8
	fld.s	$f2,$r4,12
	fcmp.slt.s	$fcc1,$f0,$f3
	fcmp.sgt.s	$fcc0,$f1,$f2
	movcf2gr	$r13,$fcc1
	movcf2gr	$r12,$fcc0
	or	$r12,$r12,$r13
	bnez	$r12,.L3
	fld.s	$f4,$r4,16
	fld.s	$f5,$r4,20
	or	$r4,$r0,$r0
	fcmp.sgt.s	$fcc1,$f1,$f5
	fcmp.slt.s	$fcc0,$f0,$f4
	movcf2gr	$r12,$fcc1
	movcf2gr	$r13,$fcc0
	or	$r12,$r12,$r13
	bnez	$r12,.L2
	fcmp.sgt.s	$fcc1,$f3,$f5
	fcmp.slt.s	$fcc0,$f2,$f4
	movcf2gr	$r4,$fcc1
	movcf2gr	$r12,$fcc0
	or	$r4,$r4,$r12
	xori	$r4,$r4,1
	slli.w	$r4,$r4,0
	jr	$r1
	.align	4
.L3:
	or	$r4,$r0,$r0
	.align	4
.L2:
	jr	$r1

Per my micro-benchmark this is much faster than
LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
when the branches are not predictable).

Note that there is a redundant slli.w instruction in the compiled code
and I couldn't find a way to remove it (my trick in the TARGET_64BIT
branch only works for simple examples).  We may be able to handle via
the ext_dce pass [1] in the future.

[1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.
  2023-12-12 18:27       ` Xi Ruoyao
@ 2023-12-13  1:30         ` chenglulu
  2023-12-13  2:37         ` chenglulu
  2023-12-13  6:17         ` Jiahao Xu
  2 siblings, 0 replies; 11+ messages in thread
From: chenglulu @ 2023-12-13  1:30 UTC (permalink / raw)
  To: Xi Ruoyao, Jiahao Xu, gcc-patches; +Cc: i, xuchenghua

[-- Attachment #1: Type: text/plain, Size: 1293 bytes --]


在 2023/12/13 上午2:27, Xi Ruoyao 写道:
>
> 	fld.s	$f1,$r4,0
> 	fld.s	$f0,$r4,4
> 	fld.s	$f3,$r4,8
> 	fld.s	$f2,$r4,12
> 	fcmp.slt.s	$fcc1,$f0,$f3
> 	fcmp.sgt.s	$fcc0,$f1,$f2
> 	movcf2gr	$r13,$fcc1
> 	movcf2gr	$r12,$fcc0
> 	or	$r12,$r12,$r13
> 	bnez	$r12,.L3
> 	fld.s	$f4,$r4,16
> 	fld.s	$f5,$r4,20
> 	or	$r4,$r0,$r0
> 	fcmp.sgt.s	$fcc1,$f1,$f5
> 	fcmp.slt.s	$fcc0,$f0,$f4
> 	movcf2gr	$r12,$fcc1
> 	movcf2gr	$r13,$fcc0
> 	or	$r12,$r12,$r13
> 	bnez	$r12,.L2
> 	fcmp.sgt.s	$fcc1,$f3,$f5
> 	fcmp.slt.s	$fcc0,$f2,$f4
> 	movcf2gr	$r4,$fcc1
> 	movcf2gr	$r12,$fcc0
> 	or	$r4,$r4,$r12
> 	xori	$r4,$r4,1
> 	slli.w	$r4,$r4,0
> 	jr	$r1
> 	.align	4
> .L3:
> 	or	$r4,$r0,$r0
> 	.align	4
> .L2:
> 	jr	$r1
>
> Per my micro-benchmark this is much faster than
> LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
> when the branches are not predictable).
>
> Note that there is a redundant slli.w instruction in the compiled code
> and I couldn't find a way to remove it (my trick in the TARGET_64BIT
> branch only works for simple examples).  We may be able to handle via
> the ext_dce pass [1] in the future.

Patches in attachments can remove the remaining symbol extension 
directives from

the assembly.

> [1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html
>

[-- Attachment #2: v1-0001-LoongArch-Optimized-some-of-the-symbolic-expansio.patch --]
[-- Type: text/x-patch, Size: 10324 bytes --]

From 01eea237e13056fad9839219ed1aa70037cd3b60 Mon Sep 17 00:00:00 2001
From: Lulu Cheng <chenglulu@loongson.cn>
Date: Fri, 8 Dec 2023 10:16:48 +0800
Subject: [PATCH v1] LoongArch: Optimized some of the symbolic expansion
 instructions generated during bitwise operations

There are two mode iterators defined in the loongarch.md:
	(define_mode_iterator GPR [SI (DI "TARGET_64BIT")])
  and
	(define_mode_iterator X [(SI "!TARGET_64BIT") (DI "TARGET_64BIT")])
Replace the mode in the bit arithmetic from GPR to X.

Since the bitwise operation instruction does not distinguish between 64-bit,
32-bit, etc., it is necessary to perform symbolic expansion if the bitwise
operation is less than 64 bits.
The original definition would have generated a lot of redundant symbolic
extension instructions. This problem is optimized with reference to the
implementation of RISCV.

gcc/ChangeLog:

	* config/loongarch/loongarch.md (one_cmpl<mode>2): Replace GPR with X.
	(*nor<mode>3): Likewise.
	(nor<mode>3): Likewise.
	(*branch_on_bit<X:mode>): Likewise.
	(*branch_on_bit_range<X:mode>): Likewise.
	(*negsi2_extended): New template.
	(*<optab>si3_internal): Likewise.
	(*one_cmplsi2_internal): Likewise.
	(*norsi3_internal): Likewise.
	(*<optab>nsi_internal): Likewise.
	(bytepick_w_<bytepick_imm>_extend): Modify this template according to the
	modified bit operation to make the optimization work.
	* config/loongarch/predicates.md (branch_on_bit_operand): New predicate.

gcc/testsuite/ChangeLog:

	* gcc.target/loongarch/sign-extend-1.c: New test.
---
 gcc/config/loongarch/loongarch.md             | 148 +++++++++++++++---
 gcc/config/loongarch/predicates.md            |   5 +
 .../gcc.target/loongarch/sign-extend-1.c      |  21 +++
 3 files changed, 151 insertions(+), 23 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/loongarch/sign-extend-1.c

diff --git a/gcc/config/loongarch/loongarch.md b/gcc/config/loongarch/loongarch.md
index 7a101dd64b7..35788deafc7 100644
--- a/gcc/config/loongarch/loongarch.md
+++ b/gcc/config/loongarch/loongarch.md
@@ -721,7 +721,7 @@ (define_insn "sub<mode>3"
 
 (define_insn "sub<mode>3"
   [(set (match_operand:GPR 0 "register_operand" "=r")
-	(minus:GPR (match_operand:GPR 1 "register_operand" "rJ")
+	(minus:GPR (match_operand:GPR 1 "register_operand" "r")
 		   (match_operand:GPR 2 "register_operand" "r")))]
   ""
   "sub.<d>\t%0,%z1,%2"
@@ -1327,13 +1327,13 @@ (define_insn "neg<mode>2"
   [(set_attr "alu_type"	"sub")
    (set_attr "mode" "<MODE>")])
 
-(define_insn "one_cmpl<mode>2"
-  [(set (match_operand:GPR 0 "register_operand" "=r")
-	(not:GPR (match_operand:GPR 1 "register_operand" "r")))]
-  ""
-  "nor\t%0,%.,%1"
-  [(set_attr "alu_type" "not")
-   (set_attr "mode" "<MODE>")])
+(define_insn "*negsi2_extended"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(sign_extend:DI (neg:SI (match_operand:SI 1 "register_operand" "r"))))]
+  "TARGET_64BIT"
+  "sub.w\t%0,%.,%1"
+  [(set_attr "alu_type"	"sub")
+   (set_attr "mode" "SI")])
 
 (define_insn "neg<mode>2"
   [(set (match_operand:ANYF 0 "register_operand" "=f")
@@ -1353,14 +1353,39 @@ (define_insn "neg<mode>2"
 ;;
 
 (define_insn "<optab><mode>3"
-  [(set (match_operand:GPR 0 "register_operand" "=r,r")
-	(any_bitwise:GPR (match_operand:GPR 1 "register_operand" "%r,r")
-			 (match_operand:GPR 2 "uns_arith_operand" "r,K")))]
+  [(set (match_operand:X 0 "register_operand" "=r,r")
+	(any_bitwise:X (match_operand:X 1 "register_operand" "%r,r")
+		       (match_operand:X 2 "uns_arith_operand" "r,K")))]
   ""
   "<insn>%i2\t%0,%1,%2"
   [(set_attr "type" "logical")
    (set_attr "mode" "<MODE>")])
 
+(define_insn "*<optab>si3_internal"
+  [(set (match_operand:SI                 0 "register_operand" "=r,r")
+	(any_bitwise:SI (match_operand:SI 1 "register_operand" "%r,r")
+			(match_operand:SI 2 "uns_arith_operand"    " r,K")))]
+  "TARGET_64BIT"
+  "<insn>%i2\t%0,%1,%2"
+  [(set_attr "type" "logical")
+   (set_attr "mode" "SI")])
+
+(define_insn "one_cmpl<mode>2"
+  [(set (match_operand:X 0 "register_operand" "=r")
+	(not:X (match_operand:X 1 "register_operand" "r")))]
+  ""
+  "nor\t%0,%.,%1"
+  [(set_attr "alu_type" "not")
+   (set_attr "mode" "<MODE>")])
+
+(define_insn "*one_cmplsi2_internal"
+  [(set (match_operand:SI         0 "register_operand" "=r")
+	(not:SI (match_operand:SI 1 "register_operand" " r")))]
+  "TARGET_64BIT"
+  "nor\t%0,%.,%1"
+  [(set_attr "type" "logical")
+   (set_attr "mode" "SI")])
+
 (define_insn "and<mode>3_extended"
   [(set (match_operand:GPR 0 "register_operand" "=r")
 	(and:GPR (match_operand:GPR 1 "nonimmediate_operand" "r")
@@ -1476,25 +1501,43 @@ (define_insn "*iorhi3"
   [(set_attr "type" "logical")
    (set_attr "mode" "HI")])
 
-(define_insn "*nor<mode>3"
-  [(set (match_operand:GPR 0 "register_operand" "=r")
-	(and:GPR (not:GPR (match_operand:GPR 1 "register_operand" "%r"))
-		 (not:GPR (match_operand:GPR 2 "register_operand" "r"))))]
+(define_insn "nor<mode>3"
+  [(set (match_operand:X 0 "register_operand" "=r")
+	(and:X (not:X (match_operand:X 1 "register_operand" "%r"))
+		 (not:X (match_operand:X 2 "register_operand" "r"))))]
   ""
   "nor\t%0,%1,%2"
   [(set_attr "type" "logical")
    (set_attr "mode" "<MODE>")])
 
+(define_insn "*norsi3_internal"
+  [(set (match_operand:SI 0 "register_operand" "=r")
+	(and:SI (not:SI (match_operand:SI 1 "register_operand" "%r"))
+		 (not:SI (match_operand:SI 2 "register_operand" "r"))))]
+  "TARGET_64BIT"
+  "nor\t%0,%1,%2"
+  [(set_attr "type" "logical")
+   (set_attr "mode" "SI")])
+
 (define_insn "<optab>n<mode>"
-  [(set (match_operand:GPR 0 "register_operand" "=r")
-	(neg_bitwise:GPR
-	    (not:GPR (match_operand:GPR 1 "register_operand" "r"))
-	    (match_operand:GPR 2 "register_operand" "r")))]
+  [(set (match_operand:X 0 "register_operand" "=r")
+	(neg_bitwise:X
+	    (not:X (match_operand:X 1 "register_operand" "r"))
+	    (match_operand:X 2 "register_operand" "r")))]
   ""
   "<insn>n\t%0,%2,%1"
   [(set_attr "type" "logical")
    (set_attr "mode" "<MODE>")])
 
+(define_insn "*<optab>nsi_internal"
+  [(set (match_operand:SI 0 "register_operand" "=r")
+	(neg_bitwise:SI
+	    (not:SI (match_operand:SI 1 "register_operand" "r"))
+	    (match_operand:SI 2 "register_operand" "r")))]
+  "TARGET_64BIT"
+  "<insn>n\t%0,%2,%1"
+  [(set_attr "type" "logical")
+   (set_attr "mode" "SI")])
 \f
 ;;
 ;;  ....................
@@ -2976,6 +3019,62 @@ (define_expand "condjump"
 		      (label_ref (match_operand 1))
 		      (pc)))])
 
+(define_insn_and_split "*branch_on_bit<X:mode>"
+  [(set (pc)
+	(if_then_else
+	    (match_operator 0 "equality_operator"
+	        [(zero_extract:X (match_operand:X 2 "register_operand" "r")
+				 (const_int 1)
+				 (match_operand 3 "branch_on_bit_operand"))
+				 (const_int 0)])
+	    (label_ref (match_operand 1))
+	    (pc)))
+   (clobber (match_scratch:X 4 "=&r"))]
+  ""
+  "#"
+  "reload_completed"
+  [(set (match_dup 4)
+	(ashift:X (match_dup 2) (match_dup 3)))
+   (set (pc)
+	(if_then_else
+	    (match_op_dup 0 [(match_dup 4) (const_int 0)])
+	    (label_ref (match_operand 1))
+	    (pc)))]
+{
+  int shift = GET_MODE_BITSIZE (<MODE>mode) - 1 - INTVAL (operands[3]);
+  operands[3] = GEN_INT (shift);
+
+  if (GET_CODE (operands[0]) == EQ)
+    operands[0] = gen_rtx_GE (<MODE>mode, operands[4], const0_rtx);
+  else
+    operands[0] = gen_rtx_LT (<MODE>mode, operands[4], const0_rtx);
+})
+
+(define_insn_and_split "*branch_on_bit_range<X:mode>"
+  [(set (pc)
+	(if_then_else
+	    (match_operator 0 "equality_operator"
+		[(zero_extract:X (match_operand:X 2 "register_operand" "r")
+				 (match_operand 3 "branch_on_bit_operand")
+				 (const_int 0))
+				 (const_int 0)])
+	    (label_ref (match_operand 1))
+	    (pc)))
+   (clobber (match_scratch:X 4 "=&r"))]
+  ""
+  "#"
+  "reload_completed"
+  [(set (match_dup 4)
+	(ashift:X (match_dup 2) (match_dup 3)))
+   (set (pc)
+	(if_then_else
+	    (match_op_dup 0 [(match_dup 4) (const_int 0)])
+	    (label_ref (match_operand 1))
+	    (pc)))]
+{
+  operands[3] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode) - INTVAL (operands[3]));
+})
+
 
 \f
 ;;
@@ -3762,10 +3861,13 @@ (define_insn "bytepick_w_<bytepick_imm>"
 (define_insn "bytepick_w_<bytepick_imm>_extend"
   [(set (match_operand:DI 0 "register_operand" "=r")
 	(sign_extend:DI
-	  (ior:SI (lshiftrt (match_operand:SI 1 "register_operand" "r")
-			    (const_int <bytepick_w_lshiftrt_amount>))
-		  (ashift (match_operand:SI 2 "register_operand" "r")
-			  (const_int bytepick_w_ashift_amount)))))]
+	 (subreg:SI
+	  (ior:DI (subreg:DI (lshiftrt
+			      (match_operand:SI 1 "register_operand" "r")
+			      (const_int <bytepick_w_lshiftrt_amount>)) 0)
+		  (subreg:DI (ashift
+			      (match_operand:SI 2 "register_operand" "r")
+			      (const_int bytepick_w_ashift_amount)) 0)) 0)))]
   "TARGET_64BIT"
   "bytepick.w\t%0,%1,%2,<bytepick_imm>"
   [(set_attr "mode" "SI")])
diff --git a/gcc/config/loongarch/predicates.md b/gcc/config/loongarch/predicates.md
index d02e846cb12..5084752171a 100644
--- a/gcc/config/loongarch/predicates.md
+++ b/gcc/config/loongarch/predicates.md
@@ -67,6 +67,11 @@ (define_predicate "arith_operand"
   (ior (match_operand 0 "const_arith_operand")
        (match_operand 0 "register_operand")))
 
+;; Only use branch-on-bit sequences when the mask is not an ANDI immediate.
+(define_predicate "branch_on_bit_operand"
+  (and (match_code "const_int")
+       (match_test "INTVAL (op) >= IMM_BITS - 1")))
+
 (define_predicate "plus_di_operand"
   (ior (match_operand 0 "arith_operand")
        (match_operand 0 "const_dual_imm12_operand")
diff --git a/gcc/testsuite/gcc.target/loongarch/sign-extend-1.c b/gcc/testsuite/gcc.target/loongarch/sign-extend-1.c
new file mode 100644
index 00000000000..c294ba6c407
--- /dev/null
+++ b/gcc/testsuite/gcc.target/loongarch/sign-extend-1.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-mabi=lp64d -O2" } */
+/* { dg-final { scan-assembler-not "slli.w" } } */
+
+struct pmop
+{
+  unsigned int op_pmflags;
+  unsigned int op_pmpermflags;
+};
+unsigned int PL_hints;
+
+struct pmop *pmop;
+void
+Perl_newPMOP (int type, int flags)
+{
+  if (PL_hints & 0x00100000)
+    pmop->op_pmpermflags |= 0x0001;
+  if (PL_hints & 0x00000004)
+    pmop->op_pmpermflags |= 0x0800;
+  pmop->op_pmflags = pmop->op_pmpermflags;
+}
-- 
2.39.3


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.
  2023-12-12 18:27       ` Xi Ruoyao
  2023-12-13  1:30         ` chenglulu
@ 2023-12-13  2:37         ` chenglulu
  2023-12-13  6:17         ` Jiahao Xu
  2 siblings, 0 replies; 11+ messages in thread
From: chenglulu @ 2023-12-13  2:37 UTC (permalink / raw)
  To: Xi Ruoyao, Jiahao Xu, gcc-patches; +Cc: i, xuchenghua


在 2023/12/13 上午2:27, Xi Ruoyao 写道:
> On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:
>
> 	fld.s	$f1,$r4,0
> 	fld.s	$f0,$r4,4
> 	fld.s	$f3,$r4,8
> 	fld.s	$f2,$r4,12
> 	fcmp.slt.s	$fcc1,$f0,$f3
> 	fcmp.sgt.s	$fcc0,$f1,$f2
> 	movcf2gr	$r13,$fcc1
> 	movcf2gr	$r12,$fcc0

There is also a problem that on 3A5000 MOVCF2GR requires 7 cycles,

MOVCF2FR+MOVFR2GR is a cycle. 3A6000 has no problem.

> 	or	$r12,$r12,$r13
> 	bnez	$r12,.L3
> 	fld.s	$f4,$r4,16
> 	fld.s	$f5,$r4,20
> 	or	$r4,$r0,$r0
> 	fcmp.sgt.s	$fcc1,$f1,$f5
> 	fcmp.slt.s	$fcc0,$f0,$f4
> 	movcf2gr	$r12,$fcc1
> 	movcf2gr	$r13,$fcc0
> 	or	$r12,$r12,$r13
> 	bnez	$r12,.L2
> 	fcmp.sgt.s	$fcc1,$f3,$f5
> 	fcmp.slt.s	$fcc0,$f2,$f4
> 	movcf2gr	$r4,$fcc1
> 	movcf2gr	$r12,$fcc0
> 	or	$r4,$r4,$r12
> 	xori	$r4,$r4,1
> 	slli.w	$r4,$r4,0
> 	jr	$r1
> 	.align	4
> .L3:
> 	or	$r4,$r0,$r0
> 	.align	4
> .L2:
> 	jr	$r1
>
> Per my micro-benchmark this is much faster than
> LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
> when the branches are not predictable).
>
> Note that there is a redundant slli.w instruction in the compiled code
> and I couldn't find a way to remove it (my trick in the TARGET_64BIT
> branch only works for simple examples).  We may be able to handle via
> the ext_dce pass [1] in the future.
>
> [1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.
  2023-12-12 18:27       ` Xi Ruoyao
  2023-12-13  1:30         ` chenglulu
  2023-12-13  2:37         ` chenglulu
@ 2023-12-13  6:17         ` Jiahao Xu
  2023-12-13  6:21           ` Xi Ruoyao
  2 siblings, 1 reply; 11+ messages in thread
From: Jiahao Xu @ 2023-12-13  6:17 UTC (permalink / raw)
  To: Xi Ruoyao, gcc-patches; +Cc: i, chenglulu, xuchenghua


在 2023/12/13 上午2:27, Xi Ruoyao 写道:
> On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:
>> On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote:
>>>> I guess here the problem is floating-point compare instruction is much
>>>> more costly than other instructions but the fact is not correctly
>>>> modeled yet.  Could you try
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
>>>> where I've raised fp_add cost (which is used for estimating floating-
>>>> point compare cost) to 5 instructions and see if it solves your problem
>>>> without LOGICAL_OP_NON_SHORT_CIRCUIT?
>>> I think this is not the same issue as the cost of floating-point
>>> comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT
>>> affects how the short-circuit branch, such as (A AND-IF B), is executed,
>>> and it is not directly related to the cost of floating-point comparison
>>> instructions. I will try to test it using SPECCPU 2017.
>> The point is if the cost of floating-point comparison is very high, the
>> middle end *should* short cut floating-point comparisons even if
>> LOGICAL_OP_NON_SHORT_CIRCUIT = 1.
>>
>> I've created https://gcc.gnu.org/PR112985.
>>
>> Another factor regressing the code is we don't have modeled movcf2gr
>> instruction yet, so we are not really eliding the branches as
>> LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.
> I made up this:
>
> diff --git a/gcc/config/loongarch/loongarch.md b/gcc/config/loongarch/loongarch.md
> index a5d0dcd65fe..84d828ebd0f 100644
> --- a/gcc/config/loongarch/loongarch.md
> +++ b/gcc/config/loongarch/loongarch.md
> @@ -3169,6 +3169,42 @@ (define_insn "s<code>_<ANYF:mode>_using_FCCmode"
>     [(set_attr "type" "fcmp")
>      (set_attr "mode" "FCC")])
>   
> +(define_insn "movcf2gr<GPR:mode>"
> +  [(set (match_operand:GPR 0 "register_operand" "=r")
> +	(if_then_else:GPR (ne (match_operand:FCC 1 "register_operand" "z")
> +			      (const_int 0))
> +			  (const_int 1)
> +			  (const_int 0)))]
> +  "TARGET_HARD_FLOAT"
> +  "movcf2gr\t%0,%1"
> +  [(set_attr "type" "move")
> +   (set_attr "mode" "FCC")])
> +
> +(define_expand "cstore<ANYF:mode>4"
> +  [(set (match_operand:SI 0 "register_operand")
> +	(match_operator:SI 1 "loongarch_fcmp_operator"
> +	  [(match_operand:ANYF 2 "register_operand")
> +	   (match_operand:ANYF 3 "register_operand")]))]
> +  ""
> +  {
> +    rtx fcc = gen_reg_rtx (FCCmode);
> +    rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), FCCmode,
> +			      operands[2], operands[3]);
> +
> +    emit_insn (gen_rtx_SET (fcc, cmp));
> +    if (TARGET_64BIT)
> +      {
> +	rtx gpr = gen_reg_rtx (DImode);
> +	emit_insn (gen_movcf2grdi (gpr, fcc));
> +	emit_insn (gen_rtx_SET (operands[0],
> +				lowpart_subreg (SImode, gpr, DImode)));
> +      }
> +    else
> +      emit_insn (gen_movcf2grsi (operands[0], fcc));
> +
> +    DONE;
> +  })
> +
>   
>
>   ;;
>   ;;  ....................
> diff --git a/gcc/config/loongarch/predicates.md b/gcc/config/loongarch/predicates.md
> index 9e9ce58cb53..83fea08315c 100644
> --- a/gcc/config/loongarch/predicates.md
> +++ b/gcc/config/loongarch/predicates.md
> @@ -590,6 +590,10 @@ (define_predicate "order_operator"
>   (define_predicate "loongarch_cstore_operator"
>     (match_code "ne,eq,gt,gtu,ge,geu,lt,ltu,le,leu"))
>   
> +(define_predicate "loongarch_fcmp_operator"
> +  (match_code
> +    "unordered,uneq,unlt,unle,eq,lt,le,ordered,ltgt,ne,ge,gt,unge,ungt"))
> +
>   (define_predicate "small_data_pattern"
>     (and (match_code "set,parallel,unspec,unspec_volatile,prefetch")
>          (match_test "loongarch_small_data_pattern_p (op)")))
>
> and now this function is compiled to (with LOGICAL_OP_NON_SHORT_CIRCUIT
> = 1):
>
> 	fld.s	$f1,$r4,0
> 	fld.s	$f0,$r4,4
> 	fld.s	$f3,$r4,8
> 	fld.s	$f2,$r4,12
> 	fcmp.slt.s	$fcc1,$f0,$f3
> 	fcmp.sgt.s	$fcc0,$f1,$f2
> 	movcf2gr	$r13,$fcc1
> 	movcf2gr	$r12,$fcc0
> 	or	$r12,$r12,$r13
> 	bnez	$r12,.L3
> 	fld.s	$f4,$r4,16
> 	fld.s	$f5,$r4,20
> 	or	$r4,$r0,$r0
> 	fcmp.sgt.s	$fcc1,$f1,$f5
> 	fcmp.slt.s	$fcc0,$f0,$f4
> 	movcf2gr	$r12,$fcc1
> 	movcf2gr	$r13,$fcc0
> 	or	$r12,$r12,$r13
> 	bnez	$r12,.L2
> 	fcmp.sgt.s	$fcc1,$f3,$f5
> 	fcmp.slt.s	$fcc0,$f2,$f4
> 	movcf2gr	$r4,$fcc1
> 	movcf2gr	$r12,$fcc0
> 	or	$r4,$r4,$r12
> 	xori	$r4,$r4,1
> 	slli.w	$r4,$r4,0
> 	jr	$r1
> 	.align	4
> .L3:
> 	or	$r4,$r0,$r0
> 	.align	4
> .L2:
> 	jr	$r1
>
> Per my micro-benchmark this is much faster than
> LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
> when the branches are not predictable).
>
> Note that there is a redundant slli.w instruction in the compiled code
> and I couldn't find a way to remove it (my trick in the TARGET_64BIT
> branch only works for simple examples).  We may be able to handle via
> the ext_dce pass [1] in the future.
>
> [1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html
>
This test was extracted from the hot functions of 526.blender_r. Setting 
LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic 
instruction count and a 13.4% performance improvement. After applying 
the patch mentioned above, the assembly code looks much better with 
LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526. 
Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further 
improved the performance of 526 by 3%. The definition of 
LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while 
the optimizations you made determine how rtl is generated. They are not 
conflicting and combining them would yield better results.  Currently, I 
have only tested it on 526, and I will continue testing its impact on 
the entire SPEC 2017 suite.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.
  2023-12-13  6:17         ` Jiahao Xu
@ 2023-12-13  6:21           ` Xi Ruoyao
  2023-12-13  6:32             ` Jiahao Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Xi Ruoyao @ 2023-12-13  6:21 UTC (permalink / raw)
  To: Jiahao Xu, gcc-patches; +Cc: i, chenglulu, xuchenghua

On Wed, 2023-12-13 at 14:17 +0800, Jiahao Xu wrote:
> This test was extracted from the hot functions of 526.blender_r. Setting 
> LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic 
> instruction count and a 13.4% performance improvement. After applying 
> the patch mentioned above, the assembly code looks much better with 
> LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526. 
> Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further 
> improved the performance of 526 by 3%. The definition of 
> LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while
> the optimizations you made determine how rtl is generated. They are not 
> conflicting and combining them would yield better results.  Currently, I 
> have only tested it on 526, and I will continue testing its impact on 
> the entire SPEC 2017 suite.

The problem with LOGICAL_OP_NON_SHORT_CIRCUIT = 0 is it may regress
fixed-point only code.  In practice the usage of -ffast-math is very
rare ("real" Linux packages invoking floating-point operations often
just malfunction with it) and it seems not good to regress common cases
with uncommon cases.

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.
  2023-12-13  6:21           ` Xi Ruoyao
@ 2023-12-13  6:32             ` Jiahao Xu
  2023-12-13  7:30               ` Xi Ruoyao
  0 siblings, 1 reply; 11+ messages in thread
From: Jiahao Xu @ 2023-12-13  6:32 UTC (permalink / raw)
  To: Xi Ruoyao, gcc-patches; +Cc: i, chenglulu, xuchenghua


在 2023/12/13 下午2:21, Xi Ruoyao 写道:
> On Wed, 2023-12-13 at 14:17 +0800, Jiahao Xu wrote:
>> This test was extracted from the hot functions of 526.blender_r. Setting
>> LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic
>> instruction count and a 13.4% performance improvement. After applying
>> the patch mentioned above, the assembly code looks much better with
>> LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526.
>> Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further
>> improved the performance of 526 by 3%. The definition of
>> LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while
>> the optimizations you made determine how rtl is generated. They are not
>> conflicting and combining them would yield better results.  Currently, I
>> have only tested it on 526, and I will continue testing its impact on
>> the entire SPEC 2017 suite.
> The problem with LOGICAL_OP_NON_SHORT_CIRCUIT = 0 is it may regress
> fixed-point only code.  In practice the usage of -ffast-math is very
> rare ("real" Linux packages invoking floating-point operations often
> just malfunction with it) and it seems not good to regress common cases
> with uncommon cases.
>
Setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 in SPEC2017 intrate benchmark 
results in a 1.6% decrease in dynamic instruction count and an overall 
performance improvement of 0.5%. Most of the SPEC2017 int programs 
experience a decrease in instruction count, and there are no instances 
of performance regression observed.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.
  2023-12-13  6:32             ` Jiahao Xu
@ 2023-12-13  7:30               ` Xi Ruoyao
  0 siblings, 0 replies; 11+ messages in thread
From: Xi Ruoyao @ 2023-12-13  7:30 UTC (permalink / raw)
  To: Jiahao Xu, gcc-patches; +Cc: i, chenglulu, xuchenghua

On Wed, 2023-12-13 at 14:32 +0800, Jiahao Xu wrote:
> 
> 在 2023/12/13 下午2:21, Xi Ruoyao 写道:
> > On Wed, 2023-12-13 at 14:17 +0800, Jiahao Xu wrote:
> > > This test was extracted from the hot functions of 526.blender_r. Setting
> > > LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic
> > > instruction count and a 13.4% performance improvement. After applying
> > > the patch mentioned above, the assembly code looks much better with
> > > LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526.
> > > Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further
> > > improved the performance of 526 by 3%. The definition of
> > > LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while
> > > the optimizations you made determine how rtl is generated. They are not
> > > conflicting and combining them would yield better results.  Currently, I
> > > have only tested it on 526, and I will continue testing its impact on
> > > the entire SPEC 2017 suite.
> > The problem with LOGICAL_OP_NON_SHORT_CIRCUIT = 0 is it may regress
> > fixed-point only code.  In practice the usage of -ffast-math is very
> > rare ("real" Linux packages invoking floating-point operations often
> > just malfunction with it) and it seems not good to regress common cases
> > with uncommon cases.
> > 
> Setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 in SPEC2017 intrate benchmark 
> results in a 1.6% decrease in dynamic instruction count and an overall
> performance improvement of 0.5%. Most of the SPEC2017 int programs 
> experience a decrease in instruction count, and there are no instances
> of performance regression observed.

Ok then.  But add these info into commit message.

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-12-13  7:30 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-12 11:14 [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT Jiahao Xu
2023-12-12 11:26 ` Xi Ruoyao
2023-12-12 11:59   ` Jiahao Xu
2023-12-12 12:39     ` Xi Ruoyao
2023-12-12 18:27       ` Xi Ruoyao
2023-12-13  1:30         ` chenglulu
2023-12-13  2:37         ` chenglulu
2023-12-13  6:17         ` Jiahao Xu
2023-12-13  6:21           ` Xi Ruoyao
2023-12-13  6:32             ` Jiahao Xu
2023-12-13  7:30               ` Xi Ruoyao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).