Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Jiahao Xu <xujiahao@loongson.cn>
To: Xi Ruoyao <xry111@xry111.site>, gcc-patches@gcc.gnu.org
Cc: i@xen0n.name, chenglulu@loongson.cn, xuchenghua@loongson.cn
Subject: Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.
Date: Wed, 13 Dec 2023 14:17:05 +0800	[thread overview]
Message-ID: <2af90567-058c-d4f9-3805-da646cd51c1f@loongson.cn> (raw)
In-Reply-To: <4646f6a313c83cde74e682d4cba4419120947fd1.camel@xry111.site>


在 2023/12/13 上午2:27, Xi Ruoyao 写道:
> On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:
>> On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote:
>>>> I guess here the problem is floating-point compare instruction is much
>>>> more costly than other instructions but the fact is not correctly
>>>> modeled yet.  Could you try
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
>>>> where I've raised fp_add cost (which is used for estimating floating-
>>>> point compare cost) to 5 instructions and see if it solves your problem
>>>> without LOGICAL_OP_NON_SHORT_CIRCUIT?
>>> I think this is not the same issue as the cost of floating-point
>>> comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT
>>> affects how the short-circuit branch, such as (A AND-IF B), is executed,
>>> and it is not directly related to the cost of floating-point comparison
>>> instructions. I will try to test it using SPECCPU 2017.
>> The point is if the cost of floating-point comparison is very high, the
>> middle end *should* short cut floating-point comparisons even if
>> LOGICAL_OP_NON_SHORT_CIRCUIT = 1.
>>
>> I've created https://gcc.gnu.org/PR112985.
>>
>> Another factor regressing the code is we don't have modeled movcf2gr
>> instruction yet, so we are not really eliding the branches as
>> LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.
> I made up this:
>
> diff --git a/gcc/config/loongarch/loongarch.md b/gcc/config/loongarch/loongarch.md
> index a5d0dcd65fe..84d828ebd0f 100644
> --- a/gcc/config/loongarch/loongarch.md
> +++ b/gcc/config/loongarch/loongarch.md
> @@ -3169,6 +3169,42 @@ (define_insn "s<code>_<ANYF:mode>_using_FCCmode"
>     [(set_attr "type" "fcmp")
>      (set_attr "mode" "FCC")])
>   
> +(define_insn "movcf2gr<GPR:mode>"
> +  [(set (match_operand:GPR 0 "register_operand" "=r")
> +	(if_then_else:GPR (ne (match_operand:FCC 1 "register_operand" "z")
> +			      (const_int 0))
> +			  (const_int 1)
> +			  (const_int 0)))]
> +  "TARGET_HARD_FLOAT"
> +  "movcf2gr\t%0,%1"
> +  [(set_attr "type" "move")
> +   (set_attr "mode" "FCC")])
> +
> +(define_expand "cstore<ANYF:mode>4"
> +  [(set (match_operand:SI 0 "register_operand")
> +	(match_operator:SI 1 "loongarch_fcmp_operator"
> +	  [(match_operand:ANYF 2 "register_operand")
> +	   (match_operand:ANYF 3 "register_operand")]))]
> +  ""
> +  {
> +    rtx fcc = gen_reg_rtx (FCCmode);
> +    rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), FCCmode,
> +			      operands[2], operands[3]);
> +
> +    emit_insn (gen_rtx_SET (fcc, cmp));
> +    if (TARGET_64BIT)
> +      {
> +	rtx gpr = gen_reg_rtx (DImode);
> +	emit_insn (gen_movcf2grdi (gpr, fcc));
> +	emit_insn (gen_rtx_SET (operands[0],
> +				lowpart_subreg (SImode, gpr, DImode)));
> +      }
> +    else
> +      emit_insn (gen_movcf2grsi (operands[0], fcc));
> +
> +    DONE;
> +  })
> +
>   
>
>   ;;
>   ;;  ....................
> diff --git a/gcc/config/loongarch/predicates.md b/gcc/config/loongarch/predicates.md
> index 9e9ce58cb53..83fea08315c 100644
> --- a/gcc/config/loongarch/predicates.md
> +++ b/gcc/config/loongarch/predicates.md
> @@ -590,6 +590,10 @@ (define_predicate "order_operator"
>   (define_predicate "loongarch_cstore_operator"
>     (match_code "ne,eq,gt,gtu,ge,geu,lt,ltu,le,leu"))
>   
> +(define_predicate "loongarch_fcmp_operator"
> +  (match_code
> +    "unordered,uneq,unlt,unle,eq,lt,le,ordered,ltgt,ne,ge,gt,unge,ungt"))
> +
>   (define_predicate "small_data_pattern"
>     (and (match_code "set,parallel,unspec,unspec_volatile,prefetch")
>          (match_test "loongarch_small_data_pattern_p (op)")))
>
> and now this function is compiled to (with LOGICAL_OP_NON_SHORT_CIRCUIT
> = 1):
>
> 	fld.s	$f1,$r4,0
> 	fld.s	$f0,$r4,4
> 	fld.s	$f3,$r4,8
> 	fld.s	$f2,$r4,12
> 	fcmp.slt.s	$fcc1,$f0,$f3
> 	fcmp.sgt.s	$fcc0,$f1,$f2
> 	movcf2gr	$r13,$fcc1
> 	movcf2gr	$r12,$fcc0
> 	or	$r12,$r12,$r13
> 	bnez	$r12,.L3
> 	fld.s	$f4,$r4,16
> 	fld.s	$f5,$r4,20
> 	or	$r4,$r0,$r0
> 	fcmp.sgt.s	$fcc1,$f1,$f5
> 	fcmp.slt.s	$fcc0,$f0,$f4
> 	movcf2gr	$r12,$fcc1
> 	movcf2gr	$r13,$fcc0
> 	or	$r12,$r12,$r13
> 	bnez	$r12,.L2
> 	fcmp.sgt.s	$fcc1,$f3,$f5
> 	fcmp.slt.s	$fcc0,$f2,$f4
> 	movcf2gr	$r4,$fcc1
> 	movcf2gr	$r12,$fcc0
> 	or	$r4,$r4,$r12
> 	xori	$r4,$r4,1
> 	slli.w	$r4,$r4,0
> 	jr	$r1
> 	.align	4
> .L3:
> 	or	$r4,$r0,$r0
> 	.align	4
> .L2:
> 	jr	$r1
>
> Per my micro-benchmark this is much faster than
> LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
> when the branches are not predictable).
>
> Note that there is a redundant slli.w instruction in the compiled code
> and I couldn't find a way to remove it (my trick in the TARGET_64BIT
> branch only works for simple examples).  We may be able to handle via
> the ext_dce pass [1] in the future.
>
> [1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html
>
This test was extracted from the hot functions of 526.blender_r. Setting 
LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic 
instruction count and a 13.4% performance improvement. After applying 
the patch mentioned above, the assembly code looks much better with 
LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526. 
Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further 
improved the performance of 526 by 3%. The definition of 
LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while 
the optimizations you made determine how rtl is generated. They are not 
conflicting and combining them would yield better results.  Currently, I 
have only tested it on 526, and I will continue testing its impact on 
the entire SPEC 2017 suite.

next prev parent reply	other threads:[~2023-12-13  6:17 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-12 11:14 Jiahao Xu
2023-12-12 11:26 ` Xi Ruoyao
2023-12-12 11:59   ` Jiahao Xu
2023-12-12 12:39     ` Xi Ruoyao
2023-12-12 18:27       ` Xi Ruoyao
2023-12-13  1:30         ` chenglulu
2023-12-13  2:37         ` chenglulu
2023-12-13  6:17         ` Jiahao Xu [this message]
2023-12-13  6:21           ` Xi Ruoyao
2023-12-13  6:32             ` Jiahao Xu
2023-12-13  7:30               ` Xi Ruoyao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2af90567-058c-d4f9-3805-da646cd51c1f@loongson.cn \
    --to=xujiahao@loongson.cn \
    --cc=chenglulu@loongson.cn \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=i@xen0n.name \
    --cc=xry111@xry111.site \
    --cc=xuchenghua@loongson.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).