public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
@ 2021-07-09  2:07 Jiufu Guo
  2021-07-09  6:43 ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Jiufu Guo @ 2021-07-09  2:07 UTC (permalink / raw)
  To: gcc-patches
  Cc: guojiufu, wschmidt, segher, dje.gcc, rguenther, jlaw, amker.cheng

Currently, doloop.xx variable is using the type as niter which may shorter
than word size.  For some cases, it may be better to use word size type.
For example, on some 64bit system, to access 32bit niter, subreg maybe used.
Then using 64bit type would not need to use subreg if the value can be
present in both 32bit and 64bit.

This patch updates doloop iv to BIT_PER_WORD size if it is fine.

Bootstrap and regtest pass on powerpc64le and x86, is this ok for trunk?

BR.
Jiufu

gcc/ChangeLog:

2021-07-08  Jiufu Guo  <guojiufu@linux.ibm.com>

	PR target/61837
	* tree-ssa-loop-ivopts.c (add_iv_candidate_for_doloop):
	Update iv on BITS_PER_WORD for niter.

gcc/testsuite/ChangeLog:

2021-07-08  Jiufu Guo  <guojiufu@linux.ibm.com>

	PR target/61837
	* gcc.target/powerpc/pr61837.c: New test.

---
 gcc/testsuite/gcc.target/powerpc/pr61837.c | 16 ++++++++++++++++
 gcc/tree-ssa-loop-ivopts.c                 | 10 ++++++++++
 2 files changed, 26 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr61837.c

diff --git a/gcc/testsuite/gcc.target/powerpc/pr61837.c b/gcc/testsuite/gcc.target/powerpc/pr61837.c
new file mode 100644
index 00000000000..dc44eb9cb41
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr61837.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+void foo(int *p1, long *p2, int s)
+{
+  int n, v, i;
+
+  v = 0;
+  for (n = 0; n <= 100; n++) {
+     for (i = 0; i < s; i++)
+        if (p2[i] == n)
+           p1[i] = v;
+     v += 88;
+  }
+}
+
+/* { dg-final { scan-assembler-not {\mrldicl\M} } } */
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 12a8a49a307..c3c2f97918d 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -5690,6 +5690,16 @@ add_iv_candidate_for_doloop (struct ivopts_data *data)
 
   tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
 			   build_int_cst (ntype, 1));
+
+  /* Use type in word size may fast.  */
+  if (TYPE_PRECISION (ntype) < BITS_PER_WORD
+      && TYPE_PRECISION (long_unsigned_type_node) == BITS_PER_WORD
+      && wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype))))
+    {
+      ntype = long_unsigned_type_node;
+      base = fold_convert (ntype, base);
+    }
+
   add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL, true);
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-09  2:07 [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837] Jiufu Guo
@ 2021-07-09  6:43 ` Richard Biener
  2021-07-09 17:08   ` Segher Boessenkool
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-07-09  6:43 UTC (permalink / raw)
  To: Jiufu Guo; +Cc: gcc-patches, wschmidt, segher, dje.gcc, jlaw, amker.cheng

On Fri, 9 Jul 2021, Jiufu Guo wrote:

> Currently, doloop.xx variable is using the type as niter which may shorter
> than word size.  For some cases, it may be better to use word size type.
> For example, on some 64bit system, to access 32bit niter, subreg maybe used.
> Then using 64bit type would not need to use subreg if the value can be
> present in both 32bit and 64bit.
> 
> This patch updates doloop iv to BIT_PER_WORD size if it is fine.
> 
> Bootstrap and regtest pass on powerpc64le and x86, is this ok for trunk?
> 
> BR.
> Jiufu
> 
> gcc/ChangeLog:
> 
> 2021-07-08  Jiufu Guo  <guojiufu@linux.ibm.com>
> 
> 	PR target/61837
> 	* tree-ssa-loop-ivopts.c (add_iv_candidate_for_doloop):
> 	Update iv on BITS_PER_WORD for niter.
> 
> gcc/testsuite/ChangeLog:
> 
> 2021-07-08  Jiufu Guo  <guojiufu@linux.ibm.com>
> 
> 	PR target/61837
> 	* gcc.target/powerpc/pr61837.c: New test.
> 
> ---
>  gcc/testsuite/gcc.target/powerpc/pr61837.c | 16 ++++++++++++++++
>  gcc/tree-ssa-loop-ivopts.c                 | 10 ++++++++++
>  2 files changed, 26 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr61837.c
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr61837.c b/gcc/testsuite/gcc.target/powerpc/pr61837.c
> new file mode 100644
> index 00000000000..dc44eb9cb41
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr61837.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +void foo(int *p1, long *p2, int s)
> +{
> +  int n, v, i;
> +
> +  v = 0;
> +  for (n = 0; n <= 100; n++) {
> +     for (i = 0; i < s; i++)
> +        if (p2[i] == n)
> +           p1[i] = v;
> +     v += 88;
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-not {\mrldicl\M} } } */
> diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
> index 12a8a49a307..c3c2f97918d 100644
> --- a/gcc/tree-ssa-loop-ivopts.c
> +++ b/gcc/tree-ssa-loop-ivopts.c
> @@ -5690,6 +5690,16 @@ add_iv_candidate_for_doloop (struct ivopts_data *data)
>  
>    tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>  			   build_int_cst (ntype, 1));
> +
> +  /* Use type in word size may fast.  */
> +  if (TYPE_PRECISION (ntype) < BITS_PER_WORD
> +      && TYPE_PRECISION (long_unsigned_type_node) == BITS_PER_WORD
> +      && wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype))))

I wonder if there's a way to query the target what modes the doloop
pattern can handle (not being too familiar with the doloop code).

Why do you need to do any checks besides the new type being able to
represent all IV values?  The original doloop IV will never wrap
(OTOH if niter is U*_MAX then we compute niter + 1 which will become
zero ... I suppose the doloop might still do the correct thing here
but it also still will with a IV with larger type).

I'd have expected sth like

   ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED 
(ntype));

thus the decision made using a mode - which is also why I wonder
if there's a way to query the target for this.  As you say,
it _may_ be fast, so better check (somehow).

> +    {
> +      ntype = long_unsigned_type_node;
> +      base = fold_convert (ntype, base);
> +    }
> +
>    add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL, true);
>  }
>  
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-09  6:43 ` Richard Biener
@ 2021-07-09 17:08   ` Segher Boessenkool
  2021-07-12  6:20     ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Segher Boessenkool @ 2021-07-09 17:08 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jiufu Guo, gcc-patches, wschmidt, dje.gcc, jlaw, amker.cheng

On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
> I wonder if there's a way to query the target what modes the doloop
> pattern can handle (not being too familiar with the doloop code).

You can look what modes are allowed for operand 0 of doloop_end,
perhaps?  Although that is a define_expand, not a define_insn, so it is
hard to introspect.

> Why do you need to do any checks besides the new type being able to
> represent all IV values?  The original doloop IV will never wrap
> (OTOH if niter is U*_MAX then we compute niter + 1 which will become
> zero ... I suppose the doloop might still do the correct thing here
> but it also still will with a IV with larger type).

doloop_valid_p guarantees it is simple and doesn't wrap.

> I'd have expected sth like
> 
>    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED 
> (ntype));
> 
> thus the decision made using a mode - which is also why I wonder
> if there's a way to query the target for this.  As you say,
> it _may_ be fast, so better check (somehow).

Almost all targets just use Pmode, but there is no such guarantee I
think, and esp. some targets that do not have machine insns for this
(but want to generate different code for this anyway) can do pretty much
anything.

Maybe using just Pmode here is good enough though?


Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-09 17:08   ` Segher Boessenkool
@ 2021-07-12  6:20     ` Richard Biener
  2021-07-12  8:21       ` guojiufu
  2021-07-13 15:38       ` Segher Boessenkool
  0 siblings, 2 replies; 20+ messages in thread
From: Richard Biener @ 2021-07-12  6:20 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Jiufu Guo, gcc-patches, wschmidt, dje.gcc, jlaw, amker.cheng

On Fri, 9 Jul 2021, Segher Boessenkool wrote:

> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
> > I wonder if there's a way to query the target what modes the doloop
> > pattern can handle (not being too familiar with the doloop code).
> 
> You can look what modes are allowed for operand 0 of doloop_end,
> perhaps?  Although that is a define_expand, not a define_insn, so it is
> hard to introspect.
> 
> > Why do you need to do any checks besides the new type being able to
> > represent all IV values?  The original doloop IV will never wrap
> > (OTOH if niter is U*_MAX then we compute niter + 1 which will become
> > zero ... I suppose the doloop might still do the correct thing here
> > but it also still will with a IV with larger type).
> 
> doloop_valid_p guarantees it is simple and doesn't wrap.
> 
> > I'd have expected sth like
> > 
> >    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED 
> > (ntype));
> > 
> > thus the decision made using a mode - which is also why I wonder
> > if there's a way to query the target for this.  As you say,
> > it _may_ be fast, so better check (somehow).
> 
> Almost all targets just use Pmode, but there is no such guarantee I
> think, and esp. some targets that do not have machine insns for this
> (but want to generate different code for this anyway) can do pretty much
> anything.
> 
> Maybe using just Pmode here is good enough though?

I think Pmode is a particularly bad choice and I'd prefer word_mode
if we go for any hardcoded mode.  s390x for example seems to handle
both SImode and DImode (but names the helper gen_doloop_si64
for SImode?!).  But indeed it looks like somehow querying doloop_end
is going to be difficult since the expander doesn't have any mode,
so we'd have to actually try emit RTL here.

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-12  6:20     ` Richard Biener
@ 2021-07-12  8:21       ` guojiufu
  2021-07-12  8:57         ` Richard Biener
  2021-07-13 15:38       ` Segher Boessenkool
  1 sibling, 1 reply; 20+ messages in thread
From: guojiufu @ 2021-07-12  8:21 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, gcc-patches, wschmidt, dje.gcc, jlaw, amker.cheng

On 2021-07-12 14:20, Richard Biener wrote:
> On Fri, 9 Jul 2021, Segher Boessenkool wrote:
> 
>> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>> > I wonder if there's a way to query the target what modes the doloop
>> > pattern can handle (not being too familiar with the doloop code).
>> 
>> You can look what modes are allowed for operand 0 of doloop_end,
>> perhaps?  Although that is a define_expand, not a define_insn, so it 
>> is
>> hard to introspect.
>> 
>> > Why do you need to do any checks besides the new type being able to
>> > represent all IV values?  The original doloop IV will never wrap
>> > (OTOH if niter is U*_MAX then we compute niter + 1 which will become
>> > zero ... I suppose the doloop might still do the correct thing here
>> > but it also still will with a IV with larger type).

The issue comes from U*_MAX (original short MAX), as you said: on which
niter + 1 becomes zero.  And because the step for doloop is -1; then, on
larger type 'zero - 1' will be a very large number on larger type
(e.g. 0xff...ff); but on the original short type 'zero - 1' is a small 
value
(e.g. "0xff").

>> 
>> doloop_valid_p guarantees it is simple and doesn't wrap.
>> 
>> > I'd have expected sth like
>> >
>> >    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
>> > (ntype));
>> >
>> > thus the decision made using a mode - which is also why I wonder
>> > if there's a way to query the target for this.  As you say,
>> > it _may_ be fast, so better check (somehow).


I was also thinking of using hooks like type_for_size/type_for_mode.
     /* Use type in word size may fast.  */
     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE 
(ntype))))
       {
         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
         base = fold_convert (ntype, base);
       }

As you pointed out, this does not query the mode from targets.
As Segher pointed out "doloop_end" checks unsupported mode, while it 
seems
not easy to use it in tree-ssa-loop-ivopts.c.
For implementations of doloop_end, tartgets like rs6000/aarch64/ia64 
requires
Pmode/DImode; while there are other targets that work on other 'mode' 
(e.g. SI).


In doloop_optimize, there is code:

```
     mode = desc->mode;
.....
     doloop_reg = gen_reg_rtx (mode);
     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, 
start_label);

     word_mode_size = GET_MODE_PRECISION (word_mode);
     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
     if (! doloop_seq
         && mode != word_mode
         /* Before trying mode different from the one in that # of 
iterations is
            computed, we must be sure that the number of iterations fits 
into
            the new mode.  */
         && (word_mode_size >= GET_MODE_PRECISION (mode)
             || wi::leu_p (iterations_max, word_mode_max)))
       {
         if (word_mode_size > GET_MODE_PRECISION (mode))
           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count, 
mode);
         else
           count = lowpart_subreg (word_mode, count, mode);
         PUT_MODE (doloop_reg, word_mode);
         doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
       }
     if (! doloop_seq)
       {
         if (dump_file)
           fprintf (dump_file,
                    "Doloop: Target unwilling to use doloop pattern!\n");
         return false;
       }
```
The above code first tries the mode of niter_desc by call 
targetm.gen_doloop_end
to see if the target can generate doloop insns, if fail, then try to use
'word_mode' against gen_doloop_end.


>> 
>> Almost all targets just use Pmode, but there is no such guarantee I
>> think, and esp. some targets that do not have machine insns for this
>> (but want to generate different code for this anyway) can do pretty 
>> much
>> anything.
>> 
>> Maybe using just Pmode here is good enough though?
> 
> I think Pmode is a particularly bad choice and I'd prefer word_mode
> if we go for any hardcoded mode.  s390x for example seems to handle
> both SImode and DImode (but names the helper gen_doloop_si64
> for SImode?!).  But indeed it looks like somehow querying doloop_end
> is going to be difficult since the expander doesn't have any mode,
> so we'd have to actually try emit RTL here.

Instead of using hardcode mode, maybe we could add a hook for targets to 
return
the preferred mode.


Thanks for those valuable comments!

Jiufu Guo



> 
> Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-12  8:21       ` guojiufu
@ 2021-07-12  8:57         ` Richard Biener
  2021-07-12  9:43           ` guojiufu
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-07-12  8:57 UTC (permalink / raw)
  To: guojiufu
  Cc: Segher Boessenkool, gcc-patches, wschmidt, dje.gcc, jlaw, amker.cheng

On Mon, 12 Jul 2021, guojiufu wrote:

> On 2021-07-12 14:20, Richard Biener wrote:
> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
> > 
> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
> >> > I wonder if there's a way to query the target what modes the doloop
> >> > pattern can handle (not being too familiar with the doloop code).
> >> 
> >> You can look what modes are allowed for operand 0 of doloop_end,
> >> perhaps?  Although that is a define_expand, not a define_insn, so it is
> >> hard to introspect.
> >> 
> >> > Why do you need to do any checks besides the new type being able to
> >> > represent all IV values?  The original doloop IV will never wrap
> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will become
> >> > zero ... I suppose the doloop might still do the correct thing here
> >> > but it also still will with a IV with larger type).
> 
> The issue comes from U*_MAX (original short MAX), as you said: on which
> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
> larger type 'zero - 1' will be a very large number on larger type
> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small value
> (e.g. "0xff").

But for the larger type the small type MAX + 1 fits and does not yield
zero so it should still work exactly as before, no?  Of course you
have to compute the + 1 in the larger type.

> >> 
> >> doloop_valid_p guarantees it is simple and doesn't wrap.
> >> 
> >> > I'd have expected sth like
> >> >
> >> >    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
> >> > (ntype));
> >> >
> >> > thus the decision made using a mode - which is also why I wonder
> >> > if there's a way to query the target for this.  As you say,
> >> > it _may_ be fast, so better check (somehow).
> 
> 
> I was also thinking of using hooks like type_for_size/type_for_mode.
>     /* Use type in word size may fast.  */
>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
> (ntype))))
>       {
>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>         base = fold_convert (ntype, base);
>       }
> 
> As you pointed out, this does not query the mode from targets.
> As Segher pointed out "doloop_end" checks unsupported mode, while it seems
> not easy to use it in tree-ssa-loop-ivopts.c.
> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64 requires
> Pmode/DImode; while there are other targets that work on other 'mode' (e.g.
> SI).
> 
> 
> In doloop_optimize, there is code:
> 
> ```
>     mode = desc->mode;
> .....
>     doloop_reg = gen_reg_rtx (mode);
>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, 
> start_label);
> 
>     word_mode_size = GET_MODE_PRECISION (word_mode);
>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
>     if (! doloop_seq
>         && mode != word_mode
>         /* Before trying mode different from the one in that # of 
> iterations is
>            computed, we must be sure that the number of iterations fits into
>            the new mode.  */
>         && (word_mode_size >= GET_MODE_PRECISION (mode)
>             || wi::leu_p (iterations_max, word_mode_max)))
>       {
>         if (word_mode_size > GET_MODE_PRECISION (mode))
>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count, mode);
>         else
>           count = lowpart_subreg (word_mode, count, mode);
>         PUT_MODE (doloop_reg, word_mode);
>         doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
>       }
>     if (! doloop_seq)
>       {
>         if (dump_file)
>           fprintf (dump_file,
>                    "Doloop: Target unwilling to use doloop pattern!\n");
>         return false;
>       }
> ```
> The above code first tries the mode of niter_desc by call
> targetm.gen_doloop_end
> to see if the target can generate doloop insns, if fail, then try to use
> 'word_mode' against gen_doloop_end.
> 
> 
> >> 
> >> Almost all targets just use Pmode, but there is no such guarantee I
> >> think, and esp. some targets that do not have machine insns for this
> >> (but want to generate different code for this anyway) can do pretty much
> >> anything.
> >> 
> >> Maybe using just Pmode here is good enough though?
> > 
> > I think Pmode is a particularly bad choice and I'd prefer word_mode
> > if we go for any hardcoded mode.  s390x for example seems to handle
> > both SImode and DImode (but names the helper gen_doloop_si64
> > for SImode?!).  But indeed it looks like somehow querying doloop_end
> > is going to be difficult since the expander doesn't have any mode,
> > so we'd have to actually try emit RTL here.
> 
> Instead of using hardcode mode, maybe we could add a hook for targets to
> return
> the preferred mode.

That's a possiblity of course.  Like the following which just shows the
default implementation then (pass in current mode, return a more preferred
mode or the mode itself)

enum machine_mode
prefered_doloop_mode (enum machine_mode mode)
{
  return mode;
}

> 
> Thanks for those valuable comments!
> 
> Jiufu Guo
> 
> 
> 
> > 
> > Richard.
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-12  8:57         ` Richard Biener
@ 2021-07-12  9:43           ` guojiufu
  2021-07-12 10:02             ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: guojiufu @ 2021-07-12  9:43 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, gcc-patches, wschmidt, dje.gcc, jlaw, amker.cheng

On 2021-07-12 16:57, Richard Biener wrote:
> On Mon, 12 Jul 2021, guojiufu wrote:
> 
>> On 2021-07-12 14:20, Richard Biener wrote:
>> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>> >
>> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>> >> > I wonder if there's a way to query the target what modes the doloop
>> >> > pattern can handle (not being too familiar with the doloop code).
>> >>
>> >> You can look what modes are allowed for operand 0 of doloop_end,
>> >> perhaps?  Although that is a define_expand, not a define_insn, so it is
>> >> hard to introspect.
>> >>
>> >> > Why do you need to do any checks besides the new type being able to
>> >> > represent all IV values?  The original doloop IV will never wrap
>> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will become
>> >> > zero ... I suppose the doloop might still do the correct thing here
>> >> > but it also still will with a IV with larger type).
>> 
>> The issue comes from U*_MAX (original short MAX), as you said: on 
>> which
>> niter + 1 becomes zero.  And because the step for doloop is -1; then, 
>> on
>> larger type 'zero - 1' will be a very large number on larger type
>> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small 
>> value
>> (e.g. "0xff").
> 
> But for the larger type the small type MAX + 1 fits and does not yield
> zero so it should still work exactly as before, no?  Of course you
> have to compute the + 1 in the larger type.
> 
You are right, if compute the "+ 1" in the larger type it is ok, as 
below code:
```
    /* Use type in word size may fast.  */
     if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
       {
         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
         niter = fold_convert (ntype, niter);
       }

     tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
                              build_int_cst (ntype, 1));


     add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, 
NULL, true);
```
The issue of this is, this code generates more stmt for doloop.xxx:
   _12 = (unsigned int) xx(D);
   _10 = _12 + 4294967295;
   _24 = (long unsigned int) _10;
   doloop.6_8 = _24 + 1;

if use previous patch, "+ 1" on original type, then the stmts will looks 
like:
   _12 = (unsigned int) xx(D);
   doloop.6_8 = (long unsigned int) _12;

This is the reason for checking
    wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))

>> >>
>> >> doloop_valid_p guarantees it is simple and doesn't wrap.
>> >>
>> >> > I'd have expected sth like
>> >> >
>> >> >    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
>> >> > (ntype));
>> >> >
>> >> > thus the decision made using a mode - which is also why I wonder
>> >> > if there's a way to query the target for this.  As you say,
>> >> > it _may_ be fast, so better check (somehow).
>> 
>> 
>> I was also thinking of using hooks like type_for_size/type_for_mode.
>>     /* Use type in word size may fast.  */
>>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
>>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
>> (ntype))))
>>       {
>>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>>         base = fold_convert (ntype, base);
>>       }
>> 
>> As you pointed out, this does not query the mode from targets.
>> As Segher pointed out "doloop_end" checks unsupported mode, while it 
>> seems
>> not easy to use it in tree-ssa-loop-ivopts.c.
>> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64 
>> requires
>> Pmode/DImode; while there are other targets that work on other 'mode' 
>> (e.g.
>> SI).
>> 
>> 
>> In doloop_optimize, there is code:
>> 
>> ```
>>     mode = desc->mode;
>> .....
>>     doloop_reg = gen_reg_rtx (mode);
>>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
>> start_label);
>> 
>>     word_mode_size = GET_MODE_PRECISION (word_mode);
>>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 
>> 1;
>>     if (! doloop_seq
>>         && mode != word_mode
>>         /* Before trying mode different from the one in that # of
>> iterations is
>>            computed, we must be sure that the number of iterations 
>> fits into
>>            the new mode.  */
>>         && (word_mode_size >= GET_MODE_PRECISION (mode)
>>             || wi::leu_p (iterations_max, word_mode_max)))
>>       {
>>         if (word_mode_size > GET_MODE_PRECISION (mode))
>>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count, 
>> mode);
>>         else
>>           count = lowpart_subreg (word_mode, count, mode);
>>         PUT_MODE (doloop_reg, word_mode);
>>         doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
>>       }
>>     if (! doloop_seq)
>>       {
>>         if (dump_file)
>>           fprintf (dump_file,
>>                    "Doloop: Target unwilling to use doloop 
>> pattern!\n");
>>         return false;
>>       }
>> ```
>> The above code first tries the mode of niter_desc by call
>> targetm.gen_doloop_end
>> to see if the target can generate doloop insns, if fail, then try to 
>> use
>> 'word_mode' against gen_doloop_end.
>> 
>> 
>> >>
>> >> Almost all targets just use Pmode, but there is no such guarantee I
>> >> think, and esp. some targets that do not have machine insns for this
>> >> (but want to generate different code for this anyway) can do pretty much
>> >> anything.
>> >>
>> >> Maybe using just Pmode here is good enough though?
>> >
>> > I think Pmode is a particularly bad choice and I'd prefer word_mode
>> > if we go for any hardcoded mode.  s390x for example seems to handle
>> > both SImode and DImode (but names the helper gen_doloop_si64
>> > for SImode?!).  But indeed it looks like somehow querying doloop_end
>> > is going to be difficult since the expander doesn't have any mode,
>> > so we'd have to actually try emit RTL here.
>> 
>> Instead of using hardcode mode, maybe we could add a hook for targets 
>> to
>> return
>> the preferred mode.
> 
> That's a possiblity of course.  Like the following which just shows the
> default implementation then (pass in current mode, return a more 
> preferred
> mode or the mode itself)
> 
> enum machine_mode
> prefered_doloop_mode (enum machine_mode mode)
> {
>   return mode;
> }
> 
Yes, thanks!

Checking current do_loop_end in targets, in general, when do_loop_end 
requires
SI mode, the target is defining Pmode as SImode and word_mode (from 
BITS_PER_WORD
which defaults from UNITS_PER_WORD) is also defined to align with SI 
mode.
When do_loop_end requires DI mode, the target is defining Pmode as 
DImode
and word_mode/UNITS_PER_WORD is also defined to align with DI mode.

So, if aggressively, then by default we may just return word_mode.

BR,

Jiufu Guo.


>> 
>> Thanks for those valuable comments!
>> 
>> Jiufu Guo
>> 
>> 
>> 
>> >
>> > Richard.
>> 
>> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-12  9:43           ` guojiufu
@ 2021-07-12 10:02             ` Richard Biener
  2021-07-12 14:08               ` guojiufu
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-07-12 10:02 UTC (permalink / raw)
  To: guojiufu
  Cc: Segher Boessenkool, gcc-patches, wschmidt, dje.gcc, jlaw, amker.cheng

On Mon, 12 Jul 2021, guojiufu wrote:

> On 2021-07-12 16:57, Richard Biener wrote:
> > On Mon, 12 Jul 2021, guojiufu wrote:
> > 
> >> On 2021-07-12 14:20, Richard Biener wrote:
> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
> >> >
> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
> >> >> > I wonder if there's a way to query the target what modes the doloop
> >> >> > pattern can handle (not being too familiar with the doloop code).
> >> >>
> >> >> You can look what modes are allowed for operand 0 of doloop_end,
> >> >> perhaps?  Although that is a define_expand, not a define_insn, so it is
> >> >> hard to introspect.
> >> >>
> >> >> > Why do you need to do any checks besides the new type being able to
> >> >> > represent all IV values?  The original doloop IV will never wrap
> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will become
> >> >> > zero ... I suppose the doloop might still do the correct thing here
> >> >> > but it also still will with a IV with larger type).
> >> 
> >> The issue comes from U*_MAX (original short MAX), as you said: on which
> >> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
> >> larger type 'zero - 1' will be a very large number on larger type
> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
> >> value
> >> (e.g. "0xff").
> > 
> > But for the larger type the small type MAX + 1 fits and does not yield
> > zero so it should still work exactly as before, no?  Of course you
> > have to compute the + 1 in the larger type.
> > 
> You are right, if compute the "+ 1" in the larger type it is ok, as below
> code:
> ```
>    /* Use type in word size may fast.  */
>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>       {
>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>         niter = fold_convert (ntype, niter);
>       }
> 
>     tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>                              build_int_cst (ntype, 1));
> 
> 
>     add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL,
> true);
> ```
> The issue of this is, this code generates more stmt for doloop.xxx:
>   _12 = (unsigned int) xx(D);
>   _10 = _12 + 4294967295;
>   _24 = (long unsigned int) _10;
>   doloop.6_8 = _24 + 1;
> 
> if use previous patch, "+ 1" on original type, then the stmts will looks like:
>   _12 = (unsigned int) xx(D);
>   doloop.6_8 = (long unsigned int) _12;
> 
> This is the reason for checking
>    wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))

But this then only works when there's an upper bound on the number
of iterations.  Note you should not use TYPE_MAX_VALUE here but
you can instead use

     wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value 
(TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));

I think the -1 above comes from number of latch iterations vs. header
entries - it's a common source for this kind of issues.  range analysis
might be able to prove that we can still merge the two adds even with
the intermediate extension.

Is this pre-loop extra add really offsetting the in-loop doloop
improvements?

> >> >>
> >> >> doloop_valid_p guarantees it is simple and doesn't wrap.
> >> >>
> >> >> > I'd have expected sth like
> >> >> >
> >> >> >    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
> >> >> > (ntype));
> >> >> >
> >> >> > thus the decision made using a mode - which is also why I wonder
> >> >> > if there's a way to query the target for this.  As you say,
> >> >> > it _may_ be fast, so better check (somehow).
> >> 
> >> 
> >> I was also thinking of using hooks like type_for_size/type_for_mode.
> >>     /* Use type in word size may fast.  */
> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
> >>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
> >> (ntype))))
> >>       {
> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
> >>         base = fold_convert (ntype, base);
> >>       }
> >> 
> >> As you pointed out, this does not query the mode from targets.
> >> As Segher pointed out "doloop_end" checks unsupported mode, while it seems
> >> not easy to use it in tree-ssa-loop-ivopts.c.
> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
> >> requires
> >> Pmode/DImode; while there are other targets that work on other 'mode' (e.g.
> >> SI).
> >> 
> >> 
> >> In doloop_optimize, there is code:
> >> 
> >> ```
> >>     mode = desc->mode;
> >> .....
> >>     doloop_reg = gen_reg_rtx (mode);
> >>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
> >> start_label);
> >> 
> >>     word_mode_size = GET_MODE_PRECISION (word_mode);
> >>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 
> >> 1;
> >>     if (! doloop_seq
> >>         && mode != word_mode
> >>         /* Before trying mode different from the one in that # of
> >> iterations is
> >>            computed, we must be sure that the number of iterations fits
> >> into
> >>            the new mode.  */
> >>         && (word_mode_size >= GET_MODE_PRECISION (mode)
> >>             || wi::leu_p (iterations_max, word_mode_max)))
> >>       {
> >>         if (word_mode_size > GET_MODE_PRECISION (mode))
> >>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count, mode);
> >>         else
> >>           count = lowpart_subreg (word_mode, count, mode);
> >>         PUT_MODE (doloop_reg, word_mode);
> >>         doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
> >>       }
> >>     if (! doloop_seq)
> >>       {
> >>         if (dump_file)
> >>           fprintf (dump_file,
> >>                    "Doloop: Target unwilling to use doloop pattern!\n");
> >>         return false;
> >>       }
> >> ```
> >> The above code first tries the mode of niter_desc by call
> >> targetm.gen_doloop_end
> >> to see if the target can generate doloop insns, if fail, then try to use
> >> 'word_mode' against gen_doloop_end.
> >> 
> >> 
> >> >>
> >> >> Almost all targets just use Pmode, but there is no such guarantee I
> >> >> think, and esp. some targets that do not have machine insns for this
> >> >> (but want to generate different code for this anyway) can do pretty much
> >> >> anything.
> >> >>
> >> >> Maybe using just Pmode here is good enough though?
> >> >
> >> > I think Pmode is a particularly bad choice and I'd prefer word_mode
> >> > if we go for any hardcoded mode.  s390x for example seems to handle
> >> > both SImode and DImode (but names the helper gen_doloop_si64
> >> > for SImode?!).  But indeed it looks like somehow querying doloop_end
> >> > is going to be difficult since the expander doesn't have any mode,
> >> > so we'd have to actually try emit RTL here.
> >> 
> >> Instead of using hardcode mode, maybe we could add a hook for targets to
> >> return
> >> the preferred mode.
> > 
> > That's a possiblity of course.  Like the following which just shows the
> > default implementation then (pass in current mode, return a more preferred
> > mode or the mode itself)
> > 
> > enum machine_mode
> > prefered_doloop_mode (enum machine_mode mode)
> > {
> >   return mode;
> > }
> > 
> Yes, thanks!
> 
> Checking current do_loop_end in targets, in general, when do_loop_end requires
> SI mode, the target is defining Pmode as SImode and word_mode (from
> BITS_PER_WORD
> which defaults from UNITS_PER_WORD) is also defined to align with SI mode.
> When do_loop_end requires DI mode, the target is defining Pmode as DImode
> and word_mode/UNITS_PER_WORD is also defined to align with DI mode.
> 
> So, if aggressively, then by default we may just return word_mode.

Note we still have to check whether the prefered mode is valid
(passing in TImode but then returning DImode would be wrong).

Richard.

> BR,
> 
> Jiufu Guo.
> 
> 
> >> 
> >> Thanks for those valuable comments!
> >> 
> >> Jiufu Guo
> >> 
> >> 
> >> 
> >> >
> >> > Richard.
> >> 
> >> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-12 10:02             ` Richard Biener
@ 2021-07-12 14:08               ` guojiufu
  2021-07-12 14:46                 ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: guojiufu @ 2021-07-12 14:08 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, gcc-patches, wschmidt, dje.gcc, jlaw, amker.cheng

On 2021-07-12 18:02, Richard Biener wrote:
> On Mon, 12 Jul 2021, guojiufu wrote:
> 
>> On 2021-07-12 16:57, Richard Biener wrote:
>> > On Mon, 12 Jul 2021, guojiufu wrote:
>> >
>> >> On 2021-07-12 14:20, Richard Biener wrote:
>> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>> >> >
>> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>> >> >> > I wonder if there's a way to query the target what modes the doloop
>> >> >> > pattern can handle (not being too familiar with the doloop code).
>> >> >>
>> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>> >> >> perhaps?  Although that is a define_expand, not a define_insn, so it is
>> >> >> hard to introspect.
>> >> >>
>> >> >> > Why do you need to do any checks besides the new type being able to
>> >> >> > represent all IV values?  The original doloop IV will never wrap
>> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will become
>> >> >> > zero ... I suppose the doloop might still do the correct thing here
>> >> >> > but it also still will with a IV with larger type).
>> >>
>> >> The issue comes from U*_MAX (original short MAX), as you said: on which
>> >> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
>> >> larger type 'zero - 1' will be a very large number on larger type
>> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
>> >> value
>> >> (e.g. "0xff").
>> >
>> > But for the larger type the small type MAX + 1 fits and does not yield
>> > zero so it should still work exactly as before, no?  Of course you
>> > have to compute the + 1 in the larger type.
>> >
>> You are right, if compute the "+ 1" in the larger type it is ok, as 
>> below
>> code:
>> ```
>>    /* Use type in word size may fast.  */
>>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>>       {
>>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>>         niter = fold_convert (ntype, niter);
>>       }
>> 
>>     tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>>                              build_int_cst (ntype, 1));
>> 
>> 
>>     add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, 
>> NULL,
>> true);
>> ```
>> The issue of this is, this code generates more stmt for doloop.xxx:
>>   _12 = (unsigned int) xx(D);
>>   _10 = _12 + 4294967295;
>>   _24 = (long unsigned int) _10;
>>   doloop.6_8 = _24 + 1;
>> 
>> if use previous patch, "+ 1" on original type, then the stmts will 
>> looks like:
>>   _12 = (unsigned int) xx(D);
>>   doloop.6_8 = (long unsigned int) _12;
>> 
>> This is the reason for checking
>>    wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
> 
> But this then only works when there's an upper bound on the number
> of iterations.  Note you should not use TYPE_MAX_VALUE here but
> you can instead use
> 
>      wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
> (TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));

Ok, Thanks!
I remember you mentioned that:
widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN 
(ntype)), TYPE_SIGN (ntype))
would be better than
wi::to_widest (TYPE_MAX_VALUE (ntype)).

It seems that:
"TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK 
(NODE)->type_non_common.maxval"
which do a numerical-check and return the field of maxval.  And then 
call to
wi::to_widest

The other code "widest_int::from (wi::max_value (..,..),..)" calls 
wi::max_value
and widest_int::from.

I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?

> 
> I think the -1 above comes from number of latch iterations vs. header
> entries - it's a common source for this kind of issues.  range analysis
> might be able to prove that we can still merge the two adds even with
> the intermediate extension.
Yes, as you mentioned here, it relates to number of latch iterations
For loop looks like : while (l < n) or for (i = 0; i < n; i++)
This kind of loop, the niter is used to be 'n - 1' after transformed
into 'do-while' form.
I would see how to merge these two adds safely at this point
when generating doloop iv. (maybe range info, thanks!

> 
> Is this pre-loop extra add really offsetting the in-loop doloop
> improvements?
I'm not catching this question too much, sorry.  I guess your concern
is if the "+1" is an offset: it may not, "+1" may be just that doloop.xx
is decreasing niter until 0 (all number >0).
If misunderstand,  thanks for point out.

> 
>> >> >>
>> >> >> doloop_valid_p guarantees it is simple and doesn't wrap.
>> >> >>
>> >> >> > I'd have expected sth like
>> >> >> >
>> >> >> >    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
>> >> >> > (ntype));
>> >> >> >
>> >> >> > thus the decision made using a mode - which is also why I wonder
>> >> >> > if there's a way to query the target for this.  As you say,
>> >> >> > it _may_ be fast, so better check (somehow).
>> >>
>> >>
>> >> I was also thinking of using hooks like type_for_size/type_for_mode.
>> >>     /* Use type in word size may fast.  */
>> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
>> >>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
>> >> (ntype))))
>> >>       {
>> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>> >>         base = fold_convert (ntype, base);
>> >>       }
>> >>
>> >> As you pointed out, this does not query the mode from targets.
>> >> As Segher pointed out "doloop_end" checks unsupported mode, while it seems
>> >> not easy to use it in tree-ssa-loop-ivopts.c.
>> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
>> >> requires
>> >> Pmode/DImode; while there are other targets that work on other 'mode' (e.g.
>> >> SI).
>> >>
>> >>
>> >> In doloop_optimize, there is code:
>> >>
>> >> ```
>> >>     mode = desc->mode;
>> >> .....
>> >>     doloop_reg = gen_reg_rtx (mode);
>> >>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
>> >> start_label);
>> >>
>> >>     word_mode_size = GET_MODE_PRECISION (word_mode);
>> >>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) -
>> >> 1;
>> >>     if (! doloop_seq
>> >>         && mode != word_mode
>> >>         /* Before trying mode different from the one in that # of
>> >> iterations is
>> >>            computed, we must be sure that the number of iterations fits
>> >> into
>> >>            the new mode.  */
>> >>         && (word_mode_size >= GET_MODE_PRECISION (mode)
>> >>             || wi::leu_p (iterations_max, word_mode_max)))
>> >>       {
>> >>         if (word_mode_size > GET_MODE_PRECISION (mode))
>> >>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count, mode);
>> >>         else
>> >>           count = lowpart_subreg (word_mode, count, mode);
>> >>         PUT_MODE (doloop_reg, word_mode);
>> >>         doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
>> >>       }
>> >>     if (! doloop_seq)
>> >>       {
>> >>         if (dump_file)
>> >>           fprintf (dump_file,
>> >>                    "Doloop: Target unwilling to use doloop pattern!\n");
>> >>         return false;
>> >>       }
>> >> ```
>> >> The above code first tries the mode of niter_desc by call
>> >> targetm.gen_doloop_end
>> >> to see if the target can generate doloop insns, if fail, then try to use
>> >> 'word_mode' against gen_doloop_end.
>> >>
>> >>
>> >> >>
>> >> >> Almost all targets just use Pmode, but there is no such guarantee I
>> >> >> think, and esp. some targets that do not have machine insns for this
>> >> >> (but want to generate different code for this anyway) can do pretty much
>> >> >> anything.
>> >> >>
>> >> >> Maybe using just Pmode here is good enough though?
>> >> >
>> >> > I think Pmode is a particularly bad choice and I'd prefer word_mode
>> >> > if we go for any hardcoded mode.  s390x for example seems to handle
>> >> > both SImode and DImode (but names the helper gen_doloop_si64
>> >> > for SImode?!).  But indeed it looks like somehow querying doloop_end
>> >> > is going to be difficult since the expander doesn't have any mode,
>> >> > so we'd have to actually try emit RTL here.
>> >>
>> >> Instead of using hardcode mode, maybe we could add a hook for targets to
>> >> return
>> >> the preferred mode.
>> >
>> > That's a possiblity of course.  Like the following which just shows the
>> > default implementation then (pass in current mode, return a more preferred
>> > mode or the mode itself)
>> >
>> > enum machine_mode
>> > prefered_doloop_mode (enum machine_mode mode)
>> > {
>> >   return mode;
>> > }
>> >
>> Yes, thanks!
>> 
>> Checking current do_loop_end in targets, in general, when do_loop_end 
>> requires
>> SI mode, the target is defining Pmode as SImode and word_mode (from
>> BITS_PER_WORD
>> which defaults from UNITS_PER_WORD) is also defined to align with SI 
>> mode.
>> When do_loop_end requires DI mode, the target is defining Pmode as 
>> DImode
>> and word_mode/UNITS_PER_WORD is also defined to align with DI mode.
>> 
>> So, if aggressively, then by default we may just return word_mode.
> 
> Note we still have to check whether the prefered mode is valid
> (passing in TImode but then returning DImode would be wrong).
Yes, some code like in doloop_optimize (rtl code)
```
mode != word_mode
&& (word_mode_size >= GET_MODE_PRECISION (mode)
     || wi::leu_p (iterations_max, word_mode_max))
```

Thanks again for your comments!

BR,
Jiufu Guo
> 
> Richard.
> 
>> BR,
>> 
>> Jiufu Guo.
>> 
>> 
>> >>
>> >> Thanks for those valuable comments!
>> >>
>> >> Jiufu Guo
>> >>
>> >>
>> >>
>> >> >
>> >> > Richard.
>> >>
>> >>
>> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-12 14:08               ` guojiufu
@ 2021-07-12 14:46                 ` Richard Biener
  2021-07-12 15:53                   ` guojiufu
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-07-12 14:46 UTC (permalink / raw)
  To: guojiufu
  Cc: Segher Boessenkool, gcc-patches, wschmidt, dje.gcc, jlaw, amker.cheng

On Mon, 12 Jul 2021, guojiufu wrote:

> On 2021-07-12 18:02, Richard Biener wrote:
> > On Mon, 12 Jul 2021, guojiufu wrote:
> > 
> >> On 2021-07-12 16:57, Richard Biener wrote:
> >> > On Mon, 12 Jul 2021, guojiufu wrote:
> >> >
> >> >> On 2021-07-12 14:20, Richard Biener wrote:
> >> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
> >> >> >
> >> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
> >> >> >> > I wonder if there's a way to query the target what modes the doloop
> >> >> >> > pattern can handle (not being too familiar with the doloop code).
> >> >> >>
> >> >> >> You can look what modes are allowed for operand 0 of doloop_end,
> >> >> >> perhaps?  Although that is a define_expand, not a define_insn, so it
> >> >> >> is
> >> >> >> hard to introspect.
> >> >> >>
> >> >> >> > Why do you need to do any checks besides the new type being able to
> >> >> >> > represent all IV values?  The original doloop IV will never wrap
> >> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
> >> >> >> > become
> >> >> >> > zero ... I suppose the doloop might still do the correct thing here
> >> >> >> > but it also still will with a IV with larger type).
> >> >>
> >> >> The issue comes from U*_MAX (original short MAX), as you said: on which
> >> >> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
> >> >> larger type 'zero - 1' will be a very large number on larger type
> >> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
> >> >> value
> >> >> (e.g. "0xff").
> >> >
> >> > But for the larger type the small type MAX + 1 fits and does not yield
> >> > zero so it should still work exactly as before, no?  Of course you
> >> > have to compute the + 1 in the larger type.
> >> >
> >> You are right, if compute the "+ 1" in the larger type it is ok, as below
> >> code:
> >> ```
> >>    /* Use type in word size may fast.  */
> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
> >>       {
> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
> >>         niter = fold_convert (ntype, niter);
> >>       }
> >> 
> >>     tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
> >>                              build_int_cst (ntype, 1));
> >> 
> >> 
> >>     add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL,
> >> true);
> >> ```
> >> The issue of this is, this code generates more stmt for doloop.xxx:
> >>   _12 = (unsigned int) xx(D);
> >>   _10 = _12 + 4294967295;
> >>   _24 = (long unsigned int) _10;
> >>   doloop.6_8 = _24 + 1;
> >> 
> >> if use previous patch, "+ 1" on original type, then the stmts will looks
> >> like:
> >>   _12 = (unsigned int) xx(D);
> >>   doloop.6_8 = (long unsigned int) _12;
> >> 
> >> This is the reason for checking
> >>    wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
> > 
> > But this then only works when there's an upper bound on the number
> > of iterations.  Note you should not use TYPE_MAX_VALUE here but
> > you can instead use
> > 
> >      wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
> > (TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));
> 
> Ok, Thanks!
> I remember you mentioned that:
> widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN (ntype)),
> TYPE_SIGN (ntype))
> would be better than
> wi::to_widest (TYPE_MAX_VALUE (ntype)).
> 
> It seems that:
> "TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
> (NODE)->type_non_common.maxval"
> which do a numerical-check and return the field of maxval.  And then call to
> wi::to_widest
> 
> The other code "widest_int::from (wi::max_value (..,..),..)" calls
> wi::max_value
> and widest_int::from.
> 
> I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?

TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
underlying modes precision.  At some point we've tried to eliminate
most of its uses, not sure what the situation/position is right now.

> > I think the -1 above comes from number of latch iterations vs. header
> > entries - it's a common source for this kind of issues.  range analysis
> > might be able to prove that we can still merge the two adds even with
> > the intermediate extension.
> Yes, as you mentioned here, it relates to number of latch iterations
> For loop looks like : while (l < n) or for (i = 0; i < n; i++)
> This kind of loop, the niter is used to be 'n - 1' after transformed
> into 'do-while' form.
> I would see how to merge these two adds safely at this point
> when generating doloop iv. (maybe range info, thanks!
>
> > 
> > Is this pre-loop extra add really offsetting the in-loop doloop
> > improvements?
> I'm not catching this question too much, sorry.  I guess your concern
> is if the "+1" is an offset: it may not, "+1" may be just that doloop.xx
> is decreasing niter until 0 (all number >0).
> If misunderstand,  thanks for point out.

I'm questioning the argument that not being able to eliminate the +1-1
pair effects the overall cost improvement for the doloop IV type change
which hopefully is _not_ just loop IV setup cost but improving
performance in the loop body itself?

Richard.
 
> > 
> >> >> >>
> >> >> >> doloop_valid_p guarantees it is simple and doesn't wrap.
> >> >> >>
> >> >> >> > I'd have expected sth like
> >> >> >> >
> >> >> >> >    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
> >> >> >> > (ntype));
> >> >> >> >
> >> >> >> > thus the decision made using a mode - which is also why I wonder
> >> >> >> > if there's a way to query the target for this.  As you say,
> >> >> >> > it _may_ be fast, so better check (somehow).
> >> >>
> >> >>
> >> >> I was also thinking of using hooks like type_for_size/type_for_mode.
> >> >>     /* Use type in word size may fast.  */
> >> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
> >> >>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
> >> >> (ntype))))
> >> >>       {
> >> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
> >> >>         base = fold_convert (ntype, base);
> >> >>       }
> >> >>
> >> >> As you pointed out, this does not query the mode from targets.
> >> >> As Segher pointed out "doloop_end" checks unsupported mode, while it
> >> >> seems
> >> >> not easy to use it in tree-ssa-loop-ivopts.c.
> >> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
> >> >> requires
> >> >> Pmode/DImode; while there are other targets that work on other 'mode'
> >> >> (e.g.
> >> >> SI).
> >> >>
> >> >>
> >> >> In doloop_optimize, there is code:
> >> >>
> >> >> ```
> >> >>     mode = desc->mode;
> >> >> .....
> >> >>     doloop_reg = gen_reg_rtx (mode);
> >> >>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
> >> >> start_label);
> >> >>
> >> >>     word_mode_size = GET_MODE_PRECISION (word_mode);
> >> >>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) -
> >> >> 1;
> >> >>     if (! doloop_seq
> >> >>         && mode != word_mode
> >> >>         /* Before trying mode different from the one in that # of
> >> >> iterations is
> >> >>            computed, we must be sure that the number of iterations fits
> >> >> into
> >> >>            the new mode.  */
> >> >>         && (word_mode_size >= GET_MODE_PRECISION (mode)
> >> >>             || wi::leu_p (iterations_max, word_mode_max)))
> >> >>       {
> >> >>         if (word_mode_size > GET_MODE_PRECISION (mode))
> >> >>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count,
> >> >>         mode);
> >> >>         else
> >> >>           count = lowpart_subreg (word_mode, count, mode);
> >> >>         PUT_MODE (doloop_reg, word_mode);
> >> >>         doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
> >> >>       }
> >> >>     if (! doloop_seq)
> >> >>       {
> >> >>         if (dump_file)
> >> >>           fprintf (dump_file,
> >> >>                    "Doloop: Target unwilling to use doloop pattern!\n");
> >> >>         return false;
> >> >>       }
> >> >> ```
> >> >> The above code first tries the mode of niter_desc by call
> >> >> targetm.gen_doloop_end
> >> >> to see if the target can generate doloop insns, if fail, then try to use
> >> >> 'word_mode' against gen_doloop_end.
> >> >>
> >> >>
> >> >> >>
> >> >> >> Almost all targets just use Pmode, but there is no such guarantee I
> >> >> >> think, and esp. some targets that do not have machine insns for this
> >> >> >> (but want to generate different code for this anyway) can do pretty
> >> >> >> much
> >> >> >> anything.
> >> >> >>
> >> >> >> Maybe using just Pmode here is good enough though?
> >> >> >
> >> >> > I think Pmode is a particularly bad choice and I'd prefer word_mode
> >> >> > if we go for any hardcoded mode.  s390x for example seems to handle
> >> >> > both SImode and DImode (but names the helper gen_doloop_si64
> >> >> > for SImode?!).  But indeed it looks like somehow querying doloop_end
> >> >> > is going to be difficult since the expander doesn't have any mode,
> >> >> > so we'd have to actually try emit RTL here.
> >> >>
> >> >> Instead of using hardcode mode, maybe we could add a hook for targets to
> >> >> return
> >> >> the preferred mode.
> >> >
> >> > That's a possiblity of course.  Like the following which just shows the
> >> > default implementation then (pass in current mode, return a more
> >> > preferred
> >> > mode or the mode itself)
> >> >
> >> > enum machine_mode
> >> > prefered_doloop_mode (enum machine_mode mode)
> >> > {
> >> >   return mode;
> >> > }
> >> >
> >> Yes, thanks!
> >> 
> >> Checking current do_loop_end in targets, in general, when do_loop_end
> >> requires
> >> SI mode, the target is defining Pmode as SImode and word_mode (from
> >> BITS_PER_WORD
> >> which defaults from UNITS_PER_WORD) is also defined to align with SI mode.
> >> When do_loop_end requires DI mode, the target is defining Pmode as DImode
> >> and word_mode/UNITS_PER_WORD is also defined to align with DI mode.
> >> 
> >> So, if aggressively, then by default we may just return word_mode.
> > 
> > Note we still have to check whether the prefered mode is valid
> > (passing in TImode but then returning DImode would be wrong).
> Yes, some code like in doloop_optimize (rtl code)
> ```
> mode != word_mode
> && (word_mode_size >= GET_MODE_PRECISION (mode)
>     || wi::leu_p (iterations_max, word_mode_max))
> ```
> 
> Thanks again for your comments!
> 
> BR,
> Jiufu Guo
> > 
> > Richard.
> > 
> >> BR,
> >> 
> >> Jiufu Guo.
> >> 
> >> 
> >> >>
> >> >> Thanks for those valuable comments!
> >> >>
> >> >> Jiufu Guo
> >> >>
> >> >>
> >> >>
> >> >> >
> >> >> > Richard.
> >> >>
> >> >>
> >> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-12 14:46                 ` Richard Biener
@ 2021-07-12 15:53                   ` guojiufu
  2021-07-12 15:59                     ` guojiufu
  2021-07-13  2:09                     ` guojiufu
  0 siblings, 2 replies; 20+ messages in thread
From: guojiufu @ 2021-07-12 15:53 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, gcc-patches, wschmidt, dje.gcc, jlaw, amker.cheng

On 2021-07-12 22:46, Richard Biener wrote:
> On Mon, 12 Jul 2021, guojiufu wrote:
> 
>> On 2021-07-12 18:02, Richard Biener wrote:
>> > On Mon, 12 Jul 2021, guojiufu wrote:
>> >
>> >> On 2021-07-12 16:57, Richard Biener wrote:
>> >> > On Mon, 12 Jul 2021, guojiufu wrote:
>> >> >
>> >> >> On 2021-07-12 14:20, Richard Biener wrote:
>> >> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>> >> >> >
>> >> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>> >> >> >> > I wonder if there's a way to query the target what modes the doloop
>> >> >> >> > pattern can handle (not being too familiar with the doloop code).
>> >> >> >>
>> >> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>> >> >> >> perhaps?  Although that is a define_expand, not a define_insn, so it
>> >> >> >> is
>> >> >> >> hard to introspect.
>> >> >> >>
>> >> >> >> > Why do you need to do any checks besides the new type being able to
>> >> >> >> > represent all IV values?  The original doloop IV will never wrap
>> >> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
>> >> >> >> > become
>> >> >> >> > zero ... I suppose the doloop might still do the correct thing here
>> >> >> >> > but it also still will with a IV with larger type).
>> >> >>
>> >> >> The issue comes from U*_MAX (original short MAX), as you said: on which
>> >> >> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
>> >> >> larger type 'zero - 1' will be a very large number on larger type
>> >> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
>> >> >> value
>> >> >> (e.g. "0xff").
>> >> >
>> >> > But for the larger type the small type MAX + 1 fits and does not yield
>> >> > zero so it should still work exactly as before, no?  Of course you
>> >> > have to compute the + 1 in the larger type.
>> >> >
>> >> You are right, if compute the "+ 1" in the larger type it is ok, as below
>> >> code:
>> >> ```
>> >>    /* Use type in word size may fast.  */
>> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>> >>       {
>> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>> >>         niter = fold_convert (ntype, niter);
>> >>       }
>> >>
>> >>     tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>> >>                              build_int_cst (ntype, 1));
>> >>
>> >>
>> >>     add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL,
>> >> true);
>> >> ```
>> >> The issue of this is, this code generates more stmt for doloop.xxx:
>> >>   _12 = (unsigned int) xx(D);
>> >>   _10 = _12 + 4294967295;
>> >>   _24 = (long unsigned int) _10;
>> >>   doloop.6_8 = _24 + 1;
>> >>
>> >> if use previous patch, "+ 1" on original type, then the stmts will looks
>> >> like:
>> >>   _12 = (unsigned int) xx(D);
>> >>   doloop.6_8 = (long unsigned int) _12;
>> >>
>> >> This is the reason for checking
>> >>    wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
>> >
>> > But this then only works when there's an upper bound on the number
>> > of iterations.  Note you should not use TYPE_MAX_VALUE here but
>> > you can instead use
>> >
>> >      wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
>> > (TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));
>> 
>> Ok, Thanks!
>> I remember you mentioned that:
>> widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN 
>> (ntype)),
>> TYPE_SIGN (ntype))
>> would be better than
>> wi::to_widest (TYPE_MAX_VALUE (ntype)).
>> 
>> It seems that:
>> "TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
>> (NODE)->type_non_common.maxval"
>> which do a numerical-check and return the field of maxval.  And then 
>> call to
>> wi::to_widest
>> 
>> The other code "widest_int::from (wi::max_value (..,..),..)" calls
>> wi::max_value
>> and widest_int::from.
>> 
>> I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?
> 
> TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
> underlying modes precision.  At some point we've tried to eliminate
> most of its uses, not sure what the situation/position is right now.
Ok, get it, thanks.
I will use "widest_int::from (wi::max_value (..,..),..)".

> 
>> > I think the -1 above comes from number of latch iterations vs. header
>> > entries - it's a common source for this kind of issues.  range analysis
>> > might be able to prove that we can still merge the two adds even with
>> > the intermediate extension.
>> Yes, as you mentioned here, it relates to number of latch iterations
>> For loop looks like : while (l < n) or for (i = 0; i < n; i++)
>> This kind of loop, the niter is used to be 'n - 1' after transformed
>> into 'do-while' form.
For this kind of loop, the max value for the number of iteration "n - 1"
would be "max_value_type(n) - 1" which is wi::ltu than max_value_type.
This kind of loop is already common, and we could use wi::ltu (max, 
max_value_type)
to check.

For loop looks like:
   do ;
   while (n-- > 0); /* while  (n-- > low); */

The niter_desc->max will wi::eq to max_value_type, and niter would be 
"n",
and then doloop.xx is 'n+1'.

>> I would see how to merge these two adds safely at this point
>> when generating doloop iv. (maybe range info, thanks!
>> 
>> >
>> > Is this pre-loop extra add really offsetting the in-loop doloop
>> > improvements?
>> I'm not catching this question too much, sorry.  I guess your concern
>> is if the "+1" is an offset: it may not, "+1" may be just that 
>> doloop.xx
>> is decreasing niter until 0 (all number >0).
>> If misunderstand,  thanks for point out.
> 
> I'm questioning the argument that not being able to eliminate the +1-1
> pair effects the overall cost improvement for the doloop IV type change
> which hopefully is _not_ just loop IV setup cost but improving
> performance in the loop body itself?
Yes, I think so.  It would affect doloop IV setup cost, it may also 
affect
loop itself on some targets, if the target prefers a changed type:
using one jump-counter instruction for wider type, but cmp/jump 
instructions
for other types.
If the loop is nested in outer-loop, eliminating +1-1 would improving
outer-loop performance.

BR,
Jiufu Guo.

> 
> Richard.
> 
>> >
>> >> >> >>
>> >> >> >> doloop_valid_p guarantees it is simple and doesn't wrap.
>> >> >> >>
>> >> >> >> > I'd have expected sth like
>> >> >> >> >
>> >> >> >> >    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
>> >> >> >> > (ntype));
>> >> >> >> >
>> >> >> >> > thus the decision made using a mode - which is also why I wonder
>> >> >> >> > if there's a way to query the target for this.  As you say,
>> >> >> >> > it _may_ be fast, so better check (somehow).
>> >> >>
>> >> >>
>> >> >> I was also thinking of using hooks like type_for_size/type_for_mode.
>> >> >>     /* Use type in word size may fast.  */
>> >> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
>> >> >>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
>> >> >> (ntype))))
>> >> >>       {
>> >> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>> >> >>         base = fold_convert (ntype, base);
>> >> >>       }
>> >> >>
>> >> >> As you pointed out, this does not query the mode from targets.
>> >> >> As Segher pointed out "doloop_end" checks unsupported mode, while it
>> >> >> seems
>> >> >> not easy to use it in tree-ssa-loop-ivopts.c.
>> >> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
>> >> >> requires
>> >> >> Pmode/DImode; while there are other targets that work on other 'mode'
>> >> >> (e.g.
>> >> >> SI).
>> >> >>
>> >> >>
>> >> >> In doloop_optimize, there is code:
>> >> >>
>> >> >> ```
>> >> >>     mode = desc->mode;
>> >> >> .....
>> >> >>     doloop_reg = gen_reg_rtx (mode);
>> >> >>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
>> >> >> start_label);
>> >> >>
>> >> >>     word_mode_size = GET_MODE_PRECISION (word_mode);
>> >> >>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) -
>> >> >> 1;
>> >> >>     if (! doloop_seq
>> >> >>         && mode != word_mode
>> >> >>         /* Before trying mode different from the one in that # of
>> >> >> iterations is
>> >> >>            computed, we must be sure that the number of iterations fits
>> >> >> into
>> >> >>            the new mode.  */
>> >> >>         && (word_mode_size >= GET_MODE_PRECISION (mode)
>> >> >>             || wi::leu_p (iterations_max, word_mode_max)))
>> >> >>       {
>> >> >>         if (word_mode_size > GET_MODE_PRECISION (mode))
>> >> >>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count,
>> >> >>         mode);
>> >> >>         else
>> >> >>           count = lowpart_subreg (word_mode, count, mode);
>> >> >>         PUT_MODE (doloop_reg, word_mode);
>> >> >>         doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
>> >> >>       }
>> >> >>     if (! doloop_seq)
>> >> >>       {
>> >> >>         if (dump_file)
>> >> >>           fprintf (dump_file,
>> >> >>                    "Doloop: Target unwilling to use doloop pattern!\n");
>> >> >>         return false;
>> >> >>       }
>> >> >> ```
>> >> >> The above code first tries the mode of niter_desc by call
>> >> >> targetm.gen_doloop_end
>> >> >> to see if the target can generate doloop insns, if fail, then try to use
>> >> >> 'word_mode' against gen_doloop_end.
>> >> >>
>> >> >>
>> >> >> >>
>> >> >> >> Almost all targets just use Pmode, but there is no such guarantee I
>> >> >> >> think, and esp. some targets that do not have machine insns for this
>> >> >> >> (but want to generate different code for this anyway) can do pretty
>> >> >> >> much
>> >> >> >> anything.
>> >> >> >>
>> >> >> >> Maybe using just Pmode here is good enough though?
>> >> >> >
>> >> >> > I think Pmode is a particularly bad choice and I'd prefer word_mode
>> >> >> > if we go for any hardcoded mode.  s390x for example seems to handle
>> >> >> > both SImode and DImode (but names the helper gen_doloop_si64
>> >> >> > for SImode?!).  But indeed it looks like somehow querying doloop_end
>> >> >> > is going to be difficult since the expander doesn't have any mode,
>> >> >> > so we'd have to actually try emit RTL here.
>> >> >>
>> >> >> Instead of using hardcode mode, maybe we could add a hook for targets to
>> >> >> return
>> >> >> the preferred mode.
>> >> >
>> >> > That's a possiblity of course.  Like the following which just shows the
>> >> > default implementation then (pass in current mode, return a more
>> >> > preferred
>> >> > mode or the mode itself)
>> >> >
>> >> > enum machine_mode
>> >> > prefered_doloop_mode (enum machine_mode mode)
>> >> > {
>> >> >   return mode;
>> >> > }
>> >> >
>> >> Yes, thanks!
>> >>
>> >> Checking current do_loop_end in targets, in general, when do_loop_end
>> >> requires
>> >> SI mode, the target is defining Pmode as SImode and word_mode (from
>> >> BITS_PER_WORD
>> >> which defaults from UNITS_PER_WORD) is also defined to align with SI mode.
>> >> When do_loop_end requires DI mode, the target is defining Pmode as DImode
>> >> and word_mode/UNITS_PER_WORD is also defined to align with DI mode.
>> >>
>> >> So, if aggressively, then by default we may just return word_mode.
>> >
>> > Note we still have to check whether the prefered mode is valid
>> > (passing in TImode but then returning DImode would be wrong).
>> Yes, some code like in doloop_optimize (rtl code)
>> ```
>> mode != word_mode
>> && (word_mode_size >= GET_MODE_PRECISION (mode)
>>     || wi::leu_p (iterations_max, word_mode_max))
>> ```
>> 
>> Thanks again for your comments!
>> 
>> BR,
>> Jiufu Guo
>> >
>> > Richard.
>> >
>> >> BR,
>> >>
>> >> Jiufu Guo.
>> >>
>> >>
>> >> >>
>> >> >> Thanks for those valuable comments!
>> >> >>
>> >> >> Jiufu Guo
>> >> >>
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > Richard.
>> >> >>
>> >> >>
>> >>
>> 
>> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-12 15:53                   ` guojiufu
@ 2021-07-12 15:59                     ` guojiufu
  2021-07-13  2:09                     ` guojiufu
  1 sibling, 0 replies; 20+ messages in thread
From: guojiufu @ 2021-07-12 15:59 UTC (permalink / raw)
  To: guojiufu
  Cc: Richard Biener, Segher Boessenkool, gcc-patches, wschmidt, jlaw,
	dje.gcc, Gcc-patches

On 2021-07-12 23:53, guojiufu via Gcc-patches wrote:
> On 2021-07-12 22:46, Richard Biener wrote:
>> On Mon, 12 Jul 2021, guojiufu wrote:
>> 
>>> On 2021-07-12 18:02, Richard Biener wrote:
>>> > On Mon, 12 Jul 2021, guojiufu wrote:
>>> >
>>> >> On 2021-07-12 16:57, Richard Biener wrote:
>>> >> > On Mon, 12 Jul 2021, guojiufu wrote:
>>> >> >
>>> >> >> On 2021-07-12 14:20, Richard Biener wrote:
>>> >> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>>> >> >> >
>>> >> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>>> >> >> >> > I wonder if there's a way to query the target what modes the doloop
>>> >> >> >> > pattern can handle (not being too familiar with the doloop code).
>>> >> >> >>
>>> >> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>>> >> >> >> perhaps?  Although that is a define_expand, not a define_insn, so it
>>> >> >> >> is
>>> >> >> >> hard to introspect.
>>> >> >> >>
>>> >> >> >> > Why do you need to do any checks besides the new type being able to
>>> >> >> >> > represent all IV values?  The original doloop IV will never wrap
>>> >> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
>>> >> >> >> > become
>>> >> >> >> > zero ... I suppose the doloop might still do the correct thing here
>>> >> >> >> > but it also still will with a IV with larger type).
>>> >> >>
>>> >> >> The issue comes from U*_MAX (original short MAX), as you said: on which
>>> >> >> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
>>> >> >> larger type 'zero - 1' will be a very large number on larger type
>>> >> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
>>> >> >> value
>>> >> >> (e.g. "0xff").
>>> >> >
>>> >> > But for the larger type the small type MAX + 1 fits and does not yield
>>> >> > zero so it should still work exactly as before, no?  Of course you
>>> >> > have to compute the + 1 in the larger type.
>>> >> >
>>> >> You are right, if compute the "+ 1" in the larger type it is ok, as below
>>> >> code:
>>> >> ```
>>> >>    /* Use type in word size may fast.  */
>>> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>>> >>       {
>>> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>>> >>         niter = fold_convert (ntype, niter);
>>> >>       }
>>> >>
>>> >>     tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>>> >>                              build_int_cst (ntype, 1));
>>> >>
>>> >>
>>> >>     add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL,
>>> >> true);
>>> >> ```
>>> >> The issue of this is, this code generates more stmt for doloop.xxx:
>>> >>   _12 = (unsigned int) xx(D);
>>> >>   _10 = _12 + 4294967295;
>>> >>   _24 = (long unsigned int) _10;
>>> >>   doloop.6_8 = _24 + 1;
>>> >>
>>> >> if use previous patch, "+ 1" on original type, then the stmts will looks
>>> >> like:
>>> >>   _12 = (unsigned int) xx(D);
>>> >>   doloop.6_8 = (long unsigned int) _12;
>>> >>
>>> >> This is the reason for checking
>>> >>    wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
>>> >
>>> > But this then only works when there's an upper bound on the number
>>> > of iterations.  Note you should not use TYPE_MAX_VALUE here but
>>> > you can instead use
>>> >
>>> >      wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
>>> > (TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));
>>> 
>>> Ok, Thanks!
>>> I remember you mentioned that:
>>> widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN 
>>> (ntype)),
>>> TYPE_SIGN (ntype))
>>> would be better than
>>> wi::to_widest (TYPE_MAX_VALUE (ntype)).
>>> 
>>> It seems that:
>>> "TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
>>> (NODE)->type_non_common.maxval"
>>> which do a numerical-check and return the field of maxval.  And then 
>>> call to
>>> wi::to_widest
>>> 
>>> The other code "widest_int::from (wi::max_value (..,..),..)" calls
>>> wi::max_value
>>> and widest_int::from.
>>> 
>>> I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?
>> 
>> TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
>> underlying modes precision.  At some point we've tried to eliminate
>> most of its uses, not sure what the situation/position is right now.
> Ok, get it, thanks.
> I will use "widest_int::from (wi::max_value (..,..),..)".
> 
>> 
>>> > I think the -1 above comes from number of latch iterations vs. header
>>> > entries - it's a common source for this kind of issues.  range analysis
>>> > might be able to prove that we can still merge the two adds even with
>>> > the intermediate extension.
>>> Yes, as you mentioned here, it relates to number of latch iterations
>>> For loop looks like : while (l < n) or for (i = 0; i < n; i++)
>>> This kind of loop, the niter is used to be 'n - 1' after transformed
>>> into 'do-while' form.
> For this kind of loop, the max value for the number of iteration "n - 
> 1"
> would be "max_value_type(n) - 1" which is wi::ltu than max_value_type.
> This kind of loop is already common, and we could use wi::ltu (max,
> max_value_type)
> to check.
> 
> For loop looks like:
>   do ;
>   while (n-- > 0); /* while  (n-- > low); */
> 
> The niter_desc->max will wi::eq to max_value_type, and niter would be 
> "n",
> and then doloop.xx is 'n+1'.
> 
>>> I would see how to merge these two adds safely at this point
>>> when generating doloop iv. (maybe range info, thanks!
>>> 
>>> >
>>> > Is this pre-loop extra add really offsetting the in-loop doloop
>>> > improvements?
>>> I'm not catching this question too much, sorry.  I guess your concern
>>> is if the "+1" is an offset: it may not, "+1" may be just that 
>>> doloop.xx
>>> is decreasing niter until 0 (all number >0).
>>> If misunderstand,  thanks for point out.
>> 
>> I'm questioning the argument that not being able to eliminate the +1-1
>> pair effects the overall cost improvement for the doloop IV type 
>> change
>> which hopefully is _not_ just loop IV setup cost but improving
>> performance in the loop body itself?
> Yes, I think so.  It would affect doloop IV setup cost, it may also 
> affect
> loop itself on some targets, if the target prefers a changed type:
> using one jump-counter instruction for wider type, but cmp/jump 
> instructions
> for other types.
Oh, this may be an inaccurate example. -:)

BR,
Jiufu Guo.

> If the loop is nested in outer-loop, eliminating +1-1 would improving
> outer-loop performance.
> 
> BR,
> Jiufu Guo.
> 
>> 
>> Richard.
>> 
>>> >
>>> >> >> >>
>>> >> >> >> doloop_valid_p guarantees it is simple and doesn't wrap.
>>> >> >> >>
>>> >> >> >> > I'd have expected sth like
>>> >> >> >> >
>>> >> >> >> >    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
>>> >> >> >> > (ntype));
>>> >> >> >> >
>>> >> >> >> > thus the decision made using a mode - which is also why I wonder
>>> >> >> >> > if there's a way to query the target for this.  As you say,
>>> >> >> >> > it _may_ be fast, so better check (somehow).
>>> >> >>
>>> >> >>
>>> >> >> I was also thinking of using hooks like type_for_size/type_for_mode.
>>> >> >>     /* Use type in word size may fast.  */
>>> >> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
>>> >> >>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
>>> >> >> (ntype))))
>>> >> >>       {
>>> >> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>>> >> >>         base = fold_convert (ntype, base);
>>> >> >>       }
>>> >> >>
>>> >> >> As you pointed out, this does not query the mode from targets.
>>> >> >> As Segher pointed out "doloop_end" checks unsupported mode, while it
>>> >> >> seems
>>> >> >> not easy to use it in tree-ssa-loop-ivopts.c.
>>> >> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
>>> >> >> requires
>>> >> >> Pmode/DImode; while there are other targets that work on other 'mode'
>>> >> >> (e.g.
>>> >> >> SI).
>>> >> >>
>>> >> >>
>>> >> >> In doloop_optimize, there is code:
>>> >> >>
>>> >> >> ```
>>> >> >>     mode = desc->mode;
>>> >> >> .....
>>> >> >>     doloop_reg = gen_reg_rtx (mode);
>>> >> >>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
>>> >> >> start_label);
>>> >> >>
>>> >> >>     word_mode_size = GET_MODE_PRECISION (word_mode);
>>> >> >>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) -
>>> >> >> 1;
>>> >> >>     if (! doloop_seq
>>> >> >>         && mode != word_mode
>>> >> >>         /* Before trying mode different from the one in that # of
>>> >> >> iterations is
>>> >> >>            computed, we must be sure that the number of iterations fits
>>> >> >> into
>>> >> >>            the new mode.  */
>>> >> >>         && (word_mode_size >= GET_MODE_PRECISION (mode)
>>> >> >>             || wi::leu_p (iterations_max, word_mode_max)))
>>> >> >>       {
>>> >> >>         if (word_mode_size > GET_MODE_PRECISION (mode))
>>> >> >>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count,
>>> >> >>         mode);
>>> >> >>         else
>>> >> >>           count = lowpart_subreg (word_mode, count, mode);
>>> >> >>         PUT_MODE (doloop_reg, word_mode);
>>> >> >>         doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
>>> >> >>       }
>>> >> >>     if (! doloop_seq)
>>> >> >>       {
>>> >> >>         if (dump_file)
>>> >> >>           fprintf (dump_file,
>>> >> >>                    "Doloop: Target unwilling to use doloop pattern!\n");
>>> >> >>         return false;
>>> >> >>       }
>>> >> >> ```
>>> >> >> The above code first tries the mode of niter_desc by call
>>> >> >> targetm.gen_doloop_end
>>> >> >> to see if the target can generate doloop insns, if fail, then try to use
>>> >> >> 'word_mode' against gen_doloop_end.
>>> >> >>
>>> >> >>
>>> >> >> >>
>>> >> >> >> Almost all targets just use Pmode, but there is no such guarantee I
>>> >> >> >> think, and esp. some targets that do not have machine insns for this
>>> >> >> >> (but want to generate different code for this anyway) can do pretty
>>> >> >> >> much
>>> >> >> >> anything.
>>> >> >> >>
>>> >> >> >> Maybe using just Pmode here is good enough though?
>>> >> >> >
>>> >> >> > I think Pmode is a particularly bad choice and I'd prefer word_mode
>>> >> >> > if we go for any hardcoded mode.  s390x for example seems to handle
>>> >> >> > both SImode and DImode (but names the helper gen_doloop_si64
>>> >> >> > for SImode?!).  But indeed it looks like somehow querying doloop_end
>>> >> >> > is going to be difficult since the expander doesn't have any mode,
>>> >> >> > so we'd have to actually try emit RTL here.
>>> >> >>
>>> >> >> Instead of using hardcode mode, maybe we could add a hook for targets to
>>> >> >> return
>>> >> >> the preferred mode.
>>> >> >
>>> >> > That's a possiblity of course.  Like the following which just shows the
>>> >> > default implementation then (pass in current mode, return a more
>>> >> > preferred
>>> >> > mode or the mode itself)
>>> >> >
>>> >> > enum machine_mode
>>> >> > prefered_doloop_mode (enum machine_mode mode)
>>> >> > {
>>> >> >   return mode;
>>> >> > }
>>> >> >
>>> >> Yes, thanks!
>>> >>
>>> >> Checking current do_loop_end in targets, in general, when do_loop_end
>>> >> requires
>>> >> SI mode, the target is defining Pmode as SImode and word_mode (from
>>> >> BITS_PER_WORD
>>> >> which defaults from UNITS_PER_WORD) is also defined to align with SI mode.
>>> >> When do_loop_end requires DI mode, the target is defining Pmode as DImode
>>> >> and word_mode/UNITS_PER_WORD is also defined to align with DI mode.
>>> >>
>>> >> So, if aggressively, then by default we may just return word_mode.
>>> >
>>> > Note we still have to check whether the prefered mode is valid
>>> > (passing in TImode but then returning DImode would be wrong).
>>> Yes, some code like in doloop_optimize (rtl code)
>>> ```
>>> mode != word_mode
>>> && (word_mode_size >= GET_MODE_PRECISION (mode)
>>>     || wi::leu_p (iterations_max, word_mode_max))
>>> ```
>>> 
>>> Thanks again for your comments!
>>> 
>>> BR,
>>> Jiufu Guo
>>> >
>>> > Richard.
>>> >
>>> >> BR,
>>> >>
>>> >> Jiufu Guo.
>>> >>
>>> >>
>>> >> >>
>>> >> >> Thanks for those valuable comments!
>>> >> >>
>>> >> >> Jiufu Guo
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> >
>>> >> >> > Richard.
>>> >> >>
>>> >> >>
>>> >>
>>> 
>>> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-12 15:53                   ` guojiufu
  2021-07-12 15:59                     ` guojiufu
@ 2021-07-13  2:09                     ` guojiufu
  2021-07-13  7:09                       ` Richard Biener
  2021-07-13 15:51                       ` Segher Boessenkool
  1 sibling, 2 replies; 20+ messages in thread
From: guojiufu @ 2021-07-13  2:09 UTC (permalink / raw)
  To: guojiufu
  Cc: Richard Biener, Segher Boessenkool, gcc-patches, wschmidt, jlaw,
	dje.gcc, Gcc-patches

On 2021-07-12 23:53, guojiufu via Gcc-patches wrote:
> On 2021-07-12 22:46, Richard Biener wrote:
>> On Mon, 12 Jul 2021, guojiufu wrote:
>> 
>>> On 2021-07-12 18:02, Richard Biener wrote:
>>> > On Mon, 12 Jul 2021, guojiufu wrote:
>>> >
>>> >> On 2021-07-12 16:57, Richard Biener wrote:
>>> >> > On Mon, 12 Jul 2021, guojiufu wrote:
>>> >> >
>>> >> >> On 2021-07-12 14:20, Richard Biener wrote:
>>> >> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>>> >> >> >
>>> >> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>>> >> >> >> > I wonder if there's a way to query the target what modes the doloop
>>> >> >> >> > pattern can handle (not being too familiar with the doloop code).
>>> >> >> >>
>>> >> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>>> >> >> >> perhaps?  Although that is a define_expand, not a define_insn, so it
>>> >> >> >> is
>>> >> >> >> hard to introspect.
>>> >> >> >>
>>> >> >> >> > Why do you need to do any checks besides the new type being able to
>>> >> >> >> > represent all IV values?  The original doloop IV will never wrap
>>> >> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
>>> >> >> >> > become
>>> >> >> >> > zero ... I suppose the doloop might still do the correct thing here
>>> >> >> >> > but it also still will with a IV with larger type).
>>> >> >>
>>> >> >> The issue comes from U*_MAX (original short MAX), as you said: on which
>>> >> >> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
>>> >> >> larger type 'zero - 1' will be a very large number on larger type
>>> >> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
>>> >> >> value
>>> >> >> (e.g. "0xff").
>>> >> >
>>> >> > But for the larger type the small type MAX + 1 fits and does not yield
>>> >> > zero so it should still work exactly as before, no?  Of course you
>>> >> > have to compute the + 1 in the larger type.
>>> >> >
>>> >> You are right, if compute the "+ 1" in the larger type it is ok, as below
>>> >> code:
>>> >> ```
>>> >>    /* Use type in word size may fast.  */
>>> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>>> >>       {
>>> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>>> >>         niter = fold_convert (ntype, niter);
>>> >>       }
>>> >>
>>> >>     tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>>> >>                              build_int_cst (ntype, 1));
>>> >>
>>> >>
>>> >>     add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL,
>>> >> true);
>>> >> ```
>>> >> The issue of this is, this code generates more stmt for doloop.xxx:
>>> >>   _12 = (unsigned int) xx(D);
>>> >>   _10 = _12 + 4294967295;
>>> >>   _24 = (long unsigned int) _10;
>>> >>   doloop.6_8 = _24 + 1;
>>> >>
>>> >> if use previous patch, "+ 1" on original type, then the stmts will looks
>>> >> like:
>>> >>   _12 = (unsigned int) xx(D);
>>> >>   doloop.6_8 = (long unsigned int) _12;
>>> >>
>>> >> This is the reason for checking
>>> >>    wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
>>> >
>>> > But this then only works when there's an upper bound on the number
>>> > of iterations.  Note you should not use TYPE_MAX_VALUE here but
>>> > you can instead use
>>> >
>>> >      wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
>>> > (TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));
>>> 
>>> Ok, Thanks!
>>> I remember you mentioned that:
>>> widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN 
>>> (ntype)),
>>> TYPE_SIGN (ntype))
>>> would be better than
>>> wi::to_widest (TYPE_MAX_VALUE (ntype)).
>>> 
>>> It seems that:
>>> "TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
>>> (NODE)->type_non_common.maxval"
>>> which do a numerical-check and return the field of maxval.  And then 
>>> call to
>>> wi::to_widest
>>> 
>>> The other code "widest_int::from (wi::max_value (..,..),..)" calls
>>> wi::max_value
>>> and widest_int::from.
>>> 
>>> I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?
>> 
>> TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
>> underlying modes precision.  At some point we've tried to eliminate
>> most of its uses, not sure what the situation/position is right now.
> Ok, get it, thanks.
> I will use "widest_int::from (wi::max_value (..,..),..)".
> 
>> 
>>> > I think the -1 above comes from number of latch iterations vs. header
>>> > entries - it's a common source for this kind of issues.  range analysis
>>> > might be able to prove that we can still merge the two adds even with
>>> > the intermediate extension.
>>> Yes, as you mentioned here, it relates to number of latch iterations
>>> For loop looks like : while (l < n) or for (i = 0; i < n; i++)
>>> This kind of loop, the niter is used to be 'n - 1' after transformed
>>> into 'do-while' form.
> For this kind of loop, the max value for the number of iteration "n - 
> 1"
> would be "max_value_type(n) - 1" which is wi::ltu than max_value_type.
> This kind of loop is already common, and we could use wi::ltu (max,
> max_value_type)
> to check.
> 
> For loop looks like:
>   do ;
>   while (n-- > 0); /* while  (n-- > low); */
> 
> The niter_desc->max will wi::eq to max_value_type, and niter would be 
> "n",
> and then doloop.xx is 'n+1'.
> 
>>> I would see how to merge these two adds safely at this point
>>> when generating doloop iv. (maybe range info, thanks!
>>> 

For some cases, it may not easy to merge -1 +1 pair:
(from doloop-1.c)
```
int __attribute__ ((noinline))
test (unsigned char n)
{
   do ;
   while (--n > 0);
   return n;
}

int main ()
{
   unsigned char z = 0;

   return test (z);
}
```

For this loop, niter_desc->max is 255, and niter is 'n - 1', then
doloop.xx is "'n - 1' + 1",  'n - 1' should be compute at uchar type
(QImode).
It is ok to compute '- 1' and "+ 1" under the original shorter type.
And it is ok if computing "n - 1" under original type and computing
"+ 1" under larger mode.
But it is not ok to compute both "- 1" and "+ 1" in large mode.

As discussed previously:  when 'n - 1' is U*_MAX (original short MAX),
'(n - 1) + 1' becomes zero.  And the step for doloop is -1; then,
on the larger type 'zero - 1' will be a very large number.

It is ok: Original
   unsigned char doloop.5;
   # doloop.5_4 = PHI <n_2(D)(2), doloop.5_6(5)>
   doloop.5_6 = doloop.5_4 - 1;

it would be error if:
   doloop.5_7 = (long unsigned int) n_2(D);
   # doloop.5_4 = PHI <doloop.5_7(2), doloop.5_6(5)>
   doloop.5_6 = doloop.5_4 - 1;


>>> >
>>> > Is this pre-loop extra add really offsetting the in-loop doloop
>>> > improvements?
>>> I'm not catching this question too much, sorry.  I guess your concern
>>> is if the "+1" is an offset: it may not, "+1" may be just that 
>>> doloop.xx
>>> is decreasing niter until 0 (all number >0).
>>> If misunderstand,  thanks for point out.
>> 
>> I'm questioning the argument that not being able to eliminate the +1-1
>> pair effects the overall cost improvement for the doloop IV type 
>> change
>> which hopefully is _not_ just loop IV setup cost but improving
>> performance in the loop body itself?
> Yes, I think so.  It would affect doloop IV setup cost, it may also 
> affect
> loop itself on some targets, if the target prefers a changed type:
> using one jump-counter instruction for wider type, but cmp/jump 
> instructions
> for other types.
> If the loop is nested in outer-loop, eliminating +1-1 would improving
> outer-loop performance.

There is a patch that could mitigate "-1 +1 pair" in rtl part.
https://gcc.gnu.org/g:8a15faa730f99100f6f3ed12663563356ec5a2c0

With that patch, one minor thing is doloop.xxx is using shorter type on 
some
cases, then need subreg to access it.  I have this patch is trying to 
use better
type and avoid to use subreg.

BR,

Jiufu Guo.

> 
> BR,
> Jiufu Guo.
> 
>> 
>> Richard.
>> 
>>> >
>>> >> >> >>
>>> >> >> >> doloop_valid_p guarantees it is simple and doesn't wrap.
>>> >> >> >>
>>> >> >> >> > I'd have expected sth like
>>> >> >> >> >
>>> >> >> >> >    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
>>> >> >> >> > (ntype));
>>> >> >> >> >
>>> >> >> >> > thus the decision made using a mode - which is also why I wonder
>>> >> >> >> > if there's a way to query the target for this.  As you say,
>>> >> >> >> > it _may_ be fast, so better check (somehow).
>>> >> >>
>>> >> >>
>>> >> >> I was also thinking of using hooks like type_for_size/type_for_mode.
>>> >> >>     /* Use type in word size may fast.  */
>>> >> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
>>> >> >>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
>>> >> >> (ntype))))
>>> >> >>       {
>>> >> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>>> >> >>         base = fold_convert (ntype, base);
>>> >> >>       }
>>> >> >>
>>> >> >> As you pointed out, this does not query the mode from targets.
>>> >> >> As Segher pointed out "doloop_end" checks unsupported mode, while it
>>> >> >> seems
>>> >> >> not easy to use it in tree-ssa-loop-ivopts.c.
>>> >> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
>>> >> >> requires
>>> >> >> Pmode/DImode; while there are other targets that work on other 'mode'
>>> >> >> (e.g.
>>> >> >> SI).
>>> >> >>
>>> >> >>
>>> >> >> In doloop_optimize, there is code:
>>> >> >>
>>> >> >> ```
>>> >> >>     mode = desc->mode;
>>> >> >> .....
>>> >> >>     doloop_reg = gen_reg_rtx (mode);
>>> >> >>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
>>> >> >> start_label);
>>> >> >>
>>> >> >>     word_mode_size = GET_MODE_PRECISION (word_mode);
>>> >> >>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) -
>>> >> >> 1;
>>> >> >>     if (! doloop_seq
>>> >> >>         && mode != word_mode
>>> >> >>         /* Before trying mode different from the one in that # of
>>> >> >> iterations is
>>> >> >>            computed, we must be sure that the number of iterations fits
>>> >> >> into
>>> >> >>            the new mode.  */
>>> >> >>         && (word_mode_size >= GET_MODE_PRECISION (mode)
>>> >> >>             || wi::leu_p (iterations_max, word_mode_max)))
>>> >> >>       {
>>> >> >>         if (word_mode_size > GET_MODE_PRECISION (mode))
>>> >> >>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count,
>>> >> >>         mode);
>>> >> >>         else
>>> >> >>           count = lowpart_subreg (word_mode, count, mode);
>>> >> >>         PUT_MODE (doloop_reg, word_mode);
>>> >> >>         doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
>>> >> >>       }
>>> >> >>     if (! doloop_seq)
>>> >> >>       {
>>> >> >>         if (dump_file)
>>> >> >>           fprintf (dump_file,
>>> >> >>                    "Doloop: Target unwilling to use doloop pattern!\n");
>>> >> >>         return false;
>>> >> >>       }
>>> >> >> ```
>>> >> >> The above code first tries the mode of niter_desc by call
>>> >> >> targetm.gen_doloop_end
>>> >> >> to see if the target can generate doloop insns, if fail, then try to use
>>> >> >> 'word_mode' against gen_doloop_end.
>>> >> >>
>>> >> >>
>>> >> >> >>
>>> >> >> >> Almost all targets just use Pmode, but there is no such guarantee I
>>> >> >> >> think, and esp. some targets that do not have machine insns for this
>>> >> >> >> (but want to generate different code for this anyway) can do pretty
>>> >> >> >> much
>>> >> >> >> anything.
>>> >> >> >>
>>> >> >> >> Maybe using just Pmode here is good enough though?
>>> >> >> >
>>> >> >> > I think Pmode is a particularly bad choice and I'd prefer word_mode
>>> >> >> > if we go for any hardcoded mode.  s390x for example seems to handle
>>> >> >> > both SImode and DImode (but names the helper gen_doloop_si64
>>> >> >> > for SImode?!).  But indeed it looks like somehow querying doloop_end
>>> >> >> > is going to be difficult since the expander doesn't have any mode,
>>> >> >> > so we'd have to actually try emit RTL here.
>>> >> >>
>>> >> >> Instead of using hardcode mode, maybe we could add a hook for targets to
>>> >> >> return
>>> >> >> the preferred mode.
>>> >> >
>>> >> > That's a possiblity of course.  Like the following which just shows the
>>> >> > default implementation then (pass in current mode, return a more
>>> >> > preferred
>>> >> > mode or the mode itself)
>>> >> >
>>> >> > enum machine_mode
>>> >> > prefered_doloop_mode (enum machine_mode mode)
>>> >> > {
>>> >> >   return mode;
>>> >> > }
>>> >> >
>>> >> Yes, thanks!
>>> >>
>>> >> Checking current do_loop_end in targets, in general, when do_loop_end
>>> >> requires
>>> >> SI mode, the target is defining Pmode as SImode and word_mode (from
>>> >> BITS_PER_WORD
>>> >> which defaults from UNITS_PER_WORD) is also defined to align with SI mode.
>>> >> When do_loop_end requires DI mode, the target is defining Pmode as DImode
>>> >> and word_mode/UNITS_PER_WORD is also defined to align with DI mode.
>>> >>
>>> >> So, if aggressively, then by default we may just return word_mode.
>>> >
>>> > Note we still have to check whether the prefered mode is valid
>>> > (passing in TImode but then returning DImode would be wrong).
>>> Yes, some code like in doloop_optimize (rtl code)
>>> ```
>>> mode != word_mode
>>> && (word_mode_size >= GET_MODE_PRECISION (mode)
>>>     || wi::leu_p (iterations_max, word_mode_max))
>>> ```
>>> 
>>> Thanks again for your comments!
>>> 
>>> BR,
>>> Jiufu Guo
>>> >
>>> > Richard.
>>> >
>>> >> BR,
>>> >>
>>> >> Jiufu Guo.
>>> >>
>>> >>
>>> >> >>
>>> >> >> Thanks for those valuable comments!
>>> >> >>
>>> >> >> Jiufu Guo
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> >
>>> >> >> > Richard.
>>> >> >>
>>> >> >>
>>> >>
>>> 
>>> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-13  2:09                     ` guojiufu
@ 2021-07-13  7:09                       ` Richard Biener
  2021-07-13  8:16                         ` guojiufu
  2021-07-13 15:51                       ` Segher Boessenkool
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-07-13  7:09 UTC (permalink / raw)
  To: guojiufu
  Cc: Segher Boessenkool, gcc-patches, wschmidt, jlaw, dje.gcc, Gcc-patches

On Tue, 13 Jul 2021, guojiufu wrote:

> On 2021-07-12 23:53, guojiufu via Gcc-patches wrote:
> > On 2021-07-12 22:46, Richard Biener wrote:
> >> On Mon, 12 Jul 2021, guojiufu wrote:
> >> 
> >>> On 2021-07-12 18:02, Richard Biener wrote:
> >>> > On Mon, 12 Jul 2021, guojiufu wrote:
> >>> >
> >>> >> On 2021-07-12 16:57, Richard Biener wrote:
> >>> >> > On Mon, 12 Jul 2021, guojiufu wrote:
> >>> >> >
> >>> >> >> On 2021-07-12 14:20, Richard Biener wrote:
> >>> >> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
> >>> >> >> >
> >>> >> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
> >>> >> >> >> > I wonder if there's a way to query the target what modes the
> >>> >> >> >> > doloop
> >>> >> >> >> > pattern can handle (not being too familiar with the doloop
> >>> >> >> >> > code).
> >>> >> >> >>
> >>> >> >> >> You can look what modes are allowed for operand 0 of doloop_end,
> >>> >> >> >> perhaps?  Although that is a define_expand, not a define_insn, so
> >>> >> >> >> it
> >>> >> >> >> is
> >>> >> >> >> hard to introspect.
> >>> >> >> >>
> >>> >> >> >> > Why do you need to do any checks besides the new type being
> >>> >> >> >> > able to
> >>> >> >> >> > represent all IV values?  The original doloop IV will never
> >>> >> >> >> > wrap
> >>> >> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
> >>> >> >> >> > become
> >>> >> >> >> > zero ... I suppose the doloop might still do the correct thing
> >>> >> >> >> > here
> >>> >> >> >> > but it also still will with a IV with larger type).
> >>> >> >>
> >>> >> >> The issue comes from U*_MAX (original short MAX), as you said: on
> >>> >> >> which
> >>> >> >> niter + 1 becomes zero.  And because the step for doloop is -1;
> >>> >> >> then, on
> >>> >> >> larger type 'zero - 1' will be a very large number on larger type
> >>> >> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a
> >>> >> >> small
> >>> >> >> value
> >>> >> >> (e.g. "0xff").
> >>> >> >
> >>> >> > But for the larger type the small type MAX + 1 fits and does not
> >>> >> > yield
> >>> >> > zero so it should still work exactly as before, no?  Of course you
> >>> >> > have to compute the + 1 in the larger type.
> >>> >> >
> >>> >> You are right, if compute the "+ 1" in the larger type it is ok, as
> >>> >> below
> >>> >> code:
> >>> >> ```
> >>> >>    /* Use type in word size may fast.  */
> >>> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
> >>> >>       {
> >>> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
> >>> >>         niter = fold_convert (ntype, niter);
> >>> >>       }
> >>> >>
> >>> >>     tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
> >>> >>                              build_int_cst (ntype, 1));
> >>> >>
> >>> >>
> >>> >>     add_candidate (data, base, build_int_cst (ntype, -1), true, NULL,
> >>> >> NULL,
> >>> >> true);
> >>> >> ```
> >>> >> The issue of this is, this code generates more stmt for doloop.xxx:
> >>> >>   _12 = (unsigned int) xx(D);
> >>> >>   _10 = _12 + 4294967295;
> >>> >>   _24 = (long unsigned int) _10;
> >>> >>   doloop.6_8 = _24 + 1;
> >>> >>
> >>> >> if use previous patch, "+ 1" on original type, then the stmts will
> >>> >> looks
> >>> >> like:
> >>> >>   _12 = (unsigned int) xx(D);
> >>> >>   doloop.6_8 = (long unsigned int) _12;
> >>> >>
> >>> >> This is the reason for checking
> >>> >>    wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
> >>> >
> >>> > But this then only works when there's an upper bound on the number
> >>> > of iterations.  Note you should not use TYPE_MAX_VALUE here but
> >>> > you can instead use
> >>> >
> >>> >      wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
> >>> > (TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));
> >>> 
> >>> Ok, Thanks!
> >>> I remember you mentioned that:
> >>> widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN
> >>> (ntype)),
> >>> TYPE_SIGN (ntype))
> >>> would be better than
> >>> wi::to_widest (TYPE_MAX_VALUE (ntype)).
> >>> 
> >>> It seems that:
> >>> "TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
> >>> (NODE)->type_non_common.maxval"
> >>> which do a numerical-check and return the field of maxval.  And then call
> >>> to
> >>> wi::to_widest
> >>> 
> >>> The other code "widest_int::from (wi::max_value (..,..),..)" calls
> >>> wi::max_value
> >>> and widest_int::from.
> >>> 
> >>> I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?
> >> 
> >> TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
> >> underlying modes precision.  At some point we've tried to eliminate
> >> most of its uses, not sure what the situation/position is right now.
> > Ok, get it, thanks.
> > I will use "widest_int::from (wi::max_value (..,..),..)".
> > 
> >> 
> >>> > I think the -1 above comes from number of latch iterations vs. header
> >>> > entries - it's a common source for this kind of issues.  range analysis
> >>> > might be able to prove that we can still merge the two adds even with
> >>> > the intermediate extension.
> >>> Yes, as you mentioned here, it relates to number of latch iterations
> >>> For loop looks like : while (l < n) or for (i = 0; i < n; i++)
> >>> This kind of loop, the niter is used to be 'n - 1' after transformed
> >>> into 'do-while' form.
> > For this kind of loop, the max value for the number of iteration "n - 1"
> > would be "max_value_type(n) - 1" which is wi::ltu than max_value_type.
> > This kind of loop is already common, and we could use wi::ltu (max,
> > max_value_type)
> > to check.
> > 
> > For loop looks like:
> >   do ;
> >   while (n-- > 0); /* while  (n-- > low); */
> > 
> > The niter_desc->max will wi::eq to max_value_type, and niter would be "n",
> > and then doloop.xx is 'n+1'.
> > 
> >>> I would see how to merge these two adds safely at this point
> >>> when generating doloop iv. (maybe range info, thanks!
> >>> 
> 
> For some cases, it may not easy to merge -1 +1 pair:
> (from doloop-1.c)
> ```
> int __attribute__ ((noinline))
> test (unsigned char n)
> {
>   do ;
>   while (--n > 0);
>   return n;
> }
> 
> int main ()
> {
>   unsigned char z = 0;
> 
>   return test (z);
> }
> ```
> 
> For this loop, niter_desc->max is 255, and niter is 'n - 1', then
> doloop.xx is "'n - 1' + 1",  'n - 1' should be compute at uchar type
> (QImode).
> It is ok to compute '- 1' and "+ 1" under the original shorter type.
> And it is ok if computing "n - 1" under original type and computing
> "+ 1" under larger mode.
> But it is not ok to compute both "- 1" and "+ 1" in large mode.
> 
> As discussed previously:  when 'n - 1' is U*_MAX (original short MAX),
> '(n - 1) + 1' becomes zero.  And the step for doloop is -1; then,
> on the larger type 'zero - 1' will be a very large number.
> 
> It is ok: Original
>   unsigned char doloop.5;
>   # doloop.5_4 = PHI <n_2(D)(2), doloop.5_6(5)>
>   doloop.5_6 = doloop.5_4 - 1;
> 
> it would be error if:
>   doloop.5_7 = (long unsigned int) n_2(D);
>   # doloop.5_4 = PHI <doloop.5_7(2), doloop.5_6(5)>
>   doloop.5_6 = doloop.5_4 - 1;
> 
> 
> >>> >
> >>> > Is this pre-loop extra add really offsetting the in-loop doloop
> >>> > improvements?
> >>> I'm not catching this question too much, sorry.  I guess your concern
> >>> is if the "+1" is an offset: it may not, "+1" may be just that doloop.xx
> >>> is decreasing niter until 0 (all number >0).
> >>> If misunderstand,  thanks for point out.
> >> 
> >> I'm questioning the argument that not being able to eliminate the +1-1
> >> pair effects the overall cost improvement for the doloop IV type change
> >> which hopefully is _not_ just loop IV setup cost but improving
> >> performance in the loop body itself?
> > Yes, I think so.  It would affect doloop IV setup cost, it may also affect
> > loop itself on some targets, if the target prefers a changed type:
> > using one jump-counter instruction for wider type, but cmp/jump instructions
> > for other types.
> > If the loop is nested in outer-loop, eliminating +1-1 would improving
> > outer-loop performance.

Yes.  So we're asking targets whether using a different IV mode for
the doloop is going to improve performance for the doloop related
instructions, for other uses IVOPTs usual cost model will apply.
Now that you only add one IV candidate for doloop it's a on/off
decision here?

Still the setup cost should still make the inner loop faster, you
can keep the < TYPE_MAX check and perform the -1 adjustment in the
original type but I think it's not a reason to not perform the
doloop IV promotion if we cannot do that.

> There is a patch that could mitigate "-1 +1 pair" in rtl part.
> https://gcc.gnu.org/g:8a15faa730f99100f6f3ed12663563356ec5a2c0
> 
> With that patch, one minor thing is doloop.xxx is using shorter type on some
> cases, then need subreg to access it.  I have this patch is trying to use
> better
> type and avoid to use subreg.

Yes, that's understood.

Richard.

> BR,
> 
> Jiufu Guo.
> 
> > 
> > BR,
> > Jiufu Guo.
> > 
> >> 
> >> Richard.
> >> 
> >>> >
> >>> >> >> >>
> >>> >> >> >> doloop_valid_p guarantees it is simple and doesn't wrap.
> >>> >> >> >>
> >>> >> >> >> > I'd have expected sth like
> >>> >> >> >> >
> >>> >> >> >> >    ntype = lang_hooks.types.type_for_mode (word_mode,
> >>> >> >> >> > TYPE_UNSIGNED
> >>> >> >> >> > (ntype));
> >>> >> >> >> >
> >>> >> >> >> > thus the decision made using a mode - which is also why I
> >>> >> >> >> > wonder
> >>> >> >> >> > if there's a way to query the target for this.  As you say,
> >>> >> >> >> > it _may_ be fast, so better check (somehow).
> >>> >> >>
> >>> >> >>
> >>> >> >> I was also thinking of using hooks like type_for_size/type_for_mode.
> >>> >> >>     /* Use type in word size may fast.  */
> >>> >> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
> >>> >> >>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
> >>> >> >> (ntype))))
> >>> >> >>       {
> >>> >> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
> >>> >> >>         base = fold_convert (ntype, base);
> >>> >> >>       }
> >>> >> >>
> >>> >> >> As you pointed out, this does not query the mode from targets.
> >>> >> >> As Segher pointed out "doloop_end" checks unsupported mode, while it
> >>> >> >> seems
> >>> >> >> not easy to use it in tree-ssa-loop-ivopts.c.
> >>> >> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
> >>> >> >> requires
> >>> >> >> Pmode/DImode; while there are other targets that work on other
> >>> >> >> 'mode'
> >>> >> >> (e.g.
> >>> >> >> SI).
> >>> >> >>
> >>> >> >>
> >>> >> >> In doloop_optimize, there is code:
> >>> >> >>
> >>> >> >> ```
> >>> >> >>     mode = desc->mode;
> >>> >> >> .....
> >>> >> >>     doloop_reg = gen_reg_rtx (mode);
> >>> >> >>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
> >>> >> >> start_label);
> >>> >> >>
> >>> >> >>     word_mode_size = GET_MODE_PRECISION (word_mode);
> >>> >> >>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1)
> >>> >> >>     -
> >>> >> >> 1;
> >>> >> >>     if (! doloop_seq
> >>> >> >>         && mode != word_mode
> >>> >> >>         /* Before trying mode different from the one in that # of
> >>> >> >> iterations is
> >>> >> >>            computed, we must be sure that the number of iterations
> >>> >> >> fits
> >>> >> >> into
> >>> >> >>            the new mode.  */
> >>> >> >>         && (word_mode_size >= GET_MODE_PRECISION (mode)
> >>> >> >>             || wi::leu_p (iterations_max, word_mode_max)))
> >>> >> >>       {
> >>> >> >>         if (word_mode_size > GET_MODE_PRECISION (mode))
> >>> >> >>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count,
> >>> >> >>         mode);
> >>> >> >>         else
> >>> >> >>           count = lowpart_subreg (word_mode, count, mode);
> >>> >> >>         PUT_MODE (doloop_reg, word_mode);
> >>> >> >>         doloop_seq = targetm.gen_doloop_end (doloop_reg,
> >>> >> >>         start_label);
> >>> >> >>       }
> >>> >> >>     if (! doloop_seq)
> >>> >> >>       {
> >>> >> >>         if (dump_file)
> >>> >> >>           fprintf (dump_file,
> >>> >> >>                    "Doloop: Target unwilling to use doloop
> >>> >> >>         pattern!\n");
> >>> >> >>         return false;
> >>> >> >>       }
> >>> >> >> ```
> >>> >> >> The above code first tries the mode of niter_desc by call
> >>> >> >> targetm.gen_doloop_end
> >>> >> >> to see if the target can generate doloop insns, if fail, then try to
> >>> >> >> use
> >>> >> >> 'word_mode' against gen_doloop_end.
> >>> >> >>
> >>> >> >>
> >>> >> >> >>
> >>> >> >> >> Almost all targets just use Pmode, but there is no such guarantee
> >>> >> >> >> I
> >>> >> >> >> think, and esp. some targets that do not have machine insns for
> >>> >> >> >> this
> >>> >> >> >> (but want to generate different code for this anyway) can do
> >>> >> >> >> pretty
> >>> >> >> >> much
> >>> >> >> >> anything.
> >>> >> >> >>
> >>> >> >> >> Maybe using just Pmode here is good enough though?
> >>> >> >> >
> >>> >> >> > I think Pmode is a particularly bad choice and I'd prefer
> >>> >> >> > word_mode
> >>> >> >> > if we go for any hardcoded mode.  s390x for example seems to
> >>> >> >> > handle
> >>> >> >> > both SImode and DImode (but names the helper gen_doloop_si64
> >>> >> >> > for SImode?!).  But indeed it looks like somehow querying
> >>> >> >> > doloop_end
> >>> >> >> > is going to be difficult since the expander doesn't have any mode,
> >>> >> >> > so we'd have to actually try emit RTL here.
> >>> >> >>
> >>> >> >> Instead of using hardcode mode, maybe we could add a hook for
> >>> >> >> targets to
> >>> >> >> return
> >>> >> >> the preferred mode.
> >>> >> >
> >>> >> > That's a possiblity of course.  Like the following which just shows
> >>> >> > the
> >>> >> > default implementation then (pass in current mode, return a more
> >>> >> > preferred
> >>> >> > mode or the mode itself)
> >>> >> >
> >>> >> > enum machine_mode
> >>> >> > prefered_doloop_mode (enum machine_mode mode)
> >>> >> > {
> >>> >> >   return mode;
> >>> >> > }
> >>> >> >
> >>> >> Yes, thanks!
> >>> >>
> >>> >> Checking current do_loop_end in targets, in general, when do_loop_end
> >>> >> requires
> >>> >> SI mode, the target is defining Pmode as SImode and word_mode (from
> >>> >> BITS_PER_WORD
> >>> >> which defaults from UNITS_PER_WORD) is also defined to align with SI
> >>> >> mode.
> >>> >> When do_loop_end requires DI mode, the target is defining Pmode as
> >>> >> DImode
> >>> >> and word_mode/UNITS_PER_WORD is also defined to align with DI mode.
> >>> >>
> >>> >> So, if aggressively, then by default we may just return word_mode.
> >>> >
> >>> > Note we still have to check whether the prefered mode is valid
> >>> > (passing in TImode but then returning DImode would be wrong).
> >>> Yes, some code like in doloop_optimize (rtl code)
> >>> ```
> >>> mode != word_mode
> >>> && (word_mode_size >= GET_MODE_PRECISION (mode)
> >>>     || wi::leu_p (iterations_max, word_mode_max))
> >>> ```
> >>> 
> >>> Thanks again for your comments!
> >>> 
> >>> BR,
> >>> Jiufu Guo
> >>> >
> >>> > Richard.
> >>> >
> >>> >> BR,
> >>> >>
> >>> >> Jiufu Guo.
> >>> >>
> >>> >>
> >>> >> >>
> >>> >> >> Thanks for those valuable comments!
> >>> >> >>
> >>> >> >> Jiufu Guo
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> >
> >>> >> >> > Richard.
> >>> >> >>
> >>> >> >>
> >>> >>
> >>> 
> >>> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-13  7:09                       ` Richard Biener
@ 2021-07-13  8:16                         ` guojiufu
  0 siblings, 0 replies; 20+ messages in thread
From: guojiufu @ 2021-07-13  8:16 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, gcc-patches, wschmidt, jlaw, dje.gcc, Gcc-patches

On 2021-07-13 15:09, Richard Biener wrote:
> On Tue, 13 Jul 2021, guojiufu wrote:
> 
>> On 2021-07-12 23:53, guojiufu via Gcc-patches wrote:
>> > On 2021-07-12 22:46, Richard Biener wrote:
>> >> On Mon, 12 Jul 2021, guojiufu wrote:
>> >>
>> >>> On 2021-07-12 18:02, Richard Biener wrote:
>> >>> > On Mon, 12 Jul 2021, guojiufu wrote:
>> >>> >
>> >>> >> On 2021-07-12 16:57, Richard Biener wrote:
>> >>> >> > On Mon, 12 Jul 2021, guojiufu wrote:
>> >>> >> >
>> >>> >> >> On 2021-07-12 14:20, Richard Biener wrote:
>> >>> >> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>> >>> >> >> >
>> >>> >> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>> >>> >> >> >> > I wonder if there's a way to query the target what modes the
>> >>> >> >> >> > doloop
>> >>> >> >> >> > pattern can handle (not being too familiar with the doloop
>> >>> >> >> >> > code).
>> >>> >> >> >>
>> >>> >> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>> >>> >> >> >> perhaps?  Although that is a define_expand, not a define_insn, so
>> >>> >> >> >> it
>> >>> >> >> >> is
>> >>> >> >> >> hard to introspect.
>> >>> >> >> >>
>> >>> >> >> >> > Why do you need to do any checks besides the new type being
>> >>> >> >> >> > able to
>> >>> >> >> >> > represent all IV values?  The original doloop IV will never
>> >>> >> >> >> > wrap
>> >>> >> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
>> >>> >> >> >> > become
>> >>> >> >> >> > zero ... I suppose the doloop might still do the correct thing
>> >>> >> >> >> > here
>> >>> >> >> >> > but it also still will with a IV with larger type).
>> >>> >> >>
>> >>> >> >> The issue comes from U*_MAX (original short MAX), as you said: on
>> >>> >> >> which
>> >>> >> >> niter + 1 becomes zero.  And because the step for doloop is -1;
>> >>> >> >> then, on
>> >>> >> >> larger type 'zero - 1' will be a very large number on larger type
>> >>> >> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a
>> >>> >> >> small
>> >>> >> >> value
>> >>> >> >> (e.g. "0xff").
>> >>> >> >
>> >>> >> > But for the larger type the small type MAX + 1 fits and does not
>> >>> >> > yield
>> >>> >> > zero so it should still work exactly as before, no?  Of course you
>> >>> >> > have to compute the + 1 in the larger type.
>> >>> >> >
>> >>> >> You are right, if compute the "+ 1" in the larger type it is ok, as
>> >>> >> below
>> >>> >> code:
>> >>> >> ```
>> >>> >>    /* Use type in word size may fast.  */
>> >>> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>> >>> >>       {
>> >>> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>> >>> >>         niter = fold_convert (ntype, niter);
>> >>> >>       }
>> >>> >>
>> >>> >>     tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>> >>> >>                              build_int_cst (ntype, 1));
>> >>> >>
>> >>> >>
>> >>> >>     add_candidate (data, base, build_int_cst (ntype, -1), true, NULL,
>> >>> >> NULL,
>> >>> >> true);
>> >>> >> ```
>> >>> >> The issue of this is, this code generates more stmt for doloop.xxx:
>> >>> >>   _12 = (unsigned int) xx(D);
>> >>> >>   _10 = _12 + 4294967295;
>> >>> >>   _24 = (long unsigned int) _10;
>> >>> >>   doloop.6_8 = _24 + 1;
>> >>> >>
>> >>> >> if use previous patch, "+ 1" on original type, then the stmts will
>> >>> >> looks
>> >>> >> like:
>> >>> >>   _12 = (unsigned int) xx(D);
>> >>> >>   doloop.6_8 = (long unsigned int) _12;
>> >>> >>
>> >>> >> This is the reason for checking
>> >>> >>    wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
>> >>> >
>> >>> > But this then only works when there's an upper bound on the number
>> >>> > of iterations.  Note you should not use TYPE_MAX_VALUE here but
>> >>> > you can instead use
>> >>> >
>> >>> >      wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
>> >>> > (TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));
>> >>>
>> >>> Ok, Thanks!
>> >>> I remember you mentioned that:
>> >>> widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN
>> >>> (ntype)),
>> >>> TYPE_SIGN (ntype))
>> >>> would be better than
>> >>> wi::to_widest (TYPE_MAX_VALUE (ntype)).
>> >>>
>> >>> It seems that:
>> >>> "TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
>> >>> (NODE)->type_non_common.maxval"
>> >>> which do a numerical-check and return the field of maxval.  And then call
>> >>> to
>> >>> wi::to_widest
>> >>>
>> >>> The other code "widest_int::from (wi::max_value (..,..),..)" calls
>> >>> wi::max_value
>> >>> and widest_int::from.
>> >>>
>> >>> I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?
>> >>
>> >> TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
>> >> underlying modes precision.  At some point we've tried to eliminate
>> >> most of its uses, not sure what the situation/position is right now.
>> > Ok, get it, thanks.
>> > I will use "widest_int::from (wi::max_value (..,..),..)".
>> >
>> >>
>> >>> > I think the -1 above comes from number of latch iterations vs. header
>> >>> > entries - it's a common source for this kind of issues.  range analysis
>> >>> > might be able to prove that we can still merge the two adds even with
>> >>> > the intermediate extension.
>> >>> Yes, as you mentioned here, it relates to number of latch iterations
>> >>> For loop looks like : while (l < n) or for (i = 0; i < n; i++)
>> >>> This kind of loop, the niter is used to be 'n - 1' after transformed
>> >>> into 'do-while' form.
>> > For this kind of loop, the max value for the number of iteration "n - 1"
>> > would be "max_value_type(n) - 1" which is wi::ltu than max_value_type.
>> > This kind of loop is already common, and we could use wi::ltu (max,
>> > max_value_type)
>> > to check.
>> >
>> > For loop looks like:
>> >   do ;
>> >   while (n-- > 0); /* while  (n-- > low); */
>> >
>> > The niter_desc->max will wi::eq to max_value_type, and niter would be "n",
>> > and then doloop.xx is 'n+1'.
>> >
>> >>> I would see how to merge these two adds safely at this point
>> >>> when generating doloop iv. (maybe range info, thanks!
>> >>>
>> 
>> For some cases, it may not easy to merge -1 +1 pair:
>> (from doloop-1.c)
>> ```
>> int __attribute__ ((noinline))
>> test (unsigned char n)
>> {
>>   do ;
>>   while (--n > 0);
>>   return n;
>> }
>> 
>> int main ()
>> {
>>   unsigned char z = 0;
>> 
>>   return test (z);
>> }
>> ```
>> 
>> For this loop, niter_desc->max is 255, and niter is 'n - 1', then
>> doloop.xx is "'n - 1' + 1",  'n - 1' should be compute at uchar type
>> (QImode).
>> It is ok to compute '- 1' and "+ 1" under the original shorter type.
>> And it is ok if computing "n - 1" under original type and computing
>> "+ 1" under larger mode.
>> But it is not ok to compute both "- 1" and "+ 1" in large mode.
>> 
>> As discussed previously:  when 'n - 1' is U*_MAX (original short MAX),
>> '(n - 1) + 1' becomes zero.  And the step for doloop is -1; then,
>> on the larger type 'zero - 1' will be a very large number.
>> 
>> It is ok: Original
>>   unsigned char doloop.5;
>>   # doloop.5_4 = PHI <n_2(D)(2), doloop.5_6(5)>
>>   doloop.5_6 = doloop.5_4 - 1;
>> 
>> it would be error if:
>>   doloop.5_7 = (long unsigned int) n_2(D);
>>   # doloop.5_4 = PHI <doloop.5_7(2), doloop.5_6(5)>
>>   doloop.5_6 = doloop.5_4 - 1;
>> 
>> 
>> >>> >
>> >>> > Is this pre-loop extra add really offsetting the in-loop doloop
>> >>> > improvements?
>> >>> I'm not catching this question too much, sorry.  I guess your concern
>> >>> is if the "+1" is an offset: it may not, "+1" may be just that doloop.xx
>> >>> is decreasing niter until 0 (all number >0).
>> >>> If misunderstand,  thanks for point out.
>> >>
>> >> I'm questioning the argument that not being able to eliminate the +1-1
>> >> pair effects the overall cost improvement for the doloop IV type change
>> >> which hopefully is _not_ just loop IV setup cost but improving
>> >> performance in the loop body itself?
>> > Yes, I think so.  It would affect doloop IV setup cost, it may also affect
>> > loop itself on some targets, if the target prefers a changed type:
>> > using one jump-counter instruction for wider type, but cmp/jump instructions
>> > for other types.
>> > If the loop is nested in outer-loop, eliminating +1-1 would improving
>> > outer-loop performance.
> 
> Yes.  So we're asking targets whether using a different IV mode for
> the doloop is going to improve performance for the doloop related
> instructions, for other uses IVOPTs usual cost model will apply.
> Now that you only add one IV candidate for doloop it's a on/off
> decision here?
Yes, there is one doloop IV, so if there is a preferred mode on the 
target,
then try to use the mode for the doloop IV.
I was once thinking of adding more doloop IVs and use the cost model to
select the best one.  While it seems not necessary: it may not get a 
better
decision than update the doloop IV mode.

> 
> Still the setup cost should still make the inner loop faster, you
> can keep the < TYPE_MAX check and perform the -1 adjustment in the
> original type but I think it's not a reason to not perform the
> doloop IV promotion if we cannot do that.

Ok, during refine patch, I would add code to perform:
-1 in the original type and +1 in promoted type if "-1"/"+1" pair is
not suitable to compute together in the preferred mode.

Thanks for your suggestions!

BR,
Jiufu

> 
>> There is a patch that could mitigate "-1 +1 pair" in rtl part.
>> https://gcc.gnu.org/g:8a15faa730f99100f6f3ed12663563356ec5a2c0
>> 
>> With that patch, one minor thing is doloop.xxx is using shorter type 
>> on some
>> cases, then need subreg to access it.  I have this patch is trying to 
>> use
>> better
>> type and avoid to use subreg.
> 
> Yes, that's understood.
> 
> Richard.
> 
>> BR,
>> 
>> Jiufu Guo.
>> 
>> >
>> > BR,
>> > Jiufu Guo.
>> >
>> >>
>> >> Richard.
>> >>
>> >>> >
>> >>> >> >> >>
>> >>> >> >> >> doloop_valid_p guarantees it is simple and doesn't wrap.
>> >>> >> >> >>
>> >>> >> >> >> > I'd have expected sth like
>> >>> >> >> >> >
>> >>> >> >> >> >    ntype = lang_hooks.types.type_for_mode (word_mode,
>> >>> >> >> >> > TYPE_UNSIGNED
>> >>> >> >> >> > (ntype));
>> >>> >> >> >> >
>> >>> >> >> >> > thus the decision made using a mode - which is also why I
>> >>> >> >> >> > wonder
>> >>> >> >> >> > if there's a way to query the target for this.  As you say,
>> >>> >> >> >> > it _may_ be fast, so better check (somehow).
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> I was also thinking of using hooks like type_for_size/type_for_mode.
>> >>> >> >>     /* Use type in word size may fast.  */
>> >>> >> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
>> >>> >> >>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
>> >>> >> >> (ntype))))
>> >>> >> >>       {
>> >>> >> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>> >>> >> >>         base = fold_convert (ntype, base);
>> >>> >> >>       }
>> >>> >> >>
>> >>> >> >> As you pointed out, this does not query the mode from targets.
>> >>> >> >> As Segher pointed out "doloop_end" checks unsupported mode, while it
>> >>> >> >> seems
>> >>> >> >> not easy to use it in tree-ssa-loop-ivopts.c.
>> >>> >> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
>> >>> >> >> requires
>> >>> >> >> Pmode/DImode; while there are other targets that work on other
>> >>> >> >> 'mode'
>> >>> >> >> (e.g.
>> >>> >> >> SI).
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> In doloop_optimize, there is code:
>> >>> >> >>
>> >>> >> >> ```
>> >>> >> >>     mode = desc->mode;
>> >>> >> >> .....
>> >>> >> >>     doloop_reg = gen_reg_rtx (mode);
>> >>> >> >>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
>> >>> >> >> start_label);
>> >>> >> >>
>> >>> >> >>     word_mode_size = GET_MODE_PRECISION (word_mode);
>> >>> >> >>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1)
>> >>> >> >>     -
>> >>> >> >> 1;
>> >>> >> >>     if (! doloop_seq
>> >>> >> >>         && mode != word_mode
>> >>> >> >>         /* Before trying mode different from the one in that # of
>> >>> >> >> iterations is
>> >>> >> >>            computed, we must be sure that the number of iterations
>> >>> >> >> fits
>> >>> >> >> into
>> >>> >> >>            the new mode.  */
>> >>> >> >>         && (word_mode_size >= GET_MODE_PRECISION (mode)
>> >>> >> >>             || wi::leu_p (iterations_max, word_mode_max)))
>> >>> >> >>       {
>> >>> >> >>         if (word_mode_size > GET_MODE_PRECISION (mode))
>> >>> >> >>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count,
>> >>> >> >>         mode);
>> >>> >> >>         else
>> >>> >> >>           count = lowpart_subreg (word_mode, count, mode);
>> >>> >> >>         PUT_MODE (doloop_reg, word_mode);
>> >>> >> >>         doloop_seq = targetm.gen_doloop_end (doloop_reg,
>> >>> >> >>         start_label);
>> >>> >> >>       }
>> >>> >> >>     if (! doloop_seq)
>> >>> >> >>       {
>> >>> >> >>         if (dump_file)
>> >>> >> >>           fprintf (dump_file,
>> >>> >> >>                    "Doloop: Target unwilling to use doloop
>> >>> >> >>         pattern!\n");
>> >>> >> >>         return false;
>> >>> >> >>       }
>> >>> >> >> ```
>> >>> >> >> The above code first tries the mode of niter_desc by call
>> >>> >> >> targetm.gen_doloop_end
>> >>> >> >> to see if the target can generate doloop insns, if fail, then try to
>> >>> >> >> use
>> >>> >> >> 'word_mode' against gen_doloop_end.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> >>
>> >>> >> >> >> Almost all targets just use Pmode, but there is no such guarantee
>> >>> >> >> >> I
>> >>> >> >> >> think, and esp. some targets that do not have machine insns for
>> >>> >> >> >> this
>> >>> >> >> >> (but want to generate different code for this anyway) can do
>> >>> >> >> >> pretty
>> >>> >> >> >> much
>> >>> >> >> >> anything.
>> >>> >> >> >>
>> >>> >> >> >> Maybe using just Pmode here is good enough though?
>> >>> >> >> >
>> >>> >> >> > I think Pmode is a particularly bad choice and I'd prefer
>> >>> >> >> > word_mode
>> >>> >> >> > if we go for any hardcoded mode.  s390x for example seems to
>> >>> >> >> > handle
>> >>> >> >> > both SImode and DImode (but names the helper gen_doloop_si64
>> >>> >> >> > for SImode?!).  But indeed it looks like somehow querying
>> >>> >> >> > doloop_end
>> >>> >> >> > is going to be difficult since the expander doesn't have any mode,
>> >>> >> >> > so we'd have to actually try emit RTL here.
>> >>> >> >>
>> >>> >> >> Instead of using hardcode mode, maybe we could add a hook for
>> >>> >> >> targets to
>> >>> >> >> return
>> >>> >> >> the preferred mode.
>> >>> >> >
>> >>> >> > That's a possiblity of course.  Like the following which just shows
>> >>> >> > the
>> >>> >> > default implementation then (pass in current mode, return a more
>> >>> >> > preferred
>> >>> >> > mode or the mode itself)
>> >>> >> >
>> >>> >> > enum machine_mode
>> >>> >> > prefered_doloop_mode (enum machine_mode mode)
>> >>> >> > {
>> >>> >> >   return mode;
>> >>> >> > }
>> >>> >> >
>> >>> >> Yes, thanks!
>> >>> >>
>> >>> >> Checking current do_loop_end in targets, in general, when do_loop_end
>> >>> >> requires
>> >>> >> SI mode, the target is defining Pmode as SImode and word_mode (from
>> >>> >> BITS_PER_WORD
>> >>> >> which defaults from UNITS_PER_WORD) is also defined to align with SI
>> >>> >> mode.
>> >>> >> When do_loop_end requires DI mode, the target is defining Pmode as
>> >>> >> DImode
>> >>> >> and word_mode/UNITS_PER_WORD is also defined to align with DI mode.
>> >>> >>
>> >>> >> So, if aggressively, then by default we may just return word_mode.
>> >>> >
>> >>> > Note we still have to check whether the prefered mode is valid
>> >>> > (passing in TImode but then returning DImode would be wrong).
>> >>> Yes, some code like in doloop_optimize (rtl code)
>> >>> ```
>> >>> mode != word_mode
>> >>> && (word_mode_size >= GET_MODE_PRECISION (mode)
>> >>>     || wi::leu_p (iterations_max, word_mode_max))
>> >>> ```
>> >>>
>> >>> Thanks again for your comments!
>> >>>
>> >>> BR,
>> >>> Jiufu Guo
>> >>> >
>> >>> > Richard.
>> >>> >
>> >>> >> BR,
>> >>> >>
>> >>> >> Jiufu Guo.
>> >>> >>
>> >>> >>
>> >>> >> >>
>> >>> >> >> Thanks for those valuable comments!
>> >>> >> >>
>> >>> >> >> Jiufu Guo
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> >
>> >>> >> >> > Richard.
>> >>> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-12  6:20     ` Richard Biener
  2021-07-12  8:21       ` guojiufu
@ 2021-07-13 15:38       ` Segher Boessenkool
  2021-07-14  3:10         ` guojiufu
  1 sibling, 1 reply; 20+ messages in thread
From: Segher Boessenkool @ 2021-07-13 15:38 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jiufu Guo, gcc-patches, wschmidt, dje.gcc, jlaw, amker.cheng

On Mon, Jul 12, 2021 at 08:20:14AM +0200, Richard Biener wrote:
> On Fri, 9 Jul 2021, Segher Boessenkool wrote:
> > Almost all targets just use Pmode, but there is no such guarantee I
> > think, and esp. some targets that do not have machine insns for this
> > (but want to generate different code for this anyway) can do pretty much
> > anything.
> > 
> > Maybe using just Pmode here is good enough though?
> 
> I think Pmode is a particularly bad choice and I'd prefer word_mode
> if we go for any hardcoded mode.

In many important cases you use a pointer as iteration variable.

Is word_mode register size on most current targets?

> s390x for example seems to handle
> both SImode and DImode (but names the helper gen_doloop_si64
> for SImode?!).

Yes, so Pmode will work fine for 390.  It would be nice if we could
allow multiple modes here, certainly.  Can we?

> But indeed it looks like somehow querying doloop_end
> is going to be difficult since the expander doesn't have any mode,
> so we'd have to actually try emit RTL here.

Or add a well-designed target macro for this.  "Which modes do we like
for IVs", perhaps?


Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-13  2:09                     ` guojiufu
  2021-07-13  7:09                       ` Richard Biener
@ 2021-07-13 15:51                       ` Segher Boessenkool
  2021-07-14  2:38                         ` guojiufu
  1 sibling, 1 reply; 20+ messages in thread
From: Segher Boessenkool @ 2021-07-13 15:51 UTC (permalink / raw)
  To: guojiufu
  Cc: Richard Biener, gcc-patches, wschmidt, jlaw, dje.gcc, Gcc-patches

On Tue, Jul 13, 2021 at 10:09:25AM +0800, guojiufu wrote:
> >For loop looks like:
> >  do ;
> >  while (n-- > 0); /* while  (n-- > low); */

(This whole loop as written will be optimised away, but :-) )

> There is a patch that could mitigate "-1 +1 pair" in rtl part.
> https://gcc.gnu.org/g:8a15faa730f99100f6f3ed12663563356ec5a2c0

Does that solve PR67288 (and its many duplicates)?


Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-13 15:51                       ` Segher Boessenkool
@ 2021-07-14  2:38                         ` guojiufu
  0 siblings, 0 replies; 20+ messages in thread
From: guojiufu @ 2021-07-14  2:38 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Richard Biener, gcc-patches, wschmidt, jlaw, dje.gcc, Gcc-patches

On 2021-07-13 23:51, Segher Boessenkool wrote:
> On Tue, Jul 13, 2021 at 10:09:25AM +0800, guojiufu wrote:
>> >For loop looks like:
>> >  do ;
>> >  while (n-- > 0); /* while  (n-- > low); */
> 
> (This whole loop as written will be optimised away, but :-) )
At -O2, the loop is optimized away.
At -O1, the loop is there.
         .cfi_startproc
         addi %r3,%r3,1
.L2:
         addi %r9,%r3,-1
         mr %r3,%r9
         andi. %r9,%r9,0xff
         bne %cr0,.L2
The patch v2 
(https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574596.html)
could help it to be:
         .cfi_startproc
         addi %r3,%r3,1
         mtctr %r3
.L2:
         addi %r3,%r3,-1
         bdnz .L2

> 
>> There is a patch that could mitigate "-1 +1 pair" in rtl part.
>> https://gcc.gnu.org/g:8a15faa730f99100f6f3ed12663563356ec5a2c0
> 
> Does that solve PR67288 (and its many duplicates)?
Had a test, yes, the "-1 +1" issue in PR67288 was fixed by that patch.

BR,
Jiufu.
> 
> 
> Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-13 15:38       ` Segher Boessenkool
@ 2021-07-14  3:10         ` guojiufu
  2021-07-14  7:17           ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: guojiufu @ 2021-07-14  3:10 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Richard Biener, gcc-patches, wschmidt, dje.gcc, jlaw, amker.cheng

On 2021-07-13 23:38, Segher Boessenkool wrote:
> On Mon, Jul 12, 2021 at 08:20:14AM +0200, Richard Biener wrote:
>> On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>> > Almost all targets just use Pmode, but there is no such guarantee I
>> > think, and esp. some targets that do not have machine insns for this
>> > (but want to generate different code for this anyway) can do pretty much
>> > anything.
>> >
>> > Maybe using just Pmode here is good enough though?
>> 
>> I think Pmode is a particularly bad choice and I'd prefer word_mode
>> if we go for any hardcoded mode.
> 
> In many important cases you use a pointer as iteration variable.
> 
> Is word_mode register size on most current targets?
Had a search on the implementation, word_mode is the mode on size 
BITS_PER_WORD
in MODE_INTs.  Actually, when targets define Pmode and BITS_PER_WORD, 
these two
macros are aligned -:), and it seems most targets define both these two 
macros.

> 
>> s390x for example seems to handle
>> both SImode and DImode (but names the helper gen_doloop_si64
>> for SImode?!).
> 
> Yes, so Pmode will work fine for 390.  It would be nice if we could
> allow multiple modes here, certainly.  Can we?

:), for other IVs, multiple modes are allowed to add as candidates;
while only one doloop iv is added.  Comparing the supporting more
doloop IVs, it seems changing the doloop iv mode is easy relatively
for me.  So, the patch is trying to update doloop iv.

> 
>> But indeed it looks like somehow querying doloop_end
>> is going to be difficult since the expander doesn't have any mode,
>> so we'd have to actually try emit RTL here.
> 
> Or add a well-designed target macro for this.  "Which modes do we like
> for IVs", perhaps?

In the new patch, a target hook preferred_doloop_mode is introduced. 
While
this hook is only for doloop iv at this time.
Maybe we could have preferred_iv_mode if needed. In the current code, 
IVs
are free to be added in different types, and the cost model is applied
to determine which IV may be better. The iv mode would be one factor for 
cost.


BR,
Jiufu

> 
> 
> Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
  2021-07-14  3:10         ` guojiufu
@ 2021-07-14  7:17           ` Richard Biener
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Biener @ 2021-07-14  7:17 UTC (permalink / raw)
  To: guojiufu
  Cc: Segher Boessenkool, Richard Biener, Jeff Law, GCC Patches,
	Bill Schmidt, David Edelsohn

On Wed, Jul 14, 2021 at 5:11 AM guojiufu via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On 2021-07-13 23:38, Segher Boessenkool wrote:
> > On Mon, Jul 12, 2021 at 08:20:14AM +0200, Richard Biener wrote:
> >> On Fri, 9 Jul 2021, Segher Boessenkool wrote:
> >> > Almost all targets just use Pmode, but there is no such guarantee I
> >> > think, and esp. some targets that do not have machine insns for this
> >> > (but want to generate different code for this anyway) can do pretty much
> >> > anything.
> >> >
> >> > Maybe using just Pmode here is good enough though?
> >>
> >> I think Pmode is a particularly bad choice and I'd prefer word_mode
> >> if we go for any hardcoded mode.
> >
> > In many important cases you use a pointer as iteration variable.
> >
> > Is word_mode register size on most current targets?
> Had a search on the implementation, word_mode is the mode on size
> BITS_PER_WORD
> in MODE_INTs.  Actually, when targets define Pmode and BITS_PER_WORD,
> these two
> macros are aligned -:), and it seems most targets define both these two
> macros.
>
> >
> >> s390x for example seems to handle
> >> both SImode and DImode (but names the helper gen_doloop_si64
> >> for SImode?!).
> >
> > Yes, so Pmode will work fine for 390.  It would be nice if we could
> > allow multiple modes here, certainly.  Can we?
>
> :), for other IVs, multiple modes are allowed to add as candidates;
> while only one doloop iv is added.  Comparing the supporting more
> doloop IVs, it seems changing the doloop iv mode is easy relatively
> for me.  So, the patch is trying to update doloop iv.
>
> >
> >> But indeed it looks like somehow querying doloop_end
> >> is going to be difficult since the expander doesn't have any mode,
> >> so we'd have to actually try emit RTL here.
> >
> > Or add a well-designed target macro for this.  "Which modes do we like
> > for IVs", perhaps?
>
> In the new patch, a target hook preferred_doloop_mode is introduced.
> While
> this hook is only for doloop iv at this time.
> Maybe we could have preferred_iv_mode if needed. In the current code,
> IVs
> are free to be added in different types, and the cost model is applied
> to determine which IV may be better. The iv mode would be one factor for
> cost.

I guess IVOPTs could directly look at PROMOTE_MODE & friends to
decide whether the IV increment has extra cost due to subregs/extensions
and account for that.  The rest should be really driven by the cost to express
IV uses, I don't think putting another target hook in here will help
reduce GIGO.

Richard.

>
> BR,
> Jiufu
>
> >
> >
> > Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-07-14  7:18 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-09  2:07 [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837] Jiufu Guo
2021-07-09  6:43 ` Richard Biener
2021-07-09 17:08   ` Segher Boessenkool
2021-07-12  6:20     ` Richard Biener
2021-07-12  8:21       ` guojiufu
2021-07-12  8:57         ` Richard Biener
2021-07-12  9:43           ` guojiufu
2021-07-12 10:02             ` Richard Biener
2021-07-12 14:08               ` guojiufu
2021-07-12 14:46                 ` Richard Biener
2021-07-12 15:53                   ` guojiufu
2021-07-12 15:59                     ` guojiufu
2021-07-13  2:09                     ` guojiufu
2021-07-13  7:09                       ` Richard Biener
2021-07-13  8:16                         ` guojiufu
2021-07-13 15:51                       ` Segher Boessenkool
2021-07-14  2:38                         ` guojiufu
2021-07-13 15:38       ` Segher Boessenkool
2021-07-14  3:10         ` guojiufu
2021-07-14  7:17           ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).