Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: guojiufu <guojiufu@linux.ibm.com>
To: guojiufu <guojiufu@linux.ibm.com>
Cc: Richard Biener <rguenther@suse.de>,
	Segher Boessenkool <segher@kernel.crashing.org>,
	gcc-patches@gcc.gnu.org, wschmidt@linux.ibm.com,
	jlaw@tachyum.com, dje.gcc@gmail.com,
	Gcc-patches
	<gcc-patches-bounces+guojiufu=linux.ibm.com@gcc.gnu.org>
Subject: Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
Date: Mon, 12 Jul 2021 23:59:16 +0800	[thread overview]
Message-ID: <595a8cc8b2d12d2cc30c49d9799c3360@imap.linux.ibm.com> (raw)
In-Reply-To: <5200541b7141d6ca90b18bc6721dea60@imap.linux.ibm.com>

On 2021-07-12 23:53, guojiufu via Gcc-patches wrote:
> On 2021-07-12 22:46, Richard Biener wrote:
>> On Mon, 12 Jul 2021, guojiufu wrote:
>> 
>>> On 2021-07-12 18:02, Richard Biener wrote:
>>> > On Mon, 12 Jul 2021, guojiufu wrote:
>>> >
>>> >> On 2021-07-12 16:57, Richard Biener wrote:
>>> >> > On Mon, 12 Jul 2021, guojiufu wrote:
>>> >> >
>>> >> >> On 2021-07-12 14:20, Richard Biener wrote:
>>> >> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>>> >> >> >
>>> >> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>>> >> >> >> > I wonder if there's a way to query the target what modes the doloop
>>> >> >> >> > pattern can handle (not being too familiar with the doloop code).
>>> >> >> >>
>>> >> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>>> >> >> >> perhaps?  Although that is a define_expand, not a define_insn, so it
>>> >> >> >> is
>>> >> >> >> hard to introspect.
>>> >> >> >>
>>> >> >> >> > Why do you need to do any checks besides the new type being able to
>>> >> >> >> > represent all IV values?  The original doloop IV will never wrap
>>> >> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
>>> >> >> >> > become
>>> >> >> >> > zero ... I suppose the doloop might still do the correct thing here
>>> >> >> >> > but it also still will with a IV with larger type).
>>> >> >>
>>> >> >> The issue comes from U*_MAX (original short MAX), as you said: on which
>>> >> >> niter + 1 becomes zero.  And because the step for doloop is -1; then, on
>>> >> >> larger type 'zero - 1' will be a very large number on larger type
>>> >> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
>>> >> >> value
>>> >> >> (e.g. "0xff").
>>> >> >
>>> >> > But for the larger type the small type MAX + 1 fits and does not yield
>>> >> > zero so it should still work exactly as before, no?  Of course you
>>> >> > have to compute the + 1 in the larger type.
>>> >> >
>>> >> You are right, if compute the "+ 1" in the larger type it is ok, as below
>>> >> code:
>>> >> ```
>>> >>    /* Use type in word size may fast.  */
>>> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>>> >>       {
>>> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>>> >>         niter = fold_convert (ntype, niter);
>>> >>       }
>>> >>
>>> >>     tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>>> >>                              build_int_cst (ntype, 1));
>>> >>
>>> >>
>>> >>     add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL,
>>> >> true);
>>> >> ```
>>> >> The issue of this is, this code generates more stmt for doloop.xxx:
>>> >>   _12 = (unsigned int) xx(D);
>>> >>   _10 = _12 + 4294967295;
>>> >>   _24 = (long unsigned int) _10;
>>> >>   doloop.6_8 = _24 + 1;
>>> >>
>>> >> if use previous patch, "+ 1" on original type, then the stmts will looks
>>> >> like:
>>> >>   _12 = (unsigned int) xx(D);
>>> >>   doloop.6_8 = (long unsigned int) _12;
>>> >>
>>> >> This is the reason for checking
>>> >>    wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
>>> >
>>> > But this then only works when there's an upper bound on the number
>>> > of iterations.  Note you should not use TYPE_MAX_VALUE here but
>>> > you can instead use
>>> >
>>> >      wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
>>> > (TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));
>>> 
>>> Ok, Thanks!
>>> I remember you mentioned that:
>>> widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN 
>>> (ntype)),
>>> TYPE_SIGN (ntype))
>>> would be better than
>>> wi::to_widest (TYPE_MAX_VALUE (ntype)).
>>> 
>>> It seems that:
>>> "TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
>>> (NODE)->type_non_common.maxval"
>>> which do a numerical-check and return the field of maxval.  And then 
>>> call to
>>> wi::to_widest
>>> 
>>> The other code "widest_int::from (wi::max_value (..,..),..)" calls
>>> wi::max_value
>>> and widest_int::from.
>>> 
>>> I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?
>> 
>> TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
>> underlying modes precision.  At some point we've tried to eliminate
>> most of its uses, not sure what the situation/position is right now.
> Ok, get it, thanks.
> I will use "widest_int::from (wi::max_value (..,..),..)".
> 
>> 
>>> > I think the -1 above comes from number of latch iterations vs. header
>>> > entries - it's a common source for this kind of issues.  range analysis
>>> > might be able to prove that we can still merge the two adds even with
>>> > the intermediate extension.
>>> Yes, as you mentioned here, it relates to number of latch iterations
>>> For loop looks like : while (l < n) or for (i = 0; i < n; i++)
>>> This kind of loop, the niter is used to be 'n - 1' after transformed
>>> into 'do-while' form.
> For this kind of loop, the max value for the number of iteration "n - 
> 1"
> would be "max_value_type(n) - 1" which is wi::ltu than max_value_type.
> This kind of loop is already common, and we could use wi::ltu (max,
> max_value_type)
> to check.
> 
> For loop looks like:
>   do ;
>   while (n-- > 0); /* while  (n-- > low); */
> 
> The niter_desc->max will wi::eq to max_value_type, and niter would be 
> "n",
> and then doloop.xx is 'n+1'.
> 
>>> I would see how to merge these two adds safely at this point
>>> when generating doloop iv. (maybe range info, thanks!
>>> 
>>> >
>>> > Is this pre-loop extra add really offsetting the in-loop doloop
>>> > improvements?
>>> I'm not catching this question too much, sorry.  I guess your concern
>>> is if the "+1" is an offset: it may not, "+1" may be just that 
>>> doloop.xx
>>> is decreasing niter until 0 (all number >0).
>>> If misunderstand,  thanks for point out.
>> 
>> I'm questioning the argument that not being able to eliminate the +1-1
>> pair effects the overall cost improvement for the doloop IV type 
>> change
>> which hopefully is _not_ just loop IV setup cost but improving
>> performance in the loop body itself?
> Yes, I think so.  It would affect doloop IV setup cost, it may also 
> affect
> loop itself on some targets, if the target prefers a changed type:
> using one jump-counter instruction for wider type, but cmp/jump 
> instructions
> for other types.
Oh, this may be an inaccurate example. -:)

BR,
Jiufu Guo.

> If the loop is nested in outer-loop, eliminating +1-1 would improving
> outer-loop performance.
> 
> BR,
> Jiufu Guo.
> 
>> 
>> Richard.
>> 
>>> >
>>> >> >> >>
>>> >> >> >> doloop_valid_p guarantees it is simple and doesn't wrap.
>>> >> >> >>
>>> >> >> >> > I'd have expected sth like
>>> >> >> >> >
>>> >> >> >> >    ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
>>> >> >> >> > (ntype));
>>> >> >> >> >
>>> >> >> >> > thus the decision made using a mode - which is also why I wonder
>>> >> >> >> > if there's a way to query the target for this.  As you say,
>>> >> >> >> > it _may_ be fast, so better check (somehow).
>>> >> >>
>>> >> >>
>>> >> >> I was also thinking of using hooks like type_for_size/type_for_mode.
>>> >> >>     /* Use type in word size may fast.  */
>>> >> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
>>> >> >>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
>>> >> >> (ntype))))
>>> >> >>       {
>>> >> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>>> >> >>         base = fold_convert (ntype, base);
>>> >> >>       }
>>> >> >>
>>> >> >> As you pointed out, this does not query the mode from targets.
>>> >> >> As Segher pointed out "doloop_end" checks unsupported mode, while it
>>> >> >> seems
>>> >> >> not easy to use it in tree-ssa-loop-ivopts.c.
>>> >> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
>>> >> >> requires
>>> >> >> Pmode/DImode; while there are other targets that work on other 'mode'
>>> >> >> (e.g.
>>> >> >> SI).
>>> >> >>
>>> >> >>
>>> >> >> In doloop_optimize, there is code:
>>> >> >>
>>> >> >> ```
>>> >> >>     mode = desc->mode;
>>> >> >> .....
>>> >> >>     doloop_reg = gen_reg_rtx (mode);
>>> >> >>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
>>> >> >> start_label);
>>> >> >>
>>> >> >>     word_mode_size = GET_MODE_PRECISION (word_mode);
>>> >> >>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) -
>>> >> >> 1;
>>> >> >>     if (! doloop_seq
>>> >> >>         && mode != word_mode
>>> >> >>         /* Before trying mode different from the one in that # of
>>> >> >> iterations is
>>> >> >>            computed, we must be sure that the number of iterations fits
>>> >> >> into
>>> >> >>            the new mode.  */
>>> >> >>         && (word_mode_size >= GET_MODE_PRECISION (mode)
>>> >> >>             || wi::leu_p (iterations_max, word_mode_max)))
>>> >> >>       {
>>> >> >>         if (word_mode_size > GET_MODE_PRECISION (mode))
>>> >> >>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count,
>>> >> >>         mode);
>>> >> >>         else
>>> >> >>           count = lowpart_subreg (word_mode, count, mode);
>>> >> >>         PUT_MODE (doloop_reg, word_mode);
>>> >> >>         doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
>>> >> >>       }
>>> >> >>     if (! doloop_seq)
>>> >> >>       {
>>> >> >>         if (dump_file)
>>> >> >>           fprintf (dump_file,
>>> >> >>                    "Doloop: Target unwilling to use doloop pattern!\n");
>>> >> >>         return false;
>>> >> >>       }
>>> >> >> ```
>>> >> >> The above code first tries the mode of niter_desc by call
>>> >> >> targetm.gen_doloop_end
>>> >> >> to see if the target can generate doloop insns, if fail, then try to use
>>> >> >> 'word_mode' against gen_doloop_end.
>>> >> >>
>>> >> >>
>>> >> >> >>
>>> >> >> >> Almost all targets just use Pmode, but there is no such guarantee I
>>> >> >> >> think, and esp. some targets that do not have machine insns for this
>>> >> >> >> (but want to generate different code for this anyway) can do pretty
>>> >> >> >> much
>>> >> >> >> anything.
>>> >> >> >>
>>> >> >> >> Maybe using just Pmode here is good enough though?
>>> >> >> >
>>> >> >> > I think Pmode is a particularly bad choice and I'd prefer word_mode
>>> >> >> > if we go for any hardcoded mode.  s390x for example seems to handle
>>> >> >> > both SImode and DImode (but names the helper gen_doloop_si64
>>> >> >> > for SImode?!).  But indeed it looks like somehow querying doloop_end
>>> >> >> > is going to be difficult since the expander doesn't have any mode,
>>> >> >> > so we'd have to actually try emit RTL here.
>>> >> >>
>>> >> >> Instead of using hardcode mode, maybe we could add a hook for targets to
>>> >> >> return
>>> >> >> the preferred mode.
>>> >> >
>>> >> > That's a possiblity of course.  Like the following which just shows the
>>> >> > default implementation then (pass in current mode, return a more
>>> >> > preferred
>>> >> > mode or the mode itself)
>>> >> >
>>> >> > enum machine_mode
>>> >> > prefered_doloop_mode (enum machine_mode mode)
>>> >> > {
>>> >> >   return mode;
>>> >> > }
>>> >> >
>>> >> Yes, thanks!
>>> >>
>>> >> Checking current do_loop_end in targets, in general, when do_loop_end
>>> >> requires
>>> >> SI mode, the target is defining Pmode as SImode and word_mode (from
>>> >> BITS_PER_WORD
>>> >> which defaults from UNITS_PER_WORD) is also defined to align with SI mode.
>>> >> When do_loop_end requires DI mode, the target is defining Pmode as DImode
>>> >> and word_mode/UNITS_PER_WORD is also defined to align with DI mode.
>>> >>
>>> >> So, if aggressively, then by default we may just return word_mode.
>>> >
>>> > Note we still have to check whether the prefered mode is valid
>>> > (passing in TImode but then returning DImode would be wrong).
>>> Yes, some code like in doloop_optimize (rtl code)
>>> ```
>>> mode != word_mode
>>> && (word_mode_size >= GET_MODE_PRECISION (mode)
>>>     || wi::leu_p (iterations_max, word_mode_max))
>>> ```
>>> 
>>> Thanks again for your comments!
>>> 
>>> BR,
>>> Jiufu Guo
>>> >
>>> > Richard.
>>> >
>>> >> BR,
>>> >>
>>> >> Jiufu Guo.
>>> >>
>>> >>
>>> >> >>
>>> >> >> Thanks for those valuable comments!
>>> >> >>
>>> >> >> Jiufu Guo
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> >
>>> >> >> > Richard.
>>> >> >>
>>> >> >>
>>> >>
>>> 
>>>

next prev parent reply	other threads:[~2021-07-12 15:59 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-09  2:07 Jiufu Guo
2021-07-09  6:43 ` Richard Biener
2021-07-09 17:08   ` Segher Boessenkool
2021-07-12  6:20     ` Richard Biener
2021-07-12  8:21       ` guojiufu
2021-07-12  8:57         ` Richard Biener
2021-07-12  9:43           ` guojiufu
2021-07-12 10:02             ` Richard Biener
2021-07-12 14:08               ` guojiufu
2021-07-12 14:46                 ` Richard Biener
2021-07-12 15:53                   ` guojiufu
2021-07-12 15:59                     ` guojiufu [this message]
2021-07-13  2:09                     ` guojiufu
2021-07-13  7:09                       ` Richard Biener
2021-07-13  8:16                         ` guojiufu
2021-07-13 15:51                       ` Segher Boessenkool
2021-07-14  2:38                         ` guojiufu
2021-07-13 15:38       ` Segher Boessenkool
2021-07-14  3:10         ` guojiufu
2021-07-14  7:17           ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=595a8cc8b2d12d2cc30c49d9799c3360@imap.linux.ibm.com \
    --to=guojiufu@linux.ibm.com \
    --cc=dje.gcc@gmail.com \
    --cc=gcc-patches-bounces+guojiufu=linux.ibm.com@gcc.gnu.org \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jlaw@tachyum.com \
    --cc=rguenther@suse.de \
    --cc=segher@kernel.crashing.org \
    --cc=wschmidt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).