From: guojiufu <guojiufu@linux.ibm.com>
To: guojiufu <guojiufu@linux.ibm.com>
Cc: Richard Biener <rguenther@suse.de>,
Segher Boessenkool <segher@kernel.crashing.org>,
gcc-patches@gcc.gnu.org, wschmidt@linux.ibm.com,
jlaw@tachyum.com, dje.gcc@gmail.com,
Gcc-patches
<gcc-patches-bounces+guojiufu=linux.ibm.com@gcc.gnu.org>
Subject: Re: [PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]
Date: Mon, 12 Jul 2021 23:59:16 +0800 [thread overview]
Message-ID: <595a8cc8b2d12d2cc30c49d9799c3360@imap.linux.ibm.com> (raw)
In-Reply-To: <5200541b7141d6ca90b18bc6721dea60@imap.linux.ibm.com>
On 2021-07-12 23:53, guojiufu via Gcc-patches wrote:
> On 2021-07-12 22:46, Richard Biener wrote:
>> On Mon, 12 Jul 2021, guojiufu wrote:
>>
>>> On 2021-07-12 18:02, Richard Biener wrote:
>>> > On Mon, 12 Jul 2021, guojiufu wrote:
>>> >
>>> >> On 2021-07-12 16:57, Richard Biener wrote:
>>> >> > On Mon, 12 Jul 2021, guojiufu wrote:
>>> >> >
>>> >> >> On 2021-07-12 14:20, Richard Biener wrote:
>>> >> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>>> >> >> >
>>> >> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>>> >> >> >> > I wonder if there's a way to query the target what modes the doloop
>>> >> >> >> > pattern can handle (not being too familiar with the doloop code).
>>> >> >> >>
>>> >> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>>> >> >> >> perhaps? Although that is a define_expand, not a define_insn, so it
>>> >> >> >> is
>>> >> >> >> hard to introspect.
>>> >> >> >>
>>> >> >> >> > Why do you need to do any checks besides the new type being able to
>>> >> >> >> > represent all IV values? The original doloop IV will never wrap
>>> >> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
>>> >> >> >> > become
>>> >> >> >> > zero ... I suppose the doloop might still do the correct thing here
>>> >> >> >> > but it also still will with a IV with larger type).
>>> >> >>
>>> >> >> The issue comes from U*_MAX (original short MAX), as you said: on which
>>> >> >> niter + 1 becomes zero. And because the step for doloop is -1; then, on
>>> >> >> larger type 'zero - 1' will be a very large number on larger type
>>> >> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a small
>>> >> >> value
>>> >> >> (e.g. "0xff").
>>> >> >
>>> >> > But for the larger type the small type MAX + 1 fits and does not yield
>>> >> > zero so it should still work exactly as before, no? Of course you
>>> >> > have to compute the + 1 in the larger type.
>>> >> >
>>> >> You are right, if compute the "+ 1" in the larger type it is ok, as below
>>> >> code:
>>> >> ```
>>> >> /* Use type in word size may fast. */
>>> >> if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>>> >> {
>>> >> ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>>> >> niter = fold_convert (ntype, niter);
>>> >> }
>>> >>
>>> >> tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>>> >> build_int_cst (ntype, 1));
>>> >>
>>> >>
>>> >> add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL,
>>> >> true);
>>> >> ```
>>> >> The issue of this is, this code generates more stmt for doloop.xxx:
>>> >> _12 = (unsigned int) xx(D);
>>> >> _10 = _12 + 4294967295;
>>> >> _24 = (long unsigned int) _10;
>>> >> doloop.6_8 = _24 + 1;
>>> >>
>>> >> if use previous patch, "+ 1" on original type, then the stmts will looks
>>> >> like:
>>> >> _12 = (unsigned int) xx(D);
>>> >> doloop.6_8 = (long unsigned int) _12;
>>> >>
>>> >> This is the reason for checking
>>> >> wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
>>> >
>>> > But this then only works when there's an upper bound on the number
>>> > of iterations. Note you should not use TYPE_MAX_VALUE here but
>>> > you can instead use
>>> >
>>> > wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
>>> > (TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));
>>>
>>> Ok, Thanks!
>>> I remember you mentioned that:
>>> widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN
>>> (ntype)),
>>> TYPE_SIGN (ntype))
>>> would be better than
>>> wi::to_widest (TYPE_MAX_VALUE (ntype)).
>>>
>>> It seems that:
>>> "TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
>>> (NODE)->type_non_common.maxval"
>>> which do a numerical-check and return the field of maxval. And then
>>> call to
>>> wi::to_widest
>>>
>>> The other code "widest_int::from (wi::max_value (..,..),..)" calls
>>> wi::max_value
>>> and widest_int::from.
>>>
>>> I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?
>>
>> TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
>> underlying modes precision. At some point we've tried to eliminate
>> most of its uses, not sure what the situation/position is right now.
> Ok, get it, thanks.
> I will use "widest_int::from (wi::max_value (..,..),..)".
>
>>
>>> > I think the -1 above comes from number of latch iterations vs. header
>>> > entries - it's a common source for this kind of issues. range analysis
>>> > might be able to prove that we can still merge the two adds even with
>>> > the intermediate extension.
>>> Yes, as you mentioned here, it relates to number of latch iterations
>>> For loop looks like : while (l < n) or for (i = 0; i < n; i++)
>>> This kind of loop, the niter is used to be 'n - 1' after transformed
>>> into 'do-while' form.
> For this kind of loop, the max value for the number of iteration "n -
> 1"
> would be "max_value_type(n) - 1" which is wi::ltu than max_value_type.
> This kind of loop is already common, and we could use wi::ltu (max,
> max_value_type)
> to check.
>
> For loop looks like:
> do ;
> while (n-- > 0); /* while (n-- > low); */
>
> The niter_desc->max will wi::eq to max_value_type, and niter would be
> "n",
> and then doloop.xx is 'n+1'.
>
>>> I would see how to merge these two adds safely at this point
>>> when generating doloop iv. (maybe range info, thanks!
>>>
>>> >
>>> > Is this pre-loop extra add really offsetting the in-loop doloop
>>> > improvements?
>>> I'm not catching this question too much, sorry. I guess your concern
>>> is if the "+1" is an offset: it may not, "+1" may be just that
>>> doloop.xx
>>> is decreasing niter until 0 (all number >0).
>>> If misunderstand, thanks for point out.
>>
>> I'm questioning the argument that not being able to eliminate the +1-1
>> pair effects the overall cost improvement for the doloop IV type
>> change
>> which hopefully is _not_ just loop IV setup cost but improving
>> performance in the loop body itself?
> Yes, I think so. It would affect doloop IV setup cost, it may also
> affect
> loop itself on some targets, if the target prefers a changed type:
> using one jump-counter instruction for wider type, but cmp/jump
> instructions
> for other types.
Oh, this may be an inaccurate example. -:)
BR,
Jiufu Guo.
> If the loop is nested in outer-loop, eliminating +1-1 would improving
> outer-loop performance.
>
> BR,
> Jiufu Guo.
>
>>
>> Richard.
>>
>>> >
>>> >> >> >>
>>> >> >> >> doloop_valid_p guarantees it is simple and doesn't wrap.
>>> >> >> >>
>>> >> >> >> > I'd have expected sth like
>>> >> >> >> >
>>> >> >> >> > ntype = lang_hooks.types.type_for_mode (word_mode, TYPE_UNSIGNED
>>> >> >> >> > (ntype));
>>> >> >> >> >
>>> >> >> >> > thus the decision made using a mode - which is also why I wonder
>>> >> >> >> > if there's a way to query the target for this. As you say,
>>> >> >> >> > it _may_ be fast, so better check (somehow).
>>> >> >>
>>> >> >>
>>> >> >> I was also thinking of using hooks like type_for_size/type_for_mode.
>>> >> >> /* Use type in word size may fast. */
>>> >> >> if (TYPE_PRECISION (ntype) < BITS_PER_WORD
>>> >> >> && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
>>> >> >> (ntype))))
>>> >> >> {
>>> >> >> ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>>> >> >> base = fold_convert (ntype, base);
>>> >> >> }
>>> >> >>
>>> >> >> As you pointed out, this does not query the mode from targets.
>>> >> >> As Segher pointed out "doloop_end" checks unsupported mode, while it
>>> >> >> seems
>>> >> >> not easy to use it in tree-ssa-loop-ivopts.c.
>>> >> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
>>> >> >> requires
>>> >> >> Pmode/DImode; while there are other targets that work on other 'mode'
>>> >> >> (e.g.
>>> >> >> SI).
>>> >> >>
>>> >> >>
>>> >> >> In doloop_optimize, there is code:
>>> >> >>
>>> >> >> ```
>>> >> >> mode = desc->mode;
>>> >> >> .....
>>> >> >> doloop_reg = gen_reg_rtx (mode);
>>> >> >> rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
>>> >> >> start_label);
>>> >> >>
>>> >> >> word_mode_size = GET_MODE_PRECISION (word_mode);
>>> >> >> word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) -
>>> >> >> 1;
>>> >> >> if (! doloop_seq
>>> >> >> && mode != word_mode
>>> >> >> /* Before trying mode different from the one in that # of
>>> >> >> iterations is
>>> >> >> computed, we must be sure that the number of iterations fits
>>> >> >> into
>>> >> >> the new mode. */
>>> >> >> && (word_mode_size >= GET_MODE_PRECISION (mode)
>>> >> >> || wi::leu_p (iterations_max, word_mode_max)))
>>> >> >> {
>>> >> >> if (word_mode_size > GET_MODE_PRECISION (mode))
>>> >> >> count = simplify_gen_unary (ZERO_EXTEND, word_mode, count,
>>> >> >> mode);
>>> >> >> else
>>> >> >> count = lowpart_subreg (word_mode, count, mode);
>>> >> >> PUT_MODE (doloop_reg, word_mode);
>>> >> >> doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
>>> >> >> }
>>> >> >> if (! doloop_seq)
>>> >> >> {
>>> >> >> if (dump_file)
>>> >> >> fprintf (dump_file,
>>> >> >> "Doloop: Target unwilling to use doloop pattern!\n");
>>> >> >> return false;
>>> >> >> }
>>> >> >> ```
>>> >> >> The above code first tries the mode of niter_desc by call
>>> >> >> targetm.gen_doloop_end
>>> >> >> to see if the target can generate doloop insns, if fail, then try to use
>>> >> >> 'word_mode' against gen_doloop_end.
>>> >> >>
>>> >> >>
>>> >> >> >>
>>> >> >> >> Almost all targets just use Pmode, but there is no such guarantee I
>>> >> >> >> think, and esp. some targets that do not have machine insns for this
>>> >> >> >> (but want to generate different code for this anyway) can do pretty
>>> >> >> >> much
>>> >> >> >> anything.
>>> >> >> >>
>>> >> >> >> Maybe using just Pmode here is good enough though?
>>> >> >> >
>>> >> >> > I think Pmode is a particularly bad choice and I'd prefer word_mode
>>> >> >> > if we go for any hardcoded mode. s390x for example seems to handle
>>> >> >> > both SImode and DImode (but names the helper gen_doloop_si64
>>> >> >> > for SImode?!). But indeed it looks like somehow querying doloop_end
>>> >> >> > is going to be difficult since the expander doesn't have any mode,
>>> >> >> > so we'd have to actually try emit RTL here.
>>> >> >>
>>> >> >> Instead of using hardcode mode, maybe we could add a hook for targets to
>>> >> >> return
>>> >> >> the preferred mode.
>>> >> >
>>> >> > That's a possiblity of course. Like the following which just shows the
>>> >> > default implementation then (pass in current mode, return a more
>>> >> > preferred
>>> >> > mode or the mode itself)
>>> >> >
>>> >> > enum machine_mode
>>> >> > prefered_doloop_mode (enum machine_mode mode)
>>> >> > {
>>> >> > return mode;
>>> >> > }
>>> >> >
>>> >> Yes, thanks!
>>> >>
>>> >> Checking current do_loop_end in targets, in general, when do_loop_end
>>> >> requires
>>> >> SI mode, the target is defining Pmode as SImode and word_mode (from
>>> >> BITS_PER_WORD
>>> >> which defaults from UNITS_PER_WORD) is also defined to align with SI mode.
>>> >> When do_loop_end requires DI mode, the target is defining Pmode as DImode
>>> >> and word_mode/UNITS_PER_WORD is also defined to align with DI mode.
>>> >>
>>> >> So, if aggressively, then by default we may just return word_mode.
>>> >
>>> > Note we still have to check whether the prefered mode is valid
>>> > (passing in TImode but then returning DImode would be wrong).
>>> Yes, some code like in doloop_optimize (rtl code)
>>> ```
>>> mode != word_mode
>>> && (word_mode_size >= GET_MODE_PRECISION (mode)
>>> || wi::leu_p (iterations_max, word_mode_max))
>>> ```
>>>
>>> Thanks again for your comments!
>>>
>>> BR,
>>> Jiufu Guo
>>> >
>>> > Richard.
>>> >
>>> >> BR,
>>> >>
>>> >> Jiufu Guo.
>>> >>
>>> >>
>>> >> >>
>>> >> >> Thanks for those valuable comments!
>>> >> >>
>>> >> >> Jiufu Guo
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> >
>>> >> >> > Richard.
>>> >> >>
>>> >> >>
>>> >>
>>>
>>>
next prev parent reply other threads:[~2021-07-12 15:59 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-09 2:07 Jiufu Guo
2021-07-09 6:43 ` Richard Biener
2021-07-09 17:08 ` Segher Boessenkool
2021-07-12 6:20 ` Richard Biener
2021-07-12 8:21 ` guojiufu
2021-07-12 8:57 ` Richard Biener
2021-07-12 9:43 ` guojiufu
2021-07-12 10:02 ` Richard Biener
2021-07-12 14:08 ` guojiufu
2021-07-12 14:46 ` Richard Biener
2021-07-12 15:53 ` guojiufu
2021-07-12 15:59 ` guojiufu [this message]
2021-07-13 2:09 ` guojiufu
2021-07-13 7:09 ` Richard Biener
2021-07-13 8:16 ` guojiufu
2021-07-13 15:51 ` Segher Boessenkool
2021-07-14 2:38 ` guojiufu
2021-07-13 15:38 ` Segher Boessenkool
2021-07-14 3:10 ` guojiufu
2021-07-14 7:17 ` Richard Biener
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=595a8cc8b2d12d2cc30c49d9799c3360@imap.linux.ibm.com \
--to=guojiufu@linux.ibm.com \
--cc=dje.gcc@gmail.com \
--cc=gcc-patches-bounces+guojiufu=linux.ibm.com@gcc.gnu.org \
--cc=gcc-patches@gcc.gnu.org \
--cc=jlaw@tachyum.com \
--cc=rguenther@suse.de \
--cc=segher@kernel.crashing.org \
--cc=wschmidt@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).