From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <rguenther@suse.de>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] Teach VN about masked/len stores
Date: Tue, 02 Aug 2022 09:15:43 +0100 [thread overview]
Message-ID: <mptv8rbqcwg.fsf@arm.com> (raw)
In-Reply-To: <20220721090131.E297413A1B@imap2.suse-dmz.suse.de> (Richard Biener's message of "Thu, 21 Jul 2022 11:01:31 +0200 (CEST)")
Richard Biener <rguenther@suse.de> writes:
> The following teaches VN to handle reads from .MASK_STORE and
> .LEN_STORE. For this push_partial_def is extended first for
> convenience so we don't have to handle the full def case in the
> caller (possibly other paths can be simplified then). Also
> the partial definition stored value can have an offset applied
> so we don't have to build a fake RHS when we register the pieces
> of an existing store.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, Kewen is
> going to test on powerpc.
>
> I'm not sure about whether it's worth (or easily possible) to
> handle .MASK_STORE_LANES, I think handling the constant def case
> might be possible but since it has an intrinsic permute it might
> make more sense to rewrite the constant def case into a .MASK_STORE?
> (does the mask apply to the destination memory order or the source
> lane order?)
There's one mask bit for each structure, rather than one for each
element. So converting a constant .MASK_STORE_LANES to a constant
.MASK_STORE would require some massaging of the predicate.
Thanks,
Richard
>
> PR tree-optimization/106365
> * tree-ssa-sccvn.cc (pd_data::rhs_off): New field determining
> the offset to start encoding of RHS from.
> (vn_walk_cb_data::vn_walk_cb_data): Initialize it.
> (vn_walk_cb_data::push_partial_def): Allow the first partial
> definition to be fully providing the def. Offset RHS
> before encoding if requested.
> (vn_reference_lookup_3): Initialize def_rhs everywhere.
> Add support for .MASK_STORE and .LEN_STORE (partial) definitions.
>
> * gcc.target/i386/vec-maskstore-vn.c: New testcase.
> ---
> .../gcc.target/i386/vec-maskstore-vn.c | 30 +++
> gcc/tree-ssa-sccvn.cc | 255 ++++++++++++++----
> 2 files changed, 228 insertions(+), 57 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/i386/vec-maskstore-vn.c
>
> diff --git a/gcc/testsuite/gcc.target/i386/vec-maskstore-vn.c b/gcc/testsuite/gcc.target/i386/vec-maskstore-vn.c
> new file mode 100644
> index 00000000000..98213905ece
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vec-maskstore-vn.c
> @@ -0,0 +1,30 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mavx2 -fdump-tree-fre5" } */
> +
> +void __attribute__((noinline,noclone))
> +foo (int *out, int *res)
> +{
> + int mask[] = { 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 };
> + int i;
> + for (i = 0; i < 16; ++i)
> + {
> + if (mask[i])
> + out[i] = i;
> + }
> + int o0 = out[0];
> + int o7 = out[7];
> + int o14 = out[14];
> + int o15 = out[15];
> + res[0] = o0;
> + res[2] = o7;
> + res[4] = o14;
> + res[6] = o15;
> +}
> +
> +/* Vectorization produces .MASK_STORE, unrolling will unroll the two
> + vector iterations. FRE5 after that should be able to CSE
> + out[7] and out[15], but leave out[0] and out[14] alone. */
> +/* { dg-final { scan-tree-dump " = o0_\[0-9\]+;" "fre5" } } */
> +/* { dg-final { scan-tree-dump " = 7;" "fre5" } } */
> +/* { dg-final { scan-tree-dump " = o14_\[0-9\]+;" "fre5" } } */
> +/* { dg-final { scan-tree-dump " = 15;" "fre5" } } */
> diff --git a/gcc/tree-ssa-sccvn.cc b/gcc/tree-ssa-sccvn.cc
> index f41d5031365..7d947b55a27 100644
> --- a/gcc/tree-ssa-sccvn.cc
> +++ b/gcc/tree-ssa-sccvn.cc
> @@ -1790,6 +1790,7 @@ struct pd_range
> struct pd_data
> {
> tree rhs;
> + HOST_WIDE_INT rhs_off;
> HOST_WIDE_INT offset;
> HOST_WIDE_INT size;
> };
> @@ -1816,6 +1817,7 @@ struct vn_walk_cb_data
> unsigned int pos = 0, prec = w.get_precision ();
> pd_data pd;
> pd.rhs = build_constructor (NULL_TREE, NULL);
> + pd.rhs_off = 0;
> /* When bitwise and with a constant is done on a memory load,
> we don't really need all the bits to be defined or defined
> to constants, we don't really care what is in the position
> @@ -1976,6 +1978,7 @@ vn_walk_cb_data::push_partial_def (pd_data pd,
>
> bool pd_constant_p = (TREE_CODE (pd.rhs) == CONSTRUCTOR
> || CONSTANT_CLASS_P (pd.rhs));
> + pd_range *r;
> if (partial_defs.is_empty ())
> {
> /* If we get a clobber upfront, fail. */
> @@ -1989,65 +1992,70 @@ vn_walk_cb_data::push_partial_def (pd_data pd,
> first_set = set;
> first_base_set = base_set;
> last_vuse_ptr = NULL;
> - /* Continue looking for partial defs. */
> - return NULL;
> - }
> -
> - if (!known_ranges)
> - {
> - /* ??? Optimize the case where the 2nd partial def completes things. */
> - gcc_obstack_init (&ranges_obstack);
> - known_ranges = splay_tree_new_with_allocator (pd_range_compare, 0, 0,
> - pd_tree_alloc,
> - pd_tree_dealloc, this);
> - splay_tree_insert (known_ranges,
> - (splay_tree_key)&first_range.offset,
> - (splay_tree_value)&first_range);
> - }
> -
> - pd_range newr = { pd.offset, pd.size };
> - splay_tree_node n;
> - pd_range *r;
> - /* Lookup the predecessor of offset + 1 and see if we need to merge. */
> - HOST_WIDE_INT loffset = newr.offset + 1;
> - if ((n = splay_tree_predecessor (known_ranges, (splay_tree_key)&loffset))
> - && ((r = (pd_range *)n->value), true)
> - && ranges_known_overlap_p (r->offset, r->size + 1,
> - newr.offset, newr.size))
> - {
> - /* Ignore partial defs already covered. Here we also drop shadowed
> - clobbers arriving here at the floor. */
> - if (known_subrange_p (newr.offset, newr.size, r->offset, r->size))
> - return NULL;
> - r->size = MAX (r->offset + r->size, newr.offset + newr.size) - r->offset;
> + r = &first_range;
> + /* Go check if the first partial definition was a full one in case
> + the caller didn't optimize for this. */
> }
> else
> {
> - /* newr.offset wasn't covered yet, insert the range. */
> - r = XOBNEW (&ranges_obstack, pd_range);
> - *r = newr;
> - splay_tree_insert (known_ranges, (splay_tree_key)&r->offset,
> - (splay_tree_value)r);
> - }
> - /* Merge r which now contains newr and is a member of the splay tree with
> - adjacent overlapping ranges. */
> - pd_range *rafter;
> - while ((n = splay_tree_successor (known_ranges, (splay_tree_key)&r->offset))
> - && ((rafter = (pd_range *)n->value), true)
> - && ranges_known_overlap_p (r->offset, r->size + 1,
> - rafter->offset, rafter->size))
> - {
> - r->size = MAX (r->offset + r->size,
> - rafter->offset + rafter->size) - r->offset;
> - splay_tree_remove (known_ranges, (splay_tree_key)&rafter->offset);
> - }
> - /* If we get a clobber, fail. */
> - if (TREE_CLOBBER_P (pd.rhs))
> - return (void *)-1;
> - /* Non-constants are OK as long as they are shadowed by a constant. */
> - if (!pd_constant_p)
> - return (void *)-1;
> - partial_defs.safe_push (pd);
> + if (!known_ranges)
> + {
> + /* ??? Optimize the case where the 2nd partial def completes
> + things. */
> + gcc_obstack_init (&ranges_obstack);
> + known_ranges = splay_tree_new_with_allocator (pd_range_compare, 0, 0,
> + pd_tree_alloc,
> + pd_tree_dealloc, this);
> + splay_tree_insert (known_ranges,
> + (splay_tree_key)&first_range.offset,
> + (splay_tree_value)&first_range);
> + }
> +
> + pd_range newr = { pd.offset, pd.size };
> + splay_tree_node n;
> + /* Lookup the predecessor of offset + 1 and see if we need to merge. */
> + HOST_WIDE_INT loffset = newr.offset + 1;
> + if ((n = splay_tree_predecessor (known_ranges, (splay_tree_key)&loffset))
> + && ((r = (pd_range *)n->value), true)
> + && ranges_known_overlap_p (r->offset, r->size + 1,
> + newr.offset, newr.size))
> + {
> + /* Ignore partial defs already covered. Here we also drop shadowed
> + clobbers arriving here at the floor. */
> + if (known_subrange_p (newr.offset, newr.size, r->offset, r->size))
> + return NULL;
> + r->size
> + = MAX (r->offset + r->size, newr.offset + newr.size) - r->offset;
> + }
> + else
> + {
> + /* newr.offset wasn't covered yet, insert the range. */
> + r = XOBNEW (&ranges_obstack, pd_range);
> + *r = newr;
> + splay_tree_insert (known_ranges, (splay_tree_key)&r->offset,
> + (splay_tree_value)r);
> + }
> + /* Merge r which now contains newr and is a member of the splay tree with
> + adjacent overlapping ranges. */
> + pd_range *rafter;
> + while ((n = splay_tree_successor (known_ranges,
> + (splay_tree_key)&r->offset))
> + && ((rafter = (pd_range *)n->value), true)
> + && ranges_known_overlap_p (r->offset, r->size + 1,
> + rafter->offset, rafter->size))
> + {
> + r->size = MAX (r->offset + r->size,
> + rafter->offset + rafter->size) - r->offset;
> + splay_tree_remove (known_ranges, (splay_tree_key)&rafter->offset);
> + }
> + /* If we get a clobber, fail. */
> + if (TREE_CLOBBER_P (pd.rhs))
> + return (void *)-1;
> + /* Non-constants are OK as long as they are shadowed by a constant. */
> + if (!pd_constant_p)
> + return (void *)-1;
> + partial_defs.safe_push (pd);
> + }
>
> /* Now we have merged newr into the range tree. When we have covered
> [offseti, sizei] then the tree will contain exactly one node which has
> @@ -2081,7 +2089,8 @@ vn_walk_cb_data::push_partial_def (pd_data pd,
> else
> {
> len = native_encode_expr (pd.rhs, this_buffer, bufsize,
> - MAX (0, -pd.offset) / BITS_PER_UNIT);
> + (MAX (0, -pd.offset)
> + + pd.rhs_off) / BITS_PER_UNIT);
> if (len <= 0
> || len < (ROUND_UP (pd.size, BITS_PER_UNIT) / BITS_PER_UNIT
> - MAX (0, -pd.offset) / BITS_PER_UNIT))
> @@ -2906,6 +2915,7 @@ vn_reference_lookup_3 (ao_ref *ref, tree vuse, void *data_,
> {
> pd_data pd;
> pd.rhs = build_constructor (NULL_TREE, NULL);
> + pd.rhs_off = 0;
> pd.offset = offset2i;
> pd.size = leni << LOG2_BITS_PER_UNIT;
> return data->push_partial_def (pd, 0, 0, offseti, maxsizei);
> @@ -2955,6 +2965,7 @@ vn_reference_lookup_3 (ao_ref *ref, tree vuse, void *data_,
> by a later def. */
> pd_data pd;
> pd.rhs = gimple_assign_rhs1 (def_stmt);
> + pd.rhs_off = 0;
> pd.offset = offset2i;
> pd.size = size2i;
> return data->push_partial_def (pd, ao_ref_alias_set (&lhs_ref),
> @@ -3107,6 +3118,7 @@ vn_reference_lookup_3 (ao_ref *ref, tree vuse, void *data_,
> if (TREE_CODE (rhs) == SSA_NAME)
> rhs = SSA_VAL (rhs);
> pd.rhs = rhs;
> + pd.rhs_off = 0;
> pd.offset = offset2i;
> pd.size = size2i;
> return data->push_partial_def (pd, ao_ref_alias_set (&lhs_ref),
> @@ -3186,6 +3198,7 @@ vn_reference_lookup_3 (ao_ref *ref, tree vuse, void *data_,
> {
> pd_data pd;
> pd.rhs = SSA_VAL (def_rhs);
> + pd.rhs_off = 0;
> pd.offset = offset2i;
> pd.size = size2i;
> return data->push_partial_def (pd, ao_ref_alias_set (&lhs_ref),
> @@ -3195,6 +3208,133 @@ vn_reference_lookup_3 (ao_ref *ref, tree vuse, void *data_,
> }
> }
>
> + /* 4b) Assignment done via one of the vectorizer internal store
> + functions where we may be able to access pieces from or we can
> + combine to a larger entity. */
> + else if (known_eq (ref->size, maxsize)
> + && is_gimple_reg_type (vr->type)
> + && !reverse_storage_order_for_component_p (vr->operands)
> + && !contains_storage_order_barrier_p (vr->operands)
> + && is_gimple_call (def_stmt)
> + && gimple_call_internal_p (def_stmt)
> + && internal_store_fn_p (gimple_call_internal_fn (def_stmt)))
> + {
> + gcall *call = as_a <gcall *> (def_stmt);
> + internal_fn fn = gimple_call_internal_fn (call);
> + tree def_rhs = gimple_call_arg (call,
> + internal_fn_stored_value_index (fn));
> + def_rhs = vn_valueize (def_rhs);
> + if (TREE_CODE (def_rhs) != VECTOR_CST)
> + return (void *)-1;
> +
> + tree mask = NULL_TREE, len = NULL_TREE, bias = NULL_TREE;
> + switch (fn)
> + {
> + case IFN_MASK_STORE:
> + mask = gimple_call_arg (call, internal_fn_mask_index (fn));
> + mask = vn_valueize (mask);
> + if (TREE_CODE (mask) != VECTOR_CST)
> + return (void *)-1;
> + break;
> + case IFN_LEN_STORE:
> + len = gimple_call_arg (call, 2);
> + bias = gimple_call_arg (call, 4);
> + if (!tree_fits_uhwi_p (len) || !tree_fits_shwi_p (bias))
> + return (void *)-1;
> + break;
> + default:
> + return (void *)-1;
> + }
> + ao_ref_init_from_ptr_and_size (&lhs_ref,
> + vn_valueize (gimple_call_arg (call, 0)),
> + TYPE_SIZE_UNIT (TREE_TYPE (def_rhs)));
> + tree base2;
> + poly_int64 offset2, size2, maxsize2;
> + HOST_WIDE_INT offset2i, size2i, offseti;
> + base2 = ao_ref_base (&lhs_ref);
> + offset2 = lhs_ref.offset;
> + size2 = lhs_ref.size;
> + maxsize2 = lhs_ref.max_size;
> + if (known_size_p (maxsize2)
> + && known_eq (maxsize2, size2)
> + && adjust_offsets_for_equal_base_address (base, &offset,
> + base2, &offset2)
> + && maxsize.is_constant (&maxsizei)
> + && offset.is_constant (&offseti)
> + && offset2.is_constant (&offset2i)
> + && size2.is_constant (&size2i))
> + {
> + if (!ranges_maybe_overlap_p (offset, maxsize, offset2, size2))
> + /* Poor-mans disambiguation. */
> + return NULL;
> + else if (ranges_known_overlap_p (offset, maxsize, offset2, size2))
> + {
> + pd_data pd;
> + pd.rhs = def_rhs;
> + tree aa = gimple_call_arg (call, 1);
> + alias_set_type set = get_deref_alias_set (TREE_TYPE (aa));
> + tree vectype = TREE_TYPE (def_rhs);
> + unsigned HOST_WIDE_INT elsz
> + = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (vectype)));
> + if (mask)
> + {
> + HOST_WIDE_INT start = 0, len = 0;
> + unsigned mask_idx = 0;
> + do
> + {
> + if (integer_zerop (VECTOR_CST_ELT (mask, mask_idx)))
> + {
> + if (len != 0)
> + {
> + pd.rhs_off = start;
> + pd.offset = offset2i + start;
> + pd.size = len;
> + if (ranges_known_overlap_p
> + (offset, maxsize, pd.offset, pd.size))
> + {
> + void *res = data->push_partial_def
> + (pd, set, set, offseti, maxsizei);
> + if (res != NULL)
> + return res;
> + }
> + }
> + start = (mask_idx + 1) * elsz;
> + len = 0;
> + }
> + else
> + len += elsz;
> + mask_idx++;
> + }
> + while (known_lt (mask_idx, TYPE_VECTOR_SUBPARTS (vectype)));
> + if (len != 0)
> + {
> + pd.rhs_off = start;
> + pd.offset = offset2i + start;
> + pd.size = len;
> + if (ranges_known_overlap_p (offset, maxsize,
> + pd.offset, pd.size))
> + return data->push_partial_def (pd, set, set,
> + offseti, maxsizei);
> + }
> + }
> + else if (fn == IFN_LEN_STORE)
> + {
> + pd.rhs_off = 0;
> + pd.offset = offset2i;
> + pd.size = (tree_to_uhwi (len)
> + + -tree_to_shwi (bias)) * BITS_PER_UNIT;
> + if (ranges_known_overlap_p (offset, maxsize,
> + pd.offset, pd.size))
> + return data->push_partial_def (pd, set, set,
> + offseti, maxsizei);
> + }
> + else
> + gcc_unreachable ();
> + return NULL;
> + }
> + }
> + }
> +
> /* 5) For aggregate copies translate the reference through them if
> the copy kills ref. */
> else if (data->vn_walk_kind == VN_WALKREWRITE
> @@ -3327,6 +3467,7 @@ vn_reference_lookup_3 (ao_ref *ref, tree vuse, void *data_,
> {
> pd_data pd;
> pd.rhs = val;
> + pd.rhs_off = 0;
> pd.offset = 0;
> pd.size = maxsizei;
> return data->push_partial_def (pd, ao_ref_alias_set (&lhs_ref),
prev parent reply other threads:[~2022-08-02 8:15 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-21 9:01 Richard Biener
2022-07-21 10:14 ` Kewen.Lin
2022-08-02 8:15 ` Richard Sandiford [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mptv8rbqcwg.fsf@arm.com \
--to=richard.sandiford@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=rguenther@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).