Re: [PATCH] Teach vectorizer to deal with bitfield accesses (was: [RFC] Teach vectorizer to deal with bitfield reads)

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Biener <rguenther@suse.de>
To: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
Cc: Jakub Jelinek <jakub@redhat.com>,
	 Richard Sandiford <richard.sandiford@arm.com>,
	 "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] Teach vectorizer to deal with bitfield accesses (was: [RFC] Teach vectorizer to deal with bitfield reads)
Date: Wed, 17 Aug 2022 12:49:08 +0000 (UTC)	[thread overview]
Message-ID: <nycvar.YFH.7.77.849.2208171201490.13569@jbgna.fhfr.qr> (raw)
In-Reply-To: <aa8a155a-d420-83e2-7687-7b4fb5c50f2d@arm.com>

On Tue, 16 Aug 2022, Andre Vieira (lists) wrote:

> Hi,
> 
> New version of the patch attached, but haven't recreated the ChangeLog yet,
> just waiting to see if this is what you had in mind. See also some replies to
> your comments in-line below:
> 
> On 09/08/2022 15:34, Richard Biener wrote:
> 
> > @@ -2998,7 +3013,7 @@ ifcvt_split_critical_edges (class loop *loop, bool
> > aggressive_if_conv)
> >     auto_vec<edge> critical_edges;
> >
> >     /* Loop is not well formed.  */
> > -  if (num <= 2 || loop->inner || !single_exit (loop))
> > +  if (num <= 2 || loop->inner)
> >       return false;
> >
> >     body = get_loop_body (loop);
> >
> > this doesn't appear in the ChangeLog nor is it clear to me why it's
> > needed?  Likewise
> So both these and...
> >
> > -  /* Save BB->aux around loop_version as that uses the same field.  */
> > -  save_length = loop->inner ? loop->inner->num_nodes : loop->num_nodes;
> > -  void **saved_preds = XALLOCAVEC (void *, save_length);
> > -  for (unsigned i = 0; i < save_length; i++)
> > -    saved_preds[i] = ifc_bbs[i]->aux;
> > +  void **saved_preds = NULL;
> > +  if (any_complicated_phi || need_to_predicate)
> > +    {
> > +      /* Save BB->aux around loop_version as that uses the same field.
> > */
> > +      save_length = loop->inner ? loop->inner->num_nodes :
> > loop->num_nodes;
> > +      saved_preds = XALLOCAVEC (void *, save_length);
> > +      for (unsigned i = 0; i < save_length; i++)
> > +       saved_preds[i] = ifc_bbs[i]->aux;
> > +    }
> >
> > is that just premature optimization?
> 
> .. these changes are to make sure we can still use the loop versioning code
> even for cases where there are bitfields to lower but no ifcvts (i.e. num of
> BBs <= 2).
> I wasn't sure about the loop-inner condition and the small examples I tried it
> seemed to work, that is loop version seems to be able to handle nested loops.
> 
> The single_exit condition is still required for both, because the code to
> create the loop versions depends on it. It does look like I missed this in the
> ChangeLog...
> 
> > +  /* BITSTART and BITEND describe the region we can safely load from
> > inside the
> > +     structure.  BITPOS is the bit position of the value inside the
> > +     representative that we will end up loading OFFSET bytes from the
> > start
> > +     of the struct.  BEST_MODE is the mode describing the optimal size of
> > the
> > +     representative chunk we load.  If this is a write we will store the
> > same
> > +     sized representative back, after we have changed the appropriate
> > bits.  */
> > +  get_bit_range (&bitstart, &bitend, comp_ref, &bitpos, &offset);
> >
> > I think you need to give up when get_bit_range sets bitstart = bitend to
> > zero
> >
> > +  if (get_best_mode (bitsize, bitpos.to_constant (), bitstart, bitend,
> > +                    TYPE_ALIGN (TREE_TYPE (struct_expr)),
> > +                    INT_MAX, false, &best_mode))
> >
> > +  tree rep_decl = build_decl (UNKNOWN_LOCATION, FIELD_DECL,
> > +                             NULL_TREE, rep_type);
> > +  /* Load from the start of 'offset + bitpos % alignment'.  */
> > +  uint64_t extra_offset = bitpos.to_constant ();
> >
> > you shouldn't build a new FIELD_DECL.  Either you use
> > DECL_BIT_FIELD_REPRESENTATIVE directly or you use a
> > BIT_FIELD_REF accessing the "representative".
> > DECL_BIT_FIELD_REPRESENTATIVE exists so it can maintain
> > a variable field offset, you can also subset that with an
> > intermediate BIT_FIELD_REF if DECL_BIT_FIELD_REPRESENTATIVE is
> > too large for your taste.
> >
> > I'm not sure all the offset calculation you do is correct, but
> > since you shouldn't invent a new FIELD_DECL it probably needs
> > to change anyway ...
> I can use the DECL_BIT_FIELD_REPRESENTATIVE, but I'll still need some offset
> calculation/extraction. It's easier to example with an example:
> 
> In vect-bitfield-read-3.c the struct:
> typedef struct {
>     int  c;
>     int  b;
>     bool a : 1;
> } struct_t;
> 
> and field access 'vect_false[i].a' or 'vect_true[i].a' will lead to a
> DECL_BIT_FIELD_REPRESENTATIVE of TYPE_SIZE of 8 (and TYPE_PRECISION is also 8
> as expected). However, the DECL_FIELD_OFFSET of either the original field
> decl, the actual bitfield member, or the DECL_BIT_FIELD_REPRESENTATIVE is 0
> and the DECL_FIELD_BIT_OFFSET is 64. These will lead to the correct load:
> _1 = vect_false[i].D;
> 
> D here being the representative is an 8-bit load from vect_false[i] + 64bits.
> So all good there. However, when we construct BIT_FIELD_REF we can't simply
> use DECL_FIELD_BIT_OFFSET (field_decl) as the BIT_FIELD_REF's bitpos.  During
> `verify_gimple` it checks that bitpos + bitsize < TYPE_SIZE (TREE_TYPE (load))
> where BIT_FIELD_REF (load, bitsize, bitpos).

Yes, of course.  What you need to do is subtract DECL_FIELD_BIT_OFFSET
of the representative from DECL_FIELD_BIT_OFFSET of the original bitfield
access - that's the offset within the representative (by construction
both fields share DECL_FIELD_OFFSET).

> So instead I change bitpos such that:
> align_of_representative = TYPE_ALIGN (TREE_TYPE (representative));
> bitpos -= bitpos.to_constant () / align_of_representative *
> align_of_representative;

?  Not sure why alignment comes into play here?

> I've now rewritten this to:
> poly_int64 q,r;
> if (can_trunc_div_p(bitpos, align_of_representative, &q, &r))
>   bitpos = r;
> 
> It makes it slightly clearer, also because I no longer need the changes to the
> original tree offset as I'm just using D for the load.
>
> > Note that for optimization it will be important that all
> > accesses to the bitfield members of the same bitfield use the
> > same underlying area (CSE and store-forwarding will thank you).
> >
> > +
> > +  need_to_lower_bitfields = bitfields_to_lower_p (loop,
> > &bitfields_to_lower);
> > +  if (!ifcvt_split_critical_edges (loop, aggressive_if_conv)
> > +      && !need_to_lower_bitfields)
> >       goto cleanup;
> >
> > so we lower bitfields even when we cannot split critical edges?
> > why?
> >
> > +  need_to_ifcvt
> > +    = if_convertible_loop_p (loop) && dbg_cnt (if_conversion_tree);
> > +  if (!need_to_ifcvt && !need_to_lower_bitfields)
> >       goto cleanup;
> >
> > likewise - if_convertible_loop_p performs other checks, the only
> > one we want to elide is the loop->num_nodes <= 2 check since
> > we want to lower bitfields in single-block loops as well.  That
> > means we only have to scan for bitfield accesses in the first
> > block "prematurely".  So I would interwind the need_to_lower_bitfields
> > into if_convertible_loop_p and if_convertible_loop_p_1 and
> > put the loop->num_nodes <= 2 after it when !need_to_lower_bitfields.
> I'm not sure I understood this. But I'd rather keep the 'need_to_ifcvt' (new)
> and 'need_to_lower_bitfields' separate. One thing I did change is that we no
> longer check for bitfields to lower if there are if-stmts that we can't lower,
> since we will not be vectorizing this loop anyway so no point in wasting time
> lowering bitfields. At the same time though, I'd like to be able to
> lower-bitfields if there are no ifcvts.

Sure.

> > +  if (!useless_type_conversion_p (TREE_TYPE (lhs), ret_type))
> > +    {
> > +      pattern_stmt
> > +       = gimple_build_assign (vect_recog_temp_ssa_var (ret_type, NULL),
> > +                              NOP_EXPR, lhs);
> > +      lhs = gimple_get_lhs (pattern_stmt);
> > +      append_pattern_def_seq (vinfo, stmt_info, pattern_stmt);
> > +    }
> >
> > hm - so you have for example
> >
> >   int _1 = MEM;
> >   int:3 _2 = BIT_FIELD_REF <_1, ...>
> >   type _3 = (type) _2;
> >
> > and that _3 = (type) _2 is because of integer promotion and you
> > perform all the shifting in that type.  I suppose you should
> > verify that the cast is indeed promoting, not narrowing, since
> > otherwise you'll produce wrong code?  That said, shouldn't you
> > perform the shift / mask in the type of _1 instead?  (the hope
> > is, of course, that typeof (_1) == type in most cases)
> >
> > Similar comments apply to vect_recog_bit_insert_pattern.
> Good shout, hadn't realized that yet because of how the testcases didn't have
> that problem, but when using the REPRESENTATIVE macro they do test that now. I
> don't think the bit_insert is a problem though. In bit_insert, 'value' always
> has the relevant bits starting at its LSB. So regardless of whether the load
> (and store) type is larger or smaller than the type, performing the shifts and
> masks in this type should be OK as you'll only be 'cutting off' the MSB's
> which would be the ones that would get truncated anyway? Or am missing
> something here?

Not sure what you are saying but "yes", all shifting and masking should
happen in the type of the representative.

+  tree bitpos_tree = build_int_cst (bitsizetype, bitpos);

for your convenience there's bitsize_int (bitpos) you can use.

I don't think you are using the correct bitpos though, you fail to
adjust it for the BIT_FIELD_REF/BIT_INSERT_EXPR.

+                        build_int_cst (bitsizetype, TYPE_PRECISION 
(bf_type)),

the size of the bitfield reference is DECL_SIZE of the original
FIELD_DECL - it might be bigger than the precision of its type.
You probably want to double-check it's equal to the precision
(because of the insert but also because of all the masking) and
refuse to lower if not.

+/* Return TRUE if there are bitfields to lower in this LOOP.  Fill 
TO_LOWER
+   with data structures representing these bitfields.  */
+
+static bool
+bitfields_to_lower_p (class loop *loop,
+                     vec <gassign *> &reads_to_lower,
+                     vec <gassign *> &writes_to_lower)
+{
+  basic_block *bbs = get_loop_body (loop);
+  gimple_stmt_iterator gsi;

as said I'd prefer to do this walk as part of the other walks we
already do - if and if only because get_loop_body () is a DFS
walk over the loop body (you should at least share that).

+         gassign *stmt = dyn_cast<gassign*> (gsi_stmt (gsi));
+         if (!stmt)
+           continue;
+
+         tree op = gimple_get_lhs (stmt);

gimple_assign_lhs (stmt)

+         bool write = TREE_CODE (op) == COMPONENT_REF;
+
+         if (!write)
+           op = gimple_assign_rhs1 (stmt);
+
+         if (TREE_CODE (op) != COMPONENT_REF)
+           continue;
+
+         if (DECL_BIT_FIELD_TYPE (TREE_OPERAND (op, 1)))
+           {

rumors say that at least with Ada you can have non-integral, maybe
even aggregate "bitfields", so please add

  && INTEGRAL_TYPE_P (TREE_TYPE (op))

@@ -3269,12 +3474,18 @@ tree_if_conversion (class loop *loop, vec<gimple 
*> *preds)
   unsigned int todo = 0;
   bool aggressive_if_conv;
   class loop *rloop;
+  vec <gassign *> reads_to_lower;
+  vec <gassign *> writes_to_lower;
   bitmap exit_bbs;

you should be able to use auto_vec<> here

  again:
+  reads_to_lower.create (4);
+  writes_to_lower.create (4);

I think repeated .create will not release what is there.  With
auto_vec<> above there's no need to .create, just do
truncate (0) here.

+  tree mask = build_int_cst (TREE_TYPE (lhs),
+                            ((1ULL << mask_i) - 1) << shift_n);

please use wide_int_to_tree (TREE_TYPE (lhs),
                             wi::shifted_mask (shift_n, mask_i, false
  , TYPE_PRECISION (TREE_TYPE (lhs)));

1ULL would better be (unsigned HOST_WIDE_INT)1 or HOST_WIDE_INT_1U.
But note the representative could be __int128_t where uint64_t
mask operations fall apart...

Btw, instead of (val & mask) >> shift it might be better to use
(val >> shift) & mask since the resulting mask values are "smaller"
and maybe easier to code generate?

+   patt1 = _2 & mask;              // Clearing of the non-relevant bits 
in the
+                                   // 'to-write value'.
+   patt2 = patt1 << bitpos;        // Shift the cleaned value in to 
place.
+   patt3 = _1 & ~(mask << bitpos);  // Clearing the bits we want to write 
to,

same here, shifting patt1 first and then masking allows to just
invert the mask (or use andn), no need for two different (constant)
masks?

+      value = fold_build1 (NOP_EXPR, load_type, value);

fold_convert (load_type, value)

+      if (!CONSTANT_CLASS_P (value))
+       {
+         pattern_stmt
+           = gimple_build_assign (vect_recog_temp_ssa_var (load_type, 
NULL),
+                                  value);
+         value = gimple_get_lhs (pattern_stmt);

there's in principle

     gimple_seq stmts = NULL;
     value = gimple_convert (&stmts, load_type, value);
     if (!gimple_seq_empty_p (stmts))
       {
         pattern_stmt = gimple_seq_first_stmt (stmts);
         append_pattern_def_seq (vinfo, stmt_info, pattern_stmt);
       }

though a append_pattern_def_seq helper to add a convenience sequence
would be nice to have here.

+  pattern_stmt
+    = gimple_build_assign (vect_recog_temp_ssa_var (load_type, NULL),
+                          fold_build2 (BIT_AND_EXPR, load_type, value,
+                                       mask_t));

please avoid building GENERIC and then gimple from it.  Either use

  gimple_build_assing (..., BIT_AND_EXPR, load_type, value, mask_t);

or, if you want to fold, use

  result_value = gimple_build (&stmts, BIT_AND_EXPR, load_type, value, 
mask_t);

as above with gimple_convert.  See my comment about the nice to have
helper so you can block-process the 'stmts' sequence as pattern
def sequence.

+  mask_i <<= shift_n;
+  mask_i = ~mask_i;

you have to use wide_ints again, a HOST_WIDE_INT might not be
large enough.

You probably want to double-check your lowering code by
bootstrapping / testing with -ftree-loop-if-convert.

Richard.

next prev parent reply	other threads:[~2022-08-17 12:49 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-26 10:00 [RFC] Teach vectorizer to deal with bitfield reads Andre Vieira (lists)
2022-07-27 11:37 ` Richard Biener
2022-07-29  8:57   ` Andre Vieira (lists)
2022-07-29  9:11     ` Richard Biener
2022-07-29 10:31     ` Jakub Jelinek
2022-07-29 10:52       ` Richard Biener
2022-08-01 10:21         ` Andre Vieira (lists)
2022-08-01 13:16           ` Richard Biener
2022-08-08 14:06             ` [PATCH] Teach vectorizer to deal with bitfield accesses (was: [RFC] Teach vectorizer to deal with bitfield reads) Andre Vieira (lists)
2022-08-09 14:34               ` Richard Biener
2022-08-16 10:24                 ` Andre Vieira (lists)
2022-08-17 12:49                   ` Richard Biener [this message]
2022-08-25  9:09                     ` Andre Vieira (lists)
2022-09-08  9:07                       ` Andre Vieira (lists)
2022-09-08 11:51                       ` Richard Biener
2022-09-26 15:23                         ` Andre Vieira (lists)
2022-09-27 12:34                           ` Richard Biener
2022-09-28  9:43                             ` Andre Vieira (lists)
2022-09-28 17:31                               ` Andre Vieira (lists)
2022-09-29  7:54                                 ` Richard Biener
2022-10-07 14:20                                   ` Andre Vieira (lists)
2022-10-12  1:55                                     ` Hongtao Liu
2022-10-12  2:11                                       ` Hongtao Liu
2022-08-01 10:13       ` [RFC] Teach vectorizer to deal with bitfield reads Andre Vieira (lists)
2022-10-12  9:02 ` Eric Botcazou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.YFH.7.77.849.2208171201490.13569@jbgna.fhfr.qr \
    --to=rguenther@suse.de \
    --cc=andre.simoesdiasvieira@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).