Hi, New version of the patch attached, but haven't recreated the ChangeLog yet, just waiting to see if this is what you had in mind. See also some replies to your comments in-line below: On 09/08/2022 15:34, Richard Biener wrote: > @@ -2998,7 +3013,7 @@ ifcvt_split_critical_edges (class loop *loop, bool > aggressive_if_conv) > auto_vec critical_edges; > > /* Loop is not well formed. */ > - if (num <= 2 || loop->inner || !single_exit (loop)) > + if (num <= 2 || loop->inner) > return false; > > body = get_loop_body (loop); > > this doesn't appear in the ChangeLog nor is it clear to me why it's > needed? Likewise So both these and... > > - /* Save BB->aux around loop_version as that uses the same field. */ > - save_length = loop->inner ? loop->inner->num_nodes : loop->num_nodes; > - void **saved_preds = XALLOCAVEC (void *, save_length); > - for (unsigned i = 0; i < save_length; i++) > - saved_preds[i] = ifc_bbs[i]->aux; > + void **saved_preds = NULL; > + if (any_complicated_phi || need_to_predicate) > + { > + /* Save BB->aux around loop_version as that uses the same field. > */ > + save_length = loop->inner ? loop->inner->num_nodes : > loop->num_nodes; > + saved_preds = XALLOCAVEC (void *, save_length); > + for (unsigned i = 0; i < save_length; i++) > + saved_preds[i] = ifc_bbs[i]->aux; > + } > > is that just premature optimization? .. these changes are to make sure we can still use the loop versioning code even for cases where there are bitfields to lower but no ifcvts (i.e. num of BBs <= 2). I wasn't sure about the loop-inner condition and the small examples I tried it seemed to work, that is loop version seems to be able to handle nested loops. The single_exit condition is still required for both, because the code to create the loop versions depends on it. It does look like I missed this in the ChangeLog... > + /* BITSTART and BITEND describe the region we can safely load from > inside the > + structure. BITPOS is the bit position of the value inside the > + representative that we will end up loading OFFSET bytes from the > start > + of the struct. BEST_MODE is the mode describing the optimal size of > the > + representative chunk we load. If this is a write we will store the > same > + sized representative back, after we have changed the appropriate > bits. */ > + get_bit_range (&bitstart, &bitend, comp_ref, &bitpos, &offset); > > I think you need to give up when get_bit_range sets bitstart = bitend to > zero > > + if (get_best_mode (bitsize, bitpos.to_constant (), bitstart, bitend, > + TYPE_ALIGN (TREE_TYPE (struct_expr)), > + INT_MAX, false, &best_mode)) > > + tree rep_decl = build_decl (UNKNOWN_LOCATION, FIELD_DECL, > + NULL_TREE, rep_type); > + /* Load from the start of 'offset + bitpos % alignment'. */ > + uint64_t extra_offset = bitpos.to_constant (); > > you shouldn't build a new FIELD_DECL. Either you use > DECL_BIT_FIELD_REPRESENTATIVE directly or you use a > BIT_FIELD_REF accessing the "representative". > DECL_BIT_FIELD_REPRESENTATIVE exists so it can maintain > a variable field offset, you can also subset that with an > intermediate BIT_FIELD_REF if DECL_BIT_FIELD_REPRESENTATIVE is > too large for your taste. > > I'm not sure all the offset calculation you do is correct, but > since you shouldn't invent a new FIELD_DECL it probably needs > to change anyway ... I can use the DECL_BIT_FIELD_REPRESENTATIVE, but I'll still need some offset calculation/extraction. It's easier to example with an example: In vect-bitfield-read-3.c the struct: typedef struct {     int  c;     int  b;     bool a : 1; } struct_t; and field access 'vect_false[i].a' or 'vect_true[i].a' will lead to a DECL_BIT_FIELD_REPRESENTATIVE of TYPE_SIZE of 8 (and TYPE_PRECISION is also 8 as expected). However, the DECL_FIELD_OFFSET of either the original field decl, the actual bitfield member, or the DECL_BIT_FIELD_REPRESENTATIVE is 0 and the DECL_FIELD_BIT_OFFSET is 64. These will lead to the correct load: _1 = vect_false[i].D; D here being the representative is an 8-bit load from vect_false[i] + 64bits. So all good there. However, when we construct BIT_FIELD_REF we can't simply use DECL_FIELD_BIT_OFFSET (field_decl) as the BIT_FIELD_REF's bitpos.  During `verify_gimple` it checks that bitpos + bitsize < TYPE_SIZE (TREE_TYPE (load)) where BIT_FIELD_REF (load, bitsize, bitpos). So instead I change bitpos such that: align_of_representative = TYPE_ALIGN (TREE_TYPE (representative)); bitpos -= bitpos.to_constant () / align_of_representative * align_of_representative; I've now rewritten this to: poly_int64 q,r; if (can_trunc_div_p(bitpos, align_of_representative, &q, &r))   bitpos = r; It makes it slightly clearer, also because I no longer need the changes to the original tree offset as I'm just using D for the load. > Note that for optimization it will be important that all > accesses to the bitfield members of the same bitfield use the > same underlying area (CSE and store-forwarding will thank you). > > + > + need_to_lower_bitfields = bitfields_to_lower_p (loop, > &bitfields_to_lower); > + if (!ifcvt_split_critical_edges (loop, aggressive_if_conv) > + && !need_to_lower_bitfields) > goto cleanup; > > so we lower bitfields even when we cannot split critical edges? > why? > > + need_to_ifcvt > + = if_convertible_loop_p (loop) && dbg_cnt (if_conversion_tree); > + if (!need_to_ifcvt && !need_to_lower_bitfields) > goto cleanup; > > likewise - if_convertible_loop_p performs other checks, the only > one we want to elide is the loop->num_nodes <= 2 check since > we want to lower bitfields in single-block loops as well. That > means we only have to scan for bitfield accesses in the first > block "prematurely". So I would interwind the need_to_lower_bitfields > into if_convertible_loop_p and if_convertible_loop_p_1 and > put the loop->num_nodes <= 2 after it when !need_to_lower_bitfields. I'm not sure I understood this. But I'd rather keep the 'need_to_ifcvt' (new) and 'need_to_lower_bitfields' separate. One thing I did change is that we no longer check for bitfields to lower if there are if-stmts that we can't lower, since we will not be vectorizing this loop anyway so no point in wasting time lowering bitfields. At the same time though, I'd like to be able to lower-bitfields if there are no ifcvts. > + if (!useless_type_conversion_p (TREE_TYPE (lhs), ret_type)) > + { > + pattern_stmt > + = gimple_build_assign (vect_recog_temp_ssa_var (ret_type, NULL), > + NOP_EXPR, lhs); > + lhs = gimple_get_lhs (pattern_stmt); > + append_pattern_def_seq (vinfo, stmt_info, pattern_stmt); > + } > > hm - so you have for example > > int _1 = MEM; > int:3 _2 = BIT_FIELD_REF <_1, ...> > type _3 = (type) _2; > > and that _3 = (type) _2 is because of integer promotion and you > perform all the shifting in that type. I suppose you should > verify that the cast is indeed promoting, not narrowing, since > otherwise you'll produce wrong code? That said, shouldn't you > perform the shift / mask in the type of _1 instead? (the hope > is, of course, that typeof (_1) == type in most cases) > > Similar comments apply to vect_recog_bit_insert_pattern. Good shout, hadn't realized that yet because of how the testcases didn't have that problem, but when using the REPRESENTATIVE macro they do test that now. I don't think the bit_insert is a problem though. In bit_insert, 'value' always has the relevant bits starting at its LSB. So regardless of whether the load (and store) type is larger or smaller than the type, performing the shifts and masks in this type should be OK as you'll only be 'cutting off' the MSB's which would be the ones that would get truncated anyway? Or am missing something here?