* [RFC] Implementing detection of saturation and rounding arithmetic
@ 2021-06-03 16:34 Andre Vieira (lists)
2021-06-03 16:41 ` [RFC][ivopts] Generate better code for IVs with uses outside the loop (was Re: [RFC] Implementing detection of saturation and rounding arithmetic) Andre Vieira (lists)
2021-06-07 11:28 ` [RFC] Implementing detection of saturation and rounding arithmetic Bin.Cheng
0 siblings, 2 replies; 5+ messages in thread
From: Andre Vieira (lists) @ 2021-06-03 16:34 UTC (permalink / raw)
To: gcc-patches; +Cc: Richard Biener, Richard Sandiford, bin.cheng
[-- Attachment #1: Type: text/plain, Size: 6567 bytes --]
Hi,
This RFC is motivated by the IV sharing RFC in
https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569502.html and the
need to have the IVOPTS pass be able to clean up IV's shared between
multiple loops. When creating a similar problem with C code I noticed
IVOPTs treated IV's with uses outside the loop differently, this didn't
even required multiple loops, take for instance the following example
using SVE intrinsics:
#include <arm_sve.h>
#include <limits.h>
extern void use (char *);
void bar (char * __restrict__ a, char * __restrict__ b, char *
__restrict__ c, unsigned n)
{
svbool_t all_true = svptrue_b8 ();
unsigned i = 0;
if (n < (UINT_MAX - svcntb() - 1))
{
for (; i < n; i += svcntb())
{
svuint8_t va = svld1 (all_true, (uint8_t*)a);
svuint8_t vb = svld1 (all_true, (uint8_t*)b);
svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
a += svcntb();
b += svcntb();
c += svcntb();
}
}
use (a);
}
IVOPTs tends to generate a shared IV for SVE memory accesses, as we
don't have a post-increment for SVE load/stores. If we had not included
'use (a);' in this example, IVOPTs would have replaced the IV's for a, b
and c with a single one, (also used for the loop-control). See:
<bb 4> [local count: 955630225]:
# ivtmp.7_8 = PHI <ivtmp.7_25(7), 0(6)>
va_14 = MEM <svuint8_t> [(unsigned char *)a_10(D) + ivtmp.7_8 * 1];
vb_15 = MEM <svuint8_t> [(unsigned char *)b_11(D) + ivtmp.7_8 * 1];
_2 = svadd_u8_z ({ -1, ... }, va_14, vb_15);
MEM <__SVUint8_t> [(unsigned char *)c_12(D) + ivtmp.7_8 * 1] = _2;
ivtmp.7_25 = ivtmp.7_8 + POLY_INT_CST [16, 16];
i_23 = (unsigned int) ivtmp.7_25;
if (n_9(D) > i_23)
goto <bb 7>; [89.00%]
else
goto <bb 5>; [11.00%]
However, due to the 'use (a);' it will create two IVs one for
loop-control, b and c and one for a. See:
<bb 4> [local count: 955630225]:
# a_28 = PHI <a_18(7), a_11(D)(6)>
# ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
va_15 = MEM <svuint8_t> [(unsigned char *)a_28];
vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
_2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
a_18 = a_28 + POLY_INT_CST [16, 16];
ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
i_8 = (unsigned int) ivtmp.7_24;
if (n_10(D) > i_8)
goto <bb 7>; [89.00%]
else
goto <bb 10>; [11.00%]
With the first patch attached in this RFC 'no_cost.patch', I tell IVOPTs
to not cost uses outside of the loop. This makes IVOPTs generate a
single IV, but unfortunately it decides to create the variable for the
use inside the loop and it also seems to use the pre-increment value of
the shared-IV and add the [16,16] to it. See:
<bb 4> [local count: 955630225]:
# ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
va_15 = MEM <svuint8_t> [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
_2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
_8 = (unsigned long) a_11(D);
_7 = _8 + ivtmp.7_25;
_6 = _7 + POLY_INT_CST [16, 16];
a_18 = (char * restrict) _6;
ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
i_5 = (unsigned int) ivtmp.7_24;
if (n_10(D) > i_5)
goto <bb 7>; [89.00%]
else
goto <bb 10>; [11.00%]
With the patch 'var_after.patch' I make get_computation_aff_1 use
'cand->var_after' for outside uses thus using the post-increment var of
the candidate IV. This means I have to insert it in a different place
and make sure to delete the old use->stmt. I'm sure there is a better
way to do this using IVOPTs current framework, but I didn't find one
yet. See the result:
<bb 4> [local count: 955630225]:
# ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
va_15 = MEM <svuint8_t> [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
_2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
_8 = (unsigned long) a_11(D);
_7 = _8 + ivtmp.7_24;
a_18 = (char * restrict) _7;
i_6 = (unsigned int) ivtmp.7_24;
if (n_10(D) > i_6)
goto <bb 7>; [89.00%]
else
goto <bb 10>; [11.00%]
This is still not optimal as we are still doing the update inside the
loop and there is absolutely no need for that. I found that running sink
would solve it and it seems someone has added a second sink pass, so
that saves me a third patch :) see after sink2:
<bb 4> [local count: 955630225]:
# ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
va_15 = MEM <svuint8_t> [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
_2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
i_6 = (unsigned int) ivtmp.7_24;
if (i_6 < n_10(D))
goto <bb 7>; [89.00%]
else
goto <bb 10>; [11.00%]
<bb 10> [local count: 105119324]:
_8 = (unsigned long) a_11(D);
_7 = _8 + ivtmp.7_24;
a_18 = (char * restrict) _7;
goto <bb 5>; [100.00%]
I haven't tested this at all, but I wanted to get the opinion of someone
more knowledgeable in IVOPTs before I continued this avenue. I have two
main questions:
1) How should we be costing outside uses, right now I use a nocost, but
that's not entirely accurate. Should we use a constant multiply factor
for inside loop uses to make them outweigh outside uses? Should we use
iteration count if available? Do we want to use a backend hook to let
targets provide their own costing for these?
2) Is there a cleaner way to generate the optimal 'post-increment' use
for the outside-use variable? I first thought the position in the
candidate might be something I could use or even the var_at_stmt
functionality, but the outside IV has the actual increment of the
variable as it's use, rather than the outside uses. This is this RFC's
main weakness I find.
Kind regards,
Andre
[-- Attachment #2: no_cost.patch --]
[-- Type: text/plain, Size: 2073 bytes --]
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 12a8a49a3071c09f222fbb6aef68c2a24a107252..1e80da3826ec427fefc9d9e8d882c21d2b3b05c8 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -413,6 +413,9 @@ struct iv_use
tree addr_base; /* Base address with const offset stripped. */
poly_uint64_pod addr_offset;
/* Const offset stripped from base address. */
+ bool outside; /* True if the use of this IV is outside of the loop,
+ use this to make such uses 'less costly' and avoid
+ updating it inside the loop. */
};
/* Group of uses. */
@@ -1538,6 +1541,7 @@ record_use (struct iv_group *group, tree *use_p, struct iv *iv,
use->op_p = use_p;
use->addr_base = addr_base;
use->addr_offset = addr_offset;
+ use->outside = false;
group->vuses.safe_push (use);
return use;
@@ -1666,6 +1670,23 @@ find_interesting_uses_op (struct ivopts_data *data, tree op)
use = record_group_use (data, NULL, iv, stmt, USE_NONLINEAR_EXPR, NULL_TREE);
iv->nonlin_use = use;
+
+ /* Find out whether this is only used outside of the loop. */
+ use->outside = true;
+ tree def;
+ if (gimple_code (stmt) == GIMPLE_PHI)
+ def = PHI_RESULT (stmt);
+ else
+ def = gimple_get_lhs (stmt);
+
+ imm_use_iterator imm_iter;
+ FOR_EACH_IMM_USE_STMT (stmt, imm_iter, def)
+ {
+ /* Do not count it's own PHI. */
+ if (gimple_code (stmt) != GIMPLE_PHI
+ && flow_bb_inside_loop_p (data->current_loop, gimple_bb (stmt)))
+ use->outside = false;
+ }
return use;
}
@@ -4958,7 +4979,8 @@ determine_group_iv_cost_generic (struct ivopts_data *data,
original biv, the cost is 0. This also prevents us from counting the
cost of increment twice -- once at this use and once in the cost of
the candidate. */
- if (cand->pos == IP_ORIGINAL && cand->incremented_at == use->stmt)
+ if (use->outside
+ || (cand->pos == IP_ORIGINAL && cand->incremented_at == use->stmt))
cost = no_cost;
else
cost = get_computation_cost (data, use, cand, false,
[-- Attachment #3: var_after.patch --]
[-- Type: text/plain, Size: 1890 bytes --]
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 1e80da3826ec427fefc9d9e8d882c21d2b3b05c8..ba6ced36e27b7b3a30d51135fd6aba72d66dbe0d 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -3994,7 +3994,13 @@ get_computation_aff_1 (class loop *loop, gimple *at, struct iv_use *use,
if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
return false;
- var = var_at_stmt (loop, cand, at);
+ if (use->outside)
+ {
+ var = cand->var_after;
+ ubase = fold_build2 (MINUS_EXPR, utype, ubase, ustep);
+ }
+ else
+ var = var_at_stmt (loop, cand, at);
uutype = unsigned_type_for (utype);
/* If the conversion is not noop, perform it. */
@@ -7328,19 +7334,32 @@ rewrite_use_nonlinear_expr (struct ivopts_data *data,
}
}
- gsi_insert_seq_before (&bsi, stmt_list, GSI_SAME_STMT);
- if (gimple_code (use->stmt) == GIMPLE_PHI)
+ if (use->outside)
{
+ gcc_assert (gimple_code (use->stmt) != GIMPLE_PHI);
ass = gimple_build_assign (tgt, comp);
- gsi_insert_before (&bsi, ass, GSI_SAME_STMT);
-
+ gimple_seq_add_stmt (&stmt_list, ass);
+ bsi = gsi_for_stmt (SSA_NAME_DEF_STMT (cand->var_after));
+ gsi_insert_seq_after (&bsi, stmt_list, GSI_SAME_STMT);
bsi = gsi_for_stmt (use->stmt);
- remove_phi_node (&bsi, false);
+ gsi_remove (&bsi, true);
}
else
{
- gimple_assign_set_rhs_from_tree (&bsi, comp);
- use->stmt = gsi_stmt (bsi);
+ gsi_insert_seq_before (&bsi, stmt_list, GSI_SAME_STMT);
+ if (gimple_code (use->stmt) == GIMPLE_PHI)
+ {
+ ass = gimple_build_assign (tgt, comp);
+ gsi_insert_before (&bsi, ass, GSI_SAME_STMT);
+
+ bsi = gsi_for_stmt (use->stmt);
+ remove_phi_node (&bsi, false);
+ }
+ else
+ {
+ gimple_assign_set_rhs_from_tree (&bsi, comp);
+ use->stmt = gsi_stmt (bsi);
+ }
}
}
^ permalink raw reply [flat|nested] 5+ messages in thread
* [RFC][ivopts] Generate better code for IVs with uses outside the loop (was Re: [RFC] Implementing detection of saturation and rounding arithmetic)
2021-06-03 16:34 [RFC] Implementing detection of saturation and rounding arithmetic Andre Vieira (lists)
@ 2021-06-03 16:41 ` Andre Vieira (lists)
2021-06-07 11:28 ` [RFC] Implementing detection of saturation and rounding arithmetic Bin.Cheng
1 sibling, 0 replies; 5+ messages in thread
From: Andre Vieira (lists) @ 2021-06-03 16:41 UTC (permalink / raw)
To: gcc-patches; +Cc: Richard Sandiford, bin.cheng, Richard Biener
Streams got crossed there and used the wrong subject ...
On 03/06/2021 17:34, Andre Vieira (lists) via Gcc-patches wrote:
> Hi,
>
> This RFC is motivated by the IV sharing RFC in
> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569502.html and the
> need to have the IVOPTS pass be able to clean up IV's shared between
> multiple loops. When creating a similar problem with C code I noticed
> IVOPTs treated IV's with uses outside the loop differently, this
> didn't even required multiple loops, take for instance the following
> example using SVE intrinsics:
>
> #include <arm_sve.h>
> #include <limits.h>
> extern void use (char *);
> void bar (char * __restrict__ a, char * __restrict__ b, char *
> __restrict__ c, unsigned n)
> {
> svbool_t all_true = svptrue_b8 ();
> unsigned i = 0;
> if (n < (UINT_MAX - svcntb() - 1))
> {
> for (; i < n; i += svcntb())
> {
> svuint8_t va = svld1 (all_true, (uint8_t*)a);
> svuint8_t vb = svld1 (all_true, (uint8_t*)b);
> svst1 (all_true, (uint8_t *)c, svadd_z (all_true,
> va,vb));
> a += svcntb();
> b += svcntb();
> c += svcntb();
> }
> }
> use (a);
> }
>
> IVOPTs tends to generate a shared IV for SVE memory accesses, as we
> don't have a post-increment for SVE load/stores. If we had not
> included 'use (a);' in this example, IVOPTs would have replaced the
> IV's for a, b and c with a single one, (also used for the
> loop-control). See:
>
> <bb 4> [local count: 955630225]:
> # ivtmp.7_8 = PHI <ivtmp.7_25(7), 0(6)>
> va_14 = MEM <svuint8_t> [(unsigned char *)a_10(D) + ivtmp.7_8 * 1];
> vb_15 = MEM <svuint8_t> [(unsigned char *)b_11(D) + ivtmp.7_8 * 1];
> _2 = svadd_u8_z ({ -1, ... }, va_14, vb_15);
> MEM <__SVUint8_t> [(unsigned char *)c_12(D) + ivtmp.7_8 * 1] = _2;
> ivtmp.7_25 = ivtmp.7_8 + POLY_INT_CST [16, 16];
> i_23 = (unsigned int) ivtmp.7_25;
> if (n_9(D) > i_23)
> goto <bb 7>; [89.00%]
> else
> goto <bb 5>; [11.00%]
>
> However, due to the 'use (a);' it will create two IVs one for
> loop-control, b and c and one for a. See:
>
> <bb 4> [local count: 955630225]:
> # a_28 = PHI <a_18(7), a_11(D)(6)>
> # ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
> va_15 = MEM <svuint8_t> [(unsigned char *)a_28];
> vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
> _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
> MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
> a_18 = a_28 + POLY_INT_CST [16, 16];
> ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
> i_8 = (unsigned int) ivtmp.7_24;
> if (n_10(D) > i_8)
> goto <bb 7>; [89.00%]
> else
> goto <bb 10>; [11.00%]
>
> With the first patch attached in this RFC 'no_cost.patch', I tell
> IVOPTs to not cost uses outside of the loop. This makes IVOPTs
> generate a single IV, but unfortunately it decides to create the
> variable for the use inside the loop and it also seems to use the
> pre-increment value of the shared-IV and add the [16,16] to it. See:
>
> <bb 4> [local count: 955630225]:
> # ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
> va_15 = MEM <svuint8_t> [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
> vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
> _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
> MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
> _8 = (unsigned long) a_11(D);
> _7 = _8 + ivtmp.7_25;
> _6 = _7 + POLY_INT_CST [16, 16];
> a_18 = (char * restrict) _6;
> ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
> i_5 = (unsigned int) ivtmp.7_24;
> if (n_10(D) > i_5)
> goto <bb 7>; [89.00%]
> else
> goto <bb 10>; [11.00%]
>
> With the patch 'var_after.patch' I make get_computation_aff_1 use
> 'cand->var_after' for outside uses thus using the post-increment var
> of the candidate IV. This means I have to insert it in a different
> place and make sure to delete the old use->stmt. I'm sure there is a
> better way to do this using IVOPTs current framework, but I didn't
> find one yet. See the result:
>
> <bb 4> [local count: 955630225]:
> # ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
> va_15 = MEM <svuint8_t> [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
> vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
> _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
> MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
> ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
> _8 = (unsigned long) a_11(D);
> _7 = _8 + ivtmp.7_24;
> a_18 = (char * restrict) _7;
> i_6 = (unsigned int) ivtmp.7_24;
> if (n_10(D) > i_6)
> goto <bb 7>; [89.00%]
> else
> goto <bb 10>; [11.00%]
>
>
> This is still not optimal as we are still doing the update inside the
> loop and there is absolutely no need for that. I found that running
> sink would solve it and it seems someone has added a second sink pass,
> so that saves me a third patch :) see after sink2:
>
> <bb 4> [local count: 955630225]:
> # ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
> va_15 = MEM <svuint8_t> [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
> vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
> _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
> MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
> ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
> i_6 = (unsigned int) ivtmp.7_24;
> if (i_6 < n_10(D))
> goto <bb 7>; [89.00%]
> else
> goto <bb 10>; [11.00%]
>
> <bb 10> [local count: 105119324]:
> _8 = (unsigned long) a_11(D);
> _7 = _8 + ivtmp.7_24;
> a_18 = (char * restrict) _7;
> goto <bb 5>; [100.00%]
>
>
> I haven't tested this at all, but I wanted to get the opinion of
> someone more knowledgeable in IVOPTs before I continued this avenue. I
> have two main questions:
> 1) How should we be costing outside uses, right now I use a nocost,
> but that's not entirely accurate. Should we use a constant multiply
> factor for inside loop uses to make them outweigh outside uses? Should
> we use iteration count if available? Do we want to use a backend hook
> to let targets provide their own costing for these?
> 2) Is there a cleaner way to generate the optimal 'post-increment' use
> for the outside-use variable? I first thought the position in the
> candidate might be something I could use or even the var_at_stmt
> functionality, but the outside IV has the actual increment of the
> variable as it's use, rather than the outside uses. This is this RFC's
> main weakness I find.
>
> Kind regards,
> Andre
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] Implementing detection of saturation and rounding arithmetic
2021-06-03 16:34 [RFC] Implementing detection of saturation and rounding arithmetic Andre Vieira (lists)
2021-06-03 16:41 ` [RFC][ivopts] Generate better code for IVs with uses outside the loop (was Re: [RFC] Implementing detection of saturation and rounding arithmetic) Andre Vieira (lists)
@ 2021-06-07 11:28 ` Bin.Cheng
2021-06-08 15:00 ` Andre Simoes Dias Vieira
1 sibling, 1 reply; 5+ messages in thread
From: Bin.Cheng @ 2021-06-07 11:28 UTC (permalink / raw)
To: Andre Vieira (lists)
Cc: gcc-patches, Richard Sandiford, bin.cheng, Richard Biener
On Fri, Jun 4, 2021 at 12:35 AM Andre Vieira (lists) via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi,
>
> This RFC is motivated by the IV sharing RFC in
> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569502.html and the
> need to have the IVOPTS pass be able to clean up IV's shared between
> multiple loops. When creating a similar problem with C code I noticed
> IVOPTs treated IV's with uses outside the loop differently, this didn't
> even required multiple loops, take for instance the following example
> using SVE intrinsics:
>
> #include <arm_sve.h>
> #include <limits.h>
> extern void use (char *);
> void bar (char * __restrict__ a, char * __restrict__ b, char *
> __restrict__ c, unsigned n)
> {
> svbool_t all_true = svptrue_b8 ();
> unsigned i = 0;
> if (n < (UINT_MAX - svcntb() - 1))
> {
> for (; i < n; i += svcntb())
> {
> svuint8_t va = svld1 (all_true, (uint8_t*)a);
> svuint8_t vb = svld1 (all_true, (uint8_t*)b);
> svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
> a += svcntb();
> b += svcntb();
> c += svcntb();
> }
> }
> use (a);
> }
>
> IVOPTs tends to generate a shared IV for SVE memory accesses, as we
> don't have a post-increment for SVE load/stores. If we had not included
> 'use (a);' in this example, IVOPTs would have replaced the IV's for a, b
> and c with a single one, (also used for the loop-control). See:
>
> <bb 4> [local count: 955630225]:
> # ivtmp.7_8 = PHI <ivtmp.7_25(7), 0(6)>
> va_14 = MEM <svuint8_t> [(unsigned char *)a_10(D) + ivtmp.7_8 * 1];
> vb_15 = MEM <svuint8_t> [(unsigned char *)b_11(D) + ivtmp.7_8 * 1];
> _2 = svadd_u8_z ({ -1, ... }, va_14, vb_15);
> MEM <__SVUint8_t> [(unsigned char *)c_12(D) + ivtmp.7_8 * 1] = _2;
> ivtmp.7_25 = ivtmp.7_8 + POLY_INT_CST [16, 16];
> i_23 = (unsigned int) ivtmp.7_25;
> if (n_9(D) > i_23)
> goto <bb 7>; [89.00%]
> else
> goto <bb 5>; [11.00%]
>
> However, due to the 'use (a);' it will create two IVs one for
> loop-control, b and c and one for a. See:
>
> <bb 4> [local count: 955630225]:
> # a_28 = PHI <a_18(7), a_11(D)(6)>
> # ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
> va_15 = MEM <svuint8_t> [(unsigned char *)a_28];
> vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
> _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
> MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
> a_18 = a_28 + POLY_INT_CST [16, 16];
> ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
> i_8 = (unsigned int) ivtmp.7_24;
> if (n_10(D) > i_8)
> goto <bb 7>; [89.00%]
> else
> goto <bb 10>; [11.00%]
>
> With the first patch attached in this RFC 'no_cost.patch', I tell IVOPTs
> to not cost uses outside of the loop. This makes IVOPTs generate a
> single IV, but unfortunately it decides to create the variable for the
> use inside the loop and it also seems to use the pre-increment value of
> the shared-IV and add the [16,16] to it. See:
>
> <bb 4> [local count: 955630225]:
> # ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
> va_15 = MEM <svuint8_t> [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
> vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
> _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
> MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
> _8 = (unsigned long) a_11(D);
> _7 = _8 + ivtmp.7_25;
> _6 = _7 + POLY_INT_CST [16, 16];
> a_18 = (char * restrict) _6;
> ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
> i_5 = (unsigned int) ivtmp.7_24;
> if (n_10(D) > i_5)
> goto <bb 7>; [89.00%]
> else
> goto <bb 10>; [11.00%]
>
> With the patch 'var_after.patch' I make get_computation_aff_1 use
> 'cand->var_after' for outside uses thus using the post-increment var of
> the candidate IV. This means I have to insert it in a different place
> and make sure to delete the old use->stmt. I'm sure there is a better
> way to do this using IVOPTs current framework, but I didn't find one
> yet. See the result:
>
> <bb 4> [local count: 955630225]:
> # ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
> va_15 = MEM <svuint8_t> [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
> vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
> _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
> MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
> ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
> _8 = (unsigned long) a_11(D);
> _7 = _8 + ivtmp.7_24;
> a_18 = (char * restrict) _7;
> i_6 = (unsigned int) ivtmp.7_24;
> if (n_10(D) > i_6)
> goto <bb 7>; [89.00%]
> else
> goto <bb 10>; [11.00%]
>
>
> This is still not optimal as we are still doing the update inside the
> loop and there is absolutely no need for that. I found that running sink
> would solve it and it seems someone has added a second sink pass, so
> that saves me a third patch :) see after sink2:
>
> <bb 4> [local count: 955630225]:
> # ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
> va_15 = MEM <svuint8_t> [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
> vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
> _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
> MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
> ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
> i_6 = (unsigned int) ivtmp.7_24;
> if (i_6 < n_10(D))
> goto <bb 7>; [89.00%]
> else
> goto <bb 10>; [11.00%]
>
> <bb 10> [local count: 105119324]:
> _8 = (unsigned long) a_11(D);
> _7 = _8 + ivtmp.7_24;
> a_18 = (char * restrict) _7;
> goto <bb 5>; [100.00%]
>
>
> I haven't tested this at all, but I wanted to get the opinion of someone
> more knowledgeable in IVOPTs before I continued this avenue. I have two
> main questions:
> 1) How should we be costing outside uses, right now I use a nocost, but
> that's not entirely accurate. Should we use a constant multiply factor
> for inside loop uses to make them outweigh outside uses? Should we use
> iteration count if available? Do we want to use a backend hook to let
> targets provide their own costing for these?
Hi Andre,
I didn't look into the details of the IV sharing RFC. It seems to me
costing outside uses is trying to generate better code for later code
(epilogue loop here). The only problem is IVOPTs doesn't know that
the outside use is not in the final form - which will be transformed
by IVOPTs again.
I think this example is not good at describing your problem because it
shows exactly that considering outside use results in better code,
compared to the other two approaches.
> 2) Is there a cleaner way to generate the optimal 'post-increment' use
> for the outside-use variable? I first thought the position in the
> candidate might be something I could use or even the var_at_stmt
> functionality, but the outside IV has the actual increment of the
> variable as it's use, rather than the outside uses. This is this RFC's
> main weakness I find.
To answer why IVOPTs behaves like this w/o your two patches. The main
problem is the point IVOPTs rewrites outside use IV - I don't remember
the exact point - but looks like at the end of loop while before
incrementing instruction of main IV. It's a known issue that outside
use should be costed/re-written on the exit edge along which its value
flows out of loop. I had a patch a long time ago but discarded it,
because it didn't bring obvious improvement and is complicated in case
of multi-exit edges.
But in general, I am less convinced that any of the two patches is the
right direction solving IV sharing issue between vectorized loop and
epilogue loop. I would need to read the previous RFC before giving
further comments though.
Thanks,
bin
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] Implementing detection of saturation and rounding arithmetic
2021-06-07 11:28 ` [RFC] Implementing detection of saturation and rounding arithmetic Bin.Cheng
@ 2021-06-08 15:00 ` Andre Simoes Dias Vieira
2021-06-10 11:51 ` [RFC][ivopts] Generate better code for IVs with uses outside the loop Andre Vieira (lists)
0 siblings, 1 reply; 5+ messages in thread
From: Andre Simoes Dias Vieira @ 2021-06-08 15:00 UTC (permalink / raw)
To: Bin.Cheng; +Cc: gcc-patches, Richard Sandiford, bin.cheng, Richard Biener
Hi Bin,
Thank you for the reply, I have some questions, see below.
On 07/06/2021 12:28, Bin.Cheng wrote:
> On Fri, Jun 4, 2021 at 12:35 AM Andre Vieira (lists) via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>
> Hi Andre,
> I didn't look into the details of the IV sharing RFC. It seems to me
> costing outside uses is trying to generate better code for later code
> (epilogue loop here). The only problem is IVOPTs doesn't know that
> the outside use is not in the final form - which will be transformed
> by IVOPTs again.
>
> I think this example is not good at describing your problem because it
> shows exactly that considering outside use results in better code,
> compared to the other two approaches.
I don't quite understand what you are saying here :( What do you mean by
final form? It seems to me that costing uses inside and outside loop the
same way is wrong because calculating the IV inside the loop has to be
done every iteration, whereas if you can resolve it to a single update
(without an IV) then you can sink it outside the loop. This is why I
think this example shows why we need to cost these uses differently.
>> 2) Is there a cleaner way to generate the optimal 'post-increment' use
>> for the outside-use variable? I first thought the position in the
>> candidate might be something I could use or even the var_at_stmt
>> functionality, but the outside IV has the actual increment of the
>> variable as it's use, rather than the outside uses. This is this RFC's
>> main weakness I find.
> To answer why IVOPTs behaves like this w/o your two patches. The main
> problem is the point IVOPTs rewrites outside use IV - I don't remember
> the exact point - but looks like at the end of loop while before
> incrementing instruction of main IV. It's a known issue that outside
> use should be costed/re-written on the exit edge along which its value
> flows out of loop. I had a patch a long time ago but discarded it,
> because it didn't bring obvious improvement and is complicated in case
> of multi-exit edges.
Yeah I haven't looked at multi-exit edges and I understand that
complicates things. But for now we could disable the special casing of
outside uses when dealing with multi-exit loops and keep the current
behavior.
>
> But in general, I am less convinced that any of the two patches is the
> right direction solving IV sharing issue between vectorized loop and
> epilogue loop. I would need to read the previous RFC before giving
> further comments though.
The previous RFC still has a lot of unanswered questions too, but
regardless of that, take the following (non-vectorizer) example:
#include <arm_neon.h>
#include <arm_sve.h>
void bar (char * __restrict__ a, char * __restrict__ b, char *
__restrict__ c, unsigned long long n)
{
svbool_t all_true = svptrue_b8 ();
unsigned long long i = 0;
for (; i < (n & ~(svcntb() - 1)); i += svcntb()) {
svuint8_t va = svld1 (all_true, (uint8_t*)a);
svuint8_t vb = svld1 (all_true, (uint8_t*)b);
svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
a += svcntb();
b += svcntb();
c += svcntb();
}
svbool_t pred;
for (; i < (n); i += svcntb()) {
pred = svwhilelt_b8 (i, n);
svuint8_t va = svld1 (pred, (uint8_t*)a);
svuint8_t vb = svld1 (pred, (uint8_t*)b);
svst1 (pred, (uint8_t *)c, svadd_z (pred, va,vb));
a += svcntb();
b += svcntb();
c += svcntb();
}
Current IVOPTs will use 4 iterators for the first loop, when it could do
with just 1. In fact, if you use my patches it will create just a single
IV and sink the uses and it is then able to merge them with loads &
stores of the next loop.
I am not saying setting outside costs to 0 is the right thing to do by
the way. It is absolutely not! It will break cost considerations for
other cases. Like I said above I've been playing around with using
'!use->outside' as a multiplier for the cost. Unfortunately it won't
help with the case above, because this seems to choose 'infinite_cost'
because the candidate IV has a lower precision than the use IV. I don't
quite understand yet how candidates are created, but something I'm going
to try to look at. Just wanted to show this as an example of how IVOPTs
would not improve code with multiple loops that don't involve the
vectorizer.
BR,
Andre
>
> Thanks,
> bin
^ permalink raw reply [flat|nested] 5+ messages in thread
* [RFC][ivopts] Generate better code for IVs with uses outside the loop
2021-06-08 15:00 ` Andre Simoes Dias Vieira
@ 2021-06-10 11:51 ` Andre Vieira (lists)
0 siblings, 0 replies; 5+ messages in thread
From: Andre Vieira (lists) @ 2021-06-10 11:51 UTC (permalink / raw)
To: Bin.Cheng; +Cc: Richard Sandiford, bin.cheng, gcc-patches, Richard Biener
On 08/06/2021 16:00, Andre Simoes Dias Vieira via Gcc-patches wrote:
> Hi Bin,
>
> Thank you for the reply, I have some questions, see below.
>
> On 07/06/2021 12:28, Bin.Cheng wrote:
>> On Fri, Jun 4, 2021 at 12:35 AM Andre Vieira (lists) via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>>
>> Hi Andre,
>> I didn't look into the details of the IV sharing RFC. It seems to me
>> costing outside uses is trying to generate better code for later code
>> (epilogue loop here). The only problem is IVOPTs doesn't know that
>> the outside use is not in the final form - which will be transformed
>> by IVOPTs again.
>>
>> I think this example is not good at describing your problem because it
>> shows exactly that considering outside use results in better code,
>> compared to the other two approaches.
> I don't quite understand what you are saying here :( What do you mean
> by final form? It seems to me that costing uses inside and outside
> loop the same way is wrong because calculating the IV inside the loop
> has to be done every iteration, whereas if you can resolve it to a
> single update (without an IV) then you can sink it outside the loop.
> This is why I think this example shows why we need to cost these uses
> differently.
>>> 2) Is there a cleaner way to generate the optimal 'post-increment' use
>>> for the outside-use variable? I first thought the position in the
>>> candidate might be something I could use or even the var_at_stmt
>>> functionality, but the outside IV has the actual increment of the
>>> variable as it's use, rather than the outside uses. This is this RFC's
>>> main weakness I find.
>> To answer why IVOPTs behaves like this w/o your two patches. The main
>> problem is the point IVOPTs rewrites outside use IV - I don't remember
>> the exact point - but looks like at the end of loop while before
>> incrementing instruction of main IV. It's a known issue that outside
>> use should be costed/re-written on the exit edge along which its value
>> flows out of loop. I had a patch a long time ago but discarded it,
>> because it didn't bring obvious improvement and is complicated in case
>> of multi-exit edges.
> Yeah I haven't looked at multi-exit edges and I understand that
> complicates things. But for now we could disable the special casing of
> outside uses when dealing with multi-exit loops and keep the current
> behavior.
>>
>> But in general, I am less convinced that any of the two patches is the
>> right direction solving IV sharing issue between vectorized loop and
>> epilogue loop. I would need to read the previous RFC before giving
>> further comments though.
>
> The previous RFC still has a lot of unanswered questions too, but
> regardless of that, take the following (non-vectorizer) example:
>
> #include <arm_neon.h>
> #include <arm_sve.h>
>
> void bar (char * __restrict__ a, char * __restrict__ b, char *
> __restrict__ c, unsigned long long n)
> {
> svbool_t all_true = svptrue_b8 ();
> unsigned long long i = 0;
> for (; i < (n & ~(svcntb() - 1)); i += svcntb()) {
> svuint8_t va = svld1 (all_true, (uint8_t*)a);
> svuint8_t vb = svld1 (all_true, (uint8_t*)b);
> svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
> a += svcntb();
> b += svcntb();
> c += svcntb();
> }
> svbool_t pred;
> for (; i < (n); i += svcntb()) {
> pred = svwhilelt_b8 (i, n);
> svuint8_t va = svld1 (pred, (uint8_t*)a);
> svuint8_t vb = svld1 (pred, (uint8_t*)b);
> svst1 (pred, (uint8_t *)c, svadd_z (pred, va,vb));
> a += svcntb();
> b += svcntb();
> c += svcntb();
> }
>
>
> Current IVOPTs will use 4 iterators for the first loop, when it could
> do with just 1. In fact, if you use my patches it will create just a
> single IV and sink the uses and it is then able to merge them with
> loads & stores of the next loop.
I mixed things up here, I think an earlier version of my patch (with
even more hacks) managed to rewrite these properly, but it looks like
the current ones are messing things up.
I'll continue to try to understand how this works as I do still think
IVOPTs should be able to do better.
You mentioned you had a patch you thought might help earlier, but you
dropped it. Do you still have it lying around anywhere?
>
> I am not saying setting outside costs to 0 is the right thing to do by
> the way. It is absolutely not! It will break cost considerations for
> other cases. Like I said above I've been playing around with using
> '!use->outside' as a multiplier for the cost. Unfortunately it won't
> help with the case above, because this seems to choose 'infinite_cost'
> because the candidate IV has a lower precision than the use IV. I
> don't quite understand yet how candidates are created, but something
> I'm going to try to look at. Just wanted to show this as an example of
> how IVOPTs would not improve code with multiple loops that don't
> involve the vectorizer.
>
> BR,
> Andre
>
>
>>
>> Thanks,
>> bin
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-06-10 11:51 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-03 16:34 [RFC] Implementing detection of saturation and rounding arithmetic Andre Vieira (lists)
2021-06-03 16:41 ` [RFC][ivopts] Generate better code for IVs with uses outside the loop (was Re: [RFC] Implementing detection of saturation and rounding arithmetic) Andre Vieira (lists)
2021-06-07 11:28 ` [RFC] Implementing detection of saturation and rounding arithmetic Bin.Cheng
2021-06-08 15:00 ` Andre Simoes Dias Vieira
2021-06-10 11:51 ` [RFC][ivopts] Generate better code for IVs with uses outside the loop Andre Vieira (lists)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).