From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-513893-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 1412 invoked by alias); 18 Nov 2019 11:04:10 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 1403 invoked by uid 89); 18 Nov 2019 11:04:10 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-8.7 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_PASS,UNSUBSCRIBE_BODY autolearn=ham version=3.3.1 spammy=wi, lengths, indicator, Ie
X-HELO: mail-lj1-f181.google.com
Received: from mail-lj1-f181.google.com (HELO mail-lj1-f181.google.com) (209.85.208.181) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 18 Nov 2019 11:04:06 +0000
Received: by mail-lj1-f181.google.com with SMTP id d5so18399807ljl.4        for <gcc-patches@gcc.gnu.org>; Mon, 18 Nov 2019 03:04:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20161025;        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;        bh=hG3W5rIQ3fEPSfLE9BFkf9G+hDGATose0oIWrkqeEf0=;        b=Tbb8fiJqlNX+FTIzZxb5Ddkq8ImZG3HlNdIgCDXuSJiAzS5XJBenIN6z3gchuEIHVK         1OjHWbn2X/HbDVBD56pfOddDdF9EK225h/9cnQcwjpHEcPrm0L/glPVN+KzHHLpdaZqj         1gTF7yXL+wFsqxmFnqkzWxaEbAkXALo268P+99ELLQXY9pXKpGHVGFsyv6D1XYLYIvA6         5B4m5CkE9muXmZAQntj0q1WAExZ5vi7x/pg9S/yRb7Nibp/osqHKnR7AYMBZ7JaMsbn1         Gq58UlZSsCg1Cz8SlHie80Wx9YEPRrV3sn/+XmYi8YwSN6lHgUaZ6PKvSwDM0fmEL0Y/         4t0Q==
MIME-Version: 1.0
References: <mpt36eui76b.fsf@arm.com> <mpt4kzagsa9.fsf@arm.com>
In-Reply-To: <mpt4kzagsa9.fsf@arm.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 18 Nov 2019 11:05:00 -0000
Message-ID: <CAFiYyc0yFq+EzCzr5wGmeQ_UhG34XJjXVUJdyAvZ4qac3GzJtA@mail.gmail.com>
Subject: Re: [8/8] Optimise WAR and WAW alias checks
To: GCC Patches <gcc-patches@gcc.gnu.org>, Richard Sandiford <richard.sandiford@arm.com>
Content-Type: text/plain; charset="UTF-8"
X-IsSubscribed: yes
X-SW-Source: 2019-11/txt/msg01686.txt.bz2

On Mon, Nov 11, 2019 at 7:52 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> For:
>
>   void
>   f1 (int *x, int *y)
>   {
>     for (int i = 0; i < 32; ++i)
>       x[i] += y[i];
>   }
>
> we checked at runtime whether one vector at x would overlap one vector
> at y.  But in cases like this, the vector code would handle x <= y just
> fine, since any write to address A still happens after any read from
> address A.  The only problem is if x is ahead of y by less than a
> vector.
>
> The same is true for two writes:
>
>   void
>   f2 (int *x, int *y)
>   {
>     for (int i = 0; i < 32; ++i)
>       {
>         x[i] = i;
>         y[i] = 2;
>       }
>   }
>
> if y <= x then a vector write at y after a vector write at x would
> have the same net effect as the original scalar writes.
>
> This patch optimises the alias checks for these two cases.  E.g.,
> before the patch, f1 used:
>
>         add     x2, x0, 15
>         sub     x2, x2, x1
>         cmp     x2, 30
>         bls     .L2
>
> whereas after the patch it uses:
>
>         add     x2, x1, 4
>         sub     x2, x0, x2
>         cmp     x2, 8
>         bls     .L2
>
> Read-after-write cases like:
>
>   int
>   f3 (int *x, int *y)
>   {
>     int res = 0;
>     for (int i = 0; i < 32; ++i)
>       {
>         x[i] = i;
>         res += y[i];
>       }
>     return res;
>   }
>
> can cope with x == y, but otherwise don't allow overlap in either
> direction.  Since checking for x == y at runtime would require extra
> code, we're probably better off sticking with the current overlap test.
>
> An overlap test is also needed if the scalar or vector accesses covered
> by the alias check are mixed together, rather than all statements for
> the second access following all statements for the first access.
>
> The new code for gcc.target/aarch64/sve/var_strict_[135].c is slightly
> better than before.

OK.

Thanks,
Richard.

>
> 2019-11-11  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * tree-data-ref.c (create_intersect_range_checks_index): If the
>         alias pair describes simple WAW and WAR dependencies, just check
>         whether the first B access overlaps later A accesses.
>         (create_waw_or_war_checks): New function that performs the same
>         optimization on addresses.
>         (create_intersect_range_checks): Call it.
>
> gcc/testsuite/
>         * gcc.dg/vect/vect-alias-check-8.c: Expect WAR/WAW checks to be used.
>         * gcc.dg/vect/vect-alias-check-14.c: Likewise.
>         * gcc.dg/vect/vect-alias-check-15.c: Likewise.
>         * gcc.dg/vect/vect-alias-check-18.c: Likewise.
>         * gcc.dg/vect/vect-alias-check-19.c: Likewise.
>         * gcc.target/aarch64/sve/var_stride_1.c: Update expected sequence.
>         * gcc.target/aarch64/sve/var_stride_2.c: Likewise.
>         * gcc.target/aarch64/sve/var_stride_3.c: Likewise.
>         * gcc.target/aarch64/sve/var_stride_5.c: Likewise.
>
> Index: gcc/tree-data-ref.c
> ===================================================================
> --- gcc/tree-data-ref.c 2019-11-11 18:32:12.000000000 +0000
> +++ gcc/tree-data-ref.c 2019-11-11 18:32:13.186616541 +0000
> @@ -1806,6 +1806,8 @@ create_intersect_range_checks_index (cla
>                            abs_step, &niter_access2))
>      return false;
>
> +  bool waw_or_war_p = (alias_pair.flags & ~(DR_ALIAS_WAR | DR_ALIAS_WAW)) == 0;
> +
>    unsigned int i;
>    for (i = 0; i < DR_NUM_DIMENSIONS (dr_a.dr); i++)
>      {
> @@ -1907,16 +1909,57 @@ create_intersect_range_checks_index (cla
>
>          Combining the tests requires limit to be computable in an unsigned
>          form of the index type; if it isn't, we fall back to the usual
> -        pointer-based checks.  */
> -      poly_offset_int limit = (idx_len1 + idx_access1 - 1
> -                              + idx_len2 + idx_access2 - 1);
> +        pointer-based checks.
> +
> +        We can do better if DR_B is a write and if DR_A and DR_B are
> +        well-ordered in both the original and the new code (see the
> +        comment above the DR_ALIAS_* flags for details).  In this case
> +        we know that for each i in [0, n-1], the write performed by
> +        access i of DR_B occurs after access numbers j<=i of DR_A in
> +        both the original and the new code.  Any write or anti
> +        dependencies wrt those DR_A accesses are therefore maintained.
> +
> +        We just need to make sure that each individual write in DR_B does not
> +        overlap any higher-indexed access in DR_A; such DR_A accesses happen
> +        after the DR_B access in the original code but happen before it in
> +        the new code.
> +
> +        We know the steps for both accesses are equal, so by induction, we
> +        just need to test whether the first write of DR_B overlaps a later
> +        access of DR_A.  In other words, we need to move min1 along by
> +        one iteration:
> +
> +          min1' = min1 + idx_step
> +
> +        and use the ranges:
> +
> +          [min1' + low_offset1', min1' + high_offset1' + idx_access1 - 1]
> +
> +        and:
> +
> +          [min2, min2 + idx_access2 - 1]
> +
> +        where:
> +
> +           low_offset1' = +ve step ? 0 : -(idx_len1 - |idx_step|)
> +          high_offset1' = +ve_step ? idx_len1 - |idx_step| : 0.  */
> +      if (waw_or_war_p)
> +       idx_len1 -= abs_idx_step;
> +
> +      poly_offset_int limit = idx_len1 + idx_access1 - 1 + idx_access2 - 1;
> +      if (!waw_or_war_p)
> +       limit += idx_len2;
> +
>        tree utype = unsigned_type_for (TREE_TYPE (min1));
>        if (!wi::fits_to_tree_p (limit, utype))
>         return false;
>
>        poly_offset_int low_offset1 = neg_step ? -idx_len1 : 0;
> -      poly_offset_int high_offset2 = neg_step ? 0 : idx_len2;
> +      poly_offset_int high_offset2 = neg_step || waw_or_war_p ? 0 : idx_len2;
>        poly_offset_int bias = high_offset2 + idx_access2 - 1 - low_offset1;
> +      /* Equivalent to adding IDX_STEP to MIN1.  */
> +      if (waw_or_war_p)
> +       bias -= wi::to_offset (idx_step);
>
>        tree subject = fold_build2 (MINUS_EXPR, utype,
>                                   fold_convert (utype, min2),
> @@ -1932,7 +1975,169 @@ create_intersect_range_checks_index (cla
>         *cond_expr = part_cond_expr;
>      }
>    if (dump_enabled_p ())
> -    dump_printf (MSG_NOTE, "using an index-based overlap test\n");
> +    {
> +      if (waw_or_war_p)
> +       dump_printf (MSG_NOTE, "using an index-based WAR/WAW test\n");
> +      else
> +       dump_printf (MSG_NOTE, "using an index-based overlap test\n");
> +    }
> +  return true;
> +}
> +
> +/* A subroutine of create_intersect_range_checks, with a subset of the
> +   same arguments.  Try to optimize cases in which the second access
> +   is a write and in which some overlap is valid.  */
> +
> +static bool
> +create_waw_or_war_checks (tree *cond_expr,
> +                         const dr_with_seg_len_pair_t &alias_pair)
> +{
> +  const dr_with_seg_len& dr_a = alias_pair.first;
> +  const dr_with_seg_len& dr_b = alias_pair.second;
> +
> +  /* Check for cases in which:
> +
> +     (a) DR_B is always a write;
> +     (b) the accesses are well-ordered in both the original and new code
> +        (see the comment above the DR_ALIAS_* flags for details); and
> +     (c) the DR_STEPs describe all access pairs covered by ALIAS_PAIR.  */
> +  if (alias_pair.flags & ~(DR_ALIAS_WAR | DR_ALIAS_WAW))
> +    return false;
> +
> +  /* Check for equal (but possibly variable) steps.  */
> +  tree step = DR_STEP (dr_a.dr);
> +  if (!operand_equal_p (step, DR_STEP (dr_b.dr)))
> +    return false;
> +
> +  /* Make sure that we can operate on sizetype without loss of precision.  */
> +  tree addr_type = TREE_TYPE (DR_BASE_ADDRESS (dr_a.dr));
> +  if (TYPE_PRECISION (addr_type) != TYPE_PRECISION (sizetype))
> +    return false;
> +
> +  /* All addresses involved are known to have a common alignment ALIGN.
> +     We can therefore subtract ALIGN from an exclusive endpoint to get
> +     an inclusive endpoint.  In the best (and common) case, ALIGN is the
> +     same as the access sizes of both DRs, and so subtracting ALIGN
> +     cancels out the addition of an access size.  */
> +  unsigned int align = MIN (dr_a.align, dr_b.align);
> +  poly_uint64 last_chunk_a = dr_a.access_size - align;
> +  poly_uint64 last_chunk_b = dr_b.access_size - align;
> +
> +  /* Get a boolean expression that is true when the step is negative.  */
> +  tree indicator = dr_direction_indicator (dr_a.dr);
> +  tree neg_step = fold_build2 (LT_EXPR, boolean_type_node,
> +                              fold_convert (ssizetype, indicator),
> +                              ssize_int (0));
> +
> +  /* Get lengths in sizetype.  */
> +  tree seg_len_a
> +    = fold_convert (sizetype, rewrite_to_non_trapping_overflow (dr_a.seg_len));
> +  step = fold_convert (sizetype, rewrite_to_non_trapping_overflow (step));
> +
> +  /* Each access has the following pattern:
> +
> +         <- |seg_len| ->
> +         <--- A: -ve step --->
> +         +-----+-------+-----+-------+-----+
> +         | n-1 | ..... |  0  | ..... | n-1 |
> +         +-----+-------+-----+-------+-----+
> +                       <--- B: +ve step --->
> +                       <- |seg_len| ->
> +                       |
> +                  base address
> +
> +     where "n" is the number of scalar iterations covered by the segment.
> +
> +     A is the range of bytes accessed when the step is negative,
> +     B is the range when the step is positive.
> +
> +     We know that DR_B is a write.  We also know (from checking that
> +     DR_A and DR_B are well-ordered) that for each i in [0, n-1],
> +     the write performed by access i of DR_B occurs after access numbers
> +     j<=i of DR_A in both the original and the new code.  Any write or
> +     anti dependencies wrt those DR_A accesses are therefore maintained.
> +
> +     We just need to make sure that each individual write in DR_B does not
> +     overlap any higher-indexed access in DR_A; such DR_A accesses happen
> +     after the DR_B access in the original code but happen before it in
> +     the new code.
> +
> +     We know the steps for both accesses are equal, so by induction, we
> +     just need to test whether the first write of DR_B overlaps a later
> +     access of DR_A.  In other words, we need to move addr_a along by
> +     one iteration:
> +
> +       addr_a' = addr_a + step
> +
> +     and check whether:
> +
> +       [addr_b, addr_b + last_chunk_b]
> +
> +     overlaps:
> +
> +       [addr_a' + low_offset_a, addr_a' + high_offset_a + last_chunk_a]
> +
> +     where [low_offset_a, high_offset_a] spans accesses [1, n-1].  I.e.:
> +
> +       low_offset_a = +ve step ? 0 : seg_len_a - step
> +       high_offset_a = +ve step ? seg_len_a - step : 0
> +
> +     This is equivalent to testing whether:
> +
> +       addr_a' + low_offset_a <= addr_b + last_chunk_b
> +       && addr_b <= addr_a' + high_offset_a + last_chunk_a
> +
> +     Converting this into a single test, there is an overlap if:
> +
> +       0 <= addr_b + last_chunk_b - addr_a' - low_offset_a <= limit
> +
> +     where limit = high_offset_a - low_offset_a + last_chunk_a + last_chunk_b
> +
> +     If DR_A is performed, limit + |step| - last_chunk_b is known to be
> +     less than the size of the object underlying DR_A.  We also know
> +     that last_chunk_b <= |step|; this is checked elsewhere if it isn't
> +     guaranteed at compile time.  There can therefore be no overflow if
> +     "limit" is calculated in an unsigned type with pointer precision.  */
> +  tree addr_a = fold_build_pointer_plus (DR_BASE_ADDRESS (dr_a.dr),
> +                                        DR_OFFSET (dr_a.dr));
> +  addr_a = fold_build_pointer_plus (addr_a, DR_INIT (dr_a.dr));
> +
> +  tree addr_b = fold_build_pointer_plus (DR_BASE_ADDRESS (dr_b.dr),
> +                                        DR_OFFSET (dr_b.dr));
> +  addr_b = fold_build_pointer_plus (addr_b, DR_INIT (dr_b.dr));
> +
> +  /* Advance ADDR_A by one iteration and adjust the length to compensate.  */
> +  addr_a = fold_build_pointer_plus (addr_a, step);
> +  tree seg_len_a_minus_step = fold_build2 (MINUS_EXPR, sizetype,
> +                                          seg_len_a, step);
> +  if (!CONSTANT_CLASS_P (seg_len_a_minus_step))
> +    seg_len_a_minus_step = build1 (SAVE_EXPR, sizetype, seg_len_a_minus_step);
> +
> +  tree low_offset_a = fold_build3 (COND_EXPR, sizetype, neg_step,
> +                                  seg_len_a_minus_step, size_zero_node);
> +  if (!CONSTANT_CLASS_P (low_offset_a))
> +    low_offset_a = build1 (SAVE_EXPR, sizetype, low_offset_a);
> +
> +  /* We could use COND_EXPR <neg_step, size_zero_node, seg_len_a_minus_step>,
> +     but it's usually more efficient to reuse the LOW_OFFSET_A result.  */
> +  tree high_offset_a = fold_build2 (MINUS_EXPR, sizetype, seg_len_a_minus_step,
> +                                   low_offset_a);
> +
> +  /* The amount added to addr_b - addr_a'.  */
> +  tree bias = fold_build2 (MINUS_EXPR, sizetype,
> +                          size_int (last_chunk_b), low_offset_a);
> +
> +  tree limit = fold_build2 (MINUS_EXPR, sizetype, high_offset_a, low_offset_a);
> +  limit = fold_build2 (PLUS_EXPR, sizetype, limit,
> +                      size_int (last_chunk_a + last_chunk_b));
> +
> +  tree subject = fold_build2 (POINTER_DIFF_EXPR, ssizetype, addr_b, addr_a);
> +  subject = fold_build2 (PLUS_EXPR, sizetype,
> +                        fold_convert (sizetype, subject), bias);
> +
> +  *cond_expr = fold_build2 (GT_EXPR, boolean_type_node, subject, limit);
> +  if (dump_enabled_p ())
> +    dump_printf (MSG_NOTE, "using an address-based WAR/WAW test\n");
>    return true;
>  }
>
> @@ -2036,6 +2241,9 @@ create_intersect_range_checks (class loo
>    if (create_intersect_range_checks_index (loop, cond_expr, alias_pair))
>      return;
>
> +  if (create_waw_or_war_checks (cond_expr, alias_pair))
> +    return;
> +
>    unsigned HOST_WIDE_INT min_align;
>    tree_code cmp_code;
>    /* We don't have to check DR_ALIAS_MIXED_STEPS here, since both versions
> Index: gcc/testsuite/gcc.dg/vect/vect-alias-check-8.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/vect-alias-check-8.c      2019-11-11 18:32:12.000000000 +0000
> +++ gcc/testsuite/gcc.dg/vect/vect-alias-check-8.c      2019-11-11 18:32:13.186616541 +0000
> @@ -60,5 +60,5 @@ main (void)
>  }
>
>  /* { dg-final { scan-tree-dump {flags: *WAR\n} "vect" { target vect_int } } } */
> -/* { dg-final { scan-tree-dump "using an index-based overlap test" "vect" } } */
> +/* { dg-final { scan-tree-dump "using an index-based WAR/WAW test" "vect" } } */
>  /* { dg-final { scan-tree-dump-not "using an address-based" "vect" } } */
> Index: gcc/testsuite/gcc.dg/vect/vect-alias-check-14.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/vect-alias-check-14.c     2019-11-11 18:32:12.000000000 +0000
> +++ gcc/testsuite/gcc.dg/vect/vect-alias-check-14.c     2019-11-11 18:32:13.186616541 +0000
> @@ -60,5 +60,5 @@ main (void)
>
>  /* { dg-final { scan-tree-dump {flags: *WAR\n} "vect" { target vect_int } } } */
>  /* { dg-final { scan-tree-dump-not {flags: [^\n]*ARBITRARY\n} "vect" } } */
> -/* { dg-final { scan-tree-dump "using an address-based overlap test" "vect" } } */
> +/* { dg-final { scan-tree-dump "using an address-based WAR/WAW test" "vect" } } */
>  /* { dg-final { scan-tree-dump-not "using an index-based" "vect" } } */
> Index: gcc/testsuite/gcc.dg/vect/vect-alias-check-15.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/vect-alias-check-15.c     2019-11-11 18:32:12.000000000 +0000
> +++ gcc/testsuite/gcc.dg/vect/vect-alias-check-15.c     2019-11-11 18:32:13.186616541 +0000
> @@ -57,5 +57,5 @@ main (void)
>  }
>
>  /* { dg-final { scan-tree-dump {flags: *WAW\n} "vect" { target vect_int } } } */
> -/* { dg-final { scan-tree-dump "using an address-based overlap test" "vect" } } */
> +/* { dg-final { scan-tree-dump "using an address-based WAR/WAW test" "vect" } } */
>  /* { dg-final { scan-tree-dump-not "using an index-based" "vect" } } */
> Index: gcc/testsuite/gcc.dg/vect/vect-alias-check-18.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/vect-alias-check-18.c     2019-11-11 18:32:12.000000000 +0000
> +++ gcc/testsuite/gcc.dg/vect/vect-alias-check-18.c     2019-11-11 18:32:13.186616541 +0000
> @@ -60,5 +60,5 @@ main (void)
>  }
>
>  /* { dg-final { scan-tree-dump {flags: *WAR\n} "vect" { target vect_int } } } */
> -/* { dg-final { scan-tree-dump "using an index-based overlap test" "vect" } } */
> +/* { dg-final { scan-tree-dump "using an index-based WAR/WAW test" "vect" } } */
>  /* { dg-final { scan-tree-dump-not "using an address-based" "vect" } } */
> Index: gcc/testsuite/gcc.dg/vect/vect-alias-check-19.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/vect-alias-check-19.c     2019-11-11 18:32:12.000000000 +0000
> +++ gcc/testsuite/gcc.dg/vect/vect-alias-check-19.c     2019-11-11 18:32:13.186616541 +0000
> @@ -58,5 +58,5 @@ main (void)
>  }
>
>  /* { dg-final { scan-tree-dump {flags: *WAW\n} "vect" { target vect_int } } } */
> -/* { dg-final { scan-tree-dump "using an index-based overlap test" "vect" } } */
> +/* { dg-final { scan-tree-dump "using an index-based WAR/WAW test" "vect" } } */
>  /* { dg-final { scan-tree-dump-not "using an address-based" "vect" } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve/var_stride_1.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/var_stride_1.c 2019-11-11 18:32:12.000000000 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/var_stride_1.c 2019-11-11 18:32:13.186616541 +0000
> @@ -15,13 +15,9 @@ f (TYPE *x, TYPE *y, unsigned short n, l
>  /* { dg-final { scan-assembler {\tst1w\tz[0-9]+} } } */
>  /* { dg-final { scan-assembler {\tldr\tw[0-9]+} } } */
>  /* { dg-final { scan-assembler {\tstr\tw[0-9]+} } } */
> -/* Should multiply by (VF-1)*4 rather than (257-1)*4.  */
> -/* { dg-final { scan-assembler-not {, 1024} } } */
> -/* { dg-final { scan-assembler-not {\t.bfiz\t} } } */
> -/* { dg-final { scan-assembler-not {lsl[^\n]*[, ]10} } } */
> -/* { dg-final { scan-assembler-not {\tcmp\tx[0-9]+, 0} } } */
> -/* { dg-final { scan-assembler-not {\tcmp\tw[0-9]+, 0} } } */
> -/* { dg-final { scan-assembler-not {\tcsel\tx[0-9]+} } } */
> -/* Two range checks and a check for n being zero.  */
> -/* { dg-final { scan-assembler-times {\tcmp\t} 1 } } */
> -/* { dg-final { scan-assembler-times {\tccmp\t} 2 } } */
> +/* Should use a WAR check that multiplies by (VF-2)*4 rather than
> +   an overlap check that multiplies by (257-1)*4.  */
> +/* { dg-final { scan-assembler {\tcntb\t(x[0-9]+)\n.*\tsub\tx[0-9]+, \1, #8\n.*\tmul\tx[0-9]+,[^\n]*\1} } } */
> +/* One range check and a check for n being zero.  */
> +/* { dg-final { scan-assembler-times {\t(?:cmp|tst)\t} 1 } } */
> +/* { dg-final { scan-assembler-times {\tccmp\t} 1 } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c 2019-11-11 18:32:12.000000000 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c 2019-11-11 18:32:13.186616541 +0000
> @@ -15,7 +15,7 @@ f (TYPE *x, TYPE *y, unsigned short n, u
>  /* { dg-final { scan-assembler {\tst1w\tz[0-9]+} } } */
>  /* { dg-final { scan-assembler {\tldr\tw[0-9]+} } } */
>  /* { dg-final { scan-assembler {\tstr\tw[0-9]+} } } */
> -/* Should multiply by (257-1)*4 rather than (VF-1)*4.  */
> +/* Should multiply by (257-1)*4 rather than (VF-1)*4 or (VF-2)*4.  */
>  /* { dg-final { scan-assembler-times {\tubfiz\tx[0-9]+, x2, 10, 16\n} 1 } } */
>  /* { dg-final { scan-assembler-times {\tubfiz\tx[0-9]+, x3, 10, 16\n} 1 } } */
>  /* { dg-final { scan-assembler-not {\tcmp\tx[0-9]+, 0} } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve/var_stride_3.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/var_stride_3.c 2019-11-11 18:32:12.000000000 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/var_stride_3.c 2019-11-11 18:32:13.186616541 +0000
> @@ -15,13 +15,10 @@ f (TYPE *x, TYPE *y, int n, long m __att
>  /* { dg-final { scan-assembler {\tst1w\tz[0-9]+} } } */
>  /* { dg-final { scan-assembler {\tldr\tw[0-9]+} } } */
>  /* { dg-final { scan-assembler {\tstr\tw[0-9]+} } } */
> -/* Should multiply by (VF-1)*4 rather than (257-1)*4.  */
> -/* { dg-final { scan-assembler-not {, 1024} } } */
> -/* { dg-final { scan-assembler-not {\t.bfiz\t} } } */
> -/* { dg-final { scan-assembler-not {lsl[^\n]*[, ]10} } } */
> -/* { dg-final { scan-assembler-not {\tcmp\tx[0-9]+, 0} } } */
> -/* { dg-final { scan-assembler {\tcmp\tw2, 0} } } */
> -/* { dg-final { scan-assembler-times {\tcsel\tx[0-9]+} 2 } } */
> -/* Two range checks and a check for n being zero.  */
> -/* { dg-final { scan-assembler {\tcmp\t} } } */
> -/* { dg-final { scan-assembler-times {\tccmp\t} 2 } } */
> +/* Should use a WAR check that multiplies by (VF-2)*4 rather than
> +   an overlap check that multiplies by (257-1)*4.  */
> +/* { dg-final { scan-assembler {\tcntb\t(x[0-9]+)\n.*\tsub\tx[0-9]+, \1, #8\n.*\tmul\tx[0-9]+,[^\n]*\1} } } */
> +/* { dg-final { scan-assembler-times {\tcsel\tx[0-9]+[^\n]*xzr} 1 } } */
> +/* One range check and a check for n being zero.  */
> +/* { dg-final { scan-assembler-times {\tcmp\t} 1 } } */
> +/* { dg-final { scan-assembler-times {\tccmp\t} 1 } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve/var_stride_5.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/var_stride_5.c 2019-11-11 18:32:12.000000000 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/var_stride_5.c 2019-11-11 18:32:13.186616541 +0000
> @@ -15,13 +15,10 @@ f (TYPE *x, TYPE *y, long n, long m __at
>  /* { dg-final { scan-assembler {\tst1d\tz[0-9]+} } } */
>  /* { dg-final { scan-assembler {\tldr\td[0-9]+} } } */
>  /* { dg-final { scan-assembler {\tstr\td[0-9]+} } } */
> -/* Should multiply by (VF-1)*8 rather than (257-1)*8.  */
> -/* { dg-final { scan-assembler-not {, 2048} } } */
> -/* { dg-final { scan-assembler-not {\t.bfiz\t} } } */
> -/* { dg-final { scan-assembler-not {lsl[^\n]*[, ]11} } } */
> -/* { dg-final { scan-assembler {\tcmp\tx[0-9]+, 0} } } */
> -/* { dg-final { scan-assembler-not {\tcmp\tw[0-9]+, 0} } } */
> -/* { dg-final { scan-assembler-times {\tcsel\tx[0-9]+} 2 } } */
> -/* Two range checks and a check for n being zero.  */
> -/* { dg-final { scan-assembler {\tcmp\t} } } */
> -/* { dg-final { scan-assembler-times {\tccmp\t} 2 } } */
> +/* Should use a WAR check that multiplies by (VF-2)*8 rather than
> +   an overlap check that multiplies by (257-1)*4.  */
> +/* { dg-final { scan-assembler {\tcntb\t(x[0-9]+)\n.*\tsub\tx[0-9]+, \1, #16\n.*\tmul\tx[0-9]+,[^\n]*\1} } } */
> +/* { dg-final { scan-assembler-times {\tcsel\tx[0-9]+[^\n]*xzr} 1 } } */
> +/* One range check and a check for n being zero.  */
> +/* { dg-final { scan-assembler-times {\tcmp\t} 1 } } */
> +/* { dg-final { scan-assembler-times {\tccmp\t} 1 } } */