From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Gc68=7A=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id B536E3858C54
	for <gcc-patches@gcc.gnu.org>; Wed,  8 Mar 2023 08:55:36 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B536E3858C54
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C7C261063;
	Wed,  8 Mar 2023 00:56:19 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8A9743F71A;
	Wed,  8 Mar 2023 00:55:35 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Tamar Christina <Tamar.Christina@arm.com>
Mail-Followup-To: Tamar Christina <Tamar.Christina@arm.com>,"gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,  nd <nd@arm.com>,  "rguenther\@suse.de" <rguenther@suse.de>, richard.sandiford@arm.com
Cc: "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,  nd <nd@arm.com>,  "rguenther\@suse.de" <rguenther@suse.de>
Subject: Re: [PATCH 3/4]middle-end: Implement preferred_div_as_shifts_over_mult [PR108583]
References: <patch-16928-tamar@arm.com> <Y/yjM04lLCimbeu4@arm.com>
	<VI1PR08MB53258008DADE449698F19184FFB69@VI1PR08MB5325.eurprd08.prod.outlook.com>
Date: Wed, 08 Mar 2023 08:55:34 +0000
In-Reply-To: <VI1PR08MB53258008DADE449698F19184FFB69@VI1PR08MB5325.eurprd08.prod.outlook.com>
	(Tamar Christina's message of "Mon, 6 Mar 2023 11:23:49 +0000")
Message-ID: <mpta60n62rd.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-32.2 required=5.0 tests=BAYES_00,BODY_8BITS,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_LOTSOFHASH,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Tamar Christina <Tamar.Christina@arm.com> writes:
> Ping,
>
> And updated the hook to allow to differentiate between ISAs.
>
> As Andy said before initializing a ranger instance is cheap but not free,=
 and if
> the intention is to call it often during a pass it should be instantiated=
 at
> pass startup and passed along to the places that need it.  This is a big
> refactoring and doesn't seem right to do in this PR.  But we should in GC=
C 14.
>
> Currently we only instantiate it after a long series of much cheaper chec=
ks.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>         PR target/108583
>         * target.def (preferred_div_as_shifts_over_mult): New.
>         * doc/tm.texi.in: Document it.
>         * doc/tm.texi: Regenerate.
>         * targhooks.cc (default_preferred_div_as_shifts_over_mult): New.
>         * targhooks.h (default_preferred_div_as_shifts_over_mult): New.
>         * tree-vect-patterns.cc (vect_recog_divmod_pattern): Use it.
>
> gcc/testsuite/ChangeLog:
>
>         PR target/108583
>         * gcc.dg/vect/vect-div-bitmask-4.c: New test.
>         * gcc.dg/vect/vect-div-bitmask-5.c: New test.
>
> --- inline copy of patch ---
>
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 50a8872a6695b18b9bed0d393bacf733833633db..f69f7f036272e867ea1c3fee8=
51b117f057f68c5 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6137,6 +6137,10 @@ instruction pattern.  There is no need for the hoo=
k to handle these two
>  implementation approaches itself.
>  @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_O=
VER_MULT (const_tree @var{type})
> +If possible, when decomposing a division operation of vectors of
> +type @var{type} during vectorization, prefer to use shifts rather than
> +multiplication by magic constants.
>  @end deftypefn
>
>  @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTI=
ON (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 3e07978a02f4e6077adae6cadc93ea4273295f1f..0051017a7fd67691a343470f3=
6ad4fc32c8e7e15 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4173,6 +4173,7 @@ address;  but often a machine-dependent strategy ca=
n generate better code.
>
>  @hook TARGET_VECTORIZE_VEC_PERM_CONST
>
> +@hook TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
>
>  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>
> diff --git a/gcc/target.def b/gcc/target.def
> index e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5..bdee9b7f9c941508738fac495=
93b5baa525e2915 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -1868,6 +1868,16 @@ correct for most targets.",
>   poly_uint64, (const_tree type),
>   default_preferred_vector_alignment)
>
> +/* Returns whether the target has a preference for decomposing divisions=
 using
> +   shifts rather than multiplies.  */
> +DEFHOOK
> +(preferred_div_as_shifts_over_mult,
> + "If possible, when decomposing a division operation of vectors of\n\
> +type @var{type} during vectorization, prefer to use shifts rather than\n\
> +multiplication by magic constants.",

Both approaches requires shifts though.  How about:

  Sometimes it is possible to implement a vector division using a sequence
  of two addition-shift pairs, giving four instructions in total.
  Return true if taking this approach for @var{vectype} is likely
  to be better than using a sequence involving highpart multiplication.

It should also say what the default is, more below.

> + bool, (const_tree type),
> + default_preferred_div_as_shifts_over_mult)
> +
>  /* Return true if vector alignment is reachable (by peeling N
>     iterations) for the given scalar type.  */
>  DEFHOOK
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index a6a4809ca91baa5d7fad2244549317a31390f0c2..a207963b9e6eb9300df0043e1=
b79aa6c941d0f7f 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -53,6 +53,8 @@ extern scalar_int_mode default_unwind_word_mode (void);
>  extern unsigned HOST_WIDE_INT default_shift_truncation_mask
>    (machine_mode);
>  extern unsigned int default_min_divisions_for_recip_mul (machine_mode);
> +extern bool default_preferred_div_as_shifts_over_mult
> +  (const_tree);
>  extern int default_mode_rep_extended (scalar_int_mode, scalar_int_mode);
>
>  extern tree default_stack_protect_guard (void);
> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> index 211525720a620d6f533e2da91e03877337a931e7..becea6ef4b6329cfa0b676f8d=
844630fbdc97f20 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -1483,6 +1483,15 @@ default_preferred_vector_alignment (const_tree typ=
e)
>    return TYPE_ALIGN (type);
>  }
>
> +/* The default implementation of
> +   TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
> +
> +bool
> +default_preferred_div_as_shifts_over_mult (const_tree /* type */)
> +{
> +  return false;
> +}
> +

I think the default should be true for targets without highpart multiplicat=
ion,
since the fallback isn't possible then.  Either that, or we should skip
calling the hook when the fallback isn't possible.  E.g. maybe we could
test and record can_mult_highpart_p before the new code, and skip the
hook test when can_mult_highpart_p is false.

>  /* By default assume vectors of element TYPE require a multiple of the n=
atural
>     alignment of TYPE.  TYPE is naturally aligned if IS_PACKED is false. =
 */
>  bool
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsui=
te/gcc.dg/vect/vect-div-bitmask-4.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a=
0a04ea8c1f73e3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> @@ -0,0 +1,25 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include "tree-vect.h"
> +
> +typedef unsigned __attribute__((__vector_size__ (16))) V;
> +
> +static __attribute__((__noinline__)) __attribute__((__noclone__)) V
> +foo (V v, unsigned short i)
> +{
> +  v /=3D i;
> +  return v;
> +}
> +
> +int
> +main (void)
> +{
> +  V v =3D foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0=
xffff);
> +  for (unsigned i =3D 0; i < sizeof (v) / sizeof (v[0]); i++)
> +    if (v[i] !=3D 0x00010001)
> +      __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected"=
 "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsui=
te/gcc.dg/vect/vect-div-bitmask-5.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b4991=
4d2a29b933de625
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> @@ -0,0 +1,58 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include <stdio.h>
> +#include "tree-vect.h"
> +
> +#define N 50
> +#define TYPE uint8_t
> +
> +#ifndef DEBUG
> +#define DEBUG 0
> +#endif
> +
> +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> +
> +
> +__attribute__((noipa, noinline, optimize("O1")))
> +void fun1(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i =3D 0; i < n; i+=3D1)
> +    pixel[i] =3D (pixel[i] + level) / 0xff;
> +}
> +
> +__attribute__((noipa, noinline, optimize("O3")))
> +void fun2(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i =3D 0; i < n; i+=3D1)
> +    pixel[i] =3D (pixel[i] + level) / 0xff;
> +}
> +
> +int main ()
> +{
> +  TYPE a[N];
> +  TYPE b[N];
> +
> +  for (int i =3D 0; i < N; ++i)
> +    {
> +      a[i] =3D BASE + i * 13;
> +      b[i] =3D BASE + i * 13;
> +      if (DEBUG)
> +        printf ("%d: 0x%x\n", i, a[i]);
> +    }
> +
> +  fun1 (a, N / 2, N);
> +  fun2 (b, N / 2, N);
> +
> +  for (int i =3D 0; i < N; ++i)
> +    {
> +      if (DEBUG)
> +        printf ("%d =3D 0x%x =3D=3D 0x%x\n", i, a[i], b[i]);
> +
> +      if (a[i] !=3D b[i])
> +        __builtin_abort ();
> +    }
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { targ=
et aarch64*-*-* } } } */
> diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
> index 5ab5b944a573ad24ce8427aff24fc5215bf05dac..26ed91d58fa4709a67c903ad4=
46d267a3113c172 100644
> --- a/gcc/tree-ssa-math-opts.cc
> +++ b/gcc/tree-ssa-math-opts.cc
> @@ -3346,6 +3346,20 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, t=
ree op2,
>                     param_avoid_fma_max_bits));
>    bool defer =3D check_defer;
>    bool seen_negate_p =3D false;
> +
> +  /* There is no numerical difference between fused and unfused integer =
FMAs,
> +     and the assumption below that FMA is as cheap as addition is unlike=
ly
> +     to be true, especially if the multiplication occurs multiple times =
on
> +     the same chain.  E.g., for something like:
> +
> +        (((a * b) + c) >> 1) + (a * b)
> +
> +     we do not want to duplicate the a * b into two additions, not least
> +     because the result is not a natural FMA chain.  */
> +  if (ANY_INTEGRAL_TYPE_P (type)
> +      && !has_single_use (mul_result))
> +    return false;
> +
>    /* Make sure that the multiplication statement becomes dead after
>       the transformation, thus that all uses are transformed to FMAs.
>       This means we assume that an FMA operation has the same cost

I think this should be a separate patch, with its own testcase.
Sorry for not saying that until now.  The testcase from that thread
would be enough.

> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 1766ce277d6b88d8aa3be77e7c8abb504a10a735..27fb4c6a59f0182c4a836f96a=
b6b5e2e405a18a0 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -3913,6 +3913,84 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>
>        return pattern_stmt;
>      }
> +  else if ((cst =3D uniform_integer_cst_p (oprnd1))
> +          && TYPE_UNSIGNED (itype)
> +          && rhs_code =3D=3D TRUNC_DIV_EXPR
> +          && vectype
> +          && targetm.vectorize.preferred_div_as_shifts_over_mult (vectyp=
e))

The else isn't really necessary here.

> +    {
> +      /* div optimizations using narrowings
> +       we can do the division e.g. shorts by 255 faster by calculating i=
t as
> +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> +       double the precision of x.
> +
> +       If we imagine a short as being composed of two blocks of bytes th=
en
> +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> +       adding 1 to each sub component:
> +
> +           short value of 16-bits
> +       =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=90
> +       =E2=94=82              =E2=94=82                =E2=94=82
> +       =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=98
> +        8-bit part1 =E2=96=B2  8-bit part2   =E2=96=B2
> +                    =E2=94=82                =E2=94=82
> +                    =E2=94=82                =E2=94=82
> +                   +1               +1
> +
> +       after the first addition, we have to shift right by 8, and narrow=
 the
> +       results back to a byte. Remember that the addition must be done in
> +       double the precision of the input.  However if we know that the a=
ddition
> +       `x + 257` does not overflow then we can do the operation in the c=
urrent
> +       precision.  In which case we don't need the pack and unpacks.  */

I think this needs rewording.  The last two sentences describe
the real constraint: x + 257 msut not overflow.  So the part
about "assuming the operation is done in double the precision of x"
and narrowing "the results back to a byte" don't apply.

AIUI, this is really an instance of the general transform:

  x // N =3D=3D ((x+N+2) // (N+1) + x) // (N+1)  for 0 <=3D x < N(N+3)

(Hope I've got that right.  Proof sketch below if this isn't an
off-the-shelf result.)

So when 0 <=3D x < N(N+3) is guaranteed by the precision of the type,
the question becomes whether the operation overflows, like you say.
For smaller N, it's the N(N+3) bound that matters.

How about:

      /* We can use the relationship:

	   x // N =3D=3D ((x+N+2) // (N+1) + x) // (N+1)  for 0 <=3D x < N(N+3)

         to optimize cases where N+1 is a power of 2, and where // (N+1)
         is therefore a shift right.  When operating in modes that are
         multiples of a byte in size, there are two cases:

         (1) N(N+3) is not representable, in which case the question
             becomes whether the replacement expression overflows.
             It is enough to test that x+N+2 does not overflow,
             i.e. that x < MAX-(N+1).

         (2) N(N+3) is representable, in which case it is the (only)
             bound that we need to check.

         ??? For now we just handle the case where // (N+1) is a shift
         right by half the precision, since some architectures can
         optimize the associated addition and shift combinations
         into single instructions.  */

> +      auto wcst =3D wi::to_wide (cst);
> +      int pow =3D wi::exact_log2 (wcst + 1);
> +      if (pow =3D=3D (int) (element_precision (vectype) / 2))

Seems like this should be equivalent to:

  if (pow =3D=3D prec / 2)

which would come in handy for the comment below.

> +       {
> +         gimple *stmt =3D SSA_NAME_DEF_STMT (oprnd0);
> +
> +         gimple_ranger ranger;
> +         int_range_max r;
> +
> +         /* Check that no overflow will occur.  If we don't have range
> +            information we can't perform the optimization.  */
> +
> +         if (ranger.range_of_expr (r, oprnd0, stmt))
> +           {
> +             wide_int max =3D r.upper_bound ();
> +             wide_int one =3D wi::to_wide (build_one_cst (itype));

Looks like this could be:

  auto one =3D wi::shwi (1, prec);

We shouldn't build trees just to convert them to wide_ints.

> +             wide_int adder =3D wi::add (one, wi::lshift (one, pow));
> +             wi::overflow_type ovf;
> +             wi::add (max, adder, UNSIGNED, &ovf);
> +             if (ovf =3D=3D wi::OVF_NONE)
> +               {
> +                 *type_out =3D vectype;
> +                 tree tadder =3D wide_int_to_tree (itype, adder);
> +                 tree rshift =3D wide_int_to_tree (itype, pow);
> +
> +                 tree new_lhs1 =3D vect_recog_temp_ssa_var (itype, NULL);
> +                 gassign *patt1
> +                   =3D gimple_build_assign (new_lhs1, PLUS_EXPR, oprnd0,=
 tadder);
> +                 append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vecty=
pe);
> +
> +                 tree new_lhs2 =3D vect_recog_temp_ssa_var (itype, NULL);
> +                 patt1 =3D gimple_build_assign (new_lhs2, RSHIFT_EXPR, n=
ew_lhs1,
> +                                              rshift);
> +                 append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vecty=
pe);
> +
> +                 tree new_lhs3 =3D vect_recog_temp_ssa_var (itype, NULL);
> +                 patt1 =3D gimple_build_assign (new_lhs3, PLUS_EXPR, new=
_lhs2,
> +                                              oprnd0);
> +                 append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vecty=
pe);
> +
> +                 tree new_lhs4 =3D vect_recog_temp_ssa_var (itype, NULL);
> +                 pattern_stmt =3D gimple_build_assign (new_lhs4, RSHIFT_=
EXPR,
> +                                                     new_lhs3, rshift);
> +
> +                 return pattern_stmt;
> +               }
> +           }
> +       }
> +    }
>
>    if (prec > HOST_BITS_PER_WIDE_INT
>        || integer_zerop (oprnd1))

LGTM otherwise, but would prefer to see the updated patch before acking.

Thanks,
Richard

----

Proof sketch, probably unnecessarily verbose/indirect:

The transform is:

  x // N  =3D=3D  ((x+N+2) // (N+1) + x) // (N+1)           [A]

For this to be valid, we need:

  0 <=3D x - N(((x+N+2) // (N+1) + x) // (N+1)) < N       [B]

Dividing into two cases:

(1) when 0 <=3D x < N:

  Then N+1 < x+N+2 < 2(N+1),
  so (x+N+2) // (N+1) =3D=3D 1

  Substituting into [B] gives:

    0 <=3D x - N((1+x) // (N+1)) < N                      [C]

  And 0 <=3D x < N implies 1 <=3D 1+x < N+1,
  so (1+x) // (N+1) =3D=3D 0.

  Substituting into [C] gives:

    0 <=3D x < N

  which is given.

(2) when x =3D K(N+1)-1+L for integral K and L, 0 <=3D L <=3D N:

  Then:

    (x+N+2) // (N+1) =3D=3D (K(N+1)-1+L+N+2) // (N+1)
                     =3D=3D ((K+1)(N+1)+L) // (N+1)
                     =3D=3D K+1

  due to the range of L.

  Substituting into [B] gives:

    0 <=3D K(N+1)-1+L - N((K+1+K(N+1)-1+L) // (N+1)) < N
      <=3D K(N+1)-1+L - N((K+L+K(N+1)) // (N+1)) < N
      <=3D K(N+1)-1+L - N((K+L) // (N+1) + K) < N
      <=3D K+L-1 - N((K+L) // (N+1)) < N
      <=3D K+L - N((K+L) // (N+1)) < N+1                  [D]

  Dividing into three subcases:

  (2a) when (K+L) // (N+1) =3D=3D 0

     This division result implies 0 <=3D K+L < N+1.

     Also, substituting the division into [D] gives:

       0 <=3D K+L < N+1

     which is the same condition.

  (2b) when (K+L) // (N+1) =3D=3D 1

     This division result implies N+1 <=3D K+L < 2N+2.

     Also, substituting the division into [D] gives:

       0 <=3D K+L < 2N+1

     So for this case, [A] gives the wrong result iff K+L =3D=3D 2N+1.

     Since L is bounded by N, the minimum K for which this is true
     is K =3D=3D N+1.

     Therefore, for this case, the minimum invalid value of x is
     (N+1)(N+1)-1+N =3D=3D N(N+3).

  (2c) when (K+L) // (N+1) >=3D 2

     This division result implies 2N+2 <=3D K+L.

     Since L is bounded by N, K >=3D N+2, and so x >=3D (N+2)(N+1)-1,
     i.e. x >=3D N(N+3)+1.  So the minimum invalid x for this case
     is greater than the one for (2b).

So [A] is valid if (but not only if) 0 <=3D x < N(N+3).