From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-340145-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 22886 invoked by alias); 19 Apr 2013 13:31:59 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 22876 invoked by uid 89); 19 Apr 2013 13:31:58 -0000
X-Spam-SWARE-Status: No, score=-5.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,KHOP_RCVD_TRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE autolearn=ham version=3.3.1
Received: from mail-we0-f178.google.com (HELO mail-we0-f178.google.com) (74.125.82.178)    by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Fri, 19 Apr 2013 13:31:56 +0000
Received: by mail-we0-f178.google.com with SMTP id z53so3456470wey.23        for <gcc-patches@gcc.gnu.org>; Fri, 19 Apr 2013 06:31:54 -0700 (PDT)
MIME-Version: 1.0
X-Received: by 10.180.94.133 with SMTP id dc5mr4429085wib.1.1366378314206; Fri, 19 Apr 2013 06:31:54 -0700 (PDT)
Received: by 10.194.56.100 with HTTP; Fri, 19 Apr 2013 06:31:53 -0700 (PDT)
In-Reply-To: <516DAF9B.3050008@naturalbridge.com>
References: <506C72C7.7090207@naturalbridge.com>	<5936254E-4B37-4D1A-9951-C33172118127@comcast.net>	<CAFiYyc1=8=LffiZ=EDBOy_uzn_ARdXk1OWxT=abYd8ot+iFp5Q@mail.gmail.com>	<50891AAC.8090301@naturalbridge.com>	<CAFiYyc15kmhRWhN3tsZqJDbZ5Uh6tVa45ssiYdsytLEfqaZ4zw@mail.gmail.com>	<87y5im3orb.fsf@sandifor-thinkpad.stglab.manchester.uk.ibm.com>	<CAFiYyc2buJtu8RMKnLnvvb-A2=aYwopO+RBLPO6iJ3gKnq-hvg@mail.gmail.com>	<87pq3y3kyk.fsf@sandifor-thinkpad.stglab.manchester.uk.ibm.com>	<CAFiYyc3NjOxpQ-Y9GDrQOET+dc3LXWuiuM=DxqmyASE8urRoWw@mail.gmail.com>	<50912D85.1070002@naturalbridge.com>	<CAFiYyc2Q2UQPmkhExi2c8f-BSGLv+Rq1rOy4NdPQmTqSRE1A0A@mail.gmail.com>	<5091331C.3030504@naturalbridge.com>	<CAFiYyc1L6zuehE75dEfd_fB1-81F1fDHpL3kS=tbk6qAK3Texg@mail.gmail.com>	<512D686B.90000@naturalbridge.com>	<CAFiYyc3fXewAW2dU6-RHLiTQ-ZiLgdWmfwdFF6k1VqxPsrvZbQ@mail.gmail.com>	<515EC4E7.7040907@naturalbridge.com>	<CAFiYyc3x6xb16kbYB9LzcC+XX17hM=R-h9KEWCZy6o0k-Csjfw@mail.gmail.com>	<516DAF9B.3050008@naturalbridge.com>
Date: Fri, 19 Apr 2013 15:35:00 -0000
Message-ID: <CAFiYyc1d6O1m967wRMSGupNo5wuGpzuwvxt0D7bf-Dnao3et6Q@mail.gmail.com>
Subject: Re: patch to fix constant math - 4th patch - the wide-int class - patch ping for the next stage 1
From: Richard Biener <richard.guenther@gmail.com>
To: Kenneth Zadeck <zadeck@naturalbridge.com>
Cc: Mike Stump <mikestump@comcast.net>, gcc-patches <gcc-patches@gcc.gnu.org>, 	Lawrence Crowl <crowl@google.com>, rdsandiford@googlemail.com, 	Ian Lance Taylor <iant@google.com>
Content-Type: text/plain; charset=ISO-8859-1
X-SW-Source: 2013-04/txt/msg01146.txt.bz2

On Tue, Apr 16, 2013 at 10:07 PM, Kenneth Zadeck
<zadeck@naturalbridge.com> wrote:
> Richard,
>
> I made major changes to wide-int along the lines you suggested. Each of the
> binary operations is now a template.
> There are 5 possible implementations of those operations, one for each of
> HWI, unsigned HWI, wide-int, rtl, and tree.   Note that this is not exactly
> as you suggested, but it is along the same lines.
>
> The HWI template sign extends the value to the precision of the first
> operand, the unsigned HWI is the same except that it is an unsigned
> extension.   The wide-int version is used as before, but is in truth rarely
> used.  The rtl and tree "logically" convert the value to a wide-int but in
> practice do something more efficient than converting to the wide-int.   What
> they do is look inside the rtl or the tree and pass a pointer to the data
> and a length to the binary operation.  This is perfectly safe in the
> position of a second operand to the binary operation because the lifetime is
> guaranteed to be very short.  The wide-int implementation was also modified
> to do the same pointer trick allowing all 5 templates to share the same use
> of the data.
>
> Note that currently the tree code is more crufty than one would like.   This
> will clean up nicely when the tree-cst is changed to represent the value
> with an array and a length field.
>
> So now, at least for the second operand of binary operations, the storage is
> never copied.    I do not believe that there is a good similar trick for the
> first operand.  i did not consider something like wide_int::add (a, b) to be
> a viable option; it seems to mis the point of using an object oriented
> language.   So I think that you really have to copy the data into an
> instance of a wide int.
>
> However, while all of this avoids ever having to pass a precision into the
> second operand, this patch does preserve the finite math implementation of
> wide-int.    Finite math is really what people expect an optimizer to do,
> because it seamlessly matches what the machine is going to do.
>
> I hope at this point, i can get a comprehensive review on these patches.   I
> believe that I have done what is required.
>
> There are two other patches that will be submitted in the next few minutes.
> The first one is an updated version of the rtl level patch.   The only
> changes from what you have seen before are that the binary operations now
> use the templated binary operations.  The second one is the first of the
> tree level patches.   It converts builtins.c to use both use wide-int and it
> removes all assumptions that tree-csts are built with two HWIs.
>
> Once builtins.c is accepted, i will convert the rest of the middle end
> patches.   They will all be converted in a similar way.

+   number of elements of the vector that are in use.  When LEN *
+   HOST_BITS_PER_WIDE_INT < the precision, the value has been
+   compressed.  The values of the elements of the vector greater than
+   LEN - 1. are all equal to the highest order bit of LEN.

equal to the highest order bit of element LEN - 1. ?

Especially _not_ equal to the precision - 1 bit of the value, correct?

+   The representation does not contain any information inherant about
+   signedness of the represented value, so it can be used to represent
+   both signed and unsigned numbers.   For operations where the results
+   depend on signedness (division, comparisons), the signedness must
+   be specified separately.  For operations where the signness
+   matters, one of the operands to the operation specifies either
+   wide_int::SIGNED or wide_int::UNSIGNED.

The last sentence is somehow duplicated.

+   The numbers are stored as sign entended numbers as a means of
+   compression.  Leading HOST_WIDE_INTS that contain strings of either
+   -1 or 0 are removed as long as they can be reconstructed from the
+   top bit that is being represented.

I'd put this paragraph before the one that talks about signedness, next
to the one that already talks about encoding.

+   All constructors for wide_int take either a precision, an enum
+   machine_mode or tree_type.  */

That's probably no longer true (I'll now check).

+class wide_int {
+  /* Internal representation.  */
+
+  /* VAL is set to a size that is capable of computing a full
+     multiplication on the largest mode that is represented on the
+     target.  The full multiplication is use by tree-vrp.  tree-vpn
+     currently does a 2x largest mode by 2x largest mode yielding a 4x
+     largest mode result.  If operations are added that require larger
+     buffers, then VAL needs to be changed.  */
+  HOST_WIDE_INT val[WIDE_INT_MAX_ELTS];
+  unsigned short len;
+  unsigned int precision;

I wonder if there is a technical reason to stick to HOST_WIDE_INTs?
I'd say for efficiency HOST_WIDEST_FAST_INT would be more appropriate
(to get a 32bit value on 32bit x86 for example).  I of course see that
conversion to/from HOST_WIDE_INT is an important operation
that would get slightly more complicated.

Maybe just quickly checking the code generated on 32bit x86 for
HOST_WIDE_INT vs. HOST_WIDEST_FAST_INT tells us whether
it's worth considering (it would be bad if each add/multiply would
end up calling to libgcc for example - I know that doesn't happen
for x86, but maybe it would happen for an arm hosted gcc
targeting x86_64?)

+  enum ShiftOp {
+    NONE,
+    /* There are two uses for the wide-int shifting functions.  The
+       first use is as an emulation of the target hardware.  The
+       second use is as service routines for other optimizations.  The
+       first case needs to be identified by passing TRUNC as the value
+       of ShiftOp so that shift amount is properly handled according to the
+       SHIFT_COUNT_TRUNCATED flag.  For the second case, the shift
+       amount is always truncated by the bytesize of the mode of
+       THIS.  */
+    TRUNC
+  };

double-int simply honors SHIFT_COUNT_TRUNCATED.  Why differ
from that (and thus change behavior in existing code - not sure if you
do that with introducing wide-int)?

+  enum SignOp {
+    /* Many of the math functions produce different results depending
+       on if they are SIGNED or UNSIGNED.  In general, there are two
+       different functions, whose names are prefixed with an 'S" and
+       or an 'U'.  However, for some math functions there is also a
+       routine that does not have the prefix and takes an SignOp
+       parameter of SIGNED or UNSIGNED.  */
+    SIGNED,
+    UNSIGNED
+  };

You seem to insist on that.  It should propagate to the various parts
of the compiler that have settled for the 'uns' integer argument.
Having one piece behave different is just weird.  I suppose I will
find code like

    wi.ext (prec, uns ? UNSIGNED : SIGNED)

in your conversion?

+  static wide_int from_shwi (HOST_WIDE_INT op0,
+                            unsigned int precision, bool *overflow = 0);

+  if (precision < HOST_BITS_PER_WIDE_INT)
+    {
+      HOST_WIDE_INT t = sext_hwi (op0, precision);
+      if (t != op0 && overflow)
+       *overflow = true;
+      op0 = t;
+    }

Hm.  I'd say input values should be properly extended already (we certainly
expect that from RTL or tree constants).  So we might want to assert the
above instead of tracking it via *overflow.

+  static wide_int from_array (const HOST_WIDE_INT* op0,
+                             unsigned int len,
+                             unsigned int precision, bool need_canon = true);
+  inline static wide_int from_array (const HOST_WIDE_INT* op0,
+                                    unsigned int len,
+                                    enum machine_mode mode);
+  inline static wide_int from_array (const HOST_WIDE_INT* op0,
+                                    unsigned int len,
+                                    const_tree type);

I still don't like the overloads precision vs. machine_mode vs. tree type.
It's much more explanative what you specify here if you write

  from_array (&x, len, GET_MODE_PRECISION (mode))

instead of

  from_array (&x, len, mode)

and it's not that much more typing either.

+  static wide_int from_double_int (double_int,
+                                  unsigned int precision);
+  inline static wide_int from_double_int (double_int, enum machine_mode);

this one would lack the tree type overload (I of course think it has an
excessive overload for machine_mode).

+  inline HOST_WIDE_INT to_shwi (unsigned int prec = 0) const;
+  inline unsigned HOST_WIDE_INT to_uhwi (unsigned int prec = 0) const;

this is

   wi.sext (prec);
   wi.to_shwi ();

which is less odd than handling prec == 0 specially.

+  static wide_int max_value (unsigned int prec,
+                            unsigned int precision, SignOp sgn);

two precisions - ugh ;)  In some places having to have a precision is ugly :/

+  inline static wide_int minus_one (unsigned int prec);
+  inline static wide_int zero (unsigned int prec);
+  inline static wide_int one (unsigned int prec);
+  inline static wide_int two (unsigned int prec);

here as well.  I see two (1) properly truncates the value to zero ;)
It just comes to my mind that these could have "arbitrary" precision.
"arbitrary" == MAX_SIZE * HOST_BITS_PER_WIDE_INT.  Which
would get us back to mixed precision operations ...


+  /* Printing functions.  */
+
+  void print_dec (char *buf, SignOp sgn) const;
+  void print_dec (FILE *file, SignOp sgn) const;
+  void print_decs (char *buf) const;
+  void print_decs (FILE *file) const;
+  void print_decu (char *buf) const;
+  void print_decu (FILE *file) const;
+  void print_hex (char *buf) const;
+  void print_hex (FILE *file) const;

making those non-member functions would allow getting rid of stdio.h
from wide-int.h (similar making the tree / rtx / mode argument methods
either standalone or making them templates and externalizing the
implementation to a separate template class would get rid of the
rtl / tree dependency)

+  inline bool neg_p () const;

how can you test that?  wide-int doesn't have sign information.

+bool
+wide_int::neg_p () const
+{
+  return sign_mask () != 0;

not exactly efficient either ...

+HOST_WIDE_INT
+wide_int::sign_mask () const
+{
+  int i = len - 1;
+  if (precision < HOST_BITS_PER_WIDE_INT)
+    return ((val[0] << (HOST_BITS_PER_WIDE_INT - precision))
+           >> (HOST_BITS_PER_WIDE_INT - 1));
+
+  /* VRP appears to be badly broken and this is a very ugly fix.  */
+  if (i >= 0)
+    return val[i] >> (HOST_BITS_PER_WIDE_INT - 1);
+
+  gcc_unreachable ();
+#if 0
+  return val[len - 1] >> (HOST_BITS_PER_WIDE_INT - 1);
+#endif
+}

I can't see what's the "fix".  It seems to be equivalent to the commented out
code.  Apart from not handling len == 0 for which you ICE then.

Back to neg_p ... it computes wrong information.
from_uwhi (-1, HOST_BITS_PER_WIDE_INT).neg_p () returns true.

I suppose you want to rename it to msb () (implementing that efficiently
and using it from sign_mask, too)

+  template <typename T>
+    inline bool gt_p (T c, SignOp sgn) const;
+  template <typename T>
+    inline bool gts_p (T c) const;
+  template <typename T>
+    inline bool gtu_p (T c) const;

it's bad that we can't use the sign information we have available in almost
all cases ... (where precision is not an exact multiple of
HOST_BITS_PER_WIDE_INT
and len == precision / HOST_BITS_PER_WIDE_INT).  It isn't hard to encode
a sign - you just have to possibly waste a word of zeroes for positive
values where at the moment precision is an exact multiple of
HOST_BIST_PER_WIDE_INT and len == precision / HOST_BITS_PER_WIDE_INT.
Which of course means that the encoding can be one word larger than
maximally required by 'precision'.

+  wide_int force_to_size (unsigned int precision, SignOp sgn) const;
+  inline wide_int force_to_size (enum machine_mode mode, SignOp sgn) const;
+  inline wide_int force_to_size (const_tree type) const;
+  inline wide_int force_to_size (const_tree type, SignOp sgn) const;
+
+  inline wide_int sforce_to_size (enum machine_mode mode) const;
+  inline wide_int sforce_to_size (const_tree type) const;
+  inline wide_int zforce_to_size (enum machine_mode mode) const;
+  inline wide_int zforce_to_size (const_tree type) const;

too many overloads again

Looking at

+template <typename T>
+wide_int
+wide_int::operator & (T c) const
+{
+  wide_int result;
+  HOST_WIDE_INT ws[WIDE_INT_MAX_ELTS];
+  const HOST_WIDE_INT *s;
+  int cl;
+
+  s = to_shwi2 (ws, &cl, c);
+

I see

+  template <typename T>
+  static inline const HOST_WIDE_INT* to_shwi2 (HOST_WIDE_INT *s, int *l, T x)
+  {
+    s[0] = x;
+    if (~(T)0 < (T)0
...

+  static inline const HOST_WIDE_INT* to_shwi2 (HOST_WIDE_INT *s
ATTRIBUTE_UNUSED,
+                                       int *l, const wide_int &y)
+  {
+    *l = y.len;
...

I think you want to use template specialization here, not overloading.  Thus,

  template <>
  static inline const HOST_WIDE_INT* to_shwi2 (HOST_WIDE_INT *s, int
*l, const wide_int &y)

and adjust the HWI case to look like

  template <typename T>
  static inline const HOST_WIDE_INT* to_shwi2 (HOST_WIDE_INT *s, int
*l, const T& x)

you want to move that ~(T)0 < (T)0 check up to template substitution level
to allow SFINAE to disable the template if that's not computable.

+    s[0] = x;
+    if (~(T)0 < (T)0
+       || sizeof (T) < sizeof (HOST_WIDE_INT))
+      {
+       *l = 1;
+      }
+    else
+      {
+       s[1] = 0;
+       *l = 2;

that's only required if the MSB is set, otherwise it's not properly compressed

+      }
+    return s;

hmm, looking at from_uhwi I see that you are adding the extra zero words
as I suggested ... which means you _do_ have reliable sign information
available (well, not for the case where len would end up bigger than MAX_LEN).
So - can we simplify some things with that?  I can think of the ordered
comparisons for example.  Also the encoding description should be
clarified that there _is_ sign information available (and that len can be
bigger than precision / HOST_BITS_PER_WIDE_INT).

Returning to to_shwi2 (odd name, btw):

+       /* Copy the result because it does not fit in two HWIs.  */
+       s[0] = TREE_INT_CST_LOW (tcst);
+       s[1] = TREE_INT_CST_HIGH (tcst);
+       s[2] = 0;
+       *l = 3;
+       return s;

ah, this is why you need the extra argument ;)  Btw, what happens
when a proper encoding does not fit in MAX_LEN words?  That is,
we'd need an extra zero word?

+  /* There should logically be an overload for rtl here, but it cannot
+     be here because of circular include issues.  It is in rtl.h.  */
+  static inline const HOST_WIDE_INT* to_shwi2
+    (HOST_WIDE_INT *s ATTRIBUTE_UNUSED, int *l, rtx rcst);

in rtl.c I suppose.  Btw, this is why I'm suggesting to defer implementation
to a template - you can implement that outside of wide-int.h even without
declaring the specialization here.

Is there an accessible branch with the wide-int work?  I may be tempted
to propose some changes in form of patches.  If not then I think the
work is in a state that is suitable to put on a merge-branch.

+wide_int
+wide_int::add (const wide_int &op1, SignOp sgn, bool *overflow) const
+{
...
+  /* Uncompress the rest.  */
+  for (i = len; i < op1.len; i++)
+    {
+      o0 = val[i];
+      o1 = mask1;
+      x = o0 + o1 + carry;
+      result.val[i] = x;
+      old_carry = carry;
+      carry = x < o0;
+    }
+  for (i = len; i < op1.len; i++)
+    {
+      o0 = mask0;
+      o1 = op1.val[i];
+      x = o0 + o1 + carry;
+      result.val[i] = x;
+      old_carry = carry;
+      carry = x < o0;
+    }

Cut & paste error here I think.  at least one should run up to this->len, no?
And the first loop should run to min (len, op1.len) only.  How much
testing coverage does the code have for "interesting" values?
A standalone testing program will probably reveal such issues (due to
the rich-ness of the current API covering everything will be hard :/)

Looking at

+HOST_WIDE_INT
+wide_int::sign_mask () const
+{
+  int i = len - 1;
+  if (precision < HOST_BITS_PER_WIDE_INT)
+    return ((val[0] << (HOST_BITS_PER_WIDE_INT - precision))
+           >> (HOST_BITS_PER_WIDE_INT - 1));
+
+  /* VRP appears to be badly broken and this is a very ugly fix.  */
+  if (i >= 0)
+    return val[i] >> (HOST_BITS_PER_WIDE_INT - 1);
+
+  gcc_unreachable ();
+#if 0
+  return val[len - 1] >> (HOST_BITS_PER_WIDE_INT - 1);
+#endif
+}

again - this looks like it does overly complicated work.  Our encoding
of values guarantees that the following is correct:

HOST_WIDE_INT
wide_int::sign_mask () const
{
  if (val[len - 1] < 0)
    return -1;
  else
    return 0;
}

I suppose quite some code can be simplified that calls sign_mask
currently (thanks to the changed encoding).

back to ::add ():

+  small_prec = precision & (HOST_BITS_PER_WIDE_INT - 1);
+  if (small_prec == 0)
+    {
+      if (sgn == wide_int::SIGNED)
+       {
+         if (((!(o0 ^ o1)) & (x ^ o0)) >> (HOST_BITS_PER_WIDE_INT - 1))
+           *overflow = true;

shouldn't this share code with the non-overflow add_large and simply
do overflow detection conditional on overlflow being non-NULL?
Because the add_large code seems to be correct with respect to
the issues noted above ...

+ ex:
+  result.canonize ();
+  return result;
+}

canonizing is not strictly necessary - in fact it is unlikely to do
something useful most of the time.  I'd say we should only
canonize before building a tree or RTL with that representation.
Note that canonize is wrong, too:

+wide_int::canonize ()
+{
+  int small_prec = precision & (HOST_BITS_PER_WIDE_INT - 1);
+  int blocks_needed = BLOCKS_NEEDED (precision);
+  HOST_WIDE_INT top;
+  int i;
+
+  if (len > blocks_needed)
+    len = blocks_needed;

that will set a UHWI -1 to have len == 1, ignoring the extra needed
zero word.  Thus, either BLOCKS_NEEDED is broken or its
uses need to be audited.

+
+  /* Clean up the top bits for any mode that is not a multiple of a HWI.  */
+  if (len == blocks_needed && small_prec)
+    val[len - 1] = sext_hwi (val[len - 1], small_prec);

That's not necessary.  By construction all arithmetic operations will
preserve proper extension.  And you unconditionally sign-extend
small precision "large" positive constants here which is even wrong
(-1U, precision == 32, HWI == 64, original rep { 0x00..00fffff }, broken
{ 0xfff...fff } ).

+  if (len == 1)
+    return;

in fact for len == 1 we should do nothing in this function from the start.
Callers that also want to properly sign or zero extend the result value
should do so using ext (), not abuse another implementation inside
canonize () (which should only optimize the encoding).

Ok, I'll stop reviewing here.  It's a lot better than before - thanks for
doing the changes.

I still like you to cut down the interface somewhat and _test_ what
is remaining.  Put the code on a branch off trunk so I and others can
produce patches against it.

Thanks,
Richard.