From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-460543-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 79121 invoked by alias); 18 Aug 2017 01:40:45 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 79105 invoked by uid 89); 18 Aug 2017 01:40:44 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.6 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW autolearn=no version=3.3.2 spammy=
X-HELO: mx0a-001b2d01.pphosted.com
Received: from mx0a-001b2d01.pphosted.com (HELO mx0a-001b2d01.pphosted.com) (148.163.156.1) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 18 Aug 2017 01:40:41 +0000
Received: from pps.filterd (m0098393.ppops.net [127.0.0.1])	by mx0a-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id v7I1dXHw114930	for <gcc-patches@gcc.gnu.org>; Thu, 17 Aug 2017 21:40:39 -0400
Received: from e13.ny.us.ibm.com (e13.ny.us.ibm.com [129.33.205.203])	by mx0a-001b2d01.pphosted.com with ESMTP id 2cdj1xwckr-1	(version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT)	for <gcc-patches@gcc.gnu.org>; Thu, 17 Aug 2017 21:40:39 -0400
Received: from localhost	by e13.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted	for <gcc-patches@gcc.gnu.org> from <munroesj@linux.vnet.ibm.com>;	Thu, 17 Aug 2017 21:40:38 -0400
Received: from b01cxnp23034.gho.pok.ibm.com (9.57.198.29)	by e13.ny.us.ibm.com (146.89.104.200) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted;	Thu, 17 Aug 2017 21:40:36 -0400
Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109])	by b01cxnp23034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id v7I1eZxu27066518;	Fri, 18 Aug 2017 01:40:35 GMT
Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1])	by IMSVA (Postfix) with ESMTP id 0C70A112047;	Thu, 17 Aug 2017 21:40:25 -0400 (EDT)
Received: from [9.85.146.235] (unknown [9.85.146.235])	by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP id B17E4112034;	Thu, 17 Aug 2017 21:40:24 -0400 (EDT)
Subject: Re: [PATCH, rs6000] 2/3 Add x86 SSE <xmmintrin.h> intrinsics to GCC PPC64LE taget
From: Steven Munroe <munroesj@linux.vnet.ibm.com>
Reply-To: munroesj@linux.vnet.ibm.com
To: Segher Boessenkool <segher@kernel.crashing.org>
Cc: gcc-patches <gcc-patches@gcc.gnu.org>, David Edelsohn <dje.gcc@gmail.com>
In-Reply-To: <20170817052841.GH13471@gate.crashing.org>
References: <1502915740.16102.62.camel@oc7878010663>	 <20170817052841.GH13471@gate.crashing.org>
Content-Type: text/plain; charset="UTF-8"
Date: Fri, 18 Aug 2017 08:03:00 -0000
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
X-TM-AS-GCONF: 00
x-cbid: 17081801-0008-0000-0000-000002705659
X-IBM-SpamModules-Scores:
X-IBM-SpamModules-Versions: BY=3.00007564; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000222; SDB=6.00903877; UDB=6.00452839; IPR=6.00684089; BA=6.00005538; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016748; XFM=3.00000015; UTC=2017-08-18 01:40:37
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17081801-0009-0000-0000-000036691037
Message-Id: <1503020434.7915.65.camel@oc7878010663>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-08-17_14:,, signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1708180023
X-SW-Source: 2017-08/txt/msg01106.txt.bz2

On Thu, 2017-08-17 at 00:28 -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Aug 16, 2017 at 03:35:40PM -0500, Steven Munroe wrote:
> > +extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> > +_mm_add_ss (__m128 __A, __m128 __B)
> > +{
> > +#ifdef _ARCH_PWR7
> > +  __m128 a, b, c;
> > +  static const __vector unsigned int mask = {0xffffffff, 0, 0, 0};
> > +  /* PowerISA VSX does not allow partial (for just lower double)
> > +   * results. So to insure we don't generate spurious exceptions
> > +   * (from the upper double values) we splat the lower double
> > +   * before we to the operation. */
> 
> No leading stars in comments please.
Fixed

> 
> > +  a = vec_splat (__A, 0);
> > +  b = vec_splat (__B, 0);
> > +  c = a + b;
> > +  /* Then we merge the lower float result with the original upper
> > +   * float elements from __A.  */
> > +  return (vec_sel (__A, c, mask));
> > +#else
> > +  __A[0] = __A[0] + __B[0];
> > +  return (__A);
> > +#endif
> > +}
> 
> It would be nice if we could just write the #else version and get the
> more optimised code, but I guess we get something horrible going through
> memory, instead?
> 
No, even with GCC8-trunk this field access is going through storage.

The generated code for splat, op, select is shorter even when you
include loading the constant.

vector <-> scalar float is just nasty!

> > +extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> > +_mm_rcp_ps (__m128 __A)
> > +{
> > +  __v4sf result;
> > +
> > +  __asm__(
> > +      "xvresp %x0,%x1;\n"
> > +      : "=v" (result)
> > +      : "v" (__A)
> > +      : );
> > +
> > +  return (result);
> > +}
> 
> There is a builtin for this (__builtin_vec_re).

Yes, not sure how I missed that. Fixed.

> 
> > +/* Convert the lower SPFP value to a 32-bit integer according to the current
> > +   rounding mode.  */
> > +extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> > +_mm_cvtss_si32 (__m128 __A)
> > +{
> > +  __m64 res = 0;
> > +#ifdef _ARCH_PWR8
> > +  __m128 vtmp;
> > +  __asm__(
> > +      "xxsldwi %x1,%x2,%x2,3;\n"
> > +      "xscvspdp %x1,%x1;\n"
> > +      "fctiw  %1,%1;\n"
> > +      "mfvsrd  %0,%x1;\n"
> > +      : "=r" (res),
> > +	"=&wi" (vtmp)
> > +      : "wa" (__A)
> > +      : );
> > +#endif
> > +  return (res);
> > +}
> 
> Maybe it could do something better than return the wrong answer for non-p8?

Ok this gets tricky. Before _ARCH_PWR8 the vector to scalar transfer
would go through storage. But that is not the worst of it.

The semantic of cvtss requires rint or llrint. But __builtin_rint will
generate a call to libm unless we assert -ffast-math. And we don't have
builtins to generate fctiw/fctid directly.

So I will add the #else using __builtin_rint if that libm dependency is
ok (this will pop in the DG test for older machines.

> 
> > +extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> 
> > +#ifdef __LITTLE_ENDIAN__
> > +  return result[1];
> > +#elif __BIG_ENDIAN__
> > +  return result [0];
> 
> Remove the extra space here?
> 
> > +_mm_max_pi16 (__m64 __A, __m64 __B)
> 
> > +  res.as_short[0] = (m1.as_short[0] > m2.as_short[0])? m1.as_short[0]: m2.as_short[0];
> > +  res.as_short[1] = (m1.as_short[1] > m2.as_short[1])? m1.as_short[1]: m2.as_short[1];
> > +  res.as_short[2] = (m1.as_short[2] > m2.as_short[2])? m1.as_short[2]: m2.as_short[2];
> > +  res.as_short[3] = (m1.as_short[3] > m2.as_short[3])? m1.as_short[3]: m2.as_short[3];
> 
> Space before ? and : .
done

> 
> > +_mm_min_pi16 (__m64 __A, __m64 __B)
> 
> In this function, too.
> 
> > +extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> > +_m_pmovmskb (__m64 __A)
> > +{
> > +  return _mm_movemask_pi8 (__A);
> > +}
> > +/* Multiply four unsigned 16-bit values in A by four unsigned 16-bit values
> > +   in B and produce the high 16 bits of the 32-bit results.  */
> > +extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> 
> Newline before the comment?
done

> 
> > +_mm_sad_pu8 (__m64  __A, __m64  __B)
> 
> > +  /* Sum four groups of bytes into integers.  */
> > +  vsum = (__vector signed int) vec_sum4s (vabsdiff, zero);
> > +  /* sum across four integers with integer result.  */
> > +  vsum = vec_sums (vsum, (__vector signed int) zero);
> > +  /* the sum is in the right most 32-bits of the vector result.
> > +   Transfer the a GPR and truncate to 16 bits.  */
> 
> That last line has an indentation problem.  Sentences should start with
> a capital, in those last two comments.
> 
Fixed

> > +/* Stores the data in A to the address P without polluting the caches.  */
> > +extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> > +_mm_stream_pi (__m64 *__P, __m64 __A)
> > +{
> > +  /* Use the data cache block touch for store transient.  */
> > +  __asm__ (
> > +    "	dcbtstt	0,%0;"
> 
> No need for the ";".  dcbtstt isn't really the same semantics but I don't
> think you can do better.
> 
Removed the ';'

> The rest looks great.
> 
> Please do something about the _mm_cvtss_si32, have a look at the other
> comments, and it is ready to commit.  Thanks,
> 
Thanks.