From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 79121 invoked by alias); 18 Aug 2017 01:40:45 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 79105 invoked by uid 89); 18 Aug 2017 01:40:44 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.6 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW autolearn=no version=3.3.2 spammy= X-HELO: mx0a-001b2d01.pphosted.com Received: from mx0a-001b2d01.pphosted.com (HELO mx0a-001b2d01.pphosted.com) (148.163.156.1) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 18 Aug 2017 01:40:41 +0000 Received: from pps.filterd (m0098393.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id v7I1dXHw114930 for ; Thu, 17 Aug 2017 21:40:39 -0400 Received: from e13.ny.us.ibm.com (e13.ny.us.ibm.com [129.33.205.203]) by mx0a-001b2d01.pphosted.com with ESMTP id 2cdj1xwckr-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Thu, 17 Aug 2017 21:40:39 -0400 Received: from localhost by e13.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 17 Aug 2017 21:40:38 -0400 Received: from b01cxnp23034.gho.pok.ibm.com (9.57.198.29) by e13.ny.us.ibm.com (146.89.104.200) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 17 Aug 2017 21:40:36 -0400 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp23034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id v7I1eZxu27066518; Fri, 18 Aug 2017 01:40:35 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0C70A112047; Thu, 17 Aug 2017 21:40:25 -0400 (EDT) Received: from [9.85.146.235] (unknown [9.85.146.235]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP id B17E4112034; Thu, 17 Aug 2017 21:40:24 -0400 (EDT) Subject: Re: [PATCH, rs6000] 2/3 Add x86 SSE intrinsics to GCC PPC64LE taget From: Steven Munroe Reply-To: munroesj@linux.vnet.ibm.com To: Segher Boessenkool Cc: gcc-patches , David Edelsohn In-Reply-To: <20170817052841.GH13471@gate.crashing.org> References: <1502915740.16102.62.camel@oc7878010663> <20170817052841.GH13471@gate.crashing.org> Content-Type: text/plain; charset="UTF-8" Date: Fri, 18 Aug 2017 08:03:00 -0000 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 17081801-0008-0000-0000-000002705659 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007564; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000222; SDB=6.00903877; UDB=6.00452839; IPR=6.00684089; BA=6.00005538; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016748; XFM=3.00000015; UTC=2017-08-18 01:40:37 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17081801-0009-0000-0000-000036691037 Message-Id: <1503020434.7915.65.camel@oc7878010663> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-08-17_14:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1708180023 X-SW-Source: 2017-08/txt/msg01106.txt.bz2 On Thu, 2017-08-17 at 00:28 -0500, Segher Boessenkool wrote: > Hi! > > On Wed, Aug 16, 2017 at 03:35:40PM -0500, Steven Munroe wrote: > > +extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) > > +_mm_add_ss (__m128 __A, __m128 __B) > > +{ > > +#ifdef _ARCH_PWR7 > > + __m128 a, b, c; > > + static const __vector unsigned int mask = {0xffffffff, 0, 0, 0}; > > + /* PowerISA VSX does not allow partial (for just lower double) > > + * results. So to insure we don't generate spurious exceptions > > + * (from the upper double values) we splat the lower double > > + * before we to the operation. */ > > No leading stars in comments please. Fixed > > > + a = vec_splat (__A, 0); > > + b = vec_splat (__B, 0); > > + c = a + b; > > + /* Then we merge the lower float result with the original upper > > + * float elements from __A. */ > > + return (vec_sel (__A, c, mask)); > > +#else > > + __A[0] = __A[0] + __B[0]; > > + return (__A); > > +#endif > > +} > > It would be nice if we could just write the #else version and get the > more optimised code, but I guess we get something horrible going through > memory, instead? > No, even with GCC8-trunk this field access is going through storage. The generated code for splat, op, select is shorter even when you include loading the constant. vector <-> scalar float is just nasty! > > +extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) > > +_mm_rcp_ps (__m128 __A) > > +{ > > + __v4sf result; > > + > > + __asm__( > > + "xvresp %x0,%x1;\n" > > + : "=v" (result) > > + : "v" (__A) > > + : ); > > + > > + return (result); > > +} > > There is a builtin for this (__builtin_vec_re). Yes, not sure how I missed that. Fixed. > > > +/* Convert the lower SPFP value to a 32-bit integer according to the current > > + rounding mode. */ > > +extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__)) > > +_mm_cvtss_si32 (__m128 __A) > > +{ > > + __m64 res = 0; > > +#ifdef _ARCH_PWR8 > > + __m128 vtmp; > > + __asm__( > > + "xxsldwi %x1,%x2,%x2,3;\n" > > + "xscvspdp %x1,%x1;\n" > > + "fctiw %1,%1;\n" > > + "mfvsrd %0,%x1;\n" > > + : "=r" (res), > > + "=&wi" (vtmp) > > + : "wa" (__A) > > + : ); > > +#endif > > + return (res); > > +} > > Maybe it could do something better than return the wrong answer for non-p8? Ok this gets tricky. Before _ARCH_PWR8 the vector to scalar transfer would go through storage. But that is not the worst of it. The semantic of cvtss requires rint or llrint. But __builtin_rint will generate a call to libm unless we assert -ffast-math. And we don't have builtins to generate fctiw/fctid directly. So I will add the #else using __builtin_rint if that libm dependency is ok (this will pop in the DG test for older machines. > > > +extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__)) > > > +#ifdef __LITTLE_ENDIAN__ > > + return result[1]; > > +#elif __BIG_ENDIAN__ > > + return result [0]; > > Remove the extra space here? > > > +_mm_max_pi16 (__m64 __A, __m64 __B) > > > + res.as_short[0] = (m1.as_short[0] > m2.as_short[0])? m1.as_short[0]: m2.as_short[0]; > > + res.as_short[1] = (m1.as_short[1] > m2.as_short[1])? m1.as_short[1]: m2.as_short[1]; > > + res.as_short[2] = (m1.as_short[2] > m2.as_short[2])? m1.as_short[2]: m2.as_short[2]; > > + res.as_short[3] = (m1.as_short[3] > m2.as_short[3])? m1.as_short[3]: m2.as_short[3]; > > Space before ? and : . done > > > +_mm_min_pi16 (__m64 __A, __m64 __B) > > In this function, too. > > > +extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__)) > > +_m_pmovmskb (__m64 __A) > > +{ > > + return _mm_movemask_pi8 (__A); > > +} > > +/* Multiply four unsigned 16-bit values in A by four unsigned 16-bit values > > + in B and produce the high 16 bits of the 32-bit results. */ > > +extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) > > Newline before the comment? done > > > +_mm_sad_pu8 (__m64 __A, __m64 __B) > > > + /* Sum four groups of bytes into integers. */ > > + vsum = (__vector signed int) vec_sum4s (vabsdiff, zero); > > + /* sum across four integers with integer result. */ > > + vsum = vec_sums (vsum, (__vector signed int) zero); > > + /* the sum is in the right most 32-bits of the vector result. > > + Transfer the a GPR and truncate to 16 bits. */ > > That last line has an indentation problem. Sentences should start with > a capital, in those last two comments. > Fixed > > +/* Stores the data in A to the address P without polluting the caches. */ > > +extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) > > +_mm_stream_pi (__m64 *__P, __m64 __A) > > +{ > > + /* Use the data cache block touch for store transient. */ > > + __asm__ ( > > + " dcbtstt 0,%0;" > > No need for the ";". dcbtstt isn't really the same semantics but I don't > think you can do better. > Removed the ';' > The rest looks great. > > Please do something about the _mm_cvtss_si32, have a look at the other > comments, and it is ready to commit. Thanks, > Thanks.