From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <murphyp@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 6292C386F465
 for <libc-alpha@sourceware.org>; Mon, 22 Jun 2020 23:04:30 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 6292C386F465
Received: from pps.filterd (m0098396.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id
 05MN22SO168868; Mon, 22 Jun 2020 19:04:29 -0400
Received: from ppma03wdc.us.ibm.com (ba.79.3fa9.ip4.static.sl-reverse.com
 [169.63.121.186])
 by mx0a-001b2d01.pphosted.com with ESMTP id 31tyspjj5q-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Mon, 22 Jun 2020 19:04:28 -0400
Received: from pps.filterd (ppma03wdc.us.ibm.com [127.0.0.1])
 by ppma03wdc.us.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 05MMu8ul025922;
 Mon, 22 Jun 2020 23:04:27 GMT
Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com
 [9.57.198.23]) by ppma03wdc.us.ibm.com with ESMTP id 31sa38b1c7-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Mon, 22 Jun 2020 23:04:27 +0000
Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com
 [9.57.199.107])
 by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 05MN4QLa33948128
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Mon, 22 Jun 2020 23:04:26 GMT
Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id B1317124052;
 Mon, 22 Jun 2020 23:04:26 +0000 (GMT)
Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 1FA97124054;
 Mon, 22 Jun 2020 23:04:26 +0000 (GMT)
Received: from [9.211.136.148] (unknown [9.211.136.148])
 by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP;
 Mon, 22 Jun 2020 23:04:25 +0000 (GMT)
Subject: Re: [PATCHv4 2/2] powerpc64le: ifunc (almost) all *f128 routines in
 multiarch mode
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>,
 libc-alpha@sourceware.org,
 Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
References: <bf94613b3c06a1a165c4e3818f09a8ea5d477df8.1592254374.git.murphyp@linux.vnet.ibm.com>
 <f22cb700cb51c7e3562b22e459677047b0f6b03a.1592254374.git.murphyp@linux.vnet.ibm.com>
 <4ea0388c-f261-eb53-7f65-c176fdda3eb5@linaro.org>
From: Paul E Murphy <murphyp@linux.ibm.com>
Message-ID: <110e6c9e-e4b8-3d85-a98f-7b4a818b82ff@linux.ibm.com>
Date: Mon, 22 Jun 2020 18:04:25 -0500
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.9.0
MIME-Version: 1.0
In-Reply-To: <4ea0388c-f261-eb53-7f65-c176fdda3eb5@linaro.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-TM-AS-GCONF: 00
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.687
 definitions=2020-06-22_15:2020-06-22,
 2020-06-22 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 malwarescore=0 spamscore=0
 lowpriorityscore=0 clxscore=1011 cotscore=-2147483648 impostorscore=0
 phishscore=0 mlxlogscore=999 priorityscore=1501 bulkscore=0 suspectscore=0
 adultscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2004280000 definitions=main-2006220148
X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <http://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <http://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 22 Jun 2020 23:04:32 -0000


On 6/22/20 11:57 AM, Adhemerval Zanella via Libc-alpha wrote:
> 
> 
> On 15/06/2020 17:59, Paul E. Murphy via Libc-alpha wrote:
>> See the Makefile changes for high level design/commentary.
>>
>> V4 changes -
>>    * Drop patch to add libm_alias_exclusive_ldouble.  After
>>      recent refactoring of fmaf128, it showed some unfixable
>>      flaws.  Instead, use macro renaming for nextafterf128 to
>>      generate the needed symbols, and rework.
>>
>> V3 changes -
>>    * Cleanup comments.
>>    * Rebase against fmaf128 cleanup
>>    * Use Makeconfig trick to set var in le/power9 sysdep dir to
>>      determine if ifunc support is necessary.  This works with
>>      the upcoming CPU detection patch.
>>    * fmaf128 patch is no longer needed.
>>
>> V2 changes -
>>    * move duplicate redirect macros into float128-ifunc-redirect-macros.h
>>    * replace subshell usage with command sequencing
>>    * Add more instructive documentation in Makefile about how all
>>      these ugly pieces work togethor
>>    * Minor comment cleanup throughout
>>    * Improve inline documentation/commentary throughout
>>
>> ---8<---
>>
>> Programatically generate simple wrappers for most libm *f128
>> objects and a set of ifunc objects to unify them.
>>
>> A second set of implementation files are generated which simply
>> include the first implementation encountered along the search
>> path.  This usually works, excepting when a wrapper is overriden
>> and makefile search order slightly diverges from include order.
>>
>> A set of additional headers are included which primarily rely
>> on asm redirects to rename, and less frequently macro renames
>> where an asm redirect is not possible.  These intercept several
>> common headers to install redirect and disable macros at specific
>> times.  This works surprisingly well.  Notably, some ugliness
>> occurs when header inclusion must be coerced at certain times
>> before turning off aliasing and plt bypass wrappers.
>>
>> Notably, the only special case is s_significandf128.c.  It is
>> doubly special as exists to support ldouble redirects, and
>> exposes subtle difference between makefile rules and search path
>> orders.  Commentary is inlined.
>>
>> Admittedly, this makes shared maintenance a tiny bit more
>> difficult, but lays groundwork for supporting more optimized
>> float128 routines which very overtly assume a soft-fp runtime.
>> Changes to internal float128 API should fail at compile time,
>> thus build-many-glibcs.py should readily catch any divergence.
>>
>> Finally, don't build this support if requested CPU is newer
>> than power8.
>>


>> fixup f128 ifunc
>>
>> drop the patch to introduce the new macro to assist simplification of
>> s_nextafter.c.  It wasn't thought out well enough.  Instead just add
>> the ugly macro redirections needed to generate the appropriate >> nexttoward symbols.

This is refactoring noise, and while not wrong is not meant to be
in the final commit message.

> 
> I am trying to digest the requirements to add such complexity on the
> powerpc64le build rules, specially the internally Makefile hackery
> required.

This is addressed in the notes. Mildly speaking, soft-fp code
generation on P8 is quite limited.  This is pretty easy to identify in 
any non-trivial binary128 function.  e.g expf128 is almost 1/3 the
size on P9. Likewise many complex functions are almost 1/2 the size. 
Anything soft-fp touches massively increases code size and impedes 
instruction scheduling.

I can get some more concrete numbers, but my hope is this enables us
to make even more meaningful improvements to common code when hardware
support is available.

> 
> So if I understood correctly, let say we have these targets:
> 
>    1. powerpc64le-linux-gnu
>    2. powerpc64le-linux-gnu with --with-cpu=power9
> 
> The ifunc mechanism to build optimized versions for power9 will be
> built only for 1, while for 2. only versions that uses hardware
> instruction for __float128 (-mfloat128-hardware gcc option)
> will be used.

In case 2 (and with any newer cpu), this patch is a no-op.

> 
> So all the rediretion machinery done in the float128-ifunc-* are to
> list and redirect internal libm symbols to its float128 counterparts.
> One initial issue is this tend to be fragile: it requires to change
> arch-specific code when generic code is changed (for instance by
> changing the internal symbol name or the caller implementation)

The interesting symbol names are likely to see less change, and those
that do should mostly be hidden via local calls.  This is the price
the ppc64le maintainers pay to support multiarch for a large swath
of libm.  This greatly simplifies the most mundane and error prone
pieces.

> 
> Another issue the rules exceptions (such as s_totalorderf128) that
> require additional care to check if they result in correct code.

Such is already tested via the existing test suite.

> 
> Another possible mantainance issue is to keep updating the exported
> symbol list at float128-ifunc.c, float128-ifunc.h, and
> float128_private.hfor each new possible symbol in future version.
> It against means to correct/change arch-specific code for generic
> changes.

Note that float128-ifunc.c only defines compat symbols for the old
finite entry points. That set should never grow.

> 
> It also increases code size considerable with the potential to keep
> increasing with the addition on new libm functions.

Stripping debug info, the code size increase of libm is about 220kb
added 1210kb library.  Not trivial, but not overwhelming.

> 
> Finally the question is how useful would be this change on real
> world cases to justify this huge build and permutation complexity.

Code size is an interesting metric to measure.  The P9 variants
are substantially smaller where soft-fp is involved. expf128 is almost
1/3 the size.

> 
> What I would expect in realword cases is if the workload really
> uses float128 extensivelly to be built with -mcpu=power9 and/or
> -mfloat128/-mfloat128-hardware. It should cover most the required
> hotspots and glibc can focus on providing only cases where adding
> an specialized ifunc variant does make sense (as for the x86_64
> sysdeps/x86_64/fpu/multiarch/mp*) for instance.
> 
> Also, if an optimized float128 glibc build is paramount, a much
> simpler solution would be to just provide a -mcpu=power9 built one.

That kicks the can to the distros.  I think few ship such libraries. 
The whole value of multiarch is to expose these benefits without having 
to make the end user jump through such hurdles.  I don't think the x86 
comparison holds.  Adding a couple of helpful instructions is tame 
compared to going from soft to hard fp.