public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* pthread_cond performence Discussion
@ 2020-03-16  7:30 liqingqing
  2020-03-18 12:12 ` Carlos O'Donell
  2020-05-23  4:04 ` liqingqing
  0 siblings, 2 replies; 32+ messages in thread
From: liqingqing @ 2020-03-16  7:30 UTC (permalink / raw)
  To: libc-alpha, triegel; +Cc: Hushiyuan, Liusirui, wangshuo47

The new condvar implementation that provides stronger ordering guarantees.
For the waiters's ordering without expand the size of the struct of pthread_cond_t, It uses a little bits to maintain the state machine which has two different start group G1 and G2.
This algorithm is very cleverly.
But when I test MySQL performance and found that this new condvar implementation will affect the performance when there are many cores in one machine.
the scenario is that in my arm server, test 200 terminals to read and write the database in 4P processor environment(totally 256 cores),
and I found that It can get better performance when I use the old algorithm. here is the performace I tested:

old algorithm   new algorithm
755449.3        668712.05


I think maybe there has too many cache false sharing when in my environment. Does anyone has the same problem? And is there room for optimization about the new algorithm?


the test step is:
[root@client]# ./runBenchmark.sh props.mysql_4p_arm
[root@client]# cat props.mysql_4p_arm
db=mysql
driver=com.mysql.cj.jdbc.Driver
#conn=jdbc:mysql://222.222.222.11:3306/tpccpart
#conn=jdbc:mysql://222.222.222.132:3306/tpcc1000
#conn=jdbc:mysql://222.222.222.145:3306/tpcc
conn=jdbc:mysql://222.222.222.212:3306/tpcc
user=root
password=123456

warehouses=1000
loadWorkers=30

terminals=200
//To run specified transactions per terminal- runMins must equal zero
runTxnsPerTerminal=0
//To run for specified minutes- runTxnsPerTerminal must equal zero
runMins=5
//Number of total transactions per minute
limitTxnsPerMin=1000000000

//Set to true to run in 4.x compatible mode. Set to false to use the
//entire configured database evenly.
terminalWarehouseFixed=true

//The following five values must add up to 100
newOrderWeight=45
paymentWeight=43
orderStatusWeight=4
deliveryWeight=4
stockLevelWeight=4

// Directory name to create for collecting detailed result data.
// Comment this out to suppress.
//resultDirectory=my_result_%tY-%tm-%td_%tH%tM%tS
//osCollectorScript=./misc/os_collector_linux.py
//osCollectorInterval=1
//osCollectorSSHAddr=user@dbhost
//osCollectorDevices=net_eth0 blk_sda



below is the struct of pthread_cond_t:

/* Common definition of pthread_cond_t. */       // consumer and producer maybe in the same cache_line.
struct __pthread_cond_s
{
  __extension__ union
  {
    __extension__ unsigned long long int __wseq;  //LSB is index of current G2.
    struct
    {
      unsigned int __low; //等待着的序列号,G2
      unsigned int __high; //等待着的序列号 G1
    } __wseq32;
  };
  __extension__ union
  {
    __extension__ unsigned long long int __g1_start;  // LSB is index of current G2.
    struct
    {
      unsigned int __low;
      unsigned int __high;
    } __g1_start32;
  };
  unsigned int __g_refs[2] __LOCK_ALIGNMENT;  // LSB is true if waiters should run futex_wake when they remove the last reference.
  unsigned int __g_size[2];
  unsigned int __g1_orig_size; // Initial size of G1
  unsigned int __wrefs; // Bit 2 is true if waiters should run futex_wake when they remove the
       			   last reference.  pthread_cond_destroy uses this as futex word.
			// Bit 1 is the clock ID (0 == CLOCK_REALTIME, 1 == CLOCK_MONOTONIC).
			// Bit 0 is true iff this is a process-shared condvar.
  unsigned int __g_signals[2];  // LSB is true iff this group has been completely signaled (i.e., it is closed).
};


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: pthread_cond performence Discussion
  2020-03-16  7:30 pthread_cond performence Discussion liqingqing
@ 2020-03-18 12:12 ` Carlos O'Donell
  2020-03-18 12:53   ` Torvald Riegel
  2020-05-23  4:04 ` liqingqing
  1 sibling, 1 reply; 32+ messages in thread
From: Carlos O'Donell @ 2020-03-18 12:12 UTC (permalink / raw)
  To: liqingqing, libc-alpha, triegel; +Cc: Hushiyuan, Liusirui, wangshuo47

On 3/16/20 3:30 AM, liqingqing wrote:
> The new condvar implementation that provides stronger ordering
> guarantees. For the waiters's ordering without expand the size of the
> struct of pthread_cond_t, It uses a little bits to maintain the state
> machine which has two different start group G1 and G2. This algorithm
> is very cleverly. But when I test MySQL performance and found that
> this new condvar implementation will affect the performance when
> there are many cores in one machine. the scenario is that in my arm
> server, test 200 terminals to read and write the database in 4P
> processor environment(totally 256 cores), and I found that It can get
> better performance when I use the old algorithm. 

Are you able to look at any hardware performance counters to see if
there are increased cache line miss rates?

> I think maybe there has too many cache false sharing when in my
> environment. Does anyone has the same problem? And is there room for
> optimization about the new algorithm?

I have not seen anyone report a performance problem on large machines.

Unfortunately from an ABI perspective we cannot increase the size of
the structure, nor change the required alignment.

We may be able to play with the order of the layout of elements
within the condvar. That's something you could experiment with and
report back to the list with your findings.

For example:
- Make changes the layout by moving elements around to attempt to
  avoid cache-line sharing.
- Recompile glibc.
- Install into your system.
  - PTHREAD_COND_INITIALIZER should be all-zero bytes so you should
    not need to recompile applications.
- Retest performance.

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: pthread_cond performence Discussion
  2020-03-18 12:12 ` Carlos O'Donell
@ 2020-03-18 12:53   ` Torvald Riegel
  2020-03-18 14:42     ` Carlos O'Donell
  0 siblings, 1 reply; 32+ messages in thread
From: Torvald Riegel @ 2020-03-18 12:53 UTC (permalink / raw)
  To: Carlos O'Donell, liqingqing, libc-alpha
  Cc: Hushiyuan, Liusirui, wangshuo47

On Wed, 2020-03-18 at 08:12 -0400, Carlos O'Donell wrote:
> On 3/16/20 3:30 AM, liqingqing wrote:
> > The new condvar implementation that provides stronger ordering
> > guarantees. For the waiters's ordering without expand the size of
> > the
> > struct of pthread_cond_t, It uses a little bits to maintain the
> > state
> > machine which has two different start group G1 and G2. This
> > algorithm
> > is very cleverly. But when I test MySQL performance and found that
> > this new condvar implementation will affect the performance when
> > there are many cores in one machine. the scenario is that in my arm
> > server, test 200 terminals to read and write the database in 4P
> > processor environment(totally 256 cores), and I found that It can
> > get
> > better performance when I use the old algorithm. 
> 
> Are you able to look at any hardware performance counters to see if
> there are increased cache line miss rates?
> 
> > I think maybe there has too many cache false sharing when in my
> > environment. Does anyone has the same problem? And is there room
> > for
> > optimization about the new algorithm?
> 
> I have not seen anyone report a performance problem on large
> machines.
> 
> Unfortunately from an ABI perspective we cannot increase the size of
> the structure, nor change the required alignment.
> 
> We may be able to play with the order of the layout of elements
> within the condvar. That's something you could experiment with and
> report back to the list with your findings.
> 
> For example:
> - Make changes the layout by moving elements around to attempt to
>   avoid cache-line sharing.
> - Recompile glibc.
> - Install into your system.
>   - PTHREAD_COND_INITIALIZER should be all-zero bytes so you should
>     not need to recompile applications.
> - Retest performance.

The other thing I would recommend is to investigate whether you can
improve the synchronization in MySQL.  Condition variables are just one
way to do synchronization, and the root synchronization problem to
solve is in the application.  This could potentially give you a much
bigger performance improvement than any optimization of the condition
variable implementation could. 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: pthread_cond performence Discussion
  2020-03-18 12:53   ` Torvald Riegel
@ 2020-03-18 14:42     ` Carlos O'Donell
  0 siblings, 0 replies; 32+ messages in thread
From: Carlos O'Donell @ 2020-03-18 14:42 UTC (permalink / raw)
  To: Torvald Riegel, liqingqing, libc-alpha; +Cc: Hushiyuan, Liusirui, wangshuo47

On 3/18/20 8:53 AM, Torvald Riegel wrote:
> On Wed, 2020-03-18 at 08:12 -0400, Carlos O'Donell wrote:
>> On 3/16/20 3:30 AM, liqingqing wrote:
>>> The new condvar implementation that provides stronger ordering
>>> guarantees. For the waiters's ordering without expand the size of
>>> the
>>> struct of pthread_cond_t, It uses a little bits to maintain the
>>> state
>>> machine which has two different start group G1 and G2. This
>>> algorithm
>>> is very cleverly. But when I test MySQL performance and found that
>>> this new condvar implementation will affect the performance when
>>> there are many cores in one machine. the scenario is that in my arm
>>> server, test 200 terminals to read and write the database in 4P
>>> processor environment(totally 256 cores), and I found that It can
>>> get
>>> better performance when I use the old algorithm. 
>>
>> Are you able to look at any hardware performance counters to see if
>> there are increased cache line miss rates?
>>
>>> I think maybe there has too many cache false sharing when in my
>>> environment. Does anyone has the same problem? And is there room
>>> for
>>> optimization about the new algorithm?
>>
>> I have not seen anyone report a performance problem on large
>> machines.
>>
>> Unfortunately from an ABI perspective we cannot increase the size of
>> the structure, nor change the required alignment.
>>
>> We may be able to play with the order of the layout of elements
>> within the condvar. That's something you could experiment with and
>> report back to the list with your findings.
>>
>> For example:
>> - Make changes the layout by moving elements around to attempt to
>>   avoid cache-line sharing.
>> - Recompile glibc.
>> - Install into your system.
>>   - PTHREAD_COND_INITIALIZER should be all-zero bytes so you should
>>     not need to recompile applications.
>> - Retest performance.
> 
> The other thing I would recommend is to investigate whether you can
> improve the synchronization in MySQL.  Condition variables are just one
> way to do synchronization, and the root synchronization problem to
> solve is in the application.  This could potentially give you a much
> bigger performance improvement than any optimization of the condition
> variable implementation could. 
 
Agreed. The ABI limitations on the current interface are the biggest
problem. To fix the ABI issues would require structure allocation and
freeing in the background, which has it's own issues and performance
impact (requires fallback PSHARED handling). Therefore if you could
parallelize differently then it would certainly help. I expect that
this is harder than you think though since it requires refactoring
the locking in mysql. While making the current condvar layout more
cahceline friendly (even probablistically so) would improve all
applications using it.

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: pthread_cond performence Discussion
  2020-03-16  7:30 pthread_cond performence Discussion liqingqing
  2020-03-18 12:12 ` Carlos O'Donell
@ 2020-05-23  4:04 ` liqingqing
  2020-05-23  4:10   ` [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M liqingqing
  1 sibling, 1 reply; 32+ messages in thread
From: liqingqing @ 2020-05-23  4:04 UTC (permalink / raw)
  To: libc-alpha, hjl.tools; +Cc: Hushiyuan

this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset,
and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB,
this api will use STOB to instead of  MOVQ

but when I test this API on x86_64 platform
and found that this default value is not appropriate for some input length. here it's the enviornment and result

test suite: libMicro-0.4.0
	./memset -E -C 200 -L -S -W -N "memset_4k"    -s 4k    -I 250
	./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k    -u -I 400
	./memset -E -C 200 -L -S -W -N "memset_1m"    -s 1m   -I 200000
	./memset -E -C 200 -L -S -W -N "memset_10m"   -s 10m -I 2000000

hardware platform:
	Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
	L1d cache:32KB
	L1i cache: 32KB
	L2 cache: 1MB
	L3 cache: 60MB

the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance.
			
	before this commit     after this commit         	
	        cycle	   cycle
memset_4k  	249 	    96	
memset_10k  	657	    185	
memset_36k	2773	    3767	
memset_100k	7594	    10002	
memset_500k	37678	    52149	
memset_1m  	86780	    108044	
memset_10m 	1307238	    1148994	

	before this commit          after this commit         	
	   MLC cache miss(10sec)	 MLC cache miss(10sec)
memset_4k  	1,09,33,823	     1,01,79,270
memset_10k  	1,23,78,958	     1,05,41,087
memset_36k	3,61,64,244	     4,07,22,429
memset_100k	8,25,33,052	     9,31,81,253
memset_500k	37,32,55,449	     43,56,70,395
memset_1m  	75,16,28,239	     88,29,90,237
memset_10m 	9,36,61,67,397	     8,96,69,49,522


though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx,
but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below:



From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001
From: liqingqing <liqingqing3@huawei.com>
Date: Thu, 21 May 2020 11:23:06 +0800
Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M
macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache.
so update the defaule value to eliminate the decrement .

---
 sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index dcd63c92..92c08eed 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -65,7 +65,7 @@
    Enhanced REP STOSB.  Since the stored value is fixed, larger register
    size has minimal impact on threshold.  */
 #ifndef REP_STOSB_THRESHOLD
-# define REP_STOSB_THRESHOLD           2048
+# define REP_STOSB_THRESHOLD           1048576
 #endif

 #ifndef SECTION
-- 
2.19.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M
  2020-05-23  4:04 ` liqingqing
@ 2020-05-23  4:10   ` liqingqing
  2020-05-23  4:37     ` [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables H.J. Lu
  2020-12-21  4:38     ` [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M Siddhesh Poyarekar
  0 siblings, 2 replies; 32+ messages in thread
From: liqingqing @ 2020-05-23  4:10 UTC (permalink / raw)
  To: libc-alpha, hjl.tools, Hushiyuan

this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset,
and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB,
this api will use STOB to instead of  MOVQ

but when I test this API on x86_64 platform
and found that this default value is not appropriate for some input length. here it's the enviornment and result

test suite: libMicro-0.4.0
	./memset -E -C 200 -L -S -W -N "memset_4k"    -s 4k    -I 250
	./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k    -u -I 400
	./memset -E -C 200 -L -S -W -N "memset_1m"    -s 1m   -I 200000
	./memset -E -C 200 -L -S -W -N "memset_10m"   -s 10m -I 2000000

hardware platform:
	Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
	L1d cache:32KB
	L1i cache: 32KB
	L2 cache: 1MB
	L3 cache: 60MB

the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance.
			
	before this commit     after this commit         	
	        cycle	   cycle
memset_4k  	249 	    96	
memset_10k  	657	    185	
memset_36k	2773	    3767	
memset_100k	7594	    10002	
memset_500k	37678	    52149	
memset_1m  	86780	    108044	
memset_10m 	1307238	    1148994	

	before this commit          after this commit         	
	   MLC cache miss(10sec)	 MLC cache miss(10sec)
memset_4k  	1,09,33,823	     1,01,79,270
memset_10k  	1,23,78,958	     1,05,41,087
memset_36k	3,61,64,244	     4,07,22,429
memset_100k	8,25,33,052	     9,31,81,253
memset_500k	37,32,55,449	     43,56,70,395
memset_1m  	75,16,28,239	     88,29,90,237
memset_10m 	9,36,61,67,397	     8,96,69,49,522


though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx,
but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below:



From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001
From: liqingqing <liqingqing3@huawei.com>
Date: Thu, 21 May 2020 11:23:06 +0800
Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M
macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache.
so update the defaule value to eliminate the decrement .

---
 sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index dcd63c92..92c08eed 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -65,7 +65,7 @@
    Enhanced REP STOSB.  Since the stored value is fixed, larger register
    size has minimal impact on threshold.  */
 #ifndef REP_STOSB_THRESHOLD
-# define REP_STOSB_THRESHOLD           2048
+# define REP_STOSB_THRESHOLD           1048576
 #endif

 #ifndef SECTION
-- 
2.19.1



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-05-23  4:10   ` [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M liqingqing
@ 2020-05-23  4:37     ` H.J. Lu
  2020-05-28 11:56       ` H.J. Lu
  2020-05-29 13:13       ` Carlos O'Donell
  2020-12-21  4:38     ` [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M Siddhesh Poyarekar
  1 sibling, 2 replies; 32+ messages in thread
From: H.J. Lu @ 2020-05-23  4:37 UTC (permalink / raw)
  To: liqingqing; +Cc: libc-alpha, Hushiyuan

[-- Attachment #1: Type: text/plain, Size: 3008 bytes --]

On Fri, May 22, 2020 at 9:10 PM liqingqing <liqingqing3@huawei.com> wrote:
>
> this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset,
> and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB,
> this api will use STOB to instead of  MOVQ
>
> but when I test this API on x86_64 platform
> and found that this default value is not appropriate for some input length. here it's the enviornment and result
>
> test suite: libMicro-0.4.0
>         ./memset -E -C 200 -L -S -W -N "memset_4k"    -s 4k    -I 250
>         ./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k    -u -I 400
>         ./memset -E -C 200 -L -S -W -N "memset_1m"    -s 1m   -I 200000
>         ./memset -E -C 200 -L -S -W -N "memset_10m"   -s 10m -I 2000000
>
> hardware platform:
>         Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
>         L1d cache:32KB
>         L1i cache: 32KB
>         L2 cache: 1MB
>         L3 cache: 60MB
>
> the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance.
>
>         before this commit     after this commit
>                 cycle      cycle
> memset_4k       249         96
> memset_10k      657         185
> memset_36k      2773        3767
> memset_100k     7594        10002
> memset_500k     37678       52149
> memset_1m       86780       108044
> memset_10m      1307238     1148994
>
>         before this commit          after this commit
>            MLC cache miss(10sec)         MLC cache miss(10sec)
> memset_4k       1,09,33,823          1,01,79,270
> memset_10k      1,23,78,958          1,05,41,087
> memset_36k      3,61,64,244          4,07,22,429
> memset_100k     8,25,33,052          9,31,81,253
> memset_500k     37,32,55,449         43,56,70,395
> memset_1m       75,16,28,239         88,29,90,237
> memset_10m      9,36,61,67,397       8,96,69,49,522
>
>
> though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx,
> but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below:
>
>
>
> From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001
> From: liqingqing <liqingqing3@huawei.com>
> Date: Thu, 21 May 2020 11:23:06 +0800
> Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M
> macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache.
> so update the defaule value to eliminate the decrement .
>

There is no single threshold value which is good for all workloads.
I don't think we should change REP_STOSB_THRESHOLD to 1MB.
On the other hand, the fixed threshold isn't flexible.  Please try this
patch to see if you can set the threshold for your specific workload.

-- 
H.J.

[-- Attachment #2: 0001-x86-Add-thresholds-for-rep-movsb-stosb-to-tunables.patch --]
[-- Type: text/x-patch, Size: 8962 bytes --]

From 7d2e0c0b843d509716d92960b9b139b32eacea54 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Sat, 9 May 2020 11:13:57 -0700
Subject: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables

Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables
to update thresholds for "rep movsb" and "rep stosb" at run-time.

Note that the user specified threshold for "rep movsb" smaller than
the minimum threshold will be ignored.
---
 manual/tunables.texi                          | 16 +++++++
 sysdeps/x86/cacheinfo.c                       | 46 +++++++++++++++++++
 sysdeps/x86/cpu-features.c                    |  4 ++
 sysdeps/x86/cpu-features.h                    |  4 ++
 sysdeps/x86/dl-tunables.list                  |  6 +++
 .../multiarch/memmove-vec-unaligned-erms.S    | 16 +------
 .../multiarch/memset-vec-unaligned-erms.S     | 12 +----
 7 files changed, 78 insertions(+), 26 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index ec18b10834..8054f79be0 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -396,6 +396,22 @@ to set threshold in bytes for non temporal store.
 This tunable is specific to i386 and x86-64.
 @end deftp
 
+@deftp Tunable glibc.cpu.x86_rep_movsb_threshold
+The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user
+to set threshold in bytes to start using "rep movsb".  Note that the
+user specified threshold smaller than the minimum threshold will be
+ignored.
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
+@deftp Tunable glibc.cpu.x86_rep_stosb_threshold
+The @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user
+to set threshold in bytes to start using "rep stosb".
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
 @deftp Tunable glibc.cpu.x86_ibt
 The @code{glibc.cpu.x86_ibt} tunable allows the user to control how
 indirect branch tracking (IBT) should be enabled.  Accepted values are
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index 311502dee3..4322328a1b 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -530,6 +530,23 @@ long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024;
 /* Threshold to use non temporal store.  */
 long int __x86_shared_non_temporal_threshold attribute_hidden;
 
+/* Threshold to use Enhanced REP MOVSB.  Since there is overhead to set
+   up REP MOVSB operation, REP MOVSB isn't faster on short data.  The
+   memcpy micro benchmark in glibc shows that 2KB is the approximate
+   value above which REP MOVSB becomes faster than SSE2 optimization
+   on processors with Enhanced REP MOVSB.  Since larger register size
+   can move more data with a single load and store, the threshold is
+   higher with larger register size.  */
+long int __x86_rep_movsb_threshold attribute_hidden = 2048;
+
+/* Threshold to use Enhanced REP STOSB.  Since there is overhead to set
+   up REP STOSB operation, REP STOSB isn't faster on short data.  The
+   memset micro benchmark in glibc shows that 2KB is the approximate
+   value above which REP STOSB becomes faster on processors with
+   Enhanced REP STOSB.  Since the stored value is fixed, larger register
+   size has minimal impact on threshold.  */
+long int __x86_rep_stosb_threshold attribute_hidden = 2048;
+
 #ifndef DISABLE_PREFETCHW
 /* PREFETCHW support flag for use in memory and string routines.  */
 int __x86_prefetchw attribute_hidden;
@@ -872,6 +889,35 @@ init_cacheinfo (void)
     = (cpu_features->non_temporal_threshold != 0
        ? cpu_features->non_temporal_threshold
        : __x86_shared_cache_size * threads * 3 / 4);
+
+  /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8.  */
+  unsigned int minimum_rep_movsb_threshold;
+  /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16).  */
+  unsigned int rep_movsb_threshold;
+  if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable)
+      && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
+    {
+      rep_movsb_threshold = 2048 * (64 / 16);
+      minimum_rep_movsb_threshold = 64 * 8;
+    }
+  else if (CPU_FEATURES_ARCH_P (cpu_features,
+				AVX_Fast_Unaligned_Load))
+    {
+      rep_movsb_threshold = 2048 * (32 / 16);
+      minimum_rep_movsb_threshold = 32 * 8;
+    }
+  else
+    {
+      rep_movsb_threshold = 2048 * (16 / 16);
+      minimum_rep_movsb_threshold = 16 * 8;
+    }
+  if (cpu_features->rep_movsb_threshold > minimum_rep_movsb_threshold)
+    __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
+  else
+    __x86_rep_movsb_threshold = rep_movsb_threshold;
+
+  if (cpu_features->rep_stosb_threshold)
+    __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold;
 }
 
 #endif
diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 916bbf5242..14f847320f 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -564,6 +564,10 @@ no_cpuid:
   TUNABLE_GET (hwcaps, tunable_val_t *, TUNABLE_CALLBACK (set_hwcaps));
   cpu_features->non_temporal_threshold
     = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
+  cpu_features->rep_movsb_threshold
+    = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
+  cpu_features->rep_stosb_threshold
+    = TUNABLE_GET (x86_rep_stosb_threshold, long int, NULL);
   cpu_features->data_cache_size
     = TUNABLE_GET (x86_data_cache_size, long int, NULL);
   cpu_features->shared_cache_size
diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
index f05d5ce158..7410324e83 100644
--- a/sysdeps/x86/cpu-features.h
+++ b/sysdeps/x86/cpu-features.h
@@ -91,6 +91,10 @@ struct cpu_features
   unsigned long int shared_cache_size;
   /* Threshold to use non temporal store.  */
   unsigned long int non_temporal_threshold;
+  /* Threshold to use "rep movsb".  */
+  unsigned long int rep_movsb_threshold;
+  /* Threshold to use "rep stosb".  */
+  unsigned long int rep_stosb_threshold;
 };
 
 /* Used from outside of glibc to get access to the CPU features
diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
index 251b926ce4..43bf6c2389 100644
--- a/sysdeps/x86/dl-tunables.list
+++ b/sysdeps/x86/dl-tunables.list
@@ -30,6 +30,12 @@ glibc {
     x86_non_temporal_threshold {
       type: SIZE_T
     }
+    x86_rep_movsb_threshold {
+      type: SIZE_T
+    }
+    x86_rep_stosb_threshold {
+      type: SIZE_T
+    }
     x86_data_cache_size {
       type: SIZE_T
     }
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index 74953245aa..bd5dc1a3f3 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -56,17 +56,6 @@
 # endif
 #endif
 
-/* Threshold to use Enhanced REP MOVSB.  Since there is overhead to set
-   up REP MOVSB operation, REP MOVSB isn't faster on short data.  The
-   memcpy micro benchmark in glibc shows that 2KB is the approximate
-   value above which REP MOVSB becomes faster than SSE2 optimization
-   on processors with Enhanced REP MOVSB.  Since larger register size
-   can move more data with a single load and store, the threshold is
-   higher with larger register size.  */
-#ifndef REP_MOVSB_THRESHOLD
-# define REP_MOVSB_THRESHOLD	(2048 * (VEC_SIZE / 16))
-#endif
-
 #ifndef PREFETCH
 # define PREFETCH(addr) prefetcht0 addr
 #endif
@@ -253,9 +242,6 @@ L(movsb):
 	leaq	(%rsi,%rdx), %r9
 	cmpq	%r9, %rdi
 	/* Avoid slow backward REP MOVSB.  */
-# if REP_MOVSB_THRESHOLD <= (VEC_SIZE * 8)
-#  error Unsupported REP_MOVSB_THRESHOLD and VEC_SIZE!
-# endif
 	jb	L(more_8x_vec_backward)
 1:
 	mov	%RDX_LP, %RCX_LP
@@ -331,7 +317,7 @@ L(between_2_3):
 
 #if defined USE_MULTIARCH && IS_IN (libc)
 L(movsb_more_2x_vec):
-	cmpq	$REP_MOVSB_THRESHOLD, %rdx
+	cmp	__x86_rep_movsb_threshold(%rip), %RDX_LP
 	ja	L(movsb)
 #endif
 L(more_2x_vec):
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index af2299709c..2bfc95de05 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -58,16 +58,6 @@
 # endif
 #endif
 
-/* Threshold to use Enhanced REP STOSB.  Since there is overhead to set
-   up REP STOSB operation, REP STOSB isn't faster on short data.  The
-   memset micro benchmark in glibc shows that 2KB is the approximate
-   value above which REP STOSB becomes faster on processors with
-   Enhanced REP STOSB.  Since the stored value is fixed, larger register
-   size has minimal impact on threshold.  */
-#ifndef REP_STOSB_THRESHOLD
-# define REP_STOSB_THRESHOLD		2048
-#endif
-
 #ifndef SECTION
 # error SECTION is not defined!
 #endif
@@ -181,7 +171,7 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))
 	ret
 
 L(stosb_more_2x_vec):
-	cmpq	$REP_STOSB_THRESHOLD, %rdx
+	cmp	__x86_rep_stosb_threshold(%rip), %RDX_LP
 	ja	L(stosb)
 #endif
 L(more_2x_vec):
-- 
2.26.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-05-23  4:37     ` [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables H.J. Lu
@ 2020-05-28 11:56       ` H.J. Lu
  2020-05-28 13:47         ` liqingqing
  2020-05-29 13:13       ` Carlos O'Donell
  1 sibling, 1 reply; 32+ messages in thread
From: H.J. Lu @ 2020-05-28 11:56 UTC (permalink / raw)
  To: liqingqing; +Cc: libc-alpha, Hushiyuan

On Fri, May 22, 2020 at 9:37 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Fri, May 22, 2020 at 9:10 PM liqingqing <liqingqing3@huawei.com> wrote:
> >
> > this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset,
> > and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB,
> > this api will use STOB to instead of  MOVQ
> >
> > but when I test this API on x86_64 platform
> > and found that this default value is not appropriate for some input length. here it's the enviornment and result
> >
> > test suite: libMicro-0.4.0
> >         ./memset -E -C 200 -L -S -W -N "memset_4k"    -s 4k    -I 250
> >         ./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k    -u -I 400
> >         ./memset -E -C 200 -L -S -W -N "memset_1m"    -s 1m   -I 200000
> >         ./memset -E -C 200 -L -S -W -N "memset_10m"   -s 10m -I 2000000
> >
> > hardware platform:
> >         Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
> >         L1d cache:32KB
> >         L1i cache: 32KB
> >         L2 cache: 1MB
> >         L3 cache: 60MB
> >
> > the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance.
> >
> >         before this commit     after this commit
> >                 cycle      cycle
> > memset_4k       249         96
> > memset_10k      657         185
> > memset_36k      2773        3767
> > memset_100k     7594        10002
> > memset_500k     37678       52149
> > memset_1m       86780       108044
> > memset_10m      1307238     1148994
> >
> >         before this commit          after this commit
> >            MLC cache miss(10sec)         MLC cache miss(10sec)
> > memset_4k       1,09,33,823          1,01,79,270
> > memset_10k      1,23,78,958          1,05,41,087
> > memset_36k      3,61,64,244          4,07,22,429
> > memset_100k     8,25,33,052          9,31,81,253
> > memset_500k     37,32,55,449         43,56,70,395
> > memset_1m       75,16,28,239         88,29,90,237
> > memset_10m      9,36,61,67,397       8,96,69,49,522
> >
> >
> > though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx,
> > but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below:
> >
> >
> >
> > From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001
> > From: liqingqing <liqingqing3@huawei.com>
> > Date: Thu, 21 May 2020 11:23:06 +0800
> > Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M
> > macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache.
> > so update the defaule value to eliminate the decrement .
> >
>
> There is no single threshold value which is good for all workloads.
> I don't think we should change REP_STOSB_THRESHOLD to 1MB.
> On the other hand, the fixed threshold isn't flexible.  Please try this
> patch to see if you can set the threshold for your specific workload.
>

Any comments, objections?

https://sourceware.org/pipermail/libc-alpha/2020-May/114281.html

-- 
H.J.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-05-28 11:56       ` H.J. Lu
@ 2020-05-28 13:47         ` liqingqing
  0 siblings, 0 replies; 32+ messages in thread
From: liqingqing @ 2020-05-28 13:47 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha, Hushiyuan

Hi Lu, thank you for your comment.
the REP_STOSB_THRESHOLD value 2M it's suit for the hardware platform what I used.
Cause I do not have some other x86 enviornments, so I can't make sure this change is good for all of it and  you are right.


On 2020/5/28 19:56, H.J. Lu wrote:
> On Fri, May 22, 2020 at 9:37 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Fri, May 22, 2020 at 9:10 PM liqingqing <liqingqing3@huawei.com> wrote:
>>>
>>> this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset,
>>> and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB,
>>> this api will use STOB to instead of  MOVQ
>>>
>>> but when I test this API on x86_64 platform
>>> and found that this default value is not appropriate for some input length. here it's the enviornment and result
>>>
>>> test suite: libMicro-0.4.0
>>>         ./memset -E -C 200 -L -S -W -N "memset_4k"    -s 4k    -I 250
>>>         ./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k    -u -I 400
>>>         ./memset -E -C 200 -L -S -W -N "memset_1m"    -s 1m   -I 200000
>>>         ./memset -E -C 200 -L -S -W -N "memset_10m"   -s 10m -I 2000000
>>>
>>> hardware platform:
>>>         Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
>>>         L1d cache:32KB
>>>         L1i cache: 32KB
>>>         L2 cache: 1MB
>>>         L3 cache: 60MB
>>>
>>> the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance.
>>>
>>>         before this commit     after this commit
>>>                 cycle      cycle
>>> memset_4k       249         96
>>> memset_10k      657         185
>>> memset_36k      2773        3767
>>> memset_100k     7594        10002
>>> memset_500k     37678       52149
>>> memset_1m       86780       108044
>>> memset_10m      1307238     1148994
>>>
>>>         before this commit          after this commit
>>>            MLC cache miss(10sec)         MLC cache miss(10sec)
>>> memset_4k       1,09,33,823          1,01,79,270
>>> memset_10k      1,23,78,958          1,05,41,087
>>> memset_36k      3,61,64,244          4,07,22,429
>>> memset_100k     8,25,33,052          9,31,81,253
>>> memset_500k     37,32,55,449         43,56,70,395
>>> memset_1m       75,16,28,239         88,29,90,237
>>> memset_10m      9,36,61,67,397       8,96,69,49,522
>>>
>>>
>>> though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx,
>>> but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below:
>>>
>>>
>>>
>>> From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001
>>> From: liqingqing <liqingqing3@huawei.com>
>>> Date: Thu, 21 May 2020 11:23:06 +0800
>>> Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M
>>> macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache.
>>> so update the defaule value to eliminate the decrement .
>>>
>>
>> There is no single threshold value which is good for all workloads.
>> I don't think we should change REP_STOSB_THRESHOLD to 1MB.
>> On the other hand, the fixed threshold isn't flexible.  Please try this
>> patch to see if you can set the threshold for your specific workload.
>>
> 
> Any comments, objections?
> 
> https://sourceware.org/pipermail/libc-alpha/2020-May/114281.html
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-05-23  4:37     ` [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables H.J. Lu
  2020-05-28 11:56       ` H.J. Lu
@ 2020-05-29 13:13       ` Carlos O'Donell
  2020-05-29 13:21         ` H.J. Lu
  1 sibling, 1 reply; 32+ messages in thread
From: Carlos O'Donell @ 2020-05-29 13:13 UTC (permalink / raw)
  To: H.J. Lu, liqingqing; +Cc: Hushiyuan, libc-alpha

On 5/23/20 12:37 AM, H.J. Lu via Libc-alpha wrote:
> There is no single threshold value which is good for all workloads.
> I don't think we should change REP_STOSB_THRESHOLD to 1MB.
> On the other hand, the fixed threshold isn't flexible.  Please try this
> patch to see if you can set the threshold for your specific workload.

My request here is that the manual include a documentation of what the
minimums are for the tunable. Even an example reference of the minimum
value would be useful for the tunable e.g. On AVX512 systems this value
is X, on AVX systems this value is Y, on all other systems Z.

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-05-29 13:13       ` Carlos O'Donell
@ 2020-05-29 13:21         ` H.J. Lu
  2020-05-29 16:18           ` Carlos O'Donell
  0 siblings, 1 reply; 32+ messages in thread
From: H.J. Lu @ 2020-05-29 13:21 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: liqingqing, Hushiyuan, libc-alpha

On Fri, May 29, 2020 at 6:13 AM Carlos O'Donell <carlos@redhat.com> wrote:
>
> On 5/23/20 12:37 AM, H.J. Lu via Libc-alpha wrote:
> > There is no single threshold value which is good for all workloads.
> > I don't think we should change REP_STOSB_THRESHOLD to 1MB.
> > On the other hand, the fixed threshold isn't flexible.  Please try this
> > patch to see if you can set the threshold for your specific workload.
>
> My request here is that the manual include a documentation of what the
> minimums are for the tunable. Even an example reference of the minimum
> value would be useful for the tunable e.g. On AVX512 systems this value
> is X, on AVX systems this value is Y, on all other systems Z.
>

The logic of thresholds are:

 /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8.  */
  unsigned int minimum_rep_movsb_threshold;
  /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16).  */
  unsigned int rep_movsb_threshold;
  if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable)
      && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
    {
      rep_movsb_threshold = 2048 * (64 / 16);
      minimum_rep_movsb_threshold = 64 * 8;
    }
  else if (CPU_FEATURES_ARCH_P (cpu_features,
AVX_Fast_Unaligned_Load))
    {
      rep_movsb_threshold = 2048 * (32 / 16);
      minimum_rep_movsb_threshold = 32 * 8;
    }
  else
    {
      rep_movsb_threshold = 2048 * (16 / 16);
      minimum_rep_movsb_threshold = 16 * 8;
    }
  if (cpu_features->rep_movsb_threshold > minimum_rep_movsb_threshold)
    __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
  else
    __x86_rep_movsb_threshold = rep_movsb_threshold;

We can't simply say AVX512 machines will use ZMM and AVX machines
will use YMM.  It depends on other factors which are invisible to users.
Can you suggest some paragraph for libc manual?

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-05-29 13:21         ` H.J. Lu
@ 2020-05-29 16:18           ` Carlos O'Donell
  2020-06-01 19:32             ` H.J. Lu
  0 siblings, 1 reply; 32+ messages in thread
From: Carlos O'Donell @ 2020-05-29 16:18 UTC (permalink / raw)
  To: H.J. Lu; +Cc: liqingqing, Hushiyuan, libc-alpha

On 5/29/20 9:21 AM, H.J. Lu wrote:
> On Fri, May 29, 2020 at 6:13 AM Carlos O'Donell <carlos@redhat.com> wrote:
>>
>> On 5/23/20 12:37 AM, H.J. Lu via Libc-alpha wrote:
>>> There is no single threshold value which is good for all workloads.
>>> I don't think we should change REP_STOSB_THRESHOLD to 1MB.
>>> On the other hand, the fixed threshold isn't flexible.  Please try this
>>> patch to see if you can set the threshold for your specific workload.
>>
>> My request here is that the manual include a documentation of what the
>> minimums are for the tunable. Even an example reference of the minimum
>> value would be useful for the tunable e.g. On AVX512 systems this value
>> is X, on AVX systems this value is Y, on all other systems Z.
>>
> 
> The logic of thresholds are:
> 
>  /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8.  */
>   unsigned int minimum_rep_movsb_threshold;
>   /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16).  */
>   unsigned int rep_movsb_threshold;
>   if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable)
>       && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
>     {
>       rep_movsb_threshold = 2048 * (64 / 16);
>       minimum_rep_movsb_threshold = 64 * 8;
>     }
>   else if (CPU_FEATURES_ARCH_P (cpu_features,
> AVX_Fast_Unaligned_Load))
>     {
>       rep_movsb_threshold = 2048 * (32 / 16);
>       minimum_rep_movsb_threshold = 32 * 8;
>     }
>   else
>     {
>       rep_movsb_threshold = 2048 * (16 / 16);
>       minimum_rep_movsb_threshold = 16 * 8;
>     }
>   if (cpu_features->rep_movsb_threshold > minimum_rep_movsb_threshold)
>     __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
>   else
>     __x86_rep_movsb_threshold = rep_movsb_threshold;
> 
> We can't simply say AVX512 machines will use ZMM and AVX machines
> will use YMM.  It depends on other factors which are invisible to users.
> Can you suggest some paragraph for libc manual?

We must tell the users the lower limit, so they can avoid having their
settings ignored.

If we can't tell them the lower limit in the manual, then we must add
a way to print it.

Augment the libc.so.6 main() entry point to print all tunables with
a --list-tunables option and print the limit? Then in the manual just
say you have to look it up?

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-05-29 16:18           ` Carlos O'Donell
@ 2020-06-01 19:32             ` H.J. Lu
  2020-06-01 19:38               ` Carlos O'Donell
  0 siblings, 1 reply; 32+ messages in thread
From: H.J. Lu @ 2020-06-01 19:32 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: liqingqing, Hushiyuan, libc-alpha

On Fri, May 29, 2020 at 9:18 AM Carlos O'Donell <carlos@redhat.com> wrote:
>
> On 5/29/20 9:21 AM, H.J. Lu wrote:
> > On Fri, May 29, 2020 at 6:13 AM Carlos O'Donell <carlos@redhat.com> wrote:
> >>
> >> On 5/23/20 12:37 AM, H.J. Lu via Libc-alpha wrote:
> >>> There is no single threshold value which is good for all workloads.
> >>> I don't think we should change REP_STOSB_THRESHOLD to 1MB.
> >>> On the other hand, the fixed threshold isn't flexible.  Please try this
> >>> patch to see if you can set the threshold for your specific workload.
> >>
> >> My request here is that the manual include a documentation of what the
> >> minimums are for the tunable. Even an example reference of the minimum
> >> value would be useful for the tunable e.g. On AVX512 systems this value
> >> is X, on AVX systems this value is Y, on all other systems Z.
> >>
> >
> > The logic of thresholds are:
> >
> >  /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8.  */
> >   unsigned int minimum_rep_movsb_threshold;
> >   /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16).  */
> >   unsigned int rep_movsb_threshold;
> >   if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable)
> >       && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
> >     {
> >       rep_movsb_threshold = 2048 * (64 / 16);
> >       minimum_rep_movsb_threshold = 64 * 8;
> >     }
> >   else if (CPU_FEATURES_ARCH_P (cpu_features,
> > AVX_Fast_Unaligned_Load))
> >     {
> >       rep_movsb_threshold = 2048 * (32 / 16);
> >       minimum_rep_movsb_threshold = 32 * 8;
> >     }
> >   else
> >     {
> >       rep_movsb_threshold = 2048 * (16 / 16);
> >       minimum_rep_movsb_threshold = 16 * 8;
> >     }
> >   if (cpu_features->rep_movsb_threshold > minimum_rep_movsb_threshold)
> >     __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
> >   else
> >     __x86_rep_movsb_threshold = rep_movsb_threshold;
> >
> > We can't simply say AVX512 machines will use ZMM and AVX machines
> > will use YMM.  It depends on other factors which are invisible to users.
> > Can you suggest some paragraph for libc manual?
>
> We must tell the users the lower limit, so they can avoid having their
> settings ignored.
>
> If we can't tell them the lower limit in the manual, then we must add
> a way to print it.
>
> Augment the libc.so.6 main() entry point to print all tunables with
> a --list-tunables option and print the limit? Then in the manual just

Did you mean adding  --list-tunables to ld.so?  libc.so.6 doesn't take
any arguments.

> say you have to look it up?
>
> --
> Cheers,
> Carlos.
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-06-01 19:32             ` H.J. Lu
@ 2020-06-01 19:38               ` Carlos O'Donell
  2020-06-01 20:15                 ` H.J. Lu
  0 siblings, 1 reply; 32+ messages in thread
From: Carlos O'Donell @ 2020-06-01 19:38 UTC (permalink / raw)
  To: H.J. Lu; +Cc: liqingqing, Hushiyuan, libc-alpha

On Mon, Jun 1, 2020 at 3:33 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> Did you mean adding  --list-tunables to ld.so?  libc.so.6 doesn't take
> any arguments.

Yes, I mean adding argument processing to libc.so.6, and handling
--list-tunables.

We have enough infrastructure in place that wiring that up shouldn't be too bad?

Then, even in trimmed down containers, you can just run
/lib64/libc.so.6 --list-tunables and get back the list of tunables and
their min, max, and security values.

The alternative is a glibc-tunables binary which does only this, but
that seems like waste.

Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-06-01 19:38               ` Carlos O'Donell
@ 2020-06-01 20:15                 ` H.J. Lu
  2020-06-01 20:19                   ` H.J. Lu
  0 siblings, 1 reply; 32+ messages in thread
From: H.J. Lu @ 2020-06-01 20:15 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: liqingqing, Hushiyuan, libc-alpha

On Mon, Jun 1, 2020 at 12:38 PM Carlos O'Donell <carlos@redhat.com> wrote:
>
> On Mon, Jun 1, 2020 at 3:33 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > Did you mean adding  --list-tunables to ld.so?  libc.so.6 doesn't take
> > any arguments.
>
> Yes, I mean adding argument processing to libc.so.6, and handling
> --list-tunables.
>
> We have enough infrastructure in place that wiring that up shouldn't be too bad?
>
> Then, even in trimmed down containers, you can just run
> /lib64/libc.so.6 --list-tunables and get back the list of tunables and
> their min, max, and security values.

Adding an argument to libc.so.6 is difficult since argument passing is
processor specific.  Adding --list-tunables to ld.so is more doable.

> The alternative is a glibc-tunables binary which does only this, but
> that seems like waste.
>
> Cheers,
> Carlos.
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-06-01 20:15                 ` H.J. Lu
@ 2020-06-01 20:19                   ` H.J. Lu
  2020-06-01 20:48                     ` Florian Weimer
  0 siblings, 1 reply; 32+ messages in thread
From: H.J. Lu @ 2020-06-01 20:19 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: liqingqing, Hushiyuan, libc-alpha

On Mon, Jun 1, 2020 at 1:15 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Jun 1, 2020 at 12:38 PM Carlos O'Donell <carlos@redhat.com> wrote:
> >
> > On Mon, Jun 1, 2020 at 3:33 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > Did you mean adding  --list-tunables to ld.so?  libc.so.6 doesn't take
> > > any arguments.
> >
> > Yes, I mean adding argument processing to libc.so.6, and handling
> > --list-tunables.
> >
> > We have enough infrastructure in place that wiring that up shouldn't be too bad?
> >
> > Then, even in trimmed down containers, you can just run
> > /lib64/libc.so.6 --list-tunables and get back the list of tunables and
> > their min, max, and security values.
>
> Adding an argument to libc.so.6 is difficult since argument passing is
> processor specific.  Adding --list-tunables to ld.so is more doable.

But tunables are in libc.so.


-- 
H.J.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-06-01 20:19                   ` H.J. Lu
@ 2020-06-01 20:48                     ` Florian Weimer
  2020-06-01 20:56                       ` Carlos O'Donell
  0 siblings, 1 reply; 32+ messages in thread
From: Florian Weimer @ 2020-06-01 20:48 UTC (permalink / raw)
  To: H.J. Lu via Libc-alpha; +Cc: Carlos O'Donell, H.J. Lu, Hushiyuan

* H. J. Lu via Libc-alpha:

> On Mon, Jun 1, 2020 at 1:15 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Mon, Jun 1, 2020 at 12:38 PM Carlos O'Donell <carlos@redhat.com> wrote:
>> >
>> > On Mon, Jun 1, 2020 at 3:33 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>> > > Did you mean adding  --list-tunables to ld.so?  libc.so.6 doesn't take
>> > > any arguments.
>> >
>> > Yes, I mean adding argument processing to libc.so.6, and handling
>> > --list-tunables.
>> >
>> > We have enough infrastructure in place that wiring that up shouldn't be too bad?
>> >
>> > Then, even in trimmed down containers, you can just run
>> > /lib64/libc.so.6 --list-tunables and get back the list of tunables and
>> > their min, max, and security values.
>>
>> Adding an argument to libc.so.6 is difficult since argument passing is
>> processor specific.  Adding --list-tunables to ld.so is more doable.
>
> But tunables are in libc.so.

If this is really a problem, we can load libc.so and call a
GLIBC_PRIVATE function to print the information.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-06-01 20:48                     ` Florian Weimer
@ 2020-06-01 20:56                       ` Carlos O'Donell
  2020-06-01 21:13                         ` H.J. Lu
  0 siblings, 1 reply; 32+ messages in thread
From: Carlos O'Donell @ 2020-06-01 20:56 UTC (permalink / raw)
  To: Florian Weimer; +Cc: H.J. Lu via Libc-alpha, H.J. Lu, Hushiyuan

On Mon, Jun 1, 2020 at 4:48 PM Florian Weimer <fw@deneb.enyo.de> wrote:
>
> * H. J. Lu via Libc-alpha:
>
> > On Mon, Jun 1, 2020 at 1:15 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>
> >> On Mon, Jun 1, 2020 at 12:38 PM Carlos O'Donell <carlos@redhat.com> wrote:
> >> >
> >> > On Mon, Jun 1, 2020 at 3:33 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >> > > Did you mean adding  --list-tunables to ld.so?  libc.so.6 doesn't take
> >> > > any arguments.
> >> >
> >> > Yes, I mean adding argument processing to libc.so.6, and handling
> >> > --list-tunables.
> >> >
> >> > We have enough infrastructure in place that wiring that up shouldn't be too bad?
> >> >
> >> > Then, even in trimmed down containers, you can just run
> >> > /lib64/libc.so.6 --list-tunables and get back the list of tunables and
> >> > their min, max, and security values.
> >>
> >> Adding an argument to libc.so.6 is difficult since argument passing is
> >> processor specific.  Adding --list-tunables to ld.so? is more doable.
> >
> > But tunables are in libc.so.
>
> If this is really a problem, we can load libc.so and call a
> GLIBC_PRIVATE function to print the information.

Agreed.

Please keep in mind the original problem we are trying to solve.

We want a tunable for a parameter that is difficult to explain to the user.

To make it easier for our users to use the tunable we are going to
provide them a way to look at the tunable settings in detail.

Yes, it requires a target system, but we can't avoid that in some cases.

Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-06-01 20:56                       ` Carlos O'Donell
@ 2020-06-01 21:13                         ` H.J. Lu
  2020-06-01 22:43                           ` H.J. Lu
  0 siblings, 1 reply; 32+ messages in thread
From: H.J. Lu @ 2020-06-01 21:13 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Florian Weimer, H.J. Lu via Libc-alpha, Hushiyuan

[-- Attachment #1: Type: text/plain, Size: 1686 bytes --]

On Mon, Jun 1, 2020 at 1:57 PM Carlos O'Donell <carlos@redhat.com> wrote:
>
> On Mon, Jun 1, 2020 at 4:48 PM Florian Weimer <fw@deneb.enyo.de> wrote:
> >
> > * H. J. Lu via Libc-alpha:
> >
> > > On Mon, Jun 1, 2020 at 1:15 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >>
> > >> On Mon, Jun 1, 2020 at 12:38 PM Carlos O'Donell <carlos@redhat.com> wrote:
> > >> >
> > >> > On Mon, Jun 1, 2020 at 3:33 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >> > > Did you mean adding  --list-tunables to ld.so?  libc.so.6 doesn't take
> > >> > > any arguments.
> > >> >
> > >> > Yes, I mean adding argument processing to libc.so.6, and handling
> > >> > --list-tunables.
> > >> >
> > >> > We have enough infrastructure in place that wiring that up shouldn't be too bad?
> > >> >
> > >> > Then, even in trimmed down containers, you can just run
> > >> > /lib64/libc.so.6 --list-tunables and get back the list of tunables and
> > >> > their min, max, and security values.
> > >>
> > >> Adding an argument to libc.so.6 is difficult since argument passing is
> > >> processor specific.  Adding --list-tunables to ld.so? is more doable.
> > >
> > > But tunables are in libc.so.
> >
> > If this is really a problem, we can load libc.so and call a
> > GLIBC_PRIVATE function to print the information.
>
> Agreed.
>
> Please keep in mind the original problem we are trying to solve.
>
> We want a tunable for a parameter that is difficult to explain to the user.
>
> To make it easier for our users to use the tunable we are going to
> provide them a way to look at the tunable settings in detail.
>
> Yes, it requires a target system, but we can't avoid that in some cases.
>

Something like this?

-- 
H.J.

[-- Attachment #2: 0001-x86-Pass-argc-and-argv-to-__libc_main.patch --]
[-- Type: text/x-patch, Size: 7825 bytes --]

From 17a661ac6f10e9cf51a664ce95ee95c8113c74e8 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Mon, 1 Jun 2020 14:11:32 -0700
Subject: [PATCH] x86: Pass argc and argv to __libc_main

---
 csu/Makefile                      |  4 ++--
 csu/{version.c => libc-version.c} |  8 +++++--
 sysdeps/x86/Makefile              |  1 +
 sysdeps/x86/libc-main.S           | 38 +++++++++++++++++++++++++++++++
 sysdeps/x86/libc-version.c        | 30 ++++++++++++++++++++++++
 sysdeps/x86_64/start.S            | 34 +++++++++++++++++++++------
 6 files changed, 104 insertions(+), 11 deletions(-)
 rename csu/{version.c => libc-version.c} (93%)
 create mode 100644 sysdeps/x86/libc-main.S
 create mode 100644 sysdeps/x86/libc-version.c

diff --git a/csu/Makefile b/csu/Makefile
index 555ae27dea..951a093f15 100644
--- a/csu/Makefile
+++ b/csu/Makefile
@@ -26,8 +26,8 @@ subdir := csu
 
 include ../Makeconfig
 
-routines = init-first libc-start $(libc-init) sysdep version check_fds \
-	   libc-tls elf-init dso_handle
+routines = init-first libc-start $(libc-init) sysdep libc-version \
+	   check_fds libc-tls elf-init dso_handle
 aux	 = errno
 elide-routines.os = libc-tls
 static-only-routines = elf-init
diff --git a/csu/version.c b/csu/libc-version.c
similarity index 93%
rename from csu/version.c
rename to csu/libc-version.c
index 57b49dfd8a..9b0f4cb94b 100644
--- a/csu/version.c
+++ b/csu/libc-version.c
@@ -61,12 +61,16 @@ __gnu_get_libc_version (void)
 }
 weak_alias (__gnu_get_libc_version, gnu_get_libc_version)
 
+#ifndef LIBC_MAIN
+# define LIBC_MAIN __libc_main
+#endif
+
 /* This function is the entry point for the shared object.
    Running the library as a program will get here.  */
 
-extern void __libc_main (void) __attribute__ ((noreturn));
+extern void LIBC_MAIN (void) __attribute__ ((noreturn, visibility ("hidden")));
 void
-__libc_main (void)
+LIBC_MAIN (void)
 {
   __libc_print_version ();
   _exit (0);
diff --git a/sysdeps/x86/Makefile b/sysdeps/x86/Makefile
index beab426f67..de6aed89ee 100644
--- a/sysdeps/x86/Makefile
+++ b/sysdeps/x86/Makefile
@@ -1,5 +1,6 @@
 ifeq ($(subdir),csu)
 gen-as-const-headers += cpu-features-offsets.sym
+routines += libc-main
 endif
 
 ifeq ($(subdir),elf)
diff --git a/sysdeps/x86/libc-main.S b/sysdeps/x86/libc-main.S
new file mode 100644
index 0000000000..2f8dd0be73
--- /dev/null
+++ b/sysdeps/x86/libc-main.S
@@ -0,0 +1,38 @@
+/* Copyright (C) 2020 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   In addition to the permissions in the GNU Lesser General Public
+   License, the Free Software Foundation gives you unlimited
+   permission to link the compiled version of this file with other
+   programs, and to distribute those programs without any restriction
+   coming from the use of this file. (The GNU Lesser General Public
+   License restrictions do apply in other respects; for example, they
+   cover modification of the file, and distribution when not linked
+   into another program.)
+
+   Note that people who make modified versions of this file are not
+   obligated to grant this special exception for their modified
+   versions; it is their choice whether to do so. The GNU Lesser
+   General Public License gives permission to release a modified
+   version without this exception; this exception also makes it
+   possible to release a modified version which carries forward this
+   exception.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+	.hidden __x86_libc_main
+
+#define LIBC_MAIN __x86_libc_main
+#include "start.S"
diff --git a/sysdeps/x86/libc-version.c b/sysdeps/x86/libc-version.c
new file mode 100644
index 0000000000..bac0cda6c7
--- /dev/null
+++ b/sysdeps/x86/libc-version.c
@@ -0,0 +1,30 @@
+/* Copyright (C) 2020 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#define LIBC_MAIN __generic_libc_main
+#include <csu/libc-version.c>
+
+/* This function is the entry point for the shared object.
+   Running the library as a program will get here.  */
+
+extern void __x86_libc_main (int, char **) __attribute__ ((noreturn));
+
+void
+__x86_libc_main (int argc, char **argv)
+{
+  __generic_libc_main ();
+}
diff --git a/sysdeps/x86_64/start.S b/sysdeps/x86_64/start.S
index 7477b632f7..18d910257b 100644
--- a/sysdeps/x86_64/start.S
+++ b/sysdeps/x86_64/start.S
@@ -55,7 +55,13 @@
 
 #include <sysdep.h>
 
-ENTRY (_start)
+#ifdef LIBC_MAIN
+# define START __libc_main
+#else
+# define START _start
+#endif
+
+ENTRY (START)
 	/* Clearing frame pointer is insufficient, use CFI.  */
 	cfi_undefined (rip)
 	/* Clear the frame pointer.  The ABI suggests this be done, to mark
@@ -76,16 +82,24 @@ ENTRY (_start)
 	rtld_fini:	%r9
 	stack_end:	stack.	*/
 
+#ifdef LIBC_MAIN
+# define ARGC_REG	rdi
+# define ARGV_REG	RSI_LP
+#else
+# define ARGC_REG	rsi
+# define ARGV_REG	RDX_LP
+#endif
+
 	mov %RDX_LP, %R9_LP	/* Address of the shared library termination
 				   function.  */
 #ifdef __ILP32__
 	mov (%rsp), %esi	/* Simulate popping 4-byte argument count.  */
 	add $4, %esp
 #else
-	popq %rsi		/* Pop the argument count.  */
+	popq %ARGC_REG		/* Pop the argument count.  */
 #endif
 	/* argv starts just at the current stack top.  */
-	mov %RSP_LP, %RDX_LP
+	mov %RSP_LP, %ARGV_REG
 	/* Align the stack to a 16 byte boundary to follow the ABI.  */
 	and  $~15, %RSP_LP
 
@@ -96,19 +110,22 @@ ENTRY (_start)
 	   which grow downwards).  */
 	pushq %rsp
 
-#ifdef PIC
+#ifdef LIBC_MAIN
+	call LIBC_MAIN
+#else
+# ifdef PIC
 	/* Pass address of our own entry points to .fini and .init.  */
 	mov __libc_csu_fini@GOTPCREL(%rip), %R8_LP
 	mov __libc_csu_init@GOTPCREL(%rip), %RCX_LP
 
 	mov main@GOTPCREL(%rip), %RDI_LP
-#else
+# else
 	/* Pass address of our own entry points to .fini and .init.  */
 	mov $__libc_csu_fini, %R8_LP
 	mov $__libc_csu_init, %RCX_LP
 
 	mov $main, %RDI_LP
-#endif
+# endif
 
 	/* Call the user's main function, and exit with its value.
 	   But let the libc call main.  Since __libc_start_main in
@@ -118,10 +135,12 @@ ENTRY (_start)
 	   2.26 or above can convert indirect branch into direct
 	   branch.  */
 	call *__libc_start_main@GOTPCREL(%rip)
+#endif
 
 	hlt			/* Crash if somehow `exit' does return.	 */
-END (_start)
+END (START)
 
+#ifndef LIBC_MAIN
 /* Define a symbol for the first piece of initialized data.  */
 	.data
 	.globl __data_start
@@ -129,3 +148,4 @@ __data_start:
 	.long 0
 	.weak data_start
 	data_start = __data_start
+#endif
-- 
2.26.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-06-01 21:13                         ` H.J. Lu
@ 2020-06-01 22:43                           ` H.J. Lu
  2020-06-02  2:08                             ` Carlos O'Donell
  0 siblings, 1 reply; 32+ messages in thread
From: H.J. Lu @ 2020-06-01 22:43 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Florian Weimer, H.J. Lu via Libc-alpha, Hushiyuan

On Mon, Jun 1, 2020 at 2:13 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Jun 1, 2020 at 1:57 PM Carlos O'Donell <carlos@redhat.com> wrote:
> >
> > On Mon, Jun 1, 2020 at 4:48 PM Florian Weimer <fw@deneb.enyo.de> wrote:
> > >
> > > * H. J. Lu via Libc-alpha:
> > >
> > > > On Mon, Jun 1, 2020 at 1:15 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >>
> > > >> On Mon, Jun 1, 2020 at 12:38 PM Carlos O'Donell <carlos@redhat.com> wrote:
> > > >> >
> > > >> > On Mon, Jun 1, 2020 at 3:33 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >> > > Did you mean adding  --list-tunables to ld.so?  libc.so.6 doesn't take
> > > >> > > any arguments.
> > > >> >
> > > >> > Yes, I mean adding argument processing to libc.so.6, and handling
> > > >> > --list-tunables.
> > > >> >
> > > >> > We have enough infrastructure in place that wiring that up shouldn't be too bad?
> > > >> >
> > > >> > Then, even in trimmed down containers, you can just run
> > > >> > /lib64/libc.so.6 --list-tunables and get back the list of tunables and
> > > >> > their min, max, and security values.
> > > >>
> > > >> Adding an argument to libc.so.6 is difficult since argument passing is
> > > >> processor specific.  Adding --list-tunables to ld.so? is more doable.
> > > >
> > > > But tunables are in libc.so.
> > >
> > > If this is really a problem, we can load libc.so and call a
> > > GLIBC_PRIVATE function to print the information.
> >
> > Agreed.
> >
> > Please keep in mind the original problem we are trying to solve.
> >
> > We want a tunable for a parameter that is difficult to explain to the user.
> >
> > To make it easier for our users to use the tunable we are going to
> > provide them a way to look at the tunable settings in detail.
> >
> > Yes, it requires a target system, but we can't avoid that in some cases.
> >
>
> Something like this?
>

Tunables are designed to pass info from user to glibc, not the other
way around.  When __libc_main is called, init_cacheinfo is never
called.  I can call init_cacheinfo from __libc_main.  But there is no
interface to update min and max values from init_cacheinfo.  I don't
think --list-tunables will work here without changes to tunables.



-- 
H.J.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-06-01 22:43                           ` H.J. Lu
@ 2020-06-02  2:08                             ` Carlos O'Donell
  2020-06-04 21:00                               ` [PATCH] libc.so: Add --list-tunables H.J. Lu
  0 siblings, 1 reply; 32+ messages in thread
From: Carlos O'Donell @ 2020-06-02  2:08 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Florian Weimer, H.J. Lu via Libc-alpha, Hushiyuan

On Mon, Jun 1, 2020 at 6:44 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> Tunables are designed to pass info from user to glibc, not the other
> way around.  When __libc_main is called, init_cacheinfo is never
> called.  I can call init_cacheinfo from __libc_main.  But there is no
> interface to update min and max values from init_cacheinfo.  I don't
> think --list-tunables will work here without changes to tunables.

You have a dynamic threshold.

You have to tell the user what that minimum is, otherwise they can't
use the tunable reliably.

This is the first instance of a min/max that is dynamically determined.

You must fetch the cache info ahead of the tunable initialization, that
is you must call init_cacheinfo before __init_tunables.

You can initialize the tunable data dynamically like this:

/* Dynamically set the min and max of glibc.foo.bar.  */
tunable_id_t id = TUNABLE_ENUM_NAME (glibc, foo, bar);
tunable_list[id].type.min = lowval;
tunable_list[id].type.max = highval;

We do something similar for maybe_enable_malloc_check.

Then once the tunables are parsed, and the cpu features are loaded
you can print the tunables, and the printed tunables will have meaningful
min and max values.

If you have circular dependency, then you must process the cpu features
first without reading from the tunables, then allow the tunables to be
initialized from the system, *then* process the tunables to alter the existing
cpu feature settings.

Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] libc.so: Add --list-tunables
  2020-06-02  2:08                             ` Carlos O'Donell
@ 2020-06-04 21:00                               ` H.J. Lu
  2020-06-05 22:45                                 ` V2 " H.J. Lu
  0 siblings, 1 reply; 32+ messages in thread
From: H.J. Lu @ 2020-06-04 21:00 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Florian Weimer, H.J. Lu via Libc-alpha, Hushiyuan

[-- Attachment #1: Type: text/plain, Size: 3536 bytes --]

On Mon, Jun 1, 2020 at 7:08 PM Carlos O'Donell <carlos@redhat.com> wrote:
>
> On Mon, Jun 1, 2020 at 6:44 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > Tunables are designed to pass info from user to glibc, not the other
> > way around.  When __libc_main is called, init_cacheinfo is never
> > called.  I can call init_cacheinfo from __libc_main.  But there is no
> > interface to update min and max values from init_cacheinfo.  I don't
> > think --list-tunables will work here without changes to tunables.
>
> You have a dynamic threshold.
>
> You have to tell the user what that minimum is, otherwise they can't
> use the tunable reliably.
>
> This is the first instance of a min/max that is dynamically determined.
>
> You must fetch the cache info ahead of the tunable initialization, that
> is you must call init_cacheinfo before __init_tunables.
>
> You can initialize the tunable data dynamically like this:
>
> /* Dynamically set the min and max of glibc.foo.bar.  */
> tunable_id_t id = TUNABLE_ENUM_NAME (glibc, foo, bar);
> tunable_list[id].type.min = lowval;
> tunable_list[id].type.max = highval;
>
> We do something similar for maybe_enable_malloc_check.
>
> Then once the tunables are parsed, and the cpu features are loaded
> you can print the tunables, and the printed tunables will have meaningful
> min and max values.
>
> If you have circular dependency, then you must process the cpu features
> first without reading from the tunables, then allow the tunables to be
> initialized from the system, *then* process the tunables to alter the existing
> cpu feature settings.
>

How about this?  I got

[hjl@gnu-cfl-2 build-x86_64-linux]$ ./elf/ld.so ./libc.so --list-tunables
tunables:
  glibc.elision.skip_lock_after_retries: 0x3 (min: 0x80000000, max: 0x7fffffff)
  glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
  glibc.malloc.perturb: 0x0 (min: 0x0, max: 0xff)
  glibc.cpu.x86_shared_cache_size: 0x0 (min: 0x0, max: 0xffffffffffffffff)
  glibc.elision.tries: 0x3 (min: 0x80000000, max: 0x7fffffff)
  glibc.elision.enable: 0x0 (min: 0x0, max: 0x1)
  glibc.cpu.x86_rep_movsb_threshold: 0x1000 (min: 0x100, max:
0xffffffffffffffff)
  glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
  glibc.elision.skip_lock_busy: 0x3 (min: 0x80000000, max: 0x7fffffff)
  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x0, max: 0xffffffffffffffff)
  glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x0, max:
0xffffffffffffffff)
  glibc.cpu.x86_shstk:
  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
  glibc.malloc.mmap_max: 0x0 (min: 0x80000000, max: 0x7fffffff)
  glibc.elision.skip_trylock_internal_abort: 0x3 (min: 0x80000000,
max: 0x7fffffff)
  glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffffffffffff)
  glibc.cpu.x86_ibt:
  glibc.cpu.hwcaps:
  glibc.elision.skip_lock_internal_abort: 0x3 (min: 0x80000000, max: 0x7fffffff)
  glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffffffffffff)
  glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
  glibc.cpu.x86_data_cache_size: 0x0 (min: 0x0, max: 0xffffffffffffffff)
  glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffffffffffff)
  glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffffffffffff)
  glibc.pthread.mutex_spin_count: 0x64 (min: 0x0, max: 0x7fff)
  glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffffffffffff)
  glibc.malloc.check: 0x0 (min: 0x0, max: 0x3)
[hjl@gnu-cfl-2 build-x86_64-linux]$

-- 
H.J.

[-- Attachment #2: 0001-libc.so-Add-list-tunables.patch --]
[-- Type: application/x-patch, Size: 19860 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* V2 [PATCH] libc.so: Add --list-tunables
  2020-06-04 21:00                               ` [PATCH] libc.so: Add --list-tunables H.J. Lu
@ 2020-06-05 22:45                                 ` H.J. Lu
  2020-06-06 21:51                                   ` V3 [PATCH] libc.so: Add --list-tunables support to __libc_main H.J. Lu
  0 siblings, 1 reply; 32+ messages in thread
From: H.J. Lu @ 2020-06-05 22:45 UTC (permalink / raw)
  To: libc-alpha; +Cc: Carlos O'Donell, Florian Weimer, Hushiyuan

On Thu, Jun 04, 2020 at 02:00:35PM -0700, H.J. Lu wrote:
> On Mon, Jun 1, 2020 at 7:08 PM Carlos O'Donell <carlos@redhat.com> wrote:
> >
> > On Mon, Jun 1, 2020 at 6:44 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > Tunables are designed to pass info from user to glibc, not the other
> > > way around.  When __libc_main is called, init_cacheinfo is never
> > > called.  I can call init_cacheinfo from __libc_main.  But there is no
> > > interface to update min and max values from init_cacheinfo.  I don't
> > > think --list-tunables will work here without changes to tunables.
> >
> > You have a dynamic threshold.
> >
> > You have to tell the user what that minimum is, otherwise they can't
> > use the tunable reliably.
> >
> > This is the first instance of a min/max that is dynamically determined.
> >
> > You must fetch the cache info ahead of the tunable initialization, that
> > is you must call init_cacheinfo before __init_tunables.
> >
> > You can initialize the tunable data dynamically like this:
> >
> > /* Dynamically set the min and max of glibc.foo.bar.  */
> > tunable_id_t id = TUNABLE_ENUM_NAME (glibc, foo, bar);
> > tunable_list[id].type.min = lowval;
> > tunable_list[id].type.max = highval;
> >
> > We do something similar for maybe_enable_malloc_check.
> >
> > Then once the tunables are parsed, and the cpu features are loaded
> > you can print the tunables, and the printed tunables will have meaningful
> > min and max values.
> >
> > If you have circular dependency, then you must process the cpu features
> > first without reading from the tunables, then allow the tunables to be
> > initialized from the system, *then* process the tunables to alter the existing
> > cpu feature settings.
> >
> 
> How about this?  I got
> 

Here is the updated patch, which depends on

https://sourceware.org/pipermail/libc-alpha/2020-June/114820.html

to add "%d" support to _dl_debug_vdprintf.  I got

$ ./elf/ld.so ./libc.so --list-tunables
glibc.elision.skip_lock_after_retries: 3 (min: -2147483648, max: 2147483647)
glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffff)
glibc.malloc.perturb: 0 (min: 0, max: 255)
glibc.cpu.x86_shared_cache_size: 0x100000 (min: 0x0, max: 0xffffffff)
glibc.elision.tries: 3 (min: -2147483648, max: 2147483647)
glibc.elision.enable: 0 (min: 0, max: 1)
glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffff)
glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffff)
glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x0, max: 0xffffffff)
glibc.cpu.x86_shstk:
glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffff)
glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
glibc.elision.skip_trylock_internal_abort: 3 (min: -2147483648, max: 2147483647)
glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffff)
glibc.cpu.x86_ibt:
glibc.cpu.hwcaps:
glibc.elision.skip_lock_internal_abort: 3 (min: -2147483648, max: 2147483647)
glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffff)
glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffff)
glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffff)
glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffff)
glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffff)
glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767)
glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffff)
glibc.malloc.check: 0 (min: 0, max: 3)
$

Ok for master?

Thanks.

H.J.
---
Add --list-tunables to __libc_main to print tunables with min and max
values.  In order to pass argc and argv to __libc_main, each target must
provide a suitable __libc_main which currently is only provied by x86.

Two functions, __tunable_update_val and __tunables_print, are added to
update tunable values and print tunable values.

X86 processor cache info is moved to cpu_features so that it is available
for __tunables_print with --list-tunables.
---
 csu/Makefile                                |   4 +-
 csu/{version.c => libc-version.c}           |   8 +-
 elf/Versions                                |   6 +
 elf/dl-tunables.c                           |  88 +-
 elf/dl-tunables.h                           |  17 +
 sysdeps/i386/cacheinfo.c                    |   3 -
 sysdeps/i386/start.S                        |  28 +-
 sysdeps/x86/Makefile                        |  14 +-
 sysdeps/x86/cacheinfo.c                     | 852 ++------------------
 sysdeps/x86/cpu-features.c                  |  19 +-
 sysdeps/x86/cpu-features.h                  |  26 +
 sysdeps/x86/{cacheinfo.c => dl-cacheinfo.c} | 200 ++---
 sysdeps/x86/init-arch.h                     |   3 +
 sysdeps/x86/libc-main.S                     |  36 +
 sysdeps/x86/libc-version.c                  |  53 ++
 sysdeps/x86_64/start.S                      |  36 +-
 16 files changed, 446 insertions(+), 947 deletions(-)
 rename csu/{version.c => libc-version.c} (93%)
 delete mode 100644 sysdeps/i386/cacheinfo.c
 copy sysdeps/x86/{cacheinfo.c => dl-cacheinfo.c} (83%)
 create mode 100644 sysdeps/x86/libc-main.S
 create mode 100644 sysdeps/x86/libc-version.c

diff --git a/csu/Makefile b/csu/Makefile
index 555ae27dea..951a093f15 100644
--- a/csu/Makefile
+++ b/csu/Makefile
@@ -26,8 +26,8 @@ subdir := csu
 
 include ../Makeconfig
 
-routines = init-first libc-start $(libc-init) sysdep version check_fds \
-	   libc-tls elf-init dso_handle
+routines = init-first libc-start $(libc-init) sysdep libc-version \
+	   check_fds libc-tls elf-init dso_handle
 aux	 = errno
 elide-routines.os = libc-tls
 static-only-routines = elf-init
diff --git a/csu/version.c b/csu/libc-version.c
similarity index 93%
rename from csu/version.c
rename to csu/libc-version.c
index 57b49dfd8a..9b0f4cb94b 100644
--- a/csu/version.c
+++ b/csu/libc-version.c
@@ -61,12 +61,16 @@ __gnu_get_libc_version (void)
 }
 weak_alias (__gnu_get_libc_version, gnu_get_libc_version)
 
+#ifndef LIBC_MAIN
+# define LIBC_MAIN __libc_main
+#endif
+
 /* This function is the entry point for the shared object.
    Running the library as a program will get here.  */
 
-extern void __libc_main (void) __attribute__ ((noreturn));
+extern void LIBC_MAIN (void) __attribute__ ((noreturn, visibility ("hidden")));
 void
-__libc_main (void)
+LIBC_MAIN (void)
 {
   __libc_print_version ();
   _exit (0);
diff --git a/elf/Versions b/elf/Versions
index be88c48e6d..bf9d7dff9b 100644
--- a/elf/Versions
+++ b/elf/Versions
@@ -76,5 +76,11 @@ ld {
 
     # Set value of a tunable.
     __tunable_get_val;
+
+    # Update value of a tunable.
+    __tunable_update_val;
+
+    # Print all tunables.
+    __tunables_print;
   }
 }
diff --git a/elf/dl-tunables.c b/elf/dl-tunables.c
index 26e6e26612..c9f11e3b26 100644
--- a/elf/dl-tunables.c
+++ b/elf/dl-tunables.c
@@ -100,31 +100,39 @@ get_next_env (char **envp, char **name, size_t *namelen, char **val,
     }									      \
 })
 
+#define TUNABLE_UPDATE_VAL(__cur, __val, __min, __max, __type)		      \
+({									      \
+  (__cur)->type.min = (__min);						      \
+  (__cur)->type.max = (__max);						      \
+  (__cur)->val.numval = (__val);					      \
+  (__cur)->initialized = true;						      \
+})
+
 static void
-do_tunable_update_val (tunable_t *cur, const void *valp)
+do_tunable_update_val (tunable_t *cur, const void *valp,
+		       const void *minp, const void *maxp)
 {
-  uint64_t val;
+  uint64_t val, min, max;
 
   if (cur->type.type_code != TUNABLE_TYPE_STRING)
-    val = *((int64_t *) valp);
+    {
+      val = *((int64_t *) valp);
+      if (minp)
+	min = *((int64_t *) minp);
+      if (maxp)
+	max = *((int64_t *) maxp);
+    }
 
   switch (cur->type.type_code)
     {
     case TUNABLE_TYPE_INT_32:
-	{
-	  TUNABLE_SET_VAL_IF_VALID_RANGE (cur, val, int64_t);
-	  break;
-	}
     case TUNABLE_TYPE_UINT_64:
-	{
-	  TUNABLE_SET_VAL_IF_VALID_RANGE (cur, val, uint64_t);
-	  break;
-	}
     case TUNABLE_TYPE_SIZE_T:
-	{
-	  TUNABLE_SET_VAL_IF_VALID_RANGE (cur, val, uint64_t);
-	  break;
-	}
+      if (minp && maxp)
+	TUNABLE_UPDATE_VAL (cur, val, min, max, int64_t);
+      else
+	TUNABLE_SET_VAL_IF_VALID_RANGE (cur, val, int64_t);
+      break;
     case TUNABLE_TYPE_STRING:
 	{
 	  cur->val.strval = valp;
@@ -153,7 +161,7 @@ tunable_initialize (tunable_t *cur, const char *strval)
       cur->initialized = true;
       valp = strval;
     }
-  do_tunable_update_val (cur, valp);
+  do_tunable_update_val (cur, valp, NULL, NULL);
 }
 
 void
@@ -161,8 +169,17 @@ __tunable_set_val (tunable_id_t id, void *valp)
 {
   tunable_t *cur = &tunable_list[id];
 
-  do_tunable_update_val (cur, valp);
+  do_tunable_update_val (cur, valp, NULL, NULL);
+}
+
+void
+__tunable_update_val (tunable_id_t id, void *valp, void *minp, void *maxp)
+{
+  tunable_t *cur = &tunable_list[id];
+
+  do_tunable_update_val (cur, valp, minp, maxp);
 }
+rtld_hidden_def (__tunable_update_val)
 
 #if TUNABLES_FRONTEND == TUNABLES_FRONTEND_valstring
 /* Parse the tunable string TUNESTR and adjust it to drop any tunables that may
@@ -361,6 +378,43 @@ __tunables_init (char **envp)
     }
 }
 
+void
+__tunables_print (void)
+{
+  for (int i = 0; i < sizeof (tunable_list) / sizeof (tunable_t); i++)
+    {
+      tunable_t *cur = &tunable_list[i];
+      _dl_printf ("%s: ", cur->name);
+      switch (cur->type.type_code)
+	{
+	case TUNABLE_TYPE_INT_32:
+	  _dl_printf ("%d (min: %d, max: %d)\n",
+		      (int) cur->val.numval,
+		      (int) cur->type.min,
+		      (int) cur->type.max);
+	  break;
+	case TUNABLE_TYPE_UINT_64:
+	  _dl_printf ("0x%lx (min: 0x%lx, max: 0x%lx)\n",
+		      (long int) cur->val.numval,
+		      (long int) cur->type.min,
+		      (long int) cur->type.max);
+	  break;
+	case TUNABLE_TYPE_SIZE_T:
+	  _dl_printf ("0x%Zx (min: 0x%Zx, max: 0x%Zx)\n",
+		      (size_t) cur->val.numval,
+		      (size_t) cur->type.min,
+		      (size_t) cur->type.max);
+	  break;
+	case TUNABLE_TYPE_STRING:
+	  _dl_printf ("%s\n", cur->val.strval ? cur->val.strval : "");
+	  break;
+	default:
+	  __builtin_unreachable ();
+	}
+    }
+}
+rtld_hidden_def (__tunables_print)
+
 /* Set the tunable value.  This is called by the module that the tunable exists
    in. */
 void
diff --git a/elf/dl-tunables.h b/elf/dl-tunables.h
index 969e50327b..577c5d3369 100644
--- a/elf/dl-tunables.h
+++ b/elf/dl-tunables.h
@@ -67,10 +67,14 @@ typedef struct _tunable tunable_t;
 # include "dl-tunable-list.h"
 
 extern void __tunables_init (char **);
+extern void __tunables_print (void);
 extern void __tunable_get_val (tunable_id_t, void *, tunable_callback_t);
 extern void __tunable_set_val (tunable_id_t, void *);
+extern void __tunable_update_val (tunable_id_t, void *, void *, void *);
 rtld_hidden_proto (__tunables_init)
+rtld_hidden_proto (__tunables_print)
 rtld_hidden_proto (__tunable_get_val)
+rtld_hidden_proto (__tunable_update_val)
 
 /* Define TUNABLE_GET and TUNABLE_SET in short form if TOP_NAMESPACE and
    TUNABLE_NAMESPACE are defined.  This is useful shorthand to get and set
@@ -80,11 +84,16 @@ rtld_hidden_proto (__tunable_get_val)
   TUNABLE_GET_FULL (TOP_NAMESPACE, TUNABLE_NAMESPACE, __id, __type, __cb)
 # define TUNABLE_SET(__id, __type, __val) \
   TUNABLE_SET_FULL (TOP_NAMESPACE, TUNABLE_NAMESPACE, __id, __type, __val)
+# define TUNABLE_UPDATE(__id, __type, __val, __min, __max) \
+  TUNABLE_UPDATE_FULL (TOP_NAMESPACE, TUNABLE_NAMESPACE, __id, __type, \
+		       __val, __min, __max)
 #else
 # define TUNABLE_GET(__top, __ns, __id, __type, __cb) \
   TUNABLE_GET_FULL (__top, __ns, __id, __type, __cb)
 # define TUNABLE_SET(__top, __ns, __id, __type, __val) \
   TUNABLE_SET_FULL (__top, __ns, __id, __type, __val)
+# define TUNABLE_UPDATE(__top, __ns, __id, __type, __val, __min, __max) \
+  TUNABLE_UPDATE_FULL (__top, __ns, __id, __type, __val, __min, __max)
 #endif
 
 /* Get and return a tunable value.  If the tunable was set externally and __CB
@@ -104,6 +113,14 @@ rtld_hidden_proto (__tunable_get_val)
 			& (__type) {__val});				      \
 })
 
+/* Update a tunable value.  */
+# define TUNABLE_UPDATE_FULL(__top, __ns, __id, __type, __val, __min, __max) \
+({									      \
+  __tunable_update_val (TUNABLE_ENUM_NAME (__top, __ns, __id),		      \
+			& (__type) {__val},  & (__type) {__min},	      \
+			& (__type) {__max});				      \
+})
+
 /* Namespace sanity for callback functions.  Use this macro to keep the
    namespace of the modules clean.  */
 # define TUNABLE_CALLBACK(__name) _dl_tunable_ ## __name
diff --git a/sysdeps/i386/cacheinfo.c b/sysdeps/i386/cacheinfo.c
deleted file mode 100644
index f15fe0779a..0000000000
--- a/sysdeps/i386/cacheinfo.c
+++ /dev/null
@@ -1,3 +0,0 @@
-#define DISABLE_PREFETCHW
-
-#include <sysdeps/x86/cacheinfo.c>
diff --git a/sysdeps/i386/start.S b/sysdeps/i386/start.S
index c57b25f055..6d2e76e5cb 100644
--- a/sysdeps/i386/start.S
+++ b/sysdeps/i386/start.S
@@ -54,7 +54,13 @@
 
 #include <sysdep.h>
 
-ENTRY (_start)
+#ifdef LIBC_MAIN
+# define START __libc_main
+#else
+# define START _start
+#endif
+
+ENTRY (START)
 	/* Clearing frame pointer is insufficient, use CFI.  */
 	cfi_undefined (eip)
 	/* Clear the frame pointer.  The ABI suggests this be done, to mark
@@ -75,6 +81,11 @@ ENTRY (_start)
 	pushl %eax		/* Push garbage because we allocate
 				   28 more bytes.  */
 
+#ifdef LIBC_MAIN
+	pushl %ecx		/* Push second argument: argv.  */
+	pushl %esi		/* Push first argument: argc.  */
+	call LIBC_MAIN
+#else
 	/* Provide the highest stack address to the user code (for stacks
 	   which grow downwards).  */
 	pushl %esp
@@ -82,7 +93,7 @@ ENTRY (_start)
 	pushl %edx		/* Push address of the shared library
 				   termination function.  */
 
-#ifdef PIC
+# ifdef PIC
 	/* Load PIC register.  */
 	call 1f
 	addl $_GLOBAL_OFFSET_TABLE_, %ebx
@@ -96,9 +107,9 @@ ENTRY (_start)
 	pushl %ecx		/* Push second argument: argv.  */
 	pushl %esi		/* Push first argument: argc.  */
 
-# ifdef SHARED
+#  ifdef SHARED
 	pushl main@GOT(%ebx)
-# else
+#  else
 	/* Avoid relocation in static PIE since _start is called before
 	   it is relocated.  Don't use "leal main@GOTOFF(%ebx), %eax"
 	   since main may be in a shared object.  Linker will convert
@@ -106,12 +117,12 @@ ENTRY (_start)
 	   if main is defined locally.  */
 	movl main@GOT(%ebx), %eax
 	pushl %eax
-# endif
+#  endif
 
 	/* Call the user's main function, and exit with its value.
 	   But let the libc call main.    */
 	call __libc_start_main@PLT
-#else
+# else
 	/* Push address of our own entry points to .fini and .init.  */
 	pushl $__libc_csu_fini
 	pushl $__libc_csu_init
@@ -124,6 +135,7 @@ ENTRY (_start)
 	/* Call the user's main function, and exit with its value.
 	   But let the libc call main.    */
 	call __libc_start_main
+# endif
 #endif
 
 	hlt			/* Crash if somehow `exit' does return.  */
@@ -132,8 +144,9 @@ ENTRY (_start)
 1:	movl	(%esp), %ebx
 	ret
 #endif
-END (_start)
+END (START)
 
+#ifndef LIBC_MAIN
 /* To fulfill the System V/i386 ABI we need this symbol.  Yuck, it's so
    meaningless since we don't support machines < 80386.  */
 	.section .rodata
@@ -149,3 +162,4 @@ __data_start:
 	.long 0
 	.weak data_start
 	data_start = __data_start
+#endif
diff --git a/sysdeps/x86/Makefile b/sysdeps/x86/Makefile
index beab426f67..22f3866b2d 100644
--- a/sysdeps/x86/Makefile
+++ b/sysdeps/x86/Makefile
@@ -1,9 +1,10 @@
 ifeq ($(subdir),csu)
 gen-as-const-headers += cpu-features-offsets.sym
+routines += libc-main
 endif
 
 ifeq ($(subdir),elf)
-sysdep-dl-routines += dl-get-cpu-features
+sysdep-dl-routines += dl-get-cpu-features dl-cacheinfo
 
 tests += tst-get-cpu-features tst-get-cpu-features-static
 tests-static += tst-get-cpu-features-static
@@ -141,3 +142,14 @@ $(objpfx)check-cet.out: $(..)sysdeps/x86/check-cet.awk \
 generated += check-cet.out
 endif
 endif
+
+ifeq ($(subdir)$(have-tunables)$(build-shared),elfyesyes)
+tests-special += $(objpfx)list-tunables.out
+generated += list-tunables.out
+
+$(objpfx)list-tunables.out:$(common-objpfx)elf/ld.so \
+  $(common-objpfx)libc.so
+	$(common-objpfx)elf/ld.so $(common-objpfx)libc.so \
+		--list-tunables > $@; \
+	$(evaluate-test)
+endif
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index 311502dee3..8c4c7f9972 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -18,498 +18,9 @@
 
 #if IS_IN (libc)
 
-#include <assert.h>
-#include <stdbool.h>
-#include <stdlib.h>
 #include <unistd.h>
-#include <cpuid.h>
 #include <init-arch.h>
 
-static const struct intel_02_cache_info
-{
-  unsigned char idx;
-  unsigned char assoc;
-  unsigned char linesize;
-  unsigned char rel_name;
-  unsigned int size;
-} intel_02_known [] =
-  {
-#define M(sc) ((sc) - _SC_LEVEL1_ICACHE_SIZE)
-    { 0x06,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),    8192 },
-    { 0x08,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),   16384 },
-    { 0x09,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),   32768 },
-    { 0x0a,  2, 32, M(_SC_LEVEL1_DCACHE_SIZE),    8192 },
-    { 0x0c,  4, 32, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
-    { 0x0d,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
-    { 0x0e,  6, 64, M(_SC_LEVEL1_DCACHE_SIZE),   24576 },
-    { 0x21,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x22,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),   524288 },
-    { 0x23,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
-    { 0x25,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0x29,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0x2c,  8, 64, M(_SC_LEVEL1_DCACHE_SIZE),   32768 },
-    { 0x30,  8, 64, M(_SC_LEVEL1_ICACHE_SIZE),   32768 },
-    { 0x39,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
-    { 0x3a,  6, 64, M(_SC_LEVEL2_CACHE_SIZE),   196608 },
-    { 0x3b,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
-    { 0x3c,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x3d,  6, 64, M(_SC_LEVEL2_CACHE_SIZE),   393216 },
-    { 0x3e,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x3f,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x41,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
-    { 0x42,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x43,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x44,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0x45,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
-    { 0x46,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0x47,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
-    { 0x48, 12, 64, M(_SC_LEVEL2_CACHE_SIZE),  3145728 },
-    { 0x49, 16, 64, M(_SC_LEVEL2_CACHE_SIZE),  4194304 },
-    { 0x4a, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  6291456 },
-    { 0x4b, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
-    { 0x4c, 12, 64, M(_SC_LEVEL3_CACHE_SIZE), 12582912 },
-    { 0x4d, 16, 64, M(_SC_LEVEL3_CACHE_SIZE), 16777216 },
-    { 0x4e, 24, 64, M(_SC_LEVEL2_CACHE_SIZE),  6291456 },
-    { 0x60,  8, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
-    { 0x66,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),    8192 },
-    { 0x67,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
-    { 0x68,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   32768 },
-    { 0x78,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0x79,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
-    { 0x7a,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x7b,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x7c,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0x7d,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
-    { 0x7f,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x80,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x82,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x83,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x84,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0x85,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
-    { 0x86,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x87,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0xd0,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),   524288 },
-    { 0xd1,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
-    { 0xd2,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0xd6,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
-    { 0xd7,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0xd8,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0xdc, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0xdd, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0xde, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
-    { 0xe2, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0xe3, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0xe4, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
-    { 0xea, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 12582912 },
-    { 0xeb, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 18874368 },
-    { 0xec, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 25165824 },
-  };
-
-#define nintel_02_known (sizeof (intel_02_known) / sizeof (intel_02_known [0]))
-
-static int
-intel_02_known_compare (const void *p1, const void *p2)
-{
-  const struct intel_02_cache_info *i1;
-  const struct intel_02_cache_info *i2;
-
-  i1 = (const struct intel_02_cache_info *) p1;
-  i2 = (const struct intel_02_cache_info *) p2;
-
-  if (i1->idx == i2->idx)
-    return 0;
-
-  return i1->idx < i2->idx ? -1 : 1;
-}
-
-
-static long int
-__attribute__ ((noinline))
-intel_check_word (int name, unsigned int value, bool *has_level_2,
-		  bool *no_level_2_or_3,
-		  const struct cpu_features *cpu_features)
-{
-  if ((value & 0x80000000) != 0)
-    /* The register value is reserved.  */
-    return 0;
-
-  /* Fold the name.  The _SC_ constants are always in the order SIZE,
-     ASSOC, LINESIZE.  */
-  int folded_rel_name = (M(name) / 3) * 3;
-
-  while (value != 0)
-    {
-      unsigned int byte = value & 0xff;
-
-      if (byte == 0x40)
-	{
-	  *no_level_2_or_3 = true;
-
-	  if (folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
-	    /* No need to look further.  */
-	    break;
-	}
-      else if (byte == 0xff)
-	{
-	  /* CPUID leaf 0x4 contains all the information.  We need to
-	     iterate over it.  */
-	  unsigned int eax;
-	  unsigned int ebx;
-	  unsigned int ecx;
-	  unsigned int edx;
-
-	  unsigned int round = 0;
-	  while (1)
-	    {
-	      __cpuid_count (4, round, eax, ebx, ecx, edx);
-
-	      enum { null = 0, data = 1, inst = 2, uni = 3 } type = eax & 0x1f;
-	      if (type == null)
-		/* That was the end.  */
-		break;
-
-	      unsigned int level = (eax >> 5) & 0x7;
-
-	      if ((level == 1 && type == data
-		   && folded_rel_name == M(_SC_LEVEL1_DCACHE_SIZE))
-		  || (level == 1 && type == inst
-		      && folded_rel_name == M(_SC_LEVEL1_ICACHE_SIZE))
-		  || (level == 2 && folded_rel_name == M(_SC_LEVEL2_CACHE_SIZE))
-		  || (level == 3 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
-		  || (level == 4 && folded_rel_name == M(_SC_LEVEL4_CACHE_SIZE)))
-		{
-		  unsigned int offset = M(name) - folded_rel_name;
-
-		  if (offset == 0)
-		    /* Cache size.  */
-		    return (((ebx >> 22) + 1)
-			    * (((ebx >> 12) & 0x3ff) + 1)
-			    * ((ebx & 0xfff) + 1)
-			    * (ecx + 1));
-		  if (offset == 1)
-		    return (ebx >> 22) + 1;
-
-		  assert (offset == 2);
-		  return (ebx & 0xfff) + 1;
-		}
-
-	      ++round;
-	    }
-	  /* There is no other cache information anywhere else.  */
-	  break;
-	}
-      else
-	{
-	  if (byte == 0x49 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
-	    {
-	      /* Intel reused this value.  For family 15, model 6 it
-		 specifies the 3rd level cache.  Otherwise the 2nd
-		 level cache.  */
-	      unsigned int family = cpu_features->basic.family;
-	      unsigned int model = cpu_features->basic.model;
-
-	      if (family == 15 && model == 6)
-		{
-		  /* The level 3 cache is encoded for this model like
-		     the level 2 cache is for other models.  Pretend
-		     the caller asked for the level 2 cache.  */
-		  name = (_SC_LEVEL2_CACHE_SIZE
-			  + (name - _SC_LEVEL3_CACHE_SIZE));
-		  folded_rel_name = M(_SC_LEVEL2_CACHE_SIZE);
-		}
-	    }
-
-	  struct intel_02_cache_info *found;
-	  struct intel_02_cache_info search;
-
-	  search.idx = byte;
-	  found = bsearch (&search, intel_02_known, nintel_02_known,
-			   sizeof (intel_02_known[0]), intel_02_known_compare);
-	  if (found != NULL)
-	    {
-	      if (found->rel_name == folded_rel_name)
-		{
-		  unsigned int offset = M(name) - folded_rel_name;
-
-		  if (offset == 0)
-		    /* Cache size.  */
-		    return found->size;
-		  if (offset == 1)
-		    return found->assoc;
-
-		  assert (offset == 2);
-		  return found->linesize;
-		}
-
-	      if (found->rel_name == M(_SC_LEVEL2_CACHE_SIZE))
-		*has_level_2 = true;
-	    }
-	}
-
-      /* Next byte for the next round.  */
-      value >>= 8;
-    }
-
-  /* Nothing found.  */
-  return 0;
-}
-
-
-static long int __attribute__ ((noinline))
-handle_intel (int name, const struct cpu_features *cpu_features)
-{
-  unsigned int maxidx = cpu_features->basic.max_cpuid;
-
-  /* Return -1 for older CPUs.  */
-  if (maxidx < 2)
-    return -1;
-
-  /* OK, we can use the CPUID instruction to get all info about the
-     caches.  */
-  unsigned int cnt = 0;
-  unsigned int max = 1;
-  long int result = 0;
-  bool no_level_2_or_3 = false;
-  bool has_level_2 = false;
-
-  while (cnt++ < max)
-    {
-      unsigned int eax;
-      unsigned int ebx;
-      unsigned int ecx;
-      unsigned int edx;
-      __cpuid (2, eax, ebx, ecx, edx);
-
-      /* The low byte of EAX in the first round contain the number of
-	 rounds we have to make.  At least one, the one we are already
-	 doing.  */
-      if (cnt == 1)
-	{
-	  max = eax & 0xff;
-	  eax &= 0xffffff00;
-	}
-
-      /* Process the individual registers' value.  */
-      result = intel_check_word (name, eax, &has_level_2,
-				 &no_level_2_or_3, cpu_features);
-      if (result != 0)
-	return result;
-
-      result = intel_check_word (name, ebx, &has_level_2,
-				 &no_level_2_or_3, cpu_features);
-      if (result != 0)
-	return result;
-
-      result = intel_check_word (name, ecx, &has_level_2,
-				 &no_level_2_or_3, cpu_features);
-      if (result != 0)
-	return result;
-
-      result = intel_check_word (name, edx, &has_level_2,
-				 &no_level_2_or_3, cpu_features);
-      if (result != 0)
-	return result;
-    }
-
-  if (name >= _SC_LEVEL2_CACHE_SIZE && name <= _SC_LEVEL3_CACHE_LINESIZE
-      && no_level_2_or_3)
-    return -1;
-
-  return 0;
-}
-
-
-static long int __attribute__ ((noinline))
-handle_amd (int name)
-{
-  unsigned int eax;
-  unsigned int ebx;
-  unsigned int ecx;
-  unsigned int edx;
-  __cpuid (0x80000000, eax, ebx, ecx, edx);
-
-  /* No level 4 cache (yet).  */
-  if (name > _SC_LEVEL3_CACHE_LINESIZE)
-    return 0;
-
-  unsigned int fn = 0x80000005 + (name >= _SC_LEVEL2_CACHE_SIZE);
-  if (eax < fn)
-    return 0;
-
-  __cpuid (fn, eax, ebx, ecx, edx);
-
-  if (name < _SC_LEVEL1_DCACHE_SIZE)
-    {
-      name += _SC_LEVEL1_DCACHE_SIZE - _SC_LEVEL1_ICACHE_SIZE;
-      ecx = edx;
-    }
-
-  switch (name)
-    {
-    case _SC_LEVEL1_DCACHE_SIZE:
-      return (ecx >> 14) & 0x3fc00;
-
-    case _SC_LEVEL1_DCACHE_ASSOC:
-      ecx >>= 16;
-      if ((ecx & 0xff) == 0xff)
-	/* Fully associative.  */
-	return (ecx << 2) & 0x3fc00;
-      return ecx & 0xff;
-
-    case _SC_LEVEL1_DCACHE_LINESIZE:
-      return ecx & 0xff;
-
-    case _SC_LEVEL2_CACHE_SIZE:
-      return (ecx & 0xf000) == 0 ? 0 : (ecx >> 6) & 0x3fffc00;
-
-    case _SC_LEVEL2_CACHE_ASSOC:
-      switch ((ecx >> 12) & 0xf)
-	{
-	case 0:
-	case 1:
-	case 2:
-	case 4:
-	  return (ecx >> 12) & 0xf;
-	case 6:
-	  return 8;
-	case 8:
-	  return 16;
-	case 10:
-	  return 32;
-	case 11:
-	  return 48;
-	case 12:
-	  return 64;
-	case 13:
-	  return 96;
-	case 14:
-	  return 128;
-	case 15:
-	  return ((ecx >> 6) & 0x3fffc00) / (ecx & 0xff);
-	default:
-	  return 0;
-	}
-      /* NOTREACHED */
-
-    case _SC_LEVEL2_CACHE_LINESIZE:
-      return (ecx & 0xf000) == 0 ? 0 : ecx & 0xff;
-
-    case _SC_LEVEL3_CACHE_SIZE:
-      return (edx & 0xf000) == 0 ? 0 : (edx & 0x3ffc0000) << 1;
-
-    case _SC_LEVEL3_CACHE_ASSOC:
-      switch ((edx >> 12) & 0xf)
-	{
-	case 0:
-	case 1:
-	case 2:
-	case 4:
-	  return (edx >> 12) & 0xf;
-	case 6:
-	  return 8;
-	case 8:
-	  return 16;
-	case 10:
-	  return 32;
-	case 11:
-	  return 48;
-	case 12:
-	  return 64;
-	case 13:
-	  return 96;
-	case 14:
-	  return 128;
-	case 15:
-	  return ((edx & 0x3ffc0000) << 1) / (edx & 0xff);
-	default:
-	  return 0;
-	}
-      /* NOTREACHED */
-
-    case _SC_LEVEL3_CACHE_LINESIZE:
-      return (edx & 0xf000) == 0 ? 0 : edx & 0xff;
-
-    default:
-      assert (! "cannot happen");
-    }
-  return -1;
-}
-
-
-static long int __attribute__ ((noinline))
-handle_zhaoxin (int name)
-{
-  unsigned int eax;
-  unsigned int ebx;
-  unsigned int ecx;
-  unsigned int edx;
-
-  int folded_rel_name = (M(name) / 3) * 3;
-
-  unsigned int round = 0;
-  while (1)
-    {
-      __cpuid_count (4, round, eax, ebx, ecx, edx);
-
-      enum { null = 0, data = 1, inst = 2, uni = 3 } type = eax & 0x1f;
-      if (type == null)
-        break;
-
-      unsigned int level = (eax >> 5) & 0x7;
-
-      if ((level == 1 && type == data
-        && folded_rel_name == M(_SC_LEVEL1_DCACHE_SIZE))
-        || (level == 1 && type == inst
-            && folded_rel_name == M(_SC_LEVEL1_ICACHE_SIZE))
-        || (level == 2 && folded_rel_name == M(_SC_LEVEL2_CACHE_SIZE))
-        || (level == 3 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE)))
-        {
-          unsigned int offset = M(name) - folded_rel_name;
-
-          if (offset == 0)
-            /* Cache size.  */
-            return (((ebx >> 22) + 1)
-                * (((ebx >> 12) & 0x3ff) + 1)
-                * ((ebx & 0xfff) + 1)
-                * (ecx + 1));
-          if (offset == 1)
-            return (ebx >> 22) + 1;
-
-          assert (offset == 2);
-          return (ebx & 0xfff) + 1;
-        }
-
-      ++round;
-    }
-
-  /* Nothing found.  */
-  return 0;
-}
-
-
-/* Get the value of the system variable NAME.  */
-long int
-attribute_hidden
-__cache_sysconf (int name)
-{
-  const struct cpu_features *cpu_features = __get_cpu_features ();
-
-  if (cpu_features->basic.kind == arch_kind_intel)
-    return handle_intel (name, cpu_features);
-
-  if (cpu_features->basic.kind == arch_kind_amd)
-    return handle_amd (name);
-
-  if (cpu_features->basic.kind == arch_kind_zhaoxin)
-    return handle_zhaoxin (name);
-
-  // XXX Fill in more vendors.
-
-  /* CPU not known, we have no information.  */
-  return 0;
-}
-
-
 /* Data cache size for use in memory and string routines, typically
    L1 size, rounded to multiple of 256 bytes.  */
 long int __x86_data_cache_size_half attribute_hidden = 32 * 1024 / 2;
@@ -530,348 +41,85 @@ long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024;
 /* Threshold to use non temporal store.  */
 long int __x86_shared_non_temporal_threshold attribute_hidden;
 
-#ifndef DISABLE_PREFETCHW
+#ifndef __x86_64__
 /* PREFETCHW support flag for use in memory and string routines.  */
 int __x86_prefetchw attribute_hidden;
 #endif
 
-
-static void
-get_common_cache_info (long int *shared_ptr, unsigned int *threads_ptr,
-                long int core)
+/* Get the value of the system variable NAME.  */
+long int
+attribute_hidden
+__cache_sysconf (int name)
 {
-  unsigned int eax;
-  unsigned int ebx;
-  unsigned int ecx;
-  unsigned int edx;
-
-  /* Number of logical processors sharing L2 cache.  */
-  int threads_l2;
-
-  /* Number of logical processors sharing L3 cache.  */
-  int threads_l3;
-
   const struct cpu_features *cpu_features = __get_cpu_features ();
-  int max_cpuid = cpu_features->basic.max_cpuid;
-  unsigned int family = cpu_features->basic.family;
-  unsigned int model = cpu_features->basic.model;
-  long int shared = *shared_ptr;
-  unsigned int threads = *threads_ptr;
-  bool inclusive_cache = true;
-  bool support_count_mask = true;
-
-  /* Try L3 first.  */
-  unsigned int level = 3;
-
-  if (cpu_features->basic.kind == arch_kind_zhaoxin && family == 6)
-    support_count_mask = false;
-
-  if (shared <= 0)
-    {
-      /* Try L2 otherwise.  */
-      level  = 2;
-      shared = core;
-      threads_l2 = 0;
-      threads_l3 = -1;
-    }
-  else
-    {
-      threads_l2 = 0;
-      threads_l3 = 0;
-    }
-
-  /* A value of 0 for the HTT bit indicates there is only a single
-     logical processor.  */
-  if (HAS_CPU_FEATURE (HTT))
+  switch (name)
     {
-      /* Figure out the number of logical threads that share the
-         highest cache level.  */
-      if (max_cpuid >= 4)
-        {
-          int i = 0;
-
-          /* Query until cache level 2 and 3 are enumerated.  */
-          int check = 0x1 | (threads_l3 == 0) << 1;
-          do
-            {
-              __cpuid_count (4, i++, eax, ebx, ecx, edx);
+    case _SC_LEVEL1_ICACHE_SIZE:
+      return cpu_features->level1_icache_size;
 
-              /* There seems to be a bug in at least some Pentium Ds
-                 which sometimes fail to iterate all cache parameters.
-                 Do not loop indefinitely here, stop in this case and
-                 assume there is no such information.  */
-              if (cpu_features->basic.kind == arch_kind_intel
-                  && (eax & 0x1f) == 0 )
-                goto intel_bug_no_cache_info;
+    case _SC_LEVEL1_DCACHE_SIZE:
+      return cpu_features->level1_dcache_size;
 
-              switch ((eax >> 5) & 0x7)
-                {
-                  default:
-                    break;
-                  case 2:
-                    if ((check & 0x1))
-                      {
-                        /* Get maximum number of logical processors
-                           sharing L2 cache.  */
-                        threads_l2 = (eax >> 14) & 0x3ff;
-                        check &= ~0x1;
-                      }
-                    break;
-                  case 3:
-                    if ((check & (0x1 << 1)))
-                      {
-                        /* Get maximum number of logical processors
-                           sharing L3 cache.  */
-                        threads_l3 = (eax >> 14) & 0x3ff;
+    case _SC_LEVEL1_DCACHE_ASSOC:
+      return cpu_features->level1_dcache_assoc;
 
-                        /* Check if L2 and L3 caches are inclusive.  */
-                        inclusive_cache = (edx & 0x2) != 0;
-                        check &= ~(0x1 << 1);
-                      }
-                    break;
-                }
-            }
-          while (check);
+    case _SC_LEVEL1_DCACHE_LINESIZE:
+      return cpu_features->level1_dcache_linesize;
 
-          /* If max_cpuid >= 11, THREADS_L2/THREADS_L3 are the maximum
-             numbers of addressable IDs for logical processors sharing
-             the cache, instead of the maximum number of threads
-             sharing the cache.  */
-          if (max_cpuid >= 11 && support_count_mask)
-            {
-              /* Find the number of logical processors shipped in
-                 one core and apply count mask.  */
-              i = 0;
+    case _SC_LEVEL2_CACHE_SIZE:
+      return cpu_features->level2_cache_size;
 
-              /* Count SMT only if there is L3 cache.  Always count
-                 core if there is no L3 cache.  */
-              int count = ((threads_l2 > 0 && level == 3)
-                           | ((threads_l3 > 0
-                               || (threads_l2 > 0 && level == 2)) << 1));
+    case _SC_LEVEL2_CACHE_ASSOC:
+      return cpu_features->level2_cache_assoc;
 
-              while (count)
-                {
-                  __cpuid_count (11, i++, eax, ebx, ecx, edx);
+    case _SC_LEVEL2_CACHE_LINESIZE:
+      return cpu_features->level2_cache_linesize;
 
-                  int shipped = ebx & 0xff;
-                  int type = ecx & 0xff00;
-                  if (shipped == 0 || type == 0)
-                    break;
-                  else if (type == 0x100)
-                    {
-                      /* Count SMT.  */
-                      if ((count & 0x1))
-                        {
-                          int count_mask;
+    case _SC_LEVEL3_CACHE_SIZE:
+      return cpu_features->level3_cache_size;
 
-                          /* Compute count mask.  */
-                          asm ("bsr %1, %0"
-                               : "=r" (count_mask) : "g" (threads_l2));
-                          count_mask = ~(-1 << (count_mask + 1));
-                          threads_l2 = (shipped - 1) & count_mask;
-                          count &= ~0x1;
-                        }
-                    }
-                  else if (type == 0x200)
-                    {
-                      /* Count core.  */
-                      if ((count & (0x1 << 1)))
-                        {
-                          int count_mask;
-                          int threads_core
-                            = (level == 2 ? threads_l2 : threads_l3);
+    case _SC_LEVEL3_CACHE_ASSOC:
+      return cpu_features->level3_cache_assoc;
 
-                          /* Compute count mask.  */
-                          asm ("bsr %1, %0"
-                               : "=r" (count_mask) : "g" (threads_core));
-                          count_mask = ~(-1 << (count_mask + 1));
-                          threads_core = (shipped - 1) & count_mask;
-                          if (level == 2)
-                            threads_l2 = threads_core;
-                          else
-                            threads_l3 = threads_core;
-                          count &= ~(0x1 << 1);
-                        }
-                    }
-                }
-            }
-          if (threads_l2 > 0)
-            threads_l2 += 1;
-          if (threads_l3 > 0)
-            threads_l3 += 1;
-          if (level == 2)
-            {
-              if (threads_l2)
-                {
-                  threads = threads_l2;
-                  if (cpu_features->basic.kind == arch_kind_intel
-                      && threads > 2
-                      && family == 6)
-                    switch (model)
-                      {
-                        case 0x37:
-                        case 0x4a:
-                        case 0x4d:
-                        case 0x5a:
-                        case 0x5d:
-                          /* Silvermont has L2 cache shared by 2 cores.  */
-                          threads = 2;
-                          break;
-                        default:
-                          break;
-                      }
-                }
-            }
-          else if (threads_l3)
-            threads = threads_l3;
-        }
-      else
-        {
-intel_bug_no_cache_info:
-          /* Assume that all logical threads share the highest cache
-             level.  */
-          threads
-            = ((cpu_features->cpuid[COMMON_CPUID_INDEX_1].ebx
-                >> 16) & 0xff);
-        }
+    case _SC_LEVEL3_CACHE_LINESIZE:
+      return cpu_features->level3_cache_linesize;
 
-        /* Cap usage of highest cache level to the number of supported
-           threads.  */
-        if (shared > 0 && threads > 0)
-          shared /= threads;
-    }
+    case _SC_LEVEL4_CACHE_SIZE:
+      return cpu_features->level4_cache_size;
 
-  /* Account for non-inclusive L2 and L3 caches.  */
-  if (!inclusive_cache)
-    {
-      if (threads_l2 > 0)
-        core /= threads_l2;
-      shared += core;
+    default:
+      break;
     }
-
-  *shared_ptr = shared;
-  *threads_ptr = threads;
+  return -1;
 }
 
-
 static void
 __attribute__((constructor))
 init_cacheinfo (void)
 {
-  /* Find out what brand of processor.  */
-  unsigned int ebx;
-  unsigned int ecx;
-  unsigned int edx;
-  int max_cpuid_ex;
-  long int data = -1;
-  long int shared = -1;
-  long int core;
-  unsigned int threads = 0;
   const struct cpu_features *cpu_features = __get_cpu_features ();
+  long int data = cpu_features->data_cache_size;
+  __x86_raw_data_cache_size_half = data / 2;
+  __x86_raw_data_cache_size = data;
+  /* Round data cache size to multiple of 256 bytes.  */
+  data = data & ~255L;
+  __x86_data_cache_size_half = data / 2;
+  __x86_data_cache_size = data;
+
+  long int shared = cpu_features->shared_cache_size;
+  __x86_raw_shared_cache_size_half = shared / 2;
+  __x86_raw_shared_cache_size = shared;
+  /* Round shared cache size to multiple of 256 bytes.  */
+  shared = shared & ~255L;
+  __x86_shared_cache_size_half = shared / 2;
+  __x86_shared_cache_size = shared;
 
-  if (cpu_features->basic.kind == arch_kind_intel)
-    {
-      data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features);
-      core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
-      shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features);
-
-      get_common_cache_info (&shared, &threads, core);
-    }
-  else if (cpu_features->basic.kind == arch_kind_zhaoxin)
-    {
-      data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE);
-      core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
-      shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE);
-
-      get_common_cache_info (&shared, &threads, core);
-    }
-  else if (cpu_features->basic.kind == arch_kind_amd)
-    {
-      data   = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
-      long int core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
-      shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
-
-      /* Get maximum extended function. */
-      __cpuid (0x80000000, max_cpuid_ex, ebx, ecx, edx);
-
-      if (shared <= 0)
-	/* No shared L3 cache.  All we have is the L2 cache.  */
-	shared = core;
-      else
-	{
-	  /* Figure out the number of logical threads that share L3.  */
-	  if (max_cpuid_ex >= 0x80000008)
-	    {
-	      /* Get width of APIC ID.  */
-	      __cpuid (0x80000008, max_cpuid_ex, ebx, ecx, edx);
-	      threads = 1 << ((ecx >> 12) & 0x0f);
-	    }
-
-	  if (threads == 0)
-	    {
-	      /* If APIC ID width is not available, use logical
-		 processor count.  */
-	      __cpuid (0x00000001, max_cpuid_ex, ebx, ecx, edx);
-
-	      if ((edx & (1 << 28)) != 0)
-		threads = (ebx >> 16) & 0xff;
-	    }
-
-	  /* Cap usage of highest cache level to the number of
-	     supported threads.  */
-	  if (threads > 0)
-	    shared /= threads;
-
-	  /* Account for exclusive L2 and L3 caches.  */
-	  shared += core;
-	}
+  __x86_shared_non_temporal_threshold
+    = cpu_features->non_temporal_threshold;
 
-#ifndef DISABLE_PREFETCHW
-      if (max_cpuid_ex >= 0x80000001)
-	{
-	  unsigned int eax;
-	  __cpuid (0x80000001, eax, ebx, ecx, edx);
-	  /*  PREFETCHW     || 3DNow!  */
-	  if ((ecx & 0x100) || (edx & 0x80000000))
-	    __x86_prefetchw = -1;
-	}
+#ifndef __x86_64__
+  __x86_prefetchw = cpu_features->prefetchw;
 #endif
-    }
-
-  if (cpu_features->data_cache_size != 0)
-    data = cpu_features->data_cache_size;
-
-  if (data > 0)
-    {
-      __x86_raw_data_cache_size_half = data / 2;
-      __x86_raw_data_cache_size = data;
-      /* Round data cache size to multiple of 256 bytes.  */
-      data = data & ~255L;
-      __x86_data_cache_size_half = data / 2;
-      __x86_data_cache_size = data;
-    }
-
-  if (cpu_features->shared_cache_size != 0)
-    shared = cpu_features->shared_cache_size;
-
-  if (shared > 0)
-    {
-      __x86_raw_shared_cache_size_half = shared / 2;
-      __x86_raw_shared_cache_size = shared;
-      /* Round shared cache size to multiple of 256 bytes.  */
-      shared = shared & ~255L;
-      __x86_shared_cache_size_half = shared / 2;
-      __x86_shared_cache_size = shared;
-    }
-
-  /* The large memcpy micro benchmark in glibc shows that 6 times of
-     shared cache size is the approximate value above which non-temporal
-     store becomes faster on a 8-core processor.  This is the 3/4 of the
-     total shared cache size.  */
-  __x86_shared_non_temporal_threshold
-    = (cpu_features->non_temporal_threshold != 0
-       ? cpu_features->non_temporal_threshold
-       : __x86_shared_cache_size * threads * 3 / 4);
 }
 
 #endif
diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 916bbf5242..3d1596bd89 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -19,6 +19,7 @@
 #include <cpuid.h>
 #include <cpu-features.h>
 #include <dl-hwcap.h>
+#include <init-arch.h>
 #include <libc-pointer-arith.h>
 
 #if HAVE_TUNABLES
@@ -560,20 +561,14 @@ no_cpuid:
   cpu_features->basic.model = model;
   cpu_features->basic.stepping = stepping;
 
+  __init_cacheinfo ();
+
 #if HAVE_TUNABLES
   TUNABLE_GET (hwcaps, tunable_val_t *, TUNABLE_CALLBACK (set_hwcaps));
-  cpu_features->non_temporal_threshold
-    = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
-  cpu_features->data_cache_size
-    = TUNABLE_GET (x86_data_cache_size, long int, NULL);
-  cpu_features->shared_cache_size
-    = TUNABLE_GET (x86_shared_cache_size, long int, NULL);
-#endif
-
-  /* Reuse dl_platform, dl_hwcap and dl_hwcap_mask for x86.  */
-#if !HAVE_TUNABLES && defined SHARED
-  /* The glibc.cpu.hwcap_mask tunable is initialized already, so no need to do
-     this.  */
+#elif defined SHARED
+  /* Reuse dl_platform, dl_hwcap and dl_hwcap_mask for x86.  The
+     glibc.cpu.hwcap_mask tunable is initialized already, so no
+     need to do this.  */
   GLRO(dl_hwcap_mask) = HWCAP_IMPORTANT;
 #endif
 
diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
index f05d5ce158..636b270e3b 100644
--- a/sysdeps/x86/cpu-features.h
+++ b/sysdeps/x86/cpu-features.h
@@ -91,6 +91,32 @@ struct cpu_features
   unsigned long int shared_cache_size;
   /* Threshold to use non temporal store.  */
   unsigned long int non_temporal_threshold;
+  /* _SC_LEVEL1_ICACHE_SIZE.  */
+  unsigned long int level1_icache_size;
+  /* _SC_LEVEL1_DCACHE_SIZE.  */
+  unsigned long int level1_dcache_size;
+  /* _SC_LEVEL1_DCACHE_ASSOC.  */
+  unsigned long int level1_dcache_assoc;
+  /* _SC_LEVEL1_DCACHE_LINESIZE.  */
+  unsigned long int level1_dcache_linesize;
+  /* _SC_LEVEL2_CACHE_ASSOC.  */
+  unsigned long int level2_cache_size;
+  /* _SC_LEVEL2_DCACHE_ASSOC.  */
+  unsigned long int level2_cache_assoc;
+  /* _SC_LEVEL2_CACHE_LINESIZE.  */
+  unsigned long int level2_cache_linesize;
+  /* /_SC_LEVEL3_CACHE_SIZE.  */
+  unsigned long int level3_cache_size;
+  /* _SC_LEVEL3_CACHE_ASSOC.  */
+  unsigned long int level3_cache_assoc;
+  /* _SC_LEVEL3_CACHE_LINESIZE.  */
+  unsigned long int level3_cache_linesize;
+  /* /_SC_LEVEL4_CACHE_SIZE.  */
+  unsigned long int level4_cache_size;
+#ifndef __x86_64__
+  /* PREFETCHW support flag for use in memory and string routines.  */
+  unsigned long int prefetchw;
+#endif
 };
 
 /* Used from outside of glibc to get access to the CPU features
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/dl-cacheinfo.c
similarity index 83%
copy from sysdeps/x86/cacheinfo.c
copy to sysdeps/x86/dl-cacheinfo.c
index 311502dee3..aa059dc99b 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/dl-cacheinfo.c
@@ -1,5 +1,5 @@
-/* x86_64 cache info.
-   Copyright (C) 2003-2020 Free Software Foundation, Inc.
+/* x86 cache info.
+   Copyright (C) 2020 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -16,14 +16,16 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
-#if IS_IN (libc)
-
 #include <assert.h>
 #include <stdbool.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <cpuid.h>
 #include <init-arch.h>
+#if HAVE_TUNABLES
+# define TUNABLE_NAMESPACE cpu
+# include <elf/dl-tunables.h>
+#endif
 
 static const struct intel_02_cache_info
 {
@@ -487,55 +489,6 @@ handle_zhaoxin (int name)
 }
 
 
-/* Get the value of the system variable NAME.  */
-long int
-attribute_hidden
-__cache_sysconf (int name)
-{
-  const struct cpu_features *cpu_features = __get_cpu_features ();
-
-  if (cpu_features->basic.kind == arch_kind_intel)
-    return handle_intel (name, cpu_features);
-
-  if (cpu_features->basic.kind == arch_kind_amd)
-    return handle_amd (name);
-
-  if (cpu_features->basic.kind == arch_kind_zhaoxin)
-    return handle_zhaoxin (name);
-
-  // XXX Fill in more vendors.
-
-  /* CPU not known, we have no information.  */
-  return 0;
-}
-
-
-/* Data cache size for use in memory and string routines, typically
-   L1 size, rounded to multiple of 256 bytes.  */
-long int __x86_data_cache_size_half attribute_hidden = 32 * 1024 / 2;
-long int __x86_data_cache_size attribute_hidden = 32 * 1024;
-/* Similar to __x86_data_cache_size_half, but not rounded.  */
-long int __x86_raw_data_cache_size_half attribute_hidden = 32 * 1024 / 2;
-/* Similar to __x86_data_cache_size, but not rounded.  */
-long int __x86_raw_data_cache_size attribute_hidden = 32 * 1024;
-/* Shared cache size for use in memory and string routines, typically
-   L2 or L3 size, rounded to multiple of 256 bytes.  */
-long int __x86_shared_cache_size_half attribute_hidden = 1024 * 1024 / 2;
-long int __x86_shared_cache_size attribute_hidden = 1024 * 1024;
-/* Similar to __x86_shared_cache_size_half, but not rounded.  */
-long int __x86_raw_shared_cache_size_half attribute_hidden = 1024 * 1024 / 2;
-/* Similar to __x86_shared_cache_size, but not rounded.  */
-long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024;
-
-/* Threshold to use non temporal store.  */
-long int __x86_shared_non_temporal_threshold attribute_hidden;
-
-#ifndef DISABLE_PREFETCHW
-/* PREFETCHW support flag for use in memory and string routines.  */
-int __x86_prefetchw attribute_hidden;
-#endif
-
-
 static void
 get_common_cache_info (long int *shared_ptr, unsigned int *threads_ptr,
                 long int core)
@@ -753,10 +706,8 @@ intel_bug_no_cache_info:
   *threads_ptr = threads;
 }
 
-
-static void
-__attribute__((constructor))
-init_cacheinfo (void)
+void
+__init_cacheinfo (void)
 {
   /* Find out what brand of processor.  */
   unsigned int ebx;
@@ -767,7 +718,18 @@ init_cacheinfo (void)
   long int shared = -1;
   long int core;
   unsigned int threads = 0;
-  const struct cpu_features *cpu_features = __get_cpu_features ();
+  unsigned long int level1_icache_size = -1;
+  unsigned long int level1_dcache_size = -1;
+  unsigned long int level1_dcache_assoc = -1;
+  unsigned long int level1_dcache_linesize = -1;
+  unsigned long int level2_cache_size = -1;
+  unsigned long int level2_cache_assoc = -1;
+  unsigned long int level2_cache_linesize = -1;
+  unsigned long int level3_cache_size = -1;
+  unsigned long int level3_cache_assoc = -1;
+  unsigned long int level3_cache_linesize = -1;
+  unsigned long int level4_cache_size = -1;
+  struct cpu_features *cpu_features = __get_cpu_features ();
 
   if (cpu_features->basic.kind == arch_kind_intel)
     {
@@ -775,6 +737,26 @@ init_cacheinfo (void)
       core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
       shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features);
 
+      level1_icache_size
+	= handle_intel (_SC_LEVEL1_ICACHE_SIZE, cpu_features);
+      level1_dcache_size = data;
+      level1_dcache_assoc
+	= handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features);
+      level1_dcache_linesize
+	= handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features);
+      level2_cache_size = core;
+      level2_cache_assoc
+	= handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features);
+      level2_cache_linesize
+	= handle_intel (_SC_LEVEL2_CACHE_LINESIZE, cpu_features);
+      level3_cache_size = shared;
+      level3_cache_assoc
+	= handle_intel (_SC_LEVEL3_CACHE_ASSOC, cpu_features);
+      level3_cache_linesize
+	= handle_intel (_SC_LEVEL3_CACHE_LINESIZE, cpu_features);
+      level4_cache_size
+	= handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features);
+
       get_common_cache_info (&shared, &threads, core);
     }
   else if (cpu_features->basic.kind == arch_kind_zhaoxin)
@@ -783,14 +765,36 @@ init_cacheinfo (void)
       core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
       shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE);
 
+      level1_icache_size = handle_zhaoxin (_SC_LEVEL1_ICACHE_SIZE);
+      level1_dcache_size = data;
+      level1_dcache_assoc = handle_zhaoxin (_SC_LEVEL1_DCACHE_ASSOC);
+      level1_dcache_linesize = handle_zhaoxin (_SC_LEVEL1_DCACHE_LINESIZE);
+      level2_cache_size = core;
+      level2_cache_assoc = handle_zhaoxin (_SC_LEVEL2_CACHE_ASSOC);
+      level2_cache_linesize = handle_zhaoxin (_SC_LEVEL2_CACHE_LINESIZE);
+      level3_cache_size = shared;
+      level3_cache_assoc = handle_zhaoxin (_SC_LEVEL3_CACHE_ASSOC);
+      level3_cache_linesize = handle_zhaoxin (_SC_LEVEL3_CACHE_LINESIZE);
+
       get_common_cache_info (&shared, &threads, core);
     }
   else if (cpu_features->basic.kind == arch_kind_amd)
     {
-      data   = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
-      long int core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
+      data  = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
+      core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
       shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
 
+      level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
+      level1_dcache_size = data;
+      level1_dcache_assoc = handle_amd (_SC_LEVEL1_DCACHE_ASSOC);
+      level1_dcache_linesize = handle_amd (_SC_LEVEL1_DCACHE_LINESIZE);
+      level2_cache_size = core;
+      level2_cache_assoc = handle_amd (_SC_LEVEL2_CACHE_ASSOC);
+      level2_cache_linesize = handle_amd (_SC_LEVEL2_CACHE_LINESIZE);
+      level3_cache_size = shared;
+      level3_cache_assoc = handle_amd (_SC_LEVEL3_CACHE_ASSOC);
+      level3_cache_linesize = handle_amd (_SC_LEVEL3_CACHE_LINESIZE);
+
       /* Get maximum extended function. */
       __cpuid (0x80000000, max_cpuid_ex, ebx, ecx, edx);
 
@@ -826,52 +830,62 @@ init_cacheinfo (void)
 	  shared += core;
 	}
 
-#ifndef DISABLE_PREFETCHW
+#ifndef __x86_64__
       if (max_cpuid_ex >= 0x80000001)
 	{
 	  unsigned int eax;
 	  __cpuid (0x80000001, eax, ebx, ecx, edx);
 	  /*  PREFETCHW     || 3DNow!  */
 	  if ((ecx & 0x100) || (edx & 0x80000000))
-	    __x86_prefetchw = -1;
+	    cpu_features->prefetchw = -1;
 	}
 #endif
     }
 
-  if (cpu_features->data_cache_size != 0)
-    data = cpu_features->data_cache_size;
-
-  if (data > 0)
-    {
-      __x86_raw_data_cache_size_half = data / 2;
-      __x86_raw_data_cache_size = data;
-      /* Round data cache size to multiple of 256 bytes.  */
-      data = data & ~255L;
-      __x86_data_cache_size_half = data / 2;
-      __x86_data_cache_size = data;
-    }
-
-  if (cpu_features->shared_cache_size != 0)
-    shared = cpu_features->shared_cache_size;
-
-  if (shared > 0)
-    {
-      __x86_raw_shared_cache_size_half = shared / 2;
-      __x86_raw_shared_cache_size = shared;
-      /* Round shared cache size to multiple of 256 bytes.  */
-      shared = shared & ~255L;
-      __x86_shared_cache_size_half = shared / 2;
-      __x86_shared_cache_size = shared;
-    }
+  cpu_features->level1_icache_size = level1_icache_size;
+  cpu_features->level1_dcache_size = level1_dcache_size;
+  cpu_features->level1_dcache_assoc = level1_dcache_assoc;
+  cpu_features->level1_dcache_linesize = level1_dcache_linesize;
+  cpu_features->level2_cache_size = level2_cache_size;
+  cpu_features->level2_cache_assoc = level2_cache_assoc;
+  cpu_features->level2_cache_linesize = level2_cache_linesize;
+  cpu_features->level3_cache_size = level3_cache_size;
+  cpu_features->level3_cache_assoc = level3_cache_assoc;
+  cpu_features->level3_cache_linesize = level3_cache_linesize;
+  cpu_features->level4_cache_size = level4_cache_size;
+
+  unsigned long int non_temporal_threshold;
+
+#if HAVE_TUNABLES
+  long int tunable_size;
+  tunable_size = TUNABLE_GET (x86_data_cache_size, long int, NULL);
+  if (tunable_size != 0)
+    data = tunable_size;
+  tunable_size = TUNABLE_GET (x86_shared_cache_size, long int, NULL);
+  if (tunable_size != 0)
+    shared = tunable_size;
 
   /* The large memcpy micro benchmark in glibc shows that 6 times of
      shared cache size is the approximate value above which non-temporal
      store becomes faster on a 8-core processor.  This is the 3/4 of the
      total shared cache size.  */
-  __x86_shared_non_temporal_threshold
-    = (cpu_features->non_temporal_threshold != 0
-       ? cpu_features->non_temporal_threshold
-       : __x86_shared_cache_size * threads * 3 / 4);
-}
-
+  tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
+  if (tunable_size != 0)
+    non_temporal_threshold = tunable_size;
+  else
 #endif
+    non_temporal_threshold = (shared * threads * 3 / 4);
+
+  cpu_features->data_cache_size = data;
+  cpu_features->shared_cache_size = shared;
+  cpu_features->non_temporal_threshold = non_temporal_threshold; 
+
+#if HAVE_TUNABLES
+  TUNABLE_UPDATE (x86_data_cache_size, long int,
+		  data, 0, (long int) -1);
+  TUNABLE_UPDATE (x86_shared_cache_size, long int,
+		  shared, 0, (long int) -1);
+  TUNABLE_UPDATE (x86_non_temporal_threshold, long int,
+		  non_temporal_threshold, 0, (long int) -1);
+#endif
+}
diff --git a/sysdeps/x86/init-arch.h b/sysdeps/x86/init-arch.h
index d6f59cf962..272ed10902 100644
--- a/sysdeps/x86/init-arch.h
+++ b/sysdeps/x86/init-arch.h
@@ -23,6 +23,9 @@
 #include <ifunc-init.h>
 #include <isa.h>
 
+extern void __init_cacheinfo (void)
+  __attribute__ ((visibility ("hidden")));
+
 #ifndef __x86_64__
 /* Due to the reordering and the other nifty extensions in i686, it is
    not really good to use heavily i586 optimized code on an i686.  It's
diff --git a/sysdeps/x86/libc-main.S b/sysdeps/x86/libc-main.S
new file mode 100644
index 0000000000..5ef80d5b63
--- /dev/null
+++ b/sysdeps/x86/libc-main.S
@@ -0,0 +1,36 @@
+/* Copyright (C) 2020 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   In addition to the permissions in the GNU Lesser General Public
+   License, the Free Software Foundation gives you unlimited
+   permission to link the compiled version of this file with other
+   programs, and to distribute those programs without any restriction
+   coming from the use of this file. (The GNU Lesser General Public
+   License restrictions do apply in other respects; for example, they
+   cover modification of the file, and distribution when not linked
+   into another program.)
+
+   Note that people who make modified versions of this file are not
+   obligated to grant this special exception for their modified
+   versions; it is their choice whether to do so. The GNU Lesser
+   General Public License gives permission to release a modified
+   version without this exception; this exception also makes it
+   possible to release a modified version which carries forward this
+   exception.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#define LIBC_MAIN __x86_libc_main
+#include "start.S"
diff --git a/sysdeps/x86/libc-version.c b/sysdeps/x86/libc-version.c
new file mode 100644
index 0000000000..1e6e9459f1
--- /dev/null
+++ b/sysdeps/x86/libc-version.c
@@ -0,0 +1,53 @@
+/* Copyright (C) 2020 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#if HAVE_TUNABLES
+# include <stdint.h>
+# include <stdbool.h>
+# include <string.h>
+# include <elf/dl-tunables.h>
+#endif
+
+#define LIBC_MAIN __generic_libc_main
+#include <csu/libc-version.c>
+
+/* This function is the entry point for the shared object.
+   Running the library as a program will get here.  */
+
+extern void __x86_libc_main (int, char **)
+  __attribute__ ((noreturn, visibility ("hidden")));
+
+void
+__x86_libc_main (int argc, char **argv)
+{
+#if HAVE_TUNABLES
+  bool stop = false;
+  while (argc > 1)
+    if (! strcmp (argv[1], "--list-tunables"))
+      {
+	__tunables_print ();
+	stop = true;
+	argc--;
+      }
+    else
+      break;
+
+  if (stop)
+    _exit (0);
+#endif
+  __generic_libc_main ();
+}
diff --git a/sysdeps/x86_64/start.S b/sysdeps/x86_64/start.S
index 7477b632f7..01496027ca 100644
--- a/sysdeps/x86_64/start.S
+++ b/sysdeps/x86_64/start.S
@@ -55,7 +55,13 @@
 
 #include <sysdep.h>
 
-ENTRY (_start)
+#ifdef LIBC_MAIN
+# define START __libc_main
+#else
+# define START _start
+#endif
+
+ENTRY (START)
 	/* Clearing frame pointer is insufficient, use CFI.  */
 	cfi_undefined (rip)
 	/* Clear the frame pointer.  The ABI suggests this be done, to mark
@@ -76,16 +82,24 @@ ENTRY (_start)
 	rtld_fini:	%r9
 	stack_end:	stack.	*/
 
+#ifdef LIBC_MAIN
+# define ARGC_REG	RDI_LP
+# define ARGV_REG	RSI_LP
+#else
+# define ARGC_REG	RSI_LP
+# define ARGV_REG	RDX_LP
+#endif
+
 	mov %RDX_LP, %R9_LP	/* Address of the shared library termination
 				   function.  */
 #ifdef __ILP32__
-	mov (%rsp), %esi	/* Simulate popping 4-byte argument count.  */
+	mov (%rsp), %ARGC_REG	/* Simulate popping 4-byte argument count.  */
 	add $4, %esp
 #else
-	popq %rsi		/* Pop the argument count.  */
+	popq %ARGC_REG		/* Pop the argument count.  */
 #endif
 	/* argv starts just at the current stack top.  */
-	mov %RSP_LP, %RDX_LP
+	mov %RSP_LP, %ARGV_REG
 	/* Align the stack to a 16 byte boundary to follow the ABI.  */
 	and  $~15, %RSP_LP
 
@@ -96,19 +110,22 @@ ENTRY (_start)
 	   which grow downwards).  */
 	pushq %rsp
 
-#ifdef PIC
+#ifdef LIBC_MAIN
+	call LIBC_MAIN
+#else
+# ifdef PIC
 	/* Pass address of our own entry points to .fini and .init.  */
 	mov __libc_csu_fini@GOTPCREL(%rip), %R8_LP
 	mov __libc_csu_init@GOTPCREL(%rip), %RCX_LP
 
 	mov main@GOTPCREL(%rip), %RDI_LP
-#else
+# else
 	/* Pass address of our own entry points to .fini and .init.  */
 	mov $__libc_csu_fini, %R8_LP
 	mov $__libc_csu_init, %RCX_LP
 
 	mov $main, %RDI_LP
-#endif
+# endif
 
 	/* Call the user's main function, and exit with its value.
 	   But let the libc call main.  Since __libc_start_main in
@@ -118,10 +135,12 @@ ENTRY (_start)
 	   2.26 or above can convert indirect branch into direct
 	   branch.  */
 	call *__libc_start_main@GOTPCREL(%rip)
+#endif
 
 	hlt			/* Crash if somehow `exit' does return.	 */
-END (_start)
+END (START)
 
+#ifndef LIBC_MAIN
 /* Define a symbol for the first piece of initialized data.  */
 	.data
 	.globl __data_start
@@ -129,3 +148,4 @@ __data_start:
 	.long 0
 	.weak data_start
 	data_start = __data_start
+#endif
-- 
2.26.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* V3 [PATCH] libc.so: Add --list-tunables support to __libc_main
  2020-06-05 22:45                                 ` V2 " H.J. Lu
@ 2020-06-06 21:51                                   ` H.J. Lu
  2020-07-02 18:00                                     ` Carlos O'Donell
  0 siblings, 1 reply; 32+ messages in thread
From: H.J. Lu @ 2020-06-06 21:51 UTC (permalink / raw)
  To: GNU C Library; +Cc: Carlos O'Donell, Florian Weimer, Hushiyuan

[-- Attachment #1: Type: text/plain, Size: 4055 bytes --]

On Fri, Jun 5, 2020 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Thu, Jun 04, 2020 at 02:00:35PM -0700, H.J. Lu wrote:
> > On Mon, Jun 1, 2020 at 7:08 PM Carlos O'Donell <carlos@redhat.com> wrote:
> > >
> > > On Mon, Jun 1, 2020 at 6:44 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > Tunables are designed to pass info from user to glibc, not the other
> > > > way around.  When __libc_main is called, init_cacheinfo is never
> > > > called.  I can call init_cacheinfo from __libc_main.  But there is no
> > > > interface to update min and max values from init_cacheinfo.  I don't
> > > > think --list-tunables will work here without changes to tunables.
> > >
> > > You have a dynamic threshold.
> > >
> > > You have to tell the user what that minimum is, otherwise they can't
> > > use the tunable reliably.
> > >
> > > This is the first instance of a min/max that is dynamically determined.
> > >
> > > You must fetch the cache info ahead of the tunable initialization, that
> > > is you must call init_cacheinfo before __init_tunables.
> > >
> > > You can initialize the tunable data dynamically like this:
> > >
> > > /* Dynamically set the min and max of glibc.foo.bar.  */
> > > tunable_id_t id = TUNABLE_ENUM_NAME (glibc, foo, bar);
> > > tunable_list[id].type.min = lowval;
> > > tunable_list[id].type.max = highval;
> > >
> > > We do something similar for maybe_enable_malloc_check.
> > >
> > > Then once the tunables are parsed, and the cpu features are loaded
> > > you can print the tunables, and the printed tunables will have meaningful
> > > min and max values.
> > >
> > > If you have circular dependency, then you must process the cpu features
> > > first without reading from the tunables, then allow the tunables to be
> > > initialized from the system, *then* process the tunables to alter the existing
> > > cpu feature settings.
> > >
> >
> > How about this?  I got
> >
>
> Here is the updated patch, which depends on
>
> https://sourceware.org/pipermail/libc-alpha/2020-June/114820.html
>
> to add "%d" support to _dl_debug_vdprintf.  I got
>
> $ ./elf/ld.so ./libc.so --list-tunables
> glibc.elision.skip_lock_after_retries: 3 (min: -2147483648, max: 2147483647)
> glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffff)
> glibc.malloc.perturb: 0 (min: 0, max: 255)
> glibc.cpu.x86_shared_cache_size: 0x100000 (min: 0x0, max: 0xffffffff)
> glibc.elision.tries: 3 (min: -2147483648, max: 2147483647)
> glibc.elision.enable: 0 (min: 0, max: 1)
> glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffff)
> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffff)
> glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x0, max: 0xffffffff)
> glibc.cpu.x86_shstk:
> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffff)
> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> glibc.elision.skip_trylock_internal_abort: 3 (min: -2147483648, max: 2147483647)
> glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffff)
> glibc.cpu.x86_ibt:
> glibc.cpu.hwcaps:
> glibc.elision.skip_lock_internal_abort: 3 (min: -2147483648, max: 2147483647)
> glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffff)
> glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffff)
> glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffff)
> glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffff)
> glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffff)
> glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767)
> glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffff)
> glibc.malloc.check: 0 (min: 0, max: 3)
> $
>
> Ok for master?
>

Here is the updated patch.  To support --list-tunables, a target should add

CPPFLAGS-version.c = -DLIBC_MAIN=__libc_main_body
CPPFLAGS-libc-main.S = -DLIBC_MAIN=__libc_main_body

and start.S should be updated to define __libc_main and call
__libc_main_body:

extern void __libc_main_body (int argc, char **argv)
  __attribute__ ((noreturn, visibility ("hidden")));

when LIBC_MAIN is defined.


--
H.J.

[-- Attachment #2: 0001-libc.so-Add-list-tunables-support-to-__libc_main.patch --]
[-- Type: text/x-patch, Size: 80352 bytes --]

From 2566891069e34af8f768b43a3e88278d9db50d98 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Mon, 1 Jun 2020 14:11:32 -0700
Subject: [PATCH] libc.so: Add --list-tunables support to __libc_main

Pass --list-tunables to __libc_main to print tunables with min and max
values.  To support --list-tunables, a target should add

CPPFLAGS-version.c = -DLIBC_MAIN=__libc_main_body
CPPFLAGS-libc-main.S = -DLIBC_MAIN=__libc_main_body

and start.S should be updated to define __libc_main and call
__libc_main_body:

extern void __libc_main_body (int argc, char **argv)
  __attribute__ ((noreturn, visibility ("hidden")));

when LIBC_MAIN is defined.

Currently this option is only functional on i386 and x86-64.

Two functions, __tunable_update_val and __tunables_print, are added to
update tunable values and print tunable values.

X86 processor cache info is moved to cpu_features so that it is available
for __tunables_print with --list-tunables.
---
 csu/Makefile                         |  12 +-
 csu/libc-main.S                      |  44 ++
 csu/version.c                        |  29 +-
 elf/Versions                         |   6 +
 elf/dl-tunables.c                    |  88 ++-
 elf/dl-tunables.h                    |  17 +
 manual/tunables.texi                 |  33 +
 sysdeps/i386/cacheinfo.c             |   3 -
 sysdeps/i386/start.S                 |  28 +-
 sysdeps/mach/hurd/i386/localplt.data |   1 +
 sysdeps/x86/Makefile                 |   4 +-
 sysdeps/x86/cacheinfo.c              | 852 ++-----------------------
 sysdeps/x86/cpu-features.c           |  19 +-
 sysdeps/x86/cpu-features.h           |  26 +
 sysdeps/x86/dl-cacheinfo.c           | 888 +++++++++++++++++++++++++++
 sysdeps/x86/init-arch.h              |   3 +
 sysdeps/x86_64/start.S               |  36 +-
 17 files changed, 1236 insertions(+), 853 deletions(-)
 create mode 100644 csu/libc-main.S
 delete mode 100644 sysdeps/i386/cacheinfo.c
 create mode 100644 sysdeps/x86/dl-cacheinfo.c

diff --git a/csu/Makefile b/csu/Makefile
index 555ae27dea..2dda9c1894 100644
--- a/csu/Makefile
+++ b/csu/Makefile
@@ -27,7 +27,7 @@ subdir := csu
 include ../Makeconfig
 
 routines = init-first libc-start $(libc-init) sysdep version check_fds \
-	   libc-tls elf-init dso_handle
+	   libc-tls elf-init dso_handle libc-main
 aux	 = errno
 elide-routines.os = libc-tls
 static-only-routines = elf-init
@@ -73,6 +73,10 @@ extra-objs += gmon-start.o
 endif
 install-lib += S$(start-installed-name)
 generated += start.os
+ifeq ($(have-tunables)$(cross-compiling),yesno)
+tests-special += $(objpfx)list-tunables.out
+generated += list-tunables.out
+endif
 else
 extra-objs += gmon-start.o
 endif
@@ -191,3 +195,9 @@ ifneq ($(multidir),.)
 $(addprefix $(objpfx)$(multidir)/, $(install-lib)): $(addprefix $(objpfx), $(install-lib))
 	$(make-link-multidir)
 endif
+
+$(objpfx)list-tunables.out:$(common-objpfx)elf/ld.so \
+  $(common-objpfx)libc.so
+	$(common-objpfx)elf/ld.so $(common-objpfx)libc.so \
+		--list-tunables > $@; \
+	$(evaluate-test)
diff --git a/csu/libc-main.S b/csu/libc-main.S
new file mode 100644
index 0000000000..129cf20d8e
--- /dev/null
+++ b/csu/libc-main.S
@@ -0,0 +1,44 @@
+/* Startup code for libc.so.
+   Copyright (C) 2020 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   In addition to the permissions in the GNU Lesser General Public
+   License, the Free Software Foundation gives you unlimited
+   permission to link the compiled version of this file with other
+   programs, and to distribute those programs without any restriction
+   coming from the use of this file. (The GNU Lesser General Public
+   License restrictions do apply in other respects; for example, they
+   cover modification of the file, and distribution when not linked
+   into another program.)
+
+   Note that people who make modified versions of this file are not
+   obligated to grant this special exception for their modified
+   versions; it is their choice whether to do so. The GNU Lesser
+   General Public License gives permission to release a modified
+   version without this exception; this exception also makes it
+   possible to release a modified version which carries forward this
+   exception.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* This file should define __libc_main and call __libc_main_body:
+
+     extern void __libc_main_body (int argc, char **argv)
+	__attribute__ ((noreturn, visibility ("hidden")));
+ */
+
+#ifdef LIBC_MAIN
+# include "start.S"
+#endif
diff --git a/csu/version.c b/csu/version.c
index 57b49dfd8a..58b214249f 100644
--- a/csu/version.c
+++ b/csu/version.c
@@ -19,6 +19,14 @@
 #include <tls.h>
 #include <libc-abis.h>
 #include <gnu/libc-version.h>
+#ifndef LIBC_MAIN
+# define LIBC_MAIN __libc_main
+#endif
+#if HAVE_TUNABLES
+# include <stdbool.h>
+# include <string.h>
+# include <elf/dl-tunables.h>
+#endif
 
 static const char __libc_release[] = RELEASE;
 static const char __libc_version[] = VERSION;
@@ -64,10 +72,27 @@ weak_alias (__gnu_get_libc_version, gnu_get_libc_version)
 /* This function is the entry point for the shared object.
    Running the library as a program will get here.  */
 
-extern void __libc_main (void) __attribute__ ((noreturn));
+extern void LIBC_MAIN (int, char **)
+  __attribute__ ((noreturn, visibility ("hidden")));
+
 void
-__libc_main (void)
+LIBC_MAIN (int argc, char **argv)
 {
+#if HAVE_TUNABLES
+  bool stop = false;
+  while (argc > 1)
+    if (! strcmp (argv[1], "--list-tunables"))
+      {
+	__tunables_print ();
+	stop = true;
+	argc--;
+      }
+    else
+      break;
+
+  if (stop)
+    _exit (0);
+#endif
   __libc_print_version ();
   _exit (0);
 }
diff --git a/elf/Versions b/elf/Versions
index be88c48e6d..bf9d7dff9b 100644
--- a/elf/Versions
+++ b/elf/Versions
@@ -76,5 +76,11 @@ ld {
 
     # Set value of a tunable.
     __tunable_get_val;
+
+    # Update value of a tunable.
+    __tunable_update_val;
+
+    # Print all tunables.
+    __tunables_print;
   }
 }
diff --git a/elf/dl-tunables.c b/elf/dl-tunables.c
index 26e6e26612..c9f11e3b26 100644
--- a/elf/dl-tunables.c
+++ b/elf/dl-tunables.c
@@ -100,31 +100,39 @@ get_next_env (char **envp, char **name, size_t *namelen, char **val,
     }									      \
 })
 
+#define TUNABLE_UPDATE_VAL(__cur, __val, __min, __max, __type)		      \
+({									      \
+  (__cur)->type.min = (__min);						      \
+  (__cur)->type.max = (__max);						      \
+  (__cur)->val.numval = (__val);					      \
+  (__cur)->initialized = true;						      \
+})
+
 static void
-do_tunable_update_val (tunable_t *cur, const void *valp)
+do_tunable_update_val (tunable_t *cur, const void *valp,
+		       const void *minp, const void *maxp)
 {
-  uint64_t val;
+  uint64_t val, min, max;
 
   if (cur->type.type_code != TUNABLE_TYPE_STRING)
-    val = *((int64_t *) valp);
+    {
+      val = *((int64_t *) valp);
+      if (minp)
+	min = *((int64_t *) minp);
+      if (maxp)
+	max = *((int64_t *) maxp);
+    }
 
   switch (cur->type.type_code)
     {
     case TUNABLE_TYPE_INT_32:
-	{
-	  TUNABLE_SET_VAL_IF_VALID_RANGE (cur, val, int64_t);
-	  break;
-	}
     case TUNABLE_TYPE_UINT_64:
-	{
-	  TUNABLE_SET_VAL_IF_VALID_RANGE (cur, val, uint64_t);
-	  break;
-	}
     case TUNABLE_TYPE_SIZE_T:
-	{
-	  TUNABLE_SET_VAL_IF_VALID_RANGE (cur, val, uint64_t);
-	  break;
-	}
+      if (minp && maxp)
+	TUNABLE_UPDATE_VAL (cur, val, min, max, int64_t);
+      else
+	TUNABLE_SET_VAL_IF_VALID_RANGE (cur, val, int64_t);
+      break;
     case TUNABLE_TYPE_STRING:
 	{
 	  cur->val.strval = valp;
@@ -153,7 +161,7 @@ tunable_initialize (tunable_t *cur, const char *strval)
       cur->initialized = true;
       valp = strval;
     }
-  do_tunable_update_val (cur, valp);
+  do_tunable_update_val (cur, valp, NULL, NULL);
 }
 
 void
@@ -161,8 +169,17 @@ __tunable_set_val (tunable_id_t id, void *valp)
 {
   tunable_t *cur = &tunable_list[id];
 
-  do_tunable_update_val (cur, valp);
+  do_tunable_update_val (cur, valp, NULL, NULL);
+}
+
+void
+__tunable_update_val (tunable_id_t id, void *valp, void *minp, void *maxp)
+{
+  tunable_t *cur = &tunable_list[id];
+
+  do_tunable_update_val (cur, valp, minp, maxp);
 }
+rtld_hidden_def (__tunable_update_val)
 
 #if TUNABLES_FRONTEND == TUNABLES_FRONTEND_valstring
 /* Parse the tunable string TUNESTR and adjust it to drop any tunables that may
@@ -361,6 +378,43 @@ __tunables_init (char **envp)
     }
 }
 
+void
+__tunables_print (void)
+{
+  for (int i = 0; i < sizeof (tunable_list) / sizeof (tunable_t); i++)
+    {
+      tunable_t *cur = &tunable_list[i];
+      _dl_printf ("%s: ", cur->name);
+      switch (cur->type.type_code)
+	{
+	case TUNABLE_TYPE_INT_32:
+	  _dl_printf ("%d (min: %d, max: %d)\n",
+		      (int) cur->val.numval,
+		      (int) cur->type.min,
+		      (int) cur->type.max);
+	  break;
+	case TUNABLE_TYPE_UINT_64:
+	  _dl_printf ("0x%lx (min: 0x%lx, max: 0x%lx)\n",
+		      (long int) cur->val.numval,
+		      (long int) cur->type.min,
+		      (long int) cur->type.max);
+	  break;
+	case TUNABLE_TYPE_SIZE_T:
+	  _dl_printf ("0x%Zx (min: 0x%Zx, max: 0x%Zx)\n",
+		      (size_t) cur->val.numval,
+		      (size_t) cur->type.min,
+		      (size_t) cur->type.max);
+	  break;
+	case TUNABLE_TYPE_STRING:
+	  _dl_printf ("%s\n", cur->val.strval ? cur->val.strval : "");
+	  break;
+	default:
+	  __builtin_unreachable ();
+	}
+    }
+}
+rtld_hidden_def (__tunables_print)
+
 /* Set the tunable value.  This is called by the module that the tunable exists
    in. */
 void
diff --git a/elf/dl-tunables.h b/elf/dl-tunables.h
index 969e50327b..577c5d3369 100644
--- a/elf/dl-tunables.h
+++ b/elf/dl-tunables.h
@@ -67,10 +67,14 @@ typedef struct _tunable tunable_t;
 # include "dl-tunable-list.h"
 
 extern void __tunables_init (char **);
+extern void __tunables_print (void);
 extern void __tunable_get_val (tunable_id_t, void *, tunable_callback_t);
 extern void __tunable_set_val (tunable_id_t, void *);
+extern void __tunable_update_val (tunable_id_t, void *, void *, void *);
 rtld_hidden_proto (__tunables_init)
+rtld_hidden_proto (__tunables_print)
 rtld_hidden_proto (__tunable_get_val)
+rtld_hidden_proto (__tunable_update_val)
 
 /* Define TUNABLE_GET and TUNABLE_SET in short form if TOP_NAMESPACE and
    TUNABLE_NAMESPACE are defined.  This is useful shorthand to get and set
@@ -80,11 +84,16 @@ rtld_hidden_proto (__tunable_get_val)
   TUNABLE_GET_FULL (TOP_NAMESPACE, TUNABLE_NAMESPACE, __id, __type, __cb)
 # define TUNABLE_SET(__id, __type, __val) \
   TUNABLE_SET_FULL (TOP_NAMESPACE, TUNABLE_NAMESPACE, __id, __type, __val)
+# define TUNABLE_UPDATE(__id, __type, __val, __min, __max) \
+  TUNABLE_UPDATE_FULL (TOP_NAMESPACE, TUNABLE_NAMESPACE, __id, __type, \
+		       __val, __min, __max)
 #else
 # define TUNABLE_GET(__top, __ns, __id, __type, __cb) \
   TUNABLE_GET_FULL (__top, __ns, __id, __type, __cb)
 # define TUNABLE_SET(__top, __ns, __id, __type, __val) \
   TUNABLE_SET_FULL (__top, __ns, __id, __type, __val)
+# define TUNABLE_UPDATE(__top, __ns, __id, __type, __val, __min, __max) \
+  TUNABLE_UPDATE_FULL (__top, __ns, __id, __type, __val, __min, __max)
 #endif
 
 /* Get and return a tunable value.  If the tunable was set externally and __CB
@@ -104,6 +113,14 @@ rtld_hidden_proto (__tunable_get_val)
 			& (__type) {__val});				      \
 })
 
+/* Update a tunable value.  */
+# define TUNABLE_UPDATE_FULL(__top, __ns, __id, __type, __val, __min, __max) \
+({									      \
+  __tunable_update_val (TUNABLE_ENUM_NAME (__top, __ns, __id),		      \
+			& (__type) {__val},  & (__type) {__min},	      \
+			& (__type) {__max});				      \
+})
+
 /* Namespace sanity for callback functions.  Use this macro to keep the
    namespace of the modules clean.  */
 # define TUNABLE_CALLBACK(__name) _dl_tunable_ ## __name
diff --git a/manual/tunables.texi b/manual/tunables.texi
index ec18b10834..cd1bf90359 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -28,6 +28,39 @@ Finally, the set of tunables available may vary between distributions as
 the tunables feature allows distributions to add their own tunables under
 their own namespace.
 
+Passing @option{--list-tunables} to @samp{libc.so.6} print all tunables
+with minimum and maximum values:
+
+@example
+$ /lib64/libc.so.6 --list-tunables
+glibc.elision.skip_lock_after_retries: 3 (min: -2147483648, max: 2147483647)
+glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffff)
+glibc.malloc.perturb: 0 (min: 0, max: 255)
+glibc.cpu.x86_shared_cache_size: 0x100000 (min: 0x0, max: 0xffffffff)
+glibc.elision.tries: 3 (min: -2147483648, max: 2147483647)
+glibc.elision.enable: 0 (min: 0, max: 1)
+glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffff)
+glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
+glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x0, max: 0xffffffff)
+glibc.cpu.x86_shstk:
+glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffff)
+glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
+glibc.elision.skip_trylock_internal_abort: 3 (min: -2147483648, max: 2147483647)
+glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffff)
+glibc.cpu.x86_ibt:
+glibc.cpu.hwcaps:
+glibc.elision.skip_lock_internal_abort: 3 (min: -2147483648, max: 2147483647)
+glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffff)
+glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffff)
+glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffff)
+glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffff)
+glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffff)
+glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767)
+glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffff)
+glibc.malloc.check: 0 (min: 0, max: 3)
+@end example
+
 @menu
 * Tunable names::  The structure of a tunable name
 * Memory Allocation Tunables::  Tunables in the memory allocation subsystem
diff --git a/sysdeps/i386/cacheinfo.c b/sysdeps/i386/cacheinfo.c
deleted file mode 100644
index f15fe0779a..0000000000
--- a/sysdeps/i386/cacheinfo.c
+++ /dev/null
@@ -1,3 +0,0 @@
-#define DISABLE_PREFETCHW
-
-#include <sysdeps/x86/cacheinfo.c>
diff --git a/sysdeps/i386/start.S b/sysdeps/i386/start.S
index c57b25f055..6d2e76e5cb 100644
--- a/sysdeps/i386/start.S
+++ b/sysdeps/i386/start.S
@@ -54,7 +54,13 @@
 
 #include <sysdep.h>
 
-ENTRY (_start)
+#ifdef LIBC_MAIN
+# define START __libc_main
+#else
+# define START _start
+#endif
+
+ENTRY (START)
 	/* Clearing frame pointer is insufficient, use CFI.  */
 	cfi_undefined (eip)
 	/* Clear the frame pointer.  The ABI suggests this be done, to mark
@@ -75,6 +81,11 @@ ENTRY (_start)
 	pushl %eax		/* Push garbage because we allocate
 				   28 more bytes.  */
 
+#ifdef LIBC_MAIN
+	pushl %ecx		/* Push second argument: argv.  */
+	pushl %esi		/* Push first argument: argc.  */
+	call LIBC_MAIN
+#else
 	/* Provide the highest stack address to the user code (for stacks
 	   which grow downwards).  */
 	pushl %esp
@@ -82,7 +93,7 @@ ENTRY (_start)
 	pushl %edx		/* Push address of the shared library
 				   termination function.  */
 
-#ifdef PIC
+# ifdef PIC
 	/* Load PIC register.  */
 	call 1f
 	addl $_GLOBAL_OFFSET_TABLE_, %ebx
@@ -96,9 +107,9 @@ ENTRY (_start)
 	pushl %ecx		/* Push second argument: argv.  */
 	pushl %esi		/* Push first argument: argc.  */
 
-# ifdef SHARED
+#  ifdef SHARED
 	pushl main@GOT(%ebx)
-# else
+#  else
 	/* Avoid relocation in static PIE since _start is called before
 	   it is relocated.  Don't use "leal main@GOTOFF(%ebx), %eax"
 	   since main may be in a shared object.  Linker will convert
@@ -106,12 +117,12 @@ ENTRY (_start)
 	   if main is defined locally.  */
 	movl main@GOT(%ebx), %eax
 	pushl %eax
-# endif
+#  endif
 
 	/* Call the user's main function, and exit with its value.
 	   But let the libc call main.    */
 	call __libc_start_main@PLT
-#else
+# else
 	/* Push address of our own entry points to .fini and .init.  */
 	pushl $__libc_csu_fini
 	pushl $__libc_csu_init
@@ -124,6 +135,7 @@ ENTRY (_start)
 	/* Call the user's main function, and exit with its value.
 	   But let the libc call main.    */
 	call __libc_start_main
+# endif
 #endif
 
 	hlt			/* Crash if somehow `exit' does return.  */
@@ -132,8 +144,9 @@ ENTRY (_start)
 1:	movl	(%esp), %ebx
 	ret
 #endif
-END (_start)
+END (START)
 
+#ifndef LIBC_MAIN
 /* To fulfill the System V/i386 ABI we need this symbol.  Yuck, it's so
    meaningless since we don't support machines < 80386.  */
 	.section .rodata
@@ -149,3 +162,4 @@ __data_start:
 	.long 0
 	.weak data_start
 	data_start = __data_start
+#endif
diff --git a/sysdeps/mach/hurd/i386/localplt.data b/sysdeps/mach/hurd/i386/localplt.data
index 541c3f32ae..9d0c7e4253 100644
--- a/sysdeps/mach/hurd/i386/localplt.data
+++ b/sysdeps/mach/hurd/i386/localplt.data
@@ -49,6 +49,7 @@ ld.so: _dl_init_first
 ld.so: _dl_mcount
 ld.so: ___tls_get_addr
 ld.so: __tunable_get_val
+ld.so: __tunable_update_val
 #
 # These should ideally be avoided, but is currently difficult
 libc.so: siglongjmp ?
diff --git a/sysdeps/x86/Makefile b/sysdeps/x86/Makefile
index beab426f67..bf67eaaa02 100644
--- a/sysdeps/x86/Makefile
+++ b/sysdeps/x86/Makefile
@@ -1,9 +1,11 @@
 ifeq ($(subdir),csu)
 gen-as-const-headers += cpu-features-offsets.sym
+CPPFLAGS-version.c = -DLIBC_MAIN=__libc_main_body
+CPPFLAGS-libc-main.S = -DLIBC_MAIN=__libc_main_body
 endif
 
 ifeq ($(subdir),elf)
-sysdep-dl-routines += dl-get-cpu-features
+sysdep-dl-routines += dl-get-cpu-features dl-cacheinfo
 
 tests += tst-get-cpu-features tst-get-cpu-features-static
 tests-static += tst-get-cpu-features-static
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index 311502dee3..8c4c7f9972 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -18,498 +18,9 @@
 
 #if IS_IN (libc)
 
-#include <assert.h>
-#include <stdbool.h>
-#include <stdlib.h>
 #include <unistd.h>
-#include <cpuid.h>
 #include <init-arch.h>
 
-static const struct intel_02_cache_info
-{
-  unsigned char idx;
-  unsigned char assoc;
-  unsigned char linesize;
-  unsigned char rel_name;
-  unsigned int size;
-} intel_02_known [] =
-  {
-#define M(sc) ((sc) - _SC_LEVEL1_ICACHE_SIZE)
-    { 0x06,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),    8192 },
-    { 0x08,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),   16384 },
-    { 0x09,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),   32768 },
-    { 0x0a,  2, 32, M(_SC_LEVEL1_DCACHE_SIZE),    8192 },
-    { 0x0c,  4, 32, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
-    { 0x0d,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
-    { 0x0e,  6, 64, M(_SC_LEVEL1_DCACHE_SIZE),   24576 },
-    { 0x21,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x22,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),   524288 },
-    { 0x23,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
-    { 0x25,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0x29,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0x2c,  8, 64, M(_SC_LEVEL1_DCACHE_SIZE),   32768 },
-    { 0x30,  8, 64, M(_SC_LEVEL1_ICACHE_SIZE),   32768 },
-    { 0x39,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
-    { 0x3a,  6, 64, M(_SC_LEVEL2_CACHE_SIZE),   196608 },
-    { 0x3b,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
-    { 0x3c,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x3d,  6, 64, M(_SC_LEVEL2_CACHE_SIZE),   393216 },
-    { 0x3e,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x3f,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x41,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
-    { 0x42,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x43,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x44,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0x45,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
-    { 0x46,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0x47,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
-    { 0x48, 12, 64, M(_SC_LEVEL2_CACHE_SIZE),  3145728 },
-    { 0x49, 16, 64, M(_SC_LEVEL2_CACHE_SIZE),  4194304 },
-    { 0x4a, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  6291456 },
-    { 0x4b, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
-    { 0x4c, 12, 64, M(_SC_LEVEL3_CACHE_SIZE), 12582912 },
-    { 0x4d, 16, 64, M(_SC_LEVEL3_CACHE_SIZE), 16777216 },
-    { 0x4e, 24, 64, M(_SC_LEVEL2_CACHE_SIZE),  6291456 },
-    { 0x60,  8, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
-    { 0x66,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),    8192 },
-    { 0x67,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
-    { 0x68,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   32768 },
-    { 0x78,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0x79,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
-    { 0x7a,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x7b,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x7c,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0x7d,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
-    { 0x7f,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x80,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x82,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x83,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x84,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0x85,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
-    { 0x86,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x87,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0xd0,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),   524288 },
-    { 0xd1,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
-    { 0xd2,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0xd6,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
-    { 0xd7,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0xd8,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0xdc, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0xdd, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0xde, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
-    { 0xe2, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0xe3, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0xe4, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
-    { 0xea, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 12582912 },
-    { 0xeb, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 18874368 },
-    { 0xec, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 25165824 },
-  };
-
-#define nintel_02_known (sizeof (intel_02_known) / sizeof (intel_02_known [0]))
-
-static int
-intel_02_known_compare (const void *p1, const void *p2)
-{
-  const struct intel_02_cache_info *i1;
-  const struct intel_02_cache_info *i2;
-
-  i1 = (const struct intel_02_cache_info *) p1;
-  i2 = (const struct intel_02_cache_info *) p2;
-
-  if (i1->idx == i2->idx)
-    return 0;
-
-  return i1->idx < i2->idx ? -1 : 1;
-}
-
-
-static long int
-__attribute__ ((noinline))
-intel_check_word (int name, unsigned int value, bool *has_level_2,
-		  bool *no_level_2_or_3,
-		  const struct cpu_features *cpu_features)
-{
-  if ((value & 0x80000000) != 0)
-    /* The register value is reserved.  */
-    return 0;
-
-  /* Fold the name.  The _SC_ constants are always in the order SIZE,
-     ASSOC, LINESIZE.  */
-  int folded_rel_name = (M(name) / 3) * 3;
-
-  while (value != 0)
-    {
-      unsigned int byte = value & 0xff;
-
-      if (byte == 0x40)
-	{
-	  *no_level_2_or_3 = true;
-
-	  if (folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
-	    /* No need to look further.  */
-	    break;
-	}
-      else if (byte == 0xff)
-	{
-	  /* CPUID leaf 0x4 contains all the information.  We need to
-	     iterate over it.  */
-	  unsigned int eax;
-	  unsigned int ebx;
-	  unsigned int ecx;
-	  unsigned int edx;
-
-	  unsigned int round = 0;
-	  while (1)
-	    {
-	      __cpuid_count (4, round, eax, ebx, ecx, edx);
-
-	      enum { null = 0, data = 1, inst = 2, uni = 3 } type = eax & 0x1f;
-	      if (type == null)
-		/* That was the end.  */
-		break;
-
-	      unsigned int level = (eax >> 5) & 0x7;
-
-	      if ((level == 1 && type == data
-		   && folded_rel_name == M(_SC_LEVEL1_DCACHE_SIZE))
-		  || (level == 1 && type == inst
-		      && folded_rel_name == M(_SC_LEVEL1_ICACHE_SIZE))
-		  || (level == 2 && folded_rel_name == M(_SC_LEVEL2_CACHE_SIZE))
-		  || (level == 3 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
-		  || (level == 4 && folded_rel_name == M(_SC_LEVEL4_CACHE_SIZE)))
-		{
-		  unsigned int offset = M(name) - folded_rel_name;
-
-		  if (offset == 0)
-		    /* Cache size.  */
-		    return (((ebx >> 22) + 1)
-			    * (((ebx >> 12) & 0x3ff) + 1)
-			    * ((ebx & 0xfff) + 1)
-			    * (ecx + 1));
-		  if (offset == 1)
-		    return (ebx >> 22) + 1;
-
-		  assert (offset == 2);
-		  return (ebx & 0xfff) + 1;
-		}
-
-	      ++round;
-	    }
-	  /* There is no other cache information anywhere else.  */
-	  break;
-	}
-      else
-	{
-	  if (byte == 0x49 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
-	    {
-	      /* Intel reused this value.  For family 15, model 6 it
-		 specifies the 3rd level cache.  Otherwise the 2nd
-		 level cache.  */
-	      unsigned int family = cpu_features->basic.family;
-	      unsigned int model = cpu_features->basic.model;
-
-	      if (family == 15 && model == 6)
-		{
-		  /* The level 3 cache is encoded for this model like
-		     the level 2 cache is for other models.  Pretend
-		     the caller asked for the level 2 cache.  */
-		  name = (_SC_LEVEL2_CACHE_SIZE
-			  + (name - _SC_LEVEL3_CACHE_SIZE));
-		  folded_rel_name = M(_SC_LEVEL2_CACHE_SIZE);
-		}
-	    }
-
-	  struct intel_02_cache_info *found;
-	  struct intel_02_cache_info search;
-
-	  search.idx = byte;
-	  found = bsearch (&search, intel_02_known, nintel_02_known,
-			   sizeof (intel_02_known[0]), intel_02_known_compare);
-	  if (found != NULL)
-	    {
-	      if (found->rel_name == folded_rel_name)
-		{
-		  unsigned int offset = M(name) - folded_rel_name;
-
-		  if (offset == 0)
-		    /* Cache size.  */
-		    return found->size;
-		  if (offset == 1)
-		    return found->assoc;
-
-		  assert (offset == 2);
-		  return found->linesize;
-		}
-
-	      if (found->rel_name == M(_SC_LEVEL2_CACHE_SIZE))
-		*has_level_2 = true;
-	    }
-	}
-
-      /* Next byte for the next round.  */
-      value >>= 8;
-    }
-
-  /* Nothing found.  */
-  return 0;
-}
-
-
-static long int __attribute__ ((noinline))
-handle_intel (int name, const struct cpu_features *cpu_features)
-{
-  unsigned int maxidx = cpu_features->basic.max_cpuid;
-
-  /* Return -1 for older CPUs.  */
-  if (maxidx < 2)
-    return -1;
-
-  /* OK, we can use the CPUID instruction to get all info about the
-     caches.  */
-  unsigned int cnt = 0;
-  unsigned int max = 1;
-  long int result = 0;
-  bool no_level_2_or_3 = false;
-  bool has_level_2 = false;
-
-  while (cnt++ < max)
-    {
-      unsigned int eax;
-      unsigned int ebx;
-      unsigned int ecx;
-      unsigned int edx;
-      __cpuid (2, eax, ebx, ecx, edx);
-
-      /* The low byte of EAX in the first round contain the number of
-	 rounds we have to make.  At least one, the one we are already
-	 doing.  */
-      if (cnt == 1)
-	{
-	  max = eax & 0xff;
-	  eax &= 0xffffff00;
-	}
-
-      /* Process the individual registers' value.  */
-      result = intel_check_word (name, eax, &has_level_2,
-				 &no_level_2_or_3, cpu_features);
-      if (result != 0)
-	return result;
-
-      result = intel_check_word (name, ebx, &has_level_2,
-				 &no_level_2_or_3, cpu_features);
-      if (result != 0)
-	return result;
-
-      result = intel_check_word (name, ecx, &has_level_2,
-				 &no_level_2_or_3, cpu_features);
-      if (result != 0)
-	return result;
-
-      result = intel_check_word (name, edx, &has_level_2,
-				 &no_level_2_or_3, cpu_features);
-      if (result != 0)
-	return result;
-    }
-
-  if (name >= _SC_LEVEL2_CACHE_SIZE && name <= _SC_LEVEL3_CACHE_LINESIZE
-      && no_level_2_or_3)
-    return -1;
-
-  return 0;
-}
-
-
-static long int __attribute__ ((noinline))
-handle_amd (int name)
-{
-  unsigned int eax;
-  unsigned int ebx;
-  unsigned int ecx;
-  unsigned int edx;
-  __cpuid (0x80000000, eax, ebx, ecx, edx);
-
-  /* No level 4 cache (yet).  */
-  if (name > _SC_LEVEL3_CACHE_LINESIZE)
-    return 0;
-
-  unsigned int fn = 0x80000005 + (name >= _SC_LEVEL2_CACHE_SIZE);
-  if (eax < fn)
-    return 0;
-
-  __cpuid (fn, eax, ebx, ecx, edx);
-
-  if (name < _SC_LEVEL1_DCACHE_SIZE)
-    {
-      name += _SC_LEVEL1_DCACHE_SIZE - _SC_LEVEL1_ICACHE_SIZE;
-      ecx = edx;
-    }
-
-  switch (name)
-    {
-    case _SC_LEVEL1_DCACHE_SIZE:
-      return (ecx >> 14) & 0x3fc00;
-
-    case _SC_LEVEL1_DCACHE_ASSOC:
-      ecx >>= 16;
-      if ((ecx & 0xff) == 0xff)
-	/* Fully associative.  */
-	return (ecx << 2) & 0x3fc00;
-      return ecx & 0xff;
-
-    case _SC_LEVEL1_DCACHE_LINESIZE:
-      return ecx & 0xff;
-
-    case _SC_LEVEL2_CACHE_SIZE:
-      return (ecx & 0xf000) == 0 ? 0 : (ecx >> 6) & 0x3fffc00;
-
-    case _SC_LEVEL2_CACHE_ASSOC:
-      switch ((ecx >> 12) & 0xf)
-	{
-	case 0:
-	case 1:
-	case 2:
-	case 4:
-	  return (ecx >> 12) & 0xf;
-	case 6:
-	  return 8;
-	case 8:
-	  return 16;
-	case 10:
-	  return 32;
-	case 11:
-	  return 48;
-	case 12:
-	  return 64;
-	case 13:
-	  return 96;
-	case 14:
-	  return 128;
-	case 15:
-	  return ((ecx >> 6) & 0x3fffc00) / (ecx & 0xff);
-	default:
-	  return 0;
-	}
-      /* NOTREACHED */
-
-    case _SC_LEVEL2_CACHE_LINESIZE:
-      return (ecx & 0xf000) == 0 ? 0 : ecx & 0xff;
-
-    case _SC_LEVEL3_CACHE_SIZE:
-      return (edx & 0xf000) == 0 ? 0 : (edx & 0x3ffc0000) << 1;
-
-    case _SC_LEVEL3_CACHE_ASSOC:
-      switch ((edx >> 12) & 0xf)
-	{
-	case 0:
-	case 1:
-	case 2:
-	case 4:
-	  return (edx >> 12) & 0xf;
-	case 6:
-	  return 8;
-	case 8:
-	  return 16;
-	case 10:
-	  return 32;
-	case 11:
-	  return 48;
-	case 12:
-	  return 64;
-	case 13:
-	  return 96;
-	case 14:
-	  return 128;
-	case 15:
-	  return ((edx & 0x3ffc0000) << 1) / (edx & 0xff);
-	default:
-	  return 0;
-	}
-      /* NOTREACHED */
-
-    case _SC_LEVEL3_CACHE_LINESIZE:
-      return (edx & 0xf000) == 0 ? 0 : edx & 0xff;
-
-    default:
-      assert (! "cannot happen");
-    }
-  return -1;
-}
-
-
-static long int __attribute__ ((noinline))
-handle_zhaoxin (int name)
-{
-  unsigned int eax;
-  unsigned int ebx;
-  unsigned int ecx;
-  unsigned int edx;
-
-  int folded_rel_name = (M(name) / 3) * 3;
-
-  unsigned int round = 0;
-  while (1)
-    {
-      __cpuid_count (4, round, eax, ebx, ecx, edx);
-
-      enum { null = 0, data = 1, inst = 2, uni = 3 } type = eax & 0x1f;
-      if (type == null)
-        break;
-
-      unsigned int level = (eax >> 5) & 0x7;
-
-      if ((level == 1 && type == data
-        && folded_rel_name == M(_SC_LEVEL1_DCACHE_SIZE))
-        || (level == 1 && type == inst
-            && folded_rel_name == M(_SC_LEVEL1_ICACHE_SIZE))
-        || (level == 2 && folded_rel_name == M(_SC_LEVEL2_CACHE_SIZE))
-        || (level == 3 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE)))
-        {
-          unsigned int offset = M(name) - folded_rel_name;
-
-          if (offset == 0)
-            /* Cache size.  */
-            return (((ebx >> 22) + 1)
-                * (((ebx >> 12) & 0x3ff) + 1)
-                * ((ebx & 0xfff) + 1)
-                * (ecx + 1));
-          if (offset == 1)
-            return (ebx >> 22) + 1;
-
-          assert (offset == 2);
-          return (ebx & 0xfff) + 1;
-        }
-
-      ++round;
-    }
-
-  /* Nothing found.  */
-  return 0;
-}
-
-
-/* Get the value of the system variable NAME.  */
-long int
-attribute_hidden
-__cache_sysconf (int name)
-{
-  const struct cpu_features *cpu_features = __get_cpu_features ();
-
-  if (cpu_features->basic.kind == arch_kind_intel)
-    return handle_intel (name, cpu_features);
-
-  if (cpu_features->basic.kind == arch_kind_amd)
-    return handle_amd (name);
-
-  if (cpu_features->basic.kind == arch_kind_zhaoxin)
-    return handle_zhaoxin (name);
-
-  // XXX Fill in more vendors.
-
-  /* CPU not known, we have no information.  */
-  return 0;
-}
-
-
 /* Data cache size for use in memory and string routines, typically
    L1 size, rounded to multiple of 256 bytes.  */
 long int __x86_data_cache_size_half attribute_hidden = 32 * 1024 / 2;
@@ -530,348 +41,85 @@ long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024;
 /* Threshold to use non temporal store.  */
 long int __x86_shared_non_temporal_threshold attribute_hidden;
 
-#ifndef DISABLE_PREFETCHW
+#ifndef __x86_64__
 /* PREFETCHW support flag for use in memory and string routines.  */
 int __x86_prefetchw attribute_hidden;
 #endif
 
-
-static void
-get_common_cache_info (long int *shared_ptr, unsigned int *threads_ptr,
-                long int core)
+/* Get the value of the system variable NAME.  */
+long int
+attribute_hidden
+__cache_sysconf (int name)
 {
-  unsigned int eax;
-  unsigned int ebx;
-  unsigned int ecx;
-  unsigned int edx;
-
-  /* Number of logical processors sharing L2 cache.  */
-  int threads_l2;
-
-  /* Number of logical processors sharing L3 cache.  */
-  int threads_l3;
-
   const struct cpu_features *cpu_features = __get_cpu_features ();
-  int max_cpuid = cpu_features->basic.max_cpuid;
-  unsigned int family = cpu_features->basic.family;
-  unsigned int model = cpu_features->basic.model;
-  long int shared = *shared_ptr;
-  unsigned int threads = *threads_ptr;
-  bool inclusive_cache = true;
-  bool support_count_mask = true;
-
-  /* Try L3 first.  */
-  unsigned int level = 3;
-
-  if (cpu_features->basic.kind == arch_kind_zhaoxin && family == 6)
-    support_count_mask = false;
-
-  if (shared <= 0)
-    {
-      /* Try L2 otherwise.  */
-      level  = 2;
-      shared = core;
-      threads_l2 = 0;
-      threads_l3 = -1;
-    }
-  else
-    {
-      threads_l2 = 0;
-      threads_l3 = 0;
-    }
-
-  /* A value of 0 for the HTT bit indicates there is only a single
-     logical processor.  */
-  if (HAS_CPU_FEATURE (HTT))
+  switch (name)
     {
-      /* Figure out the number of logical threads that share the
-         highest cache level.  */
-      if (max_cpuid >= 4)
-        {
-          int i = 0;
-
-          /* Query until cache level 2 and 3 are enumerated.  */
-          int check = 0x1 | (threads_l3 == 0) << 1;
-          do
-            {
-              __cpuid_count (4, i++, eax, ebx, ecx, edx);
+    case _SC_LEVEL1_ICACHE_SIZE:
+      return cpu_features->level1_icache_size;
 
-              /* There seems to be a bug in at least some Pentium Ds
-                 which sometimes fail to iterate all cache parameters.
-                 Do not loop indefinitely here, stop in this case and
-                 assume there is no such information.  */
-              if (cpu_features->basic.kind == arch_kind_intel
-                  && (eax & 0x1f) == 0 )
-                goto intel_bug_no_cache_info;
+    case _SC_LEVEL1_DCACHE_SIZE:
+      return cpu_features->level1_dcache_size;
 
-              switch ((eax >> 5) & 0x7)
-                {
-                  default:
-                    break;
-                  case 2:
-                    if ((check & 0x1))
-                      {
-                        /* Get maximum number of logical processors
-                           sharing L2 cache.  */
-                        threads_l2 = (eax >> 14) & 0x3ff;
-                        check &= ~0x1;
-                      }
-                    break;
-                  case 3:
-                    if ((check & (0x1 << 1)))
-                      {
-                        /* Get maximum number of logical processors
-                           sharing L3 cache.  */
-                        threads_l3 = (eax >> 14) & 0x3ff;
+    case _SC_LEVEL1_DCACHE_ASSOC:
+      return cpu_features->level1_dcache_assoc;
 
-                        /* Check if L2 and L3 caches are inclusive.  */
-                        inclusive_cache = (edx & 0x2) != 0;
-                        check &= ~(0x1 << 1);
-                      }
-                    break;
-                }
-            }
-          while (check);
+    case _SC_LEVEL1_DCACHE_LINESIZE:
+      return cpu_features->level1_dcache_linesize;
 
-          /* If max_cpuid >= 11, THREADS_L2/THREADS_L3 are the maximum
-             numbers of addressable IDs for logical processors sharing
-             the cache, instead of the maximum number of threads
-             sharing the cache.  */
-          if (max_cpuid >= 11 && support_count_mask)
-            {
-              /* Find the number of logical processors shipped in
-                 one core and apply count mask.  */
-              i = 0;
+    case _SC_LEVEL2_CACHE_SIZE:
+      return cpu_features->level2_cache_size;
 
-              /* Count SMT only if there is L3 cache.  Always count
-                 core if there is no L3 cache.  */
-              int count = ((threads_l2 > 0 && level == 3)
-                           | ((threads_l3 > 0
-                               || (threads_l2 > 0 && level == 2)) << 1));
+    case _SC_LEVEL2_CACHE_ASSOC:
+      return cpu_features->level2_cache_assoc;
 
-              while (count)
-                {
-                  __cpuid_count (11, i++, eax, ebx, ecx, edx);
+    case _SC_LEVEL2_CACHE_LINESIZE:
+      return cpu_features->level2_cache_linesize;
 
-                  int shipped = ebx & 0xff;
-                  int type = ecx & 0xff00;
-                  if (shipped == 0 || type == 0)
-                    break;
-                  else if (type == 0x100)
-                    {
-                      /* Count SMT.  */
-                      if ((count & 0x1))
-                        {
-                          int count_mask;
+    case _SC_LEVEL3_CACHE_SIZE:
+      return cpu_features->level3_cache_size;
 
-                          /* Compute count mask.  */
-                          asm ("bsr %1, %0"
-                               : "=r" (count_mask) : "g" (threads_l2));
-                          count_mask = ~(-1 << (count_mask + 1));
-                          threads_l2 = (shipped - 1) & count_mask;
-                          count &= ~0x1;
-                        }
-                    }
-                  else if (type == 0x200)
-                    {
-                      /* Count core.  */
-                      if ((count & (0x1 << 1)))
-                        {
-                          int count_mask;
-                          int threads_core
-                            = (level == 2 ? threads_l2 : threads_l3);
+    case _SC_LEVEL3_CACHE_ASSOC:
+      return cpu_features->level3_cache_assoc;
 
-                          /* Compute count mask.  */
-                          asm ("bsr %1, %0"
-                               : "=r" (count_mask) : "g" (threads_core));
-                          count_mask = ~(-1 << (count_mask + 1));
-                          threads_core = (shipped - 1) & count_mask;
-                          if (level == 2)
-                            threads_l2 = threads_core;
-                          else
-                            threads_l3 = threads_core;
-                          count &= ~(0x1 << 1);
-                        }
-                    }
-                }
-            }
-          if (threads_l2 > 0)
-            threads_l2 += 1;
-          if (threads_l3 > 0)
-            threads_l3 += 1;
-          if (level == 2)
-            {
-              if (threads_l2)
-                {
-                  threads = threads_l2;
-                  if (cpu_features->basic.kind == arch_kind_intel
-                      && threads > 2
-                      && family == 6)
-                    switch (model)
-                      {
-                        case 0x37:
-                        case 0x4a:
-                        case 0x4d:
-                        case 0x5a:
-                        case 0x5d:
-                          /* Silvermont has L2 cache shared by 2 cores.  */
-                          threads = 2;
-                          break;
-                        default:
-                          break;
-                      }
-                }
-            }
-          else if (threads_l3)
-            threads = threads_l3;
-        }
-      else
-        {
-intel_bug_no_cache_info:
-          /* Assume that all logical threads share the highest cache
-             level.  */
-          threads
-            = ((cpu_features->cpuid[COMMON_CPUID_INDEX_1].ebx
-                >> 16) & 0xff);
-        }
+    case _SC_LEVEL3_CACHE_LINESIZE:
+      return cpu_features->level3_cache_linesize;
 
-        /* Cap usage of highest cache level to the number of supported
-           threads.  */
-        if (shared > 0 && threads > 0)
-          shared /= threads;
-    }
+    case _SC_LEVEL4_CACHE_SIZE:
+      return cpu_features->level4_cache_size;
 
-  /* Account for non-inclusive L2 and L3 caches.  */
-  if (!inclusive_cache)
-    {
-      if (threads_l2 > 0)
-        core /= threads_l2;
-      shared += core;
+    default:
+      break;
     }
-
-  *shared_ptr = shared;
-  *threads_ptr = threads;
+  return -1;
 }
 
-
 static void
 __attribute__((constructor))
 init_cacheinfo (void)
 {
-  /* Find out what brand of processor.  */
-  unsigned int ebx;
-  unsigned int ecx;
-  unsigned int edx;
-  int max_cpuid_ex;
-  long int data = -1;
-  long int shared = -1;
-  long int core;
-  unsigned int threads = 0;
   const struct cpu_features *cpu_features = __get_cpu_features ();
+  long int data = cpu_features->data_cache_size;
+  __x86_raw_data_cache_size_half = data / 2;
+  __x86_raw_data_cache_size = data;
+  /* Round data cache size to multiple of 256 bytes.  */
+  data = data & ~255L;
+  __x86_data_cache_size_half = data / 2;
+  __x86_data_cache_size = data;
+
+  long int shared = cpu_features->shared_cache_size;
+  __x86_raw_shared_cache_size_half = shared / 2;
+  __x86_raw_shared_cache_size = shared;
+  /* Round shared cache size to multiple of 256 bytes.  */
+  shared = shared & ~255L;
+  __x86_shared_cache_size_half = shared / 2;
+  __x86_shared_cache_size = shared;
 
-  if (cpu_features->basic.kind == arch_kind_intel)
-    {
-      data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features);
-      core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
-      shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features);
-
-      get_common_cache_info (&shared, &threads, core);
-    }
-  else if (cpu_features->basic.kind == arch_kind_zhaoxin)
-    {
-      data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE);
-      core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
-      shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE);
-
-      get_common_cache_info (&shared, &threads, core);
-    }
-  else if (cpu_features->basic.kind == arch_kind_amd)
-    {
-      data   = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
-      long int core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
-      shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
-
-      /* Get maximum extended function. */
-      __cpuid (0x80000000, max_cpuid_ex, ebx, ecx, edx);
-
-      if (shared <= 0)
-	/* No shared L3 cache.  All we have is the L2 cache.  */
-	shared = core;
-      else
-	{
-	  /* Figure out the number of logical threads that share L3.  */
-	  if (max_cpuid_ex >= 0x80000008)
-	    {
-	      /* Get width of APIC ID.  */
-	      __cpuid (0x80000008, max_cpuid_ex, ebx, ecx, edx);
-	      threads = 1 << ((ecx >> 12) & 0x0f);
-	    }
-
-	  if (threads == 0)
-	    {
-	      /* If APIC ID width is not available, use logical
-		 processor count.  */
-	      __cpuid (0x00000001, max_cpuid_ex, ebx, ecx, edx);
-
-	      if ((edx & (1 << 28)) != 0)
-		threads = (ebx >> 16) & 0xff;
-	    }
-
-	  /* Cap usage of highest cache level to the number of
-	     supported threads.  */
-	  if (threads > 0)
-	    shared /= threads;
-
-	  /* Account for exclusive L2 and L3 caches.  */
-	  shared += core;
-	}
+  __x86_shared_non_temporal_threshold
+    = cpu_features->non_temporal_threshold;
 
-#ifndef DISABLE_PREFETCHW
-      if (max_cpuid_ex >= 0x80000001)
-	{
-	  unsigned int eax;
-	  __cpuid (0x80000001, eax, ebx, ecx, edx);
-	  /*  PREFETCHW     || 3DNow!  */
-	  if ((ecx & 0x100) || (edx & 0x80000000))
-	    __x86_prefetchw = -1;
-	}
+#ifndef __x86_64__
+  __x86_prefetchw = cpu_features->prefetchw;
 #endif
-    }
-
-  if (cpu_features->data_cache_size != 0)
-    data = cpu_features->data_cache_size;
-
-  if (data > 0)
-    {
-      __x86_raw_data_cache_size_half = data / 2;
-      __x86_raw_data_cache_size = data;
-      /* Round data cache size to multiple of 256 bytes.  */
-      data = data & ~255L;
-      __x86_data_cache_size_half = data / 2;
-      __x86_data_cache_size = data;
-    }
-
-  if (cpu_features->shared_cache_size != 0)
-    shared = cpu_features->shared_cache_size;
-
-  if (shared > 0)
-    {
-      __x86_raw_shared_cache_size_half = shared / 2;
-      __x86_raw_shared_cache_size = shared;
-      /* Round shared cache size to multiple of 256 bytes.  */
-      shared = shared & ~255L;
-      __x86_shared_cache_size_half = shared / 2;
-      __x86_shared_cache_size = shared;
-    }
-
-  /* The large memcpy micro benchmark in glibc shows that 6 times of
-     shared cache size is the approximate value above which non-temporal
-     store becomes faster on a 8-core processor.  This is the 3/4 of the
-     total shared cache size.  */
-  __x86_shared_non_temporal_threshold
-    = (cpu_features->non_temporal_threshold != 0
-       ? cpu_features->non_temporal_threshold
-       : __x86_shared_cache_size * threads * 3 / 4);
 }
 
 #endif
diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 916bbf5242..3d1596bd89 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -19,6 +19,7 @@
 #include <cpuid.h>
 #include <cpu-features.h>
 #include <dl-hwcap.h>
+#include <init-arch.h>
 #include <libc-pointer-arith.h>
 
 #if HAVE_TUNABLES
@@ -560,20 +561,14 @@ no_cpuid:
   cpu_features->basic.model = model;
   cpu_features->basic.stepping = stepping;
 
+  __init_cacheinfo ();
+
 #if HAVE_TUNABLES
   TUNABLE_GET (hwcaps, tunable_val_t *, TUNABLE_CALLBACK (set_hwcaps));
-  cpu_features->non_temporal_threshold
-    = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
-  cpu_features->data_cache_size
-    = TUNABLE_GET (x86_data_cache_size, long int, NULL);
-  cpu_features->shared_cache_size
-    = TUNABLE_GET (x86_shared_cache_size, long int, NULL);
-#endif
-
-  /* Reuse dl_platform, dl_hwcap and dl_hwcap_mask for x86.  */
-#if !HAVE_TUNABLES && defined SHARED
-  /* The glibc.cpu.hwcap_mask tunable is initialized already, so no need to do
-     this.  */
+#elif defined SHARED
+  /* Reuse dl_platform, dl_hwcap and dl_hwcap_mask for x86.  The
+     glibc.cpu.hwcap_mask tunable is initialized already, so no
+     need to do this.  */
   GLRO(dl_hwcap_mask) = HWCAP_IMPORTANT;
 #endif
 
diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
index f05d5ce158..636b270e3b 100644
--- a/sysdeps/x86/cpu-features.h
+++ b/sysdeps/x86/cpu-features.h
@@ -91,6 +91,32 @@ struct cpu_features
   unsigned long int shared_cache_size;
   /* Threshold to use non temporal store.  */
   unsigned long int non_temporal_threshold;
+  /* _SC_LEVEL1_ICACHE_SIZE.  */
+  unsigned long int level1_icache_size;
+  /* _SC_LEVEL1_DCACHE_SIZE.  */
+  unsigned long int level1_dcache_size;
+  /* _SC_LEVEL1_DCACHE_ASSOC.  */
+  unsigned long int level1_dcache_assoc;
+  /* _SC_LEVEL1_DCACHE_LINESIZE.  */
+  unsigned long int level1_dcache_linesize;
+  /* _SC_LEVEL2_CACHE_ASSOC.  */
+  unsigned long int level2_cache_size;
+  /* _SC_LEVEL2_DCACHE_ASSOC.  */
+  unsigned long int level2_cache_assoc;
+  /* _SC_LEVEL2_CACHE_LINESIZE.  */
+  unsigned long int level2_cache_linesize;
+  /* /_SC_LEVEL3_CACHE_SIZE.  */
+  unsigned long int level3_cache_size;
+  /* _SC_LEVEL3_CACHE_ASSOC.  */
+  unsigned long int level3_cache_assoc;
+  /* _SC_LEVEL3_CACHE_LINESIZE.  */
+  unsigned long int level3_cache_linesize;
+  /* /_SC_LEVEL4_CACHE_SIZE.  */
+  unsigned long int level4_cache_size;
+#ifndef __x86_64__
+  /* PREFETCHW support flag for use in memory and string routines.  */
+  unsigned long int prefetchw;
+#endif
 };
 
 /* Used from outside of glibc to get access to the CPU features
diff --git a/sysdeps/x86/dl-cacheinfo.c b/sysdeps/x86/dl-cacheinfo.c
new file mode 100644
index 0000000000..7cd6426a9a
--- /dev/null
+++ b/sysdeps/x86/dl-cacheinfo.c
@@ -0,0 +1,888 @@
+/* x86 cache info.
+   Copyright (C) 2020 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <cpuid.h>
+#include <init-arch.h>
+#if HAVE_TUNABLES
+# define TUNABLE_NAMESPACE cpu
+# include <elf/dl-tunables.h>
+#endif
+
+static const struct intel_02_cache_info
+{
+  unsigned char idx;
+  unsigned char assoc;
+  unsigned char linesize;
+  unsigned char rel_name;
+  unsigned int size;
+} intel_02_known [] =
+  {
+#define M(sc) ((sc) - _SC_LEVEL1_ICACHE_SIZE)
+    { 0x06,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),    8192 },
+    { 0x08,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),   16384 },
+    { 0x09,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),   32768 },
+    { 0x0a,  2, 32, M(_SC_LEVEL1_DCACHE_SIZE),    8192 },
+    { 0x0c,  4, 32, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
+    { 0x0d,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
+    { 0x0e,  6, 64, M(_SC_LEVEL1_DCACHE_SIZE),   24576 },
+    { 0x21,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
+    { 0x22,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),   524288 },
+    { 0x23,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
+    { 0x25,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
+    { 0x29,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
+    { 0x2c,  8, 64, M(_SC_LEVEL1_DCACHE_SIZE),   32768 },
+    { 0x30,  8, 64, M(_SC_LEVEL1_ICACHE_SIZE),   32768 },
+    { 0x39,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
+    { 0x3a,  6, 64, M(_SC_LEVEL2_CACHE_SIZE),   196608 },
+    { 0x3b,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
+    { 0x3c,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
+    { 0x3d,  6, 64, M(_SC_LEVEL2_CACHE_SIZE),   393216 },
+    { 0x3e,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x3f,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
+    { 0x41,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
+    { 0x42,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
+    { 0x43,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x44,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
+    { 0x45,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
+    { 0x46,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
+    { 0x47,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
+    { 0x48, 12, 64, M(_SC_LEVEL2_CACHE_SIZE),  3145728 },
+    { 0x49, 16, 64, M(_SC_LEVEL2_CACHE_SIZE),  4194304 },
+    { 0x4a, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  6291456 },
+    { 0x4b, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
+    { 0x4c, 12, 64, M(_SC_LEVEL3_CACHE_SIZE), 12582912 },
+    { 0x4d, 16, 64, M(_SC_LEVEL3_CACHE_SIZE), 16777216 },
+    { 0x4e, 24, 64, M(_SC_LEVEL2_CACHE_SIZE),  6291456 },
+    { 0x60,  8, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
+    { 0x66,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),    8192 },
+    { 0x67,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
+    { 0x68,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   32768 },
+    { 0x78,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
+    { 0x79,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
+    { 0x7a,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
+    { 0x7b,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x7c,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
+    { 0x7d,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
+    { 0x7f,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x80,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x82,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
+    { 0x83,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x84,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
+    { 0x85,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
+    { 0x86,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x87,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
+    { 0xd0,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),   524288 },
+    { 0xd1,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
+    { 0xd2,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
+    { 0xd6,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
+    { 0xd7,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
+    { 0xd8,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
+    { 0xdc, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
+    { 0xdd, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
+    { 0xde, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
+    { 0xe2, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
+    { 0xe3, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
+    { 0xe4, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
+    { 0xea, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 12582912 },
+    { 0xeb, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 18874368 },
+    { 0xec, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 25165824 },
+  };
+
+#define nintel_02_known (sizeof (intel_02_known) / sizeof (intel_02_known [0]))
+
+static int
+intel_02_known_compare (const void *p1, const void *p2)
+{
+  const struct intel_02_cache_info *i1;
+  const struct intel_02_cache_info *i2;
+
+  i1 = (const struct intel_02_cache_info *) p1;
+  i2 = (const struct intel_02_cache_info *) p2;
+
+  if (i1->idx == i2->idx)
+    return 0;
+
+  return i1->idx < i2->idx ? -1 : 1;
+}
+
+
+static long int
+__attribute__ ((noinline))
+intel_check_word (int name, unsigned int value, bool *has_level_2,
+		  bool *no_level_2_or_3,
+		  const struct cpu_features *cpu_features)
+{
+  if ((value & 0x80000000) != 0)
+    /* The register value is reserved.  */
+    return 0;
+
+  /* Fold the name.  The _SC_ constants are always in the order SIZE,
+     ASSOC, LINESIZE.  */
+  int folded_rel_name = (M(name) / 3) * 3;
+
+  while (value != 0)
+    {
+      unsigned int byte = value & 0xff;
+
+      if (byte == 0x40)
+	{
+	  *no_level_2_or_3 = true;
+
+	  if (folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
+	    /* No need to look further.  */
+	    break;
+	}
+      else if (byte == 0xff)
+	{
+	  /* CPUID leaf 0x4 contains all the information.  We need to
+	     iterate over it.  */
+	  unsigned int eax;
+	  unsigned int ebx;
+	  unsigned int ecx;
+	  unsigned int edx;
+
+	  unsigned int round = 0;
+	  while (1)
+	    {
+	      __cpuid_count (4, round, eax, ebx, ecx, edx);
+
+	      enum { null = 0, data = 1, inst = 2, uni = 3 } type = eax & 0x1f;
+	      if (type == null)
+		/* That was the end.  */
+		break;
+
+	      unsigned int level = (eax >> 5) & 0x7;
+
+	      if ((level == 1 && type == data
+		   && folded_rel_name == M(_SC_LEVEL1_DCACHE_SIZE))
+		  || (level == 1 && type == inst
+		      && folded_rel_name == M(_SC_LEVEL1_ICACHE_SIZE))
+		  || (level == 2 && folded_rel_name == M(_SC_LEVEL2_CACHE_SIZE))
+		  || (level == 3 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
+		  || (level == 4 && folded_rel_name == M(_SC_LEVEL4_CACHE_SIZE)))
+		{
+		  unsigned int offset = M(name) - folded_rel_name;
+
+		  if (offset == 0)
+		    /* Cache size.  */
+		    return (((ebx >> 22) + 1)
+			    * (((ebx >> 12) & 0x3ff) + 1)
+			    * ((ebx & 0xfff) + 1)
+			    * (ecx + 1));
+		  if (offset == 1)
+		    return (ebx >> 22) + 1;
+
+		  assert (offset == 2);
+		  return (ebx & 0xfff) + 1;
+		}
+
+	      ++round;
+	    }
+	  /* There is no other cache information anywhere else.  */
+	  break;
+	}
+      else
+	{
+	  if (byte == 0x49 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
+	    {
+	      /* Intel reused this value.  For family 15, model 6 it
+		 specifies the 3rd level cache.  Otherwise the 2nd
+		 level cache.  */
+	      unsigned int family = cpu_features->basic.family;
+	      unsigned int model = cpu_features->basic.model;
+
+	      if (family == 15 && model == 6)
+		{
+		  /* The level 3 cache is encoded for this model like
+		     the level 2 cache is for other models.  Pretend
+		     the caller asked for the level 2 cache.  */
+		  name = (_SC_LEVEL2_CACHE_SIZE
+			  + (name - _SC_LEVEL3_CACHE_SIZE));
+		  folded_rel_name = M(_SC_LEVEL2_CACHE_SIZE);
+		}
+	    }
+
+	  struct intel_02_cache_info *found;
+	  struct intel_02_cache_info search;
+
+	  search.idx = byte;
+	  found = bsearch (&search, intel_02_known, nintel_02_known,
+			   sizeof (intel_02_known[0]), intel_02_known_compare);
+	  if (found != NULL)
+	    {
+	      if (found->rel_name == folded_rel_name)
+		{
+		  unsigned int offset = M(name) - folded_rel_name;
+
+		  if (offset == 0)
+		    /* Cache size.  */
+		    return found->size;
+		  if (offset == 1)
+		    return found->assoc;
+
+		  assert (offset == 2);
+		  return found->linesize;
+		}
+
+	      if (found->rel_name == M(_SC_LEVEL2_CACHE_SIZE))
+		*has_level_2 = true;
+	    }
+	}
+
+      /* Next byte for the next round.  */
+      value >>= 8;
+    }
+
+  /* Nothing found.  */
+  return 0;
+}
+
+
+static long int __attribute__ ((noinline))
+handle_intel (int name, const struct cpu_features *cpu_features)
+{
+  unsigned int maxidx = cpu_features->basic.max_cpuid;
+
+  /* Return -1 for older CPUs.  */
+  if (maxidx < 2)
+    return -1;
+
+  /* OK, we can use the CPUID instruction to get all info about the
+     caches.  */
+  unsigned int cnt = 0;
+  unsigned int max = 1;
+  long int result = 0;
+  bool no_level_2_or_3 = false;
+  bool has_level_2 = false;
+
+  while (cnt++ < max)
+    {
+      unsigned int eax;
+      unsigned int ebx;
+      unsigned int ecx;
+      unsigned int edx;
+      __cpuid (2, eax, ebx, ecx, edx);
+
+      /* The low byte of EAX in the first round contain the number of
+	 rounds we have to make.  At least one, the one we are already
+	 doing.  */
+      if (cnt == 1)
+	{
+	  max = eax & 0xff;
+	  eax &= 0xffffff00;
+	}
+
+      /* Process the individual registers' value.  */
+      result = intel_check_word (name, eax, &has_level_2,
+				 &no_level_2_or_3, cpu_features);
+      if (result != 0)
+	return result;
+
+      result = intel_check_word (name, ebx, &has_level_2,
+				 &no_level_2_or_3, cpu_features);
+      if (result != 0)
+	return result;
+
+      result = intel_check_word (name, ecx, &has_level_2,
+				 &no_level_2_or_3, cpu_features);
+      if (result != 0)
+	return result;
+
+      result = intel_check_word (name, edx, &has_level_2,
+				 &no_level_2_or_3, cpu_features);
+      if (result != 0)
+	return result;
+    }
+
+  if (name >= _SC_LEVEL2_CACHE_SIZE && name <= _SC_LEVEL3_CACHE_LINESIZE
+      && no_level_2_or_3)
+    return -1;
+
+  return 0;
+}
+
+
+static long int __attribute__ ((noinline))
+handle_amd (int name)
+{
+  unsigned int eax;
+  unsigned int ebx;
+  unsigned int ecx;
+  unsigned int edx;
+  __cpuid (0x80000000, eax, ebx, ecx, edx);
+
+  /* No level 4 cache (yet).  */
+  if (name > _SC_LEVEL3_CACHE_LINESIZE)
+    return 0;
+
+  unsigned int fn = 0x80000005 + (name >= _SC_LEVEL2_CACHE_SIZE);
+  if (eax < fn)
+    return 0;
+
+  __cpuid (fn, eax, ebx, ecx, edx);
+
+  if (name < _SC_LEVEL1_DCACHE_SIZE)
+    {
+      name += _SC_LEVEL1_DCACHE_SIZE - _SC_LEVEL1_ICACHE_SIZE;
+      ecx = edx;
+    }
+
+  switch (name)
+    {
+    case _SC_LEVEL1_DCACHE_SIZE:
+      return (ecx >> 14) & 0x3fc00;
+
+    case _SC_LEVEL1_DCACHE_ASSOC:
+      ecx >>= 16;
+      if ((ecx & 0xff) == 0xff)
+	/* Fully associative.  */
+	return (ecx << 2) & 0x3fc00;
+      return ecx & 0xff;
+
+    case _SC_LEVEL1_DCACHE_LINESIZE:
+      return ecx & 0xff;
+
+    case _SC_LEVEL2_CACHE_SIZE:
+      return (ecx & 0xf000) == 0 ? 0 : (ecx >> 6) & 0x3fffc00;
+
+    case _SC_LEVEL2_CACHE_ASSOC:
+      switch ((ecx >> 12) & 0xf)
+	{
+	case 0:
+	case 1:
+	case 2:
+	case 4:
+	  return (ecx >> 12) & 0xf;
+	case 6:
+	  return 8;
+	case 8:
+	  return 16;
+	case 10:
+	  return 32;
+	case 11:
+	  return 48;
+	case 12:
+	  return 64;
+	case 13:
+	  return 96;
+	case 14:
+	  return 128;
+	case 15:
+	  return ((ecx >> 6) & 0x3fffc00) / (ecx & 0xff);
+	default:
+	  return 0;
+	}
+      /* NOTREACHED */
+
+    case _SC_LEVEL2_CACHE_LINESIZE:
+      return (ecx & 0xf000) == 0 ? 0 : ecx & 0xff;
+
+    case _SC_LEVEL3_CACHE_SIZE:
+      return (edx & 0xf000) == 0 ? 0 : (edx & 0x3ffc0000) << 1;
+
+    case _SC_LEVEL3_CACHE_ASSOC:
+      switch ((edx >> 12) & 0xf)
+	{
+	case 0:
+	case 1:
+	case 2:
+	case 4:
+	  return (edx >> 12) & 0xf;
+	case 6:
+	  return 8;
+	case 8:
+	  return 16;
+	case 10:
+	  return 32;
+	case 11:
+	  return 48;
+	case 12:
+	  return 64;
+	case 13:
+	  return 96;
+	case 14:
+	  return 128;
+	case 15:
+	  return ((edx & 0x3ffc0000) << 1) / (edx & 0xff);
+	default:
+	  return 0;
+	}
+      /* NOTREACHED */
+
+    case _SC_LEVEL3_CACHE_LINESIZE:
+      return (edx & 0xf000) == 0 ? 0 : edx & 0xff;
+
+    default:
+      assert (! "cannot happen");
+    }
+  return -1;
+}
+
+
+static long int __attribute__ ((noinline))
+handle_zhaoxin (int name)
+{
+  unsigned int eax;
+  unsigned int ebx;
+  unsigned int ecx;
+  unsigned int edx;
+
+  int folded_rel_name = (M(name) / 3) * 3;
+
+  unsigned int round = 0;
+  while (1)
+    {
+      __cpuid_count (4, round, eax, ebx, ecx, edx);
+
+      enum { null = 0, data = 1, inst = 2, uni = 3 } type = eax & 0x1f;
+      if (type == null)
+        break;
+
+      unsigned int level = (eax >> 5) & 0x7;
+
+      if ((level == 1 && type == data
+        && folded_rel_name == M(_SC_LEVEL1_DCACHE_SIZE))
+        || (level == 1 && type == inst
+            && folded_rel_name == M(_SC_LEVEL1_ICACHE_SIZE))
+        || (level == 2 && folded_rel_name == M(_SC_LEVEL2_CACHE_SIZE))
+        || (level == 3 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE)))
+        {
+          unsigned int offset = M(name) - folded_rel_name;
+
+          if (offset == 0)
+            /* Cache size.  */
+            return (((ebx >> 22) + 1)
+                * (((ebx >> 12) & 0x3ff) + 1)
+                * ((ebx & 0xfff) + 1)
+                * (ecx + 1));
+          if (offset == 1)
+            return (ebx >> 22) + 1;
+
+          assert (offset == 2);
+          return (ebx & 0xfff) + 1;
+        }
+
+      ++round;
+    }
+
+  /* Nothing found.  */
+  return 0;
+}
+
+
+static void
+get_common_cache_info (long int *shared_ptr, unsigned int *threads_ptr,
+                long int core)
+{
+  unsigned int eax;
+  unsigned int ebx;
+  unsigned int ecx;
+  unsigned int edx;
+
+  /* Number of logical processors sharing L2 cache.  */
+  int threads_l2;
+
+  /* Number of logical processors sharing L3 cache.  */
+  int threads_l3;
+
+  const struct cpu_features *cpu_features = __get_cpu_features ();
+  int max_cpuid = cpu_features->basic.max_cpuid;
+  unsigned int family = cpu_features->basic.family;
+  unsigned int model = cpu_features->basic.model;
+  long int shared = *shared_ptr;
+  unsigned int threads = *threads_ptr;
+  bool inclusive_cache = true;
+  bool support_count_mask = true;
+
+  /* Try L3 first.  */
+  unsigned int level = 3;
+
+  if (cpu_features->basic.kind == arch_kind_zhaoxin && family == 6)
+    support_count_mask = false;
+
+  if (shared <= 0)
+    {
+      /* Try L2 otherwise.  */
+      level  = 2;
+      shared = core;
+      threads_l2 = 0;
+      threads_l3 = -1;
+    }
+  else
+    {
+      threads_l2 = 0;
+      threads_l3 = 0;
+    }
+
+  /* A value of 0 for the HTT bit indicates there is only a single
+     logical processor.  */
+  if (HAS_CPU_FEATURE (HTT))
+    {
+      /* Figure out the number of logical threads that share the
+         highest cache level.  */
+      if (max_cpuid >= 4)
+        {
+          int i = 0;
+
+          /* Query until cache level 2 and 3 are enumerated.  */
+          int check = 0x1 | (threads_l3 == 0) << 1;
+          do
+            {
+              __cpuid_count (4, i++, eax, ebx, ecx, edx);
+
+              /* There seems to be a bug in at least some Pentium Ds
+                 which sometimes fail to iterate all cache parameters.
+                 Do not loop indefinitely here, stop in this case and
+                 assume there is no such information.  */
+              if (cpu_features->basic.kind == arch_kind_intel
+                  && (eax & 0x1f) == 0 )
+                goto intel_bug_no_cache_info;
+
+              switch ((eax >> 5) & 0x7)
+                {
+                  default:
+                    break;
+                  case 2:
+                    if ((check & 0x1))
+                      {
+                        /* Get maximum number of logical processors
+                           sharing L2 cache.  */
+                        threads_l2 = (eax >> 14) & 0x3ff;
+                        check &= ~0x1;
+                      }
+                    break;
+                  case 3:
+                    if ((check & (0x1 << 1)))
+                      {
+                        /* Get maximum number of logical processors
+                           sharing L3 cache.  */
+                        threads_l3 = (eax >> 14) & 0x3ff;
+
+                        /* Check if L2 and L3 caches are inclusive.  */
+                        inclusive_cache = (edx & 0x2) != 0;
+                        check &= ~(0x1 << 1);
+                      }
+                    break;
+                }
+            }
+          while (check);
+
+          /* If max_cpuid >= 11, THREADS_L2/THREADS_L3 are the maximum
+             numbers of addressable IDs for logical processors sharing
+             the cache, instead of the maximum number of threads
+             sharing the cache.  */
+          if (max_cpuid >= 11 && support_count_mask)
+            {
+              /* Find the number of logical processors shipped in
+                 one core and apply count mask.  */
+              i = 0;
+
+              /* Count SMT only if there is L3 cache.  Always count
+                 core if there is no L3 cache.  */
+              int count = ((threads_l2 > 0 && level == 3)
+                           | ((threads_l3 > 0
+                               || (threads_l2 > 0 && level == 2)) << 1));
+
+              while (count)
+                {
+                  __cpuid_count (11, i++, eax, ebx, ecx, edx);
+
+                  int shipped = ebx & 0xff;
+                  int type = ecx & 0xff00;
+                  if (shipped == 0 || type == 0)
+                    break;
+                  else if (type == 0x100)
+                    {
+                      /* Count SMT.  */
+                      if ((count & 0x1))
+                        {
+                          int count_mask;
+
+                          /* Compute count mask.  */
+                          asm ("bsr %1, %0"
+                               : "=r" (count_mask) : "g" (threads_l2));
+                          count_mask = ~(-1 << (count_mask + 1));
+                          threads_l2 = (shipped - 1) & count_mask;
+                          count &= ~0x1;
+                        }
+                    }
+                  else if (type == 0x200)
+                    {
+                      /* Count core.  */
+                      if ((count & (0x1 << 1)))
+                        {
+                          int count_mask;
+                          int threads_core
+                            = (level == 2 ? threads_l2 : threads_l3);
+
+                          /* Compute count mask.  */
+                          asm ("bsr %1, %0"
+                               : "=r" (count_mask) : "g" (threads_core));
+                          count_mask = ~(-1 << (count_mask + 1));
+                          threads_core = (shipped - 1) & count_mask;
+                          if (level == 2)
+                            threads_l2 = threads_core;
+                          else
+                            threads_l3 = threads_core;
+                          count &= ~(0x1 << 1);
+                        }
+                    }
+                }
+            }
+          if (threads_l2 > 0)
+            threads_l2 += 1;
+          if (threads_l3 > 0)
+            threads_l3 += 1;
+          if (level == 2)
+            {
+              if (threads_l2)
+                {
+                  threads = threads_l2;
+                  if (cpu_features->basic.kind == arch_kind_intel
+                      && threads > 2
+                      && family == 6)
+                    switch (model)
+                      {
+                        case 0x37:
+                        case 0x4a:
+                        case 0x4d:
+                        case 0x5a:
+                        case 0x5d:
+                          /* Silvermont has L2 cache shared by 2 cores.  */
+                          threads = 2;
+                          break;
+                        default:
+                          break;
+                      }
+                }
+            }
+          else if (threads_l3)
+            threads = threads_l3;
+        }
+      else
+        {
+intel_bug_no_cache_info:
+          /* Assume that all logical threads share the highest cache
+             level.  */
+          threads
+            = ((cpu_features->cpuid[COMMON_CPUID_INDEX_1].ebx
+                >> 16) & 0xff);
+        }
+
+        /* Cap usage of highest cache level to the number of supported
+           threads.  */
+        if (shared > 0 && threads > 0)
+          shared /= threads;
+    }
+
+  /* Account for non-inclusive L2 and L3 caches.  */
+  if (!inclusive_cache)
+    {
+      if (threads_l2 > 0)
+        core /= threads_l2;
+      shared += core;
+    }
+
+  *shared_ptr = shared;
+  *threads_ptr = threads;
+}
+
+void
+__init_cacheinfo (void)
+{
+  /* Find out what brand of processor.  */
+  unsigned int ebx;
+  unsigned int ecx;
+  unsigned int edx;
+  int max_cpuid_ex;
+  long int data = -1;
+  long int shared = -1;
+  long int core;
+  unsigned int threads = 0;
+  unsigned long int level1_icache_size = -1;
+  unsigned long int level1_dcache_size = -1;
+  unsigned long int level1_dcache_assoc = -1;
+  unsigned long int level1_dcache_linesize = -1;
+  unsigned long int level2_cache_size = -1;
+  unsigned long int level2_cache_assoc = -1;
+  unsigned long int level2_cache_linesize = -1;
+  unsigned long int level3_cache_size = -1;
+  unsigned long int level3_cache_assoc = -1;
+  unsigned long int level3_cache_linesize = -1;
+  unsigned long int level4_cache_size = -1;
+  struct cpu_features *cpu_features = __get_cpu_features ();
+
+  if (cpu_features->basic.kind == arch_kind_intel)
+    {
+      data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features);
+      core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
+      shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features);
+
+      level1_icache_size
+	= handle_intel (_SC_LEVEL1_ICACHE_SIZE, cpu_features);
+      level1_dcache_size = data;
+      level1_dcache_assoc
+	= handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features);
+      level1_dcache_linesize
+	= handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features);
+      level2_cache_size = core;
+      level2_cache_assoc
+	= handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features);
+      level2_cache_linesize
+	= handle_intel (_SC_LEVEL2_CACHE_LINESIZE, cpu_features);
+      level3_cache_size = shared;
+      level3_cache_assoc
+	= handle_intel (_SC_LEVEL3_CACHE_ASSOC, cpu_features);
+      level3_cache_linesize
+	= handle_intel (_SC_LEVEL3_CACHE_LINESIZE, cpu_features);
+      level4_cache_size
+	= handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features);
+
+      get_common_cache_info (&shared, &threads, core);
+    }
+  else if (cpu_features->basic.kind == arch_kind_zhaoxin)
+    {
+      data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE);
+      core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
+      shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE);
+
+      level1_icache_size = handle_zhaoxin (_SC_LEVEL1_ICACHE_SIZE);
+      level1_dcache_size = data;
+      level1_dcache_assoc = handle_zhaoxin (_SC_LEVEL1_DCACHE_ASSOC);
+      level1_dcache_linesize = handle_zhaoxin (_SC_LEVEL1_DCACHE_LINESIZE);
+      level2_cache_size = core;
+      level2_cache_assoc = handle_zhaoxin (_SC_LEVEL2_CACHE_ASSOC);
+      level2_cache_linesize = handle_zhaoxin (_SC_LEVEL2_CACHE_LINESIZE);
+      level3_cache_size = shared;
+      level3_cache_assoc = handle_zhaoxin (_SC_LEVEL3_CACHE_ASSOC);
+      level3_cache_linesize = handle_zhaoxin (_SC_LEVEL3_CACHE_LINESIZE);
+
+      get_common_cache_info (&shared, &threads, core);
+    }
+  else if (cpu_features->basic.kind == arch_kind_amd)
+    {
+      data  = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
+      core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
+      shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
+
+      level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
+      level1_dcache_size = data;
+      level1_dcache_assoc = handle_amd (_SC_LEVEL1_DCACHE_ASSOC);
+      level1_dcache_linesize = handle_amd (_SC_LEVEL1_DCACHE_LINESIZE);
+      level2_cache_size = core;
+      level2_cache_assoc = handle_amd (_SC_LEVEL2_CACHE_ASSOC);
+      level2_cache_linesize = handle_amd (_SC_LEVEL2_CACHE_LINESIZE);
+      level3_cache_size = shared;
+      level3_cache_assoc = handle_amd (_SC_LEVEL3_CACHE_ASSOC);
+      level3_cache_linesize = handle_amd (_SC_LEVEL3_CACHE_LINESIZE);
+
+      /* Get maximum extended function. */
+      __cpuid (0x80000000, max_cpuid_ex, ebx, ecx, edx);
+
+      if (shared <= 0)
+	/* No shared L3 cache.  All we have is the L2 cache.  */
+	shared = core;
+      else
+	{
+	  /* Figure out the number of logical threads that share L3.  */
+	  if (max_cpuid_ex >= 0x80000008)
+	    {
+	      /* Get width of APIC ID.  */
+	      __cpuid (0x80000008, max_cpuid_ex, ebx, ecx, edx);
+	      threads = 1 << ((ecx >> 12) & 0x0f);
+	    }
+
+	  if (threads == 0)
+	    {
+	      /* If APIC ID width is not available, use logical
+		 processor count.  */
+	      __cpuid (0x00000001, max_cpuid_ex, ebx, ecx, edx);
+
+	      if ((edx & (1 << 28)) != 0)
+		threads = (ebx >> 16) & 0xff;
+	    }
+
+	  /* Cap usage of highest cache level to the number of
+	     supported threads.  */
+	  if (threads > 0)
+	    shared /= threads;
+
+	  /* Account for exclusive L2 and L3 caches.  */
+	  shared += core;
+	}
+
+#ifndef __x86_64__
+      if (max_cpuid_ex >= 0x80000001)
+	{
+	  unsigned int eax;
+	  __cpuid (0x80000001, eax, ebx, ecx, edx);
+	  /*  PREFETCHW     || 3DNow!  */
+	  if ((ecx & 0x100) || (edx & 0x80000000))
+	    cpu_features->prefetchw = -1;
+	}
+#endif
+    }
+
+  cpu_features->level1_icache_size = level1_icache_size;
+  cpu_features->level1_dcache_size = level1_dcache_size;
+  cpu_features->level1_dcache_assoc = level1_dcache_assoc;
+  cpu_features->level1_dcache_linesize = level1_dcache_linesize;
+  cpu_features->level2_cache_size = level2_cache_size;
+  cpu_features->level2_cache_assoc = level2_cache_assoc;
+  cpu_features->level2_cache_linesize = level2_cache_linesize;
+  cpu_features->level3_cache_size = level3_cache_size;
+  cpu_features->level3_cache_assoc = level3_cache_assoc;
+  cpu_features->level3_cache_linesize = level3_cache_linesize;
+  cpu_features->level4_cache_size = level4_cache_size;
+
+  /* The large memcpy micro benchmark in glibc shows that 6 times of
+     shared cache size is the approximate value above which non-temporal
+     store becomes faster on a 8-core processor.  This is the 3/4 of the
+     total shared cache size.  */
+  unsigned long int non_temporal_threshold = (shared * threads * 3 / 4);
+
+#if HAVE_TUNABLES
+  long int tunable_size;
+  tunable_size = TUNABLE_GET (x86_data_cache_size, long int, NULL);
+  if (tunable_size != 0)
+    data = tunable_size;
+  tunable_size = TUNABLE_GET (x86_shared_cache_size, long int, NULL);
+  if (tunable_size != 0)
+    shared = tunable_size;
+  tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
+  if (tunable_size != 0)
+    non_temporal_threshold = tunable_size;
+#endif
+
+  cpu_features->data_cache_size = data;
+  cpu_features->shared_cache_size = shared;
+  cpu_features->non_temporal_threshold = non_temporal_threshold; 
+
+#if HAVE_TUNABLES
+  TUNABLE_UPDATE (x86_data_cache_size, long int,
+		  data, 0, (long int) -1);
+  TUNABLE_UPDATE (x86_shared_cache_size, long int,
+		  shared, 0, (long int) -1);
+  TUNABLE_UPDATE (x86_non_temporal_threshold, long int,
+		  non_temporal_threshold, 0, (long int) -1);
+#endif
+}
diff --git a/sysdeps/x86/init-arch.h b/sysdeps/x86/init-arch.h
index d6f59cf962..272ed10902 100644
--- a/sysdeps/x86/init-arch.h
+++ b/sysdeps/x86/init-arch.h
@@ -23,6 +23,9 @@
 #include <ifunc-init.h>
 #include <isa.h>
 
+extern void __init_cacheinfo (void)
+  __attribute__ ((visibility ("hidden")));
+
 #ifndef __x86_64__
 /* Due to the reordering and the other nifty extensions in i686, it is
    not really good to use heavily i586 optimized code on an i686.  It's
diff --git a/sysdeps/x86_64/start.S b/sysdeps/x86_64/start.S
index 7477b632f7..01496027ca 100644
--- a/sysdeps/x86_64/start.S
+++ b/sysdeps/x86_64/start.S
@@ -55,7 +55,13 @@
 
 #include <sysdep.h>
 
-ENTRY (_start)
+#ifdef LIBC_MAIN
+# define START __libc_main
+#else
+# define START _start
+#endif
+
+ENTRY (START)
 	/* Clearing frame pointer is insufficient, use CFI.  */
 	cfi_undefined (rip)
 	/* Clear the frame pointer.  The ABI suggests this be done, to mark
@@ -76,16 +82,24 @@ ENTRY (_start)
 	rtld_fini:	%r9
 	stack_end:	stack.	*/
 
+#ifdef LIBC_MAIN
+# define ARGC_REG	RDI_LP
+# define ARGV_REG	RSI_LP
+#else
+# define ARGC_REG	RSI_LP
+# define ARGV_REG	RDX_LP
+#endif
+
 	mov %RDX_LP, %R9_LP	/* Address of the shared library termination
 				   function.  */
 #ifdef __ILP32__
-	mov (%rsp), %esi	/* Simulate popping 4-byte argument count.  */
+	mov (%rsp), %ARGC_REG	/* Simulate popping 4-byte argument count.  */
 	add $4, %esp
 #else
-	popq %rsi		/* Pop the argument count.  */
+	popq %ARGC_REG		/* Pop the argument count.  */
 #endif
 	/* argv starts just at the current stack top.  */
-	mov %RSP_LP, %RDX_LP
+	mov %RSP_LP, %ARGV_REG
 	/* Align the stack to a 16 byte boundary to follow the ABI.  */
 	and  $~15, %RSP_LP
 
@@ -96,19 +110,22 @@ ENTRY (_start)
 	   which grow downwards).  */
 	pushq %rsp
 
-#ifdef PIC
+#ifdef LIBC_MAIN
+	call LIBC_MAIN
+#else
+# ifdef PIC
 	/* Pass address of our own entry points to .fini and .init.  */
 	mov __libc_csu_fini@GOTPCREL(%rip), %R8_LP
 	mov __libc_csu_init@GOTPCREL(%rip), %RCX_LP
 
 	mov main@GOTPCREL(%rip), %RDI_LP
-#else
+# else
 	/* Pass address of our own entry points to .fini and .init.  */
 	mov $__libc_csu_fini, %R8_LP
 	mov $__libc_csu_init, %RCX_LP
 
 	mov $main, %RDI_LP
-#endif
+# endif
 
 	/* Call the user's main function, and exit with its value.
 	   But let the libc call main.  Since __libc_start_main in
@@ -118,10 +135,12 @@ ENTRY (_start)
 	   2.26 or above can convert indirect branch into direct
 	   branch.  */
 	call *__libc_start_main@GOTPCREL(%rip)
+#endif
 
 	hlt			/* Crash if somehow `exit' does return.	 */
-END (_start)
+END (START)
 
+#ifndef LIBC_MAIN
 /* Define a symbol for the first piece of initialized data.  */
 	.data
 	.globl __data_start
@@ -129,3 +148,4 @@ __data_start:
 	.long 0
 	.weak data_start
 	data_start = __data_start
+#endif
-- 
2.26.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: V3 [PATCH] libc.so: Add --list-tunables support to __libc_main
  2020-06-06 21:51                                   ` V3 [PATCH] libc.so: Add --list-tunables support to __libc_main H.J. Lu
@ 2020-07-02 18:00                                     ` Carlos O'Donell
  2020-07-02 19:08                                       ` [PATCH] Update tunable min/max values H.J. Lu
  0 siblings, 1 reply; 32+ messages in thread
From: Carlos O'Donell @ 2020-07-02 18:00 UTC (permalink / raw)
  To: H.J. Lu, GNU C Library; +Cc: Florian Weimer, Hushiyuan

On 6/6/20 5:51 PM, H.J. Lu wrote:
> On Fri, Jun 5, 2020 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Thu, Jun 04, 2020 at 02:00:35PM -0700, H.J. Lu wrote:
>>> On Mon, Jun 1, 2020 at 7:08 PM Carlos O'Donell <carlos@redhat.com> wrote:
>>>>
>>>> On Mon, Jun 1, 2020 at 6:44 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>> Tunables are designed to pass info from user to glibc, not the other
>>>>> way around.  When __libc_main is called, init_cacheinfo is never
>>>>> called.  I can call init_cacheinfo from __libc_main.  But there is no
>>>>> interface to update min and max values from init_cacheinfo.  I don't
>>>>> think --list-tunables will work here without changes to tunables.
>>>>
>>>> You have a dynamic threshold.
>>>>
>>>> You have to tell the user what that minimum is, otherwise they can't
>>>> use the tunable reliably.
>>>>
>>>> This is the first instance of a min/max that is dynamically determined.
>>>>
>>>> You must fetch the cache info ahead of the tunable initialization, that
>>>> is you must call init_cacheinfo before __init_tunables.
>>>>
>>>> You can initialize the tunable data dynamically like this:
>>>>
>>>> /* Dynamically set the min and max of glibc.foo.bar.  */
>>>> tunable_id_t id = TUNABLE_ENUM_NAME (glibc, foo, bar);
>>>> tunable_list[id].type.min = lowval;
>>>> tunable_list[id].type.max = highval;
>>>>
>>>> We do something similar for maybe_enable_malloc_check.
>>>>
>>>> Then once the tunables are parsed, and the cpu features are loaded
>>>> you can print the tunables, and the printed tunables will have meaningful
>>>> min and max values.
>>>>
>>>> If you have circular dependency, then you must process the cpu features
>>>> first without reading from the tunables, then allow the tunables to be
>>>> initialized from the system, *then* process the tunables to alter the existing
>>>> cpu feature settings.
>>>>
>>>
>>> How about this?  I got
>>>
>>
>> Here is the updated patch, which depends on
>>
>> https://sourceware.org/pipermail/libc-alpha/2020-June/114820.html
>>
>> to add "%d" support to _dl_debug_vdprintf.  I got
>>
>> $ ./elf/ld.so ./libc.so --list-tunables
>> glibc.elision.skip_lock_after_retries: 3 (min: -2147483648, max: 2147483647)
>> glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffff)
>> glibc.malloc.perturb: 0 (min: 0, max: 255)
>> glibc.cpu.x86_shared_cache_size: 0x100000 (min: 0x0, max: 0xffffffff)
>> glibc.elision.tries: 3 (min: -2147483648, max: 2147483647)
>> glibc.elision.enable: 0 (min: 0, max: 1)
>> glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffff)
>> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
>> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffff)
>> glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x0, max: 0xffffffff)
>> glibc.cpu.x86_shstk:
>> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffff)
>> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
>> glibc.elision.skip_trylock_internal_abort: 3 (min: -2147483648, max: 2147483647)
>> glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffff)
>> glibc.cpu.x86_ibt:
>> glibc.cpu.hwcaps:
>> glibc.elision.skip_lock_internal_abort: 3 (min: -2147483648, max: 2147483647)
>> glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffff)
>> glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffff)
>> glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffff)
>> glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffff)
>> glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffff)
>> glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767)
>> glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffff)
>> glibc.malloc.check: 0 (min: 0, max: 3)
>> $
>>
>> Ok for master?
>>
> 
> Here is the updated patch.  To support --list-tunables, a target should add
> 
> CPPFLAGS-version.c = -DLIBC_MAIN=__libc_main_body
> CPPFLAGS-libc-main.S = -DLIBC_MAIN=__libc_main_body
> 
> and start.S should be updated to define __libc_main and call
> __libc_main_body:
> 
> extern void __libc_main_body (int argc, char **argv)
>   __attribute__ ((noreturn, visibility ("hidden")));
> 
> when LIBC_MAIN is defined.

I like where this patch is going, but the __libc_main wiring up means
we'll have to delay this until glibc 2.33 opens for development and
give the architectures time to fill in the required pieces of assembly.

Can we split this into:

(a) Minimum required to implement the feature e.g. just the tunable without
    my requested changes.

(b) A second patch which implements the --list-tunables that users can
    then use to know what the values they can choose are.

That way we can commit (a) right now, and then commit (b) when we
reopen for development?

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] Update tunable min/max values
  2020-07-02 18:00                                     ` Carlos O'Donell
@ 2020-07-02 19:08                                       ` H.J. Lu
  2020-07-03 16:14                                         ` Carlos O'Donell
  0 siblings, 1 reply; 32+ messages in thread
From: H.J. Lu @ 2020-07-02 19:08 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: GNU C Library, Florian Weimer, Hushiyuan

On Thu, Jul 02, 2020 at 02:00:54PM -0400, Carlos O'Donell wrote:
> On 6/6/20 5:51 PM, H.J. Lu wrote:
> > On Fri, Jun 5, 2020 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>
> >> On Thu, Jun 04, 2020 at 02:00:35PM -0700, H.J. Lu wrote:
> >>> On Mon, Jun 1, 2020 at 7:08 PM Carlos O'Donell <carlos@redhat.com> wrote:
> >>>>
> >>>> On Mon, Jun 1, 2020 at 6:44 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>>>> Tunables are designed to pass info from user to glibc, not the other
> >>>>> way around.  When __libc_main is called, init_cacheinfo is never
> >>>>> called.  I can call init_cacheinfo from __libc_main.  But there is no
> >>>>> interface to update min and max values from init_cacheinfo.  I don't
> >>>>> think --list-tunables will work here without changes to tunables.
> >>>>
> >>>> You have a dynamic threshold.
> >>>>
> >>>> You have to tell the user what that minimum is, otherwise they can't
> >>>> use the tunable reliably.
> >>>>
> >>>> This is the first instance of a min/max that is dynamically determined.
> >>>>
> >>>> You must fetch the cache info ahead of the tunable initialization, that
> >>>> is you must call init_cacheinfo before __init_tunables.
> >>>>
> >>>> You can initialize the tunable data dynamically like this:
> >>>>
> >>>> /* Dynamically set the min and max of glibc.foo.bar.  */
> >>>> tunable_id_t id = TUNABLE_ENUM_NAME (glibc, foo, bar);
> >>>> tunable_list[id].type.min = lowval;
> >>>> tunable_list[id].type.max = highval;
> >>>>
> >>>> We do something similar for maybe_enable_malloc_check.
> >>>>
> >>>> Then once the tunables are parsed, and the cpu features are loaded
> >>>> you can print the tunables, and the printed tunables will have meaningful
> >>>> min and max values.
> >>>>
> >>>> If you have circular dependency, then you must process the cpu features
> >>>> first without reading from the tunables, then allow the tunables to be
> >>>> initialized from the system, *then* process the tunables to alter the existing
> >>>> cpu feature settings.
> >>>>
> >>>
> >>> How about this?  I got
> >>>
> >>
> >> Here is the updated patch, which depends on
> >>
> >> https://sourceware.org/pipermail/libc-alpha/2020-June/114820.html
> >>
> >> to add "%d" support to _dl_debug_vdprintf.  I got
> >>
> >> $ ./elf/ld.so ./libc.so --list-tunables
> >> glibc.elision.skip_lock_after_retries: 3 (min: -2147483648, max: 2147483647)
> >> glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffff)
> >> glibc.malloc.perturb: 0 (min: 0, max: 255)
> >> glibc.cpu.x86_shared_cache_size: 0x100000 (min: 0x0, max: 0xffffffff)
> >> glibc.elision.tries: 3 (min: -2147483648, max: 2147483647)
> >> glibc.elision.enable: 0 (min: 0, max: 1)
> >> glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffff)
> >> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> >> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffff)
> >> glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x0, max: 0xffffffff)
> >> glibc.cpu.x86_shstk:
> >> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffff)
> >> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> >> glibc.elision.skip_trylock_internal_abort: 3 (min: -2147483648, max: 2147483647)
> >> glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffff)
> >> glibc.cpu.x86_ibt:
> >> glibc.cpu.hwcaps:
> >> glibc.elision.skip_lock_internal_abort: 3 (min: -2147483648, max: 2147483647)
> >> glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffff)
> >> glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffff)
> >> glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffff)
> >> glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffff)
> >> glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffff)
> >> glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767)
> >> glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffff)
> >> glibc.malloc.check: 0 (min: 0, max: 3)
> >> $
> >>
> >> Ok for master?
> >>
> > 
> > Here is the updated patch.  To support --list-tunables, a target should add
> > 
> > CPPFLAGS-version.c = -DLIBC_MAIN=__libc_main_body
> > CPPFLAGS-libc-main.S = -DLIBC_MAIN=__libc_main_body
> > 
> > and start.S should be updated to define __libc_main and call
> > __libc_main_body:
> > 
> > extern void __libc_main_body (int argc, char **argv)
> >   __attribute__ ((noreturn, visibility ("hidden")));
> > 
> > when LIBC_MAIN is defined.
> 
> I like where this patch is going, but the __libc_main wiring up means
> we'll have to delay this until glibc 2.33 opens for development and
> give the architectures time to fill in the required pieces of assembly.
> 
> Can we split this into:
> 
> (a) Minimum required to implement the feature e.g. just the tunable without
>     my requested changes.
> 
> (b) A second patch which implements the --list-tunables that users can
>     then use to know what the values they can choose are.
> 
> That way we can commit (a) right now, and then commit (b) when we
> reopen for development?
> 

Like this?

Thanks.

H.J.
---
Add __tunable_update_val to update tunable min/max values and move x86
processor cache info to cpu_features.
---
 elf/dl-tunables.c          |  51 ++-
 elf/dl-tunables.h          |  15 +
 sysdeps/i386/cacheinfo.c   |   3 -
 sysdeps/x86/Makefile       |   2 +-
 sysdeps/x86/cacheinfo.c    | 852 +++--------------------------------
 sysdeps/x86/cpu-features.c |  19 +-
 sysdeps/x86/cpu-features.h |  26 ++
 sysdeps/x86/dl-cacheinfo.c | 888 +++++++++++++++++++++++++++++++++++++
 sysdeps/x86/init-arch.h    |   3 +
 9 files changed, 1024 insertions(+), 835 deletions(-)
 delete mode 100644 sysdeps/i386/cacheinfo.c
 create mode 100644 sysdeps/x86/dl-cacheinfo.c

diff --git a/elf/dl-tunables.c b/elf/dl-tunables.c
index 26e6e26612..7c9f1ca31f 100644
--- a/elf/dl-tunables.c
+++ b/elf/dl-tunables.c
@@ -100,31 +100,39 @@ get_next_env (char **envp, char **name, size_t *namelen, char **val,
     }									      \
 })
 
+#define TUNABLE_UPDATE_VAL(__cur, __val, __min, __max, __type)		      \
+({									      \
+  (__cur)->type.min = (__min);						      \
+  (__cur)->type.max = (__max);						      \
+  (__cur)->val.numval = (__val);					      \
+  (__cur)->initialized = true;						      \
+})
+
 static void
-do_tunable_update_val (tunable_t *cur, const void *valp)
+do_tunable_update_val (tunable_t *cur, const void *valp,
+		       const void *minp, const void *maxp)
 {
-  uint64_t val;
+  uint64_t val, min, max;
 
   if (cur->type.type_code != TUNABLE_TYPE_STRING)
-    val = *((int64_t *) valp);
+    {
+      val = *((int64_t *) valp);
+      if (minp)
+	min = *((int64_t *) minp);
+      if (maxp)
+	max = *((int64_t *) maxp);
+    }
 
   switch (cur->type.type_code)
     {
     case TUNABLE_TYPE_INT_32:
-	{
-	  TUNABLE_SET_VAL_IF_VALID_RANGE (cur, val, int64_t);
-	  break;
-	}
     case TUNABLE_TYPE_UINT_64:
-	{
-	  TUNABLE_SET_VAL_IF_VALID_RANGE (cur, val, uint64_t);
-	  break;
-	}
     case TUNABLE_TYPE_SIZE_T:
-	{
-	  TUNABLE_SET_VAL_IF_VALID_RANGE (cur, val, uint64_t);
-	  break;
-	}
+      if (minp && maxp)
+	TUNABLE_UPDATE_VAL (cur, val, min, max, int64_t);
+      else
+	TUNABLE_SET_VAL_IF_VALID_RANGE (cur, val, int64_t);
+      break;
     case TUNABLE_TYPE_STRING:
 	{
 	  cur->val.strval = valp;
@@ -153,7 +161,7 @@ tunable_initialize (tunable_t *cur, const char *strval)
       cur->initialized = true;
       valp = strval;
     }
-  do_tunable_update_val (cur, valp);
+  do_tunable_update_val (cur, valp, NULL, NULL);
 }
 
 void
@@ -161,8 +169,17 @@ __tunable_set_val (tunable_id_t id, void *valp)
 {
   tunable_t *cur = &tunable_list[id];
 
-  do_tunable_update_val (cur, valp);
+  do_tunable_update_val (cur, valp, NULL, NULL);
+}
+
+void
+__tunable_update_val (tunable_id_t id, void *valp, void *minp, void *maxp)
+{
+  tunable_t *cur = &tunable_list[id];
+
+  do_tunable_update_val (cur, valp, minp, maxp);
 }
+rtld_hidden_def (__tunable_update_val)
 
 #if TUNABLES_FRONTEND == TUNABLES_FRONTEND_valstring
 /* Parse the tunable string TUNESTR and adjust it to drop any tunables that may
diff --git a/elf/dl-tunables.h b/elf/dl-tunables.h
index f05eb50c2f..f6bf7379af 100644
--- a/elf/dl-tunables.h
+++ b/elf/dl-tunables.h
@@ -71,8 +71,10 @@ typedef struct _tunable tunable_t;
 extern void __tunables_init (char **);
 extern void __tunable_get_val (tunable_id_t, void *, tunable_callback_t);
 extern void __tunable_set_val (tunable_id_t, void *);
+extern void __tunable_update_val (tunable_id_t, void *, void *, void *);
 rtld_hidden_proto (__tunables_init)
 rtld_hidden_proto (__tunable_get_val)
+rtld_hidden_proto (__tunable_update_val)
 
 /* Define TUNABLE_GET and TUNABLE_SET in short form if TOP_NAMESPACE and
    TUNABLE_NAMESPACE are defined.  This is useful shorthand to get and set
@@ -82,11 +84,16 @@ rtld_hidden_proto (__tunable_get_val)
   TUNABLE_GET_FULL (TOP_NAMESPACE, TUNABLE_NAMESPACE, __id, __type, __cb)
 # define TUNABLE_SET(__id, __type, __val) \
   TUNABLE_SET_FULL (TOP_NAMESPACE, TUNABLE_NAMESPACE, __id, __type, __val)
+# define TUNABLE_UPDATE(__id, __type, __val, __min, __max) \
+  TUNABLE_UPDATE_FULL (TOP_NAMESPACE, TUNABLE_NAMESPACE, __id, __type, \
+		       __val, __min, __max)
 #else
 # define TUNABLE_GET(__top, __ns, __id, __type, __cb) \
   TUNABLE_GET_FULL (__top, __ns, __id, __type, __cb)
 # define TUNABLE_SET(__top, __ns, __id, __type, __val) \
   TUNABLE_SET_FULL (__top, __ns, __id, __type, __val)
+# define TUNABLE_UPDATE(__top, __ns, __id, __type, __val, __min, __max) \
+  TUNABLE_UPDATE_FULL (__top, __ns, __id, __type, __val, __min, __max)
 #endif
 
 /* Get and return a tunable value.  If the tunable was set externally and __CB
@@ -106,6 +113,14 @@ rtld_hidden_proto (__tunable_get_val)
 			& (__type) {__val});				      \
 })
 
+/* Update a tunable value.  */
+# define TUNABLE_UPDATE_FULL(__top, __ns, __id, __type, __val, __min, __max) \
+({									      \
+  __tunable_update_val (TUNABLE_ENUM_NAME (__top, __ns, __id),		      \
+			& (__type) {__val},  & (__type) {__min},	      \
+			& (__type) {__max});				      \
+})
+
 /* Namespace sanity for callback functions.  Use this macro to keep the
    namespace of the modules clean.  */
 # define TUNABLE_CALLBACK(__name) _dl_tunable_ ## __name
diff --git a/sysdeps/i386/cacheinfo.c b/sysdeps/i386/cacheinfo.c
deleted file mode 100644
index f15fe0779a..0000000000
--- a/sysdeps/i386/cacheinfo.c
+++ /dev/null
@@ -1,3 +0,0 @@
-#define DISABLE_PREFETCHW
-
-#include <sysdeps/x86/cacheinfo.c>
diff --git a/sysdeps/x86/Makefile b/sysdeps/x86/Makefile
index beab426f67..0872e0e655 100644
--- a/sysdeps/x86/Makefile
+++ b/sysdeps/x86/Makefile
@@ -3,7 +3,7 @@ gen-as-const-headers += cpu-features-offsets.sym
 endif
 
 ifeq ($(subdir),elf)
-sysdep-dl-routines += dl-get-cpu-features
+sysdep-dl-routines += dl-get-cpu-features dl-cacheinfo
 
 tests += tst-get-cpu-features tst-get-cpu-features-static
 tests-static += tst-get-cpu-features-static
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index 311502dee3..8c4c7f9972 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -18,498 +18,9 @@
 
 #if IS_IN (libc)
 
-#include <assert.h>
-#include <stdbool.h>
-#include <stdlib.h>
 #include <unistd.h>
-#include <cpuid.h>
 #include <init-arch.h>
 
-static const struct intel_02_cache_info
-{
-  unsigned char idx;
-  unsigned char assoc;
-  unsigned char linesize;
-  unsigned char rel_name;
-  unsigned int size;
-} intel_02_known [] =
-  {
-#define M(sc) ((sc) - _SC_LEVEL1_ICACHE_SIZE)
-    { 0x06,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),    8192 },
-    { 0x08,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),   16384 },
-    { 0x09,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),   32768 },
-    { 0x0a,  2, 32, M(_SC_LEVEL1_DCACHE_SIZE),    8192 },
-    { 0x0c,  4, 32, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
-    { 0x0d,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
-    { 0x0e,  6, 64, M(_SC_LEVEL1_DCACHE_SIZE),   24576 },
-    { 0x21,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x22,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),   524288 },
-    { 0x23,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
-    { 0x25,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0x29,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0x2c,  8, 64, M(_SC_LEVEL1_DCACHE_SIZE),   32768 },
-    { 0x30,  8, 64, M(_SC_LEVEL1_ICACHE_SIZE),   32768 },
-    { 0x39,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
-    { 0x3a,  6, 64, M(_SC_LEVEL2_CACHE_SIZE),   196608 },
-    { 0x3b,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
-    { 0x3c,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x3d,  6, 64, M(_SC_LEVEL2_CACHE_SIZE),   393216 },
-    { 0x3e,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x3f,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x41,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
-    { 0x42,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x43,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x44,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0x45,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
-    { 0x46,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0x47,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
-    { 0x48, 12, 64, M(_SC_LEVEL2_CACHE_SIZE),  3145728 },
-    { 0x49, 16, 64, M(_SC_LEVEL2_CACHE_SIZE),  4194304 },
-    { 0x4a, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  6291456 },
-    { 0x4b, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
-    { 0x4c, 12, 64, M(_SC_LEVEL3_CACHE_SIZE), 12582912 },
-    { 0x4d, 16, 64, M(_SC_LEVEL3_CACHE_SIZE), 16777216 },
-    { 0x4e, 24, 64, M(_SC_LEVEL2_CACHE_SIZE),  6291456 },
-    { 0x60,  8, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
-    { 0x66,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),    8192 },
-    { 0x67,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
-    { 0x68,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   32768 },
-    { 0x78,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0x79,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
-    { 0x7a,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x7b,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x7c,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0x7d,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
-    { 0x7f,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x80,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x82,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
-    { 0x83,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x84,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0x85,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
-    { 0x86,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
-    { 0x87,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
-    { 0xd0,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),   524288 },
-    { 0xd1,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
-    { 0xd2,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0xd6,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
-    { 0xd7,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0xd8,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0xdc, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0xdd, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0xde, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
-    { 0xe2, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
-    { 0xe3, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
-    { 0xe4, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
-    { 0xea, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 12582912 },
-    { 0xeb, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 18874368 },
-    { 0xec, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 25165824 },
-  };
-
-#define nintel_02_known (sizeof (intel_02_known) / sizeof (intel_02_known [0]))
-
-static int
-intel_02_known_compare (const void *p1, const void *p2)
-{
-  const struct intel_02_cache_info *i1;
-  const struct intel_02_cache_info *i2;
-
-  i1 = (const struct intel_02_cache_info *) p1;
-  i2 = (const struct intel_02_cache_info *) p2;
-
-  if (i1->idx == i2->idx)
-    return 0;
-
-  return i1->idx < i2->idx ? -1 : 1;
-}
-
-
-static long int
-__attribute__ ((noinline))
-intel_check_word (int name, unsigned int value, bool *has_level_2,
-		  bool *no_level_2_or_3,
-		  const struct cpu_features *cpu_features)
-{
-  if ((value & 0x80000000) != 0)
-    /* The register value is reserved.  */
-    return 0;
-
-  /* Fold the name.  The _SC_ constants are always in the order SIZE,
-     ASSOC, LINESIZE.  */
-  int folded_rel_name = (M(name) / 3) * 3;
-
-  while (value != 0)
-    {
-      unsigned int byte = value & 0xff;
-
-      if (byte == 0x40)
-	{
-	  *no_level_2_or_3 = true;
-
-	  if (folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
-	    /* No need to look further.  */
-	    break;
-	}
-      else if (byte == 0xff)
-	{
-	  /* CPUID leaf 0x4 contains all the information.  We need to
-	     iterate over it.  */
-	  unsigned int eax;
-	  unsigned int ebx;
-	  unsigned int ecx;
-	  unsigned int edx;
-
-	  unsigned int round = 0;
-	  while (1)
-	    {
-	      __cpuid_count (4, round, eax, ebx, ecx, edx);
-
-	      enum { null = 0, data = 1, inst = 2, uni = 3 } type = eax & 0x1f;
-	      if (type == null)
-		/* That was the end.  */
-		break;
-
-	      unsigned int level = (eax >> 5) & 0x7;
-
-	      if ((level == 1 && type == data
-		   && folded_rel_name == M(_SC_LEVEL1_DCACHE_SIZE))
-		  || (level == 1 && type == inst
-		      && folded_rel_name == M(_SC_LEVEL1_ICACHE_SIZE))
-		  || (level == 2 && folded_rel_name == M(_SC_LEVEL2_CACHE_SIZE))
-		  || (level == 3 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
-		  || (level == 4 && folded_rel_name == M(_SC_LEVEL4_CACHE_SIZE)))
-		{
-		  unsigned int offset = M(name) - folded_rel_name;
-
-		  if (offset == 0)
-		    /* Cache size.  */
-		    return (((ebx >> 22) + 1)
-			    * (((ebx >> 12) & 0x3ff) + 1)
-			    * ((ebx & 0xfff) + 1)
-			    * (ecx + 1));
-		  if (offset == 1)
-		    return (ebx >> 22) + 1;
-
-		  assert (offset == 2);
-		  return (ebx & 0xfff) + 1;
-		}
-
-	      ++round;
-	    }
-	  /* There is no other cache information anywhere else.  */
-	  break;
-	}
-      else
-	{
-	  if (byte == 0x49 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
-	    {
-	      /* Intel reused this value.  For family 15, model 6 it
-		 specifies the 3rd level cache.  Otherwise the 2nd
-		 level cache.  */
-	      unsigned int family = cpu_features->basic.family;
-	      unsigned int model = cpu_features->basic.model;
-
-	      if (family == 15 && model == 6)
-		{
-		  /* The level 3 cache is encoded for this model like
-		     the level 2 cache is for other models.  Pretend
-		     the caller asked for the level 2 cache.  */
-		  name = (_SC_LEVEL2_CACHE_SIZE
-			  + (name - _SC_LEVEL3_CACHE_SIZE));
-		  folded_rel_name = M(_SC_LEVEL2_CACHE_SIZE);
-		}
-	    }
-
-	  struct intel_02_cache_info *found;
-	  struct intel_02_cache_info search;
-
-	  search.idx = byte;
-	  found = bsearch (&search, intel_02_known, nintel_02_known,
-			   sizeof (intel_02_known[0]), intel_02_known_compare);
-	  if (found != NULL)
-	    {
-	      if (found->rel_name == folded_rel_name)
-		{
-		  unsigned int offset = M(name) - folded_rel_name;
-
-		  if (offset == 0)
-		    /* Cache size.  */
-		    return found->size;
-		  if (offset == 1)
-		    return found->assoc;
-
-		  assert (offset == 2);
-		  return found->linesize;
-		}
-
-	      if (found->rel_name == M(_SC_LEVEL2_CACHE_SIZE))
-		*has_level_2 = true;
-	    }
-	}
-
-      /* Next byte for the next round.  */
-      value >>= 8;
-    }
-
-  /* Nothing found.  */
-  return 0;
-}
-
-
-static long int __attribute__ ((noinline))
-handle_intel (int name, const struct cpu_features *cpu_features)
-{
-  unsigned int maxidx = cpu_features->basic.max_cpuid;
-
-  /* Return -1 for older CPUs.  */
-  if (maxidx < 2)
-    return -1;
-
-  /* OK, we can use the CPUID instruction to get all info about the
-     caches.  */
-  unsigned int cnt = 0;
-  unsigned int max = 1;
-  long int result = 0;
-  bool no_level_2_or_3 = false;
-  bool has_level_2 = false;
-
-  while (cnt++ < max)
-    {
-      unsigned int eax;
-      unsigned int ebx;
-      unsigned int ecx;
-      unsigned int edx;
-      __cpuid (2, eax, ebx, ecx, edx);
-
-      /* The low byte of EAX in the first round contain the number of
-	 rounds we have to make.  At least one, the one we are already
-	 doing.  */
-      if (cnt == 1)
-	{
-	  max = eax & 0xff;
-	  eax &= 0xffffff00;
-	}
-
-      /* Process the individual registers' value.  */
-      result = intel_check_word (name, eax, &has_level_2,
-				 &no_level_2_or_3, cpu_features);
-      if (result != 0)
-	return result;
-
-      result = intel_check_word (name, ebx, &has_level_2,
-				 &no_level_2_or_3, cpu_features);
-      if (result != 0)
-	return result;
-
-      result = intel_check_word (name, ecx, &has_level_2,
-				 &no_level_2_or_3, cpu_features);
-      if (result != 0)
-	return result;
-
-      result = intel_check_word (name, edx, &has_level_2,
-				 &no_level_2_or_3, cpu_features);
-      if (result != 0)
-	return result;
-    }
-
-  if (name >= _SC_LEVEL2_CACHE_SIZE && name <= _SC_LEVEL3_CACHE_LINESIZE
-      && no_level_2_or_3)
-    return -1;
-
-  return 0;
-}
-
-
-static long int __attribute__ ((noinline))
-handle_amd (int name)
-{
-  unsigned int eax;
-  unsigned int ebx;
-  unsigned int ecx;
-  unsigned int edx;
-  __cpuid (0x80000000, eax, ebx, ecx, edx);
-
-  /* No level 4 cache (yet).  */
-  if (name > _SC_LEVEL3_CACHE_LINESIZE)
-    return 0;
-
-  unsigned int fn = 0x80000005 + (name >= _SC_LEVEL2_CACHE_SIZE);
-  if (eax < fn)
-    return 0;
-
-  __cpuid (fn, eax, ebx, ecx, edx);
-
-  if (name < _SC_LEVEL1_DCACHE_SIZE)
-    {
-      name += _SC_LEVEL1_DCACHE_SIZE - _SC_LEVEL1_ICACHE_SIZE;
-      ecx = edx;
-    }
-
-  switch (name)
-    {
-    case _SC_LEVEL1_DCACHE_SIZE:
-      return (ecx >> 14) & 0x3fc00;
-
-    case _SC_LEVEL1_DCACHE_ASSOC:
-      ecx >>= 16;
-      if ((ecx & 0xff) == 0xff)
-	/* Fully associative.  */
-	return (ecx << 2) & 0x3fc00;
-      return ecx & 0xff;
-
-    case _SC_LEVEL1_DCACHE_LINESIZE:
-      return ecx & 0xff;
-
-    case _SC_LEVEL2_CACHE_SIZE:
-      return (ecx & 0xf000) == 0 ? 0 : (ecx >> 6) & 0x3fffc00;
-
-    case _SC_LEVEL2_CACHE_ASSOC:
-      switch ((ecx >> 12) & 0xf)
-	{
-	case 0:
-	case 1:
-	case 2:
-	case 4:
-	  return (ecx >> 12) & 0xf;
-	case 6:
-	  return 8;
-	case 8:
-	  return 16;
-	case 10:
-	  return 32;
-	case 11:
-	  return 48;
-	case 12:
-	  return 64;
-	case 13:
-	  return 96;
-	case 14:
-	  return 128;
-	case 15:
-	  return ((ecx >> 6) & 0x3fffc00) / (ecx & 0xff);
-	default:
-	  return 0;
-	}
-      /* NOTREACHED */
-
-    case _SC_LEVEL2_CACHE_LINESIZE:
-      return (ecx & 0xf000) == 0 ? 0 : ecx & 0xff;
-
-    case _SC_LEVEL3_CACHE_SIZE:
-      return (edx & 0xf000) == 0 ? 0 : (edx & 0x3ffc0000) << 1;
-
-    case _SC_LEVEL3_CACHE_ASSOC:
-      switch ((edx >> 12) & 0xf)
-	{
-	case 0:
-	case 1:
-	case 2:
-	case 4:
-	  return (edx >> 12) & 0xf;
-	case 6:
-	  return 8;
-	case 8:
-	  return 16;
-	case 10:
-	  return 32;
-	case 11:
-	  return 48;
-	case 12:
-	  return 64;
-	case 13:
-	  return 96;
-	case 14:
-	  return 128;
-	case 15:
-	  return ((edx & 0x3ffc0000) << 1) / (edx & 0xff);
-	default:
-	  return 0;
-	}
-      /* NOTREACHED */
-
-    case _SC_LEVEL3_CACHE_LINESIZE:
-      return (edx & 0xf000) == 0 ? 0 : edx & 0xff;
-
-    default:
-      assert (! "cannot happen");
-    }
-  return -1;
-}
-
-
-static long int __attribute__ ((noinline))
-handle_zhaoxin (int name)
-{
-  unsigned int eax;
-  unsigned int ebx;
-  unsigned int ecx;
-  unsigned int edx;
-
-  int folded_rel_name = (M(name) / 3) * 3;
-
-  unsigned int round = 0;
-  while (1)
-    {
-      __cpuid_count (4, round, eax, ebx, ecx, edx);
-
-      enum { null = 0, data = 1, inst = 2, uni = 3 } type = eax & 0x1f;
-      if (type == null)
-        break;
-
-      unsigned int level = (eax >> 5) & 0x7;
-
-      if ((level == 1 && type == data
-        && folded_rel_name == M(_SC_LEVEL1_DCACHE_SIZE))
-        || (level == 1 && type == inst
-            && folded_rel_name == M(_SC_LEVEL1_ICACHE_SIZE))
-        || (level == 2 && folded_rel_name == M(_SC_LEVEL2_CACHE_SIZE))
-        || (level == 3 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE)))
-        {
-          unsigned int offset = M(name) - folded_rel_name;
-
-          if (offset == 0)
-            /* Cache size.  */
-            return (((ebx >> 22) + 1)
-                * (((ebx >> 12) & 0x3ff) + 1)
-                * ((ebx & 0xfff) + 1)
-                * (ecx + 1));
-          if (offset == 1)
-            return (ebx >> 22) + 1;
-
-          assert (offset == 2);
-          return (ebx & 0xfff) + 1;
-        }
-
-      ++round;
-    }
-
-  /* Nothing found.  */
-  return 0;
-}
-
-
-/* Get the value of the system variable NAME.  */
-long int
-attribute_hidden
-__cache_sysconf (int name)
-{
-  const struct cpu_features *cpu_features = __get_cpu_features ();
-
-  if (cpu_features->basic.kind == arch_kind_intel)
-    return handle_intel (name, cpu_features);
-
-  if (cpu_features->basic.kind == arch_kind_amd)
-    return handle_amd (name);
-
-  if (cpu_features->basic.kind == arch_kind_zhaoxin)
-    return handle_zhaoxin (name);
-
-  // XXX Fill in more vendors.
-
-  /* CPU not known, we have no information.  */
-  return 0;
-}
-
-
 /* Data cache size for use in memory and string routines, typically
    L1 size, rounded to multiple of 256 bytes.  */
 long int __x86_data_cache_size_half attribute_hidden = 32 * 1024 / 2;
@@ -530,348 +41,85 @@ long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024;
 /* Threshold to use non temporal store.  */
 long int __x86_shared_non_temporal_threshold attribute_hidden;
 
-#ifndef DISABLE_PREFETCHW
+#ifndef __x86_64__
 /* PREFETCHW support flag for use in memory and string routines.  */
 int __x86_prefetchw attribute_hidden;
 #endif
 
-
-static void
-get_common_cache_info (long int *shared_ptr, unsigned int *threads_ptr,
-                long int core)
+/* Get the value of the system variable NAME.  */
+long int
+attribute_hidden
+__cache_sysconf (int name)
 {
-  unsigned int eax;
-  unsigned int ebx;
-  unsigned int ecx;
-  unsigned int edx;
-
-  /* Number of logical processors sharing L2 cache.  */
-  int threads_l2;
-
-  /* Number of logical processors sharing L3 cache.  */
-  int threads_l3;
-
   const struct cpu_features *cpu_features = __get_cpu_features ();
-  int max_cpuid = cpu_features->basic.max_cpuid;
-  unsigned int family = cpu_features->basic.family;
-  unsigned int model = cpu_features->basic.model;
-  long int shared = *shared_ptr;
-  unsigned int threads = *threads_ptr;
-  bool inclusive_cache = true;
-  bool support_count_mask = true;
-
-  /* Try L3 first.  */
-  unsigned int level = 3;
-
-  if (cpu_features->basic.kind == arch_kind_zhaoxin && family == 6)
-    support_count_mask = false;
-
-  if (shared <= 0)
-    {
-      /* Try L2 otherwise.  */
-      level  = 2;
-      shared = core;
-      threads_l2 = 0;
-      threads_l3 = -1;
-    }
-  else
-    {
-      threads_l2 = 0;
-      threads_l3 = 0;
-    }
-
-  /* A value of 0 for the HTT bit indicates there is only a single
-     logical processor.  */
-  if (HAS_CPU_FEATURE (HTT))
+  switch (name)
     {
-      /* Figure out the number of logical threads that share the
-         highest cache level.  */
-      if (max_cpuid >= 4)
-        {
-          int i = 0;
-
-          /* Query until cache level 2 and 3 are enumerated.  */
-          int check = 0x1 | (threads_l3 == 0) << 1;
-          do
-            {
-              __cpuid_count (4, i++, eax, ebx, ecx, edx);
+    case _SC_LEVEL1_ICACHE_SIZE:
+      return cpu_features->level1_icache_size;
 
-              /* There seems to be a bug in at least some Pentium Ds
-                 which sometimes fail to iterate all cache parameters.
-                 Do not loop indefinitely here, stop in this case and
-                 assume there is no such information.  */
-              if (cpu_features->basic.kind == arch_kind_intel
-                  && (eax & 0x1f) == 0 )
-                goto intel_bug_no_cache_info;
+    case _SC_LEVEL1_DCACHE_SIZE:
+      return cpu_features->level1_dcache_size;
 
-              switch ((eax >> 5) & 0x7)
-                {
-                  default:
-                    break;
-                  case 2:
-                    if ((check & 0x1))
-                      {
-                        /* Get maximum number of logical processors
-                           sharing L2 cache.  */
-                        threads_l2 = (eax >> 14) & 0x3ff;
-                        check &= ~0x1;
-                      }
-                    break;
-                  case 3:
-                    if ((check & (0x1 << 1)))
-                      {
-                        /* Get maximum number of logical processors
-                           sharing L3 cache.  */
-                        threads_l3 = (eax >> 14) & 0x3ff;
+    case _SC_LEVEL1_DCACHE_ASSOC:
+      return cpu_features->level1_dcache_assoc;
 
-                        /* Check if L2 and L3 caches are inclusive.  */
-                        inclusive_cache = (edx & 0x2) != 0;
-                        check &= ~(0x1 << 1);
-                      }
-                    break;
-                }
-            }
-          while (check);
+    case _SC_LEVEL1_DCACHE_LINESIZE:
+      return cpu_features->level1_dcache_linesize;
 
-          /* If max_cpuid >= 11, THREADS_L2/THREADS_L3 are the maximum
-             numbers of addressable IDs for logical processors sharing
-             the cache, instead of the maximum number of threads
-             sharing the cache.  */
-          if (max_cpuid >= 11 && support_count_mask)
-            {
-              /* Find the number of logical processors shipped in
-                 one core and apply count mask.  */
-              i = 0;
+    case _SC_LEVEL2_CACHE_SIZE:
+      return cpu_features->level2_cache_size;
 
-              /* Count SMT only if there is L3 cache.  Always count
-                 core if there is no L3 cache.  */
-              int count = ((threads_l2 > 0 && level == 3)
-                           | ((threads_l3 > 0
-                               || (threads_l2 > 0 && level == 2)) << 1));
+    case _SC_LEVEL2_CACHE_ASSOC:
+      return cpu_features->level2_cache_assoc;
 
-              while (count)
-                {
-                  __cpuid_count (11, i++, eax, ebx, ecx, edx);
+    case _SC_LEVEL2_CACHE_LINESIZE:
+      return cpu_features->level2_cache_linesize;
 
-                  int shipped = ebx & 0xff;
-                  int type = ecx & 0xff00;
-                  if (shipped == 0 || type == 0)
-                    break;
-                  else if (type == 0x100)
-                    {
-                      /* Count SMT.  */
-                      if ((count & 0x1))
-                        {
-                          int count_mask;
+    case _SC_LEVEL3_CACHE_SIZE:
+      return cpu_features->level3_cache_size;
 
-                          /* Compute count mask.  */
-                          asm ("bsr %1, %0"
-                               : "=r" (count_mask) : "g" (threads_l2));
-                          count_mask = ~(-1 << (count_mask + 1));
-                          threads_l2 = (shipped - 1) & count_mask;
-                          count &= ~0x1;
-                        }
-                    }
-                  else if (type == 0x200)
-                    {
-                      /* Count core.  */
-                      if ((count & (0x1 << 1)))
-                        {
-                          int count_mask;
-                          int threads_core
-                            = (level == 2 ? threads_l2 : threads_l3);
+    case _SC_LEVEL3_CACHE_ASSOC:
+      return cpu_features->level3_cache_assoc;
 
-                          /* Compute count mask.  */
-                          asm ("bsr %1, %0"
-                               : "=r" (count_mask) : "g" (threads_core));
-                          count_mask = ~(-1 << (count_mask + 1));
-                          threads_core = (shipped - 1) & count_mask;
-                          if (level == 2)
-                            threads_l2 = threads_core;
-                          else
-                            threads_l3 = threads_core;
-                          count &= ~(0x1 << 1);
-                        }
-                    }
-                }
-            }
-          if (threads_l2 > 0)
-            threads_l2 += 1;
-          if (threads_l3 > 0)
-            threads_l3 += 1;
-          if (level == 2)
-            {
-              if (threads_l2)
-                {
-                  threads = threads_l2;
-                  if (cpu_features->basic.kind == arch_kind_intel
-                      && threads > 2
-                      && family == 6)
-                    switch (model)
-                      {
-                        case 0x37:
-                        case 0x4a:
-                        case 0x4d:
-                        case 0x5a:
-                        case 0x5d:
-                          /* Silvermont has L2 cache shared by 2 cores.  */
-                          threads = 2;
-                          break;
-                        default:
-                          break;
-                      }
-                }
-            }
-          else if (threads_l3)
-            threads = threads_l3;
-        }
-      else
-        {
-intel_bug_no_cache_info:
-          /* Assume that all logical threads share the highest cache
-             level.  */
-          threads
-            = ((cpu_features->cpuid[COMMON_CPUID_INDEX_1].ebx
-                >> 16) & 0xff);
-        }
+    case _SC_LEVEL3_CACHE_LINESIZE:
+      return cpu_features->level3_cache_linesize;
 
-        /* Cap usage of highest cache level to the number of supported
-           threads.  */
-        if (shared > 0 && threads > 0)
-          shared /= threads;
-    }
+    case _SC_LEVEL4_CACHE_SIZE:
+      return cpu_features->level4_cache_size;
 
-  /* Account for non-inclusive L2 and L3 caches.  */
-  if (!inclusive_cache)
-    {
-      if (threads_l2 > 0)
-        core /= threads_l2;
-      shared += core;
+    default:
+      break;
     }
-
-  *shared_ptr = shared;
-  *threads_ptr = threads;
+  return -1;
 }
 
-
 static void
 __attribute__((constructor))
 init_cacheinfo (void)
 {
-  /* Find out what brand of processor.  */
-  unsigned int ebx;
-  unsigned int ecx;
-  unsigned int edx;
-  int max_cpuid_ex;
-  long int data = -1;
-  long int shared = -1;
-  long int core;
-  unsigned int threads = 0;
   const struct cpu_features *cpu_features = __get_cpu_features ();
+  long int data = cpu_features->data_cache_size;
+  __x86_raw_data_cache_size_half = data / 2;
+  __x86_raw_data_cache_size = data;
+  /* Round data cache size to multiple of 256 bytes.  */
+  data = data & ~255L;
+  __x86_data_cache_size_half = data / 2;
+  __x86_data_cache_size = data;
+
+  long int shared = cpu_features->shared_cache_size;
+  __x86_raw_shared_cache_size_half = shared / 2;
+  __x86_raw_shared_cache_size = shared;
+  /* Round shared cache size to multiple of 256 bytes.  */
+  shared = shared & ~255L;
+  __x86_shared_cache_size_half = shared / 2;
+  __x86_shared_cache_size = shared;
 
-  if (cpu_features->basic.kind == arch_kind_intel)
-    {
-      data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features);
-      core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
-      shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features);
-
-      get_common_cache_info (&shared, &threads, core);
-    }
-  else if (cpu_features->basic.kind == arch_kind_zhaoxin)
-    {
-      data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE);
-      core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
-      shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE);
-
-      get_common_cache_info (&shared, &threads, core);
-    }
-  else if (cpu_features->basic.kind == arch_kind_amd)
-    {
-      data   = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
-      long int core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
-      shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
-
-      /* Get maximum extended function. */
-      __cpuid (0x80000000, max_cpuid_ex, ebx, ecx, edx);
-
-      if (shared <= 0)
-	/* No shared L3 cache.  All we have is the L2 cache.  */
-	shared = core;
-      else
-	{
-	  /* Figure out the number of logical threads that share L3.  */
-	  if (max_cpuid_ex >= 0x80000008)
-	    {
-	      /* Get width of APIC ID.  */
-	      __cpuid (0x80000008, max_cpuid_ex, ebx, ecx, edx);
-	      threads = 1 << ((ecx >> 12) & 0x0f);
-	    }
-
-	  if (threads == 0)
-	    {
-	      /* If APIC ID width is not available, use logical
-		 processor count.  */
-	      __cpuid (0x00000001, max_cpuid_ex, ebx, ecx, edx);
-
-	      if ((edx & (1 << 28)) != 0)
-		threads = (ebx >> 16) & 0xff;
-	    }
-
-	  /* Cap usage of highest cache level to the number of
-	     supported threads.  */
-	  if (threads > 0)
-	    shared /= threads;
-
-	  /* Account for exclusive L2 and L3 caches.  */
-	  shared += core;
-	}
+  __x86_shared_non_temporal_threshold
+    = cpu_features->non_temporal_threshold;
 
-#ifndef DISABLE_PREFETCHW
-      if (max_cpuid_ex >= 0x80000001)
-	{
-	  unsigned int eax;
-	  __cpuid (0x80000001, eax, ebx, ecx, edx);
-	  /*  PREFETCHW     || 3DNow!  */
-	  if ((ecx & 0x100) || (edx & 0x80000000))
-	    __x86_prefetchw = -1;
-	}
+#ifndef __x86_64__
+  __x86_prefetchw = cpu_features->prefetchw;
 #endif
-    }
-
-  if (cpu_features->data_cache_size != 0)
-    data = cpu_features->data_cache_size;
-
-  if (data > 0)
-    {
-      __x86_raw_data_cache_size_half = data / 2;
-      __x86_raw_data_cache_size = data;
-      /* Round data cache size to multiple of 256 bytes.  */
-      data = data & ~255L;
-      __x86_data_cache_size_half = data / 2;
-      __x86_data_cache_size = data;
-    }
-
-  if (cpu_features->shared_cache_size != 0)
-    shared = cpu_features->shared_cache_size;
-
-  if (shared > 0)
-    {
-      __x86_raw_shared_cache_size_half = shared / 2;
-      __x86_raw_shared_cache_size = shared;
-      /* Round shared cache size to multiple of 256 bytes.  */
-      shared = shared & ~255L;
-      __x86_shared_cache_size_half = shared / 2;
-      __x86_shared_cache_size = shared;
-    }
-
-  /* The large memcpy micro benchmark in glibc shows that 6 times of
-     shared cache size is the approximate value above which non-temporal
-     store becomes faster on a 8-core processor.  This is the 3/4 of the
-     total shared cache size.  */
-  __x86_shared_non_temporal_threshold
-    = (cpu_features->non_temporal_threshold != 0
-       ? cpu_features->non_temporal_threshold
-       : __x86_shared_cache_size * threads * 3 / 4);
 }
 
 #endif
diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index c351bdd54a..e718204c18 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -19,6 +19,7 @@
 #include <cpuid.h>
 #include <cpu-features.h>
 #include <dl-hwcap.h>
+#include <init-arch.h>
 #include <libc-pointer-arith.h>
 
 #if HAVE_TUNABLES
@@ -602,20 +603,14 @@ no_cpuid:
   cpu_features->basic.model = model;
   cpu_features->basic.stepping = stepping;
 
+  __init_cacheinfo ();
+
 #if HAVE_TUNABLES
   TUNABLE_GET (hwcaps, tunable_val_t *, TUNABLE_CALLBACK (set_hwcaps));
-  cpu_features->non_temporal_threshold
-    = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
-  cpu_features->data_cache_size
-    = TUNABLE_GET (x86_data_cache_size, long int, NULL);
-  cpu_features->shared_cache_size
-    = TUNABLE_GET (x86_shared_cache_size, long int, NULL);
-#endif
-
-  /* Reuse dl_platform, dl_hwcap and dl_hwcap_mask for x86.  */
-#if !HAVE_TUNABLES && defined SHARED
-  /* The glibc.cpu.hwcap_mask tunable is initialized already, so no need to do
-     this.  */
+#elif defined SHARED
+  /* Reuse dl_platform, dl_hwcap and dl_hwcap_mask for x86.  The
+     glibc.cpu.hwcap_mask tunable is initialized already, so no
+     need to do this.  */
   GLRO(dl_hwcap_mask) = HWCAP_IMPORTANT;
 #endif
 
diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
index d66dc206f7..3aaed33cbc 100644
--- a/sysdeps/x86/cpu-features.h
+++ b/sysdeps/x86/cpu-features.h
@@ -102,6 +102,32 @@ struct cpu_features
   unsigned long int shared_cache_size;
   /* Threshold to use non temporal store.  */
   unsigned long int non_temporal_threshold;
+  /* _SC_LEVEL1_ICACHE_SIZE.  */
+  unsigned long int level1_icache_size;
+  /* _SC_LEVEL1_DCACHE_SIZE.  */
+  unsigned long int level1_dcache_size;
+  /* _SC_LEVEL1_DCACHE_ASSOC.  */
+  unsigned long int level1_dcache_assoc;
+  /* _SC_LEVEL1_DCACHE_LINESIZE.  */
+  unsigned long int level1_dcache_linesize;
+  /* _SC_LEVEL2_CACHE_ASSOC.  */
+  unsigned long int level2_cache_size;
+  /* _SC_LEVEL2_DCACHE_ASSOC.  */
+  unsigned long int level2_cache_assoc;
+  /* _SC_LEVEL2_CACHE_LINESIZE.  */
+  unsigned long int level2_cache_linesize;
+  /* /_SC_LEVEL3_CACHE_SIZE.  */
+  unsigned long int level3_cache_size;
+  /* _SC_LEVEL3_CACHE_ASSOC.  */
+  unsigned long int level3_cache_assoc;
+  /* _SC_LEVEL3_CACHE_LINESIZE.  */
+  unsigned long int level3_cache_linesize;
+  /* /_SC_LEVEL4_CACHE_SIZE.  */
+  unsigned long int level4_cache_size;
+#ifndef __x86_64__
+  /* PREFETCHW support flag for use in memory and string routines.  */
+  unsigned long int prefetchw;
+#endif
 };
 
 /* Used from outside of glibc to get access to the CPU features
diff --git a/sysdeps/x86/dl-cacheinfo.c b/sysdeps/x86/dl-cacheinfo.c
new file mode 100644
index 0000000000..8e2a6f552c
--- /dev/null
+++ b/sysdeps/x86/dl-cacheinfo.c
@@ -0,0 +1,888 @@
+/* x86 cache info.
+   Copyright (C) 2020 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <cpuid.h>
+#include <init-arch.h>
+#if HAVE_TUNABLES
+# define TUNABLE_NAMESPACE cpu
+# include <elf/dl-tunables.h>
+#endif
+
+static const struct intel_02_cache_info
+{
+  unsigned char idx;
+  unsigned char assoc;
+  unsigned char linesize;
+  unsigned char rel_name;
+  unsigned int size;
+} intel_02_known [] =
+  {
+#define M(sc) ((sc) - _SC_LEVEL1_ICACHE_SIZE)
+    { 0x06,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),    8192 },
+    { 0x08,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),   16384 },
+    { 0x09,  4, 32, M(_SC_LEVEL1_ICACHE_SIZE),   32768 },
+    { 0x0a,  2, 32, M(_SC_LEVEL1_DCACHE_SIZE),    8192 },
+    { 0x0c,  4, 32, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
+    { 0x0d,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
+    { 0x0e,  6, 64, M(_SC_LEVEL1_DCACHE_SIZE),   24576 },
+    { 0x21,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
+    { 0x22,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),   524288 },
+    { 0x23,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
+    { 0x25,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
+    { 0x29,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
+    { 0x2c,  8, 64, M(_SC_LEVEL1_DCACHE_SIZE),   32768 },
+    { 0x30,  8, 64, M(_SC_LEVEL1_ICACHE_SIZE),   32768 },
+    { 0x39,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
+    { 0x3a,  6, 64, M(_SC_LEVEL2_CACHE_SIZE),   196608 },
+    { 0x3b,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
+    { 0x3c,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
+    { 0x3d,  6, 64, M(_SC_LEVEL2_CACHE_SIZE),   393216 },
+    { 0x3e,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x3f,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
+    { 0x41,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
+    { 0x42,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
+    { 0x43,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x44,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
+    { 0x45,  4, 32, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
+    { 0x46,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
+    { 0x47,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
+    { 0x48, 12, 64, M(_SC_LEVEL2_CACHE_SIZE),  3145728 },
+    { 0x49, 16, 64, M(_SC_LEVEL2_CACHE_SIZE),  4194304 },
+    { 0x4a, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  6291456 },
+    { 0x4b, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
+    { 0x4c, 12, 64, M(_SC_LEVEL3_CACHE_SIZE), 12582912 },
+    { 0x4d, 16, 64, M(_SC_LEVEL3_CACHE_SIZE), 16777216 },
+    { 0x4e, 24, 64, M(_SC_LEVEL2_CACHE_SIZE),  6291456 },
+    { 0x60,  8, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
+    { 0x66,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),    8192 },
+    { 0x67,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   16384 },
+    { 0x68,  4, 64, M(_SC_LEVEL1_DCACHE_SIZE),   32768 },
+    { 0x78,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
+    { 0x79,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   131072 },
+    { 0x7a,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
+    { 0x7b,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x7c,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
+    { 0x7d,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
+    { 0x7f,  2, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x80,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x82,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),   262144 },
+    { 0x83,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x84,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
+    { 0x85,  8, 32, M(_SC_LEVEL2_CACHE_SIZE),  2097152 },
+    { 0x86,  4, 64, M(_SC_LEVEL2_CACHE_SIZE),   524288 },
+    { 0x87,  8, 64, M(_SC_LEVEL2_CACHE_SIZE),  1048576 },
+    { 0xd0,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),   524288 },
+    { 0xd1,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
+    { 0xd2,  4, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
+    { 0xd6,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  1048576 },
+    { 0xd7,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
+    { 0xd8,  8, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
+    { 0xdc, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
+    { 0xdd, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
+    { 0xde, 12, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
+    { 0xe2, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  2097152 },
+    { 0xe3, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  4194304 },
+    { 0xe4, 16, 64, M(_SC_LEVEL3_CACHE_SIZE),  8388608 },
+    { 0xea, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 12582912 },
+    { 0xeb, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 18874368 },
+    { 0xec, 24, 64, M(_SC_LEVEL3_CACHE_SIZE), 25165824 },
+  };
+
+#define nintel_02_known (sizeof (intel_02_known) / sizeof (intel_02_known [0]))
+
+static int
+intel_02_known_compare (const void *p1, const void *p2)
+{
+  const struct intel_02_cache_info *i1;
+  const struct intel_02_cache_info *i2;
+
+  i1 = (const struct intel_02_cache_info *) p1;
+  i2 = (const struct intel_02_cache_info *) p2;
+
+  if (i1->idx == i2->idx)
+    return 0;
+
+  return i1->idx < i2->idx ? -1 : 1;
+}
+
+
+static long int
+__attribute__ ((noinline))
+intel_check_word (int name, unsigned int value, bool *has_level_2,
+		  bool *no_level_2_or_3,
+		  const struct cpu_features *cpu_features)
+{
+  if ((value & 0x80000000) != 0)
+    /* The register value is reserved.  */
+    return 0;
+
+  /* Fold the name.  The _SC_ constants are always in the order SIZE,
+     ASSOC, LINESIZE.  */
+  int folded_rel_name = (M(name) / 3) * 3;
+
+  while (value != 0)
+    {
+      unsigned int byte = value & 0xff;
+
+      if (byte == 0x40)
+	{
+	  *no_level_2_or_3 = true;
+
+	  if (folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
+	    /* No need to look further.  */
+	    break;
+	}
+      else if (byte == 0xff)
+	{
+	  /* CPUID leaf 0x4 contains all the information.  We need to
+	     iterate over it.  */
+	  unsigned int eax;
+	  unsigned int ebx;
+	  unsigned int ecx;
+	  unsigned int edx;
+
+	  unsigned int round = 0;
+	  while (1)
+	    {
+	      __cpuid_count (4, round, eax, ebx, ecx, edx);
+
+	      enum { null = 0, data = 1, inst = 2, uni = 3 } type = eax & 0x1f;
+	      if (type == null)
+		/* That was the end.  */
+		break;
+
+	      unsigned int level = (eax >> 5) & 0x7;
+
+	      if ((level == 1 && type == data
+		   && folded_rel_name == M(_SC_LEVEL1_DCACHE_SIZE))
+		  || (level == 1 && type == inst
+		      && folded_rel_name == M(_SC_LEVEL1_ICACHE_SIZE))
+		  || (level == 2 && folded_rel_name == M(_SC_LEVEL2_CACHE_SIZE))
+		  || (level == 3 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
+		  || (level == 4 && folded_rel_name == M(_SC_LEVEL4_CACHE_SIZE)))
+		{
+		  unsigned int offset = M(name) - folded_rel_name;
+
+		  if (offset == 0)
+		    /* Cache size.  */
+		    return (((ebx >> 22) + 1)
+			    * (((ebx >> 12) & 0x3ff) + 1)
+			    * ((ebx & 0xfff) + 1)
+			    * (ecx + 1));
+		  if (offset == 1)
+		    return (ebx >> 22) + 1;
+
+		  assert (offset == 2);
+		  return (ebx & 0xfff) + 1;
+		}
+
+	      ++round;
+	    }
+	  /* There is no other cache information anywhere else.  */
+	  break;
+	}
+      else
+	{
+	  if (byte == 0x49 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE))
+	    {
+	      /* Intel reused this value.  For family 15, model 6 it
+		 specifies the 3rd level cache.  Otherwise the 2nd
+		 level cache.  */
+	      unsigned int family = cpu_features->basic.family;
+	      unsigned int model = cpu_features->basic.model;
+
+	      if (family == 15 && model == 6)
+		{
+		  /* The level 3 cache is encoded for this model like
+		     the level 2 cache is for other models.  Pretend
+		     the caller asked for the level 2 cache.  */
+		  name = (_SC_LEVEL2_CACHE_SIZE
+			  + (name - _SC_LEVEL3_CACHE_SIZE));
+		  folded_rel_name = M(_SC_LEVEL2_CACHE_SIZE);
+		}
+	    }
+
+	  struct intel_02_cache_info *found;
+	  struct intel_02_cache_info search;
+
+	  search.idx = byte;
+	  found = bsearch (&search, intel_02_known, nintel_02_known,
+			   sizeof (intel_02_known[0]), intel_02_known_compare);
+	  if (found != NULL)
+	    {
+	      if (found->rel_name == folded_rel_name)
+		{
+		  unsigned int offset = M(name) - folded_rel_name;
+
+		  if (offset == 0)
+		    /* Cache size.  */
+		    return found->size;
+		  if (offset == 1)
+		    return found->assoc;
+
+		  assert (offset == 2);
+		  return found->linesize;
+		}
+
+	      if (found->rel_name == M(_SC_LEVEL2_CACHE_SIZE))
+		*has_level_2 = true;
+	    }
+	}
+
+      /* Next byte for the next round.  */
+      value >>= 8;
+    }
+
+  /* Nothing found.  */
+  return 0;
+}
+
+
+static long int __attribute__ ((noinline))
+handle_intel (int name, const struct cpu_features *cpu_features)
+{
+  unsigned int maxidx = cpu_features->basic.max_cpuid;
+
+  /* Return -1 for older CPUs.  */
+  if (maxidx < 2)
+    return -1;
+
+  /* OK, we can use the CPUID instruction to get all info about the
+     caches.  */
+  unsigned int cnt = 0;
+  unsigned int max = 1;
+  long int result = 0;
+  bool no_level_2_or_3 = false;
+  bool has_level_2 = false;
+
+  while (cnt++ < max)
+    {
+      unsigned int eax;
+      unsigned int ebx;
+      unsigned int ecx;
+      unsigned int edx;
+      __cpuid (2, eax, ebx, ecx, edx);
+
+      /* The low byte of EAX in the first round contain the number of
+	 rounds we have to make.  At least one, the one we are already
+	 doing.  */
+      if (cnt == 1)
+	{
+	  max = eax & 0xff;
+	  eax &= 0xffffff00;
+	}
+
+      /* Process the individual registers' value.  */
+      result = intel_check_word (name, eax, &has_level_2,
+				 &no_level_2_or_3, cpu_features);
+      if (result != 0)
+	return result;
+
+      result = intel_check_word (name, ebx, &has_level_2,
+				 &no_level_2_or_3, cpu_features);
+      if (result != 0)
+	return result;
+
+      result = intel_check_word (name, ecx, &has_level_2,
+				 &no_level_2_or_3, cpu_features);
+      if (result != 0)
+	return result;
+
+      result = intel_check_word (name, edx, &has_level_2,
+				 &no_level_2_or_3, cpu_features);
+      if (result != 0)
+	return result;
+    }
+
+  if (name >= _SC_LEVEL2_CACHE_SIZE && name <= _SC_LEVEL3_CACHE_LINESIZE
+      && no_level_2_or_3)
+    return -1;
+
+  return 0;
+}
+
+
+static long int __attribute__ ((noinline))
+handle_amd (int name)
+{
+  unsigned int eax;
+  unsigned int ebx;
+  unsigned int ecx;
+  unsigned int edx;
+  __cpuid (0x80000000, eax, ebx, ecx, edx);
+
+  /* No level 4 cache (yet).  */
+  if (name > _SC_LEVEL3_CACHE_LINESIZE)
+    return 0;
+
+  unsigned int fn = 0x80000005 + (name >= _SC_LEVEL2_CACHE_SIZE);
+  if (eax < fn)
+    return 0;
+
+  __cpuid (fn, eax, ebx, ecx, edx);
+
+  if (name < _SC_LEVEL1_DCACHE_SIZE)
+    {
+      name += _SC_LEVEL1_DCACHE_SIZE - _SC_LEVEL1_ICACHE_SIZE;
+      ecx = edx;
+    }
+
+  switch (name)
+    {
+    case _SC_LEVEL1_DCACHE_SIZE:
+      return (ecx >> 14) & 0x3fc00;
+
+    case _SC_LEVEL1_DCACHE_ASSOC:
+      ecx >>= 16;
+      if ((ecx & 0xff) == 0xff)
+	/* Fully associative.  */
+	return (ecx << 2) & 0x3fc00;
+      return ecx & 0xff;
+
+    case _SC_LEVEL1_DCACHE_LINESIZE:
+      return ecx & 0xff;
+
+    case _SC_LEVEL2_CACHE_SIZE:
+      return (ecx & 0xf000) == 0 ? 0 : (ecx >> 6) & 0x3fffc00;
+
+    case _SC_LEVEL2_CACHE_ASSOC:
+      switch ((ecx >> 12) & 0xf)
+	{
+	case 0:
+	case 1:
+	case 2:
+	case 4:
+	  return (ecx >> 12) & 0xf;
+	case 6:
+	  return 8;
+	case 8:
+	  return 16;
+	case 10:
+	  return 32;
+	case 11:
+	  return 48;
+	case 12:
+	  return 64;
+	case 13:
+	  return 96;
+	case 14:
+	  return 128;
+	case 15:
+	  return ((ecx >> 6) & 0x3fffc00) / (ecx & 0xff);
+	default:
+	  return 0;
+	}
+      /* NOTREACHED */
+
+    case _SC_LEVEL2_CACHE_LINESIZE:
+      return (ecx & 0xf000) == 0 ? 0 : ecx & 0xff;
+
+    case _SC_LEVEL3_CACHE_SIZE:
+      return (edx & 0xf000) == 0 ? 0 : (edx & 0x3ffc0000) << 1;
+
+    case _SC_LEVEL3_CACHE_ASSOC:
+      switch ((edx >> 12) & 0xf)
+	{
+	case 0:
+	case 1:
+	case 2:
+	case 4:
+	  return (edx >> 12) & 0xf;
+	case 6:
+	  return 8;
+	case 8:
+	  return 16;
+	case 10:
+	  return 32;
+	case 11:
+	  return 48;
+	case 12:
+	  return 64;
+	case 13:
+	  return 96;
+	case 14:
+	  return 128;
+	case 15:
+	  return ((edx & 0x3ffc0000) << 1) / (edx & 0xff);
+	default:
+	  return 0;
+	}
+      /* NOTREACHED */
+
+    case _SC_LEVEL3_CACHE_LINESIZE:
+      return (edx & 0xf000) == 0 ? 0 : edx & 0xff;
+
+    default:
+      assert (! "cannot happen");
+    }
+  return -1;
+}
+
+
+static long int __attribute__ ((noinline))
+handle_zhaoxin (int name)
+{
+  unsigned int eax;
+  unsigned int ebx;
+  unsigned int ecx;
+  unsigned int edx;
+
+  int folded_rel_name = (M(name) / 3) * 3;
+
+  unsigned int round = 0;
+  while (1)
+    {
+      __cpuid_count (4, round, eax, ebx, ecx, edx);
+
+      enum { null = 0, data = 1, inst = 2, uni = 3 } type = eax & 0x1f;
+      if (type == null)
+        break;
+
+      unsigned int level = (eax >> 5) & 0x7;
+
+      if ((level == 1 && type == data
+        && folded_rel_name == M(_SC_LEVEL1_DCACHE_SIZE))
+        || (level == 1 && type == inst
+            && folded_rel_name == M(_SC_LEVEL1_ICACHE_SIZE))
+        || (level == 2 && folded_rel_name == M(_SC_LEVEL2_CACHE_SIZE))
+        || (level == 3 && folded_rel_name == M(_SC_LEVEL3_CACHE_SIZE)))
+        {
+          unsigned int offset = M(name) - folded_rel_name;
+
+          if (offset == 0)
+            /* Cache size.  */
+            return (((ebx >> 22) + 1)
+                * (((ebx >> 12) & 0x3ff) + 1)
+                * ((ebx & 0xfff) + 1)
+                * (ecx + 1));
+          if (offset == 1)
+            return (ebx >> 22) + 1;
+
+          assert (offset == 2);
+          return (ebx & 0xfff) + 1;
+        }
+
+      ++round;
+    }
+
+  /* Nothing found.  */
+  return 0;
+}
+
+
+static void
+get_common_cache_info (long int *shared_ptr, unsigned int *threads_ptr,
+                long int core)
+{
+  unsigned int eax;
+  unsigned int ebx;
+  unsigned int ecx;
+  unsigned int edx;
+
+  /* Number of logical processors sharing L2 cache.  */
+  int threads_l2;
+
+  /* Number of logical processors sharing L3 cache.  */
+  int threads_l3;
+
+  const struct cpu_features *cpu_features = __get_cpu_features ();
+  int max_cpuid = cpu_features->basic.max_cpuid;
+  unsigned int family = cpu_features->basic.family;
+  unsigned int model = cpu_features->basic.model;
+  long int shared = *shared_ptr;
+  unsigned int threads = *threads_ptr;
+  bool inclusive_cache = true;
+  bool support_count_mask = true;
+
+  /* Try L3 first.  */
+  unsigned int level = 3;
+
+  if (cpu_features->basic.kind == arch_kind_zhaoxin && family == 6)
+    support_count_mask = false;
+
+  if (shared <= 0)
+    {
+      /* Try L2 otherwise.  */
+      level  = 2;
+      shared = core;
+      threads_l2 = 0;
+      threads_l3 = -1;
+    }
+  else
+    {
+      threads_l2 = 0;
+      threads_l3 = 0;
+    }
+
+  /* A value of 0 for the HTT bit indicates there is only a single
+     logical processor.  */
+  if (HAS_CPU_FEATURE (HTT))
+    {
+      /* Figure out the number of logical threads that share the
+         highest cache level.  */
+      if (max_cpuid >= 4)
+        {
+          int i = 0;
+
+          /* Query until cache level 2 and 3 are enumerated.  */
+          int check = 0x1 | (threads_l3 == 0) << 1;
+          do
+            {
+              __cpuid_count (4, i++, eax, ebx, ecx, edx);
+
+              /* There seems to be a bug in at least some Pentium Ds
+                 which sometimes fail to iterate all cache parameters.
+                 Do not loop indefinitely here, stop in this case and
+                 assume there is no such information.  */
+              if (cpu_features->basic.kind == arch_kind_intel
+                  && (eax & 0x1f) == 0 )
+                goto intel_bug_no_cache_info;
+
+              switch ((eax >> 5) & 0x7)
+                {
+                  default:
+                    break;
+                  case 2:
+                    if ((check & 0x1))
+                      {
+                        /* Get maximum number of logical processors
+                           sharing L2 cache.  */
+                        threads_l2 = (eax >> 14) & 0x3ff;
+                        check &= ~0x1;
+                      }
+                    break;
+                  case 3:
+                    if ((check & (0x1 << 1)))
+                      {
+                        /* Get maximum number of logical processors
+                           sharing L3 cache.  */
+                        threads_l3 = (eax >> 14) & 0x3ff;
+
+                        /* Check if L2 and L3 caches are inclusive.  */
+                        inclusive_cache = (edx & 0x2) != 0;
+                        check &= ~(0x1 << 1);
+                      }
+                    break;
+                }
+            }
+          while (check);
+
+          /* If max_cpuid >= 11, THREADS_L2/THREADS_L3 are the maximum
+             numbers of addressable IDs for logical processors sharing
+             the cache, instead of the maximum number of threads
+             sharing the cache.  */
+          if (max_cpuid >= 11 && support_count_mask)
+            {
+              /* Find the number of logical processors shipped in
+                 one core and apply count mask.  */
+              i = 0;
+
+              /* Count SMT only if there is L3 cache.  Always count
+                 core if there is no L3 cache.  */
+              int count = ((threads_l2 > 0 && level == 3)
+                           | ((threads_l3 > 0
+                               || (threads_l2 > 0 && level == 2)) << 1));
+
+              while (count)
+                {
+                  __cpuid_count (11, i++, eax, ebx, ecx, edx);
+
+                  int shipped = ebx & 0xff;
+                  int type = ecx & 0xff00;
+                  if (shipped == 0 || type == 0)
+                    break;
+                  else if (type == 0x100)
+                    {
+                      /* Count SMT.  */
+                      if ((count & 0x1))
+                        {
+                          int count_mask;
+
+                          /* Compute count mask.  */
+                          asm ("bsr %1, %0"
+                               : "=r" (count_mask) : "g" (threads_l2));
+                          count_mask = ~(-1 << (count_mask + 1));
+                          threads_l2 = (shipped - 1) & count_mask;
+                          count &= ~0x1;
+                        }
+                    }
+                  else if (type == 0x200)
+                    {
+                      /* Count core.  */
+                      if ((count & (0x1 << 1)))
+                        {
+                          int count_mask;
+                          int threads_core
+                            = (level == 2 ? threads_l2 : threads_l3);
+
+                          /* Compute count mask.  */
+                          asm ("bsr %1, %0"
+                               : "=r" (count_mask) : "g" (threads_core));
+                          count_mask = ~(-1 << (count_mask + 1));
+                          threads_core = (shipped - 1) & count_mask;
+                          if (level == 2)
+                            threads_l2 = threads_core;
+                          else
+                            threads_l3 = threads_core;
+                          count &= ~(0x1 << 1);
+                        }
+                    }
+                }
+            }
+          if (threads_l2 > 0)
+            threads_l2 += 1;
+          if (threads_l3 > 0)
+            threads_l3 += 1;
+          if (level == 2)
+            {
+              if (threads_l2)
+                {
+                  threads = threads_l2;
+                  if (cpu_features->basic.kind == arch_kind_intel
+                      && threads > 2
+                      && family == 6)
+                    switch (model)
+                      {
+                        case 0x37:
+                        case 0x4a:
+                        case 0x4d:
+                        case 0x5a:
+                        case 0x5d:
+                          /* Silvermont has L2 cache shared by 2 cores.  */
+                          threads = 2;
+                          break;
+                        default:
+                          break;
+                      }
+                }
+            }
+          else if (threads_l3)
+            threads = threads_l3;
+        }
+      else
+        {
+intel_bug_no_cache_info:
+          /* Assume that all logical threads share the highest cache
+             level.  */
+          threads
+            = ((cpu_features->cpuid[COMMON_CPUID_INDEX_1].ebx
+                >> 16) & 0xff);
+        }
+
+        /* Cap usage of highest cache level to the number of supported
+           threads.  */
+        if (shared > 0 && threads > 0)
+          shared /= threads;
+    }
+
+  /* Account for non-inclusive L2 and L3 caches.  */
+  if (!inclusive_cache)
+    {
+      if (threads_l2 > 0)
+        core /= threads_l2;
+      shared += core;
+    }
+
+  *shared_ptr = shared;
+  *threads_ptr = threads;
+}
+
+void
+__init_cacheinfo (void)
+{
+  /* Find out what brand of processor.  */
+  unsigned int ebx;
+  unsigned int ecx;
+  unsigned int edx;
+  int max_cpuid_ex;
+  long int data = -1;
+  long int shared = -1;
+  long int core;
+  unsigned int threads = 0;
+  unsigned long int level1_icache_size = -1;
+  unsigned long int level1_dcache_size = -1;
+  unsigned long int level1_dcache_assoc = -1;
+  unsigned long int level1_dcache_linesize = -1;
+  unsigned long int level2_cache_size = -1;
+  unsigned long int level2_cache_assoc = -1;
+  unsigned long int level2_cache_linesize = -1;
+  unsigned long int level3_cache_size = -1;
+  unsigned long int level3_cache_assoc = -1;
+  unsigned long int level3_cache_linesize = -1;
+  unsigned long int level4_cache_size = -1;
+  struct cpu_features *cpu_features = __get_cpu_features ();
+
+  if (cpu_features->basic.kind == arch_kind_intel)
+    {
+      data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features);
+      core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
+      shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features);
+
+      level1_icache_size
+	= handle_intel (_SC_LEVEL1_ICACHE_SIZE, cpu_features);
+      level1_dcache_size = data;
+      level1_dcache_assoc
+	= handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features);
+      level1_dcache_linesize
+	= handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features);
+      level2_cache_size = core;
+      level2_cache_assoc
+	= handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features);
+      level2_cache_linesize
+	= handle_intel (_SC_LEVEL2_CACHE_LINESIZE, cpu_features);
+      level3_cache_size = shared;
+      level3_cache_assoc
+	= handle_intel (_SC_LEVEL3_CACHE_ASSOC, cpu_features);
+      level3_cache_linesize
+	= handle_intel (_SC_LEVEL3_CACHE_LINESIZE, cpu_features);
+      level4_cache_size
+	= handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features);
+
+      get_common_cache_info (&shared, &threads, core);
+    }
+  else if (cpu_features->basic.kind == arch_kind_zhaoxin)
+    {
+      data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE);
+      core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
+      shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE);
+
+      level1_icache_size = handle_zhaoxin (_SC_LEVEL1_ICACHE_SIZE);
+      level1_dcache_size = data;
+      level1_dcache_assoc = handle_zhaoxin (_SC_LEVEL1_DCACHE_ASSOC);
+      level1_dcache_linesize = handle_zhaoxin (_SC_LEVEL1_DCACHE_LINESIZE);
+      level2_cache_size = core;
+      level2_cache_assoc = handle_zhaoxin (_SC_LEVEL2_CACHE_ASSOC);
+      level2_cache_linesize = handle_zhaoxin (_SC_LEVEL2_CACHE_LINESIZE);
+      level3_cache_size = shared;
+      level3_cache_assoc = handle_zhaoxin (_SC_LEVEL3_CACHE_ASSOC);
+      level3_cache_linesize = handle_zhaoxin (_SC_LEVEL3_CACHE_LINESIZE);
+
+      get_common_cache_info (&shared, &threads, core);
+    }
+  else if (cpu_features->basic.kind == arch_kind_amd)
+    {
+      data  = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
+      core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
+      shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
+
+      level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
+      level1_dcache_size = data;
+      level1_dcache_assoc = handle_amd (_SC_LEVEL1_DCACHE_ASSOC);
+      level1_dcache_linesize = handle_amd (_SC_LEVEL1_DCACHE_LINESIZE);
+      level2_cache_size = core;
+      level2_cache_assoc = handle_amd (_SC_LEVEL2_CACHE_ASSOC);
+      level2_cache_linesize = handle_amd (_SC_LEVEL2_CACHE_LINESIZE);
+      level3_cache_size = shared;
+      level3_cache_assoc = handle_amd (_SC_LEVEL3_CACHE_ASSOC);
+      level3_cache_linesize = handle_amd (_SC_LEVEL3_CACHE_LINESIZE);
+
+      /* Get maximum extended function. */
+      __cpuid (0x80000000, max_cpuid_ex, ebx, ecx, edx);
+
+      if (shared <= 0)
+	/* No shared L3 cache.  All we have is the L2 cache.  */
+	shared = core;
+      else
+	{
+	  /* Figure out the number of logical threads that share L3.  */
+	  if (max_cpuid_ex >= 0x80000008)
+	    {
+	      /* Get width of APIC ID.  */
+	      __cpuid (0x80000008, max_cpuid_ex, ebx, ecx, edx);
+	      threads = 1 << ((ecx >> 12) & 0x0f);
+	    }
+
+	  if (threads == 0)
+	    {
+	      /* If APIC ID width is not available, use logical
+		 processor count.  */
+	      __cpuid (0x00000001, max_cpuid_ex, ebx, ecx, edx);
+
+	      if ((edx & (1 << 28)) != 0)
+		threads = (ebx >> 16) & 0xff;
+	    }
+
+	  /* Cap usage of highest cache level to the number of
+	     supported threads.  */
+	  if (threads > 0)
+	    shared /= threads;
+
+	  /* Account for exclusive L2 and L3 caches.  */
+	  shared += core;
+	}
+
+#ifndef __x86_64__
+      if (max_cpuid_ex >= 0x80000001)
+	{
+	  unsigned int eax;
+	  __cpuid (0x80000001, eax, ebx, ecx, edx);
+	  /*  PREFETCHW     || 3DNow!  */
+	  if ((ecx & 0x100) || (edx & 0x80000000))
+	    cpu_features->prefetchw = -1;
+	}
+#endif
+    }
+
+  cpu_features->level1_icache_size = level1_icache_size;
+  cpu_features->level1_dcache_size = level1_dcache_size;
+  cpu_features->level1_dcache_assoc = level1_dcache_assoc;
+  cpu_features->level1_dcache_linesize = level1_dcache_linesize;
+  cpu_features->level2_cache_size = level2_cache_size;
+  cpu_features->level2_cache_assoc = level2_cache_assoc;
+  cpu_features->level2_cache_linesize = level2_cache_linesize;
+  cpu_features->level3_cache_size = level3_cache_size;
+  cpu_features->level3_cache_assoc = level3_cache_assoc;
+  cpu_features->level3_cache_linesize = level3_cache_linesize;
+  cpu_features->level4_cache_size = level4_cache_size;
+
+  /* The large memcpy micro benchmark in glibc shows that 6 times of
+     shared cache size is the approximate value above which non-temporal
+     store becomes faster on a 8-core processor.  This is the 3/4 of the
+     total shared cache size.  */
+  unsigned long int non_temporal_threshold = (shared * threads * 3 / 4);
+
+#if HAVE_TUNABLES
+  long int tunable_size;
+  tunable_size = TUNABLE_GET (x86_data_cache_size, long int, NULL);
+  if (tunable_size != 0)
+    data = tunable_size;
+  tunable_size = TUNABLE_GET (x86_shared_cache_size, long int, NULL);
+  if (tunable_size != 0)
+    shared = tunable_size;
+  tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
+  if (tunable_size != 0)
+    non_temporal_threshold = tunable_size;
+#endif
+
+  cpu_features->data_cache_size = data;
+  cpu_features->shared_cache_size = shared;
+  cpu_features->non_temporal_threshold = non_temporal_threshold;
+
+#if HAVE_TUNABLES
+  TUNABLE_UPDATE (x86_data_cache_size, long int,
+		  data, 0, (long int) -1);
+  TUNABLE_UPDATE (x86_shared_cache_size, long int,
+		  shared, 0, (long int) -1);
+  TUNABLE_UPDATE (x86_non_temporal_threshold, long int,
+		  non_temporal_threshold, 0, (long int) -1);
+#endif
+}
diff --git a/sysdeps/x86/init-arch.h b/sysdeps/x86/init-arch.h
index d6f59cf962..272ed10902 100644
--- a/sysdeps/x86/init-arch.h
+++ b/sysdeps/x86/init-arch.h
@@ -23,6 +23,9 @@
 #include <ifunc-init.h>
 #include <isa.h>
 
+extern void __init_cacheinfo (void)
+  __attribute__ ((visibility ("hidden")));
+
 #ifndef __x86_64__
 /* Due to the reordering and the other nifty extensions in i686, it is
    not really good to use heavily i586 optimized code on an i686.  It's
-- 
2.26.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Update tunable min/max values
  2020-07-02 19:08                                       ` [PATCH] Update tunable min/max values H.J. Lu
@ 2020-07-03 16:14                                         ` Carlos O'Donell
  2020-07-03 16:54                                           ` [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables H.J. Lu
  0 siblings, 1 reply; 32+ messages in thread
From: Carlos O'Donell @ 2020-07-03 16:14 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Florian Weimer, Hushiyuan

On 7/2/20 3:08 PM, H.J. Lu wrote:
> On Thu, Jul 02, 2020 at 02:00:54PM -0400, Carlos O'Donell wrote:
>> On 6/6/20 5:51 PM, H.J. Lu wrote:
>>> On Fri, Jun 5, 2020 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>
>>>> On Thu, Jun 04, 2020 at 02:00:35PM -0700, H.J. Lu wrote:
>>>>> On Mon, Jun 1, 2020 at 7:08 PM Carlos O'Donell <carlos@redhat.com> wrote:
>>>>>>
>>>>>> On Mon, Jun 1, 2020 at 6:44 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>> Tunables are designed to pass info from user to glibc, not the other
>>>>>>> way around.  When __libc_main is called, init_cacheinfo is never
>>>>>>> called.  I can call init_cacheinfo from __libc_main.  But there is no
>>>>>>> interface to update min and max values from init_cacheinfo.  I don't
>>>>>>> think --list-tunables will work here without changes to tunables.
>>>>>>
>>>>>> You have a dynamic threshold.
>>>>>>
>>>>>> You have to tell the user what that minimum is, otherwise they can't
>>>>>> use the tunable reliably.
>>>>>>
>>>>>> This is the first instance of a min/max that is dynamically determined.
>>>>>>
>>>>>> You must fetch the cache info ahead of the tunable initialization, that
>>>>>> is you must call init_cacheinfo before __init_tunables.
>>>>>>
>>>>>> You can initialize the tunable data dynamically like this:
>>>>>>
>>>>>> /* Dynamically set the min and max of glibc.foo.bar.  */
>>>>>> tunable_id_t id = TUNABLE_ENUM_NAME (glibc, foo, bar);
>>>>>> tunable_list[id].type.min = lowval;
>>>>>> tunable_list[id].type.max = highval;
>>>>>>
>>>>>> We do something similar for maybe_enable_malloc_check.
>>>>>>
>>>>>> Then once the tunables are parsed, and the cpu features are loaded
>>>>>> you can print the tunables, and the printed tunables will have meaningful
>>>>>> min and max values.
>>>>>>
>>>>>> If you have circular dependency, then you must process the cpu features
>>>>>> first without reading from the tunables, then allow the tunables to be
>>>>>> initialized from the system, *then* process the tunables to alter the existing
>>>>>> cpu feature settings.
>>>>>>
>>>>>
>>>>> How about this?  I got
>>>>>
>>>>
>>>> Here is the updated patch, which depends on
>>>>
>>>> https://sourceware.org/pipermail/libc-alpha/2020-June/114820.html
>>>>
>>>> to add "%d" support to _dl_debug_vdprintf.  I got
>>>>
>>>> $ ./elf/ld.so ./libc.so --list-tunables
>>>> glibc.elision.skip_lock_after_retries: 3 (min: -2147483648, max: 2147483647)
>>>> glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffff)
>>>> glibc.malloc.perturb: 0 (min: 0, max: 255)
>>>> glibc.cpu.x86_shared_cache_size: 0x100000 (min: 0x0, max: 0xffffffff)
>>>> glibc.elision.tries: 3 (min: -2147483648, max: 2147483647)
>>>> glibc.elision.enable: 0 (min: 0, max: 1)
>>>> glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffff)
>>>> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
>>>> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffff)
>>>> glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x0, max: 0xffffffff)
>>>> glibc.cpu.x86_shstk:
>>>> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffff)
>>>> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
>>>> glibc.elision.skip_trylock_internal_abort: 3 (min: -2147483648, max: 2147483647)
>>>> glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffff)
>>>> glibc.cpu.x86_ibt:
>>>> glibc.cpu.hwcaps:
>>>> glibc.elision.skip_lock_internal_abort: 3 (min: -2147483648, max: 2147483647)
>>>> glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffff)
>>>> glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffff)
>>>> glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffff)
>>>> glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffff)
>>>> glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffff)
>>>> glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767)
>>>> glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffff)
>>>> glibc.malloc.check: 0 (min: 0, max: 3)
>>>> $
>>>>
>>>> Ok for master?
>>>>
>>>
>>> Here is the updated patch.  To support --list-tunables, a target should add
>>>
>>> CPPFLAGS-version.c = -DLIBC_MAIN=__libc_main_body
>>> CPPFLAGS-libc-main.S = -DLIBC_MAIN=__libc_main_body
>>>
>>> and start.S should be updated to define __libc_main and call
>>> __libc_main_body:
>>>
>>> extern void __libc_main_body (int argc, char **argv)
>>>   __attribute__ ((noreturn, visibility ("hidden")));
>>>
>>> when LIBC_MAIN is defined.
>>
>> I like where this patch is going, but the __libc_main wiring up means
>> we'll have to delay this until glibc 2.33 opens for development and
>> give the architectures time to fill in the required pieces of assembly.
>>
>> Can we split this into:
>>
>> (a) Minimum required to implement the feature e.g. just the tunable without
>>     my requested changes.
>>
>> (b) A second patch which implements the --list-tunables that users can
>>     then use to know what the values they can choose are.
>>
>> That way we can commit (a) right now, and then commit (b) when we
>> reopen for development?
>>
> 
> Like this?

Almost.

Why do we still use a constructor?

Why don't we accurately set the min and max?

+#if HAVE_TUNABLES
+  TUNABLE_UPDATE (x86_non_temporal_threshold, long int,
+		  __x86_shared_non_temporal_threshold, 0,
+		  (long int) -1);
+  TUNABLE_UPDATE (x86_rep_movsb_threshold, long int,
+		  __x86_rep_movsb_threshold,
+		  minimum_rep_movsb_threshold, (long int) -1);
+  TUNABLE_UPDATE (x86_rep_stosb_threshold, long int,
+		  __x86_rep_stosb_threshold, 0, (long int) -1);

A min and max of 0 and -1 respectively could have been set in the tunables
list file and are not dynamic?

I'd expect your patch would do everything except actually implement
--list-tunables.

We need a manual page, and I accept that showing a "lower value" will
have to wait for --list-tunables.

Otherwise the patch is looking ready.

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-07-03 16:14                                         ` Carlos O'Donell
@ 2020-07-03 16:54                                           ` H.J. Lu
  2020-07-03 17:43                                             ` Carlos O'Donell
  0 siblings, 1 reply; 32+ messages in thread
From: H.J. Lu @ 2020-07-03 16:54 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: GNU C Library, Florian Weimer, Hushiyuan

On Fri, Jul 03, 2020 at 12:14:01PM -0400, Carlos O'Donell wrote:
> On 7/2/20 3:08 PM, H.J. Lu wrote:
> > On Thu, Jul 02, 2020 at 02:00:54PM -0400, Carlos O'Donell wrote:
> >> On 6/6/20 5:51 PM, H.J. Lu wrote:
> >>> On Fri, Jun 5, 2020 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>>>
> >>>> On Thu, Jun 04, 2020 at 02:00:35PM -0700, H.J. Lu wrote:
> >>>>> On Mon, Jun 1, 2020 at 7:08 PM Carlos O'Donell <carlos@redhat.com> wrote:
> >>>>>>
> >>>>>> On Mon, Jun 1, 2020 at 6:44 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>>>>>> Tunables are designed to pass info from user to glibc, not the other
> >>>>>>> way around.  When __libc_main is called, init_cacheinfo is never
> >>>>>>> called.  I can call init_cacheinfo from __libc_main.  But there is no
> >>>>>>> interface to update min and max values from init_cacheinfo.  I don't
> >>>>>>> think --list-tunables will work here without changes to tunables.
> >>>>>>
> >>>>>> You have a dynamic threshold.
> >>>>>>
> >>>>>> You have to tell the user what that minimum is, otherwise they can't
> >>>>>> use the tunable reliably.
> >>>>>>
> >>>>>> This is the first instance of a min/max that is dynamically determined.
> >>>>>>
> >>>>>> You must fetch the cache info ahead of the tunable initialization, that
> >>>>>> is you must call init_cacheinfo before __init_tunables.
> >>>>>>
> >>>>>> You can initialize the tunable data dynamically like this:
> >>>>>>
> >>>>>> /* Dynamically set the min and max of glibc.foo.bar.  */
> >>>>>> tunable_id_t id = TUNABLE_ENUM_NAME (glibc, foo, bar);
> >>>>>> tunable_list[id].type.min = lowval;
> >>>>>> tunable_list[id].type.max = highval;
> >>>>>>
> >>>>>> We do something similar for maybe_enable_malloc_check.
> >>>>>>
> >>>>>> Then once the tunables are parsed, and the cpu features are loaded
> >>>>>> you can print the tunables, and the printed tunables will have meaningful
> >>>>>> min and max values.
> >>>>>>
> >>>>>> If you have circular dependency, then you must process the cpu features
> >>>>>> first without reading from the tunables, then allow the tunables to be
> >>>>>> initialized from the system, *then* process the tunables to alter the existing
> >>>>>> cpu feature settings.
> >>>>>>
> >>>>>
> >>>>> How about this?  I got
> >>>>>
> >>>>
> >>>> Here is the updated patch, which depends on
> >>>>
> >>>> https://sourceware.org/pipermail/libc-alpha/2020-June/114820.html
> >>>>
> >>>> to add "%d" support to _dl_debug_vdprintf.  I got
> >>>>
> >>>> $ ./elf/ld.so ./libc.so --list-tunables
> >>>> glibc.elision.skip_lock_after_retries: 3 (min: -2147483648, max: 2147483647)
> >>>> glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffff)
> >>>> glibc.malloc.perturb: 0 (min: 0, max: 255)
> >>>> glibc.cpu.x86_shared_cache_size: 0x100000 (min: 0x0, max: 0xffffffff)
> >>>> glibc.elision.tries: 3 (min: -2147483648, max: 2147483647)
> >>>> glibc.elision.enable: 0 (min: 0, max: 1)
> >>>> glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffff)
> >>>> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> >>>> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffff)
> >>>> glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x0, max: 0xffffffff)
> >>>> glibc.cpu.x86_shstk:
> >>>> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffff)
> >>>> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> >>>> glibc.elision.skip_trylock_internal_abort: 3 (min: -2147483648, max: 2147483647)
> >>>> glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffff)
> >>>> glibc.cpu.x86_ibt:
> >>>> glibc.cpu.hwcaps:
> >>>> glibc.elision.skip_lock_internal_abort: 3 (min: -2147483648, max: 2147483647)
> >>>> glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffff)
> >>>> glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffff)
> >>>> glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffff)
> >>>> glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffff)
> >>>> glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffff)
> >>>> glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767)
> >>>> glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffff)
> >>>> glibc.malloc.check: 0 (min: 0, max: 3)
> >>>> $
> >>>>
> >>>> Ok for master?
> >>>>
> >>>
> >>> Here is the updated patch.  To support --list-tunables, a target should add
> >>>
> >>> CPPFLAGS-version.c = -DLIBC_MAIN=__libc_main_body
> >>> CPPFLAGS-libc-main.S = -DLIBC_MAIN=__libc_main_body
> >>>
> >>> and start.S should be updated to define __libc_main and call
> >>> __libc_main_body:
> >>>
> >>> extern void __libc_main_body (int argc, char **argv)
> >>>   __attribute__ ((noreturn, visibility ("hidden")));
> >>>
> >>> when LIBC_MAIN is defined.
> >>
> >> I like where this patch is going, but the __libc_main wiring up means
> >> we'll have to delay this until glibc 2.33 opens for development and
> >> give the architectures time to fill in the required pieces of assembly.
> >>
> >> Can we split this into:
> >>
> >> (a) Minimum required to implement the feature e.g. just the tunable without
> >>     my requested changes.
> >>
> >> (b) A second patch which implements the --list-tunables that users can
> >>     then use to know what the values they can choose are.
> >>
> >> That way we can commit (a) right now, and then commit (b) when we
> >> reopen for development?
> >>
> > 
> > Like this?
> 
> Almost.
> 
> Why do we still use a constructor?
> 
> Why don't we accurately set the min and max?
> 
> +#if HAVE_TUNABLES
> +  TUNABLE_UPDATE (x86_non_temporal_threshold, long int,
> +		  __x86_shared_non_temporal_threshold, 0,
> +		  (long int) -1);
> +  TUNABLE_UPDATE (x86_rep_movsb_threshold, long int,
> +		  __x86_rep_movsb_threshold,
> +		  minimum_rep_movsb_threshold, (long int) -1);
> +  TUNABLE_UPDATE (x86_rep_stosb_threshold, long int,
> +		  __x86_rep_stosb_threshold, 0, (long int) -1);
> 
> A min and max of 0 and -1 respectively could have been set in the tunables
> list file and are not dynamic?
> 
> I'd expect your patch would do everything except actually implement
> --list-tunables.

Here is the followup patch which does it.

> 
> We need a manual page, and I accept that showing a "lower value" will
> have to wait for --list-tunables.
> 
> Otherwise the patch is looking ready.


Are these 2 patches OK for trunk?

Thanks.

H.J.
---
Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables
to update thresholds for "rep movsb" and "rep stosb" at run-time.

Note that the user specified threshold for "rep movsb" smaller than the
minimum threshold will be ignored.
---
 manual/tunables.texi                          | 14 +++++++
 sysdeps/x86/cacheinfo.c                       | 20 ++++++++++
 sysdeps/x86/cpu-features.h                    |  4 ++
 sysdeps/x86/dl-cacheinfo.c                    | 38 +++++++++++++++++++
 sysdeps/x86/dl-tunables.list                  |  6 +++
 .../multiarch/memmove-vec-unaligned-erms.S    | 16 +-------
 .../multiarch/memset-vec-unaligned-erms.S     | 12 +-----
 7 files changed, 84 insertions(+), 26 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index ec18b10834..61edd62425 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -396,6 +396,20 @@ to set threshold in bytes for non temporal store.
 This tunable is specific to i386 and x86-64.
 @end deftp
 
+@deftp Tunable glibc.cpu.x86_rep_movsb_threshold
+The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user
+to set threshold in bytes to start using "rep movsb".
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
+@deftp Tunable glibc.cpu.x86_rep_stosb_threshold
+The @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user
+to set threshold in bytes to start using "rep stosb".
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
 @deftp Tunable glibc.cpu.x86_ibt
 The @code{glibc.cpu.x86_ibt} tunable allows the user to control how
 indirect branch tracking (IBT) should be enabled.  Accepted values are
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index 8c4c7f9972..bb536d96ef 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -41,6 +41,23 @@ long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024;
 /* Threshold to use non temporal store.  */
 long int __x86_shared_non_temporal_threshold attribute_hidden;
 
+/* Threshold to use Enhanced REP MOVSB.  Since there is overhead to set
+   up REP MOVSB operation, REP MOVSB isn't faster on short data.  The
+   memcpy micro benchmark in glibc shows that 2KB is the approximate
+   value above which REP MOVSB becomes faster than SSE2 optimization
+   on processors with Enhanced REP MOVSB.  Since larger register size
+   can move more data with a single load and store, the threshold is
+   higher with larger register size.  */
+long int __x86_rep_movsb_threshold attribute_hidden = 2048;
+
+/* Threshold to use Enhanced REP STOSB.  Since there is overhead to set
+   up REP STOSB operation, REP STOSB isn't faster on short data.  The
+   memset micro benchmark in glibc shows that 2KB is the approximate
+   value above which REP STOSB becomes faster on processors with
+   Enhanced REP STOSB.  Since the stored value is fixed, larger register
+   size has minimal impact on threshold.  */
+long int __x86_rep_stosb_threshold attribute_hidden = 2048;
+
 #ifndef __x86_64__
 /* PREFETCHW support flag for use in memory and string routines.  */
 int __x86_prefetchw attribute_hidden;
@@ -117,6 +134,9 @@ init_cacheinfo (void)
   __x86_shared_non_temporal_threshold
     = cpu_features->non_temporal_threshold;
 
+  __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
+  __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold;
+
 #ifndef __x86_64__
   __x86_prefetchw = cpu_features->prefetchw;
 #endif
diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
index 3aaed33cbc..002e12e11f 100644
--- a/sysdeps/x86/cpu-features.h
+++ b/sysdeps/x86/cpu-features.h
@@ -128,6 +128,10 @@ struct cpu_features
   /* PREFETCHW support flag for use in memory and string routines.  */
   unsigned long int prefetchw;
 #endif
+  /* Threshold to use "rep movsb".  */
+  unsigned long int rep_movsb_threshold;
+  /* Threshold to use "rep stosb".  */
+  unsigned long int rep_stosb_threshold;
 };
 
 /* Used from outside of glibc to get access to the CPU features
diff --git a/sysdeps/x86/dl-cacheinfo.c b/sysdeps/x86/dl-cacheinfo.c
index 8e2a6f552c..aff9bd1067 100644
--- a/sysdeps/x86/dl-cacheinfo.c
+++ b/sysdeps/x86/dl-cacheinfo.c
@@ -860,6 +860,31 @@ __init_cacheinfo (void)
      total shared cache size.  */
   unsigned long int non_temporal_threshold = (shared * threads * 3 / 4);
 
+  /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8.  */
+  unsigned long int minimum_rep_movsb_threshold;
+  /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16).  See
+     comments for __x86_rep_movsb_threshold in cacheinfo.c.  */
+  unsigned long int rep_movsb_threshold;
+  if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable)
+      && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
+    {
+      rep_movsb_threshold = 2048 * (64 / 16);
+      minimum_rep_movsb_threshold = 64 * 8;
+    }
+  else if (CPU_FEATURES_ARCH_P (cpu_features,
+				AVX_Fast_Unaligned_Load))
+    {
+      rep_movsb_threshold = 2048 * (32 / 16);
+      minimum_rep_movsb_threshold = 32 * 8;
+    }
+  else
+    {
+      rep_movsb_threshold = 2048 * (16 / 16);
+      minimum_rep_movsb_threshold = 16 * 8;
+    }
+  /* NB: See comments for __x86_rep_stosb_threshold in cacheinfo.c.  */
+  unsigned long int rep_stosb_threshold = 2048;
+
 #if HAVE_TUNABLES
   long int tunable_size;
   tunable_size = TUNABLE_GET (x86_data_cache_size, long int, NULL);
@@ -871,11 +896,19 @@ __init_cacheinfo (void)
   tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
   if (tunable_size != 0)
     non_temporal_threshold = tunable_size;
+  tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
+  if (tunable_size > minimum_rep_movsb_threshold)
+    rep_movsb_threshold = tunable_size;
+  tunable_size = TUNABLE_GET (x86_rep_stosb_threshold, long int, NULL);
+  if (tunable_size != 0)
+    rep_stosb_threshold = tunable_size;
 #endif
 
   cpu_features->data_cache_size = data;
   cpu_features->shared_cache_size = shared;
   cpu_features->non_temporal_threshold = non_temporal_threshold;
+  cpu_features->rep_movsb_threshold = rep_movsb_threshold;
+  cpu_features->rep_stosb_threshold = rep_stosb_threshold;
 
 #if HAVE_TUNABLES
   TUNABLE_UPDATE (x86_data_cache_size, long int,
@@ -884,5 +917,10 @@ __init_cacheinfo (void)
 		  shared, 0, (long int) -1);
   TUNABLE_UPDATE (x86_non_temporal_threshold, long int,
 		  non_temporal_threshold, 0, (long int) -1);
+  TUNABLE_UPDATE (x86_rep_movsb_threshold, long int,
+		  rep_movsb_threshold, minimum_rep_movsb_threshold,
+		  (long int) -1);
+  TUNABLE_UPDATE (x86_rep_stosb_threshold, long int,
+		  rep_stosb_threshold, 0, (long int) -1);
 #endif
 }
diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
index 251b926ce4..43bf6c2389 100644
--- a/sysdeps/x86/dl-tunables.list
+++ b/sysdeps/x86/dl-tunables.list
@@ -30,6 +30,12 @@ glibc {
     x86_non_temporal_threshold {
       type: SIZE_T
     }
+    x86_rep_movsb_threshold {
+      type: SIZE_T
+    }
+    x86_rep_stosb_threshold {
+      type: SIZE_T
+    }
     x86_data_cache_size {
       type: SIZE_T
     }
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index 74953245aa..bd5dc1a3f3 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -56,17 +56,6 @@
 # endif
 #endif
 
-/* Threshold to use Enhanced REP MOVSB.  Since there is overhead to set
-   up REP MOVSB operation, REP MOVSB isn't faster on short data.  The
-   memcpy micro benchmark in glibc shows that 2KB is the approximate
-   value above which REP MOVSB becomes faster than SSE2 optimization
-   on processors with Enhanced REP MOVSB.  Since larger register size
-   can move more data with a single load and store, the threshold is
-   higher with larger register size.  */
-#ifndef REP_MOVSB_THRESHOLD
-# define REP_MOVSB_THRESHOLD	(2048 * (VEC_SIZE / 16))
-#endif
-
 #ifndef PREFETCH
 # define PREFETCH(addr) prefetcht0 addr
 #endif
@@ -253,9 +242,6 @@ L(movsb):
 	leaq	(%rsi,%rdx), %r9
 	cmpq	%r9, %rdi
 	/* Avoid slow backward REP MOVSB.  */
-# if REP_MOVSB_THRESHOLD <= (VEC_SIZE * 8)
-#  error Unsupported REP_MOVSB_THRESHOLD and VEC_SIZE!
-# endif
 	jb	L(more_8x_vec_backward)
 1:
 	mov	%RDX_LP, %RCX_LP
@@ -331,7 +317,7 @@ L(between_2_3):
 
 #if defined USE_MULTIARCH && IS_IN (libc)
 L(movsb_more_2x_vec):
-	cmpq	$REP_MOVSB_THRESHOLD, %rdx
+	cmp	__x86_rep_movsb_threshold(%rip), %RDX_LP
 	ja	L(movsb)
 #endif
 L(more_2x_vec):
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index af2299709c..2bfc95de05 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -58,16 +58,6 @@
 # endif
 #endif
 
-/* Threshold to use Enhanced REP STOSB.  Since there is overhead to set
-   up REP STOSB operation, REP STOSB isn't faster on short data.  The
-   memset micro benchmark in glibc shows that 2KB is the approximate
-   value above which REP STOSB becomes faster on processors with
-   Enhanced REP STOSB.  Since the stored value is fixed, larger register
-   size has minimal impact on threshold.  */
-#ifndef REP_STOSB_THRESHOLD
-# define REP_STOSB_THRESHOLD		2048
-#endif
-
 #ifndef SECTION
 # error SECTION is not defined!
 #endif
@@ -181,7 +171,7 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))
 	ret
 
 L(stosb_more_2x_vec):
-	cmpq	$REP_STOSB_THRESHOLD, %rdx
+	cmp	__x86_rep_stosb_threshold(%rip), %RDX_LP
 	ja	L(stosb)
 #endif
 L(more_2x_vec):
-- 
2.26.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-07-03 16:54                                           ` [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables H.J. Lu
@ 2020-07-03 17:43                                             ` Carlos O'Donell
  2020-07-03 17:53                                               ` H.J. Lu
  0 siblings, 1 reply; 32+ messages in thread
From: Carlos O'Donell @ 2020-07-03 17:43 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Florian Weimer, Hushiyuan

On 7/3/20 12:54 PM, H.J. Lu wrote:
> On Fri, Jul 03, 2020 at 12:14:01PM -0400, Carlos O'Donell wrote:
>> On 7/2/20 3:08 PM, H.J. Lu wrote:
>>> On Thu, Jul 02, 2020 at 02:00:54PM -0400, Carlos O'Donell wrote:
>>>> On 6/6/20 5:51 PM, H.J. Lu wrote:
>>>>> On Fri, Jun 5, 2020 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>
>>>>>> On Thu, Jun 04, 2020 at 02:00:35PM -0700, H.J. Lu wrote:
>>>>>>> On Mon, Jun 1, 2020 at 7:08 PM Carlos O'Donell <carlos@redhat.com> wrote:
>>>>>>>>
>>>>>>>> On Mon, Jun 1, 2020 at 6:44 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>>>> Tunables are designed to pass info from user to glibc, not the other
>>>>>>>>> way around.  When __libc_main is called, init_cacheinfo is never
>>>>>>>>> called.  I can call init_cacheinfo from __libc_main.  But there is no
>>>>>>>>> interface to update min and max values from init_cacheinfo.  I don't
>>>>>>>>> think --list-tunables will work here without changes to tunables.
>>>>>>>>
>>>>>>>> You have a dynamic threshold.
>>>>>>>>
>>>>>>>> You have to tell the user what that minimum is, otherwise they can't
>>>>>>>> use the tunable reliably.
>>>>>>>>
>>>>>>>> This is the first instance of a min/max that is dynamically determined.
>>>>>>>>
>>>>>>>> You must fetch the cache info ahead of the tunable initialization, that
>>>>>>>> is you must call init_cacheinfo before __init_tunables.
>>>>>>>>
>>>>>>>> You can initialize the tunable data dynamically like this:
>>>>>>>>
>>>>>>>> /* Dynamically set the min and max of glibc.foo.bar.  */
>>>>>>>> tunable_id_t id = TUNABLE_ENUM_NAME (glibc, foo, bar);
>>>>>>>> tunable_list[id].type.min = lowval;
>>>>>>>> tunable_list[id].type.max = highval;
>>>>>>>>
>>>>>>>> We do something similar for maybe_enable_malloc_check.
>>>>>>>>
>>>>>>>> Then once the tunables are parsed, and the cpu features are loaded
>>>>>>>> you can print the tunables, and the printed tunables will have meaningful
>>>>>>>> min and max values.
>>>>>>>>
>>>>>>>> If you have circular dependency, then you must process the cpu features
>>>>>>>> first without reading from the tunables, then allow the tunables to be
>>>>>>>> initialized from the system, *then* process the tunables to alter the existing
>>>>>>>> cpu feature settings.
>>>>>>>>
>>>>>>>
>>>>>>> How about this?  I got
>>>>>>>
>>>>>>
>>>>>> Here is the updated patch, which depends on
>>>>>>
>>>>>> https://sourceware.org/pipermail/libc-alpha/2020-June/114820.html
>>>>>>
>>>>>> to add "%d" support to _dl_debug_vdprintf.  I got
>>>>>>
>>>>>> $ ./elf/ld.so ./libc.so --list-tunables
>>>>>> glibc.elision.skip_lock_after_retries: 3 (min: -2147483648, max: 2147483647)
>>>>>> glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffff)
>>>>>> glibc.malloc.perturb: 0 (min: 0, max: 255)
>>>>>> glibc.cpu.x86_shared_cache_size: 0x100000 (min: 0x0, max: 0xffffffff)
>>>>>> glibc.elision.tries: 3 (min: -2147483648, max: 2147483647)
>>>>>> glibc.elision.enable: 0 (min: 0, max: 1)
>>>>>> glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffff)
>>>>>> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
>>>>>> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffff)
>>>>>> glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x0, max: 0xffffffff)
>>>>>> glibc.cpu.x86_shstk:
>>>>>> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffff)
>>>>>> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
>>>>>> glibc.elision.skip_trylock_internal_abort: 3 (min: -2147483648, max: 2147483647)
>>>>>> glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffff)
>>>>>> glibc.cpu.x86_ibt:
>>>>>> glibc.cpu.hwcaps:
>>>>>> glibc.elision.skip_lock_internal_abort: 3 (min: -2147483648, max: 2147483647)
>>>>>> glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffff)
>>>>>> glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffff)
>>>>>> glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffff)
>>>>>> glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffff)
>>>>>> glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffff)
>>>>>> glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767)
>>>>>> glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffff)
>>>>>> glibc.malloc.check: 0 (min: 0, max: 3)
>>>>>> $
>>>>>>
>>>>>> Ok for master?
>>>>>>
>>>>>
>>>>> Here is the updated patch.  To support --list-tunables, a target should add
>>>>>
>>>>> CPPFLAGS-version.c = -DLIBC_MAIN=__libc_main_body
>>>>> CPPFLAGS-libc-main.S = -DLIBC_MAIN=__libc_main_body
>>>>>
>>>>> and start.S should be updated to define __libc_main and call
>>>>> __libc_main_body:
>>>>>
>>>>> extern void __libc_main_body (int argc, char **argv)
>>>>>   __attribute__ ((noreturn, visibility ("hidden")));
>>>>>
>>>>> when LIBC_MAIN is defined.
>>>>
>>>> I like where this patch is going, but the __libc_main wiring up means
>>>> we'll have to delay this until glibc 2.33 opens for development and
>>>> give the architectures time to fill in the required pieces of assembly.
>>>>
>>>> Can we split this into:
>>>>
>>>> (a) Minimum required to implement the feature e.g. just the tunable without
>>>>     my requested changes.
>>>>
>>>> (b) A second patch which implements the --list-tunables that users can
>>>>     then use to know what the values they can choose are.
>>>>
>>>> That way we can commit (a) right now, and then commit (b) when we
>>>> reopen for development?
>>>>
>>>
>>> Like this?
>>
>> Almost.
>>
>> Why do we still use a constructor?
>>
>> Why don't we accurately set the min and max?
>>
>> +#if HAVE_TUNABLES
>> +  TUNABLE_UPDATE (x86_non_temporal_threshold, long int,
>> +		  __x86_shared_non_temporal_threshold, 0,
>> +		  (long int) -1);
>> +  TUNABLE_UPDATE (x86_rep_movsb_threshold, long int,
>> +		  __x86_rep_movsb_threshold,
>> +		  minimum_rep_movsb_threshold, (long int) -1);
>> +  TUNABLE_UPDATE (x86_rep_stosb_threshold, long int,
>> +		  __x86_rep_stosb_threshold, 0, (long int) -1);
>>
>> A min and max of 0 and -1 respectively could have been set in the tunables
>> list file and are not dynamic?
>>
>> I'd expect your patch would do everything except actually implement
>> --list-tunables.
> 
> Here is the followup patch which does it.
> 
>>
>> We need a manual page, and I accept that showing a "lower value" will
>> have to wait for --list-tunables.
>>
>> Otherwise the patch is looking ready.
> 
> 
> Are these 2 patches OK for trunk?

Could you please post the patches in a distinct thread with a clear
subject, that way I know exactly what I'm applying and testing.
I'll review those ASAP so we can get something in place.

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
  2020-07-03 17:43                                             ` Carlos O'Donell
@ 2020-07-03 17:53                                               ` H.J. Lu
  0 siblings, 0 replies; 32+ messages in thread
From: H.J. Lu @ 2020-07-03 17:53 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: GNU C Library, Florian Weimer, Hushiyuan

On Fri, Jul 3, 2020 at 10:43 AM Carlos O'Donell <carlos@redhat.com> wrote:
>
> On 7/3/20 12:54 PM, H.J. Lu wrote:
> > On Fri, Jul 03, 2020 at 12:14:01PM -0400, Carlos O'Donell wrote:
> >> On 7/2/20 3:08 PM, H.J. Lu wrote:
> >>> On Thu, Jul 02, 2020 at 02:00:54PM -0400, Carlos O'Donell wrote:
> >>>> On 6/6/20 5:51 PM, H.J. Lu wrote:
> >>>>> On Fri, Jun 5, 2020 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>>>>>
> >>>>>> On Thu, Jun 04, 2020 at 02:00:35PM -0700, H.J. Lu wrote:
> >>>>>>> On Mon, Jun 1, 2020 at 7:08 PM Carlos O'Donell <carlos@redhat.com> wrote:
> >>>>>>>>
> >>>>>>>> On Mon, Jun 1, 2020 at 6:44 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>>>>>>>> Tunables are designed to pass info from user to glibc, not the other
> >>>>>>>>> way around.  When __libc_main is called, init_cacheinfo is never
> >>>>>>>>> called.  I can call init_cacheinfo from __libc_main.  But there is no
> >>>>>>>>> interface to update min and max values from init_cacheinfo.  I don't
> >>>>>>>>> think --list-tunables will work here without changes to tunables.
> >>>>>>>>
> >>>>>>>> You have a dynamic threshold.
> >>>>>>>>
> >>>>>>>> You have to tell the user what that minimum is, otherwise they can't
> >>>>>>>> use the tunable reliably.
> >>>>>>>>
> >>>>>>>> This is the first instance of a min/max that is dynamically determined.
> >>>>>>>>
> >>>>>>>> You must fetch the cache info ahead of the tunable initialization, that
> >>>>>>>> is you must call init_cacheinfo before __init_tunables.
> >>>>>>>>
> >>>>>>>> You can initialize the tunable data dynamically like this:
> >>>>>>>>
> >>>>>>>> /* Dynamically set the min and max of glibc.foo.bar.  */
> >>>>>>>> tunable_id_t id = TUNABLE_ENUM_NAME (glibc, foo, bar);
> >>>>>>>> tunable_list[id].type.min = lowval;
> >>>>>>>> tunable_list[id].type.max = highval;
> >>>>>>>>
> >>>>>>>> We do something similar for maybe_enable_malloc_check.
> >>>>>>>>
> >>>>>>>> Then once the tunables are parsed, and the cpu features are loaded
> >>>>>>>> you can print the tunables, and the printed tunables will have meaningful
> >>>>>>>> min and max values.
> >>>>>>>>
> >>>>>>>> If you have circular dependency, then you must process the cpu features
> >>>>>>>> first without reading from the tunables, then allow the tunables to be
> >>>>>>>> initialized from the system, *then* process the tunables to alter the existing
> >>>>>>>> cpu feature settings.
> >>>>>>>>
> >>>>>>>
> >>>>>>> How about this?  I got
> >>>>>>>
> >>>>>>
> >>>>>> Here is the updated patch, which depends on
> >>>>>>
> >>>>>> https://sourceware.org/pipermail/libc-alpha/2020-June/114820.html
> >>>>>>
> >>>>>> to add "%d" support to _dl_debug_vdprintf.  I got
> >>>>>>
> >>>>>> $ ./elf/ld.so ./libc.so --list-tunables
> >>>>>> glibc.elision.skip_lock_after_retries: 3 (min: -2147483648, max: 2147483647)
> >>>>>> glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffff)
> >>>>>> glibc.malloc.perturb: 0 (min: 0, max: 255)
> >>>>>> glibc.cpu.x86_shared_cache_size: 0x100000 (min: 0x0, max: 0xffffffff)
> >>>>>> glibc.elision.tries: 3 (min: -2147483648, max: 2147483647)
> >>>>>> glibc.elision.enable: 0 (min: 0, max: 1)
> >>>>>> glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffff)
> >>>>>> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> >>>>>> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffff)
> >>>>>> glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x0, max: 0xffffffff)
> >>>>>> glibc.cpu.x86_shstk:
> >>>>>> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffff)
> >>>>>> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> >>>>>> glibc.elision.skip_trylock_internal_abort: 3 (min: -2147483648, max: 2147483647)
> >>>>>> glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffff)
> >>>>>> glibc.cpu.x86_ibt:
> >>>>>> glibc.cpu.hwcaps:
> >>>>>> glibc.elision.skip_lock_internal_abort: 3 (min: -2147483648, max: 2147483647)
> >>>>>> glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffff)
> >>>>>> glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffff)
> >>>>>> glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffff)
> >>>>>> glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffff)
> >>>>>> glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffff)
> >>>>>> glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767)
> >>>>>> glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffff)
> >>>>>> glibc.malloc.check: 0 (min: 0, max: 3)
> >>>>>> $
> >>>>>>
> >>>>>> Ok for master?
> >>>>>>
> >>>>>
> >>>>> Here is the updated patch.  To support --list-tunables, a target should add
> >>>>>
> >>>>> CPPFLAGS-version.c = -DLIBC_MAIN=__libc_main_body
> >>>>> CPPFLAGS-libc-main.S = -DLIBC_MAIN=__libc_main_body
> >>>>>
> >>>>> and start.S should be updated to define __libc_main and call
> >>>>> __libc_main_body:
> >>>>>
> >>>>> extern void __libc_main_body (int argc, char **argv)
> >>>>>   __attribute__ ((noreturn, visibility ("hidden")));
> >>>>>
> >>>>> when LIBC_MAIN is defined.
> >>>>
> >>>> I like where this patch is going, but the __libc_main wiring up means
> >>>> we'll have to delay this until glibc 2.33 opens for development and
> >>>> give the architectures time to fill in the required pieces of assembly.
> >>>>
> >>>> Can we split this into:
> >>>>
> >>>> (a) Minimum required to implement the feature e.g. just the tunable without
> >>>>     my requested changes.
> >>>>
> >>>> (b) A second patch which implements the --list-tunables that users can
> >>>>     then use to know what the values they can choose are.
> >>>>
> >>>> That way we can commit (a) right now, and then commit (b) when we
> >>>> reopen for development?
> >>>>
> >>>
> >>> Like this?
> >>
> >> Almost.
> >>
> >> Why do we still use a constructor?
> >>
> >> Why don't we accurately set the min and max?
> >>
> >> +#if HAVE_TUNABLES
> >> +  TUNABLE_UPDATE (x86_non_temporal_threshold, long int,
> >> +              __x86_shared_non_temporal_threshold, 0,
> >> +              (long int) -1);
> >> +  TUNABLE_UPDATE (x86_rep_movsb_threshold, long int,
> >> +              __x86_rep_movsb_threshold,
> >> +              minimum_rep_movsb_threshold, (long int) -1);
> >> +  TUNABLE_UPDATE (x86_rep_stosb_threshold, long int,
> >> +              __x86_rep_stosb_threshold, 0, (long int) -1);
> >>
> >> A min and max of 0 and -1 respectively could have been set in the tunables
> >> list file and are not dynamic?
> >>
> >> I'd expect your patch would do everything except actually implement
> >> --list-tunables.
> >
> > Here is the followup patch which does it.
> >
> >>
> >> We need a manual page, and I accept that showing a "lower value" will
> >> have to wait for --list-tunables.
> >>
> >> Otherwise the patch is looking ready.
> >
> >
> > Are these 2 patches OK for trunk?
>
> Could you please post the patches in a distinct thread with a clear
> subject, that way I know exactly what I'm applying and testing.
> I'll review those ASAP so we can get something in place.
>

Done:

https://sourceware.org/pipermail/libc-alpha/2020-July/115759.html

-- 
H.J.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M
  2020-05-23  4:10   ` [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M liqingqing
  2020-05-23  4:37     ` [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables H.J. Lu
@ 2020-12-21  4:38     ` Siddhesh Poyarekar
  2020-12-22  1:02       ` Qingqing Li
  1 sibling, 1 reply; 32+ messages in thread
From: Siddhesh Poyarekar @ 2020-12-21  4:38 UTC (permalink / raw)
  To: liqingqing, libc-alpha, hjl.tools, Hushiyuan

On 5/23/20 9:40 AM, liqingqing wrote:
> this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset,
> and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB,
> this api will use STOB to instead of  MOVQ
> 
> but when I test this API on x86_64 platform
> and found that this default value is not appropriate for some input length. here it's the enviornment and result

This patch is not needed anymore since the threshold has been made a 
tunable: glibc.cpu.x86_rep_movsb_threshold.

Siddhesh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M
  2020-12-21  4:38     ` [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M Siddhesh Poyarekar
@ 2020-12-22  1:02       ` Qingqing Li
  0 siblings, 0 replies; 32+ messages in thread
From: Qingqing Li @ 2020-12-22  1:02 UTC (permalink / raw)
  To: Siddhesh Poyarekar, libc-alpha, hjl.tools, Hushiyuan

OK,  thanks.

On 2020/12/21 12:38, Siddhesh Poyarekar wrote:
> On 5/23/20 9:40 AM, liqingqing wrote:
>> this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset,
>> and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB,
>> this api will use STOB to instead of  MOVQ
>>
>> but when I test this API on x86_64 platform
>> and found that this default value is not appropriate for some input length. here it's the enviornment and result
>
> This patch is not needed anymore since the threshold has been made a tunable: glibc.cpu.x86_rep_movsb_threshold.
>
> Siddhesh
> .

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-12-22  1:06 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-16  7:30 pthread_cond performence Discussion liqingqing
2020-03-18 12:12 ` Carlos O'Donell
2020-03-18 12:53   ` Torvald Riegel
2020-03-18 14:42     ` Carlos O'Donell
2020-05-23  4:04 ` liqingqing
2020-05-23  4:10   ` [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M liqingqing
2020-05-23  4:37     ` [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables H.J. Lu
2020-05-28 11:56       ` H.J. Lu
2020-05-28 13:47         ` liqingqing
2020-05-29 13:13       ` Carlos O'Donell
2020-05-29 13:21         ` H.J. Lu
2020-05-29 16:18           ` Carlos O'Donell
2020-06-01 19:32             ` H.J. Lu
2020-06-01 19:38               ` Carlos O'Donell
2020-06-01 20:15                 ` H.J. Lu
2020-06-01 20:19                   ` H.J. Lu
2020-06-01 20:48                     ` Florian Weimer
2020-06-01 20:56                       ` Carlos O'Donell
2020-06-01 21:13                         ` H.J. Lu
2020-06-01 22:43                           ` H.J. Lu
2020-06-02  2:08                             ` Carlos O'Donell
2020-06-04 21:00                               ` [PATCH] libc.so: Add --list-tunables H.J. Lu
2020-06-05 22:45                                 ` V2 " H.J. Lu
2020-06-06 21:51                                   ` V3 [PATCH] libc.so: Add --list-tunables support to __libc_main H.J. Lu
2020-07-02 18:00                                     ` Carlos O'Donell
2020-07-02 19:08                                       ` [PATCH] Update tunable min/max values H.J. Lu
2020-07-03 16:14                                         ` Carlos O'Donell
2020-07-03 16:54                                           ` [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables H.J. Lu
2020-07-03 17:43                                             ` Carlos O'Donell
2020-07-03 17:53                                               ` H.J. Lu
2020-12-21  4:38     ` [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M Siddhesh Poyarekar
2020-12-22  1:02       ` Qingqing Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).