From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=K9DU=BP=redhat.com=dj@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by sourceware.org (Postfix) with ESMTPS id C5C0D3858D39
	for <libc-alpha@sourceware.org>; Fri, 26 May 2023 03:34:50 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C5C0D3858D39
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1685072090;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to; bh=iP6po3WBeCDwHTJiLzlW18BFU9/ikYAmarkbibOCJNw=;
	b=OOhO3ZFVlpSQ8+MifE+dq6lqXqliChiwK99QolwLQtfnjeBqxIl4zRDb+FIzDNpbLzR3IB
	mBGU6ARVbnNZS5qX/pYnrQ7KuhHhOcPz1kgABwZrUMWfy4G1ElIR4ozoCmrJPNwo3eBtfM
	ogh0JCqr+wHaO9wLiATqfZMUaWzr+hs=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-287-1w0w9YoEP5mAIOHdX_D_Bg-1; Thu, 25 May 2023 23:34:44 -0400
X-MC-Unique: 1w0w9YoEP5mAIOHdX_D_Bg-1
Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 8D5E0811E78;
	Fri, 26 May 2023 03:34:43 +0000 (UTC)
Received: from greed.delorie.com (unknown [10.22.9.251])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 6E451492B0A;
	Fri, 26 May 2023 03:34:43 +0000 (UTC)
Received: from greed.delorie.com.redhat.com (localhost [127.0.0.1])
	by greed.delorie.com (8.15.2/8.15.2) with ESMTP id 34Q3YgEo3681935;
	Thu, 25 May 2023 23:34:42 -0400
From: DJ Delorie <dj@redhat.com>
To: Noah Goldstein <goldstein.w.n@gmail.com>
Cc: libc-alpha@sourceware.org, goldstein.w.n@gmail.com, hjl.tools@gmail.com,
        carlos@systemhalted.org
Subject: Re: [PATCH v9 1/3] x86: Increase `non_temporal_threshold` to roughly
 `sizeof_L3 / 4`
In-Reply-To: <20230513051906.1287611-1-goldstein.w.n@gmail.com>
Date: Thu, 25 May 2023 23:34:42 -0400
Message-ID: <xnwn0v7okt.fsf@greed.delorie.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.10
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain
X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>


Ignoring the benchmarks for now, technical review...

LGTM.
Reviewed-by: DJ Delorie <dj@redhat.com>


>  static void
> -get_common_cache_info (long int *shared_ptr, unsigned int *threads_ptr,
> +get_common_cache_info (long int *shared_ptr, long int * shared_per_thread_ptr, unsigned int *threads_ptr,
>                  long int core)

Adding a new parameter, ok.

> +  long int shared_per_thread = *shared_per_thread_ptr;

No NULL check, but all callers pass address-of-variable, so OK.

> +      shared_per_thread = core;

Ok.

> @@ -597,29 +599,28 @@ get_common_cache_info (long int *shared_ptr, unsigned int *threads_ptr,
>          }
>        else
>          {
> -intel_bug_no_cache_info:
> -          /* Assume that all logical threads share the highest cache
> -             level.  */
> -          threads
> -            = ((cpu_features->features[CPUID_INDEX_1].cpuid.ebx >> 16)
> -	       & 0xff);
> -        }
> -
> -        /* Cap usage of highest cache level to the number of supported
> -           threads.  */
> -        if (shared > 0 && threads > 0)
> -          shared /= threads;
> +	intel_bug_no_cache_info:
> +	  /* Assume that all logical threads share the highest cache
> +	     level.  */
> +	  threads = ((cpu_features->features[CPUID_INDEX_1].cpuid.ebx >> 16)
> +		     & 0xff);
> +
> +	  /* Get per-thread size of highest level cache.  */
> +	  if (shared_per_thread > 0 && threads > 0)
> +	    shared_per_thread /= threads;
> +	}
>      }

Mostly whitespace, but real change is to modify shared_per_thread
instead of shared.  Ok.

>        if (threads_l2 > 0)
> -        core /= threads_l2;
> +	shared_per_thread += core / threads_l2;
>        shared += core;

shared is still unscaled, shared_per_thread has the scaled version.  Ok.

>      }
>  
>    *shared_ptr = shared;
> +  *shared_per_thread_ptr = shared_per_thread;

Ok.

> @@ -629,6 +630,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>    /* Find out what brand of processor.  */
>    long int data = -1;
>    long int shared = -1;
> +  long int shared_per_thread = -1;

Ok.

> @@ -649,6 +651,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>        data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features);
>        core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
>        shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features);
> +      shared_per_thread = shared;

Ok.

> -      get_common_cache_info (&shared, &threads, core);
> +      get_common_cache_info (&shared, &shared_per_thread, &threads, core);

Ok.

>        shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE);
> +      shared_per_thread = shared;

Ok.

> -      get_common_cache_info (&shared, &threads, core);
> +      get_common_cache_info (&shared, &shared_per_thread, &threads, core);

Ok.

>        shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
> +      shared_per_thread = shared;

Ok.

> @@ -715,6 +720,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>        if (shared <= 0)
>          /* No shared L3 cache.  All we have is the L2 cache.  */
>  	shared = core;
> +
> +      if (shared_per_thread <= 0)
> +	shared_per_thread = shared;

Reasonable default, ok.

> @@ -730,17 +738,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>    cpu_features->level3_cache_linesize = level3_cache_linesize;
>    cpu_features->level4_cache_size = level4_cache_size;
>  
> -  /* The default setting for the non_temporal threshold is 3/4 of one
> -     thread's share of the chip's cache. For most Intel and AMD processors
> -     with an initial release date between 2017 and 2020, a thread's typical
> -     share of the cache is from 500 KBytes to 2 MBytes. Using the 3/4
> -     threshold leaves 125 KBytes to 500 KBytes of the thread's data
> -     in cache after a maximum temporal copy, which will maintain
> -     in cache a reasonable portion of the thread's stack and other
> -     active data. If the threshold is set higher than one thread's
> -     share of the cache, it has a substantial risk of negatively
> -     impacting the performance of other threads running on the chip. */
> -  unsigned long int non_temporal_threshold = shared * 3 / 4;

> +  /* The default setting for the non_temporal threshold is 1/4 of size
> +     of the chip's cache. For most Intel and AMD processors with an
> +     initial release date between 2017 and 2023, a thread's typical
> +     share of the cache is from 18-64MB. Using the 1/4 L3 is meant to
> +     estimate the point where non-temporal stores begin outcompeting
> +     REP MOVSB. As well the point where the fact that non-temporal
> +     stores are forced back to main memory would already occurred to the
> +     majority of the lines in the copy. Note, concerns about the
> +     entire L3 cache being evicted by the copy are mostly alleviated
> +     by the fact that modern HW detects streaming patterns and
> +     provides proper LRU hints so that the maximum thrashing
> +     capped at 1/associativity. */
> +  unsigned long int non_temporal_threshold = shared / 4;

Changing from 3/4 of a scaled amount, to 1/4 of the total amount.  Ok.

> +  /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> +     a higher risk of actually thrashing the cache as they don't have a HW LRU
> +     hint. As well, there performance in highly parallel situations is
> +     noticeably worse.  */
> +  if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> +    non_temporal_threshold = shared_per_thread * 3 / 4;

This is the same as the old default, so OK.