From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id C5C0D3858D39 for ; Fri, 26 May 2023 03:34:50 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C5C0D3858D39 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1685072090; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to; bh=iP6po3WBeCDwHTJiLzlW18BFU9/ikYAmarkbibOCJNw=; b=OOhO3ZFVlpSQ8+MifE+dq6lqXqliChiwK99QolwLQtfnjeBqxIl4zRDb+FIzDNpbLzR3IB mBGU6ARVbnNZS5qX/pYnrQ7KuhHhOcPz1kgABwZrUMWfy4G1ElIR4ozoCmrJPNwo3eBtfM ogh0JCqr+wHaO9wLiATqfZMUaWzr+hs= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-287-1w0w9YoEP5mAIOHdX_D_Bg-1; Thu, 25 May 2023 23:34:44 -0400 X-MC-Unique: 1w0w9YoEP5mAIOHdX_D_Bg-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 8D5E0811E78; Fri, 26 May 2023 03:34:43 +0000 (UTC) Received: from greed.delorie.com (unknown [10.22.9.251]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 6E451492B0A; Fri, 26 May 2023 03:34:43 +0000 (UTC) Received: from greed.delorie.com.redhat.com (localhost [127.0.0.1]) by greed.delorie.com (8.15.2/8.15.2) with ESMTP id 34Q3YgEo3681935; Thu, 25 May 2023 23:34:42 -0400 From: DJ Delorie To: Noah Goldstein Cc: libc-alpha@sourceware.org, goldstein.w.n@gmail.com, hjl.tools@gmail.com, carlos@systemhalted.org Subject: Re: [PATCH v9 1/3] x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` In-Reply-To: <20230513051906.1287611-1-goldstein.w.n@gmail.com> Date: Thu, 25 May 2023 23:34:42 -0400 Message-ID: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.10 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Ignoring the benchmarks for now, technical review... LGTM. Reviewed-by: DJ Delorie > static void > -get_common_cache_info (long int *shared_ptr, unsigned int *threads_ptr, > +get_common_cache_info (long int *shared_ptr, long int * shared_per_thread_ptr, unsigned int *threads_ptr, > long int core) Adding a new parameter, ok. > + long int shared_per_thread = *shared_per_thread_ptr; No NULL check, but all callers pass address-of-variable, so OK. > + shared_per_thread = core; Ok. > @@ -597,29 +599,28 @@ get_common_cache_info (long int *shared_ptr, unsigned int *threads_ptr, > } > else > { > -intel_bug_no_cache_info: > - /* Assume that all logical threads share the highest cache > - level. */ > - threads > - = ((cpu_features->features[CPUID_INDEX_1].cpuid.ebx >> 16) > - & 0xff); > - } > - > - /* Cap usage of highest cache level to the number of supported > - threads. */ > - if (shared > 0 && threads > 0) > - shared /= threads; > + intel_bug_no_cache_info: > + /* Assume that all logical threads share the highest cache > + level. */ > + threads = ((cpu_features->features[CPUID_INDEX_1].cpuid.ebx >> 16) > + & 0xff); > + > + /* Get per-thread size of highest level cache. */ > + if (shared_per_thread > 0 && threads > 0) > + shared_per_thread /= threads; > + } > } Mostly whitespace, but real change is to modify shared_per_thread instead of shared. Ok. > if (threads_l2 > 0) > - core /= threads_l2; > + shared_per_thread += core / threads_l2; > shared += core; shared is still unscaled, shared_per_thread has the scaled version. Ok. > } > > *shared_ptr = shared; > + *shared_per_thread_ptr = shared_per_thread; Ok. > @@ -629,6 +630,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > /* Find out what brand of processor. */ > long int data = -1; > long int shared = -1; > + long int shared_per_thread = -1; Ok. > @@ -649,6 +651,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features); > core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features); > shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features); > + shared_per_thread = shared; Ok. > - get_common_cache_info (&shared, &threads, core); > + get_common_cache_info (&shared, &shared_per_thread, &threads, core); Ok. > shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE); > + shared_per_thread = shared; Ok. > - get_common_cache_info (&shared, &threads, core); > + get_common_cache_info (&shared, &shared_per_thread, &threads, core); Ok. > shared = handle_amd (_SC_LEVEL3_CACHE_SIZE); > + shared_per_thread = shared; Ok. > @@ -715,6 +720,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > if (shared <= 0) > /* No shared L3 cache. All we have is the L2 cache. */ > shared = core; > + > + if (shared_per_thread <= 0) > + shared_per_thread = shared; Reasonable default, ok. > @@ -730,17 +738,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > cpu_features->level3_cache_linesize = level3_cache_linesize; > cpu_features->level4_cache_size = level4_cache_size; > > - /* The default setting for the non_temporal threshold is 3/4 of one > - thread's share of the chip's cache. For most Intel and AMD processors > - with an initial release date between 2017 and 2020, a thread's typical > - share of the cache is from 500 KBytes to 2 MBytes. Using the 3/4 > - threshold leaves 125 KBytes to 500 KBytes of the thread's data > - in cache after a maximum temporal copy, which will maintain > - in cache a reasonable portion of the thread's stack and other > - active data. If the threshold is set higher than one thread's > - share of the cache, it has a substantial risk of negatively > - impacting the performance of other threads running on the chip. */ > - unsigned long int non_temporal_threshold = shared * 3 / 4; > + /* The default setting for the non_temporal threshold is 1/4 of size > + of the chip's cache. For most Intel and AMD processors with an > + initial release date between 2017 and 2023, a thread's typical > + share of the cache is from 18-64MB. Using the 1/4 L3 is meant to > + estimate the point where non-temporal stores begin outcompeting > + REP MOVSB. As well the point where the fact that non-temporal > + stores are forced back to main memory would already occurred to the > + majority of the lines in the copy. Note, concerns about the > + entire L3 cache being evicted by the copy are mostly alleviated > + by the fact that modern HW detects streaming patterns and > + provides proper LRU hints so that the maximum thrashing > + capped at 1/associativity. */ > + unsigned long int non_temporal_threshold = shared / 4; Changing from 3/4 of a scaled amount, to 1/4 of the total amount. Ok. > + /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > + a higher risk of actually thrashing the cache as they don't have a HW LRU > + hint. As well, there performance in highly parallel situations is > + noticeably worse. */ > + if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > + non_temporal_threshold = shared_per_thread * 3 / 4; This is the same as the old default, so OK.