From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl1-x633.google.com (mail-pl1-x633.google.com [IPv6:2607:f8b0:4864:20::633]) by sourceware.org (Postfix) with ESMTPS id 10EDD3858C35 for ; Thu, 8 Feb 2024 13:08:50 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 10EDD3858C35 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 10EDD3858C35 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::633 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707397733; cv=none; b=T1omClpwBnvM1BdWXh+tWxutC4X+vgOcAaRNJUuy6u4m4V5GtInfB2jTSJ8IkRBCMRRfgHTSn5cAzwlmchbia9t6CRkgKJ+xvD++emjdD06Ut+RQ1bRGH0yJflZrik0h6PldnXxYe/6yCcZiLiYADQSkgia5v1IXMYH5x/3C35U= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707397733; c=relaxed/simple; bh=xnJ+tDWAjOBmDgnMtZrrEqUVvVc1F2wm9LR8wmhqXdU=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=Edztt+KEomWTfaVOwc9hvMgl6COC/cPeX4r4Vn2Ux/MfTmq6rI7uc2SONB2/a2o/thRZ60nXRbcRcWrYc1QIZUuv4Mzj4g620s4Cgz2L9WG/tTpzS2ZNzlrz2dCMJ6ie3bPXpCJ4Z4YcW+OYi4iA/S44GfzkS2soHwtwu5wXSo4= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-pl1-x633.google.com with SMTP id d9443c01a7336-1d70b0e521eso13963365ad.1 for ; Thu, 08 Feb 2024 05:08:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1707397728; x=1708002528; darn=sourceware.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=97QAG9z8C8w+cwX6JnQpqmEqLWnpiF7zMMcL+XGPH/Y=; b=jNnLtGJ7XnTvbbyLbqL3j3PkUR9jYDohQZ7pMoIzK8J3fhQhR3u4ddO9aOQKJ/BtDM QHyFp39CMiy3aCzCWffVyREuHm1ciop4oD6v9KhrXb4TyVW3HdA3p/lyOxOGt/y/2Ihc U4DKcyRSKm5TmW88BLXwgKIzwJAZhTr8ORbn+fZseHV4ByrCnm4Ww7Peua9WwqmnRpIr jpFIzKnnHcWR86FxJzgR5llY7pCGAJBLl5DPgUgNU+EE6Bu8NbqjTp8AnkmnFjmpGF5n 1miqvOYQK/lqoJNXQis6fVcS7PijaHlfSCzerfzS3arOV1vxFD+6MUcQVKUnJJeSpn6Y GW/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707397728; x=1708002528; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=97QAG9z8C8w+cwX6JnQpqmEqLWnpiF7zMMcL+XGPH/Y=; b=rvDZ9lqqIXh4hLysTQF7i2dCCMNi6TzYPUukjtU9uvFaIoXGfbeUoUXEiQkAqw7pIH 048OfhXNt6px22ORjQCqGMx6o8ge6oaGZcqgMcfxcVPX8yqhhO0vDlxezPyYke/ja6QX 3iP3ds4cIlEuQDwOz9aaW8XSt9yrNUN4i/QOuIkFLO91o04zHVP5i+W0QQtA2+5XLQlD W35/n5SUCCPhcj8tCQtKGVt9NzsfTLNxEPq+WDVc4ARpVG8FmDWgYA6dYq46P063DPnE Y+Hebym/9StSwYQyqRwa2PCCD2Vso83N4DucRlWc6GD5XSOR01tTCdNG+Vzan2ulwKaq wgvA== X-Gm-Message-State: AOJu0YzkShnWYF2qqzNvvtUBDejUd4MPTeGkMUaHvHlI4TpE4zGtoszw T106bioma9HvDsUDmfIdHtxpOt6YyawMtdMQnjwANYM4W468sfWIfeZf2TLXvD4c4gHnhR5EtIS O X-Google-Smtp-Source: AGHT+IFT7ns/AGO18ucYPBUuEQaMKL9+RW5XbGBgitoKZWQSiwjGXBHuXLAu/REOtbcm059oP1q0kg== X-Received: by 2002:a17:902:cec1:b0:1d9:90d6:bed3 with SMTP id d1-20020a170902cec100b001d990d6bed3mr9365887plg.43.1707397728419; Thu, 08 Feb 2024 05:08:48 -0800 (PST) X-Forwarded-Encrypted: i=1; AJvYcCUjoLx9ztK2D+zu35uCKav2hRKaOW1qlKifOYqF9AwPwPoGhThDzn/K1XfKcczIH84/I8JVzAHBe7xuFl2xbuBu+gagOGoqgKGHzPNuDZmWDyS6FXF9Scv80ipfWIFLbeYvSxCPeFh8XXo1EBfZucUCcdsQd5bX1ImlBw65kseOjbdyyhp7Nj5tOA== Received: from mandiga.. ([2804:1b3:a7c0:378:6793:1dc3:1346:d6d6]) by smtp.gmail.com with ESMTPSA id 4-20020a170902c14400b001d9fc535378sm1844083plj.236.2024.02.08.05.08.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Feb 2024 05:08:47 -0800 (PST) From: Adhemerval Zanella To: libc-alpha@sourceware.org Cc: "H . J . Lu" , Noah Goldstein , Sajan Karumanchi , bmerry@sarao.ac.za, pmallapp@amd.com Subject: [PATCH v3 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) Date: Thu, 8 Feb 2024 10:08:38 -0300 Message-Id: <20240208130840.533348-2-adhemerval.zanella@linaro.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240208130840.533348-1-adhemerval.zanella@linaro.org> References: <20240208130840.533348-1-adhemerval.zanella@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: The REP MOVSB usage on memcpy/memmove does not show much performance improvement on Zen3/Zen4 cores compared to the vectorized loops. Also, as from BZ 30994, if the source is aligned and the destination is not the performance can be 20x slower. The performance difference is noticeable with small buffer sizes, closer to the lower bounds limits when memcpy/memmove starts to use ERMS. The performance of REP MOVSB is similar to vectorized instruction on the size limit (the L2 cache). Also, there is no drawback to multiple cores sharing the cache. Checked on x86_64-linux-gnu on Zen3. --- sysdeps/x86/dl-cacheinfo.h | 38 ++++++++++++++++++-------------------- 1 file changed, 18 insertions(+), 20 deletions(-) diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index d5101615e3..f34d12846c 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -791,7 +791,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) long int data = -1; long int shared = -1; long int shared_per_thread = -1; - long int core = -1; unsigned int threads = 0; unsigned long int level1_icache_size = -1; unsigned long int level1_icache_linesize = -1; @@ -809,7 +808,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) if (cpu_features->basic.kind == arch_kind_intel) { data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features); - core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features); shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features); shared_per_thread = shared; @@ -822,7 +820,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) = handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features); level1_dcache_linesize = handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features); - level2_cache_size = core; + level2_cache_size + = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features); level2_cache_assoc = handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features); level2_cache_linesize @@ -835,12 +834,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) level4_cache_size = handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features); - get_common_cache_info (&shared, &shared_per_thread, &threads, core); + get_common_cache_info (&shared, &shared_per_thread, &threads, + level2_cache_size); } else if (cpu_features->basic.kind == arch_kind_zhaoxin) { data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE); - core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE); shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE); shared_per_thread = shared; @@ -849,19 +848,19 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) level1_dcache_size = data; level1_dcache_assoc = handle_zhaoxin (_SC_LEVEL1_DCACHE_ASSOC); level1_dcache_linesize = handle_zhaoxin (_SC_LEVEL1_DCACHE_LINESIZE); - level2_cache_size = core; + level2_cache_size = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE); level2_cache_assoc = handle_zhaoxin (_SC_LEVEL2_CACHE_ASSOC); level2_cache_linesize = handle_zhaoxin (_SC_LEVEL2_CACHE_LINESIZE); level3_cache_size = shared; level3_cache_assoc = handle_zhaoxin (_SC_LEVEL3_CACHE_ASSOC); level3_cache_linesize = handle_zhaoxin (_SC_LEVEL3_CACHE_LINESIZE); - get_common_cache_info (&shared, &shared_per_thread, &threads, core); + get_common_cache_info (&shared, &shared_per_thread, &threads, + level2_cache_size); } else if (cpu_features->basic.kind == arch_kind_amd) { data = handle_amd (_SC_LEVEL1_DCACHE_SIZE); - core = handle_amd (_SC_LEVEL2_CACHE_SIZE); shared = handle_amd (_SC_LEVEL3_CACHE_SIZE); level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE); @@ -869,7 +868,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) level1_dcache_size = data; level1_dcache_assoc = handle_amd (_SC_LEVEL1_DCACHE_ASSOC); level1_dcache_linesize = handle_amd (_SC_LEVEL1_DCACHE_LINESIZE); - level2_cache_size = core; + level2_cache_size = handle_amd (_SC_LEVEL2_CACHE_SIZE);; level2_cache_assoc = handle_amd (_SC_LEVEL2_CACHE_ASSOC); level2_cache_linesize = handle_amd (_SC_LEVEL2_CACHE_LINESIZE); level3_cache_size = shared; @@ -880,12 +879,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) if (shared <= 0) { /* No shared L3 cache. All we have is the L2 cache. */ - shared = core; + shared = level2_cache_size; } else if (cpu_features->basic.family < 0x17) { /* Account for exclusive L2 and L3 caches. */ - shared += core; + shared += level2_cache_size; } shared_per_thread = shared; @@ -987,6 +986,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) if (CPU_FEATURE_USABLE_P (cpu_features, FSRM)) rep_movsb_threshold = 2112; + /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of + cases slower than the vectorized path (and for some alignments, + it is really slow, check BZ #30994). */ + if (cpu_features->basic.kind == arch_kind_amd) + rep_movsb_threshold = non_temporal_threshold; + /* The default threshold to use Enhanced REP STOSB. */ unsigned long int rep_stosb_threshold = 2048; @@ -1028,16 +1033,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) SIZE_MAX); unsigned long int rep_movsb_stop_threshold; - /* ERMS feature is implemented from AMD Zen3 architecture and it is - performing poorly for data above L2 cache size. Henceforth, adding - an upper bound threshold parameter to limit the usage of Enhanced - REP MOVSB operations and setting its value to L2 cache size. */ - if (cpu_features->basic.kind == arch_kind_amd) - rep_movsb_stop_threshold = core; /* Setting the upper bound of ERMS to the computed value of - non-temporal threshold for architectures other than AMD. */ - else - rep_movsb_stop_threshold = non_temporal_threshold; + non-temporal threshold for all architectures. */ + rep_movsb_stop_threshold = non_temporal_threshold; cpu_features->data_cache_size = data; cpu_features->shared_cache_size = shared; -- 2.34.1