From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) by sourceware.org (Postfix) with ESMTPS id 6A9CA385DC02 for ; Thu, 15 Jul 2021 16:51:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 6A9CA385DC02 Received: from pps.filterd (m0246617.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 16FGlV1i003821; Thu, 15 Jul 2021 16:51:23 GMT Received: from userp3030.oracle.com (userp3030.oracle.com [156.151.31.80]) by mx0b-00069f02.pphosted.com with ESMTP id 39t77usx3m-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 15 Jul 2021 16:51:23 +0000 Received: from pps.filterd (userp3030.oracle.com [127.0.0.1]) by userp3030.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 16FGk4Ti032433; Thu, 15 Jul 2021 16:51:21 GMT Received: from nam10-dm6-obe.outbound.protection.outlook.com (mail-dm6nam10lp2108.outbound.protection.outlook.com [104.47.58.108]) by userp3030.oracle.com with ESMTP id 39q0pbfu6u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 15 Jul 2021 16:51:21 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=eZDvgspdU0MpJAQ2OKUa3vbLYUdQ8voHDN6MZddCwzh0M7TDElehyL0Z49Vm/8S7XhiAN7RkmMiKkZ8TJbjd3ybKVmTjlaznajRI970qVhC5Yt4jylyX+LRiT1b3T+ANnOWaZi40fcF7WmuYyGhMfEZF7PRlc8DVzdqDUQoSPEUqUa/fa+BljyF0dyGcxfJaMMnnloPXR2zEChXCObQRnX1QLmUWi6oHJNJ7x+zfPmy0aNly8OoBQ/tUzDFNdS233PwNYl10pCQ4qHu8hy8Q3g8JTk1qHHAKoEdGaXpxMEVAjWVYVK3TueYn74oVZsgz9UF9IlKeESXFboy8luYlrg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=vphvn8ZFEcRRXc2EylaVX3FpIhXuKeyZfixvFx5Wt54=; b=lNNpkEGMv9XTsA+gdaSfiPKhd9m9EjMCOZDvMN3/uZavKbl/47icCpkE5HFU5KxGIgLzUQ4PqPJ9lRZqKRJiblj1CL4PaDUQPztWzRqMA4NW8Ef5pNH0eEmokKPhTX7BM3AVPi8PCXsIGPoCBJCKcqZBYJkTLw8TinwYZk30AZRAqSjJlaX9659yHd9z6dg/6/TpxfY4iIgBJo8SksvI+kfJrGzqiFI3T3RdP5NqzHKhJyeTT0Fxyrf1/74fwfolXnznXBBO6gCBydpWbz1g3m3TZSIiJrb4zn6FKsurw8Z4ViEYP26sClR0ussPFHXtB18vFQP6mN+qkatTFNxlog== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none Received: from BYAPR10MB3208.namprd10.prod.outlook.com (2603:10b6:a03:159::10) by SJ0PR10MB4413.namprd10.prod.outlook.com (2603:10b6:a03:2d9::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4331.22; Thu, 15 Jul 2021 16:51:19 +0000 Received: from BYAPR10MB3208.namprd10.prod.outlook.com ([fe80::45c2:73e5:65dd:97d2]) by BYAPR10MB3208.namprd10.prod.outlook.com ([fe80::45c2:73e5:65dd:97d2%6]) with mapi id 15.20.4308.027; Thu, 15 Jul 2021 16:51:19 +0000 Subject: Re: memcpy performance on skylake server To: "Ji, Cheng" , "H.J. Lu" Cc: Libc-help References: <6ee56912-dbe1-181e-6981-8d286c0325f3@linaro.org> From: Patrick McGehearty Message-ID: Date: Thu, 15 Jul 2021 11:51:17 -0500 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-ClientProxiedBy: SJ0PR03CA0334.namprd03.prod.outlook.com (2603:10b6:a03:39c::9) To BYAPR10MB3208.namprd10.prod.outlook.com (2603:10b6:a03:159::10) MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received: from [192.168.1.151] (47.184.4.28) by SJ0PR03CA0334.namprd03.prod.outlook.com (2603:10b6:a03:39c::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4331.21 via Frontend Transport; Thu, 15 Jul 2021 16:51:19 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 83ecbe20-c424-4edd-bd69-08d947b0c4f2 X-MS-TrafficTypeDiagnostic: SJ0PR10MB4413: X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:3173; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 1jRo3oAidqotG04s8tU3PfMANlR7Be1LSb00PiOLjEOqF4sQ8n5XTcYMHs9TfNFD21VeyMsgnoB9zIMnlSsEwrNSYvF62typ6vNDrz5HgBMzwjSXbCVs8GdSuQtCtfzatvNxGWvGp/+Rsvu6cjRIxcJE5qhN3qjwpQYR4F8Ln9DzVU5t3snmhQeON0VyEKX/28g9LRMjXGJxIGmT21MgKIy434DFI9MksAByqvVsND96KIV20zW3eYN+okodHET3IAqMvxqi6VB41e7eHHuDGZ2l8SiVmhDVOtXyzHOTsQDMkRPS2dMCnqsNl+OLooGA8YP4/afl7iRrtrUDz8q7XOBq6tNm0nPuo9eyOGYCFt1FSLCSQvNVbtSTg4kT99KSZozKGxXbub26iv8qPC40HP58C//PVQLN6sqDR77kUCgr2VfdesH3/C8goIcdHz02KNj/NCgF4Hm62GnxLSXrhJ2RZ/5wG35mGpnWp1up6amupAS9jcDVho+Jsuwq4mOhIZYrApIqv9eZAPDdBzS9RB4oYV9Y7dUyVZ1gfuThyT7nkF2GfZPaORI2IcZ5gAMD2oG0BrK4qv0WoJ4S+BB+bjc9t5bPv+iIv392MadktJoJAhqv24NZvODXKsAERkmmRn+TCdMXWyHkXUvV8I+iEf7pCY4lwUvh3/I2p7S2MikaiQWKaFe/vCPo5QW9C/aXzwD1V1TbWGi+FP+0Xwha9AlsStPA39JQcW6ioyQoxKPULQu6avrsGg+vMdxFvwzGU6frmy2LRSGJbDoBfsm8926O7/zYMgl8hgS1E97mCSPslRjBk21Lq3GNdx7Cq9Ksgo7PIA9Vrn8MpdR2S4s10LNPd6Nv712HdxVeqMVWfcvNWguWgzWx0nGRZFYIKame4dOym6nt6MizfOzolF/b1hz9WWRIl+Cc/lNXo7T46qExdo4FmB6Xz+QgkckLRt77 X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:BYAPR10MB3208.namprd10.prod.outlook.com; PTR:; CAT:NONE; SFS:(366004)(36756003)(16576012)(8676002)(316002)(83380400001)(86362001)(2906002)(8936002)(66946007)(66476007)(31686004)(66556008)(5660300002)(6486002)(52116002)(110136005)(53546011)(4326008)(26005)(966005)(186003)(31696002)(2616005)(478600001)(956004)(38100700002)(38350700002)(43740500002)(45980500001); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?WXg5QWhYWi9vaUI1UnhUWk5aaUsvcUF4eVFEcjJ5K2xRK1ZjOHdFeUsvSGxK?= =?utf-8?B?c1M0amJKemdyRFZldExwVkdOdjZQb1dqbkI1VXRJMDZHOGRnMDBRTEVQYlZX?= =?utf-8?B?dWVnRnMvZEVDbWdCTEE2MkgvZm1oTExvV2gvdkNjNjBxSU1RaVd0dFF5eklN?= =?utf-8?B?ZVgxalNneFdSbjdnRXR2OXhqZm5WaGhKdGEzODRnUkhSa1BKenNPWU8vYjJ1?= =?utf-8?B?MjVMeW81WU9sckhmb0FMd01adVU2czYyN0U3L29sUURzNFdxNW54ZWdmdXBW?= =?utf-8?B?SnVyM2MrVWlENm5xRFR6NU1uV05CbHhZOGNqdHp0NElBSS9UMFh1NnQxS0RP?= =?utf-8?B?aEt3Y2ErWElteGVZMUVtaUxnTnMrcUlFOEVpMlV2ZURDb3UvWG5qWlhIN2lJ?= =?utf-8?B?NWtCNExQNHpOQURDUXlac3VQMDliZU9tZml3L0VtelFocmpwRFNTTjBrYWRv?= =?utf-8?B?ZUpaeHlqMHEvTXVoSjVxeW1ZekNrSjgxVVhqMDB2U2daaHd3MWNXYWdTcFc0?= =?utf-8?B?dGx0QWY1RENYMExJNjNmZkZENUxWdGNhRm1EclRsWmVPRnZ1WGV4TjVkWDFP?= =?utf-8?B?TmNJYmlKVi9DZ2JZcGZwc2Zyc1R1UEhhSm03RXNIS2swT0xsMkhWRStIMmh6?= =?utf-8?B?U2lrZGpNajlmd1JtNzcxcjRPMUZKZCsxN1pEZzdQVTQwSFlhdndkUlNFNmxw?= =?utf-8?B?SGFmQzlKQ05namVtNlJrK2JaSWpnL2YrdHdwcm1jdWtFb3RtdDk5Unc2YXB0?= =?utf-8?B?cTZkaDFEa2JvME14Rkc3Q01jMUJHdTN2VDVxQlNVUUZYT0VJY21zOUhRMXdK?= =?utf-8?B?cnF3V3ExYTlVNGJPenJodjhMS0s5VXBKV2JyMUROS1ZDTDhJMXNEYkF3QVN2?= =?utf-8?B?dmkrS2dSU3NxeW1RQkdVbnJBbE9YUFhiL2Vldnp6d2Rla2F1bnhaV0hrb21E?= =?utf-8?B?dlR1cjhrTTdKT3FjMnk3U0lmSS9vcnlrWVV6SEZRQWZ1U0dFRW4yNFo4M2Z6?= =?utf-8?B?b29vS2RBWURxRWVJM05vQ1h6WnNyeE1zdVJEQjM1a2hueEU5MlZxQnhlRG0v?= =?utf-8?B?Qkk5L3NKVzZRSXNGNTc5NE13NGxYQld1VXRwK2pZdWhWRXpPNnIxTVVJQU5s?= =?utf-8?B?clcvc1J1ZnJkckViR3U3YVpNM3BNZGRad0ppaXYzWVRtdnJmQlJ0U0tTUno1?= =?utf-8?B?TzFXYlNEV040S3duOVJVMS9SdzBFdjBHL2ZmdG5hS3luK3Y2cDAwNlVlYVdZ?= =?utf-8?B?ZlVuTFJvMlpabjBuY1lhMEtoUVR4WUtEQ0YyMjFFNkZ2VEl0YzdpcFNDNDI1?= =?utf-8?B?dXgxQjFrUTFERC9DVWhkRWJPSW5hMFRRRnVzb0FrZVRoZ2hpdGhSWVZaL20y?= =?utf-8?B?YS9Qamx0T3dsRWNhck82WkE0Q29rbkNxQ3Q0c01teEwvbklPc3FhN3QxL0N5?= =?utf-8?B?L2k1Um1LZjI2WUszS2ppTUFFM0V2K0ZpS3pLVnZHbTd2cWcweFMzRUUwL0RT?= =?utf-8?B?UXlXeTl0NW56V1ZvamYyM21zYkQ1MUs0U3A1R3l0YmpBMjIwYzI0VDZWUnBr?= =?utf-8?B?b1ZzeWQrbzVqYlgzSXRVSWFEZ2dlZHNCaDhnZG1vb2JtRzZmZmRXWjV3NXJt?= =?utf-8?B?Qjh5akR3ZzlraVdDMjBTbmJ6Ylh2K2xYc0JUK01HVmNMVURPQlRxbmhvR1Vs?= =?utf-8?B?M3U2b1JNcnE5MGxXSzVoYkZVT2lXeHBKK0xyZUdUV3pJQ0Uzd3N3aXdUSkJa?= =?utf-8?Q?g8vm6zZgBmj8LM9KVSu3EYj5Ea75uB7QnZONQxl?= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: 83ecbe20-c424-4edd-bd69-08d947b0c4f2 X-MS-Exchange-CrossTenant-AuthSource: BYAPR10MB3208.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 15 Jul 2021 16:51:19.6193 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: TJ/JxQnv3SrUBESigsZlFaYJyyIGs2NqweLfBaRtVxKALBQ2wMWr6BkhCRhs3iCU46eyLfgyMU7GvF72dpyQiyp/++JvDkNY2V/FcudxCqM= X-MS-Exchange-Transport-CrossTenantHeadersStamped: SJ0PR10MB4413 X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=10046 signatures=668682 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 malwarescore=0 adultscore=0 phishscore=0 spamscore=0 bulkscore=0 mlxlogscore=999 mlxscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2107150115 X-Proofpoint-GUID: j_goQ9SoaWAqL4U6Hazbm-kq-fAr5foS X-Proofpoint-ORIG-GUID: j_goQ9SoaWAqL4U6Hazbm-kq-fAr5foS X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, MSGID_FROM_MTA_HEADER, NICE_REPLY_A, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-help@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-help mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Jul 2021 16:51:27 -0000 More in-depth discussion of tuning non-temporal stores for x86 can be found at: http://patches-tcwg.linaro.org/patch/41797/ - Patrick McGehearty On 7/15/2021 2:32 AM, Ji, Cheng via Libc-help wrote: > Thanks for the information. We did some quick experiments. Indeed, using > normal temporal stores is ~20% faster than using non-temporal stores in > this case. > > Cheng > > On Wed, Jul 14, 2021 at 9:27 PM H.J. Lu wrote: > >> On Wed, Jul 14, 2021 at 5:58 AM Adhemerval Zanella >> wrote: >>> >>> >>> On 06/07/2021 05:17, Ji, Cheng via Libc-help wrote: >>>> Hello, >>>> >>>> I found that memcpy is slower on skylake server CPUs during our >>>> optimization work, and I can't really explain what we got and need some >>>> guidance here. >>>> >>>> The problem is that memcpy is noticeably slower than a simple for loop >> when >>>> copying large chunks of data. This genuinely sounds like an amateur >> mistake >>>> in our testing code but here's what we have tried: >>>> >>>> * The test data is large enough: 1GB. >>>> * We noticed a change quite a while ago regarding skylake and AVX512: >>>> >> https://patchwork.ozlabs.org/project/glibc/patch/20170418183712.GA22211@intel.com/ >>>> * We updated glibc from 2.17 to the latest 2.33, we did see memcpy is >> 5% >>>> faster but still slower than a simple loop. >>>> * We tested on multiple bare metal machines with different cpus: Xeon >> Gold >>>> 6132, Gold 6252, Silver 4114, as well as a virtual machine on google >> cloud, >>>> the result is reproducible. >>>> * On an older generation Xeon E5-2630 v3, memcpy is about 50% faster >> than >>>> the simple loop. On my desktop (i7-7700k) memcpy is also significantly >>>> faster. >>>> * numactl is used to ensure everything is running on a single core. >>>> * The code is compiled by gcc 10.3 >>>> >>>> The numbers on a Xeon Gold 6132, with glibc 2.33: >>>> simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s >>>> simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s >>>> simple_memcpy 4.18 seconds, 4.79 GiB/s 5.02 GB/s >>>> simple_copy 3.68 seconds, 5.44 GiB/s 5.71 GB/s >>>> >>>> The result is worse with system provided glibc 2.17: >>>> simple_memcpy 4.38 seconds, 4.57 GiB/s 4.79 GB/s >>>> simple_copy 3.68 seconds, 5.43 GiB/s 5.70 GB/s >>>> simple_memcpy 4.38 seconds, 4.56 GiB/s 4.78 GB/s >>>> simple_copy 3.68 seconds, 5.44 GiB/s 5.70 GB/s >>>> >>>> >>>> The code to generate this result (compiled with g++ -O2 -g, run with: >> numactl >>>> --membind 0 --physcpubind 0 -- ./a.out) >>>> ===== >>>> >>>> #include >>>> #include >>>> #include >>>> #include >>>> #include >>>> >>>> class TestCase { >>>> using clock_t = std::chrono::high_resolution_clock; >>>> using sec_t = std::chrono::duration>; >>>> >>>> public: >>>> static constexpr size_t NUM_VALUES = 128 * (1 << 20); // 128 >> million * >>>> 8 bytes = 1GiB >>>> >>>> void init() { >>>> vals_.resize(NUM_VALUES); >>>> for (size_t i = 0; i < NUM_VALUES; ++i) { >>>> vals_[i] = i; >>>> } >>>> dest_.resize(NUM_VALUES); >>>> } >>>> >>>> void run(std::string name, std::function> int64_t >>>> *, size_t)> &&func) { >>>> // ignore the result from first run >>>> func(vals_.data(), dest_.data(), vals_.size()); >>>> constexpr size_t count = 20; >>>> auto start = clock_t::now(); >>>> for (size_t i = 0; i < count; ++i) { >>>> func(vals_.data(), dest_.data(), vals_.size()); >>>> } >>>> auto end = clock_t::now(); >>>> double duration = >>>> std::chrono::duration_cast(end-start).count(); >>>> printf("%s %.2f seconds, %.2f GiB/s, %.2f GB/s\n", name.data(), >>>> duration, >>>> sizeof(int64_t) * NUM_VALUES / double(1 << 30) * count / >>>> duration, >>>> sizeof(int64_t) * NUM_VALUES / double(1e9) * count / >>>> duration); >>>> } >>>> >>>> private: >>>> std::vector vals_; >>>> std::vector dest_; >>>> }; >>>> >>>> void simple_memcpy(const int64_t *src, int64_t *dest, size_t n) { >>>> memcpy(dest, src, n * sizeof(int64_t)); >>>> } >>>> >>>> void simple_copy(const int64_t *src, int64_t *dest, size_t n) { >>>> for (size_t i = 0; i < n; ++i) { >>>> dest[i] = src[i]; >>>> } >>>> } >>>> >>>> int main(int, char **) { >>>> TestCase c; >>>> c.init(); >>>> >>>> c.run("simple_memcpy", simple_memcpy); >>>> c.run("simple_copy", simple_copy); >>>> c.run("simple_memcpy", simple_memcpy); >>>> c.run("simple_copy", simple_copy); >>>> } >>>> >>>> ===== >>>> >>>> The assembly of simple_copy generated by gcc is very simple: >>>> Dump of assembler code for function _Z11simple_copyPKlPlm: >>>> 0x0000000000401440 <+0>: mov %rdx,%rcx >>>> 0x0000000000401443 <+3>: test %rdx,%rdx >>>> 0x0000000000401446 <+6>: je 0x401460 >> <_Z11simple_copyPKlPlm+32> >>>> 0x0000000000401448 <+8>: xor %eax,%eax >>>> 0x000000000040144a <+10>: nopw 0x0(%rax,%rax,1) >>>> 0x0000000000401450 <+16>: mov (%rdi,%rax,8),%rdx >>>> 0x0000000000401454 <+20>: mov %rdx,(%rsi,%rax,8) >>>> 0x0000000000401458 <+24>: inc %rax >>>> 0x000000000040145b <+27>: cmp %rax,%rcx >>>> 0x000000000040145e <+30>: jne 0x401450 >> <_Z11simple_copyPKlPlm+16> >>>> 0x0000000000401460 <+32>: retq >>>> >>>> When compiling with -O3, gcc vectorized the loop using xmm0, the >>>> simple_loop is around 1% faster. >>> Usually differences of that magnitude falls either in noise or may be >> something >>> related to OS jitter. >>> >>>> I took a brief look at the glibc source code. Though I don't have >> enough >>>> knowledge to understand it yet, I'm curious about the underlying >> mechanism. >>>> Thanks. >>> H.J, do you have any idea what might be happening here? >> From Intel optimization guide: >> >> 2.2.2 Non-Temporal Stores on Skylake Server Microarchitecture >> Because of the change in the size of each bank of last level cache on >> Skylake Server microarchitecture, if >> an application, library, or driver only considers the last level cache >> to determine the size of on-chip cacheper-core, it may see a reduction >> with Skylake Server microarchitecture and may use non-temporal store >> with smaller blocks of memory writes. Since non-temporal stores evict >> cache lines back to memory, this >> may result in an increase in the number of subsequent cache misses and >> memory bandwidth demands >> on Skylake Server microarchitecture, compared to the previous Intel >> Xeon processor family. >> Also, because of a change in the handling of accesses resulting from >> non-temporal stores by Skylake >> Server microarchitecture, the resources within each core remain busy >> for a longer duration compared to >> similar accesses on the previous Intel Xeon processor family. As a >> result, if a series of such instructions >> are executed, there is a potential that the processor may run out of >> resources and stall, thus limiting the >> memory write bandwidth from each core. >> The increase in cache misses due to overuse of non-temporal stores and >> the limit on the memory write >> bandwidth per core for non-temporal stores may result in reduced >> performance for some applications. >> >> -- >> H.J. >>