From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 12310 invoked by alias); 5 Jun 2018 14:02:21 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 12260 invoked by uid 89); 5 Jun 2018 14:02:20 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.4 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY,KAM_NUMSUBJECT autolearn=no version=3.3.2 spammy=thermal, Haswell, indications, constraint X-HELO: mga02.intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 Message-ID: <5cac325af2cd4af04e1665475690ba96b5058fc8.camel@linux.intel.com> Subject: Re: [PATCH v2] x86-64: Optimize strcmp/wcscmp with AVX2 From: Leonardo Sandoval To: Alexander Monakov Cc: "H.J. Lu" , GNU C Library Date: Tue, 05 Jun 2018 14:02:00 -0000 In-Reply-To: References: <20180529185339.11541-1-leonardo.sandoval.gonzalez@linux.intel.com> <03bdf89c47880fd0734fc5b82213fc3c98eab372.camel@linux.intel.com> <3a3ebd816fd263cc9eb76f904594f4f0105e5c9a.camel@linux.intel.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-SW-Source: 2018-06/txt/msg00061.txt.bz2 On Tue, 2018-06-05 at 13:13 +0300, Alexander Monakov wrote: > On Mon, 4 Jun 2018, Leonardo Sandoval wrote: > > > > right, perhaps microbenchmarks does not tell us much on this case > > because AVX and non-AVX is not mixed. Also, if you look at the > > patch, > > upper ymm bits are cleared (vzeroupper) before returning from > > strcmp, > > thus there is no perf penalty in storing these and then restoring > > when > > other AVX code is called again. > > Agreed, but I don't understand why you're bringing up the vzeroupper > aspect, my concern was about frequency limits only. > > > As I said before, using strcmp wont hurt performance at all > > (internal > > HW perf team confirmed what I said) because we are not using any > > opcode > > that that may drop frequency. > > Okay. I didn't manage to find confirmations on the Internet though. > In my previous mail I gave a link to an Intel whitepaper that makes > no such indications. I read the paper you share (thanks for the link), and there are indications there. Read the last part, the FAQ area: Q: Will workloads not using Intel AVX instructions also operate at a reduced frequency? A: Workloads not using Intel AVX instructions will continue to operate at or above the TDP frequency. Q: Will running a small number of Intel AVX instructions reduce frequency below the regular marked frequency? A: No, frequency will be reduced below the regular marked frequency only if a real power or thermal constraint is reached, not just due to the presence of Intel AVX instructions. Some workloads that utilize Intel AVX instructions could still achieve turbo above the marked TDP frequency > Also there's a presentation from CERN saying, > > "Compiling with AVX, or even just using a handful of AVX-256 > instructions at runtime, will most probably make your program > globally slower" > > (in context of using AVX on Haswell) > URL: https://indico.cern.ch/event/327306/contributions/760669/attachm > ents/635800/875267/HaswellConundrum.pdf > will look at it. > > if you have a test scenario to prove the 5% drop, I would like to > > test it and discuss it further. > > I don't have access to a range of Haswell/Broadwell/Skylake CPUs to > test. > If the PDFs I've referenced are in fact incomplete or in error w.r.t. > AVX frequency limits, and you have links to more accurate documents, > can > you please share them? > > FWIW, on one Haswell CPU I was able to reproduce turbo limits > appearing > with non-FMA FP AVX usage, but not INT AVX2. This indicates that on > Haswell > the situation is different than what you said initially ("partially > true for > AVX2 FMA and AVX512"). > > Alexander