From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-92906-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 45882 invoked by alias); 5 Jun 2018 10:14:27 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 39635 invoked by uid 89); 5 Jun 2018 10:14:16 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00,KAM_NUMSUBJECT,SPF_PASS autolearn=no version=3.3.2 spammy=H*r:MSK, Haswell, H*M:intra, indications
X-HELO: smtp.ispras.ru
Date: Tue, 05 Jun 2018 10:14:00 -0000
From: Alexander Monakov <amonakov@ispras.ru>
To: Leonardo Sandoval <leonardo.sandoval.gonzalez@linux.intel.com>
cc: "H.J. Lu" <hjl.tools@gmail.com>, GNU C Library <libc-alpha@sourceware.org>
Subject: Re: [PATCH v2] x86-64: Optimize strcmp/wcscmp with AVX2
In-Reply-To: <3a3ebd816fd263cc9eb76f904594f4f0105e5c9a.camel@linux.intel.com>
Message-ID: <alpine.LNX.2.20.13.1806051235260.10950@monopod.intra.ispras.ru>
References: <20180529185339.11541-1-leonardo.sandoval.gonzalez@linux.intel.com>   <CAMe9rOpKpR6pOLkxyMuTPBA1zSx4MmYYsTOwHz5pTxjdR57p1A@mail.gmail.com>   <alpine.LNX.2.20.13.1806011824140.1892@monopod.intra.ispras.ru>  <03bdf89c47880fd0734fc5b82213fc3c98eab372.camel@linux.intel.com>
  <alpine.LNX.2.20.13.1806021022140.1892@monopod.intra.ispras.ru> <3a3ebd816fd263cc9eb76f904594f4f0105e5c9a.camel@linux.intel.com>
User-Agent: Alpine 2.20.13 (LNX 116 2015-12-14)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-SW-Source: 2018-06/txt/msg00060.txt.bz2

On Mon, 4 Jun 2018, Leonardo Sandoval wrote:
> 
> right, perhaps microbenchmarks does not tell us much on this case
> because AVX and non-AVX is not mixed. Also, if you look at the patch,
> upper ymm bits are cleared (vzeroupper) before returning from strcmp,
> thus there is no perf penalty in storing these and then restoring when
> other AVX code is called again.

Agreed, but I don't understand why you're bringing up the vzeroupper
aspect, my concern was about frequency limits only.

> As I said before, using strcmp wont hurt performance at all (internal
> HW perf team confirmed what I said) because we are not using any opcode
> that that may drop frequency.

Okay. I didn't manage to find confirmations on the Internet though.
In my previous mail I gave a link to an Intel whitepaper that makes
no such indications. Also there's a presentation from CERN saying,

    "Compiling with AVX, or even just using a handful of AVX-256 
     instructions at runtime, will most probably make your program
     globally slower"

(in context of using AVX on Haswell)
URL: https://indico.cern.ch/event/327306/contributions/760669/attachments/635800/875267/HaswellConundrum.pdf

> if you have a test scenario to prove the 5% drop, I would like to  
> test it and discuss it further.

I don't have access to a range of Haswell/Broadwell/Skylake CPUs to test.
If the PDFs I've referenced are in fact incomplete or in error w.r.t.
AVX frequency limits, and you have links to more accurate documents, can
you please share them?

FWIW, on one Haswell CPU I was able to reproduce turbo limits appearing
with non-FMA FP AVX usage, but not INT AVX2. This indicates that on Haswell
the situation is different than what you said initially ("partially true for
AVX2 FMA and AVX512").

Alexander