From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <goldstein.w.n@gmail.com>
Received: from mail-pf1-x42e.google.com (mail-pf1-x42e.google.com
 [IPv6:2607:f8b0:4864:20::42e])
 by sourceware.org (Postfix) with ESMTPS id C52193861843
 for <libc-alpha@sourceware.org>; Wed,  5 May 2021 18:45:00 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org C52193861843
Received: by mail-pf1-x42e.google.com with SMTP id c17so2664865pfn.6
 for <libc-alpha@sourceware.org>; Wed, 05 May 2021 11:45:00 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=yJol75Zvqzekuo18tQ7vOkIoBrChUH/JATVzY/dmOKc=;
 b=NwHxMgFxFHg+SP6Ib9Ia0fy96Pt87NWk0foU5YYBDTQ1klhoKgexe0Ewd80v7pi/5n
 EqvDw2YtvBhKPksb757Q3sqgSs7PkNea2ofFVcURPsP6AhNCAfvIDWOSUV7ckVKxC0pU
 caSsICBwa5PLuapLEe+LO5r+ufTKInpBGDXZinwlhrPeqpjbqU5+sZEawo1nZ0EGlod+
 wCVlX+MmtCiSXn/KAm/GFE1tvoraAQ2OsS6+bpIOmEb9J0SfeoLrr+D4aj+O/MEIAw/l
 +2TNFqqy8YoydiPN0ECj4A6WWF5pSCGU+F+eTXOuOJ/tUpGLnWhQ+F2nYQ3srT4o9mEO
 1PSw==
X-Gm-Message-State: AOAM532I6dVrO9szxys/1/Fg4peFVrBNTApgaDPkPF4mpQWlI+TAbps0
 Ci7X3Rm3x8HAXDhJB1UmtTyeo4p7Io/F/34hJtw=
X-Google-Smtp-Source: ABdhPJxb/g1ZAvidghG19VzSGeqOZlYOc25w35KGMDI/OpiDbtwEjjK02gt31sUdhuRTBVT0atCDElBgYkKn+RmWDrs=
X-Received: by 2002:a65:4c0c:: with SMTP id u12mr336574pgq.122.1620240299965; 
 Wed, 05 May 2021 11:44:59 -0700 (PDT)
MIME-Version: 1.0
References: <20210504233226.1514601-1-goldstein.w.n@gmail.com>
 <CAMe9rOrjd3U=JQ2MT95X6CBKZNCTToufdXxYJGP0GB_BUYZajg@mail.gmail.com>
 <CAFUsyfLxpYrgaqHstPb6JeEH15FwC3mbbP0KUbZ6=3aDFBuyfg@mail.gmail.com>
 <CAMe9rOq1+k27Pxhg3eGeoU93sawm1-0bA7weu=bSj-vtas5yfg@mail.gmail.com>
 <CAFUsyfLyja8JSEtw-McVQGATPx-gojiwTyGV0W8Qc+4MGJOigw@mail.gmail.com>
 <CAMe9rOrD83t9cXmi7t70g3r88SSOvx=-k614exxadu3GT15fZw@mail.gmail.com>
 <CAFUsyfLfMHm=qJMgxoDMB=cE5SbqxkT-5M5nPka=uiJihjafbg@mail.gmail.com>
In-Reply-To: <CAFUsyfLfMHm=qJMgxoDMB=cE5SbqxkT-5M5nPka=uiJihjafbg@mail.gmail.com>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Wed, 5 May 2021 11:44:49 -0700
Message-ID: <CAFUsyfKBViDdwN9VVJgGa3tHr4OQshsJ7=ocY20NvmY=VNo7dg@mail.gmail.com>
Subject: Re: [PATCH v1] x86: Add EVEX optimized memchr family not safe for RTM
To: "H.J. Lu" <hjl.tools@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>,
 "Carlos O'Donell" <carlos@systemhalted.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3.8 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 05 May 2021 18:45:02 -0000

On Wed, May 5, 2021 at 11:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Wed, May 5, 2021 at 11:29 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Wed, May 5, 2021 at 11:19 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Wed, May 5, 2021 at 1:55 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > On Wed, May 5, 2021 at 9:25 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > >
> > > > > On Wed, May 5, 2021 at 9:23 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, May 4, 2021 at 4:34 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > > > >
> > > > > > > No bug.
> > > > > > >
> > > > > > > This commit adds a new implementation for EVEX memchr that is not safe
> > > > > > > for RTM because it uses vzeroupper. The benefit is that by using
> > > > > >
> > > > > > EVEX memchr won't cause RTM abort if YMM16-YMM31 are used
> > > > > > since there is no need to use vzeroupper.  Please remove vzeroupper from
> > > > > > EVEX memchr and remove EVEX RTM functions.
> > > > >
> > > > > That's impossible for this implementation.
> > > > >
> > > > > The reason ymm0-ymm15 are used is so that we can use vpcmpeq which is
> > > > > not encodable with ymm16-ymm31.
> > > > >
> > > > > This implementation is optimized for CPUs which dont support RTM but
> > > > > do support EVEX.
> > > > >
> > > >
> > > > Are you seeing something along the line of Prefer_AVX2_STRCMP:
> > >
> > > Yes.
> > >
> > > For atleast some functions I think EVEX + AVX2 is probably the ideal
> > > implementation unless you have to worry about RTM. And as of right now
> > > there are a fair amount of x86_64 chips out there w/ avx512 but w/o
> > > RTM (or intel might fix the issue where vzeroupper aborts transactions
> > > in the future)
> > >
> > > For small values even if EVEX costs an extra instruction (i.e strchr
> > > vpxor + vpmin + vpcmp can be replace with vpcmpeq + vpcmp) the
> > > overhead of vzeroupper make EVEX perform better. But once the main
> > > loop is hit the overhead of vzeroupper isn't really a concern and
> > > vpcmpeq isnt really replaceable with vpcmp from a logic perspective
> > > (i.e 3x vpcmpeq + vptern isnt doable with vpcmp) and vpcmp is slower
> > > for tput and latency. As a little tackon the EVEX instructions are all
> > > larger code footprint than AVX2 so keeping instruction length <= 6byte
> > > for the DSB isnt really doable.
> >
> > The only reason to use YMM16-YMM31 is to avoid vzeroupper.   Since
> > we are going to issue vzeroupper anyway,  please use YMM0 to YMM15.
>
> Throughout or just in the 4x loop? Using YMM0-YMM15 throughout is a
> regression for the small sizes where the overhead of vzeroupper is
> meaningful. I.e the strchr example above.

The be clear. The way it is organized now there is only a vzeroupper if the
code hits the 4x loop which is a performance improvement over the same
code w/ vzeroupper.

>
> >
> > > I havent taken that long of a look at strcmp but would guess it would
> > > benefit in a simliar way as memchr from EVEX for sizes [0..160] then
> > > AVX2 for the loop (possibly augmented with vptern for 3way reduction)
> > >
> > > >
> > > > commit 1da50d4bda07f04135dca39f40e79fc9eabed1f8
> > > > Author: H.J. Lu <hjl.tools@gmail.com>
> > > > Date:   Fri Feb 26 05:36:59 2021 -0800
> > > >
> > > >     x86: Set Prefer_No_VZEROUPPER and add Prefer_AVX2_STRCMP
> > > >
> > > >     1. Set Prefer_No_VZEROUPPER if RTM is usable to avoid RTM abort triggered
> > > >     by VZEROUPPER inside a transactionally executing RTM region.
> > > >     2. Since to compare 2 32-byte strings, 256-bit EVEX strcmp requires 2
> > > >     loads, 3 VPCMPs and 2 KORDs while AVX2 strcmp requires 1 load, 2 VPCMPEQs,
> > > >     1 VPMINU and 1 VPMOVMSKB, AVX2 strcmp is faster than EVEX strcmp.  Add
> > > >     Prefer_AVX2_STRCMP to prefer AVX2 strcmp family functions.
> > > >
> > > >
> > > > --
> > > > H.J.
> >
> >
> >
> > --
> > H.J.