From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <hjl.tools@gmail.com>
Received: from mail-ot1-x32b.google.com (mail-ot1-x32b.google.com
 [IPv6:2607:f8b0:4864:20::32b])
 by sourceware.org (Postfix) with ESMTPS id 22C7E38930D6
 for <gcc-patches@gcc.gnu.org>; Tue,  6 Apr 2021 12:34:58 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 22C7E38930D6
Received: by mail-ot1-x32b.google.com with SMTP id
 s16-20020a0568301490b02901b83efc84a0so5522575otq.10
 for <gcc-patches@gcc.gnu.org>; Tue, 06 Apr 2021 05:34:58 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=0l12hxEXcOn+bhZd9UfC+ceoKtQU22HkMimUTt8SrYU=;
 b=kX8wqN6JmUs8hGDVJ2awG50FyChCzcaLhxJ5m5op2qOIiiDjcC78Pq8sJn7poL9LkT
 UAzwZfetnxJW4QrTaxY3K66Q+OQZFsewdy/8b069kMA/8Il0rXU1CBNc/T12f0NIVelI
 YT7QaU9ea8VRFO22zc0unAP/vMQbqq8Tr13L+InxxyiP+Wu1wuBGfPu2v8TuPo7RKdfI
 SRMYExsEM0ggnESbTNj+al85cZM8b5iSBfbL7FpfJvM82o0JRetbzN8OdxrEPIpZNYb1
 fgh1rbUVJDLHguPdDS+UNWUQMHzz/aGDY1nAjgMvlFmQHFoynmg56RsalcKmaHU9zVk3
 N8FA==
X-Gm-Message-State: AOAM530RSTCwByyGp5kVfHuJtJ80W+vXwOUaKkCUQJQEyrtuBBwf0HBe
 wrq8TUO27jOcJD0tIQIGaDFrwHd8J1hAwONvL5k=
X-Google-Smtp-Source: ABdhPJwyLlzH4IOzJKUA+i3Ah4bZPSk2hAuirECsp1EnpQOZnELSafeiSt1syjVAC8wNIe/3MkzNX9yyzlGb24WqmZE=
X-Received: by 2002:a05:6830:1515:: with SMTP id
 k21mr26284251otp.269.1617712497504; 
 Tue, 06 Apr 2021 05:34:57 -0700 (PDT)
MIME-Version: 1.0
References: <20210322131636.58461-1-hjl.tools@gmail.com>
 <20210322131636.58461-3-hjl.tools@gmail.com>
 <CAMe9rOoOSRa+vPfNrVAudS1ocy_F6pBh7jm+PsCFzWFd18NLDg@mail.gmail.com>
 <20210405211407.GC91941@kam.mff.cuni.cz>
 <CAMe9rOria7bHjN-gt-s=8965aNbbztCfYDw03bCagky-qgCmjQ@mail.gmail.com>
 <CA+OydWmv5UCc5Xwsm-x5Ss4g2ZyK3kz5TSDCFX-b7wyPY_-nbw@mail.gmail.com>
 <20210406095148.GC51111@kam.mff.cuni.cz>
In-Reply-To: <20210406095148.GC51111@kam.mff.cuni.cz>
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Tue, 6 Apr 2021 05:34:21 -0700
Message-ID: <CAMe9rOqNrnoGgYo5De=0beosTpfd-CHZtFBQhu_Jixn8ifP8JA@mail.gmail.com>
Subject: Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for
 Skylake family CPUs
To: Jan Hubicka <hubicka@ucw.cz>
Cc: Hongyu Wang <wwwhhhyyy333@gmail.com>, Hongtao Liu <hongtao.liu@intel.com>, 
 GCC Patches <gcc-patches@gcc.gnu.org>, Hongyu Wang <hongyu.wang@intel.com>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3029.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_SHORT,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Apr 2021 12:34:59 -0000

On Tue, Apr 6, 2021 at 2:51 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > > Do you know what of the three changes (preferring reps/stosb,
> > > CLEAR_RATIO and algorithm choice changes) cause the two speedups
> > > on eebmc?
> >
> > A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP
> >
> > This loop is transformed to builtin_memcpy and builtin_memset with size 280.
> >
> > Current strategy for skylake is {512, unrolled_loop, false} for such
> > size, so it will generate unrolled loops with mov, while the patch
> > generates memcpy/memset libcall and uses vector move.
>
> This is good - I originally set the table based on this
> micro-benchmarking script and apparently glibc used at that time had
> more expensive memcpy for small blocks.
>
> One thing to consider is, however, that calling external memcpy has also
> additional cost of clobbering all caller saved registers.  Especially
> for code that uses SSE this is painful since all needs to go to stack in
> that case. So I am not completely sure how representative the
> micro-benchmark is to this respect since it does not use any SSE and
> register pressure is generally small.
>
> So with current glibc it seems libcall is win for blocks of size greater
> than 64 or 128 at least if the register pressure is not big.
> With this respect your change looks good.
> > >
> > > My patch generates "rep movsb" only in a very limited cases:
> > >
> > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> > >    load and store for up to 16 * 16 (256) bytes when the data size is
> > >    fixed and known.
> > > 2. Inline only if data size is known to be <= 256.
> > >    a. Use "rep movsb/stosb" with a simple code sequence if the data size
> > >       is a constant.
> > >    b. Use loop if data size is not a constant.
>
> Aha, this is very hard to read from the algorithm descriptor.  So we
> still have the check that maxsize==minsize and use rep mosb only for
> constant sized blocks when the corresponding TARGET macro is defined.
>
> I think it would be more readable if we introduced rep_1_byte_constant.
> The descriptor is supposed to read as a sequence of rules where fist
> applies.  It is not obvious that we have another TARGET_* macro that
> makes rep_1_byte to be ignored in some cases.
> (TARGET macro will also interfere with the microbenchmarking script).
>
> Still I do not understand why compile time constant makes rep mosb/stosb
> better than loop. Is it CPU special casing it at decoder time and
> requiring explicit mov instruction? Or is it only becuase rep mosb is
> not good for blocks smaller than 128bit?

Non constant "rep movsb" triggers more machine clear events:

https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/mo-machine-clear-overhead.html

in hot loops of some workloads.

> > >
> > > As a result,  "rep stosb" is generated only when 128 < data size < 256
> > > with -mno-sse.
> > >
> > > > Do you have some data for blocks in size 8...256 to be faster with rep1
> > > > compared to unrolled loop for perhaps more real world benchmarks?
> > >
> > > "rep movsb" isn't generated with my patch in this case since
> > > MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with
> > > XMM registers.
>
> OK, so I guess:
>   {libcall,
>    {{256, rep_1_byte, true},
>     {256, unrolled_loop, false},
>     {-1, libcall, false}}},
>   {libcall,
>    {{256, rep_1_loop, true},
>     {256, unrolled_loop, false},
>     {-1, libcall, false}}}};
>
> may still perform better but the differnece between loop and unrolled
> loop is within 10% margin..
>
> So i guess patch is OK and we should look into cleaning up the
> descriptors.  I can make patch for that once I understand the logic above.

I am checking in my patch.  We improve it for GCC 12.  We will also revisit:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90773

for GCC 12.

Thanks.

-- 
H.J.