From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot1-x32b.google.com (mail-ot1-x32b.google.com [IPv6:2607:f8b0:4864:20::32b]) by sourceware.org (Postfix) with ESMTPS id 22C7E38930D6 for ; Tue, 6 Apr 2021 12:34:58 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 22C7E38930D6 Received: by mail-ot1-x32b.google.com with SMTP id s16-20020a0568301490b02901b83efc84a0so5522575otq.10 for ; Tue, 06 Apr 2021 05:34:58 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0l12hxEXcOn+bhZd9UfC+ceoKtQU22HkMimUTt8SrYU=; b=kX8wqN6JmUs8hGDVJ2awG50FyChCzcaLhxJ5m5op2qOIiiDjcC78Pq8sJn7poL9LkT UAzwZfetnxJW4QrTaxY3K66Q+OQZFsewdy/8b069kMA/8Il0rXU1CBNc/T12f0NIVelI YT7QaU9ea8VRFO22zc0unAP/vMQbqq8Tr13L+InxxyiP+Wu1wuBGfPu2v8TuPo7RKdfI SRMYExsEM0ggnESbTNj+al85cZM8b5iSBfbL7FpfJvM82o0JRetbzN8OdxrEPIpZNYb1 fgh1rbUVJDLHguPdDS+UNWUQMHzz/aGDY1nAjgMvlFmQHFoynmg56RsalcKmaHU9zVk3 N8FA== X-Gm-Message-State: AOAM530RSTCwByyGp5kVfHuJtJ80W+vXwOUaKkCUQJQEyrtuBBwf0HBe wrq8TUO27jOcJD0tIQIGaDFrwHd8J1hAwONvL5k= X-Google-Smtp-Source: ABdhPJwyLlzH4IOzJKUA+i3Ah4bZPSk2hAuirECsp1EnpQOZnELSafeiSt1syjVAC8wNIe/3MkzNX9yyzlGb24WqmZE= X-Received: by 2002:a05:6830:1515:: with SMTP id k21mr26284251otp.269.1617712497504; Tue, 06 Apr 2021 05:34:57 -0700 (PDT) MIME-Version: 1.0 References: <20210322131636.58461-1-hjl.tools@gmail.com> <20210322131636.58461-3-hjl.tools@gmail.com> <20210405211407.GC91941@kam.mff.cuni.cz> <20210406095148.GC51111@kam.mff.cuni.cz> In-Reply-To: <20210406095148.GC51111@kam.mff.cuni.cz> From: "H.J. Lu" Date: Tue, 6 Apr 2021 05:34:21 -0700 Message-ID: Subject: Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs To: Jan Hubicka Cc: Hongyu Wang , Hongtao Liu , GCC Patches , Hongyu Wang Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3029.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Apr 2021 12:34:59 -0000 On Tue, Apr 6, 2021 at 2:51 AM Jan Hubicka wrote: > > > > Do you know what of the three changes (preferring reps/stosb, > > > CLEAR_RATIO and algorithm choice changes) cause the two speedups > > > on eebmc? > > > > A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP > > > > This loop is transformed to builtin_memcpy and builtin_memset with size 280. > > > > Current strategy for skylake is {512, unrolled_loop, false} for such > > size, so it will generate unrolled loops with mov, while the patch > > generates memcpy/memset libcall and uses vector move. > > This is good - I originally set the table based on this > micro-benchmarking script and apparently glibc used at that time had > more expensive memcpy for small blocks. > > One thing to consider is, however, that calling external memcpy has also > additional cost of clobbering all caller saved registers. Especially > for code that uses SSE this is painful since all needs to go to stack in > that case. So I am not completely sure how representative the > micro-benchmark is to this respect since it does not use any SSE and > register pressure is generally small. > > So with current glibc it seems libcall is win for blocks of size greater > than 64 or 128 at least if the register pressure is not big. > With this respect your change looks good. > > > > > > My patch generates "rep movsb" only in a very limited cases: > > > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > > load and store for up to 16 * 16 (256) bytes when the data size is > > > fixed and known. > > > 2. Inline only if data size is known to be <= 256. > > > a. Use "rep movsb/stosb" with a simple code sequence if the data size > > > is a constant. > > > b. Use loop if data size is not a constant. > > Aha, this is very hard to read from the algorithm descriptor. So we > still have the check that maxsize==minsize and use rep mosb only for > constant sized blocks when the corresponding TARGET macro is defined. > > I think it would be more readable if we introduced rep_1_byte_constant. > The descriptor is supposed to read as a sequence of rules where fist > applies. It is not obvious that we have another TARGET_* macro that > makes rep_1_byte to be ignored in some cases. > (TARGET macro will also interfere with the microbenchmarking script). > > Still I do not understand why compile time constant makes rep mosb/stosb > better than loop. Is it CPU special casing it at decoder time and > requiring explicit mov instruction? Or is it only becuase rep mosb is > not good for blocks smaller than 128bit? Non constant "rep movsb" triggers more machine clear events: https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/mo-machine-clear-overhead.html in hot loops of some workloads. > > > > > > As a result, "rep stosb" is generated only when 128 < data size < 256 > > > with -mno-sse. > > > > > > > Do you have some data for blocks in size 8...256 to be faster with rep1 > > > > compared to unrolled loop for perhaps more real world benchmarks? > > > > > > "rep movsb" isn't generated with my patch in this case since > > > MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with > > > XMM registers. > > OK, so I guess: > {libcall, > {{256, rep_1_byte, true}, > {256, unrolled_loop, false}, > {-1, libcall, false}}}, > {libcall, > {{256, rep_1_loop, true}, > {256, unrolled_loop, false}, > {-1, libcall, false}}}}; > > may still perform better but the differnece between loop and unrolled > loop is within 10% margin.. > > So i guess patch is OK and we should look into cleaning up the > descriptors. I can make patch for that once I understand the logic above. I am checking in my patch. We improve it for GCC 12. We will also revisit: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90773 for GCC 12. Thanks. -- H.J.