From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr1-x42d.google.com (mail-wr1-x42d.google.com [IPv6:2a00:1450:4864:20::42d]) by sourceware.org (Postfix) with ESMTPS id EDF74388E80F; Thu, 24 Jun 2021 11:42:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org EDF74388E80F Received: by mail-wr1-x42d.google.com with SMTP id b3so6297222wrm.6; Thu, 24 Jun 2021 04:42:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to:cc; bh=dERu5dpUDChrYLP2yA2kUx4bvtrTS/nNX+gHGBKMIwY=; b=L9Y4p4LM3Ob2xJvySOj0WNBcvtB4wlUr2Jfc+WWb3pU2g/GS+1iPCrhELZyCVV4nRR w+DaSYYjN3nQI/r2lD1xn4TFBp0sw8iZblDSl7YVse7eQAzSiuhslbOUzD15SbyFI2X8 sbbdZ0x0VFdFK7X7ESgDpm6duI0VagPF2IJS2mGn00zIFTsQ1YDD8wLa1zlaJzz8lhZW ZfHM+iJK18+54mDJM/EvRzSC890t+X3WKH0iZ6g02NgHaNiubiePzmGwDS7WStDD7W9P RXkC7yEqblphdMMfSNiicr/w7H5u3zwYZySAtLK7WI8qJwnSc7ICx8oKnDvt98IjoSuX Usow== X-Gm-Message-State: AOAM530Pvc0J+gfI2dc0Kjp595kv/PKWAMaiAMD9W4/OV1F7s00evltx bs0zjjKlI9r2zbBw/F/1pplR497py+VzSyRmfY8= X-Google-Smtp-Source: ABdhPJxIVhyrEncQA4N+BxE/EcbphWTLCVdg8l+dnzAMYFpbjwJCSZIecEo+H3NqYf5cWF8WAx6xfvYpcmjJ6Qfe3rM= X-Received: by 2002:a5d:60c2:: with SMTP id x2mr4035546wrt.254.1624534937959; Thu, 24 Jun 2021 04:42:17 -0700 (PDT) MIME-Version: 1.0 References: <34C4F25A-6333-4C08-BBFF-8E86A5A9B764@dell.com> In-Reply-To: Reply-To: cassio.neri@gmail.com From: Cassio Neri Date: Thu, 24 Jun 2021 12:42:06 +0100 Message-ID: Subject: Re: [PATCH] libstdc++: More efficient std::chrono::year::leap. To: Jonathan Wakely Cc: "Koning, Paul" , "libstdc++" , Gcc-patches X-Spam-Status: No, score=-10.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, HTML_MESSAGE, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: libstdc++@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libstdc++ mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jun 2021 11:42:21 -0000 Thanks Jonathan. Here are some benchmarks (assembly in [1]): https://quick-bench.com/q/jclBXmi4QLDcRMLuuVpxTUsFmQw Unfortunately, quick-bench times out unless some implementations are commented out. You can copy the code and run it locally (needs google benchmark) to get the full picture. I really like Ulrich Drepper's implementation (version 8). GCC 11 emmits really good code and looking at previous versions this has been very consistent. This implementation also highlights very clearly that my hack to calculate __is_multiple_of_100 (as opposed to % 100 used in version 7) saves a ror instruction with remarkable performance effects. In my AMD Ryzen 7 box, his implementation doesn't seem to be the fastest. Nevertheless, compared to other fast alternatives, performance differences are small and results seem very much platform dependent. In my opinion, his implementation is the best, especially, given recent compiler changes. My reasoning follows. My original motivation was contrasting 'year % 400 == 0' with 'year % 16' as in versions 1 and 2: __is_multiple_of_100 ? year % 400 == 0 : year % 4 == 0; // version 1 __is_multiple_of_100 ? year % 16 == 0 : year % 4 == 0; // version 2 It's fair to expect that version 2 won't be slower than version 1. However, this is the case and by quite a big margin. The emitted instructions are very reasonable and, I guess, the issue is the choice of registers. I've reimplemented half of version 2 in inline asm using similar register choices as version 1 and got the performance boost that I was expecting. (By no means I'm suggesting a non portable implementation here. It was just an exercise to make my point. Also, it performs very poorly when compiled by clang.) The point here is that code emitted for sources like versions 1 and 2 (involving %) seems sensitive to very small variations and has been changing across compiler releases in the last 3 years or so. (This can be checked on godbolt.) Clang also seems very confused [2]. Even if the emitted code for version 2 and others were good today, I fear a new compiler release could regress. The stability of the code emitted for Ulrich Drepper's implementation is much more reliable, its performance is very good and its clarity is only compromised by my own hack for __is_multiple_of_100. (Recall, however, that this hack is crucial for his implementation's performance.) Best wishes, Cassio. [1] https://godbolt.org/z/TbGYEqx33 [2] https://bugs.llvm.org/show_bug.cgi?id=50334 On Wed, Jun 23, 2021 at 6:52 PM Jonathan Wakely wrote: > On 23/06/21 18:51 +0100, Jonathan Wakely wrote: > >Here's what I've committed. Tested x86_64-linux and powerpc64le-linux. > >Pushed to trunk. > > > > > > > > >commit b92d12d3fe3f1aa56d190d960e40c62869a6cfbb > >Author: Cassio Neri > >Date: Wed Jun 23 15:32:16 2021 > > > > libstdc++: More efficient std::chrono::year::leap > > > > Simple change to std::chrono::year::is_leap. If a year is multiple of > 100, > > then it's divisible by 400 if and only if it's divisible by 16. The > latter > > allows for better code generation. > > > > The expression is then either y%16 or y%4 which are both powers of two > > and so it can be rearranged to use simple bitmask operations. > > > > Co-authored-by: Jonathan Wakely > > Co-authored-by: Ulrich Drepper > > > > libstdc++-v3/ChangeLog: > > > > * include/std/chrono (chrono::year::is_leap()): Optimize. > > > >diff --git a/libstdc++-v3/include/std/chrono > b/libstdc++-v3/include/std/chrono > >index 4631a727d73..863b6a27bdf 100644 > >--- a/libstdc++-v3/include/std/chrono > >+++ b/libstdc++-v3/include/std/chrono > >@@ -1606,13 +1606,18 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > // [1] https://github.com/cassioneri/calendar > > // [2] > https://accu.org/journals/overload/28/155/overload155.pdf#page=16 > > > >+ // Furthermore, if y%100 != 0, then y%400==0 is equivalent to > y%16==0, > >+ // so we can rearrange the expression to (mult_100 ? y % 4 : y % > 16)==0 > > But Ulrich pointed out I got my boolean logic all muddled up in the > comment. Fixed with the attached patch! > >