From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 0EC993861845; Tue, 20 Oct 2020 18:16:53 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0EC993861845 From: "tkoenig at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug rtl-optimization/97459] __uint128_t remainder for division by 3 Date: Tue, 20 Oct 2020 18:16:52 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: rtl-optimization X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: tkoenig at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 20 Oct 2020 18:16:53 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D97459 --- Comment #9 from Thomas Koenig --- (In reply to Jakub Jelinek from comment #7) > So, can we use this for anything but modulo 3, or 5, or 17, or 257 (all of > those have 2^32 mod N =3D=3D 2^64 mod N =3D=3D 2^128 mod N =3D=3D 1) I think so, too. > probably also > keyed on the target having corresponding uaddv4_optab handler, normal > expansion not being able to handle it and emitting a libcall? Again, yes. This can also be used as a building block for handling division and remainder base 10. Here's a benchmark for this (it uses the sum of digits base 10 instead). qsum1 uses the standard method, which you can find (for example) in libgfortran. div_rem5_v2 first calculates the remainder of the division by 5 using this method, then does an exact division by multiplying with its modular inverse for 2^128. div_rem10_v2 then uses div_rem5_v2 to calculate the value and remainder of the division by 10, and qsum_v2 uses that to calculate the sum of digits. The timings are about a factor of 2 faster than the straightforward libcall version: s =3D 360398898 qsum_v1: 1.091621 s s =3D 360398898 qsum_v2: 0.485509 s #include #include #include #include #include #include #include #define ONE ((__uint128_t) 1) #define TWO_64 (ONE << 64) typedef __uint128_t mytype; double this_time () { struct timeval tv; gettimeofday (&tv, NULL); return tv.tv_sec + tv.tv_usec * 1e-6; } unsigned qsum_v1 (mytype n) { unsigned ret; ret =3D 0; while (n > 0) { ret +=3D n % 10; n =3D n / 10; } return ret; } static void inline __attribute__((always_inline)) div_rem_5_v2 (mytype n, mytype *div, unsigned *rem) { unsigned long a, b, c; /* The modular inverse to 5 modulo 2^128 */ const mytype magic =3D (0xCCCCCCCCCCCCCCCC * TWO_64 + 0xCCCCCCCCCCCCCCCD * ONE); b =3D n >> 64; c =3D n; if (__builtin_add_overflow (b, c, &a)) a++; *rem =3D a % 5; *div =3D (n-*rem) * magic; } static void inline __attribute__((always_inline)) div_rem_10_v2 (mytype n, mytype *div, unsigned *rem) { mytype n5; unsigned rem5; div_rem_5_v2 (n, &n5, &rem5); *rem =3D rem5 + (n5 % 2) * 5; *div =3D n5/2; } unsigned qsum_v2 (mytype n) { unsigned ret; unsigned rem; mytype n_new; ret =3D 0; while (n > 0) { div_rem_10_v2 (n, &n_new, &rem); ret +=3D rem; n =3D n_new; } return ret; } #define N 10000000 int main() { mytype *a; unsigned long int s; double t1, t2; int fd; long int i; a =3D malloc (sizeof (*a) * N); fd =3D open ("/dev/urandom", O_RDONLY); read (fd, a, sizeof (*a) * N); s =3D 0; t1 =3D this_time(); for (i=3D0; i