From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 11B8F3858D33; Thu, 4 May 2023 12:55:26 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 11B8F3858D33 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1683204926; bh=DnLA42eeXtrU3PPmiVVTkaSEQCqtIkA2N/Z+wEn1O08=; h=From:To:Subject:Date:In-Reply-To:References:From; b=gR5Z3J/q6THbuXS0LmmqyDb4OAC1iVSPeAqK6mm9O90MSMLOlYUkmP9F+/pn6hats WEj8HjCKv4YnNKUQycAgpdlQ5beEmvbiKSMKtCJ7QNnpRIIK09Z8jkmDnxG9PB+wm5 bQY7aQ5qQUAjZTeB4XDTdmMAuHblALwOZRJq6+0k= From: "rsaxvc at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/106484] Failure to optimize uint64_t/constant division on ARM32 Date: Thu, 04 May 2023 12:55:25 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 12.1.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: enhancement X-Bugzilla-Who: rsaxvc at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D106484 --- Comment #4 from rsaxvc at gmail dot com --- Benchmarking shows the speedup to be highly variable depending on CPU core = as well as __aeabi_uldivmod() implementation, and somewhat on numerator. The best __aeabi_uldivmod()s I've seen do use 32bit division instructions w= hen available, and umulh() based approach is only 2-3x faster when division instructions are available. When umull(32x32 with 64bit result) is available and udiv is not available = or libc doesn't use it, the umulh() based approach proposed here completes 28-= 38x faster, on Cortex-M4, measured via GPIO and oscilloscope. The wide variatio= n in relative speed is due to variable execution time of __aeabi_uldivmod(). Sim= ilar on ARM11. There's a partial list of some contemporary cores have udiv here: https://community.arm.com/arm-community-blogs/b/architectures-and-processor= s-blog/posts/divide-and-conquer it does look like things are headed towards more cores having udiv availabl= e.=