Yes, this pass can detect only bitwise (naive) implementations of CRC32C and replace them with hardware crc32* instructions (if supported), or alternatively generate CRC using CLMUL-based or table-based methods. Here is an example of such a CRC32C implementation, where the entire loop will be replaced with the crc32b instruction: uint32_t crc32c(const uint8_t *data, size_t length) { uint32_t crc = 0xFFFFFFFF; for (size_t i = 0; i < length; ++i) { crc ^= data[i]; for (int j = 0; j < 8; ++j) { crc = (crc >> 1) ^ ((crc & 1) ? 0x82F63B78 : 0); } } return ~crc; } I have added support for this optimization, as detailed in this patch: https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652615.html . The patch description mentions a polynomial of 0x1EDC6F41, but CRC32C actually uses the reflected version of this polynomial: 0x82F63B78. It might be clearer to use 0x82F63B78 in the description. Unfortunately, the example from the link you provided uses a table-based version of CRC32C, which we don't detect. Currently, we only optimize bitwise implementations, which are slower than all other implementations. If necessary, I can add support for detecting table-based CRC implementations. This would require a completely different algorithm than the one used for bitwise implementations. Thanks, Mariam On Sun, May 26, 2024 at 10:23 PM NightStrike wrote: > > > On Fri, May 24, 2024, 04:42 Mariam Arutunian > wrote: > >> Hello! >> This patch set detects bitwise CRC implementation loops (with branches) >> in the GIMPLE optimizers and replaces them with more optimal CRC >> implementations in RTL. These patches introduce new internal functions, >> built-in functions, and expanders for CRC generation, leveraging hardware >> instructions where available. Additionally, various tests are included >> to check CRC detection and generation. Main Features: >> >> 1. >> >> CRC Loop Detection and Replacement: >> - Detection of CRC loops involves two stages: fast checks to identify >> potential candidates and verification using symbolic execution. The >> algorithm detects only CRCs (8, 16, 32, and 64 bits, both bit-forward and >> bit-reversed) with constant polynomials used without the leading 1. This >> part can be improved to detect more implementation types. >> - Once identified, the CRC loops are replaced with calls to newly >> added internal functions. These internal functions use target-specific >> expanders if available, otherwise generating table-based CRCs. >> 2. >> >> Architecture-Specific Expanders: >> - Expanders are added for RISC-V, aarch64, and i386 architectures. >> - These expanders generate CRCs using either carry-less >> multiplication instructions or direct CRC instructions, based on the target >> architecture's capabilities. >> 3. >> >> New Internal and Built-In Functions: >> - Introduces internal functions and built-in functions for CRC >> generation, supporting various CRC and data sizes (8, 16, 32, and 64 bits). >> >> I presented this work during the GNU Tools Cauldron 2023. You can view >> the presentation here: GCC CRC optimization presentation >> >> . >> >> Previously, I submitted a patch to GCC upstream that included built-in >> parts and expanders for RISC-V. However, the main component of the >> previously sent patch has been changed. You can find the patch here: >> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626279.html >> >> >> Best regards, >> Mariam >> > > Could this detect a specific CRC32C calculation as well (vs any urge CRC) > and replace it with optimized calls to the dedicated crc32c hardware > instructions via __builtin_ia32_crc32*i? I'm asking because ironically I > was just a few days ago trying to see how current GCC optimizes simpler but > slower approaches compared to hand tuning. For example, > https://github.com/htot/crc32c. Almost every implementation I found was > ultimately based on Mark Adler's (of zlib and that fun Mars mission fame) > implementation posted to stack overflow here: > https://stackoverflow.com/a/17646775 > > Because so many current libraries are based on that, it would be nice to > see if GCC can improve upon it with your patch. Or if they are no longer > necessary, because your patch optimizes the simpler software approach to > basically yield Mark's solution. > >>