Yes, this pass can detect only bitwise (naive) implementations of CRC32C
and replace them with hardware crc32* instructions (if supported), or
alternatively
generate CRC using CLMUL-based or table-based methods.
Here is an example of such a CRC32C implementation, where the entire loop
will be replaced with the crc32b instruction:
uint32_t crc32c(const uint8_t *data, size_t length) { uint32_t crc =
0xFFFFFFFF; for (size_t i = 0; i < length; ++i) { crc ^= data[i]; for (int
j = 0; j < 8; ++j) { crc = (crc >> 1) ^ ((crc & 1) ? 0x82F63B78 : 0); } }
return ~crc; }
I have added support for this optimization, as detailed in this patch:
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652615.html . The patch
description mentions a polynomial of 0x1EDC6F41, but CRC32C actually uses
the reflected version of this polynomial: 0x82F63B78. It might be clearer
to use 0x82F63B78 in the description.
Unfortunately, the example from the link you provided uses a table-based
version of CRC32C, which we don't detect. Currently, we only optimize
bitwise implementations, which are slower than all other implementations. If
necessary, I can add support for detecting table-based CRC implementations.
This would require a completely different algorithm than the one used for
bitwise implementations.
Thanks,
Mariam

On Sun, May 26, 2024 at 10:23 PM NightStrike <nightstrike@gmail.com> wrote:

>
>
> On Fri, May 24, 2024, 04:42 Mariam Arutunian <mariamarutunian@gmail.com>
> wrote:
>
>> Hello!
>> This patch set detects bitwise CRC implementation loops (with branches)
>> in the GIMPLE optimizers and replaces them with more optimal CRC
>> implementations in RTL. These patches introduce new internal functions,
>> built-in functions, and expanders for CRC generation, leveraging hardware
>> instructions where available. Additionally, various tests are included
>> to check CRC detection and generation. Main Features:
>>
>>    1.
>>
>>    CRC Loop Detection and Replacement:
>>    - Detection of CRC loops involves two stages: fast checks to identify
>>       potential candidates and verification using symbolic execution. The
>>       algorithm detects only CRCs (8, 16, 32, and 64 bits, both bit-forward and
>>       bit-reversed) with constant polynomials used without the leading 1. This
>>       part can be improved to detect more implementation types.
>>       - Once identified, the CRC loops are replaced with calls to newly
>>       added internal functions. These internal functions use target-specific
>>       expanders if available, otherwise generating table-based CRCs.
>>    2.
>>
>>    Architecture-Specific Expanders:
>>    - Expanders are added for RISC-V, aarch64, and i386 architectures.
>>       - These expanders generate CRCs using either carry-less
>>       multiplication instructions or direct CRC instructions, based on the target
>>       architecture's capabilities.
>>    3.
>>
>>    New Internal and Built-In Functions:
>>    - Introduces internal functions and built-in functions for CRC
>>       generation, supporting various CRC and data sizes (8, 16, 32, and 64 bits).
>>
>> I presented this work during the GNU Tools Cauldron 2023. You can view
>> the presentation here: GCC CRC optimization presentation
>> <https://gcc.gnu.org/wiki/cauldron2023talks?action=AttachFile&do=view&target=GCC+CRC+optimization.pdf>
>> .
>>
>> Previously, I submitted a patch to GCC upstream that included built-in
>> parts and expanders for RISC-V. However, the main component of the
>> previously sent patch has been changed. You can find the patch here:
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626279.html
>>
>>
>> Best regards,
>> Mariam
>>
>
> Could this detect a specific CRC32C calculation as well (vs any urge CRC)
> and replace it with optimized calls to the dedicated crc32c hardware
> instructions via __builtin_ia32_crc32*i? I'm asking because ironically I
> was just a few days ago trying to see how current GCC optimizes simpler but
> slower approaches compared to hand tuning. For example,
> https://github.com/htot/crc32c. Almost every implementation I found was
> ultimately based on Mark Adler's (of zlib and that fun Mars mission fame)
> implementation posted to stack overflow here:
> https://stackoverflow.com/a/17646775
>
> Because so many current libraries are based on that, it would be nice to
> see if GCC can improve upon it with your patch. Or if they are no longer
> necessary, because your patch optimizes the simpler software approach to
> basically yield Mark's solution.
>
>>