public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [RFC/RFA][PATCH 00/12] CRC optimization
@ 2024-05-24  8:41 Mariam Arutunian
  2024-05-25  1:53 ` Jeff Law
  2024-05-26 18:23 ` NightStrike
  0 siblings, 2 replies; 4+ messages in thread
From: Mariam Arutunian @ 2024-05-24  8:41 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 2068 bytes --]

Hello!
This patch set detects bitwise CRC implementation loops (with branches) in
the GIMPLE optimizers and replaces them with more optimal CRC
implementations in RTL. These patches introduce new internal functions,
built-in functions, and expanders for CRC generation, leveraging hardware
instructions where available. Additionally, various tests are included to
check CRC detection and generation. Main Features:

   1.

   CRC Loop Detection and Replacement:
   - Detection of CRC loops involves two stages: fast checks to identify
      potential candidates and verification using symbolic execution. The
      algorithm detects only CRCs (8, 16, 32, and 64 bits, both bit-forward and
      bit-reversed) with constant polynomials used without the leading 1. This
      part can be improved to detect more implementation types.
      - Once identified, the CRC loops are replaced with calls to newly
      added internal functions. These internal functions use target-specific
      expanders if available, otherwise generating table-based CRCs.
   2.

   Architecture-Specific Expanders:
   - Expanders are added for RISC-V, aarch64, and i386 architectures.
      - These expanders generate CRCs using either carry-less
      multiplication instructions or direct CRC instructions, based on
the target
      architecture's capabilities.
   3.

   New Internal and Built-In Functions:
   - Introduces internal functions and built-in functions for CRC
      generation, supporting various CRC and data sizes (8, 16, 32,
and 64 bits).

I presented this work during the GNU Tools Cauldron 2023. You can view the
presentation here: GCC CRC optimization presentation
<https://gcc.gnu.org/wiki/cauldron2023talks?action=AttachFile&do=view&target=GCC+CRC+optimization.pdf>
.

Previously, I submitted a patch to GCC upstream that included built-in
parts and expanders for RISC-V. However, the main component of the
previously sent patch has been changed. You can find the patch here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626279.html


Best regards,
Mariam

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC/RFA][PATCH 00/12] CRC optimization
  2024-05-24  8:41 [RFC/RFA][PATCH 00/12] CRC optimization Mariam Arutunian
@ 2024-05-25  1:53 ` Jeff Law
  2024-05-26 18:23 ` NightStrike
  1 sibling, 0 replies; 4+ messages in thread
From: Jeff Law @ 2024-05-25  1:53 UTC (permalink / raw)
  To: Mariam Arutunian, GCC Patches



On 5/24/24 2:41 AM, Mariam Arutunian wrote:
> Hello!
> 
> This patch set detects bitwise CRC implementation loops (with branches) 
> in the GIMPLE optimizers and replaces them with more optimal CRC 
> implementations in RTL. These patches introduce new internal functions, 
> built-in functions, and expanders for CRC generation, leveraging 
> hardware instructions where available. Additionally, various tests are 
> included to check CRC detection and generation.
Thanks so much for getting this process started.  It's a bit quicker 
than I was ready, but no worries.



>  2.
> 
>     Architecture-Specific Expanders:
> 
>       * Expanders are added for RISC-V, aarch64, and i386 architectures.
>       * These expanders generate CRCs using either carry-less
>         multiplication instructions or direct CRC instructions, based on
>         the target architecture's capabilities.
Also note for the wider audience, this work can also generate table 
lookup based CRC implementations.  This has proven exceedingly helpful 
during the testing phase as we were able to run this code on a wide 
variety of the embedded targets to shake out target dependencies.

On Ventana's V1 design the clmul variant was a small, but clear winner 
over the table lookup.  Obviously the bitwise implementation found in 
coremark was the worst performing.

On our V2 design clmul outperforms the table lookup by a wide margin, 
largely due to reduced latency of clmul.


Jeff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC/RFA][PATCH 00/12] CRC optimization
  2024-05-24  8:41 [RFC/RFA][PATCH 00/12] CRC optimization Mariam Arutunian
  2024-05-25  1:53 ` Jeff Law
@ 2024-05-26 18:23 ` NightStrike
  2024-05-27  9:56   ` Mariam Arutunian
  1 sibling, 1 reply; 4+ messages in thread
From: NightStrike @ 2024-05-26 18:23 UTC (permalink / raw)
  To: Mariam Arutunian; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 3067 bytes --]

On Fri, May 24, 2024, 04:42 Mariam Arutunian <mariamarutunian@gmail.com>
wrote:

> Hello!
> This patch set detects bitwise CRC implementation loops (with branches) in
> the GIMPLE optimizers and replaces them with more optimal CRC
> implementations in RTL. These patches introduce new internal functions,
> built-in functions, and expanders for CRC generation, leveraging hardware
> instructions where available. Additionally, various tests are included to
> check CRC detection and generation. Main Features:
>
>    1.
>
>    CRC Loop Detection and Replacement:
>    - Detection of CRC loops involves two stages: fast checks to identify
>       potential candidates and verification using symbolic execution. The
>       algorithm detects only CRCs (8, 16, 32, and 64 bits, both bit-forward and
>       bit-reversed) with constant polynomials used without the leading 1. This
>       part can be improved to detect more implementation types.
>       - Once identified, the CRC loops are replaced with calls to newly
>       added internal functions. These internal functions use target-specific
>       expanders if available, otherwise generating table-based CRCs.
>    2.
>
>    Architecture-Specific Expanders:
>    - Expanders are added for RISC-V, aarch64, and i386 architectures.
>       - These expanders generate CRCs using either carry-less
>       multiplication instructions or direct CRC instructions, based on the target
>       architecture's capabilities.
>    3.
>
>    New Internal and Built-In Functions:
>    - Introduces internal functions and built-in functions for CRC
>       generation, supporting various CRC and data sizes (8, 16, 32, and 64 bits).
>
> I presented this work during the GNU Tools Cauldron 2023. You can view the
> presentation here: GCC CRC optimization presentation
> <https://gcc.gnu.org/wiki/cauldron2023talks?action=AttachFile&do=view&target=GCC+CRC+optimization.pdf>
> .
>
> Previously, I submitted a patch to GCC upstream that included built-in
> parts and expanders for RISC-V. However, the main component of the
> previously sent patch has been changed. You can find the patch here:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626279.html
>
>
> Best regards,
> Mariam
>

Could this detect a specific CRC32C calculation as well (vs any urge CRC)
and replace it with optimized calls to the dedicated crc32c hardware
instructions via __builtin_ia32_crc32*i? I'm asking because ironically I
was just a few days ago trying to see how current GCC optimizes simpler but
slower approaches compared to hand tuning. For example,
https://github.com/htot/crc32c. Almost every implementation I found was
ultimately based on Mark Adler's (of zlib and that fun Mars mission fame)
implementation posted to stack overflow here:
https://stackoverflow.com/a/17646775

Because so many current libraries are based on that, it would be nice to
see if GCC can improve upon it with your patch. Or if they are no longer
necessary, because your patch optimizes the simpler software approach to
basically yield Mark's solution.

>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC/RFA][PATCH 00/12] CRC optimization
  2024-05-26 18:23 ` NightStrike
@ 2024-05-27  9:56   ` Mariam Arutunian
  0 siblings, 0 replies; 4+ messages in thread
From: Mariam Arutunian @ 2024-05-27  9:56 UTC (permalink / raw)
  To: NightStrike; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 4633 bytes --]

Yes, this pass can detect only bitwise (naive) implementations of CRC32C
and replace them with hardware crc32* instructions (if supported), or
alternatively
generate CRC using CLMUL-based or table-based methods.
Here is an example of such a CRC32C implementation, where the entire loop
will be replaced with the crc32b instruction:
uint32_t crc32c(const uint8_t *data, size_t length) { uint32_t crc =
0xFFFFFFFF; for (size_t i = 0; i < length; ++i) { crc ^= data[i]; for (int
j = 0; j < 8; ++j) { crc = (crc >> 1) ^ ((crc & 1) ? 0x82F63B78 : 0); } }
return ~crc; }
I have added support for this optimization, as detailed in this patch:
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652615.html . The patch
description mentions a polynomial of 0x1EDC6F41, but CRC32C actually uses
the reflected version of this polynomial: 0x82F63B78. It might be clearer
to use 0x82F63B78 in the description.
Unfortunately, the example from the link you provided uses a table-based
version of CRC32C, which we don't detect. Currently, we only optimize
bitwise implementations, which are slower than all other implementations. If
necessary, I can add support for detecting table-based CRC implementations.
This would require a completely different algorithm than the one used for
bitwise implementations.
Thanks,
Mariam

On Sun, May 26, 2024 at 10:23 PM NightStrike <nightstrike@gmail.com> wrote:

>
>
> On Fri, May 24, 2024, 04:42 Mariam Arutunian <mariamarutunian@gmail.com>
> wrote:
>
>> Hello!
>> This patch set detects bitwise CRC implementation loops (with branches)
>> in the GIMPLE optimizers and replaces them with more optimal CRC
>> implementations in RTL. These patches introduce new internal functions,
>> built-in functions, and expanders for CRC generation, leveraging hardware
>> instructions where available. Additionally, various tests are included
>> to check CRC detection and generation. Main Features:
>>
>>    1.
>>
>>    CRC Loop Detection and Replacement:
>>    - Detection of CRC loops involves two stages: fast checks to identify
>>       potential candidates and verification using symbolic execution. The
>>       algorithm detects only CRCs (8, 16, 32, and 64 bits, both bit-forward and
>>       bit-reversed) with constant polynomials used without the leading 1. This
>>       part can be improved to detect more implementation types.
>>       - Once identified, the CRC loops are replaced with calls to newly
>>       added internal functions. These internal functions use target-specific
>>       expanders if available, otherwise generating table-based CRCs.
>>    2.
>>
>>    Architecture-Specific Expanders:
>>    - Expanders are added for RISC-V, aarch64, and i386 architectures.
>>       - These expanders generate CRCs using either carry-less
>>       multiplication instructions or direct CRC instructions, based on the target
>>       architecture's capabilities.
>>    3.
>>
>>    New Internal and Built-In Functions:
>>    - Introduces internal functions and built-in functions for CRC
>>       generation, supporting various CRC and data sizes (8, 16, 32, and 64 bits).
>>
>> I presented this work during the GNU Tools Cauldron 2023. You can view
>> the presentation here: GCC CRC optimization presentation
>> <https://gcc.gnu.org/wiki/cauldron2023talks?action=AttachFile&do=view&target=GCC+CRC+optimization.pdf>
>> .
>>
>> Previously, I submitted a patch to GCC upstream that included built-in
>> parts and expanders for RISC-V. However, the main component of the
>> previously sent patch has been changed. You can find the patch here:
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626279.html
>>
>>
>> Best regards,
>> Mariam
>>
>
> Could this detect a specific CRC32C calculation as well (vs any urge CRC)
> and replace it with optimized calls to the dedicated crc32c hardware
> instructions via __builtin_ia32_crc32*i? I'm asking because ironically I
> was just a few days ago trying to see how current GCC optimizes simpler but
> slower approaches compared to hand tuning. For example,
> https://github.com/htot/crc32c. Almost every implementation I found was
> ultimately based on Mark Adler's (of zlib and that fun Mars mission fame)
> implementation posted to stack overflow here:
> https://stackoverflow.com/a/17646775
>
> Because so many current libraries are based on that, it would be nice to
> see if GCC can improve upon it with your patch. Or if they are no longer
> necessary, because your patch optimizes the simpler software approach to
> basically yield Mark's solution.
>
>>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-05-27  9:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-24  8:41 [RFC/RFA][PATCH 00/12] CRC optimization Mariam Arutunian
2024-05-25  1:53 ` Jeff Law
2024-05-26 18:23 ` NightStrike
2024-05-27  9:56   ` Mariam Arutunian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).