* [RFC/RFA][PATCH 00/12] CRC optimization @ 2024-05-24 8:41 Mariam Arutunian 2024-05-25 1:53 ` Jeff Law 2024-05-26 18:23 ` NightStrike 0 siblings, 2 replies; 4+ messages in thread From: Mariam Arutunian @ 2024-05-24 8:41 UTC (permalink / raw) To: GCC Patches [-- Attachment #1: Type: text/plain, Size: 2068 bytes --] Hello! This patch set detects bitwise CRC implementation loops (with branches) in the GIMPLE optimizers and replaces them with more optimal CRC implementations in RTL. These patches introduce new internal functions, built-in functions, and expanders for CRC generation, leveraging hardware instructions where available. Additionally, various tests are included to check CRC detection and generation. Main Features: 1. CRC Loop Detection and Replacement: - Detection of CRC loops involves two stages: fast checks to identify potential candidates and verification using symbolic execution. The algorithm detects only CRCs (8, 16, 32, and 64 bits, both bit-forward and bit-reversed) with constant polynomials used without the leading 1. This part can be improved to detect more implementation types. - Once identified, the CRC loops are replaced with calls to newly added internal functions. These internal functions use target-specific expanders if available, otherwise generating table-based CRCs. 2. Architecture-Specific Expanders: - Expanders are added for RISC-V, aarch64, and i386 architectures. - These expanders generate CRCs using either carry-less multiplication instructions or direct CRC instructions, based on the target architecture's capabilities. 3. New Internal and Built-In Functions: - Introduces internal functions and built-in functions for CRC generation, supporting various CRC and data sizes (8, 16, 32, and 64 bits). I presented this work during the GNU Tools Cauldron 2023. You can view the presentation here: GCC CRC optimization presentation <https://gcc.gnu.org/wiki/cauldron2023talks?action=AttachFile&do=view&target=GCC+CRC+optimization.pdf> . Previously, I submitted a patch to GCC upstream that included built-in parts and expanders for RISC-V. However, the main component of the previously sent patch has been changed. You can find the patch here: https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626279.html Best regards, Mariam ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFC/RFA][PATCH 00/12] CRC optimization 2024-05-24 8:41 [RFC/RFA][PATCH 00/12] CRC optimization Mariam Arutunian @ 2024-05-25 1:53 ` Jeff Law 2024-05-26 18:23 ` NightStrike 1 sibling, 0 replies; 4+ messages in thread From: Jeff Law @ 2024-05-25 1:53 UTC (permalink / raw) To: Mariam Arutunian, GCC Patches On 5/24/24 2:41 AM, Mariam Arutunian wrote: > Hello! > > This patch set detects bitwise CRC implementation loops (with branches) > in the GIMPLE optimizers and replaces them with more optimal CRC > implementations in RTL. These patches introduce new internal functions, > built-in functions, and expanders for CRC generation, leveraging > hardware instructions where available. Additionally, various tests are > included to check CRC detection and generation. Thanks so much for getting this process started. It's a bit quicker than I was ready, but no worries. > 2. > > Architecture-Specific Expanders: > > * Expanders are added for RISC-V, aarch64, and i386 architectures. > * These expanders generate CRCs using either carry-less > multiplication instructions or direct CRC instructions, based on > the target architecture's capabilities. Also note for the wider audience, this work can also generate table lookup based CRC implementations. This has proven exceedingly helpful during the testing phase as we were able to run this code on a wide variety of the embedded targets to shake out target dependencies. On Ventana's V1 design the clmul variant was a small, but clear winner over the table lookup. Obviously the bitwise implementation found in coremark was the worst performing. On our V2 design clmul outperforms the table lookup by a wide margin, largely due to reduced latency of clmul. Jeff ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFC/RFA][PATCH 00/12] CRC optimization 2024-05-24 8:41 [RFC/RFA][PATCH 00/12] CRC optimization Mariam Arutunian 2024-05-25 1:53 ` Jeff Law @ 2024-05-26 18:23 ` NightStrike 2024-05-27 9:56 ` Mariam Arutunian 1 sibling, 1 reply; 4+ messages in thread From: NightStrike @ 2024-05-26 18:23 UTC (permalink / raw) To: Mariam Arutunian; +Cc: GCC Patches [-- Attachment #1: Type: text/plain, Size: 3067 bytes --] On Fri, May 24, 2024, 04:42 Mariam Arutunian <mariamarutunian@gmail.com> wrote: > Hello! > This patch set detects bitwise CRC implementation loops (with branches) in > the GIMPLE optimizers and replaces them with more optimal CRC > implementations in RTL. These patches introduce new internal functions, > built-in functions, and expanders for CRC generation, leveraging hardware > instructions where available. Additionally, various tests are included to > check CRC detection and generation. Main Features: > > 1. > > CRC Loop Detection and Replacement: > - Detection of CRC loops involves two stages: fast checks to identify > potential candidates and verification using symbolic execution. The > algorithm detects only CRCs (8, 16, 32, and 64 bits, both bit-forward and > bit-reversed) with constant polynomials used without the leading 1. This > part can be improved to detect more implementation types. > - Once identified, the CRC loops are replaced with calls to newly > added internal functions. These internal functions use target-specific > expanders if available, otherwise generating table-based CRCs. > 2. > > Architecture-Specific Expanders: > - Expanders are added for RISC-V, aarch64, and i386 architectures. > - These expanders generate CRCs using either carry-less > multiplication instructions or direct CRC instructions, based on the target > architecture's capabilities. > 3. > > New Internal and Built-In Functions: > - Introduces internal functions and built-in functions for CRC > generation, supporting various CRC and data sizes (8, 16, 32, and 64 bits). > > I presented this work during the GNU Tools Cauldron 2023. You can view the > presentation here: GCC CRC optimization presentation > <https://gcc.gnu.org/wiki/cauldron2023talks?action=AttachFile&do=view&target=GCC+CRC+optimization.pdf> > . > > Previously, I submitted a patch to GCC upstream that included built-in > parts and expanders for RISC-V. However, the main component of the > previously sent patch has been changed. You can find the patch here: > https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626279.html > > > Best regards, > Mariam > Could this detect a specific CRC32C calculation as well (vs any urge CRC) and replace it with optimized calls to the dedicated crc32c hardware instructions via __builtin_ia32_crc32*i? I'm asking because ironically I was just a few days ago trying to see how current GCC optimizes simpler but slower approaches compared to hand tuning. For example, https://github.com/htot/crc32c. Almost every implementation I found was ultimately based on Mark Adler's (of zlib and that fun Mars mission fame) implementation posted to stack overflow here: https://stackoverflow.com/a/17646775 Because so many current libraries are based on that, it would be nice to see if GCC can improve upon it with your patch. Or if they are no longer necessary, because your patch optimizes the simpler software approach to basically yield Mark's solution. > ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFC/RFA][PATCH 00/12] CRC optimization 2024-05-26 18:23 ` NightStrike @ 2024-05-27 9:56 ` Mariam Arutunian 0 siblings, 0 replies; 4+ messages in thread From: Mariam Arutunian @ 2024-05-27 9:56 UTC (permalink / raw) To: NightStrike; +Cc: GCC Patches [-- Attachment #1: Type: text/plain, Size: 4633 bytes --] Yes, this pass can detect only bitwise (naive) implementations of CRC32C and replace them with hardware crc32* instructions (if supported), or alternatively generate CRC using CLMUL-based or table-based methods. Here is an example of such a CRC32C implementation, where the entire loop will be replaced with the crc32b instruction: uint32_t crc32c(const uint8_t *data, size_t length) { uint32_t crc = 0xFFFFFFFF; for (size_t i = 0; i < length; ++i) { crc ^= data[i]; for (int j = 0; j < 8; ++j) { crc = (crc >> 1) ^ ((crc & 1) ? 0x82F63B78 : 0); } } return ~crc; } I have added support for this optimization, as detailed in this patch: https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652615.html . The patch description mentions a polynomial of 0x1EDC6F41, but CRC32C actually uses the reflected version of this polynomial: 0x82F63B78. It might be clearer to use 0x82F63B78 in the description. Unfortunately, the example from the link you provided uses a table-based version of CRC32C, which we don't detect. Currently, we only optimize bitwise implementations, which are slower than all other implementations. If necessary, I can add support for detecting table-based CRC implementations. This would require a completely different algorithm than the one used for bitwise implementations. Thanks, Mariam On Sun, May 26, 2024 at 10:23 PM NightStrike <nightstrike@gmail.com> wrote: > > > On Fri, May 24, 2024, 04:42 Mariam Arutunian <mariamarutunian@gmail.com> > wrote: > >> Hello! >> This patch set detects bitwise CRC implementation loops (with branches) >> in the GIMPLE optimizers and replaces them with more optimal CRC >> implementations in RTL. These patches introduce new internal functions, >> built-in functions, and expanders for CRC generation, leveraging hardware >> instructions where available. Additionally, various tests are included >> to check CRC detection and generation. Main Features: >> >> 1. >> >> CRC Loop Detection and Replacement: >> - Detection of CRC loops involves two stages: fast checks to identify >> potential candidates and verification using symbolic execution. The >> algorithm detects only CRCs (8, 16, 32, and 64 bits, both bit-forward and >> bit-reversed) with constant polynomials used without the leading 1. This >> part can be improved to detect more implementation types. >> - Once identified, the CRC loops are replaced with calls to newly >> added internal functions. These internal functions use target-specific >> expanders if available, otherwise generating table-based CRCs. >> 2. >> >> Architecture-Specific Expanders: >> - Expanders are added for RISC-V, aarch64, and i386 architectures. >> - These expanders generate CRCs using either carry-less >> multiplication instructions or direct CRC instructions, based on the target >> architecture's capabilities. >> 3. >> >> New Internal and Built-In Functions: >> - Introduces internal functions and built-in functions for CRC >> generation, supporting various CRC and data sizes (8, 16, 32, and 64 bits). >> >> I presented this work during the GNU Tools Cauldron 2023. You can view >> the presentation here: GCC CRC optimization presentation >> <https://gcc.gnu.org/wiki/cauldron2023talks?action=AttachFile&do=view&target=GCC+CRC+optimization.pdf> >> . >> >> Previously, I submitted a patch to GCC upstream that included built-in >> parts and expanders for RISC-V. However, the main component of the >> previously sent patch has been changed. You can find the patch here: >> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626279.html >> >> >> Best regards, >> Mariam >> > > Could this detect a specific CRC32C calculation as well (vs any urge CRC) > and replace it with optimized calls to the dedicated crc32c hardware > instructions via __builtin_ia32_crc32*i? I'm asking because ironically I > was just a few days ago trying to see how current GCC optimizes simpler but > slower approaches compared to hand tuning. For example, > https://github.com/htot/crc32c. Almost every implementation I found was > ultimately based on Mark Adler's (of zlib and that fun Mars mission fame) > implementation posted to stack overflow here: > https://stackoverflow.com/a/17646775 > > Because so many current libraries are based on that, it would be nice to > see if GCC can improve upon it with your patch. Or if they are no longer > necessary, because your patch optimizes the simpler software approach to > basically yield Mark's solution. > >> ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-05-27 9:56 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-05-24 8:41 [RFC/RFA][PATCH 00/12] CRC optimization Mariam Arutunian 2024-05-25 1:53 ` Jeff Law 2024-05-26 18:23 ` NightStrike 2024-05-27 9:56 ` Mariam Arutunian
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).