* Re: RISC-V: Added support for CRC.
@ 2023-09-23 23:05 Joern Rennecke
2023-09-24 11:41 ` Alexander Monakov
` (3 more replies)
0 siblings, 4 replies; 30+ messages in thread
From: Joern Rennecke @ 2023-09-23 23:05 UTC (permalink / raw)
To: GCC Patches
Cc: Mariam Harutyunyan, Alexander Monakov, Jeff Law, Richard Biener,
Oleg Endo
Mariam Harutyunyan:
+++ b/gcc/ChangeLog
@@ -1,3 +1,45 @@
+2023-08-03 Mariam Arutunian <mariamarutunian@gmail.com>
+
It is common courtesy to include all authors in the list of authors
for the ChangeLog; also,
this may help people in the future understand the history of the code better.
While must of your patch is new, it still contains non-trivial parts of mine
( https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591744.html )
.And stripping out the comment why, currently, we can't use linkonce
for crc tables on the the RISC-V target is
not helpful to someone who wants to understand the code.
See also the discussion to put this into loop distribution:
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591821.html
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591866.html
Mariam Harutyunyan:
> It adds internal
> functions and built-ins specifically designed to handle CRC computations
> efficiently.
This sounds like this is a finished work, although defining built-in
functions to calculate the CRC of single data elements and recognizers
for some C idioms that do these calculations,
is just a starting point.
Alexander Monakov :
> Jeff, as I understand this all is happening only because Coremark contains
> use of bitwise CRC that affects benchmark scores. In another universe where
> - Coremark was careful to checksum outputs outside of timed sections, or
> - implemented CRC in a manner that is not transparent to the compiler, or
> - did not use CRC at all
> we would not be spending effort on this, correct?
It is a stated goal of coremark to test performance for CRC. They do
not use a library call
to implement CRC, but a specific bit-banging algorithm they have
chosen. That algorithm is,
for the vast majority of processors, not representative of the targets
performance potential in calculating CRCs, thus if a compiler fails to
translate this into the CRC implementation that
would be used for performance code, the compiler frustrates this goal
of coremark to give a measure of CRC calculation performance.
> At best we might have
> a discussion on providing a __builtin_clmul for carry-less multiplication
> (which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
Some processors have specialized instructions for CRC computations.
> Instead, efficient CRC loops have the following structure:
> - they carry an unreduced remainder in the loop, performing final reduction
> modulo polynomial only once after the loop — this halves the CLMUL count;
> - they overlap multiple CLMUL chains to make the loop throughput-bound
> rather than latency-bound. The typical unroll factor is about 4x-8x.
If you want to recognize a loop that does a CRC on a block, it makes
sense to start with recognizing the CRC computation for single array
elements first. We have to learn to
walk before we can run.
Nore that my initial patch already contained a match.pd stanza to
recognize two chained single-element CRC calculations.
Jeff Law:
> The intention is to provide a useful
> builtin_crc while at the same time putting one side of the
> infrastructure we need for automatic detection of CRC loops and turning
> them into table lookups or CLMULs.
Note that when optimizing for size, for a target with tiny memory, or
when using a non-constant (or constant but undiscoverable by the
compiler) polynom, we can't use the table lookup. But still, even if
we don't have CLmul, we can generally speed up CRC computation over
the coremark algorithm by using something more suited to the target,
like the crcu16_1 function I
put into comments in my patch.
Alexander Monakov :
> So... just provide a library? A library code is easier to develop and audit,
> it can be released independently, people can use it with their compiler of
> choice. Not everything needs to be in libgcc.
A library can be used to implement built-ins in gcc (we still need to
define one for block operations, one step at a time...).
However, someone or something needs to rewrite the existing code to
use the library.
It is commonly accepted that an efficient way to do this is to make a
compiler do the
necessary transformations, as long as it can be made to churn out good
enough code.
Alexander Monakov:
> Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> These consumers need high-performance blockwise CRC, offering them
> a latency-bound elementwise CRC primitive is a disservice. And what
> should they use as a fallback when __builtin_crc is unavailable?
We can provide a fallback implementation for all targets with table
lookup and/or shifts .
Alexander Monakov:
> I think offering a conventional library for CRC has substantial advantages.
Are you volunteering? It would make our work to emit code for block
CRC easier if we could
just use a library call when we recognize a block CRC (although making
that recognizer is likely still considerable work if we want to get
good coverage over different coding styles).
Although maybe Oleg Endo's library, as mentioned in
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591748.html ,
might be suitable? What is the license for that?
Jeff Law :
> I'm not real comfortable with directly supporting sub-word sizes for the
> output operand. The vast majority of insns on the risc-v port have
> outputs that use either the X iterator (which maps to either SI or DI
> mode for rv32 and rv64 respectively) or they use the GPR iterator. I
> don't think many us ANYI.
> Which ultimately makes sense since operations actually write X mode
> bits.
> Presumably the goal here is to capture cases where the resultant CRC
> value is a sub-word size.
Yes, the builtin functions naturally have lots of narrow operations,
and thus the operands are narrow.
Note that in my implementation of crcsihi4 / crchihi4, I use a Pmode
paradoxical subreg of the output operand, so that the move itself can
be in Pmode.
I suppose we could make the output mode wider than the data size, but
that can only be for the targets that want it; we can't bend the
target-independent built-in for the convenience of RISC-V.
So the builtin function expanding machinery would have to know to use
the mode of the pattern instead of the one implied by the type.
Jeff Law :
> I wonder if all the table based generation support code should move into a
> generic file so that it could be used by other targets. I don't offhand
> see anything in there that is RISC-V specific.
Well, the avoidance of linkonce is. OTOH, other targets might not
have weak. On the grabber hand, making the table static and
duplicating it all over would be sad.
Jeff Law :
>+ sprintf (buf, "crc_table_for_%u_bit_crc_%llx_polynomial", crc_bits, polynom);
> So I think we need to be more careful here with namespace pollution.
> Ideally I think we'd want this symbol to be static and const.
I had:
+ sprintf (buf, "__gcc_crc%s%s4_" HOST_WIDE_INT_PRINT_HEX,
+ mode_name[data_bits / BITS_PER_UNIT],
+ mode_name[crc_bits / BITS_PER_UNIT], polynom);
Jeff Law:
> Note that if you were to create a DECL node for the table and attach a
> suitable DECL_INITIAL to it with its initialization data I think most,
> if not all of the assembly output complexity would be handled for you
> automatically.
I suppose we could make it a string constant, to take advantage of
string commoning on targets where that is supported. However, that
would likely get us into trouble on targets with addressable units
other that 8 bits.
Jeff Law:
> + rtx op1 = gen_rtx_ASHIFTRT (crc_mode, crc,
> + GEN_INT (mode_bit_size - 8));
> In general, I'd suggest gen_int_mode rather than GEN_INT.
Well, in this specific case, we have a non-negative number that fits
well into the space provided.
Although the latter might get tight on a target with 8 bit shift
counts, so if we want to make this properly portable, you have a
point.
However, the code generation actually is somewhat target specific
beyond these technicalities.
The widths of arithmetic used, and how we order it with memory
accesses (which might need arithmetic transformations), depends on the
facilities available and latency considerations to get good code.
Well, we can still have some generic implementation. but optimal
performance it makessense to look at what assembly code you get and
consider a target-specific code generation if you can do better.
The other port we used this optimization for actually has dedicated
crc instructions in some configurations, and is generally
memory-tight, so the fallback is tightly coded library functions.
Jeff Law:
> Hmm, I'm not sure we can depend on the target actually supporting
> ZERO_EXTEND.
> Similarly, don't we have to worry about whether or not the resulting
> address is actually valid?
It wouldn't be an an alpha, since that doesn't have indexed
addressing. And that has nothing to do with 12 bit, tab is a symbol
(usually needs lots of bits, and multiple machine instructions
to set up for RISC-V), and ix is a an index.
..
> And here I don't think we can necessarily depend on the target
> supporting shifts by a constant amount.
..
> Similarly here, some targets may not support AND with immediate or may
> not have enough bits to handle all the immediates that can arise here.
Yes, it's RISC-V specific code in a RISC_V file. If we want to make
this more general, we might be better off constructing some trees or
gimple and expanding them. We might indeed do a check if the internal
function or a library function is defined, and substitute a table
lookup otherwise. (or leave it alone for -Os in that case).
> So did we ever figure out why we're treating the CRC optab as a
conversion optab?
Input and output are different sizes. At least for some of the
variants. So you get an ICE if you treat it as something other than a
conversion.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-09-23 23:05 RISC-V: Added support for CRC Joern Rennecke
@ 2023-09-24 11:41 ` Alexander Monakov
2023-09-24 15:53 ` Joern Rennecke
2023-09-26 6:19 ` Mariam Harutyunyan
` (2 subsequent siblings)
3 siblings, 1 reply; 30+ messages in thread
From: Alexander Monakov @ 2023-09-24 11:41 UTC (permalink / raw)
To: Joern Rennecke
Cc: GCC Patches, Mariam Harutyunyan, Jeff Law, Richard Biener, Oleg Endo
On Sun, 24 Sep 2023, Joern Rennecke wrote:
> It is a stated goal of coremark to test performance for CRC.
I would expect a good CRC benchmark to print CRC throughput in
bytes per cycle or megabytes per second.
I don't see where Coremark states that goal. In the readme at
https://github.com/eembc/coremark/blob/main/README.md
it enumerates the three subcategories (linked list, matrix ops,
state machine) and indicates that CRC is used for validation.
If it claims that elsewhere, the way its code employs CRC does not
correspond to real-world use patterns, like in the Linux kernel for
protocol and filesystem checksumming, or decompression libraries.
> They do not use a library call to implement CRC, but a specific
> bit-banging algorithm they have chosen. That algorithm is, for the
> vast majority of processors, not representative of the targets
> performance potential in calculating CRCs,
It is, however, representative of the target CPU's ability to run
those basic bitwise ops with good overlap with the rest of computation,
which is far more relevant for the real-world performance of the CPU.
> thus if a compiler fails to translate this into the CRC implementation
> that would be used for performance code, the compiler frustrates this
> goal of coremark to give a measure of CRC calculation performance.
Are you seriously saying that if a customer chooses CPU A over CPU B
based on Coremark scores, and then discovers that actual performance
in, say, zlib (which uses slice-by-N for CRC) is better on CPU B, that's
entirely fair and the benchmarks scores they saw were not misleading?
> > At best we might have
> > a discussion on providing a __builtin_clmul for carry-less multiplication
> > (which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
>
> Some processors have specialized instructions for CRC computations.
Only for one or two fixed polynomials. For that matter, some processors
have instructions for AES and SHA, but that doesn't change that clmul is
a more fundamental and flexible primitive than "CRC".
> If you want to recognize a loop that does a CRC on a block, it makes
> sense to start with recognizing the CRC computation for single array
> elements first. We have to learn to walk before we can run.
If only the "walk before you run" logic was applied in favor of
implementing a portable clmul builtin prior to all this.
> A library can be used to implement built-ins in gcc (we still need to
> define one for block operations, one step at a time...). However,
> someone or something needs to rewrite the existing code to use the
> library. It is commonly accepted that an efficient way to do this is
> to make a compiler do the necessary transformations, as long as it can
> be made to churn out good enough code.
How does this apply to the real world? Among CRC implementations in the
Linux kernel, ffmpeg, zlib, bzip2, xz-utils, and zstd I'm aware of only
a single instance where bitwise CRC is used. It's in the table
initialization function in xz-utils. The compiler would transform that
to copying one table into another. Is that a valuable transform?
> Alexander Monakov:
> > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> > These consumers need high-performance blockwise CRC, offering them
> > a latency-bound elementwise CRC primitive is a disservice. And what
> > should they use as a fallback when __builtin_crc is unavailable?
>
> We can provide a fallback implementation for all targets with table
> lookup and/or shifts .
How would it help when they are compiled with LLVM, or GCC version
earlier than 14?
Alexander
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-09-24 11:41 ` Alexander Monakov
@ 2023-09-24 15:53 ` Joern Rennecke
0 siblings, 0 replies; 30+ messages in thread
From: Joern Rennecke @ 2023-09-24 15:53 UTC (permalink / raw)
To: Alexander Monakov
Cc: GCC Patches, Mariam Harutyunyan, Jeff Law, Richard Biener, Oleg Endo
On Sun, 24 Sept 2023 at 12:41, Alexander Monakov <amonakov@ispras.ru> wrote:
>
>
> On Sun, 24 Sep 2023, Joern Rennecke wrote:
>
> > It is a stated goal of coremark to test performance for CRC.
>
> I would expect a good CRC benchmark to print CRC throughput in
> bytes per cycle or megabytes per second.
>
> I don't see where Coremark states that goal. In the readme at
> https://github.com/eembc/coremark/blob/main/README.md
> it enumerates the three subcategories (linked list, matrix ops,
> state machine) and indicates that CRC is used for validation.
At https://www.eembc.org/coremark/index.php , they state under the
Details heading:
...
Replacing the antiquated Dhrystone benchmark, Coremark contains
implementations of the following algorithms: list processing (find and
sort), matrix manipulation (common matrix operations), state machine
(determine if an input stream contains valid numbers), and CRC (cyclic
redundancy check).
...
The CRC algorithm serves a dual function; it provides a workload
commonly seen in embedded applications and ensures correct operation
of the CoreMark benchmark, essentially providing a self-checking
mechanism.
...
They also point to a whitepaper there, which states:
Since CRC is also a commonly used function in embedded applications, this
calculation is included in the timed portion of the CoreMark.
> If it claims that elsewhere, the way its code employs CRC does not
> correspond to real-world use patterns, like in the Linux kernel for
> protocol and filesystem checksumming, or decompression libraries.
That may be so, but we should still strive to optimize the code to
obtain the intended purpose.
> It is, however, representative of the target CPU's ability to run
> those basic bitwise ops with good overlap with the rest of computation,
> which is far more relevant for the real-world performance of the CPU.
That depends on how much CRC calculation your application does. You can
disable specific compiler optimizations in GCC for specialized testing.
> > thus if a compiler fails to translate this into the CRC implementation
> > that would be used for performance code, the compiler frustrates this
> > goal of coremark to give a measure of CRC calculation performance.
>
> Are you seriously saying that if a customer chooses CPU A over CPU B
> based on Coremark scores, and then discovers that actual performance
> in, say, zlib (which uses slice-by-N for CRC) is better on CPU B, that's
> entirely fair and the benchmarks scores they saw were not misleading?
Using coremark as a yardstick for any one application is always going to be
likely to give an inaccurate assessment - unless your application is
identical to
coremark. I don't see why whatever implementation is chosen for the
short-length
CRC in coremark should be closer or farther from slice-by-N CRC, I would expect
it to be pseudo-random. Unless CPU B has worse GCC support or
available hardware
instruction for short-range CRC, in which case the manufacturer might
considering
improving support (particularily if it's about GCC support ;-)
Actually, if CRC optimization is implemented via table lookup, on both
CPU A and B,
it gets a bit closer to slice-by-N, since both do table lookups,
although for slice-by-N you
trade latency for register pressure.
Any single benchmark can't be a good performance predictor for all applications.
If you care a lot about performance for a particular load, you should
benchmark that load,
or something that is known to be a close proxy.
> > > At best we might have
> > > a discussion on providing a __builtin_clmul for carry-less multiplication
> > > (which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
> >
> > Some processors have specialized instructions for CRC computations.
>
> Only for one or two fixed polynomials. For that matter, some processors
> have instructions for AES and SHA, but that doesn't change that clmul is
> a more fundamental and flexible primitive than "CRC".
So it is, but when analyzing user programs that haven't been written by experts
with a focus on performance, CRC is more likely to come up than clmul .
I agree that it would make sense to have a builtin for clmul that can be used
uniformly across architectures that support this operation, but I'm
not volunteering
to write a patch for that.
> If only the "walk before you run" logic was applied in favor of
> implementing a portable clmul builtin prior to all this.
I started writing the CRC patch for an architecture that didn't have
clmul as a basecase
instruction, so a clmul builtin would not have helped.
>
> > A library can be used to implement built-ins in gcc (we still need to
> > define one for block operations, one step at a time...). However,
> > someone or something needs to rewrite the existing code to use the
> > library. It is commonly accepted that an efficient way to do this is
> > to make a compiler do the necessary transformations, as long as it can
> > be made to churn out good enough code.
>
> How does this apply to the real world? Among CRC implementations in the
> Linux kernel, ffmpeg, zlib, bzip2, xz-utils, and zstd I'm aware of only
> a single instance where bitwise CRC is used. It's in the table
> initialization function in xz-utils. The compiler would transform that
> to copying one table into another. Is that a valuable transform?
As long as the target sets sensible costs, we can compare the cost of
the analyzed code to the target implementation,
and choose not to do a transformation if there is no gain.
Although there might be corner cases where we can gain benefits when
we see different table based implementations mixed, or table and
non-table-based implementations, and we'd need WPA to detect that.
Well, we can worry about when we see it. Or not, depending on the
causes and the impact. Is it the user's fault, and/or insignificant?
Or is it an emerging problem from throwing lots of code together, and
we can get some significant gain from merging the implementations?
> > We can provide a fallback implementation for all targets with table
> > lookup and/or shifts .
>
> How would it help when they are compiled with LLVM, or GCC version
> earlier than 14?
It wouldn't, at least not initially. That's in the nature of
improving compilers, which leaves older versions behind. In the case
of using LLVM, it might help those users too if/when LLVM catches up
(if they haven't already implemented a comparable or superior
optimization by the time we release an improved GCC).
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-09-23 23:05 RISC-V: Added support for CRC Joern Rennecke
2023-09-24 11:41 ` Alexander Monakov
@ 2023-09-26 6:19 ` Mariam Harutyunyan
2023-09-26 13:05 ` Oleg Endo
2023-09-26 13:18 ` Jeff Law
3 siblings, 0 replies; 30+ messages in thread
From: Mariam Harutyunyan @ 2023-09-26 6:19 UTC (permalink / raw)
To: Joern Rennecke; +Cc: GCC Patches, Jeff Law, Richard Biener, Oleg Endo
[-- Attachment #1: Type: text/plain, Size: 701 bytes --]
On Sun, Sep 24, 2023, 00:05 Joern Rennecke <joern.rennecke@embecosm.com>
wrote:
> Mariam Harutyunyan:
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,45 @@
> +2023-08-03 Mariam Arutunian <mariamarutunian@gmail.com>
> +
>
> It is common courtesy to include all authors in the list of authors
> for the ChangeLog; also,
> this may help people in the future understand the history of the code
> better.
> While must of your patch is new, it still contains non-trivial parts of
> mine
Sorry, I didn't know that when using others' code snippets, we should
include their names. I'll add your name.
Although, the code has been removed from the new patch version, but your
work was really helpful.
Thank you,
Mariam
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-09-23 23:05 RISC-V: Added support for CRC Joern Rennecke
2023-09-24 11:41 ` Alexander Monakov
2023-09-26 6:19 ` Mariam Harutyunyan
@ 2023-09-26 13:05 ` Oleg Endo
2023-09-26 13:18 ` Jeff Law
3 siblings, 0 replies; 30+ messages in thread
From: Oleg Endo @ 2023-09-26 13:05 UTC (permalink / raw)
To: Joern Rennecke, GCC Patches
Cc: Mariam Harutyunyan, Alexander Monakov, Jeff Law, Richard Biener
On Sun, 2023-09-24 at 00:05 +0100, Joern Rennecke wrote:
>
> Although maybe Oleg Endo's library, as mentioned in
> https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591748.html ,
> might be suitable? What is the license for that?
>
>
I haven't published the library, but I think I could do that.
It's a C++-14 header-only thing and uses templates + constexpr to generate
the .rodata lookup tables. It's convenient for an application project, as
it doesn't require any generator tool in the build. This might be not a big
advantage in the context of GCC.
Since the tables are computed during compile-time, there is no particular
optimization implemented. The run-time function is also nothing fancy:
static constexpr uint8_t table_index (value_type rem, uint8_t x)
{
if (ReflectInput)
return x ^ rem;
else
return x ^ (BitCount > 8 ? (rem >> (BitCount - 8))
: (rem << (8 - BitCount)));
}
static constexpr value_type shift (value_type rem)
{
return ReflectInput ? rem >> 8 : rem << 8;
}
static value_type
default_process_bytes (value_type rem, const uint8_t* in, const uint8_t* in_end)
{
for (; in != in_end; ++in)
{
auto i = table_index (rem, *in);
rem = table[i] ^ shift (rem);
}
return rem;
}
Anyway, let me know if anyone is interested.
Cheers,
Oleg
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-09-23 23:05 RISC-V: Added support for CRC Joern Rennecke
` (2 preceding siblings ...)
2023-09-26 13:05 ` Oleg Endo
@ 2023-09-26 13:18 ` Jeff Law
2023-09-26 16:23 ` Alexander Monakov
2023-09-26 18:56 ` Joern Rennecke
3 siblings, 2 replies; 30+ messages in thread
From: Jeff Law @ 2023-09-26 13:18 UTC (permalink / raw)
To: Joern Rennecke, GCC Patches
Cc: Mariam Harutyunyan, Alexander Monakov, Richard Biener, Oleg Endo
On 9/23/23 17:05, Joern Rennecke wrote:
> Mariam Harutyunyan:
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,45 @@
> +2023-08-03 Mariam Arutunian <mariamarutunian@gmail.com>
> +
>
> It is common courtesy to include all authors in the list of authors
> for the ChangeLog; also,
> this may help people in the future understand the history of the code better.
> While must of your patch is new, it still contains non-trivial parts of mine
> ( https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591744.html )
> .And stripping out the comment why, currently, we can't use linkonce
> for crc tables on the the RISC-V target is
> not helpful to someone who wants to understand the code.
Thanks for pointing this out Joern. Neither Mariam nor I were aware
that some of your code was used to seed this work. We are happy to
include you as a co-author.
Mariam -- the way to do this is to add a "Co-Author:" line to your
commit message. If you look at upstream commit:
5ff4431675c0d0c800d4a983254e94a6b401c14d
Shows the right format.
>
> See also the discussion to put this into loop distribution:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591821.html > https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591866.html
Yes, this was the gist of the discussion I had with Richi. CRC
optimization has a lot in common with distribution, reduction and final
value replacement.
For early testing and development we focused on detecting a CRC function
and replacing it at a function level. THat allowed Mariam to avoid a
class a problems for a while and focus on an end-to-end solution. Once
that was working reasonably well Mariam went back and enhanced the first
pass filter and validation so that it would work if the CRC loop was
embedded inside a larger function either at the source level or due to
inlining.
The placement we chose was just after loop distribution. I think
there's some wiggle room in terms of pass placement.
The one thing I'm hesitant to do is make this part of an existing pass.
I dont see many benefits of integrating inside loop dist and I very much
like its clean separation.
>
> Mariam Harutyunyan:
>> It adds internal
>> functions and built-ins specifically designed to handle CRC computations
>> efficiently.
>
> This sounds like this is a finished work, although defining built-in
> functions to calculate the CRC of single data elements and recognizers
> for some C idioms that do these calculations,
> is just a starting point.
More precisely it was carved out of a larger piece of work in the hopes
that it would be useful independently and allow upstreaming of the
backend work. Alexander was fairly dismissive of the utility of doing
that. There was some interest from Cauldron attendees to potentially
reviving that idea.
>
> Alexander Monakov :
>
>> Jeff, as I understand this all is happening only because Coremark contains
>> use of bitwise CRC that affects benchmark scores. In another universe where
>> - Coremark was careful to checksum outputs outside of timed sections, or
>> - implemented CRC in a manner that is not transparent to the compiler, or
>> - did not use CRC at all
>> we would not be spending effort on this, correct?
>
> It is a stated goal of coremark to test performance for CRC. They do
> not use a library call
> to implement CRC, but a specific bit-banging algorithm they have
> chosen. That algorithm is,
> for the vast majority of processors, not representative of the targets
> performance potential in calculating CRCs, thus if a compiler fails to
> translate this into the CRC implementation that
> would be used for performance code, the compiler frustrates this goal
> of coremark to give a measure of CRC calculation performance.All true. But the Coremark code is what it is. This isn't a whole lot
different than the work in the 90s which rewrote loops and compromised
some of the spec benchmarks, or the eqntott hack to simplify that one
key loop in eqntott.
What ultimately pushed us to keep moving forward on this effort was
discovering numerous CRC loop implementations out in the wild, including
4 implementations (IIRC) in the kernel itself.
>
>> At best we might have
>> a discussion on providing a __builtin_clmul for carry-less multiplication
>> (which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
>
> Some processors have specialized instructions for CRC computations.
They do. But I think the discussion about providing intrinsic access to
a clmul instruction is orthogonal to the discussion about whether or not
to provide a builtin for crc.
I can easily see creating a clmul RTL opcode for targets which support
it and hoisting the clmul vs lookup table selection into generic code.
I'm still pondering if we're likely to ever see cases where we want a
vector clmul intrinsic or support in the autovectorizer for clmul.
We've certainly got targets with vector clmul in the ISA, the question
is using it.
>
>> Instead, efficient CRC loops have the following structure:
>> - they carry an unreduced remainder in the loop, performing final reduction
>> modulo polynomial only once after the loop — this halves the CLMUL count;
>> - they overlap multiple CLMUL chains to make the loop throughput-bound
>> rather than latency-bound. The typical unroll factor is about 4x-8x.
>
> If you want to recognize a loop that does a CRC on a block, it makes
> sense to start with recognizing the CRC computation for single array
> elements first. We have to learn to
> walk before we can run.
>
> Nore that my initial patch already contained a match.pd stanza to
> recognize two chained single-element CRC calculations.
Mariam's code goes well behind simple match.pd recognition of a block.
The hope is to do a few rounds of internal review/cleanup then post it
for the world to see.
Probably the biggest task in that space right now is to see if we can
avoid the symbolic execution engine by re-using ranger.
>
> Jeff Law:
>> The intention is to provide a useful
>> builtin_crc while at the same time putting one side of the
>> infrastructure we need for automatic detection of CRC loops and turning
>> them into table lookups or CLMULs.
>
> Note that when optimizing for size, for a target with tiny memory, or
> when using a non-constant (or constant but undiscoverable by the
> compiler) polynom, we can't use the table lookup. But still, even if
> we don't have CLmul, we can generally speed up CRC computation over
> the coremark algorithm by using something more suited to the target,
> like the crcu16_1 function I
> put into comments in my patch.
Sure. And once the CRC loop is discovered and an IFN created, the
backend would have the opportunity to guide code generation.
>
> Alexander Monakov:
>> Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
>> These consumers need high-performance blockwise CRC, offering them
>> a latency-bound elementwise CRC primitive is a disservice. And what
>> should they use as a fallback when __builtin_crc is unavailable?
>
> We can provide a fallback implementation for all targets with table
> lookup and/or shifts .
And as I've stated before, the latency of clmuls is dropping. I
wouldn't be terribly surprised to see single cycle clmul
implmementations showing up within the next 18-24 months. It's really
just a matter of gate budget vs expected value.
>
> Alexander Monakov:
>> I think offering a conventional library for CRC has substantial advantages.
> Are you volunteering? It would make our work to emit code for block
> CRC easier if we could
> just use a library call when we recognize a block CRC (although making
> that recognizer is likely still considerable work if we want to get
> good coverage over different coding styles).
>
> Although maybe Oleg Endo's library, as mentioned in
> https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591748.html ,
> might be suitable? What is the license for that?
But going this route as the primary direction means that any such code
has to change their implementation to take advantage of a faster CRC
impmementation.
To reiterate the real goal here is to take code as-is and make it
significantly faster. While the original target was Coremark, we've
found similar bitwise implementations of CRCs all over the place.
There's no good reason that code should have to change.
The idea of exposing a CRC builtin was an intermediate step that would
allow those willing to change their code or writing new code to write
their CRC in a trivial way and let the compiler figure out a sensible
implementation while we clean up the CRC loop detection/validation code.
>
> Yes, the builtin functions naturally have lots of narrow operations,
> and thus the operands are narrow.
> Note that in my implementation of crcsihi4 / crchihi4, I use a Pmode
> paradoxical subreg of the output operand, so that the move itself can
> be in Pmode.
>
> I suppose we could make the output mode wider than the data size, but
> that can only be for the targets that want it; we can't bend the
> target-independent built-in for the convenience of RISC-V.
> So the builtin function expanding machinery would have to know to use
> the mode of the pattern instead of the one implied by the type.
I'm not at all suggesting we do something weird in the middle end for
RISC-V. My point was we should look for a better solution. This feels
like it ought to work more like the existing widening/narrowing
capabilities we have in the expansion code.
>
> Jeff Law :
>
>> I wonder if all the table based generation support code should move into a
>> generic file so that it could be used by other targets. I don't offhand
>> see anything in there that is RISC-V specific.
>
> Well, the avoidance of linkonce is. OTOH, other targets might not
> have weak. On the grabber hand, making the table static and
> duplicating it all over would be sad.
Right. But making the table a DECL node with an initializer allows us
to clean up a whole mess of things -- including the ability to take
advantage of linkonce on targets that support it. The whole point is to
take what was fairly target specific and generalize it so that it's
useful elsewhere.
>
> I suppose we could make it a string constant, to take advantage of
> string commoning on targets where that is supported. However, that
> would likely get us into trouble on targets with addressable units
> other that 8 bits.
I think we can leave this as a follow-up. I'm not convinced it's a
major issue.
>
> However, the code generation actually is somewhat target specific
> beyond these technicalities.
> The widths of arithmetic used, and how we order it with memory
> accesses (which might need arithmetic transformations), depends on the
> facilities available and latency considerations to get good code.
> Well, we can still have some generic implementation. but optimal
> performance it makessense to look at what assembly code you get and
> consider a target-specific code generation if you can do better.
The widths and such as well as "does this instruction support
immediates" should be handled by using the existing expansion
interfaces. That's the broader point.
As written the code makes various assumptions about the ISA. By moving
to standard expansion interfaces most of those problems are solved.
>
> The other port we used this optimization for actually has dedicated
> crc instructions in some configurations, and is generally
> memory-tight, so the fallback is tightly coded library functions.
ACK. And I think we can get to a point where those ports that have CRC
functions (presumably with a limited set of polynomials) are trivially
handled.
>
> It wouldn't be an an alpha, since that doesn't have indexed
> addressing. And that has nothing to do with 12 bit, tab is a symbol
> (usually needs lots of bits, and multiple machine instructions
> to set up for RISC-V), and ix is a an index.
Right. This isn't about 12bit immediates or anything like that, just a
general need to be careful about how we generate code to ensure it's
target independent.
> ..
>> And here I don't think we can necessarily depend on the target
>> supporting shifts by a constant amount.
> ..
>> Similarly here, some targets may not support AND with immediate or may
>> not have enough bits to handle all the immediates that can arise here.
>
> Yes, it's RISC-V specific code in a RISC_V file. If we want to make
> this more general, we might be better off constructing some trees or
> gimple and expanding them. We might indeed do a check if the internal
> function or a library function is defined, and substitute a table
> lookup otherwise. (or leave it alone for -Os in that case).
Which is basically what I was suggesting by moving to the various expand
interfaces. That code does not need to be RISC-V specific and in fact
it's much better if it is not and can be shared across architectures.
>
>> So did we ever figure out why we're treating the CRC optab as a
> conversion optab?
>
> Input and output are different sizes. At least for some of the
> variants. So you get an ICE if you treat it as something other than a
> conversion.
That's what I was guessing. Thanks for the confirmation.
jeff
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-09-26 13:18 ` Jeff Law
@ 2023-09-26 16:23 ` Alexander Monakov
2023-09-26 18:56 ` Joern Rennecke
1 sibling, 0 replies; 30+ messages in thread
From: Alexander Monakov @ 2023-09-26 16:23 UTC (permalink / raw)
To: Jeff Law
Cc: Joern Rennecke, GCC Patches, Mariam Harutyunyan, Richard Biener,
Oleg Endo
On Tue, 26 Sep 2023, Jeff Law wrote:
> What ultimately pushed us to keep moving forward on this effort was
> discovering numerous CRC loop implementations out in the wild, including 4
> implementations (IIRC) in the kernel itself.
The kernel employs bitwise CRC only in look-up table generators.
Which run at build time. You are quite literally slowing down the compiler
in order to speed up generators that don't account for even one millisecond
of kernel build time, and have no relation to its run-time performance.
(incidentally you can't detect the actual CRC impls using those tables)
> And as I've stated before, the latency of clmuls is dropping. I wouldn't be
> terribly surprised to see single cycle clmul implmementations showing up
> within the next 18-24 months. It's really just a matter of gate budget vs
> expected value.
In a commercial implementation? I'll take that bet. You spend gates budget
like that after better avenues for raising ILP are exhausted (like adding
more ALUs that can do clmul at a reasonable 3c-4c latency).
> To reiterate the real goal here is to take code as-is and make it
> significantly faster.
Which code? Table generators in the kernel and xz-utils?
> While the original target was Coremark, we've found similar bitwise
> implementations of CRCs all over the place. There's no good reason that code
> should have to change.
But did you look at them? There's no point to optimize table generators either.
Alexander
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-09-26 13:18 ` Jeff Law
2023-09-26 16:23 ` Alexander Monakov
@ 2023-09-26 18:56 ` Joern Rennecke
2023-09-27 21:44 ` Jeff Law
1 sibling, 1 reply; 30+ messages in thread
From: Joern Rennecke @ 2023-09-26 18:56 UTC (permalink / raw)
To: Jeff Law
Cc: GCC Patches, Mariam Harutyunyan, Alexander Monakov,
Richard Biener, Oleg Endo
On Tue, 26 Sept 2023 at 14:18, Jeff Law <jeffreyalaw@gmail.com> wrote:
> But the Coremark code is what it is. This isn't a whole lot
> different than the work in the 90s which rewrote loops and compromised
> some of the spec benchmarks, or the eqntott hack to simplify that one
> key loop in eqntott.
I think the stated purpose of the benchmark matters. If dhrystone had been
pushed as an abstraction-penalty benchmark, it would have been fine to
present results with WPA, inlining and dead code elimination as ordinary
dhrystone results. But since it's supposed to exercise specific hardware
features, and not have the tests for these optimized away, that's not
appropriate.
So, first, we make the compiled program perform the work that the benchmark
was supposed to include in the measurement, just more efficiently.
Second, we not only optimize the benchmark, but also make the target-optimized
code generation available for other programs. For new programs targeted at
GNU C, that is minimally archived by providing a built-in function,
and in general
for new code, by being able to replicate the idiom from coremark that
is recognized
by GCC. The mere existence of a C idiom in a popular benchmark also makes this
idiom a common idiom, if it hasn't already been that before.
As we find new patterns that are used to implement CRC which would
be better replaced with a target-specific implementation, we can add these.
This is similar to rotate operations, which are supported directly by
some processors,
and even for other targets, there are generally preferred ways to
expand the code,
but there are a couple of different variants depending on the
available instruction set,
registers, and the microarchitecture (pipeline, latency etc). We
started out with
one patterns that was recognized, and as new patterns were identified
in C code, we
improved GCC to recognize these other patterns.
> What ultimately pushed us to keep moving forward on this effort was
> discovering numerous CRC loop implementations out in the wild, including
> 4 implementations (IIRC) in the kernel itself.
I have always assumed that such must exist (CRCs are useful for a number
of problems, and besides, they wouldn't have been included in coremark as
a part of the workload if they were not relevant), but it is good to have
confirmation, and even better to have code that can detect and analyse a
entire class of idioms that is in such widespread use.
This still leaves room for further improvements, like detecting fully-unrolled
code, table lookup, or slice-by-N, and replacing them with better
target-optimized code where this is indicated by the optimization flags to
save execution time or code/rodata size. Not something we have to tackle
now, but just because we don't do it now, doesn't mean we couldn't address
these in the future if that appears worthwhile.
> I can easily see creating a clmul RTL opcode for targets which support
> it and hoisting the clmul vs lookup table selection into generic code.
> I'm still pondering if we're likely to ever see cases where we want a
> vector clmul intrinsic or support in the autovectorizer for clmul.
> We've certainly got targets with vector clmul in the ISA, the question
> is using it.
If we aim for optimal code, I think it more likely that we want to detect a
block CRC computation, and have a target expander decide to do that
with inline code or a library call that uses vectorized clmul. At the time
we add such block-CRC expansion code, it also makes sense to add a
builtin for block CRC so that new GNU C programs can directly request
that functionality without having to go through the cargo cult of matching
a supported idiom.
Now, the library might be written in GNU C, and for that it might be useful
to have a vector clmul intrinsic so that we can express this algorithm more
easily.
> Probably the biggest task in that space right now is to see if we can
> avoid the symbolic execution engine by re-using ranger.
I'll be interested to see what you'll come up with, but if reverting to the
symbolic execution engine, the compile time cost isn't much if you only
use it for a proper match. So whatever heuristics are used before deciding
to use the engine matter. Can all the cases detected by the engine be
recognized as a loop with a reduction? We might use different heuristics
for different optimization levels, i.e. allow more false negatives at -O1,
and more false positives at -O2 / -fexpensive-optimizations.
> To reiterate the real goal here is to take code as-is and make it
> significantly faster. While the original target was Coremark, we've
> found similar bitwise implementations of CRCs all over the place.
> There's no good reason that code should have to change.
>
> The idea of exposing a CRC builtin was an intermediate step that would
> allow those willing to change their code or writing new code to write
> their CRC in a trivial way and let the compiler figure out a sensible
> implementation while we clean up the CRC loop detection/validation code.
While I concur that we want existing code to be able to take advantage of
this optimization by gcc recognizing whatever method the code uses to
compute CRC (within reason, we are still bound by the laws of
computability and resource constraints for the compilation), I would
like to stress that I don't think the builtin will use its value over time.
It can be used in tests to make sure we exercise the code generation for the
internal function. It can be used in new GNU C code to make it easier to
do a CRC computation without worrying about the implementation. If
an accord with other major compiler suppliers (e.g. the LLVM community)
is reached, it might even see more widespread use.
> Which is basically what I was suggesting by moving to the various expand
> interfaces. That code does not need to be RISC-V specific and in fact
> it's much better if it is not and can be shared across architectures.
For the RISC-V implementation, I started out with writing some variants in
C and assembler optimized for latency, register use, and code size, picking
the one that's best for speed for constant polynoms (crcu16_1 is still viable
for other cases, but I didn't have time to implement *and test*
-Os and variable-polynom paths), and then I crafted an expander so that
I'd get the desired assembler at the end of compilation. I had this idea that
every port wants its own implementation (with the previous port going a
completely different route than RISC-V). But I suppose you are right that
a lot of GCCs ports that are in mainline are just like RISC-V in that they
will work well with a lookup table, and to get an initial version that works,
it makes sense to start with a single or a handful of generic
implementation(s), and do target-specific refinements later.
I suppose for such an initial implementation, RISC and x86* 32 / 64 bit
targets will be fine with the table lookup,while small-address-space targets
like avr would be better off with a library function along the lines
of crcu16_1,
tailored to their ALU and register sizes. The 'For base instruction
set' variant
should actually work with int or short just as well as long, long just happens
to be faster for RISC-V as it doesn't need extra extensions, while avr
will likely
be much better off with short.
FWIW, for the table lookup code, there are probably some places where
force_operand is convenient to make the code portable across targets.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-09-26 18:56 ` Joern Rennecke
@ 2023-09-27 21:44 ` Jeff Law
0 siblings, 0 replies; 30+ messages in thread
From: Jeff Law @ 2023-09-27 21:44 UTC (permalink / raw)
To: Joern Rennecke
Cc: GCC Patches, Mariam Harutyunyan, Alexander Monakov,
Richard Biener, Oleg Endo
On 9/26/23 12:56, Joern Rennecke wrote:
>
>> What ultimately pushed us to keep moving forward on this effort was
>> discovering numerous CRC loop implementations out in the wild, including
>> 4 implementations (IIRC) in the kernel itself.
>
> I have always assumed that such must exist (CRCs are useful for a number
> of problems, and besides, they wouldn't have been included in coremark as
> a part of the workload if they were not relevant), but it is good to have
> confirmation, and even better to have code that can detect and analyse a
> entire class of idioms that is in such widespread use.
I was personally surprised at how many we found. While there were a
bunch of table generation routines which obviously aren't at all
interesting, there were enough in the cases we analyzed that it
convinced me this wasn't just catering to a benchmark.
>
> This still leaves room for further improvements, like detecting fully-unrolled
> code, table lookup, or slice-by-N, and replacing them with better
> target-optimized code where this is indicated by the optimization flags to
> save execution time or code/rodata size. Not something we have to tackle
> now, but just because we don't do it now, doesn't mean we couldn't address
> these in the future if that appears worthwhile.
Absolutely. In fact, I have encouraged Mariam to capture some of the
cases we can't currently handle in the testsuite, essentially building a
bit of a TODO list.
>
>> I can easily see creating a clmul RTL opcode for targets which support
>> it and hoisting the clmul vs lookup table selection into generic code.
>> I'm still pondering if we're likely to ever see cases where we want a
>> vector clmul intrinsic or support in the autovectorizer for clmul.
>> We've certainly got targets with vector clmul in the ISA, the question
>> is using it.
>
> If we aim for optimal code, I think it more likely that we want to detect a
> block CRC computation, and have a target expander decide to do that
> with inline code or a library call that uses vectorized clmul. At the time
> we add such block-CRC expansion code, it also makes sense to add a
> builtin for block CRC so that new GNU C programs can directly request
> that functionality without having to go through the cargo cult of matching
> a supported idiom.
And I think we've got a structure which largely allows further
improvements, both in the recognition/validation step and in the code
expansion step.
>
> Now, the library might be written in GNU C, and for that it might be useful
> to have a vector clmul intrinsic so that we can express this algorithm more
> easily.
Agreed. It's also worth nothing that LLVM has a clmul in their IL and I
suspect they expose it via a builtin/intrinsic. I'd expect we'll
ultimately end up in the same boat.
>
>> Probably the biggest task in that space right now is to see if we can
>> avoid the symbolic execution engine by re-using ranger.
>
> I'll be interested to see what you'll come up with, but if reverting to the
> symbolic execution engine, the compile time cost isn't much if you only
> use it for a proper match. So whatever heuristics are used before deciding
> to use the engine matter. Can all the cases detected by the engine be
> recognized as a loop with a reduction? We might use different heuristics
> for different optimization levels, i.e. allow more false negatives at -O1,
> and more false positives at -O2 / -fexpensive-optimizations.
It's mostly a desire not to add (yet another) symbolic execution engine
to GCC. We've already got the engine for CCP as well as symbolic
execution capabilities in Ranger. I'd like to avoid adding another if
we can do so.
For a LFSR validation step we need to track 4 potential states for each
bit in an object. 0, 1, x, !x where "x" is the state of the bit from a
different object. If it was just tracking 0, 1, x, !x for an entire
object, Ranger is probably already do that. But we need to do it for
each bit within an object.
We haven't seen compile-time cost be a real issue. But we also haven't
looked too closely at that problem.
>
> While I concur that we want existing code to be able to take advantage of
> this optimization by gcc recognizing whatever method the code uses to
> compute CRC (within reason, we are still bound by the laws of
> computability and resource constraints for the compilation), I would
> like to stress that I don't think the builtin will use its value over time.
> It can be used in tests to make sure we exercise the code generation for the
> internal function. It can be used in new GNU C code to make it easier to
> do a CRC computation without worrying about the implementation. If
> an accord with other major compiler suppliers (e.g. the LLVM community)
> is reached, it might even see more widespread use.
Which somewhat dovetails with Alex's position -- namely that it's not
that value. Hence the de-emphasis on this approach. We'll continue to
focus on the end-to-end solution.
We may still want a clmul as an RTL opcode and builtins to utilize it.
>
>> Which is basically what I was suggesting by moving to the various expand
>> interfaces. That code does not need to be RISC-V specific and in fact
>> it's much better if it is not and can be shared across architectures.
>
> For the RISC-V implementation, I started out with writing some variants in
> C and assembler optimized for latency, register use, and code size, picking
> the one that's best for speed for constant polynoms (crcu16_1 is still viable
> for other cases, but I didn't have time to implement *and test*
> -Os and variable-polynom paths), and then I crafted an expander so that
> I'd get the desired assembler at the end of compilation. I had this idea that
> every port wants its own implementation (with the previous port going a
> completely different route than RISC-V). But I suppose you are right that
> a lot of GCCs ports that are in mainline are just like RISC-V in that they
> will work well with a lookup table, and to get an initial version that works,
> it makes sense to start with a single or a handful of generic
> implementation(s), and do target-specific refinements later.
The thing about a table is it'll work just about everywhere as it
doesn't require anything special from a hardware standpoint. So we
ought to be able to verify the table based version quite thoroughly
without relying on other port maintainers to wire up a carryless
multiply in their port. And it'll be crazy faster than a bitwise
implementation.
> I suppose for such an initial implementation, RISC and x86* 32 / 64
> bit targets will be fine with the table lookup,while
> small-address-space targets
> like avr would be better off with a library function along the lines
> of crcu16_1, tailored to their ALU and register sizes. The 'For base
> instruction set' variant should actually work with int or short just
> as well as long, long just
happens
> to be faster for RISC-V as it doesn't need extra extensions, while
> avr will likely be much better off with short.
Right. And there's nothing preventing someone from putting a target
custom routine in libgcc, but I doubt we'll see a lot of that. I
suspect that 99% of the time clmul or a table is going to be the desired
codegen for a detected CRC. The remaining cases would be space
sensitive and likely using -Os/-Oz -- where we'd disable CRC
optimization if there isn't a clmul capability in the backend.
>
> FWIW, for the table lookup code, there are probably some places where
> force_operand is convenient to make the code portable across targets.
Absolutely. But I think we're doing too much direct RTL generation.
I'd like to squash that out and hopefully what's left is just calls to
expand_* and we don't have to worry about force_operand and such.
jeff
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-16 4:59 ` Jeff Law
@ 2023-08-21 13:39 ` Mariam Harutyunyan
0 siblings, 0 replies; 30+ messages in thread
From: Mariam Harutyunyan @ 2023-08-21 13:39 UTC (permalink / raw)
To: Jeff Law; +Cc: gcc-patches
[-- Attachment #1.1: Type: text/plain, Size: 829 bytes --]
Thank you for the review.
I'm already working on suggested changes.
The answers to the few questions are attached here.
Thanks,
Mariam
On Wed, Aug 16, 2023 at 8:59 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
> On 8/3/23 13:37, Mariam Harutyunyan via Gcc-patches wrote:
> > This patch adds CRC support for the RISC-V architecture. It adds internal
> > functions and built-ins specifically designed to handle CRC computations
> > efficiently.
> >
> > If the target is ZBC, the clmul instruction is used for the CRC code
> > generation; otherwise, table-based CRC is generated. A table with 256
> > elements is used to store precomputed CRCs.
> >
> > These CRC calculation algorithms have higher performance than the naive
> CRC
> > calculation algorithm.
> [ ... ]
> Various comments attached.
>
[-- Attachment #2: reply-to-comments --]
[-- Type: application/octet-stream, Size: 2702 bytes --]
diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def
index 43381bc8949..e33837c27d0 100644
--- a/gcc/builtin-types.def
+++ b/gcc/builtin-types.def
@@ -829,6 +829,26 @@ DEF_FUNCTION_TYPE_3 (BT_FN_PTR_SIZE_SIZE_PTRMODE,
BT_PTR, BT_SIZE, BT_SIZE, BT_PTRMODE)
DEF_FUNCTION_TYPE_3 (BT_FN_VOID_PTR_UINT8_PTRMODE, BT_VOID, BT_PTR, BT_UINT8,
BT_PTRMODE)
+DEF_FUNCTION_TYPE_3 (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE, BT_UINT8, BT_UINT8,
+ BT_UINT8, BT_CONST_SIZE)
[ ... ]
Presumably the reason we need to many variants is due to the desire to
support various types for the inputs and outputs?
- Yes.
diff --git a/gcc/config/riscv/bitmanip.md b/gcc/config/riscv/bitmanip.md
index c42e7b890db..4c896303242 100644
--- a/gcc/config/riscv/bitmanip.md
+++ b/gcc/config/riscv/bitmanip.md
@@ -856,3 +856,38 @@
"TARGET_ZBC"
"clmulr\t%0,%1,%2"
[(set_attr "type" "clmul")])
+
+;; Iterator for hardware-supported integer modes, same as ANYI
+(define_mode_iterator ANYI2 [QI HI SI (DI "TARGET_64BIT")])
+
+;; CRC 8, 16, 32, (64 for TARGET_64)
+(define_expand "crc<ANYI2:mode><ANYI:mode>4"
+ ;; return value (calculated CRC)
+ [(set (match_operand:ANYI 0 "register_operand" "=r")
+ ;; initial CRC
+ (unspec:ANYI [(match_operand:ANYI 1 "register_operand" "r")
+ ;; data
+ (match_operand:ANYI2 2 "register_operand" "r")
+ ;; polynomial
+ (match_operand:ANYI 3)]
+ UNSPEC_CRC))]
I'm not real comfortable with directly supporting sub-word sizes for the
output operand. The vast majority of insns on the risc-v port have
outputs that use either the X iterator (which maps to either SI or DI
mode for rv32 and rv64 respectively) or they use the GPR iterator. I
don't think many us ANYI.
Which ultimately makes sense since operations actually write X mode
bits.
Presumably the goal here is to capture cases where the resultant CRC
value is a sub-word size.
- Yes. ANYI isn't used that much, but the only appropriate one I found is this.
- Here I need to know the actual sizes of the CRC and the data, so I use sub-word sizes.
@@ -3996,6 +4028,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
#define direct_cond_len_unary_optab_supported_p direct_optab_supported_p
#define direct_cond_len_binary_optab_supported_p direct_optab_supported_p
#define direct_cond_len_ternary_optab_supported_p direct_optab_supported_p
+#define direct_crc_optab_supported_p convert_optab_supported_p
So did we ever figure out why we're treating the CRC optab as a
conversion optab?
- CRC and data may be of different types. For different types I need to have different modes (as mentioned above).
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-16 19:42 ` Philipp Tomsich
2023-08-16 20:23 ` Paul Koning
@ 2023-08-17 13:36 ` Alexander Monakov
1 sibling, 0 replies; 30+ messages in thread
From: Alexander Monakov @ 2023-08-17 13:36 UTC (permalink / raw)
To: Philipp Tomsich; +Cc: Jeff Law, Mariam Harutyunyan, gcc-patches
On Wed, 16 Aug 2023, Philipp Tomsich wrote:
> > > I fully expect that latency to drop within the next 12-18 months. In that
> > > world, there's not going to be much benefit to using hand-coded libraries vs
> > > just letting the compiler do it.
>
> I would also hope that the hand-coded libraries would eventually have
> a code path for compilers that support the built-in.
You seem to be working with the false assumption that the interface of the
proposed builtin matches how high-performance CRC computation is structured.
It is not. State-of-the-art CRC keeps unreduced intermediate residual, split
over multiple temporaries to allow overlapping CLMULs in the CPU. The
intermediate residuals are reduced only once, when the final CRC value is
needed. In constrast, the proposed builtin has data dependencies between
all adjacent instructions, and cannot allow the CPU to work at IPC > 1.0.
Shame how little you apparently understand of the "mindbending math".
Alexander
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-16 19:10 ` Alexander Monakov
2023-08-16 19:42 ` Philipp Tomsich
@ 2023-08-17 3:12 ` Jeff Law
1 sibling, 0 replies; 30+ messages in thread
From: Jeff Law @ 2023-08-17 3:12 UTC (permalink / raw)
To: Alexander Monakov; +Cc: Mariam Harutyunyan, gcc-patches
On 8/16/23 13:10, Alexander Monakov wrote:
>
> On Tue, 15 Aug 2023, Jeff Law wrote:
>
>> Because if the compiler can optimize it automatically, then the projects have
>> to do literally nothing to take advantage of it. They just compile normally
>> and their bitwise CRC gets optimized down to either a table lookup or a clmul
>> variant. That's the real goal here.
>
> The only high-profile FOSS project that carries a bitwise CRC implementation
> I'm aware of is the 'xz' compression library. There bitwise CRC is used for
> populating the lookup table under './configure --enable-small':
>
> https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c
>
> It's a well-reasoned choice and your compiler would be undoing it
> (reintroducing the table when the bitwise CRC is employed specifically
> to avoid carrying the table).
If they don't want the table variant, there would obviously be ways to
turn that off. It's essentially no different than any speed improving
optimization that makes things larger.
>
>> One final note. Elsewhere in this thread you described performance concerns.
>> Right now clmuls can be implemented in 4c, fully piped.
>
> Pipelining doesn't matter in the implementation being proposed here, because
> the builtin is expanded to
>
> li a4,quotient
> li a5,polynomial
> xor a0,a1,a0
> clmul a0,a0,a4
> srli a0,a0,crc_size
> clmul a0,a0,a5
> slli a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
> srli a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
>
> making CLMULs data-dependent, so the second can only be started one cycle
> after the first finishes, and consecutive invocations of __builtin_crc
> are likewise data-dependent (with three cycles between CLMUL). So even
> when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles
> per input block, while state of the art is one widening CLMUL per input block
> (one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not latency.
>
I expect it'll actually be 2c latency. We're approaching the point
where it just won't make that much sense to call out to a library when
you can emit the pair of clmuls and a couple shifts.
jeff
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-16 19:42 ` Philipp Tomsich
@ 2023-08-16 20:23 ` Paul Koning
2023-08-17 13:36 ` Alexander Monakov
1 sibling, 0 replies; 30+ messages in thread
From: Paul Koning @ 2023-08-16 20:23 UTC (permalink / raw)
To: Philipp Tomsich
Cc: Alexander Monakov, Jeff Law, Mariam Harutyunyan, gcc-patches
> On Aug 16, 2023, at 3:42 PM, Philipp Tomsich <philipp.tomsich@vrull.eu> wrote:
>
> On Wed, 16 Aug 2023 at 21:10, Alexander Monakov <amonakov@ispras.ru> wrote:
>>
>>
>> On Tue, 15 Aug 2023, Jeff Law wrote:
>>
>>> Because if the compiler can optimize it automatically, then the projects have
>>> to do literally nothing to take advantage of it. They just compile normally
>>> and their bitwise CRC gets optimized down to either a table lookup or a clmul
>>> variant. That's the real goal here.
>>
>> The only high-profile FOSS project that carries a bitwise CRC implementation
>> I'm aware of is the 'xz' compression library. There bitwise CRC is used for
>> populating the lookup table under './configure --enable-small':
>>
>> https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c
>>
>> It's a well-reasoned choice and your compiler would be undoing it
>> (reintroducing the table when the bitwise CRC is employed specifically
>> to avoid carrying the table).
Is that compiled with -Os? It would seem sensible for that to be the case, and for the table optimization to be suppressed if that switch is used.
paul
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-16 19:10 ` Alexander Monakov
@ 2023-08-16 19:42 ` Philipp Tomsich
2023-08-16 20:23 ` Paul Koning
2023-08-17 13:36 ` Alexander Monakov
2023-08-17 3:12 ` Jeff Law
1 sibling, 2 replies; 30+ messages in thread
From: Philipp Tomsich @ 2023-08-16 19:42 UTC (permalink / raw)
To: Alexander Monakov; +Cc: Jeff Law, Mariam Harutyunyan, gcc-patches
On Wed, 16 Aug 2023 at 21:10, Alexander Monakov <amonakov@ispras.ru> wrote:
>
>
> On Tue, 15 Aug 2023, Jeff Law wrote:
>
> > Because if the compiler can optimize it automatically, then the projects have
> > to do literally nothing to take advantage of it. They just compile normally
> > and their bitwise CRC gets optimized down to either a table lookup or a clmul
> > variant. That's the real goal here.
>
> The only high-profile FOSS project that carries a bitwise CRC implementation
> I'm aware of is the 'xz' compression library. There bitwise CRC is used for
> populating the lookup table under './configure --enable-small':
>
> https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c
>
> It's a well-reasoned choice and your compiler would be undoing it
> (reintroducing the table when the bitwise CRC is employed specifically
> to avoid carrying the table).
>
> > One final note. Elsewhere in this thread you described performance concerns.
> > Right now clmuls can be implemented in 4c, fully piped.
>
> Pipelining doesn't matter in the implementation being proposed here, because
> the builtin is expanded to
>
> li a4,quotient
> li a5,polynomial
> xor a0,a1,a0
> clmul a0,a0,a4
> srli a0,a0,crc_size
> clmul a0,a0,a5
> slli a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
> srli a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
>
> making CLMULs data-dependent, so the second can only be started one cycle
> after the first finishes, and consecutive invocations of __builtin_crc
> are likewise data-dependent (with three cycles between CLMUL). So even
> when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles
> per input block, while state of the art is one widening CLMUL per input block
> (one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not latency.
>
> > I fully expect that latency to drop within the next 12-18 months. In that
> > world, there's not going to be much benefit to using hand-coded libraries vs
> > just letting the compiler do it.
I would also hope that the hand-coded libraries would eventually have
a code path for compilers that support the built-in.
For what it's worth, there now is CRC in Boost:
https://www.boost.org/doc/libs/1_83_0/doc/html/crc.html
Cheers,
philipp.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-16 4:12 ` Jeff Law
@ 2023-08-16 19:10 ` Alexander Monakov
2023-08-16 19:42 ` Philipp Tomsich
2023-08-17 3:12 ` Jeff Law
0 siblings, 2 replies; 30+ messages in thread
From: Alexander Monakov @ 2023-08-16 19:10 UTC (permalink / raw)
To: Jeff Law; +Cc: Mariam Harutyunyan, gcc-patches
On Tue, 15 Aug 2023, Jeff Law wrote:
> Because if the compiler can optimize it automatically, then the projects have
> to do literally nothing to take advantage of it. They just compile normally
> and their bitwise CRC gets optimized down to either a table lookup or a clmul
> variant. That's the real goal here.
The only high-profile FOSS project that carries a bitwise CRC implementation
I'm aware of is the 'xz' compression library. There bitwise CRC is used for
populating the lookup table under './configure --enable-small':
https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c
It's a well-reasoned choice and your compiler would be undoing it
(reintroducing the table when the bitwise CRC is employed specifically
to avoid carrying the table).
> One final note. Elsewhere in this thread you described performance concerns.
> Right now clmuls can be implemented in 4c, fully piped.
Pipelining doesn't matter in the implementation being proposed here, because
the builtin is expanded to
li a4,quotient
li a5,polynomial
xor a0,a1,a0
clmul a0,a0,a4
srli a0,a0,crc_size
clmul a0,a0,a5
slli a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
srli a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
making CLMULs data-dependent, so the second can only be started one cycle
after the first finishes, and consecutive invocations of __builtin_crc
are likewise data-dependent (with three cycles between CLMUL). So even
when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles
per input block, while state of the art is one widening CLMUL per input block
(one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not latency.
> I fully expect that latency to drop within the next 12-18 months. In that
> world, there's not going to be much benefit to using hand-coded libraries vs
> just letting the compiler do it.
...
Alexander
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-03 19:37 Mariam Harutyunyan
2023-08-03 19:56 ` Andrew Pinski
2023-08-03 20:54 ` Jeff Law
@ 2023-08-16 4:59 ` Jeff Law
2023-08-21 13:39 ` Mariam Harutyunyan
2 siblings, 1 reply; 30+ messages in thread
From: Jeff Law @ 2023-08-16 4:59 UTC (permalink / raw)
To: Mariam Harutyunyan, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 561 bytes --]
On 8/3/23 13:37, Mariam Harutyunyan via Gcc-patches wrote:
> This patch adds CRC support for the RISC-V architecture. It adds internal
> functions and built-ins specifically designed to handle CRC computations
> efficiently.
>
> If the target is ZBC, the clmul instruction is used for the CRC code
> generation; otherwise, table-based CRC is generated. A table with 256
> elements is used to store precomputed CRCs.
>
> These CRC calculation algorithms have higher performance than the naive CRC
> calculation algorithm.
[ ... ]
Various comments attached.
[-- Attachment #2: comments-on-CRC-backend-support --]
[-- Type: text/x-patch, Size: 11804 bytes --]
From 9d2e9023c222501a1d9519bea3d5cdbd32b5a91e Mon Sep 17 00:00:00 2001
From: Mariam Arutunian <mariamarutunian@gmail.com>
Date: Thu, 3 Aug 2023 15:59:57 +0400
Subject: [PATCH] RISC-V: Added support for CRC.
If the target is ZBC, then the clmul instruction is used for the CRC code generation;
otherwise, table-based CRC is generated. A table with 256 elements is used to store precomputed CRCs.
gcc/ChangeLog:
*builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): Define.
(BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise.
(BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
(BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise.
(BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise.
(BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise.
(BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise.
(BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise.
(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
* builtins.cc (associated_internal_fn): Handle BUILT_IN_CRC8_DATA8,
BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16, BUILT_IN_CRC32_DATA32,
BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16, BUILT_IN_CRC64_DATA32,
BUILT_IN_CRC64_DATA64.
* builtins.def (BUILT_IN_CRC8_DATA8): New builtin.
(BUILT_IN_CRC16_DATA8): Likewise.
(BUILT_IN_CRC16_DATA16): Likewise.
(BUILT_IN_CRC32_DATA8): Likewise.
(BUILT_IN_CRC32_DATA16): Likewise.
(BUILT_IN_CRC32_DATA32): Likewise.
(BUILT_IN_CRC64_DATA8): Likewise.
(BUILT_IN_CRC64_DATA16): Likewise.
(BUILT_IN_CRC64_DATA32): Likewise.
(BUILT_IN_CRC64_DATA64): Likewise.
* config/riscv/bitmanip.md (crc<ANYI2:mode><ANYI:mode>4): New expander.
* config/riscv/riscv-protos.h (expand_crc_table_based): Declare.
(expand_crc_using_clmul): Likewise.
* config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New function.
(generate_crc): Likewise.
(generate_crc_table): Likewise.
(expand_crc_table_based): Likewise.
(expand_crc_using_clmul): Likewise.
* config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC.
* internal-fn.cc (crc_direct): Define.
(expand_crc_optab_fn): New function.
(direct_crc_optab_supported_p): Define.
* internal-fn.def (CRC): New internal optab function.
* optabs.def (crc_optab): New optab.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/crc-builtin-table-target32.c: New test.
* gcc.target/riscv/crc-builtin-table-target64.c: New test.
* gcc.target/riscv/crc-builtin-zbc32.c: New test.
* gcc.target/riscv/crc-builtin-zbc64.c: New test.
---
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 2eab466a9f8..748d8be384b 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
In general, the ChangeLog entry is just sent like you've done above and
not included as a diff.
diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def
index 43381bc8949..e33837c27d0 100644
--- a/gcc/builtin-types.def
+++ b/gcc/builtin-types.def
@@ -829,6 +829,26 @@ DEF_FUNCTION_TYPE_3 (BT_FN_PTR_SIZE_SIZE_PTRMODE,
BT_PTR, BT_SIZE, BT_SIZE, BT_PTRMODE)
DEF_FUNCTION_TYPE_3 (BT_FN_VOID_PTR_UINT8_PTRMODE, BT_VOID, BT_PTR, BT_UINT8,
BT_PTRMODE)
+DEF_FUNCTION_TYPE_3 (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE, BT_UINT8, BT_UINT8,
+ BT_UINT8, BT_CONST_SIZE)
[ ... ]
Presumably the reason we need to many variants is due to the desire to
support various types for the inputs and outputs?
diff --git a/gcc/config/riscv/bitmanip.md b/gcc/config/riscv/bitmanip.md
index c42e7b890db..4c896303242 100644
--- a/gcc/config/riscv/bitmanip.md
+++ b/gcc/config/riscv/bitmanip.md
@@ -856,3 +856,38 @@
"TARGET_ZBC"
"clmulr\t%0,%1,%2"
[(set_attr "type" "clmul")])
+
+;; Iterator for hardware-supported integer modes, same as ANYI
+(define_mode_iterator ANYI2 [QI HI SI (DI "TARGET_64BIT")])
+
+;; CRC 8, 16, 32, (64 for TARGET_64)
+(define_expand "crc<ANYI2:mode><ANYI:mode>4"
+ ;; return value (calculated CRC)
+ [(set (match_operand:ANYI 0 "register_operand" "=r")
+ ;; initial CRC
+ (unspec:ANYI [(match_operand:ANYI 1 "register_operand" "r")
+ ;; data
+ (match_operand:ANYI2 2 "register_operand" "r")
+ ;; polynomial
+ (match_operand:ANYI 3)]
+ UNSPEC_CRC))]
I'm not real comfortable with directly supporting sub-word sizes for the
output operand. The vast majority of insns on the risc-v port have
outputs that use either the X iterator (which maps to either SI or DI
mode for rv32 and rv64 respectively) or they use the GPR iterator. I
don't think many us ANYI.
Which ultimately makes sense since operations actually write X mode
bits.
Presumably the goal here is to capture cases where the resultant CRC
value is a sub-word size.
+""
+{
+ /* TODO: We only support cases where the data size is not greater
+ than the CRC size. */
+ if (GET_MODE (operands[0]) >= GET_MODE (operands[2]))
+ {
+ /* If we have the ZBC extension (ie, clmul) and
+ it is possible to store the quotient within a single variable
+ (E.g. CRC64's quotient may need 65 bits,
+ we can't keep it in 64 bit variable.)
+ then use clmul instruction to implement the CRC,
+ otherwise generate table-based CRC. */
+ if (TARGET_ZBC && ((TARGET_64BIT && GET_MODE (operands[0]) != DImode)
+ || (!TARGET_64BIT && GET_MODE (operands[0]) < SImode)))
Formatting NIT. We'd tend to prefer to write this as
if (TARGET_ZBC
&& ((TARGET_64BIT && GET_MODE (operands[0]) != DImode)
|| (!TARGET_64BIT && GET_MODE (operands[0]) < SImode)))
But I think the condition you're really looking for is
if (TARGET_ZBC && GET_MODE (operands[0]) == word_mode)
Meaning the output is 32bits for rv32 or 64bits for rv64. If we fix the
mode in the expander pattern to use X, then this condition would
simplify to just (TARGET_ZBC) since the output would be known to be word
sized already.
+ expand_crc_using_clmul (operands);
+ else
+ expand_crc_table_based (operands);
+ }
+ DONE;
+})
\ No newline at end of file
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index c9520f689e2..35bf19806df 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -131,6 +131,8 @@ extern bool riscv_shamt_matches_mask_p (int, HOST_WIDE_INT);
extern void riscv_subword_address (rtx, rtx *, rtx *, rtx *, rtx *);
extern void riscv_lshift_subword (machine_mode, rtx, rtx, rtx *);
extern enum memmodel riscv_union_memmodels (enum memmodel, enum memmodel);
+extern void expand_crc_table_based (rtx *);
+extern void expand_crc_using_clmul (rtx *);
I wonder if all the table based generation support code should move into a
generic file so that it could be used by other targets. I don't offhand
see anything in there that is RISC-V specific.
/* Routines implemented in riscv-c.cc. */
void riscv_cpu_cpp_builtins (cpp_reader *);
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 332fa720f01..e15850910a2 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
+/* Generates CRC lookup table by calculating CRC for all possible
+ 8-bit data values. The table is stored with a specific name in the read-only
+ data section. */
+
+rtx
+generate_crc_table (unsigned HOST_WIDE_INT polynom, unsigned crc_bits)
+{
+ gcc_assert (crc_bits <= 64);
+ FILE *out = asm_out_file;
+
+ /* Buf size - 33 letters
+ + 18 for numbers (2 for crc bit size + 2 for 0x + 16 for 64 bit polynomial)
+ + 1 for \0. */
+ char buf[33 + 20 + 1];
+ sprintf (buf, "crc_table_for_%u_bit_crc_%llx_polynomial", crc_bits, polynom);
+ tree id = maybe_get_identifier (buf);
+ if (id)
+ return gen_rtx_SYMBOL_REF (Pmode, IDENTIFIER_POINTER (id));
+ id = get_identifier (buf);
So I think we need to be more careful here with namespace pollution.
Ideally I think we'd want this symbol to be static and const.
+ rtx tab = gen_rtx_SYMBOL_REF (Pmode, IDENTIFIER_POINTER (id));
+
+ /* Create a table with 256 elements. */
+ unsigned table_elements_n = 0x100;
+ char val_align_op[7];
+ if (crc_bits <= 8)
+ sprintf (val_align_op, ".byte");
+ else if (crc_bits <= 16)
+ sprintf (val_align_op, ".half");
+ else if (crc_bits <= 32)
+ sprintf (val_align_op, ".word");
+ else
+ sprintf (val_align_op, ".dword");
There are standard target hooks for emitting data and a routine that
given the size the size of an object will return the right assembler
directive. Look for integer_asm_op.
More generally, look in varasm.cc for various routines dealing with
generating assembler output.
+
+ /* Write in read only data section. */
There are also standard ways to switch sections and standard section
names based on the target. Again, varasm.cc will be the right place to
wander.
Note that if you were to create a DECL node for the table and attach a
suitable DECL_INITIAL to it with its initialization data I think most,
if not all of the assembly output complexity would be handled for you
automatically.
If that is unwieldly to implement, then you'll definitely want to be
using various helpers from varasm.cc.
+ for (int i = 0; i < GET_MODE_SIZE (data_mode).to_constant (); i++)
+ {
+ /* crc >> (bit_size - 8). */
+ rtx op1 = gen_rtx_ASHIFTRT (crc_mode, crc,
+ GEN_INT (mode_bit_size - 8));
In general, I'd suggest gen_int_mode rather than GEN_INT.
+ /* crc_mode is the return value's mode,
+ depends on CRC function's CRC size. */
+ if (crc_mode != word_mode)
+ ix = gen_rtx_ZERO_EXTEND (word_mode, ix);
+ ix = gen_rtx_ASHIFT (word_mode, ix, GEN_INT (exact_log2 (
+ GET_MODE_SIZE (crc_mode).to_constant ())));
+ ix = force_reg (word_mode, ix);
Hmm, I'm not sure we can depend on the target actually supporting
ZERO_EXTEND. Some targets use AND with suitable immediates. Some might
use shifts, etc.
+
+ /* crc_table[(crc >> (bit_size - 8)) ^ data_8bit] */
+ rtx tab_el = gen_rtx_MEM (crc_mode, gen_rtx_PLUS (word_mode, ix, tab));
Similarly, don't we have to worry about whether or not the resulting
address is actually valid? They probably always are on RISC-V as I
don't think TAB will require more than 12 bits of offsets to address.
But some targets can only do simple register indirect addressing.
+
+ /* (crc << 8) */
+ rtx high = gen_rtx_ASHIFT (crc_mode, crc, GEN_INT (8));
And here I don't think we can necessarily depend on the target
supporting shifts by a constant amount. There's certainly targets out
there that can only shift one or two positions at a time. Those targets
typically lie and claim they have more general shifting capabilities,
but the point is we need to be much more careful about the RTL we
generate.
+ if (crc_mode != word_mode)
+ high = force_reg (crc_mode,
+ gen_rtx_AND (crc_mode, high,
+ GEN_INT (GET_MODE_MASK (crc_mode))));
Similarly here, some targets may not support AND with immediate or may
not have enough bits to handle all the immediates that can arise here.
+ /* crc = (crc << 8) ^ crc_table[(crc >> (bit_size - 8)) ^ data_8bit]; */
+ crc = force_reg (crc_mode, gen_rtx_XOR (crc_mode, tab_el, high));
+ }
+ riscv_emit_move (operands[0], crc);
CRC is a REG. Isn't operands[0] also a REG? Can we just use
emit_move_insn here rather than riscv_emit_move? Conceptually this code
should e target independent, so embedding risc-v isms in here will make
it harder to use on other targets.
@@ -3996,6 +4028,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
#define direct_cond_len_unary_optab_supported_p direct_optab_supported_p
#define direct_cond_len_binary_optab_supported_p direct_optab_supported_p
#define direct_cond_len_ternary_optab_supported_p direct_optab_supported_p
+#define direct_crc_optab_supported_p convert_optab_supported_p
So did we ever figure out why we're treating the CRC optab as a
conversion optab?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-09 13:02 ` Paul Koning
@ 2023-08-16 4:15 ` Jeff Law
0 siblings, 0 replies; 30+ messages in thread
From: Jeff Law @ 2023-08-16 4:15 UTC (permalink / raw)
To: Paul Koning, Alexander Monakov; +Cc: Mariam Harutyunyan, gcc-patches
On 8/9/23 07:02, Paul Koning wrote:
>
>
>> On Aug 9, 2023, at 2:32 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
>>
>>
>> On Tue, 8 Aug 2023, Jeff Law wrote:
>>
>>> If the compiler can identify a CRC and collapse it down to a table or clmul,
>>> that's a major win and such code does exist in the real world. That was the
>>> whole point behind the Fedora experiment -- to determine if these things are
>>> showing up in the real world or if this is just a benchmarking exercise.
>>
>> Can you share the results of the experiment and give your estimate of what
>> sort of real-world improvement is expected? I already listed the popular
>> FOSS projects where CRC performance is important: the Linux kernel and
>> a few compression libraries. Those projects do not use a bitwise CRC loop,
>> except sometimes for table generation on startup (which needs less time
>> than a page fault that may be necessary to bring in a hardcoded table).
>>
>> For those projects that need a better CRC, why is the chosen solution is
>> to optimize it in the compiler instead of offering them a library they
>> could use with any compiler?
>>
>> Was there any thought given to embedded projects that use bitwise CRC
>> exactly because they little space for a hardcoded table to spare?
>
> Or those that use smaller tables -- for example, the classic VAX microcode approach with a 16-entry table, doing CRC 4 bits at a time.
Yup. I think we settled on 8 bits as a time for the table variant. It
seemed like a good tradeoff between size of the tables and speed.
>
> I agree that this seems an odd thing to optimize. CRC is a well known CPU hog with well established efficient solutions, and it's hard to see why anyone who needs good performance would fail to understand and apply that knowledge.
As I've said, what started us down this path was Coremark. But what
convinced me that this was useful beyond juicing benchmark data was
finding the various implementations in the wild.
Jeff
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-09 6:32 ` Alexander Monakov
2023-08-09 13:02 ` Paul Koning
@ 2023-08-16 4:12 ` Jeff Law
2023-08-16 19:10 ` Alexander Monakov
1 sibling, 1 reply; 30+ messages in thread
From: Jeff Law @ 2023-08-16 4:12 UTC (permalink / raw)
To: Alexander Monakov; +Cc: Mariam Harutyunyan, gcc-patches
On 8/9/23 00:32, Alexander Monakov wrote:
>
> On Tue, 8 Aug 2023, Jeff Law wrote:
>
>> If the compiler can identify a CRC and collapse it down to a table or clmul,
>> that's a major win and such code does exist in the real world. That was the
>> whole point behind the Fedora experiment -- to determine if these things are
>> showing up in the real world or if this is just a benchmarking exercise.
>
> Can you share the results of the experiment and give your estimate of what
> sort of real-world improvement is expected? I already listed the popular
> FOSS projects where CRC performance is important: the Linux kernel and
> a few compression libraries. Those projects do not use a bitwise CRC loop,
> except sometimes for table generation on startup (which needs less time
> than a page fault that may be necessary to bring in a hardcoded table).
That experiment was ~7 months ago. I don't think any of the data is
still around except for some extracted testcases.
>
> For those projects that need a better CRC, why is the chosen solution is
> to optimize it in the compiler instead of offering them a library they
> could use with any compiler?
Because if the compiler can optimize it automatically, then the projects
have to do literally nothing to take advantage of it. They just compile
normally and their bitwise CRC gets optimized down to either a table
lookup or a clmul variant. That's the real goal here.
If a step where we provide the backend bits hooked up to a builtin isn't
useful, then we won't pursue it. The thinking was it would provide
value for those willing to make a slight change to their sources and at
the same time we get real world exposure for the backend work of the CRC
optimization effort while we polish the gimple detection bits.
>
> Was there any thought given to embedded projects that use bitwise CRC
> exactly because they little space for a hardcoded table to spare?
It wasn't an explicit goal, but the ability to select between a table
implementation and a clmul implementation in the backend seemed useful,
so we wired up both.
>
> No, not if the compiler is not GCC, or its version is less than 14. And
> those projects are not going to sacrifice their portability just for
> __builtin_crc.
You may be right. I don't think it's so clear cut. though.
>
>>> I think offering a conventional library for CRC has substantial advantages.
>> That's not what I asked. If you think there's room for improvement to a
>> builtin API, I'd love to hear it.
>>
>> But it seems you don't think this is worth the effort at all. That's
>> unfortunate, but if that's the consensus, then so be it.
>
> I think it's a strange application of development effort. You'd get more
> done coding a library.
Not if the end goal is to detect the CRC and optimize it into a table or
clmul without the user having to do anything special.
Again, what we've proposed in this patch is a piece of that larger body
of work, specifically the backend bits that we thought would have value
independently. If the community doesn't see that carved out chunk as
helpful we'll table it until the whole end-to-end path is ready for
submission.
>
>> I'll note LLVM is likely going forward with CRC detection and optimization at
>> some point in the next ~6 months (effectively moving the implementation from
>> the hexagon port into the generic parts of their loop optimizer).
>
> I don't see CRC detection in the Hexagon port. There is a recognizer for
> polynomial multiplication (CRC is division, not multiplication).
Yes, you need to the recognizer so that you can detect a CRC loop, then
with a bit of math you turn that into a carryless multiply sequence. I
find the math here mindbending, but the Hexagon bits are precisely to
optimize CRC loops. Sadly the Hexagon bits are fairly specific to the
CRC implementation inside coremark. The GCC bits we've been working on
are much more general.
One final note. Elsewhere in this thread you described performance
concerns. Right now clmuls can be implemented in 4c, fully piped. I
fully expect that latency to drop within the next 12-18 months. In that
world, there's not going to be much benefit to using hand-coded
libraries vs just letting the compiler do it.
Jeff
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-09 6:32 ` Alexander Monakov
@ 2023-08-09 13:02 ` Paul Koning
2023-08-16 4:15 ` Jeff Law
2023-08-16 4:12 ` Jeff Law
1 sibling, 1 reply; 30+ messages in thread
From: Paul Koning @ 2023-08-09 13:02 UTC (permalink / raw)
To: Alexander Monakov; +Cc: Jeff Law, Mariam Harutyunyan, gcc-patches
> On Aug 9, 2023, at 2:32 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
>
>
> On Tue, 8 Aug 2023, Jeff Law wrote:
>
>> If the compiler can identify a CRC and collapse it down to a table or clmul,
>> that's a major win and such code does exist in the real world. That was the
>> whole point behind the Fedora experiment -- to determine if these things are
>> showing up in the real world or if this is just a benchmarking exercise.
>
> Can you share the results of the experiment and give your estimate of what
> sort of real-world improvement is expected? I already listed the popular
> FOSS projects where CRC performance is important: the Linux kernel and
> a few compression libraries. Those projects do not use a bitwise CRC loop,
> except sometimes for table generation on startup (which needs less time
> than a page fault that may be necessary to bring in a hardcoded table).
>
> For those projects that need a better CRC, why is the chosen solution is
> to optimize it in the compiler instead of offering them a library they
> could use with any compiler?
>
> Was there any thought given to embedded projects that use bitwise CRC
> exactly because they little space for a hardcoded table to spare?
Or those that use smaller tables -- for example, the classic VAX microcode approach with a 16-entry table, doing CRC 4 bits at a time.
I agree that this seems an odd thing to optimize. CRC is a well known CPU hog with well established efficient solutions, and it's hard to see why anyone who needs good performance would fail to understand and apply that knowledge.
paul
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-08 23:17 ` Jeff Law
2023-08-08 23:31 ` Andrew Pinski
@ 2023-08-09 6:32 ` Alexander Monakov
2023-08-09 13:02 ` Paul Koning
2023-08-16 4:12 ` Jeff Law
1 sibling, 2 replies; 30+ messages in thread
From: Alexander Monakov @ 2023-08-09 6:32 UTC (permalink / raw)
To: Jeff Law; +Cc: Mariam Harutyunyan, gcc-patches
On Tue, 8 Aug 2023, Jeff Law wrote:
> If the compiler can identify a CRC and collapse it down to a table or clmul,
> that's a major win and such code does exist in the real world. That was the
> whole point behind the Fedora experiment -- to determine if these things are
> showing up in the real world or if this is just a benchmarking exercise.
Can you share the results of the experiment and give your estimate of what
sort of real-world improvement is expected? I already listed the popular
FOSS projects where CRC performance is important: the Linux kernel and
a few compression libraries. Those projects do not use a bitwise CRC loop,
except sometimes for table generation on startup (which needs less time
than a page fault that may be necessary to bring in a hardcoded table).
For those projects that need a better CRC, why is the chosen solution is
to optimize it in the compiler instead of offering them a library they
could use with any compiler?
Was there any thought given to embedded projects that use bitwise CRC
exactly because they little space for a hardcoded table to spare?
> > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> > These consumers need high-performance blockwise CRC, offering them
> > a latency-bound elementwise CRC primitive is a disservice. And what
> > should they use as a fallback when __builtin_crc is unavailable?
> THe point is builtin_crc would always be available. If there is no clmul,
> then the RTL backend can expand to a table lookup version.
No, not if the compiler is not GCC, or its version is less than 14. And
those projects are not going to sacrifice their portability just for
__builtin_crc.
> > I think offering a conventional library for CRC has substantial advantages.
> That's not what I asked. If you think there's room for improvement to a
> builtin API, I'd love to hear it.
>
> But it seems you don't think this is worth the effort at all. That's
> unfortunate, but if that's the consensus, then so be it.
I think it's a strange application of development effort. You'd get more
done coding a library.
> I'll note LLVM is likely going forward with CRC detection and optimization at
> some point in the next ~6 months (effectively moving the implementation from
> the hexagon port into the generic parts of their loop optimizer).
I don't see CRC detection in the Hexagon port. There is a recognizer for
polynomial multiplication (CRC is division, not multiplication).
Alexander
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-08 23:31 ` Andrew Pinski
@ 2023-08-09 0:01 ` Jeff Law
0 siblings, 0 replies; 30+ messages in thread
From: Jeff Law @ 2023-08-09 0:01 UTC (permalink / raw)
To: Andrew Pinski; +Cc: Alexander Monakov, Mariam Harutyunyan, gcc-patches
On 8/8/23 17:31, Andrew Pinski wrote:
> On Tue, Aug 8, 2023 at 4:17 PM Jeff Law via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>>
>>
>> On 8/8/23 10:38, Alexander Monakov wrote:
>>>
>>> On Tue, 8 Aug 2023, Jeff Law wrote:
>>>
>>>> That was my thinking at one time. Then we started looking at the distros and
>>>> found enough crc implementations in there to change my mind about the overall
>>>> utility.
>>>
>>> The ones I'm familiar with are all table-based and look impossible to
>>> pattern-match (and hence already fairly efficient comparable to bitwise
>>> loop in Coremark).
>> We found dozens that were the usual looking loops and, IIRC ~200 table
>> lookups after analyzing about half of the packages in Fedora.
>
> I will make a note we do handle table lookups to detect count leading
> zeros, see check_ctz_array in tree-ssa-forwprop.cc for that detection.
> (that was done to improve a SPEC benchmark even).
> So if the tables are statically defined at compile time, there is
> already an example of how it can be detected too.
Detecting CRC is a bit more complex than ctz/popcount and friends
because of the field polynomial's impact on the table.
But there is some value in detecting a table version, then converting it
to clmul. I don't remember the data offhand, but clmul was noticeably
better than a table.
Jeff
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-08 23:17 ` Jeff Law
@ 2023-08-08 23:31 ` Andrew Pinski
2023-08-09 0:01 ` Jeff Law
2023-08-09 6:32 ` Alexander Monakov
1 sibling, 1 reply; 30+ messages in thread
From: Andrew Pinski @ 2023-08-08 23:31 UTC (permalink / raw)
To: Jeff Law; +Cc: Alexander Monakov, Mariam Harutyunyan, gcc-patches
On Tue, Aug 8, 2023 at 4:17 PM Jeff Law via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
>
> On 8/8/23 10:38, Alexander Monakov wrote:
> >
> > On Tue, 8 Aug 2023, Jeff Law wrote:
> >
> >> That was my thinking at one time. Then we started looking at the distros and
> >> found enough crc implementations in there to change my mind about the overall
> >> utility.
> >
> > The ones I'm familiar with are all table-based and look impossible to
> > pattern-match (and hence already fairly efficient comparable to bitwise
> > loop in Coremark).
> We found dozens that were the usual looking loops and, IIRC ~200 table
> lookups after analyzing about half of the packages in Fedora.
I will make a note we do handle table lookups to detect count leading
zeros, see check_ctz_array in tree-ssa-forwprop.cc for that detection.
(that was done to improve a SPEC benchmark even).
So if the tables are statically defined at compile time, there is
already an example of how it can be detected too.
Thanks,
Andrew Pinski
>
>
> >
> > So... just provide a library? A library code is easier to develop and audit,
> > it can be released independently, people can use it with their compiler of
> > choice. Not everything needs to be in libgcc.
> If the compiler can identify a CRC and collapse it down to a table or
> clmul, that's a major win and such code does exist in the real world.
> That was the whole point behind the Fedora experiment -- to determine if
> these things are showing up in the real world or if this is just a
> benchmarking exercise.
>
> And just to be clear, we're not proposing anything for libgcc.
>
> >
> > I'm talking about factoring a long chain into multiple independent chains
> > for latency hiding.
> And that could potentially be an extension. But even without this a
> standard looking CRC loop will be much faster using table lookups or
> simple generation with clmul.
>
> Also note that latency of clmuls is improving on modern hardware. 4c
> isn't hard to achieve and I wouldn't be surprised to see 2c clmuls in
> the near future.
>
>
> >
> > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> > These consumers need high-performance blockwise CRC, offering them
> > a latency-bound elementwise CRC primitive is a disservice. And what
> > should they use as a fallback when __builtin_crc is unavailable?
> THe point is builtin_crc would always be available. If there is no
> clmul, then the RTL backend can expand to a table lookup version.
>
> >
> >> while at the same time putting one side of the infrastructure we need for
> >> automatic detection of CRC loops and turning them into table lookups or
> >> CLMULs.
> >>
> >> With that in mind I'm certain Mariam & I would love feedback on a builtin API
> >> that would be more useful.
> >
> > I think offering a conventional library for CRC has substantial advantages.
> That's not what I asked. If you think there's room for improvement to a
> builtin API, I'd love to hear it.
>
> But it seems you don't think this is worth the effort at all. That's
> unfortunate, but if that's the consensus, then so be it.
>
> I'll note LLVM is likely going forward with CRC detection and
> optimization at some point in the next ~6 months (effectively moving the
> implementation from the hexagon port into the generic parts of their
> loop optimizer).
>
>
>
> Jeff
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-08 16:38 ` Alexander Monakov
@ 2023-08-08 23:17 ` Jeff Law
2023-08-08 23:31 ` Andrew Pinski
2023-08-09 6:32 ` Alexander Monakov
0 siblings, 2 replies; 30+ messages in thread
From: Jeff Law @ 2023-08-08 23:17 UTC (permalink / raw)
To: Alexander Monakov; +Cc: Mariam Harutyunyan, gcc-patches
On 8/8/23 10:38, Alexander Monakov wrote:
>
> On Tue, 8 Aug 2023, Jeff Law wrote:
>
>> That was my thinking at one time. Then we started looking at the distros and
>> found enough crc implementations in there to change my mind about the overall
>> utility.
>
> The ones I'm familiar with are all table-based and look impossible to
> pattern-match (and hence already fairly efficient comparable to bitwise
> loop in Coremark).
We found dozens that were the usual looking loops and, IIRC ~200 table
lookups after analyzing about half of the packages in Fedora.
>
> So... just provide a library? A library code is easier to develop and audit,
> it can be released independently, people can use it with their compiler of
> choice. Not everything needs to be in libgcc.
If the compiler can identify a CRC and collapse it down to a table or
clmul, that's a major win and such code does exist in the real world.
That was the whole point behind the Fedora experiment -- to determine if
these things are showing up in the real world or if this is just a
benchmarking exercise.
And just to be clear, we're not proposing anything for libgcc.
>
> I'm talking about factoring a long chain into multiple independent chains
> for latency hiding.
And that could potentially be an extension. But even without this a
standard looking CRC loop will be much faster using table lookups or
simple generation with clmul.
Also note that latency of clmuls is improving on modern hardware. 4c
isn't hard to achieve and I wouldn't be surprised to see 2c clmuls in
the near future.
>
> Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> These consumers need high-performance blockwise CRC, offering them
> a latency-bound elementwise CRC primitive is a disservice. And what
> should they use as a fallback when __builtin_crc is unavailable?
THe point is builtin_crc would always be available. If there is no
clmul, then the RTL backend can expand to a table lookup version.
>
>> while at the same time putting one side of the infrastructure we need for
>> automatic detection of CRC loops and turning them into table lookups or
>> CLMULs.
>>
>> With that in mind I'm certain Mariam & I would love feedback on a builtin API
>> that would be more useful.
>
> I think offering a conventional library for CRC has substantial advantages.
That's not what I asked. If you think there's room for improvement to a
builtin API, I'd love to hear it.
But it seems you don't think this is worth the effort at all. That's
unfortunate, but if that's the consensus, then so be it.
I'll note LLVM is likely going forward with CRC detection and
optimization at some point in the next ~6 months (effectively moving the
implementation from the hexagon port into the generic parts of their
loop optimizer).
Jeff
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-08 15:31 ` Jeff Law
@ 2023-08-08 16:38 ` Alexander Monakov
2023-08-08 23:17 ` Jeff Law
0 siblings, 1 reply; 30+ messages in thread
From: Alexander Monakov @ 2023-08-08 16:38 UTC (permalink / raw)
To: Jeff Law; +Cc: Mariam Harutyunyan, gcc-patches
On Tue, 8 Aug 2023, Jeff Law wrote:
> That was my thinking at one time. Then we started looking at the distros and
> found enough crc implementations in there to change my mind about the overall
> utility.
The ones I'm familiar with are all table-based and look impossible to
pattern-match (and hence already fairly efficient comparable to bitwise
loop in Coremark).
> If we need to do something to make it more useful, we're certainly open to
> that.
So... just provide a library? A library code is easier to develop and audit,
it can be released independently, people can use it with their compiler of
choice. Not everything needs to be in libgcc.
> > - they overlap multiple CLMUL chains to make the loop throughput-bound
> > rather than latency-bound. The typical unroll factor is about 4x-8x.
> We do have the ability to build longer chains. We actually use that in the
> coremark benchmark where the underlying primitives are 8-bit CRCs that are
> composed into 16/32 bit CRCs.
I'm talking about factoring a long chain into multiple independent chains
for latency hiding.
> > Hence, I am concerned that proposed __builtin_crc is not useful for FOSS
> > that actually needs high-performance CRC (the Linux kernel, compression
> > and image libraries).
> >
> > I think this crosses the line of "cheating in benchmarks" and not something
> > we should do in GCC.
> Certianly not the intention. The intention is to provide a useful builtin_crc
Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
These consumers need high-performance blockwise CRC, offering them
a latency-bound elementwise CRC primitive is a disservice. And what
should they use as a fallback when __builtin_crc is unavailable?
> while at the same time putting one side of the infrastructure we need for
> automatic detection of CRC loops and turning them into table lookups or
> CLMULs.
>
> With that in mind I'm certain Mariam & I would love feedback on a builtin API
> that would be more useful.
I think offering a conventional library for CRC has substantial advantages.
Alexander
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-08 15:22 ` Alexander Monakov
@ 2023-08-08 15:31 ` Jeff Law
2023-08-08 16:38 ` Alexander Monakov
0 siblings, 1 reply; 30+ messages in thread
From: Jeff Law @ 2023-08-08 15:31 UTC (permalink / raw)
To: Alexander Monakov; +Cc: Mariam Harutyunyan, gcc-patches
On 8/8/23 09:22, Alexander Monakov wrote:
> Jeff, as I understand this all is happening only because Coremark contains
> use of bitwise CRC that affects benchmark scores. In another universe where
>
> - Coremark was careful to checksum outputs outside of timed sections, or
> - implemented CRC in a manner that is not transparent to the compiler, or
> - did not use CRC at all
>
> we would not be spending effort on this, correct? At best we might have
> a discussion on providing a __builtin_clmul for carry-less multiplication
> (which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
That was my thinking at one time. Then we started looking at the
distros and found enough crc implementations in there to change my mind
about the overall utility.
>
> Note that proposed __builtin_crc does not match how a high-performance CRC
> over a variable-size array is implemented. You don't want to do two
> back-to-back CLMULs to compute a new CRC given an old CRC. That makes your
> loop latency-constrained to 2*L*N where L is latency of the CLMUL instruction
> and N is the number of loop iterations.
If we need to do something to make it more useful, we're certainly open
to that.
>
> Instead, efficient CRC loops have the following structure:
>
> - they carry an unreduced remainder in the loop, performing final reduction
> modulo polynomial only once after the loop — this halves the CLMUL count;
>
> - they overlap multiple CLMUL chains to make the loop throughput-bound
> rather than latency-bound. The typical unroll factor is about 4x-8x.
We do have the ability to build longer chains. We actually use that in
the coremark benchmark where the underlying primitives are 8-bit CRCs
that are composed into 16/32 bit CRCs.
>
> A relatively easy to follow explanation is provided by Pete Cawley at
> https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq
> (there are other sources for similar optimization of table-based CRC).
>
> Also note that in __builtin_crc care is needed regarding how the
> polynomial is specified (which term is dropped, and what bit order is used).
Absolutely.
>
> Hence, I am concerned that proposed __builtin_crc is not useful for FOSS
> that actually needs high-performance CRC (the Linux kernel, compression
> and image libraries).
>
> I think this crosses the line of "cheating in benchmarks" and not something
> we should do in GCC.
Certianly not the intention. The intention is to provide a useful
builtin_crc while at the same time putting one side of the
infrastructure we need for automatic detection of CRC loops and turning
them into table lookups or CLMULs.
With that in mind I'm certain Mariam & I would love feedback on a
builtin API that would be more useful.
jeff
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-03 20:54 ` Jeff Law
@ 2023-08-08 15:22 ` Alexander Monakov
2023-08-08 15:31 ` Jeff Law
0 siblings, 1 reply; 30+ messages in thread
From: Alexander Monakov @ 2023-08-08 15:22 UTC (permalink / raw)
To: Jeff Law; +Cc: Mariam Harutyunyan, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 2404 bytes --]
On Thu, 3 Aug 2023, Jeff Law wrote:
> The end goal here is to actually detect bitwise CRC implementations in the
> gimple optimizers and turn them into table lookups or carryless multiplies in
> RTL.
>
> Mariam has that working end-to-end and has proposed a talk for the Cauldron on
> the topic.
>
> The idea here is to carve out the RTL side which we think provides potential
> value to end users (the ability to use the builtin to get an performant CRC
> implementation) and to get community feedback on the implementation.
Jeff, as I understand this all is happening only because Coremark contains
use of bitwise CRC that affects benchmark scores. In another universe where
- Coremark was careful to checksum outputs outside of timed sections, or
- implemented CRC in a manner that is not transparent to the compiler, or
- did not use CRC at all
we would not be spending effort on this, correct? At best we might have
a discussion on providing a __builtin_clmul for carry-less multiplication
(which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
Note that proposed __builtin_crc does not match how a high-performance CRC
over a variable-size array is implemented. You don't want to do two
back-to-back CLMULs to compute a new CRC given an old CRC. That makes your
loop latency-constrained to 2*L*N where L is latency of the CLMUL instruction
and N is the number of loop iterations.
Instead, efficient CRC loops have the following structure:
- they carry an unreduced remainder in the loop, performing final reduction
modulo polynomial only once after the loop — this halves the CLMUL count;
- they overlap multiple CLMUL chains to make the loop throughput-bound
rather than latency-bound. The typical unroll factor is about 4x-8x.
A relatively easy to follow explanation is provided by Pete Cawley at
https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq
(there are other sources for similar optimization of table-based CRC).
Also note that in __builtin_crc care is needed regarding how the
polynomial is specified (which term is dropped, and what bit order is used).
Hence, I am concerned that proposed __builtin_crc is not useful for FOSS
that actually needs high-performance CRC (the Linux kernel, compression
and image libraries).
I think this crosses the line of "cheating in benchmarks" and not something
we should do in GCC.
Alexander
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-03 19:37 Mariam Harutyunyan
2023-08-03 19:56 ` Andrew Pinski
@ 2023-08-03 20:54 ` Jeff Law
2023-08-08 15:22 ` Alexander Monakov
2023-08-16 4:59 ` Jeff Law
2 siblings, 1 reply; 30+ messages in thread
From: Jeff Law @ 2023-08-03 20:54 UTC (permalink / raw)
To: Mariam Harutyunyan, gcc-patches
On 8/3/23 13:37, Mariam Harutyunyan via Gcc-patches wrote:
> This patch adds CRC support for the RISC-V architecture. It adds internal
> functions and built-ins specifically designed to handle CRC computations
> efficiently.
>
> If the target is ZBC, the clmul instruction is used for the CRC code
> generation; otherwise, table-based CRC is generated. A table with 256
> elements is used to store precomputed CRCs.
>
> These CRC calculation algorithms have higher performance than the naive CRC
> calculation algorithm.
So just a little bit of additional information for others that might be
looking at this thread.
The end goal here is to actually detect bitwise CRC implementations in
the gimple optimizers and turn them into table lookups or carryless
multiplies in RTL.
Mariam has that working end-to-end and has proposed a talk for the
Cauldron on the topic.
The idea here is to carve out the RTL side which we think provides
potential value to end users (the ability to use the builtin to get an
performant CRC implementation) and to get community feedback on the
implementation.
So this isn't the end of the line for this work, just the first carved
out chunk.
Jeff
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-03 19:56 ` Andrew Pinski
@ 2023-08-03 19:59 ` Mariam Harutyunyan
0 siblings, 0 replies; 30+ messages in thread
From: Mariam Harutyunyan @ 2023-08-03 19:59 UTC (permalink / raw)
To: Andrew Pinski; +Cc: gcc-patches
[-- Attachment #1: Type: text/plain, Size: 3894 bytes --]
Hi.
Thank you. I'll add.
Best regards,
Mariam
On Thu, Aug 3, 2023, 23:56 Andrew Pinski <pinskia@gmail.com> wrote:
> On Thu, Aug 3, 2023 at 12:38 PM Mariam Harutyunyan via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > This patch adds CRC support for the RISC-V architecture. It adds internal
> > functions and built-ins specifically designed to handle CRC computations
> > efficiently.
> >
> > If the target is ZBC, the clmul instruction is used for the CRC code
> > generation; otherwise, table-based CRC is generated. A table with 256
> > elements is used to store precomputed CRCs.
> >
> > These CRC calculation algorithms have higher performance than the naive
> CRC
> > calculation algorithm.
>
> A few things about this patch:
> You created a generic (non-target specific) builtin but didn't
> document it in doc/extend.texi .
> You created a generic builtin with no fallback in libgcc.
> You created a new named (RTL) pattern, crc, and didn't document it in
> the `Standard Names` section of doc/md.texi .
>
> Thanks,
> Andrew Pinski
>
> >
> > gcc/ChangeLog:
> > *builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE):
> Define.
> > (BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise.
> > (BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
> > (BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise.
> > (BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise.
> > (BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise.
> > (BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise.
> > (BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise.
> > (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
> > (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
> > * builtins.cc (associated_internal_fn): Handle
> > BUILT_IN_CRC8_DATA8,
> > BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
> > BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16,
> > BUILT_IN_CRC32_DATA32,
> > BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16,
> > BUILT_IN_CRC64_DATA32,
> > BUILT_IN_CRC64_DATA64.
> > * builtins.def (BUILT_IN_CRC8_DATA8): New builtin.
> > (BUILT_IN_CRC16_DATA8): Likewise.
> > (BUILT_IN_CRC16_DATA16): Likewise.
> > (BUILT_IN_CRC32_DATA8): Likewise.
> > (BUILT_IN_CRC32_DATA16): Likewise.
> > (BUILT_IN_CRC32_DATA32): Likewise.
> > (BUILT_IN_CRC64_DATA8): Likewise.
> > (BUILT_IN_CRC64_DATA16): Likewise.
> > (BUILT_IN_CRC64_DATA32): Likewise.
> > (BUILT_IN_CRC64_DATA64): Likewise.
> > * config/riscv/bitmanip.md (crc<ANYI2:mode><ANYI:mode>4): New
> > expander.
> > * config/riscv/riscv-protos.h (expand_crc_table_based):
> Declare.
> > (expand_crc_using_clmul): Likewise.
> > * config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New
> > function.
> > (generate_crc): Likewise.
> > (generate_crc_table): Likewise.
> > (expand_crc_table_based): Likewise.
> > (expand_crc_using_clmul): Likewise.
> > * config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC.
> > * internal-fn.cc (crc_direct): Define.
> > (expand_crc_optab_fn): New function.
> > (direct_crc_optab_supported_p): Define.
> > * internal-fn.def (CRC): New internal optab function.
> > * optabs.def (crc_optab): New optab.
> >
> > gcc/testsuite/ChangeLog:
> > * gcc.target/riscv/crc-builtin-table-target32.c: New test.
> > * gcc.target/riscv/crc-builtin-table-target64.c: New test.
> > * gcc.target/riscv/crc-builtin-zbc32.c: New test.
> > * gcc.target/riscv/crc-builtin-zbc64.c: New test.
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: RISC-V: Added support for CRC.
2023-08-03 19:37 Mariam Harutyunyan
@ 2023-08-03 19:56 ` Andrew Pinski
2023-08-03 19:59 ` Mariam Harutyunyan
2023-08-03 20:54 ` Jeff Law
2023-08-16 4:59 ` Jeff Law
2 siblings, 1 reply; 30+ messages in thread
From: Andrew Pinski @ 2023-08-03 19:56 UTC (permalink / raw)
To: Mariam Harutyunyan; +Cc: gcc-patches
On Thu, Aug 3, 2023 at 12:38 PM Mariam Harutyunyan via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> This patch adds CRC support for the RISC-V architecture. It adds internal
> functions and built-ins specifically designed to handle CRC computations
> efficiently.
>
> If the target is ZBC, the clmul instruction is used for the CRC code
> generation; otherwise, table-based CRC is generated. A table with 256
> elements is used to store precomputed CRCs.
>
> These CRC calculation algorithms have higher performance than the naive CRC
> calculation algorithm.
A few things about this patch:
You created a generic (non-target specific) builtin but didn't
document it in doc/extend.texi .
You created a generic builtin with no fallback in libgcc.
You created a new named (RTL) pattern, crc, and didn't document it in
the `Standard Names` section of doc/md.texi .
Thanks,
Andrew Pinski
>
> gcc/ChangeLog:
> *builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): Define.
> (BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise.
> (BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
> (BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise.
> (BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise.
> (BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise.
> (BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise.
> (BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise.
> (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
> (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
> * builtins.cc (associated_internal_fn): Handle
> BUILT_IN_CRC8_DATA8,
> BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
> BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16,
> BUILT_IN_CRC32_DATA32,
> BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16,
> BUILT_IN_CRC64_DATA32,
> BUILT_IN_CRC64_DATA64.
> * builtins.def (BUILT_IN_CRC8_DATA8): New builtin.
> (BUILT_IN_CRC16_DATA8): Likewise.
> (BUILT_IN_CRC16_DATA16): Likewise.
> (BUILT_IN_CRC32_DATA8): Likewise.
> (BUILT_IN_CRC32_DATA16): Likewise.
> (BUILT_IN_CRC32_DATA32): Likewise.
> (BUILT_IN_CRC64_DATA8): Likewise.
> (BUILT_IN_CRC64_DATA16): Likewise.
> (BUILT_IN_CRC64_DATA32): Likewise.
> (BUILT_IN_CRC64_DATA64): Likewise.
> * config/riscv/bitmanip.md (crc<ANYI2:mode><ANYI:mode>4): New
> expander.
> * config/riscv/riscv-protos.h (expand_crc_table_based): Declare.
> (expand_crc_using_clmul): Likewise.
> * config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New
> function.
> (generate_crc): Likewise.
> (generate_crc_table): Likewise.
> (expand_crc_table_based): Likewise.
> (expand_crc_using_clmul): Likewise.
> * config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC.
> * internal-fn.cc (crc_direct): Define.
> (expand_crc_optab_fn): New function.
> (direct_crc_optab_supported_p): Define.
> * internal-fn.def (CRC): New internal optab function.
> * optabs.def (crc_optab): New optab.
>
> gcc/testsuite/ChangeLog:
> * gcc.target/riscv/crc-builtin-table-target32.c: New test.
> * gcc.target/riscv/crc-builtin-table-target64.c: New test.
> * gcc.target/riscv/crc-builtin-zbc32.c: New test.
> * gcc.target/riscv/crc-builtin-zbc64.c: New test.
^ permalink raw reply [flat|nested] 30+ messages in thread
* RISC-V: Added support for CRC.
@ 2023-08-03 19:37 Mariam Harutyunyan
2023-08-03 19:56 ` Andrew Pinski
` (2 more replies)
0 siblings, 3 replies; 30+ messages in thread
From: Mariam Harutyunyan @ 2023-08-03 19:37 UTC (permalink / raw)
To: gcc-patches
[-- Attachment #1.1: Type: text/plain, Size: 2982 bytes --]
This patch adds CRC support for the RISC-V architecture. It adds internal
functions and built-ins specifically designed to handle CRC computations
efficiently.
If the target is ZBC, the clmul instruction is used for the CRC code
generation; otherwise, table-based CRC is generated. A table with 256
elements is used to store precomputed CRCs.
These CRC calculation algorithms have higher performance than the naive CRC
calculation algorithm.
gcc/ChangeLog:
*builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): Define.
(BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise.
(BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
(BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise.
(BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise.
(BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise.
(BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise.
(BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise.
(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
* builtins.cc (associated_internal_fn): Handle
BUILT_IN_CRC8_DATA8,
BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16,
BUILT_IN_CRC32_DATA32,
BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16,
BUILT_IN_CRC64_DATA32,
BUILT_IN_CRC64_DATA64.
* builtins.def (BUILT_IN_CRC8_DATA8): New builtin.
(BUILT_IN_CRC16_DATA8): Likewise.
(BUILT_IN_CRC16_DATA16): Likewise.
(BUILT_IN_CRC32_DATA8): Likewise.
(BUILT_IN_CRC32_DATA16): Likewise.
(BUILT_IN_CRC32_DATA32): Likewise.
(BUILT_IN_CRC64_DATA8): Likewise.
(BUILT_IN_CRC64_DATA16): Likewise.
(BUILT_IN_CRC64_DATA32): Likewise.
(BUILT_IN_CRC64_DATA64): Likewise.
* config/riscv/bitmanip.md (crc<ANYI2:mode><ANYI:mode>4): New
expander.
* config/riscv/riscv-protos.h (expand_crc_table_based): Declare.
(expand_crc_using_clmul): Likewise.
* config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New
function.
(generate_crc): Likewise.
(generate_crc_table): Likewise.
(expand_crc_table_based): Likewise.
(expand_crc_using_clmul): Likewise.
* config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC.
* internal-fn.cc (crc_direct): Define.
(expand_crc_optab_fn): New function.
(direct_crc_optab_supported_p): Define.
* internal-fn.def (CRC): New internal optab function.
* optabs.def (crc_optab): New optab.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/crc-builtin-table-target32.c: New test.
* gcc.target/riscv/crc-builtin-table-target64.c: New test.
* gcc.target/riscv/crc-builtin-zbc32.c: New test.
* gcc.target/riscv/crc-builtin-zbc64.c: New test.
[-- Attachment #2: 0001-RISC-V-Added-support-for-CRC.patch --]
[-- Type: application/x-patch, Size: 31807 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2023-09-27 21:44 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-23 23:05 RISC-V: Added support for CRC Joern Rennecke
2023-09-24 11:41 ` Alexander Monakov
2023-09-24 15:53 ` Joern Rennecke
2023-09-26 6:19 ` Mariam Harutyunyan
2023-09-26 13:05 ` Oleg Endo
2023-09-26 13:18 ` Jeff Law
2023-09-26 16:23 ` Alexander Monakov
2023-09-26 18:56 ` Joern Rennecke
2023-09-27 21:44 ` Jeff Law
-- strict thread matches above, loose matches on Subject: below --
2023-08-03 19:37 Mariam Harutyunyan
2023-08-03 19:56 ` Andrew Pinski
2023-08-03 19:59 ` Mariam Harutyunyan
2023-08-03 20:54 ` Jeff Law
2023-08-08 15:22 ` Alexander Monakov
2023-08-08 15:31 ` Jeff Law
2023-08-08 16:38 ` Alexander Monakov
2023-08-08 23:17 ` Jeff Law
2023-08-08 23:31 ` Andrew Pinski
2023-08-09 0:01 ` Jeff Law
2023-08-09 6:32 ` Alexander Monakov
2023-08-09 13:02 ` Paul Koning
2023-08-16 4:15 ` Jeff Law
2023-08-16 4:12 ` Jeff Law
2023-08-16 19:10 ` Alexander Monakov
2023-08-16 19:42 ` Philipp Tomsich
2023-08-16 20:23 ` Paul Koning
2023-08-17 13:36 ` Alexander Monakov
2023-08-17 3:12 ` Jeff Law
2023-08-16 4:59 ` Jeff Law
2023-08-21 13:39 ` Mariam Harutyunyan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).