public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Jeff Law <jeffreyalaw@gmail.com>
To: Joern Rennecke <joern.rennecke@embecosm.com>,
	GCC Patches <gcc-patches@gcc.gnu.org>
Cc: Mariam Harutyunyan <mariamarutunian@gmail.com>,
	Alexander Monakov <amonakov@ispras.ru>,
	Richard Biener <richard.guenther@gmail.com>,
	Oleg Endo <oleg.endo@t-online.de>
Subject: Re: RISC-V: Added support for CRC.
Date: Tue, 26 Sep 2023 07:18:28 -0600	[thread overview]
Message-ID: <a3a879a1-9f0d-4967-a976-4f00eca0292f@gmail.com> (raw)
In-Reply-To: <CAMqJFCpEoMtLiW3hqABa_-qm3yPaW49pkrve7U2tFLe7_Q=B7A@mail.gmail.com>



On 9/23/23 17:05, Joern Rennecke wrote:
> Mariam Harutyunyan:
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,45 @@
> +2023-08-03  Mariam Arutunian  <mariamarutunian@gmail.com>
> +
> 
> It is common courtesy to include all authors in the list of authors
> for the ChangeLog; also,
> this may help people in the future understand the history of the code better.
> While must of your patch is new, it still contains non-trivial parts of mine
> ( https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591744.html )
> .And stripping out the comment why, currently,  we can't use linkonce
> for crc tables on the the RISC-V target is
> not helpful to someone who wants to understand the code.
Thanks for pointing this out Joern.  Neither Mariam nor I were aware 
that some of your code was used to seed this work.  We are happy to 
include you as a co-author.


Mariam -- the way to do this is to add a "Co-Author:" line to your 
commit message.  If you look at upstream commit:

5ff4431675c0d0c800d4a983254e94a6b401c14d

Shows the right format.


> 
> See also the discussion to put this into loop distribution:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591821.html > https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591866.html
Yes, this was the gist of the discussion I had with Richi.  CRC 
optimization has a lot in common with distribution, reduction and final 
value replacement.

For early testing and development we focused on detecting a CRC function 
and replacing it at a function level.  THat allowed Mariam to avoid a 
class a problems for a while and focus on an end-to-end solution.  Once 
that was working reasonably well Mariam went back and enhanced the first 
pass filter and validation so that it would work if the CRC loop was 
embedded inside a larger function either at the source level or due to 
inlining.

The placement we chose was just after loop distribution.  I think 
there's some wiggle room in terms of pass placement.

The one thing I'm hesitant to do is make this part of an existing pass. 
I dont see many benefits of integrating inside loop dist and I very much 
like its clean separation.


> 
> Mariam Harutyunyan:
>> It adds internal
>> functions and built-ins specifically designed to handle CRC computations
>> efficiently.
> 
> This sounds like this is a finished work, although defining built-in
> functions to calculate the CRC of single data elements and recognizers
> for some C idioms that do these calculations,
> is just a starting point.
More precisely it was carved out of a larger piece of work in the hopes 
that it would be useful independently and allow upstreaming of the 
backend work.  Alexander was fairly dismissive of the utility of doing 
that.  There was some interest from Cauldron attendees to potentially 
reviving that idea.


> 
> Alexander Monakov :
> 
>> Jeff, as I understand this all is happening only because Coremark contains
>> use of bitwise CRC that affects benchmark scores. In another universe where
>> - Coremark was careful to checksum outputs outside of timed sections, or
>> - implemented CRC in a manner that is not transparent to the compiler, or
>> - did not use CRC at all
>> we would not be spending effort on this, correct?
> 
> It is a stated goal of coremark to test performance for CRC.  They do
> not use a library call
> to implement CRC, but a specific bit-banging algorithm they have
> chosen.  That algorithm is,
> for the vast majority of processors, not representative of the targets
> performance potential in calculating CRCs, thus if a compiler fails to
> translate this into the CRC implementation that
> would be used for performance code, the compiler frustrates this goal
> of coremark to give a measure of CRC calculation performance.All true.  But the Coremark code is what it is.  This isn't a whole lot 
different than the work in the 90s which rewrote loops and compromised 
some of the spec benchmarks, or the eqntott hack to simplify that one 
key loop in eqntott.

What ultimately pushed us to keep moving forward on this effort was 
discovering numerous CRC loop implementations out in the wild, including 
4 implementations (IIRC) in the kernel itself.


> 
>> At best we might have
>> a discussion on providing a __builtin_clmul for carry-less multiplication
>> (which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
> 
> Some processors have specialized instructions for CRC computations.
They do.  But I think the discussion about providing intrinsic access to 
a clmul instruction is orthogonal to the discussion about whether or not 
to provide a builtin for crc.

I can easily see creating a clmul RTL opcode for targets which support 
it and hoisting the clmul vs lookup table selection into generic code. 
I'm still pondering if we're likely to ever see cases where we want a 
vector clmul intrinsic or support in the autovectorizer for clmul. 
We've certainly got targets with vector clmul in the ISA, the question 
is using it.


> 
>> Instead, efficient CRC loops have the following structure:
>> - they carry an unreduced remainder in the loop, performing final reduction
>>   modulo polynomial only once after the loop — this halves the CLMUL count;
>> - they overlap multiple CLMUL chains to make the loop throughput-bound
>> rather than latency-bound. The typical unroll factor is about 4x-8x.
> 
> If you want to recognize a loop that does a CRC on a block, it makes
> sense to start with recognizing the CRC computation for single array
> elements first.  We have to learn to
> walk before we can run.
> 
> Nore that my initial patch already contained a match.pd stanza to
> recognize two chained single-element CRC calculations.
Mariam's code goes well behind simple match.pd recognition of a block. 
The hope is to do a few rounds of internal review/cleanup then post it 
for the world to see.

Probably the biggest task in that space right now is to see if we can 
avoid the symbolic execution engine by re-using ranger.

> 
> Jeff Law:
>> The intention is to provide a useful
>> builtin_crc while at the same time putting one side of the
>> infrastructure we need for automatic detection of CRC loops and turning
>> them into table lookups or CLMULs.
> 
> Note that when optimizing for size, for a target with tiny memory, or
> when using a non-constant (or constant but undiscoverable by the
> compiler) polynom, we can't use the table lookup.  But still, even if
> we don't have CLmul, we can generally speed up CRC computation over
> the coremark algorithm by using something more suited to the target,
> like the crcu16_1 function I
> put into comments in my patch.
Sure.  And once the CRC loop is discovered and an IFN created, the 
backend would have the opportunity to guide code generation.





> 
> Alexander Monakov:
>> Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
>> These consumers need high-performance blockwise CRC, offering them
>> a latency-bound elementwise CRC primitive is a disservice. And what
>> should they use as a fallback when __builtin_crc is unavailable?
> 
> We can provide a fallback implementation for all targets with table
> lookup and/or shifts .
And as I've stated before, the latency of clmuls is dropping.   I 
wouldn't be terribly surprised to see single cycle clmul 
implmementations showing up within the next 18-24 months.  It's really 
just a matter of gate budget vs expected value.


> 
> Alexander Monakov:
>> I think offering a conventional library for CRC has substantial advantages.
> Are you volunteering?  It would make our work to emit code for block
> CRC easier if we could
> just use a library call when we recognize a block CRC (although making
> that recognizer is likely still considerable work if we want to get
> good coverage over different coding styles).
> 
> Although maybe Oleg Endo's library, as mentioned in
> https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591748.html ,
> might be suitable?  What is the license for that?
But going this route as the primary direction means that any such code 
has to change their implementation to take advantage of a faster CRC 
impmementation.

To reiterate the real goal here is to take code as-is and make it 
significantly faster.  While the original target was Coremark, we've 
found similar bitwise implementations of CRCs all over the place. 
There's no good reason that code should have to change.

The idea of exposing a CRC builtin was an intermediate step that would 
allow those willing to change their code or writing new code to write 
their CRC in a trivial way and let the compiler figure out a sensible 
implementation while we clean up the CRC loop detection/validation code.

> 
> Yes, the builtin functions naturally have lots of narrow operations,
> and thus the operands are narrow.
> Note that in my implementation of crcsihi4 / crchihi4, I use a Pmode
> paradoxical subreg of the output operand, so that the move itself can
> be in Pmode.
> 
> I suppose we could make the output mode wider than the data size, but
> that can only be for the targets that want it; we can't bend the
> target-independent built-in for the convenience of RISC-V.
> So the builtin function expanding machinery would have to know to use
> the mode of the pattern instead of the one implied by the type.
I'm not at all suggesting we do something weird in the middle end for 
RISC-V.  My point was we should look for a better solution.  This feels 
like it ought to work more like the existing widening/narrowing 
capabilities we have in the expansion code.


> 
> Jeff Law :
> 
>> I wonder if all the table based generation support code should move into a
>> generic file so that it could be used by other targets.    I don't offhand
>> see anything in there that is RISC-V specific.
> 
> Well, the avoidance of linkonce is.  OTOH, other targets might not
> have weak.  On the grabber hand, making the table static and
> duplicating it all over would be sad.
Right.  But making the table a DECL node with an initializer allows us 
to clean up a whole mess of things -- including the ability to take 
advantage of linkonce on targets that support it.  The whole point is to 
take what was fairly target specific and generalize it so that it's 
useful elsewhere.


> 
> I suppose we could make it a string constant, to take advantage of
> string commoning on targets where that is supported.  However, that
> would likely get us into trouble on targets with addressable units
> other that 8 bits.
I think we can leave this as a follow-up.  I'm not convinced it's a 
major issue.


> 
> However, the code generation actually is somewhat target specific
> beyond these technicalities.
> The widths of arithmetic used, and how we order it with memory
> accesses (which might need arithmetic transformations), depends on the
> facilities available and latency considerations to get good code.
> Well, we can still have some generic implementation. but optimal
> performance it makessense to look at what assembly code you get and
> consider a target-specific code generation if you can do better.
The widths and such as well as "does this instruction support 
immediates" should be handled by using the existing expansion 
interfaces.  That's the broader point.

As written the code makes various assumptions about the ISA.  By moving 
to standard expansion interfaces most of those problems are solved.

> 
> The other port we used this optimization for actually has dedicated
> crc instructions in some configurations, and is generally
> memory-tight, so the fallback is tightly coded library functions.
ACK.  And I think we can get to a point where those ports that have CRC 
functions (presumably with a limited set of polynomials) are trivially 
handled.


> 
> It wouldn't be an an alpha, since that doesn't have indexed
> addressing.  And that has nothing to do with 12 bit, tab is a symbol
> (usually needs lots of bits, and multiple machine instructions
> to set up for RISC-V), and ix is a an index.
Right.  This isn't about 12bit immediates or anything like that, just a 
general need to be careful about how we generate code to ensure it's 
target independent.


> ..
>> And here I don't think we can necessarily depend on the target
>> supporting shifts by a constant amount.
> ..
>> Similarly here, some targets may not support AND with immediate or may
>> not have enough bits to handle all the immediates that can arise here.
> 
> Yes, it's RISC-V specific code in a RISC_V file.  If we want to make
> this more general, we might be better off constructing some trees or
> gimple and expanding them.  We might indeed do a check if the internal
> function or a library function is defined, and substitute a table
> lookup otherwise.  (or leave it alone for -Os in that case).
Which is basically what I was suggesting by moving to the various expand 
interfaces.  That code does not need to be RISC-V specific and in fact 
it's much better if it is not and can be shared across architectures.


> 
>> So did we ever figure out why we're treating the CRC optab as a
> conversion optab?
> 
> Input and output are different sizes.  At least for some of the
> variants.  So you get an ICE if you treat it as something other than a
> conversion.
That's what I was guessing.   Thanks for the confirmation.

jeff

  parent reply	other threads:[~2023-09-26 13:18 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-23 23:05 Joern Rennecke
2023-09-24 11:41 ` Alexander Monakov
2023-09-24 15:53   ` Joern Rennecke
2023-09-26  6:19 ` Mariam Harutyunyan
2023-09-26 13:05 ` Oleg Endo
2023-09-26 13:18 ` Jeff Law [this message]
2023-09-26 16:23   ` Alexander Monakov
2023-09-26 18:56   ` Joern Rennecke
2023-09-27 21:44     ` Jeff Law
  -- strict thread matches above, loose matches on Subject: below --
2023-08-03 19:37 Mariam Harutyunyan
2023-08-03 19:56 ` Andrew Pinski
2023-08-03 19:59   ` Mariam Harutyunyan
2023-08-03 20:54 ` Jeff Law
2023-08-08 15:22   ` Alexander Monakov
2023-08-08 15:31     ` Jeff Law
2023-08-08 16:38       ` Alexander Monakov
2023-08-08 23:17         ` Jeff Law
2023-08-08 23:31           ` Andrew Pinski
2023-08-09  0:01             ` Jeff Law
2023-08-09  6:32           ` Alexander Monakov
2023-08-09 13:02             ` Paul Koning
2023-08-16  4:15               ` Jeff Law
2023-08-16  4:12             ` Jeff Law
2023-08-16 19:10               ` Alexander Monakov
2023-08-16 19:42                 ` Philipp Tomsich
2023-08-16 20:23                   ` Paul Koning
2023-08-17 13:36                   ` Alexander Monakov
2023-08-17  3:12                 ` Jeff Law
2023-08-16  4:59 ` Jeff Law
2023-08-21 13:39   ` Mariam Harutyunyan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a3a879a1-9f0d-4967-a976-4f00eca0292f@gmail.com \
    --to=jeffreyalaw@gmail.com \
    --cc=amonakov@ispras.ru \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=joern.rennecke@embecosm.com \
    --cc=mariamarutunian@gmail.com \
    --cc=oleg.endo@t-online.de \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).