From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=jGZ0=AS=rivosinc.com=andrea@sourceware.org>
Received: from mail-wm1-x32c.google.com (mail-wm1-x32c.google.com [IPv6:2a00:1450:4864:20::32c])
	by sourceware.org (Postfix) with ESMTPS id 4EB303858D33
	for <gcc-patches@gcc.gnu.org>; Thu, 27 Apr 2023 17:20:20 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 4EB303858D33
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivosinc.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivosinc.com
Received: by mail-wm1-x32c.google.com with SMTP id 5b1f17b1804b1-3f315712406so32567325e9.0
        for <gcc-patches@gcc.gnu.org>; Thu, 27 Apr 2023 10:20:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=rivosinc-com.20221208.gappssmtp.com; s=20221208; t=1682616019; x=1685208019;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date:from:to
         :cc:subject:date:message-id:reply-to;
        bh=MnvQVQPJ66343OTwrjrpfgMDMORp4+04yvOSvSGluSo=;
        b=bBUdtbXiS+AhyYFiJoBzxTQsaFeNrIaHLRdBlld29QC3TJeu0dzea+7+QKhtob2YFe
         Qi/8mIrg5q/VuZXPCNZh/E120IoVE9MZroh4lP+yJaUnVZOby4rq4EQL5IcrSW2AMK05
         U9KdlN+Ufh2Q1mZ0DWSmV9ApCQ7Rr29yBUZJMKCr9msf4acfLK6uK53yWNOZLiUT34u8
         VN5yzik0F8w5AIhnsQ2JHJyoUknxMRyOH7hXTZKjmO/pYGagpdHok6Di/fVYWYuqQ3b9
         S6vHqVmNSN9cKLTfelLEPpzKCDbJN9OwyeT+m9SuH0woH8zNi4dSAPJNTbdMU6nTbUsb
         xEbg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1682616019; x=1685208019;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=MnvQVQPJ66343OTwrjrpfgMDMORp4+04yvOSvSGluSo=;
        b=FIU3+qz3zhPTon1KotVes+4kQz33eWfjTnfScHkbZV86D/73qzJq5DWUQaCP1m/RF4
         JW2bF2YAasb37nmX5tSWyP0BYK7AaY41rtJ7hdOn3Npq0QpxVCCvzbqcM0VnRqSyQH31
         GyfXSj2okdKgwJ4Y6hUDkA8dXVP9WYOPi3Yo024fwcqvneRmauXsItDntMRP+QLUT3lC
         8lt3b2KXFEXXPIfXfmCt23RnOqFF8+lOkrbAZ6H9uABaYkZMAO4pUQfcSfr/Jvsy3vTM
         Ud6ptJ+r4gRFAoEhrg3wVZbF6n1DjpBNwvGKAlBhRvvRtT5k8F47frvn8PeesjivQJs+
         I+1g==
X-Gm-Message-State: AC+VfDxhb/h7JhVlsUB5rXC/1VzzWzxwjVIoIL1GpmgDpJxghp73H8yA
	MAtUeLiBydsNfYdYKkZyK6X0MQ==
X-Google-Smtp-Source: ACHHUZ6c7J4t3ibgS4R58JT0CCULWT1poHTEJmU6FQxUPSXJgc7m34SbaLooF8p26/2rxP4xuzxdhQ==
X-Received: by 2002:adf:fa82:0:b0:2f7:625:7997 with SMTP id h2-20020adffa82000000b002f706257997mr4327238wrr.28.1682616018579;
        Thu, 27 Apr 2023 10:20:18 -0700 (PDT)
Received: from andrea ([51.52.155.69])
        by smtp.gmail.com with ESMTPSA id o4-20020a056000010400b002fa67f77c16sm18924986wrx.57.2023.04.27.10.20.17
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 27 Apr 2023 10:20:18 -0700 (PDT)
Date: Thu, 27 Apr 2023 19:20:14 +0200
From: Andrea Parri <andrea@rivosinc.com>
To: Patrick O'Neill <patrick@rivosinc.com>
Cc: gcc-patches@gcc.gnu.org, palmer@rivosinc.com,
	gnu-toolchain@rivosinc.com, vineetg@rivosinc.com, andrew@sifive.com,
	kito.cheng@sifive.com, dlustig@nvidia.com, cmuellner@gcc.gnu.org,
	hboehm@google.com, jeffreyalaw@gmail.com
Subject: Re: [PATCH v5 00/11] RISC-V: Implement ISA Manual Table A.6 Mappings
Message-ID: <ZEquzlKEegm0k3lx@andrea>
References: <20230414170942.1695672-1-patrick@rivosinc.com>
 <20230427162301.1151333-1-patrick@rivosinc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20230427162301.1151333-1-patrick@rivosinc.com>
X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Thu, Apr 27, 2023 at 09:22:50AM -0700, Patrick O'Neill wrote:
> This patchset aims to make the RISCV atomics implementation stronger
> than the recommended mapping present in table A.6 of the ISA manual.
> https://github.com/riscv/riscv-isa-manual/blob/c7cf84547b3aefacab5463add1734c1602b67a49/src/memory.tex#L1083-L1157 
> 
> Context
> ---------
> GCC defined RISC-V mappings [1] before the Memory Model task group
> finalized their work and provided the ISA Manual Table A.6/A.7 mappings[2].
> 
> For at least a year now, we've known that the mappings were different,
> but it wasn't clear if these unique mappings had correctness issues.
> 
> Andrea Parri found an issue with the GCC mappings, showing that
> atomic_compare_exchange_weak_explicit(-,-,-,release,relaxed) mappings do
> not enforce release ordering guarantees. (Meaning the GCC mappings have
> a correctness issue).
>   https://inbox.sourceware.org/gcc-patches/Y1GbJuhcBFpPGJQ0@andrea/ 
> 
> Why not A.6?
> ---------
> We can update our mappings now, so the obvious choice would be to
> implement Table A.6 (what LLVM implements/ISA manual recommends).
> 
> The reason why that isn't the best path forward for GCC is due to a
> proposal by Hans Boehm to add L{d|w|b|h}.aq/rl and S{d|w|b|h}.aq/rl.
> 
> For context, there is discussion about fast-tracking the addition of
> these instructions. The RISCV architectural review committee supports
> adopting a "new and common atomics ABI for gcc and LLVM toochains ...
> that assumes the addition of the preceding instructions”. That common
> ABI is likely to be A.7.
>   https://lists.riscv.org/g/tech-privileged/message/1284 
> 
> Transitioning from A.6 to A.7 will cause an ABI break. We can hedge
> against that risk by emitting a conservative fence after SEQ_CST stores
> to make the mapping compatible with both A.6 and A.7.
> 
> What does a mapping compatible with both A.6 & A.7 look like?
> ---------
> It is exactly the same as Table A.6, but SEQ_CST stores have a trailing
> fence rw,rw. It's strictly stronger than Table A.6.
> 
> Microbenchmark
> ---------
> Hans Boehm helpfully wrote a microbenchmark [3] that uses ARM to give a
> rough estimate for the performance benefits/penalties of the different
> mappings. The microbenchmark is single threaded and almost-write-only.
> This case seems unlikely but is useful for getting a rough idea of the
> workload that would be impacted the most.
> 
> Testcases
> -------
> Control: A simple volatile store. This is most similar to a relaxed
> store.
> Release Store: This is most similar to Sw.rl (one of the instructions in
> Hans' proposal).
> Store with release fence: This is most similar to the mapping present in
> Table A.6.
> Store with two fences: This is most similar to the compatibility mapping
> present in this patchset.
> 
> Machines
> -------
> Intel(R) Core(TM) i7-8650U (sanity check only): x86 TSO
> Cortex A53 (Raspberry pi): ARM In order core
> Cortex A55 (Pixel 6 Pro): ARM In order core
> Cortex A76 (Pixel 6 Pro): ARM Out of order core
> Cortex X1 (Pixel 6 Pro): ARM Out of order core
> 
> Microbenchmark Results [4]
> --------
> Units are nsecs per iteration.
> 
> Sanity check
> Machine    	   CONTROL   REL_STORE   STORE_REL_FENCE   STORE_TWO_FENCE
> -------    	   -------   ---------   ---------------   ---------------
> Intel i7-8650U 1.34812   1.30038     1.2933            18.0474
> 
> 
> Machine    	CONTROL   REL_STORE   STORE_REL_FENCE   STORE_TWO_FENCE
> -------    	-------   ---------   ---------------   ---------------
> Cortex A53 	7.15224   10.7282     7.15221           10.013
> Cortex A55 	2.77965   8.89654     4.44787           7.78331
> Cortex A76 	1.78021   1.86095     5.33088           8.88462
> Cortex X1  	2.14252   2.14258     4.32982           7.05234
> 
> Reordered tests (using -r flag on microbenchmark)
> Machine    	CONTROL   REL_STORE   STORE_REL_FENCE   STORE_TWO_FENCE
> -------    	-------   ---------   ---------------   ---------------
> Cortex A53 	7.15227   10.7282     7.16113           10.034
> Cortex A55 	2.78024   8.89574     4.44844           7.78428
> Cortex A76 	1.77686   1.81081     5.3301            8.88346
> Cortex X1  	2.14254   2.14251     4.3273            7.05239
> 
> Benchmark Interpretation
> --------
> As expected, out of order machines are significantly faster with the
> REL_STORE mappings. Unexpectedly, the in-order machines are
> significantly slower with REL_STORE rather than REL_STORE_FENCE.
> 
> Most machines in the wild are expected to use Table A.7 once the
> instructions are introduced. 
> Incurring this added cost now will make it easier for compiled RISC-V
> binaries to transition to the A.7 memory model mapping.
> 
> The performance benefits of moving to A.7 can be more clearly seen using
> an almost-all-load microbenchmark (included on page 3 of Hans’
> proposal). The code for that microbenchmark is attached below [5].
>   https://lists.riscv.org/g/tech-unprivileged/attachment/382/0/load-acquire110422.pdf 
>   https://lists.riscv.org/g/tech-unprivileged/topic/92916241 
> 
> Caveats
> --------
> This is a very synthetic microbenchmark that represents what is expected
> to be a very unlikely workload. Nevertheless, it's helpful to see the
> worst-case price we are paying for compatibility. 
> 
> “All times include an entire loop iteration, indirect dispatch and all.
> The benchmark alternates tests, but does not lock CPU frequency. Since a
> single core was in use, I expect this was running at basically full
> speed. Any throttling affected everything more or less uniformly.”
> - Hans Boehm
> 
> Patchset overview
> --------
> Patch 1 simplifies the memmodel to ignore MEMMODEL_SYNC_* cases (legacy
> cases that aren't handled differently for RISC-V).
> Patches 2-6 make the mappings strictly stronger.
> Patches 7-9 weaken the mappings to be in line with table A.6 of the ISA
> manual.
> Patch 11 adds some basic conformance tests to ensure the implemented
> mapping matches table A.6 with stronger SEQ_CST stores.
> 
> Conformance test cases notes
> --------
> The conformance tests in this patch are a good sanity check but do not
> guarantee exactly following Table A.6. It checks that the right
> instructions are emitted (ex. fence rw,r) but not the order of those
> instructions.
> 
> LLVM mapping notes
> --------
> LLVM emits corresponding fences for atomic_signal_fence instructions.
> This seems to be an oversight since AFAIK atomic_signal_fence acts as a
> compiler directive. GCC does not emit any fences for atomic_signal_fence
> instructions.
> 
> Future work
> --------
> There still remains some work to be done in this space after this
> patchset fixes the correctness of the GCC mappings. 
> * Look into explicitly handling subword loads/stores.
> * Look into using AMOSWAP.rl for store words/doubles.
> * L{b|h|w|d}.aq/rl & S{b|h|w|d}.aq/rl support once ratified.
> * zTSO mappings.
> 
> Prior Patchsets
> --------
> Patchset v1:
>   https://gcc.gnu.org/pipermail/gcc-patches/2022-April/592950.html 
> 
> Patchset v2:
>   https://gcc.gnu.org/pipermail/gcc-patches/2023-April/615264.html 
> 
> Patchset v3:
>   https://gcc.gnu.org/pipermail/gcc-patches/2023-April/615431.html 
> 
> Patchset v4:
>   https://gcc.gnu.org/pipermail/gcc-patches/2023-April/615748.html 
> 
> Changelogs
> --------
> Changes for v2:
> * Use memmodel_base rather than a custom simplify_memmodel function
>   (Inspired by Christoph Muellner's patch 1/9)
> * Move instruction styling change from [v1 5/7] to [v2 3/8] to reduce
>   [v2 6/8]'s complexity
> * Eliminated %K flag for atomic store introduced in v1 in favor of
>   if/else
> * Rebase/test
> 
> Changes for v3:
> * Use a trailing fence for atomic stores to be compatible with table A.7
> * Emit an optimized fence r,rw following a SEQ_CST load
> * Consolidate tests in [PATCH v3 10/10]
> * Add tests for basic A.6 conformance
> 
> Changes for v4:
> * Update cover letter to cover more of the reasoning behind moving to a
>   compatibility mapping
> * Improve conformance testcases patch assertions and add new
>   compare-exchange testcases
> 
> Changes for v5:
> * Update cover letter to cover more context and reasoning behind moving
>   to a compatibility mapping
> * Rebase to include the subword-atomic cases introduced here:
>   https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616080.html
> * Add basic amo-add subword atomic testcases
> * Reformat changelogs
> * Fix misc. whitespace issues
> 
> [1] GCC port with mappings merged 06 Feb 2017
>   https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=09cae7507d9e88f2b05cf3a9404bf181e65ccbac
> 
> [2] A.6 mappings added to ISA manual 12 Dec 2017
> https://github.com/riscv/riscv-isa-manual/commit/9da1a115bcc4fe327f35acceb851d4850d12e9fa 
> 
> [3] Hans Boehm almost-all-store Microbenchmark:
> // Copyright 2023 Google LLC.
> // SPDX-License-Identifier: Apache-2.0
> 
> #include <atomic>
> #include <iostream>
> #include <time.h>
> 
> static constexpr int INNER_ITERS = 10'000'000;
> static constexpr int OUTER_ITERS = 20;
> static constexpr int N_TESTS = 4;
> 
> volatile int the_volatile(17);
> std::atomic<int> the_atomic(17);
> 
> void test1(int i) {
>   the_volatile = i;
> }
> 
> void test2(int i) {
>   the_atomic.store(i, std::memory_order_release);
> }
> 
> void test3(int i) {
>   atomic_thread_fence(std::memory_order_release);
>   the_atomic.store(i, std::memory_order_relaxed);
> }
> 
> void test4(int i) {
>   atomic_thread_fence(std::memory_order_release);
>   the_atomic.store(i, std::memory_order_relaxed);
>   atomic_thread_fence(std::memory_order_seq_cst);
> }
> 
> typedef void (*int_func)(int);
> 
> uint64_t getnanos() {
>   struct timespec result;
>   if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &result) != 0) {
> 	std::cerr << "clock_gettime() failed\n";
> 	exit(1);
>   }
>   return (uint64_t)result.tv_nsec + 1'000'000'000 * (uint64_t)result.tv_sec;
> }
> 
> int_func tests[N_TESTS] = { test1, test2, test3, test4 };
> const char *test_names[N_TESTS] =
> 	{ "control", "release store", "store with release fence", "store with two fences" };
> uint64_t total_time[N_TESTS] = { 0 };
> 
> int main(int argc, char **argv) {
>   struct timespec res;
>   if (clock_getres(CLOCK_PROCESS_CPUTIME_ID, &res) != 0) {
> 	std::cerr << "clock_getres() failed\n";
> 	exit(1);
>   } else {
> 	std::cout << "nsec resolution = " << res.tv_nsec << std::endl;
>   }
>   if (argc == 2 && argv[1][0] == 'r') {
> 	// Run tests in reverse order.
> 	for (int i = 0; i < N_TESTS / 2; ++i) {
>   	std::swap(tests[i], tests[N_TESTS - 1 - i]);
>   	std::swap(test_names[i], test_names[N_TESTS - 1 - i]);
> 	}
>   }
>   for (int i = 0; i < OUTER_ITERS; ++i) {
> 	// Alternate tests to minimize bias due to thermal throttling.
> 	for (int j = 0; j < N_TESTS; ++j) {
>   	uint64_t start_time = getnanos();
>   	for (int k = 1; k <= INNER_ITERS; ++k) {
>     	tests[j](k); // Provides memory accesses between tests.
>   	}
>   	// Ignore first iteration for all tests. The first iteration of the first test is
>   	// empirically slightly slower.
>   	if (i != 0) {
>     	total_time[j] += getnanos() - start_time;
>   	}
>   	if ((tests[j] == test1 ? the_volatile : the_atomic.load()) != INNER_ITERS) {
>     	std::cerr << "result check failed, test = " << j << ", " << the_volatile << std::endl;
>     	exit(1);
>   	}
> 	}
>   }
>   for (int i = 0; i < N_TESTS; ++i) {
> 	double nsecs_per_iter = (double) total_time[i] / INNER_ITERS / (OUTER_ITERS - 1);
> 	std::cout << test_names[i] << " took " << nsecs_per_iter << " nseconds per iteration\n";
>   }
>   exit(0);
> }
> 
> [4] Hans Boehm Raw Microbenchmark Results
> Intel(R) Core(TM) i7-8650U (sanity check only):
> 
> hboehm@hboehm-glaptop0:~/tests$ ./a.out
> nsec resolution = 1
> control took 1.34812 nseconds per iteration
> release store took 1.30038 nseconds per iteration
> store with release fence took 1.2933 nseconds per iteration
> store with two fences took 18.0474 nseconds per iteration
> 
> Cortex A53 (Raspberry pi)
> hboehm@rpi3-20210823:~/tests$ ./a.out
> nsec resolution = 1
> control took 7.15224 nseconds per iteration
> release store took 10.7282 nseconds per iteration
> store with release fence took 7.15221 nseconds per iteration
> store with two fences took 10.013 nseconds per iteration
> hboehm@rpi3-20210823:~/tests$ ./a.out -r
> nsec resolution = 1
> control took 7.15227 nseconds per iteration
> release store took 10.7282 nseconds per iteration
> store with release fence took 7.16133 nseconds per iteration
> store with two fences took 10.034 nseconds per iteration
> 
> Cortex A55 (Pixel 6 Pro)
> 
> raven:/data/tmp # taskset 0f ./release-timer
> nsec resolution = 1
> control took 2.77965 nseconds per iteration
> release store took 8.89654 nseconds per iteration
> store with release fence took 4.44787 nseconds per iteration
> store with two fences took 7.78331 nseconds per iteration
> raven:/data/tmp # taskset 0f ./release-timer -r                                                                 	 
> nsec resolution = 1
> control took 2.78024 nseconds per iteration
> release store took 8.89574 nseconds per iteration
> store with release fence took 4.44844 nseconds per iteration
> store with two fences took 7.78428 nseconds per iteration
> 
> Cortex A76 (Pixel 6 Pro)
> raven:/data/tmp # taskset 30 ./release-timer -r                                                                 	 
> nsec resolution = 1
> control took 1.77686 nseconds per iteration
> release store took 1.81081 nseconds per iteration
> store with release fence took 5.3301 nseconds per iteration
> store with two fences took 8.88346 nseconds per iteration
> raven:/data/tmp # taskset 30 ./release-timer                                                                   	 
> nsec resolution = 1
> control took 1.78021 nseconds per iteration
> release store took 1.86095 nseconds per iteration
> store with release fence took 5.33088 nseconds per iteration
> store with two fences took 8.88462 nseconds per iteration
> 
> Cortex X1 (Pixel 6 Pro)
> raven:/data/tmp # taskset c0 ./release-timer                                                                   	 
> nsec resolution = 1
> control took 2.14252 nseconds per iteration
> release store took 2.14258 nseconds per iteration
> store with release fence took 4.32982 nseconds per iteration
> store with two fences took 7.05234 nseconds per iteration
> raven:/data/tmp # taskset c0 ./release-timer -r                                                                 	 
> nsec resolution = 1
> control took 2.14254 nseconds per iteration
> release store took 2.14251 nseconds per iteration
> store with release fence took 4.3273 nseconds per iteration
> store with two fences took 7.05239 nseconds per iteration
> 
> [5] Hans Boehm almost-all-load Microbenchmark:
> // Copyright 2023 Google LLC.
> // SPDX-License-Identifier: Apache-2.0
> 
> #include <atomic>
> #include <iostream>
> #include <time.h>
> 
> static constexpr int INNER_ITERS = 10'000'000;
> static constexpr int OUTER_ITERS = 20;
> static constexpr int N_TESTS = 4;
> 
> volatile int the_volatile(17);
> std::atomic<int> the_atomic(17);
> 
> int test1() {
>   return the_volatile;
> }
> 
> int test2() {
>   return the_atomic.load(std::memory_order_acquire);
> }
> 
> int test3() {
>   int result = the_atomic.load(std::memory_order_relaxed);
>   atomic_thread_fence(std::memory_order_acquire);
>   return result;
> }
> 
> int test4() {
>   atomic_thread_fence(std::memory_order_seq_cst);
>   int result = the_atomic.load(std::memory_order_relaxed);
>   atomic_thread_fence(std::memory_order_acquire);
>   return result;
> }
> 
> typedef int (*int_func)();
> 
> uint64_t getnanos() {
>   struct timespec result;
>   if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &result) != 0) {
> 	std::cerr << "clock_gettime() failed\n";
> 	exit(1);
>   }
>   return (uint64_t)result.tv_nsec + 1'000'000'000 * (uint64_t)result.tv_sec;
> }
> 
> int_func tests[N_TESTS] = { test1, test2, test3, test4 };
> const char *test_names[N_TESTS] =
> 	{ "control", "acquire load", "load with acquire fence", "load with two fences" };
> uint64_t total_time[N_TESTS] = { 0 };
> 
> uint sum, last_sum = 0;
> 
> int main(int argc, char **argv) {
>   struct timespec res;
>   if (clock_getres(CLOCK_PROCESS_CPUTIME_ID, &res) != 0) {
> 	std::cerr << "clock_getres() failed\n";
> 	exit(1);
>   } else {
> 	std::cout << "nsec resolution = " << res.tv_nsec << std::endl;
>   }
>   if (argc == 2 && argv[1][0] == 'r') {
> 	// Run tests in reverse order.
> 	for (int i = 0; i < N_TESTS / 2; ++i) {
>   	std::swap(tests[i], tests[N_TESTS - 1 - i]);
>   	std::swap(test_names[i], test_names[N_TESTS - 1 - i]);
> 	}
>   }
>   for (int i = 0; i < OUTER_ITERS; ++i) {
> 	// Alternate tests to minimize bias due to thermal throttling.
> 	for (int j = 0; j < N_TESTS; ++j) {
>   	sum = 0;
>   	uint64_t start_time = getnanos();
>   	for (int k = 0; k < INNER_ITERS; ++k) {
>     	sum += tests[j](); // Provides memory accesses between tests.
>   	}
>   	// Ignore first iteration for all tests. The first iteration of the first test is
>   	// empirically slightly slower.
>   	if (i != 0) {
>     	total_time[j] += getnanos() - start_time;
>   	}
>   	if (sum == 0 || last_sum != 0 && sum != last_sum) {
>     	std::cerr << "result check failed";
>     	exit(1);
>   	}
>   	last_sum = sum;
> 	}
>   }
>   for (int i = 0; i < N_TESTS; ++i) {
> 	double nsecs_per_iter = (double) total_time[i] / INNER_ITERS / (OUTER_ITERS - 1);
> 	std::cout << test_names[i] << " took " << nsecs_per_iter << " nseconds per iteration\n";
>   }
>   exit(0);
> }
> 
> Patrick O'Neill (11):
>   RISC-V: Eliminate SYNC memory models
>   RISC-V: Enforce Libatomic LR/SC SEQ_CST
>   RISC-V: Enforce subword atomic LR/SC SEQ_CST
>   RISC-V: Enforce atomic compare_exchange SEQ_CST
>   RISC-V: Add AMO release bits
>   RISC-V: Strengthen atomic stores
>   RISC-V: Eliminate AMO op fences
>   RISC-V: Weaken LR/SC pairs
>   RISC-V: Weaken mem_thread_fence
>   RISC-V: Weaken atomic loads
>   RISC-V: Table A.6 conformance tests
> 
>  gcc/config/riscv/riscv-protos.h               |   3 +
>  gcc/config/riscv/riscv.cc                     |  66 ++++--
>  gcc/config/riscv/sync.md                      | 194 ++++++++++++------
>  .../riscv/amo-table-a-6-amo-add-1.c           |   8 +
>  .../riscv/amo-table-a-6-amo-add-2.c           |   8 +
>  .../riscv/amo-table-a-6-amo-add-3.c           |   8 +
>  .../riscv/amo-table-a-6-amo-add-4.c           |   8 +
>  .../riscv/amo-table-a-6-amo-add-5.c           |   8 +
>  .../riscv/amo-table-a-6-compare-exchange-1.c  |  10 +
>  .../riscv/amo-table-a-6-compare-exchange-2.c  |  10 +
>  .../riscv/amo-table-a-6-compare-exchange-3.c  |  10 +
>  .../riscv/amo-table-a-6-compare-exchange-4.c  |  10 +
>  .../riscv/amo-table-a-6-compare-exchange-5.c  |  10 +
>  .../riscv/amo-table-a-6-compare-exchange-6.c  |  11 +
>  .../riscv/amo-table-a-6-compare-exchange-7.c  |  10 +
>  .../gcc.target/riscv/amo-table-a-6-fence-1.c  |   8 +
>  .../gcc.target/riscv/amo-table-a-6-fence-2.c  |  10 +
>  .../gcc.target/riscv/amo-table-a-6-fence-3.c  |  10 +
>  .../gcc.target/riscv/amo-table-a-6-fence-4.c  |  10 +
>  .../gcc.target/riscv/amo-table-a-6-fence-5.c  |  10 +
>  .../gcc.target/riscv/amo-table-a-6-load-1.c   |   9 +
>  .../gcc.target/riscv/amo-table-a-6-load-2.c   |  11 +
>  .../gcc.target/riscv/amo-table-a-6-load-3.c   |  11 +
>  .../gcc.target/riscv/amo-table-a-6-store-1.c  |   9 +
>  .../gcc.target/riscv/amo-table-a-6-store-2.c  |  11 +
>  .../riscv/amo-table-a-6-store-compat-3.c      |  11 +
>  .../riscv/amo-table-a-6-subword-amo-add-1.c   |   9 +
>  .../riscv/amo-table-a-6-subword-amo-add-2.c   |   9 +
>  .../riscv/amo-table-a-6-subword-amo-add-3.c   |   9 +
>  .../riscv/amo-table-a-6-subword-amo-add-4.c   |   9 +
>  .../riscv/amo-table-a-6-subword-amo-add-5.c   |   9 +
>  gcc/testsuite/gcc.target/riscv/pr89835.c      |   9 +
>  libgcc/config/riscv/atomic.c                  |   4 +-
>  33 files changed, 467 insertions(+), 75 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-1.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-2.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-3.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-4.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-amo-add-5.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-1.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-2.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-3.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-4.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-5.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-6.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-compare-exchange-7.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-1.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-2.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-3.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-4.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-fence-5.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-load-1.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-load-2.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-load-3.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-store-1.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-store-2.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-store-compat-3.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-1.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-2.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-3.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-4.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-table-a-6-subword-amo-add-5.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/pr89835.c

These changes address and fix all the issues I reported/I'm aware of,
thank you!

Tested-by: Andrea Parri <andrea@rivosinc.com>

  Andrea