public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Richard Sandiford <richard.sandiford@arm.com>
To: Tamar Christina <tamar.christina@arm.com>
Cc: gcc-patches@gcc.gnu.org,  nd@arm.com,  Richard.Earnshaw@arm.com,
	 Marcus.Shawcroft@arm.com,  Kyrylo.Tkachov@arm.com
Subject: Re: [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]
Date: Fri, 10 Feb 2023 14:10:16 +0000	[thread overview]
Message-ID: <mpt5yc94nnb.fsf@arm.com> (raw)
In-Reply-To: <Y+UrwQSz3hWz+Puo@arm.com> (Tamar Christina's message of "Thu, 9 Feb 2023 17:22:09 +0000")

I was asking in the 1/2 review whether we need the optab, but that
decision doesn't affect the other patterns, so:

Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> This replaces the custom division hook with just an implementation through
> add_highpart.  For NEON we implement the add highpart (Addition + extraction of
> the upper highpart of the register in the same precision) as ADD + LSR.
>
> This representation allows us to easily optimize the sequence using existing
> sequences. This gets us a pretty decent sequence using SRA:
>
>         umull   v1.8h, v0.8b, v3.8b
>         umull2  v0.8h, v0.16b, v3.16b
>         add     v5.8h, v1.8h, v2.8h
>         add     v4.8h, v0.8h, v2.8h
>         usra    v1.8h, v5.8h, 8
>         usra    v0.8h, v4.8h, 8
>         uzp2    v1.16b, v1.16b, v0.16b
>
> To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
> is half the precision of the mode of the operation into addhn + uaddw which is
> a general good optimization on its own and gets us back to:
>
> .L4:
>         ldr     q0, [x3]
>         umull   v1.8h, v0.8b, v5.8b
>         umull2  v0.8h, v0.16b, v5.16b
>         addhn   v3.8b, v1.8h, v4.8h
>         addhn   v2.8b, v0.8h, v4.8h
>         uaddw   v1.8h, v1.8h, v3.8b
>         uaddw   v0.8h, v0.8h, v2.8b
>         uzp2    v1.16b, v1.16b, v0.16b
>         str     q1, [x3], 16
>         cmp     x3, x4
>         bne     .L4
>
> For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:
>
> .L3:
>         ld1b    z0.h, p0/z, [x0, x3]
>         mul     z0.h, p1/m, z0.h, z2.h
>         add     z1.h, z0.h, z3.h
>         usra    z0.h, z1.h, #8
>         lsr     z0.h, z0.h, #8
>         st1b    z0.h, p0, [x0, x3]
>         inch    x3
>         whilelo p0.h, w3, w2
>         b.any   .L3
> .L1:
>         ret
>
> and to get the most optimal sequence I match (a + b) >> n (same constraint on n)
> to addhnb which gets us to:
>
> .L3:
>         ld1b    z0.h, p0/z, [x0, x3]
>         mul     z0.h, p1/m, z0.h, z2.h
>         addhnb  z1.b, z0.h, z3.h
>         addhnb  z0.b, z0.h, z1.h
>         st1b    z0.h, p0, [x0, x3]
>         inch    x3
>         whilelo p0.h, w3, w2
>         b.any   .L3
>
> There are multiple RTL representations possible for these optimizations, I did
> not represent them using a zero_extend because we seem very inconsistent in this
> in the backend.  Since they are unspecs we won't match them from vector ops
> anyway. I figured maintainers would prefer this, but my maintainer ouija board
> is still out for repairs :)

I agree this is the best approach as things stand.  Personally, I'd like
to have some way for the target to define simplification rules based on
unspecs, so that unspecs act more like target-specific rtl codes.  But I
know others disagree, and it wouldn't really apply to this case anyway.

> There are no new test as new correctness tests were added to the mid-end and
> the existing codegen tests for this already exist.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	PR target/108583
> 	* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove.
> 	(<su>add<mode>3_highpart, *bitmask_shift_plus<mode>): New.
> 	* config/aarch64/aarch64-sve2.md (<su>add<mode>3_highpart,
> 	*bitmask_shift_plus<mode>): New.
> 	(@aarch64_bitmask_udiv<mode>3): Remove.
> 	* config/aarch64/aarch64.cc
> 	(aarch64_vectorize_can_special_div_by_constant): Removed.
> 	* config/aarch64/iterators.md (UNSPEC_SADD_HIGHPART,
> 	UNSPEC_UADD_HIGHPART, ADD_HIGHPART): New.
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 7f212bf37cd2c120dceb7efa733c9fa76226f029..26871a56d1fdb134f0ad9d828ce68a8df0272c53 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4867,62 +4867,48 @@ (define_expand "aarch64_<sur><addsub>hn2<mode>"
>    }
>  )
>  
> -;; div optimizations using narrowings
> -;; we can do the division e.g. shorts by 255 faster by calculating it as
> -;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> -;; double the precision of x.
> -;;
> -;; If we imagine a short as being composed of two blocks of bytes then
> -;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> -;; adding 1 to each sub component:
> -;;
> -;;      short value of 16-bits
> -;; ┌──────────────┬────────────────┐
> -;; │              │                │
> -;; └──────────────┴────────────────┘
> -;;   8-bit part1 ▲  8-bit part2   ▲
> -;;               │                │
> -;;               │                │
> -;;              +1               +1
> -;;
> -;; after the first addition, we have to shift right by 8, and narrow the
> -;; results back to a byte.  Remember that the addition must be done in
> -;; double the precision of the input.  Since 8 is half the size of a short
> -;; we can use a narrowing halfing instruction in AArch64, addhn which also
> -;; does the addition in a wider precision and narrows back to a byte.  The
> -;; shift itself is implicit in the operation as it writes back only the top
> -;; half of the result. i.e. bits 2*esize-1:esize.
> -;;
> -;; Since we have narrowed the result of the first part back to a byte, for
> -;; the second addition we can use a widening addition, uaddw.
> -;;
> -;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
> -;;
> -;; The shift is later optimized by combine to a uzp2 with movi #0.
> -(define_expand "@aarch64_bitmask_udiv<mode>3"
> -  [(match_operand:VQN 0 "register_operand")
> -   (match_operand:VQN 1 "register_operand")
> -   (match_operand:VQN 2 "immediate_operand")]
> +;; Implement add_highpart as ADD + RSHIFT, we have various optimization for
> +;; narrowing represented as shifts and so this representation will allow us to
> +;; further optimize this should the result require narrowing. The alternative
> +;; representation of ADDHN + UXTL is less efficient and harder to futher
> +;; optimize.
> +(define_expand "<su>add<mode>3_highpart"
> +  [(set (match_operand:VQN 0 "register_operand")
> +	(unspec:VQN [(match_operand:VQN 1 "register_operand")
> +		     (match_operand:VQN 2 "register_operand")]
> +		    ADD_HIGHPART))]
> +  "TARGET_SIMD"
> +{
> +  rtx result = gen_reg_rtx (<MODE>mode);
> +  int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
> +  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
> +							shift_amount);
> +  emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
> +  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], result, shift_vector));
> +  DONE;
> +})
> +
> +;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
> +(define_insn_and_split "*bitmask_shift_plus<mode>"
> +  [(set (match_operand:VQN 0 "register_operand" "=w")
> +	(plus:VQN
> +	  (lshiftrt:VQN
> +	    (plus:VQN (match_operand:VQN 1 "register_operand" "w")
> +		      (match_operand:VQN 2 "register_operand" "w"))
> +	    (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
> +	  (match_operand:VQN 4 "register_operand" "w")))]
>    "TARGET_SIMD"
> +  "#"
> +  "&& !reload_completed"

This is an ICE trap, since "#" forces a split while "!reload_completed"
prevents one after reload.

I think the theoretically correct way would be to use operand 0 as a
temporary when reload_completed, which in turn means making it an
earlyclobber.

However, IIUC, this pattern would only be formed from combining
three distinct patterns.  Is that right?  If so, we should be able
to handle it as a plain define_split, with no define_insn.
That should make things simpler, so would be worth trying before
the changes I mentioned above.

> +  [(const_int 0)]
>  {
> -  unsigned HOST_WIDE_INT size
> -    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
> -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
> -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
> -    FAIL;
> -
> -  rtx addend = gen_reg_rtx (<MODE>mode);
> -  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
> -  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROWQ2>mode));
> -  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
> -  rtx tmp2 = gen_reg_rtx (<MODE>mode);
> -  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
> -  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
> -  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode, bitsize);
> -  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
> -  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2, shift_vector));
> +  rtx tmp = gen_reg_rtx (<VNARROWQ>mode);
> +  emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
> +  emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
>    DONE;
> -})
> +}
> +  [(set_attr "type" "neon_add_halve<q>")]

I think we should leave this out, since it's a multi-instruction pattern.

> +)
>  
>  ;; pmul.
>  
> diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
> index 40c0728a7e6f00c395c360ce7625bc2e4a018809..ad01c1ddf9257cec951ed0c16558a3c4d856813b 100644
> --- a/gcc/config/aarch64/aarch64-sve2.md
> +++ b/gcc/config/aarch64/aarch64-sve2.md
> @@ -2317,39 +2317,51 @@ (define_insn "@aarch64_sve_<optab><mode>"
>  ;; ---- [INT] Misc optab implementations
>  ;; -------------------------------------------------------------------------
>  ;; Includes:
> -;; - aarch64_bitmask_udiv
> +;; - add_highpart
>  ;; -------------------------------------------------------------------------
>  
> -;; div optimizations using narrowings
> -;; we can do the division e.g. shorts by 255 faster by calculating it as
> -;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> -;; double the precision of x.
> -;;
> -;; See aarch64-simd.md for bigger explanation.
> -(define_expand "@aarch64_bitmask_udiv<mode>3"
> -  [(match_operand:SVE_FULL_HSDI 0 "register_operand")
> -   (match_operand:SVE_FULL_HSDI 1 "register_operand")
> -   (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
> +;; Implement add_highpart as ADD + RSHIFT, we have various optimization for
> +;; narrowing represented as shifts and so this representation will allow us to
> +;; further optimize this should the result require narrowing. The alternative
> +;; representation of ADDHN + UXTL is less efficient and harder to futher
> +;; optimize.
> +(define_expand "<su>add<mode>3_highpart"
> +  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand")
> +	(unspec:SVE_FULL_HSDI
> +	  [(match_operand:SVE_FULL_HSDI 1 "register_operand")
> +	   (match_operand:SVE_FULL_HSDI 2 "register_operand")]
> +	  ADD_HIGHPART))]
>    "TARGET_SVE2"
>  {
> -  unsigned HOST_WIDE_INT size
> -    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
> -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
> -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
> -    FAIL;
> +  rtx result = gen_reg_rtx (<MODE>mode);
> +  int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
> +  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
> +							shift_amount);
> +  emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
> +  emit_insn (gen_vlshr<mode>3 (operands[0], result, shift_vector));
> +  DONE;
> +})
>  
> -  rtx addend = gen_reg_rtx (<MODE>mode);
> +;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
> +(define_insn_and_split "*bitmask_shift_plus<mode>"
> +  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
> +	(unspec:SVE_FULL_HSDI [
> +	    (match_operand:<VPRED> 1 "register_operand" "Upl")

Looks like this can be:

  (match_operand:<VPRED> 1)

since the predicate isn't used.

> +	    (lshiftrt:SVE_FULL_HSDI
> +	      (plus:SVE_FULL_HSDI
> +		(match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
> +		(match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
> +	      (match_operand:SVE_FULL_HSDI 4 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
> +        ] UNSPEC_PRED_X))]

Very minor nit, but the formatting used in the file follows the style
in the earlier pattern above, with [ immediately before ( and ]
immediately after ).  Not that that's inherently better or anything,
it's just a consistency thing.

> +  "TARGET_SVE2"
> +  "#"
> +  "&& !reload_completed"
> +  [(const_int 0)]
> +{
>    rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
> -  rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
> -  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
> -  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROW>mode));
> -  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[1],
> -			      addend));
> -  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2, operands[1],
> -			      lowpart_subreg (<MODE>mode, tmp1,
> -					      <VNARROW>mode)));
> +  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[2], operands[3]));
>    emit_move_insn (operands[0],
> -		  lowpart_subreg (<MODE>mode, tmp2, <VNARROW>mode));
> +		  lowpart_subreg (<MODE>mode, tmp1, <VNARROW>mode));
>    DONE;
>  })

Since this is a single instruction, I'm not sure it's worth splitting it.
Perhaps there would be CSE opportunities to having a single form,
but it seems unlikely.  And doing the unsplit form is nice and safe.

But yeah, generating the patterns this way seems like a good approach.
It might even help optimise open-coded versions of the same trick.

Thanks,
Richard


>  
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..8a04feb29e6bfb423a09dde2cd64853e69d0e1ba 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -24363,46 +24363,6 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
>  
>    return ret;
>  }
> -
> -/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST.  */
> -
> -bool
> -aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
> -					       tree vectype, wide_int cst,
> -					       rtx *output, rtx in0, rtx in1)
> -{
> -  if (code != TRUNC_DIV_EXPR
> -      || !TYPE_UNSIGNED (vectype))
> -    return false;
> -
> -  machine_mode mode = TYPE_MODE (vectype);
> -  unsigned int flags = aarch64_classify_vector_mode (mode);
> -  if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
> -    return false;
> -
> -  int pow = wi::exact_log2 (cst + 1);
> -  auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE (vectype));
> -  /* SVE actually has a div operator, we may have gotten here through
> -     that route.  */
> -  if (pow != (int) (element_precision (vectype) / 2)
> -      || insn_code == CODE_FOR_nothing)
> -    return false;
> -
> -  /* We can use the optimized pattern.  */
> -  if (in0 == NULL_RTX && in1 == NULL_RTX)
> -    return true;
> -
> -  gcc_assert (output);
> -
> -  expand_operand ops[3];
> -  create_output_operand (&ops[0], *output, mode);
> -  create_input_operand (&ops[1], in0, mode);
> -  create_fixed_operand (&ops[2], in1);
> -  expand_insn (insn_code, 3, ops);
> -  *output = ops[0].value;
> -  return true;
> -}
> -
>  /* Generate a byte permute mask for a register of mode MODE,
>     which has NUNITS units.  */
>  
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 6cbc97cc82c06a68259bdf4dec8a0eab230081e5..ae627ae56cbd1e8b882e596dba974e74ef396e0e 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -750,6 +750,8 @@ (define_c_enum "unspec"
>      UNSPEC_REVH		; Used in aarch64-sve.md.
>      UNSPEC_REVW		; Used in aarch64-sve.md.
>      UNSPEC_REVBHW	; Used in aarch64-sve.md.
> +    UNSPEC_SADD_HIGHPART ; Used in aarch64-sve.md.
> +    UNSPEC_UADD_HIGHPART ; Used in aarch64-sve.md.
>      UNSPEC_SMUL_HIGHPART ; Used in aarch64-sve.md.
>      UNSPEC_UMUL_HIGHPART ; Used in aarch64-sve.md.
>      UNSPEC_FMLA		; Used in aarch64-sve.md.
> @@ -2704,6 +2706,7 @@ (define_int_iterator UNPACK [UNSPEC_UNPACKSHI UNSPEC_UNPACKUHI
>  
>  (define_int_iterator UNPACK_UNSIGNED [UNSPEC_UNPACKULO UNSPEC_UNPACKUHI])
>  
> +(define_int_iterator ADD_HIGHPART [UNSPEC_SADD_HIGHPART UNSPEC_UADD_HIGHPART])
>  (define_int_iterator MUL_HIGHPART [UNSPEC_SMUL_HIGHPART UNSPEC_UMUL_HIGHPART])
>  
>  (define_int_iterator CLAST [UNSPEC_CLASTA UNSPEC_CLASTB])
> @@ -3342,6 +3345,8 @@ (define_int_attr su [(UNSPEC_SADDV "s")
>  		     (UNSPEC_UNPACKUHI "u")
>  		     (UNSPEC_UNPACKSLO "s")
>  		     (UNSPEC_UNPACKULO "u")
> +		     (UNSPEC_SADD_HIGHPART "s")
> +		     (UNSPEC_UADD_HIGHPART "u")
>  		     (UNSPEC_SMUL_HIGHPART "s")
>  		     (UNSPEC_UMUL_HIGHPART "u")
>  		     (UNSPEC_COND_FCVTZS "s")

  parent reply	other threads:[~2023-02-10 14:10 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-09 17:16 [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] Tamar Christina
2023-02-09 17:22 ` [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583] Tamar Christina
2023-02-10 10:35   ` Tamar Christina
2023-02-10 14:10   ` Richard Sandiford [this message]
2023-02-10 10:34 ` [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] Tamar Christina
2023-02-10 13:13 ` Richard Biener
2023-02-10 13:36 ` Richard Sandiford
2023-02-10 13:52   ` Richard Biener
2023-02-10 14:13   ` Tamar Christina
2023-02-10 14:30     ` Richard Sandiford
2023-02-10 14:54       ` Tamar Christina
2023-02-27 11:09       ` Tamar Christina
2023-02-27 12:11         ` Richard Sandiford
2023-02-27 12:14           ` Tamar Christina
2023-02-27 21:33             ` Richard Sandiford
2023-02-27 22:10               ` Tamar Christina
2023-02-28 11:08                 ` Richard Sandiford
2023-02-28 11:12                   ` Tamar Christina
2023-02-28 12:03                     ` Richard Sandiford
2023-03-01 11:30                       ` Richard Biener
2023-02-10 15:56     ` Richard Sandiford
2023-02-10 16:09       ` Tamar Christina
2023-02-10 16:25         ` Richard Sandiford
2023-02-10 16:33           ` Tamar Christina
2023-02-10 16:57             ` Richard Sandiford
2023-02-10 17:01               ` Richard Sandiford
2023-02-10 17:14               ` Tamar Christina
2023-02-10 18:12                 ` Richard Sandiford
2023-02-10 18:34                   ` Richard Biener
2023-02-10 20:58                     ` Andrew MacLeod
2023-02-13  9:54                       ` Tamar Christina
2023-02-15 12:51                         ` Tamar Christina
2023-02-15 16:05                           ` Andrew MacLeod
2023-02-15 17:13                             ` Tamar Christina
2023-02-15 17:50                               ` Andrew MacLeod
2023-02-15 18:42                                 ` Andrew MacLeod
2023-02-22 12:51                                   ` Tamar Christina
2023-02-22 16:41                                   ` Andrew MacLeod
2023-02-22 18:03                                     ` Tamar Christina
2023-02-22 18:33                                       ` Andrew MacLeod
2023-02-23  8:36                                         ` Tamar Christina
2023-02-23 16:39                                           ` Andrew MacLeod
2023-02-23 16:56                                             ` Tamar Christina
2023-03-01 16:57                                             ` Andrew Carlotti
2023-03-01 18:16                                               ` Tamar Christina
2023-02-22 13:06                                 ` Tamar Christina
2023-02-22 15:19                                   ` Andrew MacLeod

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mpt5yc94nnb.fsf@arm.com \
    --to=richard.sandiford@arm.com \
    --cc=Kyrylo.Tkachov@arm.com \
    --cc=Marcus.Shawcroft@arm.com \
    --cc=Richard.Earnshaw@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=nd@arm.com \
    --cc=tamar.christina@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).