On 03/08/15 14:01, James Greenhalgh wrote: > On Fri, Jul 24, 2015 at 11:55:33AM +0100, Kyrill Tkachov wrote: >> Hi all, >> >> This patch implements an aarch64-specific expansion of the signed modulo by a power of 2. >> The proposed sequence makes use of the conditional negate instruction CSNEG. >> For a power of N, x % N can be calculated with: >> negs x1, x0 >> and x0, x0, #(N - 1) >> and x1, x1, #(N - 1) >> csneg x0, x0, x1, mi >> >> So, for N == 256 this would be: >> negs x1, x0 >> and x0, x0, #255 >> and x1, x1, #255 >> csneg x0, x0, x1, mi >> >> For comparison, the existing sequence emitted by expand_smod_pow2 in expmed.c is: >> asr x1, x0, 63 >> lsr x1, x1, 56 >> add x0, x0, x1 >> and x0, x0, 255 >> sub x0, x0, x1 >> >> Note that the CSNEG sequence is one instruction shorter and that the two and operations >> are independent, compared to the existing sequence where all instructions are dependent >> on the preceeding instructions. >> >> For the special case of N == 2 we can do even better: >> cmp x0, xzr >> and x0, x0, 1 >> csneg x0, x0, x0, ge >> >> I first tried implementing this in the generic code in expmed.c but that didn't work >> out for a few reasons: >> >> * This relies on having a conditional-negate instruction. We could gate it on >> HAVE_conditional_move and the combiner is capable of merging the final negate into >> the conditional move if a conditional negate is available (like on aarch64) but on >> targets without a conditional negate this would end up emitting a separate negate. >> >> * The first negs has to be a negs for the sequence to be a win i.e. having a separate >> negate and compare makes the sequence slower than the existing one (at least in my >> microbenchmarking) and I couldn't get subsequent passes to combine the negate and combine >> into the negs (presumably due to the use of the negated result in one of the ands). >> Doing it in the aarch64 backend where I could just call the exact gen_* functions that >> I need worked much more cleanly. >> >> The costing logic is updated to reflect this sequence during the intialisation of >> expmed.c where it calculates the smod_pow2_cheap metric. >> >> The tests will come in patch 3 of the series which are partly shared with the equivalent >> arm implementation. >> >> Bootstrapped and tested on aarch64. >> Ok for trunk? >> >> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c >> index 9d88a60..7bb4a55 100644 >> --- a/gcc/config/aarch64/aarch64.c >> +++ b/gcc/config/aarch64/aarch64.c >> @@ -6639,8 +6639,26 @@ cost_plus: >> if (VECTOR_MODE_P (mode)) >> *cost += extra_cost->vect.alu; >> else if (GET_MODE_CLASS (mode) == MODE_INT) >> - *cost += (extra_cost->mult[mode == DImode].add >> - + extra_cost->mult[mode == DImode].idiv); >> + { >> + /* We can expand signed mod by power of 2 using a >> + NEGS, two parallel ANDs and a CSNEG. Assume here >> + that CSNEG is COSTS_N_INSNS (1). This case should >> + only ever be reached through the set_smod_pow2_cheap check >> + in expmed.c. */ >> + if (code == MOD >> + && CONST_INT_P (XEXP (x, 1)) >> + && exact_log2 (INTVAL (XEXP (x, 1))) > 0 >> + && (mode == SImode || mode == DImode)) >> + { >> + *cost += COSTS_N_INSNS (3) >> + + 2 * extra_cost->alu.logical >> + + extra_cost->alu.arith; >> + return true; >> + } >> + >> + *cost += (extra_cost->mult[mode == DImode].add >> + + extra_cost->mult[mode == DImode].idiv); >> + } >> else if (mode == DFmode) >> *cost += (extra_cost->fp[1].mult >> + extra_cost->fp[1].div); > This looks like it calculates the wrong cost for !speed. I think we will > still expand through mod3 when compiling for size, so we probably > still want to cost the multiple instructions. > > Have I misunderstood? You're right, the logic needs a bit of wiggling to do the right thing. I've moved this case into a separate MOD case and added a gate on speed for the extra costs addition. Ok? Thanks, Kyrill 2015-08-13 Kyrylo Tkachov * config/aarch64/aarch64.md (mod3): New define_expand. (*neg2_compare0): Rename to... (neg2_compare0): ... This. * config/aarch64/aarch64.c (aarch64_rtx_costs, MOD case): Move check for speed inside the if-then-elses. Reflect CSNEG sequence in MOD by power of 2 case. > > Thanks, > James >