Richard Sandiford wrote: > > I see Ralf's already answered this. Yup, and I ran it earlier. Took quite a bit to finish, but I figured it wasn't going to be a quick endeavor. There's a couple of failures, but I'm guessing some failures are expected. Not sure what counts and what doesn't, so I've attached it. It was run from a fully-compiled gcc, not bootstrap, so I'm unsure if that affects the output any. > There doesn't seem to be anything in the description linking the > FP multiplier cpu_unit with the division and sqare root cpu_units, > so I'm pretty sure it isn't modelled. Which is fine. I just think > you should add something like "We don't model this at present." > to the end of the comment. > > (There's no shame in that. It's common to omit some details from the > DFA description, and only mention them in the comments. The aim after > all is to get good code, not to describe the pipeline with complete accuracy. > Sometimes omitting details gives better code.) Ah! That's probably because I wasn't sure how to link the division and square-root units to the multiplier. I knew that they had to be linked, because as the R10K manual stated, they're separate/parallel units, but their issue & completion logic is shared by the multiplier. So I know that if the multiplier is busy in either of these two stages, it'll cause a delay for these other two units, right?. That I think is why I had only three automata, and was funneling squareroot and division into the r10k_fp automata. I figured this represented "linking" the multiplier and these two units. I suppose that wasn't accurate, though? Can we even model the issue and completion stages of a cpu unit? > Add: > > (automata_option "v") > > to one of the .md files and do "make insn-automata.c". This will > create a file called "mips.dfa" in the build directory. At the end > of that file is a summary of the automata. The interesting thing is > the number of DFA states and DFA arcs in the r10k_* automata. > > (Hadn't realised it was so hard to get at this information these days. > It used to be printed on stderr. There's also support for adding "-v" > to the genautomata command line, but it seems to have bitrotted and > no longer works.) Thanks, this worked great. I captured the entire build output looking for this verbosity, but didn't see it; I guess it was hidden at some point. Breaking out those two units into their own automata changes things quite a bit. The resulting mips.dfa file is only about 300,000 lines long, and the r10k_fp automaton now has only 8 states (division has 22 and square root has 36 states. Originally, the r10k_fp automaton had 6336 states (and the mips.dfa file was 500,000+ lines long). So this seems to bring the state numbers down to look more like the other mips cpu automatons. I can pass those along if you're interested. And yeah, I looked at the option parsing bit in genautomata.c. It looks like it should work, but that if-then-else construct it's got going just seems to fail somehow. > "logical" means things like AND, OR, XOR and NOR. These insns used > to be lumped into "arith", but were split out for the benefit of a > pipeline that doesn't issue all old-"arith" insns in the same way. > > (That's the general model. We split "type" attributes up on an > as-needed basis, rather than trying to predict in advance what > would be the finest useful granularity.) Ah, so R10K is probably in the older class of just handling arith & logical as one, thus it's safe to lump them into the same insn reservation. > TBH, the only way to know is to try it and measure the result. > > And like I say, there's absolutely no need to try it. I was just trying > to say that the comment should mention bypasses instead of lo_operand. Curiosity demands I at least look it up :) I wrote that comment based on what I thought was the way to check -- I wasn't aware that multiplications and divisions clobbered both HI and LO, so I can see why bypasses are the way to go. Speaking of predicates, I get what to do now. Define a custom predicate in mips.c (I guess, "mips_check_insn_hi_p" ?). Here's what I think so far by looking at those two predicates that you mentioned: mips_check_insn_hi_p(rtx insn) { return IS_INSN_HI(insn) } Then in 10000.md, something like: (define_bypass 6 "r10k_imul_single" "mips_check_insn_hi_p") And I use IS_INSN_HI as a placeholder because I have no idea what function/macro in the gcc internals checks an insn to see if it's a HI or LO one. Is there such a check? I perused mips.c and poked into ia64.c looking for something that checks for HI or LO, but nothing stood out to me really. Probably cause I'm not real sure what I need to be looking for. But I assume that would be a decent predicate definition and usage, right? Assuming there's a basic "Is this HI?" mechanism that returns true if yes and false if no, it makes sense to assign that straight to that predicates return value, right? Then define_bypass knows to use latency 6 for that insn if it knows that it's on the HI side of things. FYI, I tweaked the 10000.md file to use the LO latencies by default, per your earlier mention that using LO is the common case. I also re-wrote that comment in case I can't figure out a working predicate to check for HI. And do you know what the attr type is for MULTU or DMULTU? imul, imul3, and imadd don't seem to fit (I am assuming MULTU/DMULTU and friends are for unsigned?). The R10K manual has different latencies for those, and it looks like I don't have insn reservations defined for those. Or is this another define_bypass + custom predicate to check for signed/unsigned? > Yes. The costs array is indexed by "enum processor_type". > > You also need to remove the r12000, r14000 and r16000 "cpu" attributes, > because "cpu" must be a carbon copy of "enum processor_type". > (It's a nasty wart of the infrastructure that we need to define both.) Done. > Experimentation, basically. Costs are used to choose between > two equivalent implementations of an operation. E.g. multiplication > by a constant can be done using a single multiplication insn or by > a sequence of shifts and adds. > > The target-independent code calculates the cost of a sequence of > insns simply by adding them up. It doesn't take into account how > the pipeline might issue them, or what the repeat rates are. > > So COSTS_N_INSNS (latency) is a good start, but is often too high on > superscalar pipelines, where breaking a monolithic operation into > smaller operations can exploit the parallelism better. For example, > if multiplication takes 5 cycles on a dual-issue target, a multiplication > is often (but not always!) more expensive than 5 single-cycle insns. > > The costs are just heuristics, and you have to accept that any given > choice of values is going to make some things better and some things > worse. When I've done scheduling work in the past, I simply tried > various values and run the result through (commercial) benchmarks. Well, I know of no commercial benchmarking tools for Linux/Mips on SGI systems, and since it sounds like it's mostly guesswork to begin with, I guess using the same values as the latencies should be kosher. I don't suppose there's any rule of thumb involving superscalar pipelines out there that might say, slice a couple digits off these default latencies? > I think you've misunderstood what I meant. I was simply saying > that you shouldn't define those new TARGET_* and TUNE_* macros. > They're not used anywhere in your patch, so they're just dead code. > > TARGET_FOO should only be defined if some code tests TARGET_FOO. > Likewise TUNE_FOO. > > I certainly wasn't talking about changing the -march options. > Please keep them all, but map them to PROCESSOR_R10000, which is > exactly what your revised patch did. Yup, I see what you were getting at. I pulled the unneeded TARGET_R1[246]000 and TUNE_1[246]000 options, as well as the three insn costs enums related to them. > You need to add the new options to doc/invoke.texi. Done. Attached is round three. Other changes not mentioned above include adding frdiv1/2 and frsqrt1/2 insns to the existing reservations. No idea if R10K supports these, but better safe than sorry. I also added the 'move' insn, even though the manual makes no explicit mention of an unconditional integer register move operand (only integer condmove). Also, the R10K manual doesn't seem to differentiate betweem fmadd and imadd. In the latency table, it simply states "MADD" -- might I infer this to assume that R10K itself doesn't distingush between imadd or fmadd, treats them the same, and so I need to follow suit? (I've got imadd set to run on ALU2, whereas fmadd runs on the fp multiplier). And happen to know what kind of insns LWC1/LDC1/LWXC1/LDXC1 match? fpload/fpidxload by chance? Referenced in the manual, they look like loads, but they have a different latency (which is how I coded them, but wanted to double check). Thanks for the feedback! gcc/ * config/mips/10000.md: Add R10000 scheduler * config/mips/mips.c: Add r10000 params & costs * config/mips/mips.h: Add R10k constant * config/mips/mips.md: Add r10000 params & incl 10000.md diff -Naurp gcc.orig/gcc/config/mips/10000.md gcc/gcc/config/mips/10000.md --- gcc.orig/gcc/config/mips/10000.md 1969-12-31 19:00:00.000000000 -0500 +++ gcc/gcc/config/mips/10000.md 2008-08-04 02:37:13.000000000 -0400 @@ -0,0 +1,223 @@ +;; DFA-based pipeline description for the VR1x000. +;; Copyright (C) 2005, 2006, 2008 Free Software Foundation, Inc. +;; +;; This file is part of GCC. + +;; GCC is free software; you can redistribute it and/or modify it +;; under the terms of the GNU General Public License as published +;; by the Free Software Foundation; either version 3, or (at your +;; option) any later version. + +;; GCC is distributed in the hope that it will be useful, but WITHOUT +;; ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +;; or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public +;; License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with GCC; see the file COPYING3. If not see +;; . + + +;; R12K/R14K/R16K are derivatives of R10K, thus copy its description +;; until specific tuning for each is added. + +;; R10000 has int queue, fp queue, address queue. +;; We split the fp queue into standard fp, fp division, and +;; fp square root to further optimize the automata, though. +(define_automaton "r10k_int, r10k_fp, r10k_fpdivision, + r10k_fpsqroot, r10k_addr") + +;; R10000 has 2 integer ALUs, fp-adder and fp-multiplier, load/store. +(define_cpu_unit "r10k_alu1" "r10k_int") +(define_cpu_unit "r10k_alu2" "r10k_int") +(define_cpu_unit "r10k_fpadd" "r10k_fp") +(define_cpu_unit "r10k_fpmpy" "r10k_fp") +(define_cpu_unit "r10k_loadstore" "r10k_addr") + +;; R10000 has separate fp-div and fp-sqrt units as well and these can +;; execute in parallel, however their issue & completion logic is shared +;; by the fp-multiplier. +(define_cpu_unit "r10k_fpdiv" "r10k_fpdivision") +(define_cpu_unit "r10k_fpsqrt" "r10k_fpsqroot") + + +;; R10k Loader. +(define_insn_reservation "r10k_load" 2 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (eq_attr "type" "load,prefetch,prefetchx")) + "r10k_loadstore") + +(define_insn_reservation "r10k_store" 0 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (eq_attr "type" "store,fpstore,fpidxstore")) + "r10k_loadstore") + +(define_insn_reservation "r10k_fpload" 3 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (eq_attr "type" "fpload,fpidxload")) + "r10k_loadstore") + + +;; Integer add/sub + logic ops, and mf/mt hi/lo can be done by alu1 or alu2. +;; Miscellaneous arith goes here too (this is a guess). +(define_insn_reservation "r10k_arith" 1 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (eq_attr "type" "arith,mfhilo,mthilo,slt,clz,const,nop,trap,logical")) + "r10k_alu1 | r10k_alu2") + + +;; ALU1 handles shifts, branch eval, and condmove. +;; +;; Brancher is separate, but part of ALU1, but can only +;; do one branch per cycle (needs implementing?). +;; +;; Unsure if the brancher handles jumps and calls as well, but since +;; they're related, we'll add them here for now. +(define_insn_reservation "r10k_shift" 1 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (eq_attr "type" "shift,branch,jump,call,move")) + "r10k_alu1") + +(define_insn_reservation "r10k_int_cmove" 1 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "condmove") + (eq_attr "mode" "SI,DI"))) + "r10k_alu1") + + +;; Coprocessor Moves. +;; mtc1/dmtc1 are handled by ALU1. +;; mfc1/dmfc1 are handled by the fp-multiplier. +(define_insn_reservation "r10k_mt_xfer" 3 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (eq_attr "type" "mtc")) + "r10k_alu1") + +(define_insn_reservation "r10k_mf_xfer" 2 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (eq_attr "type" "mfc")) + "r10k_fpmpy") + + +;; Only ALU2 does int multiplications and divisions. +;; +;; According to the Vr10000 series user manual, +;; integer mult and div insns can be issued one +;; cycle earlier if using register Lo, but this is +;; not modeled here. We use the latency for the +;; Lo register, however, as this is the common case. +;; +;; Divides keep ALU2 busy, but this isn't expressed here (I think?). +(define_insn_reservation "r10k_imul_single" 5 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "imul,imul3,imadd") + (eq_attr "mode" "SI"))) + "r10k_alu2 * 6") + +(define_insn_reservation "r10k_imul_double" 9 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "imul,imul3,imadd") + (eq_attr "mode" "DI"))) + "r10k_alu2 * 10") + +(define_insn_reservation "r10k_idiv_single" 34 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "idiv") + (eq_attr "mode" "SI"))) + "r10k_alu2 * 35") + +(define_insn_reservation "r10k_idiv_double" 66 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "idiv") + (eq_attr "mode" "DI"))) + "r10k_alu2 * 67") + + +;; Floating point add/sub, mul, abs value, neg, comp, & moves. +(define_insn_reservation "r10k_fp_miscadd" 2 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (eq_attr "type" "fadd,fabs,fneg,fcmp")) + "r10k_fpadd") + +(define_insn_reservation "r10k_fp_miscmul" 2 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (eq_attr "type" "fmul,fmove")) + "r10k_fpmpy") + +(define_insn_reservation "r10k_fp_cmove" 2 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "condmove") + (eq_attr "mode" "SF,DF"))) + "r10k_fpmpy") + + +;; The fcvt.s.[wl] insn has latency 4, repeat 2. +;; All other fcvt have latency 2, repeat 1. +(define_insn_reservation "r10k_fcvt_single" 4 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "fcvt") + (eq_attr "cnv_mode" "I2S"))) + "r10k_fpadd * 2") + +(define_insn_reservation "r10k_fcvt_other" 2 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "fcvt") + (eq_attr "cnv_mode" "!I2S"))) + "r10k_fpadd") + + +;; Run the fmadd insn through fp-adder first, then fp-multiplier. +;; +;; The latency for fmadd is 2 cycles if the result is used +;; by another fmadd instruction. +(define_insn_reservation "r10k_fmadd" 4 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (eq_attr "type" "fmadd")) + "r10k_fpadd, r10k_fpmpy") + +(define_bypass 2 "r10k_fmadd" "r10k_fmadd") + + +;; Floating point Divisions & square roots. +(define_insn_reservation "r10k_fdiv_single" 12 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "fdiv,frdiv,frdiv1,frdiv2") + (eq_attr "mode" "SF"))) + "r10k_fpdiv * 14") + +(define_insn_reservation "r10k_fdiv_double" 19 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "fdiv,frdiv,frdiv1,frdiv2") + (eq_attr "mode" "DF"))) + "r10k_fpdiv * 21") + +(define_insn_reservation "r10k_fsqrt_single" 18 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "fsqrt") + (eq_attr "mode" "SF"))) + "r10k_fpsqrt * 20") + +(define_insn_reservation "r10k_fsqrt_double" 33 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "fsqrt") + (eq_attr "mode" "DF"))) + "r10k_fpsqrt * 35") + +(define_insn_reservation "r10k_frsqrt_single" 30 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "frsqrt,frsqrt1,frsqrt2") + (eq_attr "mode" "SF"))) + "r10k_fpsqrt * 20") + +(define_insn_reservation "r10k_frsqrt_double" 52 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (and (eq_attr "type" "frsqrt,frsqrt1,frsqrt2") + (eq_attr "mode" "DF"))) + "r10k_fpsqrt * 35") + + +;; Handle unknown/multi insns here (this is a guess). +(define_insn_reservation "r10k_unknown" 1 + (and (eq_attr "cpu" "r10000,r12000,r14000,r16000") + (eq_attr "type" "unknown,multi")) + "r10k_alu1 + r10k_alu2") diff -Naurp gcc.orig/gcc/config/mips/mips.c gcc/gcc/config/mips/mips.c --- gcc.orig/gcc/config/mips/mips.c 2008-08-01 21:55:41.000000000 -0400 +++ gcc/gcc/config/mips/mips.c 2008-08-04 01:26:49.000000000 -0400 @@ -593,6 +593,10 @@ static const struct mips_cpu_info mips_c /* MIPS IV processors. */ { "r8000", PROCESSOR_R8000, 4, 0 }, + { "r10000", PROCESSOR_R10000, 4, 0 }, + { "r12000", PROCESSOR_R10000, 4, 0 }, + { "r14000", PROCESSOR_R10000, 4, 0 }, + { "r16000", PROCESSOR_R10000, 4, 0 }, { "vr5000", PROCESSOR_R5000, 4, 0 }, { "vr5400", PROCESSOR_R5400, 4, 0 }, { "vr5500", PROCESSOR_R5500, 4, PTF_AVOID_BRANCHLIKELY }, @@ -988,6 +992,19 @@ static const struct mips_rtx_cost_data m 1, /* branch_cost */ 4 /* memory_latency */ }, + { /* R1x000 */ + COSTS_N_INSNS (2), /* fp_add */ + COSTS_N_INSNS (2), /* fp_mult_sf */ + COSTS_N_INSNS (2), /* fp_mult_df */ + COSTS_N_INSNS (12), /* fp_div_sf */ + COSTS_N_INSNS (19), /* fp_div_df */ + COSTS_N_INSNS (5), /* int_mult_si */ + COSTS_N_INSNS (9), /* int_mult_di */ + COSTS_N_INSNS (34), /* int_div_si */ + COSTS_N_INSNS (66), /* int_div_di */ + 1, /* branch_cost */ + 4 /* memory_latency */ + }, { /* SB1 */ /* These costs are the same as the SB-1A below. */ COSTS_N_INSNS (4), /* fp_add */ @@ -9872,7 +9889,10 @@ mips_issue_rate (void) but in reality only a maximum of 3 insns can be issued as floating-point loads and stores also require a slot in the AGEN pipe. */ - return 4; + case PROCESSOR_R10000: + /* All R10K Processors are quad-issue (being the first MIPS + processors to support this feature). */ + return 4; case PROCESSOR_20KC: case PROCESSOR_R4130: diff -Naurp gcc.orig/gcc/config/mips/mips.h gcc/gcc/config/mips/mips.h --- gcc.orig/gcc/config/mips/mips.h 2008-08-01 21:55:41.000000000 -0400 +++ gcc/gcc/config/mips/mips.h 2008-08-04 00:05:27.000000000 -0400 @@ -66,6 +66,7 @@ enum processor_type { PROCESSOR_R7000, PROCESSOR_R8000, PROCESSOR_R9000, + PROCESSOR_R10000, PROCESSOR_SB1, PROCESSOR_SB1A, PROCESSOR_SR71000, @@ -241,6 +242,7 @@ enum mips_code_readable_setting { #define TARGET_MIPS5500 (mips_arch == PROCESSOR_R5500) #define TARGET_MIPS7000 (mips_arch == PROCESSOR_R7000) #define TARGET_MIPS9000 (mips_arch == PROCESSOR_R9000) +#define TARGET_MIPS10000 (mips_arch == PROCESSOR_R10000) #define TARGET_SB1 (mips_arch == PROCESSOR_SB1 \ || mips_arch == PROCESSOR_SB1A) #define TARGET_SR71K (mips_arch == PROCESSOR_SR71000) @@ -267,6 +269,7 @@ enum mips_code_readable_setting { #define TUNE_MIPS6000 (mips_tune == PROCESSOR_R6000) #define TUNE_MIPS7000 (mips_tune == PROCESSOR_R7000) #define TUNE_MIPS9000 (mips_tune == PROCESSOR_R9000) +#define TUNE_MIPS10000 (mips_tune == PROCESSOR_R10000) #define TUNE_SB1 (mips_tune == PROCESSOR_SB1 \ || mips_tune == PROCESSOR_SB1A) diff -Naurp gcc.orig/gcc/config/mips/mips.md gcc/gcc/config/mips/mips.md --- gcc.orig/gcc/config/mips/mips.md 2008-08-01 21:55:41.000000000 -0400 +++ gcc/gcc/config/mips/mips.md 2008-08-01 23:05:01.000000000 -0400 @@ -553,7 +553,7 @@ ;; Attribute describing the processor. This attribute must match exactly ;; with the processor_type enumeration in mips.h. (define_attr "cpu" - "r3000,4kc,4kp,5kc,5kf,20kc,24kc,24kf2_1,24kf1_1,74kc,74kf2_1,74kf1_1,74kf3_2,loongson_2e,loongson_2f,m4k,r3900,r6000,r4000,r4100,r4111,r4120,r4130,r4300,r4600,r4650,r5000,r5400,r5500,r7000,r8000,r9000,sb1,sb1a,sr71000,xlr" + "r3000,4kc,4kp,5kc,5kf,20kc,24kc,24kf2_1,24kf1_1,74kc,74kf2_1,74kf1_1,74kf3_2,loongson_2e,loongson_2f,m4k,r3900,r6000,r4000,r4100,r4111,r4120,r4130,r4300,r4600,r4650,r5000,r5400,r5500,r7000,r8000,r9000,r10000,r12000,r14000,r16000,sb1,sb1a,sr71000,xlr" (const (symbol_ref "mips_tune"))) ;; The type of hardware hazard associated with this instruction. @@ -903,6 +903,7 @@ (include "6000.md") (include "7000.md") (include "9000.md") +(include "10000.md") (include "sb1.md") (include "sr71k.md") (include "xlr.md") diff -Naurp gcc.orig/gcc/doc/invoke.texi gcc/gcc/doc/invoke.texi --- gcc.orig/gcc/doc/invoke.texi 2008-08-01 21:51:46.000000000 -0400 +++ gcc/gcc/doc/invoke.texi 2008-08-04 00:09:12.000000000 -0400 @@ -11980,6 +11980,7 @@ The processor names are: @samp{r2000}, @samp{r3000}, @samp{r3900}, @samp{r4000}, @samp{r4400}, @samp{r4600}, @samp{r4650}, @samp{r6000}, @samp{r8000}, @samp{rm7000}, @samp{rm9000}, +@samp{r10000}, @samp{r12000}, @samp{r14000}, @samp{r16000}, @samp{sb1}, @samp{sr71000}, @samp{vr4100}, @samp{vr4111}, @samp{vr4120}, @samp{vr4130}, @samp{vr4300},