From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 67154 invoked by alias); 23 Jan 2018 09:32:48 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 67074 invoked by uid 89); 23 Jan 2018 09:32:42 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.9 required=5.0 tests=BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_LAZY_DOMAIN_SECURITY,T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy= X-HELO: foss.arm.com Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 23 Jan 2018 09:32:40 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 26A5180D; Tue, 23 Jan 2018 01:32:38 -0800 (PST) Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com [10.2.207.77]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 476373F41F; Tue, 23 Jan 2018 01:32:37 -0800 (PST) Message-ID: <5A670133.4010303@foss.arm.com> Date: Tue, 23 Jan 2018 09:46:00 -0000 From: Kyrill Tkachov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Luis Machado , "gcc-patches@gcc.gnu.org" CC: James Greenhalgh , Richard Earnshaw Subject: Re: [PATCH 1/2] Introduce prefetch-minimum stride option References: <1516628770-25036-1-git-send-email-luis.machado@linaro.org> <1516628770-25036-2-git-send-email-luis.machado@linaro.org> In-Reply-To: <1516628770-25036-2-git-send-email-luis.machado@linaro.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-SW-Source: 2018-01/txt/msg01920.txt.bz2 Hi Luis, On 22/01/18 13:46, Luis Machado wrote: > This patch adds a new option to control the minimum stride, for a memory > reference, after which the loop prefetch pass may issue software prefetch > hints for. There are two motivations: > > * Make the pass less aggressive, only issuing prefetch hints for bigger strides > that are more likely to benefit from prefetching. I've noticed a case in cpu2017 > where we were issuing thousands of hints, for example. > I've noticed a large amount of prefetch hints being issued as well, but had not analysed it further. > * For processors that have a hardware prefetcher, like Falkor, it allows the > loop prefetch pass to defer prefetching of smaller (less than the threshold) > strides to the hardware prefetcher instead. This prevents conflicts between > the software prefetcher and the hardware prefetcher. > > I've noticed considerable reduction in the number of prefetch hints and > slightly positive performance numbers. This aligns GCC and LLVM in terms of > prefetch behavior for Falkor. Do you, by any chance, have a link to the LLVM review that implemented that behavior? It's okay if you don't, but I think it would be useful context. > > The default settings should guarantee no changes for existing targets. Those > are free to tweak the settings as necessary. > > No regressions in the testsuite and bootstrapped ok on aarch64-linux. > > Ok? > Are there any benchmark numbers you can share? I think this approach is sensible. Since your patch touches generic code as well as AArch64 code you'll need an approval from a midend maintainer as well as an AArch64 maintainer. Also, GCC development is now in the regression fixing stage, so unless this fixes a regression it may have to wait until GCC 9 development is opened. Thanks, Kyrill > 2018-01-22 Luis Machado > > Introduce option to limit software prefetching to known constant > strides above a specific threshold with the goal of preventing > conflicts with a hardware prefetcher. > > gcc/ > * config/aarch64/aarch64-protos.h (cpu_prefetch_tune) > : New const int field. > * config/aarch64/aarch64.c (generic_prefetch_tune): Update to include > minimum_stride field. > (exynosm1_prefetch_tune): Likewise. > (thunderxt88_prefetch_tune): Likewise. > (thunderx_prefetch_tune): Likewise. > (thunderx2t99_prefetch_tune): Likewise. > (qdf24xx_prefetch_tune): Likewise. Set minimum_stride to 2048. > (aarch64_override_options_internal): Update to set > PARAM_PREFETCH_MINIMUM_STRIDE. > * doc/invoke.texi (prefetch-minimum-stride): Document new option. > * params.def (PARAM_PREFETCH_MINIMUM_STRIDE): New. > * params.h (PARAM_PREFETCH_MINIMUM_STRIDE): Define. > * tree-ssa-loop-prefetch.c (should_issue_prefetch_p): Return false if > stride is constant and is below the minimum stride threshold. > --- > gcc/config/aarch64/aarch64-protos.h | 3 +++ > gcc/config/aarch64/aarch64.c | 13 ++++++++++++- > gcc/doc/invoke.texi | 15 +++++++++++++++ > gcc/params.def | 9 +++++++++ > gcc/params.h | 2 ++ > gcc/tree-ssa-loop-prefetch.c | 16 ++++++++++++++++ > 6 files changed, 57 insertions(+), 1 deletion(-) > > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h > index ef1b0bc..8736bd9 100644 > --- a/gcc/config/aarch64/aarch64-protos.h > +++ b/gcc/config/aarch64/aarch64-protos.h > @@ -230,6 +230,9 @@ struct cpu_prefetch_tune > const int l1_cache_size; > const int l1_cache_line_size; > const int l2_cache_size; > + /* The minimum constant stride beyond which we should use prefetch > + hints for. */ > + const int minimum_stride; > const int default_opt_level; > }; > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index 174310c..0ed9f14 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -547,6 +547,7 @@ static const cpu_prefetch_tune generic_prefetch_tune = > -1, /* l1_cache_size */ > -1, /* l1_cache_line_size */ > -1, /* l2_cache_size */ > + -1, /* minimum_stride */ > -1 /* default_opt_level */ > }; > > @@ -556,6 +557,7 @@ static const cpu_prefetch_tune exynosm1_prefetch_tune = > -1, /* l1_cache_size */ > 64, /* l1_cache_line_size */ > -1, /* l2_cache_size */ > + -1, /* minimum_stride */ > -1 /* default_opt_level */ > }; > > @@ -565,7 +567,8 @@ static const cpu_prefetch_tune qdf24xx_prefetch_tune = > 32, /* l1_cache_size */ > 64, /* l1_cache_line_size */ > 1024, /* l2_cache_size */ > - -1 /* default_opt_level */ > + 2048, /* minimum_stride */ > + 3 /* default_opt_level */ > }; > > static const cpu_prefetch_tune thunderxt88_prefetch_tune = > @@ -574,6 +577,7 @@ static const cpu_prefetch_tune thunderxt88_prefetch_tune = > 32, /* l1_cache_size */ > 128, /* l1_cache_line_size */ > 16*1024, /* l2_cache_size */ > + -1, /* minimum_stride */ > 3 /* default_opt_level */ > }; > > @@ -583,6 +587,7 @@ static const cpu_prefetch_tune thunderx_prefetch_tune = > 32, /* l1_cache_size */ > 128, /* l1_cache_line_size */ > -1, /* l2_cache_size */ > + -1, /* minimum_stride */ > -1 /* default_opt_level */ > }; > > @@ -592,6 +597,7 @@ static const cpu_prefetch_tune thunderx2t99_prefetch_tune = > 32, /* l1_cache_size */ > 64, /* l1_cache_line_size */ > 256, /* l2_cache_size */ > + -1, /* minimum_stride */ > -1 /* default_opt_level */ > }; > > @@ -10461,6 +10467,11 @@ aarch64_override_options_internal (struct gcc_options *opts) > aarch64_tune_params.prefetch->l2_cache_size, > opts->x_param_values, > global_options_set.x_param_values); > + if (aarch64_tune_params.prefetch->minimum_stride >= 0) > + maybe_set_param_value (PARAM_PREFETCH_MINIMUM_STRIDE, > + aarch64_tune_params.prefetch->minimum_stride, > + opts->x_param_values, > + global_options_set.x_param_values); > > /* Use the alternative scheduling-pressure algorithm by default. */ > maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, SCHED_PRESSURE_MODEL, > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index 27c5974..1cb1ef5 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -10567,6 +10567,21 @@ The size of L1 cache, in kilobytes. > @item l2-cache-size > The size of L2 cache, in kilobytes. > > +@item prefetch-minimum-stride > +Minimum constant stride, in bytes, to start using prefetch hints for. If > +the stride is less than this threshold, prefetch hints will not be issued. > + > +This setting is useful for processors that have hardware prefetchers, in > +which case there may be conflicts between the hardware prefetchers and > +the software prefetchers. If the hardware prefetchers have a maximum > +stride they can handle, it should be used here to improve the use of > +software prefetchers. > + > +A value of -1, the default, means we don't have a threshold and therefore > +prefetch hints can be issued for any constant stride. > + > +This setting is only useful for strides that are known and constant. > + > @item loop-interchange-max-num-stmts > The maximum number of stmts in a loop to be interchanged. > > diff --git a/gcc/params.def b/gcc/params.def > index 930b318..bf2d12c 100644 > --- a/gcc/params.def > +++ b/gcc/params.def > @@ -790,6 +790,15 @@ DEFPARAM (PARAM_L2_CACHE_SIZE, > "The size of L2 cache.", > 512, 0, 0) > > +/* The minimum constant stride beyond which we should use prefetch hints > + for. */ > + > +DEFPARAM (PARAM_PREFETCH_MINIMUM_STRIDE, > + "prefetch-minimum-stride", > + "The minimum constant stride beyond which we should use prefetch " > + "hints for.", > + -1, 0, 0) > + > /* Maximum number of statements in loop nest for loop interchange. */ > > DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS, > diff --git a/gcc/params.h b/gcc/params.h > index 98249d2..96012db 100644 > --- a/gcc/params.h > +++ b/gcc/params.h > @@ -196,6 +196,8 @@ extern void init_param_values (int *params); > PARAM_VALUE (PARAM_L1_CACHE_LINE_SIZE) > #define L2_CACHE_SIZE \ > PARAM_VALUE (PARAM_L2_CACHE_SIZE) > +#define PREFETCH_MINIMUM_STRIDE \ > + PARAM_VALUE (PARAM_PREFETCH_MINIMUM_STRIDE) > #define USE_CANONICAL_TYPES \ > PARAM_VALUE (PARAM_USE_CANONICAL_TYPES) > #define IRA_MAX_LOOPS_NUM \ > diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c > index 2f10db1..112ccac 100644 > --- a/gcc/tree-ssa-loop-prefetch.c > +++ b/gcc/tree-ssa-loop-prefetch.c > @@ -992,6 +992,22 @@ prune_by_reuse (struct mem_ref_group *groups) > static bool > should_issue_prefetch_p (struct mem_ref *ref) > { > + /* Some processors may have a hardware prefetcher that may conflict with > + prefetch hints for a range of strides. Make sure we don't issue > + prefetches for such cases if the stride is within this particular > + range. */ > + if (cst_and_fits_in_hwi (ref->group->step) > + && absu_hwi (int_cst_value (ref->group->step)) < PREFETCH_MINIMUM_STRIDE) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, > + "Step for reference %u:%u (%d) is less than the mininum " > + " required stride of %d\n", > + ref->group->uid, ref->uid, int_cst_value (ref->group->step), > + PREFETCH_MINIMUM_STRIDE); > + return false; > + } > + > /* For now do not issue prefetches for only first few of the > iterations. */ > if (ref->prefetch_before != PREFETCH_ALL) > -- > 2.7.4 >