From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ed1-x532.google.com (mail-ed1-x532.google.com [IPv6:2a00:1450:4864:20::532]) by sourceware.org (Postfix) with ESMTPS id E9A263848024 for ; Thu, 20 May 2021 07:51:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org E9A263848024 Received: by mail-ed1-x532.google.com with SMTP id s6so18219424edu.10 for ; Thu, 20 May 2021 00:51:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=cM1O1yDGxMxhHLC+bTafjPwZrcTspoGgSYRuy2I7W1g=; b=GNGO6cEO6BPvGxoGHcRgjaHGE+kJIr8d69C3fSFXLpit5Aa2kHrD+0r7zyYKQlFgEQ wpajdQYjKxcLj/x+slgZg2OTAdQvybNUr1/FqP5UP07Rt6zjEipi/lcWf8NZxbB1sI9P tOnlYh7pt3oIKpMEiXFK34NBmful+yPFmCrtJWWm5hGpnFDz1N7LfBJLq+5ZWNTH57tK Okljrhq0UtR8rYIXzNyLY/jv+qH4I8zQC1wXuZMl2mnK8AnjUCd2mfkmE7VvhAZTpyyJ ckFdYEIXtMXOecsAupWTWZXFzmQZ8g145uUOtEjZ2bXzr28m9h1GCiH/dlIsW2jJQ51V ImEw== X-Gm-Message-State: AOAM530BhSOtO7uXbpG8UcUKL2mrbMQuQFDnBLpUw2Ht8Z3kfYU/NHWt jXoATqD1mvibuGy3J74NuajVgfcI9DoNmiISbSs= X-Google-Smtp-Source: ABdhPJxRTsVzeJ2ZZ3UEw5C9r3ry1rejWF1DfdQI7Yqff4Pppdb+8vlyLirTcVwxqexyAuFX6QDRzhyzykETutpGn/4= X-Received: by 2002:a05:6402:4256:: with SMTP id g22mr3486442edb.214.1621497090880; Thu, 20 May 2021 00:51:30 -0700 (PDT) MIME-Version: 1.0 References: <20210518191646.3071005-1-hjl.tools@gmail.com> <20210518191646.3071005-13-hjl.tools@gmail.com> In-Reply-To: From: Richard Biener Date: Thu, 20 May 2021 09:51:20 +0200 Message-ID: Subject: Re: [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory To: "H.J. Lu" Cc: GCC Patches , Richard Sandiford , Uros Bizjak , Bernd Edlinger Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, LOTS_OF_MONEY, MONEY_NOHTML, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 May 2021 07:51:34 -0000 On Wed, May 19, 2021 at 3:22 PM H.J. Lu wrote: > > On Wed, May 19, 2021 at 2:33 AM Richard Biener > wrote: > > > > On Tue, May 18, 2021 at 9:16 PM H.J. Lu wrote: > > > > > > When expanding a constant constructor, don't call expand_constructor if > > > it is more efficient to load the data from the memory via move by pieces. > > > > > > gcc/ > > > > > > PR middle-end/90773 > > > * expr.c (expand_expr_real_1): Don't call expand_constructor if > > > it is more efficient to load the data from the memory. > > > > > > gcc/testsuite/ > > > > > > PR middle-end/90773 > > > * gcc.target/i386/pr90773-24.c: New test. > > > * gcc.target/i386/pr90773-25.c: Likewise. > > > --- > > > gcc/expr.c | 10 ++++++++++ > > > gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++ > > > gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++ > > > 3 files changed, 52 insertions(+) > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c > > > > > > diff --git a/gcc/expr.c b/gcc/expr.c > > > index d09ee42e262..80e01ea1cbe 100644 > > > --- a/gcc/expr.c > > > +++ b/gcc/expr.c > > > @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode, > > > unsigned HOST_WIDE_INT ix; > > > tree field, value; > > > > > > + /* Check if it is more efficient to load the data from > > > + the memory directly. FIXME: How many stores do we > > > + need here if not moved by pieces? */ > > > + unsigned HOST_WIDE_INT bytes > > > + = tree_to_uhwi (TYPE_SIZE_UNIT (type)); > > > > that's prone to fail - it could be a VLA. > > What do you mean by fail? Is it ICE or missed optimization? > Do you have a testcase? > > > > > > + if ((bytes / UNITS_PER_WORD) > 2 > > > + && MOVE_MAX_PIECES > UNITS_PER_WORD > > > + && can_move_by_pieces (bytes, TYPE_ALIGN (type))) > > > + goto normal_inner_ref; > > > + > > > > It looks like you're concerned about aggregate copies but this also handles > > non-aggregates (which on GIMPLE might already be optimized of course). > > Here I check if we copy more than 2 words and we can move more than > a word in a single instruction. > > > Also you say "if it's cheaper" but I see no cost considerations. How do > > we generally handle immed const vs. load from constant pool costs? > > This trades 2 (update to 8) stores with one load plus one store. Is there > a way to check which one is faster? I'm not sure - it depends on whether the target can do stores from immediates at all or what restrictions apply, what the immediate value actually is (zero or all-ones should be way cheaper than sth arbitrary) and how the pressure on the load unit is. can_move_by_pieces (bytes, TYPE_ALIGN (type)) also does not guarantee it will actually move pieces larger than UNITS_PER_WORD, that might depend on alignment. There's by_pieces_ninsns that might provide some hint here. I'm sure it works well for x86. I wonder if the existing code is in the appropriate place and we shouldn't instead handle this somewhere upthread where we ask to copy 'exp' into some other memory location. For your testcase that's expand_assignment but I can imagine passing array[0] by value to a function resulting in similar copying. Testing that shows we get pushq array+56(%rip) .cfi_def_cfa_offset 24 pushq array+48(%rip) .cfi_def_cfa_offset 32 pushq array+40(%rip) .cfi_def_cfa_offset 40 pushq array+32(%rip) .cfi_def_cfa_offset 48 pushq array+24(%rip) .cfi_def_cfa_offset 56 pushq array+16(%rip) .cfi_def_cfa_offset 64 pushq array+8(%rip) .cfi_def_cfa_offset 72 pushq array(%rip) .cfi_def_cfa_offset 80 call bar for that. We do have the by-pieces infrastructure to generally do this kind of copying but in both of these cases we do not seem to use it. I also wonder if the by-pieces infrastructure can pick up constant initializers automagically (we could native_encode the initializer part and feed the by-pieces infrastructure with an array of bytes). There for example might be easy to immediate-store byte parts and difficult ones where we could decide on a case-by-case basis whether to load+store or immediate-store them. For example if I change your testcase to have the array[] initializer all-zero we currently emit pxor %xmm0, %xmm0 movups %xmm0, (%rdi) movups %xmm0, 16(%rdi) movups %xmm0, 32(%rdi) movups %xmm0, 48(%rdi) ret will your patch cause us to emit 4 loads? OTHO if I do const struct S array[] = { { 0, 0, 0, 7241, 124764, 48, 16, 33, 10, 96, 2, 0, 0, 4 } }; we get movq $0, (%rdi) movl $0, 8(%rdi) movl $0, 12(%rdi) movl $7241, 16(%rdi) ... ideally we'd have sth like pxor %xmm0, %xmm0 movups %xmm0, (%rdi) movaps array+16(%rip), %xmm0 movups %xmm0, 16(%rdi) ... thus have the zeros written as immediates and the remaining pieces with load+stores. The by-pieces infrastructure eventually get's to see (mem/u/c:BLK (symbol_ref:DI ("array") [flags 0x2] ) [1 array+0 S64 A256]) where the MEM_EXPR should provide a way to access the constant initializer. That said I do agree the current code is a bit premature optimization - but maybe it should be fend off in expand_constructor which has the cheap clear_storage first and which already does check can_move_by_pieces with some heuristics, but that seems to be guarded by || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (type)) && (! can_move_by_pieces (tree_to_uhwi (TYPE_SIZE_UNIT (type)), TYPE_ALIGN (type))) && ! mostly_zeros_p (exp)))) which is odd (we _can_ move by pieces, but how does this apply to TREE_CONSTANT CTORs and avoid_temp_mem?). That said, I wonder if we want to elide expand_constructor when the CTOR is TREE_STATIC && TREE_CONSTANT and !mostly_zeros_p and we can_move_by_pieces. So sth like diff --git a/gcc/expr.c b/gcc/expr.c index 7139545d543..76b3bdf0c01 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum expand_modifier modifier, && (! can_move_by_pieces (tree_to_uhwi (TYPE_SIZE_UNIT (type)), TYPE_ALIGN (type))) + && ! mostly_zeros_p (exp)) + || (TREE_CONSTANT (exp) + && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type)) + && (can_move_by_pieces + (tree_to_uhwi (TYPE_SIZE_UNIT (type)), + TYPE_ALIGN (type))) && ! mostly_zeros_p (exp)))) || ((modifier == EXPAND_INITIALIZER || modifier == EXPAND_CONST_ADDRESS) && TREE_CONSTANT (exp))) which handles your initializer and the all-zero one optimal? Richard. > > > FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (init), ix, > > > field, value) > > > if (tree_int_cst_equal (field, index)) > > > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-24.c b/gcc/testsuite/gcc.target/i386/pr90773-24.c > > > new file mode 100644 > > > index 00000000000..4a4b62533dc > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/i386/pr90773-24.c > > > @@ -0,0 +1,22 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-O2 -march=x86-64" } */ > > > + > > > +struct S > > > +{ > > > + long long s1 __attribute__ ((aligned (8))); > > > + unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14; > > > +}; > > > + > > > +const struct S array[] = { > > > + { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 } > > > +}; > > > + > > > +void > > > +foo (struct S *x) > > > +{ > > > + x[0] = array[0]; > > > +} > > > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */ > > > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */ > > > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */ > > > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */ > > > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-25.c b/gcc/testsuite/gcc.target/i386/pr90773-25.c > > > new file mode 100644 > > > index 00000000000..2520b670989 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/i386/pr90773-25.c > > > @@ -0,0 +1,20 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-O2 -march=skylake" } */ > > > + > > > +struct S > > > +{ > > > + long long s1 __attribute__ ((aligned (8))); > > > + unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14; > > > +}; > > > + > > > +const struct S array[] = { > > > + { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 } > > > +}; > > > + > > > +void > > > +foo (struct S *x) > > > +{ > > > + x[0] = array[0]; > > > +} > > > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */ > > > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */ > > > -- > > > 2.31.1 > > > > > > > -- > H.J.