From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by sourceware.org (Postfix) with ESMTPS id E9E783858D1E for ; Tue, 7 May 2024 21:24:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E9E783858D1E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=dabbelt.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=dabbelt.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org E9E783858D1E Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::62e ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715117068; cv=none; b=LhPpXOwn2GMcnKXYPSSSH2jIu6ycvxAapI9bbB1Fo9avhbVjVYqO5daZqkoVCbD9AeHeOGG3YazmhLa3zbbFb2J2UJxEU4x6lRfgN1VGcCDKxBXmiYQhgZKrySMDX6vS4v/y9VFWRdkZmss2sr6lJ77c6HuoIUziFXDVrrmb+aw= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715117068; c=relaxed/simple; bh=PvfxgwK6J4tV+YEciHXPkMQjsow+nlpqWRjvNP1MtEQ=; h=DKIM-Signature:Date:Subject:From:To:Message-ID; b=X9xOAUw8ZBdd6wURSzm5UR7SS5WPQQEQLfoHH6VrIQOrzw2k3vG6rQ79Q9dub3gCBdhiGpbB2FsxE6z+EbudQeFNeDptr7hyxnX9uEGmb4L5krXdEcvPHgdziZw7ndPP+LHVhH6ZciVJ2a/Smcx7Y/F7TPY4jnRQDzQFEM10mig= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-pl1-x62e.google.com with SMTP id d9443c01a7336-1eb24e3a2d9so35026965ad.1 for ; Tue, 07 May 2024 14:24:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dabbelt-com.20230601.gappssmtp.com; s=20230601; t=1715117065; x=1715721865; darn=gcc.gnu.org; h=message-id:to:from:cc:in-reply-to:subject:date:from:to:cc:subject :date:message-id:reply-to; bh=3DIdmBZ5+46ozGLIG4q2WdOyk831hdBpuSM7X86xxVg=; b=guusv4h3RtSocPd+spbIyrFAv4k7+xZX8b3kcV3MP7pV1d3IZLHMig26nut6faPlVh JPKWY+PYFJHAPJW5rVJ9Dr3LO/1kbz+IQDZV6ZTiFbTxs4BEaEotDhgk7chmmgboTl22 mhtX4jRrHOKDygAmzY5MR0kpPgZ4x00dLuxY01T0NYZnmq8X4AOo6OwaSyL84lesmLru E+CooO3UPaKppiNul+cU7G81LrAAG1O+XgVGDjUD5+qEGGRDjwLvjEZ90meBhiRRPrH6 EAQcStzx9mWXI1hWmgiWTTDnK6tpdIDtUXf4t8IQX/QeSgqzT3j/5HPkecjoBP/KQqj/ pAHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715117065; x=1715721865; h=message-id:to:from:cc:in-reply-to:subject:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=3DIdmBZ5+46ozGLIG4q2WdOyk831hdBpuSM7X86xxVg=; b=cuLBe0cL75QsmiAKs3KblHZxYyZL48W5s8a9yfJGGLf1UKqLn2f+09i5r4JAQUvzbu Ay12O8Hl51d3A5LMJa4zuhMkElKuHwapcuiVvJA2PCQaqA6RM9Vdv2L7ZQV7ezGWpK1U nut5omQ+Fo6cWgWy9NVp8Sg+JTM4wuXX2exR3adJRRkmzVgaVMhiqxNpe8j3gUMz3afN jxNmm+NxpyK0PQL1edPrh+g7Bg3pzhDEIXCk60QVEDu0guc20aj4IV/raLv/pXjVSuKe MeFJbG09VfYI2/PrRfX30qYdr1dzolzERQQgu/2+LG//P0Js9AUQvHI7Eb/hMPIsznN0 U9Aw== X-Gm-Message-State: AOJu0YzCcVf/4vAftBU8InZ4JCnXxqT7OG+8wq701orwHxhqrHlHyr5I NoOuCypb8qTwc31p8Kx1unYwQfeTGArpuDVS0yc9MNMtieam/glZCjV9taXmNI0nknnsGF4ZC0p N X-Google-Smtp-Source: AGHT+IHU/7+dgTyLRU6vq48O5eaNxDdSvzI9EkznBAimv89imOfdCC8a/fahV2I88AYyAPi95fSZYA== X-Received: by 2002:a17:902:fc4e:b0:1e5:1071:5631 with SMTP id d9443c01a7336-1eeb089bb4fmr11090715ad.65.1715117064448; Tue, 07 May 2024 14:24:24 -0700 (PDT) Received: from localhost ([192.184.165.199]) by smtp.gmail.com with ESMTPSA id l18-20020a170903121200b001eb50fc9f83sm10488094plh.6.2024.05.07.14.24.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 May 2024 14:24:23 -0700 (PDT) Date: Tue, 07 May 2024 14:24:23 -0700 (PDT) X-Google-Original-Date: Tue, 07 May 2024 14:24:19 PDT (-0700) Subject: Re: [committed] [RISC-V] Allow uarchs to set TARGET_OVERLAP_OP_BY_PIECES_P In-Reply-To: CC: gcc-patches@gcc.gnu.org From: Palmer Dabbelt To: Jeff Law Message-ID: X-Spam-Status: No, score=-9.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,GIT_PATCH_0,KAM_SHORT,LIKELY_SPAM_BODY,PP_MIME_FAKE_ASCII_TEXT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Tue, 07 May 2024 14:18:36 PDT (-0700), Jeff Law wrote: > This is almost exclusively work from the VRULL team. > > As we've discussed in the Tuesday meeting in the past, we'd like to have > a knob in the tuning structure to indicate that overlapped stores during > move_by_pieces expansion of memcpy & friends are acceptable. > > This patch adds the that capability in our tuning structure. It's off > for all the uarchs upstream, but we have been using it inside Ventana > for our uarch with success. So technically it's NFC upstream, but puts > in the infrastructure multiple organizations likely need. > > > Built and tested rv64gc. Pushing to the trunk shortly. > jeff > commit 300393484dbfa9fd3891174ea47aa3fb41915abc > Author: Christoph Müllner > Date: Tue May 7 15:16:21 2024 -0600 > > [committed] [RISC-V] Allow uarchs to set TARGET_OVERLAP_OP_BY_PIECES_P > > This is almost exclusively work from the VRULL team. > > As we've discussed in the Tuesday meeting in the past, we'd like to have a knob > in the tuning structure to indicate that overlapped stores during > move_by_pieces expansion of memcpy & friends are acceptable. > > This patch adds the that capability in our tuning structure. It's off for all > the uarchs upstream, but we have been using it inside Ventana for our uarch > with success. So technically it's NFC upstream, but puts in the infrastructure > multiple organizations likely need. > > gcc/ > > * config/riscv/riscv.cc (struct riscv_tune_param): Add new > "overlap_op_by_pieces" field. > (rocket_tune_info, sifive_7_tune_info): Set it. > (sifive_p400_tune_info, sifive_p600_tune_info): Likewise. > (thead_c906_tune_info, xiangshan_nanhu_tune_info): Likewise. > (generic_ooo_tune_info, optimize_size_tune_info): Likewise. > (riscv_overlap_op_by_pieces): New function. > (TARGET_OVERLAP_OP_BY_PIECES_P): define. > > gcc/testsuite/ > > * gcc.target/riscv/memcpy-nonoverlapping.c: New test. > * gcc.target/riscv/memset-nonoverlapping.c: New test. > > diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc > index 545e68566dc..a9b57d41184 100644 > --- a/gcc/config/riscv/riscv.cc > +++ b/gcc/config/riscv/riscv.cc > @@ -288,6 +288,7 @@ struct riscv_tune_param > unsigned short fmv_cost; > bool slow_unaligned_access; > bool use_divmod_expansion; > + bool overlap_op_by_pieces; > unsigned int fusible_ops; > const struct cpu_vector_cost *vec_costs; > }; > @@ -427,6 +428,7 @@ static const struct riscv_tune_param rocket_tune_info = { > 8, /* fmv_cost */ > true, /* slow_unaligned_access */ > false, /* use_divmod_expansion */ > + false, /* overlap_op_by_pieces */ > RISCV_FUSE_NOTHING, /* fusible_ops */ > NULL, /* vector cost */ > }; > @@ -444,6 +446,7 @@ static const struct riscv_tune_param sifive_7_tune_info = { > 8, /* fmv_cost */ > true, /* slow_unaligned_access */ > false, /* use_divmod_expansion */ > + false, /* overlap_op_by_pieces */ > RISCV_FUSE_NOTHING, /* fusible_ops */ > NULL, /* vector cost */ > }; > @@ -461,6 +464,7 @@ static const struct riscv_tune_param sifive_p400_tune_info = { > 4, /* fmv_cost */ > true, /* slow_unaligned_access */ > false, /* use_divmod_expansion */ > + false, /* overlap_op_by_pieces */ > RISCV_FUSE_LUI_ADDI | RISCV_FUSE_AUIPC_ADDI, /* fusible_ops */ > &generic_vector_cost, /* vector cost */ > }; > @@ -478,6 +482,7 @@ static const struct riscv_tune_param sifive_p600_tune_info = { > 4, /* fmv_cost */ > true, /* slow_unaligned_access */ > false, /* use_divmod_expansion */ > + false, /* overlap_op_by_pieces */ > RISCV_FUSE_LUI_ADDI | RISCV_FUSE_AUIPC_ADDI, /* fusible_ops */ > &generic_vector_cost, /* vector cost */ > }; > @@ -495,6 +500,7 @@ static const struct riscv_tune_param thead_c906_tune_info = { > 8, /* fmv_cost */ > false, /* slow_unaligned_access */ > false, /* use_divmod_expansion */ > + false, /* overlap_op_by_pieces */ > RISCV_FUSE_NOTHING, /* fusible_ops */ > NULL, /* vector cost */ > }; > @@ -512,6 +518,7 @@ static const struct riscv_tune_param xiangshan_nanhu_tune_info = { > 3, /* fmv_cost */ > true, /* slow_unaligned_access */ > false, /* use_divmod_expansion */ > + false, /* overlap_op_by_pieces */ > RISCV_FUSE_ZEXTW | RISCV_FUSE_ZEXTH, /* fusible_ops */ > NULL, /* vector cost */ > }; > @@ -529,6 +536,7 @@ static const struct riscv_tune_param generic_ooo_tune_info = { > 4, /* fmv_cost */ > false, /* slow_unaligned_access */ > false, /* use_divmod_expansion */ > + false, /* overlap_op_by_pieces */ IMO we should turn this on for the generic OOO tuning -- the benchmarks say it's not faster for the T-Head OOO cores, but we were all so surprised to find that I don't think we even fully trust the benchmarks. I'd assume OOO cores are faster with the overlapping stores, so we should just lean into it and let vendors say something if that's the wrong assumption. > RISCV_FUSE_NOTHING, /* fusible_ops */ > &generic_vector_cost, /* vector cost */ > }; > @@ -546,6 +554,7 @@ static const struct riscv_tune_param optimize_size_tune_info = { > 8, /* fmv_cost */ > false, /* slow_unaligned_access */ > false, /* use_divmod_expansion */ > + false, /* overlap_op_by_pieces */ > RISCV_FUSE_NOTHING, /* fusible_ops */ > NULL, /* vector cost */ > }; > @@ -9979,6 +9988,12 @@ riscv_slow_unaligned_access (machine_mode, unsigned int) > return riscv_slow_unaligned_access_p; > } > > +static bool > +riscv_overlap_op_by_pieces (void) > +{ > + return tune_param->overlap_op_by_pieces; > +} > + > /* Implement TARGET_CAN_CHANGE_MODE_CLASS. */ > > static bool > @@ -11420,6 +11435,9 @@ riscv_get_raw_result_mode (int regno) > #undef TARGET_SLOW_UNALIGNED_ACCESS > #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access > > +#undef TARGET_OVERLAP_OP_BY_PIECES_P > +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces > + > #undef TARGET_SECONDARY_MEMORY_NEEDED > #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed > > diff --git a/gcc/testsuite/gcc.target/riscv/memcpy-nonoverlapping.c b/gcc/testsuite/gcc.target/riscv/memcpy-nonoverlapping.c > new file mode 100644 > index 00000000000..1c99e13fc26 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/riscv/memcpy-nonoverlapping.c > @@ -0,0 +1,54 @@ > +/* { dg-do compile } */ > +/* { dg-options "-mcpu=sifive-u74 -march=rv64gc -mabi=lp64" } */ > +/* { dg-skip-if "" { *-*-* } { "-O0" "-Os" "-Oz" "-Og" } } */ > + > + > +#define COPY_N(N) \ > +void copy##N (char *src, char *dst) \ > +{ \ > + dst = __builtin_assume_aligned (dst, 4096); \ > + src = __builtin_assume_aligned (src, 4096); \ > + __builtin_memcpy (dst, src, N); \ > +} > + > +/* Emits 1x {ld,sd} and 1x {lhu,lbu,sh,sb}. */ > +COPY_N(11) > + > +/* Emits 1x {ld,sd} and 1x {lw,lbu,sw,sb}. */ > +COPY_N(13) > + > +/* Emits 1x {ld,sd} and 1x {lw,lhu,sw,sh}. */ > +COPY_N(14) > + > +/* Emits 1x {ld,sd} and 1x {lw,lhu,lbu,sw,sh,sb}. */ > +COPY_N(15) > + > +/* Emits 2x {ld,sd} and 1x {lhu,lbu,sh,sb}. */ > +COPY_N(19) > + > +/* Emits 2x {ld,sd} and 1x {lw,lhu,lbu,sw,sh,sb}. */ > +COPY_N(23) > + > +/* The by-pieces infrastructure handles up to 24 bytes. > + So the code below is emitted via cpymemsi/block_move_straight. */ > + > +/* Emits 3x {ld,sd} and 1x {lhu,lbu,sh,sb}. */ > +COPY_N(27) > + > +/* Emits 3x {ld,sd} and 1x {lw,lbu,sw,sb}. */ > +COPY_N(29) > + > +/* Emits 3x {ld,sd} and 1x {lw,lhu,lbu,sw,sh,sb}. */ > +COPY_N(31) > + > +/* { dg-final { scan-assembler-times "ld\t" 17 } } */ > +/* { dg-final { scan-assembler-times "sd\t" 17 } } */ > + > +/* { dg-final { scan-assembler-times "lw\t" 6 } } */ > +/* { dg-final { scan-assembler-times "sw\t" 6 } } */ > + > +/* { dg-final { scan-assembler-times "lhu\t" 7 } } */ > +/* { dg-final { scan-assembler-times "sh\t" 7 } } */ > + > +/* { dg-final { scan-assembler-times "lbu\t" 8 } } */ > +/* { dg-final { scan-assembler-times "sb\t" 8 } } */ > diff --git a/gcc/testsuite/gcc.target/riscv/memset-nonoverlapping.c b/gcc/testsuite/gcc.target/riscv/memset-nonoverlapping.c > new file mode 100644 > index 00000000000..c4311c7a8d0 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/riscv/memset-nonoverlapping.c > @@ -0,0 +1,45 @@ > +/* { dg-do compile } */ > +/* { dg-options "-mcpu=sifive-u74 -march=rv64gc -mabi=lp64" } */ > +/* { dg-skip-if "" { *-*-* } { "-O0" "-Os" "-Oz" "-Og" } } */ > + > +#define ZERO_N(N) \ > +void zero##N (char *dst) \ > +{ \ > + dst = __builtin_assume_aligned (dst, 4096); \ > + __builtin_memset (dst, 0, N); \ > +} > + > +/* Emits 1x sd and 1x {sh,sb}. */ > +ZERO_N(11) > + > +/* Emits 1x sd and 1x {sw,sb}. */ > +ZERO_N(13) > + > +/* Emits 1x sd and 1x {sw,sh}. */ > +ZERO_N(14) > + > +/* Emits 1x sd and 1x {sw,sh,sb}. */ > +ZERO_N(15) > + > +/* Emits 2x sd and 1x {sh,sb}. */ > +ZERO_N(19) > + > +/* Emits 2x sd and 1x {sw,sh,sb}. */ > +ZERO_N(23) > + > +/* The by-pieces infrastructure handles up to 24 bytes. > + So the code below is emitted via cpymemsi/block_move_straight. */ > + > +/* Emits 3x sd and 1x {sh,sb}. */ > +ZERO_N(27) > + > +/* Emits 3x sd and 1x {sw,sb}. */ > +ZERO_N(29) > + > +/* Emits 3x sd and 1x {sw,sh,sb}. */ > +ZERO_N(31) > + > +/* { dg-final { scan-assembler-times "sd\t" 17 } } */ > +/* { dg-final { scan-assembler-times "sw\t" 6 } } */ > +/* { dg-final { scan-assembler-times "sh\t" 7 } } */ > +/* { dg-final { scan-assembler-times "sb\t" 8 } } */