From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=N2n1=GB=gmail.com=rdapp.gcc@sourceware.org>
Received: from mail-wr1-x42c.google.com (mail-wr1-x42c.google.com [IPv6:2a00:1450:4864:20::42c])
	by sourceware.org (Postfix) with ESMTPS id B5C2A3858D28
	for <gcc-patches@gcc.gnu.org>; Thu, 19 Oct 2023 20:07:20 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B5C2A3858D28
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org B5C2A3858D28
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::42c
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1697746044; cv=none;
	b=BIdDT56JlEXFdUvBAq0WsnIO09lukNA3wvxEmahOA8TY7Fuh/fGkoGYesV4O0gmdeNLjbuy9mMrvfxsMD+Aub+09P4isgz4MWCvzD0fX8ENDt7ulVqwj5lBSjzWxFP4onszC22dkRxKPO0l83GyA8mqcxJyF40RG7eE0iaUf+48=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1697746044; c=relaxed/simple;
	bh=RVNldaxh9hB7rym3M8AM+C7NkYN2TcZsc9818V4AERs=;
	h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=EnaN0o2Rwz4U4o+M5JY5PMWb6nCYRwI9x9MV5OOv3htK6jHZQaXmss6ocpL3wxSW/cj1HODbc6CdFGBgYG/CRyRF+MtQzVEMksLqr3lghoyiwIkcug6qH4UT9fJQatRAUD2hphjug30lvO55Ik6SsuQDuJR4gDAQNSIsalyvtwI=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-wr1-x42c.google.com with SMTP id ffacd0b85a97d-32dc918d454so61405f8f.2
        for <gcc-patches@gcc.gnu.org>; Thu, 19 Oct 2023 13:07:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1697746039; x=1698350839; darn=gcc.gnu.org;
        h=content-transfer-encoding:in-reply-to:from:references:to
         :content-language:subject:cc:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=pTR85HQ8TBKgRTNoP3lN88SQBvyzZEUDFHUD3wY/+Zw=;
        b=MOsdXt19XaDcPDA5mGh3Di6tzqqjRTPpftSC2y5Nn9nelQt4YZEOHXIiCWAY8eO1E3
         sY1L+Ba2wdbAi5g2kCbZl5LQSOpP1RYHWRbqtKNgRBdKaCJkMXJnnxiR4br+DrwVp/A0
         ZbYPIojfnmGN7R0bUlBdC9mZpvW5tYFJwf5eGRCYOSyKIl1EiItSCNSU3DsPDOUzy5su
         ntitmce+c2RiNRRaLtGT+gG4NBUJMdRHiqatUW+5aE/5fcLTB2sjStXPfUtxWnqsNwO4
         +pkRyYd26bck2mqQnk0hbVre5GcGixwt6j2O7AbIdh39iUP77cGvAmKKllA96eiv9aYJ
         DHPQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1697746039; x=1698350839;
        h=content-transfer-encoding:in-reply-to:from:references:to
         :content-language:subject:cc:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=pTR85HQ8TBKgRTNoP3lN88SQBvyzZEUDFHUD3wY/+Zw=;
        b=KdAcQeJcXv47SkaGKZbFEmqSNeY/dGRcezKNXaO8BmDrcJzmxBrI4URY/jB3GlIXHm
         Ca7A27fWzesZAjB7dpXf+tUdzvgLSKJBLkpb2h+ss50g7yqMmR4duX4R77Mk5zycciPv
         VKyqjS/HZ2/rcEZ1juL/zXKuOWPXxUzc/fCTrLJlX1K0BnMTa02JW+gjpjJWIV+gkF26
         YYmVxRxDGh/RuWAzPjrsI7Bhz/I9M8bZkttga2mVYRxzcAdSJQj55dROFTkhCArqfZix
         1c/xxISCRI/JToILAGY7ByrjFYJoMtQURz8eXnt3KGZln3YDBrCV52xJyZilrp/FBJDR
         KaZQ==
X-Gm-Message-State: AOJu0YzsiC/H946aggkKmDJtZMbaBfm/OhiEnbcABb4GKIrrbxtQXNhe
	lHSLvcsW5TCC1OwBnhdpvEk=
X-Google-Smtp-Source: AGHT+IEWDsw1h5HUV3ETL/yPv2p4fsTK9H/W+NEEOjeJIL2mqxoGlwYQtS5wltghDdpsrrLTf7ntkA==
X-Received: by 2002:adf:f74c:0:b0:32d:a7d7:5631 with SMTP id z12-20020adff74c000000b0032da7d75631mr2170747wrp.13.1697746038338;
        Thu, 19 Oct 2023 13:07:18 -0700 (PDT)
Received: from [192.168.1.24] (ip-046-223-203-173.um13.pools.vodafone-ip.de. [46.223.203.173])
        by smtp.gmail.com with ESMTPSA id o1-20020a5d6841000000b0031ad5fb5a0fsm111815wrw.58.2023.10.19.13.07.17
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 19 Oct 2023 13:07:17 -0700 (PDT)
Message-ID: <68c8aa07-996a-4591-a60c-e2a27f32cafa@gmail.com>
Date: Thu, 19 Oct 2023 22:07:16 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Cc: rdapp.gcc@gmail.com, Tamar Christina <Tamar.Christina@arm.com>,
 gcc-patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Subject: Re: [PATCH] ifcvt/vect: Emit COND_ADD for conditional scalar
 reduction.
Content-Language: en-US
To: Richard Biener <rguenther@suse.de>
References: <0193b63e-98dc-42bc-cd33-485361ea50bf@gmail.com>
 <VI1PR08MB5325C3361619D0110E741BC0FFC2A@VI1PR08MB5325.eurprd08.prod.outlook.com>
 <671a575c-02ff-071b-967e-2e93d8986c1a@gmail.com>
 <VI1PR08MB5325F4C648851C5A9C833E87FFCBA@VI1PR08MB5325.eurprd08.prod.outlook.com>
 <85b08273-7eea-be3f-f08a-edf0780d36a7@gmail.com> <mpt34ykibt1.fsf@arm.com>
 <fe57083d-6a56-8c68-46db-649ad1dcd29e@gmail.com> <mptr0m3etc9.fsf@arm.com>
 <9861f38a-9716-44c4-957c-d53e1c96f567@gmail.com>
 <nycvar.YFH.7.77.849.2310121110521.8126@jbgna.fhfr.qr>
From: Robin Dapp <rdapp.gcc@gmail.com>
In-Reply-To: <nycvar.YFH.7.77.849.2310121110521.8126@jbgna.fhfr.qr>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-9.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Ugh, I didn't push yet because with a rebased trunk I am
seeing different behavior for some riscv testcases.

A reduction is not recognized because there is yet another
"double use" occurrence in check_reduction_path.  I guess it's
reasonable to loosen the restriction for conditional operations
here as well.

The only change to v4 therefore is:

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index ebab1953b9c..64654a55e4c 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -4085,7 +4094,15 @@ pop:
                || flow_bb_inside_loop_p (loop, gimple_bb (op_use_stmt))))
          FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter)
            cnt++;
-      if (cnt != 1)
+
+      bool cond_fn_p = op.code.is_internal_fn ()
+       && (conditional_internal_fn_code (internal_fn (*code))
+           != ERROR_MARK);
+
+      /* In case of a COND_OP (mask, op1, op2, op1) reduction we might have
+        op1 twice (once as definition, once as else) in the same operation.
+        Allow this.  */
+      if ((!cond_fn_p && cnt != 1) || (opi == 1 && cond_fn_p && cnt != 2))

Bootstrapped and regtested again on x86, aarch64 and power10.
Testsuite on riscv unchanged.

Regards
 Robin

Subject: [PATCH v5] ifcvt/vect: Emit COND_OP for conditional scalar reduction.

As described in PR111401 we currently emit a COND and a PLUS expression
for conditional reductions.  This makes it difficult to combine both
into a masked reduction statement later.
This patch improves that by directly emitting a COND_ADD/COND_OP during
ifcvt and adjusting some vectorizer code to handle it.

It also makes neutral_op_for_reduction return -0 if HONOR_SIGNED_ZEROS
is true.

gcc/ChangeLog:

	PR middle-end/111401
	* tree-if-conv.cc (convert_scalar_cond_reduction): Emit COND_OP
	if supported.
	(predicate_scalar_phi): Add whitespace.
	* tree-vect-loop.cc (fold_left_reduction_fn): Add IFN_COND_OP.
	(neutral_op_for_reduction): Return -0 for PLUS.
	(check_reduction_path): Don't count else operand in COND_OP.
	(vect_is_simple_reduction): Ditto.
	(vect_create_epilog_for_reduction): Fix whitespace.
	(vectorize_fold_left_reduction): Add COND_OP handling.
	(vectorizable_reduction): Don't count else operand in COND_OP.
	(vect_transform_reduction): Add COND_OP handling.
	* tree-vectorizer.h (neutral_op_for_reduction): Add default
	parameter.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c: New test.
	* gcc.target/riscv/rvv/autovec/cond/pr111401.c: New test.
	* gcc.target/riscv/rvv/autovec/reduc/reduc_call-2.c: Adjust.
	* gcc.target/riscv/rvv/autovec/reduc/reduc_call-4.c: Ditto.
---
 .../vect-cond-reduc-in-order-2-signed-zero.c  | 141 +++++++++++++++
 .../riscv/rvv/autovec/cond/pr111401.c         | 139 +++++++++++++++
 .../riscv/rvv/autovec/reduc/reduc_call-2.c    |   4 +-
 .../riscv/rvv/autovec/reduc/reduc_call-4.c    |   4 +-
 gcc/tree-if-conv.cc                           |  49 +++--
 gcc/tree-vect-loop.cc                         | 168 ++++++++++++++----
 gcc/tree-vectorizer.h                         |   2 +-
 7 files changed, 456 insertions(+), 51 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/cond/pr111401.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c b/gcc/testsuite/gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c
new file mode 100644
index 00000000000..7b46e7d8a2a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c
@@ -0,0 +1,141 @@
+/* Make sure a -0 stays -0 when we perform a conditional reduction.  */
+/* { dg-do run } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-add-options ieee } */
+/* { dg-additional-options "-std=gnu99 -fno-fast-math" } */
+
+#include "tree-vect.h"
+
+#include <math.h>
+
+#define N (VECTOR_BITS * 17)
+
+double __attribute__ ((noinline, noclone))
+reduc_plus_double (double *restrict a, double init, int *cond, int n)
+{
+  double res = init;
+  for (int i = 0; i < n; i++)
+    if (cond[i])
+      res += a[i];
+  return res;
+}
+
+double __attribute__ ((noinline, noclone, optimize ("0")))
+reduc_plus_double_ref (double *restrict a, double init, int *cond, int n)
+{
+  double res = init;
+  for (int i = 0; i < n; i++)
+    if (cond[i])
+      res += a[i];
+  return res;
+}
+
+double __attribute__ ((noinline, noclone))
+reduc_minus_double (double *restrict a, double init, int *cond, int n)
+{
+  double res = init;
+  for (int i = 0; i < n; i++)
+    if (cond[i])
+      res -= a[i];
+  return res;
+}
+
+double __attribute__ ((noinline, noclone, optimize ("0")))
+reduc_minus_double_ref (double *restrict a, double init, int *cond, int n)
+{
+  double res = init;
+  for (int i = 0; i < n; i++)
+    if (cond[i])
+      res -= a[i];
+  return res;
+}
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  int n = 19;
+  double a[N];
+  int cond1[N], cond2[N];
+
+  for (int i = 0; i < N; i++)
+    {
+      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
+      cond1[i] = 0;
+      cond2[i] = i & 4 ? 1 : 0;
+      asm volatile ("" ::: "memory");
+    }
+
+  double res1 = reduc_plus_double (a, -0.0, cond1, n);
+  double ref1 = reduc_plus_double_ref (a, -0.0, cond1, n);
+  double res2 = reduc_minus_double (a, -0.0, cond1, n);
+  double ref2 = reduc_minus_double_ref (a, -0.0, cond1, n);
+  double res3 = reduc_plus_double (a, -0.0, cond1, n);
+  double ref3 = reduc_plus_double_ref (a, -0.0, cond1, n);
+  double res4 = reduc_minus_double (a, -0.0, cond1, n);
+  double ref4 = reduc_minus_double_ref (a, -0.0, cond1, n);
+
+  if (res1 != ref1 || signbit (res1) != signbit (ref1))
+    __builtin_abort ();
+  if (res2 != ref2 || signbit (res2) != signbit (ref2))
+    __builtin_abort ();
+  if (res3 != ref3 || signbit (res3) != signbit (ref3))
+    __builtin_abort ();
+  if (res4 != ref4 || signbit (res4) != signbit (ref4))
+    __builtin_abort ();
+
+  res1 = reduc_plus_double (a, 0.0, cond1, n);
+  ref1 = reduc_plus_double_ref (a, 0.0, cond1, n);
+  res2 = reduc_minus_double (a, 0.0, cond1, n);
+  ref2 = reduc_minus_double_ref (a, 0.0, cond1, n);
+  res3 = reduc_plus_double (a, 0.0, cond1, n);
+  ref3 = reduc_plus_double_ref (a, 0.0, cond1, n);
+  res4 = reduc_minus_double (a, 0.0, cond1, n);
+  ref4 = reduc_minus_double_ref (a, 0.0, cond1, n);
+
+  if (res1 != ref1 || signbit (res1) != signbit (ref1))
+    __builtin_abort ();
+  if (res2 != ref2 || signbit (res2) != signbit (ref2))
+    __builtin_abort ();
+  if (res3 != ref3 || signbit (res3) != signbit (ref3))
+    __builtin_abort ();
+  if (res4 != ref4 || signbit (res4) != signbit (ref4))
+    __builtin_abort ();
+
+  res1 = reduc_plus_double (a, -0.0, cond2, n);
+  ref1 = reduc_plus_double_ref (a, -0.0, cond2, n);
+  res2 = reduc_minus_double (a, -0.0, cond2, n);
+  ref2 = reduc_minus_double_ref (a, -0.0, cond2, n);
+  res3 = reduc_plus_double (a, -0.0, cond2, n);
+  ref3 = reduc_plus_double_ref (a, -0.0, cond2, n);
+  res4 = reduc_minus_double (a, -0.0, cond2, n);
+  ref4 = reduc_minus_double_ref (a, -0.0, cond2, n);
+
+  if (res1 != ref1 || signbit (res1) != signbit (ref1))
+    __builtin_abort ();
+  if (res2 != ref2 || signbit (res2) != signbit (ref2))
+    __builtin_abort ();
+  if (res3 != ref3 || signbit (res3) != signbit (ref3))
+    __builtin_abort ();
+  if (res4 != ref4 || signbit (res4) != signbit (ref4))
+    __builtin_abort ();
+
+  res1 = reduc_plus_double (a, 0.0, cond2, n);
+  ref1 = reduc_plus_double_ref (a, 0.0, cond2, n);
+  res2 = reduc_minus_double (a, 0.0, cond2, n);
+  ref2 = reduc_minus_double_ref (a, 0.0, cond2, n);
+  res3 = reduc_plus_double (a, 0.0, cond2, n);
+  ref3 = reduc_plus_double_ref (a, 0.0, cond2, n);
+  res4 = reduc_minus_double (a, 0.0, cond2, n);
+  ref4 = reduc_minus_double_ref (a, 0.0, cond2, n);
+
+  if (res1 != ref1 || signbit (res1) != signbit (ref1))
+    __builtin_abort ();
+  if (res2 != ref2 || signbit (res2) != signbit (ref2))
+    __builtin_abort ();
+  if (res3 != ref3 || signbit (res3) != signbit (ref3))
+    __builtin_abort ();
+  if (res4 != ref4 || signbit (res4) != signbit (ref4))
+    __builtin_abort ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/cond/pr111401.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/cond/pr111401.c
new file mode 100644
index 00000000000..83dbd61b3f3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/cond/pr111401.c
@@ -0,0 +1,139 @@
+/* { dg-do run { target { riscv_v } } } */
+/* { dg-additional-options "-march=rv64gcv -mabi=lp64d --param riscv-autovec-preference=scalable -fdump-tree-vect-details" } */
+
+double
+__attribute__ ((noipa))
+foo2 (double *__restrict a, double init, int *__restrict cond, int n)
+{
+  for (int i = 0; i < n; i++)
+    if (cond[i])
+      init += a[i];
+  return init;
+}
+
+double
+__attribute__ ((noipa))
+foo3 (double *__restrict a, double init, int *__restrict cond, int n)
+{
+  for (int i = 0; i < n; i++)
+    if (cond[i])
+      init -= a[i];
+  return init;
+}
+
+double
+__attribute__ ((noipa))
+foo4 (double *__restrict a, double init, int *__restrict cond, int n)
+{
+  for (int i = 0; i < n; i++)
+    if (cond[i])
+      init *= a[i];
+  return init;
+}
+
+int
+__attribute__ ((noipa))
+foo5 (int *__restrict a, int init, int *__restrict cond, int n)
+{
+  for (int i = 0; i < n; i++)
+    if (cond[i])
+      init &= a[i];
+  return init;
+}
+
+int
+__attribute__ ((noipa))
+foo6 (int *__restrict a, int init, int *__restrict cond, int n)
+{
+  for (int i = 0; i < n; i++)
+    if (cond[i])
+      init |= a[i];
+  return init;
+}
+
+int
+__attribute__ ((noipa))
+foo7 (int *__restrict a, int init, int *__restrict cond, int n)
+{
+  for (int i = 0; i < n; i++)
+    if (cond[i])
+      init ^= a[i];
+  return init;
+}
+
+#define SZ 125
+
+int
+main ()
+{
+  double res1 = 0, res2 = 0, res3 = 0;
+  double a1[SZ], a2[SZ], a3[SZ];
+  int c1[SZ], c2[SZ], c3[SZ];
+
+  int a4[SZ], a5[SZ], a6[SZ];
+  int res4 = 0, res5 = 0, res6 = 0;
+  int c4[SZ], c5[SZ], c6[SZ];
+
+  for (int i = 0; i < SZ; i++)
+    {
+      a1[i] = i * 3 + (i & 4) - (i & 7);
+      a2[i] = i * 3 + (i & 4) - (i & 7);
+      a3[i] = i * 0.05 + (i & 4) - (i & 7);
+      a4[i] = i * 3 + (i & 4) - (i & 7);
+      a5[i] = i * 3 + (i & 4) - (i & 7);
+      a6[i] = i * 3 + (i & 4) - (i & 7);
+      c1[i] = i & 1;
+      c2[i] = i & 2;
+      c3[i] = i & 3;
+      c4[i] = i & 4;
+      c5[i] = i & 5;
+      c6[i] = i & 6;
+      __asm__ volatile ("" : : : "memory");
+    }
+
+  double init1 = 2.7, init2 = 8.2, init3 = 0.1;
+  double ref1 = init1, ref2 = init2, ref3 = init3;
+
+  int init4 = 87, init5 = 11, init6 = -123894344;
+  int ref4 = init4, ref5 = init5, ref6 = init6;
+
+#pragma GCC novector
+  for (int i = 0; i < SZ; i++)
+    {
+      if (c1[i])
+        ref1 += a1[i];
+      if (c2[i])
+        ref2 -= a2[i];
+      if (c3[i])
+        ref3 *= a3[i];
+      if (c4[i])
+        ref4 &= a4[i];
+      if (c5[i])
+        ref5 |= a5[i];
+      if (c6[i])
+        ref6 ^= a6[i];
+    }
+
+  res1 = foo2 (a1, init1, c1, SZ);
+  res2 = foo3 (a2, init2, c2, SZ);
+  res3 = foo4 (a3, init3, c3, SZ);
+  res4 = foo5 (a4, init4, c4, SZ);
+  res5 = foo6 (a5, init5, c5, SZ);
+  res6 = foo7 (a6, init6, c6, SZ);
+
+  if (res1 != ref1)
+    __builtin_abort ();
+  if (res2 != ref2)
+    __builtin_abort ();
+  if (res3 != ref3)
+    __builtin_abort ();
+  if (res4 != ref4)
+    __builtin_abort ();
+  if (res5 != ref5)
+    __builtin_abort ();
+  if (res6 != ref6)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 5 "vect" } } */
+/* { dg-final { scan-tree-dump-not "VCOND_MASK" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_call-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_call-2.c
index cc07a047cd5..7be22d60bf2 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_call-2.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_call-2.c
@@ -3,4 +3,6 @@
 
 #include "reduc_call-1.c"
 
-/* { dg-final { scan-assembler-times {vfmacc\.vv\s+v[0-9]+,v[0-9]+,v[0-9]+,v0.t} 1 } } */
+/* { dg-final { scan-assembler-times {vfmadd\.vv\s+v[0-9]+,v[0-9]+,v[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vfadd\.vv\s+v[0-9]+,v[0-9]+,v[0-9]+,v0.t} 1 } } */
+/* { dg-final { scan-assembler-not {vmerge} } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_call-4.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_call-4.c
index 6d00c404d2a..83beabeff97 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_call-4.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_call-4.c
@@ -3,4 +3,6 @@
 
 #include "reduc_call-1.c"
 
-/* { dg-final { scan-assembler {vfmacc\.vv\s+v[0-9]+,v[0-9]+,v[0-9]+,v0.t} } } */
+/* { dg-final { scan-assembler {vfmadd\.vv\s+v[0-9]+,v[0-9]+,v[0-9]+} } } */
+/* { dg-final { scan-assembler {vfadd\.vv\s+v[0-9]+,v[0-9]+,v[0-9]+,v0.t} } } */
+/* { dg-final { scan-assembler-not {vmerge} } } */
diff --git a/gcc/tree-if-conv.cc b/gcc/tree-if-conv.cc
index c381d14b801..9571351805c 100644
--- a/gcc/tree-if-conv.cc
+++ b/gcc/tree-if-conv.cc
@@ -1856,10 +1856,12 @@ convert_scalar_cond_reduction (gimple *reduc, gimple_stmt_iterator *gsi,
   gimple *new_assign;
   tree rhs;
   tree rhs1 = gimple_assign_rhs1 (reduc);
+  tree lhs = gimple_assign_lhs (reduc);
   tree tmp = make_temp_ssa_name (TREE_TYPE (rhs1), NULL, "_ifc_");
   tree c;
   enum tree_code reduction_op  = gimple_assign_rhs_code (reduc);
-  tree op_nochange = neutral_op_for_reduction (TREE_TYPE (rhs1), reduction_op, NULL);
+  tree op_nochange = neutral_op_for_reduction (TREE_TYPE (rhs1), reduction_op,
+					       NULL, false);
   gimple_seq stmts = NULL;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
@@ -1868,19 +1870,36 @@ convert_scalar_cond_reduction (gimple *reduc, gimple_stmt_iterator *gsi,
       print_gimple_stmt (dump_file, reduc, 0, TDF_SLIM);
     }
 
-  /* Build cond expression using COND and constant operand
-     of reduction rhs.  */
-  c = fold_build_cond_expr (TREE_TYPE (rhs1),
-			    unshare_expr (cond),
-			    swap ? op_nochange : op1,
-			    swap ? op1 : op_nochange);
-
-  /* Create assignment stmt and insert it at GSI.  */
-  new_assign = gimple_build_assign (tmp, c);
-  gsi_insert_before (gsi, new_assign, GSI_SAME_STMT);
-  /* Build rhs for unconditional increment/decrement/logic_operation.  */
-  rhs = gimple_build (&stmts, reduction_op,
-		      TREE_TYPE (rhs1), op0, tmp);
+  /* If possible create a COND_OP instead of a COND_EXPR and an OP_EXPR.
+     The COND_OP will have a neutral_op else value.  */
+  internal_fn ifn;
+  ifn = get_conditional_internal_fn (reduction_op);
+  if (ifn != IFN_LAST
+      && vectorized_internal_fn_supported_p (ifn, TREE_TYPE (lhs))
+      && !swap)
+    {
+      gcall *cond_call = gimple_build_call_internal (ifn, 4,
+						     unshare_expr (cond),
+						     op0, op1, op0);
+      gsi_insert_before (gsi, cond_call, GSI_SAME_STMT);
+      gimple_call_set_lhs (cond_call, tmp);
+      rhs = tmp;
+    }
+  else
+    {
+      /* Build cond expression using COND and constant operand
+	 of reduction rhs.  */
+      c = fold_build_cond_expr (TREE_TYPE (rhs1),
+				unshare_expr (cond),
+				swap ? op_nochange : op1,
+				swap ? op1 : op_nochange);
+      /* Create assignment stmt and insert it at GSI.  */
+      new_assign = gimple_build_assign (tmp, c);
+      gsi_insert_before (gsi, new_assign, GSI_SAME_STMT);
+      /* Build rhs for unconditional increment/decrement/logic_operation.  */
+      rhs = gimple_build (&stmts, reduction_op,
+			  TREE_TYPE (rhs1), op0, tmp);
+    }
 
   if (has_nop)
     {
@@ -2292,7 +2311,7 @@ predicate_scalar_phi (gphi *phi, gimple_stmt_iterator *gsi)
 	{
 	  /* Convert reduction stmt into vectorizable form.  */
 	  rhs = convert_scalar_cond_reduction (reduc, gsi, cond, op0, op1,
-					       swap,has_nop, nop_reduc);
+					       swap, has_nop, nop_reduc);
 	  redundant_ssa_names.safe_push (std::make_pair (res, rhs));
 	}
       new_stmt = gimple_build_assign (res, rhs);
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index ebab1953b9c..1c455701c73 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -3762,7 +3762,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 static bool
 fold_left_reduction_fn (code_helper code, internal_fn *reduc_fn)
 {
-  if (code == PLUS_EXPR)
+  /* We support MINUS_EXPR by negating the operand.  This also preserves an
+     initial -0.0 since -0.0 - 0.0 (neutral op for MINUS_EXPR) == -0.0 +
+     (-0.0) = -0.0.  */
+  if (code == PLUS_EXPR || code == MINUS_EXPR)
     {
       *reduc_fn = IFN_FOLD_LEFT_PLUS;
       return true;
@@ -3841,23 +3844,29 @@ reduction_fn_for_scalar_code (code_helper code, internal_fn *reduc_fn)
    by the introduction of additional X elements, return that X, otherwise
    return null.  CODE is the code of the reduction and SCALAR_TYPE is type
    of the scalar elements.  If the reduction has just a single initial value
-   then INITIAL_VALUE is that value, otherwise it is null.  */
+   then INITIAL_VALUE is that value, otherwise it is null.
+   If AS_INITIAL is TRUE the value is supposed to be used as initial value.
+   In that case no signed zero is returned.  */
 
 tree
 neutral_op_for_reduction (tree scalar_type, code_helper code,
-			  tree initial_value)
+			  tree initial_value, bool as_initial)
 {
   if (code.is_tree_code ())
     switch (tree_code (code))
       {
-      case WIDEN_SUM_EXPR:
       case DOT_PROD_EXPR:
       case SAD_EXPR:
-      case PLUS_EXPR:
       case MINUS_EXPR:
       case BIT_IOR_EXPR:
       case BIT_XOR_EXPR:
 	return build_zero_cst (scalar_type);
+      case WIDEN_SUM_EXPR:
+      case PLUS_EXPR:
+	if (!as_initial && HONOR_SIGNED_ZEROS (scalar_type))
+	  return build_real (scalar_type, dconstm0);
+	else
+	  return build_zero_cst (scalar_type);
 
       case MULT_EXPR:
 	return build_one_cst (scalar_type);
@@ -4085,7 +4094,15 @@ pop:
 		|| flow_bb_inside_loop_p (loop, gimple_bb (op_use_stmt))))
 	  FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter)
 	    cnt++;
-      if (cnt != 1)
+
+      bool cond_fn_p = op.code.is_internal_fn ()
+	&& (conditional_internal_fn_code (internal_fn (*code))
+	    != ERROR_MARK);
+
+      /* In case of a COND_OP (mask, op1, op2, op1) reduction we might have
+	 op1 twice (once as definition, once as else) in the same operation.
+	 Allow this.  */
+      if ((!cond_fn_p && cnt != 1) || (opi == 1 && cond_fn_p && cnt != 2))
 	{
 	  fail = true;
 	  break;
@@ -4187,8 +4204,14 @@ vect_is_simple_reduction (loop_vec_info loop_info, stmt_vec_info phi_info,
           return NULL;
         }
 
-      nphi_def_loop_uses++;
-      phi_use_stmt = use_stmt;
+      /* In case of a COND_OP (mask, op1, op2, op1) reduction we might have
+	 op1 twice (once as definition, once as else) in the same operation.
+	 Only count it as one. */
+      if (use_stmt != phi_use_stmt)
+	{
+	  nphi_def_loop_uses++;
+	  phi_use_stmt = use_stmt;
+	}
     }
 
   tree latch_def = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
@@ -6122,7 +6145,7 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
       gcc_assert (STMT_VINFO_IN_PATTERN_P (orig_stmt_info));
       gcc_assert (STMT_VINFO_RELATED_STMT (orig_stmt_info) == stmt_info);
     }
-  
+
   scalar_dest = gimple_get_lhs (orig_stmt_info->stmt);
   scalar_type = TREE_TYPE (scalar_dest);
   scalar_results.truncate (0);
@@ -6459,7 +6482,7 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
 	  if (REDUC_GROUP_FIRST_ELEMENT (stmt_info))
 	    initial_value = reduc_info->reduc_initial_values[0];
 	  neutral_op = neutral_op_for_reduction (TREE_TYPE (vectype), code,
-						 initial_value);
+						 initial_value, false);
 	}
       if (neutral_op)
 	vector_identity = gimple_build_vector_from_val (&seq, vectype,
@@ -6941,8 +6964,8 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo,
 			       gimple_stmt_iterator *gsi,
 			       gimple **vec_stmt, slp_tree slp_node,
 			       gimple *reduc_def_stmt,
-			       tree_code code, internal_fn reduc_fn,
-			       tree ops[3], tree vectype_in,
+			       code_helper code, internal_fn reduc_fn,
+			       tree *ops, int num_ops, tree vectype_in,
 			       int reduc_index, vec_loop_masks *masks,
 			       vec_loop_lens *lens)
 {
@@ -6958,17 +6981,48 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo,
 
   gcc_assert (!nested_in_vect_loop_p (loop, stmt_info));
   gcc_assert (ncopies == 1);
-  gcc_assert (TREE_CODE_LENGTH (code) == binary_op);
+
+  bool is_cond_op = false;
+  if (!code.is_tree_code ())
+    {
+      code = conditional_internal_fn_code (internal_fn (code));
+      gcc_assert (code != ERROR_MARK);
+      is_cond_op = true;
+    }
+
+  gcc_assert (TREE_CODE_LENGTH (tree_code (code)) == binary_op);
 
   if (slp_node)
-    gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (vectype_out),
-			  TYPE_VECTOR_SUBPARTS (vectype_in)));
+    {
+      if (is_cond_op)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "fold-left reduction on SLP not supported.\n");
+	  return false;
+	}
 
-  tree op0 = ops[1 - reduc_index];
+      gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (vectype_out),
+			    TYPE_VECTOR_SUBPARTS (vectype_in)));
+    }
+
+  /* The operands either come from a binary operation or an IFN_COND operation.
+     The former is a gimple assign with binary rhs and the latter is a
+     gimple call with four arguments.  */
+  gcc_assert (num_ops == 2 || num_ops == 4);
+  tree op0, opmask;
+  if (!is_cond_op)
+    op0 = ops[1 - reduc_index];
+  else
+    {
+      op0 = ops[2];
+      opmask = ops[0];
+      gcc_assert (!slp_node);
+    }
 
   int group_size = 1;
   stmt_vec_info scalar_dest_def_info;
-  auto_vec<tree> vec_oprnds0;
+  auto_vec<tree> vec_oprnds0, vec_opmask;
   if (slp_node)
     {
       auto_vec<vec<tree> > vec_defs (2);
@@ -6984,9 +7038,15 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo,
       vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, 1,
 				     op0, &vec_oprnds0);
       scalar_dest_def_info = stmt_info;
+
+      /* For an IFN_COND_OP we also need the vector mask operand.  */
+      if (is_cond_op)
+	  vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, 1,
+					 opmask, &vec_opmask);
     }
 
-  tree scalar_dest = gimple_assign_lhs (scalar_dest_def_info->stmt);
+  gimple *sdef = scalar_dest_def_info->stmt;
+  tree scalar_dest = gimple_get_lhs (sdef);
   tree scalar_type = TREE_TYPE (scalar_dest);
   tree reduc_var = gimple_phi_result (reduc_def_stmt);
 
@@ -7020,13 +7080,16 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo,
       tree bias = NULL_TREE;
       if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
 	mask = vect_get_loop_mask (loop_vinfo, gsi, masks, vec_num, vectype_in, i);
+      else if (is_cond_op)
+	mask = vec_opmask[0];
       if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
 	{
 	  len = vect_get_loop_len (loop_vinfo, gsi, lens, vec_num, vectype_in,
 				   i, 1);
 	  signed char biasval = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
 	  bias = build_int_cst (intQI_type_node, biasval);
-	  mask = build_minus_one_cst (truth_type_for (vectype_in));
+	  if (!is_cond_op)
+	    mask = build_minus_one_cst (truth_type_for (vectype_in));
 	}
 
       /* Handle MINUS by adding the negative.  */
@@ -7038,7 +7101,8 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo,
 	  def0 = negated;
 	}
 
-      if (mask && mask_reduc_fn == IFN_LAST)
+      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+	  && mask && mask_reduc_fn == IFN_LAST)
 	def0 = merge_with_identity (gsi, mask, vectype_out, def0,
 				    vector_identity);
 
@@ -7069,8 +7133,8 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo,
 	}
       else
 	{
-	  reduc_var = vect_expand_fold_left (gsi, scalar_dest_var, code,
-					     reduc_var, def0);
+	  reduc_var = vect_expand_fold_left (gsi, scalar_dest_var,
+					     tree_code (code), reduc_var, def0);
 	  new_stmt = SSA_NAME_DEF_STMT (reduc_var);
 	  /* Remove the statement, so that we can use the same code paths
 	     as for statements that we've just created.  */
@@ -7521,8 +7585,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
       if (i == STMT_VINFO_REDUC_IDX (stmt_info))
 	continue;
 
+      /* For an IFN_COND_OP we might hit the reduction definition operand
+	 twice (once as definition, once as else).  */
+      if (op.ops[i] == op.ops[STMT_VINFO_REDUC_IDX (stmt_info)])
+	continue;
+
       /* There should be only one cycle def in the stmt, the one
-         leading to reduc_def.  */
+	 leading to reduc_def.  */
       if (VECTORIZABLE_CYCLE_DEF (dt))
 	return false;
 
@@ -7721,6 +7790,15 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
           when generating the code inside the loop.  */
 
   code_helper orig_code = STMT_VINFO_REDUC_CODE (phi_info);
+
+  /* If conversion might have created a conditional operation like
+     IFN_COND_ADD already.  Use the internal code for the following checks.  */
+  if (orig_code.is_internal_fn ())
+    {
+      tree_code new_code = conditional_internal_fn_code (internal_fn (orig_code));
+      orig_code = new_code != ERROR_MARK ? new_code : orig_code;
+    }
+
   STMT_VINFO_REDUC_CODE (reduc_info) = orig_code;
 
   vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info);
@@ -7759,7 +7837,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			    "reduction: not commutative/associative");
+			    "reduction: not commutative/associative\n");
 	  return false;
 	}
     }
@@ -8143,9 +8221,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 	  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
 	}
       else if (reduction_type == FOLD_LEFT_REDUCTION
-	       && reduc_fn == IFN_LAST
+	       && internal_fn_mask_index (reduc_fn) == -1
 	       && FLOAT_TYPE_P (vectype_in)
-	       && HONOR_SIGNED_ZEROS (vectype_in)
 	       && HONOR_SIGN_DEPENDENT_ROUNDING (vectype_in))
 	{
 	  if (dump_enabled_p ())
@@ -8294,6 +8371,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 
   code_helper code = canonicalize_code (op.code, op.type);
   internal_fn cond_fn = get_conditional_internal_fn (code, op.type);
+
   vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
   vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
   bool mask_by_cond_expr = use_mask_by_cond_expr_p (code, cond_fn, vectype_in);
@@ -8312,17 +8390,29 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   if (code == COND_EXPR)
     gcc_assert (ncopies == 1);
 
+  /* A binary COND_OP reduction must have the same definition and else
+     value. */
+  bool cond_fn_p = code.is_internal_fn ()
+    && conditional_internal_fn_code (internal_fn (code)) != ERROR_MARK;
+  if (cond_fn_p)
+    {
+      gcc_assert (code == IFN_COND_ADD || code == IFN_COND_SUB
+		  || code == IFN_COND_MUL || code == IFN_COND_AND
+		  || code == IFN_COND_IOR || code == IFN_COND_XOR);
+      gcc_assert (op.num_ops == 4 && (op.ops[1] == op.ops[3]));
+    }
+
   bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
 
   vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info);
   if (reduction_type == FOLD_LEFT_REDUCTION)
     {
       internal_fn reduc_fn = STMT_VINFO_REDUC_FN (reduc_info);
-      gcc_assert (code.is_tree_code ());
+      gcc_assert (code.is_tree_code () || cond_fn_p);
       return vectorize_fold_left_reduction
 	  (loop_vinfo, stmt_info, gsi, vec_stmt, slp_node, reduc_def_phi,
-	   tree_code (code), reduc_fn, op.ops, vectype_in, reduc_index, masks,
-	   lens);
+	   code, reduc_fn, op.ops, op.num_ops, vectype_in,
+	   reduc_index, masks, lens);
     }
 
   bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info);
@@ -8335,14 +8425,20 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
   tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
 
+  /* Get NCOPIES vector definitions for all operands except the reduction
+     definition.  */
   vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies,
 		     single_defuse_cycle && reduc_index == 0
 		     ? NULL_TREE : op.ops[0], &vec_oprnds0,
 		     single_defuse_cycle && reduc_index == 1
 		     ? NULL_TREE : op.ops[1], &vec_oprnds1,
-		     op.num_ops == 3
-		     && !(single_defuse_cycle && reduc_index == 2)
+		     op.num_ops == 4
+		     || (op.num_ops == 3
+			 && !(single_defuse_cycle && reduc_index == 2))
 		     ? op.ops[2] : NULL_TREE, &vec_oprnds2);
+
+  /* For single def-use cycles get one copy of the vectorized reduction
+     definition.  */
   if (single_defuse_cycle)
     {
       gcc_assert (!slp_node);
@@ -8382,7 +8478,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	}
       else
 	{
-	  if (op.num_ops == 3)
+	  if (op.num_ops >= 3)
 	    vop[2] = vec_oprnds2[i];
 
 	  if (masked_loop_p && mask_by_cond_expr)
@@ -8395,10 +8491,16 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	  if (emulated_mixed_dot_prod)
 	    new_stmt = vect_emulate_mixed_dot_prod (loop_vinfo, stmt_info, gsi,
 						    vec_dest, vop);
-	  else if (code.is_internal_fn ())
+
+	  else if (code.is_internal_fn () && !cond_fn_p)
 	    new_stmt = gimple_build_call_internal (internal_fn (code),
 						   op.num_ops,
 						   vop[0], vop[1], vop[2]);
+	  else if (code.is_internal_fn () && cond_fn_p)
+	    new_stmt = gimple_build_call_internal (internal_fn (code),
+						   op.num_ops,
+						   vop[0], vop[1], vop[2],
+						   vop[1]);
 	  else
 	    new_stmt = gimple_build_assign (vec_dest, tree_code (op.code),
 					    vop[0], vop[1], vop[2]);
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index a4043e4a656..254d172231d 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2350,7 +2350,7 @@ extern tree vect_create_addr_base_for_vector_ref (vec_info *,
 						  tree);
 
 /* In tree-vect-loop.cc.  */
-extern tree neutral_op_for_reduction (tree, code_helper, tree);
+extern tree neutral_op_for_reduction (tree, code_helper, tree, bool = true);
 extern widest_int vect_iv_limit_for_partial_vectors (loop_vec_info loop_vinfo);
 bool vect_rgroup_iv_might_wrap_p (loop_vec_info, rgroup_controls *);
 /* Used in tree-vect-loop-manip.cc */
-- 
2.41.0