[PATCH 0/7] Support vector load/store with length

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 0/7] Support vector load/store with length
@ 2020-05-26  5:49 Kewen.Lin
  2020-05-26  5:51 ` [PATCH 1/7] ifn/optabs: " Kewen.Lin
                   ` (7 more replies)
  0 siblings, 8 replies; 80+ messages in thread
From: Kewen.Lin @ 2020-05-26  5:49 UTC (permalink / raw)
  To: GCC Patches
  Cc: Bill Schmidt, Segher Boessenkool, Richard Sandiford,
	Richard Guenther, dje.gcc

Hi all,

This patch set adds support for vector load/store with length, Power 
ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
length, it's good to be exploited for those cases we don't have enough
stuffs to fill in the whole vector like epilogues.

This support mainly refers to the handlings for fully-predicated loop
but it also covers the epilogue usage.  Now it supports two modes
controlled by parameter vect-with-length-scope, it can support any
loops fully with length or just for those cases with small iteration
counts less than VF like epilogue, for now I don't have ready env to
benchmark it, but based on the current inefficient length generation,
I don't think it's a good idea to adopt vector with length for any loops.
For the main loop which used to be vectorized, it increases register
pressure and introduces extra computation for length, the pro for icache
seems not comparable.  But I think it might be a good idea to keep this
parameter there for functionality testing, further benchmarking and other
ports' potential future supports.

As we don't have any benchmarking, this support isn't enabled by default
for any particular cpus, all testings are with explicit parameter setting.

Bootstrapped on powerpc64le-linux-gnu P9 with all vect-with-length-scope
settings (0/1/2).  Regress-test passed with vector-with-length-scope 0,
for the other twos, several vector related cases need to be updated, no
remarkable failures found.  BTW, P9 is the one which supports the
functionality but not ready to evaluate the performance.

Here still are many things to be supported or improved, not limited to:
  - reduction/live-out support
  - Cost model adding/tweaking
  - IFN gimple folding
  - Some unnecessary ops improvements eg: vector_size check
  - Some possible refactoring
I'll support/post the patches gradually.

Any comments are highly appreciated.

BR,
Kewen
-----

Patch set outline:
  [PATCH 1/7] ifn/optabs: Support vector load/store with length
  [PATCH 2/7] rs6000: lenload/lenstore optab support
  [PATCH 3/7] vect: Factor out codes for niters smaller than vf check
  [PATCH 4/7] hook/rs6000: Add vectorize length mode for vector with length
  [PATCH 5/7] vect: Support vector load/store with length in vectorizer
  [PATCH 6/7] ivopts: Add handlings for vector with length IFNs
  [PATCH 7/7] rs6000/testsuite: Vector with length test cases

 gcc/config/rs6000/rs6000.c                                  |   3 +
 gcc/config/rs6000/vsx.md                                    |  30 ++++++++++
 gcc/doc/invoke.texi                                         |   7 +++
 gcc/doc/md.texi                                             |  16 ++++++
 gcc/doc/tm.texi                                             |   6 ++
 gcc/doc/tm.texi.in                                          |   2 +
 gcc/internal-fn.c                                           |  13 ++++-
 gcc/internal-fn.def                                         |   6 ++
 gcc/optabs.def                                              |   2 +
 gcc/params.opt                                              |   4 ++
 gcc/target.def                                              |   7 +++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-1.h          |  18 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-2.h          |  17 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-3.h          |  31 +++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-4.h          |  24 ++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-5.h          |  29 ++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-6.h          |  32 +++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-1.c     |  15 +++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-2.c     |  15 +++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-3.c     |  18 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-4.c     |  15 +++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-5.c     |  15 +++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-6.c     |  16 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-1.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-2.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-3.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-5.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-6.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c     |  16 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c     |  16 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-3.c     |  17 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-4.c     |  16 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-5.c     |  16 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c     |  16 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-1.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-2.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-3.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-4.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-5.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-6.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-1.h      |  34 ++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-2.h      |  36 ++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-3.h      |  34 ++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-4.h      |  62 +++++++++++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-5.h      |  45 +++++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-6.h      |  52 +++++++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length.h            |  14 +++++
 gcc/tree-ssa-loop-ivopts.c                                  |   4 ++
 gcc/tree-vect-loop-manip.c                                  | 268 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 gcc/tree-vect-loop.c                                        | 272 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 gcc/tree-vect-stmts.c                                       | 152 ++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/tree-vectorizer.h                                       |  32 +++++++++++
 53 files changed, 1545 insertions(+), 18 deletions(-)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 1/7] ifn/optabs: Support vector load/store with length
  2020-05-26  5:49 [PATCH 0/7] Support vector load/store with length Kewen.Lin
@ 2020-05-26  5:51 ` Kewen.Lin
  2020-06-10  6:41   ` [PATCH 1/7 V2] " Kewen.Lin
  2020-05-26  5:53 ` [PATCH 2/7] rs6000: lenload/lenstore optab support Kewen.Lin
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-05-26  5:51 UTC (permalink / raw)
  To: GCC Patches
  Cc: Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool,
	Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 686 bytes --]

gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/md.texi (lenload@var{m}@var{n}): Document.
	(lenstore@var{m}@var{n}): Likewise.
	* internal-fn.c (len_load_direct): New macro.
	(len_store_direct): Likewise.
	(expand_len_load_optab_fn): Likewise.
	(expand_len_store_optab_fn): Likewise.
	(direct_len_load_optab_supported_p): Likewise.
	(direct_len_store_optab_supported_p): Likewise.
	(internal_load_fn_p): Handle IFN_LEN_LOAD.
	(internal_store_fn_p): Handle IFN_LEN_STORE.
	(internal_fn_stored_value_index): Handle IFN_LEN_STORE.
	* internal-fn.def (LEN_LOAD): New internal function.
	(LEN_STORE): Likewise.
	* optabs.def (lenload_optab, lenstore_optab): New optab.



[-- Attachment #2: 0001-IFN-for-vector-load-store-with-length-and-related-op.patch --]
[-- Type: text/plain, Size: 6763 bytes --]

---
 gcc/doc/md.texi     | 16 ++++++++++++++++
 gcc/internal-fn.c   | 13 +++++++++++--
 gcc/internal-fn.def |  6 ++++++
 gcc/optabs.def      |  2 ++
 4 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2c67c818da5..b0c19cd3b81 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5167,6 +5167,22 @@ mode @var{n}.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{lenload@var{m}@var{n}} instruction pattern
+@item @samp{lenload@var{m}@var{n}}
+Perform a vector load with length from memory operand 1 of mode @var{m}
+into register operand 0.  Length is provided in register operand 2 of
+mode @var{n}.
+
+This pattern is not allowed to @code{FAIL}.
+
+@cindex @code{lenstore@var{m}@var{n}} instruction pattern
+@item @samp{lenstore@var{m}@var{n}}
+Perform a vector store with length from register operand 1 of mode @var{m}
+into memory operand 0.  Length is provided in register operand 2 of
+mode @var{n}.
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_perm@var{m}} instruction pattern
 @item @samp{vec_perm@var{m}}
 Output a (variable) vector permutation.  Operand 0 is the destination
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 5e9aa60721e..be64cd86c07 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -104,10 +104,12 @@ init_internal_fns ()
 #define load_lanes_direct { -1, -1, false }
 #define mask_load_lanes_direct { -1, -1, false }
 #define gather_load_direct { 3, 1, false }
+#define len_load_direct { -1, 2, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
 #define mask_store_lanes_direct { 0, 0, false }
 #define scatter_store_direct { 3, 1, false }
+#define len_store_direct { 3, 2, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 #define ternary_direct { 0, 0, true }
@@ -2478,7 +2480,7 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
   return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
 }
 
-/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES} and LEN_LOAD call STMT using optab OPTAB.  */
 
 static void
 expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2514,8 +2516,9 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 }
 
 #define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
+#define expand_len_load_optab_fn expand_mask_load_optab_fn
 
-/* Expand MASK_STORE{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_STORE{,_LANES} and LEN_STORE call STMT using optab OPTAB.  */
 
 static void
 expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2547,6 +2550,7 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 }
 
 #define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
+#define expand_len_store_optab_fn expand_mask_store_optab_fn
 
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
@@ -3128,10 +3132,12 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_gather_load_optab_supported_p convert_optab_supported_p
+#define direct_len_load_optab_supported_p direct_optab_supported_p
 #define direct_mask_store_optab_supported_p direct_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_scatter_store_optab_supported_p convert_optab_supported_p
+#define direct_len_store_optab_supported_p direct_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
 #define direct_fold_left_optab_supported_p direct_optab_supported_p
@@ -3498,6 +3504,7 @@ internal_load_fn_p (internal_fn fn)
     case IFN_MASK_LOAD_LANES:
     case IFN_GATHER_LOAD:
     case IFN_MASK_GATHER_LOAD:
+    case IFN_LEN_LOAD:
       return true;
 
     default:
@@ -3517,6 +3524,7 @@ internal_store_fn_p (internal_fn fn)
     case IFN_MASK_STORE_LANES:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return true;
 
     default:
@@ -3577,6 +3585,7 @@ internal_fn_stored_value_index (internal_fn fn)
     case IFN_MASK_STORE:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return 3;
 
     default:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 1d190d492ff..ed6561f296a 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
    - load_lanes: currently just vec_load_lanes
    - mask_load_lanes: currently just vec_mask_load_lanes
    - gather_load: used for {mask_,}gather_load
+   - len_load: currently just lenload
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
    - mask_store_lanes: currently just vec_mask_store_lanes
    - scatter_store: used for {mask_,}scatter_store
+   - len_store: currently just lenstore
 
    - unary: a normal unary optab, such as vec_reverse_<mode>
    - binary: a normal binary optab, such as vec_interleave_lo_<mode>
@@ -127,6 +129,8 @@ DEF_INTERNAL_OPTAB_FN (GATHER_LOAD, ECF_PURE, gather_load, gather_load)
 DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
 		       mask_gather_load, gather_load)
 
+DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, lenload, len_load)
+
 DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
 DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
 		       mask_scatter_store, scatter_store)
@@ -136,6 +140,8 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
 DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
 		       vec_mask_store_lanes, mask_store_lanes)
 
+DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, lenstore, len_store)
+
 DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
 DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
 		       check_raw_ptrs, check_ptrs)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 0c64eb52a8d..0551a191ad0 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -97,6 +97,8 @@ OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
 OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
 OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
 OPTAB_CD(vec_init_optab, "vec_init$a$b")
+OPTAB_CD(lenload_optab, "lenload$a$b")
+OPTAB_CD(lenstore_optab, "lenstore$a$b")
 
 OPTAB_CD (while_ult_optab, "while_ult$a$b")
 
-- 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 2/7] rs6000: lenload/lenstore optab support
  2020-05-26  5:49 [PATCH 0/7] Support vector load/store with length Kewen.Lin
  2020-05-26  5:51 ` [PATCH 1/7] ifn/optabs: " Kewen.Lin
@ 2020-05-26  5:53 ` Kewen.Lin
  2020-06-10  6:43   ` [PATCH 2/7 V2] " Kewen.Lin
  2020-05-26  5:54 ` [PATCH 3/7] vect: Factor out codes for niters smaller than vf check Kewen.Lin
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-05-26  5:53 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, dje.gcc, Segher Boessenkool

[-- Attachment #1: Type: text/plain, Size: 155 bytes --]

gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* config/rs6000/vsx.md (lenload<mode>di): New define_expand.
	(lenstore<mode>di): Likewise.



[-- Attachment #2: 0002-Add-rs6000-lenload-lenstore-optab-support.patch --]
[-- Type: text/plain, Size: 1415 bytes --]

---
 gcc/config/rs6000/vsx.md | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 2a28215ac5b..cc098d3ccb5 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -5082,6 +5082,36 @@
   operands[3] = gen_reg_rtx (DImode);
 })
 
+;; Define optab for vector access with length vectorization exploitation.
+(define_expand "lenload<mode>di"
+  [(match_operand:VEC_A 0 "vlogical_operand")
+   (match_operand:VEC_A 1 "memory_operand")
+   (match_operand:DI 2 "int_reg_operand")]
+  "TARGET_P9_VECTOR && TARGET_64BIT"
+{
+  rtx mem = XEXP (operands[1], 0);
+  mem = force_reg (DImode, mem);
+  rtx res = gen_reg_rtx (V16QImode);
+  emit_insn (gen_lxvl (res, mem, operands[2]));
+  emit_move_insn (operands[0], gen_lowpart (<MODE>mode, res));
+  DONE;
+})
+
+(define_expand "lenstore<mode>di"
+  [(match_operand:VEC_A 0 "memory_operand")
+   (match_operand:VEC_A 1 "vlogical_operand")
+   (match_operand:DI 2 "int_reg_operand")
+  ]
+  "TARGET_P9_VECTOR && TARGET_64BIT"
+{
+  rtx val = gen_reg_rtx (V16QImode);
+  emit_move_insn (val, gen_lowpart (V16QImode, operands[1]));
+  rtx mem = XEXP (operands[0], 0);
+  mem = force_reg (DImode, mem);
+  emit_insn (gen_stxvl (val, mem, operands[2]));
+  DONE;
+})
+
 (define_insn "*stxvl"
   [(set (mem:V16QI (match_operand:DI 1 "gpc_reg_operand" "b"))
 	(unspec:V16QI
-- 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 3/7] vect: Factor out codes for niters smaller than vf check
  2020-05-26  5:49 [PATCH 0/7] Support vector load/store with length Kewen.Lin
  2020-05-26  5:51 ` [PATCH 1/7] ifn/optabs: " Kewen.Lin
  2020-05-26  5:53 ` [PATCH 2/7] rs6000: lenload/lenstore optab support Kewen.Lin
@ 2020-05-26  5:54 ` Kewen.Lin
  2020-05-26  5:55 ` [PATCH 4/7] hook/rs6000: Add vectorize length mode for vector with length Kewen.Lin
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 80+ messages in thread
From: Kewen.Lin @ 2020-05-26  5:54 UTC (permalink / raw)
  To: GCC Patches
  Cc: Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool,
	Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 190 bytes --]

gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* tree-vect-loop.c (known_niters_smaller_than_vf): New function, 
	factored out from ...
	(vect_analyze_loop_costing): ... here.

[-- Attachment #2: 0003-refatoring-out-function-niter-small-than-VF.patch --]
[-- Type: text/plain, Size: 1997 bytes --]

---
 gcc/tree-vect-loop.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 4f94b4baad9..80e33b61be7 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -155,6 +155,7 @@ along with GCC; see the file COPYING3.  If not see
 static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
 static stmt_vec_info vect_is_simple_reduction (loop_vec_info, stmt_vec_info,
 					       bool *, bool *);
+static bool known_niters_smaller_than_vf (loop_vec_info);
 
 /* Subroutine of vect_determine_vf_for_stmt that handles only one
    statement.  VECTYPE_MAYBE_SET_P is true if STMT_VINFO_VECTYPE
@@ -1631,15 +1632,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
      vectorization factor.  */
   if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
     {
-      HOST_WIDE_INT max_niter;
-
-      if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
-	max_niter = LOOP_VINFO_INT_NITERS (loop_vinfo);
-      else
-	max_niter = max_stmt_executions_int (loop);
-
-      if (max_niter != -1
-	  && (unsigned HOST_WIDE_INT) max_niter < assumed_vf)
+      if (known_niters_smaller_than_vf (loop_vinfo))
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -9231,3 +9224,23 @@ vect_iv_limit_for_full_masking (loop_vec_info loop_vinfo)
   return iv_limit;
 }
 
+/* If we know the iteration count is smaller than vectorization factor, return
+   true, otherwise return false.  */
+
+static bool
+known_niters_smaller_than_vf (loop_vec_info loop_vinfo)
+{
+  unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
+
+  HOST_WIDE_INT max_niter;
+  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+    max_niter = LOOP_VINFO_INT_NITERS (loop_vinfo);
+  else
+    max_niter = max_stmt_executions_int (LOOP_VINFO_LOOP (loop_vinfo));
+
+  if (max_niter != -1 && (unsigned HOST_WIDE_INT) max_niter < assumed_vf)
+    return true;
+
+  return false;
+}
+
-- 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 4/7] hook/rs6000: Add vectorize length mode for vector with length
  2020-05-26  5:49 [PATCH 0/7] Support vector load/store with length Kewen.Lin
                   ` (2 preceding siblings ...)
  2020-05-26  5:54 ` [PATCH 3/7] vect: Factor out codes for niters smaller than vf check Kewen.Lin
@ 2020-05-26  5:55 ` Kewen.Lin
  2020-06-10  6:44   ` [PATCH 4/7 V2] " Kewen.Lin
  2020-05-26  5:57 ` [PATCH 5/7] vect: Support vector load/store with length in vectorizer Kewen.Lin
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-05-26  5:55 UTC (permalink / raw)
  To: GCC Patches
  Cc: Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool,
	Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 213 bytes --]

gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* config/rs6000/rs6000.c (TARGET_VECTORIZE_LENGTH_MODE): New macro.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in: New hook.
	* target.def: Likewise.



[-- Attachment #2: 0004-add-length-mode-target-hookpod-support.patch --]
[-- Type: text/plain, Size: 2573 bytes --]

---
 gcc/config/rs6000/rs6000.c | 3 +++
 gcc/doc/tm.texi            | 6 ++++++
 gcc/doc/tm.texi.in         | 2 ++
 gcc/target.def             | 7 +++++++
 4 files changed, 18 insertions(+)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 8435bc15d72..c4d9d558b2f 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1659,6 +1659,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_HAVE_COUNT_REG_DECR_P
 #define TARGET_HAVE_COUNT_REG_DECR_P true
 
+#undef TARGET_VECTORIZE_LENGTH_MODE
+#define TARGET_VECTORIZE_LENGTH_MODE DImode
+
 /* 1000000000 is infinite cost in IVOPTs.  */
 #undef TARGET_DOLOOP_COST_FOR_GENERIC
 #define TARGET_DOLOOP_COST_FOR_GENERIC 1000000000
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 6e7d9dc54a9..5ea8734a191 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6084,6 +6084,12 @@ The default implementation returns a @code{MODE_VECTOR_INT} with the
 same size and number of elements as @var{mode}, if such a mode exists.
 @end deftypefn
 
+@deftypevr {Target Hook} scalar_int_mode TARGET_VECTORIZE_LENGTH_MODE
+For the targets which support vector memory access with length, return
+the scalar int mode to use for the length in bytes.
+The default is to use @code{word_mode}.
+@end deftypevr
+
 @deftypefn {Target Hook} bool TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE (unsigned @var{ifn})
 This hook returns true if masked internal function @var{ifn} (really of
 type @code{internal_fn}) should be considered expensive when the mask is
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3be984bbd5c..83034176b56 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4181,6 +4181,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_GET_MASK_MODE
 
+@hook TARGET_VECTORIZE_LENGTH_MODE
+
 @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
 
 @hook TARGET_VECTORIZE_INIT_COST
diff --git a/gcc/target.def b/gcc/target.def
index 07059a87caf..b58d87e1496 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1969,6 +1969,13 @@ same size and number of elements as @var{mode}, if such a mode exists.",
  (machine_mode mode),
  default_get_mask_mode)
 
+DEFHOOKPOD
+(length_mode,
+ "For the targets which support vector memory access with length, return\n\
+the scalar int mode to use for the length in bytes.\n\
+The default is to use @code{word_mode}.",
+ scalar_int_mode, word_mode)
+
 /* Function to say whether a masked operation is expensive when the
    mask is all zeros.  */
 DEFHOOK
-- 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 5/7] vect: Support vector load/store with length in vectorizer
  2020-05-26  5:49 [PATCH 0/7] Support vector load/store with length Kewen.Lin
                   ` (3 preceding siblings ...)
  2020-05-26  5:55 ` [PATCH 4/7] hook/rs6000: Add vectorize length mode for vector with length Kewen.Lin
@ 2020-05-26  5:57 ` Kewen.Lin
  2020-05-26 12:49   ` Richard Sandiford
  2020-05-26  5:58 ` [PATCH 6/7] ivopts: Add handlings for vector with length IFNs Kewen.Lin
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-05-26  5:57 UTC (permalink / raw)
  To: GCC Patches
  Cc: Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool,
	Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 2116 bytes --]

gcc/ChangeLog

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/invoke.texi (vect-with-length-scope): Document new option.
	* params.opt (vect-with-length-scope): New.
	* tree-vect-loop-manip.c (vect_set_loop_lens_directly): New function.
	(vect_set_loop_condition_len): Likewise.
	(vect_set_loop_condition): Call vect_set_loop_condition_len for loop with
	length.
	(vect_gen_vector_loop_niters): Use VF as the step for loop with length.
	(vect_do_peeling): Adjust for loop with length.
	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
	can_with_length_p and fully_with_length_p.
	(release_vec_loop_lens): New function.
	(_loop_vec_info::~_loop_vec_info): Use it to free the loop lens.
	(vect_verify_loop_lens): New function.
	(vect_analyze_loop_costing): Adjust for loop fully with length.
	(determine_peel_for_niter): Don't peel if loop fully with length.
	(vect_analyze_loop_2): Save LOOP_VINFO_CAN_WITH_LENGTH_P around retries,
	and free the length rgroups before retrying.  Check loop-wide reasons for
	disabling loops with length.  Make the final decision about use vector
	access with length or not.
	(vect_analyze_loop): Add handlings for epilogue of loop that can use vector
	with length but not.
	(vect_estimate_min_profitable_iters): Adjust for loop with length.
	(vectorizable_reduction): Disable loop with length.
	(vectorizable_live_operation): Likewise.
	(vect_record_loop_len): New function.
	(vect_get_loop_len): Likewise.
	(vect_transform_loop): Flag final loop iteration could be partial vector
	for loop with length.
	* tree-vect-stmts.c (check_load_store_with_len): New function.
	(vectorizable_store): Handle vector loop with length.
	(vectorizable_load): Likewise.
	(vect_gen_len): New function.
	* tree-vectorizer.h (struct rgroup_lens): New structure.
	(vec_loop_lens): New typedef.
	(_loop_vec_info): Add lens, can_with_length_p and fully_with_length_p.
	(LOOP_VINFO_CAN_WITH_LENGTH_P): New macro.
	(LOOP_VINFO_FULLY_WITH_LENGTH_P): Likewise.
	(LOOP_VINFO_LENS): Likewise.
	(vect_record_loop_len): New declare.
	(vect_get_loop_len): Likewise.
	(vect_gen_len): Likewise.



[-- Attachment #2: 0005-vector-with-length-support-in-vectorizer.patch --]
[-- Type: text/plain, Size: 40424 bytes --]

---
 gcc/doc/invoke.texi        |   7 +
 gcc/params.opt             |   4 +
 gcc/tree-vect-loop-manip.c | 268 ++++++++++++++++++++++++++++++++++++-
 gcc/tree-vect-loop.c       | 241 ++++++++++++++++++++++++++++++++-
 gcc/tree-vect-stmts.c      | 152 +++++++++++++++++++++
 gcc/tree-vectorizer.h      |  32 +++++
 6 files changed, 697 insertions(+), 7 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 8b9935dfe65..ac765feab13 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13079,6 +13079,13 @@ by the copy loop headers pass.
 @item vect-epilogues-nomask
 Enable loop epilogue vectorization using smaller vector size.
 
+@item vect-with-length-scope
+Control the scope of vector memory access with length exploitation.  0 means we
+don't expliot any vector memory access with length, 1 means we only exploit
+vector memory access with length for those loops whose iteration number are
+less than VF, such as very small loop or epilogue, 2 means we want to exploit
+vector memory access with length for any loops if possible.
+
 @item slp-max-insns-in-bb
 Maximum number of instructions in basic block to be
 considered for SLP vectorization.
diff --git a/gcc/params.opt b/gcc/params.opt
index 4aec480798b..d4309101067 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -964,4 +964,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f
 Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
 Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
 
+-param=vect-with-length-scope=
+Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization
+Control the vector with length exploitation scope.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 8c5e696b995..3d5dec6f65c 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -747,6 +747,263 @@ vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
   return cond_stmt;
 }
 
+/* Helper for vect_set_loop_condition_len.  Like vect_set_loop_masks_directly,
+   generate definitions for all the lengths in RGL and return a length that is
+   nonzero when the loop needs to iterate.  Add any new preheader statements to
+   PREHEADER_SEQ.  Use LOOP_COND_GSI to insert code before the exit gcond.
+
+   RGL belongs to loop LOOP.  The loop originally iterated NITERS
+   times and has been vectorized according to LOOP_VINFO.  Each iteration
+   of the vectorized loop handles VF iterations of the scalar loop.
+
+   IV_LIMIT is the limit which induction variable can reach, that will be used
+   to check whether induction variable can wrap before hit the niters.  */
+
+static tree
+vect_set_loop_lens_directly (class loop *loop, loop_vec_info loop_vinfo,
+			      gimple_seq *preheader_seq,
+			      gimple_stmt_iterator loop_cond_gsi,
+			      rgroup_lens *rgl, tree niters, widest_int iv_limit)
+{
+  scalar_int_mode len_mode = targetm.vectorize.length_mode;
+  unsigned int len_prec = GET_MODE_PRECISION (len_mode);
+  tree len_type = build_nonstandard_integer_type (len_prec, true);
+
+  tree vec_type = rgl->vec_type;
+  unsigned int nbytes_per_iter = rgl->nbytes_per_iter;
+  poly_uint64 vector_size = GET_MODE_SIZE (TYPE_MODE (vec_type));
+  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  tree vec_size = build_int_cst (len_type, vector_size);
+
+  /* See whether zero-based IV would ever generate zero length before
+     wrapping around.  */
+  bool might_wrap_p = (iv_limit == -1);
+  if (!might_wrap_p)
+    {
+      widest_int iv_limit_max = iv_limit * nbytes_per_iter;
+      might_wrap_p = wi::min_precision (iv_limit_max, UNSIGNED) > len_prec;
+    }
+
+  /* Calculate the maximum number of bytes of scalars that the rgroup
+     handles in total, the number that it handles for each iteration
+     of the vector loop.  */
+  tree nbytes_total = niters;
+  tree nbytes_step = build_int_cst (len_type, vf);
+  if (nbytes_per_iter != 1)
+    {
+      tree factor = build_int_cst (len_type, nbytes_per_iter);
+      nbytes_total = gimple_build (preheader_seq, MULT_EXPR, len_type,
+				   nbytes_total, factor);
+      nbytes_step = gimple_build (preheader_seq, MULT_EXPR, len_type,
+				  nbytes_step, factor);
+    }
+
+  /* Create an induction variable that counts the processed bytes of scalars. */
+  tree index_before_incr, index_after_incr;
+  gimple_stmt_iterator incr_gsi;
+  bool insert_after;
+  standard_iv_increment_position (loop, &incr_gsi, &insert_after);
+  create_iv (build_int_cst (len_type, 0), nbytes_step, NULL_TREE, loop,
+	     &incr_gsi, insert_after, &index_before_incr, &index_after_incr);
+
+  tree zero_index = build_int_cst (len_type, 0);
+  tree test_index, test_limit, first_limit;
+  gimple_stmt_iterator *test_gsi;
+
+  /* For the first iteration it doesn't matter whether the IV hits
+     a value above NBYTES_TOTAL.  That only matters for the latch
+     condition.  */
+  first_limit = nbytes_total;
+
+  if (might_wrap_p)
+    {
+      test_index = index_before_incr;
+      tree adjust = gimple_convert (preheader_seq, len_type, nbytes_step);
+      test_limit = gimple_build (preheader_seq, MAX_EXPR, len_type,
+				 nbytes_total, adjust);
+      test_limit = gimple_build (preheader_seq, MINUS_EXPR, len_type,
+				 test_limit, adjust);
+      test_gsi = &incr_gsi;
+    }
+  else
+    {
+      /* Test the incremented IV, which will always hit a value above
+	 the bound before wrapping.  */
+      test_index = index_after_incr;
+      test_limit = nbytes_total;
+      test_gsi = &loop_cond_gsi;
+    }
+
+  /* Provide a definition of each length in the group.  */
+  tree next_len = NULL_TREE;
+  tree len;
+  unsigned int i;
+  FOR_EACH_VEC_ELT_REVERSE (rgl->lens, i, len)
+    {
+      /* Previous lengths will cover BIAS scalars.  This length covers the
+	 next batch.  Each batch's length should be vector_size.  */
+      poly_uint64 bias = vector_size * i;
+      tree bias_tree = build_int_cst (len_type, bias);
+
+      /* See whether the first iteration of the vector loop is known
+	 to have a full vector size.  */
+      poly_uint64 const_limit;
+      bool first_iteration_full
+	= (poly_int_tree_p (first_limit, &const_limit)
+	   && known_ge (const_limit, (i + 1) * vector_size));
+
+      /* Rather than have a new IV that starts at BIAS and goes up to
+	 TEST_LIMIT, prefer to use the same 0-based IV for each length
+	 and adjust the bound down by BIAS.  */
+      tree this_test_limit = test_limit;
+      if (i != 0)
+	{
+	  this_test_limit = gimple_build (preheader_seq, MAX_EXPR, len_type,
+					  this_test_limit, bias_tree);
+	  this_test_limit = gimple_build (preheader_seq, MINUS_EXPR, len_type,
+					  this_test_limit, bias_tree);
+	}
+
+      /* Create the initial length.  First include all scalar bytes that
+	 are within the loop limit.  */
+      tree init_len = NULL_TREE;
+      if (!first_iteration_full)
+	{
+	  tree start, end;
+	  if (first_limit == test_limit)
+	    {
+	      /* Use a natural test between zero (the initial IV value)
+		 and the loop limit.  The "else" block would be valid too,
+		 but this choice can avoid the need to load BIAS_TREE into
+		 a register.  */
+	      start = zero_index;
+	      end = this_test_limit;
+	    }
+	  else
+	    {
+	      /* FIRST_LIMIT is the maximum number of scalar bytes handled by
+		 the first iteration of the vector loop.  Test the portion
+		 associated with this length.  */
+	      start = bias_tree;
+	      end = first_limit;
+	    }
+
+	  init_len = make_temp_ssa_name (len_type, NULL, "max_len");
+	  gimple_seq seq = vect_gen_len (init_len, start, end, vec_size);
+	  gimple_seq_add_seq (preheader_seq, seq);
+	}
+
+      /* First iteration is full.  */
+      if (!init_len)
+	init_len = vec_size;
+
+      /* Get the length value for the next iteration of the loop.  */
+      next_len = make_temp_ssa_name (len_type, NULL, "next_len");
+      tree end = this_test_limit;
+      gimple_seq seq = vect_gen_len (next_len, test_index, end, vec_size);
+      gsi_insert_seq_before (test_gsi, seq, GSI_SAME_STMT);
+
+      /* Use mask routine for length.  */
+      vect_set_loop_mask (loop, len, init_len, next_len);
+    }
+
+  return next_len;
+}
+
+/* Like vect_set_loop_condition_masked, handle the case vector access with
+   length.  */
+
+static gcond *
+vect_set_loop_condition_len (class loop *loop, loop_vec_info loop_vinfo,
+				tree niters, tree final_iv,
+				bool niters_maybe_zero,
+				gimple_stmt_iterator loop_cond_gsi)
+{
+  gimple_seq preheader_seq = NULL;
+  gimple_seq header_seq = NULL;
+  tree orig_niters = niters;
+
+  /* Type of the initial value of NITERS.  */
+  tree ni_actual_type = TREE_TYPE (niters);
+  unsigned int ni_actual_prec = TYPE_PRECISION (ni_actual_type);
+
+  /* Obtain target supported length type.  */
+  scalar_int_mode len_mode = targetm.vectorize.length_mode;
+  unsigned int len_prec = GET_MODE_PRECISION (len_mode);
+  tree len_type = build_nonstandard_integer_type (len_prec, true);
+
+  /* Calculate the value that the induction variable must be able to hit in
+     order to ensure that we end the loop with an zero length.  */
+  widest_int iv_limit = -1;
+  unsigned HOST_WIDE_INT max_vf = vect_max_vf (loop_vinfo);
+  if (max_loop_iterations (loop, &iv_limit))
+    {
+      /* Round this value down to the previous vector alignment boundary and
+	 then add an extra full iteration.  */
+      poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+      iv_limit = (iv_limit & -(int) known_alignment (vf)) + max_vf;
+    }
+
+  /* Convert NITERS to the same size as the length.  */
+  if (niters_maybe_zero || (len_prec > ni_actual_prec))
+    {
+      /* We know that there is always at least one iteration, so if the
+	 count is zero then it must have wrapped.  Cope with this by
+	 subtracting 1 before the conversion and adding 1 to the result.  */
+      gcc_assert (TYPE_UNSIGNED (ni_actual_type));
+      niters = gimple_build (&preheader_seq, PLUS_EXPR, ni_actual_type, niters,
+			     build_minus_one_cst (ni_actual_type));
+      niters = gimple_convert (&preheader_seq, len_type, niters);
+      niters = gimple_build (&preheader_seq, PLUS_EXPR, len_type, niters,
+			     build_one_cst (len_type));
+    }
+  else
+    niters = gimple_convert (&preheader_seq, len_type, niters);
+
+  /* Iterate over all the rgroups and fill in their lengths.  We could use
+     the first length from any rgroup for the loop condition; here we
+     arbitrarily pick the last.  */
+  tree test_len = NULL_TREE;
+  rgroup_lens *rgl;
+  unsigned int i;
+  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+
+  FOR_EACH_VEC_ELT (*lens, i, rgl)
+    if (!rgl->lens.is_empty ())
+      /* Set up all lens for this group.  */
+      test_len
+	= vect_set_loop_lens_directly (loop, loop_vinfo, &preheader_seq,
+				       loop_cond_gsi, rgl, niters, iv_limit);
+
+  /* Emit all accumulated statements.  */
+  add_preheader_seq (loop, preheader_seq);
+  add_header_seq (loop, header_seq);
+
+  /* Get a boolean result that tells us whether to iterate.  */
+  edge exit_edge = single_exit (loop);
+  tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : NE_EXPR;
+  tree zero_len = build_zero_cst (TREE_TYPE (test_len));
+  gcond *cond_stmt
+    = gimple_build_cond (code, test_len, zero_len, NULL_TREE, NULL_TREE);
+  gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
+
+  /* The loop iterates (NITERS - 1) / VF + 1 times.
+     Subtract one from this to get the latch count.  */
+  tree step = build_int_cst (len_type, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+  tree niters_minus_one
+    = fold_build2 (PLUS_EXPR, len_type, niters, build_minus_one_cst (len_type));
+  loop->nb_iterations
+    = fold_build2 (TRUNC_DIV_EXPR, len_type, niters_minus_one, step);
+
+  if (final_iv)
+    {
+      gassign *assign = gimple_build_assign (final_iv, orig_niters);
+      gsi_insert_on_edge_immediate (single_exit (loop), assign);
+    }
+
+  return cond_stmt;
+}
+
 /* Like vect_set_loop_condition, but handle the case in which there
    are no loop masks.  */
 
@@ -916,6 +1173,10 @@ vect_set_loop_condition (class loop *loop, loop_vec_info loop_vinfo,
     cond_stmt = vect_set_loop_condition_masked (loop, loop_vinfo, niters,
 						final_iv, niters_maybe_zero,
 						loop_cond_gsi);
+  else if (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    cond_stmt = vect_set_loop_condition_len (loop, loop_vinfo, niters,
+						final_iv, niters_maybe_zero,
+						loop_cond_gsi);
   else
     cond_stmt = vect_set_loop_condition_unmasked (loop, niters, step,
 						  final_iv, niters_maybe_zero,
@@ -1939,7 +2200,8 @@ vect_gen_vector_loop_niters (loop_vec_info loop_vinfo, tree niters,
 
   unsigned HOST_WIDE_INT const_vf;
   if (vf.is_constant (&const_vf)
-      && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
     {
       /* Create: niters >> log2(vf) */
       /* If it's known that niters == number of latch executions + 1 doesn't
@@ -2472,6 +2734,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   poly_uint64 bound_epilog = 0;
   if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
       && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
     bound_epilog += vf - 1;
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
@@ -2567,7 +2830,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   if (vect_epilogues
       && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
       && prolog_peeling >= 0
-      && known_eq (vf, lowest_vf))
+      && known_eq (vf, lowest_vf)
+      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (epilogue_vinfo))
     {
       unsigned HOST_WIDE_INT eiters
 	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 80e33b61be7..d61f46becfd 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -815,6 +815,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     vectorizable (false),
     can_fully_mask_p (true),
     fully_masked_p (false),
+    can_with_length_p (param_vect_with_length_scope != 0),
+    fully_with_length_p (false),
     peeling_for_gaps (false),
     peeling_for_niter (false),
     no_data_dependencies (false),
@@ -887,6 +889,18 @@ release_vec_loop_masks (vec_loop_masks *masks)
   masks->release ();
 }
 
+/* Free all levels of LENS.  */
+
+void
+release_vec_loop_lens (vec_loop_lens *lens)
+{
+  rgroup_lens *rgl;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (*lens, i, rgl)
+    rgl->lens.release ();
+  lens->release ();
+}
+
 /* Free all memory used by the _loop_vec_info, as well as all the
    stmt_vec_info structs of all the stmts in the loop.  */
 
@@ -895,6 +909,7 @@ _loop_vec_info::~_loop_vec_info ()
   free (bbs);
 
   release_vec_loop_masks (&masks);
+  release_vec_loop_lens (&lens);
   delete ivexpr_map;
   delete scan_map;
   epilogue_vinfos.release ();
@@ -1056,6 +1071,44 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
   return true;
 }
 
+/* Check whether we can use vector access with length based on precison
+   comparison.  So far, to keep it simple, we only allow the case that the
+   precision of the target supported length is larger than the precision
+   required by loop niters.  */
+
+static bool
+vect_verify_loop_lens (loop_vec_info loop_vinfo)
+{
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+
+  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    return false;
+
+  /* Get the maximum number of iterations that is representable
+     in the counter type.  */
+  tree ni_type = TREE_TYPE (LOOP_VINFO_NITERSM1 (loop_vinfo));
+  widest_int max_ni = wi::to_widest (TYPE_MAX_VALUE (ni_type)) + 1;
+
+  /* Get a more refined estimate for the number of iterations.  */
+  widest_int max_back_edges;
+  if (max_loop_iterations (loop, &max_back_edges))
+    max_ni = wi::smin (max_ni, max_back_edges + 1);
+
+  /* Account for rgroup lengths, in which each bit is replicated N times.  */
+  rgroup_lens *rgl = &(*lens)[lens->length () - 1];
+  max_ni *= rgl->nbytes_per_iter;
+
+  /* Work out how many bits we need to represent the limit.  */
+  unsigned int min_ni_width = wi::min_precision (max_ni, UNSIGNED);
+
+  unsigned len_bits = GET_MODE_PRECISION (targetm.vectorize.length_mode);
+  if (len_bits < min_ni_width)
+    return false;
+
+  return true;
+}
+
 /* Calculate the cost of one scalar iteration of the loop.  */
 static void
 vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
@@ -1630,7 +1683,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
 
   /* Only fully-masked loops can have iteration counts less than the
      vectorization factor.  */
-  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
     {
       if (known_niters_smaller_than_vf (loop_vinfo))
 	{
@@ -1858,7 +1912,8 @@ determine_peel_for_niter (loop_vec_info loop_vinfo)
     th = LOOP_VINFO_COST_MODEL_THRESHOLD (LOOP_VINFO_ORIG_LOOP_INFO
 					  (loop_vinfo));
 
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      || LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
     /* The main loop handles all iterations.  */
     LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
@@ -2048,6 +2103,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
     }
 
   bool saved_can_fully_mask_p = LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo);
+  bool saved_can_with_length_p = LOOP_VINFO_CAN_WITH_LENGTH_P(loop_vinfo);
 
   /* We don't expect to have to roll back to anything other than an empty
      set of rgroups.  */
@@ -2144,6 +2200,71 @@ start_over:
 			 "not using a fully-masked loop.\n");
     }
 
+  /* Decide whether we can use vector access with length.  */
+
+  if ((LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+       || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
+      && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length becuase peeling"
+			 " for alignment or gaps is required.\n");
+      LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+    }
+
+  if (LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)
+      && !vect_verify_loop_lens (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length becuase the"
+			 " length precision verification fail.\n");
+      LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+    }
+
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length becuase the"
+			 " loop will be fully-masked.\n");
+      LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+    }
+
+  if (LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+    {
+      /* One special case, the loop with max niters less than VF, we can simply
+	 take it as body with length.  */
+      if (param_vect_with_length_scope == 1)
+	{
+	  /* This is the epilogue, should be less than VF.  */
+	  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+	    LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = true;
+	  /* Otherwise, ensure the loop iteration less than VF.  */
+	  else if (known_niters_smaller_than_vf (loop_vinfo))
+	    LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = true;
+	}
+      else
+	{
+	  gcc_assert (param_vect_with_length_scope == 2);
+	  LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = true;
+	}
+    }
+  else
+    /* Always set it as false in case previous tries set it.  */
+    LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = false;
+
+  if (dump_enabled_p ())
+    {
+      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+	dump_printf_loc (MSG_NOTE, vect_location, "using vector access with"
+						  " length for loop fully.\n");
+      else
+	dump_printf_loc (MSG_NOTE, vect_location, "not using vector access with"
+						  " length for loop fully.\n");
+    }
+
   /* If epilog loop is required because of data accesses with gaps,
      one additional iteration needs to be peeled.  Check if there is
      enough iterations for vectorization.  */
@@ -2164,6 +2285,7 @@ start_over:
      loop or a loop that has a lower VF than the main loop.  */
   if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
       && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
       && maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
 		   LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo)))
     return opt_result::failure_at (vect_location,
@@ -2362,12 +2484,14 @@ again:
     = init_cost (LOOP_VINFO_LOOP (loop_vinfo));
   /* Reset accumulated rgroup information.  */
   release_vec_loop_masks (&LOOP_VINFO_MASKS (loop_vinfo));
+  release_vec_loop_lens (&LOOP_VINFO_LENS (loop_vinfo));
   /* Reset assorted flags.  */
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
   LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = 0;
   LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = 0;
   LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = saved_can_fully_mask_p;
+  LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = saved_can_with_length_p;
 
   goto start_over;
 }
@@ -2646,8 +2770,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	      if (ordered_p (lowest_th, th))
 		lowest_th = ordered_min (lowest_th, th);
 	    }
-	  else
-	    delete loop_vinfo;
+	  else {
+	      delete loop_vinfo;
+	      loop_vinfo = opt_loop_vec_info::success (NULL);
+	  }
 
 	  /* Only vectorize epilogues if PARAM_VECT_EPILOGUES_NOMASK is
 	     enabled, SIMDUID is not set, it is the innermost loop and we have
@@ -2672,6 +2798,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
       else
 	{
 	  delete loop_vinfo;
+	  loop_vinfo = opt_loop_vec_info::success (NULL);
 	  if (fatal)
 	    {
 	      gcc_checking_assert (first_loop_vinfo == NULL);
@@ -2679,6 +2806,21 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	    }
 	}
 
+      /* If the original loop can use vector access with length but we still
+	 get true vect_epilogue here, it would try vector access with length
+	 on epilogue and with the same mode.  */
+      if (vect_epilogues && loop_vinfo
+	  && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+	{
+	  gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** Re-trying analysis with same vector"
+			     " mode %s for epilogue with length.\n",
+			     GET_MODE_NAME (loop_vinfo->vector_mode));
+	  continue;
+	}
+
       if (mode_i < vector_modes.length ()
 	  && VECTOR_MODE_P (autodetected_vector_mode)
 	  && (related_vector_mode (vector_modes[mode_i],
@@ -3519,6 +3661,11 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 			    target_cost_data, num_masks - 1, vector_stmt,
 			    NULL, NULL_TREE, 0, vect_body);
     }
+  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      peel_iters_prologue = 0;
+      peel_iters_epilogue = 0;
+    }
   else if (npeel < 0)
     {
       peel_iters_prologue = assumed_vf / 2;
@@ -3809,6 +3956,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 		 min_profitable_iters);
 
   if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
@@ -6761,6 +6909,16 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "using an in-order (fold-left) reduction.\n");
   STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
+
+  if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length due to"
+			 " reduction operation.\n");
+      LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+    }
+
   /* All but single defuse-cycle optimized, lane-reducing and fold-left
      reductions go through their own vectorizable_* routines.  */
   if (!single_defuse_cycle
@@ -8041,6 +8199,16 @@ vectorizable_live_operation (loop_vec_info loop_vinfo,
 				     1, vectype, NULL);
 	    }
 	}
+
+      if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+	{
+	  LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "can't use vector access with length due to"
+			     " live operation.\n");
+	}
+
       return true;
     }
 
@@ -8354,6 +8522,66 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
   return mask;
 }
 
+/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
+   lengths for vector access with length that each control a vector of type
+   VECTYPE.  */
+
+void
+vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		       unsigned int nvectors, tree vectype)
+{
+  gcc_assert (nvectors != 0);
+  if (lens->length () < nvectors)
+    lens->safe_grow_cleared (nvectors);
+  rgroup_lens *rgl = &(*lens)[nvectors - 1];
+
+  /* The number of scalars per iteration, total bytes of them and the number of
+     vectors are both compile-time constants.  */
+  poly_uint64 vector_size = GET_MODE_SIZE (TYPE_MODE (vectype));
+  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned int nbytes_per_iter
+    = exact_div (nvectors * vector_size, vf).to_constant ();
+
+  /* The one associated to the same nvectors should have the same bytes per
+     iteration.  */
+  if (!rgl->vec_type)
+    {
+      rgl->vec_type = vectype;
+      rgl->nbytes_per_iter = nbytes_per_iter;
+    }
+  else
+    gcc_assert (rgl->nbytes_per_iter == nbytes_per_iter);
+}
+
+/* Given a complete set of length LENS, extract length number INDEX for an
+   rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.  */
+
+tree
+vect_get_loop_len (vec_loop_lens *lens, unsigned int nvectors, unsigned int index)
+{
+  rgroup_lens *rgl = &(*lens)[nvectors - 1];
+
+  /* Populate the rgroup's len array, if this is the first time we've
+     used it.  */
+  if (rgl->lens.is_empty ())
+    {
+      rgl->lens.safe_grow_cleared (nvectors);
+      for (unsigned int i = 0; i < nvectors; ++i)
+	{
+	  scalar_int_mode len_mode = targetm.vectorize.length_mode;
+	  unsigned int len_prec = GET_MODE_PRECISION (len_mode);
+	  tree len_type = build_nonstandard_integer_type (len_prec, true);
+	  tree len = make_temp_ssa_name (len_type, NULL, "loop_len");
+
+	  /* Provide a dummy definition until the real one is available.  */
+	  SSA_NAME_DEF_STMT (len) = gimple_build_nop ();
+	  rgl->lens[i] = len;
+	}
+    }
+
+  return rgl->lens[index];
+}
+
 /* Scale profiling counters by estimation for LOOP which is vectorized
    by factor VF.  */
 
@@ -8714,6 +8942,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
     {
       if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 	  && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+	  && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
 	  && known_eq (lowest_vf, vf))
 	{
 	  niters_vector
@@ -8881,7 +9110,9 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
 
   /* True if the final iteration might not handle a full vector's
      worth of scalar iterations.  */
-  bool final_iter_may_be_partial = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+  bool final_iter_may_be_partial
+    = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      || LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo);
   /* The minimum number of iterations performed by the epilogue.  This
      is 1 when peeling for gaps because we always need a final scalar
      iteration.  */
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index e7822c44951..d6be39e1831 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1879,6 +1879,66 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
     gcc_unreachable ();
 }
 
+/* Check whether a load or store statement in the loop described by
+   LOOP_VINFO is possible to go with length.  This is testing whether
+   the vectorizer pass has the appropriate support, as well as whether
+   the target does.
+
+   VLS_TYPE says whether the statement is a load or store and VECTYPE
+   is the type of the vector being loaded or stored.  MEMORY_ACCESS_TYPE
+   says how the load or store is going to be implemented and GROUP_SIZE
+   is the number of load or store statements in the containing group.
+
+   Clear LOOP_VINFO_CAN_WITH_LENGTH_P if it can't go with length, otherwise
+   record the required length types.  */
+
+static void
+check_load_store_with_len (loop_vec_info loop_vinfo, tree vectype,
+		      vec_load_store_type vls_type, int group_size,
+		      vect_memory_access_type memory_access_type)
+{
+  /* Invariant loads need no special support.  */
+  if (memory_access_type == VMAT_INVARIANT)
+    return;
+
+  if (memory_access_type != VMAT_CONTIGUOUS
+      && memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length"
+			 " because an access isn't contiguous.\n");
+      LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+      return;
+    }
+
+  machine_mode vecmode = TYPE_MODE (vectype);
+  bool is_load = (vls_type == VLS_LOAD);
+  optab op = is_load ? lenload_optab : lenstore_optab;
+
+  if (!VECTOR_MODE_P (vecmode)
+      || !convert_optab_handler (op, vecmode, targetm.vectorize.length_mode))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length because"
+			 " the target doesn't have the appropriate"
+			 " load or store with length.\n");
+      LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+      return;
+    }
+
+  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+  poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned int nvectors;
+
+  if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+    vect_record_loop_len (loop_vinfo, lens, nvectors, vectype);
+  else
+    gcc_unreachable ();
+}
+
 /* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
    form of the scalar mask condition and LOOP_MASK, if nonnull, is the mask
    that needs to be applied to all loads and stores in a vectorized loop.
@@ -7532,6 +7592,10 @@ vectorizable_store (vec_info *vinfo,
 	check_load_store_masking (loop_vinfo, vectype, vls_type, group_size,
 				  memory_access_type, &gs_info, mask);
 
+      if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+	check_load_store_with_len (loop_vinfo, vectype, vls_type, group_size,
+				      memory_access_type);
+
       if (slp_node
 	  && !vect_maybe_update_slp_op_vectype (SLP_TREE_CHILDREN (slp_node)[0],
 						vectype))
@@ -8068,6 +8132,15 @@ vectorizable_store (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+	 ? &LOOP_VINFO_LENS (loop_vinfo)
+	 : NULL);
+
+  /* Shouldn't go with length if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -8320,10 +8393,15 @@ vectorizable_store (vec_info *vinfo,
 	      unsigned HOST_WIDE_INT align;
 
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens)
+		final_len = vect_get_loop_len (loop_lens, vec_num * ncopies,
+					       vec_num * j + i);
+
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
@@ -8403,6 +8481,17 @@ vectorizable_store (vec_info *vinfo,
 		  new_stmt_info
 		    = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
 		}
+	      else if (final_len)
+		{
+		  align = least_bit_hwi (misalign | align);
+		  tree ptr = build_int_cst (ref_type, align);
+		  gcall *call
+		    = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr,
+						  ptr, final_len, vec_oprnd);
+		  gimple_call_set_nothrow (call, true);
+		  new_stmt_info
+		    = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
+		}
 	      else
 		{
 		  data_ref = fold_build2 (MEM_REF, vectype,
@@ -8839,6 +8928,10 @@ vectorizable_load (vec_info *vinfo,
 	check_load_store_masking (loop_vinfo, vectype, VLS_LOAD, group_size,
 				  memory_access_type, &gs_info, mask);
 
+      if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+	check_load_store_with_len (loop_vinfo, vectype, VLS_LOAD, group_size,
+				      memory_access_type);
+
       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
       vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
 			    slp_node, cost_vec);
@@ -8937,6 +9030,7 @@ vectorizable_load (vec_info *vinfo,
 
       gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
       gcc_assert (!nested_in_vect_loop);
+      gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
 
       if (grouped_load)
 	{
@@ -9234,6 +9328,15 @@ vectorizable_load (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+	 ? &LOOP_VINFO_LENS (loop_vinfo)
+	 : NULL);
+
+  /* Shouldn't go with length if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -9555,15 +9658,20 @@ vectorizable_load (vec_info *vinfo,
 	  for (i = 0; i < vec_num; i++)
 	    {
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks
 		  && memory_access_type != VMAT_INVARIANT)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens && memory_access_type != VMAT_INVARIANT)
+		final_len = vect_get_loop_len (loop_lens, vec_num * ncopies,
+					       vec_num * j + i);
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
 
+
 	      if (i > 0)
 		dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
 					       gsi, stmt_info, bump);
@@ -9629,6 +9737,18 @@ vectorizable_load (vec_info *vinfo,
 			new_stmt = call;
 			data_ref = NULL_TREE;
 		      }
+		    else if (final_len)
+		      {
+			align = least_bit_hwi (misalign | align);
+			tree ptr = build_int_cst (ref_type, align);
+			gcall *call
+			  = gimple_build_call_internal (IFN_LEN_LOAD, 3,
+							dataref_ptr, ptr,
+							final_len);
+			gimple_call_set_nothrow (call, true);
+			new_stmt = call;
+			data_ref = NULL_TREE;
+		      }
 		    else
 		      {
 			tree ltype = vectype;
@@ -12480,3 +12600,35 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
   *nunits_vectype_out = nunits_vectype;
   return opt_result::success ();
 }
+
+/* Generate and return statement sequence that sets vector length LEN that is:
+
+   min_of_start_and_end = min (START_INDEX, END_INDEX);
+   left_bytes = END_INDEX - min_of_start_and_end;
+   rhs = min (left_bytes, VECTOR_SIZE);
+   LEN = rhs;
+
+   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
+   means if we have left_bytes larger than 255, it can't be saturated to vector
+   size.  One target hook can be provided if other ports don't suffer this.
+*/
+
+gimple_seq
+vect_gen_len (tree len, tree start_index, tree end_index, tree vector_size)
+{
+  gimple_seq stmts = NULL;
+  tree len_type = TREE_TYPE (len);
+  gcc_assert (TREE_TYPE (start_index) == len_type);
+
+  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
+  tree left_bytes = fold_build2 (MINUS_EXPR, len_type, end_index, min);
+  left_bytes = fold_build2 (MIN_EXPR, len_type, left_bytes, vector_size);
+
+  tree rhs = force_gimple_operand (left_bytes, &stmts, true, NULL_TREE);
+  gimple *new_stmt = gimple_build_assign (len, rhs);
+  gimple_stmt_iterator i = gsi_last (stmts);
+  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
+
+  return stmts;
+}
+
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 2eb3ab5d280..774d5025639 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -476,6 +476,21 @@ struct rgroup_masks {
 
 typedef auto_vec<rgroup_masks> vec_loop_masks;
 
+/* Similar to masks above, the lengths needed by rgroups with nV vectors.  */
+struct rgroup_lens
+{
+  /* The total bytes for any nS per iteration.  */
+  unsigned int nbytes_per_iter;
+
+  /* Any vector type to use these lengths.  */
+  tree vec_type;
+
+  /* A vector of nV lengths, in iteration order.  */
+  vec<tree> lens;
+};
+
+typedef auto_vec<rgroup_lens> vec_loop_lens;
+
 typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
 
 /*-----------------------------------------------------------------*/
@@ -523,6 +538,10 @@ public:
      on inactive scalars.  */
   vec_loop_masks masks;
 
+  /* The lengths that a loop with length should use to avoid operating
+     on inactive scalars.  */
+  vec_loop_lens lens;
+
   /* Set of scalar conditions that have loop mask applied.  */
   scalar_cond_masked_set_type scalar_cond_masked_set;
 
@@ -626,6 +645,12 @@ public:
   /* True if have decided to use a fully-masked loop.  */
   bool fully_masked_p;
 
+  /* Records whether we still have the option of using a length access loop.  */
+  bool can_with_length_p;
+
+  /* True if have decided to use length access for the loop fully.  */
+  bool fully_with_length_p;
+
   /* When we have grouped data accesses with gaps, we may introduce invalid
      memory accesses.  We peel the last iteration of the loop to prevent
      this.  */
@@ -689,6 +714,9 @@ public:
 #define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
 #define LOOP_VINFO_CAN_FULLY_MASK_P(L)     (L)->can_fully_mask_p
 #define LOOP_VINFO_FULLY_MASKED_P(L)       (L)->fully_masked_p
+#define LOOP_VINFO_CAN_WITH_LENGTH_P(L)    (L)->can_with_length_p
+#define LOOP_VINFO_FULLY_WITH_LENGTH_P(L)  (L)->fully_with_length_p
+#define LOOP_VINFO_LENS(L)                 (L)->lens
 #define LOOP_VINFO_VECT_FACTOR(L)          (L)->vectorization_factor
 #define LOOP_VINFO_MAX_VECT_FACTOR(L)      (L)->max_vectorization_factor
 #define LOOP_VINFO_MASKS(L)                (L)->masks
@@ -1842,6 +1870,10 @@ extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
 				   unsigned int, tree, tree);
 extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
+extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+				  tree);
+extern tree vect_get_loop_len (vec_loop_lens *, unsigned int, unsigned int);
+extern gimple_seq vect_gen_len (tree, tree, tree, tree);
 extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info);
 
 /* Drive for loop transformation stage.  */
-- 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 6/7] ivopts: Add handlings for vector with length IFNs
  2020-05-26  5:49 [PATCH 0/7] Support vector load/store with length Kewen.Lin
                   ` (4 preceding siblings ...)
  2020-05-26  5:57 ` [PATCH 5/7] vect: Support vector load/store with length in vectorizer Kewen.Lin
@ 2020-05-26  5:58 ` Kewen.Lin
  2020-07-22 12:51   ` Richard Sandiford
  2020-05-26  5:59 ` [PATCH 7/7] rs6000/testsuite: Vector with length test cases Kewen.Lin
  2020-05-26  7:12 ` [PATCH 0/7] Support vector load/store with length Richard Biener
  7 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-05-26  5:58 UTC (permalink / raw)
  To: GCC Patches
  Cc: Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool,
	bin.cheng, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 208 bytes --]

gcc/ChangeLog

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* tree-ssa-loop-ivopts.c (get_mem_type_for_internal_fn): Handle
	IFN_LEN_LOAD and IFN_LEN_STORE.
	(get_alias_ptr_type_for_ptr_address): Likewise.



[-- Attachment #2: 0006-ivopts-for-vector-with-length.patch --]
[-- Type: text/plain, Size: 1138 bytes --]

---
 gcc/tree-ssa-loop-ivopts.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 1d2697ae1ba..45b31640e75 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -2436,12 +2436,14 @@ get_mem_type_for_internal_fn (gcall *call, tree *op_p)
     {
     case IFN_MASK_LOAD:
     case IFN_MASK_LOAD_LANES:
+    case IFN_LEN_LOAD:
       if (op_p == gimple_call_arg_ptr (call, 0))
 	return TREE_TYPE (gimple_call_lhs (call));
       return NULL_TREE;
 
     case IFN_MASK_STORE:
     case IFN_MASK_STORE_LANES:
+    case IFN_LEN_STORE:
       if (op_p == gimple_call_arg_ptr (call, 0))
 	return TREE_TYPE (gimple_call_arg (call, 3));
       return NULL_TREE;
@@ -7415,6 +7417,8 @@ get_alias_ptr_type_for_ptr_address (iv_use *use)
     case IFN_MASK_STORE:
     case IFN_MASK_LOAD_LANES:
     case IFN_MASK_STORE_LANES:
+    case IFN_LEN_LOAD:
+    case IFN_LEN_STORE:
       /* The second argument contains the correct alias type.  */
       gcc_assert (use->op_p = gimple_call_arg_ptr (call, 0));
       return TREE_TYPE (gimple_call_arg (call, 1));
-- 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 7/7] rs6000/testsuite: Vector with length test cases
  2020-05-26  5:49 [PATCH 0/7] Support vector load/store with length Kewen.Lin
                   ` (5 preceding siblings ...)
  2020-05-26  5:58 ` [PATCH 6/7] ivopts: Add handlings for vector with length IFNs Kewen.Lin
@ 2020-05-26  5:59 ` Kewen.Lin
  2020-07-10 10:07   ` [PATCH 7/7 v2] " Kewen.Lin
  2020-05-26  7:12 ` [PATCH 0/7] Support vector load/store with length Richard Biener
  7 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-05-26  5:59 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, dje.gcc, Segher Boessenkool, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 2148 bytes --]

gcc/testsuite/ChangeLog

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* gcc.target/powerpc/p9-vec-length-1.h: New test.
	* gcc.target/powerpc/p9-vec-length-2.h: New test.
	* gcc.target/powerpc/p9-vec-length-3.h: New test.
	* gcc.target/powerpc/p9-vec-length-4.h: New test.
	* gcc.target/powerpc/p9-vec-length-5.h: New test.
	* gcc.target/powerpc/p9-vec-length-6.h: New test.
	* gcc.target/powerpc/p9-vec-length-epil-1.c: New test.
	* gcc.target/powerpc/p9-vec-length-epil-2.c: New test.
	* gcc.target/powerpc/p9-vec-length-epil-3.c: New test.
	* gcc.target/powerpc/p9-vec-length-epil-4.c: New test.
	* gcc.target/powerpc/p9-vec-length-epil-5.c: New test.
	* gcc.target/powerpc/p9-vec-length-epil-6.c: New test.
	* gcc.target/powerpc/p9-vec-length-epil-run-1.c: New test.
	* gcc.target/powerpc/p9-vec-length-epil-run-2.c: New test.
	* gcc.target/powerpc/p9-vec-length-epil-run-3.c: New test.
	* gcc.target/powerpc/p9-vec-length-epil-run-4.c: New test.
	* gcc.target/powerpc/p9-vec-length-epil-run-5.c: New test.
	* gcc.target/powerpc/p9-vec-length-epil-run-6.c: New test.
	* gcc.target/powerpc/p9-vec-length-full-1.c: New test.
	* gcc.target/powerpc/p9-vec-length-full-2.c: New test.
	* gcc.target/powerpc/p9-vec-length-full-3.c: New test.
	* gcc.target/powerpc/p9-vec-length-full-4.c: New test.
	* gcc.target/powerpc/p9-vec-length-full-5.c: New test.
	* gcc.target/powerpc/p9-vec-length-full-6.c: New test.
	* gcc.target/powerpc/p9-vec-length-full-run-1.c: New test.
	* gcc.target/powerpc/p9-vec-length-full-run-2.c: New test.
	* gcc.target/powerpc/p9-vec-length-full-run-3.c: New test.
	* gcc.target/powerpc/p9-vec-length-full-run-4.c: New test.
	* gcc.target/powerpc/p9-vec-length-full-run-5.c: New test.
	* gcc.target/powerpc/p9-vec-length-full-run-6.c: New test.
	* gcc.target/powerpc/p9-vec-length-run-1.h: New test.
	* gcc.target/powerpc/p9-vec-length-run-2.h: New test.
	* gcc.target/powerpc/p9-vec-length-run-3.h: New test.
	* gcc.target/powerpc/p9-vec-length-run-4.h: New test.
	* gcc.target/powerpc/p9-vec-length-run-5.h: New test.
	* gcc.target/powerpc/p9-vec-length-run-6.h: New test.
	* gcc.target/powerpc/p9-vec-length.h: New test.



[-- Attachment #2: 0007-test-cases.patch --]
[-- Type: text/plain, Size: 52144 bytes --]

---
 .../gcc.target/powerpc/p9-vec-length-1.h      | 18 ++++++
 .../gcc.target/powerpc/p9-vec-length-2.h      | 17 +++++
 .../gcc.target/powerpc/p9-vec-length-3.h      | 31 ++++++++++
 .../gcc.target/powerpc/p9-vec-length-4.h      | 24 +++++++
 .../gcc.target/powerpc/p9-vec-length-5.h      | 29 +++++++++
 .../gcc.target/powerpc/p9-vec-length-6.h      | 32 ++++++++++
 .../gcc.target/powerpc/p9-vec-length-epil-1.c | 15 +++++
 .../gcc.target/powerpc/p9-vec-length-epil-2.c | 15 +++++
 .../gcc.target/powerpc/p9-vec-length-epil-3.c | 18 ++++++
 .../gcc.target/powerpc/p9-vec-length-epil-4.c | 15 +++++
 .../gcc.target/powerpc/p9-vec-length-epil-5.c | 15 +++++
 .../gcc.target/powerpc/p9-vec-length-epil-6.c | 16 +++++
 .../powerpc/p9-vec-length-epil-run-1.c        | 10 +++
 .../powerpc/p9-vec-length-epil-run-2.c        | 10 +++
 .../powerpc/p9-vec-length-epil-run-3.c        | 10 +++
 .../powerpc/p9-vec-length-epil-run-4.c        | 10 +++
 .../powerpc/p9-vec-length-epil-run-5.c        | 10 +++
 .../powerpc/p9-vec-length-epil-run-6.c        | 10 +++
 .../gcc.target/powerpc/p9-vec-length-full-1.c | 16 +++++
 .../gcc.target/powerpc/p9-vec-length-full-2.c | 16 +++++
 .../gcc.target/powerpc/p9-vec-length-full-3.c | 17 +++++
 .../gcc.target/powerpc/p9-vec-length-full-4.c | 16 +++++
 .../gcc.target/powerpc/p9-vec-length-full-5.c | 16 +++++
 .../gcc.target/powerpc/p9-vec-length-full-6.c | 16 +++++
 .../powerpc/p9-vec-length-full-run-1.c        | 10 +++
 .../powerpc/p9-vec-length-full-run-2.c        | 10 +++
 .../powerpc/p9-vec-length-full-run-3.c        | 10 +++
 .../powerpc/p9-vec-length-full-run-4.c        | 10 +++
 .../powerpc/p9-vec-length-full-run-5.c        | 10 +++
 .../powerpc/p9-vec-length-full-run-6.c        | 10 +++
 .../gcc.target/powerpc/p9-vec-length-run-1.h  | 34 ++++++++++
 .../gcc.target/powerpc/p9-vec-length-run-2.h  | 36 +++++++++++
 .../gcc.target/powerpc/p9-vec-length-run-3.h  | 34 ++++++++++
 .../gcc.target/powerpc/p9-vec-length-run-4.h  | 62 +++++++++++++++++++
 .../gcc.target/powerpc/p9-vec-length-run-5.h  | 45 ++++++++++++++
 .../gcc.target/powerpc/p9-vec-length-run-6.h  | 52 ++++++++++++++++
 .../gcc.target/powerpc/p9-vec-length.h        | 14 +++++
 37 files changed, 739 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-1.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-2.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-3.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-4.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-5.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-6.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-4.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-5.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-6.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-5.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-6.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-4.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-5.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-4.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-5.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-6.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-1.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-2.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-3.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-4.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-5.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-6.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-vec-length.h

diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-1.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-1.h
new file mode 100644
index 00000000000..50da5817013
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-1.h
@@ -0,0 +1,18 @@
+#include "p9-vec-length.h"
+
+/* Test the case loop iteration is known.  */
+
+#define N 127
+
+#define test(TYPE)                                                             \
+  extern TYPE a_##TYPE[N];                                                     \
+  extern TYPE b_##TYPE[N];                                                     \
+  extern TYPE c_##TYPE[N];                                                     \
+  void __attribute__ ((noinline, noclone)) test##TYPE ()                       \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    for (i = 0; i < N; i++)                                                    \
+      c_##TYPE[i] = a_##TYPE[i] + b_##TYPE[i];                                 \
+  }
+
+TEST_ALL (test)
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-2.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-2.h
new file mode 100644
index 00000000000..b275dba0fde
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-2.h
@@ -0,0 +1,17 @@
+#include "p9-vec-length.h"
+
+/* Test the case loop iteration is unknown.  */
+#define N 255
+
+#define test(TYPE)                                                             \
+  extern TYPE a_##TYPE[N];                                                     \
+  extern TYPE b_##TYPE[N];                                                     \
+  extern TYPE c_##TYPE[N];                                                     \
+  void __attribute__ ((noinline, noclone)) test##TYPE (unsigned int n)         \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    for (i = 0; i < n; i++)                                                    \
+      c_##TYPE[i] = a_##TYPE[i] + b_##TYPE[i];                                 \
+  }
+
+TEST_ALL (test)
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-3.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-3.h
new file mode 100644
index 00000000000..c79b9b30910
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-3.h
@@ -0,0 +1,31 @@
+#include "p9-vec-length.h"
+
+/* Test the case loop iteration less than VF.  */
+
+/* For char.  */
+#define N_uint8_t 15
+#define N_int8_t 15
+/* For short.  */
+#define N_uint16_t 6
+#define N_int16_t 6
+/* For int/float.  */
+#define N_uint32_t 3
+#define N_int32_t 3
+#define N_float 3
+/* For long/double.  */
+#define N_uint64_t 1
+#define N_int64_t 1
+#define N_double 1
+
+#define test(TYPE)                                                             \
+  extern TYPE a_##TYPE[N_##TYPE];                                              \
+  extern TYPE b_##TYPE[N_##TYPE];                                              \
+  extern TYPE c_##TYPE[N_##TYPE];                                              \
+  void __attribute__ ((noinline, noclone)) test##TYPE ()                       \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    for (i = 0; i < N_##TYPE; i++)                                             \
+      c_##TYPE[i] = a_##TYPE[i] + b_##TYPE[i];                                 \
+  }
+
+TEST_ALL (test)
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-4.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-4.h
new file mode 100644
index 00000000000..0ee7fc84502
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-4.h
@@ -0,0 +1,24 @@
+#include "p9-vec-length.h"
+
+/* Test the case that the loop which has multiple vectors (concatenated vectors)
+   but with same vector type.  */
+
+#define test(TYPE)                                                             \
+  void __attribute__ ((noinline, noclone))                                     \
+    test_mv_##TYPE (TYPE *restrict a, TYPE *restrict b, TYPE *restrict c,      \
+		    int n)                                                     \
+  {                                                                            \
+    for (int i = 0; i < n; ++i)                                                \
+      {                                                                        \
+	a[i] += 1;                                                             \
+	b[i * 2] += 2;                                                         \
+	b[i * 2 + 1] += 3;                                                     \
+	c[i * 4] += 4;                                                         \
+	c[i * 4 + 1] += 5;                                                     \
+	c[i * 4 + 2] += 6;                                                     \
+	c[i * 4 + 3] += 7;                                                     \
+      }                                                                        \
+  }
+
+TEST_ALL (test)
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-5.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-5.h
new file mode 100644
index 00000000000..406daaa3d3e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-5.h
@@ -0,0 +1,29 @@
+#include "p9-vec-length.h"
+
+/* Test the case that the loop which has multiple vectors (concatenated vectors)
+   with different types.  */
+
+#define test(TYPE1, TYPE2)                                                     \
+  void __attribute__ ((noinline, noclone))                                     \
+    test_mv_##TYPE1##TYPE2 (TYPE1 *restrict a, TYPE2 *restrict b, int n)       \
+  {                                                                            \
+    for (int i = 0; i < n; ++i)                                                \
+      {                                                                        \
+	a[i * 2] += 1;                                                         \
+	a[i * 2 + 1] += 2;                                                     \
+	b[i * 2] += 3;                                                         \
+	b[i * 2 + 1] += 4;                                                     \
+      }                                                                        \
+  }
+
+#define TEST_ALL2(T)                                                           \
+  T (int8_t, uint16_t)                                                         \
+  T (uint8_t, int16_t)                                                         \
+  T (int16_t, uint32_t)                                                        \
+  T (uint16_t, int32_t)                                                        \
+  T (int32_t, double)                                                          \
+  T (uint32_t, int64_t)                                                        \
+  T (float, uint64_t)
+
+TEST_ALL2 (test)
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-6.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-6.h
new file mode 100644
index 00000000000..58b151e18f8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-6.h
@@ -0,0 +1,32 @@
+#include "p9-vec-length.h"
+
+/* Test the case that the loop which has the same concatenated vectors (same
+   size per iteration) but from different types.  */
+
+#define test(TYPE1, TYPE2)                                                     \
+  void __attribute__ ((noinline, noclone))                                     \
+    test_mv_##TYPE1##TYPE2 (TYPE1 *restrict a, TYPE2 *restrict b, int n)       \
+  {                                                                            \
+    for (int i = 0; i < n; i++)                                                \
+      {                                                                        \
+	a[i * 2] += 1;                                                         \
+	a[i * 2 + 1] += 2;                                                     \
+	b[i * 4] += 3;                                                         \
+	b[i * 4 + 1] += 4;                                                     \
+	b[i * 4 + 2] += 5;                                                     \
+	b[i * 4 + 3] += 6;                                                     \
+      }                                                                        \
+  }
+
+#define TEST_ALL2(T)                                                           \
+  T (int16_t, uint8_t)                                                         \
+  T (uint16_t, int8_t)                                                         \
+  T (int32_t, uint16_t)                                                        \
+  T (uint32_t, int16_t)                                                        \
+  T (float, uint16_t)                                                          \
+  T (int64_t, float)                                                           \
+  T (uint64_t, int32_t)                                                        \
+  T (double, uint32_t)
+
+TEST_ALL2 (test)
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-1.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-1.c
new file mode 100644
index 00000000000..aba49a46695
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-1.h"
+
+/* { dg-final { scan-assembler-times {\mlxv\M|\mlxvx\M} 20 } } */
+/* { dg-final { scan-assembler-times {\mstxv\M|\mstxvx\M} 10 } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 20 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-2.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-2.c
new file mode 100644
index 00000000000..66a78a2b312
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-2.c
@@ -0,0 +1,15 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-2.h"
+
+/* { dg-final { scan-assembler-times {\mlxv\M|\mlxvx\M} 20 } } */
+/* { dg-final { scan-assembler-times {\mstxv\M|\mstxvx\M} 10 } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 20 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-3.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-3.c
new file mode 100644
index 00000000000..86d71afc0fd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-3.c
@@ -0,0 +1,18 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-3.h"
+
+/* { dg-final { scan-assembler-not   {\mlxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* 64bit types get completely unrolled, so only check the others.  */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 14 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 7 } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-4.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-4.c
new file mode 100644
index 00000000000..83f98a119e8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-4.c
@@ -0,0 +1,15 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-4.h"
+
+/* { dg-final { scan-assembler-times {\mlxv\M|\mlxvx\M} 120 } } */
+/* { dg-final { scan-assembler-times {\mstxv\M|\mstxvx\M} 70 } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 70 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 70 } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-5.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-5.c
new file mode 100644
index 00000000000..cd646700acf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-5.c
@@ -0,0 +1,15 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-5.h"
+
+/* { dg-final { scan-assembler-times {\mlxv\M|\mlxvx\M} 49 } } */
+/* { dg-final { scan-assembler-times {\mstxv\M|\mstxvx\M} 21 } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 21 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 21 } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-6.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-6.c
new file mode 100644
index 00000000000..48ac191ddcb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-6.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-6.h"
+
+/* { dg-final { scan-assembler-times {\mlxv\M|\mlxvx\M} 42 } } */
+/* { dg-final { scan-assembler-times {\mstxv\M|\mstxvx\M} 16 } } */
+/* 64bit/32bit pairs don't have the epilogues.  */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 10 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-1.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-1.c
new file mode 100644
index 00000000000..ea624b027c7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-1.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-1.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-2.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-2.c
new file mode 100644
index 00000000000..2e8d0430151
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-2.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-2.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-3.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-3.c
new file mode 100644
index 00000000000..3a842220b64
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-3.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-3.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c
new file mode 100644
index 00000000000..ecbd00207dc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-4.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-5.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-5.c
new file mode 100644
index 00000000000..34cbf56ac2c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-5.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-5.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-6.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-6.c
new file mode 100644
index 00000000000..584dd99a7bd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-6.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-6.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c
new file mode 100644
index 00000000000..bac275ea61a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-1.h"
+
+/* { dg-final { scan-assembler-not   {\mlxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 20 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c
new file mode 100644
index 00000000000..eb6f43abbdc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-2.h"
+
+/* { dg-final { scan-assembler-not   {\mlxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 20 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-3.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-3.c
new file mode 100644
index 00000000000..91524b1bb1a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-3.c
@@ -0,0 +1,17 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-3.h"
+
+/* { dg-final { scan-assembler-not   {\mlxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* 64bit types get completely unrolled, so only check the others.  */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 14 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 7 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-4.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-4.c
new file mode 100644
index 00000000000..05ea5ccdb80
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-4.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-4.h"
+
+/* It can use normal vector load for constant vector load.  */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 70 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 70 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-5.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-5.c
new file mode 100644
index 00000000000..6045a444148
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-5.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-5.h"
+
+/* It can use normal vector load for constant vector load.  */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 21 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 21 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c
new file mode 100644
index 00000000000..c4d67799644
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-6.h"
+
+/* It can use normal vector load for constant vector load.  */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 16 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 16 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-1.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-1.c
new file mode 100644
index 00000000000..4ccf0e0a4e0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-1.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-1.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-2.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-2.c
new file mode 100644
index 00000000000..456a6ce1440
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-2.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-2.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-3.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-3.c
new file mode 100644
index 00000000000..35c31cc8ed8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-3.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-3.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-4.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-4.c
new file mode 100644
index 00000000000..ff66b56dff0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-4.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-4.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-5.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-5.c
new file mode 100644
index 00000000000..37550881aea
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-5.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-5.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-6.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-6.c
new file mode 100644
index 00000000000..9209b682c1c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-6.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-with-length-scope=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-6.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-1.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-1.h
new file mode 100644
index 00000000000..b397fd1ac30
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-1.h
@@ -0,0 +1,34 @@
+#include "p9-vec-length-1.h"
+
+#define decl(TYPE)                                                             \
+  TYPE a_##TYPE[N];                                                            \
+  TYPE b_##TYPE[N];                                                            \
+  TYPE c_##TYPE[N];
+
+#define run(TYPE)                                                              \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	a_##TYPE[i] = i * 2 + 1;                                               \
+	b_##TYPE[i] = i % 2 - 2;                                               \
+      }                                                                        \
+    test##TYPE ();                                                             \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	TYPE a1 = i * 2 + 1;                                                   \
+	TYPE b1 = i % 2 - 2;                                                   \
+	TYPE exp_c = a1 + b1;                                                  \
+	if (c_##TYPE[i] != exp_c)                                              \
+	  __builtin_abort ();                                                  \
+      }                                                                        \
+  }
+
+TEST_ALL (decl)
+
+int
+main (void)
+{
+  TEST_ALL (run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-2.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-2.h
new file mode 100644
index 00000000000..a0f2d6ccb23
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-2.h
@@ -0,0 +1,36 @@
+#include "p9-vec-length-2.h"
+
+#define decl(TYPE)                                                             \
+  TYPE a_##TYPE[N];                                                            \
+  TYPE b_##TYPE[N];                                                            \
+  TYPE c_##TYPE[N];
+
+#define N1 195
+
+#define run(TYPE)                                                              \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	a_##TYPE[i] = i * 2 + 1;                                               \
+	b_##TYPE[i] = i % 2 - 2;                                               \
+      }                                                                        \
+    test##TYPE (N1);                                                           \
+    for (i = 0; i < N1; i++)                                                   \
+      {                                                                        \
+	TYPE a1 = i * 2 + 1;                                                   \
+	TYPE b1 = i % 2 - 2;                                                   \
+	TYPE exp_c = a1 + b1;                                                  \
+	if (c_##TYPE[i] != exp_c)                                              \
+	  __builtin_abort ();                                                  \
+      }                                                                        \
+  }
+
+TEST_ALL (decl)
+
+int
+main (void)
+{
+  TEST_ALL (run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-3.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-3.h
new file mode 100644
index 00000000000..5d2f5c34b6a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-3.h
@@ -0,0 +1,34 @@
+#include "p9-vec-length-3.h"
+
+#define decl(TYPE)                                                             \
+  TYPE a_##TYPE[N_##TYPE];                                                     \
+  TYPE b_##TYPE[N_##TYPE];                                                     \
+  TYPE c_##TYPE[N_##TYPE];
+
+#define run(TYPE)                                                              \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    for (i = 0; i < N_##TYPE; i++)                                             \
+      {                                                                        \
+	a_##TYPE[i] = i * 2 + 1;                                               \
+	b_##TYPE[i] = i % 2 - 2;                                               \
+      }                                                                        \
+    test##TYPE ();                                                             \
+    for (i = 0; i < N_##TYPE; i++)                                             \
+      {                                                                        \
+	TYPE a1 = i * 2 + 1;                                                   \
+	TYPE b1 = i % 2 - 2;                                                   \
+	TYPE exp_c = a1 + b1;                                                  \
+	if (c_##TYPE[i] != exp_c)                                              \
+	  __builtin_abort ();                                                  \
+      }                                                                        \
+  }
+
+TEST_ALL (decl)
+
+int
+main (void)
+{
+  TEST_ALL (run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-4.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-4.h
new file mode 100644
index 00000000000..2f3b911d0d1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-4.h
@@ -0,0 +1,62 @@
+#include "p9-vec-length-4.h"
+
+/* Check more to ensure vector access with out of bound.  */
+#define N  144
+/* Array size used for test function actually.  */
+#define NF 123
+
+#define run(TYPE)                                                              \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    TYPE a[N], b[N * 2], c[N * 4];                                             \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	a[i] = i + i % 2;                                                      \
+	b[i * 2] = i * 2 + i % 3;                                              \
+	b[i * 2 + 1] = i * 3 + i % 4;                                          \
+	c[i * 4] = i * 4 + i % 5;                                              \
+	c[i * 4 + 1] = i * 5 + i % 6;                                          \
+	c[i * 4 + 2] = i * 6 + i % 7;                                          \
+	c[i * 4 + 3] = i * 7 + i % 8;                                          \
+      }                                                                        \
+    test_mv_##TYPE (a, b, c, NF);                                              \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	TYPE a1 = i + i % 2;                                                   \
+	TYPE b1 = i * 2 + i % 3;                                               \
+	TYPE b2 = i * 3 + i % 4;                                               \
+	TYPE c1 = i * 4 + i % 5;                                               \
+	TYPE c2 = i * 5 + i % 6;                                               \
+	TYPE c3 = i * 6 + i % 7;                                               \
+	TYPE c4 = i * 7 + i % 8;                                               \
+                                                                               \
+	TYPE exp_a = a1;                                                       \
+	TYPE exp_b1 = b1;                                                      \
+	TYPE exp_b2 = b2;                                                      \
+	TYPE exp_c1 = c1;                                                      \
+	TYPE exp_c2 = c2;                                                      \
+	TYPE exp_c3 = c3;                                                      \
+	TYPE exp_c4 = c4;                                                      \
+	if (i < NF)                                                            \
+	  {                                                                    \
+	    exp_a += 1;                                                        \
+	    exp_b1 += 2;                                                       \
+	    exp_b2 += 3;                                                       \
+	    exp_c1 += 4;                                                       \
+	    exp_c2 += 5;                                                       \
+	    exp_c3 += 6;                                                       \
+	    exp_c4 += 7;                                                       \
+	  }                                                                    \
+	if (a[i] != exp_a || b[i * 2] != exp_b1 || b[i * 2 + 1] != exp_b2      \
+	    || c[i * 4] != exp_c1 || c[i * 4 + 1] != exp_c2                    \
+	    || c[i * 4 + 2] != exp_c3 || c[i * 4 + 3] != exp_c4)               \
+	  __builtin_abort ();                                                  \
+      }                                                                        \
+  }
+
+int
+main (void)
+{
+  TEST_ALL (run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-5.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-5.h
new file mode 100644
index 00000000000..ca4b3d56351
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-5.h
@@ -0,0 +1,45 @@
+#include "p9-vec-length-5.h"
+
+/* Check more to ensure vector access with out of bound.  */
+#define N 155
+/* Array size used for test function actually.  */
+#define NF 127
+
+#define run(TYPE1, TYPE2)                                                      \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    TYPE1 a[N * 2];                                                            \
+    TYPE2 b[N * 2];                                                            \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	a[i * 2] = i * 2 + i % 3;                                              \
+	a[i * 2 + 1] = i * 3 + i % 4;                                          \
+	b[i * 2] = i * 7 + i / 5;                                              \
+	b[i * 2 + 1] = i * 8 + i / 6;                                          \
+      }                                                                        \
+    test_mv_##TYPE1##TYPE2 (a, b, NF);                                         \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	TYPE1 exp_a1 = i * 2 + i % 3;                                          \
+	TYPE1 exp_a2 = i * 3 + i % 4;                                          \
+	TYPE2 exp_b1 = i * 7 + i / 5;                                          \
+	TYPE2 exp_b2 = i * 8 + i / 6;                                          \
+	if (i < NF)                                                            \
+	  {                                                                    \
+	    exp_a1 += 1;                                                        \
+	    exp_a2 += 2;                                                       \
+	    exp_b1 += 3;                                                       \
+	    exp_b2 += 4;                                                       \
+	  }                                                                    \
+	if (a[i * 2] != exp_a1 || a[i * 2 + 1] != exp_a2 || b[i * 2] != exp_b1 \
+	    || b[i * 2 + 1] != exp_b2)                                         \
+	  __builtin_abort ();                                                  \
+      }                                                                        \
+  }
+
+int
+main (void)
+{
+  TEST_ALL2 (run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-6.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-6.h
new file mode 100644
index 00000000000..814e4059bdf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-6.h
@@ -0,0 +1,52 @@
+#include "p9-vec-length-6.h"
+
+/* Check more to ensure vector access with out of bound.  */
+#define N 275
+/* Array size used for test function actually.  */
+#define NF 255
+
+#define run(TYPE1, TYPE2)                                                      \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    TYPE1 a[N * 2];                                                            \
+    TYPE2 b[N * 4];                                                            \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	a[i * 2] = i * 2 + i % 3;                                              \
+	a[i * 2 + 1] = i * 3 + i % 4;                                          \
+	b[i * 4] = i * 4 + i / 5;                                              \
+	b[i * 4 + 1] = i * 5 + i / 6;                                          \
+	b[i * 4 + 2] = i * 6 + i / 7;                                          \
+	b[i * 4 + 3] = i * 7 + i / 8;                                          \
+      }                                                                        \
+    test_mv_##TYPE1##TYPE2 (a, b, NF);                                         \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	TYPE1 exp_a1 = i * 2 + i % 3;                                          \
+	TYPE1 exp_a2 = i * 3 + i % 4;                                          \
+	TYPE2 exp_b1 = i * 4 + i / 5;                                          \
+	TYPE2 exp_b2 = i * 5 + i / 6;                                          \
+	TYPE2 exp_b3 = i * 6 + i / 7;                                          \
+	TYPE2 exp_b4 = i * 7 + i / 8;                                          \
+	if (i < NF)                                                            \
+	  {                                                                    \
+	    exp_a1 += 1;                                                       \
+	    exp_a2 += 2;                                                       \
+	    exp_b1 += 3;                                                       \
+	    exp_b2 += 4;                                                       \
+	    exp_b3 += 5;                                                       \
+	    exp_b4 += 6;                                                       \
+	  }                                                                    \
+	if (a[i * 2] != exp_a1 || a[i * 2 + 1] != exp_a2 || b[i * 4] != exp_b1 \
+	    || b[i * 4 + 1] != exp_b2 || b[i * 4 + 2] != exp_b3                \
+	    || b[i * 4 + 3] != exp_b4)                                         \
+	  __builtin_abort ();                                                  \
+      }                                                                        \
+  }
+
+int
+main (void)
+{
+  TEST_ALL2 (run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length.h
new file mode 100644
index 00000000000..83418b0b641
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length.h
@@ -0,0 +1,14 @@
+#include <stdint.h>
+
+#define TEST_ALL(T)                                                            \
+  T (int8_t)                                                                   \
+  T (uint8_t)                                                                  \
+  T (int16_t)                                                                  \
+  T (uint16_t)                                                                 \
+  T (int32_t)                                                                  \
+  T (uint32_t)                                                                 \
+  T (int64_t)                                                                  \
+  T (uint64_t)                                                                 \
+  T (float)                                                                    \
+  T (double)
+
-- 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/7] Support vector load/store with length
  2020-05-26  5:49 [PATCH 0/7] Support vector load/store with length Kewen.Lin
                   ` (6 preceding siblings ...)
  2020-05-26  5:59 ` [PATCH 7/7] rs6000/testsuite: Vector with length test cases Kewen.Lin
@ 2020-05-26  7:12 ` Richard Biener
  2020-05-26  8:51   ` Kewen.Lin
  2020-05-26 22:34   ` Jim Wilson
  7 siblings, 2 replies; 80+ messages in thread
From: Richard Biener @ 2020-05-26  7:12 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Bill Schmidt, Segher Boessenkool, Richard Sandiford,
	dje.gcc

On Tue, 26 May 2020, Kewen.Lin wrote:

> Hi all,
> 
> This patch set adds support for vector load/store with length, Power 
> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
> length, it's good to be exploited for those cases we don't have enough
> stuffs to fill in the whole vector like epilogues.
> 
> This support mainly refers to the handlings for fully-predicated loop
> but it also covers the epilogue usage.  Now it supports two modes
> controlled by parameter vect-with-length-scope, it can support any
> loops fully with length or just for those cases with small iteration
> counts less than VF like epilogue, for now I don't have ready env to
> benchmark it, but based on the current inefficient length generation,
> I don't think it's a good idea to adopt vector with length for any loops.
> For the main loop which used to be vectorized, it increases register
> pressure and introduces extra computation for length, the pro for icache
> seems not comparable.  But I think it might be a good idea to keep this
> parameter there for functionality testing, further benchmarking and other
> ports' potential future supports.

Can you explain in more detail what "vector load/store with length" does?
Is that a "simplified" masked load/store which instead of masking 
arbitrary elements (and need a mask computed in the first place), masks
elements > N (the length operand)?  Thus assuming a lane IV decrementing
to zero that IV would be the natural argument for the length operand?
If that's correct, what data are the remaining lanes filled with?

From a look at the series description below you seem to add a new way
of doing loads for this.  Did you review other ISAs (those I'm not
familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
they have similar support and whether your approach can be supported
there?  ISTR SVE must have some similar support - what's the reason
you do not piggy-back on that?

I think a load like I described above might be represented as

_1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3));

not sure if that actually works out though.  But given it seems it
is a contiguous load we shouldn't need an internal function here?
[there's a possible size mismatch in the __VIEW_CONVERT above, I guess
on RTL you end up with a paradoxical subreg - or an UNSPEC]

That said, I'm not very happy seeing yet another way of doing loads
[for fully predicated loops].  I'd rather like to see a single
representation on GIMPLE at least.

Will dig into the patch once the actual workings of those load/store with
length is confirmed.

I don't spot tree-vect-slp.c being changed - maybe that's not necessary
for SLP operation, but please do not introduce new vectorizer features
without supporting SLP operation at this point.

Thanks,
Richard.

> As we don't have any benchmarking, this support isn't enabled by default
> for any particular cpus, all testings are with explicit parameter setting.
> 
> Bootstrapped on powerpc64le-linux-gnu P9 with all vect-with-length-scope
> settings (0/1/2).  Regress-test passed with vector-with-length-scope 0,
> for the other twos, several vector related cases need to be updated, no
> remarkable failures found.  BTW, P9 is the one which supports the
> functionality but not ready to evaluate the performance.
> 
> Here still are many things to be supported or improved, not limited to:
>   - reduction/live-out support
>   - Cost model adding/tweaking
>   - IFN gimple folding
>   - Some unnecessary ops improvements eg: vector_size check
>   - Some possible refactoring
> I'll support/post the patches gradually.
> 
> Any comments are highly appreciated.
> 
> BR,
> Kewen
> -----
> 
> Patch set outline:
>   [PATCH 1/7] ifn/optabs: Support vector load/store with length
>   [PATCH 2/7] rs6000: lenload/lenstore optab support
>   [PATCH 3/7] vect: Factor out codes for niters smaller than vf check
>   [PATCH 4/7] hook/rs6000: Add vectorize length mode for vector with length
>   [PATCH 5/7] vect: Support vector load/store with length in vectorizer
>   [PATCH 6/7] ivopts: Add handlings for vector with length IFNs
>   [PATCH 7/7] rs6000/testsuite: Vector with length test cases
> 
>  gcc/config/rs6000/rs6000.c                                  |   3 +
>  gcc/config/rs6000/vsx.md                                    |  30 ++++++++++
>  gcc/doc/invoke.texi                                         |   7 +++
>  gcc/doc/md.texi                                             |  16 ++++++
>  gcc/doc/tm.texi                                             |   6 ++
>  gcc/doc/tm.texi.in                                          |   2 +
>  gcc/internal-fn.c                                           |  13 ++++-
>  gcc/internal-fn.def                                         |   6 ++
>  gcc/optabs.def                                              |   2 +
>  gcc/params.opt                                              |   4 ++
>  gcc/target.def                                              |   7 +++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-1.h          |  18 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-2.h          |  17 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-3.h          |  31 +++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-4.h          |  24 ++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-5.h          |  29 ++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-6.h          |  32 +++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-1.c     |  15 +++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-2.c     |  15 +++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-3.c     |  18 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-4.c     |  15 +++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-5.c     |  15 +++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-6.c     |  16 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-1.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-2.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-3.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-5.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-6.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c     |  16 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c     |  16 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-3.c     |  17 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-4.c     |  16 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-5.c     |  16 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c     |  16 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-1.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-2.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-3.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-4.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-5.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-6.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-1.h      |  34 ++++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-2.h      |  36 ++++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-3.h      |  34 ++++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-4.h      |  62 +++++++++++++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-5.h      |  45 +++++++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-6.h      |  52 +++++++++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length.h            |  14 +++++
>  gcc/tree-ssa-loop-ivopts.c                                  |   4 ++
>  gcc/tree-vect-loop-manip.c                                  | 268 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  gcc/tree-vect-loop.c                                        | 272 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  gcc/tree-vect-stmts.c                                       | 152 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  gcc/tree-vectorizer.h                                       |  32 +++++++++++
>  53 files changed, 1545 insertions(+), 18 deletions(-)
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/7] Support vector load/store with length
  2020-05-26  7:12 ` [PATCH 0/7] Support vector load/store with length Richard Biener
@ 2020-05-26  8:51   ` Kewen.Lin
  2020-05-26  9:44     ` Richard Biener
  2020-05-26 22:34   ` Jim Wilson
  1 sibling, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-05-26  8:51 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Bill Schmidt, Segher Boessenkool, Richard Sandiford,
	dje.gcc

Hi Richi,

on 2020/5/26 下午3:12, Richard Biener wrote:
> On Tue, 26 May 2020, Kewen.Lin wrote:
> 
>> Hi all,
>>
>> This patch set adds support for vector load/store with length, Power 
>> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
>> length, it's good to be exploited for those cases we don't have enough
>> stuffs to fill in the whole vector like epilogues.
>>
>> This support mainly refers to the handlings for fully-predicated loop
>> but it also covers the epilogue usage.  Now it supports two modes
>> controlled by parameter vect-with-length-scope, it can support any
>> loops fully with length or just for those cases with small iteration
>> counts less than VF like epilogue, for now I don't have ready env to
>> benchmark it, but based on the current inefficient length generation,
>> I don't think it's a good idea to adopt vector with length for any loops.
>> For the main loop which used to be vectorized, it increases register
>> pressure and introduces extra computation for length, the pro for icache
>> seems not comparable.  But I think it might be a good idea to keep this
>> parameter there for functionality testing, further benchmarking and other
>> ports' potential future supports.
> 
> Can you explain in more detail what "vector load/store with length" does?
> Is that a "simplified" masked load/store which instead of masking 
> arbitrary elements (and need a mask computed in the first place), masks
> elements > N (the length operand)?  Thus assuming a lane IV decrementing
> to zero that IV would be the natural argument for the length operand?
> If that's correct, what data are the remaining lanes filled with?
> 

The vector load/store have one GPR holding one length in bytes (called as
n here) to control how many bytes we want to load/store.  If n > vector_size
(on Power it's 16), n will be taken as 16, if n is zero, the storage access
won't happen, the result for load is vector zero, if n > 0 but < vector_size,
the remaining lanes will be filled with zero.  On Power, it checks 0:7 bits
of the length GPR, so the length should be adjusted.

Your understanding is correct!  It's a "simplified" masked load/store, need
the length in bytes computed, only support continuous access.  For the lane
IV, the one should multiply with the lane bytes and be adjusted if needed.

> From a look at the series description below you seem to add a new way
> of doing loads for this.  Did you review other ISAs (those I'm not
> familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
> they have similar support and whether your approach can be supported
> there?  ISTR SVE must have some similar support - what's the reason
> you do not piggy-back on that?

I may miss something, but I didn't find there is any support that meet with
this in GCC.  Good suggestion on ISAs, I didn't review other ISAs, for the
current implementation, I heavily referred to existing SVE fully predicated
loop support, it's stronger than "with length", it can deal with arbitrary
elements, only perform for the loop fully etc.

> 
> I think a load like I described above might be represented as
> 
> _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3));
> 
> not sure if that actually works out though.  But given it seems it
> is a contiguous load we shouldn't need an internal function here?
> [there's a possible size mismatch in the __VIEW_CONVERT above, I guess
> on RTL you end up with a paradoxical subreg - or an UNSPEC]

IIUC, what you suggested is to avoid use IFN here.  May I know the reason?
is it due to the maintenance efforts on various IFNs?  I thought using
IFN is gentle.  And agreed, I had the concern that the size mismatch
problem here since the length can be large (extremely large probably, it
depends on target saturated limit), the converted vector size can be small.
Besides, the length can be a variable.
> 
> That said, I'm not very happy seeing yet another way of doing loads
> [for fully predicated loops].  I'd rather like to see a single
> representation on GIMPLE at least.

OK.  Does it mean IFN isn't counted into this scope?  :)

> 
> Will dig into the patch once the actual workings of those load/store with
> length is confirmed.

Thanks a lot for your time!

> 
> I don't spot tree-vect-slp.c being changed - maybe that's not necessary
> for SLP operation, but please do not introduce new vectorizer features
> without supporting SLP operation at this point.

Got it.  SLP is also one part to be supported, I forgot to state in the
original email.  I'll let it be for now.  :)

BR,
Kewen

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/7] Support vector load/store with length
  2020-05-26  8:51   ` Kewen.Lin
@ 2020-05-26  9:44     ` Richard Biener
  2020-05-26 10:10       ` Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Biener @ 2020-05-26  9:44 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Bill Schmidt, Segher Boessenkool, Richard Sandiford,
	dje.gcc

On Tue, 26 May 2020, Kewen.Lin wrote:

> Hi Richi,
> 
> on 2020/5/26 下午3:12, Richard Biener wrote:
> > On Tue, 26 May 2020, Kewen.Lin wrote:
> > 
> >> Hi all,
> >>
> >> This patch set adds support for vector load/store with length, Power 
> >> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
> >> length, it's good to be exploited for those cases we don't have enough
> >> stuffs to fill in the whole vector like epilogues.
> >>
> >> This support mainly refers to the handlings for fully-predicated loop
> >> but it also covers the epilogue usage.  Now it supports two modes
> >> controlled by parameter vect-with-length-scope, it can support any
> >> loops fully with length or just for those cases with small iteration
> >> counts less than VF like epilogue, for now I don't have ready env to
> >> benchmark it, but based on the current inefficient length generation,
> >> I don't think it's a good idea to adopt vector with length for any loops.
> >> For the main loop which used to be vectorized, it increases register
> >> pressure and introduces extra computation for length, the pro for icache
> >> seems not comparable.  But I think it might be a good idea to keep this
> >> parameter there for functionality testing, further benchmarking and other
> >> ports' potential future supports.
> > 
> > Can you explain in more detail what "vector load/store with length" does?
> > Is that a "simplified" masked load/store which instead of masking 
> > arbitrary elements (and need a mask computed in the first place), masks
> > elements > N (the length operand)?  Thus assuming a lane IV decrementing
> > to zero that IV would be the natural argument for the length operand?
> > If that's correct, what data are the remaining lanes filled with?
> > 
> 
> The vector load/store have one GPR holding one length in bytes (called as
> n here) to control how many bytes we want to load/store.  If n > vector_size
> (on Power it's 16), n will be taken as 16, if n is zero, the storage access
> won't happen, the result for load is vector zero, if n > 0 but < vector_size,
> the remaining lanes will be filled with zero.  On Power, it checks 0:7 bits
> of the length GPR, so the length should be adjusted.
> 
> Your understanding is correct!  It's a "simplified" masked load/store, need
> the length in bytes computed, only support continuous access.  For the lane
> IV, the one should multiply with the lane bytes and be adjusted if needed.
> 
> > From a look at the series description below you seem to add a new way
> > of doing loads for this.  Did you review other ISAs (those I'm not
> > familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
> > they have similar support and whether your approach can be supported
> > there?  ISTR SVE must have some similar support - what's the reason
> > you do not piggy-back on that?
> 
> I may miss something, but I didn't find there is any support that meet with
> this in GCC.  Good suggestion on ISAs, I didn't review other ISAs, for the
> current implementation, I heavily referred to existing SVE fully predicated
> loop support, it's stronger than "with length", it can deal with arbitrary
> elements, only perform for the loop fully etc.
> 
> > 
> > I think a load like I described above might be represented as
> > 
> > _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3));
> > 
> > not sure if that actually works out though.  But given it seems it
> > is a contiguous load we shouldn't need an internal function here?
> > [there's a possible size mismatch in the __VIEW_CONVERT above, I guess
> > on RTL you end up with a paradoxical subreg - or an UNSPEC]
> 
> IIUC, what you suggested is to avoid use IFN here.  May I know the reason?
> is it due to the maintenance efforts on various IFNs?  I thought using
> IFN is gentle.  And agreed, I had the concern that the size mismatch
> problem here since the length can be large (extremely large probably, it
> depends on target saturated limit), the converted vector size can be small.
> Besides, the length can be a variable.
>
> > 
> > That said, I'm not very happy seeing yet another way of doing loads
> > [for fully predicated loops].  I'd rather like to see a single
> > representation on GIMPLE at least.
> 
> OK.  Does it mean IFN isn't counted into this scope?  :)

Sure going with a new IFN is a possibility.  I'd just avoid that if
possible ;)  How does SVE represent loads/stores for whilelt loops?
Would it not be possible to share that representation somehow?

Richard.

> > 
> > Will dig into the patch once the actual workings of those load/store with
> > length is confirmed.
> 
> Thanks a lot for your time!
> 
> > 
> > I don't spot tree-vect-slp.c being changed - maybe that's not necessary
> > for SLP operation, but please do not introduce new vectorizer features
> > without supporting SLP operation at this point.
> 
> Got it.  SLP is also one part to be supported, I forgot to state in the
> original email.  I'll let it be for now.  :)
> 
> BR,
> Kewen
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/7] Support vector load/store with length
  2020-05-26  9:44     ` Richard Biener
@ 2020-05-26 10:10       ` Kewen.Lin
  2020-05-26 12:29         ` Richard Sandiford
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-05-26 10:10 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Bill Schmidt, Segher Boessenkool, Richard Sandiford,
	dje.gcc

on 2020/5/26 下午5:44, Richard Biener wrote:
> On Tue, 26 May 2020, Kewen.Lin wrote:
> 
>> Hi Richi,
>>
>> on 2020/5/26 下午3:12, Richard Biener wrote:
>>> On Tue, 26 May 2020, Kewen.Lin wrote:
>>>
>>>> Hi all,
>>>>
>>>> This patch set adds support for vector load/store with length, Power 
>>>> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
>>>> length, it's good to be exploited for those cases we don't have enough
>>>> stuffs to fill in the whole vector like epilogues.
>>>>
>>>> This support mainly refers to the handlings for fully-predicated loop
>>>> but it also covers the epilogue usage.  Now it supports two modes
>>>> controlled by parameter vect-with-length-scope, it can support any
>>>> loops fully with length or just for those cases with small iteration
>>>> counts less than VF like epilogue, for now I don't have ready env to
>>>> benchmark it, but based on the current inefficient length generation,
>>>> I don't think it's a good idea to adopt vector with length for any loops.
>>>> For the main loop which used to be vectorized, it increases register
>>>> pressure and introduces extra computation for length, the pro for icache
>>>> seems not comparable.  But I think it might be a good idea to keep this
>>>> parameter there for functionality testing, further benchmarking and other
>>>> ports' potential future supports.
>>>
>>> Can you explain in more detail what "vector load/store with length" does?
>>> Is that a "simplified" masked load/store which instead of masking 
>>> arbitrary elements (and need a mask computed in the first place), masks
>>> elements > N (the length operand)?  Thus assuming a lane IV decrementing
>>> to zero that IV would be the natural argument for the length operand?
>>> If that's correct, what data are the remaining lanes filled with?
>>>
>>
>> The vector load/store have one GPR holding one length in bytes (called as
>> n here) to control how many bytes we want to load/store.  If n > vector_size
>> (on Power it's 16), n will be taken as 16, if n is zero, the storage access
>> won't happen, the result for load is vector zero, if n > 0 but < vector_size,
>> the remaining lanes will be filled with zero.  On Power, it checks 0:7 bits
>> of the length GPR, so the length should be adjusted.
>>
>> Your understanding is correct!  It's a "simplified" masked load/store, need
>> the length in bytes computed, only support continuous access.  For the lane
>> IV, the one should multiply with the lane bytes and be adjusted if needed.
>>
>>> From a look at the series description below you seem to add a new way
>>> of doing loads for this.  Did you review other ISAs (those I'm not
>>> familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
>>> they have similar support and whether your approach can be supported
>>> there?  ISTR SVE must have some similar support - what's the reason
>>> you do not piggy-back on that?
>>
>> I may miss something, but I didn't find there is any support that meet with
>> this in GCC.  Good suggestion on ISAs, I didn't review other ISAs, for the
>> current implementation, I heavily referred to existing SVE fully predicated
>> loop support, it's stronger than "with length", it can deal with arbitrary
>> elements, only perform for the loop fully etc.
>>
>>>
>>> I think a load like I described above might be represented as
>>>
>>> _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3));
>>>
>>> not sure if that actually works out though.  But given it seems it
>>> is a contiguous load we shouldn't need an internal function here?
>>> [there's a possible size mismatch in the __VIEW_CONVERT above, I guess
>>> on RTL you end up with a paradoxical subreg - or an UNSPEC]
>>
>> IIUC, what you suggested is to avoid use IFN here.  May I know the reason?
>> is it due to the maintenance efforts on various IFNs?  I thought using
>> IFN is gentle.  And agreed, I had the concern that the size mismatch
>> problem here since the length can be large (extremely large probably, it
>> depends on target saturated limit), the converted vector size can be small.
>> Besides, the length can be a variable.
>>
>>>
>>> That said, I'm not very happy seeing yet another way of doing loads
>>> [for fully predicated loops].  I'd rather like to see a single
>>> representation on GIMPLE at least.
>>
>> OK.  Does it mean IFN isn't counted into this scope?  :)
> 
> Sure going with a new IFN is a possibility.  I'd just avoid that if
> possible ;)  How does SVE represent loads/stores for whilelt loops?
> Would it not be possible to share that representation somehow?
> 
> Richard.
> 

Got it.  :) Currently SVE uses IFNs .MASK_LOAD and .MASK_STORE for 
whileult loops:

  vect__1.5_14 = .MASK_LOAD (_11, 4B, loop_mask_9);
  ...
  .MASK_STORE (_1, 4B, loop_mask_9, vect__3.9_18);
  ...
  next_mask_26 = .WHILE_ULT (_2, 127, { 0, ... });
  if (next_mask_26 != { 0, ... })

I thought to share it once, but didn't feel it's a good idea since it's
like we take something as masks that isn't actually masks, easily confuse
people.  OTOH, Power or other potential targets probably supports these
kinds of masking instructions, easily to confuse people further.

But I definitely agree that we can refactor some codes to share with
each other when possible.

Btw, here the code with proposed IFNs looks like:

  vect__1.5_4 = .LEN_LOAD (_15, 4B, loop_len_5);
  ...
  .LEN_STORE (_1, 4B, loop_len_5, vect__3.9_18);
  ivtmp_23 = ivtmp_22 + 16;
  _25 = MIN_EXPR <ivtmp_23, 508>;
  _26 = 508 - _25;
  _27 = MIN_EXPR <_26, 16>;
  if (ivtmp_23 <= 507)


BR,
Kewen

>>>
>>> Will dig into the patch once the actual workings of those load/store with
>>> length is confirmed.
>>
>> Thanks a lot for your time!
>>
>>>
>>> I don't spot tree-vect-slp.c being changed - maybe that's not necessary
>>> for SLP operation, but please do not introduce new vectorizer features
>>> without supporting SLP operation at this point.
>>
>> Got it.  SLP is also one part to be supported, I forgot to state in the
>> original email.  I'll let it be for now.  :)
>>
>> BR,
>> Kewen
>>
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/7] Support vector load/store with length
  2020-05-26 10:10       ` Kewen.Lin
@ 2020-05-26 12:29         ` Richard Sandiford
  2020-05-27  0:09           ` Segher Boessenkool
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Sandiford @ 2020-05-26 12:29 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Richard Biener, GCC Patches, Bill Schmidt, Segher Boessenkool, dje.gcc

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> on 2020/5/26 下午5:44, Richard Biener wrote:
>> On Tue, 26 May 2020, Kewen.Lin wrote:
>> 
>>> Hi Richi,
>>>
>>> on 2020/5/26 下午3:12, Richard Biener wrote:
>>>> On Tue, 26 May 2020, Kewen.Lin wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> This patch set adds support for vector load/store with length, Power 
>>>>> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
>>>>> length, it's good to be exploited for those cases we don't have enough
>>>>> stuffs to fill in the whole vector like epilogues.
>>>>>
>>>>> This support mainly refers to the handlings for fully-predicated loop
>>>>> but it also covers the epilogue usage.  Now it supports two modes
>>>>> controlled by parameter vect-with-length-scope, it can support any
>>>>> loops fully with length or just for those cases with small iteration
>>>>> counts less than VF like epilogue, for now I don't have ready env to
>>>>> benchmark it, but based on the current inefficient length generation,
>>>>> I don't think it's a good idea to adopt vector with length for any loops.
>>>>> For the main loop which used to be vectorized, it increases register
>>>>> pressure and introduces extra computation for length, the pro for icache
>>>>> seems not comparable.  But I think it might be a good idea to keep this
>>>>> parameter there for functionality testing, further benchmarking and other
>>>>> ports' potential future supports.
>>>>
>>>> Can you explain in more detail what "vector load/store with length" does?
>>>> Is that a "simplified" masked load/store which instead of masking 
>>>> arbitrary elements (and need a mask computed in the first place), masks
>>>> elements > N (the length operand)?  Thus assuming a lane IV decrementing
>>>> to zero that IV would be the natural argument for the length operand?
>>>> If that's correct, what data are the remaining lanes filled with?
>>>>
>>>
>>> The vector load/store have one GPR holding one length in bytes (called as
>>> n here) to control how many bytes we want to load/store.  If n > vector_size
>>> (on Power it's 16), n will be taken as 16, if n is zero, the storage access
>>> won't happen, the result for load is vector zero, if n > 0 but < vector_size,
>>> the remaining lanes will be filled with zero.  On Power, it checks 0:7 bits
>>> of the length GPR, so the length should be adjusted.
>>>
>>> Your understanding is correct!  It's a "simplified" masked load/store, need
>>> the length in bytes computed, only support continuous access.  For the lane
>>> IV, the one should multiply with the lane bytes and be adjusted if needed.
>>>
>>>> From a look at the series description below you seem to add a new way
>>>> of doing loads for this.  Did you review other ISAs (those I'm not
>>>> familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
>>>> they have similar support and whether your approach can be supported
>>>> there?  ISTR SVE must have some similar support - what's the reason
>>>> you do not piggy-back on that?
>>>
>>> I may miss something, but I didn't find there is any support that meet with
>>> this in GCC.  Good suggestion on ISAs, I didn't review other ISAs, for the
>>> current implementation, I heavily referred to existing SVE fully predicated
>>> loop support, it's stronger than "with length", it can deal with arbitrary
>>> elements, only perform for the loop fully etc.
>>>
>>>>
>>>> I think a load like I described above might be represented as
>>>>
>>>> _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3));
>>>>
>>>> not sure if that actually works out though.  But given it seems it
>>>> is a contiguous load we shouldn't need an internal function here?
>>>> [there's a possible size mismatch in the __VIEW_CONVERT above, I guess
>>>> on RTL you end up with a paradoxical subreg - or an UNSPEC]
>>>
>>> IIUC, what you suggested is to avoid use IFN here.  May I know the reason?
>>> is it due to the maintenance efforts on various IFNs?  I thought using
>>> IFN is gentle.  And agreed, I had the concern that the size mismatch
>>> problem here since the length can be large (extremely large probably, it
>>> depends on target saturated limit), the converted vector size can be small.
>>> Besides, the length can be a variable.
>>>
>>>>
>>>> That said, I'm not very happy seeing yet another way of doing loads
>>>> [for fully predicated loops].  I'd rather like to see a single
>>>> representation on GIMPLE at least.
>>>
>>> OK.  Does it mean IFN isn't counted into this scope?  :)
>> 
>> Sure going with a new IFN is a possibility.  I'd just avoid that if
>> possible ;)  How does SVE represent loads/stores for whilelt loops?
>> Would it not be possible to share that representation somehow?
>> 
>> Richard.
>> 
>
> Got it.  :) Currently SVE uses IFNs .MASK_LOAD and .MASK_STORE for 
> whileult loops:
>
>   vect__1.5_14 = .MASK_LOAD (_11, 4B, loop_mask_9);
>   ...
>   .MASK_STORE (_1, 4B, loop_mask_9, vect__3.9_18);
>   ...
>   next_mask_26 = .WHILE_ULT (_2, 127, { 0, ... });
>   if (next_mask_26 != { 0, ... })
>
> I thought to share it once, but didn't feel it's a good idea since it's
> like we take something as masks that isn't actually masks, easily confuse
> people.  OTOH, Power or other potential targets probably supports these
> kinds of masking instructions, easily to confuse people further.
>
> But I definitely agree that we can refactor some codes to share with
> each other when possible.
>
> Btw, here the code with proposed IFNs looks like:
>
>   vect__1.5_4 = .LEN_LOAD (_15, 4B, loop_len_5);
>   ...
>   .LEN_STORE (_1, 4B, loop_len_5, vect__3.9_18);
>   ivtmp_23 = ivtmp_22 + 16;
>   _25 = MIN_EXPR <ivtmp_23, 508>;
>   _26 = 508 - _25;
>   _27 = MIN_EXPR <_26, 16>;
>   if (ivtmp_23 <= 507)

FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good
approach.  I think it'll be more maintainable in the long run than
trying to have .MASK_LOADs and .MASK_STOREs that need a special mask
operand.  (That would be too similar to VEC_COND_EXPR :-))

Not sure yet what the exact semantics wrt out-of-range values for
the IFN/optab though.  Maybe we should instead have some kind of
abstract, target-specific cookie created by a separate intrinsic.
Haven't thought much about it yet...

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7] vect: Support vector load/store with length in vectorizer
  2020-05-26  5:57 ` [PATCH 5/7] vect: Support vector load/store with length in vectorizer Kewen.Lin
@ 2020-05-26 12:49   ` Richard Sandiford
  2020-05-26 12:52     ` Richard Sandiford
  2020-05-27  8:25     ` Kewen.Lin
  0 siblings, 2 replies; 80+ messages in thread
From: Richard Sandiford @ 2020-05-26 12:49 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> @@ -626,6 +645,12 @@ public:
>    /* True if have decided to use a fully-masked loop.  */
>    bool fully_masked_p;
>  
> +  /* Records whether we still have the option of using a length access loop.  */
> +  bool can_with_length_p;
> +
> +  /* True if have decided to use length access for the loop fully.  */
> +  bool fully_with_length_p;

Rather than duplicate the flags like this, I think we should have
three bits of information:

(1) Can the loop operate on partial vectors?  Starts off optimistically
    assuming "yes", gets set to "no" when we find a counter-example.

(2) If we do decide to use partial vectors, will we need loop masks?

(3) If we do decide to use partial vectors, will we need lengths?

Vectorisation using partial vectors succeeds if (1) && ((2) != (3))

LOOP_VINFO_CAN_FULLY_MASK_P currently tracks (1) and
LOOP_VINFO_MASKS currently tracks (2).  In pathological cases it's
already possible to have (1) && !(2), see r9-6240 for an example.

With the new support, LOOP_VINFO_LENS tracks (3).

So I don't think we need the can_with_length_p.  What is now
LOOP_VINFO_CAN_FULLY_MASK_P can continue to track (1) for both
approaches, with the final choice of approach only being made
at the end.  Maybe it would be worth renaming it to something
more generic though, now that we have two approaches to partial
vectorisation.

I think we can assume for now that no arch will be asymmetrical,
and require (say) loop masks for loads and lengths for stores.
So if that does happen (i.e. if (2) && (3) ends up being true)
we should just be able to punt on partial vectorisation.

Some of the new length code looks like it's copied and adjusted from the
corresponding mask code.  It would be good to share the code instead
where possible, e.g. when deciding whether an IV can overflow.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7] vect: Support vector load/store with length in vectorizer
  2020-05-26 12:49   ` Richard Sandiford
@ 2020-05-26 12:52     ` Richard Sandiford
  2020-05-27  8:25     ` Kewen.Lin
  1 sibling, 0 replies; 80+ messages in thread
From: Richard Sandiford @ 2020-05-26 12:52 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool

Richard Sandiford <richard.sandiford@arm.com> writes:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> @@ -626,6 +645,12 @@ public:
>>    /* True if have decided to use a fully-masked loop.  */
>>    bool fully_masked_p;
>>  
>> +  /* Records whether we still have the option of using a length access loop.  */
>> +  bool can_with_length_p;
>> +
>> +  /* True if have decided to use length access for the loop fully.  */
>> +  bool fully_with_length_p;
>
> Rather than duplicate the flags like this, I think we should have
> three bits of information:
>
> (1) Can the loop operate on partial vectors?  Starts off optimistically
>     assuming "yes", gets set to "no" when we find a counter-example.
>
> (2) If we do decide to use partial vectors, will we need loop masks?
>
> (3) If we do decide to use partial vectors, will we need lengths?
>
> Vectorisation using partial vectors succeeds if (1) && ((2) != (3))
>
> LOOP_VINFO_CAN_FULLY_MASK_P currently tracks (1) and
> LOOP_VINFO_MASKS currently tracks (2).  In pathological cases it's
> already possible to have (1) && !(2), see r9-6240 for an example.

Oops, I meant r8-6240.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/7] Support vector load/store with length
  2020-05-26  7:12 ` [PATCH 0/7] Support vector load/store with length Richard Biener
  2020-05-26  8:51   ` Kewen.Lin
@ 2020-05-26 22:34   ` Jim Wilson
  2020-05-27  7:21     ` Richard Biener
  1 sibling, 1 reply; 80+ messages in thread
From: Jim Wilson @ 2020-05-26 22:34 UTC (permalink / raw)
  To: Richard Biener
  Cc: Kewen.Lin, Bill Schmidt, GCC Patches, David Edelsohn, Segher Boessenkool

On Tue, May 26, 2020 at 12:12 AM Richard Biener <rguenther@suse.de> wrote:
> From a look at the series description below you seem to add a new way
> of doing loads for this.  Did you review other ISAs (those I'm not
> familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
> they have similar support and whether your approach can be supported
> there?  ISTR SVE must have some similar support - what's the reason
> you do not piggy-back on that?

There isn't any RISC-V Vector support in GCC yet.  The RVV spec is
still in draft and still occasionally changing in incompatible ways.
We've done some experimenting with gcc patches, but all we have are
intrinsics.  We haven't implemented any auto vectorization support, so
we haven't defined tree representations for anything yet, other than
the types we need for intrinsics support.  But if it looks OK for SVE
then it probably will be OK for RVV.

Jim

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/7] Support vector load/store with length
  2020-05-26 12:29         ` Richard Sandiford
@ 2020-05-27  0:09           ` Segher Boessenkool
  2020-05-27  7:25             ` Richard Biener
  0 siblings, 1 reply; 80+ messages in thread
From: Segher Boessenkool @ 2020-05-27  0:09 UTC (permalink / raw)
  To: Kewen.Lin, Richard Biener, GCC Patches, Bill Schmidt, dje.gcc,
	richard.sandiford

Hi!

On Tue, May 26, 2020 at 01:29:30PM +0100, Richard Sandiford wrote:
> FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good
> approach.  I think it'll be more maintainable in the long run than
> trying to have .MASK_LOADs and .MASK_STOREs that need a special mask
> operand.  (That would be too similar to VEC_COND_EXPR :-))
> 
> Not sure yet what the exact semantics wrt out-of-range values for
> the IFN/optab though.  Maybe we should instead have some kind of
> abstract, target-specific cookie created by a separate intrinsic.
> Haven't thought much about it yet...

Or maybe only support 0..N with N the length of the vector?  It is
pretty important to support 0 and N, but greater than N isn't as
important (it is useful for tricky hand-written code, but not as much
for compiler-generate code -- we only support an 8-bit number here on
Power, maybe that is why ;-) )


Segher

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/7] Support vector load/store with length
  2020-05-26 22:34   ` Jim Wilson
@ 2020-05-27  7:21     ` Richard Biener
  2020-05-27  7:46       ` Richard Sandiford
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Biener @ 2020-05-27  7:21 UTC (permalink / raw)
  To: Jim Wilson
  Cc: Kewen.Lin, Bill Schmidt, GCC Patches, David Edelsohn, Segher Boessenkool

On Tue, 26 May 2020, Jim Wilson wrote:

> On Tue, May 26, 2020 at 12:12 AM Richard Biener <rguenther@suse.de> wrote:
> > From a look at the series description below you seem to add a new way
> > of doing loads for this.  Did you review other ISAs (those I'm not
> > familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
> > they have similar support and whether your approach can be supported
> > there?  ISTR SVE must have some similar support - what's the reason
> > you do not piggy-back on that?
> 
> There isn't any RISC-V Vector support in GCC yet.  The RVV spec is
> still in draft and still occasionally changing in incompatible ways.
> We've done some experimenting with gcc patches, but all we have are
> intrinsics.  We haven't implemented any auto vectorization support, so
> we haven't defined tree representations for anything yet, other than
> the types we need for intrinsics support.  But if it looks OK for SVE
> then it probably will be OK for RVV.

Btw, I'm specifically looking for other load/store with length
implementations and as to whether they agree on taking bytes for
the length rather than, for example the number of lanes.  I guess
exposing this detail on GIMPLE can help IV selection but if we'd
ever get a differing semantics ISA we'd have to add another set
of IFNs, so maybe the PPC ones should be named in a more specific
way like _WITH_BYTES or _BYTES or _WITH_BYTE_LENGTH or so to
allow _WITH_LANES?

Richard.

> Jim
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/7] Support vector load/store with length
  2020-05-27  0:09           ` Segher Boessenkool
@ 2020-05-27  7:25             ` Richard Biener
  2020-05-27  8:50               ` Kewen.Lin
  2020-05-27 14:08               ` Segher Boessenkool
  0 siblings, 2 replies; 80+ messages in thread
From: Richard Biener @ 2020-05-27  7:25 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Kewen.Lin, GCC Patches, Bill Schmidt, dje.gcc, richard.sandiford

On Tue, 26 May 2020, Segher Boessenkool wrote:

> Hi!
> 
> On Tue, May 26, 2020 at 01:29:30PM +0100, Richard Sandiford wrote:
> > FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good
> > approach.  I think it'll be more maintainable in the long run than
> > trying to have .MASK_LOADs and .MASK_STOREs that need a special mask
> > operand.  (That would be too similar to VEC_COND_EXPR :-))
> > 
> > Not sure yet what the exact semantics wrt out-of-range values for
> > the IFN/optab though.  Maybe we should instead have some kind of
> > abstract, target-specific cookie created by a separate intrinsic.
> > Haven't thought much about it yet...
> 
> Or maybe only support 0..N with N the length of the vector?  It is
> pretty important to support 0 and N, but greater than N isn't as
> important (it is useful for tricky hand-written code, but not as much
> for compiler-generate code -- we only support an 8-bit number here on
> Power, maybe that is why ;-) )

The question is one of semantics - if power masks the length to an
8 bit number it's important to preprocess the IV.  As with my
other suggestion the question is what to expose to the IL (to GIMPLE)
here.  Exposing as much as possible will help IV selection but
will eventually require IFN variations for different semantics.

So yes, 0..N sounds about right here and we'll require a MIN ()
operation and likely need to teach IV selection about this to at least
possibly get an IV with the byte size multiplication factored.

Richard.

> 
> Segher
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/7] Support vector load/store with length
  2020-05-27  7:21     ` Richard Biener
@ 2020-05-27  7:46       ` Richard Sandiford
  0 siblings, 0 replies; 80+ messages in thread
From: Richard Sandiford @ 2020-05-27  7:46 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jim Wilson, GCC Patches, Bill Schmidt, Segher Boessenkool,
	David Edelsohn

Richard Biener <rguenther@suse.de> writes:
> On Tue, 26 May 2020, Jim Wilson wrote:
>
>> On Tue, May 26, 2020 at 12:12 AM Richard Biener <rguenther@suse.de> wrote:
>> > From a look at the series description below you seem to add a new way
>> > of doing loads for this.  Did you review other ISAs (those I'm not
>> > familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
>> > they have similar support and whether your approach can be supported
>> > there?  ISTR SVE must have some similar support - what's the reason
>> > you do not piggy-back on that?
>> 
>> There isn't any RISC-V Vector support in GCC yet.  The RVV spec is
>> still in draft and still occasionally changing in incompatible ways.
>> We've done some experimenting with gcc patches, but all we have are
>> intrinsics.  We haven't implemented any auto vectorization support, so
>> we haven't defined tree representations for anything yet, other than
>> the types we need for intrinsics support.  But if it looks OK for SVE
>> then it probably will be OK for RVV.
>
> Btw, I'm specifically looking for other load/store with length
> implementations and as to whether they agree on taking bytes for
> the length rather than, for example the number of lanes.  I guess
> exposing this detail on GIMPLE can help IV selection but if we'd
> ever get a differing semantics ISA we'd have to add another set
> of IFNs, so maybe the PPC ones should be named in a more specific
> way like _WITH_BYTES or _BYTES or _WITH_BYTE_LENGTH or so to
> allow _WITH_LANES?

Maybe that detail is another thing that a cookie could hide.  We'd then
potentially need one IFN per approach to calculating the length parameter
(bytes vs. elements, self-capping vs. explicit capping, etc.), but it would
only be one IFN per approach, rather than the combinatorial explosion
we'd get from one IFN per approach*load/store-kind.

It doesn't make much difference when we only have one LOAD and one STORE
per approach.  But I imagine this will be useful for MVE, and there we'll
want extending loads, truncating stores, gathers and scatters too.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7] vect: Support vector load/store with length in vectorizer
  2020-05-26 12:49   ` Richard Sandiford
  2020-05-26 12:52     ` Richard Sandiford
@ 2020-05-27  8:25     ` Kewen.Lin
  2020-05-27 10:02       ` Richard Sandiford
  1 sibling, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-05-27  8:25 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool

[-- Attachment #1: Type: text/plain, Size: 3820 bytes --]

Hi Richard,

Thanks for your comments!

on 2020/5/26 下午8:49, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> @@ -626,6 +645,12 @@ public:
>>    /* True if have decided to use a fully-masked loop.  */
>>    bool fully_masked_p;
>>  
>> +  /* Records whether we still have the option of using a length access loop.  */
>> +  bool can_with_length_p;
>> +
>> +  /* True if have decided to use length access for the loop fully.  */
>> +  bool fully_with_length_p;
> 
> Rather than duplicate the flags like this, I think we should have
> three bits of information:
> 
> (1) Can the loop operate on partial vectors?  Starts off optimistically
>     assuming "yes", gets set to "no" when we find a counter-example.
> 
> (2) If we do decide to use partial vectors, will we need loop masks?
> 
> (3) If we do decide to use partial vectors, will we need lengths?
> 
> Vectorisation using partial vectors succeeds if (1) && ((2) != (3))
> 
> LOOP_VINFO_CAN_FULLY_MASK_P currently tracks (1) and
> LOOP_VINFO_MASKS currently tracks (2).  In pathological cases it's
> already possible to have (1) && !(2), see r9-6240 for an example.
> 
> With the new support, LOOP_VINFO_LENS tracks (3).
> 
> So I don't think we need the can_with_length_p.  What is now
> LOOP_VINFO_CAN_FULLY_MASK_P can continue to track (1) for both
> approaches, with the final choice of approach only being made
> at the end.  Maybe it would be worth renaming it to something
> more generic though, now that we have two approaches to partial
> vectorisation.

I like this idea!  I could be wrong, but I'm afraid that we
can not have one common flag to be shared for both approaches,
the check criterias could be different for both approaches, one
counter example for length could be acceptable for masking, such
as length can only allow CONTIGUOUS related modes, but masking
can support more.  When we see acceptable VMAT_LOAD_STORE_LANES,
we leave LOOP_VINFO_CAN_FULLY_MASK_P true, later should length
checking turn it to false?  I guess no, assuming still true, then 
LOOP_VINFO_CAN_FULLY_MASK_P will mean only partial vectorization
for masking, not for both.  We can probably clean LOOP_VINFO_LENS
when the length checking is false, but we just know the vec is empty,
not sure we are unable to do partial vectorization with length,
when we see LOOP_VINFO_CAN_FULLY_MASK_P true, we could still
record length into it if possible.

> 
> I think we can assume for now that no arch will be asymmetrical,
> and require (say) loop masks for loads and lengths for stores.
> So if that does happen (i.e. if (2) && (3) ends up being true)
> we should just be able to punt on partial vectorisation.
> 

Agreed, the current implementation takes masking as preferrence,
if it's fully_masked, we will disable vector with length.

> Some of the new length code looks like it's copied and adjusted from the
> corresponding mask code.  It would be good to share the code instead
> where possible, e.g. when deciding whether an IV can overflow.
> 

Yes, some refactoring can be done, it's on my to-do list, give it
priority as your comments.  

V2 attached with some changes against V1:
  1) use rgroup_objs for both mask and length
  2) merge both mask and length handlings into 
     vect_set_loop_condition_partial which is renamed and extended
     from vect_set_loop_condition_masked.
  3) renamed and updated vect_set_loop_masks_directly to 
     vect_set_loop_objs_directly.
  4) renamed vect_set_loop_condition_unmasked to 
     vect_set_loop_condition_normal
  5) factored out min_prec_for_max_niters.
  6) added macro LOOP_VINFO_PARTIAL_VECT_P since a few places need
     to check (LOOP_VINFO_FULLY_MASKED_P || LOOP_VINFO_FULLY_WITH_LENGTH_P) 

Tested with ppc64le test cases, will update with changelog if everything
goes well.

BR,
Kewen

[-- Attachment #2: 0005-vector-with-length-v2.patch --]
[-- Type: text/plain, Size: 56551 bytes --]

---
 gcc/doc/invoke.texi        |   7 +
 gcc/params.opt             |   4 +
 gcc/tree-vect-loop-manip.c | 266 ++++++++++++++++++-------------
 gcc/tree-vect-loop.c       | 311 ++++++++++++++++++++++++++++++++-----
 gcc/tree-vect-stmts.c      | 152 ++++++++++++++++++
 gcc/tree-vectorizer.h      |  57 +++++--
 6 files changed, 639 insertions(+), 158 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 8b9935dfe65..ac765feab13 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13079,6 +13079,13 @@ by the copy loop headers pass.
 @item vect-epilogues-nomask
 Enable loop epilogue vectorization using smaller vector size.
 
+@item vect-with-length-scope
+Control the scope of vector memory access with length exploitation.  0 means we
+don't expliot any vector memory access with length, 1 means we only exploit
+vector memory access with length for those loops whose iteration number are
+less than VF, such as very small loop or epilogue, 2 means we want to exploit
+vector memory access with length for any loops if possible.
+
 @item slp-max-insns-in-bb
 Maximum number of instructions in basic block to be
 considered for SLP vectorization.
diff --git a/gcc/params.opt b/gcc/params.opt
index 4aec480798b..d4309101067 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -964,4 +964,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f
 Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
 Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
 
+-param=vect-with-length-scope=
+Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization
+Control the vector with length exploitation scope.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 8c5e696b995..0a5770c7d28 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -256,17 +256,17 @@ adjust_phi_and_debug_stmts (gimple *update_phi, edge e, tree new_def)
 			gimple_bb (update_phi));
 }
 
-/* Define one loop mask MASK from loop LOOP.  INIT_MASK is the value that
-   the mask should have during the first iteration and NEXT_MASK is the
+/* Define one loop mask/length OBJ from loop LOOP.  INIT_OBJ is the value that
+   the mask/length should have during the first iteration and NEXT_OBJ is the
    value that it should have on subsequent iterations.  */
 
 static void
-vect_set_loop_mask (class loop *loop, tree mask, tree init_mask,
-		    tree next_mask)
+vect_set_loop_mask_or_len (class loop *loop, tree obj, tree init_obj,
+			   tree next_obj)
 {
-  gphi *phi = create_phi_node (mask, loop->header);
-  add_phi_arg (phi, init_mask, loop_preheader_edge (loop), UNKNOWN_LOCATION);
-  add_phi_arg (phi, next_mask, loop_latch_edge (loop), UNKNOWN_LOCATION);
+  gphi *phi = create_phi_node (obj, loop->header);
+  add_phi_arg (phi, init_obj, loop_preheader_edge (loop), UNKNOWN_LOCATION);
+  add_phi_arg (phi, next_obj, loop_latch_edge (loop), UNKNOWN_LOCATION);
 }
 
 /* Add SEQ to the end of LOOP's preheader block.  */
@@ -320,8 +320,8 @@ interleave_supported_p (vec_perm_indices *indices, tree vectype,
    latter.  Return true on success, adding any new statements to SEQ.  */
 
 static bool
-vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
-			       rgroup_masks *src_rgm)
+vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_objs *dest_rgm,
+			       rgroup_objs *src_rgm)
 {
   tree src_masktype = src_rgm->mask_type;
   tree dest_masktype = dest_rgm->mask_type;
@@ -338,10 +338,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
       machine_mode dest_mode = insn_data[icode1].operand[0].mode;
       gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
       tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
-      for (unsigned int i = 0; i < dest_rgm->masks.length (); ++i)
+      for (unsigned int i = 0; i < dest_rgm->objs.length (); ++i)
 	{
-	  tree src = src_rgm->masks[i / 2];
-	  tree dest = dest_rgm->masks[i];
+	  tree src = src_rgm->objs[i / 2];
+	  tree dest = dest_rgm->objs[i];
 	  tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
 			    ? VEC_UNPACK_HI_EXPR
 			    : VEC_UNPACK_LO_EXPR);
@@ -371,10 +371,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
       tree masks[2];
       for (unsigned int i = 0; i < 2; ++i)
 	masks[i] = vect_gen_perm_mask_checked (src_masktype, indices[i]);
-      for (unsigned int i = 0; i < dest_rgm->masks.length (); ++i)
+      for (unsigned int i = 0; i < dest_rgm->objs.length (); ++i)
 	{
-	  tree src = src_rgm->masks[i / 2];
-	  tree dest = dest_rgm->masks[i];
+	  tree src = src_rgm->objs[i / 2];
+	  tree dest = dest_rgm->objs[i];
 	  gimple *stmt = gimple_build_assign (dest, VEC_PERM_EXPR,
 					      src, src, masks[i & 1]);
 	  gimple_seq_add_stmt (seq, stmt);
@@ -384,60 +384,80 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
   return false;
 }
 
-/* Helper for vect_set_loop_condition_masked.  Generate definitions for
-   all the masks in RGM and return a mask that is nonzero when the loop
+/* Helper for vect_set_loop_condition_partial.  Generate definitions for
+   all the objs in RGO and return a obj that is nonzero when the loop
    needs to iterate.  Add any new preheader statements to PREHEADER_SEQ.
    Use LOOP_COND_GSI to insert code before the exit gcond.
 
-   RGM belongs to loop LOOP.  The loop originally iterated NITERS
+   RGO belongs to loop LOOP.  The loop originally iterated NITERS
    times and has been vectorized according to LOOP_VINFO.
 
    If NITERS_SKIP is nonnull, the first iteration of the vectorized loop
    starts with NITERS_SKIP dummy iterations of the scalar loop before
-   the real work starts.  The mask elements for these dummy iterations
+   the real work starts.  The obj elements for these dummy iterations
    must be 0, to ensure that the extra iterations do not have an effect.
 
    It is known that:
 
-     NITERS * RGM->max_nscalars_per_iter
+     NITERS * RGO->max_nscalars_per_iter
 
    does not overflow.  However, MIGHT_WRAP_P says whether an induction
    variable that starts at 0 and has step:
 
-     VF * RGM->max_nscalars_per_iter
+     VF * RGO->max_nscalars_per_iter
 
    might overflow before hitting a value above:
 
-     (NITERS + NITERS_SKIP) * RGM->max_nscalars_per_iter
+     (NITERS + NITERS_SKIP) * RGO->max_nscalars_per_iter
 
    This means that we cannot guarantee that such an induction variable
-   would ever hit a value that produces a set of all-false masks for RGM.  */
+   would ever hit a value that produces a set of all-false masks or
+   zero byte length for RGO.  */
 
 static tree
-vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
+vect_set_loop_objs_directly (class loop *loop, loop_vec_info loop_vinfo,
 			      gimple_seq *preheader_seq,
 			      gimple_stmt_iterator loop_cond_gsi,
-			      rgroup_masks *rgm, tree niters, tree niters_skip,
+			      rgroup_objs *rgo, tree niters, tree niters_skip,
 			      bool might_wrap_p)
 {
   tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
   tree iv_type = LOOP_VINFO_MASK_IV_TYPE (loop_vinfo);
-  tree mask_type = rgm->mask_type;
-  unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
-  poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
+
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+  if (!vect_for_masking)
+    {
+      /* Obtain target supported length type.  */
+      scalar_int_mode len_mode = targetm.vectorize.length_mode;
+      unsigned int len_prec = GET_MODE_PRECISION (len_mode);
+      compare_type = build_nonstandard_integer_type (len_prec, true);
+      /* Simply set iv_type as same as compare_type.  */
+      iv_type = compare_type;
+    }
+
+  tree obj_type = rgo->mask_type;
+  /* Here, take nscalars_per_iter as nbytes_per_iter for length.  */
+  unsigned int nscalars_per_iter = rgo->max_nscalars_per_iter;
+  poly_uint64 nscalars_per_obj = TYPE_VECTOR_SUBPARTS (obj_type);
+  poly_uint64 vector_size = GET_MODE_SIZE (TYPE_MODE (obj_type));
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  tree vec_size = NULL_TREE;
+  /* For length, we probably need vec_size to check length in range.  */
+  if (!vect_for_masking)
+    vec_size = build_int_cst (compare_type, vector_size);
 
   /* Calculate the maximum number of scalar values that the rgroup
      handles in total, the number that it handles for each iteration
      of the vector loop, and the number that it should skip during the
-     first iteration of the vector loop.  */
+     first iteration of the vector loop.  For vector with length, take
+     scalar values as bytes.  */
   tree nscalars_total = niters;
   tree nscalars_step = build_int_cst (iv_type, vf);
   tree nscalars_skip = niters_skip;
   if (nscalars_per_iter != 1)
     {
-      /* We checked before choosing to use a fully-masked loop that these
-	 multiplications don't overflow.  */
+      /* We checked before choosing to use a fully-masked or fully with length
+	 loop that these multiplications don't overflow.  */
       tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
       tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
       nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
@@ -541,28 +561,28 @@ vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
   test_index = gimple_convert (&test_seq, compare_type, test_index);
   gsi_insert_seq_before (test_gsi, test_seq, GSI_SAME_STMT);
 
-  /* Provide a definition of each mask in the group.  */
-  tree next_mask = NULL_TREE;
-  tree mask;
+  /* Provide a definition of each obj in the group.  */
+  tree next_obj = NULL_TREE;
+  tree obj;
   unsigned int i;
-  FOR_EACH_VEC_ELT_REVERSE (rgm->masks, i, mask)
+  poly_uint64 batch_cnt = vect_for_masking ? nscalars_per_obj : vector_size;
+  FOR_EACH_VEC_ELT_REVERSE (rgo->objs, i, obj)
     {
-      /* Previous masks will cover BIAS scalars.  This mask covers the
+      /* Previous objs will cover BIAS scalars.  This obj covers the
 	 next batch.  */
-      poly_uint64 bias = nscalars_per_mask * i;
+      poly_uint64 bias = batch_cnt * i;
       tree bias_tree = build_int_cst (compare_type, bias);
-      gimple *tmp_stmt;
 
       /* See whether the first iteration of the vector loop is known
-	 to have a full mask.  */
+	 to have a full mask or length.  */
       poly_uint64 const_limit;
       bool first_iteration_full
 	= (poly_int_tree_p (first_limit, &const_limit)
-	   && known_ge (const_limit, (i + 1) * nscalars_per_mask));
+	   && known_ge (const_limit, (i + 1) * batch_cnt));
 
       /* Rather than have a new IV that starts at BIAS and goes up to
 	 TEST_LIMIT, prefer to use the same 0-based IV for each mask
-	 and adjust the bound down by BIAS.  */
+	 or length and adjust the bound down by BIAS.  */
       tree this_test_limit = test_limit;
       if (i != 0)
 	{
@@ -574,9 +594,9 @@ vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
 					  bias_tree);
 	}
 
-      /* Create the initial mask.  First include all scalars that
+      /* Create the initial obj.  First include all scalars that
 	 are within the loop limit.  */
-      tree init_mask = NULL_TREE;
+      tree init_obj = NULL_TREE;
       if (!first_iteration_full)
 	{
 	  tree start, end;
@@ -598,9 +618,18 @@ vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
 	      end = first_limit;
 	    }
 
-	  init_mask = make_temp_ssa_name (mask_type, NULL, "max_mask");
-	  tmp_stmt = vect_gen_while (init_mask, start, end);
-	  gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	  if (vect_for_masking)
+	    {
+	      init_obj = make_temp_ssa_name (obj_type, NULL, "max_mask");
+	      gimple *tmp_stmt = vect_gen_while (init_obj, start, end);
+	      gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	    }
+	  else
+	    {
+	      init_obj = make_temp_ssa_name (compare_type, NULL, "max_len");
+	      gimple_seq seq = vect_gen_len (init_obj, start, end, vec_size);
+	      gimple_seq_add_seq (preheader_seq, seq);
+	    }
 	}
 
       /* Now AND out the bits that are within the number of skipped
@@ -610,51 +639,76 @@ vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
 	  && !(poly_int_tree_p (nscalars_skip, &const_skip)
 	       && known_le (const_skip, bias)))
 	{
-	  tree unskipped_mask = vect_gen_while_not (preheader_seq, mask_type,
+	  tree unskipped_mask = vect_gen_while_not (preheader_seq, obj_type,
 						    bias_tree, nscalars_skip);
-	  if (init_mask)
-	    init_mask = gimple_build (preheader_seq, BIT_AND_EXPR, mask_type,
-				      init_mask, unskipped_mask);
+	  if (init_obj)
+	    init_obj = gimple_build (preheader_seq, BIT_AND_EXPR, obj_type,
+				      init_obj, unskipped_mask);
 	  else
-	    init_mask = unskipped_mask;
+	    init_obj = unskipped_mask;
+	  gcc_assert (vect_for_masking);
 	}
 
-      if (!init_mask)
-	/* First iteration is full.  */
-	init_mask = build_minus_one_cst (mask_type);
+      /* First iteration is full.  */
+      if (!init_obj)
+	{
+	  if (vect_for_masking)
+	    init_obj = build_minus_one_cst (obj_type);
+	  else
+	    init_obj = vec_size;
+	}
 
-      /* Get the mask value for the next iteration of the loop.  */
-      next_mask = make_temp_ssa_name (mask_type, NULL, "next_mask");
-      gcall *call = vect_gen_while (next_mask, test_index, this_test_limit);
-      gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+      /* Get the obj value for the next iteration of the loop.  */
+      if (vect_for_masking)
+	{
+	  next_obj = make_temp_ssa_name (obj_type, NULL, "next_mask");
+	  gcall *call = vect_gen_while (next_obj, test_index, this_test_limit);
+	  gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+	}
+      else
+	{
+	  next_obj = make_temp_ssa_name (compare_type, NULL, "next_len");
+	  tree end = this_test_limit;
+	  gimple_seq seq = vect_gen_len (next_obj, test_index, end, vec_size);
+	  gsi_insert_seq_before (test_gsi, seq, GSI_SAME_STMT);
+	}
 
-      vect_set_loop_mask (loop, mask, init_mask, next_mask);
+      vect_set_loop_mask_or_len (loop, obj, init_obj, next_obj);
     }
-  return next_mask;
+  return next_obj;
 }
 
-/* Make LOOP iterate NITERS times using masking and WHILE_ULT calls.
-   LOOP_VINFO describes the vectorization of LOOP.  NITERS is the
-   number of iterations of the original scalar loop that should be
-   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are
-   as for vect_set_loop_condition.
+/* Make LOOP iterate NITERS times using objects like masks (and
+   WHILE_ULT calls) or lengths.  LOOP_VINFO describes the vectorization
+   of LOOP.  NITERS is the number of iterations of the original scalar
+   loop that should be handled by the vector loop.  NITERS_MAYBE_ZERO
+   and FINAL_IV are as for vect_set_loop_condition.
 
    Insert the branch-back condition before LOOP_COND_GSI and return the
    final gcond.  */
 
 static gcond *
-vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
-				tree niters, tree final_iv,
-				bool niters_maybe_zero,
-				gimple_stmt_iterator loop_cond_gsi)
+vect_set_loop_condition_partial (class loop *loop, loop_vec_info loop_vinfo,
+				 tree niters, tree final_iv,
+				 bool niters_maybe_zero,
+				 gimple_stmt_iterator loop_cond_gsi)
 {
   gimple_seq preheader_seq = NULL;
   gimple_seq header_seq = NULL;
 
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+
   tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
+  if (!vect_for_masking)
+    {
+      /* Obtain target supported length type as compare_type.  */
+      scalar_int_mode len_mode = targetm.vectorize.length_mode;
+      unsigned len_prec = GET_MODE_PRECISION (len_mode);
+      compare_type = build_nonstandard_integer_type (len_prec, true);
+    }
   unsigned int compare_precision = TYPE_PRECISION (compare_type);
-  tree orig_niters = niters;
 
+  tree orig_niters = niters;
   /* Type of the initial value of NITERS.  */
   tree ni_actual_type = TREE_TYPE (niters);
   unsigned int ni_actual_precision = TYPE_PRECISION (ni_actual_type);
@@ -677,42 +731,45 @@ vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
   else
     niters = gimple_convert (&preheader_seq, compare_type, niters);
 
-  widest_int iv_limit = vect_iv_limit_for_full_masking (loop_vinfo);
+  widest_int iv_limit = vect_iv_limit_for_partial_vect (loop_vinfo);
 
-  /* Iterate over all the rgroups and fill in their masks.  We could use
-     the first mask from any rgroup for the loop condition; here we
+  /* Iterate over all the rgroups and fill in their objs.  We could use
+     the first obj from any rgroup for the loop condition; here we
      arbitrarily pick the last.  */
-  tree test_mask = NULL_TREE;
-  rgroup_masks *rgm;
+  tree test_obj = NULL_TREE;
+  rgroup_objs *rgo;
   unsigned int i;
-  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
-  FOR_EACH_VEC_ELT (*masks, i, rgm)
-    if (!rgm->masks.is_empty ())
+  auto_vec<rgroup_objs> *objs = vect_for_masking
+				  ? &LOOP_VINFO_MASKS (loop_vinfo)
+				  : &LOOP_VINFO_LENS (loop_vinfo);
+
+  FOR_EACH_VEC_ELT (*objs, i, rgo)
+    if (!rgo->objs.is_empty ())
       {
 	/* First try using permutes.  This adds a single vector
 	   instruction to the loop for each mask, but needs no extra
 	   loop invariants or IVs.  */
 	unsigned int nmasks = i + 1;
-	if ((nmasks & 1) == 0)
+	if (vect_for_masking && (nmasks & 1) == 0)
 	  {
-	    rgroup_masks *half_rgm = &(*masks)[nmasks / 2 - 1];
-	    if (!half_rgm->masks.is_empty ()
-		&& vect_maybe_permute_loop_masks (&header_seq, rgm, half_rgm))
+	    rgroup_objs *half_rgo = &(*objs)[nmasks / 2 - 1];
+	    if (!half_rgo->objs.is_empty ()
+		&& vect_maybe_permute_loop_masks (&header_seq, rgo, half_rgo))
 	      continue;
 	  }
 
 	/* See whether zero-based IV would ever generate all-false masks
-	   before wrapping around.  */
+	   or zero byte length before wrapping around.  */
 	bool might_wrap_p
 	  = (iv_limit == -1
-	     || (wi::min_precision (iv_limit * rgm->max_nscalars_per_iter,
+	     || (wi::min_precision (iv_limit * rgo->max_nscalars_per_iter,
 				    UNSIGNED)
 		 > compare_precision));
 
-	/* Set up all masks for this group.  */
-	test_mask = vect_set_loop_masks_directly (loop, loop_vinfo,
+	/* Set up all masks/lengths for this group.  */
+	test_obj = vect_set_loop_objs_directly (loop, loop_vinfo,
 						  &preheader_seq,
-						  loop_cond_gsi, rgm,
+						  loop_cond_gsi, rgo,
 						  niters, niters_skip,
 						  might_wrap_p);
       }
@@ -724,8 +781,8 @@ vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
   /* Get a boolean result that tells us whether to iterate.  */
   edge exit_edge = single_exit (loop);
   tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : NE_EXPR;
-  tree zero_mask = build_zero_cst (TREE_TYPE (test_mask));
-  gcond *cond_stmt = gimple_build_cond (code, test_mask, zero_mask,
+  tree zero_obj = build_zero_cst (TREE_TYPE (test_obj));
+  gcond *cond_stmt = gimple_build_cond (code, test_obj, zero_obj,
 					NULL_TREE, NULL_TREE);
   gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
 
@@ -748,13 +805,12 @@ vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
 }
 
 /* Like vect_set_loop_condition, but handle the case in which there
-   are no loop masks.  */
+   are no loop masks/lengths.  */
 
 static gcond *
-vect_set_loop_condition_unmasked (class loop *loop, tree niters,
-				  tree step, tree final_iv,
-				  bool niters_maybe_zero,
-				  gimple_stmt_iterator loop_cond_gsi)
+vect_set_loop_condition_normal (class loop *loop, tree niters, tree step,
+			      tree final_iv, bool niters_maybe_zero,
+			      gimple_stmt_iterator loop_cond_gsi)
 {
   tree indx_before_incr, indx_after_incr;
   gcond *cond_stmt;
@@ -912,14 +968,14 @@ vect_set_loop_condition (class loop *loop, loop_vec_info loop_vinfo,
   gcond *orig_cond = get_loop_exit_condition (loop);
   gimple_stmt_iterator loop_cond_gsi = gsi_for_stmt (orig_cond);
 
-  if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
-    cond_stmt = vect_set_loop_condition_masked (loop, loop_vinfo, niters,
-						final_iv, niters_maybe_zero,
-						loop_cond_gsi);
+  if (loop_vinfo && LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo))
+    cond_stmt
+      = vect_set_loop_condition_partial (loop, loop_vinfo, niters, final_iv,
+					 niters_maybe_zero, loop_cond_gsi);
   else
-    cond_stmt = vect_set_loop_condition_unmasked (loop, niters, step,
-						  final_iv, niters_maybe_zero,
-						  loop_cond_gsi);
+    cond_stmt
+      = vect_set_loop_condition_normal (loop, niters, step, final_iv,
+					niters_maybe_zero, loop_cond_gsi);
 
   /* Remove old loop exit test.  */
   stmt_vec_info orig_cond_info;
@@ -1938,8 +1994,7 @@ vect_gen_vector_loop_niters (loop_vec_info loop_vinfo, tree niters,
     ni_minus_gap = niters;
 
   unsigned HOST_WIDE_INT const_vf;
-  if (vf.is_constant (&const_vf)
-      && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (vf.is_constant (&const_vf) && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo))
     {
       /* Create: niters >> log2(vf) */
       /* If it's known that niters == number of latch executions + 1 doesn't
@@ -2471,7 +2526,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   poly_uint64 bound_epilog = 0;
-  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+  if (!LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)
       && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
     bound_epilog += vf - 1;
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
@@ -2567,7 +2622,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   if (vect_epilogues
       && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
       && prolog_peeling >= 0
-      && known_eq (vf, lowest_vf))
+      && known_eq (vf, lowest_vf)
+      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (epilogue_vinfo))
     {
       unsigned HOST_WIDE_INT eiters
 	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 80e33b61be7..cbf498e87dd 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -815,6 +815,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     vectorizable (false),
     can_fully_mask_p (true),
     fully_masked_p (false),
+    can_with_length_p (param_vect_with_length_scope != 0),
+    fully_with_length_p (false),
     peeling_for_gaps (false),
     peeling_for_niter (false),
     no_data_dependencies (false),
@@ -880,13 +882,25 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
 void
 release_vec_loop_masks (vec_loop_masks *masks)
 {
-  rgroup_masks *rgm;
+  rgroup_objs *rgm;
   unsigned int i;
   FOR_EACH_VEC_ELT (*masks, i, rgm)
-    rgm->masks.release ();
+    rgm->objs.release ();
   masks->release ();
 }
 
+/* Free all levels of LENS.  */
+
+void
+release_vec_loop_lens (vec_loop_lens *lens)
+{
+  rgroup_objs *rgl;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (*lens, i, rgl)
+    rgl->objs.release ();
+  lens->release ();
+}
+
 /* Free all memory used by the _loop_vec_info, as well as all the
    stmt_vec_info structs of all the stmts in the loop.  */
 
@@ -895,6 +909,7 @@ _loop_vec_info::~_loop_vec_info ()
   free (bbs);
 
   release_vec_loop_masks (&masks);
+  release_vec_loop_lens (&lens);
   delete ivexpr_map;
   delete scan_map;
   epilogue_vinfos.release ();
@@ -935,7 +950,7 @@ cse_and_gimplify_to_preheader (loop_vec_info loop_vinfo, tree expr)
 static bool
 can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type)
 {
-  rgroup_masks *rgm;
+  rgroup_objs *rgm;
   unsigned int i;
   FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
     if (rgm->mask_type != NULL_TREE
@@ -954,12 +969,40 @@ vect_get_max_nscalars_per_iter (loop_vec_info loop_vinfo)
 {
   unsigned int res = 1;
   unsigned int i;
-  rgroup_masks *rgm;
+  rgroup_objs *rgm;
   FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
     res = MAX (res, rgm->max_nscalars_per_iter);
   return res;
 }
 
+/* Calculate the minimal bits necessary to represent the maximal iteration
+   count of loop with loop_vec_info LOOP_VINFO which is scaling with a given
+   factor FACTOR.  */
+
+static unsigned
+min_prec_for_max_niters (loop_vec_info loop_vinfo, unsigned int factor)
+{
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+
+  /* Get the maximum number of iterations that is representable
+     in the counter type.  */
+  tree ni_type = TREE_TYPE (LOOP_VINFO_NITERSM1 (loop_vinfo));
+  widest_int max_ni = wi::to_widest (TYPE_MAX_VALUE (ni_type)) + 1;
+
+  /* Get a more refined estimate for the number of iterations.  */
+  widest_int max_back_edges;
+  if (max_loop_iterations (loop, &max_back_edges))
+    max_ni = wi::smin (max_ni, max_back_edges + 1);
+
+  /* Account for factor, in which each bit is replicated N times.  */
+  max_ni *= factor;
+
+  /* Work out how many bits we need to represent the limit.  */
+  unsigned int min_ni_width = wi::min_precision (max_ni, UNSIGNED);
+
+  return min_ni_width;
+}
+
 /* Each statement in LOOP_VINFO can be masked where necessary.  Check
    whether we can actually generate the masks required.  Return true if so,
    storing the type of the scalar IV in LOOP_VINFO_MASK_COMPARE_TYPE.  */
@@ -967,7 +1010,6 @@ vect_get_max_nscalars_per_iter (loop_vec_info loop_vinfo)
 static bool
 vect_verify_full_masking (loop_vec_info loop_vinfo)
 {
-  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   unsigned int min_ni_width;
   unsigned int max_nscalars_per_iter
     = vect_get_max_nscalars_per_iter (loop_vinfo);
@@ -978,27 +1020,14 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
   if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
     return false;
 
-  /* Get the maximum number of iterations that is representable
-     in the counter type.  */
-  tree ni_type = TREE_TYPE (LOOP_VINFO_NITERSM1 (loop_vinfo));
-  widest_int max_ni = wi::to_widest (TYPE_MAX_VALUE (ni_type)) + 1;
-
-  /* Get a more refined estimate for the number of iterations.  */
-  widest_int max_back_edges;
-  if (max_loop_iterations (loop, &max_back_edges))
-    max_ni = wi::smin (max_ni, max_back_edges + 1);
-
-  /* Account for rgroup masks, in which each bit is replicated N times.  */
-  max_ni *= max_nscalars_per_iter;
-
   /* Work out how many bits we need to represent the limit.  */
-  min_ni_width = wi::min_precision (max_ni, UNSIGNED);
+  min_ni_width = min_prec_for_max_niters (loop_vinfo, max_nscalars_per_iter);
 
   /* Find a scalar mode for which WHILE_ULT is supported.  */
   opt_scalar_int_mode cmp_mode_iter;
   tree cmp_type = NULL_TREE;
   tree iv_type = NULL_TREE;
-  widest_int iv_limit = vect_iv_limit_for_full_masking (loop_vinfo);
+  widest_int iv_limit = vect_iv_limit_for_partial_vect (loop_vinfo);
   unsigned int iv_precision = UINT_MAX;
 
   if (iv_limit != -1)
@@ -1056,6 +1085,33 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
   return true;
 }
 
+/* Check whether we can use vector access with length based on precison
+   comparison.  So far, to keep it simple, we only allow the case that the
+   precision of the target supported length is larger than the precision
+   required by loop niters.  */
+
+static bool
+vect_verify_loop_lens (loop_vec_info loop_vinfo)
+{
+  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+
+  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    return false;
+
+  /* The one which has the largest NV should have max bytes per iter.  */
+  rgroup_objs *rgl = &(*lens)[lens->length () - 1];
+
+  /* Work out how many bits we need to represent the limit.  */
+  unsigned int min_ni_width
+    = min_prec_for_max_niters (loop_vinfo, rgl->nbytes_per_iter);
+
+  unsigned len_bits = GET_MODE_PRECISION (targetm.vectorize.length_mode);
+  if (len_bits < min_ni_width)
+    return false;
+
+  return true;
+}
+
 /* Calculate the cost of one scalar iteration of the loop.  */
 static void
 vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
@@ -1628,9 +1684,9 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
 
-  /* Only fully-masked loops can have iteration counts less than the
-     vectorization factor.  */
-  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  /* Only fully-masked or fully with length loops can have iteration counts less
+     than the vectorization factor.  */
+  if (!LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo))
     {
       if (known_niters_smaller_than_vf (loop_vinfo))
 	{
@@ -1858,7 +1914,7 @@ determine_peel_for_niter (loop_vec_info loop_vinfo)
     th = LOOP_VINFO_COST_MODEL_THRESHOLD (LOOP_VINFO_ORIG_LOOP_INFO
 					  (loop_vinfo));
 
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo))
     /* The main loop handles all iterations.  */
     LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
@@ -2048,6 +2104,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
     }
 
   bool saved_can_fully_mask_p = LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo);
+  bool saved_can_with_length_p = LOOP_VINFO_CAN_WITH_LENGTH_P(loop_vinfo);
 
   /* We don't expect to have to roll back to anything other than an empty
      set of rgroups.  */
@@ -2144,6 +2201,71 @@ start_over:
 			 "not using a fully-masked loop.\n");
     }
 
+  /* Decide whether we can use vector access with length.  */
+
+  if ((LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+       || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
+      && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length becuase peeling"
+			 " for alignment or gaps is required.\n");
+      LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+    }
+
+  if (LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)
+      && !vect_verify_loop_lens (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length becuase the"
+			 " length precision verification fail.\n");
+      LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+    }
+
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length becuase the"
+			 " loop will be fully-masked.\n");
+      LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+    }
+
+  if (LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+    {
+      /* One special case, the loop with max niters less than VF, we can simply
+	 take it as body with length.  */
+      if (param_vect_with_length_scope == 1)
+	{
+	  /* This is the epilogue, should be less than VF.  */
+	  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+	    LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = true;
+	  /* Otherwise, ensure the loop iteration less than VF.  */
+	  else if (known_niters_smaller_than_vf (loop_vinfo))
+	    LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = true;
+	}
+      else
+	{
+	  gcc_assert (param_vect_with_length_scope == 2);
+	  LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = true;
+	}
+    }
+  else
+    /* Always set it as false in case previous tries set it.  */
+    LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = false;
+
+  if (dump_enabled_p ())
+    {
+      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+	dump_printf_loc (MSG_NOTE, vect_location, "using vector access with"
+						  " length for loop fully.\n");
+      else
+	dump_printf_loc (MSG_NOTE, vect_location, "not using vector access with"
+						  " length for loop fully.\n");
+    }
+
   /* If epilog loop is required because of data accesses with gaps,
      one additional iteration needs to be peeled.  Check if there is
      enough iterations for vectorization.  */
@@ -2163,7 +2285,7 @@ start_over:
   /* If we're vectorizing an epilogue loop, we either need a fully-masked
      loop or a loop that has a lower VF than the main loop.  */
   if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
-      && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)
       && maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
 		   LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo)))
     return opt_result::failure_at (vect_location,
@@ -2362,12 +2484,14 @@ again:
     = init_cost (LOOP_VINFO_LOOP (loop_vinfo));
   /* Reset accumulated rgroup information.  */
   release_vec_loop_masks (&LOOP_VINFO_MASKS (loop_vinfo));
+  release_vec_loop_lens (&LOOP_VINFO_LENS (loop_vinfo));
   /* Reset assorted flags.  */
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
   LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = 0;
   LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = 0;
   LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = saved_can_fully_mask_p;
+  LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = saved_can_with_length_p;
 
   goto start_over;
 }
@@ -2646,8 +2770,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	      if (ordered_p (lowest_th, th))
 		lowest_th = ordered_min (lowest_th, th);
 	    }
-	  else
-	    delete loop_vinfo;
+	  else {
+	      delete loop_vinfo;
+	      loop_vinfo = opt_loop_vec_info::success (NULL);
+	  }
 
 	  /* Only vectorize epilogues if PARAM_VECT_EPILOGUES_NOMASK is
 	     enabled, SIMDUID is not set, it is the innermost loop and we have
@@ -2672,6 +2798,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
       else
 	{
 	  delete loop_vinfo;
+	  loop_vinfo = opt_loop_vec_info::success (NULL);
 	  if (fatal)
 	    {
 	      gcc_checking_assert (first_loop_vinfo == NULL);
@@ -2679,6 +2806,21 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	    }
 	}
 
+      /* If the original loop can use vector access with length but we still
+	 get true vect_epilogue here, it would try vector access with length
+	 on epilogue and with the same mode.  */
+      if (vect_epilogues && loop_vinfo
+	  && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+	{
+	  gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** Re-trying analysis with same vector"
+			     " mode %s for epilogue with length.\n",
+			     GET_MODE_NAME (loop_vinfo->vector_mode));
+	  continue;
+	}
+
       if (mode_i < vector_modes.length ()
 	  && VECTOR_MODE_P (autodetected_vector_mode)
 	  && (related_vector_mode (vector_modes[mode_i],
@@ -3493,7 +3635,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
       /* Calculate how many masks we need to generate.  */
       unsigned int num_masks = 0;
-      rgroup_masks *rgm;
+      rgroup_objs *rgm;
       unsigned int num_vectors_m1;
       FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
 	if (rgm->mask_type)
@@ -3519,6 +3661,11 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 			    target_cost_data, num_masks - 1, vector_stmt,
 			    NULL, NULL_TREE, 0, vect_body);
     }
+  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      peel_iters_prologue = 0;
+      peel_iters_epilogue = 0;
+    }
   else if (npeel < 0)
     {
       peel_iters_prologue = assumed_vf / 2;
@@ -3808,7 +3955,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 		 "  Calculated minimum iters for profitability: %d\n",
 		 min_profitable_iters);
 
-  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+  if (!LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
@@ -6761,6 +6908,16 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "using an in-order (fold-left) reduction.\n");
   STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
+
+  if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length due to"
+			 " reduction operation.\n");
+      LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+    }
+
   /* All but single defuse-cycle optimized, lane-reducing and fold-left
      reductions go through their own vectorizable_* routines.  */
   if (!single_defuse_cycle
@@ -8041,6 +8198,16 @@ vectorizable_live_operation (loop_vec_info loop_vinfo,
 				     1, vectype, NULL);
 	    }
 	}
+
+      if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+	{
+	  LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "can't use vector access with length due to"
+			     " live operation.\n");
+	}
+
       return true;
     }
 
@@ -8285,7 +8452,7 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
   gcc_assert (nvectors != 0);
   if (masks->length () < nvectors)
     masks->safe_grow_cleared (nvectors);
-  rgroup_masks *rgm = &(*masks)[nvectors - 1];
+  rgroup_objs *rgm = &(*masks)[nvectors - 1];
   /* The number of scalars per iteration and the number of vectors are
      both compile-time constants.  */
   unsigned int nscalars_per_iter
@@ -8316,24 +8483,24 @@ tree
 vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
 		    unsigned int nvectors, tree vectype, unsigned int index)
 {
-  rgroup_masks *rgm = &(*masks)[nvectors - 1];
+  rgroup_objs *rgm = &(*masks)[nvectors - 1];
   tree mask_type = rgm->mask_type;
 
   /* Populate the rgroup's mask array, if this is the first time we've
      used it.  */
-  if (rgm->masks.is_empty ())
+  if (rgm->objs.is_empty ())
     {
-      rgm->masks.safe_grow_cleared (nvectors);
+      rgm->objs.safe_grow_cleared (nvectors);
       for (unsigned int i = 0; i < nvectors; ++i)
 	{
 	  tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
 	  /* Provide a dummy definition until the real one is available.  */
 	  SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
-	  rgm->masks[i] = mask;
+	  rgm->objs[i] = mask;
 	}
     }
 
-  tree mask = rgm->masks[index];
+  tree mask = rgm->objs[index];
   if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
 		TYPE_VECTOR_SUBPARTS (vectype)))
     {
@@ -8354,6 +8521,66 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
   return mask;
 }
 
+/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
+   lengths for vector access with length that each control a vector of type
+   VECTYPE.  */
+
+void
+vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		       unsigned int nvectors, tree vectype)
+{
+  gcc_assert (nvectors != 0);
+  if (lens->length () < nvectors)
+    lens->safe_grow_cleared (nvectors);
+  rgroup_objs *rgl = &(*lens)[nvectors - 1];
+
+  /* The number of scalars per iteration, total bytes of them and the number of
+     vectors are both compile-time constants.  */
+  poly_uint64 vector_size = GET_MODE_SIZE (TYPE_MODE (vectype));
+  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned int nbytes_per_iter
+    = exact_div (nvectors * vector_size, vf).to_constant ();
+
+  /* The one associated to the same nvectors should have the same bytes per
+     iteration.  */
+  if (!rgl->vec_type)
+    {
+      rgl->vec_type = vectype;
+      rgl->nbytes_per_iter = nbytes_per_iter;
+    }
+  else
+    gcc_assert (rgl->nbytes_per_iter == nbytes_per_iter);
+}
+
+/* Given a complete set of length LENS, extract length number INDEX for an
+   rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.  */
+
+tree
+vect_get_loop_len (vec_loop_lens *lens, unsigned int nvectors, unsigned int index)
+{
+  rgroup_objs *rgl = &(*lens)[nvectors - 1];
+
+  /* Populate the rgroup's len array, if this is the first time we've
+     used it.  */
+  if (rgl->objs.is_empty ())
+    {
+      rgl->objs.safe_grow_cleared (nvectors);
+      for (unsigned int i = 0; i < nvectors; ++i)
+	{
+	  scalar_int_mode len_mode = targetm.vectorize.length_mode;
+	  unsigned int len_prec = GET_MODE_PRECISION (len_mode);
+	  tree len_type = build_nonstandard_integer_type (len_prec, true);
+	  tree len = make_temp_ssa_name (len_type, NULL, "loop_len");
+
+	  /* Provide a dummy definition until the real one is available.  */
+	  SSA_NAME_DEF_STMT (len) = gimple_build_nop ();
+	  rgl->objs[i] = len;
+	}
+    }
+
+  return rgl->objs[index];
+}
+
 /* Scale profiling counters by estimation for LOOP which is vectorized
    by factor VF.  */
 
@@ -8713,7 +8940,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
   if (niters_vector == NULL_TREE)
     {
       if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-	  && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+	  && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)
 	  && known_eq (lowest_vf, vf))
 	{
 	  niters_vector
@@ -8881,7 +9108,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
 
   /* True if the final iteration might not handle a full vector's
      worth of scalar iterations.  */
-  bool final_iter_may_be_partial = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+  bool final_iter_may_be_partial = LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo);
   /* The minimum number of iterations performed by the epilogue.  This
      is 1 when peeling for gaps because we always need a final scalar
      iteration.  */
@@ -9184,12 +9411,14 @@ optimize_mask_stores (class loop *loop)
 }
 
 /* Decide whether it is possible to use a zero-based induction variable
-   when vectorizing LOOP_VINFO with a fully-masked loop.  If it is,
-   return the value that the induction variable must be able to hold
-   in order to ensure that the loop ends with an all-false mask.
+   when vectorizing LOOP_VINFO with a fully-masked or fully with length
+   loop.  If it is, return the value that the induction variable must
+   be able to hold in order to ensure that the loop ends with an
+   all-false mask or zero byte length.
    Return -1 otherwise.  */
+
 widest_int
-vect_iv_limit_for_full_masking (loop_vec_info loop_vinfo)
+vect_iv_limit_for_partial_vect (loop_vec_info loop_vinfo)
 {
   tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index e7822c44951..d6be39e1831 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1879,6 +1879,66 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
     gcc_unreachable ();
 }
 
+/* Check whether a load or store statement in the loop described by
+   LOOP_VINFO is possible to go with length.  This is testing whether
+   the vectorizer pass has the appropriate support, as well as whether
+   the target does.
+
+   VLS_TYPE says whether the statement is a load or store and VECTYPE
+   is the type of the vector being loaded or stored.  MEMORY_ACCESS_TYPE
+   says how the load or store is going to be implemented and GROUP_SIZE
+   is the number of load or store statements in the containing group.
+
+   Clear LOOP_VINFO_CAN_WITH_LENGTH_P if it can't go with length, otherwise
+   record the required length types.  */
+
+static void
+check_load_store_with_len (loop_vec_info loop_vinfo, tree vectype,
+		      vec_load_store_type vls_type, int group_size,
+		      vect_memory_access_type memory_access_type)
+{
+  /* Invariant loads need no special support.  */
+  if (memory_access_type == VMAT_INVARIANT)
+    return;
+
+  if (memory_access_type != VMAT_CONTIGUOUS
+      && memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length"
+			 " because an access isn't contiguous.\n");
+      LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+      return;
+    }
+
+  machine_mode vecmode = TYPE_MODE (vectype);
+  bool is_load = (vls_type == VLS_LOAD);
+  optab op = is_load ? lenload_optab : lenstore_optab;
+
+  if (!VECTOR_MODE_P (vecmode)
+      || !convert_optab_handler (op, vecmode, targetm.vectorize.length_mode))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length because"
+			 " the target doesn't have the appropriate"
+			 " load or store with length.\n");
+      LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false;
+      return;
+    }
+
+  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+  poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned int nvectors;
+
+  if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+    vect_record_loop_len (loop_vinfo, lens, nvectors, vectype);
+  else
+    gcc_unreachable ();
+}
+
 /* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
    form of the scalar mask condition and LOOP_MASK, if nonnull, is the mask
    that needs to be applied to all loads and stores in a vectorized loop.
@@ -7532,6 +7592,10 @@ vectorizable_store (vec_info *vinfo,
 	check_load_store_masking (loop_vinfo, vectype, vls_type, group_size,
 				  memory_access_type, &gs_info, mask);
 
+      if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+	check_load_store_with_len (loop_vinfo, vectype, vls_type, group_size,
+				      memory_access_type);
+
       if (slp_node
 	  && !vect_maybe_update_slp_op_vectype (SLP_TREE_CHILDREN (slp_node)[0],
 						vectype))
@@ -8068,6 +8132,15 @@ vectorizable_store (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+	 ? &LOOP_VINFO_LENS (loop_vinfo)
+	 : NULL);
+
+  /* Shouldn't go with length if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -8320,10 +8393,15 @@ vectorizable_store (vec_info *vinfo,
 	      unsigned HOST_WIDE_INT align;
 
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens)
+		final_len = vect_get_loop_len (loop_lens, vec_num * ncopies,
+					       vec_num * j + i);
+
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
@@ -8403,6 +8481,17 @@ vectorizable_store (vec_info *vinfo,
 		  new_stmt_info
 		    = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
 		}
+	      else if (final_len)
+		{
+		  align = least_bit_hwi (misalign | align);
+		  tree ptr = build_int_cst (ref_type, align);
+		  gcall *call
+		    = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr,
+						  ptr, final_len, vec_oprnd);
+		  gimple_call_set_nothrow (call, true);
+		  new_stmt_info
+		    = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
+		}
 	      else
 		{
 		  data_ref = fold_build2 (MEM_REF, vectype,
@@ -8839,6 +8928,10 @@ vectorizable_load (vec_info *vinfo,
 	check_load_store_masking (loop_vinfo, vectype, VLS_LOAD, group_size,
 				  memory_access_type, &gs_info, mask);
 
+      if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo))
+	check_load_store_with_len (loop_vinfo, vectype, VLS_LOAD, group_size,
+				      memory_access_type);
+
       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
       vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
 			    slp_node, cost_vec);
@@ -8937,6 +9030,7 @@ vectorizable_load (vec_info *vinfo,
 
       gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
       gcc_assert (!nested_in_vect_loop);
+      gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
 
       if (grouped_load)
 	{
@@ -9234,6 +9328,15 @@ vectorizable_load (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+	 ? &LOOP_VINFO_LENS (loop_vinfo)
+	 : NULL);
+
+  /* Shouldn't go with length if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -9555,15 +9658,20 @@ vectorizable_load (vec_info *vinfo,
 	  for (i = 0; i < vec_num; i++)
 	    {
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks
 		  && memory_access_type != VMAT_INVARIANT)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens && memory_access_type != VMAT_INVARIANT)
+		final_len = vect_get_loop_len (loop_lens, vec_num * ncopies,
+					       vec_num * j + i);
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
 
+
 	      if (i > 0)
 		dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
 					       gsi, stmt_info, bump);
@@ -9629,6 +9737,18 @@ vectorizable_load (vec_info *vinfo,
 			new_stmt = call;
 			data_ref = NULL_TREE;
 		      }
+		    else if (final_len)
+		      {
+			align = least_bit_hwi (misalign | align);
+			tree ptr = build_int_cst (ref_type, align);
+			gcall *call
+			  = gimple_build_call_internal (IFN_LEN_LOAD, 3,
+							dataref_ptr, ptr,
+							final_len);
+			gimple_call_set_nothrow (call, true);
+			new_stmt = call;
+			data_ref = NULL_TREE;
+		      }
 		    else
 		      {
 			tree ltype = vectype;
@@ -12480,3 +12600,35 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
   *nunits_vectype_out = nunits_vectype;
   return opt_result::success ();
 }
+
+/* Generate and return statement sequence that sets vector length LEN that is:
+
+   min_of_start_and_end = min (START_INDEX, END_INDEX);
+   left_bytes = END_INDEX - min_of_start_and_end;
+   rhs = min (left_bytes, VECTOR_SIZE);
+   LEN = rhs;
+
+   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
+   means if we have left_bytes larger than 255, it can't be saturated to vector
+   size.  One target hook can be provided if other ports don't suffer this.
+*/
+
+gimple_seq
+vect_gen_len (tree len, tree start_index, tree end_index, tree vector_size)
+{
+  gimple_seq stmts = NULL;
+  tree len_type = TREE_TYPE (len);
+  gcc_assert (TREE_TYPE (start_index) == len_type);
+
+  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
+  tree left_bytes = fold_build2 (MINUS_EXPR, len_type, end_index, min);
+  left_bytes = fold_build2 (MIN_EXPR, len_type, left_bytes, vector_size);
+
+  tree rhs = force_gimple_operand (left_bytes, &stmts, true, NULL_TREE);
+  gimple *new_stmt = gimple_build_assign (len, rhs);
+  gimple_stmt_iterator i = gsi_last (stmts);
+  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
+
+  return stmts;
+}
+
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 2eb3ab5d280..78e260e5611 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -461,20 +461,32 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
    first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
    the second being indexed by the mask index 0 <= i < nV.  */
 
-/* The masks needed by rgroups with nV vectors, according to the
-   description above.  */
-struct rgroup_masks {
-  /* The largest nS for all rgroups that use these masks.  */
-  unsigned int max_nscalars_per_iter;
-
-  /* The type of mask to use, based on the highest nS recorded above.  */
-  tree mask_type;
+/* The masks/lengths (called as objects) needed by rgroups with nV vectors,
+   according to the description above.  */
+struct rgroup_objs {
+  union
+  {
+    /* The largest nS for all rgroups that use these masks.  */
+    unsigned int max_nscalars_per_iter;
+    /* The total bytes for any nS per iteration.  */
+    unsigned int nbytes_per_iter;
+  };
 
-  /* A vector of nV masks, in iteration order.  */
-  vec<tree> masks;
+  union
+  {
+    /* The type of mask to use, based on the highest nS recorded above.  */
+    tree mask_type;
+    /* Any vector type to use these lengths.  */
+    tree vec_type;
+  };
+
+  /* A vector of nV objs, in iteration order.  */
+  vec<tree> objs;
 };
 
-typedef auto_vec<rgroup_masks> vec_loop_masks;
+typedef auto_vec<rgroup_objs> vec_loop_masks;
+
+typedef auto_vec<rgroup_objs> vec_loop_lens;
 
 typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
 
@@ -523,6 +535,10 @@ public:
      on inactive scalars.  */
   vec_loop_masks masks;
 
+  /* The lengths that a loop with length should use to avoid operating
+     on inactive scalars.  */
+  vec_loop_lens lens;
+
   /* Set of scalar conditions that have loop mask applied.  */
   scalar_cond_masked_set_type scalar_cond_masked_set;
 
@@ -626,6 +642,12 @@ public:
   /* True if have decided to use a fully-masked loop.  */
   bool fully_masked_p;
 
+  /* Records whether we still have the option of using a length access loop.  */
+  bool can_with_length_p;
+
+  /* True if have decided to use length access for the loop fully.  */
+  bool fully_with_length_p;
+
   /* When we have grouped data accesses with gaps, we may introduce invalid
      memory accesses.  We peel the last iteration of the loop to prevent
      this.  */
@@ -689,6 +711,9 @@ public:
 #define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
 #define LOOP_VINFO_CAN_FULLY_MASK_P(L)     (L)->can_fully_mask_p
 #define LOOP_VINFO_FULLY_MASKED_P(L)       (L)->fully_masked_p
+#define LOOP_VINFO_CAN_WITH_LENGTH_P(L)    (L)->can_with_length_p
+#define LOOP_VINFO_FULLY_WITH_LENGTH_P(L)  (L)->fully_with_length_p
+#define LOOP_VINFO_LENS(L)                 (L)->lens
 #define LOOP_VINFO_VECT_FACTOR(L)          (L)->vectorization_factor
 #define LOOP_VINFO_MAX_VECT_FACTOR(L)      (L)->max_vectorization_factor
 #define LOOP_VINFO_MASKS(L)                (L)->masks
@@ -741,6 +766,10 @@ public:
    || LOOP_REQUIRES_VERSIONING_FOR_NITERS (L)		\
    || LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (L))
 
+/* Whether operates on partial vector.  */
+#define LOOP_VINFO_PARTIAL_VECT_P(L)                                           \
+  (LOOP_VINFO_FULLY_MASKED_P (L) || LOOP_VINFO_FULLY_WITH_LENGTH_P (L))
+
 #define LOOP_VINFO_NITERS_KNOWN_P(L)          \
   (tree_fits_shwi_p ((L)->num_iters) && tree_to_shwi ((L)->num_iters) > 0)
 
@@ -1824,7 +1853,7 @@ extern tree vect_create_addr_base_for_vector_ref (vec_info *,
 						  tree, tree = NULL_TREE);
 
 /* In tree-vect-loop.c.  */
-extern widest_int vect_iv_limit_for_full_masking (loop_vec_info loop_vinfo);
+extern widest_int vect_iv_limit_for_partial_vect (loop_vec_info loop_vinfo);
 /* Used in tree-vect-loop-manip.c */
 extern void determine_peel_for_niter (loop_vec_info);
 /* Used in gimple-loop-interchange.c and tree-parloops.c.  */
@@ -1842,6 +1871,10 @@ extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
 				   unsigned int, tree, tree);
 extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
+extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+				  tree);
+extern tree vect_get_loop_len (vec_loop_lens *, unsigned int, unsigned int);
+extern gimple_seq vect_gen_len (tree, tree, tree, tree);
 extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info);
 
 /* Drive for loop transformation stage.  */
-- 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/7] Support vector load/store with length
  2020-05-27  7:25             ` Richard Biener
@ 2020-05-27  8:50               ` Kewen.Lin
  2020-05-27 14:08               ` Segher Boessenkool
  1 sibling, 0 replies; 80+ messages in thread
From: Kewen.Lin @ 2020-05-27  8:50 UTC (permalink / raw)
  To: Richard Biener, Segher Boessenkool
  Cc: GCC Patches, Bill Schmidt, dje.gcc, richard.sandiford

on 2020/5/27 下午3:25, Richard Biener wrote:
> On Tue, 26 May 2020, Segher Boessenkool wrote:
> 
>> Hi!
>>
>> On Tue, May 26, 2020 at 01:29:30PM +0100, Richard Sandiford wrote:
>>> FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good
>>> approach.  I think it'll be more maintainable in the long run than
>>> trying to have .MASK_LOADs and .MASK_STOREs that need a special mask
>>> operand.  (That would be too similar to VEC_COND_EXPR :-))
>>>
>>> Not sure yet what the exact semantics wrt out-of-range values for
>>> the IFN/optab though.  Maybe we should instead have some kind of
>>> abstract, target-specific cookie created by a separate intrinsic.
>>> Haven't thought much about it yet...
>>
>> Or maybe only support 0..N with N the length of the vector?  It is
>> pretty important to support 0 and N, but greater than N isn't as
>> important (it is useful for tricky hand-written code, but not as much
>> for compiler-generate code -- we only support an 8-bit number here on
>> Power, maybe that is why ;-) )
> 
> The question is one of semantics - if power masks the length to an
> 8 bit number it's important to preprocess the IV.  As with my
> other suggestion the question is what to expose to the IL (to GIMPLE)
> here.  Exposing as much as possible will help IV selection but
> will eventually require IFN variations for different semantics.
> 

In the current implementation, we don't use IFN for the length computation,
it has something like:

  ivtmp_28 = ivtmp_27 + 16;
  _39 = MIN_EXPR <ivtmp_28, _32>;  // _32 is the limit
  _40 = _32 - _39;                 // get the zero bytes for the ending
  _41 = MIN_EXPR <_40, 16>;        // check for vector size
  if (ivtmp_28 < _32)

In my initial thought, the len load/store IFNs are considered to accept any
lengths (any values hold in length mode), since the length larger than vector
size is no sense, the hardware can take it as saturated to vector size, if
hardware has some masking bits on it like ppc, we can add one hook to guard
the MIN requirement for length gen.  For now, the MIN is mandatory since ppc
is the only user.

FWIW, if we mostly adopt this for epilogues or small loop (iteration < VF),
the range can be analyzed during compilation time, these MIN computations
can be optimized theoricially.

> So yes, 0..N sounds about right here and we'll require a MIN ()
> operation and likely need to teach IV selection about this to at least
> possibly get an IV with the byte size multiplication factored.
> 

FWIW, in the current implementation, the step/limit have multiplied the bytes
of lanes first, the IV computation will not have the multilcation for it there.

BR,
Kewen

> Richard.
> 
>>
>> Segher
>>
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7] vect: Support vector load/store with length in vectorizer
  2020-05-27  8:25     ` Kewen.Lin
@ 2020-05-27 10:02       ` Richard Sandiford
  2020-05-28  1:21         ` Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Sandiford @ 2020-05-27 10:02 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> Thanks for your comments!
>
> on 2020/5/26 锟斤拷锟斤拷8:49, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> @@ -626,6 +645,12 @@ public:
>>>    /* True if have decided to use a fully-masked loop.  */
>>>    bool fully_masked_p;
>>>  
>>> +  /* Records whether we still have the option of using a length access loop.  */
>>> +  bool can_with_length_p;
>>> +
>>> +  /* True if have decided to use length access for the loop fully.  */
>>> +  bool fully_with_length_p;
>> 
>> Rather than duplicate the flags like this, I think we should have
>> three bits of information:
>> 
>> (1) Can the loop operate on partial vectors?  Starts off optimistically
>>     assuming "yes", gets set to "no" when we find a counter-example.
>> 
>> (2) If we do decide to use partial vectors, will we need loop masks?
>> 
>> (3) If we do decide to use partial vectors, will we need lengths?
>> 
>> Vectorisation using partial vectors succeeds if (1) && ((2) != (3))
>> 
>> LOOP_VINFO_CAN_FULLY_MASK_P currently tracks (1) and
>> LOOP_VINFO_MASKS currently tracks (2).  In pathological cases it's
>> already possible to have (1) && !(2), see r9-6240 for an example.
>> 
>> With the new support, LOOP_VINFO_LENS tracks (3).
>> 
>> So I don't think we need the can_with_length_p.  What is now
>> LOOP_VINFO_CAN_FULLY_MASK_P can continue to track (1) for both
>> approaches, with the final choice of approach only being made
>> at the end.  Maybe it would be worth renaming it to something
>> more generic though, now that we have two approaches to partial
>> vectorisation.
>
> I like this idea!  I could be wrong, but I'm afraid that we
> can not have one common flag to be shared for both approaches,
> the check criterias could be different for both approaches, one
> counter example for length could be acceptable for masking, such
> as length can only allow CONTIGUOUS related modes, but masking
> can support more.  When we see acceptable VMAT_LOAD_STORE_LANES,
> we leave LOOP_VINFO_CAN_FULLY_MASK_P true, later should length
> checking turn it to false?  I guess no, assuming still true, then 
> LOOP_VINFO_CAN_FULLY_MASK_P will mean only partial vectorization
> for masking, not for both.  We can probably clean LOOP_VINFO_LENS
> when the length checking is false, but we just know the vec is empty,
> not sure we are unable to do partial vectorization with length,
> when we see LOOP_VINFO_CAN_FULLY_MASK_P true, we could still
> record length into it if possible.

Let's call the flag in (1) CAN_USE_PARTIAL_VECTORS_P rather than
CAN_FULLY_MASK_P to (try to) avoid any confusion from the current name.

What I meant is that each vectorizable_* routine has the responsibility
of finding a way of coping with partial vectorisation, or setting
CAN_USE_PARTIAL_VECTORS_P to false if it can't.

vectorizable_load chooses the VMAT first, and then decides based on that
whether partial vectorisation is supported.  There's no influence in
the other direction (partial vectorisation doesn't determine the VMAT).

So once it has chosen a VMAT, vectorizable_load needs to try to find a way
of handling the operation with partial vectorisation.  Currently the only
way of doing that for VMAT_LOAD_STORE_LANES is using masks.  So at the
moment there are two possible outcomes:

- The target supports the necessary IFN_MASK_LOAD_LANES function.
  If so, we can use partial vectorisation for the statement, so we
  leave CAN_USE_PARTIAL_VECTORS_P true and record the necessary masks
  in LOOP_VINFO_MASKS.

- The target doesn't support the necessary IFN_MASK_LOAD_LANES function.
  If so, we can't use partial vectorisation, so we clear
  CAN_USE_PARTIAL_VECTORS_P.

That's how things work at the moment.  It would work in the same way
for lengths if we ever supported IFN_LEN_LOAD_LANES: we'd check whether
IFN_LEN_LOAD_LANES is available and record the length in LOOP_VINFO_LENS
if so.  If partial vectorisation isn't supported (via masks or lengths),
we'd continue to clear CAN_USE_PARTIAL_VECTORS_P.

But equally, if we never add support for IFN_LEN_LOAD_LANES, the current
code continues to work with length-based approaches.  We'll continue to
clear CAN_USE_PARTIAL_VECTORS_P for VMAT_LOAD_STORE_LANES when the
target provides no IFN_MASK_LOAD_LANES function.

As I say, this is all predicated on the assumption that we don't need
to mix both masks and lengths in the same loop, and so can decide not
to use partial vectorisation when both masks and lengths have been
recorded.  This is a check that would happen at the end, after all
statements have been analysed.

(There's no reason in principle why we *couldn't* support both
approaches in the same loop, but it's not worth adding the code
for that until there's a use case.)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/7] Support vector load/store with length
  2020-05-27  7:25             ` Richard Biener
  2020-05-27  8:50               ` Kewen.Lin
@ 2020-05-27 14:08               ` Segher Boessenkool
  1 sibling, 0 replies; 80+ messages in thread
From: Segher Boessenkool @ 2020-05-27 14:08 UTC (permalink / raw)
  To: Richard Biener
  Cc: Kewen.Lin, GCC Patches, Bill Schmidt, dje.gcc, richard.sandiford

On Wed, May 27, 2020 at 09:25:43AM +0200, Richard Biener wrote:
> On Tue, 26 May 2020, Segher Boessenkool wrote:
> > On Tue, May 26, 2020 at 01:29:30PM +0100, Richard Sandiford wrote:
> > > FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good
> > > approach.  I think it'll be more maintainable in the long run than
> > > trying to have .MASK_LOADs and .MASK_STOREs that need a special mask
> > > operand.  (That would be too similar to VEC_COND_EXPR :-))
> > > 
> > > Not sure yet what the exact semantics wrt out-of-range values for
> > > the IFN/optab though.  Maybe we should instead have some kind of
> > > abstract, target-specific cookie created by a separate intrinsic.
> > > Haven't thought much about it yet...
> > 
> > Or maybe only support 0..N with N the length of the vector?  It is
> > pretty important to support 0 and N, but greater than N isn't as
> > important (it is useful for tricky hand-written code, but not as much
> > for compiler-generate code -- we only support an 8-bit number here on
> > Power, maybe that is why ;-) )
> 
> The question is one of semantics - if power masks the length to an
> 8 bit number it's important to preprocess the IV.

In the instructions it *is* an 8 bit number (it is the top 8 bits of a
GPR).

> As with my
> other suggestion the question is what to expose to the IL (to GIMPLE)
> here.

Yes, I understand that.  Hence my answer :-)

Only multiples of element size would be fine as well of course.

> Exposing as much as possible will help IV selection but
> will eventually require IFN variations for different semantics.
> 
> So yes, 0..N sounds about right here and we'll require a MIN ()
> operation and likely need to teach IV selection about this to at least
> possibly get an IV with the byte size multiplication factored.

Maybe we should have a hook to say which lengths are allowed for which
element type?

And, how does this work for variable lengths (the usual case!)


Segher

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7] vect: Support vector load/store with length in vectorizer
  2020-05-27 10:02       ` Richard Sandiford
@ 2020-05-28  1:21         ` Kewen.Lin
  2020-05-29  8:32           ` Richard Sandiford
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-05-28  1:21 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool

on 2020/5/27 下午6:02, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> Hi Richard,
>>
>> Thanks for your comments!
>>
>> on 2020/5/26 锟斤拷锟斤拷8:49, Richard Sandiford wrote:
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>> @@ -626,6 +645,12 @@ public:
>>>>    /* True if have decided to use a fully-masked loop.  */
>>>>    bool fully_masked_p;
>>>>  
>>>> +  /* Records whether we still have the option of using a length access loop.  */
>>>> +  bool can_with_length_p;
>>>> +
>>>> +  /* True if have decided to use length access for the loop fully.  */
>>>> +  bool fully_with_length_p;
>>>
>>> Rather than duplicate the flags like this, I think we should have
>>> three bits of information:
>>>
>>> (1) Can the loop operate on partial vectors?  Starts off optimistically
>>>     assuming "yes", gets set to "no" when we find a counter-example.
>>>
>>> (2) If we do decide to use partial vectors, will we need loop masks?
>>>
>>> (3) If we do decide to use partial vectors, will we need lengths?
>>>
>>> Vectorisation using partial vectors succeeds if (1) && ((2) != (3))
>>>
>>> LOOP_VINFO_CAN_FULLY_MASK_P currently tracks (1) and
>>> LOOP_VINFO_MASKS currently tracks (2).  In pathological cases it's
>>> already possible to have (1) && !(2), see r9-6240 for an example.
>>>
>>> With the new support, LOOP_VINFO_LENS tracks (3).
>>>
>>> So I don't think we need the can_with_length_p.  What is now
>>> LOOP_VINFO_CAN_FULLY_MASK_P can continue to track (1) for both
>>> approaches, with the final choice of approach only being made
>>> at the end.  Maybe it would be worth renaming it to something
>>> more generic though, now that we have two approaches to partial
>>> vectorisation.
>>
>> I like this idea!  I could be wrong, but I'm afraid that we
>> can not have one common flag to be shared for both approaches,
>> the check criterias could be different for both approaches, one
>> counter example for length could be acceptable for masking, such
>> as length can only allow CONTIGUOUS related modes, but masking
>> can support more.  When we see acceptable VMAT_LOAD_STORE_LANES,
>> we leave LOOP_VINFO_CAN_FULLY_MASK_P true, later should length
>> checking turn it to false?  I guess no, assuming still true, then 
>> LOOP_VINFO_CAN_FULLY_MASK_P will mean only partial vectorization
>> for masking, not for both.  We can probably clean LOOP_VINFO_LENS
>> when the length checking is false, but we just know the vec is empty,
>> not sure we are unable to do partial vectorization with length,
>> when we see LOOP_VINFO_CAN_FULLY_MASK_P true, we could still
>> record length into it if possible.
> 
> Let's call the flag in (1) CAN_USE_PARTIAL_VECTORS_P rather than
> CAN_FULLY_MASK_P to (try to) avoid any confusion from the current name.
> 
> What I meant is that each vectorizable_* routine has the responsibility
> of finding a way of coping with partial vectorisation, or setting
> CAN_USE_PARTIAL_VECTORS_P to false if it can't.
> 
> vectorizable_load chooses the VMAT first, and then decides based on that
> whether partial vectorisation is supported.  There's no influence in
> the other direction (partial vectorisation doesn't determine the VMAT).
> 
> So once it has chosen a VMAT, vectorizable_load needs to try to find a way
> of handling the operation with partial vectorisation.  Currently the only
> way of doing that for VMAT_LOAD_STORE_LANES is using masks.  So at the
> moment there are two possible outcomes:
> 
> - The target supports the necessary IFN_MASK_LOAD_LANES function.
>   If so, we can use partial vectorisation for the statement, so we
>   leave CAN_USE_PARTIAL_VECTORS_P true and record the necessary masks
>   in LOOP_VINFO_MASKS.
> 
> - The target doesn't support the necessary IFN_MASK_LOAD_LANES function.
>   If so, we can't use partial vectorisation, so we clear
>   CAN_USE_PARTIAL_VECTORS_P.
> 
> That's how things work at the moment.  It would work in the same way
> for lengths if we ever supported IFN_LEN_LOAD_LANES: we'd check whether
> IFN_LEN_LOAD_LANES is available and record the length in LOOP_VINFO_LENS
> if so.  If partial vectorisation isn't supported (via masks or lengths),
> we'd continue to clear CAN_USE_PARTIAL_VECTORS_P.
> 
> But equally, if we never add support for IFN_LEN_LOAD_LANES, the current
> code continues to work with length-based approaches.  We'll continue to
> clear CAN_USE_PARTIAL_VECTORS_P for VMAT_LOAD_STORE_LANES when the
> target provides no IFN_MASK_LOAD_LANES function.
> 

Thanks a lot for your detailed explanation!  This proposal looks good
based on the current implementation of both masking and length.  I may
think too much, but I had a bit concern as below when some targets have
both masking and length supports in future, such as ppc adds masking
support like SVE.

I assumed that you meant each vectorizable_* routine should record the
objs for any available partial vectorisation approaches.  If one target
supports both, we would have both recorded but decide not to do partial
vectorisation finally since both have records.  The target can disable
length like through optab to resolve it, but there is one possibility
that the masking support can be imperfect initially since ISA support
could be gradual, it further leads some vectorizable_* check or final
verification to fail for masking, and length approach may work here but
it gets disabled.  We can miss to use partial vectorisation here.

The other assumption is that each vectorizable_* routine record the 
first available partial vectorisation approach, let's assume masking
takes preference, then it's fine to record just one here even if one
target supports both approaches, but we still have the possiblity to
miss the partial vectorisation chance as some check/verify fail with
masking but fine with length.

Does this concern make sense?

BR,
Kewen

> As I say, this is all predicated on the assumption that we don't need
> to mix both masks and lengths in the same loop, and so can decide not
> to use partial vectorisation when both masks and lengths have been
> recorded.  This is a check that would happen at the end, after all
> statements have been analysed.
> 
> (There's no reason in principle why we *couldn't* support both
> approaches in the same loop, but it's not worth adding the code
> for that until there's a use case.)
> 
> Thanks,
> Richard
> 



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7] vect: Support vector load/store with length in vectorizer
  2020-05-28  1:21         ` Kewen.Lin
@ 2020-05-29  8:32           ` Richard Sandiford
  2020-05-29 12:38             ` Segher Boessenkool
  2020-06-02  9:03             ` [PATCH 5/7 v3] " Kewen.Lin
  0 siblings, 2 replies; 80+ messages in thread
From: Richard Sandiford @ 2020-05-29  8:32 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> on 2020/5/27 下午6:02, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> Hi Richard,
>>>
>>> Thanks for your comments!
>>>
>>> on 2020/5/26 锟斤拷锟斤拷8:49, Richard Sandiford wrote:
>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>> @@ -626,6 +645,12 @@ public:
>>>>>    /* True if have decided to use a fully-masked loop.  */
>>>>>    bool fully_masked_p;
>>>>>  
>>>>> +  /* Records whether we still have the option of using a length access loop.  */
>>>>> +  bool can_with_length_p;
>>>>> +
>>>>> +  /* True if have decided to use length access for the loop fully.  */
>>>>> +  bool fully_with_length_p;
>>>>
>>>> Rather than duplicate the flags like this, I think we should have
>>>> three bits of information:
>>>>
>>>> (1) Can the loop operate on partial vectors?  Starts off optimistically
>>>>     assuming "yes", gets set to "no" when we find a counter-example.
>>>>
>>>> (2) If we do decide to use partial vectors, will we need loop masks?
>>>>
>>>> (3) If we do decide to use partial vectors, will we need lengths?
>>>>
>>>> Vectorisation using partial vectors succeeds if (1) && ((2) != (3))
>>>>
>>>> LOOP_VINFO_CAN_FULLY_MASK_P currently tracks (1) and
>>>> LOOP_VINFO_MASKS currently tracks (2).  In pathological cases it's
>>>> already possible to have (1) && !(2), see r9-6240 for an example.
>>>>
>>>> With the new support, LOOP_VINFO_LENS tracks (3).
>>>>
>>>> So I don't think we need the can_with_length_p.  What is now
>>>> LOOP_VINFO_CAN_FULLY_MASK_P can continue to track (1) for both
>>>> approaches, with the final choice of approach only being made
>>>> at the end.  Maybe it would be worth renaming it to something
>>>> more generic though, now that we have two approaches to partial
>>>> vectorisation.
>>>
>>> I like this idea!  I could be wrong, but I'm afraid that we
>>> can not have one common flag to be shared for both approaches,
>>> the check criterias could be different for both approaches, one
>>> counter example for length could be acceptable for masking, such
>>> as length can only allow CONTIGUOUS related modes, but masking
>>> can support more.  When we see acceptable VMAT_LOAD_STORE_LANES,
>>> we leave LOOP_VINFO_CAN_FULLY_MASK_P true, later should length
>>> checking turn it to false?  I guess no, assuming still true, then 
>>> LOOP_VINFO_CAN_FULLY_MASK_P will mean only partial vectorization
>>> for masking, not for both.  We can probably clean LOOP_VINFO_LENS
>>> when the length checking is false, but we just know the vec is empty,
>>> not sure we are unable to do partial vectorization with length,
>>> when we see LOOP_VINFO_CAN_FULLY_MASK_P true, we could still
>>> record length into it if possible.
>> 
>> Let's call the flag in (1) CAN_USE_PARTIAL_VECTORS_P rather than
>> CAN_FULLY_MASK_P to (try to) avoid any confusion from the current name.
>> 
>> What I meant is that each vectorizable_* routine has the responsibility
>> of finding a way of coping with partial vectorisation, or setting
>> CAN_USE_PARTIAL_VECTORS_P to false if it can't.
>> 
>> vectorizable_load chooses the VMAT first, and then decides based on that
>> whether partial vectorisation is supported.  There's no influence in
>> the other direction (partial vectorisation doesn't determine the VMAT).
>> 
>> So once it has chosen a VMAT, vectorizable_load needs to try to find a way
>> of handling the operation with partial vectorisation.  Currently the only
>> way of doing that for VMAT_LOAD_STORE_LANES is using masks.  So at the
>> moment there are two possible outcomes:
>> 
>> - The target supports the necessary IFN_MASK_LOAD_LANES function.
>>   If so, we can use partial vectorisation for the statement, so we
>>   leave CAN_USE_PARTIAL_VECTORS_P true and record the necessary masks
>>   in LOOP_VINFO_MASKS.
>> 
>> - The target doesn't support the necessary IFN_MASK_LOAD_LANES function.
>>   If so, we can't use partial vectorisation, so we clear
>>   CAN_USE_PARTIAL_VECTORS_P.
>> 
>> That's how things work at the moment.  It would work in the same way
>> for lengths if we ever supported IFN_LEN_LOAD_LANES: we'd check whether
>> IFN_LEN_LOAD_LANES is available and record the length in LOOP_VINFO_LENS
>> if so.  If partial vectorisation isn't supported (via masks or lengths),
>> we'd continue to clear CAN_USE_PARTIAL_VECTORS_P.
>> 
>> But equally, if we never add support for IFN_LEN_LOAD_LANES, the current
>> code continues to work with length-based approaches.  We'll continue to
>> clear CAN_USE_PARTIAL_VECTORS_P for VMAT_LOAD_STORE_LANES when the
>> target provides no IFN_MASK_LOAD_LANES function.
>> 
>
> Thanks a lot for your detailed explanation!  This proposal looks good
> based on the current implementation of both masking and length.  I may
> think too much, but I had a bit concern as below when some targets have
> both masking and length supports in future, such as ppc adds masking
> support like SVE.
>
> I assumed that you meant each vectorizable_* routine should record the
> objs for any available partial vectorisation approaches.  If one target
> supports both, we would have both recorded but decide not to do partial
> vectorisation finally since both have records.  The target can disable
> length like through optab to resolve it, but there is one possibility
> that the masking support can be imperfect initially since ISA support
> could be gradual, it further leads some vectorizable_* check or final
> verification to fail for masking, and length approach may work here but
> it gets disabled.  We can miss to use partial vectorisation here.
>
> The other assumption is that each vectorizable_* routine record the 
> first available partial vectorisation approach, let's assume masking
> takes preference, then it's fine to record just one here even if one
> target supports both approaches, but we still have the possiblity to
> miss the partial vectorisation chance as some check/verify fail with
> masking but fine with length.
>
> Does this concern make sense?

There's nothing to stop us using masks and lengths in the same loop
in future if we need to.  It would “just” be a case of setting up both
the masks and the lengths in vect_set_loop_condition.  But the point is
that doing that would be extra code, and there's no point writing that
extra code until it's needed.

If some future arch does support both mask-based and length-based
approaches, I think that's even less reason to make a binary choice
between them.  How we prioritise the length and mask approaches when
both are available is something that we'll have to decide at the time.

If your concern is that the arch might support masked operations
without wanting them to be used for loop control, we could test for
that case by checking whether while_ult_optab is implemented.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7] vect: Support vector load/store with length in vectorizer
  2020-05-29  8:32           ` Richard Sandiford
@ 2020-05-29 12:38             ` Segher Boessenkool
  2020-06-02  9:03             ` [PATCH 5/7 v3] " Kewen.Lin
  1 sibling, 0 replies; 80+ messages in thread
From: Segher Boessenkool @ 2020-05-29 12:38 UTC (permalink / raw)
  To: Kewen.Lin, GCC Patches, Richard Guenther, Bill Schmidt, dje.gcc,
	richard.sandiford

Hi!

On Fri, May 29, 2020 at 09:32:49AM +0100, Richard Sandiford wrote:
> There's nothing to stop us using masks and lengths in the same loop
> in future if we need to.  It would “just” be a case of setting up both
> the masks and the lengths in vect_set_loop_condition.  But the point is
> that doing that would be extra code, and there's no point writing that
> extra code until it's needed.

You won't ever get it right even, because you do not know exactly what
will be needed :-)

> If some future arch does support both mask-based and length-based
> approaches, I think that's even less reason to make a binary choice
> between them.  How we prioritise the length and mask approaches when
> both are available is something that we'll have to decide at the time.
> 
> If your concern is that the arch might support masked operations
> without wanting them to be used for loop control, we could test for
> that case by checking whether while_ult_optab is implemented.

Heh, sneaky.  But at least for now it will work fine, and it is local,
and not hard to change later.


Segher

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 5/7 v3] vect: Support vector load/store with length in vectorizer
  2020-05-29  8:32           ` Richard Sandiford
  2020-05-29 12:38             ` Segher Boessenkool
@ 2020-06-02  9:03             ` Kewen.Lin
  2020-06-02 11:50               ` Richard Sandiford
  1 sibling, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-02  9:03 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool

[-- Attachment #1: Type: text/plain, Size: 8947 bytes --]

Hi Richard,

on 2020/5/29 下午4:32, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> on 2020/5/27 下午6:02, Richard Sandiford wrote:
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>> Hi Richard,
>>>>

Snip ...

>>
>> Thanks a lot for your detailed explanation!  This proposal looks good
>> based on the current implementation of both masking and length.  I may
>> think too much, but I had a bit concern as below when some targets have
>> both masking and length supports in future, such as ppc adds masking
>> support like SVE.
>>
>> I assumed that you meant each vectorizable_* routine should record the
>> objs for any available partial vectorisation approaches.  If one target
>> supports both, we would have both recorded but decide not to do partial
>> vectorisation finally since both have records.  The target can disable
>> length like through optab to resolve it, but there is one possibility
>> that the masking support can be imperfect initially since ISA support
>> could be gradual, it further leads some vectorizable_* check or final
>> verification to fail for masking, and length approach may work here but
>> it gets disabled.  We can miss to use partial vectorisation here.
>>
>> The other assumption is that each vectorizable_* routine record the 
>> first available partial vectorisation approach, let's assume masking
>> takes preference, then it's fine to record just one here even if one
>> target supports both approaches, but we still have the possiblity to
>> miss the partial vectorisation chance as some check/verify fail with
>> masking but fine with length.
>>
>> Does this concern make sense?
> 
> There's nothing to stop us using masks and lengths in the same loop
> in future if we need to.  It would “just” be a case of setting up both
> the masks and the lengths in vect_set_loop_condition.  But the point is
> that doing that would be extra code, and there's no point writing that
> extra code until it's needed.
> 
> If some future arch does support both mask-based and length-based
> approaches, I think that's even less reason to make a binary choice
> between them.  How we prioritise the length and mask approaches when
> both are available is something that we'll have to decide at the time.
> 
> If your concern is that the arch might support masked operations
> without wanting them to be used for loop control, we could test for
> that case by checking whether while_ult_optab is implemented.
> 
> Thanks,
> Richard
> 

Thanks for your further expalanation, as you pointed out, my concern
is just one case of mixing mask-based and length-based.  I didn't
realize it and thought we still used one approach for one loop at the
time, but it's senseless.

The v3 patch attached to use can_partial_vect_p.  In the regression
testing with explicit vect-with-length-scope setting, I saw several
reduction failures, updated vectorizable_condition to set
can_partial_vect_p to false for !EXTRACT_LAST_REDUCTION under your
guidance to ensure it either records sth. or clearing
can_partial_vect_p.

Bootstrapped/regtested on powerpc64le-linux-gnu P9 and no remarkable
failures found even with explicit vect-with-length-scope settings.

But I met one regression failure on aarch64-linux-gnu as below:

PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,

It's caused by vectorizable_condition's change, without the change,
it can use fully-masking for the outer loop.  The reduction_type is
TREE_CODE_REDUCTION here, so can_partial_vect_p gets cleared.

From the optimized dumping, the previous IRs look fine.  It's doing
reduction for inner loop, but we are checking partial vectorisation
for the outer loop.  I'm not sure whether to adjust the current
guard is reasonable for this case.  Could you help to give some
insights?  Thanks in advance!

BR,
Kewen
------
gcc/ChangeLog

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/invoke.texi (vect-with-length-scope): Document new option.
	* params.opt (vect-with-length-scope): New.
	* tree-vect-loop-manip.c (vect_set_loop_mask): Renamed to ...
	(vect_set_loop_mask_or_len): ... this.  Update variable names
	accordingly.
	(vect_maybe_permute_loop_masks): Replace rgroup_masks with rgroup_objs.
	(vect_set_loop_masks_directly): Renamed to ...
	(vect_set_loop_objs_directly): ... this.  Extend the support to cover
	vector with length, call vect_gen_len for length, replace rgroup_masks
	with rgroup_objs, replace vect_set_loop_mask with
	vect_set_loop_mask_or_len.
	(vect_set_loop_condition_masked): Renamed to ...
	(vect_set_loop_condition_partial): ... this.  Extend the support to
	cover length-based partial vectorization, replace rgroup_masks with
	rgroup_objs, replace vect_iv_limit_for_full_masking with
	vect_iv_limit_for_partial_vect.
	(vect_set_loop_condition_unmasked): Renamed to ...
	(vect_set_loop_condition_normal): ... this.
	(vect_set_loop_condition): Replace vect_set_loop_condition_masked with
	vect_set_loop_condition_partial, replace
	vect_set_loop_condition_unmasked with vect_set_loop_condition_normal.
	(vect_gen_vector_loop_niters): Use LOOP_VINFO_PARTIAL_VECT_P for
	partial vectorization case instead of LOOP_VINFO_FULLY_MASKED_P.
	(vect_do_peeling): Use LOOP_VINFO_PARTIAL_VECT_P for partial
	vectorization case instead of LOOP_VINFO_FULLY_MASKED_P, adjust for
	epilogue handling for length-based partial vectorization.
	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
	fully_with_length_p and epil_partial_vect_p, replace can_fully_mask_p
	with can_partial_vect_p.
	(release_vec_loop_masks): Replace rgroup_masks with rgroup_objs.
	(release_vec_loop_lens): New function.
	(_loop_vec_info::~_loop_vec_info): Use it to free the loop lens.
	(can_produce_all_loop_masks_p): Replace rgroup_masks with rgroup_objs.
	(vect_get_max_nscalars_per_iter): Likewise.
	(min_prec_for_max_niters): New function.  Factored out from ...
	(vect_verify_full_masking): ... this.  Replace
	vect_iv_limit_for_full_masking with vect_iv_limit_for_partial_vect.
	(vect_verify_loop_lens): New function.
	(vect_analyze_loop_costing): Use LOOP_VINFO_PARTIAL_VECT_P for partial
	vectorization case instead of LOOP_VINFO_FULLY_MASKED_P.
	(determine_peel_for_niter): Likewise.
	(vect_analyze_loop_2): Replace LOOP_VINFO_CAN_FULLY_MASK_P with
	LOOP_VINFO_CAN_PARTIAL_VECT_P, replace LOOP_VINFO_FULLY_MASKED_P with
	LOOP_VINFO_PARTIAL_VECT_P.  Check loop-wide reasons for disabling loops
	with length.  Make the final decision about use vector access with
	length or not.  Disable LOOP_VINFO_CAN_PARTIAL_VECT_P if both
	length-based and length-based approaches recorded.  Mark epilogue go
	with length-based approach if suitable.
	(vect_analyze_loop): Add handlings for epilogue of loop that is marked
	to use partial vectorization approach.
	(vect_estimate_min_profitable_iters): Replace rgroup_masks with
	rgroup_objs.  Adjust for loop with length-based partial vectorization.
	(vectorizable_reduction): Replace LOOP_VINFO_CAN_FULLY_MASK_P with
	LOOP_VINFO_CAN_PARTIAL_VECT_P, adjust some dumpings.
	(vectorizable_live_operation): Likewise.
	(vect_record_loop_mask): Replace rgroup_masks with rgroup_objs.
	(vect_get_loop_mask): Likewise.
	(vect_record_loop_len): New function.
	(vect_get_loop_len): Likewise.
	(vect_transform_loop): Use LOOP_VINFO_PARTIAL_VECT_P for partial
	vectorization case instead of LOOP_VINFO_FULLY_MASKED_P.
	(vect_iv_limit_for_full_masking): Renamed to ...
	(vect_iv_limit_for_partial_vect): ... here. 
	* tree-vect-stmts.c (permute_vec_elements):
	(check_load_store_masking): Renamed to ...
	(check_load_store_partial_vect): ... here.  Add length-based partial
	vectorization checks.
	(vectorizable_operation): Replace LOOP_VINFO_CAN_FULLY_MASK_P with
	LOOP_VINFO_CAN_PARTIAL_VECT_P.
	(vectorizable_store): Replace check_load_store_masking with
	check_load_store_partial_vect.  Add handlings for length-based partial
	vectorization.
	(vectorizable_load): Likewise.
	(vectorizable_condition): Replace LOOP_VINFO_CAN_FULLY_MASK_P with
	LOOP_VINFO_CAN_PARTIAL_VECT_P.  Guard partial vectorization reduction
	only for EXTRACT_LAST_REDUCTION.
	(vect_gen_len): New function.
	* tree-vectorizer.h (struct rgroup_masks): Renamed to ...
	(struct rgroup_objs): ... this.  Add anonymous union to field
	max_nscalars_per_iter and mask_type.
	(vec_loop_lens): New typedef.
	(_loop_vec_info): Add lens, fully_with_length_p and
	epil_partial_vect_p.  Rename can_fully_mask_p to can_partial_vect_p.
	(LOOP_VINFO_CAN_FULLY_MASK_P): Renamed to ...
	(LOOP_VINFO_CAN_PARTIAL_VECT_P): ... this.
	(LOOP_VINFO_FULLY_WITH_LENGTH_P): New macro.
	(LOOP_VINFO_EPIL_PARTIAL_VECT_P): Likewise.
	(LOOP_VINFO_LENS): Likewise.
	(LOOP_VINFO_PARTIAL_VECT_P): Likewise.
	(vect_iv_limit_for_full_masking): Renamed to ...
	(vect_iv_limit_for_partial_vect): ... this.
	(vect_record_loop_len): New declare.
	(vect_get_loop_len): Likewise.
	(vect_gen_len): Likewise.


[-- Attachment #2: vect_with_length_v3.diff --]
[-- Type: text/plain, Size: 65404 bytes --]

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 8b9935dfe65..ac765feab13 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13079,6 +13079,13 @@ by the copy loop headers pass.
 @item vect-epilogues-nomask
 Enable loop epilogue vectorization using smaller vector size.
 
+@item vect-with-length-scope
+Control the scope of vector memory access with length exploitation.  0 means we
+don't expliot any vector memory access with length, 1 means we only exploit
+vector memory access with length for those loops whose iteration number are
+less than VF, such as very small loop or epilogue, 2 means we want to exploit
+vector memory access with length for any loops if possible.
+
 @item slp-max-insns-in-bb
 Maximum number of instructions in basic block to be
 considered for SLP vectorization.
diff --git a/gcc/params.opt b/gcc/params.opt
index 4aec480798b..d4309101067 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -964,4 +964,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f
 Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
 Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
 
+-param=vect-with-length-scope=
+Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization
+Control the vector with length exploitation scope.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 8c5e696b995..0a5770c7d28 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -256,17 +256,17 @@ adjust_phi_and_debug_stmts (gimple *update_phi, edge e, tree new_def)
 			gimple_bb (update_phi));
 }
 
-/* Define one loop mask MASK from loop LOOP.  INIT_MASK is the value that
-   the mask should have during the first iteration and NEXT_MASK is the
+/* Define one loop mask/length OBJ from loop LOOP.  INIT_OBJ is the value that
+   the mask/length should have during the first iteration and NEXT_OBJ is the
    value that it should have on subsequent iterations.  */
 
 static void
-vect_set_loop_mask (class loop *loop, tree mask, tree init_mask,
-		    tree next_mask)
+vect_set_loop_mask_or_len (class loop *loop, tree obj, tree init_obj,
+			   tree next_obj)
 {
-  gphi *phi = create_phi_node (mask, loop->header);
-  add_phi_arg (phi, init_mask, loop_preheader_edge (loop), UNKNOWN_LOCATION);
-  add_phi_arg (phi, next_mask, loop_latch_edge (loop), UNKNOWN_LOCATION);
+  gphi *phi = create_phi_node (obj, loop->header);
+  add_phi_arg (phi, init_obj, loop_preheader_edge (loop), UNKNOWN_LOCATION);
+  add_phi_arg (phi, next_obj, loop_latch_edge (loop), UNKNOWN_LOCATION);
 }
 
 /* Add SEQ to the end of LOOP's preheader block.  */
@@ -320,8 +320,8 @@ interleave_supported_p (vec_perm_indices *indices, tree vectype,
    latter.  Return true on success, adding any new statements to SEQ.  */
 
 static bool
-vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
-			       rgroup_masks *src_rgm)
+vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_objs *dest_rgm,
+			       rgroup_objs *src_rgm)
 {
   tree src_masktype = src_rgm->mask_type;
   tree dest_masktype = dest_rgm->mask_type;
@@ -338,10 +338,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
       machine_mode dest_mode = insn_data[icode1].operand[0].mode;
       gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
       tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
-      for (unsigned int i = 0; i < dest_rgm->masks.length (); ++i)
+      for (unsigned int i = 0; i < dest_rgm->objs.length (); ++i)
 	{
-	  tree src = src_rgm->masks[i / 2];
-	  tree dest = dest_rgm->masks[i];
+	  tree src = src_rgm->objs[i / 2];
+	  tree dest = dest_rgm->objs[i];
 	  tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
 			    ? VEC_UNPACK_HI_EXPR
 			    : VEC_UNPACK_LO_EXPR);
@@ -371,10 +371,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
       tree masks[2];
       for (unsigned int i = 0; i < 2; ++i)
 	masks[i] = vect_gen_perm_mask_checked (src_masktype, indices[i]);
-      for (unsigned int i = 0; i < dest_rgm->masks.length (); ++i)
+      for (unsigned int i = 0; i < dest_rgm->objs.length (); ++i)
 	{
-	  tree src = src_rgm->masks[i / 2];
-	  tree dest = dest_rgm->masks[i];
+	  tree src = src_rgm->objs[i / 2];
+	  tree dest = dest_rgm->objs[i];
 	  gimple *stmt = gimple_build_assign (dest, VEC_PERM_EXPR,
 					      src, src, masks[i & 1]);
 	  gimple_seq_add_stmt (seq, stmt);
@@ -384,60 +384,80 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
   return false;
 }
 
-/* Helper for vect_set_loop_condition_masked.  Generate definitions for
-   all the masks in RGM and return a mask that is nonzero when the loop
+/* Helper for vect_set_loop_condition_partial.  Generate definitions for
+   all the objs in RGO and return a obj that is nonzero when the loop
    needs to iterate.  Add any new preheader statements to PREHEADER_SEQ.
    Use LOOP_COND_GSI to insert code before the exit gcond.
 
-   RGM belongs to loop LOOP.  The loop originally iterated NITERS
+   RGO belongs to loop LOOP.  The loop originally iterated NITERS
    times and has been vectorized according to LOOP_VINFO.
 
    If NITERS_SKIP is nonnull, the first iteration of the vectorized loop
    starts with NITERS_SKIP dummy iterations of the scalar loop before
-   the real work starts.  The mask elements for these dummy iterations
+   the real work starts.  The obj elements for these dummy iterations
    must be 0, to ensure that the extra iterations do not have an effect.
 
    It is known that:
 
-     NITERS * RGM->max_nscalars_per_iter
+     NITERS * RGO->max_nscalars_per_iter
 
    does not overflow.  However, MIGHT_WRAP_P says whether an induction
    variable that starts at 0 and has step:
 
-     VF * RGM->max_nscalars_per_iter
+     VF * RGO->max_nscalars_per_iter
 
    might overflow before hitting a value above:
 
-     (NITERS + NITERS_SKIP) * RGM->max_nscalars_per_iter
+     (NITERS + NITERS_SKIP) * RGO->max_nscalars_per_iter
 
    This means that we cannot guarantee that such an induction variable
-   would ever hit a value that produces a set of all-false masks for RGM.  */
+   would ever hit a value that produces a set of all-false masks or
+   zero byte length for RGO.  */
 
 static tree
-vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
+vect_set_loop_objs_directly (class loop *loop, loop_vec_info loop_vinfo,
 			      gimple_seq *preheader_seq,
 			      gimple_stmt_iterator loop_cond_gsi,
-			      rgroup_masks *rgm, tree niters, tree niters_skip,
+			      rgroup_objs *rgo, tree niters, tree niters_skip,
 			      bool might_wrap_p)
 {
   tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
   tree iv_type = LOOP_VINFO_MASK_IV_TYPE (loop_vinfo);
-  tree mask_type = rgm->mask_type;
-  unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
-  poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
+
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+  if (!vect_for_masking)
+    {
+      /* Obtain target supported length type.  */
+      scalar_int_mode len_mode = targetm.vectorize.length_mode;
+      unsigned int len_prec = GET_MODE_PRECISION (len_mode);
+      compare_type = build_nonstandard_integer_type (len_prec, true);
+      /* Simply set iv_type as same as compare_type.  */
+      iv_type = compare_type;
+    }
+
+  tree obj_type = rgo->mask_type;
+  /* Here, take nscalars_per_iter as nbytes_per_iter for length.  */
+  unsigned int nscalars_per_iter = rgo->max_nscalars_per_iter;
+  poly_uint64 nscalars_per_obj = TYPE_VECTOR_SUBPARTS (obj_type);
+  poly_uint64 vector_size = GET_MODE_SIZE (TYPE_MODE (obj_type));
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  tree vec_size = NULL_TREE;
+  /* For length, we probably need vec_size to check length in range.  */
+  if (!vect_for_masking)
+    vec_size = build_int_cst (compare_type, vector_size);
 
   /* Calculate the maximum number of scalar values that the rgroup
      handles in total, the number that it handles for each iteration
      of the vector loop, and the number that it should skip during the
-     first iteration of the vector loop.  */
+     first iteration of the vector loop.  For vector with length, take
+     scalar values as bytes.  */
   tree nscalars_total = niters;
   tree nscalars_step = build_int_cst (iv_type, vf);
   tree nscalars_skip = niters_skip;
   if (nscalars_per_iter != 1)
     {
-      /* We checked before choosing to use a fully-masked loop that these
-	 multiplications don't overflow.  */
+      /* We checked before choosing to use a fully-masked or fully with length
+	 loop that these multiplications don't overflow.  */
       tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
       tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
       nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
@@ -541,28 +561,28 @@ vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
   test_index = gimple_convert (&test_seq, compare_type, test_index);
   gsi_insert_seq_before (test_gsi, test_seq, GSI_SAME_STMT);
 
-  /* Provide a definition of each mask in the group.  */
-  tree next_mask = NULL_TREE;
-  tree mask;
+  /* Provide a definition of each obj in the group.  */
+  tree next_obj = NULL_TREE;
+  tree obj;
   unsigned int i;
-  FOR_EACH_VEC_ELT_REVERSE (rgm->masks, i, mask)
+  poly_uint64 batch_cnt = vect_for_masking ? nscalars_per_obj : vector_size;
+  FOR_EACH_VEC_ELT_REVERSE (rgo->objs, i, obj)
     {
-      /* Previous masks will cover BIAS scalars.  This mask covers the
+      /* Previous objs will cover BIAS scalars.  This obj covers the
 	 next batch.  */
-      poly_uint64 bias = nscalars_per_mask * i;
+      poly_uint64 bias = batch_cnt * i;
       tree bias_tree = build_int_cst (compare_type, bias);
-      gimple *tmp_stmt;
 
       /* See whether the first iteration of the vector loop is known
-	 to have a full mask.  */
+	 to have a full mask or length.  */
       poly_uint64 const_limit;
       bool first_iteration_full
 	= (poly_int_tree_p (first_limit, &const_limit)
-	   && known_ge (const_limit, (i + 1) * nscalars_per_mask));
+	   && known_ge (const_limit, (i + 1) * batch_cnt));
 
       /* Rather than have a new IV that starts at BIAS and goes up to
 	 TEST_LIMIT, prefer to use the same 0-based IV for each mask
-	 and adjust the bound down by BIAS.  */
+	 or length and adjust the bound down by BIAS.  */
       tree this_test_limit = test_limit;
       if (i != 0)
 	{
@@ -574,9 +594,9 @@ vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
 					  bias_tree);
 	}
 
-      /* Create the initial mask.  First include all scalars that
+      /* Create the initial obj.  First include all scalars that
 	 are within the loop limit.  */
-      tree init_mask = NULL_TREE;
+      tree init_obj = NULL_TREE;
       if (!first_iteration_full)
 	{
 	  tree start, end;
@@ -598,9 +618,18 @@ vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
 	      end = first_limit;
 	    }
 
-	  init_mask = make_temp_ssa_name (mask_type, NULL, "max_mask");
-	  tmp_stmt = vect_gen_while (init_mask, start, end);
-	  gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	  if (vect_for_masking)
+	    {
+	      init_obj = make_temp_ssa_name (obj_type, NULL, "max_mask");
+	      gimple *tmp_stmt = vect_gen_while (init_obj, start, end);
+	      gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	    }
+	  else
+	    {
+	      init_obj = make_temp_ssa_name (compare_type, NULL, "max_len");
+	      gimple_seq seq = vect_gen_len (init_obj, start, end, vec_size);
+	      gimple_seq_add_seq (preheader_seq, seq);
+	    }
 	}
 
       /* Now AND out the bits that are within the number of skipped
@@ -610,51 +639,76 @@ vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo,
 	  && !(poly_int_tree_p (nscalars_skip, &const_skip)
 	       && known_le (const_skip, bias)))
 	{
-	  tree unskipped_mask = vect_gen_while_not (preheader_seq, mask_type,
+	  tree unskipped_mask = vect_gen_while_not (preheader_seq, obj_type,
 						    bias_tree, nscalars_skip);
-	  if (init_mask)
-	    init_mask = gimple_build (preheader_seq, BIT_AND_EXPR, mask_type,
-				      init_mask, unskipped_mask);
+	  if (init_obj)
+	    init_obj = gimple_build (preheader_seq, BIT_AND_EXPR, obj_type,
+				      init_obj, unskipped_mask);
 	  else
-	    init_mask = unskipped_mask;
+	    init_obj = unskipped_mask;
+	  gcc_assert (vect_for_masking);
 	}
 
-      if (!init_mask)
-	/* First iteration is full.  */
-	init_mask = build_minus_one_cst (mask_type);
+      /* First iteration is full.  */
+      if (!init_obj)
+	{
+	  if (vect_for_masking)
+	    init_obj = build_minus_one_cst (obj_type);
+	  else
+	    init_obj = vec_size;
+	}
 
-      /* Get the mask value for the next iteration of the loop.  */
-      next_mask = make_temp_ssa_name (mask_type, NULL, "next_mask");
-      gcall *call = vect_gen_while (next_mask, test_index, this_test_limit);
-      gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+      /* Get the obj value for the next iteration of the loop.  */
+      if (vect_for_masking)
+	{
+	  next_obj = make_temp_ssa_name (obj_type, NULL, "next_mask");
+	  gcall *call = vect_gen_while (next_obj, test_index, this_test_limit);
+	  gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+	}
+      else
+	{
+	  next_obj = make_temp_ssa_name (compare_type, NULL, "next_len");
+	  tree end = this_test_limit;
+	  gimple_seq seq = vect_gen_len (next_obj, test_index, end, vec_size);
+	  gsi_insert_seq_before (test_gsi, seq, GSI_SAME_STMT);
+	}
 
-      vect_set_loop_mask (loop, mask, init_mask, next_mask);
+      vect_set_loop_mask_or_len (loop, obj, init_obj, next_obj);
     }
-  return next_mask;
+  return next_obj;
 }
 
-/* Make LOOP iterate NITERS times using masking and WHILE_ULT calls.
-   LOOP_VINFO describes the vectorization of LOOP.  NITERS is the
-   number of iterations of the original scalar loop that should be
-   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are
-   as for vect_set_loop_condition.
+/* Make LOOP iterate NITERS times using objects like masks (and
+   WHILE_ULT calls) or lengths.  LOOP_VINFO describes the vectorization
+   of LOOP.  NITERS is the number of iterations of the original scalar
+   loop that should be handled by the vector loop.  NITERS_MAYBE_ZERO
+   and FINAL_IV are as for vect_set_loop_condition.
 
    Insert the branch-back condition before LOOP_COND_GSI and return the
    final gcond.  */
 
 static gcond *
-vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
-				tree niters, tree final_iv,
-				bool niters_maybe_zero,
-				gimple_stmt_iterator loop_cond_gsi)
+vect_set_loop_condition_partial (class loop *loop, loop_vec_info loop_vinfo,
+				 tree niters, tree final_iv,
+				 bool niters_maybe_zero,
+				 gimple_stmt_iterator loop_cond_gsi)
 {
   gimple_seq preheader_seq = NULL;
   gimple_seq header_seq = NULL;
 
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+
   tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
+  if (!vect_for_masking)
+    {
+      /* Obtain target supported length type as compare_type.  */
+      scalar_int_mode len_mode = targetm.vectorize.length_mode;
+      unsigned len_prec = GET_MODE_PRECISION (len_mode);
+      compare_type = build_nonstandard_integer_type (len_prec, true);
+    }
   unsigned int compare_precision = TYPE_PRECISION (compare_type);
-  tree orig_niters = niters;
 
+  tree orig_niters = niters;
   /* Type of the initial value of NITERS.  */
   tree ni_actual_type = TREE_TYPE (niters);
   unsigned int ni_actual_precision = TYPE_PRECISION (ni_actual_type);
@@ -677,42 +731,45 @@ vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
   else
     niters = gimple_convert (&preheader_seq, compare_type, niters);
 
-  widest_int iv_limit = vect_iv_limit_for_full_masking (loop_vinfo);
+  widest_int iv_limit = vect_iv_limit_for_partial_vect (loop_vinfo);
 
-  /* Iterate over all the rgroups and fill in their masks.  We could use
-     the first mask from any rgroup for the loop condition; here we
+  /* Iterate over all the rgroups and fill in their objs.  We could use
+     the first obj from any rgroup for the loop condition; here we
      arbitrarily pick the last.  */
-  tree test_mask = NULL_TREE;
-  rgroup_masks *rgm;
+  tree test_obj = NULL_TREE;
+  rgroup_objs *rgo;
   unsigned int i;
-  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
-  FOR_EACH_VEC_ELT (*masks, i, rgm)
-    if (!rgm->masks.is_empty ())
+  auto_vec<rgroup_objs> *objs = vect_for_masking
+				  ? &LOOP_VINFO_MASKS (loop_vinfo)
+				  : &LOOP_VINFO_LENS (loop_vinfo);
+
+  FOR_EACH_VEC_ELT (*objs, i, rgo)
+    if (!rgo->objs.is_empty ())
       {
 	/* First try using permutes.  This adds a single vector
 	   instruction to the loop for each mask, but needs no extra
 	   loop invariants or IVs.  */
 	unsigned int nmasks = i + 1;
-	if ((nmasks & 1) == 0)
+	if (vect_for_masking && (nmasks & 1) == 0)
 	  {
-	    rgroup_masks *half_rgm = &(*masks)[nmasks / 2 - 1];
-	    if (!half_rgm->masks.is_empty ()
-		&& vect_maybe_permute_loop_masks (&header_seq, rgm, half_rgm))
+	    rgroup_objs *half_rgo = &(*objs)[nmasks / 2 - 1];
+	    if (!half_rgo->objs.is_empty ()
+		&& vect_maybe_permute_loop_masks (&header_seq, rgo, half_rgo))
 	      continue;
 	  }
 
 	/* See whether zero-based IV would ever generate all-false masks
-	   before wrapping around.  */
+	   or zero byte length before wrapping around.  */
 	bool might_wrap_p
 	  = (iv_limit == -1
-	     || (wi::min_precision (iv_limit * rgm->max_nscalars_per_iter,
+	     || (wi::min_precision (iv_limit * rgo->max_nscalars_per_iter,
 				    UNSIGNED)
 		 > compare_precision));
 
-	/* Set up all masks for this group.  */
-	test_mask = vect_set_loop_masks_directly (loop, loop_vinfo,
+	/* Set up all masks/lengths for this group.  */
+	test_obj = vect_set_loop_objs_directly (loop, loop_vinfo,
 						  &preheader_seq,
-						  loop_cond_gsi, rgm,
+						  loop_cond_gsi, rgo,
 						  niters, niters_skip,
 						  might_wrap_p);
       }
@@ -724,8 +781,8 @@ vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
   /* Get a boolean result that tells us whether to iterate.  */
   edge exit_edge = single_exit (loop);
   tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : NE_EXPR;
-  tree zero_mask = build_zero_cst (TREE_TYPE (test_mask));
-  gcond *cond_stmt = gimple_build_cond (code, test_mask, zero_mask,
+  tree zero_obj = build_zero_cst (TREE_TYPE (test_obj));
+  gcond *cond_stmt = gimple_build_cond (code, test_obj, zero_obj,
 					NULL_TREE, NULL_TREE);
   gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
 
@@ -748,13 +805,12 @@ vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
 }
 
 /* Like vect_set_loop_condition, but handle the case in which there
-   are no loop masks.  */
+   are no loop masks/lengths.  */
 
 static gcond *
-vect_set_loop_condition_unmasked (class loop *loop, tree niters,
-				  tree step, tree final_iv,
-				  bool niters_maybe_zero,
-				  gimple_stmt_iterator loop_cond_gsi)
+vect_set_loop_condition_normal (class loop *loop, tree niters, tree step,
+			      tree final_iv, bool niters_maybe_zero,
+			      gimple_stmt_iterator loop_cond_gsi)
 {
   tree indx_before_incr, indx_after_incr;
   gcond *cond_stmt;
@@ -912,14 +968,14 @@ vect_set_loop_condition (class loop *loop, loop_vec_info loop_vinfo,
   gcond *orig_cond = get_loop_exit_condition (loop);
   gimple_stmt_iterator loop_cond_gsi = gsi_for_stmt (orig_cond);
 
-  if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
-    cond_stmt = vect_set_loop_condition_masked (loop, loop_vinfo, niters,
-						final_iv, niters_maybe_zero,
-						loop_cond_gsi);
+  if (loop_vinfo && LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo))
+    cond_stmt
+      = vect_set_loop_condition_partial (loop, loop_vinfo, niters, final_iv,
+					 niters_maybe_zero, loop_cond_gsi);
   else
-    cond_stmt = vect_set_loop_condition_unmasked (loop, niters, step,
-						  final_iv, niters_maybe_zero,
-						  loop_cond_gsi);
+    cond_stmt
+      = vect_set_loop_condition_normal (loop, niters, step, final_iv,
+					niters_maybe_zero, loop_cond_gsi);
 
   /* Remove old loop exit test.  */
   stmt_vec_info orig_cond_info;
@@ -1938,8 +1994,7 @@ vect_gen_vector_loop_niters (loop_vec_info loop_vinfo, tree niters,
     ni_minus_gap = niters;
 
   unsigned HOST_WIDE_INT const_vf;
-  if (vf.is_constant (&const_vf)
-      && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (vf.is_constant (&const_vf) && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo))
     {
       /* Create: niters >> log2(vf) */
       /* If it's known that niters == number of latch executions + 1 doesn't
@@ -2471,7 +2526,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   poly_uint64 bound_epilog = 0;
-  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+  if (!LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)
       && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
     bound_epilog += vf - 1;
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
@@ -2567,7 +2622,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   if (vect_epilogues
       && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
       && prolog_peeling >= 0
-      && known_eq (vf, lowest_vf))
+      && known_eq (vf, lowest_vf)
+      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (epilogue_vinfo))
     {
       unsigned HOST_WIDE_INT eiters
 	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 80e33b61be7..99e6cb904ba 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -813,8 +813,10 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     vec_outside_cost (0),
     vec_inside_cost (0),
     vectorizable (false),
-    can_fully_mask_p (true),
+    can_partial_vect_p (true),
     fully_masked_p (false),
+    fully_with_length_p (false),
+    epil_partial_vect_p (false),
     peeling_for_gaps (false),
     peeling_for_niter (false),
     no_data_dependencies (false),
@@ -880,13 +882,25 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
 void
 release_vec_loop_masks (vec_loop_masks *masks)
 {
-  rgroup_masks *rgm;
+  rgroup_objs *rgm;
   unsigned int i;
   FOR_EACH_VEC_ELT (*masks, i, rgm)
-    rgm->masks.release ();
+    rgm->objs.release ();
   masks->release ();
 }
 
+/* Free all levels of LENS.  */
+
+void
+release_vec_loop_lens (vec_loop_lens *lens)
+{
+  rgroup_objs *rgl;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (*lens, i, rgl)
+    rgl->objs.release ();
+  lens->release ();
+}
+
 /* Free all memory used by the _loop_vec_info, as well as all the
    stmt_vec_info structs of all the stmts in the loop.  */
 
@@ -895,6 +909,7 @@ _loop_vec_info::~_loop_vec_info ()
   free (bbs);
 
   release_vec_loop_masks (&masks);
+  release_vec_loop_lens (&lens);
   delete ivexpr_map;
   delete scan_map;
   epilogue_vinfos.release ();
@@ -935,7 +950,7 @@ cse_and_gimplify_to_preheader (loop_vec_info loop_vinfo, tree expr)
 static bool
 can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type)
 {
-  rgroup_masks *rgm;
+  rgroup_objs *rgm;
   unsigned int i;
   FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
     if (rgm->mask_type != NULL_TREE
@@ -954,12 +969,40 @@ vect_get_max_nscalars_per_iter (loop_vec_info loop_vinfo)
 {
   unsigned int res = 1;
   unsigned int i;
-  rgroup_masks *rgm;
+  rgroup_objs *rgm;
   FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
     res = MAX (res, rgm->max_nscalars_per_iter);
   return res;
 }
 
+/* Calculate the minimal bits necessary to represent the maximal iteration
+   count of loop with loop_vec_info LOOP_VINFO which is scaling with a given
+   factor FACTOR.  */
+
+static unsigned
+min_prec_for_max_niters (loop_vec_info loop_vinfo, unsigned int factor)
+{
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+
+  /* Get the maximum number of iterations that is representable
+     in the counter type.  */
+  tree ni_type = TREE_TYPE (LOOP_VINFO_NITERSM1 (loop_vinfo));
+  widest_int max_ni = wi::to_widest (TYPE_MAX_VALUE (ni_type)) + 1;
+
+  /* Get a more refined estimate for the number of iterations.  */
+  widest_int max_back_edges;
+  if (max_loop_iterations (loop, &max_back_edges))
+    max_ni = wi::smin (max_ni, max_back_edges + 1);
+
+  /* Account for factor, in which each bit is replicated N times.  */
+  max_ni *= factor;
+
+  /* Work out how many bits we need to represent the limit.  */
+  unsigned int min_ni_width = wi::min_precision (max_ni, UNSIGNED);
+
+  return min_ni_width;
+}
+
 /* Each statement in LOOP_VINFO can be masked where necessary.  Check
    whether we can actually generate the masks required.  Return true if so,
    storing the type of the scalar IV in LOOP_VINFO_MASK_COMPARE_TYPE.  */
@@ -967,7 +1010,6 @@ vect_get_max_nscalars_per_iter (loop_vec_info loop_vinfo)
 static bool
 vect_verify_full_masking (loop_vec_info loop_vinfo)
 {
-  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   unsigned int min_ni_width;
   unsigned int max_nscalars_per_iter
     = vect_get_max_nscalars_per_iter (loop_vinfo);
@@ -978,27 +1020,14 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
   if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
     return false;
 
-  /* Get the maximum number of iterations that is representable
-     in the counter type.  */
-  tree ni_type = TREE_TYPE (LOOP_VINFO_NITERSM1 (loop_vinfo));
-  widest_int max_ni = wi::to_widest (TYPE_MAX_VALUE (ni_type)) + 1;
-
-  /* Get a more refined estimate for the number of iterations.  */
-  widest_int max_back_edges;
-  if (max_loop_iterations (loop, &max_back_edges))
-    max_ni = wi::smin (max_ni, max_back_edges + 1);
-
-  /* Account for rgroup masks, in which each bit is replicated N times.  */
-  max_ni *= max_nscalars_per_iter;
-
   /* Work out how many bits we need to represent the limit.  */
-  min_ni_width = wi::min_precision (max_ni, UNSIGNED);
+  min_ni_width = min_prec_for_max_niters (loop_vinfo, max_nscalars_per_iter);
 
   /* Find a scalar mode for which WHILE_ULT is supported.  */
   opt_scalar_int_mode cmp_mode_iter;
   tree cmp_type = NULL_TREE;
   tree iv_type = NULL_TREE;
-  widest_int iv_limit = vect_iv_limit_for_full_masking (loop_vinfo);
+  widest_int iv_limit = vect_iv_limit_for_partial_vect (loop_vinfo);
   unsigned int iv_precision = UINT_MAX;
 
   if (iv_limit != -1)
@@ -1056,6 +1085,33 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
   return true;
 }
 
+/* Check whether we can use vector access with length based on precison
+   comparison.  So far, to keep it simple, we only allow the case that the
+   precision of the target supported length is larger than the precision
+   required by loop niters.  */
+
+static bool
+vect_verify_loop_lens (loop_vec_info loop_vinfo)
+{
+  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+
+  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    return false;
+
+  /* The one which has the largest NV should have max bytes per iter.  */
+  rgroup_objs *rgl = &(*lens)[lens->length () - 1];
+
+  /* Work out how many bits we need to represent the limit.  */
+  unsigned int min_ni_width
+    = min_prec_for_max_niters (loop_vinfo, rgl->nbytes_per_iter);
+
+  unsigned len_bits = GET_MODE_PRECISION (targetm.vectorize.length_mode);
+  if (len_bits < min_ni_width)
+    return false;
+
+  return true;
+}
+
 /* Calculate the cost of one scalar iteration of the loop.  */
 static void
 vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
@@ -1628,9 +1684,9 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
 
-  /* Only fully-masked loops can have iteration counts less than the
-     vectorization factor.  */
-  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  /* Only fully-masked or fully with length loops can have iteration counts less
+     than the vectorization factor.  */
+  if (!LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo))
     {
       if (known_niters_smaller_than_vf (loop_vinfo))
 	{
@@ -1858,7 +1914,7 @@ determine_peel_for_niter (loop_vec_info loop_vinfo)
     th = LOOP_VINFO_COST_MODEL_THRESHOLD (LOOP_VINFO_ORIG_LOOP_INFO
 					  (loop_vinfo));
 
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo))
     /* The main loop handles all iterations.  */
     LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
@@ -2047,7 +2103,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
       vect_optimize_slp (loop_vinfo);
     }
 
-  bool saved_can_fully_mask_p = LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo);
+  bool saved_can_partial_vect_p = LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo);
 
   /* We don't expect to have to roll back to anything other than an empty
      set of rgroups.  */
@@ -2129,10 +2185,24 @@ start_over:
       return ok;
     }
 
+  /* For now, we don't expect to mix both masking and length approaches for one
+     loop, disable it if both are recorded.  */
+  if (LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo)
+      && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ()
+      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use a partial vectorized loop because we"
+			 " don't expect to mix partial vectorization"
+			 " approaches for the same loop.\n");
+      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
+    }
+
   /* Decide whether to use a fully-masked loop for this vectorization
      factor.  */
   LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
-    = (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
+    = (LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo)
        && vect_verify_full_masking (loop_vinfo));
   if (dump_enabled_p ())
     {
@@ -2144,6 +2214,50 @@ start_over:
 			 "not using a fully-masked loop.\n");
     }
 
+  /* Decide whether to use vector access with length.  */
+  LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+    = (LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo)
+       && vect_verify_loop_lens (loop_vinfo));
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+      && (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	  || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use vector access with length becuase peeling"
+			 " for alignment or gaps is required.\n");
+      LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = false;
+    }
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      if (param_vect_with_length_scope == 0)
+	LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = false;
+      /* The epilogue and other known niters less than VF cases can still use
+	 vector access with length fully.  */
+      else if (param_vect_with_length_scope == 1
+	       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	       && !known_niters_smaller_than_vf (loop_vinfo))
+	{
+	  LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = false;
+	  LOOP_VINFO_EPIL_PARTIAL_VECT_P (loop_vinfo) = true;
+	}
+    }
+  else
+    /* Always set it as false in case previous tries set it.  */
+    LOOP_VINFO_EPIL_PARTIAL_VECT_P (loop_vinfo) = false;
+
+  if (dump_enabled_p ())
+    {
+      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+	dump_printf_loc (MSG_NOTE, vect_location, "using vector access with"
+						  " length for loop fully.\n");
+      else
+	dump_printf_loc (MSG_NOTE, vect_location, "not using vector access with"
+						  " length for loop fully.\n");
+    }
+
   /* If epilog loop is required because of data accesses with gaps,
      one additional iteration needs to be peeled.  Check if there is
      enough iterations for vectorization.  */
@@ -2163,7 +2277,7 @@ start_over:
   /* If we're vectorizing an epilogue loop, we either need a fully-masked
      loop or a loop that has a lower VF than the main loop.  */
   if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
-      && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)
       && maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
 		   LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo)))
     return opt_result::failure_at (vect_location,
@@ -2362,12 +2476,13 @@ again:
     = init_cost (LOOP_VINFO_LOOP (loop_vinfo));
   /* Reset accumulated rgroup information.  */
   release_vec_loop_masks (&LOOP_VINFO_MASKS (loop_vinfo));
+  release_vec_loop_lens (&LOOP_VINFO_LENS (loop_vinfo));
   /* Reset assorted flags.  */
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
   LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = 0;
   LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = 0;
-  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = saved_can_fully_mask_p;
+  LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = saved_can_partial_vect_p;
 
   goto start_over;
 }
@@ -2646,8 +2761,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	      if (ordered_p (lowest_th, th))
 		lowest_th = ordered_min (lowest_th, th);
 	    }
-	  else
-	    delete loop_vinfo;
+	  else {
+	      delete loop_vinfo;
+	      loop_vinfo = opt_loop_vec_info::success (NULL);
+	  }
 
 	  /* Only vectorize epilogues if PARAM_VECT_EPILOGUES_NOMASK is
 	     enabled, SIMDUID is not set, it is the innermost loop and we have
@@ -2672,6 +2789,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
       else
 	{
 	  delete loop_vinfo;
+	  loop_vinfo = opt_loop_vec_info::success (NULL);
 	  if (fatal)
 	    {
 	      gcc_checking_assert (first_loop_vinfo == NULL);
@@ -2679,6 +2797,22 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	    }
 	}
 
+      /* Handle the case that the original loop can use partial vectorization,
+	 but want to only adopt it for the epilogue.  The retry should be in the
+	 same mode as original.  */
+      if (vect_epilogues && loop_vinfo
+	  && LOOP_VINFO_EPIL_PARTIAL_VECT_P (loop_vinfo))
+	{
+	  gcc_assert (LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo)
+		      && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo));
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** Re-trying analysis with same vector mode"
+			     " %s for epilogue with partial vectorization.\n",
+			     GET_MODE_NAME (loop_vinfo->vector_mode));
+	  continue;
+	}
+
       if (mode_i < vector_modes.length ()
 	  && VECTOR_MODE_P (autodetected_vector_mode)
 	  && (related_vector_mode (vector_modes[mode_i],
@@ -3493,7 +3627,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
       /* Calculate how many masks we need to generate.  */
       unsigned int num_masks = 0;
-      rgroup_masks *rgm;
+      rgroup_objs *rgm;
       unsigned int num_vectors_m1;
       FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
 	if (rgm->mask_type)
@@ -3519,6 +3653,11 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 			    target_cost_data, num_masks - 1, vector_stmt,
 			    NULL, NULL_TREE, 0, vect_body);
     }
+  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      peel_iters_prologue = 0;
+      peel_iters_epilogue = 0;
+    }
   else if (npeel < 0)
     {
       peel_iters_prologue = assumed_vf / 2;
@@ -3808,7 +3947,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 		 "  Calculated minimum iters for profitability: %d\n",
 		 min_profitable_iters);
 
-  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+  if (!LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
@@ -6761,6 +6900,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "using an in-order (fold-left) reduction.\n");
   STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
+
   /* All but single defuse-cycle optimized, lane-reducing and fold-left
      reductions go through their own vectorizable_* routines.  */
   if (!single_defuse_cycle
@@ -6779,7 +6919,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
       STMT_VINFO_DEF_TYPE (vect_orig_stmt (tem)) = vect_internal_def;
       STMT_VINFO_DEF_TYPE (tem) = vect_internal_def;
     }
-  else if (loop_vinfo && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
+  else if (loop_vinfo && LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo))
     {
       vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
       internal_fn cond_fn = get_conditional_internal_fn (code);
@@ -6792,9 +6932,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "can't use a fully-masked loop because no"
+			     "can't use a partial vectorized loop because no"
 			     " conditional operation is available.\n");
-	  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	  LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	}
       else if (reduction_type == FOLD_LEFT_REDUCTION
 	       && reduc_fn == IFN_LAST
@@ -6804,9 +6944,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "can't use a fully-masked loop because no"
+			     "can't use a partial vectorized loop because no"
 			     " conditional operation is available.\n");
-	  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	  LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	}
       else
 	vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num,
@@ -8005,33 +8145,33 @@ vectorizable_live_operation (loop_vec_info loop_vinfo,
   if (!vec_stmt_p)
     {
       /* No transformation required.  */
-      if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
+      if (LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo))
 	{
 	  if (!direct_internal_fn_supported_p (IFN_EXTRACT_LAST, vectype,
 					       OPTIMIZE_FOR_SPEED))
 	    {
 	      if (dump_enabled_p ())
 		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-				 "can't use a fully-masked loop because "
+				 "can't use a partial vectorized loop because "
 				 "the target doesn't support extract last "
 				 "reduction.\n");
-	      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	    }
 	  else if (slp_node)
 	    {
 	      if (dump_enabled_p ())
 		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-				 "can't use a fully-masked loop because an "
-				 "SLP statement is live after the loop.\n");
-	      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+				 "can't use a partial vectorized loop because "
+				 "an SLP statement is live after the loop.\n");
+	      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	    }
 	  else if (ncopies > 1)
 	    {
 	      if (dump_enabled_p ())
 		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-				 "can't use a fully-masked loop because"
+				 "can't use a partial vectorized loop because"
 				 " ncopies is greater than 1.\n");
-	      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	    }
 	  else
 	    {
@@ -8041,6 +8181,7 @@ vectorizable_live_operation (loop_vec_info loop_vinfo,
 				     1, vectype, NULL);
 	    }
 	}
+
       return true;
     }
 
@@ -8285,7 +8426,7 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
   gcc_assert (nvectors != 0);
   if (masks->length () < nvectors)
     masks->safe_grow_cleared (nvectors);
-  rgroup_masks *rgm = &(*masks)[nvectors - 1];
+  rgroup_objs *rgm = &(*masks)[nvectors - 1];
   /* The number of scalars per iteration and the number of vectors are
      both compile-time constants.  */
   unsigned int nscalars_per_iter
@@ -8316,24 +8457,24 @@ tree
 vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
 		    unsigned int nvectors, tree vectype, unsigned int index)
 {
-  rgroup_masks *rgm = &(*masks)[nvectors - 1];
+  rgroup_objs *rgm = &(*masks)[nvectors - 1];
   tree mask_type = rgm->mask_type;
 
   /* Populate the rgroup's mask array, if this is the first time we've
      used it.  */
-  if (rgm->masks.is_empty ())
+  if (rgm->objs.is_empty ())
     {
-      rgm->masks.safe_grow_cleared (nvectors);
+      rgm->objs.safe_grow_cleared (nvectors);
       for (unsigned int i = 0; i < nvectors; ++i)
 	{
 	  tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
 	  /* Provide a dummy definition until the real one is available.  */
 	  SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
-	  rgm->masks[i] = mask;
+	  rgm->objs[i] = mask;
 	}
     }
 
-  tree mask = rgm->masks[index];
+  tree mask = rgm->objs[index];
   if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
 		TYPE_VECTOR_SUBPARTS (vectype)))
     {
@@ -8354,6 +8495,66 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
   return mask;
 }
 
+/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
+   lengths for vector access with length that each control a vector of type
+   VECTYPE.  */
+
+void
+vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		       unsigned int nvectors, tree vectype)
+{
+  gcc_assert (nvectors != 0);
+  if (lens->length () < nvectors)
+    lens->safe_grow_cleared (nvectors);
+  rgroup_objs *rgl = &(*lens)[nvectors - 1];
+
+  /* The number of scalars per iteration, total bytes of them and the number of
+     vectors are both compile-time constants.  */
+  poly_uint64 vector_size = GET_MODE_SIZE (TYPE_MODE (vectype));
+  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned int nbytes_per_iter
+    = exact_div (nvectors * vector_size, vf).to_constant ();
+
+  /* The one associated to the same nvectors should have the same bytes per
+     iteration.  */
+  if (!rgl->vec_type)
+    {
+      rgl->vec_type = vectype;
+      rgl->nbytes_per_iter = nbytes_per_iter;
+    }
+  else
+    gcc_assert (rgl->nbytes_per_iter == nbytes_per_iter);
+}
+
+/* Given a complete set of length LENS, extract length number INDEX for an
+   rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.  */
+
+tree
+vect_get_loop_len (vec_loop_lens *lens, unsigned int nvectors, unsigned int index)
+{
+  rgroup_objs *rgl = &(*lens)[nvectors - 1];
+
+  /* Populate the rgroup's len array, if this is the first time we've
+     used it.  */
+  if (rgl->objs.is_empty ())
+    {
+      rgl->objs.safe_grow_cleared (nvectors);
+      for (unsigned int i = 0; i < nvectors; ++i)
+	{
+	  scalar_int_mode len_mode = targetm.vectorize.length_mode;
+	  unsigned int len_prec = GET_MODE_PRECISION (len_mode);
+	  tree len_type = build_nonstandard_integer_type (len_prec, true);
+	  tree len = make_temp_ssa_name (len_type, NULL, "loop_len");
+
+	  /* Provide a dummy definition until the real one is available.  */
+	  SSA_NAME_DEF_STMT (len) = gimple_build_nop ();
+	  rgl->objs[i] = len;
+	}
+    }
+
+  return rgl->objs[index];
+}
+
 /* Scale profiling counters by estimation for LOOP which is vectorized
    by factor VF.  */
 
@@ -8713,7 +8914,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
   if (niters_vector == NULL_TREE)
     {
       if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-	  && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+	  && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)
 	  && known_eq (lowest_vf, vf))
 	{
 	  niters_vector
@@ -8881,7 +9082,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
 
   /* True if the final iteration might not handle a full vector's
      worth of scalar iterations.  */
-  bool final_iter_may_be_partial = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+  bool final_iter_may_be_partial = LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo);
   /* The minimum number of iterations performed by the epilogue.  This
      is 1 when peeling for gaps because we always need a final scalar
      iteration.  */
@@ -9184,12 +9385,14 @@ optimize_mask_stores (class loop *loop)
 }
 
 /* Decide whether it is possible to use a zero-based induction variable
-   when vectorizing LOOP_VINFO with a fully-masked loop.  If it is,
-   return the value that the induction variable must be able to hold
-   in order to ensure that the loop ends with an all-false mask.
+   when vectorizing LOOP_VINFO with a fully-masked or fully with length
+   loop.  If it is, return the value that the induction variable must
+   be able to hold in order to ensure that the loop ends with an
+   all-false mask or zero byte length.
    Return -1 otherwise.  */
+
 widest_int
-vect_iv_limit_for_full_masking (loop_vec_info loop_vinfo)
+vect_iv_limit_for_partial_vect (loop_vec_info loop_vinfo)
 {
   tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index e7822c44951..1bd2d2bd581 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1771,9 +1771,9 @@ static tree permute_vec_elements (vec_info *, tree, tree, tree, stmt_vec_info,
 				  gimple_stmt_iterator *);
 
 /* Check whether a load or store statement in the loop described by
-   LOOP_VINFO is possible in a fully-masked loop.  This is testing
-   whether the vectorizer pass has the appropriate support, as well as
-   whether the target does.
+   LOOP_VINFO is possible in a fully-masked or fully with length loop.
+   This is testing whether the vectorizer pass has the appropriate support,
+   as well as whether the target does.
 
    VLS_TYPE says whether the statement is a load or store and VECTYPE
    is the type of the vector being loaded or stored.  MEMORY_ACCESS_TYPE
@@ -1783,14 +1783,14 @@ static tree permute_vec_elements (vec_info *, tree, tree, tree, stmt_vec_info,
    its arguments.  If the load or store is conditional, SCALAR_MASK is the
    condition under which it occurs.
 
-   Clear LOOP_VINFO_CAN_FULLY_MASK_P if a fully-masked loop is not
-   supported, otherwise record the required mask types.  */
+   Clear LOOP_VINFO_CAN_PARTIAL_VECT_P if a fully-masked or fully with
+   length loop is not supported, otherwise record the required mask types.  */
 
 static void
-check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
-			  vec_load_store_type vls_type, int group_size,
-			  vect_memory_access_type memory_access_type,
-			  gather_scatter_info *gs_info, tree scalar_mask)
+check_load_store_partial_vect (loop_vec_info loop_vinfo, tree vectype,
+			       vec_load_store_type vls_type, int group_size,
+			       vect_memory_access_type memory_access_type,
+			       gather_scatter_info *gs_info, tree scalar_mask)
 {
   /* Invariant loads need no special support.  */
   if (memory_access_type == VMAT_INVARIANT)
@@ -1807,10 +1807,10 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "can't use a fully-masked loop because the"
-			     " target doesn't have an appropriate masked"
+			     "can't use a partial vectorized loop because"
+			     " the target doesn't have an appropriate"
 			     " load/store-lanes instruction.\n");
-	  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	  LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	  return;
 	}
       unsigned int ncopies = vect_get_num_copies (loop_vinfo, vectype);
@@ -1830,10 +1830,10 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "can't use a fully-masked loop because the"
-			     " target doesn't have an appropriate masked"
+			     "can't use a partial vectorized loop because"
+			     " the target doesn't have an appropriate"
 			     " gather load or scatter store instruction.\n");
-	  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	  LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	  return;
 	}
       unsigned int ncopies = vect_get_num_copies (loop_vinfo, vectype);
@@ -1848,35 +1848,61 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 	 scalar loop.  We need more work to support other mappings.  */
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "can't use a fully-masked loop because an access"
-			 " isn't contiguous.\n");
-      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+			 "can't use a partial vectorized loop because an"
+			 " access isn't contiguous.\n");
+      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
       return;
     }
 
-  machine_mode mask_mode;
-  if (!VECTOR_MODE_P (vecmode)
-      || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
-      || !can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+  if (!VECTOR_MODE_P (vecmode))
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "can't use a fully-masked loop because the target"
-			 " doesn't have the appropriate masked load or"
-			 " store.\n");
-      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+			 "can't use a partial vectorized loop because of"
+			 " the unexpected mode.\n");
+      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
       return;
     }
-  /* We might load more scalars than we need for permuting SLP loads.
-     We checked in get_group_load_store_type that the extra elements
-     don't leak into a new vector.  */
+
   poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   unsigned int nvectors;
-  if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
-    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, scalar_mask);
-  else
-    gcc_unreachable ();
+  machine_mode mask_mode;
+  bool partial_vectorized_p = false;
+  if (targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
+      && can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+    {
+      /* We might load more scalars than we need for permuting SLP loads.
+	 We checked in get_group_load_store_type that the extra elements
+	 don't leak into a new vector.  */
+      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+	vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype,
+			       scalar_mask);
+      else
+	gcc_unreachable ();
+      partial_vectorized_p = true;
+    }
+
+  optab op = is_load ? lenload_optab : lenstore_optab;
+  if (convert_optab_handler (op, vecmode, targetm.vectorize.length_mode))
+    {
+      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+	vect_record_loop_len (loop_vinfo, lens, nvectors, vectype);
+      else
+	gcc_unreachable ();
+      partial_vectorized_p = true;
+    }
+
+  if (!partial_vectorized_p)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use a partial vectorized loop because the"
+			 " target doesn't have the appropriate partial"
+			 "vectorized load or store.\n");
+      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
+    }
 }
 
 /* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
@@ -6187,7 +6213,7 @@ vectorizable_operation (vec_info *vinfo,
 	 should only change the active lanes of the reduction chain,
 	 keeping the inactive lanes as-is.  */
       if (loop_vinfo
-	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
+	  && LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo)
 	  && reduc_idx >= 0)
 	{
 	  if (cond_fn == IFN_LAST
@@ -6198,7 +6224,7 @@ vectorizable_operation (vec_info *vinfo,
 		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 				 "can't use a fully-masked loop because no"
 				 " conditional operation is available.\n");
-	      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	      LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
 	    }
 	  else
 	    vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num,
@@ -7527,10 +7553,10 @@ vectorizable_store (vec_info *vinfo,
     {
       STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) = memory_access_type;
 
-      if (loop_vinfo
-	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
-	check_load_store_masking (loop_vinfo, vectype, vls_type, group_size,
-				  memory_access_type, &gs_info, mask);
+      if (loop_vinfo && LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo))
+	check_load_store_partial_vect (loop_vinfo, vectype, vls_type,
+				       group_size, memory_access_type, &gs_info,
+				       mask);
 
       if (slp_node
 	  && !vect_maybe_update_slp_op_vectype (SLP_TREE_CHILDREN (slp_node)[0],
@@ -8068,6 +8094,15 @@ vectorizable_store (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+	 ? &LOOP_VINFO_LENS (loop_vinfo)
+	 : NULL);
+
+  /* Shouldn't go with length if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -8320,10 +8355,15 @@ vectorizable_store (vec_info *vinfo,
 	      unsigned HOST_WIDE_INT align;
 
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens)
+		final_len = vect_get_loop_len (loop_lens, vec_num * ncopies,
+					       vec_num * j + i);
+
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
@@ -8403,6 +8443,17 @@ vectorizable_store (vec_info *vinfo,
 		  new_stmt_info
 		    = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
 		}
+	      else if (final_len)
+		{
+		  align = least_bit_hwi (misalign | align);
+		  tree ptr = build_int_cst (ref_type, align);
+		  gcall *call
+		    = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr,
+						  ptr, final_len, vec_oprnd);
+		  gimple_call_set_nothrow (call, true);
+		  new_stmt_info
+		    = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
+		}
 	      else
 		{
 		  data_ref = fold_build2 (MEM_REF, vectype,
@@ -8834,10 +8885,10 @@ vectorizable_load (vec_info *vinfo,
       if (!slp)
 	STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) = memory_access_type;
 
-      if (loop_vinfo
-	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
-	check_load_store_masking (loop_vinfo, vectype, VLS_LOAD, group_size,
-				  memory_access_type, &gs_info, mask);
+      if (loop_vinfo && LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo))
+	check_load_store_partial_vect (loop_vinfo, vectype, VLS_LOAD,
+				       group_size, memory_access_type, &gs_info,
+				       mask);
 
       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
       vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
@@ -8937,6 +8988,7 @@ vectorizable_load (vec_info *vinfo,
 
       gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
       gcc_assert (!nested_in_vect_loop);
+      gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
 
       if (grouped_load)
 	{
@@ -9234,6 +9286,15 @@ vectorizable_load (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+	 ? &LOOP_VINFO_LENS (loop_vinfo)
+	 : NULL);
+
+  /* Shouldn't go with length if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -9555,15 +9616,20 @@ vectorizable_load (vec_info *vinfo,
 	  for (i = 0; i < vec_num; i++)
 	    {
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks
 		  && memory_access_type != VMAT_INVARIANT)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens && memory_access_type != VMAT_INVARIANT)
+		final_len = vect_get_loop_len (loop_lens, vec_num * ncopies,
+					       vec_num * j + i);
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
 
+
 	      if (i > 0)
 		dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
 					       gsi, stmt_info, bump);
@@ -9629,6 +9695,18 @@ vectorizable_load (vec_info *vinfo,
 			new_stmt = call;
 			data_ref = NULL_TREE;
 		      }
+		    else if (final_len)
+		      {
+			align = least_bit_hwi (misalign | align);
+			tree ptr = build_int_cst (ref_type, align);
+			gcall *call
+			  = gimple_build_call_internal (IFN_LEN_LOAD, 3,
+							dataref_ptr, ptr,
+							final_len);
+			gimple_call_set_nothrow (call, true);
+			new_stmt = call;
+			data_ref = NULL_TREE;
+		      }
 		    else
 		      {
 			tree ltype = vectype;
@@ -10279,11 +10357,16 @@ vectorizable_condition (vec_info *vinfo,
 	  return false;
 	}
 
-      if (loop_vinfo
-	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
-	  && reduction_type == EXTRACT_LAST_REDUCTION)
-	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
-			       ncopies * vec_num, vectype, NULL);
+      /* For reduction, we expect EXTRACT_LAST_REDUCTION so far.  */
+      if (loop_vinfo && for_reduction
+	  && LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo))
+	{
+	  if (reduction_type == EXTRACT_LAST_REDUCTION)
+	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
+				   ncopies * vec_num, vectype, NULL);
+	  else
+	    LOOP_VINFO_CAN_PARTIAL_VECT_P (loop_vinfo) = false;
+	}
 
       STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
       vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
@@ -12480,3 +12563,35 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
   *nunits_vectype_out = nunits_vectype;
   return opt_result::success ();
 }
+
+/* Generate and return statement sequence that sets vector length LEN that is:
+
+   min_of_start_and_end = min (START_INDEX, END_INDEX);
+   left_bytes = END_INDEX - min_of_start_and_end;
+   rhs = min (left_bytes, VECTOR_SIZE);
+   LEN = rhs;
+
+   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
+   means if we have left_bytes larger than 255, it can't be saturated to vector
+   size.  One target hook can be provided if other ports don't suffer this.
+*/
+
+gimple_seq
+vect_gen_len (tree len, tree start_index, tree end_index, tree vector_size)
+{
+  gimple_seq stmts = NULL;
+  tree len_type = TREE_TYPE (len);
+  gcc_assert (TREE_TYPE (start_index) == len_type);
+
+  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
+  tree left_bytes = fold_build2 (MINUS_EXPR, len_type, end_index, min);
+  left_bytes = fold_build2 (MIN_EXPR, len_type, left_bytes, vector_size);
+
+  tree rhs = force_gimple_operand (left_bytes, &stmts, true, NULL_TREE);
+  gimple *new_stmt = gimple_build_assign (len, rhs);
+  gimple_stmt_iterator i = gsi_last (stmts);
+  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
+
+  return stmts;
+}
+
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 2eb3ab5d280..9d84766d724 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -461,20 +461,32 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
    first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
    the second being indexed by the mask index 0 <= i < nV.  */
 
-/* The masks needed by rgroups with nV vectors, according to the
-   description above.  */
-struct rgroup_masks {
-  /* The largest nS for all rgroups that use these masks.  */
-  unsigned int max_nscalars_per_iter;
-
-  /* The type of mask to use, based on the highest nS recorded above.  */
-  tree mask_type;
+/* The masks/lengths (called as objects) needed by rgroups with nV vectors,
+   according to the description above.  */
+struct rgroup_objs {
+  union
+  {
+    /* The largest nS for all rgroups that use these masks.  */
+    unsigned int max_nscalars_per_iter;
+    /* The total bytes for any nS per iteration.  */
+    unsigned int nbytes_per_iter;
+  };
 
-  /* A vector of nV masks, in iteration order.  */
-  vec<tree> masks;
+  union
+  {
+    /* The type of mask to use, based on the highest nS recorded above.  */
+    tree mask_type;
+    /* Any vector type to use these lengths.  */
+    tree vec_type;
+  };
+
+  /* A vector of nV objs, in iteration order.  */
+  vec<tree> objs;
 };
 
-typedef auto_vec<rgroup_masks> vec_loop_masks;
+typedef auto_vec<rgroup_objs> vec_loop_masks;
+
+typedef auto_vec<rgroup_objs> vec_loop_lens;
 
 typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
 
@@ -523,6 +535,10 @@ public:
      on inactive scalars.  */
   vec_loop_masks masks;
 
+  /* The lengths that a loop with length should use to avoid operating
+     on inactive scalars.  */
+  vec_loop_lens lens;
+
   /* Set of scalar conditions that have loop mask applied.  */
   scalar_cond_masked_set_type scalar_cond_masked_set;
 
@@ -620,12 +636,20 @@ public:
   /* Is the loop vectorizable? */
   bool vectorizable;
 
-  /* Records whether we still have the option of using a fully-masked loop.  */
-  bool can_fully_mask_p;
+  /* Records whether we can use partial vector approaches for this loop, for
+     now we support masking and length approaches.  */
+  bool can_partial_vect_p;
 
   /* True if have decided to use a fully-masked loop.  */
   bool fully_masked_p;
 
+  /* True if have decided to use length access for the loop fully.  */
+  bool fully_with_length_p;
+
+  /* Records whether we can use partial vector approaches for the epilogue of
+     this loop, for now we only support length approach.  */
+  bool epil_partial_vect_p;
+
   /* When we have grouped data accesses with gaps, we may introduce invalid
      memory accesses.  We peel the last iteration of the loop to prevent
      this.  */
@@ -687,8 +711,11 @@ public:
 #define LOOP_VINFO_COST_MODEL_THRESHOLD(L) (L)->th
 #define LOOP_VINFO_VERSIONING_THRESHOLD(L) (L)->versioning_threshold
 #define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
-#define LOOP_VINFO_CAN_FULLY_MASK_P(L)     (L)->can_fully_mask_p
+#define LOOP_VINFO_CAN_PARTIAL_VECT_P(L)   (L)->can_partial_vect_p
 #define LOOP_VINFO_FULLY_MASKED_P(L)       (L)->fully_masked_p
+#define LOOP_VINFO_FULLY_WITH_LENGTH_P(L)  (L)->fully_with_length_p
+#define LOOP_VINFO_EPIL_PARTIAL_VECT_P(L)  (L)->epil_partial_vect_p
+#define LOOP_VINFO_LENS(L)                 (L)->lens
 #define LOOP_VINFO_VECT_FACTOR(L)          (L)->vectorization_factor
 #define LOOP_VINFO_MAX_VECT_FACTOR(L)      (L)->max_vectorization_factor
 #define LOOP_VINFO_MASKS(L)                (L)->masks
@@ -741,6 +768,10 @@ public:
    || LOOP_REQUIRES_VERSIONING_FOR_NITERS (L)		\
    || LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (L))
 
+/* Whether operates on partial vector.  */
+#define LOOP_VINFO_PARTIAL_VECT_P(L)                                           \
+  (LOOP_VINFO_FULLY_MASKED_P (L) || LOOP_VINFO_FULLY_WITH_LENGTH_P (L))
+
 #define LOOP_VINFO_NITERS_KNOWN_P(L)          \
   (tree_fits_shwi_p ((L)->num_iters) && tree_to_shwi ((L)->num_iters) > 0)
 
@@ -1824,7 +1855,7 @@ extern tree vect_create_addr_base_for_vector_ref (vec_info *,
 						  tree, tree = NULL_TREE);
 
 /* In tree-vect-loop.c.  */
-extern widest_int vect_iv_limit_for_full_masking (loop_vec_info loop_vinfo);
+extern widest_int vect_iv_limit_for_partial_vect (loop_vec_info loop_vinfo);
 /* Used in tree-vect-loop-manip.c */
 extern void determine_peel_for_niter (loop_vec_info);
 /* Used in gimple-loop-interchange.c and tree-parloops.c.  */
@@ -1842,6 +1873,10 @@ extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
 				   unsigned int, tree, tree);
 extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
+extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+				  tree);
+extern tree vect_get_loop_len (vec_loop_lens *, unsigned int, unsigned int);
+extern gimple_seq vect_gen_len (tree, tree, tree, tree);
 extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info);
 
 /* Drive for loop transformation stage.  */

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v3] vect: Support vector load/store with length in vectorizer
  2020-06-02  9:03             ` [PATCH 5/7 v3] " Kewen.Lin
@ 2020-06-02 11:50               ` Richard Sandiford
  2020-06-02 17:01                 ` Segher Boessenkool
  2020-06-03  6:33                 ` Kewen.Lin
  0 siblings, 2 replies; 80+ messages in thread
From: Richard Sandiford @ 2020-06-02 11:50 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> on 2020/5/29 下午4:32, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> on 2020/5/27 下午6:02, Richard Sandiford wrote:
>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>> Hi Richard,
>>>>>
>
> Snip ...
>
>>>
>>> Thanks a lot for your detailed explanation!  This proposal looks good
>>> based on the current implementation of both masking and length.  I may
>>> think too much, but I had a bit concern as below when some targets have
>>> both masking and length supports in future, such as ppc adds masking
>>> support like SVE.
>>>
>>> I assumed that you meant each vectorizable_* routine should record the
>>> objs for any available partial vectorisation approaches.  If one target
>>> supports both, we would have both recorded but decide not to do partial
>>> vectorisation finally since both have records.  The target can disable
>>> length like through optab to resolve it, but there is one possibility
>>> that the masking support can be imperfect initially since ISA support
>>> could be gradual, it further leads some vectorizable_* check or final
>>> verification to fail for masking, and length approach may work here but
>>> it gets disabled.  We can miss to use partial vectorisation here.
>>>
>>> The other assumption is that each vectorizable_* routine record the 
>>> first available partial vectorisation approach, let's assume masking
>>> takes preference, then it's fine to record just one here even if one
>>> target supports both approaches, but we still have the possiblity to
>>> miss the partial vectorisation chance as some check/verify fail with
>>> masking but fine with length.
>>>
>>> Does this concern make sense?
>> 
>> There's nothing to stop us using masks and lengths in the same loop
>> in future if we need to.  It would “just” be a case of setting up both
>> the masks and the lengths in vect_set_loop_condition.  But the point is
>> that doing that would be extra code, and there's no point writing that
>> extra code until it's needed.
>> 
>> If some future arch does support both mask-based and length-based
>> approaches, I think that's even less reason to make a binary choice
>> between them.  How we prioritise the length and mask approaches when
>> both are available is something that we'll have to decide at the time.
>> 
>> If your concern is that the arch might support masked operations
>> without wanting them to be used for loop control, we could test for
>> that case by checking whether while_ult_optab is implemented.
>> 
>> Thanks,
>> Richard
>> 
>
> Thanks for your further expalanation, as you pointed out, my concern
> is just one case of mixing mask-based and length-based.  I didn't
> realize it and thought we still used one approach for one loop at the
> time, but it's senseless.
>
> The v3 patch attached to use can_partial_vect_p.  In the regression
> testing with explicit vect-with-length-scope setting, I saw several
> reduction failures, updated vectorizable_condition to set
> can_partial_vect_p to false for !EXTRACT_LAST_REDUCTION under your
> guidance to ensure it either records sth. or clearing
> can_partial_vect_p.
>
> Bootstrapped/regtested on powerpc64le-linux-gnu P9 and no remarkable
> failures found even with explicit vect-with-length-scope settings.
>
> But I met one regression failure on aarch64-linux-gnu as below:
>
> PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,
>
> It's caused by vectorizable_condition's change, without the change,
> it can use fully-masking for the outer loop.  The reduction_type is
> TREE_CODE_REDUCTION here, so can_partial_vect_p gets cleared.
>
> From the optimized dumping, the previous IRs look fine.  It's doing
> reduction for inner loop, but we are checking partial vectorisation
> for the outer loop.  I'm not sure whether to adjust the current
> guard is reasonable for this case.  Could you help to give some
> insights?  Thanks in advance!
>
> BR,
> Kewen
> ------
> gcc/ChangeLog

It would be easier to review, and perhaps easier to bisect,
if some of the mechanical changes were split out.  E.g.:

- Rename can_fully_mask_p to can_use_partial_vectors_p.

- Rename fully_masked_p to using_partial_vectors_p.

- Rename things related to rgroup_masks.  I think “rgroup_controls”
  or “rgroup_guards” might be more descriptive than “rgroup_objs”.

These should be fairly mechanical changes and can happen ahead of
the main series.  It'll then be easier to see what's different
for masks and lengths, separately from the more mechanical stuff.

As far as:

+  union
+  {
+    /* The type of mask to use, based on the highest nS recorded above.  */
+    tree mask_type;
+    /* Any vector type to use these lengths.  */
+    tree vec_type;
+  };

goes, some parts of the code seem to use mask_type for lengths too,
which I'm a bit nervous about.  I think we should either be consistent
about which union field we use (always mask_type for masks, always
vec_type for lengths) or we should just rename mask_type to something
more generic.  Just "type" might be good enough with a suitable comment.

>  {
>    tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
>    tree iv_type = LOOP_VINFO_MASK_IV_TYPE (loop_vinfo);
> -  tree mask_type = rgm->mask_type;
> -  unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
> -  poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
> +
> +  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
> +  if (!vect_for_masking)
> +    {
> +      /* Obtain target supported length type.  */
> +      scalar_int_mode len_mode = targetm.vectorize.length_mode;
> +      unsigned int len_prec = GET_MODE_PRECISION (len_mode);
> +      compare_type = build_nonstandard_integer_type (len_prec, true);
> +      /* Simply set iv_type as same as compare_type.  */
> +      iv_type = compare_type;

This might not be the best time to bring this up :-) but it seems
odd to be asking the target for the induction variable type here.
I got the impression that the hook was returning DImode, whereas
the PowerPC instructions only looked at the low 8 bits of the length.
If so, forcing a naturally 32-bit IV to DImode would insert extra
sign/zero extensions, even though the input to the length intrinsics
would have been happy with the 32-bit IV.

I think it would make sense to ask the target for its minimum
precision P (which would be 8 bits if the above is correct).
The starting point would then be the maximum of:

- this P
- the IV's natural precision
- the precision needed to hold:
    the maximum number of scalar iterations multiplied by the scale factor
    (to convert scalar counts to bytes)

If the IV might wrap at that precision without producing all-zero lengths,
it would be worth doubling the precision to avoid the wrapping issue,
provided that we don't go beyond BITS_PER_WORD.

> +  tree obj_type = rgo->mask_type;
> +  /* Here, take nscalars_per_iter as nbytes_per_iter for length.  */
> +  unsigned int nscalars_per_iter = rgo->max_nscalars_per_iter;

I think whether we count scalars or count bytes is really a separate
decision that shouldn't be tied directly to using lengths.  Length-based
loads and stores on other arches might want to count scalars too.
I'm not saying you should add support for that (it wouldn't be tested),
but I think we should avoid structuring the code to make it harder to
add in future.

So I think nscalars_per_iter should always count scalars and anything
length-based should be separate.  Would it make sense to store the
length scale factor as a separate field?  I.e. using the terms
above the rgroup_masks comment, the length IV step is:

   factor * nS * VF == factor * nV * nL

That way, applying the factor becomes separate from lengths vs. masks.
The factor would also be useful in calculating the IV precision above.

> [...]
> -/* Make LOOP iterate NITERS times using masking and WHILE_ULT calls.
> -   LOOP_VINFO describes the vectorization of LOOP.  NITERS is the
> -   number of iterations of the original scalar loop that should be
> -   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are
> -   as for vect_set_loop_condition.
> +/* Make LOOP iterate NITERS times using objects like masks (and
> +   WHILE_ULT calls) or lengths.  LOOP_VINFO describes the vectorization
> +   of LOOP.  NITERS is the number of iterations of the original scalar
> +   loop that should be handled by the vector loop.  NITERS_MAYBE_ZERO
> +   and FINAL_IV are as for vect_set_loop_condition.
>  
>     Insert the branch-back condition before LOOP_COND_GSI and return the
>     final gcond.  */
>  
>  static gcond *
> -vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
> -				tree niters, tree final_iv,
> -				bool niters_maybe_zero,
> -				gimple_stmt_iterator loop_cond_gsi)
> +vect_set_loop_condition_partial (class loop *loop, loop_vec_info loop_vinfo,
> +				 tree niters, tree final_iv,
> +				 bool niters_maybe_zero,
> +				 gimple_stmt_iterator loop_cond_gsi)
>  {
>    gimple_seq preheader_seq = NULL;
>    gimple_seq header_seq = NULL;
>  
> +  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
> +
>    tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
> +  if (!vect_for_masking)
> +    {
> +      /* Obtain target supported length type as compare_type.  */
> +      scalar_int_mode len_mode = targetm.vectorize.length_mode;
> +      unsigned len_prec = GET_MODE_PRECISION (len_mode);
> +      compare_type = build_nonstandard_integer_type (len_prec, true);

Same comment as above about the choice of IV type.  We shouldn't
recalculate this multiple times.  It would be better to calculate
it upfront and store it in the loop_vinfo.

> @@ -2567,7 +2622,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>    if (vect_epilogues
>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>        && prolog_peeling >= 0
> -      && known_eq (vf, lowest_vf))
> +      && known_eq (vf, lowest_vf)
> +      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (epilogue_vinfo))

Why's this check needed?

>      {
>        unsigned HOST_WIDE_INT eiters
>  	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 80e33b61be7..99e6cb904ba 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -813,8 +813,10 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
>      vec_outside_cost (0),
>      vec_inside_cost (0),
>      vectorizable (false),
> -    can_fully_mask_p (true),
> +    can_partial_vect_p (true),

I think “can_use_partial_vectors_p” reads better

>      fully_masked_p (false),
> +    fully_with_length_p (false),

I think it would be better if these two were a single flag
(using_partial_vectors_p), with masking vs. lengths being derived
information.

> +    epil_partial_vect_p (false),
>      peeling_for_gaps (false),
>      peeling_for_niter (false),
>      no_data_dependencies (false),
> @@ -880,13 +882,25 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
>  void
>  release_vec_loop_masks (vec_loop_masks *masks)
>  {
> -  rgroup_masks *rgm;
> +  rgroup_objs *rgm;
>    unsigned int i;
>    FOR_EACH_VEC_ELT (*masks, i, rgm)
> -    rgm->masks.release ();
> +    rgm->objs.release ();
>    masks->release ();
>  }
>  
> +/* Free all levels of LENS.  */
> +
> +void
> +release_vec_loop_lens (vec_loop_lens *lens)
> +{
> +  rgroup_objs *rgl;
> +  unsigned int i;
> +  FOR_EACH_VEC_ELT (*lens, i, rgl)
> +    rgl->objs.release ();
> +  lens->release ();
> +}
> +

There's no need to duplicate this function.

The overall approach looks good though.  I just think we need to work
through the details a bit more.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v3] vect: Support vector load/store with length in vectorizer
  2020-06-02 11:50               ` Richard Sandiford
@ 2020-06-02 17:01                 ` Segher Boessenkool
  2020-06-03  6:33                 ` Kewen.Lin
  1 sibling, 0 replies; 80+ messages in thread
From: Segher Boessenkool @ 2020-06-02 17:01 UTC (permalink / raw)
  To: Kewen.Lin, GCC Patches, Richard Guenther, Bill Schmidt, dje.gcc,
	richard.sandiford

On Tue, Jun 02, 2020 at 12:50:25PM +0100, Richard Sandiford wrote:
> This might not be the best time to bring this up :-) but it seems
> odd to be asking the target for the induction variable type here.
> I got the impression that the hook was returning DImode, whereas
> the PowerPC instructions only looked at the low 8 bits of the length.

It's the top(!) 8 bits of the register actually (the other 56 bits are
"do not care" bits).

> If so, forcing a naturally 32-bit IV to DImode would insert extra
> sign/zero extensions, even though the input to the length intrinsics
> would have been happy with the 32-bit IV.

It's a shift left always.  All bits beyond the 8 drop out.


Thanks for the great reviews Richard, much appreciated!


Segher

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v3] vect: Support vector load/store with length in vectorizer
  2020-06-02 11:50               ` Richard Sandiford
  2020-06-02 17:01                 ` Segher Boessenkool
@ 2020-06-03  6:33                 ` Kewen.Lin
  2020-06-10  9:19                   ` [PATCH 5/7 v4] " Kewen.Lin
  1 sibling, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-03  6:33 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool

Hi Richard,

Thanks a lot for your great comments!

on 2020/6/2 下午7:50, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> Hi Richard,
>>
>> on 2020/5/29 下午4:32, Richard Sandiford wrote:
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>> on 2020/5/27 下午6:02, Richard Sandiford wrote:
>>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>>> Hi Richard,
>>>>>>

snip ...

> 
> It would be easier to review, and perhaps easier to bisect,
> if some of the mechanical changes were split out.  E.g.:
> 
> - Rename can_fully_mask_p to can_use_partial_vectors_p.
> 
> - Rename fully_masked_p to using_partial_vectors_p.
> 
> - Rename things related to rgroup_masks.  I think “rgroup_controls”
>   or “rgroup_guards” might be more descriptive than “rgroup_objs”.
> 
> These should be fairly mechanical changes and can happen ahead of
> the main series.  It'll then be easier to see what's different
> for masks and lengths, separately from the more mechanical stuff.
> 

Good suggestion.  My fault, I should have done it before. 
Will split it into some NFC patches.

> As far as:
> 
> +  union
> +  {
> +    /* The type of mask to use, based on the highest nS recorded above.  */
> +    tree mask_type;
> +    /* Any vector type to use these lengths.  */
> +    tree vec_type;
> +  };
> 
> goes, some parts of the code seem to use mask_type for lengths too,
> which I'm a bit nervous about.  I think we should either be consistent
> about which union field we use (always mask_type for masks, always
> vec_type for lengths) or we should just rename mask_type to something
> more generic.  Just "type" might be good enough with a suitable comment.

Will fix it.

> 
>>  {
>>    tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
>>    tree iv_type = LOOP_VINFO_MASK_IV_TYPE (loop_vinfo);
>> -  tree mask_type = rgm->mask_type;
>> -  unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
>> -  poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
>> +
>> +  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
>> +  if (!vect_for_masking)
>> +    {
>> +      /* Obtain target supported length type.  */
>> +      scalar_int_mode len_mode = targetm.vectorize.length_mode;
>> +      unsigned int len_prec = GET_MODE_PRECISION (len_mode);
>> +      compare_type = build_nonstandard_integer_type (len_prec, true);
>> +      /* Simply set iv_type as same as compare_type.  */
>> +      iv_type = compare_type;
> 
> This might not be the best time to bring this up :-) but it seems
> odd to be asking the target for the induction variable type here.
> I got the impression that the hook was returning DImode, whereas
> the PowerPC instructions only looked at the low 8 bits of the length.
> If so, forcing a naturally 32-bit IV to DImode would insert extra
> sign/zero extensions, even though the input to the length intrinsics
> would have been happy with the 32-bit IV.
> 

Good point, I'll check it with some cases.  As Segher pointed out, the 8
bits in bits 0-7 (the top, abnormal I admit), these vector with length
instructions are guarded in 64 bits only.  IIUC the extra sign/zero
extensions would exist in pre-header with current setting?  At that time
I thought the iv with less precision than length mode had to be converted
later for length's need, it looks good to use length mode simply.

> I think it would make sense to ask the target for its minimum
> precision P (which would be 8 bits if the above is correct).
> The starting point would then be the maximum of:
> 
> - this P
> - the IV's natural precision
> - the precision needed to hold:
>     the maximum number of scalar iterations multiplied by the scale factor
>     (to convert scalar counts to bytes)
> 
> If the IV might wrap at that precision without producing all-zero lengths,
> it would be worth doubling the precision to avoid the wrapping issue,
> provided that we don't go beyond BITS_PER_WORD.
> 
Thanks! Will think/test more on this part.

>> +  tree obj_type = rgo->mask_type;
>> +  /* Here, take nscalars_per_iter as nbytes_per_iter for length.  */
>> +  unsigned int nscalars_per_iter = rgo->max_nscalars_per_iter;
> 
> I think whether we count scalars or count bytes is really a separate
> decision that shouldn't be tied directly to using lengths.  Length-based
> loads and stores on other arches might want to count scalars too.
> I'm not saying you should add support for that (it wouldn't be tested),
> but I think we should avoid structuring the code to make it harder to
> add in future.
> 

It makes sense, will update it.

> So I think nscalars_per_iter should always count scalars and anything
> length-based should be separate.  Would it make sense to store the
> length scale factor as a separate field?  I.e. using the terms
> above the rgroup_masks comment, the length IV step is:
> 
>    factor * nS * VF == factor * nV * nL
> 

Yeah, factor*nS becomes what we wanted for length-based in bytes, factor
* nL would be the vector size.

> That way, applying the factor becomes separate from lengths vs. masks.
> The factor would also be useful in calculating the IV precision above.
> 

Yeah, nice!

>> [...]
>> -/* Make LOOP iterate NITERS times using masking and WHILE_ULT calls.
>> -   LOOP_VINFO describes the vectorization of LOOP.  NITERS is the
>> -   number of iterations of the original scalar loop that should be
>> -   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are
>> -   as for vect_set_loop_condition.
>> +/* Make LOOP iterate NITERS times using objects like masks (and
>> +   WHILE_ULT calls) or lengths.  LOOP_VINFO describes the vectorization
>> +   of LOOP.  NITERS is the number of iterations of the original scalar
>> +   loop that should be handled by the vector loop.  NITERS_MAYBE_ZERO
>> +   and FINAL_IV are as for vect_set_loop_condition.
>>  
>>     Insert the branch-back condition before LOOP_COND_GSI and return the
>>     final gcond.  */
>>  
>>  static gcond *
>> -vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
>> -				tree niters, tree final_iv,
>> -				bool niters_maybe_zero,
>> -				gimple_stmt_iterator loop_cond_gsi)
>> +vect_set_loop_condition_partial (class loop *loop, loop_vec_info loop_vinfo,
>> +				 tree niters, tree final_iv,
>> +				 bool niters_maybe_zero,
>> +				 gimple_stmt_iterator loop_cond_gsi)
>>  {
>>    gimple_seq preheader_seq = NULL;
>>    gimple_seq header_seq = NULL;
>>  
>> +  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
>> +
>>    tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
>> +  if (!vect_for_masking)
>> +    {
>> +      /* Obtain target supported length type as compare_type.  */
>> +      scalar_int_mode len_mode = targetm.vectorize.length_mode;
>> +      unsigned len_prec = GET_MODE_PRECISION (len_mode);
>> +      compare_type = build_nonstandard_integer_type (len_prec, true);
> 
> Same comment as above about the choice of IV type.  We shouldn't
> recalculate this multiple times.  It would be better to calculate
> it upfront and store it in the loop_vinfo.

OK.

> 
>> @@ -2567,7 +2622,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>>    if (vect_epilogues
>>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>>        && prolog_peeling >= 0
>> -      && known_eq (vf, lowest_vf))
>> +      && known_eq (vf, lowest_vf)
>> +      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (epilogue_vinfo))
> 
> Why's this check needed?
> 

It's mainly for length-based epilogue handlings.

       while (!(constant_multiple_p
                (GET_MODE_SIZE (loop_vinfo->vector_mode),
                 GET_MODE_SIZE (epilogue_vinfo->vector_mode), &ratio)
                 && eiters >= lowest_vf / ratio + epilogue_gaps))

This "if" part checks whether remaining eiters enough for the epilogue,
if eiters less than epilogue's lowest_vf, it will back out the epilogue.
But for partial_vectors, it should be acceptable, since it can deal
with partial.  Probably I should use using_partial_vectors_p here
instead of LOOP_VINFO_FULLY_WITH_LENGTH_P, although the masking won't
have the possiblity to handle the epilogue, the concept would be the same.

>>      {
>>        unsigned HOST_WIDE_INT eiters
>>  	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> index 80e33b61be7..99e6cb904ba 100644
>> --- a/gcc/tree-vect-loop.c
>> +++ b/gcc/tree-vect-loop.c
>> @@ -813,8 +813,10 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
>>      vec_outside_cost (0),
>>      vec_inside_cost (0),
>>      vectorizable (false),
>> -    can_fully_mask_p (true),
>> +    can_partial_vect_p (true),
> 
> I think “can_use_partial_vectors_p” reads better

Will update with it.

> 
>>      fully_masked_p (false),
>> +    fully_with_length_p (false),
> 
> I think it would be better if these two were a single flag
> (using_partial_vectors_p), with masking vs. lengths being derived
> information.
> 

Will update it.

>> +    epil_partial_vect_p (false),
>>      peeling_for_gaps (false),
>>      peeling_for_niter (false),
>>      no_data_dependencies (false),
>> @@ -880,13 +882,25 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
>>  void
>>  release_vec_loop_masks (vec_loop_masks *masks)
>>  {
>> -  rgroup_masks *rgm;
>> +  rgroup_objs *rgm;
>>    unsigned int i;
>>    FOR_EACH_VEC_ELT (*masks, i, rgm)
>> -    rgm->masks.release ();
>> +    rgm->objs.release ();
>>    masks->release ();
>>  }
>>  
>> +/* Free all levels of LENS.  */
>> +
>> +void
>> +release_vec_loop_lens (vec_loop_lens *lens)
>> +{
>> +  rgroup_objs *rgl;
>> +  unsigned int i;
>> +  FOR_EACH_VEC_ELT (*lens, i, rgl)
>> +    rgl->objs.release ();
>> +  lens->release ();
>> +}
>> +
> 
> There's no need to duplicate this function.
> 

Good catch, will rename and merge them.

BR,
Kewen

> The overall approach looks good though.  I just think we need to work
> through the details a bit more.
> 
> Thanks,
> Richard
> 



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 1/7 V2] ifn/optabs: Support vector load/store with length
  2020-05-26  5:51 ` [PATCH 1/7] ifn/optabs: " Kewen.Lin
@ 2020-06-10  6:41   ` Kewen.Lin
  2020-06-10  9:22     ` Richard Sandiford
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-10  6:41 UTC (permalink / raw)
  To: GCC Patches
  Cc: Bill Schmidt, Richard Guenther, Segher Boessenkool, dje.gcc,
	Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 1015 bytes --]

V2: Convert the length to targetm.vectorize.length_mode in
related IFN expand functions.

btw, since it depends on this hook now, it should be
committed after the hook patch if accepted in future.

gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/md.texi (lenload@var{m}@var{n}): Document.
	(lenstore@var{m}@var{n}): Likewise.
	* internal-fn.c (len_load_direct): New macro.
	(len_store_direct): Likewise.
	(expand_len_load_optab_fn): Likewise.
	(expand_len_store_optab_fn): Likewise.
	(direct_len_load_optab_supported_p): Likewise.
	(direct_len_store_optab_supported_p): Likewise.
	(expand_mask_load_optab_fn): Add handlings for lenload_optab.
	(expand_mask_store_optab_fn): Add handlings for lenstore_optab.
	(internal_load_fn_p): Handle IFN_LEN_LOAD.
	(internal_store_fn_p): Handle IFN_LEN_STORE.
	(internal_fn_stored_value_index): Handle IFN_LEN_STORE.
	* internal-fn.def (LEN_LOAD): New internal function.
	(LEN_STORE): Likewise.
	* optabs.def (lenload_optab, lenstore_optab): New optab.

---

[-- Attachment #2: lenload_ifn_v2.patch --]
[-- Type: text/plain, Size: 8726 bytes --]


---
 gcc/doc/md.texi     | 16 ++++++++++++++++
 gcc/internal-fn.c   | 31 +++++++++++++++++++++++++++----
 gcc/internal-fn.def |  6 ++++++
 gcc/optabs.def      |  2 ++
 4 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2c67c818da5..b0c19cd3b81 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5167,6 +5167,22 @@ mode @var{n}.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{lenload@var{m}@var{n}} instruction pattern
+@item @samp{lenload@var{m}@var{n}}
+Perform a vector load with length from memory operand 1 of mode @var{m}
+into register operand 0.  Length is provided in register operand 2 of
+mode @var{n}.
+
+This pattern is not allowed to @code{FAIL}.
+
+@cindex @code{lenstore@var{m}@var{n}} instruction pattern
+@item @samp{lenstore@var{m}@var{n}}
+Perform a vector store with length from register operand 1 of mode @var{m}
+into memory operand 0.  Length is provided in register operand 2 of
+mode @var{n}.
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_perm@var{m}} instruction pattern
 @item @samp{vec_perm@var{m}}
 Output a (variable) vector permutation.  Operand 0 is the destination
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 5e9aa60721e..f896666796e 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -104,10 +104,12 @@ init_internal_fns ()
 #define load_lanes_direct { -1, -1, false }
 #define mask_load_lanes_direct { -1, -1, false }
 #define gather_load_direct { 3, 1, false }
+#define len_load_direct { -1, 2, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
 #define mask_store_lanes_direct { 0, 0, false }
 #define scatter_store_direct { 3, 1, false }
+#define len_store_direct { 3, 2, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 #define ternary_direct { 0, 0, true }
@@ -2478,7 +2480,7 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
   return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
 }
 
-/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES} and LEN_LOAD call STMT using optab OPTAB.  */
 
 static void
 expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2497,6 +2499,9 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_load_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == lenload_optab)
+    icode = convert_optab_handler (optab, TYPE_MODE (type),
+				   targetm.vectorize.length_mode);
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2507,15 +2512,20 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   create_output_operand (&ops[0], target, TYPE_MODE (type));
   create_fixed_operand (&ops[1], mem);
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == lenload_optab)
+    create_convert_operand_from (&ops[2], mask, targetm.vectorize.length_mode,
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
   if (!rtx_equal_p (target, ops[0].value))
     emit_move_insn (target, ops[0].value);
 }
 
 #define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
+#define expand_len_load_optab_fn expand_mask_load_optab_fn
 
-/* Expand MASK_STORE{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_STORE{,_LANES} and LEN_STORE call STMT using optab OPTAB.  */
 
 static void
 expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2532,6 +2542,9 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_store_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == lenstore_optab)
+    icode = convert_optab_handler (optab, TYPE_MODE (type),
+				   targetm.vectorize.length_mode);
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2542,11 +2555,16 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   reg = expand_normal (rhs);
   create_fixed_operand (&ops[0], mem);
   create_input_operand (&ops[1], reg, TYPE_MODE (type));
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == lenstore_optab)
+    create_convert_operand_from (&ops[2], mask, targetm.vectorize.length_mode,
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
 }
 
 #define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
+#define expand_len_store_optab_fn expand_mask_store_optab_fn
 
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
@@ -3128,10 +3146,12 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_gather_load_optab_supported_p convert_optab_supported_p
+#define direct_len_load_optab_supported_p direct_optab_supported_p
 #define direct_mask_store_optab_supported_p direct_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_scatter_store_optab_supported_p convert_optab_supported_p
+#define direct_len_store_optab_supported_p direct_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
 #define direct_fold_left_optab_supported_p direct_optab_supported_p
@@ -3498,6 +3518,7 @@ internal_load_fn_p (internal_fn fn)
     case IFN_MASK_LOAD_LANES:
     case IFN_GATHER_LOAD:
     case IFN_MASK_GATHER_LOAD:
+    case IFN_LEN_LOAD:
       return true;
 
     default:
@@ -3517,6 +3538,7 @@ internal_store_fn_p (internal_fn fn)
     case IFN_MASK_STORE_LANES:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return true;
 
     default:
@@ -3577,6 +3599,7 @@ internal_fn_stored_value_index (internal_fn fn)
     case IFN_MASK_STORE:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return 3;
 
     default:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 1d190d492ff..ed6561f296a 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
    - load_lanes: currently just vec_load_lanes
    - mask_load_lanes: currently just vec_mask_load_lanes
    - gather_load: used for {mask_,}gather_load
+   - len_load: currently just lenload
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
    - mask_store_lanes: currently just vec_mask_store_lanes
    - scatter_store: used for {mask_,}scatter_store
+   - len_store: currently just lenstore
 
    - unary: a normal unary optab, such as vec_reverse_<mode>
    - binary: a normal binary optab, such as vec_interleave_lo_<mode>
@@ -127,6 +129,8 @@ DEF_INTERNAL_OPTAB_FN (GATHER_LOAD, ECF_PURE, gather_load, gather_load)
 DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
 		       mask_gather_load, gather_load)
 
+DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, lenload, len_load)
+
 DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
 DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
 		       mask_scatter_store, scatter_store)
@@ -136,6 +140,8 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
 DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
 		       vec_mask_store_lanes, mask_store_lanes)
 
+DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, lenstore, len_store)
+
 DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
 DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
 		       check_raw_ptrs, check_ptrs)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 0c64eb52a8d..0551a191ad0 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -97,6 +97,8 @@ OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
 OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
 OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
 OPTAB_CD(vec_init_optab, "vec_init$a$b")
+OPTAB_CD(lenload_optab, "lenload$a$b")
+OPTAB_CD(lenstore_optab, "lenstore$a$b")
 
 OPTAB_CD (while_ult_optab, "while_ult$a$b")
 
-- 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 2/7 V2] rs6000: lenload/lenstore optab support
  2020-05-26  5:53 ` [PATCH 2/7] rs6000: lenload/lenstore optab support Kewen.Lin
@ 2020-06-10  6:43   ` Kewen.Lin
  2020-06-10 12:39     ` [PATCH 2/7 V3] " Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-10  6:43 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, Segher Boessenkool, dje.gcc

[-- Attachment #1: Type: text/plain, Size: 199 bytes --]

V2: Update the define_expand to use QImode.

gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* config/rs6000/vsx.md (lenload<mode>qi): New define_expand.
	(lenstore<mode>qi): Likewise.


[-- Attachment #2: rs6000_v2.patch --]
[-- Type: text/plain, Size: 1499 bytes --]


---
 gcc/config/rs6000/vsx.md | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 2a28215ac5b..6ee4cabc964 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -5082,6 +5082,38 @@
   operands[3] = gen_reg_rtx (DImode);
 })
 
+;; Define optab for vector access with length vectorization exploitation.
+(define_expand "lenload<mode>qi"
+  [(match_operand:VEC_A 0 "vlogical_operand")
+   (match_operand:VEC_A 1 "memory_operand")
+   (match_operand:QI 2 "int_reg_operand")]
+  "TARGET_P9_VECTOR && TARGET_64BIT"
+{
+  rtx mem = XEXP (operands[1], 0);
+  mem = force_reg (DImode, mem);
+  rtx len = gen_lowpart (DImode, operands[2]);
+  rtx res = gen_reg_rtx (V16QImode);
+  emit_insn (gen_lxvl (res, mem, len));
+  emit_move_insn (operands[0], gen_lowpart (<MODE>mode, res));
+  DONE;
+})
+
+(define_expand "lenstore<mode>qi"
+  [(match_operand:VEC_A 0 "memory_operand")
+   (match_operand:VEC_A 1 "vlogical_operand")
+   (match_operand:QI 2 "int_reg_operand")
+  ]
+  "TARGET_P9_VECTOR && TARGET_64BIT"
+{
+  rtx val = gen_reg_rtx (V16QImode);
+  emit_move_insn (val, gen_lowpart (V16QImode, operands[1]));
+  rtx mem = XEXP (operands[0], 0);
+  mem = force_reg (DImode, mem);
+  rtx len = gen_lowpart (DImode, operands[2]);
+  emit_insn (gen_stxvl (val, mem, len));
+  DONE;
+})
+
 (define_insn "*stxvl"
   [(set (mem:V16QI (match_operand:DI 1 "gpc_reg_operand" "b"))
 	(unspec:V16QI
-- 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 4/7 V2] hook/rs6000: Add vectorize length mode for vector with length
  2020-05-26  5:55 ` [PATCH 4/7] hook/rs6000: Add vectorize length mode for vector with length Kewen.Lin
@ 2020-06-10  6:44   ` Kewen.Lin
  0 siblings, 0 replies; 80+ messages in thread
From: Kewen.Lin @ 2020-06-10  6:44 UTC (permalink / raw)
  To: GCC Patches
  Cc: Bill Schmidt, Richard Guenther, Segher Boessenkool, dje.gcc,
	Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 295 bytes --]

v2: Update rs6000 length_mode hook to QImode, also update description of the hook.

gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* config/rs6000/rs6000.c (TARGET_VECTORIZE_LENGTH_MODE): New macro.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in: New hook.
	* target.def: Likewise.

[-- Attachment #2: length_mode_v2.patch --]
[-- Type: text/plain, Size: 2863 bytes --]


---
 gcc/config/rs6000/rs6000.c | 3 +++
 gcc/doc/tm.texi            | 8 ++++++++
 gcc/doc/tm.texi.in         | 2 ++
 gcc/target.def             | 9 +++++++++
 4 files changed, 22 insertions(+)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 8435bc15d72..89881a352f0 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1659,6 +1659,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_HAVE_COUNT_REG_DECR_P
 #define TARGET_HAVE_COUNT_REG_DECR_P true
 
+#undef TARGET_VECTORIZE_LENGTH_MODE
+#define TARGET_VECTORIZE_LENGTH_MODE QImode
+
 /* 1000000000 is infinite cost in IVOPTs.  */
 #undef TARGET_DOLOOP_COST_FOR_GENERIC
 #define TARGET_DOLOOP_COST_FOR_GENERIC 1000000000
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 6e7d9dc54a9..087a39b840d 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6084,6 +6084,14 @@ The default implementation returns a @code{MODE_VECTOR_INT} with the
 same size and number of elements as @var{mode}, if such a mode exists.
 @end deftypefn
 
+@deftypevr {Target Hook} scalar_int_mode TARGET_VECTORIZE_LENGTH_MODE
+For the targets which support length-based partial vectorization, also
+known as vector memory access with length, return the scalar int mode to
+be used for the length.  Normally it should be set according to the
+required minimum precision of supported length.
+The default is to use @code{word_mode}.
+@end deftypevr
+
 @deftypefn {Target Hook} bool TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE (unsigned @var{ifn})
 This hook returns true if masked internal function @var{ifn} (really of
 type @code{internal_fn}) should be considered expensive when the mask is
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3be984bbd5c..83034176b56 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4181,6 +4181,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_GET_MASK_MODE
 
+@hook TARGET_VECTORIZE_LENGTH_MODE
+
 @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
 
 @hook TARGET_VECTORIZE_INIT_COST
diff --git a/gcc/target.def b/gcc/target.def
index 07059a87caf..f64861189cb 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1969,6 +1969,15 @@ same size and number of elements as @var{mode}, if such a mode exists.",
  (machine_mode mode),
  default_get_mask_mode)
 
+DEFHOOKPOD
+(length_mode,
+ "For the targets which support length-based partial vectorization, also\n\
+known as vector memory access with length, return the scalar int mode to\n\
+be used for the length.  Normally it should be set according to the\n\
+required minimum precision of supported length.\n\
+The default is to use @code{word_mode}.",
+ scalar_int_mode, word_mode)
+
 /* Function to say whether a masked operation is expensive when the
    mask is all zeros.  */
 DEFHOOK
-- 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 5/7 v4] vect: Support vector load/store with length in vectorizer
  2020-06-03  6:33                 ` Kewen.Lin
@ 2020-06-10  9:19                   ` Kewen.Lin
  2020-06-22  8:33                     ` [PATCH 5/7 v5] " Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-10  9:19 UTC (permalink / raw)
  To: GCC Patches
  Cc: richard.sandiford, Bill Schmidt, Richard Guenther,
	Segher Boessenkool, dje.gcc

[-- Attachment #1: Type: text/plain, Size: 3310 bytes --]

Hi all,

Against v3, v4 mainly based on Richard's comments and consists of changes:
  - split out some renaming and refactoring.
  - use QImode for length.
  - update the iv type determination.
  - introduce factor into rgroup_controls.
  - use using_partial_vectors_p for both approaches.

Bootstrapped/regtested on powerpc64le-linux-gnu P9 and no remarkable
failures found even with explicit vect-with-length-scope settings.

Still saw the regression failure on aarch64-linux-gnu as below:
PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,

It's caused by vectorizable_condition's change, without the change,
it can use fully-masking for the outer loop.  The reduction_type is
TREE_CODE_REDUCTION here, so can_partial_vectors_p gets cleared.

From the optimized dumping, the previous IRs look fine.  It's doing
reduction for inner loop, but we are checking partial vectorization
for the outer loop.  I'm not sure whether to adjust the current
guard is reasonable for this case.

Any comments are highly appreciated!

BR,
Kewen
----
gcc/ChangeLog

	* doc/invoke.texi (vect-with-length-scope): Document new option.
	* params.opt (vect-with-length-scope): New.
	* tree-vect-loop-manip.c (vect_set_loop_controls_directly): Add the
	handlings for length-based partial vectorization, call vect_gen_len
	for length generation.
	(vect_set_loop_condition_partial_vectors): Add the handlings for
	length-based partial vectorization.
	(vect_do_peeling): Allow remaining eiters less than epilogue vf for
	LOOP_VINFO_USING_PARTIAL_VECTORS_P.
	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Init
	epil_using_partial_vectors_p.
	(_loop_vec_info::~_loop_vec_info): Call release_vec_loop_controls
	for lengths destruction.
	(vect_verify_loop_lens): New function.
	(vect_analyze_loop_2): Add the check to allow only one partial
	vectorization approach at the same time.  Check loop-wide reasons
	length-based partial vectorization decision.  Mark
	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P if the epilogue is
	considerable to use length-based approach.  Call
	release_vec_loop_controls for lengths destruction.
	(vect_analyze_loop): Add handlings for epilogue of loop when it's
	marked to use partial vectorization approach.
	(vect_estimate_min_profitable_iters): Adjust for loop with
	length-based partial vectorization.
	(vect_record_loop_mask): Init factor to 1 for mask-based partial
	vectorization.
	(vect_record_loop_len): New function.
	(vect_get_loop_len): New function.
	* tree-vect-stmts.c (check_load_store_for_partial_vectors): Add
	length-based partial vectorization related checks.
	(vectorizable_store): Add handlings for length-based partial
	vectorization.
	(vectorizable_load): Likewise.
	(vectorizable_condition): Guard partial vectorization reduction
	only for EXTRACT_LAST_REDUCTION.
	(vect_gen_len): New function.
	* tree-vectorizer.h (struct rgroup_controls): Add field factor
	for length-based partial vectorization.
	(vec_loop_lens): New typedef.
	(_loop_vec_info): Add lens and epil_using_partial_vectors_p.
	(LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P): New macro.
	(LOOP_VINFO_LENS): Likewise.
	(LOOP_VINFO_FULLY_WITH_LENGTH_P): Likewise.
	(vect_record_loop_len): New declare.
	(vect_get_loop_len): Likewise.
	(vect_gen_len): Likewise.


[-- Attachment #2: vector_with_length_v4.patch --]
[-- Type: text/plain, Size: 36737 bytes --]

---
 gcc/doc/invoke.texi        |   7 ++
 gcc/params.opt             |   4 +
 gcc/tree-vect-loop-manip.c |  97 ++++++++++-----
 gcc/tree-vect-loop.c       | 243 ++++++++++++++++++++++++++++++++++++-
 gcc/tree-vect-stmts.c      | 155 ++++++++++++++++++++---
 gcc/tree-vectorizer.h      |  43 ++++++-
 6 files changed, 494 insertions(+), 55 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 8b9935dfe65..ac765feab13 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13079,6 +13079,13 @@ by the copy loop headers pass.
 @item vect-epilogues-nomask
 Enable loop epilogue vectorization using smaller vector size.
 
+@item vect-with-length-scope
+Control the scope of vector memory access with length exploitation.  0 means we
+don't expliot any vector memory access with length, 1 means we only exploit
+vector memory access with length for those loops whose iteration number are
+less than VF, such as very small loop or epilogue, 2 means we want to exploit
+vector memory access with length for any loops if possible.
+
 @item slp-max-insns-in-bb
 Maximum number of instructions in basic block to be
 considered for SLP vectorization.
diff --git a/gcc/params.opt b/gcc/params.opt
index 4aec480798b..d4309101067 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -964,4 +964,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f
 Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
 Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
 
+-param=vect-with-length-scope=
+Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization
+Control the vector with length exploitation scope.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 1fac5898525..1eaf6e1c3ea 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -399,19 +399,20 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
 
    It is known that:
 
-     NITERS * RGC->max_nscalars_per_iter
+     NITERS * RGC->max_nscalars_per_iter * RGC->factor
 
    does not overflow.  However, MIGHT_WRAP_P says whether an induction
    variable that starts at 0 and has step:
 
-     VF * RGC->max_nscalars_per_iter
+     VF * RGC->max_nscalars_per_iter * RGC->factor
 
    might overflow before hitting a value above:
 
-     (NITERS + NITERS_SKIP) * RGC->max_nscalars_per_iter
+     (NITERS + NITERS_SKIP) * RGC->max_nscalars_per_iter * RGC->factor
 
    This means that we cannot guarantee that such an induction variable
-   would ever hit a value that produces a set of all-false masks for RGC.  */
+   would ever hit a value that produces a set of all-false masks or zero
+   lengths for RGC.  */
 
 static tree
 vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
@@ -422,10 +423,20 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 {
   tree compare_type = LOOP_VINFO_COMPARE_TYPE (loop_vinfo);
   tree iv_type = LOOP_VINFO_IV_TYPE (loop_vinfo);
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+
   tree ctrl_type = rgc->type;
-  unsigned int nscalars_per_iter = rgc->max_nscalars_per_iter;
+  /* Scale up nscalars per iteration with factor.  */
+  unsigned int nscalars_per_iter_ft = rgc->max_nscalars_per_iter * rgc->factor;
   poly_uint64 nscalars_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  tree length_limit = NULL_TREE;
+  /* For length, we probably need length_limit to check length in range.  */
+  if (!vect_for_masking)
+    {
+      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
+      length_limit = build_int_cst (compare_type, len_limit);
+    }
 
   /* Calculate the maximum number of scalar values that the rgroup
      handles in total, the number that it handles for each iteration
@@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
   tree nscalars_total = niters;
   tree nscalars_step = build_int_cst (iv_type, vf);
   tree nscalars_skip = niters_skip;
-  if (nscalars_per_iter != 1)
+  if (nscalars_per_iter_ft != 1)
     {
       /* We checked before choosing to use a partial vectorization loop that
 	 these multiplications don't overflow.  */
-      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
-      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
+      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
+      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
       nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
 				     nscalars_total, compare_factor);
       nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
@@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 	     NSCALARS_SKIP to that cannot overflow.  */
 	  tree const_limit = build_int_cst (compare_type,
 					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
-					    * nscalars_per_iter);
+					    * nscalars_per_iter_ft);
 	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
 				      nscalars_total, const_limit);
 	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
@@ -549,16 +560,16 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
     {
       /* Previous controls will cover BIAS scalars.  This control covers the
 	 next batch.  */
-      poly_uint64 bias = nscalars_per_ctrl * i;
+      poly_uint64 batch_nscalars_ft = nscalars_per_ctrl * rgc->factor;
+      poly_uint64 bias = batch_nscalars_ft * i;
       tree bias_tree = build_int_cst (compare_type, bias);
-      gimple *tmp_stmt;
 
       /* See whether the first iteration of the vector loop is known
 	 to have a full control.  */
       poly_uint64 const_limit;
       bool first_iteration_full
 	= (poly_int_tree_p (first_limit, &const_limit)
-	   && known_ge (const_limit, (i + 1) * nscalars_per_ctrl));
+	   && known_ge (const_limit, (i + 1) * batch_nscalars_ft));
 
       /* Rather than have a new IV that starts at BIAS and goes up to
 	 TEST_LIMIT, prefer to use the same 0-based IV for each control
@@ -598,9 +609,19 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 	      end = first_limit;
 	    }
 
-	  init_ctrl = make_temp_ssa_name (ctrl_type, NULL, "max_mask");
-	  tmp_stmt = vect_gen_while (init_ctrl, start, end);
-	  gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	  if (vect_for_masking)
+	    {
+	      init_ctrl = make_temp_ssa_name (ctrl_type, NULL, "max_mask");
+	      gimple *tmp_stmt = vect_gen_while (init_ctrl, start, end);
+	      gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	    }
+	  else
+	    {
+	      init_ctrl = make_temp_ssa_name (compare_type, NULL, "max_len");
+	      gimple_seq seq = vect_gen_len (init_ctrl, start,
+					     end, length_limit);
+	      gimple_seq_add_seq (preheader_seq, seq);
+	    }
 	}
 
       /* Now AND out the bits that are within the number of skipped
@@ -617,16 +638,32 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 				      init_ctrl, unskipped_mask);
 	  else
 	    init_ctrl = unskipped_mask;
+	  gcc_assert (vect_for_masking);
 	}
 
+      /* First iteration is full.  */
       if (!init_ctrl)
-	/* First iteration is full.  */
-	init_ctrl = build_minus_one_cst (ctrl_type);
+	{
+	  if (vect_for_masking)
+	    init_ctrl = build_minus_one_cst (ctrl_type);
+	  else
+	    init_ctrl = length_limit;
+	}
 
       /* Get the control value for the next iteration of the loop.  */
-      next_ctrl = make_temp_ssa_name (ctrl_type, NULL, "next_mask");
-      gcall *call = vect_gen_while (next_ctrl, test_index, this_test_limit);
-      gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+      if (vect_for_masking)
+	{
+	  next_ctrl = make_temp_ssa_name (ctrl_type, NULL, "next_mask");
+	  gcall *call = vect_gen_while (next_ctrl, test_index, this_test_limit);
+	  gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+	}
+      else
+	{
+	  next_ctrl = make_temp_ssa_name (compare_type, NULL, "next_len");
+	  gimple_seq seq = vect_gen_len (next_ctrl, test_index, this_test_limit,
+					 length_limit);
+	  gsi_insert_seq_before (test_gsi, seq, GSI_SAME_STMT);
+	}
 
       vect_set_loop_control (loop, ctrl, init_ctrl, next_ctrl);
     }
@@ -651,6 +688,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
   gimple_seq preheader_seq = NULL;
   gimple_seq header_seq = NULL;
 
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
   tree compare_type = LOOP_VINFO_COMPARE_TYPE (loop_vinfo);
   unsigned int compare_precision = TYPE_PRECISION (compare_type);
   tree orig_niters = niters;
@@ -685,28 +723,30 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
   tree test_ctrl = NULL_TREE;
   rgroup_controls *rgc;
   unsigned int i;
-  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
-  FOR_EACH_VEC_ELT (*masks, i, rgc)
+  auto_vec<rgroup_controls> *controls = vect_for_masking
+					  ? &LOOP_VINFO_MASKS (loop_vinfo)
+					  : &LOOP_VINFO_LENS (loop_vinfo);
+  FOR_EACH_VEC_ELT (*controls, i, rgc)
     if (!rgc->controls.is_empty ())
       {
 	/* First try using permutes.  This adds a single vector
 	   instruction to the loop for each mask, but needs no extra
 	   loop invariants or IVs.  */
 	unsigned int nmasks = i + 1;
-	if ((nmasks & 1) == 0)
+	if (vect_for_masking && (nmasks & 1) == 0)
 	  {
-	    rgroup_controls *half_rgc = &(*masks)[nmasks / 2 - 1];
+	    rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
 	    if (!half_rgc->controls.is_empty ()
 		&& vect_maybe_permute_loop_masks (&header_seq, rgc, half_rgc))
 	      continue;
 	  }
 
 	/* See whether zero-based IV would ever generate all-false masks
-	   before wrapping around.  */
+	   or zero length before wrapping around.  */
+	unsigned nscalars_ft = rgc->max_nscalars_per_iter * rgc->factor;
 	bool might_wrap_p
 	  = (iv_limit == -1
-	     || (wi::min_precision (iv_limit * rgc->max_nscalars_per_iter,
-				    UNSIGNED)
+	     || (wi::min_precision (iv_limit * nscalars_ft, UNSIGNED)
 		 > compare_precision));
 
 	/* Set up all controls for this group.  */
@@ -2567,7 +2607,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   if (vect_epilogues
       && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
       && prolog_peeling >= 0
-      && known_eq (vf, lowest_vf))
+      && known_eq (vf, lowest_vf)
+      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
     {
       unsigned HOST_WIDE_INT eiters
 	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index b6e96f77f69..19a37af2f56 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -815,6 +815,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     vectorizable (false),
     can_use_partial_vectors_p (true),
     using_partial_vectors_p (false),
+    epil_using_partial_vectors_p (false),
     peeling_for_gaps (false),
     peeling_for_niter (false),
     no_data_dependencies (false),
@@ -895,6 +896,7 @@ _loop_vec_info::~_loop_vec_info ()
   free (bbs);
 
   release_vec_loop_controls (&masks);
+  release_vec_loop_controls (&lens);
   delete ivexpr_map;
   delete scan_map;
   epilogue_vinfos.release ();
@@ -1070,6 +1072,88 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
   return true;
 }
 
+/* Check whether we can use vector access with length based on precison
+   comparison.  So far, to keep it simple, we only allow the case that the
+   precision of the target supported length is larger than the precision
+   required by loop niters.  */
+
+static bool
+vect_verify_loop_lens (loop_vec_info loop_vinfo)
+{
+  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+
+  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    return false;
+
+  /* The one which has the largest NV should have max bytes per iter.  */
+  rgroup_controls *rgl = &(*lens)[lens->length () - 1];
+
+  /* Work out how many bits we need to represent the length limit.  */
+  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;
+  unsigned int min_ni_prec
+    = min_prec_for_max_niters (loop_vinfo, nscalars_per_iter_ft);
+
+  /* Now use the maximum of below precisions for one suitable IV type:
+     - the IV's natural precision
+     - the precision needed to hold: the maximum number of scalar
+       iterations multiplied by the scale factor (min_ni_prec above)
+     - the Pmode precision
+  */
+
+  /* If min_ni_width is less than the precision of the current niters,
+     we perfer to still use the niters type.  */
+  unsigned int ni_prec
+    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
+  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
+  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
+
+  unsigned int required_prec = ni_prec;
+  if (required_prec < pmode_prec)
+    required_prec = pmode_prec;
+
+  tree iv_type = NULL_TREE;
+  if (min_ni_prec > required_prec)
+    {
+      opt_scalar_int_mode tmode_iter;
+      unsigned standard_bits = 0;
+      FOR_EACH_MODE_IN_CLASS (tmode_iter, MODE_INT)
+      {
+	scalar_mode tmode = tmode_iter.require ();
+	unsigned int tbits = GET_MODE_BITSIZE (tmode);
+
+	/* ??? Do we really want to construct one IV whose precision exceeds
+	   BITS_PER_WORD?  */
+	if (tbits > BITS_PER_WORD)
+	  break;
+
+	/* Find the first available standard integral type.  */
+	if (tbits >= min_ni_prec && targetm.scalar_mode_supported_p (tmode))
+	  {
+	    standard_bits = tbits;
+	    break;
+	  }
+      }
+      if (standard_bits != 0)
+	iv_type = build_nonstandard_integer_type (standard_bits, true);
+    }
+  else
+    iv_type = build_nonstandard_integer_type (required_prec, true);
+
+  if (!iv_type)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use length-based partial vectorization"
+			 " due to no suitable iv type.\n");
+      return false;
+    }
+
+  LOOP_VINFO_COMPARE_TYPE (loop_vinfo) = iv_type;
+  LOOP_VINFO_IV_TYPE (loop_vinfo) = iv_type;
+
+  return true;
+}
+
 /* Calculate the cost of one scalar iteration of the loop.  */
 static void
 vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
@@ -2144,11 +2228,63 @@ start_over:
       return ok;
     }
 
-  /* Decide whether to use a fully-masked loop for this vectorization
-     factor.  */
-  LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
-    = (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
-       && vect_verify_full_masking (loop_vinfo));
+  /* For now, we don't expect to mix both masking and length approaches for one
+     loop, disable it if both are recorded.  */
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+      && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ()
+      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use a partial vectorized loop because we"
+			 " don't expect to mix partial vectorization"
+			 " approaches for the same loop.\n");
+      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
+
+  /* Decide whether to use a partial vectorization loop for this
+     vectorization factor.  */
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      /* Decide whether to use fully-masked approach.  */
+      if (vect_verify_full_masking (loop_vinfo))
+	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+      /* Decide whether to use length-based approach.  */
+      else if (vect_verify_loop_lens (loop_vinfo))
+	{
+	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "can't use length-based partial vectorization"
+				 " approach becuase peeling for alignment or"
+				 " gaps is required.\n");
+	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	    }
+	  else if (param_vect_with_length_scope == 0)
+	    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	  /* The epilogue and other known niters less than VF
+	    cases can still use vector access with length fully.  */
+	  else if (param_vect_with_length_scope == 1
+		   && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+		   && !known_niters_smaller_than_vf (loop_vinfo))
+	    {
+	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	      LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+	    }
+	  else
+	    {
+	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+	      LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	    }
+	}
+      else
+	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
+  else
+    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+
   if (dump_enabled_p ())
     {
       if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
@@ -2157,6 +2293,15 @@ start_over:
       else
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "not using a fully-masked loop.\n");
+
+      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "using length-based partial"
+			 " vectorization for loop fully.\n");
+      else
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "not using length-based partial"
+			 " vectorization for loop fully.\n");
     }
 
   /* If epilog loop is required because of data accesses with gaps,
@@ -2377,6 +2522,7 @@ again:
     = init_cost (LOOP_VINFO_LOOP (loop_vinfo));
   /* Reset accumulated rgroup information.  */
   release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo));
+  release_vec_loop_controls (&LOOP_VINFO_LENS (loop_vinfo));
   /* Reset assorted flags.  */
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
@@ -2663,7 +2809,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 		lowest_th = ordered_min (lowest_th, th);
 	    }
 	  else
-	    delete loop_vinfo;
+	    {
+	      delete loop_vinfo;
+	      loop_vinfo = opt_loop_vec_info::success (NULL);
+	    }
 
 	  /* Only vectorize epilogues if PARAM_VECT_EPILOGUES_NOMASK is
 	     enabled, SIMDUID is not set, it is the innermost loop and we have
@@ -2688,6 +2837,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
       else
 	{
 	  delete loop_vinfo;
+	  loop_vinfo = opt_loop_vec_info::success (NULL);
 	  if (fatal)
 	    {
 	      gcc_checking_assert (first_loop_vinfo == NULL);
@@ -2695,6 +2845,23 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	    }
 	}
 
+      /* Handle the case that the original loop can use partial
+	 vectorization, but want to only adopt it for the epilogue.
+	 The retry should be in the same mode as original.  */
+      if (vect_epilogues
+	  && loop_vinfo
+	  && LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo))
+	{
+	  gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+		      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** Re-trying analysis with same vector mode"
+			     " %s for epilogue with partial vectorization.\n",
+			     GET_MODE_NAME (loop_vinfo->vector_mode));
+	  continue;
+	}
+
       if (mode_i < vector_modes.length ()
 	  && VECTOR_MODE_P (autodetected_vector_mode)
 	  && (related_vector_mode (vector_modes[mode_i],
@@ -3535,6 +3702,11 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 			    target_cost_data, num_masks - 1, vector_stmt,
 			    NULL, NULL_TREE, 0, vect_body);
     }
+  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      peel_iters_prologue = 0;
+      peel_iters_epilogue = 0;
+    }
   else if (npeel < 0)
     {
       peel_iters_prologue = assumed_vf / 2;
@@ -8319,6 +8491,7 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
     {
       rgm->max_nscalars_per_iter = nscalars_per_iter;
       rgm->type = truth_type_for (vectype);
+      rgm->factor = 1;
     }
 }
 
@@ -8371,6 +8544,64 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
   return mask;
 }
 
+/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
+   lengths for vector access with length that each control a vector of type
+   VECTYPE.  */
+
+void
+vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		      unsigned int nvectors, tree vectype)
+{
+  gcc_assert (nvectors != 0);
+  if (lens->length () < nvectors)
+    lens->safe_grow_cleared (nvectors);
+  rgroup_controls *rgl = &(*lens)[nvectors - 1];
+
+  /* The number of scalars per iteration, scalar occupied bytes and
+     the number of vectors are both compile-time constants.  */
+  unsigned int nscalars_per_iter
+    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
+		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
+
+  if (rgl->max_nscalars_per_iter < nscalars_per_iter)
+    {
+      rgl->max_nscalars_per_iter = nscalars_per_iter;
+      rgl->type = vectype;
+      /* For now, the length-based is for length in bytes.
+	 FIXME if length-based supports more eg: length in scalar counts.  */
+      rgl->factor = int_cst_value (TYPE_SIZE_UNIT (TREE_TYPE (vectype)));
+    }
+}
+
+/* Given a complete set of length LENS, extract length number INDEX for an
+   rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.  */
+
+tree
+vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		   unsigned int nvectors, unsigned int index)
+{
+  rgroup_controls *rgl = &(*lens)[nvectors - 1];
+
+  /* Populate the rgroup's len array, if this is the first time we've
+     used it.  */
+  if (rgl->controls.is_empty ())
+    {
+      rgl->controls.safe_grow_cleared (nvectors);
+      for (unsigned int i = 0; i < nvectors; ++i)
+	{
+	  tree len_type = LOOP_VINFO_COMPARE_TYPE (loop_vinfo);
+	  gcc_assert (len_type != NULL_TREE);
+	  tree len = make_temp_ssa_name (len_type, NULL, "loop_len");
+
+	  /* Provide a dummy definition until the real one is available.  */
+	  SSA_NAME_DEF_STMT (len) = gimple_build_nop ();
+	  rgl->controls[i] = len;
+	}
+    }
+
+  return rgl->controls[index];
+}
+
 /* Scale profiling counters by estimation for LOOP which is vectorized
    by factor VF.  */
 
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 484470091a8..98f166d742f 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1855,29 +1855,56 @@ check_load_store_for_partial_vectors (
       return;
     }
 
-  machine_mode mask_mode;
-  if (!VECTOR_MODE_P (vecmode)
-      || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
-      || !can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+  if (!VECTOR_MODE_P (vecmode))
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "can't use a fully-masked loop because the target"
-			 " doesn't have the appropriate masked load or"
-			 " store.\n");
+			 "can't use a partial vectorization loop because of"
+			 " the unexpected mode.\n");
       LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
       return;
     }
-  /* We might load more scalars than we need for permuting SLP loads.
-     We checked in get_group_load_store_type that the extra elements
-     don't leak into a new vector.  */
+
   poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   unsigned int nvectors;
-  if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
-    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, scalar_mask);
-  else
-    gcc_unreachable ();
+
+  machine_mode mask_mode;
+  bool partial_vectorization_p = false;
+  if (targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
+      && can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+    {
+      /* We might load more scalars than we need for permuting SLP loads.
+     We checked in get_group_load_store_type that the extra elements
+     don't leak into a new vector.  */
+      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+	vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype,
+			       scalar_mask);
+      else
+	gcc_unreachable ();
+      partial_vectorization_p = true;
+    }
+
+  optab op = is_load ? lenload_optab : lenstore_optab;
+  if (convert_optab_handler (op, vecmode, targetm.vectorize.length_mode))
+    {
+      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+	vect_record_loop_len (loop_vinfo, lens, nvectors, vectype);
+      else
+	gcc_unreachable ();
+      partial_vectorization_p = true;
+    }
+
+  if (!partial_vectorization_p)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use a partial vectorization loop because the"
+			 " target doesn't have the appropriate partial"
+			 "vectorization load or store.\n");
+      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
 }
 
 /* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
@@ -8070,6 +8097,14 @@ vectorizable_store (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+       ? &LOOP_VINFO_LENS (loop_vinfo)
+       : NULL);
+
+  /* Shouldn't go with length-based approach if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -8322,10 +8357,16 @@ vectorizable_store (vec_info *vinfo,
 	      unsigned HOST_WIDE_INT align;
 
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens)
+		final_len = vect_get_loop_len (loop_vinfo, loop_lens,
+					       vec_num * ncopies,
+					       vec_num * j + i);
+
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
@@ -8405,6 +8446,17 @@ vectorizable_store (vec_info *vinfo,
 		  new_stmt_info
 		    = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
 		}
+	      else if (final_len)
+		{
+		  align = least_bit_hwi (misalign | align);
+		  tree ptr = build_int_cst (ref_type, align);
+		  gcall *call
+		    = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr,
+						  ptr, final_len, vec_oprnd);
+		  gimple_call_set_nothrow (call, true);
+		  new_stmt_info
+		    = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
+		}
 	      else
 		{
 		  data_ref = fold_build2 (MEM_REF, vectype,
@@ -8939,6 +8991,7 @@ vectorizable_load (vec_info *vinfo,
       tree dr_offset;
 
       gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
+      gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
       gcc_assert (!nested_in_vect_loop);
 
       if (grouped_load)
@@ -9237,6 +9290,14 @@ vectorizable_load (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+       ? &LOOP_VINFO_LENS (loop_vinfo)
+       : NULL);
+
+  /* Shouldn't go with length-based approach if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -9558,11 +9619,18 @@ vectorizable_load (vec_info *vinfo,
 	  for (i = 0; i < vec_num; i++)
 	    {
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks
 		  && memory_access_type != VMAT_INVARIANT)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens
+		  && memory_access_type != VMAT_INVARIANT)
+		final_len = vect_get_loop_len (loop_vinfo, loop_lens,
+					       vec_num * ncopies,
+					       vec_num * j + i);
+
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
@@ -9632,6 +9700,18 @@ vectorizable_load (vec_info *vinfo,
 			new_stmt = call;
 			data_ref = NULL_TREE;
 		      }
+		    else if (final_len)
+		      {
+			align = least_bit_hwi (misalign | align);
+			tree ptr = build_int_cst (ref_type, align);
+			gcall *call
+			  = gimple_build_call_internal (IFN_LEN_LOAD, 3,
+							dataref_ptr, ptr,
+							final_len);
+			gimple_call_set_nothrow (call, true);
+			new_stmt = call;
+			data_ref = NULL_TREE;
+		      }
 		    else
 		      {
 			tree ltype = vectype;
@@ -10282,11 +10362,17 @@ vectorizable_condition (vec_info *vinfo,
 	  return false;
 	}
 
+      /* For reduction, we expect EXTRACT_LAST_REDUCTION so far.  */
       if (loop_vinfo
-	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
-	  && reduction_type == EXTRACT_LAST_REDUCTION)
-	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
-			       ncopies * vec_num, vectype, NULL);
+	  && for_reduction
+	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+	{
+	  if (reduction_type == EXTRACT_LAST_REDUCTION)
+	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
+				   ncopies * vec_num, vectype, NULL);
+	  else
+	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	}
 
       STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
       vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
@@ -12483,3 +12569,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
   *nunits_vectype_out = nunits_vectype;
   return opt_result::success ();
 }
+
+/* Generate and return statement sequence that sets vector length LEN that is:
+
+   min_of_start_and_end = min (START_INDEX, END_INDEX);
+   left_len = END_INDEX - min_of_start_and_end;
+   rhs = min (left_len, LEN_LIMIT);
+   LEN = rhs;
+
+   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
+   means if we have left_len in bytes larger than 255, it can't be saturated to
+   vector limit (vector size).  One target hook can be provided if other ports
+   don't suffer this.
+*/
+
+gimple_seq
+vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)
+{
+  gimple_seq stmts = NULL;
+  tree len_type = TREE_TYPE (len);
+  gcc_assert (TREE_TYPE (start_index) == len_type);
+
+  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
+  tree left_len = fold_build2 (MINUS_EXPR, len_type, end_index, min);
+  left_len = fold_build2 (MIN_EXPR, len_type, left_len, len_limit);
+
+  tree rhs = force_gimple_operand (left_len, &stmts, true, NULL_TREE);
+  gimple *new_stmt = gimple_build_assign (len, rhs);
+  gimple_stmt_iterator i = gsi_last (stmts);
+  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
+
+  return stmts;
+}
+
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 857b4a9db15..57da8db43a2 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -407,6 +407,16 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
    are compile-time constants but VF and nL can be variable (if the target
    supports variable-length vectors).
 
+   Moreover, for some partial vectorization approach like length-based
+   in bytes, we care about the occupied bytes for each scalar.  Provided
+   that each scalar has factor bytes, the total number of scalar values
+   becomes to factor * N, the above equation becomes to:
+
+       factor * N = factor * NS * VF = factor * NV * NL
+
+   factor * NS is the bytes of each scalar, factor * NL is the vector size
+   in bytes.
+
    In classical vectorization, each iteration of the vector loop would
    handle exactly VF iterations of the original scalar loop.  However,
    in a partial vectorization loop, a particular iteration of the vector
@@ -462,14 +472,19 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
    first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
    the second being indexed by the mask index 0 <= i < nV.  */
 
-/* The controls (like masks) needed by rgroups with nV vectors,
+/* The controls (like masks, lengths) needed by rgroups with nV vectors,
    according to the description above.  */
 struct rgroup_controls {
   /* The largest nS for all rgroups that use these controls.  */
   unsigned int max_nscalars_per_iter;
 
-  /* The type of control to use, based on the highest nS recorded above.
-     For mask-based approach, it's used for mask_type.  */
+  /* For now, it's mainly used for length-based in bytes approach, it's
+     record the occupied bytes of each scalar.  */
+  unsigned int factor;
+
+  /* This type is based on the highest nS recorded above.
+     For mask-based approach, it records mask type to use.
+     For length-based approach, it records appropriate vector type.  */
   tree type;
 
   /* A vector of nV controls, in iteration order.  */
@@ -478,6 +493,8 @@ struct rgroup_controls {
 
 typedef auto_vec<rgroup_controls> vec_loop_masks;
 
+typedef auto_vec<rgroup_controls> vec_loop_lens;
+
 typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
 
 /*-----------------------------------------------------------------*/
@@ -525,6 +542,10 @@ public:
      on inactive scalars.  */
   vec_loop_masks masks;
 
+  /* The lengths that a loop with length should use to avoid operating
+     on inactive scalars.  */
+  vec_loop_lens lens;
+
   /* Set of scalar conditions that have loop mask applied.  */
   scalar_cond_masked_set_type scalar_cond_masked_set;
 
@@ -630,6 +651,10 @@ public:
   /* True if have decided to use partial vectorization for this loop.  */
   bool using_partial_vectors_p;
 
+  /* Records whether we can use partial vector approaches for the epilogue of
+     this loop, for now we only support length approach.  */
+  bool epil_using_partial_vectors_p;
+
   /* When we have grouped data accesses with gaps, we may introduce invalid
      memory accesses.  We peel the last iteration of the loop to prevent
      this.  */
@@ -693,9 +718,12 @@ public:
 #define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
 #define LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P(L) (L)->can_use_partial_vectors_p
 #define LOOP_VINFO_USING_PARTIAL_VECTORS_P(L) (L)->using_partial_vectors_p
+#define LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P(L)                             \
+  (L)->epil_using_partial_vectors_p
 #define LOOP_VINFO_VECT_FACTOR(L)          (L)->vectorization_factor
 #define LOOP_VINFO_MAX_VECT_FACTOR(L)      (L)->max_vectorization_factor
 #define LOOP_VINFO_MASKS(L)                (L)->masks
+#define LOOP_VINFO_LENS(L)                 (L)->lens
 #define LOOP_VINFO_MASK_SKIP_NITERS(L)     (L)->mask_skip_niters
 #define LOOP_VINFO_COMPARE_TYPE(L)         (L)->compare_type
 #define LOOP_VINFO_IV_TYPE(L)              (L)->iv_type
@@ -733,6 +761,10 @@ public:
   (LOOP_VINFO_USING_PARTIAL_VECTORS_P (L)	\
    && !LOOP_VINFO_MASKS (L).is_empty ())
 
+#define LOOP_VINFO_FULLY_WITH_LENGTH_P(L)	\
+  (LOOP_VINFO_USING_PARTIAL_VECTORS_P (L)	\
+   && !LOOP_VINFO_LENS (L).is_empty ())
+
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L)	\
   ((L)->may_misalign_stmts.length () > 0)
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIAS(L)		\
@@ -1850,6 +1882,11 @@ extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
 				   unsigned int, tree, tree);
 extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
+extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+				  tree);
+extern tree vect_get_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+			       unsigned int);
+extern gimple_seq vect_gen_len (tree, tree, tree, tree);
 extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info);
 
 /* Drive for loop transformation stage.  */
-- 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/7 V2] ifn/optabs: Support vector load/store with length
  2020-06-10  6:41   ` [PATCH 1/7 V2] " Kewen.Lin
@ 2020-06-10  9:22     ` Richard Sandiford
  2020-06-10 12:36       ` [PATCH 1/7 V3] " Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Sandiford @ 2020-06-10  9:22 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Bill Schmidt, Richard Guenther, Segher Boessenkool, dje.gcc

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> @@ -2497,6 +2499,9 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>  
>    if (optab == vec_mask_load_lanes_optab)
>      icode = get_multi_vector_move (type, optab);
> +  else if (optab == lenload_optab)
> +    icode = convert_optab_handler (optab, TYPE_MODE (type),
> +				   targetm.vectorize.length_mode);
>    else
>      icode = convert_optab_handler (optab, TYPE_MODE (type),
>  				   TYPE_MODE (TREE_TYPE (maskt)));

I think lenload_optab should just be a direct optab, based only on
the vector mode.  It seems unlikely that targets would provide the
“same” load with different length modes.

> @@ -2507,15 +2512,20 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>    target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
>    create_output_operand (&ops[0], target, TYPE_MODE (type));
>    create_fixed_operand (&ops[1], mem);
> -  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
> +  if (optab == lenload_optab)
> +    create_convert_operand_from (&ops[2], mask, targetm.vectorize.length_mode,
> +				 TYPE_UNSIGNED (TREE_TYPE (maskt)));

The mode argument should be TYPE_MODE (TREE_TYPE (maskt)) -- i.e. the
arguments should specify the precision and signedness of the existing rtx.

Hopefully this means that we don't need the target hook at all.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 1/7 V3] ifn/optabs: Support vector load/store with length
  2020-06-10  9:22     ` Richard Sandiford
@ 2020-06-10 12:36       ` Kewen.Lin
  2020-06-22  8:51         ` [PATCH 1/7 V4] " Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-10 12:36 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Bill Schmidt, Richard Guenther, Segher Boessenkool, dje.gcc

[-- Attachment #1: Type: text/plain, Size: 2482 bytes --]

on 2020/6/10 下午5:22, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> @@ -2497,6 +2499,9 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>>  
>>    if (optab == vec_mask_load_lanes_optab)
>>      icode = get_multi_vector_move (type, optab);
>> +  else if (optab == lenload_optab)
>> +    icode = convert_optab_handler (optab, TYPE_MODE (type),
>> +				   targetm.vectorize.length_mode);
>>    else
>>      icode = convert_optab_handler (optab, TYPE_MODE (type),
>>  				   TYPE_MODE (TREE_TYPE (maskt)));
> 
> I think lenload_optab should just be a direct optab, based only on
> the vector mode.  It seems unlikely that targets would provide the
> “same” load with different length modes.

Good point!  Yes, targets unlikely have this need.

> 
>> @@ -2507,15 +2512,20 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>>    target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
>>    create_output_operand (&ops[0], target, TYPE_MODE (type));
>>    create_fixed_operand (&ops[1], mem);
>> -  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
>> +  if (optab == lenload_optab)
>> +    create_convert_operand_from (&ops[2], mask, targetm.vectorize.length_mode,
>> +				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
> 
> The mode argument should be TYPE_MODE (TREE_TYPE (maskt)) -- i.e. the
> arguments should specify the precision and signedness of the existing rtx.
> 

Thanks for correcting this.  I found I misunderstood its usage.

> Hopefully this means that we don't need the target hook at all.
> 

New version v3 attached to get rid of length mode hook.

BR,
Kewen

---
gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/md.texi (lenload@var{m}): Document.
	(lenstore@var{m}): Likewise.
	* internal-fn.c (len_load_direct): New macro.
	(len_store_direct): Likewise.
	(expand_len_load_optab_fn): Likewise.
	(expand_len_store_optab_fn): Likewise.
	(direct_len_load_optab_supported_p): Likewise.
	(direct_len_store_optab_supported_p): Likewise.
	(expand_mask_load_optab_fn): Add handlings for lenload_optab.
	(expand_mask_store_optab_fn): Add handlings for lenstore_optab.
	(internal_load_fn_p): Handle IFN_LEN_LOAD.
	(internal_store_fn_p): Handle IFN_LEN_STORE.
	(internal_fn_stored_value_index): Handle IFN_LEN_STORE.
	* internal-fn.def (LEN_LOAD): New internal function.
	(LEN_STORE): Likewise.
	* optabs.def (lenload_optab, lenstore_optab): New optab.



[-- Attachment #2: lenload_ifn_v3.patch --]
[-- Type: text/plain, Size: 8756 bytes --]

---
 gcc/doc/md.texi     | 18 ++++++++++++++++++
 gcc/internal-fn.c   | 29 +++++++++++++++++++++++++----
 gcc/internal-fn.def |  6 ++++++
 gcc/optabs.def      |  2 ++
 4 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2c67c818da5..dd0d3ec203b 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5167,6 +5167,24 @@ mode @var{n}.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{lenload@var{m}} instruction pattern
+@item @samp{lenload@var{m}}
+Perform a vector load with length from memory operand 1 of mode @var{m}
+into register operand 0.  Length is provided in register operand 2 with
+appropriate mode which should afford the maximal required precision of
+any available lengths.
+
+This pattern is not allowed to @code{FAIL}.
+
+@cindex @code{lenstore@var{m}} instruction pattern
+@item @samp{lenstore@var{m}}
+Perform a vector store with length from register operand 1 of mode @var{m}
+into memory operand 0.  Length is provided in register operand 2 with
+appropriate mode which should afford the maximal required precision of
+any available lengths.
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_perm@var{m}} instruction pattern
 @item @samp{vec_perm@var{m}}
 Output a (variable) vector permutation.  Operand 0 is the destination
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 5e9aa60721e..e85df5cbd92 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -104,10 +104,12 @@ init_internal_fns ()
 #define load_lanes_direct { -1, -1, false }
 #define mask_load_lanes_direct { -1, -1, false }
 #define gather_load_direct { 3, 1, false }
+#define len_load_direct { -1, 2, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
 #define mask_store_lanes_direct { 0, 0, false }
 #define scatter_store_direct { 3, 1, false }
+#define len_store_direct { 3, 2, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 #define ternary_direct { 0, 0, true }
@@ -2478,7 +2480,7 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
   return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
 }
 
-/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES} and LEN_LOAD call STMT using optab OPTAB.  */
 
 static void
 expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2497,6 +2499,8 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_load_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == lenload_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2507,15 +2511,20 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   create_output_operand (&ops[0], target, TYPE_MODE (type));
   create_fixed_operand (&ops[1], mem);
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == lenload_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
   if (!rtx_equal_p (target, ops[0].value))
     emit_move_insn (target, ops[0].value);
 }
 
 #define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
+#define expand_len_load_optab_fn expand_mask_load_optab_fn
 
-/* Expand MASK_STORE{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_STORE{,_LANES} and LEN_STORE call STMT using optab OPTAB.  */
 
 static void
 expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2532,6 +2541,8 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_store_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == lenstore_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2542,11 +2553,16 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   reg = expand_normal (rhs);
   create_fixed_operand (&ops[0], mem);
   create_input_operand (&ops[1], reg, TYPE_MODE (type));
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == lenstore_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
 }
 
 #define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
+#define expand_len_store_optab_fn expand_mask_store_optab_fn
 
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
@@ -3128,10 +3144,12 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_gather_load_optab_supported_p convert_optab_supported_p
+#define direct_len_load_optab_supported_p direct_optab_supported_p
 #define direct_mask_store_optab_supported_p direct_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_scatter_store_optab_supported_p convert_optab_supported_p
+#define direct_len_store_optab_supported_p direct_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
 #define direct_fold_left_optab_supported_p direct_optab_supported_p
@@ -3498,6 +3516,7 @@ internal_load_fn_p (internal_fn fn)
     case IFN_MASK_LOAD_LANES:
     case IFN_GATHER_LOAD:
     case IFN_MASK_GATHER_LOAD:
+    case IFN_LEN_LOAD:
       return true;
 
     default:
@@ -3517,6 +3536,7 @@ internal_store_fn_p (internal_fn fn)
     case IFN_MASK_STORE_LANES:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return true;
 
     default:
@@ -3577,6 +3597,7 @@ internal_fn_stored_value_index (internal_fn fn)
     case IFN_MASK_STORE:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return 3;
 
     default:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 1d190d492ff..ed6561f296a 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
    - load_lanes: currently just vec_load_lanes
    - mask_load_lanes: currently just vec_mask_load_lanes
    - gather_load: used for {mask_,}gather_load
+   - len_load: currently just lenload
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
    - mask_store_lanes: currently just vec_mask_store_lanes
    - scatter_store: used for {mask_,}scatter_store
+   - len_store: currently just lenstore
 
    - unary: a normal unary optab, such as vec_reverse_<mode>
    - binary: a normal binary optab, such as vec_interleave_lo_<mode>
@@ -127,6 +129,8 @@ DEF_INTERNAL_OPTAB_FN (GATHER_LOAD, ECF_PURE, gather_load, gather_load)
 DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
 		       mask_gather_load, gather_load)
 
+DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, lenload, len_load)
+
 DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
 DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
 		       mask_scatter_store, scatter_store)
@@ -136,6 +140,8 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
 DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
 		       vec_mask_store_lanes, mask_store_lanes)
 
+DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, lenstore, len_store)
+
 DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
 DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
 		       check_raw_ptrs, check_ptrs)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 0c64eb52a8d..9fe4ac1840d 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -435,3 +435,5 @@ OPTAB_D (check_war_ptrs_optab, "check_war_ptrs$a")
 OPTAB_DC (vec_duplicate_optab, "vec_duplicate$a", VEC_DUPLICATE)
 OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
 OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
+OPTAB_D (lenload_optab, "lenload$a")
+OPTAB_D (lenstore_optab, "lenstore$a")
-- 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 2/7 V3] rs6000: lenload/lenstore optab support
  2020-06-10  6:43   ` [PATCH 2/7 V2] " Kewen.Lin
@ 2020-06-10 12:39     ` Kewen.Lin
  2020-06-11 22:55       ` Segher Boessenkool
  2020-06-23  3:58       ` [PATCH 2/7 v4] " Kewen.Lin
  0 siblings, 2 replies; 80+ messages in thread
From: Kewen.Lin @ 2020-06-10 12:39 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, dje.gcc, Segher Boessenkool

[-- Attachment #1: Type: text/plain, Size: 199 bytes --]

V3: Update the define_expand as optab changes.

gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* config/rs6000/vsx.md (lenload<mode>): New define_expand.
	(lenstore<mode>): Likewise.



[-- Attachment #2: rs6000_v3.patch --]
[-- Type: text/plain, Size: 1517 bytes --]


---
 gcc/config/rs6000/vsx.md | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 2a28215ac5b..349da294877 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -5082,6 +5082,38 @@ (define_expand "stxvl"
   operands[3] = gen_reg_rtx (DImode);
 })
 
+;; Define optab for vector access with length vectorization exploitation.
+(define_expand "lenload<mode>"
+  [(match_operand:VEC_A 0 "vlogical_operand")
+   (match_operand:VEC_A 1 "memory_operand")
+   (match_operand:QI 2 "int_reg_operand")]
+  "TARGET_P9_VECTOR && TARGET_64BIT"
+{
+  rtx mem = XEXP (operands[1], 0);
+  mem = force_reg (DImode, mem);
+  rtx len = gen_lowpart (DImode, operands[2]);
+  rtx res = gen_reg_rtx (V16QImode);
+  emit_insn (gen_lxvl (res, mem, len));
+  emit_move_insn (operands[0], gen_lowpart (<MODE>mode, res));
+  DONE;
+})
+
+(define_expand "lenstore<mode>"
+  [(match_operand:VEC_A 0 "memory_operand")
+   (match_operand:VEC_A 1 "vlogical_operand")
+   (match_operand:QI 2 "int_reg_operand")
+  ]
+  "TARGET_P9_VECTOR && TARGET_64BIT"
+{
+  rtx val = gen_reg_rtx (V16QImode);
+  emit_move_insn (val, gen_lowpart (V16QImode, operands[1]));
+  rtx mem = XEXP (operands[0], 0);
+  mem = force_reg (DImode, mem);
+  rtx len = gen_lowpart (DImode, operands[2]);
+  emit_insn (gen_stxvl (val, mem, len));
+  DONE;
+})
+
 (define_insn "*stxvl"
   [(set (mem:V16QI (match_operand:DI 1 "gpc_reg_operand" "b"))
 	(unspec:V16QI
-- 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/7 V3] rs6000: lenload/lenstore optab support
  2020-06-10 12:39     ` [PATCH 2/7 V3] " Kewen.Lin
@ 2020-06-11 22:55       ` Segher Boessenkool
  2020-06-12  3:02         ` Kewen.Lin
  2020-06-23  3:58       ` [PATCH 2/7 v4] " Kewen.Lin
  1 sibling, 1 reply; 80+ messages in thread
From: Segher Boessenkool @ 2020-06-11 22:55 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, dje.gcc

Hi!

On Wed, Jun 10, 2020 at 08:39:19PM +0800, Kewen.Lin wrote:
> +;; Define optab for vector access with length vectorization exploitation.
> +(define_expand "lenload<mode>"
> +  [(match_operand:VEC_A 0 "vlogical_operand")
> +   (match_operand:VEC_A 1 "memory_operand")
> +   (match_operand:QI 2 "int_reg_operand")]

Why this?  gpc_reg_operand will just work, no?  (Even just
register_operand should, but let's not go there today ;-) )

Okay for trunk with that change, or with some explanation.  Thanks!


Segher

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/7 V3] rs6000: lenload/lenstore optab support
  2020-06-11 22:55       ` Segher Boessenkool
@ 2020-06-12  3:02         ` Kewen.Lin
  0 siblings, 0 replies; 80+ messages in thread
From: Kewen.Lin @ 2020-06-12  3:02 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Patches, Bill Schmidt, dje.gcc

Hi Segher,

on 2020/6/12 上午6:55, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Jun 10, 2020 at 08:39:19PM +0800, Kewen.Lin wrote:
>> +;; Define optab for vector access with length vectorization exploitation.
>> +(define_expand "lenload<mode>"
>> +  [(match_operand:VEC_A 0 "vlogical_operand")
>> +   (match_operand:VEC_A 1 "memory_operand")
>> +   (match_operand:QI 2 "int_reg_operand")]
> 
> Why this?  gpc_reg_operand will just work, no?  (Even just
> register_operand should, but let's not go there today ;-) )
> 

Good question!  The existing lxvl requires register_operand,
yeah, gpr_reg_operand looks fine too.  I was thinking this
operand for length would only exist in GPR, int_reg_operand
looks more reasonable here?

> Okay for trunk with that change, or with some explanation.  Thanks!
> 

Thanks!

BR,
Kewen

> 
> Segher
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 5/7 v5] vect: Support vector load/store with length in vectorizer
  2020-06-10  9:19                   ` [PATCH 5/7 v4] " Kewen.Lin
@ 2020-06-22  8:33                     ` Kewen.Lin
  2020-06-29  6:33                       ` [PATCH 5/7 v6] " Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-22  8:33 UTC (permalink / raw)
  To: GCC Patches
  Cc: Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool,
	Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 3110 bytes --]

Hi,

v5 changes against v4:
  - Updated the conditions of clearing LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P
    in vectorizable_condition (which fixed the aarch reg failure).
  - Rebased and updated some macro and function names as the
    renaming/refactoring patch.
  - Updated some comments and dumpings.

v4 changes against v3:
  - split out some renaming and refactoring.
  - use QImode for length.
  - update the iv type determination.
  - introduce factor into rgroup_controls.
  - use using_partial_vectors_p for both approaches.

Bootstrapped/regtested on powerpc64le-linux-gnu P9 and no remarkable
failures found even with explicit vect-with-length-scope settings 1/2
(only some trivial test case issues).

Also bootstrapped/regtested on aarch64-linux-gnu.

Is it ok for trunk?

BR,
Kewen
----
gcc/ChangeLog

	* doc/invoke.texi (vect-with-length-scope): Document new option.
	* params.opt (vect-with-length-scope): New.
	* tree-vect-loop-manip.c (vect_set_loop_controls_directly): Add the
	handlings for vectorization using length-based partial vectors, call
	vect_gen_len for length generation.
	(vect_set_loop_condition_partial_vectors): Add the handlings for
	vectorization using length-based partial vectors.
	(vect_do_peeling): Allow remaining eiters less than epilogue vf for
	LOOP_VINFO_USING_PARTIAL_VECTORS_P.
	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Init
	epil_using_partial_vectors_p.
	(_loop_vec_info::~_loop_vec_info): Call release_vec_loop_controls
	for lengths destruction.
	(vect_verify_loop_lens): New function.
	(vect_analyze_loop_2): Add the check to allow only one vectorization
	approach using partial vectorization at the same time.  Check
	loop-wide reasons using length-based partial vectors decision.  Mark
	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P if the epilogue is
	considerable to use length-based approach.  Call
	release_vec_loop_controls for lengths destruction.
	(vect_analyze_loop): Add handlings for epilogue of loop when it's
	marked to use vectorization using partial vectors.
	(vect_estimate_min_profitable_iters): Adjust for loop vectorization
	using length-based partial vectors.
	(vect_record_loop_mask): Init factor to 1 for vectorization using
	mask-based partial vectors.
	(vect_record_loop_len): New function.
	(vect_get_loop_len): New function.
	* tree-vect-stmts.c (check_load_store_for_partial_vectors): Add
	checks for vectorization using length-based partial vectors.
	(vectorizable_store): Add handlings when using length-based partial
	vectors.
	(vectorizable_load): Likewise.
	(vectorizable_condition): Add some checks to disable vectorization
	using partial vectors for reduction.
	(vect_gen_len): New function.
	* tree-vectorizer.h (struct rgroup_controls): Add field factor
	mainly for length-based partial vectors.
	(vec_loop_lens): New typedef.
	(_loop_vec_info): Add lens and epil_using_partial_vectors_p.
	(LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P): New macro.
	(LOOP_VINFO_LENS): Likewise.
	(LOOP_VINFO_FULLY_WITH_LENGTH_P): Likewise.
	(vect_record_loop_len): New declare.
	(vect_get_loop_len): Likewise.
	(vect_gen_len): Likewise.

[-- Attachment #2: vect_with_length_v5.diff --]
[-- Type: text/plain, Size: 37134 bytes --]

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 06a04e3d7dd..284c15705ea 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13389,6 +13389,13 @@ by the copy loop headers pass.
 @item vect-epilogues-nomask
 Enable loop epilogue vectorization using smaller vector size.
 
+@item vect-with-length-scope
+Control the scope of vector memory access with length exploitation.  0 means we
+don't expliot any vector memory access with length, 1 means we only exploit
+vector memory access with length for those loops whose iteration number are
+less than VF, such as very small loop or epilogue, 2 means we want to exploit
+vector memory access with length for any loops if possible.
+
 @item slp-max-insns-in-bb
 Maximum number of instructions in basic block to be
 considered for SLP vectorization.
diff --git a/gcc/params.opt b/gcc/params.opt
index 9b564bb046c..daa6e8a2beb 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -968,4 +968,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f
 Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
 Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
 
+-param=vect-with-length-scope=
+Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization
+Control the vector with length exploitation scope.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 458a6675c47..bc88dde7079 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -399,19 +399,20 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
 
    It is known that:
 
-     NITERS * RGC->max_nscalars_per_iter
+     NITERS * RGC->max_nscalars_per_iter * RGC->factor
 
    does not overflow.  However, MIGHT_WRAP_P says whether an induction
    variable that starts at 0 and has step:
 
-     VF * RGC->max_nscalars_per_iter
+     VF * RGC->max_nscalars_per_iter * RGC->factor
 
    might overflow before hitting a value above:
 
-     (NITERS + NITERS_SKIP) * RGC->max_nscalars_per_iter
+     (NITERS + NITERS_SKIP) * RGC->max_nscalars_per_iter * RGC->factor
 
    This means that we cannot guarantee that such an induction variable
-   would ever hit a value that produces a set of all-false masks for RGC.  */
+   would ever hit a value that produces a set of all-false masks or zero
+   lengths for RGC.  */
 
 static tree
 vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
@@ -422,10 +423,20 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 {
   tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
   tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+
   tree ctrl_type = rgc->type;
-  unsigned int nscalars_per_iter = rgc->max_nscalars_per_iter;
+  /* Scale up nscalars per iteration with factor.  */
+  unsigned int nscalars_per_iter_ft = rgc->max_nscalars_per_iter * rgc->factor;
   poly_uint64 nscalars_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  tree length_limit = NULL_TREE;
+  /* For length, we probably need length_limit to check length in range.  */
+  if (!vect_for_masking)
+    {
+      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
+      length_limit = build_int_cst (compare_type, len_limit);
+    }
 
   /* Calculate the maximum number of scalar values that the rgroup
      handles in total, the number that it handles for each iteration
@@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
   tree nscalars_total = niters;
   tree nscalars_step = build_int_cst (iv_type, vf);
   tree nscalars_skip = niters_skip;
-  if (nscalars_per_iter != 1)
+  if (nscalars_per_iter_ft != 1)
     {
       /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
 	 these multiplications don't overflow.  */
-      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
-      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
+      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
+      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
       nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
 				     nscalars_total, compare_factor);
       nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
@@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 	     NSCALARS_SKIP to that cannot overflow.  */
 	  tree const_limit = build_int_cst (compare_type,
 					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
-					    * nscalars_per_iter);
+					    * nscalars_per_iter_ft);
 	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
 				      nscalars_total, const_limit);
 	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
@@ -549,16 +560,16 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
     {
       /* Previous controls will cover BIAS scalars.  This control covers the
 	 next batch.  */
-      poly_uint64 bias = nscalars_per_ctrl * i;
+      poly_uint64 batch_nscalars_ft = nscalars_per_ctrl * rgc->factor;
+      poly_uint64 bias = batch_nscalars_ft * i;
       tree bias_tree = build_int_cst (compare_type, bias);
-      gimple *tmp_stmt;
 
       /* See whether the first iteration of the vector loop is known
 	 to have a full control.  */
       poly_uint64 const_limit;
       bool first_iteration_full
 	= (poly_int_tree_p (first_limit, &const_limit)
-	   && known_ge (const_limit, (i + 1) * nscalars_per_ctrl));
+	   && known_ge (const_limit, (i + 1) * batch_nscalars_ft));
 
       /* Rather than have a new IV that starts at BIAS and goes up to
 	 TEST_LIMIT, prefer to use the same 0-based IV for each control
@@ -598,9 +609,19 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 	      end = first_limit;
 	    }
 
-	  init_ctrl = make_temp_ssa_name (ctrl_type, NULL, "max_mask");
-	  tmp_stmt = vect_gen_while (init_ctrl, start, end);
-	  gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	  if (vect_for_masking)
+	    {
+	      init_ctrl = make_temp_ssa_name (ctrl_type, NULL, "max_mask");
+	      gimple *tmp_stmt = vect_gen_while (init_ctrl, start, end);
+	      gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	    }
+	  else
+	    {
+	      init_ctrl = make_temp_ssa_name (compare_type, NULL, "max_len");
+	      gimple_seq seq = vect_gen_len (init_ctrl, start,
+					     end, length_limit);
+	      gimple_seq_add_seq (preheader_seq, seq);
+	    }
 	}
 
       /* Now AND out the bits that are within the number of skipped
@@ -617,16 +638,32 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 				      init_ctrl, unskipped_mask);
 	  else
 	    init_ctrl = unskipped_mask;
+	  gcc_assert (vect_for_masking);
 	}
 
+      /* First iteration is full.  */
       if (!init_ctrl)
-	/* First iteration is full.  */
-	init_ctrl = build_minus_one_cst (ctrl_type);
+	{
+	  if (vect_for_masking)
+	    init_ctrl = build_minus_one_cst (ctrl_type);
+	  else
+	    init_ctrl = length_limit;
+	}
 
       /* Get the control value for the next iteration of the loop.  */
-      next_ctrl = make_temp_ssa_name (ctrl_type, NULL, "next_mask");
-      gcall *call = vect_gen_while (next_ctrl, test_index, this_test_limit);
-      gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+      if (vect_for_masking)
+	{
+	  next_ctrl = make_temp_ssa_name (ctrl_type, NULL, "next_mask");
+	  gcall *call = vect_gen_while (next_ctrl, test_index, this_test_limit);
+	  gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+	}
+      else
+	{
+	  next_ctrl = make_temp_ssa_name (compare_type, NULL, "next_len");
+	  gimple_seq seq = vect_gen_len (next_ctrl, test_index, this_test_limit,
+					 length_limit);
+	  gsi_insert_seq_before (test_gsi, seq, GSI_SAME_STMT);
+	}
 
       vect_set_loop_control (loop, ctrl, init_ctrl, next_ctrl);
     }
@@ -652,6 +689,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
   gimple_seq preheader_seq = NULL;
   gimple_seq header_seq = NULL;
 
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
   tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
   unsigned int compare_precision = TYPE_PRECISION (compare_type);
   tree orig_niters = niters;
@@ -686,28 +724,30 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
   tree test_ctrl = NULL_TREE;
   rgroup_controls *rgc;
   unsigned int i;
-  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
-  FOR_EACH_VEC_ELT (*masks, i, rgc)
+  auto_vec<rgroup_controls> *controls = vect_for_masking
+					  ? &LOOP_VINFO_MASKS (loop_vinfo)
+					  : &LOOP_VINFO_LENS (loop_vinfo);
+  FOR_EACH_VEC_ELT (*controls, i, rgc)
     if (!rgc->controls.is_empty ())
       {
 	/* First try using permutes.  This adds a single vector
 	   instruction to the loop for each mask, but needs no extra
 	   loop invariants or IVs.  */
 	unsigned int nmasks = i + 1;
-	if ((nmasks & 1) == 0)
+	if (vect_for_masking && (nmasks & 1) == 0)
 	  {
-	    rgroup_controls *half_rgc = &(*masks)[nmasks / 2 - 1];
+	    rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
 	    if (!half_rgc->controls.is_empty ()
 		&& vect_maybe_permute_loop_masks (&header_seq, rgc, half_rgc))
 	      continue;
 	  }
 
 	/* See whether zero-based IV would ever generate all-false masks
-	   before wrapping around.  */
+	   or zero length before wrapping around.  */
+	unsigned nscalars_ft = rgc->max_nscalars_per_iter * rgc->factor;
 	bool might_wrap_p
 	  = (iv_limit == -1
-	     || (wi::min_precision (iv_limit * rgc->max_nscalars_per_iter,
-				    UNSIGNED)
+	     || (wi::min_precision (iv_limit * nscalars_ft, UNSIGNED)
 		 > compare_precision));
 
 	/* Set up all controls for this group.  */
@@ -2568,7 +2608,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   if (vect_epilogues
       && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
       && prolog_peeling >= 0
-      && known_eq (vf, lowest_vf))
+      && known_eq (vf, lowest_vf)
+      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
     {
       unsigned HOST_WIDE_INT eiters
 	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 6311e795204..58b1860b8a5 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -816,6 +816,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     vectorizable (false),
     can_use_partial_vectors_p (true),
     using_partial_vectors_p (false),
+    epil_using_partial_vectors_p (false),
     peeling_for_gaps (false),
     peeling_for_niter (false),
     no_data_dependencies (false),
@@ -898,6 +899,7 @@ _loop_vec_info::~_loop_vec_info ()
   free (bbs);
 
   release_vec_loop_controls (&masks);
+  release_vec_loop_controls (&lens);
   delete ivexpr_map;
   delete scan_map;
   epilogue_vinfos.release ();
@@ -1072,6 +1074,88 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
   return true;
 }
 
+/* Check whether we can use vector access with length based on precison
+   comparison.  So far, to keep it simple, we only allow the case that the
+   precision of the target supported length is larger than the precision
+   required by loop niters.  */
+
+static bool
+vect_verify_loop_lens (loop_vec_info loop_vinfo)
+{
+  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+
+  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    return false;
+
+  /* The one which has the largest NV should have max bytes per iter.  */
+  rgroup_controls *rgl = &(*lens)[lens->length () - 1];
+
+  /* Work out how many bits we need to represent the length limit.  */
+  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;
+  unsigned int min_ni_prec
+    = vect_min_prec_for_max_niters (loop_vinfo, nscalars_per_iter_ft);
+
+  /* Now use the maximum of below precisions for one suitable IV type:
+     - the IV's natural precision
+     - the precision needed to hold: the maximum number of scalar
+       iterations multiplied by the scale factor (min_ni_prec above)
+     - the Pmode precision
+  */
+
+  /* If min_ni_width is less than the precision of the current niters,
+     we perfer to still use the niters type.  */
+  unsigned int ni_prec
+    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
+  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
+  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
+
+  unsigned int required_prec = ni_prec;
+  if (required_prec < pmode_prec)
+    required_prec = pmode_prec;
+
+  tree iv_type = NULL_TREE;
+  if (min_ni_prec > required_prec)
+    {
+      opt_scalar_int_mode tmode_iter;
+      unsigned standard_bits = 0;
+      FOR_EACH_MODE_IN_CLASS (tmode_iter, MODE_INT)
+      {
+	scalar_mode tmode = tmode_iter.require ();
+	unsigned int tbits = GET_MODE_BITSIZE (tmode);
+
+	/* ??? Do we really want to construct one IV whose precision exceeds
+	   BITS_PER_WORD?  */
+	if (tbits > BITS_PER_WORD)
+	  break;
+
+	/* Find the first available standard integral type.  */
+	if (tbits >= min_ni_prec && targetm.scalar_mode_supported_p (tmode))
+	  {
+	    standard_bits = tbits;
+	    break;
+	  }
+      }
+      if (standard_bits != 0)
+	iv_type = build_nonstandard_integer_type (standard_bits, true);
+    }
+  else
+    iv_type = build_nonstandard_integer_type (required_prec, true);
+
+  if (!iv_type)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't vectorize with length-based partial vectors"
+			 " due to no suitable iv type.\n");
+      return false;
+    }
+
+  LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = iv_type;
+  LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
+
+  return true;
+}
+
 /* Calculate the cost of one scalar iteration of the loop.  */
 static void
 vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
@@ -2170,11 +2254,64 @@ start_over:
       return ok;
     }
 
-  /* Decide whether to use a fully-masked loop for this vectorization
-     factor.  */
-  LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
-    = (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
-       && vect_verify_full_masking (loop_vinfo));
+  /* For now, we don't expect to mix both masking and length approaches for one
+     loop, disable it if both are recorded.  */
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+      && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ()
+      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't vectorize a loop with partial vectors"
+			 " because we don't expect to mix different"
+			 " approaches with partial vectors for the"
+			 " same loop.\n");
+      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
+
+  /* Decide whether to vectorize a loop with partial vectors for
+     this vectorization factor.  */
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      /* Decide whether to use fully-masked approach.  */
+      if (vect_verify_full_masking (loop_vinfo))
+	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+      /* Decide whether to use length-based approach.  */
+      else if (vect_verify_loop_lens (loop_vinfo))
+	{
+	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "can't vectorize this loop with length-based"
+				 " partial vectors approach becuase peeling"
+				 " for alignment or gaps is required.\n");
+	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	    }
+	  else if (param_vect_with_length_scope == 0)
+	    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	  /* The epilogue and other known niters less than VF
+	    cases can still use vector access with length fully.  */
+	  else if (param_vect_with_length_scope == 1
+		   && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+		   && !vect_known_niters_smaller_than_vf (loop_vinfo))
+	    {
+	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	      LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+	    }
+	  else
+	    {
+	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+	      LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	    }
+	}
+      else
+	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
+  else
+    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+
   if (dump_enabled_p ())
     {
       if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
@@ -2183,6 +2320,15 @@ start_over:
       else
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "not using a fully-masked loop.\n");
+
+      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "using length-based partial"
+			 " vectors for loop fully.\n");
+      else
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "not using length-based partial"
+			 " vectors for loop fully.\n");
     }
 
   /* If epilog loop is required because of data accesses with gaps,
@@ -2406,6 +2552,7 @@ again:
     = init_cost (LOOP_VINFO_LOOP (loop_vinfo));
   /* Reset accumulated rgroup information.  */
   release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo));
+  release_vec_loop_controls (&LOOP_VINFO_LENS (loop_vinfo));
   /* Reset assorted flags.  */
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
@@ -2692,7 +2839,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 		lowest_th = ordered_min (lowest_th, th);
 	    }
 	  else
-	    delete loop_vinfo;
+	    {
+	      delete loop_vinfo;
+	      loop_vinfo = opt_loop_vec_info::success (NULL);
+	    }
 
 	  /* Only vectorize epilogues if PARAM_VECT_EPILOGUES_NOMASK is
 	     enabled, SIMDUID is not set, it is the innermost loop and we have
@@ -2717,6 +2867,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
       else
 	{
 	  delete loop_vinfo;
+	  loop_vinfo = opt_loop_vec_info::success (NULL);
 	  if (fatal)
 	    {
 	      gcc_checking_assert (first_loop_vinfo == NULL);
@@ -2724,6 +2875,23 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	    }
 	}
 
+      /* Handle the case that the original loop can use partial
+	 vectorization, but want to only adopt it for the epilogue.
+	 The retry should be in the same mode as original.  */
+      if (vect_epilogues
+	  && loop_vinfo
+	  && LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo))
+	{
+	  gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+		      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** Re-trying analysis with same vector mode"
+			     " %s for epilogue with partial vectors.\n",
+			     GET_MODE_NAME (loop_vinfo->vector_mode));
+	  continue;
+	}
+
       if (mode_i < vector_modes.length ()
 	  && VECTOR_MODE_P (autodetected_vector_mode)
 	  && (related_vector_mode (vector_modes[mode_i],
@@ -3564,6 +3732,11 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 			    target_cost_data, num_masks - 1, vector_stmt,
 			    NULL, NULL_TREE, 0, vect_body);
     }
+  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      peel_iters_prologue = 0;
+      peel_iters_epilogue = 0;
+    }
   else if (npeel < 0)
     {
       peel_iters_prologue = assumed_vf / 2;
@@ -8197,6 +8370,7 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
     {
       rgm->max_nscalars_per_iter = nscalars_per_iter;
       rgm->type = truth_type_for (vectype);
+      rgm->factor = 1;
     }
 }
 
@@ -8249,6 +8423,64 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
   return mask;
 }
 
+/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
+   lengths for vector access with length that each control a vector of type
+   VECTYPE.  */
+
+void
+vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		      unsigned int nvectors, tree vectype)
+{
+  gcc_assert (nvectors != 0);
+  if (lens->length () < nvectors)
+    lens->safe_grow_cleared (nvectors);
+  rgroup_controls *rgl = &(*lens)[nvectors - 1];
+
+  /* The number of scalars per iteration, scalar occupied bytes and
+     the number of vectors are both compile-time constants.  */
+  unsigned int nscalars_per_iter
+    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
+		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
+
+  if (rgl->max_nscalars_per_iter < nscalars_per_iter)
+    {
+      rgl->max_nscalars_per_iter = nscalars_per_iter;
+      rgl->type = vectype;
+      /* For now, the length-based is for length in bytes.
+	 FIXME if length-based supports more eg: length in scalar counts.  */
+      rgl->factor = int_cst_value (TYPE_SIZE_UNIT (TREE_TYPE (vectype)));
+    }
+}
+
+/* Given a complete set of length LENS, extract length number INDEX for an
+   rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.  */
+
+tree
+vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		   unsigned int nvectors, unsigned int index)
+{
+  rgroup_controls *rgl = &(*lens)[nvectors - 1];
+
+  /* Populate the rgroup's len array, if this is the first time we've
+     used it.  */
+  if (rgl->controls.is_empty ())
+    {
+      rgl->controls.safe_grow_cleared (nvectors);
+      for (unsigned int i = 0; i < nvectors; ++i)
+	{
+	  tree len_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+	  gcc_assert (len_type != NULL_TREE);
+	  tree len = make_temp_ssa_name (len_type, NULL, "loop_len");
+
+	  /* Provide a dummy definition until the real one is available.  */
+	  SSA_NAME_DEF_STMT (len) = gimple_build_nop ();
+	  rgl->controls[i] = len;
+	}
+    }
+
+  return rgl->controls[index];
+}
+
 /* Scale profiling counters by estimation for LOOP which is vectorized
    by factor VF.  */
 
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index cdd6f6c5e5d..1da67f31859 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1742,29 +1742,56 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
       return;
     }
 
-  machine_mode mask_mode;
-  if (!VECTOR_MODE_P (vecmode)
-      || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
-      || !can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+  if (!VECTOR_MODE_P (vecmode))
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "can't use a fully-masked loop because the target"
-			 " doesn't have the appropriate masked load or"
-			 " store.\n");
+			 "can't operate on partial vectors because of"
+			 " the unexpected mode.\n");
       LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
       return;
     }
-  /* We might load more scalars than we need for permuting SLP loads.
-     We checked in get_group_load_store_type that the extra elements
-     don't leak into a new vector.  */
+
   poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   unsigned int nvectors;
-  if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
-    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, scalar_mask);
-  else
-    gcc_unreachable ();
+
+  machine_mode mask_mode;
+  bool with_partial_vectors_p = false;
+  if (targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
+      && can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+    {
+      /* We might load more scalars than we need for permuting SLP loads.
+	 We checked in get_group_load_store_type that the extra elements
+	 don't leak into a new vector.  */
+      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+	vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype,
+			       scalar_mask);
+      else
+	gcc_unreachable ();
+      with_partial_vectors_p = true;
+    }
+
+  optab op = is_load ? lenload_optab : lenstore_optab;
+  if (optab_handler (op, vecmode) != CODE_FOR_nothing)
+    {
+      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+	vect_record_loop_len (loop_vinfo, lens, nvectors, vectype);
+      else
+	gcc_unreachable ();
+      with_partial_vectors_p = true;
+    }
+
+  if (!with_partial_vectors_p)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't operate on partial vectors because the"
+			 " target doesn't have the appropriate partial"
+			 "vectorization load or store.\n");
+      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
 }
 
 /* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
@@ -7655,6 +7682,14 @@ vectorizable_store (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+       ? &LOOP_VINFO_LENS (loop_vinfo)
+       : NULL);
+
+  /* Shouldn't go with length-based approach if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -7911,10 +7946,16 @@ vectorizable_store (vec_info *vinfo,
 	      unsigned HOST_WIDE_INT align;
 
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens)
+		final_len = vect_get_loop_len (loop_vinfo, loop_lens,
+					       vec_num * ncopies,
+					       vec_num * j + i);
+
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
@@ -7994,6 +8035,17 @@ vectorizable_store (vec_info *vinfo,
 		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
 		  new_stmt = call;
 		}
+	      else if (final_len)
+		{
+		  align = least_bit_hwi (misalign | align);
+		  tree ptr = build_int_cst (ref_type, align);
+		  gcall *call
+		    = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr,
+						  ptr, final_len, vec_oprnd);
+		  gimple_call_set_nothrow (call, true);
+		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
+		  new_stmt = call;
+		}
 	      else
 		{
 		  data_ref = fold_build2 (MEM_REF, vectype,
@@ -8531,6 +8583,7 @@ vectorizable_load (vec_info *vinfo,
       tree dr_offset;
 
       gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
+      gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
       gcc_assert (!nested_in_vect_loop);
 
       if (grouped_load)
@@ -8819,6 +8872,14 @@ vectorizable_load (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+       ? &LOOP_VINFO_LENS (loop_vinfo)
+       : NULL);
+
+  /* Shouldn't go with length-based approach if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -9134,11 +9195,18 @@ vectorizable_load (vec_info *vinfo,
 	  for (i = 0; i < vec_num; i++)
 	    {
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks
 		  && memory_access_type != VMAT_INVARIANT)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens
+		  && memory_access_type != VMAT_INVARIANT)
+		final_len = vect_get_loop_len (loop_vinfo, loop_lens,
+					       vec_num * ncopies,
+					       vec_num * j + i);
+
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
@@ -9207,6 +9275,18 @@ vectorizable_load (vec_info *vinfo,
 			new_stmt = call;
 			data_ref = NULL_TREE;
 		      }
+		    else if (final_len)
+		      {
+			align = least_bit_hwi (misalign | align);
+			tree ptr = build_int_cst (ref_type, align);
+			gcall *call
+			  = gimple_build_call_internal (IFN_LEN_LOAD, 3,
+							dataref_ptr, ptr,
+							final_len);
+			gimple_call_set_nothrow (call, true);
+			new_stmt = call;
+			data_ref = NULL_TREE;
+		      }
 		    else
 		      {
 			tree ltype = vectype;
@@ -9850,11 +9930,30 @@ vectorizable_condition (vec_info *vinfo,
 	  return false;
 	}
 
-      if (loop_vinfo
-	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
-	  && reduction_type == EXTRACT_LAST_REDUCTION)
-	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
-			       ncopies * vec_num, vectype, NULL);
+      if (loop_vinfo && for_reduction
+	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+	{
+	  if (reduction_type == EXTRACT_LAST_REDUCTION)
+	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
+				   ncopies * vec_num, vectype, NULL);
+	  /* Using partial vectors can introduce inactive lanes in the last
+	     iteration, since full vector of condition results are operated,
+	     it's unsafe here.  But if we can AND the condition mask with
+	     loop mask, it would be safe then.  */
+	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
+	    {
+	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
+	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
+		{
+		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
+		  cond.code = invert_tree_comparison (cond.code, honor_nans);
+		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
+		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+		}
+	    }
+	  else
+	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	}
 
       STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
       vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
@@ -11910,3 +12009,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
   *nunits_vectype_out = nunits_vectype;
   return opt_result::success ();
 }
+
+/* Generate and return statement sequence that sets vector length LEN that is:
+
+   min_of_start_and_end = min (START_INDEX, END_INDEX);
+   left_len = END_INDEX - min_of_start_and_end;
+   rhs = min (left_len, LEN_LIMIT);
+   LEN = rhs;
+
+   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
+   means if we have left_len in bytes larger than 255, it can't be saturated to
+   vector limit (vector size).  One target hook can be provided if other ports
+   don't suffer this.
+*/
+
+gimple_seq
+vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)
+{
+  gimple_seq stmts = NULL;
+  tree len_type = TREE_TYPE (len);
+  gcc_assert (TREE_TYPE (start_index) == len_type);
+
+  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
+  tree left_len = fold_build2 (MINUS_EXPR, len_type, end_index, min);
+  left_len = fold_build2 (MIN_EXPR, len_type, left_len, len_limit);
+
+  tree rhs = force_gimple_operand (left_len, &stmts, true, NULL_TREE);
+  gimple *new_stmt = gimple_build_assign (len, rhs);
+  gimple_stmt_iterator i = gsi_last (stmts);
+  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
+
+  return stmts;
+}
+
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 6c830ad09f4..3752ce51e64 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -417,6 +417,16 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
    are compile-time constants but VF and nL can be variable (if the target
    supports variable-length vectors).
 
+   Moreover, for some approach with partial vectors like being controlled
+   by length (in bytes), it cares about the occupied bytes for each scalar.
+   Provided that each scalar has factor bytes, the total number of scalar
+   values becomes to factor * N, the above equation becomes to:
+
+       factor * N = factor * NS * VF = factor * NV * NL
+
+   factor * NS is the bytes of each scalar, factor * NL is the vector size
+   in bytes.
+
    In classical vectorization, each iteration of the vector loop would
    handle exactly VF iterations of the original scalar loop.  However,
    in vector loops that are able to operate on partial vectors, a
@@ -473,14 +483,19 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
    first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
    the second being indexed by the mask index 0 <= i < nV.  */
 
-/* The controls (like masks) needed by rgroups with nV vectors,
+/* The controls (like masks, lengths) needed by rgroups with nV vectors,
    according to the description above.  */
 struct rgroup_controls {
   /* The largest nS for all rgroups that use these controls.  */
   unsigned int max_nscalars_per_iter;
 
-  /* The type of control to use, based on the highest nS recorded above.
-     For mask-based approach, it's used for mask_type.  */
+  /* For now, it's mainly used for length-based in bytes approach, it's
+     record the occupied bytes of each scalar.  */
+  unsigned int factor;
+
+  /* This type is based on the highest nS recorded above.
+     For mask-based approach, it records mask type to use.
+     For length-based approach, it records appropriate vector type.  */
   tree type;
 
   /* A vector of nV controls, in iteration order.  */
@@ -489,6 +504,8 @@ struct rgroup_controls {
 
 typedef auto_vec<rgroup_controls> vec_loop_masks;
 
+typedef auto_vec<rgroup_controls> vec_loop_lens;
+
 typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
 
 /*-----------------------------------------------------------------*/
@@ -536,6 +553,10 @@ public:
      on inactive scalars.  */
   vec_loop_masks masks;
 
+  /* The lengths that a loop with length should use to avoid operating
+     on inactive scalars.  */
+  vec_loop_lens lens;
+
   /* Set of scalar conditions that have loop mask applied.  */
   scalar_cond_masked_set_type scalar_cond_masked_set;
 
@@ -644,6 +665,10 @@ public:
      the vector loop can handle fewer than VF scalars.  */
   bool using_partial_vectors_p;
 
+  /* True if we've decided to use partially-populated vectors for the
+     epilogue of loop, only for length-based approach for now.  */
+  bool epil_using_partial_vectors_p;
+
   /* When we have grouped data accesses with gaps, we may introduce invalid
      memory accesses.  We peel the last iteration of the loop to prevent
      this.  */
@@ -707,9 +732,12 @@ public:
 #define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
 #define LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P(L) (L)->can_use_partial_vectors_p
 #define LOOP_VINFO_USING_PARTIAL_VECTORS_P(L) (L)->using_partial_vectors_p
+#define LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P(L)                             \
+  (L)->epil_using_partial_vectors_p
 #define LOOP_VINFO_VECT_FACTOR(L)          (L)->vectorization_factor
 #define LOOP_VINFO_MAX_VECT_FACTOR(L)      (L)->max_vectorization_factor
 #define LOOP_VINFO_MASKS(L)                (L)->masks
+#define LOOP_VINFO_LENS(L)                 (L)->lens
 #define LOOP_VINFO_MASK_SKIP_NITERS(L)     (L)->mask_skip_niters
 #define LOOP_VINFO_RGROUP_COMPARE_TYPE(L)  (L)->rgroup_compare_type
 #define LOOP_VINFO_RGROUP_IV_TYPE(L)       (L)->rgroup_iv_type
@@ -747,6 +775,10 @@ public:
   (LOOP_VINFO_USING_PARTIAL_VECTORS_P (L)	\
    && !LOOP_VINFO_MASKS (L).is_empty ())
 
+#define LOOP_VINFO_FULLY_WITH_LENGTH_P(L)	\
+  (LOOP_VINFO_USING_PARTIAL_VECTORS_P (L)	\
+   && !LOOP_VINFO_LENS (L).is_empty ())
+
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L)	\
   ((L)->may_misalign_stmts.length () > 0)
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIAS(L)		\
@@ -1866,6 +1898,11 @@ extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
 				   unsigned int, tree, tree);
 extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
+extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+				  tree);
+extern tree vect_get_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+			       unsigned int);
+extern gimple_seq vect_gen_len (tree, tree, tree, tree);
 extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info);
 
 /* Drive for loop transformation stage.  */

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 1/7 V4] ifn/optabs: Support vector load/store with length
  2020-06-10 12:36       ` [PATCH 1/7 V3] " Kewen.Lin
@ 2020-06-22  8:51         ` Kewen.Lin
  2020-06-22 19:59           ` Richard Sandiford
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-22  8:51 UTC (permalink / raw)
  To: GCC Patches
  Cc: richard.sandiford, Richard Guenther, Bill Schmidt, dje.gcc,
	Segher Boessenkool

[-- Attachment #1: Type: text/plain, Size: 947 bytes --]

Hi,

v4: Update len_load_direct/len_store_direct to align with direct optab.

v3: Get rid of length mode hook.

Thanks for reviewing!

BR,
Kewen
---
gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/md.texi (lenload@var{m}): Document.
	(lenstore@var{m}): Likewise.
	* internal-fn.c (len_load_direct): New macro.
	(len_store_direct): Likewise.
	(expand_len_load_optab_fn): Likewise.
	(expand_len_store_optab_fn): Likewise.
	(direct_len_load_optab_supported_p): Likewise.
	(direct_len_store_optab_supported_p): Likewise.
	(expand_mask_load_optab_fn): Add handlings for lenload_optab.
	(expand_mask_store_optab_fn): Add handlings for lenstore_optab.
	(internal_load_fn_p): Handle IFN_LEN_LOAD.
	(internal_store_fn_p): Handle IFN_LEN_STORE.
	(internal_fn_stored_value_index): Handle IFN_LEN_STORE.
	* internal-fn.def (LEN_LOAD): New internal function.
	(LEN_STORE): Likewise.
	* optabs.def (lenload_optab, lenstore_optab): New optab.

[-- Attachment #2: ifn_v4.diff --]
[-- Type: text/plain, Size: 8533 bytes --]

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2c67c818da5..dd0d3ec203b 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5167,6 +5167,24 @@ mode @var{n}.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{lenload@var{m}} instruction pattern
+@item @samp{lenload@var{m}}
+Perform a vector load with length from memory operand 1 of mode @var{m}
+into register operand 0.  Length is provided in register operand 2 with
+appropriate mode which should afford the maximal required precision of
+any available lengths.
+
+This pattern is not allowed to @code{FAIL}.
+
+@cindex @code{lenstore@var{m}} instruction pattern
+@item @samp{lenstore@var{m}}
+Perform a vector store with length from register operand 1 of mode @var{m}
+into memory operand 0.  Length is provided in register operand 2 with
+appropriate mode which should afford the maximal required precision of
+any available lengths.
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_perm@var{m}} instruction pattern
 @item @samp{vec_perm@var{m}}
 Output a (variable) vector permutation.  Operand 0 is the destination
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 5e9aa60721e..3d590517e5d 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -104,10 +104,12 @@ init_internal_fns ()
 #define load_lanes_direct { -1, -1, false }
 #define mask_load_lanes_direct { -1, -1, false }
 #define gather_load_direct { 3, 1, false }
+#define len_load_direct { -1, -1, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
 #define mask_store_lanes_direct { 0, 0, false }
 #define scatter_store_direct { 3, 1, false }
+#define len_store_direct { 3, 3, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 #define ternary_direct { 0, 0, true }
@@ -2478,7 +2480,7 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
   return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
 }
 
-/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES} and LEN_LOAD call STMT using optab OPTAB.  */
 
 static void
 expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2497,6 +2499,8 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_load_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == lenload_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2507,15 +2511,20 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   create_output_operand (&ops[0], target, TYPE_MODE (type));
   create_fixed_operand (&ops[1], mem);
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == lenload_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
   if (!rtx_equal_p (target, ops[0].value))
     emit_move_insn (target, ops[0].value);
 }
 
 #define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
+#define expand_len_load_optab_fn expand_mask_load_optab_fn
 
-/* Expand MASK_STORE{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_STORE{,_LANES} and LEN_STORE call STMT using optab OPTAB.  */
 
 static void
 expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2532,6 +2541,8 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_store_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == lenstore_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2542,11 +2553,16 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   reg = expand_normal (rhs);
   create_fixed_operand (&ops[0], mem);
   create_input_operand (&ops[1], reg, TYPE_MODE (type));
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == lenstore_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
 }
 
 #define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
+#define expand_len_store_optab_fn expand_mask_store_optab_fn
 
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
@@ -3128,10 +3144,12 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_gather_load_optab_supported_p convert_optab_supported_p
+#define direct_len_load_optab_supported_p direct_optab_supported_p
 #define direct_mask_store_optab_supported_p direct_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_scatter_store_optab_supported_p convert_optab_supported_p
+#define direct_len_store_optab_supported_p direct_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
 #define direct_fold_left_optab_supported_p direct_optab_supported_p
@@ -3498,6 +3516,7 @@ internal_load_fn_p (internal_fn fn)
     case IFN_MASK_LOAD_LANES:
     case IFN_GATHER_LOAD:
     case IFN_MASK_GATHER_LOAD:
+    case IFN_LEN_LOAD:
       return true;
 
     default:
@@ -3517,6 +3536,7 @@ internal_store_fn_p (internal_fn fn)
     case IFN_MASK_STORE_LANES:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return true;
 
     default:
@@ -3577,6 +3597,7 @@ internal_fn_stored_value_index (internal_fn fn)
     case IFN_MASK_STORE:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return 3;
 
     default:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 1d190d492ff..ed6561f296a 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
    - load_lanes: currently just vec_load_lanes
    - mask_load_lanes: currently just vec_mask_load_lanes
    - gather_load: used for {mask_,}gather_load
+   - len_load: currently just lenload
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
    - mask_store_lanes: currently just vec_mask_store_lanes
    - scatter_store: used for {mask_,}scatter_store
+   - len_store: currently just lenstore
 
    - unary: a normal unary optab, such as vec_reverse_<mode>
    - binary: a normal binary optab, such as vec_interleave_lo_<mode>
@@ -127,6 +129,8 @@ DEF_INTERNAL_OPTAB_FN (GATHER_LOAD, ECF_PURE, gather_load, gather_load)
 DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
 		       mask_gather_load, gather_load)
 
+DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, lenload, len_load)
+
 DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
 DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
 		       mask_scatter_store, scatter_store)
@@ -136,6 +140,8 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
 DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
 		       vec_mask_store_lanes, mask_store_lanes)
 
+DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, lenstore, len_store)
+
 DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
 DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
 		       check_raw_ptrs, check_ptrs)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 0c64eb52a8d..9fe4ac1840d 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -435,3 +435,5 @@ OPTAB_D (check_war_ptrs_optab, "check_war_ptrs$a")
 OPTAB_DC (vec_duplicate_optab, "vec_duplicate$a", VEC_DUPLICATE)
 OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
 OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
+OPTAB_D (lenload_optab, "lenload$a")
+OPTAB_D (lenstore_optab, "lenstore$a")

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/7 V4] ifn/optabs: Support vector load/store with length
  2020-06-22  8:51         ` [PATCH 1/7 V4] " Kewen.Lin
@ 2020-06-22 19:59           ` Richard Sandiford
  2020-06-22 22:19             ` Segher Boessenkool
                               ` (2 more replies)
  0 siblings, 3 replies; 80+ messages in thread
From: Richard Sandiford @ 2020-06-22 19:59 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Richard Guenther, Bill Schmidt, dje.gcc, Segher Boessenkool

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> @@ -5167,6 +5167,24 @@ mode @var{n}.
>  
>  This pattern is not allowed to @code{FAIL}.
>  
> +@cindex @code{lenload@var{m}} instruction pattern
> +@item @samp{lenload@var{m}}
> +Perform a vector load with length from memory operand 1 of mode @var{m}
> +into register operand 0.  Length is provided in register operand 2 with
> +appropriate mode which should afford the maximal required precision of
> +any available lengths.

I think we need to say in more detail what “load with length” actually
means.  How about:

  Load the number of bytes specified by operand 2 from memory operand 1
  into register operand 0, setting the other bytes of operand 0 to
  undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has
  whichever integer mode the target prefers.

> +@cindex @code{lenstore@var{m}} instruction pattern
> +@item @samp{lenstore@var{m}}
> +Perform a vector store with length from register operand 1 of mode @var{m}
> +into memory operand 0.  Length is provided in register operand 2 with
> +appropriate mode which should afford the maximal required precision of
> +any available lengths.

Similarly here:

  Store the number of bytes specified by operand 2 from nonmemory operand 1
  into memory operand 0, leaving the other bytes of operand 0 unchanged.
  Operands 0 and 1 have mode @var{m}.  Operand 2 has whichever integer
  mode the target prefers.

> @@ -2478,7 +2480,7 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
>    return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
>  }
>  
> -/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
> +/* Expand MASK_LOAD{,_LANES} and LEN_LOAD call STMT using optab OPTAB.  */

s/and/or/.

>  
>  static void
>  expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)

Think it would be worth generalising the name, e.g. to
expand_partial_load_optab_fn, and adding a #define for
expand_mask_load_optab_fn before the other two #defines.

Same comments for stores.

> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 1d190d492ff..ed6561f296a 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
>     - load_lanes: currently just vec_load_lanes
>     - mask_load_lanes: currently just vec_mask_load_lanes
>     - gather_load: used for {mask_,}gather_load
> +   - len_load: currently just lenload
>  
>     - mask_store: currently just maskstore
>     - store_lanes: currently just vec_store_lanes
>     - mask_store_lanes: currently just vec_mask_store_lanes
>     - scatter_store: used for {mask_,}scatter_store
> +   - len_store: currently just lenstore
>  
>     - unary: a normal unary optab, such as vec_reverse_<mode>
>     - binary: a normal binary optab, such as vec_interleave_lo_<mode>
> @@ -127,6 +129,8 @@ DEF_INTERNAL_OPTAB_FN (GATHER_LOAD, ECF_PURE, gather_load, gather_load)
>  DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
>  		       mask_gather_load, gather_load)
>  
> +DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, lenload, len_load)
> +
>  DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
>  DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
>  		       mask_scatter_store, scatter_store)
> @@ -136,6 +140,8 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
>  DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
>  		       vec_mask_store_lanes, mask_store_lanes)
>  
> +DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, lenstore, len_store)
> +
>  DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
>  DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
>  		       check_raw_ptrs, check_ptrs)
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 0c64eb52a8d..9fe4ac1840d 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -435,3 +435,5 @@ OPTAB_D (check_war_ptrs_optab, "check_war_ptrs$a")
>  OPTAB_DC (vec_duplicate_optab, "vec_duplicate$a", VEC_DUPLICATE)
>  OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
>  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
> +OPTAB_D (lenload_optab, "lenload$a")
> +OPTAB_D (lenstore_optab, "lenstore$a")

Sorry, I should have picked up on this last time, but I think we should
be consistent about whether there's an underscore after “len” or not.
I realise this is just replicating what happens for IFN_MASK_LOAD/
“maskload” and IFN_MASK_STORE/“maskstore”, but it's something I kept
tripping over when implementing those for SVE.

Personally I think it is easier to read with the underscore, so this
would be “len_load_optab” and “len_load$a” (or “len_load_$a”,
there's no real consistency on that).  Same for stores.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/7 V4] ifn/optabs: Support vector load/store with length
  2020-06-22 19:59           ` Richard Sandiford
@ 2020-06-22 22:19             ` Segher Boessenkool
  2020-06-23  3:54             ` [PATCH 1/7 v5] " Kewen.Lin
  2020-06-23  6:47             ` [PATCH 1/7 V4] " Richard Biener
  2 siblings, 0 replies; 80+ messages in thread
From: Segher Boessenkool @ 2020-06-22 22:19 UTC (permalink / raw)
  To: Kewen.Lin, GCC Patches, Richard Guenther, Bill Schmidt, dje.gcc,
	richard.sandiford

Hi!

On Mon, Jun 22, 2020 at 08:59:48PM +0100, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> > @@ -5167,6 +5167,24 @@ mode @var{n}.
> >  
> >  This pattern is not allowed to @code{FAIL}.
> >  
> > +@cindex @code{lenload@var{m}} instruction pattern
> > +@item @samp{lenload@var{m}}
> > +Perform a vector load with length from memory operand 1 of mode @var{m}
> > +into register operand 0.  Length is provided in register operand 2 with
> > +appropriate mode which should afford the maximal required precision of
> > +any available lengths.
> 
> I think we need to say in more detail what “load with length” actually
> means.  How about:
> 
>   Load the number of bytes specified by operand 2 from memory operand 1
>   into register operand 0, setting the other bytes of operand 0 to
>   undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has
>   whichever integer mode the target prefers.

The Power instructions set the other bytes to 0.  There is no great way
to zero them out with other insns either, so lenloadM should do it, from
our viewpoint.  (The problem case is when the length is not constant).

> > +@cindex @code{lenstore@var{m}} instruction pattern
> > +@item @samp{lenstore@var{m}}
> > +Perform a vector store with length from register operand 1 of mode @var{m}
> > +into memory operand 0.  Length is provided in register operand 2 with
> > +appropriate mode which should afford the maximal required precision of
> > +any available lengths.
> 
> Similarly here:
> 
>   Store the number of bytes specified by operand 2 from nonmemory operand 1
>   into memory operand 0, leaving the other bytes of operand 0 unchanged.
>   Operands 0 and 1 have mode @var{m}.  Operand 2 has whichever integer
>   mode the target prefers.

That is what our insns do, yes, so that is fine with me :-)

> Sorry, I should have picked up on this last time, but I think we should
> be consistent about whether there's an underscore after “len” or not.
> I realise this is just replicating what happens for IFN_MASK_LOAD/
> “maskload” and IFN_MASK_STORE/“maskstore”, but it's something I kept
> tripping over when implementing those for SVE.
> 
> Personally I think it is easier to read with the underscore, so this
> would be “len_load_optab” and “len_load$a” (or “len_load_$a”,
> there's no real consistency on that).  Same for stores.

With a second underscore is better readable than with just one (the mode
does not apply to just the part after the underscore).

Thanks,


Segher

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 1/7 v5] ifn/optabs: Support vector load/store with length
  2020-06-22 19:59           ` Richard Sandiford
  2020-06-22 22:19             ` Segher Boessenkool
@ 2020-06-23  3:54             ` Kewen.Lin
  2020-06-23  9:52               ` Richard Sandiford
  2020-06-23  6:47             ` [PATCH 1/7 V4] " Richard Biener
  2 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-23  3:54 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Richard Biener, Bill Schmidt, dje.gcc, Segher Boessenkool

[-- Attachment #1: Type: text/plain, Size: 4393 bytes --]

Hi Richard,

Thanks for your comments!

on 2020/6/23 上午3:59, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> @@ -5167,6 +5167,24 @@ mode @var{n}.
>>  
>>  This pattern is not allowed to @code{FAIL}.
>>  
>> +@cindex @code{lenload@var{m}} instruction pattern
>> +@item @samp{lenload@var{m}}
>> +Perform a vector load with length from memory operand 1 of mode @var{m}
>> +into register operand 0.  Length is provided in register operand 2 with
>> +appropriate mode which should afford the maximal required precision of
>> +any available lengths.
> 
> I think we need to say in more detail what “load with length” actually
> means.  How about:
> 
>   Load the number of bytes specified by operand 2 from memory operand 1
>   into register operand 0, setting the other bytes of operand 0 to
>   undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has
>   whichever integer mode the target prefers.
> 

Thanks for nice wordings!  Updated, for "... to undefined values" I changed it
to "... to undefined values or zeros" as Segher's comments to match the behavior
on Power.

> Similarly here:
> 
>   Store the number of bytes specified by operand 2 from nonmemory operand 1
>   into memory operand 0, leaving the other bytes of operand 0 unchanged.
>   Operands 0 and 1 have mode @var{m}.  Operand 2 has whichever integer
>   mode the target prefers.
> 

Done.

>> @@ -2478,7 +2480,7 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
>>    return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
>>  }
>>  
>> -/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
>> +/* Expand MASK_LOAD{,_LANES} and LEN_LOAD call STMT using optab OPTAB.  */
> 
> s/and/or/.
> 

Done.

>>  
>>  static void
>>  expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
> 
> Think it would be worth generalising the name, e.g. to
> expand_partial_load_optab_fn, and adding a #define for
> expand_mask_load_optab_fn before the other two #defines.
> 
> Same comments for stores.
> 

Done.

>> +OPTAB_D (lenload_optab, "lenload$a")
>> +OPTAB_D (lenstore_optab, "lenstore$a")
> 
> Sorry, I should have picked up on this last time, but I think we should
> be consistent about whether there's an underscore after “len” or not.
> I realise this is just replicating what happens for IFN_MASK_LOAD/
> “maskload” and IFN_MASK_STORE/“maskstore”, but it's something I kept
> tripping over when implementing those for SVE.
> 
> Personally I think it is easier to read with the underscore, so this
> would be “len_load_optab” and “len_load$a” (or “len_load_$a”,
> there's no real consistency on that).  Same for stores.
> 

Good point!  I found there are two flavors on the optab naming.

  OPTAB_CD(maskload_optab, "maskload$a$b")
  OPTAB_CD(gather_load_optab, "gather_load$a$b")
  ...
vs. 
  OPTAB_D (vec_realign_load_optab, "vec_realign_load_$a")

Finally I chose two underscores as Segher's comment on readability.

----------------------------

v5:
  - Updated lenload/lenstore optab to len_load/len_store and the docs.
  - Rename expand_mask_{load,store}_optab_fn to expand_partial_{load,store}_optab_fn
  - Added/updated macros for expand_mask_{load,store}_optab_fn
    and expand_len_{load,store}_optab_fn

v4: Update len_load_direct/len_store_direct to align with direct optab.

v3: Get rid of length mode hook.


BR,
Kewen
---
gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/md.texi (len_load_@var{m}): Document.
	(len_store_@var{m}): Likewise.
	* internal-fn.c (len_load_direct): New macro.
	(len_store_direct): Likewise.
	(expand_len_load_optab_fn): Likewise.
	(expand_len_store_optab_fn): Likewise.
	(direct_len_load_optab_supported_p): Likewise.
	(direct_len_store_optab_supported_p): Likewise.
	(expand_mask_load_optab_fn): New macro.  Original renamed to ...
	(expand_partial_load_optab_fn): ... here.  Add handlings for
	len_load_optab.
	(expand_mask_store_optab_fn): New macro.  Original renamed to ...
	(expand_partial_store_optab_fn): ... here. Add handlings for
	len_store_optab.
	(internal_load_fn_p): Handle IFN_LEN_LOAD.
	(internal_store_fn_p): Handle IFN_LEN_STORE.
	(internal_fn_stored_value_index): Handle IFN_LEN_STORE.
	* internal-fn.def (LEN_LOAD): New internal function.
	(LEN_STORE): Likewise.
	* optabs.def (len_load_optab, len_store_optab): New optab.

[-- Attachment #2: ifn_v5.diff --]
[-- Type: text/plain, Size: 9189 bytes --]

commit f6012656a8968f239ad781c2cd388a9210675e11
Author: Kewen Lin <linkw@gcc.gnu.org>
Date:   Mon May 25 10:55:16 2020 +0800

    IFN for vector load/store with length and related optabs V5

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2c67c818da5..23918136345 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5167,6 +5167,24 @@ mode @var{n}.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{len_load_@var{m}} instruction pattern
+@item @samp{len_load_@var{m}}
+Load the number of bytes specified by operand 2 from memory operand 1
+into register operand 0, setting the other bytes of operand 0 to
+undefined values or zeros.  Operands 0 and 1 have mode @var{m}.
+Operand 2 has whichever integer mode the target prefers.
+
+This pattern is not allowed to @code{FAIL}.
+
+@cindex @code{len_store_@var{m}} instruction pattern
+@item @samp{len_store_@var{m}}
+Store the number of bytes specified by operand 2 from nonmemory operand 1
+into memory operand 0, leaving the other bytes of operand 0 unchanged.
+Operands 0 and 1 have mode @var{m}.  Operand 2 has whichever integer
+mode the target prefers.
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_perm@var{m}} instruction pattern
 @item @samp{vec_perm@var{m}}
 Output a (variable) vector permutation.  Operand 0 is the destination
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 5e9aa60721e..f9e851069a5 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -104,10 +104,12 @@ init_internal_fns ()
 #define load_lanes_direct { -1, -1, false }
 #define mask_load_lanes_direct { -1, -1, false }
 #define gather_load_direct { 3, 1, false }
+#define len_load_direct { -1, -1, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
 #define mask_store_lanes_direct { 0, 0, false }
 #define scatter_store_direct { 3, 1, false }
+#define len_store_direct { 3, 3, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 #define ternary_direct { 0, 0, true }
@@ -2478,10 +2480,10 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
   return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
 }
 
-/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES} or LEN_LOAD call STMT using optab OPTAB.  */
 
 static void
-expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 {
   class expand_operand ops[3];
   tree type, lhs, rhs, maskt;
@@ -2497,6 +2499,8 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_load_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == len_load_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2507,18 +2511,24 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   create_output_operand (&ops[0], target, TYPE_MODE (type));
   create_fixed_operand (&ops[1], mem);
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == len_load_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
   if (!rtx_equal_p (target, ops[0].value))
     emit_move_insn (target, ops[0].value);
 }
 
+#define expand_mask_load_optab_fn expand_partial_load_optab_fn
 #define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
+#define expand_len_load_optab_fn expand_partial_load_optab_fn
 
-/* Expand MASK_STORE{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_STORE{,_LANES} or LEN_STORE call STMT using optab OPTAB.  */
 
 static void
-expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 {
   class expand_operand ops[3];
   tree type, lhs, rhs, maskt;
@@ -2532,6 +2542,8 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_store_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == len_store_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2542,11 +2554,17 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   reg = expand_normal (rhs);
   create_fixed_operand (&ops[0], mem);
   create_input_operand (&ops[1], reg, TYPE_MODE (type));
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == len_store_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
 }
 
+#define expand_mask_store_optab_fn expand_partial_store_optab_fn
 #define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
+#define expand_len_store_optab_fn expand_partial_store_optab_fn
 
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
@@ -3128,10 +3146,12 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_gather_load_optab_supported_p convert_optab_supported_p
+#define direct_len_load_optab_supported_p direct_optab_supported_p
 #define direct_mask_store_optab_supported_p direct_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_scatter_store_optab_supported_p convert_optab_supported_p
+#define direct_len_store_optab_supported_p direct_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
 #define direct_fold_left_optab_supported_p direct_optab_supported_p
@@ -3498,6 +3518,7 @@ internal_load_fn_p (internal_fn fn)
     case IFN_MASK_LOAD_LANES:
     case IFN_GATHER_LOAD:
     case IFN_MASK_GATHER_LOAD:
+    case IFN_LEN_LOAD:
       return true;
 
     default:
@@ -3517,6 +3538,7 @@ internal_store_fn_p (internal_fn fn)
     case IFN_MASK_STORE_LANES:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return true;
 
     default:
@@ -3577,6 +3599,7 @@ internal_fn_stored_value_index (internal_fn fn)
     case IFN_MASK_STORE:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return 3;
 
     default:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 1d190d492ff..17dac128e83 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
    - load_lanes: currently just vec_load_lanes
    - mask_load_lanes: currently just vec_mask_load_lanes
    - gather_load: used for {mask_,}gather_load
+   - len_load: currently just len_load
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
    - mask_store_lanes: currently just vec_mask_store_lanes
    - scatter_store: used for {mask_,}scatter_store
+   - len_store: currently just len_store
 
    - unary: a normal unary optab, such as vec_reverse_<mode>
    - binary: a normal binary optab, such as vec_interleave_lo_<mode>
@@ -127,6 +129,8 @@ DEF_INTERNAL_OPTAB_FN (GATHER_LOAD, ECF_PURE, gather_load, gather_load)
 DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
 		       mask_gather_load, gather_load)
 
+DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
+
 DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
 DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
 		       mask_scatter_store, scatter_store)
@@ -136,6 +140,8 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
 DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
 		       vec_mask_store_lanes, mask_store_lanes)
 
+DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
+
 DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
 DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
 		       check_raw_ptrs, check_ptrs)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 0c64eb52a8d..78409aa1453 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -435,3 +435,5 @@ OPTAB_D (check_war_ptrs_optab, "check_war_ptrs$a")
 OPTAB_DC (vec_duplicate_optab, "vec_duplicate$a", VEC_DUPLICATE)
 OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
 OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
+OPTAB_D (len_load_optab, "len_load_$a")
+OPTAB_D (len_store_optab, "len_store_$a")

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 2/7 v4] rs6000: lenload/lenstore optab support
  2020-06-10 12:39     ` [PATCH 2/7 V3] " Kewen.Lin
  2020-06-11 22:55       ` Segher Boessenkool
@ 2020-06-23  3:58       ` Kewen.Lin
  2020-06-29  6:32         ` [PATCH 2/7 v5] " Kewen.Lin
  1 sibling, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-23  3:58 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, Segher Boessenkool, dje.gcc

[-- Attachment #1: Type: text/plain, Size: 279 bytes --]

Hi,

V4: Update define_expand names as optab name changes.

V3: Update the define_expand as optab changes.


BR,
Kewen
-----
gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* config/rs6000/vsx.md (len_load_<mode>): New define_expand.
	(len_store_<mode>): Likewise.


[-- Attachment #2: rs6000_v4.diff --]
[-- Type: text/plain, Size: 1413 bytes --]

diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 2a28215ac5b..e443960afe3 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -5082,6 +5082,38 @@ (define_expand "stxvl"
   operands[3] = gen_reg_rtx (DImode);
 })
 
+;; Define optab for vector access with length vectorization exploitation.
+(define_expand "len_load_<mode>"
+  [(match_operand:VEC_A 0 "vlogical_operand")
+   (match_operand:VEC_A 1 "memory_operand")
+   (match_operand:QI 2 "gpc_reg_operand")]
+  "TARGET_P9_VECTOR && TARGET_64BIT"
+{
+  rtx mem = XEXP (operands[1], 0);
+  mem = force_reg (DImode, mem);
+  rtx len = gen_lowpart (DImode, operands[2]);
+  rtx res = gen_reg_rtx (V16QImode);
+  emit_insn (gen_lxvl (res, mem, len));
+  emit_move_insn (operands[0], gen_lowpart (<MODE>mode, res));
+  DONE;
+})
+
+(define_expand "len_store_<mode>"
+  [(match_operand:VEC_A 0 "memory_operand")
+   (match_operand:VEC_A 1 "vlogical_operand")
+   (match_operand:QI 2 "gpc_reg_operand")
+  ]
+  "TARGET_P9_VECTOR && TARGET_64BIT"
+{
+  rtx val = gen_reg_rtx (V16QImode);
+  emit_move_insn (val, gen_lowpart (V16QImode, operands[1]));
+  rtx mem = XEXP (operands[0], 0);
+  mem = force_reg (DImode, mem);
+  rtx len = gen_lowpart (DImode, operands[2]);
+  emit_insn (gen_stxvl (val, mem, len));
+  DONE;
+})
+
 (define_insn "*stxvl"
   [(set (mem:V16QI (match_operand:DI 1 "gpc_reg_operand" "b"))
 	(unspec:V16QI

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/7 V4] ifn/optabs: Support vector load/store with length
  2020-06-22 19:59           ` Richard Sandiford
  2020-06-22 22:19             ` Segher Boessenkool
  2020-06-23  3:54             ` [PATCH 1/7 v5] " Kewen.Lin
@ 2020-06-23  6:47             ` Richard Biener
  2 siblings, 0 replies; 80+ messages in thread
From: Richard Biener @ 2020-06-23  6:47 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: Kewen.Lin, GCC Patches, Bill Schmidt, dje.gcc, Segher Boessenkool

On Mon, 22 Jun 2020, Richard Sandiford wrote:

> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> > @@ -5167,6 +5167,24 @@ mode @var{n}.
> >  
> >  This pattern is not allowed to @code{FAIL}.
> >  
> > +@cindex @code{lenload@var{m}} instruction pattern
> > +@item @samp{lenload@var{m}}
> > +Perform a vector load with length from memory operand 1 of mode @var{m}
> > +into register operand 0.  Length is provided in register operand 2 with
> > +appropriate mode which should afford the maximal required precision of
> > +any available lengths.
> 
> I think we need to say in more detail what “load with length” actually
> means.  How about:
> 
>   Load the number of bytes specified by operand 2 from memory operand 1
>   into register operand 0, setting the other bytes of operand 0 to
>   undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has
>   whichever integer mode the target prefers.

IMHO it should also say whether length may be >= the mode byte size
(should we say "units" instead of bytes everywhere?), and if so what
the behavior is.  That is, whether explicit masking is required
or if there's implicit masking or whether values >= the mode byte size
will load all bytes.

Does the number of bytes have to be a multiple of the vector component
size?  That is, is there any difference between lenloadv4si and
lenloadv8hi?

> > +@cindex @code{lenstore@var{m}} instruction pattern
> > +@item @samp{lenstore@var{m}}
> > +Perform a vector store with length from register operand 1 of mode @var{m}
> > +into memory operand 0.  Length is provided in register operand 2 with
> > +appropriate mode which should afford the maximal required precision of
> > +any available lengths.
> 
> Similarly here:
> 
>   Store the number of bytes specified by operand 2 from nonmemory operand 1
>   into memory operand 0, leaving the other bytes of operand 0 unchanged.
>   Operands 0 and 1 have mode @var{m}.  Operand 2 has whichever integer
>   mode the target prefers.
> 
> > @@ -2478,7 +2480,7 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
> >    return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
> >  }
> >  
> > -/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
> > +/* Expand MASK_LOAD{,_LANES} and LEN_LOAD call STMT using optab OPTAB.  */
> 
> s/and/or/.
> 
> >  
> >  static void
> >  expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
> 
> Think it would be worth generalising the name, e.g. to
> expand_partial_load_optab_fn, and adding a #define for
> expand_mask_load_optab_fn before the other two #defines.
> 
> Same comments for stores.
> 
> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > index 1d190d492ff..ed6561f296a 100644
> > --- a/gcc/internal-fn.def
> > +++ b/gcc/internal-fn.def
> > @@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
> >     - load_lanes: currently just vec_load_lanes
> >     - mask_load_lanes: currently just vec_mask_load_lanes
> >     - gather_load: used for {mask_,}gather_load
> > +   - len_load: currently just lenload
> >  
> >     - mask_store: currently just maskstore
> >     - store_lanes: currently just vec_store_lanes
> >     - mask_store_lanes: currently just vec_mask_store_lanes
> >     - scatter_store: used for {mask_,}scatter_store
> > +   - len_store: currently just lenstore
> >  
> >     - unary: a normal unary optab, such as vec_reverse_<mode>
> >     - binary: a normal binary optab, such as vec_interleave_lo_<mode>
> > @@ -127,6 +129,8 @@ DEF_INTERNAL_OPTAB_FN (GATHER_LOAD, ECF_PURE, gather_load, gather_load)
> >  DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
> >  		       mask_gather_load, gather_load)
> >  
> > +DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, lenload, len_load)
> > +
> >  DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
> >  DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> >  		       mask_scatter_store, scatter_store)
> > @@ -136,6 +140,8 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> >  DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
> >  		       vec_mask_store_lanes, mask_store_lanes)
> >  
> > +DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, lenstore, len_store)
> > +
> >  DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
> >  DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
> >  		       check_raw_ptrs, check_ptrs)
> > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > index 0c64eb52a8d..9fe4ac1840d 100644
> > --- a/gcc/optabs.def
> > +++ b/gcc/optabs.def
> > @@ -435,3 +435,5 @@ OPTAB_D (check_war_ptrs_optab, "check_war_ptrs$a")
> >  OPTAB_DC (vec_duplicate_optab, "vec_duplicate$a", VEC_DUPLICATE)
> >  OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
> >  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
> > +OPTAB_D (lenload_optab, "lenload$a")
> > +OPTAB_D (lenstore_optab, "lenstore$a")
> 
> Sorry, I should have picked up on this last time, but I think we should
> be consistent about whether there's an underscore after “len” or not.
> I realise this is just replicating what happens for IFN_MASK_LOAD/
> “maskload” and IFN_MASK_STORE/“maskstore”, but it's something I kept
> tripping over when implementing those for SVE.
> 
> Personally I think it is easier to read with the underscore, so this
> would be “len_load_optab” and “len_load$a” (or “len_load_$a”,
> there's no real consistency on that).  Same for stores.
> 
> Thanks,
> Richard
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/7 v5] ifn/optabs: Support vector load/store with length
  2020-06-23  3:54             ` [PATCH 1/7 v5] " Kewen.Lin
@ 2020-06-23  9:52               ` Richard Sandiford
  2020-06-23 11:25                 ` Richard Biener
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Sandiford @ 2020-06-23  9:52 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Richard Biener, Bill Schmidt, dje.gcc, Segher Boessenkool

Things have moved on due to the IRC conversation, but…

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> on 2020/6/23 上午3:59, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> @@ -5167,6 +5167,24 @@ mode @var{n}.
>>>  
>>>  This pattern is not allowed to @code{FAIL}.
>>>  
>>> +@cindex @code{lenload@var{m}} instruction pattern
>>> +@item @samp{lenload@var{m}}
>>> +Perform a vector load with length from memory operand 1 of mode @var{m}
>>> +into register operand 0.  Length is provided in register operand 2 with
>>> +appropriate mode which should afford the maximal required precision of
>>> +any available lengths.
>> 
>> I think we need to say in more detail what “load with length” actually
>> means.  How about:
>> 
>>   Load the number of bytes specified by operand 2 from memory operand 1
>>   into register operand 0, setting the other bytes of operand 0 to
>>   undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has
>>   whichever integer mode the target prefers.
>> 
>
> Thanks for nice wordings!  Updated, for "... to undefined values" I changed it
> to "... to undefined values or zeros" as Segher's comments to match the behavior
> on Power.

“set … to undefined values” means that the values are not defined by
the optab interface.  In other words, the target can set the bytes
to whatever it wants, and gimple code can't make any assumptions about
what the values of the bytes are.

So setting the bytes to zero (as Power does) would conform to the
interface.  So would leaving the bytes in operand 0 untouched.
So would using an instruction that really does leave the other
bytes with undefined values, etc.

So I think we should keep it as just “… to undefined values”,

The alternative would be to define the interface so that targets *must*
ensure that the other bytes are zeros.  But at the moment, the only
intended use of the optabs and ifns is for autovectorisation, and the
vectoriser won't care about the values of “inactive” bytes/lanes.
Forcing the target to set them to a specific value like zero would be
unnecessarily restrictive.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/7 v5] ifn/optabs: Support vector load/store with length
  2020-06-23  9:52               ` Richard Sandiford
@ 2020-06-23 11:25                 ` Richard Biener
  2020-06-23 12:20                   ` Richard Sandiford
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Biener @ 2020-06-23 11:25 UTC (permalink / raw)
  To: Kewen.Lin, GCC Patches, Richard Biener, Bill Schmidt,
	David Edelsohn, Segher Boessenkool, Richard Sandiford

On Tue, Jun 23, 2020 at 11:53 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Things have moved on due to the IRC conversation, but…
>
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> > on 2020/6/23 上午3:59, Richard Sandiford wrote:
> >> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> >>> @@ -5167,6 +5167,24 @@ mode @var{n}.
> >>>
> >>>  This pattern is not allowed to @code{FAIL}.
> >>>
> >>> +@cindex @code{lenload@var{m}} instruction pattern
> >>> +@item @samp{lenload@var{m}}
> >>> +Perform a vector load with length from memory operand 1 of mode @var{m}
> >>> +into register operand 0.  Length is provided in register operand 2 with
> >>> +appropriate mode which should afford the maximal required precision of
> >>> +any available lengths.
> >>
> >> I think we need to say in more detail what “load with length” actually
> >> means.  How about:
> >>
> >>   Load the number of bytes specified by operand 2 from memory operand 1
> >>   into register operand 0, setting the other bytes of operand 0 to
> >>   undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has
> >>   whichever integer mode the target prefers.
> >>
> >
> > Thanks for nice wordings!  Updated, for "... to undefined values" I changed it
> > to "... to undefined values or zeros" as Segher's comments to match the behavior
> > on Power.
>
> “set … to undefined values” means that the values are not defined by
> the optab interface.  In other words, the target can set the bytes
> to whatever it wants, and gimple code can't make any assumptions about
> what the values of the bytes are.
>
> So setting the bytes to zero (as Power does) would conform to the
> interface.  So would leaving the bytes in operand 0 untouched.
> So would using an instruction that really does leave the other
> bytes with undefined values, etc.
>
> So I think we should keep it as just “… to undefined values”,
>
> The alternative would be to define the interface so that targets *must*
> ensure that the other bytes are zeros.  But at the moment, the only
> intended use of the optabs and ifns is for autovectorisation, and the
> vectoriser won't care about the values of “inactive” bytes/lanes.
> Forcing the target to set them to a specific value like zero would be
> unnecessarily restrictive.

Actually it _does_ care.  This is supposed to be used for fully masked
loops and 'unspecified values' would require us to explicitely zero
them for any FP op because of possible sNaN representations.  It
also precludes us from bitwise ORing in an appropriately masked
vector of 1s to make integer division happy (OK, no vector ISA supports
integer division).

So unless we have evidence that there exists an ISA that does _not_
zero the excess bits I'd rather specify it does.

Richard.

>
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/7 v5] ifn/optabs: Support vector load/store with length
  2020-06-23 11:25                 ` Richard Biener
@ 2020-06-23 12:20                   ` Richard Sandiford
  2020-06-24  2:40                     ` Jim Wilson
  2020-06-24 23:56                     ` [PATCH 1/7 v5] " Segher Boessenkool
  0 siblings, 2 replies; 80+ messages in thread
From: Richard Sandiford @ 2020-06-23 12:20 UTC (permalink / raw)
  To: Richard Biener
  Cc: Kewen.Lin, GCC Patches, Bill Schmidt, David Edelsohn,
	Segher Boessenkool, wilson

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Jun 23, 2020 at 11:53 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Things have moved on due to the IRC conversation, but…
>>
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> > on 2020/6/23 上午3:59, Richard Sandiford wrote:
>> >> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> >>> @@ -5167,6 +5167,24 @@ mode @var{n}.
>> >>>
>> >>>  This pattern is not allowed to @code{FAIL}.
>> >>>
>> >>> +@cindex @code{lenload@var{m}} instruction pattern
>> >>> +@item @samp{lenload@var{m}}
>> >>> +Perform a vector load with length from memory operand 1 of mode @var{m}
>> >>> +into register operand 0.  Length is provided in register operand 2 with
>> >>> +appropriate mode which should afford the maximal required precision of
>> >>> +any available lengths.
>> >>
>> >> I think we need to say in more detail what “load with length” actually
>> >> means.  How about:
>> >>
>> >>   Load the number of bytes specified by operand 2 from memory operand 1
>> >>   into register operand 0, setting the other bytes of operand 0 to
>> >>   undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has
>> >>   whichever integer mode the target prefers.
>> >>
>> >
>> > Thanks for nice wordings!  Updated, for "... to undefined values" I changed it
>> > to "... to undefined values or zeros" as Segher's comments to match the behavior
>> > on Power.
>>
>> “set … to undefined values” means that the values are not defined by
>> the optab interface.  In other words, the target can set the bytes
>> to whatever it wants, and gimple code can't make any assumptions about
>> what the values of the bytes are.
>>
>> So setting the bytes to zero (as Power does) would conform to the
>> interface.  So would leaving the bytes in operand 0 untouched.
>> So would using an instruction that really does leave the other
>> bytes with undefined values, etc.
>>
>> So I think we should keep it as just “… to undefined values”,
>>
>> The alternative would be to define the interface so that targets *must*
>> ensure that the other bytes are zeros.  But at the moment, the only
>> intended use of the optabs and ifns is for autovectorisation, and the
>> vectoriser won't care about the values of “inactive” bytes/lanes.
>> Forcing the target to set them to a specific value like zero would be
>> unnecessarily restrictive.
>
> Actually it _does_ care.

I'd argue it doesn't, but for essentially the same reasons :-)

> This is supposed to be used for fully masked
> loops and 'unspecified values' would require us to explicitely zero
> them for any FP op because of possible sNaN representations.  It
> also precludes us from bitwise ORing in an appropriately masked
> vector of 1s to make integer division happy (OK, no vector ISA supports
> integer division).

Zeros would be a problem for FP division too.  And even if we require
loads to set inactive lanes to zero, we couldn't infer from that that
any given FP addition (say) won't raise an exception.  E.g. the inputs
could be the result of converting integers and adding them could trigger
an inexact exception.  Or the values could be the result of simple
bitcasts, giving arbitrary FP values.  (AIUI, current bfloat code
works this way.)

The vectoriser currently only allows potentially-trapping FP operations
on partial vectors if the target provides an appropriate IFN_COND_*
function.  (That's one of the main use cases for those functions.)
In other cases it requires the loop to operate on full vectors.
This should be relaxed in future to support inbranch partial
vectorisation of simd calls.

This means that the current patch series will/should simply punt
for “length”-based loop control if the loop contains FP operations
that (as far as gimple is concerned) might trap.

If we're thinking about how to relax that, then IMO it will need
to be done either at the level of each FP operation or by some
kind of “global” vectorisation subpass that introduces known-safe
values for inactive lanes.  The first would be easier, the second
would be more optimal.

I don't think that's specific to “length” vectorisation though.
The same concerns apply to if-converted loops that operate on full
vectors.  I think the approach would be essentially the same for both.

In that scenario, removing zeroing of an IFN_LEN_LOAD would “just” be
an optimisation, and could potentially be left to RTL code if necessary.
(But see my main point below.)

SVE supports integer division btw. :-)

> So unless we have evidence that there exists an ISA that does _not_
> zero the excess bits I'd rather specify it does.

I think the known architectures that might use this are:

- MVE
- Power
- RVV

MVE and Power both set inactive lanes to zero.  But I'm not sure about RVV.
AIUI, for RVV the approach instead would be to reduce the effective vector
length for the final iteration of the vector loop, and I'm not sure
whether in that situation it makes sense to say that the other elements
still exist and are guaranteed to be zero.

I'm the last person who should be speculating on that though.  Let's see
whether Jim has any comments.

In summary, I'm not saying we should never define the inactive values
to be zero.  I just think that we should leave it until it matters.
And I don't think it does/should matter for the current patch series.

IFN_MASK_LOAD has been around for quite a long time now and we've never
had to define the values of inactive lanes there.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/7 v5] ifn/optabs: Support vector load/store with length
  2020-06-23 12:20                   ` Richard Sandiford
@ 2020-06-24  2:40                     ` Jim Wilson
  2020-06-24  7:34                       ` Richard Sandiford
  2020-06-24 23:56                     ` [PATCH 1/7 v5] " Segher Boessenkool
  1 sibling, 1 reply; 80+ messages in thread
From: Jim Wilson @ 2020-06-24  2:40 UTC (permalink / raw)
  To: Richard Biener, Kewen.Lin, GCC Patches, Bill Schmidt,
	David Edelsohn, Segher Boessenkool, Jim Wilson,
	Richard Sandiford

On Tue, Jun 23, 2020 at 5:21 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
> MVE and Power both set inactive lanes to zero.  But I'm not sure about RVV.
> AIUI, for RVV the approach instead would be to reduce the effective vector
> length for the final iteration of the vector loop, and I'm not sure
> whether in that situation it makes sense to say that the other elements
> still exist and are guaranteed to be zero.
>
> I'm the last person who should be speculating on that though.  Let's see
> whether Jim has any comments.

The RVV spec supports two policies for tail elements, i.e. elements
beyond the current vector length.  They can be undisturbed or
agnostic.  In the undisturbed case, the trail elements retain their
old values.  In the agnostic case, the implementation can choose to
either retain their old values, or set them to all ones, and this
choice can be different from lane to lane.  The latter case is useful
because registers may be wider than the execution unit, and current
vector length may not be a multiple of the width of the execution
unit.  So for instance if the vector registers can hold 8 elements,
and the execution unit works on 4 elements at a time, and the current
vector length is 2, then it might make sense to leave the last four
elements unmodified to avoid an iteration across the registers, but
the third and fourth elements might be set to all ones because you
have to write to them anyways.  The choice is left up to the
implementation because we have multiple parties designing vector
units, and some are target for low cost embedded market, and some are
target for high performance, and they couldn't agree on a single best
way to implement this.  The software is expected to choose agnostic
only if it doesn't care about what happens to tail elements, and
undisturbed if you want to preserve them.  The value of all ones was
chosen to discourage software developers from trying to use the values
in tail elements.  The choice of undisturbed or agnostic can be
changed every time you set the current vector length and type.

In most cases, I think RVV programs will use agnostic for tail
elements, since we can change the vector length at will, and it will
be rare that we will care about elements beyond the current vector
length.

Tail elements can't cause exceptions so there is no need to worry
about whether those elements hold valid values.

Jim

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/7 v5] ifn/optabs: Support vector load/store with length
  2020-06-24  2:40                     ` Jim Wilson
@ 2020-06-24  7:34                       ` Richard Sandiford
  2020-06-29  6:32                         ` [PATCH 1/7 v6] " Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Sandiford @ 2020-06-24  7:34 UTC (permalink / raw)
  To: Jim Wilson
  Cc: Richard Biener, Kewen.Lin, GCC Patches, Bill Schmidt,
	David Edelsohn, Segher Boessenkool, Jim Wilson

Jim Wilson <jimw@sifive.com> writes:
> On Tue, Jun 23, 2020 at 5:21 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>> MVE and Power both set inactive lanes to zero.  But I'm not sure about RVV.
>> AIUI, for RVV the approach instead would be to reduce the effective vector
>> length for the final iteration of the vector loop, and I'm not sure
>> whether in that situation it makes sense to say that the other elements
>> still exist and are guaranteed to be zero.
>>
>> I'm the last person who should be speculating on that though.  Let's see
>> whether Jim has any comments.
>
> The RVV spec supports two policies for tail elements, i.e. elements
> beyond the current vector length.  They can be undisturbed or
> agnostic.  In the undisturbed case, the trail elements retain their
> old values.  In the agnostic case, the implementation can choose to
> either retain their old values, or set them to all ones, and this
> choice can be different from lane to lane.  The latter case is useful
> because registers may be wider than the execution unit, and current
> vector length may not be a multiple of the width of the execution
> unit.  So for instance if the vector registers can hold 8 elements,
> and the execution unit works on 4 elements at a time, and the current
> vector length is 2, then it might make sense to leave the last four
> elements unmodified to avoid an iteration across the registers, but
> the third and fourth elements might be set to all ones because you
> have to write to them anyways.  The choice is left up to the
> implementation because we have multiple parties designing vector
> units, and some are target for low cost embedded market, and some are
> target for high performance, and they couldn't agree on a single best
> way to implement this.  The software is expected to choose agnostic
> only if it doesn't care about what happens to tail elements, and
> undisturbed if you want to preserve them.  The value of all ones was
> chosen to discourage software developers from trying to use the values
> in tail elements.  The choice of undisturbed or agnostic can be
> changed every time you set the current vector length and type.
>
> In most cases, I think RVV programs will use agnostic for tail
> elements, since we can change the vector length at will, and it will
> be rare that we will care about elements beyond the current vector
> length.
>
> Tail elements can't cause exceptions so there is no need to worry
> about whether those elements hold valid values.

Thanks for the info.  Based on that, I guess GCC should leave the values
of extra inactive lanes undefined for now, so that the agnostic case
is supported.

Maybe in future we could have IFN_LEN_* versions of arithmetic
operations too, similar to the IFN_COND_* ones, so that they explicitly
ignore the inactive elements.

Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/7 v5] ifn/optabs: Support vector load/store with length
  2020-06-23 12:20                   ` Richard Sandiford
  2020-06-24  2:40                     ` Jim Wilson
@ 2020-06-24 23:56                     ` Segher Boessenkool
  1 sibling, 0 replies; 80+ messages in thread
From: Segher Boessenkool @ 2020-06-24 23:56 UTC (permalink / raw)
  To: Richard Biener, Kewen.Lin, GCC Patches, Bill Schmidt,
	David Edelsohn, wilson, richard.sandiford

Hi!

On Tue, Jun 23, 2020 at 01:20:53PM +0100, Richard Sandiford wrote:
> SVE supports integer division btw. :-)

So does Power (ISA 3.1, power10).

> In summary, I'm not saying we should never define the inactive values
> to be zero.  I just think that we should leave it until it matters.
> And I don't think it does/should matter for the current patch series.

I am perfectly happy with that.  Thanks for looking at it!

> IFN_MASK_LOAD has been around for quite a long time now and we've never
> had to define the values of inactive lanes there.

Yeah, but typically the insns that consume the values loaded will use
the same masks again, so that may not be such a strong point.


Segher

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 1/7 v6] ifn/optabs: Support vector load/store with length
  2020-06-24  7:34                       ` Richard Sandiford
@ 2020-06-29  6:32                         ` Kewen.Lin
  2020-06-29 10:07                           ` Richard Sandiford
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-29  6:32 UTC (permalink / raw)
  To: GCC Patches
  Cc: Jim Wilson, Richard Biener, Bill Schmidt, David Edelsohn,
	Segher Boessenkool, Jim Wilson, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]

Hi Richard S./Richi/Jim/Segher,

Thanks a lot for your comments to make this patch more solid.

Based on our discussion, for the vector load/store with length
optab, the length unit would be measured in lanes by default.
For the targets which support length measured in bytes like Power,
they should only define VnQI modes to wrap the other same size
vector modes.  If the length is larger than total lane/byte count
of the given mode, it's taken to load all lanes/bytes implicitly.
For the remaining lanes/bytes which isn't specified by length,
they would be taken as undefined value.  For length in bytes,
it's required that the byte count should be a multiple of the
element size (wrapped vector), otherwise it's undefined.

This patch has been updated as attached.

2/7 for rs6000 optab defintion has been updated to use V16QI.
5/7 for vectorizer change has been updated accordingly.

-----

v6: Updated optab descriptions.

v5:
  - Updated lenload/lenstore optab to len_load/len_store and the docs.
  - Rename expand_mask_{load,store}_optab_fn to expand_partial_{load,store}_optab_fn
  - Added/updated macros for expand_mask_{load,store}_optab_fn
    and expand_len_{load,store}_optab_fn

v4: Update len_load_direct/len_store_direct to align with direct optab.

v3: Get rid of length mode hook.

BR,
Kewen
-----
gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/md.texi (len_load_@var{m}): Document.
	(len_store_@var{m}): Likewise.
	* internal-fn.c (len_load_direct): New macro.
	(len_store_direct): Likewise.
	(expand_len_load_optab_fn): Likewise.
	(expand_len_store_optab_fn): Likewise.
	(direct_len_load_optab_supported_p): Likewise.
	(direct_len_store_optab_supported_p): Likewise.
	(expand_mask_load_optab_fn): New macro.  Original renamed to ...
	(expand_partial_load_optab_fn): ... here.  Add handlings for
	len_load_optab.
	(expand_mask_store_optab_fn): New macro.  Original renamed to ...
	(expand_partial_store_optab_fn): ... here. Add handlings for
	len_store_optab.
	(internal_load_fn_p): Handle IFN_LEN_LOAD.
	(internal_store_fn_p): Handle IFN_LEN_STORE.
	(internal_fn_stored_value_index): Handle IFN_LEN_STORE.
	* internal-fn.def (LEN_LOAD): New internal function.
	(LEN_STORE): Likewise.
	* optabs.def (len_load_optab, len_store_optab): New optab.

[-- Attachment #2: ifn_v6.diff --]
[-- Type: text/plain, Size: 9707 bytes --]

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2c67c818da5..690c384ff66 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5167,6 +5167,34 @@ mode @var{n}.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{len_load_@var{m}} instruction pattern
+@item @samp{len_load_@var{m}}
+Load the number of units specified by operand 2 from memory operand 1
+into register operand 0, setting the other bytes of operand 0 to
+undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has
+whichever integer mode the target prefers.  If operand 2 exceeds the
+maximum units of mode @var{m}, it will be set to the maximum units of
+mode @var{m}.  For targets which support length measured in bytes,
+they should only define VnQI mode to wrap the other vector modes with
+the same size.  Meanwhile, it's required that the byte count should
+be a multiple of the element size (wrapped vector).
+
+This pattern is not allowed to @code{FAIL}.
+
+@cindex @code{len_store_@var{m}} instruction pattern
+@item @samp{len_store_@var{m}}
+Store the number of units specified by operand 2 from nonmemory operand 1
+into memory operand 0, leaving the other bytes of operand 0 unchanged.
+Operands 0 and 1 have mode @var{m}.  Operand 2 has whichever integer
+mode the target prefers.  If operand 2 exceeds the maximum units of mode
+@var{m}, it will be set to the maximum units of mode @var{m}.  For
+targets which support length measured in bytes, they should only define
+VnQI mode to wrap the other vector modes with the same size.  Meanwhile,
+it's required that the byte count should be a multiple of the element
+size (wrapped vector).
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_perm@var{m}} instruction pattern
 @item @samp{vec_perm@var{m}}
 Output a (variable) vector permutation.  Operand 0 is the destination
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 4f088de48d5..1e53ced60eb 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -104,10 +104,12 @@ init_internal_fns ()
 #define load_lanes_direct { -1, -1, false }
 #define mask_load_lanes_direct { -1, -1, false }
 #define gather_load_direct { 3, 1, false }
+#define len_load_direct { -1, -1, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
 #define mask_store_lanes_direct { 0, 0, false }
 #define scatter_store_direct { 3, 1, false }
+#define len_store_direct { 3, 3, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 #define ternary_direct { 0, 0, true }
@@ -2478,10 +2480,10 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
   return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
 }
 
-/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES} or LEN_LOAD call STMT using optab OPTAB.  */
 
 static void
-expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 {
   class expand_operand ops[3];
   tree type, lhs, rhs, maskt;
@@ -2497,6 +2499,8 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_load_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == len_load_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2507,18 +2511,24 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   create_output_operand (&ops[0], target, TYPE_MODE (type));
   create_fixed_operand (&ops[1], mem);
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == len_load_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
   if (!rtx_equal_p (target, ops[0].value))
     emit_move_insn (target, ops[0].value);
 }
 
+#define expand_mask_load_optab_fn expand_partial_load_optab_fn
 #define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
+#define expand_len_load_optab_fn expand_partial_load_optab_fn
 
-/* Expand MASK_STORE{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_STORE{,_LANES} or LEN_STORE call STMT using optab OPTAB.  */
 
 static void
-expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 {
   class expand_operand ops[3];
   tree type, lhs, rhs, maskt;
@@ -2532,6 +2542,8 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_store_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == len_store_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2542,11 +2554,17 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   reg = expand_normal (rhs);
   create_fixed_operand (&ops[0], mem);
   create_input_operand (&ops[1], reg, TYPE_MODE (type));
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == len_store_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
 }
 
+#define expand_mask_store_optab_fn expand_partial_store_optab_fn
 #define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
+#define expand_len_store_optab_fn expand_partial_store_optab_fn
 
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
@@ -3128,10 +3146,12 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_gather_load_optab_supported_p convert_optab_supported_p
+#define direct_len_load_optab_supported_p direct_optab_supported_p
 #define direct_mask_store_optab_supported_p convert_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_scatter_store_optab_supported_p convert_optab_supported_p
+#define direct_len_store_optab_supported_p direct_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
 #define direct_fold_left_optab_supported_p direct_optab_supported_p
@@ -3498,6 +3518,7 @@ internal_load_fn_p (internal_fn fn)
     case IFN_MASK_LOAD_LANES:
     case IFN_GATHER_LOAD:
     case IFN_MASK_GATHER_LOAD:
+    case IFN_LEN_LOAD:
       return true;
 
     default:
@@ -3517,6 +3538,7 @@ internal_store_fn_p (internal_fn fn)
     case IFN_MASK_STORE_LANES:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return true;
 
     default:
@@ -3577,6 +3599,7 @@ internal_fn_stored_value_index (internal_fn fn)
     case IFN_MASK_STORE:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return 3;
 
     default:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 1d190d492ff..17dac128e83 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
    - load_lanes: currently just vec_load_lanes
    - mask_load_lanes: currently just vec_mask_load_lanes
    - gather_load: used for {mask_,}gather_load
+   - len_load: currently just len_load
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
    - mask_store_lanes: currently just vec_mask_store_lanes
    - scatter_store: used for {mask_,}scatter_store
+   - len_store: currently just len_store
 
    - unary: a normal unary optab, such as vec_reverse_<mode>
    - binary: a normal binary optab, such as vec_interleave_lo_<mode>
@@ -127,6 +129,8 @@ DEF_INTERNAL_OPTAB_FN (GATHER_LOAD, ECF_PURE, gather_load, gather_load)
 DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
 		       mask_gather_load, gather_load)
 
+DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
+
 DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
 DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
 		       mask_scatter_store, scatter_store)
@@ -136,6 +140,8 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
 DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
 		       vec_mask_store_lanes, mask_store_lanes)
 
+DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
+
 DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
 DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
 		       check_raw_ptrs, check_ptrs)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 0c64eb52a8d..78409aa1453 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -435,3 +435,5 @@ OPTAB_D (check_war_ptrs_optab, "check_war_ptrs$a")
 OPTAB_DC (vec_duplicate_optab, "vec_duplicate$a", VEC_DUPLICATE)
 OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
 OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
+OPTAB_D (len_load_optab, "len_load_$a")
+OPTAB_D (len_store_optab, "len_store_$a")

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 2/7 v5] rs6000: lenload/lenstore optab support
  2020-06-23  3:58       ` [PATCH 2/7 v4] " Kewen.Lin
@ 2020-06-29  6:32         ` Kewen.Lin
  2020-06-29 17:57           ` Segher Boessenkool
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-29  6:32 UTC (permalink / raw)
  To: GCC Patches
  Cc: Bill Schmidt, dje.gcc, Segher Boessenkool, Richard Biener,
	Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 291 bytes --]

Hi,

V5: Like V4.

V4: Update define_expand names as optab name changes.

V3: Update the define_expand as optab changes.

BR,
Kewen
------
gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* config/rs6000/vsx.md (len_load_v16qi): New define_expand.
	(len_store_v16qi): Likewise.


[-- Attachment #2: rs6000_v5.diff --]
[-- Type: text/plain, Size: 1224 bytes --]

diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 2a28215ac5b..fe85f60c681 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -5082,6 +5082,34 @@ (define_expand "stxvl"
   operands[3] = gen_reg_rtx (DImode);
 })
 
+;; Define optab for vector access with length vectorization exploitation.
+(define_expand "len_load_v16qi"
+  [(match_operand:V16QI 0 "vlogical_operand")
+   (match_operand:V16QI 1 "memory_operand")
+   (match_operand:QI 2 "gpc_reg_operand")]
+  "TARGET_P9_VECTOR && TARGET_64BIT"
+{
+  rtx mem = XEXP (operands[1], 0);
+  mem = force_reg (DImode, mem);
+  rtx len = gen_lowpart (DImode, operands[2]);
+  emit_insn (gen_lxvl (operands[0], mem, len));
+  DONE;
+})
+
+(define_expand "len_store_v16qi"
+  [(match_operand:V16QI 0 "memory_operand")
+   (match_operand:V16QI 1 "vlogical_operand")
+   (match_operand:QI 2 "gpc_reg_operand")
+  ]
+  "TARGET_P9_VECTOR && TARGET_64BIT"
+{
+  rtx mem = XEXP (operands[0], 0);
+  mem = force_reg (DImode, mem);
+  rtx len = gen_lowpart (DImode, operands[2]);
+  emit_insn (gen_stxvl (operands[1], mem, len));
+  DONE;
+})
+
 (define_insn "*stxvl"
   [(set (mem:V16QI (match_operand:DI 1 "gpc_reg_operand" "b"))
 	(unspec:V16QI

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer
  2020-06-22  8:33                     ` [PATCH 5/7 v5] " Kewen.Lin
@ 2020-06-29  6:33                       ` Kewen.Lin
  2020-06-30 19:53                         ` Richard Sandiford
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-29  6:33 UTC (permalink / raw)
  To: GCC Patches
  Cc: Bill Schmidt, Richard Biener, Segher Boessenkool, dje.gcc,
	Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 3471 bytes --]

Hi,

v6 changes against v5:
  - As len_load/store optab changes, added function can_vec_len_load_store_p
    and vect_get_same_size_vec_for_len.
  - Updated several places like vectoriable_load/store for optab changes.

v5 changes against v4:
  - Updated the conditions of clearing LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P
    in vectorizable_condition (which fixed the aarch reg failure).
  - Rebased and updated some macro and function names as the
    renaming/refactoring patch.
  - Updated some comments and dumpings.

v4 changes against v3:
  - split out some renaming and refactoring.
  - use QImode for length.
  - update the iv type determination.
  - introduce factor into rgroup_controls.
  - use using_partial_vectors_p for both approaches.

Bootstrapped/regtested on aarch64-linux-gnu and powerpc64le-linux-gnu P9.
Even with explicit vect-with-length-scope settings 1/2, I didn't find
any remarkable failures (only some trivial test case issues).

Is it ok for trunk?

BR,
Kewen
----
gcc/ChangeLog

	* doc/invoke.texi (vect-with-length-scope): Document new option.
	* optabs-query.c (can_vec_len_load_store_p): New function.
	* optabs-query.h (can_vec_len_load_store_p): New declare.
	* params.opt (vect-with-length-scope): New.
	* tree-vect-loop-manip.c (vect_set_loop_controls_directly): Add the
	handlings for vectorization using length-based partial vectors, call
	vect_gen_len for length generation.
	(vect_set_loop_condition_partial_vectors): Add the handlings for
	vectorization using length-based partial vectors.
	(vect_do_peeling): Allow remaining eiters less than epilogue vf for
	LOOP_VINFO_USING_PARTIAL_VECTORS_P.
	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Init
	epil_using_partial_vectors_p.
	(_loop_vec_info::~_loop_vec_info): Call release_vec_loop_controls
	for lengths destruction.
	(vect_verify_loop_lens): New function.
	(vect_analyze_loop_2): Add the check to allow only one vectorization
	approach using partial vectorization at the same time.  Check
	loop-wide reasons using length-based partial vectors decision.  Mark
	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P if the epilogue is
	considerable to use length-based approach.  Call
	release_vec_loop_controls for lengths destruction.
	(vect_analyze_loop): Add handlings for epilogue of loop when it's
	marked to use vectorization using partial vectors.
	(vect_estimate_min_profitable_iters): Adjust for loop vectorization
	using length-based partial vectors.
	(vect_record_loop_mask): Init factor to 1 for vectorization using
	mask-based partial vectors.
	(vect_record_loop_len): New function.
	(vect_get_loop_len): New function.
	* tree-vect-stmts.c (check_load_store_for_partial_vectors): Add
	checks for vectorization using length-based partial vectors.
	(vect_get_same_size_vec_for_len): New function.
	(vectorizable_store): Add handlings when using length-based partial
	vectors.
	(vectorizable_load): Likewise.
	(vectorizable_condition): Add some checks to disable vectorization
	using partial vectors for reduction.
	(vect_gen_len): New function.
	* tree-vectorizer.h (struct rgroup_controls): Add field factor
	mainly for length-based partial vectors.
	(vec_loop_lens): New typedef.
	(_loop_vec_info): Add lens and epil_using_partial_vectors_p.
	(LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P): New macro.
	(LOOP_VINFO_LENS): Likewise.
	(LOOP_VINFO_FULLY_WITH_LENGTH_P): Likewise.
	(vect_record_loop_len): New declare.
	(vect_get_loop_len): Likewise.
	(vect_gen_len): Likewise.

[-- Attachment #2: vector_v6.diff --]
[-- Type: text/plain, Size: 41854 bytes --]

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 06a04e3d7dd..284c15705ea 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13389,6 +13389,13 @@ by the copy loop headers pass.
 @item vect-epilogues-nomask
 Enable loop epilogue vectorization using smaller vector size.
 
+@item vect-with-length-scope
+Control the scope of vector memory access with length exploitation.  0 means we
+don't expliot any vector memory access with length, 1 means we only exploit
+vector memory access with length for those loops whose iteration number are
+less than VF, such as very small loop or epilogue, 2 means we want to exploit
+vector memory access with length for any loops if possible.
+
 @item slp-max-insns-in-bb
 Maximum number of instructions in basic block to be
 considered for SLP vectorization.
diff --git a/gcc/optabs-query.c b/gcc/optabs-query.c
index 215d68e4225..9c351759204 100644
--- a/gcc/optabs-query.c
+++ b/gcc/optabs-query.c
@@ -606,6 +606,60 @@ can_vec_mask_load_store_p (machine_mode mode,
   return false;
 }
 
+/* Return true if target supports vector load/store with length for vector
+   mode MODE.  There are two flavors for vector load/store with length, one
+   is to measure length with bytes, the other is to measure length with lanes.
+   As len_{load,store} optabs point out, for the flavor with bytes, we use
+   VnQI to wrap the other supportable same size vector modes.  Here the
+   pointer FACTOR is to indicate that it is using VnQI to wrap if its value
+   more than 1 and how many bytes for one element of wrapped vector mode.  */
+
+bool
+can_vec_len_load_store_p (machine_mode mode, bool is_load, unsigned int *factor)
+{
+  optab op = is_load ? len_load_optab : len_store_optab;
+  gcc_assert (VECTOR_MODE_P (mode));
+
+  /* Check if length in lanes supported for this mode directly.  */
+  if (direct_optab_handler (op, mode))
+    {
+      *factor = 1;
+      return true;
+    }
+
+  /* Check if length in bytes supported for VnQI with the same vector size.  */
+  scalar_mode emode = QImode;
+  poly_uint64 esize = GET_MODE_SIZE (emode);
+  poly_uint64 vsize = GET_MODE_SIZE (mode);
+  poly_uint64 nunits;
+
+  /* To get how many nunits it would have if the element is QImode.  */
+  if (multiple_p (vsize, esize, &nunits))
+    {
+      machine_mode vmode;
+      /* Check whether the related VnQI vector mode exists, as well as
+	 optab supported.  */
+      if (related_vector_mode (mode, emode, nunits).exists (&vmode)
+	  && direct_optab_handler (op, vmode))
+	{
+	  unsigned int mul;
+	  scalar_mode orig_emode = GET_MODE_INNER (mode);
+	  poly_uint64 orig_esize = GET_MODE_SIZE (orig_emode);
+
+	  if (constant_multiple_p (orig_esize, esize, &mul))
+	    *factor = mul;
+	  else
+	    gcc_unreachable ();
+
+	  return true;
+	}
+    }
+  else
+    gcc_unreachable ();
+
+  return false;
+}
+
 /* Return true if there is a compare_and_swap pattern.  */
 
 bool
diff --git a/gcc/optabs-query.h b/gcc/optabs-query.h
index 729e1fdfc81..9db9c91994a 100644
--- a/gcc/optabs-query.h
+++ b/gcc/optabs-query.h
@@ -188,6 +188,7 @@ enum insn_code find_widening_optab_handler_and_mode (optab, machine_mode,
 						     machine_mode *);
 int can_mult_highpart_p (machine_mode, bool);
 bool can_vec_mask_load_store_p (machine_mode, machine_mode, bool);
+bool can_vec_len_load_store_p (machine_mode, bool, unsigned int *);
 bool can_compare_and_swap_p (machine_mode, bool);
 bool can_atomic_exchange_p (machine_mode, bool);
 bool can_atomic_load_p (machine_mode);
diff --git a/gcc/params.opt b/gcc/params.opt
index 9b564bb046c..daa6e8a2beb 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -968,4 +968,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f
 Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
 Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
 
+-param=vect-with-length-scope=
+Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization
+Control the vector with length exploitation scope.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 458a6675c47..9b9bfb88b1a 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -399,19 +399,20 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
 
    It is known that:
 
-     NITERS * RGC->max_nscalars_per_iter
+     NITERS * RGC->max_nscalars_per_iter * RGC->factor
 
    does not overflow.  However, MIGHT_WRAP_P says whether an induction
    variable that starts at 0 and has step:
 
-     VF * RGC->max_nscalars_per_iter
+     VF * RGC->max_nscalars_per_iter * RGC->factor
 
    might overflow before hitting a value above:
 
-     (NITERS + NITERS_SKIP) * RGC->max_nscalars_per_iter
+     (NITERS + NITERS_SKIP) * RGC->max_nscalars_per_iter * RGC->factor
 
    This means that we cannot guarantee that such an induction variable
-   would ever hit a value that produces a set of all-false masks for RGC.  */
+   would ever hit a value that produces a set of all-false masks or zero
+   lengths for RGC.  */
 
 static tree
 vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
@@ -422,10 +423,20 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 {
   tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
   tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+
   tree ctrl_type = rgc->type;
-  unsigned int nscalars_per_iter = rgc->max_nscalars_per_iter;
+  /* Scale up nscalars per iteration with factor.  */
+  unsigned int nscalars_per_iter_ft = rgc->max_nscalars_per_iter * rgc->factor;
   poly_uint64 nscalars_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  tree length_limit = NULL_TREE;
+  /* For length, we need length_limit to check length in range.  */
+  if (!vect_for_masking)
+    {
+      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
+      length_limit = build_int_cst (compare_type, len_limit);
+    }
 
   /* Calculate the maximum number of scalar values that the rgroup
      handles in total, the number that it handles for each iteration
@@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
   tree nscalars_total = niters;
   tree nscalars_step = build_int_cst (iv_type, vf);
   tree nscalars_skip = niters_skip;
-  if (nscalars_per_iter != 1)
+  if (nscalars_per_iter_ft != 1)
     {
       /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
 	 these multiplications don't overflow.  */
-      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
-      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
+      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
+      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
       nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
 				     nscalars_total, compare_factor);
       nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
@@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 	     NSCALARS_SKIP to that cannot overflow.  */
 	  tree const_limit = build_int_cst (compare_type,
 					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
-					    * nscalars_per_iter);
+					    * nscalars_per_iter_ft);
 	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
 				      nscalars_total, const_limit);
 	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
@@ -549,16 +560,16 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
     {
       /* Previous controls will cover BIAS scalars.  This control covers the
 	 next batch.  */
-      poly_uint64 bias = nscalars_per_ctrl * i;
+      poly_uint64 batch_nscalars_ft = nscalars_per_ctrl * rgc->factor;
+      poly_uint64 bias = batch_nscalars_ft * i;
       tree bias_tree = build_int_cst (compare_type, bias);
-      gimple *tmp_stmt;
 
       /* See whether the first iteration of the vector loop is known
 	 to have a full control.  */
       poly_uint64 const_limit;
       bool first_iteration_full
 	= (poly_int_tree_p (first_limit, &const_limit)
-	   && known_ge (const_limit, (i + 1) * nscalars_per_ctrl));
+	   && known_ge (const_limit, (i + 1) * batch_nscalars_ft));
 
       /* Rather than have a new IV that starts at BIAS and goes up to
 	 TEST_LIMIT, prefer to use the same 0-based IV for each control
@@ -598,9 +609,19 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 	      end = first_limit;
 	    }
 
-	  init_ctrl = make_temp_ssa_name (ctrl_type, NULL, "max_mask");
-	  tmp_stmt = vect_gen_while (init_ctrl, start, end);
-	  gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	  if (vect_for_masking)
+	    {
+	      init_ctrl = make_temp_ssa_name (ctrl_type, NULL, "max_mask");
+	      gimple *tmp_stmt = vect_gen_while (init_ctrl, start, end);
+	      gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	    }
+	  else
+	    {
+	      init_ctrl = make_temp_ssa_name (compare_type, NULL, "max_len");
+	      gimple_seq seq = vect_gen_len (init_ctrl, start,
+					     end, length_limit);
+	      gimple_seq_add_seq (preheader_seq, seq);
+	    }
 	}
 
       /* Now AND out the bits that are within the number of skipped
@@ -617,16 +638,32 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 				      init_ctrl, unskipped_mask);
 	  else
 	    init_ctrl = unskipped_mask;
+	  gcc_assert (vect_for_masking);
 	}
 
+      /* First iteration is full.  */
       if (!init_ctrl)
-	/* First iteration is full.  */
-	init_ctrl = build_minus_one_cst (ctrl_type);
+	{
+	  if (vect_for_masking)
+	    init_ctrl = build_minus_one_cst (ctrl_type);
+	  else
+	    init_ctrl = length_limit;
+	}
 
       /* Get the control value for the next iteration of the loop.  */
-      next_ctrl = make_temp_ssa_name (ctrl_type, NULL, "next_mask");
-      gcall *call = vect_gen_while (next_ctrl, test_index, this_test_limit);
-      gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+      if (vect_for_masking)
+	{
+	  next_ctrl = make_temp_ssa_name (ctrl_type, NULL, "next_mask");
+	  gcall *call = vect_gen_while (next_ctrl, test_index, this_test_limit);
+	  gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+	}
+      else
+	{
+	  next_ctrl = make_temp_ssa_name (compare_type, NULL, "next_len");
+	  gimple_seq seq = vect_gen_len (next_ctrl, test_index, this_test_limit,
+					 length_limit);
+	  gsi_insert_seq_before (test_gsi, seq, GSI_SAME_STMT);
+	}
 
       vect_set_loop_control (loop, ctrl, init_ctrl, next_ctrl);
     }
@@ -652,6 +689,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
   gimple_seq preheader_seq = NULL;
   gimple_seq header_seq = NULL;
 
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
   tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
   unsigned int compare_precision = TYPE_PRECISION (compare_type);
   tree orig_niters = niters;
@@ -686,28 +724,30 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
   tree test_ctrl = NULL_TREE;
   rgroup_controls *rgc;
   unsigned int i;
-  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
-  FOR_EACH_VEC_ELT (*masks, i, rgc)
+  auto_vec<rgroup_controls> *controls = vect_for_masking
+					  ? &LOOP_VINFO_MASKS (loop_vinfo)
+					  : &LOOP_VINFO_LENS (loop_vinfo);
+  FOR_EACH_VEC_ELT (*controls, i, rgc)
     if (!rgc->controls.is_empty ())
       {
 	/* First try using permutes.  This adds a single vector
 	   instruction to the loop for each mask, but needs no extra
 	   loop invariants or IVs.  */
 	unsigned int nmasks = i + 1;
-	if ((nmasks & 1) == 0)
+	if (vect_for_masking && (nmasks & 1) == 0)
 	  {
-	    rgroup_controls *half_rgc = &(*masks)[nmasks / 2 - 1];
+	    rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
 	    if (!half_rgc->controls.is_empty ()
 		&& vect_maybe_permute_loop_masks (&header_seq, rgc, half_rgc))
 	      continue;
 	  }
 
 	/* See whether zero-based IV would ever generate all-false masks
-	   before wrapping around.  */
+	   or zero length before wrapping around.  */
+	unsigned nscalars_ft = rgc->max_nscalars_per_iter * rgc->factor;
 	bool might_wrap_p
 	  = (iv_limit == -1
-	     || (wi::min_precision (iv_limit * rgc->max_nscalars_per_iter,
-				    UNSIGNED)
+	     || (wi::min_precision (iv_limit * nscalars_ft, UNSIGNED)
 		 > compare_precision));
 
 	/* Set up all controls for this group.  */
@@ -2568,7 +2608,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   if (vect_epilogues
       && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
       && prolog_peeling >= 0
-      && known_eq (vf, lowest_vf))
+      && known_eq (vf, lowest_vf)
+      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
     {
       unsigned HOST_WIDE_INT eiters
 	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 6311e795204..1079807534b 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -816,6 +816,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     vectorizable (false),
     can_use_partial_vectors_p (true),
     using_partial_vectors_p (false),
+    epil_using_partial_vectors_p (false),
     peeling_for_gaps (false),
     peeling_for_niter (false),
     no_data_dependencies (false),
@@ -898,6 +899,7 @@ _loop_vec_info::~_loop_vec_info ()
   free (bbs);
 
   release_vec_loop_controls (&masks);
+  release_vec_loop_controls (&lens);
   delete ivexpr_map;
   delete scan_map;
   epilogue_vinfos.release ();
@@ -1072,6 +1074,88 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
   return true;
 }
 
+/* Check whether we can use vector access with length based on precison
+   comparison.  So far, to keep it simple, we only allow the case that the
+   precision of the target supported length is larger than the precision
+   required by loop niters.  */
+
+static bool
+vect_verify_loop_lens (loop_vec_info loop_vinfo)
+{
+  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+
+  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    return false;
+
+  /* The one which has the largest NV should have max bytes per iter.  */
+  rgroup_controls *rgl = &(*lens)[lens->length () - 1];
+
+  /* Work out how many bits we need to represent the length limit.  */
+  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;
+  unsigned int min_ni_prec
+    = vect_min_prec_for_max_niters (loop_vinfo, nscalars_per_iter_ft);
+
+  /* Now use the maximum of below precisions for one suitable IV type:
+     - the IV's natural precision
+     - the precision needed to hold: the maximum number of scalar
+       iterations multiplied by the scale factor (min_ni_prec above)
+     - the Pmode precision
+  */
+
+  /* If min_ni_width is less than the precision of the current niters,
+     we perfer to still use the niters type.  */
+  unsigned int ni_prec
+    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
+  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
+  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
+
+  unsigned int required_prec = ni_prec;
+  if (required_prec < pmode_prec)
+    required_prec = pmode_prec;
+
+  tree iv_type = NULL_TREE;
+  if (min_ni_prec > required_prec)
+    {
+      opt_scalar_int_mode tmode_iter;
+      unsigned standard_bits = 0;
+      FOR_EACH_MODE_IN_CLASS (tmode_iter, MODE_INT)
+      {
+	scalar_mode tmode = tmode_iter.require ();
+	unsigned int tbits = GET_MODE_BITSIZE (tmode);
+
+	/* ??? Do we really want to construct one IV whose precision exceeds
+	   BITS_PER_WORD?  */
+	if (tbits > BITS_PER_WORD)
+	  break;
+
+	/* Find the first available standard integral type.  */
+	if (tbits >= min_ni_prec && targetm.scalar_mode_supported_p (tmode))
+	  {
+	    standard_bits = tbits;
+	    break;
+	  }
+      }
+      if (standard_bits != 0)
+	iv_type = build_nonstandard_integer_type (standard_bits, true);
+    }
+  else
+    iv_type = build_nonstandard_integer_type (required_prec, true);
+
+  if (!iv_type)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't vectorize with length-based partial vectors"
+			 " due to no suitable iv type.\n");
+      return false;
+    }
+
+  LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = iv_type;
+  LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
+
+  return true;
+}
+
 /* Calculate the cost of one scalar iteration of the loop.  */
 static void
 vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
@@ -2170,11 +2254,64 @@ start_over:
       return ok;
     }
 
-  /* Decide whether to use a fully-masked loop for this vectorization
-     factor.  */
-  LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
-    = (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
-       && vect_verify_full_masking (loop_vinfo));
+  /* For now, we don't expect to mix both masking and length approaches for one
+     loop, disable it if both are recorded.  */
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+      && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ()
+      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't vectorize a loop with partial vectors"
+			 " because we don't expect to mix different"
+			 " approaches with partial vectors for the"
+			 " same loop.\n");
+      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
+
+  /* Decide whether to vectorize a loop with partial vectors for
+     this vectorization factor.  */
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      /* Decide whether to use fully-masked approach.  */
+      if (vect_verify_full_masking (loop_vinfo))
+	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+      /* Decide whether to use length-based approach.  */
+      else if (vect_verify_loop_lens (loop_vinfo))
+	{
+	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "can't vectorize this loop with length-based"
+				 " partial vectors approach becuase peeling"
+				 " for alignment or gaps is required.\n");
+	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	    }
+	  else if (param_vect_with_length_scope == 0)
+	    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	  /* The epilogue and other known niters less than VF
+	    cases can still use vector access with length fully.  */
+	  else if (param_vect_with_length_scope == 1
+		   && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+		   && !vect_known_niters_smaller_than_vf (loop_vinfo))
+	    {
+	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	      LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+	    }
+	  else
+	    {
+	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+	      LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	    }
+	}
+      else
+	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
+  else
+    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+
   if (dump_enabled_p ())
     {
       if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
@@ -2183,6 +2320,15 @@ start_over:
       else
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "not using a fully-masked loop.\n");
+
+      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "using length-based partial"
+			 " vectors for loop fully.\n");
+      else
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "not using length-based partial"
+			 " vectors for loop fully.\n");
     }
 
   /* If epilog loop is required because of data accesses with gaps,
@@ -2406,6 +2552,7 @@ again:
     = init_cost (LOOP_VINFO_LOOP (loop_vinfo));
   /* Reset accumulated rgroup information.  */
   release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo));
+  release_vec_loop_controls (&LOOP_VINFO_LENS (loop_vinfo));
   /* Reset assorted flags.  */
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
@@ -2692,7 +2839,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 		lowest_th = ordered_min (lowest_th, th);
 	    }
 	  else
-	    delete loop_vinfo;
+	    {
+	      delete loop_vinfo;
+	      loop_vinfo = opt_loop_vec_info::success (NULL);
+	    }
 
 	  /* Only vectorize epilogues if PARAM_VECT_EPILOGUES_NOMASK is
 	     enabled, SIMDUID is not set, it is the innermost loop and we have
@@ -2717,6 +2867,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
       else
 	{
 	  delete loop_vinfo;
+	  loop_vinfo = opt_loop_vec_info::success (NULL);
 	  if (fatal)
 	    {
 	      gcc_checking_assert (first_loop_vinfo == NULL);
@@ -2724,6 +2875,23 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	    }
 	}
 
+      /* Handle the case that the original loop can use partial
+	 vectorization, but want to only adopt it for the epilogue.
+	 The retry should be in the same mode as original.  */
+      if (vect_epilogues
+	  && loop_vinfo
+	  && LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo))
+	{
+	  gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+		      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** Re-trying analysis with same vector mode"
+			     " %s for epilogue with partial vectors.\n",
+			     GET_MODE_NAME (loop_vinfo->vector_mode));
+	  continue;
+	}
+
       if (mode_i < vector_modes.length ()
 	  && VECTOR_MODE_P (autodetected_vector_mode)
 	  && (related_vector_mode (vector_modes[mode_i],
@@ -3564,6 +3732,11 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 			    target_cost_data, num_masks - 1, vector_stmt,
 			    NULL, NULL_TREE, 0, vect_body);
     }
+  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      peel_iters_prologue = 0;
+      peel_iters_epilogue = 0;
+    }
   else if (npeel < 0)
     {
       peel_iters_prologue = assumed_vf / 2;
@@ -8197,6 +8370,7 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
     {
       rgm->max_nscalars_per_iter = nscalars_per_iter;
       rgm->type = truth_type_for (vectype);
+      rgm->factor = 1;
     }
 }
 
@@ -8249,6 +8423,63 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
   return mask;
 }
 
+/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
+   lengths for vector access with length that each control a vector of type
+   VECTYPE.  FACTOR is only meaningful for length in bytes, and to indicate
+   how many bytes for each element (lane).  */
+
+void
+vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		      unsigned int nvectors, tree vectype, unsigned int factor)
+{
+  gcc_assert (nvectors != 0);
+  if (lens->length () < nvectors)
+    lens->safe_grow_cleared (nvectors);
+  rgroup_controls *rgl = &(*lens)[nvectors - 1];
+
+  /* The number of scalars per iteration, scalar occupied bytes and
+     the number of vectors are both compile-time constants.  */
+  unsigned int nscalars_per_iter
+    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
+		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
+
+  if (rgl->max_nscalars_per_iter < nscalars_per_iter)
+    {
+      rgl->max_nscalars_per_iter = nscalars_per_iter;
+      rgl->type = vectype;
+      rgl->factor = factor;
+    }
+}
+
+/* Given a complete set of length LENS, extract length number INDEX for an
+   rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.  */
+
+tree
+vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		   unsigned int nvectors, unsigned int index)
+{
+  rgroup_controls *rgl = &(*lens)[nvectors - 1];
+
+  /* Populate the rgroup's len array, if this is the first time we've
+     used it.  */
+  if (rgl->controls.is_empty ())
+    {
+      rgl->controls.safe_grow_cleared (nvectors);
+      for (unsigned int i = 0; i < nvectors; ++i)
+	{
+	  tree len_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+	  gcc_assert (len_type != NULL_TREE);
+	  tree len = make_temp_ssa_name (len_type, NULL, "loop_len");
+
+	  /* Provide a dummy definition until the real one is available.  */
+	  SSA_NAME_DEF_STMT (len) = gimple_build_nop ();
+	  rgl->controls[i] = len;
+	}
+    }
+
+  return rgl->controls[index];
+}
+
 /* Scale profiling counters by estimation for LOOP which is vectorized
    by factor VF.  */
 
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index cdd6f6c5e5d..e0ffbab1d02 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1742,29 +1742,56 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
       return;
     }
 
-  machine_mode mask_mode;
-  if (!VECTOR_MODE_P (vecmode)
-      || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
-      || !can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+  if (!VECTOR_MODE_P (vecmode))
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "can't use a fully-masked loop because the target"
-			 " doesn't have the appropriate masked load or"
-			 " store.\n");
+			 "can't operate on partial vectors because of"
+			 " the unexpected mode.\n");
       LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
       return;
     }
-  /* We might load more scalars than we need for permuting SLP loads.
-     We checked in get_group_load_store_type that the extra elements
-     don't leak into a new vector.  */
+
   poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   unsigned int nvectors;
-  if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
-    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, scalar_mask);
-  else
-    gcc_unreachable ();
+
+  machine_mode mask_mode;
+  bool using_partial_vectors_p = false;
+  if (targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
+      && can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+    {
+      /* We might load more scalars than we need for permuting SLP loads.
+	 We checked in get_group_load_store_type that the extra elements
+	 don't leak into a new vector.  */
+      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+	vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype,
+			       scalar_mask);
+      else
+	gcc_unreachable ();
+      using_partial_vectors_p = true;
+    }
+
+  unsigned int factor;
+  if (can_vec_len_load_store_p (vecmode, is_load, &factor))
+    {
+      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+	vect_record_loop_len (loop_vinfo, lens, nvectors, vectype, factor);
+      else
+	gcc_unreachable ();
+      using_partial_vectors_p = true;
+    }
+
+  if (!using_partial_vectors_p)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't operate on partial vectors because the"
+			 " target doesn't have the appropriate partial"
+			 "vectorization load or store.\n");
+      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
 }
 
 /* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
@@ -6936,6 +6963,28 @@ vectorizable_scan_store (vec_info *vinfo,
   return true;
 }
 
+/* For the vector type VTYPE, return the same size vector type with
+   QImode element, which is mainly for vector load/store with length
+   in bytes.  */
+
+static tree
+vect_get_same_size_vec_for_len (tree vtype)
+{
+  gcc_assert (VECTOR_TYPE_P (vtype));
+  machine_mode v_mode = TYPE_MODE (vtype);
+  gcc_assert (GET_MODE_INNER (v_mode) != QImode);
+
+  /* Obtain new element counts with QImode.  */
+  poly_uint64 vsize = GET_MODE_SIZE (v_mode);
+  poly_uint64 esize = GET_MODE_SIZE (QImode);
+  poly_uint64 nelts = exact_div (vsize, esize);
+
+  /* Build element type with QImode.  */
+  unsigned int eprec = GET_MODE_PRECISION (QImode);
+  tree etype = build_nonstandard_integer_type (eprec, 1);
+
+  return build_vector_type (etype, nelts);
+}
 
 /* Function vectorizable_store.
 
@@ -7655,6 +7704,14 @@ vectorizable_store (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+       ? &LOOP_VINFO_LENS (loop_vinfo)
+       : NULL);
+
+  /* Shouldn't go with length-based approach if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -7911,10 +7968,16 @@ vectorizable_store (vec_info *vinfo,
 	      unsigned HOST_WIDE_INT align;
 
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens)
+		final_len = vect_get_loop_len (loop_vinfo, loop_lens,
+					       vec_num * ncopies,
+					       vec_num * j + i);
+
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
@@ -7994,6 +8057,34 @@ vectorizable_store (vec_info *vinfo,
 		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
 		  new_stmt = call;
 		}
+	      else if (final_len)
+		{
+		  align = least_bit_hwi (misalign | align);
+		  tree ptr = build_int_cst (ref_type, align);
+		  tree vtype = TREE_TYPE (vec_oprnd);
+		  /* Need conversion if it's wrapped with VnQI.  */
+		  if (!direct_optab_handler (len_store_optab,
+					     TYPE_MODE (vtype)))
+		    {
+		      tree new_vtype = vect_get_same_size_vec_for_len (vtype);
+		      tree var
+			= vect_get_new_ssa_name (new_vtype, vect_simple_var);
+		      vec_oprnd
+			= build1 (VIEW_CONVERT_EXPR, new_vtype, vec_oprnd);
+		      gassign *new_stmt
+			= gimple_build_assign (var, VIEW_CONVERT_EXPR,
+					       vec_oprnd);
+		      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt,
+						   gsi);
+		      vec_oprnd = var;
+		    }
+		  gcall *call
+		    = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr,
+						  ptr, final_len, vec_oprnd);
+		  gimple_call_set_nothrow (call, true);
+		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
+		  new_stmt = call;
+		}
 	      else
 		{
 		  data_ref = fold_build2 (MEM_REF, vectype,
@@ -8531,6 +8622,7 @@ vectorizable_load (vec_info *vinfo,
       tree dr_offset;
 
       gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
+      gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
       gcc_assert (!nested_in_vect_loop);
 
       if (grouped_load)
@@ -8819,6 +8911,14 @@ vectorizable_load (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+       ? &LOOP_VINFO_LENS (loop_vinfo)
+       : NULL);
+
+  /* Shouldn't go with length-based approach if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -9134,11 +9234,18 @@ vectorizable_load (vec_info *vinfo,
 	  for (i = 0; i < vec_num; i++)
 	    {
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks
 		  && memory_access_type != VMAT_INVARIANT)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens
+		  && memory_access_type != VMAT_INVARIANT)
+		final_len = vect_get_loop_len (loop_vinfo, loop_lens,
+					       vec_num * ncopies,
+					       vec_num * j + i);
+
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
@@ -9207,6 +9314,35 @@ vectorizable_load (vec_info *vinfo,
 			new_stmt = call;
 			data_ref = NULL_TREE;
 		      }
+		    else if (final_len)
+		      {
+			align = least_bit_hwi (misalign | align);
+			tree ptr = build_int_cst (ref_type, align);
+			gcall *call
+			  = gimple_build_call_internal (IFN_LEN_LOAD, 3,
+							dataref_ptr, ptr,
+							final_len);
+			gimple_call_set_nothrow (call, true);
+			new_stmt = call;
+			data_ref = NULL_TREE;
+
+			/* Need conversion if it's wrapped with VnQI.  */
+			if (!direct_optab_handler (len_load_optab,
+						   TYPE_MODE (vectype)))
+			  {
+			    tree new_vtype
+			      = vect_get_same_size_vec_for_len (vectype);
+			    tree var = vect_get_new_ssa_name (new_vtype,
+							      vect_simple_var);
+			    gimple_set_lhs (call, var);
+			    vect_finish_stmt_generation (vinfo, stmt_info, call,
+							 gsi);
+			    tree op = build1 (VIEW_CONVERT_EXPR, vectype, var);
+			    new_stmt
+			      = gimple_build_assign (vec_dest,
+						     VIEW_CONVERT_EXPR, op);
+			  }
+		      }
 		    else
 		      {
 			tree ltype = vectype;
@@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
 	  return false;
 	}
 
-      if (loop_vinfo
-	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
-	  && reduction_type == EXTRACT_LAST_REDUCTION)
-	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
-			       ncopies * vec_num, vectype, NULL);
+      if (loop_vinfo && for_reduction
+	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+	{
+	  if (reduction_type == EXTRACT_LAST_REDUCTION)
+	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
+				   ncopies * vec_num, vectype, NULL);
+	  /* Using partial vectors can introduce inactive lanes in the last
+	     iteration, since full vector of condition results are operated,
+	     it's unsafe here.  But if we can AND the condition mask with
+	     loop mask, it would be safe then.  */
+	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
+	    {
+	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
+	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
+		{
+		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
+		  cond.code = invert_tree_comparison (cond.code, honor_nans);
+		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
+		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+		}
+	    }
+	  else
+	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	}
 
       STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
       vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
@@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
   *nunits_vectype_out = nunits_vectype;
   return opt_result::success ();
 }
+
+/* Generate and return statement sequence that sets vector length LEN that is:
+
+   min_of_start_and_end = min (START_INDEX, END_INDEX);
+   left_len = END_INDEX - min_of_start_and_end;
+   rhs = min (left_len, LEN_LIMIT);
+   LEN = rhs;
+
+   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
+   means if we have left_len in bytes larger than 255, it can't be saturated to
+   vector limit (vector size).  One target hook can be provided if other ports
+   don't suffer this.
+*/
+
+gimple_seq
+vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)
+{
+  gimple_seq stmts = NULL;
+  tree len_type = TREE_TYPE (len);
+  gcc_assert (TREE_TYPE (start_index) == len_type);
+
+  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
+  tree left_len = fold_build2 (MINUS_EXPR, len_type, end_index, min);
+  left_len = fold_build2 (MIN_EXPR, len_type, left_len, len_limit);
+
+  tree rhs = force_gimple_operand (left_len, &stmts, true, NULL_TREE);
+  gimple *new_stmt = gimple_build_assign (len, rhs);
+  gimple_stmt_iterator i = gsi_last (stmts);
+  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
+
+  return stmts;
+}
+
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 6c830ad09f4..4155ffe1d49 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -417,6 +417,16 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
    are compile-time constants but VF and nL can be variable (if the target
    supports variable-length vectors).
 
+   Moreover, for some approach with partial vectors like being controlled
+   by length (in bytes), it cares about the occupied bytes for each scalar.
+   Provided that each scalar has factor bytes, the total number of scalar
+   values becomes to factor * N, the above equation becomes to:
+
+       factor * N = factor * NS * VF = factor * NV * NL
+
+   factor * NS is the bytes of each scalar, factor * NL is the vector size
+   in bytes.
+
    In classical vectorization, each iteration of the vector loop would
    handle exactly VF iterations of the original scalar loop.  However,
    in vector loops that are able to operate on partial vectors, a
@@ -473,14 +483,19 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
    first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
    the second being indexed by the mask index 0 <= i < nV.  */
 
-/* The controls (like masks) needed by rgroups with nV vectors,
+/* The controls (like masks, lengths) needed by rgroups with nV vectors,
    according to the description above.  */
 struct rgroup_controls {
   /* The largest nS for all rgroups that use these controls.  */
   unsigned int max_nscalars_per_iter;
 
-  /* The type of control to use, based on the highest nS recorded above.
-     For mask-based approach, it's used for mask_type.  */
+  /* For now, it's mainly used for length-based in bytes approach, it's
+     record the occupied bytes of each scalar.  */
+  unsigned int factor;
+
+  /* This type is based on the highest nS recorded above.
+     For mask-based approach, it records mask type to use.
+     For length-based approach, it records appropriate vector type.  */
   tree type;
 
   /* A vector of nV controls, in iteration order.  */
@@ -489,6 +504,8 @@ struct rgroup_controls {
 
 typedef auto_vec<rgroup_controls> vec_loop_masks;
 
+typedef auto_vec<rgroup_controls> vec_loop_lens;
+
 typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
 
 /*-----------------------------------------------------------------*/
@@ -536,6 +553,10 @@ public:
      on inactive scalars.  */
   vec_loop_masks masks;
 
+  /* The lengths that a loop with length should use to avoid operating
+     on inactive scalars.  */
+  vec_loop_lens lens;
+
   /* Set of scalar conditions that have loop mask applied.  */
   scalar_cond_masked_set_type scalar_cond_masked_set;
 
@@ -644,6 +665,10 @@ public:
      the vector loop can handle fewer than VF scalars.  */
   bool using_partial_vectors_p;
 
+  /* True if we've decided to use partially-populated vectors for the
+     epilogue of loop, only for length-based approach for now.  */
+  bool epil_using_partial_vectors_p;
+
   /* When we have grouped data accesses with gaps, we may introduce invalid
      memory accesses.  We peel the last iteration of the loop to prevent
      this.  */
@@ -707,9 +732,12 @@ public:
 #define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
 #define LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P(L) (L)->can_use_partial_vectors_p
 #define LOOP_VINFO_USING_PARTIAL_VECTORS_P(L) (L)->using_partial_vectors_p
+#define LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P(L)                             \
+  (L)->epil_using_partial_vectors_p
 #define LOOP_VINFO_VECT_FACTOR(L)          (L)->vectorization_factor
 #define LOOP_VINFO_MAX_VECT_FACTOR(L)      (L)->max_vectorization_factor
 #define LOOP_VINFO_MASKS(L)                (L)->masks
+#define LOOP_VINFO_LENS(L)                 (L)->lens
 #define LOOP_VINFO_MASK_SKIP_NITERS(L)     (L)->mask_skip_niters
 #define LOOP_VINFO_RGROUP_COMPARE_TYPE(L)  (L)->rgroup_compare_type
 #define LOOP_VINFO_RGROUP_IV_TYPE(L)       (L)->rgroup_iv_type
@@ -747,6 +775,10 @@ public:
   (LOOP_VINFO_USING_PARTIAL_VECTORS_P (L)	\
    && !LOOP_VINFO_MASKS (L).is_empty ())
 
+#define LOOP_VINFO_FULLY_WITH_LENGTH_P(L)	\
+  (LOOP_VINFO_USING_PARTIAL_VECTORS_P (L)	\
+   && !LOOP_VINFO_LENS (L).is_empty ())
+
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L)	\
   ((L)->may_misalign_stmts.length () > 0)
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIAS(L)		\
@@ -1866,6 +1898,11 @@ extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
 				   unsigned int, tree, tree);
 extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
+extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+				  tree, unsigned int);
+extern tree vect_get_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+			       unsigned int);
+extern gimple_seq vect_gen_len (tree, tree, tree, tree);
 extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info);
 
 /* Drive for loop transformation stage.  */

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/7 v6] ifn/optabs: Support vector load/store with length
  2020-06-29  6:32                         ` [PATCH 1/7 v6] " Kewen.Lin
@ 2020-06-29 10:07                           ` Richard Sandiford
  2020-06-29 10:39                             ` [PATCH 1/7 v7] " Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Sandiford @ 2020-06-29 10:07 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Jim Wilson, Richard Biener, Bill Schmidt,
	David Edelsohn, Segher Boessenkool, Jim Wilson

Thanks for the update.  I agree with the summary of the IRC discussion
except for…

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard S./Richi/Jim/Segher,
>
> Thanks a lot for your comments to make this patch more solid.
>
> Based on our discussion, for the vector load/store with length
> optab, the length unit would be measured in lanes by default.
> For the targets which support length measured in bytes like Power,
> they should only define VnQI modes to wrap the other same size
> vector modes.  If the length is larger than total lane/byte count
> of the given mode, it's taken to load all lanes/bytes implicitly.

…this last bit.  IMO the behaviour of the optab should be undefined
when the supplied length is greater than the number of lanes.

I think that also makes things better for the lxvl implementation,
which ignores the upper 56 bits of the length.  It sounds like the
above semantics would instead require Power to saturate the value
at 255 before shifting it.

Richard

> For the remaining lanes/bytes which isn't specified by length,
> they would be taken as undefined value.  For length in bytes,
> it's required that the byte count should be a multiple of the
> element size (wrapped vector), otherwise it's undefined.
>
> This patch has been updated as attached.
>
> 2/7 for rs6000 optab defintion has been updated to use V16QI.
> 5/7 for vectorizer change has been updated accordingly.
>
> -----
>
> v6: Updated optab descriptions.
>
> v5:
>   - Updated lenload/lenstore optab to len_load/len_store and the docs.
>   - Rename expand_mask_{load,store}_optab_fn to expand_partial_{load,store}_optab_fn
>   - Added/updated macros for expand_mask_{load,store}_optab_fn
>     and expand_len_{load,store}_optab_fn
>
> v4: Update len_load_direct/len_store_direct to align with direct optab.
>
> v3: Get rid of length mode hook.
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>
> 	* doc/md.texi (len_load_@var{m}): Document.
> 	(len_store_@var{m}): Likewise.
> 	* internal-fn.c (len_load_direct): New macro.
> 	(len_store_direct): Likewise.
> 	(expand_len_load_optab_fn): Likewise.
> 	(expand_len_store_optab_fn): Likewise.
> 	(direct_len_load_optab_supported_p): Likewise.
> 	(direct_len_store_optab_supported_p): Likewise.
> 	(expand_mask_load_optab_fn): New macro.  Original renamed to ...
> 	(expand_partial_load_optab_fn): ... here.  Add handlings for
> 	len_load_optab.
> 	(expand_mask_store_optab_fn): New macro.  Original renamed to ...
> 	(expand_partial_store_optab_fn): ... here. Add handlings for
> 	len_store_optab.
> 	(internal_load_fn_p): Handle IFN_LEN_LOAD.
> 	(internal_store_fn_p): Handle IFN_LEN_STORE.
> 	(internal_fn_stored_value_index): Handle IFN_LEN_STORE.
> 	* internal-fn.def (LEN_LOAD): New internal function.
> 	(LEN_STORE): Likewise.
> 	* optabs.def (len_load_optab, len_store_optab): New optab.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 1/7 v7] ifn/optabs: Support vector load/store with length
  2020-06-29 10:07                           ` Richard Sandiford
@ 2020-06-29 10:39                             ` Kewen.Lin
  2020-06-30 15:32                               ` Richard Sandiford
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-06-29 10:39 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Jim Wilson, Richard Biener, Bill Schmidt, David Edelsohn,
	Segher Boessenkool, Jim Wilson

[-- Attachment #1: Type: text/plain, Size: 2872 bytes --]

Hi Richard,

Thanks for the comments!

on 2020/6/29 下午6:07, Richard Sandiford wrote:
> Thanks for the update.  I agree with the summary of the IRC discussion
> except for…
> 
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> Hi Richard S./Richi/Jim/Segher,
>>
>> Thanks a lot for your comments to make this patch more solid.
>>
>> Based on our discussion, for the vector load/store with length
>> optab, the length unit would be measured in lanes by default.
>> For the targets which support length measured in bytes like Power,
>> they should only define VnQI modes to wrap the other same size
>> vector modes.  If the length is larger than total lane/byte count
>> of the given mode, it's taken to load all lanes/bytes implicitly.
> 
> …this last bit.  IMO the behaviour of the optab should be undefined
> when the supplied length is greater than the number of lanes.
> 
> I think that also makes things better for the lxvl implementation,
> which ignores the upper 56 bits of the length.  It sounds like the
> above semantics would instead require Power to saturate the value
> at 255 before shifting it.
> 

Good catch, I just realized that this part is inconsistent to what I
implemented in patch 5/7, where the function vect_gen_len still does
the min operation between the given length and length_limit.

This patch is updated accordingly to state the behavior to be undefined.
The others aren't required to change.

Could you have a further look? Thanks in advance!

v6/v7: Updated optab descriptions.

v5:
  - Updated lenload/lenstore optab to len_load/len_store and the docs.
  - Rename expand_mask_{load,store}_optab_fn to expand_partial_{load,store}_optab_fn
  - Added/updated macros for expand_mask_{load,store}_optab_fn
    and expand_len_{load,store}_optab_fn

v4: Update len_load_direct/len_store_direct to align with direct optab.

v3: Get rid of length mode hook.

BR,
Kewen
-----
gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/md.texi (len_load_@var{m}): Document.
	(len_store_@var{m}): Likewise.
	* internal-fn.c (len_load_direct): New macro.
	(len_store_direct): Likewise.
	(expand_len_load_optab_fn): Likewise.
	(expand_len_store_optab_fn): Likewise.
	(direct_len_load_optab_supported_p): Likewise.
	(direct_len_store_optab_supported_p): Likewise.
	(expand_mask_load_optab_fn): New macro.  Original renamed to ...
	(expand_partial_load_optab_fn): ... here.  Add handlings for
	len_load_optab.
	(expand_mask_store_optab_fn): New macro.  Original renamed to ...
	(expand_partial_store_optab_fn): ... here. Add handlings for
	len_store_optab.
	(internal_load_fn_p): Handle IFN_LEN_LOAD.
	(internal_store_fn_p): Handle IFN_LEN_STORE.
	(internal_fn_stored_value_index): Handle IFN_LEN_STORE.
	* internal-fn.def (LEN_LOAD): New internal function.
	(LEN_STORE): Likewise.
	* optabs.def (len_load_optab, len_store_optab): New optab.

[-- Attachment #2: ifn_v7.diff --]
[-- Type: text/plain, Size: 9654 bytes --]

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2c67c818da5..c8d7bcc7f62 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5167,6 +5167,33 @@ mode @var{n}.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{len_load_@var{m}} instruction pattern
+@item @samp{len_load_@var{m}}
+Load the number of units specified by operand 2 from memory operand 1
+into register operand 0, setting the other bytes of operand 0 to
+undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has
+whichever integer mode the target prefers.  If operand 2 exceeds the
+maximum units of mode @var{m}, the behavior is undefined.  For targets
+which support length measured in bytes, they should only define VnQI
+mode to wrap the other vector modes with the same size.  Meanwhile,
+it's required that the byte count should be a multiple of the element
+size (wrapped vector).
+
+This pattern is not allowed to @code{FAIL}.
+
+@cindex @code{len_store_@var{m}} instruction pattern
+@item @samp{len_store_@var{m}}
+Store the number of units specified by operand 2 from nonmemory operand 1
+into memory operand 0, leaving the other bytes of operand 0 unchanged.
+Operands 0 and 1 have mode @var{m}.  Operand 2 has whichever integer
+mode the target prefers.  If operand 2 exceeds the maximum units of mode
+@var{m}, the behavior is undefined.  For targets which support length
+measured in bytes, they should only define VnQI mode to wrap the other
+vector modes with the same size.  Meanwhile, it's required that the byte
+count should be a multiple of the element size (wrapped vector).
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_perm@var{m}} instruction pattern
 @item @samp{vec_perm@var{m}}
 Output a (variable) vector permutation.  Operand 0 is the destination
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 4f088de48d5..1e53ced60eb 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -104,10 +104,12 @@ init_internal_fns ()
 #define load_lanes_direct { -1, -1, false }
 #define mask_load_lanes_direct { -1, -1, false }
 #define gather_load_direct { 3, 1, false }
+#define len_load_direct { -1, -1, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
 #define mask_store_lanes_direct { 0, 0, false }
 #define scatter_store_direct { 3, 1, false }
+#define len_store_direct { 3, 3, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 #define ternary_direct { 0, 0, true }
@@ -2478,10 +2480,10 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
   return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
 }
 
-/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES} or LEN_LOAD call STMT using optab OPTAB.  */
 
 static void
-expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 {
   class expand_operand ops[3];
   tree type, lhs, rhs, maskt;
@@ -2497,6 +2499,8 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_load_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == len_load_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2507,18 +2511,24 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   create_output_operand (&ops[0], target, TYPE_MODE (type));
   create_fixed_operand (&ops[1], mem);
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == len_load_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
   if (!rtx_equal_p (target, ops[0].value))
     emit_move_insn (target, ops[0].value);
 }
 
+#define expand_mask_load_optab_fn expand_partial_load_optab_fn
 #define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
+#define expand_len_load_optab_fn expand_partial_load_optab_fn
 
-/* Expand MASK_STORE{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_STORE{,_LANES} or LEN_STORE call STMT using optab OPTAB.  */
 
 static void
-expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 {
   class expand_operand ops[3];
   tree type, lhs, rhs, maskt;
@@ -2532,6 +2542,8 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_store_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == len_store_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2542,11 +2554,17 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   reg = expand_normal (rhs);
   create_fixed_operand (&ops[0], mem);
   create_input_operand (&ops[1], reg, TYPE_MODE (type));
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == len_store_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
 }
 
+#define expand_mask_store_optab_fn expand_partial_store_optab_fn
 #define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
+#define expand_len_store_optab_fn expand_partial_store_optab_fn
 
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
@@ -3128,10 +3146,12 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_gather_load_optab_supported_p convert_optab_supported_p
+#define direct_len_load_optab_supported_p direct_optab_supported_p
 #define direct_mask_store_optab_supported_p convert_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_scatter_store_optab_supported_p convert_optab_supported_p
+#define direct_len_store_optab_supported_p direct_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
 #define direct_fold_left_optab_supported_p direct_optab_supported_p
@@ -3498,6 +3518,7 @@ internal_load_fn_p (internal_fn fn)
     case IFN_MASK_LOAD_LANES:
     case IFN_GATHER_LOAD:
     case IFN_MASK_GATHER_LOAD:
+    case IFN_LEN_LOAD:
       return true;
 
     default:
@@ -3517,6 +3538,7 @@ internal_store_fn_p (internal_fn fn)
     case IFN_MASK_STORE_LANES:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return true;
 
     default:
@@ -3577,6 +3599,7 @@ internal_fn_stored_value_index (internal_fn fn)
     case IFN_MASK_STORE:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return 3;
 
     default:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 1d190d492ff..17dac128e83 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
    - load_lanes: currently just vec_load_lanes
    - mask_load_lanes: currently just vec_mask_load_lanes
    - gather_load: used for {mask_,}gather_load
+   - len_load: currently just len_load
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
    - mask_store_lanes: currently just vec_mask_store_lanes
    - scatter_store: used for {mask_,}scatter_store
+   - len_store: currently just len_store
 
    - unary: a normal unary optab, such as vec_reverse_<mode>
    - binary: a normal binary optab, such as vec_interleave_lo_<mode>
@@ -127,6 +129,8 @@ DEF_INTERNAL_OPTAB_FN (GATHER_LOAD, ECF_PURE, gather_load, gather_load)
 DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
 		       mask_gather_load, gather_load)
 
+DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
+
 DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
 DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
 		       mask_scatter_store, scatter_store)
@@ -136,6 +140,8 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
 DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
 		       vec_mask_store_lanes, mask_store_lanes)
 
+DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
+
 DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
 DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
 		       check_raw_ptrs, check_ptrs)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 0c64eb52a8d..78409aa1453 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -435,3 +435,5 @@ OPTAB_D (check_war_ptrs_optab, "check_war_ptrs$a")
 OPTAB_DC (vec_duplicate_optab, "vec_duplicate$a", VEC_DUPLICATE)
 OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
 OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
+OPTAB_D (len_load_optab, "len_load_$a")
+OPTAB_D (len_store_optab, "len_store_$a")

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/7 v5] rs6000: lenload/lenstore optab support
  2020-06-29  6:32         ` [PATCH 2/7 v5] " Kewen.Lin
@ 2020-06-29 17:57           ` Segher Boessenkool
  0 siblings, 0 replies; 80+ messages in thread
From: Segher Boessenkool @ 2020-06-29 17:57 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Bill Schmidt, dje.gcc, Richard Biener, Richard Sandiford

Hi Kewen,

On Mon, Jun 29, 2020 at 02:32:47PM +0800, Kewen.Lin wrote:
> V5: Like V4.

It is still okay for trunk, like before :-)


Segher

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/7 v7] ifn/optabs: Support vector load/store with length
  2020-06-29 10:39                             ` [PATCH 1/7 v7] " Kewen.Lin
@ 2020-06-30 15:32                               ` Richard Sandiford
  2020-07-01 13:35                                 ` [PATCH 1/7 v8] " Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Sandiford @ 2020-06-30 15:32 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Jim Wilson, Richard Biener, Bill Schmidt,
	David Edelsohn, Segher Boessenkool, Jim Wilson

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> Thanks for the comments!
>
> on 2020/6/29 下午6:07, Richard Sandiford wrote:
>> Thanks for the update.  I agree with the summary of the IRC discussion
>> except for…
>> 
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> Hi Richard S./Richi/Jim/Segher,
>>>
>>> Thanks a lot for your comments to make this patch more solid.
>>>
>>> Based on our discussion, for the vector load/store with length
>>> optab, the length unit would be measured in lanes by default.
>>> For the targets which support length measured in bytes like Power,
>>> they should only define VnQI modes to wrap the other same size
>>> vector modes.  If the length is larger than total lane/byte count
>>> of the given mode, it's taken to load all lanes/bytes implicitly.
>> 
>> …this last bit.  IMO the behaviour of the optab should be undefined
>> when the supplied length is greater than the number of lanes.
>> 
>> I think that also makes things better for the lxvl implementation,
>> which ignores the upper 56 bits of the length.  It sounds like the
>> above semantics would instead require Power to saturate the value
>> at 255 before shifting it.
>> 
>
> Good catch, I just realized that this part is inconsistent to what I
> implemented in patch 5/7, where the function vect_gen_len still does
> the min operation between the given length and length_limit.
>
> This patch is updated accordingly to state the behavior to be undefined.
> The others aren't required to change.
>
> Could you have a further look? Thanks in advance!
>
> v6/v7: Updated optab descriptions.
>
> v5:
>   - Updated lenload/lenstore optab to len_load/len_store and the docs.
>   - Rename expand_mask_{load,store}_optab_fn to expand_partial_{load,store}_optab_fn
>   - Added/updated macros for expand_mask_{load,store}_optab_fn
>     and expand_len_{load,store}_optab_fn
>
> v4: Update len_load_direct/len_store_direct to align with direct optab.
>
> v3: Get rid of length mode hook.

Thanks, mostly looks good, just some comments about the documentation…

>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>
> 	* doc/md.texi (len_load_@var{m}): Document.
> 	(len_store_@var{m}): Likewise.
> 	* internal-fn.c (len_load_direct): New macro.
> 	(len_store_direct): Likewise.
> 	(expand_len_load_optab_fn): Likewise.
> 	(expand_len_store_optab_fn): Likewise.
> 	(direct_len_load_optab_supported_p): Likewise.
> 	(direct_len_store_optab_supported_p): Likewise.
> 	(expand_mask_load_optab_fn): New macro.  Original renamed to ...
> 	(expand_partial_load_optab_fn): ... here.  Add handlings for
> 	len_load_optab.
> 	(expand_mask_store_optab_fn): New macro.  Original renamed to ...
> 	(expand_partial_store_optab_fn): ... here. Add handlings for
> 	len_store_optab.
> 	(internal_load_fn_p): Handle IFN_LEN_LOAD.
> 	(internal_store_fn_p): Handle IFN_LEN_STORE.
> 	(internal_fn_stored_value_index): Handle IFN_LEN_STORE.
> 	* internal-fn.def (LEN_LOAD): New internal function.
> 	(LEN_STORE): Likewise.
> 	* optabs.def (len_load_optab, len_store_optab): New optab.
>
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 2c67c818da5..c8d7bcc7f62 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5167,6 +5167,33 @@ mode @var{n}.
>  
>  This pattern is not allowed to @code{FAIL}.
>  
> +@cindex @code{len_load_@var{m}} instruction pattern
> +@item @samp{len_load_@var{m}}
> +Load the number of units specified by operand 2 from memory operand 1

s/units/vector elements/

> +into register operand 0, setting the other bytes of operand 0 to

s/bytes/elements/

Maybe s/register operand 0/vector register operand 0/ would be clearer,
now that we're explicitly measuring elements rather than bytes.

> +undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has

and maybe here “…@var{m}, which must be a vector mode”.

> +whichever integer mode the target prefers.  If operand 2 exceeds the
> +maximum units of mode @var{m}, the behavior is undefined.  For targets

Maybe s/maximum units of/number of elements in/

> +which support length measured in bytes, they should only define VnQI
> +mode to wrap the other vector modes with the same size.  Meanwhile,

How about:

  If the target prefers the length to be measured in bytes
  rather than elements, it should only implement this pattern
  for vectors of @code{QI} elements.

> +it's required that the byte count should be a multiple of the element
> +size (wrapped vector).

This last sentence doesn't apply now that the length is measured in
elements (lanes).

> +
> +This pattern is not allowed to @code{FAIL}.
> +
> +@cindex @code{len_store_@var{m}} instruction pattern
> +@item @samp{len_store_@var{m}}
> +Store the number of units specified by operand 2 from nonmemory operand 1
> +into memory operand 0, leaving the other bytes of operand 0 unchanged.
> +Operands 0 and 1 have mode @var{m}.  Operand 2 has whichever integer
> +mode the target prefers.  If operand 2 exceeds the maximum units of mode
> +@var{m}, the behavior is undefined.  For targets which support length
> +measured in bytes, they should only define VnQI mode to wrap the other
> +vector modes with the same size.  Meanwhile, it's required that the byte
> +count should be a multiple of the element size (wrapped vector).

Equivalent changes here too.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer
  2020-06-29  6:33                       ` [PATCH 5/7 v6] " Kewen.Lin
@ 2020-06-30 19:53                         ` Richard Sandiford
  2020-07-01 13:23                           ` Kewen.Lin
  2020-07-10  9:55                           ` [PATCH 5/7 v7] " Kewen.Lin
  0 siblings, 2 replies; 80+ messages in thread
From: Richard Sandiford @ 2020-06-30 19:53 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Bill Schmidt, Richard Biener, Segher Boessenkool, dje.gcc

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 06a04e3d7dd..284c15705ea 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -13389,6 +13389,13 @@ by the copy loop headers pass.
>  @item vect-epilogues-nomask
>  Enable loop epilogue vectorization using smaller vector size.
>  
> +@item vect-with-length-scope

In principle there's nothing length-specific about this option.
We could do the same for masks or for any future loop control
mechanism.  So how about vect-partial-vector-usage instead?

> +Control the scope of vector memory access with length exploitation.  0 means we
> +don't expliot any vector memory access with length, 1 means we only exploit
> +vector memory access with length for those loops whose iteration number are
> +less than VF, such as very small loop or epilogue, 2 means we want to exploit
> +vector memory access with length for any loops if possible.

Maybe:

  Controls when the loop vectorizer considers using partial vector loads
  and stores as an alternative to falling back to scalar code.  0 stops
  the vectorizer from ever using partial vector loads and stores.  1 allows
  partial vector loads and stores if vectorization removes the need for the
  code to iterate.  2 allows partial vector loads and stores in all loops.
  The parameter only has an effect on targets that support partial
  vector loads and stores.
  
> diff --git a/gcc/optabs-query.c b/gcc/optabs-query.c
> index 215d68e4225..9c351759204 100644
> --- a/gcc/optabs-query.c
> +++ b/gcc/optabs-query.c
> @@ -606,6 +606,60 @@ can_vec_mask_load_store_p (machine_mode mode,
>    return false;
>  }
>  
> +/* Return true if target supports vector load/store with length for vector
> +   mode MODE.  There are two flavors for vector load/store with length, one
> +   is to measure length with bytes, the other is to measure length with lanes.
> +   As len_{load,store} optabs point out, for the flavor with bytes, we use
> +   VnQI to wrap the other supportable same size vector modes.  Here the
> +   pointer FACTOR is to indicate that it is using VnQI to wrap if its value
> +   more than 1 and how many bytes for one element of wrapped vector mode.  */
> +
> +bool
> +can_vec_len_load_store_p (machine_mode mode, bool is_load, unsigned int *factor)
> +{
> +  optab op = is_load ? len_load_optab : len_store_optab;
> +  gcc_assert (VECTOR_MODE_P (mode));
> +
> +  /* Check if length in lanes supported for this mode directly.  */
> +  if (direct_optab_handler (op, mode))
> +    {
> +      *factor = 1;
> +      return true;
> +    }
> +
> +  /* Check if length in bytes supported for VnQI with the same vector size.  */
> +  scalar_mode emode = QImode;
> +  poly_uint64 esize = GET_MODE_SIZE (emode);

This is always equal to 1, so…

> +  poly_uint64 vsize = GET_MODE_SIZE (mode);
> +  poly_uint64 nunits;
> +
> +  /* To get how many nunits it would have if the element is QImode.  */
> +  if (multiple_p (vsize, esize, &nunits))
> +    {

…we can just set nunits to GET_MODE_SIZE (mode).

> +      machine_mode vmode;
> +      /* Check whether the related VnQI vector mode exists, as well as
> +	 optab supported.  */
> +      if (related_vector_mode (mode, emode, nunits).exists (&vmode)
> +	  && direct_optab_handler (op, vmode))
> +	{
> +	  unsigned int mul;
> +	  scalar_mode orig_emode = GET_MODE_INNER (mode);
> +	  poly_uint64 orig_esize = GET_MODE_SIZE (orig_emode);
> +
> +	  if (constant_multiple_p (orig_esize, esize, &mul))
> +	    *factor = mul;
> +	  else
> +	    gcc_unreachable ();

This is just:

	  *factor = GET_MODE_UNIT_SIZE (mode);

However, I think it would be better to return the vector mode that the
load or store should use, instead of this factor.  That way we can reuse
it when generating the load and store statements.

So maybe call the function get_len_load_store_mode and return an
opt_machine_mode.

> +
> +	  return true;
> +	}
> +    }
> +  else
> +    gcc_unreachable ();
> +
> +  return false;
> +}
> +
>  /* Return true if there is a compare_and_swap pattern.  */
>  
>  bool
> […]
> diff --git a/gcc/params.opt b/gcc/params.opt
> index 9b564bb046c..daa6e8a2beb 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -968,4 +968,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f
>  Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
>  Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
>  
> +-param=vect-with-length-scope=
> +Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization
> +Control the vector with length exploitation scope.

Think this should be a bit more descriptive, at least saying what the
three values are (but in a more abbreviated form than the .texi above).

I think the default should be 2, with targets actively turning it down
where necessary.  That way, the decision to turn it down is more likely
to have a comment explaining why.

> […]
> @@ -422,10 +423,20 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>  {
>    tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
>    tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> +  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);

IMO just “use_masks_p” would be more readable.  Same later on.

> +
>    tree ctrl_type = rgc->type;
> -  unsigned int nscalars_per_iter = rgc->max_nscalars_per_iter;
> +  /* Scale up nscalars per iteration with factor.  */
> +  unsigned int nscalars_per_iter_ft = rgc->max_nscalars_per_iter * rgc->factor;

Maybe “scaled_nscalars_per_iter”?  Not sure the comment really adds
anything here.

Or maybe “nitems_per_iter”, to keep the names shorter?

>    poly_uint64 nscalars_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);

Maybe worth inserting a scaled_nscalars_per_ctrl or nitems_per_ctrl
here, since it's used in two places below (length_limit and as
batch_nscalars_ft).

>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +  tree length_limit = NULL_TREE;
> +  /* For length, we need length_limit to check length in range.  */
> +  if (!vect_for_masking)
> +    {
> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
> +      length_limit = build_int_cst (compare_type, len_limit);
> +    }
>  
>    /* Calculate the maximum number of scalar values that the rgroup
>       handles in total, the number that it handles for each iteration
> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>    tree nscalars_total = niters;
>    tree nscalars_step = build_int_cst (iv_type, vf);
>    tree nscalars_skip = niters_skip;
> -  if (nscalars_per_iter != 1)
> +  if (nscalars_per_iter_ft != 1)
>      {
>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>  	 these multiplications don't overflow.  */
> -      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
> +      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>  				     nscalars_total, compare_factor);
>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>  	     NSCALARS_SKIP to that cannot overflow.  */
>  	  tree const_limit = build_int_cst (compare_type,
>  					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
> -					    * nscalars_per_iter);
> +					    * nscalars_per_iter_ft);
>  	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>  				      nscalars_total, const_limit);
>  	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,

It looks odd that we don't need to adjust the other nscalars_* values too.
E.g. the above seems to be comparing an unscaled nscalars_total with
a scaled nscalars_per_iter.  I think the units ought to “agree”,
both here and in the rest of the function.

> […]
> @@ -617,16 +638,32 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>  				      init_ctrl, unskipped_mask);
>  	  else
>  	    init_ctrl = unskipped_mask;
> +	  gcc_assert (vect_for_masking);

I think this ought to go at the beginning of the { … } block,
rather than the end.

>  	}
>  
> +      /* First iteration is full.  */

This comment belongs inside the “if”.

>        if (!init_ctrl)
> -	/* First iteration is full.  */
> -	init_ctrl = build_minus_one_cst (ctrl_type);
> +	{
> +	  if (vect_for_masking)
> +	    init_ctrl = build_minus_one_cst (ctrl_type);
> +	  else
> +	    init_ctrl = length_limit;
> +	}
>  
> […]
> @@ -2568,7 +2608,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>    if (vect_epilogues
>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>        && prolog_peeling >= 0
> -      && known_eq (vf, lowest_vf))
> +      && known_eq (vf, lowest_vf)
> +      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
>      {
>        unsigned HOST_WIDE_INT eiters
>  	= (LOOP_VINFO_INT_NITERS (loop_vinfo)

I'm still not really convinced that this check is right.  It feels
like it's hiding a problem elsewhere.

> […]
> @@ -1072,6 +1074,88 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
>    return true;
>  }
>  
> +/* Check whether we can use vector access with length based on precison
> +   comparison.  So far, to keep it simple, we only allow the case that the
> +   precision of the target supported length is larger than the precision
> +   required by loop niters.  */
> +
> +static bool
> +vect_verify_loop_lens (loop_vec_info loop_vinfo)
> +{
> +  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> +
> +  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
> +    return false;
> +
> +  /* The one which has the largest NV should have max bytes per iter.  */
> +  rgroup_controls *rgl = &(*lens)[lens->length () - 1];

“lens->last ()”.  Using a reference feels more natural here.

> +
> +  /* Work out how many bits we need to represent the length limit.  */
> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;

I think this breaks the abstraction.  There's no guarantee that the
factor is the same for each rgroup_control, so there's no guarantee
that the maximum bytes per iter comes the last entry.  (Also, it'd
be better to avoid talking about bytes if we're trying to be general.)
I think we should take the maximum of each entry instead.

> +  unsigned int min_ni_prec
> +    = vect_min_prec_for_max_niters (loop_vinfo, nscalars_per_iter_ft);
> +
> +  /* Now use the maximum of below precisions for one suitable IV type:
> +     - the IV's natural precision
> +     - the precision needed to hold: the maximum number of scalar
> +       iterations multiplied by the scale factor (min_ni_prec above)
> +     - the Pmode precision
> +  */
> +
> +  /* If min_ni_width is less than the precision of the current niters,

min_ni_prec

> +     we perfer to still use the niters type.  */
> +  unsigned int ni_prec
> +    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
> +  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
> +  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
> +
> +  unsigned int required_prec = ni_prec;
> +  if (required_prec < pmode_prec)
> +    required_prec = pmode_prec;
> +
> +  tree iv_type = NULL_TREE;
> +  if (min_ni_prec > required_prec)
> +    {

Do we need this condition?  Looks like we could just do:

  min_ni_prec = MAX (min_ni_prec, GET_MODE_BITSIZE (Pmode));
  min_ni_prec = MAX (min_ni_prec, ni_prec);

and then run the loop below.

> +      opt_scalar_int_mode tmode_iter;
> +      unsigned standard_bits = 0;
> +      FOR_EACH_MODE_IN_CLASS (tmode_iter, MODE_INT)
> +      {
> +	scalar_mode tmode = tmode_iter.require ();
> +	unsigned int tbits = GET_MODE_BITSIZE (tmode);
> +
> +	/* ??? Do we really want to construct one IV whose precision exceeds
> +	   BITS_PER_WORD?  */
> +	if (tbits > BITS_PER_WORD)
> +	  break;
> +
> +	/* Find the first available standard integral type.  */
> +	if (tbits >= min_ni_prec && targetm.scalar_mode_supported_p (tmode))
> +	  {
> +	    standard_bits = tbits;
> +	    break;
> +	  }
> +      }
> +      if (standard_bits != 0)
> +	iv_type = build_nonstandard_integer_type (standard_bits, true);

I don't think there's any need for “standard_bits” here, we can just
set “iv_type” directly before breaking.

> +    }
> +  else
> +    iv_type = build_nonstandard_integer_type (required_prec, true);
> +
> +  if (!iv_type)
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			 "can't vectorize with length-based partial vectors"
> +			 " due to no suitable iv type.\n");
> +      return false;
> +    }
> +
> +  LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = iv_type;
> +  LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> +
> +  return true;
> +}
> +
>  /* Calculate the cost of one scalar iteration of the loop.  */
>  static void
>  vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
> @@ -2170,11 +2254,64 @@ start_over:
>        return ok;
>      }
>  
> -  /* Decide whether to use a fully-masked loop for this vectorization
> -     factor.  */
> -  LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> -    = (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> -       && vect_verify_full_masking (loop_vinfo));
> +  /* For now, we don't expect to mix both masking and length approaches for one
> +     loop, disable it if both are recorded.  */
> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> +      && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ()
> +      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ())
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			 "can't vectorize a loop with partial vectors"
> +			 " because we don't expect to mix different"
> +			 " approaches with partial vectors for the"
> +			 " same loop.\n");
> +      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +    }
> +
> +  /* Decide whether to vectorize a loop with partial vectors for
> +     this vectorization factor.  */
> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +    {
> +      /* Decide whether to use fully-masked approach.  */
> +      if (vect_verify_full_masking (loop_vinfo))
> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
> +      /* Decide whether to use length-based approach.  */
> +      else if (vect_verify_loop_lens (loop_vinfo))
> +	{
> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
> +	    {
> +	      if (dump_enabled_p ())
> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +				 "can't vectorize this loop with length-based"
> +				 " partial vectors approach becuase peeling"
> +				 " for alignment or gaps is required.\n");
> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +	    }

Why are these peeling cases necessary?  Peeling for gaps should
just mean subtracting one scalar iteration from the iteration count
and shouldn't otherwise affect the main loop.  Similarly, peeling for
alignment can be handled in the normal way, with a scalar prologue loop.

> +	  else if (param_vect_with_length_scope == 0)
> +	    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;

As above, I don't think this should be length-specific.  Same for the
== 1 handling, which we could do afterwards.

> +	  /* The epilogue and other known niters less than VF
> +	    cases can still use vector access with length fully.  */
> +	  else if (param_vect_with_length_scope == 1
> +		   && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +		   && !vect_known_niters_smaller_than_vf (loop_vinfo))
> +	    {
> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +	      LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
> +	    }
> +	  else
> +	    {
> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
> +	      LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;

Think it's better to leave this last line out, otherwise it raises
the question why we don't set it to false elsewhere as well.

> +	    }
> +	}
> +      else
> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +    }
> +  else
> +    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +
>    if (dump_enabled_p ())
>      {
>        if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> @@ -2183,6 +2320,15 @@ start_over:
>        else
>  	dump_printf_loc (MSG_NOTE, vect_location,
>  			 "not using a fully-masked loop.\n");
> +
> +      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +	dump_printf_loc (MSG_NOTE, vect_location,
> +			 "using length-based partial"
> +			 " vectors for loop fully.\n");
> +      else
> +	dump_printf_loc (MSG_NOTE, vect_location,
> +			 "not using length-based partial"
> +			 " vectors for loop fully.\n");

Think just one message for all three cases is better, perhaps with

  "operating only on full vectors.\n"

instead of "not using a fully-masked loop.\n".  Might need some
testsuite updates though -- probably worth splitting the wording
change out into a separate patch if so.

>      }
>  
>    /* If epilog loop is required because of data accesses with gaps,
> @@ -8249,6 +8423,63 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
>    return mask;
>  }
>  
> +/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
> +   lengths for vector access with length that each control a vector of type
> +   VECTYPE.  FACTOR is only meaningful for length in bytes, and to indicate
> +   how many bytes for each element (lane).  */

Maybe:

/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
   lengths for controlling an operation on VECTYPE.  The operation splits
   each element of VECTYPE into FACTOR separate subelements, measuring
   the length as a number of these subelements.  */

> +
> +void
> +vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
> +		      unsigned int nvectors, tree vectype, unsigned int factor)
> +{
> +  gcc_assert (nvectors != 0);
> +  if (lens->length () < nvectors)
> +    lens->safe_grow_cleared (nvectors);
> +  rgroup_controls *rgl = &(*lens)[nvectors - 1];
> +
> +  /* The number of scalars per iteration, scalar occupied bytes and
> +     the number of vectors are both compile-time constants.  */
> +  unsigned int nscalars_per_iter
> +    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> +		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> +
> +  if (rgl->max_nscalars_per_iter < nscalars_per_iter)
> +    {
> +      rgl->max_nscalars_per_iter = nscalars_per_iter;
> +      rgl->type = vectype;
> +      rgl->factor = factor;
> +    }

This is dangerous because it ignores “factor” otherwise, and ignores
the previous factor if we overwrite it.

I think instead we should have:

  /* For now, we only support cases in which all loads and stores fall back
     to VnQI or none do.  */
  gcc_assert (!rgl->max_nscalars_per_iter
	      || (rgl->factor == 1 && factor == 1)
	      || (rgl->max_nscalars_per_iter * rgl->factor
		  == nscalars_per_iter * factor));

before changing rgl.

> […]
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index cdd6f6c5e5d..e0ffbab1d02 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -1742,29 +1742,56 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
>        return;
>      }
>  
> -  machine_mode mask_mode;
> -  if (!VECTOR_MODE_P (vecmode)
> -      || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
> -      || !can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
> +  if (!VECTOR_MODE_P (vecmode))
>      {
>        if (dump_enabled_p ())
>  	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -			 "can't use a fully-masked loop because the target"
> -			 " doesn't have the appropriate masked load or"
> -			 " store.\n");
> +			 "can't operate on partial vectors because of"
> +			 " the unexpected mode.\n");

Maybe: “can't operate on partial vectors when emulating vector operations”

>        LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>        return;
>      }
> -  /* We might load more scalars than we need for permuting SLP loads.
> -     We checked in get_group_load_store_type that the extra elements
> -     don't leak into a new vector.  */
> +
>    poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>    unsigned int nvectors;
> -  if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
> -    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, scalar_mask);
> -  else
> -    gcc_unreachable ();
> +
> +  machine_mode mask_mode;
> +  bool using_partial_vectors_p = false;
> +  if (targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
> +      && can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
> +    {
> +      /* We might load more scalars than we need for permuting SLP loads.
> +	 We checked in get_group_load_store_type that the extra elements
> +	 don't leak into a new vector.  */
> +      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))

Please split this out into a lambda that returns the number of vectors,
and keep the comment with it.  That way we can use it here and below.

> +	vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype,
> +			       scalar_mask);
> +      else
> +	gcc_unreachable ();
> +      using_partial_vectors_p = true;
> +    }
> +
> +  unsigned int factor;
> +  if (can_vec_len_load_store_p (vecmode, is_load, &factor))
> +    {
> +      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> +      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
> +	vect_record_loop_len (loop_vinfo, lens, nvectors, vectype, factor);
> +      else
> +	gcc_unreachable ();
> +      using_partial_vectors_p = true;
> +    }
> +
> +  if (!using_partial_vectors_p)
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			 "can't operate on partial vectors because the"
> +			 " target doesn't have the appropriate partial"
> +			 "vectorization load or store.\n");

missing space between “partial” and “vectorization”.

> +      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +    }
>  }
>  
>  /* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
> @@ -6936,6 +6963,28 @@ vectorizable_scan_store (vec_info *vinfo,
>    return true;
>  }
>  
> +/* For the vector type VTYPE, return the same size vector type with
> +   QImode element, which is mainly for vector load/store with length
> +   in bytes.  */
> +
> +static tree
> +vect_get_same_size_vec_for_len (tree vtype)
> +{
> +  gcc_assert (VECTOR_TYPE_P (vtype));
> +  machine_mode v_mode = TYPE_MODE (vtype);
> +  gcc_assert (GET_MODE_INNER (v_mode) != QImode);
> +
> +  /* Obtain new element counts with QImode.  */
> +  poly_uint64 vsize = GET_MODE_SIZE (v_mode);
> +  poly_uint64 esize = GET_MODE_SIZE (QImode);
> +  poly_uint64 nelts = exact_div (vsize, esize);
> +
> +  /* Build element type with QImode.  */
> +  unsigned int eprec = GET_MODE_PRECISION (QImode);
> +  tree etype = build_nonstandard_integer_type (eprec, 1);
> +
> +  return build_vector_type (etype, nelts);
> +}

As mentioned above, I think we should be getting the mode of
the vector from get_len_load_store_mode.

> […]
> @@ -7911,10 +7968,16 @@ vectorizable_store (vec_info *vinfo,
>  	      unsigned HOST_WIDE_INT align;
>  
>  	      tree final_mask = NULL_TREE;
> +	      tree final_len = NULL_TREE;
>  	      if (loop_masks)
>  		final_mask = vect_get_loop_mask (gsi, loop_masks,
>  						 vec_num * ncopies,
>  						 vectype, vec_num * j + i);
> +	      else if (loop_lens)
> +		final_len = vect_get_loop_len (loop_vinfo, loop_lens,
> +					       vec_num * ncopies,
> +					       vec_num * j + i);
> +

I don't think we need this “final_len”.  Unlike for masks, we only have
a single length, and can calculate it in the “if” statement below.

>  	      if (vec_mask)
>  		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
>  						      vec_mask, gsi);
> @@ -7994,6 +8057,34 @@ vectorizable_store (vec_info *vinfo,
>  		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
>  		  new_stmt = call;
>  		}
> +	      else if (final_len)
> +		{
> +		  align = least_bit_hwi (misalign | align);
> +		  tree ptr = build_int_cst (ref_type, align);
> +		  tree vtype = TREE_TYPE (vec_oprnd);

Couldn't you just reuse “vectype”?  Worth a comment if not.

> +		  /* Need conversion if it's wrapped with VnQI.  */
> +		  if (!direct_optab_handler (len_store_optab,
> +					     TYPE_MODE (vtype)))

I think this should use get_len_load_store_mode rather than querying
the optab directly.

> +		    {
> +		      tree new_vtype = vect_get_same_size_vec_for_len (vtype);
> +		      tree var
> +			= vect_get_new_ssa_name (new_vtype, vect_simple_var);
> +		      vec_oprnd
> +			= build1 (VIEW_CONVERT_EXPR, new_vtype, vec_oprnd);
> +		      gassign *new_stmt
> +			= gimple_build_assign (var, VIEW_CONVERT_EXPR,
> +					       vec_oprnd);
> +		      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt,
> +						   gsi);
> +		      vec_oprnd = var;
> +		    }
> +		  gcall *call
> +		    = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr,
> +						  ptr, final_len, vec_oprnd);
> +		  gimple_call_set_nothrow (call, true);
> +		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
> +		  new_stmt = call;
> +		}
>  	      else
>  		{
>  		  data_ref = fold_build2 (MEM_REF, vectype,
> @@ -8531,6 +8622,7 @@ vectorizable_load (vec_info *vinfo,
>        tree dr_offset;
>  
>        gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
> +      gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));

Might as well just change the existing assert to
!LOOP_VINFO_USING_PARTIAL_VECTORS_P.

Same comments for the load code.

> […]
> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>  	  return false;
>  	}
>  
> -      if (loop_vinfo
> -	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> -	  && reduction_type == EXTRACT_LAST_REDUCTION)
> -	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
> -			       ncopies * vec_num, vectype, NULL);
> +      if (loop_vinfo && for_reduction
> +	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +	{
> +	  if (reduction_type == EXTRACT_LAST_REDUCTION)
> +	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
> +				   ncopies * vec_num, vectype, NULL);
> +	  /* Using partial vectors can introduce inactive lanes in the last
> +	     iteration, since full vector of condition results are operated,
> +	     it's unsafe here.  But if we can AND the condition mask with
> +	     loop mask, it would be safe then.  */
> +	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
> +	    {
> +	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
> +	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
> +		{
> +		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> +		  cond.code = invert_tree_comparison (cond.code, honor_nans);
> +		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
> +		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +		}
> +	    }
> +	  else
> +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +	}
>  
>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,

I don't understand this part.

> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>    *nunits_vectype_out = nunits_vectype;
>    return opt_result::success ();
>  }
> +
> +/* Generate and return statement sequence that sets vector length LEN that is:
> +
> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
> +   left_len = END_INDEX - min_of_start_and_end;
> +   rhs = min (left_len, LEN_LIMIT);
> +   LEN = rhs;
> +
> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
> +   means if we have left_len in bytes larger than 255, it can't be saturated to
> +   vector limit (vector size).  One target hook can be provided if other ports
> +   don't suffer this.
> +*/

Should be no line break before the */

Personally I think it'd be better to drop the TODO.  This isn't the only
place that would need to change if we allowed out-of-range lengths,
whereas the comment might give the impression that it is.

> +
> +gimple_seq
> +vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)
> +{
> +  gimple_seq stmts = NULL;
> +  tree len_type = TREE_TYPE (len);
> +  gcc_assert (TREE_TYPE (start_index) == len_type);
> +
> +  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
> +  tree left_len = fold_build2 (MINUS_EXPR, len_type, end_index, min);
> +  left_len = fold_build2 (MIN_EXPR, len_type, left_len, len_limit);
> +
> +  tree rhs = force_gimple_operand (left_len, &stmts, true, NULL_TREE);
> +  gimple *new_stmt = gimple_build_assign (len, rhs);
> +  gimple_stmt_iterator i = gsi_last (stmts);
> +  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
> +
> +  return stmts;
> +}

It's better to build this up using gimple_build instead.

> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 6c830ad09f4..4155ffe1d49 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -417,6 +417,16 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
>     are compile-time constants but VF and nL can be variable (if the target
>     supports variable-length vectors).
>  
> +   Moreover, for some approach with partial vectors like being controlled
> +   by length (in bytes), it cares about the occupied bytes for each scalar.
> +   Provided that each scalar has factor bytes, the total number of scalar
> +   values becomes to factor * N, the above equation becomes to:
> +
> +       factor * N = factor * NS * VF = factor * NV * NL
> +
> +   factor * NS is the bytes of each scalar, factor * NL is the vector size
> +   in bytes.
> +
>     In classical vectorization, each iteration of the vector loop would
>     handle exactly VF iterations of the original scalar loop.  However,
>     in vector loops that are able to operate on partial vectors, a

As above, I think it'd be better to model the factor as splitting each
element into FACTOR pieces.  In that case I don't think we need to
describe it in this comment; a comment above the field should be enough.

> @@ -473,14 +483,19 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
>     first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
>     the second being indexed by the mask index 0 <= i < nV.  */
>  
> -/* The controls (like masks) needed by rgroups with nV vectors,
> +/* The controls (like masks, lengths) needed by rgroups with nV vectors,
>     according to the description above.  */

“(masks or lengths)”

>  struct rgroup_controls {
>    /* The largest nS for all rgroups that use these controls.  */
>    unsigned int max_nscalars_per_iter;
>  
> -  /* The type of control to use, based on the highest nS recorded above.
> -     For mask-based approach, it's used for mask_type.  */
> +  /* For now, it's mainly used for length-based in bytes approach, it's
> +     record the occupied bytes of each scalar.  */

Maybe:

  /* For the largest nS recorded above, the loop controls divide each scalar
     into FACTOR equal-sized pieces.  This is useful if we need to split
     element-based accesses into byte-based accesses.  */

> +  unsigned int factor;
> +
> +  /* This type is based on the highest nS recorded above.
> +     For mask-based approach, it records mask type to use.
> +     For length-based approach, it records appropriate vector type.  */

Maybe:

  /* This is a vector type with MAX_NSCALARS_PER_ITER * VF / nV elements.
     For mask-based controls, it is the type of the masks in CONTROLS.
     For length-based controls, it can be any vector type that has the
     specified number of elements; the type of the elements doesn't matter.  */

> @@ -644,6 +665,10 @@ public:
>       the vector loop can handle fewer than VF scalars.  */
>    bool using_partial_vectors_p;
>  
> +  /* True if we've decided to use partially-populated vectors for the
> +     epilogue of loop, only for length-based approach for now.  */

Don't think the bit after the comma is necessary.

> +  bool epil_using_partial_vectors_p;
> +
>    /* When we have grouped data accesses with gaps, we may introduce invalid
>       memory accesses.  We peel the last iteration of the loop to prevent
>       this.  */

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer
  2020-06-30 19:53                         ` Richard Sandiford
@ 2020-07-01 13:23                           ` Kewen.Lin
  2020-07-01 15:17                             ` Richard Sandiford
  2020-07-10  9:55                           ` [PATCH 5/7 v7] " Kewen.Lin
  1 sibling, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-07-01 13:23 UTC (permalink / raw)
  To: GCC Patches, Bill Schmidt, Richard Biener, Segher Boessenkool,
	dje.gcc, richard.sandiford

Hi Richard,

Many thanks for your great review comments!

on 2020/7/1 上午3:53, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
>> index 06a04e3d7dd..284c15705ea 100644
>> --- a/gcc/doc/invoke.texi
>> +++ b/gcc/doc/invoke.texi
>> @@ -13389,6 +13389,13 @@ by the copy loop headers pass.
>>  @item vect-epilogues-nomask
>>  Enable loop epilogue vectorization using smaller vector size.
>>  
>> +@item vect-with-length-scope
> 
> In principle there's nothing length-specific about this option.
> We could do the same for masks or for any future loop control
> mechanism.  So how about vect-partial-vector-usage instead?
> 

Sounds good, will update it. 

[snip] 

I will also update as the comments in snipped parts (if they have)
snip some of them to have good readablity.

>> +      machine_mode vmode;
>> +      /* Check whether the related VnQI vector mode exists, as well as
>> +	 optab supported.  */
>> +      if (related_vector_mode (mode, emode, nunits).exists (&vmode)
>> +	  && direct_optab_handler (op, vmode))
>> +	{
>> +	  unsigned int mul;
>> +	  scalar_mode orig_emode = GET_MODE_INNER (mode);
>> +	  poly_uint64 orig_esize = GET_MODE_SIZE (orig_emode);
>> +
>> +	  if (constant_multiple_p (orig_esize, esize, &mul))
>> +	    *factor = mul;
>> +	  else
>> +	    gcc_unreachable ();
> 
> This is just:
> 
> 	  *factor = GET_MODE_UNIT_SIZE (mode);
> 
> However, I think it would be better to return the vector mode that the
> load or store should use, instead of this factor.  That way we can reuse
> it when generating the load and store statements.
> 
> So maybe call the function get_len_load_store_mode and return an
> opt_machine_mode.
> 

Will improve it.

>> diff --git a/gcc/params.opt b/gcc/params.opt
>> index 9b564bb046c..daa6e8a2beb 100644
>> --- a/gcc/params.opt
>> +++ b/gcc/params.opt
>> @@ -968,4 +968,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f
>>  Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
>>  Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
>>  
>> +-param=vect-with-length-scope=
>> +Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization
>> +Control the vector with length exploitation scope.
> 
> Think this should be a bit more descriptive, at least saying what the
> three values are (but in a more abbreviated form than the .texi above).
> 
> I think the default should be 2, with targets actively turning it down
> where necessary.  That way, the decision to turn it down is more likely
> to have a comment explaining why.
> 

Will update both.

>> +
>>    tree ctrl_type = rgc->type;
>> -  unsigned int nscalars_per_iter = rgc->max_nscalars_per_iter;
>> +  /* Scale up nscalars per iteration with factor.  */
>> +  unsigned int nscalars_per_iter_ft = rgc->max_nscalars_per_iter * rgc->factor;
> 
> Maybe “scaled_nscalars_per_iter”?  Not sure the comment really adds
> anything here.
> 
> Or maybe “nitems_per_iter”, to keep the names shorter?
> 

Will use the short one.

>>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>> +  tree length_limit = NULL_TREE;
>> +  /* For length, we need length_limit to check length in range.  */
>> +  if (!vect_for_masking)
>> +    {
>> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
>> +      length_limit = build_int_cst (compare_type, len_limit);
>> +    }
>>  
>>    /* Calculate the maximum number of scalar values that the rgroup
>>       handles in total, the number that it handles for each iteration
>> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>    tree nscalars_total = niters;
>>    tree nscalars_step = build_int_cst (iv_type, vf);
>>    tree nscalars_skip = niters_skip;
>> -  if (nscalars_per_iter != 1)
>> +  if (nscalars_per_iter_ft != 1)
>>      {
>>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>>  	 these multiplications don't overflow.  */
>> -      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
>> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
>> +      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
>> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>>  				     nscalars_total, compare_factor);
>>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
>> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>  	     NSCALARS_SKIP to that cannot overflow.  */
>>  	  tree const_limit = build_int_cst (compare_type,
>>  					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
>> -					    * nscalars_per_iter);
>> +					    * nscalars_per_iter_ft);
>>  	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>>  				      nscalars_total, const_limit);
>>  	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
> 
> It looks odd that we don't need to adjust the other nscalars_* values too.
> E.g. the above seems to be comparing an unscaled nscalars_total with
> a scaled nscalars_per_iter.  I think the units ought to “agree”,
> both here and in the rest of the function.
> 

Sorry, I didn't quite follow this comment.  Both nscalars_totoal and
nscalars_step are scaled here.  The remaining related nscalars_*
seems only nscalars_skip, but length can't support skip.

>>  	}
>>  
>> +      /* First iteration is full.  */
> 
> This comment belongs inside the “if”.
> 

Sorry, I might miss something, but isn't this applied for both?

>>        if (!init_ctrl)
>> -	/* First iteration is full.  */
>> -	init_ctrl = build_minus_one_cst (ctrl_type);
>> +	{
>> +	  if (vect_for_masking)
>> +	    init_ctrl = build_minus_one_cst (ctrl_type);
>> +	  else
>> +	    init_ctrl = length_limit;
>> +	}
>>  
>> […]
>> @@ -2568,7 +2608,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>>    if (vect_epilogues
>>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>>        && prolog_peeling >= 0
>> -      && known_eq (vf, lowest_vf))
>> +      && known_eq (vf, lowest_vf)
>> +      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
>>      {
>>        unsigned HOST_WIDE_INT eiters
>>  	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
> 
> I'm still not really convinced that this check is right.  It feels
> like it's hiding a problem elsewhere.
> 

The comments above this hunk is that:

  /* If we know the number of scalar iterations for the main loop we should
     check whether after the main loop there are enough iterations left over
     for the epilogue.  */

So it's to check the ones in loop_vinfo->epilogue_vinfos whether can be removed.
And the main work in the loop is to remove epil_info from epilogue_vinfos.

To make it simply, let's assume prolog_peeling and LOOP_VINFO_PEELING_FOR_GAPS
are zero, vf == lowest_vf.  

   eiters = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)

eiters is the remaining iterations which can't be handled in main loop with 
full (width/lanes) vectors.

For partial vectors epilogue handlings, loop_vinfo->vector_mode and 
epilogue_vinfo->vector_mode is the same (specially).

      while (!(constant_multiple_p
	       (GET_MODE_SIZE (loop_vinfo->vector_mode),
		GET_MODE_SIZE (epilogue_vinfo->vector_mode), &ratio)
	       && eiters >= lowest_vf / ratio + epilogue_gaps))

It means that the ratio is 1 (specially), the lowest_vf/ratio is still vf, 
the remaining eiters is definitely less than vf, then the loop_vinfo->epilogue_vinfos[0]
gets removed.

I think the reason why partial vectors epilogue is special here is the VF of main loop 
is the same as the VF of epilogue loop.  Normally VF of epilogue loop should be
less than VF of main loop (here it seems assuming it's multiple relationship).

>> […]
>> @@ -1072,6 +1074,88 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
>>    return true;
>>  }
>>  
>> +/* Check whether we can use vector access with length based on precison
>> +   comparison.  So far, to keep it simple, we only allow the case that the
>> +   precision of the target supported length is larger than the precision
>> +   required by loop niters.  */
>> +
>> +static bool
>> +vect_verify_loop_lens (loop_vec_info loop_vinfo)
>> +{
>> +  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
>> +
>> +  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
>> +    return false;
>> +
>> +  /* The one which has the largest NV should have max bytes per iter.  */
>> +  rgroup_controls *rgl = &(*lens)[lens->length () - 1];
> 
> “lens->last ()”.  Using a reference feels more natural here.
> 

Will fix it.

>> +
>> +  /* Work out how many bits we need to represent the length limit.  */
>> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;
> 
> I think this breaks the abstraction.  There's no guarantee that the
> factor is the same for each rgroup_control, so there's no guarantee
> that the maximum bytes per iter comes the last entry.  (Also, it'd
> be better to avoid talking about bytes if we're trying to be general.)
> I think we should take the maximum of each entry instead.
> 

Agree!  I guess the above "maximum bytes per iter" is a typo? and you meant
"maximum elements per iter"?  Yes, the code is for length in bytes, checking
the last entry is only reasonable for it.  Will update it to check all entries
instead.

>> +     we perfer to still use the niters type.  */
>> +  unsigned int ni_prec
>> +    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
>> +  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
>> +  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
>> +
>> +  unsigned int required_prec = ni_prec;
>> +  if (required_prec < pmode_prec)
>> +    required_prec = pmode_prec;
>> +
>> +  tree iv_type = NULL_TREE;
>> +  if (min_ni_prec > required_prec)
>> +    {
> 
> Do we need this condition?  Looks like we could just do:
> 
>   min_ni_prec = MAX (min_ni_prec, GET_MODE_BITSIZE (Pmode));
>   min_ni_prec = MAX (min_ni_prec, ni_prec);
> 
> and then run the loop below.
> 

I think the assumption holds that Pmode and niters type are standard integral
type?  If so, both of them don't need the below loop to build the integer
type, but min_ni_prec needs.  Does it make sense to differentiate them?

>> +      /* Decide whether to use fully-masked approach.  */
>> +      if (vect_verify_full_masking (loop_vinfo))
>> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>> +      /* Decide whether to use length-based approach.  */
>> +      else if (vect_verify_loop_lens (loop_vinfo))
>> +	{
>> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>> +	    {
>> +	      if (dump_enabled_p ())
>> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> +				 "can't vectorize this loop with length-based"
>> +				 " partial vectors approach becuase peeling"
>> +				 " for alignment or gaps is required.\n");
>> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>> +	    }
> 
> Why are these peeling cases necessary?  Peeling for gaps should
> just mean subtracting one scalar iteration from the iteration count
> and shouldn't otherwise affect the main loop.  Similarly, peeling for
> alignment can be handled in the normal way, with a scalar prologue loop.
> 

I was thinking to relax this later and to avoid to handle too many cases
in the first enablement patch.  Since Power hw whose level is able to support
vector with length, it supports unaligned load/store, need to construct
some cases for them.  May I postpone it a bit?  Or you prefer me to support
it here?

>> +	  else if (param_vect_with_length_scope == 0)
>> +	    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> 
> As above, I don't think this should be length-specific.  Same for the
> == 1 handling, which we could do afterwards.
> 

OK.  Will adjust it.

[snip]
>> +
>> +      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
>> +	dump_printf_loc (MSG_NOTE, vect_location,
>> +			 "using length-based partial"
>> +			 " vectors for loop fully.\n");
>> +      else
>> +	dump_printf_loc (MSG_NOTE, vect_location,
>> +			 "not using length-based partial"
>> +			 " vectors for loop fully.\n");
> 
> Think just one message for all three cases is better, perhaps with
> 
>   "operating only on full vectors.\n"
> 
> instead of "not using a fully-masked loop.\n".  Might need some
> testsuite updates though -- probably worth splitting the wording
> change out into a separate patch if so.
> 

OK, will check and split it if need.

[snip]

>> +
>> +void
>> +vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
>> +		      unsigned int nvectors, tree vectype, unsigned int factor)
>> +{
>> +  gcc_assert (nvectors != 0);
>> +  if (lens->length () < nvectors)
>> +    lens->safe_grow_cleared (nvectors);
>> +  rgroup_controls *rgl = &(*lens)[nvectors - 1];
>> +
>> +  /* The number of scalars per iteration, scalar occupied bytes and
>> +     the number of vectors are both compile-time constants.  */
>> +  unsigned int nscalars_per_iter
>> +    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
>> +		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
>> +
>> +  if (rgl->max_nscalars_per_iter < nscalars_per_iter)
>> +    {
>> +      rgl->max_nscalars_per_iter = nscalars_per_iter;
>> +      rgl->type = vectype;
>> +      rgl->factor = factor;
>> +    }
> 
> This is dangerous because it ignores “factor” otherwise, and ignores
> the previous factor if we overwrite it.
> 
> I think instead we should have:
> 
>   /* For now, we only support cases in which all loads and stores fall back
>      to VnQI or none do.  */
>   gcc_assert (!rgl->max_nscalars_per_iter
> 	      || (rgl->factor == 1 && factor == 1)
> 	      || (rgl->max_nscalars_per_iter * rgl->factor
> 		  == nscalars_per_iter * factor));
> 
> before changing rgl.
> 

Thanks for pointing out this!  Will guard it.

[snip]
>>  	      if (vec_mask)
>>  		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
>>  						      vec_mask, gsi);
>> @@ -7994,6 +8057,34 @@ vectorizable_store (vec_info *vinfo,
>>  		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
>>  		  new_stmt = call;
>>  		}
>> +	      else if (final_len)
>> +		{
>> +		  align = least_bit_hwi (misalign | align);
>> +		  tree ptr = build_int_cst (ref_type, align);
>> +		  tree vtype = TREE_TYPE (vec_oprnd);
> 
> Couldn't you just reuse “vectype”?  Worth a comment if not.
> 

Yeah, will replace with it.

[snip]

>> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>>  	  return false;
>>  	}
>>  
>> -      if (loop_vinfo
>> -	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>> -	  && reduction_type == EXTRACT_LAST_REDUCTION)
>> -	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>> -			       ncopies * vec_num, vectype, NULL);
>> +      if (loop_vinfo && for_reduction
>> +	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>> +	{
>> +	  if (reduction_type == EXTRACT_LAST_REDUCTION)
>> +	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>> +				   ncopies * vec_num, vectype, NULL);
>> +	  /* Using partial vectors can introduce inactive lanes in the last
>> +	     iteration, since full vector of condition results are operated,
>> +	     it's unsafe here.  But if we can AND the condition mask with
>> +	     loop mask, it would be safe then.  */
>> +	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
>> +	    {
>> +	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
>> +	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>> +		{
>> +		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>> +		  cond.code = invert_tree_comparison (cond.code, honor_nans);
>> +		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>> +		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>> +		}
>> +	    }
>> +	  else
>> +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>> +	}
>>  
>>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
> 
> I don't understand this part.

This is for the regression case on aarch64:

PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,

As you mentioned before, we would expect to record masks for partial vectors reduction, 
otherwise the inactive lanes would be possibly unsafe.  For this failed case, the
reduction_type is TREE_CODE_REDUCTION, we won't record loop mask.  But it's still safe
since the mask is further AND with some loop mask.  The difference looks like:

Without mask AND loop mask optimization:

  loop_mask =...
  v1 = .MASK_LOAD (a, loop_mask)
  mask1 = v1 == {cst, ...}                // unsafe since it's generate from full width.
  mask2 = loop_mask & mask1               // safe, since it's AND with loop mask?
  v2 = .MASK_LOAD (b, mask2)
  vres = VEC_COND_EXPR < mask1, vres, v2> // unsafe coz of mask1

With mask AND loop mask optimization:

  loop_mask =...
  v1 = .MASK_LOAD (a, loop_mask)
  mask1 = v1 == {cst, ...}
  mask2 = loop_mask & mask1       
  v2 = .MASK_LOAD (b, mask2)
  vres = VEC_COND_EXPR < mask2, vres, v2> // safe coz of mask2?


The loop mask ANDing can make unsafe inactive lanes safe.  So the fix here is to further check
it's possible to be optimized further, if it can, we can know it's safe.  Does it make sense?

> 
>> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>>    *nunits_vectype_out = nunits_vectype;
>>    return opt_result::success ();
>>  }
>> +
>> +/* Generate and return statement sequence that sets vector length LEN that is:
>> +
>> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
>> +   left_len = END_INDEX - min_of_start_and_end;
>> +   rhs = min (left_len, LEN_LIMIT);
>> +   LEN = rhs;
>> +
>> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
>> +   means if we have left_len in bytes larger than 255, it can't be saturated to
>> +   vector limit (vector size).  One target hook can be provided if other ports
>> +   don't suffer this.
>> +*/
> 
> Should be no line break before the */
> 
> Personally I think it'd be better to drop the TODO.  This isn't the only
> place that would need to change if we allowed out-of-range lengths,
> whereas the comment might give the impression that it is.
> 

Sorry I might miss something, but all undetermined lengths are generated here,
the other places you meant is doc or elsewhere?

>> +
>> +gimple_seq
>> +vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)
>> +{
>> +  gimple_seq stmts = NULL;
>> +  tree len_type = TREE_TYPE (len);
>> +  gcc_assert (TREE_TYPE (start_index) == len_type);
>> +
>> +  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
>> +  tree left_len = fold_build2 (MINUS_EXPR, len_type, end_index, min);
>> +  left_len = fold_build2 (MIN_EXPR, len_type, left_len, len_limit);
>> +
>> +  tree rhs = force_gimple_operand (left_len, &stmts, true, NULL_TREE);
>> +  gimple *new_stmt = gimple_build_assign (len, rhs);
>> +  gimple_stmt_iterator i = gsi_last (stmts);
>> +  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
>> +
>> +  return stmts;
>> +}
> 
> It's better to build this up using gimple_build instead.
> 

Will fix it.

[snip]

>> +  bool epil_using_partial_vectors_p;
>> +
>>    /* When we have grouped data accesses with gaps, we may introduce invalid
>>       memory accesses.  We peel the last iteration of the loop to prevent
>>       this.  */
> 
> Thanks,
> Richard
> 


BR,
Kewen

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 1/7 v8] ifn/optabs: Support vector load/store with length
  2020-06-30 15:32                               ` Richard Sandiford
@ 2020-07-01 13:35                                 ` Kewen.Lin
  2020-07-07  9:24                                   ` Richard Sandiford
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-07-01 13:35 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Jim Wilson, Richard Biener, Bill Schmidt, David Edelsohn,
	Segher Boessenkool, Jim Wilson

[-- Attachment #1: Type: text/plain, Size: 3708 bytes --]

Hi Richard,

on 2020/6/30 下午11:32, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> Hi Richard,
>>
>> Thanks for the comments!
>>
>> on 2020/6/29 下午6:07, Richard Sandiford wrote:
>>> Thanks for the update.  I agree with the summary of the IRC discussion
>>> except for…
>>>
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>> Hi Richard S./Richi/Jim/Segher,
>>>>
>>>> Thanks a lot for your comments to make this patch more solid.
>>>>
>>>> Based on our discussion, for the vector load/store with length
>>>> optab, the length unit would be measured in lanes by default.
>>>> For the targets which support length measured in bytes like Power,
>>>> they should only define VnQI modes to wrap the other same size
>>>> vector modes.  If the length is larger than total lane/byte count
>>>> of the given mode, it's taken to load all lanes/bytes implicitly.
>>>
>>> …this last bit.  IMO the behaviour of the optab should be undefined
>>> when the supplied length is greater than the number of lanes.
>>>
>>> I think that also makes things better for the lxvl implementation,
>>> which ignores the upper 56 bits of the length.  It sounds like the
>>> above semantics would instead require Power to saturate the value
>>> at 255 before shifting it.
>>>
>>
>> Good catch, I just realized that this part is inconsistent to what I
>> implemented in patch 5/7, where the function vect_gen_len still does
>> the min operation between the given length and length_limit.
>>
>> This patch is updated accordingly to state the behavior to be undefined.
>> The others aren't required to change.
>>
>> Could you have a further look? Thanks in advance!
>>
>> v6/v7: Updated optab descriptions.
>>
>> v5:
>>   - Updated lenload/lenstore optab to len_load/len_store and the docs.
>>   - Rename expand_mask_{load,store}_optab_fn to expand_partial_{load,store}_optab_fn
>>   - Added/updated macros for expand_mask_{load,store}_optab_fn
>>     and expand_len_{load,store}_optab_fn
>>
>> v4: Update len_load_direct/len_store_direct to align with direct optab.
>>
>> v3: Get rid of length mode hook.
> 
> Thanks, mostly looks good, just some comments about the documentation…
> 

Thanks here again!!!

V8 attached with updates according to your comments!  

Could you have a check again?  Thanks!

-----

v6/v7/v8: Updated optab descriptions.

v5:
  - Updated lenload/lenstore optab to len_load/len_store and the docs.
  - Rename expand_mask_{load,store}_optab_fn to expand_partial_{load,store}_optab_fn
  - Added/updated macros for expand_mask_{load,store}_optab_fn
    and expand_len_{load,store}_optab_fn

v4: Update len_load_direct/len_store_direct to align with direct optab.

v3: Get rid of length mode hook.

BR,
Kewen
-----
gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/md.texi (len_load_@var{m}): Document.
	(len_store_@var{m}): Likewise.
	* internal-fn.c (len_load_direct): New macro.
	(len_store_direct): Likewise.
	(expand_len_load_optab_fn): Likewise.
	(expand_len_store_optab_fn): Likewise.
	(direct_len_load_optab_supported_p): Likewise.
	(direct_len_store_optab_supported_p): Likewise.
	(expand_mask_load_optab_fn): New macro.  Original renamed to ...
	(expand_partial_load_optab_fn): ... here.  Add handlings for
	len_load_optab.
	(expand_mask_store_optab_fn): New macro.  Original renamed to ...
	(expand_partial_store_optab_fn): ... here. Add handlings for
	len_store_optab.
	(internal_load_fn_p): Handle IFN_LEN_LOAD.
	(internal_store_fn_p): Handle IFN_LEN_STORE.
	(internal_fn_stored_value_index): Handle IFN_LEN_STORE.
	* internal-fn.def (LEN_LOAD): New internal function.
	(LEN_STORE): Likewise.
	* optabs.def (len_load_optab, len_store_optab): New optab.

[-- Attachment #2: ifn_v8.diff --]
[-- Type: text/plain, Size: 9580 bytes --]

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2c67c818da5..2b462869437 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5167,6 +5167,32 @@ mode @var{n}.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{len_load_@var{m}} instruction pattern
+@item @samp{len_load_@var{m}}
+Load the number of vector elements specified by operand 2 from memory
+operand 1 into vector register operand 0, setting the other elements of
+operand 0 to undefined values.  Operands 0 and 1 have mode @var{m},
+which must be a vector mode.  Operand 2 has whichever integer mode the
+target prefers.  If operand 2 exceeds the number of elements in mode
+@var{m}, the behavior is undefined.  If the target prefers the length
+to be measured in bytes rather than elements, it should only implement
+this pattern for vectors of @code{QI} elements.
+
+This pattern is not allowed to @code{FAIL}.
+
+@cindex @code{len_store_@var{m}} instruction pattern
+@item @samp{len_store_@var{m}}
+Store the number of vector elements specified by operand 2 from vector
+register operand 1 into memory operand 0, leaving the other elements of
+operand 0 unchanged.  Operands 0 and 1 have mode @var{m}, which must be
+a vector mode.  Operand 2 has whichever integer mode the target prefers.
+If operand 2 exceeds the number of elements in mode @var{m}, the behavior
+is undefined.  If the target prefers the length to be measured in bytes
+rather than elements, it should only implement this pattern for vectors
+of @code{QI} elements.
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_perm@var{m}} instruction pattern
 @item @samp{vec_perm@var{m}}
 Output a (variable) vector permutation.  Operand 0 is the destination
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 4f088de48d5..1e53ced60eb 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -104,10 +104,12 @@ init_internal_fns ()
 #define load_lanes_direct { -1, -1, false }
 #define mask_load_lanes_direct { -1, -1, false }
 #define gather_load_direct { 3, 1, false }
+#define len_load_direct { -1, -1, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
 #define mask_store_lanes_direct { 0, 0, false }
 #define scatter_store_direct { 3, 1, false }
+#define len_store_direct { 3, 3, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 #define ternary_direct { 0, 0, true }
@@ -2478,10 +2480,10 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
   return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
 }
 
-/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES} or LEN_LOAD call STMT using optab OPTAB.  */
 
 static void
-expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 {
   class expand_operand ops[3];
   tree type, lhs, rhs, maskt;
@@ -2497,6 +2499,8 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_load_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == len_load_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2507,18 +2511,24 @@ expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   create_output_operand (&ops[0], target, TYPE_MODE (type));
   create_fixed_operand (&ops[1], mem);
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == len_load_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
   if (!rtx_equal_p (target, ops[0].value))
     emit_move_insn (target, ops[0].value);
 }
 
+#define expand_mask_load_optab_fn expand_partial_load_optab_fn
 #define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
+#define expand_len_load_optab_fn expand_partial_load_optab_fn
 
-/* Expand MASK_STORE{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_STORE{,_LANES} or LEN_STORE call STMT using optab OPTAB.  */
 
 static void
-expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 {
   class expand_operand ops[3];
   tree type, lhs, rhs, maskt;
@@ -2532,6 +2542,8 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_store_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == len_store_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2542,11 +2554,17 @@ expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   reg = expand_normal (rhs);
   create_fixed_operand (&ops[0], mem);
   create_input_operand (&ops[1], reg, TYPE_MODE (type));
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == len_store_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
 }
 
+#define expand_mask_store_optab_fn expand_partial_store_optab_fn
 #define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
+#define expand_len_store_optab_fn expand_partial_store_optab_fn
 
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
@@ -3128,10 +3146,12 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_gather_load_optab_supported_p convert_optab_supported_p
+#define direct_len_load_optab_supported_p direct_optab_supported_p
 #define direct_mask_store_optab_supported_p convert_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_scatter_store_optab_supported_p convert_optab_supported_p
+#define direct_len_store_optab_supported_p direct_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
 #define direct_fold_left_optab_supported_p direct_optab_supported_p
@@ -3498,6 +3518,7 @@ internal_load_fn_p (internal_fn fn)
     case IFN_MASK_LOAD_LANES:
     case IFN_GATHER_LOAD:
     case IFN_MASK_GATHER_LOAD:
+    case IFN_LEN_LOAD:
       return true;
 
     default:
@@ -3517,6 +3538,7 @@ internal_store_fn_p (internal_fn fn)
     case IFN_MASK_STORE_LANES:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return true;
 
     default:
@@ -3577,6 +3599,7 @@ internal_fn_stored_value_index (internal_fn fn)
     case IFN_MASK_STORE:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return 3;
 
     default:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 1d190d492ff..17dac128e83 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
    - load_lanes: currently just vec_load_lanes
    - mask_load_lanes: currently just vec_mask_load_lanes
    - gather_load: used for {mask_,}gather_load
+   - len_load: currently just len_load
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
    - mask_store_lanes: currently just vec_mask_store_lanes
    - scatter_store: used for {mask_,}scatter_store
+   - len_store: currently just len_store
 
    - unary: a normal unary optab, such as vec_reverse_<mode>
    - binary: a normal binary optab, such as vec_interleave_lo_<mode>
@@ -127,6 +129,8 @@ DEF_INTERNAL_OPTAB_FN (GATHER_LOAD, ECF_PURE, gather_load, gather_load)
 DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
 		       mask_gather_load, gather_load)
 
+DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
+
 DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
 DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
 		       mask_scatter_store, scatter_store)
@@ -136,6 +140,8 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
 DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
 		       vec_mask_store_lanes, mask_store_lanes)
 
+DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
+
 DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
 DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
 		       check_raw_ptrs, check_ptrs)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 0c64eb52a8d..78409aa1453 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -435,3 +435,5 @@ OPTAB_D (check_war_ptrs_optab, "check_war_ptrs$a")
 OPTAB_DC (vec_duplicate_optab, "vec_duplicate$a", VEC_DUPLICATE)
 OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
 OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
+OPTAB_D (len_load_optab, "len_load_$a")
+OPTAB_D (len_store_optab, "len_store_$a")

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer
  2020-07-01 13:23                           ` Kewen.Lin
@ 2020-07-01 15:17                             ` Richard Sandiford
  2020-07-02  5:20                               ` Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Sandiford @ 2020-07-01 15:17 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Bill Schmidt, Richard Biener, Segher Boessenkool, dje.gcc

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>>> +  tree length_limit = NULL_TREE;
>>> +  /* For length, we need length_limit to check length in range.  */
>>> +  if (!vect_for_masking)
>>> +    {
>>> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
>>> +      length_limit = build_int_cst (compare_type, len_limit);
>>> +    }
>>>  
>>>    /* Calculate the maximum number of scalar values that the rgroup
>>>       handles in total, the number that it handles for each iteration
>>> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>    tree nscalars_total = niters;
>>>    tree nscalars_step = build_int_cst (iv_type, vf);
>>>    tree nscalars_skip = niters_skip;
>>> -  if (nscalars_per_iter != 1)
>>> +  if (nscalars_per_iter_ft != 1)
>>>      {
>>>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>>>  	 these multiplications don't overflow.  */
>>> -      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
>>> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
>>> +      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
>>> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>>>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>>>  				     nscalars_total, compare_factor);
>>>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
>>> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>  	     NSCALARS_SKIP to that cannot overflow.  */
>>>  	  tree const_limit = build_int_cst (compare_type,
>>>  					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
>>> -					    * nscalars_per_iter);
>>> +					    * nscalars_per_iter_ft);
>>>  	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>>>  				      nscalars_total, const_limit);
>>>  	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
>> 
>> It looks odd that we don't need to adjust the other nscalars_* values too.
>> E.g. the above seems to be comparing an unscaled nscalars_total with
>> a scaled nscalars_per_iter.  I think the units ought to “agree”,
>> both here and in the rest of the function.
>> 
>
> Sorry, I didn't quite follow this comment.  Both nscalars_totoal and
> nscalars_step are scaled here.  The remaining related nscalars_*
> seems only nscalars_skip, but length can't support skip.

Hmm, OK.  But in that case can you update the names of the variables
to match?  It's confusing to have some nscalars_* variables actually
count scalars (and thus have “nitems” equivalents) and other nscalars_*
variables count something else (and thus effectively be nitems_* variables
themselves).

>
>>>  	}
>>>  
>>> +      /* First iteration is full.  */
>> 
>> This comment belongs inside the “if”.
>> 
>
> Sorry, I might miss something, but isn't this applied for both?

I meant it should be…

>
>>>        if (!init_ctrl)
>>> -	/* First iteration is full.  */
>>> -	init_ctrl = build_minus_one_cst (ctrl_type);
>>> +	{
>>> +	  if (vect_for_masking)
>>> +	    init_ctrl = build_minus_one_cst (ctrl_type);
>>> +	  else
>>> +	    init_ctrl = length_limit;
>>> +	}

  if (!init_ctrl)
    {
      /* First iteration is full.  */
      if (vect_for_masking)
        init_ctrl = build_minus_one_cst (ctrl_type);
      else
        init_ctrl = length_limit;
    }

since the comment only applies to the “!init_ctrl” case.  The point
of a nonnull init_ctrl is to create cases in which the first vector
is not a full vector.

>>>  
>>> […]
>>> @@ -2568,7 +2608,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>>>    if (vect_epilogues
>>>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>>>        && prolog_peeling >= 0
>>> -      && known_eq (vf, lowest_vf))
>>> +      && known_eq (vf, lowest_vf)
>>> +      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
>>>      {
>>>        unsigned HOST_WIDE_INT eiters
>>>  	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
>> 
>> I'm still not really convinced that this check is right.  It feels
>> like it's hiding a problem elsewhere.
>> 
>
> The comments above this hunk is that:
>
>   /* If we know the number of scalar iterations for the main loop we should
>      check whether after the main loop there are enough iterations left over
>      for the epilogue.  */
>
> So it's to check the ones in loop_vinfo->epilogue_vinfos whether can be removed.
> And the main work in the loop is to remove epil_info from epilogue_vinfos.

Oops, I think I misread it as checking loop_vinfo rather than
epilogue_vinfo.  It makes more sense now. :-)

>>> +
>>> +  /* Work out how many bits we need to represent the length limit.  */
>>> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;
>> 
>> I think this breaks the abstraction.  There's no guarantee that the
>> factor is the same for each rgroup_control, so there's no guarantee
>> that the maximum bytes per iter comes the last entry.  (Also, it'd
>> be better to avoid talking about bytes if we're trying to be general.)
>> I think we should take the maximum of each entry instead.
>> 
>
> Agree!  I guess the above "maximum bytes per iter" is a typo? and you meant
> "maximum elements per iter"?  Yes, the code is for length in bytes, checking
> the last entry is only reasonable for it.  Will update it to check all entries
> instead.

I meant bytes, since that's what the code is effectively calculating
(at least for Power).  I.e. I think this breaks the abstraction even
if we assume the Power scheme to measuring length, since in principle
it's possible to fix different vector sizes in the same vector region.

>>> +     we perfer to still use the niters type.  */
>>> +  unsigned int ni_prec
>>> +    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
>>> +  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
>>> +  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
>>> +
>>> +  unsigned int required_prec = ni_prec;
>>> +  if (required_prec < pmode_prec)
>>> +    required_prec = pmode_prec;
>>> +
>>> +  tree iv_type = NULL_TREE;
>>> +  if (min_ni_prec > required_prec)
>>> +    {
>> 
>> Do we need this condition?  Looks like we could just do:
>> 
>>   min_ni_prec = MAX (min_ni_prec, GET_MODE_BITSIZE (Pmode));
>>   min_ni_prec = MAX (min_ni_prec, ni_prec);
>> 
>> and then run the loop below.
>> 
>
> I think the assumption holds that Pmode and niters type are standard integral
> type?  If so, both of them don't need the below loop to build the integer
> type, but min_ni_prec needs.  Does it make sense to differentiate them?

IMO we should handle them the same way, i.e. always use the loop.
For example, Pmode can be a partial integer mode on some targets,
so it isn't guaranteed to give a nice power-of-2 integer type.

Maybe having a special case would be worth it if this was performance-
critical code, but since it isn't, having all cases go through the same
path seems better.  It also means that the loop will get more testing
coverage.

>>> +      /* Decide whether to use fully-masked approach.  */
>>> +      if (vect_verify_full_masking (loop_vinfo))
>>> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>>> +      /* Decide whether to use length-based approach.  */
>>> +      else if (vect_verify_loop_lens (loop_vinfo))
>>> +	{
>>> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>>> +	    {
>>> +	      if (dump_enabled_p ())
>>> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>> +				 "can't vectorize this loop with length-based"
>>> +				 " partial vectors approach becuase peeling"
>>> +				 " for alignment or gaps is required.\n");
>>> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>> +	    }
>> 
>> Why are these peeling cases necessary?  Peeling for gaps should
>> just mean subtracting one scalar iteration from the iteration count
>> and shouldn't otherwise affect the main loop.  Similarly, peeling for
>> alignment can be handled in the normal way, with a scalar prologue loop.
>> 
>
> I was thinking to relax this later and to avoid to handle too many cases
> in the first enablement patch.  Since Power hw whose level is able to support
> vector with length, it supports unaligned load/store, need to construct
> some cases for them.  May I postpone it a bit?  Or you prefer me to support
> it here?

I've no objection to postponing it if there are specific known
problems that make it difficult, but I think we should at least
say what they are.  On the face of it, I'm not sure why it doesn't
Just Work, since the way that we control the main loop should be
mostly orthogonal to how we handle peeled prologue iterations
and how we handle a single peeled epilogue iteration.

>>> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>>>  	  return false;
>>>  	}
>>>  
>>> -      if (loop_vinfo
>>> -	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>>> -	  && reduction_type == EXTRACT_LAST_REDUCTION)
>>> -	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>> -			       ncopies * vec_num, vectype, NULL);
>>> +      if (loop_vinfo && for_reduction
>>> +	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>>> +	{
>>> +	  if (reduction_type == EXTRACT_LAST_REDUCTION)
>>> +	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>> +				   ncopies * vec_num, vectype, NULL);
>>> +	  /* Using partial vectors can introduce inactive lanes in the last
>>> +	     iteration, since full vector of condition results are operated,
>>> +	     it's unsafe here.  But if we can AND the condition mask with
>>> +	     loop mask, it would be safe then.  */
>>> +	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
>>> +	    {
>>> +	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
>>> +	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>> +		{
>>> +		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>>> +		  cond.code = invert_tree_comparison (cond.code, honor_nans);
>>> +		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>> +		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>> +		}
>>> +	    }
>>> +	  else
>>> +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>> +	}
>>>  
>>>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>>>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
>> 
>> I don't understand this part.
>
> This is for the regression case on aarch64:
>
> PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,

OK, if this is an SVE thing, it should really be a separate patch.
(And thanks for testing SVE.)

> As you mentioned before, we would expect to record masks for partial vectors reduction, 
> otherwise the inactive lanes would be possibly unsafe.  For this failed case, the
> reduction_type is TREE_CODE_REDUCTION, we won't record loop mask.  But it's still safe
> since the mask is further AND with some loop mask.  The difference looks like:
>
> Without mask AND loop mask optimization:
>
>   loop_mask =...
>   v1 = .MASK_LOAD (a, loop_mask)
>   mask1 = v1 == {cst, ...}                // unsafe since it's generate from full width.
>   mask2 = loop_mask & mask1               // safe, since it's AND with loop mask?
>   v2 = .MASK_LOAD (b, mask2)
>   vres = VEC_COND_EXPR < mask1, vres, v2> // unsafe coz of mask1
>
> With mask AND loop mask optimization:
>
>   loop_mask =...
>   v1 = .MASK_LOAD (a, loop_mask)
>   mask1 = v1 == {cst, ...}
>   mask2 = loop_mask & mask1       
>   v2 = .MASK_LOAD (b, mask2)
>   vres = VEC_COND_EXPR < mask2, vres, v2> // safe coz of mask2?
>
>
> The loop mask ANDing can make unsafe inactive lanes safe.  So the fix here is to further check
> it's possible to be optimized further, if it can, we can know it's safe.  Does it make sense?

But in this particular test, we're doing outer loop vectorisation,
and the only elements of vres that matter are the ones selected
by loop_mask (since those are the only ones that get stored out).
So applying the loop mask to the VEC_COND_EXPR is “just” an
(important) optimisation, rather than a correctness issue.

What's causing the test to start failing with the patch?  I realise
you've probably already said, sorry, but it's been a large patch series
so it's hard to keep all the details committed to memory.

>>> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>>>    *nunits_vectype_out = nunits_vectype;
>>>    return opt_result::success ();
>>>  }
>>> +
>>> +/* Generate and return statement sequence that sets vector length LEN that is:
>>> +
>>> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
>>> +   left_len = END_INDEX - min_of_start_and_end;
>>> +   rhs = min (left_len, LEN_LIMIT);
>>> +   LEN = rhs;
>>> +
>>> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
>>> +   means if we have left_len in bytes larger than 255, it can't be saturated to
>>> +   vector limit (vector size).  One target hook can be provided if other ports
>>> +   don't suffer this.
>>> +*/
>> 
>> Should be no line break before the */
>> 
>> Personally I think it'd be better to drop the TODO.  This isn't the only
>> place that would need to change if we allowed out-of-range lengths,
>> whereas the comment might give the impression that it is.
>> 
>
> Sorry I might miss something, but all undetermined lengths are generated here,
> the other places you meant is doc or elsewhere?

For example, we'd need to start querying the length operand of the optabs
to see what length precision the target uses, since it would be invalid
to do this optimisation for IVs that are wider than that precision.
The routine above doesn't seem the right place to do that.

It could also affect the semantics of the IFNs, if we ever added
folding rules for them.  So yeah, it boils down to this not being
a local decision for this routine -- it's tied to the optab and
IFN behaviour too.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer
  2020-07-01 15:17                             ` Richard Sandiford
@ 2020-07-02  5:20                               ` Kewen.Lin
  2020-07-07  9:26                                 ` Kewen.Lin
  2020-07-07 10:15                                 ` Richard Sandiford
  0 siblings, 2 replies; 80+ messages in thread
From: Kewen.Lin @ 2020-07-02  5:20 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Bill Schmidt, Richard Biener, Segher Boessenkool, dje.gcc

Hi Richard,

on 2020/7/1 下午11:17, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>>>> +  tree length_limit = NULL_TREE;
>>>> +  /* For length, we need length_limit to check length in range.  */
>>>> +  if (!vect_for_masking)
>>>> +    {
>>>> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
>>>> +      length_limit = build_int_cst (compare_type, len_limit);
>>>> +    }
>>>>  
>>>>    /* Calculate the maximum number of scalar values that the rgroup
>>>>       handles in total, the number that it handles for each iteration
>>>> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>>    tree nscalars_total = niters;
>>>>    tree nscalars_step = build_int_cst (iv_type, vf);
>>>>    tree nscalars_skip = niters_skip;
>>>> -  if (nscalars_per_iter != 1)
>>>> +  if (nscalars_per_iter_ft != 1)
>>>>      {
>>>>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>>>>  	 these multiplications don't overflow.  */
>>>> -      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
>>>> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
>>>> +      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
>>>> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>>>>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>>>>  				     nscalars_total, compare_factor);
>>>>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
>>>> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>>  	     NSCALARS_SKIP to that cannot overflow.  */
>>>>  	  tree const_limit = build_int_cst (compare_type,
>>>>  					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
>>>> -					    * nscalars_per_iter);
>>>> +					    * nscalars_per_iter_ft);
>>>>  	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>>>>  				      nscalars_total, const_limit);
>>>>  	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
>>>
>>> It looks odd that we don't need to adjust the other nscalars_* values too.
>>> E.g. the above seems to be comparing an unscaled nscalars_total with
>>> a scaled nscalars_per_iter.  I think the units ought to “agree”,
>>> both here and in the rest of the function.
>>>
>>
>> Sorry, I didn't quite follow this comment.  Both nscalars_totoal and
>> nscalars_step are scaled here.  The remaining related nscalars_*
>> seems only nscalars_skip, but length can't support skip.
> 
> Hmm, OK.  But in that case can you update the names of the variables
> to match?  It's confusing to have some nscalars_* variables actually
> count scalars (and thus have “nitems” equivalents) and other nscalars_*
> variables count something else (and thus effectively be nitems_* variables
> themselves).
> 

OK.  I'll update the names like nscalars_total/nscalars_step and equivalents
to nitems_total/... (or nunits_total better?)

>>
>>>>  	}
>>>>  
>>>> +      /* First iteration is full.  */
>>>
>>> This comment belongs inside the “if”.
>>>
>>
>> Sorry, I might miss something, but isn't this applied for both?
> 
> I meant it should be…
> 
>>
>>>>        if (!init_ctrl)
>>>> -	/* First iteration is full.  */
>>>> -	init_ctrl = build_minus_one_cst (ctrl_type);
>>>> +	{
>>>> +	  if (vect_for_masking)
>>>> +	    init_ctrl = build_minus_one_cst (ctrl_type);
>>>> +	  else
>>>> +	    init_ctrl = length_limit;
>>>> +	}
> 
>   if (!init_ctrl)
>     {
>       /* First iteration is full.  */
>       if (vect_for_masking)
>         init_ctrl = build_minus_one_cst (ctrl_type);
>       else
>         init_ctrl = length_limit;
>     }
> 
> since the comment only applies to the “!init_ctrl” case.  The point
> of a nonnull init_ctrl is to create cases in which the first vector
> is not a full vector.
> 

Got it, will fix it.

>>>>  
>>>> […]
>>>> @@ -2568,7 +2608,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>>>>    if (vect_epilogues
>>>>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>>>>        && prolog_peeling >= 0
>>>> -      && known_eq (vf, lowest_vf))
>>>> +      && known_eq (vf, lowest_vf)
>>>> +      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
>>>>      {
>>>>        unsigned HOST_WIDE_INT eiters
>>>>  	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
>>>
>>> I'm still not really convinced that this check is right.  It feels
>>> like it's hiding a problem elsewhere.
>>>
>>
>> The comments above this hunk is that:
>>
>>   /* If we know the number of scalar iterations for the main loop we should
>>      check whether after the main loop there are enough iterations left over
>>      for the epilogue.  */
>>
>> So it's to check the ones in loop_vinfo->epilogue_vinfos whether can be removed.
>> And the main work in the loop is to remove epil_info from epilogue_vinfos.
> 
> Oops, I think I misread it as checking loop_vinfo rather than
> epilogue_vinfo.  It makes more sense now. :-)
> 
>>>> +
>>>> +  /* Work out how many bits we need to represent the length limit.  */
>>>> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;
>>>
>>> I think this breaks the abstraction.  There's no guarantee that the
>>> factor is the same for each rgroup_control, so there's no guarantee
>>> that the maximum bytes per iter comes the last entry.  (Also, it'd
>>> be better to avoid talking about bytes if we're trying to be general.)
>>> I think we should take the maximum of each entry instead.
>>>
>>
>> Agree!  I guess the above "maximum bytes per iter" is a typo? and you meant
>> "maximum elements per iter"?  Yes, the code is for length in bytes, checking
>> the last entry is only reasonable for it.  Will update it to check all entries
>> instead.
> 
> I meant bytes, since that's what the code is effectively calculating
> (at least for Power).  I.e. I think this breaks the abstraction even
> if we assume the Power scheme to measuring length, since in principle
> it's possible to fix different vector sizes in the same vector region.
> 

Sorry I didn't catch the meaning of "it's possible to fix different
vector sizes in the same vector region."  I guess if we are counting
bytes, the max nunits per iteration should come from the last entry
since the last one holds max bytes which is the result of 
max_nscalar_per_iter * factor.  But I agree that it breaks abstraction
here since it's not applied to length in lanes.

>>>> +     we perfer to still use the niters type.  */
>>>> +  unsigned int ni_prec
>>>> +    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
>>>> +  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
>>>> +  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
>>>> +
>>>> +  unsigned int required_prec = ni_prec;
>>>> +  if (required_prec < pmode_prec)
>>>> +    required_prec = pmode_prec;
>>>> +
>>>> +  tree iv_type = NULL_TREE;
>>>> +  if (min_ni_prec > required_prec)
>>>> +    {
>>>
>>> Do we need this condition?  Looks like we could just do:
>>>
>>>   min_ni_prec = MAX (min_ni_prec, GET_MODE_BITSIZE (Pmode));
>>>   min_ni_prec = MAX (min_ni_prec, ni_prec);
>>>
>>> and then run the loop below.
>>>
>>
>> I think the assumption holds that Pmode and niters type are standard integral
>> type?  If so, both of them don't need the below loop to build the integer
>> type, but min_ni_prec needs.  Does it make sense to differentiate them?
> 
> IMO we should handle them the same way, i.e. always use the loop.
> For example, Pmode can be a partial integer mode on some targets,
> so it isn't guaranteed to give a nice power-of-2 integer type.
> 
> Maybe having a special case would be worth it if this was performance-
> critical code, but since it isn't, having all cases go through the same
> path seems better.  It also means that the loop will get more testing
> coverage.
> 

Thanks for the explanation, it makes sense.  I'll fix it.

>>>> +      /* Decide whether to use fully-masked approach.  */
>>>> +      if (vect_verify_full_masking (loop_vinfo))
>>>> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>>>> +      /* Decide whether to use length-based approach.  */
>>>> +      else if (vect_verify_loop_lens (loop_vinfo))
>>>> +	{
>>>> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>>> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>>>> +	    {
>>>> +	      if (dump_enabled_p ())
>>>> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>> +				 "can't vectorize this loop with length-based"
>>>> +				 " partial vectors approach becuase peeling"
>>>> +				 " for alignment or gaps is required.\n");
>>>> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>> +	    }
>>>
>>> Why are these peeling cases necessary?  Peeling for gaps should
>>> just mean subtracting one scalar iteration from the iteration count
>>> and shouldn't otherwise affect the main loop.  Similarly, peeling for
>>> alignment can be handled in the normal way, with a scalar prologue loop.
>>>
>>
>> I was thinking to relax this later and to avoid to handle too many cases
>> in the first enablement patch.  Since Power hw whose level is able to support
>> vector with length, it supports unaligned load/store, need to construct
>> some cases for them.  May I postpone it a bit?  Or you prefer me to support
>> it here?
> 
> I've no objection to postponing it if there are specific known
> problems that make it difficult, but I think we should at least
> say what they are.  On the face of it, I'm not sure why it doesn't
> Just Work, since the way that we control the main loop should be
> mostly orthogonal to how we handle peeled prologue iterations
> and how we handle a single peeled epilogue iteration.
> 

OK, I will remove it to see the impact.  By the way, do you think to
use partial vectors for prologue is something worth to trying in future?

>>>> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>>>>  	  return false;
>>>>  	}
>>>>  
>>>> -      if (loop_vinfo
>>>> -	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>>>> -	  && reduction_type == EXTRACT_LAST_REDUCTION)
>>>> -	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>>> -			       ncopies * vec_num, vectype, NULL);
>>>> +      if (loop_vinfo && for_reduction
>>>> +	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>>>> +	{
>>>> +	  if (reduction_type == EXTRACT_LAST_REDUCTION)
>>>> +	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>>> +				   ncopies * vec_num, vectype, NULL);
>>>> +	  /* Using partial vectors can introduce inactive lanes in the last
>>>> +	     iteration, since full vector of condition results are operated,
>>>> +	     it's unsafe here.  But if we can AND the condition mask with
>>>> +	     loop mask, it would be safe then.  */
>>>> +	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
>>>> +	    {
>>>> +	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
>>>> +	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>>> +		{
>>>> +		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>>>> +		  cond.code = invert_tree_comparison (cond.code, honor_nans);
>>>> +		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>>> +		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>> +		}
>>>> +	    }
>>>> +	  else
>>>> +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>> +	}
>>>>  
>>>>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>>>>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
>>>
>>> I don't understand this part.
>>
>> This is for the regression case on aarch64:
>>
>> PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,
> 
> OK, if this is an SVE thing, it should really be a separate patch.
> (And thanks for testing SVE.)
> 
>> As you mentioned before, we would expect to record masks for partial vectors reduction, 
>> otherwise the inactive lanes would be possibly unsafe.  For this failed case, the
>> reduction_type is TREE_CODE_REDUCTION, we won't record loop mask.  But it's still safe
>> since the mask is further AND with some loop mask.  The difference looks like:
>>
>> Without mask AND loop mask optimization:
>>
>>   loop_mask =...
>>   v1 = .MASK_LOAD (a, loop_mask)
>>   mask1 = v1 == {cst, ...}                // unsafe since it's generate from full width.
>>   mask2 = loop_mask & mask1               // safe, since it's AND with loop mask?
>>   v2 = .MASK_LOAD (b, mask2)
>>   vres = VEC_COND_EXPR < mask1, vres, v2> // unsafe coz of mask1
>>
>> With mask AND loop mask optimization:
>>
>>   loop_mask =...
>>   v1 = .MASK_LOAD (a, loop_mask)
>>   mask1 = v1 == {cst, ...}
>>   mask2 = loop_mask & mask1       
>>   v2 = .MASK_LOAD (b, mask2)
>>   vres = VEC_COND_EXPR < mask2, vres, v2> // safe coz of mask2?
>>
>>
>> The loop mask ANDing can make unsafe inactive lanes safe.  So the fix here is to further check
>> it's possible to be optimized further, if it can, we can know it's safe.  Does it make sense?
> 
> But in this particular test, we're doing outer loop vectorisation,
> and the only elements of vres that matter are the ones selected
> by loop_mask (since those are the only ones that get stored out).
> So applying the loop mask to the VEC_COND_EXPR is “just” an
> (important) optimisation, rather than a correctness issue.
>  

Thanks for the clarification.  It looks the vres is always safe since its
further usage is guard with loop mask.  Then sorry that I didn't catch why
it is one optimization for this case, is there some difference in backend
supports on this different mask for cond_expr?


> What's causing the test to start failing with the patch?  I realise
> you've probably already said, sorry, but it's been a large patch series
> so it's hard to keep all the details committed to memory.
> 

No problem, appreciate your time much!  Since length-based partial vectors
doesn't support any reduction so far, the function has the responsibility
to disable use_partial_vectors_p for it.  Without the above else-if part,
since the reduction_type is TREE_CODE_REDUCTION for this case, the else part
will stop this case to use mask-based partial vectors, but the case expects
the outer loop still able to use mask-based partial vectors.

As your clarification above, else-if looks wrong.  Probably we can change it
to check whether the current vectorization is for outer loop and the condition
stmt being handled is in the inner loop, we can allow it for partial vectors?

>>>> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>>>>    *nunits_vectype_out = nunits_vectype;
>>>>    return opt_result::success ();
>>>>  }
>>>> +
>>>> +/* Generate and return statement sequence that sets vector length LEN that is:
>>>> +
>>>> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
>>>> +   left_len = END_INDEX - min_of_start_and_end;
>>>> +   rhs = min (left_len, LEN_LIMIT);
>>>> +   LEN = rhs;
>>>> +
>>>> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
>>>> +   means if we have left_len in bytes larger than 255, it can't be saturated to
>>>> +   vector limit (vector size).  One target hook can be provided if other ports
>>>> +   don't suffer this.
>>>> +*/
>>>
>>> Should be no line break before the */
>>>
>>> Personally I think it'd be better to drop the TODO.  This isn't the only
>>> place that would need to change if we allowed out-of-range lengths,
>>> whereas the comment might give the impression that it is.
>>>
>>
>> Sorry I might miss something, but all undetermined lengths are generated here,
>> the other places you meant is doc or elsewhere?
> 
> For example, we'd need to start querying the length operand of the optabs
> to see what length precision the target uses, since it would be invalid
> to do this optimisation for IVs that are wider than that precision.
> The routine above doesn't seem the right place to do that.
> 

OK, but it seems it's acceptable if the IV wider than the precision since
we allows it out of range?

> It could also affect the semantics of the IFNs, if we ever added
> folding rules for them.  So yeah, it boils down to this not being
> a local decision for this routine -- it's tied to the optab and
> IFN behaviour too.
> 

Agree, thanks!

BR,
Kewen

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/7 v8] ifn/optabs: Support vector load/store with length
  2020-07-01 13:35                                 ` [PATCH 1/7 v8] " Kewen.Lin
@ 2020-07-07  9:24                                   ` Richard Sandiford
  0 siblings, 0 replies; 80+ messages in thread
From: Richard Sandiford @ 2020-07-07  9:24 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Jim Wilson, Richard Biener, Bill Schmidt,
	David Edelsohn, Segher Boessenkool, Jim Wilson

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> gcc/ChangeLog:
>
> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>
> 	* doc/md.texi (len_load_@var{m}): Document.
> 	(len_store_@var{m}): Likewise.
> 	* internal-fn.c (len_load_direct): New macro.
> 	(len_store_direct): Likewise.
> 	(expand_len_load_optab_fn): Likewise.
> 	(expand_len_store_optab_fn): Likewise.
> 	(direct_len_load_optab_supported_p): Likewise.
> 	(direct_len_store_optab_supported_p): Likewise.
> 	(expand_mask_load_optab_fn): New macro.  Original renamed to ...
> 	(expand_partial_load_optab_fn): ... here.  Add handlings for
> 	len_load_optab.
> 	(expand_mask_store_optab_fn): New macro.  Original renamed to ...
> 	(expand_partial_store_optab_fn): ... here. Add handlings for
> 	len_store_optab.
> 	(internal_load_fn_p): Handle IFN_LEN_LOAD.
> 	(internal_store_fn_p): Handle IFN_LEN_STORE.
> 	(internal_fn_stored_value_index): Handle IFN_LEN_STORE.
> 	* internal-fn.def (LEN_LOAD): New internal function.
> 	(LEN_STORE): Likewise.
> 	* optabs.def (len_load_optab, len_store_optab): New optab.

OK, thanks.

Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer
  2020-07-02  5:20                               ` Kewen.Lin
@ 2020-07-07  9:26                                 ` Kewen.Lin
  2020-07-07 10:44                                   ` Richard Sandiford
  2020-07-07 10:15                                 ` Richard Sandiford
  1 sibling, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-07-07  9:26 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford; +Cc: Bill Schmidt, dje.gcc, Segher Boessenkool

Hi Richard,

on 2020/7/2 下午1:20, Kewen.Lin via Gcc-patches wrote:
> on 2020/7/1 下午11:17, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
[...]
>> Hmm, OK.  But in that case can you update the names of the variables
>> to match?  It's confusing to have some nscalars_* variables actually
>> count scalars (and thus have “nitems” equivalents) and other nscalars_*
>> variables count something else (and thus effectively be nitems_* variables
>> themselves).
>>
> 
> OK.  I'll update the names like nscalars_total/nscalars_step and equivalents
> to nitems_total/... (or nunits_total better?)
> 

Please ignore this part, I have used nitems_ for the names.  :)

>>>>> +  /* Work out how many bits we need to represent the length limit.  */
>>>>> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;
>>>>
>>>> I think this breaks the abstraction.  There's no guarantee that the
>>>> factor is the same for each rgroup_control, so there's no guarantee
>>>> that the maximum bytes per iter comes the last entry.  (Also, it'd
>>>> be better to avoid talking about bytes if we're trying to be general.)
>>>> I think we should take the maximum of each entry instead.
>>>>
>>>
>>> Agree!  I guess the above "maximum bytes per iter" is a typo? and you meant
>>> "maximum elements per iter"?  Yes, the code is for length in bytes, checking
>>> the last entry is only reasonable for it.  Will update it to check all entries
>>> instead.
>>
>> I meant bytes, since that's what the code is effectively calculating
>> (at least for Power).  I.e. I think this breaks the abstraction even
>> if we assume the Power scheme to measuring length, since in principle
>> it's possible to fix different vector sizes in the same vector region.
>>
> 
> Sorry I didn't catch the meaning of "it's possible to fix different
> vector sizes in the same vector region."  I guess if we are counting
> bytes, the max nunits per iteration should come from the last entry
> since the last one holds max bytes which is the result of 
> max_nscalar_per_iter * factor.  But I agree that it breaks abstraction
> here since it's not applied to length in lanes.
> 

By further thought, I guessed you meant we can have different vector
sizes for the same loop in future?  Yes, the assumption doesn't hold then.

> 
>>>>> +      /* Decide whether to use fully-masked approach.  */
>>>>> +      if (vect_verify_full_masking (loop_vinfo))
>>>>> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>>>>> +      /* Decide whether to use length-based approach.  */
>>>>> +      else if (vect_verify_loop_lens (loop_vinfo))
>>>>> +	{
>>>>> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>>>> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>>>>> +	    {
>>>>> +	      if (dump_enabled_p ())
>>>>> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>>> +				 "can't vectorize this loop with length-based"
>>>>> +				 " partial vectors approach becuase peeling"
>>>>> +				 " for alignment or gaps is required.\n");
>>>>> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>>> +	    }
>>>>
>>>> Why are these peeling cases necessary?  Peeling for gaps should
>>>> just mean subtracting one scalar iteration from the iteration count
>>>> and shouldn't otherwise affect the main loop.  Similarly, peeling for
>>>> alignment can be handled in the normal way, with a scalar prologue loop.
>>>>
>>>
>>> I was thinking to relax this later and to avoid to handle too many cases
>>> in the first enablement patch.  Since Power hw whose level is able to support
>>> vector with length, it supports unaligned load/store, need to construct
>>> some cases for them.  May I postpone it a bit?  Or you prefer me to support
>>> it here?
>>
>> I've no objection to postponing it if there are specific known
>> problems that make it difficult, but I think we should at least
>> say what they are.  On the face of it, I'm not sure why it doesn't
>> Just Work, since the way that we control the main loop should be
>> mostly orthogonal to how we handle peeled prologue iterations
>> and how we handle a single peeled epilogue iteration.
>>
> 
> OK, I will remove it to see the impact.  By the way, do you think to
> use partial vectors for prologue is something worth to trying in future?
> 

I tested the updated patch with this releasing, LOOP_VINFO_PEELING_FOR_GAPS
part looks fine, but LOOP_VINFO_PEELING_FOR_ALIGNMENT caused one case to
fail at execution during vect-partial-vector-usage=2.  So far the patch
doesn't handle any niters_skip cases.  I think if we want to support it, 
we have to add some handlings in/like what we have for masking, such as: 
mask_skip_niters, vect_prepare_for_masked_peels etc.  

Do you prefer me to extend the support in this patch series?

>>> Sorry I might miss something, but all undetermined lengths are generated here,
>>> the other places you meant is doc or elsewhere?
>>
>> For example, we'd need to start querying the length operand of the optabs
>> to see what length precision the target uses, since it would be invalid
>> to do this optimisation for IVs that are wider than that precision.
>> The routine above doesn't seem the right place to do that.
>>
> 
> OK, but it seems it's acceptable if the IV wider than the precision since
> we allows it out of range?
> 

Please ignore this question, I agree that we have to avoid that case.  Sorry that
I was misunderstanding it before.  :)

BR,
Kewen

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer
  2020-07-02  5:20                               ` Kewen.Lin
  2020-07-07  9:26                                 ` Kewen.Lin
@ 2020-07-07 10:15                                 ` Richard Sandiford
  2020-07-08  7:01                                   ` Kewen.Lin
  1 sibling, 1 reply; 80+ messages in thread
From: Richard Sandiford @ 2020-07-07 10:15 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Bill Schmidt, Richard Biener, Segher Boessenkool, dje.gcc

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> on 2020/7/1 下午11:17, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>>>>> +  tree length_limit = NULL_TREE;
>>>>> +  /* For length, we need length_limit to check length in range.  */
>>>>> +  if (!vect_for_masking)
>>>>> +    {
>>>>> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
>>>>> +      length_limit = build_int_cst (compare_type, len_limit);
>>>>> +    }
>>>>>  
>>>>>    /* Calculate the maximum number of scalar values that the rgroup
>>>>>       handles in total, the number that it handles for each iteration
>>>>> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>>>    tree nscalars_total = niters;
>>>>>    tree nscalars_step = build_int_cst (iv_type, vf);
>>>>>    tree nscalars_skip = niters_skip;
>>>>> -  if (nscalars_per_iter != 1)
>>>>> +  if (nscalars_per_iter_ft != 1)
>>>>>      {
>>>>>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>>>>>  	 these multiplications don't overflow.  */
>>>>> -      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
>>>>> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
>>>>> +      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
>>>>> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>>>>>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>>>>>  				     nscalars_total, compare_factor);
>>>>>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
>>>>> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>>>  	     NSCALARS_SKIP to that cannot overflow.  */
>>>>>  	  tree const_limit = build_int_cst (compare_type,
>>>>>  					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
>>>>> -					    * nscalars_per_iter);
>>>>> +					    * nscalars_per_iter_ft);
>>>>>  	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>>>>>  				      nscalars_total, const_limit);
>>>>>  	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
>>>>
>>>> It looks odd that we don't need to adjust the other nscalars_* values too.
>>>> E.g. the above seems to be comparing an unscaled nscalars_total with
>>>> a scaled nscalars_per_iter.  I think the units ought to “agree”,
>>>> both here and in the rest of the function.
>>>>
>>>
>>> Sorry, I didn't quite follow this comment.  Both nscalars_totoal and
>>> nscalars_step are scaled here.  The remaining related nscalars_*
>>> seems only nscalars_skip, but length can't support skip.
>> 
>> Hmm, OK.  But in that case can you update the names of the variables
>> to match?  It's confusing to have some nscalars_* variables actually
>> count scalars (and thus have “nitems” equivalents) and other nscalars_*
>> variables count something else (and thus effectively be nitems_* variables
>> themselves).
>> 
>
> OK.  I'll update the names like nscalars_total/nscalars_step and equivalents
> to nitems_total/... (or nunits_total better?)

I agree “items” isn't great.  I was trying to avoid “units” because GCC
often uses that to mean bytes (BITS_PER_UNIT, UNITS_PER_WORD, etc.).
In this context that could be confusing, because sometimes the
“units” actually would be bytes, but not always.

>>>>> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>>>>>  	  return false;
>>>>>  	}
>>>>>  
>>>>> -      if (loop_vinfo
>>>>> -	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>>>>> -	  && reduction_type == EXTRACT_LAST_REDUCTION)
>>>>> -	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>>>> -			       ncopies * vec_num, vectype, NULL);
>>>>> +      if (loop_vinfo && for_reduction
>>>>> +	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>>>>> +	{
>>>>> +	  if (reduction_type == EXTRACT_LAST_REDUCTION)
>>>>> +	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>>>> +				   ncopies * vec_num, vectype, NULL);
>>>>> +	  /* Using partial vectors can introduce inactive lanes in the last
>>>>> +	     iteration, since full vector of condition results are operated,
>>>>> +	     it's unsafe here.  But if we can AND the condition mask with
>>>>> +	     loop mask, it would be safe then.  */
>>>>> +	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
>>>>> +	    {
>>>>> +	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
>>>>> +	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>>>> +		{
>>>>> +		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>>>>> +		  cond.code = invert_tree_comparison (cond.code, honor_nans);
>>>>> +		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>>>> +		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>>> +		}
>>>>> +	    }
>>>>> +	  else
>>>>> +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>>> +	}
>>>>>  
>>>>>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>>>>>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
>>>>
>>>> I don't understand this part.
>>>
>>> This is for the regression case on aarch64:
>>>
>>> PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,
>> 
>> OK, if this is an SVE thing, it should really be a separate patch.
>> (And thanks for testing SVE.)
>> 
>>> As you mentioned before, we would expect to record masks for partial vectors reduction, 
>>> otherwise the inactive lanes would be possibly unsafe.  For this failed case, the
>>> reduction_type is TREE_CODE_REDUCTION, we won't record loop mask.  But it's still safe
>>> since the mask is further AND with some loop mask.  The difference looks like:
>>>
>>> Without mask AND loop mask optimization:
>>>
>>>   loop_mask =...
>>>   v1 = .MASK_LOAD (a, loop_mask)
>>>   mask1 = v1 == {cst, ...}                // unsafe since it's generate from full width.
>>>   mask2 = loop_mask & mask1               // safe, since it's AND with loop mask?
>>>   v2 = .MASK_LOAD (b, mask2)
>>>   vres = VEC_COND_EXPR < mask1, vres, v2> // unsafe coz of mask1
>>>
>>> With mask AND loop mask optimization:
>>>
>>>   loop_mask =...
>>>   v1 = .MASK_LOAD (a, loop_mask)
>>>   mask1 = v1 == {cst, ...}
>>>   mask2 = loop_mask & mask1       
>>>   v2 = .MASK_LOAD (b, mask2)
>>>   vres = VEC_COND_EXPR < mask2, vres, v2> // safe coz of mask2?
>>>
>>>
>>> The loop mask ANDing can make unsafe inactive lanes safe.  So the fix here is to further check
>>> it's possible to be optimized further, if it can, we can know it's safe.  Does it make sense?
>> 
>> But in this particular test, we're doing outer loop vectorisation,
>> and the only elements of vres that matter are the ones selected
>> by loop_mask (since those are the only ones that get stored out).
>> So applying the loop mask to the VEC_COND_EXPR is “just” an
>> (important) optimisation, rather than a correctness issue.
>>  
>
> Thanks for the clarification.  It looks the vres is always safe since its
> further usage is guard with loop mask.  Then sorry that I didn't catch why
> it is one optimization for this case, is there some difference in backend
> supports on this different mask for cond_expr?

No, the idea of the optimisation is to avoid cases in which we have:

    cmp_res = …compare…
    cmp_res' = cmp_res & loop_mask
    IFN_MASK_LOAD (…, cmp_res')
    z = cmp_res ? x : y

The problem here is that cmp_res and cmp_res' are live at the same time,
which prevents cmp_res and cmp_res' from being combined into a single
instruction.  It's better for the final instruction to be:

    z = cmp_res' ? x : y

so that everything uses the same comparison result.

We can't leave that to later passes because nothing in the gimple IL
indicates that only the loop_mask elements of z matter.

>> What's causing the test to start failing with the patch?  I realise
>> you've probably already said, sorry, but it's been a large patch series
>> so it's hard to keep all the details committed to memory.
>> 
>
> No problem, appreciate your time much!  Since length-based partial vectors
> doesn't support any reduction so far, the function has the responsibility
> to disable use_partial_vectors_p for it.  Without the above else-if part,
> since the reduction_type is TREE_CODE_REDUCTION for this case, the else part
> will stop this case to use mask-based partial vectors, but the case expects
> the outer loop still able to use mask-based partial vectors.
>
> As your clarification above, else-if looks wrong.  Probably we can change it
> to check whether the current vectorization is for outer loop and the condition
> stmt being handled is in the inner loop, we can allow it for partial vectors?

I think it's more whether, for outer loop vectorisation, the reduction
is a double reduction or a simple nested-cycle reduction.  Both have
a COND_EXPR in the inner loop, but the extra elements only matter for
double reductions.

There again, I don't think we actually support double reductions for
COND_EXPR reductions.

>>>>> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>>>>>    *nunits_vectype_out = nunits_vectype;
>>>>>    return opt_result::success ();
>>>>>  }
>>>>> +
>>>>> +/* Generate and return statement sequence that sets vector length LEN that is:
>>>>> +
>>>>> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
>>>>> +   left_len = END_INDEX - min_of_start_and_end;
>>>>> +   rhs = min (left_len, LEN_LIMIT);
>>>>> +   LEN = rhs;
>>>>> +
>>>>> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
>>>>> +   means if we have left_len in bytes larger than 255, it can't be saturated to
>>>>> +   vector limit (vector size).  One target hook can be provided if other ports
>>>>> +   don't suffer this.
>>>>> +*/
>>>>
>>>> Should be no line break before the */
>>>>
>>>> Personally I think it'd be better to drop the TODO.  This isn't the only
>>>> place that would need to change if we allowed out-of-range lengths,
>>>> whereas the comment might give the impression that it is.
>>>>
>>>
>>> Sorry I might miss something, but all undetermined lengths are generated here,
>>> the other places you meant is doc or elsewhere?
>> 
>> For example, we'd need to start querying the length operand of the optabs
>> to see what length precision the target uses, since it would be invalid
>> to do this optimisation for IVs that are wider than that precision.
>> The routine above doesn't seem the right place to do that.
>> 
>
> OK, but it seems it's acceptable if the IV wider than the precision since
> we allows it out of range?

For example, suppose that a target handled out-of-range values but
still had a QImode length.  If the IV was wider than QI, we'd truncate
0x100 to 0 when generating the pattern, so a full-vector access would
get truncated to an empty-vector access.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer
  2020-07-07  9:26                                 ` Kewen.Lin
@ 2020-07-07 10:44                                   ` Richard Sandiford
  2020-07-08  6:52                                     ` Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Sandiford @ 2020-07-07 10:44 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, dje.gcc, Segher Boessenkool

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> on 2020/7/2 下午1:20, Kewen.Lin via Gcc-patches wrote:
>> on 2020/7/1 下午11:17, Richard Sandiford wrote:
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>>> +      /* Decide whether to use fully-masked approach.  */
>>>>>> +      if (vect_verify_full_masking (loop_vinfo))
>>>>>> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>>>>>> +      /* Decide whether to use length-based approach.  */
>>>>>> +      else if (vect_verify_loop_lens (loop_vinfo))
>>>>>> +	{
>>>>>> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>>>>> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>>>>>> +	    {
>>>>>> +	      if (dump_enabled_p ())
>>>>>> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>>>> +				 "can't vectorize this loop with length-based"
>>>>>> +				 " partial vectors approach becuase peeling"
>>>>>> +				 " for alignment or gaps is required.\n");
>>>>>> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>>>> +	    }
>>>>>
>>>>> Why are these peeling cases necessary?  Peeling for gaps should
>>>>> just mean subtracting one scalar iteration from the iteration count
>>>>> and shouldn't otherwise affect the main loop.  Similarly, peeling for
>>>>> alignment can be handled in the normal way, with a scalar prologue loop.
>>>>>
>>>>
>>>> I was thinking to relax this later and to avoid to handle too many cases
>>>> in the first enablement patch.  Since Power hw whose level is able to support
>>>> vector with length, it supports unaligned load/store, need to construct
>>>> some cases for them.  May I postpone it a bit?  Or you prefer me to support
>>>> it here?
>>>
>>> I've no objection to postponing it if there are specific known
>>> problems that make it difficult, but I think we should at least
>>> say what they are.  On the face of it, I'm not sure why it doesn't
>>> Just Work, since the way that we control the main loop should be
>>> mostly orthogonal to how we handle peeled prologue iterations
>>> and how we handle a single peeled epilogue iteration.
>>>
>> 
>> OK, I will remove it to see the impact.  By the way, do you think to
>> use partial vectors for prologue is something worth to trying in future?
>> 
>
> I tested the updated patch with this releasing, LOOP_VINFO_PEELING_FOR_GAPS
> part looks fine, but LOOP_VINFO_PEELING_FOR_ALIGNMENT caused one case to
> fail at execution during vect-partial-vector-usage=2.  So far the patch
> doesn't handle any niters_skip cases.  I think if we want to support it, 
> we have to add some handlings in/like what we have for masking, such as: 
> mask_skip_niters, vect_prepare_for_masked_peels etc.  
>
> Do you prefer me to extend the support in this patch series?

It's not so much whether it has to be supported now, but more why
it doesn't work now.  What was the reason for the failure?

The peeling-with-masking thing is just an optimisation, so that we
can vectorise the peeled iterations rather than falling back to
scalar code for them.  It shouldn't be needed for correctness.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer
  2020-07-07 10:44                                   ` Richard Sandiford
@ 2020-07-08  6:52                                     ` Kewen.Lin
  2020-07-08 12:50                                       ` Richard Sandiford
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-07-08  6:52 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford; +Cc: Bill Schmidt, dje.gcc, Segher Boessenkool

Hi Richard,

on 2020/7/7 下午6:44, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> on 2020/7/2 下午1:20, Kewen.Lin via Gcc-patches wrote:
>>> on 2020/7/1 下午11:17, Richard Sandiford wrote:
>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>>>> +      /* Decide whether to use fully-masked approach.  */
>>>>>>> +      if (vect_verify_full_masking (loop_vinfo))
>>>>>>> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>>>>>>> +      /* Decide whether to use length-based approach.  */
>>>>>>> +      else if (vect_verify_loop_lens (loop_vinfo))
>>>>>>> +	{
>>>>>>> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>>>>>> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>>>>>>> +	    {
>>>>>>> +	      if (dump_enabled_p ())
>>>>>>> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>>>>> +				 "can't vectorize this loop with length-based"
>>>>>>> +				 " partial vectors approach becuase peeling"
>>>>>>> +				 " for alignment or gaps is required.\n");
>>>>>>> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>>>>> +	    }
>>>>>>
>>>>>> Why are these peeling cases necessary?  Peeling for gaps should
>>>>>> just mean subtracting one scalar iteration from the iteration count
>>>>>> and shouldn't otherwise affect the main loop.  Similarly, peeling for
>>>>>> alignment can be handled in the normal way, with a scalar prologue loop.
>>>>>>
>>>>>
>>>>> I was thinking to relax this later and to avoid to handle too many cases
>>>>> in the first enablement patch.  Since Power hw whose level is able to support
>>>>> vector with length, it supports unaligned load/store, need to construct
>>>>> some cases for them.  May I postpone it a bit?  Or you prefer me to support
>>>>> it here?
>>>>
>>>> I've no objection to postponing it if there are specific known
>>>> problems that make it difficult, but I think we should at least
>>>> say what they are.  On the face of it, I'm not sure why it doesn't
>>>> Just Work, since the way that we control the main loop should be
>>>> mostly orthogonal to how we handle peeled prologue iterations
>>>> and how we handle a single peeled epilogue iteration.
>>>>
>>>
>>> OK, I will remove it to see the impact.  By the way, do you think to
>>> use partial vectors for prologue is something worth to trying in future?
>>>
>>
>> I tested the updated patch with this releasing, LOOP_VINFO_PEELING_FOR_GAPS
>> part looks fine, but LOOP_VINFO_PEELING_FOR_ALIGNMENT caused one case to
>> fail at execution during vect-partial-vector-usage=2.  So far the patch
>> doesn't handle any niters_skip cases.  I think if we want to support it, 
>> we have to add some handlings in/like what we have for masking, such as: 
>> mask_skip_niters, vect_prepare_for_masked_peels etc.  
>>
>> Do you prefer me to extend the support in this patch series?
> 
> It's not so much whether it has to be supported now, but more why
> it doesn't work now.  What was the reason for the failure?
> 
> The peeling-with-masking thing is just an optimisation, so that we
> can vectorise the peeled iterations rather than falling back to
> scalar code for them.  It shouldn't be needed for correctness.
> 

Whoops, thanks for the clarification!  Nice, I just realized it's a way to
adopt partial vectors for prologue.  The fail case is gcc.dg/vect/vect-ifcvt-11.c.
There the first iteration is optimized out due to the known AND result of
IV 0, then it tries to peel 3 iterations, the number of remaining iterations
for vectorization body is expected to be 12.  But it still uses 15 and causes
out-of-bound access.

The below fix can fix the failure.  The justification is that we need to use
the fixed up niters after peeling prolog for the vectorization body for
partial vectors.  I'm not sure why the other cases not using partial vectors 
don't need the fixed up niters, to avoid troubles I guarded it with 
LOOP_VINFO_USING_PARTIAL_VECTORS_P explicitly.

Does it make sense?

--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -8888,6 +8896,11 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
			     LOOP_VINFO_INT_NITERS (loop_vinfo) / lowest_vf);
	  step_vector = build_one_cst (TREE_TYPE (niters));
	}
+      else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+	      && !vect_use_loop_mask_for_alignment_p (loop_vinfo))
+       vect_gen_vector_loop_niters (loop_vinfo, LOOP_VINFO_NITERS (loop_vinfo),
+				    &niters_vector, &step_vector,
+				    niters_no_overflow);
       else
	vect_gen_vector_loop_niters (loop_vinfo, niters, &niters_vector,
				     &step_vector, niters_no_overflow);

BR,
Kewen

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer
  2020-07-07 10:15                                 ` Richard Sandiford
@ 2020-07-08  7:01                                   ` Kewen.Lin
  0 siblings, 0 replies; 80+ messages in thread
From: Kewen.Lin @ 2020-07-08  7:01 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Bill Schmidt, Richard Biener, Segher Boessenkool, dje.gcc

Hi Richard,

on 2020/7/7 下午6:15, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> Hi Richard,
>>
>> on 2020/7/1 下午11:17, Richard Sandiford wrote:
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>
>>>> Sorry, I didn't quite follow this comment.  Both nscalars_totoal and
>>>> nscalars_step are scaled here.  The remaining related nscalars_*
>>>> seems only nscalars_skip, but length can't support skip.
>>>
>>> Hmm, OK.  But in that case can you update the names of the variables
>>> to match?  It's confusing to have some nscalars_* variables actually
>>> count scalars (and thus have “nitems” equivalents) and other nscalars_*
>>> variables count something else (and thus effectively be nitems_* variables
>>> themselves).
>>>
>>
>> OK.  I'll update the names like nscalars_total/nscalars_step and equivalents
>> to nitems_total/... (or nunits_total better?)
> 
> I agree “items” isn't great.  I was trying to avoid “units” because GCC
> often uses that to mean bytes (BITS_PER_UNIT, UNITS_PER_WORD, etc.).
> In this context that could be confusing, because sometimes the
> “units” actually would be bytes, but not always.
> 

Got it!  Thanks!

[...]
>>> But in this particular test, we're doing outer loop vectorisation,
>>> and the only elements of vres that matter are the ones selected
>>> by loop_mask (since those are the only ones that get stored out).
>>> So applying the loop mask to the VEC_COND_EXPR is “just” an
>>> (important) optimisation, rather than a correctness issue.
>>>  
>>
>> Thanks for the clarification.  It looks the vres is always safe since its
>> further usage is guard with loop mask.  Then sorry that I didn't catch why
>> it is one optimization for this case, is there some difference in backend
>> supports on this different mask for cond_expr?
> 
> No, the idea of the optimisation is to avoid cases in which we have:
> 
>     cmp_res = …compare…
>     cmp_res' = cmp_res & loop_mask
>     IFN_MASK_LOAD (…, cmp_res')
>     z = cmp_res ? x : y
> 
> The problem here is that cmp_res and cmp_res' are live at the same time,
> which prevents cmp_res and cmp_res' from being combined into a single
> instruction.  It's better for the final instruction to be:
> 
>     z = cmp_res' ? x : y
> 
> so that everything uses the same comparison result.
> 
> We can't leave that to later passes because nothing in the gimple IL
> indicates that only the loop_mask elements of z matter.
> 

Nice, thanks for the explanation.


[...]
>>> What's causing the test to start failing with the patch?  I realise
>>> you've probably already said, sorry, but it's been a large patch series
>>> so it's hard to keep all the details committed to memory.
>>>
>>
>> No problem, appreciate your time much!  Since length-based partial vectors
>> doesn't support any reduction so far, the function has the responsibility
>> to disable use_partial_vectors_p for it.  Without the above else-if part,
>> since the reduction_type is TREE_CODE_REDUCTION for this case, the else part
>> will stop this case to use mask-based partial vectors, but the case expects
>> the outer loop still able to use mask-based partial vectors.
>>
>> As your clarification above, else-if looks wrong.  Probably we can change it
>> to check whether the current vectorization is for outer loop and the condition
>> stmt being handled is in the inner loop, we can allow it for partial vectors?
> 
> I think it's more whether, for outer loop vectorisation, the reduction
> is a double reduction or a simple nested-cycle reduction.  Both have
> a COND_EXPR in the inner loop, but the extra elements only matter for
> double reductions.
> 
> There again, I don't think we actually support double reductions for
> COND_EXPR reductions.
> 

OK.  I will send one separate patch with your suggestion on this.

[...]
>>
>> OK, but it seems it's acceptable if the IV wider than the precision since
>> we allows it out of range?
> 
> For example, suppose that a target handled out-of-range values but
> still had a QImode length.  If the IV was wider than QI, we'd truncate
> 0x100 to 0 when generating the pattern, so a full-vector access would
> get truncated to an empty-vector access.
> 

Yeah, it's so true.

BR,
Kewen

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer
  2020-07-08  6:52                                     ` Kewen.Lin
@ 2020-07-08 12:50                                       ` Richard Sandiford
  2020-07-10  7:40                                         ` Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Sandiford @ 2020-07-08 12:50 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, dje.gcc, Segher Boessenkool

"Kewen.Lin" <linkw@linux.ibm.com> writes:
>> […]
>>> I tested the updated patch with this releasing, LOOP_VINFO_PEELING_FOR_GAPS
>>> part looks fine, but LOOP_VINFO_PEELING_FOR_ALIGNMENT caused one case to
>>> fail at execution during vect-partial-vector-usage=2.  So far the patch
>>> doesn't handle any niters_skip cases.  I think if we want to support it, 
>>> we have to add some handlings in/like what we have for masking, such as: 
>>> mask_skip_niters, vect_prepare_for_masked_peels etc.  
>>>
>>> Do you prefer me to extend the support in this patch series?
>> 
>> It's not so much whether it has to be supported now, but more why
>> it doesn't work now.  What was the reason for the failure?
>> 
>> The peeling-with-masking thing is just an optimisation, so that we
>> can vectorise the peeled iterations rather than falling back to
>> scalar code for them.  It shouldn't be needed for correctness.
>> 
>
> Whoops, thanks for the clarification!  Nice, I just realized it's a way to
> adopt partial vectors for prologue.  The fail case is gcc.dg/vect/vect-ifcvt-11.c.
> There the first iteration is optimized out due to the known AND result of
> IV 0, then it tries to peel 3 iterations, the number of remaining iterations
> for vectorization body is expected to be 12.  But it still uses 15 and causes
> out-of-bound access.
>
> The below fix can fix the failure.  The justification is that we need to use
> the fixed up niters after peeling prolog for the vectorization body for
> partial vectors.  I'm not sure why the other cases not using partial vectors 
> don't need the fixed up niters, to avoid troubles I guarded it with 
> LOOP_VINFO_USING_PARTIAL_VECTORS_P explicitly.

I think the reason is that if we're peeling prologue iterations and
the total number of iterations isn't fixed, full-vector vectorisation
will “almost always” need an epilogue loop too, and in that case
niters_vector will be nonnull.

But that's not guaranteed to be true forever.  E.g. if the start
pointers have a known misalignment that require peeling a constant
number of iterations N, and if we can prove (using enhanced range/
nonzero-bits information) that the way niters is calculated means
that niter - N is a multiple of the vector size, we could peel
the prologue and not the epilogue.  In that case, what your patch
does would be correct.

So…

> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -8888,6 +8896,11 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
> 			     LOOP_VINFO_INT_NITERS (loop_vinfo) / lowest_vf);
> 	  step_vector = build_one_cst (TREE_TYPE (niters));
> 	}
> +      else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> +	      && !vect_use_loop_mask_for_alignment_p (loop_vinfo))
> +       vect_gen_vector_loop_niters (loop_vinfo, LOOP_VINFO_NITERS (loop_vinfo),
> +				    &niters_vector, &step_vector,
> +				    niters_no_overflow);
>        else
> 	vect_gen_vector_loop_niters (loop_vinfo, niters, &niters_vector,
> 				     &step_vector, niters_no_overflow);

…I think we should drop the LOOP_VINFO_USING_PARTIAL_VECTORS_P
condition.  Could you also add a comment above the new call saying:

   /* vect_do_peeling subtracted the number of peeled prologue
      iterations from LOOP_VINFO_NITERS.  */

It wasn't obvious to me where the update was happening when I first
looked at the code.

Very minor, but maybe also switch the last two cases round so that
“else” is the default behaviour and the “if”s are the exceptions.

OK with those changes, thanks.

Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer
  2020-07-08 12:50                                       ` Richard Sandiford
@ 2020-07-10  7:40                                         ` Kewen.Lin
  0 siblings, 0 replies; 80+ messages in thread
From: Kewen.Lin @ 2020-07-10  7:40 UTC (permalink / raw)
  To: richard.sandiford; +Cc: GCC Patches, Bill Schmidt, dje.gcc, Segher Boessenkool

Hi Richard,

on 2020/7/8 下午8:50, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> […]
>>>> I tested the updated patch with this releasing, LOOP_VINFO_PEELING_FOR_GAPS
>>>> part looks fine, but LOOP_VINFO_PEELING_FOR_ALIGNMENT caused one case to
>>>> fail at execution during vect-partial-vector-usage=2.  So far the patch
>>>> doesn't handle any niters_skip cases.  I think if we want to support it, 
>>>> we have to add some handlings in/like what we have for masking, such as: 
>>>> mask_skip_niters, vect_prepare_for_masked_peels etc.  
>>>>
>>>> Do you prefer me to extend the support in this patch series?
>>>
>>> It's not so much whether it has to be supported now, but more why
>>> it doesn't work now.  What was the reason for the failure?
>>>
>>> The peeling-with-masking thing is just an optimisation, so that we
>>> can vectorise the peeled iterations rather than falling back to
>>> scalar code for them.  It shouldn't be needed for correctness.
>>>
>>
>> Whoops, thanks for the clarification!  Nice, I just realized it's a way to
>> adopt partial vectors for prologue.  The fail case is gcc.dg/vect/vect-ifcvt-11.c.
>> There the first iteration is optimized out due to the known AND result of
>> IV 0, then it tries to peel 3 iterations, the number of remaining iterations
>> for vectorization body is expected to be 12.  But it still uses 15 and causes
>> out-of-bound access.
>>
>> The below fix can fix the failure.  The justification is that we need to use
>> the fixed up niters after peeling prolog for the vectorization body for
>> partial vectors.  I'm not sure why the other cases not using partial vectors 
>> don't need the fixed up niters, to avoid troubles I guarded it with 
>> LOOP_VINFO_USING_PARTIAL_VECTORS_P explicitly.
> 
> I think the reason is that if we're peeling prologue iterations and
> the total number of iterations isn't fixed, full-vector vectorisation
> will “almost always” need an epilogue loop too, and in that case
> niters_vector will be nonnull.
> 
> But that's not guaranteed to be true forever.  E.g. if the start
> pointers have a known misalignment that require peeling a constant
> number of iterations N, and if we can prove (using enhanced range/
> nonzero-bits information) that the way niters is calculated means
> that niter - N is a multiple of the vector size, we could peel
> the prologue and not the epilogue.  In that case, what your patch
> does would be correct.
> 

Thanks for the explanation, it makes more sense!

> So…
> 
>> --- a/gcc/tree-vect-loop.c
>> +++ b/gcc/tree-vect-loop.c
>> @@ -8888,6 +8896,11 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>> 			     LOOP_VINFO_INT_NITERS (loop_vinfo) / lowest_vf);
>> 	  step_vector = build_one_cst (TREE_TYPE (niters));
>> 	}
>> +      else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
>> +	      && !vect_use_loop_mask_for_alignment_p (loop_vinfo))
>> +       vect_gen_vector_loop_niters (loop_vinfo, LOOP_VINFO_NITERS (loop_vinfo),
>> +				    &niters_vector, &step_vector,
>> +				    niters_no_overflow);
>>        else
>> 	vect_gen_vector_loop_niters (loop_vinfo, niters, &niters_vector,
>> 				     &step_vector, niters_no_overflow);
> 
> …I think we should drop the LOOP_VINFO_USING_PARTIAL_VECTORS_P
> condition.  Could you also add a comment above the new call saying:
> 
>    /* vect_do_peeling subtracted the number of peeled prologue
>       iterations from LOOP_VINFO_NITERS.  */
> 
> It wasn't obvious to me where the update was happening when I first
> looked at the code.
> 
> Very minor, but maybe also switch the last two cases round so that
> “else” is the default behaviour and the “if”s are the exceptions.
> 
> OK with those changes, thanks.

Bootstrapped/regtested on aarch64-linux-gnu and powerpc64le-linux-gnu.

Committed it via r11-1978 by incorporating your comments.  Thanks!

BR,
Kewen

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 5/7 v7] vect: Support vector load/store with length in vectorizer
  2020-06-30 19:53                         ` Richard Sandiford
  2020-07-01 13:23                           ` Kewen.Lin
@ 2020-07-10  9:55                           ` Kewen.Lin
  2020-07-17  9:54                             ` Richard Sandiford
  1 sibling, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-07-10  9:55 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Bill Schmidt, Richard Biener, Segher Boessenkool, dje.gcc

[-- Attachment #1: Type: text/plain, Size: 2939 bytes --]

Hi Richard,

The new version v7 is attached which has addressed your review comments
on v6.  Could you have a further look?  Many thanks in advance!

Bootstrapped/regtested on aarch64-linux-gnu and powerpc64le-linux-gnu P9.
Even with explicit vect-partial-vector-usage settings 1/2 on Power target,
I didn't find any remarkable failures (only some trivial test case issues).

BR,
Kewen
----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (rs6000_option_override_internal):
	Set param_vect_partial_vector_usage to 0 explicitly.
	* doc/invoke.texi (vect-partial-vector-usage): Document new option.
	* optabs-query.c (get_len_load_store_mode): New function.
	* optabs-query.h (get_len_load_store_mode): New declare.
	* params.opt (vect-partial-vector-usage): New.
	* tree-vect-loop-manip.c (vect_set_loop_controls_directly): Add the
	handlings for vectorization using length-based partial vectors, call
	vect_gen_len for length generation, and rename some variables with
	items instead of scalars.
	(vect_set_loop_condition_partial_vectors): Add the handlings for
	vectorization using length-based partial vectors.
	(vect_do_peeling): Allow remaining eiters less than epilogue vf for
	LOOP_VINFO_USING_PARTIAL_VECTORS_P.
	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Init
	epil_using_partial_vectors_p.
	(_loop_vec_info::~_loop_vec_info): Call release_vec_loop_controls
	for lengths destruction.
	(vect_verify_loop_lens): New function.
	(vect_analyze_loop): Add handlings for epilogue of loop when it's
	marked to use vectorization using partial vectors.
	(vect_analyze_loop_2): Add the check to allow only one vectorization
	approach using partial vectorization at the same time.  Check param
	vect-partial-vector-usage for partial vectors decision.  Mark
	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P if the epilogue is
	considerable to use partial vectors.  Call release_vec_loop_controls
	for lengths destruction.
	(vect_estimate_min_profitable_iters): Adjust for loop vectorization
	using length-based partial vectors.
	(vect_record_loop_mask): Init factor to 1 for vectorization using
	mask-based partial vectors.
	(vect_record_loop_len): New function.
	(vect_get_loop_len): Likewise.
	* tree-vect-stmts.c (check_load_store_for_partial_vectors): Add
	checks for vectorization using length-based partial vectors.  Factor
	some code to lambda function get_valid_nvectors.
	(vectorizable_store): Add handlings when using length-based partial
	vectors.
	(vectorizable_load): Likewise.
	(vect_gen_len): New function.
	* tree-vectorizer.h (struct rgroup_controls): Add field factor
	mainly for length-based partial vectors.
	(vec_loop_lens): New typedef.
	(_loop_vec_info): Add lens and epil_using_partial_vectors_p.
	(LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P): New macro.
	(LOOP_VINFO_LENS): Likewise.
	(LOOP_VINFO_FULLY_WITH_LENGTH_P): Likewise.
	(vect_record_loop_len): New declare.
	(vect_get_loop_len): Likewise.
	(vect_gen_len): Likewise.

[-- Attachment #2: vector_v7.diff --]
[-- Type: text/plain, Size: 42656 bytes --]

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 58f5d780603..af1271ef85a 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -4554,6 +4554,11 @@ rs6000_option_override_internal (bool global_init_p)
       SET_OPTION_IF_UNSET (&global_options, &global_options_set,
 			   param_max_completely_peeled_insns, 400);
 
+      /* Temporarily disable it for now since lxvl/stxvl on the default
+	 supported hardware Power9 has unexpected performance behaviors. */
+      SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+			   param_vect_partial_vector_usage, 0);
+
       /* Use the 'model' -fsched-pressure algorithm by default.  */
       SET_OPTION_IF_UNSET (&global_options, &global_options_set,
 			   param_sched_pressure_algorithm,
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 06a04e3d7dd..719f5a1ee4d 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13389,6 +13389,15 @@ by the copy loop headers pass.
 @item vect-epilogues-nomask
 Enable loop epilogue vectorization using smaller vector size.
 
+@item vect-partial-vector-usage
+Controls when the loop vectorizer considers using partial vector loads
+and stores as an alternative to falling back to scalar code.  0 stops
+the vectorizer from ever using partial vector loads and stores.  1 allows
+partial vector loads and stores if vectorization removes the need for the
+code to iterate.  2 allows partial vector loads and stores in all loops.
+The parameter only has an effect on targets that support partial
+vector loads and stores.
+
 @item slp-max-insns-in-bb
 Maximum number of instructions in basic block to be
 considered for SLP vectorization.
diff --git a/gcc/optabs-query.c b/gcc/optabs-query.c
index 215d68e4225..be241057e4f 100644
--- a/gcc/optabs-query.c
+++ b/gcc/optabs-query.c
@@ -606,6 +606,33 @@ can_vec_mask_load_store_p (machine_mode mode,
   return false;
 }
 
+/* If target supports vector load/store with length for vector mode MODE,
+   return the corresponding vector mode, otherwise return opt_machine_mode ().
+   There are two flavors for vector load/store with length, one is to measure
+   length with bytes, the other is to measure length with lanes.
+   As len_{load,store} optabs point out, for the flavor with bytes, we use
+   VnQI to wrap the other supportable same size vector modes.  */
+
+opt_machine_mode
+get_len_load_store_mode (machine_mode mode, bool is_load)
+{
+  optab op = is_load ? len_load_optab : len_store_optab;
+  gcc_assert (VECTOR_MODE_P (mode));
+
+  /* Check if length in lanes supported for this mode directly.  */
+  if (direct_optab_handler (op, mode))
+    return mode;
+
+  /* Check if length in bytes supported for same vector size VnQI.  */
+  machine_mode vmode;
+  poly_uint64 nunits = GET_MODE_SIZE (mode);
+  if (related_vector_mode (mode, QImode, nunits).exists (&vmode)
+      && direct_optab_handler (op, vmode))
+    return vmode;
+
+  return opt_machine_mode ();
+}
+
 /* Return true if there is a compare_and_swap pattern.  */
 
 bool
diff --git a/gcc/optabs-query.h b/gcc/optabs-query.h
index 729e1fdfc81..603ea8cac0d 100644
--- a/gcc/optabs-query.h
+++ b/gcc/optabs-query.h
@@ -188,6 +188,7 @@ enum insn_code find_widening_optab_handler_and_mode (optab, machine_mode,
 						     machine_mode *);
 int can_mult_highpart_p (machine_mode, bool);
 bool can_vec_mask_load_store_p (machine_mode, machine_mode, bool);
+opt_machine_mode get_len_load_store_mode (machine_mode, bool);
 bool can_compare_and_swap_p (machine_mode, bool);
 bool can_atomic_exchange_p (machine_mode, bool);
 bool can_atomic_load_p (machine_mode);
diff --git a/gcc/params.opt b/gcc/params.opt
index 9b564bb046c..cbd021e82f9 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -968,4 +968,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f
 Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
 Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
 
+-param=vect-partial-vector-usage=
+Common Joined UInteger Var(param_vect_partial_vector_usage) Init(2) IntegerRange(0, 2) Param Optimization
+Controls how loop vectorizer uses partial vectors.  0 means never, 1 means only for loops whose iterating need can be removed, 2 means for all loops.  The default value is 2.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 458a6675c47..b273b253fe7 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -399,19 +399,20 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
 
    It is known that:
 
-     NITERS * RGC->max_nscalars_per_iter
+     NITERS * RGC->max_nscalars_per_iter * RGC->factor
 
    does not overflow.  However, MIGHT_WRAP_P says whether an induction
    variable that starts at 0 and has step:
 
-     VF * RGC->max_nscalars_per_iter
+     VF * RGC->max_nscalars_per_iter * RGC->factor
 
    might overflow before hitting a value above:
 
-     (NITERS + NITERS_SKIP) * RGC->max_nscalars_per_iter
+     (NITERS + NITERS_SKIP) * RGC->max_nscalars_per_iter * RGC->factor
 
    This means that we cannot guarantee that such an induction variable
-   would ever hit a value that produces a set of all-false masks for RGC.  */
+   would ever hit a value that produces a set of all-false masks or zero
+   lengths for RGC.  */
 
 static tree
 vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
@@ -422,40 +423,46 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 {
   tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
   tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
+  bool use_masks_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+
   tree ctrl_type = rgc->type;
-  unsigned int nscalars_per_iter = rgc->max_nscalars_per_iter;
-  poly_uint64 nscalars_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);
+  unsigned int nitems_per_iter = rgc->max_nscalars_per_iter * rgc->factor;
+  poly_uint64 nitems_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type) * rgc->factor;
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  tree length_limit = NULL_TREE;
+  /* For length, we need length_limit to ensure length in range.  */
+  if (!use_masks_p)
+    length_limit = build_int_cst (compare_type, nitems_per_ctrl);
 
-  /* Calculate the maximum number of scalar values that the rgroup
+  /* Calculate the maximum number of item values that the rgroup
      handles in total, the number that it handles for each iteration
      of the vector loop, and the number that it should skip during the
      first iteration of the vector loop.  */
-  tree nscalars_total = niters;
-  tree nscalars_step = build_int_cst (iv_type, vf);
-  tree nscalars_skip = niters_skip;
-  if (nscalars_per_iter != 1)
+  tree nitems_total = niters;
+  tree nitems_step = build_int_cst (iv_type, vf);
+  tree nitems_skip = niters_skip;
+  if (nitems_per_iter != 1)
     {
       /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
 	 these multiplications don't overflow.  */
-      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
-      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
-      nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
-				     nscalars_total, compare_factor);
-      nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
-				    nscalars_step, iv_factor);
-      if (nscalars_skip)
-	nscalars_skip = gimple_build (preheader_seq, MULT_EXPR, compare_type,
-				      nscalars_skip, compare_factor);
-    }
-
-  /* Create an induction variable that counts the number of scalars
+      tree compare_factor = build_int_cst (compare_type, nitems_per_iter);
+      tree iv_factor = build_int_cst (iv_type, nitems_per_iter);
+      nitems_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
+				   nitems_total, compare_factor);
+      nitems_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
+				  nitems_step, iv_factor);
+      if (nitems_skip)
+	nitems_skip = gimple_build (preheader_seq, MULT_EXPR, compare_type,
+				    nitems_skip, compare_factor);
+    }
+
+  /* Create an induction variable that counts the number of items
      processed.  */
   tree index_before_incr, index_after_incr;
   gimple_stmt_iterator incr_gsi;
   bool insert_after;
   standard_iv_increment_position (loop, &incr_gsi, &insert_after);
-  create_iv (build_int_cst (iv_type, 0), nscalars_step, NULL_TREE, loop,
+  create_iv (build_int_cst (iv_type, 0), nitems_step, NULL_TREE, loop,
 	     &incr_gsi, insert_after, &index_before_incr, &index_after_incr);
 
   tree zero_index = build_int_cst (compare_type, 0);
@@ -466,70 +473,70 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
       /* In principle the loop should stop iterating once the incremented
 	 IV reaches a value greater than or equal to:
 
-	   NSCALARS_TOTAL +[infinite-prec] NSCALARS_SKIP
+	   NITEMS_TOTAL +[infinite-prec] NITEMS_SKIP
 
 	 However, there's no guarantee that this addition doesn't overflow
 	 the comparison type, or that the IV hits a value above it before
 	 wrapping around.  We therefore adjust the limit down by one
 	 IV step:
 
-	   (NSCALARS_TOTAL +[infinite-prec] NSCALARS_SKIP)
-	   -[infinite-prec] NSCALARS_STEP
+	   (NITEMS_TOTAL +[infinite-prec] NITEMS_SKIP)
+	   -[infinite-prec] NITEMS_STEP
 
 	 and compare the IV against this limit _before_ incrementing it.
 	 Since the comparison type is unsigned, we actually want the
 	 subtraction to saturate at zero:
 
-	   (NSCALARS_TOTAL +[infinite-prec] NSCALARS_SKIP)
-	   -[sat] NSCALARS_STEP
+	   (NITEMS_TOTAL +[infinite-prec] NITEMS_SKIP)
+	   -[sat] NITEMS_STEP
 
-	 And since NSCALARS_SKIP < NSCALARS_STEP, we can reassociate this as:
+	 And since NITEMS_SKIP < NITEMS_STEP, we can reassociate this as:
 
-	   NSCALARS_TOTAL -[sat] (NSCALARS_STEP - NSCALARS_SKIP)
+	   NITEMS_TOTAL -[sat] (NITEMS_STEP - NITEMS_SKIP)
 
 	 where the rightmost subtraction can be done directly in
 	 COMPARE_TYPE.  */
       test_index = index_before_incr;
       tree adjust = gimple_convert (preheader_seq, compare_type,
-				    nscalars_step);
-      if (nscalars_skip)
+				    nitems_step);
+      if (nitems_skip)
 	adjust = gimple_build (preheader_seq, MINUS_EXPR, compare_type,
-			       adjust, nscalars_skip);
+			       adjust, nitems_skip);
       test_limit = gimple_build (preheader_seq, MAX_EXPR, compare_type,
-				 nscalars_total, adjust);
+				 nitems_total, adjust);
       test_limit = gimple_build (preheader_seq, MINUS_EXPR, compare_type,
 				 test_limit, adjust);
       test_gsi = &incr_gsi;
 
       /* Get a safe limit for the first iteration.  */
-      if (nscalars_skip)
+      if (nitems_skip)
 	{
-	  /* The first vector iteration can handle at most NSCALARS_STEP
-	     scalars.  NSCALARS_STEP <= CONST_LIMIT, and adding
-	     NSCALARS_SKIP to that cannot overflow.  */
+	  /* The first vector iteration can handle at most NITEMS_STEP
+	     items.  NITEMS_STEP <= CONST_LIMIT, and adding
+	     NITEMS_SKIP to that cannot overflow.  */
 	  tree const_limit = build_int_cst (compare_type,
 					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
-					    * nscalars_per_iter);
+					    * nitems_per_iter);
 	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
-				      nscalars_total, const_limit);
+				      nitems_total, const_limit);
 	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
-				      first_limit, nscalars_skip);
+				      first_limit, nitems_skip);
 	}
       else
 	/* For the first iteration it doesn't matter whether the IV hits
-	   a value above NSCALARS_TOTAL.  That only matters for the latch
+	   a value above NITEMS_TOTAL.  That only matters for the latch
 	   condition.  */
-	first_limit = nscalars_total;
+	first_limit = nitems_total;
     }
   else
     {
       /* Test the incremented IV, which will always hit a value above
 	 the bound before wrapping.  */
       test_index = index_after_incr;
-      test_limit = nscalars_total;
-      if (nscalars_skip)
+      test_limit = nitems_total;
+      if (nitems_skip)
 	test_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
-				   test_limit, nscalars_skip);
+				   test_limit, nitems_skip);
       test_gsi = &loop_cond_gsi;
 
       first_limit = test_limit;
@@ -547,18 +554,17 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
   unsigned int i;
   FOR_EACH_VEC_ELT_REVERSE (rgc->controls, i, ctrl)
     {
-      /* Previous controls will cover BIAS scalars.  This control covers the
+      /* Previous controls will cover BIAS items.  This control covers the
 	 next batch.  */
-      poly_uint64 bias = nscalars_per_ctrl * i;
+      poly_uint64 bias = nitems_per_ctrl * i;
       tree bias_tree = build_int_cst (compare_type, bias);
-      gimple *tmp_stmt;
 
       /* See whether the first iteration of the vector loop is known
 	 to have a full control.  */
       poly_uint64 const_limit;
       bool first_iteration_full
 	= (poly_int_tree_p (first_limit, &const_limit)
-	   && known_ge (const_limit, (i + 1) * nscalars_per_ctrl));
+	   && known_ge (const_limit, (i + 1) * nitems_per_ctrl));
 
       /* Rather than have a new IV that starts at BIAS and goes up to
 	 TEST_LIMIT, prefer to use the same 0-based IV for each control
@@ -574,7 +580,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 					  bias_tree);
 	}
 
-      /* Create the initial control.  First include all scalars that
+      /* Create the initial control.  First include all items that
 	 are within the loop limit.  */
       tree init_ctrl = NULL_TREE;
       if (!first_iteration_full)
@@ -591,27 +597,38 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 	    }
 	  else
 	    {
-	      /* FIRST_LIMIT is the maximum number of scalars handled by the
+	      /* FIRST_LIMIT is the maximum number of items handled by the
 		 first iteration of the vector loop.  Test the portion
 		 associated with this control.  */
 	      start = bias_tree;
 	      end = first_limit;
 	    }
 
-	  init_ctrl = make_temp_ssa_name (ctrl_type, NULL, "max_mask");
-	  tmp_stmt = vect_gen_while (init_ctrl, start, end);
-	  gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	  if (use_masks_p)
+	    {
+	      init_ctrl = make_temp_ssa_name (ctrl_type, NULL, "max_mask");
+	      gimple *tmp_stmt = vect_gen_while (init_ctrl, start, end);
+	      gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	    }
+	  else
+	    {
+	      init_ctrl = make_temp_ssa_name (compare_type, NULL, "max_len");
+	      gimple_seq seq = vect_gen_len (init_ctrl, start,
+					     end, length_limit);
+	      gimple_seq_add_seq (preheader_seq, seq);
+	    }
 	}
 
       /* Now AND out the bits that are within the number of skipped
-	 scalars.  */
+	 items.  */
       poly_uint64 const_skip;
-      if (nscalars_skip
-	  && !(poly_int_tree_p (nscalars_skip, &const_skip)
+      if (nitems_skip
+	  && !(poly_int_tree_p (nitems_skip, &const_skip)
 	       && known_le (const_skip, bias)))
 	{
+	  gcc_assert (use_masks_p);
 	  tree unskipped_mask = vect_gen_while_not (preheader_seq, ctrl_type,
-						    bias_tree, nscalars_skip);
+						    bias_tree, nitems_skip);
 	  if (init_ctrl)
 	    init_ctrl = gimple_build (preheader_seq, BIT_AND_EXPR, ctrl_type,
 				      init_ctrl, unskipped_mask);
@@ -620,13 +637,28 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 	}
 
       if (!init_ctrl)
-	/* First iteration is full.  */
-	init_ctrl = build_minus_one_cst (ctrl_type);
+	{
+	  /* First iteration is full.  */
+	  if (use_masks_p)
+	    init_ctrl = build_minus_one_cst (ctrl_type);
+	  else
+	    init_ctrl = length_limit;
+	}
 
       /* Get the control value for the next iteration of the loop.  */
-      next_ctrl = make_temp_ssa_name (ctrl_type, NULL, "next_mask");
-      gcall *call = vect_gen_while (next_ctrl, test_index, this_test_limit);
-      gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+      if (use_masks_p)
+	{
+	  next_ctrl = make_temp_ssa_name (ctrl_type, NULL, "next_mask");
+	  gcall *call = vect_gen_while (next_ctrl, test_index, this_test_limit);
+	  gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+	}
+      else
+	{
+	  next_ctrl = make_temp_ssa_name (compare_type, NULL, "next_len");
+	  gimple_seq seq = vect_gen_len (next_ctrl, test_index, this_test_limit,
+					 length_limit);
+	  gsi_insert_seq_before (test_gsi, seq, GSI_SAME_STMT);
+	}
 
       vect_set_loop_control (loop, ctrl, init_ctrl, next_ctrl);
     }
@@ -652,6 +684,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
   gimple_seq preheader_seq = NULL;
   gimple_seq header_seq = NULL;
 
+  bool use_masks_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
   tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
   unsigned int compare_precision = TYPE_PRECISION (compare_type);
   tree orig_niters = niters;
@@ -686,28 +719,30 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
   tree test_ctrl = NULL_TREE;
   rgroup_controls *rgc;
   unsigned int i;
-  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
-  FOR_EACH_VEC_ELT (*masks, i, rgc)
+  auto_vec<rgroup_controls> *controls = use_masks_p
+					  ? &LOOP_VINFO_MASKS (loop_vinfo)
+					  : &LOOP_VINFO_LENS (loop_vinfo);
+  FOR_EACH_VEC_ELT (*controls, i, rgc)
     if (!rgc->controls.is_empty ())
       {
 	/* First try using permutes.  This adds a single vector
 	   instruction to the loop for each mask, but needs no extra
 	   loop invariants or IVs.  */
 	unsigned int nmasks = i + 1;
-	if ((nmasks & 1) == 0)
+	if (use_masks_p && (nmasks & 1) == 0)
 	  {
-	    rgroup_controls *half_rgc = &(*masks)[nmasks / 2 - 1];
+	    rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
 	    if (!half_rgc->controls.is_empty ()
 		&& vect_maybe_permute_loop_masks (&header_seq, rgc, half_rgc))
 	      continue;
 	  }
 
 	/* See whether zero-based IV would ever generate all-false masks
-	   before wrapping around.  */
+	   or zero length before wrapping around.  */
+	unsigned nitems_per_iter = rgc->max_nscalars_per_iter * rgc->factor;
 	bool might_wrap_p
 	  = (iv_limit == -1
-	     || (wi::min_precision (iv_limit * rgc->max_nscalars_per_iter,
-				    UNSIGNED)
+	     || (wi::min_precision (iv_limit * nitems_per_iter, UNSIGNED)
 		 > compare_precision));
 
 	/* Set up all controls for this group.  */
@@ -2568,7 +2603,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   if (vect_epilogues
       && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
       && prolog_peeling >= 0
-      && known_eq (vf, lowest_vf))
+      && known_eq (vf, lowest_vf)
+      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
     {
       unsigned HOST_WIDE_INT eiters
 	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 5bb6f66e712..88109ac1eb0 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -816,6 +816,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     vectorizable (false),
     can_use_partial_vectors_p (true),
     using_partial_vectors_p (false),
+    epil_using_partial_vectors_p (false),
     peeling_for_gaps (false),
     peeling_for_niter (false),
     no_data_dependencies (false),
@@ -898,6 +899,7 @@ _loop_vec_info::~_loop_vec_info ()
   free (bbs);
 
   release_vec_loop_controls (&masks);
+  release_vec_loop_controls (&lens);
   delete ivexpr_map;
   delete scan_map;
   epilogue_vinfos.release ();
@@ -1072,6 +1074,81 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
   return true;
 }
 
+/* Check whether we can use vector access with length based on precison
+   comparison.  So far, to keep it simple, we only allow the case that the
+   precision of the target supported length is larger than the precision
+   required by loop niters.  */
+
+static bool
+vect_verify_loop_lens (loop_vec_info loop_vinfo)
+{
+  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    return false;
+
+  unsigned int max_nitems_per_iter = 1;
+  unsigned int i;
+  rgroup_controls *rgl;
+  /* Find the maximum number of items per iteration for every rgroup.  */
+  FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), i, rgl)
+    {
+      unsigned nitems_per_iter = rgl->max_nscalars_per_iter * rgl->factor;
+      max_nitems_per_iter = MAX (max_nitems_per_iter, nitems_per_iter);
+    }
+
+  /* Work out how many bits we need to represent the length limit.  */
+  unsigned int min_ni_prec
+    = vect_min_prec_for_max_niters (loop_vinfo, max_nitems_per_iter);
+
+  /* Now use the maximum of below precisions for one suitable IV type:
+     - the IV's natural precision
+     - the precision needed to hold: the maximum number of scalar
+       iterations multiplied by the scale factor (min_ni_prec above)
+     - the Pmode precision
+
+     If min_ni_prec is less than the precision of the current niters,
+     we perfer to still use the niters type.  Prefer to use Pmode and
+     wider IV to avoid narrow conversions.  */
+
+  unsigned int ni_prec
+    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
+  min_ni_prec = MAX (min_ni_prec, ni_prec);
+  min_ni_prec = MAX (min_ni_prec, GET_MODE_BITSIZE (Pmode));
+
+  tree iv_type = NULL_TREE;
+  opt_scalar_int_mode tmode_iter;
+  FOR_EACH_MODE_IN_CLASS (tmode_iter, MODE_INT)
+  {
+    scalar_mode tmode = tmode_iter.require ();
+    unsigned int tbits = GET_MODE_BITSIZE (tmode);
+
+    /* ??? Do we really want to construct one IV whose precision exceeds
+       BITS_PER_WORD?  */
+    if (tbits > BITS_PER_WORD)
+      break;
+
+    /* Find the first available standard integral type.  */
+    if (tbits >= min_ni_prec && targetm.scalar_mode_supported_p (tmode))
+      {
+	iv_type = build_nonstandard_integer_type (tbits, true);
+	break;
+      }
+  }
+
+  if (!iv_type)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't vectorize with length-based partial vectors"
+			 " due to no suitable iv type.\n");
+      return false;
+    }
+
+  LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = iv_type;
+  LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
+
+  return true;
+}
+
 /* Calculate the cost of one scalar iteration of the loop.  */
 static void
 vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
@@ -2170,11 +2247,48 @@ start_over:
       return ok;
     }
 
-  /* Decide whether to use a fully-masked loop for this vectorization
-     factor.  */
-  LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
-    = (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
-       && vect_verify_full_masking (loop_vinfo));
+  /* For now, we don't expect to mix both masking and length approaches for one
+     loop, disable it if both are recorded.  */
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+      && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ()
+      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't vectorize a loop with partial vectors"
+			 " because we don't expect to mix different"
+			 " approaches with partial vectors for the"
+			 " same loop.\n");
+      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
+
+  /* Decide whether to vectorize a loop with partial vectors for
+     this vectorization factor.  */
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      if (param_vect_partial_vector_usage == 0)
+	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+      else if (vect_verify_full_masking (loop_vinfo)
+	       || vect_verify_loop_lens (loop_vinfo))
+	{
+	  /* The epilogue and other known niters less than VF
+	    cases can still use vector access with length fully.  */
+	  if (param_vect_partial_vector_usage == 1
+	      && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	      && !vect_known_niters_smaller_than_vf (loop_vinfo))
+	    {
+	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	      LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+	    }
+	  else
+	    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+	}
+      else
+	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
+  else
+    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+
   if (dump_enabled_p ())
     {
       if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
@@ -2406,6 +2520,7 @@ again:
     = init_cost (LOOP_VINFO_LOOP (loop_vinfo));
   /* Reset accumulated rgroup information.  */
   release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo));
+  release_vec_loop_controls (&LOOP_VINFO_LENS (loop_vinfo));
   /* Reset assorted flags.  */
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
@@ -2692,7 +2807,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 		lowest_th = ordered_min (lowest_th, th);
 	    }
 	  else
-	    delete loop_vinfo;
+	    {
+	      delete loop_vinfo;
+	      loop_vinfo = opt_loop_vec_info::success (NULL);
+	    }
 
 	  /* Only vectorize epilogues if PARAM_VECT_EPILOGUES_NOMASK is
 	     enabled, SIMDUID is not set, it is the innermost loop and we have
@@ -2717,6 +2835,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
       else
 	{
 	  delete loop_vinfo;
+	  loop_vinfo = opt_loop_vec_info::success (NULL);
 	  if (fatal)
 	    {
 	      gcc_checking_assert (first_loop_vinfo == NULL);
@@ -2724,6 +2843,23 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	    }
 	}
 
+      /* Handle the case that the original loop can use partial
+	 vectorization, but want to only adopt it for the epilogue.
+	 The retry should be in the same mode as original.  */
+      if (vect_epilogues
+	  && loop_vinfo
+	  && LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo))
+	{
+	  gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+		      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** Re-trying analysis with same vector mode"
+			     " %s for epilogue with partial vectors.\n",
+			     GET_MODE_NAME (loop_vinfo->vector_mode));
+	  continue;
+	}
+
       if (mode_i < vector_modes.length ()
 	  && VECTOR_MODE_P (autodetected_vector_mode)
 	  && (related_vector_mode (vector_modes[mode_i],
@@ -3564,6 +3700,11 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 			    target_cost_data, num_masks - 1, vector_stmt,
 			    NULL, NULL_TREE, 0, vect_body);
     }
+  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      peel_iters_prologue = 0;
+      peel_iters_epilogue = 0;
+    }
   else if (npeel < 0)
     {
       peel_iters_prologue = assumed_vf / 2;
@@ -8197,6 +8338,7 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
     {
       rgm->max_nscalars_per_iter = nscalars_per_iter;
       rgm->type = truth_type_for (vectype);
+      rgm->factor = 1;
     }
 }
 
@@ -8249,6 +8391,69 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
   return mask;
 }
 
+/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
+   lengths for controlling an operation on VECTYPE.  The operation splits
+   each element of VECTYPE into FACTOR separate subelements, measuring the
+   length as a number of these subelements.  */
+
+void
+vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		      unsigned int nvectors, tree vectype, unsigned int factor)
+{
+  gcc_assert (nvectors != 0);
+  if (lens->length () < nvectors)
+    lens->safe_grow_cleared (nvectors);
+  rgroup_controls *rgl = &(*lens)[nvectors - 1];
+
+  /* The number of scalars per iteration, scalar occupied bytes and
+     the number of vectors are both compile-time constants.  */
+  unsigned int nscalars_per_iter
+    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
+		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
+
+  if (rgl->max_nscalars_per_iter < nscalars_per_iter)
+    {
+      /* For now, we only support cases in which all loads and stores fall back
+	 to VnQI or none do.  */
+      gcc_assert (!rgl->max_nscalars_per_iter
+		  || (rgl->factor == 1 && factor == 1)
+		  || (rgl->max_nscalars_per_iter * rgl->factor
+		      == nscalars_per_iter * factor));
+      rgl->max_nscalars_per_iter = nscalars_per_iter;
+      rgl->type = vectype;
+      rgl->factor = factor;
+    }
+}
+
+/* Given a complete set of length LENS, extract length number INDEX for an
+   rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.  */
+
+tree
+vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		   unsigned int nvectors, unsigned int index)
+{
+  rgroup_controls *rgl = &(*lens)[nvectors - 1];
+
+  /* Populate the rgroup's len array, if this is the first time we've
+     used it.  */
+  if (rgl->controls.is_empty ())
+    {
+      rgl->controls.safe_grow_cleared (nvectors);
+      for (unsigned int i = 0; i < nvectors; ++i)
+	{
+	  tree len_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+	  gcc_assert (len_type != NULL_TREE);
+	  tree len = make_temp_ssa_name (len_type, NULL, "loop_len");
+
+	  /* Provide a dummy definition until the real one is available.  */
+	  SSA_NAME_DEF_STMT (len) = gimple_build_nop ();
+	  rgl->controls[i] = len;
+	}
+    }
+
+  return rgl->controls[index];
+}
+
 /* Scale profiling counters by estimation for LOOP which is vectorized
    by factor VF.  */
 
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 3d642a5bcca..c23520aceab 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1742,29 +1742,57 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
       return;
     }
 
-  machine_mode mask_mode;
-  if (!VECTOR_MODE_P (vecmode)
-      || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
-      || !can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+  if (!VECTOR_MODE_P (vecmode))
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "can't use a fully-masked loop because the target"
-			 " doesn't have the appropriate masked load or"
-			 " store.\n");
+			 "can't operate on partial vectors when emulating"
+			 " vector operations.\n");
       LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
       return;
     }
+
   /* We might load more scalars than we need for permuting SLP loads.
      We checked in get_group_load_store_type that the extra elements
      don't leak into a new vector.  */
+  auto get_valid_nvectors = [] (poly_uint64 size, poly_uint64 nunits) {
+    unsigned int nvectors;
+    if (can_div_away_from_zero_p (size, nunits, &nvectors))
+      return nvectors;
+    gcc_unreachable ();
+  };
+
   poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
-  unsigned int nvectors;
-  if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
-    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, scalar_mask);
-  else
-    gcc_unreachable ();
+  machine_mode mask_mode;
+  bool using_partial_vectors_p = false;
+  if (targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
+      && can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+    {
+      unsigned int nvectors = get_valid_nvectors (group_size * vf, nunits);
+      vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, scalar_mask);
+      using_partial_vectors_p = true;
+    }
+
+  machine_mode vmode;
+  if (get_len_load_store_mode (vecmode, is_load).exists (&vmode))
+    {
+      unsigned int nvectors = get_valid_nvectors (group_size * vf, nunits);
+      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+      unsigned factor = (vecmode == vmode) ? 1 : GET_MODE_UNIT_SIZE (vecmode);
+      vect_record_loop_len (loop_vinfo, lens, nvectors, vectype, factor);
+      using_partial_vectors_p = true;
+    }
+
+  if (!using_partial_vectors_p)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't operate on partial vectors because the"
+			 " target doesn't have the appropriate partial"
+			 " vectorization load or store.\n");
+      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
 }
 
 /* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
@@ -7655,6 +7683,14 @@ vectorizable_store (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+       ? &LOOP_VINFO_LENS (loop_vinfo)
+       : NULL);
+
+  /* Shouldn't go with length-based approach if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -7994,6 +8030,42 @@ vectorizable_store (vec_info *vinfo,
 		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
 		  new_stmt = call;
 		}
+	      else if (loop_lens)
+		{
+		  tree final_len
+		    = vect_get_loop_len (loop_vinfo, loop_lens,
+					 vec_num * ncopies, vec_num * j + i);
+		  align = least_bit_hwi (misalign | align);
+		  tree ptr = build_int_cst (ref_type, align);
+		  machine_mode vmode = TYPE_MODE (vectype);
+		  opt_machine_mode new_ovmode
+		    = get_len_load_store_mode (vmode, false);
+		  gcc_assert (new_ovmode.exists ());
+		  machine_mode new_vmode = new_ovmode.require ();
+		  /* Need conversion if it's wrapped with VnQI.  */
+		  if (vmode != new_vmode)
+		    {
+		      tree new_vtype
+			= build_vector_type_for_mode (unsigned_intQI_type_node,
+						      new_vmode);
+		      tree var
+			= vect_get_new_ssa_name (new_vtype, vect_simple_var);
+		      vec_oprnd
+			= build1 (VIEW_CONVERT_EXPR, new_vtype, vec_oprnd);
+		      gassign *new_stmt
+			= gimple_build_assign (var, VIEW_CONVERT_EXPR,
+					       vec_oprnd);
+		      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt,
+						   gsi);
+		      vec_oprnd = var;
+		    }
+		  gcall *call
+		    = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr,
+						  ptr, final_len, vec_oprnd);
+		  gimple_call_set_nothrow (call, true);
+		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
+		  new_stmt = call;
+		}
 	      else
 		{
 		  data_ref = fold_build2 (MEM_REF, vectype,
@@ -8530,7 +8602,7 @@ vectorizable_load (vec_info *vinfo,
       unsigned HOST_WIDE_INT cst_offset = 0;
       tree dr_offset;
 
-      gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
+      gcc_assert (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
       gcc_assert (!nested_in_vect_loop);
 
       if (grouped_load)
@@ -8819,6 +8891,14 @@ vectorizable_load (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+       ? &LOOP_VINFO_LENS (loop_vinfo)
+       : NULL);
+
+  /* Shouldn't go with length-based approach if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -9207,6 +9287,44 @@ vectorizable_load (vec_info *vinfo,
 			new_stmt = call;
 			data_ref = NULL_TREE;
 		      }
+		    else if (loop_lens && memory_access_type != VMAT_INVARIANT)
+		      {
+			tree final_len
+			  = vect_get_loop_len (loop_vinfo, loop_lens,
+					       vec_num * ncopies,
+					       vec_num * j + i);
+			align = least_bit_hwi (misalign | align);
+			tree ptr = build_int_cst (ref_type, align);
+			gcall *call
+			  = gimple_build_call_internal (IFN_LEN_LOAD, 3,
+							dataref_ptr, ptr,
+							final_len);
+			gimple_call_set_nothrow (call, true);
+			new_stmt = call;
+			data_ref = NULL_TREE;
+
+			/* Need conversion if it's wrapped with VnQI.  */
+			machine_mode vmode = TYPE_MODE (vectype);
+			opt_machine_mode new_ovmode
+			  = get_len_load_store_mode (vmode, true);
+			gcc_assert (new_ovmode.exists ());
+			machine_mode new_vmode = new_ovmode.require ();
+			if (vmode != new_vmode)
+			  {
+			    tree qi_type = unsigned_intQI_type_node;
+			    tree new_vtype
+			      = build_vector_type_for_mode (qi_type, new_vmode);
+			    tree var = vect_get_new_ssa_name (new_vtype,
+							      vect_simple_var);
+			    gimple_set_lhs (call, var);
+			    vect_finish_stmt_generation (vinfo, stmt_info, call,
+							 gsi);
+			    tree op = build1 (VIEW_CONVERT_EXPR, vectype, var);
+			    new_stmt
+			      = gimple_build_assign (vec_dest,
+						     VIEW_CONVERT_EXPR, op);
+			  }
+		      }
 		    else
 		      {
 			tree ltype = vectype;
@@ -11915,3 +12033,27 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
   *nunits_vectype_out = nunits_vectype;
   return opt_result::success ();
 }
+
+/* Generate and return statement sequence that sets vector length LEN that is:
+
+   min_of_start_and_end = min (START_INDEX, END_INDEX);
+   left_len = END_INDEX - min_of_start_and_end;
+   rhs = min (left_len, LEN_LIMIT);
+   LEN = rhs;  */
+
+gimple_seq
+vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)
+{
+  gimple_seq stmts = NULL;
+  tree len_type = TREE_TYPE (len);
+  gcc_assert (TREE_TYPE (start_index) == len_type);
+
+  tree min = gimple_build (&stmts, MIN_EXPR, len_type, start_index, end_index);
+  tree left_len = gimple_build (&stmts, MINUS_EXPR, len_type, end_index, min);
+  tree rhs = gimple_build (&stmts, MIN_EXPR, len_type, left_len, len_limit);
+  gimple* stmt = gimple_build_assign (len, rhs);
+  gimple_seq_add_stmt (&stmts, stmt);
+
+  return stmts;
+}
+
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 6c830ad09f4..17d550d121c 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -473,14 +473,21 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
    first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
    the second being indexed by the mask index 0 <= i < nV.  */
 
-/* The controls (like masks) needed by rgroups with nV vectors,
+/* The controls (like masks or lengths) needed by rgroups with nV vectors,
    according to the description above.  */
 struct rgroup_controls {
   /* The largest nS for all rgroups that use these controls.  */
   unsigned int max_nscalars_per_iter;
 
-  /* The type of control to use, based on the highest nS recorded above.
-     For mask-based approach, it's used for mask_type.  */
+  /* For the largest nS recorded above, the loop controls divide each scalar
+     into FACTOR equal-sized pieces.  This is useful if we need to split
+     element-based accesses into byte-based accesses.  */
+  unsigned int factor;
+
+  /* This is a vector type with MAX_NSCALARS_PER_ITER * VF / nV elements.
+     For mask-based controls, it is the type of the masks in CONTROLS.
+     For length-based controls, it can be any vector type that has the
+     specified number of elements; the type of the elements doesn't matter.  */
   tree type;
 
   /* A vector of nV controls, in iteration order.  */
@@ -489,6 +496,8 @@ struct rgroup_controls {
 
 typedef auto_vec<rgroup_controls> vec_loop_masks;
 
+typedef auto_vec<rgroup_controls> vec_loop_lens;
+
 typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
 
 /*-----------------------------------------------------------------*/
@@ -536,6 +545,10 @@ public:
      on inactive scalars.  */
   vec_loop_masks masks;
 
+  /* The lengths that a loop with length should use to avoid operating
+     on inactive scalars.  */
+  vec_loop_lens lens;
+
   /* Set of scalar conditions that have loop mask applied.  */
   scalar_cond_masked_set_type scalar_cond_masked_set;
 
@@ -644,6 +657,10 @@ public:
      the vector loop can handle fewer than VF scalars.  */
   bool using_partial_vectors_p;
 
+  /* True if we've decided to use partially-populated vectors for the
+     epilogue of loop.  */
+  bool epil_using_partial_vectors_p;
+
   /* When we have grouped data accesses with gaps, we may introduce invalid
      memory accesses.  We peel the last iteration of the loop to prevent
      this.  */
@@ -707,9 +724,12 @@ public:
 #define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
 #define LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P(L) (L)->can_use_partial_vectors_p
 #define LOOP_VINFO_USING_PARTIAL_VECTORS_P(L) (L)->using_partial_vectors_p
+#define LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P(L)                             \
+  (L)->epil_using_partial_vectors_p
 #define LOOP_VINFO_VECT_FACTOR(L)          (L)->vectorization_factor
 #define LOOP_VINFO_MAX_VECT_FACTOR(L)      (L)->max_vectorization_factor
 #define LOOP_VINFO_MASKS(L)                (L)->masks
+#define LOOP_VINFO_LENS(L)                 (L)->lens
 #define LOOP_VINFO_MASK_SKIP_NITERS(L)     (L)->mask_skip_niters
 #define LOOP_VINFO_RGROUP_COMPARE_TYPE(L)  (L)->rgroup_compare_type
 #define LOOP_VINFO_RGROUP_IV_TYPE(L)       (L)->rgroup_iv_type
@@ -747,6 +767,10 @@ public:
   (LOOP_VINFO_USING_PARTIAL_VECTORS_P (L)	\
    && !LOOP_VINFO_MASKS (L).is_empty ())
 
+#define LOOP_VINFO_FULLY_WITH_LENGTH_P(L)	\
+  (LOOP_VINFO_USING_PARTIAL_VECTORS_P (L)	\
+   && !LOOP_VINFO_LENS (L).is_empty ())
+
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L)	\
   ((L)->may_misalign_stmts.length () > 0)
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIAS(L)		\
@@ -1866,6 +1890,11 @@ extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
 				   unsigned int, tree, tree);
 extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
+extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+				  tree, unsigned int);
+extern tree vect_get_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+			       unsigned int);
+extern gimple_seq vect_gen_len (tree, tree, tree, tree);
 extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info);
 
 /* Drive for loop transformation stage.  */

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 7/7 v2] rs6000/testsuite: Vector with length test cases
  2020-05-26  5:59 ` [PATCH 7/7] rs6000/testsuite: Vector with length test cases Kewen.Lin
@ 2020-07-10 10:07   ` Kewen.Lin
  2020-07-20 16:58     ` Segher Boessenkool
  0 siblings, 1 reply; 80+ messages in thread
From: Kewen.Lin @ 2020-07-10 10:07 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, Segher Boessenkool, dje.gcc

[-- Attachment #1: Type: text/plain, Size: 3338 bytes --]

Hi,

v2 changes:
  - Updated param from vect-with-length-scope to
    vect-partial-vector-usage
  - Add *-7*/*-8* to cover peeling alignment and gaps. 

All cases passed on powerpc64le-linux-gnu P9.

BR,
Kewen
-----
gcc/testsuite/ChangeLog:

        * gcc.target/powerpc/p9-vec-length-1.h: New test.
        * gcc.target/powerpc/p9-vec-length-2.h: New test.
        * gcc.target/powerpc/p9-vec-length-3.h: New test.
        * gcc.target/powerpc/p9-vec-length-4.h: New test.
        * gcc.target/powerpc/p9-vec-length-5.h: New test.
        * gcc.target/powerpc/p9-vec-length-6.h: New test.
        * gcc.target/powerpc/p9-vec-length-7.h: New test.
        * gcc.target/powerpc/p9-vec-length-8.h: New test.
        * gcc.target/powerpc/p9-vec-length-epil-1.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-2.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-3.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-4.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-5.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-6.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-7.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-8.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-run-1.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-run-2.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-run-3.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-run-4.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-run-5.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-run-6.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-run-7.c: New test.
        * gcc.target/powerpc/p9-vec-length-epil-run-8.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-1.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-2.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-3.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-4.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-5.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-6.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-7.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-8.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-run-1.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-run-2.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-run-3.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-run-4.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-run-5.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-run-6.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-run-7.c: New test.
        * gcc.target/powerpc/p9-vec-length-full-run-8.c: New test.
        * gcc.target/powerpc/p9-vec-length-run-1.h: New test.
        * gcc.target/powerpc/p9-vec-length-run-2.h: New test.
        * gcc.target/powerpc/p9-vec-length-run-3.h: New test.
        * gcc.target/powerpc/p9-vec-length-run-4.h: New test.
        * gcc.target/powerpc/p9-vec-length-run-5.h: New test.
        * gcc.target/powerpc/p9-vec-length-run-6.h: New test.
        * gcc.target/powerpc/p9-vec-length-run-7.h: New test.
        * gcc.target/powerpc/p9-vec-length-run-8.h: New test.
        * gcc.target/powerpc/p9-vec-length.h: New test.

[-- Attachment #2: testcases_v2.diff --]
[-- Type: text/plain, Size: 58055 bytes --]

diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-1.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-1.h
new file mode 100644
index 00000000000..50da5817013
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-1.h
@@ -0,0 +1,18 @@
+#include "p9-vec-length.h"
+
+/* Test the case loop iteration is known.  */
+
+#define N 127
+
+#define test(TYPE)                                                             \
+  extern TYPE a_##TYPE[N];                                                     \
+  extern TYPE b_##TYPE[N];                                                     \
+  extern TYPE c_##TYPE[N];                                                     \
+  void __attribute__ ((noinline, noclone)) test##TYPE ()                       \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    for (i = 0; i < N; i++)                                                    \
+      c_##TYPE[i] = a_##TYPE[i] + b_##TYPE[i];                                 \
+  }
+
+TEST_ALL (test)
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-2.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-2.h
new file mode 100644
index 00000000000..b275dba0fde
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-2.h
@@ -0,0 +1,17 @@
+#include "p9-vec-length.h"
+
+/* Test the case loop iteration is unknown.  */
+#define N 255
+
+#define test(TYPE)                                                             \
+  extern TYPE a_##TYPE[N];                                                     \
+  extern TYPE b_##TYPE[N];                                                     \
+  extern TYPE c_##TYPE[N];                                                     \
+  void __attribute__ ((noinline, noclone)) test##TYPE (unsigned int n)         \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    for (i = 0; i < n; i++)                                                    \
+      c_##TYPE[i] = a_##TYPE[i] + b_##TYPE[i];                                 \
+  }
+
+TEST_ALL (test)
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-3.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-3.h
new file mode 100644
index 00000000000..c79b9b30910
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-3.h
@@ -0,0 +1,31 @@
+#include "p9-vec-length.h"
+
+/* Test the case loop iteration less than VF.  */
+
+/* For char.  */
+#define N_uint8_t 15
+#define N_int8_t 15
+/* For short.  */
+#define N_uint16_t 6
+#define N_int16_t 6
+/* For int/float.  */
+#define N_uint32_t 3
+#define N_int32_t 3
+#define N_float 3
+/* For long/double.  */
+#define N_uint64_t 1
+#define N_int64_t 1
+#define N_double 1
+
+#define test(TYPE)                                                             \
+  extern TYPE a_##TYPE[N_##TYPE];                                              \
+  extern TYPE b_##TYPE[N_##TYPE];                                              \
+  extern TYPE c_##TYPE[N_##TYPE];                                              \
+  void __attribute__ ((noinline, noclone)) test##TYPE ()                       \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    for (i = 0; i < N_##TYPE; i++)                                             \
+      c_##TYPE[i] = a_##TYPE[i] + b_##TYPE[i];                                 \
+  }
+
+TEST_ALL (test)
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-4.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-4.h
new file mode 100644
index 00000000000..0ee7fc84502
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-4.h
@@ -0,0 +1,24 @@
+#include "p9-vec-length.h"
+
+/* Test the case that the loop which has multiple vectors (concatenated vectors)
+   but with same vector type.  */
+
+#define test(TYPE)                                                             \
+  void __attribute__ ((noinline, noclone))                                     \
+    test_mv_##TYPE (TYPE *restrict a, TYPE *restrict b, TYPE *restrict c,      \
+		    int n)                                                     \
+  {                                                                            \
+    for (int i = 0; i < n; ++i)                                                \
+      {                                                                        \
+	a[i] += 1;                                                             \
+	b[i * 2] += 2;                                                         \
+	b[i * 2 + 1] += 3;                                                     \
+	c[i * 4] += 4;                                                         \
+	c[i * 4 + 1] += 5;                                                     \
+	c[i * 4 + 2] += 6;                                                     \
+	c[i * 4 + 3] += 7;                                                     \
+      }                                                                        \
+  }
+
+TEST_ALL (test)
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-5.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-5.h
new file mode 100644
index 00000000000..406daaa3d3e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-5.h
@@ -0,0 +1,29 @@
+#include "p9-vec-length.h"
+
+/* Test the case that the loop which has multiple vectors (concatenated vectors)
+   with different types.  */
+
+#define test(TYPE1, TYPE2)                                                     \
+  void __attribute__ ((noinline, noclone))                                     \
+    test_mv_##TYPE1##TYPE2 (TYPE1 *restrict a, TYPE2 *restrict b, int n)       \
+  {                                                                            \
+    for (int i = 0; i < n; ++i)                                                \
+      {                                                                        \
+	a[i * 2] += 1;                                                         \
+	a[i * 2 + 1] += 2;                                                     \
+	b[i * 2] += 3;                                                         \
+	b[i * 2 + 1] += 4;                                                     \
+      }                                                                        \
+  }
+
+#define TEST_ALL2(T)                                                           \
+  T (int8_t, uint16_t)                                                         \
+  T (uint8_t, int16_t)                                                         \
+  T (int16_t, uint32_t)                                                        \
+  T (uint16_t, int32_t)                                                        \
+  T (int32_t, double)                                                          \
+  T (uint32_t, int64_t)                                                        \
+  T (float, uint64_t)
+
+TEST_ALL2 (test)
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-6.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-6.h
new file mode 100644
index 00000000000..58b151e18f8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-6.h
@@ -0,0 +1,32 @@
+#include "p9-vec-length.h"
+
+/* Test the case that the loop which has the same concatenated vectors (same
+   size per iteration) but from different types.  */
+
+#define test(TYPE1, TYPE2)                                                     \
+  void __attribute__ ((noinline, noclone))                                     \
+    test_mv_##TYPE1##TYPE2 (TYPE1 *restrict a, TYPE2 *restrict b, int n)       \
+  {                                                                            \
+    for (int i = 0; i < n; i++)                                                \
+      {                                                                        \
+	a[i * 2] += 1;                                                         \
+	a[i * 2 + 1] += 2;                                                     \
+	b[i * 4] += 3;                                                         \
+	b[i * 4 + 1] += 4;                                                     \
+	b[i * 4 + 2] += 5;                                                     \
+	b[i * 4 + 3] += 6;                                                     \
+      }                                                                        \
+  }
+
+#define TEST_ALL2(T)                                                           \
+  T (int16_t, uint8_t)                                                         \
+  T (uint16_t, int8_t)                                                         \
+  T (int32_t, uint16_t)                                                        \
+  T (uint32_t, int16_t)                                                        \
+  T (float, uint16_t)                                                          \
+  T (int64_t, float)                                                           \
+  T (uint64_t, int32_t)                                                        \
+  T (double, uint32_t)
+
+TEST_ALL2 (test)
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-7.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-7.h
new file mode 100644
index 00000000000..4ef8f974a04
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-7.h
@@ -0,0 +1,20 @@
+#include "p9-vec-length.h"
+
+/* Test the case that the loop requires to have peeled prologues for
+   alignment.  */
+
+#define N 64
+#define START 1
+#define END 59
+
+#define test(TYPE)                                                             \
+  TYPE x_##TYPE[N] __attribute__((aligned(16)));                                \
+  void __attribute__((noinline, noclone)) test_npeel_##TYPE() {                \
+    TYPE v = 0;                                                                \
+    for (unsigned int i = START; i < END; i++) {                               \
+      x_##TYPE[i] = v;                                                         \
+      v += 1;                                                                  \
+    }                                                                          \
+  }
+
+TEST_ALL (test)
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-8.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-8.h
new file mode 100644
index 00000000000..09d0e369f11
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-8.h
@@ -0,0 +1,14 @@
+#include "p9-vec-length.h"
+
+/* Test the case that the loop requires to peel for gaps.  */
+
+#define N 200
+
+#define test(TYPE)                                                             \
+  void __attribute__((noinline, noclone))                                      \
+      test_##TYPE(TYPE *restrict dest, TYPE *restrict src) {                   \
+    for (unsigned int i = 0; i < N; ++i)                                       \
+      dest[i] += src[i * 2];                                                   \
+  }
+
+TEST_ALL(test)
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-1.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-1.c
new file mode 100644
index 00000000000..bde224560db
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-1.h"
+
+/* { dg-final { scan-assembler-times {\mlxv\M|\mlxvx\M} 20 } } */
+/* { dg-final { scan-assembler-times {\mstxv\M|\mstxvx\M} 10 } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 20 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-2.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-2.c
new file mode 100644
index 00000000000..86cd7910f74
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-2.c
@@ -0,0 +1,15 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-2.h"
+
+/* { dg-final { scan-assembler-times {\mlxv\M|\mlxvx\M} 20 } } */
+/* { dg-final { scan-assembler-times {\mstxv\M|\mstxvx\M} 10 } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 20 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-3.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-3.c
new file mode 100644
index 00000000000..962e0d88971
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-3.c
@@ -0,0 +1,18 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-3.h"
+
+/* { dg-final { scan-assembler-not   {\mlxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* 64bit types get completely unrolled, so only check the others.  */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 14 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 7 } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-4.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-4.c
new file mode 100644
index 00000000000..a7c6edf2f8f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-4.c
@@ -0,0 +1,15 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-4.h"
+
+/* { dg-final { scan-assembler-times {\mlxv\M|\mlxvx\M} 120 } } */
+/* { dg-final { scan-assembler-times {\mstxv\M|\mstxvx\M} 70 } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 70 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 70 } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-5.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-5.c
new file mode 100644
index 00000000000..04622145648
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-5.c
@@ -0,0 +1,15 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-5.h"
+
+/* { dg-final { scan-assembler-times {\mlxv\M|\mlxvx\M} 49 } } */
+/* { dg-final { scan-assembler-times {\mstxv\M|\mstxvx\M} 21 } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 21 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 21 } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-6.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-6.c
new file mode 100644
index 00000000000..1ffa98f0fde
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-6.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-6.h"
+
+/* { dg-final { scan-assembler-times {\mlxv\M|\mlxvx\M} 42 } } */
+/* { dg-final { scan-assembler-times {\mstxv\M|\mstxvx\M} 16 } } */
+/* 64bit/32bit pairs don't have the epilogues.  */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 10 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-7.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-7.c
new file mode 100644
index 00000000000..a6755ed75ef
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-7.c
@@ -0,0 +1,11 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops -ffast-math" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-7.h"
+
+/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-8.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-8.c
new file mode 100644
index 00000000000..3a60db2d7f7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-8.c
@@ -0,0 +1,12 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Test for that only vectorize the epilogue with vector access with length,
+   the main body still use normal vector load/store.  */
+
+#include "p9-vec-length-8.h"
+
+/* { dg-final { scan-assembler-times {\mlxvl\M} 30 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-1.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-1.c
new file mode 100644
index 00000000000..f11eccb62f0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-1.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-1.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-2.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-2.c
new file mode 100644
index 00000000000..f77ad31c6cc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-2.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-2.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-3.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-3.c
new file mode 100644
index 00000000000..79551dab7e1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-3.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-3.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c
new file mode 100644
index 00000000000..c4c479b6f03
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-4.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-5.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-5.c
new file mode 100644
index 00000000000..0239991a293
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-5.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-5.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-6.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-6.c
new file mode 100644
index 00000000000..30e9b759767
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-6.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-6.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-7.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-7.c
new file mode 100644
index 00000000000..50ffea15ee3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-7.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -ffast-math" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-7.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-8.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-8.c
new file mode 100644
index 00000000000..b43610a8b34
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-8.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=1" } */
+
+/* Check whether it runs successfully if we only vectorize the epilogue
+   with vector access with length.  */
+
+#include "p9-vec-length-run-8.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c
new file mode 100644
index 00000000000..67fa719ecaa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-1.h"
+
+/* { dg-final { scan-assembler-not   {\mlxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 20 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c
new file mode 100644
index 00000000000..97ea32cc008
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-2.h"
+
+/* { dg-final { scan-assembler-not   {\mlxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 20 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-3.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-3.c
new file mode 100644
index 00000000000..cd5459fa9f4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-3.c
@@ -0,0 +1,17 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-3.h"
+
+/* { dg-final { scan-assembler-not   {\mlxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* 64bit types get completely unrolled, so only check the others.  */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 14 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 7 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-4.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-4.c
new file mode 100644
index 00000000000..03429c1c92b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-4.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-4.h"
+
+/* It can use normal vector load for constant vector load.  */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 70 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 70 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-5.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-5.c
new file mode 100644
index 00000000000..1abb28a2c2d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-5.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-5.h"
+
+/* It can use normal vector load for constant vector load.  */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 21 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 21 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c
new file mode 100644
index 00000000000..5c9a035c544
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-6.h"
+
+/* It can use normal vector load for constant vector load.  */
+/* { dg-final { scan-assembler-not   {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
+/* { dg-final { scan-assembler-times {\mlxvl\M} 16 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 16 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-7.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-7.c
new file mode 100644
index 00000000000..f5fe07d719f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-7.c
@@ -0,0 +1,13 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops -ffast-math" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-7.h"
+
+/* Each type has one stxvl excepting for int8 and uint8, that have two due to
+   rtl pass bbro duplicating the block which has one stxvl.  */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 12 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-8.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-8.c
new file mode 100644
index 00000000000..880d6aaec39
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-8.c
@@ -0,0 +1,12 @@
+/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Test for fully with length, the loop body uses vector access with length,
+   there should not be any epilogues.  */
+
+#include "p9-vec-length-8.h"
+
+/* { dg-final { scan-assembler-times {\mlxvl\M} 30 } } */
+/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-1.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-1.c
new file mode 100644
index 00000000000..81c4c5d5f2a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-1.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-1.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-2.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-2.c
new file mode 100644
index 00000000000..c0eabde6b42
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-2.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-2.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-3.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-3.c
new file mode 100644
index 00000000000..1a2fd9cb5b9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-3.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-3.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-4.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-4.c
new file mode 100644
index 00000000000..0406798f958
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-4.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-4.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-5.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-5.c
new file mode 100644
index 00000000000..98c8af1d15b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-5.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-5.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-6.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-6.c
new file mode 100644
index 00000000000..a2244943187
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-6.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-6.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-7.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-7.c
new file mode 100644
index 00000000000..4a4a9ea67ff
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-7.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model -ffast-math" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-7.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-8.c b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-8.c
new file mode 100644
index 00000000000..a4f72e72248
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-8.c
@@ -0,0 +1,10 @@
+/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
+/* { dg-options "-mdejagnu-cpu=power9 -O2 -ftree-vectorize -fno-vect-cost-model" } */
+
+/* { dg-additional-options "--param=vect-partial-vector-usage=2" } */
+
+/* Check whether it runs successfully if we vectorize the loop fully
+   with vector access with length.  */
+
+#include "p9-vec-length-run-8.h"
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-1.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-1.h
new file mode 100644
index 00000000000..b397fd1ac30
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-1.h
@@ -0,0 +1,34 @@
+#include "p9-vec-length-1.h"
+
+#define decl(TYPE)                                                             \
+  TYPE a_##TYPE[N];                                                            \
+  TYPE b_##TYPE[N];                                                            \
+  TYPE c_##TYPE[N];
+
+#define run(TYPE)                                                              \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	a_##TYPE[i] = i * 2 + 1;                                               \
+	b_##TYPE[i] = i % 2 - 2;                                               \
+      }                                                                        \
+    test##TYPE ();                                                             \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	TYPE a1 = i * 2 + 1;                                                   \
+	TYPE b1 = i % 2 - 2;                                                   \
+	TYPE exp_c = a1 + b1;                                                  \
+	if (c_##TYPE[i] != exp_c)                                              \
+	  __builtin_abort ();                                                  \
+      }                                                                        \
+  }
+
+TEST_ALL (decl)
+
+int
+main (void)
+{
+  TEST_ALL (run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-2.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-2.h
new file mode 100644
index 00000000000..a0f2d6ccb23
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-2.h
@@ -0,0 +1,36 @@
+#include "p9-vec-length-2.h"
+
+#define decl(TYPE)                                                             \
+  TYPE a_##TYPE[N];                                                            \
+  TYPE b_##TYPE[N];                                                            \
+  TYPE c_##TYPE[N];
+
+#define N1 195
+
+#define run(TYPE)                                                              \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	a_##TYPE[i] = i * 2 + 1;                                               \
+	b_##TYPE[i] = i % 2 - 2;                                               \
+      }                                                                        \
+    test##TYPE (N1);                                                           \
+    for (i = 0; i < N1; i++)                                                   \
+      {                                                                        \
+	TYPE a1 = i * 2 + 1;                                                   \
+	TYPE b1 = i % 2 - 2;                                                   \
+	TYPE exp_c = a1 + b1;                                                  \
+	if (c_##TYPE[i] != exp_c)                                              \
+	  __builtin_abort ();                                                  \
+      }                                                                        \
+  }
+
+TEST_ALL (decl)
+
+int
+main (void)
+{
+  TEST_ALL (run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-3.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-3.h
new file mode 100644
index 00000000000..5d2f5c34b6a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-3.h
@@ -0,0 +1,34 @@
+#include "p9-vec-length-3.h"
+
+#define decl(TYPE)                                                             \
+  TYPE a_##TYPE[N_##TYPE];                                                     \
+  TYPE b_##TYPE[N_##TYPE];                                                     \
+  TYPE c_##TYPE[N_##TYPE];
+
+#define run(TYPE)                                                              \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    for (i = 0; i < N_##TYPE; i++)                                             \
+      {                                                                        \
+	a_##TYPE[i] = i * 2 + 1;                                               \
+	b_##TYPE[i] = i % 2 - 2;                                               \
+      }                                                                        \
+    test##TYPE ();                                                             \
+    for (i = 0; i < N_##TYPE; i++)                                             \
+      {                                                                        \
+	TYPE a1 = i * 2 + 1;                                                   \
+	TYPE b1 = i % 2 - 2;                                                   \
+	TYPE exp_c = a1 + b1;                                                  \
+	if (c_##TYPE[i] != exp_c)                                              \
+	  __builtin_abort ();                                                  \
+      }                                                                        \
+  }
+
+TEST_ALL (decl)
+
+int
+main (void)
+{
+  TEST_ALL (run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-4.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-4.h
new file mode 100644
index 00000000000..2f3b911d0d1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-4.h
@@ -0,0 +1,62 @@
+#include "p9-vec-length-4.h"
+
+/* Check more to ensure vector access with out of bound.  */
+#define N  144
+/* Array size used for test function actually.  */
+#define NF 123
+
+#define run(TYPE)                                                              \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    TYPE a[N], b[N * 2], c[N * 4];                                             \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	a[i] = i + i % 2;                                                      \
+	b[i * 2] = i * 2 + i % 3;                                              \
+	b[i * 2 + 1] = i * 3 + i % 4;                                          \
+	c[i * 4] = i * 4 + i % 5;                                              \
+	c[i * 4 + 1] = i * 5 + i % 6;                                          \
+	c[i * 4 + 2] = i * 6 + i % 7;                                          \
+	c[i * 4 + 3] = i * 7 + i % 8;                                          \
+      }                                                                        \
+    test_mv_##TYPE (a, b, c, NF);                                              \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	TYPE a1 = i + i % 2;                                                   \
+	TYPE b1 = i * 2 + i % 3;                                               \
+	TYPE b2 = i * 3 + i % 4;                                               \
+	TYPE c1 = i * 4 + i % 5;                                               \
+	TYPE c2 = i * 5 + i % 6;                                               \
+	TYPE c3 = i * 6 + i % 7;                                               \
+	TYPE c4 = i * 7 + i % 8;                                               \
+                                                                               \
+	TYPE exp_a = a1;                                                       \
+	TYPE exp_b1 = b1;                                                      \
+	TYPE exp_b2 = b2;                                                      \
+	TYPE exp_c1 = c1;                                                      \
+	TYPE exp_c2 = c2;                                                      \
+	TYPE exp_c3 = c3;                                                      \
+	TYPE exp_c4 = c4;                                                      \
+	if (i < NF)                                                            \
+	  {                                                                    \
+	    exp_a += 1;                                                        \
+	    exp_b1 += 2;                                                       \
+	    exp_b2 += 3;                                                       \
+	    exp_c1 += 4;                                                       \
+	    exp_c2 += 5;                                                       \
+	    exp_c3 += 6;                                                       \
+	    exp_c4 += 7;                                                       \
+	  }                                                                    \
+	if (a[i] != exp_a || b[i * 2] != exp_b1 || b[i * 2 + 1] != exp_b2      \
+	    || c[i * 4] != exp_c1 || c[i * 4 + 1] != exp_c2                    \
+	    || c[i * 4 + 2] != exp_c3 || c[i * 4 + 3] != exp_c4)               \
+	  __builtin_abort ();                                                  \
+      }                                                                        \
+  }
+
+int
+main (void)
+{
+  TEST_ALL (run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-5.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-5.h
new file mode 100644
index 00000000000..ca4b3d56351
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-5.h
@@ -0,0 +1,45 @@
+#include "p9-vec-length-5.h"
+
+/* Check more to ensure vector access with out of bound.  */
+#define N 155
+/* Array size used for test function actually.  */
+#define NF 127
+
+#define run(TYPE1, TYPE2)                                                      \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    TYPE1 a[N * 2];                                                            \
+    TYPE2 b[N * 2];                                                            \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	a[i * 2] = i * 2 + i % 3;                                              \
+	a[i * 2 + 1] = i * 3 + i % 4;                                          \
+	b[i * 2] = i * 7 + i / 5;                                              \
+	b[i * 2 + 1] = i * 8 + i / 6;                                          \
+      }                                                                        \
+    test_mv_##TYPE1##TYPE2 (a, b, NF);                                         \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	TYPE1 exp_a1 = i * 2 + i % 3;                                          \
+	TYPE1 exp_a2 = i * 3 + i % 4;                                          \
+	TYPE2 exp_b1 = i * 7 + i / 5;                                          \
+	TYPE2 exp_b2 = i * 8 + i / 6;                                          \
+	if (i < NF)                                                            \
+	  {                                                                    \
+	    exp_a1 += 1;                                                        \
+	    exp_a2 += 2;                                                       \
+	    exp_b1 += 3;                                                       \
+	    exp_b2 += 4;                                                       \
+	  }                                                                    \
+	if (a[i * 2] != exp_a1 || a[i * 2 + 1] != exp_a2 || b[i * 2] != exp_b1 \
+	    || b[i * 2 + 1] != exp_b2)                                         \
+	  __builtin_abort ();                                                  \
+      }                                                                        \
+  }
+
+int
+main (void)
+{
+  TEST_ALL2 (run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-6.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-6.h
new file mode 100644
index 00000000000..814e4059bdf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-6.h
@@ -0,0 +1,52 @@
+#include "p9-vec-length-6.h"
+
+/* Check more to ensure vector access with out of bound.  */
+#define N 275
+/* Array size used for test function actually.  */
+#define NF 255
+
+#define run(TYPE1, TYPE2)                                                      \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    TYPE1 a[N * 2];                                                            \
+    TYPE2 b[N * 4];                                                            \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	a[i * 2] = i * 2 + i % 3;                                              \
+	a[i * 2 + 1] = i * 3 + i % 4;                                          \
+	b[i * 4] = i * 4 + i / 5;                                              \
+	b[i * 4 + 1] = i * 5 + i / 6;                                          \
+	b[i * 4 + 2] = i * 6 + i / 7;                                          \
+	b[i * 4 + 3] = i * 7 + i / 8;                                          \
+      }                                                                        \
+    test_mv_##TYPE1##TYPE2 (a, b, NF);                                         \
+    for (i = 0; i < N; i++)                                                    \
+      {                                                                        \
+	TYPE1 exp_a1 = i * 2 + i % 3;                                          \
+	TYPE1 exp_a2 = i * 3 + i % 4;                                          \
+	TYPE2 exp_b1 = i * 4 + i / 5;                                          \
+	TYPE2 exp_b2 = i * 5 + i / 6;                                          \
+	TYPE2 exp_b3 = i * 6 + i / 7;                                          \
+	TYPE2 exp_b4 = i * 7 + i / 8;                                          \
+	if (i < NF)                                                            \
+	  {                                                                    \
+	    exp_a1 += 1;                                                       \
+	    exp_a2 += 2;                                                       \
+	    exp_b1 += 3;                                                       \
+	    exp_b2 += 4;                                                       \
+	    exp_b3 += 5;                                                       \
+	    exp_b4 += 6;                                                       \
+	  }                                                                    \
+	if (a[i * 2] != exp_a1 || a[i * 2 + 1] != exp_a2 || b[i * 4] != exp_b1 \
+	    || b[i * 4 + 1] != exp_b2 || b[i * 4 + 2] != exp_b3                \
+	    || b[i * 4 + 3] != exp_b4)                                         \
+	  __builtin_abort ();                                                  \
+      }                                                                        \
+  }
+
+int
+main (void)
+{
+  TEST_ALL2 (run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-7.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-7.h
new file mode 100644
index 00000000000..31280bf8a16
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-7.h
@@ -0,0 +1,16 @@
+#include "p9-vec-length-7.h"
+
+#define run(TYPE)                                                              \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+    test_npeel_##TYPE();                                                       \
+    for (int i = 0; i < N; ++i) {                                              \
+      if (x_##TYPE[i] != (i < START || i >= END ? 0 : (i - START)))            \
+        __builtin_abort();                                                     \
+    }                                                                          \
+  }
+
+int main() {
+  TEST_ALL(run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-8.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-8.h
new file mode 100644
index 00000000000..aedbc3df3aa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-8.h
@@ -0,0 +1,27 @@
+#include "p9-vec-length-8.h"
+
+#define run(TYPE)                                                              \
+  {                                                                            \
+    unsigned int i = 0;                                                        \
+                                                                               \
+    TYPE out_##TYPE[N];                                                        \
+    TYPE in_##TYPE[N * 2];                                                     \
+    for (int i = 0; i < N; ++i) {                                              \
+      out_##TYPE[i] = i * 7 / 2;                                               \
+    }                                                                          \
+    for (int i = 0; i < N * 2; ++i) {                                          \
+      in_##TYPE[i] = i * 9 / 2;                                                \
+    }                                                                          \
+                                                                               \
+    test_##TYPE(out_##TYPE, in_##TYPE);                                        \
+    for (int i = 0; i < N; ++i) {                                              \
+      TYPE expected = i * 7 / 2 + in_##TYPE[i * 2];                            \
+      if (out_##TYPE[i] != expected)                                           \
+        __builtin_abort();                                                     \
+    }                                                                          \
+  }
+
+int main(void) {
+  TEST_ALL(run)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length.h b/gcc/testsuite/gcc.target/powerpc/p9-vec-length.h
new file mode 100644
index 00000000000..83418b0b641
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length.h
@@ -0,0 +1,14 @@
+#include <stdint.h>
+
+#define TEST_ALL(T)                                                            \
+  T (int8_t)                                                                   \
+  T (uint8_t)                                                                  \
+  T (int16_t)                                                                  \
+  T (uint16_t)                                                                 \
+  T (int32_t)                                                                  \
+  T (uint32_t)                                                                 \
+  T (int64_t)                                                                  \
+  T (uint64_t)                                                                 \
+  T (float)                                                                    \
+  T (double)
+

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v7] vect: Support vector load/store with length in vectorizer
  2020-07-10  9:55                           ` [PATCH 5/7 v7] " Kewen.Lin
@ 2020-07-17  9:54                             ` Richard Sandiford
  2020-07-20  2:25                               ` Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Richard Sandiford @ 2020-07-17  9:54 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Bill Schmidt, Richard Biener, Segher Boessenkool, dje.gcc

Hi,

Sorry for the slow review.

> The new version v7 is attached which has addressed your review comments
> on v6.  Could you have a further look?  Many thanks in advance!
>
> Bootstrapped/regtested on aarch64-linux-gnu and powerpc64le-linux-gnu P9.
> Even with explicit vect-partial-vector-usage settings 1/2 on Power target,
> I didn't find any remarkable failures (only some trivial test case issues).

Thanks, this looks great.  OK for trunk with the minor nits below fixed.

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> @@ -968,4 +968,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f
>  Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
>  Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
>  
> +-param=vect-partial-vector-usage=
> +Common Joined UInteger Var(param_vect_partial_vector_usage) Init(2) IntegerRange(0, 2) Param Optimization
> +Controls how loop vectorizer uses partial vectors.  0 means never, 1 means only for loops whose iterating need can be removed, 2 means for all loops.  The default value is 2.

IMO reads better as s/iterating need/need to iterate/

> +  FOR_EACH_MODE_IN_CLASS (tmode_iter, MODE_INT)
> +  {
> +    scalar_mode tmode = tmode_iter.require ();
> +    unsigned int tbits = GET_MODE_BITSIZE (tmode);
> +
> +    /* ??? Do we really want to construct one IV whose precision exceeds
> +       BITS_PER_WORD?  */
> +    if (tbits > BITS_PER_WORD)
> +      break;
> +
> +    /* Find the first available standard integral type.  */
> +    if (tbits >= min_ni_prec && targetm.scalar_mode_supported_p (tmode))
> +      {
> +	iv_type = build_nonstandard_integer_type (tbits, true);
> +	break;
> +      }
> +  }

The outer {…} block should be indented by two spaces relative
to the FOR_EACH_MODE_IN_CLASS.
> +
> +  if (!iv_type)
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			 "can't vectorize with length-based partial vectors"
> +			 " due to no suitable iv type.\n");

IMO reads better as s/due to/because there is/

> +  /* Shouldn't go with length-based approach if fully masked.  */
> +  gcc_assert (!loop_lens || (loop_lens && !loop_masks));

The “loop_lens &&” is redundant.

Same for vectorizable_load.

> @@ -7994,6 +8030,42 @@ vectorizable_store (vec_info *vinfo,
>  		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
>  		  new_stmt = call;
>  		}
> +	      else if (loop_lens)
> +		{
> +		  tree final_len
> +		    = vect_get_loop_len (loop_vinfo, loop_lens,
> +					 vec_num * ncopies, vec_num * j + i);
> +		  align = least_bit_hwi (misalign | align);
> +		  tree ptr = build_int_cst (ref_type, align);
> +		  machine_mode vmode = TYPE_MODE (vectype);
> +		  opt_machine_mode new_ovmode
> +		    = get_len_load_store_mode (vmode, false);
> +		  gcc_assert (new_ovmode.exists ());
> +		  machine_mode new_vmode = new_ovmode.require ();

The assert is redundant with the “require ()”.

Same for vectorizable_load.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/7 v7] vect: Support vector load/store with length in vectorizer
  2020-07-17  9:54                             ` Richard Sandiford
@ 2020-07-20  2:25                               ` Kewen.Lin
  0 siblings, 0 replies; 80+ messages in thread
From: Kewen.Lin @ 2020-07-20  2:25 UTC (permalink / raw)
  To: richard.sandiford
  Cc: GCC Patches, Bill Schmidt, Richard Biener, Segher Boessenkool, dje.gcc

Hi Richard,

on 2020/7/17 下午5:54, Richard Sandiford wrote:
> Hi,
> 
> Sorry for the slow review.
> 
>> The new version v7 is attached which has addressed your review comments
>> on v6.  Could you have a further look?  Many thanks in advance!
>>
>> Bootstrapped/regtested on aarch64-linux-gnu and powerpc64le-linux-gnu P9.
>> Even with explicit vect-partial-vector-usage settings 1/2 on Power target,
>> I didn't find any remarkable failures (only some trivial test case issues).
> 
> Thanks, this looks great.  OK for trunk with the minor nits below fixed.
> 

Thanks again for your time!

Those comments were addressed, I just committed it in r11-2221.


BR,
Kewen

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 7/7 v2] rs6000/testsuite: Vector with length test cases
  2020-07-10 10:07   ` [PATCH 7/7 v2] " Kewen.Lin
@ 2020-07-20 16:58     ` Segher Boessenkool
  2020-07-21  2:53       ` Kewen.Lin
  0 siblings, 1 reply; 80+ messages in thread
From: Segher Boessenkool @ 2020-07-20 16:58 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, dje.gcc

Hi!

On Fri, Jul 10, 2020 at 06:07:16PM +0800, Kewen.Lin wrote:
> +/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */

Everything in gcc.targer/powerpc/ requires powerpc*-*-* automatically
(is never run on other targets).

> +/* { dg-final { scan-assembler-times {\mlxv\M|\mlxvx\M} 20 } } */

You can write {\mlxvx?\M} if you think that is better.  Each option has
its own downsides and upsides here ;-)

> +++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c
> @@ -0,0 +1,10 @@
> +/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */

Testing for powerpc64*-*-* is always wrong (it doesn't matter what the
*default* target is: it is usual to run the tests with RUNTESTFLAGS
{-m32,-m64} for example.

Random example from my bash history:
  make check-gcc-c RUNTESTFLAGS="--target_board=unix'{-m64,-m32}' powerpc.exp=volatile-mem.c"
but my usual is
  make -k -j60 check RUNTESTFLAGS="--target_board=unix'{-m64,-m32}'"

Other than that this looks fine.  Please make sure to test it on an older
machine as well (you cannot really test on a BE p9, but ideally you would
do that as well ;-) )

So, okay for trunk if all patches that are required for these tests have
been committed.  Thanks!

Segher

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 7/7 v2] rs6000/testsuite: Vector with length test cases
  2020-07-20 16:58     ` Segher Boessenkool
@ 2020-07-21  2:53       ` Kewen.Lin
  0 siblings, 0 replies; 80+ messages in thread
From: Kewen.Lin @ 2020-07-21  2:53 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Patches, Bill Schmidt, dje.gcc

Hi Segher,

on 2020/7/21 上午12:58, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Jul 10, 2020 at 06:07:16PM +0800, Kewen.Lin wrote:
>> +/* { dg-do compile { target { powerpc*-*-* } && { lp64 && powerpc_p9vector_ok } } } */
> 
> Everything in gcc.targer/powerpc/ requires powerpc*-*-* automatically
> (is never run on other targets).

Done.

> 
>> +/* { dg-final { scan-assembler-times {\mlxv\M|\mlxvx\M} 20 } } */
> 
> You can write {\mlxvx?\M} if you think that is better.  Each option has
> its own downsides and upsides here ;-)

It looks shorter, done.

> 
>> +++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c
>> @@ -0,0 +1,10 @@
>> +/* { dg-do run { target { powerpc64*-*-* && { lp64 && p9vector_hw } } } } */
> 
> Testing for powerpc64*-*-* is always wrong (it doesn't matter what the
> *default* target is: it is usual to run the tests with RUNTESTFLAGS
> {-m32,-m64} for example.

ah, thanks for the correction!  I think lp64 is already enough to ensure
it's 64-bit on power, powerpc64*-*-* removed.

> 
> Random example from my bash history:
>   make check-gcc-c RUNTESTFLAGS="--target_board=unix'{-m64,-m32}' powerpc.exp=volatile-mem.c"
> but my usual is
>   make -k -j60 check RUNTESTFLAGS="--target_board=unix'{-m64,-m32}'"
> 
> Other than that this looks fine.  Please make sure to test it on an older
> machine as well (you cannot really test on a BE p9, but ideally you would
> do that as well ;-) )

Thanks for the remind, I tested it on P7 BE and got some unsupported cases
expectedly.  Checked v1 on P9 BE (aix), the result also looked fine.

> 
> So, okay for trunk if all patches that are required for these tests have
> been committed.  Thanks!

Thanks!

BR,
Kewen

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 6/7] ivopts: Add handlings for vector with length IFNs
  2020-05-26  5:58 ` [PATCH 6/7] ivopts: Add handlings for vector with length IFNs Kewen.Lin
@ 2020-07-22 12:51   ` Richard Sandiford
  0 siblings, 0 replies; 80+ messages in thread
From: Richard Sandiford @ 2020-07-22 12:51 UTC (permalink / raw)
  To: Kewen.Lin via Gcc-patches
  Cc: Kewen.Lin, bin.cheng, Segher Boessenkool, Bill Schmidt,
	Richard Guenther, dje.gcc

"Kewen.Lin via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> gcc/ChangeLog
>
> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>
> 	* tree-ssa-loop-ivopts.c (get_mem_type_for_internal_fn): Handle
> 	IFN_LEN_LOAD and IFN_LEN_STORE.
> 	(get_alias_ptr_type_for_ptr_address): Likewise.

OK, thanks.

(Sorry, hadn't realised that this was still awaiting review.)

Richard
>
>
> ---
>  gcc/tree-ssa-loop-ivopts.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
> index 1d2697ae1ba..45b31640e75 100644
> --- a/gcc/tree-ssa-loop-ivopts.c
> +++ b/gcc/tree-ssa-loop-ivopts.c
> @@ -2436,12 +2436,14 @@ get_mem_type_for_internal_fn (gcall *call, tree *op_p)
>      {
>      case IFN_MASK_LOAD:
>      case IFN_MASK_LOAD_LANES:
> +    case IFN_LEN_LOAD:
>        if (op_p == gimple_call_arg_ptr (call, 0))
>  	return TREE_TYPE (gimple_call_lhs (call));
>        return NULL_TREE;
>  
>      case IFN_MASK_STORE:
>      case IFN_MASK_STORE_LANES:
> +    case IFN_LEN_STORE:
>        if (op_p == gimple_call_arg_ptr (call, 0))
>  	return TREE_TYPE (gimple_call_arg (call, 3));
>        return NULL_TREE;
> @@ -7415,6 +7417,8 @@ get_alias_ptr_type_for_ptr_address (iv_use *use)
>      case IFN_MASK_STORE:
>      case IFN_MASK_LOAD_LANES:
>      case IFN_MASK_STORE_LANES:
> +    case IFN_LEN_LOAD:
> +    case IFN_LEN_STORE:
>        /* The second argument contains the correct alias type.  */
>        gcc_assert (use->op_p = gimple_call_arg_ptr (call, 0));
>        return TREE_TYPE (gimple_call_arg (call, 1));

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2020-07-22 12:51 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-26  5:49 [PATCH 0/7] Support vector load/store with length Kewen.Lin
2020-05-26  5:51 ` [PATCH 1/7] ifn/optabs: " Kewen.Lin
2020-06-10  6:41   ` [PATCH 1/7 V2] " Kewen.Lin
2020-06-10  9:22     ` Richard Sandiford
2020-06-10 12:36       ` [PATCH 1/7 V3] " Kewen.Lin
2020-06-22  8:51         ` [PATCH 1/7 V4] " Kewen.Lin
2020-06-22 19:59           ` Richard Sandiford
2020-06-22 22:19             ` Segher Boessenkool
2020-06-23  3:54             ` [PATCH 1/7 v5] " Kewen.Lin
2020-06-23  9:52               ` Richard Sandiford
2020-06-23 11:25                 ` Richard Biener
2020-06-23 12:20                   ` Richard Sandiford
2020-06-24  2:40                     ` Jim Wilson
2020-06-24  7:34                       ` Richard Sandiford
2020-06-29  6:32                         ` [PATCH 1/7 v6] " Kewen.Lin
2020-06-29 10:07                           ` Richard Sandiford
2020-06-29 10:39                             ` [PATCH 1/7 v7] " Kewen.Lin
2020-06-30 15:32                               ` Richard Sandiford
2020-07-01 13:35                                 ` [PATCH 1/7 v8] " Kewen.Lin
2020-07-07  9:24                                   ` Richard Sandiford
2020-06-24 23:56                     ` [PATCH 1/7 v5] " Segher Boessenkool
2020-06-23  6:47             ` [PATCH 1/7 V4] " Richard Biener
2020-05-26  5:53 ` [PATCH 2/7] rs6000: lenload/lenstore optab support Kewen.Lin
2020-06-10  6:43   ` [PATCH 2/7 V2] " Kewen.Lin
2020-06-10 12:39     ` [PATCH 2/7 V3] " Kewen.Lin
2020-06-11 22:55       ` Segher Boessenkool
2020-06-12  3:02         ` Kewen.Lin
2020-06-23  3:58       ` [PATCH 2/7 v4] " Kewen.Lin
2020-06-29  6:32         ` [PATCH 2/7 v5] " Kewen.Lin
2020-06-29 17:57           ` Segher Boessenkool
2020-05-26  5:54 ` [PATCH 3/7] vect: Factor out codes for niters smaller than vf check Kewen.Lin
2020-05-26  5:55 ` [PATCH 4/7] hook/rs6000: Add vectorize length mode for vector with length Kewen.Lin
2020-06-10  6:44   ` [PATCH 4/7 V2] " Kewen.Lin
2020-05-26  5:57 ` [PATCH 5/7] vect: Support vector load/store with length in vectorizer Kewen.Lin
2020-05-26 12:49   ` Richard Sandiford
2020-05-26 12:52     ` Richard Sandiford
2020-05-27  8:25     ` Kewen.Lin
2020-05-27 10:02       ` Richard Sandiford
2020-05-28  1:21         ` Kewen.Lin
2020-05-29  8:32           ` Richard Sandiford
2020-05-29 12:38             ` Segher Boessenkool
2020-06-02  9:03             ` [PATCH 5/7 v3] " Kewen.Lin
2020-06-02 11:50               ` Richard Sandiford
2020-06-02 17:01                 ` Segher Boessenkool
2020-06-03  6:33                 ` Kewen.Lin
2020-06-10  9:19                   ` [PATCH 5/7 v4] " Kewen.Lin
2020-06-22  8:33                     ` [PATCH 5/7 v5] " Kewen.Lin
2020-06-29  6:33                       ` [PATCH 5/7 v6] " Kewen.Lin
2020-06-30 19:53                         ` Richard Sandiford
2020-07-01 13:23                           ` Kewen.Lin
2020-07-01 15:17                             ` Richard Sandiford
2020-07-02  5:20                               ` Kewen.Lin
2020-07-07  9:26                                 ` Kewen.Lin
2020-07-07 10:44                                   ` Richard Sandiford
2020-07-08  6:52                                     ` Kewen.Lin
2020-07-08 12:50                                       ` Richard Sandiford
2020-07-10  7:40                                         ` Kewen.Lin
2020-07-07 10:15                                 ` Richard Sandiford
2020-07-08  7:01                                   ` Kewen.Lin
2020-07-10  9:55                           ` [PATCH 5/7 v7] " Kewen.Lin
2020-07-17  9:54                             ` Richard Sandiford
2020-07-20  2:25                               ` Kewen.Lin
2020-05-26  5:58 ` [PATCH 6/7] ivopts: Add handlings for vector with length IFNs Kewen.Lin
2020-07-22 12:51   ` Richard Sandiford
2020-05-26  5:59 ` [PATCH 7/7] rs6000/testsuite: Vector with length test cases Kewen.Lin
2020-07-10 10:07   ` [PATCH 7/7 v2] " Kewen.Lin
2020-07-20 16:58     ` Segher Boessenkool
2020-07-21  2:53       ` Kewen.Lin
2020-05-26  7:12 ` [PATCH 0/7] Support vector load/store with length Richard Biener
2020-05-26  8:51   ` Kewen.Lin
2020-05-26  9:44     ` Richard Biener
2020-05-26 10:10       ` Kewen.Lin
2020-05-26 12:29         ` Richard Sandiford
2020-05-27  0:09           ` Segher Boessenkool
2020-05-27  7:25             ` Richard Biener
2020-05-27  8:50               ` Kewen.Lin
2020-05-27 14:08               ` Segher Boessenkool
2020-05-26 22:34   ` Jim Wilson
2020-05-27  7:21     ` Richard Biener
2020-05-27  7:46       ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).