public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* RFA: crc builtin functions & optimizations
@ 2022-03-15  0:32 Joern Rennecke
  2022-03-15  1:04 ` Andrew Pinski
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Joern Rennecke @ 2022-03-15  0:32 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1010 bytes --]

Most microprocessors have efficient ways to perform CRC operations, be
that with lookup tables, rotates, or even special instructions.
However, because we lack a representation for CRC in the compiler, we
can't do proper instruction selection.  With this patch I seek out to
rectify this,
I've avoided using a mode name for the built-in functions because that
would tie the semantics to the size of the addressable unit.  We
generally use abbreviations like s/l/ll for type names, which is all
right when the type can be widened without changing semantics.  For
the data input, however, we also have to consider the shift count that
is tied to it.  That is why I used a number to designate the width of
the data input and shift.

For machine support, I made a start with 8 and 16 bit little-endian
CRC for RISCV using a
lookup table.  I am sure once we have the basic infrastructure in the
tree, we'll get more
contributions of suitable named patterns for various ports.

bootstrapped on x86_64-pc-linux-gnu .

[-- Attachment #2: builtin-crc-diff.txt --]
[-- Type: text/plain, Size: 50707 bytes --]

2022-03-14  Jon Beniston  <jon@beniston.com>
	    Joern Rennecke  <joern.rennecke@embecosm.com>

	* Makefile.in (OBJS): Add tree-crc.o .
	* builtin-types.def (BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Define.
	(BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
	(BT_FN_UINT16_UINT16_UINT32_CONST_SIZE): Likewise.
	* builtins.cc (associated_internal_fn):
	Handle BUILT_IN_CRC8S, BUILT_IN_CRC16S, BUILT_IN_CRC32S.
	* builtins.def (BUILT_IN_CRC8S, BUILT_IN_CRC16S, BUILT_IN_CRC32S):
	New builtin functions.
	* cgraph.cc (cgraph_node::verify_node):
	Allow const calls without a callgraph edge.
	* common.opt (fcrc): New option.
	* doc/invoke.texi (-fcrc): Document.
	* gimple-match-head.cc: #include predict.h .
	* internal-fn.cc (crc_direct): Define.
	(expand_crc_optab_fn): New function.
	(direct_crc_optab_supported_p): Define.
	* internal-fn.def (CRC, CRC_BE): New internal optab functions.
	* match.pd: Match a pair of crc operations.
	* optabs.def (crc_optab, crc_be_optab): New optabs.
	* passes.def (pass_crc): Add new pass.
	* tree-crc.cc: New file.
	* tree-pass.h (make_pass_crc): Declare.

testsuite:
	* gcc.c-torture/compile/crc.c: New test.
	* gcc.dg/tree-ssa/crc.c: Likewise.
	* gcc.dg/tree-ssa/crc-2.c: likewise.
	* gcc.dg/tree-ssa/pr59597.c: Add flag -fno-crc .

config/riscv:
	* crc.md: New file.
	* riscv-protos.h (expand_crc_lookup, print_crc_table): Declare.
	* riscv.cc (compute_crc): New function.
	(print_crc_table, expand_crc_lookup): Likewise.
	* riscv.md: Include crc.md.
	* riscv.opt (msmall-memory): New option.
	* tree-crc-doc.txt: New file.

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 31ff95500c9..a901925511b 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1612,6 +1612,7 @@ OBJS = \
 	tree-cfgcleanup.o \
 	tree-chrec.o \
 	tree-complex.o \
+	tree-crc.o \
 	tree-data-ref.o \
 	tree-dfa.o \
 	tree-diagnostic.o \
diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def
index 3a7cecdf087..aa7751a6d5a 100644
--- a/gcc/builtin-types.def
+++ b/gcc/builtin-types.def
@@ -872,3 +872,9 @@ DEF_FUNCTION_TYPE_2 (BT_FN_VOID_VPTR_LDOUBLE, BT_VOID,
 		     BT_VOLATILE_PTR, BT_LONGDOUBLE)
 DEF_FUNCTION_TYPE_2 (BT_FN_VOID_VPTR_SIZE, BT_VOID,
 		     BT_VOLATILE_PTR, BT_SIZE)
+DEF_FUNCTION_TYPE_3 (BT_FN_UINT16_UINT16_UINT8_CONST_SIZE, BT_UINT16,
+		     BT_UINT16, BT_UINT8, BT_CONST_SIZE)
+DEF_FUNCTION_TYPE_3 (BT_FN_UINT16_UINT16_UINT16_CONST_SIZE, BT_UINT16,
+		     BT_UINT16, BT_UINT16, BT_CONST_SIZE)
+DEF_FUNCTION_TYPE_3 (BT_FN_UINT16_UINT16_UINT32_CONST_SIZE, BT_UINT16,
+		     BT_UINT16, BT_UINT32, BT_CONST_SIZE)
diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 4c6c2939053..37c28c930ac 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -2175,6 +2175,9 @@ associated_internal_fn (built_in_function fn, tree return_type)
 	return IFN_LDEXP;
       return IFN_LAST;
 
+    case BUILT_IN_CRC8S: case BUILT_IN_CRC16S: case BUILT_IN_CRC32S:
+      return IFN_CRC;
+
     default:
       return IFN_LAST;
     }
diff --git a/gcc/builtins.def b/gcc/builtins.def
index 005976f34e9..24aaca34406 100644
--- a/gcc/builtins.def
+++ b/gcc/builtins.def
@@ -850,6 +850,9 @@ DEF_GCC_BUILTIN        (BUILT_IN_CLRSB, "clrsb", BT_FN_INT_INT, ATTR_CONST_NOTHR
 DEF_GCC_BUILTIN        (BUILT_IN_CLRSBIMAX, "clrsbimax", BT_FN_INT_INTMAX, ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GCC_BUILTIN        (BUILT_IN_CLRSBL, "clrsbl", BT_FN_INT_LONG, ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GCC_BUILTIN        (BUILT_IN_CLRSBLL, "clrsbll", BT_FN_INT_LONGLONG, ATTR_CONST_NOTHROW_LEAF_LIST)
+DEF_GCC_BUILTIN        (BUILT_IN_CRC8S, "crc8s", BT_FN_UINT16_UINT16_UINT8_CONST_SIZE, ATTR_CONST_NOTHROW_LEAF_LIST)
+DEF_GCC_BUILTIN        (BUILT_IN_CRC16S, "crc16s", BT_FN_UINT16_UINT16_UINT16_CONST_SIZE, ATTR_CONST_NOTHROW_LEAF_LIST)
+DEF_GCC_BUILTIN        (BUILT_IN_CRC32S, "crc32s", BT_FN_UINT16_UINT16_UINT32_CONST_SIZE, ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_EXT_LIB_BUILTIN    (BUILT_IN_DCGETTEXT, "dcgettext", BT_FN_STRING_CONST_STRING_CONST_STRING_INT, ATTR_FORMAT_ARG_2)
 DEF_EXT_LIB_BUILTIN    (BUILT_IN_DGETTEXT, "dgettext", BT_FN_STRING_CONST_STRING_CONST_STRING, ATTR_FORMAT_ARG_2)
 DEF_GCC_BUILTIN        (BUILT_IN_DWARF_CFA, "dwarf_cfa", BT_FN_PTR, ATTR_NULL)
diff --git a/gcc/cgraph.cc b/gcc/cgraph.cc
index b923a59ab0c..9570f5121af 100644
--- a/gcc/cgraph.cc
+++ b/gcc/cgraph.cc
@@ -3793,7 +3793,8 @@ cgraph_node::verify_node (void)
 			    }
 			  e->aux = (void *)1;
 			}
-		      else if (decl)
+		      else if (decl
+			       && !TREE_READONLY (decl) && !DECL_PURE_P (decl))
 			{
 			  error ("missing callgraph edge for call stmt:");
 			  cgraph_debug_gimple_stmt (this_cfun, stmt);
diff --git a/gcc/common.opt b/gcc/common.opt
index 8b6513de47c..ce2dfe9c76f 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -3607,4 +3607,8 @@ fipa-ra
 Common Var(flag_ipa_ra) Optimization
 Use caller save register across calls if possible.
 
+fcrc
+Common Var(flag_crc_opt) Optimization Init(1)
+Recognize some crc calculations to make them into built-in functions.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/config/riscv/crc.md b/gcc/config/riscv/crc.md
new file mode 100644
index 00000000000..5c926c185ee
--- /dev/null
+++ b/gcc/config/riscv/crc.md
@@ -0,0 +1,157 @@
+;;Lower code size implementation, and for variable polynoms.
+;;Gets unrolled if not optimizing for size, but still compact.
+;;
+;;static
+;;u16 crcu16_1(u16 data, u16 crc, u16 xor_val)
+;;{
+;;  unsigned long tmp = data ^ crc;
+;;  if (!tmp)
+;;    return tmp;
+;;  for (int i = 0; i < 16; i++)
+;;    {
+;;#if 1 /* Using ROR.  */
+;;      tmp = (tmp >> 1) | (tmp << (sizeof (tmp) * (8 /*CHAR_BIT*/) - 1));
+;;      if ((long)tmp < 0)
+;;#else /* For base instruction set.  */
+;;      long tmp2 = tmp & 1;
+;;      tmp >>= 1;
+;;      if (tmp2)
+;;#endif
+;;        tmp ^= xor_val;
+;;    }
+;;  return tmp;
+;;}
+;;
+;;u16 crcu16 (u16 data, u16 crc)
+;;{
+;;  u16 xor_val;
+;;#if 0
+;;  xor_val = 0xa001;
+;;#else /* Avoid unwanted constant propagation.  */
+;;  asm ("" : "=r" (xor_val) : "0" (0xa001));
+;;#endif
+;;  return crcu16_1 (data, crc, xor_val);
+;;}
+
+;; First mode is for the data that is being processed, the second is for
+;; input crc, polynom and output crc.
+
+(define_expand "crcqihi4"
+  [(set (match_operand:HI 0)
+	(umod:HI ;; Actually not umod, but crc.
+	  (xor:HI (match_operand:HI 1) (zero_extend:HI (match_operand:QI 2)))
+		  (match_operand:HI 3)))] ;; op3: polynom
+  ""
+{
+  /* Use speed optimized CRC computation for constant polynom.  */
+  expand_crc_lookup (operands, QImode);
+  DONE;
+})
+
+(define_expand "index_hi"
+  [(set (match_dup 0) (match_operand 2))]
+  ""
+{
+  /* Preferrably use sh1add.  */
+  rtx doubled = gen_rtx_ASHIFT (Pmode, operands[2], const1_rtx);
+  if (!TARGET_ZBA)
+    doubled = force_reg (Pmode, doubled);
+  operands[2] = gen_rtx_PLUS (Pmode, doubled, operands[1]);
+})
+
+(define_expand "crchihi4"
+  [(set (match_operand:HI 0 "register_operand")
+	(umod:HI ;; Actually not umod, but crc.
+	  (xor:HI (match_operand:HI 1) (match_operand:HI 2))
+		  (match_operand:HI 3)))] ;; op3: polynom
+  ""
+{
+  /* Use speed optimized CRC computation for constant polynom.  */
+  if (!TARGET_SMALL_MEMORY
+      || (CONST_INT_P (operands[1]) && CONST_INT_P (operands[2])))
+    {
+      expand_crc_lookup (operands, HImode);
+      DONE;
+    }
+  rtx tab
+    = force_reg (Pmode, print_crc_table (16, 8, INTVAL (operands[3]) & 65535));
+  rtx lo = GEN_INT (TARGET_BIG_ENDIAN != 0);
+  rtx hi = GEN_INT (TARGET_BIG_ENDIAN == 0);
+  rtx op1 = simplify_gen_subreg (Pmode, operands[1], HImode, 0);
+  rtx op2 = simplify_gen_subreg (Pmode, operands[2], HImode, 0);
+  rtx crcin = force_reg (Pmode, gen_rtx_XOR (Pmode, op1, op2));
+  rtx ix1 = force_reg (Pmode, gen_rtx_AND (Pmode, crcin, GEN_INT (255)));
+  emit_insn (gen_index_hi (ix1, tab, ix1));
+  rtx ix2 = gen_rtx_MEM (QImode, gen_rtx_PLUS (Pmode, ix1, lo));
+  ix2 = force_reg (Pmode, gen_rtx_ZERO_EXTEND (Pmode, ix2));
+  crcin = force_reg (Pmode, gen_rtx_LSHIFTRT (Pmode, crcin, GEN_INT (8)));
+  emit_move_insn (crcin, gen_rtx_AND (Pmode, crcin, GEN_INT (255)));
+  rtx mem2 = gen_rtx_MEM (QImode, gen_rtx_PLUS (Pmode, ix1, hi));
+  emit_move_insn (ix1, gen_rtx_ZERO_EXTEND (Pmode, mem2));
+  mem2 = ix1;
+  emit_move_insn (ix2, gen_rtx_XOR (Pmode, ix2, crcin));
+  emit_insn (gen_index_hi (ix2, tab, ix2));
+  rtx mem3 = ix2;
+  emit_move_insn (mem3, gen_rtx_ZERO_EXTEND (Pmode, gen_rtx_MEM (HImode, ix2)));
+  rtx tgt = simplify_gen_subreg (Pmode, operands[0], HImode, 0);
+  emit_move_insn (tgt, gen_rtx_XOR (Pmode, mem2, mem3));
+  DONE;
+})
+
+(define_expand "crcsihi4"
+  [(set (match_operand:HI 0)
+	(umod:HI ;; Actually not umod, but crc.
+	  (xor:SI (match_operand:HI 1) (match_operand:SI 2))
+		  (match_operand:HI 3)))] ;; op3: polynom
+  ""
+{
+  /* Use speed optimized CRC computation for constant polynom.  */
+  if (CONST_INT_P (operands[1]) && CONST_INT_P (operands[2]))
+    {
+      expand_crc_lookup (operands, HImode);
+      DONE;
+    }
+  rtx tab
+    = force_reg (Pmode, print_crc_table (16, 8, INTVAL (operands[3]) & 65535));
+  rtx addr = gen_reg_rtx (Pmode);
+  rtx lo_ad = plus_constant (Pmode, addr, TARGET_BIG_ENDIAN != 0);
+  rtx hi_ad = plus_constant (Pmode, addr, TARGET_BIG_ENDIAN == 0);
+  operands[1] = simplify_gen_subreg (word_mode, operands[1], HImode, 0);
+  operands[2] = simplify_gen_subreg (word_mode, operands[2], SImode, 0);
+  rtx crc_in = force_reg (Pmode, gen_rtx_XOR (Pmode, operands[1], operands[2]));
+  rtx ix = force_reg (Pmode, gen_rtx_AND (Pmode, crc_in, GEN_INT (255)));
+
+  rtx lo, hi;
+  rtx old_hi = NULL_RTX;
+  for (int i = 0; i < 3; i++)
+    {
+      emit_insn (gen_index_hi (addr, tab, ix));
+      lo = force_reg (Pmode, (gen_rtx_ZERO_EXTEND
+			       (Pmode, gen_rtx_MEM (QImode, lo_ad))));
+      hi = force_reg (Pmode, (gen_rtx_ZERO_EXTEND
+			       (Pmode, gen_rtx_MEM (QImode, hi_ad))));
+      switch (i)
+	{
+	case 0: case 2:
+	  crc_in = gen_rtx_LSHIFTRT (Pmode, crc_in, GEN_INT (8));
+	  break;
+	case 1:
+	  crc_in = gen_rtx_LSHIFTRT (Pmode, operands[2], GEN_INT (16));
+	  break;
+	default:
+	  gcc_unreachable ();
+	}
+      crc_in = force_reg (Pmode, crc_in);
+      ix = force_reg (Pmode, gen_rtx_AND (Pmode, crc_in, GEN_INT (255)));
+      if (old_hi)
+	ix = force_reg (Pmode, gen_rtx_XOR (Pmode, ix, old_hi));
+      ix = force_reg (Pmode, gen_rtx_XOR (Pmode, ix, lo));
+      old_hi = hi;
+    }
+  emit_insn (gen_index_hi (addr, tab, ix));
+  lo = force_reg (Pmode, (gen_rtx_ZERO_EXTEND
+			   (Pmode, gen_rtx_MEM (HImode, addr))));
+  operands[0] = simplify_gen_subreg (word_mode, operands[0], HImode, 0);
+  emit_move_insn (operands[0], gen_rtx_XOR (Pmode, lo, old_hi));
+  DONE;
+})
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 20c2381c21a..b80838685e2 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -109,4 +109,7 @@ struct riscv_cpu_info {
 
 extern const riscv_cpu_info *riscv_find_cpu (const char *);
 
+extern void expand_crc_lookup (rtx *operands, machine_mode);
+extern rtx print_crc_table (int, int, unsigned HOST_WIDE_INT);
+
 #endif /* ! GCC_RISCV_PROTOS_H */
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 7da9d377120..219e6ce02b9 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -5587,6 +5587,128 @@ riscv_asan_shadow_offset (void)
   return TARGET_64BIT ? (HOST_WIDE_INT_1 << 29) : 0;
 }
 
+static unsigned HOST_WIDE_INT
+compute_crc (unsigned HOST_WIDE_INT crc, int bits,
+	     unsigned HOST_WIDE_INT polynom)
+{
+  for (int j = 0; j < bits; j++)
+    {
+      int tmp = crc & 1;
+      crc >>= 1;
+      if (tmp)
+	crc ^= polynom;
+    }
+  return crc;
+}
+
+/* Make sure no crc table indentifiers get culled by garbage collection.  */
+static GTY(()) tree print_crc_table_id_list;
+
+rtx
+print_crc_table (int crc_bits, int data_bits, unsigned HOST_WIDE_INT polynom)
+{
+  gcc_assert (data_bits < 100);
+  gcc_assert (crc_bits < 100);
+  FILE *out = asm_out_file;
+  char buf[15+32+1];
+  static const char *mode_name[] = { "nil", "qi", "hi", "si", "di" };
+  sprintf (buf, "__gcc_crc%s%s4_" HOST_WIDE_INT_PRINT_HEX,
+	   mode_name[data_bits / BITS_PER_UNIT],
+	   mode_name[crc_bits / BITS_PER_UNIT], polynom);
+  tree ident = maybe_get_identifier (buf);
+  if (ident)
+    return gen_rtx_SYMBOL_REF (Pmode, IDENTIFIER_POINTER (ident));
+  ident = get_identifier (buf);
+  print_crc_table_id_list = build_tree_list (ident, print_crc_table_id_list);
+  rtx lab = gen_rtx_SYMBOL_REF (Pmode, IDENTIFIER_POINTER (ident));
+  unsigned n_entries = 1 << data_bits;
+  const char *entry_directive;
+  if (crc_bits == 16)
+    entry_directive = ".half";
+  else
+    gcc_unreachable ();
+#if 0 /* The linkonce directive is not supported for elf; .gnu.linkonce makes the linker script try to put this into too small an area.  */
+  asm_fprintf (out, "\t.pushsection\t%s.%s, \"a\"\n", ".gnu.linkonce.r", buf);
+  asm_fprintf (out, "\t.linkonce same_contents\n");
+#else
+  asm_fprintf (out, "\t.pushsection\t%s.%s, \"a\"\n", ".rodata", buf);
+  asm_fprintf (out, "\t.weak %s\n", buf);
+#endif
+  asm_fprintf (out, "\t.type\t%s, @object\n", buf);
+  asm_fprintf (out, "\t.size\t%s, %d\n", buf,
+	       n_entries * (crc_bits / BITS_PER_UNIT));
+  asm_fprintf (out, "%s:\n", buf);
+  asm_fprintf (out, " %s ", entry_directive);
+  gcc_assert (crc_bits == 16);
+  for (unsigned i = 0; i < n_entries; i++)
+    {
+      unsigned HOST_WIDE_INT crc = compute_crc (i, data_bits, polynom);
+      if ((unsigned long) crc == crc)
+	fprintf (out, "0x%0*lx", crc_bits / 4, (unsigned long)crc);
+      else
+	fprintf (out, HOST_WIDE_INT_PRINT_HEX, crc);
+      if (i % 8 != 7)
+	asm_fprintf (out, ", ");
+      else if (i < n_entries - 1)
+	asm_fprintf (out, "\n %s ", entry_directive);
+    }
+  asm_fprintf (out, "\n\t.popsection\n");
+  return lab;
+}
+
+void
+expand_crc_lookup (rtx *operands, machine_mode data_mode)
+{
+  machine_mode crc_mode = GET_MODE (operands[0]);
+  gcc_assert (GET_MODE (operands[1]) == crc_mode || CONST_INT_P (operands[1]));
+  if (CONST_INT_P (operands[2]))
+    gcc_assert (INTVAL (operands[2])
+		== trunc_int_for_mode (INTVAL (operands[2]), data_mode));
+  else
+    gcc_assert (data_mode == GET_MODE (operands[2]));
+  gcc_assert (GET_MODE_SIZE (data_mode) <= GET_MODE_SIZE (crc_mode));
+  gcc_assert (CONST_INT_P (operands[3]));
+  gcc_assert (INTVAL (operands[3])
+	      == trunc_int_for_mode (INTVAL (operands[3]), crc_mode));
+  unsigned HOST_WIDE_INT polynom
+    = GET_MODE_MASK (crc_mode) & INTVAL (operands[3]);
+  if (CONST_INT_P (operands[1]) && CONST_INT_P (operands[2]))
+    {
+      unsigned HOST_WIDE_INT crc_i
+	= ((INTVAL (operands[1]) & GET_MODE_MASK (crc_mode))
+	   ^ (INTVAL (operands[2]) & GET_MODE_MASK (data_mode)));
+      crc_i = compute_crc (crc_i, GET_MODE_BITSIZE (data_mode), polynom);
+      emit_insn (gen_rtx_SET (operands[0], GEN_INT (crc_i)));
+      return;
+    }
+  machine_mode mode = crc_mode;
+  if (GET_MODE_SIZE (mode) < GET_MODE_SIZE (word_mode))
+    mode = word_mode;
+  gcc_assert (GET_MODE_SIZE (mode) == GET_MODE_SIZE (Pmode));
+
+  if (!CONST_INT_P (operands[1]) && crc_mode != mode)
+    operands[1] = gen_rtx_SUBREG (mode, operands[1], 0);
+  if (!CONST_INT_P (operands[2]) && data_mode != mode)
+    operands[2] = gen_rtx_SUBREG (mode, operands[2], 0);
+  rtx in = force_reg (mode, gen_rtx_XOR (mode, operands[1], operands[2]));
+  rtx ix = gen_rtx_AND (mode, in, GEN_INT (GET_MODE_MASK (data_mode)));
+  if (mode != Pmode)
+    ix = gen_rtx_SUBREG (Pmode, ix, 0);
+  ix = gen_rtx_ASHIFT (Pmode, ix,
+		       GEN_INT (exact_log2 (GET_MODE_SIZE (crc_mode))));
+  ix = force_reg (Pmode, ix);
+  rtx tab = print_crc_table (GET_MODE_BITSIZE (crc_mode),
+			     GET_MODE_BITSIZE (data_mode), polynom);
+  tab = gen_rtx_MEM (crc_mode, gen_rtx_PLUS (Pmode, ix, tab));
+  if (crc_mode != mode)
+    tab = gen_rtx_ZERO_EXTEND (mode, tab);
+  /* We assume operands[1] is passed in suitably zero-extended for its type.  */
+  rtx high = gen_rtx_LSHIFTRT (mode, operands[1],
+			       GEN_INT (GET_MODE_BITSIZE (data_mode)));
+  rtx crc = force_reg (mode, gen_rtx_XOR (mode, tab, high));
+  riscv_emit_move (operands[0], gen_rtx_SUBREG (crc_mode, crc, 0));
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_ASM_ALIGNED_HI_OP
 #define TARGET_ASM_ALIGNED_HI_OP "\t.half\t"
diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index b3c5bce842a..0b1fdf6e36d 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -2869,3 +2869,4 @@
 (include "pic.md")
 (include "generic.md")
 (include "sifive-7.md")
+(include "crc.md")
diff --git a/gcc/config/riscv/riscv.opt b/gcc/config/riscv/riscv.opt
index 9fffc08220d..5b72f80d91a 100644
--- a/gcc/config/riscv/riscv.opt
+++ b/gcc/config/riscv/riscv.opt
@@ -225,3 +225,7 @@ Enum(isa_spec_class) String(20191213) Value(ISA_SPEC_CLASS_20191213)
 misa-spec=
 Target RejectNegative Joined Enum(isa_spec_class) Var(riscv_isa_spec) Init(TARGET_DEFAULT_ISA_SPEC)
 Set the version of RISC-V ISA spec.
+
+msmall-memory
+Target Bool Var(TARGET_SMALL_MEMORY) Init(1)
+Don't use large lookup tables.
diff --git a/gcc/config/riscv/tree-crc-doc.txt b/gcc/config/riscv/tree-crc-doc.txt
new file mode 100644
index 00000000000..59cd410f99e
--- /dev/null
+++ b/gcc/config/riscv/tree-crc-doc.txt
@@ -0,0 +1,124 @@
+We want to recognize the crc computation from a trademarked benchmark -
+which may and may not be named, the license is contradictory on this,
+although it is mostly including the Apache License 2.0.
+The code is - modified for context-free types:
+
+#include <stdint.h>
+
+uint16_t crcu8(uint8_t data, uint16_t crc)
+{
+  uint8_t i=0,x16=0,carry=0;
+
+  for (i = 0; i < 8; i++)
+    {
+      x16 = (uint8_t)((data & 1) ^ ((uint8_t)crc & 1));
+      data >>= 1;
+
+      if (x16 == 1)
+        {
+          crc ^= 0x4002;
+          carry = 1;
+        }
+      else
+        carry = 0;
+      crc >>= 1;
+      if (carry)
+        crc |= 0x8000;
+      else
+        crc &= 0x7fff;
+    }
+  return crc;
+}
+
+And this looks like this in the *.c.029t.local-fnsummary1 dump file:
+
+;; Function crcu8 (crcu8, funcdef_no=28, decl_uid=1892, cgraph_uid=29, symbol_order=37)
+
+
+Analyzing function body size: crcu8/37
+
+IPA function summary for crcu8/37 inlinable
+  global time:     16.000000
+  self size:       17
+  global size:     0
+  min size:       0
+  self stack:      0
+  global stack:    0
+    size:14.000000, time:14.000000
+    size:3.000000, time:2.000000,  executed if:(not inlined)
+  calls:
+
+ee_u16 crcu8 (ee_u8 data, ee_u16 crc)
+{
+  ee_u8 carry;
+  ee_u8 x16;
+  ee_u8 i;
+  signed char _1;
+  signed char data.63_2;
+  signed char _3;
+  unsigned char _4;
+  unsigned char i.64_5;
+  ee_u16 _18;
+
+  <bb 2> :
+  i_12 = 0;
+  x16_13 = 0;
+  carry_14 = 0;
+  i_15 = 0;
+  goto <bb 10>; [INV]
+
+  <bb 3> :
+  _1 = (signed char) crc_9;
+  data.63_2 = (signed char) data_6;
+  _3 = _1 ^ data.63_2;
+  _4 = (unsigned char) _3;
+  x16_20 = _4 & 1;
+  data_21 = data_6 >> 1;
+  if (x16_20 == 1)
+    goto <bb 4>; [INV]
+  else
+    goto <bb 5>; [INV]
+
+  <bb 4> :
+  crc_23 = crc_9 ^ 16386;
+  carry_24 = 1;
+  goto <bb 6>; [INV]
+
+  <bb 5> :
+  carry_22 = 0;
+
+  <bb 6> :
+  # crc_7 = PHI <crc_23(4), crc_9(5)>
+  # carry_11 = PHI <carry_24(4), carry_22(5)>
+  crc_25 = crc_7 >> 1;
+  if (carry_11 != 0)
+    goto <bb 7>; [INV]
+  else
+    goto <bb 8>; [INV]
+
+  <bb 7> :
+  crc_27 = crc_25 | 32768;
+  goto <bb 9>; [INV]
+
+  <bb 8> :
+  crc_26 = crc_25 & 32767;
+
+  <bb 9> :
+  # crc_8 = PHI <crc_27(7), crc_26(8)>
+  i.64_5 = i_10;
+  i_28 = i.64_5 + 1;
+
+  <bb 10> :
+  # data_6 = PHI <data_16(D)(2), data_21(9)>
+  # crc_9 = PHI <crc_17(D)(2), crc_8(9)>
+  # i_10 = PHI <i_15(2), i_28(9)>
+  if (i_10 <= 7)
+    goto <bb 3>; [INV]
+  else
+    goto <bb 11>; [INV]
+
+  <bb 11> :
+  _18 = crc_9;
+  return _18;
+
+}
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 2a14e1a9472..eb3571c7926 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -574,7 +574,7 @@ Objective-C and Objective-C++ Dialects}.
 -fstdarg-opt  -fstore-merging  -fstrict-aliasing -fipa-strict-aliasing @gol
 -fthread-jumps  -ftracer  -ftree-bit-ccp @gol
 -ftree-builtin-call-dce  -ftree-ccp  -ftree-ch @gol
--ftree-coalesce-vars  -ftree-copy-prop  -ftree-dce  -ftree-dominator-opts @gol
+-ftree-coalesce-vars  -ftree-copy-prop  -fcrc -ftree-dce  -ftree-dominator-opts @gol
 -ftree-dse  -ftree-forwprop  -ftree-fre  -fcode-hoisting @gol
 -ftree-loop-if-convert  -ftree-loop-im @gol
 -ftree-phiprop  -ftree-loop-distribution  -ftree-loop-distribute-patterns @gol
@@ -12311,6 +12311,11 @@ Perform basic block vectorization on trees. This flag is enabled by default at
 @option{-O2} and by @option{-ftree-vectorize}, @option{-fprofile-use},
 and @option{-fauto-profile}.
 
+@item -fcrc
+@opindex fcrc
+Recognize some crc calculations at the tree stage to make them into
+built-in functions.
+
 @item -ftrivial-auto-var-init=@var{choice}
 @opindex ftrivial-auto-var-init
 Initialize automatic variables with either a pattern or with zeroes to increase
diff --git a/gcc/gimple-match-head.cc b/gcc/gimple-match-head.cc
index 74d5818898f..42b732b73d8 100644
--- a/gcc/gimple-match-head.cc
+++ b/gcc/gimple-match-head.cc
@@ -45,6 +45,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "dbgcnt.h"
 #include "tm.h"
 #include "gimple-range.h"
+#include "predict.h"
 
 /* Forward declarations of the private auto-generated matchers.
    They expect valueized operands in canonical order and do not
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 8b1733e20c4..dd37b34d3f5 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -130,6 +130,7 @@ init_internal_fns ()
 #define fold_left_direct { 1, 1, false }
 #define mask_fold_left_direct { 1, 1, false }
 #define check_ptrs_direct { 0, 0, false }
+#define crc_direct { 1, -1, true }
 
 const direct_internal_fn_info direct_internal_fn_array[IFN_LAST + 1] = {
 #define DEF_INTERNAL_FN(CODE, FLAGS, FNSPEC) not_direct,
@@ -3657,6 +3658,44 @@ expand_while_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
     emit_move_insn (lhs_rtx, ops[0].value);
 }
 
+static void
+expand_crc_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+{
+  tree lhs = gimple_call_lhs (stmt);
+  tree rhs1 = gimple_call_arg (stmt, 0);
+  tree rhs2 = gimple_call_arg (stmt, 1);
+  tree rhs3 = gimple_call_arg (stmt, 2);
+
+  /* rhs1 and rhs2 are commutative if the modes are the same, but if they
+     aren't, we canonicalize for rhs1 to match the lhs and polynom.
+     The optab lookup and the insn pattern mind.  */
+  if (TYPE_PRECISION (TREE_TYPE (rhs1)) != TYPE_PRECISION (TREE_TYPE (lhs))
+      && TYPE_PRECISION (TREE_TYPE (rhs2)) == TYPE_PRECISION (TREE_TYPE (lhs)))
+    std::swap (rhs1, rhs2);
+
+  tree crc_type = TREE_TYPE (lhs);
+  tree dat_type = TREE_TYPE (rhs2);
+
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  rtx op1 = expand_normal (rhs1);
+  rtx op2 = expand_normal (rhs2);
+  rtx op3 = expand_normal (rhs3);
+
+  if (CONST_INT_P (op3))
+    op3 = GEN_INT (trunc_int_for_mode (INTVAL (op3), TYPE_MODE (crc_type)));
+
+  class expand_operand ops[4];
+  create_output_operand (&ops[0], target, TYPE_MODE (crc_type));
+  create_input_operand (&ops[1], op1, TYPE_MODE (crc_type));
+  create_input_operand (&ops[2], op2, TYPE_MODE (dat_type));
+  create_input_operand (&ops[3], op3, TYPE_MODE (crc_type));
+  insn_code icode = convert_optab_handler (optab, TYPE_MODE (dat_type),
+					   TYPE_MODE (crc_type));
+  expand_insn (icode, 4, ops);
+  if (!rtx_equal_p (target, ops[0].value))
+    emit_move_insn (target, ops[0].value);
+}
+
 /* Expanders for optabs that can use expand_direct_optab_fn.  */
 
 #define expand_unary_optab_fn(FN, STMT, OPTAB) \
@@ -3784,6 +3823,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_mask_fold_left_optab_supported_p direct_optab_supported_p
 #define direct_check_ptrs_optab_supported_p direct_optab_supported_p
 #define direct_vec_set_optab_supported_p direct_optab_supported_p
+#define direct_crc_optab_supported_p convert_optab_supported_p
 
 /* Return the optab used by internal function FN.  */
 
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index d2d550d3586..792dd5d25e8 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -201,6 +201,10 @@ DEF_INTERNAL_OPTAB_FN (COND_SHL, ECF_CONST | ECF_NOTHROW,
 DEF_INTERNAL_SIGNED_OPTAB_FN (COND_SHR, ECF_CONST | ECF_NOTHROW, first,
 			      cond_ashr, cond_lshr, cond_binary)
 
+/* [Little Endian] CRC and Big Endian CRC.  */
+DEF_INTERNAL_OPTAB_FN (CRC, ECF_CONST | ECF_NOTHROW, crc, crc)
+DEF_INTERNAL_OPTAB_FN (CRC_BE, ECF_CONST | ECF_NOTHROW, crc_be, crc)
+
 DEF_INTERNAL_OPTAB_FN (COND_FMA, ECF_CONST, cond_fma, cond_ternary)
 DEF_INTERNAL_OPTAB_FN (COND_FMS, ECF_CONST, cond_fms, cond_ternary)
 DEF_INTERNAL_OPTAB_FN (COND_FNMA, ECF_CONST, cond_fnma, cond_ternary)
diff --git a/gcc/match.pd b/gcc/match.pd
index 8b44b5cc92c..690aa64e3b3 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -1210,6 +1210,18 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (bit_and @0 integer_all_onesp)
   (non_lvalue @0))
 
+#if GIMPLE
+(for crc (IFN_CRC BUILT_IN_CRC8S)
+ (simplify
+  (crc (crc:s @0 (convert@5 @1) @2) (convert (rshift @1 INTEGER_CST@4)) @2)
+   (if (wi::to_wide (@4) == TYPE_PRECISION (TREE_TYPE (@5))
+	&& (direct_internal_fn_supported_p
+	      (IFN_CRC, tree_pair (TREE_TYPE (@1), TREE_TYPE (@0)),
+	       function_optimization_type (cfun))
+	    != CODE_FOR_nothing))
+    (IFN_CRC @0 @1 @2))))
+#endif
+
 /* x & x -> x,  x | x -> x  */
 (for bitop (bit_and bit_ior)
  (simplify
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 801310ebaa7..096581f9c41 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -78,6 +78,8 @@ OPTAB_CD(smsub_widen_optab, "msub$b$a4")
 OPTAB_CD(umsub_widen_optab, "umsub$b$a4")
 OPTAB_CD(ssmsub_widen_optab, "ssmsub$b$a4")
 OPTAB_CD(usmsub_widen_optab, "usmsub$a$b4")
+OPTAB_CD(crc_optab, "crc$a$b4")
+OPTAB_CD(crc_be_optab, "crc_be$a$b4")
 OPTAB_CD(vec_load_lanes_optab, "vec_load_lanes$a$b")
 OPTAB_CD(vec_store_lanes_optab, "vec_store_lanes$a$b")
 OPTAB_CD(vec_mask_load_lanes_optab, "vec_mask_load_lanes$a$b")
diff --git a/gcc/passes.def b/gcc/passes.def
index f7718181038..c31206b6fa6 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -71,6 +71,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_fixup_cfg);
       NEXT_PASS (pass_rebuild_cgraph_edges);
       NEXT_PASS (pass_local_fn_summary);
+      NEXT_PASS (pass_crc);
       NEXT_PASS (pass_early_inline);
       NEXT_PASS (pass_warn_recursion);
       NEXT_PASS (pass_all_early_optimizations);
diff --git a/gcc/testsuite/gcc.c-torture/compile/crc.c b/gcc/testsuite/gcc.c-torture/compile/crc.c
new file mode 100644
index 00000000000..1b0321d1860
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/compile/crc.c
@@ -0,0 +1,5 @@
+int
+f (int crc, int i)
+{
+  return __builtin_crc8s (crc, i, 0xa001);
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/crc-2.c b/gcc/testsuite/gcc.dg/tree-ssa/crc-2.c
new file mode 100644
index 00000000000..6b045c7e856
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/crc-2.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fcrc -fdump-tree-forwprop1" } */
+
+unsigned short
+f (unsigned short crc, unsigned char i)
+{
+  return __builtin_crc8s (crc, i, 0xa001);
+}
+
+//#define f(a,b) Calc_crc8(b,a)
+unsigned short
+g (unsigned short crc, unsigned short i)
+{
+  return f (f (crc, (unsigned char)i), (unsigned char)(i >> 8));
+}
+
+/* Add targets as they implement a CRC operation.  */
+/* { dg-final { scan-tree-dump-times "\\.CRC" 1 "forwprop1"  { target riscv*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/crc.c b/gcc/testsuite/gcc.dg/tree-ssa/crc.c
new file mode 100644
index 00000000000..33877e4c384
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/crc.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-crc" } */
+
+typedef unsigned short u16;
+typedef unsigned char u8;
+
+u16
+Calc_crc8 (u8 data, u16 crc)
+{
+  u8 i=0,x16=0,carry=0;
+  for (i = 0; i < 8; i++)
+    {
+      x16 = (u8)((data & 1) ^ ((u8)crc & 1));
+      data >>= 1;
+ 
+      if (x16 == 1)
+	{
+	  crc ^= 0x4002;
+	  carry = 1;
+	}
+      else
+	carry = 0;
+      crc >>= 1;
+      if (carry)
+	crc |= 0x8000;
+      else
+	crc &= 0x7fff;
+    }
+  return crc;
+}
+
+/* { dg-final { scan-tree-dump-times "Matched loop" 1 "crc" } } */
+
+/* Add targets as they implement a CRC operation.  */
+/* { dg-final { scan-tree-dump-times "\\.CRC" 1 "crc"  { target riscv*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c b/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c
index 0f66aae87bb..be70b632c07 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c
@@ -1,6 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-Ofast -fdisable-tree-cunrolli -fdump-tree-threadfull1-details" } */
-
+/* { dg-options "-Ofast -fdisable-tree-cunrolli -fdump-tree-threadfull1-details -fno-crc" } */
 typedef unsigned short u16;
 typedef unsigned char u8;
 typedef unsigned int u32;
diff --git a/gcc/tree-crc.cc b/gcc/tree-crc.cc
new file mode 100644
index 00000000000..73e8bcd72ba
--- /dev/null
+++ b/gcc/tree-crc.cc
@@ -0,0 +1,842 @@
+/* CRC detection.
+   Copyright (C) 2015-2022 Free Software Foundation, Inc.
+   Contributed by Jon Beniston <jon@beniston.com>
+   and Joern Rennecke <joern.rennecke@embecosm.com> .
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+/* This pass detects CRC calculations.  */
+
+#define IN_TARGET_CODE 1
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "hash-set.h"
+#include "machmode.h"
+#include "vec.h"
+#include "double-int.h"
+#include "input.h"
+#include "alias.h"
+#include "symtab.h"
+#include "options.h"
+#include "wide-int.h"
+#include "inchash.h"
+#include "tree.h"
+#include "fold-const.h"
+#include "predict.h"
+#include "tm.h"
+#include "tm_p.h"
+#include "hard-reg-set.h"
+#include "input.h"
+#include "function.h"
+#include "dominance.h"
+#include "cfg.h"
+#include "bitmap.h"
+#include "cfganal.h"
+#include "basic-block.h"
+#include "tree-ssa-alias.h"
+#include "internal-fn.h"
+#include "gimple-expr.h"
+#include "is-a.h"
+#include "gimple.h"
+#include "gimple-iterator.h"
+#include "gimplify-me.h"
+#include "stor-layout.h"
+#include "gimple-ssa.h"
+#include "tree-cfg.h"
+#include "tree-phinodes.h"
+#include "ssa-iterators.h"
+#include "stringpool.h"
+#include "tree-vrp.h"
+#include "tree-ssanames.h"
+#include "tree-ssa-loop-manip.h"
+#include "tree-ssa-loop.h"
+#include "tree-into-ssa.h"
+#include "tree-ssa.h"
+#include "cfgloop.h"
+#include "tree-chrec.h"
+#include "tree-data-ref.h"
+#include "tree-scalar-evolution.h"
+#include "tree-pass.h"
+#include "gimple-pretty-print.h"
+#include "gimple-iterator.h"
+#include "tree-vectorizer.h"
+#include "print-tree.h"
+#include "targhooks.h"
+#include "convert.h"
+
+namespace {
+
+const pass_data pass_data_crc =
+{
+  GIMPLE_PASS, /* type */
+  "crc", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_MACH_DEP, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa, /* todo_flags_finish */
+};
+
+/* The optimal way to implement CRC computations depends on the CPU, but
+   we need first recognize the computation as such before we can select a
+   suitable method.
+   For now, we just match the calculation of a 16 bit crc from 8-bit data
+   and a previous crc as performed by a trademarked benchmark that is
+   promoted as a replacement for Dhrystone.
+   You can add other recognizers as desired.
+   For new code you want to be optimized, you might also consider to use
+   the GCC built-in functions.  */
+
+class pass_crc : public gimple_opt_pass
+{
+public:
+  pass_crc (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_crc, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      /* We can at least use an efficient library function.
+	 OTOH, we might not want to run this pass when it is
+	 irrelevant, so obey that option setting.  */
+      return flag_crc_opt != 0;
+    }
+
+  virtual unsigned int execute (function *);
+
+  tree crc_result;
+  basic_block crc_result_bb;
+  tree crc_state, crc_latch, crc_joined = NULL_TREE, crc_rejoined = NULL_TREE;
+  tree crc_data, data_top, data_latch;
+  tree bit_joined = NULL_TREE;
+  unsigned HOST_WIDE_INT xor_mask;
+  HOST_WIDE_INT first_iter_num, last_iter_num;
+  basic_block bb3, bb6 = NULL, bb9 = NULL;
+
+  bool match_bb4 (basic_block bb)
+    {
+      int stmt_match = 0;
+      gimple_stmt_iterator gsi;
+      tree bit_out, crc_out;
+
+      if (!single_pred_p (bb))
+	return false;
+
+      for (gsi = gsi_start_nondebug_bb (bb); !gsi_end_p (gsi);
+	   gsi_next_nondebug (&gsi))
+	{
+	  gimple *t = gsi_stmt (gsi);
+
+	  if (stmt_match == 0)
+	    {
+	      if (is_gimple_assign (t)
+		  && gimple_assign_rhs_code(t) == BIT_XOR_EXPR
+		  && gimple_assign_rhs1 (t) == crc_latch)
+		{
+		  tree op2 = gimple_assign_rhs2 (t);
+		  if (!tree_fits_uhwi_p (op2))
+		    return false;
+		  crc_out = gimple_assign_lhs (t);
+		  xor_mask = TREE_INT_CST_LOW (op2);
+		  stmt_match++;
+		}
+	      else
+		return false;
+	    }
+	  else if (stmt_match == 1)
+	    {
+	      if (!gimple_assign_single_p (t))
+		return false;
+
+	      bit_out = gimple_assign_lhs (t);
+	      tree rhs = gimple_assign_rhs1 (t);
+	      int prec = TYPE_PRECISION (TREE_TYPE (bit_out));
+
+	      if (!expr_not_equal_to (rhs, wi::zero (prec)))
+		return false;
+
+	      stmt_match++;
+	    }
+	  else
+	    return false;
+	}
+      if (stmt_match < 2)
+	return false;
+
+      edge e = single_succ_edge (bb);
+
+      if (!e)
+	return false;
+
+      return match_bb6 (e, bit_out, crc_out);
+    }
+
+  bool match_bb5 (basic_block bb)
+    {
+      int stmt_match = 0;
+      gimple_stmt_iterator gsi;
+      tree bit_out;
+
+      if (!single_pred_p (bb))
+	return false;
+
+      for (gsi = gsi_start_nondebug_bb (bb); !gsi_end_p (gsi);
+	   gsi_next_nondebug (&gsi))
+	{
+	  gimple *t = gsi_stmt (gsi);
+
+	  if (stmt_match == 0)
+	    {
+	      if (!gimple_assign_single_p (t)
+		  || !integer_zerop (gimple_assign_rhs1 (t)))
+		return false;
+	      bit_out = gimple_assign_lhs (t);
+
+	      stmt_match++;
+	    }
+	  else
+	    return false;
+	}
+      if (stmt_match < 1)
+	return false;
+
+      edge e = single_succ_edge (bb);
+
+      if (!e)
+	return false;
+
+      return match_bb6 (e, bit_out, crc_latch);
+    }
+
+  bool match_bb6 (edge e, tree bit_in, tree crc_in)
+    {
+      basic_block bb = e->dest;
+
+      if (EDGE_COUNT (bb->preds) != 2)
+	return false;
+
+      int phi_match = 0;
+
+      for (gphi_iterator gsi = gsi_start_phis (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gphi *phi = gsi.phi ();
+	  tree val = PHI_ARG_DEF_FROM_EDGE (phi, e);
+	  tree *joinp;
+
+	  if (val == bit_in)
+	    joinp = &bit_joined, phi_match++;
+	  else if (val == crc_in)
+	    joinp = &crc_joined, phi_match++;
+	  else
+	    return false;
+
+	  if (!*joinp)
+	    *joinp = PHI_RESULT (phi);
+	  else if (*joinp != PHI_RESULT (phi))
+	    return false;
+	}
+      if (phi_match < 2)
+	return false;
+
+      if (bb6 != NULL)
+	return bb == bb6;
+      bb6 = bb;
+
+      int stmt_match = 0;
+      tree crc_shifted;
+
+      for (gimple_stmt_iterator gsi = gsi_start_nondebug_bb (bb);
+	   !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
+	{
+	  gimple *t = gsi_stmt (gsi);
+
+	  if (stmt_match == 0)
+	    {
+	      if (!is_gimple_assign (t)
+		  || gimple_assign_rhs_code(t) != RSHIFT_EXPR
+		  || gimple_assign_rhs1 (t) != crc_joined
+		  || !TYPE_UNSIGNED (TREE_TYPE (crc_joined))
+		  || !integer_onep (gimple_assign_rhs2 (t)))
+		return false;
+	      stmt_match++;
+	      crc_shifted = gimple_assign_lhs (t);
+	    }
+	  else if (stmt_match == 1)
+	    {
+	      if ((gimple_code (t) == GIMPLE_COND)
+		  && (gimple_cond_code(t) == NE_EXPR)
+		  && gimple_cond_lhs (t) == bit_joined
+		  && (integer_zerop (gimple_cond_rhs (t))))
+		stmt_match++;
+	      else
+		return false;
+	    }
+	  else
+	    return false;
+	}
+      if (stmt_match < 2)
+	return false;
+
+      edge true_edge;
+      edge false_edge;
+      extract_true_false_edges_from_block (bb, &true_edge, &false_edge);
+
+      if (match_bb7 (true_edge->dest, crc_shifted)
+	  && match_bb8 (false_edge->dest, crc_shifted))
+	return true;
+      else
+	return false;
+    }
+
+  bool match_bb7 (basic_block bb, tree crc_in)
+    {
+      int stmt_match = 0;
+      gimple_stmt_iterator gsi;
+      tree crc_ored;
+
+      if (!single_pred_p (bb))
+	return false;
+
+      for (gsi = gsi_start_nondebug_bb (bb); !gsi_end_p (gsi);
+	   gsi_next_nondebug (&gsi))
+	{
+	  gimple *t = gsi_stmt (gsi);
+
+	  if (stmt_match == 0)
+	    {
+	      if (!is_gimple_assign(t)
+		  || gimple_assign_rhs_code(t) != BIT_IOR_EXPR
+		  || gimple_assign_rhs1 (t) != crc_in)
+		return false;
+
+	      tree op2 = gimple_assign_rhs2 (t);
+	      int prec = TYPE_PRECISION (TREE_TYPE (gimple_assign_rhs1 (t)));
+	      if (TREE_CODE (op2) != INTEGER_CST
+		  || prec > HOST_BITS_PER_INT
+		  || (TREE_INT_CST_LOW (op2)
+		      & ((HOST_WIDE_INT_C (1) << (prec - 1)) - 1)))
+		return false;
+	      xor_mask = TREE_INT_CST_LOW (op2) | (xor_mask >> 1);
+	      crc_ored = gimple_assign_lhs (t);
+	      stmt_match++;
+	    }
+	  else
+	    return false;
+	}
+      if (stmt_match < 1)
+	return false;
+
+      edge e = single_succ_edge (bb);
+
+      if (!e)
+	return false;
+
+      return match_bb9 (e, crc_ored);
+    }
+
+  bool match_bb8 (basic_block bb, tree crc_in)
+    {
+      int stmt_match = 0;
+      gimple_stmt_iterator gsi;
+      tree crc_anded;
+
+      if (!single_pred_p (bb))
+	return false;
+
+      for (gsi = gsi_start_nondebug_bb (bb); !gsi_end_p (gsi);
+	   gsi_next_nondebug (&gsi))
+	{
+	  gimple *t = gsi_stmt (gsi);
+
+	  if (stmt_match == 0)
+	    {
+	      if (! is_gimple_assign (t)
+		  || gimple_assign_rhs_code(t) != BIT_AND_EXPR
+		  || gimple_assign_rhs1 (t) != crc_in)
+		return false;
+
+	      tree op1 = gimple_assign_rhs1 (t);
+	      tree op2 = gimple_assign_rhs2 (t);
+	      int prec = TYPE_PRECISION (TREE_TYPE (op1));
+
+	      if (op1 != crc_in
+		  || TREE_CODE (op2) != INTEGER_CST
+		  || prec > HOST_BITS_PER_INT
+		  || (~TREE_INT_CST_LOW (op2)
+		      & ((HOST_WIDE_INT_C (1) << (prec - 1)) - 1)))
+		return false;
+
+	      if (!TYPE_UNSIGNED (TREE_TYPE (op1))
+		  && (~TREE_INT_CST_LOW (op2)
+		      & (HOST_WIDE_INT_C (1) << (prec - 1))))
+		return false;
+
+	      crc_anded = gimple_assign_lhs (t);
+	      stmt_match++;
+	    }
+	  else
+	    return false;
+	}
+      if (stmt_match < 1)
+	return false;
+
+      edge e = single_succ_edge (bb);
+
+      if (!e)
+	return false;
+
+      return match_bb9 (e, crc_anded);
+    }
+
+  bool match_bb9 (edge e, tree crc_in)
+    {
+      basic_block bb = e->dest;
+
+      if (EDGE_COUNT (bb->preds) != 2)
+	return false;
+
+      int phi_match = 0;
+
+      for (gphi_iterator gsi = gsi_start_phis (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gphi *phi = gsi.phi ();
+	  tree val = PHI_ARG_DEF_FROM_EDGE (phi, e);
+
+	  if (val == crc_in)
+	    {
+	      if (!crc_rejoined)
+		crc_rejoined = PHI_RESULT (phi);
+	      else if (PHI_RESULT (phi) != crc_rejoined)
+		return false;
+	      phi_match++;
+	    }
+	}
+      if (phi_match < 1)
+	return false;
+
+      if (bb9 != NULL)
+	return bb == bb9;
+      bb9 = bb;
+
+      int stmt_match = 0;
+      tree iter_var, ix;
+
+      for (gimple_stmt_iterator gsi = gsi_start_nondebug_bb (bb);
+	   !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
+	{
+	  gimple *t = gsi_stmt (gsi);
+
+	  if (stmt_match == 0)
+	    {
+	      if (!gimple_assign_single_p (t)
+		  || TREE_CODE (gimple_assign_rhs1 (t)) != SSA_NAME)
+		return false;
+	      iter_var = gimple_assign_rhs1 (t);
+	      ix = gimple_assign_lhs (t);
+	      stmt_match++;
+	    }
+	  else if (stmt_match == 1)
+	    {
+	      if (!is_gimple_assign (t)
+		  || gimple_assign_rhs_code(t) != PLUS_EXPR
+		  || gimple_assign_rhs1 (t) != ix
+		  || !integer_onep (gimple_assign_rhs2 (t)))
+		return false;
+	      ix = gimple_assign_lhs (t);
+	      stmt_match++;
+	    }
+	  else
+	    return false;
+	}
+      if (stmt_match < 2)
+	return false;
+
+      e = single_succ_edge (bb);
+
+      if (!e)
+	return false;
+
+      return match_bb10 (e, iter_var, ix);
+    }
+
+
+  bool match_bb10 (edge e, tree iter_var, tree ix)
+    {
+      basic_block bb = e->dest, inc_bb = e->src;
+
+      if (EDGE_COUNT (bb->preds) != 2)
+	return false;
+
+      edge preheader_edge;
+      edge_iterator ei;
+
+      FOR_EACH_EDGE (preheader_edge, ei, bb->preds)
+	if (preheader_edge->src != inc_bb)
+	  break;
+
+      int stmt_match = 0;
+      gimple_stmt_iterator gsi;
+
+      for (gsi = gsi_start_nondebug_bb (bb); !gsi_end_p (gsi);
+	   gsi_next_nondebug (&gsi))
+	{
+	  gimple *t = gsi_stmt (gsi);
+
+	  if (stmt_match == 0)
+	    {
+	      if ((gimple_code (t) == GIMPLE_COND)
+		  && (gimple_cond_code(t) == LE_EXPR)
+		  && iter_var == gimple_cond_lhs (t)
+		  && tree_fits_shwi_p (gimple_cond_rhs (t)))
+		{
+		  last_iter_num = tree_to_shwi (gimple_cond_rhs (t));
+		  stmt_match++;
+		}
+	      else
+		return false;
+	    }
+	  else
+	    return false;
+	}
+      if (stmt_match < 1)
+	return false;
+
+      tree iter_init = NULL_TREE;
+
+      // Find tree's of incoming data and CRC
+      // We choose the PHI arg who's src insn't the bb at the end of the loop
+      int phi_match = 0;
+      gphi_iterator pi;
+      for (pi = gsi_start_phis (bb); !gsi_end_p (pi); gsi_next (&pi))
+	{
+	  gphi *phi = pi.phi ();
+	  if (virtual_operand_p (gimple_phi_result (phi)))
+	    ; /* Ignore.  */
+	  else if (PHI_RESULT (phi) == data_latch
+		   && PHI_ARG_DEF_FROM_EDGE (phi, e) == data_top)
+	    {
+	      crc_data = PHI_ARG_DEF_FROM_EDGE (phi, preheader_edge);
+	      phi_match++;
+	    }
+	  else if (PHI_RESULT (phi) == crc_latch
+		   && PHI_ARG_DEF_FROM_EDGE (phi, e) == crc_rejoined)
+	    {
+	      crc_state = PHI_ARG_DEF_FROM_EDGE (phi, preheader_edge);
+	      phi_match++;
+	    }
+	  else if (PHI_RESULT (phi) == iter_var
+		   && PHI_ARG_DEF_FROM_EDGE (phi, e) == ix)
+	    {
+	      iter_init = PHI_ARG_DEF_FROM_EDGE (phi, preheader_edge);
+	      phi_match++;
+	    }
+	}
+      if (phi_match < 3 || !iter_init)
+	return false;
+
+      edge true_edge;
+      edge false_edge;
+      extract_true_false_edges_from_block (bb, &true_edge, &false_edge);
+
+      /* Find what iter_init is set to.
+	 (Presumably in preheader_edge->src, not that it really matters.)  */
+      if (TREE_CODE (iter_init) != SSA_NAME)
+	return false;
+      gimple *def = SSA_NAME_DEF_STMT (iter_init);
+      if (!gimple_assign_single_p (def)
+	  || !integer_zerop (gimple_assign_rhs1 (def)))
+	return false;
+      first_iter_num = 0;
+      /* Check that the type allows safe a safe loop test in bb10.  */
+      if (!INTEGRAL_TYPE_P (TREE_TYPE (iter_init)) || last_iter_num >= 127)
+	return false;
+
+      if (true_edge->dest == bb3 && match_bb11 (false_edge->dest))
+	return true;
+      else
+	return false;
+    }
+
+  bool match_bb11 (basic_block bb)
+    {
+      int stmt_match = 0;
+      gimple_stmt_iterator gsi;
+
+      for (gsi = gsi_start_nondebug_bb (bb); !gsi_end_p (gsi);
+	   gsi_next_nondebug (&gsi))
+	{
+	  gimple *t = gsi_stmt (gsi);
+
+	  if (stmt_match == 0)
+	    {
+	      if (gimple_assign_single_p (t)
+		  && gimple_assign_rhs1 (t) == crc_latch)
+		{
+		  stmt_match++;
+
+		  // Save temp assigned to
+		  crc_result = gimple_assign_lhs(t);
+		  crc_result_bb = bb;
+
+		  // Ignore rest
+		  return true;
+		}
+	      else
+		return false;
+	    }
+	  else
+	    return false;
+	}
+
+      return false;
+    }
+
+  bool insert_crc (int bits, unsigned HOST_WIDE_INT polynom)
+    {
+      gimple_stmt_iterator rsi = gsi_start_bb(crc_result_bb);
+
+      gimple *fn_call;
+      tree fn_result;
+
+      tree crc_arg = crc_state;
+      tree data_arg = crc_data;
+
+      tree polynom_arg = build_int_cst (TREE_TYPE (crc_state), polynom);
+
+      if (GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (data_arg))) != bits)
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "Data size mismatch, %d vs. %d\n",
+		     GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (data_arg))), bits);
+	  return false;
+	}
+
+      if (!direct_internal_fn_supported_p
+	     (IFN_CRC, tree_pair (TREE_TYPE (data_arg), TREE_TYPE (crc_arg)),
+	      function_optimization_type (cfun)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "No matching optab entry\n");
+	  return false;
+	}
+
+      fn_call = gimple_build_call_internal (IFN_CRC, 3,
+					    crc_arg, data_arg, polynom_arg);
+      fn_result = make_ssa_name (TREE_TYPE (crc_result));
+      gimple_call_set_lhs (fn_call, fn_result);
+
+      gimple_set_location(fn_call, gimple_location (gsi_stmt(rsi)));
+
+      gsi_insert_after (&rsi, fn_call, GSI_SAME_STMT);
+
+      use_operand_p imm_use_p;
+      imm_use_iterator iterator;
+      gimple *stmt;
+      FOR_EACH_IMM_USE_STMT (stmt, iterator, crc_result)
+	{
+	  FOR_EACH_IMM_USE_ON_STMT (imm_use_p, iterator)
+	    SET_USE (imm_use_p, fn_result);
+	  update_stmt (stmt);
+	}
+      return true;
+    }
+
+
+}; // class pass_crc
+
+
+unsigned int
+pass_crc::execute (function *fun)
+{
+  bool changed = false;
+  basic_block bb;
+
+  int bb_match;
+  int stmt_match;
+
+  if (dump_file)
+    fprintf(dump_file, "CRC check for function: %s\n", function_name (fun));
+
+  bb_match = 0;
+  FOR_ALL_BB_FN (bb, fun)
+    {
+      gimple_stmt_iterator gsi;
+      tree tmp1, tmp2, tmp3, tmp4, tmp5;
+
+      stmt_match = 0;
+      crc_result = NULL_TREE;
+      crc_state = NULL_TREE;
+      crc_data = NULL_TREE;
+      crc_result_bb = NULL;
+
+      for (gsi = gsi_start_nondebug_bb (bb); !gsi_end_p (gsi);
+	   gsi_next_nondebug (&gsi))
+	{
+	  gimple *t = gsi_stmt (gsi);
+
+	  if (stmt_match == 0)
+	    {
+	      if (is_gimple_assign (t)
+		  && gimple_assign_rhs_code (t) == NOP_EXPR)
+		{
+		  crc_latch = gimple_assign_rhs1 (t);
+		  tmp1 = gimple_assign_lhs (t);
+		  stmt_match++;
+		}
+	      else
+		break;
+	    }
+	  else if (stmt_match == 1)
+	    {
+	      if (is_gimple_assign (t)
+		  && gimple_assign_rhs_code (t) == NOP_EXPR)
+		{
+		  data_latch = gimple_assign_rhs1 (t);
+		  tmp2 = gimple_assign_lhs (t);
+		  stmt_match++;
+		}
+	      else
+		break;
+	    }
+	  else if (stmt_match == 2)
+	    {
+	      if (is_gimple_assign (t)
+		  && gimple_assign_rhs_code (t) == BIT_XOR_EXPR
+		  && gimple_assign_rhs1 (t) == tmp1
+		  && gimple_assign_rhs2 (t) == tmp2)
+		{
+		  tmp3 = gimple_assign_lhs (t);
+		  stmt_match++;
+		}
+	      else
+		break;
+	    }
+	  else if (stmt_match == 3)
+	    {
+	      if (is_gimple_assign (t)
+		  && gimple_assign_rhs_code(t) == NOP_EXPR
+		  && gimple_assign_rhs1 (t) == tmp3)
+		{
+		  tmp4 = gimple_assign_lhs (t);
+		  stmt_match++;
+		}
+	      else
+		break;
+	    }
+	  else if (stmt_match == 4)
+	    {
+	      if (is_gimple_assign (t)
+		  && gimple_assign_rhs_code(t) == BIT_AND_EXPR
+		  && gimple_assign_rhs1 (t) == tmp4
+		  && integer_onep (gimple_assign_rhs2 (t)))
+		{
+		  tmp5 = gimple_assign_lhs (t);
+		  stmt_match++;
+		}
+	      else
+		break;
+	    }
+	  else if (stmt_match == 5)
+	    {
+	      if (is_gimple_assign (t)
+		  && gimple_assign_rhs_code(t) == RSHIFT_EXPR
+		  && gimple_assign_rhs1 (t) == data_latch
+		  && TYPE_UNSIGNED (TREE_TYPE (data_latch))
+		  && integer_onep (gimple_assign_rhs2 (t)))
+		{
+		  data_top = gimple_assign_lhs (t);
+		  stmt_match++;
+		}
+	      else
+		break;
+	    }
+	  else if (stmt_match == 6)
+	    {
+	      if (gimple_code (t) == GIMPLE_COND
+		  && gimple_cond_code(t) == EQ_EXPR
+		  && gimple_cond_lhs (t) == tmp5
+		  && integer_onep (gimple_cond_rhs (t))
+		  && single_pred_p (bb))
+		{
+		  stmt_match++;
+		  bb_match++;
+		  if (dump_file)
+		    fprintf (dump_file, "Potential match in BB %d\n",
+			     bb->index);
+		  // Now compare basic-blocks that follow this,
+		  // as they need to match as well.
+		  edge true_edge;
+		  edge false_edge;
+		  bb3 = bb;
+		  extract_true_false_edges_from_block (bb, &true_edge,
+						       &false_edge);
+		  if (match_bb4 (true_edge->dest)
+		      && match_bb5 (false_edge->dest))
+		    {
+		      if (dump_file)
+			{
+			  fprintf (dump_file, "Matched loop, result in : ");
+			  print_generic_expr (dump_file, crc_result);
+			  fprintf (dump_file, "\ndata: ");
+			  print_generic_expr (dump_file, crc_data);
+			  fprintf (dump_file, " state: ");
+			  print_generic_expr (dump_file, crc_state);
+			  fprintf (dump_file, "\n");
+			}
+
+		    }
+
+		    if (crc_data && crc_state && crc_result_bb
+			&& insert_crc (last_iter_num + 1 - first_iter_num,
+				       xor_mask))
+		      changed = true;
+		}
+	      else
+		break;
+	    }
+	  else
+	    break;
+
+	}
+
+      if (changed)
+	{
+	  /* Cached scalar evolutions now may refer to wrong or non-existing
+	     loops.  */
+	  scev_reset_htab ();
+	  mark_virtual_operands_for_renaming (fun);
+	  rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+	}
+    }
+
+  return 0;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_crc (gcc::context *ctxt)
+{
+  return new pass_crc (ctxt);
+}
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 606d1d60b85..37acd3a96ec 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -647,6 +647,7 @@ extern gimple_opt_pass *make_pass_gimple_isel (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_harden_compares (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_harden_conditional_branches (gcc::context
 							       *ctxt);
+extern gimple_opt_pass *make_pass_crc (gcc::context *ctxt);
 
 /* Current optimization pass.  */
 extern opt_pass *current_pass;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFA: crc builtin functions & optimizations
  2022-03-15  0:32 RFA: crc builtin functions & optimizations Joern Rennecke
@ 2022-03-15  1:04 ` Andrew Pinski
  2022-03-15  2:17   ` Oleg Endo
       [not found]   ` <CAMqJFCoHyuNzRkfmWSM6VkeWVhPw8tv5UAhnNfPYcUMzS5jFiQ@mail.gmail.com>
  2022-03-15 12:03 ` Martin Jambor
  2022-03-15 14:45 ` Richard Biener
  2 siblings, 2 replies; 12+ messages in thread
From: Andrew Pinski @ 2022-03-15  1:04 UTC (permalink / raw)
  To: Joern Rennecke; +Cc: GCC Patches

On Mon, Mar 14, 2022 at 5:33 PM Joern Rennecke
<joern.rennecke@embecosm.com> wrote:
>
> Most microprocessors have efficient ways to perform CRC operations, be
> that with lookup tables, rotates, or even special instructions.
> However, because we lack a representation for CRC in the compiler, we
> can't do proper instruction selection.  With this patch I seek out to
> rectify this,
> I've avoided using a mode name for the built-in functions because that
> would tie the semantics to the size of the addressable unit.  We
> generally use abbreviations like s/l/ll for type names, which is all
> right when the type can be widened without changing semantics.  For
> the data input, however, we also have to consider the shift count that
> is tied to it.  That is why I used a number to designate the width of
> the data input and shift.
>
> For machine support, I made a start with 8 and 16 bit little-endian
> CRC for RISCV using a
> lookup table.  I am sure once we have the basic infrastructure in the
> tree, we'll get more
> contributions of suitable named patterns for various ports.


A few points.
There are at least 9 different polynomials for the CRC-8 in common use today.
For CRC-32 there are 5 different polynomials used.
You don't have a patch to invoke.texi adding the descriptions of the builtins.
How is your polynom 3rd argument described? Is it similar to how it is
done on the wiki for the CRC?
Does it make sense to have to list the most common polynomials in the
documentation?

Also I am sorry but micro-optimizing coremarks is just wrong. Maybe it
is better to pick the CRC32 that is inside zip instead for a testcase
and benchmarking against?
Or even the CRC32C for iSCSI/ext4.

I see you also don't optimize the case where you have three other
variants of polynomials that are reversed, reciprocal and reversed
reciocal.

Also a huge problem, you don't check to make sure the third argument
to the crc builtin function is constant in the rsicv backend. Plus
since you expose the crc builtins as a non target specific builtin, I
assume there should be a libcall right and therefore either you need
to have a generic expansion for it or a function added to libgcc?
Also why not expand generically to the table or a loop based on a
target hook instead of in the backend? This way a target can choose
without even doing much.

Thanks,
Andrew Pinski



>
> bootstrapped on x86_64-pc-linux-gnu .

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFA: crc builtin functions & optimizations
  2022-03-15  1:04 ` Andrew Pinski
@ 2022-03-15  2:17   ` Oleg Endo
  2022-03-15 13:00     ` Joern Rennecke
       [not found]   ` <CAMqJFCoHyuNzRkfmWSM6VkeWVhPw8tv5UAhnNfPYcUMzS5jFiQ@mail.gmail.com>
  1 sibling, 1 reply; 12+ messages in thread
From: Oleg Endo @ 2022-03-15  2:17 UTC (permalink / raw)
  To: Andrew Pinski, Joern Rennecke; +Cc: GCC Patches

On Mon, 2022-03-14 at 18:04 -0700, Andrew Pinski via Gcc-patches wrote:
> On Mon, Mar 14, 2022 at 5:33 PM Joern Rennecke
> <joern.rennecke@embecosm.com> wrote:
> > 
> > Most microprocessors have efficient ways to perform CRC operations, be
> > that with lookup tables, rotates, or even special instructions.
> > However, because we lack a representation for CRC in the compiler, we
> > can't do proper instruction selection.  With this patch I seek out to
> > rectify this,
> > I've avoided using a mode name for the built-in functions because that
> > would tie the semantics to the size of the addressable unit.  We
> > generally use abbreviations like s/l/ll for type names, which is all
> > right when the type can be widened without changing semantics.  For
> > the data input, however, we also have to consider the shift count that
> > is tied to it.  That is why I used a number to designate the width of
> > the data input and shift.
> > 
> > For machine support, I made a start with 8 and 16 bit little-endian
> > CRC for RISCV using a
> > lookup table.  I am sure once we have the basic infrastructure in the
> > tree, we'll get more
> > contributions of suitable named patterns for various ports.
> 
> 
> A few points.
> There are at least 9 different polynomials for the CRC-8 in common use today.
> For CRC-32 there are 5 different polynomials used.
> You don't have a patch to invoke.texi adding the descriptions of the builtins.
> How is your polynom 3rd argument described? Is it similar to how it is
> done on the wiki for the CRC?
> Does it make sense to have to list the most common polynomials in the
> documentation?
> 
> Also I am sorry but micro-optimizing coremarks is just wrong. Maybe it
> is better to pick the CRC32 that is inside zip instead for a testcase
> and benchmarking against?
> Or even the CRC32C for iSCSI/ext4.
> 
> I see you also don't optimize the case where you have three other
> variants of polynomials that are reversed, reciprocal and reversed
> reciocal.

In my own CRC library I've got ~30 'commonly used' CRC types, based on
the following generic definition:

template <
  // number of crc result bits (polynomial order in bits)
  unsigned int BitCount,

  // normal polynomial without the leading 1 bit.
  typename crc_impl_detail::select_int<BitCount>::type TruncPoly,

  // initial remainder
  typename crc_impl_detail::select_int<BitCount>::type InitRem = 0,

  // final xor value
  typename crc_impl_detail::select_int<BitCount>::type FinalXor = 0,

  // input data byte reflected before processing (LSB / MSB first)
  bool ReflectInput = false,

  // output CRC reflected before the xor
  bool ReflectRemainder = false >
class crc
{
...
};


and then it goes like ...

// CRC-1 (most hardware; also known as parity bit)
// x + 1
typedef crc < 1, 0x01 > crc_1;

// CRC-3
typedef crc < 3, 0x03, 0x07, 0x00, true, true> crc_3;

...

// CRC-32 (ISO 3309, ANSI X3.66, FIPS PUB 71, FED-STD-1003, ITU-T V.42, Ethernet, SATA, MPEG-2, Gzip, PKZIP, POSIX cksum, PNG, ZMODEM)
// x^32 + x^26 + x^23 + x^22 + x^16 + x^12 + x^11 + x^10 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1
typedef crc < 32, 0x04C11DB7, 0xFFFFFFFF, 0xFFFFFFFF, true, true > crc_32;

typedef crc < 32, 0x04C11DB7, 0x7FFFFFFF, 0x7FFFFFFF, false, false > crc_32_mpeg2;
typedef crc < 32, 0x04C11db7, 0x00000000, 0xFFFFFFFF, false, false > crc_32_posix;

...


It then generates the lookup tables at compile time into .rodata for
the types that are used in the program, which is great for MCUs with
more flash/ROM than RAM.

Specific CRC types can be overridden if the system has a better way to
calculate the CRC, e.g. as hardware peripheral.

This being a library makes it relatively easy to tune and customize for
various systems.

How would that work together with your proposal?

Cheers,
Oleg


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFA: crc builtin functions & optimizations
  2022-03-15  0:32 RFA: crc builtin functions & optimizations Joern Rennecke
  2022-03-15  1:04 ` Andrew Pinski
@ 2022-03-15 12:03 ` Martin Jambor
  2022-03-15 14:45 ` Richard Biener
  2 siblings, 0 replies; 12+ messages in thread
From: Martin Jambor @ 2022-03-15 12:03 UTC (permalink / raw)
  To: Joern Rennecke; +Cc: GCC Patches

Hello,

just one question....

On Tue, Mar 15 2022, Joern Rennecke wrote:
> Most microprocessors have efficient ways to perform CRC operations, be
> that with lookup tables, rotates, or even special instructions.
> However, because we lack a representation for CRC in the compiler, we
> can't do proper instruction selection.  With this patch I seek out to
> rectify this,
> I've avoided using a mode name for the built-in functions because that
> would tie the semantics to the size of the addressable unit.  We
> generally use abbreviations like s/l/ll for type names, which is all
> right when the type can be widened without changing semantics.  For
> the data input, however, we also have to consider the shift count that
> is tied to it.  That is why I used a number to designate the width of
> the data input and shift.
>
> For machine support, I made a start with 8 and 16 bit little-endian
> CRC for RISCV using a
> lookup table.  I am sure once we have the basic infrastructure in the
> tree, we'll get more
> contributions of suitable named patterns for various ports.
>
> bootstrapped on x86_64-pc-linux-gnu .
> 2022-03-14  Jon Beniston  <jon@beniston.com>
> 	    Joern Rennecke  <joern.rennecke@embecosm.com>
>
> 	* Makefile.in (OBJS): Add tree-crc.o .
> 	* builtin-types.def (BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Define.
> 	(BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
> 	(BT_FN_UINT16_UINT16_UINT32_CONST_SIZE): Likewise.
> 	* builtins.cc (associated_internal_fn):
> 	Handle BUILT_IN_CRC8S, BUILT_IN_CRC16S, BUILT_IN_CRC32S.
> 	* builtins.def (BUILT_IN_CRC8S, BUILT_IN_CRC16S, BUILT_IN_CRC32S):
> 	New builtin functions.
> 	* cgraph.cc (cgraph_node::verify_node):
> 	Allow const calls without a callgraph edge.
> 	* common.opt (fcrc): New option.
> 	* doc/invoke.texi (-fcrc): Document.
> 	* gimple-match-head.cc: #include predict.h .
> 	* internal-fn.cc (crc_direct): Define.
> 	(expand_crc_optab_fn): New function.
> 	(direct_crc_optab_supported_p): Define.
> 	* internal-fn.def (CRC, CRC_BE): New internal optab functions.
> 	* match.pd: Match a pair of crc operations.
> 	* optabs.def (crc_optab, crc_be_optab): New optabs.
> 	* passes.def (pass_crc): Add new pass.
> 	* tree-crc.cc: New file.
> 	* tree-pass.h (make_pass_crc): Declare.
>
> testsuite:
> 	* gcc.c-torture/compile/crc.c: New test.
> 	* gcc.dg/tree-ssa/crc.c: Likewise.
> 	* gcc.dg/tree-ssa/crc-2.c: likewise.
> 	* gcc.dg/tree-ssa/pr59597.c: Add flag -fno-crc .
>
> config/riscv:
> 	* crc.md: New file.
> 	* riscv-protos.h (expand_crc_lookup, print_crc_table): Declare.
> 	* riscv.cc (compute_crc): New function.
> 	(print_crc_table, expand_crc_lookup): Likewise.
> 	* riscv.md: Include crc.md.
> 	* riscv.opt (msmall-memory): New option.
> 	* tree-crc-doc.txt: New file.
>
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 31ff95500c9..a901925511b 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1612,6 +1612,7 @@ OBJS = \
>  	tree-cfgcleanup.o \
>  	tree-chrec.o \
>  	tree-complex.o \
> +	tree-crc.o \
>  	tree-data-ref.o \
>  	tree-dfa.o \
>  	tree-diagnostic.o \

[...]

> diff --git a/gcc/cgraph.cc b/gcc/cgraph.cc
> index b923a59ab0c..9570f5121af 100644
> --- a/gcc/cgraph.cc
> +++ b/gcc/cgraph.cc
> @@ -3793,7 +3793,8 @@ cgraph_node::verify_node (void)
>  			    }
>  			  e->aux = (void *)1;
>  			}
> -		      else if (decl)
> +		      else if (decl
> +			       && !TREE_READONLY (decl) && !DECL_PURE_P (decl))
>  			{
>  			  error ("missing callgraph edge for call stmt:");
>  			  cgraph_debug_gimple_stmt (this_cfun, stmt);

Why is this is necessary?  It seems that all other built-ins just have a
cgraph_node and their calls a cgraph_edge.

Thanks,

Martin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Fwd: RFA: crc builtin functions & optimizations
       [not found]   ` <CAMqJFCoHyuNzRkfmWSM6VkeWVhPw8tv5UAhnNfPYcUMzS5jFiQ@mail.gmail.com>
@ 2022-03-15 12:52     ` Joern Rennecke
  0 siblings, 0 replies; 12+ messages in thread
From: Joern Rennecke @ 2022-03-15 12:52 UTC (permalink / raw)
  To: GCC Patches

Oops, that was meant to go to the list too.


On Tue, 15 Mar 2022 at 01:04, Andrew Pinski <pinskia@gmail.com> wrote:
>
> On Mon, Mar 14, 2022 at 5:33 PM Joern Rennecke
> <joern.rennecke@embecosm.com> wrote:
> >
> > Most microprocessors have efficient ways to perform CRC operations, be
> > that with lookup tables, rotates, or even special instructions.
> > However, because we lack a representation for CRC in the compiler, we
> > can't do proper instruction selection.  With this patch I seek out to
> > rectify this,
> > I've avoided using a mode name for the built-in functions because that
> > would tie the semantics to the size of the addressable unit.  We
> > generally use abbreviations like s/l/ll for type names, which is all
> > right when the type can be widened without changing semantics.  For
> > the data input, however, we also have to consider the shift count that
> > is tied to it.  That is why I used a number to designate the width of
> > the data input and shift.
> >
> > For machine support, I made a start with 8 and 16 bit little-endian
> > CRC for RISCV using a
> > lookup table.  I am sure once we have the basic infrastructure in the
> > tree, we'll get more
> > contributions of suitable named patterns for various ports.
>
>
> A few points.
> There are at least 9 different polynomials for the CRC-8 in common use today.
> For CRC-32 there are 5 different polynomials used.
> You don't have a patch to invoke.texi adding the descriptions of the builtins.

You are correct that the documentation could use some work, but that part
would go into extend.texi .

> How is your polynom 3rd argument described? Is it similar to how it is
> done on the wiki for the CRC?

It's a constant integer.
I haven't found a CRC in https://gcc.gnu.org/wiki .
If you mean wikipedia.org, they focus mainly on big endian CRC.  I've added
a function code IFN_CRC_BE for it because for completeness it should be
there, but haven't fleshed out anything further around that.  IFN_CRC and
its associated built-in functions are little-endian.  If you look at the start
of the confg/riscv/crc.md patch, there is a comment with a simple C
implementation of crchihi4.

> Does it make sense to have to list the most common polynomials in the
> documentation?

Maybe.  You could give advice on what makes cryptographic sense for
people who want to use CRCs in their code for integrity checks.
Or once some ports with special-purpose instructons are supported,
there could be comments on which polynoms will result in faster operation
because of the specialized expansion for the respective targets.

> Also I am sorry but micro-optimizing coremarks is just wrong.

The claim for that benchmark is that it tests a set of common
operations, including CRC calculations.  Without compiler support,
what we test instead is how well this particular implementation of CRC is
compiled for the target CPU, which can be very different from the actual
CRC computation performance.  So recognizing the CRC computation
helps the benchmark archive the stated goal of gauging CRC computation
performance.

Moreover, since the benchmark is commonly used, this also makes
it a commonly used idiom, and the license allows to copy the code
into your own programs to a large extent.

> Maybe it
> is better to pick the CRC32 that is inside zip instead for a testcase
> and benchmarking against?
> Or even the CRC32C for iSCSI/ext4.

I'm not sure what's inside there, but in principle, the more the merrier.
I had a look at the bzip2 CRC computation, but that's just a table
lookup.  We can recognize table lookups that compute a CRC if the
array is a constant, but there is no point if you haven't either a faster
implementation or want further optimization to be enabled.  Going
there was beyond the scope of my work at this time.

In principle, it would be interesting to do reduction / vectorization of
block CRC computations.  But you have to start with having a
representation for the CRC computations first.

> I see you also don't optimize the case where you have three other
> variants of polynomials that are reversed, reciprocal and reversed
> reciocal.

Do you want to contribute that?

> Also a huge problem, you don't check to make sure the third argument
> to the crc builtin function is constant in the rsicv backend.
Why is that a huge problem?  I see it as a further refinement not yet
added.  Strictly speaking, there is a check, but it's an assert, OTOH
it shouldn't be triggered with the infrstructure as it is now because the
optimizer only looks for a computation with a constant polynom, and
the third argument of the builtin crc functions is BT_CONST_SIZE for
now.  Variable polynoms are interesting, but before we introduce them,
we must make sure that constants remain inside the builtin function,
lest we get severe perfromance degradation if table lookups and
special instructions are not available.

> Plus
> since you expose the crc builtins as a non target specific builtin, I
> assume there should be a libcall right and therefore either you need
> to have a generic expansion for it or a function added to libgcc?

I thought we had other builtin functions that are not supported everywhere?
If we must have a default implementation, I suppose a looping one
in libgcc is the simplest one, and it can cover variable polynoms
(even if it's not needed right away).

> Also why not expand generically to the table or a loop based on a
> target hook instead of in the backend? This way a target can choose
> without even doing much.

I thought the optabs interface was already flexible enough.  Yes,
the work of the target author is simpler if you put pre-made functions
for expanding a shft loop / rotate loop / table lookup into the backend.

At any rate, I don't know when I next work on this, so further contributions
are welcome.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFA: crc builtin functions & optimizations
  2022-03-15  2:17   ` Oleg Endo
@ 2022-03-15 13:00     ` Joern Rennecke
  0 siblings, 0 replies; 12+ messages in thread
From: Joern Rennecke @ 2022-03-15 13:00 UTC (permalink / raw)
  To: Oleg Endo; +Cc: Andrew Pinski, GCC Patches

On Tue, 15 Mar 2022 at 02:17, Oleg Endo <oleg.endo@t-online.de> wrote:
> > In my own CRC library I've got ~30 'commonly used' CRC types, based on
> the following generic definition:
> > This being a library makes it relatively easy to tune and customize for
> various systems.

...

> How would that work together with your proposal?

With optabs, you can put in whatever you like into the
machine-specific expansion.

Or if we could put your library-using code into a default expansion that is used
if there's no optab expansion for the modes given, then the target can override
this for machine-specific methods using the optabs, and otherwise use
your library
method in the default expansion.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFA: crc builtin functions & optimizations
  2022-03-15  0:32 RFA: crc builtin functions & optimizations Joern Rennecke
  2022-03-15  1:04 ` Andrew Pinski
  2022-03-15 12:03 ` Martin Jambor
@ 2022-03-15 14:45 ` Richard Biener
  2022-03-15 15:15   ` Joern Rennecke
  2022-03-15 19:54   ` Joern Rennecke
  2 siblings, 2 replies; 12+ messages in thread
From: Richard Biener @ 2022-03-15 14:45 UTC (permalink / raw)
  To: Joern Rennecke; +Cc: GCC Patches

On Tue, Mar 15, 2022 at 1:32 AM Joern Rennecke
<joern.rennecke@embecosm.com> wrote:
>
> Most microprocessors have efficient ways to perform CRC operations, be
> that with lookup tables, rotates, or even special instructions.
> However, because we lack a representation for CRC in the compiler, we
> can't do proper instruction selection.  With this patch I seek out to
> rectify this,
> I've avoided using a mode name for the built-in functions because that
> would tie the semantics to the size of the addressable unit.  We
> generally use abbreviations like s/l/ll for type names, which is all
> right when the type can be widened without changing semantics.  For
> the data input, however, we also have to consider the shift count that
> is tied to it.  That is why I used a number to designate the width of
> the data input and shift.
>
> For machine support, I made a start with 8 and 16 bit little-endian
> CRC for RISCV using a
> lookup table.  I am sure once we have the basic infrastructure in the
> tree, we'll get more
> contributions of suitable named patterns for various ports.

Why's this a new pass?  Every walk over all insns costs time.  The pass
lacks any comments as to what CFG / stmt structure is matched.  From
a quick look it seems like it first(?) statically matches a stmt sequence
without considering intermediate stmts, so matching should be quite
fragile.  Why not match (sub-)expressions with the help of match.pd?

Any reason why you match CRC before early inlinig and thus even when
not optimizing?  Matching at least after early FRE/DCE/DSE would help
to get rid of abstraction and/or memory temporary uses.

> bootstrapped on x86_64-pc-linux-gnu .

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFA: crc builtin functions & optimizations
  2022-03-15 14:45 ` Richard Biener
@ 2022-03-15 15:15   ` Joern Rennecke
  2022-03-16  8:15     ` Richard Biener
  2022-03-15 19:54   ` Joern Rennecke
  1 sibling, 1 reply; 12+ messages in thread
From: Joern Rennecke @ 2022-03-15 15:15 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Jon Beniston

On 15/03/2022, Richard Biener <richard.guenther@gmail.com> wrote:

> Why's this a new pass?  Every walk over all insns costs time.

If should typically scan considerably less than all the insns.

>  The pass
> lacks any comments as to what CFG / stmt structure is matched.

I've put a file in:
config/riscv/tree-crc-doc.txt

would this text be suitabe to put in a comment block in tree-crc.cc ?

>  From
> a quick look it seems like it first(?) statically matches a stmt sequence
> without considering intermediate stmts, so matching should be quite
> fragile.

It might be fragile inasmuch as it won't match when things change, but
the matching has remained effective for seven years and across two
architecture families with varying word sizes.
And with regards to matching only what it's supposed to match, I believe
I have checked all the data dependencies and phis so that it's definitely
calculating a CRC.

>  Why not match (sub-)expressions with the help of match.pd?

Can you match a loop with match.pd ?

> Any reason why you match CRC before early inlinig and thus even when
> not optimizing?  Matching at least after early FRE/DCE/DSE would help
> to get rid of abstraction and/or memory temporary uses.

I haven't originally placed it there, but I believe benefits include:
- Getting rid of loop without having to actively deleting it in the
crc pass (this also
  might be safer as we just have to make sure we're are computing the CRC, and
  DCE will determine if there is any ancillary result that is left,
and only delete the
  loop if it's really dead.
- The optimized function is available for inlining.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFA: crc builtin functions & optimizations
  2022-03-15 14:45 ` Richard Biener
  2022-03-15 15:15   ` Joern Rennecke
@ 2022-03-15 19:54   ` Joern Rennecke
  1 sibling, 0 replies; 12+ messages in thread
From: Joern Rennecke @ 2022-03-15 19:54 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

On 15/03/2022, Richard Biener <richard.guenther@gmail.com> wrote:

> Why's this a new pass?  Every walk over all insns costs time.  The pass
> lacks any comments as to what CFG / stmt structure is matched.  From
> a quick look it seems like it first(?) statically matches a stmt sequence
> without considering intermediate stmts, so matching should be quite
> fragile.  Why not match (sub-)expressions with the help of match.pd?

Thinking about this a bit more, I suppose I could change the match.pd
framework to allow to set a bit or add a list element for a basic block where
an expression match is found.  That wouldn't make it any simpler - on the
contrary, much more complicated, since there need to be another check
for the same expression that makes sure all the inputs and outputs line up
with the other basic blocks constituting the loop - but it could avoid scanning
functions that don't have anything that looks like a match in a separate pass.

The proper check and actual transformation would still have to be in its own
pass, but that could return immediately if no expression match for a starting
block was found.
It'd have to be early enough, though, to happen before all inlining
and unrolling,
since both operations would hinder recognition, and we also want them applied
to outer loops / inlining functions after the transformation of the
crc computing
loop into a built-in function.
I suppose if no gimple pass is early enough, we could resort to use a
generic match.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFA: crc builtin functions & optimizations
  2022-03-15 15:15   ` Joern Rennecke
@ 2022-03-16  8:15     ` Richard Biener
  2022-03-16 17:02       ` Joern Rennecke
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Biener @ 2022-03-16  8:15 UTC (permalink / raw)
  To: Joern Rennecke; +Cc: GCC Patches, Jon Beniston

On Tue, Mar 15, 2022 at 4:15 PM Joern Rennecke
<joern.rennecke@embecosm.com> wrote:
>
> On 15/03/2022, Richard Biener <richard.guenther@gmail.com> wrote:
>
> > Why's this a new pass?  Every walk over all insns costs time.
>
> If should typically scan considerably less than all the insns.
>
> >  The pass
> > lacks any comments as to what CFG / stmt structure is matched.
>
> I've put a file in:
> config/riscv/tree-crc-doc.txt
>
> would this text be suitabe to put in a comment block in tree-crc.cc ?

Yes, that would be a better place I think.

> >  From
> > a quick look it seems like it first(?) statically matches a stmt sequence
> > without considering intermediate stmts, so matching should be quite
> > fragile.
>
> It might be fragile inasmuch as it won't match when things change, but
> the matching has remained effective for seven years and across two
> architecture families with varying word sizes.
> And with regards to matching only what it's supposed to match, I believe
> I have checked all the data dependencies and phis so that it's definitely
> calculating a CRC.
>
> >  Why not match (sub-)expressions with the help of match.pd?
>
> Can you match a loop with match.pd ?

No, this is why I said (sub-)expression.  I'm mainly talking about

      tmp = (tmp >> 1) | (tmp << (sizeof (tmp) * (8 /*CHAR_BIT*/) - 1));
      if ((long)tmp < 0)

for example - one key bit of the CRC seems to be a comparison, so
you'd match that with sth like

(match (crc_compare @0)
 (lt (bit_ior (rshift @0 integer_onep) (lshift @0 INTEGER_CST@1)) integer_zerop)
 (if (compare_tree_int (@1, TYPE_SIZE_UNIT (TREE_TYPE (@0)) * 8 - 1))))

where you can add alternative expression forms.  You can then use
the generated match function from the pass.  See for example the
tree-ssa-forwprop.cc use of gimple_ctz_table_index.

> > Any reason why you match CRC before early inlinig and thus even when
> > not optimizing?  Matching at least after early FRE/DCE/DSE would help
> > to get rid of abstraction and/or memory temporary uses.
>
> I haven't originally placed it there, but I believe benefits include:
> - Getting rid of loop without having to actively deleting it in the
> crc pass (this also
>   might be safer as we just have to make sure we're are computing the CRC, and
>   DCE will determine if there is any ancillary result that is left,
> and only delete the
>   loop if it's really dead.
> - The optimized function is available for inlining.

The canonical place to transform loops into builtins would be loop distribution.
Another place would be final value replacement since you basically replace
the reduction result with a call to the builtin, but I think
loop-distribution is
the better overall place.  See how we match strlen() there.

Richard.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFA: crc builtin functions & optimizations
  2022-03-16  8:15     ` Richard Biener
@ 2022-03-16 17:02       ` Joern Rennecke
  2022-03-16 20:23         ` Joern Rennecke
  0 siblings, 1 reply; 12+ messages in thread
From: Joern Rennecke @ 2022-03-16 17:02 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Jon Beniston

[-- Attachment #1: Type: text/plain, Size: 1349 bytes --]

On Wed, 16 Mar 2022 at 08:15, Richard Biener <richard.guenther@gmail.com> wrote:

> The canonical place to transform loops into builtins would be loop distribution.
> Another place would be final value replacement since you basically replace
> the reduction result with a call to the builtin, but I think
> loop-distribution is
> the better overall place.  See how we match strlen() there.
>
> Richard.

So I suppose that would be something along the line of the below patch?
Except it'd need a lot more checks to make sure it's actually processing a
CRC computation,
and there needs to be code to actually expand the builtin using optabs.
And something needs to be done to make match.pd work on the output.
I'm seeing crcu16 being unrolled by in the *.cunroll dump and the remains
of the loops deleted in the *.dse4 dump, but the patch.pd clause to simplify
the two __builtin_crc8s calls never matches, so we end up with this in
*.optimized:

 ee_u16 crcu16 (ee_u16 newval, ee_u16 crc)
{
  unsigned char _1;
  short unsigned int _2;
  unsigned char _3;
  short unsigned int _24;
  short unsigned int _26;

  <bb 2> [local count: 1073741824]:
  _1 = (unsigned char) newval_4(D);
  _24 = __builtin_crc8s (crc_6(D), _1, 40961);
  _2 = newval_4(D) >> 8;
  _3 = (unsigned char) _2;
  _26 = __builtin_crc8s (_24, _3, 40961); [tail call]
  return _26;

}

[-- Attachment #2: ldist-crc-diff.txt --]
[-- Type: text/plain, Size: 5495 bytes --]

diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
index db6e9096a86..b74e8569b94 100644
--- a/gcc/tree-loop-distribution.cc
+++ b/gcc/tree-loop-distribution.cc
@@ -659,6 +659,9 @@ class loop_distribution
      replace them accordingly.  */
   bool transform_reduction_loop (loop_p loop);
 
+  /* Transform some loops which calculate a CRC.  */
+  bool transform_reduction_loop (loop_p loop, tree niters);
+
   /* Compute topological order for basic blocks.  Topological order is
      needed because data dependence is computed for data references in
      lexicographical order.  */
@@ -3432,6 +3435,26 @@ generate_strlen_builtin_using_rawmemchr (loop_p loop, tree reduction_var,
 			     start_len);
 }
 
+static void
+generate_crc_builtin (loop_p loop, tree reduction_var,
+		      tree crc_in, tree data_in, tree xor_val,
+		      location_t loc)
+{
+  gimple_seq seq = NULL;
+  tree reduction_var_new = make_ssa_name (TREE_TYPE (reduction_var));
+
+  crc_in = force_gimple_operand (crc_in, &seq, true, NULL_TREE);
+  data_in = force_gimple_operand (data_in, &seq, true, NULL_TREE);
+  tree fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_CRC8S));
+  gimple *fn_call = gimple_build_call (fn, 3, crc_in, data_in, xor_val);
+  gimple_call_set_lhs (fn_call, reduction_var_new);
+  gimple_set_location (fn_call, loc);
+  gimple_seq_add_stmt (&seq, fn_call);
+
+  generate_reduction_builtin_1 (loop, seq, reduction_var, reduction_var_new,
+				"generated crc%s", E_QImode);
+}
+
 /* Return true if we can count at least as many characters by taking pointer
    difference as we can count via reduction_var without an overflow.  Thus
    compute 2^n < (2^(m-1) / s) where n = TYPE_PRECISION (reduction_var_type),
@@ -3713,6 +3736,128 @@ loop_distribution::transform_reduction_loop (loop_p loop)
   return false;
 }
 
+/* Match loops like:
+
+  <bb 3> [local count: 954449105]:
+  # data_16 = PHI <data_13(7), data_9(D)(2)>
+  # crc_22 = PHI <crc_4(7), crc_10(D)(2)>
+  # i_3 = PHI <i_18(7), 0(2)>
+  # ivtmp_7 = PHI <ivtmp_6(7), 8(2)>
+  _8 = (unsigned char) crc_22;
+  _1 = _8 ^ data_16;
+  x16_12 = _1 & 1;
+  data_13 = data_16 >> 1;
+  _19 = crc_22 >> 1;
+  if (x16_12 != 0)
+    goto <bb 4>; [50.00%]
+  else
+    goto <bb 8>; [50.00%]
+
+  <bb 8> [local count: 477224553]:
+  goto <bb 5>; [100.00%]
+
+  <bb 4> [local count: 477224552]:
+  crc_17 = _19 ^ 40961;
+
+  <bb 5> [local count: 954449105]:
+  # crc_4 = PHI <crc_17(4), _19(8)>
+  i_18 = i_3 + 1;
+  ivtmp_6 = ivtmp_7 - 1;
+  if (ivtmp_6 != 0)
+    goto <bb 7>; [88.89%]
+  else
+    goto <bb 6>; [11.11%]
+
+  <bb 7> [local count: 848409806]:
+  goto <bb 3>; [100.00%] */
+
+bool
+loop_distribution::transform_reduction_loop (loop_p loop, tree niters)
+{
+  gimple *reduction_stmt;
+
+  if (!wi::eq_p (wi::to_widest (niters), 7))
+    return false;
+
+  if (loop->num_nodes != 5)
+    return false;
+
+  reduction_stmt = determine_reduction_stmt (loop);
+  if (reduction_stmt == NULL)
+    return false;
+
+  /* Reduction variables are guaranteed to be SSA names.  */
+  tree reduction_var;
+  switch (gimple_code (reduction_stmt))
+    {
+    case GIMPLE_ASSIGN:
+    case GIMPLE_PHI:
+      reduction_var = gimple_get_lhs (reduction_stmt);
+      break;
+    default:
+      /* Bail out e.g. for GIMPLE_CALL.  */
+      return false;
+    }
+
+  if (EDGE_COUNT (loop->header->preds) != 2)
+    return false;
+
+  edge e, entry_edge = NULL, backedge = NULL;
+  edge_iterator ei;
+
+  FOR_EACH_EDGE (e, ei, loop->header->preds)
+    if (e->src->loop_father != loop)
+      entry_edge = e;
+    else
+      backedge = e;
+
+  if (!entry_edge || !backedge)
+    return false;
+
+  tree crc_in = NULL_TREE, data_in = NULL_TREE;
+
+  for (gphi_iterator gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi);
+       gsi_next (&gsi))
+    {
+      gphi *phi = gsi.phi ();
+      tree val = PHI_ARG_DEF_FROM_EDGE (phi, entry_edge);
+      if (PHI_ARG_DEF_FROM_EDGE (phi, backedge) == reduction_var)
+	crc_in = val;
+      else if (!data_in)
+	data_in = val;
+    }
+  if (!crc_in || !data_in)
+    return false;
+
+  basic_block xor_bb = NULL;
+  gcond *cond = safe_dyn_cast <gcond *> (last_stmt (loop->header));
+  if (!cond)
+    return false;
+
+  FOR_EACH_EDGE (e, ei, loop->header->succs)
+    if ((e->flags & EDGE_TRUE_VALUE && gimple_cond_code (cond) == NE_EXPR)
+	|| (e->flags & EDGE_FALSE_VALUE && gimple_cond_code (cond) == EQ_EXPR))
+      xor_bb = e->dest;
+
+  if (!xor_bb)
+    return false;
+
+  gimple_stmt_iterator gsi = gsi_start_nondebug_bb (xor_bb);
+  if (gsi_end_p (gsi))
+    return false;
+  gimple *xor_stmt = gsi_stmt (gsi);
+  if (!is_gimple_assign (xor_stmt) || gimple_assign_rhs_code(xor_stmt) != BIT_XOR_EXPR)
+    return false;
+  tree xor_val = gimple_assign_rhs2 (xor_stmt);
+  gsi_next_nondebug (&gsi);
+  if (!gsi_end_p (gsi))
+    return false;
+
+  generate_crc_builtin (loop, reduction_var, crc_in, data_in, xor_val,
+			gimple_location (xor_stmt));
+  return true;
+}
+
 /* Given innermost LOOP, return the outermost enclosing loop that forms a
    perfect loop nest.  */
 
@@ -3798,6 +3943,11 @@ loop_distribution::execute (function *fun)
 	  free_data_refs (datarefs_vec);
 	  continue;
 	}
+      else if (TREE_CODE (niters) == INTEGER_CST && flag_tree_loop_distribute_patterns)
+	{
+	  if (transform_reduction_loop (loop, niters))
+	    continue;
+	}
 
       /* Get the perfect loop nest for distribution.  */
       loop = prepare_perfect_loop_nest (loop);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFA: crc builtin functions & optimizations
  2022-03-16 17:02       ` Joern Rennecke
@ 2022-03-16 20:23         ` Joern Rennecke
  0 siblings, 0 replies; 12+ messages in thread
From: Joern Rennecke @ 2022-03-16 20:23 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Jon Beniston

> and there needs to be code to actually expand the builtin using optabs.
> And something needs to be done to make match.pd work on the output.

Never mind that bit, that was just due to a function argument type mismatch
on the last argument of the crc built-in functions.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-03-16 20:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-15  0:32 RFA: crc builtin functions & optimizations Joern Rennecke
2022-03-15  1:04 ` Andrew Pinski
2022-03-15  2:17   ` Oleg Endo
2022-03-15 13:00     ` Joern Rennecke
     [not found]   ` <CAMqJFCoHyuNzRkfmWSM6VkeWVhPw8tv5UAhnNfPYcUMzS5jFiQ@mail.gmail.com>
2022-03-15 12:52     ` Fwd: " Joern Rennecke
2022-03-15 12:03 ` Martin Jambor
2022-03-15 14:45 ` Richard Biener
2022-03-15 15:15   ` Joern Rennecke
2022-03-16  8:15     ` Richard Biener
2022-03-16 17:02       ` Joern Rennecke
2022-03-16 20:23         ` Joern Rennecke
2022-03-15 19:54   ` Joern Rennecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).