[snip]

Please find attached a new version of the patch.  A major change compared to
the previous patch is that I created a separate pass which hopefully makes
reviewing also easier since it is almost self-contained.  After realizing that
detecting loops which mimic the behavior of rawmemchr/strlen functions does not
really fit into the topic of loop distribution, I created a separate pass.  Due
to this I was also able to play around a bit and schedule the pass at different
times.  Currently it is scheduled right before loop distribution where loop
header copying already took place which leads to the following effect.  Running
this setup over

char *t (char *p)
{
  for (; *p; ++p);
  return p;
}

the new pass transforms

char * t (char * p)
{
  char _1;
  char _7;

  <bb 2> [local count: 118111600]:
  _7 = *p_3(D);
  if (_7 != 0)
    goto <bb 5>; [89.00%]
  else
    goto <bb 7>; [11.00%]

  <bb 5> [local count: 105119324]:

  <bb 3> [local count: 955630225]:
  # p_8 = PHI <p_6(6), p_3(D)(5)>
  p_6 = p_8 + 1;
  _1 = *p_6;
  if (_1 != 0)
    goto <bb 6>; [89.00%]
  else
    goto <bb 8>; [11.00%]

  <bb 8> [local count: 105119324]:
  # p_2 = PHI <p_6(3)>
  goto <bb 4>; [100.00%]

  <bb 6> [local count: 850510901]:
  goto <bb 3>; [100.00%]

  <bb 7> [local count: 12992276]:

  <bb 4> [local count: 118111600]:
  # p_9 = PHI <p_2(8), p_3(D)(7)>
  return p_9;

}

into

char * t (char * p)
{
  char * _5;
  char _7;

  <bb 2> [local count: 118111600]:
  _7 = *p_3(D);
  if (_7 != 0)
    goto <bb 5>; [89.00%]
  else
    goto <bb 4>; [11.00%]

  <bb 5> [local count: 105119324]:
  _5 = p_3(D) + 1;
  p_10 = .RAWMEMCHR (_5, 0);

  <bb 4> [local count: 118111600]:
  # p_9 = PHI <p_10(5), p_3(D)(2)>
  return p_9;

}

which is fine so far.  However, I haven't made up my mind so far whether it is
worthwhile to spend more time in order to also eliminate the "first unrolling"
of the loop.  I gave it a shot by scheduling the pass prior pass copy header
and ended up with:

char * t (char * p)
{
  <bb 2> [local count: 118111600]:
  p_5 = .RAWMEMCHR (p_3(D), 0);
  return p_5;

}

which seems optimal to me.  The downside of this is that I have to initialize
scalar evolution analysis which might be undesired that early.

All this brings me to the question where do you see this peace of code running?
If in a separate pass when would you schedule it?  If in an existing pass,
which one would you choose?

Another topic which came up is whether there exists a more elegant solution to
my current implementation in order to deal with stores (I'm speaking of the `if
(store_dr)` statement inside of function transform_loop_1).  For example,

extern char *p;
char *t ()
{
  for (; *p; ++p);
  return p;
}

ends up as

char * t ()
{
  char * _1;
  char * _2;
  char _3;
  char * p.1_8;
  char _9;
  char * p.1_10;
  char * p.1_11;

  <bb 2> [local count: 118111600]:
  p.1_8 = p;
  _9 = *p.1_8;
  if (_9 != 0)
    goto <bb 5>; [89.00%]
  else
    goto <bb 7>; [11.00%]

  <bb 5> [local count: 105119324]:

  <bb 3> [local count: 955630225]:
  # p.1_10 = PHI <_1(6), p.1_8(5)>
  _1 = p.1_10 + 1;
  p = _1;
  _3 = *_1;
  if (_3 != 0)
    goto <bb 6>; [89.00%]
  else
    goto <bb 8>; [11.00%]

  <bb 8> [local count: 105119324]:
  # _2 = PHI <_1(3)>
  goto <bb 4>; [100.00%]

  <bb 6> [local count: 850510901]:
  goto <bb 3>; [100.00%]

  <bb 7> [local count: 12992276]:

  <bb 4> [local count: 118111600]:
  # p.1_11 = PHI <_2(8), p.1_8(7)>
  return p.1_11;

}

where inside the loop a load and store occurs.  For a rawmemchr like loop I
have to show that we never load from a memory location to which we write.
Currently I solve this by hard coding those facts which are not generic at all.
I gave compute_data_dependences_for_loop a try which failed to determine the
fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
there are more generic solutions to express this in contrast to my current one?

Thanks again for your input so far.  Really appreciated.

Cheers,
Stefan