Hi,

I've committed the attached patch which addresses a problem encountered 
on architectures with parallel insn execution.

CGEN cpus in SID compute total cycles used by adding total_insn_count + 
total_latency. In the case of parallel execution, total_latency may 
actually decrease since, for a parallel insn, total_insn_count 
increases, but the number of cycles used does not.

The sample_gprof method I committed in my previous patch was using 
total_latency to determine how many samples to take. I have now changed 
it to use total_insn_count + current_step_insn_count + total_latency to 
compute this.

The patch also corrects the resetting of gprof_prev_cycle so that it 
does not get reset unless gprof has been turned off dynamically. This 
allows initial latency for a cpu to be counted properly.

Dave