I suggest you have another output from gprofng:

 

(I assume the sampling is on wall-clock time, so it has visibility into
I/O.)

 

Let the user choose a small number N, like 10 or 20, and then select N
stacks at random (with source code line info) and display them, in a tree or
in raw form. The point is - any performance problem consists of activity
that isn't necessary, and if it accounts for fraction F of time, then it
will show up on NF samples. High precision of measurement is not necessary,
but precision of insight is. 

 

If there are multiple threads, let each sample be from all running threads
at the same time, so the user can see which threads are waiting for which
other threads at the point in time.

 

Let me know if this makes sense, or maybe you've already done it.

 

Thanks,

Mike Dunlavey

 

P.S. I've been advocating this for years on StackOverflow. People who've
tried it agree that it works. I've also got a YouTube video about it.