On Wed, 6 Jan 2010, Rhys Ulerich wrote:

>> I recall from my benchmarking days that -- depending on compiler --
>> there is a small dereferencing penalty for packed matrices (vectors
>> packed into dereferencing **..* pointers) compared to doing the offset
>> arithmetic via brute force inline or via a macro.
>> ......
>> I haven't
>> run the benchmark recently and don't know how large it currently is.  It
>> was never so large that it stopped me from using repacked pointers for
>> code clarity..
>
> Mostly unscientific, but worth tossing into the mix:
>
> Using Intel 10.1 compilers on a fairly recent AMD chip, 100,000 iterations
> of doing the nested pointers approach is neck-and-neck with index arithmetic
> on a 10x10 double matrix.  For the 100x100 case it takes 1.3 times longer
> to iterate using the nested pointers.  Work in the inner loop "compute
> kernel" is
> *= against a constant scalar.  Optimization flags on -O3.  I've seen similar
> behavior on recent GNU compilers.

That sounds partly like a cache effect -- 10x10 almost certainly stays
in L1, 100x100 won't fit.  My own experience is similar, although I
don't recall the multiplier being as large as 1.3 (but then, I was doing
stream and stream-like tests with very large vectors, which means that
one spends more time in a vector streaming mode and minimizes
cache-thrashing when turning corners).  And my memory could be faulty --
I'm an old guy, after all, early Alzheimers...;-)

    rgb

>
> I'm happy to provide the test code if anyone's interested.
>
> - Rhys
>

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu