public inbox for gsl-discuss@sourceware.org
 help / color / mirror / Atom feed
* correlation coefficient
@ 2007-03-15 23:06 Patrick Alken
  2007-03-15 23:18 ` Ben Klemens
  2007-03-16 15:51 ` Brian Gough
  0 siblings, 2 replies; 9+ messages in thread
From: Patrick Alken @ 2007-03-15 23:06 UTC (permalink / raw)
  To: gsl-discuss

Hi,

  Is there any interest in putting a new function in the
statistics area for calculating the Pearson correlation coefficient?
I think this can be done safely in gsl by just using

r = gsl_stats_covariance(x,y) / (gsl_stats_sd(x) * gsl_stats_sd(y))

but it would be more efficient to calculate everything in 1 pass
through the data and I believe there is a stable algorithm to do
this (similar to how the mean/variance is calculated). This is
such a common function for people who work with data so I think
it'd be nice to have it in gsl :)

Patrick Alken

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: correlation coefficient
  2007-03-15 23:06 correlation coefficient Patrick Alken
@ 2007-03-15 23:18 ` Ben Klemens
  2007-03-16 15:51 ` Brian Gough
  1 sibling, 0 replies; 9+ messages in thread
From: Ben Klemens @ 2007-03-15 23:18 UTC (permalink / raw)
  To: Patrick Alken, gsl-discuss; +Cc: gsl-discuss

>   Is there any interest in putting a new function in the
> statistics area for calculating the Pearson correlation coefficient?
> I think this can be done safely in gsl by just using
> 
> r = gsl_stats_covariance(x,y) / (gsl_stats_sd(x) * gsl_stats_sd(y))
> 
> but it would be more efficient to calculate everything in 1 pass
> through the data and I believe there is a stable algorithm to do
> this (similar to how the mean/variance is calculated). This is
> such a common function for people who work with data so I think
> it'd be nice to have it in gsl :)

I've been working on a library of stats functions to complement
the GSL, so it naturally includes a correlation matrix
function (apop_correlation_matrix). The library home page is at
http://apophenia.info . There's an accompanying book whose home page
(this week) is at http://avocado.econ.jhu.edu/modeling .

Responding to your request for a covariance with an entire package may
be overkill, but I assume if you're looking for one statistic, you're
probably looking for several more.

Regards,

BK

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: correlation coefficient
  2007-03-15 23:06 correlation coefficient Patrick Alken
  2007-03-15 23:18 ` Ben Klemens
@ 2007-03-16 15:51 ` Brian Gough
  2007-03-16 16:01   ` keith.briggs
                     ` (2 more replies)
  1 sibling, 3 replies; 9+ messages in thread
From: Brian Gough @ 2007-03-16 15:51 UTC (permalink / raw)
  To: Patrick Alken; +Cc: gsl-discuss

At Thu, 15 Mar 2007 17:06:43 -0600,
Patrick Alken wrote:
>   Is there any interest in putting a new function in the
> statistics area for calculating the Pearson correlation coefficient?
> I think this can be done safely in gsl by just using
> 
> r = gsl_stats_covariance(x,y) / (gsl_stats_sd(x) * gsl_stats_sd(y))
> 
> but it would be more efficient to calculate everything in 1 pass
> through the data and I believe there is a stable algorithm to do
> this (similar to how the mean/variance is calculated). 

Yes, sounds like a good idea to me. Go ahead and add it in
covariance_source.c if you have the 1-pass algorithm.

-- 
Brian Gough

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: correlation coefficient
  2007-03-16 15:51 ` Brian Gough
@ 2007-03-16 16:01   ` keith.briggs
  2007-03-16 16:05   ` James Theiler
  2007-03-16 22:15   ` Patrick Alken
  2 siblings, 0 replies; 9+ messages in thread
From: keith.briggs @ 2007-03-16 16:01 UTC (permalink / raw)
  To: gsl-discuss

>>   Is there any interest in putting a new function in the
>> statistics area for calculating the Pearson correlation coefficient?
>> ... it would be more efficient to calculate everything in 1 pass
>> through the data 
>Yes, sounds like a good idea to me. Go ahead and add it in
>covariance_source.c if you have the 1-pass algorithm.

A one-pass algorithm for this and also for polynomial least-squares can be found at http://keithbriggs.info/pipemath.html.

Keith 

	

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: correlation coefficient
  2007-03-16 15:51 ` Brian Gough
  2007-03-16 16:01   ` keith.briggs
@ 2007-03-16 16:05   ` James Theiler
  2007-03-16 16:36     ` Patrick Alken
  2007-03-16 22:15   ` Patrick Alken
  2 siblings, 1 reply; 9+ messages in thread
From: James Theiler @ 2007-03-16 16:05 UTC (permalink / raw)
  To: Brian Gough; +Cc: Patrick Alken, gsl-discuss

On Fri, 16 Mar 2007, Brian Gough wrote:

] At Thu, 15 Mar 2007 17:06:43 -0600,
] Patrick Alken wrote:
] >   Is there any interest in putting a new function in the
] > statistics area for calculating the Pearson correlation coefficient?
] > I think this can be done safely in gsl by just using
] > 
] > r = gsl_stats_covariance(x,y) / (gsl_stats_sd(x) * gsl_stats_sd(y))
] > 
] > but it would be more efficient to calculate everything in 1 pass
] > through the data and I believe there is a stable algorithm to do
] > this (similar to how the mean/variance is calculated). 
] 
] Yes, sounds like a good idea to me. Go ahead and add it in
] covariance_source.c if you have the 1-pass algorithm.
] 
] 

be sure to include Pearson in the name of the function, since there
are also Spearman's and Kendall's correlation statistics.  (on second
thought, contradicting myself, those two are specialized nonparametric
measures, and so maybe it's reasonable to have Pearson's be the
default.)

jt

-- 
James Theiler
MS-B244, ISR-2, LANL; Los Alamos, NM 87544
Space and Remote Sensing Sciences; Los Alamos National Laboratory
http://public.lanl.gov/jt


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: correlation coefficient
  2007-03-16 16:05   ` James Theiler
@ 2007-03-16 16:36     ` Patrick Alken
  0 siblings, 0 replies; 9+ messages in thread
From: Patrick Alken @ 2007-03-16 16:36 UTC (permalink / raw)
  To: gsl-discuss

> be sure to include Pearson in the name of the function, since there
> are also Spearman's and Kendall's correlation statistics.  (on second
> thought, contradicting myself, those two are specialized nonparametric
> measures, and so maybe it's reasonable to have Pearson's be the
> default.)

matlab and octave both use the name "corrcoef". I don't know about
matlab, but octave has separate functions "spearman" and "kendall"
for the other statistics, so I was thinking that
gsl_stats_correlation would be ok, and later if people want the
other functions they can name them gsl_stats_spearman etc. I'll
make sure the docs clearly state its the Pearson coefficient
for the function.

Patrick Alken

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: correlation coefficient
  2007-03-16 15:51 ` Brian Gough
  2007-03-16 16:01   ` keith.briggs
  2007-03-16 16:05   ` James Theiler
@ 2007-03-16 22:15   ` Patrick Alken
  2007-03-20 14:11     ` Brian Gough
  2 siblings, 1 reply; 9+ messages in thread
From: Patrick Alken @ 2007-03-16 22:15 UTC (permalink / raw)
  To: gsl-discuss

> Yes, sounds like a good idea to me. Go ahead and add it in
> covariance_source.c if you have the 1-pass algorithm.

Ok, added under gsl_stats_correlation. I tested it successfully with
the data set:

x = 9.0e9 + i + 1
y = 9.0e9 - i - 1

for i = 1..100

which normally causes bad codes to fail. Also tested it with
lots of random x/y vectors and compared against the result of:

gsl_stats_covariance(x,y) / (gsl_stats_sd(x) * gsl_stats_sd(y))

all with excellent results - error was below GSL_DBL_EPSILON in all
cases - the test codes in statistics/ aren't really set up for
this type of exhaustive testing...so I just adapted the test that
is in there for the correlation routine.

Patrick Alken

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: correlation coefficient
  2007-03-16 22:15   ` Patrick Alken
@ 2007-03-20 14:11     ` Brian Gough
  2007-03-20 15:41       ` Patrick Alken
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Gough @ 2007-03-20 14:11 UTC (permalink / raw)
  To: Patrick Alken; +Cc: gsl-discuss

At Fri, 16 Mar 2007 16:15:32 -0600,
Patrick Alken wrote:
> Ok, added under gsl_stats_correlation. I tested it successfully with
> the data set...

Looks good.  Incidentally if you didn't find it there is a script
scripts/mkheaders.pl which can be used to update the various headers
automatically from the 'float' version.

-- 
Brian Gough

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: correlation coefficient
  2007-03-20 14:11     ` Brian Gough
@ 2007-03-20 15:41       ` Patrick Alken
  0 siblings, 0 replies; 9+ messages in thread
From: Patrick Alken @ 2007-03-20 15:41 UTC (permalink / raw)
  To: gsl-discuss

> Looks good.  Incidentally if you didn't find it there is a script
> scripts/mkheaders.pl which can be used to update the various headers
> automatically from the 'float' version.

Ah..good to know :-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-03-20 15:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-15 23:06 correlation coefficient Patrick Alken
2007-03-15 23:18 ` Ben Klemens
2007-03-16 15:51 ` Brian Gough
2007-03-16 16:01   ` keith.briggs
2007-03-16 16:05   ` James Theiler
2007-03-16 16:36     ` Patrick Alken
2007-03-16 22:15   ` Patrick Alken
2007-03-20 14:11     ` Brian Gough
2007-03-20 15:41       ` Patrick Alken

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).