public inbox for gsl-discuss@sourceware.org
 help / color / mirror / Atom feed
* Re: [Help-gsl] Spearman rank correlation coefficient
       [not found] <CAGJVmuKKL4HCXViqxAf3b=em05hb7ox5pABm5b_LdKNMxTLMkg@mail.gmail.com>
@ 2012-02-10  0:05 ` Patrick Alken
  2012-02-11 20:53   ` Timothée Flutre
  0 siblings, 1 reply; 5+ messages in thread
From: Patrick Alken @ 2012-02-10  0:05 UTC (permalink / raw)
  To: timflutre, gsl-discuss

Hello,

   It would be best to move this discussion over to gsl-discuss. I think 
it would be very useful to have this function in GSL. Just a few 
comments on your code:

1) The code looks clean and nicely commented. One issue is that since 
you appear to have followed the apache code very closely, there may be a 
licensing issue - I don't know if the Apache license is compatible with 
the GPL. On a quick check, its possible we can use it but it seems we 
need to preserve the original copyright notice.

2) Dynamic allocation - it looks like you dynamically allocate 5 
different arrays to do the calculation. It would be better to either 
make functions like gsl_stats_spearman_alloc and 
gsl_stats_spearman_free, or to pass in a pre-allocated workspace as one 
of the function arguments. Since you're using workspace of different 
types (double,size_t), its probably better to make the alloc/free functions.

3) One of your dynamically allocated arrays is realloc()'d in a loop. Is 
this because the size of the array is unknown before the loop? Perhaps 
there is a way to avoid the realloc's.

4) We also need to think of some automated tests that can be added to 
statistics/test.c to test this function exhaustively and make sure its 
working correctly - even if that consists simply of known output values 
for a few different input cases.

Good work,
Patrick Alken

On 02/09/2012 04:26 PM, Timothée Flutre wrote:
> Hello,
>
> I noticed that only the Pearson correlation coefficient is implemented
> in the GSL (http://www.gnu.org/software/gsl/manual/html_node/Correlation.html).
> However, in quantitative genetics, several authors are using the
> Spearman coef (for instance, Stranger et al "Population genomics of
> human gene expression", Nature Genetics, 2007) as it is less
> influenced by outliers.
>
> Current high-throughput data requires to compute such coef several
> millions of times. Thus I implemented the computation of the Spearman
> coef in GSL-like code. In fact, one just need to rank the input
> vectors and then compute the Pearson coef on them. For the ranking, I
> got inspired by the code from the Apache Math module.
>
> I was thinking that it could be useful to other users to add my piece
> of code to the file "covariance_source.c" of the GSL
> (http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/covariance_source.c#L77).
> So here is the code: https://gist.github.com/1784199
>
> I am not very proficient in C, so even if it is not possible to
> include the code in the GSL, don't hesitate to give me advice.
>
> Thanks,
> Tim
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Help-gsl] Spearman rank correlation coefficient
  2012-02-10  0:05 ` [Help-gsl] Spearman rank correlation coefficient Patrick Alken
@ 2012-02-11 20:53   ` Timothée Flutre
  2012-03-02 15:33     ` Timothée Flutre
  0 siblings, 1 reply; 5+ messages in thread
From: Timothée Flutre @ 2012-02-11 20:53 UTC (permalink / raw)
  To: Patrick Alken; +Cc: gsl-discuss

Thanks for your input!

1) Here is the text of the license under which the Apache code is:
http://www.apache.org/licenses/LICENSE-2.0. Indeed it seems that we
would have to indicate their copyright. Is this a problem? In a way,
there is not a lot of different algorithms to compute the Spearman
coefficient...

2) I have made the changes and now have "gsl_stats_spearman_alloc" and
"gsl_stats_spearman_free" functions for the four arrays ranks1,
ranks2, d and p. I added the code as a 2nd file to the same gist:
https://gist.github.com/1784199#file_spearman_v2.c

3) Yes, we don't know in advance how many ties there will be. That's
why I reallocate inside the loop. I don't see how I can do
differently.

4) I added a function performing tests, using the data defined in
statistics/test_float_source.
c. What do I do now? Do I need to have write access to the GSL
repository on Savannah? Or maybe someone else can do it for me?

Thanks,
Tim


On Thu, Feb 9, 2012 at 6:04 PM, Patrick Alken
<Patrick.Alken@colorado.edu> wrote:
>
> Hello,
>
>  It would be best to move this discussion over to gsl-discuss. I think it would be very useful to have this function in GSL. Just a few comments on your code:
>
> 1) The code looks clean and nicely commented. One issue is that since you appear to have followed the apache code very closely, there may be a licensing issue - I don't know if the Apache license is compatible with the GPL. On a quick check, its possible we can use it but it seems we need to preserve the original copyright notice.
>
> 2) Dynamic allocation - it looks like you dynamically allocate 5 different arrays to do the calculation. It would be better to either make functions like gsl_stats_spearman_alloc and gsl_stats_spearman_free, or to pass in a pre-allocated workspace as one of the function arguments. Since you're using workspace of different types (double,size_t), its probably better to make the alloc/free functions.
>
> 3) One of your dynamically allocated arrays is realloc()'d in a loop. Is this because the size of the array is unknown before the loop? Perhaps there is a way to avoid the realloc's.
>
> 4) We also need to think of some automated tests that can be added to statistics/test.c to test this function exhaustively and make sure its working correctly - even if that consists simply of known output values for a few different input cases.
>
> Good work,
> Patrick Alken
>
>
> On 02/09/2012 04:26 PM, Timothée Flutre wrote:
>>
>> Hello,
>>
>> I noticed that only the Pearson correlation coefficient is implemented
>> in the GSL (http://www.gnu.org/software/gsl/manual/html_node/Correlation.html).
>> However, in quantitative genetics, several authors are using the
>> Spearman coef (for instance, Stranger et al "Population genomics of
>> human gene expression", Nature Genetics, 2007) as it is less
>> influenced by outliers.
>>
>> Current high-throughput data requires to compute such coef several
>> millions of times. Thus I implemented the computation of the Spearman
>> coef in GSL-like code. In fact, one just need to rank the input
>> vectors and then compute the Pearson coef on them. For the ranking, I
>> got inspired by the code from the Apache Math module.
>>
>> I was thinking that it could be useful to other users to add my piece
>> of code to the file "covariance_source.c" of the GSL
>> (http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/covariance_source.c#L77).
>> So here is the code: https://gist.github.com/1784199
>>
>> I am not very proficient in C, so even if it is not possible to
>> include the code in the GSL, don't hesitate to give me advice.
>>
>> Thanks,
>> Tim
>>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Help-gsl] Spearman rank correlation coefficient
  2012-02-11 20:53   ` Timothée Flutre
@ 2012-03-02 15:33     ` Timothée Flutre
  2012-03-02 15:59       ` Rhys Ulerich
  2012-03-02 16:01       ` Patrick Alken
  0 siblings, 2 replies; 5+ messages in thread
From: Timothée Flutre @ 2012-03-02 15:33 UTC (permalink / raw)
  To: gsl-discuss

Hello,

a month ago I proposed an implementation of the Spearman rank
correlation coefficient as it is missing in the GSL (see emails
below). I took into account some advice and the updated code is
available here:
https://gist.github.com/1784199#file_spearman_v2.c

Since then, I didn't have any answer. I'm not an experienced C
programmer, thus my code may need further improvements, but still it
can be useful to others. Thus can I submit it to the GSL main trunk?
I've never done that before. Can someone indicate me what to do?
Should I request "developer write access" for instance?

Thanks in advance,
Tim


2012/2/11 Timothée Flutre <timflutre@gmail.com>
>
> Thanks for your input!
>
> 1) Here is the text of the license under which the Apache code is:
> http://www.apache.org/licenses/LICENSE-2.0. Indeed it seems that we
> would have to indicate their copyright. Is this a problem? In a way,
> there is not a lot of different algorithms to compute the Spearman
> coefficient...
>
> 2) I have made the changes and now have "gsl_stats_spearman_alloc" and
> "gsl_stats_spearman_free" functions for the four arrays ranks1,
> ranks2, d and p. I added the code as a 2nd file to the same gist:
> https://gist.github.com/1784199#file_spearman_v2.c
>
> 3) Yes, we don't know in advance how many ties there will be. That's
> why I reallocate inside the loop. I don't see how I can do
> differently.
>
> 4) I added a function performing tests, using the data defined in
> statistics/test_float_source.
> c. What do I do now? Do I need to have write access to the GSL
> repository on Savannah? Or maybe someone else can do it for me?
>
> Thanks,
> Tim
>
>
> On Thu, Feb 9, 2012 at 6:04 PM, Patrick Alken
> <Patrick.Alken@colorado.edu> wrote:
> >
> > Hello,
> >
> >  It would be best to move this discussion over to gsl-discuss. I think
> > it would be very useful to have this function in GSL. Just a few comments on
> > your code:
> >
> > 1) The code looks clean and nicely commented. One issue is that since
> > you appear to have followed the apache code very closely, there may be a
> > licensing issue - I don't know if the Apache license is compatible with the
> > GPL. On a quick check, its possible we can use it but it seems we need to
> > preserve the original copyright notice.
> >
> > 2) Dynamic allocation - it looks like you dynamically allocate 5
> > different arrays to do the calculation. It would be better to either make
> > functions like gsl_stats_spearman_alloc and gsl_stats_spearman_free, or to
> > pass in a pre-allocated workspace as one of the function arguments. Since
> > you're using workspace of different types (double,size_t), its probably
> > better to make the alloc/free functions.
> >
> > 3) One of your dynamically allocated arrays is realloc()'d in a loop. Is
> > this because the size of the array is unknown before the loop? Perhaps there
> > is a way to avoid the realloc's.
> >
> > 4) We also need to think of some automated tests that can be added to
> > statistics/test.c to test this function exhaustively and make sure its
> > working correctly - even if that consists simply of known output values for
> > a few different input cases.
> >
> > Good work,
> > Patrick Alken
> >
> >
> > On 02/09/2012 04:26 PM, Timothée Flutre wrote:
> >>
> >> Hello,
> >>
> >> I noticed that only the Pearson correlation coefficient is implemented
> >> in the GSL
> >> (http://www.gnu.org/software/gsl/manual/html_node/Correlation.html).
> >> However, in quantitative genetics, several authors are using the
> >> Spearman coef (for instance, Stranger et al "Population genomics of
> >> human gene expression", Nature Genetics, 2007) as it is less
> >> influenced by outliers.
> >>
> >> Current high-throughput data requires to compute such coef several
> >> millions of times. Thus I implemented the computation of the Spearman
> >> coef in GSL-like code. In fact, one just need to rank the input
> >> vectors and then compute the Pearson coef on them. For the ranking, I
> >> got inspired by the code from the Apache Math module.
> >>
> >> I was thinking that it could be useful to other users to add my piece
> >> of code to the file "covariance_source.c" of the GSL
> >>
> >> (http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/covariance_source.c#L77).
> >> So here is the code: https://gist.github.com/1784199
> >>
> >> I am not very proficient in C, so even if it is not possible to
> >> include the code in the GSL, don't hesitate to give me advice.
> >>
> >> Thanks,
> >> Tim
> >>
> >

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Help-gsl] Spearman rank correlation coefficient
  2012-03-02 15:33     ` Timothée Flutre
@ 2012-03-02 15:59       ` Rhys Ulerich
  2012-03-02 16:01       ` Patrick Alken
  1 sibling, 0 replies; 5+ messages in thread
From: Rhys Ulerich @ 2012-03-02 15:59 UTC (permalink / raw)
  To: timflutre; +Cc: gsl-discuss

Hi Tim,

> a month ago I proposed an implementation of the Spearman rank
> correlation coefficient ...
> https://gist.github.com/1784199#file_spearman_v2.c
>
> ... Thus can I submit it to the GSL main trunk?
> I've never done that before. Can someone indicate me what to do?
> Should I request "developer write access" for instance?

From reviewing the code at the github...

Your gsl_stats_spearman_workspace (e.g.) should be passed into your
gsl_stats_spearman method.
You'll want to declare a struct-based workspace rather than
allocate/deallocate many separate variables in your
gsl_stats_spearman_alloc/gsl_stats_spearman_free.  As an example,
check out gsl_bspline_workspace from bspline/gsl_bspline.h which is
allocated/deallocated by gsl_bspline_alloc/gsl_bspline_free.  You'll
want to add the allocation-failure error handling as in the bspline
example.

You gsl_stats_spearman_rank implementation should always be checking
if calloc/realloc failed and handle that error appropriately

Helper functions like "resolveTies" which should not be publicly
visible should be declared static.  That way they can only be called
from within the same file.

Finally, the Apache License 2.0 origin of the logic should be made
more clear in the file.  Just attributing the logic with "Source"
seems insufficient-- I'd organize the file so everything below some
point if the Apache-derived code and everything above that point is
your GSL-ready GPL3 logic.

Give a shout when you're ready and I'll take another look.  I can help
with getting the final logic into GSL's trunk.

- Rhys

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Help-gsl] Spearman rank correlation coefficient
  2012-03-02 15:33     ` Timothée Flutre
  2012-03-02 15:59       ` Rhys Ulerich
@ 2012-03-02 16:01       ` Patrick Alken
  1 sibling, 0 replies; 5+ messages in thread
From: Patrick Alken @ 2012-03-02 16:01 UTC (permalink / raw)
  To: gsl-discuss

Yes I apologize for not responding sooner. I looked at your updated 
code, and just have a few comments:

1. I recommend making a gsl_stats_spearman_workspace struct to hold the 
allocated variables (ranks1, ranks2, d, p). This way the user doesn't 
need to know the details of everything that gets allocated and can 
simply call the alloc routine with the size n parameter and pass the 
struct pointer to subsequent routines (this is the way that most of gsl 
is designed).

2. Typically new features to GSL are made as extensions first (see: 
http://www.gnu.org/software/gsl/ under "Extensions"). This way people 
can try out the new feature and test it well before it is incorporated 
into the main source tree. There are examples there on how to make your 
code into an extension. I should be able to add your extension to the 
web page when its ready.

Patrick

On 03/02/2012 08:32 AM, Timothée Flutre wrote:
> Hello,
>
> a month ago I proposed an implementation of the Spearman rank
> correlation coefficient as it is missing in the GSL (see emails
> below). I took into account some advice and the updated code is
> available here:
> https://gist.github.com/1784199#file_spearman_v2.c
>
> Since then, I didn't have any answer. I'm not an experienced C
> programmer, thus my code may need further improvements, but still it
> can be useful to others. Thus can I submit it to the GSL main trunk?
> I've never done that before. Can someone indicate me what to do?
> Should I request "developer write access" for instance?
>
> Thanks in advance,
> Tim
>
>
> 2012/2/11 Timothée Flutre<timflutre@gmail.com>
>> Thanks for your input!
>>
>> 1) Here is the text of the license under which the Apache code is:
>> http://www.apache.org/licenses/LICENSE-2.0. Indeed it seems that we
>> would have to indicate their copyright. Is this a problem? In a way,
>> there is not a lot of different algorithms to compute the Spearman
>> coefficient...
>>
>> 2) I have made the changes and now have "gsl_stats_spearman_alloc" and
>> "gsl_stats_spearman_free" functions for the four arrays ranks1,
>> ranks2, d and p. I added the code as a 2nd file to the same gist:
>> https://gist.github.com/1784199#file_spearman_v2.c
>>
>> 3) Yes, we don't know in advance how many ties there will be. That's
>> why I reallocate inside the loop. I don't see how I can do
>> differently.
>>
>> 4) I added a function performing tests, using the data defined in
>> statistics/test_float_source.
>> c. What do I do now? Do I need to have write access to the GSL
>> repository on Savannah? Or maybe someone else can do it for me?
>>
>> Thanks,
>> Tim
>>
>>
>> On Thu, Feb 9, 2012 at 6:04 PM, Patrick Alken
>> <Patrick.Alken@colorado.edu>  wrote:
>>> Hello,
>>>
>>>   It would be best to move this discussion over to gsl-discuss. I think
>>> it would be very useful to have this function in GSL. Just a few comments on
>>> your code:
>>>
>>> 1) The code looks clean and nicely commented. One issue is that since
>>> you appear to have followed the apache code very closely, there may be a
>>> licensing issue - I don't know if the Apache license is compatible with the
>>> GPL. On a quick check, its possible we can use it but it seems we need to
>>> preserve the original copyright notice.
>>>
>>> 2) Dynamic allocation - it looks like you dynamically allocate 5
>>> different arrays to do the calculation. It would be better to either make
>>> functions like gsl_stats_spearman_alloc and gsl_stats_spearman_free, or to
>>> pass in a pre-allocated workspace as one of the function arguments. Since
>>> you're using workspace of different types (double,size_t), its probably
>>> better to make the alloc/free functions.
>>>
>>> 3) One of your dynamically allocated arrays is realloc()'d in a loop. Is
>>> this because the size of the array is unknown before the loop? Perhaps there
>>> is a way to avoid the realloc's.
>>>
>>> 4) We also need to think of some automated tests that can be added to
>>> statistics/test.c to test this function exhaustively and make sure its
>>> working correctly - even if that consists simply of known output values for
>>> a few different input cases.
>>>
>>> Good work,
>>> Patrick Alken
>>>
>>>
>>> On 02/09/2012 04:26 PM, Timothée Flutre wrote:
>>>> Hello,
>>>>
>>>> I noticed that only the Pearson correlation coefficient is implemented
>>>> in the GSL
>>>> (http://www.gnu.org/software/gsl/manual/html_node/Correlation.html).
>>>> However, in quantitative genetics, several authors are using the
>>>> Spearman coef (for instance, Stranger et al "Population genomics of
>>>> human gene expression", Nature Genetics, 2007) as it is less
>>>> influenced by outliers.
>>>>
>>>> Current high-throughput data requires to compute such coef several
>>>> millions of times. Thus I implemented the computation of the Spearman
>>>> coef in GSL-like code. In fact, one just need to rank the input
>>>> vectors and then compute the Pearson coef on them. For the ranking, I
>>>> got inspired by the code from the Apache Math module.
>>>>
>>>> I was thinking that it could be useful to other users to add my piece
>>>> of code to the file "covariance_source.c" of the GSL
>>>>
>>>> (http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/covariance_source.c#L77).
>>>> So here is the code: https://gist.github.com/1784199
>>>>
>>>> I am not very proficient in C, so even if it is not possible to
>>>> include the code in the GSL, don't hesitate to give me advice.
>>>>
>>>> Thanks,
>>>> Tim
>>>>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-03-02 16:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAGJVmuKKL4HCXViqxAf3b=em05hb7ox5pABm5b_LdKNMxTLMkg@mail.gmail.com>
2012-02-10  0:05 ` [Help-gsl] Spearman rank correlation coefficient Patrick Alken
2012-02-11 20:53   ` Timothée Flutre
2012-03-02 15:33     ` Timothée Flutre
2012-03-02 15:59       ` Rhys Ulerich
2012-03-02 16:01       ` Patrick Alken

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).