public inbox for gsl-discuss@sourceware.org
 help / color / mirror / Atom feed
From: "Timothée Flutre" <timflutre@gmail.com>
To: Patrick Alken <Patrick.Alken@colorado.edu>
Cc: "gsl-discuss@sourceware.org" <gsl-discuss@sourceware.org>
Subject: Re: [Help-gsl] Spearman rank correlation coefficient
Date: Sat, 11 Feb 2012 20:53:00 -0000	[thread overview]
Message-ID: <CAGJVmuLHJp9AGYVwikJDQy-=LdkxyRdAsGVDbcF7sHrJiu5Kdw@mail.gmail.com> (raw)
In-Reply-To: <4F345F28.3060102@colorado.edu>

Thanks for your input!

1) Here is the text of the license under which the Apache code is:
http://www.apache.org/licenses/LICENSE-2.0. Indeed it seems that we
would have to indicate their copyright. Is this a problem? In a way,
there is not a lot of different algorithms to compute the Spearman
coefficient...

2) I have made the changes and now have "gsl_stats_spearman_alloc" and
"gsl_stats_spearman_free" functions for the four arrays ranks1,
ranks2, d and p. I added the code as a 2nd file to the same gist:
https://gist.github.com/1784199#file_spearman_v2.c

3) Yes, we don't know in advance how many ties there will be. That's
why I reallocate inside the loop. I don't see how I can do
differently.

4) I added a function performing tests, using the data defined in
statistics/test_float_source.
c. What do I do now? Do I need to have write access to the GSL
repository on Savannah? Or maybe someone else can do it for me?

Thanks,
Tim


On Thu, Feb 9, 2012 at 6:04 PM, Patrick Alken
<Patrick.Alken@colorado.edu> wrote:
>
> Hello,
>
>  It would be best to move this discussion over to gsl-discuss. I think it would be very useful to have this function in GSL. Just a few comments on your code:
>
> 1) The code looks clean and nicely commented. One issue is that since you appear to have followed the apache code very closely, there may be a licensing issue - I don't know if the Apache license is compatible with the GPL. On a quick check, its possible we can use it but it seems we need to preserve the original copyright notice.
>
> 2) Dynamic allocation - it looks like you dynamically allocate 5 different arrays to do the calculation. It would be better to either make functions like gsl_stats_spearman_alloc and gsl_stats_spearman_free, or to pass in a pre-allocated workspace as one of the function arguments. Since you're using workspace of different types (double,size_t), its probably better to make the alloc/free functions.
>
> 3) One of your dynamically allocated arrays is realloc()'d in a loop. Is this because the size of the array is unknown before the loop? Perhaps there is a way to avoid the realloc's.
>
> 4) We also need to think of some automated tests that can be added to statistics/test.c to test this function exhaustively and make sure its working correctly - even if that consists simply of known output values for a few different input cases.
>
> Good work,
> Patrick Alken
>
>
> On 02/09/2012 04:26 PM, Timothée Flutre wrote:
>>
>> Hello,
>>
>> I noticed that only the Pearson correlation coefficient is implemented
>> in the GSL (http://www.gnu.org/software/gsl/manual/html_node/Correlation.html).
>> However, in quantitative genetics, several authors are using the
>> Spearman coef (for instance, Stranger et al "Population genomics of
>> human gene expression", Nature Genetics, 2007) as it is less
>> influenced by outliers.
>>
>> Current high-throughput data requires to compute such coef several
>> millions of times. Thus I implemented the computation of the Spearman
>> coef in GSL-like code. In fact, one just need to rank the input
>> vectors and then compute the Pearson coef on them. For the ranking, I
>> got inspired by the code from the Apache Math module.
>>
>> I was thinking that it could be useful to other users to add my piece
>> of code to the file "covariance_source.c" of the GSL
>> (http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/covariance_source.c#L77).
>> So here is the code: https://gist.github.com/1784199
>>
>> I am not very proficient in C, so even if it is not possible to
>> include the code in the GSL, don't hesitate to give me advice.
>>
>> Thanks,
>> Tim
>>
>

  reply	other threads:[~2012-02-11 20:53 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAGJVmuKKL4HCXViqxAf3b=em05hb7ox5pABm5b_LdKNMxTLMkg@mail.gmail.com>
2012-02-10  0:05 ` Patrick Alken
2012-02-11 20:53   ` Timothée Flutre [this message]
2012-03-02 15:33     ` Timothée Flutre
2012-03-02 15:59       ` Rhys Ulerich
2012-03-02 16:01       ` Patrick Alken

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGJVmuLHJp9AGYVwikJDQy-=LdkxyRdAsGVDbcF7sHrJiu5Kdw@mail.gmail.com' \
    --to=timflutre@gmail.com \
    --cc=Patrick.Alken@colorado.edu \
    --cc=gsl-discuss@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).