public inbox for gsl-discuss@sourceware.org
 help / color / mirror / Atom feed
From: Patrick Alken <patrick.alken@Colorado.EDU>
To: gsl-discuss@sourceware.org
Subject: Re: [Help-gsl] Spearman rank correlation coefficient
Date: Fri, 02 Mar 2012 16:01:00 -0000	[thread overview]
Message-ID: <4F50EEDD.8070303@colorado.edu> (raw)
In-Reply-To: <CAGJVmuKKNcAC5DkdtMSUfd7BLSYWDfQK5wW12H5Sb8nU_bGMTw@mail.gmail.com>

Yes I apologize for not responding sooner. I looked at your updated 
code, and just have a few comments:

1. I recommend making a gsl_stats_spearman_workspace struct to hold the 
allocated variables (ranks1, ranks2, d, p). This way the user doesn't 
need to know the details of everything that gets allocated and can 
simply call the alloc routine with the size n parameter and pass the 
struct pointer to subsequent routines (this is the way that most of gsl 
is designed).

2. Typically new features to GSL are made as extensions first (see: 
http://www.gnu.org/software/gsl/ under "Extensions"). This way people 
can try out the new feature and test it well before it is incorporated 
into the main source tree. There are examples there on how to make your 
code into an extension. I should be able to add your extension to the 
web page when its ready.

Patrick

On 03/02/2012 08:32 AM, Timothée Flutre wrote:
> Hello,
>
> a month ago I proposed an implementation of the Spearman rank
> correlation coefficient as it is missing in the GSL (see emails
> below). I took into account some advice and the updated code is
> available here:
> https://gist.github.com/1784199#file_spearman_v2.c
>
> Since then, I didn't have any answer. I'm not an experienced C
> programmer, thus my code may need further improvements, but still it
> can be useful to others. Thus can I submit it to the GSL main trunk?
> I've never done that before. Can someone indicate me what to do?
> Should I request "developer write access" for instance?
>
> Thanks in advance,
> Tim
>
>
> 2012/2/11 Timothée Flutre<timflutre@gmail.com>
>> Thanks for your input!
>>
>> 1) Here is the text of the license under which the Apache code is:
>> http://www.apache.org/licenses/LICENSE-2.0. Indeed it seems that we
>> would have to indicate their copyright. Is this a problem? In a way,
>> there is not a lot of different algorithms to compute the Spearman
>> coefficient...
>>
>> 2) I have made the changes and now have "gsl_stats_spearman_alloc" and
>> "gsl_stats_spearman_free" functions for the four arrays ranks1,
>> ranks2, d and p. I added the code as a 2nd file to the same gist:
>> https://gist.github.com/1784199#file_spearman_v2.c
>>
>> 3) Yes, we don't know in advance how many ties there will be. That's
>> why I reallocate inside the loop. I don't see how I can do
>> differently.
>>
>> 4) I added a function performing tests, using the data defined in
>> statistics/test_float_source.
>> c. What do I do now? Do I need to have write access to the GSL
>> repository on Savannah? Or maybe someone else can do it for me?
>>
>> Thanks,
>> Tim
>>
>>
>> On Thu, Feb 9, 2012 at 6:04 PM, Patrick Alken
>> <Patrick.Alken@colorado.edu>  wrote:
>>> Hello,
>>>
>>>   It would be best to move this discussion over to gsl-discuss. I think
>>> it would be very useful to have this function in GSL. Just a few comments on
>>> your code:
>>>
>>> 1) The code looks clean and nicely commented. One issue is that since
>>> you appear to have followed the apache code very closely, there may be a
>>> licensing issue - I don't know if the Apache license is compatible with the
>>> GPL. On a quick check, its possible we can use it but it seems we need to
>>> preserve the original copyright notice.
>>>
>>> 2) Dynamic allocation - it looks like you dynamically allocate 5
>>> different arrays to do the calculation. It would be better to either make
>>> functions like gsl_stats_spearman_alloc and gsl_stats_spearman_free, or to
>>> pass in a pre-allocated workspace as one of the function arguments. Since
>>> you're using workspace of different types (double,size_t), its probably
>>> better to make the alloc/free functions.
>>>
>>> 3) One of your dynamically allocated arrays is realloc()'d in a loop. Is
>>> this because the size of the array is unknown before the loop? Perhaps there
>>> is a way to avoid the realloc's.
>>>
>>> 4) We also need to think of some automated tests that can be added to
>>> statistics/test.c to test this function exhaustively and make sure its
>>> working correctly - even if that consists simply of known output values for
>>> a few different input cases.
>>>
>>> Good work,
>>> Patrick Alken
>>>
>>>
>>> On 02/09/2012 04:26 PM, Timothée Flutre wrote:
>>>> Hello,
>>>>
>>>> I noticed that only the Pearson correlation coefficient is implemented
>>>> in the GSL
>>>> (http://www.gnu.org/software/gsl/manual/html_node/Correlation.html).
>>>> However, in quantitative genetics, several authors are using the
>>>> Spearman coef (for instance, Stranger et al "Population genomics of
>>>> human gene expression", Nature Genetics, 2007) as it is less
>>>> influenced by outliers.
>>>>
>>>> Current high-throughput data requires to compute such coef several
>>>> millions of times. Thus I implemented the computation of the Spearman
>>>> coef in GSL-like code. In fact, one just need to rank the input
>>>> vectors and then compute the Pearson coef on them. For the ranking, I
>>>> got inspired by the code from the Apache Math module.
>>>>
>>>> I was thinking that it could be useful to other users to add my piece
>>>> of code to the file "covariance_source.c" of the GSL
>>>>
>>>> (http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/covariance_source.c#L77).
>>>> So here is the code: https://gist.github.com/1784199
>>>>
>>>> I am not very proficient in C, so even if it is not possible to
>>>> include the code in the GSL, don't hesitate to give me advice.
>>>>
>>>> Thanks,
>>>> Tim
>>>>

      parent reply	other threads:[~2012-03-02 16:01 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAGJVmuKKL4HCXViqxAf3b=em05hb7ox5pABm5b_LdKNMxTLMkg@mail.gmail.com>
2012-02-10  0:05 ` Patrick Alken
2012-02-11 20:53   ` Timothée Flutre
2012-03-02 15:33     ` Timothée Flutre
2012-03-02 15:59       ` Rhys Ulerich
2012-03-02 16:01       ` Patrick Alken [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F50EEDD.8070303@colorado.edu \
    --to=patrick.alken@colorado.edu \
    --cc=gsl-discuss@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).