From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 4409 invoked by alias); 2 Mar 2012 15:33:22 -0000 Received: (qmail 4267 invoked by uid 22791); 2 Mar 2012 15:33:19 -0000 X-SWARE-Spam-Status: No, hits=0.3 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SARE_MILLIONSOF,TVD_PH_BODY_ACCOUNTS_PRE,T_FRT_PROFILE2 X-Spam-Check-By: sourceware.org Received: from mail-yw0-f41.google.com (HELO mail-yw0-f41.google.com) (209.85.213.41) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 02 Mar 2012 15:32:58 +0000 Received: by yhr47 with SMTP id 47so982108yhr.0 for ; Fri, 02 Mar 2012 07:32:58 -0800 (PST) Received-SPF: pass (google.com: domain of timflutre@gmail.com designates 10.50.36.194 as permitted sender) client-ip=10.50.36.194; Authentication-Results: mr.google.com; spf=pass (google.com: domain of timflutre@gmail.com designates 10.50.36.194 as permitted sender) smtp.mail=timflutre@gmail.com; dkim=pass header.i=timflutre@gmail.com Received: from mr.google.com ([10.50.36.194]) by 10.50.36.194 with SMTP id s2mr2132127igj.43.1330702378060 (num_hops = 1); Fri, 02 Mar 2012 07:32:58 -0800 (PST) Received: by 10.50.36.194 with SMTP id s2mr1774128igj.43.1330702378002; Fri, 02 Mar 2012 07:32:58 -0800 (PST) MIME-Version: 1.0 Received: by 10.50.109.197 with HTTP; Fri, 2 Mar 2012 07:32:37 -0800 (PST) Reply-To: timflutre@gmail.com In-Reply-To: References: <4F345F28.3060102@colorado.edu> From: =?UTF-8?Q?Timoth=C3=A9e_Flutre?= Date: Fri, 02 Mar 2012 15:33:00 -0000 Message-ID: Subject: Re: [Help-gsl] Spearman rank correlation coefficient To: gsl-discuss@sourceware.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Mailing-List: contact gsl-discuss-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gsl-discuss-owner@sourceware.org X-SW-Source: 2012-q1/txt/msg00014.txt.bz2 Hello, a month ago I proposed an implementation of the Spearman rank correlation coefficient as it is missing in the GSL (see emails below). I took into account some advice and the updated code is available here: https://gist.github.com/1784199#file_spearman_v2.c Since then, I didn't have any answer. I'm not an experienced C programmer, thus my code may need further improvements, but still it can be useful to others. Thus can I submit it to the GSL main trunk? I've never done that before. Can someone indicate me what to do? Should I request "developer write access" for instance? Thanks in advance, Tim 2012/2/11 Timoth=C3=A9e Flutre > > Thanks for your input! > > 1) Here is the text of the license under which the Apache code is: > http://www.apache.org/licenses/LICENSE-2.0. Indeed it seems that we > would have to indicate their copyright. Is this a problem? In a way, > there is not a lot of different algorithms to compute the Spearman > coefficient... > > 2) I have made the changes and now have "gsl_stats_spearman_alloc" and > "gsl_stats_spearman_free" functions for the four arrays ranks1, > ranks2, d and p. I added the code as a 2nd file to the same gist: > https://gist.github.com/1784199#file_spearman_v2.c > > 3) Yes, we don't know in advance how many ties there will be. That's > why I reallocate inside the loop. I don't see how I can do > differently. > > 4) I added a function performing tests, using the data defined in > statistics/test_float_source. > c. What do I do now? Do I need to have write access to the GSL > repository on Savannah? Or maybe someone else can do it for me? > > Thanks, > Tim > > > On Thu, Feb 9, 2012 at 6:04 PM, Patrick Alken > wrote: > > > > Hello, > > > > =C2=A0It would be best to move this discussion over to gsl-discuss. I t= hink > > it would be very useful to have this function in GSL. Just a few commen= ts on > > your code: > > > > 1) The code looks clean and nicely commented. One issue is that since > > you appear to have followed the apache code very closely, there may be a > > licensing issue - I don't know if the Apache license is compatible with= the > > GPL. On a quick check, its possible we can use it but it seems we need = to > > preserve the original copyright notice. > > > > 2) Dynamic allocation - it looks like you dynamically allocate 5 > > different arrays to do the calculation. It would be better to either ma= ke > > functions like gsl_stats_spearman_alloc and gsl_stats_spearman_free, or= to > > pass in a pre-allocated workspace as one of the function arguments. Sin= ce > > you're using workspace of different types (double,size_t), its probably > > better to make the alloc/free functions. > > > > 3) One of your dynamically allocated arrays is realloc()'d in a loop. Is > > this because the size of the array is unknown before the loop? Perhaps = there > > is a way to avoid the realloc's. > > > > 4) We also need to think of some automated tests that can be added to > > statistics/test.c to test this function exhaustively and make sure its > > working correctly - even if that consists simply of known output values= for > > a few different input cases. > > > > Good work, > > Patrick Alken > > > > > > On 02/09/2012 04:26 PM, Timoth=C3=A9e Flutre wrote: > >> > >> Hello, > >> > >> I noticed that only the Pearson correlation coefficient is implemented > >> in the GSL > >> (http://www.gnu.org/software/gsl/manual/html_node/Correlation.html). > >> However, in quantitative genetics, several authors are using the > >> Spearman coef (for instance, Stranger et al "Population genomics of > >> human gene expression", Nature Genetics, 2007) as it is less > >> influenced by outliers. > >> > >> Current high-throughput data requires to compute such coef several > >> millions of times. Thus I implemented the computation of the Spearman > >> coef in GSL-like code. In fact, one just need to rank the input > >> vectors and then compute the Pearson coef on them. For the ranking, I > >> got inspired by the code from the Apache Math module. > >> > >> I was thinking that it could be useful to other users to add my piece > >> of code to the file "covariance_source.c" of the GSL > >> > >> (http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/co= variance_source.c#L77). > >> So here is the code: https://gist.github.com/1784199 > >> > >> I am not very proficient in C, so even if it is not possible to > >> include the code in the GSL, don't hesitate to give me advice. > >> > >> Thanks, > >> Tim > >> > >