From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 14553 invoked by alias); 28 May 2013 22:44:12 -0000 Mailing-List: contact gsl-discuss-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gsl-discuss-owner@sourceware.org Received: (qmail 14534 invoked by uid 89); 28 May 2013 22:44:11 -0000 X-Spam-SWARE-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00,KHOP_THREADED,RCVD_IN_HOSTKARMA_NO,RP_MATCHES_RCVD,SPF_PASS,TW_DN autolearn=ham version=3.3.1 Received: from ipmx5.colorado.edu (HELO ipmx5.colorado.edu) (128.138.128.235) by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Tue, 28 May 2013 22:44:09 +0000 From: Patrick Alken X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AlEGAFsypVGMrLMp/2dsb2JhbABZgwkwgzu+a4EHFnSCIwEBBSMPAQUzAxsLGAICBRMOAgIPAkYGAQwIAQGICQyqAIloiAiBJow1gUmCQYETA4kfj0WEYos1gy4dgTU Received: from bonanza.ngdc.noaa.gov ([140.172.179.41]) by smtp.colorado.edu with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA; 28 May 2013 16:44:07 -0600 Message-ID: <51A53336.30801@colorado.edu> Date: Tue, 28 May 2013 22:44:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: "timflutre@gmail.com" , "gsl-discuss@sourceware.org" Subject: Re: spearman coefficient References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-SW-Source: 2013-q2/txt/msg00012.txt.bz2 I've added gsl_stats_spearman to the repository and have tested it on a few sample datasets. I essentially rewrote the routine using octave and numerical recipes as examples, though I rewrote everything from scratch so there are no copyright issues. I added the function gsl_sort_vector2, similar to the numerical recipes sort2() function, which eliminates the need to allocate a permutation and sort vector. The workspace for the rank vectors is passed directly to the function so there is no need to allocate a separate workspace now. It is possible to write the function to calculate the rank vectors in-place in the data vectors, but I opted to keep those inputs untouched to stay consistent with the rest of the statistics routines. The user must pass in a workspace of size 2*n. I put the function in statistics/covariance_source.c so it will be defined with all the different types (float,double,int,short,etc) and its documented in the manual. I'm sorry I wasn't able to directly use a lot of your code, but I do think this implementation is much more consistent with the rest of the library design. If you are using this function regularly in your work I would appreciate any feedback you can give (ie testing it with a wide range of inputs). Patrick On 05/25/2013 03:25 PM, Timothée Flutre wrote: > Hi Patrick, > > thanks for your detailed reply. (I don't know why I didn't received > your email, I had to check the GSL mailing list archive to see it, > that's why I'm answering directly to you this time.) > > About introducing a new workspace, I did it based on your advice from last year: > http://sourceware.org/ml/gsl-discuss/2012-q1/msg00011.html > > I don't have a strong opinion on what is the best, but someone else > commented on my code and also thought that it would be better to have > a workspace: > https://gist.github.com/timflutre/1784199#comment-82458 > > Maybe the code could offer two functions, with or without the > workspace? In this case, is there any guidelines to name the > functions? > > I had a look at the implementation in R. The description of the > interface is here: > http://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html). > > Even though it indicates that the argument "method" can take the value > "spearman", I don't see it anymore in the R code and thus I am a bit > confused by their implementation: > https://github.com/wch/r-source/blob/trunk/src/library/stats/R/cor.R#L21 > > Moreover, the R code calls C code: > https://github.com/wch/r-source/blob/trunk/src/library/stats/src/cov.c#L623 > > The file with the C code has several macros and functions to compute > covariance or correlation, to handle missing data in different ways, > to deal with Pearson, Spearman and Kendall coefficients, etc. All this > makes it really hard for me to understand it... > > Finally, I looked at the algorithm in Numerical Recipes in C, the pdf > of the book is available here: > www2.units.it/ipl/students_area/imm2/files/Numerical_Recipes.pdf‎ > > However, the GSL web site says that we can't use algorithms from this > book because of the non-free license. > > Also, it seems to me that spear() from Numerical Recipe (pdf page 641) > uses the function srt2() (Quicksort with 2 arrays, page 334) which > seems to require to allocate another array, "istack". Therefore, at > the end, it doesn't seem to me that it's much better than my d and > perm vector, which have the advantage of using other functions of the > GSL (gsl_sort_vector and gsl_sort_vector_index). > > But again, I'm really not an expert programmer, in C or any other > language. So I tried to see how I could change my code based on what > you said but I don't see any obvious ways to do it (except copying the > code from Numerical Recipe). > > If you don't want to include the code as it is into the next release > of the GSL, I'm fine with that. Of course, if you have a better > understandng of all this and you can explain me what to do, I can try > to help. > > Best, > > Timothée Flutre