Robust linear least squares

public inbox for gsl-discuss@sourceware.org
 help / color / mirror / Atom feed

* Robust linear least squares
@ 2013-05-10 22:01 Patrick Alken
  2013-05-12 17:56 ` Peter Teuben
  0 siblings, 1 reply; 6+ messages in thread
From: Patrick Alken @ 2013-05-10 22:01 UTC (permalink / raw)
  To: gsl-discuss

Hi all,

   I just committed a significant chunk of code related to robust linear 
regression into GSL and mainly wanted to update the other developers and 
any other interested parties. The main idea here is that ordinary least 
squares is very sensitive to data outliers, and the robust algorithm 
tries to identify and downweight outlier points so they don't 
drastically affect the model. I think this is something that has been 
needed in gsl for a while.

   I've been developing the code for a while and have been using it 
successfully in my own work, and also validated it pretty extensively 
against the matlab implementation. I still need to make some automated 
tests for it which I should get to next week.

   In the meantime, the code is very usable and working so feel free to 
try it out.

Patrick

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Robust linear least squares
  2013-05-10 22:01 Robust linear least squares Patrick Alken
@ 2013-05-12 17:56 ` Peter Teuben
  2013-05-12 18:13   ` Dirk Eddelbuettel
  2013-05-12 18:14   ` Patrick Alken
  0 siblings, 2 replies; 6+ messages in thread
From: Peter Teuben @ 2013-05-12 17:56 UTC (permalink / raw)
  To: gsl-discuss; +Cc: Patrick Alken

Patrick
I agree, this is a useful option!

   can you say a little more here how you define robustness. The one I
know takes the quartiles Q1 and Q3 (where Q2 would
be the median), then define D=Q3-Q1 and only uses points between
Q1-1.5*D and Q3+1.5*D to define things like  a robust mean and variance.
Why 1.5 I don't know, I guess you could keep that a variable and tinker
with it.
For OLS you can imagine applying this in an iterative way to the Y
values, since formally the errors in X are neglibable compared to those
in Y. I'm saying iterative, since in theory the 2nd iteration could have
rejected points that should have
been part or the "core points".  For non-linear fitting this could be a
lot more tricky.

peter

On 05/10/2013 06:01 PM, Patrick Alken wrote:
> Hi all,
>
>   I just committed a significant chunk of code related to robust
> linear regression into GSL and mainly wanted to update the other
> developers and any other interested parties. The main idea here is
> that ordinary least squares is very sensitive to data outliers, and
> the robust algorithm tries to identify and downweight outlier points
> so they don't drastically affect the model. I think this is something
> that has been needed in gsl for a while.
>
>   I've been developing the code for a while and have been using it
> successfully in my own work, and also validated it pretty extensively
> against the matlab implementation. I still need to make some automated
> tests for it which I should get to next week.
>
>   In the meantime, the code is very usable and working so feel free to
> try it out.
>
> Patrick

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Robust linear least squares
  2013-05-12 17:56 ` Peter Teuben
@ 2013-05-12 18:13   ` Dirk Eddelbuettel
  2013-05-12 18:14   ` Patrick Alken
  1 sibling, 0 replies; 6+ messages in thread
From: Dirk Eddelbuettel @ 2013-05-12 18:13 UTC (permalink / raw)
  To: Peter Teuben; +Cc: gsl-discuss, Patrick Alken

On 12 May 2013 at 13:56, Peter Teuben wrote:
| Patrick
| I agree, this is a useful option!
| 
|    can you say a little more here how you define robustness. The one I
| know takes the quartiles Q1 and Q3 (where Q2 would
| be the median), then define D=Q3-Q1 and only uses points between
| Q1-1.5*D and Q3+1.5*D to define things like  a robust mean and variance.
| Why 1.5 I don't know, I guess you could keep that a variable and tinker
| with it.
| For OLS you can imagine applying this in an iterative way to the Y
| values, since formally the errors in X are neglibable compared to those
| in Y. I'm saying iterative, since in theory the 2nd iteration could have
| rejected points that should have
| been part or the "core points".  For non-linear fitting this could be a
| lot more tricky.

There is an entire "task view" (ie edited survey of available packages)
available for R concerning robust methods (for model fitting and more):

   http://cran.r-project.org/web/views/Robust.html

So there is not just one generally accepted best option.  That said, having
something is clearly better than nothing. But let's properly define the
method and delineat its scope/

Dirk

-- 
Dirk Eddelbuettel | edd@debian.org | http://dirk.eddelbuettel.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Robust linear least squares
  2013-05-12 17:56 ` Peter Teuben
  2013-05-12 18:13   ` Dirk Eddelbuettel
@ 2013-05-12 18:14   ` Patrick Alken
  1 sibling, 0 replies; 6+ messages in thread
From: Patrick Alken @ 2013-05-12 18:14 UTC (permalink / raw)
  To: Peter Teuben; +Cc: gsl-discuss

Hi Peter,

   The most common robust least squares algorithm is called 
"M-estimation" which is what I've implemented. At each step of the 
iteration, you calculate the residuals and use a weighting function 
which is designed to assign large weights to small residuals and small 
weights to large residuals, so that the large residuals (outliers) 
contribute less and less to the model at each iteration. At each 
iteration, you need an estimate of the residual standard deviation, and 
I am using the Mean-Absolute-Deviation (MAD) of the p largest residuals 
(where p is the number of model parameters). There are alternatives to 
computing sigma but the MAD seems to be the most widely used.

   If you check out the latest repository, have a look at the manual 
since I've documented everything including a description of the 
algorithm used. Let me know if you have more questions.

Patrick

On 05/12/2013 11:56 AM, Peter Teuben wrote:
> Patrick
> I agree, this is a useful option!
>
>     can you say a little more here how you define robustness. The one I
> know takes the quartiles Q1 and Q3 (where Q2 would
> be the median), then define D=Q3-Q1 and only uses points between
> Q1-1.5*D and Q3+1.5*D to define things like  a robust mean and variance.
> Why 1.5 I don't know, I guess you could keep that a variable and tinker
> with it.
> For OLS you can imagine applying this in an iterative way to the Y
> values, since formally the errors in X are neglibable compared to those
> in Y. I'm saying iterative, since in theory the 2nd iteration could have
> rejected points that should have
> been part or the "core points".  For non-linear fitting this could be a
> lot more tricky.
>
> peter
>
>
> On 05/10/2013 06:01 PM, Patrick Alken wrote:
>> Hi all,
>>
>>    I just committed a significant chunk of code related to robust
>> linear regression into GSL and mainly wanted to update the other
>> developers and any other interested parties. The main idea here is
>> that ordinary least squares is very sensitive to data outliers, and
>> the robust algorithm tries to identify and downweight outlier points
>> so they don't drastically affect the model. I think this is something
>> that has been needed in gsl for a while.
>>
>>    I've been developing the code for a while and have been using it
>> successfully in my own work, and also validated it pretty extensively
>> against the matlab implementation. I still need to make some automated
>> tests for it which I should get to next week.
>>
>>    In the meantime, the code is very usable and working so feel free to
>> try it out.
>>
>> Patrick

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Robust linear least squares
  2013-05-24 19:23 Timothée Flutre
@ 2013-05-24 20:12 ` Patrick Alken
  0 siblings, 0 replies; 6+ messages in thread
From: Patrick Alken @ 2013-05-24 20:12 UTC (permalink / raw)
  To: gsl-discuss

Hi Tim,

   Yes I remember and I do still think it would be a very nice function 
to have in GSL. My main worry at this point is the need to call the 
_alloc and _free functions for this function. Looking through the 
statistics chapter of the manual, none of the functions there require 
alloc/free calls, so it would be really nice to implement spearman in a 
similar way.

   One issue we need to worry about, is once we introduce a new 
workspace (gsl_stats_spearman) into GSL, it will be there for a long 
time since we need to keep GSL binary compatible for future releases (so 
that for future releases users won't have to recompile their code and 
can just link to the new library).

   If there is no other way to nicely implement this function then so be 
it - we will include a spearman workspace, but I'd really like to 
exhaust other options first.

   For example, I know you've looked at the Apache implementation. Have 
you looked at the R implementation as well to get any ideas? Also, 
numerical recipes implements this function where they allocate 2 vectors 
(your ranks1/ranks2 vectors) inside the spear function. Numerical 
recipes doesn't seem to need an additional sort vector (your d) or 
permutation vector (p). Do you think there is a way to eliminate these 2 
parameters, and then perhaps the user could simply pass in a double 
variable of size 2*n which you could use as your ranks1/ranks2 vectors, 
eliminating the need for spearman_workspace.

   Alternatively, do you think there is any clever way to compute the 
ranks in-place in the data1/2 vectors, so you won't have to allocate 
additional ranks1/ranks2 vectors?

   Finally, I know I asked you before about the ties_trace realloc call 
- it looks like this variable is allocated to 'nties' on the fly. Is 
there any way to count the number of ties initially, so that this only 
needs to be allocated once?

   I don't see any realloc calls in the Numerical Recipes 
implementation, so I'd like to ask you to try to understand how they do 
it (Also look at GNU R which may be more professionally written). 
Perhaps look at octave too?

   Sorry to be a stickler about this but I do think its worth trying to 
eliminate the alloc/free calls for this function. Even with the 
alloc/free calls there is still a performance hit due to the realloc 
calls of ties_trace. I may have some time next week to look into this a 
bit more myself.

Patrick

On 05/24/2013 01:23 PM, TimothÃ©e Flutre wrote:
> Hello Patrick,
>
> about the next release, a while ago I proposed some code (+tests) to
> compute the Spearman rank correlation coefficient. I uploaded my code
> on savannah (http://savannah.gnu.org/bugs/?36199) and it is also
> available on github (https://github.com/timflutre/spearman). At least
> one person asked on the mailing list if this coef was implemented
> (http://savannah.gnu.org/bugs/?37728) so I think it would be useful to
> add it.
> I tried to follow the GSL guidelines as close as possible so that it
> should be possible to integrate the code easily into the next release.
> I would be glad to help in this matter if necessary.
>
> Best,
> Tim

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Robust linear least squares
@ 2013-05-24 19:23 Timothée Flutre
  2013-05-24 20:12 ` Patrick Alken
  0 siblings, 1 reply; 6+ messages in thread
From: Timothée Flutre @ 2013-05-24 19:23 UTC (permalink / raw)
  To: Patrick Alken; +Cc: gsl-discuss

Hello Patrick,

about the next release, a while ago I proposed some code (+tests) to
compute the Spearman rank correlation coefficient. I uploaded my code
on savannah (http://savannah.gnu.org/bugs/?36199) and it is also
available on github (https://github.com/timflutre/spearman). At least
one person asked on the mailing list if this coef was implemented
(http://savannah.gnu.org/bugs/?37728) so I think it would be useful to
add it.
I tried to follow the GSL guidelines as close as possible so that it
should be possible to integrate the code easily into the next release.
I would be glad to help in this matter if necessary.

Best,
Tim

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-05-24 20:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-10 22:01 Robust linear least squares Patrick Alken
2013-05-12 17:56 ` Peter Teuben
2013-05-12 18:13   ` Dirk Eddelbuettel
2013-05-12 18:14   ` Patrick Alken
2013-05-24 19:23 Timothée Flutre
2013-05-24 20:12 ` Patrick Alken

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).