From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 4311 invoked by alias); 7 Jun 2007 14:17:28 -0000 Received: (qmail 4297 invoked by uid 22791); 7 Jun 2007 14:17:26 -0000 X-Spam-Check-By: sourceware.org Received: from mail.phy.duke.edu (HELO mail.phy.duke.edu) (152.3.182.2) by sourceware.org (qpsmtpd/0.31) with ESMTP; Thu, 07 Jun 2007 14:17:22 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.phy.duke.edu (Postfix) with ESMTP id 2E95F1508AD; Thu, 7 Jun 2007 10:17:18 -0400 (EDT) Received: from mail.phy.duke.edu ([127.0.0.1]) by localhost (mail.phy.duke.edu [127.0.0.1]) (amavisd-new, port 10026) with LMTP id 70VVD78WMKFd; Thu, 7 Jun 2007 10:17:18 -0400 (EDT) Received: from lilith.rgb.private.net (client212-5.dsl.intrex.net [209.42.212.5]) by mail.phy.duke.edu (Postfix) with ESMTP id 50FC61500C0; Thu, 7 Jun 2007 10:17:16 -0400 (EDT) Date: Thu, 07 Jun 2007 14:17:00 -0000 From: "Robert G. Brown" To: Michael cc: help-gsl@gnu.org, gsl-discuss@sources.redhat.com Subject: Re: Will I get speedup by providing Jacobian to the GSL ODE45 solver? In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Mailing-List: contact gsl-discuss-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gsl-discuss-owner@sourceware.org X-SW-Source: 2007-q2/txt/msg00037.txt.bz2 On Wed, 6 Jun 2007, Michael wrote: > Will I get speedup by providing Jacobian to the GSL ODE45 solver? > > Hi all, > > I am a newbie to GSL ODE solvers. I have a critical need for speed. I > am hunting for ways to speed up. My questions are: > > 1. If after a comparison of GSL ODE solvers, I conclude that ODE45 is > sufficient for my need in terms of accuracy. (And in fact I turned to > GSL ODE solver from Matlab, and I have already done a comparison > analysis using all the ODE solvers in Matlab. ) And I remember I don't > provide Jacobian to ODE45 solver. And now will I get speed-up at the > fixed accuracy level by providing the Jacobian to the GSL ODE45 > solver? > > 2. Is it possible that after providing the Jacobian, other solvers use > the Jacobian more efficiently and start to outform ODE45 in terms of > speed or accuracy? > > I am still experimenting these things, but I would like to hear your > comments too... > > Thanks for your help! I'll reply in just one place. There are lots of things to answer from your previous messages, so forgive me/remind me if I forget one. I'll have to make this (uncharacteristically, for me:-) brief as I have some other work to do today, so bear with me. ODE45 in matlab is probably rkf45 in GSL and it is an excellent general purpose ODE solver, I agree. But, as I already noted in a previous message, some of the alternatives sometimes turn out to be faster and just as accurate, so I'd still recommend that you experiment with this if speed is THE defining goal here. If it is, the it is worth the time required to prototype and run the code for a day or two to tune it up, and it is trivial to swap ODE solvers with the GSL, really. At most you have to write the Jacobian code. Don't be "scared" of doing this because you don't know what a Jacobian is (not unlikely, actually:-). GIYF: http://en.wikipedia.org/wiki/Jacobian It is just the matrix of partial derivatives that show how all of the variables in the vector being solved depend on (in the case of the ODE solver) each other (and in the more general case, on some underlying set of "spatial" coordinates). The van der pol oscillator example should make this clear. However, if you are solving (say) single variable ODEs and make your vector "system" out of N independent 1-D ODES, each with the same derivative, then the Jacobian matrix is diagonal and trivial -- basically the unit matrix. OK, now, to speed things up. Speed costs money, so you have to decide how "valuable" the results are to you (what you want to spend) and then optimize within that expenditure limit. You can also spend "nothing" and optimize on whatever hardware you have, but you can't do this AND insist that you be able to solve umpty-zillion ODE instances in finite time. Reality is real, and you have to live within the physical limitations of your hardware or spend more on better/more hardware. There are three distinct dimensions you can optimize: * Number of CPUs. You are fortunate in that your problem decomposes into embarrassingly parallel subproblems trivially. This means you can basically do your work twice as fast by using two systems at the same time, N times as fast by using N systems (less a hair for overhead). So one way to get linear speedup is to buy more systems and make a small beowulf cluster to do your work. If you are interested in this approach, check out: http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php to learn about parallel scaling and getting started with beowulfery. Also look at: http://www.clustermonkey.net where there are all sorts of articles on this, including a whole series by yours truly aimed at newbies. Note well that this NEED NOT BE VERY EXPENSIVE! These days $3000 will buy you 8 64-bit opteron cores with at least 512 MB per core in only two boxes, and on an embarrassingly parallel problem that is a HUGE amount of compute capacity! * Quality of CPU, memory bus, etc. Cheap, single core, small cache 32-bit CPU cores on a slow plodgy memory subsystem will not perform as well as a bleeding edge 64-bit cores on a fast memory subsystem for problems of this type. If you're currently working on a 1.6 GHz Celeron and move your code over to a 3 GHz 64-bit CPU, it will speed up by maybe four right there. If you move it to a dual core CPU that's eight. If you move it to a dual dual core system, that is as much as 16x speedup right there per SINGLE system. There are price point nonlinearities in here you should be aware of. Your problem is almost certainly CPU bound -- unlikely to depend strongly on anything but CPU clock and maybe memory bus clock and architecture when you have enough cores hitting the same memory. Bleeding edge CPUs are MUCH more expensive than ones only 10% to 20% slower that aren't quite so advanced. Since your problem decomposes, you should multiply the number of CPU cores you can afford per architecture by the system clock of those cores and optimize the resulting number, remembering tha AMD cores are typically faster at the same clock than Intel cores but YMMV and it's up to you to prototype and compare, not me. * Writing your code to take advantage of vectorization and CPU cache, etc. This latter is tricky. I know that you want me to show you an example of a vectorized solution, but I'd have to write out a full answer and test it (just like you will) and I don't have time to, so you'll have to try it on your own. I'll describe "how" to proceed, and assume/hope that you can manage the actual coding. First of all, it isn't completely clear that you are better off vectorizing the independent ODEs to be solved vs calling the ODE solver one at a time. The trade off involved is the overhead of starting up the ODE solver (a fixed cost per invocation per solution) vs the speedup advantage of executing all of the core loop in cache while working on register memory. At some point increasing the size of the problem forces the system out of cache and maybe even out of being relatively local in memory (unlikely, actually, if your problem decomposes). This nonlinearly slows it down to a new scaling regime. However, if you really only evaluate the solution for four or five points per run, the overhead of startup can be a significant fraction of the time of solution (or not -- something to check out) and you may be better off vectorizing a number of solutions into a single startup call. In other words, there is LIKELY to be an optimum that looks like the following: a) Put each independent ODE to be solved into a component of a vector y[]. Make the length of y a variable so you can adjust it at run time. b) Write code that systematically assigns your parameter space to the vector of initial conditions or parameters passed to the ODE solver or in a global parameter vector -- it doesn't matter how, but it has to be systematic and automated. c) Write code that evaluates the vector of deriviatives -- if all that changes is the parameter you can reuse the actual derivative code in a trivial loop. d) Write code that makes the jacobian a unit matrix -- no y[] depends on any other y[] and \partial y_i/\partial y_i = 1. Use the GSL example as a template, but your code will be much simpler on the right hand side of the assignments. e) Test it for (say) rkf45, why not. Make sure that it gives you the same answers you were getting in matlab, that the evolution routine works flawlessly to each of the desired times, etc. Now experiment. Try running it varying N and plot out the average time required to complete per solution. Try varying ODE solvers and repeat. Try varying architecture (if you have more than one kind of system to use to run it on). If god is good to you, you will rapidly determine that the code has one of the three generic limiting forms: * It will run fastest with N=1, so you are best off running a loop to run the ODE solver N times (on as many hosts/cores as you can manage to split up the work) for 1 problem at a time. * It will run fastest with the largest possible N that will fit the program comfortably in memory -- maybe N = 2^20, maybe N = 2^28 (if you have a lot of memory -- GB+ -- to run in). * It will run fastest somewhere in between. where the answers WILL vary per architecture tested and maybe per ODE tested. I don't even know how I would bet between these three. I suppose I'd "expect" an optimum somewhere in the ballpark of N = 32K -- small enough to keep the code in L1 and data in L2, big enough to spread the startup overhead out over a "large" number of calls, but modern memory subsystems could easily perform well enough to make the second one true or if you carefully write the startup code so it avoids reallocating the ODE structs from run to run and use register variables throughout might make the first one true. Just carefully coding can move the optimum around -- it is really another degree of freedom. I personally would advise you not to go the assembler route unless it is your last resort. C compilers are actually pretty efficient and you'd have to work long and hard to get a really significant speedup out of it compared to what you could do by just careful C coding. Indeed, I'd try rewriting the code in C first to maybe leave out some of the GSL-added overhead of testing for successful completions and the like -- at your own considerable risk -- and vectorizing the core loops in that way. But I actually wouldn't try that -- not for a parallelizable problem. Code portability and readability and reliability matter, and using the GSL is a good idea for all of these reasons. Hardware is cheaper than human time. Buy some and throw it at the problem, accepting that you can't squeeze blood from turnips and you can't get 2^25 solutions per second out of hardware capable of generating only 2^20, but that if you add 2^5 CPU cores (a mere 32 processors) you can... HTH, rgb -- -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu