From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 14770 invoked by alias); 7 Nov 2006 04:37:20 -0000 Received: (qmail 14762 invoked by uid 22791); 7 Nov 2006 04:37:19 -0000 X-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: sourceware.org Received: from mx1.redhat.com (HELO mx1.redhat.com) (66.187.233.31) by sourceware.org (qpsmtpd/0.31) with ESMTP; Tue, 07 Nov 2006 04:37:14 +0000 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.12.11.20060308/8.12.11) with ESMTP id kA74bBOQ017727 for ; Mon, 6 Nov 2006 23:37:11 -0500 Received: from pobox.corp.redhat.com (pobox.corp.redhat.com [10.11.255.20]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id kA74bBOX018309; Mon, 6 Nov 2006 23:37:11 -0500 Received: from vpn-248-46.boston.redhat.com (vpn-248-46.boston.redhat.com [10.13.248.46]) by pobox.corp.redhat.com (8.13.1/8.12.8) with ESMTP id kA74bAhf023552; Mon, 6 Nov 2006 23:37:10 -0500 Subject: RE: offline elfutils processing committed From: Martin Hunt To: "Stone, Joshua I" Cc: "Frank Ch. Eigler" , systemtap@sources.redhat.com In-Reply-To: References: Content-Type: text/plain Organization: Red Hat Inc. Date: Tue, 07 Nov 2006 06:36:00 -0000 Message-Id: <1162874229.31112.7.camel@dragon> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 (2.6.3-1.fc5.5) Content-Transfer-Encoding: 7bit X-IsSubscribed: yes Mailing-List: contact systemtap-help@sourceware.org; run by ezmlm Precedence: bulk List-Subscribe: List-Post: List-Help: , Sender: systemtap-owner@sourceware.org X-SW-Source: 2006-q4/txt/msg00357.txt.bz2 On Mon, 2006-11-06 at 14:15 -0800, Stone, Joshua I wrote: > On Monday, November 06, 2006 1:18 PM, Martin Hunt wrote: > > The point is damage control. Systemtap allocates too much memory and > > oom killer gets active, the first thing it will kill is staprun and > > that should unload the module (but this seems broken at the moment). > > So we haven't really hurt the system. > > The goal is fine, but I don't think this accomplishes it. My > understanding is that __alloc_pages will keep calling OOM until it is > able to satisfy the request -- thus the module is blocked waiting for > memory. The process might end up something like: > > stap module: allocate lots of memory > __alloc_pages: Not enough memory -> OOM kill something (staprun) > __alloc_pages: Still not enough memory -> OOM kill other stuff > __alloc_pages: Yay, now we have enough memory! > stap module: got some memory > stap module: Oops, staprun is gone, better exit... There are 2 different, but related problems. The one you describe is easily fixed by using the GFP_NORETRY flag on our allocs. The second problem is the one I was trying to describe. What happens when systemtap's allocations succeed, but leave the system in a low memory state such that other applications trigger the oom killer when they try to allocate memory. In this case, we want staprun and the systemtap module to be first to be killed. I haven't looked at the sources, but it seems unlikely to me that the oom killer would be so fast that it would kill staprun and then kill other processes before the module is also killed and frees it's memory. Martin