public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* gcc 3.3.6 - stack corruption questions
@ 2005-07-25 14:55 Louis LeBlanc
  2005-07-25 15:15 ` Giovanni Bajo
  0 siblings, 1 reply; 10+ messages in thread
From: Louis LeBlanc @ 2005-07-25 14:55 UTC (permalink / raw)
  To: gcc

Hey folks.  I'm having some trouble with a process compiled with gcc
3.3.6.  This code is pretty complex and has several features that are not
typically in use because they involve non-production test cases.

The problem is I'm getting core dumps (SEGV) that appears to come from
this code when I know it shouldn't be in the execution path.  The code
in question is switched on by a command line argument only, and the
process is managed by a parent process that monitors and manages it's
execution - reporting crashes and restarting it if necessary.

Here's my environment:
gcc 3.3.6 built on SunOS 5.8 sun4u sparc SUNW,Ultra-60,
app built on the same platform and execution on SunOS 5.8 sun4u sparc
  SUNW,UltraSPARC-IIi-cEngine.

The entire codebase is written in C, and is compiled as follows:
/usr/local/gcc-3.3.6/bin/gcc -ggdb -g3 -Wall -D_REENTRANT
-Wno-multichar -Wno-unused-function -D_SOLARIS -DUSE_DEV_POLL
-mcpu=ultrasparc -O2 -DTIMING=1 -DDB_TIMING=1  -Icommon/include
-I/opt/oracle/8.1.7/include -I/opt/oracle/8.1.7/rdbms/public  -c -o
store.o store.c

These problems have popped up time and again over the last 6 years,
going as far back as gcc 2.95, but gdb has never been able to tell me
any more than where the problem came from (the Solaris pstack utility
always agrees with gdb).  These problems are only repeated under
longer execution times, and only after some thousands or even millions
of transactions.  The application is supposed to provide 99.97%
availability, so having this happen 12 times over the course of a
weekend is a bit concerning.  Sometimes a build will prove wonderfully
stable, but then a very small code change made to tweak some behavior
will completely destabilize it.

Recently, I added a handler to catch segfaults and bus errors to try
to extract more info through the ucontext interface.  I am able to get
a little explicit detail, but not much new information.  Problem with
this is it doesn't preserve the originating stack as well.

At this point, I'm at a loss as to where to start.  This is a pretty
important codebase (to my employer, anyway) and the frequency of these
inexplicable problems is starting to cause some concern.

Any suggestions as to where to go next?  If I've forgotten any
potentially useful information please don't hesitate to request it.
Please CC me directly, as I am not on the dev list.

Thanks for your time.

Lou
-- 
Louis LeBlanc                                 leblanc@keyslapper.net
Fully Funded Hobbyist,                   KeySlapper Extrordinaire :þ
http://www.keyslapper.net                                       Ô¿Ô¬
Key fingerprint = C5E7 4762 F071 CE3B ED51  4FB8 AF85 A2FE 80C8 D9A2

Flugg's Law:
  When you need to knock on wood is when you realize
  that the world is composed of vinyl, naugahyde and aluminum.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gcc 3.3.6 - stack corruption questions
  2005-07-25 14:55 gcc 3.3.6 - stack corruption questions Louis LeBlanc
@ 2005-07-25 15:15 ` Giovanni Bajo
  2005-07-25 15:23   ` Louis LeBlanc
  2005-07-25 22:00   ` Louis LeBlanc
  0 siblings, 2 replies; 10+ messages in thread
From: Giovanni Bajo @ 2005-07-25 15:15 UTC (permalink / raw)
  To: Louis LeBlanc; +Cc: gcc

Louis LeBlanc <leblanc@keyslapper.net> wrote:

> The problem is I'm getting core dumps (SEGV) that appears to come from
> this code when I know it shouldn't be in the execution path.  The code
> in question is switched on by a command line argument only, and the
> process is managed by a parent process that monitors and manages it's
> execution - reporting crashes and restarting it if necessary.

Looks like a bug hidden in your code. Several things you could try:

- valgrind
- GCC 4.0 with -fmudflap
- GCC 4.1 CVS with -fstack-protect
-- 
Giovanni Bajo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gcc 3.3.6 - stack corruption questions
  2005-07-25 15:15 ` Giovanni Bajo
@ 2005-07-25 15:23   ` Louis LeBlanc
  2005-07-25 22:00   ` Louis LeBlanc
  1 sibling, 0 replies; 10+ messages in thread
From: Louis LeBlanc @ 2005-07-25 15:23 UTC (permalink / raw)
  To: Giovanni Bajo; +Cc: Louis LeBlanc, gcc

On 07/25/05 05:15 PM, Giovanni Bajo sat at the `puter and typed:
> Louis LeBlanc <leblanc@keyslapper.net> wrote:
> 
> > The problem is I'm getting core dumps (SEGV) that appears to come from
> > this code when I know it shouldn't be in the execution path.  The code
> > in question is switched on by a command line argument only, and the
> > process is managed by a parent process that monitors and manages it's
> > execution - reporting crashes and restarting it if necessary.
> 
> Looks like a bug hidden in your code. Several things you could try:
> 
> - valgrind
> - GCC 4.0 with -fmudflap
> - GCC 4.1 CVS with -fstack-protect
> -- 

Thanks for the tips.  Since I'm on Solaris, I don't think Valgrind is
an option (Linux and FreeBSD on x86/PowerPC/AMD64 only).

I will check out the gcc versions and features you mention.

Lou
-- 
Louis LeBlanc                                 leblanc@keyslapper.net
Fully Funded Hobbyist,                   KeySlapper Extrordinaire :þ
http://www.keyslapper.net                                       Ô¿Ô¬
Key fingerprint = C5E7 4762 F071 CE3B ED51  4FB8 AF85 A2FE 80C8 D9A2

Turnaucka's Law:
  The attention span of a computer is only as long as its electrical cord.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gcc 3.3.6 - stack corruption questions
  2005-07-25 15:15 ` Giovanni Bajo
  2005-07-25 15:23   ` Louis LeBlanc
@ 2005-07-25 22:00   ` Louis LeBlanc
  2005-07-25 22:28     ` Giovanni Bajo
  2005-07-25 22:50     ` Robert Dewar
  1 sibling, 2 replies; 10+ messages in thread
From: Louis LeBlanc @ 2005-07-25 22:00 UTC (permalink / raw)
  To: Giovanni Bajo; +Cc: Louis LeBlanc, gcc

On 07/25/05 05:15 PM, Giovanni Bajo sat at the `puter and typed:
> Louis LeBlanc <leblanc@keyslapper.net> wrote:
> 
> > The problem is I'm getting core dumps (SEGV) that appears to come from
> > this code when I know it shouldn't be in the execution path.  The code
> > in question is switched on by a command line argument only, and the
> > process is managed by a parent process that monitors and manages it's
> > execution - reporting crashes and restarting it if necessary.
> 
> Looks like a bug hidden in your code. Several things you could try:
> 
> - valgrind
> - GCC 4.0 with -fmudflap
> - GCC 4.1 CVS with -fstack-protect

I've not gotten the chance to build with gcc4.0.1 yet (still building),
but I've tried a couple other things with 3.3.6 that you might find
interesting, maybe something will raise a flag.

I added the -fstack-check switch to my makefile and recompiled with
various optimizations.  I was pretty surprised at the file sizes that
showed up:

No Optimization:
-rwxr-xr-x  1 leblanc  daemon  1128660 Jul 25 16:25 myprocess*

Optimized with -O2
-rwxr-xr-x  1 leblanc  daemon  1058228 Jul 25 17:36 myprocess*

Optimized with -O3
-rwxr-xr-x  1 leblanc  daemon  1129792 Jul 25 17:32 myprocess*

I would have expected much different results.  Shouldn't the file
sizes be smaller (at least a little) with the -O3 switch?  Maybe
there's a loop unrolled to make it faster, resulting in a larger
codebase?

Anyway, the code that generated these files (identical between the
three, except the compilation flags) appears to be hitting something
with the optimizer that doesn't like my code.  Those with optimization
(the last 2) core on less than 50K transactions - both in calls to
pthread_mutex_unlock().  I have verified that the mutex passed in is
valid (it would have been locked and unlocked some 70K times before
failing in my last test).

The unoptimized version completed a 401,900 transaction test with no
problem.  All day, I've been playing with different things, 

BTW, all these executables were compiled with -ggdb -g3 -Wall.

Thanks to everyone who sent ideas so far.

Lou
-- 
Louis LeBlanc                                 leblanc@keyslapper.net
Fully Funded Hobbyist,                   KeySlapper Extrordinaire :þ
http://www.keyslapper.net                                       Ô¿Ô¬
Key fingerprint = C5E7 4762 F071 CE3B ED51  4FB8 AF85 A2FE 80C8 D9A2

Udall's Fourth Law:
  Any change or reform you make is going to have consequences you don't like.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gcc 3.3.6 - stack corruption questions
  2005-07-25 22:00   ` Louis LeBlanc
@ 2005-07-25 22:28     ` Giovanni Bajo
  2005-07-26 21:06       ` Louis LeBlanc
  2005-07-25 22:50     ` Robert Dewar
  1 sibling, 1 reply; 10+ messages in thread
From: Giovanni Bajo @ 2005-07-25 22:28 UTC (permalink / raw)
  To: Louis LeBlanc; +Cc: gcc

Louis LeBlanc <leblanc@keyslapper.net> wrote:

> I added the -fstack-check switch to my makefile and recompiled with
> various optimizations.  I was pretty surprised at the file sizes that
> showed up:
> 
> No Optimization:
> -rwxr-xr-x  1 leblanc  daemon  1128660 Jul 25 16:25 myprocess*
> 
> Optimized with -O2
> -rwxr-xr-x  1 leblanc  daemon  1058228 Jul 25 17:36 myprocess*
> 
> Optimized with -O3
> -rwxr-xr-x  1 leblanc  daemon  1129792 Jul 25 17:32 myprocess*
> 
> I would have expected much different results.  Shouldn't the file
> sizes be smaller (at least a little) with the -O3 switch?  Maybe
> there's a loop unrolled to make it faster, resulting in a larger
> codebase?


Or inlining, or many other things. If you care about size, use -Os.
-- 
Giovanni Bajo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gcc 3.3.6 - stack corruption questions
  2005-07-25 22:00   ` Louis LeBlanc
  2005-07-25 22:28     ` Giovanni Bajo
@ 2005-07-25 22:50     ` Robert Dewar
  2005-07-25 23:00       ` Dale Johannesen
  1 sibling, 1 reply; 10+ messages in thread
From: Robert Dewar @ 2005-07-25 22:50 UTC (permalink / raw)
  To: Louis LeBlanc; +Cc: Giovanni Bajo, gcc

Louis LeBlanc wrote:

> I would have expected much different results.  Shouldn't the file
> sizes be smaller (at least a little) with the -O3 switch?  Maybe
> there's a loop unrolled to make it faster, resulting in a larger
> codebase?

you generally expect -O3 files to be larger. inlining does not save
space. Indeed in the context of Ada, where well thought out inlining
is acheived by -O2 -gnatn (and the use of pragma Inline), we often
see -O3 executables being not only larger but slower, presumably due
to increased icache pressure.

> The unoptimized version completed a 401,900 transaction test with no
> problem.  All day, I've been playing with different things, 

there are many bugs, most notably uninitialed vars, that show
up only when you turn on optimization.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gcc 3.3.6 - stack corruption questions
  2005-07-25 22:50     ` Robert Dewar
@ 2005-07-25 23:00       ` Dale Johannesen
  0 siblings, 0 replies; 10+ messages in thread
From: Dale Johannesen @ 2005-07-25 23:00 UTC (permalink / raw)
  To: Robert Dewar; +Cc: gcc, Dale Johannesen, Louis LeBlanc, Giovanni Bajo

O Jul 25, 2005, at 3:50 PM, Robert Dewar wrote:
>> The unoptimized version completed a 401,900 transaction test with no
>> problem.  All day, I've been playing with different things,
>
> there are many bugs, most notably uninitialed vars, that show
> up only when you turn on optimization.

Also violations of strict aliasing rules are common.  -Wuninitialized
-fno-strict-aliasing [after the -O] will exercise those two.   Also,
mixed builds with some -O0 and some -O3 files should
narrow it down.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gcc 3.3.6 - stack corruption questions
  2005-07-25 22:28     ` Giovanni Bajo
@ 2005-07-26 21:06       ` Louis LeBlanc
  2005-07-26 21:52         ` Robert Dewar
  0 siblings, 1 reply; 10+ messages in thread
From: Louis LeBlanc @ 2005-07-26 21:06 UTC (permalink / raw)
  To: Giovanni Bajo; +Cc: Louis LeBlanc, gcc

On 07/26/05 12:28 AM, Giovanni Bajo sat at the `puter and typed:
> Louis LeBlanc <leblanc@keyslapper.net> wrote:
> 
> > I added the -fstack-check switch to my makefile and recompiled with
> > various optimizations.  I was pretty surprised at the file sizes that
> > showed up:
> > 
> > No Optimization:
> > -rwxr-xr-x  1 leblanc  daemon  1128660 Jul 25 16:25 myprocess*
> > 
> > Optimized with -O2
> > -rwxr-xr-x  1 leblanc  daemon  1058228 Jul 25 17:36 myprocess*
> > 
> > Optimized with -O3
> > -rwxr-xr-x  1 leblanc  daemon  1129792 Jul 25 17:32 myprocess*
> > 
> > I would have expected much different results.  Shouldn't the file
> > sizes be smaller (at least a little) with the -O3 switch?  Maybe
> > there's a loop unrolled to make it faster, resulting in a larger
> > codebase?
> 
> 
> Or inlining, or many other things. If you care about size, use -Os.


Well, I don't care about size that much.  This isn't an embedded app,
and performance and stability usually contend with each other for the
primary concern.

I did finally get a build with 4.0.1.  I certainly was surprised at
some of the things I found, but it did help a lot.

Funny thing is that the same compiler flags provide a much bigger
executable with 4.0.1 than with 3.3.6, but it doesn't core trying to
release a thread lock.  I've also confirmed that this happens without
contention because one of the apps that uses this code actually only
runs a single thread.

I used the following flags:
-ggdb -g3 -Wall -fstack-check -fno-strict-aliasing -O2
-mcpu=ultrasparc

The -fmudflaps flag requires some odd linking I haven't figured out
yet, so I'm skipping it for now.

I also found, to my delight and surprise, that the same code appears
to perform between 10% and 20% better - in a rough, fairly imprecise
environment.

I still cannot find the cause of the core I get with the 3.3.6 version
though.

What I do get with 4.0.1 is some warnings I'd never seen before:
disk.c: In function 'disk_find':
disk.c:281: warning: frame size too large for reliable stack checking
disk.c:281: warning: try reducing the number of local variables

I was initially getting this for about 20 of my functions.  I know
very well that I'm using some large stacks, so I wasn't surprised
about that, I was just surprised that the threshold here appears to be
quite low (is it around 8K? - please clarify for me if you know).
This process does some pretty heavy text management for URL transforms
and transaction logging, so several routines statically allocate some
fairly large character buffers for this.

When I first saw these warnings, I went through and moved a few buffers
to different scopes and changed a couple to dynamic allocations, since
they weren't needed in all cases.  This helped eliminate most of these
(down to 6 functions), but some still require these buffers for all
transactions, so it doesn't make much sense (to me, anyway) to
allocate them dynamically.

The source file that is still generating a core with 3.3.6 did have a
couple of these warnings (in different functions, not the one that was
coring), but they were easily accomodated, and even though the
warnings are gone in the 4.0.1 build, it still cores with the 3.3.6
build unless I turn off optimization for that file.  I know it's
confusing, but I'm trying to use the behavior from the two compiler
versions to make the code itself better, rather than just picking the
compiler that "seems to work".  I was actually hoping to get more
discernable info in the 3.3.6 build core after these changes, but the
core hasn't actually changed.

Looking at the gcc docs, I noticed the -fstack-limit flags, but I'm
not real clear if or how they can help, or exactly how they're used.
I am using the gcc linker, so the method described in the docs should
be fine if it applies, but it looks like an absolute address is needed
for the lower bound.

Any pointers on where I can get clarification on these flags and the
default stack size limits?

If I should be taking this to a usenet group or some other forum
instead of here, please let me know.

Once again, many thanks to all those that have provided input so far.
It's been very helpful.

Lou
-- 
Louis LeBlanc                                 leblanc@keyslapper.net
Fully Funded Hobbyist,                   KeySlapper Extrordinaire :þ
http://www.keyslapper.net                                       Ô¿Ô¬
Key fingerprint = C5E7 4762 F071 CE3B ED51  4FB8 AF85 A2FE 80C8 D9A2

QOTD:
  "If you keep an open mind people will throw a lot of garbage in it."

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gcc 3.3.6 - stack corruption questions
  2005-07-26 21:06       ` Louis LeBlanc
@ 2005-07-26 21:52         ` Robert Dewar
  2005-07-26 22:27           ` Louis LeBlanc
  0 siblings, 1 reply; 10+ messages in thread
From: Robert Dewar @ 2005-07-26 21:52 UTC (permalink / raw)
  To: Louis LeBlanc; +Cc: Giovanni Bajo, gcc

Louis LeBlanc wrote:

> I also found, to my delight and surprise, that the same code appears
> to perform between 10% and 20% better - in a rough, fairly imprecise
> environment.

why surprise?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gcc 3.3.6 - stack corruption questions
  2005-07-26 21:52         ` Robert Dewar
@ 2005-07-26 22:27           ` Louis LeBlanc
  0 siblings, 0 replies; 10+ messages in thread
From: Louis LeBlanc @ 2005-07-26 22:27 UTC (permalink / raw)
  To: Robert Dewar; +Cc: Louis LeBlanc, Giovanni Bajo, gcc

On 07/26/05 05:52 PM, Robert Dewar sat at the `puter and typed:
> Louis LeBlanc wrote:
> 
> > I also found, to my delight and surprise, that the same code appears
> > to perform between 10% and 20% better - in a rough, fairly imprecise
> > environment.
> 
> why surprise?

I wasn't aware it was possible to get such a large boost just going to
a newer compiler version and relatively minor code restructuring.

Lou
-- 
Louis LeBlanc                                 leblanc@keyslapper.net
Fully Funded Hobbyist,                   KeySlapper Extrordinaire :þ
http://www.keyslapper.net                                       Ô¿Ô¬
Key fingerprint = C5E7 4762 F071 CE3B ED51  4FB8 AF85 A2FE 80C8 D9A2

Age, n.:
   That period of life in which we compound for the vices that we still
   cherish by reviling those that we no longer have the enterprise to commit.
    -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-07-26 22:27 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-07-25 14:55 gcc 3.3.6 - stack corruption questions Louis LeBlanc
2005-07-25 15:15 ` Giovanni Bajo
2005-07-25 15:23   ` Louis LeBlanc
2005-07-25 22:00   ` Louis LeBlanc
2005-07-25 22:28     ` Giovanni Bajo
2005-07-26 21:06       ` Louis LeBlanc
2005-07-26 21:52         ` Robert Dewar
2005-07-26 22:27           ` Louis LeBlanc
2005-07-25 22:50     ` Robert Dewar
2005-07-25 23:00       ` Dale Johannesen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).