public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
* illegal instruction (CPU mismatch)
       [not found] <2070054246.1500449.1474012574681.ref@mail.yahoo.com>
@ 2016-09-16  7:56 ` Mahmood Naderan
  2016-09-16  8:31   ` Markus Trippelsdorf
  0 siblings, 1 reply; 27+ messages in thread
From: Mahmood Naderan @ 2016-09-16  7:56 UTC (permalink / raw)
  To: gcc-help

Hi
In a cluster, there is a frontend and some compute nodes with the following CPU specs:


Frontend:
cpu family      : 21
model           : 2
model name      : AMD Opteron(tm) Processor 6380
stepping        : 0
GCC: 4.4.7

Computes:
cpu family      : 21
model           : 1
model name      : AMD Opteron(tm) Processor 6282 SE
stepping        : 2
GCC: 4.4.6



Specifically, the 6380 has the following flags while 6282 doesn't have them


fma, f16c, tch, tce, tbm and bmi1 


Problem is that, I have compiled OpenMPI and another program (which is written in Fortran) on the frontend. When issue the run via MPI, the compute node fails with an illegal instruction


--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 5383 on node compute-0-1 exited on signal 4 (Illegal instruction).
--------------------------------------------------------------------------


I have compiled OMPI and the application with -march=amdfam10 on the frontend. A snippet of the program looks like


/export/apps/siesta/openmpi-2.0.0/bin/mpifort -c -g -Os -march=amdfam10   `FoX/FoX-config --fcflags`  -DMPI -DFC_HAVE_FLUSH -DFC_HAVE_ABORT -DTRANSIESTA    /export/apps/siesta/siesta-4.0/Src/pspltm1.F






The question is, how can I find the name of that illegal instruction? With that I can find which flag is missing.

 Regards,
Mahmood

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16  7:56 ` illegal instruction (CPU mismatch) Mahmood Naderan
@ 2016-09-16  8:31   ` Markus Trippelsdorf
  2016-09-16  9:05     ` Mahmood Naderan
  2016-09-16 11:37     ` Jeffrey Walton
  0 siblings, 2 replies; 27+ messages in thread
From: Markus Trippelsdorf @ 2016-09-16  8:31 UTC (permalink / raw)
  To: Mahmood Naderan; +Cc: gcc-help

On 2016.09.16 at 07:56 +0000, Mahmood Naderan wrote:
> Hi
> In a cluster, there is a frontend and some compute nodes with the following CPU specs:
> 
> 
> Frontend:
> cpu family      : 21
> model           : 2
> model name      : AMD Opteron(tm) Processor 6380
> stepping        : 0
> GCC: 4.4.7
> 
> Computes:
> cpu family      : 21
> model           : 1
> model name      : AMD Opteron(tm) Processor 6282 SE
> stepping        : 2
> GCC: 4.4.6
> 
> 
> 
> Specifically, the 6380 has the following flags while 6282 doesn't have them
> 
> 
> fma, f16c, tch, tce, tbm and bmi1 
> 
> 
> Problem is that, I have compiled OpenMPI and another program (which is written in Fortran) on the frontend. When issue the run via MPI, the compute node fails with an illegal instruction
> 
> 
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 5383 on node compute-0-1 exited on signal 4 (Illegal instruction).
> --------------------------------------------------------------------------
> 
> 
> I have compiled OMPI and the application with -march=amdfam10 on the frontend. A snippet of the program looks like
> 
> 
> /export/apps/siesta/openmpi-2.0.0/bin/mpifort -c -g -Os -march=amdfam10   `FoX/FoX-config --fcflags`  -DMPI -DFC_HAVE_FLUSH -DFC_HAVE_ABORT -DTRANSIESTA    /export/apps/siesta/siesta-4.0/Src/pspltm1.F
> 
> 
> 
> 
> 
> 
> The question is, how can I find the name of that illegal instruction? With that I can find which flag is missing.

Run the application under gdb and type "disass".
This will give you a disassembly of the failing function and a visual
pointer to the failing instruction (or the very near vicinity).

-- 
Markus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16  8:31   ` Markus Trippelsdorf
@ 2016-09-16  9:05     ` Mahmood Naderan
  2016-09-16  9:07       ` Jonathan Wakely
  2016-09-16  9:07       ` Markus Trippelsdorf
  2016-09-16 11:37     ` Jeffrey Walton
  1 sibling, 2 replies; 27+ messages in thread
From: Mahmood Naderan @ 2016-09-16  9:05 UTC (permalink / raw)
  To: Markus Trippelsdorf; +Cc: gcc-help

>Run the application under gdb and type "disass".

Should I recompile all applications (OMPI, my app and friends) with -g -ggdb?
Currently they are with -g -Os -march


 Regards,
Mahmood

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16  9:05     ` Mahmood Naderan
@ 2016-09-16  9:07       ` Jonathan Wakely
  2016-09-16 10:55         ` Mahmood Naderan
  2016-09-16  9:07       ` Markus Trippelsdorf
  1 sibling, 1 reply; 27+ messages in thread
From: Jonathan Wakely @ 2016-09-16  9:07 UTC (permalink / raw)
  To: Mahmood Naderan; +Cc: Markus Trippelsdorf, gcc-help

On 16 September 2016 at 10:05, Mahmood Naderan wrote:
>>Run the application under gdb and type "disass".
>
> Should I recompile all applications (OMPI, my app and friends) with -g -ggdb?
> Currently they are with -g -Os -march

No, that's not necessary.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16  9:05     ` Mahmood Naderan
  2016-09-16  9:07       ` Jonathan Wakely
@ 2016-09-16  9:07       ` Markus Trippelsdorf
  1 sibling, 0 replies; 27+ messages in thread
From: Markus Trippelsdorf @ 2016-09-16  9:07 UTC (permalink / raw)
  To: Mahmood Naderan; +Cc: gcc-help

On 2016.09.16 at 09:05 +0000, Mahmood Naderan wrote:
> >Run the application under gdb and type "disass".
> 
> Should I recompile all applications (OMPI, my app and friends) with -g -ggdb?
> Currently they are with -g -Os -march

No, it is not necessary.

-- 
Markus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16  9:07       ` Jonathan Wakely
@ 2016-09-16 10:55         ` Mahmood Naderan
  2016-09-16 10:57           ` Markus Trippelsdorf
  2016-09-20  9:20           ` Richard Earnshaw (lists)
  0 siblings, 2 replies; 27+ messages in thread
From: Mahmood Naderan @ 2016-09-16 10:55 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: Markus Trippelsdorf, gcc-help

>No, that's not necessary.

So, within GDB, I ran the program, but disas command says "no frame selected". please see below

$ cat sc.sh
#!/bin/bash

ulimit -c unlimited

exec /share/apps/siesta/siesta-4.0/tpar/transiesta < trans-cc-bt-cc-163-20.fdf

$ cat sc2.sh

#!/bin/bash

/share/apps/siesta/openmpi-2.0.0/bin/mpirun -hostfile hosts.txt -np 15 sc.sh
$ gdb --args bash sc2.sh
(gdb) r

Starting program: /bin/bash sc2.sh

Detaching after fork from child process 29640.

....
....
--------------------------------------------------------------------------

mpirun noticed that process rank 0 with PID 9443 on node compute-0-1 exited on signal 4 (Illegal instruction).

--------------------------------------------------------------------------

Program exited with code 0204.

(gdb) disas

No frame selected.

(gdb)




Any idea?

 Regards,
Mahmood

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16 10:55         ` Mahmood Naderan
@ 2016-09-16 10:57           ` Markus Trippelsdorf
  2016-09-16 11:01             ` Markus Trippelsdorf
  2016-09-20  9:20           ` Richard Earnshaw (lists)
  1 sibling, 1 reply; 27+ messages in thread
From: Markus Trippelsdorf @ 2016-09-16 10:57 UTC (permalink / raw)
  To: Mahmood Naderan; +Cc: Jonathan Wakely, gcc-help

On 2016.09.16 at 10:54 +0000, Mahmood Naderan wrote:
> >No, that's not necessary.
> 
> So, within GDB, I ran the program, but disas command says "no frame selected". please see below
> 
> $ cat sc.sh
> #!/bin/bash
> 
> ulimit -c unlimited
> 
> exec /share/apps/siesta/siesta-4.0/tpar/transiesta < trans-cc-bt-cc-163-20.fdf
> 
> $ cat sc2.sh
> 
> #!/bin/bash
> 
> /share/apps/siesta/openmpi-2.0.0/bin/mpirun -hostfile hosts.txt -np 15 sc.sh
> $ gdb --args bash sc2.sh
> (gdb) r

gdb --args /share/apps/siesta/openmpi-2.0.0/bin/mpirun -hostfile hosts.txt -np 15 sc.sh 

-- 
Markus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16 10:57           ` Markus Trippelsdorf
@ 2016-09-16 11:01             ` Markus Trippelsdorf
  2016-09-16 11:05               ` Mahmood Naderan
  2016-09-16 11:07               ` Markus Trippelsdorf
  0 siblings, 2 replies; 27+ messages in thread
From: Markus Trippelsdorf @ 2016-09-16 11:01 UTC (permalink / raw)
  To: Mahmood Naderan; +Cc: Jonathan Wakely, gcc-help

On 2016.09.16 at 12:57 +0200, Markus Trippelsdorf wrote:
> On 2016.09.16 at 10:54 +0000, Mahmood Naderan wrote:
> > >No, that's not necessary.
> > 
> > So, within GDB, I ran the program, but disas command says "no frame selected". please see below
> > 
> > $ cat sc.sh
> > #!/bin/bash
> > 
> > ulimit -c unlimited
> > 
> > exec /share/apps/siesta/siesta-4.0/tpar/transiesta < trans-cc-bt-cc-163-20.fdf
> > 
> > $ cat sc2.sh
> > 
> > #!/bin/bash
> > 
> > /share/apps/siesta/openmpi-2.0.0/bin/mpirun -hostfile hosts.txt -np 15 sc.sh
> > $ gdb --args bash sc2.sh
> > (gdb) r
> 
> gdb --args /share/apps/siesta/openmpi-2.0.0/bin/mpirun -hostfile hosts.txt -np 15 sc.sh 

and perhaps:
set follow-fork-mode child

-- 
Markus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16 11:01             ` Markus Trippelsdorf
@ 2016-09-16 11:05               ` Mahmood Naderan
  2016-09-16 11:09                 ` Markus Trippelsdorf
  2016-09-16 11:07               ` Markus Trippelsdorf
  1 sibling, 1 reply; 27+ messages in thread
From: Mahmood Naderan @ 2016-09-16 11:05 UTC (permalink / raw)
  To: Markus Trippelsdorf; +Cc: Jonathan Wakely, gcc-help

>and perhaps:
>set follow-fork-mode child

Is this a GDB command? Should I execute it before "run" in the GDB or after the crash?

 Regards,
Mahmood

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16 11:01             ` Markus Trippelsdorf
  2016-09-16 11:05               ` Mahmood Naderan
@ 2016-09-16 11:07               ` Markus Trippelsdorf
  1 sibling, 0 replies; 27+ messages in thread
From: Markus Trippelsdorf @ 2016-09-16 11:07 UTC (permalink / raw)
  To: Mahmood Naderan; +Cc: Jonathan Wakely, gcc-help

On 2016.09.16 at 13:00 +0200, Markus Trippelsdorf wrote:
> On 2016.09.16 at 12:57 +0200, Markus Trippelsdorf wrote:
> > On 2016.09.16 at 10:54 +0000, Mahmood Naderan wrote:
> > > >No, that's not necessary.
> > > 
> > > So, within GDB, I ran the program, but disas command says "no frame selected". please see below
> > > 
> > > $ cat sc.sh
> > > #!/bin/bash
> > > 
> > > ulimit -c unlimited
> > > 
> > > exec /share/apps/siesta/siesta-4.0/tpar/transiesta < trans-cc-bt-cc-163-20.fdf
> > > 
> > > $ cat sc2.sh
> > > 
> > > #!/bin/bash
> > > 
> > > /share/apps/siesta/openmpi-2.0.0/bin/mpirun -hostfile hosts.txt -np 15 sc.sh
> > > $ gdb --args bash sc2.sh
> > > (gdb) r
> > 
> > gdb --args /share/apps/siesta/openmpi-2.0.0/bin/mpirun -hostfile hosts.txt -np 15 sc.sh 
> 
> and perhaps:
> set follow-fork-mode child

And of course you should run this locally on the actual failing machine.
I have no idea how to debug Open MP jobs in general.

-- 
Markus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16 11:05               ` Mahmood Naderan
@ 2016-09-16 11:09                 ` Markus Trippelsdorf
  0 siblings, 0 replies; 27+ messages in thread
From: Markus Trippelsdorf @ 2016-09-16 11:09 UTC (permalink / raw)
  To: Mahmood Naderan; +Cc: Jonathan Wakely, gcc-help

On 2016.09.16 at 11:05 +0000, Mahmood Naderan wrote:
> >and perhaps:
> >set follow-fork-mode child
> 
> Is this a GDB command? Should I execute it before "run" in the GDB or after the crash?

yes, before "run". But I doubt it will work if don't run this on the
actual failing machine.

-- 
Markus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16  8:31   ` Markus Trippelsdorf
  2016-09-16  9:05     ` Mahmood Naderan
@ 2016-09-16 11:37     ` Jeffrey Walton
  2016-09-16 12:12       ` Mahmood Naderan
                         ` (2 more replies)
  1 sibling, 3 replies; 27+ messages in thread
From: Jeffrey Walton @ 2016-09-16 11:37 UTC (permalink / raw)
  To: Markus Trippelsdorf; +Cc: Mahmood Naderan, gcc-help

>> Specifically, the 6380 has the following flags while 6282 doesn't have them
>>
>>
>> fma, f16c, tch, tce, tbm and bmi1
>>
>>
>> Problem is that, I have compiled OpenMPI and another program (which is written in Fortran) on the frontend. When issue the run via MPI, the compute node fails with an illegal instruction

I believe many AMD processors lack BMI, BMI2, ADX, etc. I don't know
about that particular model.

Try adding -mno-bmi to your CFLAGS and CXXFLAGS to clear the BMI/BMI2 issue.

I don't know about the other cpu flags. GCC is good about taking a cpu
feature, like ADX, and using -madx and -mno-adx. An exception is
RDRAND, whits omits the A for some reason; you use -mrdrnd.

Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16 11:37     ` Jeffrey Walton
@ 2016-09-16 12:12       ` Mahmood Naderan
  2016-09-16 12:13       ` Mahmood Naderan
  2016-09-16 12:40       ` Mahmood Naderan
  2 siblings, 0 replies; 27+ messages in thread
From: Mahmood Naderan @ 2016-09-16 12:12 UTC (permalink / raw)
  To: noloader, Markus Trippelsdorf; +Cc: gcc-help

Markus,
I ran "set follow-fork-mode child" on another terminal while I was connecting to the compute node. Please see the output of the gdb

$ gdb --args /share/apps/siesta/openmpi-2.0.0/bin/mpirun -hostfile hosts.txt -np 15 sc.sh
Reading symbols from /share/apps/siesta/openmpi-2.0.0/bin/mpirun...done.
(gdb) run
Starting program: /share/apps/siesta/openmpi-2.0.0/bin/mpirun -hostfile hosts.txt -np 15 sc.sh
[Thread debugging using libthread_db enabled]
[New Thread 0x2aaaaafcc700 (LWP 26346)]
[New Thread 0x2aaaab1cd700 (LWP 26347)]
[New Thread 0x2aaaab3ce700 (LWP 26348)]
[New Thread 0x2aaaab5cf700 (LWP 26349)]
Detaching after fork from child process 26350.

...
...
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 16401 on node compute-0-1 exited on signal 4 (Illegal instruction).
--------------------------------------------------------------------------
[Thread 0x2aaaab5cf700 (LWP 26349) exited]
[Thread 0x2aaaab3ce700 (LWP 26348) exited]
[Thread 0x2aaaab1cd700 (LWP 26347) exited]
[Thread 0x2aaaaafcc700 (LWP 26346) exited]

Program exited with code 0204.
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.6.x86_64 libibverbs-1.1.8-4.el6.x86_64 libnl-1.1-14.el6.x86_64 libudev-147-2.42.el6.x86_64
(gdb) disas
No frame selected.
(gdb)




This time the output has some more lines, so I think the "follow-fork-module" worked. But still the process is dead!


Regards,
Mahmood

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16 11:37     ` Jeffrey Walton
  2016-09-16 12:12       ` Mahmood Naderan
@ 2016-09-16 12:13       ` Mahmood Naderan
  2016-09-16 12:40       ` Mahmood Naderan
  2 siblings, 0 replies; 27+ messages in thread
From: Mahmood Naderan @ 2016-09-16 12:13 UTC (permalink / raw)
  To: noloader, Markus Trippelsdorf; +Cc: gcc-help

Jeff,
I will try with -mno-bmi -mno-adx -mrdrnd



 Regards,
Mahmood

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16 11:37     ` Jeffrey Walton
  2016-09-16 12:12       ` Mahmood Naderan
  2016-09-16 12:13       ` Mahmood Naderan
@ 2016-09-16 12:40       ` Mahmood Naderan
  2016-09-16 12:47         ` Markus Trippelsdorf
  2 siblings, 1 reply; 27+ messages in thread
From: Mahmood Naderan @ 2016-09-16 12:40 UTC (permalink / raw)
  To: noloader, Markus Trippelsdorf, Jonathan Wakely; +Cc: gcc-help



>Try adding -mno-bmi to your CFLAGS and CXXFLAGS to clear the BMI/BMI2 issue.

>I don't know about the other cpu flags. GCC is good about taking a cpu
>feature, like ADX, and using -madx and -mno-adx. An exception is
>RDRAND, whits omits the A for some reason; you use -mrdrnd.

>Jeff


Using -mno-adx returns 



checking for linker flag to name executables... configure: error: Could not determine flag to name executables


Using -mrdrnd -mno-bmi returns the "illegal instruction" much sooner than before.


This is a very bad issue. I really want to know what is the instruction?!


Regards,
Mahmood

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16 12:40       ` Mahmood Naderan
@ 2016-09-16 12:47         ` Markus Trippelsdorf
  2016-09-16 12:50           ` Mahmood Naderan
  0 siblings, 1 reply; 27+ messages in thread
From: Markus Trippelsdorf @ 2016-09-16 12:47 UTC (permalink / raw)
  To: Mahmood Naderan; +Cc: noloader, Jonathan Wakely, gcc-help

On 2016.09.16 at 12:40 +0000, Mahmood Naderan wrote:
> 
> 
> >Try adding -mno-bmi to your CFLAGS and CXXFLAGS to clear the BMI/BMI2 issue.
> 
> >I don't know about the other cpu flags. GCC is good about taking a cpu
> >feature, like ADX, and using -madx and -mno-adx. An exception is
> >RDRAND, whits omits the A for some reason; you use -mrdrnd.
> 
> >Jeff
> 
> 
> Using -mno-adx returns 
> 
> 
> 
> checking for linker flag to name executables... configure: error: Could not determine flag to name executables
> 
> 
> Using -mrdrnd -mno-bmi returns the "illegal instruction" much sooner than before.
> 
> 
> This is a very bad issue. I really want to know what is the instruction?!

Well, if you simply want to avoid the issue, just compile without
-march=amdfam10 (this really assumes that all machines are of the same
type) or use -mtune=amdfam10 instead.

Details on how to remotely debug Open MP jobs are off topic on this list.

-- 
Markus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16 12:47         ` Markus Trippelsdorf
@ 2016-09-16 12:50           ` Mahmood Naderan
  2016-09-16 13:37             ` Jonathan Wakely
  0 siblings, 1 reply; 27+ messages in thread
From: Mahmood Naderan @ 2016-09-16 12:50 UTC (permalink / raw)
  To: Markus Trippelsdorf; +Cc: noloader, Jonathan Wakely, gcc-help

> -march=amdfam10


Actually, I tried with march and got the same, dropping march (the default is native) also results the same. I will try with mtune.

 Regards,
Mahmood

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16 12:50           ` Mahmood Naderan
@ 2016-09-16 13:37             ` Jonathan Wakely
  0 siblings, 0 replies; 27+ messages in thread
From: Jonathan Wakely @ 2016-09-16 13:37 UTC (permalink / raw)
  To: Mahmood Naderan; +Cc: Markus Trippelsdorf, noloader, gcc-help

On 16 September 2016 at 13:50, Mahmood Naderan wrote:
>> -march=amdfam10
>
>
> Actually, I tried with march and got the same, dropping march (the default is native) also results the same. I will try with mtune.

The default should not be native, that compiles code that can only run
on a particular machine (or identical ones). That's a very bad
default.

Using a more generic instruction set would produce binaries that run
on all hardware, e.g. -march=x86-64

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-16 10:55         ` Mahmood Naderan
  2016-09-16 10:57           ` Markus Trippelsdorf
@ 2016-09-20  9:20           ` Richard Earnshaw (lists)
  2016-09-20  9:54             ` Mahmood Naderan
  1 sibling, 1 reply; 27+ messages in thread
From: Richard Earnshaw (lists) @ 2016-09-20  9:20 UTC (permalink / raw)
  To: Mahmood Naderan, Jonathan Wakely; +Cc: Markus Trippelsdorf, gcc-help

On 16/09/16 11:54, Mahmood Naderan wrote:
>> No, that's not necessary.
> 
> So, within GDB, I ran the program, but disas command says "no frame selected". please see below
> 
> $ cat sc.sh
> #!/bin/bash
> 
> ulimit -c unlimited
> 
> exec /share/apps/siesta/siesta-4.0/tpar/transiesta < trans-cc-bt-cc-163-20.fdf
> 
> $ cat sc2.sh
> 
> #!/bin/bash
> 
> /share/apps/siesta/openmpi-2.0.0/bin/mpirun -hostfile hosts.txt -np 15 sc.sh
> $ gdb --args bash sc2.sh
> (gdb) r
> 
> Starting program: /bin/bash sc2.sh
> 
> Detaching after fork from child process 29640.
> 
> ....
> ....
> --------------------------------------------------------------------------
> 
> mpirun noticed that process rank 0 with PID 9443 on node compute-0-1 exited on signal 4 (Illegal instruction).
> 
> --------------------------------------------------------------------------
> 
> Program exited with code 0204.
> 
> (gdb) disas
> 
> No frame selected.

In these circumstances 'x/i $pc' may be your friend, it will just try to
disassemble the exact instruction at the current PC.  If that still
doesn't work, then your program may have jumped off into the weeds and
the PC may not even be pointing at a valid location (but then I'd not
expect the signal to be SIGILL in that case).

HTH.

R.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-20  9:20           ` Richard Earnshaw (lists)
@ 2016-09-20  9:54             ` Mahmood Naderan
  2016-09-20 10:06               ` Richard Earnshaw (lists)
  0 siblings, 1 reply; 27+ messages in thread
From: Mahmood Naderan @ 2016-09-20  9:54 UTC (permalink / raw)
  To: Richard Earnshaw (lists), Jonathan Wakely; +Cc: Markus Trippelsdorf, gcc-help

> In these circumstances 'x/i $pc' may be your friend, it will just try to

disassemble the exact instruction at the current PC.

What does that mean? I didn't understand... Can you explain more? Which command should I use in order to catch the illegal instruction


 
Regards,
Mahmood

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-20  9:54             ` Mahmood Naderan
@ 2016-09-20 10:06               ` Richard Earnshaw (lists)
  2016-09-20 10:20                 ` Mahmood Naderan
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Earnshaw (lists) @ 2016-09-20 10:06 UTC (permalink / raw)
  To: Mahmood Naderan, Jonathan Wakely; +Cc: Markus Trippelsdorf, gcc-help

On 20/09/16 10:54, Mahmood Naderan wrote:
>> In these circumstances 'x/i $pc' may be your friend, it will just try to
> 
> disassemble the exact instruction at the current PC.
> 
> What does that mean? I didn't understand... Can you explain more? Which command should I use in order to catch the illegal instruction
> 

It's a gdb command.  Run it in the debugger after the program has
stopped (or once you've loaded the core file).

R.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-20 10:06               ` Richard Earnshaw (lists)
@ 2016-09-20 10:20                 ` Mahmood Naderan
  2016-09-20 13:29                   ` Richard Earnshaw (lists)
  0 siblings, 1 reply; 27+ messages in thread
From: Mahmood Naderan @ 2016-09-20 10:20 UTC (permalink / raw)
  To: Richard Earnshaw (lists), Jonathan Wakely; +Cc: Markus Trippelsdorf, gcc-help

>It's a gdb command.  Run it in the debugger after the program has
>stopped (or once you've loaded the core file).

How can I create the core file and then import it to GDB?
Is that "gdb --core"?

 Regards,
Mahmood




It's a gdb command.  Run it in the debugger after the program has
stopped (or once you've loaded the core file).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-20 10:20                 ` Mahmood Naderan
@ 2016-09-20 13:29                   ` Richard Earnshaw (lists)
  2016-09-20 18:47                     ` Mahmood Naderan
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Earnshaw (lists) @ 2016-09-20 13:29 UTC (permalink / raw)
  To: Mahmood Naderan, Jonathan Wakely; +Cc: Markus Trippelsdorf, gcc-help

On 20/09/16 11:20, Mahmood Naderan wrote:
>> It's a gdb command.  Run it in the debugger after the program has
>> stopped (or once you've loaded the core file).
> 
> How can I create the core file and then import it to GDB?
> Is that "gdb --core"?
> 
>  Regards,
> Mahmood
> 
> 
> 
> 
> It's a gdb command.  Run it in the debugger after the program has
> stopped (or once you've loaded the core file).
> 

$ ulimit -c unlimited

should normally enable core dumping, but if that doesn't work it may be
disabled at the system level and changing that may be more tricky.

$ ulimit -a

will tell you what the soft limits are, and

$ ulimit -aH

will tell you the hard limits.  You can only increase limits up to the
hard limits.

If you can get a core file, then run

$ gdb <binary> <core-file>

where <binary> is the name of the program being run and <core-file> is
the ... well you can work that out!


R.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-20 13:29                   ` Richard Earnshaw (lists)
@ 2016-09-20 18:47                     ` Mahmood Naderan
  2016-09-21  9:48                       ` Richard Earnshaw (lists)
  0 siblings, 1 reply; 27+ messages in thread
From: Mahmood Naderan @ 2016-09-20 18:47 UTC (permalink / raw)
  To: Richard Earnshaw (lists), Jonathan Wakely; +Cc: Markus Trippelsdorf, gcc-help

I ran the command from the compute node. I also set the number of the threads to 1.

mahmood@compute-0-1:tran-bt-o-40$ ulimit -c unlimited
mahmood@compute-0-1:tran-bt-o-40$ gdb --args /share/apps/siesta/openmpi-2.0.0/bin/mpirun -hostfile hosts.txt -np  sc.sh
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)

Reading symbols from /share/apps/siesta/openmpi-2.0.0/bin/mpirun...done.
(gdb) run
...
...
[Thread 0x2aaaab447700 (LWP 32506) exited]
[Thread 0x2aaaab246700 (LWP 32505) exited]

Program exited with code 0204.
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.6.x86_64 libibverbs-1.1.6-4.el6.x86_64 ibudev-147-2.42.el6.x86_64
(gdb) disas
No frame selected.
(gdb) x/i $pc
No registers.
(gdb) q
mahmood@compute-0-1:tran-bt-o-40$ ls -l core*
-rw------- 1 mahmood nfsnobody 2342809600 Sep 20 23:12 core.5767







>If you can get a core file, then run
>$ gdb <binary> <core-file>

So, please see the output


mahmood@compute-0-1:tran-bt-o-40$ gdb /share/apps/siesta/openmpi-2.0.0/bin/mpirun core.5767
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
Reading symbols from /share/apps/siesta/openmpi-2.0.0/bin/mpirun...done.
warning: core file may not match specified executable file.
[New Thread 5767]
..
[New Thread 5784]
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `/share/apps/siesta/siesta-4.0/tpar/transiesta'.
Program terminated with signal 4, Illegal instruction.
#0  0x00000000008d3a58 in ?? ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.6.x86_64
(gdb) disas
No function contains program counter for selected frame.
(gdb) mahmood@compute-0-1:tran-bt-o-40$ ls -l core*
Undefined command: "mahmood".  Try "help".
(gdb) x/i $pc
=> 0x8d3a58:    Cannot access memory at address 0x8d3a58
(gdb)





Do you have any idea? Still I am not able to see the illegal instruction


 Regards,
Mahmood

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-20 18:47                     ` Mahmood Naderan
@ 2016-09-21  9:48                       ` Richard Earnshaw (lists)
  2016-09-21 17:49                         ` Mahmood Naderan
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Earnshaw (lists) @ 2016-09-21  9:48 UTC (permalink / raw)
  To: Mahmood Naderan, Jonathan Wakely; +Cc: Markus Trippelsdorf, gcc-help

On 20/09/16 19:47, Mahmood Naderan wrote:
> I ran the command from the compute node. I also set the number of the threads to 1.
> 
> mahmood@compute-0-1:tran-bt-o-40$ ulimit -c unlimited
> mahmood@compute-0-1:tran-bt-o-40$ gdb --args /share/apps/siesta/openmpi-2.0.0/bin/mpirun -hostfile hosts.txt -np  sc.sh
> GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
> 
> Reading symbols from /share/apps/siesta/openmpi-2.0.0/bin/mpirun...done.
> (gdb) run
> ...
> ...
> [Thread 0x2aaaab447700 (LWP 32506) exited]
> [Thread 0x2aaaab246700 (LWP 32505) exited]
> 
> Program exited with code 0204.
> Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.6.x86_64 libibverbs-1.1.6-4.el6.x86_64 ibudev-147-2.42.el6.x86_64
> (gdb) disas
> No frame selected.
> (gdb) x/i $pc
> No registers.
> (gdb) q
> mahmood@compute-0-1:tran-bt-o-40$ ls -l core*
> -rw------- 1 mahmood nfsnobody 2342809600 Sep 20 23:12 core.5767
> 
> 
> 
> 
> 
> 
> 
>> If you can get a core file, then run
>> $ gdb <binary> <core-file>
> 
> So, please see the output
> 
> 
> mahmood@compute-0-1:tran-bt-o-40$ gdb /share/apps/siesta/openmpi-2.0.0/bin/mpirun core.5767
> GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
> Reading symbols from /share/apps/siesta/openmpi-2.0.0/bin/mpirun...done.
> warning: core file may not match specified executable file.
> [New Thread 5767]
> ..
> [New Thread 5784]
> Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> Core was generated by `/share/apps/siesta/siesta-4.0/tpar/transiesta'.
> Program terminated with signal 4, Illegal instruction.
> #0  0x00000000008d3a58 in ?? ()
> Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.6.x86_64
> (gdb) disas
> No function contains program counter for selected frame.
> (gdb) mahmood@compute-0-1:tran-bt-o-40$ ls -l core*
> Undefined command: "mahmood".  Try "help".
> (gdb) x/i $pc
> => 0x8d3a58:    Cannot access memory at address 0x8d3a58
> (gdb)
> 
> 
> 
> 
> 
> Do you have any idea? Still I am not able to see the illegal instruction
> #

Sounds like the program has jumped off into the weeds.  At this point I
think you're going to have to start examining what's on the stack to see
if you can find any clues (don't forget to look below the current SP as
well as above it, since it may be a return operation from a corrupted
stack).

I'm not sure there's much else I can add at this point.  It's all down
to detective work now.

R.

> 
>  Regards,
> Mahmood
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-21  9:48                       ` Richard Earnshaw (lists)
@ 2016-09-21 17:49                         ` Mahmood Naderan
  2016-09-21 19:04                           ` Markus Trippelsdorf
  0 siblings, 1 reply; 27+ messages in thread
From: Mahmood Naderan @ 2016-09-21 17:49 UTC (permalink / raw)
  To: Richard Earnshaw (lists), Jonathan Wakely; +Cc: Markus Trippelsdorf, gcc-help

OK guys...
I built a serial version of that application. Therefore, the MPI is not present any more. Good news is that I was able to reproduce the illegal instruction.

I built the application with -g -ggdb on the frontend and then ran it from the compute node and attached GDB once the process has been started.

Please see the GDB output


(gdb) c
Continuing.

Program received signal SIGILL, Illegal instruction.
0x00000000008fdb6e in zdot_kernel_8 ()
(gdb) bt
#0  0x00000000008fdb6e in zdot_kernel_8 ()
#1  0x00000000008fdd70 in zdotc_k ()
#2  0x00000000008404fe in zdotc_ ()
#3  0x000000000082e1ce in zpotf2 (uplo=<value optimized out>, n=64, a=..., lda=1847, info=0,
_uplo=<value optimized out>) at zpotf2.f:118
#4  0x000000000080a529 in zpotrf (uplo=<value optimized out>, n=1847, a=..., lda=1847, info=0,
_uplo=<value optimized out>) at zpotrf.f:129
#5  0x00000000004b9590 in cdiag (h=..., s=..., n=1847, nm=1847, nml=1847, w=..., z=..., neigvec=0, iscf=1,
ierror=0) at /export/apps/siesta/siesta-4.0/Src/cdiag.F:480
#6  0x000000000042d3c1 in diagk (nspin=1, nuo=1847, no=3694, maxspn=1, maxnh=827341, maxnd=827341, maxo=1847,
numh=..., listhptr=..., listh=..., numd=..., listdptr=..., listd=..., h=..., s=..., getd=.TRUE.,
getpsi=.FALSE., fixspin=.FALSE., qtot=554.00000000000125, qs=..., temp=0.0019824056039010785, e1=1, e2=-1,
xij=..., indxuo=..., nk=6, kpoint=..., wk=..., eo=..., qo=..., dnew=..., enew=..., ef=0, efs=...,
entropy=0, haux=..., saux=..., psi=..., dk=..., ek=..., aux=..., nuotot=1847,
occtol=9.9999999999999998e-13, iscf=1) at /export/apps/siesta/siesta-4.0/Src/diagk.F:171
#7  0x0000000000421a1e in diagon (no=3694, maxspn=1, maxuo=1847, maxnh=827341, maxnd=827341, maxo=1847,
numh=..., listhptr=..., listh=..., numd=..., listdptr=..., listd=..., h=..., s=...,
qtot=554.00000000000125, fixspin=.FALSE., qs=..., temp=0.0019824056039010785, e1=1, e2=-1, gamma=.FALSE.,
xij=..., indxuo=..., kpoint=..., eo=..., dnew=..., enew=..., ef=0, efs=..., entropy=0, iscf=1,
neigwanted=1847) at /export/apps/siesta/siesta-4.0/Src/diagon.F:289
#8  0x00000000004d2896 in m_compute_dm::compute_dm (iscf=1)
at /export/apps/siesta/siesta-4.0/Src/compute_dm.F:120
#9  0x00000000004f10e6 in m_siesta_forces::siesta_forces ()
at /export/apps/siesta/siesta-4.0/Src/siesta_forces.F:132
#10 0x00000000006af149 in siesta () at /export/apps/siesta/siesta-4.0/Src/siesta.F:30
#11 0x000000000092609a in main ()
#12 0x0000003d1721ecdd in __libc_start_main () from /lib64/libc.so.6
#13 0x0000000000403a29 in _start ()
(gdb) disas
Dump of assembler code for function zdot_kernel_8:
0x00000000008fda20 <+0>:     cmp    $0x27f,%rdi
0x00000000008fda27 <+7>:     jle    0x8fdb18 <zdot_kernel_8+248>
0x00000000008fda2d <+13>:    xor    %eax,%eax
0x00000000008fda2f <+15>:    vzeroupper
0x00000000008fda32 <+18>:    vxorpd %xmm0,%xmm0,%xmm0
0x00000000008fda36 <+22>:    vxorpd %xmm1,%xmm1,%xmm1
0x00000000008fda3a <+26>:    vxorpd %xmm2,%xmm2,%xmm2
0x00000000008fda3e <+30>:    vxorpd %xmm3,%xmm3,%xmm3
0x00000000008fda42 <+34>:    vxorpd %xmm4,%xmm4,%xmm4
0x00000000008fda46 <+38>:    vxorpd %xmm5,%xmm5,%xmm5
0x00000000008fda4a <+42>:    vxorpd %xmm6,%xmm6,%xmm6
0x00000000008fda4e <+46>:    vxorpd %xmm7,%xmm7,%xmm7
0x00000000008fda52 <+50>:    data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
0x00000000008fda60 <+64>:    prefetcht0 0x200(%rsi,%rax,8)
0x00000000008fda68 <+72>:    vmovups (%rsi,%rax,8),%xmm8
0x00000000008fda6d <+77>:    vmovups 0x10(%rsi,%rax,8),%xmm9
0x00000000008fda73 <+83>:    prefetcht0 0x200(%rdx,%rax,8)
0x00000000008fda7b <+91>:    vmovups (%rdx,%rax,8),%xmm12
0x00000000008fda80 <+96>:    vmovups 0x10(%rdx,%rax,8),%xmm13
0x00000000008fda86 <+102>:   vmovups 0x20(%rsi,%rax,8),%xmm10
0x00000000008fda8c <+108>:   vmovups 0x30(%rsi,%rax,8),%xmm11
0x00000000008fda92 <+114>:   vmovups 0x20(%rdx,%rax,8),%xmm14
0x00000000008fda98 <+120>:   vmovups 0x30(%rdx,%rax,8),%xmm15
0x00000000008fda9e <+126>:   vfmadd231pd %xmm8,%xmm12,%xmm0
0x00000000008fdaa3 <+131>:   vfmadd231pd %xmm9,%xmm13,%xmm1
0x00000000008fdaa8 <+136>:   vpermilpd $0x1,%xmm13,%xmm13
0x00000000008fdaae <+142>:   vpermilpd $0x1,%xmm12,%xmm12
0x00000000008fdab4 <+148>:   vfmadd231pd %xmm10,%xmm14,%xmm2
0x00000000008fdab9 <+153>:   vfmadd231pd %xmm11,%xmm15,%xmm3
0x00000000008fdabe <+158>:   vpermilpd $0x1,%xmm14,%xmm14
0x00000000008fdac4 <+164>:   vpermilpd $0x1,%xmm15,%xmm15
0x00000000008fdaca <+170>:   vfmadd231pd %xmm8,%xmm12,%xmm4
0x00000000008fdacf <+175>:   add    $0x8,%rax
0x00000000008fdad3 <+179>:   vfmadd231pd %xmm9,%xmm13,%xmm5
0x00000000008fdad8 <+184>:   vfmadd231pd %xmm10,%xmm14,%xmm6
---Type <return> to continue, or q <return> to quit---

Quit
(gdb) x/i $pc
=> 0x8fdb6e <zdot_kernel_8+334>:        vfmadd231pd %xmm8,%xmm12,%xmm0




So, the instruction is vfmadd231pd 


Any idea?



 Regards,
Mahmood

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: illegal instruction (CPU mismatch)
  2016-09-21 17:49                         ` Mahmood Naderan
@ 2016-09-21 19:04                           ` Markus Trippelsdorf
  0 siblings, 0 replies; 27+ messages in thread
From: Markus Trippelsdorf @ 2016-09-21 19:04 UTC (permalink / raw)
  To: Mahmood Naderan; +Cc: Richard Earnshaw (lists), Jonathan Wakely, gcc-help

On 2016.09.21 at 17:48 +0000, Mahmood Naderan wrote:
> OK guys...
> (gdb) x/i $pc
> => 0x8fdb6e <zdot_kernel_8+334>:        vfmadd231pd %xmm8,%xmm12,%xmm0
> 
> So, the instruction is vfmadd231pd 
> 
> Any idea?

It is an FMA3 instruction. See:
https://en.wikipedia.org/wiki/FMA_instruction_set

-- 
Markus

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-09-21 19:04 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <2070054246.1500449.1474012574681.ref@mail.yahoo.com>
2016-09-16  7:56 ` illegal instruction (CPU mismatch) Mahmood Naderan
2016-09-16  8:31   ` Markus Trippelsdorf
2016-09-16  9:05     ` Mahmood Naderan
2016-09-16  9:07       ` Jonathan Wakely
2016-09-16 10:55         ` Mahmood Naderan
2016-09-16 10:57           ` Markus Trippelsdorf
2016-09-16 11:01             ` Markus Trippelsdorf
2016-09-16 11:05               ` Mahmood Naderan
2016-09-16 11:09                 ` Markus Trippelsdorf
2016-09-16 11:07               ` Markus Trippelsdorf
2016-09-20  9:20           ` Richard Earnshaw (lists)
2016-09-20  9:54             ` Mahmood Naderan
2016-09-20 10:06               ` Richard Earnshaw (lists)
2016-09-20 10:20                 ` Mahmood Naderan
2016-09-20 13:29                   ` Richard Earnshaw (lists)
2016-09-20 18:47                     ` Mahmood Naderan
2016-09-21  9:48                       ` Richard Earnshaw (lists)
2016-09-21 17:49                         ` Mahmood Naderan
2016-09-21 19:04                           ` Markus Trippelsdorf
2016-09-16  9:07       ` Markus Trippelsdorf
2016-09-16 11:37     ` Jeffrey Walton
2016-09-16 12:12       ` Mahmood Naderan
2016-09-16 12:13       ` Mahmood Naderan
2016-09-16 12:40       ` Mahmood Naderan
2016-09-16 12:47         ` Markus Trippelsdorf
2016-09-16 12:50           ` Mahmood Naderan
2016-09-16 13:37             ` Jonathan Wakely

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).