Snort mailing list archives

Re: Unable to kill a non-zombie process with -9 (fwd)


From: elof2 () sentor se
Date: Wed, 15 Oct 2014 14:21:44 +0200 (CEST)


Oops, this message didn't make it into the snort-devel list since I wasn't 
registered.

Here's a copy.

See question 1 below. Is the problem located in snort, in FreeBSD 10.0 or 
a combination of the two?

/Elof


---------- Forwarded message ----------
From: elof2 () sentor se
To: John-Mark Gurney <jmg () funkthat com>
Cc: freebsd-net <freebsd-net () freebsd org>,
     snort-devel mailinglist <snort-devel () lists sourceforge net>
Date: Wed, 15 Oct 2014 11:41:33 +0200 (CEST)
Subject: Re: Unable to kill a non-zombie process with -9
In-Reply-To: <20141009222926.GC1852 () funkthat com>
References: <alpine.BSF.2.00.1410081310340.39263 () farmermaggot shire sentor se>
     <20141009222926.GC1852 () funkthat com>


Hi!

Today the problem reoccurred.
I've now debugged the problem a little furter.

I'm starting snort (as root).

<<<lots of startup logs for pid 22646>>>
Oct 15 08:46:59 snort[22646]: Initializing daemon mode
Oct 15 08:46:59 snort[22648]: Daemon initialized, signaled parent pid: 22646
Oct 15 08:46:59 snort[22648]: Reload thread starting...
Oct 15 08:46:59 snort[22648]: Reload thread started, thread 0x8146e8800 (22648)
End of log.

Error! Nothing more happens with the snort process!
Normally it should continue and log these lines as well:


snort[nnn]: Decoding Ethernet
snort[nnn]: Checking PID path...
snort[nnn]: PID path stat checked out ok, PID path set to /var/run/
snort[nnn]: Writing PID "7627" to file "/var/run//snort_mon0.pid"
snort[nnn]: Chroot directory = /usr/foobar/log
snort[nnn]: Set gid to 100
snort[nnn]: Set uid to 100
snort[nnn]:
snort[nnn]:         --== Initialization Complete ==--
snort[nnn]: Commencing packet processing (pid=nnn)





When looking at this half-started snort process with 'ps', it looks like this:

ps faxulwwj 22648
USER   PID %CPU %MEM    VSZ    RSS TT  STAT STARTED     TIME COMMAND UID PPID 
CPU PRI NI MWCHAN  PGID   SID JOBC
root 22648 51.8  1.1 488552 179344  -  Rs    8:46AM 53:06.52 /usr/local/bin/s 
0    1   0  88  0 -      22648 22648    0


The process is still owned by root, so just as the missing log lines are 
saying, it has not yet performed any change of uid/gid.




So there seem to be two questions.

Q1)
What happens between "Reload thread started, thread 0x8146e8800 (22648)" and 
"Decoding Ethernet"?
Apparently something goes wrong here on FreeBSD 10.0.
(this problem does not always occur, sometimes snort start just fine)

Q2)
When the process has frozen in this half-started state, it can't be killed even 
with a -9. Why?




John-Mark asked me for some debugging info. Here it is:

I now run 'kill 22648' on the above semi-started process:

     USER   PID %CPU %MEM    VSZ    RSS TT  STAT STARTED     TIME COMMAND UID 
PPID CPU PRI NI MWCHAN  PGID   SID JOBC
old root 22648 51.8  1.1 488552 179344  -  Rs    8:46AM 53:06.52 
/usr/local/bin/s   0    1   0  88  0 -      22648 22648    0
new root 22648 52.3  1.1 488552 179344  -  Rs    8:46AM 53:36.48 
/usr/local/bin/s   0    1   0  52  0 -      22648 22648    0

No change.



kill -9 22648

     USER   PID %CPU %MEM    VSZ    RSS TT  STAT STARTED     TIME COMMAND UID 
PPID CPU PRI NI MWCHAN  PGID   SID JOBC
old root 22648 51.8  1.1 488552 179344  -  Rs    8:46AM 53:06.52 
/usr/local/bin/s   0    1   0  88  0 -      22648 22648    0
new root 22648 37.7  1.1 488552 179344  -  Ts    8:46AM 53:50.87 
/usr/local/bin/s   0    1   0  52  0 -      22648 22648    0

Less CPU-usage and STAT changed to "Ts".




kill -CONT 22648

     USER   PID %CPU %MEM    VSZ    RSS TT  STAT STARTED     TIME COMMAND UID 
PPID CPU PRI NI MWCHAN  PGID   SID JOBC
old root 22648 51.8  1.1 488552 179344  -  Rs    8:46AM 53:06.52 
/usr/local/bin/s   0    1   0  88  0 -      22648 22648    0
new root 22648  0.0  1.1 488552 179344  -  Ts    8:46AM 53:50.88 
/usr/local/bin/s   0    1   0  52  0 -      22648 22648    0

No change except cpu is down to 0.


I now start 'kgdb'
info threads
I found two threads for snort, doing a bt for both of them:
   372 Thread 100602 (PID=22648: snort)  sched_switch (td=0xfffff802c061f490, 
newtd=<value optimized out>, flags=<value optimized out>) at 
/usr/src/sys/kern/sched_ule.c:1962
   371 Thread 100598 (PID=22648: snort)  sched_switch (td=0xfffff80221857000, 
newtd=<value optimized out>, flags=<value optimized out>) at 
/usr/src/sys/kern/sched_ule.c:1962
thread 372
[Switching to thread 372 (Thread 100602)]#0  sched_switch 
(td=0xfffff802c061f490, newtd=<value optimized out>, flags=<value optimized 
out>) at /usr/src/sys/kern/sched_ule.c:1962
1962    in /usr/src/sys/kern/sched_ule.c
bt
#0  sched_switch (td=0xfffff802c061f490, newtd=<value optimized out>, 
flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1962
#1  0xffffffff808b8c1e in mi_switch (flags=266, newtd=0x0) at 
/usr/src/sys/kern/kern_synch.c:494
#2  0xffffffff808c04b0 in thread_suspend_switch (td=0xfffff802c061f490) at 
/usr/src/sys/kern/kern_thread.c:883
#3  0xffffffff808c0276 in thread_single (mode=1) at 
/usr/src/sys/kern/kern_thread.c:713
#4  0xffffffff8087c1bb in exit1 (td=0xfffff802c061f490, rv=9) at 
/usr/src/sys/kern/kern_exit.c:180
#5  0xffffffff808b2faf in sigexit (td=<value optimized out>, sig=<value 
optimized out>) at /usr/src/sys/kern/kern_sig.c:2935
#6  0xffffffff808b3669 in postsig (sig=<value optimized out>) at 
/usr/src/sys/kern/kern_sig.c:2822
#7  0xffffffff808f6f57 in ast (framep=<value optimized out>) at 
/usr/src/sys/kern/subr_trap.c:271
#8  0xffffffff80c75870 in Xfast_syscall () at 
/usr/src/sys/amd64/amd64/exception.S:416
#9  0x0000000801d6f19a in ?? ()
Previous frame inner to this frame (corrupt stack?)


thread 371
[Switching to thread 371 (Thread 100598)]#0  sched_switch 
(td=0xfffff80221857000, newtd=<value optimized out>, flags=<value optimized 
out>) at /usr/src/sys/kern/sched_ule.c:1962
1962    in /usr/src/sys/kern/sched_ule.c
bt
#0  sched_switch (td=0xfffff80221857000, newtd=<value optimized out>, 
flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1962
#1  0xffffffff808b8c1e in mi_switch (flags=260, newtd=0x0) at 
/usr/src/sys/kern/kern_synch.c:494
#2  0xffffffff808f2e3a in sleepq_wait (wchan=0x0, pri=0) at 
/usr/src/sys/kern/subr_sleepqueue.c:620
#3  0xffffffff80864aad in _cv_wait (cvp=0xffffffff8147a500, 
lock=0xffffffff8147a480) at /usr/src/sys/kern/kern_condvar.c:139
#4  0xffffffff808fb05f in vmem_xalloc (vm=0xffffffff8147a480, size0=<value 
optimized out>, align=<value optimized out>, phase=0, nocross=<value optimized 
out>, minaddr=0, maxaddr=18446735286768857088, flags=8194, addrp=<value 
optimized out>) at /usr/src/sys/kern/subr_vmem.c:1196
#5  0xffffffff808fae6b in vmem_alloc (vm=0x0, size=0, flags=<value optimized 
out>, addrp=0xfffffe0466e1d6e8) at /usr/src/sys/kern/subr_vmem.c:1082
#6  0xffffffff80b0fa58 in kmem_malloc (vmem=0xffffffff8147a480, 
size=2139729920, flags=2) at /usr/src/sys/vm/vm_kern.c:314
#7  0xffffffff80b08dfb in uma_large_malloc (size=<value optimized out>, wait=2) 
at /usr/src/sys/vm/uma_core.c:1006
#8  0xffffffff80898cf3 in malloc (size=2139729920, mtp=0xffffffff813a0450, 
flags=0) at /usr/src/sys/kern/kern_malloc.c:520
#9  0xffffffff8096307b in bpf_buffer_ioctl_sblen (d=0xfffff80159ea9000, 
i=<value optimized out>) at /usr/src/sys/net/bpf_buffer.c:183
#10 0xffffffff80960a3c in bpfioctl (dev=0x0, cmd=<value optimized out>, 
addr=0xfffff801fbd06b40 "", flags=0, td=0xfffff80221857000) at 
/usr/src/sys/net/bpf.c:408
#11 0xffffffff807ac1df in devfs_ioctl_f (fp=0xfffff8002b3d9d20, com=3221504614, 
data=0xfffff801fbd06b40, cred=<value optimized out>, td=0xfffff80221857000) at 
/usr/src/sys/fs/devfs/devfs_vnops.c:757
#12 0xffffffff808fdfae in kern_ioctl (td=0xfffff80221857000, fd=<value 
optimized out>, com=0) at file.h:319
#13 0xffffffff808fdd2f in sys_ioctl (td=0xfffff80221857000, 
uap=0xfffffe0466e1da40) at /usr/src/sys/kern/sys_generic.c:702
#14 0xffffffff80c8f117 in amd64_syscall (td=0xfffff80221857000, traced=0) at 
subr_syscall.c:134
#15 0xffffffff80c7580b in Xfast_syscall () at 
/usr/src/sys/amd64/amd64/exception.S:391
#16 0x0000000801d8f08a in ?? ()
Previous frame inner to this frame (corrupt stack?)


Let me know if I can debug this any further.

/Elof



On Thu, 9 Oct 2014, John-Mark Gurney wrote:

elof2 () sentor se wrote this message on Wed, Oct 08, 2014 at 13:30 +0200:

I guess this is a bug report for FreeBSD 10.0.



Sometimes I can't kill my snort process on FreeBSD 10.0.
It won't die, even with kill -9.

I'm not talking about a zombie process. Snort is a process that should
die normally.
I've run snort on over 100 nodes since FreeBSD v6.x and I've never seen
this behavior until now in FreeBSD 10.0.


Example:

#ps faxuw
USER      PID  %CPU %MEM    VSZ    RSS TT  STAT STARTED        TIME
COMMAND
root    49222  53.4  2.2 492648 183012  -  Rs   11:46AM     7:05.59
/usr/local/bin/snort -q -D -c snort.conf
root    47937   0.0  2.2 488552 182864  -  Ts   10:56AM    29:35.98
/usr/local/bin/snort -q -D -c snort.conf

What is the MWCHAN?  add l to the ps command...

The pid 47937 has been killed (repeatedly) with -9.
Its status is "Ts" meaning it is Stopped.

have you tried to kill -CONT <pid> to resume it?

But it won't actually die and disappear. The only way to get rid of it
seem to be to reboot the machine. :-(

(pid 49222 is the new process that was started after 47937 was killed)


The problem doesn't happen all the time and I haven't found any patterns
as to when it does. :-(
If I restart snort once every day, it fails to die approximately 2-4 times
per month.
Even though the problem doesn't happen on every kill, it is a definately a
recurring event.

Can you run kgdb on the machine? (yes, it works on a live machine), use
info threads to find the thread id, and then use thread <threadid> to
switch to it, and run bt to get a back trace...

I began to see it on a heavily loaded 10GE sensor, so I thought it could
have something to do with the ix driver, or the heavy load.
But now another FreeBSD 10.0-sensor had the exact same problem, and this
sensor don't have any 10GE NICs. In fact, this sensor has been running
just fine with both FreeBSD 9.1 and 9.3 for the past years. Snort has
always terminated correctly! After I reinstalled this machine with FreeBSD
10.0 last friday, snort has then terminated correctly every day until
today, when it failed with the above pid 47937. (this sensor use the 'em'
driver, not 'ixgbe')

I'm running snort with the same configuration, settings, version, daq,
libs, etc on 10.0 as I do on 9.3.
None of the 9.3 sensors have this problem, so it has to be something new
in FreeBSD 10.0.

--
 John-Mark Gurney                             Voice: +1 415 225 5579

    "All that I will do, has been done, All that I have, has not."



------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Snort-devel mailing list
Snort-devel () lists sourceforge net
https://lists.sourceforge.net/lists/listinfo/snort-devel
Archive:
http://sourceforge.net/mailarchive/forum.php?forum_name=snort-devel

Please visit http://blog.snort.org for the latest news about Snort!


Current thread: