First Last Prev Next    No search results available
Details
: Gmond UDP Receive Channel on Solaris 10 X86 Locks Up
Bug#: 244
: Ganglia Monitoring System
: gmond
Status: ASSIGNED
Resolution:
: PC
: Solaris
: 3.1.x
: P2
: normal

:
:
:
  Show dependency tree - Show dependency graph
People
Reporter: Paul Sobey <buddha@the-annexe.net>
Assigned To: Carlo Marcelo Arenas Belon <carenas@sajinet.com.pe>

Attachments


Note

You need to log in before you can comment on or make changes to this bug.

Related actions


Description:   Opened: 2009-10-27 03:56
We have around 40 machines reporting via udp unicast to a gmond instance. All
are currently running 3.1.2, newly built against apr 1.3.8 using GCC 4.4.1.

After a random period of time between 10 mins - 1 hour, the gmond receive
channel appears to lock up. All hosts on the web interface start to show as
down, and this is confirmed by telnet to the gmond - TN > TMAX for all metrics.
Eventually the metrics just disappear from the gmond xml altogether and we're
left with some essentially blank <CLUSTER/> tags with no hosts.

I'm currently bouncing gmond every 30 mins but it's far from ideal since the
graphs have large holes in.

Have tried compiling against apr tool built with --disable-nonportable-atomics
and result is the same.

Any other required information or things to test, let me know!

Cheers,
Paul
------- Comment #1 From Carlo Marcelo Arenas Belon 2009-11-01 21:44:10 -------
Just a shot in the dark, but could you build your gmond using SUNWgcc instead?

Which gmond is being restarted every 30mins? the collector? (the one that is
receiving all unicast messages and then getting polled by gmetad), if that is
the case then you are triggering another bug which is making the holes on your
graphs as that gmond won't be able to process any metric from any of the
reporting hosts until it gets an updated metadata package; to minimize that, set
"send_metadata_interval" in all other gmond to a small number of seconds but
greater than 0 or try running the collector in some other box running linux
(could be the same one that is running gmetad/apache) as a workaround
------- Comment #2 From Paul Sobey 2009-12-03 00:17:33 -------
Apologies for late reply. I rebuilt gmond and dependencies:

        gettext (version 0.16)
        libconfuse (version 2.5)
        apr (version 1.3.8)
        expat (version 2.0.1)
        python (version 2.6.2)
        ganglia-client (version 3.1.2)

with stock Solaris gcc 3.4.3 into the prefix /opt/ganglia-client-3.1.2, and
result was the same - graphs were drawn for around 2 hours and then stopped.
Note that it's only the head or receive gmond that needs restarting - all of
the individual host gmonds which send the data carry on working as expected -
it's definitely udp receive which locks up. I have it in that state now, and a
truss shows:

port_getn(4, 0x08080C48, 2, 1, 0x08046890)      = 0 [62]
sysconfig(_CONFIG_NPROC_ONLN)                   = 2
ioctl(9, KSTAT_IOC_CHAIN_ID, 0x00000000)        = 10600
p_online(0, 3)                                  = 2
ioctl(9, KSTAT_IOC_READ, "cpu_stat0")           = 10600
p_online(1, 3)                                  = 2
ioctl(9, KSTAT_IOC_READ, "cpu_stat1")           = 10600
time()                                          = 1259828137
time()                                          = 1259828137
time()                                          = 1259828137
time()                                          = 1259828137
time()                                          = 1259828137
write(7, "\0\0\084\0\0\01F l o b p".., 72)      = 72
write(8, "\0\0\084\0\0\01F l o b p".., 72)      = 72
write(7, "\0\0\086\0\0\01F l o b p".., 68)      = 68
write(8, "\0\0\086\0\0\01F l o b p".., 68)      = 68
write(7, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(8, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(7, "\0\0\086\0\0\01F l o b p".., 68)      = 68
write(8, "\0\0\086\0\0\01F l o b p".., 68)      = 68
write(7, "\0\0\086\0\0\01F l o b p".., 68)      = 68
write(8, "\0\0\086\0\0\01F l o b p".., 68)      = 68
write(7, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(8, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(7, "\0\0\086\0\0\01F l o b p".., 68)      = 68
write(8, "\0\0\086\0\0\01F l o b p".., 68)      = 68
write(7, "\0\0\086\0\0\01F l o b p".., 68)      = 68
write(8, "\0\0\086\0\0\01F l o b p".., 68)      = 68
write(7, "\0\0\086\0\0\01F l o b p".., 68)      = 68
write(8, "\0\0\086\0\0\01F l o b p".., 68)      = 68
write(7, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(8, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(7, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(8, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(7, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(8, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(7, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(8, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(7, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(8, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(7, "\0\0\086\0\0\01F l o b p".., 72)      = 72
write(8, "\0\0\086\0\0\01F l o b p".., 72)      = 72
port_getn(4, 0x08080C48, 2, 1, 0x08046890) (sleeping...)

(machine name starts with lobp). I know that around a hundred hosts are all
sending data to this one during this period so that data isn't showing up on
the truss. Let me know if there's any other data I can collect for you.
------- Comment #3 From Carlo Marcelo Arenas Belon 2009-12-03 00:50:27 -------
can you rebuild your apr without port_gen support (there are known open bugs
with that in Solaris 10) :

$ make distclean
$ ac_cv_func_port_create=no
$ export ac_cv_func_port_create
$ ./configure --regular-options

and see if that corrects the problem?
------- Comment #4 From Paul Sobey 2009-12-04 00:34:08 -------
You sir are a genius. It's been up for nearly 24 hours now, where previous best
was an hour. Was this:

https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

The apr bug you mentioned? If so it's supposedly fixed for 1.3.10. I'll post
back confirming that when it comes out. Thanks so much for your help!
------- Comment #5 From Carlo Marcelo Arenas Belon 2009-12-04 05:36:00 -------
(In reply to comment #4)
> You sir are a genius. It's been up for nearly 24 hours now, where previous best
> was an hour. Was this:
> 
> https://issues.apache.org/bugzilla/show_bug.cgi?id=48029
> 
> The apr bug you mentioned? If so it's supposedly fixed for 1.3.10. I'll post
> back confirming that when it comes out. Thanks so much for your help!

yes, but the bug is actually in Solaris; what apr 1.3.10 (or most likely 1.4.0)
will have is a workaround for it.

in any case always configuring apr to use poll() is most likely to be a safer
choice and good enough performance wise for what gmond has to do with it.

First Last Prev Next    No search results available