Bugzilla – Bug 244
Gmond UDP Receive Channel on Solaris 10 X86 Locks Up
Last modified: 2009-12-04 05:36:00
You need to log in before you can comment on or make changes to this bug.
We have around 40 machines reporting via udp unicast to a gmond instance. All are currently running 3.1.2, newly built against apr 1.3.8 using GCC 4.4.1. After a random period of time between 10 mins - 1 hour, the gmond receive channel appears to lock up. All hosts on the web interface start to show as down, and this is confirmed by telnet to the gmond - TN > TMAX for all metrics. Eventually the metrics just disappear from the gmond xml altogether and we're left with some essentially blank <CLUSTER/> tags with no hosts. I'm currently bouncing gmond every 30 mins but it's far from ideal since the graphs have large holes in. Have tried compiling against apr tool built with --disable-nonportable-atomics and result is the same. Any other required information or things to test, let me know! Cheers, Paul
Just a shot in the dark, but could you build your gmond using SUNWgcc instead? Which gmond is being restarted every 30mins? the collector? (the one that is receiving all unicast messages and then getting polled by gmetad), if that is the case then you are triggering another bug which is making the holes on your graphs as that gmond won't be able to process any metric from any of the reporting hosts until it gets an updated metadata package; to minimize that, set "send_metadata_interval" in all other gmond to a small number of seconds but greater than 0 or try running the collector in some other box running linux (could be the same one that is running gmetad/apache) as a workaround
Apologies for late reply. I rebuilt gmond and dependencies: gettext (version 0.16) libconfuse (version 2.5) apr (version 1.3.8) expat (version 2.0.1) python (version 2.6.2) ganglia-client (version 3.1.2) with stock Solaris gcc 3.4.3 into the prefix /opt/ganglia-client-3.1.2, and result was the same - graphs were drawn for around 2 hours and then stopped. Note that it's only the head or receive gmond that needs restarting - all of the individual host gmonds which send the data carry on working as expected - it's definitely udp receive which locks up. I have it in that state now, and a truss shows: port_getn(4, 0x08080C48, 2, 1, 0x08046890) = 0 [62] sysconfig(_CONFIG_NPROC_ONLN) = 2 ioctl(9, KSTAT_IOC_CHAIN_ID, 0x00000000) = 10600 p_online(0, 3) = 2 ioctl(9, KSTAT_IOC_READ, "cpu_stat0") = 10600 p_online(1, 3) = 2 ioctl(9, KSTAT_IOC_READ, "cpu_stat1") = 10600 time() = 1259828137 time() = 1259828137 time() = 1259828137 time() = 1259828137 time() = 1259828137 write(7, "\0\0\084\0\0\01F l o b p".., 72) = 72 write(8, "\0\0\084\0\0\01F l o b p".., 72) = 72 write(7, "\0\0\086\0\0\01F l o b p".., 68) = 68 write(8, "\0\0\086\0\0\01F l o b p".., 68) = 68 write(7, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(8, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(7, "\0\0\086\0\0\01F l o b p".., 68) = 68 write(8, "\0\0\086\0\0\01F l o b p".., 68) = 68 write(7, "\0\0\086\0\0\01F l o b p".., 68) = 68 write(8, "\0\0\086\0\0\01F l o b p".., 68) = 68 write(7, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(8, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(7, "\0\0\086\0\0\01F l o b p".., 68) = 68 write(8, "\0\0\086\0\0\01F l o b p".., 68) = 68 write(7, "\0\0\086\0\0\01F l o b p".., 68) = 68 write(8, "\0\0\086\0\0\01F l o b p".., 68) = 68 write(7, "\0\0\086\0\0\01F l o b p".., 68) = 68 write(8, "\0\0\086\0\0\01F l o b p".., 68) = 68 write(7, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(8, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(7, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(8, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(7, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(8, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(7, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(8, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(7, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(8, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(7, "\0\0\086\0\0\01F l o b p".., 72) = 72 write(8, "\0\0\086\0\0\01F l o b p".., 72) = 72 port_getn(4, 0x08080C48, 2, 1, 0x08046890) (sleeping...) (machine name starts with lobp). I know that around a hundred hosts are all sending data to this one during this period so that data isn't showing up on the truss. Let me know if there's any other data I can collect for you.
can you rebuild your apr without port_gen support (there are known open bugs with that in Solaris 10) : $ make distclean $ ac_cv_func_port_create=no $ export ac_cv_func_port_create $ ./configure --regular-options and see if that corrects the problem?
You sir are a genius. It's been up for nearly 24 hours now, where previous best was an hour. Was this: https://issues.apache.org/bugzilla/show_bug.cgi?id=48029 The apr bug you mentioned? If so it's supposedly fixed for 1.3.10. I'll post back confirming that when it comes out. Thanks so much for your help!
(In reply to comment #4) > You sir are a genius. It's been up for nearly 24 hours now, where previous best > was an hour. Was this: > > https://issues.apache.org/bugzilla/show_bug.cgi?id=48029 > > The apr bug you mentioned? If so it's supposedly fixed for 1.3.10. I'll post > back confirming that when it comes out. Thanks so much for your help! yes, but the bug is actually in Solaris; what apr 1.3.10 (or most likely 1.4.0) will have is a workaround for it. in any case always configuring apr to use poll() is most likely to be a safer choice and good enough performance wise for what gmond has to do with it.