Bugzilla – Bug 38
Network reconfiguraton silently breaks gmond's ability to pickup new multicast packets.
Last modified: 2008-08-06 11:48:30
You need to
before you can comment on or make changes to this bug.
The problem is really two fold, first network reconfigs/restarts prevent gmond
from receiving anymore multicast packets, second gmond just silently ignores
them. This silent ignore part effectively places gmond in deaf mode and if this
is the gmond that gmetad is getting data from, it looks like your whole cluster
went down. Shouldn't gmond at least be able to recognize this or is there no
real error for it to pick up on. Occassionally, in 2.5 ganglia, I have seen
errors like "mcast_thread() error multicasting", but I think these are from
gmond having been unlucky enough to attempt to send a multicast message when the
interface was down, otherwise there is no warning.
It would be real nice if gmond could detect these errors and recover its ability
to receive the multicast data. How about some sort of sanity check if gmond is
not deaf and muted so it can detect if it has received its own multicast
messages. If not then it could reinitialize its multicast listening ability.
the errors for "mcast_thread() error multicasting" are actually persistent
the interface it is bound into is brought back up (which might not be the case
if the ip the machine uses has been migrated to another interface for failover
this doesn't affect unicast based configurations though.
Ah ha! I always wondered how my gmonds quit hearing the other nodes and made
the whole cluster look dead when the rest of them were hearing fine. I wonder
if something other than APR_SUCCESS is happening here:
status = apr_pollset_poll(listen_channels, timeout, &num, &descs);
if(status != APR_SUCCESS)
If so, couldn't we then call setup_listen_channels_pollset() again? Do you know
exactly how to get a gmond into this state so I can try it?
This is easy enough to duplicate with 'ifdown && ifup'. A gmond in debug mode
multicasting to itself clearly goes deaf after interface down/up. This also
exercises the network wraparound code since the counters all reset.
Unfortunately, the poll returns nothing but a timeout, but this happens when all
is well too, so it can not be detected that easily. I set a variable to know
when I haven't heard anything for a while so I could try to recover. It is not
as easy as calling setup_listen_channels_pollset() again since the tcp socket is
already in use, and that would leave the other udp sockets laying around too.
Have to figure out how to properly shut everything down and restart them, but I
am not familiar with apr...
Created an attachment (id=137) [details]
PATCH: fix for bz#38, against trunk r1420
This works for me. Basically, you have to "subscribe" or "tune-in" to hear
multicast packets. After network reset, these subscriptions are lost. So what
I did is if we haven't heard anything for over a minute, then re-subscribe
using the socket we already have. I have only tested this on 3.0, so the trunk
patch is actually untested. But the same changes work great on 3.0.
cleaned up version (avoiding odd renamed variables, misaligned code because of
tabs and respecting used code style for indentation) of the proposed patch :
Committed revision 1478
as a workaround for this bug. a fix will need to detect instead the failures
associated with trying to send an update once the multicast channel is no longer
I had the strange variable due to the conflicting function of the same name. I
see you just renamed the function instead. Much cleaner, thanks!
When I run the gmond in debug mode, it claims to still be sending after ifdown
&& ifup. We could test to see if it is really working by seeing if gmond on
another host is hearing it. I suspect there may be no failure to detect, just
as there was no failure to detect on the listen.
Created an attachment (id=150) [details]
Prevent unicast receivers from dieing
I just noticed that the original change causes a unicast receiver to die when
it no longer receives data from any clients. This fixes that.
Committed revision 1632