Bugzilla – Bug 38
Network reconfiguraton silently breaks gmond's ability to pickup new multicast packets.
Last modified: 2008-08-06 11:48:30
You need to log in before you can comment on or make changes to this bug.
The problem is really two fold, first network reconfigs/restarts prevent gmond from receiving anymore multicast packets, second gmond just silently ignores them. This silent ignore part effectively places gmond in deaf mode and if this is the gmond that gmetad is getting data from, it looks like your whole cluster went down. Shouldn't gmond at least be able to recognize this or is there no real error for it to pick up on. Occassionally, in 2.5 ganglia, I have seen errors like "mcast_thread() error multicasting", but I think these are from gmond having been unlucky enough to attempt to send a multicast message when the interface was down, otherwise there is no warning. It would be real nice if gmond could detect these errors and recover its ability to receive the multicast data. How about some sort of sanity check if gmond is not deaf and muted so it can detect if it has received its own multicast messages. If not then it could reinitialize its multicast listening ability. ~Jason
the errors for "mcast_thread() error multicasting" are actually persistent until the interface it is bound into is brought back up (which might not be the case if the ip the machine uses has been migrated to another interface for failover purposes) this doesn't affect unicast based configurations though.
Ah ha! I always wondered how my gmonds quit hearing the other nodes and made the whole cluster look dead when the rest of them were hearing fine. I wonder if something other than APR_SUCCESS is happening here: status = apr_pollset_poll(listen_channels, timeout, &num, &descs); if(status != APR_SUCCESS) return; If so, couldn't we then call setup_listen_channels_pollset() again? Do you know exactly how to get a gmond into this state so I can try it?
This is easy enough to duplicate with 'ifdown && ifup'. A gmond in debug mode multicasting to itself clearly goes deaf after interface down/up. This also exercises the network wraparound code since the counters all reset. Unfortunately, the poll returns nothing but a timeout, but this happens when all is well too, so it can not be detected that easily. I set a variable to know when I haven't heard anything for a while so I could try to recover. It is not as easy as calling setup_listen_channels_pollset() again since the tcp socket is already in use, and that would leave the other udp sockets laying around too. Have to figure out how to properly shut everything down and restart them, but I am not familiar with apr...
Created an attachment (id=137) [details] PATCH: fix for bz#38, against trunk r1420 This works for me. Basically, you have to "subscribe" or "tune-in" to hear multicast packets. After network reset, these subscriptions are lost. So what I did is if we haven't heard anything for over a minute, then re-subscribe using the socket we already have. I have only tested this on 3.0, so the trunk patch is actually untested. But the same changes work great on 3.0.
cleaned up version (avoiding odd renamed variables, misaligned code because of tabs and respecting used code style for indentation) of the proposed patch : Committed revision 1478 as a workaround for this bug. a fix will need to detect instead the failures associated with trying to send an update once the multicast channel is no longer subscribed into.
I had the strange variable due to the conflicting function of the same name. I see you just renamed the function instead. Much cleaner, thanks! When I run the gmond in debug mode, it claims to still be sending after ifdown && ifup. We could test to see if it is really working by seeing if gmond on another host is hearing it. I suspect there may be no failure to detect, just as there was no failure to detect on the listen.
Created an attachment (id=150) [details] Prevent unicast receivers from dieing I just noticed that the original change causes a unicast receiver to die when it no longer receives data from any clients. This fixes that.
Committed revision 1632