First Last Prev Next    No search results available
Details
: Network reconfiguraton silently breaks gmond's ability to...
Bug#: 38
: Ganglia Monitoring System
: gmond
Status: ASSIGNED
Resolution:
: PC
: Linux
: 3.0.x
: P2
: normal

:
:
:
  Show dependency tree - Show dependency graph
People
Reporter: Jason A. Smith <smithj4@bnl.gov>
Assigned To: Carlo Marcelo Arenas Belon <carenas@sajinet.com.pe>

Attachments
PATCH: fix for bz#38, against trunk r1420 (5.25 KB, patch)
2008-06-16 11:30, Timothy Witham
Details
Prevent unicast receivers from dieing (391 bytes, patch)
2008-08-05 12:40, Timothy Witham
Details


Note

You need to log in before you can comment on or make changes to this bug.

Related actions


Description:   Opened: 2005-03-21 12:28
The problem is really two fold, first network reconfigs/restarts prevent gmond
from receiving anymore multicast packets, second gmond just silently ignores
them.  This silent ignore part effectively places gmond in deaf mode and if this
is the gmond that gmetad is getting data from, it looks like your whole cluster
went down.  Shouldn't gmond at least be able to recognize this or is there no
real error for it to pick up on.  Occassionally, in 2.5 ganglia, I have seen
errors like "mcast_thread() error multicasting", but I think these are from
gmond having been unlucky enough to attempt to send a multicast message when the
interface was down, otherwise there is no warning.

It would be real nice if gmond could detect these errors and recover its ability
to receive the multicast data.  How about some sort of sanity check if gmond is
not deaf and muted so it can detect if it has received its own multicast
messages.  If not then it could reinitialize its multicast listening ability.

~Jason
------- Comment #1 From Carlo Marcelo Arenas Belon 2007-01-01 23:16:33 -------
the errors for "mcast_thread() error multicasting" are actually persistent
until
the interface it is bound into is brought back up (which might not be the case
if the ip the machine uses has been migrated to another interface for failover
purposes)

this doesn't affect unicast based configurations though.
------- Comment #2 From Timothy Witham 2008-06-05 11:34:21 -------
Ah ha!  I always wondered how my gmonds quit hearing the other nodes and made
the whole cluster look dead when the rest of them were hearing fine.  I wonder
if something other than APR_SUCCESS is happening here:

  status = apr_pollset_poll(listen_channels, timeout, &num, &descs);
  if(status != APR_SUCCESS)
    return;

If so, couldn't we then call setup_listen_channels_pollset() again?  Do you know
exactly how to get a gmond into this state so I can try it?
------- Comment #3 From Timothy Witham 2008-06-13 10:40:45 -------
This is easy enough to duplicate with 'ifdown && ifup'.  A gmond in debug mode
multicasting to itself clearly goes deaf after interface down/up.  This also
exercises the network wraparound code since the counters all reset.

Unfortunately, the poll returns nothing but a timeout, but this happens when all
is well too, so it can not be detected that easily.  I set a variable to know
when I haven't heard anything for a while so I could try to recover.  It is not
as easy as calling setup_listen_channels_pollset() again since the tcp socket is
already in use, and that would leave the other udp sockets laying around too. 
Have to figure out how to properly shut everything down and restart them, but I
am not familiar with apr...
------- Comment #4 From Timothy Witham 2008-06-16 11:30:17 -------
Created an attachment (id=137) [details]
PATCH: fix for bz#38, against trunk r1420

This works for me.  Basically, you have to "subscribe" or "tune-in" to hear
multicast packets.  After network reset, these subscriptions are lost.	So what
I did is if we haven't heard anything for over a minute, then re-subscribe
using the socket we already have.  I have only tested this on 3.0, so the trunk
patch is actually untested.  But the same changes work great on 3.0.
------- Comment #5 From Carlo Marcelo Arenas Belon 2008-06-30 14:52:00 -------
cleaned up version (avoiding odd renamed variables, misaligned code because of
tabs and respecting used code style for indentation) of the proposed patch :

  Committed revision 1478

as a workaround for this bug.  a fix will need to detect instead the failures
associated with trying to send an update once the multicast channel is no longer
subscribed into.
------- Comment #6 From Timothy Witham 2008-07-01 12:43:49 -------
I had the strange variable due to the conflicting function of the same name.  I
see you just renamed the function instead.  Much cleaner, thanks!

When I run the gmond in debug mode, it claims to still be sending after ifdown
&& ifup.  We could test to see if it is really working by seeing if gmond on
another host is hearing it.  I suspect there may be no failure to detect, just
as there was no failure to detect on the listen.
------- Comment #7 From Timothy Witham 2008-08-05 12:40:41 -------
Created an attachment (id=150) [details]
Prevent unicast receivers from dieing

I just noticed that the original change causes a unicast receiver to die when
it no longer receives data from any clients.  This fixes that.
------- Comment #8 From Carlo Marcelo Arenas Belon 2008-08-06 11:48:30 -------
Committed revision 1632

First Last Prev Next    No search results available