First Last Prev Next    No search results available
Details
: gmetad poll() timeout failover failure
Bug#: 92
: Ganglia Monitoring System
: gmetad
Status: ASSIGNED
Resolution:
: PC
: Linux
: 3.0.x
: P2
: major

:
:
:
  Show dependency tree - Show dependency graph
People
Reporter: Ian Cunningham <Ian.Cunningham@xilinx.com>
Assigned To: Carlo Marcelo Arenas Belon <carenas@sajinet.com.pe>
:

Attachments
gmetad patch to fix bugzilla #92 (2.00 KB, patch)
2007-03-29 09:01, Timothy Witham
Details
proposed backport patch for 3.1 (3.88 KB, patch)
2008-08-29 22:34, Carlo Marcelo Arenas Belon
Details
proposed backport patch for 3.0 (8.73 KB, patch)
2008-08-29 22:54, Carlo Marcelo Arenas Belon
Details
proposed backport patch for 3.1 (4.42 KB, patch)
2008-09-01 11:33, Carlo Marcelo Arenas Belon
Details
proposed backport patch for 3.0 (3.93 KB, patch)
2008-10-19 20:48, Carlo Marcelo Arenas Belon
Details


Note

You need to log in before you can comment on or make changes to this bug.

Related actions


Description:   Opened: 2006-04-05 17:21
If one of the head node sources in list of data sources causes the data_thread
socket to timeout during a read of gmond xml data, gmetad will not failover to
the next source in the list.

We have a head node in a state where it accepts tcp connections, but no data is
sent. This state causes all connections to the bad host to timeout. Normally a
bad host just would not allow the remote host to connect, but this host is in an
odd state. The current mechanism of failover depends on the affirmative ability
of creating a socket with the source as the test of viability for that head node
as a good source. Since this bad head node will allow a socket connection, this
mechanism is flawed.

The meta daemon should consider any failure to complete a read as a failure of
the source and move onto another source. This could be accomplished by looping
through the list of sources and trying a new source on each fail until success
or exhastion of the list. Another solution is to reorder the list of sources,
putting the failed hosts at the end of the list.
------- Comment #1 From Timothy Witham 2007-03-29 09:01:54 -------
Created an attachment (id=56) [details]
gmetad patch to fix bugzilla #92

A quick hack is to pick a random host from the list, which is what this patch
does.  It resolves the problem, but might not be ideal.  The documentation
might need to be fixed since the sources are no longer tried in order.
------- Comment #2 From Timothy Witham 2008-06-05 10:38:23 -------
My patch still loses if we are talking to a gmond affected by Bug#38.  In that
case, we receive incomplete data, but since it is some data, we keep talking to
that host every time.  Maybe we should just talk to a random host every time. 
Better to fix Bug#38 though...
------- Comment #3 From Carlo Marcelo Arenas Belon 2008-08-29 14:39:31 -------
still a problem with 3.1 and trunk and reporting in the gmetad's log (syslog)
as:

Aug 29 14:31:19 dell /usr/sbin/gmetad[27606]: poll() timeout for [wireless]
data
source after 0 bytes read 
Aug 29 14:31:56 dell last message repeated 2 times
Aug 29 14:33:14 dell last message repeated 6 times
Aug 29 14:34:22 dell last message repeated 4 times
...
------- Comment #4 From Carlo Marcelo Arenas Belon 2008-08-29 20:14:39 -------
Fix Committed revision 1738 for trunk

randomization (or load balancing) of the data sources is an interesting option
but doesn't address the source of the problem because the broken node will be
selected randomly and delay getting the data anyway or could result in other
failures presented as described in the comments
------- Comment #5 From Carlo Marcelo Arenas Belon 2008-08-29 22:34:45 -------
Created an attachment (id=162) [details]
proposed backport patch for 3.1
------- Comment #6 From Carlo Marcelo Arenas Belon 2008-08-29 22:54:41 -------
Created an attachment (id=163) [details]
proposed backport patch for 3.0

contains several other unrelated changes from trunk committed in r1740 to
monitor-core-3.0 and that are required to synchronize the code with trunk to
avoid spurious conflicts.
------- Comment #7 From Carlo Marcelo Arenas Belon 2008-08-30 08:50:43 -------
for the "load balancing" solution that used to be suggested here, refer to
BUG208
------- Comment #8 From Carlo Marcelo Arenas Belon 2008-09-01 11:33:55 -------
Created an attachment (id=165) [details]
proposed backport patch for 3.1
------- Comment #9 From Carlo Marcelo Arenas Belon 2008-10-19 20:48:01 -------
Created an attachment (id=174) [details]
proposed backport patch for 3.0

resolving whitespace conflicts, and with all other unrelated whitespace fixes
from the original patch removed.
------- Comment #10 From Carlo Marcelo Arenas Belon 2008-10-19 20:49:29 -------
Committed to 3.1 in r1866

First Last Prev Next    No search results available