Bugzilla – Bug 92
gmetad poll() timeout failover failure
Last modified: 2008-10-19 20:49:29
You need to log in before you can comment on or make changes to this bug.
If one of the head node sources in list of data sources causes the data_thread socket to timeout during a read of gmond xml data, gmetad will not failover to the next source in the list. We have a head node in a state where it accepts tcp connections, but no data is sent. This state causes all connections to the bad host to timeout. Normally a bad host just would not allow the remote host to connect, but this host is in an odd state. The current mechanism of failover depends on the affirmative ability of creating a socket with the source as the test of viability for that head node as a good source. Since this bad head node will allow a socket connection, this mechanism is flawed. The meta daemon should consider any failure to complete a read as a failure of the source and move onto another source. This could be accomplished by looping through the list of sources and trying a new source on each fail until success or exhastion of the list. Another solution is to reorder the list of sources, putting the failed hosts at the end of the list.
Created an attachment (id=56) [details] gmetad patch to fix bugzilla #92 A quick hack is to pick a random host from the list, which is what this patch does. It resolves the problem, but might not be ideal. The documentation might need to be fixed since the sources are no longer tried in order.
My patch still loses if we are talking to a gmond affected by Bug#38. In that case, we receive incomplete data, but since it is some data, we keep talking to that host every time. Maybe we should just talk to a random host every time. Better to fix Bug#38 though...
still a problem with 3.1 and trunk and reporting in the gmetad's log (syslog) as: Aug 29 14:31:19 dell /usr/sbin/gmetad[27606]: poll() timeout for [wireless] data source after 0 bytes read Aug 29 14:31:56 dell last message repeated 2 times Aug 29 14:33:14 dell last message repeated 6 times Aug 29 14:34:22 dell last message repeated 4 times ...
Fix Committed revision 1738 for trunk randomization (or load balancing) of the data sources is an interesting option but doesn't address the source of the problem because the broken node will be selected randomly and delay getting the data anyway or could result in other failures presented as described in the comments
Created an attachment (id=162) [details] proposed backport patch for 3.1
Created an attachment (id=163) [details] proposed backport patch for 3.0 contains several other unrelated changes from trunk committed in r1740 to monitor-core-3.0 and that are required to synchronize the code with trunk to avoid spurious conflicts.
for the "load balancing" solution that used to be suggested here, refer to BUG208
Created an attachment (id=165) [details] proposed backport patch for 3.1
Created an attachment (id=174) [details] proposed backport patch for 3.0 resolving whitespace conflicts, and with all other unrelated whitespace fixes from the original patch removed.
Committed to 3.1 in r1866