<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "/bugzilla3/bugzilla.dtd">

<bugzilla version="3.0.4.1-2+lenny2"
          urlbase="http://bugzilla.ganglia.info/cgi-bin/bugzilla/"
          maintainer="THE MAINTAINER HAS NOT YET BEEN SET"
>

    <bug>
          <bug_id>244</bug_id>
          
          <creation_ts>2009-10-27 03:56</creation_ts>
          <short_desc>Gmond UDP Receive Channel on Solaris 10 X86 Locks Up</short_desc>
          <delta_ts>2009-12-04 05:36:00</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>Ganglia Monitoring System</product>
          <component>gmond</component>
          <version>3.1.x</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Solaris</op_sys>
          <bug_status>ASSIGNED</bug_status>
          
          
          
          
          
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Paul Sobey">buddha@the-annexe.net</reporter>
          <assigned_to name="Carlo Marcelo Arenas Belon">carenas@sajinet.com.pe</assigned_to>
          

      

      
          <long_desc isprivate="0">
            <who name="Paul Sobey">buddha@the-annexe.net</who>
            <bug_when>2009-10-27 03:56:15</bug_when>
            <thetext>We have around 40 machines reporting via udp unicast to a gmond instance. All
are currently running 3.1.2, newly built against apr 1.3.8 using GCC 4.4.1.

After a random period of time between 10 mins - 1 hour, the gmond receive
channel appears to lock up. All hosts on the web interface start to show as
down, and this is confirmed by telnet to the gmond - TN &gt; TMAX for all metrics.
Eventually the metrics just disappear from the gmond xml altogether and we&apos;re
left with some essentially blank &lt;CLUSTER/&gt; tags with no hosts.

I&apos;m currently bouncing gmond every 30 mins but it&apos;s far from ideal since the
graphs have large holes in.

Have tried compiling against apr tool built with --disable-nonportable-atomics
and result is the same.

Any other required information or things to test, let me know!

Cheers,
Paul</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Carlo Marcelo Arenas Belon">carenas@sajinet.com.pe</who>
            <bug_when>2009-11-01 21:44:10</bug_when>
            <thetext>Just a shot in the dark, but could you build your gmond using SUNWgcc instead?

Which gmond is being restarted every 30mins? the collector? (the one that is
receiving all unicast messages and then getting polled by gmetad), if that is
the case then you are triggering another bug which is making the holes on your
graphs as that gmond won&apos;t be able to process any metric from any of the
reporting hosts until it gets an updated metadata package; to minimize that, set
&quot;send_metadata_interval&quot; in all other gmond to a small number of seconds but
greater than 0 or try running the collector in some other box running linux
(could be the same one that is running gmetad/apache) as a workaround</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Paul Sobey">buddha@the-annexe.net</who>
            <bug_when>2009-12-03 00:17:33</bug_when>
            <thetext>Apologies for late reply. I rebuilt gmond and dependencies:

        gettext (version 0.16)
        libconfuse (version 2.5)
        apr (version 1.3.8)
        expat (version 2.0.1)
        python (version 2.6.2)
        ganglia-client (version 3.1.2)

with stock Solaris gcc 3.4.3 into the prefix /opt/ganglia-client-3.1.2, and result was the same - graphs were drawn for around 2 hours and then stopped. Note that it&apos;s only the head or receive gmond that needs restarting - all of the individual host gmonds which send the data carry on working as expected - it&apos;s definitely udp receive which locks up. I have it in that state now, and a truss shows:

port_getn(4, 0x08080C48, 2, 1, 0x08046890)      = 0 [62]
sysconfig(_CONFIG_NPROC_ONLN)                   = 2
ioctl(9, KSTAT_IOC_CHAIN_ID, 0x00000000)        = 10600
p_online(0, 3)                                  = 2
ioctl(9, KSTAT_IOC_READ, &quot;cpu_stat0&quot;)           = 10600
p_online(1, 3)                                  = 2
ioctl(9, KSTAT_IOC_READ, &quot;cpu_stat1&quot;)           = 10600
time()                                          = 1259828137
time()                                          = 1259828137
time()                                          = 1259828137
time()                                          = 1259828137
time()                                          = 1259828137
write(7, &quot;\0\0\084\0\0\01F l o b p&quot;.., 72)      = 72
write(8, &quot;\0\0\084\0\0\01F l o b p&quot;.., 72)      = 72
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 68)      = 68
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 68)      = 68
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 68)      = 68
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 68)      = 68
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 68)      = 68
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 68)      = 68
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 68)      = 68
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 68)      = 68
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 68)      = 68
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 68)      = 68
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 68)      = 68
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 68)      = 68
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(7, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
write(8, &quot;\0\0\086\0\0\01F l o b p&quot;.., 72)      = 72
port_getn(4, 0x08080C48, 2, 1, 0x08046890) (sleeping...)

(machine name starts with lobp). I know that around a hundred hosts are all sending data to this one during this period so that data isn&apos;t showing up on the truss. Let me know if there&apos;s any other data I can collect for you.
</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Carlo Marcelo Arenas Belon">carenas@sajinet.com.pe</who>
            <bug_when>2009-12-03 00:50:27</bug_when>
            <thetext>can you rebuild your apr without port_gen support (there are known open bugs with that in Solaris 10) :

$ make distclean
$ ac_cv_func_port_create=no
$ export ac_cv_func_port_create
$ ./configure --regular-options

and see if that corrects the problem?</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Paul Sobey">buddha@the-annexe.net</who>
            <bug_when>2009-12-04 00:34:08</bug_when>
            <thetext>You sir are a genius. It&apos;s been up for nearly 24 hours now, where previous best was an hour. Was this:

https://issues.apache.org/bugzilla/show_bug.cgi?id=48029

The apr bug you mentioned? If so it&apos;s supposedly fixed for 1.3.10. I&apos;ll post back confirming that when it comes out. Thanks so much for your help!
</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Carlo Marcelo Arenas Belon">carenas@sajinet.com.pe</who>
            <bug_when>2009-12-04 05:36:00</bug_when>
            <thetext>(In reply to comment #4)
&gt; You sir are a genius. It&apos;s been up for nearly 24 hours now, where previous best
&gt; was an hour. Was this:
&gt; 
&gt; https://issues.apache.org/bugzilla/show_bug.cgi?id=48029
&gt; 
&gt; The apr bug you mentioned? If so it&apos;s supposedly fixed for 1.3.10. I&apos;ll post
&gt; back confirming that when it comes out. Thanks so much for your help!

yes, but the bug is actually in Solaris; what apr 1.3.10 (or most likely 1.4.0) will have is a workaround for it.

in any case always configuring apr to use poll() is most likely to be a safer choice and good enough performance wise for what gmond has to do with it.</thetext>
          </long_desc>
      
    </bug>

</bugzilla>