[nycbug-talk] network strangeness (resource starvation?)

Jim Brown jpb
Mon Aug 1 22:06:03 EDT 2005


* Charles Sprickman <spork at bway.net> [2005-08-01 17:13]:
> On Sun, 31 Jul 2005, Jim Brown wrote:
> 
> >Hi Charles,
> >
> >What does 'lots of dns work' mean?  Can you give some I/O stats?
> >What DNS software are you running?  What is the hw platform?
> >How many zones?  Avg size of a zone?
> 
> Well, during a "lull", we were at about 600 queries/second, and it spikes 
> up much higher than that.  This is djbdns, specifically dnscache for a 
> huge amount of boxes doing lots of mail deliveries.
> 

Ok, that's a good chunk of DNS.  BSD should be able to handle OK,
but i'd try to bring it down by design into the 200-300/sec range,
or even lower.  Spikes with flash mobs can get nasty with DNS.


> Hardware is all pretty decent, SuperMicro SuperServers (pre-built, various 
> dual xeon).  The box having the network issues is the first one we bought, 
> so it is unique and may have a different revision of the fxp chips than 
> the newer boxes.  Just upgraded it to 4.11 in hopes that I'll find some 



> "silent fix".  I have never, ever seen a box drop characters in an ssh 
> session before - my understanding is that TCP should just keep 
> retransmitting and there should never be bogus/missing data in TCP.


Ok, here's something to consider.  Serial communications (and network too)
are typically full-duplex communications.  (Telnet actually has two
modes - character-at-a-time and line-at-a-time.)  ssh is typically 
character-at-a-time, but because the communication is full duplex,
what you type locally actually gets sent to the remote host, where it
is interpreted and echoed back (under application control). The echo back
is actually what you see.

So- consider that possiblility that the characters are reaching the
remote host ok, but the echo back (what you actually see) is getting
dropped every now and then by something in the path from the remote
back to you.  (Note- you can easily check this by running a sniffer
and watching the traffic filtered for just ssh, or telnet.  You
should see two packets for every character you type- one going to
the remote and one coming from the remote.  You'll actually
also see TCP ACK packets too, but they are part of the protocol,
not part of your application session.)




> 
> 
> After less than 24 hours of being up, here's some netstat stats, does 
> anything in the tcp or udp section stand out as being very strange?

I'll assume that it wasn't running 600 queries/sec for all 24 hours.

That would be around 51.8MB udp traffic, which is not shown below.

Nothing really jumps out at me here:

  13477 bad connection attempts  seems a little high for 24 hours
                                 but if your directly on the net
                                 (no firewall) this is probably
                                 about right

  250228 control packets         also seems high.  Might indicate
                                 some problem with window sizes or
                                 some other TCP parameter.
                                 You have one control for about
                                 every 10 packets.

> 
> tcp:
>         2390742 packets sent
>                 1388685 data packets (347633967 bytes)
>                 3659 data packets (755463 bytes) retransmitted
>                 0 resends initiated by MTU discovery
>                 726967 ack-only packets (396528 delayed)
>                 0 URG only packets
>                 0 window probe packets
>                 22022 window update packets
>                 250228 control packets
>         2342977 packets received
>                 1483244 acks (for 347618230 bytes)
>                 101958 duplicate acks
>                 3 acks for unsent data
>                 1610847 packets (441601679 bytes) received in-sequence
>                 2681 completely duplicate packets (828736 bytes)
>                 36 old duplicate packets
>                 117 packets with some dup. data (36072 bytes duped)
>                 19740 out-of-order packets (15168338 bytes)
>                 1 packet (0 bytes) of data after window
>                 0 window probes
>                 14556 window update packets
>                 26019 packets received after close
>                 63 discarded for bad checksums
>                 0 discarded for bad header offset fields
>                 0 discarded because packet too short
>         119904 connection requests
>         19459 connection accepts
>         13477 bad connection attempts
>         0 listen queue overflows
>         136653 connections established (including accepts)
>         173934 connections closed (including 594 drops)
>                 1703 connections updated cached RTT on close
>                 1703 connections updated cached RTT variance on close
>                 493 connections updated cached ssthresh on close
>         406 embryonic connections dropped
>         1475400 segments updated rtt (of 1463308 attempts)
>         15465 retransmit timeouts
>                 123 connections dropped by rexmit timeout
>         0 persist timeouts
>                 0 connections dropped by persist timeout
>         15 keepalive timeouts
>                 14 keepalive probes sent
>                 1 connection dropped by keepalive
>         2101 correct ACK header predictions
>         510881 correct data packet header predictions
>         19868 syncache entries added
>                 1262 retransmitted
>                 984 dupsyn
>                 173 dropped
>                 19459 completed
>                 0 bucket overflow
>                 0 cache overflow
>                 107 reset
>                 234 stale
>                 0 aborted
>                 0 badack
>                 68 unreach
>                 0 zone failures
>         0 cookies sent
>         0 cookies received
> udp:
>         1819407 datagrams received
>         0 with incomplete header
>         0 with bad data length field
>         26 with bad checksum
>         323 with no checksum
>         7068 dropped due to no socket
>         150 broadcast/multicast datagrams dropped due to no socket
>         0 dropped due to full socket buffers
>         0 not for hashed pcb
>         1812163 delivered
>         2250035 datagrams output
> 
> I'm hoping to get a clearer picture of what's going on before I bring this 
> up on freebsd-network.
> 



Also check netstat -s for your ICMP usage.  Might be some revealing
stats there.

Hope this helps,

Jim B.





More information about the talk mailing list