[nycbug-talk] network strangeness (resource starvation?)

Charles Sprickman spork
Mon Aug 1 17:13:46 EDT 2005

On Sun, 31 Jul 2005, Jim Brown wrote:

> Hi Charles,
> What does 'lots of dns work' mean?  Can you give some I/O stats?
> What DNS software are you running?  What is the hw platform?
> How many zones?  Avg size of a zone?

Well, during a "lull", we were at about 600 queries/second, and it spikes 
up much higher than that.  This is djbdns, specifically dnscache for a 
huge amount of boxes doing lots of mail deliveries.

Hardware is all pretty decent, SuperMicro SuperServers (pre-built, various 
dual xeon).  The box having the network issues is the first one we bought, 
so it is unique and may have a different revision of the fxp chips than 
the newer boxes.  Just upgraded it to 4.11 in hopes that I'll find some 
"silent fix".  I have never, ever seen a box drop characters in an ssh 
session before - my understanding is that TCP should just keep 
retransmitting and there should never be bogus/missing data in TCP.

> Some general thoughts:
> - eliminate all other non-essential services

We've split this off to two other boxes, and will eventually dedicate a 
few machines just to dns, but at this point even with much less load I see 
the same symptoms.  The box also acts as a tertiary mxer (very low 
volume), nagios monitor, and a repository for internal docs, etc.

> - re-nice named
> - try redesigning your DNS service to include multiple servers
>   and load balance between them

yep. :)

> - BIND (if that's what you are using) is a notorious memory hog
>   However it's still my favorite DNS server.
>   Increase memory to the limit.

We've got dnscache in check, and we're not cpu or memory bound, so it's a 
bit of a mystery...

> I know it's not what you want to hear, but when I made the
> switch from 4.10 to 5.4 i was *impressed* with the performance
> under heavy load.  I tested using:

We are bringing up three more boxes this week, so I may steal one to do 
some testing.

>  X+KDE with multiple konsole sessions:
>    - multiple FTPs
>    - two different stress sessions, one memory, one disk
>    - a loop of 'make buildworld'
>    - a loop of 'make buildkernel'
>    - a loop of 'ls -alR /'
>    - ssh session to remote host
> The desktop was still usable, and I didn't lose any ssh characters on remote sessions.
> IBM T41 with 512 MB.

Wow, sounds good!

After less than 24 hours of being up, here's some netstat stats, does 
anything in the tcp or udp section stand out as being very strange?

         2390742 packets sent
                 1388685 data packets (347633967 bytes)
                 3659 data packets (755463 bytes) retransmitted
                 0 resends initiated by MTU discovery
                 726967 ack-only packets (396528 delayed)
                 0 URG only packets
                 0 window probe packets
                 22022 window update packets
                 250228 control packets
         2342977 packets received
                 1483244 acks (for 347618230 bytes)
                 101958 duplicate acks
                 3 acks for unsent data
                 1610847 packets (441601679 bytes) received in-sequence
                 2681 completely duplicate packets (828736 bytes)
                 36 old duplicate packets
                 117 packets with some dup. data (36072 bytes duped)
                 19740 out-of-order packets (15168338 bytes)
                 1 packet (0 bytes) of data after window
                 0 window probes
                 14556 window update packets
                 26019 packets received after close
                 63 discarded for bad checksums
                 0 discarded for bad header offset fields
                 0 discarded because packet too short
         119904 connection requests
         19459 connection accepts
         13477 bad connection attempts
         0 listen queue overflows
         136653 connections established (including accepts)
         173934 connections closed (including 594 drops)
                 1703 connections updated cached RTT on close
                 1703 connections updated cached RTT variance on close
                 493 connections updated cached ssthresh on close
         406 embryonic connections dropped
         1475400 segments updated rtt (of 1463308 attempts)
         15465 retransmit timeouts
                 123 connections dropped by rexmit timeout
         0 persist timeouts
                 0 connections dropped by persist timeout
         15 keepalive timeouts
                 14 keepalive probes sent
                 1 connection dropped by keepalive
         2101 correct ACK header predictions
         510881 correct data packet header predictions
         19868 syncache entries added
                 1262 retransmitted
                 984 dupsyn
                 173 dropped
                 19459 completed
                 0 bucket overflow
                 0 cache overflow
                 107 reset
                 234 stale
                 0 aborted
                 0 badack
                 68 unreach
                 0 zone failures
         0 cookies sent
         0 cookies received
         1819407 datagrams received
         0 with incomplete header
         0 with bad data length field
         26 with bad checksum
         323 with no checksum
         7068 dropped due to no socket
         150 broadcast/multicast datagrams dropped due to no socket
         0 dropped due to full socket buffers
         0 not for hashed pcb
         1812163 delivered
         2250035 datagrams output

I'm hoping to get a clearer picture of what's going on before I bring this 
up on freebsd-network.



> Best Regards,
> Jim B.

