[nycbug-talk] network strangeness (resource starvation?)
Charles Sprickman
spork
Mon Aug 1 17:13:46 EDT 2005
On Sun, 31 Jul 2005, Jim Brown wrote:
> Hi Charles,
>
> What does 'lots of dns work' mean? Can you give some I/O stats?
> What DNS software are you running? What is the hw platform?
> How many zones? Avg size of a zone?
Well, during a "lull", we were at about 600 queries/second, and it spikes
up much higher than that. This is djbdns, specifically dnscache for a
huge amount of boxes doing lots of mail deliveries.
Hardware is all pretty decent, SuperMicro SuperServers (pre-built, various
dual xeon). The box having the network issues is the first one we bought,
so it is unique and may have a different revision of the fxp chips than
the newer boxes. Just upgraded it to 4.11 in hopes that I'll find some
"silent fix". I have never, ever seen a box drop characters in an ssh
session before - my understanding is that TCP should just keep
retransmitting and there should never be bogus/missing data in TCP.
> Some general thoughts:
>
> - eliminate all other non-essential services
We've split this off to two other boxes, and will eventually dedicate a
few machines just to dns, but at this point even with much less load I see
the same symptoms. The box also acts as a tertiary mxer (very low
volume), nagios monitor, and a repository for internal docs, etc.
> - re-nice named
> - try redesigning your DNS service to include multiple servers
> and load balance between them
yep. :)
> - BIND (if that's what you are using) is a notorious memory hog
> However it's still my favorite DNS server.
> Increase memory to the limit.
We've got dnscache in check, and we're not cpu or memory bound, so it's a
bit of a mystery...
> I know it's not what you want to hear, but when I made the
> switch from 4.10 to 5.4 i was *impressed* with the performance
> under heavy load. I tested using:
We are bringing up three more boxes this week, so I may steal one to do
some testing.
> X+KDE with multiple konsole sessions:
>
> - multiple FTPs
> - two different stress sessions, one memory, one disk
> - a loop of 'make buildworld'
> - a loop of 'make buildkernel'
> - a loop of 'ls -alR /'
> - ssh session to remote host
>
> The desktop was still usable, and I didn't lose any ssh characters on remote sessions.
>
> IBM T41 with 512 MB.
Wow, sounds good!
After less than 24 hours of being up, here's some netstat stats, does
anything in the tcp or udp section stand out as being very strange?
tcp:
2390742 packets sent
1388685 data packets (347633967 bytes)
3659 data packets (755463 bytes) retransmitted
0 resends initiated by MTU discovery
726967 ack-only packets (396528 delayed)
0 URG only packets
0 window probe packets
22022 window update packets
250228 control packets
2342977 packets received
1483244 acks (for 347618230 bytes)
101958 duplicate acks
3 acks for unsent data
1610847 packets (441601679 bytes) received in-sequence
2681 completely duplicate packets (828736 bytes)
36 old duplicate packets
117 packets with some dup. data (36072 bytes duped)
19740 out-of-order packets (15168338 bytes)
1 packet (0 bytes) of data after window
0 window probes
14556 window update packets
26019 packets received after close
63 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
119904 connection requests
19459 connection accepts
13477 bad connection attempts
0 listen queue overflows
136653 connections established (including accepts)
173934 connections closed (including 594 drops)
1703 connections updated cached RTT on close
1703 connections updated cached RTT variance on close
493 connections updated cached ssthresh on close
406 embryonic connections dropped
1475400 segments updated rtt (of 1463308 attempts)
15465 retransmit timeouts
123 connections dropped by rexmit timeout
0 persist timeouts
0 connections dropped by persist timeout
15 keepalive timeouts
14 keepalive probes sent
1 connection dropped by keepalive
2101 correct ACK header predictions
510881 correct data packet header predictions
19868 syncache entries added
1262 retransmitted
984 dupsyn
173 dropped
19459 completed
0 bucket overflow
0 cache overflow
107 reset
234 stale
0 aborted
0 badack
68 unreach
0 zone failures
0 cookies sent
0 cookies received
udp:
1819407 datagrams received
0 with incomplete header
0 with bad data length field
26 with bad checksum
323 with no checksum
7068 dropped due to no socket
150 broadcast/multicast datagrams dropped due to no socket
0 dropped due to full socket buffers
0 not for hashed pcb
1812163 delivered
2250035 datagrams output
I'm hoping to get a clearer picture of what's going on before I bring this
up on freebsd-network.
Thanks,
Charles
> Best Regards,
> Jim B.
>
>
>
>
More information about the talk
mailing list