[nycbug-talk] overloaded webserver: nfs wait issue?

Thu Dec 1 14:39:44 EST 2005

We have a website with moderately high traffic, load balanced among 3
webservers.

During peak traffic times however (when the volume is higher than
normal), the load shoots up to over a 100, and the site crawls to its
knees.

We set up a script to take snapshots of top every 20 seconds. Here is
what it looks like when everthing is normal:

         127
    last pid: 12003;  load averages:  0.93,  1.36,  1.35  up 41+04:22:14    14:00:23
    243 processes: 12 running, 230 sleeping, 1 zombie

    Mem: 222M Active, 74M Inact, 186M Wired, 16M Cache, 111M Buf, 503M Free
    Swap: 2048M Total, 16M Used, 2032M Free

      PID USERNAME     PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
      136 root          32   0  1208K   420K RUN     33.1H  7.28%  7.28% amd
    11918 nobody        -1   0   149M 12292K nfsrcv   0:01  3.00%  1.95% httpd
    11879 nobody         2   0   149M 12292K sbwait   0:01  2.10%  1.37% httpd
    11896 nobody         2   0   148M 11704K RUN      0:00  1.80%  1.17% httpd
    11962 nobody         2   0   147M 10072K RUN      0:00  4.33%  1.12% httpd
    11892 nobody        -1   0   145M  8804K nfsrcv   0:00  1.35%  0.88% httpd
    11935 nobody         2   0   149M 12284K sbwait   0:00  1.73%  0.78% httpd
    11925 nobody         2   0   149M 12288K sbwait   0:00  1.08%  0.68% httpd
    11894 nobody         2   0   149M 12404K sbwait   0:00  0.98%  0.63% httpd
    11937 nobody         2   0   149M 12456K RUN      0:00  1.61%  0.63% httpd
    11954 nobody         2   0   149M 12288K sbwait   0:00  1.88%  0.49% httpd
      191 root           2   0   144M  6632K select  13:23  0.34%  0.34% httpd
    11930 nobody         2   0   145M  8852K sbwait   0:00  0.62%  0.34% httpd
    11872 nobody         2   0   149M 12288K sbwait   0:00  0.45%  0.29% httpd
    11911 nobody         2   0   148M 11604K accept   0:00  0.45%  0.29% httpd
    11893 nobody         2   0   149M 12392K sbwait   0:00  0.38%  0.24% httpd
    11876 nobody         2   0   149M 12264K sbwait   0:00  0.38%  0.24% httpd
    11934 nobody         2   0   149M 12292K accept   0:00  0.41%  0.20% httpd

When the load shoots up, the number of http clients hits Apache's
MaxClients setting, here is what top shows:

    last pid: 12407;  load averages: 87.84, 51.91, 27.52  up 41+04:40:51    14:19:00
    268 processes: 2 running, 266 sleeping

    Mem: 715M Active, 68M Inact, 187M Wired, 29M Cache, 111M Buf, 2100K Free
    Swap: 2048M Total, 272M Used, 1776M Free, 13% Inuse

      PID USERNAME     PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
      136 root          64   0  1208K   376K RUN     33.1H  2.69%  2.69% amd
    11965 nobody        -1   0   149M  6892K nfsrcv   0:05  0.24%  0.24% httpd
    11913 nobody        -1   0   149M  8300K nfsrcv   0:05  0.20%  0.20% httpd
    11878 nobody        -1   0   149M  8572K nfsrcv   0:09  0.15%  0.15% httpd
    11948 nobody        -1   0   149M  8852K nfsrcv   0:07  0.15%  0.15% httpd
    11982 nobody        -1   0   149M  6764K nfsrcv   0:04  0.15%  0.15% httpd
    11912 nobody        -1   0   149M  4912K nfsrcv   0:06  0.10%  0.10% httpd
    12060 nobody        -1   0   149M  7356K nfsrcv   0:05  0.10%  0.10% httpd
    11999 nobody        -1   0   149M  8352K nfsrcv   0:04  0.10%  0.10% httpd
    12122 nobody        -1   0   149M  8296K nfsrcv   0:04  0.10%  0.10% httpd
    12028 nobody        -1   0   149M  8664K nfsrcv   0:04  0.10%  0.10% httpd
    12267 nobody        -1   0   149M  8452K nfsrcv   0:03  0.10%  0.10% httpd
    12270 nobody        -1   0   150M  7156K nfsrcv   0:02  0.10%  0.10% httpd
    11983 nobody        -1   0   149M  8256K nfsrcv   0:09  0.05%  0.05% httpd
    11977 nobody        -1   0   149M  5488K nfsrcv   0:06  0.05%  0.05% httpd
    11952 nobody        -1   0   149M  6704K nfsrcv   0:06  0.05%  0.05% httpd
    11895 nobody        -1   0   148M  4404K nfsrcv   0:06  0.05%  0.05% httpd
    11885 nobody        -1   0   149M  8348K nfsrcv   0:06  0.05%  0.05% httpd

The state of all the httpd prcesses are "nfsrcv". Does this mean the
bottleneck is at the NFS server that hosts the htdocs (and PHP scripts)
or just that the server is low on memory?

Thomas

-- 
N.J. Thomas
njt at ayvali.org
Etiamsi occiderit me, in ipso sperabo