Here is a brief list of tools we use extensively to maintain our site.
Open Monitoring Distribution (OMD) - An advanced monitoring tool for services and hardware. We use this tool to monitor and inform us of problems with our infrastructure. This includes UPS, cooling, computer hardware, services, and more. OMD is a highly extensible and flexible tool that can gather information from a wide variety of hardware types and software.
Cacti - A RRD based network graphing tool. We use it to graph everything from network throughput to disk I/O by leveraging SNMP capabilities and custom query scripts.
Cfengine - Cfengine is used extensively to manage configurations on cluster and non-cluster nodes. Once primarily used only for service nodes and non-cluster nodes we are gradually expanding it to manage compute nodes to give us additional real-time flexibility that our build system cannot give us. We use the open-source community edition.
ROCKS - The system we use to automate building cluster compute nodes.
Ganglia - Used sitewide for system workload and health monitoring.
Syslog-ng Advanced replacement for Syslog or Rsyslog. All of our hosts use syslog-ng and we monitor them with a central loghost.
Rancid - Really Awesome New Cisco Config Differ. We use this tool to track configurations on all of our switches. It regularly checks for configuration changes and commits the current switch config to a CVS repository for us.
Subversion - Nearly every core configuration or software project at our site is tracked in subversion. This includes rocks build specifications, cfengine policy, software tools, this website, or files not managed by another config tool.