Saturday, May 06, 2006

First task of Linux system administrator

Linux System Administration: First Tasks

Linux system administration has a place of its own in the hierarchy of information technology specializations. Some people excel in special areas of free software technology but haven't needed to learn system administration. For example, you may specialize in configuring e-mail or writing applications using Apache and MySQL. You may focus only on Domain Names Services and know esoteric ways of setting up servers on provider lines that frequently change IP addresses. But if I asked you to babysit a busy server or servers, you might not have the temperament or have learned the plethora of skills required to do so.

The above does not mean that good system administrators do not excel in areas such as configuring Apache, maintaining DNS zone files or writing Perl Scripts. It simply means that if you want to work as a system administrator in the Linux world, you need to know how to do everything from installing a server to securing the filesystem from mischievous crackers on the Internet. In between, you need to prepare your system to recover from the myriad ways a server can fail.

Consider, for example, a case in which you find that one of the Web sites you manage has gone down; the server has locked up and nothing works. How do you recover in the fastest possible way? Such an event happened to me two weeks ago. One of my articles wound up on Slashdot.org, Digg.com, NewsForge and other sites at the same time. None of my colleagues had seen that much traffic on a Linux site before. Aside from the several million hits on our server, we had a quarter of a million unique visitors concentrated in a five-hour period.

When you see that kind of traffic, you don't want the server to go down or you'll miss new readers. In our situation, a reboot allowed the system to return to service for a few minutes, but then it locked up again. Normally, we used less than ten percent of our system resources, so we thought we had prepared for the hottest day of the year.

Knowing the server and all the running processes, we could shut some down and focus on allowing a massive increase in simultaneous connections to our database. Although we have several thousand subscribers, we turned off processes such as those that restricted comments to registered readers. In the end, we made it through the day with only a short period of down time. But the surge of traffic rocked our boats.

Service outages such as the one described above can happen in the confines of a private network. Many services experience peak usage at specific times. For example, administrators know that one of the heaviest loads they'll have during the day occurs first thing in the morning, when people check their e-mail. People arrive at work about the same time, crank up their e-mail clients and read mail while drinking coffee.

The mail server might experience 75% of its use between 8 and 10 AM. Gateway traffic also increases and bandwidth on the network bogs down. Should you provide separate dedicated servers for mail, routing, proxy and gateway services? The majority of IT shops do that.

What if those computers averaged only 10% of CPU and memory capacity during the course of the day, but required 75% of resources for only a couple of hours a day, five days a week? Rather than buying individual computers, vendors have started recommending higher capacity machines and creating virtual severs.

You might want to configure a little larger metal to provide virtual machines for e-mail and related applications. Then, using Xen for example, you could let each application run in its own space. In that case, you might find server capacity utilization running around 50%, which helps maximize your resources and reduces server sprawl.

A system administrator should know how to climb a learning curve quickly. If a new technology arrives, such as virtualization, you need to master it before it masters you. You also need to know how to apply it in your environment.

What kinds of tasks occupy a system administrators day? That depends on the environment in which he or she works. You may find yourself managing dozens or even hundreds of Web servers. In contrast, you might find yourself running a local area network that supports knowledge workers and/or developers.

Regardless of your environment, you will find that some tasks are common to all system administration functions. For example, monitoring system services and starting and stopping them takes on a role of its own. Your Linux box might appear to be running smoothly while one or more processes have stopped. A Linux server might seem happy on the outside, for example, while the database serving Web pages has failed.

When services to users become critical needs, you need to be prepared and stay ahead of problems. Imagine a failed printing job is locking up a queue, keeping users from getting their documents printed. Do you wait to do something until you hear from irate users, or do you have a way to stay ahead of the problem?

Most system administrators have to face the fact that something will happen at some point that causes down time. Such events usually occur outside of our control. Perhaps your system incurs a power outage or spike. Sometimes a system bug pops up due to a combination of factors that exist only on your server; it's something that never occurred during project testing. In reality, sysadmins never know when a problem will occur; they only know that eventually one will arise.

Administrators need to monitor their systems in an efficient and effective manner. To this end, many administrators have discovered a plethora of monitoring and alert tools within the Free Software community. Some require you to log into a remote system by SSH and run command-line tools such as pstree, lsof, dstat and chkconfig.

Another useful monitoring tool is Checkservice, which provides the status of services on (remote) hosts. It provides results by way of logs, a PHP status page or output to other tools. Some administrators like tiger, which performs a thorough check of a system and reports the results to a log file. You can find a list and explanation of tools for Debian here.

When you have to monitor a larger server farm and do not want to spend all your time logging into remote servers and running command-line tests, look for free software tools you can use with a browser. I like a tool called monit. This monitoring and alert system works on a number of Linux-type systems. Monit provides a system administrator with the ability to define, manage and monitor processes, the filesystem and even devices. You also can configure monit to restart processes if they fail.

Stanford University keeps an updated list of network monitoring tools and sponsors a working group called the Internet End-to-End Performance Monitoring Group. Be sure to check out the latest tools at the top of the Stanford list. Cacti, for example, has become one of the more popular tools among system administrators.

Professional Linux system administration requires you to know a broad number of tasks associated with networking and providing services to users. It takes a special breed of person to work in this capacity. Obviously, many people have both the character and the interest to do the job. Over the next few months, we will explore the tasks that make up Linux system administration. I hope you'll join me for the ride.

No comments: