Nov 16, 2015
One of the curious features of Unix systems (including Linux) is the “uninterruptible sleep” state. This is a state that a process can enter when doing certain system calls. In this state, the process is blocked performing a sytem call, and the process cannot be interrupted (or killed) until the system call completes. Most of these uninterruptible system calls are effectively instantaneous meaning that you never observe the uninterruptible nature of the system call. In rare cases (often because of buggy kernel drivers, but possibly for other reasons) the process can get stuck in this uninterruptible state. This is very similar to the zombie process state in the sense that you cannot kill a process in this state, although it’s worth that the two cases happen for different reasons. Typically when a process is wedged in the uninterruptible sleep state your only recourse is to reboot the system, because there is literally no way to kill the process.
One infamous example of this has been Linux with NFS. For historical reasons certain local I/O operations are not interruptible. For instance, the mkdir(2) system call is not interruptible, which you can verify from its man page by observing that this system call cannot return EINTR. On a normal system the worst case situation for mkdir would be a few disk seeks, which isn’t exactly fast but isn’t the end of the world either. On a networked filesystem like NFS this operation can involve network RPC calls that can block, potentially forever. This means that if you get the right kind of horkage under NFS, a program that calls mkdir(2) can get stuck in the dreaded uninterruptible sleep state forever. When this happens there’s no way to kill the process and the operator has to either live with this zombie-like process or reboot the system. The Linux kernel programmers could “fix” this by making the mkdir(2) system call interruptible so that mkdir(2) could return EINTR. However, historical Unix system since the dawn of time don’t return EINTR for this system call so Linux adopts the same convention.
This was actually a big problem for us at my first job out of college at Yelp. At the time we had just taken the radical step of moving images out of MySQL tables storing the raw image data in a BLOB column, and had moved the images into NFS served from cheap unreliable NFS appliances. Under certain situations the NFS servers would lock up and processes accessing NFS would start entering uninterruptible sleep as they did various I/O operations. When this happened, very quickly (e.g. in a minute or two) every single Apache worker would service a request handler doing one of these I/O operations, and thus 100% of the Apache workers would become stuck in the uninterruptible sleep state. This would quite literally bring down the entire site until we rebooted everything. We eventually “solved” this problem by dropping the NFS dependency and moving things to S3.
Another fun fact about the uninterruptible sleep state is that occassionally it may not be possible to strace a process in this state. The man page for the ptrace system call notes that under rare circumstances attaching to a process using the ptrace system call can cause the traced process to be interrupted. If the process is in uninterruptible sleep then the process can’t be interrupted, which will cause the strace process itself to hang forever. Remarkably, it appears that the ptrace(2) system call is itself uninterruptible, which means that if this happens you may not be able to kill the strace process!
Tonight I learned about a “new” feature in Linux: the
TASK_KILLABLE state. This is sort of a compromise between processes in interruptible sleep and processes in uninterruptible sleep. A process in the
TASK_KILLABLE state still cannot be interrupted in the usual sense (i.e. you can’t force the system call to return EINTR); however, processes in this state can be killed. This means that, for instance, processes doing I/O over NFS can be killed if they get into a wedged state. Not all system calls implement this state, so it’s still possible to get stuck unkillable processes for some system calls, but it’s certainly an improvement over the previous situation. As usual LWN has a great article on the subject including information about the historical semantics of uinterruptible sleep on Linux.
一个臭名昭著的例子是带有NFS的Linux。由于历史原因，某些本地I / O操作不可中断。例如，mkdir（2）系统调用是不可中断的，您可以从其手册页中进行验证 通过观察此系统调用不能返回EINTR。在正常系统上，mkdir的最坏情况是几次磁盘寻道，虽然速度并不很快，但也不是世界末日。在类似NFS的网络文件系统上，此操作可能涉及可能永久阻止的网络RPC调用。这意味着，如果在NFS下获得正确的支持，则调用mkdir（2）的程序可能永远陷入可怕的不间断睡眠状态。发生这种情况时，无法杀死进程，操作员必须忍受这种类似于僵尸的进程，或者重新启动系统。Linux内核程序员可以通过使mkdir（2）系统调用可中断来“修复”此问题，以便mkdir（2）可以返回EINTR。然而，
对于我在Yelp大学毕业后的第一份工作来说，这实际上是一个大问题。当时，我们刚刚采取了根本性的步骤，即从将原始图像数据存储在BLOB列中的MySQL表中移出图像，并将图像移至由廉价，不可靠的NFS设备提供服务的NFS中。在某些情况下，NFS服务器将锁定，并且访问NFS的进程将在执行各种I / O操作时开始进入不间断的睡眠状态。当发生这种情况时，每个Apache工作人员很快就会（例如在一两分钟之内）为执行这些I / O操作之一的请求处理程序提供服务，因此100％的Apache工作人员将陷入不间断的睡眠状态。从字面上看，这将使整个站点瘫痪，直到我们重新启动一切。
TASK_KILLABLE状态的进程仍无法中断（即，您不能强制系统调用返回EINTR）；但是，处于这种状态的进程可以被杀死。这意味着，例如，如果进程进入楔入状态，则可以终止通过NFS执行I / O的进程。并非所有的系统调用都实现此状态，因此对于某些系统调用来说，仍然有可能陷入无法杀死的进程，但这肯定是对以前情况的一种改进。像往常一样， LWN在该主题上有一篇很棒的文章， 包括有关Linux上不间断睡眠的历史语义的信息。