实时Linux内核调度器 | Real-Time Linux Kernel Scheduler

【Linux内核】 专栏收录该内容
383 篇文章 10 订阅

《Real-Time Linux Kernel Scheduler》 HOWTOs by Ankita Garg on August 1, 2009

目录

Design Goal 设计目标

Overview of the -rt Patchset Scheduling Algorithm -rt补丁集调度算法概述

Important -rt Patchset Scheduler Data Structures and Concepts 重要的-rt Patchset Scheduler数据结构和概念

Root Domain

CPU Priority Management - CPU优先级管理

Details of the Push Scheduling Algorithm 推送计划算法的详细信息

Details of the Pull Scheduling Algorithm - PULL调度算法的详细信息

Scheduling Example

Summary

Legal Statement 法律声明

Resources

Index of /pub/linux/kernel/projects/rt/

是什么使内核/ OS实时?

Real-Time Linux Wiki


 

Many market sectors, such as financial trading, defense, industry automation and gaming, long have had a need for low latencies and deterministic response time. Traditionally, custom-built hardware and software were used to meet these real-time requirements. However, for some soft real-time requirements, where predictability of response times is advantageous and not mandatory, this is an expensive solution. With the advent of the PREEMPT_RT patchset, referred to as -rt henceforth, led by Ingo Molnar, Linux has made great progress in the world of real-time operating systems for “enterprise real-time” applications. A number of modifications were made to the general-purpose Linux kernel to make Linux a viable choice for real time, such as the scheduler, interrupt handling, locking mechanism and so on.

许多市场部门,例如金融交易,国防,工业自动化和游戏,长期以来一直需要低延迟和确定性的响应时间。传统上,使用定制的硬件和软件来满足这些实时要求。但是,对于某些软实时要求,响应时间的可预测性是有利的而不是强制性的,这是一种昂贵的解决方案。在Ingo Molnar领导下的PREEMPT_RT补丁集(此后称为-rt)问世之后,Linux在“企业实时”应用程序的实时操作系统领域取得了长足进步。对通用Linux内核进行了许多修改,使Linux成为实时可行的选择,例如调度程序,中断处理,锁定机制等。

A real-time system is one that provides guaranteed system response times for events and transactions—that is, every operation is expected to be completed within a certain rigid time period. A system is classified as hard real-time if missed deadlines cause system failure and soft real-time if the system can tolerate some missed time constraints.

实时系统是一种为事件和事务提供有保证的系统响应时间的系统,也就是说,每项操作都应在一定的固定时间内完成。如果错过的最后期限导致系统故障,则将系统分类为硬实时;如果系统可以容忍某些错过的时间限制,则将系统分类为软实时

 

Design Goal 设计目标


Real-time systems require that tasks be executed in a strict priority order. This necessitates that only the N highest-priority tasks be running at any given point in time, where N is the number of CPUs. A variation to this requirement could be strict priority-ordered scheduling in a given subset of CPUs or scheduling domains (explained later in this article). In both cases, when a task is runnable, the scheduler must ensure that it be put on a runqueue on which it can be run immediately—that is, the real-time scheduler has to ensure system-wide strict real-time priority scheduling (SWSRPS). Unlike non-real-time systems where the scheduler needs to look only at its runqueue of tasks to make scheduling decisions, a real-time scheduler makes global scheduling decisions, taking into account all the tasks in the system at any given point. Real-time task balancing also has to be performed frequently. Task balancing can introduce cache thrashing and contention for global data (such as runqueue locks) and can degrade throughput and scalability. A real-time task scheduler would trade off throughput in favor of correctness, but at the same time, it must ensure minimal task ping-ponging.

实时系统要求以严格的优先级顺序执行任务。这需要在任何给定时间点仅运行N个优先级最高的任务,其中N是CPU的数量。对此要求的一种变化可能是在给定的CPU子集或调度域中进行严格的优先级排序调度(本文稍后将进行解释)。在这两种情况下,任务都可运行时,调度程序必须确保将其放在可以立即运行的运行队列上,也就是说,实时调度程序必须确保在系统范围内进行严格的实时优先级调度( SWSRPS)。与非实时系统不同,在非实时系统中,调度程序只需要查看其任务的运行队列即可做出调度决策,而实时调度程序则可以在不考虑任何给定时间点的情况下,进行全局调度决策。实时任务平衡也必须经常执行。任务平衡可能导致全局数据(例如运行队列锁)的缓存抖动和争用,并可能降低吞吐量和可伸缩性。实时任务调度程序会权衡吞吐量,以确保正确性,但同时,它必须确保最小的任务响应。

The standard Linux kernel provides two real-time scheduling policies, SCHED_FIFO and SCHED_RR. The main real-time policy is SCHED_FIFO. It implements a first-in, first-out scheduling algorithm. When a SCHED_FIFO task starts running, it continues to run until it voluntarily yields the processor, blocks or is preempted by a higher-priority real-time task. It has no timeslices. All other tasks of lower priority will not be scheduled until it relinquishes the CPU. Two equal-priority SCHED_FIFO tasks do not preempt each other. SCHED_RR is similar to SCHED_FIFO, except that such tasks are allotted timeslices based on their priority and run until they exhaust their timeslice. Non-real-time tasks use the SCHED_NORMAL scheduling policy (older kernels had a policy named SCHED_OTHER).

标准Linux内核提供了两种实时调度策略,即SCHED_FIFO和SCHED_RR。主要的实时策略是SCHED_FIFO。它实现了先进先出的调度算法。当SCHED_FIFO任务开始运行时,它将继续运行,直到它自愿放弃处理器,阻塞或被更高优先级的实时任务抢占。它没有时间片。优先级较低的所有其他任务只有在其放弃CPU后才会安排。两个优先级相同的SCHED_FIFO任务不会相互抢占SCHED_RR与SCHED_FIFO相似,除了SCHED_RR会根据优先级为此类任务分配时间片并一直运行到耗尽其时间片为止。非实时任务使用SCHED_NORMAL调度策略(旧内核的策略名为SCHED_OTHER)。

In the standard Linux kernel, real-time priorities range from zero to (MAX_RT_PRIO-1), inclusive. By default, MAX_RT_PRIO is 100. Non-real-time tasks have priorities in the range of MAX_RT_PRIO to (MAX_RT_PRIO + 40). This constitutes the nice values of SCHED_NORMAL tasks. By default, the –20 to 19 nice range maps directly onto the priority range of 100 to 139.

在标准Linux内核中,实时优先级的范围为0到(MAX_RT_PRIO-1)(含)(0-99)。默认情况下,MAX_RT_PRIO为100。非实时任务的优先级范围为MAX_RT_PRIO至(MAX_RT_PRIO + 40)(100-140)。这构成了SCHED_NORMAL任务的理想值。默认情况下,–20到19的尼斯范围直接映射到100到139的优先级范围。

This article assumes that readers are aware of the basics of a task scheduler. See Resources for more information about the Linux Completely Fair Scheduler (CFS).

本文假定读者了解任务计划程序的基础知识。有关Linux完全公平调度程序(CFS)的更多信息,请参见参考资料。

 

Overview of the -rt Patchset Scheduling Algorithm -rt补丁集调度算法概述


The real-time scheduler of the -rt patchset adopts an active push-pull strategy developed by Steven Rostedt and Gregory Haskins for balancing tasks across CPUs. The scheduler has to address several scenarios:

  1. Where to place a task optimally on wakeup (that is, pre-balance).

  2. What to do with a lower-priority task when it wakes up but is on a runqueue running a task of higher priority.

  3. What to do with a low-priority task when a higher-priority task on the same runqueue wakes up and preempts it.

  4. What to do when a task lowers its priority and thereby causes a previously lower-priority task to have the higher priority.

-rt补丁集的实时调度程序采用了由Steven Rostedt和Gregory Haskins开发的主动推挽策略,用于在CPU之间平衡任务。调度程序必须解决几种情况:

  1. 唤醒时将任务放在何处为最佳(即预平衡)。

  2. 在运行较高优先级任务的运行队列上的较低优先级的任务醒来时该怎么办。

  3. 当低优先级任务在同一运行队列上被唤醒,高优先级任务也环形并抢占它时,该怎么办。

  4. 当任务降低其优先级并因此导致先前较低优先级的任务具有较高优先级时该怎么办。

A push operation is initiated in cases 2 and 3 above. The push algorithm considers all the runqueues within its root domain (discussed later) to find the one that is of a lower priority than the task being pushed.

A pull operation is performed for case 4 above. Whenever a runqueue is about to schedule a task that is lower in priority than the previous one, it checks to see whether it can pull tasks of higher priority from other runqueues.

Real-time tasks are affected only by the push and pull operations. The CFS load-balancing algorithm does not handle real-time tasks at all, as it has been observed that the load balancing pulls real-time tasks away from runqueues to which they were correctly assigned, inducing unnecessary latencies.

在上述情况2和3中会启动push操作。push算法考虑其根域(稍后讨论)中的所有运行队列,以找到优先级比被push任务低的运行队列。

对上述情况4执行pull操作。每当运行队列要调度优先级比前一个任务低的任务时,它将检查是否可以从其他运行队列中提取优先级更高的任务。

实时任务仅受推和拉操作的影响。CFS负载平衡算法根本不处理实时任务,因为已经发现,负载平衡使实时任务脱离了正确分配了它们的运行队列,从而导致不必要的等待时间。

 

Important -rt Patchset Scheduler Data Structures and Concepts 重要的-rt Patchset Scheduler数据结构和概念


The main per-CPU runqueue data structure struct rq, holds a structure struct rt_rq that encapsulates information about the real-time tasks placed on the per-CPU runqueue, as shown in Listing 1.

每个CPU运行队列的主要数据结构struct rq拥有一个结构rt_rq结构,该结构封装了关于放置在每个CPU运行队列上的实时任务的信息,如清单1所示。

Listing 1. struct rt_rq

struct rt_rq {
    struct rt_prio_array  active;
    ...
    unsigned long         rt_nr_running;
    unsigned long         rt_nr_migratory;
    unsigned long         rt_nr_uninterruptible;
    int                   highest_prio;
    int                   overloaded;
};

Real-time tasks have a priority in the range of 0–99. These tasks are organized on a runqueue in a priority-indexed array active, of type struct rt_prio_array. An rt_prio_array consists of an array of subqueues. There is one subqueue per priority level. Each subqueue contains the runnable real-time tasks at the corresponding priority level. There is also a bitmask corresponding to the array that is used to determine effectively the highest-priority task on the runqueue.

实时任务的优先级为0–99。这些任务在活动队列中以struct rt_prio_array类型的优先级索引数组处于活动状态。rt_prio_array由子队列数组组成。每个优先级有一个子队列。每个子队列包含处于相应优先级的可运行实时任务。还有一个与该阵列相对应的位掩码,用于有效确定运行队列上的最高优先级任务。


/* Real-Time classes' related field in a runqueue: */
struct rt_rq {
	struct rt_prio_array	active;
	unsigned int		rt_nr_running;
	unsigned int		rr_nr_running;
#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
	struct {
		int		curr; /* highest queued rt task prio */
#ifdef CONFIG_SMP
		int		next; /* next highest */
#endif
	} highest_prio;
#endif
#ifdef CONFIG_SMP
	unsigned long		rt_nr_migratory;
	unsigned long		rt_nr_total;
	int			overloaded;
	struct plist_head	pushable_tasks;

#endif /* CONFIG_SMP */
	int			rt_queued;

	int			rt_throttled;
	u64			rt_time;
	u64			rt_runtime;
	/* Nests inside the rq lock: */
	raw_spinlock_t		rt_runtime_lock;

#ifdef CONFIG_RT_GROUP_SCHED
	unsigned long		rt_nr_boosted;

	struct rq		*rq;
	struct task_group	*tg;
#endif
};

rt_nr_running and rt_nr_uninterruptible are counts of the number of runnable real-time tasks and the number of tasks in the TASK_UNINTERRUPTIBLE state, respectively.

rt_nr_runningrt_nr_uninterruptible分别是可运行的实时任务数和TASK_UNINTERRUPTIBLE状态的任务数的计数。

rt_nr_migratory indicates the number of tasks on the runqueue that can be migrated to other runqueues. Some real-time tasks are bound to a specific CPU, such as the kernel thread softirq-timer. It is quite possible that a number of such affined threads wake up on a CPU at the same time. For example, the softirq-timer thread might cause the softirq-sched kernel thread to become active, resulting in two real-time tasks becoming runnable. This causes the runqueue to be overloaded with real-time tasks. When overloaded, the real-time scheduler normally will cause other CPUs to pull tasks. These tasks, however, cannot be pulled by another CPU because of their CPU affinity. The other CPUs cannot determine this without the overhead of locking several data structures. This can be avoided by maintaining a count of the number of tasks on the runqueue that can be migrated to other CPUs. When a task is added to a runqueue, the hamming weight of the task->cpus_allowed mask is looked at (cached in task->rt.nr_cpus_allowed). If the value is greater than one, the rt_nr_migratory field of the runqueue is incremented by one. The overloaded field is set when a runqueue contains more than one real-time task and at least one of them can be migrated to another runqueue. It is updated whenever a real-time task is enqueued on a runqueue.

rt_nr_migratory指示运行队列上可以迁移到其他运行队列的任务数。一些实时任务绑定到特定的CPU,例如内核线程softirq-timer。许多这样的固定线程很可能同时在CPU上唤醒。例如,softirq-timer线程可能会导致softirq-sched内核线程变为活动状态,从而导致两个实时任务变得可运行。这导致运行队列因实时任务而过载。过载时,实时调度程序通常会导致其他CPU提取任务。但是,由于这些任务与CPU有密切关系,因此无法由其他CPU提取。如果没有锁定多个数据结构的开销,其他CPU将无法确定。可以通过对运行队列上可以迁移到其他CPU的任务数量进行计数来避免这种情况。将任务添加到运行队列后,将查看task-> cpus_allowed掩码的汉密权重(缓存在task-> rt.nr_cpus_allowed中)。如果该值大于1,则运行队列的rt_nr_migratory字段将增加1。当一个运行队列包含多个实时任务并且其中至少一个可以迁移到另一个运行队列时,将设置过载字段。只要实时任务排队在运行队列中,它就会更新。当一个运行队列包含多个实时任务并且其中至少一个可以迁移到另一个运行队列时,将设置过载字段。只要实时任务排队在运行队列中,它就会更新。当一个运行队列包含多个实时任务并且其中至少一个可以迁移到另一个运行队列时,将设置过载字段。只要实时任务排队在运行队列中,它就会更新。

The highest_prio field indicates the priority of the highest-priority task queued on the runqueue. This may or may not be the priority of the task currently executing on the runqueue (the highest-priority task could have just been enqueued on the runqueue and is pending a schedule). This variable is updated whenever a task is enqueued on a runqueue. The value of the highest_prio is used when scanning every runqueue to find the lowest-priority runqueue for pushing a task. If the highest_prio of the target runqueue is smaller than the task to be pushed, the task is pushed to that runqueue.

maximum_prio字段指示在运行队列中排队的最高优先级任务的优先级。这可能是当前在运行队列上执行的任务的优先级,也可能不是该任务的优先级(最高优先级的任务可能刚被排队在运行队列上,并且正在等待调度)。每当将任务排入运行队列时,都会更新此变量。在扫描每个运行队列以查找用于推送任务的最低优先级运行队列时,将使用highest_prio的值。如果目标运行队列的highest_prio小于要推送的任务,则将任务推送到该运行队列。

Figure 1 shows the values of the above data structures in an example scenario.

图1显示了示例场景中上述数据结构的值。

Figure 1. Example Runqueues

 

Root Domain


As mentioned before, because the real-time scheduler requires several global, or system-wide, resources for making scheduling decisions, scalability bottlenecks appear as the number of CPUs increase (due to increased contention for the locks protecting these resources). For instance, in order to find out if the system is overloaded with real-time tasks—that is, has more runnable real-time tasks than the number of CPUs—it needs to look at the state of all the runqueues. In earlier versions, a global rt_overload variable was used to track the status of all the runqueues on a system. This variable would then be used by the scheduler on every call to the schedule() routine, thus leading to huge contention.

如前所述,由于实时调度程序需要多个全局或系统范围的资源来制定调度决策,因此随着CPU数量的增加(由于争用保护这些资源的锁的争用增加),出现了可伸缩性瓶颈。例如,为了找出系统是否因实时任务而过载(即,具有比CPU数量更多的可运行实时任务),它需要查看所有运行队列的状态。在早期版本中,全局rt_overload变量用于跟踪系统上所有运行队列的状态。然后,调度程序将在每次调用schedule()例程时使用此变量,从而导致大量争用。

Recently, several enhancements were made to the scheduler to reduce the contention for such variables to improve scalability. The concept of root domains was introduced by Gregory Haskins for this purpose. cpusets provide a mechanism to partition CPUs into a subset that is used by a process or a group of processes. Several cpusets could overlap. A cpuset is called exclusive if no other cpuset contains overlapping CPUs. Each exclusive cpuset defines an isolated domain (called a root domain) of CPUs partitioned from other cpusets or CPUs. Information pertaining to every root domain is stored in struct root_domain, as shown in Listing 2. These root domains are used to narrow the scope of the global variables to per-domain variables. Whenever an exclusive cpuset is created, a new root domain object is created with information from the member CPUs. By default, a single high-level root domain is created with all CPUs as members. With the rescoping of the rt_overload variable, the cache-line bouncing would affect only the members of a particular domain and not the entire system. All real-time scheduling decisions are made only within the scope of a root domain.

最近,对调度程序进行了一些增强,以减少此类变量的争用,以提高可伸缩性。为此,Gregory Haskins引入了根域的概念。cpusets提供了一种将CPU划分为一个进程或一组进程使用的子集的机制。几个cpuset可能会重叠。如果没有其他cpuset包含重叠的CPU,则将cpuset称为“独占”。每个排他的cpuset都定义了与其他cpuset或CPU分区的CPU的隔离域(称为根域)。有关每个根域的信息存储在struct root_domain中,如清单2所示。这些根域用于将全局变量的范围缩小到每个域变量。每当创建独占cpuset时,就会使用来自成员CPU的信息来创建新的根域对象。默认情况下,将以所有CPU为成员创建单个高级根域。通过对rt_overload变量进行范围调整,高速缓存行的跳动将仅影响特定域的成员,而不影响整个系统。所有实时调度决策仅在根域的范围内做出。

Listing 2. struct root_domain

struct root_domain {
    atomic_t   refcount;  /* reference count for the domain */
    cpumask_t  span;      /* span of member cpus of the domain*/
    cpumask_t  online;    /* number of online cpus in the domain*/
    cpumask_t  rto_mask;  /* mask of overloaded cpus in the domain*/
    atomic_t   rto_count; /* number of overloaded cpus */
   ....
};

 

CPU Priority Management - CPU优先级管理


CPU Priority Management is an infrastructure also introduced by Gregory Haskins to make task migration decisions efficient. This code tracks the priority of every CPU in the system. Every CPU can be in any one of the following states: INVALID, IDLE, NORMAL, RT1, ... RT99.

CPU优先级管理是格雷戈里·哈斯金斯(Gregory Haskins)引入的基础结构,用于使任务迁移决策高效。此代码跟踪系统中每个CPU的优先级。每个CPU可以处于以下任一状态:INVALID,IDLE,NORMAL,RT1,... RT99。

CPUs in the INVALID state are not eligible for task routing. The system maintains this state with a two-dimensional bitmap: one dimension for the different priority levels and the second for the CPUs in that priority level (priority of a CPU is equivalent to the rq->rt.highest_prio). This is implemented using three arrays, as shown in Listing 3.

处于INVALID状态的CPU不适合进行任务路由。系统使用二维位图维护此状态:一维用于不同的优先级,第二维用于该优先级中的CPU(CPU的优先级等于rq-> rt.highest_prio)。这是使用三个数组实现的,如清单3所示。

Listing 3. struct cpupri

struct cpupri {
    struct cpupri_vec  pri_to_cpu[CPUPRI_NR_PRIORITIES];
    long               pri_active[CPUPRI_NR_PRI_WORDS];
    int                cpu_to_pri[NR_CPUS];
};

The pri_active bitmap tracks those priority levels that contain one or more CPUs. For example, if there is a CPU at priority 49, pri_active[49+2]=1 (real-time task priorities are mapped to 2–102 internally in order to account for priorities INVALID and IDLE), finding the first set bit of this array would yield the lowest priority that any of the CPUs in a given cpuset is in.

pri_active位图跟踪包含一个或多个CPU的优先级。例如,如果有一个CPU的优先级为49,则pri_active [49 + 2] = 1(实时任务优先级在内部映射为2–102,以便考虑优先级INVALID和IDLE),找到该数组将产生给定cpuset中任何CPU所处的最低优先级。

The field cpu_to_pri indicates the priority of a CPU.

The field pri_to_cpu yields information about all the CPUs of a cpuset that are in a particular priority level. This is encapsulated in struct cpupri_vec, as shown in Listing 4.

Like rt_overload, cpupri also is scoped at the root domain level. Every exclusive cpuset that comprises a root domain consists of a cpupri data value.

字段cpu_to_pri表示CPU的优先级。

字段pri_to_cpu产生有关特定优先级中cpuset的所有CPU的信息。封装在struct cpupri_vec中,如清单4所示。

与rt_overload一样,cpupri也位于根域级别。每个包含根域的专用cpuset都由一个cpupri数据值组成。

Listing 4. struct cpupri_vec

struct cpupri_vec {
    raw_spinlock_t  lock;
    int             count;  /* number of cpus at a priority level */
    cpumask_t       mask;   /* mask of cpus at a priority level */
};

The CPU Priority Management infrastructure is used to find a CPU to which to push a task, as shown in Listing 5. It should be noted that no locks are taken when the search is performed.

CPU优先级管理基础结构用于查找将任务推送到的CPU,如清单5所示。应该注意的是,执行搜索时不会锁定任何锁。

Listing 5. Finding a CPU to Which to Push a Task

int cpupri_find(struct cpupri      *cp,
                struct task_struct *p,
                cpumask_t          *lowest_mask)
{
...
    for_each_cpupri_active(cp->pri_active, idx) {
        struct cpupri_vec *vec  = &cp->pri_to_cpu[idx];
        cpumask_t mask;

        if (idx >= task_pri)
            break;

        cpus_and(mask, p->cpus_allowed, vec->mask);

        if (cpus_empty(mask))
            continue;
        *lowest_mask = mask;
        return 1;
    }
    return 0;
}

If a priority level is non-empty and lower than the priority of the task being pushed, the lowest_mask is set to the mask corresponding to the priority level selected. This mask is then used by the push algorithm to compute the best CPU to which to push the task, based on affinity, topology and cache characteristics.

如果优先级为非空且低于要推送的任务的优先级,则将lowest_mask设置为与所选优先级相对应的掩码。然后,推入算法使用此掩码,根据亲和力,拓扑和缓存特征,计算将任务推入的最佳CPU。

 

Details of the Push Scheduling Algorithm 推送计划算法的详细信息


As discussed before, in order to ensure SWSRPS, when a low-priority real-time task gets preempted by a higher one or when a task is woken up on a runqueue that already has a higher-priority task running on it, the scheduler needs to search for a suitable target runqueue for the task. This operation of searching a runqueue and transferring one of its tasks to another runqueue is called pushing a task.

如前所述,为了确保SWSRPS,当低优先级的实时任务被高优先级的任务抢占时,或者当某个任务在已经运行了高优先级任务的运行队列中被唤醒时,调度程序需要搜索适合该任务的目标运行队列。搜索一个运行队列并将其一个任务转移到另一个运行队列的操作称为推送任务。

The push_rt_task() algorithm looks at the highest-priority non-running runnable real-time task on the runqueue and considers all the runqueues to find a CPU where it can run. It searches for a runqueue that is of lower priority—that is, one where the currently running task can be preempted by the task that is being pushed. As explained previously, the CPU Priority Management infrastructure is used to find a mask of CPUs that have the lowest-priority runqueues. It is important to select only the best CPU from among all the candidates. The algorithm gives the highest priority to the CPU on which the task last executed, as it is likely to be cache-hot in that location. If that is not possible, the sched_domain map is considered to find a CPU that is logically closest to last_cpu. If this too fails, a CPU is selected at random from the mask.

push_rt_task()算法查看运行队列上优先级最高的非运行可运行实时任务,并考虑所有运行队列以查找可以在其中运行的CPU。它搜索优先级较低的运行队列,也就是说,当前正在运行的任务可以被正在推送的任务抢占。如前所述,CPU优先级管理基础结构用于查找具有最低优先级运行队列的CPU的掩码。从所有候选中仅选择最佳的CPU是很重要的。该算法将最后执行任务的CPU的优先级设置为最高,因为在该位置可能很热。如果不可能,则认为sched_domain映射用于查找逻辑上最接近last_cpu的CPU。如果仍然失败,则从掩码中随机选择一个CPU。

The push operation is performed until a real-time task fails to be migrated or there are no more tasks to be pushed. Because the algorithm always selects the highest non-running task for pushing, the assumption is that, if it cannot migrate it, then most likely the lower real-time tasks cannot be migrated either and the search is aborted. No lock is taken when scanning for the lowest-priority runqueue. When the target runqueue is found, only the lock of that runqueue is taken, after which a check is made to verify whether it is still a candidate to which to push the task (as the target runqueue might have been modified by a parallel scheduling operation on another CPU). If not, the search is repeated for a maximum of three tries, after which it is aborted.

执行推送操作,直到实时任务无法迁移或没有更多的任务要推送。由于该算法始终选择最高的非运行任务进行推送,因此假设是,如果无法迁移它,则很可能无法迁移较低的实时任务,并且中止搜索。扫描优先级最低的运行队列时不采取任何锁定措施。当找到目标运行队列时,仅对该运行队列进行锁定,然后进行检查以确认它是否仍然是向其推送任务的候选对象(因为目标运行队列可能已被并行调度操作修改了)在另一个CPU上)。如果不是,则重复搜索最多三次,然后中止搜索。

 

Details of the Pull Scheduling Algorithm - PULL调度算法的详细信息


The pull_rt_task() algorithm looks at all the overloaded runqueues in a root domain and checks whether they have a real-time task that can run on the target runqueue (that is, the target CPU is in the task->cpus_allowed_mask) and is of a priority higher than the task the target runqueue is about to schedule. If so, the task is queued on the target runqueue. This search aborts only after scanning all the overloaded runqueues in the root domain. Thus, the pull operation may pull more than one task to the target runqueue. If the algorithm only selects a candidate task to be pulled in the first pass and then performs the actual pull in the second pass, there is a possibility that the selected highest-priority task is no longer a candidate (due to another parallel scheduling operation on another CPU). To avoid this race between finding the highest-priority runqueue and having that still be the highest-priority task on the runqueue when the actual pull is performed, the pull operation continues to pull tasks. In the worst case, this might lead to a number of tasks being pulled to the target runqueue, which would later get pushed off to other CPUs, leading to task bouncing. Task bouncing is known to be a rare occurrence.

pull_rt_task()算法查看根域中所有过载的运行队列,并检查它们是否具有可以在目标运行队列上运行的实时任务(即,目标CPU在任务-> cpus_allowed_mask中)并且优先级高于目标运行队列将要安排的任务。如果是这样,则任务在目标运行队列上排队。仅在扫描根域中的所有过载运行队列后,该搜索才会中止。因此,拉操作可以将一个以上的任务拉到目标运行队列。如果算法仅选择要在第一遍中拉出的候选任务,然后在第二遍中执行实际拉出,则有可能所选的最高优先级任务不再是候选任务(由于对另一个CPU)。为了避免在找到最高优先级的运行队列与在执行实际拉动时仍将其作为运行队列中的最高优先级任务之间的争夺,拉动操作会继续拉动任务。在最坏的情况下,这可能导致许多任务被拉到目标运行队列,随后又被推到其他CPU,导致任务跳动。任务跳动是罕见的。

 

Scheduling Example


Consider the scenario shown in Figure 2. Task T2 is being preempted by task T3 being woken on runqueue 0. Similarly, task T7 is voluntarily yielding CPU 3 to task T6 on runqueue 3. We first consider the scheduling action on CPU 0 followed by CPU 3. Also, assume all the CPUs are in the same root domain. The pri_active bitmap for this system of CPUs will look like Figure 3. The numbers in the brackets indicate the actual priority that is offset by two (as explained earlier).

考虑图2所示的情况。任务T2被在运行队列0上唤醒的任务T3抢占。类似地,任务T7在运行队列3上自愿将CPU 3交给任务T6。我们首先考虑在CPU 0上执行调度操作,然后在CPU上执行3.此外,假定所有CPU都在同一根域中。此CPU系统的pri_active位图将类似于图3。括号中的数字表示实际优先级,该优先级偏移了2(如前所述)。

Figure 2. Runqueues Showing Currently Running Tasks and the Next Tasks to Be Run Just before the Push Operation

图2.运行队列显示当前正在运行的任务以及在执行Push操作之前要运行的下一个任务

Figure 3. Per-sched Domain cpupri.pri_active Array before the Push Operation

图3.推送操作之前的每个预定域cpupri.pri_active阵列

On CPU 0, the post-schedule algorithm would find the runqueue under real-time overload. It then would initiate a push operation. The first set bit of pri_active yields runqueue of CPU 1 as the lowest-priority runqueue suitable for task T2 to be pushed to (assuming all the tasks being considered are not affined to a subset of CPUs). Once T2 is pushed over, the push algorithm then would try to push T1, because after pushing T2, the runqueue still would be under RT overload. The pri_active after the first operation would be as shown in Figure 4. Because the lowest-priority runqueue has a priority greater than the task to be pushed (T1 of priority 85), the push aborts.

在CPU 0上,后调度算法将在实时过载下找到运行队列。然后它将启动推送操作。pri_active的第一个设置位将CPU 1的运行队列作为适用于要推送到的任务T2的最低优先级运行队列(假定正在考虑的所有任务均未绑定到CPU的子集)。一旦将T2推入,则推算法将尝试推T1,因为在推T2之后,运行队列仍将处于RT过载之下。第一次操作后的pri_active如图4所示。由于最低优先级运行队列的优先级大于要推送的任务(优先级85的T1),因此推送中止。

Figure 4. Per-sched Domain cpupri.pri_active Array after the Push Operation

图4.推送操作后的每个预定域cpupri.pri_active阵列

Now, consider scheduling at CPU 3, where the current task of priority 92 is voluntarily giving up the CPU. The next task in the queue is T6. The pre-schedule routine would determine that the priority of the runqueue is being lowered, triggering the pull algorithm. Only runqueues 0 and 1 being under real-time overload would be considered by the pull routine. From runqueue 0, the next highest-priority task T1 is of priority greater than the task to be scheduled—T6, and because T1 < T3 and T6 < T3, T1 is pulled over to runqueue 3. The pull does not abort here, as runqueue 1 is still under overload, and there are chances of a higher-priority task being pulled over. The next highest task, T4 on runqueue 1, also can be pulled over, as its priority is higher than the highest priority on runqueue 3. The pull now aborts, as there are no more overloaded runqueues. The final status of all the runqueues is as shown in Figure 5, which is in accordance with scheduling requirements on real-time systems.

现在,考虑在CPU 3上进行调度,其中优先级为92的当前任务是自愿放弃CPU。队列中的下一个任务是T6。预调度例程将确定运行队列的优先级正在降低,从而触发拉动算法。拉例程仅考虑处于实时过载的运行队列0和1。从运行队列0开始,下一个优先级最高的任务T1的优先级高于要调度的任务T6,并且由于T1 <T3并且T6 <T3,因此T1被拉到运行队列3。运行队列1仍处于过载状态,并且有可能将优先级较高的任务移出。下一个最高的任务,即运行队列1上的T4,也可以被撤消,因为它的优先级高于运行队列3上的最高优先级。因为不再有过载的运行队列。所有运行队列的最终状态如图5所示,这与实时系统上的调度要求一致。

Figure 5. Runqueues after the Push and Pull Operations

图5.推和拉操作后的运行队列

Although strict priority scheduling has been achieved, runqueue 3 is in an overloaded state due to the pull operation. This scenario is very rare; however, the community is working on a solution.

A number of locking-related decisions have to be made by the scheduler. The state of the runqueues would vary from the above example, depending on when the scheduling operation is performed on the runqueues. The above example has been simplified for this explanation.

尽管已经实现了严格的优先级调度,但是运行队列3由于拉动操作而处于过载状态。这种情况非常罕见。但是,社区正在致力于解决方案。

调度程序必须做出许多与锁定有关的决定。运行队列的状态将与上面的示例有所不同,具体取决于何时对运行队列执行调度操作。出于解释的目的,简化了以上示例。

 

Summary


The most important goal of a real-time kernel scheduler is to ensure SWSRPS. The scheduler in the CONFIG_PREEMPT_RT kernel uses push and pull algorithms to balance and correctly distribute real-time tasks across the system. Both the push and pull operations try to ensure that a real-time task gets an opportunity to run as soon as possible. Also, in order to reduce the performance and scalability impact that might result from increased contention of global variables, the scheduler uses the concept of root domains and CPU priority management. The scope of the global variables is reduced to a subset of CPUs as opposed to the entire system, resulting in significant reduction of cache penalties and performance improvement.

实时内核调度程序的最重要目标是确保SWSRPS。CONFIG_PREEMPT_RT内核中的调度程序使用推和拉算法在整个系统上平衡并正确分配实时任务。推和拉操作都试图确保实时任务有机会尽快运行。另外,为了减少可能因全局变量争用增加而导致的性能和可伸缩性影响,调度程序使用了根域和CPU优​​先级管理的概念。相对于整个系统,全局变量的范围缩小到CPU的一个子集,从而显着减少了缓存惩罚并提高了性能。

法律声明

 

Legal Statement 法律声明


This work represents the views of the author and does not necessarily represent the view of IBM. Linux is a copyright of Linus Torvalds. Other company, product and service names may be trademarks or service marks of others.

本作品代表作者的观点,不一定代表IBM的观点。Linux是Linus Torvalds的版权。其他公司,产品和服务名称可能是其他公司的商标或服务标记。

 

Resources


Ankita Garg, a computer science graduate from the P.E.S. Institute of Technology, works as a developer at the Linux Technology Centre, IBM India. She currently is working on the Real-Time Linux Kernel Project. You are welcome to send your comments and suggestions to ankita@in.ibm.com.

 

Index of /pub/linux/kernel/projects/rt/


../
2.6.22/                                            08-Aug-2013 18:24       -
2.6.23/                                            08-Aug-2013 18:26       -
2.6.24/                                            08-Aug-2013 18:27       -
2.6.25/                                            08-Aug-2013 18:27       -
2.6.26/                                            08-Aug-2013 18:28       -
2.6.29/                                            08-Aug-2013 18:28       -
2.6.31/                                            04-Nov-2014 14:19       -
2.6.33/                                            08-Aug-2013 18:29       -
3.0/                                               19-Nov-2013 22:02       -
3.10/                                              23-Nov-2017 05:44       -
3.12/                                              08-Jun-2017 13:40       -
3.14/                                              13-Feb-2017 22:26       -
3.18/                                              23-May-2019 16:23       -
3.2/                                               23-Nov-2017 05:53       -
3.4/                                               16-Nov-2016 19:26       -
3.6/                                               19-Nov-2013 22:01       -
3.8/                                               04-Nov-2014 13:35       -
4.0/                                               13-Jul-2015 21:06       -
4.1/                                               29-Nov-2017 22:12       -
4.11/                                              17-Oct-2017 13:42       -
4.13/                                              17-Nov-2017 17:03       -
4.14/                                              22-Jan-2021 19:35       -
4.16/                                              03-Aug-2018 07:39       -
4.18/                                              29-Oct-2018 11:51       -
4.19/                                              08-Jan-2021 18:17       -
4.4/                                               05-Feb-2021 18:01       -
4.6/                                               30-Sep-2016 21:37       -
4.8/                                               23-Dec-2016 15:26       -
4.9/                                               04-Feb-2021 01:30       -
5.0/                                               10-Jul-2019 15:17       -
5.10/                                              03-Feb-2021 17:55       -
5.11/                                              29-Jan-2021 18:52       -
5.2/                                               16-Dec-2019 17:11       -
5.4/                                               02-Feb-2021 16:39       -
5.6/                                               20-Aug-2020 14:23       -
5.9/                                               28-Oct-2020 20:05       -

 

是什么使内核/ OS实时?

https://stackoverflow.com/questions/22241264/what-makes-a-kernel-os-real-time


我读这个文章,但我的问题是在一个通用的水平,我大意如下思考:

  1. 可以仅仅因为它具有实时调度程序而将其称为实时内核吗?换句话说,如果我有一个Linux内核,并且如果将默认调度程序从O(1)CFS更改为real time scheduler,它将变成RTOS吗?
  2. 是否需要硬件的任何支持?通常,我已经看到具有RTOS的嵌入式设备(例如VxWorks,QNX),这些设备是否有任何特殊规定/硬件来支持它们?我知道RTOS进程的运行时间是确定的,但随后可以使用longjump / setjump在确定的时间内获取输出。

我真的很感谢您的投入/见解,如果我对某些事情有误,请纠正我。

经过研究后,与人(Jamie Hanrahan,Juha Aaltonen @linkedIn组-设备驱动程序专家)交谈,当然还有@Jim Garrison的输入,我可以得出以下结论:

杰米·汉拉汉Jamie Hanrahan)的话来说,

是什么使内核实时?
必要条件实时操作系统-

  • 保证外部中断与中断处理程序启动之间的最大延迟的能力。

    请注意,最大等待时间不必特别短(例如微秒),您可以拥有一个实时操作系统,以保证绝对最大等待时间为137毫秒。

  • 实时调度程序是一种向线程开发人员提供完全可预测的行为(对开发人员而言)的行为-“下一步运行哪个线程”。

    通常,这与保证最大响应延迟的问题是分开的(因为中断处理程序不一定像普通线程一样进行调度),但通常有必要实现实时应用程序。实时OS中的调度程序通常实现大量的优先级。并且它们几乎总是实现优先级继承,以避免优先级反转的情况。

因此,最好保证中断的等待时间和线程调度的可预测性,那么为什么不使每个OS都实时呢?

  • 因为适合通用用途的OS(服务器和/或台式机)需要具有通常与实时延迟保证不一致的特性。

    例如,实时调度程序应具有完全可预测的行为。这意味着,除其他事项外,开发人员为各种任务分配的优先级应由OS独自处理。这可能意味着某些低优先级的任务最终会长时间饿死。但是RT OS必须耸耸肩说“开发人员想要的”。请注意,要获得正确的行为,RT系统开发人员必须担心很多事情,例如任务优先级和CPU亲和力。

    通用OS正好相反。您希望能够将应用程序和服务(几乎总是由许多不同的供应商编写的东西)扔在上面(而不是像大多数RT系统那样是一个紧密集成的系统),并获得良好的性能。也许不是绝对最佳的性能,而是好的。

    请注意,“良好的性能”不仅可以通过中断延迟来衡量。特别是,您希望CPU和其他资源分配通常被描述为“公平”,而用户,管理员甚至应用程序开发人员不必担心线程优先级,CPU关联性和NUMA节点之类的问题。一个工作可能比另一个工作更重要,但是在通用操作系统中,这并不意味着第二个工作根本就不会获得任何资源。

    因此,通用操作系统通常会在优先级相同的线程之间执行时间分片,并可能根据其过去的行为来调整线程的优先级(例如,CPU猪的优先级可能会降低; I / O绑定的线程可能会对其优先级降低)。优先级提高了,因此可以使I / O设备保持工作状态; CPU不足的线程的优先级可能会提高,因此可以不时获得一点CPU时间。

可以仅仅因为它具有实时调度程序而将其称为实时内核吗?

  • 不可以,RT调度程序是RT操作系统的必要组件,但是您在OS的其他部分也需要可预测的行为。

是否需要硬件的任何支持?

  • 通常,硬件越简单,其行为就越可预测。因此,PCI-E的可预测性不如PCI,而PCI的可预测性却不如ISA等。有一些专门的I / O总线是为(例如)中断延迟等易于预测的目的而设计的,但是许多RT要求可以这些天用商品硬件可以满足。

实时的具体描述是过程具有最短的响应时间保证。这对于应用来说通常是不够的,甚至不如确定性重要。使用现代功能丰富的OS尤其难以实现。考虑:

如果我想在精确的时间点上命令某些硬件或机器,则需要能够在那些特定时刻生成命令信号,通常精度要不到毫秒。通常,如果您编译,让我们说一个C代码运行一个等待“半毫秒”的循环并执行某项操作,那么等待时间并不完全是半毫秒,而是更多,因为通用操作系统处理此问题的方式,就是说他们至少要搁置该过程,直到经过正确的时间为止,在此之后,调度程序可能会(在某个时候)再次进行处理。

严重的问题不是时间t并非精确地为半秒,而是不能预先知道它还有多少时间。这种误差不是恒定的,也不是确定性的。

在进行物理自动化时,这会产生令人惊讶的后果。例如,如果不通过内核接口使用专用硬件并告诉他们您真正想要多长时间,就不可能在任何典型的操作系统中准确地控制步进电机。因此,单个AVR模块可以准确地控制多个电动机,但是在任何典型的OS中,Raspberry Pi(在时钟速度方面绝对压倒AVR)无法管理超过2个。

 

 


Real-Time Linux Wiki

https://rt.wiki.kernel.org/index.php/Main_Page


Documentation

Wiki News

2012-10-20: New page: Reporting Bugs

2012-10-20: New Page: Rteval

2010-09-15: New page: Systems based on Real time preempt Linux

2009-03-19: New page: I/Otop utility

2008-10-21: New page: Schedtop utility

2008-09-13: New page: IO CPU Affinity

2008-05-23: New page: Ftrace

2008-05-07: New page: Cpuset Management Utility

2007-10-23: New page: Preemption test

2007-10-18: New page: Fixed Time Quanta Benchmark (FTQ)

Community News

2016-06-23 Real-Time Summit 2016 CFP

2016-05-13: rt-tests version 1.0

2015-10-05: The Linux Foundation Announces Project to Advance Real-Time Linux

2015-09-28: ANNOUNCE 4.1.7-rt8 gmane

2015-09-28: ANNOUNCE rt-tests-0.94

2015-06-18: RTLWS 17 (Austria October 21-22) Call for Papers (Deadline Aug 2, 2015)

2015-05-20: Announce 4.0.4-rt1

2015-01-27: Stage win for real-time Linux Stage win for real-time Linux (in german)

2013-11-06: The future of realtime Linux The future of realtime Linux, by Jake Edge

2013-07-24: RTLWS15 15th Real Time Linux Workshop, October 28 to October 31 at the Dipartimento Tecnologie Innovative, Scuola Universitaria Professionale della Svizzera Italiana in Lugano-Manno, Switzerland - Call for papers (ASCII) - Abstract Submission

2012-05-15: RTLWS14 14th Real Time Linux Workshop, October 18 to 20, 2012 at the Department of Computer Science, University of North Carolina at Chapel Hill - Call for papers (ASCII) - Abstract Submission

2011-03-13: RTLWS13 13th Real-Time Linux Workshop on October 20 to 22, 2011 in Prague, Czech Republic - Call for Papers in ASCII - Registration - Abstract Submission

2010-07-05: RTECC (July 29, 2010), Real-Time & Embedded Computing Conference in Portland, OR. Darren Hart, rt wiki admin, to give luncheon keynote: Linux, Real-Time, and Fragmentation.

2010-04-08: RTLWS12 Twelfth Real-Time Linux Workshop on October 25 to 27, 2010 in Nairobi, Kenya (LWN Article)

2009-06-17: Cpuset Version 1.5.1 released. Download packages from the build service for many popular distros here.

2009-06-15: RTLWS11 Eleventh Real-Time Linux Workshop on September 28 to 30, 2009 in Dresden, Germany

2009-03-02: OSPERT 2009 (Jul 2-4), Fifth International Workshop on Operating Systems Platforms for Embedded Real-Time Applications

2008-05-29: RTLWS10 Tenth Real-Time Linux Workshop at the University of Guadalajara, Mexico.

2008-04-27: multi-reader rwlock + adaptive locking = near mainline performance release !!

2008-04-27: Ubuntu Ships Real-Time Hardy

2008-01-23: 9th Real-Time Linux Workshop videos available from Free Electrons

2008-01-02: Real-time Linux Tests have been integrated into the LTP

2007-11-25: LinuxDevices.com about Ninth Real-Time Linux Workshop, with papers

2007-10-23: Ubuntu Ships Real-Time in Gutsy

Industry News

2008-10-22: Presentations from the Linux Foundation End User Summit 2008

2007-12-05: Red Hat Announces Messaging, Realtime, Grid product (MRG)

2007-11-27: Novell Ships Suse Linux Real Time Enterprise

Tips and Techniques

Utilities

Benchmarks and Test Cases

Further Information

 

  • 0
    点赞
  • 6
    评论
  • 2
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

相关推荐
©️2020 CSDN 皮肤主题: 代码科技 设计师:Amelia_0503 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值