Gentoo Archives: gentoo-commits

From: Mike Pagano <mpagano@g.o>
To: gentoo-commits@l.g.o
Subject: [gentoo-commits] proj/linux-patches:5.15 commit in: /
Date: Thu, 04 Nov 2021 12:22:57
Message-Id: 1636028504.412bab2012d1b669b481463fd275bbb8bb6933fb.mpagano@gentoo
1 commit: 412bab2012d1b669b481463fd275bbb8bb6933fb
2 Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
3 AuthorDate: Thu Nov 4 12:21:44 2021 +0000
4 Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
5 CommitDate: Thu Nov 4 12:21:44 2021 +0000
6 URL: https://gitweb.gentoo.org/proj/linux-patches.git/commit/?id=412bab20
7
8 Add patch for the BMQ(BitMap Queue) Scheduler.
9
10 A new CPU scheduler developed from PDS(incld).
11 Inspired by the scheduler in zircon
12
13 Signed-off-by: Mike Pagano <mpagano <AT> gentoo.org>
14
15 0000_README | 8 +
16 5020_BMQ-and-PDS-io-scheduler-v5.15-r0.patch | 9787 ++++++++++++++++++++++++++
17 5021_BMQ-and-PDS-gentoo-defaults.patch | 13 +
18 3 files changed, 9808 insertions(+)
19
20 diff --git a/0000_README b/0000_README
21 index efde5c7..9bc9951 100644
22 --- a/0000_README
23 +++ b/0000_README
24 @@ -71,6 +71,14 @@ Patch: 4567_distro-Gentoo-Kconfig.patch
25 From: Tom Wijsman <TomWij@g.o>
26 Desc: Add Gentoo Linux support config settings and defaults.
27
28 +Patch: 5020_BMQ-and-PDS-io-scheduler-v5.15-r0.patch
29 +From: https://gitlab.com/alfredchen/linux-prjc
30 +Desc: BMQ(BitMap Queue) Scheduler. A new CPU scheduler developed from PDS(incld). Inspired by the scheduler in zircon.
31 +
32 +Patch: 5021_BMQ-and-PDS-gentoo-defaults.patch
33 +From: https://gitweb.gentoo.org/proj/linux-patches.git/
34 +Desc: Set defaults for BMQ. Add archs as people test, default to N
35 +
36 Patch: 5010_enable-cpu-optimizations-universal.patch
37 From: https://github.com/graysky2/kernel_compiler_patch
38 Desc: Kernel >= 5.15 patch enables gcc = v11.1+ optimizations for additional CPUs.
39
40 diff --git a/5020_BMQ-and-PDS-io-scheduler-v5.15-r0.patch b/5020_BMQ-and-PDS-io-scheduler-v5.15-r0.patch
41 new file mode 100644
42 index 0000000..1d0c322
43 --- /dev/null
44 +++ b/5020_BMQ-and-PDS-io-scheduler-v5.15-r0.patch
45 @@ -0,0 +1,9787 @@
46 +diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
47 +index 43dc35fe5bc0..0873e92ca5d0 100644
48 +--- a/Documentation/admin-guide/kernel-parameters.txt
49 ++++ b/Documentation/admin-guide/kernel-parameters.txt
50 +@@ -4985,6 +4985,12 @@
51 + sa1100ir [NET]
52 + See drivers/net/irda/sa1100_ir.c.
53 +
54 ++ sched_timeslice=
55 ++ [KNL] Time slice in ms for Project C BMQ/PDS scheduler.
56 ++ Format: integer 2, 4
57 ++ Default: 4
58 ++ See Documentation/scheduler/sched-BMQ.txt
59 ++
60 + sched_verbose [KNL] Enables verbose scheduler debug messages.
61 +
62 + schedstats= [KNL,X86] Enable or disable scheduled statistics.
63 +diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
64 +index 426162009ce9..15ac2d7e47cd 100644
65 +--- a/Documentation/admin-guide/sysctl/kernel.rst
66 ++++ b/Documentation/admin-guide/sysctl/kernel.rst
67 +@@ -1542,3 +1542,13 @@ is 10 seconds.
68 +
69 + The softlockup threshold is (``2 * watchdog_thresh``). Setting this
70 + tunable to zero will disable lockup detection altogether.
71 ++
72 ++yield_type:
73 ++===========
74 ++
75 ++BMQ/PDS CPU scheduler only. This determines what type of yield calls
76 ++to sched_yield will perform.
77 ++
78 ++ 0 - No yield.
79 ++ 1 - Deboost and requeue task. (default)
80 ++ 2 - Set run queue skip task.
81 +diff --git a/Documentation/scheduler/sched-BMQ.txt b/Documentation/scheduler/sched-BMQ.txt
82 +new file mode 100644
83 +index 000000000000..05c84eec0f31
84 +--- /dev/null
85 ++++ b/Documentation/scheduler/sched-BMQ.txt
86 +@@ -0,0 +1,110 @@
87 ++ BitMap queue CPU Scheduler
88 ++ --------------------------
89 ++
90 ++CONTENT
91 ++========
92 ++
93 ++ Background
94 ++ Design
95 ++ Overview
96 ++ Task policy
97 ++ Priority management
98 ++ BitMap Queue
99 ++ CPU Assignment and Migration
100 ++
101 ++
102 ++Background
103 ++==========
104 ++
105 ++BitMap Queue CPU scheduler, referred to as BMQ from here on, is an evolution
106 ++of previous Priority and Deadline based Skiplist multiple queue scheduler(PDS),
107 ++and inspired by Zircon scheduler. The goal of it is to keep the scheduler code
108 ++simple, while efficiency and scalable for interactive tasks, such as desktop,
109 ++movie playback and gaming etc.
110 ++
111 ++Design
112 ++======
113 ++
114 ++Overview
115 ++--------
116 ++
117 ++BMQ use per CPU run queue design, each CPU(logical) has it's own run queue,
118 ++each CPU is responsible for scheduling the tasks that are putting into it's
119 ++run queue.
120 ++
121 ++The run queue is a set of priority queues. Note that these queues are fifo
122 ++queue for non-rt tasks or priority queue for rt tasks in data structure. See
123 ++BitMap Queue below for details. BMQ is optimized for non-rt tasks in the fact
124 ++that most applications are non-rt tasks. No matter the queue is fifo or
125 ++priority, In each queue is an ordered list of runnable tasks awaiting execution
126 ++and the data structures are the same. When it is time for a new task to run,
127 ++the scheduler simply looks the lowest numbered queueue that contains a task,
128 ++and runs the first task from the head of that queue. And per CPU idle task is
129 ++also in the run queue, so the scheduler can always find a task to run on from
130 ++its run queue.
131 ++
132 ++Each task will assigned the same timeslice(default 4ms) when it is picked to
133 ++start running. Task will be reinserted at the end of the appropriate priority
134 ++queue when it uses its whole timeslice. When the scheduler selects a new task
135 ++from the priority queue it sets the CPU's preemption timer for the remainder of
136 ++the previous timeslice. When that timer fires the scheduler will stop execution
137 ++on that task, select another task and start over again.
138 ++
139 ++If a task blocks waiting for a shared resource then it's taken out of its
140 ++priority queue and is placed in a wait queue for the shared resource. When it
141 ++is unblocked it will be reinserted in the appropriate priority queue of an
142 ++eligible CPU.
143 ++
144 ++Task policy
145 ++-----------
146 ++
147 ++BMQ supports DEADLINE, FIFO, RR, NORMAL, BATCH and IDLE task policy like the
148 ++mainline CFS scheduler. But BMQ is heavy optimized for non-rt task, that's
149 ++NORMAL/BATCH/IDLE policy tasks. Below is the implementation detail of each
150 ++policy.
151 ++
152 ++DEADLINE
153 ++ It is squashed as priority 0 FIFO task.
154 ++
155 ++FIFO/RR
156 ++ All RT tasks share one single priority queue in BMQ run queue designed. The
157 ++complexity of insert operation is O(n). BMQ is not designed for system runs
158 ++with major rt policy tasks.
159 ++
160 ++NORMAL/BATCH/IDLE
161 ++ BATCH and IDLE tasks are treated as the same policy. They compete CPU with
162 ++NORMAL policy tasks, but they just don't boost. To control the priority of
163 ++NORMAL/BATCH/IDLE tasks, simply use nice level.
164 ++
165 ++ISO
166 ++ ISO policy is not supported in BMQ. Please use nice level -20 NORMAL policy
167 ++task instead.
168 ++
169 ++Priority management
170 ++-------------------
171 ++
172 ++RT tasks have priority from 0-99. For non-rt tasks, there are three different
173 ++factors used to determine the effective priority of a task. The effective
174 ++priority being what is used to determine which queue it will be in.
175 ++
176 ++The first factor is simply the task’s static priority. Which is assigned from
177 ++task's nice level, within [-20, 19] in userland's point of view and [0, 39]
178 ++internally.
179 ++
180 ++The second factor is the priority boost. This is a value bounded between
181 ++[-MAX_PRIORITY_ADJ, MAX_PRIORITY_ADJ] used to offset the base priority, it is
182 ++modified by the following cases:
183 ++
184 ++*When a thread has used up its entire timeslice, always deboost its boost by
185 ++increasing by one.
186 ++*When a thread gives up cpu control(voluntary or non-voluntary) to reschedule,
187 ++and its switch-in time(time after last switch and run) below the thredhold
188 ++based on its priority boost, will boost its boost by decreasing by one buti is
189 ++capped at 0 (won’t go negative).
190 ++
191 ++The intent in this system is to ensure that interactive threads are serviced
192 ++quickly. These are usually the threads that interact directly with the user
193 ++and cause user-perceivable latency. These threads usually do little work and
194 ++spend most of their time blocked awaiting another user event. So they get the
195 ++priority boost from unblocking while background threads that do most of the
196 ++processing receive the priority penalty for using their entire timeslice.
197 +diff --git a/fs/proc/base.c b/fs/proc/base.c
198 +index 533d5836eb9a..5756c51c9b58 100644
199 +--- a/fs/proc/base.c
200 ++++ b/fs/proc/base.c
201 +@@ -477,7 +477,7 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
202 + seq_puts(m, "0 0 0\n");
203 + else
204 + seq_printf(m, "%llu %llu %lu\n",
205 +- (unsigned long long)task->se.sum_exec_runtime,
206 ++ (unsigned long long)tsk_seruntime(task),
207 + (unsigned long long)task->sched_info.run_delay,
208 + task->sched_info.pcount);
209 +
210 +diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
211 +index 8874f681b056..59eb72bf7d5f 100644
212 +--- a/include/asm-generic/resource.h
213 ++++ b/include/asm-generic/resource.h
214 +@@ -23,7 +23,7 @@
215 + [RLIMIT_LOCKS] = { RLIM_INFINITY, RLIM_INFINITY }, \
216 + [RLIMIT_SIGPENDING] = { 0, 0 }, \
217 + [RLIMIT_MSGQUEUE] = { MQ_BYTES_MAX, MQ_BYTES_MAX }, \
218 +- [RLIMIT_NICE] = { 0, 0 }, \
219 ++ [RLIMIT_NICE] = { 30, 30 }, \
220 + [RLIMIT_RTPRIO] = { 0, 0 }, \
221 + [RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
222 + }
223 +diff --git a/include/linux/sched.h b/include/linux/sched.h
224 +index c1a927ddec64..a7eb91d15442 100644
225 +--- a/include/linux/sched.h
226 ++++ b/include/linux/sched.h
227 +@@ -748,12 +748,18 @@ struct task_struct {
228 + unsigned int ptrace;
229 +
230 + #ifdef CONFIG_SMP
231 +- int on_cpu;
232 + struct __call_single_node wake_entry;
233 ++#endif
234 ++#if defined(CONFIG_SMP) || defined(CONFIG_SCHED_ALT)
235 ++ int on_cpu;
236 ++#endif
237 ++
238 ++#ifdef CONFIG_SMP
239 + #ifdef CONFIG_THREAD_INFO_IN_TASK
240 + /* Current CPU: */
241 + unsigned int cpu;
242 + #endif
243 ++#ifndef CONFIG_SCHED_ALT
244 + unsigned int wakee_flips;
245 + unsigned long wakee_flip_decay_ts;
246 + struct task_struct *last_wakee;
247 +@@ -767,6 +773,7 @@ struct task_struct {
248 + */
249 + int recent_used_cpu;
250 + int wake_cpu;
251 ++#endif /* !CONFIG_SCHED_ALT */
252 + #endif
253 + int on_rq;
254 +
255 +@@ -775,6 +782,20 @@ struct task_struct {
256 + int normal_prio;
257 + unsigned int rt_priority;
258 +
259 ++#ifdef CONFIG_SCHED_ALT
260 ++ u64 last_ran;
261 ++ s64 time_slice;
262 ++ int sq_idx;
263 ++ struct list_head sq_node;
264 ++#ifdef CONFIG_SCHED_BMQ
265 ++ int boost_prio;
266 ++#endif /* CONFIG_SCHED_BMQ */
267 ++#ifdef CONFIG_SCHED_PDS
268 ++ u64 deadline;
269 ++#endif /* CONFIG_SCHED_PDS */
270 ++ /* sched_clock time spent running */
271 ++ u64 sched_time;
272 ++#else /* !CONFIG_SCHED_ALT */
273 + const struct sched_class *sched_class;
274 + struct sched_entity se;
275 + struct sched_rt_entity rt;
276 +@@ -785,6 +806,7 @@ struct task_struct {
277 + unsigned long core_cookie;
278 + unsigned int core_occupation;
279 + #endif
280 ++#endif /* !CONFIG_SCHED_ALT */
281 +
282 + #ifdef CONFIG_CGROUP_SCHED
283 + struct task_group *sched_task_group;
284 +@@ -1505,6 +1527,15 @@ struct task_struct {
285 + */
286 + };
287 +
288 ++#ifdef CONFIG_SCHED_ALT
289 ++#define tsk_seruntime(t) ((t)->sched_time)
290 ++/* replace the uncertian rt_timeout with 0UL */
291 ++#define tsk_rttimeout(t) (0UL)
292 ++#else /* CFS */
293 ++#define tsk_seruntime(t) ((t)->se.sum_exec_runtime)
294 ++#define tsk_rttimeout(t) ((t)->rt.timeout)
295 ++#endif /* !CONFIG_SCHED_ALT */
296 ++
297 + static inline struct pid *task_pid(struct task_struct *task)
298 + {
299 + return task->thread_pid;
300 +diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
301 +index 1aff00b65f3c..216fdf2fe90c 100644
302 +--- a/include/linux/sched/deadline.h
303 ++++ b/include/linux/sched/deadline.h
304 +@@ -1,5 +1,24 @@
305 + /* SPDX-License-Identifier: GPL-2.0 */
306 +
307 ++#ifdef CONFIG_SCHED_ALT
308 ++
309 ++static inline int dl_task(struct task_struct *p)
310 ++{
311 ++ return 0;
312 ++}
313 ++
314 ++#ifdef CONFIG_SCHED_BMQ
315 ++#define __tsk_deadline(p) (0UL)
316 ++#endif
317 ++
318 ++#ifdef CONFIG_SCHED_PDS
319 ++#define __tsk_deadline(p) ((((u64) ((p)->prio))<<56) | (p)->deadline)
320 ++#endif
321 ++
322 ++#else
323 ++
324 ++#define __tsk_deadline(p) ((p)->dl.deadline)
325 ++
326 + /*
327 + * SCHED_DEADLINE tasks has negative priorities, reflecting
328 + * the fact that any of them has higher prio than RT and
329 +@@ -19,6 +38,7 @@ static inline int dl_task(struct task_struct *p)
330 + {
331 + return dl_prio(p->prio);
332 + }
333 ++#endif /* CONFIG_SCHED_ALT */
334 +
335 + static inline bool dl_time_before(u64 a, u64 b)
336 + {
337 +diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
338 +index ab83d85e1183..6af9ae681116 100644
339 +--- a/include/linux/sched/prio.h
340 ++++ b/include/linux/sched/prio.h
341 +@@ -18,6 +18,32 @@
342 + #define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH)
343 + #define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2)
344 +
345 ++#ifdef CONFIG_SCHED_ALT
346 ++
347 ++/* Undefine MAX_PRIO and DEFAULT_PRIO */
348 ++#undef MAX_PRIO
349 ++#undef DEFAULT_PRIO
350 ++
351 ++/* +/- priority levels from the base priority */
352 ++#ifdef CONFIG_SCHED_BMQ
353 ++#define MAX_PRIORITY_ADJ (7)
354 ++
355 ++#define MIN_NORMAL_PRIO (MAX_RT_PRIO)
356 ++#define MAX_PRIO (MIN_NORMAL_PRIO + NICE_WIDTH)
357 ++#define DEFAULT_PRIO (MIN_NORMAL_PRIO + NICE_WIDTH / 2)
358 ++#endif
359 ++
360 ++#ifdef CONFIG_SCHED_PDS
361 ++#define MAX_PRIORITY_ADJ (0)
362 ++
363 ++#define MIN_NORMAL_PRIO (128)
364 ++#define NORMAL_PRIO_NUM (64)
365 ++#define MAX_PRIO (MIN_NORMAL_PRIO + NORMAL_PRIO_NUM)
366 ++#define DEFAULT_PRIO (MAX_PRIO - NICE_WIDTH / 2)
367 ++#endif
368 ++
369 ++#endif /* CONFIG_SCHED_ALT */
370 ++
371 + /*
372 + * Convert user-nice values [ -20 ... 0 ... 19 ]
373 + * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
374 +diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
375 +index e5af028c08b4..0a7565d0d3cf 100644
376 +--- a/include/linux/sched/rt.h
377 ++++ b/include/linux/sched/rt.h
378 +@@ -24,8 +24,10 @@ static inline bool task_is_realtime(struct task_struct *tsk)
379 +
380 + if (policy == SCHED_FIFO || policy == SCHED_RR)
381 + return true;
382 ++#ifndef CONFIG_SCHED_ALT
383 + if (policy == SCHED_DEADLINE)
384 + return true;
385 ++#endif
386 + return false;
387 + }
388 +
389 +diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
390 +index 8f0f778b7c91..991f2280475b 100644
391 +--- a/include/linux/sched/topology.h
392 ++++ b/include/linux/sched/topology.h
393 +@@ -225,7 +225,8 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
394 +
395 + #endif /* !CONFIG_SMP */
396 +
397 +-#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
398 ++#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) && \
399 ++ !defined(CONFIG_SCHED_ALT)
400 + extern void rebuild_sched_domains_energy(void);
401 + #else
402 + static inline void rebuild_sched_domains_energy(void)
403 +diff --git a/init/Kconfig b/init/Kconfig
404 +index 11f8a845f259..c8e82fcafb9e 100644
405 +--- a/init/Kconfig
406 ++++ b/init/Kconfig
407 +@@ -814,9 +814,39 @@ config GENERIC_SCHED_CLOCK
408 +
409 + menu "Scheduler features"
410 +
411 ++menuconfig SCHED_ALT
412 ++ bool "Alternative CPU Schedulers"
413 ++ default y
414 ++ help
415 ++ This feature enable alternative CPU scheduler"
416 ++
417 ++if SCHED_ALT
418 ++
419 ++choice
420 ++ prompt "Alternative CPU Scheduler"
421 ++ default SCHED_BMQ
422 ++
423 ++config SCHED_BMQ
424 ++ bool "BMQ CPU scheduler"
425 ++ help
426 ++ The BitMap Queue CPU scheduler for excellent interactivity and
427 ++ responsiveness on the desktop and solid scalability on normal
428 ++ hardware and commodity servers.
429 ++
430 ++config SCHED_PDS
431 ++ bool "PDS CPU scheduler"
432 ++ help
433 ++ The Priority and Deadline based Skip list multiple queue CPU
434 ++ Scheduler.
435 ++
436 ++endchoice
437 ++
438 ++endif
439 ++
440 + config UCLAMP_TASK
441 + bool "Enable utilization clamping for RT/FAIR tasks"
442 + depends on CPU_FREQ_GOV_SCHEDUTIL
443 ++ depends on !SCHED_ALT
444 + help
445 + This feature enables the scheduler to track the clamped utilization
446 + of each CPU based on RUNNABLE tasks scheduled on that CPU.
447 +@@ -902,6 +932,7 @@ config NUMA_BALANCING
448 + depends on ARCH_SUPPORTS_NUMA_BALANCING
449 + depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
450 + depends on SMP && NUMA && MIGRATION
451 ++ depends on !SCHED_ALT
452 + help
453 + This option adds support for automatic NUMA aware memory/task placement.
454 + The mechanism is quite primitive and is based on migrating memory when
455 +@@ -994,6 +1025,7 @@ config FAIR_GROUP_SCHED
456 + depends on CGROUP_SCHED
457 + default CGROUP_SCHED
458 +
459 ++if !SCHED_ALT
460 + config CFS_BANDWIDTH
461 + bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
462 + depends on FAIR_GROUP_SCHED
463 +@@ -1016,6 +1048,7 @@ config RT_GROUP_SCHED
464 + realtime bandwidth for them.
465 + See Documentation/scheduler/sched-rt-group.rst for more information.
466 +
467 ++endif #!SCHED_ALT
468 + endif #CGROUP_SCHED
469 +
470 + config UCLAMP_TASK_GROUP
471 +@@ -1259,6 +1292,7 @@ config CHECKPOINT_RESTORE
472 +
473 + config SCHED_AUTOGROUP
474 + bool "Automatic process group scheduling"
475 ++ depends on !SCHED_ALT
476 + select CGROUPS
477 + select CGROUP_SCHED
478 + select FAIR_GROUP_SCHED
479 +diff --git a/init/init_task.c b/init/init_task.c
480 +index 2d024066e27b..49f706df0904 100644
481 +--- a/init/init_task.c
482 ++++ b/init/init_task.c
483 +@@ -75,9 +75,15 @@ struct task_struct init_task
484 + .stack = init_stack,
485 + .usage = REFCOUNT_INIT(2),
486 + .flags = PF_KTHREAD,
487 ++#ifdef CONFIG_SCHED_ALT
488 ++ .prio = DEFAULT_PRIO + MAX_PRIORITY_ADJ,
489 ++ .static_prio = DEFAULT_PRIO,
490 ++ .normal_prio = DEFAULT_PRIO + MAX_PRIORITY_ADJ,
491 ++#else
492 + .prio = MAX_PRIO - 20,
493 + .static_prio = MAX_PRIO - 20,
494 + .normal_prio = MAX_PRIO - 20,
495 ++#endif
496 + .policy = SCHED_NORMAL,
497 + .cpus_ptr = &init_task.cpus_mask,
498 + .user_cpus_ptr = NULL,
499 +@@ -88,6 +94,17 @@ struct task_struct init_task
500 + .restart_block = {
501 + .fn = do_no_restart_syscall,
502 + },
503 ++#ifdef CONFIG_SCHED_ALT
504 ++ .sq_node = LIST_HEAD_INIT(init_task.sq_node),
505 ++#ifdef CONFIG_SCHED_BMQ
506 ++ .boost_prio = 0,
507 ++ .sq_idx = 15,
508 ++#endif
509 ++#ifdef CONFIG_SCHED_PDS
510 ++ .deadline = 0,
511 ++#endif
512 ++ .time_slice = HZ,
513 ++#else
514 + .se = {
515 + .group_node = LIST_HEAD_INIT(init_task.se.group_node),
516 + },
517 +@@ -95,6 +112,7 @@ struct task_struct init_task
518 + .run_list = LIST_HEAD_INIT(init_task.rt.run_list),
519 + .time_slice = RR_TIMESLICE,
520 + },
521 ++#endif
522 + .tasks = LIST_HEAD_INIT(init_task.tasks),
523 + #ifdef CONFIG_SMP
524 + .pushable_tasks = PLIST_NODE_INIT(init_task.pushable_tasks, MAX_PRIO),
525 +diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
526 +index 5876e30c5740..7594d0a31869 100644
527 +--- a/kernel/Kconfig.preempt
528 ++++ b/kernel/Kconfig.preempt
529 +@@ -102,7 +102,7 @@ config PREEMPT_DYNAMIC
530 +
531 + config SCHED_CORE
532 + bool "Core Scheduling for SMT"
533 +- depends on SCHED_SMT
534 ++ depends on SCHED_SMT && !SCHED_ALT
535 + help
536 + This option permits Core Scheduling, a means of coordinated task
537 + selection across SMT siblings. When enabled -- see
538 +diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
539 +index 2a9695ccb65f..292112c267b8 100644
540 +--- a/kernel/cgroup/cpuset.c
541 ++++ b/kernel/cgroup/cpuset.c
542 +@@ -664,7 +664,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
543 + return ret;
544 + }
545 +
546 +-#ifdef CONFIG_SMP
547 ++#if defined(CONFIG_SMP) && !defined(CONFIG_SCHED_ALT)
548 + /*
549 + * Helper routine for generate_sched_domains().
550 + * Do cpusets a, b have overlapping effective cpus_allowed masks?
551 +@@ -1060,7 +1060,7 @@ static void rebuild_sched_domains_locked(void)
552 + /* Have scheduler rebuild the domains */
553 + partition_and_rebuild_sched_domains(ndoms, doms, attr);
554 + }
555 +-#else /* !CONFIG_SMP */
556 ++#else /* !CONFIG_SMP || CONFIG_SCHED_ALT */
557 + static void rebuild_sched_domains_locked(void)
558 + {
559 + }
560 +diff --git a/kernel/delayacct.c b/kernel/delayacct.c
561 +index 51530d5b15a8..e542d71bb94b 100644
562 +--- a/kernel/delayacct.c
563 ++++ b/kernel/delayacct.c
564 +@@ -139,7 +139,7 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
565 + */
566 + t1 = tsk->sched_info.pcount;
567 + t2 = tsk->sched_info.run_delay;
568 +- t3 = tsk->se.sum_exec_runtime;
569 ++ t3 = tsk_seruntime(tsk);
570 +
571 + d->cpu_count += t1;
572 +
573 +diff --git a/kernel/exit.c b/kernel/exit.c
574 +index 91a43e57a32e..4b157befc10c 100644
575 +--- a/kernel/exit.c
576 ++++ b/kernel/exit.c
577 +@@ -122,7 +122,7 @@ static void __exit_signal(struct task_struct *tsk)
578 + sig->curr_target = next_thread(tsk);
579 + }
580 +
581 +- add_device_randomness((const void*) &tsk->se.sum_exec_runtime,
582 ++ add_device_randomness((const void*) &tsk_seruntime(tsk),
583 + sizeof(unsigned long long));
584 +
585 + /*
586 +@@ -143,7 +143,7 @@ static void __exit_signal(struct task_struct *tsk)
587 + sig->inblock += task_io_get_inblock(tsk);
588 + sig->oublock += task_io_get_oublock(tsk);
589 + task_io_accounting_add(&sig->ioac, &tsk->ioac);
590 +- sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
591 ++ sig->sum_sched_runtime += tsk_seruntime(tsk);
592 + sig->nr_threads--;
593 + __unhash_process(tsk, group_dead);
594 + write_sequnlock(&sig->stats_lock);
595 +diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
596 +index 291b857a6e20..f3480cdb7497 100644
597 +--- a/kernel/livepatch/transition.c
598 ++++ b/kernel/livepatch/transition.c
599 +@@ -307,7 +307,11 @@ static bool klp_try_switch_task(struct task_struct *task)
600 + */
601 + rq = task_rq_lock(task, &flags);
602 +
603 ++#ifdef CONFIG_SCHED_ALT
604 ++ if (task_running(task) && task != current) {
605 ++#else
606 + if (task_running(rq, task) && task != current) {
607 ++#endif
608 + snprintf(err_buf, STACK_ERR_BUF_SIZE,
609 + "%s: %s:%d is running\n", __func__, task->comm,
610 + task->pid);
611 +diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
612 +index 6bb116c559b4..d4c8168a8270 100644
613 +--- a/kernel/locking/rtmutex.c
614 ++++ b/kernel/locking/rtmutex.c
615 +@@ -298,21 +298,25 @@ static __always_inline void
616 + waiter_update_prio(struct rt_mutex_waiter *waiter, struct task_struct *task)
617 + {
618 + waiter->prio = __waiter_prio(task);
619 +- waiter->deadline = task->dl.deadline;
620 ++ waiter->deadline = __tsk_deadline(task);
621 + }
622 +
623 + /*
624 + * Only use with rt_mutex_waiter_{less,equal}()
625 + */
626 + #define task_to_waiter(p) \
627 +- &(struct rt_mutex_waiter){ .prio = __waiter_prio(p), .deadline = (p)->dl.deadline }
628 ++ &(struct rt_mutex_waiter){ .prio = __waiter_prio(p), .deadline = __tsk_deadline(p) }
629 +
630 + static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,
631 + struct rt_mutex_waiter *right)
632 + {
633 ++#ifdef CONFIG_SCHED_PDS
634 ++ return (left->deadline < right->deadline);
635 ++#else
636 + if (left->prio < right->prio)
637 + return 1;
638 +
639 ++#ifndef CONFIG_SCHED_BMQ
640 + /*
641 + * If both waiters have dl_prio(), we check the deadlines of the
642 + * associated tasks.
643 +@@ -321,16 +325,22 @@ static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,
644 + */
645 + if (dl_prio(left->prio))
646 + return dl_time_before(left->deadline, right->deadline);
647 ++#endif
648 +
649 + return 0;
650 ++#endif
651 + }
652 +
653 + static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,
654 + struct rt_mutex_waiter *right)
655 + {
656 ++#ifdef CONFIG_SCHED_PDS
657 ++ return (left->deadline == right->deadline);
658 ++#else
659 + if (left->prio != right->prio)
660 + return 0;
661 +
662 ++#ifndef CONFIG_SCHED_BMQ
663 + /*
664 + * If both waiters have dl_prio(), we check the deadlines of the
665 + * associated tasks.
666 +@@ -339,8 +349,10 @@ static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,
667 + */
668 + if (dl_prio(left->prio))
669 + return left->deadline == right->deadline;
670 ++#endif
671 +
672 + return 1;
673 ++#endif
674 + }
675 +
676 + static inline bool rt_mutex_steal(struct rt_mutex_waiter *waiter,
677 +diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
678 +index 978fcfca5871..0425ee149b4d 100644
679 +--- a/kernel/sched/Makefile
680 ++++ b/kernel/sched/Makefile
681 +@@ -22,14 +22,21 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
682 + CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
683 + endif
684 +
685 +-obj-y += core.o loadavg.o clock.o cputime.o
686 +-obj-y += idle.o fair.o rt.o deadline.o
687 +-obj-y += wait.o wait_bit.o swait.o completion.o
688 +-
689 +-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
690 ++ifdef CONFIG_SCHED_ALT
691 ++obj-y += alt_core.o
692 ++obj-$(CONFIG_SCHED_DEBUG) += alt_debug.o
693 ++else
694 ++obj-y += core.o
695 ++obj-y += fair.o rt.o deadline.o
696 ++obj-$(CONFIG_SMP) += cpudeadline.o stop_task.o
697 + obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
698 +-obj-$(CONFIG_SCHEDSTATS) += stats.o
699 ++endif
700 + obj-$(CONFIG_SCHED_DEBUG) += debug.o
701 ++obj-y += loadavg.o clock.o cputime.o
702 ++obj-y += idle.o
703 ++obj-y += wait.o wait_bit.o swait.o completion.o
704 ++obj-$(CONFIG_SMP) += cpupri.o pelt.o topology.o
705 ++obj-$(CONFIG_SCHEDSTATS) += stats.o
706 + obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
707 + obj-$(CONFIG_CPU_FREQ) += cpufreq.o
708 + obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
709 +diff --git a/kernel/sched/alt_core.c b/kernel/sched/alt_core.c
710 +new file mode 100644
711 +index 000000000000..9576c57f82da
712 +--- /dev/null
713 ++++ b/kernel/sched/alt_core.c
714 +@@ -0,0 +1,7626 @@
715 ++/*
716 ++ * kernel/sched/alt_core.c
717 ++ *
718 ++ * Core alternative kernel scheduler code and related syscalls
719 ++ *
720 ++ * Copyright (C) 1991-2002 Linus Torvalds
721 ++ *
722 ++ * 2009-08-13 Brainfuck deadline scheduling policy by Con Kolivas deletes
723 ++ * a whole lot of those previous things.
724 ++ * 2017-09-06 Priority and Deadline based Skip list multiple queue kernel
725 ++ * scheduler by Alfred Chen.
726 ++ * 2019-02-20 BMQ(BitMap Queue) kernel scheduler by Alfred Chen.
727 ++ */
728 ++#define CREATE_TRACE_POINTS
729 ++#include <trace/events/sched.h>
730 ++#undef CREATE_TRACE_POINTS
731 ++
732 ++#include "sched.h"
733 ++
734 ++#include <linux/sched/rt.h>
735 ++
736 ++#include <linux/context_tracking.h>
737 ++#include <linux/compat.h>
738 ++#include <linux/blkdev.h>
739 ++#include <linux/delayacct.h>
740 ++#include <linux/freezer.h>
741 ++#include <linux/init_task.h>
742 ++#include <linux/kprobes.h>
743 ++#include <linux/mmu_context.h>
744 ++#include <linux/nmi.h>
745 ++#include <linux/profile.h>
746 ++#include <linux/rcupdate_wait.h>
747 ++#include <linux/security.h>
748 ++#include <linux/syscalls.h>
749 ++#include <linux/wait_bit.h>
750 ++
751 ++#include <linux/kcov.h>
752 ++#include <linux/scs.h>
753 ++
754 ++#include <asm/switch_to.h>
755 ++
756 ++#include "../workqueue_internal.h"
757 ++#include "../../fs/io-wq.h"
758 ++#include "../smpboot.h"
759 ++
760 ++#include "pelt.h"
761 ++#include "smp.h"
762 ++
763 ++/*
764 ++ * Export tracepoints that act as a bare tracehook (ie: have no trace event
765 ++ * associated with them) to allow external modules to probe them.
766 ++ */
767 ++EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_irq_tp);
768 ++
769 ++#ifdef CONFIG_SCHED_DEBUG
770 ++#define sched_feat(x) (1)
771 ++/*
772 ++ * Print a warning if need_resched is set for the given duration (if
773 ++ * LATENCY_WARN is enabled).
774 ++ *
775 ++ * If sysctl_resched_latency_warn_once is set, only one warning will be shown
776 ++ * per boot.
777 ++ */
778 ++__read_mostly int sysctl_resched_latency_warn_ms = 100;
779 ++__read_mostly int sysctl_resched_latency_warn_once = 1;
780 ++#else
781 ++#define sched_feat(x) (0)
782 ++#endif /* CONFIG_SCHED_DEBUG */
783 ++
784 ++#define ALT_SCHED_VERSION "v5.15-r0"
785 ++
786 ++/* rt_prio(prio) defined in include/linux/sched/rt.h */
787 ++#define rt_task(p) rt_prio((p)->prio)
788 ++#define rt_policy(policy) ((policy) == SCHED_FIFO || (policy) == SCHED_RR)
789 ++#define task_has_rt_policy(p) (rt_policy((p)->policy))
790 ++
791 ++#define STOP_PRIO (MAX_RT_PRIO - 1)
792 ++
793 ++/* Default time slice is 4 in ms, can be set via kernel parameter "sched_timeslice" */
794 ++u64 sched_timeslice_ns __read_mostly = (4 << 20);
795 ++
796 ++static inline void requeue_task(struct task_struct *p, struct rq *rq);
797 ++
798 ++#ifdef CONFIG_SCHED_BMQ
799 ++#include "bmq.h"
800 ++#endif
801 ++#ifdef CONFIG_SCHED_PDS
802 ++#include "pds.h"
803 ++#endif
804 ++
805 ++static int __init sched_timeslice(char *str)
806 ++{
807 ++ int timeslice_ms;
808 ++
809 ++ get_option(&str, &timeslice_ms);
810 ++ if (2 != timeslice_ms)
811 ++ timeslice_ms = 4;
812 ++ sched_timeslice_ns = timeslice_ms << 20;
813 ++ sched_timeslice_imp(timeslice_ms);
814 ++
815 ++ return 0;
816 ++}
817 ++early_param("sched_timeslice", sched_timeslice);
818 ++
819 ++/* Reschedule if less than this many μs left */
820 ++#define RESCHED_NS (100 << 10)
821 ++
822 ++/**
823 ++ * sched_yield_type - Choose what sort of yield sched_yield will perform.
824 ++ * 0: No yield.
825 ++ * 1: Deboost and requeue task. (default)
826 ++ * 2: Set rq skip task.
827 ++ */
828 ++int sched_yield_type __read_mostly = 1;
829 ++
830 ++#ifdef CONFIG_SMP
831 ++static cpumask_t sched_rq_pending_mask ____cacheline_aligned_in_smp;
832 ++
833 ++DEFINE_PER_CPU(cpumask_t [NR_CPU_AFFINITY_LEVELS], sched_cpu_topo_masks);
834 ++DEFINE_PER_CPU(cpumask_t *, sched_cpu_llc_mask);
835 ++DEFINE_PER_CPU(cpumask_t *, sched_cpu_topo_end_mask);
836 ++
837 ++#ifdef CONFIG_SCHED_SMT
838 ++DEFINE_STATIC_KEY_FALSE(sched_smt_present);
839 ++EXPORT_SYMBOL_GPL(sched_smt_present);
840 ++#endif
841 ++
842 ++/*
843 ++ * Keep a unique ID per domain (we use the first CPUs number in the cpumask of
844 ++ * the domain), this allows us to quickly tell if two cpus are in the same cache
845 ++ * domain, see cpus_share_cache().
846 ++ */
847 ++DEFINE_PER_CPU(int, sd_llc_id);
848 ++#endif /* CONFIG_SMP */
849 ++
850 ++static DEFINE_MUTEX(sched_hotcpu_mutex);
851 ++
852 ++DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
853 ++
854 ++#ifndef prepare_arch_switch
855 ++# define prepare_arch_switch(next) do { } while (0)
856 ++#endif
857 ++#ifndef finish_arch_post_lock_switch
858 ++# define finish_arch_post_lock_switch() do { } while (0)
859 ++#endif
860 ++
861 ++#ifdef CONFIG_SCHED_SMT
862 ++static cpumask_t sched_sg_idle_mask ____cacheline_aligned_in_smp;
863 ++#endif
864 ++static cpumask_t sched_rq_watermark[SCHED_BITS] ____cacheline_aligned_in_smp;
865 ++
866 ++/* sched_queue related functions */
867 ++static inline void sched_queue_init(struct sched_queue *q)
868 ++{
869 ++ int i;
870 ++
871 ++ bitmap_zero(q->bitmap, SCHED_BITS);
872 ++ for(i = 0; i < SCHED_BITS; i++)
873 ++ INIT_LIST_HEAD(&q->heads[i]);
874 ++}
875 ++
876 ++/*
877 ++ * Init idle task and put into queue structure of rq
878 ++ * IMPORTANT: may be called multiple times for a single cpu
879 ++ */
880 ++static inline void sched_queue_init_idle(struct sched_queue *q,
881 ++ struct task_struct *idle)
882 ++{
883 ++ idle->sq_idx = IDLE_TASK_SCHED_PRIO;
884 ++ INIT_LIST_HEAD(&q->heads[idle->sq_idx]);
885 ++ list_add(&idle->sq_node, &q->heads[idle->sq_idx]);
886 ++}
887 ++
888 ++/* water mark related functions */
889 ++static inline void update_sched_rq_watermark(struct rq *rq)
890 ++{
891 ++ unsigned long watermark = find_first_bit(rq->queue.bitmap, SCHED_QUEUE_BITS);
892 ++ unsigned long last_wm = rq->watermark;
893 ++ unsigned long i;
894 ++ int cpu;
895 ++
896 ++ if (watermark == last_wm)
897 ++ return;
898 ++
899 ++ rq->watermark = watermark;
900 ++ cpu = cpu_of(rq);
901 ++ if (watermark < last_wm) {
902 ++ for (i = last_wm; i > watermark; i--)
903 ++ cpumask_clear_cpu(cpu, sched_rq_watermark + SCHED_BITS - 1 - i);
904 ++#ifdef CONFIG_SCHED_SMT
905 ++ if (static_branch_likely(&sched_smt_present) &&
906 ++ IDLE_TASK_SCHED_PRIO == last_wm)
907 ++ cpumask_andnot(&sched_sg_idle_mask,
908 ++ &sched_sg_idle_mask, cpu_smt_mask(cpu));
909 ++#endif
910 ++ return;
911 ++ }
912 ++ /* last_wm < watermark */
913 ++ for (i = watermark; i > last_wm; i--)
914 ++ cpumask_set_cpu(cpu, sched_rq_watermark + SCHED_BITS - 1 - i);
915 ++#ifdef CONFIG_SCHED_SMT
916 ++ if (static_branch_likely(&sched_smt_present) &&
917 ++ IDLE_TASK_SCHED_PRIO == watermark) {
918 ++ cpumask_t tmp;
919 ++
920 ++ cpumask_and(&tmp, cpu_smt_mask(cpu), sched_rq_watermark);
921 ++ if (cpumask_equal(&tmp, cpu_smt_mask(cpu)))
922 ++ cpumask_or(&sched_sg_idle_mask,
923 ++ &sched_sg_idle_mask, cpu_smt_mask(cpu));
924 ++ }
925 ++#endif
926 ++}
927 ++
928 ++/*
929 ++ * This routine assume that the idle task always in queue
930 ++ */
931 ++static inline struct task_struct *sched_rq_first_task(struct rq *rq)
932 ++{
933 ++ unsigned long idx = find_first_bit(rq->queue.bitmap, SCHED_QUEUE_BITS);
934 ++ const struct list_head *head = &rq->queue.heads[sched_prio2idx(idx, rq)];
935 ++
936 ++ return list_first_entry(head, struct task_struct, sq_node);
937 ++}
938 ++
939 ++static inline struct task_struct *
940 ++sched_rq_next_task(struct task_struct *p, struct rq *rq)
941 ++{
942 ++ unsigned long idx = p->sq_idx;
943 ++ struct list_head *head = &rq->queue.heads[idx];
944 ++
945 ++ if (list_is_last(&p->sq_node, head)) {
946 ++ idx = find_next_bit(rq->queue.bitmap, SCHED_QUEUE_BITS,
947 ++ sched_idx2prio(idx, rq) + 1);
948 ++ head = &rq->queue.heads[sched_prio2idx(idx, rq)];
949 ++
950 ++ return list_first_entry(head, struct task_struct, sq_node);
951 ++ }
952 ++
953 ++ return list_next_entry(p, sq_node);
954 ++}
955 ++
956 ++static inline struct task_struct *rq_runnable_task(struct rq *rq)
957 ++{
958 ++ struct task_struct *next = sched_rq_first_task(rq);
959 ++
960 ++ if (unlikely(next == rq->skip))
961 ++ next = sched_rq_next_task(next, rq);
962 ++
963 ++ return next;
964 ++}
965 ++
966 ++/*
967 ++ * Serialization rules:
968 ++ *
969 ++ * Lock order:
970 ++ *
971 ++ * p->pi_lock
972 ++ * rq->lock
973 ++ * hrtimer_cpu_base->lock (hrtimer_start() for bandwidth controls)
974 ++ *
975 ++ * rq1->lock
976 ++ * rq2->lock where: rq1 < rq2
977 ++ *
978 ++ * Regular state:
979 ++ *
980 ++ * Normal scheduling state is serialized by rq->lock. __schedule() takes the
981 ++ * local CPU's rq->lock, it optionally removes the task from the runqueue and
982 ++ * always looks at the local rq data structures to find the most eligible task
983 ++ * to run next.
984 ++ *
985 ++ * Task enqueue is also under rq->lock, possibly taken from another CPU.
986 ++ * Wakeups from another LLC domain might use an IPI to transfer the enqueue to
987 ++ * the local CPU to avoid bouncing the runqueue state around [ see
988 ++ * ttwu_queue_wakelist() ]
989 ++ *
990 ++ * Task wakeup, specifically wakeups that involve migration, are horribly
991 ++ * complicated to avoid having to take two rq->locks.
992 ++ *
993 ++ * Special state:
994 ++ *
995 ++ * System-calls and anything external will use task_rq_lock() which acquires
996 ++ * both p->pi_lock and rq->lock. As a consequence the state they change is
997 ++ * stable while holding either lock:
998 ++ *
999 ++ * - sched_setaffinity()/
1000 ++ * set_cpus_allowed_ptr(): p->cpus_ptr, p->nr_cpus_allowed
1001 ++ * - set_user_nice(): p->se.load, p->*prio
1002 ++ * - __sched_setscheduler(): p->sched_class, p->policy, p->*prio,
1003 ++ * p->se.load, p->rt_priority,
1004 ++ * p->dl.dl_{runtime, deadline, period, flags, bw, density}
1005 ++ * - sched_setnuma(): p->numa_preferred_nid
1006 ++ * - sched_move_task()/
1007 ++ * cpu_cgroup_fork(): p->sched_task_group
1008 ++ * - uclamp_update_active() p->uclamp*
1009 ++ *
1010 ++ * p->state <- TASK_*:
1011 ++ *
1012 ++ * is changed locklessly using set_current_state(), __set_current_state() or
1013 ++ * set_special_state(), see their respective comments, or by
1014 ++ * try_to_wake_up(). This latter uses p->pi_lock to serialize against
1015 ++ * concurrent self.
1016 ++ *
1017 ++ * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }:
1018 ++ *
1019 ++ * is set by activate_task() and cleared by deactivate_task(), under
1020 ++ * rq->lock. Non-zero indicates the task is runnable, the special
1021 ++ * ON_RQ_MIGRATING state is used for migration without holding both
1022 ++ * rq->locks. It indicates task_cpu() is not stable, see task_rq_lock().
1023 ++ *
1024 ++ * p->on_cpu <- { 0, 1 }:
1025 ++ *
1026 ++ * is set by prepare_task() and cleared by finish_task() such that it will be
1027 ++ * set before p is scheduled-in and cleared after p is scheduled-out, both
1028 ++ * under rq->lock. Non-zero indicates the task is running on its CPU.
1029 ++ *
1030 ++ * [ The astute reader will observe that it is possible for two tasks on one
1031 ++ * CPU to have ->on_cpu = 1 at the same time. ]
1032 ++ *
1033 ++ * task_cpu(p): is changed by set_task_cpu(), the rules are:
1034 ++ *
1035 ++ * - Don't call set_task_cpu() on a blocked task:
1036 ++ *
1037 ++ * We don't care what CPU we're not running on, this simplifies hotplug,
1038 ++ * the CPU assignment of blocked tasks isn't required to be valid.
1039 ++ *
1040 ++ * - for try_to_wake_up(), called under p->pi_lock:
1041 ++ *
1042 ++ * This allows try_to_wake_up() to only take one rq->lock, see its comment.
1043 ++ *
1044 ++ * - for migration called under rq->lock:
1045 ++ * [ see task_on_rq_migrating() in task_rq_lock() ]
1046 ++ *
1047 ++ * o move_queued_task()
1048 ++ * o detach_task()
1049 ++ *
1050 ++ * - for migration called under double_rq_lock():
1051 ++ *
1052 ++ * o __migrate_swap_task()
1053 ++ * o push_rt_task() / pull_rt_task()
1054 ++ * o push_dl_task() / pull_dl_task()
1055 ++ * o dl_task_offline_migration()
1056 ++ *
1057 ++ */
1058 ++
1059 ++/*
1060 ++ * Context: p->pi_lock
1061 ++ */
1062 ++static inline struct rq
1063 ++*__task_access_lock(struct task_struct *p, raw_spinlock_t **plock)
1064 ++{
1065 ++ struct rq *rq;
1066 ++ for (;;) {
1067 ++ rq = task_rq(p);
1068 ++ if (p->on_cpu || task_on_rq_queued(p)) {
1069 ++ raw_spin_lock(&rq->lock);
1070 ++ if (likely((p->on_cpu || task_on_rq_queued(p))
1071 ++ && rq == task_rq(p))) {
1072 ++ *plock = &rq->lock;
1073 ++ return rq;
1074 ++ }
1075 ++ raw_spin_unlock(&rq->lock);
1076 ++ } else if (task_on_rq_migrating(p)) {
1077 ++ do {
1078 ++ cpu_relax();
1079 ++ } while (unlikely(task_on_rq_migrating(p)));
1080 ++ } else {
1081 ++ *plock = NULL;
1082 ++ return rq;
1083 ++ }
1084 ++ }
1085 ++}
1086 ++
1087 ++static inline void
1088 ++__task_access_unlock(struct task_struct *p, raw_spinlock_t *lock)
1089 ++{
1090 ++ if (NULL != lock)
1091 ++ raw_spin_unlock(lock);
1092 ++}
1093 ++
1094 ++static inline struct rq
1095 ++*task_access_lock_irqsave(struct task_struct *p, raw_spinlock_t **plock,
1096 ++ unsigned long *flags)
1097 ++{
1098 ++ struct rq *rq;
1099 ++ for (;;) {
1100 ++ rq = task_rq(p);
1101 ++ if (p->on_cpu || task_on_rq_queued(p)) {
1102 ++ raw_spin_lock_irqsave(&rq->lock, *flags);
1103 ++ if (likely((p->on_cpu || task_on_rq_queued(p))
1104 ++ && rq == task_rq(p))) {
1105 ++ *plock = &rq->lock;
1106 ++ return rq;
1107 ++ }
1108 ++ raw_spin_unlock_irqrestore(&rq->lock, *flags);
1109 ++ } else if (task_on_rq_migrating(p)) {
1110 ++ do {
1111 ++ cpu_relax();
1112 ++ } while (unlikely(task_on_rq_migrating(p)));
1113 ++ } else {
1114 ++ raw_spin_lock_irqsave(&p->pi_lock, *flags);
1115 ++ if (likely(!p->on_cpu && !p->on_rq &&
1116 ++ rq == task_rq(p))) {
1117 ++ *plock = &p->pi_lock;
1118 ++ return rq;
1119 ++ }
1120 ++ raw_spin_unlock_irqrestore(&p->pi_lock, *flags);
1121 ++ }
1122 ++ }
1123 ++}
1124 ++
1125 ++static inline void
1126 ++task_access_unlock_irqrestore(struct task_struct *p, raw_spinlock_t *lock,
1127 ++ unsigned long *flags)
1128 ++{
1129 ++ raw_spin_unlock_irqrestore(lock, *flags);
1130 ++}
1131 ++
1132 ++/*
1133 ++ * __task_rq_lock - lock the rq @p resides on.
1134 ++ */
1135 ++struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
1136 ++ __acquires(rq->lock)
1137 ++{
1138 ++ struct rq *rq;
1139 ++
1140 ++ lockdep_assert_held(&p->pi_lock);
1141 ++
1142 ++ for (;;) {
1143 ++ rq = task_rq(p);
1144 ++ raw_spin_lock(&rq->lock);
1145 ++ if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))
1146 ++ return rq;
1147 ++ raw_spin_unlock(&rq->lock);
1148 ++
1149 ++ while (unlikely(task_on_rq_migrating(p)))
1150 ++ cpu_relax();
1151 ++ }
1152 ++}
1153 ++
1154 ++/*
1155 ++ * task_rq_lock - lock p->pi_lock and lock the rq @p resides on.
1156 ++ */
1157 ++struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
1158 ++ __acquires(p->pi_lock)
1159 ++ __acquires(rq->lock)
1160 ++{
1161 ++ struct rq *rq;
1162 ++
1163 ++ for (;;) {
1164 ++ raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
1165 ++ rq = task_rq(p);
1166 ++ raw_spin_lock(&rq->lock);
1167 ++ /*
1168 ++ * move_queued_task() task_rq_lock()
1169 ++ *
1170 ++ * ACQUIRE (rq->lock)
1171 ++ * [S] ->on_rq = MIGRATING [L] rq = task_rq()
1172 ++ * WMB (__set_task_cpu()) ACQUIRE (rq->lock);
1173 ++ * [S] ->cpu = new_cpu [L] task_rq()
1174 ++ * [L] ->on_rq
1175 ++ * RELEASE (rq->lock)
1176 ++ *
1177 ++ * If we observe the old CPU in task_rq_lock(), the acquire of
1178 ++ * the old rq->lock will fully serialize against the stores.
1179 ++ *
1180 ++ * If we observe the new CPU in task_rq_lock(), the address
1181 ++ * dependency headed by '[L] rq = task_rq()' and the acquire
1182 ++ * will pair with the WMB to ensure we then also see migrating.
1183 ++ */
1184 ++ if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
1185 ++ return rq;
1186 ++ }
1187 ++ raw_spin_unlock(&rq->lock);
1188 ++ raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
1189 ++
1190 ++ while (unlikely(task_on_rq_migrating(p)))
1191 ++ cpu_relax();
1192 ++ }
1193 ++}
1194 ++
1195 ++static inline void
1196 ++rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
1197 ++ __acquires(rq->lock)
1198 ++{
1199 ++ raw_spin_lock_irqsave(&rq->lock, rf->flags);
1200 ++}
1201 ++
1202 ++static inline void
1203 ++rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
1204 ++ __releases(rq->lock)
1205 ++{
1206 ++ raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
1207 ++}
1208 ++
1209 ++void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
1210 ++{
1211 ++ raw_spinlock_t *lock;
1212 ++
1213 ++ /* Matches synchronize_rcu() in __sched_core_enable() */
1214 ++ preempt_disable();
1215 ++
1216 ++ for (;;) {
1217 ++ lock = __rq_lockp(rq);
1218 ++ raw_spin_lock_nested(lock, subclass);
1219 ++ if (likely(lock == __rq_lockp(rq))) {
1220 ++ /* preempt_count *MUST* be > 1 */
1221 ++ preempt_enable_no_resched();
1222 ++ return;
1223 ++ }
1224 ++ raw_spin_unlock(lock);
1225 ++ }
1226 ++}
1227 ++
1228 ++void raw_spin_rq_unlock(struct rq *rq)
1229 ++{
1230 ++ raw_spin_unlock(rq_lockp(rq));
1231 ++}
1232 ++
1233 ++/*
1234 ++ * RQ-clock updating methods:
1235 ++ */
1236 ++
1237 ++static void update_rq_clock_task(struct rq *rq, s64 delta)
1238 ++{
1239 ++/*
1240 ++ * In theory, the compile should just see 0 here, and optimize out the call
1241 ++ * to sched_rt_avg_update. But I don't trust it...
1242 ++ */
1243 ++ s64 __maybe_unused steal = 0, irq_delta = 0;
1244 ++
1245 ++#ifdef CONFIG_IRQ_TIME_ACCOUNTING
1246 ++ irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
1247 ++
1248 ++ /*
1249 ++ * Since irq_time is only updated on {soft,}irq_exit, we might run into
1250 ++ * this case when a previous update_rq_clock() happened inside a
1251 ++ * {soft,}irq region.
1252 ++ *
1253 ++ * When this happens, we stop ->clock_task and only update the
1254 ++ * prev_irq_time stamp to account for the part that fit, so that a next
1255 ++ * update will consume the rest. This ensures ->clock_task is
1256 ++ * monotonic.
1257 ++ *
1258 ++ * It does however cause some slight miss-attribution of {soft,}irq
1259 ++ * time, a more accurate solution would be to update the irq_time using
1260 ++ * the current rq->clock timestamp, except that would require using
1261 ++ * atomic ops.
1262 ++ */
1263 ++ if (irq_delta > delta)
1264 ++ irq_delta = delta;
1265 ++
1266 ++ rq->prev_irq_time += irq_delta;
1267 ++ delta -= irq_delta;
1268 ++#endif
1269 ++#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
1270 ++ if (static_key_false((&paravirt_steal_rq_enabled))) {
1271 ++ steal = paravirt_steal_clock(cpu_of(rq));
1272 ++ steal -= rq->prev_steal_time_rq;
1273 ++
1274 ++ if (unlikely(steal > delta))
1275 ++ steal = delta;
1276 ++
1277 ++ rq->prev_steal_time_rq += steal;
1278 ++ delta -= steal;
1279 ++ }
1280 ++#endif
1281 ++
1282 ++ rq->clock_task += delta;
1283 ++
1284 ++#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
1285 ++ if ((irq_delta + steal))
1286 ++ update_irq_load_avg(rq, irq_delta + steal);
1287 ++#endif
1288 ++}
1289 ++
1290 ++static inline void update_rq_clock(struct rq *rq)
1291 ++{
1292 ++ s64 delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
1293 ++
1294 ++ if (unlikely(delta <= 0))
1295 ++ return;
1296 ++ rq->clock += delta;
1297 ++ update_rq_time_edge(rq);
1298 ++ update_rq_clock_task(rq, delta);
1299 ++}
1300 ++
1301 ++/*
1302 ++ * RQ Load update routine
1303 ++ */
1304 ++#define RQ_LOAD_HISTORY_BITS (sizeof(s32) * 8ULL)
1305 ++#define RQ_UTIL_SHIFT (8)
1306 ++#define RQ_LOAD_HISTORY_TO_UTIL(l) (((l) >> (RQ_LOAD_HISTORY_BITS - 1 - RQ_UTIL_SHIFT)) & 0xff)
1307 ++
1308 ++#define LOAD_BLOCK(t) ((t) >> 17)
1309 ++#define LOAD_HALF_BLOCK(t) ((t) >> 16)
1310 ++#define BLOCK_MASK(t) ((t) & ((0x01 << 18) - 1))
1311 ++#define LOAD_BLOCK_BIT(b) (1UL << (RQ_LOAD_HISTORY_BITS - 1 - (b)))
1312 ++#define CURRENT_LOAD_BIT LOAD_BLOCK_BIT(0)
1313 ++
1314 ++static inline void rq_load_update(struct rq *rq)
1315 ++{
1316 ++ u64 time = rq->clock;
1317 ++ u64 delta = min(LOAD_BLOCK(time) - LOAD_BLOCK(rq->load_stamp),
1318 ++ RQ_LOAD_HISTORY_BITS - 1);
1319 ++ u64 prev = !!(rq->load_history & CURRENT_LOAD_BIT);
1320 ++ u64 curr = !!rq->nr_running;
1321 ++
1322 ++ if (delta) {
1323 ++ rq->load_history = rq->load_history >> delta;
1324 ++
1325 ++ if (delta < RQ_UTIL_SHIFT) {
1326 ++ rq->load_block += (~BLOCK_MASK(rq->load_stamp)) * prev;
1327 ++ if (!!LOAD_HALF_BLOCK(rq->load_block) ^ curr)
1328 ++ rq->load_history ^= LOAD_BLOCK_BIT(delta);
1329 ++ }
1330 ++
1331 ++ rq->load_block = BLOCK_MASK(time) * prev;
1332 ++ } else {
1333 ++ rq->load_block += (time - rq->load_stamp) * prev;
1334 ++ }
1335 ++ if (prev ^ curr)
1336 ++ rq->load_history ^= CURRENT_LOAD_BIT;
1337 ++ rq->load_stamp = time;
1338 ++}
1339 ++
1340 ++unsigned long rq_load_util(struct rq *rq, unsigned long max)
1341 ++{
1342 ++ return RQ_LOAD_HISTORY_TO_UTIL(rq->load_history) * (max >> RQ_UTIL_SHIFT);
1343 ++}
1344 ++
1345 ++#ifdef CONFIG_SMP
1346 ++unsigned long sched_cpu_util(int cpu, unsigned long max)
1347 ++{
1348 ++ return rq_load_util(cpu_rq(cpu), max);
1349 ++}
1350 ++#endif /* CONFIG_SMP */
1351 ++
1352 ++#ifdef CONFIG_CPU_FREQ
1353 ++/**
1354 ++ * cpufreq_update_util - Take a note about CPU utilization changes.
1355 ++ * @rq: Runqueue to carry out the update for.
1356 ++ * @flags: Update reason flags.
1357 ++ *
1358 ++ * This function is called by the scheduler on the CPU whose utilization is
1359 ++ * being updated.
1360 ++ *
1361 ++ * It can only be called from RCU-sched read-side critical sections.
1362 ++ *
1363 ++ * The way cpufreq is currently arranged requires it to evaluate the CPU
1364 ++ * performance state (frequency/voltage) on a regular basis to prevent it from
1365 ++ * being stuck in a completely inadequate performance level for too long.
1366 ++ * That is not guaranteed to happen if the updates are only triggered from CFS
1367 ++ * and DL, though, because they may not be coming in if only RT tasks are
1368 ++ * active all the time (or there are RT tasks only).
1369 ++ *
1370 ++ * As a workaround for that issue, this function is called periodically by the
1371 ++ * RT sched class to trigger extra cpufreq updates to prevent it from stalling,
1372 ++ * but that really is a band-aid. Going forward it should be replaced with
1373 ++ * solutions targeted more specifically at RT tasks.
1374 ++ */
1375 ++static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
1376 ++{
1377 ++ struct update_util_data *data;
1378 ++
1379 ++#ifdef CONFIG_SMP
1380 ++ rq_load_update(rq);
1381 ++#endif
1382 ++ data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
1383 ++ cpu_of(rq)));
1384 ++ if (data)
1385 ++ data->func(data, rq_clock(rq), flags);
1386 ++}
1387 ++#else
1388 ++static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
1389 ++{
1390 ++#ifdef CONFIG_SMP
1391 ++ rq_load_update(rq);
1392 ++#endif
1393 ++}
1394 ++#endif /* CONFIG_CPU_FREQ */
1395 ++
1396 ++#ifdef CONFIG_NO_HZ_FULL
1397 ++/*
1398 ++ * Tick may be needed by tasks in the runqueue depending on their policy and
1399 ++ * requirements. If tick is needed, lets send the target an IPI to kick it out
1400 ++ * of nohz mode if necessary.
1401 ++ */
1402 ++static inline void sched_update_tick_dependency(struct rq *rq)
1403 ++{
1404 ++ int cpu = cpu_of(rq);
1405 ++
1406 ++ if (!tick_nohz_full_cpu(cpu))
1407 ++ return;
1408 ++
1409 ++ if (rq->nr_running < 2)
1410 ++ tick_nohz_dep_clear_cpu(cpu, TICK_DEP_BIT_SCHED);
1411 ++ else
1412 ++ tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
1413 ++}
1414 ++#else /* !CONFIG_NO_HZ_FULL */
1415 ++static inline void sched_update_tick_dependency(struct rq *rq) { }
1416 ++#endif
1417 ++
1418 ++bool sched_task_on_rq(struct task_struct *p)
1419 ++{
1420 ++ return task_on_rq_queued(p);
1421 ++}
1422 ++
1423 ++/*
1424 ++ * Add/Remove/Requeue task to/from the runqueue routines
1425 ++ * Context: rq->lock
1426 ++ */
1427 ++#define __SCHED_DEQUEUE_TASK(p, rq, flags, func) \
1428 ++ psi_dequeue(p, flags & DEQUEUE_SLEEP); \
1429 ++ sched_info_dequeue(rq, p); \
1430 ++ \
1431 ++ list_del(&p->sq_node); \
1432 ++ if (list_empty(&rq->queue.heads[p->sq_idx])) { \
1433 ++ clear_bit(sched_idx2prio(p->sq_idx, rq), \
1434 ++ rq->queue.bitmap); \
1435 ++ func; \
1436 ++ }
1437 ++
1438 ++#define __SCHED_ENQUEUE_TASK(p, rq, flags) \
1439 ++ sched_info_enqueue(rq, p); \
1440 ++ psi_enqueue(p, flags); \
1441 ++ \
1442 ++ p->sq_idx = task_sched_prio_idx(p, rq); \
1443 ++ list_add_tail(&p->sq_node, &rq->queue.heads[p->sq_idx]); \
1444 ++ set_bit(sched_idx2prio(p->sq_idx, rq), rq->queue.bitmap);
1445 ++
1446 ++static inline void dequeue_task(struct task_struct *p, struct rq *rq, int flags)
1447 ++{
1448 ++ lockdep_assert_held(&rq->lock);
1449 ++
1450 ++ /*printk(KERN_INFO "sched: dequeue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/
1451 ++ WARN_ONCE(task_rq(p) != rq, "sched: dequeue task reside on cpu%d from cpu%d\n",
1452 ++ task_cpu(p), cpu_of(rq));
1453 ++
1454 ++ __SCHED_DEQUEUE_TASK(p, rq, flags, update_sched_rq_watermark(rq));
1455 ++ --rq->nr_running;
1456 ++#ifdef CONFIG_SMP
1457 ++ if (1 == rq->nr_running)
1458 ++ cpumask_clear_cpu(cpu_of(rq), &sched_rq_pending_mask);
1459 ++#endif
1460 ++
1461 ++ sched_update_tick_dependency(rq);
1462 ++}
1463 ++
1464 ++static inline void enqueue_task(struct task_struct *p, struct rq *rq, int flags)
1465 ++{
1466 ++ lockdep_assert_held(&rq->lock);
1467 ++
1468 ++ /*printk(KERN_INFO "sched: enqueue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/
1469 ++ WARN_ONCE(task_rq(p) != rq, "sched: enqueue task reside on cpu%d to cpu%d\n",
1470 ++ task_cpu(p), cpu_of(rq));
1471 ++
1472 ++ __SCHED_ENQUEUE_TASK(p, rq, flags);
1473 ++ update_sched_rq_watermark(rq);
1474 ++ ++rq->nr_running;
1475 ++#ifdef CONFIG_SMP
1476 ++ if (2 == rq->nr_running)
1477 ++ cpumask_set_cpu(cpu_of(rq), &sched_rq_pending_mask);
1478 ++#endif
1479 ++
1480 ++ sched_update_tick_dependency(rq);
1481 ++}
1482 ++
1483 ++static inline void requeue_task(struct task_struct *p, struct rq *rq)
1484 ++{
1485 ++ int idx;
1486 ++
1487 ++ lockdep_assert_held(&rq->lock);
1488 ++ /*printk(KERN_INFO "sched: requeue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/
1489 ++ WARN_ONCE(task_rq(p) != rq, "sched: cpu[%d] requeue task reside on cpu%d\n",
1490 ++ cpu_of(rq), task_cpu(p));
1491 ++
1492 ++ idx = task_sched_prio_idx(p, rq);
1493 ++
1494 ++ list_del(&p->sq_node);
1495 ++ list_add_tail(&p->sq_node, &rq->queue.heads[idx]);
1496 ++ if (idx != p->sq_idx) {
1497 ++ if (list_empty(&rq->queue.heads[p->sq_idx]))
1498 ++ clear_bit(sched_idx2prio(p->sq_idx, rq),
1499 ++ rq->queue.bitmap);
1500 ++ p->sq_idx = idx;
1501 ++ set_bit(sched_idx2prio(p->sq_idx, rq), rq->queue.bitmap);
1502 ++ update_sched_rq_watermark(rq);
1503 ++ }
1504 ++}
1505 ++
1506 ++/*
1507 ++ * cmpxchg based fetch_or, macro so it works for different integer types
1508 ++ */
1509 ++#define fetch_or(ptr, mask) \
1510 ++ ({ \
1511 ++ typeof(ptr) _ptr = (ptr); \
1512 ++ typeof(mask) _mask = (mask); \
1513 ++ typeof(*_ptr) _old, _val = *_ptr; \
1514 ++ \
1515 ++ for (;;) { \
1516 ++ _old = cmpxchg(_ptr, _val, _val | _mask); \
1517 ++ if (_old == _val) \
1518 ++ break; \
1519 ++ _val = _old; \
1520 ++ } \
1521 ++ _old; \
1522 ++})
1523 ++
1524 ++#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
1525 ++/*
1526 ++ * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
1527 ++ * this avoids any races wrt polling state changes and thereby avoids
1528 ++ * spurious IPIs.
1529 ++ */
1530 ++static bool set_nr_and_not_polling(struct task_struct *p)
1531 ++{
1532 ++ struct thread_info *ti = task_thread_info(p);
1533 ++ return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
1534 ++}
1535 ++
1536 ++/*
1537 ++ * Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set.
1538 ++ *
1539 ++ * If this returns true, then the idle task promises to call
1540 ++ * sched_ttwu_pending() and reschedule soon.
1541 ++ */
1542 ++static bool set_nr_if_polling(struct task_struct *p)
1543 ++{
1544 ++ struct thread_info *ti = task_thread_info(p);
1545 ++ typeof(ti->flags) old, val = READ_ONCE(ti->flags);
1546 ++
1547 ++ for (;;) {
1548 ++ if (!(val & _TIF_POLLING_NRFLAG))
1549 ++ return false;
1550 ++ if (val & _TIF_NEED_RESCHED)
1551 ++ return true;
1552 ++ old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED);
1553 ++ if (old == val)
1554 ++ break;
1555 ++ val = old;
1556 ++ }
1557 ++ return true;
1558 ++}
1559 ++
1560 ++#else
1561 ++static bool set_nr_and_not_polling(struct task_struct *p)
1562 ++{
1563 ++ set_tsk_need_resched(p);
1564 ++ return true;
1565 ++}
1566 ++
1567 ++#ifdef CONFIG_SMP
1568 ++static bool set_nr_if_polling(struct task_struct *p)
1569 ++{
1570 ++ return false;
1571 ++}
1572 ++#endif
1573 ++#endif
1574 ++
1575 ++static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)
1576 ++{
1577 ++ struct wake_q_node *node = &task->wake_q;
1578 ++
1579 ++ /*
1580 ++ * Atomically grab the task, if ->wake_q is !nil already it means
1581 ++ * it's already queued (either by us or someone else) and will get the
1582 ++ * wakeup due to that.
1583 ++ *
1584 ++ * In order to ensure that a pending wakeup will observe our pending
1585 ++ * state, even in the failed case, an explicit smp_mb() must be used.
1586 ++ */
1587 ++ smp_mb__before_atomic();
1588 ++ if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
1589 ++ return false;
1590 ++
1591 ++ /*
1592 ++ * The head is context local, there can be no concurrency.
1593 ++ */
1594 ++ *head->lastp = node;
1595 ++ head->lastp = &node->next;
1596 ++ return true;
1597 ++}
1598 ++
1599 ++/**
1600 ++ * wake_q_add() - queue a wakeup for 'later' waking.
1601 ++ * @head: the wake_q_head to add @task to
1602 ++ * @task: the task to queue for 'later' wakeup
1603 ++ *
1604 ++ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
1605 ++ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
1606 ++ * instantly.
1607 ++ *
1608 ++ * This function must be used as-if it were wake_up_process(); IOW the task
1609 ++ * must be ready to be woken at this location.
1610 ++ */
1611 ++void wake_q_add(struct wake_q_head *head, struct task_struct *task)
1612 ++{
1613 ++ if (__wake_q_add(head, task))
1614 ++ get_task_struct(task);
1615 ++}
1616 ++
1617 ++/**
1618 ++ * wake_q_add_safe() - safely queue a wakeup for 'later' waking.
1619 ++ * @head: the wake_q_head to add @task to
1620 ++ * @task: the task to queue for 'later' wakeup
1621 ++ *
1622 ++ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
1623 ++ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
1624 ++ * instantly.
1625 ++ *
1626 ++ * This function must be used as-if it were wake_up_process(); IOW the task
1627 ++ * must be ready to be woken at this location.
1628 ++ *
1629 ++ * This function is essentially a task-safe equivalent to wake_q_add(). Callers
1630 ++ * that already hold reference to @task can call the 'safe' version and trust
1631 ++ * wake_q to do the right thing depending whether or not the @task is already
1632 ++ * queued for wakeup.
1633 ++ */
1634 ++void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
1635 ++{
1636 ++ if (!__wake_q_add(head, task))
1637 ++ put_task_struct(task);
1638 ++}
1639 ++
1640 ++void wake_up_q(struct wake_q_head *head)
1641 ++{
1642 ++ struct wake_q_node *node = head->first;
1643 ++
1644 ++ while (node != WAKE_Q_TAIL) {
1645 ++ struct task_struct *task;
1646 ++
1647 ++ task = container_of(node, struct task_struct, wake_q);
1648 ++ /* task can safely be re-inserted now: */
1649 ++ node = node->next;
1650 ++ task->wake_q.next = NULL;
1651 ++
1652 ++ /*
1653 ++ * wake_up_process() executes a full barrier, which pairs with
1654 ++ * the queueing in wake_q_add() so as not to miss wakeups.
1655 ++ */
1656 ++ wake_up_process(task);
1657 ++ put_task_struct(task);
1658 ++ }
1659 ++}
1660 ++
1661 ++/*
1662 ++ * resched_curr - mark rq's current task 'to be rescheduled now'.
1663 ++ *
1664 ++ * On UP this means the setting of the need_resched flag, on SMP it
1665 ++ * might also involve a cross-CPU call to trigger the scheduler on
1666 ++ * the target CPU.
1667 ++ */
1668 ++void resched_curr(struct rq *rq)
1669 ++{
1670 ++ struct task_struct *curr = rq->curr;
1671 ++ int cpu;
1672 ++
1673 ++ lockdep_assert_held(&rq->lock);
1674 ++
1675 ++ if (test_tsk_need_resched(curr))
1676 ++ return;
1677 ++
1678 ++ cpu = cpu_of(rq);
1679 ++ if (cpu == smp_processor_id()) {
1680 ++ set_tsk_need_resched(curr);
1681 ++ set_preempt_need_resched();
1682 ++ return;
1683 ++ }
1684 ++
1685 ++ if (set_nr_and_not_polling(curr))
1686 ++ smp_send_reschedule(cpu);
1687 ++ else
1688 ++ trace_sched_wake_idle_without_ipi(cpu);
1689 ++}
1690 ++
1691 ++void resched_cpu(int cpu)
1692 ++{
1693 ++ struct rq *rq = cpu_rq(cpu);
1694 ++ unsigned long flags;
1695 ++
1696 ++ raw_spin_lock_irqsave(&rq->lock, flags);
1697 ++ if (cpu_online(cpu) || cpu == smp_processor_id())
1698 ++ resched_curr(cpu_rq(cpu));
1699 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
1700 ++}
1701 ++
1702 ++#ifdef CONFIG_SMP
1703 ++#ifdef CONFIG_NO_HZ_COMMON
1704 ++void nohz_balance_enter_idle(int cpu) {}
1705 ++
1706 ++void select_nohz_load_balancer(int stop_tick) {}
1707 ++
1708 ++void set_cpu_sd_state_idle(void) {}
1709 ++
1710 ++/*
1711 ++ * In the semi idle case, use the nearest busy CPU for migrating timers
1712 ++ * from an idle CPU. This is good for power-savings.
1713 ++ *
1714 ++ * We don't do similar optimization for completely idle system, as
1715 ++ * selecting an idle CPU will add more delays to the timers than intended
1716 ++ * (as that CPU's timer base may not be uptodate wrt jiffies etc).
1717 ++ */
1718 ++int get_nohz_timer_target(void)
1719 ++{
1720 ++ int i, cpu = smp_processor_id(), default_cpu = -1;
1721 ++ struct cpumask *mask;
1722 ++ const struct cpumask *hk_mask;
1723 ++
1724 ++ if (housekeeping_cpu(cpu, HK_FLAG_TIMER)) {
1725 ++ if (!idle_cpu(cpu))
1726 ++ return cpu;
1727 ++ default_cpu = cpu;
1728 ++ }
1729 ++
1730 ++ hk_mask = housekeeping_cpumask(HK_FLAG_TIMER);
1731 ++
1732 ++ for (mask = per_cpu(sched_cpu_topo_masks, cpu) + 1;
1733 ++ mask < per_cpu(sched_cpu_topo_end_mask, cpu); mask++)
1734 ++ for_each_cpu_and(i, mask, hk_mask)
1735 ++ if (!idle_cpu(i))
1736 ++ return i;
1737 ++
1738 ++ if (default_cpu == -1)
1739 ++ default_cpu = housekeeping_any_cpu(HK_FLAG_TIMER);
1740 ++ cpu = default_cpu;
1741 ++
1742 ++ return cpu;
1743 ++}
1744 ++
1745 ++/*
1746 ++ * When add_timer_on() enqueues a timer into the timer wheel of an
1747 ++ * idle CPU then this timer might expire before the next timer event
1748 ++ * which is scheduled to wake up that CPU. In case of a completely
1749 ++ * idle system the next event might even be infinite time into the
1750 ++ * future. wake_up_idle_cpu() ensures that the CPU is woken up and
1751 ++ * leaves the inner idle loop so the newly added timer is taken into
1752 ++ * account when the CPU goes back to idle and evaluates the timer
1753 ++ * wheel for the next timer event.
1754 ++ */
1755 ++static inline void wake_up_idle_cpu(int cpu)
1756 ++{
1757 ++ struct rq *rq = cpu_rq(cpu);
1758 ++
1759 ++ if (cpu == smp_processor_id())
1760 ++ return;
1761 ++
1762 ++ if (set_nr_and_not_polling(rq->idle))
1763 ++ smp_send_reschedule(cpu);
1764 ++ else
1765 ++ trace_sched_wake_idle_without_ipi(cpu);
1766 ++}
1767 ++
1768 ++static inline bool wake_up_full_nohz_cpu(int cpu)
1769 ++{
1770 ++ /*
1771 ++ * We just need the target to call irq_exit() and re-evaluate
1772 ++ * the next tick. The nohz full kick at least implies that.
1773 ++ * If needed we can still optimize that later with an
1774 ++ * empty IRQ.
1775 ++ */
1776 ++ if (cpu_is_offline(cpu))
1777 ++ return true; /* Don't try to wake offline CPUs. */
1778 ++ if (tick_nohz_full_cpu(cpu)) {
1779 ++ if (cpu != smp_processor_id() ||
1780 ++ tick_nohz_tick_stopped())
1781 ++ tick_nohz_full_kick_cpu(cpu);
1782 ++ return true;
1783 ++ }
1784 ++
1785 ++ return false;
1786 ++}
1787 ++
1788 ++void wake_up_nohz_cpu(int cpu)
1789 ++{
1790 ++ if (!wake_up_full_nohz_cpu(cpu))
1791 ++ wake_up_idle_cpu(cpu);
1792 ++}
1793 ++
1794 ++static void nohz_csd_func(void *info)
1795 ++{
1796 ++ struct rq *rq = info;
1797 ++ int cpu = cpu_of(rq);
1798 ++ unsigned int flags;
1799 ++
1800 ++ /*
1801 ++ * Release the rq::nohz_csd.
1802 ++ */
1803 ++ flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
1804 ++ WARN_ON(!(flags & NOHZ_KICK_MASK));
1805 ++
1806 ++ rq->idle_balance = idle_cpu(cpu);
1807 ++ if (rq->idle_balance && !need_resched()) {
1808 ++ rq->nohz_idle_balance = flags;
1809 ++ raise_softirq_irqoff(SCHED_SOFTIRQ);
1810 ++ }
1811 ++}
1812 ++
1813 ++#endif /* CONFIG_NO_HZ_COMMON */
1814 ++#endif /* CONFIG_SMP */
1815 ++
1816 ++static inline void check_preempt_curr(struct rq *rq)
1817 ++{
1818 ++ if (sched_rq_first_task(rq) != rq->curr)
1819 ++ resched_curr(rq);
1820 ++}
1821 ++
1822 ++#ifdef CONFIG_SCHED_HRTICK
1823 ++/*
1824 ++ * Use HR-timers to deliver accurate preemption points.
1825 ++ */
1826 ++
1827 ++static void hrtick_clear(struct rq *rq)
1828 ++{
1829 ++ if (hrtimer_active(&rq->hrtick_timer))
1830 ++ hrtimer_cancel(&rq->hrtick_timer);
1831 ++}
1832 ++
1833 ++/*
1834 ++ * High-resolution timer tick.
1835 ++ * Runs from hardirq context with interrupts disabled.
1836 ++ */
1837 ++static enum hrtimer_restart hrtick(struct hrtimer *timer)
1838 ++{
1839 ++ struct rq *rq = container_of(timer, struct rq, hrtick_timer);
1840 ++
1841 ++ WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
1842 ++
1843 ++ raw_spin_lock(&rq->lock);
1844 ++ resched_curr(rq);
1845 ++ raw_spin_unlock(&rq->lock);
1846 ++
1847 ++ return HRTIMER_NORESTART;
1848 ++}
1849 ++
1850 ++/*
1851 ++ * Use hrtick when:
1852 ++ * - enabled by features
1853 ++ * - hrtimer is actually high res
1854 ++ */
1855 ++static inline int hrtick_enabled(struct rq *rq)
1856 ++{
1857 ++ /**
1858 ++ * Alt schedule FW doesn't support sched_feat yet
1859 ++ if (!sched_feat(HRTICK))
1860 ++ return 0;
1861 ++ */
1862 ++ if (!cpu_active(cpu_of(rq)))
1863 ++ return 0;
1864 ++ return hrtimer_is_hres_active(&rq->hrtick_timer);
1865 ++}
1866 ++
1867 ++#ifdef CONFIG_SMP
1868 ++
1869 ++static void __hrtick_restart(struct rq *rq)
1870 ++{
1871 ++ struct hrtimer *timer = &rq->hrtick_timer;
1872 ++ ktime_t time = rq->hrtick_time;
1873 ++
1874 ++ hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
1875 ++}
1876 ++
1877 ++/*
1878 ++ * called from hardirq (IPI) context
1879 ++ */
1880 ++static void __hrtick_start(void *arg)
1881 ++{
1882 ++ struct rq *rq = arg;
1883 ++
1884 ++ raw_spin_lock(&rq->lock);
1885 ++ __hrtick_restart(rq);
1886 ++ raw_spin_unlock(&rq->lock);
1887 ++}
1888 ++
1889 ++/*
1890 ++ * Called to set the hrtick timer state.
1891 ++ *
1892 ++ * called with rq->lock held and irqs disabled
1893 ++ */
1894 ++void hrtick_start(struct rq *rq, u64 delay)
1895 ++{
1896 ++ struct hrtimer *timer = &rq->hrtick_timer;
1897 ++ s64 delta;
1898 ++
1899 ++ /*
1900 ++ * Don't schedule slices shorter than 10000ns, that just
1901 ++ * doesn't make sense and can cause timer DoS.
1902 ++ */
1903 ++ delta = max_t(s64, delay, 10000LL);
1904 ++
1905 ++ rq->hrtick_time = ktime_add_ns(timer->base->get_time(), delta);
1906 ++
1907 ++ if (rq == this_rq())
1908 ++ __hrtick_restart(rq);
1909 ++ else
1910 ++ smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);
1911 ++}
1912 ++
1913 ++#else
1914 ++/*
1915 ++ * Called to set the hrtick timer state.
1916 ++ *
1917 ++ * called with rq->lock held and irqs disabled
1918 ++ */
1919 ++void hrtick_start(struct rq *rq, u64 delay)
1920 ++{
1921 ++ /*
1922 ++ * Don't schedule slices shorter than 10000ns, that just
1923 ++ * doesn't make sense. Rely on vruntime for fairness.
1924 ++ */
1925 ++ delay = max_t(u64, delay, 10000LL);
1926 ++ hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),
1927 ++ HRTIMER_MODE_REL_PINNED_HARD);
1928 ++}
1929 ++#endif /* CONFIG_SMP */
1930 ++
1931 ++static void hrtick_rq_init(struct rq *rq)
1932 ++{
1933 ++#ifdef CONFIG_SMP
1934 ++ INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
1935 ++#endif
1936 ++
1937 ++ hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
1938 ++ rq->hrtick_timer.function = hrtick;
1939 ++}
1940 ++#else /* CONFIG_SCHED_HRTICK */
1941 ++static inline int hrtick_enabled(struct rq *rq)
1942 ++{
1943 ++ return 0;
1944 ++}
1945 ++
1946 ++static inline void hrtick_clear(struct rq *rq)
1947 ++{
1948 ++}
1949 ++
1950 ++static inline void hrtick_rq_init(struct rq *rq)
1951 ++{
1952 ++}
1953 ++#endif /* CONFIG_SCHED_HRTICK */
1954 ++
1955 ++static inline int __normal_prio(int policy, int rt_prio, int static_prio)
1956 ++{
1957 ++ return rt_policy(policy) ? (MAX_RT_PRIO - 1 - rt_prio) :
1958 ++ static_prio + MAX_PRIORITY_ADJ;
1959 ++}
1960 ++
1961 ++/*
1962 ++ * Calculate the expected normal priority: i.e. priority
1963 ++ * without taking RT-inheritance into account. Might be
1964 ++ * boosted by interactivity modifiers. Changes upon fork,
1965 ++ * setprio syscalls, and whenever the interactivity
1966 ++ * estimator recalculates.
1967 ++ */
1968 ++static inline int normal_prio(struct task_struct *p)
1969 ++{
1970 ++ return __normal_prio(p->policy, p->rt_priority, p->static_prio);
1971 ++}
1972 ++
1973 ++/*
1974 ++ * Calculate the current priority, i.e. the priority
1975 ++ * taken into account by the scheduler. This value might
1976 ++ * be boosted by RT tasks as it will be RT if the task got
1977 ++ * RT-boosted. If not then it returns p->normal_prio.
1978 ++ */
1979 ++static int effective_prio(struct task_struct *p)
1980 ++{
1981 ++ p->normal_prio = normal_prio(p);
1982 ++ /*
1983 ++ * If we are RT tasks or we were boosted to RT priority,
1984 ++ * keep the priority unchanged. Otherwise, update priority
1985 ++ * to the normal priority:
1986 ++ */
1987 ++ if (!rt_prio(p->prio))
1988 ++ return p->normal_prio;
1989 ++ return p->prio;
1990 ++}
1991 ++
1992 ++/*
1993 ++ * activate_task - move a task to the runqueue.
1994 ++ *
1995 ++ * Context: rq->lock
1996 ++ */
1997 ++static void activate_task(struct task_struct *p, struct rq *rq)
1998 ++{
1999 ++ enqueue_task(p, rq, ENQUEUE_WAKEUP);
2000 ++ p->on_rq = TASK_ON_RQ_QUEUED;
2001 ++
2002 ++ /*
2003 ++ * If in_iowait is set, the code below may not trigger any cpufreq
2004 ++ * utilization updates, so do it here explicitly with the IOWAIT flag
2005 ++ * passed.
2006 ++ */
2007 ++ cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT * p->in_iowait);
2008 ++}
2009 ++
2010 ++/*
2011 ++ * deactivate_task - remove a task from the runqueue.
2012 ++ *
2013 ++ * Context: rq->lock
2014 ++ */
2015 ++static inline void deactivate_task(struct task_struct *p, struct rq *rq)
2016 ++{
2017 ++ dequeue_task(p, rq, DEQUEUE_SLEEP);
2018 ++ p->on_rq = 0;
2019 ++ cpufreq_update_util(rq, 0);
2020 ++}
2021 ++
2022 ++static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
2023 ++{
2024 ++#ifdef CONFIG_SMP
2025 ++ /*
2026 ++ * After ->cpu is set up to a new value, task_access_lock(p, ...) can be
2027 ++ * successfully executed on another CPU. We must ensure that updates of
2028 ++ * per-task data have been completed by this moment.
2029 ++ */
2030 ++ smp_wmb();
2031 ++
2032 ++#ifdef CONFIG_THREAD_INFO_IN_TASK
2033 ++ WRITE_ONCE(p->cpu, cpu);
2034 ++#else
2035 ++ WRITE_ONCE(task_thread_info(p)->cpu, cpu);
2036 ++#endif
2037 ++#endif
2038 ++}
2039 ++
2040 ++static inline bool is_migration_disabled(struct task_struct *p)
2041 ++{
2042 ++#ifdef CONFIG_SMP
2043 ++ return p->migration_disabled;
2044 ++#else
2045 ++ return false;
2046 ++#endif
2047 ++}
2048 ++
2049 ++#define SCA_CHECK 0x01
2050 ++#define SCA_USER 0x08
2051 ++
2052 ++#ifdef CONFIG_SMP
2053 ++
2054 ++void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
2055 ++{
2056 ++#ifdef CONFIG_SCHED_DEBUG
2057 ++ unsigned int state = READ_ONCE(p->__state);
2058 ++
2059 ++ /*
2060 ++ * We should never call set_task_cpu() on a blocked task,
2061 ++ * ttwu() will sort out the placement.
2062 ++ */
2063 ++ WARN_ON_ONCE(state != TASK_RUNNING && state != TASK_WAKING && !p->on_rq);
2064 ++
2065 ++#ifdef CONFIG_LOCKDEP
2066 ++ /*
2067 ++ * The caller should hold either p->pi_lock or rq->lock, when changing
2068 ++ * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
2069 ++ *
2070 ++ * sched_move_task() holds both and thus holding either pins the cgroup,
2071 ++ * see task_group().
2072 ++ */
2073 ++ WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
2074 ++ lockdep_is_held(&task_rq(p)->lock)));
2075 ++#endif
2076 ++ /*
2077 ++ * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
2078 ++ */
2079 ++ WARN_ON_ONCE(!cpu_online(new_cpu));
2080 ++
2081 ++ WARN_ON_ONCE(is_migration_disabled(p));
2082 ++#endif
2083 ++ if (task_cpu(p) == new_cpu)
2084 ++ return;
2085 ++ trace_sched_migrate_task(p, new_cpu);
2086 ++ rseq_migrate(p);
2087 ++ perf_event_task_migrate(p);
2088 ++
2089 ++ __set_task_cpu(p, new_cpu);
2090 ++}
2091 ++
2092 ++#define MDF_FORCE_ENABLED 0x80
2093 ++
2094 ++static void
2095 ++__do_set_cpus_ptr(struct task_struct *p, const struct cpumask *new_mask)
2096 ++{
2097 ++ /*
2098 ++ * This here violates the locking rules for affinity, since we're only
2099 ++ * supposed to change these variables while holding both rq->lock and
2100 ++ * p->pi_lock.
2101 ++ *
2102 ++ * HOWEVER, it magically works, because ttwu() is the only code that
2103 ++ * accesses these variables under p->pi_lock and only does so after
2104 ++ * smp_cond_load_acquire(&p->on_cpu, !VAL), and we're in __schedule()
2105 ++ * before finish_task().
2106 ++ *
2107 ++ * XXX do further audits, this smells like something putrid.
2108 ++ */
2109 ++ SCHED_WARN_ON(!p->on_cpu);
2110 ++ p->cpus_ptr = new_mask;
2111 ++}
2112 ++
2113 ++void migrate_disable(void)
2114 ++{
2115 ++ struct task_struct *p = current;
2116 ++ int cpu;
2117 ++
2118 ++ if (p->migration_disabled) {
2119 ++ p->migration_disabled++;
2120 ++ return;
2121 ++ }
2122 ++
2123 ++ preempt_disable();
2124 ++ cpu = smp_processor_id();
2125 ++ if (cpumask_test_cpu(cpu, &p->cpus_mask)) {
2126 ++ cpu_rq(cpu)->nr_pinned++;
2127 ++ p->migration_disabled = 1;
2128 ++ p->migration_flags &= ~MDF_FORCE_ENABLED;
2129 ++
2130 ++ /*
2131 ++ * Violates locking rules! see comment in __do_set_cpus_ptr().
2132 ++ */
2133 ++ if (p->cpus_ptr == &p->cpus_mask)
2134 ++ __do_set_cpus_ptr(p, cpumask_of(cpu));
2135 ++ }
2136 ++ preempt_enable();
2137 ++}
2138 ++EXPORT_SYMBOL_GPL(migrate_disable);
2139 ++
2140 ++void migrate_enable(void)
2141 ++{
2142 ++ struct task_struct *p = current;
2143 ++
2144 ++ if (0 == p->migration_disabled)
2145 ++ return;
2146 ++
2147 ++ if (p->migration_disabled > 1) {
2148 ++ p->migration_disabled--;
2149 ++ return;
2150 ++ }
2151 ++
2152 ++ /*
2153 ++ * Ensure stop_task runs either before or after this, and that
2154 ++ * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().
2155 ++ */
2156 ++ preempt_disable();
2157 ++ /*
2158 ++ * Assumption: current should be running on allowed cpu
2159 ++ */
2160 ++ WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &p->cpus_mask));
2161 ++ if (p->cpus_ptr != &p->cpus_mask)
2162 ++ __do_set_cpus_ptr(p, &p->cpus_mask);
2163 ++ /*
2164 ++ * Mustn't clear migration_disabled() until cpus_ptr points back at the
2165 ++ * regular cpus_mask, otherwise things that race (eg.
2166 ++ * select_fallback_rq) get confused.
2167 ++ */
2168 ++ barrier();
2169 ++ p->migration_disabled = 0;
2170 ++ this_rq()->nr_pinned--;
2171 ++ preempt_enable();
2172 ++}
2173 ++EXPORT_SYMBOL_GPL(migrate_enable);
2174 ++
2175 ++static inline bool rq_has_pinned_tasks(struct rq *rq)
2176 ++{
2177 ++ return rq->nr_pinned;
2178 ++}
2179 ++
2180 ++/*
2181 ++ * Per-CPU kthreads are allowed to run on !active && online CPUs, see
2182 ++ * __set_cpus_allowed_ptr() and select_fallback_rq().
2183 ++ */
2184 ++static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
2185 ++{
2186 ++ /* When not in the task's cpumask, no point in looking further. */
2187 ++ if (!cpumask_test_cpu(cpu, p->cpus_ptr))
2188 ++ return false;
2189 ++
2190 ++ /* migrate_disabled() must be allowed to finish. */
2191 ++ if (is_migration_disabled(p))
2192 ++ return cpu_online(cpu);
2193 ++
2194 ++ /* Non kernel threads are not allowed during either online or offline. */
2195 ++ if (!(p->flags & PF_KTHREAD))
2196 ++ return cpu_active(cpu) && task_cpu_possible(cpu, p);
2197 ++
2198 ++ /* KTHREAD_IS_PER_CPU is always allowed. */
2199 ++ if (kthread_is_per_cpu(p))
2200 ++ return cpu_online(cpu);
2201 ++
2202 ++ /* Regular kernel threads don't get to stay during offline. */
2203 ++ if (cpu_dying(cpu))
2204 ++ return false;
2205 ++
2206 ++ /* But are allowed during online. */
2207 ++ return cpu_online(cpu);
2208 ++}
2209 ++
2210 ++/*
2211 ++ * This is how migration works:
2212 ++ *
2213 ++ * 1) we invoke migration_cpu_stop() on the target CPU using
2214 ++ * stop_one_cpu().
2215 ++ * 2) stopper starts to run (implicitly forcing the migrated thread
2216 ++ * off the CPU)
2217 ++ * 3) it checks whether the migrated task is still in the wrong runqueue.
2218 ++ * 4) if it's in the wrong runqueue then the migration thread removes
2219 ++ * it and puts it into the right queue.
2220 ++ * 5) stopper completes and stop_one_cpu() returns and the migration
2221 ++ * is done.
2222 ++ */
2223 ++
2224 ++/*
2225 ++ * move_queued_task - move a queued task to new rq.
2226 ++ *
2227 ++ * Returns (locked) new rq. Old rq's lock is released.
2228 ++ */
2229 ++static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int
2230 ++ new_cpu)
2231 ++{
2232 ++ lockdep_assert_held(&rq->lock);
2233 ++
2234 ++ WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
2235 ++ dequeue_task(p, rq, 0);
2236 ++ set_task_cpu(p, new_cpu);
2237 ++ raw_spin_unlock(&rq->lock);
2238 ++
2239 ++ rq = cpu_rq(new_cpu);
2240 ++
2241 ++ raw_spin_lock(&rq->lock);
2242 ++ BUG_ON(task_cpu(p) != new_cpu);
2243 ++ sched_task_sanity_check(p, rq);
2244 ++ enqueue_task(p, rq, 0);
2245 ++ p->on_rq = TASK_ON_RQ_QUEUED;
2246 ++ check_preempt_curr(rq);
2247 ++
2248 ++ return rq;
2249 ++}
2250 ++
2251 ++struct migration_arg {
2252 ++ struct task_struct *task;
2253 ++ int dest_cpu;
2254 ++};
2255 ++
2256 ++/*
2257 ++ * Move (not current) task off this CPU, onto the destination CPU. We're doing
2258 ++ * this because either it can't run here any more (set_cpus_allowed()
2259 ++ * away from this CPU, or CPU going down), or because we're
2260 ++ * attempting to rebalance this task on exec (sched_exec).
2261 ++ *
2262 ++ * So we race with normal scheduler movements, but that's OK, as long
2263 ++ * as the task is no longer on this CPU.
2264 ++ */
2265 ++static struct rq *__migrate_task(struct rq *rq, struct task_struct *p, int
2266 ++ dest_cpu)
2267 ++{
2268 ++ /* Affinity changed (again). */
2269 ++ if (!is_cpu_allowed(p, dest_cpu))
2270 ++ return rq;
2271 ++
2272 ++ update_rq_clock(rq);
2273 ++ return move_queued_task(rq, p, dest_cpu);
2274 ++}
2275 ++
2276 ++/*
2277 ++ * migration_cpu_stop - this will be executed by a highprio stopper thread
2278 ++ * and performs thread migration by bumping thread off CPU then
2279 ++ * 'pushing' onto another runqueue.
2280 ++ */
2281 ++static int migration_cpu_stop(void *data)
2282 ++{
2283 ++ struct migration_arg *arg = data;
2284 ++ struct task_struct *p = arg->task;
2285 ++ struct rq *rq = this_rq();
2286 ++ unsigned long flags;
2287 ++
2288 ++ /*
2289 ++ * The original target CPU might have gone down and we might
2290 ++ * be on another CPU but it doesn't matter.
2291 ++ */
2292 ++ local_irq_save(flags);
2293 ++ /*
2294 ++ * We need to explicitly wake pending tasks before running
2295 ++ * __migrate_task() such that we will not miss enforcing cpus_ptr
2296 ++ * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.
2297 ++ */
2298 ++ flush_smp_call_function_from_idle();
2299 ++
2300 ++ raw_spin_lock(&p->pi_lock);
2301 ++ raw_spin_lock(&rq->lock);
2302 ++ /*
2303 ++ * If task_rq(p) != rq, it cannot be migrated here, because we're
2304 ++ * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because
2305 ++ * we're holding p->pi_lock.
2306 ++ */
2307 ++ if (task_rq(p) == rq && task_on_rq_queued(p))
2308 ++ rq = __migrate_task(rq, p, arg->dest_cpu);
2309 ++ raw_spin_unlock(&rq->lock);
2310 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
2311 ++
2312 ++ return 0;
2313 ++}
2314 ++
2315 ++static inline void
2316 ++set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask)
2317 ++{
2318 ++ cpumask_copy(&p->cpus_mask, new_mask);
2319 ++ p->nr_cpus_allowed = cpumask_weight(new_mask);
2320 ++}
2321 ++
2322 ++static void
2323 ++__do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
2324 ++{
2325 ++ lockdep_assert_held(&p->pi_lock);
2326 ++ set_cpus_allowed_common(p, new_mask);
2327 ++}
2328 ++
2329 ++void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
2330 ++{
2331 ++ __do_set_cpus_allowed(p, new_mask);
2332 ++}
2333 ++
2334 ++int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
2335 ++ int node)
2336 ++{
2337 ++ if (!src->user_cpus_ptr)
2338 ++ return 0;
2339 ++
2340 ++ dst->user_cpus_ptr = kmalloc_node(cpumask_size(), GFP_KERNEL, node);
2341 ++ if (!dst->user_cpus_ptr)
2342 ++ return -ENOMEM;
2343 ++
2344 ++ cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
2345 ++ return 0;
2346 ++}
2347 ++
2348 ++static inline struct cpumask *clear_user_cpus_ptr(struct task_struct *p)
2349 ++{
2350 ++ struct cpumask *user_mask = NULL;
2351 ++
2352 ++ swap(p->user_cpus_ptr, user_mask);
2353 ++
2354 ++ return user_mask;
2355 ++}
2356 ++
2357 ++void release_user_cpus_ptr(struct task_struct *p)
2358 ++{
2359 ++ kfree(clear_user_cpus_ptr(p));
2360 ++}
2361 ++
2362 ++#endif
2363 ++
2364 ++/**
2365 ++ * task_curr - is this task currently executing on a CPU?
2366 ++ * @p: the task in question.
2367 ++ *
2368 ++ * Return: 1 if the task is currently executing. 0 otherwise.
2369 ++ */
2370 ++inline int task_curr(const struct task_struct *p)
2371 ++{
2372 ++ return cpu_curr(task_cpu(p)) == p;
2373 ++}
2374 ++
2375 ++#ifdef CONFIG_SMP
2376 ++/*
2377 ++ * wait_task_inactive - wait for a thread to unschedule.
2378 ++ *
2379 ++ * If @match_state is nonzero, it's the @p->state value just checked and
2380 ++ * not expected to change. If it changes, i.e. @p might have woken up,
2381 ++ * then return zero. When we succeed in waiting for @p to be off its CPU,
2382 ++ * we return a positive number (its total switch count). If a second call
2383 ++ * a short while later returns the same number, the caller can be sure that
2384 ++ * @p has remained unscheduled the whole time.
2385 ++ *
2386 ++ * The caller must ensure that the task *will* unschedule sometime soon,
2387 ++ * else this function might spin for a *long* time. This function can't
2388 ++ * be called with interrupts off, or it may introduce deadlock with
2389 ++ * smp_call_function() if an IPI is sent by the same process we are
2390 ++ * waiting to become inactive.
2391 ++ */
2392 ++unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state)
2393 ++{
2394 ++ unsigned long flags;
2395 ++ bool running, on_rq;
2396 ++ unsigned long ncsw;
2397 ++ struct rq *rq;
2398 ++ raw_spinlock_t *lock;
2399 ++
2400 ++ for (;;) {
2401 ++ rq = task_rq(p);
2402 ++
2403 ++ /*
2404 ++ * If the task is actively running on another CPU
2405 ++ * still, just relax and busy-wait without holding
2406 ++ * any locks.
2407 ++ *
2408 ++ * NOTE! Since we don't hold any locks, it's not
2409 ++ * even sure that "rq" stays as the right runqueue!
2410 ++ * But we don't care, since this will return false
2411 ++ * if the runqueue has changed and p is actually now
2412 ++ * running somewhere else!
2413 ++ */
2414 ++ while (task_running(p) && p == rq->curr) {
2415 ++ if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
2416 ++ return 0;
2417 ++ cpu_relax();
2418 ++ }
2419 ++
2420 ++ /*
2421 ++ * Ok, time to look more closely! We need the rq
2422 ++ * lock now, to be *sure*. If we're wrong, we'll
2423 ++ * just go back and repeat.
2424 ++ */
2425 ++ task_access_lock_irqsave(p, &lock, &flags);
2426 ++ trace_sched_wait_task(p);
2427 ++ running = task_running(p);
2428 ++ on_rq = p->on_rq;
2429 ++ ncsw = 0;
2430 ++ if (!match_state || READ_ONCE(p->__state) == match_state)
2431 ++ ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
2432 ++ task_access_unlock_irqrestore(p, lock, &flags);
2433 ++
2434 ++ /*
2435 ++ * If it changed from the expected state, bail out now.
2436 ++ */
2437 ++ if (unlikely(!ncsw))
2438 ++ break;
2439 ++
2440 ++ /*
2441 ++ * Was it really running after all now that we
2442 ++ * checked with the proper locks actually held?
2443 ++ *
2444 ++ * Oops. Go back and try again..
2445 ++ */
2446 ++ if (unlikely(running)) {
2447 ++ cpu_relax();
2448 ++ continue;
2449 ++ }
2450 ++
2451 ++ /*
2452 ++ * It's not enough that it's not actively running,
2453 ++ * it must be off the runqueue _entirely_, and not
2454 ++ * preempted!
2455 ++ *
2456 ++ * So if it was still runnable (but just not actively
2457 ++ * running right now), it's preempted, and we should
2458 ++ * yield - it could be a while.
2459 ++ */
2460 ++ if (unlikely(on_rq)) {
2461 ++ ktime_t to = NSEC_PER_SEC / HZ;
2462 ++
2463 ++ set_current_state(TASK_UNINTERRUPTIBLE);
2464 ++ schedule_hrtimeout(&to, HRTIMER_MODE_REL);
2465 ++ continue;
2466 ++ }
2467 ++
2468 ++ /*
2469 ++ * Ahh, all good. It wasn't running, and it wasn't
2470 ++ * runnable, which means that it will never become
2471 ++ * running in the future either. We're all done!
2472 ++ */
2473 ++ break;
2474 ++ }
2475 ++
2476 ++ return ncsw;
2477 ++}
2478 ++
2479 ++/***
2480 ++ * kick_process - kick a running thread to enter/exit the kernel
2481 ++ * @p: the to-be-kicked thread
2482 ++ *
2483 ++ * Cause a process which is running on another CPU to enter
2484 ++ * kernel-mode, without any delay. (to get signals handled.)
2485 ++ *
2486 ++ * NOTE: this function doesn't have to take the runqueue lock,
2487 ++ * because all it wants to ensure is that the remote task enters
2488 ++ * the kernel. If the IPI races and the task has been migrated
2489 ++ * to another CPU then no harm is done and the purpose has been
2490 ++ * achieved as well.
2491 ++ */
2492 ++void kick_process(struct task_struct *p)
2493 ++{
2494 ++ int cpu;
2495 ++
2496 ++ preempt_disable();
2497 ++ cpu = task_cpu(p);
2498 ++ if ((cpu != smp_processor_id()) && task_curr(p))
2499 ++ smp_send_reschedule(cpu);
2500 ++ preempt_enable();
2501 ++}
2502 ++EXPORT_SYMBOL_GPL(kick_process);
2503 ++
2504 ++/*
2505 ++ * ->cpus_ptr is protected by both rq->lock and p->pi_lock
2506 ++ *
2507 ++ * A few notes on cpu_active vs cpu_online:
2508 ++ *
2509 ++ * - cpu_active must be a subset of cpu_online
2510 ++ *
2511 ++ * - on CPU-up we allow per-CPU kthreads on the online && !active CPU,
2512 ++ * see __set_cpus_allowed_ptr(). At this point the newly online
2513 ++ * CPU isn't yet part of the sched domains, and balancing will not
2514 ++ * see it.
2515 ++ *
2516 ++ * - on cpu-down we clear cpu_active() to mask the sched domains and
2517 ++ * avoid the load balancer to place new tasks on the to be removed
2518 ++ * CPU. Existing tasks will remain running there and will be taken
2519 ++ * off.
2520 ++ *
2521 ++ * This means that fallback selection must not select !active CPUs.
2522 ++ * And can assume that any active CPU must be online. Conversely
2523 ++ * select_task_rq() below may allow selection of !active CPUs in order
2524 ++ * to satisfy the above rules.
2525 ++ */
2526 ++static int select_fallback_rq(int cpu, struct task_struct *p)
2527 ++{
2528 ++ int nid = cpu_to_node(cpu);
2529 ++ const struct cpumask *nodemask = NULL;
2530 ++ enum { cpuset, possible, fail } state = cpuset;
2531 ++ int dest_cpu;
2532 ++
2533 ++ /*
2534 ++ * If the node that the CPU is on has been offlined, cpu_to_node()
2535 ++ * will return -1. There is no CPU on the node, and we should
2536 ++ * select the CPU on the other node.
2537 ++ */
2538 ++ if (nid != -1) {
2539 ++ nodemask = cpumask_of_node(nid);
2540 ++
2541 ++ /* Look for allowed, online CPU in same node. */
2542 ++ for_each_cpu(dest_cpu, nodemask) {
2543 ++ if (is_cpu_allowed(p, dest_cpu))
2544 ++ return dest_cpu;
2545 ++ }
2546 ++ }
2547 ++
2548 ++ for (;;) {
2549 ++ /* Any allowed, online CPU? */
2550 ++ for_each_cpu(dest_cpu, p->cpus_ptr) {
2551 ++ if (!is_cpu_allowed(p, dest_cpu))
2552 ++ continue;
2553 ++ goto out;
2554 ++ }
2555 ++
2556 ++ /* No more Mr. Nice Guy. */
2557 ++ switch (state) {
2558 ++ case cpuset:
2559 ++ if (cpuset_cpus_allowed_fallback(p)) {
2560 ++ state = possible;
2561 ++ break;
2562 ++ }
2563 ++ fallthrough;
2564 ++ case possible:
2565 ++ /*
2566 ++ * XXX When called from select_task_rq() we only
2567 ++ * hold p->pi_lock and again violate locking order.
2568 ++ *
2569 ++ * More yuck to audit.
2570 ++ */
2571 ++ do_set_cpus_allowed(p, task_cpu_possible_mask(p));
2572 ++ state = fail;
2573 ++ break;
2574 ++
2575 ++ case fail:
2576 ++ BUG();
2577 ++ break;
2578 ++ }
2579 ++ }
2580 ++
2581 ++out:
2582 ++ if (state != cpuset) {
2583 ++ /*
2584 ++ * Don't tell them about moving exiting tasks or
2585 ++ * kernel threads (both mm NULL), since they never
2586 ++ * leave kernel.
2587 ++ */
2588 ++ if (p->mm && printk_ratelimit()) {
2589 ++ printk_deferred("process %d (%s) no longer affine to cpu%d\n",
2590 ++ task_pid_nr(p), p->comm, cpu);
2591 ++ }
2592 ++ }
2593 ++
2594 ++ return dest_cpu;
2595 ++}
2596 ++
2597 ++static inline int select_task_rq(struct task_struct *p)
2598 ++{
2599 ++ cpumask_t chk_mask, tmp;
2600 ++
2601 ++ if (unlikely(!cpumask_and(&chk_mask, p->cpus_ptr, cpu_active_mask)))
2602 ++ return select_fallback_rq(task_cpu(p), p);
2603 ++
2604 ++ if (
2605 ++#ifdef CONFIG_SCHED_SMT
2606 ++ cpumask_and(&tmp, &chk_mask, &sched_sg_idle_mask) ||
2607 ++#endif
2608 ++ cpumask_and(&tmp, &chk_mask, sched_rq_watermark) ||
2609 ++ cpumask_and(&tmp, &chk_mask,
2610 ++ sched_rq_watermark + SCHED_BITS - task_sched_prio(p)))
2611 ++ return best_mask_cpu(task_cpu(p), &tmp);
2612 ++
2613 ++ return best_mask_cpu(task_cpu(p), &chk_mask);
2614 ++}
2615 ++
2616 ++void sched_set_stop_task(int cpu, struct task_struct *stop)
2617 ++{
2618 ++ static struct lock_class_key stop_pi_lock;
2619 ++ struct sched_param stop_param = { .sched_priority = STOP_PRIO };
2620 ++ struct sched_param start_param = { .sched_priority = 0 };
2621 ++ struct task_struct *old_stop = cpu_rq(cpu)->stop;
2622 ++
2623 ++ if (stop) {
2624 ++ /*
2625 ++ * Make it appear like a SCHED_FIFO task, its something
2626 ++ * userspace knows about and won't get confused about.
2627 ++ *
2628 ++ * Also, it will make PI more or less work without too
2629 ++ * much confusion -- but then, stop work should not
2630 ++ * rely on PI working anyway.
2631 ++ */
2632 ++ sched_setscheduler_nocheck(stop, SCHED_FIFO, &stop_param);
2633 ++
2634 ++ /*
2635 ++ * The PI code calls rt_mutex_setprio() with ->pi_lock held to
2636 ++ * adjust the effective priority of a task. As a result,
2637 ++ * rt_mutex_setprio() can trigger (RT) balancing operations,
2638 ++ * which can then trigger wakeups of the stop thread to push
2639 ++ * around the current task.
2640 ++ *
2641 ++ * The stop task itself will never be part of the PI-chain, it
2642 ++ * never blocks, therefore that ->pi_lock recursion is safe.
2643 ++ * Tell lockdep about this by placing the stop->pi_lock in its
2644 ++ * own class.
2645 ++ */
2646 ++ lockdep_set_class(&stop->pi_lock, &stop_pi_lock);
2647 ++ }
2648 ++
2649 ++ cpu_rq(cpu)->stop = stop;
2650 ++
2651 ++ if (old_stop) {
2652 ++ /*
2653 ++ * Reset it back to a normal scheduling policy so that
2654 ++ * it can die in pieces.
2655 ++ */
2656 ++ sched_setscheduler_nocheck(old_stop, SCHED_NORMAL, &start_param);
2657 ++ }
2658 ++}
2659 ++
2660 ++static int affine_move_task(struct rq *rq, struct task_struct *p, int dest_cpu,
2661 ++ raw_spinlock_t *lock, unsigned long irq_flags)
2662 ++{
2663 ++ /* Can the task run on the task's current CPU? If so, we're done */
2664 ++ if (!cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
2665 ++ if (p->migration_disabled) {
2666 ++ if (likely(p->cpus_ptr != &p->cpus_mask))
2667 ++ __do_set_cpus_ptr(p, &p->cpus_mask);
2668 ++ p->migration_disabled = 0;
2669 ++ p->migration_flags |= MDF_FORCE_ENABLED;
2670 ++ /* When p is migrate_disabled, rq->lock should be held */
2671 ++ rq->nr_pinned--;
2672 ++ }
2673 ++
2674 ++ if (task_running(p) || READ_ONCE(p->__state) == TASK_WAKING) {
2675 ++ struct migration_arg arg = { p, dest_cpu };
2676 ++
2677 ++ /* Need help from migration thread: drop lock and wait. */
2678 ++ __task_access_unlock(p, lock);
2679 ++ raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);
2680 ++ stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
2681 ++ return 0;
2682 ++ }
2683 ++ if (task_on_rq_queued(p)) {
2684 ++ /*
2685 ++ * OK, since we're going to drop the lock immediately
2686 ++ * afterwards anyway.
2687 ++ */
2688 ++ update_rq_clock(rq);
2689 ++ rq = move_queued_task(rq, p, dest_cpu);
2690 ++ lock = &rq->lock;
2691 ++ }
2692 ++ }
2693 ++ __task_access_unlock(p, lock);
2694 ++ raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);
2695 ++ return 0;
2696 ++}
2697 ++
2698 ++static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
2699 ++ const struct cpumask *new_mask,
2700 ++ u32 flags,
2701 ++ struct rq *rq,
2702 ++ raw_spinlock_t *lock,
2703 ++ unsigned long irq_flags)
2704 ++{
2705 ++ const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
2706 ++ const struct cpumask *cpu_valid_mask = cpu_active_mask;
2707 ++ bool kthread = p->flags & PF_KTHREAD;
2708 ++ struct cpumask *user_mask = NULL;
2709 ++ int dest_cpu;
2710 ++ int ret = 0;
2711 ++
2712 ++ if (kthread || is_migration_disabled(p)) {
2713 ++ /*
2714 ++ * Kernel threads are allowed on online && !active CPUs,
2715 ++ * however, during cpu-hot-unplug, even these might get pushed
2716 ++ * away if not KTHREAD_IS_PER_CPU.
2717 ++ *
2718 ++ * Specifically, migration_disabled() tasks must not fail the
2719 ++ * cpumask_any_and_distribute() pick below, esp. so on
2720 ++ * SCA_MIGRATE_ENABLE, otherwise we'll not call
2721 ++ * set_cpus_allowed_common() and actually reset p->cpus_ptr.
2722 ++ */
2723 ++ cpu_valid_mask = cpu_online_mask;
2724 ++ }
2725 ++
2726 ++ if (!kthread && !cpumask_subset(new_mask, cpu_allowed_mask)) {
2727 ++ ret = -EINVAL;
2728 ++ goto out;
2729 ++ }
2730 ++
2731 ++ /*
2732 ++ * Must re-check here, to close a race against __kthread_bind(),
2733 ++ * sched_setaffinity() is not guaranteed to observe the flag.
2734 ++ */
2735 ++ if ((flags & SCA_CHECK) && (p->flags & PF_NO_SETAFFINITY)) {
2736 ++ ret = -EINVAL;
2737 ++ goto out;
2738 ++ }
2739 ++
2740 ++ if (cpumask_equal(&p->cpus_mask, new_mask))
2741 ++ goto out;
2742 ++
2743 ++ dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
2744 ++ if (dest_cpu >= nr_cpu_ids) {
2745 ++ ret = -EINVAL;
2746 ++ goto out;
2747 ++ }
2748 ++
2749 ++ __do_set_cpus_allowed(p, new_mask);
2750 ++
2751 ++ if (flags & SCA_USER)
2752 ++ user_mask = clear_user_cpus_ptr(p);
2753 ++
2754 ++ ret = affine_move_task(rq, p, dest_cpu, lock, irq_flags);
2755 ++
2756 ++ kfree(user_mask);
2757 ++
2758 ++ return ret;
2759 ++
2760 ++out:
2761 ++ __task_access_unlock(p, lock);
2762 ++ raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);
2763 ++
2764 ++ return ret;
2765 ++}
2766 ++
2767 ++/*
2768 ++ * Change a given task's CPU affinity. Migrate the thread to a
2769 ++ * proper CPU and schedule it away if the CPU it's executing on
2770 ++ * is removed from the allowed bitmask.
2771 ++ *
2772 ++ * NOTE: the caller must have a valid reference to the task, the
2773 ++ * task must not exit() & deallocate itself prematurely. The
2774 ++ * call is not atomic; no spinlocks may be held.
2775 ++ */
2776 ++static int __set_cpus_allowed_ptr(struct task_struct *p,
2777 ++ const struct cpumask *new_mask, u32 flags)
2778 ++{
2779 ++ unsigned long irq_flags;
2780 ++ struct rq *rq;
2781 ++ raw_spinlock_t *lock;
2782 ++
2783 ++ raw_spin_lock_irqsave(&p->pi_lock, irq_flags);
2784 ++ rq = __task_access_lock(p, &lock);
2785 ++
2786 ++ return __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, lock, irq_flags);
2787 ++}
2788 ++
2789 ++int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
2790 ++{
2791 ++ return __set_cpus_allowed_ptr(p, new_mask, 0);
2792 ++}
2793 ++EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
2794 ++
2795 ++/*
2796 ++ * Change a given task's CPU affinity to the intersection of its current
2797 ++ * affinity mask and @subset_mask, writing the resulting mask to @new_mask
2798 ++ * and pointing @p->user_cpus_ptr to a copy of the old mask.
2799 ++ * If the resulting mask is empty, leave the affinity unchanged and return
2800 ++ * -EINVAL.
2801 ++ */
2802 ++static int restrict_cpus_allowed_ptr(struct task_struct *p,
2803 ++ struct cpumask *new_mask,
2804 ++ const struct cpumask *subset_mask)
2805 ++{
2806 ++ struct cpumask *user_mask = NULL;
2807 ++ unsigned long irq_flags;
2808 ++ raw_spinlock_t *lock;
2809 ++ struct rq *rq;
2810 ++ int err;
2811 ++
2812 ++ if (!p->user_cpus_ptr) {
2813 ++ user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
2814 ++ if (!user_mask)
2815 ++ return -ENOMEM;
2816 ++ }
2817 ++
2818 ++ raw_spin_lock_irqsave(&p->pi_lock, irq_flags);
2819 ++ rq = __task_access_lock(p, &lock);
2820 ++
2821 ++ if (!cpumask_and(new_mask, &p->cpus_mask, subset_mask)) {
2822 ++ err = -EINVAL;
2823 ++ goto err_unlock;
2824 ++ }
2825 ++
2826 ++ /*
2827 ++ * We're about to butcher the task affinity, so keep track of what
2828 ++ * the user asked for in case we're able to restore it later on.
2829 ++ */
2830 ++ if (user_mask) {
2831 ++ cpumask_copy(user_mask, p->cpus_ptr);
2832 ++ p->user_cpus_ptr = user_mask;
2833 ++ }
2834 ++
2835 ++ /*return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, &rf);*/
2836 ++ return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, lock, irq_flags);
2837 ++
2838 ++err_unlock:
2839 ++ __task_access_unlock(p, lock);
2840 ++ raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);
2841 ++ kfree(user_mask);
2842 ++ return err;
2843 ++}
2844 ++
2845 ++/*
2846 ++ * Restrict the CPU affinity of task @p so that it is a subset of
2847 ++ * task_cpu_possible_mask() and point @p->user_cpu_ptr to a copy of the
2848 ++ * old affinity mask. If the resulting mask is empty, we warn and walk
2849 ++ * up the cpuset hierarchy until we find a suitable mask.
2850 ++ */
2851 ++void force_compatible_cpus_allowed_ptr(struct task_struct *p)
2852 ++{
2853 ++ cpumask_var_t new_mask;
2854 ++ const struct cpumask *override_mask = task_cpu_possible_mask(p);
2855 ++
2856 ++ alloc_cpumask_var(&new_mask, GFP_KERNEL);
2857 ++
2858 ++ /*
2859 ++ * __migrate_task() can fail silently in the face of concurrent
2860 ++ * offlining of the chosen destination CPU, so take the hotplug
2861 ++ * lock to ensure that the migration succeeds.
2862 ++ */
2863 ++ cpus_read_lock();
2864 ++ if (!cpumask_available(new_mask))
2865 ++ goto out_set_mask;
2866 ++
2867 ++ if (!restrict_cpus_allowed_ptr(p, new_mask, override_mask))
2868 ++ goto out_free_mask;
2869 ++
2870 ++ /*
2871 ++ * We failed to find a valid subset of the affinity mask for the
2872 ++ * task, so override it based on its cpuset hierarchy.
2873 ++ */
2874 ++ cpuset_cpus_allowed(p, new_mask);
2875 ++ override_mask = new_mask;
2876 ++
2877 ++out_set_mask:
2878 ++ if (printk_ratelimit()) {
2879 ++ printk_deferred("Overriding affinity for process %d (%s) to CPUs %*pbl\n",
2880 ++ task_pid_nr(p), p->comm,
2881 ++ cpumask_pr_args(override_mask));
2882 ++ }
2883 ++
2884 ++ WARN_ON(set_cpus_allowed_ptr(p, override_mask));
2885 ++out_free_mask:
2886 ++ cpus_read_unlock();
2887 ++ free_cpumask_var(new_mask);
2888 ++}
2889 ++
2890 ++static int
2891 ++__sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
2892 ++
2893 ++/*
2894 ++ * Restore the affinity of a task @p which was previously restricted by a
2895 ++ * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)
2896 ++ * @p->user_cpus_ptr.
2897 ++ *
2898 ++ * It is the caller's responsibility to serialise this with any calls to
2899 ++ * force_compatible_cpus_allowed_ptr(@p).
2900 ++ */
2901 ++void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
2902 ++{
2903 ++ struct cpumask *user_mask = p->user_cpus_ptr;
2904 ++ unsigned long flags;
2905 ++
2906 ++ /*
2907 ++ * Try to restore the old affinity mask. If this fails, then
2908 ++ * we free the mask explicitly to avoid it being inherited across
2909 ++ * a subsequent fork().
2910 ++ */
2911 ++ if (!user_mask || !__sched_setaffinity(p, user_mask))
2912 ++ return;
2913 ++
2914 ++ raw_spin_lock_irqsave(&p->pi_lock, flags);
2915 ++ user_mask = clear_user_cpus_ptr(p);
2916 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
2917 ++
2918 ++ kfree(user_mask);
2919 ++}
2920 ++
2921 ++#else /* CONFIG_SMP */
2922 ++
2923 ++static inline int select_task_rq(struct task_struct *p)
2924 ++{
2925 ++ return 0;
2926 ++}
2927 ++
2928 ++static inline int
2929 ++__set_cpus_allowed_ptr(struct task_struct *p,
2930 ++ const struct cpumask *new_mask, u32 flags)
2931 ++{
2932 ++ return set_cpus_allowed_ptr(p, new_mask);
2933 ++}
2934 ++
2935 ++static inline bool rq_has_pinned_tasks(struct rq *rq)
2936 ++{
2937 ++ return false;
2938 ++}
2939 ++
2940 ++#endif /* !CONFIG_SMP */
2941 ++
2942 ++static void
2943 ++ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
2944 ++{
2945 ++ struct rq *rq;
2946 ++
2947 ++ if (!schedstat_enabled())
2948 ++ return;
2949 ++
2950 ++ rq = this_rq();
2951 ++
2952 ++#ifdef CONFIG_SMP
2953 ++ if (cpu == rq->cpu)
2954 ++ __schedstat_inc(rq->ttwu_local);
2955 ++ else {
2956 ++ /** Alt schedule FW ToDo:
2957 ++ * How to do ttwu_wake_remote
2958 ++ */
2959 ++ }
2960 ++#endif /* CONFIG_SMP */
2961 ++
2962 ++ __schedstat_inc(rq->ttwu_count);
2963 ++}
2964 ++
2965 ++/*
2966 ++ * Mark the task runnable and perform wakeup-preemption.
2967 ++ */
2968 ++static inline void
2969 ++ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
2970 ++{
2971 ++ check_preempt_curr(rq);
2972 ++ WRITE_ONCE(p->__state, TASK_RUNNING);
2973 ++ trace_sched_wakeup(p);
2974 ++}
2975 ++
2976 ++static inline void
2977 ++ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
2978 ++{
2979 ++ if (p->sched_contributes_to_load)
2980 ++ rq->nr_uninterruptible--;
2981 ++
2982 ++ if (
2983 ++#ifdef CONFIG_SMP
2984 ++ !(wake_flags & WF_MIGRATED) &&
2985 ++#endif
2986 ++ p->in_iowait) {
2987 ++ delayacct_blkio_end(p);
2988 ++ atomic_dec(&task_rq(p)->nr_iowait);
2989 ++ }
2990 ++
2991 ++ activate_task(p, rq);
2992 ++ ttwu_do_wakeup(rq, p, 0);
2993 ++}
2994 ++
2995 ++/*
2996 ++ * Consider @p being inside a wait loop:
2997 ++ *
2998 ++ * for (;;) {
2999 ++ * set_current_state(TASK_UNINTERRUPTIBLE);
3000 ++ *
3001 ++ * if (CONDITION)
3002 ++ * break;
3003 ++ *
3004 ++ * schedule();
3005 ++ * }
3006 ++ * __set_current_state(TASK_RUNNING);
3007 ++ *
3008 ++ * between set_current_state() and schedule(). In this case @p is still
3009 ++ * runnable, so all that needs doing is change p->state back to TASK_RUNNING in
3010 ++ * an atomic manner.
3011 ++ *
3012 ++ * By taking task_rq(p)->lock we serialize against schedule(), if @p->on_rq
3013 ++ * then schedule() must still happen and p->state can be changed to
3014 ++ * TASK_RUNNING. Otherwise we lost the race, schedule() has happened, and we
3015 ++ * need to do a full wakeup with enqueue.
3016 ++ *
3017 ++ * Returns: %true when the wakeup is done,
3018 ++ * %false otherwise.
3019 ++ */
3020 ++static int ttwu_runnable(struct task_struct *p, int wake_flags)
3021 ++{
3022 ++ struct rq *rq;
3023 ++ raw_spinlock_t *lock;
3024 ++ int ret = 0;
3025 ++
3026 ++ rq = __task_access_lock(p, &lock);
3027 ++ if (task_on_rq_queued(p)) {
3028 ++ /* check_preempt_curr() may use rq clock */
3029 ++ update_rq_clock(rq);
3030 ++ ttwu_do_wakeup(rq, p, wake_flags);
3031 ++ ret = 1;
3032 ++ }
3033 ++ __task_access_unlock(p, lock);
3034 ++
3035 ++ return ret;
3036 ++}
3037 ++
3038 ++#ifdef CONFIG_SMP
3039 ++void sched_ttwu_pending(void *arg)
3040 ++{
3041 ++ struct llist_node *llist = arg;
3042 ++ struct rq *rq = this_rq();
3043 ++ struct task_struct *p, *t;
3044 ++ struct rq_flags rf;
3045 ++
3046 ++ if (!llist)
3047 ++ return;
3048 ++
3049 ++ /*
3050 ++ * rq::ttwu_pending racy indication of out-standing wakeups.
3051 ++ * Races such that false-negatives are possible, since they
3052 ++ * are shorter lived that false-positives would be.
3053 ++ */
3054 ++ WRITE_ONCE(rq->ttwu_pending, 0);
3055 ++
3056 ++ rq_lock_irqsave(rq, &rf);
3057 ++ update_rq_clock(rq);
3058 ++
3059 ++ llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
3060 ++ if (WARN_ON_ONCE(p->on_cpu))
3061 ++ smp_cond_load_acquire(&p->on_cpu, !VAL);
3062 ++
3063 ++ if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
3064 ++ set_task_cpu(p, cpu_of(rq));
3065 ++
3066 ++ ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0);
3067 ++ }
3068 ++
3069 ++ rq_unlock_irqrestore(rq, &rf);
3070 ++}
3071 ++
3072 ++void send_call_function_single_ipi(int cpu)
3073 ++{
3074 ++ struct rq *rq = cpu_rq(cpu);
3075 ++
3076 ++ if (!set_nr_if_polling(rq->idle))
3077 ++ arch_send_call_function_single_ipi(cpu);
3078 ++ else
3079 ++ trace_sched_wake_idle_without_ipi(cpu);
3080 ++}
3081 ++
3082 ++/*
3083 ++ * Queue a task on the target CPUs wake_list and wake the CPU via IPI if
3084 ++ * necessary. The wakee CPU on receipt of the IPI will queue the task
3085 ++ * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
3086 ++ * of the wakeup instead of the waker.
3087 ++ */
3088 ++static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
3089 ++{
3090 ++ struct rq *rq = cpu_rq(cpu);
3091 ++
3092 ++ p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
3093 ++
3094 ++ WRITE_ONCE(rq->ttwu_pending, 1);
3095 ++ __smp_call_single_queue(cpu, &p->wake_entry.llist);
3096 ++}
3097 ++
3098 ++static inline bool ttwu_queue_cond(int cpu, int wake_flags)
3099 ++{
3100 ++ /*
3101 ++ * Do not complicate things with the async wake_list while the CPU is
3102 ++ * in hotplug state.
3103 ++ */
3104 ++ if (!cpu_active(cpu))
3105 ++ return false;
3106 ++
3107 ++ /*
3108 ++ * If the CPU does not share cache, then queue the task on the
3109 ++ * remote rqs wakelist to avoid accessing remote data.
3110 ++ */
3111 ++ if (!cpus_share_cache(smp_processor_id(), cpu))
3112 ++ return true;
3113 ++
3114 ++ /*
3115 ++ * If the task is descheduling and the only running task on the
3116 ++ * CPU then use the wakelist to offload the task activation to
3117 ++ * the soon-to-be-idle CPU as the current CPU is likely busy.
3118 ++ * nr_running is checked to avoid unnecessary task stacking.
3119 ++ */
3120 ++ if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
3121 ++ return true;
3122 ++
3123 ++ return false;
3124 ++}
3125 ++
3126 ++static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
3127 ++{
3128 ++ if (__is_defined(ALT_SCHED_TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
3129 ++ if (WARN_ON_ONCE(cpu == smp_processor_id()))
3130 ++ return false;
3131 ++
3132 ++ sched_clock_cpu(cpu); /* Sync clocks across CPUs */
3133 ++ __ttwu_queue_wakelist(p, cpu, wake_flags);
3134 ++ return true;
3135 ++ }
3136 ++
3137 ++ return false;
3138 ++}
3139 ++
3140 ++void wake_up_if_idle(int cpu)
3141 ++{
3142 ++ struct rq *rq = cpu_rq(cpu);
3143 ++ unsigned long flags;
3144 ++
3145 ++ rcu_read_lock();
3146 ++
3147 ++ if (!is_idle_task(rcu_dereference(rq->curr)))
3148 ++ goto out;
3149 ++
3150 ++ if (set_nr_if_polling(rq->idle)) {
3151 ++ trace_sched_wake_idle_without_ipi(cpu);
3152 ++ } else {
3153 ++ raw_spin_lock_irqsave(&rq->lock, flags);
3154 ++ if (is_idle_task(rq->curr))
3155 ++ smp_send_reschedule(cpu);
3156 ++ /* Else CPU is not idle, do nothing here */
3157 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
3158 ++ }
3159 ++
3160 ++out:
3161 ++ rcu_read_unlock();
3162 ++}
3163 ++
3164 ++bool cpus_share_cache(int this_cpu, int that_cpu)
3165 ++{
3166 ++ return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
3167 ++}
3168 ++#else /* !CONFIG_SMP */
3169 ++
3170 ++static inline bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
3171 ++{
3172 ++ return false;
3173 ++}
3174 ++
3175 ++#endif /* CONFIG_SMP */
3176 ++
3177 ++static inline void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
3178 ++{
3179 ++ struct rq *rq = cpu_rq(cpu);
3180 ++
3181 ++ if (ttwu_queue_wakelist(p, cpu, wake_flags))
3182 ++ return;
3183 ++
3184 ++ raw_spin_lock(&rq->lock);
3185 ++ update_rq_clock(rq);
3186 ++ ttwu_do_activate(rq, p, wake_flags);
3187 ++ raw_spin_unlock(&rq->lock);
3188 ++}
3189 ++
3190 ++/*
3191 ++ * Invoked from try_to_wake_up() to check whether the task can be woken up.
3192 ++ *
3193 ++ * The caller holds p::pi_lock if p != current or has preemption
3194 ++ * disabled when p == current.
3195 ++ *
3196 ++ * The rules of PREEMPT_RT saved_state:
3197 ++ *
3198 ++ * The related locking code always holds p::pi_lock when updating
3199 ++ * p::saved_state, which means the code is fully serialized in both cases.
3200 ++ *
3201 ++ * The lock wait and lock wakeups happen via TASK_RTLOCK_WAIT. No other
3202 ++ * bits set. This allows to distinguish all wakeup scenarios.
3203 ++ */
3204 ++static __always_inline
3205 ++bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
3206 ++{
3207 ++ if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)) {
3208 ++ WARN_ON_ONCE((state & TASK_RTLOCK_WAIT) &&
3209 ++ state != TASK_RTLOCK_WAIT);
3210 ++ }
3211 ++
3212 ++ if (READ_ONCE(p->__state) & state) {
3213 ++ *success = 1;
3214 ++ return true;
3215 ++ }
3216 ++
3217 ++#ifdef CONFIG_PREEMPT_RT
3218 ++ /*
3219 ++ * Saved state preserves the task state across blocking on
3220 ++ * an RT lock. If the state matches, set p::saved_state to
3221 ++ * TASK_RUNNING, but do not wake the task because it waits
3222 ++ * for a lock wakeup. Also indicate success because from
3223 ++ * the regular waker's point of view this has succeeded.
3224 ++ *
3225 ++ * After acquiring the lock the task will restore p::__state
3226 ++ * from p::saved_state which ensures that the regular
3227 ++ * wakeup is not lost. The restore will also set
3228 ++ * p::saved_state to TASK_RUNNING so any further tests will
3229 ++ * not result in false positives vs. @success
3230 ++ */
3231 ++ if (p->saved_state & state) {
3232 ++ p->saved_state = TASK_RUNNING;
3233 ++ *success = 1;
3234 ++ }
3235 ++#endif
3236 ++ return false;
3237 ++}
3238 ++
3239 ++/*
3240 ++ * Notes on Program-Order guarantees on SMP systems.
3241 ++ *
3242 ++ * MIGRATION
3243 ++ *
3244 ++ * The basic program-order guarantee on SMP systems is that when a task [t]
3245 ++ * migrates, all its activity on its old CPU [c0] happens-before any subsequent
3246 ++ * execution on its new CPU [c1].
3247 ++ *
3248 ++ * For migration (of runnable tasks) this is provided by the following means:
3249 ++ *
3250 ++ * A) UNLOCK of the rq(c0)->lock scheduling out task t
3251 ++ * B) migration for t is required to synchronize *both* rq(c0)->lock and
3252 ++ * rq(c1)->lock (if not at the same time, then in that order).
3253 ++ * C) LOCK of the rq(c1)->lock scheduling in task
3254 ++ *
3255 ++ * Transitivity guarantees that B happens after A and C after B.
3256 ++ * Note: we only require RCpc transitivity.
3257 ++ * Note: the CPU doing B need not be c0 or c1
3258 ++ *
3259 ++ * Example:
3260 ++ *
3261 ++ * CPU0 CPU1 CPU2
3262 ++ *
3263 ++ * LOCK rq(0)->lock
3264 ++ * sched-out X
3265 ++ * sched-in Y
3266 ++ * UNLOCK rq(0)->lock
3267 ++ *
3268 ++ * LOCK rq(0)->lock // orders against CPU0
3269 ++ * dequeue X
3270 ++ * UNLOCK rq(0)->lock
3271 ++ *
3272 ++ * LOCK rq(1)->lock
3273 ++ * enqueue X
3274 ++ * UNLOCK rq(1)->lock
3275 ++ *
3276 ++ * LOCK rq(1)->lock // orders against CPU2
3277 ++ * sched-out Z
3278 ++ * sched-in X
3279 ++ * UNLOCK rq(1)->lock
3280 ++ *
3281 ++ *
3282 ++ * BLOCKING -- aka. SLEEP + WAKEUP
3283 ++ *
3284 ++ * For blocking we (obviously) need to provide the same guarantee as for
3285 ++ * migration. However the means are completely different as there is no lock
3286 ++ * chain to provide order. Instead we do:
3287 ++ *
3288 ++ * 1) smp_store_release(X->on_cpu, 0) -- finish_task()
3289 ++ * 2) smp_cond_load_acquire(!X->on_cpu) -- try_to_wake_up()
3290 ++ *
3291 ++ * Example:
3292 ++ *
3293 ++ * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule)
3294 ++ *
3295 ++ * LOCK rq(0)->lock LOCK X->pi_lock
3296 ++ * dequeue X
3297 ++ * sched-out X
3298 ++ * smp_store_release(X->on_cpu, 0);
3299 ++ *
3300 ++ * smp_cond_load_acquire(&X->on_cpu, !VAL);
3301 ++ * X->state = WAKING
3302 ++ * set_task_cpu(X,2)
3303 ++ *
3304 ++ * LOCK rq(2)->lock
3305 ++ * enqueue X
3306 ++ * X->state = RUNNING
3307 ++ * UNLOCK rq(2)->lock
3308 ++ *
3309 ++ * LOCK rq(2)->lock // orders against CPU1
3310 ++ * sched-out Z
3311 ++ * sched-in X
3312 ++ * UNLOCK rq(2)->lock
3313 ++ *
3314 ++ * UNLOCK X->pi_lock
3315 ++ * UNLOCK rq(0)->lock
3316 ++ *
3317 ++ *
3318 ++ * However; for wakeups there is a second guarantee we must provide, namely we
3319 ++ * must observe the state that lead to our wakeup. That is, not only must our
3320 ++ * task observe its own prior state, it must also observe the stores prior to
3321 ++ * its wakeup.
3322 ++ *
3323 ++ * This means that any means of doing remote wakeups must order the CPU doing
3324 ++ * the wakeup against the CPU the task is going to end up running on. This,
3325 ++ * however, is already required for the regular Program-Order guarantee above,
3326 ++ * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).
3327 ++ *
3328 ++ */
3329 ++
3330 ++/**
3331 ++ * try_to_wake_up - wake up a thread
3332 ++ * @p: the thread to be awakened
3333 ++ * @state: the mask of task states that can be woken
3334 ++ * @wake_flags: wake modifier flags (WF_*)
3335 ++ *
3336 ++ * Conceptually does:
3337 ++ *
3338 ++ * If (@state & @p->state) @p->state = TASK_RUNNING.
3339 ++ *
3340 ++ * If the task was not queued/runnable, also place it back on a runqueue.
3341 ++ *
3342 ++ * This function is atomic against schedule() which would dequeue the task.
3343 ++ *
3344 ++ * It issues a full memory barrier before accessing @p->state, see the comment
3345 ++ * with set_current_state().
3346 ++ *
3347 ++ * Uses p->pi_lock to serialize against concurrent wake-ups.
3348 ++ *
3349 ++ * Relies on p->pi_lock stabilizing:
3350 ++ * - p->sched_class
3351 ++ * - p->cpus_ptr
3352 ++ * - p->sched_task_group
3353 ++ * in order to do migration, see its use of select_task_rq()/set_task_cpu().
3354 ++ *
3355 ++ * Tries really hard to only take one task_rq(p)->lock for performance.
3356 ++ * Takes rq->lock in:
3357 ++ * - ttwu_runnable() -- old rq, unavoidable, see comment there;
3358 ++ * - ttwu_queue() -- new rq, for enqueue of the task;
3359 ++ * - psi_ttwu_dequeue() -- much sadness :-( accounting will kill us.
3360 ++ *
3361 ++ * As a consequence we race really badly with just about everything. See the
3362 ++ * many memory barriers and their comments for details.
3363 ++ *
3364 ++ * Return: %true if @p->state changes (an actual wakeup was done),
3365 ++ * %false otherwise.
3366 ++ */
3367 ++static int try_to_wake_up(struct task_struct *p, unsigned int state,
3368 ++ int wake_flags)
3369 ++{
3370 ++ unsigned long flags;
3371 ++ int cpu, success = 0;
3372 ++
3373 ++ preempt_disable();
3374 ++ if (p == current) {
3375 ++ /*
3376 ++ * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
3377 ++ * == smp_processor_id()'. Together this means we can special
3378 ++ * case the whole 'p->on_rq && ttwu_runnable()' case below
3379 ++ * without taking any locks.
3380 ++ *
3381 ++ * In particular:
3382 ++ * - we rely on Program-Order guarantees for all the ordering,
3383 ++ * - we're serialized against set_special_state() by virtue of
3384 ++ * it disabling IRQs (this allows not taking ->pi_lock).
3385 ++ */
3386 ++ if (!ttwu_state_match(p, state, &success))
3387 ++ goto out;
3388 ++
3389 ++ trace_sched_waking(p);
3390 ++ WRITE_ONCE(p->__state, TASK_RUNNING);
3391 ++ trace_sched_wakeup(p);
3392 ++ goto out;
3393 ++ }
3394 ++
3395 ++ /*
3396 ++ * If we are going to wake up a thread waiting for CONDITION we
3397 ++ * need to ensure that CONDITION=1 done by the caller can not be
3398 ++ * reordered with p->state check below. This pairs with smp_store_mb()
3399 ++ * in set_current_state() that the waiting thread does.
3400 ++ */
3401 ++ raw_spin_lock_irqsave(&p->pi_lock, flags);
3402 ++ smp_mb__after_spinlock();
3403 ++ if (!ttwu_state_match(p, state, &success))
3404 ++ goto unlock;
3405 ++
3406 ++ trace_sched_waking(p);
3407 ++
3408 ++ /*
3409 ++ * Ensure we load p->on_rq _after_ p->state, otherwise it would
3410 ++ * be possible to, falsely, observe p->on_rq == 0 and get stuck
3411 ++ * in smp_cond_load_acquire() below.
3412 ++ *
3413 ++ * sched_ttwu_pending() try_to_wake_up()
3414 ++ * STORE p->on_rq = 1 LOAD p->state
3415 ++ * UNLOCK rq->lock
3416 ++ *
3417 ++ * __schedule() (switch to task 'p')
3418 ++ * LOCK rq->lock smp_rmb();
3419 ++ * smp_mb__after_spinlock();
3420 ++ * UNLOCK rq->lock
3421 ++ *
3422 ++ * [task p]
3423 ++ * STORE p->state = UNINTERRUPTIBLE LOAD p->on_rq
3424 ++ *
3425 ++ * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
3426 ++ * __schedule(). See the comment for smp_mb__after_spinlock().
3427 ++ *
3428 ++ * A similar smb_rmb() lives in try_invoke_on_locked_down_task().
3429 ++ */
3430 ++ smp_rmb();
3431 ++ if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
3432 ++ goto unlock;
3433 ++
3434 ++#ifdef CONFIG_SMP
3435 ++ /*
3436 ++ * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
3437 ++ * possible to, falsely, observe p->on_cpu == 0.
3438 ++ *
3439 ++ * One must be running (->on_cpu == 1) in order to remove oneself
3440 ++ * from the runqueue.
3441 ++ *
3442 ++ * __schedule() (switch to task 'p') try_to_wake_up()
3443 ++ * STORE p->on_cpu = 1 LOAD p->on_rq
3444 ++ * UNLOCK rq->lock
3445 ++ *
3446 ++ * __schedule() (put 'p' to sleep)
3447 ++ * LOCK rq->lock smp_rmb();
3448 ++ * smp_mb__after_spinlock();
3449 ++ * STORE p->on_rq = 0 LOAD p->on_cpu
3450 ++ *
3451 ++ * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
3452 ++ * __schedule(). See the comment for smp_mb__after_spinlock().
3453 ++ *
3454 ++ * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
3455 ++ * schedule()'s deactivate_task() has 'happened' and p will no longer
3456 ++ * care about it's own p->state. See the comment in __schedule().
3457 ++ */
3458 ++ smp_acquire__after_ctrl_dep();
3459 ++
3460 ++ /*
3461 ++ * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq
3462 ++ * == 0), which means we need to do an enqueue, change p->state to
3463 ++ * TASK_WAKING such that we can unlock p->pi_lock before doing the
3464 ++ * enqueue, such as ttwu_queue_wakelist().
3465 ++ */
3466 ++ WRITE_ONCE(p->__state, TASK_WAKING);
3467 ++
3468 ++ /*
3469 ++ * If the owning (remote) CPU is still in the middle of schedule() with
3470 ++ * this task as prev, considering queueing p on the remote CPUs wake_list
3471 ++ * which potentially sends an IPI instead of spinning on p->on_cpu to
3472 ++ * let the waker make forward progress. This is safe because IRQs are
3473 ++ * disabled and the IPI will deliver after on_cpu is cleared.
3474 ++ *
3475 ++ * Ensure we load task_cpu(p) after p->on_cpu:
3476 ++ *
3477 ++ * set_task_cpu(p, cpu);
3478 ++ * STORE p->cpu = @cpu
3479 ++ * __schedule() (switch to task 'p')
3480 ++ * LOCK rq->lock
3481 ++ * smp_mb__after_spin_lock() smp_cond_load_acquire(&p->on_cpu)
3482 ++ * STORE p->on_cpu = 1 LOAD p->cpu
3483 ++ *
3484 ++ * to ensure we observe the correct CPU on which the task is currently
3485 ++ * scheduling.
3486 ++ */
3487 ++ if (smp_load_acquire(&p->on_cpu) &&
3488 ++ ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
3489 ++ goto unlock;
3490 ++
3491 ++ /*
3492 ++ * If the owning (remote) CPU is still in the middle of schedule() with
3493 ++ * this task as prev, wait until it's done referencing the task.
3494 ++ *
3495 ++ * Pairs with the smp_store_release() in finish_task().
3496 ++ *
3497 ++ * This ensures that tasks getting woken will be fully ordered against
3498 ++ * their previous state and preserve Program Order.
3499 ++ */
3500 ++ smp_cond_load_acquire(&p->on_cpu, !VAL);
3501 ++
3502 ++ sched_task_ttwu(p);
3503 ++
3504 ++ cpu = select_task_rq(p);
3505 ++
3506 ++ if (cpu != task_cpu(p)) {
3507 ++ if (p->in_iowait) {
3508 ++ delayacct_blkio_end(p);
3509 ++ atomic_dec(&task_rq(p)->nr_iowait);
3510 ++ }
3511 ++
3512 ++ wake_flags |= WF_MIGRATED;
3513 ++ psi_ttwu_dequeue(p);
3514 ++ set_task_cpu(p, cpu);
3515 ++ }
3516 ++#else
3517 ++ cpu = task_cpu(p);
3518 ++#endif /* CONFIG_SMP */
3519 ++
3520 ++ ttwu_queue(p, cpu, wake_flags);
3521 ++unlock:
3522 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
3523 ++out:
3524 ++ if (success)
3525 ++ ttwu_stat(p, task_cpu(p), wake_flags);
3526 ++ preempt_enable();
3527 ++
3528 ++ return success;
3529 ++}
3530 ++
3531 ++/**
3532 ++ * try_invoke_on_locked_down_task - Invoke a function on task in fixed state
3533 ++ * @p: Process for which the function is to be invoked, can be @current.
3534 ++ * @func: Function to invoke.
3535 ++ * @arg: Argument to function.
3536 ++ *
3537 ++ * If the specified task can be quickly locked into a definite state
3538 ++ * (either sleeping or on a given runqueue), arrange to keep it in that
3539 ++ * state while invoking @func(@arg). This function can use ->on_rq and
3540 ++ * task_curr() to work out what the state is, if required. Given that
3541 ++ * @func can be invoked with a runqueue lock held, it had better be quite
3542 ++ * lightweight.
3543 ++ *
3544 ++ * Returns:
3545 ++ * @false if the task slipped out from under the locks.
3546 ++ * @true if the task was locked onto a runqueue or is sleeping.
3547 ++ * However, @func can override this by returning @false.
3548 ++ */
3549 ++bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)
3550 ++{
3551 ++ struct rq_flags rf;
3552 ++ bool ret = false;
3553 ++ struct rq *rq;
3554 ++
3555 ++ raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
3556 ++ if (p->on_rq) {
3557 ++ rq = __task_rq_lock(p, &rf);
3558 ++ if (task_rq(p) == rq)
3559 ++ ret = func(p, arg);
3560 ++ __task_rq_unlock(rq, &rf);
3561 ++ } else {
3562 ++ switch (READ_ONCE(p->__state)) {
3563 ++ case TASK_RUNNING:
3564 ++ case TASK_WAKING:
3565 ++ break;
3566 ++ default:
3567 ++ smp_rmb(); // See smp_rmb() comment in try_to_wake_up().
3568 ++ if (!p->on_rq)
3569 ++ ret = func(p, arg);
3570 ++ }
3571 ++ }
3572 ++ raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
3573 ++ return ret;
3574 ++}
3575 ++
3576 ++/**
3577 ++ * wake_up_process - Wake up a specific process
3578 ++ * @p: The process to be woken up.
3579 ++ *
3580 ++ * Attempt to wake up the nominated process and move it to the set of runnable
3581 ++ * processes.
3582 ++ *
3583 ++ * Return: 1 if the process was woken up, 0 if it was already running.
3584 ++ *
3585 ++ * This function executes a full memory barrier before accessing the task state.
3586 ++ */
3587 ++int wake_up_process(struct task_struct *p)
3588 ++{
3589 ++ return try_to_wake_up(p, TASK_NORMAL, 0);
3590 ++}
3591 ++EXPORT_SYMBOL(wake_up_process);
3592 ++
3593 ++int wake_up_state(struct task_struct *p, unsigned int state)
3594 ++{
3595 ++ return try_to_wake_up(p, state, 0);
3596 ++}
3597 ++
3598 ++/*
3599 ++ * Perform scheduler related setup for a newly forked process p.
3600 ++ * p is forked by current.
3601 ++ *
3602 ++ * __sched_fork() is basic setup used by init_idle() too:
3603 ++ */
3604 ++static inline void __sched_fork(unsigned long clone_flags, struct task_struct *p)
3605 ++{
3606 ++ p->on_rq = 0;
3607 ++ p->on_cpu = 0;
3608 ++ p->utime = 0;
3609 ++ p->stime = 0;
3610 ++ p->sched_time = 0;
3611 ++
3612 ++#ifdef CONFIG_PREEMPT_NOTIFIERS
3613 ++ INIT_HLIST_HEAD(&p->preempt_notifiers);
3614 ++#endif
3615 ++
3616 ++#ifdef CONFIG_COMPACTION
3617 ++ p->capture_control = NULL;
3618 ++#endif
3619 ++#ifdef CONFIG_SMP
3620 ++ p->wake_entry.u_flags = CSD_TYPE_TTWU;
3621 ++#endif
3622 ++}
3623 ++
3624 ++/*
3625 ++ * fork()/clone()-time setup:
3626 ++ */
3627 ++int sched_fork(unsigned long clone_flags, struct task_struct *p)
3628 ++{
3629 ++ unsigned long flags;
3630 ++ struct rq *rq;
3631 ++
3632 ++ __sched_fork(clone_flags, p);
3633 ++ /*
3634 ++ * We mark the process as NEW here. This guarantees that
3635 ++ * nobody will actually run it, and a signal or other external
3636 ++ * event cannot wake it up and insert it on the runqueue either.
3637 ++ */
3638 ++ p->__state = TASK_NEW;
3639 ++
3640 ++ /*
3641 ++ * Make sure we do not leak PI boosting priority to the child.
3642 ++ */
3643 ++ p->prio = current->normal_prio;
3644 ++
3645 ++ /*
3646 ++ * Revert to default priority/policy on fork if requested.
3647 ++ */
3648 ++ if (unlikely(p->sched_reset_on_fork)) {
3649 ++ if (task_has_rt_policy(p)) {
3650 ++ p->policy = SCHED_NORMAL;
3651 ++ p->static_prio = NICE_TO_PRIO(0);
3652 ++ p->rt_priority = 0;
3653 ++ } else if (PRIO_TO_NICE(p->static_prio) < 0)
3654 ++ p->static_prio = NICE_TO_PRIO(0);
3655 ++
3656 ++ p->prio = p->normal_prio = p->static_prio;
3657 ++
3658 ++ /*
3659 ++ * We don't need the reset flag anymore after the fork. It has
3660 ++ * fulfilled its duty:
3661 ++ */
3662 ++ p->sched_reset_on_fork = 0;
3663 ++ }
3664 ++
3665 ++ /*
3666 ++ * The child is not yet in the pid-hash so no cgroup attach races,
3667 ++ * and the cgroup is pinned to this child due to cgroup_fork()
3668 ++ * is ran before sched_fork().
3669 ++ *
3670 ++ * Silence PROVE_RCU.
3671 ++ */
3672 ++ raw_spin_lock_irqsave(&p->pi_lock, flags);
3673 ++ /*
3674 ++ * Share the timeslice between parent and child, thus the
3675 ++ * total amount of pending timeslices in the system doesn't change,
3676 ++ * resulting in more scheduling fairness.
3677 ++ */
3678 ++ rq = this_rq();
3679 ++ raw_spin_lock(&rq->lock);
3680 ++
3681 ++ rq->curr->time_slice /= 2;
3682 ++ p->time_slice = rq->curr->time_slice;
3683 ++#ifdef CONFIG_SCHED_HRTICK
3684 ++ hrtick_start(rq, rq->curr->time_slice);
3685 ++#endif
3686 ++
3687 ++ if (p->time_slice < RESCHED_NS) {
3688 ++ p->time_slice = sched_timeslice_ns;
3689 ++ resched_curr(rq);
3690 ++ }
3691 ++ sched_task_fork(p, rq);
3692 ++ raw_spin_unlock(&rq->lock);
3693 ++
3694 ++ rseq_migrate(p);
3695 ++ /*
3696 ++ * We're setting the CPU for the first time, we don't migrate,
3697 ++ * so use __set_task_cpu().
3698 ++ */
3699 ++ __set_task_cpu(p, cpu_of(rq));
3700 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
3701 ++
3702 ++#ifdef CONFIG_SCHED_INFO
3703 ++ if (unlikely(sched_info_on()))
3704 ++ memset(&p->sched_info, 0, sizeof(p->sched_info));
3705 ++#endif
3706 ++ init_task_preempt_count(p);
3707 ++
3708 ++ return 0;
3709 ++}
3710 ++
3711 ++void sched_post_fork(struct task_struct *p) {}
3712 ++
3713 ++#ifdef CONFIG_SCHEDSTATS
3714 ++
3715 ++DEFINE_STATIC_KEY_FALSE(sched_schedstats);
3716 ++
3717 ++static void set_schedstats(bool enabled)
3718 ++{
3719 ++ if (enabled)
3720 ++ static_branch_enable(&sched_schedstats);
3721 ++ else
3722 ++ static_branch_disable(&sched_schedstats);
3723 ++}
3724 ++
3725 ++void force_schedstat_enabled(void)
3726 ++{
3727 ++ if (!schedstat_enabled()) {
3728 ++ pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");
3729 ++ static_branch_enable(&sched_schedstats);
3730 ++ }
3731 ++}
3732 ++
3733 ++static int __init setup_schedstats(char *str)
3734 ++{
3735 ++ int ret = 0;
3736 ++ if (!str)
3737 ++ goto out;
3738 ++
3739 ++ if (!strcmp(str, "enable")) {
3740 ++ set_schedstats(true);
3741 ++ ret = 1;
3742 ++ } else if (!strcmp(str, "disable")) {
3743 ++ set_schedstats(false);
3744 ++ ret = 1;
3745 ++ }
3746 ++out:
3747 ++ if (!ret)
3748 ++ pr_warn("Unable to parse schedstats=\n");
3749 ++
3750 ++ return ret;
3751 ++}
3752 ++__setup("schedstats=", setup_schedstats);
3753 ++
3754 ++#ifdef CONFIG_PROC_SYSCTL
3755 ++int sysctl_schedstats(struct ctl_table *table, int write,
3756 ++ void __user *buffer, size_t *lenp, loff_t *ppos)
3757 ++{
3758 ++ struct ctl_table t;
3759 ++ int err;
3760 ++ int state = static_branch_likely(&sched_schedstats);
3761 ++
3762 ++ if (write && !capable(CAP_SYS_ADMIN))
3763 ++ return -EPERM;
3764 ++
3765 ++ t = *table;
3766 ++ t.data = &state;
3767 ++ err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
3768 ++ if (err < 0)
3769 ++ return err;
3770 ++ if (write)
3771 ++ set_schedstats(state);
3772 ++ return err;
3773 ++}
3774 ++#endif /* CONFIG_PROC_SYSCTL */
3775 ++#endif /* CONFIG_SCHEDSTATS */
3776 ++
3777 ++/*
3778 ++ * wake_up_new_task - wake up a newly created task for the first time.
3779 ++ *
3780 ++ * This function will do some initial scheduler statistics housekeeping
3781 ++ * that must be done for every newly created context, then puts the task
3782 ++ * on the runqueue and wakes it.
3783 ++ */
3784 ++void wake_up_new_task(struct task_struct *p)
3785 ++{
3786 ++ unsigned long flags;
3787 ++ struct rq *rq;
3788 ++
3789 ++ raw_spin_lock_irqsave(&p->pi_lock, flags);
3790 ++ WRITE_ONCE(p->__state, TASK_RUNNING);
3791 ++ rq = cpu_rq(select_task_rq(p));
3792 ++#ifdef CONFIG_SMP
3793 ++ rseq_migrate(p);
3794 ++ /*
3795 ++ * Fork balancing, do it here and not earlier because:
3796 ++ * - cpus_ptr can change in the fork path
3797 ++ * - any previously selected CPU might disappear through hotplug
3798 ++ *
3799 ++ * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
3800 ++ * as we're not fully set-up yet.
3801 ++ */
3802 ++ __set_task_cpu(p, cpu_of(rq));
3803 ++#endif
3804 ++
3805 ++ raw_spin_lock(&rq->lock);
3806 ++ update_rq_clock(rq);
3807 ++
3808 ++ activate_task(p, rq);
3809 ++ trace_sched_wakeup_new(p);
3810 ++ check_preempt_curr(rq);
3811 ++
3812 ++ raw_spin_unlock(&rq->lock);
3813 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
3814 ++}
3815 ++
3816 ++#ifdef CONFIG_PREEMPT_NOTIFIERS
3817 ++
3818 ++static DEFINE_STATIC_KEY_FALSE(preempt_notifier_key);
3819 ++
3820 ++void preempt_notifier_inc(void)
3821 ++{
3822 ++ static_branch_inc(&preempt_notifier_key);
3823 ++}
3824 ++EXPORT_SYMBOL_GPL(preempt_notifier_inc);
3825 ++
3826 ++void preempt_notifier_dec(void)
3827 ++{
3828 ++ static_branch_dec(&preempt_notifier_key);
3829 ++}
3830 ++EXPORT_SYMBOL_GPL(preempt_notifier_dec);
3831 ++
3832 ++/**
3833 ++ * preempt_notifier_register - tell me when current is being preempted & rescheduled
3834 ++ * @notifier: notifier struct to register
3835 ++ */
3836 ++void preempt_notifier_register(struct preempt_notifier *notifier)
3837 ++{
3838 ++ if (!static_branch_unlikely(&preempt_notifier_key))
3839 ++ WARN(1, "registering preempt_notifier while notifiers disabled\n");
3840 ++
3841 ++ hlist_add_head(&notifier->link, &current->preempt_notifiers);
3842 ++}
3843 ++EXPORT_SYMBOL_GPL(preempt_notifier_register);
3844 ++
3845 ++/**
3846 ++ * preempt_notifier_unregister - no longer interested in preemption notifications
3847 ++ * @notifier: notifier struct to unregister
3848 ++ *
3849 ++ * This is *not* safe to call from within a preemption notifier.
3850 ++ */
3851 ++void preempt_notifier_unregister(struct preempt_notifier *notifier)
3852 ++{
3853 ++ hlist_del(&notifier->link);
3854 ++}
3855 ++EXPORT_SYMBOL_GPL(preempt_notifier_unregister);
3856 ++
3857 ++static void __fire_sched_in_preempt_notifiers(struct task_struct *curr)
3858 ++{
3859 ++ struct preempt_notifier *notifier;
3860 ++
3861 ++ hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)
3862 ++ notifier->ops->sched_in(notifier, raw_smp_processor_id());
3863 ++}
3864 ++
3865 ++static __always_inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)
3866 ++{
3867 ++ if (static_branch_unlikely(&preempt_notifier_key))
3868 ++ __fire_sched_in_preempt_notifiers(curr);
3869 ++}
3870 ++
3871 ++static void
3872 ++__fire_sched_out_preempt_notifiers(struct task_struct *curr,
3873 ++ struct task_struct *next)
3874 ++{
3875 ++ struct preempt_notifier *notifier;
3876 ++
3877 ++ hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)
3878 ++ notifier->ops->sched_out(notifier, next);
3879 ++}
3880 ++
3881 ++static __always_inline void
3882 ++fire_sched_out_preempt_notifiers(struct task_struct *curr,
3883 ++ struct task_struct *next)
3884 ++{
3885 ++ if (static_branch_unlikely(&preempt_notifier_key))
3886 ++ __fire_sched_out_preempt_notifiers(curr, next);
3887 ++}
3888 ++
3889 ++#else /* !CONFIG_PREEMPT_NOTIFIERS */
3890 ++
3891 ++static inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)
3892 ++{
3893 ++}
3894 ++
3895 ++static inline void
3896 ++fire_sched_out_preempt_notifiers(struct task_struct *curr,
3897 ++ struct task_struct *next)
3898 ++{
3899 ++}
3900 ++
3901 ++#endif /* CONFIG_PREEMPT_NOTIFIERS */
3902 ++
3903 ++static inline void prepare_task(struct task_struct *next)
3904 ++{
3905 ++ /*
3906 ++ * Claim the task as running, we do this before switching to it
3907 ++ * such that any running task will have this set.
3908 ++ *
3909 ++ * See the ttwu() WF_ON_CPU case and its ordering comment.
3910 ++ */
3911 ++ WRITE_ONCE(next->on_cpu, 1);
3912 ++}
3913 ++
3914 ++static inline void finish_task(struct task_struct *prev)
3915 ++{
3916 ++#ifdef CONFIG_SMP
3917 ++ /*
3918 ++ * This must be the very last reference to @prev from this CPU. After
3919 ++ * p->on_cpu is cleared, the task can be moved to a different CPU. We
3920 ++ * must ensure this doesn't happen until the switch is completely
3921 ++ * finished.
3922 ++ *
3923 ++ * In particular, the load of prev->state in finish_task_switch() must
3924 ++ * happen before this.
3925 ++ *
3926 ++ * Pairs with the smp_cond_load_acquire() in try_to_wake_up().
3927 ++ */
3928 ++ smp_store_release(&prev->on_cpu, 0);
3929 ++#else
3930 ++ prev->on_cpu = 0;
3931 ++#endif
3932 ++}
3933 ++
3934 ++#ifdef CONFIG_SMP
3935 ++
3936 ++static void do_balance_callbacks(struct rq *rq, struct callback_head *head)
3937 ++{
3938 ++ void (*func)(struct rq *rq);
3939 ++ struct callback_head *next;
3940 ++
3941 ++ lockdep_assert_held(&rq->lock);
3942 ++
3943 ++ while (head) {
3944 ++ func = (void (*)(struct rq *))head->func;
3945 ++ next = head->next;
3946 ++ head->next = NULL;
3947 ++ head = next;
3948 ++
3949 ++ func(rq);
3950 ++ }
3951 ++}
3952 ++
3953 ++static void balance_push(struct rq *rq);
3954 ++
3955 ++struct callback_head balance_push_callback = {
3956 ++ .next = NULL,
3957 ++ .func = (void (*)(struct callback_head *))balance_push,
3958 ++};
3959 ++
3960 ++static inline struct callback_head *splice_balance_callbacks(struct rq *rq)
3961 ++{
3962 ++ struct callback_head *head = rq->balance_callback;
3963 ++
3964 ++ if (head) {
3965 ++ lockdep_assert_held(&rq->lock);
3966 ++ rq->balance_callback = NULL;
3967 ++ }
3968 ++
3969 ++ return head;
3970 ++}
3971 ++
3972 ++static void __balance_callbacks(struct rq *rq)
3973 ++{
3974 ++ do_balance_callbacks(rq, splice_balance_callbacks(rq));
3975 ++}
3976 ++
3977 ++static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
3978 ++{
3979 ++ unsigned long flags;
3980 ++
3981 ++ if (unlikely(head)) {
3982 ++ raw_spin_lock_irqsave(&rq->lock, flags);
3983 ++ do_balance_callbacks(rq, head);
3984 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
3985 ++ }
3986 ++}
3987 ++
3988 ++#else
3989 ++
3990 ++static inline void __balance_callbacks(struct rq *rq)
3991 ++{
3992 ++}
3993 ++
3994 ++static inline struct callback_head *splice_balance_callbacks(struct rq *rq)
3995 ++{
3996 ++ return NULL;
3997 ++}
3998 ++
3999 ++static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
4000 ++{
4001 ++}
4002 ++
4003 ++#endif
4004 ++
4005 ++static inline void
4006 ++prepare_lock_switch(struct rq *rq, struct task_struct *next)
4007 ++{
4008 ++ /*
4009 ++ * Since the runqueue lock will be released by the next
4010 ++ * task (which is an invalid locking op but in the case
4011 ++ * of the scheduler it's an obvious special-case), so we
4012 ++ * do an early lockdep release here:
4013 ++ */
4014 ++ spin_release(&rq->lock.dep_map, _THIS_IP_);
4015 ++#ifdef CONFIG_DEBUG_SPINLOCK
4016 ++ /* this is a valid case when another task releases the spinlock */
4017 ++ rq->lock.owner = next;
4018 ++#endif
4019 ++}
4020 ++
4021 ++static inline void finish_lock_switch(struct rq *rq)
4022 ++{
4023 ++ /*
4024 ++ * If we are tracking spinlock dependencies then we have to
4025 ++ * fix up the runqueue lock - which gets 'carried over' from
4026 ++ * prev into current:
4027 ++ */
4028 ++ spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
4029 ++ __balance_callbacks(rq);
4030 ++ raw_spin_unlock_irq(&rq->lock);
4031 ++}
4032 ++
4033 ++/*
4034 ++ * NOP if the arch has not defined these:
4035 ++ */
4036 ++
4037 ++#ifndef prepare_arch_switch
4038 ++# define prepare_arch_switch(next) do { } while (0)
4039 ++#endif
4040 ++
4041 ++#ifndef finish_arch_post_lock_switch
4042 ++# define finish_arch_post_lock_switch() do { } while (0)
4043 ++#endif
4044 ++
4045 ++static inline void kmap_local_sched_out(void)
4046 ++{
4047 ++#ifdef CONFIG_KMAP_LOCAL
4048 ++ if (unlikely(current->kmap_ctrl.idx))
4049 ++ __kmap_local_sched_out();
4050 ++#endif
4051 ++}
4052 ++
4053 ++static inline void kmap_local_sched_in(void)
4054 ++{
4055 ++#ifdef CONFIG_KMAP_LOCAL
4056 ++ if (unlikely(current->kmap_ctrl.idx))
4057 ++ __kmap_local_sched_in();
4058 ++#endif
4059 ++}
4060 ++
4061 ++/**
4062 ++ * prepare_task_switch - prepare to switch tasks
4063 ++ * @rq: the runqueue preparing to switch
4064 ++ * @next: the task we are going to switch to.
4065 ++ *
4066 ++ * This is called with the rq lock held and interrupts off. It must
4067 ++ * be paired with a subsequent finish_task_switch after the context
4068 ++ * switch.
4069 ++ *
4070 ++ * prepare_task_switch sets up locking and calls architecture specific
4071 ++ * hooks.
4072 ++ */
4073 ++static inline void
4074 ++prepare_task_switch(struct rq *rq, struct task_struct *prev,
4075 ++ struct task_struct *next)
4076 ++{
4077 ++ kcov_prepare_switch(prev);
4078 ++ sched_info_switch(rq, prev, next);
4079 ++ perf_event_task_sched_out(prev, next);
4080 ++ rseq_preempt(prev);
4081 ++ fire_sched_out_preempt_notifiers(prev, next);
4082 ++ kmap_local_sched_out();
4083 ++ prepare_task(next);
4084 ++ prepare_arch_switch(next);
4085 ++}
4086 ++
4087 ++/**
4088 ++ * finish_task_switch - clean up after a task-switch
4089 ++ * @rq: runqueue associated with task-switch
4090 ++ * @prev: the thread we just switched away from.
4091 ++ *
4092 ++ * finish_task_switch must be called after the context switch, paired
4093 ++ * with a prepare_task_switch call before the context switch.
4094 ++ * finish_task_switch will reconcile locking set up by prepare_task_switch,
4095 ++ * and do any other architecture-specific cleanup actions.
4096 ++ *
4097 ++ * Note that we may have delayed dropping an mm in context_switch(). If
4098 ++ * so, we finish that here outside of the runqueue lock. (Doing it
4099 ++ * with the lock held can cause deadlocks; see schedule() for
4100 ++ * details.)
4101 ++ *
4102 ++ * The context switch have flipped the stack from under us and restored the
4103 ++ * local variables which were saved when this task called schedule() in the
4104 ++ * past. prev == current is still correct but we need to recalculate this_rq
4105 ++ * because prev may have moved to another CPU.
4106 ++ */
4107 ++static struct rq *finish_task_switch(struct task_struct *prev)
4108 ++ __releases(rq->lock)
4109 ++{
4110 ++ struct rq *rq = this_rq();
4111 ++ struct mm_struct *mm = rq->prev_mm;
4112 ++ long prev_state;
4113 ++
4114 ++ /*
4115 ++ * The previous task will have left us with a preempt_count of 2
4116 ++ * because it left us after:
4117 ++ *
4118 ++ * schedule()
4119 ++ * preempt_disable(); // 1
4120 ++ * __schedule()
4121 ++ * raw_spin_lock_irq(&rq->lock) // 2
4122 ++ *
4123 ++ * Also, see FORK_PREEMPT_COUNT.
4124 ++ */
4125 ++ if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
4126 ++ "corrupted preempt_count: %s/%d/0x%x\n",
4127 ++ current->comm, current->pid, preempt_count()))
4128 ++ preempt_count_set(FORK_PREEMPT_COUNT);
4129 ++
4130 ++ rq->prev_mm = NULL;
4131 ++
4132 ++ /*
4133 ++ * A task struct has one reference for the use as "current".
4134 ++ * If a task dies, then it sets TASK_DEAD in tsk->state and calls
4135 ++ * schedule one last time. The schedule call will never return, and
4136 ++ * the scheduled task must drop that reference.
4137 ++ *
4138 ++ * We must observe prev->state before clearing prev->on_cpu (in
4139 ++ * finish_task), otherwise a concurrent wakeup can get prev
4140 ++ * running on another CPU and we could rave with its RUNNING -> DEAD
4141 ++ * transition, resulting in a double drop.
4142 ++ */
4143 ++ prev_state = READ_ONCE(prev->__state);
4144 ++ vtime_task_switch(prev);
4145 ++ perf_event_task_sched_in(prev, current);
4146 ++ finish_task(prev);
4147 ++ tick_nohz_task_switch();
4148 ++ finish_lock_switch(rq);
4149 ++ finish_arch_post_lock_switch();
4150 ++ kcov_finish_switch(current);
4151 ++ /*
4152 ++ * kmap_local_sched_out() is invoked with rq::lock held and
4153 ++ * interrupts disabled. There is no requirement for that, but the
4154 ++ * sched out code does not have an interrupt enabled section.
4155 ++ * Restoring the maps on sched in does not require interrupts being
4156 ++ * disabled either.
4157 ++ */
4158 ++ kmap_local_sched_in();
4159 ++
4160 ++ fire_sched_in_preempt_notifiers(current);
4161 ++ /*
4162 ++ * When switching through a kernel thread, the loop in
4163 ++ * membarrier_{private,global}_expedited() may have observed that
4164 ++ * kernel thread and not issued an IPI. It is therefore possible to
4165 ++ * schedule between user->kernel->user threads without passing though
4166 ++ * switch_mm(). Membarrier requires a barrier after storing to
4167 ++ * rq->curr, before returning to userspace, so provide them here:
4168 ++ *
4169 ++ * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
4170 ++ * provided by mmdrop(),
4171 ++ * - a sync_core for SYNC_CORE.
4172 ++ */
4173 ++ if (mm) {
4174 ++ membarrier_mm_sync_core_before_usermode(mm);
4175 ++ mmdrop(mm);
4176 ++ }
4177 ++ if (unlikely(prev_state == TASK_DEAD)) {
4178 ++ /*
4179 ++ * Remove function-return probe instances associated with this
4180 ++ * task and put them back on the free list.
4181 ++ */
4182 ++ kprobe_flush_task(prev);
4183 ++
4184 ++ /* Task is done with its stack. */
4185 ++ put_task_stack(prev);
4186 ++
4187 ++ put_task_struct_rcu_user(prev);
4188 ++ }
4189 ++
4190 ++ return rq;
4191 ++}
4192 ++
4193 ++/**
4194 ++ * schedule_tail - first thing a freshly forked thread must call.
4195 ++ * @prev: the thread we just switched away from.
4196 ++ */
4197 ++asmlinkage __visible void schedule_tail(struct task_struct *prev)
4198 ++ __releases(rq->lock)
4199 ++{
4200 ++ /*
4201 ++ * New tasks start with FORK_PREEMPT_COUNT, see there and
4202 ++ * finish_task_switch() for details.
4203 ++ *
4204 ++ * finish_task_switch() will drop rq->lock() and lower preempt_count
4205 ++ * and the preempt_enable() will end up enabling preemption (on
4206 ++ * PREEMPT_COUNT kernels).
4207 ++ */
4208 ++
4209 ++ finish_task_switch(prev);
4210 ++ preempt_enable();
4211 ++
4212 ++ if (current->set_child_tid)
4213 ++ put_user(task_pid_vnr(current), current->set_child_tid);
4214 ++
4215 ++ calculate_sigpending();
4216 ++}
4217 ++
4218 ++/*
4219 ++ * context_switch - switch to the new MM and the new thread's register state.
4220 ++ */
4221 ++static __always_inline struct rq *
4222 ++context_switch(struct rq *rq, struct task_struct *prev,
4223 ++ struct task_struct *next)
4224 ++{
4225 ++ prepare_task_switch(rq, prev, next);
4226 ++
4227 ++ /*
4228 ++ * For paravirt, this is coupled with an exit in switch_to to
4229 ++ * combine the page table reload and the switch backend into
4230 ++ * one hypercall.
4231 ++ */
4232 ++ arch_start_context_switch(prev);
4233 ++
4234 ++ /*
4235 ++ * kernel -> kernel lazy + transfer active
4236 ++ * user -> kernel lazy + mmgrab() active
4237 ++ *
4238 ++ * kernel -> user switch + mmdrop() active
4239 ++ * user -> user switch
4240 ++ */
4241 ++ if (!next->mm) { // to kernel
4242 ++ enter_lazy_tlb(prev->active_mm, next);
4243 ++
4244 ++ next->active_mm = prev->active_mm;
4245 ++ if (prev->mm) // from user
4246 ++ mmgrab(prev->active_mm);
4247 ++ else
4248 ++ prev->active_mm = NULL;
4249 ++ } else { // to user
4250 ++ membarrier_switch_mm(rq, prev->active_mm, next->mm);
4251 ++ /*
4252 ++ * sys_membarrier() requires an smp_mb() between setting
4253 ++ * rq->curr / membarrier_switch_mm() and returning to userspace.
4254 ++ *
4255 ++ * The below provides this either through switch_mm(), or in
4256 ++ * case 'prev->active_mm == next->mm' through
4257 ++ * finish_task_switch()'s mmdrop().
4258 ++ */
4259 ++ switch_mm_irqs_off(prev->active_mm, next->mm, next);
4260 ++
4261 ++ if (!prev->mm) { // from kernel
4262 ++ /* will mmdrop() in finish_task_switch(). */
4263 ++ rq->prev_mm = prev->active_mm;
4264 ++ prev->active_mm = NULL;
4265 ++ }
4266 ++ }
4267 ++
4268 ++ prepare_lock_switch(rq, next);
4269 ++
4270 ++ /* Here we just switch the register state and the stack. */
4271 ++ switch_to(prev, next, prev);
4272 ++ barrier();
4273 ++
4274 ++ return finish_task_switch(prev);
4275 ++}
4276 ++
4277 ++/*
4278 ++ * nr_running, nr_uninterruptible and nr_context_switches:
4279 ++ *
4280 ++ * externally visible scheduler statistics: current number of runnable
4281 ++ * threads, total number of context switches performed since bootup.
4282 ++ */
4283 ++unsigned int nr_running(void)
4284 ++{
4285 ++ unsigned int i, sum = 0;
4286 ++
4287 ++ for_each_online_cpu(i)
4288 ++ sum += cpu_rq(i)->nr_running;
4289 ++
4290 ++ return sum;
4291 ++}
4292 ++
4293 ++/*
4294 ++ * Check if only the current task is running on the CPU.
4295 ++ *
4296 ++ * Caution: this function does not check that the caller has disabled
4297 ++ * preemption, thus the result might have a time-of-check-to-time-of-use
4298 ++ * race. The caller is responsible to use it correctly, for example:
4299 ++ *
4300 ++ * - from a non-preemptible section (of course)
4301 ++ *
4302 ++ * - from a thread that is bound to a single CPU
4303 ++ *
4304 ++ * - in a loop with very short iterations (e.g. a polling loop)
4305 ++ */
4306 ++bool single_task_running(void)
4307 ++{
4308 ++ return raw_rq()->nr_running == 1;
4309 ++}
4310 ++EXPORT_SYMBOL(single_task_running);
4311 ++
4312 ++unsigned long long nr_context_switches(void)
4313 ++{
4314 ++ int i;
4315 ++ unsigned long long sum = 0;
4316 ++
4317 ++ for_each_possible_cpu(i)
4318 ++ sum += cpu_rq(i)->nr_switches;
4319 ++
4320 ++ return sum;
4321 ++}
4322 ++
4323 ++/*
4324 ++ * Consumers of these two interfaces, like for example the cpuidle menu
4325 ++ * governor, are using nonsensical data. Preferring shallow idle state selection
4326 ++ * for a CPU that has IO-wait which might not even end up running the task when
4327 ++ * it does become runnable.
4328 ++ */
4329 ++
4330 ++unsigned int nr_iowait_cpu(int cpu)
4331 ++{
4332 ++ return atomic_read(&cpu_rq(cpu)->nr_iowait);
4333 ++}
4334 ++
4335 ++/*
4336 ++ * IO-wait accounting, and how it's mostly bollocks (on SMP).
4337 ++ *
4338 ++ * The idea behind IO-wait account is to account the idle time that we could
4339 ++ * have spend running if it were not for IO. That is, if we were to improve the
4340 ++ * storage performance, we'd have a proportional reduction in IO-wait time.
4341 ++ *
4342 ++ * This all works nicely on UP, where, when a task blocks on IO, we account
4343 ++ * idle time as IO-wait, because if the storage were faster, it could've been
4344 ++ * running and we'd not be idle.
4345 ++ *
4346 ++ * This has been extended to SMP, by doing the same for each CPU. This however
4347 ++ * is broken.
4348 ++ *
4349 ++ * Imagine for instance the case where two tasks block on one CPU, only the one
4350 ++ * CPU will have IO-wait accounted, while the other has regular idle. Even
4351 ++ * though, if the storage were faster, both could've ran at the same time,
4352 ++ * utilising both CPUs.
4353 ++ *
4354 ++ * This means, that when looking globally, the current IO-wait accounting on
4355 ++ * SMP is a lower bound, by reason of under accounting.
4356 ++ *
4357 ++ * Worse, since the numbers are provided per CPU, they are sometimes
4358 ++ * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
4359 ++ * associated with any one particular CPU, it can wake to another CPU than it
4360 ++ * blocked on. This means the per CPU IO-wait number is meaningless.
4361 ++ *
4362 ++ * Task CPU affinities can make all that even more 'interesting'.
4363 ++ */
4364 ++
4365 ++unsigned int nr_iowait(void)
4366 ++{
4367 ++ unsigned int i, sum = 0;
4368 ++
4369 ++ for_each_possible_cpu(i)
4370 ++ sum += nr_iowait_cpu(i);
4371 ++
4372 ++ return sum;
4373 ++}
4374 ++
4375 ++#ifdef CONFIG_SMP
4376 ++
4377 ++/*
4378 ++ * sched_exec - execve() is a valuable balancing opportunity, because at
4379 ++ * this point the task has the smallest effective memory and cache
4380 ++ * footprint.
4381 ++ */
4382 ++void sched_exec(void)
4383 ++{
4384 ++ struct task_struct *p = current;
4385 ++ unsigned long flags;
4386 ++ int dest_cpu;
4387 ++
4388 ++ raw_spin_lock_irqsave(&p->pi_lock, flags);
4389 ++ dest_cpu = cpumask_any(p->cpus_ptr);
4390 ++ if (dest_cpu == smp_processor_id())
4391 ++ goto unlock;
4392 ++
4393 ++ if (likely(cpu_active(dest_cpu))) {
4394 ++ struct migration_arg arg = { p, dest_cpu };
4395 ++
4396 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
4397 ++ stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
4398 ++ return;
4399 ++ }
4400 ++unlock:
4401 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
4402 ++}
4403 ++
4404 ++#endif
4405 ++
4406 ++DEFINE_PER_CPU(struct kernel_stat, kstat);
4407 ++DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
4408 ++
4409 ++EXPORT_PER_CPU_SYMBOL(kstat);
4410 ++EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
4411 ++
4412 ++static inline void update_curr(struct rq *rq, struct task_struct *p)
4413 ++{
4414 ++ s64 ns = rq->clock_task - p->last_ran;
4415 ++
4416 ++ p->sched_time += ns;
4417 ++ cgroup_account_cputime(p, ns);
4418 ++ account_group_exec_runtime(p, ns);
4419 ++
4420 ++ p->time_slice -= ns;
4421 ++ p->last_ran = rq->clock_task;
4422 ++}
4423 ++
4424 ++/*
4425 ++ * Return accounted runtime for the task.
4426 ++ * Return separately the current's pending runtime that have not been
4427 ++ * accounted yet.
4428 ++ */
4429 ++unsigned long long task_sched_runtime(struct task_struct *p)
4430 ++{
4431 ++ unsigned long flags;
4432 ++ struct rq *rq;
4433 ++ raw_spinlock_t *lock;
4434 ++ u64 ns;
4435 ++
4436 ++#if defined(CONFIG_64BIT) && defined(CONFIG_SMP)
4437 ++ /*
4438 ++ * 64-bit doesn't need locks to atomically read a 64-bit value.
4439 ++ * So we have a optimization chance when the task's delta_exec is 0.
4440 ++ * Reading ->on_cpu is racy, but this is ok.
4441 ++ *
4442 ++ * If we race with it leaving CPU, we'll take a lock. So we're correct.
4443 ++ * If we race with it entering CPU, unaccounted time is 0. This is
4444 ++ * indistinguishable from the read occurring a few cycles earlier.
4445 ++ * If we see ->on_cpu without ->on_rq, the task is leaving, and has
4446 ++ * been accounted, so we're correct here as well.
4447 ++ */
4448 ++ if (!p->on_cpu || !task_on_rq_queued(p))
4449 ++ return tsk_seruntime(p);
4450 ++#endif
4451 ++
4452 ++ rq = task_access_lock_irqsave(p, &lock, &flags);
4453 ++ /*
4454 ++ * Must be ->curr _and_ ->on_rq. If dequeued, we would
4455 ++ * project cycles that may never be accounted to this
4456 ++ * thread, breaking clock_gettime().
4457 ++ */
4458 ++ if (p == rq->curr && task_on_rq_queued(p)) {
4459 ++ update_rq_clock(rq);
4460 ++ update_curr(rq, p);
4461 ++ }
4462 ++ ns = tsk_seruntime(p);
4463 ++ task_access_unlock_irqrestore(p, lock, &flags);
4464 ++
4465 ++ return ns;
4466 ++}
4467 ++
4468 ++/* This manages tasks that have run out of timeslice during a scheduler_tick */
4469 ++static inline void scheduler_task_tick(struct rq *rq)
4470 ++{
4471 ++ struct task_struct *p = rq->curr;
4472 ++
4473 ++ if (is_idle_task(p))
4474 ++ return;
4475 ++
4476 ++ update_curr(rq, p);
4477 ++ cpufreq_update_util(rq, 0);
4478 ++
4479 ++ /*
4480 ++ * Tasks have less than RESCHED_NS of time slice left they will be
4481 ++ * rescheduled.
4482 ++ */
4483 ++ if (p->time_slice >= RESCHED_NS)
4484 ++ return;
4485 ++ set_tsk_need_resched(p);
4486 ++ set_preempt_need_resched();
4487 ++}
4488 ++
4489 ++#ifdef CONFIG_SCHED_DEBUG
4490 ++static u64 cpu_resched_latency(struct rq *rq)
4491 ++{
4492 ++ int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms);
4493 ++ u64 resched_latency, now = rq_clock(rq);
4494 ++ static bool warned_once;
4495 ++
4496 ++ if (sysctl_resched_latency_warn_once && warned_once)
4497 ++ return 0;
4498 ++
4499 ++ if (!need_resched() || !latency_warn_ms)
4500 ++ return 0;
4501 ++
4502 ++ if (system_state == SYSTEM_BOOTING)
4503 ++ return 0;
4504 ++
4505 ++ if (!rq->last_seen_need_resched_ns) {
4506 ++ rq->last_seen_need_resched_ns = now;
4507 ++ rq->ticks_without_resched = 0;
4508 ++ return 0;
4509 ++ }
4510 ++
4511 ++ rq->ticks_without_resched++;
4512 ++ resched_latency = now - rq->last_seen_need_resched_ns;
4513 ++ if (resched_latency <= latency_warn_ms * NSEC_PER_MSEC)
4514 ++ return 0;
4515 ++
4516 ++ warned_once = true;
4517 ++
4518 ++ return resched_latency;
4519 ++}
4520 ++
4521 ++static int __init setup_resched_latency_warn_ms(char *str)
4522 ++{
4523 ++ long val;
4524 ++
4525 ++ if ((kstrtol(str, 0, &val))) {
4526 ++ pr_warn("Unable to set resched_latency_warn_ms\n");
4527 ++ return 1;
4528 ++ }
4529 ++
4530 ++ sysctl_resched_latency_warn_ms = val;
4531 ++ return 1;
4532 ++}
4533 ++__setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
4534 ++#else
4535 ++static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
4536 ++#endif /* CONFIG_SCHED_DEBUG */
4537 ++
4538 ++/*
4539 ++ * This function gets called by the timer code, with HZ frequency.
4540 ++ * We call it with interrupts disabled.
4541 ++ */
4542 ++void scheduler_tick(void)
4543 ++{
4544 ++ int cpu __maybe_unused = smp_processor_id();
4545 ++ struct rq *rq = cpu_rq(cpu);
4546 ++ u64 resched_latency;
4547 ++
4548 ++ arch_scale_freq_tick();
4549 ++ sched_clock_tick();
4550 ++
4551 ++ raw_spin_lock(&rq->lock);
4552 ++ update_rq_clock(rq);
4553 ++
4554 ++ scheduler_task_tick(rq);
4555 ++ if (sched_feat(LATENCY_WARN))
4556 ++ resched_latency = cpu_resched_latency(rq);
4557 ++ calc_global_load_tick(rq);
4558 ++
4559 ++ rq->last_tick = rq->clock;
4560 ++ raw_spin_unlock(&rq->lock);
4561 ++
4562 ++ if (sched_feat(LATENCY_WARN) && resched_latency)
4563 ++ resched_latency_warn(cpu, resched_latency);
4564 ++
4565 ++ perf_event_task_tick();
4566 ++}
4567 ++
4568 ++#ifdef CONFIG_SCHED_SMT
4569 ++static inline int active_load_balance_cpu_stop(void *data)
4570 ++{
4571 ++ struct rq *rq = this_rq();
4572 ++ struct task_struct *p = data;
4573 ++ cpumask_t tmp;
4574 ++ unsigned long flags;
4575 ++
4576 ++ local_irq_save(flags);
4577 ++
4578 ++ raw_spin_lock(&p->pi_lock);
4579 ++ raw_spin_lock(&rq->lock);
4580 ++
4581 ++ rq->active_balance = 0;
4582 ++ /* _something_ may have changed the task, double check again */
4583 ++ if (task_on_rq_queued(p) && task_rq(p) == rq &&
4584 ++ cpumask_and(&tmp, p->cpus_ptr, &sched_sg_idle_mask) &&
4585 ++ !is_migration_disabled(p)) {
4586 ++ int cpu = cpu_of(rq);
4587 ++ int dcpu = __best_mask_cpu(&tmp, per_cpu(sched_cpu_llc_mask, cpu));
4588 ++ rq = move_queued_task(rq, p, dcpu);
4589 ++ }
4590 ++
4591 ++ raw_spin_unlock(&rq->lock);
4592 ++ raw_spin_unlock(&p->pi_lock);
4593 ++
4594 ++ local_irq_restore(flags);
4595 ++
4596 ++ return 0;
4597 ++}
4598 ++
4599 ++/* sg_balance_trigger - trigger slibing group balance for @cpu */
4600 ++static inline int sg_balance_trigger(const int cpu)
4601 ++{
4602 ++ struct rq *rq= cpu_rq(cpu);
4603 ++ unsigned long flags;
4604 ++ struct task_struct *curr;
4605 ++ int res;
4606 ++
4607 ++ if (!raw_spin_trylock_irqsave(&rq->lock, flags))
4608 ++ return 0;
4609 ++ curr = rq->curr;
4610 ++ res = (!is_idle_task(curr)) && (1 == rq->nr_running) &&\
4611 ++ cpumask_intersects(curr->cpus_ptr, &sched_sg_idle_mask) &&\
4612 ++ !is_migration_disabled(curr) && (!rq->active_balance);
4613 ++
4614 ++ if (res)
4615 ++ rq->active_balance = 1;
4616 ++
4617 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
4618 ++
4619 ++ if (res)
4620 ++ stop_one_cpu_nowait(cpu, active_load_balance_cpu_stop,
4621 ++ curr, &rq->active_balance_work);
4622 ++ return res;
4623 ++}
4624 ++
4625 ++/*
4626 ++ * sg_balance_check - slibing group balance check for run queue @rq
4627 ++ */
4628 ++static inline void sg_balance_check(struct rq *rq)
4629 ++{
4630 ++ cpumask_t chk;
4631 ++ int cpu = cpu_of(rq);
4632 ++
4633 ++ /* exit when cpu is offline */
4634 ++ if (unlikely(!rq->online))
4635 ++ return;
4636 ++
4637 ++ /*
4638 ++ * Only cpu in slibing idle group will do the checking and then
4639 ++ * find potential cpus which can migrate the current running task
4640 ++ */
4641 ++ if (cpumask_test_cpu(cpu, &sched_sg_idle_mask) &&
4642 ++ cpumask_andnot(&chk, cpu_online_mask, sched_rq_watermark) &&
4643 ++ cpumask_andnot(&chk, &chk, &sched_rq_pending_mask)) {
4644 ++ int i;
4645 ++
4646 ++ for_each_cpu_wrap(i, &chk, cpu) {
4647 ++ if (cpumask_subset(cpu_smt_mask(i), &chk) &&
4648 ++ sg_balance_trigger(i))
4649 ++ return;
4650 ++ }
4651 ++ }
4652 ++}
4653 ++#endif /* CONFIG_SCHED_SMT */
4654 ++
4655 ++#ifdef CONFIG_NO_HZ_FULL
4656 ++
4657 ++struct tick_work {
4658 ++ int cpu;
4659 ++ atomic_t state;
4660 ++ struct delayed_work work;
4661 ++};
4662 ++/* Values for ->state, see diagram below. */
4663 ++#define TICK_SCHED_REMOTE_OFFLINE 0
4664 ++#define TICK_SCHED_REMOTE_OFFLINING 1
4665 ++#define TICK_SCHED_REMOTE_RUNNING 2
4666 ++
4667 ++/*
4668 ++ * State diagram for ->state:
4669 ++ *
4670 ++ *
4671 ++ * TICK_SCHED_REMOTE_OFFLINE
4672 ++ * | ^
4673 ++ * | |
4674 ++ * | | sched_tick_remote()
4675 ++ * | |
4676 ++ * | |
4677 ++ * +--TICK_SCHED_REMOTE_OFFLINING
4678 ++ * | ^
4679 ++ * | |
4680 ++ * sched_tick_start() | | sched_tick_stop()
4681 ++ * | |
4682 ++ * V |
4683 ++ * TICK_SCHED_REMOTE_RUNNING
4684 ++ *
4685 ++ *
4686 ++ * Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
4687 ++ * and sched_tick_start() are happy to leave the state in RUNNING.
4688 ++ */
4689 ++
4690 ++static struct tick_work __percpu *tick_work_cpu;
4691 ++
4692 ++static void sched_tick_remote(struct work_struct *work)
4693 ++{
4694 ++ struct delayed_work *dwork = to_delayed_work(work);
4695 ++ struct tick_work *twork = container_of(dwork, struct tick_work, work);
4696 ++ int cpu = twork->cpu;
4697 ++ struct rq *rq = cpu_rq(cpu);
4698 ++ struct task_struct *curr;
4699 ++ unsigned long flags;
4700 ++ u64 delta;
4701 ++ int os;
4702 ++
4703 ++ /*
4704 ++ * Handle the tick only if it appears the remote CPU is running in full
4705 ++ * dynticks mode. The check is racy by nature, but missing a tick or
4706 ++ * having one too much is no big deal because the scheduler tick updates
4707 ++ * statistics and checks timeslices in a time-independent way, regardless
4708 ++ * of when exactly it is running.
4709 ++ */
4710 ++ if (!tick_nohz_tick_stopped_cpu(cpu))
4711 ++ goto out_requeue;
4712 ++
4713 ++ raw_spin_lock_irqsave(&rq->lock, flags);
4714 ++ curr = rq->curr;
4715 ++ if (cpu_is_offline(cpu))
4716 ++ goto out_unlock;
4717 ++
4718 ++ update_rq_clock(rq);
4719 ++ if (!is_idle_task(curr)) {
4720 ++ /*
4721 ++ * Make sure the next tick runs within a reasonable
4722 ++ * amount of time.
4723 ++ */
4724 ++ delta = rq_clock_task(rq) - curr->last_ran;
4725 ++ WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);
4726 ++ }
4727 ++ scheduler_task_tick(rq);
4728 ++
4729 ++ calc_load_nohz_remote(rq);
4730 ++out_unlock:
4731 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
4732 ++
4733 ++out_requeue:
4734 ++ /*
4735 ++ * Run the remote tick once per second (1Hz). This arbitrary
4736 ++ * frequency is large enough to avoid overload but short enough
4737 ++ * to keep scheduler internal stats reasonably up to date. But
4738 ++ * first update state to reflect hotplug activity if required.
4739 ++ */
4740 ++ os = atomic_fetch_add_unless(&twork->state, -1, TICK_SCHED_REMOTE_RUNNING);
4741 ++ WARN_ON_ONCE(os == TICK_SCHED_REMOTE_OFFLINE);
4742 ++ if (os == TICK_SCHED_REMOTE_RUNNING)
4743 ++ queue_delayed_work(system_unbound_wq, dwork, HZ);
4744 ++}
4745 ++
4746 ++static void sched_tick_start(int cpu)
4747 ++{
4748 ++ int os;
4749 ++ struct tick_work *twork;
4750 ++
4751 ++ if (housekeeping_cpu(cpu, HK_FLAG_TICK))
4752 ++ return;
4753 ++
4754 ++ WARN_ON_ONCE(!tick_work_cpu);
4755 ++
4756 ++ twork = per_cpu_ptr(tick_work_cpu, cpu);
4757 ++ os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);
4758 ++ WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);
4759 ++ if (os == TICK_SCHED_REMOTE_OFFLINE) {
4760 ++ twork->cpu = cpu;
4761 ++ INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
4762 ++ queue_delayed_work(system_unbound_wq, &twork->work, HZ);
4763 ++ }
4764 ++}
4765 ++
4766 ++#ifdef CONFIG_HOTPLUG_CPU
4767 ++static void sched_tick_stop(int cpu)
4768 ++{
4769 ++ struct tick_work *twork;
4770 ++
4771 ++ if (housekeeping_cpu(cpu, HK_FLAG_TICK))
4772 ++ return;
4773 ++
4774 ++ WARN_ON_ONCE(!tick_work_cpu);
4775 ++
4776 ++ twork = per_cpu_ptr(tick_work_cpu, cpu);
4777 ++ cancel_delayed_work_sync(&twork->work);
4778 ++}
4779 ++#endif /* CONFIG_HOTPLUG_CPU */
4780 ++
4781 ++int __init sched_tick_offload_init(void)
4782 ++{
4783 ++ tick_work_cpu = alloc_percpu(struct tick_work);
4784 ++ BUG_ON(!tick_work_cpu);
4785 ++ return 0;
4786 ++}
4787 ++
4788 ++#else /* !CONFIG_NO_HZ_FULL */
4789 ++static inline void sched_tick_start(int cpu) { }
4790 ++static inline void sched_tick_stop(int cpu) { }
4791 ++#endif
4792 ++
4793 ++#if defined(CONFIG_PREEMPTION) && (defined(CONFIG_DEBUG_PREEMPT) || \
4794 ++ defined(CONFIG_PREEMPT_TRACER))
4795 ++/*
4796 ++ * If the value passed in is equal to the current preempt count
4797 ++ * then we just disabled preemption. Start timing the latency.
4798 ++ */
4799 ++static inline void preempt_latency_start(int val)
4800 ++{
4801 ++ if (preempt_count() == val) {
4802 ++ unsigned long ip = get_lock_parent_ip();
4803 ++#ifdef CONFIG_DEBUG_PREEMPT
4804 ++ current->preempt_disable_ip = ip;
4805 ++#endif
4806 ++ trace_preempt_off(CALLER_ADDR0, ip);
4807 ++ }
4808 ++}
4809 ++
4810 ++void preempt_count_add(int val)
4811 ++{
4812 ++#ifdef CONFIG_DEBUG_PREEMPT
4813 ++ /*
4814 ++ * Underflow?
4815 ++ */
4816 ++ if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))
4817 ++ return;
4818 ++#endif
4819 ++ __preempt_count_add(val);
4820 ++#ifdef CONFIG_DEBUG_PREEMPT
4821 ++ /*
4822 ++ * Spinlock count overflowing soon?
4823 ++ */
4824 ++ DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=
4825 ++ PREEMPT_MASK - 10);
4826 ++#endif
4827 ++ preempt_latency_start(val);
4828 ++}
4829 ++EXPORT_SYMBOL(preempt_count_add);
4830 ++NOKPROBE_SYMBOL(preempt_count_add);
4831 ++
4832 ++/*
4833 ++ * If the value passed in equals to the current preempt count
4834 ++ * then we just enabled preemption. Stop timing the latency.
4835 ++ */
4836 ++static inline void preempt_latency_stop(int val)
4837 ++{
4838 ++ if (preempt_count() == val)
4839 ++ trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
4840 ++}
4841 ++
4842 ++void preempt_count_sub(int val)
4843 ++{
4844 ++#ifdef CONFIG_DEBUG_PREEMPT
4845 ++ /*
4846 ++ * Underflow?
4847 ++ */
4848 ++ if (DEBUG_LOCKS_WARN_ON(val > preempt_count()))
4849 ++ return;
4850 ++ /*
4851 ++ * Is the spinlock portion underflowing?
4852 ++ */
4853 ++ if (DEBUG_LOCKS_WARN_ON((val < PREEMPT_MASK) &&
4854 ++ !(preempt_count() & PREEMPT_MASK)))
4855 ++ return;
4856 ++#endif
4857 ++
4858 ++ preempt_latency_stop(val);
4859 ++ __preempt_count_sub(val);
4860 ++}
4861 ++EXPORT_SYMBOL(preempt_count_sub);
4862 ++NOKPROBE_SYMBOL(preempt_count_sub);
4863 ++
4864 ++#else
4865 ++static inline void preempt_latency_start(int val) { }
4866 ++static inline void preempt_latency_stop(int val) { }
4867 ++#endif
4868 ++
4869 ++static inline unsigned long get_preempt_disable_ip(struct task_struct *p)
4870 ++{
4871 ++#ifdef CONFIG_DEBUG_PREEMPT
4872 ++ return p->preempt_disable_ip;
4873 ++#else
4874 ++ return 0;
4875 ++#endif
4876 ++}
4877 ++
4878 ++/*
4879 ++ * Print scheduling while atomic bug:
4880 ++ */
4881 ++static noinline void __schedule_bug(struct task_struct *prev)
4882 ++{
4883 ++ /* Save this before calling printk(), since that will clobber it */
4884 ++ unsigned long preempt_disable_ip = get_preempt_disable_ip(current);
4885 ++
4886 ++ if (oops_in_progress)
4887 ++ return;
4888 ++
4889 ++ printk(KERN_ERR "BUG: scheduling while atomic: %s/%d/0x%08x\n",
4890 ++ prev->comm, prev->pid, preempt_count());
4891 ++
4892 ++ debug_show_held_locks(prev);
4893 ++ print_modules();
4894 ++ if (irqs_disabled())
4895 ++ print_irqtrace_events(prev);
4896 ++ if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)
4897 ++ && in_atomic_preempt_off()) {
4898 ++ pr_err("Preemption disabled at:");
4899 ++ print_ip_sym(KERN_ERR, preempt_disable_ip);
4900 ++ }
4901 ++ if (panic_on_warn)
4902 ++ panic("scheduling while atomic\n");
4903 ++
4904 ++ dump_stack();
4905 ++ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
4906 ++}
4907 ++
4908 ++/*
4909 ++ * Various schedule()-time debugging checks and statistics:
4910 ++ */
4911 ++static inline void schedule_debug(struct task_struct *prev, bool preempt)
4912 ++{
4913 ++#ifdef CONFIG_SCHED_STACK_END_CHECK
4914 ++ if (task_stack_end_corrupted(prev))
4915 ++ panic("corrupted stack end detected inside scheduler\n");
4916 ++
4917 ++ if (task_scs_end_corrupted(prev))
4918 ++ panic("corrupted shadow stack detected inside scheduler\n");
4919 ++#endif
4920 ++
4921 ++#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
4922 ++ if (!preempt && READ_ONCE(prev->__state) && prev->non_block_count) {
4923 ++ printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",
4924 ++ prev->comm, prev->pid, prev->non_block_count);
4925 ++ dump_stack();
4926 ++ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
4927 ++ }
4928 ++#endif
4929 ++
4930 ++ if (unlikely(in_atomic_preempt_off())) {
4931 ++ __schedule_bug(prev);
4932 ++ preempt_count_set(PREEMPT_DISABLED);
4933 ++ }
4934 ++ rcu_sleep_check();
4935 ++ SCHED_WARN_ON(ct_state() == CONTEXT_USER);
4936 ++
4937 ++ profile_hit(SCHED_PROFILING, __builtin_return_address(0));
4938 ++
4939 ++ schedstat_inc(this_rq()->sched_count);
4940 ++}
4941 ++
4942 ++/*
4943 ++ * Compile time debug macro
4944 ++ * #define ALT_SCHED_DEBUG
4945 ++ */
4946 ++
4947 ++#ifdef ALT_SCHED_DEBUG
4948 ++void alt_sched_debug(void)
4949 ++{
4950 ++ printk(KERN_INFO "sched: pending: 0x%04lx, idle: 0x%04lx, sg_idle: 0x%04lx\n",
4951 ++ sched_rq_pending_mask.bits[0],
4952 ++ sched_rq_watermark[0].bits[0],
4953 ++ sched_sg_idle_mask.bits[0]);
4954 ++}
4955 ++#else
4956 ++inline void alt_sched_debug(void) {}
4957 ++#endif
4958 ++
4959 ++#ifdef CONFIG_SMP
4960 ++
4961 ++#define SCHED_RQ_NR_MIGRATION (32U)
4962 ++/*
4963 ++ * Migrate pending tasks in @rq to @dest_cpu
4964 ++ * Will try to migrate mininal of half of @rq nr_running tasks and
4965 ++ * SCHED_RQ_NR_MIGRATION to @dest_cpu
4966 ++ */
4967 ++static inline int
4968 ++migrate_pending_tasks(struct rq *rq, struct rq *dest_rq, const int dest_cpu)
4969 ++{
4970 ++ struct task_struct *p, *skip = rq->curr;
4971 ++ int nr_migrated = 0;
4972 ++ int nr_tries = min(rq->nr_running / 2, SCHED_RQ_NR_MIGRATION);
4973 ++
4974 ++ while (skip != rq->idle && nr_tries &&
4975 ++ (p = sched_rq_next_task(skip, rq)) != rq->idle) {
4976 ++ skip = sched_rq_next_task(p, rq);
4977 ++ if (cpumask_test_cpu(dest_cpu, p->cpus_ptr)) {
4978 ++ __SCHED_DEQUEUE_TASK(p, rq, 0, );
4979 ++ set_task_cpu(p, dest_cpu);
4980 ++ sched_task_sanity_check(p, dest_rq);
4981 ++ __SCHED_ENQUEUE_TASK(p, dest_rq, 0);
4982 ++ nr_migrated++;
4983 ++ }
4984 ++ nr_tries--;
4985 ++ }
4986 ++
4987 ++ return nr_migrated;
4988 ++}
4989 ++
4990 ++static inline int take_other_rq_tasks(struct rq *rq, int cpu)
4991 ++{
4992 ++ struct cpumask *topo_mask, *end_mask;
4993 ++
4994 ++ if (unlikely(!rq->online))
4995 ++ return 0;
4996 ++
4997 ++ if (cpumask_empty(&sched_rq_pending_mask))
4998 ++ return 0;
4999 ++
5000 ++ topo_mask = per_cpu(sched_cpu_topo_masks, cpu) + 1;
5001 ++ end_mask = per_cpu(sched_cpu_topo_end_mask, cpu);
5002 ++ do {
5003 ++ int i;
5004 ++ for_each_cpu_and(i, &sched_rq_pending_mask, topo_mask) {
5005 ++ int nr_migrated;
5006 ++ struct rq *src_rq;
5007 ++
5008 ++ src_rq = cpu_rq(i);
5009 ++ if (!do_raw_spin_trylock(&src_rq->lock))
5010 ++ continue;
5011 ++ spin_acquire(&src_rq->lock.dep_map,
5012 ++ SINGLE_DEPTH_NESTING, 1, _RET_IP_);
5013 ++
5014 ++ if ((nr_migrated = migrate_pending_tasks(src_rq, rq, cpu))) {
5015 ++ src_rq->nr_running -= nr_migrated;
5016 ++ if (src_rq->nr_running < 2)
5017 ++ cpumask_clear_cpu(i, &sched_rq_pending_mask);
5018 ++
5019 ++ rq->nr_running += nr_migrated;
5020 ++ if (rq->nr_running > 1)
5021 ++ cpumask_set_cpu(cpu, &sched_rq_pending_mask);
5022 ++
5023 ++ update_sched_rq_watermark(rq);
5024 ++ cpufreq_update_util(rq, 0);
5025 ++
5026 ++ spin_release(&src_rq->lock.dep_map, _RET_IP_);
5027 ++ do_raw_spin_unlock(&src_rq->lock);
5028 ++
5029 ++ return 1;
5030 ++ }
5031 ++
5032 ++ spin_release(&src_rq->lock.dep_map, _RET_IP_);
5033 ++ do_raw_spin_unlock(&src_rq->lock);
5034 ++ }
5035 ++ } while (++topo_mask < end_mask);
5036 ++
5037 ++ return 0;
5038 ++}
5039 ++#endif
5040 ++
5041 ++/*
5042 ++ * Timeslices below RESCHED_NS are considered as good as expired as there's no
5043 ++ * point rescheduling when there's so little time left.
5044 ++ */
5045 ++static inline void check_curr(struct task_struct *p, struct rq *rq)
5046 ++{
5047 ++ if (unlikely(rq->idle == p))
5048 ++ return;
5049 ++
5050 ++ update_curr(rq, p);
5051 ++
5052 ++ if (p->time_slice < RESCHED_NS)
5053 ++ time_slice_expired(p, rq);
5054 ++}
5055 ++
5056 ++static inline struct task_struct *
5057 ++choose_next_task(struct rq *rq, int cpu, struct task_struct *prev)
5058 ++{
5059 ++ struct task_struct *next;
5060 ++
5061 ++ if (unlikely(rq->skip)) {
5062 ++ next = rq_runnable_task(rq);
5063 ++ if (next == rq->idle) {
5064 ++#ifdef CONFIG_SMP
5065 ++ if (!take_other_rq_tasks(rq, cpu)) {
5066 ++#endif
5067 ++ rq->skip = NULL;
5068 ++ schedstat_inc(rq->sched_goidle);
5069 ++ return next;
5070 ++#ifdef CONFIG_SMP
5071 ++ }
5072 ++ next = rq_runnable_task(rq);
5073 ++#endif
5074 ++ }
5075 ++ rq->skip = NULL;
5076 ++#ifdef CONFIG_HIGH_RES_TIMERS
5077 ++ hrtick_start(rq, next->time_slice);
5078 ++#endif
5079 ++ return next;
5080 ++ }
5081 ++
5082 ++ next = sched_rq_first_task(rq);
5083 ++ if (next == rq->idle) {
5084 ++#ifdef CONFIG_SMP
5085 ++ if (!take_other_rq_tasks(rq, cpu)) {
5086 ++#endif
5087 ++ schedstat_inc(rq->sched_goidle);
5088 ++ /*printk(KERN_INFO "sched: choose_next_task(%d) idle %px\n", cpu, next);*/
5089 ++ return next;
5090 ++#ifdef CONFIG_SMP
5091 ++ }
5092 ++ next = sched_rq_first_task(rq);
5093 ++#endif
5094 ++ }
5095 ++#ifdef CONFIG_HIGH_RES_TIMERS
5096 ++ hrtick_start(rq, next->time_slice);
5097 ++#endif
5098 ++ /*printk(KERN_INFO "sched: choose_next_task(%d) next %px\n", cpu,
5099 ++ * next);*/
5100 ++ return next;
5101 ++}
5102 ++
5103 ++/*
5104 ++ * Constants for the sched_mode argument of __schedule().
5105 ++ *
5106 ++ * The mode argument allows RT enabled kernels to differentiate a
5107 ++ * preemption from blocking on an 'sleeping' spin/rwlock. Note that
5108 ++ * SM_MASK_PREEMPT for !RT has all bits set, which allows the compiler to
5109 ++ * optimize the AND operation out and just check for zero.
5110 ++ */
5111 ++#define SM_NONE 0x0
5112 ++#define SM_PREEMPT 0x1
5113 ++#define SM_RTLOCK_WAIT 0x2
5114 ++
5115 ++#ifndef CONFIG_PREEMPT_RT
5116 ++# define SM_MASK_PREEMPT (~0U)
5117 ++#else
5118 ++# define SM_MASK_PREEMPT SM_PREEMPT
5119 ++#endif
5120 ++
5121 ++/*
5122 ++ * schedule() is the main scheduler function.
5123 ++ *
5124 ++ * The main means of driving the scheduler and thus entering this function are:
5125 ++ *
5126 ++ * 1. Explicit blocking: mutex, semaphore, waitqueue, etc.
5127 ++ *
5128 ++ * 2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
5129 ++ * paths. For example, see arch/x86/entry_64.S.
5130 ++ *
5131 ++ * To drive preemption between tasks, the scheduler sets the flag in timer
5132 ++ * interrupt handler scheduler_tick().
5133 ++ *
5134 ++ * 3. Wakeups don't really cause entry into schedule(). They add a
5135 ++ * task to the run-queue and that's it.
5136 ++ *
5137 ++ * Now, if the new task added to the run-queue preempts the current
5138 ++ * task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
5139 ++ * called on the nearest possible occasion:
5140 ++ *
5141 ++ * - If the kernel is preemptible (CONFIG_PREEMPTION=y):
5142 ++ *
5143 ++ * - in syscall or exception context, at the next outmost
5144 ++ * preempt_enable(). (this might be as soon as the wake_up()'s
5145 ++ * spin_unlock()!)
5146 ++ *
5147 ++ * - in IRQ context, return from interrupt-handler to
5148 ++ * preemptible context
5149 ++ *
5150 ++ * - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)
5151 ++ * then at the next:
5152 ++ *
5153 ++ * - cond_resched() call
5154 ++ * - explicit schedule() call
5155 ++ * - return from syscall or exception to user-space
5156 ++ * - return from interrupt-handler to user-space
5157 ++ *
5158 ++ * WARNING: must be called with preemption disabled!
5159 ++ */
5160 ++static void __sched notrace __schedule(unsigned int sched_mode)
5161 ++{
5162 ++ struct task_struct *prev, *next;
5163 ++ unsigned long *switch_count;
5164 ++ unsigned long prev_state;
5165 ++ struct rq *rq;
5166 ++ int cpu;
5167 ++
5168 ++ cpu = smp_processor_id();
5169 ++ rq = cpu_rq(cpu);
5170 ++ prev = rq->curr;
5171 ++
5172 ++ schedule_debug(prev, !!sched_mode);
5173 ++
5174 ++ /* by passing sched_feat(HRTICK) checking which Alt schedule FW doesn't support */
5175 ++ hrtick_clear(rq);
5176 ++
5177 ++ local_irq_disable();
5178 ++ rcu_note_context_switch(!!sched_mode);
5179 ++
5180 ++ /*
5181 ++ * Make sure that signal_pending_state()->signal_pending() below
5182 ++ * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
5183 ++ * done by the caller to avoid the race with signal_wake_up():
5184 ++ *
5185 ++ * __set_current_state(@state) signal_wake_up()
5186 ++ * schedule() set_tsk_thread_flag(p, TIF_SIGPENDING)
5187 ++ * wake_up_state(p, state)
5188 ++ * LOCK rq->lock LOCK p->pi_state
5189 ++ * smp_mb__after_spinlock() smp_mb__after_spinlock()
5190 ++ * if (signal_pending_state()) if (p->state & @state)
5191 ++ *
5192 ++ * Also, the membarrier system call requires a full memory barrier
5193 ++ * after coming from user-space, before storing to rq->curr.
5194 ++ */
5195 ++ raw_spin_lock(&rq->lock);
5196 ++ smp_mb__after_spinlock();
5197 ++
5198 ++ update_rq_clock(rq);
5199 ++
5200 ++ switch_count = &prev->nivcsw;
5201 ++ /*
5202 ++ * We must load prev->state once (task_struct::state is volatile), such
5203 ++ * that:
5204 ++ *
5205 ++ * - we form a control dependency vs deactivate_task() below.
5206 ++ * - ptrace_{,un}freeze_traced() can change ->state underneath us.
5207 ++ */
5208 ++ prev_state = READ_ONCE(prev->__state);
5209 ++ if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
5210 ++ if (signal_pending_state(prev_state, prev)) {
5211 ++ WRITE_ONCE(prev->__state, TASK_RUNNING);
5212 ++ } else {
5213 ++ prev->sched_contributes_to_load =
5214 ++ (prev_state & TASK_UNINTERRUPTIBLE) &&
5215 ++ !(prev_state & TASK_NOLOAD) &&
5216 ++ !(prev->flags & PF_FROZEN);
5217 ++
5218 ++ if (prev->sched_contributes_to_load)
5219 ++ rq->nr_uninterruptible++;
5220 ++
5221 ++ /*
5222 ++ * __schedule() ttwu()
5223 ++ * prev_state = prev->state; if (p->on_rq && ...)
5224 ++ * if (prev_state) goto out;
5225 ++ * p->on_rq = 0; smp_acquire__after_ctrl_dep();
5226 ++ * p->state = TASK_WAKING
5227 ++ *
5228 ++ * Where __schedule() and ttwu() have matching control dependencies.
5229 ++ *
5230 ++ * After this, schedule() must not care about p->state any more.
5231 ++ */
5232 ++ sched_task_deactivate(prev, rq);
5233 ++ deactivate_task(prev, rq);
5234 ++
5235 ++ if (prev->in_iowait) {
5236 ++ atomic_inc(&rq->nr_iowait);
5237 ++ delayacct_blkio_start();
5238 ++ }
5239 ++ }
5240 ++ switch_count = &prev->nvcsw;
5241 ++ }
5242 ++
5243 ++ check_curr(prev, rq);
5244 ++
5245 ++ next = choose_next_task(rq, cpu, prev);
5246 ++ clear_tsk_need_resched(prev);
5247 ++ clear_preempt_need_resched();
5248 ++#ifdef CONFIG_SCHED_DEBUG
5249 ++ rq->last_seen_need_resched_ns = 0;
5250 ++#endif
5251 ++
5252 ++ if (likely(prev != next)) {
5253 ++ next->last_ran = rq->clock_task;
5254 ++ rq->last_ts_switch = rq->clock;
5255 ++
5256 ++ rq->nr_switches++;
5257 ++ /*
5258 ++ * RCU users of rcu_dereference(rq->curr) may not see
5259 ++ * changes to task_struct made by pick_next_task().
5260 ++ */
5261 ++ RCU_INIT_POINTER(rq->curr, next);
5262 ++ /*
5263 ++ * The membarrier system call requires each architecture
5264 ++ * to have a full memory barrier after updating
5265 ++ * rq->curr, before returning to user-space.
5266 ++ *
5267 ++ * Here are the schemes providing that barrier on the
5268 ++ * various architectures:
5269 ++ * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
5270 ++ * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
5271 ++ * - finish_lock_switch() for weakly-ordered
5272 ++ * architectures where spin_unlock is a full barrier,
5273 ++ * - switch_to() for arm64 (weakly-ordered, spin_unlock
5274 ++ * is a RELEASE barrier),
5275 ++ */
5276 ++ ++*switch_count;
5277 ++
5278 ++ psi_sched_switch(prev, next, !task_on_rq_queued(prev));
5279 ++
5280 ++ trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next);
5281 ++
5282 ++ /* Also unlocks the rq: */
5283 ++ rq = context_switch(rq, prev, next);
5284 ++ } else {
5285 ++ __balance_callbacks(rq);
5286 ++ raw_spin_unlock_irq(&rq->lock);
5287 ++ }
5288 ++
5289 ++#ifdef CONFIG_SCHED_SMT
5290 ++ sg_balance_check(rq);
5291 ++#endif
5292 ++}
5293 ++
5294 ++void __noreturn do_task_dead(void)
5295 ++{
5296 ++ /* Causes final put_task_struct in finish_task_switch(): */
5297 ++ set_special_state(TASK_DEAD);
5298 ++
5299 ++ /* Tell freezer to ignore us: */
5300 ++ current->flags |= PF_NOFREEZE;
5301 ++
5302 ++ __schedule(SM_NONE);
5303 ++ BUG();
5304 ++
5305 ++ /* Avoid "noreturn function does return" - but don't continue if BUG() is a NOP: */
5306 ++ for (;;)
5307 ++ cpu_relax();
5308 ++}
5309 ++
5310 ++static inline void sched_submit_work(struct task_struct *tsk)
5311 ++{
5312 ++ unsigned int task_flags;
5313 ++
5314 ++ if (task_is_running(tsk))
5315 ++ return;
5316 ++
5317 ++ task_flags = tsk->flags;
5318 ++ /*
5319 ++ * If a worker went to sleep, notify and ask workqueue whether
5320 ++ * it wants to wake up a task to maintain concurrency.
5321 ++ * As this function is called inside the schedule() context,
5322 ++ * we disable preemption to avoid it calling schedule() again
5323 ++ * in the possible wakeup of a kworker and because wq_worker_sleeping()
5324 ++ * requires it.
5325 ++ */
5326 ++ if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
5327 ++ preempt_disable();
5328 ++ if (task_flags & PF_WQ_WORKER)
5329 ++ wq_worker_sleeping(tsk);
5330 ++ else
5331 ++ io_wq_worker_sleeping(tsk);
5332 ++ preempt_enable_no_resched();
5333 ++ }
5334 ++
5335 ++ if (tsk_is_pi_blocked(tsk))
5336 ++ return;
5337 ++
5338 ++ /*
5339 ++ * If we are going to sleep and we have plugged IO queued,
5340 ++ * make sure to submit it to avoid deadlocks.
5341 ++ */
5342 ++ if (blk_needs_flush_plug(tsk))
5343 ++ blk_schedule_flush_plug(tsk);
5344 ++}
5345 ++
5346 ++static void sched_update_worker(struct task_struct *tsk)
5347 ++{
5348 ++ if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
5349 ++ if (tsk->flags & PF_WQ_WORKER)
5350 ++ wq_worker_running(tsk);
5351 ++ else
5352 ++ io_wq_worker_running(tsk);
5353 ++ }
5354 ++}
5355 ++
5356 ++asmlinkage __visible void __sched schedule(void)
5357 ++{
5358 ++ struct task_struct *tsk = current;
5359 ++
5360 ++ sched_submit_work(tsk);
5361 ++ do {
5362 ++ preempt_disable();
5363 ++ __schedule(SM_NONE);
5364 ++ sched_preempt_enable_no_resched();
5365 ++ } while (need_resched());
5366 ++ sched_update_worker(tsk);
5367 ++}
5368 ++EXPORT_SYMBOL(schedule);
5369 ++
5370 ++/*
5371 ++ * synchronize_rcu_tasks() makes sure that no task is stuck in preempted
5372 ++ * state (have scheduled out non-voluntarily) by making sure that all
5373 ++ * tasks have either left the run queue or have gone into user space.
5374 ++ * As idle tasks do not do either, they must not ever be preempted
5375 ++ * (schedule out non-voluntarily).
5376 ++ *
5377 ++ * schedule_idle() is similar to schedule_preempt_disable() except that it
5378 ++ * never enables preemption because it does not call sched_submit_work().
5379 ++ */
5380 ++void __sched schedule_idle(void)
5381 ++{
5382 ++ /*
5383 ++ * As this skips calling sched_submit_work(), which the idle task does
5384 ++ * regardless because that function is a nop when the task is in a
5385 ++ * TASK_RUNNING state, make sure this isn't used someplace that the
5386 ++ * current task can be in any other state. Note, idle is always in the
5387 ++ * TASK_RUNNING state.
5388 ++ */
5389 ++ WARN_ON_ONCE(current->__state);
5390 ++ do {
5391 ++ __schedule(SM_NONE);
5392 ++ } while (need_resched());
5393 ++}
5394 ++
5395 ++#if defined(CONFIG_CONTEXT_TRACKING) && !defined(CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK)
5396 ++asmlinkage __visible void __sched schedule_user(void)
5397 ++{
5398 ++ /*
5399 ++ * If we come here after a random call to set_need_resched(),
5400 ++ * or we have been woken up remotely but the IPI has not yet arrived,
5401 ++ * we haven't yet exited the RCU idle mode. Do it here manually until
5402 ++ * we find a better solution.
5403 ++ *
5404 ++ * NB: There are buggy callers of this function. Ideally we
5405 ++ * should warn if prev_state != CONTEXT_USER, but that will trigger
5406 ++ * too frequently to make sense yet.
5407 ++ */
5408 ++ enum ctx_state prev_state = exception_enter();
5409 ++ schedule();
5410 ++ exception_exit(prev_state);
5411 ++}
5412 ++#endif
5413 ++
5414 ++/**
5415 ++ * schedule_preempt_disabled - called with preemption disabled
5416 ++ *
5417 ++ * Returns with preemption disabled. Note: preempt_count must be 1
5418 ++ */
5419 ++void __sched schedule_preempt_disabled(void)
5420 ++{
5421 ++ sched_preempt_enable_no_resched();
5422 ++ schedule();
5423 ++ preempt_disable();
5424 ++}
5425 ++
5426 ++#ifdef CONFIG_PREEMPT_RT
5427 ++void __sched notrace schedule_rtlock(void)
5428 ++{
5429 ++ do {
5430 ++ preempt_disable();
5431 ++ __schedule(SM_RTLOCK_WAIT);
5432 ++ sched_preempt_enable_no_resched();
5433 ++ } while (need_resched());
5434 ++}
5435 ++NOKPROBE_SYMBOL(schedule_rtlock);
5436 ++#endif
5437 ++
5438 ++static void __sched notrace preempt_schedule_common(void)
5439 ++{
5440 ++ do {
5441 ++ /*
5442 ++ * Because the function tracer can trace preempt_count_sub()
5443 ++ * and it also uses preempt_enable/disable_notrace(), if
5444 ++ * NEED_RESCHED is set, the preempt_enable_notrace() called
5445 ++ * by the function tracer will call this function again and
5446 ++ * cause infinite recursion.
5447 ++ *
5448 ++ * Preemption must be disabled here before the function
5449 ++ * tracer can trace. Break up preempt_disable() into two
5450 ++ * calls. One to disable preemption without fear of being
5451 ++ * traced. The other to still record the preemption latency,
5452 ++ * which can also be traced by the function tracer.
5453 ++ */
5454 ++ preempt_disable_notrace();
5455 ++ preempt_latency_start(1);
5456 ++ __schedule(SM_PREEMPT);
5457 ++ preempt_latency_stop(1);
5458 ++ preempt_enable_no_resched_notrace();
5459 ++
5460 ++ /*
5461 ++ * Check again in case we missed a preemption opportunity
5462 ++ * between schedule and now.
5463 ++ */
5464 ++ } while (need_resched());
5465 ++}
5466 ++
5467 ++#ifdef CONFIG_PREEMPTION
5468 ++/*
5469 ++ * This is the entry point to schedule() from in-kernel preemption
5470 ++ * off of preempt_enable.
5471 ++ */
5472 ++asmlinkage __visible void __sched notrace preempt_schedule(void)
5473 ++{
5474 ++ /*
5475 ++ * If there is a non-zero preempt_count or interrupts are disabled,
5476 ++ * we do not want to preempt the current task. Just return..
5477 ++ */
5478 ++ if (likely(!preemptible()))
5479 ++ return;
5480 ++
5481 ++ preempt_schedule_common();
5482 ++}
5483 ++NOKPROBE_SYMBOL(preempt_schedule);
5484 ++EXPORT_SYMBOL(preempt_schedule);
5485 ++
5486 ++#ifdef CONFIG_PREEMPT_DYNAMIC
5487 ++DEFINE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
5488 ++EXPORT_STATIC_CALL_TRAMP(preempt_schedule);
5489 ++#endif
5490 ++
5491 ++
5492 ++/**
5493 ++ * preempt_schedule_notrace - preempt_schedule called by tracing
5494 ++ *
5495 ++ * The tracing infrastructure uses preempt_enable_notrace to prevent
5496 ++ * recursion and tracing preempt enabling caused by the tracing
5497 ++ * infrastructure itself. But as tracing can happen in areas coming
5498 ++ * from userspace or just about to enter userspace, a preempt enable
5499 ++ * can occur before user_exit() is called. This will cause the scheduler
5500 ++ * to be called when the system is still in usermode.
5501 ++ *
5502 ++ * To prevent this, the preempt_enable_notrace will use this function
5503 ++ * instead of preempt_schedule() to exit user context if needed before
5504 ++ * calling the scheduler.
5505 ++ */
5506 ++asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)
5507 ++{
5508 ++ enum ctx_state prev_ctx;
5509 ++
5510 ++ if (likely(!preemptible()))
5511 ++ return;
5512 ++
5513 ++ do {
5514 ++ /*
5515 ++ * Because the function tracer can trace preempt_count_sub()
5516 ++ * and it also uses preempt_enable/disable_notrace(), if
5517 ++ * NEED_RESCHED is set, the preempt_enable_notrace() called
5518 ++ * by the function tracer will call this function again and
5519 ++ * cause infinite recursion.
5520 ++ *
5521 ++ * Preemption must be disabled here before the function
5522 ++ * tracer can trace. Break up preempt_disable() into two
5523 ++ * calls. One to disable preemption without fear of being
5524 ++ * traced. The other to still record the preemption latency,
5525 ++ * which can also be traced by the function tracer.
5526 ++ */
5527 ++ preempt_disable_notrace();
5528 ++ preempt_latency_start(1);
5529 ++ /*
5530 ++ * Needs preempt disabled in case user_exit() is traced
5531 ++ * and the tracer calls preempt_enable_notrace() causing
5532 ++ * an infinite recursion.
5533 ++ */
5534 ++ prev_ctx = exception_enter();
5535 ++ __schedule(SM_PREEMPT);
5536 ++ exception_exit(prev_ctx);
5537 ++
5538 ++ preempt_latency_stop(1);
5539 ++ preempt_enable_no_resched_notrace();
5540 ++ } while (need_resched());
5541 ++}
5542 ++EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
5543 ++
5544 ++#ifdef CONFIG_PREEMPT_DYNAMIC
5545 ++DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
5546 ++EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
5547 ++#endif
5548 ++
5549 ++#endif /* CONFIG_PREEMPTION */
5550 ++
5551 ++#ifdef CONFIG_PREEMPT_DYNAMIC
5552 ++
5553 ++#include <linux/entry-common.h>
5554 ++
5555 ++/*
5556 ++ * SC:cond_resched
5557 ++ * SC:might_resched
5558 ++ * SC:preempt_schedule
5559 ++ * SC:preempt_schedule_notrace
5560 ++ * SC:irqentry_exit_cond_resched
5561 ++ *
5562 ++ *
5563 ++ * NONE:
5564 ++ * cond_resched <- __cond_resched
5565 ++ * might_resched <- RET0
5566 ++ * preempt_schedule <- NOP
5567 ++ * preempt_schedule_notrace <- NOP
5568 ++ * irqentry_exit_cond_resched <- NOP
5569 ++ *
5570 ++ * VOLUNTARY:
5571 ++ * cond_resched <- __cond_resched
5572 ++ * might_resched <- __cond_resched
5573 ++ * preempt_schedule <- NOP
5574 ++ * preempt_schedule_notrace <- NOP
5575 ++ * irqentry_exit_cond_resched <- NOP
5576 ++ *
5577 ++ * FULL:
5578 ++ * cond_resched <- RET0
5579 ++ * might_resched <- RET0
5580 ++ * preempt_schedule <- preempt_schedule
5581 ++ * preempt_schedule_notrace <- preempt_schedule_notrace
5582 ++ * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
5583 ++ */
5584 ++
5585 ++enum {
5586 ++ preempt_dynamic_none = 0,
5587 ++ preempt_dynamic_voluntary,
5588 ++ preempt_dynamic_full,
5589 ++};
5590 ++
5591 ++int preempt_dynamic_mode = preempt_dynamic_full;
5592 ++
5593 ++int sched_dynamic_mode(const char *str)
5594 ++{
5595 ++ if (!strcmp(str, "none"))
5596 ++ return preempt_dynamic_none;
5597 ++
5598 ++ if (!strcmp(str, "voluntary"))
5599 ++ return preempt_dynamic_voluntary;
5600 ++
5601 ++ if (!strcmp(str, "full"))
5602 ++ return preempt_dynamic_full;
5603 ++
5604 ++ return -EINVAL;
5605 ++}
5606 ++
5607 ++void sched_dynamic_update(int mode)
5608 ++{
5609 ++ /*
5610 ++ * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
5611 ++ * the ZERO state, which is invalid.
5612 ++ */
5613 ++ static_call_update(cond_resched, __cond_resched);
5614 ++ static_call_update(might_resched, __cond_resched);
5615 ++ static_call_update(preempt_schedule, __preempt_schedule_func);
5616 ++ static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
5617 ++ static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
5618 ++
5619 ++ switch (mode) {
5620 ++ case preempt_dynamic_none:
5621 ++ static_call_update(cond_resched, __cond_resched);
5622 ++ static_call_update(might_resched, (void *)&__static_call_return0);
5623 ++ static_call_update(preempt_schedule, NULL);
5624 ++ static_call_update(preempt_schedule_notrace, NULL);
5625 ++ static_call_update(irqentry_exit_cond_resched, NULL);
5626 ++ pr_info("Dynamic Preempt: none\n");
5627 ++ break;
5628 ++
5629 ++ case preempt_dynamic_voluntary:
5630 ++ static_call_update(cond_resched, __cond_resched);
5631 ++ static_call_update(might_resched, __cond_resched);
5632 ++ static_call_update(preempt_schedule, NULL);
5633 ++ static_call_update(preempt_schedule_notrace, NULL);
5634 ++ static_call_update(irqentry_exit_cond_resched, NULL);
5635 ++ pr_info("Dynamic Preempt: voluntary\n");
5636 ++ break;
5637 ++
5638 ++ case preempt_dynamic_full:
5639 ++ static_call_update(cond_resched, (void *)&__static_call_return0);
5640 ++ static_call_update(might_resched, (void *)&__static_call_return0);
5641 ++ static_call_update(preempt_schedule, __preempt_schedule_func);
5642 ++ static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
5643 ++ static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
5644 ++ pr_info("Dynamic Preempt: full\n");
5645 ++ break;
5646 ++ }
5647 ++
5648 ++ preempt_dynamic_mode = mode;
5649 ++}
5650 ++
5651 ++static int __init setup_preempt_mode(char *str)
5652 ++{
5653 ++ int mode = sched_dynamic_mode(str);
5654 ++ if (mode < 0) {
5655 ++ pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
5656 ++ return 1;
5657 ++ }
5658 ++
5659 ++ sched_dynamic_update(mode);
5660 ++ return 0;
5661 ++}
5662 ++__setup("preempt=", setup_preempt_mode);
5663 ++
5664 ++#endif /* CONFIG_PREEMPT_DYNAMIC */
5665 ++
5666 ++/*
5667 ++ * This is the entry point to schedule() from kernel preemption
5668 ++ * off of irq context.
5669 ++ * Note, that this is called and return with irqs disabled. This will
5670 ++ * protect us against recursive calling from irq.
5671 ++ */
5672 ++asmlinkage __visible void __sched preempt_schedule_irq(void)
5673 ++{
5674 ++ enum ctx_state prev_state;
5675 ++
5676 ++ /* Catch callers which need to be fixed */
5677 ++ BUG_ON(preempt_count() || !irqs_disabled());
5678 ++
5679 ++ prev_state = exception_enter();
5680 ++
5681 ++ do {
5682 ++ preempt_disable();
5683 ++ local_irq_enable();
5684 ++ __schedule(SM_PREEMPT);
5685 ++ local_irq_disable();
5686 ++ sched_preempt_enable_no_resched();
5687 ++ } while (need_resched());
5688 ++
5689 ++ exception_exit(prev_state);
5690 ++}
5691 ++
5692 ++int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
5693 ++ void *key)
5694 ++{
5695 ++ WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC);
5696 ++ return try_to_wake_up(curr->private, mode, wake_flags);
5697 ++}
5698 ++EXPORT_SYMBOL(default_wake_function);
5699 ++
5700 ++static inline void check_task_changed(struct task_struct *p, struct rq *rq)
5701 ++{
5702 ++ /* Trigger resched if task sched_prio has been modified. */
5703 ++ if (task_on_rq_queued(p) && task_sched_prio_idx(p, rq) != p->sq_idx) {
5704 ++ requeue_task(p, rq);
5705 ++ check_preempt_curr(rq);
5706 ++ }
5707 ++}
5708 ++
5709 ++static void __setscheduler_prio(struct task_struct *p, int prio)
5710 ++{
5711 ++ p->prio = prio;
5712 ++}
5713 ++
5714 ++#ifdef CONFIG_RT_MUTEXES
5715 ++
5716 ++static inline int __rt_effective_prio(struct task_struct *pi_task, int prio)
5717 ++{
5718 ++ if (pi_task)
5719 ++ prio = min(prio, pi_task->prio);
5720 ++
5721 ++ return prio;
5722 ++}
5723 ++
5724 ++static inline int rt_effective_prio(struct task_struct *p, int prio)
5725 ++{
5726 ++ struct task_struct *pi_task = rt_mutex_get_top_task(p);
5727 ++
5728 ++ return __rt_effective_prio(pi_task, prio);
5729 ++}
5730 ++
5731 ++/*
5732 ++ * rt_mutex_setprio - set the current priority of a task
5733 ++ * @p: task to boost
5734 ++ * @pi_task: donor task
5735 ++ *
5736 ++ * This function changes the 'effective' priority of a task. It does
5737 ++ * not touch ->normal_prio like __setscheduler().
5738 ++ *
5739 ++ * Used by the rt_mutex code to implement priority inheritance
5740 ++ * logic. Call site only calls if the priority of the task changed.
5741 ++ */
5742 ++void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
5743 ++{
5744 ++ int prio;
5745 ++ struct rq *rq;
5746 ++ raw_spinlock_t *lock;
5747 ++
5748 ++ /* XXX used to be waiter->prio, not waiter->task->prio */
5749 ++ prio = __rt_effective_prio(pi_task, p->normal_prio);
5750 ++
5751 ++ /*
5752 ++ * If nothing changed; bail early.
5753 ++ */
5754 ++ if (p->pi_top_task == pi_task && prio == p->prio)
5755 ++ return;
5756 ++
5757 ++ rq = __task_access_lock(p, &lock);
5758 ++ /*
5759 ++ * Set under pi_lock && rq->lock, such that the value can be used under
5760 ++ * either lock.
5761 ++ *
5762 ++ * Note that there is loads of tricky to make this pointer cache work
5763 ++ * right. rt_mutex_slowunlock()+rt_mutex_postunlock() work together to
5764 ++ * ensure a task is de-boosted (pi_task is set to NULL) before the
5765 ++ * task is allowed to run again (and can exit). This ensures the pointer
5766 ++ * points to a blocked task -- which guarantees the task is present.
5767 ++ */
5768 ++ p->pi_top_task = pi_task;
5769 ++
5770 ++ /*
5771 ++ * For FIFO/RR we only need to set prio, if that matches we're done.
5772 ++ */
5773 ++ if (prio == p->prio)
5774 ++ goto out_unlock;
5775 ++
5776 ++ /*
5777 ++ * Idle task boosting is a nono in general. There is one
5778 ++ * exception, when PREEMPT_RT and NOHZ is active:
5779 ++ *
5780 ++ * The idle task calls get_next_timer_interrupt() and holds
5781 ++ * the timer wheel base->lock on the CPU and another CPU wants
5782 ++ * to access the timer (probably to cancel it). We can safely
5783 ++ * ignore the boosting request, as the idle CPU runs this code
5784 ++ * with interrupts disabled and will complete the lock
5785 ++ * protected section without being interrupted. So there is no
5786 ++ * real need to boost.
5787 ++ */
5788 ++ if (unlikely(p == rq->idle)) {
5789 ++ WARN_ON(p != rq->curr);
5790 ++ WARN_ON(p->pi_blocked_on);
5791 ++ goto out_unlock;
5792 ++ }
5793 ++
5794 ++ trace_sched_pi_setprio(p, pi_task);
5795 ++
5796 ++ __setscheduler_prio(p, prio);
5797 ++
5798 ++ check_task_changed(p, rq);
5799 ++out_unlock:
5800 ++ /* Avoid rq from going away on us: */
5801 ++ preempt_disable();
5802 ++
5803 ++ __balance_callbacks(rq);
5804 ++ __task_access_unlock(p, lock);
5805 ++
5806 ++ preempt_enable();
5807 ++}
5808 ++#else
5809 ++static inline int rt_effective_prio(struct task_struct *p, int prio)
5810 ++{
5811 ++ return prio;
5812 ++}
5813 ++#endif
5814 ++
5815 ++void set_user_nice(struct task_struct *p, long nice)
5816 ++{
5817 ++ unsigned long flags;
5818 ++ struct rq *rq;
5819 ++ raw_spinlock_t *lock;
5820 ++
5821 ++ if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)
5822 ++ return;
5823 ++ /*
5824 ++ * We have to be careful, if called from sys_setpriority(),
5825 ++ * the task might be in the middle of scheduling on another CPU.
5826 ++ */
5827 ++ raw_spin_lock_irqsave(&p->pi_lock, flags);
5828 ++ rq = __task_access_lock(p, &lock);
5829 ++
5830 ++ p->static_prio = NICE_TO_PRIO(nice);
5831 ++ /*
5832 ++ * The RT priorities are set via sched_setscheduler(), but we still
5833 ++ * allow the 'normal' nice value to be set - but as expected
5834 ++ * it won't have any effect on scheduling until the task is
5835 ++ * not SCHED_NORMAL/SCHED_BATCH:
5836 ++ */
5837 ++ if (task_has_rt_policy(p))
5838 ++ goto out_unlock;
5839 ++
5840 ++ p->prio = effective_prio(p);
5841 ++
5842 ++ check_task_changed(p, rq);
5843 ++out_unlock:
5844 ++ __task_access_unlock(p, lock);
5845 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
5846 ++}
5847 ++EXPORT_SYMBOL(set_user_nice);
5848 ++
5849 ++/*
5850 ++ * can_nice - check if a task can reduce its nice value
5851 ++ * @p: task
5852 ++ * @nice: nice value
5853 ++ */
5854 ++int can_nice(const struct task_struct *p, const int nice)
5855 ++{
5856 ++ /* Convert nice value [19,-20] to rlimit style value [1,40] */
5857 ++ int nice_rlim = nice_to_rlimit(nice);
5858 ++
5859 ++ return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) ||
5860 ++ capable(CAP_SYS_NICE));
5861 ++}
5862 ++
5863 ++#ifdef __ARCH_WANT_SYS_NICE
5864 ++
5865 ++/*
5866 ++ * sys_nice - change the priority of the current process.
5867 ++ * @increment: priority increment
5868 ++ *
5869 ++ * sys_setpriority is a more generic, but much slower function that
5870 ++ * does similar things.
5871 ++ */
5872 ++SYSCALL_DEFINE1(nice, int, increment)
5873 ++{
5874 ++ long nice, retval;
5875 ++
5876 ++ /*
5877 ++ * Setpriority might change our priority at the same moment.
5878 ++ * We don't have to worry. Conceptually one call occurs first
5879 ++ * and we have a single winner.
5880 ++ */
5881 ++
5882 ++ increment = clamp(increment, -NICE_WIDTH, NICE_WIDTH);
5883 ++ nice = task_nice(current) + increment;
5884 ++
5885 ++ nice = clamp_val(nice, MIN_NICE, MAX_NICE);
5886 ++ if (increment < 0 && !can_nice(current, nice))
5887 ++ return -EPERM;
5888 ++
5889 ++ retval = security_task_setnice(current, nice);
5890 ++ if (retval)
5891 ++ return retval;
5892 ++
5893 ++ set_user_nice(current, nice);
5894 ++ return 0;
5895 ++}
5896 ++
5897 ++#endif
5898 ++
5899 ++/**
5900 ++ * task_prio - return the priority value of a given task.
5901 ++ * @p: the task in question.
5902 ++ *
5903 ++ * Return: The priority value as seen by users in /proc.
5904 ++ *
5905 ++ * sched policy return value kernel prio user prio/nice
5906 ++ *
5907 ++ * (BMQ)normal, batch, idle[0 ... 53] [100 ... 139] 0/[-20 ... 19]/[-7 ... 7]
5908 ++ * (PDS)normal, batch, idle[0 ... 39] 100 0/[-20 ... 19]
5909 ++ * fifo, rr [-1 ... -100] [99 ... 0] [0 ... 99]
5910 ++ */
5911 ++int task_prio(const struct task_struct *p)
5912 ++{
5913 ++ return (p->prio < MAX_RT_PRIO) ? p->prio - MAX_RT_PRIO :
5914 ++ task_sched_prio_normal(p, task_rq(p));
5915 ++}
5916 ++
5917 ++/**
5918 ++ * idle_cpu - is a given CPU idle currently?
5919 ++ * @cpu: the processor in question.
5920 ++ *
5921 ++ * Return: 1 if the CPU is currently idle. 0 otherwise.
5922 ++ */
5923 ++int idle_cpu(int cpu)
5924 ++{
5925 ++ struct rq *rq = cpu_rq(cpu);
5926 ++
5927 ++ if (rq->curr != rq->idle)
5928 ++ return 0;
5929 ++
5930 ++ if (rq->nr_running)
5931 ++ return 0;
5932 ++
5933 ++#ifdef CONFIG_SMP
5934 ++ if (rq->ttwu_pending)
5935 ++ return 0;
5936 ++#endif
5937 ++
5938 ++ return 1;
5939 ++}
5940 ++
5941 ++/**
5942 ++ * idle_task - return the idle task for a given CPU.
5943 ++ * @cpu: the processor in question.
5944 ++ *
5945 ++ * Return: The idle task for the cpu @cpu.
5946 ++ */
5947 ++struct task_struct *idle_task(int cpu)
5948 ++{
5949 ++ return cpu_rq(cpu)->idle;
5950 ++}
5951 ++
5952 ++/**
5953 ++ * find_process_by_pid - find a process with a matching PID value.
5954 ++ * @pid: the pid in question.
5955 ++ *
5956 ++ * The task of @pid, if found. %NULL otherwise.
5957 ++ */
5958 ++static inline struct task_struct *find_process_by_pid(pid_t pid)
5959 ++{
5960 ++ return pid ? find_task_by_vpid(pid) : current;
5961 ++}
5962 ++
5963 ++/*
5964 ++ * sched_setparam() passes in -1 for its policy, to let the functions
5965 ++ * it calls know not to change it.
5966 ++ */
5967 ++#define SETPARAM_POLICY -1
5968 ++
5969 ++static void __setscheduler_params(struct task_struct *p,
5970 ++ const struct sched_attr *attr)
5971 ++{
5972 ++ int policy = attr->sched_policy;
5973 ++
5974 ++ if (policy == SETPARAM_POLICY)
5975 ++ policy = p->policy;
5976 ++
5977 ++ p->policy = policy;
5978 ++
5979 ++ /*
5980 ++ * allow normal nice value to be set, but will not have any
5981 ++ * effect on scheduling until the task not SCHED_NORMAL/
5982 ++ * SCHED_BATCH
5983 ++ */
5984 ++ p->static_prio = NICE_TO_PRIO(attr->sched_nice);
5985 ++
5986 ++ /*
5987 ++ * __sched_setscheduler() ensures attr->sched_priority == 0 when
5988 ++ * !rt_policy. Always setting this ensures that things like
5989 ++ * getparam()/getattr() don't report silly values for !rt tasks.
5990 ++ */
5991 ++ p->rt_priority = attr->sched_priority;
5992 ++ p->normal_prio = normal_prio(p);
5993 ++}
5994 ++
5995 ++/*
5996 ++ * check the target process has a UID that matches the current process's
5997 ++ */
5998 ++static bool check_same_owner(struct task_struct *p)
5999 ++{
6000 ++ const struct cred *cred = current_cred(), *pcred;
6001 ++ bool match;
6002 ++
6003 ++ rcu_read_lock();
6004 ++ pcred = __task_cred(p);
6005 ++ match = (uid_eq(cred->euid, pcred->euid) ||
6006 ++ uid_eq(cred->euid, pcred->uid));
6007 ++ rcu_read_unlock();
6008 ++ return match;
6009 ++}
6010 ++
6011 ++static int __sched_setscheduler(struct task_struct *p,
6012 ++ const struct sched_attr *attr,
6013 ++ bool user, bool pi)
6014 ++{
6015 ++ const struct sched_attr dl_squash_attr = {
6016 ++ .size = sizeof(struct sched_attr),
6017 ++ .sched_policy = SCHED_FIFO,
6018 ++ .sched_nice = 0,
6019 ++ .sched_priority = 99,
6020 ++ };
6021 ++ int oldpolicy = -1, policy = attr->sched_policy;
6022 ++ int retval, newprio;
6023 ++ struct callback_head *head;
6024 ++ unsigned long flags;
6025 ++ struct rq *rq;
6026 ++ int reset_on_fork;
6027 ++ raw_spinlock_t *lock;
6028 ++
6029 ++ /* The pi code expects interrupts enabled */
6030 ++ BUG_ON(pi && in_interrupt());
6031 ++
6032 ++ /*
6033 ++ * Alt schedule FW supports SCHED_DEADLINE by squash it as prio 0 SCHED_FIFO
6034 ++ */
6035 ++ if (unlikely(SCHED_DEADLINE == policy)) {
6036 ++ attr = &dl_squash_attr;
6037 ++ policy = attr->sched_policy;
6038 ++ }
6039 ++recheck:
6040 ++ /* Double check policy once rq lock held */
6041 ++ if (policy < 0) {
6042 ++ reset_on_fork = p->sched_reset_on_fork;
6043 ++ policy = oldpolicy = p->policy;
6044 ++ } else {
6045 ++ reset_on_fork = !!(attr->sched_flags & SCHED_RESET_ON_FORK);
6046 ++
6047 ++ if (policy > SCHED_IDLE)
6048 ++ return -EINVAL;
6049 ++ }
6050 ++
6051 ++ if (attr->sched_flags & ~(SCHED_FLAG_ALL))
6052 ++ return -EINVAL;
6053 ++
6054 ++ /*
6055 ++ * Valid priorities for SCHED_FIFO and SCHED_RR are
6056 ++ * 1..MAX_RT_PRIO-1, valid priority for SCHED_NORMAL and
6057 ++ * SCHED_BATCH and SCHED_IDLE is 0.
6058 ++ */
6059 ++ if (attr->sched_priority < 0 ||
6060 ++ (p->mm && attr->sched_priority > MAX_RT_PRIO - 1) ||
6061 ++ (!p->mm && attr->sched_priority > MAX_RT_PRIO - 1))
6062 ++ return -EINVAL;
6063 ++ if ((SCHED_RR == policy || SCHED_FIFO == policy) !=
6064 ++ (attr->sched_priority != 0))
6065 ++ return -EINVAL;
6066 ++
6067 ++ /*
6068 ++ * Allow unprivileged RT tasks to decrease priority:
6069 ++ */
6070 ++ if (user && !capable(CAP_SYS_NICE)) {
6071 ++ if (SCHED_FIFO == policy || SCHED_RR == policy) {
6072 ++ unsigned long rlim_rtprio =
6073 ++ task_rlimit(p, RLIMIT_RTPRIO);
6074 ++
6075 ++ /* Can't set/change the rt policy */
6076 ++ if (policy != p->policy && !rlim_rtprio)
6077 ++ return -EPERM;
6078 ++
6079 ++ /* Can't increase priority */
6080 ++ if (attr->sched_priority > p->rt_priority &&
6081 ++ attr->sched_priority > rlim_rtprio)
6082 ++ return -EPERM;
6083 ++ }
6084 ++
6085 ++ /* Can't change other user's priorities */
6086 ++ if (!check_same_owner(p))
6087 ++ return -EPERM;
6088 ++
6089 ++ /* Normal users shall not reset the sched_reset_on_fork flag */
6090 ++ if (p->sched_reset_on_fork && !reset_on_fork)
6091 ++ return -EPERM;
6092 ++ }
6093 ++
6094 ++ if (user) {
6095 ++ retval = security_task_setscheduler(p);
6096 ++ if (retval)
6097 ++ return retval;
6098 ++ }
6099 ++
6100 ++ if (pi)
6101 ++ cpuset_read_lock();
6102 ++
6103 ++ /*
6104 ++ * Make sure no PI-waiters arrive (or leave) while we are
6105 ++ * changing the priority of the task:
6106 ++ */
6107 ++ raw_spin_lock_irqsave(&p->pi_lock, flags);
6108 ++
6109 ++ /*
6110 ++ * To be able to change p->policy safely, task_access_lock()
6111 ++ * must be called.
6112 ++ * IF use task_access_lock() here:
6113 ++ * For the task p which is not running, reading rq->stop is
6114 ++ * racy but acceptable as ->stop doesn't change much.
6115 ++ * An enhancemnet can be made to read rq->stop saftly.
6116 ++ */
6117 ++ rq = __task_access_lock(p, &lock);
6118 ++
6119 ++ /*
6120 ++ * Changing the policy of the stop threads its a very bad idea
6121 ++ */
6122 ++ if (p == rq->stop) {
6123 ++ retval = -EINVAL;
6124 ++ goto unlock;
6125 ++ }
6126 ++
6127 ++ /*
6128 ++ * If not changing anything there's no need to proceed further:
6129 ++ */
6130 ++ if (unlikely(policy == p->policy)) {
6131 ++ if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
6132 ++ goto change;
6133 ++ if (!rt_policy(policy) &&
6134 ++ NICE_TO_PRIO(attr->sched_nice) != p->static_prio)
6135 ++ goto change;
6136 ++
6137 ++ p->sched_reset_on_fork = reset_on_fork;
6138 ++ retval = 0;
6139 ++ goto unlock;
6140 ++ }
6141 ++change:
6142 ++
6143 ++ /* Re-check policy now with rq lock held */
6144 ++ if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
6145 ++ policy = oldpolicy = -1;
6146 ++ __task_access_unlock(p, lock);
6147 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
6148 ++ if (pi)
6149 ++ cpuset_read_unlock();
6150 ++ goto recheck;
6151 ++ }
6152 ++
6153 ++ p->sched_reset_on_fork = reset_on_fork;
6154 ++
6155 ++ newprio = __normal_prio(policy, attr->sched_priority, NICE_TO_PRIO(attr->sched_nice));
6156 ++ if (pi) {
6157 ++ /*
6158 ++ * Take priority boosted tasks into account. If the new
6159 ++ * effective priority is unchanged, we just store the new
6160 ++ * normal parameters and do not touch the scheduler class and
6161 ++ * the runqueue. This will be done when the task deboost
6162 ++ * itself.
6163 ++ */
6164 ++ newprio = rt_effective_prio(p, newprio);
6165 ++ }
6166 ++
6167 ++ if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) {
6168 ++ __setscheduler_params(p, attr);
6169 ++ __setscheduler_prio(p, newprio);
6170 ++ }
6171 ++
6172 ++ check_task_changed(p, rq);
6173 ++
6174 ++ /* Avoid rq from going away on us: */
6175 ++ preempt_disable();
6176 ++ head = splice_balance_callbacks(rq);
6177 ++ __task_access_unlock(p, lock);
6178 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
6179 ++
6180 ++ if (pi) {
6181 ++ cpuset_read_unlock();
6182 ++ rt_mutex_adjust_pi(p);
6183 ++ }
6184 ++
6185 ++ /* Run balance callbacks after we've adjusted the PI chain: */
6186 ++ balance_callbacks(rq, head);
6187 ++ preempt_enable();
6188 ++
6189 ++ return 0;
6190 ++
6191 ++unlock:
6192 ++ __task_access_unlock(p, lock);
6193 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
6194 ++ if (pi)
6195 ++ cpuset_read_unlock();
6196 ++ return retval;
6197 ++}
6198 ++
6199 ++static int _sched_setscheduler(struct task_struct *p, int policy,
6200 ++ const struct sched_param *param, bool check)
6201 ++{
6202 ++ struct sched_attr attr = {
6203 ++ .sched_policy = policy,
6204 ++ .sched_priority = param->sched_priority,
6205 ++ .sched_nice = PRIO_TO_NICE(p->static_prio),
6206 ++ };
6207 ++
6208 ++ /* Fixup the legacy SCHED_RESET_ON_FORK hack. */
6209 ++ if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) {
6210 ++ attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
6211 ++ policy &= ~SCHED_RESET_ON_FORK;
6212 ++ attr.sched_policy = policy;
6213 ++ }
6214 ++
6215 ++ return __sched_setscheduler(p, &attr, check, true);
6216 ++}
6217 ++
6218 ++/**
6219 ++ * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
6220 ++ * @p: the task in question.
6221 ++ * @policy: new policy.
6222 ++ * @param: structure containing the new RT priority.
6223 ++ *
6224 ++ * Use sched_set_fifo(), read its comment.
6225 ++ *
6226 ++ * Return: 0 on success. An error code otherwise.
6227 ++ *
6228 ++ * NOTE that the task may be already dead.
6229 ++ */
6230 ++int sched_setscheduler(struct task_struct *p, int policy,
6231 ++ const struct sched_param *param)
6232 ++{
6233 ++ return _sched_setscheduler(p, policy, param, true);
6234 ++}
6235 ++
6236 ++int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
6237 ++{
6238 ++ return __sched_setscheduler(p, attr, true, true);
6239 ++}
6240 ++
6241 ++int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr)
6242 ++{
6243 ++ return __sched_setscheduler(p, attr, false, true);
6244 ++}
6245 ++EXPORT_SYMBOL_GPL(sched_setattr_nocheck);
6246 ++
6247 ++/**
6248 ++ * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
6249 ++ * @p: the task in question.
6250 ++ * @policy: new policy.
6251 ++ * @param: structure containing the new RT priority.
6252 ++ *
6253 ++ * Just like sched_setscheduler, only don't bother checking if the
6254 ++ * current context has permission. For example, this is needed in
6255 ++ * stop_machine(): we create temporary high priority worker threads,
6256 ++ * but our caller might not have that capability.
6257 ++ *
6258 ++ * Return: 0 on success. An error code otherwise.
6259 ++ */
6260 ++int sched_setscheduler_nocheck(struct task_struct *p, int policy,
6261 ++ const struct sched_param *param)
6262 ++{
6263 ++ return _sched_setscheduler(p, policy, param, false);
6264 ++}
6265 ++
6266 ++/*
6267 ++ * SCHED_FIFO is a broken scheduler model; that is, it is fundamentally
6268 ++ * incapable of resource management, which is the one thing an OS really should
6269 ++ * be doing.
6270 ++ *
6271 ++ * This is of course the reason it is limited to privileged users only.
6272 ++ *
6273 ++ * Worse still; it is fundamentally impossible to compose static priority
6274 ++ * workloads. You cannot take two correctly working static prio workloads
6275 ++ * and smash them together and still expect them to work.
6276 ++ *
6277 ++ * For this reason 'all' FIFO tasks the kernel creates are basically at:
6278 ++ *
6279 ++ * MAX_RT_PRIO / 2
6280 ++ *
6281 ++ * The administrator _MUST_ configure the system, the kernel simply doesn't
6282 ++ * know enough information to make a sensible choice.
6283 ++ */
6284 ++void sched_set_fifo(struct task_struct *p)
6285 ++{
6286 ++ struct sched_param sp = { .sched_priority = MAX_RT_PRIO / 2 };
6287 ++ WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0);
6288 ++}
6289 ++EXPORT_SYMBOL_GPL(sched_set_fifo);
6290 ++
6291 ++/*
6292 ++ * For when you don't much care about FIFO, but want to be above SCHED_NORMAL.
6293 ++ */
6294 ++void sched_set_fifo_low(struct task_struct *p)
6295 ++{
6296 ++ struct sched_param sp = { .sched_priority = 1 };
6297 ++ WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0);
6298 ++}
6299 ++EXPORT_SYMBOL_GPL(sched_set_fifo_low);
6300 ++
6301 ++void sched_set_normal(struct task_struct *p, int nice)
6302 ++{
6303 ++ struct sched_attr attr = {
6304 ++ .sched_policy = SCHED_NORMAL,
6305 ++ .sched_nice = nice,
6306 ++ };
6307 ++ WARN_ON_ONCE(sched_setattr_nocheck(p, &attr) != 0);
6308 ++}
6309 ++EXPORT_SYMBOL_GPL(sched_set_normal);
6310 ++
6311 ++static int
6312 ++do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
6313 ++{
6314 ++ struct sched_param lparam;
6315 ++ struct task_struct *p;
6316 ++ int retval;
6317 ++
6318 ++ if (!param || pid < 0)
6319 ++ return -EINVAL;
6320 ++ if (copy_from_user(&lparam, param, sizeof(struct sched_param)))
6321 ++ return -EFAULT;
6322 ++
6323 ++ rcu_read_lock();
6324 ++ retval = -ESRCH;
6325 ++ p = find_process_by_pid(pid);
6326 ++ if (likely(p))
6327 ++ get_task_struct(p);
6328 ++ rcu_read_unlock();
6329 ++
6330 ++ if (likely(p)) {
6331 ++ retval = sched_setscheduler(p, policy, &lparam);
6332 ++ put_task_struct(p);
6333 ++ }
6334 ++
6335 ++ return retval;
6336 ++}
6337 ++
6338 ++/*
6339 ++ * Mimics kernel/events/core.c perf_copy_attr().
6340 ++ */
6341 ++static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *attr)
6342 ++{
6343 ++ u32 size;
6344 ++ int ret;
6345 ++
6346 ++ /* Zero the full structure, so that a short copy will be nice: */
6347 ++ memset(attr, 0, sizeof(*attr));
6348 ++
6349 ++ ret = get_user(size, &uattr->size);
6350 ++ if (ret)
6351 ++ return ret;
6352 ++
6353 ++ /* ABI compatibility quirk: */
6354 ++ if (!size)
6355 ++ size = SCHED_ATTR_SIZE_VER0;
6356 ++
6357 ++ if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)
6358 ++ goto err_size;
6359 ++
6360 ++ ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
6361 ++ if (ret) {
6362 ++ if (ret == -E2BIG)
6363 ++ goto err_size;
6364 ++ return ret;
6365 ++ }
6366 ++
6367 ++ /*
6368 ++ * XXX: Do we want to be lenient like existing syscalls; or do we want
6369 ++ * to be strict and return an error on out-of-bounds values?
6370 ++ */
6371 ++ attr->sched_nice = clamp(attr->sched_nice, -20, 19);
6372 ++
6373 ++ /* sched/core.c uses zero here but we already know ret is zero */
6374 ++ return 0;
6375 ++
6376 ++err_size:
6377 ++ put_user(sizeof(*attr), &uattr->size);
6378 ++ return -E2BIG;
6379 ++}
6380 ++
6381 ++/**
6382 ++ * sys_sched_setscheduler - set/change the scheduler policy and RT priority
6383 ++ * @pid: the pid in question.
6384 ++ * @policy: new policy.
6385 ++ *
6386 ++ * Return: 0 on success. An error code otherwise.
6387 ++ * @param: structure containing the new RT priority.
6388 ++ */
6389 ++SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy, struct sched_param __user *, param)
6390 ++{
6391 ++ if (policy < 0)
6392 ++ return -EINVAL;
6393 ++
6394 ++ return do_sched_setscheduler(pid, policy, param);
6395 ++}
6396 ++
6397 ++/**
6398 ++ * sys_sched_setparam - set/change the RT priority of a thread
6399 ++ * @pid: the pid in question.
6400 ++ * @param: structure containing the new RT priority.
6401 ++ *
6402 ++ * Return: 0 on success. An error code otherwise.
6403 ++ */
6404 ++SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
6405 ++{
6406 ++ return do_sched_setscheduler(pid, SETPARAM_POLICY, param);
6407 ++}
6408 ++
6409 ++/**
6410 ++ * sys_sched_setattr - same as above, but with extended sched_attr
6411 ++ * @pid: the pid in question.
6412 ++ * @uattr: structure containing the extended parameters.
6413 ++ */
6414 ++SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
6415 ++ unsigned int, flags)
6416 ++{
6417 ++ struct sched_attr attr;
6418 ++ struct task_struct *p;
6419 ++ int retval;
6420 ++
6421 ++ if (!uattr || pid < 0 || flags)
6422 ++ return -EINVAL;
6423 ++
6424 ++ retval = sched_copy_attr(uattr, &attr);
6425 ++ if (retval)
6426 ++ return retval;
6427 ++
6428 ++ if ((int)attr.sched_policy < 0)
6429 ++ return -EINVAL;
6430 ++
6431 ++ rcu_read_lock();
6432 ++ retval = -ESRCH;
6433 ++ p = find_process_by_pid(pid);
6434 ++ if (likely(p))
6435 ++ get_task_struct(p);
6436 ++ rcu_read_unlock();
6437 ++
6438 ++ if (likely(p)) {
6439 ++ retval = sched_setattr(p, &attr);
6440 ++ put_task_struct(p);
6441 ++ }
6442 ++
6443 ++ return retval;
6444 ++}
6445 ++
6446 ++/**
6447 ++ * sys_sched_getscheduler - get the policy (scheduling class) of a thread
6448 ++ * @pid: the pid in question.
6449 ++ *
6450 ++ * Return: On success, the policy of the thread. Otherwise, a negative error
6451 ++ * code.
6452 ++ */
6453 ++SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid)
6454 ++{
6455 ++ struct task_struct *p;
6456 ++ int retval = -EINVAL;
6457 ++
6458 ++ if (pid < 0)
6459 ++ goto out_nounlock;
6460 ++
6461 ++ retval = -ESRCH;
6462 ++ rcu_read_lock();
6463 ++ p = find_process_by_pid(pid);
6464 ++ if (p) {
6465 ++ retval = security_task_getscheduler(p);
6466 ++ if (!retval)
6467 ++ retval = p->policy;
6468 ++ }
6469 ++ rcu_read_unlock();
6470 ++
6471 ++out_nounlock:
6472 ++ return retval;
6473 ++}
6474 ++
6475 ++/**
6476 ++ * sys_sched_getscheduler - get the RT priority of a thread
6477 ++ * @pid: the pid in question.
6478 ++ * @param: structure containing the RT priority.
6479 ++ *
6480 ++ * Return: On success, 0 and the RT priority is in @param. Otherwise, an error
6481 ++ * code.
6482 ++ */
6483 ++SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param)
6484 ++{
6485 ++ struct sched_param lp = { .sched_priority = 0 };
6486 ++ struct task_struct *p;
6487 ++ int retval = -EINVAL;
6488 ++
6489 ++ if (!param || pid < 0)
6490 ++ goto out_nounlock;
6491 ++
6492 ++ rcu_read_lock();
6493 ++ p = find_process_by_pid(pid);
6494 ++ retval = -ESRCH;
6495 ++ if (!p)
6496 ++ goto out_unlock;
6497 ++
6498 ++ retval = security_task_getscheduler(p);
6499 ++ if (retval)
6500 ++ goto out_unlock;
6501 ++
6502 ++ if (task_has_rt_policy(p))
6503 ++ lp.sched_priority = p->rt_priority;
6504 ++ rcu_read_unlock();
6505 ++
6506 ++ /*
6507 ++ * This one might sleep, we cannot do it with a spinlock held ...
6508 ++ */
6509 ++ retval = copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0;
6510 ++
6511 ++out_nounlock:
6512 ++ return retval;
6513 ++
6514 ++out_unlock:
6515 ++ rcu_read_unlock();
6516 ++ return retval;
6517 ++}
6518 ++
6519 ++/*
6520 ++ * Copy the kernel size attribute structure (which might be larger
6521 ++ * than what user-space knows about) to user-space.
6522 ++ *
6523 ++ * Note that all cases are valid: user-space buffer can be larger or
6524 ++ * smaller than the kernel-space buffer. The usual case is that both
6525 ++ * have the same size.
6526 ++ */
6527 ++static int
6528 ++sched_attr_copy_to_user(struct sched_attr __user *uattr,
6529 ++ struct sched_attr *kattr,
6530 ++ unsigned int usize)
6531 ++{
6532 ++ unsigned int ksize = sizeof(*kattr);
6533 ++
6534 ++ if (!access_ok(uattr, usize))
6535 ++ return -EFAULT;
6536 ++
6537 ++ /*
6538 ++ * sched_getattr() ABI forwards and backwards compatibility:
6539 ++ *
6540 ++ * If usize == ksize then we just copy everything to user-space and all is good.
6541 ++ *
6542 ++ * If usize < ksize then we only copy as much as user-space has space for,
6543 ++ * this keeps ABI compatibility as well. We skip the rest.
6544 ++ *
6545 ++ * If usize > ksize then user-space is using a newer version of the ABI,
6546 ++ * which part the kernel doesn't know about. Just ignore it - tooling can
6547 ++ * detect the kernel's knowledge of attributes from the attr->size value
6548 ++ * which is set to ksize in this case.
6549 ++ */
6550 ++ kattr->size = min(usize, ksize);
6551 ++
6552 ++ if (copy_to_user(uattr, kattr, kattr->size))
6553 ++ return -EFAULT;
6554 ++
6555 ++ return 0;
6556 ++}
6557 ++
6558 ++/**
6559 ++ * sys_sched_getattr - similar to sched_getparam, but with sched_attr
6560 ++ * @pid: the pid in question.
6561 ++ * @uattr: structure containing the extended parameters.
6562 ++ * @usize: sizeof(attr) for fwd/bwd comp.
6563 ++ * @flags: for future extension.
6564 ++ */
6565 ++SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
6566 ++ unsigned int, usize, unsigned int, flags)
6567 ++{
6568 ++ struct sched_attr kattr = { };
6569 ++ struct task_struct *p;
6570 ++ int retval;
6571 ++
6572 ++ if (!uattr || pid < 0 || usize > PAGE_SIZE ||
6573 ++ usize < SCHED_ATTR_SIZE_VER0 || flags)
6574 ++ return -EINVAL;
6575 ++
6576 ++ rcu_read_lock();
6577 ++ p = find_process_by_pid(pid);
6578 ++ retval = -ESRCH;
6579 ++ if (!p)
6580 ++ goto out_unlock;
6581 ++
6582 ++ retval = security_task_getscheduler(p);
6583 ++ if (retval)
6584 ++ goto out_unlock;
6585 ++
6586 ++ kattr.sched_policy = p->policy;
6587 ++ if (p->sched_reset_on_fork)
6588 ++ kattr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
6589 ++ if (task_has_rt_policy(p))
6590 ++ kattr.sched_priority = p->rt_priority;
6591 ++ else
6592 ++ kattr.sched_nice = task_nice(p);
6593 ++ kattr.sched_flags &= SCHED_FLAG_ALL;
6594 ++
6595 ++#ifdef CONFIG_UCLAMP_TASK
6596 ++ kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;
6597 ++ kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
6598 ++#endif
6599 ++
6600 ++ rcu_read_unlock();
6601 ++
6602 ++ return sched_attr_copy_to_user(uattr, &kattr, usize);
6603 ++
6604 ++out_unlock:
6605 ++ rcu_read_unlock();
6606 ++ return retval;
6607 ++}
6608 ++
6609 ++static int
6610 ++__sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
6611 ++{
6612 ++ int retval;
6613 ++ cpumask_var_t cpus_allowed, new_mask;
6614 ++
6615 ++ if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL))
6616 ++ return -ENOMEM;
6617 ++
6618 ++ if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {
6619 ++ retval = -ENOMEM;
6620 ++ goto out_free_cpus_allowed;
6621 ++ }
6622 ++
6623 ++ cpuset_cpus_allowed(p, cpus_allowed);
6624 ++ cpumask_and(new_mask, mask, cpus_allowed);
6625 ++again:
6626 ++ retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | SCA_USER);
6627 ++ if (retval)
6628 ++ goto out_free_new_mask;
6629 ++
6630 ++ cpuset_cpus_allowed(p, cpus_allowed);
6631 ++ if (!cpumask_subset(new_mask, cpus_allowed)) {
6632 ++ /*
6633 ++ * We must have raced with a concurrent cpuset
6634 ++ * update. Just reset the cpus_allowed to the
6635 ++ * cpuset's cpus_allowed
6636 ++ */
6637 ++ cpumask_copy(new_mask, cpus_allowed);
6638 ++ goto again;
6639 ++ }
6640 ++
6641 ++out_free_new_mask:
6642 ++ free_cpumask_var(new_mask);
6643 ++out_free_cpus_allowed:
6644 ++ free_cpumask_var(cpus_allowed);
6645 ++ return retval;
6646 ++}
6647 ++
6648 ++long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
6649 ++{
6650 ++ struct task_struct *p;
6651 ++ int retval;
6652 ++
6653 ++ rcu_read_lock();
6654 ++
6655 ++ p = find_process_by_pid(pid);
6656 ++ if (!p) {
6657 ++ rcu_read_unlock();
6658 ++ return -ESRCH;
6659 ++ }
6660 ++
6661 ++ /* Prevent p going away */
6662 ++ get_task_struct(p);
6663 ++ rcu_read_unlock();
6664 ++
6665 ++ if (p->flags & PF_NO_SETAFFINITY) {
6666 ++ retval = -EINVAL;
6667 ++ goto out_put_task;
6668 ++ }
6669 ++
6670 ++ if (!check_same_owner(p)) {
6671 ++ rcu_read_lock();
6672 ++ if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
6673 ++ rcu_read_unlock();
6674 ++ retval = -EPERM;
6675 ++ goto out_put_task;
6676 ++ }
6677 ++ rcu_read_unlock();
6678 ++ }
6679 ++
6680 ++ retval = security_task_setscheduler(p);
6681 ++ if (retval)
6682 ++ goto out_put_task;
6683 ++
6684 ++ retval = __sched_setaffinity(p, in_mask);
6685 ++out_put_task:
6686 ++ put_task_struct(p);
6687 ++ return retval;
6688 ++}
6689 ++
6690 ++static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len,
6691 ++ struct cpumask *new_mask)
6692 ++{
6693 ++ if (len < cpumask_size())
6694 ++ cpumask_clear(new_mask);
6695 ++ else if (len > cpumask_size())
6696 ++ len = cpumask_size();
6697 ++
6698 ++ return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0;
6699 ++}
6700 ++
6701 ++/**
6702 ++ * sys_sched_setaffinity - set the CPU affinity of a process
6703 ++ * @pid: pid of the process
6704 ++ * @len: length in bytes of the bitmask pointed to by user_mask_ptr
6705 ++ * @user_mask_ptr: user-space pointer to the new CPU mask
6706 ++ *
6707 ++ * Return: 0 on success. An error code otherwise.
6708 ++ */
6709 ++SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
6710 ++ unsigned long __user *, user_mask_ptr)
6711 ++{
6712 ++ cpumask_var_t new_mask;
6713 ++ int retval;
6714 ++
6715 ++ if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))
6716 ++ return -ENOMEM;
6717 ++
6718 ++ retval = get_user_cpu_mask(user_mask_ptr, len, new_mask);
6719 ++ if (retval == 0)
6720 ++ retval = sched_setaffinity(pid, new_mask);
6721 ++ free_cpumask_var(new_mask);
6722 ++ return retval;
6723 ++}
6724 ++
6725 ++long sched_getaffinity(pid_t pid, cpumask_t *mask)
6726 ++{
6727 ++ struct task_struct *p;
6728 ++ raw_spinlock_t *lock;
6729 ++ unsigned long flags;
6730 ++ int retval;
6731 ++
6732 ++ rcu_read_lock();
6733 ++
6734 ++ retval = -ESRCH;
6735 ++ p = find_process_by_pid(pid);
6736 ++ if (!p)
6737 ++ goto out_unlock;
6738 ++
6739 ++ retval = security_task_getscheduler(p);
6740 ++ if (retval)
6741 ++ goto out_unlock;
6742 ++
6743 ++ task_access_lock_irqsave(p, &lock, &flags);
6744 ++ cpumask_and(mask, &p->cpus_mask, cpu_active_mask);
6745 ++ task_access_unlock_irqrestore(p, lock, &flags);
6746 ++
6747 ++out_unlock:
6748 ++ rcu_read_unlock();
6749 ++
6750 ++ return retval;
6751 ++}
6752 ++
6753 ++/**
6754 ++ * sys_sched_getaffinity - get the CPU affinity of a process
6755 ++ * @pid: pid of the process
6756 ++ * @len: length in bytes of the bitmask pointed to by user_mask_ptr
6757 ++ * @user_mask_ptr: user-space pointer to hold the current CPU mask
6758 ++ *
6759 ++ * Return: size of CPU mask copied to user_mask_ptr on success. An
6760 ++ * error code otherwise.
6761 ++ */
6762 ++SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
6763 ++ unsigned long __user *, user_mask_ptr)
6764 ++{
6765 ++ int ret;
6766 ++ cpumask_var_t mask;
6767 ++
6768 ++ if ((len * BITS_PER_BYTE) < nr_cpu_ids)
6769 ++ return -EINVAL;
6770 ++ if (len & (sizeof(unsigned long)-1))
6771 ++ return -EINVAL;
6772 ++
6773 ++ if (!alloc_cpumask_var(&mask, GFP_KERNEL))
6774 ++ return -ENOMEM;
6775 ++
6776 ++ ret = sched_getaffinity(pid, mask);
6777 ++ if (ret == 0) {
6778 ++ unsigned int retlen = min_t(size_t, len, cpumask_size());
6779 ++
6780 ++ if (copy_to_user(user_mask_ptr, mask, retlen))
6781 ++ ret = -EFAULT;
6782 ++ else
6783 ++ ret = retlen;
6784 ++ }
6785 ++ free_cpumask_var(mask);
6786 ++
6787 ++ return ret;
6788 ++}
6789 ++
6790 ++static void do_sched_yield(void)
6791 ++{
6792 ++ struct rq *rq;
6793 ++ struct rq_flags rf;
6794 ++
6795 ++ if (!sched_yield_type)
6796 ++ return;
6797 ++
6798 ++ rq = this_rq_lock_irq(&rf);
6799 ++
6800 ++ schedstat_inc(rq->yld_count);
6801 ++
6802 ++ if (1 == sched_yield_type) {
6803 ++ if (!rt_task(current))
6804 ++ do_sched_yield_type_1(current, rq);
6805 ++ } else if (2 == sched_yield_type) {
6806 ++ if (rq->nr_running > 1)
6807 ++ rq->skip = current;
6808 ++ }
6809 ++
6810 ++ preempt_disable();
6811 ++ raw_spin_unlock_irq(&rq->lock);
6812 ++ sched_preempt_enable_no_resched();
6813 ++
6814 ++ schedule();
6815 ++}
6816 ++
6817 ++/**
6818 ++ * sys_sched_yield - yield the current processor to other threads.
6819 ++ *
6820 ++ * This function yields the current CPU to other tasks. If there are no
6821 ++ * other threads running on this CPU then this function will return.
6822 ++ *
6823 ++ * Return: 0.
6824 ++ */
6825 ++SYSCALL_DEFINE0(sched_yield)
6826 ++{
6827 ++ do_sched_yield();
6828 ++ return 0;
6829 ++}
6830 ++
6831 ++#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
6832 ++int __sched __cond_resched(void)
6833 ++{
6834 ++ if (should_resched(0)) {
6835 ++ preempt_schedule_common();
6836 ++ return 1;
6837 ++ }
6838 ++ /*
6839 ++ * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
6840 ++ * whether the current CPU is in an RCU read-side critical section,
6841 ++ * so the tick can report quiescent states even for CPUs looping
6842 ++ * in kernel context. In contrast, in non-preemptible kernels,
6843 ++ * RCU readers leave no in-memory hints, which means that CPU-bound
6844 ++ * processes executing in kernel context might never report an
6845 ++ * RCU quiescent state. Therefore, the following code causes
6846 ++ * cond_resched() to report a quiescent state, but only when RCU
6847 ++ * is in urgent need of one.
6848 ++ */
6849 ++#ifndef CONFIG_PREEMPT_RCU
6850 ++ rcu_all_qs();
6851 ++#endif
6852 ++ return 0;
6853 ++}
6854 ++EXPORT_SYMBOL(__cond_resched);
6855 ++#endif
6856 ++
6857 ++#ifdef CONFIG_PREEMPT_DYNAMIC
6858 ++DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched);
6859 ++EXPORT_STATIC_CALL_TRAMP(cond_resched);
6860 ++
6861 ++DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched);
6862 ++EXPORT_STATIC_CALL_TRAMP(might_resched);
6863 ++#endif
6864 ++
6865 ++/*
6866 ++ * __cond_resched_lock() - if a reschedule is pending, drop the given lock,
6867 ++ * call schedule, and on return reacquire the lock.
6868 ++ *
6869 ++ * This works OK both with and without CONFIG_PREEMPTION. We do strange low-level
6870 ++ * operations here to prevent schedule() from being called twice (once via
6871 ++ * spin_unlock(), once by hand).
6872 ++ */
6873 ++int __cond_resched_lock(spinlock_t *lock)
6874 ++{
6875 ++ int resched = should_resched(PREEMPT_LOCK_OFFSET);
6876 ++ int ret = 0;
6877 ++
6878 ++ lockdep_assert_held(lock);
6879 ++
6880 ++ if (spin_needbreak(lock) || resched) {
6881 ++ spin_unlock(lock);
6882 ++ if (resched)
6883 ++ preempt_schedule_common();
6884 ++ else
6885 ++ cpu_relax();
6886 ++ ret = 1;
6887 ++ spin_lock(lock);
6888 ++ }
6889 ++ return ret;
6890 ++}
6891 ++EXPORT_SYMBOL(__cond_resched_lock);
6892 ++
6893 ++int __cond_resched_rwlock_read(rwlock_t *lock)
6894 ++{
6895 ++ int resched = should_resched(PREEMPT_LOCK_OFFSET);
6896 ++ int ret = 0;
6897 ++
6898 ++ lockdep_assert_held_read(lock);
6899 ++
6900 ++ if (rwlock_needbreak(lock) || resched) {
6901 ++ read_unlock(lock);
6902 ++ if (resched)
6903 ++ preempt_schedule_common();
6904 ++ else
6905 ++ cpu_relax();
6906 ++ ret = 1;
6907 ++ read_lock(lock);
6908 ++ }
6909 ++ return ret;
6910 ++}
6911 ++EXPORT_SYMBOL(__cond_resched_rwlock_read);
6912 ++
6913 ++int __cond_resched_rwlock_write(rwlock_t *lock)
6914 ++{
6915 ++ int resched = should_resched(PREEMPT_LOCK_OFFSET);
6916 ++ int ret = 0;
6917 ++
6918 ++ lockdep_assert_held_write(lock);
6919 ++
6920 ++ if (rwlock_needbreak(lock) || resched) {
6921 ++ write_unlock(lock);
6922 ++ if (resched)
6923 ++ preempt_schedule_common();
6924 ++ else
6925 ++ cpu_relax();
6926 ++ ret = 1;
6927 ++ write_lock(lock);
6928 ++ }
6929 ++ return ret;
6930 ++}
6931 ++EXPORT_SYMBOL(__cond_resched_rwlock_write);
6932 ++
6933 ++/**
6934 ++ * yield - yield the current processor to other threads.
6935 ++ *
6936 ++ * Do not ever use this function, there's a 99% chance you're doing it wrong.
6937 ++ *
6938 ++ * The scheduler is at all times free to pick the calling task as the most
6939 ++ * eligible task to run, if removing the yield() call from your code breaks
6940 ++ * it, it's already broken.
6941 ++ *
6942 ++ * Typical broken usage is:
6943 ++ *
6944 ++ * while (!event)
6945 ++ * yield();
6946 ++ *
6947 ++ * where one assumes that yield() will let 'the other' process run that will
6948 ++ * make event true. If the current task is a SCHED_FIFO task that will never
6949 ++ * happen. Never use yield() as a progress guarantee!!
6950 ++ *
6951 ++ * If you want to use yield() to wait for something, use wait_event().
6952 ++ * If you want to use yield() to be 'nice' for others, use cond_resched().
6953 ++ * If you still want to use yield(), do not!
6954 ++ */
6955 ++void __sched yield(void)
6956 ++{
6957 ++ set_current_state(TASK_RUNNING);
6958 ++ do_sched_yield();
6959 ++}
6960 ++EXPORT_SYMBOL(yield);
6961 ++
6962 ++/**
6963 ++ * yield_to - yield the current processor to another thread in
6964 ++ * your thread group, or accelerate that thread toward the
6965 ++ * processor it's on.
6966 ++ * @p: target task
6967 ++ * @preempt: whether task preemption is allowed or not
6968 ++ *
6969 ++ * It's the caller's job to ensure that the target task struct
6970 ++ * can't go away on us before we can do any checks.
6971 ++ *
6972 ++ * In Alt schedule FW, yield_to is not supported.
6973 ++ *
6974 ++ * Return:
6975 ++ * true (>0) if we indeed boosted the target task.
6976 ++ * false (0) if we failed to boost the target.
6977 ++ * -ESRCH if there's no task to yield to.
6978 ++ */
6979 ++int __sched yield_to(struct task_struct *p, bool preempt)
6980 ++{
6981 ++ return 0;
6982 ++}
6983 ++EXPORT_SYMBOL_GPL(yield_to);
6984 ++
6985 ++int io_schedule_prepare(void)
6986 ++{
6987 ++ int old_iowait = current->in_iowait;
6988 ++
6989 ++ current->in_iowait = 1;
6990 ++ blk_schedule_flush_plug(current);
6991 ++
6992 ++ return old_iowait;
6993 ++}
6994 ++
6995 ++void io_schedule_finish(int token)
6996 ++{
6997 ++ current->in_iowait = token;
6998 ++}
6999 ++
7000 ++/*
7001 ++ * This task is about to go to sleep on IO. Increment rq->nr_iowait so
7002 ++ * that process accounting knows that this is a task in IO wait state.
7003 ++ *
7004 ++ * But don't do that if it is a deliberate, throttling IO wait (this task
7005 ++ * has set its backing_dev_info: the queue against which it should throttle)
7006 ++ */
7007 ++
7008 ++long __sched io_schedule_timeout(long timeout)
7009 ++{
7010 ++ int token;
7011 ++ long ret;
7012 ++
7013 ++ token = io_schedule_prepare();
7014 ++ ret = schedule_timeout(timeout);
7015 ++ io_schedule_finish(token);
7016 ++
7017 ++ return ret;
7018 ++}
7019 ++EXPORT_SYMBOL(io_schedule_timeout);
7020 ++
7021 ++void __sched io_schedule(void)
7022 ++{
7023 ++ int token;
7024 ++
7025 ++ token = io_schedule_prepare();
7026 ++ schedule();
7027 ++ io_schedule_finish(token);
7028 ++}
7029 ++EXPORT_SYMBOL(io_schedule);
7030 ++
7031 ++/**
7032 ++ * sys_sched_get_priority_max - return maximum RT priority.
7033 ++ * @policy: scheduling class.
7034 ++ *
7035 ++ * Return: On success, this syscall returns the maximum
7036 ++ * rt_priority that can be used by a given scheduling class.
7037 ++ * On failure, a negative error code is returned.
7038 ++ */
7039 ++SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
7040 ++{
7041 ++ int ret = -EINVAL;
7042 ++
7043 ++ switch (policy) {
7044 ++ case SCHED_FIFO:
7045 ++ case SCHED_RR:
7046 ++ ret = MAX_RT_PRIO - 1;
7047 ++ break;
7048 ++ case SCHED_NORMAL:
7049 ++ case SCHED_BATCH:
7050 ++ case SCHED_IDLE:
7051 ++ ret = 0;
7052 ++ break;
7053 ++ }
7054 ++ return ret;
7055 ++}
7056 ++
7057 ++/**
7058 ++ * sys_sched_get_priority_min - return minimum RT priority.
7059 ++ * @policy: scheduling class.
7060 ++ *
7061 ++ * Return: On success, this syscall returns the minimum
7062 ++ * rt_priority that can be used by a given scheduling class.
7063 ++ * On failure, a negative error code is returned.
7064 ++ */
7065 ++SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
7066 ++{
7067 ++ int ret = -EINVAL;
7068 ++
7069 ++ switch (policy) {
7070 ++ case SCHED_FIFO:
7071 ++ case SCHED_RR:
7072 ++ ret = 1;
7073 ++ break;
7074 ++ case SCHED_NORMAL:
7075 ++ case SCHED_BATCH:
7076 ++ case SCHED_IDLE:
7077 ++ ret = 0;
7078 ++ break;
7079 ++ }
7080 ++ return ret;
7081 ++}
7082 ++
7083 ++static int sched_rr_get_interval(pid_t pid, struct timespec64 *t)
7084 ++{
7085 ++ struct task_struct *p;
7086 ++ int retval;
7087 ++
7088 ++ alt_sched_debug();
7089 ++
7090 ++ if (pid < 0)
7091 ++ return -EINVAL;
7092 ++
7093 ++ retval = -ESRCH;
7094 ++ rcu_read_lock();
7095 ++ p = find_process_by_pid(pid);
7096 ++ if (!p)
7097 ++ goto out_unlock;
7098 ++
7099 ++ retval = security_task_getscheduler(p);
7100 ++ if (retval)
7101 ++ goto out_unlock;
7102 ++ rcu_read_unlock();
7103 ++
7104 ++ *t = ns_to_timespec64(sched_timeslice_ns);
7105 ++ return 0;
7106 ++
7107 ++out_unlock:
7108 ++ rcu_read_unlock();
7109 ++ return retval;
7110 ++}
7111 ++
7112 ++/**
7113 ++ * sys_sched_rr_get_interval - return the default timeslice of a process.
7114 ++ * @pid: pid of the process.
7115 ++ * @interval: userspace pointer to the timeslice value.
7116 ++ *
7117 ++ *
7118 ++ * Return: On success, 0 and the timeslice is in @interval. Otherwise,
7119 ++ * an error code.
7120 ++ */
7121 ++SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,
7122 ++ struct __kernel_timespec __user *, interval)
7123 ++{
7124 ++ struct timespec64 t;
7125 ++ int retval = sched_rr_get_interval(pid, &t);
7126 ++
7127 ++ if (retval == 0)
7128 ++ retval = put_timespec64(&t, interval);
7129 ++
7130 ++ return retval;
7131 ++}
7132 ++
7133 ++#ifdef CONFIG_COMPAT_32BIT_TIME
7134 ++SYSCALL_DEFINE2(sched_rr_get_interval_time32, pid_t, pid,
7135 ++ struct old_timespec32 __user *, interval)
7136 ++{
7137 ++ struct timespec64 t;
7138 ++ int retval = sched_rr_get_interval(pid, &t);
7139 ++
7140 ++ if (retval == 0)
7141 ++ retval = put_old_timespec32(&t, interval);
7142 ++ return retval;
7143 ++}
7144 ++#endif
7145 ++
7146 ++void sched_show_task(struct task_struct *p)
7147 ++{
7148 ++ unsigned long free = 0;
7149 ++ int ppid;
7150 ++
7151 ++ if (!try_get_task_stack(p))
7152 ++ return;
7153 ++
7154 ++ pr_info("task:%-15.15s state:%c", p->comm, task_state_to_char(p));
7155 ++
7156 ++ if (task_is_running(p))
7157 ++ pr_cont(" running task ");
7158 ++#ifdef CONFIG_DEBUG_STACK_USAGE
7159 ++ free = stack_not_used(p);
7160 ++#endif
7161 ++ ppid = 0;
7162 ++ rcu_read_lock();
7163 ++ if (pid_alive(p))
7164 ++ ppid = task_pid_nr(rcu_dereference(p->real_parent));
7165 ++ rcu_read_unlock();
7166 ++ pr_cont(" stack:%5lu pid:%5d ppid:%6d flags:0x%08lx\n",
7167 ++ free, task_pid_nr(p), ppid,
7168 ++ (unsigned long)task_thread_info(p)->flags);
7169 ++
7170 ++ print_worker_info(KERN_INFO, p);
7171 ++ print_stop_info(KERN_INFO, p);
7172 ++ show_stack(p, NULL, KERN_INFO);
7173 ++ put_task_stack(p);
7174 ++}
7175 ++EXPORT_SYMBOL_GPL(sched_show_task);
7176 ++
7177 ++static inline bool
7178 ++state_filter_match(unsigned long state_filter, struct task_struct *p)
7179 ++{
7180 ++ unsigned int state = READ_ONCE(p->__state);
7181 ++
7182 ++ /* no filter, everything matches */
7183 ++ if (!state_filter)
7184 ++ return true;
7185 ++
7186 ++ /* filter, but doesn't match */
7187 ++ if (!(state & state_filter))
7188 ++ return false;
7189 ++
7190 ++ /*
7191 ++ * When looking for TASK_UNINTERRUPTIBLE skip TASK_IDLE (allows
7192 ++ * TASK_KILLABLE).
7193 ++ */
7194 ++ if (state_filter == TASK_UNINTERRUPTIBLE && state == TASK_IDLE)
7195 ++ return false;
7196 ++
7197 ++ return true;
7198 ++}
7199 ++
7200 ++
7201 ++void show_state_filter(unsigned int state_filter)
7202 ++{
7203 ++ struct task_struct *g, *p;
7204 ++
7205 ++ rcu_read_lock();
7206 ++ for_each_process_thread(g, p) {
7207 ++ /*
7208 ++ * reset the NMI-timeout, listing all files on a slow
7209 ++ * console might take a lot of time:
7210 ++ * Also, reset softlockup watchdogs on all CPUs, because
7211 ++ * another CPU might be blocked waiting for us to process
7212 ++ * an IPI.
7213 ++ */
7214 ++ touch_nmi_watchdog();
7215 ++ touch_all_softlockup_watchdogs();
7216 ++ if (state_filter_match(state_filter, p))
7217 ++ sched_show_task(p);
7218 ++ }
7219 ++
7220 ++#ifdef CONFIG_SCHED_DEBUG
7221 ++ /* TODO: Alt schedule FW should support this
7222 ++ if (!state_filter)
7223 ++ sysrq_sched_debug_show();
7224 ++ */
7225 ++#endif
7226 ++ rcu_read_unlock();
7227 ++ /*
7228 ++ * Only show locks if all tasks are dumped:
7229 ++ */
7230 ++ if (!state_filter)
7231 ++ debug_show_all_locks();
7232 ++}
7233 ++
7234 ++void dump_cpu_task(int cpu)
7235 ++{
7236 ++ pr_info("Task dump for CPU %d:\n", cpu);
7237 ++ sched_show_task(cpu_curr(cpu));
7238 ++}
7239 ++
7240 ++/**
7241 ++ * init_idle - set up an idle thread for a given CPU
7242 ++ * @idle: task in question
7243 ++ * @cpu: CPU the idle task belongs to
7244 ++ *
7245 ++ * NOTE: this function does not set the idle thread's NEED_RESCHED
7246 ++ * flag, to make booting more robust.
7247 ++ */
7248 ++void __init init_idle(struct task_struct *idle, int cpu)
7249 ++{
7250 ++ struct rq *rq = cpu_rq(cpu);
7251 ++ unsigned long flags;
7252 ++
7253 ++ __sched_fork(0, idle);
7254 ++
7255 ++ /*
7256 ++ * The idle task doesn't need the kthread struct to function, but it
7257 ++ * is dressed up as a per-CPU kthread and thus needs to play the part
7258 ++ * if we want to avoid special-casing it in code that deals with per-CPU
7259 ++ * kthreads.
7260 ++ */
7261 ++ set_kthread_struct(idle);
7262 ++
7263 ++ raw_spin_lock_irqsave(&idle->pi_lock, flags);
7264 ++ raw_spin_lock(&rq->lock);
7265 ++ update_rq_clock(rq);
7266 ++
7267 ++ idle->last_ran = rq->clock_task;
7268 ++ idle->__state = TASK_RUNNING;
7269 ++ /*
7270 ++ * PF_KTHREAD should already be set at this point; regardless, make it
7271 ++ * look like a proper per-CPU kthread.
7272 ++ */
7273 ++ idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
7274 ++ kthread_set_per_cpu(idle, cpu);
7275 ++
7276 ++ sched_queue_init_idle(&rq->queue, idle);
7277 ++
7278 ++ scs_task_reset(idle);
7279 ++ kasan_unpoison_task_stack(idle);
7280 ++
7281 ++#ifdef CONFIG_SMP
7282 ++ /*
7283 ++ * It's possible that init_idle() gets called multiple times on a task,
7284 ++ * in that case do_set_cpus_allowed() will not do the right thing.
7285 ++ *
7286 ++ * And since this is boot we can forgo the serialisation.
7287 ++ */
7288 ++ set_cpus_allowed_common(idle, cpumask_of(cpu));
7289 ++#endif
7290 ++
7291 ++ /* Silence PROVE_RCU */
7292 ++ rcu_read_lock();
7293 ++ __set_task_cpu(idle, cpu);
7294 ++ rcu_read_unlock();
7295 ++
7296 ++ rq->idle = idle;
7297 ++ rcu_assign_pointer(rq->curr, idle);
7298 ++ idle->on_cpu = 1;
7299 ++
7300 ++ raw_spin_unlock(&rq->lock);
7301 ++ raw_spin_unlock_irqrestore(&idle->pi_lock, flags);
7302 ++
7303 ++ /* Set the preempt count _outside_ the spinlocks! */
7304 ++ init_idle_preempt_count(idle, cpu);
7305 ++
7306 ++ ftrace_graph_init_idle_task(idle, cpu);
7307 ++ vtime_init_idle(idle, cpu);
7308 ++#ifdef CONFIG_SMP
7309 ++ sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
7310 ++#endif
7311 ++}
7312 ++
7313 ++#ifdef CONFIG_SMP
7314 ++
7315 ++int cpuset_cpumask_can_shrink(const struct cpumask __maybe_unused *cur,
7316 ++ const struct cpumask __maybe_unused *trial)
7317 ++{
7318 ++ return 1;
7319 ++}
7320 ++
7321 ++int task_can_attach(struct task_struct *p,
7322 ++ const struct cpumask *cs_cpus_allowed)
7323 ++{
7324 ++ int ret = 0;
7325 ++
7326 ++ /*
7327 ++ * Kthreads which disallow setaffinity shouldn't be moved
7328 ++ * to a new cpuset; we don't want to change their CPU
7329 ++ * affinity and isolating such threads by their set of
7330 ++ * allowed nodes is unnecessary. Thus, cpusets are not
7331 ++ * applicable for such threads. This prevents checking for
7332 ++ * success of set_cpus_allowed_ptr() on all attached tasks
7333 ++ * before cpus_mask may be changed.
7334 ++ */
7335 ++ if (p->flags & PF_NO_SETAFFINITY)
7336 ++ ret = -EINVAL;
7337 ++
7338 ++ return ret;
7339 ++}
7340 ++
7341 ++bool sched_smp_initialized __read_mostly;
7342 ++
7343 ++#ifdef CONFIG_HOTPLUG_CPU
7344 ++/*
7345 ++ * Ensures that the idle task is using init_mm right before its CPU goes
7346 ++ * offline.
7347 ++ */
7348 ++void idle_task_exit(void)
7349 ++{
7350 ++ struct mm_struct *mm = current->active_mm;
7351 ++
7352 ++ BUG_ON(current != this_rq()->idle);
7353 ++
7354 ++ if (mm != &init_mm) {
7355 ++ switch_mm(mm, &init_mm, current);
7356 ++ finish_arch_post_lock_switch();
7357 ++ }
7358 ++
7359 ++ scs_task_reset(current);
7360 ++ /* finish_cpu(), as ran on the BP, will clean up the active_mm state */
7361 ++}
7362 ++
7363 ++static int __balance_push_cpu_stop(void *arg)
7364 ++{
7365 ++ struct task_struct *p = arg;
7366 ++ struct rq *rq = this_rq();
7367 ++ struct rq_flags rf;
7368 ++ int cpu;
7369 ++
7370 ++ raw_spin_lock_irq(&p->pi_lock);
7371 ++ rq_lock(rq, &rf);
7372 ++
7373 ++ update_rq_clock(rq);
7374 ++
7375 ++ if (task_rq(p) == rq && task_on_rq_queued(p)) {
7376 ++ cpu = select_fallback_rq(rq->cpu, p);
7377 ++ rq = __migrate_task(rq, p, cpu);
7378 ++ }
7379 ++
7380 ++ rq_unlock(rq, &rf);
7381 ++ raw_spin_unlock_irq(&p->pi_lock);
7382 ++
7383 ++ put_task_struct(p);
7384 ++
7385 ++ return 0;
7386 ++}
7387 ++
7388 ++static DEFINE_PER_CPU(struct cpu_stop_work, push_work);
7389 ++
7390 ++/*
7391 ++ * This is enabled below SCHED_AP_ACTIVE; when !cpu_active(), but only
7392 ++ * effective when the hotplug motion is down.
7393 ++ */
7394 ++static void balance_push(struct rq *rq)
7395 ++{
7396 ++ struct task_struct *push_task = rq->curr;
7397 ++
7398 ++ lockdep_assert_held(&rq->lock);
7399 ++
7400 ++ /*
7401 ++ * Ensure the thing is persistent until balance_push_set(.on = false);
7402 ++ */
7403 ++ rq->balance_callback = &balance_push_callback;
7404 ++
7405 ++ /*
7406 ++ * Only active while going offline and when invoked on the outgoing
7407 ++ * CPU.
7408 ++ */
7409 ++ if (!cpu_dying(rq->cpu) || rq != this_rq())
7410 ++ return;
7411 ++
7412 ++ /*
7413 ++ * Both the cpu-hotplug and stop task are in this case and are
7414 ++ * required to complete the hotplug process.
7415 ++ */
7416 ++ if (kthread_is_per_cpu(push_task) ||
7417 ++ is_migration_disabled(push_task)) {
7418 ++
7419 ++ /*
7420 ++ * If this is the idle task on the outgoing CPU try to wake
7421 ++ * up the hotplug control thread which might wait for the
7422 ++ * last task to vanish. The rcuwait_active() check is
7423 ++ * accurate here because the waiter is pinned on this CPU
7424 ++ * and can't obviously be running in parallel.
7425 ++ *
7426 ++ * On RT kernels this also has to check whether there are
7427 ++ * pinned and scheduled out tasks on the runqueue. They
7428 ++ * need to leave the migrate disabled section first.
7429 ++ */
7430 ++ if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&
7431 ++ rcuwait_active(&rq->hotplug_wait)) {
7432 ++ raw_spin_unlock(&rq->lock);
7433 ++ rcuwait_wake_up(&rq->hotplug_wait);
7434 ++ raw_spin_lock(&rq->lock);
7435 ++ }
7436 ++ return;
7437 ++ }
7438 ++
7439 ++ get_task_struct(push_task);
7440 ++ /*
7441 ++ * Temporarily drop rq->lock such that we can wake-up the stop task.
7442 ++ * Both preemption and IRQs are still disabled.
7443 ++ */
7444 ++ raw_spin_unlock(&rq->lock);
7445 ++ stop_one_cpu_nowait(rq->cpu, __balance_push_cpu_stop, push_task,
7446 ++ this_cpu_ptr(&push_work));
7447 ++ /*
7448 ++ * At this point need_resched() is true and we'll take the loop in
7449 ++ * schedule(). The next pick is obviously going to be the stop task
7450 ++ * which kthread_is_per_cpu() and will push this task away.
7451 ++ */
7452 ++ raw_spin_lock(&rq->lock);
7453 ++}
7454 ++
7455 ++static void balance_push_set(int cpu, bool on)
7456 ++{
7457 ++ struct rq *rq = cpu_rq(cpu);
7458 ++ struct rq_flags rf;
7459 ++
7460 ++ rq_lock_irqsave(rq, &rf);
7461 ++ if (on) {
7462 ++ WARN_ON_ONCE(rq->balance_callback);
7463 ++ rq->balance_callback = &balance_push_callback;
7464 ++ } else if (rq->balance_callback == &balance_push_callback) {
7465 ++ rq->balance_callback = NULL;
7466 ++ }
7467 ++ rq_unlock_irqrestore(rq, &rf);
7468 ++}
7469 ++
7470 ++/*
7471 ++ * Invoked from a CPUs hotplug control thread after the CPU has been marked
7472 ++ * inactive. All tasks which are not per CPU kernel threads are either
7473 ++ * pushed off this CPU now via balance_push() or placed on a different CPU
7474 ++ * during wakeup. Wait until the CPU is quiescent.
7475 ++ */
7476 ++static void balance_hotplug_wait(void)
7477 ++{
7478 ++ struct rq *rq = this_rq();
7479 ++
7480 ++ rcuwait_wait_event(&rq->hotplug_wait,
7481 ++ rq->nr_running == 1 && !rq_has_pinned_tasks(rq),
7482 ++ TASK_UNINTERRUPTIBLE);
7483 ++}
7484 ++
7485 ++#else
7486 ++
7487 ++static void balance_push(struct rq *rq)
7488 ++{
7489 ++}
7490 ++
7491 ++static void balance_push_set(int cpu, bool on)
7492 ++{
7493 ++}
7494 ++
7495 ++static inline void balance_hotplug_wait(void)
7496 ++{
7497 ++}
7498 ++#endif /* CONFIG_HOTPLUG_CPU */
7499 ++
7500 ++static void set_rq_offline(struct rq *rq)
7501 ++{
7502 ++ if (rq->online)
7503 ++ rq->online = false;
7504 ++}
7505 ++
7506 ++static void set_rq_online(struct rq *rq)
7507 ++{
7508 ++ if (!rq->online)
7509 ++ rq->online = true;
7510 ++}
7511 ++
7512 ++/*
7513 ++ * used to mark begin/end of suspend/resume:
7514 ++ */
7515 ++static int num_cpus_frozen;
7516 ++
7517 ++/*
7518 ++ * Update cpusets according to cpu_active mask. If cpusets are
7519 ++ * disabled, cpuset_update_active_cpus() becomes a simple wrapper
7520 ++ * around partition_sched_domains().
7521 ++ *
7522 ++ * If we come here as part of a suspend/resume, don't touch cpusets because we
7523 ++ * want to restore it back to its original state upon resume anyway.
7524 ++ */
7525 ++static void cpuset_cpu_active(void)
7526 ++{
7527 ++ if (cpuhp_tasks_frozen) {
7528 ++ /*
7529 ++ * num_cpus_frozen tracks how many CPUs are involved in suspend
7530 ++ * resume sequence. As long as this is not the last online
7531 ++ * operation in the resume sequence, just build a single sched
7532 ++ * domain, ignoring cpusets.
7533 ++ */
7534 ++ partition_sched_domains(1, NULL, NULL);
7535 ++ if (--num_cpus_frozen)
7536 ++ return;
7537 ++ /*
7538 ++ * This is the last CPU online operation. So fall through and
7539 ++ * restore the original sched domains by considering the
7540 ++ * cpuset configurations.
7541 ++ */
7542 ++ cpuset_force_rebuild();
7543 ++ }
7544 ++
7545 ++ cpuset_update_active_cpus();
7546 ++}
7547 ++
7548 ++static int cpuset_cpu_inactive(unsigned int cpu)
7549 ++{
7550 ++ if (!cpuhp_tasks_frozen) {
7551 ++ cpuset_update_active_cpus();
7552 ++ } else {
7553 ++ num_cpus_frozen++;
7554 ++ partition_sched_domains(1, NULL, NULL);
7555 ++ }
7556 ++ return 0;
7557 ++}
7558 ++
7559 ++int sched_cpu_activate(unsigned int cpu)
7560 ++{
7561 ++ struct rq *rq = cpu_rq(cpu);
7562 ++ unsigned long flags;
7563 ++
7564 ++ /*
7565 ++ * Clear the balance_push callback and prepare to schedule
7566 ++ * regular tasks.
7567 ++ */
7568 ++ balance_push_set(cpu, false);
7569 ++
7570 ++#ifdef CONFIG_SCHED_SMT
7571 ++ /*
7572 ++ * When going up, increment the number of cores with SMT present.
7573 ++ */
7574 ++ if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
7575 ++ static_branch_inc_cpuslocked(&sched_smt_present);
7576 ++#endif
7577 ++ set_cpu_active(cpu, true);
7578 ++
7579 ++ if (sched_smp_initialized)
7580 ++ cpuset_cpu_active();
7581 ++
7582 ++ /*
7583 ++ * Put the rq online, if not already. This happens:
7584 ++ *
7585 ++ * 1) In the early boot process, because we build the real domains
7586 ++ * after all cpus have been brought up.
7587 ++ *
7588 ++ * 2) At runtime, if cpuset_cpu_active() fails to rebuild the
7589 ++ * domains.
7590 ++ */
7591 ++ raw_spin_lock_irqsave(&rq->lock, flags);
7592 ++ set_rq_online(rq);
7593 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
7594 ++
7595 ++ return 0;
7596 ++}
7597 ++
7598 ++int sched_cpu_deactivate(unsigned int cpu)
7599 ++{
7600 ++ struct rq *rq = cpu_rq(cpu);
7601 ++ unsigned long flags;
7602 ++ int ret;
7603 ++
7604 ++ set_cpu_active(cpu, false);
7605 ++
7606 ++ /*
7607 ++ * From this point forward, this CPU will refuse to run any task that
7608 ++ * is not: migrate_disable() or KTHREAD_IS_PER_CPU, and will actively
7609 ++ * push those tasks away until this gets cleared, see
7610 ++ * sched_cpu_dying().
7611 ++ */
7612 ++ balance_push_set(cpu, true);
7613 ++
7614 ++ /*
7615 ++ * We've cleared cpu_active_mask, wait for all preempt-disabled and RCU
7616 ++ * users of this state to go away such that all new such users will
7617 ++ * observe it.
7618 ++ *
7619 ++ * Specifically, we rely on ttwu to no longer target this CPU, see
7620 ++ * ttwu_queue_cond() and is_cpu_allowed().
7621 ++ *
7622 ++ * Do sync before park smpboot threads to take care the rcu boost case.
7623 ++ */
7624 ++ synchronize_rcu();
7625 ++
7626 ++ raw_spin_lock_irqsave(&rq->lock, flags);
7627 ++ update_rq_clock(rq);
7628 ++ set_rq_offline(rq);
7629 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
7630 ++
7631 ++#ifdef CONFIG_SCHED_SMT
7632 ++ /*
7633 ++ * When going down, decrement the number of cores with SMT present.
7634 ++ */
7635 ++ if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
7636 ++ static_branch_dec_cpuslocked(&sched_smt_present);
7637 ++ if (!static_branch_likely(&sched_smt_present))
7638 ++ cpumask_clear(&sched_sg_idle_mask);
7639 ++ }
7640 ++#endif
7641 ++
7642 ++ if (!sched_smp_initialized)
7643 ++ return 0;
7644 ++
7645 ++ ret = cpuset_cpu_inactive(cpu);
7646 ++ if (ret) {
7647 ++ balance_push_set(cpu, false);
7648 ++ set_cpu_active(cpu, true);
7649 ++ return ret;
7650 ++ }
7651 ++
7652 ++ return 0;
7653 ++}
7654 ++
7655 ++static void sched_rq_cpu_starting(unsigned int cpu)
7656 ++{
7657 ++ struct rq *rq = cpu_rq(cpu);
7658 ++
7659 ++ rq->calc_load_update = calc_load_update;
7660 ++}
7661 ++
7662 ++int sched_cpu_starting(unsigned int cpu)
7663 ++{
7664 ++ sched_rq_cpu_starting(cpu);
7665 ++ sched_tick_start(cpu);
7666 ++ return 0;
7667 ++}
7668 ++
7669 ++#ifdef CONFIG_HOTPLUG_CPU
7670 ++
7671 ++/*
7672 ++ * Invoked immediately before the stopper thread is invoked to bring the
7673 ++ * CPU down completely. At this point all per CPU kthreads except the
7674 ++ * hotplug thread (current) and the stopper thread (inactive) have been
7675 ++ * either parked or have been unbound from the outgoing CPU. Ensure that
7676 ++ * any of those which might be on the way out are gone.
7677 ++ *
7678 ++ * If after this point a bound task is being woken on this CPU then the
7679 ++ * responsible hotplug callback has failed to do it's job.
7680 ++ * sched_cpu_dying() will catch it with the appropriate fireworks.
7681 ++ */
7682 ++int sched_cpu_wait_empty(unsigned int cpu)
7683 ++{
7684 ++ balance_hotplug_wait();
7685 ++ return 0;
7686 ++}
7687 ++
7688 ++/*
7689 ++ * Since this CPU is going 'away' for a while, fold any nr_active delta we
7690 ++ * might have. Called from the CPU stopper task after ensuring that the
7691 ++ * stopper is the last running task on the CPU, so nr_active count is
7692 ++ * stable. We need to take the teardown thread which is calling this into
7693 ++ * account, so we hand in adjust = 1 to the load calculation.
7694 ++ *
7695 ++ * Also see the comment "Global load-average calculations".
7696 ++ */
7697 ++static void calc_load_migrate(struct rq *rq)
7698 ++{
7699 ++ long delta = calc_load_fold_active(rq, 1);
7700 ++
7701 ++ if (delta)
7702 ++ atomic_long_add(delta, &calc_load_tasks);
7703 ++}
7704 ++
7705 ++static void dump_rq_tasks(struct rq *rq, const char *loglvl)
7706 ++{
7707 ++ struct task_struct *g, *p;
7708 ++ int cpu = cpu_of(rq);
7709 ++
7710 ++ lockdep_assert_held(&rq->lock);
7711 ++
7712 ++ printk("%sCPU%d enqueued tasks (%u total):\n", loglvl, cpu, rq->nr_running);
7713 ++ for_each_process_thread(g, p) {
7714 ++ if (task_cpu(p) != cpu)
7715 ++ continue;
7716 ++
7717 ++ if (!task_on_rq_queued(p))
7718 ++ continue;
7719 ++
7720 ++ printk("%s\tpid: %d, name: %s\n", loglvl, p->pid, p->comm);
7721 ++ }
7722 ++}
7723 ++
7724 ++int sched_cpu_dying(unsigned int cpu)
7725 ++{
7726 ++ struct rq *rq = cpu_rq(cpu);
7727 ++ unsigned long flags;
7728 ++
7729 ++ /* Handle pending wakeups and then migrate everything off */
7730 ++ sched_tick_stop(cpu);
7731 ++
7732 ++ raw_spin_lock_irqsave(&rq->lock, flags);
7733 ++ if (rq->nr_running != 1 || rq_has_pinned_tasks(rq)) {
7734 ++ WARN(true, "Dying CPU not properly vacated!");
7735 ++ dump_rq_tasks(rq, KERN_WARNING);
7736 ++ }
7737 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
7738 ++
7739 ++ calc_load_migrate(rq);
7740 ++ hrtick_clear(rq);
7741 ++ return 0;
7742 ++}
7743 ++#endif
7744 ++
7745 ++#ifdef CONFIG_SMP
7746 ++static void sched_init_topology_cpumask_early(void)
7747 ++{
7748 ++ int cpu;
7749 ++ cpumask_t *tmp;
7750 ++
7751 ++ for_each_possible_cpu(cpu) {
7752 ++ /* init topo masks */
7753 ++ tmp = per_cpu(sched_cpu_topo_masks, cpu);
7754 ++
7755 ++ cpumask_copy(tmp, cpumask_of(cpu));
7756 ++ tmp++;
7757 ++ cpumask_copy(tmp, cpu_possible_mask);
7758 ++ per_cpu(sched_cpu_llc_mask, cpu) = tmp;
7759 ++ per_cpu(sched_cpu_topo_end_mask, cpu) = ++tmp;
7760 ++ /*per_cpu(sd_llc_id, cpu) = cpu;*/
7761 ++ }
7762 ++}
7763 ++
7764 ++#define TOPOLOGY_CPUMASK(name, mask, last)\
7765 ++ if (cpumask_and(topo, topo, mask)) { \
7766 ++ cpumask_copy(topo, mask); \
7767 ++ printk(KERN_INFO "sched: cpu#%02d topo: 0x%08lx - "#name, \
7768 ++ cpu, (topo++)->bits[0]); \
7769 ++ } \
7770 ++ if (!last) \
7771 ++ cpumask_complement(topo, mask)
7772 ++
7773 ++static void sched_init_topology_cpumask(void)
7774 ++{
7775 ++ int cpu;
7776 ++ cpumask_t *topo;
7777 ++
7778 ++ for_each_online_cpu(cpu) {
7779 ++ /* take chance to reset time slice for idle tasks */
7780 ++ cpu_rq(cpu)->idle->time_slice = sched_timeslice_ns;
7781 ++
7782 ++ topo = per_cpu(sched_cpu_topo_masks, cpu) + 1;
7783 ++
7784 ++ cpumask_complement(topo, cpumask_of(cpu));
7785 ++#ifdef CONFIG_SCHED_SMT
7786 ++ TOPOLOGY_CPUMASK(smt, topology_sibling_cpumask(cpu), false);
7787 ++#endif
7788 ++ per_cpu(sd_llc_id, cpu) = cpumask_first(cpu_coregroup_mask(cpu));
7789 ++ per_cpu(sched_cpu_llc_mask, cpu) = topo;
7790 ++ TOPOLOGY_CPUMASK(coregroup, cpu_coregroup_mask(cpu), false);
7791 ++
7792 ++ TOPOLOGY_CPUMASK(core, topology_core_cpumask(cpu), false);
7793 ++
7794 ++ TOPOLOGY_CPUMASK(others, cpu_online_mask, true);
7795 ++
7796 ++ per_cpu(sched_cpu_topo_end_mask, cpu) = topo;
7797 ++ printk(KERN_INFO "sched: cpu#%02d llc_id = %d, llc_mask idx = %d\n",
7798 ++ cpu, per_cpu(sd_llc_id, cpu),
7799 ++ (int) (per_cpu(sched_cpu_llc_mask, cpu) -
7800 ++ per_cpu(sched_cpu_topo_masks, cpu)));
7801 ++ }
7802 ++}
7803 ++#endif
7804 ++
7805 ++void __init sched_init_smp(void)
7806 ++{
7807 ++ /* Move init over to a non-isolated CPU */
7808 ++ if (set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_FLAG_DOMAIN)) < 0)
7809 ++ BUG();
7810 ++ current->flags &= ~PF_NO_SETAFFINITY;
7811 ++
7812 ++ sched_init_topology_cpumask();
7813 ++
7814 ++ sched_smp_initialized = true;
7815 ++}
7816 ++#else
7817 ++void __init sched_init_smp(void)
7818 ++{
7819 ++ cpu_rq(0)->idle->time_slice = sched_timeslice_ns;
7820 ++}
7821 ++#endif /* CONFIG_SMP */
7822 ++
7823 ++int in_sched_functions(unsigned long addr)
7824 ++{
7825 ++ return in_lock_functions(addr) ||
7826 ++ (addr >= (unsigned long)__sched_text_start
7827 ++ && addr < (unsigned long)__sched_text_end);
7828 ++}
7829 ++
7830 ++#ifdef CONFIG_CGROUP_SCHED
7831 ++/* task group related information */
7832 ++struct task_group {
7833 ++ struct cgroup_subsys_state css;
7834 ++
7835 ++ struct rcu_head rcu;
7836 ++ struct list_head list;
7837 ++
7838 ++ struct task_group *parent;
7839 ++ struct list_head siblings;
7840 ++ struct list_head children;
7841 ++#ifdef CONFIG_FAIR_GROUP_SCHED
7842 ++ unsigned long shares;
7843 ++#endif
7844 ++};
7845 ++
7846 ++/*
7847 ++ * Default task group.
7848 ++ * Every task in system belongs to this group at bootup.
7849 ++ */
7850 ++struct task_group root_task_group;
7851 ++LIST_HEAD(task_groups);
7852 ++
7853 ++/* Cacheline aligned slab cache for task_group */
7854 ++static struct kmem_cache *task_group_cache __read_mostly;
7855 ++#endif /* CONFIG_CGROUP_SCHED */
7856 ++
7857 ++void __init sched_init(void)
7858 ++{
7859 ++ int i;
7860 ++ struct rq *rq;
7861 ++
7862 ++ printk(KERN_INFO ALT_SCHED_VERSION_MSG);
7863 ++
7864 ++ wait_bit_init();
7865 ++
7866 ++#ifdef CONFIG_SMP
7867 ++ for (i = 0; i < SCHED_BITS; i++)
7868 ++ cpumask_copy(sched_rq_watermark + i, cpu_present_mask);
7869 ++#endif
7870 ++
7871 ++#ifdef CONFIG_CGROUP_SCHED
7872 ++ task_group_cache = KMEM_CACHE(task_group, 0);
7873 ++
7874 ++ list_add(&root_task_group.list, &task_groups);
7875 ++ INIT_LIST_HEAD(&root_task_group.children);
7876 ++ INIT_LIST_HEAD(&root_task_group.siblings);
7877 ++#endif /* CONFIG_CGROUP_SCHED */
7878 ++ for_each_possible_cpu(i) {
7879 ++ rq = cpu_rq(i);
7880 ++
7881 ++ sched_queue_init(&rq->queue);
7882 ++ rq->watermark = IDLE_TASK_SCHED_PRIO;
7883 ++ rq->skip = NULL;
7884 ++
7885 ++ raw_spin_lock_init(&rq->lock);
7886 ++ rq->nr_running = rq->nr_uninterruptible = 0;
7887 ++ rq->calc_load_active = 0;
7888 ++ rq->calc_load_update = jiffies + LOAD_FREQ;
7889 ++#ifdef CONFIG_SMP
7890 ++ rq->online = false;
7891 ++ rq->cpu = i;
7892 ++
7893 ++#ifdef CONFIG_SCHED_SMT
7894 ++ rq->active_balance = 0;
7895 ++#endif
7896 ++
7897 ++#ifdef CONFIG_NO_HZ_COMMON
7898 ++ INIT_CSD(&rq->nohz_csd, nohz_csd_func, rq);
7899 ++#endif
7900 ++ rq->balance_callback = &balance_push_callback;
7901 ++#ifdef CONFIG_HOTPLUG_CPU
7902 ++ rcuwait_init(&rq->hotplug_wait);
7903 ++#endif
7904 ++#endif /* CONFIG_SMP */
7905 ++ rq->nr_switches = 0;
7906 ++
7907 ++ hrtick_rq_init(rq);
7908 ++ atomic_set(&rq->nr_iowait, 0);
7909 ++ }
7910 ++#ifdef CONFIG_SMP
7911 ++ /* Set rq->online for cpu 0 */
7912 ++ cpu_rq(0)->online = true;
7913 ++#endif
7914 ++ /*
7915 ++ * The boot idle thread does lazy MMU switching as well:
7916 ++ */
7917 ++ mmgrab(&init_mm);
7918 ++ enter_lazy_tlb(&init_mm, current);
7919 ++
7920 ++ /*
7921 ++ * Make us the idle thread. Technically, schedule() should not be
7922 ++ * called from this thread, however somewhere below it might be,
7923 ++ * but because we are the idle thread, we just pick up running again
7924 ++ * when this runqueue becomes "idle".
7925 ++ */
7926 ++ init_idle(current, smp_processor_id());
7927 ++
7928 ++ calc_load_update = jiffies + LOAD_FREQ;
7929 ++
7930 ++#ifdef CONFIG_SMP
7931 ++ idle_thread_set_boot_cpu();
7932 ++ balance_push_set(smp_processor_id(), false);
7933 ++
7934 ++ sched_init_topology_cpumask_early();
7935 ++#endif /* SMP */
7936 ++
7937 ++ psi_init();
7938 ++}
7939 ++
7940 ++#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
7941 ++static inline int preempt_count_equals(int preempt_offset)
7942 ++{
7943 ++ int nested = preempt_count() + rcu_preempt_depth();
7944 ++
7945 ++ return (nested == preempt_offset);
7946 ++}
7947 ++
7948 ++void __might_sleep(const char *file, int line, int preempt_offset)
7949 ++{
7950 ++ unsigned int state = get_current_state();
7951 ++ /*
7952 ++ * Blocking primitives will set (and therefore destroy) current->state,
7953 ++ * since we will exit with TASK_RUNNING make sure we enter with it,
7954 ++ * otherwise we will destroy state.
7955 ++ */
7956 ++ WARN_ONCE(state != TASK_RUNNING && current->task_state_change,
7957 ++ "do not call blocking ops when !TASK_RUNNING; "
7958 ++ "state=%x set at [<%p>] %pS\n", state,
7959 ++ (void *)current->task_state_change,
7960 ++ (void *)current->task_state_change);
7961 ++
7962 ++ ___might_sleep(file, line, preempt_offset);
7963 ++}
7964 ++EXPORT_SYMBOL(__might_sleep);
7965 ++
7966 ++void ___might_sleep(const char *file, int line, int preempt_offset)
7967 ++{
7968 ++ /* Ratelimiting timestamp: */
7969 ++ static unsigned long prev_jiffy;
7970 ++
7971 ++ unsigned long preempt_disable_ip;
7972 ++
7973 ++ /* WARN_ON_ONCE() by default, no rate limit required: */
7974 ++ rcu_sleep_check();
7975 ++
7976 ++ if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
7977 ++ !is_idle_task(current) && !current->non_block_count) ||
7978 ++ system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING ||
7979 ++ oops_in_progress)
7980 ++ return;
7981 ++ if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
7982 ++ return;
7983 ++ prev_jiffy = jiffies;
7984 ++
7985 ++ /* Save this before calling printk(), since that will clobber it: */
7986 ++ preempt_disable_ip = get_preempt_disable_ip(current);
7987 ++
7988 ++ printk(KERN_ERR
7989 ++ "BUG: sleeping function called from invalid context at %s:%d\n",
7990 ++ file, line);
7991 ++ printk(KERN_ERR
7992 ++ "in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",
7993 ++ in_atomic(), irqs_disabled(), current->non_block_count,
7994 ++ current->pid, current->comm);
7995 ++
7996 ++ if (task_stack_end_corrupted(current))
7997 ++ printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
7998 ++
7999 ++ debug_show_held_locks(current);
8000 ++ if (irqs_disabled())
8001 ++ print_irqtrace_events(current);
8002 ++#ifdef CONFIG_DEBUG_PREEMPT
8003 ++ if (!preempt_count_equals(preempt_offset)) {
8004 ++ pr_err("Preemption disabled at:");
8005 ++ print_ip_sym(KERN_ERR, preempt_disable_ip);
8006 ++ }
8007 ++#endif
8008 ++ dump_stack();
8009 ++ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
8010 ++}
8011 ++EXPORT_SYMBOL(___might_sleep);
8012 ++
8013 ++void __cant_sleep(const char *file, int line, int preempt_offset)
8014 ++{
8015 ++ static unsigned long prev_jiffy;
8016 ++
8017 ++ if (irqs_disabled())
8018 ++ return;
8019 ++
8020 ++ if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
8021 ++ return;
8022 ++
8023 ++ if (preempt_count() > preempt_offset)
8024 ++ return;
8025 ++
8026 ++ if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
8027 ++ return;
8028 ++ prev_jiffy = jiffies;
8029 ++
8030 ++ printk(KERN_ERR "BUG: assuming atomic context at %s:%d\n", file, line);
8031 ++ printk(KERN_ERR "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
8032 ++ in_atomic(), irqs_disabled(),
8033 ++ current->pid, current->comm);
8034 ++
8035 ++ debug_show_held_locks(current);
8036 ++ dump_stack();
8037 ++ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
8038 ++}
8039 ++EXPORT_SYMBOL_GPL(__cant_sleep);
8040 ++
8041 ++#ifdef CONFIG_SMP
8042 ++void __cant_migrate(const char *file, int line)
8043 ++{
8044 ++ static unsigned long prev_jiffy;
8045 ++
8046 ++ if (irqs_disabled())
8047 ++ return;
8048 ++
8049 ++ if (is_migration_disabled(current))
8050 ++ return;
8051 ++
8052 ++ if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
8053 ++ return;
8054 ++
8055 ++ if (preempt_count() > 0)
8056 ++ return;
8057 ++
8058 ++ if (current->migration_flags & MDF_FORCE_ENABLED)
8059 ++ return;
8060 ++
8061 ++ if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
8062 ++ return;
8063 ++ prev_jiffy = jiffies;
8064 ++
8065 ++ pr_err("BUG: assuming non migratable context at %s:%d\n", file, line);
8066 ++ pr_err("in_atomic(): %d, irqs_disabled(): %d, migration_disabled() %u pid: %d, name: %s\n",
8067 ++ in_atomic(), irqs_disabled(), is_migration_disabled(current),
8068 ++ current->pid, current->comm);
8069 ++
8070 ++ debug_show_held_locks(current);
8071 ++ dump_stack();
8072 ++ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
8073 ++}
8074 ++EXPORT_SYMBOL_GPL(__cant_migrate);
8075 ++#endif
8076 ++#endif
8077 ++
8078 ++#ifdef CONFIG_MAGIC_SYSRQ
8079 ++void normalize_rt_tasks(void)
8080 ++{
8081 ++ struct task_struct *g, *p;
8082 ++ struct sched_attr attr = {
8083 ++ .sched_policy = SCHED_NORMAL,
8084 ++ };
8085 ++
8086 ++ read_lock(&tasklist_lock);
8087 ++ for_each_process_thread(g, p) {
8088 ++ /*
8089 ++ * Only normalize user tasks:
8090 ++ */
8091 ++ if (p->flags & PF_KTHREAD)
8092 ++ continue;
8093 ++
8094 ++ if (!rt_task(p)) {
8095 ++ /*
8096 ++ * Renice negative nice level userspace
8097 ++ * tasks back to 0:
8098 ++ */
8099 ++ if (task_nice(p) < 0)
8100 ++ set_user_nice(p, 0);
8101 ++ continue;
8102 ++ }
8103 ++
8104 ++ __sched_setscheduler(p, &attr, false, false);
8105 ++ }
8106 ++ read_unlock(&tasklist_lock);
8107 ++}
8108 ++#endif /* CONFIG_MAGIC_SYSRQ */
8109 ++
8110 ++#if defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB)
8111 ++/*
8112 ++ * These functions are only useful for the IA64 MCA handling, or kdb.
8113 ++ *
8114 ++ * They can only be called when the whole system has been
8115 ++ * stopped - every CPU needs to be quiescent, and no scheduling
8116 ++ * activity can take place. Using them for anything else would
8117 ++ * be a serious bug, and as a result, they aren't even visible
8118 ++ * under any other configuration.
8119 ++ */
8120 ++
8121 ++/**
8122 ++ * curr_task - return the current task for a given CPU.
8123 ++ * @cpu: the processor in question.
8124 ++ *
8125 ++ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!
8126 ++ *
8127 ++ * Return: The current task for @cpu.
8128 ++ */
8129 ++struct task_struct *curr_task(int cpu)
8130 ++{
8131 ++ return cpu_curr(cpu);
8132 ++}
8133 ++
8134 ++#endif /* defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB) */
8135 ++
8136 ++#ifdef CONFIG_IA64
8137 ++/**
8138 ++ * ia64_set_curr_task - set the current task for a given CPU.
8139 ++ * @cpu: the processor in question.
8140 ++ * @p: the task pointer to set.
8141 ++ *
8142 ++ * Description: This function must only be used when non-maskable interrupts
8143 ++ * are serviced on a separate stack. It allows the architecture to switch the
8144 ++ * notion of the current task on a CPU in a non-blocking manner. This function
8145 ++ * must be called with all CPU's synchronised, and interrupts disabled, the
8146 ++ * and caller must save the original value of the current task (see
8147 ++ * curr_task() above) and restore that value before reenabling interrupts and
8148 ++ * re-starting the system.
8149 ++ *
8150 ++ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!
8151 ++ */
8152 ++void ia64_set_curr_task(int cpu, struct task_struct *p)
8153 ++{
8154 ++ cpu_curr(cpu) = p;
8155 ++}
8156 ++
8157 ++#endif
8158 ++
8159 ++#ifdef CONFIG_CGROUP_SCHED
8160 ++static void sched_free_group(struct task_group *tg)
8161 ++{
8162 ++ kmem_cache_free(task_group_cache, tg);
8163 ++}
8164 ++
8165 ++/* allocate runqueue etc for a new task group */
8166 ++struct task_group *sched_create_group(struct task_group *parent)
8167 ++{
8168 ++ struct task_group *tg;
8169 ++
8170 ++ tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
8171 ++ if (!tg)
8172 ++ return ERR_PTR(-ENOMEM);
8173 ++
8174 ++ return tg;
8175 ++}
8176 ++
8177 ++void sched_online_group(struct task_group *tg, struct task_group *parent)
8178 ++{
8179 ++}
8180 ++
8181 ++/* rcu callback to free various structures associated with a task group */
8182 ++static void sched_free_group_rcu(struct rcu_head *rhp)
8183 ++{
8184 ++ /* Now it should be safe to free those cfs_rqs */
8185 ++ sched_free_group(container_of(rhp, struct task_group, rcu));
8186 ++}
8187 ++
8188 ++void sched_destroy_group(struct task_group *tg)
8189 ++{
8190 ++ /* Wait for possible concurrent references to cfs_rqs complete */
8191 ++ call_rcu(&tg->rcu, sched_free_group_rcu);
8192 ++}
8193 ++
8194 ++void sched_offline_group(struct task_group *tg)
8195 ++{
8196 ++}
8197 ++
8198 ++static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
8199 ++{
8200 ++ return css ? container_of(css, struct task_group, css) : NULL;
8201 ++}
8202 ++
8203 ++static struct cgroup_subsys_state *
8204 ++cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
8205 ++{
8206 ++ struct task_group *parent = css_tg(parent_css);
8207 ++ struct task_group *tg;
8208 ++
8209 ++ if (!parent) {
8210 ++ /* This is early initialization for the top cgroup */
8211 ++ return &root_task_group.css;
8212 ++ }
8213 ++
8214 ++ tg = sched_create_group(parent);
8215 ++ if (IS_ERR(tg))
8216 ++ return ERR_PTR(-ENOMEM);
8217 ++ return &tg->css;
8218 ++}
8219 ++
8220 ++/* Expose task group only after completing cgroup initialization */
8221 ++static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
8222 ++{
8223 ++ struct task_group *tg = css_tg(css);
8224 ++ struct task_group *parent = css_tg(css->parent);
8225 ++
8226 ++ if (parent)
8227 ++ sched_online_group(tg, parent);
8228 ++ return 0;
8229 ++}
8230 ++
8231 ++static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
8232 ++{
8233 ++ struct task_group *tg = css_tg(css);
8234 ++
8235 ++ sched_offline_group(tg);
8236 ++}
8237 ++
8238 ++static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
8239 ++{
8240 ++ struct task_group *tg = css_tg(css);
8241 ++
8242 ++ /*
8243 ++ * Relies on the RCU grace period between css_released() and this.
8244 ++ */
8245 ++ sched_free_group(tg);
8246 ++}
8247 ++
8248 ++static void cpu_cgroup_fork(struct task_struct *task)
8249 ++{
8250 ++}
8251 ++
8252 ++static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
8253 ++{
8254 ++ return 0;
8255 ++}
8256 ++
8257 ++static void cpu_cgroup_attach(struct cgroup_taskset *tset)
8258 ++{
8259 ++}
8260 ++
8261 ++#ifdef CONFIG_FAIR_GROUP_SCHED
8262 ++static DEFINE_MUTEX(shares_mutex);
8263 ++
8264 ++int sched_group_set_shares(struct task_group *tg, unsigned long shares)
8265 ++{
8266 ++ /*
8267 ++ * We can't change the weight of the root cgroup.
8268 ++ */
8269 ++ if (&root_task_group == tg)
8270 ++ return -EINVAL;
8271 ++
8272 ++ shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
8273 ++
8274 ++ mutex_lock(&shares_mutex);
8275 ++ if (tg->shares == shares)
8276 ++ goto done;
8277 ++
8278 ++ tg->shares = shares;
8279 ++done:
8280 ++ mutex_unlock(&shares_mutex);
8281 ++ return 0;
8282 ++}
8283 ++
8284 ++static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
8285 ++ struct cftype *cftype, u64 shareval)
8286 ++{
8287 ++ if (shareval > scale_load_down(ULONG_MAX))
8288 ++ shareval = MAX_SHARES;
8289 ++ return sched_group_set_shares(css_tg(css), scale_load(shareval));
8290 ++}
8291 ++
8292 ++static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
8293 ++ struct cftype *cft)
8294 ++{
8295 ++ struct task_group *tg = css_tg(css);
8296 ++
8297 ++ return (u64) scale_load_down(tg->shares);
8298 ++}
8299 ++#endif
8300 ++
8301 ++static struct cftype cpu_legacy_files[] = {
8302 ++#ifdef CONFIG_FAIR_GROUP_SCHED
8303 ++ {
8304 ++ .name = "shares",
8305 ++ .read_u64 = cpu_shares_read_u64,
8306 ++ .write_u64 = cpu_shares_write_u64,
8307 ++ },
8308 ++#endif
8309 ++ { } /* Terminate */
8310 ++};
8311 ++
8312 ++
8313 ++static struct cftype cpu_files[] = {
8314 ++ { } /* terminate */
8315 ++};
8316 ++
8317 ++static int cpu_extra_stat_show(struct seq_file *sf,
8318 ++ struct cgroup_subsys_state *css)
8319 ++{
8320 ++ return 0;
8321 ++}
8322 ++
8323 ++struct cgroup_subsys cpu_cgrp_subsys = {
8324 ++ .css_alloc = cpu_cgroup_css_alloc,
8325 ++ .css_online = cpu_cgroup_css_online,
8326 ++ .css_released = cpu_cgroup_css_released,
8327 ++ .css_free = cpu_cgroup_css_free,
8328 ++ .css_extra_stat_show = cpu_extra_stat_show,
8329 ++ .fork = cpu_cgroup_fork,
8330 ++ .can_attach = cpu_cgroup_can_attach,
8331 ++ .attach = cpu_cgroup_attach,
8332 ++ .legacy_cftypes = cpu_files,
8333 ++ .legacy_cftypes = cpu_legacy_files,
8334 ++ .dfl_cftypes = cpu_files,
8335 ++ .early_init = true,
8336 ++ .threaded = true,
8337 ++};
8338 ++#endif /* CONFIG_CGROUP_SCHED */
8339 ++
8340 ++#undef CREATE_TRACE_POINTS
8341 +diff --git a/kernel/sched/alt_debug.c b/kernel/sched/alt_debug.c
8342 +new file mode 100644
8343 +index 000000000000..1212a031700e
8344 +--- /dev/null
8345 ++++ b/kernel/sched/alt_debug.c
8346 +@@ -0,0 +1,31 @@
8347 ++/*
8348 ++ * kernel/sched/alt_debug.c
8349 ++ *
8350 ++ * Print the alt scheduler debugging details
8351 ++ *
8352 ++ * Author: Alfred Chen
8353 ++ * Date : 2020
8354 ++ */
8355 ++#include "sched.h"
8356 ++
8357 ++/*
8358 ++ * This allows printing both to /proc/sched_debug and
8359 ++ * to the console
8360 ++ */
8361 ++#define SEQ_printf(m, x...) \
8362 ++ do { \
8363 ++ if (m) \
8364 ++ seq_printf(m, x); \
8365 ++ else \
8366 ++ pr_cont(x); \
8367 ++ } while (0)
8368 ++
8369 ++void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
8370 ++ struct seq_file *m)
8371 ++{
8372 ++ SEQ_printf(m, "%s (%d, #threads: %d)\n", p->comm, task_pid_nr_ns(p, ns),
8373 ++ get_nr_threads(p));
8374 ++}
8375 ++
8376 ++void proc_sched_set_task(struct task_struct *p)
8377 ++{}
8378 +diff --git a/kernel/sched/alt_sched.h b/kernel/sched/alt_sched.h
8379 +new file mode 100644
8380 +index 000000000000..289058a09bd5
8381 +--- /dev/null
8382 ++++ b/kernel/sched/alt_sched.h
8383 +@@ -0,0 +1,666 @@
8384 ++#ifndef ALT_SCHED_H
8385 ++#define ALT_SCHED_H
8386 ++
8387 ++#include <linux/sched.h>
8388 ++
8389 ++#include <linux/sched/clock.h>
8390 ++#include <linux/sched/cpufreq.h>
8391 ++#include <linux/sched/cputime.h>
8392 ++#include <linux/sched/debug.h>
8393 ++#include <linux/sched/init.h>
8394 ++#include <linux/sched/isolation.h>
8395 ++#include <linux/sched/loadavg.h>
8396 ++#include <linux/sched/mm.h>
8397 ++#include <linux/sched/nohz.h>
8398 ++#include <linux/sched/signal.h>
8399 ++#include <linux/sched/stat.h>
8400 ++#include <linux/sched/sysctl.h>
8401 ++#include <linux/sched/task.h>
8402 ++#include <linux/sched/topology.h>
8403 ++#include <linux/sched/wake_q.h>
8404 ++
8405 ++#include <uapi/linux/sched/types.h>
8406 ++
8407 ++#include <linux/cgroup.h>
8408 ++#include <linux/cpufreq.h>
8409 ++#include <linux/cpuidle.h>
8410 ++#include <linux/cpuset.h>
8411 ++#include <linux/ctype.h>
8412 ++#include <linux/debugfs.h>
8413 ++#include <linux/kthread.h>
8414 ++#include <linux/livepatch.h>
8415 ++#include <linux/membarrier.h>
8416 ++#include <linux/proc_fs.h>
8417 ++#include <linux/psi.h>
8418 ++#include <linux/slab.h>
8419 ++#include <linux/stop_machine.h>
8420 ++#include <linux/suspend.h>
8421 ++#include <linux/swait.h>
8422 ++#include <linux/syscalls.h>
8423 ++#include <linux/tsacct_kern.h>
8424 ++
8425 ++#include <asm/tlb.h>
8426 ++
8427 ++#ifdef CONFIG_PARAVIRT
8428 ++# include <asm/paravirt.h>
8429 ++#endif
8430 ++
8431 ++#include "cpupri.h"
8432 ++
8433 ++#include <trace/events/sched.h>
8434 ++
8435 ++#ifdef CONFIG_SCHED_BMQ
8436 ++/* bits:
8437 ++ * RT(0-99), (Low prio adj range, nice width, high prio adj range) / 2, cpu idle task */
8438 ++#define SCHED_BITS (MAX_RT_PRIO + NICE_WIDTH / 2 + MAX_PRIORITY_ADJ + 1)
8439 ++#endif
8440 ++
8441 ++#ifdef CONFIG_SCHED_PDS
8442 ++/* bits: RT(0-99), reserved(100-127), NORMAL_PRIO_NUM, cpu idle task */
8443 ++#define SCHED_BITS (MIN_NORMAL_PRIO + NORMAL_PRIO_NUM + 1)
8444 ++#endif /* CONFIG_SCHED_PDS */
8445 ++
8446 ++#define IDLE_TASK_SCHED_PRIO (SCHED_BITS - 1)
8447 ++
8448 ++#ifdef CONFIG_SCHED_DEBUG
8449 ++# define SCHED_WARN_ON(x) WARN_ONCE(x, #x)
8450 ++extern void resched_latency_warn(int cpu, u64 latency);
8451 ++#else
8452 ++# define SCHED_WARN_ON(x) ({ (void)(x), 0; })
8453 ++static inline void resched_latency_warn(int cpu, u64 latency) {}
8454 ++#endif
8455 ++
8456 ++/*
8457 ++ * Increase resolution of nice-level calculations for 64-bit architectures.
8458 ++ * The extra resolution improves shares distribution and load balancing of
8459 ++ * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
8460 ++ * hierarchies, especially on larger systems. This is not a user-visible change
8461 ++ * and does not change the user-interface for setting shares/weights.
8462 ++ *
8463 ++ * We increase resolution only if we have enough bits to allow this increased
8464 ++ * resolution (i.e. 64-bit). The costs for increasing resolution when 32-bit
8465 ++ * are pretty high and the returns do not justify the increased costs.
8466 ++ *
8467 ++ * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
8468 ++ * increase coverage and consistency always enable it on 64-bit platforms.
8469 ++ */
8470 ++#ifdef CONFIG_64BIT
8471 ++# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
8472 ++# define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
8473 ++# define scale_load_down(w) \
8474 ++({ \
8475 ++ unsigned long __w = (w); \
8476 ++ if (__w) \
8477 ++ __w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \
8478 ++ __w; \
8479 ++})
8480 ++#else
8481 ++# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
8482 ++# define scale_load(w) (w)
8483 ++# define scale_load_down(w) (w)
8484 ++#endif
8485 ++
8486 ++#ifdef CONFIG_FAIR_GROUP_SCHED
8487 ++#define ROOT_TASK_GROUP_LOAD NICE_0_LOAD
8488 ++
8489 ++/*
8490 ++ * A weight of 0 or 1 can cause arithmetics problems.
8491 ++ * A weight of a cfs_rq is the sum of weights of which entities
8492 ++ * are queued on this cfs_rq, so a weight of a entity should not be
8493 ++ * too large, so as the shares value of a task group.
8494 ++ * (The default weight is 1024 - so there's no practical
8495 ++ * limitation from this.)
8496 ++ */
8497 ++#define MIN_SHARES (1UL << 1)
8498 ++#define MAX_SHARES (1UL << 18)
8499 ++#endif
8500 ++
8501 ++/* task_struct::on_rq states: */
8502 ++#define TASK_ON_RQ_QUEUED 1
8503 ++#define TASK_ON_RQ_MIGRATING 2
8504 ++
8505 ++static inline int task_on_rq_queued(struct task_struct *p)
8506 ++{
8507 ++ return p->on_rq == TASK_ON_RQ_QUEUED;
8508 ++}
8509 ++
8510 ++static inline int task_on_rq_migrating(struct task_struct *p)
8511 ++{
8512 ++ return READ_ONCE(p->on_rq) == TASK_ON_RQ_MIGRATING;
8513 ++}
8514 ++
8515 ++/*
8516 ++ * wake flags
8517 ++ */
8518 ++#define WF_SYNC 0x01 /* waker goes to sleep after wakeup */
8519 ++#define WF_FORK 0x02 /* child wakeup after fork */
8520 ++#define WF_MIGRATED 0x04 /* internal use, task got migrated */
8521 ++#define WF_ON_CPU 0x08 /* Wakee is on_rq */
8522 ++
8523 ++#define SCHED_QUEUE_BITS (SCHED_BITS - 1)
8524 ++
8525 ++struct sched_queue {
8526 ++ DECLARE_BITMAP(bitmap, SCHED_QUEUE_BITS);
8527 ++ struct list_head heads[SCHED_BITS];
8528 ++};
8529 ++
8530 ++/*
8531 ++ * This is the main, per-CPU runqueue data structure.
8532 ++ * This data should only be modified by the local cpu.
8533 ++ */
8534 ++struct rq {
8535 ++ /* runqueue lock: */
8536 ++ raw_spinlock_t lock;
8537 ++
8538 ++ struct task_struct __rcu *curr;
8539 ++ struct task_struct *idle, *stop, *skip;
8540 ++ struct mm_struct *prev_mm;
8541 ++
8542 ++ struct sched_queue queue;
8543 ++#ifdef CONFIG_SCHED_PDS
8544 ++ u64 time_edge;
8545 ++#endif
8546 ++ unsigned long watermark;
8547 ++
8548 ++ /* switch count */
8549 ++ u64 nr_switches;
8550 ++
8551 ++ atomic_t nr_iowait;
8552 ++
8553 ++#ifdef CONFIG_SCHED_DEBUG
8554 ++ u64 last_seen_need_resched_ns;
8555 ++ int ticks_without_resched;
8556 ++#endif
8557 ++
8558 ++#ifdef CONFIG_MEMBARRIER
8559 ++ int membarrier_state;
8560 ++#endif
8561 ++
8562 ++#ifdef CONFIG_SMP
8563 ++ int cpu; /* cpu of this runqueue */
8564 ++ bool online;
8565 ++
8566 ++ unsigned int ttwu_pending;
8567 ++ unsigned char nohz_idle_balance;
8568 ++ unsigned char idle_balance;
8569 ++
8570 ++#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
8571 ++ struct sched_avg avg_irq;
8572 ++#endif
8573 ++
8574 ++#ifdef CONFIG_SCHED_SMT
8575 ++ int active_balance;
8576 ++ struct cpu_stop_work active_balance_work;
8577 ++#endif
8578 ++ struct callback_head *balance_callback;
8579 ++#ifdef CONFIG_HOTPLUG_CPU
8580 ++ struct rcuwait hotplug_wait;
8581 ++#endif
8582 ++ unsigned int nr_pinned;
8583 ++
8584 ++#endif /* CONFIG_SMP */
8585 ++#ifdef CONFIG_IRQ_TIME_ACCOUNTING
8586 ++ u64 prev_irq_time;
8587 ++#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
8588 ++#ifdef CONFIG_PARAVIRT
8589 ++ u64 prev_steal_time;
8590 ++#endif /* CONFIG_PARAVIRT */
8591 ++#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
8592 ++ u64 prev_steal_time_rq;
8593 ++#endif /* CONFIG_PARAVIRT_TIME_ACCOUNTING */
8594 ++
8595 ++ /* For genenal cpu load util */
8596 ++ s32 load_history;
8597 ++ u64 load_block;
8598 ++ u64 load_stamp;
8599 ++
8600 ++ /* calc_load related fields */
8601 ++ unsigned long calc_load_update;
8602 ++ long calc_load_active;
8603 ++
8604 ++ u64 clock, last_tick;
8605 ++ u64 last_ts_switch;
8606 ++ u64 clock_task;
8607 ++
8608 ++ unsigned int nr_running;
8609 ++ unsigned long nr_uninterruptible;
8610 ++
8611 ++#ifdef CONFIG_SCHED_HRTICK
8612 ++#ifdef CONFIG_SMP
8613 ++ call_single_data_t hrtick_csd;
8614 ++#endif
8615 ++ struct hrtimer hrtick_timer;
8616 ++ ktime_t hrtick_time;
8617 ++#endif
8618 ++
8619 ++#ifdef CONFIG_SCHEDSTATS
8620 ++
8621 ++ /* latency stats */
8622 ++ struct sched_info rq_sched_info;
8623 ++ unsigned long long rq_cpu_time;
8624 ++ /* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */
8625 ++
8626 ++ /* sys_sched_yield() stats */
8627 ++ unsigned int yld_count;
8628 ++
8629 ++ /* schedule() stats */
8630 ++ unsigned int sched_switch;
8631 ++ unsigned int sched_count;
8632 ++ unsigned int sched_goidle;
8633 ++
8634 ++ /* try_to_wake_up() stats */
8635 ++ unsigned int ttwu_count;
8636 ++ unsigned int ttwu_local;
8637 ++#endif /* CONFIG_SCHEDSTATS */
8638 ++
8639 ++#ifdef CONFIG_CPU_IDLE
8640 ++ /* Must be inspected within a rcu lock section */
8641 ++ struct cpuidle_state *idle_state;
8642 ++#endif
8643 ++
8644 ++#ifdef CONFIG_NO_HZ_COMMON
8645 ++#ifdef CONFIG_SMP
8646 ++ call_single_data_t nohz_csd;
8647 ++#endif
8648 ++ atomic_t nohz_flags;
8649 ++#endif /* CONFIG_NO_HZ_COMMON */
8650 ++};
8651 ++
8652 ++extern unsigned long rq_load_util(struct rq *rq, unsigned long max);
8653 ++
8654 ++extern unsigned long calc_load_update;
8655 ++extern atomic_long_t calc_load_tasks;
8656 ++
8657 ++extern void calc_global_load_tick(struct rq *this_rq);
8658 ++extern long calc_load_fold_active(struct rq *this_rq, long adjust);
8659 ++
8660 ++DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
8661 ++#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))
8662 ++#define this_rq() this_cpu_ptr(&runqueues)
8663 ++#define task_rq(p) cpu_rq(task_cpu(p))
8664 ++#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
8665 ++#define raw_rq() raw_cpu_ptr(&runqueues)
8666 ++
8667 ++#ifdef CONFIG_SMP
8668 ++#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
8669 ++void register_sched_domain_sysctl(void);
8670 ++void unregister_sched_domain_sysctl(void);
8671 ++#else
8672 ++static inline void register_sched_domain_sysctl(void)
8673 ++{
8674 ++}
8675 ++static inline void unregister_sched_domain_sysctl(void)
8676 ++{
8677 ++}
8678 ++#endif
8679 ++
8680 ++extern bool sched_smp_initialized;
8681 ++
8682 ++enum {
8683 ++ ITSELF_LEVEL_SPACE_HOLDER,
8684 ++#ifdef CONFIG_SCHED_SMT
8685 ++ SMT_LEVEL_SPACE_HOLDER,
8686 ++#endif
8687 ++ COREGROUP_LEVEL_SPACE_HOLDER,
8688 ++ CORE_LEVEL_SPACE_HOLDER,
8689 ++ OTHER_LEVEL_SPACE_HOLDER,
8690 ++ NR_CPU_AFFINITY_LEVELS
8691 ++};
8692 ++
8693 ++DECLARE_PER_CPU(cpumask_t [NR_CPU_AFFINITY_LEVELS], sched_cpu_topo_masks);
8694 ++DECLARE_PER_CPU(cpumask_t *, sched_cpu_llc_mask);
8695 ++
8696 ++static inline int
8697 ++__best_mask_cpu(const cpumask_t *cpumask, const cpumask_t *mask)
8698 ++{
8699 ++ int cpu;
8700 ++
8701 ++ while ((cpu = cpumask_any_and(cpumask, mask)) >= nr_cpu_ids)
8702 ++ mask++;
8703 ++
8704 ++ return cpu;
8705 ++}
8706 ++
8707 ++static inline int best_mask_cpu(int cpu, const cpumask_t *mask)
8708 ++{
8709 ++ return __best_mask_cpu(mask, per_cpu(sched_cpu_topo_masks, cpu));
8710 ++}
8711 ++
8712 ++extern void flush_smp_call_function_from_idle(void);
8713 ++
8714 ++#else /* !CONFIG_SMP */
8715 ++static inline void flush_smp_call_function_from_idle(void) { }
8716 ++#endif
8717 ++
8718 ++#ifndef arch_scale_freq_tick
8719 ++static __always_inline
8720 ++void arch_scale_freq_tick(void)
8721 ++{
8722 ++}
8723 ++#endif
8724 ++
8725 ++#ifndef arch_scale_freq_capacity
8726 ++static __always_inline
8727 ++unsigned long arch_scale_freq_capacity(int cpu)
8728 ++{
8729 ++ return SCHED_CAPACITY_SCALE;
8730 ++}
8731 ++#endif
8732 ++
8733 ++static inline u64 __rq_clock_broken(struct rq *rq)
8734 ++{
8735 ++ return READ_ONCE(rq->clock);
8736 ++}
8737 ++
8738 ++static inline u64 rq_clock(struct rq *rq)
8739 ++{
8740 ++ /*
8741 ++ * Relax lockdep_assert_held() checking as in VRQ, call to
8742 ++ * sched_info_xxxx() may not held rq->lock
8743 ++ * lockdep_assert_held(&rq->lock);
8744 ++ */
8745 ++ return rq->clock;
8746 ++}
8747 ++
8748 ++static inline u64 rq_clock_task(struct rq *rq)
8749 ++{
8750 ++ /*
8751 ++ * Relax lockdep_assert_held() checking as in VRQ, call to
8752 ++ * sched_info_xxxx() may not held rq->lock
8753 ++ * lockdep_assert_held(&rq->lock);
8754 ++ */
8755 ++ return rq->clock_task;
8756 ++}
8757 ++
8758 ++/*
8759 ++ * {de,en}queue flags:
8760 ++ *
8761 ++ * DEQUEUE_SLEEP - task is no longer runnable
8762 ++ * ENQUEUE_WAKEUP - task just became runnable
8763 ++ *
8764 ++ */
8765 ++
8766 ++#define DEQUEUE_SLEEP 0x01
8767 ++
8768 ++#define ENQUEUE_WAKEUP 0x01
8769 ++
8770 ++
8771 ++/*
8772 ++ * Below are scheduler API which using in other kernel code
8773 ++ * It use the dummy rq_flags
8774 ++ * ToDo : BMQ need to support these APIs for compatibility with mainline
8775 ++ * scheduler code.
8776 ++ */
8777 ++struct rq_flags {
8778 ++ unsigned long flags;
8779 ++};
8780 ++
8781 ++struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
8782 ++ __acquires(rq->lock);
8783 ++
8784 ++struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
8785 ++ __acquires(p->pi_lock)
8786 ++ __acquires(rq->lock);
8787 ++
8788 ++static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
8789 ++ __releases(rq->lock)
8790 ++{
8791 ++ raw_spin_unlock(&rq->lock);
8792 ++}
8793 ++
8794 ++static inline void
8795 ++task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
8796 ++ __releases(rq->lock)
8797 ++ __releases(p->pi_lock)
8798 ++{
8799 ++ raw_spin_unlock(&rq->lock);
8800 ++ raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
8801 ++}
8802 ++
8803 ++static inline void
8804 ++rq_lock(struct rq *rq, struct rq_flags *rf)
8805 ++ __acquires(rq->lock)
8806 ++{
8807 ++ raw_spin_lock(&rq->lock);
8808 ++}
8809 ++
8810 ++static inline void
8811 ++rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
8812 ++ __releases(rq->lock)
8813 ++{
8814 ++ raw_spin_unlock_irq(&rq->lock);
8815 ++}
8816 ++
8817 ++static inline void
8818 ++rq_unlock(struct rq *rq, struct rq_flags *rf)
8819 ++ __releases(rq->lock)
8820 ++{
8821 ++ raw_spin_unlock(&rq->lock);
8822 ++}
8823 ++
8824 ++static inline struct rq *
8825 ++this_rq_lock_irq(struct rq_flags *rf)
8826 ++ __acquires(rq->lock)
8827 ++{
8828 ++ struct rq *rq;
8829 ++
8830 ++ local_irq_disable();
8831 ++ rq = this_rq();
8832 ++ raw_spin_lock(&rq->lock);
8833 ++
8834 ++ return rq;
8835 ++}
8836 ++
8837 ++extern void raw_spin_rq_lock_nested(struct rq *rq, int subclass);
8838 ++extern void raw_spin_rq_unlock(struct rq *rq);
8839 ++
8840 ++static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
8841 ++{
8842 ++ return &rq->lock;
8843 ++}
8844 ++
8845 ++static inline raw_spinlock_t *rq_lockp(struct rq *rq)
8846 ++{
8847 ++ return __rq_lockp(rq);
8848 ++}
8849 ++
8850 ++static inline void raw_spin_rq_lock(struct rq *rq)
8851 ++{
8852 ++ raw_spin_rq_lock_nested(rq, 0);
8853 ++}
8854 ++
8855 ++static inline void raw_spin_rq_lock_irq(struct rq *rq)
8856 ++{
8857 ++ local_irq_disable();
8858 ++ raw_spin_rq_lock(rq);
8859 ++}
8860 ++
8861 ++static inline void raw_spin_rq_unlock_irq(struct rq *rq)
8862 ++{
8863 ++ raw_spin_rq_unlock(rq);
8864 ++ local_irq_enable();
8865 ++}
8866 ++
8867 ++static inline int task_current(struct rq *rq, struct task_struct *p)
8868 ++{
8869 ++ return rq->curr == p;
8870 ++}
8871 ++
8872 ++static inline bool task_running(struct task_struct *p)
8873 ++{
8874 ++ return p->on_cpu;
8875 ++}
8876 ++
8877 ++extern int task_running_nice(struct task_struct *p);
8878 ++
8879 ++extern struct static_key_false sched_schedstats;
8880 ++
8881 ++#ifdef CONFIG_CPU_IDLE
8882 ++static inline void idle_set_state(struct rq *rq,
8883 ++ struct cpuidle_state *idle_state)
8884 ++{
8885 ++ rq->idle_state = idle_state;
8886 ++}
8887 ++
8888 ++static inline struct cpuidle_state *idle_get_state(struct rq *rq)
8889 ++{
8890 ++ WARN_ON(!rcu_read_lock_held());
8891 ++ return rq->idle_state;
8892 ++}
8893 ++#else
8894 ++static inline void idle_set_state(struct rq *rq,
8895 ++ struct cpuidle_state *idle_state)
8896 ++{
8897 ++}
8898 ++
8899 ++static inline struct cpuidle_state *idle_get_state(struct rq *rq)
8900 ++{
8901 ++ return NULL;
8902 ++}
8903 ++#endif
8904 ++
8905 ++static inline int cpu_of(const struct rq *rq)
8906 ++{
8907 ++#ifdef CONFIG_SMP
8908 ++ return rq->cpu;
8909 ++#else
8910 ++ return 0;
8911 ++#endif
8912 ++}
8913 ++
8914 ++#include "stats.h"
8915 ++
8916 ++#ifdef CONFIG_NO_HZ_COMMON
8917 ++#define NOHZ_BALANCE_KICK_BIT 0
8918 ++#define NOHZ_STATS_KICK_BIT 1
8919 ++
8920 ++#define NOHZ_BALANCE_KICK BIT(NOHZ_BALANCE_KICK_BIT)
8921 ++#define NOHZ_STATS_KICK BIT(NOHZ_STATS_KICK_BIT)
8922 ++
8923 ++#define NOHZ_KICK_MASK (NOHZ_BALANCE_KICK | NOHZ_STATS_KICK)
8924 ++
8925 ++#define nohz_flags(cpu) (&cpu_rq(cpu)->nohz_flags)
8926 ++
8927 ++/* TODO: needed?
8928 ++extern void nohz_balance_exit_idle(struct rq *rq);
8929 ++#else
8930 ++static inline void nohz_balance_exit_idle(struct rq *rq) { }
8931 ++*/
8932 ++#endif
8933 ++
8934 ++#ifdef CONFIG_IRQ_TIME_ACCOUNTING
8935 ++struct irqtime {
8936 ++ u64 total;
8937 ++ u64 tick_delta;
8938 ++ u64 irq_start_time;
8939 ++ struct u64_stats_sync sync;
8940 ++};
8941 ++
8942 ++DECLARE_PER_CPU(struct irqtime, cpu_irqtime);
8943 ++
8944 ++/*
8945 ++ * Returns the irqtime minus the softirq time computed by ksoftirqd.
8946 ++ * Otherwise ksoftirqd's sum_exec_runtime is substracted its own runtime
8947 ++ * and never move forward.
8948 ++ */
8949 ++static inline u64 irq_time_read(int cpu)
8950 ++{
8951 ++ struct irqtime *irqtime = &per_cpu(cpu_irqtime, cpu);
8952 ++ unsigned int seq;
8953 ++ u64 total;
8954 ++
8955 ++ do {
8956 ++ seq = __u64_stats_fetch_begin(&irqtime->sync);
8957 ++ total = irqtime->total;
8958 ++ } while (__u64_stats_fetch_retry(&irqtime->sync, seq));
8959 ++
8960 ++ return total;
8961 ++}
8962 ++#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
8963 ++
8964 ++#ifdef CONFIG_CPU_FREQ
8965 ++DECLARE_PER_CPU(struct update_util_data __rcu *, cpufreq_update_util_data);
8966 ++#endif /* CONFIG_CPU_FREQ */
8967 ++
8968 ++#ifdef CONFIG_NO_HZ_FULL
8969 ++extern int __init sched_tick_offload_init(void);
8970 ++#else
8971 ++static inline int sched_tick_offload_init(void) { return 0; }
8972 ++#endif
8973 ++
8974 ++#ifdef arch_scale_freq_capacity
8975 ++#ifndef arch_scale_freq_invariant
8976 ++#define arch_scale_freq_invariant() (true)
8977 ++#endif
8978 ++#else /* arch_scale_freq_capacity */
8979 ++#define arch_scale_freq_invariant() (false)
8980 ++#endif
8981 ++
8982 ++extern void schedule_idle(void);
8983 ++
8984 ++#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
8985 ++
8986 ++/*
8987 ++ * !! For sched_setattr_nocheck() (kernel) only !!
8988 ++ *
8989 ++ * This is actually gross. :(
8990 ++ *
8991 ++ * It is used to make schedutil kworker(s) higher priority than SCHED_DEADLINE
8992 ++ * tasks, but still be able to sleep. We need this on platforms that cannot
8993 ++ * atomically change clock frequency. Remove once fast switching will be
8994 ++ * available on such platforms.
8995 ++ *
8996 ++ * SUGOV stands for SchedUtil GOVernor.
8997 ++ */
8998 ++#define SCHED_FLAG_SUGOV 0x10000000
8999 ++
9000 ++#ifdef CONFIG_MEMBARRIER
9001 ++/*
9002 ++ * The scheduler provides memory barriers required by membarrier between:
9003 ++ * - prior user-space memory accesses and store to rq->membarrier_state,
9004 ++ * - store to rq->membarrier_state and following user-space memory accesses.
9005 ++ * In the same way it provides those guarantees around store to rq->curr.
9006 ++ */
9007 ++static inline void membarrier_switch_mm(struct rq *rq,
9008 ++ struct mm_struct *prev_mm,
9009 ++ struct mm_struct *next_mm)
9010 ++{
9011 ++ int membarrier_state;
9012 ++
9013 ++ if (prev_mm == next_mm)
9014 ++ return;
9015 ++
9016 ++ membarrier_state = atomic_read(&next_mm->membarrier_state);
9017 ++ if (READ_ONCE(rq->membarrier_state) == membarrier_state)
9018 ++ return;
9019 ++
9020 ++ WRITE_ONCE(rq->membarrier_state, membarrier_state);
9021 ++}
9022 ++#else
9023 ++static inline void membarrier_switch_mm(struct rq *rq,
9024 ++ struct mm_struct *prev_mm,
9025 ++ struct mm_struct *next_mm)
9026 ++{
9027 ++}
9028 ++#endif
9029 ++
9030 ++#ifdef CONFIG_NUMA
9031 ++extern int sched_numa_find_closest(const struct cpumask *cpus, int cpu);
9032 ++#else
9033 ++static inline int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
9034 ++{
9035 ++ return nr_cpu_ids;
9036 ++}
9037 ++#endif
9038 ++
9039 ++extern void swake_up_all_locked(struct swait_queue_head *q);
9040 ++extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
9041 ++
9042 ++#ifdef CONFIG_PREEMPT_DYNAMIC
9043 ++extern int preempt_dynamic_mode;
9044 ++extern int sched_dynamic_mode(const char *str);
9045 ++extern void sched_dynamic_update(int mode);
9046 ++#endif
9047 ++
9048 ++static inline void nohz_run_idle_balance(int cpu) { }
9049 ++#endif /* ALT_SCHED_H */
9050 +diff --git a/kernel/sched/bmq.h b/kernel/sched/bmq.h
9051 +new file mode 100644
9052 +index 000000000000..be3ee4a553ca
9053 +--- /dev/null
9054 ++++ b/kernel/sched/bmq.h
9055 +@@ -0,0 +1,111 @@
9056 ++#define ALT_SCHED_VERSION_MSG "sched/bmq: BMQ CPU Scheduler "ALT_SCHED_VERSION" by Alfred Chen.\n"
9057 ++
9058 ++/*
9059 ++ * BMQ only routines
9060 ++ */
9061 ++#define rq_switch_time(rq) ((rq)->clock - (rq)->last_ts_switch)
9062 ++#define boost_threshold(p) (sched_timeslice_ns >>\
9063 ++ (15 - MAX_PRIORITY_ADJ - (p)->boost_prio))
9064 ++
9065 ++static inline void boost_task(struct task_struct *p)
9066 ++{
9067 ++ int limit;
9068 ++
9069 ++ switch (p->policy) {
9070 ++ case SCHED_NORMAL:
9071 ++ limit = -MAX_PRIORITY_ADJ;
9072 ++ break;
9073 ++ case SCHED_BATCH:
9074 ++ case SCHED_IDLE:
9075 ++ limit = 0;
9076 ++ break;
9077 ++ default:
9078 ++ return;
9079 ++ }
9080 ++
9081 ++ if (p->boost_prio > limit)
9082 ++ p->boost_prio--;
9083 ++}
9084 ++
9085 ++static inline void deboost_task(struct task_struct *p)
9086 ++{
9087 ++ if (p->boost_prio < MAX_PRIORITY_ADJ)
9088 ++ p->boost_prio++;
9089 ++}
9090 ++
9091 ++/*
9092 ++ * Common interfaces
9093 ++ */
9094 ++static inline void sched_timeslice_imp(const int timeslice_ms) {}
9095 ++
9096 ++static inline int
9097 ++task_sched_prio_normal(const struct task_struct *p, const struct rq *rq)
9098 ++{
9099 ++ return p->prio + p->boost_prio - MAX_RT_PRIO;
9100 ++}
9101 ++
9102 ++static inline int task_sched_prio(const struct task_struct *p)
9103 ++{
9104 ++ return (p->prio < MAX_RT_PRIO)? p->prio : MAX_RT_PRIO / 2 + (p->prio + p->boost_prio) / 2;
9105 ++}
9106 ++
9107 ++static inline int
9108 ++task_sched_prio_idx(const struct task_struct *p, const struct rq *rq)
9109 ++{
9110 ++ return task_sched_prio(p);
9111 ++}
9112 ++
9113 ++static inline int sched_prio2idx(int prio, struct rq *rq)
9114 ++{
9115 ++ return prio;
9116 ++}
9117 ++
9118 ++static inline int sched_idx2prio(int idx, struct rq *rq)
9119 ++{
9120 ++ return idx;
9121 ++}
9122 ++
9123 ++static inline void time_slice_expired(struct task_struct *p, struct rq *rq)
9124 ++{
9125 ++ p->time_slice = sched_timeslice_ns;
9126 ++
9127 ++ if (SCHED_FIFO != p->policy && task_on_rq_queued(p)) {
9128 ++ if (SCHED_RR != p->policy)
9129 ++ deboost_task(p);
9130 ++ requeue_task(p, rq);
9131 ++ }
9132 ++}
9133 ++
9134 ++static inline void sched_task_sanity_check(struct task_struct *p, struct rq *rq) {}
9135 ++
9136 ++inline int task_running_nice(struct task_struct *p)
9137 ++{
9138 ++ return (p->prio + p->boost_prio > DEFAULT_PRIO + MAX_PRIORITY_ADJ);
9139 ++}
9140 ++
9141 ++static void sched_task_fork(struct task_struct *p, struct rq *rq)
9142 ++{
9143 ++ p->boost_prio = (p->boost_prio < 0) ?
9144 ++ p->boost_prio + MAX_PRIORITY_ADJ : MAX_PRIORITY_ADJ;
9145 ++}
9146 ++
9147 ++static inline void do_sched_yield_type_1(struct task_struct *p, struct rq *rq)
9148 ++{
9149 ++ p->boost_prio = MAX_PRIORITY_ADJ;
9150 ++}
9151 ++
9152 ++#ifdef CONFIG_SMP
9153 ++static inline void sched_task_ttwu(struct task_struct *p)
9154 ++{
9155 ++ if(this_rq()->clock_task - p->last_ran > sched_timeslice_ns)
9156 ++ boost_task(p);
9157 ++}
9158 ++#endif
9159 ++
9160 ++static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq)
9161 ++{
9162 ++ if (rq_switch_time(rq) < boost_threshold(p))
9163 ++ boost_task(p);
9164 ++}
9165 ++
9166 ++static inline void update_rq_time_edge(struct rq *rq) {}
9167 +diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
9168 +index e7af18857371..3e38816b736e 100644
9169 +--- a/kernel/sched/cpufreq_schedutil.c
9170 ++++ b/kernel/sched/cpufreq_schedutil.c
9171 +@@ -167,9 +167,14 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
9172 + unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
9173 +
9174 + sg_cpu->max = max;
9175 ++#ifndef CONFIG_SCHED_ALT
9176 + sg_cpu->bw_dl = cpu_bw_dl(rq);
9177 + sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(rq), max,
9178 + FREQUENCY_UTIL, NULL);
9179 ++#else
9180 ++ sg_cpu->bw_dl = 0;
9181 ++ sg_cpu->util = rq_load_util(rq, max);
9182 ++#endif /* CONFIG_SCHED_ALT */
9183 + }
9184 +
9185 + /**
9186 +@@ -312,8 +317,10 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
9187 + */
9188 + static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
9189 + {
9190 ++#ifndef CONFIG_SCHED_ALT
9191 + if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
9192 + sg_cpu->sg_policy->limits_changed = true;
9193 ++#endif
9194 + }
9195 +
9196 + static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
9197 +@@ -607,6 +614,7 @@ static int sugov_kthread_create(struct sugov_policy *sg_policy)
9198 + }
9199 +
9200 + ret = sched_setattr_nocheck(thread, &attr);
9201 ++
9202 + if (ret) {
9203 + kthread_stop(thread);
9204 + pr_warn("%s: failed to set SCHED_DEADLINE\n", __func__);
9205 +@@ -839,7 +847,9 @@ cpufreq_governor_init(schedutil_gov);
9206 + #ifdef CONFIG_ENERGY_MODEL
9207 + static void rebuild_sd_workfn(struct work_struct *work)
9208 + {
9209 ++#ifndef CONFIG_SCHED_ALT
9210 + rebuild_sched_domains_energy();
9211 ++#endif /* CONFIG_SCHED_ALT */
9212 + }
9213 + static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
9214 +
9215 +diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
9216 +index 872e481d5098..f920c8b48ec1 100644
9217 +--- a/kernel/sched/cputime.c
9218 ++++ b/kernel/sched/cputime.c
9219 +@@ -123,7 +123,7 @@ void account_user_time(struct task_struct *p, u64 cputime)
9220 + p->utime += cputime;
9221 + account_group_user_time(p, cputime);
9222 +
9223 +- index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;
9224 ++ index = task_running_nice(p) ? CPUTIME_NICE : CPUTIME_USER;
9225 +
9226 + /* Add user time to cpustat. */
9227 + task_group_account_field(p, index, cputime);
9228 +@@ -147,7 +147,7 @@ void account_guest_time(struct task_struct *p, u64 cputime)
9229 + p->gtime += cputime;
9230 +
9231 + /* Add guest time to cpustat. */
9232 +- if (task_nice(p) > 0) {
9233 ++ if (task_running_nice(p)) {
9234 + cpustat[CPUTIME_NICE] += cputime;
9235 + cpustat[CPUTIME_GUEST_NICE] += cputime;
9236 + } else {
9237 +@@ -270,7 +270,7 @@ static inline u64 account_other_time(u64 max)
9238 + #ifdef CONFIG_64BIT
9239 + static inline u64 read_sum_exec_runtime(struct task_struct *t)
9240 + {
9241 +- return t->se.sum_exec_runtime;
9242 ++ return tsk_seruntime(t);
9243 + }
9244 + #else
9245 + static u64 read_sum_exec_runtime(struct task_struct *t)
9246 +@@ -280,7 +280,7 @@ static u64 read_sum_exec_runtime(struct task_struct *t)
9247 + struct rq *rq;
9248 +
9249 + rq = task_rq_lock(t, &rf);
9250 +- ns = t->se.sum_exec_runtime;
9251 ++ ns = tsk_seruntime(t);
9252 + task_rq_unlock(rq, t, &rf);
9253 +
9254 + return ns;
9255 +@@ -612,7 +612,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
9256 + void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
9257 + {
9258 + struct task_cputime cputime = {
9259 +- .sum_exec_runtime = p->se.sum_exec_runtime,
9260 ++ .sum_exec_runtime = tsk_seruntime(p),
9261 + };
9262 +
9263 + task_cputime(p, &cputime.utime, &cputime.stime);
9264 +diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
9265 +index 17a653b67006..17ab2fe34d7a 100644
9266 +--- a/kernel/sched/debug.c
9267 ++++ b/kernel/sched/debug.c
9268 +@@ -8,6 +8,7 @@
9269 + */
9270 + #include "sched.h"
9271 +
9272 ++#ifndef CONFIG_SCHED_ALT
9273 + /*
9274 + * This allows printing both to /proc/sched_debug and
9275 + * to the console
9276 +@@ -216,6 +217,7 @@ static const struct file_operations sched_scaling_fops = {
9277 + };
9278 +
9279 + #endif /* SMP */
9280 ++#endif /* !CONFIG_SCHED_ALT */
9281 +
9282 + #ifdef CONFIG_PREEMPT_DYNAMIC
9283 +
9284 +@@ -279,6 +281,7 @@ static const struct file_operations sched_dynamic_fops = {
9285 +
9286 + #endif /* CONFIG_PREEMPT_DYNAMIC */
9287 +
9288 ++#ifndef CONFIG_SCHED_ALT
9289 + __read_mostly bool sched_debug_verbose;
9290 +
9291 + static const struct seq_operations sched_debug_sops;
9292 +@@ -294,6 +297,7 @@ static const struct file_operations sched_debug_fops = {
9293 + .llseek = seq_lseek,
9294 + .release = seq_release,
9295 + };
9296 ++#endif /* !CONFIG_SCHED_ALT */
9297 +
9298 + static struct dentry *debugfs_sched;
9299 +
9300 +@@ -303,12 +307,15 @@ static __init int sched_init_debug(void)
9301 +
9302 + debugfs_sched = debugfs_create_dir("sched", NULL);
9303 +
9304 ++#ifndef CONFIG_SCHED_ALT
9305 + debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
9306 + debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);
9307 ++#endif /* !CONFIG_SCHED_ALT */
9308 + #ifdef CONFIG_PREEMPT_DYNAMIC
9309 + debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
9310 + #endif
9311 +
9312 ++#ifndef CONFIG_SCHED_ALT
9313 + debugfs_create_u32("latency_ns", 0644, debugfs_sched, &sysctl_sched_latency);
9314 + debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);
9315 + debugfs_create_u32("wakeup_granularity_ns", 0644, debugfs_sched, &sysctl_sched_wakeup_granularity);
9316 +@@ -336,11 +343,13 @@ static __init int sched_init_debug(void)
9317 + #endif
9318 +
9319 + debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
9320 ++#endif /* !CONFIG_SCHED_ALT */
9321 +
9322 + return 0;
9323 + }
9324 + late_initcall(sched_init_debug);
9325 +
9326 ++#ifndef CONFIG_SCHED_ALT
9327 + #ifdef CONFIG_SMP
9328 +
9329 + static cpumask_var_t sd_sysctl_cpus;
9330 +@@ -1063,6 +1072,7 @@ void proc_sched_set_task(struct task_struct *p)
9331 + memset(&p->se.statistics, 0, sizeof(p->se.statistics));
9332 + #endif
9333 + }
9334 ++#endif /* !CONFIG_SCHED_ALT */
9335 +
9336 + void resched_latency_warn(int cpu, u64 latency)
9337 + {
9338 +diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
9339 +index d17b0a5ce6ac..6ff77fc6b73a 100644
9340 +--- a/kernel/sched/idle.c
9341 ++++ b/kernel/sched/idle.c
9342 +@@ -403,6 +403,7 @@ void cpu_startup_entry(enum cpuhp_state state)
9343 + do_idle();
9344 + }
9345 +
9346 ++#ifndef CONFIG_SCHED_ALT
9347 + /*
9348 + * idle-task scheduling class.
9349 + */
9350 +@@ -525,3 +526,4 @@ DEFINE_SCHED_CLASS(idle) = {
9351 + .switched_to = switched_to_idle,
9352 + .update_curr = update_curr_idle,
9353 + };
9354 ++#endif
9355 +diff --git a/kernel/sched/pds.h b/kernel/sched/pds.h
9356 +new file mode 100644
9357 +index 000000000000..0f1f0d708b77
9358 +--- /dev/null
9359 ++++ b/kernel/sched/pds.h
9360 +@@ -0,0 +1,127 @@
9361 ++#define ALT_SCHED_VERSION_MSG "sched/pds: PDS CPU Scheduler "ALT_SCHED_VERSION" by Alfred Chen.\n"
9362 ++
9363 ++static int sched_timeslice_shift = 22;
9364 ++
9365 ++#define NORMAL_PRIO_MOD(x) ((x) & (NORMAL_PRIO_NUM - 1))
9366 ++
9367 ++/*
9368 ++ * Common interfaces
9369 ++ */
9370 ++static inline void sched_timeslice_imp(const int timeslice_ms)
9371 ++{
9372 ++ if (2 == timeslice_ms)
9373 ++ sched_timeslice_shift = 21;
9374 ++}
9375 ++
9376 ++static inline int
9377 ++task_sched_prio_normal(const struct task_struct *p, const struct rq *rq)
9378 ++{
9379 ++ s64 delta = p->deadline - rq->time_edge + NORMAL_PRIO_NUM - NICE_WIDTH;
9380 ++
9381 ++ if (WARN_ONCE(delta > NORMAL_PRIO_NUM - 1,
9382 ++ "pds: task_sched_prio_normal() delta %lld\n", delta))
9383 ++ return NORMAL_PRIO_NUM - 1;
9384 ++
9385 ++ return (delta < 0) ? 0 : delta;
9386 ++}
9387 ++
9388 ++static inline int task_sched_prio(const struct task_struct *p)
9389 ++{
9390 ++ return (p->prio < MAX_RT_PRIO) ? p->prio :
9391 ++ MIN_NORMAL_PRIO + task_sched_prio_normal(p, task_rq(p));
9392 ++}
9393 ++
9394 ++static inline int
9395 ++task_sched_prio_idx(const struct task_struct *p, const struct rq *rq)
9396 ++{
9397 ++ return (p->prio < MAX_RT_PRIO) ? p->prio : MIN_NORMAL_PRIO +
9398 ++ NORMAL_PRIO_MOD(task_sched_prio_normal(p, rq) + rq->time_edge);
9399 ++}
9400 ++
9401 ++static inline int sched_prio2idx(int prio, struct rq *rq)
9402 ++{
9403 ++ return (IDLE_TASK_SCHED_PRIO == prio || prio < MAX_RT_PRIO) ? prio :
9404 ++ MIN_NORMAL_PRIO + NORMAL_PRIO_MOD((prio - MIN_NORMAL_PRIO) +
9405 ++ rq->time_edge);
9406 ++}
9407 ++
9408 ++static inline int sched_idx2prio(int idx, struct rq *rq)
9409 ++{
9410 ++ return (idx < MAX_RT_PRIO) ? idx : MIN_NORMAL_PRIO +
9411 ++ NORMAL_PRIO_MOD((idx - MIN_NORMAL_PRIO) + NORMAL_PRIO_NUM -
9412 ++ NORMAL_PRIO_MOD(rq->time_edge));
9413 ++}
9414 ++
9415 ++static inline void sched_renew_deadline(struct task_struct *p, const struct rq *rq)
9416 ++{
9417 ++ if (p->prio >= MAX_RT_PRIO)
9418 ++ p->deadline = (rq->clock >> sched_timeslice_shift) +
9419 ++ p->static_prio - (MAX_PRIO - NICE_WIDTH);
9420 ++}
9421 ++
9422 ++int task_running_nice(struct task_struct *p)
9423 ++{
9424 ++ return (p->prio > DEFAULT_PRIO);
9425 ++}
9426 ++
9427 ++static inline void update_rq_time_edge(struct rq *rq)
9428 ++{
9429 ++ struct list_head head;
9430 ++ u64 old = rq->time_edge;
9431 ++ u64 now = rq->clock >> sched_timeslice_shift;
9432 ++ u64 prio, delta;
9433 ++
9434 ++ if (now == old)
9435 ++ return;
9436 ++
9437 ++ delta = min_t(u64, NORMAL_PRIO_NUM, now - old);
9438 ++ INIT_LIST_HEAD(&head);
9439 ++
9440 ++ for_each_set_bit(prio, &rq->queue.bitmap[2], delta)
9441 ++ list_splice_tail_init(rq->queue.heads + MIN_NORMAL_PRIO +
9442 ++ NORMAL_PRIO_MOD(prio + old), &head);
9443 ++
9444 ++ rq->queue.bitmap[2] = (NORMAL_PRIO_NUM == delta) ? 0UL :
9445 ++ rq->queue.bitmap[2] >> delta;
9446 ++ rq->time_edge = now;
9447 ++ if (!list_empty(&head)) {
9448 ++ u64 idx = MIN_NORMAL_PRIO + NORMAL_PRIO_MOD(now);
9449 ++ struct task_struct *p;
9450 ++
9451 ++ list_for_each_entry(p, &head, sq_node)
9452 ++ p->sq_idx = idx;
9453 ++
9454 ++ list_splice(&head, rq->queue.heads + idx);
9455 ++ rq->queue.bitmap[2] |= 1UL;
9456 ++ }
9457 ++}
9458 ++
9459 ++static inline void time_slice_expired(struct task_struct *p, struct rq *rq)
9460 ++{
9461 ++ p->time_slice = sched_timeslice_ns;
9462 ++ sched_renew_deadline(p, rq);
9463 ++ if (SCHED_FIFO != p->policy && task_on_rq_queued(p))
9464 ++ requeue_task(p, rq);
9465 ++}
9466 ++
9467 ++static inline void sched_task_sanity_check(struct task_struct *p, struct rq *rq)
9468 ++{
9469 ++ u64 max_dl = rq->time_edge + NICE_WIDTH - 1;
9470 ++ if (unlikely(p->deadline > max_dl))
9471 ++ p->deadline = max_dl;
9472 ++}
9473 ++
9474 ++static void sched_task_fork(struct task_struct *p, struct rq *rq)
9475 ++{
9476 ++ sched_renew_deadline(p, rq);
9477 ++}
9478 ++
9479 ++static inline void do_sched_yield_type_1(struct task_struct *p, struct rq *rq)
9480 ++{
9481 ++ time_slice_expired(p, rq);
9482 ++}
9483 ++
9484 ++#ifdef CONFIG_SMP
9485 ++static inline void sched_task_ttwu(struct task_struct *p) {}
9486 ++#endif
9487 ++static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq) {}
9488 +diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
9489 +index a554e3bbab2b..3e56f5e6ff5c 100644
9490 +--- a/kernel/sched/pelt.c
9491 ++++ b/kernel/sched/pelt.c
9492 +@@ -270,6 +270,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
9493 + WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
9494 + }
9495 +
9496 ++#ifndef CONFIG_SCHED_ALT
9497 + /*
9498 + * sched_entity:
9499 + *
9500 +@@ -387,8 +388,9 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
9501 +
9502 + return 0;
9503 + }
9504 ++#endif
9505 +
9506 +-#ifdef CONFIG_SCHED_THERMAL_PRESSURE
9507 ++#if defined(CONFIG_SCHED_THERMAL_PRESSURE) && !defined(CONFIG_SCHED_ALT)
9508 + /*
9509 + * thermal:
9510 + *
9511 +diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
9512 +index e06071bf3472..adf567df34d4 100644
9513 +--- a/kernel/sched/pelt.h
9514 ++++ b/kernel/sched/pelt.h
9515 +@@ -1,13 +1,15 @@
9516 + #ifdef CONFIG_SMP
9517 + #include "sched-pelt.h"
9518 +
9519 ++#ifndef CONFIG_SCHED_ALT
9520 + int __update_load_avg_blocked_se(u64 now, struct sched_entity *se);
9521 + int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);
9522 + int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
9523 + int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
9524 + int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
9525 ++#endif
9526 +
9527 +-#ifdef CONFIG_SCHED_THERMAL_PRESSURE
9528 ++#if defined(CONFIG_SCHED_THERMAL_PRESSURE) && !defined(CONFIG_SCHED_ALT)
9529 + int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
9530 +
9531 + static inline u64 thermal_load_avg(struct rq *rq)
9532 +@@ -42,6 +44,7 @@ static inline u32 get_pelt_divider(struct sched_avg *avg)
9533 + return LOAD_AVG_MAX - 1024 + avg->period_contrib;
9534 + }
9535 +
9536 ++#ifndef CONFIG_SCHED_ALT
9537 + static inline void cfs_se_util_change(struct sched_avg *avg)
9538 + {
9539 + unsigned int enqueued;
9540 +@@ -153,9 +156,11 @@ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
9541 + return rq_clock_pelt(rq_of(cfs_rq));
9542 + }
9543 + #endif
9544 ++#endif /* CONFIG_SCHED_ALT */
9545 +
9546 + #else
9547 +
9548 ++#ifndef CONFIG_SCHED_ALT
9549 + static inline int
9550 + update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
9551 + {
9552 +@@ -173,6 +178,7 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
9553 + {
9554 + return 0;
9555 + }
9556 ++#endif
9557 +
9558 + static inline int
9559 + update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
9560 +diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
9561 +index 3d3e5793e117..c1d976ef623f 100644
9562 +--- a/kernel/sched/sched.h
9563 ++++ b/kernel/sched/sched.h
9564 +@@ -2,6 +2,10 @@
9565 + /*
9566 + * Scheduler internal types and methods:
9567 + */
9568 ++#ifdef CONFIG_SCHED_ALT
9569 ++#include "alt_sched.h"
9570 ++#else
9571 ++
9572 + #include <linux/sched.h>
9573 +
9574 + #include <linux/sched/autogroup.h>
9575 +@@ -3064,3 +3068,8 @@ extern int sched_dynamic_mode(const char *str);
9576 + extern void sched_dynamic_update(int mode);
9577 + #endif
9578 +
9579 ++static inline int task_running_nice(struct task_struct *p)
9580 ++{
9581 ++ return (task_nice(p) > 0);
9582 ++}
9583 ++#endif /* !CONFIG_SCHED_ALT */
9584 +diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
9585 +index 3f93fc3b5648..528b71e144e9 100644
9586 +--- a/kernel/sched/stats.c
9587 ++++ b/kernel/sched/stats.c
9588 +@@ -22,8 +22,10 @@ static int show_schedstat(struct seq_file *seq, void *v)
9589 + } else {
9590 + struct rq *rq;
9591 + #ifdef CONFIG_SMP
9592 ++#ifndef CONFIG_SCHED_ALT
9593 + struct sched_domain *sd;
9594 + int dcount = 0;
9595 ++#endif
9596 + #endif
9597 + cpu = (unsigned long)(v - 2);
9598 + rq = cpu_rq(cpu);
9599 +@@ -40,6 +42,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
9600 + seq_printf(seq, "\n");
9601 +
9602 + #ifdef CONFIG_SMP
9603 ++#ifndef CONFIG_SCHED_ALT
9604 + /* domain-specific stats */
9605 + rcu_read_lock();
9606 + for_each_domain(cpu, sd) {
9607 +@@ -68,6 +71,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
9608 + sd->ttwu_move_balance);
9609 + }
9610 + rcu_read_unlock();
9611 ++#endif
9612 + #endif
9613 + }
9614 + return 0;
9615 +diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
9616 +index 4e8698e62f07..36c61551252e 100644
9617 +--- a/kernel/sched/topology.c
9618 ++++ b/kernel/sched/topology.c
9619 +@@ -4,6 +4,7 @@
9620 + */
9621 + #include "sched.h"
9622 +
9623 ++#ifndef CONFIG_SCHED_ALT
9624 + DEFINE_MUTEX(sched_domains_mutex);
9625 +
9626 + /* Protected by sched_domains_mutex: */
9627 +@@ -1382,8 +1383,10 @@ static void asym_cpu_capacity_scan(void)
9628 + */
9629 +
9630 + static int default_relax_domain_level = -1;
9631 ++#endif /* CONFIG_SCHED_ALT */
9632 + int sched_domain_level_max;
9633 +
9634 ++#ifndef CONFIG_SCHED_ALT
9635 + static int __init setup_relax_domain_level(char *str)
9636 + {
9637 + if (kstrtoint(str, 0, &default_relax_domain_level))
9638 +@@ -1619,6 +1622,7 @@ sd_init(struct sched_domain_topology_level *tl,
9639 +
9640 + return sd;
9641 + }
9642 ++#endif /* CONFIG_SCHED_ALT */
9643 +
9644 + /*
9645 + * Topology list, bottom-up.
9646 +@@ -1648,6 +1652,7 @@ void set_sched_topology(struct sched_domain_topology_level *tl)
9647 + sched_domain_topology = tl;
9648 + }
9649 +
9650 ++#ifndef CONFIG_SCHED_ALT
9651 + #ifdef CONFIG_NUMA
9652 +
9653 + static const struct cpumask *sd_numa_mask(int cpu)
9654 +@@ -2516,3 +2521,17 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
9655 + partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
9656 + mutex_unlock(&sched_domains_mutex);
9657 + }
9658 ++#else /* CONFIG_SCHED_ALT */
9659 ++void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
9660 ++ struct sched_domain_attr *dattr_new)
9661 ++{}
9662 ++
9663 ++#ifdef CONFIG_NUMA
9664 ++int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
9665 ++
9666 ++int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
9667 ++{
9668 ++ return best_mask_cpu(cpu, cpus);
9669 ++}
9670 ++#endif /* CONFIG_NUMA */
9671 ++#endif
9672 +diff --git a/kernel/sysctl.c b/kernel/sysctl.c
9673 +index 083be6af29d7..09fc6281d488 100644
9674 +--- a/kernel/sysctl.c
9675 ++++ b/kernel/sysctl.c
9676 +@@ -122,6 +122,10 @@ static unsigned long long_max = LONG_MAX;
9677 + static int one_hundred = 100;
9678 + static int two_hundred = 200;
9679 + static int one_thousand = 1000;
9680 ++#ifdef CONFIG_SCHED_ALT
9681 ++static int __maybe_unused zero = 0;
9682 ++extern int sched_yield_type;
9683 ++#endif
9684 + #ifdef CONFIG_PRINTK
9685 + static int ten_thousand = 10000;
9686 + #endif
9687 +@@ -1771,6 +1775,24 @@ int proc_do_static_key(struct ctl_table *table, int write,
9688 + }
9689 +
9690 + static struct ctl_table kern_table[] = {
9691 ++#ifdef CONFIG_SCHED_ALT
9692 ++/* In ALT, only supported "sched_schedstats" */
9693 ++#ifdef CONFIG_SCHED_DEBUG
9694 ++#ifdef CONFIG_SMP
9695 ++#ifdef CONFIG_SCHEDSTATS
9696 ++ {
9697 ++ .procname = "sched_schedstats",
9698 ++ .data = NULL,
9699 ++ .maxlen = sizeof(unsigned int),
9700 ++ .mode = 0644,
9701 ++ .proc_handler = sysctl_schedstats,
9702 ++ .extra1 = SYSCTL_ZERO,
9703 ++ .extra2 = SYSCTL_ONE,
9704 ++ },
9705 ++#endif /* CONFIG_SCHEDSTATS */
9706 ++#endif /* CONFIG_SMP */
9707 ++#endif /* CONFIG_SCHED_DEBUG */
9708 ++#else /* !CONFIG_SCHED_ALT */
9709 + {
9710 + .procname = "sched_child_runs_first",
9711 + .data = &sysctl_sched_child_runs_first,
9712 +@@ -1901,6 +1923,7 @@ static struct ctl_table kern_table[] = {
9713 + .extra2 = SYSCTL_ONE,
9714 + },
9715 + #endif
9716 ++#endif /* !CONFIG_SCHED_ALT */
9717 + #ifdef CONFIG_PROVE_LOCKING
9718 + {
9719 + .procname = "prove_locking",
9720 +@@ -2477,6 +2500,17 @@ static struct ctl_table kern_table[] = {
9721 + .proc_handler = proc_dointvec,
9722 + },
9723 + #endif
9724 ++#ifdef CONFIG_SCHED_ALT
9725 ++ {
9726 ++ .procname = "yield_type",
9727 ++ .data = &sched_yield_type,
9728 ++ .maxlen = sizeof (int),
9729 ++ .mode = 0644,
9730 ++ .proc_handler = &proc_dointvec_minmax,
9731 ++ .extra1 = &zero,
9732 ++ .extra2 = &two,
9733 ++ },
9734 ++#endif
9735 + #if defined(CONFIG_S390) && defined(CONFIG_SMP)
9736 + {
9737 + .procname = "spin_retry",
9738 +diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
9739 +index 0ea8702eb516..a27a0f3a654d 100644
9740 +--- a/kernel/time/hrtimer.c
9741 ++++ b/kernel/time/hrtimer.c
9742 +@@ -2088,8 +2088,10 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
9743 + int ret = 0;
9744 + u64 slack;
9745 +
9746 ++#ifndef CONFIG_SCHED_ALT
9747 + slack = current->timer_slack_ns;
9748 + if (dl_task(current) || rt_task(current))
9749 ++#endif
9750 + slack = 0;
9751 +
9752 + hrtimer_init_sleeper_on_stack(&t, clockid, mode);
9753 +diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
9754 +index 643d412ac623..6bf27565242f 100644
9755 +--- a/kernel/time/posix-cpu-timers.c
9756 ++++ b/kernel/time/posix-cpu-timers.c
9757 +@@ -216,7 +216,7 @@ static void task_sample_cputime(struct task_struct *p, u64 *samples)
9758 + u64 stime, utime;
9759 +
9760 + task_cputime(p, &utime, &stime);
9761 +- store_samples(samples, stime, utime, p->se.sum_exec_runtime);
9762 ++ store_samples(samples, stime, utime, tsk_seruntime(p));
9763 + }
9764 +
9765 + static void proc_sample_cputime_atomic(struct task_cputime_atomic *at,
9766 +@@ -859,6 +859,7 @@ static void collect_posix_cputimers(struct posix_cputimers *pct, u64 *samples,
9767 + }
9768 + }
9769 +
9770 ++#ifndef CONFIG_SCHED_ALT
9771 + static inline void check_dl_overrun(struct task_struct *tsk)
9772 + {
9773 + if (tsk->dl.dl_overrun) {
9774 +@@ -866,6 +867,7 @@ static inline void check_dl_overrun(struct task_struct *tsk)
9775 + __group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
9776 + }
9777 + }
9778 ++#endif
9779 +
9780 + static bool check_rlimit(u64 time, u64 limit, int signo, bool rt, bool hard)
9781 + {
9782 +@@ -893,8 +895,10 @@ static void check_thread_timers(struct task_struct *tsk,
9783 + u64 samples[CPUCLOCK_MAX];
9784 + unsigned long soft;
9785 +
9786 ++#ifndef CONFIG_SCHED_ALT
9787 + if (dl_task(tsk))
9788 + check_dl_overrun(tsk);
9789 ++#endif
9790 +
9791 + if (expiry_cache_is_inactive(pct))
9792 + return;
9793 +@@ -908,7 +912,7 @@ static void check_thread_timers(struct task_struct *tsk,
9794 + soft = task_rlimit(tsk, RLIMIT_RTTIME);
9795 + if (soft != RLIM_INFINITY) {
9796 + /* Task RT timeout is accounted in jiffies. RTTIME is usec */
9797 +- unsigned long rttime = tsk->rt.timeout * (USEC_PER_SEC / HZ);
9798 ++ unsigned long rttime = tsk_rttimeout(tsk) * (USEC_PER_SEC / HZ);
9799 + unsigned long hard = task_rlimit_max(tsk, RLIMIT_RTTIME);
9800 +
9801 + /* At the hard limit, send SIGKILL. No further action. */
9802 +@@ -1144,8 +1148,10 @@ static inline bool fastpath_timer_check(struct task_struct *tsk)
9803 + return true;
9804 + }
9805 +
9806 ++#ifndef CONFIG_SCHED_ALT
9807 + if (dl_task(tsk) && tsk->dl.dl_overrun)
9808 + return true;
9809 ++#endif
9810 +
9811 + return false;
9812 + }
9813 +diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
9814 +index adf7ef194005..11c8f36e281b 100644
9815 +--- a/kernel/trace/trace_selftest.c
9816 ++++ b/kernel/trace/trace_selftest.c
9817 +@@ -1052,10 +1052,15 @@ static int trace_wakeup_test_thread(void *data)
9818 + {
9819 + /* Make this a -deadline thread */
9820 + static const struct sched_attr attr = {
9821 ++#ifdef CONFIG_SCHED_ALT
9822 ++ /* No deadline on BMQ/PDS, use RR */
9823 ++ .sched_policy = SCHED_RR,
9824 ++#else
9825 + .sched_policy = SCHED_DEADLINE,
9826 + .sched_runtime = 100000ULL,
9827 + .sched_deadline = 10000000ULL,
9828 + .sched_period = 10000000ULL
9829 ++#endif
9830 + };
9831 + struct wakeup_test_data *x = data;
9832 +
9833
9834 diff --git a/5021_BMQ-and-PDS-gentoo-defaults.patch b/5021_BMQ-and-PDS-gentoo-defaults.patch
9835 new file mode 100644
9836 index 0000000..d449eec
9837 --- /dev/null
9838 +++ b/5021_BMQ-and-PDS-gentoo-defaults.patch
9839 @@ -0,0 +1,13 @@
9840 +--- a/init/Kconfig 2021-04-27 07:38:30.556467045 -0400
9841 ++++ b/init/Kconfig 2021-04-27 07:39:32.956412800 -0400
9842 +@@ -780,8 +780,9 @@ config GENERIC_SCHED_CLOCK
9843 + menu "Scheduler features"
9844 +
9845 + menuconfig SCHED_ALT
9846 ++ depends on X86_64
9847 + bool "Alternative CPU Schedulers"
9848 +- default y
9849 ++ default n
9850 + help
9851 + This feature enable alternative CPU scheduler"
9852 +