Gentoo Archives: gentoo-commits

From: Mike Pagano <mpagano@g.o>
To: gentoo-commits@l.g.o
Subject: [gentoo-commits] proj/linux-patches:5.14 commit in: /
Date: Sun, 21 Nov 2021 21:14:59
Message-Id: 1637529266.8077ca8990e6d4e9b0db60ec1e302f0699ba8d20.mpagano@gentoo
1 commit: 8077ca8990e6d4e9b0db60ec1e302f0699ba8d20
2 Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
3 AuthorDate: Sun Nov 21 21:14:26 2021 +0000
4 Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
5 CommitDate: Sun Nov 21 21:14:26 2021 +0000
6 URL: https://gitweb.gentoo.org/proj/linux-patches.git/commit/?id=8077ca89
7
8 Remove BMQ, will add back with fixed patch
9
10 Signed-off-by: Mike Pagano <mpagano <AT> gentoo.org>
11
12 0000_README | 8 -
13 5020_BMQ-and-PDS-io-scheduler-v5.14-r3.patch | 9523 --------------------------
14 5021_BMQ-and-PDS-gentoo-defaults.patch | 13 -
15 3 files changed, 9544 deletions(-)
16
17 diff --git a/0000_README b/0000_README
18 index e8f44666..35f55e4e 100644
19 --- a/0000_README
20 +++ b/0000_README
21 @@ -166,11 +166,3 @@ Desc: UID/GID shifting overlay filesystem for containers
22 Patch: 5010_enable-cpu-optimizations-universal.patch
23 From: https://github.com/graysky2/kernel_compiler_patch
24 Desc: Kernel >= 5.8 patch enables gcc = v9+ optimizations for additional CPUs.
25 -
26 -Patch: 5020_BMQ-and-PDS-io-scheduler-v5.14-r3.patch
27 -From: https://gitlab.com/alfredchen/linux-prjc
28 -Desc: BMQ(BitMap Queue) Scheduler. A new CPU scheduler developed from PDS(incld). Inspired by the scheduler in zircon.
29 -
30 -Patch: 5021_BMQ-and-PDS-gentoo-defaults.patch
31 -From: https://gitweb.gentoo.org/proj/linux-patches.git/
32 -Desc: Set defaults for BMQ. Add archs as people test, default to N
33
34 diff --git a/5020_BMQ-and-PDS-io-scheduler-v5.14-r3.patch b/5020_BMQ-and-PDS-io-scheduler-v5.14-r3.patch
35 deleted file mode 100644
36 index cf68d7ea..00000000
37 --- a/5020_BMQ-and-PDS-io-scheduler-v5.14-r3.patch
38 +++ /dev/null
39 @@ -1,9523 +0,0 @@
40 -diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
41 -index bdb22006f713..d755d7df632f 100644
42 ---- a/Documentation/admin-guide/kernel-parameters.txt
43 -+++ b/Documentation/admin-guide/kernel-parameters.txt
44 -@@ -4947,6 +4947,12 @@
45 -
46 - sbni= [NET] Granch SBNI12 leased line adapter
47 -
48 -+ sched_timeslice=
49 -+ [KNL] Time slice in ms for Project C BMQ/PDS scheduler.
50 -+ Format: integer 2, 4
51 -+ Default: 4
52 -+ See Documentation/scheduler/sched-BMQ.txt
53 -+
54 - sched_verbose [KNL] Enables verbose scheduler debug messages.
55 -
56 - schedstats= [KNL,X86] Enable or disable scheduled statistics.
57 -diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
58 -index 426162009ce9..15ac2d7e47cd 100644
59 ---- a/Documentation/admin-guide/sysctl/kernel.rst
60 -+++ b/Documentation/admin-guide/sysctl/kernel.rst
61 -@@ -1542,3 +1542,13 @@ is 10 seconds.
62 -
63 - The softlockup threshold is (``2 * watchdog_thresh``). Setting this
64 - tunable to zero will disable lockup detection altogether.
65 -+
66 -+yield_type:
67 -+===========
68 -+
69 -+BMQ/PDS CPU scheduler only. This determines what type of yield calls
70 -+to sched_yield will perform.
71 -+
72 -+ 0 - No yield.
73 -+ 1 - Deboost and requeue task. (default)
74 -+ 2 - Set run queue skip task.
75 -diff --git a/Documentation/scheduler/sched-BMQ.txt b/Documentation/scheduler/sched-BMQ.txt
76 -new file mode 100644
77 -index 000000000000..05c84eec0f31
78 ---- /dev/null
79 -+++ b/Documentation/scheduler/sched-BMQ.txt
80 -@@ -0,0 +1,110 @@
81 -+ BitMap queue CPU Scheduler
82 -+ --------------------------
83 -+
84 -+CONTENT
85 -+========
86 -+
87 -+ Background
88 -+ Design
89 -+ Overview
90 -+ Task policy
91 -+ Priority management
92 -+ BitMap Queue
93 -+ CPU Assignment and Migration
94 -+
95 -+
96 -+Background
97 -+==========
98 -+
99 -+BitMap Queue CPU scheduler, referred to as BMQ from here on, is an evolution
100 -+of previous Priority and Deadline based Skiplist multiple queue scheduler(PDS),
101 -+and inspired by Zircon scheduler. The goal of it is to keep the scheduler code
102 -+simple, while efficiency and scalable for interactive tasks, such as desktop,
103 -+movie playback and gaming etc.
104 -+
105 -+Design
106 -+======
107 -+
108 -+Overview
109 -+--------
110 -+
111 -+BMQ use per CPU run queue design, each CPU(logical) has it's own run queue,
112 -+each CPU is responsible for scheduling the tasks that are putting into it's
113 -+run queue.
114 -+
115 -+The run queue is a set of priority queues. Note that these queues are fifo
116 -+queue for non-rt tasks or priority queue for rt tasks in data structure. See
117 -+BitMap Queue below for details. BMQ is optimized for non-rt tasks in the fact
118 -+that most applications are non-rt tasks. No matter the queue is fifo or
119 -+priority, In each queue is an ordered list of runnable tasks awaiting execution
120 -+and the data structures are the same. When it is time for a new task to run,
121 -+the scheduler simply looks the lowest numbered queueue that contains a task,
122 -+and runs the first task from the head of that queue. And per CPU idle task is
123 -+also in the run queue, so the scheduler can always find a task to run on from
124 -+its run queue.
125 -+
126 -+Each task will assigned the same timeslice(default 4ms) when it is picked to
127 -+start running. Task will be reinserted at the end of the appropriate priority
128 -+queue when it uses its whole timeslice. When the scheduler selects a new task
129 -+from the priority queue it sets the CPU's preemption timer for the remainder of
130 -+the previous timeslice. When that timer fires the scheduler will stop execution
131 -+on that task, select another task and start over again.
132 -+
133 -+If a task blocks waiting for a shared resource then it's taken out of its
134 -+priority queue and is placed in a wait queue for the shared resource. When it
135 -+is unblocked it will be reinserted in the appropriate priority queue of an
136 -+eligible CPU.
137 -+
138 -+Task policy
139 -+-----------
140 -+
141 -+BMQ supports DEADLINE, FIFO, RR, NORMAL, BATCH and IDLE task policy like the
142 -+mainline CFS scheduler. But BMQ is heavy optimized for non-rt task, that's
143 -+NORMAL/BATCH/IDLE policy tasks. Below is the implementation detail of each
144 -+policy.
145 -+
146 -+DEADLINE
147 -+ It is squashed as priority 0 FIFO task.
148 -+
149 -+FIFO/RR
150 -+ All RT tasks share one single priority queue in BMQ run queue designed. The
151 -+complexity of insert operation is O(n). BMQ is not designed for system runs
152 -+with major rt policy tasks.
153 -+
154 -+NORMAL/BATCH/IDLE
155 -+ BATCH and IDLE tasks are treated as the same policy. They compete CPU with
156 -+NORMAL policy tasks, but they just don't boost. To control the priority of
157 -+NORMAL/BATCH/IDLE tasks, simply use nice level.
158 -+
159 -+ISO
160 -+ ISO policy is not supported in BMQ. Please use nice level -20 NORMAL policy
161 -+task instead.
162 -+
163 -+Priority management
164 -+-------------------
165 -+
166 -+RT tasks have priority from 0-99. For non-rt tasks, there are three different
167 -+factors used to determine the effective priority of a task. The effective
168 -+priority being what is used to determine which queue it will be in.
169 -+
170 -+The first factor is simply the task’s static priority. Which is assigned from
171 -+task's nice level, within [-20, 19] in userland's point of view and [0, 39]
172 -+internally.
173 -+
174 -+The second factor is the priority boost. This is a value bounded between
175 -+[-MAX_PRIORITY_ADJ, MAX_PRIORITY_ADJ] used to offset the base priority, it is
176 -+modified by the following cases:
177 -+
178 -+*When a thread has used up its entire timeslice, always deboost its boost by
179 -+increasing by one.
180 -+*When a thread gives up cpu control(voluntary or non-voluntary) to reschedule,
181 -+and its switch-in time(time after last switch and run) below the thredhold
182 -+based on its priority boost, will boost its boost by decreasing by one buti is
183 -+capped at 0 (won’t go negative).
184 -+
185 -+The intent in this system is to ensure that interactive threads are serviced
186 -+quickly. These are usually the threads that interact directly with the user
187 -+and cause user-perceivable latency. These threads usually do little work and
188 -+spend most of their time blocked awaiting another user event. So they get the
189 -+priority boost from unblocking while background threads that do most of the
190 -+processing receive the priority penalty for using their entire timeslice.
191 -diff --git a/fs/proc/base.c b/fs/proc/base.c
192 -index e5b5f7709d48..284b3c4b7d90 100644
193 ---- a/fs/proc/base.c
194 -+++ b/fs/proc/base.c
195 -@@ -476,7 +476,7 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
196 - seq_puts(m, "0 0 0\n");
197 - else
198 - seq_printf(m, "%llu %llu %lu\n",
199 -- (unsigned long long)task->se.sum_exec_runtime,
200 -+ (unsigned long long)tsk_seruntime(task),
201 - (unsigned long long)task->sched_info.run_delay,
202 - task->sched_info.pcount);
203 -
204 -diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
205 -index 8874f681b056..59eb72bf7d5f 100644
206 ---- a/include/asm-generic/resource.h
207 -+++ b/include/asm-generic/resource.h
208 -@@ -23,7 +23,7 @@
209 - [RLIMIT_LOCKS] = { RLIM_INFINITY, RLIM_INFINITY }, \
210 - [RLIMIT_SIGPENDING] = { 0, 0 }, \
211 - [RLIMIT_MSGQUEUE] = { MQ_BYTES_MAX, MQ_BYTES_MAX }, \
212 -- [RLIMIT_NICE] = { 0, 0 }, \
213 -+ [RLIMIT_NICE] = { 30, 30 }, \
214 - [RLIMIT_RTPRIO] = { 0, 0 }, \
215 - [RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
216 - }
217 -diff --git a/include/linux/sched.h b/include/linux/sched.h
218 -index ec8d07d88641..b12f660404fd 100644
219 ---- a/include/linux/sched.h
220 -+++ b/include/linux/sched.h
221 -@@ -681,12 +681,18 @@ struct task_struct {
222 - unsigned int ptrace;
223 -
224 - #ifdef CONFIG_SMP
225 -- int on_cpu;
226 - struct __call_single_node wake_entry;
227 -+#endif
228 -+#if defined(CONFIG_SMP) || defined(CONFIG_SCHED_ALT)
229 -+ int on_cpu;
230 -+#endif
231 -+
232 -+#ifdef CONFIG_SMP
233 - #ifdef CONFIG_THREAD_INFO_IN_TASK
234 - /* Current CPU: */
235 - unsigned int cpu;
236 - #endif
237 -+#ifndef CONFIG_SCHED_ALT
238 - unsigned int wakee_flips;
239 - unsigned long wakee_flip_decay_ts;
240 - struct task_struct *last_wakee;
241 -@@ -700,6 +706,7 @@ struct task_struct {
242 - */
243 - int recent_used_cpu;
244 - int wake_cpu;
245 -+#endif /* !CONFIG_SCHED_ALT */
246 - #endif
247 - int on_rq;
248 -
249 -@@ -708,6 +715,20 @@ struct task_struct {
250 - int normal_prio;
251 - unsigned int rt_priority;
252 -
253 -+#ifdef CONFIG_SCHED_ALT
254 -+ u64 last_ran;
255 -+ s64 time_slice;
256 -+ int sq_idx;
257 -+ struct list_head sq_node;
258 -+#ifdef CONFIG_SCHED_BMQ
259 -+ int boost_prio;
260 -+#endif /* CONFIG_SCHED_BMQ */
261 -+#ifdef CONFIG_SCHED_PDS
262 -+ u64 deadline;
263 -+#endif /* CONFIG_SCHED_PDS */
264 -+ /* sched_clock time spent running */
265 -+ u64 sched_time;
266 -+#else /* !CONFIG_SCHED_ALT */
267 - const struct sched_class *sched_class;
268 - struct sched_entity se;
269 - struct sched_rt_entity rt;
270 -@@ -718,6 +739,7 @@ struct task_struct {
271 - unsigned long core_cookie;
272 - unsigned int core_occupation;
273 - #endif
274 -+#endif /* !CONFIG_SCHED_ALT */
275 -
276 - #ifdef CONFIG_CGROUP_SCHED
277 - struct task_group *sched_task_group;
278 -@@ -1417,6 +1439,15 @@ struct task_struct {
279 - */
280 - };
281 -
282 -+#ifdef CONFIG_SCHED_ALT
283 -+#define tsk_seruntime(t) ((t)->sched_time)
284 -+/* replace the uncertian rt_timeout with 0UL */
285 -+#define tsk_rttimeout(t) (0UL)
286 -+#else /* CFS */
287 -+#define tsk_seruntime(t) ((t)->se.sum_exec_runtime)
288 -+#define tsk_rttimeout(t) ((t)->rt.timeout)
289 -+#endif /* !CONFIG_SCHED_ALT */
290 -+
291 - static inline struct pid *task_pid(struct task_struct *task)
292 - {
293 - return task->thread_pid;
294 -diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
295 -index 1aff00b65f3c..216fdf2fe90c 100644
296 ---- a/include/linux/sched/deadline.h
297 -+++ b/include/linux/sched/deadline.h
298 -@@ -1,5 +1,24 @@
299 - /* SPDX-License-Identifier: GPL-2.0 */
300 -
301 -+#ifdef CONFIG_SCHED_ALT
302 -+
303 -+static inline int dl_task(struct task_struct *p)
304 -+{
305 -+ return 0;
306 -+}
307 -+
308 -+#ifdef CONFIG_SCHED_BMQ
309 -+#define __tsk_deadline(p) (0UL)
310 -+#endif
311 -+
312 -+#ifdef CONFIG_SCHED_PDS
313 -+#define __tsk_deadline(p) ((((u64) ((p)->prio))<<56) | (p)->deadline)
314 -+#endif
315 -+
316 -+#else
317 -+
318 -+#define __tsk_deadline(p) ((p)->dl.deadline)
319 -+
320 - /*
321 - * SCHED_DEADLINE tasks has negative priorities, reflecting
322 - * the fact that any of them has higher prio than RT and
323 -@@ -19,6 +38,7 @@ static inline int dl_task(struct task_struct *p)
324 - {
325 - return dl_prio(p->prio);
326 - }
327 -+#endif /* CONFIG_SCHED_ALT */
328 -
329 - static inline bool dl_time_before(u64 a, u64 b)
330 - {
331 -diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
332 -index ab83d85e1183..6af9ae681116 100644
333 ---- a/include/linux/sched/prio.h
334 -+++ b/include/linux/sched/prio.h
335 -@@ -18,6 +18,32 @@
336 - #define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH)
337 - #define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2)
338 -
339 -+#ifdef CONFIG_SCHED_ALT
340 -+
341 -+/* Undefine MAX_PRIO and DEFAULT_PRIO */
342 -+#undef MAX_PRIO
343 -+#undef DEFAULT_PRIO
344 -+
345 -+/* +/- priority levels from the base priority */
346 -+#ifdef CONFIG_SCHED_BMQ
347 -+#define MAX_PRIORITY_ADJ (7)
348 -+
349 -+#define MIN_NORMAL_PRIO (MAX_RT_PRIO)
350 -+#define MAX_PRIO (MIN_NORMAL_PRIO + NICE_WIDTH)
351 -+#define DEFAULT_PRIO (MIN_NORMAL_PRIO + NICE_WIDTH / 2)
352 -+#endif
353 -+
354 -+#ifdef CONFIG_SCHED_PDS
355 -+#define MAX_PRIORITY_ADJ (0)
356 -+
357 -+#define MIN_NORMAL_PRIO (128)
358 -+#define NORMAL_PRIO_NUM (64)
359 -+#define MAX_PRIO (MIN_NORMAL_PRIO + NORMAL_PRIO_NUM)
360 -+#define DEFAULT_PRIO (MAX_PRIO - NICE_WIDTH / 2)
361 -+#endif
362 -+
363 -+#endif /* CONFIG_SCHED_ALT */
364 -+
365 - /*
366 - * Convert user-nice values [ -20 ... 0 ... 19 ]
367 - * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
368 -diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
369 -index e5af028c08b4..0a7565d0d3cf 100644
370 ---- a/include/linux/sched/rt.h
371 -+++ b/include/linux/sched/rt.h
372 -@@ -24,8 +24,10 @@ static inline bool task_is_realtime(struct task_struct *tsk)
373 -
374 - if (policy == SCHED_FIFO || policy == SCHED_RR)
375 - return true;
376 -+#ifndef CONFIG_SCHED_ALT
377 - if (policy == SCHED_DEADLINE)
378 - return true;
379 -+#endif
380 - return false;
381 - }
382 -
383 -diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
384 -index 8f0f778b7c91..991f2280475b 100644
385 ---- a/include/linux/sched/topology.h
386 -+++ b/include/linux/sched/topology.h
387 -@@ -225,7 +225,8 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
388 -
389 - #endif /* !CONFIG_SMP */
390 -
391 --#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
392 -+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) && \
393 -+ !defined(CONFIG_SCHED_ALT)
394 - extern void rebuild_sched_domains_energy(void);
395 - #else
396 - static inline void rebuild_sched_domains_energy(void)
397 -diff --git a/init/Kconfig b/init/Kconfig
398 -index 55f9f7738ebb..9a9b244d3ca3 100644
399 ---- a/init/Kconfig
400 -+++ b/init/Kconfig
401 -@@ -786,9 +786,39 @@ config GENERIC_SCHED_CLOCK
402 -
403 - menu "Scheduler features"
404 -
405 -+menuconfig SCHED_ALT
406 -+ bool "Alternative CPU Schedulers"
407 -+ default y
408 -+ help
409 -+ This feature enable alternative CPU scheduler"
410 -+
411 -+if SCHED_ALT
412 -+
413 -+choice
414 -+ prompt "Alternative CPU Scheduler"
415 -+ default SCHED_BMQ
416 -+
417 -+config SCHED_BMQ
418 -+ bool "BMQ CPU scheduler"
419 -+ help
420 -+ The BitMap Queue CPU scheduler for excellent interactivity and
421 -+ responsiveness on the desktop and solid scalability on normal
422 -+ hardware and commodity servers.
423 -+
424 -+config SCHED_PDS
425 -+ bool "PDS CPU scheduler"
426 -+ help
427 -+ The Priority and Deadline based Skip list multiple queue CPU
428 -+ Scheduler.
429 -+
430 -+endchoice
431 -+
432 -+endif
433 -+
434 - config UCLAMP_TASK
435 - bool "Enable utilization clamping for RT/FAIR tasks"
436 - depends on CPU_FREQ_GOV_SCHEDUTIL
437 -+ depends on !SCHED_ALT
438 - help
439 - This feature enables the scheduler to track the clamped utilization
440 - of each CPU based on RUNNABLE tasks scheduled on that CPU.
441 -@@ -874,6 +904,7 @@ config NUMA_BALANCING
442 - depends on ARCH_SUPPORTS_NUMA_BALANCING
443 - depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
444 - depends on SMP && NUMA && MIGRATION
445 -+ depends on !SCHED_ALT
446 - help
447 - This option adds support for automatic NUMA aware memory/task placement.
448 - The mechanism is quite primitive and is based on migrating memory when
449 -@@ -966,6 +997,7 @@ config FAIR_GROUP_SCHED
450 - depends on CGROUP_SCHED
451 - default CGROUP_SCHED
452 -
453 -+if !SCHED_ALT
454 - config CFS_BANDWIDTH
455 - bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
456 - depends on FAIR_GROUP_SCHED
457 -@@ -988,6 +1020,7 @@ config RT_GROUP_SCHED
458 - realtime bandwidth for them.
459 - See Documentation/scheduler/sched-rt-group.rst for more information.
460 -
461 -+endif #!SCHED_ALT
462 - endif #CGROUP_SCHED
463 -
464 - config UCLAMP_TASK_GROUP
465 -@@ -1231,6 +1264,7 @@ config CHECKPOINT_RESTORE
466 -
467 - config SCHED_AUTOGROUP
468 - bool "Automatic process group scheduling"
469 -+ depends on !SCHED_ALT
470 - select CGROUPS
471 - select CGROUP_SCHED
472 - select FAIR_GROUP_SCHED
473 -diff --git a/init/init_task.c b/init/init_task.c
474 -index 562f2ef8d157..177b63db4ce0 100644
475 ---- a/init/init_task.c
476 -+++ b/init/init_task.c
477 -@@ -75,9 +75,15 @@ struct task_struct init_task
478 - .stack = init_stack,
479 - .usage = REFCOUNT_INIT(2),
480 - .flags = PF_KTHREAD,
481 -+#ifdef CONFIG_SCHED_ALT
482 -+ .prio = DEFAULT_PRIO + MAX_PRIORITY_ADJ,
483 -+ .static_prio = DEFAULT_PRIO,
484 -+ .normal_prio = DEFAULT_PRIO + MAX_PRIORITY_ADJ,
485 -+#else
486 - .prio = MAX_PRIO - 20,
487 - .static_prio = MAX_PRIO - 20,
488 - .normal_prio = MAX_PRIO - 20,
489 -+#endif
490 - .policy = SCHED_NORMAL,
491 - .cpus_ptr = &init_task.cpus_mask,
492 - .cpus_mask = CPU_MASK_ALL,
493 -@@ -87,6 +93,17 @@ struct task_struct init_task
494 - .restart_block = {
495 - .fn = do_no_restart_syscall,
496 - },
497 -+#ifdef CONFIG_SCHED_ALT
498 -+ .sq_node = LIST_HEAD_INIT(init_task.sq_node),
499 -+#ifdef CONFIG_SCHED_BMQ
500 -+ .boost_prio = 0,
501 -+ .sq_idx = 15,
502 -+#endif
503 -+#ifdef CONFIG_SCHED_PDS
504 -+ .deadline = 0,
505 -+#endif
506 -+ .time_slice = HZ,
507 -+#else
508 - .se = {
509 - .group_node = LIST_HEAD_INIT(init_task.se.group_node),
510 - },
511 -@@ -94,6 +111,7 @@ struct task_struct init_task
512 - .run_list = LIST_HEAD_INIT(init_task.rt.run_list),
513 - .time_slice = RR_TIMESLICE,
514 - },
515 -+#endif
516 - .tasks = LIST_HEAD_INIT(init_task.tasks),
517 - #ifdef CONFIG_SMP
518 - .pushable_tasks = PLIST_NODE_INIT(init_task.pushable_tasks, MAX_PRIO),
519 -diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
520 -index 5876e30c5740..7594d0a31869 100644
521 ---- a/kernel/Kconfig.preempt
522 -+++ b/kernel/Kconfig.preempt
523 -@@ -102,7 +102,7 @@ config PREEMPT_DYNAMIC
524 -
525 - config SCHED_CORE
526 - bool "Core Scheduling for SMT"
527 -- depends on SCHED_SMT
528 -+ depends on SCHED_SMT && !SCHED_ALT
529 - help
530 - This option permits Core Scheduling, a means of coordinated task
531 - selection across SMT siblings. When enabled -- see
532 -diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
533 -index adb5190c4429..8c02bce63146 100644
534 ---- a/kernel/cgroup/cpuset.c
535 -+++ b/kernel/cgroup/cpuset.c
536 -@@ -636,7 +636,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
537 - return ret;
538 - }
539 -
540 --#ifdef CONFIG_SMP
541 -+#if defined(CONFIG_SMP) && !defined(CONFIG_SCHED_ALT)
542 - /*
543 - * Helper routine for generate_sched_domains().
544 - * Do cpusets a, b have overlapping effective cpus_allowed masks?
545 -@@ -1032,7 +1032,7 @@ static void rebuild_sched_domains_locked(void)
546 - /* Have scheduler rebuild the domains */
547 - partition_and_rebuild_sched_domains(ndoms, doms, attr);
548 - }
549 --#else /* !CONFIG_SMP */
550 -+#else /* !CONFIG_SMP || CONFIG_SCHED_ALT */
551 - static void rebuild_sched_domains_locked(void)
552 - {
553 - }
554 -diff --git a/kernel/delayacct.c b/kernel/delayacct.c
555 -index 51530d5b15a8..e542d71bb94b 100644
556 ---- a/kernel/delayacct.c
557 -+++ b/kernel/delayacct.c
558 -@@ -139,7 +139,7 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
559 - */
560 - t1 = tsk->sched_info.pcount;
561 - t2 = tsk->sched_info.run_delay;
562 -- t3 = tsk->se.sum_exec_runtime;
563 -+ t3 = tsk_seruntime(tsk);
564 -
565 - d->cpu_count += t1;
566 -
567 -diff --git a/kernel/exit.c b/kernel/exit.c
568 -index 9a89e7f36acb..7fe34c56bd08 100644
569 ---- a/kernel/exit.c
570 -+++ b/kernel/exit.c
571 -@@ -122,7 +122,7 @@ static void __exit_signal(struct task_struct *tsk)
572 - sig->curr_target = next_thread(tsk);
573 - }
574 -
575 -- add_device_randomness((const void*) &tsk->se.sum_exec_runtime,
576 -+ add_device_randomness((const void*) &tsk_seruntime(tsk),
577 - sizeof(unsigned long long));
578 -
579 - /*
580 -@@ -143,7 +143,7 @@ static void __exit_signal(struct task_struct *tsk)
581 - sig->inblock += task_io_get_inblock(tsk);
582 - sig->oublock += task_io_get_oublock(tsk);
583 - task_io_accounting_add(&sig->ioac, &tsk->ioac);
584 -- sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
585 -+ sig->sum_sched_runtime += tsk_seruntime(tsk);
586 - sig->nr_threads--;
587 - __unhash_process(tsk, group_dead);
588 - write_sequnlock(&sig->stats_lock);
589 -diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
590 -index 3a4beb9395c4..98a709628cb3 100644
591 ---- a/kernel/livepatch/transition.c
592 -+++ b/kernel/livepatch/transition.c
593 -@@ -307,7 +307,11 @@ static bool klp_try_switch_task(struct task_struct *task)
594 - */
595 - rq = task_rq_lock(task, &flags);
596 -
597 -+#ifdef CONFIG_SCHED_ALT
598 -+ if (task_running(task) && task != current) {
599 -+#else
600 - if (task_running(rq, task) && task != current) {
601 -+#endif
602 - snprintf(err_buf, STACK_ERR_BUF_SIZE,
603 - "%s: %s:%d is running\n", __func__, task->comm,
604 - task->pid);
605 -diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
606 -index ad0db322ed3b..350b0e506c17 100644
607 ---- a/kernel/locking/rtmutex.c
608 -+++ b/kernel/locking/rtmutex.c
609 -@@ -227,14 +227,18 @@ static __always_inline bool unlock_rt_mutex_safe(struct rt_mutex *lock,
610 - * Only use with rt_mutex_waiter_{less,equal}()
611 - */
612 - #define task_to_waiter(p) \
613 -- &(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = (p)->dl.deadline }
614 -+ &(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = __tsk_deadline(p) }
615 -
616 - static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,
617 - struct rt_mutex_waiter *right)
618 - {
619 -+#ifdef CONFIG_SCHED_PDS
620 -+ return (left->deadline < right->deadline);
621 -+#else
622 - if (left->prio < right->prio)
623 - return 1;
624 -
625 -+#ifndef CONFIG_SCHED_BMQ
626 - /*
627 - * If both waiters have dl_prio(), we check the deadlines of the
628 - * associated tasks.
629 -@@ -243,16 +247,22 @@ static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,
630 - */
631 - if (dl_prio(left->prio))
632 - return dl_time_before(left->deadline, right->deadline);
633 -+#endif
634 -
635 - return 0;
636 -+#endif
637 - }
638 -
639 - static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,
640 - struct rt_mutex_waiter *right)
641 - {
642 -+#ifdef CONFIG_SCHED_PDS
643 -+ return (left->deadline == right->deadline);
644 -+#else
645 - if (left->prio != right->prio)
646 - return 0;
647 -
648 -+#ifndef CONFIG_SCHED_BMQ
649 - /*
650 - * If both waiters have dl_prio(), we check the deadlines of the
651 - * associated tasks.
652 -@@ -261,8 +271,10 @@ static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,
653 - */
654 - if (dl_prio(left->prio))
655 - return left->deadline == right->deadline;
656 -+#endif
657 -
658 - return 1;
659 -+#endif
660 - }
661 -
662 - #define __node_2_waiter(node) \
663 -@@ -654,7 +666,7 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
664 - * the values of the node being removed.
665 - */
666 - waiter->prio = task->prio;
667 -- waiter->deadline = task->dl.deadline;
668 -+ waiter->deadline = __tsk_deadline(task);
669 -
670 - rt_mutex_enqueue(lock, waiter);
671 -
672 -@@ -925,7 +937,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex *lock,
673 - waiter->task = task;
674 - waiter->lock = lock;
675 - waiter->prio = task->prio;
676 -- waiter->deadline = task->dl.deadline;
677 -+ waiter->deadline = __tsk_deadline(task);
678 -
679 - /* Get the top priority waiter on the lock */
680 - if (rt_mutex_has_waiters(lock))
681 -diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
682 -index 978fcfca5871..0425ee149b4d 100644
683 ---- a/kernel/sched/Makefile
684 -+++ b/kernel/sched/Makefile
685 -@@ -22,14 +22,21 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
686 - CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
687 - endif
688 -
689 --obj-y += core.o loadavg.o clock.o cputime.o
690 --obj-y += idle.o fair.o rt.o deadline.o
691 --obj-y += wait.o wait_bit.o swait.o completion.o
692 --
693 --obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
694 -+ifdef CONFIG_SCHED_ALT
695 -+obj-y += alt_core.o
696 -+obj-$(CONFIG_SCHED_DEBUG) += alt_debug.o
697 -+else
698 -+obj-y += core.o
699 -+obj-y += fair.o rt.o deadline.o
700 -+obj-$(CONFIG_SMP) += cpudeadline.o stop_task.o
701 - obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
702 --obj-$(CONFIG_SCHEDSTATS) += stats.o
703 -+endif
704 - obj-$(CONFIG_SCHED_DEBUG) += debug.o
705 -+obj-y += loadavg.o clock.o cputime.o
706 -+obj-y += idle.o
707 -+obj-y += wait.o wait_bit.o swait.o completion.o
708 -+obj-$(CONFIG_SMP) += cpupri.o pelt.o topology.o
709 -+obj-$(CONFIG_SCHEDSTATS) += stats.o
710 - obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
711 - obj-$(CONFIG_CPU_FREQ) += cpufreq.o
712 - obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
713 -diff --git a/kernel/sched/alt_core.c b/kernel/sched/alt_core.c
714 -new file mode 100644
715 -index 000000000000..56aed2b1e42c
716 ---- /dev/null
717 -+++ b/kernel/sched/alt_core.c
718 -@@ -0,0 +1,7341 @@
719 -+/*
720 -+ * kernel/sched/alt_core.c
721 -+ *
722 -+ * Core alternative kernel scheduler code and related syscalls
723 -+ *
724 -+ * Copyright (C) 1991-2002 Linus Torvalds
725 -+ *
726 -+ * 2009-08-13 Brainfuck deadline scheduling policy by Con Kolivas deletes
727 -+ * a whole lot of those previous things.
728 -+ * 2017-09-06 Priority and Deadline based Skip list multiple queue kernel
729 -+ * scheduler by Alfred Chen.
730 -+ * 2019-02-20 BMQ(BitMap Queue) kernel scheduler by Alfred Chen.
731 -+ */
732 -+#define CREATE_TRACE_POINTS
733 -+#include <trace/events/sched.h>
734 -+#undef CREATE_TRACE_POINTS
735 -+
736 -+#include "sched.h"
737 -+
738 -+#include <linux/sched/rt.h>
739 -+
740 -+#include <linux/context_tracking.h>
741 -+#include <linux/compat.h>
742 -+#include <linux/blkdev.h>
743 -+#include <linux/delayacct.h>
744 -+#include <linux/freezer.h>
745 -+#include <linux/init_task.h>
746 -+#include <linux/kprobes.h>
747 -+#include <linux/mmu_context.h>
748 -+#include <linux/nmi.h>
749 -+#include <linux/profile.h>
750 -+#include <linux/rcupdate_wait.h>
751 -+#include <linux/security.h>
752 -+#include <linux/syscalls.h>
753 -+#include <linux/wait_bit.h>
754 -+
755 -+#include <linux/kcov.h>
756 -+#include <linux/scs.h>
757 -+
758 -+#include <asm/switch_to.h>
759 -+
760 -+#include "../workqueue_internal.h"
761 -+#include "../../fs/io-wq.h"
762 -+#include "../smpboot.h"
763 -+
764 -+#include "pelt.h"
765 -+#include "smp.h"
766 -+
767 -+/*
768 -+ * Export tracepoints that act as a bare tracehook (ie: have no trace event
769 -+ * associated with them) to allow external modules to probe them.
770 -+ */
771 -+EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_irq_tp);
772 -+
773 -+#ifdef CONFIG_SCHED_DEBUG
774 -+#define sched_feat(x) (1)
775 -+/*
776 -+ * Print a warning if need_resched is set for the given duration (if
777 -+ * LATENCY_WARN is enabled).
778 -+ *
779 -+ * If sysctl_resched_latency_warn_once is set, only one warning will be shown
780 -+ * per boot.
781 -+ */
782 -+__read_mostly int sysctl_resched_latency_warn_ms = 100;
783 -+__read_mostly int sysctl_resched_latency_warn_once = 1;
784 -+#else
785 -+#define sched_feat(x) (0)
786 -+#endif /* CONFIG_SCHED_DEBUG */
787 -+
788 -+#define ALT_SCHED_VERSION "v5.14-r3"
789 -+
790 -+/* rt_prio(prio) defined in include/linux/sched/rt.h */
791 -+#define rt_task(p) rt_prio((p)->prio)
792 -+#define rt_policy(policy) ((policy) == SCHED_FIFO || (policy) == SCHED_RR)
793 -+#define task_has_rt_policy(p) (rt_policy((p)->policy))
794 -+
795 -+#define STOP_PRIO (MAX_RT_PRIO - 1)
796 -+
797 -+/* Default time slice is 4 in ms, can be set via kernel parameter "sched_timeslice" */
798 -+u64 sched_timeslice_ns __read_mostly = (4 << 20);
799 -+
800 -+static inline void requeue_task(struct task_struct *p, struct rq *rq);
801 -+
802 -+#ifdef CONFIG_SCHED_BMQ
803 -+#include "bmq.h"
804 -+#endif
805 -+#ifdef CONFIG_SCHED_PDS
806 -+#include "pds.h"
807 -+#endif
808 -+
809 -+static int __init sched_timeslice(char *str)
810 -+{
811 -+ int timeslice_ms;
812 -+
813 -+ get_option(&str, &timeslice_ms);
814 -+ if (2 != timeslice_ms)
815 -+ timeslice_ms = 4;
816 -+ sched_timeslice_ns = timeslice_ms << 20;
817 -+ sched_timeslice_imp(timeslice_ms);
818 -+
819 -+ return 0;
820 -+}
821 -+early_param("sched_timeslice", sched_timeslice);
822 -+
823 -+/* Reschedule if less than this many μs left */
824 -+#define RESCHED_NS (100 << 10)
825 -+
826 -+/**
827 -+ * sched_yield_type - Choose what sort of yield sched_yield will perform.
828 -+ * 0: No yield.
829 -+ * 1: Deboost and requeue task. (default)
830 -+ * 2: Set rq skip task.
831 -+ */
832 -+int sched_yield_type __read_mostly = 1;
833 -+
834 -+#ifdef CONFIG_SMP
835 -+static cpumask_t sched_rq_pending_mask ____cacheline_aligned_in_smp;
836 -+
837 -+DEFINE_PER_CPU(cpumask_t [NR_CPU_AFFINITY_LEVELS], sched_cpu_topo_masks);
838 -+DEFINE_PER_CPU(cpumask_t *, sched_cpu_llc_mask);
839 -+DEFINE_PER_CPU(cpumask_t *, sched_cpu_topo_end_mask);
840 -+
841 -+#ifdef CONFIG_SCHED_SMT
842 -+DEFINE_STATIC_KEY_FALSE(sched_smt_present);
843 -+EXPORT_SYMBOL_GPL(sched_smt_present);
844 -+#endif
845 -+
846 -+/*
847 -+ * Keep a unique ID per domain (we use the first CPUs number in the cpumask of
848 -+ * the domain), this allows us to quickly tell if two cpus are in the same cache
849 -+ * domain, see cpus_share_cache().
850 -+ */
851 -+DEFINE_PER_CPU(int, sd_llc_id);
852 -+#endif /* CONFIG_SMP */
853 -+
854 -+static DEFINE_MUTEX(sched_hotcpu_mutex);
855 -+
856 -+DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
857 -+
858 -+#ifndef prepare_arch_switch
859 -+# define prepare_arch_switch(next) do { } while (0)
860 -+#endif
861 -+#ifndef finish_arch_post_lock_switch
862 -+# define finish_arch_post_lock_switch() do { } while (0)
863 -+#endif
864 -+
865 -+#ifdef CONFIG_SCHED_SMT
866 -+static cpumask_t sched_sg_idle_mask ____cacheline_aligned_in_smp;
867 -+#endif
868 -+static cpumask_t sched_rq_watermark[SCHED_BITS] ____cacheline_aligned_in_smp;
869 -+
870 -+/* sched_queue related functions */
871 -+static inline void sched_queue_init(struct sched_queue *q)
872 -+{
873 -+ int i;
874 -+
875 -+ bitmap_zero(q->bitmap, SCHED_BITS);
876 -+ for(i = 0; i < SCHED_BITS; i++)
877 -+ INIT_LIST_HEAD(&q->heads[i]);
878 -+}
879 -+
880 -+/*
881 -+ * Init idle task and put into queue structure of rq
882 -+ * IMPORTANT: may be called multiple times for a single cpu
883 -+ */
884 -+static inline void sched_queue_init_idle(struct sched_queue *q,
885 -+ struct task_struct *idle)
886 -+{
887 -+ idle->sq_idx = IDLE_TASK_SCHED_PRIO;
888 -+ INIT_LIST_HEAD(&q->heads[idle->sq_idx]);
889 -+ list_add(&idle->sq_node, &q->heads[idle->sq_idx]);
890 -+}
891 -+
892 -+/* water mark related functions */
893 -+static inline void update_sched_rq_watermark(struct rq *rq)
894 -+{
895 -+ unsigned long watermark = find_first_bit(rq->queue.bitmap, SCHED_QUEUE_BITS);
896 -+ unsigned long last_wm = rq->watermark;
897 -+ unsigned long i;
898 -+ int cpu;
899 -+
900 -+ if (watermark == last_wm)
901 -+ return;
902 -+
903 -+ rq->watermark = watermark;
904 -+ cpu = cpu_of(rq);
905 -+ if (watermark < last_wm) {
906 -+ for (i = last_wm; i > watermark; i--)
907 -+ cpumask_clear_cpu(cpu, sched_rq_watermark + SCHED_BITS - 1 - i);
908 -+#ifdef CONFIG_SCHED_SMT
909 -+ if (static_branch_likely(&sched_smt_present) &&
910 -+ IDLE_TASK_SCHED_PRIO == last_wm)
911 -+ cpumask_andnot(&sched_sg_idle_mask,
912 -+ &sched_sg_idle_mask, cpu_smt_mask(cpu));
913 -+#endif
914 -+ return;
915 -+ }
916 -+ /* last_wm < watermark */
917 -+ for (i = watermark; i > last_wm; i--)
918 -+ cpumask_set_cpu(cpu, sched_rq_watermark + SCHED_BITS - 1 - i);
919 -+#ifdef CONFIG_SCHED_SMT
920 -+ if (static_branch_likely(&sched_smt_present) &&
921 -+ IDLE_TASK_SCHED_PRIO == watermark) {
922 -+ cpumask_t tmp;
923 -+
924 -+ cpumask_and(&tmp, cpu_smt_mask(cpu), sched_rq_watermark);
925 -+ if (cpumask_equal(&tmp, cpu_smt_mask(cpu)))
926 -+ cpumask_or(&sched_sg_idle_mask,
927 -+ &sched_sg_idle_mask, cpu_smt_mask(cpu));
928 -+ }
929 -+#endif
930 -+}
931 -+
932 -+/*
933 -+ * This routine assume that the idle task always in queue
934 -+ */
935 -+static inline struct task_struct *sched_rq_first_task(struct rq *rq)
936 -+{
937 -+ unsigned long idx = find_first_bit(rq->queue.bitmap, SCHED_QUEUE_BITS);
938 -+ const struct list_head *head = &rq->queue.heads[sched_prio2idx(idx, rq)];
939 -+
940 -+ return list_first_entry(head, struct task_struct, sq_node);
941 -+}
942 -+
943 -+static inline struct task_struct *
944 -+sched_rq_next_task(struct task_struct *p, struct rq *rq)
945 -+{
946 -+ unsigned long idx = p->sq_idx;
947 -+ struct list_head *head = &rq->queue.heads[idx];
948 -+
949 -+ if (list_is_last(&p->sq_node, head)) {
950 -+ idx = find_next_bit(rq->queue.bitmap, SCHED_QUEUE_BITS,
951 -+ sched_idx2prio(idx, rq) + 1);
952 -+ head = &rq->queue.heads[sched_prio2idx(idx, rq)];
953 -+
954 -+ return list_first_entry(head, struct task_struct, sq_node);
955 -+ }
956 -+
957 -+ return list_next_entry(p, sq_node);
958 -+}
959 -+
960 -+static inline struct task_struct *rq_runnable_task(struct rq *rq)
961 -+{
962 -+ struct task_struct *next = sched_rq_first_task(rq);
963 -+
964 -+ if (unlikely(next == rq->skip))
965 -+ next = sched_rq_next_task(next, rq);
966 -+
967 -+ return next;
968 -+}
969 -+
970 -+/*
971 -+ * Serialization rules:
972 -+ *
973 -+ * Lock order:
974 -+ *
975 -+ * p->pi_lock
976 -+ * rq->lock
977 -+ * hrtimer_cpu_base->lock (hrtimer_start() for bandwidth controls)
978 -+ *
979 -+ * rq1->lock
980 -+ * rq2->lock where: rq1 < rq2
981 -+ *
982 -+ * Regular state:
983 -+ *
984 -+ * Normal scheduling state is serialized by rq->lock. __schedule() takes the
985 -+ * local CPU's rq->lock, it optionally removes the task from the runqueue and
986 -+ * always looks at the local rq data structures to find the most eligible task
987 -+ * to run next.
988 -+ *
989 -+ * Task enqueue is also under rq->lock, possibly taken from another CPU.
990 -+ * Wakeups from another LLC domain might use an IPI to transfer the enqueue to
991 -+ * the local CPU to avoid bouncing the runqueue state around [ see
992 -+ * ttwu_queue_wakelist() ]
993 -+ *
994 -+ * Task wakeup, specifically wakeups that involve migration, are horribly
995 -+ * complicated to avoid having to take two rq->locks.
996 -+ *
997 -+ * Special state:
998 -+ *
999 -+ * System-calls and anything external will use task_rq_lock() which acquires
1000 -+ * both p->pi_lock and rq->lock. As a consequence the state they change is
1001 -+ * stable while holding either lock:
1002 -+ *
1003 -+ * - sched_setaffinity()/
1004 -+ * set_cpus_allowed_ptr(): p->cpus_ptr, p->nr_cpus_allowed
1005 -+ * - set_user_nice(): p->se.load, p->*prio
1006 -+ * - __sched_setscheduler(): p->sched_class, p->policy, p->*prio,
1007 -+ * p->se.load, p->rt_priority,
1008 -+ * p->dl.dl_{runtime, deadline, period, flags, bw, density}
1009 -+ * - sched_setnuma(): p->numa_preferred_nid
1010 -+ * - sched_move_task()/
1011 -+ * cpu_cgroup_fork(): p->sched_task_group
1012 -+ * - uclamp_update_active() p->uclamp*
1013 -+ *
1014 -+ * p->state <- TASK_*:
1015 -+ *
1016 -+ * is changed locklessly using set_current_state(), __set_current_state() or
1017 -+ * set_special_state(), see their respective comments, or by
1018 -+ * try_to_wake_up(). This latter uses p->pi_lock to serialize against
1019 -+ * concurrent self.
1020 -+ *
1021 -+ * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }:
1022 -+ *
1023 -+ * is set by activate_task() and cleared by deactivate_task(), under
1024 -+ * rq->lock. Non-zero indicates the task is runnable, the special
1025 -+ * ON_RQ_MIGRATING state is used for migration without holding both
1026 -+ * rq->locks. It indicates task_cpu() is not stable, see task_rq_lock().
1027 -+ *
1028 -+ * p->on_cpu <- { 0, 1 }:
1029 -+ *
1030 -+ * is set by prepare_task() and cleared by finish_task() such that it will be
1031 -+ * set before p is scheduled-in and cleared after p is scheduled-out, both
1032 -+ * under rq->lock. Non-zero indicates the task is running on its CPU.
1033 -+ *
1034 -+ * [ The astute reader will observe that it is possible for two tasks on one
1035 -+ * CPU to have ->on_cpu = 1 at the same time. ]
1036 -+ *
1037 -+ * task_cpu(p): is changed by set_task_cpu(), the rules are:
1038 -+ *
1039 -+ * - Don't call set_task_cpu() on a blocked task:
1040 -+ *
1041 -+ * We don't care what CPU we're not running on, this simplifies hotplug,
1042 -+ * the CPU assignment of blocked tasks isn't required to be valid.
1043 -+ *
1044 -+ * - for try_to_wake_up(), called under p->pi_lock:
1045 -+ *
1046 -+ * This allows try_to_wake_up() to only take one rq->lock, see its comment.
1047 -+ *
1048 -+ * - for migration called under rq->lock:
1049 -+ * [ see task_on_rq_migrating() in task_rq_lock() ]
1050 -+ *
1051 -+ * o move_queued_task()
1052 -+ * o detach_task()
1053 -+ *
1054 -+ * - for migration called under double_rq_lock():
1055 -+ *
1056 -+ * o __migrate_swap_task()
1057 -+ * o push_rt_task() / pull_rt_task()
1058 -+ * o push_dl_task() / pull_dl_task()
1059 -+ * o dl_task_offline_migration()
1060 -+ *
1061 -+ */
1062 -+
1063 -+/*
1064 -+ * Context: p->pi_lock
1065 -+ */
1066 -+static inline struct rq
1067 -+*__task_access_lock(struct task_struct *p, raw_spinlock_t **plock)
1068 -+{
1069 -+ struct rq *rq;
1070 -+ for (;;) {
1071 -+ rq = task_rq(p);
1072 -+ if (p->on_cpu || task_on_rq_queued(p)) {
1073 -+ raw_spin_lock(&rq->lock);
1074 -+ if (likely((p->on_cpu || task_on_rq_queued(p))
1075 -+ && rq == task_rq(p))) {
1076 -+ *plock = &rq->lock;
1077 -+ return rq;
1078 -+ }
1079 -+ raw_spin_unlock(&rq->lock);
1080 -+ } else if (task_on_rq_migrating(p)) {
1081 -+ do {
1082 -+ cpu_relax();
1083 -+ } while (unlikely(task_on_rq_migrating(p)));
1084 -+ } else {
1085 -+ *plock = NULL;
1086 -+ return rq;
1087 -+ }
1088 -+ }
1089 -+}
1090 -+
1091 -+static inline void
1092 -+__task_access_unlock(struct task_struct *p, raw_spinlock_t *lock)
1093 -+{
1094 -+ if (NULL != lock)
1095 -+ raw_spin_unlock(lock);
1096 -+}
1097 -+
1098 -+static inline struct rq
1099 -+*task_access_lock_irqsave(struct task_struct *p, raw_spinlock_t **plock,
1100 -+ unsigned long *flags)
1101 -+{
1102 -+ struct rq *rq;
1103 -+ for (;;) {
1104 -+ rq = task_rq(p);
1105 -+ if (p->on_cpu || task_on_rq_queued(p)) {
1106 -+ raw_spin_lock_irqsave(&rq->lock, *flags);
1107 -+ if (likely((p->on_cpu || task_on_rq_queued(p))
1108 -+ && rq == task_rq(p))) {
1109 -+ *plock = &rq->lock;
1110 -+ return rq;
1111 -+ }
1112 -+ raw_spin_unlock_irqrestore(&rq->lock, *flags);
1113 -+ } else if (task_on_rq_migrating(p)) {
1114 -+ do {
1115 -+ cpu_relax();
1116 -+ } while (unlikely(task_on_rq_migrating(p)));
1117 -+ } else {
1118 -+ raw_spin_lock_irqsave(&p->pi_lock, *flags);
1119 -+ if (likely(!p->on_cpu && !p->on_rq &&
1120 -+ rq == task_rq(p))) {
1121 -+ *plock = &p->pi_lock;
1122 -+ return rq;
1123 -+ }
1124 -+ raw_spin_unlock_irqrestore(&p->pi_lock, *flags);
1125 -+ }
1126 -+ }
1127 -+}
1128 -+
1129 -+static inline void
1130 -+task_access_unlock_irqrestore(struct task_struct *p, raw_spinlock_t *lock,
1131 -+ unsigned long *flags)
1132 -+{
1133 -+ raw_spin_unlock_irqrestore(lock, *flags);
1134 -+}
1135 -+
1136 -+/*
1137 -+ * __task_rq_lock - lock the rq @p resides on.
1138 -+ */
1139 -+struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
1140 -+ __acquires(rq->lock)
1141 -+{
1142 -+ struct rq *rq;
1143 -+
1144 -+ lockdep_assert_held(&p->pi_lock);
1145 -+
1146 -+ for (;;) {
1147 -+ rq = task_rq(p);
1148 -+ raw_spin_lock(&rq->lock);
1149 -+ if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))
1150 -+ return rq;
1151 -+ raw_spin_unlock(&rq->lock);
1152 -+
1153 -+ while (unlikely(task_on_rq_migrating(p)))
1154 -+ cpu_relax();
1155 -+ }
1156 -+}
1157 -+
1158 -+/*
1159 -+ * task_rq_lock - lock p->pi_lock and lock the rq @p resides on.
1160 -+ */
1161 -+struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
1162 -+ __acquires(p->pi_lock)
1163 -+ __acquires(rq->lock)
1164 -+{
1165 -+ struct rq *rq;
1166 -+
1167 -+ for (;;) {
1168 -+ raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
1169 -+ rq = task_rq(p);
1170 -+ raw_spin_lock(&rq->lock);
1171 -+ /*
1172 -+ * move_queued_task() task_rq_lock()
1173 -+ *
1174 -+ * ACQUIRE (rq->lock)
1175 -+ * [S] ->on_rq = MIGRATING [L] rq = task_rq()
1176 -+ * WMB (__set_task_cpu()) ACQUIRE (rq->lock);
1177 -+ * [S] ->cpu = new_cpu [L] task_rq()
1178 -+ * [L] ->on_rq
1179 -+ * RELEASE (rq->lock)
1180 -+ *
1181 -+ * If we observe the old CPU in task_rq_lock(), the acquire of
1182 -+ * the old rq->lock will fully serialize against the stores.
1183 -+ *
1184 -+ * If we observe the new CPU in task_rq_lock(), the address
1185 -+ * dependency headed by '[L] rq = task_rq()' and the acquire
1186 -+ * will pair with the WMB to ensure we then also see migrating.
1187 -+ */
1188 -+ if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
1189 -+ return rq;
1190 -+ }
1191 -+ raw_spin_unlock(&rq->lock);
1192 -+ raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
1193 -+
1194 -+ while (unlikely(task_on_rq_migrating(p)))
1195 -+ cpu_relax();
1196 -+ }
1197 -+}
1198 -+
1199 -+static inline void
1200 -+rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
1201 -+ __acquires(rq->lock)
1202 -+{
1203 -+ raw_spin_lock_irqsave(&rq->lock, rf->flags);
1204 -+}
1205 -+
1206 -+static inline void
1207 -+rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
1208 -+ __releases(rq->lock)
1209 -+{
1210 -+ raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
1211 -+}
1212 -+
1213 -+void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
1214 -+{
1215 -+ raw_spinlock_t *lock;
1216 -+
1217 -+ /* Matches synchronize_rcu() in __sched_core_enable() */
1218 -+ preempt_disable();
1219 -+
1220 -+ for (;;) {
1221 -+ lock = __rq_lockp(rq);
1222 -+ raw_spin_lock_nested(lock, subclass);
1223 -+ if (likely(lock == __rq_lockp(rq))) {
1224 -+ /* preempt_count *MUST* be > 1 */
1225 -+ preempt_enable_no_resched();
1226 -+ return;
1227 -+ }
1228 -+ raw_spin_unlock(lock);
1229 -+ }
1230 -+}
1231 -+
1232 -+void raw_spin_rq_unlock(struct rq *rq)
1233 -+{
1234 -+ raw_spin_unlock(rq_lockp(rq));
1235 -+}
1236 -+
1237 -+/*
1238 -+ * RQ-clock updating methods:
1239 -+ */
1240 -+
1241 -+static void update_rq_clock_task(struct rq *rq, s64 delta)
1242 -+{
1243 -+/*
1244 -+ * In theory, the compile should just see 0 here, and optimize out the call
1245 -+ * to sched_rt_avg_update. But I don't trust it...
1246 -+ */
1247 -+ s64 __maybe_unused steal = 0, irq_delta = 0;
1248 -+
1249 -+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
1250 -+ irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
1251 -+
1252 -+ /*
1253 -+ * Since irq_time is only updated on {soft,}irq_exit, we might run into
1254 -+ * this case when a previous update_rq_clock() happened inside a
1255 -+ * {soft,}irq region.
1256 -+ *
1257 -+ * When this happens, we stop ->clock_task and only update the
1258 -+ * prev_irq_time stamp to account for the part that fit, so that a next
1259 -+ * update will consume the rest. This ensures ->clock_task is
1260 -+ * monotonic.
1261 -+ *
1262 -+ * It does however cause some slight miss-attribution of {soft,}irq
1263 -+ * time, a more accurate solution would be to update the irq_time using
1264 -+ * the current rq->clock timestamp, except that would require using
1265 -+ * atomic ops.
1266 -+ */
1267 -+ if (irq_delta > delta)
1268 -+ irq_delta = delta;
1269 -+
1270 -+ rq->prev_irq_time += irq_delta;
1271 -+ delta -= irq_delta;
1272 -+#endif
1273 -+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
1274 -+ if (static_key_false((&paravirt_steal_rq_enabled))) {
1275 -+ steal = paravirt_steal_clock(cpu_of(rq));
1276 -+ steal -= rq->prev_steal_time_rq;
1277 -+
1278 -+ if (unlikely(steal > delta))
1279 -+ steal = delta;
1280 -+
1281 -+ rq->prev_steal_time_rq += steal;
1282 -+ delta -= steal;
1283 -+ }
1284 -+#endif
1285 -+
1286 -+ rq->clock_task += delta;
1287 -+
1288 -+#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
1289 -+ if ((irq_delta + steal))
1290 -+ update_irq_load_avg(rq, irq_delta + steal);
1291 -+#endif
1292 -+}
1293 -+
1294 -+static inline void update_rq_clock(struct rq *rq)
1295 -+{
1296 -+ s64 delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
1297 -+
1298 -+ if (unlikely(delta <= 0))
1299 -+ return;
1300 -+ rq->clock += delta;
1301 -+ update_rq_time_edge(rq);
1302 -+ update_rq_clock_task(rq, delta);
1303 -+}
1304 -+
1305 -+/*
1306 -+ * RQ Load update routine
1307 -+ */
1308 -+#define RQ_LOAD_HISTORY_BITS (sizeof(s32) * 8ULL)
1309 -+#define RQ_UTIL_SHIFT (8)
1310 -+#define RQ_LOAD_HISTORY_TO_UTIL(l) (((l) >> (RQ_LOAD_HISTORY_BITS - 1 - RQ_UTIL_SHIFT)) & 0xff)
1311 -+
1312 -+#define LOAD_BLOCK(t) ((t) >> 17)
1313 -+#define LOAD_HALF_BLOCK(t) ((t) >> 16)
1314 -+#define BLOCK_MASK(t) ((t) & ((0x01 << 18) - 1))
1315 -+#define LOAD_BLOCK_BIT(b) (1UL << (RQ_LOAD_HISTORY_BITS - 1 - (b)))
1316 -+#define CURRENT_LOAD_BIT LOAD_BLOCK_BIT(0)
1317 -+
1318 -+static inline void rq_load_update(struct rq *rq)
1319 -+{
1320 -+ u64 time = rq->clock;
1321 -+ u64 delta = min(LOAD_BLOCK(time) - LOAD_BLOCK(rq->load_stamp),
1322 -+ RQ_LOAD_HISTORY_BITS - 1);
1323 -+ u64 prev = !!(rq->load_history & CURRENT_LOAD_BIT);
1324 -+ u64 curr = !!cpu_rq(rq->cpu)->nr_running;
1325 -+
1326 -+ if (delta) {
1327 -+ rq->load_history = rq->load_history >> delta;
1328 -+
1329 -+ if (delta < RQ_UTIL_SHIFT) {
1330 -+ rq->load_block += (~BLOCK_MASK(rq->load_stamp)) * prev;
1331 -+ if (!!LOAD_HALF_BLOCK(rq->load_block) ^ curr)
1332 -+ rq->load_history ^= LOAD_BLOCK_BIT(delta);
1333 -+ }
1334 -+
1335 -+ rq->load_block = BLOCK_MASK(time) * prev;
1336 -+ } else {
1337 -+ rq->load_block += (time - rq->load_stamp) * prev;
1338 -+ }
1339 -+ if (prev ^ curr)
1340 -+ rq->load_history ^= CURRENT_LOAD_BIT;
1341 -+ rq->load_stamp = time;
1342 -+}
1343 -+
1344 -+unsigned long rq_load_util(struct rq *rq, unsigned long max)
1345 -+{
1346 -+ return RQ_LOAD_HISTORY_TO_UTIL(rq->load_history) * (max >> RQ_UTIL_SHIFT);
1347 -+}
1348 -+
1349 -+#ifdef CONFIG_SMP
1350 -+unsigned long sched_cpu_util(int cpu, unsigned long max)
1351 -+{
1352 -+ return rq_load_util(cpu_rq(cpu), max);
1353 -+}
1354 -+#endif /* CONFIG_SMP */
1355 -+
1356 -+#ifdef CONFIG_CPU_FREQ
1357 -+/**
1358 -+ * cpufreq_update_util - Take a note about CPU utilization changes.
1359 -+ * @rq: Runqueue to carry out the update for.
1360 -+ * @flags: Update reason flags.
1361 -+ *
1362 -+ * This function is called by the scheduler on the CPU whose utilization is
1363 -+ * being updated.
1364 -+ *
1365 -+ * It can only be called from RCU-sched read-side critical sections.
1366 -+ *
1367 -+ * The way cpufreq is currently arranged requires it to evaluate the CPU
1368 -+ * performance state (frequency/voltage) on a regular basis to prevent it from
1369 -+ * being stuck in a completely inadequate performance level for too long.
1370 -+ * That is not guaranteed to happen if the updates are only triggered from CFS
1371 -+ * and DL, though, because they may not be coming in if only RT tasks are
1372 -+ * active all the time (or there are RT tasks only).
1373 -+ *
1374 -+ * As a workaround for that issue, this function is called periodically by the
1375 -+ * RT sched class to trigger extra cpufreq updates to prevent it from stalling,
1376 -+ * but that really is a band-aid. Going forward it should be replaced with
1377 -+ * solutions targeted more specifically at RT tasks.
1378 -+ */
1379 -+static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
1380 -+{
1381 -+ struct update_util_data *data;
1382 -+
1383 -+#ifdef CONFIG_SMP
1384 -+ rq_load_update(rq);
1385 -+#endif
1386 -+ data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
1387 -+ cpu_of(rq)));
1388 -+ if (data)
1389 -+ data->func(data, rq_clock(rq), flags);
1390 -+}
1391 -+#else
1392 -+static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
1393 -+{
1394 -+#ifdef CONFIG_SMP
1395 -+ rq_load_update(rq);
1396 -+#endif
1397 -+}
1398 -+#endif /* CONFIG_CPU_FREQ */
1399 -+
1400 -+#ifdef CONFIG_NO_HZ_FULL
1401 -+/*
1402 -+ * Tick may be needed by tasks in the runqueue depending on their policy and
1403 -+ * requirements. If tick is needed, lets send the target an IPI to kick it out
1404 -+ * of nohz mode if necessary.
1405 -+ */
1406 -+static inline void sched_update_tick_dependency(struct rq *rq)
1407 -+{
1408 -+ int cpu = cpu_of(rq);
1409 -+
1410 -+ if (!tick_nohz_full_cpu(cpu))
1411 -+ return;
1412 -+
1413 -+ if (rq->nr_running < 2)
1414 -+ tick_nohz_dep_clear_cpu(cpu, TICK_DEP_BIT_SCHED);
1415 -+ else
1416 -+ tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
1417 -+}
1418 -+#else /* !CONFIG_NO_HZ_FULL */
1419 -+static inline void sched_update_tick_dependency(struct rq *rq) { }
1420 -+#endif
1421 -+
1422 -+bool sched_task_on_rq(struct task_struct *p)
1423 -+{
1424 -+ return task_on_rq_queued(p);
1425 -+}
1426 -+
1427 -+/*
1428 -+ * Add/Remove/Requeue task to/from the runqueue routines
1429 -+ * Context: rq->lock
1430 -+ */
1431 -+#define __SCHED_DEQUEUE_TASK(p, rq, flags, func) \
1432 -+ psi_dequeue(p, flags & DEQUEUE_SLEEP); \
1433 -+ sched_info_dequeue(rq, p); \
1434 -+ \
1435 -+ list_del(&p->sq_node); \
1436 -+ if (list_empty(&rq->queue.heads[p->sq_idx])) { \
1437 -+ clear_bit(sched_idx2prio(p->sq_idx, rq), \
1438 -+ rq->queue.bitmap); \
1439 -+ func; \
1440 -+ }
1441 -+
1442 -+#define __SCHED_ENQUEUE_TASK(p, rq, flags) \
1443 -+ sched_info_enqueue(rq, p); \
1444 -+ psi_enqueue(p, flags); \
1445 -+ \
1446 -+ p->sq_idx = task_sched_prio_idx(p, rq); \
1447 -+ list_add_tail(&p->sq_node, &rq->queue.heads[p->sq_idx]); \
1448 -+ set_bit(sched_idx2prio(p->sq_idx, rq), rq->queue.bitmap);
1449 -+
1450 -+static inline void dequeue_task(struct task_struct *p, struct rq *rq, int flags)
1451 -+{
1452 -+ lockdep_assert_held(&rq->lock);
1453 -+
1454 -+ /*printk(KERN_INFO "sched: dequeue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/
1455 -+ WARN_ONCE(task_rq(p) != rq, "sched: dequeue task reside on cpu%d from cpu%d\n",
1456 -+ task_cpu(p), cpu_of(rq));
1457 -+
1458 -+ __SCHED_DEQUEUE_TASK(p, rq, flags, update_sched_rq_watermark(rq));
1459 -+ --rq->nr_running;
1460 -+#ifdef CONFIG_SMP
1461 -+ if (1 == rq->nr_running)
1462 -+ cpumask_clear_cpu(cpu_of(rq), &sched_rq_pending_mask);
1463 -+#endif
1464 -+
1465 -+ sched_update_tick_dependency(rq);
1466 -+}
1467 -+
1468 -+static inline void enqueue_task(struct task_struct *p, struct rq *rq, int flags)
1469 -+{
1470 -+ lockdep_assert_held(&rq->lock);
1471 -+
1472 -+ /*printk(KERN_INFO "sched: enqueue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/
1473 -+ WARN_ONCE(task_rq(p) != rq, "sched: enqueue task reside on cpu%d to cpu%d\n",
1474 -+ task_cpu(p), cpu_of(rq));
1475 -+
1476 -+ __SCHED_ENQUEUE_TASK(p, rq, flags);
1477 -+ update_sched_rq_watermark(rq);
1478 -+ ++rq->nr_running;
1479 -+#ifdef CONFIG_SMP
1480 -+ if (2 == rq->nr_running)
1481 -+ cpumask_set_cpu(cpu_of(rq), &sched_rq_pending_mask);
1482 -+#endif
1483 -+
1484 -+ sched_update_tick_dependency(rq);
1485 -+}
1486 -+
1487 -+static inline void requeue_task(struct task_struct *p, struct rq *rq)
1488 -+{
1489 -+ int idx;
1490 -+
1491 -+ lockdep_assert_held(&rq->lock);
1492 -+ /*printk(KERN_INFO "sched: requeue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/
1493 -+ WARN_ONCE(task_rq(p) != rq, "sched: cpu[%d] requeue task reside on cpu%d\n",
1494 -+ cpu_of(rq), task_cpu(p));
1495 -+
1496 -+ idx = task_sched_prio_idx(p, rq);
1497 -+
1498 -+ list_del(&p->sq_node);
1499 -+ list_add_tail(&p->sq_node, &rq->queue.heads[idx]);
1500 -+ if (idx != p->sq_idx) {
1501 -+ if (list_empty(&rq->queue.heads[p->sq_idx]))
1502 -+ clear_bit(sched_idx2prio(p->sq_idx, rq),
1503 -+ rq->queue.bitmap);
1504 -+ p->sq_idx = idx;
1505 -+ set_bit(sched_idx2prio(p->sq_idx, rq), rq->queue.bitmap);
1506 -+ update_sched_rq_watermark(rq);
1507 -+ }
1508 -+}
1509 -+
1510 -+/*
1511 -+ * cmpxchg based fetch_or, macro so it works for different integer types
1512 -+ */
1513 -+#define fetch_or(ptr, mask) \
1514 -+ ({ \
1515 -+ typeof(ptr) _ptr = (ptr); \
1516 -+ typeof(mask) _mask = (mask); \
1517 -+ typeof(*_ptr) _old, _val = *_ptr; \
1518 -+ \
1519 -+ for (;;) { \
1520 -+ _old = cmpxchg(_ptr, _val, _val | _mask); \
1521 -+ if (_old == _val) \
1522 -+ break; \
1523 -+ _val = _old; \
1524 -+ } \
1525 -+ _old; \
1526 -+})
1527 -+
1528 -+#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
1529 -+/*
1530 -+ * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
1531 -+ * this avoids any races wrt polling state changes and thereby avoids
1532 -+ * spurious IPIs.
1533 -+ */
1534 -+static bool set_nr_and_not_polling(struct task_struct *p)
1535 -+{
1536 -+ struct thread_info *ti = task_thread_info(p);
1537 -+ return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
1538 -+}
1539 -+
1540 -+/*
1541 -+ * Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set.
1542 -+ *
1543 -+ * If this returns true, then the idle task promises to call
1544 -+ * sched_ttwu_pending() and reschedule soon.
1545 -+ */
1546 -+static bool set_nr_if_polling(struct task_struct *p)
1547 -+{
1548 -+ struct thread_info *ti = task_thread_info(p);
1549 -+ typeof(ti->flags) old, val = READ_ONCE(ti->flags);
1550 -+
1551 -+ for (;;) {
1552 -+ if (!(val & _TIF_POLLING_NRFLAG))
1553 -+ return false;
1554 -+ if (val & _TIF_NEED_RESCHED)
1555 -+ return true;
1556 -+ old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED);
1557 -+ if (old == val)
1558 -+ break;
1559 -+ val = old;
1560 -+ }
1561 -+ return true;
1562 -+}
1563 -+
1564 -+#else
1565 -+static bool set_nr_and_not_polling(struct task_struct *p)
1566 -+{
1567 -+ set_tsk_need_resched(p);
1568 -+ return true;
1569 -+}
1570 -+
1571 -+#ifdef CONFIG_SMP
1572 -+static bool set_nr_if_polling(struct task_struct *p)
1573 -+{
1574 -+ return false;
1575 -+}
1576 -+#endif
1577 -+#endif
1578 -+
1579 -+static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)
1580 -+{
1581 -+ struct wake_q_node *node = &task->wake_q;
1582 -+
1583 -+ /*
1584 -+ * Atomically grab the task, if ->wake_q is !nil already it means
1585 -+ * it's already queued (either by us or someone else) and will get the
1586 -+ * wakeup due to that.
1587 -+ *
1588 -+ * In order to ensure that a pending wakeup will observe our pending
1589 -+ * state, even in the failed case, an explicit smp_mb() must be used.
1590 -+ */
1591 -+ smp_mb__before_atomic();
1592 -+ if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
1593 -+ return false;
1594 -+
1595 -+ /*
1596 -+ * The head is context local, there can be no concurrency.
1597 -+ */
1598 -+ *head->lastp = node;
1599 -+ head->lastp = &node->next;
1600 -+ return true;
1601 -+}
1602 -+
1603 -+/**
1604 -+ * wake_q_add() - queue a wakeup for 'later' waking.
1605 -+ * @head: the wake_q_head to add @task to
1606 -+ * @task: the task to queue for 'later' wakeup
1607 -+ *
1608 -+ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
1609 -+ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
1610 -+ * instantly.
1611 -+ *
1612 -+ * This function must be used as-if it were wake_up_process(); IOW the task
1613 -+ * must be ready to be woken at this location.
1614 -+ */
1615 -+void wake_q_add(struct wake_q_head *head, struct task_struct *task)
1616 -+{
1617 -+ if (__wake_q_add(head, task))
1618 -+ get_task_struct(task);
1619 -+}
1620 -+
1621 -+/**
1622 -+ * wake_q_add_safe() - safely queue a wakeup for 'later' waking.
1623 -+ * @head: the wake_q_head to add @task to
1624 -+ * @task: the task to queue for 'later' wakeup
1625 -+ *
1626 -+ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
1627 -+ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
1628 -+ * instantly.
1629 -+ *
1630 -+ * This function must be used as-if it were wake_up_process(); IOW the task
1631 -+ * must be ready to be woken at this location.
1632 -+ *
1633 -+ * This function is essentially a task-safe equivalent to wake_q_add(). Callers
1634 -+ * that already hold reference to @task can call the 'safe' version and trust
1635 -+ * wake_q to do the right thing depending whether or not the @task is already
1636 -+ * queued for wakeup.
1637 -+ */
1638 -+void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
1639 -+{
1640 -+ if (!__wake_q_add(head, task))
1641 -+ put_task_struct(task);
1642 -+}
1643 -+
1644 -+void wake_up_q(struct wake_q_head *head)
1645 -+{
1646 -+ struct wake_q_node *node = head->first;
1647 -+
1648 -+ while (node != WAKE_Q_TAIL) {
1649 -+ struct task_struct *task;
1650 -+
1651 -+ task = container_of(node, struct task_struct, wake_q);
1652 -+ /* task can safely be re-inserted now: */
1653 -+ node = node->next;
1654 -+ task->wake_q.next = NULL;
1655 -+
1656 -+ /*
1657 -+ * wake_up_process() executes a full barrier, which pairs with
1658 -+ * the queueing in wake_q_add() so as not to miss wakeups.
1659 -+ */
1660 -+ wake_up_process(task);
1661 -+ put_task_struct(task);
1662 -+ }
1663 -+}
1664 -+
1665 -+/*
1666 -+ * resched_curr - mark rq's current task 'to be rescheduled now'.
1667 -+ *
1668 -+ * On UP this means the setting of the need_resched flag, on SMP it
1669 -+ * might also involve a cross-CPU call to trigger the scheduler on
1670 -+ * the target CPU.
1671 -+ */
1672 -+void resched_curr(struct rq *rq)
1673 -+{
1674 -+ struct task_struct *curr = rq->curr;
1675 -+ int cpu;
1676 -+
1677 -+ lockdep_assert_held(&rq->lock);
1678 -+
1679 -+ if (test_tsk_need_resched(curr))
1680 -+ return;
1681 -+
1682 -+ cpu = cpu_of(rq);
1683 -+ if (cpu == smp_processor_id()) {
1684 -+ set_tsk_need_resched(curr);
1685 -+ set_preempt_need_resched();
1686 -+ return;
1687 -+ }
1688 -+
1689 -+ if (set_nr_and_not_polling(curr))
1690 -+ smp_send_reschedule(cpu);
1691 -+ else
1692 -+ trace_sched_wake_idle_without_ipi(cpu);
1693 -+}
1694 -+
1695 -+void resched_cpu(int cpu)
1696 -+{
1697 -+ struct rq *rq = cpu_rq(cpu);
1698 -+ unsigned long flags;
1699 -+
1700 -+ raw_spin_lock_irqsave(&rq->lock, flags);
1701 -+ if (cpu_online(cpu) || cpu == smp_processor_id())
1702 -+ resched_curr(cpu_rq(cpu));
1703 -+ raw_spin_unlock_irqrestore(&rq->lock, flags);
1704 -+}
1705 -+
1706 -+#ifdef CONFIG_SMP
1707 -+#ifdef CONFIG_NO_HZ_COMMON
1708 -+void nohz_balance_enter_idle(int cpu) {}
1709 -+
1710 -+void select_nohz_load_balancer(int stop_tick) {}
1711 -+
1712 -+void set_cpu_sd_state_idle(void) {}
1713 -+
1714 -+/*
1715 -+ * In the semi idle case, use the nearest busy CPU for migrating timers
1716 -+ * from an idle CPU. This is good for power-savings.
1717 -+ *
1718 -+ * We don't do similar optimization for completely idle system, as
1719 -+ * selecting an idle CPU will add more delays to the timers than intended
1720 -+ * (as that CPU's timer base may not be uptodate wrt jiffies etc).
1721 -+ */
1722 -+int get_nohz_timer_target(void)
1723 -+{
1724 -+ int i, cpu = smp_processor_id(), default_cpu = -1;
1725 -+ struct cpumask *mask;
1726 -+
1727 -+ if (housekeeping_cpu(cpu, HK_FLAG_TIMER)) {
1728 -+ if (!idle_cpu(cpu))
1729 -+ return cpu;
1730 -+ default_cpu = cpu;
1731 -+ }
1732 -+
1733 -+ for (mask = per_cpu(sched_cpu_topo_masks, cpu) + 1;
1734 -+ mask < per_cpu(sched_cpu_topo_end_mask, cpu); mask++)
1735 -+ for_each_cpu_and(i, mask, housekeeping_cpumask(HK_FLAG_TIMER))
1736 -+ if (!idle_cpu(i))
1737 -+ return i;
1738 -+
1739 -+ if (default_cpu == -1)
1740 -+ default_cpu = housekeeping_any_cpu(HK_FLAG_TIMER);
1741 -+ cpu = default_cpu;
1742 -+
1743 -+ return cpu;
1744 -+}
1745 -+
1746 -+/*
1747 -+ * When add_timer_on() enqueues a timer into the timer wheel of an
1748 -+ * idle CPU then this timer might expire before the next timer event
1749 -+ * which is scheduled to wake up that CPU. In case of a completely
1750 -+ * idle system the next event might even be infinite time into the
1751 -+ * future. wake_up_idle_cpu() ensures that the CPU is woken up and
1752 -+ * leaves the inner idle loop so the newly added timer is taken into
1753 -+ * account when the CPU goes back to idle and evaluates the timer
1754 -+ * wheel for the next timer event.
1755 -+ */
1756 -+static inline void wake_up_idle_cpu(int cpu)
1757 -+{
1758 -+ struct rq *rq = cpu_rq(cpu);
1759 -+
1760 -+ if (cpu == smp_processor_id())
1761 -+ return;
1762 -+
1763 -+ if (set_nr_and_not_polling(rq->idle))
1764 -+ smp_send_reschedule(cpu);
1765 -+ else
1766 -+ trace_sched_wake_idle_without_ipi(cpu);
1767 -+}
1768 -+
1769 -+static inline bool wake_up_full_nohz_cpu(int cpu)
1770 -+{
1771 -+ /*
1772 -+ * We just need the target to call irq_exit() and re-evaluate
1773 -+ * the next tick. The nohz full kick at least implies that.
1774 -+ * If needed we can still optimize that later with an
1775 -+ * empty IRQ.
1776 -+ */
1777 -+ if (cpu_is_offline(cpu))
1778 -+ return true; /* Don't try to wake offline CPUs. */
1779 -+ if (tick_nohz_full_cpu(cpu)) {
1780 -+ if (cpu != smp_processor_id() ||
1781 -+ tick_nohz_tick_stopped())
1782 -+ tick_nohz_full_kick_cpu(cpu);
1783 -+ return true;
1784 -+ }
1785 -+
1786 -+ return false;
1787 -+}
1788 -+
1789 -+void wake_up_nohz_cpu(int cpu)
1790 -+{
1791 -+ if (!wake_up_full_nohz_cpu(cpu))
1792 -+ wake_up_idle_cpu(cpu);
1793 -+}
1794 -+
1795 -+static void nohz_csd_func(void *info)
1796 -+{
1797 -+ struct rq *rq = info;
1798 -+ int cpu = cpu_of(rq);
1799 -+ unsigned int flags;
1800 -+
1801 -+ /*
1802 -+ * Release the rq::nohz_csd.
1803 -+ */
1804 -+ flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
1805 -+ WARN_ON(!(flags & NOHZ_KICK_MASK));
1806 -+
1807 -+ rq->idle_balance = idle_cpu(cpu);
1808 -+ if (rq->idle_balance && !need_resched()) {
1809 -+ rq->nohz_idle_balance = flags;
1810 -+ raise_softirq_irqoff(SCHED_SOFTIRQ);
1811 -+ }
1812 -+}
1813 -+
1814 -+#endif /* CONFIG_NO_HZ_COMMON */
1815 -+#endif /* CONFIG_SMP */
1816 -+
1817 -+static inline void check_preempt_curr(struct rq *rq)
1818 -+{
1819 -+ if (sched_rq_first_task(rq) != rq->curr)
1820 -+ resched_curr(rq);
1821 -+}
1822 -+
1823 -+#ifdef CONFIG_SCHED_HRTICK
1824 -+/*
1825 -+ * Use HR-timers to deliver accurate preemption points.
1826 -+ */
1827 -+
1828 -+static void hrtick_clear(struct rq *rq)
1829 -+{
1830 -+ if (hrtimer_active(&rq->hrtick_timer))
1831 -+ hrtimer_cancel(&rq->hrtick_timer);
1832 -+}
1833 -+
1834 -+/*
1835 -+ * High-resolution timer tick.
1836 -+ * Runs from hardirq context with interrupts disabled.
1837 -+ */
1838 -+static enum hrtimer_restart hrtick(struct hrtimer *timer)
1839 -+{
1840 -+ struct rq *rq = container_of(timer, struct rq, hrtick_timer);
1841 -+
1842 -+ WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
1843 -+
1844 -+ raw_spin_lock(&rq->lock);
1845 -+ resched_curr(rq);
1846 -+ raw_spin_unlock(&rq->lock);
1847 -+
1848 -+ return HRTIMER_NORESTART;
1849 -+}
1850 -+
1851 -+/*
1852 -+ * Use hrtick when:
1853 -+ * - enabled by features
1854 -+ * - hrtimer is actually high res
1855 -+ */
1856 -+static inline int hrtick_enabled(struct rq *rq)
1857 -+{
1858 -+ /**
1859 -+ * Alt schedule FW doesn't support sched_feat yet
1860 -+ if (!sched_feat(HRTICK))
1861 -+ return 0;
1862 -+ */
1863 -+ if (!cpu_active(cpu_of(rq)))
1864 -+ return 0;
1865 -+ return hrtimer_is_hres_active(&rq->hrtick_timer);
1866 -+}
1867 -+
1868 -+#ifdef CONFIG_SMP
1869 -+
1870 -+static void __hrtick_restart(struct rq *rq)
1871 -+{
1872 -+ struct hrtimer *timer = &rq->hrtick_timer;
1873 -+ ktime_t time = rq->hrtick_time;
1874 -+
1875 -+ hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
1876 -+}
1877 -+
1878 -+/*
1879 -+ * called from hardirq (IPI) context
1880 -+ */
1881 -+static void __hrtick_start(void *arg)
1882 -+{
1883 -+ struct rq *rq = arg;
1884 -+
1885 -+ raw_spin_lock(&rq->lock);
1886 -+ __hrtick_restart(rq);
1887 -+ raw_spin_unlock(&rq->lock);
1888 -+}
1889 -+
1890 -+/*
1891 -+ * Called to set the hrtick timer state.
1892 -+ *
1893 -+ * called with rq->lock held and irqs disabled
1894 -+ */
1895 -+void hrtick_start(struct rq *rq, u64 delay)
1896 -+{
1897 -+ struct hrtimer *timer = &rq->hrtick_timer;
1898 -+ s64 delta;
1899 -+
1900 -+ /*
1901 -+ * Don't schedule slices shorter than 10000ns, that just
1902 -+ * doesn't make sense and can cause timer DoS.
1903 -+ */
1904 -+ delta = max_t(s64, delay, 10000LL);
1905 -+
1906 -+ rq->hrtick_time = ktime_add_ns(timer->base->get_time(), delta);
1907 -+
1908 -+ if (rq == this_rq())
1909 -+ __hrtick_restart(rq);
1910 -+ else
1911 -+ smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);
1912 -+}
1913 -+
1914 -+#else
1915 -+/*
1916 -+ * Called to set the hrtick timer state.
1917 -+ *
1918 -+ * called with rq->lock held and irqs disabled
1919 -+ */
1920 -+void hrtick_start(struct rq *rq, u64 delay)
1921 -+{
1922 -+ /*
1923 -+ * Don't schedule slices shorter than 10000ns, that just
1924 -+ * doesn't make sense. Rely on vruntime for fairness.
1925 -+ */
1926 -+ delay = max_t(u64, delay, 10000LL);
1927 -+ hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),
1928 -+ HRTIMER_MODE_REL_PINNED_HARD);
1929 -+}
1930 -+#endif /* CONFIG_SMP */
1931 -+
1932 -+static void hrtick_rq_init(struct rq *rq)
1933 -+{
1934 -+#ifdef CONFIG_SMP
1935 -+ INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
1936 -+#endif
1937 -+
1938 -+ hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
1939 -+ rq->hrtick_timer.function = hrtick;
1940 -+}
1941 -+#else /* CONFIG_SCHED_HRTICK */
1942 -+static inline int hrtick_enabled(struct rq *rq)
1943 -+{
1944 -+ return 0;
1945 -+}
1946 -+
1947 -+static inline void hrtick_clear(struct rq *rq)
1948 -+{
1949 -+}
1950 -+
1951 -+static inline void hrtick_rq_init(struct rq *rq)
1952 -+{
1953 -+}
1954 -+#endif /* CONFIG_SCHED_HRTICK */
1955 -+
1956 -+static inline int __normal_prio(int policy, int rt_prio, int static_prio)
1957 -+{
1958 -+ return rt_policy(policy) ? (MAX_RT_PRIO - 1 - rt_prio) :
1959 -+ static_prio + MAX_PRIORITY_ADJ;
1960 -+}
1961 -+
1962 -+/*
1963 -+ * Calculate the expected normal priority: i.e. priority
1964 -+ * without taking RT-inheritance into account. Might be
1965 -+ * boosted by interactivity modifiers. Changes upon fork,
1966 -+ * setprio syscalls, and whenever the interactivity
1967 -+ * estimator recalculates.
1968 -+ */
1969 -+static inline int normal_prio(struct task_struct *p)
1970 -+{
1971 -+ return __normal_prio(p->policy, p->rt_priority, p->static_prio);
1972 -+}
1973 -+
1974 -+/*
1975 -+ * Calculate the current priority, i.e. the priority
1976 -+ * taken into account by the scheduler. This value might
1977 -+ * be boosted by RT tasks as it will be RT if the task got
1978 -+ * RT-boosted. If not then it returns p->normal_prio.
1979 -+ */
1980 -+static int effective_prio(struct task_struct *p)
1981 -+{
1982 -+ p->normal_prio = normal_prio(p);
1983 -+ /*
1984 -+ * If we are RT tasks or we were boosted to RT priority,
1985 -+ * keep the priority unchanged. Otherwise, update priority
1986 -+ * to the normal priority:
1987 -+ */
1988 -+ if (!rt_prio(p->prio))
1989 -+ return p->normal_prio;
1990 -+ return p->prio;
1991 -+}
1992 -+
1993 -+/*
1994 -+ * activate_task - move a task to the runqueue.
1995 -+ *
1996 -+ * Context: rq->lock
1997 -+ */
1998 -+static void activate_task(struct task_struct *p, struct rq *rq)
1999 -+{
2000 -+ enqueue_task(p, rq, ENQUEUE_WAKEUP);
2001 -+ p->on_rq = TASK_ON_RQ_QUEUED;
2002 -+
2003 -+ /*
2004 -+ * If in_iowait is set, the code below may not trigger any cpufreq
2005 -+ * utilization updates, so do it here explicitly with the IOWAIT flag
2006 -+ * passed.
2007 -+ */
2008 -+ cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT * p->in_iowait);
2009 -+}
2010 -+
2011 -+/*
2012 -+ * deactivate_task - remove a task from the runqueue.
2013 -+ *
2014 -+ * Context: rq->lock
2015 -+ */
2016 -+static inline void deactivate_task(struct task_struct *p, struct rq *rq)
2017 -+{
2018 -+ dequeue_task(p, rq, DEQUEUE_SLEEP);
2019 -+ p->on_rq = 0;
2020 -+ cpufreq_update_util(rq, 0);
2021 -+}
2022 -+
2023 -+static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
2024 -+{
2025 -+#ifdef CONFIG_SMP
2026 -+ /*
2027 -+ * After ->cpu is set up to a new value, task_access_lock(p, ...) can be
2028 -+ * successfully executed on another CPU. We must ensure that updates of
2029 -+ * per-task data have been completed by this moment.
2030 -+ */
2031 -+ smp_wmb();
2032 -+
2033 -+#ifdef CONFIG_THREAD_INFO_IN_TASK
2034 -+ WRITE_ONCE(p->cpu, cpu);
2035 -+#else
2036 -+ WRITE_ONCE(task_thread_info(p)->cpu, cpu);
2037 -+#endif
2038 -+#endif
2039 -+}
2040 -+
2041 -+static inline bool is_migration_disabled(struct task_struct *p)
2042 -+{
2043 -+#ifdef CONFIG_SMP
2044 -+ return p->migration_disabled;
2045 -+#else
2046 -+ return false;
2047 -+#endif
2048 -+}
2049 -+
2050 -+#define SCA_CHECK 0x01
2051 -+
2052 -+#ifdef CONFIG_SMP
2053 -+
2054 -+void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
2055 -+{
2056 -+#ifdef CONFIG_SCHED_DEBUG
2057 -+ unsigned int state = READ_ONCE(p->__state);
2058 -+
2059 -+ /*
2060 -+ * We should never call set_task_cpu() on a blocked task,
2061 -+ * ttwu() will sort out the placement.
2062 -+ */
2063 -+ WARN_ON_ONCE(state != TASK_RUNNING && state != TASK_WAKING && !p->on_rq);
2064 -+
2065 -+#ifdef CONFIG_LOCKDEP
2066 -+ /*
2067 -+ * The caller should hold either p->pi_lock or rq->lock, when changing
2068 -+ * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
2069 -+ *
2070 -+ * sched_move_task() holds both and thus holding either pins the cgroup,
2071 -+ * see task_group().
2072 -+ */
2073 -+ WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
2074 -+ lockdep_is_held(&task_rq(p)->lock)));
2075 -+#endif
2076 -+ /*
2077 -+ * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
2078 -+ */
2079 -+ WARN_ON_ONCE(!cpu_online(new_cpu));
2080 -+
2081 -+ WARN_ON_ONCE(is_migration_disabled(p));
2082 -+#endif
2083 -+ if (task_cpu(p) == new_cpu)
2084 -+ return;
2085 -+ trace_sched_migrate_task(p, new_cpu);
2086 -+ rseq_migrate(p);
2087 -+ perf_event_task_migrate(p);
2088 -+
2089 -+ __set_task_cpu(p, new_cpu);
2090 -+}
2091 -+
2092 -+#define MDF_FORCE_ENABLED 0x80
2093 -+
2094 -+static void
2095 -+__do_set_cpus_ptr(struct task_struct *p, const struct cpumask *new_mask)
2096 -+{
2097 -+ /*
2098 -+ * This here violates the locking rules for affinity, since we're only
2099 -+ * supposed to change these variables while holding both rq->lock and
2100 -+ * p->pi_lock.
2101 -+ *
2102 -+ * HOWEVER, it magically works, because ttwu() is the only code that
2103 -+ * accesses these variables under p->pi_lock and only does so after
2104 -+ * smp_cond_load_acquire(&p->on_cpu, !VAL), and we're in __schedule()
2105 -+ * before finish_task().
2106 -+ *
2107 -+ * XXX do further audits, this smells like something putrid.
2108 -+ */
2109 -+ SCHED_WARN_ON(!p->on_cpu);
2110 -+ p->cpus_ptr = new_mask;
2111 -+}
2112 -+
2113 -+void migrate_disable(void)
2114 -+{
2115 -+ struct task_struct *p = current;
2116 -+ int cpu;
2117 -+
2118 -+ if (p->migration_disabled) {
2119 -+ p->migration_disabled++;
2120 -+ return;
2121 -+ }
2122 -+
2123 -+ preempt_disable();
2124 -+ cpu = smp_processor_id();
2125 -+ if (cpumask_test_cpu(cpu, &p->cpus_mask)) {
2126 -+ cpu_rq(cpu)->nr_pinned++;
2127 -+ p->migration_disabled = 1;
2128 -+ p->migration_flags &= ~MDF_FORCE_ENABLED;
2129 -+
2130 -+ /*
2131 -+ * Violates locking rules! see comment in __do_set_cpus_ptr().
2132 -+ */
2133 -+ if (p->cpus_ptr == &p->cpus_mask)
2134 -+ __do_set_cpus_ptr(p, cpumask_of(cpu));
2135 -+ }
2136 -+ preempt_enable();
2137 -+}
2138 -+EXPORT_SYMBOL_GPL(migrate_disable);
2139 -+
2140 -+void migrate_enable(void)
2141 -+{
2142 -+ struct task_struct *p = current;
2143 -+
2144 -+ if (0 == p->migration_disabled)
2145 -+ return;
2146 -+
2147 -+ if (p->migration_disabled > 1) {
2148 -+ p->migration_disabled--;
2149 -+ return;
2150 -+ }
2151 -+
2152 -+ /*
2153 -+ * Ensure stop_task runs either before or after this, and that
2154 -+ * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().
2155 -+ */
2156 -+ preempt_disable();
2157 -+ /*
2158 -+ * Assumption: current should be running on allowed cpu
2159 -+ */
2160 -+ WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &p->cpus_mask));
2161 -+ if (p->cpus_ptr != &p->cpus_mask)
2162 -+ __do_set_cpus_ptr(p, &p->cpus_mask);
2163 -+ /*
2164 -+ * Mustn't clear migration_disabled() until cpus_ptr points back at the
2165 -+ * regular cpus_mask, otherwise things that race (eg.
2166 -+ * select_fallback_rq) get confused.
2167 -+ */
2168 -+ barrier();
2169 -+ p->migration_disabled = 0;
2170 -+ this_rq()->nr_pinned--;
2171 -+ preempt_enable();
2172 -+}
2173 -+EXPORT_SYMBOL_GPL(migrate_enable);
2174 -+
2175 -+static inline bool rq_has_pinned_tasks(struct rq *rq)
2176 -+{
2177 -+ return rq->nr_pinned;
2178 -+}
2179 -+
2180 -+/*
2181 -+ * Per-CPU kthreads are allowed to run on !active && online CPUs, see
2182 -+ * __set_cpus_allowed_ptr() and select_fallback_rq().
2183 -+ */
2184 -+static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
2185 -+{
2186 -+ /* When not in the task's cpumask, no point in looking further. */
2187 -+ if (!cpumask_test_cpu(cpu, p->cpus_ptr))
2188 -+ return false;
2189 -+
2190 -+ /* migrate_disabled() must be allowed to finish. */
2191 -+ if (is_migration_disabled(p))
2192 -+ return cpu_online(cpu);
2193 -+
2194 -+ /* Non kernel threads are not allowed during either online or offline. */
2195 -+ if (!(p->flags & PF_KTHREAD))
2196 -+ return cpu_active(cpu);
2197 -+
2198 -+ /* KTHREAD_IS_PER_CPU is always allowed. */
2199 -+ if (kthread_is_per_cpu(p))
2200 -+ return cpu_online(cpu);
2201 -+
2202 -+ /* Regular kernel threads don't get to stay during offline. */
2203 -+ if (cpu_dying(cpu))
2204 -+ return false;
2205 -+
2206 -+ /* But are allowed during online. */
2207 -+ return cpu_online(cpu);
2208 -+}
2209 -+
2210 -+/*
2211 -+ * This is how migration works:
2212 -+ *
2213 -+ * 1) we invoke migration_cpu_stop() on the target CPU using
2214 -+ * stop_one_cpu().
2215 -+ * 2) stopper starts to run (implicitly forcing the migrated thread
2216 -+ * off the CPU)
2217 -+ * 3) it checks whether the migrated task is still in the wrong runqueue.
2218 -+ * 4) if it's in the wrong runqueue then the migration thread removes
2219 -+ * it and puts it into the right queue.
2220 -+ * 5) stopper completes and stop_one_cpu() returns and the migration
2221 -+ * is done.
2222 -+ */
2223 -+
2224 -+/*
2225 -+ * move_queued_task - move a queued task to new rq.
2226 -+ *
2227 -+ * Returns (locked) new rq. Old rq's lock is released.
2228 -+ */
2229 -+static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int
2230 -+ new_cpu)
2231 -+{
2232 -+ lockdep_assert_held(&rq->lock);
2233 -+
2234 -+ WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
2235 -+ dequeue_task(p, rq, 0);
2236 -+ set_task_cpu(p, new_cpu);
2237 -+ raw_spin_unlock(&rq->lock);
2238 -+
2239 -+ rq = cpu_rq(new_cpu);
2240 -+
2241 -+ raw_spin_lock(&rq->lock);
2242 -+ BUG_ON(task_cpu(p) != new_cpu);
2243 -+ sched_task_sanity_check(p, rq);
2244 -+ enqueue_task(p, rq, 0);
2245 -+ p->on_rq = TASK_ON_RQ_QUEUED;
2246 -+ check_preempt_curr(rq);
2247 -+
2248 -+ return rq;
2249 -+}
2250 -+
2251 -+struct migration_arg {
2252 -+ struct task_struct *task;
2253 -+ int dest_cpu;
2254 -+};
2255 -+
2256 -+/*
2257 -+ * Move (not current) task off this CPU, onto the destination CPU. We're doing
2258 -+ * this because either it can't run here any more (set_cpus_allowed()
2259 -+ * away from this CPU, or CPU going down), or because we're
2260 -+ * attempting to rebalance this task on exec (sched_exec).
2261 -+ *
2262 -+ * So we race with normal scheduler movements, but that's OK, as long
2263 -+ * as the task is no longer on this CPU.
2264 -+ */
2265 -+static struct rq *__migrate_task(struct rq *rq, struct task_struct *p, int
2266 -+ dest_cpu)
2267 -+{
2268 -+ /* Affinity changed (again). */
2269 -+ if (!is_cpu_allowed(p, dest_cpu))
2270 -+ return rq;
2271 -+
2272 -+ update_rq_clock(rq);
2273 -+ return move_queued_task(rq, p, dest_cpu);
2274 -+}
2275 -+
2276 -+/*
2277 -+ * migration_cpu_stop - this will be executed by a highprio stopper thread
2278 -+ * and performs thread migration by bumping thread off CPU then
2279 -+ * 'pushing' onto another runqueue.
2280 -+ */
2281 -+static int migration_cpu_stop(void *data)
2282 -+{
2283 -+ struct migration_arg *arg = data;
2284 -+ struct task_struct *p = arg->task;
2285 -+ struct rq *rq = this_rq();
2286 -+ unsigned long flags;
2287 -+
2288 -+ /*
2289 -+ * The original target CPU might have gone down and we might
2290 -+ * be on another CPU but it doesn't matter.
2291 -+ */
2292 -+ local_irq_save(flags);
2293 -+ /*
2294 -+ * We need to explicitly wake pending tasks before running
2295 -+ * __migrate_task() such that we will not miss enforcing cpus_ptr
2296 -+ * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.
2297 -+ */
2298 -+ flush_smp_call_function_from_idle();
2299 -+
2300 -+ raw_spin_lock(&p->pi_lock);
2301 -+ raw_spin_lock(&rq->lock);
2302 -+ /*
2303 -+ * If task_rq(p) != rq, it cannot be migrated here, because we're
2304 -+ * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because
2305 -+ * we're holding p->pi_lock.
2306 -+ */
2307 -+ if (task_rq(p) == rq && task_on_rq_queued(p))
2308 -+ rq = __migrate_task(rq, p, arg->dest_cpu);
2309 -+ raw_spin_unlock(&rq->lock);
2310 -+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
2311 -+
2312 -+ return 0;
2313 -+}
2314 -+
2315 -+static inline void
2316 -+set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask)
2317 -+{
2318 -+ cpumask_copy(&p->cpus_mask, new_mask);
2319 -+ p->nr_cpus_allowed = cpumask_weight(new_mask);
2320 -+}
2321 -+
2322 -+static void
2323 -+__do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
2324 -+{
2325 -+ lockdep_assert_held(&p->pi_lock);
2326 -+ set_cpus_allowed_common(p, new_mask);
2327 -+}
2328 -+
2329 -+void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
2330 -+{
2331 -+ __do_set_cpus_allowed(p, new_mask);
2332 -+}
2333 -+
2334 -+#endif
2335 -+
2336 -+/**
2337 -+ * task_curr - is this task currently executing on a CPU?
2338 -+ * @p: the task in question.
2339 -+ *
2340 -+ * Return: 1 if the task is currently executing. 0 otherwise.
2341 -+ */
2342 -+inline int task_curr(const struct task_struct *p)
2343 -+{
2344 -+ return cpu_curr(task_cpu(p)) == p;
2345 -+}
2346 -+
2347 -+#ifdef CONFIG_SMP
2348 -+/*
2349 -+ * wait_task_inactive - wait for a thread to unschedule.
2350 -+ *
2351 -+ * If @match_state is nonzero, it's the @p->state value just checked and
2352 -+ * not expected to change. If it changes, i.e. @p might have woken up,
2353 -+ * then return zero. When we succeed in waiting for @p to be off its CPU,
2354 -+ * we return a positive number (its total switch count). If a second call
2355 -+ * a short while later returns the same number, the caller can be sure that
2356 -+ * @p has remained unscheduled the whole time.
2357 -+ *
2358 -+ * The caller must ensure that the task *will* unschedule sometime soon,
2359 -+ * else this function might spin for a *long* time. This function can't
2360 -+ * be called with interrupts off, or it may introduce deadlock with
2361 -+ * smp_call_function() if an IPI is sent by the same process we are
2362 -+ * waiting to become inactive.
2363 -+ */
2364 -+unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state)
2365 -+{
2366 -+ unsigned long flags;
2367 -+ bool running, on_rq;
2368 -+ unsigned long ncsw;
2369 -+ struct rq *rq;
2370 -+ raw_spinlock_t *lock;
2371 -+
2372 -+ for (;;) {
2373 -+ rq = task_rq(p);
2374 -+
2375 -+ /*
2376 -+ * If the task is actively running on another CPU
2377 -+ * still, just relax and busy-wait without holding
2378 -+ * any locks.
2379 -+ *
2380 -+ * NOTE! Since we don't hold any locks, it's not
2381 -+ * even sure that "rq" stays as the right runqueue!
2382 -+ * But we don't care, since this will return false
2383 -+ * if the runqueue has changed and p is actually now
2384 -+ * running somewhere else!
2385 -+ */
2386 -+ while (task_running(p) && p == rq->curr) {
2387 -+ if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
2388 -+ return 0;
2389 -+ cpu_relax();
2390 -+ }
2391 -+
2392 -+ /*
2393 -+ * Ok, time to look more closely! We need the rq
2394 -+ * lock now, to be *sure*. If we're wrong, we'll
2395 -+ * just go back and repeat.
2396 -+ */
2397 -+ task_access_lock_irqsave(p, &lock, &flags);
2398 -+ trace_sched_wait_task(p);
2399 -+ running = task_running(p);
2400 -+ on_rq = p->on_rq;
2401 -+ ncsw = 0;
2402 -+ if (!match_state || READ_ONCE(p->__state) == match_state)
2403 -+ ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
2404 -+ task_access_unlock_irqrestore(p, lock, &flags);
2405 -+
2406 -+ /*
2407 -+ * If it changed from the expected state, bail out now.
2408 -+ */
2409 -+ if (unlikely(!ncsw))
2410 -+ break;
2411 -+
2412 -+ /*
2413 -+ * Was it really running after all now that we
2414 -+ * checked with the proper locks actually held?
2415 -+ *
2416 -+ * Oops. Go back and try again..
2417 -+ */
2418 -+ if (unlikely(running)) {
2419 -+ cpu_relax();
2420 -+ continue;
2421 -+ }
2422 -+
2423 -+ /*
2424 -+ * It's not enough that it's not actively running,
2425 -+ * it must be off the runqueue _entirely_, and not
2426 -+ * preempted!
2427 -+ *
2428 -+ * So if it was still runnable (but just not actively
2429 -+ * running right now), it's preempted, and we should
2430 -+ * yield - it could be a while.
2431 -+ */
2432 -+ if (unlikely(on_rq)) {
2433 -+ ktime_t to = NSEC_PER_SEC / HZ;
2434 -+
2435 -+ set_current_state(TASK_UNINTERRUPTIBLE);
2436 -+ schedule_hrtimeout(&to, HRTIMER_MODE_REL);
2437 -+ continue;
2438 -+ }
2439 -+
2440 -+ /*
2441 -+ * Ahh, all good. It wasn't running, and it wasn't
2442 -+ * runnable, which means that it will never become
2443 -+ * running in the future either. We're all done!
2444 -+ */
2445 -+ break;
2446 -+ }
2447 -+
2448 -+ return ncsw;
2449 -+}
2450 -+
2451 -+/***
2452 -+ * kick_process - kick a running thread to enter/exit the kernel
2453 -+ * @p: the to-be-kicked thread
2454 -+ *
2455 -+ * Cause a process which is running on another CPU to enter
2456 -+ * kernel-mode, without any delay. (to get signals handled.)
2457 -+ *
2458 -+ * NOTE: this function doesn't have to take the runqueue lock,
2459 -+ * because all it wants to ensure is that the remote task enters
2460 -+ * the kernel. If the IPI races and the task has been migrated
2461 -+ * to another CPU then no harm is done and the purpose has been
2462 -+ * achieved as well.
2463 -+ */
2464 -+void kick_process(struct task_struct *p)
2465 -+{
2466 -+ int cpu;
2467 -+
2468 -+ preempt_disable();
2469 -+ cpu = task_cpu(p);
2470 -+ if ((cpu != smp_processor_id()) && task_curr(p))
2471 -+ smp_send_reschedule(cpu);
2472 -+ preempt_enable();
2473 -+}
2474 -+EXPORT_SYMBOL_GPL(kick_process);
2475 -+
2476 -+/*
2477 -+ * ->cpus_ptr is protected by both rq->lock and p->pi_lock
2478 -+ *
2479 -+ * A few notes on cpu_active vs cpu_online:
2480 -+ *
2481 -+ * - cpu_active must be a subset of cpu_online
2482 -+ *
2483 -+ * - on CPU-up we allow per-CPU kthreads on the online && !active CPU,
2484 -+ * see __set_cpus_allowed_ptr(). At this point the newly online
2485 -+ * CPU isn't yet part of the sched domains, and balancing will not
2486 -+ * see it.
2487 -+ *
2488 -+ * - on cpu-down we clear cpu_active() to mask the sched domains and
2489 -+ * avoid the load balancer to place new tasks on the to be removed
2490 -+ * CPU. Existing tasks will remain running there and will be taken
2491 -+ * off.
2492 -+ *
2493 -+ * This means that fallback selection must not select !active CPUs.
2494 -+ * And can assume that any active CPU must be online. Conversely
2495 -+ * select_task_rq() below may allow selection of !active CPUs in order
2496 -+ * to satisfy the above rules.
2497 -+ */
2498 -+static int select_fallback_rq(int cpu, struct task_struct *p)
2499 -+{
2500 -+ int nid = cpu_to_node(cpu);
2501 -+ const struct cpumask *nodemask = NULL;
2502 -+ enum { cpuset, possible, fail } state = cpuset;
2503 -+ int dest_cpu;
2504 -+
2505 -+ /*
2506 -+ * If the node that the CPU is on has been offlined, cpu_to_node()
2507 -+ * will return -1. There is no CPU on the node, and we should
2508 -+ * select the CPU on the other node.
2509 -+ */
2510 -+ if (nid != -1) {
2511 -+ nodemask = cpumask_of_node(nid);
2512 -+
2513 -+ /* Look for allowed, online CPU in same node. */
2514 -+ for_each_cpu(dest_cpu, nodemask) {
2515 -+ if (!cpu_active(dest_cpu))
2516 -+ continue;
2517 -+ if (cpumask_test_cpu(dest_cpu, p->cpus_ptr))
2518 -+ return dest_cpu;
2519 -+ }
2520 -+ }
2521 -+
2522 -+ for (;;) {
2523 -+ /* Any allowed, online CPU? */
2524 -+ for_each_cpu(dest_cpu, p->cpus_ptr) {
2525 -+ if (!is_cpu_allowed(p, dest_cpu))
2526 -+ continue;
2527 -+ goto out;
2528 -+ }
2529 -+
2530 -+ /* No more Mr. Nice Guy. */
2531 -+ switch (state) {
2532 -+ case cpuset:
2533 -+ if (IS_ENABLED(CONFIG_CPUSETS)) {
2534 -+ cpuset_cpus_allowed_fallback(p);
2535 -+ state = possible;
2536 -+ break;
2537 -+ }
2538 -+ fallthrough;
2539 -+ case possible:
2540 -+ /*
2541 -+ * XXX When called from select_task_rq() we only
2542 -+ * hold p->pi_lock and again violate locking order.
2543 -+ *
2544 -+ * More yuck to audit.
2545 -+ */
2546 -+ do_set_cpus_allowed(p, cpu_possible_mask);
2547 -+ state = fail;
2548 -+ break;
2549 -+
2550 -+ case fail:
2551 -+ BUG();
2552 -+ break;
2553 -+ }
2554 -+ }
2555 -+
2556 -+out:
2557 -+ if (state != cpuset) {
2558 -+ /*
2559 -+ * Don't tell them about moving exiting tasks or
2560 -+ * kernel threads (both mm NULL), since they never
2561 -+ * leave kernel.
2562 -+ */
2563 -+ if (p->mm && printk_ratelimit()) {
2564 -+ printk_deferred("process %d (%s) no longer affine to cpu%d\n",
2565 -+ task_pid_nr(p), p->comm, cpu);
2566 -+ }
2567 -+ }
2568 -+
2569 -+ return dest_cpu;
2570 -+}
2571 -+
2572 -+static inline int select_task_rq(struct task_struct *p)
2573 -+{
2574 -+ cpumask_t chk_mask, tmp;
2575 -+
2576 -+ if (unlikely(!cpumask_and(&chk_mask, p->cpus_ptr, cpu_active_mask)))
2577 -+ return select_fallback_rq(task_cpu(p), p);
2578 -+
2579 -+ if (
2580 -+#ifdef CONFIG_SCHED_SMT
2581 -+ cpumask_and(&tmp, &chk_mask, &sched_sg_idle_mask) ||
2582 -+#endif
2583 -+ cpumask_and(&tmp, &chk_mask, sched_rq_watermark) ||
2584 -+ cpumask_and(&tmp, &chk_mask,
2585 -+ sched_rq_watermark + SCHED_BITS - task_sched_prio(p)))
2586 -+ return best_mask_cpu(task_cpu(p), &tmp);
2587 -+
2588 -+ return best_mask_cpu(task_cpu(p), &chk_mask);
2589 -+}
2590 -+
2591 -+void sched_set_stop_task(int cpu, struct task_struct *stop)
2592 -+{
2593 -+ static struct lock_class_key stop_pi_lock;
2594 -+ struct sched_param stop_param = { .sched_priority = STOP_PRIO };
2595 -+ struct sched_param start_param = { .sched_priority = 0 };
2596 -+ struct task_struct *old_stop = cpu_rq(cpu)->stop;
2597 -+
2598 -+ if (stop) {
2599 -+ /*
2600 -+ * Make it appear like a SCHED_FIFO task, its something
2601 -+ * userspace knows about and won't get confused about.
2602 -+ *
2603 -+ * Also, it will make PI more or less work without too
2604 -+ * much confusion -- but then, stop work should not
2605 -+ * rely on PI working anyway.
2606 -+ */
2607 -+ sched_setscheduler_nocheck(stop, SCHED_FIFO, &stop_param);
2608 -+
2609 -+ /*
2610 -+ * The PI code calls rt_mutex_setprio() with ->pi_lock held to
2611 -+ * adjust the effective priority of a task. As a result,
2612 -+ * rt_mutex_setprio() can trigger (RT) balancing operations,
2613 -+ * which can then trigger wakeups of the stop thread to push
2614 -+ * around the current task.
2615 -+ *
2616 -+ * The stop task itself will never be part of the PI-chain, it
2617 -+ * never blocks, therefore that ->pi_lock recursion is safe.
2618 -+ * Tell lockdep about this by placing the stop->pi_lock in its
2619 -+ * own class.
2620 -+ */
2621 -+ lockdep_set_class(&stop->pi_lock, &stop_pi_lock);
2622 -+ }
2623 -+
2624 -+ cpu_rq(cpu)->stop = stop;
2625 -+
2626 -+ if (old_stop) {
2627 -+ /*
2628 -+ * Reset it back to a normal scheduling policy so that
2629 -+ * it can die in pieces.
2630 -+ */
2631 -+ sched_setscheduler_nocheck(old_stop, SCHED_NORMAL, &start_param);
2632 -+ }
2633 -+}
2634 -+
2635 -+/*
2636 -+ * Change a given task's CPU affinity. Migrate the thread to a
2637 -+ * proper CPU and schedule it away if the CPU it's executing on
2638 -+ * is removed from the allowed bitmask.
2639 -+ *
2640 -+ * NOTE: the caller must have a valid reference to the task, the
2641 -+ * task must not exit() & deallocate itself prematurely. The
2642 -+ * call is not atomic; no spinlocks may be held.
2643 -+ */
2644 -+static int __set_cpus_allowed_ptr(struct task_struct *p,
2645 -+ const struct cpumask *new_mask,
2646 -+ u32 flags)
2647 -+{
2648 -+ const struct cpumask *cpu_valid_mask = cpu_active_mask;
2649 -+ int dest_cpu;
2650 -+ unsigned long irq_flags;
2651 -+ struct rq *rq;
2652 -+ raw_spinlock_t *lock;
2653 -+ int ret = 0;
2654 -+
2655 -+ raw_spin_lock_irqsave(&p->pi_lock, irq_flags);
2656 -+ rq = __task_access_lock(p, &lock);
2657 -+
2658 -+ if (p->flags & PF_KTHREAD || is_migration_disabled(p)) {
2659 -+ /*
2660 -+ * Kernel threads are allowed on online && !active CPUs,
2661 -+ * however, during cpu-hot-unplug, even these might get pushed
2662 -+ * away if not KTHREAD_IS_PER_CPU.
2663 -+ *
2664 -+ * Specifically, migration_disabled() tasks must not fail the
2665 -+ * cpumask_any_and_distribute() pick below, esp. so on
2666 -+ * SCA_MIGRATE_ENABLE, otherwise we'll not call
2667 -+ * set_cpus_allowed_common() and actually reset p->cpus_ptr.
2668 -+ */
2669 -+ cpu_valid_mask = cpu_online_mask;
2670 -+ }
2671 -+
2672 -+ /*
2673 -+ * Must re-check here, to close a race against __kthread_bind(),
2674 -+ * sched_setaffinity() is not guaranteed to observe the flag.
2675 -+ */
2676 -+ if ((flags & SCA_CHECK) && (p->flags & PF_NO_SETAFFINITY)) {
2677 -+ ret = -EINVAL;
2678 -+ goto out;
2679 -+ }
2680 -+
2681 -+ if (cpumask_equal(&p->cpus_mask, new_mask))
2682 -+ goto out;
2683 -+
2684 -+ dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
2685 -+ if (dest_cpu >= nr_cpu_ids) {
2686 -+ ret = -EINVAL;
2687 -+ goto out;
2688 -+ }
2689 -+
2690 -+ __do_set_cpus_allowed(p, new_mask);
2691 -+
2692 -+ /* Can the task run on the task's current CPU? If so, we're done */
2693 -+ if (cpumask_test_cpu(task_cpu(p), new_mask))
2694 -+ goto out;
2695 -+
2696 -+ if (p->migration_disabled) {
2697 -+ if (likely(p->cpus_ptr != &p->cpus_mask))
2698 -+ __do_set_cpus_ptr(p, &p->cpus_mask);
2699 -+ p->migration_disabled = 0;
2700 -+ p->migration_flags |= MDF_FORCE_ENABLED;
2701 -+ /* When p is migrate_disabled, rq->lock should be held */
2702 -+ rq->nr_pinned--;
2703 -+ }
2704 -+
2705 -+ if (task_running(p) || READ_ONCE(p->__state) == TASK_WAKING) {
2706 -+ struct migration_arg arg = { p, dest_cpu };
2707 -+
2708 -+ /* Need help from migration thread: drop lock and wait. */
2709 -+ __task_access_unlock(p, lock);
2710 -+ raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);
2711 -+ stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
2712 -+ return 0;
2713 -+ }
2714 -+ if (task_on_rq_queued(p)) {
2715 -+ /*
2716 -+ * OK, since we're going to drop the lock immediately
2717 -+ * afterwards anyway.
2718 -+ */
2719 -+ update_rq_clock(rq);
2720 -+ rq = move_queued_task(rq, p, dest_cpu);
2721 -+ lock = &rq->lock;
2722 -+ }
2723 -+
2724 -+out:
2725 -+ __task_access_unlock(p, lock);
2726 -+ raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);
2727 -+
2728 -+ return ret;
2729 -+}
2730 -+
2731 -+int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
2732 -+{
2733 -+ return __set_cpus_allowed_ptr(p, new_mask, 0);
2734 -+}
2735 -+EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
2736 -+
2737 -+#else /* CONFIG_SMP */
2738 -+
2739 -+static inline int select_task_rq(struct task_struct *p)
2740 -+{
2741 -+ return 0;
2742 -+}
2743 -+
2744 -+static inline int
2745 -+__set_cpus_allowed_ptr(struct task_struct *p,
2746 -+ const struct cpumask *new_mask,
2747 -+ u32 flags)
2748 -+{
2749 -+ return set_cpus_allowed_ptr(p, new_mask);
2750 -+}
2751 -+
2752 -+static inline bool rq_has_pinned_tasks(struct rq *rq)
2753 -+{
2754 -+ return false;
2755 -+}
2756 -+
2757 -+#endif /* !CONFIG_SMP */
2758 -+
2759 -+static void
2760 -+ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
2761 -+{
2762 -+ struct rq *rq;
2763 -+
2764 -+ if (!schedstat_enabled())
2765 -+ return;
2766 -+
2767 -+ rq = this_rq();
2768 -+
2769 -+#ifdef CONFIG_SMP
2770 -+ if (cpu == rq->cpu)
2771 -+ __schedstat_inc(rq->ttwu_local);
2772 -+ else {
2773 -+ /** Alt schedule FW ToDo:
2774 -+ * How to do ttwu_wake_remote
2775 -+ */
2776 -+ }
2777 -+#endif /* CONFIG_SMP */
2778 -+
2779 -+ __schedstat_inc(rq->ttwu_count);
2780 -+}
2781 -+
2782 -+/*
2783 -+ * Mark the task runnable and perform wakeup-preemption.
2784 -+ */
2785 -+static inline void
2786 -+ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
2787 -+{
2788 -+ check_preempt_curr(rq);
2789 -+ WRITE_ONCE(p->__state, TASK_RUNNING);
2790 -+ trace_sched_wakeup(p);
2791 -+}
2792 -+
2793 -+static inline void
2794 -+ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
2795 -+{
2796 -+ if (p->sched_contributes_to_load)
2797 -+ rq->nr_uninterruptible--;
2798 -+
2799 -+ if (
2800 -+#ifdef CONFIG_SMP
2801 -+ !(wake_flags & WF_MIGRATED) &&
2802 -+#endif
2803 -+ p->in_iowait) {
2804 -+ delayacct_blkio_end(p);
2805 -+ atomic_dec(&task_rq(p)->nr_iowait);
2806 -+ }
2807 -+
2808 -+ activate_task(p, rq);
2809 -+ ttwu_do_wakeup(rq, p, 0);
2810 -+}
2811 -+
2812 -+/*
2813 -+ * Consider @p being inside a wait loop:
2814 -+ *
2815 -+ * for (;;) {
2816 -+ * set_current_state(TASK_UNINTERRUPTIBLE);
2817 -+ *
2818 -+ * if (CONDITION)
2819 -+ * break;
2820 -+ *
2821 -+ * schedule();
2822 -+ * }
2823 -+ * __set_current_state(TASK_RUNNING);
2824 -+ *
2825 -+ * between set_current_state() and schedule(). In this case @p is still
2826 -+ * runnable, so all that needs doing is change p->state back to TASK_RUNNING in
2827 -+ * an atomic manner.
2828 -+ *
2829 -+ * By taking task_rq(p)->lock we serialize against schedule(), if @p->on_rq
2830 -+ * then schedule() must still happen and p->state can be changed to
2831 -+ * TASK_RUNNING. Otherwise we lost the race, schedule() has happened, and we
2832 -+ * need to do a full wakeup with enqueue.
2833 -+ *
2834 -+ * Returns: %true when the wakeup is done,
2835 -+ * %false otherwise.
2836 -+ */
2837 -+static int ttwu_runnable(struct task_struct *p, int wake_flags)
2838 -+{
2839 -+ struct rq *rq;
2840 -+ raw_spinlock_t *lock;
2841 -+ int ret = 0;
2842 -+
2843 -+ rq = __task_access_lock(p, &lock);
2844 -+ if (task_on_rq_queued(p)) {
2845 -+ /* check_preempt_curr() may use rq clock */
2846 -+ update_rq_clock(rq);
2847 -+ ttwu_do_wakeup(rq, p, wake_flags);
2848 -+ ret = 1;
2849 -+ }
2850 -+ __task_access_unlock(p, lock);
2851 -+
2852 -+ return ret;
2853 -+}
2854 -+
2855 -+#ifdef CONFIG_SMP
2856 -+void sched_ttwu_pending(void *arg)
2857 -+{
2858 -+ struct llist_node *llist = arg;
2859 -+ struct rq *rq = this_rq();
2860 -+ struct task_struct *p, *t;
2861 -+ struct rq_flags rf;
2862 -+
2863 -+ if (!llist)
2864 -+ return;
2865 -+
2866 -+ /*
2867 -+ * rq::ttwu_pending racy indication of out-standing wakeups.
2868 -+ * Races such that false-negatives are possible, since they
2869 -+ * are shorter lived that false-positives would be.
2870 -+ */
2871 -+ WRITE_ONCE(rq->ttwu_pending, 0);
2872 -+
2873 -+ rq_lock_irqsave(rq, &rf);
2874 -+ update_rq_clock(rq);
2875 -+
2876 -+ llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
2877 -+ if (WARN_ON_ONCE(p->on_cpu))
2878 -+ smp_cond_load_acquire(&p->on_cpu, !VAL);
2879 -+
2880 -+ if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
2881 -+ set_task_cpu(p, cpu_of(rq));
2882 -+
2883 -+ ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0);
2884 -+ }
2885 -+
2886 -+ rq_unlock_irqrestore(rq, &rf);
2887 -+}
2888 -+
2889 -+void send_call_function_single_ipi(int cpu)
2890 -+{
2891 -+ struct rq *rq = cpu_rq(cpu);
2892 -+
2893 -+ if (!set_nr_if_polling(rq->idle))
2894 -+ arch_send_call_function_single_ipi(cpu);
2895 -+ else
2896 -+ trace_sched_wake_idle_without_ipi(cpu);
2897 -+}
2898 -+
2899 -+/*
2900 -+ * Queue a task on the target CPUs wake_list and wake the CPU via IPI if
2901 -+ * necessary. The wakee CPU on receipt of the IPI will queue the task
2902 -+ * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
2903 -+ * of the wakeup instead of the waker.
2904 -+ */
2905 -+static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
2906 -+{
2907 -+ struct rq *rq = cpu_rq(cpu);
2908 -+
2909 -+ p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
2910 -+
2911 -+ WRITE_ONCE(rq->ttwu_pending, 1);
2912 -+ __smp_call_single_queue(cpu, &p->wake_entry.llist);
2913 -+}
2914 -+
2915 -+static inline bool ttwu_queue_cond(int cpu, int wake_flags)
2916 -+{
2917 -+ /*
2918 -+ * Do not complicate things with the async wake_list while the CPU is
2919 -+ * in hotplug state.
2920 -+ */
2921 -+ if (!cpu_active(cpu))
2922 -+ return false;
2923 -+
2924 -+ /*
2925 -+ * If the CPU does not share cache, then queue the task on the
2926 -+ * remote rqs wakelist to avoid accessing remote data.
2927 -+ */
2928 -+ if (!cpus_share_cache(smp_processor_id(), cpu))
2929 -+ return true;
2930 -+
2931 -+ /*
2932 -+ * If the task is descheduling and the only running task on the
2933 -+ * CPU then use the wakelist to offload the task activation to
2934 -+ * the soon-to-be-idle CPU as the current CPU is likely busy.
2935 -+ * nr_running is checked to avoid unnecessary task stacking.
2936 -+ */
2937 -+ if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
2938 -+ return true;
2939 -+
2940 -+ return false;
2941 -+}
2942 -+
2943 -+static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
2944 -+{
2945 -+ if (__is_defined(ALT_SCHED_TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
2946 -+ if (WARN_ON_ONCE(cpu == smp_processor_id()))
2947 -+ return false;
2948 -+
2949 -+ sched_clock_cpu(cpu); /* Sync clocks across CPUs */
2950 -+ __ttwu_queue_wakelist(p, cpu, wake_flags);
2951 -+ return true;
2952 -+ }
2953 -+
2954 -+ return false;
2955 -+}
2956 -+
2957 -+void wake_up_if_idle(int cpu)
2958 -+{
2959 -+ struct rq *rq = cpu_rq(cpu);
2960 -+ unsigned long flags;
2961 -+
2962 -+ rcu_read_lock();
2963 -+
2964 -+ if (!is_idle_task(rcu_dereference(rq->curr)))
2965 -+ goto out;
2966 -+
2967 -+ if (set_nr_if_polling(rq->idle)) {
2968 -+ trace_sched_wake_idle_without_ipi(cpu);
2969 -+ } else {
2970 -+ raw_spin_lock_irqsave(&rq->lock, flags);
2971 -+ if (is_idle_task(rq->curr))
2972 -+ smp_send_reschedule(cpu);
2973 -+ /* Else CPU is not idle, do nothing here */
2974 -+ raw_spin_unlock_irqrestore(&rq->lock, flags);
2975 -+ }
2976 -+
2977 -+out:
2978 -+ rcu_read_unlock();
2979 -+}
2980 -+
2981 -+bool cpus_share_cache(int this_cpu, int that_cpu)
2982 -+{
2983 -+ return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
2984 -+}
2985 -+#else /* !CONFIG_SMP */
2986 -+
2987 -+static inline bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
2988 -+{
2989 -+ return false;
2990 -+}
2991 -+
2992 -+#endif /* CONFIG_SMP */
2993 -+
2994 -+static inline void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
2995 -+{
2996 -+ struct rq *rq = cpu_rq(cpu);
2997 -+
2998 -+ if (ttwu_queue_wakelist(p, cpu, wake_flags))
2999 -+ return;
3000 -+
3001 -+ raw_spin_lock(&rq->lock);
3002 -+ update_rq_clock(rq);
3003 -+ ttwu_do_activate(rq, p, wake_flags);
3004 -+ raw_spin_unlock(&rq->lock);
3005 -+}
3006 -+
3007 -+/*
3008 -+ * Notes on Program-Order guarantees on SMP systems.
3009 -+ *
3010 -+ * MIGRATION
3011 -+ *
3012 -+ * The basic program-order guarantee on SMP systems is that when a task [t]
3013 -+ * migrates, all its activity on its old CPU [c0] happens-before any subsequent
3014 -+ * execution on its new CPU [c1].
3015 -+ *
3016 -+ * For migration (of runnable tasks) this is provided by the following means:
3017 -+ *
3018 -+ * A) UNLOCK of the rq(c0)->lock scheduling out task t
3019 -+ * B) migration for t is required to synchronize *both* rq(c0)->lock and
3020 -+ * rq(c1)->lock (if not at the same time, then in that order).
3021 -+ * C) LOCK of the rq(c1)->lock scheduling in task
3022 -+ *
3023 -+ * Transitivity guarantees that B happens after A and C after B.
3024 -+ * Note: we only require RCpc transitivity.
3025 -+ * Note: the CPU doing B need not be c0 or c1
3026 -+ *
3027 -+ * Example:
3028 -+ *
3029 -+ * CPU0 CPU1 CPU2
3030 -+ *
3031 -+ * LOCK rq(0)->lock
3032 -+ * sched-out X
3033 -+ * sched-in Y
3034 -+ * UNLOCK rq(0)->lock
3035 -+ *
3036 -+ * LOCK rq(0)->lock // orders against CPU0
3037 -+ * dequeue X
3038 -+ * UNLOCK rq(0)->lock
3039 -+ *
3040 -+ * LOCK rq(1)->lock
3041 -+ * enqueue X
3042 -+ * UNLOCK rq(1)->lock
3043 -+ *
3044 -+ * LOCK rq(1)->lock // orders against CPU2
3045 -+ * sched-out Z
3046 -+ * sched-in X
3047 -+ * UNLOCK rq(1)->lock
3048 -+ *
3049 -+ *
3050 -+ * BLOCKING -- aka. SLEEP + WAKEUP
3051 -+ *
3052 -+ * For blocking we (obviously) need to provide the same guarantee as for
3053 -+ * migration. However the means are completely different as there is no lock
3054 -+ * chain to provide order. Instead we do:
3055 -+ *
3056 -+ * 1) smp_store_release(X->on_cpu, 0) -- finish_task()
3057 -+ * 2) smp_cond_load_acquire(!X->on_cpu) -- try_to_wake_up()
3058 -+ *
3059 -+ * Example:
3060 -+ *
3061 -+ * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule)
3062 -+ *
3063 -+ * LOCK rq(0)->lock LOCK X->pi_lock
3064 -+ * dequeue X
3065 -+ * sched-out X
3066 -+ * smp_store_release(X->on_cpu, 0);
3067 -+ *
3068 -+ * smp_cond_load_acquire(&X->on_cpu, !VAL);
3069 -+ * X->state = WAKING
3070 -+ * set_task_cpu(X,2)
3071 -+ *
3072 -+ * LOCK rq(2)->lock
3073 -+ * enqueue X
3074 -+ * X->state = RUNNING
3075 -+ * UNLOCK rq(2)->lock
3076 -+ *
3077 -+ * LOCK rq(2)->lock // orders against CPU1
3078 -+ * sched-out Z
3079 -+ * sched-in X
3080 -+ * UNLOCK rq(2)->lock
3081 -+ *
3082 -+ * UNLOCK X->pi_lock
3083 -+ * UNLOCK rq(0)->lock
3084 -+ *
3085 -+ *
3086 -+ * However; for wakeups there is a second guarantee we must provide, namely we
3087 -+ * must observe the state that lead to our wakeup. That is, not only must our
3088 -+ * task observe its own prior state, it must also observe the stores prior to
3089 -+ * its wakeup.
3090 -+ *
3091 -+ * This means that any means of doing remote wakeups must order the CPU doing
3092 -+ * the wakeup against the CPU the task is going to end up running on. This,
3093 -+ * however, is already required for the regular Program-Order guarantee above,
3094 -+ * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).
3095 -+ *
3096 -+ */
3097 -+
3098 -+/**
3099 -+ * try_to_wake_up - wake up a thread
3100 -+ * @p: the thread to be awakened
3101 -+ * @state: the mask of task states that can be woken
3102 -+ * @wake_flags: wake modifier flags (WF_*)
3103 -+ *
3104 -+ * Conceptually does:
3105 -+ *
3106 -+ * If (@state & @p->state) @p->state = TASK_RUNNING.
3107 -+ *
3108 -+ * If the task was not queued/runnable, also place it back on a runqueue.
3109 -+ *
3110 -+ * This function is atomic against schedule() which would dequeue the task.
3111 -+ *
3112 -+ * It issues a full memory barrier before accessing @p->state, see the comment
3113 -+ * with set_current_state().
3114 -+ *
3115 -+ * Uses p->pi_lock to serialize against concurrent wake-ups.
3116 -+ *
3117 -+ * Relies on p->pi_lock stabilizing:
3118 -+ * - p->sched_class
3119 -+ * - p->cpus_ptr
3120 -+ * - p->sched_task_group
3121 -+ * in order to do migration, see its use of select_task_rq()/set_task_cpu().
3122 -+ *
3123 -+ * Tries really hard to only take one task_rq(p)->lock for performance.
3124 -+ * Takes rq->lock in:
3125 -+ * - ttwu_runnable() -- old rq, unavoidable, see comment there;
3126 -+ * - ttwu_queue() -- new rq, for enqueue of the task;
3127 -+ * - psi_ttwu_dequeue() -- much sadness :-( accounting will kill us.
3128 -+ *
3129 -+ * As a consequence we race really badly with just about everything. See the
3130 -+ * many memory barriers and their comments for details.
3131 -+ *
3132 -+ * Return: %true if @p->state changes (an actual wakeup was done),
3133 -+ * %false otherwise.
3134 -+ */
3135 -+static int try_to_wake_up(struct task_struct *p, unsigned int state,
3136 -+ int wake_flags)
3137 -+{
3138 -+ unsigned long flags;
3139 -+ int cpu, success = 0;
3140 -+
3141 -+ preempt_disable();
3142 -+ if (p == current) {
3143 -+ /*
3144 -+ * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
3145 -+ * == smp_processor_id()'. Together this means we can special
3146 -+ * case the whole 'p->on_rq && ttwu_runnable()' case below
3147 -+ * without taking any locks.
3148 -+ *
3149 -+ * In particular:
3150 -+ * - we rely on Program-Order guarantees for all the ordering,
3151 -+ * - we're serialized against set_special_state() by virtue of
3152 -+ * it disabling IRQs (this allows not taking ->pi_lock).
3153 -+ */
3154 -+ if (!(READ_ONCE(p->__state) & state))
3155 -+ goto out;
3156 -+
3157 -+ success = 1;
3158 -+ trace_sched_waking(p);
3159 -+ WRITE_ONCE(p->__state, TASK_RUNNING);
3160 -+ trace_sched_wakeup(p);
3161 -+ goto out;
3162 -+ }
3163 -+
3164 -+ /*
3165 -+ * If we are going to wake up a thread waiting for CONDITION we
3166 -+ * need to ensure that CONDITION=1 done by the caller can not be
3167 -+ * reordered with p->state check below. This pairs with smp_store_mb()
3168 -+ * in set_current_state() that the waiting thread does.
3169 -+ */
3170 -+ raw_spin_lock_irqsave(&p->pi_lock, flags);
3171 -+ smp_mb__after_spinlock();
3172 -+ if (!(READ_ONCE(p->__state) & state))
3173 -+ goto unlock;
3174 -+
3175 -+ trace_sched_waking(p);
3176 -+
3177 -+ /* We're going to change ->state: */
3178 -+ success = 1;
3179 -+
3180 -+ /*
3181 -+ * Ensure we load p->on_rq _after_ p->state, otherwise it would
3182 -+ * be possible to, falsely, observe p->on_rq == 0 and get stuck
3183 -+ * in smp_cond_load_acquire() below.
3184 -+ *
3185 -+ * sched_ttwu_pending() try_to_wake_up()
3186 -+ * STORE p->on_rq = 1 LOAD p->state
3187 -+ * UNLOCK rq->lock
3188 -+ *
3189 -+ * __schedule() (switch to task 'p')
3190 -+ * LOCK rq->lock smp_rmb();
3191 -+ * smp_mb__after_spinlock();
3192 -+ * UNLOCK rq->lock
3193 -+ *
3194 -+ * [task p]
3195 -+ * STORE p->state = UNINTERRUPTIBLE LOAD p->on_rq
3196 -+ *
3197 -+ * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
3198 -+ * __schedule(). See the comment for smp_mb__after_spinlock().
3199 -+ *
3200 -+ * A similar smb_rmb() lives in try_invoke_on_locked_down_task().
3201 -+ */
3202 -+ smp_rmb();
3203 -+ if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
3204 -+ goto unlock;
3205 -+
3206 -+#ifdef CONFIG_SMP
3207 -+ /*
3208 -+ * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
3209 -+ * possible to, falsely, observe p->on_cpu == 0.
3210 -+ *
3211 -+ * One must be running (->on_cpu == 1) in order to remove oneself
3212 -+ * from the runqueue.
3213 -+ *
3214 -+ * __schedule() (switch to task 'p') try_to_wake_up()
3215 -+ * STORE p->on_cpu = 1 LOAD p->on_rq
3216 -+ * UNLOCK rq->lock
3217 -+ *
3218 -+ * __schedule() (put 'p' to sleep)
3219 -+ * LOCK rq->lock smp_rmb();
3220 -+ * smp_mb__after_spinlock();
3221 -+ * STORE p->on_rq = 0 LOAD p->on_cpu
3222 -+ *
3223 -+ * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
3224 -+ * __schedule(). See the comment for smp_mb__after_spinlock().
3225 -+ *
3226 -+ * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
3227 -+ * schedule()'s deactivate_task() has 'happened' and p will no longer
3228 -+ * care about it's own p->state. See the comment in __schedule().
3229 -+ */
3230 -+ smp_acquire__after_ctrl_dep();
3231 -+
3232 -+ /*
3233 -+ * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq
3234 -+ * == 0), which means we need to do an enqueue, change p->state to
3235 -+ * TASK_WAKING such that we can unlock p->pi_lock before doing the
3236 -+ * enqueue, such as ttwu_queue_wakelist().
3237 -+ */
3238 -+ WRITE_ONCE(p->__state, TASK_WAKING);
3239 -+
3240 -+ /*
3241 -+ * If the owning (remote) CPU is still in the middle of schedule() with
3242 -+ * this task as prev, considering queueing p on the remote CPUs wake_list
3243 -+ * which potentially sends an IPI instead of spinning on p->on_cpu to
3244 -+ * let the waker make forward progress. This is safe because IRQs are
3245 -+ * disabled and the IPI will deliver after on_cpu is cleared.
3246 -+ *
3247 -+ * Ensure we load task_cpu(p) after p->on_cpu:
3248 -+ *
3249 -+ * set_task_cpu(p, cpu);
3250 -+ * STORE p->cpu = @cpu
3251 -+ * __schedule() (switch to task 'p')
3252 -+ * LOCK rq->lock
3253 -+ * smp_mb__after_spin_lock() smp_cond_load_acquire(&p->on_cpu)
3254 -+ * STORE p->on_cpu = 1 LOAD p->cpu
3255 -+ *
3256 -+ * to ensure we observe the correct CPU on which the task is currently
3257 -+ * scheduling.
3258 -+ */
3259 -+ if (smp_load_acquire(&p->on_cpu) &&
3260 -+ ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
3261 -+ goto unlock;
3262 -+
3263 -+ /*
3264 -+ * If the owning (remote) CPU is still in the middle of schedule() with
3265 -+ * this task as prev, wait until it's done referencing the task.
3266 -+ *
3267 -+ * Pairs with the smp_store_release() in finish_task().
3268 -+ *
3269 -+ * This ensures that tasks getting woken will be fully ordered against
3270 -+ * their previous state and preserve Program Order.
3271 -+ */
3272 -+ smp_cond_load_acquire(&p->on_cpu, !VAL);
3273 -+
3274 -+ sched_task_ttwu(p);
3275 -+
3276 -+ cpu = select_task_rq(p);
3277 -+
3278 -+ if (cpu != task_cpu(p)) {
3279 -+ if (p->in_iowait) {
3280 -+ delayacct_blkio_end(p);
3281 -+ atomic_dec(&task_rq(p)->nr_iowait);
3282 -+ }
3283 -+
3284 -+ wake_flags |= WF_MIGRATED;
3285 -+ psi_ttwu_dequeue(p);
3286 -+ set_task_cpu(p, cpu);
3287 -+ }
3288 -+#else
3289 -+ cpu = task_cpu(p);
3290 -+#endif /* CONFIG_SMP */
3291 -+
3292 -+ ttwu_queue(p, cpu, wake_flags);
3293 -+unlock:
3294 -+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
3295 -+out:
3296 -+ if (success)
3297 -+ ttwu_stat(p, task_cpu(p), wake_flags);
3298 -+ preempt_enable();
3299 -+
3300 -+ return success;
3301 -+}
3302 -+
3303 -+/**
3304 -+ * try_invoke_on_locked_down_task - Invoke a function on task in fixed state
3305 -+ * @p: Process for which the function is to be invoked, can be @current.
3306 -+ * @func: Function to invoke.
3307 -+ * @arg: Argument to function.
3308 -+ *
3309 -+ * If the specified task can be quickly locked into a definite state
3310 -+ * (either sleeping or on a given runqueue), arrange to keep it in that
3311 -+ * state while invoking @func(@arg). This function can use ->on_rq and
3312 -+ * task_curr() to work out what the state is, if required. Given that
3313 -+ * @func can be invoked with a runqueue lock held, it had better be quite
3314 -+ * lightweight.
3315 -+ *
3316 -+ * Returns:
3317 -+ * @false if the task slipped out from under the locks.
3318 -+ * @true if the task was locked onto a runqueue or is sleeping.
3319 -+ * However, @func can override this by returning @false.
3320 -+ */
3321 -+bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)
3322 -+{
3323 -+ struct rq_flags rf;
3324 -+ bool ret = false;
3325 -+ struct rq *rq;
3326 -+
3327 -+ raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
3328 -+ if (p->on_rq) {
3329 -+ rq = __task_rq_lock(p, &rf);
3330 -+ if (task_rq(p) == rq)
3331 -+ ret = func(p, arg);
3332 -+ __task_rq_unlock(rq, &rf);
3333 -+ } else {
3334 -+ switch (READ_ONCE(p->__state)) {
3335 -+ case TASK_RUNNING:
3336 -+ case TASK_WAKING:
3337 -+ break;
3338 -+ default:
3339 -+ smp_rmb(); // See smp_rmb() comment in try_to_wake_up().
3340 -+ if (!p->on_rq)
3341 -+ ret = func(p, arg);
3342 -+ }
3343 -+ }
3344 -+ raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
3345 -+ return ret;
3346 -+}
3347 -+
3348 -+/**
3349 -+ * wake_up_process - Wake up a specific process
3350 -+ * @p: The process to be woken up.
3351 -+ *
3352 -+ * Attempt to wake up the nominated process and move it to the set of runnable
3353 -+ * processes.
3354 -+ *
3355 -+ * Return: 1 if the process was woken up, 0 if it was already running.
3356 -+ *
3357 -+ * This function executes a full memory barrier before accessing the task state.
3358 -+ */
3359 -+int wake_up_process(struct task_struct *p)
3360 -+{
3361 -+ return try_to_wake_up(p, TASK_NORMAL, 0);
3362 -+}
3363 -+EXPORT_SYMBOL(wake_up_process);
3364 -+
3365 -+int wake_up_state(struct task_struct *p, unsigned int state)
3366 -+{
3367 -+ return try_to_wake_up(p, state, 0);
3368 -+}
3369 -+
3370 -+/*
3371 -+ * Perform scheduler related setup for a newly forked process p.
3372 -+ * p is forked by current.
3373 -+ *
3374 -+ * __sched_fork() is basic setup used by init_idle() too:
3375 -+ */
3376 -+static inline void __sched_fork(unsigned long clone_flags, struct task_struct *p)
3377 -+{
3378 -+ p->on_rq = 0;
3379 -+ p->on_cpu = 0;
3380 -+ p->utime = 0;
3381 -+ p->stime = 0;
3382 -+ p->sched_time = 0;
3383 -+
3384 -+#ifdef CONFIG_PREEMPT_NOTIFIERS
3385 -+ INIT_HLIST_HEAD(&p->preempt_notifiers);
3386 -+#endif
3387 -+
3388 -+#ifdef CONFIG_COMPACTION
3389 -+ p->capture_control = NULL;
3390 -+#endif
3391 -+#ifdef CONFIG_SMP
3392 -+ p->wake_entry.u_flags = CSD_TYPE_TTWU;
3393 -+#endif
3394 -+}
3395 -+
3396 -+/*
3397 -+ * fork()/clone()-time setup:
3398 -+ */
3399 -+int sched_fork(unsigned long clone_flags, struct task_struct *p)
3400 -+{
3401 -+ unsigned long flags;
3402 -+ struct rq *rq;
3403 -+
3404 -+ __sched_fork(clone_flags, p);
3405 -+ /*
3406 -+ * We mark the process as NEW here. This guarantees that
3407 -+ * nobody will actually run it, and a signal or other external
3408 -+ * event cannot wake it up and insert it on the runqueue either.
3409 -+ */
3410 -+ p->__state = TASK_NEW;
3411 -+
3412 -+ /*
3413 -+ * Make sure we do not leak PI boosting priority to the child.
3414 -+ */
3415 -+ p->prio = current->normal_prio;
3416 -+
3417 -+ /*
3418 -+ * Revert to default priority/policy on fork if requested.
3419 -+ */
3420 -+ if (unlikely(p->sched_reset_on_fork)) {
3421 -+ if (task_has_rt_policy(p)) {
3422 -+ p->policy = SCHED_NORMAL;
3423 -+ p->static_prio = NICE_TO_PRIO(0);
3424 -+ p->rt_priority = 0;
3425 -+ } else if (PRIO_TO_NICE(p->static_prio) < 0)
3426 -+ p->static_prio = NICE_TO_PRIO(0);
3427 -+
3428 -+ p->prio = p->normal_prio = p->static_prio;
3429 -+
3430 -+ /*
3431 -+ * We don't need the reset flag anymore after the fork. It has
3432 -+ * fulfilled its duty:
3433 -+ */
3434 -+ p->sched_reset_on_fork = 0;
3435 -+ }
3436 -+
3437 -+ /*
3438 -+ * The child is not yet in the pid-hash so no cgroup attach races,
3439 -+ * and the cgroup is pinned to this child due to cgroup_fork()
3440 -+ * is ran before sched_fork().
3441 -+ *
3442 -+ * Silence PROVE_RCU.
3443 -+ */
3444 -+ raw_spin_lock_irqsave(&p->pi_lock, flags);
3445 -+ /*
3446 -+ * Share the timeslice between parent and child, thus the
3447 -+ * total amount of pending timeslices in the system doesn't change,
3448 -+ * resulting in more scheduling fairness.
3449 -+ */
3450 -+ rq = this_rq();
3451 -+ raw_spin_lock(&rq->lock);
3452 -+
3453 -+ rq->curr->time_slice /= 2;
3454 -+ p->time_slice = rq->curr->time_slice;
3455 -+#ifdef CONFIG_SCHED_HRTICK
3456 -+ hrtick_start(rq, rq->curr->time_slice);
3457 -+#endif
3458 -+
3459 -+ if (p->time_slice < RESCHED_NS) {
3460 -+ p->time_slice = sched_timeslice_ns;
3461 -+ resched_curr(rq);
3462 -+ }
3463 -+ sched_task_fork(p, rq);
3464 -+ raw_spin_unlock(&rq->lock);
3465 -+
3466 -+ rseq_migrate(p);
3467 -+ /*
3468 -+ * We're setting the CPU for the first time, we don't migrate,
3469 -+ * so use __set_task_cpu().
3470 -+ */
3471 -+ __set_task_cpu(p, cpu_of(rq));
3472 -+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
3473 -+
3474 -+#ifdef CONFIG_SCHED_INFO
3475 -+ if (unlikely(sched_info_on()))
3476 -+ memset(&p->sched_info, 0, sizeof(p->sched_info));
3477 -+#endif
3478 -+ init_task_preempt_count(p);
3479 -+
3480 -+ return 0;
3481 -+}
3482 -+
3483 -+void sched_post_fork(struct task_struct *p) {}
3484 -+
3485 -+#ifdef CONFIG_SCHEDSTATS
3486 -+
3487 -+DEFINE_STATIC_KEY_FALSE(sched_schedstats);
3488 -+
3489 -+static void set_schedstats(bool enabled)
3490 -+{
3491 -+ if (enabled)
3492 -+ static_branch_enable(&sched_schedstats);
3493 -+ else
3494 -+ static_branch_disable(&sched_schedstats);
3495 -+}
3496 -+
3497 -+void force_schedstat_enabled(void)
3498 -+{
3499 -+ if (!schedstat_enabled()) {
3500 -+ pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");
3501 -+ static_branch_enable(&sched_schedstats);
3502 -+ }
3503 -+}
3504 -+
3505 -+static int __init setup_schedstats(char *str)
3506 -+{
3507 -+ int ret = 0;
3508 -+ if (!str)
3509 -+ goto out;
3510 -+
3511 -+ if (!strcmp(str, "enable")) {
3512 -+ set_schedstats(true);
3513 -+ ret = 1;
3514 -+ } else if (!strcmp(str, "disable")) {
3515 -+ set_schedstats(false);
3516 -+ ret = 1;
3517 -+ }
3518 -+out:
3519 -+ if (!ret)
3520 -+ pr_warn("Unable to parse schedstats=\n");
3521 -+
3522 -+ return ret;
3523 -+}
3524 -+__setup("schedstats=", setup_schedstats);
3525 -+
3526 -+#ifdef CONFIG_PROC_SYSCTL
3527 -+int sysctl_schedstats(struct ctl_table *table, int write,
3528 -+ void __user *buffer, size_t *lenp, loff_t *ppos)
3529 -+{
3530 -+ struct ctl_table t;
3531 -+ int err;
3532 -+ int state = static_branch_likely(&sched_schedstats);
3533 -+
3534 -+ if (write && !capable(CAP_SYS_ADMIN))
3535 -+ return -EPERM;
3536 -+
3537 -+ t = *table;
3538 -+ t.data = &state;
3539 -+ err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
3540 -+ if (err < 0)
3541 -+ return err;
3542 -+ if (write)
3543 -+ set_schedstats(state);
3544 -+ return err;
3545 -+}
3546 -+#endif /* CONFIG_PROC_SYSCTL */
3547 -+#endif /* CONFIG_SCHEDSTATS */
3548 -+
3549 -+/*
3550 -+ * wake_up_new_task - wake up a newly created task for the first time.
3551 -+ *
3552 -+ * This function will do some initial scheduler statistics housekeeping
3553 -+ * that must be done for every newly created context, then puts the task
3554 -+ * on the runqueue and wakes it.
3555 -+ */
3556 -+void wake_up_new_task(struct task_struct *p)
3557 -+{
3558 -+ unsigned long flags;
3559 -+ struct rq *rq;
3560 -+
3561 -+ raw_spin_lock_irqsave(&p->pi_lock, flags);
3562 -+ WRITE_ONCE(p->__state, TASK_RUNNING);
3563 -+ rq = cpu_rq(select_task_rq(p));
3564 -+#ifdef CONFIG_SMP
3565 -+ rseq_migrate(p);
3566 -+ /*
3567 -+ * Fork balancing, do it here and not earlier because:
3568 -+ * - cpus_ptr can change in the fork path
3569 -+ * - any previously selected CPU might disappear through hotplug
3570 -+ *
3571 -+ * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
3572 -+ * as we're not fully set-up yet.
3573 -+ */
3574 -+ __set_task_cpu(p, cpu_of(rq));
3575 -+#endif
3576 -+
3577 -+ raw_spin_lock(&rq->lock);
3578 -+ update_rq_clock(rq);
3579 -+
3580 -+ activate_task(p, rq);
3581 -+ trace_sched_wakeup_new(p);
3582 -+ check_preempt_curr(rq);
3583 -+
3584 -+ raw_spin_unlock(&rq->lock);
3585 -+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
3586 -+}
3587 -+
3588 -+#ifdef CONFIG_PREEMPT_NOTIFIERS
3589 -+
3590 -+static DEFINE_STATIC_KEY_FALSE(preempt_notifier_key);
3591 -+
3592 -+void preempt_notifier_inc(void)
3593 -+{
3594 -+ static_branch_inc(&preempt_notifier_key);
3595 -+}
3596 -+EXPORT_SYMBOL_GPL(preempt_notifier_inc);
3597 -+
3598 -+void preempt_notifier_dec(void)
3599 -+{
3600 -+ static_branch_dec(&preempt_notifier_key);
3601 -+}
3602 -+EXPORT_SYMBOL_GPL(preempt_notifier_dec);
3603 -+
3604 -+/**
3605 -+ * preempt_notifier_register - tell me when current is being preempted & rescheduled
3606 -+ * @notifier: notifier struct to register
3607 -+ */
3608 -+void preempt_notifier_register(struct preempt_notifier *notifier)
3609 -+{
3610 -+ if (!static_branch_unlikely(&preempt_notifier_key))
3611 -+ WARN(1, "registering preempt_notifier while notifiers disabled\n");
3612 -+
3613 -+ hlist_add_head(&notifier->link, &current->preempt_notifiers);
3614 -+}
3615 -+EXPORT_SYMBOL_GPL(preempt_notifier_register);
3616 -+
3617 -+/**
3618 -+ * preempt_notifier_unregister - no longer interested in preemption notifications
3619 -+ * @notifier: notifier struct to unregister
3620 -+ *
3621 -+ * This is *not* safe to call from within a preemption notifier.
3622 -+ */
3623 -+void preempt_notifier_unregister(struct preempt_notifier *notifier)
3624 -+{
3625 -+ hlist_del(&notifier->link);
3626 -+}
3627 -+EXPORT_SYMBOL_GPL(preempt_notifier_unregister);
3628 -+
3629 -+static void __fire_sched_in_preempt_notifiers(struct task_struct *curr)
3630 -+{
3631 -+ struct preempt_notifier *notifier;
3632 -+
3633 -+ hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)
3634 -+ notifier->ops->sched_in(notifier, raw_smp_processor_id());
3635 -+}
3636 -+
3637 -+static __always_inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)
3638 -+{
3639 -+ if (static_branch_unlikely(&preempt_notifier_key))
3640 -+ __fire_sched_in_preempt_notifiers(curr);
3641 -+}
3642 -+
3643 -+static void
3644 -+__fire_sched_out_preempt_notifiers(struct task_struct *curr,
3645 -+ struct task_struct *next)
3646 -+{
3647 -+ struct preempt_notifier *notifier;
3648 -+
3649 -+ hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)
3650 -+ notifier->ops->sched_out(notifier, next);
3651 -+}
3652 -+
3653 -+static __always_inline void
3654 -+fire_sched_out_preempt_notifiers(struct task_struct *curr,
3655 -+ struct task_struct *next)
3656 -+{
3657 -+ if (static_branch_unlikely(&preempt_notifier_key))
3658 -+ __fire_sched_out_preempt_notifiers(curr, next);
3659 -+}
3660 -+
3661 -+#else /* !CONFIG_PREEMPT_NOTIFIERS */
3662 -+
3663 -+static inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)
3664 -+{
3665 -+}
3666 -+
3667 -+static inline void
3668 -+fire_sched_out_preempt_notifiers(struct task_struct *curr,
3669 -+ struct task_struct *next)
3670 -+{
3671 -+}
3672 -+
3673 -+#endif /* CONFIG_PREEMPT_NOTIFIERS */
3674 -+
3675 -+static inline void prepare_task(struct task_struct *next)
3676 -+{
3677 -+ /*
3678 -+ * Claim the task as running, we do this before switching to it
3679 -+ * such that any running task will have this set.
3680 -+ *
3681 -+ * See the ttwu() WF_ON_CPU case and its ordering comment.
3682 -+ */
3683 -+ WRITE_ONCE(next->on_cpu, 1);
3684 -+}
3685 -+
3686 -+static inline void finish_task(struct task_struct *prev)
3687 -+{
3688 -+#ifdef CONFIG_SMP
3689 -+ /*
3690 -+ * This must be the very last reference to @prev from this CPU. After
3691 -+ * p->on_cpu is cleared, the task can be moved to a different CPU. We
3692 -+ * must ensure this doesn't happen until the switch is completely
3693 -+ * finished.
3694 -+ *
3695 -+ * In particular, the load of prev->state in finish_task_switch() must
3696 -+ * happen before this.
3697 -+ *
3698 -+ * Pairs with the smp_cond_load_acquire() in try_to_wake_up().
3699 -+ */
3700 -+ smp_store_release(&prev->on_cpu, 0);
3701 -+#else
3702 -+ prev->on_cpu = 0;
3703 -+#endif
3704 -+}
3705 -+
3706 -+#ifdef CONFIG_SMP
3707 -+
3708 -+static void do_balance_callbacks(struct rq *rq, struct callback_head *head)
3709 -+{
3710 -+ void (*func)(struct rq *rq);
3711 -+ struct callback_head *next;
3712 -+
3713 -+ lockdep_assert_held(&rq->lock);
3714 -+
3715 -+ while (head) {
3716 -+ func = (void (*)(struct rq *))head->func;
3717 -+ next = head->next;
3718 -+ head->next = NULL;
3719 -+ head = next;
3720 -+
3721 -+ func(rq);
3722 -+ }
3723 -+}
3724 -+
3725 -+static void balance_push(struct rq *rq);
3726 -+
3727 -+struct callback_head balance_push_callback = {
3728 -+ .next = NULL,
3729 -+ .func = (void (*)(struct callback_head *))balance_push,
3730 -+};
3731 -+
3732 -+static inline struct callback_head *splice_balance_callbacks(struct rq *rq)
3733 -+{
3734 -+ struct callback_head *head = rq->balance_callback;
3735 -+
3736 -+ if (head) {
3737 -+ lockdep_assert_held(&rq->lock);
3738 -+ rq->balance_callback = NULL;
3739 -+ }
3740 -+
3741 -+ return head;
3742 -+}
3743 -+
3744 -+static void __balance_callbacks(struct rq *rq)
3745 -+{
3746 -+ do_balance_callbacks(rq, splice_balance_callbacks(rq));
3747 -+}
3748 -+
3749 -+static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
3750 -+{
3751 -+ unsigned long flags;
3752 -+
3753 -+ if (unlikely(head)) {
3754 -+ raw_spin_lock_irqsave(&rq->lock, flags);
3755 -+ do_balance_callbacks(rq, head);
3756 -+ raw_spin_unlock_irqrestore(&rq->lock, flags);
3757 -+ }
3758 -+}
3759 -+
3760 -+#else
3761 -+
3762 -+static inline void __balance_callbacks(struct rq *rq)
3763 -+{
3764 -+}
3765 -+
3766 -+static inline struct callback_head *splice_balance_callbacks(struct rq *rq)
3767 -+{
3768 -+ return NULL;
3769 -+}
3770 -+
3771 -+static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
3772 -+{
3773 -+}
3774 -+
3775 -+#endif
3776 -+
3777 -+static inline void
3778 -+prepare_lock_switch(struct rq *rq, struct task_struct *next)
3779 -+{
3780 -+ /*
3781 -+ * Since the runqueue lock will be released by the next
3782 -+ * task (which is an invalid locking op but in the case
3783 -+ * of the scheduler it's an obvious special-case), so we
3784 -+ * do an early lockdep release here:
3785 -+ */
3786 -+ spin_release(&rq->lock.dep_map, _THIS_IP_);
3787 -+#ifdef CONFIG_DEBUG_SPINLOCK
3788 -+ /* this is a valid case when another task releases the spinlock */
3789 -+ rq->lock.owner = next;
3790 -+#endif
3791 -+}
3792 -+
3793 -+static inline void finish_lock_switch(struct rq *rq)
3794 -+{
3795 -+ /*
3796 -+ * If we are tracking spinlock dependencies then we have to
3797 -+ * fix up the runqueue lock - which gets 'carried over' from
3798 -+ * prev into current:
3799 -+ */
3800 -+ spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
3801 -+ __balance_callbacks(rq);
3802 -+ raw_spin_unlock_irq(&rq->lock);
3803 -+}
3804 -+
3805 -+/*
3806 -+ * NOP if the arch has not defined these:
3807 -+ */
3808 -+
3809 -+#ifndef prepare_arch_switch
3810 -+# define prepare_arch_switch(next) do { } while (0)
3811 -+#endif
3812 -+
3813 -+#ifndef finish_arch_post_lock_switch
3814 -+# define finish_arch_post_lock_switch() do { } while (0)
3815 -+#endif
3816 -+
3817 -+static inline void kmap_local_sched_out(void)
3818 -+{
3819 -+#ifdef CONFIG_KMAP_LOCAL
3820 -+ if (unlikely(current->kmap_ctrl.idx))
3821 -+ __kmap_local_sched_out();
3822 -+#endif
3823 -+}
3824 -+
3825 -+static inline void kmap_local_sched_in(void)
3826 -+{
3827 -+#ifdef CONFIG_KMAP_LOCAL
3828 -+ if (unlikely(current->kmap_ctrl.idx))
3829 -+ __kmap_local_sched_in();
3830 -+#endif
3831 -+}
3832 -+
3833 -+/**
3834 -+ * prepare_task_switch - prepare to switch tasks
3835 -+ * @rq: the runqueue preparing to switch
3836 -+ * @next: the task we are going to switch to.
3837 -+ *
3838 -+ * This is called with the rq lock held and interrupts off. It must
3839 -+ * be paired with a subsequent finish_task_switch after the context
3840 -+ * switch.
3841 -+ *
3842 -+ * prepare_task_switch sets up locking and calls architecture specific
3843 -+ * hooks.
3844 -+ */
3845 -+static inline void
3846 -+prepare_task_switch(struct rq *rq, struct task_struct *prev,
3847 -+ struct task_struct *next)
3848 -+{
3849 -+ kcov_prepare_switch(prev);
3850 -+ sched_info_switch(rq, prev, next);
3851 -+ perf_event_task_sched_out(prev, next);
3852 -+ rseq_preempt(prev);
3853 -+ fire_sched_out_preempt_notifiers(prev, next);
3854 -+ kmap_local_sched_out();
3855 -+ prepare_task(next);
3856 -+ prepare_arch_switch(next);
3857 -+}
3858 -+
3859 -+/**
3860 -+ * finish_task_switch - clean up after a task-switch
3861 -+ * @rq: runqueue associated with task-switch
3862 -+ * @prev: the thread we just switched away from.
3863 -+ *
3864 -+ * finish_task_switch must be called after the context switch, paired
3865 -+ * with a prepare_task_switch call before the context switch.
3866 -+ * finish_task_switch will reconcile locking set up by prepare_task_switch,
3867 -+ * and do any other architecture-specific cleanup actions.
3868 -+ *
3869 -+ * Note that we may have delayed dropping an mm in context_switch(). If
3870 -+ * so, we finish that here outside of the runqueue lock. (Doing it
3871 -+ * with the lock held can cause deadlocks; see schedule() for
3872 -+ * details.)
3873 -+ *
3874 -+ * The context switch have flipped the stack from under us and restored the
3875 -+ * local variables which were saved when this task called schedule() in the
3876 -+ * past. prev == current is still correct but we need to recalculate this_rq
3877 -+ * because prev may have moved to another CPU.
3878 -+ */
3879 -+static struct rq *finish_task_switch(struct task_struct *prev)
3880 -+ __releases(rq->lock)
3881 -+{
3882 -+ struct rq *rq = this_rq();
3883 -+ struct mm_struct *mm = rq->prev_mm;
3884 -+ long prev_state;
3885 -+
3886 -+ /*
3887 -+ * The previous task will have left us with a preempt_count of 2
3888 -+ * because it left us after:
3889 -+ *
3890 -+ * schedule()
3891 -+ * preempt_disable(); // 1
3892 -+ * __schedule()
3893 -+ * raw_spin_lock_irq(&rq->lock) // 2
3894 -+ *
3895 -+ * Also, see FORK_PREEMPT_COUNT.
3896 -+ */
3897 -+ if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
3898 -+ "corrupted preempt_count: %s/%d/0x%x\n",
3899 -+ current->comm, current->pid, preempt_count()))
3900 -+ preempt_count_set(FORK_PREEMPT_COUNT);
3901 -+
3902 -+ rq->prev_mm = NULL;
3903 -+
3904 -+ /*
3905 -+ * A task struct has one reference for the use as "current".
3906 -+ * If a task dies, then it sets TASK_DEAD in tsk->state and calls
3907 -+ * schedule one last time. The schedule call will never return, and
3908 -+ * the scheduled task must drop that reference.
3909 -+ *
3910 -+ * We must observe prev->state before clearing prev->on_cpu (in
3911 -+ * finish_task), otherwise a concurrent wakeup can get prev
3912 -+ * running on another CPU and we could rave with its RUNNING -> DEAD
3913 -+ * transition, resulting in a double drop.
3914 -+ */
3915 -+ prev_state = READ_ONCE(prev->__state);
3916 -+ vtime_task_switch(prev);
3917 -+ perf_event_task_sched_in(prev, current);
3918 -+ finish_task(prev);
3919 -+ tick_nohz_task_switch();
3920 -+ finish_lock_switch(rq);
3921 -+ finish_arch_post_lock_switch();
3922 -+ kcov_finish_switch(current);
3923 -+ /*
3924 -+ * kmap_local_sched_out() is invoked with rq::lock held and
3925 -+ * interrupts disabled. There is no requirement for that, but the
3926 -+ * sched out code does not have an interrupt enabled section.
3927 -+ * Restoring the maps on sched in does not require interrupts being
3928 -+ * disabled either.
3929 -+ */
3930 -+ kmap_local_sched_in();
3931 -+
3932 -+ fire_sched_in_preempt_notifiers(current);
3933 -+ /*
3934 -+ * When switching through a kernel thread, the loop in
3935 -+ * membarrier_{private,global}_expedited() may have observed that
3936 -+ * kernel thread and not issued an IPI. It is therefore possible to
3937 -+ * schedule between user->kernel->user threads without passing though
3938 -+ * switch_mm(). Membarrier requires a barrier after storing to
3939 -+ * rq->curr, before returning to userspace, so provide them here:
3940 -+ *
3941 -+ * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
3942 -+ * provided by mmdrop(),
3943 -+ * - a sync_core for SYNC_CORE.
3944 -+ */
3945 -+ if (mm) {
3946 -+ membarrier_mm_sync_core_before_usermode(mm);
3947 -+ mmdrop(mm);
3948 -+ }
3949 -+ if (unlikely(prev_state == TASK_DEAD)) {
3950 -+ /*
3951 -+ * Remove function-return probe instances associated with this
3952 -+ * task and put them back on the free list.
3953 -+ */
3954 -+ kprobe_flush_task(prev);
3955 -+
3956 -+ /* Task is done with its stack. */
3957 -+ put_task_stack(prev);
3958 -+
3959 -+ put_task_struct_rcu_user(prev);
3960 -+ }
3961 -+
3962 -+ return rq;
3963 -+}
3964 -+
3965 -+/**
3966 -+ * schedule_tail - first thing a freshly forked thread must call.
3967 -+ * @prev: the thread we just switched away from.
3968 -+ */
3969 -+asmlinkage __visible void schedule_tail(struct task_struct *prev)
3970 -+ __releases(rq->lock)
3971 -+{
3972 -+ /*
3973 -+ * New tasks start with FORK_PREEMPT_COUNT, see there and
3974 -+ * finish_task_switch() for details.
3975 -+ *
3976 -+ * finish_task_switch() will drop rq->lock() and lower preempt_count
3977 -+ * and the preempt_enable() will end up enabling preemption (on
3978 -+ * PREEMPT_COUNT kernels).
3979 -+ */
3980 -+
3981 -+ finish_task_switch(prev);
3982 -+ preempt_enable();
3983 -+
3984 -+ if (current->set_child_tid)
3985 -+ put_user(task_pid_vnr(current), current->set_child_tid);
3986 -+
3987 -+ calculate_sigpending();
3988 -+}
3989 -+
3990 -+/*
3991 -+ * context_switch - switch to the new MM and the new thread's register state.
3992 -+ */
3993 -+static __always_inline struct rq *
3994 -+context_switch(struct rq *rq, struct task_struct *prev,
3995 -+ struct task_struct *next)
3996 -+{
3997 -+ prepare_task_switch(rq, prev, next);
3998 -+
3999 -+ /*
4000 -+ * For paravirt, this is coupled with an exit in switch_to to
4001 -+ * combine the page table reload and the switch backend into
4002 -+ * one hypercall.
4003 -+ */
4004 -+ arch_start_context_switch(prev);
4005 -+
4006 -+ /*
4007 -+ * kernel -> kernel lazy + transfer active
4008 -+ * user -> kernel lazy + mmgrab() active
4009 -+ *
4010 -+ * kernel -> user switch + mmdrop() active
4011 -+ * user -> user switch
4012 -+ */
4013 -+ if (!next->mm) { // to kernel
4014 -+ enter_lazy_tlb(prev->active_mm, next);
4015 -+
4016 -+ next->active_mm = prev->active_mm;
4017 -+ if (prev->mm) // from user
4018 -+ mmgrab(prev->active_mm);
4019 -+ else
4020 -+ prev->active_mm = NULL;
4021 -+ } else { // to user
4022 -+ membarrier_switch_mm(rq, prev->active_mm, next->mm);
4023 -+ /*
4024 -+ * sys_membarrier() requires an smp_mb() between setting
4025 -+ * rq->curr / membarrier_switch_mm() and returning to userspace.
4026 -+ *
4027 -+ * The below provides this either through switch_mm(), or in
4028 -+ * case 'prev->active_mm == next->mm' through
4029 -+ * finish_task_switch()'s mmdrop().
4030 -+ */
4031 -+ switch_mm_irqs_off(prev->active_mm, next->mm, next);
4032 -+
4033 -+ if (!prev->mm) { // from kernel
4034 -+ /* will mmdrop() in finish_task_switch(). */
4035 -+ rq->prev_mm = prev->active_mm;
4036 -+ prev->active_mm = NULL;
4037 -+ }
4038 -+ }
4039 -+
4040 -+ prepare_lock_switch(rq, next);
4041 -+
4042 -+ /* Here we just switch the register state and the stack. */
4043 -+ switch_to(prev, next, prev);
4044 -+ barrier();
4045 -+
4046 -+ return finish_task_switch(prev);
4047 -+}
4048 -+
4049 -+/*
4050 -+ * nr_running, nr_uninterruptible and nr_context_switches:
4051 -+ *
4052 -+ * externally visible scheduler statistics: current number of runnable
4053 -+ * threads, total number of context switches performed since bootup.
4054 -+ */
4055 -+unsigned int nr_running(void)
4056 -+{
4057 -+ unsigned int i, sum = 0;
4058 -+
4059 -+ for_each_online_cpu(i)
4060 -+ sum += cpu_rq(i)->nr_running;
4061 -+
4062 -+ return sum;
4063 -+}
4064 -+
4065 -+/*
4066 -+ * Check if only the current task is running on the CPU.
4067 -+ *
4068 -+ * Caution: this function does not check that the caller has disabled
4069 -+ * preemption, thus the result might have a time-of-check-to-time-of-use
4070 -+ * race. The caller is responsible to use it correctly, for example:
4071 -+ *
4072 -+ * - from a non-preemptible section (of course)
4073 -+ *
4074 -+ * - from a thread that is bound to a single CPU
4075 -+ *
4076 -+ * - in a loop with very short iterations (e.g. a polling loop)
4077 -+ */
4078 -+bool single_task_running(void)
4079 -+{
4080 -+ return raw_rq()->nr_running == 1;
4081 -+}
4082 -+EXPORT_SYMBOL(single_task_running);
4083 -+
4084 -+unsigned long long nr_context_switches(void)
4085 -+{
4086 -+ int i;
4087 -+ unsigned long long sum = 0;
4088 -+
4089 -+ for_each_possible_cpu(i)
4090 -+ sum += cpu_rq(i)->nr_switches;
4091 -+
4092 -+ return sum;
4093 -+}
4094 -+
4095 -+/*
4096 -+ * Consumers of these two interfaces, like for example the cpuidle menu
4097 -+ * governor, are using nonsensical data. Preferring shallow idle state selection
4098 -+ * for a CPU that has IO-wait which might not even end up running the task when
4099 -+ * it does become runnable.
4100 -+ */
4101 -+
4102 -+unsigned int nr_iowait_cpu(int cpu)
4103 -+{
4104 -+ return atomic_read(&cpu_rq(cpu)->nr_iowait);
4105 -+}
4106 -+
4107 -+/*
4108 -+ * IO-wait accounting, and how it's mostly bollocks (on SMP).
4109 -+ *
4110 -+ * The idea behind IO-wait account is to account the idle time that we could
4111 -+ * have spend running if it were not for IO. That is, if we were to improve the
4112 -+ * storage performance, we'd have a proportional reduction in IO-wait time.
4113 -+ *
4114 -+ * This all works nicely on UP, where, when a task blocks on IO, we account
4115 -+ * idle time as IO-wait, because if the storage were faster, it could've been
4116 -+ * running and we'd not be idle.
4117 -+ *
4118 -+ * This has been extended to SMP, by doing the same for each CPU. This however
4119 -+ * is broken.
4120 -+ *
4121 -+ * Imagine for instance the case where two tasks block on one CPU, only the one
4122 -+ * CPU will have IO-wait accounted, while the other has regular idle. Even
4123 -+ * though, if the storage were faster, both could've ran at the same time,
4124 -+ * utilising both CPUs.
4125 -+ *
4126 -+ * This means, that when looking globally, the current IO-wait accounting on
4127 -+ * SMP is a lower bound, by reason of under accounting.
4128 -+ *
4129 -+ * Worse, since the numbers are provided per CPU, they are sometimes
4130 -+ * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
4131 -+ * associated with any one particular CPU, it can wake to another CPU than it
4132 -+ * blocked on. This means the per CPU IO-wait number is meaningless.
4133 -+ *
4134 -+ * Task CPU affinities can make all that even more 'interesting'.
4135 -+ */
4136 -+
4137 -+unsigned int nr_iowait(void)
4138 -+{
4139 -+ unsigned int i, sum = 0;
4140 -+
4141 -+ for_each_possible_cpu(i)
4142 -+ sum += nr_iowait_cpu(i);
4143 -+
4144 -+ return sum;
4145 -+}
4146 -+
4147 -+#ifdef CONFIG_SMP
4148 -+
4149 -+/*
4150 -+ * sched_exec - execve() is a valuable balancing opportunity, because at
4151 -+ * this point the task has the smallest effective memory and cache
4152 -+ * footprint.
4153 -+ */
4154 -+void sched_exec(void)
4155 -+{
4156 -+ struct task_struct *p = current;
4157 -+ unsigned long flags;
4158 -+ int dest_cpu;
4159 -+
4160 -+ raw_spin_lock_irqsave(&p->pi_lock, flags);
4161 -+ dest_cpu = cpumask_any(p->cpus_ptr);
4162 -+ if (dest_cpu == smp_processor_id())
4163 -+ goto unlock;
4164 -+
4165 -+ if (likely(cpu_active(dest_cpu))) {
4166 -+ struct migration_arg arg = { p, dest_cpu };
4167 -+
4168 -+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
4169 -+ stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
4170 -+ return;
4171 -+ }
4172 -+unlock:
4173 -+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
4174 -+}
4175 -+
4176 -+#endif
4177 -+
4178 -+DEFINE_PER_CPU(struct kernel_stat, kstat);
4179 -+DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
4180 -+
4181 -+EXPORT_PER_CPU_SYMBOL(kstat);
4182 -+EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
4183 -+
4184 -+static inline void update_curr(struct rq *rq, struct task_struct *p)
4185 -+{
4186 -+ s64 ns = rq->clock_task - p->last_ran;
4187 -+
4188 -+ p->sched_time += ns;
4189 -+ cgroup_account_cputime(p, ns);
4190 -+ account_group_exec_runtime(p, ns);
4191 -+
4192 -+ p->time_slice -= ns;
4193 -+ p->last_ran = rq->clock_task;
4194 -+}
4195 -+
4196 -+/*
4197 -+ * Return accounted runtime for the task.
4198 -+ * Return separately the current's pending runtime that have not been
4199 -+ * accounted yet.
4200 -+ */
4201 -+unsigned long long task_sched_runtime(struct task_struct *p)
4202 -+{
4203 -+ unsigned long flags;
4204 -+ struct rq *rq;
4205 -+ raw_spinlock_t *lock;
4206 -+ u64 ns;
4207 -+
4208 -+#if defined(CONFIG_64BIT) && defined(CONFIG_SMP)
4209 -+ /*
4210 -+ * 64-bit doesn't need locks to atomically read a 64-bit value.
4211 -+ * So we have a optimization chance when the task's delta_exec is 0.
4212 -+ * Reading ->on_cpu is racy, but this is ok.
4213 -+ *
4214 -+ * If we race with it leaving CPU, we'll take a lock. So we're correct.
4215 -+ * If we race with it entering CPU, unaccounted time is 0. This is
4216 -+ * indistinguishable from the read occurring a few cycles earlier.
4217 -+ * If we see ->on_cpu without ->on_rq, the task is leaving, and has
4218 -+ * been accounted, so we're correct here as well.
4219 -+ */
4220 -+ if (!p->on_cpu || !task_on_rq_queued(p))
4221 -+ return tsk_seruntime(p);
4222 -+#endif
4223 -+
4224 -+ rq = task_access_lock_irqsave(p, &lock, &flags);
4225 -+ /*
4226 -+ * Must be ->curr _and_ ->on_rq. If dequeued, we would
4227 -+ * project cycles that may never be accounted to this
4228 -+ * thread, breaking clock_gettime().
4229 -+ */
4230 -+ if (p == rq->curr && task_on_rq_queued(p)) {
4231 -+ update_rq_clock(rq);
4232 -+ update_curr(rq, p);
4233 -+ }
4234 -+ ns = tsk_seruntime(p);
4235 -+ task_access_unlock_irqrestore(p, lock, &flags);
4236 -+
4237 -+ return ns;
4238 -+}
4239 -+
4240 -+/* This manages tasks that have run out of timeslice during a scheduler_tick */
4241 -+static inline void scheduler_task_tick(struct rq *rq)
4242 -+{
4243 -+ struct task_struct *p = rq->curr;
4244 -+
4245 -+ if (is_idle_task(p))
4246 -+ return;
4247 -+
4248 -+ update_curr(rq, p);
4249 -+ cpufreq_update_util(rq, 0);
4250 -+
4251 -+ /*
4252 -+ * Tasks have less than RESCHED_NS of time slice left they will be
4253 -+ * rescheduled.
4254 -+ */
4255 -+ if (p->time_slice >= RESCHED_NS)
4256 -+ return;
4257 -+ set_tsk_need_resched(p);
4258 -+ set_preempt_need_resched();
4259 -+}
4260 -+
4261 -+#ifdef CONFIG_SCHED_DEBUG
4262 -+static u64 cpu_resched_latency(struct rq *rq)
4263 -+{
4264 -+ int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms);
4265 -+ u64 resched_latency, now = rq_clock(rq);
4266 -+ static bool warned_once;
4267 -+
4268 -+ if (sysctl_resched_latency_warn_once && warned_once)
4269 -+ return 0;
4270 -+
4271 -+ if (!need_resched() || !latency_warn_ms)
4272 -+ return 0;
4273 -+
4274 -+ if (system_state == SYSTEM_BOOTING)
4275 -+ return 0;
4276 -+
4277 -+ if (!rq->last_seen_need_resched_ns) {
4278 -+ rq->last_seen_need_resched_ns = now;
4279 -+ rq->ticks_without_resched = 0;
4280 -+ return 0;
4281 -+ }
4282 -+
4283 -+ rq->ticks_without_resched++;
4284 -+ resched_latency = now - rq->last_seen_need_resched_ns;
4285 -+ if (resched_latency <= latency_warn_ms * NSEC_PER_MSEC)
4286 -+ return 0;
4287 -+
4288 -+ warned_once = true;
4289 -+
4290 -+ return resched_latency;
4291 -+}
4292 -+
4293 -+static int __init setup_resched_latency_warn_ms(char *str)
4294 -+{
4295 -+ long val;
4296 -+
4297 -+ if ((kstrtol(str, 0, &val))) {
4298 -+ pr_warn("Unable to set resched_latency_warn_ms\n");
4299 -+ return 1;
4300 -+ }
4301 -+
4302 -+ sysctl_resched_latency_warn_ms = val;
4303 -+ return 1;
4304 -+}
4305 -+__setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
4306 -+#else
4307 -+static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
4308 -+#endif /* CONFIG_SCHED_DEBUG */
4309 -+
4310 -+/*
4311 -+ * This function gets called by the timer code, with HZ frequency.
4312 -+ * We call it with interrupts disabled.
4313 -+ */
4314 -+void scheduler_tick(void)
4315 -+{
4316 -+ int cpu __maybe_unused = smp_processor_id();
4317 -+ struct rq *rq = cpu_rq(cpu);
4318 -+ u64 resched_latency;
4319 -+
4320 -+ arch_scale_freq_tick();
4321 -+ sched_clock_tick();
4322 -+
4323 -+ raw_spin_lock(&rq->lock);
4324 -+ update_rq_clock(rq);
4325 -+
4326 -+ scheduler_task_tick(rq);
4327 -+ if (sched_feat(LATENCY_WARN))
4328 -+ resched_latency = cpu_resched_latency(rq);
4329 -+ calc_global_load_tick(rq);
4330 -+
4331 -+ rq->last_tick = rq->clock;
4332 -+ raw_spin_unlock(&rq->lock);
4333 -+
4334 -+ if (sched_feat(LATENCY_WARN) && resched_latency)
4335 -+ resched_latency_warn(cpu, resched_latency);
4336 -+
4337 -+ perf_event_task_tick();
4338 -+}
4339 -+
4340 -+#ifdef CONFIG_SCHED_SMT
4341 -+static inline int active_load_balance_cpu_stop(void *data)
4342 -+{
4343 -+ struct rq *rq = this_rq();
4344 -+ struct task_struct *p = data;
4345 -+ cpumask_t tmp;
4346 -+ unsigned long flags;
4347 -+
4348 -+ local_irq_save(flags);
4349 -+
4350 -+ raw_spin_lock(&p->pi_lock);
4351 -+ raw_spin_lock(&rq->lock);
4352 -+
4353 -+ rq->active_balance = 0;
4354 -+ /* _something_ may have changed the task, double check again */
4355 -+ if (task_on_rq_queued(p) && task_rq(p) == rq &&
4356 -+ cpumask_and(&tmp, p->cpus_ptr, &sched_sg_idle_mask) &&
4357 -+ !is_migration_disabled(p)) {
4358 -+ int cpu = cpu_of(rq);
4359 -+ int dcpu = __best_mask_cpu(&tmp, per_cpu(sched_cpu_llc_mask, cpu));
4360 -+ rq = move_queued_task(rq, p, dcpu);
4361 -+ }
4362 -+
4363 -+ raw_spin_unlock(&rq->lock);
4364 -+ raw_spin_unlock(&p->pi_lock);
4365 -+
4366 -+ local_irq_restore(flags);
4367 -+
4368 -+ return 0;
4369 -+}
4370 -+
4371 -+/* sg_balance_trigger - trigger slibing group balance for @cpu */
4372 -+static inline int sg_balance_trigger(const int cpu)
4373 -+{
4374 -+ struct rq *rq= cpu_rq(cpu);
4375 -+ unsigned long flags;
4376 -+ struct task_struct *curr;
4377 -+ int res;
4378 -+
4379 -+ if (!raw_spin_trylock_irqsave(&rq->lock, flags))
4380 -+ return 0;
4381 -+ curr = rq->curr;
4382 -+ res = (!is_idle_task(curr)) && (1 == rq->nr_running) &&\
4383 -+ cpumask_intersects(curr->cpus_ptr, &sched_sg_idle_mask) &&\
4384 -+ !is_migration_disabled(curr) && (!rq->active_balance);
4385 -+
4386 -+ if (res)
4387 -+ rq->active_balance = 1;
4388 -+
4389 -+ raw_spin_unlock_irqrestore(&rq->lock, flags);
4390 -+
4391 -+ if (res)
4392 -+ stop_one_cpu_nowait(cpu, active_load_balance_cpu_stop,
4393 -+ curr, &rq->active_balance_work);
4394 -+ return res;
4395 -+}
4396 -+
4397 -+/*
4398 -+ * sg_balance_check - slibing group balance check for run queue @rq
4399 -+ */
4400 -+static inline void sg_balance_check(struct rq *rq)
4401 -+{
4402 -+ cpumask_t chk;
4403 -+ int cpu = cpu_of(rq);
4404 -+
4405 -+ /* exit when cpu is offline */
4406 -+ if (unlikely(!rq->online))
4407 -+ return;
4408 -+
4409 -+ /*
4410 -+ * Only cpu in slibing idle group will do the checking and then
4411 -+ * find potential cpus which can migrate the current running task
4412 -+ */
4413 -+ if (cpumask_test_cpu(cpu, &sched_sg_idle_mask) &&
4414 -+ cpumask_andnot(&chk, cpu_online_mask, sched_rq_watermark) &&
4415 -+ cpumask_andnot(&chk, &chk, &sched_rq_pending_mask)) {
4416 -+ int i;
4417 -+
4418 -+ for_each_cpu_wrap(i, &chk, cpu) {
4419 -+ if (cpumask_subset(cpu_smt_mask(i), &chk) &&
4420 -+ sg_balance_trigger(i))
4421 -+ return;
4422 -+ }
4423 -+ }
4424 -+}
4425 -+#endif /* CONFIG_SCHED_SMT */
4426 -+
4427 -+#ifdef CONFIG_NO_HZ_FULL
4428 -+
4429 -+struct tick_work {
4430 -+ int cpu;
4431 -+ atomic_t state;
4432 -+ struct delayed_work work;
4433 -+};
4434 -+/* Values for ->state, see diagram below. */
4435 -+#define TICK_SCHED_REMOTE_OFFLINE 0
4436 -+#define TICK_SCHED_REMOTE_OFFLINING 1
4437 -+#define TICK_SCHED_REMOTE_RUNNING 2
4438 -+
4439 -+/*
4440 -+ * State diagram for ->state:
4441 -+ *
4442 -+ *
4443 -+ * TICK_SCHED_REMOTE_OFFLINE
4444 -+ * | ^
4445 -+ * | |
4446 -+ * | | sched_tick_remote()
4447 -+ * | |
4448 -+ * | |
4449 -+ * +--TICK_SCHED_REMOTE_OFFLINING
4450 -+ * | ^
4451 -+ * | |
4452 -+ * sched_tick_start() | | sched_tick_stop()
4453 -+ * | |
4454 -+ * V |
4455 -+ * TICK_SCHED_REMOTE_RUNNING
4456 -+ *
4457 -+ *
4458 -+ * Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
4459 -+ * and sched_tick_start() are happy to leave the state in RUNNING.
4460 -+ */
4461 -+
4462 -+static struct tick_work __percpu *tick_work_cpu;
4463 -+
4464 -+static void sched_tick_remote(struct work_struct *work)
4465 -+{
4466 -+ struct delayed_work *dwork = to_delayed_work(work);
4467 -+ struct tick_work *twork = container_of(dwork, struct tick_work, work);
4468 -+ int cpu = twork->cpu;
4469 -+ struct rq *rq = cpu_rq(cpu);
4470 -+ struct task_struct *curr;
4471 -+ unsigned long flags;
4472 -+ u64 delta;
4473 -+ int os;
4474 -+
4475 -+ /*
4476 -+ * Handle the tick only if it appears the remote CPU is running in full
4477 -+ * dynticks mode. The check is racy by nature, but missing a tick or
4478 -+ * having one too much is no big deal because the scheduler tick updates
4479 -+ * statistics and checks timeslices in a time-independent way, regardless
4480 -+ * of when exactly it is running.
4481 -+ */
4482 -+ if (!tick_nohz_tick_stopped_cpu(cpu))
4483 -+ goto out_requeue;
4484 -+
4485 -+ raw_spin_lock_irqsave(&rq->lock, flags);
4486 -+ curr = rq->curr;
4487 -+ if (cpu_is_offline(cpu))
4488 -+ goto out_unlock;
4489 -+
4490 -+ update_rq_clock(rq);
4491 -+ if (!is_idle_task(curr)) {
4492 -+ /*
4493 -+ * Make sure the next tick runs within a reasonable
4494 -+ * amount of time.
4495 -+ */
4496 -+ delta = rq_clock_task(rq) - curr->last_ran;
4497 -+ WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);
4498 -+ }
4499 -+ scheduler_task_tick(rq);
4500 -+
4501 -+ calc_load_nohz_remote(rq);
4502 -+out_unlock:
4503 -+ raw_spin_unlock_irqrestore(&rq->lock, flags);
4504 -+
4505 -+out_requeue:
4506 -+ /*
4507 -+ * Run the remote tick once per second (1Hz). This arbitrary
4508 -+ * frequency is large enough to avoid overload but short enough
4509 -+ * to keep scheduler internal stats reasonably up to date. But
4510 -+ * first update state to reflect hotplug activity if required.
4511 -+ */
4512 -+ os = atomic_fetch_add_unless(&twork->state, -1, TICK_SCHED_REMOTE_RUNNING);
4513 -+ WARN_ON_ONCE(os == TICK_SCHED_REMOTE_OFFLINE);
4514 -+ if (os == TICK_SCHED_REMOTE_RUNNING)
4515 -+ queue_delayed_work(system_unbound_wq, dwork, HZ);
4516 -+}
4517 -+
4518 -+static void sched_tick_start(int cpu)
4519 -+{
4520 -+ int os;
4521 -+ struct tick_work *twork;
4522 -+
4523 -+ if (housekeeping_cpu(cpu, HK_FLAG_TICK))
4524 -+ return;
4525 -+
4526 -+ WARN_ON_ONCE(!tick_work_cpu);
4527 -+
4528 -+ twork = per_cpu_ptr(tick_work_cpu, cpu);
4529 -+ os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);
4530 -+ WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);
4531 -+ if (os == TICK_SCHED_REMOTE_OFFLINE) {
4532 -+ twork->cpu = cpu;
4533 -+ INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
4534 -+ queue_delayed_work(system_unbound_wq, &twork->work, HZ);
4535 -+ }
4536 -+}
4537 -+
4538 -+#ifdef CONFIG_HOTPLUG_CPU
4539 -+static void sched_tick_stop(int cpu)
4540 -+{
4541 -+ struct tick_work *twork;
4542 -+
4543 -+ if (housekeeping_cpu(cpu, HK_FLAG_TICK))
4544 -+ return;
4545 -+
4546 -+ WARN_ON_ONCE(!tick_work_cpu);
4547 -+
4548 -+ twork = per_cpu_ptr(tick_work_cpu, cpu);
4549 -+ cancel_delayed_work_sync(&twork->work);
4550 -+}
4551 -+#endif /* CONFIG_HOTPLUG_CPU */
4552 -+
4553 -+int __init sched_tick_offload_init(void)
4554 -+{
4555 -+ tick_work_cpu = alloc_percpu(struct tick_work);
4556 -+ BUG_ON(!tick_work_cpu);
4557 -+ return 0;
4558 -+}
4559 -+
4560 -+#else /* !CONFIG_NO_HZ_FULL */
4561 -+static inline void sched_tick_start(int cpu) { }
4562 -+static inline void sched_tick_stop(int cpu) { }
4563 -+#endif
4564 -+
4565 -+#if defined(CONFIG_PREEMPTION) && (defined(CONFIG_DEBUG_PREEMPT) || \
4566 -+ defined(CONFIG_PREEMPT_TRACER))
4567 -+/*
4568 -+ * If the value passed in is equal to the current preempt count
4569 -+ * then we just disabled preemption. Start timing the latency.
4570 -+ */
4571 -+static inline void preempt_latency_start(int val)
4572 -+{
4573 -+ if (preempt_count() == val) {
4574 -+ unsigned long ip = get_lock_parent_ip();
4575 -+#ifdef CONFIG_DEBUG_PREEMPT
4576 -+ current->preempt_disable_ip = ip;
4577 -+#endif
4578 -+ trace_preempt_off(CALLER_ADDR0, ip);
4579 -+ }
4580 -+}
4581 -+
4582 -+void preempt_count_add(int val)
4583 -+{
4584 -+#ifdef CONFIG_DEBUG_PREEMPT
4585 -+ /*
4586 -+ * Underflow?
4587 -+ */
4588 -+ if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))
4589 -+ return;
4590 -+#endif
4591 -+ __preempt_count_add(val);
4592 -+#ifdef CONFIG_DEBUG_PREEMPT
4593 -+ /*
4594 -+ * Spinlock count overflowing soon?
4595 -+ */
4596 -+ DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=
4597 -+ PREEMPT_MASK - 10);
4598 -+#endif
4599 -+ preempt_latency_start(val);
4600 -+}
4601 -+EXPORT_SYMBOL(preempt_count_add);
4602 -+NOKPROBE_SYMBOL(preempt_count_add);
4603 -+
4604 -+/*
4605 -+ * If the value passed in equals to the current preempt count
4606 -+ * then we just enabled preemption. Stop timing the latency.
4607 -+ */
4608 -+static inline void preempt_latency_stop(int val)
4609 -+{
4610 -+ if (preempt_count() == val)
4611 -+ trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
4612 -+}
4613 -+
4614 -+void preempt_count_sub(int val)
4615 -+{
4616 -+#ifdef CONFIG_DEBUG_PREEMPT
4617 -+ /*
4618 -+ * Underflow?
4619 -+ */
4620 -+ if (DEBUG_LOCKS_WARN_ON(val > preempt_count()))
4621 -+ return;
4622 -+ /*
4623 -+ * Is the spinlock portion underflowing?
4624 -+ */
4625 -+ if (DEBUG_LOCKS_WARN_ON((val < PREEMPT_MASK) &&
4626 -+ !(preempt_count() & PREEMPT_MASK)))
4627 -+ return;
4628 -+#endif
4629 -+
4630 -+ preempt_latency_stop(val);
4631 -+ __preempt_count_sub(val);
4632 -+}
4633 -+EXPORT_SYMBOL(preempt_count_sub);
4634 -+NOKPROBE_SYMBOL(preempt_count_sub);
4635 -+
4636 -+#else
4637 -+static inline void preempt_latency_start(int val) { }
4638 -+static inline void preempt_latency_stop(int val) { }
4639 -+#endif
4640 -+
4641 -+static inline unsigned long get_preempt_disable_ip(struct task_struct *p)
4642 -+{
4643 -+#ifdef CONFIG_DEBUG_PREEMPT
4644 -+ return p->preempt_disable_ip;
4645 -+#else
4646 -+ return 0;
4647 -+#endif
4648 -+}
4649 -+
4650 -+/*
4651 -+ * Print scheduling while atomic bug:
4652 -+ */
4653 -+static noinline void __schedule_bug(struct task_struct *prev)
4654 -+{
4655 -+ /* Save this before calling printk(), since that will clobber it */
4656 -+ unsigned long preempt_disable_ip = get_preempt_disable_ip(current);
4657 -+
4658 -+ if (oops_in_progress)
4659 -+ return;
4660 -+
4661 -+ printk(KERN_ERR "BUG: scheduling while atomic: %s/%d/0x%08x\n",
4662 -+ prev->comm, prev->pid, preempt_count());
4663 -+
4664 -+ debug_show_held_locks(prev);
4665 -+ print_modules();
4666 -+ if (irqs_disabled())
4667 -+ print_irqtrace_events(prev);
4668 -+ if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)
4669 -+ && in_atomic_preempt_off()) {
4670 -+ pr_err("Preemption disabled at:");
4671 -+ print_ip_sym(KERN_ERR, preempt_disable_ip);
4672 -+ }
4673 -+ if (panic_on_warn)
4674 -+ panic("scheduling while atomic\n");
4675 -+
4676 -+ dump_stack();
4677 -+ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
4678 -+}
4679 -+
4680 -+/*
4681 -+ * Various schedule()-time debugging checks and statistics:
4682 -+ */
4683 -+static inline void schedule_debug(struct task_struct *prev, bool preempt)
4684 -+{
4685 -+#ifdef CONFIG_SCHED_STACK_END_CHECK
4686 -+ if (task_stack_end_corrupted(prev))
4687 -+ panic("corrupted stack end detected inside scheduler\n");
4688 -+
4689 -+ if (task_scs_end_corrupted(prev))
4690 -+ panic("corrupted shadow stack detected inside scheduler\n");
4691 -+#endif
4692 -+
4693 -+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
4694 -+ if (!preempt && READ_ONCE(prev->__state) && prev->non_block_count) {
4695 -+ printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",
4696 -+ prev->comm, prev->pid, prev->non_block_count);
4697 -+ dump_stack();
4698 -+ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
4699 -+ }
4700 -+#endif
4701 -+
4702 -+ if (unlikely(in_atomic_preempt_off())) {
4703 -+ __schedule_bug(prev);
4704 -+ preempt_count_set(PREEMPT_DISABLED);
4705 -+ }
4706 -+ rcu_sleep_check();
4707 -+ SCHED_WARN_ON(ct_state() == CONTEXT_USER);
4708 -+
4709 -+ profile_hit(SCHED_PROFILING, __builtin_return_address(0));
4710 -+
4711 -+ schedstat_inc(this_rq()->sched_count);
4712 -+}
4713 -+
4714 -+/*
4715 -+ * Compile time debug macro
4716 -+ * #define ALT_SCHED_DEBUG
4717 -+ */
4718 -+
4719 -+#ifdef ALT_SCHED_DEBUG
4720 -+void alt_sched_debug(void)
4721 -+{
4722 -+ printk(KERN_INFO "sched: pending: 0x%04lx, idle: 0x%04lx, sg_idle: 0x%04lx\n",
4723 -+ sched_rq_pending_mask.bits[0],
4724 -+ sched_rq_watermark[0].bits[0],
4725 -+ sched_sg_idle_mask.bits[0]);
4726 -+}
4727 -+#else
4728 -+inline void alt_sched_debug(void) {}
4729 -+#endif
4730 -+
4731 -+#ifdef CONFIG_SMP
4732 -+
4733 -+#define SCHED_RQ_NR_MIGRATION (32U)
4734 -+/*
4735 -+ * Migrate pending tasks in @rq to @dest_cpu
4736 -+ * Will try to migrate mininal of half of @rq nr_running tasks and
4737 -+ * SCHED_RQ_NR_MIGRATION to @dest_cpu
4738 -+ */
4739 -+static inline int
4740 -+migrate_pending_tasks(struct rq *rq, struct rq *dest_rq, const int dest_cpu)
4741 -+{
4742 -+ struct task_struct *p, *skip = rq->curr;
4743 -+ int nr_migrated = 0;
4744 -+ int nr_tries = min(rq->nr_running / 2, SCHED_RQ_NR_MIGRATION);
4745 -+
4746 -+ while (skip != rq->idle && nr_tries &&
4747 -+ (p = sched_rq_next_task(skip, rq)) != rq->idle) {
4748 -+ skip = sched_rq_next_task(p, rq);
4749 -+ if (cpumask_test_cpu(dest_cpu, p->cpus_ptr)) {
4750 -+ __SCHED_DEQUEUE_TASK(p, rq, 0, );
4751 -+ set_task_cpu(p, dest_cpu);
4752 -+ sched_task_sanity_check(p, dest_rq);
4753 -+ __SCHED_ENQUEUE_TASK(p, dest_rq, 0);
4754 -+ nr_migrated++;
4755 -+ }
4756 -+ nr_tries--;
4757 -+ }
4758 -+
4759 -+ return nr_migrated;
4760 -+}
4761 -+
4762 -+static inline int take_other_rq_tasks(struct rq *rq, int cpu)
4763 -+{
4764 -+ struct cpumask *topo_mask, *end_mask;
4765 -+
4766 -+ if (unlikely(!rq->online))
4767 -+ return 0;
4768 -+
4769 -+ if (cpumask_empty(&sched_rq_pending_mask))
4770 -+ return 0;
4771 -+
4772 -+ topo_mask = per_cpu(sched_cpu_topo_masks, cpu) + 1;
4773 -+ end_mask = per_cpu(sched_cpu_topo_end_mask, cpu);
4774 -+ do {
4775 -+ int i;
4776 -+ for_each_cpu_and(i, &sched_rq_pending_mask, topo_mask) {
4777 -+ int nr_migrated;
4778 -+ struct rq *src_rq;
4779 -+
4780 -+ src_rq = cpu_rq(i);
4781 -+ if (!do_raw_spin_trylock(&src_rq->lock))
4782 -+ continue;
4783 -+ spin_acquire(&src_rq->lock.dep_map,
4784 -+ SINGLE_DEPTH_NESTING, 1, _RET_IP_);
4785 -+
4786 -+ if ((nr_migrated = migrate_pending_tasks(src_rq, rq, cpu))) {
4787 -+ src_rq->nr_running -= nr_migrated;
4788 -+ if (src_rq->nr_running < 2)
4789 -+ cpumask_clear_cpu(i, &sched_rq_pending_mask);
4790 -+
4791 -+ rq->nr_running += nr_migrated;
4792 -+ if (rq->nr_running > 1)
4793 -+ cpumask_set_cpu(cpu, &sched_rq_pending_mask);
4794 -+
4795 -+ update_sched_rq_watermark(rq);
4796 -+ cpufreq_update_util(rq, 0);
4797 -+
4798 -+ spin_release(&src_rq->lock.dep_map, _RET_IP_);
4799 -+ do_raw_spin_unlock(&src_rq->lock);
4800 -+
4801 -+ return 1;
4802 -+ }
4803 -+
4804 -+ spin_release(&src_rq->lock.dep_map, _RET_IP_);
4805 -+ do_raw_spin_unlock(&src_rq->lock);
4806 -+ }
4807 -+ } while (++topo_mask < end_mask);
4808 -+
4809 -+ return 0;
4810 -+}
4811 -+#endif
4812 -+
4813 -+/*
4814 -+ * Timeslices below RESCHED_NS are considered as good as expired as there's no
4815 -+ * point rescheduling when there's so little time left.
4816 -+ */
4817 -+static inline void check_curr(struct task_struct *p, struct rq *rq)
4818 -+{
4819 -+ if (unlikely(rq->idle == p))
4820 -+ return;
4821 -+
4822 -+ update_curr(rq, p);
4823 -+
4824 -+ if (p->time_slice < RESCHED_NS)
4825 -+ time_slice_expired(p, rq);
4826 -+}
4827 -+
4828 -+static inline struct task_struct *
4829 -+choose_next_task(struct rq *rq, int cpu, struct task_struct *prev)
4830 -+{
4831 -+ struct task_struct *next;
4832 -+
4833 -+ if (unlikely(rq->skip)) {
4834 -+ next = rq_runnable_task(rq);
4835 -+ if (next == rq->idle) {
4836 -+#ifdef CONFIG_SMP
4837 -+ if (!take_other_rq_tasks(rq, cpu)) {
4838 -+#endif
4839 -+ rq->skip = NULL;
4840 -+ schedstat_inc(rq->sched_goidle);
4841 -+ return next;
4842 -+#ifdef CONFIG_SMP
4843 -+ }
4844 -+ next = rq_runnable_task(rq);
4845 -+#endif
4846 -+ }
4847 -+ rq->skip = NULL;
4848 -+#ifdef CONFIG_HIGH_RES_TIMERS
4849 -+ hrtick_start(rq, next->time_slice);
4850 -+#endif
4851 -+ return next;
4852 -+ }
4853 -+
4854 -+ next = sched_rq_first_task(rq);
4855 -+ if (next == rq->idle) {
4856 -+#ifdef CONFIG_SMP
4857 -+ if (!take_other_rq_tasks(rq, cpu)) {
4858 -+#endif
4859 -+ schedstat_inc(rq->sched_goidle);
4860 -+ /*printk(KERN_INFO "sched: choose_next_task(%d) idle %px\n", cpu, next);*/
4861 -+ return next;
4862 -+#ifdef CONFIG_SMP
4863 -+ }
4864 -+ next = sched_rq_first_task(rq);
4865 -+#endif
4866 -+ }
4867 -+#ifdef CONFIG_HIGH_RES_TIMERS
4868 -+ hrtick_start(rq, next->time_slice);
4869 -+#endif
4870 -+ /*printk(KERN_INFO "sched: choose_next_task(%d) next %px\n", cpu,
4871 -+ * next);*/
4872 -+ return next;
4873 -+}
4874 -+
4875 -+/*
4876 -+ * schedule() is the main scheduler function.
4877 -+ *
4878 -+ * The main means of driving the scheduler and thus entering this function are:
4879 -+ *
4880 -+ * 1. Explicit blocking: mutex, semaphore, waitqueue, etc.
4881 -+ *
4882 -+ * 2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
4883 -+ * paths. For example, see arch/x86/entry_64.S.
4884 -+ *
4885 -+ * To drive preemption between tasks, the scheduler sets the flag in timer
4886 -+ * interrupt handler scheduler_tick().
4887 -+ *
4888 -+ * 3. Wakeups don't really cause entry into schedule(). They add a
4889 -+ * task to the run-queue and that's it.
4890 -+ *
4891 -+ * Now, if the new task added to the run-queue preempts the current
4892 -+ * task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
4893 -+ * called on the nearest possible occasion:
4894 -+ *
4895 -+ * - If the kernel is preemptible (CONFIG_PREEMPTION=y):
4896 -+ *
4897 -+ * - in syscall or exception context, at the next outmost
4898 -+ * preempt_enable(). (this might be as soon as the wake_up()'s
4899 -+ * spin_unlock()!)
4900 -+ *
4901 -+ * - in IRQ context, return from interrupt-handler to
4902 -+ * preemptible context
4903 -+ *
4904 -+ * - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)
4905 -+ * then at the next:
4906 -+ *
4907 -+ * - cond_resched() call
4908 -+ * - explicit schedule() call
4909 -+ * - return from syscall or exception to user-space
4910 -+ * - return from interrupt-handler to user-space
4911 -+ *
4912 -+ * WARNING: must be called with preemption disabled!
4913 -+ */
4914 -+static void __sched notrace __schedule(bool preempt)
4915 -+{
4916 -+ struct task_struct *prev, *next;
4917 -+ unsigned long *switch_count;
4918 -+ unsigned long prev_state;
4919 -+ struct rq *rq;
4920 -+ int cpu;
4921 -+
4922 -+ cpu = smp_processor_id();
4923 -+ rq = cpu_rq(cpu);
4924 -+ prev = rq->curr;
4925 -+
4926 -+ schedule_debug(prev, preempt);
4927 -+
4928 -+ /* by passing sched_feat(HRTICK) checking which Alt schedule FW doesn't support */
4929 -+ hrtick_clear(rq);
4930 -+
4931 -+ local_irq_disable();
4932 -+ rcu_note_context_switch(preempt);
4933 -+
4934 -+ /*
4935 -+ * Make sure that signal_pending_state()->signal_pending() below
4936 -+ * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
4937 -+ * done by the caller to avoid the race with signal_wake_up():
4938 -+ *
4939 -+ * __set_current_state(@state) signal_wake_up()
4940 -+ * schedule() set_tsk_thread_flag(p, TIF_SIGPENDING)
4941 -+ * wake_up_state(p, state)
4942 -+ * LOCK rq->lock LOCK p->pi_state
4943 -+ * smp_mb__after_spinlock() smp_mb__after_spinlock()
4944 -+ * if (signal_pending_state()) if (p->state & @state)
4945 -+ *
4946 -+ * Also, the membarrier system call requires a full memory barrier
4947 -+ * after coming from user-space, before storing to rq->curr.
4948 -+ */
4949 -+ raw_spin_lock(&rq->lock);
4950 -+ smp_mb__after_spinlock();
4951 -+
4952 -+ update_rq_clock(rq);
4953 -+
4954 -+ switch_count = &prev->nivcsw;
4955 -+ /*
4956 -+ * We must load prev->state once (task_struct::state is volatile), such
4957 -+ * that:
4958 -+ *
4959 -+ * - we form a control dependency vs deactivate_task() below.
4960 -+ * - ptrace_{,un}freeze_traced() can change ->state underneath us.
4961 -+ */
4962 -+ prev_state = READ_ONCE(prev->__state);
4963 -+ if (!preempt && prev_state) {
4964 -+ if (signal_pending_state(prev_state, prev)) {
4965 -+ WRITE_ONCE(prev->__state, TASK_RUNNING);
4966 -+ } else {
4967 -+ prev->sched_contributes_to_load =
4968 -+ (prev_state & TASK_UNINTERRUPTIBLE) &&
4969 -+ !(prev_state & TASK_NOLOAD) &&
4970 -+ !(prev->flags & PF_FROZEN);
4971 -+
4972 -+ if (prev->sched_contributes_to_load)
4973 -+ rq->nr_uninterruptible++;
4974 -+
4975 -+ /*
4976 -+ * __schedule() ttwu()
4977 -+ * prev_state = prev->state; if (p->on_rq && ...)
4978 -+ * if (prev_state) goto out;
4979 -+ * p->on_rq = 0; smp_acquire__after_ctrl_dep();
4980 -+ * p->state = TASK_WAKING
4981 -+ *
4982 -+ * Where __schedule() and ttwu() have matching control dependencies.
4983 -+ *
4984 -+ * After this, schedule() must not care about p->state any more.
4985 -+ */
4986 -+ sched_task_deactivate(prev, rq);
4987 -+ deactivate_task(prev, rq);
4988 -+
4989 -+ if (prev->in_iowait) {
4990 -+ atomic_inc(&rq->nr_iowait);
4991 -+ delayacct_blkio_start();
4992 -+ }
4993 -+ }
4994 -+ switch_count = &prev->nvcsw;
4995 -+ }
4996 -+
4997 -+ check_curr(prev, rq);
4998 -+
4999 -+ next = choose_next_task(rq, cpu, prev);
5000 -+ clear_tsk_need_resched(prev);
5001 -+ clear_preempt_need_resched();
5002 -+#ifdef CONFIG_SCHED_DEBUG
5003 -+ rq->last_seen_need_resched_ns = 0;
5004 -+#endif
5005 -+
5006 -+ if (likely(prev != next)) {
5007 -+ next->last_ran = rq->clock_task;
5008 -+ rq->last_ts_switch = rq->clock;
5009 -+
5010 -+ rq->nr_switches++;
5011 -+ /*
5012 -+ * RCU users of rcu_dereference(rq->curr) may not see
5013 -+ * changes to task_struct made by pick_next_task().
5014 -+ */
5015 -+ RCU_INIT_POINTER(rq->curr, next);
5016 -+ /*
5017 -+ * The membarrier system call requires each architecture
5018 -+ * to have a full memory barrier after updating
5019 -+ * rq->curr, before returning to user-space.
5020 -+ *
5021 -+ * Here are the schemes providing that barrier on the
5022 -+ * various architectures:
5023 -+ * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
5024 -+ * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
5025 -+ * - finish_lock_switch() for weakly-ordered
5026 -+ * architectures where spin_unlock is a full barrier,
5027 -+ * - switch_to() for arm64 (weakly-ordered, spin_unlock
5028 -+ * is a RELEASE barrier),
5029 -+ */
5030 -+ ++*switch_count;
5031 -+
5032 -+ psi_sched_switch(prev, next, !task_on_rq_queued(prev));
5033 -+
5034 -+ trace_sched_switch(preempt, prev, next);
5035 -+
5036 -+ /* Also unlocks the rq: */
5037 -+ rq = context_switch(rq, prev, next);
5038 -+ } else {
5039 -+ __balance_callbacks(rq);
5040 -+ raw_spin_unlock_irq(&rq->lock);
5041 -+ }
5042 -+
5043 -+#ifdef CONFIG_SCHED_SMT
5044 -+ sg_balance_check(rq);
5045 -+#endif
5046 -+}
5047 -+
5048 -+void __noreturn do_task_dead(void)
5049 -+{
5050 -+ /* Causes final put_task_struct in finish_task_switch(): */
5051 -+ set_special_state(TASK_DEAD);
5052 -+
5053 -+ /* Tell freezer to ignore us: */
5054 -+ current->flags |= PF_NOFREEZE;
5055 -+
5056 -+ __schedule(false);
5057 -+ BUG();
5058 -+
5059 -+ /* Avoid "noreturn function does return" - but don't continue if BUG() is a NOP: */
5060 -+ for (;;)
5061 -+ cpu_relax();
5062 -+}
5063 -+
5064 -+static inline void sched_submit_work(struct task_struct *tsk)
5065 -+{
5066 -+ unsigned int task_flags;
5067 -+
5068 -+ if (task_is_running(tsk))
5069 -+ return;
5070 -+
5071 -+ task_flags = tsk->flags;
5072 -+ /*
5073 -+ * If a worker went to sleep, notify and ask workqueue whether
5074 -+ * it wants to wake up a task to maintain concurrency.
5075 -+ * As this function is called inside the schedule() context,
5076 -+ * we disable preemption to avoid it calling schedule() again
5077 -+ * in the possible wakeup of a kworker and because wq_worker_sleeping()
5078 -+ * requires it.
5079 -+ */
5080 -+ if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
5081 -+ preempt_disable();
5082 -+ if (task_flags & PF_WQ_WORKER)
5083 -+ wq_worker_sleeping(tsk);
5084 -+ else
5085 -+ io_wq_worker_sleeping(tsk);
5086 -+ preempt_enable_no_resched();
5087 -+ }
5088 -+
5089 -+ if (tsk_is_pi_blocked(tsk))
5090 -+ return;
5091 -+
5092 -+ /*
5093 -+ * If we are going to sleep and we have plugged IO queued,
5094 -+ * make sure to submit it to avoid deadlocks.
5095 -+ */
5096 -+ if (blk_needs_flush_plug(tsk))
5097 -+ blk_schedule_flush_plug(tsk);
5098 -+}
5099 -+
5100 -+static void sched_update_worker(struct task_struct *tsk)
5101 -+{
5102 -+ if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
5103 -+ if (tsk->flags & PF_WQ_WORKER)
5104 -+ wq_worker_running(tsk);
5105 -+ else
5106 -+ io_wq_worker_running(tsk);
5107 -+ }
5108 -+}
5109 -+
5110 -+asmlinkage __visible void __sched schedule(void)
5111 -+{
5112 -+ struct task_struct *tsk = current;
5113 -+
5114 -+ sched_submit_work(tsk);
5115 -+ do {
5116 -+ preempt_disable();
5117 -+ __schedule(false);
5118 -+ sched_preempt_enable_no_resched();
5119 -+ } while (need_resched());
5120 -+ sched_update_worker(tsk);
5121 -+}
5122 -+EXPORT_SYMBOL(schedule);
5123 -+
5124 -+/*
5125 -+ * synchronize_rcu_tasks() makes sure that no task is stuck in preempted
5126 -+ * state (have scheduled out non-voluntarily) by making sure that all
5127 -+ * tasks have either left the run queue or have gone into user space.
5128 -+ * As idle tasks do not do either, they must not ever be preempted
5129 -+ * (schedule out non-voluntarily).
5130 -+ *
5131 -+ * schedule_idle() is similar to schedule_preempt_disable() except that it
5132 -+ * never enables preemption because it does not call sched_submit_work().
5133 -+ */
5134 -+void __sched schedule_idle(void)
5135 -+{
5136 -+ /*
5137 -+ * As this skips calling sched_submit_work(), which the idle task does
5138 -+ * regardless because that function is a nop when the task is in a
5139 -+ * TASK_RUNNING state, make sure this isn't used someplace that the
5140 -+ * current task can be in any other state. Note, idle is always in the
5141 -+ * TASK_RUNNING state.
5142 -+ */
5143 -+ WARN_ON_ONCE(current->__state);
5144 -+ do {
5145 -+ __schedule(false);
5146 -+ } while (need_resched());
5147 -+}
5148 -+
5149 -+#if defined(CONFIG_CONTEXT_TRACKING) && !defined(CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK)
5150 -+asmlinkage __visible void __sched schedule_user(void)
5151 -+{
5152 -+ /*
5153 -+ * If we come here after a random call to set_need_resched(),
5154 -+ * or we have been woken up remotely but the IPI has not yet arrived,
5155 -+ * we haven't yet exited the RCU idle mode. Do it here manually until
5156 -+ * we find a better solution.
5157 -+ *
5158 -+ * NB: There are buggy callers of this function. Ideally we
5159 -+ * should warn if prev_state != CONTEXT_USER, but that will trigger
5160 -+ * too frequently to make sense yet.
5161 -+ */
5162 -+ enum ctx_state prev_state = exception_enter();
5163 -+ schedule();
5164 -+ exception_exit(prev_state);
5165 -+}
5166 -+#endif
5167 -+
5168 -+/**
5169 -+ * schedule_preempt_disabled - called with preemption disabled
5170 -+ *
5171 -+ * Returns with preemption disabled. Note: preempt_count must be 1
5172 -+ */
5173 -+void __sched schedule_preempt_disabled(void)
5174 -+{
5175 -+ sched_preempt_enable_no_resched();
5176 -+ schedule();
5177 -+ preempt_disable();
5178 -+}
5179 -+
5180 -+static void __sched notrace preempt_schedule_common(void)
5181 -+{
5182 -+ do {
5183 -+ /*
5184 -+ * Because the function tracer can trace preempt_count_sub()
5185 -+ * and it also uses preempt_enable/disable_notrace(), if
5186 -+ * NEED_RESCHED is set, the preempt_enable_notrace() called
5187 -+ * by the function tracer will call this function again and
5188 -+ * cause infinite recursion.
5189 -+ *
5190 -+ * Preemption must be disabled here before the function
5191 -+ * tracer can trace. Break up preempt_disable() into two
5192 -+ * calls. One to disable preemption without fear of being
5193 -+ * traced. The other to still record the preemption latency,
5194 -+ * which can also be traced by the function tracer.
5195 -+ */
5196 -+ preempt_disable_notrace();
5197 -+ preempt_latency_start(1);
5198 -+ __schedule(true);
5199 -+ preempt_latency_stop(1);
5200 -+ preempt_enable_no_resched_notrace();
5201 -+
5202 -+ /*
5203 -+ * Check again in case we missed a preemption opportunity
5204 -+ * between schedule and now.
5205 -+ */
5206 -+ } while (need_resched());
5207 -+}
5208 -+
5209 -+#ifdef CONFIG_PREEMPTION
5210 -+/*
5211 -+ * This is the entry point to schedule() from in-kernel preemption
5212 -+ * off of preempt_enable.
5213 -+ */
5214 -+asmlinkage __visible void __sched notrace preempt_schedule(void)
5215 -+{
5216 -+ /*
5217 -+ * If there is a non-zero preempt_count or interrupts are disabled,
5218 -+ * we do not want to preempt the current task. Just return..
5219 -+ */
5220 -+ if (likely(!preemptible()))
5221 -+ return;
5222 -+
5223 -+ preempt_schedule_common();
5224 -+}
5225 -+NOKPROBE_SYMBOL(preempt_schedule);
5226 -+EXPORT_SYMBOL(preempt_schedule);
5227 -+
5228 -+#ifdef CONFIG_PREEMPT_DYNAMIC
5229 -+DEFINE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
5230 -+EXPORT_STATIC_CALL_TRAMP(preempt_schedule);
5231 -+#endif
5232 -+
5233 -+
5234 -+/**
5235 -+ * preempt_schedule_notrace - preempt_schedule called by tracing
5236 -+ *
5237 -+ * The tracing infrastructure uses preempt_enable_notrace to prevent
5238 -+ * recursion and tracing preempt enabling caused by the tracing
5239 -+ * infrastructure itself. But as tracing can happen in areas coming
5240 -+ * from userspace or just about to enter userspace, a preempt enable
5241 -+ * can occur before user_exit() is called. This will cause the scheduler
5242 -+ * to be called when the system is still in usermode.
5243 -+ *
5244 -+ * To prevent this, the preempt_enable_notrace will use this function
5245 -+ * instead of preempt_schedule() to exit user context if needed before
5246 -+ * calling the scheduler.
5247 -+ */
5248 -+asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)
5249 -+{
5250 -+ enum ctx_state prev_ctx;
5251 -+
5252 -+ if (likely(!preemptible()))
5253 -+ return;
5254 -+
5255 -+ do {
5256 -+ /*
5257 -+ * Because the function tracer can trace preempt_count_sub()
5258 -+ * and it also uses preempt_enable/disable_notrace(), if
5259 -+ * NEED_RESCHED is set, the preempt_enable_notrace() called
5260 -+ * by the function tracer will call this function again and
5261 -+ * cause infinite recursion.
5262 -+ *
5263 -+ * Preemption must be disabled here before the function
5264 -+ * tracer can trace. Break up preempt_disable() into two
5265 -+ * calls. One to disable preemption without fear of being
5266 -+ * traced. The other to still record the preemption latency,
5267 -+ * which can also be traced by the function tracer.
5268 -+ */
5269 -+ preempt_disable_notrace();
5270 -+ preempt_latency_start(1);
5271 -+ /*
5272 -+ * Needs preempt disabled in case user_exit() is traced
5273 -+ * and the tracer calls preempt_enable_notrace() causing
5274 -+ * an infinite recursion.
5275 -+ */
5276 -+ prev_ctx = exception_enter();
5277 -+ __schedule(true);
5278 -+ exception_exit(prev_ctx);
5279 -+
5280 -+ preempt_latency_stop(1);
5281 -+ preempt_enable_no_resched_notrace();
5282 -+ } while (need_resched());
5283 -+}
5284 -+EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
5285 -+
5286 -+#ifdef CONFIG_PREEMPT_DYNAMIC
5287 -+DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
5288 -+EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
5289 -+#endif
5290 -+
5291 -+#endif /* CONFIG_PREEMPTION */
5292 -+
5293 -+#ifdef CONFIG_PREEMPT_DYNAMIC
5294 -+
5295 -+#include <linux/entry-common.h>
5296 -+
5297 -+/*
5298 -+ * SC:cond_resched
5299 -+ * SC:might_resched
5300 -+ * SC:preempt_schedule
5301 -+ * SC:preempt_schedule_notrace
5302 -+ * SC:irqentry_exit_cond_resched
5303 -+ *
5304 -+ *
5305 -+ * NONE:
5306 -+ * cond_resched <- __cond_resched
5307 -+ * might_resched <- RET0
5308 -+ * preempt_schedule <- NOP
5309 -+ * preempt_schedule_notrace <- NOP
5310 -+ * irqentry_exit_cond_resched <- NOP
5311 -+ *
5312 -+ * VOLUNTARY:
5313 -+ * cond_resched <- __cond_resched
5314 -+ * might_resched <- __cond_resched
5315 -+ * preempt_schedule <- NOP
5316 -+ * preempt_schedule_notrace <- NOP
5317 -+ * irqentry_exit_cond_resched <- NOP
5318 -+ *
5319 -+ * FULL:
5320 -+ * cond_resched <- RET0
5321 -+ * might_resched <- RET0
5322 -+ * preempt_schedule <- preempt_schedule
5323 -+ * preempt_schedule_notrace <- preempt_schedule_notrace
5324 -+ * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
5325 -+ */
5326 -+
5327 -+enum {
5328 -+ preempt_dynamic_none = 0,
5329 -+ preempt_dynamic_voluntary,
5330 -+ preempt_dynamic_full,
5331 -+};
5332 -+
5333 -+int preempt_dynamic_mode = preempt_dynamic_full;
5334 -+
5335 -+int sched_dynamic_mode(const char *str)
5336 -+{
5337 -+ if (!strcmp(str, "none"))
5338 -+ return preempt_dynamic_none;
5339 -+
5340 -+ if (!strcmp(str, "voluntary"))
5341 -+ return preempt_dynamic_voluntary;
5342 -+
5343 -+ if (!strcmp(str, "full"))
5344 -+ return preempt_dynamic_full;
5345 -+
5346 -+ return -EINVAL;
5347 -+}
5348 -+
5349 -+void sched_dynamic_update(int mode)
5350 -+{
5351 -+ /*
5352 -+ * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
5353 -+ * the ZERO state, which is invalid.
5354 -+ */
5355 -+ static_call_update(cond_resched, __cond_resched);
5356 -+ static_call_update(might_resched, __cond_resched);
5357 -+ static_call_update(preempt_schedule, __preempt_schedule_func);
5358 -+ static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
5359 -+ static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
5360 -+
5361 -+ switch (mode) {
5362 -+ case preempt_dynamic_none:
5363 -+ static_call_update(cond_resched, __cond_resched);
5364 -+ static_call_update(might_resched, (void *)&__static_call_return0);
5365 -+ static_call_update(preempt_schedule, NULL);
5366 -+ static_call_update(preempt_schedule_notrace, NULL);
5367 -+ static_call_update(irqentry_exit_cond_resched, NULL);
5368 -+ pr_info("Dynamic Preempt: none\n");
5369 -+ break;
5370 -+
5371 -+ case preempt_dynamic_voluntary:
5372 -+ static_call_update(cond_resched, __cond_resched);
5373 -+ static_call_update(might_resched, __cond_resched);
5374 -+ static_call_update(preempt_schedule, NULL);
5375 -+ static_call_update(preempt_schedule_notrace, NULL);
5376 -+ static_call_update(irqentry_exit_cond_resched, NULL);
5377 -+ pr_info("Dynamic Preempt: voluntary\n");
5378 -+ break;
5379 -+
5380 -+ case preempt_dynamic_full:
5381 -+ static_call_update(cond_resched, (void *)&__static_call_return0);
5382 -+ static_call_update(might_resched, (void *)&__static_call_return0);
5383 -+ static_call_update(preempt_schedule, __preempt_schedule_func);
5384 -+ static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
5385 -+ static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
5386 -+ pr_info("Dynamic Preempt: full\n");
5387 -+ break;
5388 -+ }
5389 -+
5390 -+ preempt_dynamic_mode = mode;
5391 -+}
5392 -+
5393 -+static int __init setup_preempt_mode(char *str)
5394 -+{
5395 -+ int mode = sched_dynamic_mode(str);
5396 -+ if (mode < 0) {
5397 -+ pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
5398 -+ return 1;
5399 -+ }
5400 -+
5401 -+ sched_dynamic_update(mode);
5402 -+ return 0;
5403 -+}
5404 -+__setup("preempt=", setup_preempt_mode);
5405 -+
5406 -+#endif /* CONFIG_PREEMPT_DYNAMIC */
5407 -+
5408 -+/*
5409 -+ * This is the entry point to schedule() from kernel preemption
5410 -+ * off of irq context.
5411 -+ * Note, that this is called and return with irqs disabled. This will
5412 -+ * protect us against recursive calling from irq.
5413 -+ */
5414 -+asmlinkage __visible void __sched preempt_schedule_irq(void)
5415 -+{
5416 -+ enum ctx_state prev_state;
5417 -+
5418 -+ /* Catch callers which need to be fixed */
5419 -+ BUG_ON(preempt_count() || !irqs_disabled());
5420 -+
5421 -+ prev_state = exception_enter();
5422 -+
5423 -+ do {
5424 -+ preempt_disable();
5425 -+ local_irq_enable();
5426 -+ __schedule(true);
5427 -+ local_irq_disable();
5428 -+ sched_preempt_enable_no_resched();
5429 -+ } while (need_resched());
5430 -+
5431 -+ exception_exit(prev_state);
5432 -+}
5433 -+
5434 -+int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
5435 -+ void *key)
5436 -+{
5437 -+ WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC);
5438 -+ return try_to_wake_up(curr->private, mode, wake_flags);
5439 -+}
5440 -+EXPORT_SYMBOL(default_wake_function);
5441 -+
5442 -+static inline void check_task_changed(struct task_struct *p, struct rq *rq)
5443 -+{
5444 -+ /* Trigger resched if task sched_prio has been modified. */
5445 -+ if (task_on_rq_queued(p) && task_sched_prio_idx(p, rq) != p->sq_idx) {
5446 -+ requeue_task(p, rq);
5447 -+ check_preempt_curr(rq);
5448 -+ }
5449 -+}
5450 -+
5451 -+static void __setscheduler_prio(struct task_struct *p, int prio)
5452 -+{
5453 -+ p->prio = prio;
5454 -+}
5455 -+
5456 -+#ifdef CONFIG_RT_MUTEXES
5457 -+
5458 -+static inline int __rt_effective_prio(struct task_struct *pi_task, int prio)
5459 -+{
5460 -+ if (pi_task)
5461 -+ prio = min(prio, pi_task->prio);
5462 -+
5463 -+ return prio;
5464 -+}
5465 -+
5466 -+static inline int rt_effective_prio(struct task_struct *p, int prio)
5467 -+{
5468 -+ struct task_struct *pi_task = rt_mutex_get_top_task(p);
5469 -+
5470 -+ return __rt_effective_prio(pi_task, prio);
5471 -+}
5472 -+
5473 -+/*
5474 -+ * rt_mutex_setprio - set the current priority of a task
5475 -+ * @p: task to boost
5476 -+ * @pi_task: donor task
5477 -+ *
5478 -+ * This function changes the 'effective' priority of a task. It does
5479 -+ * not touch ->normal_prio like __setscheduler().
5480 -+ *
5481 -+ * Used by the rt_mutex code to implement priority inheritance
5482 -+ * logic. Call site only calls if the priority of the task changed.
5483 -+ */
5484 -+void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
5485 -+{
5486 -+ int prio;
5487 -+ struct rq *rq;
5488 -+ raw_spinlock_t *lock;
5489 -+
5490 -+ /* XXX used to be waiter->prio, not waiter->task->prio */
5491 -+ prio = __rt_effective_prio(pi_task, p->normal_prio);
5492 -+
5493 -+ /*
5494 -+ * If nothing changed; bail early.
5495 -+ */
5496 -+ if (p->pi_top_task == pi_task && prio == p->prio)
5497 -+ return;
5498 -+
5499 -+ rq = __task_access_lock(p, &lock);
5500 -+ /*
5501 -+ * Set under pi_lock && rq->lock, such that the value can be used under
5502 -+ * either lock.
5503 -+ *
5504 -+ * Note that there is loads of tricky to make this pointer cache work
5505 -+ * right. rt_mutex_slowunlock()+rt_mutex_postunlock() work together to
5506 -+ * ensure a task is de-boosted (pi_task is set to NULL) before the
5507 -+ * task is allowed to run again (and can exit). This ensures the pointer
5508 -+ * points to a blocked task -- which guarantees the task is present.
5509 -+ */
5510 -+ p->pi_top_task = pi_task;
5511 -+
5512 -+ /*
5513 -+ * For FIFO/RR we only need to set prio, if that matches we're done.
5514 -+ */
5515 -+ if (prio == p->prio)
5516 -+ goto out_unlock;
5517 -+
5518 -+ /*
5519 -+ * Idle task boosting is a nono in general. There is one
5520 -+ * exception, when PREEMPT_RT and NOHZ is active:
5521 -+ *
5522 -+ * The idle task calls get_next_timer_interrupt() and holds
5523 -+ * the timer wheel base->lock on the CPU and another CPU wants
5524 -+ * to access the timer (probably to cancel it). We can safely
5525 -+ * ignore the boosting request, as the idle CPU runs this code
5526 -+ * with interrupts disabled and will complete the lock
5527 -+ * protected section without being interrupted. So there is no
5528 -+ * real need to boost.
5529 -+ */
5530 -+ if (unlikely(p == rq->idle)) {
5531 -+ WARN_ON(p != rq->curr);
5532 -+ WARN_ON(p->pi_blocked_on);
5533 -+ goto out_unlock;
5534 -+ }
5535 -+
5536 -+ trace_sched_pi_setprio(p, pi_task);
5537 -+
5538 -+ __setscheduler_prio(p, prio);
5539 -+
5540 -+ check_task_changed(p, rq);
5541 -+out_unlock:
5542 -+ /* Avoid rq from going away on us: */
5543 -+ preempt_disable();
5544 -+
5545 -+ __balance_callbacks(rq);
5546 -+ __task_access_unlock(p, lock);
5547 -+
5548 -+ preempt_enable();
5549 -+}
5550 -+#else
5551 -+static inline int rt_effective_prio(struct task_struct *p, int prio)
5552 -+{
5553 -+ return prio;
5554 -+}
5555 -+#endif
5556 -+
5557 -+void set_user_nice(struct task_struct *p, long nice)
5558 -+{
5559 -+ unsigned long flags;
5560 -+ struct rq *rq;
5561 -+ raw_spinlock_t *lock;
5562 -+
5563 -+ if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)
5564 -+ return;
5565 -+ /*
5566 -+ * We have to be careful, if called from sys_setpriority(),
5567 -+ * the task might be in the middle of scheduling on another CPU.
5568 -+ */
5569 -+ raw_spin_lock_irqsave(&p->pi_lock, flags);
5570 -+ rq = __task_access_lock(p, &lock);
5571 -+
5572 -+ p->static_prio = NICE_TO_PRIO(nice);
5573 -+ /*
5574 -+ * The RT priorities are set via sched_setscheduler(), but we still
5575 -+ * allow the 'normal' nice value to be set - but as expected
5576 -+ * it won't have any effect on scheduling until the task is
5577 -+ * not SCHED_NORMAL/SCHED_BATCH:
5578 -+ */
5579 -+ if (task_has_rt_policy(p))
5580 -+ goto out_unlock;
5581 -+
5582 -+ p->prio = effective_prio(p);
5583 -+
5584 -+ check_task_changed(p, rq);
5585 -+out_unlock:
5586 -+ __task_access_unlock(p, lock);
5587 -+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
5588 -+}
5589 -+EXPORT_SYMBOL(set_user_nice);
5590 -+
5591 -+/*
5592 -+ * can_nice - check if a task can reduce its nice value
5593 -+ * @p: task
5594 -+ * @nice: nice value
5595 -+ */
5596 -+int can_nice(const struct task_struct *p, const int nice)
5597 -+{
5598 -+ /* Convert nice value [19,-20] to rlimit style value [1,40] */
5599 -+ int nice_rlim = nice_to_rlimit(nice);
5600 -+
5601 -+ return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) ||
5602 -+ capable(CAP_SYS_NICE));
5603 -+}
5604 -+
5605 -+#ifdef __ARCH_WANT_SYS_NICE
5606 -+
5607 -+/*
5608 -+ * sys_nice - change the priority of the current process.
5609 -+ * @increment: priority increment
5610 -+ *
5611 -+ * sys_setpriority is a more generic, but much slower function that
5612 -+ * does similar things.
5613 -+ */
5614 -+SYSCALL_DEFINE1(nice, int, increment)
5615 -+{
5616 -+ long nice, retval;
5617 -+
5618 -+ /*
5619 -+ * Setpriority might change our priority at the same moment.
5620 -+ * We don't have to worry. Conceptually one call occurs first
5621 -+ * and we have a single winner.
5622 -+ */
5623 -+
5624 -+ increment = clamp(increment, -NICE_WIDTH, NICE_WIDTH);
5625 -+ nice = task_nice(current) + increment;
5626 -+
5627 -+ nice = clamp_val(nice, MIN_NICE, MAX_NICE);
5628 -+ if (increment < 0 && !can_nice(current, nice))
5629 -+ return -EPERM;
5630 -+
5631 -+ retval = security_task_setnice(current, nice);
5632 -+ if (retval)
5633 -+ return retval;
5634 -+
5635 -+ set_user_nice(current, nice);
5636 -+ return 0;
5637 -+}
5638 -+
5639 -+#endif
5640 -+
5641 -+/**
5642 -+ * task_prio - return the priority value of a given task.
5643 -+ * @p: the task in question.
5644 -+ *
5645 -+ * Return: The priority value as seen by users in /proc.
5646 -+ *
5647 -+ * sched policy return value kernel prio user prio/nice
5648 -+ *
5649 -+ * (BMQ)normal, batch, idle[0 ... 53] [100 ... 139] 0/[-20 ... 19]/[-7 ... 7]
5650 -+ * (PDS)normal, batch, idle[0 ... 39] 100 0/[-20 ... 19]
5651 -+ * fifo, rr [-1 ... -100] [99 ... 0] [0 ... 99]
5652 -+ */
5653 -+int task_prio(const struct task_struct *p)
5654 -+{
5655 -+ return (p->prio < MAX_RT_PRIO) ? p->prio - MAX_RT_PRIO :
5656 -+ task_sched_prio_normal(p, task_rq(p));
5657 -+}
5658 -+
5659 -+/**
5660 -+ * idle_cpu - is a given CPU idle currently?
5661 -+ * @cpu: the processor in question.
5662 -+ *
5663 -+ * Return: 1 if the CPU is currently idle. 0 otherwise.
5664 -+ */
5665 -+int idle_cpu(int cpu)
5666 -+{
5667 -+ struct rq *rq = cpu_rq(cpu);
5668 -+
5669 -+ if (rq->curr != rq->idle)
5670 -+ return 0;
5671 -+
5672 -+ if (rq->nr_running)
5673 -+ return 0;
5674 -+
5675 -+#ifdef CONFIG_SMP
5676 -+ if (rq->ttwu_pending)
5677 -+ return 0;
5678 -+#endif
5679 -+
5680 -+ return 1;
5681 -+}
5682 -+
5683 -+/**
5684 -+ * idle_task - return the idle task for a given CPU.
5685 -+ * @cpu: the processor in question.
5686 -+ *
5687 -+ * Return: The idle task for the cpu @cpu.
5688 -+ */
5689 -+struct task_struct *idle_task(int cpu)
5690 -+{
5691 -+ return cpu_rq(cpu)->idle;
5692 -+}
5693 -+
5694 -+/**
5695 -+ * find_process_by_pid - find a process with a matching PID value.
5696 -+ * @pid: the pid in question.
5697 -+ *
5698 -+ * The task of @pid, if found. %NULL otherwise.
5699 -+ */
5700 -+static inline struct task_struct *find_process_by_pid(pid_t pid)
5701 -+{
5702 -+ return pid ? find_task_by_vpid(pid) : current;
5703 -+}
5704 -+
5705 -+/*
5706 -+ * sched_setparam() passes in -1 for its policy, to let the functions
5707 -+ * it calls know not to change it.
5708 -+ */
5709 -+#define SETPARAM_POLICY -1
5710 -+
5711 -+static void __setscheduler_params(struct task_struct *p,
5712 -+ const struct sched_attr *attr)
5713 -+{
5714 -+ int policy = attr->sched_policy;
5715 -+
5716 -+ if (policy == SETPARAM_POLICY)
5717 -+ policy = p->policy;
5718 -+
5719 -+ p->policy = policy;
5720 -+
5721 -+ /*
5722 -+ * allow normal nice value to be set, but will not have any
5723 -+ * effect on scheduling until the task not SCHED_NORMAL/
5724 -+ * SCHED_BATCH
5725 -+ */
5726 -+ p->static_prio = NICE_TO_PRIO(attr->sched_nice);
5727 -+
5728 -+ /*
5729 -+ * __sched_setscheduler() ensures attr->sched_priority == 0 when
5730 -+ * !rt_policy. Always setting this ensures that things like
5731 -+ * getparam()/getattr() don't report silly values for !rt tasks.
5732 -+ */
5733 -+ p->rt_priority = attr->sched_priority;
5734 -+ p->normal_prio = normal_prio(p);
5735 -+}
5736 -+
5737 -+/*
5738 -+ * check the target process has a UID that matches the current process's
5739 -+ */
5740 -+static bool check_same_owner(struct task_struct *p)
5741 -+{
5742 -+ const struct cred *cred = current_cred(), *pcred;
5743 -+ bool match;
5744 -+
5745 -+ rcu_read_lock();
5746 -+ pcred = __task_cred(p);
5747 -+ match = (uid_eq(cred->euid, pcred->euid) ||
5748 -+ uid_eq(cred->euid, pcred->uid));
5749 -+ rcu_read_unlock();
5750 -+ return match;
5751 -+}
5752 -+
5753 -+static int __sched_setscheduler(struct task_struct *p,
5754 -+ const struct sched_attr *attr,
5755 -+ bool user, bool pi)
5756 -+{
5757 -+ const struct sched_attr dl_squash_attr = {
5758 -+ .size = sizeof(struct sched_attr),
5759 -+ .sched_policy = SCHED_FIFO,
5760 -+ .sched_nice = 0,
5761 -+ .sched_priority = 99,
5762 -+ };
5763 -+ int oldpolicy = -1, policy = attr->sched_policy;
5764 -+ int retval, newprio;
5765 -+ struct callback_head *head;
5766 -+ unsigned long flags;
5767 -+ struct rq *rq;
5768 -+ int reset_on_fork;
5769 -+ raw_spinlock_t *lock;
5770 -+
5771 -+ /* The pi code expects interrupts enabled */
5772 -+ BUG_ON(pi && in_interrupt());
5773 -+
5774 -+ /*
5775 -+ * Alt schedule FW supports SCHED_DEADLINE by squash it as prio 0 SCHED_FIFO
5776 -+ */
5777 -+ if (unlikely(SCHED_DEADLINE == policy)) {
5778 -+ attr = &dl_squash_attr;
5779 -+ policy = attr->sched_policy;
5780 -+ }
5781 -+recheck:
5782 -+ /* Double check policy once rq lock held */
5783 -+ if (policy < 0) {
5784 -+ reset_on_fork = p->sched_reset_on_fork;
5785 -+ policy = oldpolicy = p->policy;
5786 -+ } else {
5787 -+ reset_on_fork = !!(attr->sched_flags & SCHED_RESET_ON_FORK);
5788 -+
5789 -+ if (policy > SCHED_IDLE)
5790 -+ return -EINVAL;
5791 -+ }
5792 -+
5793 -+ if (attr->sched_flags & ~(SCHED_FLAG_ALL))
5794 -+ return -EINVAL;
5795 -+
5796 -+ /*
5797 -+ * Valid priorities for SCHED_FIFO and SCHED_RR are
5798 -+ * 1..MAX_RT_PRIO-1, valid priority for SCHED_NORMAL and
5799 -+ * SCHED_BATCH and SCHED_IDLE is 0.
5800 -+ */
5801 -+ if (attr->sched_priority < 0 ||
5802 -+ (p->mm && attr->sched_priority > MAX_RT_PRIO - 1) ||
5803 -+ (!p->mm && attr->sched_priority > MAX_RT_PRIO - 1))
5804 -+ return -EINVAL;
5805 -+ if ((SCHED_RR == policy || SCHED_FIFO == policy) !=
5806 -+ (attr->sched_priority != 0))
5807 -+ return -EINVAL;
5808 -+
5809 -+ /*
5810 -+ * Allow unprivileged RT tasks to decrease priority:
5811 -+ */
5812 -+ if (user && !capable(CAP_SYS_NICE)) {
5813 -+ if (SCHED_FIFO == policy || SCHED_RR == policy) {
5814 -+ unsigned long rlim_rtprio =
5815 -+ task_rlimit(p, RLIMIT_RTPRIO);
5816 -+
5817 -+ /* Can't set/change the rt policy */
5818 -+ if (policy != p->policy && !rlim_rtprio)
5819 -+ return -EPERM;
5820 -+
5821 -+ /* Can't increase priority */
5822 -+ if (attr->sched_priority > p->rt_priority &&
5823 -+ attr->sched_priority > rlim_rtprio)
5824 -+ return -EPERM;
5825 -+ }
5826 -+
5827 -+ /* Can't change other user's priorities */
5828 -+ if (!check_same_owner(p))
5829 -+ return -EPERM;
5830 -+
5831 -+ /* Normal users shall not reset the sched_reset_on_fork flag */
5832 -+ if (p->sched_reset_on_fork && !reset_on_fork)
5833 -+ return -EPERM;
5834 -+ }
5835 -+
5836 -+ if (user) {
5837 -+ retval = security_task_setscheduler(p);
5838 -+ if (retval)
5839 -+ return retval;
5840 -+ }
5841 -+
5842 -+ if (pi)
5843 -+ cpuset_read_lock();
5844 -+
5845 -+ /*
5846 -+ * Make sure no PI-waiters arrive (or leave) while we are
5847 -+ * changing the priority of the task:
5848 -+ */
5849 -+ raw_spin_lock_irqsave(&p->pi_lock, flags);
5850 -+
5851 -+ /*
5852 -+ * To be able to change p->policy safely, task_access_lock()
5853 -+ * must be called.
5854 -+ * IF use task_access_lock() here:
5855 -+ * For the task p which is not running, reading rq->stop is
5856 -+ * racy but acceptable as ->stop doesn't change much.
5857 -+ * An enhancemnet can be made to read rq->stop saftly.
5858 -+ */
5859 -+ rq = __task_access_lock(p, &lock);
5860 -+
5861 -+ /*
5862 -+ * Changing the policy of the stop threads its a very bad idea
5863 -+ */
5864 -+ if (p == rq->stop) {
5865 -+ retval = -EINVAL;
5866 -+ goto unlock;
5867 -+ }
5868 -+
5869 -+ /*
5870 -+ * If not changing anything there's no need to proceed further:
5871 -+ */
5872 -+ if (unlikely(policy == p->policy)) {
5873 -+ if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
5874 -+ goto change;
5875 -+ if (!rt_policy(policy) &&
5876 -+ NICE_TO_PRIO(attr->sched_nice) != p->static_prio)
5877 -+ goto change;
5878 -+
5879 -+ p->sched_reset_on_fork = reset_on_fork;
5880 -+ retval = 0;
5881 -+ goto unlock;
5882 -+ }
5883 -+change:
5884 -+
5885 -+ /* Re-check policy now with rq lock held */
5886 -+ if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
5887 -+ policy = oldpolicy = -1;
5888 -+ __task_access_unlock(p, lock);
5889 -+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
5890 -+ if (pi)
5891 -+ cpuset_read_unlock();
5892 -+ goto recheck;
5893 -+ }
5894 -+
5895 -+ p->sched_reset_on_fork = reset_on_fork;
5896 -+
5897 -+ newprio = __normal_prio(policy, attr->sched_priority, NICE_TO_PRIO(attr->sched_nice));
5898 -+ if (pi) {
5899 -+ /*
5900 -+ * Take priority boosted tasks into account. If the new
5901 -+ * effective priority is unchanged, we just store the new
5902 -+ * normal parameters and do not touch the scheduler class and
5903 -+ * the runqueue. This will be done when the task deboost
5904 -+ * itself.
5905 -+ */
5906 -+ newprio = rt_effective_prio(p, newprio);
5907 -+ }
5908 -+
5909 -+ if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) {
5910 -+ __setscheduler_params(p, attr);
5911 -+ __setscheduler_prio(p, newprio);
5912 -+ }
5913 -+
5914 -+ check_task_changed(p, rq);
5915 -+
5916 -+ /* Avoid rq from going away on us: */
5917 -+ preempt_disable();
5918 -+ head = splice_balance_callbacks(rq);
5919 -+ __task_access_unlock(p, lock);
5920 -+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
5921 -+
5922 -+ if (pi) {
5923 -+ cpuset_read_unlock();
5924 -+ rt_mutex_adjust_pi(p);
5925 -+ }
5926 -+
5927 -+ /* Run balance callbacks after we've adjusted the PI chain: */
5928 -+ balance_callbacks(rq, head);
5929 -+ preempt_enable();
5930 -+
5931 -+ return 0;
5932 -+
5933 -+unlock:
5934 -+ __task_access_unlock(p, lock);
5935 -+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
5936 -+ if (pi)
5937 -+ cpuset_read_unlock();
5938 -+ return retval;
5939 -+}
5940 -+
5941 -+static int _sched_setscheduler(struct task_struct *p, int policy,
5942 -+ const struct sched_param *param, bool check)
5943 -+{
5944 -+ struct sched_attr attr = {
5945 -+ .sched_policy = policy,
5946 -+ .sched_priority = param->sched_priority,
5947 -+ .sched_nice = PRIO_TO_NICE(p->static_prio),
5948 -+ };
5949 -+
5950 -+ /* Fixup the legacy SCHED_RESET_ON_FORK hack. */
5951 -+ if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) {
5952 -+ attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
5953 -+ policy &= ~SCHED_RESET_ON_FORK;
5954 -+ attr.sched_policy = policy;
5955 -+ }
5956 -+
5957 -+ return __sched_setscheduler(p, &attr, check, true);
5958 -+}
5959 -+
5960 -+/**
5961 -+ * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
5962 -+ * @p: the task in question.
5963 -+ * @policy: new policy.
5964 -+ * @param: structure containing the new RT priority.
5965 -+ *
5966 -+ * Use sched_set_fifo(), read its comment.
5967 -+ *
5968 -+ * Return: 0 on success. An error code otherwise.
5969 -+ *
5970 -+ * NOTE that the task may be already dead.
5971 -+ */
5972 -+int sched_setscheduler(struct task_struct *p, int policy,
5973 -+ const struct sched_param *param)
5974 -+{
5975 -+ return _sched_setscheduler(p, policy, param, true);
5976 -+}
5977 -+
5978 -+int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
5979 -+{
5980 -+ return __sched_setscheduler(p, attr, true, true);
5981 -+}
5982 -+
5983 -+int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr)
5984 -+{
5985 -+ return __sched_setscheduler(p, attr, false, true);
5986 -+}
5987 -+EXPORT_SYMBOL_GPL(sched_setattr_nocheck);
5988 -+
5989 -+/**
5990 -+ * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
5991 -+ * @p: the task in question.
5992 -+ * @policy: new policy.
5993 -+ * @param: structure containing the new RT priority.
5994 -+ *
5995 -+ * Just like sched_setscheduler, only don't bother checking if the
5996 -+ * current context has permission. For example, this is needed in
5997 -+ * stop_machine(): we create temporary high priority worker threads,
5998 -+ * but our caller might not have that capability.
5999 -+ *
6000 -+ * Return: 0 on success. An error code otherwise.
6001 -+ */
6002 -+int sched_setscheduler_nocheck(struct task_struct *p, int policy,
6003 -+ const struct sched_param *param)
6004 -+{
6005 -+ return _sched_setscheduler(p, policy, param, false);
6006 -+}
6007 -+
6008 -+/*
6009 -+ * SCHED_FIFO is a broken scheduler model; that is, it is fundamentally
6010 -+ * incapable of resource management, which is the one thing an OS really should
6011 -+ * be doing.
6012 -+ *
6013 -+ * This is of course the reason it is limited to privileged users only.
6014 -+ *
6015 -+ * Worse still; it is fundamentally impossible to compose static priority
6016 -+ * workloads. You cannot take two correctly working static prio workloads
6017 -+ * and smash them together and still expect them to work.
6018 -+ *
6019 -+ * For this reason 'all' FIFO tasks the kernel creates are basically at:
6020 -+ *
6021 -+ * MAX_RT_PRIO / 2
6022 -+ *
6023 -+ * The administrator _MUST_ configure the system, the kernel simply doesn't
6024 -+ * know enough information to make a sensible choice.
6025 -+ */
6026 -+void sched_set_fifo(struct task_struct *p)
6027 -+{
6028 -+ struct sched_param sp = { .sched_priority = MAX_RT_PRIO / 2 };
6029 -+ WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0);
6030 -+}
6031 -+EXPORT_SYMBOL_GPL(sched_set_fifo);
6032 -+
6033 -+/*
6034 -+ * For when you don't much care about FIFO, but want to be above SCHED_NORMAL.
6035 -+ */
6036 -+void sched_set_fifo_low(struct task_struct *p)
6037 -+{
6038 -+ struct sched_param sp = { .sched_priority = 1 };
6039 -+ WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0);
6040 -+}
6041 -+EXPORT_SYMBOL_GPL(sched_set_fifo_low);
6042 -+
6043 -+void sched_set_normal(struct task_struct *p, int nice)
6044 -+{
6045 -+ struct sched_attr attr = {
6046 -+ .sched_policy = SCHED_NORMAL,
6047 -+ .sched_nice = nice,
6048 -+ };
6049 -+ WARN_ON_ONCE(sched_setattr_nocheck(p, &attr) != 0);
6050 -+}
6051 -+EXPORT_SYMBOL_GPL(sched_set_normal);
6052 -+
6053 -+static int
6054 -+do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
6055 -+{
6056 -+ struct sched_param lparam;
6057 -+ struct task_struct *p;
6058 -+ int retval;
6059 -+
6060 -+ if (!param || pid < 0)
6061 -+ return -EINVAL;
6062 -+ if (copy_from_user(&lparam, param, sizeof(struct sched_param)))
6063 -+ return -EFAULT;
6064 -+
6065 -+ rcu_read_lock();
6066 -+ retval = -ESRCH;
6067 -+ p = find_process_by_pid(pid);
6068 -+ if (likely(p))
6069 -+ get_task_struct(p);
6070 -+ rcu_read_unlock();
6071 -+
6072 -+ if (likely(p)) {
6073 -+ retval = sched_setscheduler(p, policy, &lparam);
6074 -+ put_task_struct(p);
6075 -+ }
6076 -+
6077 -+ return retval;
6078 -+}
6079 -+
6080 -+/*
6081 -+ * Mimics kernel/events/core.c perf_copy_attr().
6082 -+ */
6083 -+static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *attr)
6084 -+{
6085 -+ u32 size;
6086 -+ int ret;
6087 -+
6088 -+ /* Zero the full structure, so that a short copy will be nice: */
6089 -+ memset(attr, 0, sizeof(*attr));
6090 -+
6091 -+ ret = get_user(size, &uattr->size);
6092 -+ if (ret)
6093 -+ return ret;
6094 -+
6095 -+ /* ABI compatibility quirk: */
6096 -+ if (!size)
6097 -+ size = SCHED_ATTR_SIZE_VER0;
6098 -+
6099 -+ if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)
6100 -+ goto err_size;
6101 -+
6102 -+ ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
6103 -+ if (ret) {
6104 -+ if (ret == -E2BIG)
6105 -+ goto err_size;
6106 -+ return ret;
6107 -+ }
6108 -+
6109 -+ /*
6110 -+ * XXX: Do we want to be lenient like existing syscalls; or do we want
6111 -+ * to be strict and return an error on out-of-bounds values?
6112 -+ */
6113 -+ attr->sched_nice = clamp(attr->sched_nice, -20, 19);
6114 -+
6115 -+ /* sched/core.c uses zero here but we already know ret is zero */
6116 -+ return 0;
6117 -+
6118 -+err_size:
6119 -+ put_user(sizeof(*attr), &uattr->size);
6120 -+ return -E2BIG;
6121 -+}
6122 -+
6123 -+/**
6124 -+ * sys_sched_setscheduler - set/change the scheduler policy and RT priority
6125 -+ * @pid: the pid in question.
6126 -+ * @policy: new policy.
6127 -+ *
6128 -+ * Return: 0 on success. An error code otherwise.
6129 -+ * @param: structure containing the new RT priority.
6130 -+ */
6131 -+SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy, struct sched_param __user *, param)
6132 -+{
6133 -+ if (policy < 0)
6134 -+ return -EINVAL;
6135 -+
6136 -+ return do_sched_setscheduler(pid, policy, param);
6137 -+}
6138 -+
6139 -+/**
6140 -+ * sys_sched_setparam - set/change the RT priority of a thread
6141 -+ * @pid: the pid in question.
6142 -+ * @param: structure containing the new RT priority.
6143 -+ *
6144 -+ * Return: 0 on success. An error code otherwise.
6145 -+ */
6146 -+SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
6147 -+{
6148 -+ return do_sched_setscheduler(pid, SETPARAM_POLICY, param);
6149 -+}
6150 -+
6151 -+/**
6152 -+ * sys_sched_setattr - same as above, but with extended sched_attr
6153 -+ * @pid: the pid in question.
6154 -+ * @uattr: structure containing the extended parameters.
6155 -+ */
6156 -+SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
6157 -+ unsigned int, flags)
6158 -+{
6159 -+ struct sched_attr attr;
6160 -+ struct task_struct *p;
6161 -+ int retval;
6162 -+
6163 -+ if (!uattr || pid < 0 || flags)
6164 -+ return -EINVAL;
6165 -+
6166 -+ retval = sched_copy_attr(uattr, &attr);
6167 -+ if (retval)
6168 -+ return retval;
6169 -+
6170 -+ if ((int)attr.sched_policy < 0)
6171 -+ return -EINVAL;
6172 -+
6173 -+ rcu_read_lock();
6174 -+ retval = -ESRCH;
6175 -+ p = find_process_by_pid(pid);
6176 -+ if (likely(p))
6177 -+ get_task_struct(p);
6178 -+ rcu_read_unlock();
6179 -+
6180 -+ if (likely(p)) {
6181 -+ retval = sched_setattr(p, &attr);
6182 -+ put_task_struct(p);
6183 -+ }
6184 -+
6185 -+ return retval;
6186 -+}
6187 -+
6188 -+/**
6189 -+ * sys_sched_getscheduler - get the policy (scheduling class) of a thread
6190 -+ * @pid: the pid in question.
6191 -+ *
6192 -+ * Return: On success, the policy of the thread. Otherwise, a negative error
6193 -+ * code.
6194 -+ */
6195 -+SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid)
6196 -+{
6197 -+ struct task_struct *p;
6198 -+ int retval = -EINVAL;
6199 -+
6200 -+ if (pid < 0)
6201 -+ goto out_nounlock;
6202 -+
6203 -+ retval = -ESRCH;
6204 -+ rcu_read_lock();
6205 -+ p = find_process_by_pid(pid);
6206 -+ if (p) {
6207 -+ retval = security_task_getscheduler(p);
6208 -+ if (!retval)
6209 -+ retval = p->policy;
6210 -+ }
6211 -+ rcu_read_unlock();
6212 -+
6213 -+out_nounlock:
6214 -+ return retval;
6215 -+}
6216 -+
6217 -+/**
6218 -+ * sys_sched_getscheduler - get the RT priority of a thread
6219 -+ * @pid: the pid in question.
6220 -+ * @param: structure containing the RT priority.
6221 -+ *
6222 -+ * Return: On success, 0 and the RT priority is in @param. Otherwise, an error
6223 -+ * code.
6224 -+ */
6225 -+SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param)
6226 -+{
6227 -+ struct sched_param lp = { .sched_priority = 0 };
6228 -+ struct task_struct *p;
6229 -+ int retval = -EINVAL;
6230 -+
6231 -+ if (!param || pid < 0)
6232 -+ goto out_nounlock;
6233 -+
6234 -+ rcu_read_lock();
6235 -+ p = find_process_by_pid(pid);
6236 -+ retval = -ESRCH;
6237 -+ if (!p)
6238 -+ goto out_unlock;
6239 -+
6240 -+ retval = security_task_getscheduler(p);
6241 -+ if (retval)
6242 -+ goto out_unlock;
6243 -+
6244 -+ if (task_has_rt_policy(p))
6245 -+ lp.sched_priority = p->rt_priority;
6246 -+ rcu_read_unlock();
6247 -+
6248 -+ /*
6249 -+ * This one might sleep, we cannot do it with a spinlock held ...
6250 -+ */
6251 -+ retval = copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0;
6252 -+
6253 -+out_nounlock:
6254 -+ return retval;
6255 -+
6256 -+out_unlock:
6257 -+ rcu_read_unlock();
6258 -+ return retval;
6259 -+}
6260 -+
6261 -+/*
6262 -+ * Copy the kernel size attribute structure (which might be larger
6263 -+ * than what user-space knows about) to user-space.
6264 -+ *
6265 -+ * Note that all cases are valid: user-space buffer can be larger or
6266 -+ * smaller than the kernel-space buffer. The usual case is that both
6267 -+ * have the same size.
6268 -+ */
6269 -+static int
6270 -+sched_attr_copy_to_user(struct sched_attr __user *uattr,
6271 -+ struct sched_attr *kattr,
6272 -+ unsigned int usize)
6273 -+{
6274 -+ unsigned int ksize = sizeof(*kattr);
6275 -+
6276 -+ if (!access_ok(uattr, usize))
6277 -+ return -EFAULT;
6278 -+
6279 -+ /*
6280 -+ * sched_getattr() ABI forwards and backwards compatibility:
6281 -+ *
6282 -+ * If usize == ksize then we just copy everything to user-space and all is good.
6283 -+ *
6284 -+ * If usize < ksize then we only copy as much as user-space has space for,
6285 -+ * this keeps ABI compatibility as well. We skip the rest.
6286 -+ *
6287 -+ * If usize > ksize then user-space is using a newer version of the ABI,
6288 -+ * which part the kernel doesn't know about. Just ignore it - tooling can
6289 -+ * detect the kernel's knowledge of attributes from the attr->size value
6290 -+ * which is set to ksize in this case.
6291 -+ */
6292 -+ kattr->size = min(usize, ksize);
6293 -+
6294 -+ if (copy_to_user(uattr, kattr, kattr->size))
6295 -+ return -EFAULT;
6296 -+
6297 -+ return 0;
6298 -+}
6299 -+
6300 -+/**
6301 -+ * sys_sched_getattr - similar to sched_getparam, but with sched_attr
6302 -+ * @pid: the pid in question.
6303 -+ * @uattr: structure containing the extended parameters.
6304 -+ * @usize: sizeof(attr) for fwd/bwd comp.
6305 -+ * @flags: for future extension.
6306 -+ */
6307 -+SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
6308 -+ unsigned int, usize, unsigned int, flags)
6309 -+{
6310 -+ struct sched_attr kattr = { };
6311 -+ struct task_struct *p;
6312 -+ int retval;
6313 -+
6314 -+ if (!uattr || pid < 0 || usize > PAGE_SIZE ||
6315 -+ usize < SCHED_ATTR_SIZE_VER0 || flags)
6316 -+ return -EINVAL;
6317 -+
6318 -+ rcu_read_lock();
6319 -+ p = find_process_by_pid(pid);
6320 -+ retval = -ESRCH;
6321 -+ if (!p)
6322 -+ goto out_unlock;
6323 -+
6324 -+ retval = security_task_getscheduler(p);
6325 -+ if (retval)
6326 -+ goto out_unlock;
6327 -+
6328 -+ kattr.sched_policy = p->policy;
6329 -+ if (p->sched_reset_on_fork)
6330 -+ kattr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
6331 -+ if (task_has_rt_policy(p))
6332 -+ kattr.sched_priority = p->rt_priority;
6333 -+ else
6334 -+ kattr.sched_nice = task_nice(p);
6335 -+
6336 -+#ifdef CONFIG_UCLAMP_TASK
6337 -+ kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;
6338 -+ kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
6339 -+#endif
6340 -+
6341 -+ rcu_read_unlock();
6342 -+
6343 -+ return sched_attr_copy_to_user(uattr, &kattr, usize);
6344 -+
6345 -+out_unlock:
6346 -+ rcu_read_unlock();
6347 -+ return retval;
6348 -+}
6349 -+
6350 -+long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
6351 -+{
6352 -+ cpumask_var_t cpus_allowed, new_mask;
6353 -+ struct task_struct *p;
6354 -+ int retval;
6355 -+
6356 -+ rcu_read_lock();
6357 -+
6358 -+ p = find_process_by_pid(pid);
6359 -+ if (!p) {
6360 -+ rcu_read_unlock();
6361 -+ return -ESRCH;
6362 -+ }
6363 -+
6364 -+ /* Prevent p going away */
6365 -+ get_task_struct(p);
6366 -+ rcu_read_unlock();
6367 -+
6368 -+ if (p->flags & PF_NO_SETAFFINITY) {
6369 -+ retval = -EINVAL;
6370 -+ goto out_put_task;
6371 -+ }
6372 -+ if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
6373 -+ retval = -ENOMEM;
6374 -+ goto out_put_task;
6375 -+ }
6376 -+ if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {
6377 -+ retval = -ENOMEM;
6378 -+ goto out_free_cpus_allowed;
6379 -+ }
6380 -+ retval = -EPERM;
6381 -+ if (!check_same_owner(p)) {
6382 -+ rcu_read_lock();
6383 -+ if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
6384 -+ rcu_read_unlock();
6385 -+ goto out_free_new_mask;
6386 -+ }
6387 -+ rcu_read_unlock();
6388 -+ }
6389 -+
6390 -+ retval = security_task_setscheduler(p);
6391 -+ if (retval)
6392 -+ goto out_free_new_mask;
6393 -+
6394 -+ cpuset_cpus_allowed(p, cpus_allowed);
6395 -+ cpumask_and(new_mask, in_mask, cpus_allowed);
6396 -+
6397 -+again:
6398 -+ retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
6399 -+
6400 -+ if (!retval) {
6401 -+ cpuset_cpus_allowed(p, cpus_allowed);
6402 -+ if (!cpumask_subset(new_mask, cpus_allowed)) {
6403 -+ /*
6404 -+ * We must have raced with a concurrent cpuset
6405 -+ * update. Just reset the cpus_allowed to the
6406 -+ * cpuset's cpus_allowed
6407 -+ */
6408 -+ cpumask_copy(new_mask, cpus_allowed);
6409 -+ goto again;
6410 -+ }
6411 -+ }
6412 -+out_free_new_mask:
6413 -+ free_cpumask_var(new_mask);
6414 -+out_free_cpus_allowed:
6415 -+ free_cpumask_var(cpus_allowed);
6416 -+out_put_task:
6417 -+ put_task_struct(p);
6418 -+ return retval;
6419 -+}
6420 -+
6421 -+static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len,
6422 -+ struct cpumask *new_mask)
6423 -+{
6424 -+ if (len < cpumask_size())
6425 -+ cpumask_clear(new_mask);
6426 -+ else if (len > cpumask_size())
6427 -+ len = cpumask_size();
6428 -+
6429 -+ return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0;
6430 -+}
6431 -+
6432 -+/**
6433 -+ * sys_sched_setaffinity - set the CPU affinity of a process
6434 -+ * @pid: pid of the process
6435 -+ * @len: length in bytes of the bitmask pointed to by user_mask_ptr
6436 -+ * @user_mask_ptr: user-space pointer to the new CPU mask
6437 -+ *
6438 -+ * Return: 0 on success. An error code otherwise.
6439 -+ */
6440 -+SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
6441 -+ unsigned long __user *, user_mask_ptr)
6442 -+{
6443 -+ cpumask_var_t new_mask;
6444 -+ int retval;
6445 -+
6446 -+ if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))
6447 -+ return -ENOMEM;
6448 -+
6449 -+ retval = get_user_cpu_mask(user_mask_ptr, len, new_mask);
6450 -+ if (retval == 0)
6451 -+ retval = sched_setaffinity(pid, new_mask);
6452 -+ free_cpumask_var(new_mask);
6453 -+ return retval;
6454 -+}
6455 -+
6456 -+long sched_getaffinity(pid_t pid, cpumask_t *mask)
6457 -+{
6458 -+ struct task_struct *p;
6459 -+ raw_spinlock_t *lock;
6460 -+ unsigned long flags;
6461 -+ int retval;
6462 -+
6463 -+ rcu_read_lock();
6464 -+
6465 -+ retval = -ESRCH;
6466 -+ p = find_process_by_pid(pid);
6467 -+ if (!p)
6468 -+ goto out_unlock;
6469 -+
6470 -+ retval = security_task_getscheduler(p);
6471 -+ if (retval)
6472 -+ goto out_unlock;
6473 -+
6474 -+ task_access_lock_irqsave(p, &lock, &flags);
6475 -+ cpumask_and(mask, &p->cpus_mask, cpu_active_mask);
6476 -+ task_access_unlock_irqrestore(p, lock, &flags);
6477 -+
6478 -+out_unlock:
6479 -+ rcu_read_unlock();
6480 -+
6481 -+ return retval;
6482 -+}
6483 -+
6484 -+/**
6485 -+ * sys_sched_getaffinity - get the CPU affinity of a process
6486 -+ * @pid: pid of the process
6487 -+ * @len: length in bytes of the bitmask pointed to by user_mask_ptr
6488 -+ * @user_mask_ptr: user-space pointer to hold the current CPU mask
6489 -+ *
6490 -+ * Return: size of CPU mask copied to user_mask_ptr on success. An
6491 -+ * error code otherwise.
6492 -+ */
6493 -+SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
6494 -+ unsigned long __user *, user_mask_ptr)
6495 -+{
6496 -+ int ret;
6497 -+ cpumask_var_t mask;
6498 -+
6499 -+ if ((len * BITS_PER_BYTE) < nr_cpu_ids)
6500 -+ return -EINVAL;
6501 -+ if (len & (sizeof(unsigned long)-1))
6502 -+ return -EINVAL;
6503 -+
6504 -+ if (!alloc_cpumask_var(&mask, GFP_KERNEL))
6505 -+ return -ENOMEM;
6506 -+
6507 -+ ret = sched_getaffinity(pid, mask);
6508 -+ if (ret == 0) {
6509 -+ unsigned int retlen = min_t(size_t, len, cpumask_size());
6510 -+
6511 -+ if (copy_to_user(user_mask_ptr, mask, retlen))
6512 -+ ret = -EFAULT;
6513 -+ else
6514 -+ ret = retlen;
6515 -+ }
6516 -+ free_cpumask_var(mask);
6517 -+
6518 -+ return ret;
6519 -+}
6520 -+
6521 -+static void do_sched_yield(void)
6522 -+{
6523 -+ struct rq *rq;
6524 -+ struct rq_flags rf;
6525 -+
6526 -+ if (!sched_yield_type)
6527 -+ return;
6528 -+
6529 -+ rq = this_rq_lock_irq(&rf);
6530 -+
6531 -+ schedstat_inc(rq->yld_count);
6532 -+
6533 -+ if (1 == sched_yield_type) {
6534 -+ if (!rt_task(current))
6535 -+ do_sched_yield_type_1(current, rq);
6536 -+ } else if (2 == sched_yield_type) {
6537 -+ if (rq->nr_running > 1)
6538 -+ rq->skip = current;
6539 -+ }
6540 -+
6541 -+ preempt_disable();
6542 -+ raw_spin_unlock_irq(&rq->lock);
6543 -+ sched_preempt_enable_no_resched();
6544 -+
6545 -+ schedule();
6546 -+}
6547 -+
6548 -+/**
6549 -+ * sys_sched_yield - yield the current processor to other threads.
6550 -+ *
6551 -+ * This function yields the current CPU to other tasks. If there are no
6552 -+ * other threads running on this CPU then this function will return.
6553 -+ *
6554 -+ * Return: 0.
6555 -+ */
6556 -+SYSCALL_DEFINE0(sched_yield)
6557 -+{
6558 -+ do_sched_yield();
6559 -+ return 0;
6560 -+}
6561 -+
6562 -+#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
6563 -+int __sched __cond_resched(void)
6564 -+{
6565 -+ if (should_resched(0)) {
6566 -+ preempt_schedule_common();
6567 -+ return 1;
6568 -+ }
6569 -+#ifndef CONFIG_PREEMPT_RCU
6570 -+ rcu_all_qs();
6571 -+#endif
6572 -+ return 0;
6573 -+}
6574 -+EXPORT_SYMBOL(__cond_resched);
6575 -+#endif
6576 -+
6577 -+#ifdef CONFIG_PREEMPT_DYNAMIC
6578 -+DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched);
6579 -+EXPORT_STATIC_CALL_TRAMP(cond_resched);
6580 -+
6581 -+DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched);
6582 -+EXPORT_STATIC_CALL_TRAMP(might_resched);
6583 -+#endif
6584 -+
6585 -+/*
6586 -+ * __cond_resched_lock() - if a reschedule is pending, drop the given lock,
6587 -+ * call schedule, and on return reacquire the lock.
6588 -+ *
6589 -+ * This works OK both with and without CONFIG_PREEMPTION. We do strange low-level
6590 -+ * operations here to prevent schedule() from being called twice (once via
6591 -+ * spin_unlock(), once by hand).
6592 -+ */
6593 -+int __cond_resched_lock(spinlock_t *lock)
6594 -+{
6595 -+ int resched = should_resched(PREEMPT_LOCK_OFFSET);
6596 -+ int ret = 0;
6597 -+
6598 -+ lockdep_assert_held(lock);
6599 -+
6600 -+ if (spin_needbreak(lock) || resched) {
6601 -+ spin_unlock(lock);
6602 -+ if (resched)
6603 -+ preempt_schedule_common();
6604 -+ else
6605 -+ cpu_relax();
6606 -+ ret = 1;
6607 -+ spin_lock(lock);
6608 -+ }
6609 -+ return ret;
6610 -+}
6611 -+EXPORT_SYMBOL(__cond_resched_lock);
6612 -+
6613 -+int __cond_resched_rwlock_read(rwlock_t *lock)
6614 -+{
6615 -+ int resched = should_resched(PREEMPT_LOCK_OFFSET);
6616 -+ int ret = 0;
6617 -+
6618 -+ lockdep_assert_held_read(lock);
6619 -+
6620 -+ if (rwlock_needbreak(lock) || resched) {
6621 -+ read_unlock(lock);
6622 -+ if (resched)
6623 -+ preempt_schedule_common();
6624 -+ else
6625 -+ cpu_relax();
6626 -+ ret = 1;
6627 -+ read_lock(lock);
6628 -+ }
6629 -+ return ret;
6630 -+}
6631 -+EXPORT_SYMBOL(__cond_resched_rwlock_read);
6632 -+
6633 -+int __cond_resched_rwlock_write(rwlock_t *lock)
6634 -+{
6635 -+ int resched = should_resched(PREEMPT_LOCK_OFFSET);
6636 -+ int ret = 0;
6637 -+
6638 -+ lockdep_assert_held_write(lock);
6639 -+
6640 -+ if (rwlock_needbreak(lock) || resched) {
6641 -+ write_unlock(lock);
6642 -+ if (resched)
6643 -+ preempt_schedule_common();
6644 -+ else
6645 -+ cpu_relax();
6646 -+ ret = 1;
6647 -+ write_lock(lock);
6648 -+ }
6649 -+ return ret;
6650 -+}
6651 -+EXPORT_SYMBOL(__cond_resched_rwlock_write);
6652 -+
6653 -+/**
6654 -+ * yield - yield the current processor to other threads.
6655 -+ *
6656 -+ * Do not ever use this function, there's a 99% chance you're doing it wrong.
6657 -+ *
6658 -+ * The scheduler is at all times free to pick the calling task as the most
6659 -+ * eligible task to run, if removing the yield() call from your code breaks
6660 -+ * it, it's already broken.
6661 -+ *
6662 -+ * Typical broken usage is:
6663 -+ *
6664 -+ * while (!event)
6665 -+ * yield();
6666 -+ *
6667 -+ * where one assumes that yield() will let 'the other' process run that will
6668 -+ * make event true. If the current task is a SCHED_FIFO task that will never
6669 -+ * happen. Never use yield() as a progress guarantee!!
6670 -+ *
6671 -+ * If you want to use yield() to wait for something, use wait_event().
6672 -+ * If you want to use yield() to be 'nice' for others, use cond_resched().
6673 -+ * If you still want to use yield(), do not!
6674 -+ */
6675 -+void __sched yield(void)
6676 -+{
6677 -+ set_current_state(TASK_RUNNING);
6678 -+ do_sched_yield();
6679 -+}
6680 -+EXPORT_SYMBOL(yield);
6681 -+
6682 -+/**
6683 -+ * yield_to - yield the current processor to another thread in
6684 -+ * your thread group, or accelerate that thread toward the
6685 -+ * processor it's on.
6686 -+ * @p: target task
6687 -+ * @preempt: whether task preemption is allowed or not
6688 -+ *
6689 -+ * It's the caller's job to ensure that the target task struct
6690 -+ * can't go away on us before we can do any checks.
6691 -+ *
6692 -+ * In Alt schedule FW, yield_to is not supported.
6693 -+ *
6694 -+ * Return:
6695 -+ * true (>0) if we indeed boosted the target task.
6696 -+ * false (0) if we failed to boost the target.
6697 -+ * -ESRCH if there's no task to yield to.
6698 -+ */
6699 -+int __sched yield_to(struct task_struct *p, bool preempt)
6700 -+{
6701 -+ return 0;
6702 -+}
6703 -+EXPORT_SYMBOL_GPL(yield_to);
6704 -+
6705 -+int io_schedule_prepare(void)
6706 -+{
6707 -+ int old_iowait = current->in_iowait;
6708 -+
6709 -+ current->in_iowait = 1;
6710 -+ blk_schedule_flush_plug(current);
6711 -+
6712 -+ return old_iowait;
6713 -+}
6714 -+
6715 -+void io_schedule_finish(int token)
6716 -+{
6717 -+ current->in_iowait = token;
6718 -+}
6719 -+
6720 -+/*
6721 -+ * This task is about to go to sleep on IO. Increment rq->nr_iowait so
6722 -+ * that process accounting knows that this is a task in IO wait state.
6723 -+ *
6724 -+ * But don't do that if it is a deliberate, throttling IO wait (this task
6725 -+ * has set its backing_dev_info: the queue against which it should throttle)
6726 -+ */
6727 -+
6728 -+long __sched io_schedule_timeout(long timeout)
6729 -+{
6730 -+ int token;
6731 -+ long ret;
6732 -+
6733 -+ token = io_schedule_prepare();
6734 -+ ret = schedule_timeout(timeout);
6735 -+ io_schedule_finish(token);
6736 -+
6737 -+ return ret;
6738 -+}
6739 -+EXPORT_SYMBOL(io_schedule_timeout);
6740 -+
6741 -+void __sched io_schedule(void)
6742 -+{
6743 -+ int token;
6744 -+
6745 -+ token = io_schedule_prepare();
6746 -+ schedule();
6747 -+ io_schedule_finish(token);
6748 -+}
6749 -+EXPORT_SYMBOL(io_schedule);
6750 -+
6751 -+/**
6752 -+ * sys_sched_get_priority_max - return maximum RT priority.
6753 -+ * @policy: scheduling class.
6754 -+ *
6755 -+ * Return: On success, this syscall returns the maximum
6756 -+ * rt_priority that can be used by a given scheduling class.
6757 -+ * On failure, a negative error code is returned.
6758 -+ */
6759 -+SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
6760 -+{
6761 -+ int ret = -EINVAL;
6762 -+
6763 -+ switch (policy) {
6764 -+ case SCHED_FIFO:
6765 -+ case SCHED_RR:
6766 -+ ret = MAX_RT_PRIO - 1;
6767 -+ break;
6768 -+ case SCHED_NORMAL:
6769 -+ case SCHED_BATCH:
6770 -+ case SCHED_IDLE:
6771 -+ ret = 0;
6772 -+ break;
6773 -+ }
6774 -+ return ret;
6775 -+}
6776 -+
6777 -+/**
6778 -+ * sys_sched_get_priority_min - return minimum RT priority.
6779 -+ * @policy: scheduling class.
6780 -+ *
6781 -+ * Return: On success, this syscall returns the minimum
6782 -+ * rt_priority that can be used by a given scheduling class.
6783 -+ * On failure, a negative error code is returned.
6784 -+ */
6785 -+SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
6786 -+{
6787 -+ int ret = -EINVAL;
6788 -+
6789 -+ switch (policy) {
6790 -+ case SCHED_FIFO:
6791 -+ case SCHED_RR:
6792 -+ ret = 1;
6793 -+ break;
6794 -+ case SCHED_NORMAL:
6795 -+ case SCHED_BATCH:
6796 -+ case SCHED_IDLE:
6797 -+ ret = 0;
6798 -+ break;
6799 -+ }
6800 -+ return ret;
6801 -+}
6802 -+
6803 -+static int sched_rr_get_interval(pid_t pid, struct timespec64 *t)
6804 -+{
6805 -+ struct task_struct *p;
6806 -+ int retval;
6807 -+
6808 -+ alt_sched_debug();
6809 -+
6810 -+ if (pid < 0)
6811 -+ return -EINVAL;
6812 -+
6813 -+ retval = -ESRCH;
6814 -+ rcu_read_lock();
6815 -+ p = find_process_by_pid(pid);
6816 -+ if (!p)
6817 -+ goto out_unlock;
6818 -+
6819 -+ retval = security_task_getscheduler(p);
6820 -+ if (retval)
6821 -+ goto out_unlock;
6822 -+ rcu_read_unlock();
6823 -+
6824 -+ *t = ns_to_timespec64(sched_timeslice_ns);
6825 -+ return 0;
6826 -+
6827 -+out_unlock:
6828 -+ rcu_read_unlock();
6829 -+ return retval;
6830 -+}
6831 -+
6832 -+/**
6833 -+ * sys_sched_rr_get_interval - return the default timeslice of a process.
6834 -+ * @pid: pid of the process.
6835 -+ * @interval: userspace pointer to the timeslice value.
6836 -+ *
6837 -+ *
6838 -+ * Return: On success, 0 and the timeslice is in @interval. Otherwise,
6839 -+ * an error code.
6840 -+ */
6841 -+SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,
6842 -+ struct __kernel_timespec __user *, interval)
6843 -+{
6844 -+ struct timespec64 t;
6845 -+ int retval = sched_rr_get_interval(pid, &t);
6846 -+
6847 -+ if (retval == 0)
6848 -+ retval = put_timespec64(&t, interval);
6849 -+
6850 -+ return retval;
6851 -+}
6852 -+
6853 -+#ifdef CONFIG_COMPAT_32BIT_TIME
6854 -+SYSCALL_DEFINE2(sched_rr_get_interval_time32, pid_t, pid,
6855 -+ struct old_timespec32 __user *, interval)
6856 -+{
6857 -+ struct timespec64 t;
6858 -+ int retval = sched_rr_get_interval(pid, &t);
6859 -+
6860 -+ if (retval == 0)
6861 -+ retval = put_old_timespec32(&t, interval);
6862 -+ return retval;
6863 -+}
6864 -+#endif
6865 -+
6866 -+void sched_show_task(struct task_struct *p)
6867 -+{
6868 -+ unsigned long free = 0;
6869 -+ int ppid;
6870 -+
6871 -+ if (!try_get_task_stack(p))
6872 -+ return;
6873 -+
6874 -+ pr_info("task:%-15.15s state:%c", p->comm, task_state_to_char(p));
6875 -+
6876 -+ if (task_is_running(p))
6877 -+ pr_cont(" running task ");
6878 -+#ifdef CONFIG_DEBUG_STACK_USAGE
6879 -+ free = stack_not_used(p);
6880 -+#endif
6881 -+ ppid = 0;
6882 -+ rcu_read_lock();
6883 -+ if (pid_alive(p))
6884 -+ ppid = task_pid_nr(rcu_dereference(p->real_parent));
6885 -+ rcu_read_unlock();
6886 -+ pr_cont(" stack:%5lu pid:%5d ppid:%6d flags:0x%08lx\n",
6887 -+ free, task_pid_nr(p), ppid,
6888 -+ (unsigned long)task_thread_info(p)->flags);
6889 -+
6890 -+ print_worker_info(KERN_INFO, p);
6891 -+ print_stop_info(KERN_INFO, p);
6892 -+ show_stack(p, NULL, KERN_INFO);
6893 -+ put_task_stack(p);
6894 -+}
6895 -+EXPORT_SYMBOL_GPL(sched_show_task);
6896 -+
6897 -+static inline bool
6898 -+state_filter_match(unsigned long state_filter, struct task_struct *p)
6899 -+{
6900 -+ unsigned int state = READ_ONCE(p->__state);
6901 -+
6902 -+ /* no filter, everything matches */
6903 -+ if (!state_filter)
6904 -+ return true;
6905 -+
6906 -+ /* filter, but doesn't match */
6907 -+ if (!(state & state_filter))
6908 -+ return false;
6909 -+
6910 -+ /*
6911 -+ * When looking for TASK_UNINTERRUPTIBLE skip TASK_IDLE (allows
6912 -+ * TASK_KILLABLE).
6913 -+ */
6914 -+ if (state_filter == TASK_UNINTERRUPTIBLE && state == TASK_IDLE)
6915 -+ return false;
6916 -+
6917 -+ return true;
6918 -+}
6919 -+
6920 -+
6921 -+void show_state_filter(unsigned int state_filter)
6922 -+{
6923 -+ struct task_struct *g, *p;
6924 -+
6925 -+ rcu_read_lock();
6926 -+ for_each_process_thread(g, p) {
6927 -+ /*
6928 -+ * reset the NMI-timeout, listing all files on a slow
6929 -+ * console might take a lot of time:
6930 -+ * Also, reset softlockup watchdogs on all CPUs, because
6931 -+ * another CPU might be blocked waiting for us to process
6932 -+ * an IPI.
6933 -+ */
6934 -+ touch_nmi_watchdog();
6935 -+ touch_all_softlockup_watchdogs();
6936 -+ if (state_filter_match(state_filter, p))
6937 -+ sched_show_task(p);
6938 -+ }
6939 -+
6940 -+#ifdef CONFIG_SCHED_DEBUG
6941 -+ /* TODO: Alt schedule FW should support this
6942 -+ if (!state_filter)
6943 -+ sysrq_sched_debug_show();
6944 -+ */
6945 -+#endif
6946 -+ rcu_read_unlock();
6947 -+ /*
6948 -+ * Only show locks if all tasks are dumped:
6949 -+ */
6950 -+ if (!state_filter)
6951 -+ debug_show_all_locks();
6952 -+}
6953 -+
6954 -+void dump_cpu_task(int cpu)
6955 -+{
6956 -+ pr_info("Task dump for CPU %d:\n", cpu);
6957 -+ sched_show_task(cpu_curr(cpu));
6958 -+}
6959 -+
6960 -+/**
6961 -+ * init_idle - set up an idle thread for a given CPU
6962 -+ * @idle: task in question
6963 -+ * @cpu: CPU the idle task belongs to
6964 -+ *
6965 -+ * NOTE: this function does not set the idle thread's NEED_RESCHED
6966 -+ * flag, to make booting more robust.
6967 -+ */
6968 -+void __init init_idle(struct task_struct *idle, int cpu)
6969 -+{
6970 -+ struct rq *rq = cpu_rq(cpu);
6971 -+ unsigned long flags;
6972 -+
6973 -+ __sched_fork(0, idle);
6974 -+
6975 -+ /*
6976 -+ * The idle task doesn't need the kthread struct to function, but it
6977 -+ * is dressed up as a per-CPU kthread and thus needs to play the part
6978 -+ * if we want to avoid special-casing it in code that deals with per-CPU
6979 -+ * kthreads.
6980 -+ */
6981 -+ set_kthread_struct(idle);
6982 -+
6983 -+ raw_spin_lock_irqsave(&idle->pi_lock, flags);
6984 -+ raw_spin_lock(&rq->lock);
6985 -+ update_rq_clock(rq);
6986 -+
6987 -+ idle->last_ran = rq->clock_task;
6988 -+ idle->__state = TASK_RUNNING;
6989 -+ /*
6990 -+ * PF_KTHREAD should already be set at this point; regardless, make it
6991 -+ * look like a proper per-CPU kthread.
6992 -+ */
6993 -+ idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
6994 -+ kthread_set_per_cpu(idle, cpu);
6995 -+
6996 -+ sched_queue_init_idle(&rq->queue, idle);
6997 -+
6998 -+ scs_task_reset(idle);
6999 -+ kasan_unpoison_task_stack(idle);
7000 -+
7001 -+#ifdef CONFIG_SMP
7002 -+ /*
7003 -+ * It's possible that init_idle() gets called multiple times on a task,
7004 -+ * in that case do_set_cpus_allowed() will not do the right thing.
7005 -+ *
7006 -+ * And since this is boot we can forgo the serialisation.
7007 -+ */
7008 -+ set_cpus_allowed_common(idle, cpumask_of(cpu));
7009 -+#endif
7010 -+
7011 -+ /* Silence PROVE_RCU */
7012 -+ rcu_read_lock();
7013 -+ __set_task_cpu(idle, cpu);
7014 -+ rcu_read_unlock();
7015 -+
7016 -+ rq->idle = idle;
7017 -+ rcu_assign_pointer(rq->curr, idle);
7018 -+ idle->on_cpu = 1;
7019 -+
7020 -+ raw_spin_unlock(&rq->lock);
7021 -+ raw_spin_unlock_irqrestore(&idle->pi_lock, flags);
7022 -+
7023 -+ /* Set the preempt count _outside_ the spinlocks! */
7024 -+ init_idle_preempt_count(idle, cpu);
7025 -+
7026 -+ ftrace_graph_init_idle_task(idle, cpu);
7027 -+ vtime_init_idle(idle, cpu);
7028 -+#ifdef CONFIG_SMP
7029 -+ sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
7030 -+#endif
7031 -+}
7032 -+
7033 -+#ifdef CONFIG_SMP
7034 -+
7035 -+int cpuset_cpumask_can_shrink(const struct cpumask __maybe_unused *cur,
7036 -+ const struct cpumask __maybe_unused *trial)
7037 -+{
7038 -+ return 1;
7039 -+}
7040 -+
7041 -+int task_can_attach(struct task_struct *p,
7042 -+ const struct cpumask *cs_cpus_allowed)
7043 -+{
7044 -+ int ret = 0;
7045 -+
7046 -+ /*
7047 -+ * Kthreads which disallow setaffinity shouldn't be moved
7048 -+ * to a new cpuset; we don't want to change their CPU
7049 -+ * affinity and isolating such threads by their set of
7050 -+ * allowed nodes is unnecessary. Thus, cpusets are not
7051 -+ * applicable for such threads. This prevents checking for
7052 -+ * success of set_cpus_allowed_ptr() on all attached tasks
7053 -+ * before cpus_mask may be changed.
7054 -+ */
7055 -+ if (p->flags & PF_NO_SETAFFINITY)
7056 -+ ret = -EINVAL;
7057 -+
7058 -+ return ret;
7059 -+}
7060 -+
7061 -+bool sched_smp_initialized __read_mostly;
7062 -+
7063 -+#ifdef CONFIG_HOTPLUG_CPU
7064 -+/*
7065 -+ * Ensures that the idle task is using init_mm right before its CPU goes
7066 -+ * offline.
7067 -+ */
7068 -+void idle_task_exit(void)
7069 -+{
7070 -+ struct mm_struct *mm = current->active_mm;
7071 -+
7072 -+ BUG_ON(current != this_rq()->idle);
7073 -+
7074 -+ if (mm != &init_mm) {
7075 -+ switch_mm(mm, &init_mm, current);
7076 -+ finish_arch_post_lock_switch();
7077 -+ }
7078 -+
7079 -+ /* finish_cpu(), as ran on the BP, will clean up the active_mm state */
7080 -+}
7081 -+
7082 -+static int __balance_push_cpu_stop(void *arg)
7083 -+{
7084 -+ struct task_struct *p = arg;
7085 -+ struct rq *rq = this_rq();
7086 -+ struct rq_flags rf;
7087 -+ int cpu;
7088 -+
7089 -+ raw_spin_lock_irq(&p->pi_lock);
7090 -+ rq_lock(rq, &rf);
7091 -+
7092 -+ update_rq_clock(rq);
7093 -+
7094 -+ if (task_rq(p) == rq && task_on_rq_queued(p)) {
7095 -+ cpu = select_fallback_rq(rq->cpu, p);
7096 -+ rq = __migrate_task(rq, p, cpu);
7097 -+ }
7098 -+
7099 -+ rq_unlock(rq, &rf);
7100 -+ raw_spin_unlock_irq(&p->pi_lock);
7101 -+
7102 -+ put_task_struct(p);
7103 -+
7104 -+ return 0;
7105 -+}
7106 -+
7107 -+static DEFINE_PER_CPU(struct cpu_stop_work, push_work);
7108 -+
7109 -+/*
7110 -+ * This is enabled below SCHED_AP_ACTIVE; when !cpu_active(), but only
7111 -+ * effective when the hotplug motion is down.
7112 -+ */
7113 -+static void balance_push(struct rq *rq)
7114 -+{
7115 -+ struct task_struct *push_task = rq->curr;
7116 -+
7117 -+ lockdep_assert_held(&rq->lock);
7118 -+
7119 -+ /*
7120 -+ * Ensure the thing is persistent until balance_push_set(.on = false);
7121 -+ */
7122 -+ rq->balance_callback = &balance_push_callback;
7123 -+
7124 -+ /*
7125 -+ * Only active while going offline and when invoked on the outgoing
7126 -+ * CPU.
7127 -+ */
7128 -+ if (!cpu_dying(rq->cpu) || rq != this_rq())
7129 -+ return;
7130 -+
7131 -+ /*
7132 -+ * Both the cpu-hotplug and stop task are in this case and are
7133 -+ * required to complete the hotplug process.
7134 -+ */
7135 -+ if (kthread_is_per_cpu(push_task) ||
7136 -+ is_migration_disabled(push_task)) {
7137 -+
7138 -+ /*
7139 -+ * If this is the idle task on the outgoing CPU try to wake
7140 -+ * up the hotplug control thread which might wait for the
7141 -+ * last task to vanish. The rcuwait_active() check is
7142 -+ * accurate here because the waiter is pinned on this CPU
7143 -+ * and can't obviously be running in parallel.
7144 -+ *
7145 -+ * On RT kernels this also has to check whether there are
7146 -+ * pinned and scheduled out tasks on the runqueue. They
7147 -+ * need to leave the migrate disabled section first.
7148 -+ */
7149 -+ if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&
7150 -+ rcuwait_active(&rq->hotplug_wait)) {
7151 -+ raw_spin_unlock(&rq->lock);
7152 -+ rcuwait_wake_up(&rq->hotplug_wait);
7153 -+ raw_spin_lock(&rq->lock);
7154 -+ }
7155 -+ return;
7156 -+ }
7157 -+
7158 -+ get_task_struct(push_task);
7159 -+ /*
7160 -+ * Temporarily drop rq->lock such that we can wake-up the stop task.
7161 -+ * Both preemption and IRQs are still disabled.
7162 -+ */
7163 -+ raw_spin_unlock(&rq->lock);
7164 -+ stop_one_cpu_nowait(rq->cpu, __balance_push_cpu_stop, push_task,
7165 -+ this_cpu_ptr(&push_work));
7166 -+ /*
7167 -+ * At this point need_resched() is true and we'll take the loop in
7168 -+ * schedule(). The next pick is obviously going to be the stop task
7169 -+ * which kthread_is_per_cpu() and will push this task away.
7170 -+ */
7171 -+ raw_spin_lock(&rq->lock);
7172 -+}
7173 -+
7174 -+static void balance_push_set(int cpu, bool on)
7175 -+{
7176 -+ struct rq *rq = cpu_rq(cpu);
7177 -+ struct rq_flags rf;
7178 -+
7179 -+ rq_lock_irqsave(rq, &rf);
7180 -+ if (on) {
7181 -+ WARN_ON_ONCE(rq->balance_callback);
7182 -+ rq->balance_callback = &balance_push_callback;
7183 -+ } else if (rq->balance_callback == &balance_push_callback) {
7184 -+ rq->balance_callback = NULL;
7185 -+ }
7186 -+ rq_unlock_irqrestore(rq, &rf);
7187 -+}
7188 -+
7189 -+/*
7190 -+ * Invoked from a CPUs hotplug control thread after the CPU has been marked
7191 -+ * inactive. All tasks which are not per CPU kernel threads are either
7192 -+ * pushed off this CPU now via balance_push() or placed on a different CPU
7193 -+ * during wakeup. Wait until the CPU is quiescent.
7194 -+ */
7195 -+static void balance_hotplug_wait(void)
7196 -+{
7197 -+ struct rq *rq = this_rq();
7198 -+
7199 -+ rcuwait_wait_event(&rq->hotplug_wait,
7200 -+ rq->nr_running == 1 && !rq_has_pinned_tasks(rq),
7201 -+ TASK_UNINTERRUPTIBLE);
7202 -+}
7203 -+
7204 -+#else
7205 -+
7206 -+static void balance_push(struct rq *rq)
7207 -+{
7208 -+}
7209 -+
7210 -+static void balance_push_set(int cpu, bool on)
7211 -+{
7212 -+}
7213 -+
7214 -+static inline void balance_hotplug_wait(void)
7215 -+{
7216 -+}
7217 -+#endif /* CONFIG_HOTPLUG_CPU */
7218 -+
7219 -+static void set_rq_offline(struct rq *rq)
7220 -+{
7221 -+ if (rq->online)
7222 -+ rq->online = false;
7223 -+}
7224 -+
7225 -+static void set_rq_online(struct rq *rq)
7226 -+{
7227 -+ if (!rq->online)
7228 -+ rq->online = true;
7229 -+}
7230 -+
7231 -+/*
7232 -+ * used to mark begin/end of suspend/resume:
7233 -+ */
7234 -+static int num_cpus_frozen;
7235 -+
7236 -+/*
7237 -+ * Update cpusets according to cpu_active mask. If cpusets are
7238 -+ * disabled, cpuset_update_active_cpus() becomes a simple wrapper
7239 -+ * around partition_sched_domains().
7240 -+ *
7241 -+ * If we come here as part of a suspend/resume, don't touch cpusets because we
7242 -+ * want to restore it back to its original state upon resume anyway.
7243 -+ */
7244 -+static void cpuset_cpu_active(void)
7245 -+{
7246 -+ if (cpuhp_tasks_frozen) {
7247 -+ /*
7248 -+ * num_cpus_frozen tracks how many CPUs are involved in suspend
7249 -+ * resume sequence. As long as this is not the last online
7250 -+ * operation in the resume sequence, just build a single sched
7251 -+ * domain, ignoring cpusets.
7252 -+ */
7253 -+ partition_sched_domains(1, NULL, NULL);
7254 -+ if (--num_cpus_frozen)
7255 -+ return;
7256 -+ /*
7257 -+ * This is the last CPU online operation. So fall through and
7258 -+ * restore the original sched domains by considering the
7259 -+ * cpuset configurations.
7260 -+ */
7261 -+ cpuset_force_rebuild();
7262 -+ }
7263 -+
7264 -+ cpuset_update_active_cpus();
7265 -+}
7266 -+
7267 -+static int cpuset_cpu_inactive(unsigned int cpu)
7268 -+{
7269 -+ if (!cpuhp_tasks_frozen) {
7270 -+ cpuset_update_active_cpus();
7271 -+ } else {
7272 -+ num_cpus_frozen++;
7273 -+ partition_sched_domains(1, NULL, NULL);
7274 -+ }
7275 -+ return 0;
7276 -+}
7277 -+
7278 -+int sched_cpu_activate(unsigned int cpu)
7279 -+{
7280 -+ struct rq *rq = cpu_rq(cpu);
7281 -+ unsigned long flags;
7282 -+
7283 -+ /*
7284 -+ * Clear the balance_push callback and prepare to schedule
7285 -+ * regular tasks.
7286 -+ */
7287 -+ balance_push_set(cpu, false);
7288 -+
7289 -+#ifdef CONFIG_SCHED_SMT
7290 -+ /*
7291 -+ * When going up, increment the number of cores with SMT present.
7292 -+ */
7293 -+ if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
7294 -+ static_branch_inc_cpuslocked(&sched_smt_present);
7295 -+#endif
7296 -+ set_cpu_active(cpu, true);
7297 -+
7298 -+ if (sched_smp_initialized)
7299 -+ cpuset_cpu_active();
7300 -+
7301 -+ /*
7302 -+ * Put the rq online, if not already. This happens:
7303 -+ *
7304 -+ * 1) In the early boot process, because we build the real domains
7305 -+ * after all cpus have been brought up.
7306 -+ *
7307 -+ * 2) At runtime, if cpuset_cpu_active() fails to rebuild the
7308 -+ * domains.
7309 -+ */
7310 -+ raw_spin_lock_irqsave(&rq->lock, flags);
7311 -+ set_rq_online(rq);
7312 -+ raw_spin_unlock_irqrestore(&rq->lock, flags);
7313 -+
7314 -+ return 0;
7315 -+}
7316 -+
7317 -+int sched_cpu_deactivate(unsigned int cpu)
7318 -+{
7319 -+ struct rq *rq = cpu_rq(cpu);
7320 -+ unsigned long flags;
7321 -+ int ret;
7322 -+
7323 -+ set_cpu_active(cpu, false);
7324 -+
7325 -+ /*
7326 -+ * From this point forward, this CPU will refuse to run any task that
7327 -+ * is not: migrate_disable() or KTHREAD_IS_PER_CPU, and will actively
7328 -+ * push those tasks away until this gets cleared, see
7329 -+ * sched_cpu_dying().
7330 -+ */
7331 -+ balance_push_set(cpu, true);
7332 -+
7333 -+ /*
7334 -+ * We've cleared cpu_active_mask, wait for all preempt-disabled and RCU
7335 -+ * users of this state to go away such that all new such users will
7336 -+ * observe it.
7337 -+ *
7338 -+ * Specifically, we rely on ttwu to no longer target this CPU, see
7339 -+ * ttwu_queue_cond() and is_cpu_allowed().
7340 -+ *
7341 -+ * Do sync before park smpboot threads to take care the rcu boost case.
7342 -+ */
7343 -+ synchronize_rcu();
7344 -+
7345 -+ raw_spin_lock_irqsave(&rq->lock, flags);
7346 -+ update_rq_clock(rq);
7347 -+ set_rq_offline(rq);
7348 -+ raw_spin_unlock_irqrestore(&rq->lock, flags);
7349 -+
7350 -+#ifdef CONFIG_SCHED_SMT
7351 -+ /*
7352 -+ * When going down, decrement the number of cores with SMT present.
7353 -+ */
7354 -+ if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
7355 -+ static_branch_dec_cpuslocked(&sched_smt_present);
7356 -+ if (!static_branch_likely(&sched_smt_present))
7357 -+ cpumask_clear(&sched_sg_idle_mask);
7358 -+ }
7359 -+#endif
7360 -+
7361 -+ if (!sched_smp_initialized)
7362 -+ return 0;
7363 -+
7364 -+ ret = cpuset_cpu_inactive(cpu);
7365 -+ if (ret) {
7366 -+ balance_push_set(cpu, false);
7367 -+ set_cpu_active(cpu, true);
7368 -+ return ret;
7369 -+ }
7370 -+
7371 -+ return 0;
7372 -+}
7373 -+
7374 -+static void sched_rq_cpu_starting(unsigned int cpu)
7375 -+{
7376 -+ struct rq *rq = cpu_rq(cpu);
7377 -+
7378 -+ rq->calc_load_update = calc_load_update;
7379 -+}
7380 -+
7381 -+int sched_cpu_starting(unsigned int cpu)
7382 -+{
7383 -+ sched_rq_cpu_starting(cpu);
7384 -+ sched_tick_start(cpu);
7385 -+ return 0;
7386 -+}
7387 -+
7388 -+#ifdef CONFIG_HOTPLUG_CPU
7389 -+
7390 -+/*
7391 -+ * Invoked immediately before the stopper thread is invoked to bring the
7392 -+ * CPU down completely. At this point all per CPU kthreads except the
7393 -+ * hotplug thread (current) and the stopper thread (inactive) have been
7394 -+ * either parked or have been unbound from the outgoing CPU. Ensure that
7395 -+ * any of those which might be on the way out are gone.
7396 -+ *
7397 -+ * If after this point a bound task is being woken on this CPU then the
7398 -+ * responsible hotplug callback has failed to do it's job.
7399 -+ * sched_cpu_dying() will catch it with the appropriate fireworks.
7400 -+ */
7401 -+int sched_cpu_wait_empty(unsigned int cpu)
7402 -+{
7403 -+ balance_hotplug_wait();
7404 -+ return 0;
7405 -+}
7406 -+
7407 -+/*
7408 -+ * Since this CPU is going 'away' for a while, fold any nr_active delta we
7409 -+ * might have. Called from the CPU stopper task after ensuring that the
7410 -+ * stopper is the last running task on the CPU, so nr_active count is
7411 -+ * stable. We need to take the teardown thread which is calling this into
7412 -+ * account, so we hand in adjust = 1 to the load calculation.
7413 -+ *
7414 -+ * Also see the comment "Global load-average calculations".
7415 -+ */
7416 -+static void calc_load_migrate(struct rq *rq)
7417 -+{
7418 -+ long delta = calc_load_fold_active(rq, 1);
7419 -+
7420 -+ if (delta)
7421 -+ atomic_long_add(delta, &calc_load_tasks);
7422 -+}
7423 -+
7424 -+static void dump_rq_tasks(struct rq *rq, const char *loglvl)
7425 -+{
7426 -+ struct task_struct *g, *p;
7427 -+ int cpu = cpu_of(rq);
7428 -+
7429 -+ lockdep_assert_held(&rq->lock);
7430 -+
7431 -+ printk("%sCPU%d enqueued tasks (%u total):\n", loglvl, cpu, rq->nr_running);
7432 -+ for_each_process_thread(g, p) {
7433 -+ if (task_cpu(p) != cpu)
7434 -+ continue;
7435 -+
7436 -+ if (!task_on_rq_queued(p))
7437 -+ continue;
7438 -+
7439 -+ printk("%s\tpid: %d, name: %s\n", loglvl, p->pid, p->comm);
7440 -+ }
7441 -+}
7442 -+
7443 -+int sched_cpu_dying(unsigned int cpu)
7444 -+{
7445 -+ struct rq *rq = cpu_rq(cpu);
7446 -+ unsigned long flags;
7447 -+
7448 -+ /* Handle pending wakeups and then migrate everything off */
7449 -+ sched_tick_stop(cpu);
7450 -+
7451 -+ raw_spin_lock_irqsave(&rq->lock, flags);
7452 -+ if (rq->nr_running != 1 || rq_has_pinned_tasks(rq)) {
7453 -+ WARN(true, "Dying CPU not properly vacated!");
7454 -+ dump_rq_tasks(rq, KERN_WARNING);
7455 -+ }
7456 -+ raw_spin_unlock_irqrestore(&rq->lock, flags);
7457 -+
7458 -+ calc_load_migrate(rq);
7459 -+ hrtick_clear(rq);
7460 -+ return 0;
7461 -+}
7462 -+#endif
7463 -+
7464 -+#ifdef CONFIG_SMP
7465 -+static void sched_init_topology_cpumask_early(void)
7466 -+{
7467 -+ int cpu;
7468 -+ cpumask_t *tmp;
7469 -+
7470 -+ for_each_possible_cpu(cpu) {
7471 -+ /* init topo masks */
7472 -+ tmp = per_cpu(sched_cpu_topo_masks, cpu);
7473 -+
7474 -+ cpumask_copy(tmp, cpumask_of(cpu));
7475 -+ tmp++;
7476 -+ cpumask_copy(tmp, cpu_possible_mask);
7477 -+ per_cpu(sched_cpu_llc_mask, cpu) = tmp;
7478 -+ per_cpu(sched_cpu_topo_end_mask, cpu) = ++tmp;
7479 -+ /*per_cpu(sd_llc_id, cpu) = cpu;*/
7480 -+ }
7481 -+}
7482 -+
7483 -+#define TOPOLOGY_CPUMASK(name, mask, last)\
7484 -+ if (cpumask_and(topo, topo, mask)) { \
7485 -+ cpumask_copy(topo, mask); \
7486 -+ printk(KERN_INFO "sched: cpu#%02d topo: 0x%08lx - "#name, \
7487 -+ cpu, (topo++)->bits[0]); \
7488 -+ } \
7489 -+ if (!last) \
7490 -+ cpumask_complement(topo, mask)
7491 -+
7492 -+static void sched_init_topology_cpumask(void)
7493 -+{
7494 -+ int cpu;
7495 -+ cpumask_t *topo;
7496 -+
7497 -+ for_each_online_cpu(cpu) {
7498 -+ /* take chance to reset time slice for idle tasks */
7499 -+ cpu_rq(cpu)->idle->time_slice = sched_timeslice_ns;
7500 -+
7501 -+ topo = per_cpu(sched_cpu_topo_masks, cpu) + 1;
7502 -+
7503 -+ cpumask_complement(topo, cpumask_of(cpu));
7504 -+#ifdef CONFIG_SCHED_SMT
7505 -+ TOPOLOGY_CPUMASK(smt, topology_sibling_cpumask(cpu), false);
7506 -+#endif
7507 -+ per_cpu(sd_llc_id, cpu) = cpumask_first(cpu_coregroup_mask(cpu));
7508 -+ per_cpu(sched_cpu_llc_mask, cpu) = topo;
7509 -+ TOPOLOGY_CPUMASK(coregroup, cpu_coregroup_mask(cpu), false);
7510 -+
7511 -+ TOPOLOGY_CPUMASK(core, topology_core_cpumask(cpu), false);
7512 -+
7513 -+ TOPOLOGY_CPUMASK(others, cpu_online_mask, true);
7514 -+
7515 -+ per_cpu(sched_cpu_topo_end_mask, cpu) = topo;
7516 -+ printk(KERN_INFO "sched: cpu#%02d llc_id = %d, llc_mask idx = %d\n",
7517 -+ cpu, per_cpu(sd_llc_id, cpu),
7518 -+ (int) (per_cpu(sched_cpu_llc_mask, cpu) -
7519 -+ per_cpu(sched_cpu_topo_masks, cpu)));
7520 -+ }
7521 -+}
7522 -+#endif
7523 -+
7524 -+void __init sched_init_smp(void)
7525 -+{
7526 -+ /* Move init over to a non-isolated CPU */
7527 -+ if (set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_FLAG_DOMAIN)) < 0)
7528 -+ BUG();
7529 -+ current->flags &= ~PF_NO_SETAFFINITY;
7530 -+
7531 -+ sched_init_topology_cpumask();
7532 -+
7533 -+ sched_smp_initialized = true;
7534 -+}
7535 -+#else
7536 -+void __init sched_init_smp(void)
7537 -+{
7538 -+ cpu_rq(0)->idle->time_slice = sched_timeslice_ns;
7539 -+}
7540 -+#endif /* CONFIG_SMP */
7541 -+
7542 -+int in_sched_functions(unsigned long addr)
7543 -+{
7544 -+ return in_lock_functions(addr) ||
7545 -+ (addr >= (unsigned long)__sched_text_start
7546 -+ && addr < (unsigned long)__sched_text_end);
7547 -+}
7548 -+
7549 -+#ifdef CONFIG_CGROUP_SCHED
7550 -+/* task group related information */
7551 -+struct task_group {
7552 -+ struct cgroup_subsys_state css;
7553 -+
7554 -+ struct rcu_head rcu;
7555 -+ struct list_head list;
7556 -+
7557 -+ struct task_group *parent;
7558 -+ struct list_head siblings;
7559 -+ struct list_head children;
7560 -+#ifdef CONFIG_FAIR_GROUP_SCHED
7561 -+ unsigned long shares;
7562 -+#endif
7563 -+};
7564 -+
7565 -+/*
7566 -+ * Default task group.
7567 -+ * Every task in system belongs to this group at bootup.
7568 -+ */
7569 -+struct task_group root_task_group;
7570 -+LIST_HEAD(task_groups);
7571 -+
7572 -+/* Cacheline aligned slab cache for task_group */
7573 -+static struct kmem_cache *task_group_cache __read_mostly;
7574 -+#endif /* CONFIG_CGROUP_SCHED */
7575 -+
7576 -+void __init sched_init(void)
7577 -+{
7578 -+ int i;
7579 -+ struct rq *rq;
7580 -+
7581 -+ printk(KERN_INFO ALT_SCHED_VERSION_MSG);
7582 -+
7583 -+ wait_bit_init();
7584 -+
7585 -+#ifdef CONFIG_SMP
7586 -+ for (i = 0; i < SCHED_BITS; i++)
7587 -+ cpumask_copy(sched_rq_watermark + i, cpu_present_mask);
7588 -+#endif
7589 -+
7590 -+#ifdef CONFIG_CGROUP_SCHED
7591 -+ task_group_cache = KMEM_CACHE(task_group, 0);
7592 -+
7593 -+ list_add(&root_task_group.list, &task_groups);
7594 -+ INIT_LIST_HEAD(&root_task_group.children);
7595 -+ INIT_LIST_HEAD(&root_task_group.siblings);
7596 -+#endif /* CONFIG_CGROUP_SCHED */
7597 -+ for_each_possible_cpu(i) {
7598 -+ rq = cpu_rq(i);
7599 -+
7600 -+ sched_queue_init(&rq->queue);
7601 -+ rq->watermark = IDLE_TASK_SCHED_PRIO;
7602 -+ rq->skip = NULL;
7603 -+
7604 -+ raw_spin_lock_init(&rq->lock);
7605 -+ rq->nr_running = rq->nr_uninterruptible = 0;
7606 -+ rq->calc_load_active = 0;
7607 -+ rq->calc_load_update = jiffies + LOAD_FREQ;
7608 -+#ifdef CONFIG_SMP
7609 -+ rq->online = false;
7610 -+ rq->cpu = i;
7611 -+
7612 -+#ifdef CONFIG_SCHED_SMT
7613 -+ rq->active_balance = 0;
7614 -+#endif
7615 -+
7616 -+#ifdef CONFIG_NO_HZ_COMMON
7617 -+ INIT_CSD(&rq->nohz_csd, nohz_csd_func, rq);
7618 -+#endif
7619 -+ rq->balance_callback = &balance_push_callback;
7620 -+#ifdef CONFIG_HOTPLUG_CPU
7621 -+ rcuwait_init(&rq->hotplug_wait);
7622 -+#endif
7623 -+#endif /* CONFIG_SMP */
7624 -+ rq->nr_switches = 0;
7625 -+
7626 -+ hrtick_rq_init(rq);
7627 -+ atomic_set(&rq->nr_iowait, 0);
7628 -+ }
7629 -+#ifdef CONFIG_SMP
7630 -+ /* Set rq->online for cpu 0 */
7631 -+ cpu_rq(0)->online = true;
7632 -+#endif
7633 -+ /*
7634 -+ * The boot idle thread does lazy MMU switching as well:
7635 -+ */
7636 -+ mmgrab(&init_mm);
7637 -+ enter_lazy_tlb(&init_mm, current);
7638 -+
7639 -+ /*
7640 -+ * Make us the idle thread. Technically, schedule() should not be
7641 -+ * called from this thread, however somewhere below it might be,
7642 -+ * but because we are the idle thread, we just pick up running again
7643 -+ * when this runqueue becomes "idle".
7644 -+ */
7645 -+ init_idle(current, smp_processor_id());
7646 -+
7647 -+ calc_load_update = jiffies + LOAD_FREQ;
7648 -+
7649 -+#ifdef CONFIG_SMP
7650 -+ idle_thread_set_boot_cpu();
7651 -+ balance_push_set(smp_processor_id(), false);
7652 -+
7653 -+ sched_init_topology_cpumask_early();
7654 -+#endif /* SMP */
7655 -+
7656 -+ psi_init();
7657 -+}
7658 -+
7659 -+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
7660 -+static inline int preempt_count_equals(int preempt_offset)
7661 -+{
7662 -+ int nested = preempt_count() + rcu_preempt_depth();
7663 -+
7664 -+ return (nested == preempt_offset);
7665 -+}
7666 -+
7667 -+void __might_sleep(const char *file, int line, int preempt_offset)
7668 -+{
7669 -+ unsigned int state = get_current_state();
7670 -+ /*
7671 -+ * Blocking primitives will set (and therefore destroy) current->state,
7672 -+ * since we will exit with TASK_RUNNING make sure we enter with it,
7673 -+ * otherwise we will destroy state.
7674 -+ */
7675 -+ WARN_ONCE(state != TASK_RUNNING && current->task_state_change,
7676 -+ "do not call blocking ops when !TASK_RUNNING; "
7677 -+ "state=%x set at [<%p>] %pS\n", state,
7678 -+ (void *)current->task_state_change,
7679 -+ (void *)current->task_state_change);
7680 -+
7681 -+ ___might_sleep(file, line, preempt_offset);
7682 -+}
7683 -+EXPORT_SYMBOL(__might_sleep);
7684 -+
7685 -+void ___might_sleep(const char *file, int line, int preempt_offset)
7686 -+{
7687 -+ /* Ratelimiting timestamp: */
7688 -+ static unsigned long prev_jiffy;
7689 -+
7690 -+ unsigned long preempt_disable_ip;
7691 -+
7692 -+ /* WARN_ON_ONCE() by default, no rate limit required: */
7693 -+ rcu_sleep_check();
7694 -+
7695 -+ if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
7696 -+ !is_idle_task(current) && !current->non_block_count) ||
7697 -+ system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING ||
7698 -+ oops_in_progress)
7699 -+ return;
7700 -+ if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
7701 -+ return;
7702 -+ prev_jiffy = jiffies;
7703 -+
7704 -+ /* Save this before calling printk(), since that will clobber it: */
7705 -+ preempt_disable_ip = get_preempt_disable_ip(current);
7706 -+
7707 -+ printk(KERN_ERR
7708 -+ "BUG: sleeping function called from invalid context at %s:%d\n",
7709 -+ file, line);
7710 -+ printk(KERN_ERR
7711 -+ "in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",
7712 -+ in_atomic(), irqs_disabled(), current->non_block_count,
7713 -+ current->pid, current->comm);
7714 -+
7715 -+ if (task_stack_end_corrupted(current))
7716 -+ printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
7717 -+
7718 -+ debug_show_held_locks(current);
7719 -+ if (irqs_disabled())
7720 -+ print_irqtrace_events(current);
7721 -+#ifdef CONFIG_DEBUG_PREEMPT
7722 -+ if (!preempt_count_equals(preempt_offset)) {
7723 -+ pr_err("Preemption disabled at:");
7724 -+ print_ip_sym(KERN_ERR, preempt_disable_ip);
7725 -+ }
7726 -+#endif
7727 -+ dump_stack();
7728 -+ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
7729 -+}
7730 -+EXPORT_SYMBOL(___might_sleep);
7731 -+
7732 -+void __cant_sleep(const char *file, int line, int preempt_offset)
7733 -+{
7734 -+ static unsigned long prev_jiffy;
7735 -+
7736 -+ if (irqs_disabled())
7737 -+ return;
7738 -+
7739 -+ if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
7740 -+ return;
7741 -+
7742 -+ if (preempt_count() > preempt_offset)
7743 -+ return;
7744 -+
7745 -+ if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
7746 -+ return;
7747 -+ prev_jiffy = jiffies;
7748 -+
7749 -+ printk(KERN_ERR "BUG: assuming atomic context at %s:%d\n", file, line);
7750 -+ printk(KERN_ERR "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
7751 -+ in_atomic(), irqs_disabled(),
7752 -+ current->pid, current->comm);
7753 -+
7754 -+ debug_show_held_locks(current);
7755 -+ dump_stack();
7756 -+ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
7757 -+}
7758 -+EXPORT_SYMBOL_GPL(__cant_sleep);
7759 -+
7760 -+#ifdef CONFIG_SMP
7761 -+void __cant_migrate(const char *file, int line)
7762 -+{
7763 -+ static unsigned long prev_jiffy;
7764 -+
7765 -+ if (irqs_disabled())
7766 -+ return;
7767 -+
7768 -+ if (is_migration_disabled(current))
7769 -+ return;
7770 -+
7771 -+ if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
7772 -+ return;
7773 -+
7774 -+ if (preempt_count() > 0)
7775 -+ return;
7776 -+
7777 -+ if (current->migration_flags & MDF_FORCE_ENABLED)
7778 -+ return;
7779 -+
7780 -+ if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
7781 -+ return;
7782 -+ prev_jiffy = jiffies;
7783 -+
7784 -+ pr_err("BUG: assuming non migratable context at %s:%d\n", file, line);
7785 -+ pr_err("in_atomic(): %d, irqs_disabled(): %d, migration_disabled() %u pid: %d, name: %s\n",
7786 -+ in_atomic(), irqs_disabled(), is_migration_disabled(current),
7787 -+ current->pid, current->comm);
7788 -+
7789 -+ debug_show_held_locks(current);
7790 -+ dump_stack();
7791 -+ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
7792 -+}
7793 -+EXPORT_SYMBOL_GPL(__cant_migrate);
7794 -+#endif
7795 -+#endif
7796 -+
7797 -+#ifdef CONFIG_MAGIC_SYSRQ
7798 -+void normalize_rt_tasks(void)
7799 -+{
7800 -+ struct task_struct *g, *p;
7801 -+ struct sched_attr attr = {
7802 -+ .sched_policy = SCHED_NORMAL,
7803 -+ };
7804 -+
7805 -+ read_lock(&tasklist_lock);
7806 -+ for_each_process_thread(g, p) {
7807 -+ /*
7808 -+ * Only normalize user tasks:
7809 -+ */
7810 -+ if (p->flags & PF_KTHREAD)
7811 -+ continue;
7812 -+
7813 -+ if (!rt_task(p)) {
7814 -+ /*
7815 -+ * Renice negative nice level userspace
7816 -+ * tasks back to 0:
7817 -+ */
7818 -+ if (task_nice(p) < 0)
7819 -+ set_user_nice(p, 0);
7820 -+ continue;
7821 -+ }
7822 -+
7823 -+ __sched_setscheduler(p, &attr, false, false);
7824 -+ }
7825 -+ read_unlock(&tasklist_lock);
7826 -+}
7827 -+#endif /* CONFIG_MAGIC_SYSRQ */
7828 -+
7829 -+#if defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB)
7830 -+/*
7831 -+ * These functions are only useful for the IA64 MCA handling, or kdb.
7832 -+ *
7833 -+ * They can only be called when the whole system has been
7834 -+ * stopped - every CPU needs to be quiescent, and no scheduling
7835 -+ * activity can take place. Using them for anything else would
7836 -+ * be a serious bug, and as a result, they aren't even visible
7837 -+ * under any other configuration.
7838 -+ */
7839 -+
7840 -+/**
7841 -+ * curr_task - return the current task for a given CPU.
7842 -+ * @cpu: the processor in question.
7843 -+ *
7844 -+ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!
7845 -+ *
7846 -+ * Return: The current task for @cpu.
7847 -+ */
7848 -+struct task_struct *curr_task(int cpu)
7849 -+{
7850 -+ return cpu_curr(cpu);
7851 -+}
7852 -+
7853 -+#endif /* defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB) */
7854 -+
7855 -+#ifdef CONFIG_IA64
7856 -+/**
7857 -+ * ia64_set_curr_task - set the current task for a given CPU.
7858 -+ * @cpu: the processor in question.
7859 -+ * @p: the task pointer to set.
7860 -+ *
7861 -+ * Description: This function must only be used when non-maskable interrupts
7862 -+ * are serviced on a separate stack. It allows the architecture to switch the
7863 -+ * notion of the current task on a CPU in a non-blocking manner. This function
7864 -+ * must be called with all CPU's synchronised, and interrupts disabled, the
7865 -+ * and caller must save the original value of the current task (see
7866 -+ * curr_task() above) and restore that value before reenabling interrupts and
7867 -+ * re-starting the system.
7868 -+ *
7869 -+ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!
7870 -+ */
7871 -+void ia64_set_curr_task(int cpu, struct task_struct *p)
7872 -+{
7873 -+ cpu_curr(cpu) = p;
7874 -+}
7875 -+
7876 -+#endif
7877 -+
7878 -+#ifdef CONFIG_CGROUP_SCHED
7879 -+static void sched_free_group(struct task_group *tg)
7880 -+{
7881 -+ kmem_cache_free(task_group_cache, tg);
7882 -+}
7883 -+
7884 -+/* allocate runqueue etc for a new task group */
7885 -+struct task_group *sched_create_group(struct task_group *parent)
7886 -+{
7887 -+ struct task_group *tg;
7888 -+
7889 -+ tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
7890 -+ if (!tg)
7891 -+ return ERR_PTR(-ENOMEM);
7892 -+
7893 -+ return tg;
7894 -+}
7895 -+
7896 -+void sched_online_group(struct task_group *tg, struct task_group *parent)
7897 -+{
7898 -+}
7899 -+
7900 -+/* rcu callback to free various structures associated with a task group */
7901 -+static void sched_free_group_rcu(struct rcu_head *rhp)
7902 -+{
7903 -+ /* Now it should be safe to free those cfs_rqs */
7904 -+ sched_free_group(container_of(rhp, struct task_group, rcu));
7905 -+}
7906 -+
7907 -+void sched_destroy_group(struct task_group *tg)
7908 -+{
7909 -+ /* Wait for possible concurrent references to cfs_rqs complete */
7910 -+ call_rcu(&tg->rcu, sched_free_group_rcu);
7911 -+}
7912 -+
7913 -+void sched_offline_group(struct task_group *tg)
7914 -+{
7915 -+}
7916 -+
7917 -+static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
7918 -+{
7919 -+ return css ? container_of(css, struct task_group, css) : NULL;
7920 -+}
7921 -+
7922 -+static struct cgroup_subsys_state *
7923 -+cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
7924 -+{
7925 -+ struct task_group *parent = css_tg(parent_css);
7926 -+ struct task_group *tg;
7927 -+
7928 -+ if (!parent) {
7929 -+ /* This is early initialization for the top cgroup */
7930 -+ return &root_task_group.css;
7931 -+ }
7932 -+
7933 -+ tg = sched_create_group(parent);
7934 -+ if (IS_ERR(tg))
7935 -+ return ERR_PTR(-ENOMEM);
7936 -+ return &tg->css;
7937 -+}
7938 -+
7939 -+/* Expose task group only after completing cgroup initialization */
7940 -+static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
7941 -+{
7942 -+ struct task_group *tg = css_tg(css);
7943 -+ struct task_group *parent = css_tg(css->parent);
7944 -+
7945 -+ if (parent)
7946 -+ sched_online_group(tg, parent);
7947 -+ return 0;
7948 -+}
7949 -+
7950 -+static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
7951 -+{
7952 -+ struct task_group *tg = css_tg(css);
7953 -+
7954 -+ sched_offline_group(tg);
7955 -+}
7956 -+
7957 -+static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
7958 -+{
7959 -+ struct task_group *tg = css_tg(css);
7960 -+
7961 -+ /*
7962 -+ * Relies on the RCU grace period between css_released() and this.
7963 -+ */
7964 -+ sched_free_group(tg);
7965 -+}
7966 -+
7967 -+static void cpu_cgroup_fork(struct task_struct *task)
7968 -+{
7969 -+}
7970 -+
7971 -+static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
7972 -+{
7973 -+ return 0;
7974 -+}
7975 -+
7976 -+static void cpu_cgroup_attach(struct cgroup_taskset *tset)
7977 -+{
7978 -+}
7979 -+
7980 -+#ifdef CONFIG_FAIR_GROUP_SCHED
7981 -+static DEFINE_MUTEX(shares_mutex);
7982 -+
7983 -+int sched_group_set_shares(struct task_group *tg, unsigned long shares)
7984 -+{
7985 -+ /*
7986 -+ * We can't change the weight of the root cgroup.
7987 -+ */
7988 -+ if (&root_task_group == tg)
7989 -+ return -EINVAL;
7990 -+
7991 -+ shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
7992 -+
7993 -+ mutex_lock(&shares_mutex);
7994 -+ if (tg->shares == shares)
7995 -+ goto done;
7996 -+
7997 -+ tg->shares = shares;
7998 -+done:
7999 -+ mutex_unlock(&shares_mutex);
8000 -+ return 0;
8001 -+}
8002 -+
8003 -+static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
8004 -+ struct cftype *cftype, u64 shareval)
8005 -+{
8006 -+ if (shareval > scale_load_down(ULONG_MAX))
8007 -+ shareval = MAX_SHARES;
8008 -+ return sched_group_set_shares(css_tg(css), scale_load(shareval));
8009 -+}
8010 -+
8011 -+static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
8012 -+ struct cftype *cft)
8013 -+{
8014 -+ struct task_group *tg = css_tg(css);
8015 -+
8016 -+ return (u64) scale_load_down(tg->shares);
8017 -+}
8018 -+#endif
8019 -+
8020 -+static struct cftype cpu_legacy_files[] = {
8021 -+#ifdef CONFIG_FAIR_GROUP_SCHED
8022 -+ {
8023 -+ .name = "shares",
8024 -+ .read_u64 = cpu_shares_read_u64,
8025 -+ .write_u64 = cpu_shares_write_u64,
8026 -+ },
8027 -+#endif
8028 -+ { } /* Terminate */
8029 -+};
8030 -+
8031 -+
8032 -+static struct cftype cpu_files[] = {
8033 -+ { } /* terminate */
8034 -+};
8035 -+
8036 -+static int cpu_extra_stat_show(struct seq_file *sf,
8037 -+ struct cgroup_subsys_state *css)
8038 -+{
8039 -+ return 0;
8040 -+}
8041 -+
8042 -+struct cgroup_subsys cpu_cgrp_subsys = {
8043 -+ .css_alloc = cpu_cgroup_css_alloc,
8044 -+ .css_online = cpu_cgroup_css_online,
8045 -+ .css_released = cpu_cgroup_css_released,
8046 -+ .css_free = cpu_cgroup_css_free,
8047 -+ .css_extra_stat_show = cpu_extra_stat_show,
8048 -+ .fork = cpu_cgroup_fork,
8049 -+ .can_attach = cpu_cgroup_can_attach,
8050 -+ .attach = cpu_cgroup_attach,
8051 -+ .legacy_cftypes = cpu_files,
8052 -+ .legacy_cftypes = cpu_legacy_files,
8053 -+ .dfl_cftypes = cpu_files,
8054 -+ .early_init = true,
8055 -+ .threaded = true,
8056 -+};
8057 -+#endif /* CONFIG_CGROUP_SCHED */
8058 -+
8059 -+#undef CREATE_TRACE_POINTS
8060 -diff --git a/kernel/sched/alt_debug.c b/kernel/sched/alt_debug.c
8061 -new file mode 100644
8062 -index 000000000000..1212a031700e
8063 ---- /dev/null
8064 -+++ b/kernel/sched/alt_debug.c
8065 -@@ -0,0 +1,31 @@
8066 -+/*
8067 -+ * kernel/sched/alt_debug.c
8068 -+ *
8069 -+ * Print the alt scheduler debugging details
8070 -+ *
8071 -+ * Author: Alfred Chen
8072 -+ * Date : 2020
8073 -+ */
8074 -+#include "sched.h"
8075 -+
8076 -+/*
8077 -+ * This allows printing both to /proc/sched_debug and
8078 -+ * to the console
8079 -+ */
8080 -+#define SEQ_printf(m, x...) \
8081 -+ do { \
8082 -+ if (m) \
8083 -+ seq_printf(m, x); \
8084 -+ else \
8085 -+ pr_cont(x); \
8086 -+ } while (0)
8087 -+
8088 -+void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
8089 -+ struct seq_file *m)
8090 -+{
8091 -+ SEQ_printf(m, "%s (%d, #threads: %d)\n", p->comm, task_pid_nr_ns(p, ns),
8092 -+ get_nr_threads(p));
8093 -+}
8094 -+
8095 -+void proc_sched_set_task(struct task_struct *p)
8096 -+{}
8097 -diff --git a/kernel/sched/alt_sched.h b/kernel/sched/alt_sched.h
8098 -new file mode 100644
8099 -index 000000000000..289058a09bd5
8100 ---- /dev/null
8101 -+++ b/kernel/sched/alt_sched.h
8102 -@@ -0,0 +1,666 @@
8103 -+#ifndef ALT_SCHED_H
8104 -+#define ALT_SCHED_H
8105 -+
8106 -+#include <linux/sched.h>
8107 -+
8108 -+#include <linux/sched/clock.h>
8109 -+#include <linux/sched/cpufreq.h>
8110 -+#include <linux/sched/cputime.h>
8111 -+#include <linux/sched/debug.h>
8112 -+#include <linux/sched/init.h>
8113 -+#include <linux/sched/isolation.h>
8114 -+#include <linux/sched/loadavg.h>
8115 -+#include <linux/sched/mm.h>
8116 -+#include <linux/sched/nohz.h>
8117 -+#include <linux/sched/signal.h>
8118 -+#include <linux/sched/stat.h>
8119 -+#include <linux/sched/sysctl.h>
8120 -+#include <linux/sched/task.h>
8121 -+#include <linux/sched/topology.h>
8122 -+#include <linux/sched/wake_q.h>
8123 -+
8124 -+#include <uapi/linux/sched/types.h>
8125 -+
8126 -+#include <linux/cgroup.h>
8127 -+#include <linux/cpufreq.h>
8128 -+#include <linux/cpuidle.h>
8129 -+#include <linux/cpuset.h>
8130 -+#include <linux/ctype.h>
8131 -+#include <linux/debugfs.h>
8132 -+#include <linux/kthread.h>
8133 -+#include <linux/livepatch.h>
8134 -+#include <linux/membarrier.h>
8135 -+#include <linux/proc_fs.h>
8136 -+#include <linux/psi.h>
8137 -+#include <linux/slab.h>
8138 -+#include <linux/stop_machine.h>
8139 -+#include <linux/suspend.h>
8140 -+#include <linux/swait.h>
8141 -+#include <linux/syscalls.h>
8142 -+#include <linux/tsacct_kern.h>
8143 -+
8144 -+#include <asm/tlb.h>
8145 -+
8146 -+#ifdef CONFIG_PARAVIRT
8147 -+# include <asm/paravirt.h>
8148 -+#endif
8149 -+
8150 -+#include "cpupri.h"
8151 -+
8152 -+#include <trace/events/sched.h>
8153 -+
8154 -+#ifdef CONFIG_SCHED_BMQ
8155 -+/* bits:
8156 -+ * RT(0-99), (Low prio adj range, nice width, high prio adj range) / 2, cpu idle task */
8157 -+#define SCHED_BITS (MAX_RT_PRIO + NICE_WIDTH / 2 + MAX_PRIORITY_ADJ + 1)
8158 -+#endif
8159 -+
8160 -+#ifdef CONFIG_SCHED_PDS
8161 -+/* bits: RT(0-99), reserved(100-127), NORMAL_PRIO_NUM, cpu idle task */
8162 -+#define SCHED_BITS (MIN_NORMAL_PRIO + NORMAL_PRIO_NUM + 1)
8163 -+#endif /* CONFIG_SCHED_PDS */
8164 -+
8165 -+#define IDLE_TASK_SCHED_PRIO (SCHED_BITS - 1)
8166 -+
8167 -+#ifdef CONFIG_SCHED_DEBUG
8168 -+# define SCHED_WARN_ON(x) WARN_ONCE(x, #x)
8169 -+extern void resched_latency_warn(int cpu, u64 latency);
8170 -+#else
8171 -+# define SCHED_WARN_ON(x) ({ (void)(x), 0; })
8172 -+static inline void resched_latency_warn(int cpu, u64 latency) {}
8173 -+#endif
8174 -+
8175 -+/*
8176 -+ * Increase resolution of nice-level calculations for 64-bit architectures.
8177 -+ * The extra resolution improves shares distribution and load balancing of
8178 -+ * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
8179 -+ * hierarchies, especially on larger systems. This is not a user-visible change
8180 -+ * and does not change the user-interface for setting shares/weights.
8181 -+ *
8182 -+ * We increase resolution only if we have enough bits to allow this increased
8183 -+ * resolution (i.e. 64-bit). The costs for increasing resolution when 32-bit
8184 -+ * are pretty high and the returns do not justify the increased costs.
8185 -+ *
8186 -+ * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
8187 -+ * increase coverage and consistency always enable it on 64-bit platforms.
8188 -+ */
8189 -+#ifdef CONFIG_64BIT
8190 -+# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
8191 -+# define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
8192 -+# define scale_load_down(w) \
8193 -+({ \
8194 -+ unsigned long __w = (w); \
8195 -+ if (__w) \
8196 -+ __w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \
8197 -+ __w; \
8198 -+})
8199 -+#else
8200 -+# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
8201 -+# define scale_load(w) (w)
8202 -+# define scale_load_down(w) (w)
8203 -+#endif
8204 -+
8205 -+#ifdef CONFIG_FAIR_GROUP_SCHED
8206 -+#define ROOT_TASK_GROUP_LOAD NICE_0_LOAD
8207 -+
8208 -+/*
8209 -+ * A weight of 0 or 1 can cause arithmetics problems.
8210 -+ * A weight of a cfs_rq is the sum of weights of which entities
8211 -+ * are queued on this cfs_rq, so a weight of a entity should not be
8212 -+ * too large, so as the shares value of a task group.
8213 -+ * (The default weight is 1024 - so there's no practical
8214 -+ * limitation from this.)
8215 -+ */
8216 -+#define MIN_SHARES (1UL << 1)
8217 -+#define MAX_SHARES (1UL << 18)
8218 -+#endif
8219 -+
8220 -+/* task_struct::on_rq states: */
8221 -+#define TASK_ON_RQ_QUEUED 1
8222 -+#define TASK_ON_RQ_MIGRATING 2
8223 -+
8224 -+static inline int task_on_rq_queued(struct task_struct *p)
8225 -+{
8226 -+ return p->on_rq == TASK_ON_RQ_QUEUED;
8227 -+}
8228 -+
8229 -+static inline int task_on_rq_migrating(struct task_struct *p)
8230 -+{
8231 -+ return READ_ONCE(p->on_rq) == TASK_ON_RQ_MIGRATING;
8232 -+}
8233 -+
8234 -+/*
8235 -+ * wake flags
8236 -+ */
8237 -+#define WF_SYNC 0x01 /* waker goes to sleep after wakeup */
8238 -+#define WF_FORK 0x02 /* child wakeup after fork */
8239 -+#define WF_MIGRATED 0x04 /* internal use, task got migrated */
8240 -+#define WF_ON_CPU 0x08 /* Wakee is on_rq */
8241 -+
8242 -+#define SCHED_QUEUE_BITS (SCHED_BITS - 1)
8243 -+
8244 -+struct sched_queue {
8245 -+ DECLARE_BITMAP(bitmap, SCHED_QUEUE_BITS);
8246 -+ struct list_head heads[SCHED_BITS];
8247 -+};
8248 -+
8249 -+/*
8250 -+ * This is the main, per-CPU runqueue data structure.
8251 -+ * This data should only be modified by the local cpu.
8252 -+ */
8253 -+struct rq {
8254 -+ /* runqueue lock: */
8255 -+ raw_spinlock_t lock;
8256 -+
8257 -+ struct task_struct __rcu *curr;
8258 -+ struct task_struct *idle, *stop, *skip;
8259 -+ struct mm_struct *prev_mm;
8260 -+
8261 -+ struct sched_queue queue;
8262 -+#ifdef CONFIG_SCHED_PDS
8263 -+ u64 time_edge;
8264 -+#endif
8265 -+ unsigned long watermark;
8266 -+
8267 -+ /* switch count */
8268 -+ u64 nr_switches;
8269 -+
8270 -+ atomic_t nr_iowait;
8271 -+
8272 -+#ifdef CONFIG_SCHED_DEBUG
8273 -+ u64 last_seen_need_resched_ns;
8274 -+ int ticks_without_resched;
8275 -+#endif
8276 -+
8277 -+#ifdef CONFIG_MEMBARRIER
8278 -+ int membarrier_state;
8279 -+#endif
8280 -+
8281 -+#ifdef CONFIG_SMP
8282 -+ int cpu; /* cpu of this runqueue */
8283 -+ bool online;
8284 -+
8285 -+ unsigned int ttwu_pending;
8286 -+ unsigned char nohz_idle_balance;
8287 -+ unsigned char idle_balance;
8288 -+
8289 -+#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
8290 -+ struct sched_avg avg_irq;
8291 -+#endif
8292 -+
8293 -+#ifdef CONFIG_SCHED_SMT
8294 -+ int active_balance;
8295 -+ struct cpu_stop_work active_balance_work;
8296 -+#endif
8297 -+ struct callback_head *balance_callback;
8298 -+#ifdef CONFIG_HOTPLUG_CPU
8299 -+ struct rcuwait hotplug_wait;
8300 -+#endif
8301 -+ unsigned int nr_pinned;
8302 -+
8303 -+#endif /* CONFIG_SMP */
8304 -+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
8305 -+ u64 prev_irq_time;
8306 -+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
8307 -+#ifdef CONFIG_PARAVIRT
8308 -+ u64 prev_steal_time;
8309 -+#endif /* CONFIG_PARAVIRT */
8310 -+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
8311 -+ u64 prev_steal_time_rq;
8312 -+#endif /* CONFIG_PARAVIRT_TIME_ACCOUNTING */
8313 -+
8314 -+ /* For genenal cpu load util */
8315 -+ s32 load_history;
8316 -+ u64 load_block;
8317 -+ u64 load_stamp;
8318 -+
8319 -+ /* calc_load related fields */
8320 -+ unsigned long calc_load_update;
8321 -+ long calc_load_active;
8322 -+
8323 -+ u64 clock, last_tick;
8324 -+ u64 last_ts_switch;
8325 -+ u64 clock_task;
8326 -+
8327 -+ unsigned int nr_running;
8328 -+ unsigned long nr_uninterruptible;
8329 -+
8330 -+#ifdef CONFIG_SCHED_HRTICK
8331 -+#ifdef CONFIG_SMP
8332 -+ call_single_data_t hrtick_csd;
8333 -+#endif
8334 -+ struct hrtimer hrtick_timer;
8335 -+ ktime_t hrtick_time;
8336 -+#endif
8337 -+
8338 -+#ifdef CONFIG_SCHEDSTATS
8339 -+
8340 -+ /* latency stats */
8341 -+ struct sched_info rq_sched_info;
8342 -+ unsigned long long rq_cpu_time;
8343 -+ /* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */
8344 -+
8345 -+ /* sys_sched_yield() stats */
8346 -+ unsigned int yld_count;
8347 -+
8348 -+ /* schedule() stats */
8349 -+ unsigned int sched_switch;
8350 -+ unsigned int sched_count;
8351 -+ unsigned int sched_goidle;
8352 -+
8353 -+ /* try_to_wake_up() stats */
8354 -+ unsigned int ttwu_count;
8355 -+ unsigned int ttwu_local;
8356 -+#endif /* CONFIG_SCHEDSTATS */
8357 -+
8358 -+#ifdef CONFIG_CPU_IDLE
8359 -+ /* Must be inspected within a rcu lock section */
8360 -+ struct cpuidle_state *idle_state;
8361 -+#endif
8362 -+
8363 -+#ifdef CONFIG_NO_HZ_COMMON
8364 -+#ifdef CONFIG_SMP
8365 -+ call_single_data_t nohz_csd;
8366 -+#endif
8367 -+ atomic_t nohz_flags;
8368 -+#endif /* CONFIG_NO_HZ_COMMON */
8369 -+};
8370 -+
8371 -+extern unsigned long rq_load_util(struct rq *rq, unsigned long max);
8372 -+
8373 -+extern unsigned long calc_load_update;
8374 -+extern atomic_long_t calc_load_tasks;
8375 -+
8376 -+extern void calc_global_load_tick(struct rq *this_rq);
8377 -+extern long calc_load_fold_active(struct rq *this_rq, long adjust);
8378 -+
8379 -+DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
8380 -+#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))
8381 -+#define this_rq() this_cpu_ptr(&runqueues)
8382 -+#define task_rq(p) cpu_rq(task_cpu(p))
8383 -+#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
8384 -+#define raw_rq() raw_cpu_ptr(&runqueues)
8385 -+
8386 -+#ifdef CONFIG_SMP
8387 -+#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
8388 -+void register_sched_domain_sysctl(void);
8389 -+void unregister_sched_domain_sysctl(void);
8390 -+#else
8391 -+static inline void register_sched_domain_sysctl(void)
8392 -+{
8393 -+}
8394 -+static inline void unregister_sched_domain_sysctl(void)
8395 -+{
8396 -+}
8397 -+#endif
8398 -+
8399 -+extern bool sched_smp_initialized;
8400 -+
8401 -+enum {
8402 -+ ITSELF_LEVEL_SPACE_HOLDER,
8403 -+#ifdef CONFIG_SCHED_SMT
8404 -+ SMT_LEVEL_SPACE_HOLDER,
8405 -+#endif
8406 -+ COREGROUP_LEVEL_SPACE_HOLDER,
8407 -+ CORE_LEVEL_SPACE_HOLDER,
8408 -+ OTHER_LEVEL_SPACE_HOLDER,
8409 -+ NR_CPU_AFFINITY_LEVELS
8410 -+};
8411 -+
8412 -+DECLARE_PER_CPU(cpumask_t [NR_CPU_AFFINITY_LEVELS], sched_cpu_topo_masks);
8413 -+DECLARE_PER_CPU(cpumask_t *, sched_cpu_llc_mask);
8414 -+
8415 -+static inline int
8416 -+__best_mask_cpu(const cpumask_t *cpumask, const cpumask_t *mask)
8417 -+{
8418 -+ int cpu;
8419 -+
8420 -+ while ((cpu = cpumask_any_and(cpumask, mask)) >= nr_cpu_ids)
8421 -+ mask++;
8422 -+
8423 -+ return cpu;
8424 -+}
8425 -+
8426 -+static inline int best_mask_cpu(int cpu, const cpumask_t *mask)
8427 -+{
8428 -+ return __best_mask_cpu(mask, per_cpu(sched_cpu_topo_masks, cpu));
8429 -+}
8430 -+
8431 -+extern void flush_smp_call_function_from_idle(void);
8432 -+
8433 -+#else /* !CONFIG_SMP */
8434 -+static inline void flush_smp_call_function_from_idle(void) { }
8435 -+#endif
8436 -+
8437 -+#ifndef arch_scale_freq_tick
8438 -+static __always_inline
8439 -+void arch_scale_freq_tick(void)
8440 -+{
8441 -+}
8442 -+#endif
8443 -+
8444 -+#ifndef arch_scale_freq_capacity
8445 -+static __always_inline
8446 -+unsigned long arch_scale_freq_capacity(int cpu)
8447 -+{
8448 -+ return SCHED_CAPACITY_SCALE;
8449 -+}
8450 -+#endif
8451 -+
8452 -+static inline u64 __rq_clock_broken(struct rq *rq)
8453 -+{
8454 -+ return READ_ONCE(rq->clock);
8455 -+}
8456 -+
8457 -+static inline u64 rq_clock(struct rq *rq)
8458 -+{
8459 -+ /*
8460 -+ * Relax lockdep_assert_held() checking as in VRQ, call to
8461 -+ * sched_info_xxxx() may not held rq->lock
8462 -+ * lockdep_assert_held(&rq->lock);
8463 -+ */
8464 -+ return rq->clock;
8465 -+}
8466 -+
8467 -+static inline u64 rq_clock_task(struct rq *rq)
8468 -+{
8469 -+ /*
8470 -+ * Relax lockdep_assert_held() checking as in VRQ, call to
8471 -+ * sched_info_xxxx() may not held rq->lock
8472 -+ * lockdep_assert_held(&rq->lock);
8473 -+ */
8474 -+ return rq->clock_task;
8475 -+}
8476 -+
8477 -+/*
8478 -+ * {de,en}queue flags:
8479 -+ *
8480 -+ * DEQUEUE_SLEEP - task is no longer runnable
8481 -+ * ENQUEUE_WAKEUP - task just became runnable
8482 -+ *
8483 -+ */
8484 -+
8485 -+#define DEQUEUE_SLEEP 0x01
8486 -+
8487 -+#define ENQUEUE_WAKEUP 0x01
8488 -+
8489 -+
8490 -+/*
8491 -+ * Below are scheduler API which using in other kernel code
8492 -+ * It use the dummy rq_flags
8493 -+ * ToDo : BMQ need to support these APIs for compatibility with mainline
8494 -+ * scheduler code.
8495 -+ */
8496 -+struct rq_flags {
8497 -+ unsigned long flags;
8498 -+};
8499 -+
8500 -+struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
8501 -+ __acquires(rq->lock);
8502 -+
8503 -+struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
8504 -+ __acquires(p->pi_lock)
8505 -+ __acquires(rq->lock);
8506 -+
8507 -+static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
8508 -+ __releases(rq->lock)
8509 -+{
8510 -+ raw_spin_unlock(&rq->lock);
8511 -+}
8512 -+
8513 -+static inline void
8514 -+task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
8515 -+ __releases(rq->lock)
8516 -+ __releases(p->pi_lock)
8517 -+{
8518 -+ raw_spin_unlock(&rq->lock);
8519 -+ raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
8520 -+}
8521 -+
8522 -+static inline void
8523 -+rq_lock(struct rq *rq, struct rq_flags *rf)
8524 -+ __acquires(rq->lock)
8525 -+{
8526 -+ raw_spin_lock(&rq->lock);
8527 -+}
8528 -+
8529 -+static inline void
8530 -+rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
8531 -+ __releases(rq->lock)
8532 -+{
8533 -+ raw_spin_unlock_irq(&rq->lock);
8534 -+}
8535 -+
8536 -+static inline void
8537 -+rq_unlock(struct rq *rq, struct rq_flags *rf)
8538 -+ __releases(rq->lock)
8539 -+{
8540 -+ raw_spin_unlock(&rq->lock);
8541 -+}
8542 -+
8543 -+static inline struct rq *
8544 -+this_rq_lock_irq(struct rq_flags *rf)
8545 -+ __acquires(rq->lock)
8546 -+{
8547 -+ struct rq *rq;
8548 -+
8549 -+ local_irq_disable();
8550 -+ rq = this_rq();
8551 -+ raw_spin_lock(&rq->lock);
8552 -+
8553 -+ return rq;
8554 -+}
8555 -+
8556 -+extern void raw_spin_rq_lock_nested(struct rq *rq, int subclass);
8557 -+extern void raw_spin_rq_unlock(struct rq *rq);
8558 -+
8559 -+static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
8560 -+{
8561 -+ return &rq->lock;
8562 -+}
8563 -+
8564 -+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
8565 -+{
8566 -+ return __rq_lockp(rq);
8567 -+}
8568 -+
8569 -+static inline void raw_spin_rq_lock(struct rq *rq)
8570 -+{
8571 -+ raw_spin_rq_lock_nested(rq, 0);
8572 -+}
8573 -+
8574 -+static inline void raw_spin_rq_lock_irq(struct rq *rq)
8575 -+{
8576 -+ local_irq_disable();
8577 -+ raw_spin_rq_lock(rq);
8578 -+}
8579 -+
8580 -+static inline void raw_spin_rq_unlock_irq(struct rq *rq)
8581 -+{
8582 -+ raw_spin_rq_unlock(rq);
8583 -+ local_irq_enable();
8584 -+}
8585 -+
8586 -+static inline int task_current(struct rq *rq, struct task_struct *p)
8587 -+{
8588 -+ return rq->curr == p;
8589 -+}
8590 -+
8591 -+static inline bool task_running(struct task_struct *p)
8592 -+{
8593 -+ return p->on_cpu;
8594 -+}
8595 -+
8596 -+extern int task_running_nice(struct task_struct *p);
8597 -+
8598 -+extern struct static_key_false sched_schedstats;
8599 -+
8600 -+#ifdef CONFIG_CPU_IDLE
8601 -+static inline void idle_set_state(struct rq *rq,
8602 -+ struct cpuidle_state *idle_state)
8603 -+{
8604 -+ rq->idle_state = idle_state;
8605 -+}
8606 -+
8607 -+static inline struct cpuidle_state *idle_get_state(struct rq *rq)
8608 -+{
8609 -+ WARN_ON(!rcu_read_lock_held());
8610 -+ return rq->idle_state;
8611 -+}
8612 -+#else
8613 -+static inline void idle_set_state(struct rq *rq,
8614 -+ struct cpuidle_state *idle_state)
8615 -+{
8616 -+}
8617 -+
8618 -+static inline struct cpuidle_state *idle_get_state(struct rq *rq)
8619 -+{
8620 -+ return NULL;
8621 -+}
8622 -+#endif
8623 -+
8624 -+static inline int cpu_of(const struct rq *rq)
8625 -+{
8626 -+#ifdef CONFIG_SMP
8627 -+ return rq->cpu;
8628 -+#else
8629 -+ return 0;
8630 -+#endif
8631 -+}
8632 -+
8633 -+#include "stats.h"
8634 -+
8635 -+#ifdef CONFIG_NO_HZ_COMMON
8636 -+#define NOHZ_BALANCE_KICK_BIT 0
8637 -+#define NOHZ_STATS_KICK_BIT 1
8638 -+
8639 -+#define NOHZ_BALANCE_KICK BIT(NOHZ_BALANCE_KICK_BIT)
8640 -+#define NOHZ_STATS_KICK BIT(NOHZ_STATS_KICK_BIT)
8641 -+
8642 -+#define NOHZ_KICK_MASK (NOHZ_BALANCE_KICK | NOHZ_STATS_KICK)
8643 -+
8644 -+#define nohz_flags(cpu) (&cpu_rq(cpu)->nohz_flags)
8645 -+
8646 -+/* TODO: needed?
8647 -+extern void nohz_balance_exit_idle(struct rq *rq);
8648 -+#else
8649 -+static inline void nohz_balance_exit_idle(struct rq *rq) { }
8650 -+*/
8651 -+#endif
8652 -+
8653 -+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
8654 -+struct irqtime {
8655 -+ u64 total;
8656 -+ u64 tick_delta;
8657 -+ u64 irq_start_time;
8658 -+ struct u64_stats_sync sync;
8659 -+};
8660 -+
8661 -+DECLARE_PER_CPU(struct irqtime, cpu_irqtime);
8662 -+
8663 -+/*
8664 -+ * Returns the irqtime minus the softirq time computed by ksoftirqd.
8665 -+ * Otherwise ksoftirqd's sum_exec_runtime is substracted its own runtime
8666 -+ * and never move forward.
8667 -+ */
8668 -+static inline u64 irq_time_read(int cpu)
8669 -+{
8670 -+ struct irqtime *irqtime = &per_cpu(cpu_irqtime, cpu);
8671 -+ unsigned int seq;
8672 -+ u64 total;
8673 -+
8674 -+ do {
8675 -+ seq = __u64_stats_fetch_begin(&irqtime->sync);
8676 -+ total = irqtime->total;
8677 -+ } while (__u64_stats_fetch_retry(&irqtime->sync, seq));
8678 -+
8679 -+ return total;
8680 -+}
8681 -+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
8682 -+
8683 -+#ifdef CONFIG_CPU_FREQ
8684 -+DECLARE_PER_CPU(struct update_util_data __rcu *, cpufreq_update_util_data);
8685 -+#endif /* CONFIG_CPU_FREQ */
8686 -+
8687 -+#ifdef CONFIG_NO_HZ_FULL
8688 -+extern int __init sched_tick_offload_init(void);
8689 -+#else
8690 -+static inline int sched_tick_offload_init(void) { return 0; }
8691 -+#endif
8692 -+
8693 -+#ifdef arch_scale_freq_capacity
8694 -+#ifndef arch_scale_freq_invariant
8695 -+#define arch_scale_freq_invariant() (true)
8696 -+#endif
8697 -+#else /* arch_scale_freq_capacity */
8698 -+#define arch_scale_freq_invariant() (false)
8699 -+#endif
8700 -+
8701 -+extern void schedule_idle(void);
8702 -+
8703 -+#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
8704 -+
8705 -+/*
8706 -+ * !! For sched_setattr_nocheck() (kernel) only !!
8707 -+ *
8708 -+ * This is actually gross. :(
8709 -+ *
8710 -+ * It is used to make schedutil kworker(s) higher priority than SCHED_DEADLINE
8711 -+ * tasks, but still be able to sleep. We need this on platforms that cannot
8712 -+ * atomically change clock frequency. Remove once fast switching will be
8713 -+ * available on such platforms.
8714 -+ *
8715 -+ * SUGOV stands for SchedUtil GOVernor.
8716 -+ */
8717 -+#define SCHED_FLAG_SUGOV 0x10000000
8718 -+
8719 -+#ifdef CONFIG_MEMBARRIER
8720 -+/*
8721 -+ * The scheduler provides memory barriers required by membarrier between:
8722 -+ * - prior user-space memory accesses and store to rq->membarrier_state,
8723 -+ * - store to rq->membarrier_state and following user-space memory accesses.
8724 -+ * In the same way it provides those guarantees around store to rq->curr.
8725 -+ */
8726 -+static inline void membarrier_switch_mm(struct rq *rq,
8727 -+ struct mm_struct *prev_mm,
8728 -+ struct mm_struct *next_mm)
8729 -+{
8730 -+ int membarrier_state;
8731 -+
8732 -+ if (prev_mm == next_mm)
8733 -+ return;
8734 -+
8735 -+ membarrier_state = atomic_read(&next_mm->membarrier_state);
8736 -+ if (READ_ONCE(rq->membarrier_state) == membarrier_state)
8737 -+ return;
8738 -+
8739 -+ WRITE_ONCE(rq->membarrier_state, membarrier_state);
8740 -+}
8741 -+#else
8742 -+static inline void membarrier_switch_mm(struct rq *rq,
8743 -+ struct mm_struct *prev_mm,
8744 -+ struct mm_struct *next_mm)
8745 -+{
8746 -+}
8747 -+#endif
8748 -+
8749 -+#ifdef CONFIG_NUMA
8750 -+extern int sched_numa_find_closest(const struct cpumask *cpus, int cpu);
8751 -+#else
8752 -+static inline int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
8753 -+{
8754 -+ return nr_cpu_ids;
8755 -+}
8756 -+#endif
8757 -+
8758 -+extern void swake_up_all_locked(struct swait_queue_head *q);
8759 -+extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
8760 -+
8761 -+#ifdef CONFIG_PREEMPT_DYNAMIC
8762 -+extern int preempt_dynamic_mode;
8763 -+extern int sched_dynamic_mode(const char *str);
8764 -+extern void sched_dynamic_update(int mode);
8765 -+#endif
8766 -+
8767 -+static inline void nohz_run_idle_balance(int cpu) { }
8768 -+#endif /* ALT_SCHED_H */
8769 -diff --git a/kernel/sched/bmq.h b/kernel/sched/bmq.h
8770 -new file mode 100644
8771 -index 000000000000..be3ee4a553ca
8772 ---- /dev/null
8773 -+++ b/kernel/sched/bmq.h
8774 -@@ -0,0 +1,111 @@
8775 -+#define ALT_SCHED_VERSION_MSG "sched/bmq: BMQ CPU Scheduler "ALT_SCHED_VERSION" by Alfred Chen.\n"
8776 -+
8777 -+/*
8778 -+ * BMQ only routines
8779 -+ */
8780 -+#define rq_switch_time(rq) ((rq)->clock - (rq)->last_ts_switch)
8781 -+#define boost_threshold(p) (sched_timeslice_ns >>\
8782 -+ (15 - MAX_PRIORITY_ADJ - (p)->boost_prio))
8783 -+
8784 -+static inline void boost_task(struct task_struct *p)
8785 -+{
8786 -+ int limit;
8787 -+
8788 -+ switch (p->policy) {
8789 -+ case SCHED_NORMAL:
8790 -+ limit = -MAX_PRIORITY_ADJ;
8791 -+ break;
8792 -+ case SCHED_BATCH:
8793 -+ case SCHED_IDLE:
8794 -+ limit = 0;
8795 -+ break;
8796 -+ default:
8797 -+ return;
8798 -+ }
8799 -+
8800 -+ if (p->boost_prio > limit)
8801 -+ p->boost_prio--;
8802 -+}
8803 -+
8804 -+static inline void deboost_task(struct task_struct *p)
8805 -+{
8806 -+ if (p->boost_prio < MAX_PRIORITY_ADJ)
8807 -+ p->boost_prio++;
8808 -+}
8809 -+
8810 -+/*
8811 -+ * Common interfaces
8812 -+ */
8813 -+static inline void sched_timeslice_imp(const int timeslice_ms) {}
8814 -+
8815 -+static inline int
8816 -+task_sched_prio_normal(const struct task_struct *p, const struct rq *rq)
8817 -+{
8818 -+ return p->prio + p->boost_prio - MAX_RT_PRIO;
8819 -+}
8820 -+
8821 -+static inline int task_sched_prio(const struct task_struct *p)
8822 -+{
8823 -+ return (p->prio < MAX_RT_PRIO)? p->prio : MAX_RT_PRIO / 2 + (p->prio + p->boost_prio) / 2;
8824 -+}
8825 -+
8826 -+static inline int
8827 -+task_sched_prio_idx(const struct task_struct *p, const struct rq *rq)
8828 -+{
8829 -+ return task_sched_prio(p);
8830 -+}
8831 -+
8832 -+static inline int sched_prio2idx(int prio, struct rq *rq)
8833 -+{
8834 -+ return prio;
8835 -+}
8836 -+
8837 -+static inline int sched_idx2prio(int idx, struct rq *rq)
8838 -+{
8839 -+ return idx;
8840 -+}
8841 -+
8842 -+static inline void time_slice_expired(struct task_struct *p, struct rq *rq)
8843 -+{
8844 -+ p->time_slice = sched_timeslice_ns;
8845 -+
8846 -+ if (SCHED_FIFO != p->policy && task_on_rq_queued(p)) {
8847 -+ if (SCHED_RR != p->policy)
8848 -+ deboost_task(p);
8849 -+ requeue_task(p, rq);
8850 -+ }
8851 -+}
8852 -+
8853 -+static inline void sched_task_sanity_check(struct task_struct *p, struct rq *rq) {}
8854 -+
8855 -+inline int task_running_nice(struct task_struct *p)
8856 -+{
8857 -+ return (p->prio + p->boost_prio > DEFAULT_PRIO + MAX_PRIORITY_ADJ);
8858 -+}
8859 -+
8860 -+static void sched_task_fork(struct task_struct *p, struct rq *rq)
8861 -+{
8862 -+ p->boost_prio = (p->boost_prio < 0) ?
8863 -+ p->boost_prio + MAX_PRIORITY_ADJ : MAX_PRIORITY_ADJ;
8864 -+}
8865 -+
8866 -+static inline void do_sched_yield_type_1(struct task_struct *p, struct rq *rq)
8867 -+{
8868 -+ p->boost_prio = MAX_PRIORITY_ADJ;
8869 -+}
8870 -+
8871 -+#ifdef CONFIG_SMP
8872 -+static inline void sched_task_ttwu(struct task_struct *p)
8873 -+{
8874 -+ if(this_rq()->clock_task - p->last_ran > sched_timeslice_ns)
8875 -+ boost_task(p);
8876 -+}
8877 -+#endif
8878 -+
8879 -+static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq)
8880 -+{
8881 -+ if (rq_switch_time(rq) < boost_threshold(p))
8882 -+ boost_task(p);
8883 -+}
8884 -+
8885 -+static inline void update_rq_time_edge(struct rq *rq) {}
8886 -diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
8887 -index 57124614363d..f0e9c7543542 100644
8888 ---- a/kernel/sched/cpufreq_schedutil.c
8889 -+++ b/kernel/sched/cpufreq_schedutil.c
8890 -@@ -167,9 +167,14 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
8891 - unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
8892 -
8893 - sg_cpu->max = max;
8894 -+#ifndef CONFIG_SCHED_ALT
8895 - sg_cpu->bw_dl = cpu_bw_dl(rq);
8896 - sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(rq), max,
8897 - FREQUENCY_UTIL, NULL);
8898 -+#else
8899 -+ sg_cpu->bw_dl = 0;
8900 -+ sg_cpu->util = rq_load_util(rq, max);
8901 -+#endif /* CONFIG_SCHED_ALT */
8902 - }
8903 -
8904 - /**
8905 -@@ -312,8 +317,10 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
8906 - */
8907 - static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
8908 - {
8909 -+#ifndef CONFIG_SCHED_ALT
8910 - if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
8911 - sg_cpu->sg_policy->limits_changed = true;
8912 -+#endif
8913 - }
8914 -
8915 - static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
8916 -@@ -599,6 +606,7 @@ static int sugov_kthread_create(struct sugov_policy *sg_policy)
8917 - }
8918 -
8919 - ret = sched_setattr_nocheck(thread, &attr);
8920 -+
8921 - if (ret) {
8922 - kthread_stop(thread);
8923 - pr_warn("%s: failed to set SCHED_DEADLINE\n", __func__);
8924 -@@ -833,7 +841,9 @@ cpufreq_governor_init(schedutil_gov);
8925 - #ifdef CONFIG_ENERGY_MODEL
8926 - static void rebuild_sd_workfn(struct work_struct *work)
8927 - {
8928 -+#ifndef CONFIG_SCHED_ALT
8929 - rebuild_sched_domains_energy();
8930 -+#endif /* CONFIG_SCHED_ALT */
8931 - }
8932 - static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
8933 -
8934 -diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
8935 -index 872e481d5098..f920c8b48ec1 100644
8936 ---- a/kernel/sched/cputime.c
8937 -+++ b/kernel/sched/cputime.c
8938 -@@ -123,7 +123,7 @@ void account_user_time(struct task_struct *p, u64 cputime)
8939 - p->utime += cputime;
8940 - account_group_user_time(p, cputime);
8941 -
8942 -- index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;
8943 -+ index = task_running_nice(p) ? CPUTIME_NICE : CPUTIME_USER;
8944 -
8945 - /* Add user time to cpustat. */
8946 - task_group_account_field(p, index, cputime);
8947 -@@ -147,7 +147,7 @@ void account_guest_time(struct task_struct *p, u64 cputime)
8948 - p->gtime += cputime;
8949 -
8950 - /* Add guest time to cpustat. */
8951 -- if (task_nice(p) > 0) {
8952 -+ if (task_running_nice(p)) {
8953 - cpustat[CPUTIME_NICE] += cputime;
8954 - cpustat[CPUTIME_GUEST_NICE] += cputime;
8955 - } else {
8956 -@@ -270,7 +270,7 @@ static inline u64 account_other_time(u64 max)
8957 - #ifdef CONFIG_64BIT
8958 - static inline u64 read_sum_exec_runtime(struct task_struct *t)
8959 - {
8960 -- return t->se.sum_exec_runtime;
8961 -+ return tsk_seruntime(t);
8962 - }
8963 - #else
8964 - static u64 read_sum_exec_runtime(struct task_struct *t)
8965 -@@ -280,7 +280,7 @@ static u64 read_sum_exec_runtime(struct task_struct *t)
8966 - struct rq *rq;
8967 -
8968 - rq = task_rq_lock(t, &rf);
8969 -- ns = t->se.sum_exec_runtime;
8970 -+ ns = tsk_seruntime(t);
8971 - task_rq_unlock(rq, t, &rf);
8972 -
8973 - return ns;
8974 -@@ -612,7 +612,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
8975 - void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
8976 - {
8977 - struct task_cputime cputime = {
8978 -- .sum_exec_runtime = p->se.sum_exec_runtime,
8979 -+ .sum_exec_runtime = tsk_seruntime(p),
8980 - };
8981 -
8982 - task_cputime(p, &cputime.utime, &cputime.stime);
8983 -diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
8984 -index 0c5ec2776ddf..e3f4fe3f6e2c 100644
8985 ---- a/kernel/sched/debug.c
8986 -+++ b/kernel/sched/debug.c
8987 -@@ -8,6 +8,7 @@
8988 - */
8989 - #include "sched.h"
8990 -
8991 -+#ifndef CONFIG_SCHED_ALT
8992 - /*
8993 - * This allows printing both to /proc/sched_debug and
8994 - * to the console
8995 -@@ -210,6 +211,7 @@ static const struct file_operations sched_scaling_fops = {
8996 - };
8997 -
8998 - #endif /* SMP */
8999 -+#endif /* !CONFIG_SCHED_ALT */
9000 -
9001 - #ifdef CONFIG_PREEMPT_DYNAMIC
9002 -
9003 -@@ -273,6 +275,7 @@ static const struct file_operations sched_dynamic_fops = {
9004 -
9005 - #endif /* CONFIG_PREEMPT_DYNAMIC */
9006 -
9007 -+#ifndef CONFIG_SCHED_ALT
9008 - __read_mostly bool sched_debug_verbose;
9009 -
9010 - static const struct seq_operations sched_debug_sops;
9011 -@@ -288,6 +291,7 @@ static const struct file_operations sched_debug_fops = {
9012 - .llseek = seq_lseek,
9013 - .release = seq_release,
9014 - };
9015 -+#endif /* !CONFIG_SCHED_ALT */
9016 -
9017 - static struct dentry *debugfs_sched;
9018 -
9019 -@@ -297,12 +301,15 @@ static __init int sched_init_debug(void)
9020 -
9021 - debugfs_sched = debugfs_create_dir("sched", NULL);
9022 -
9023 -+#ifndef CONFIG_SCHED_ALT
9024 - debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
9025 - debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);
9026 -+#endif /* !CONFIG_SCHED_ALT */
9027 - #ifdef CONFIG_PREEMPT_DYNAMIC
9028 - debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
9029 - #endif
9030 -
9031 -+#ifndef CONFIG_SCHED_ALT
9032 - debugfs_create_u32("latency_ns", 0644, debugfs_sched, &sysctl_sched_latency);
9033 - debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);
9034 - debugfs_create_u32("wakeup_granularity_ns", 0644, debugfs_sched, &sysctl_sched_wakeup_granularity);
9035 -@@ -330,11 +337,13 @@ static __init int sched_init_debug(void)
9036 - #endif
9037 -
9038 - debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
9039 -+#endif /* !CONFIG_SCHED_ALT */
9040 -
9041 - return 0;
9042 - }
9043 - late_initcall(sched_init_debug);
9044 -
9045 -+#ifndef CONFIG_SCHED_ALT
9046 - #ifdef CONFIG_SMP
9047 -
9048 - static cpumask_var_t sd_sysctl_cpus;
9049 -@@ -1047,6 +1056,7 @@ void proc_sched_set_task(struct task_struct *p)
9050 - memset(&p->se.statistics, 0, sizeof(p->se.statistics));
9051 - #endif
9052 - }
9053 -+#endif /* !CONFIG_SCHED_ALT */
9054 -
9055 - void resched_latency_warn(int cpu, u64 latency)
9056 - {
9057 -diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
9058 -index 912b47aa99d8..7f6b13883c2a 100644
9059 ---- a/kernel/sched/idle.c
9060 -+++ b/kernel/sched/idle.c
9061 -@@ -403,6 +403,7 @@ void cpu_startup_entry(enum cpuhp_state state)
9062 - do_idle();
9063 - }
9064 -
9065 -+#ifndef CONFIG_SCHED_ALT
9066 - /*
9067 - * idle-task scheduling class.
9068 - */
9069 -@@ -525,3 +526,4 @@ DEFINE_SCHED_CLASS(idle) = {
9070 - .switched_to = switched_to_idle,
9071 - .update_curr = update_curr_idle,
9072 - };
9073 -+#endif
9074 -diff --git a/kernel/sched/pds.h b/kernel/sched/pds.h
9075 -new file mode 100644
9076 -index 000000000000..0f1f0d708b77
9077 ---- /dev/null
9078 -+++ b/kernel/sched/pds.h
9079 -@@ -0,0 +1,127 @@
9080 -+#define ALT_SCHED_VERSION_MSG "sched/pds: PDS CPU Scheduler "ALT_SCHED_VERSION" by Alfred Chen.\n"
9081 -+
9082 -+static int sched_timeslice_shift = 22;
9083 -+
9084 -+#define NORMAL_PRIO_MOD(x) ((x) & (NORMAL_PRIO_NUM - 1))
9085 -+
9086 -+/*
9087 -+ * Common interfaces
9088 -+ */
9089 -+static inline void sched_timeslice_imp(const int timeslice_ms)
9090 -+{
9091 -+ if (2 == timeslice_ms)
9092 -+ sched_timeslice_shift = 21;
9093 -+}
9094 -+
9095 -+static inline int
9096 -+task_sched_prio_normal(const struct task_struct *p, const struct rq *rq)
9097 -+{
9098 -+ s64 delta = p->deadline - rq->time_edge + NORMAL_PRIO_NUM - NICE_WIDTH;
9099 -+
9100 -+ if (WARN_ONCE(delta > NORMAL_PRIO_NUM - 1,
9101 -+ "pds: task_sched_prio_normal() delta %lld\n", delta))
9102 -+ return NORMAL_PRIO_NUM - 1;
9103 -+
9104 -+ return (delta < 0) ? 0 : delta;
9105 -+}
9106 -+
9107 -+static inline int task_sched_prio(const struct task_struct *p)
9108 -+{
9109 -+ return (p->prio < MAX_RT_PRIO) ? p->prio :
9110 -+ MIN_NORMAL_PRIO + task_sched_prio_normal(p, task_rq(p));
9111 -+}
9112 -+
9113 -+static inline int
9114 -+task_sched_prio_idx(const struct task_struct *p, const struct rq *rq)
9115 -+{
9116 -+ return (p->prio < MAX_RT_PRIO) ? p->prio : MIN_NORMAL_PRIO +
9117 -+ NORMAL_PRIO_MOD(task_sched_prio_normal(p, rq) + rq->time_edge);
9118 -+}
9119 -+
9120 -+static inline int sched_prio2idx(int prio, struct rq *rq)
9121 -+{
9122 -+ return (IDLE_TASK_SCHED_PRIO == prio || prio < MAX_RT_PRIO) ? prio :
9123 -+ MIN_NORMAL_PRIO + NORMAL_PRIO_MOD((prio - MIN_NORMAL_PRIO) +
9124 -+ rq->time_edge);
9125 -+}
9126 -+
9127 -+static inline int sched_idx2prio(int idx, struct rq *rq)
9128 -+{
9129 -+ return (idx < MAX_RT_PRIO) ? idx : MIN_NORMAL_PRIO +
9130 -+ NORMAL_PRIO_MOD((idx - MIN_NORMAL_PRIO) + NORMAL_PRIO_NUM -
9131 -+ NORMAL_PRIO_MOD(rq->time_edge));
9132 -+}
9133 -+
9134 -+static inline void sched_renew_deadline(struct task_struct *p, const struct rq *rq)
9135 -+{
9136 -+ if (p->prio >= MAX_RT_PRIO)
9137 -+ p->deadline = (rq->clock >> sched_timeslice_shift) +
9138 -+ p->static_prio - (MAX_PRIO - NICE_WIDTH);
9139 -+}
9140 -+
9141 -+int task_running_nice(struct task_struct *p)
9142 -+{
9143 -+ return (p->prio > DEFAULT_PRIO);
9144 -+}
9145 -+
9146 -+static inline void update_rq_time_edge(struct rq *rq)
9147 -+{
9148 -+ struct list_head head;
9149 -+ u64 old = rq->time_edge;
9150 -+ u64 now = rq->clock >> sched_timeslice_shift;
9151 -+ u64 prio, delta;
9152 -+
9153 -+ if (now == old)
9154 -+ return;
9155 -+
9156 -+ delta = min_t(u64, NORMAL_PRIO_NUM, now - old);
9157 -+ INIT_LIST_HEAD(&head);
9158 -+
9159 -+ for_each_set_bit(prio, &rq->queue.bitmap[2], delta)
9160 -+ list_splice_tail_init(rq->queue.heads + MIN_NORMAL_PRIO +
9161 -+ NORMAL_PRIO_MOD(prio + old), &head);
9162 -+
9163 -+ rq->queue.bitmap[2] = (NORMAL_PRIO_NUM == delta) ? 0UL :
9164 -+ rq->queue.bitmap[2] >> delta;
9165 -+ rq->time_edge = now;
9166 -+ if (!list_empty(&head)) {
9167 -+ u64 idx = MIN_NORMAL_PRIO + NORMAL_PRIO_MOD(now);
9168 -+ struct task_struct *p;
9169 -+
9170 -+ list_for_each_entry(p, &head, sq_node)
9171 -+ p->sq_idx = idx;
9172 -+
9173 -+ list_splice(&head, rq->queue.heads + idx);
9174 -+ rq->queue.bitmap[2] |= 1UL;
9175 -+ }
9176 -+}
9177 -+
9178 -+static inline void time_slice_expired(struct task_struct *p, struct rq *rq)
9179 -+{
9180 -+ p->time_slice = sched_timeslice_ns;
9181 -+ sched_renew_deadline(p, rq);
9182 -+ if (SCHED_FIFO != p->policy && task_on_rq_queued(p))
9183 -+ requeue_task(p, rq);
9184 -+}
9185 -+
9186 -+static inline void sched_task_sanity_check(struct task_struct *p, struct rq *rq)
9187 -+{
9188 -+ u64 max_dl = rq->time_edge + NICE_WIDTH - 1;
9189 -+ if (unlikely(p->deadline > max_dl))
9190 -+ p->deadline = max_dl;
9191 -+}
9192 -+
9193 -+static void sched_task_fork(struct task_struct *p, struct rq *rq)
9194 -+{
9195 -+ sched_renew_deadline(p, rq);
9196 -+}
9197 -+
9198 -+static inline void do_sched_yield_type_1(struct task_struct *p, struct rq *rq)
9199 -+{
9200 -+ time_slice_expired(p, rq);
9201 -+}
9202 -+
9203 -+#ifdef CONFIG_SMP
9204 -+static inline void sched_task_ttwu(struct task_struct *p) {}
9205 -+#endif
9206 -+static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq) {}
9207 -diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
9208 -index a554e3bbab2b..3e56f5e6ff5c 100644
9209 ---- a/kernel/sched/pelt.c
9210 -+++ b/kernel/sched/pelt.c
9211 -@@ -270,6 +270,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
9212 - WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
9213 - }
9214 -
9215 -+#ifndef CONFIG_SCHED_ALT
9216 - /*
9217 - * sched_entity:
9218 - *
9219 -@@ -387,8 +388,9 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
9220 -
9221 - return 0;
9222 - }
9223 -+#endif
9224 -
9225 --#ifdef CONFIG_SCHED_THERMAL_PRESSURE
9226 -+#if defined(CONFIG_SCHED_THERMAL_PRESSURE) && !defined(CONFIG_SCHED_ALT)
9227 - /*
9228 - * thermal:
9229 - *
9230 -diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
9231 -index e06071bf3472..adf567df34d4 100644
9232 ---- a/kernel/sched/pelt.h
9233 -+++ b/kernel/sched/pelt.h
9234 -@@ -1,13 +1,15 @@
9235 - #ifdef CONFIG_SMP
9236 - #include "sched-pelt.h"
9237 -
9238 -+#ifndef CONFIG_SCHED_ALT
9239 - int __update_load_avg_blocked_se(u64 now, struct sched_entity *se);
9240 - int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);
9241 - int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
9242 - int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
9243 - int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
9244 -+#endif
9245 -
9246 --#ifdef CONFIG_SCHED_THERMAL_PRESSURE
9247 -+#if defined(CONFIG_SCHED_THERMAL_PRESSURE) && !defined(CONFIG_SCHED_ALT)
9248 - int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
9249 -
9250 - static inline u64 thermal_load_avg(struct rq *rq)
9251 -@@ -42,6 +44,7 @@ static inline u32 get_pelt_divider(struct sched_avg *avg)
9252 - return LOAD_AVG_MAX - 1024 + avg->period_contrib;
9253 - }
9254 -
9255 -+#ifndef CONFIG_SCHED_ALT
9256 - static inline void cfs_se_util_change(struct sched_avg *avg)
9257 - {
9258 - unsigned int enqueued;
9259 -@@ -153,9 +156,11 @@ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
9260 - return rq_clock_pelt(rq_of(cfs_rq));
9261 - }
9262 - #endif
9263 -+#endif /* CONFIG_SCHED_ALT */
9264 -
9265 - #else
9266 -
9267 -+#ifndef CONFIG_SCHED_ALT
9268 - static inline int
9269 - update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
9270 - {
9271 -@@ -173,6 +178,7 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
9272 - {
9273 - return 0;
9274 - }
9275 -+#endif
9276 -
9277 - static inline int
9278 - update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
9279 -diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
9280 -index ddefb0419d7a..658c41b15d3c 100644
9281 ---- a/kernel/sched/sched.h
9282 -+++ b/kernel/sched/sched.h
9283 -@@ -2,6 +2,10 @@
9284 - /*
9285 - * Scheduler internal types and methods:
9286 - */
9287 -+#ifdef CONFIG_SCHED_ALT
9288 -+#include "alt_sched.h"
9289 -+#else
9290 -+
9291 - #include <linux/sched.h>
9292 -
9293 - #include <linux/sched/autogroup.h>
9294 -@@ -3038,3 +3042,8 @@ extern int sched_dynamic_mode(const char *str);
9295 - extern void sched_dynamic_update(int mode);
9296 - #endif
9297 -
9298 -+static inline int task_running_nice(struct task_struct *p)
9299 -+{
9300 -+ return (task_nice(p) > 0);
9301 -+}
9302 -+#endif /* !CONFIG_SCHED_ALT */
9303 -diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
9304 -index 3f93fc3b5648..528b71e144e9 100644
9305 ---- a/kernel/sched/stats.c
9306 -+++ b/kernel/sched/stats.c
9307 -@@ -22,8 +22,10 @@ static int show_schedstat(struct seq_file *seq, void *v)
9308 - } else {
9309 - struct rq *rq;
9310 - #ifdef CONFIG_SMP
9311 -+#ifndef CONFIG_SCHED_ALT
9312 - struct sched_domain *sd;
9313 - int dcount = 0;
9314 -+#endif
9315 - #endif
9316 - cpu = (unsigned long)(v - 2);
9317 - rq = cpu_rq(cpu);
9318 -@@ -40,6 +42,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
9319 - seq_printf(seq, "\n");
9320 -
9321 - #ifdef CONFIG_SMP
9322 -+#ifndef CONFIG_SCHED_ALT
9323 - /* domain-specific stats */
9324 - rcu_read_lock();
9325 - for_each_domain(cpu, sd) {
9326 -@@ -68,6 +71,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
9327 - sd->ttwu_move_balance);
9328 - }
9329 - rcu_read_unlock();
9330 -+#endif
9331 - #endif
9332 - }
9333 - return 0;
9334 -diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
9335 -index b77ad49dc14f..be9edf086412 100644
9336 ---- a/kernel/sched/topology.c
9337 -+++ b/kernel/sched/topology.c
9338 -@@ -4,6 +4,7 @@
9339 - */
9340 - #include "sched.h"
9341 -
9342 -+#ifndef CONFIG_SCHED_ALT
9343 - DEFINE_MUTEX(sched_domains_mutex);
9344 -
9345 - /* Protected by sched_domains_mutex: */
9346 -@@ -1382,8 +1383,10 @@ static void asym_cpu_capacity_scan(void)
9347 - */
9348 -
9349 - static int default_relax_domain_level = -1;
9350 -+#endif /* CONFIG_SCHED_ALT */
9351 - int sched_domain_level_max;
9352 -
9353 -+#ifndef CONFIG_SCHED_ALT
9354 - static int __init setup_relax_domain_level(char *str)
9355 - {
9356 - if (kstrtoint(str, 0, &default_relax_domain_level))
9357 -@@ -1617,6 +1620,7 @@ sd_init(struct sched_domain_topology_level *tl,
9358 -
9359 - return sd;
9360 - }
9361 -+#endif /* CONFIG_SCHED_ALT */
9362 -
9363 - /*
9364 - * Topology list, bottom-up.
9365 -@@ -1646,6 +1650,7 @@ void set_sched_topology(struct sched_domain_topology_level *tl)
9366 - sched_domain_topology = tl;
9367 - }
9368 -
9369 -+#ifndef CONFIG_SCHED_ALT
9370 - #ifdef CONFIG_NUMA
9371 -
9372 - static const struct cpumask *sd_numa_mask(int cpu)
9373 -@@ -2451,3 +2456,17 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
9374 - partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
9375 - mutex_unlock(&sched_domains_mutex);
9376 - }
9377 -+#else /* CONFIG_SCHED_ALT */
9378 -+void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
9379 -+ struct sched_domain_attr *dattr_new)
9380 -+{}
9381 -+
9382 -+#ifdef CONFIG_NUMA
9383 -+int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
9384 -+
9385 -+int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
9386 -+{
9387 -+ return best_mask_cpu(cpu, cpus);
9388 -+}
9389 -+#endif /* CONFIG_NUMA */
9390 -+#endif
9391 -diff --git a/kernel/sysctl.c b/kernel/sysctl.c
9392 -index 272f4a272f8c..1c9455c8ecf6 100644
9393 ---- a/kernel/sysctl.c
9394 -+++ b/kernel/sysctl.c
9395 -@@ -122,6 +122,10 @@ static unsigned long long_max = LONG_MAX;
9396 - static int one_hundred = 100;
9397 - static int two_hundred = 200;
9398 - static int one_thousand = 1000;
9399 -+#ifdef CONFIG_SCHED_ALT
9400 -+static int __maybe_unused zero = 0;
9401 -+extern int sched_yield_type;
9402 -+#endif
9403 - #ifdef CONFIG_PRINTK
9404 - static int ten_thousand = 10000;
9405 - #endif
9406 -@@ -1730,6 +1734,24 @@ int proc_do_static_key(struct ctl_table *table, int write,
9407 - }
9408 -
9409 - static struct ctl_table kern_table[] = {
9410 -+#ifdef CONFIG_SCHED_ALT
9411 -+/* In ALT, only supported "sched_schedstats" */
9412 -+#ifdef CONFIG_SCHED_DEBUG
9413 -+#ifdef CONFIG_SMP
9414 -+#ifdef CONFIG_SCHEDSTATS
9415 -+ {
9416 -+ .procname = "sched_schedstats",
9417 -+ .data = NULL,
9418 -+ .maxlen = sizeof(unsigned int),
9419 -+ .mode = 0644,
9420 -+ .proc_handler = sysctl_schedstats,
9421 -+ .extra1 = SYSCTL_ZERO,
9422 -+ .extra2 = SYSCTL_ONE,
9423 -+ },
9424 -+#endif /* CONFIG_SCHEDSTATS */
9425 -+#endif /* CONFIG_SMP */
9426 -+#endif /* CONFIG_SCHED_DEBUG */
9427 -+#else /* !CONFIG_SCHED_ALT */
9428 - {
9429 - .procname = "sched_child_runs_first",
9430 - .data = &sysctl_sched_child_runs_first,
9431 -@@ -1860,6 +1882,7 @@ static struct ctl_table kern_table[] = {
9432 - .extra2 = SYSCTL_ONE,
9433 - },
9434 - #endif
9435 -+#endif /* !CONFIG_SCHED_ALT */
9436 - #ifdef CONFIG_PROVE_LOCKING
9437 - {
9438 - .procname = "prove_locking",
9439 -@@ -2436,6 +2459,17 @@ static struct ctl_table kern_table[] = {
9440 - .proc_handler = proc_dointvec,
9441 - },
9442 - #endif
9443 -+#ifdef CONFIG_SCHED_ALT
9444 -+ {
9445 -+ .procname = "yield_type",
9446 -+ .data = &sched_yield_type,
9447 -+ .maxlen = sizeof (int),
9448 -+ .mode = 0644,
9449 -+ .proc_handler = &proc_dointvec_minmax,
9450 -+ .extra1 = &zero,
9451 -+ .extra2 = &two,
9452 -+ },
9453 -+#endif
9454 - #if defined(CONFIG_S390) && defined(CONFIG_SMP)
9455 - {
9456 - .procname = "spin_retry",
9457 -diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
9458 -index 4a66725b1d4a..cb80ed5c1f5c 100644
9459 ---- a/kernel/time/hrtimer.c
9460 -+++ b/kernel/time/hrtimer.c
9461 -@@ -1940,8 +1940,10 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
9462 - int ret = 0;
9463 - u64 slack;
9464 -
9465 -+#ifndef CONFIG_SCHED_ALT
9466 - slack = current->timer_slack_ns;
9467 - if (dl_task(current) || rt_task(current))
9468 -+#endif
9469 - slack = 0;
9470 -
9471 - hrtimer_init_sleeper_on_stack(&t, clockid, mode);
9472 -diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
9473 -index 517be7fd175e..de3afe8e0800 100644
9474 ---- a/kernel/time/posix-cpu-timers.c
9475 -+++ b/kernel/time/posix-cpu-timers.c
9476 -@@ -216,7 +216,7 @@ static void task_sample_cputime(struct task_struct *p, u64 *samples)
9477 - u64 stime, utime;
9478 -
9479 - task_cputime(p, &utime, &stime);
9480 -- store_samples(samples, stime, utime, p->se.sum_exec_runtime);
9481 -+ store_samples(samples, stime, utime, tsk_seruntime(p));
9482 - }
9483 -
9484 - static void proc_sample_cputime_atomic(struct task_cputime_atomic *at,
9485 -@@ -801,6 +801,7 @@ static void collect_posix_cputimers(struct posix_cputimers *pct, u64 *samples,
9486 - }
9487 - }
9488 -
9489 -+#ifndef CONFIG_SCHED_ALT
9490 - static inline void check_dl_overrun(struct task_struct *tsk)
9491 - {
9492 - if (tsk->dl.dl_overrun) {
9493 -@@ -808,6 +809,7 @@ static inline void check_dl_overrun(struct task_struct *tsk)
9494 - __group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
9495 - }
9496 - }
9497 -+#endif
9498 -
9499 - static bool check_rlimit(u64 time, u64 limit, int signo, bool rt, bool hard)
9500 - {
9501 -@@ -835,8 +837,10 @@ static void check_thread_timers(struct task_struct *tsk,
9502 - u64 samples[CPUCLOCK_MAX];
9503 - unsigned long soft;
9504 -
9505 -+#ifndef CONFIG_SCHED_ALT
9506 - if (dl_task(tsk))
9507 - check_dl_overrun(tsk);
9508 -+#endif
9509 -
9510 - if (expiry_cache_is_inactive(pct))
9511 - return;
9512 -@@ -850,7 +854,7 @@ static void check_thread_timers(struct task_struct *tsk,
9513 - soft = task_rlimit(tsk, RLIMIT_RTTIME);
9514 - if (soft != RLIM_INFINITY) {
9515 - /* Task RT timeout is accounted in jiffies. RTTIME is usec */
9516 -- unsigned long rttime = tsk->rt.timeout * (USEC_PER_SEC / HZ);
9517 -+ unsigned long rttime = tsk_rttimeout(tsk) * (USEC_PER_SEC / HZ);
9518 - unsigned long hard = task_rlimit_max(tsk, RLIMIT_RTTIME);
9519 -
9520 - /* At the hard limit, send SIGKILL. No further action. */
9521 -@@ -1086,8 +1090,10 @@ static inline bool fastpath_timer_check(struct task_struct *tsk)
9522 - return true;
9523 - }
9524 -
9525 -+#ifndef CONFIG_SCHED_ALT
9526 - if (dl_task(tsk) && tsk->dl.dl_overrun)
9527 - return true;
9528 -+#endif
9529 -
9530 - return false;
9531 - }
9532 -diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
9533 -index adf7ef194005..11c8f36e281b 100644
9534 ---- a/kernel/trace/trace_selftest.c
9535 -+++ b/kernel/trace/trace_selftest.c
9536 -@@ -1052,10 +1052,15 @@ static int trace_wakeup_test_thread(void *data)
9537 - {
9538 - /* Make this a -deadline thread */
9539 - static const struct sched_attr attr = {
9540 -+#ifdef CONFIG_SCHED_ALT
9541 -+ /* No deadline on BMQ/PDS, use RR */
9542 -+ .sched_policy = SCHED_RR,
9543 -+#else
9544 - .sched_policy = SCHED_DEADLINE,
9545 - .sched_runtime = 100000ULL,
9546 - .sched_deadline = 10000000ULL,
9547 - .sched_period = 10000000ULL
9548 -+#endif
9549 - };
9550 - struct wakeup_test_data *x = data;
9551 -
9552 ---- a/kernel/sched/alt_core.c 2021-11-18 18:58:14.290182408 -0500
9553 -+++ b/kernel/sched/alt_core.c 2021-11-18 18:58:54.870593883 -0500
9554 -@@ -2762,7 +2762,7 @@ int sched_fork(unsigned long clone_flags
9555 - return 0;
9556 - }
9557 -
9558 --void sched_post_fork(struct task_struct *p) {}
9559 -+void sched_post_fork(struct task_struct *p, struct kernel_clone_args *kargs) {}
9560 -
9561 - #ifdef CONFIG_SCHEDSTATS
9562 -
9563
9564 diff --git a/5021_BMQ-and-PDS-gentoo-defaults.patch b/5021_BMQ-and-PDS-gentoo-defaults.patch
9565 deleted file mode 100644
9566 index d449eec4..00000000
9567 --- a/5021_BMQ-and-PDS-gentoo-defaults.patch
9568 +++ /dev/null
9569 @@ -1,13 +0,0 @@
9570 ---- a/init/Kconfig 2021-04-27 07:38:30.556467045 -0400
9571 -+++ b/init/Kconfig 2021-04-27 07:39:32.956412800 -0400
9572 -@@ -780,8 +780,9 @@ config GENERIC_SCHED_CLOCK
9573 - menu "Scheduler features"
9574 -
9575 - menuconfig SCHED_ALT
9576 -+ depends on X86_64
9577 - bool "Alternative CPU Schedulers"
9578 -- default y
9579 -+ default n
9580 - help
9581 - This feature enable alternative CPU scheduler"
9582 -