Gentoo Archives: gentoo-commits

From: Mike Pagano <mpagano@g.o>
To: gentoo-commits@l.g.o
Subject: [gentoo-commits] proj/linux-patches:5.14 commit in: /
Date: Tue, 14 Sep 2021 15:37:53
Message-Id: 1631633821.bac991f4736e0a8f6712313af04b8b4cd873d3b5.mpagano@gentoo
1 commit: bac991f4736e0a8f6712313af04b8b4cd873d3b5
2 Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
3 AuthorDate: Tue Sep 14 15:37:01 2021 +0000
4 Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
5 CommitDate: Tue Sep 14 15:37:01 2021 +0000
6 URL: https://gitweb.gentoo.org/proj/linux-patches.git/commit/?id=bac991f4
7
8 Add BMQ Scheduler Patch 5.14-r1
9
10 BMQ(BitMap Queue) Scheduler.
11 A new CPU scheduler developed from PDS(incld).
12 Inspired by the scheduler in zircon.
13
14 Set defaults for BMQ. Add archs as people test, default to N
15
16 Signed-off-by: Mike Pagano <mpagano <AT> gentoo.org>
17
18 0000_README | 7 +
19 5020_BMQ-and-PDS-io-scheduler-v5.14-r1.patch | 9514 ++++++++++++++++++++++++++
20 5021_BMQ-and-PDS-gentoo-defaults.patch | 13 +
21 3 files changed, 9534 insertions(+)
22
23 diff --git a/0000_README b/0000_README
24 index 4ad6164..f4fbe66 100644
25 --- a/0000_README
26 +++ b/0000_README
27 @@ -87,3 +87,10 @@ Patch: 5010_enable-cpu-optimizations-universal.patch
28 From: https://github.com/graysky2/kernel_compiler_patch
29 Desc: Kernel >= 5.8 patch enables gcc = v9+ optimizations for additional CPUs.
30
31 +Patch: 5020_BMQ-and-PDS-io-scheduler-v5.14-r1.patch
32 +From: https://gitlab.com/alfredchen/linux-prjc
33 +Desc: BMQ(BitMap Queue) Scheduler. A new CPU scheduler developed from PDS(incld). Inspired by the scheduler in zircon.
34 +
35 +Patch: 5021_BMQ-and-PDS-gentoo-defaults.patch
36 +From: https://gitweb.gentoo.org/proj/linux-patches.git/
37 +Desc: Set defaults for BMQ. Add archs as people test, default to N
38
39 diff --git a/5020_BMQ-and-PDS-io-scheduler-v5.14-r1.patch b/5020_BMQ-and-PDS-io-scheduler-v5.14-r1.patch
40 new file mode 100644
41 index 0000000..4c6f75c
42 --- /dev/null
43 +++ b/5020_BMQ-and-PDS-io-scheduler-v5.14-r1.patch
44 @@ -0,0 +1,9514 @@
45 +diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
46 +index bdb22006f713..d755d7df632f 100644
47 +--- a/Documentation/admin-guide/kernel-parameters.txt
48 ++++ b/Documentation/admin-guide/kernel-parameters.txt
49 +@@ -4947,6 +4947,12 @@
50 +
51 + sbni= [NET] Granch SBNI12 leased line adapter
52 +
53 ++ sched_timeslice=
54 ++ [KNL] Time slice in ms for Project C BMQ/PDS scheduler.
55 ++ Format: integer 2, 4
56 ++ Default: 4
57 ++ See Documentation/scheduler/sched-BMQ.txt
58 ++
59 + sched_verbose [KNL] Enables verbose scheduler debug messages.
60 +
61 + schedstats= [KNL,X86] Enable or disable scheduled statistics.
62 +diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
63 +index 426162009ce9..15ac2d7e47cd 100644
64 +--- a/Documentation/admin-guide/sysctl/kernel.rst
65 ++++ b/Documentation/admin-guide/sysctl/kernel.rst
66 +@@ -1542,3 +1542,13 @@ is 10 seconds.
67 +
68 + The softlockup threshold is (``2 * watchdog_thresh``). Setting this
69 + tunable to zero will disable lockup detection altogether.
70 ++
71 ++yield_type:
72 ++===========
73 ++
74 ++BMQ/PDS CPU scheduler only. This determines what type of yield calls
75 ++to sched_yield will perform.
76 ++
77 ++ 0 - No yield.
78 ++ 1 - Deboost and requeue task. (default)
79 ++ 2 - Set run queue skip task.
80 +diff --git a/Documentation/scheduler/sched-BMQ.txt b/Documentation/scheduler/sched-BMQ.txt
81 +new file mode 100644
82 +index 000000000000..05c84eec0f31
83 +--- /dev/null
84 ++++ b/Documentation/scheduler/sched-BMQ.txt
85 +@@ -0,0 +1,110 @@
86 ++ BitMap queue CPU Scheduler
87 ++ --------------------------
88 ++
89 ++CONTENT
90 ++========
91 ++
92 ++ Background
93 ++ Design
94 ++ Overview
95 ++ Task policy
96 ++ Priority management
97 ++ BitMap Queue
98 ++ CPU Assignment and Migration
99 ++
100 ++
101 ++Background
102 ++==========
103 ++
104 ++BitMap Queue CPU scheduler, referred to as BMQ from here on, is an evolution
105 ++of previous Priority and Deadline based Skiplist multiple queue scheduler(PDS),
106 ++and inspired by Zircon scheduler. The goal of it is to keep the scheduler code
107 ++simple, while efficiency and scalable for interactive tasks, such as desktop,
108 ++movie playback and gaming etc.
109 ++
110 ++Design
111 ++======
112 ++
113 ++Overview
114 ++--------
115 ++
116 ++BMQ use per CPU run queue design, each CPU(logical) has it's own run queue,
117 ++each CPU is responsible for scheduling the tasks that are putting into it's
118 ++run queue.
119 ++
120 ++The run queue is a set of priority queues. Note that these queues are fifo
121 ++queue for non-rt tasks or priority queue for rt tasks in data structure. See
122 ++BitMap Queue below for details. BMQ is optimized for non-rt tasks in the fact
123 ++that most applications are non-rt tasks. No matter the queue is fifo or
124 ++priority, In each queue is an ordered list of runnable tasks awaiting execution
125 ++and the data structures are the same. When it is time for a new task to run,
126 ++the scheduler simply looks the lowest numbered queueue that contains a task,
127 ++and runs the first task from the head of that queue. And per CPU idle task is
128 ++also in the run queue, so the scheduler can always find a task to run on from
129 ++its run queue.
130 ++
131 ++Each task will assigned the same timeslice(default 4ms) when it is picked to
132 ++start running. Task will be reinserted at the end of the appropriate priority
133 ++queue when it uses its whole timeslice. When the scheduler selects a new task
134 ++from the priority queue it sets the CPU's preemption timer for the remainder of
135 ++the previous timeslice. When that timer fires the scheduler will stop execution
136 ++on that task, select another task and start over again.
137 ++
138 ++If a task blocks waiting for a shared resource then it's taken out of its
139 ++priority queue and is placed in a wait queue for the shared resource. When it
140 ++is unblocked it will be reinserted in the appropriate priority queue of an
141 ++eligible CPU.
142 ++
143 ++Task policy
144 ++-----------
145 ++
146 ++BMQ supports DEADLINE, FIFO, RR, NORMAL, BATCH and IDLE task policy like the
147 ++mainline CFS scheduler. But BMQ is heavy optimized for non-rt task, that's
148 ++NORMAL/BATCH/IDLE policy tasks. Below is the implementation detail of each
149 ++policy.
150 ++
151 ++DEADLINE
152 ++ It is squashed as priority 0 FIFO task.
153 ++
154 ++FIFO/RR
155 ++ All RT tasks share one single priority queue in BMQ run queue designed. The
156 ++complexity of insert operation is O(n). BMQ is not designed for system runs
157 ++with major rt policy tasks.
158 ++
159 ++NORMAL/BATCH/IDLE
160 ++ BATCH and IDLE tasks are treated as the same policy. They compete CPU with
161 ++NORMAL policy tasks, but they just don't boost. To control the priority of
162 ++NORMAL/BATCH/IDLE tasks, simply use nice level.
163 ++
164 ++ISO
165 ++ ISO policy is not supported in BMQ. Please use nice level -20 NORMAL policy
166 ++task instead.
167 ++
168 ++Priority management
169 ++-------------------
170 ++
171 ++RT tasks have priority from 0-99. For non-rt tasks, there are three different
172 ++factors used to determine the effective priority of a task. The effective
173 ++priority being what is used to determine which queue it will be in.
174 ++
175 ++The first factor is simply the task’s static priority. Which is assigned from
176 ++task's nice level, within [-20, 19] in userland's point of view and [0, 39]
177 ++internally.
178 ++
179 ++The second factor is the priority boost. This is a value bounded between
180 ++[-MAX_PRIORITY_ADJ, MAX_PRIORITY_ADJ] used to offset the base priority, it is
181 ++modified by the following cases:
182 ++
183 ++*When a thread has used up its entire timeslice, always deboost its boost by
184 ++increasing by one.
185 ++*When a thread gives up cpu control(voluntary or non-voluntary) to reschedule,
186 ++and its switch-in time(time after last switch and run) below the thredhold
187 ++based on its priority boost, will boost its boost by decreasing by one buti is
188 ++capped at 0 (won’t go negative).
189 ++
190 ++The intent in this system is to ensure that interactive threads are serviced
191 ++quickly. These are usually the threads that interact directly with the user
192 ++and cause user-perceivable latency. These threads usually do little work and
193 ++spend most of their time blocked awaiting another user event. So they get the
194 ++priority boost from unblocking while background threads that do most of the
195 ++processing receive the priority penalty for using their entire timeslice.
196 +diff --git a/fs/proc/base.c b/fs/proc/base.c
197 +index e5b5f7709d48..284b3c4b7d90 100644
198 +--- a/fs/proc/base.c
199 ++++ b/fs/proc/base.c
200 +@@ -476,7 +476,7 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
201 + seq_puts(m, "0 0 0\n");
202 + else
203 + seq_printf(m, "%llu %llu %lu\n",
204 +- (unsigned long long)task->se.sum_exec_runtime,
205 ++ (unsigned long long)tsk_seruntime(task),
206 + (unsigned long long)task->sched_info.run_delay,
207 + task->sched_info.pcount);
208 +
209 +diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
210 +index 8874f681b056..59eb72bf7d5f 100644
211 +--- a/include/asm-generic/resource.h
212 ++++ b/include/asm-generic/resource.h
213 +@@ -23,7 +23,7 @@
214 + [RLIMIT_LOCKS] = { RLIM_INFINITY, RLIM_INFINITY }, \
215 + [RLIMIT_SIGPENDING] = { 0, 0 }, \
216 + [RLIMIT_MSGQUEUE] = { MQ_BYTES_MAX, MQ_BYTES_MAX }, \
217 +- [RLIMIT_NICE] = { 0, 0 }, \
218 ++ [RLIMIT_NICE] = { 30, 30 }, \
219 + [RLIMIT_RTPRIO] = { 0, 0 }, \
220 + [RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
221 + }
222 +diff --git a/include/linux/sched.h b/include/linux/sched.h
223 +index ec8d07d88641..b12f660404fd 100644
224 +--- a/include/linux/sched.h
225 ++++ b/include/linux/sched.h
226 +@@ -681,12 +681,18 @@ struct task_struct {
227 + unsigned int ptrace;
228 +
229 + #ifdef CONFIG_SMP
230 +- int on_cpu;
231 + struct __call_single_node wake_entry;
232 ++#endif
233 ++#if defined(CONFIG_SMP) || defined(CONFIG_SCHED_ALT)
234 ++ int on_cpu;
235 ++#endif
236 ++
237 ++#ifdef CONFIG_SMP
238 + #ifdef CONFIG_THREAD_INFO_IN_TASK
239 + /* Current CPU: */
240 + unsigned int cpu;
241 + #endif
242 ++#ifndef CONFIG_SCHED_ALT
243 + unsigned int wakee_flips;
244 + unsigned long wakee_flip_decay_ts;
245 + struct task_struct *last_wakee;
246 +@@ -700,6 +706,7 @@ struct task_struct {
247 + */
248 + int recent_used_cpu;
249 + int wake_cpu;
250 ++#endif /* !CONFIG_SCHED_ALT */
251 + #endif
252 + int on_rq;
253 +
254 +@@ -708,6 +715,20 @@ struct task_struct {
255 + int normal_prio;
256 + unsigned int rt_priority;
257 +
258 ++#ifdef CONFIG_SCHED_ALT
259 ++ u64 last_ran;
260 ++ s64 time_slice;
261 ++ int sq_idx;
262 ++ struct list_head sq_node;
263 ++#ifdef CONFIG_SCHED_BMQ
264 ++ int boost_prio;
265 ++#endif /* CONFIG_SCHED_BMQ */
266 ++#ifdef CONFIG_SCHED_PDS
267 ++ u64 deadline;
268 ++#endif /* CONFIG_SCHED_PDS */
269 ++ /* sched_clock time spent running */
270 ++ u64 sched_time;
271 ++#else /* !CONFIG_SCHED_ALT */
272 + const struct sched_class *sched_class;
273 + struct sched_entity se;
274 + struct sched_rt_entity rt;
275 +@@ -718,6 +739,7 @@ struct task_struct {
276 + unsigned long core_cookie;
277 + unsigned int core_occupation;
278 + #endif
279 ++#endif /* !CONFIG_SCHED_ALT */
280 +
281 + #ifdef CONFIG_CGROUP_SCHED
282 + struct task_group *sched_task_group;
283 +@@ -1417,6 +1439,15 @@ struct task_struct {
284 + */
285 + };
286 +
287 ++#ifdef CONFIG_SCHED_ALT
288 ++#define tsk_seruntime(t) ((t)->sched_time)
289 ++/* replace the uncertian rt_timeout with 0UL */
290 ++#define tsk_rttimeout(t) (0UL)
291 ++#else /* CFS */
292 ++#define tsk_seruntime(t) ((t)->se.sum_exec_runtime)
293 ++#define tsk_rttimeout(t) ((t)->rt.timeout)
294 ++#endif /* !CONFIG_SCHED_ALT */
295 ++
296 + static inline struct pid *task_pid(struct task_struct *task)
297 + {
298 + return task->thread_pid;
299 +diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
300 +index 1aff00b65f3c..216fdf2fe90c 100644
301 +--- a/include/linux/sched/deadline.h
302 ++++ b/include/linux/sched/deadline.h
303 +@@ -1,5 +1,24 @@
304 + /* SPDX-License-Identifier: GPL-2.0 */
305 +
306 ++#ifdef CONFIG_SCHED_ALT
307 ++
308 ++static inline int dl_task(struct task_struct *p)
309 ++{
310 ++ return 0;
311 ++}
312 ++
313 ++#ifdef CONFIG_SCHED_BMQ
314 ++#define __tsk_deadline(p) (0UL)
315 ++#endif
316 ++
317 ++#ifdef CONFIG_SCHED_PDS
318 ++#define __tsk_deadline(p) ((((u64) ((p)->prio))<<56) | (p)->deadline)
319 ++#endif
320 ++
321 ++#else
322 ++
323 ++#define __tsk_deadline(p) ((p)->dl.deadline)
324 ++
325 + /*
326 + * SCHED_DEADLINE tasks has negative priorities, reflecting
327 + * the fact that any of them has higher prio than RT and
328 +@@ -19,6 +38,7 @@ static inline int dl_task(struct task_struct *p)
329 + {
330 + return dl_prio(p->prio);
331 + }
332 ++#endif /* CONFIG_SCHED_ALT */
333 +
334 + static inline bool dl_time_before(u64 a, u64 b)
335 + {
336 +diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
337 +index ab83d85e1183..6af9ae681116 100644
338 +--- a/include/linux/sched/prio.h
339 ++++ b/include/linux/sched/prio.h
340 +@@ -18,6 +18,32 @@
341 + #define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH)
342 + #define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2)
343 +
344 ++#ifdef CONFIG_SCHED_ALT
345 ++
346 ++/* Undefine MAX_PRIO and DEFAULT_PRIO */
347 ++#undef MAX_PRIO
348 ++#undef DEFAULT_PRIO
349 ++
350 ++/* +/- priority levels from the base priority */
351 ++#ifdef CONFIG_SCHED_BMQ
352 ++#define MAX_PRIORITY_ADJ (7)
353 ++
354 ++#define MIN_NORMAL_PRIO (MAX_RT_PRIO)
355 ++#define MAX_PRIO (MIN_NORMAL_PRIO + NICE_WIDTH)
356 ++#define DEFAULT_PRIO (MIN_NORMAL_PRIO + NICE_WIDTH / 2)
357 ++#endif
358 ++
359 ++#ifdef CONFIG_SCHED_PDS
360 ++#define MAX_PRIORITY_ADJ (0)
361 ++
362 ++#define MIN_NORMAL_PRIO (128)
363 ++#define NORMAL_PRIO_NUM (64)
364 ++#define MAX_PRIO (MIN_NORMAL_PRIO + NORMAL_PRIO_NUM)
365 ++#define DEFAULT_PRIO (MAX_PRIO - NICE_WIDTH / 2)
366 ++#endif
367 ++
368 ++#endif /* CONFIG_SCHED_ALT */
369 ++
370 + /*
371 + * Convert user-nice values [ -20 ... 0 ... 19 ]
372 + * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
373 +diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
374 +index e5af028c08b4..0a7565d0d3cf 100644
375 +--- a/include/linux/sched/rt.h
376 ++++ b/include/linux/sched/rt.h
377 +@@ -24,8 +24,10 @@ static inline bool task_is_realtime(struct task_struct *tsk)
378 +
379 + if (policy == SCHED_FIFO || policy == SCHED_RR)
380 + return true;
381 ++#ifndef CONFIG_SCHED_ALT
382 + if (policy == SCHED_DEADLINE)
383 + return true;
384 ++#endif
385 + return false;
386 + }
387 +
388 +diff --git a/init/Kconfig b/init/Kconfig
389 +index 55f9f7738ebb..9a9b244d3ca3 100644
390 +--- a/init/Kconfig
391 ++++ b/init/Kconfig
392 +@@ -786,9 +786,39 @@ config GENERIC_SCHED_CLOCK
393 +
394 + menu "Scheduler features"
395 +
396 ++menuconfig SCHED_ALT
397 ++ bool "Alternative CPU Schedulers"
398 ++ default y
399 ++ help
400 ++ This feature enable alternative CPU scheduler"
401 ++
402 ++if SCHED_ALT
403 ++
404 ++choice
405 ++ prompt "Alternative CPU Scheduler"
406 ++ default SCHED_BMQ
407 ++
408 ++config SCHED_BMQ
409 ++ bool "BMQ CPU scheduler"
410 ++ help
411 ++ The BitMap Queue CPU scheduler for excellent interactivity and
412 ++ responsiveness on the desktop and solid scalability on normal
413 ++ hardware and commodity servers.
414 ++
415 ++config SCHED_PDS
416 ++ bool "PDS CPU scheduler"
417 ++ help
418 ++ The Priority and Deadline based Skip list multiple queue CPU
419 ++ Scheduler.
420 ++
421 ++endchoice
422 ++
423 ++endif
424 ++
425 + config UCLAMP_TASK
426 + bool "Enable utilization clamping for RT/FAIR tasks"
427 + depends on CPU_FREQ_GOV_SCHEDUTIL
428 ++ depends on !SCHED_ALT
429 + help
430 + This feature enables the scheduler to track the clamped utilization
431 + of each CPU based on RUNNABLE tasks scheduled on that CPU.
432 +@@ -874,6 +904,7 @@ config NUMA_BALANCING
433 + depends on ARCH_SUPPORTS_NUMA_BALANCING
434 + depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
435 + depends on SMP && NUMA && MIGRATION
436 ++ depends on !SCHED_ALT
437 + help
438 + This option adds support for automatic NUMA aware memory/task placement.
439 + The mechanism is quite primitive and is based on migrating memory when
440 +@@ -966,6 +997,7 @@ config FAIR_GROUP_SCHED
441 + depends on CGROUP_SCHED
442 + default CGROUP_SCHED
443 +
444 ++if !SCHED_ALT
445 + config CFS_BANDWIDTH
446 + bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
447 + depends on FAIR_GROUP_SCHED
448 +@@ -988,6 +1020,7 @@ config RT_GROUP_SCHED
449 + realtime bandwidth for them.
450 + See Documentation/scheduler/sched-rt-group.rst for more information.
451 +
452 ++endif #!SCHED_ALT
453 + endif #CGROUP_SCHED
454 +
455 + config UCLAMP_TASK_GROUP
456 +@@ -1231,6 +1264,7 @@ config CHECKPOINT_RESTORE
457 +
458 + config SCHED_AUTOGROUP
459 + bool "Automatic process group scheduling"
460 ++ depends on !SCHED_ALT
461 + select CGROUPS
462 + select CGROUP_SCHED
463 + select FAIR_GROUP_SCHED
464 +diff --git a/init/init_task.c b/init/init_task.c
465 +index 562f2ef8d157..177b63db4ce0 100644
466 +--- a/init/init_task.c
467 ++++ b/init/init_task.c
468 +@@ -75,9 +75,15 @@ struct task_struct init_task
469 + .stack = init_stack,
470 + .usage = REFCOUNT_INIT(2),
471 + .flags = PF_KTHREAD,
472 ++#ifdef CONFIG_SCHED_ALT
473 ++ .prio = DEFAULT_PRIO + MAX_PRIORITY_ADJ,
474 ++ .static_prio = DEFAULT_PRIO,
475 ++ .normal_prio = DEFAULT_PRIO + MAX_PRIORITY_ADJ,
476 ++#else
477 + .prio = MAX_PRIO - 20,
478 + .static_prio = MAX_PRIO - 20,
479 + .normal_prio = MAX_PRIO - 20,
480 ++#endif
481 + .policy = SCHED_NORMAL,
482 + .cpus_ptr = &init_task.cpus_mask,
483 + .cpus_mask = CPU_MASK_ALL,
484 +@@ -87,6 +93,17 @@ struct task_struct init_task
485 + .restart_block = {
486 + .fn = do_no_restart_syscall,
487 + },
488 ++#ifdef CONFIG_SCHED_ALT
489 ++ .sq_node = LIST_HEAD_INIT(init_task.sq_node),
490 ++#ifdef CONFIG_SCHED_BMQ
491 ++ .boost_prio = 0,
492 ++ .sq_idx = 15,
493 ++#endif
494 ++#ifdef CONFIG_SCHED_PDS
495 ++ .deadline = 0,
496 ++#endif
497 ++ .time_slice = HZ,
498 ++#else
499 + .se = {
500 + .group_node = LIST_HEAD_INIT(init_task.se.group_node),
501 + },
502 +@@ -94,6 +111,7 @@ struct task_struct init_task
503 + .run_list = LIST_HEAD_INIT(init_task.rt.run_list),
504 + .time_slice = RR_TIMESLICE,
505 + },
506 ++#endif
507 + .tasks = LIST_HEAD_INIT(init_task.tasks),
508 + #ifdef CONFIG_SMP
509 + .pushable_tasks = PLIST_NODE_INIT(init_task.pushable_tasks, MAX_PRIO),
510 +diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
511 +index 5876e30c5740..7594d0a31869 100644
512 +--- a/kernel/Kconfig.preempt
513 ++++ b/kernel/Kconfig.preempt
514 +@@ -102,7 +102,7 @@ config PREEMPT_DYNAMIC
515 +
516 + config SCHED_CORE
517 + bool "Core Scheduling for SMT"
518 +- depends on SCHED_SMT
519 ++ depends on SCHED_SMT && !SCHED_ALT
520 + help
521 + This option permits Core Scheduling, a means of coordinated task
522 + selection across SMT siblings. When enabled -- see
523 +diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
524 +index adb5190c4429..8c02bce63146 100644
525 +--- a/kernel/cgroup/cpuset.c
526 ++++ b/kernel/cgroup/cpuset.c
527 +@@ -636,7 +636,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
528 + return ret;
529 + }
530 +
531 +-#ifdef CONFIG_SMP
532 ++#if defined(CONFIG_SMP) && !defined(CONFIG_SCHED_ALT)
533 + /*
534 + * Helper routine for generate_sched_domains().
535 + * Do cpusets a, b have overlapping effective cpus_allowed masks?
536 +@@ -1032,7 +1032,7 @@ static void rebuild_sched_domains_locked(void)
537 + /* Have scheduler rebuild the domains */
538 + partition_and_rebuild_sched_domains(ndoms, doms, attr);
539 + }
540 +-#else /* !CONFIG_SMP */
541 ++#else /* !CONFIG_SMP || CONFIG_SCHED_ALT */
542 + static void rebuild_sched_domains_locked(void)
543 + {
544 + }
545 +diff --git a/kernel/delayacct.c b/kernel/delayacct.c
546 +index 51530d5b15a8..e542d71bb94b 100644
547 +--- a/kernel/delayacct.c
548 ++++ b/kernel/delayacct.c
549 +@@ -139,7 +139,7 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
550 + */
551 + t1 = tsk->sched_info.pcount;
552 + t2 = tsk->sched_info.run_delay;
553 +- t3 = tsk->se.sum_exec_runtime;
554 ++ t3 = tsk_seruntime(tsk);
555 +
556 + d->cpu_count += t1;
557 +
558 +diff --git a/kernel/exit.c b/kernel/exit.c
559 +index 9a89e7f36acb..7fe34c56bd08 100644
560 +--- a/kernel/exit.c
561 ++++ b/kernel/exit.c
562 +@@ -122,7 +122,7 @@ static void __exit_signal(struct task_struct *tsk)
563 + sig->curr_target = next_thread(tsk);
564 + }
565 +
566 +- add_device_randomness((const void*) &tsk->se.sum_exec_runtime,
567 ++ add_device_randomness((const void*) &tsk_seruntime(tsk),
568 + sizeof(unsigned long long));
569 +
570 + /*
571 +@@ -143,7 +143,7 @@ static void __exit_signal(struct task_struct *tsk)
572 + sig->inblock += task_io_get_inblock(tsk);
573 + sig->oublock += task_io_get_oublock(tsk);
574 + task_io_accounting_add(&sig->ioac, &tsk->ioac);
575 +- sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
576 ++ sig->sum_sched_runtime += tsk_seruntime(tsk);
577 + sig->nr_threads--;
578 + __unhash_process(tsk, group_dead);
579 + write_sequnlock(&sig->stats_lock);
580 +diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
581 +index 3a4beb9395c4..98a709628cb3 100644
582 +--- a/kernel/livepatch/transition.c
583 ++++ b/kernel/livepatch/transition.c
584 +@@ -307,7 +307,11 @@ static bool klp_try_switch_task(struct task_struct *task)
585 + */
586 + rq = task_rq_lock(task, &flags);
587 +
588 ++#ifdef CONFIG_SCHED_ALT
589 ++ if (task_running(task) && task != current) {
590 ++#else
591 + if (task_running(rq, task) && task != current) {
592 ++#endif
593 + snprintf(err_buf, STACK_ERR_BUF_SIZE,
594 + "%s: %s:%d is running\n", __func__, task->comm,
595 + task->pid);
596 +diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
597 +index ad0db322ed3b..350b0e506c17 100644
598 +--- a/kernel/locking/rtmutex.c
599 ++++ b/kernel/locking/rtmutex.c
600 +@@ -227,14 +227,18 @@ static __always_inline bool unlock_rt_mutex_safe(struct rt_mutex *lock,
601 + * Only use with rt_mutex_waiter_{less,equal}()
602 + */
603 + #define task_to_waiter(p) \
604 +- &(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = (p)->dl.deadline }
605 ++ &(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = __tsk_deadline(p) }
606 +
607 + static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,
608 + struct rt_mutex_waiter *right)
609 + {
610 ++#ifdef CONFIG_SCHED_PDS
611 ++ return (left->deadline < right->deadline);
612 ++#else
613 + if (left->prio < right->prio)
614 + return 1;
615 +
616 ++#ifndef CONFIG_SCHED_BMQ
617 + /*
618 + * If both waiters have dl_prio(), we check the deadlines of the
619 + * associated tasks.
620 +@@ -243,16 +247,22 @@ static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,
621 + */
622 + if (dl_prio(left->prio))
623 + return dl_time_before(left->deadline, right->deadline);
624 ++#endif
625 +
626 + return 0;
627 ++#endif
628 + }
629 +
630 + static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,
631 + struct rt_mutex_waiter *right)
632 + {
633 ++#ifdef CONFIG_SCHED_PDS
634 ++ return (left->deadline == right->deadline);
635 ++#else
636 + if (left->prio != right->prio)
637 + return 0;
638 +
639 ++#ifndef CONFIG_SCHED_BMQ
640 + /*
641 + * If both waiters have dl_prio(), we check the deadlines of the
642 + * associated tasks.
643 +@@ -261,8 +271,10 @@ static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,
644 + */
645 + if (dl_prio(left->prio))
646 + return left->deadline == right->deadline;
647 ++#endif
648 +
649 + return 1;
650 ++#endif
651 + }
652 +
653 + #define __node_2_waiter(node) \
654 +@@ -654,7 +666,7 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
655 + * the values of the node being removed.
656 + */
657 + waiter->prio = task->prio;
658 +- waiter->deadline = task->dl.deadline;
659 ++ waiter->deadline = __tsk_deadline(task);
660 +
661 + rt_mutex_enqueue(lock, waiter);
662 +
663 +@@ -925,7 +937,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex *lock,
664 + waiter->task = task;
665 + waiter->lock = lock;
666 + waiter->prio = task->prio;
667 +- waiter->deadline = task->dl.deadline;
668 ++ waiter->deadline = __tsk_deadline(task);
669 +
670 + /* Get the top priority waiter on the lock */
671 + if (rt_mutex_has_waiters(lock))
672 +diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
673 +index 978fcfca5871..0425ee149b4d 100644
674 +--- a/kernel/sched/Makefile
675 ++++ b/kernel/sched/Makefile
676 +@@ -22,14 +22,21 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
677 + CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
678 + endif
679 +
680 +-obj-y += core.o loadavg.o clock.o cputime.o
681 +-obj-y += idle.o fair.o rt.o deadline.o
682 +-obj-y += wait.o wait_bit.o swait.o completion.o
683 +-
684 +-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
685 ++ifdef CONFIG_SCHED_ALT
686 ++obj-y += alt_core.o
687 ++obj-$(CONFIG_SCHED_DEBUG) += alt_debug.o
688 ++else
689 ++obj-y += core.o
690 ++obj-y += fair.o rt.o deadline.o
691 ++obj-$(CONFIG_SMP) += cpudeadline.o stop_task.o
692 + obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
693 +-obj-$(CONFIG_SCHEDSTATS) += stats.o
694 ++endif
695 + obj-$(CONFIG_SCHED_DEBUG) += debug.o
696 ++obj-y += loadavg.o clock.o cputime.o
697 ++obj-y += idle.o
698 ++obj-y += wait.o wait_bit.o swait.o completion.o
699 ++obj-$(CONFIG_SMP) += cpupri.o pelt.o topology.o
700 ++obj-$(CONFIG_SCHEDSTATS) += stats.o
701 + obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
702 + obj-$(CONFIG_CPU_FREQ) += cpufreq.o
703 + obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
704 +diff --git a/kernel/sched/alt_core.c b/kernel/sched/alt_core.c
705 +new file mode 100644
706 +index 000000000000..900889c838ea
707 +--- /dev/null
708 ++++ b/kernel/sched/alt_core.c
709 +@@ -0,0 +1,7248 @@
710 ++/*
711 ++ * kernel/sched/alt_core.c
712 ++ *
713 ++ * Core alternative kernel scheduler code and related syscalls
714 ++ *
715 ++ * Copyright (C) 1991-2002 Linus Torvalds
716 ++ *
717 ++ * 2009-08-13 Brainfuck deadline scheduling policy by Con Kolivas deletes
718 ++ * a whole lot of those previous things.
719 ++ * 2017-09-06 Priority and Deadline based Skip list multiple queue kernel
720 ++ * scheduler by Alfred Chen.
721 ++ * 2019-02-20 BMQ(BitMap Queue) kernel scheduler by Alfred Chen.
722 ++ */
723 ++#define CREATE_TRACE_POINTS
724 ++#include <trace/events/sched.h>
725 ++#undef CREATE_TRACE_POINTS
726 ++
727 ++#include "sched.h"
728 ++
729 ++#include <linux/sched/rt.h>
730 ++
731 ++#include <linux/context_tracking.h>
732 ++#include <linux/compat.h>
733 ++#include <linux/blkdev.h>
734 ++#include <linux/delayacct.h>
735 ++#include <linux/freezer.h>
736 ++#include <linux/init_task.h>
737 ++#include <linux/kprobes.h>
738 ++#include <linux/mmu_context.h>
739 ++#include <linux/nmi.h>
740 ++#include <linux/profile.h>
741 ++#include <linux/rcupdate_wait.h>
742 ++#include <linux/security.h>
743 ++#include <linux/syscalls.h>
744 ++#include <linux/wait_bit.h>
745 ++
746 ++#include <linux/kcov.h>
747 ++#include <linux/scs.h>
748 ++
749 ++#include <asm/switch_to.h>
750 ++
751 ++#include "../workqueue_internal.h"
752 ++#include "../../fs/io-wq.h"
753 ++#include "../smpboot.h"
754 ++
755 ++#include "pelt.h"
756 ++#include "smp.h"
757 ++
758 ++/*
759 ++ * Export tracepoints that act as a bare tracehook (ie: have no trace event
760 ++ * associated with them) to allow external modules to probe them.
761 ++ */
762 ++EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_irq_tp);
763 ++
764 ++#ifdef CONFIG_SCHED_DEBUG
765 ++#define sched_feat(x) (1)
766 ++/*
767 ++ * Print a warning if need_resched is set for the given duration (if
768 ++ * LATENCY_WARN is enabled).
769 ++ *
770 ++ * If sysctl_resched_latency_warn_once is set, only one warning will be shown
771 ++ * per boot.
772 ++ */
773 ++__read_mostly int sysctl_resched_latency_warn_ms = 100;
774 ++__read_mostly int sysctl_resched_latency_warn_once = 1;
775 ++#else
776 ++#define sched_feat(x) (0)
777 ++#endif /* CONFIG_SCHED_DEBUG */
778 ++
779 ++#define ALT_SCHED_VERSION "v5.14-r1"
780 ++
781 ++/* rt_prio(prio) defined in include/linux/sched/rt.h */
782 ++#define rt_task(p) rt_prio((p)->prio)
783 ++#define rt_policy(policy) ((policy) == SCHED_FIFO || (policy) == SCHED_RR)
784 ++#define task_has_rt_policy(p) (rt_policy((p)->policy))
785 ++
786 ++#define STOP_PRIO (MAX_RT_PRIO - 1)
787 ++
788 ++/* Default time slice is 4 in ms, can be set via kernel parameter "sched_timeslice" */
789 ++u64 sched_timeslice_ns __read_mostly = (4 << 20);
790 ++
791 ++static inline void requeue_task(struct task_struct *p, struct rq *rq);
792 ++
793 ++#ifdef CONFIG_SCHED_BMQ
794 ++#include "bmq.h"
795 ++#endif
796 ++#ifdef CONFIG_SCHED_PDS
797 ++#include "pds.h"
798 ++#endif
799 ++
800 ++static int __init sched_timeslice(char *str)
801 ++{
802 ++ int timeslice_ms;
803 ++
804 ++ get_option(&str, &timeslice_ms);
805 ++ if (2 != timeslice_ms)
806 ++ timeslice_ms = 4;
807 ++ sched_timeslice_ns = timeslice_ms << 20;
808 ++ sched_timeslice_imp(timeslice_ms);
809 ++
810 ++ return 0;
811 ++}
812 ++early_param("sched_timeslice", sched_timeslice);
813 ++
814 ++/* Reschedule if less than this many μs left */
815 ++#define RESCHED_NS (100 << 10)
816 ++
817 ++/**
818 ++ * sched_yield_type - Choose what sort of yield sched_yield will perform.
819 ++ * 0: No yield.
820 ++ * 1: Deboost and requeue task. (default)
821 ++ * 2: Set rq skip task.
822 ++ */
823 ++int sched_yield_type __read_mostly = 1;
824 ++
825 ++#ifdef CONFIG_SMP
826 ++static cpumask_t sched_rq_pending_mask ____cacheline_aligned_in_smp;
827 ++
828 ++DEFINE_PER_CPU(cpumask_t [NR_CPU_AFFINITY_LEVELS], sched_cpu_topo_masks);
829 ++DEFINE_PER_CPU(cpumask_t *, sched_cpu_llc_mask);
830 ++DEFINE_PER_CPU(cpumask_t *, sched_cpu_topo_end_mask);
831 ++
832 ++#ifdef CONFIG_SCHED_SMT
833 ++DEFINE_STATIC_KEY_FALSE(sched_smt_present);
834 ++EXPORT_SYMBOL_GPL(sched_smt_present);
835 ++#endif
836 ++
837 ++/*
838 ++ * Keep a unique ID per domain (we use the first CPUs number in the cpumask of
839 ++ * the domain), this allows us to quickly tell if two cpus are in the same cache
840 ++ * domain, see cpus_share_cache().
841 ++ */
842 ++DEFINE_PER_CPU(int, sd_llc_id);
843 ++#endif /* CONFIG_SMP */
844 ++
845 ++static DEFINE_MUTEX(sched_hotcpu_mutex);
846 ++
847 ++DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
848 ++
849 ++#ifndef prepare_arch_switch
850 ++# define prepare_arch_switch(next) do { } while (0)
851 ++#endif
852 ++#ifndef finish_arch_post_lock_switch
853 ++# define finish_arch_post_lock_switch() do { } while (0)
854 ++#endif
855 ++
856 ++#ifdef CONFIG_SCHED_SMT
857 ++static cpumask_t sched_sg_idle_mask ____cacheline_aligned_in_smp;
858 ++#endif
859 ++static cpumask_t sched_rq_watermark[SCHED_BITS] ____cacheline_aligned_in_smp;
860 ++
861 ++/* sched_queue related functions */
862 ++static inline void sched_queue_init(struct sched_queue *q)
863 ++{
864 ++ int i;
865 ++
866 ++ bitmap_zero(q->bitmap, SCHED_BITS);
867 ++ for(i = 0; i < SCHED_BITS; i++)
868 ++ INIT_LIST_HEAD(&q->heads[i]);
869 ++}
870 ++
871 ++/*
872 ++ * Init idle task and put into queue structure of rq
873 ++ * IMPORTANT: may be called multiple times for a single cpu
874 ++ */
875 ++static inline void sched_queue_init_idle(struct sched_queue *q,
876 ++ struct task_struct *idle)
877 ++{
878 ++ idle->sq_idx = IDLE_TASK_SCHED_PRIO;
879 ++ INIT_LIST_HEAD(&q->heads[idle->sq_idx]);
880 ++ list_add(&idle->sq_node, &q->heads[idle->sq_idx]);
881 ++}
882 ++
883 ++/* water mark related functions */
884 ++static inline void update_sched_rq_watermark(struct rq *rq)
885 ++{
886 ++ unsigned long watermark = find_first_bit(rq->queue.bitmap, SCHED_QUEUE_BITS);
887 ++ unsigned long last_wm = rq->watermark;
888 ++ unsigned long i;
889 ++ int cpu;
890 ++
891 ++ if (watermark == last_wm)
892 ++ return;
893 ++
894 ++ rq->watermark = watermark;
895 ++ cpu = cpu_of(rq);
896 ++ if (watermark < last_wm) {
897 ++ for (i = last_wm; i > watermark; i--)
898 ++ cpumask_clear_cpu(cpu, sched_rq_watermark + SCHED_BITS - 1 - i);
899 ++#ifdef CONFIG_SCHED_SMT
900 ++ if (static_branch_likely(&sched_smt_present) &&
901 ++ IDLE_TASK_SCHED_PRIO == last_wm)
902 ++ cpumask_andnot(&sched_sg_idle_mask,
903 ++ &sched_sg_idle_mask, cpu_smt_mask(cpu));
904 ++#endif
905 ++ return;
906 ++ }
907 ++ /* last_wm < watermark */
908 ++ for (i = watermark; i > last_wm; i--)
909 ++ cpumask_set_cpu(cpu, sched_rq_watermark + SCHED_BITS - 1 - i);
910 ++#ifdef CONFIG_SCHED_SMT
911 ++ if (static_branch_likely(&sched_smt_present) &&
912 ++ IDLE_TASK_SCHED_PRIO == watermark) {
913 ++ cpumask_t tmp;
914 ++
915 ++ cpumask_and(&tmp, cpu_smt_mask(cpu), sched_rq_watermark);
916 ++ if (cpumask_equal(&tmp, cpu_smt_mask(cpu)))
917 ++ cpumask_or(&sched_sg_idle_mask,
918 ++ &sched_sg_idle_mask, cpu_smt_mask(cpu));
919 ++ }
920 ++#endif
921 ++}
922 ++
923 ++/*
924 ++ * This routine assume that the idle task always in queue
925 ++ */
926 ++static inline struct task_struct *sched_rq_first_task(struct rq *rq)
927 ++{
928 ++ unsigned long idx = find_first_bit(rq->queue.bitmap, SCHED_QUEUE_BITS);
929 ++ const struct list_head *head = &rq->queue.heads[sched_prio2idx(idx, rq)];
930 ++
931 ++ return list_first_entry(head, struct task_struct, sq_node);
932 ++}
933 ++
934 ++static inline struct task_struct *
935 ++sched_rq_next_task(struct task_struct *p, struct rq *rq)
936 ++{
937 ++ unsigned long idx = p->sq_idx;
938 ++ struct list_head *head = &rq->queue.heads[idx];
939 ++
940 ++ if (list_is_last(&p->sq_node, head)) {
941 ++ idx = find_next_bit(rq->queue.bitmap, SCHED_QUEUE_BITS,
942 ++ sched_idx2prio(idx, rq) + 1);
943 ++ head = &rq->queue.heads[sched_prio2idx(idx, rq)];
944 ++
945 ++ return list_first_entry(head, struct task_struct, sq_node);
946 ++ }
947 ++
948 ++ return list_next_entry(p, sq_node);
949 ++}
950 ++
951 ++static inline struct task_struct *rq_runnable_task(struct rq *rq)
952 ++{
953 ++ struct task_struct *next = sched_rq_first_task(rq);
954 ++
955 ++ if (unlikely(next == rq->skip))
956 ++ next = sched_rq_next_task(next, rq);
957 ++
958 ++ return next;
959 ++}
960 ++
961 ++/*
962 ++ * Serialization rules:
963 ++ *
964 ++ * Lock order:
965 ++ *
966 ++ * p->pi_lock
967 ++ * rq->lock
968 ++ * hrtimer_cpu_base->lock (hrtimer_start() for bandwidth controls)
969 ++ *
970 ++ * rq1->lock
971 ++ * rq2->lock where: rq1 < rq2
972 ++ *
973 ++ * Regular state:
974 ++ *
975 ++ * Normal scheduling state is serialized by rq->lock. __schedule() takes the
976 ++ * local CPU's rq->lock, it optionally removes the task from the runqueue and
977 ++ * always looks at the local rq data structures to find the most eligible task
978 ++ * to run next.
979 ++ *
980 ++ * Task enqueue is also under rq->lock, possibly taken from another CPU.
981 ++ * Wakeups from another LLC domain might use an IPI to transfer the enqueue to
982 ++ * the local CPU to avoid bouncing the runqueue state around [ see
983 ++ * ttwu_queue_wakelist() ]
984 ++ *
985 ++ * Task wakeup, specifically wakeups that involve migration, are horribly
986 ++ * complicated to avoid having to take two rq->locks.
987 ++ *
988 ++ * Special state:
989 ++ *
990 ++ * System-calls and anything external will use task_rq_lock() which acquires
991 ++ * both p->pi_lock and rq->lock. As a consequence the state they change is
992 ++ * stable while holding either lock:
993 ++ *
994 ++ * - sched_setaffinity()/
995 ++ * set_cpus_allowed_ptr(): p->cpus_ptr, p->nr_cpus_allowed
996 ++ * - set_user_nice(): p->se.load, p->*prio
997 ++ * - __sched_setscheduler(): p->sched_class, p->policy, p->*prio,
998 ++ * p->se.load, p->rt_priority,
999 ++ * p->dl.dl_{runtime, deadline, period, flags, bw, density}
1000 ++ * - sched_setnuma(): p->numa_preferred_nid
1001 ++ * - sched_move_task()/
1002 ++ * cpu_cgroup_fork(): p->sched_task_group
1003 ++ * - uclamp_update_active() p->uclamp*
1004 ++ *
1005 ++ * p->state <- TASK_*:
1006 ++ *
1007 ++ * is changed locklessly using set_current_state(), __set_current_state() or
1008 ++ * set_special_state(), see their respective comments, or by
1009 ++ * try_to_wake_up(). This latter uses p->pi_lock to serialize against
1010 ++ * concurrent self.
1011 ++ *
1012 ++ * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }:
1013 ++ *
1014 ++ * is set by activate_task() and cleared by deactivate_task(), under
1015 ++ * rq->lock. Non-zero indicates the task is runnable, the special
1016 ++ * ON_RQ_MIGRATING state is used for migration without holding both
1017 ++ * rq->locks. It indicates task_cpu() is not stable, see task_rq_lock().
1018 ++ *
1019 ++ * p->on_cpu <- { 0, 1 }:
1020 ++ *
1021 ++ * is set by prepare_task() and cleared by finish_task() such that it will be
1022 ++ * set before p is scheduled-in and cleared after p is scheduled-out, both
1023 ++ * under rq->lock. Non-zero indicates the task is running on its CPU.
1024 ++ *
1025 ++ * [ The astute reader will observe that it is possible for two tasks on one
1026 ++ * CPU to have ->on_cpu = 1 at the same time. ]
1027 ++ *
1028 ++ * task_cpu(p): is changed by set_task_cpu(), the rules are:
1029 ++ *
1030 ++ * - Don't call set_task_cpu() on a blocked task:
1031 ++ *
1032 ++ * We don't care what CPU we're not running on, this simplifies hotplug,
1033 ++ * the CPU assignment of blocked tasks isn't required to be valid.
1034 ++ *
1035 ++ * - for try_to_wake_up(), called under p->pi_lock:
1036 ++ *
1037 ++ * This allows try_to_wake_up() to only take one rq->lock, see its comment.
1038 ++ *
1039 ++ * - for migration called under rq->lock:
1040 ++ * [ see task_on_rq_migrating() in task_rq_lock() ]
1041 ++ *
1042 ++ * o move_queued_task()
1043 ++ * o detach_task()
1044 ++ *
1045 ++ * - for migration called under double_rq_lock():
1046 ++ *
1047 ++ * o __migrate_swap_task()
1048 ++ * o push_rt_task() / pull_rt_task()
1049 ++ * o push_dl_task() / pull_dl_task()
1050 ++ * o dl_task_offline_migration()
1051 ++ *
1052 ++ */
1053 ++
1054 ++/*
1055 ++ * Context: p->pi_lock
1056 ++ */
1057 ++static inline struct rq
1058 ++*__task_access_lock(struct task_struct *p, raw_spinlock_t **plock)
1059 ++{
1060 ++ struct rq *rq;
1061 ++ for (;;) {
1062 ++ rq = task_rq(p);
1063 ++ if (p->on_cpu || task_on_rq_queued(p)) {
1064 ++ raw_spin_lock(&rq->lock);
1065 ++ if (likely((p->on_cpu || task_on_rq_queued(p))
1066 ++ && rq == task_rq(p))) {
1067 ++ *plock = &rq->lock;
1068 ++ return rq;
1069 ++ }
1070 ++ raw_spin_unlock(&rq->lock);
1071 ++ } else if (task_on_rq_migrating(p)) {
1072 ++ do {
1073 ++ cpu_relax();
1074 ++ } while (unlikely(task_on_rq_migrating(p)));
1075 ++ } else {
1076 ++ *plock = NULL;
1077 ++ return rq;
1078 ++ }
1079 ++ }
1080 ++}
1081 ++
1082 ++static inline void
1083 ++__task_access_unlock(struct task_struct *p, raw_spinlock_t *lock)
1084 ++{
1085 ++ if (NULL != lock)
1086 ++ raw_spin_unlock(lock);
1087 ++}
1088 ++
1089 ++static inline struct rq
1090 ++*task_access_lock_irqsave(struct task_struct *p, raw_spinlock_t **plock,
1091 ++ unsigned long *flags)
1092 ++{
1093 ++ struct rq *rq;
1094 ++ for (;;) {
1095 ++ rq = task_rq(p);
1096 ++ if (p->on_cpu || task_on_rq_queued(p)) {
1097 ++ raw_spin_lock_irqsave(&rq->lock, *flags);
1098 ++ if (likely((p->on_cpu || task_on_rq_queued(p))
1099 ++ && rq == task_rq(p))) {
1100 ++ *plock = &rq->lock;
1101 ++ return rq;
1102 ++ }
1103 ++ raw_spin_unlock_irqrestore(&rq->lock, *flags);
1104 ++ } else if (task_on_rq_migrating(p)) {
1105 ++ do {
1106 ++ cpu_relax();
1107 ++ } while (unlikely(task_on_rq_migrating(p)));
1108 ++ } else {
1109 ++ raw_spin_lock_irqsave(&p->pi_lock, *flags);
1110 ++ if (likely(!p->on_cpu && !p->on_rq &&
1111 ++ rq == task_rq(p))) {
1112 ++ *plock = &p->pi_lock;
1113 ++ return rq;
1114 ++ }
1115 ++ raw_spin_unlock_irqrestore(&p->pi_lock, *flags);
1116 ++ }
1117 ++ }
1118 ++}
1119 ++
1120 ++static inline void
1121 ++task_access_unlock_irqrestore(struct task_struct *p, raw_spinlock_t *lock,
1122 ++ unsigned long *flags)
1123 ++{
1124 ++ raw_spin_unlock_irqrestore(lock, *flags);
1125 ++}
1126 ++
1127 ++/*
1128 ++ * __task_rq_lock - lock the rq @p resides on.
1129 ++ */
1130 ++struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
1131 ++ __acquires(rq->lock)
1132 ++{
1133 ++ struct rq *rq;
1134 ++
1135 ++ lockdep_assert_held(&p->pi_lock);
1136 ++
1137 ++ for (;;) {
1138 ++ rq = task_rq(p);
1139 ++ raw_spin_lock(&rq->lock);
1140 ++ if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))
1141 ++ return rq;
1142 ++ raw_spin_unlock(&rq->lock);
1143 ++
1144 ++ while (unlikely(task_on_rq_migrating(p)))
1145 ++ cpu_relax();
1146 ++ }
1147 ++}
1148 ++
1149 ++/*
1150 ++ * task_rq_lock - lock p->pi_lock and lock the rq @p resides on.
1151 ++ */
1152 ++struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
1153 ++ __acquires(p->pi_lock)
1154 ++ __acquires(rq->lock)
1155 ++{
1156 ++ struct rq *rq;
1157 ++
1158 ++ for (;;) {
1159 ++ raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
1160 ++ rq = task_rq(p);
1161 ++ raw_spin_lock(&rq->lock);
1162 ++ /*
1163 ++ * move_queued_task() task_rq_lock()
1164 ++ *
1165 ++ * ACQUIRE (rq->lock)
1166 ++ * [S] ->on_rq = MIGRATING [L] rq = task_rq()
1167 ++ * WMB (__set_task_cpu()) ACQUIRE (rq->lock);
1168 ++ * [S] ->cpu = new_cpu [L] task_rq()
1169 ++ * [L] ->on_rq
1170 ++ * RELEASE (rq->lock)
1171 ++ *
1172 ++ * If we observe the old CPU in task_rq_lock(), the acquire of
1173 ++ * the old rq->lock will fully serialize against the stores.
1174 ++ *
1175 ++ * If we observe the new CPU in task_rq_lock(), the address
1176 ++ * dependency headed by '[L] rq = task_rq()' and the acquire
1177 ++ * will pair with the WMB to ensure we then also see migrating.
1178 ++ */
1179 ++ if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
1180 ++ return rq;
1181 ++ }
1182 ++ raw_spin_unlock(&rq->lock);
1183 ++ raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
1184 ++
1185 ++ while (unlikely(task_on_rq_migrating(p)))
1186 ++ cpu_relax();
1187 ++ }
1188 ++}
1189 ++
1190 ++static inline void
1191 ++rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
1192 ++ __acquires(rq->lock)
1193 ++{
1194 ++ raw_spin_lock_irqsave(&rq->lock, rf->flags);
1195 ++}
1196 ++
1197 ++static inline void
1198 ++rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
1199 ++ __releases(rq->lock)
1200 ++{
1201 ++ raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
1202 ++}
1203 ++
1204 ++void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
1205 ++{
1206 ++ raw_spinlock_t *lock;
1207 ++
1208 ++ /* Matches synchronize_rcu() in __sched_core_enable() */
1209 ++ preempt_disable();
1210 ++
1211 ++ for (;;) {
1212 ++ lock = __rq_lockp(rq);
1213 ++ raw_spin_lock_nested(lock, subclass);
1214 ++ if (likely(lock == __rq_lockp(rq))) {
1215 ++ /* preempt_count *MUST* be > 1 */
1216 ++ preempt_enable_no_resched();
1217 ++ return;
1218 ++ }
1219 ++ raw_spin_unlock(lock);
1220 ++ }
1221 ++}
1222 ++
1223 ++void raw_spin_rq_unlock(struct rq *rq)
1224 ++{
1225 ++ raw_spin_unlock(rq_lockp(rq));
1226 ++}
1227 ++
1228 ++/*
1229 ++ * RQ-clock updating methods:
1230 ++ */
1231 ++
1232 ++static void update_rq_clock_task(struct rq *rq, s64 delta)
1233 ++{
1234 ++/*
1235 ++ * In theory, the compile should just see 0 here, and optimize out the call
1236 ++ * to sched_rt_avg_update. But I don't trust it...
1237 ++ */
1238 ++ s64 __maybe_unused steal = 0, irq_delta = 0;
1239 ++
1240 ++#ifdef CONFIG_IRQ_TIME_ACCOUNTING
1241 ++ irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
1242 ++
1243 ++ /*
1244 ++ * Since irq_time is only updated on {soft,}irq_exit, we might run into
1245 ++ * this case when a previous update_rq_clock() happened inside a
1246 ++ * {soft,}irq region.
1247 ++ *
1248 ++ * When this happens, we stop ->clock_task and only update the
1249 ++ * prev_irq_time stamp to account for the part that fit, so that a next
1250 ++ * update will consume the rest. This ensures ->clock_task is
1251 ++ * monotonic.
1252 ++ *
1253 ++ * It does however cause some slight miss-attribution of {soft,}irq
1254 ++ * time, a more accurate solution would be to update the irq_time using
1255 ++ * the current rq->clock timestamp, except that would require using
1256 ++ * atomic ops.
1257 ++ */
1258 ++ if (irq_delta > delta)
1259 ++ irq_delta = delta;
1260 ++
1261 ++ rq->prev_irq_time += irq_delta;
1262 ++ delta -= irq_delta;
1263 ++#endif
1264 ++#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
1265 ++ if (static_key_false((&paravirt_steal_rq_enabled))) {
1266 ++ steal = paravirt_steal_clock(cpu_of(rq));
1267 ++ steal -= rq->prev_steal_time_rq;
1268 ++
1269 ++ if (unlikely(steal > delta))
1270 ++ steal = delta;
1271 ++
1272 ++ rq->prev_steal_time_rq += steal;
1273 ++ delta -= steal;
1274 ++ }
1275 ++#endif
1276 ++
1277 ++ rq->clock_task += delta;
1278 ++
1279 ++#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
1280 ++ if ((irq_delta + steal))
1281 ++ update_irq_load_avg(rq, irq_delta + steal);
1282 ++#endif
1283 ++}
1284 ++
1285 ++static inline void update_rq_clock(struct rq *rq)
1286 ++{
1287 ++ s64 delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
1288 ++
1289 ++ if (unlikely(delta <= 0))
1290 ++ return;
1291 ++ rq->clock += delta;
1292 ++ update_rq_time_edge(rq);
1293 ++ update_rq_clock_task(rq, delta);
1294 ++}
1295 ++
1296 ++#ifdef CONFIG_NO_HZ_FULL
1297 ++/*
1298 ++ * Tick may be needed by tasks in the runqueue depending on their policy and
1299 ++ * requirements. If tick is needed, lets send the target an IPI to kick it out
1300 ++ * of nohz mode if necessary.
1301 ++ */
1302 ++static inline void sched_update_tick_dependency(struct rq *rq)
1303 ++{
1304 ++ int cpu = cpu_of(rq);
1305 ++
1306 ++ if (!tick_nohz_full_cpu(cpu))
1307 ++ return;
1308 ++
1309 ++ if (rq->nr_running < 2)
1310 ++ tick_nohz_dep_clear_cpu(cpu, TICK_DEP_BIT_SCHED);
1311 ++ else
1312 ++ tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
1313 ++}
1314 ++#else /* !CONFIG_NO_HZ_FULL */
1315 ++static inline void sched_update_tick_dependency(struct rq *rq) { }
1316 ++#endif
1317 ++
1318 ++bool sched_task_on_rq(struct task_struct *p)
1319 ++{
1320 ++ return task_on_rq_queued(p);
1321 ++}
1322 ++
1323 ++/*
1324 ++ * Add/Remove/Requeue task to/from the runqueue routines
1325 ++ * Context: rq->lock
1326 ++ */
1327 ++#define __SCHED_DEQUEUE_TASK(p, rq, flags, func) \
1328 ++ psi_dequeue(p, flags & DEQUEUE_SLEEP); \
1329 ++ sched_info_dequeue(rq, p); \
1330 ++ \
1331 ++ list_del(&p->sq_node); \
1332 ++ if (list_empty(&rq->queue.heads[p->sq_idx])) { \
1333 ++ clear_bit(sched_idx2prio(p->sq_idx, rq), \
1334 ++ rq->queue.bitmap); \
1335 ++ func; \
1336 ++ }
1337 ++
1338 ++#define __SCHED_ENQUEUE_TASK(p, rq, flags) \
1339 ++ sched_info_enqueue(rq, p); \
1340 ++ psi_enqueue(p, flags); \
1341 ++ \
1342 ++ p->sq_idx = task_sched_prio_idx(p, rq); \
1343 ++ list_add_tail(&p->sq_node, &rq->queue.heads[p->sq_idx]); \
1344 ++ set_bit(sched_idx2prio(p->sq_idx, rq), rq->queue.bitmap);
1345 ++
1346 ++static inline void dequeue_task(struct task_struct *p, struct rq *rq, int flags)
1347 ++{
1348 ++ lockdep_assert_held(&rq->lock);
1349 ++
1350 ++ /*printk(KERN_INFO "sched: dequeue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/
1351 ++ WARN_ONCE(task_rq(p) != rq, "sched: dequeue task reside on cpu%d from cpu%d\n",
1352 ++ task_cpu(p), cpu_of(rq));
1353 ++
1354 ++ __SCHED_DEQUEUE_TASK(p, rq, flags, update_sched_rq_watermark(rq));
1355 ++ --rq->nr_running;
1356 ++#ifdef CONFIG_SMP
1357 ++ if (1 == rq->nr_running)
1358 ++ cpumask_clear_cpu(cpu_of(rq), &sched_rq_pending_mask);
1359 ++#endif
1360 ++
1361 ++ sched_update_tick_dependency(rq);
1362 ++}
1363 ++
1364 ++static inline void enqueue_task(struct task_struct *p, struct rq *rq, int flags)
1365 ++{
1366 ++ lockdep_assert_held(&rq->lock);
1367 ++
1368 ++ /*printk(KERN_INFO "sched: enqueue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/
1369 ++ WARN_ONCE(task_rq(p) != rq, "sched: enqueue task reside on cpu%d to cpu%d\n",
1370 ++ task_cpu(p), cpu_of(rq));
1371 ++
1372 ++ __SCHED_ENQUEUE_TASK(p, rq, flags);
1373 ++ update_sched_rq_watermark(rq);
1374 ++ ++rq->nr_running;
1375 ++#ifdef CONFIG_SMP
1376 ++ if (2 == rq->nr_running)
1377 ++ cpumask_set_cpu(cpu_of(rq), &sched_rq_pending_mask);
1378 ++#endif
1379 ++
1380 ++ sched_update_tick_dependency(rq);
1381 ++}
1382 ++
1383 ++static inline void requeue_task(struct task_struct *p, struct rq *rq)
1384 ++{
1385 ++ int idx;
1386 ++
1387 ++ lockdep_assert_held(&rq->lock);
1388 ++ /*printk(KERN_INFO "sched: requeue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/
1389 ++ WARN_ONCE(task_rq(p) != rq, "sched: cpu[%d] requeue task reside on cpu%d\n",
1390 ++ cpu_of(rq), task_cpu(p));
1391 ++
1392 ++ idx = task_sched_prio_idx(p, rq);
1393 ++
1394 ++ list_del(&p->sq_node);
1395 ++ list_add_tail(&p->sq_node, &rq->queue.heads[idx]);
1396 ++ if (idx != p->sq_idx) {
1397 ++ if (list_empty(&rq->queue.heads[p->sq_idx]))
1398 ++ clear_bit(sched_idx2prio(p->sq_idx, rq),
1399 ++ rq->queue.bitmap);
1400 ++ p->sq_idx = idx;
1401 ++ set_bit(sched_idx2prio(p->sq_idx, rq), rq->queue.bitmap);
1402 ++ update_sched_rq_watermark(rq);
1403 ++ }
1404 ++}
1405 ++
1406 ++/*
1407 ++ * cmpxchg based fetch_or, macro so it works for different integer types
1408 ++ */
1409 ++#define fetch_or(ptr, mask) \
1410 ++ ({ \
1411 ++ typeof(ptr) _ptr = (ptr); \
1412 ++ typeof(mask) _mask = (mask); \
1413 ++ typeof(*_ptr) _old, _val = *_ptr; \
1414 ++ \
1415 ++ for (;;) { \
1416 ++ _old = cmpxchg(_ptr, _val, _val | _mask); \
1417 ++ if (_old == _val) \
1418 ++ break; \
1419 ++ _val = _old; \
1420 ++ } \
1421 ++ _old; \
1422 ++})
1423 ++
1424 ++#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
1425 ++/*
1426 ++ * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
1427 ++ * this avoids any races wrt polling state changes and thereby avoids
1428 ++ * spurious IPIs.
1429 ++ */
1430 ++static bool set_nr_and_not_polling(struct task_struct *p)
1431 ++{
1432 ++ struct thread_info *ti = task_thread_info(p);
1433 ++ return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
1434 ++}
1435 ++
1436 ++/*
1437 ++ * Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set.
1438 ++ *
1439 ++ * If this returns true, then the idle task promises to call
1440 ++ * sched_ttwu_pending() and reschedule soon.
1441 ++ */
1442 ++static bool set_nr_if_polling(struct task_struct *p)
1443 ++{
1444 ++ struct thread_info *ti = task_thread_info(p);
1445 ++ typeof(ti->flags) old, val = READ_ONCE(ti->flags);
1446 ++
1447 ++ for (;;) {
1448 ++ if (!(val & _TIF_POLLING_NRFLAG))
1449 ++ return false;
1450 ++ if (val & _TIF_NEED_RESCHED)
1451 ++ return true;
1452 ++ old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED);
1453 ++ if (old == val)
1454 ++ break;
1455 ++ val = old;
1456 ++ }
1457 ++ return true;
1458 ++}
1459 ++
1460 ++#else
1461 ++static bool set_nr_and_not_polling(struct task_struct *p)
1462 ++{
1463 ++ set_tsk_need_resched(p);
1464 ++ return true;
1465 ++}
1466 ++
1467 ++#ifdef CONFIG_SMP
1468 ++static bool set_nr_if_polling(struct task_struct *p)
1469 ++{
1470 ++ return false;
1471 ++}
1472 ++#endif
1473 ++#endif
1474 ++
1475 ++static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)
1476 ++{
1477 ++ struct wake_q_node *node = &task->wake_q;
1478 ++
1479 ++ /*
1480 ++ * Atomically grab the task, if ->wake_q is !nil already it means
1481 ++ * it's already queued (either by us or someone else) and will get the
1482 ++ * wakeup due to that.
1483 ++ *
1484 ++ * In order to ensure that a pending wakeup will observe our pending
1485 ++ * state, even in the failed case, an explicit smp_mb() must be used.
1486 ++ */
1487 ++ smp_mb__before_atomic();
1488 ++ if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
1489 ++ return false;
1490 ++
1491 ++ /*
1492 ++ * The head is context local, there can be no concurrency.
1493 ++ */
1494 ++ *head->lastp = node;
1495 ++ head->lastp = &node->next;
1496 ++ return true;
1497 ++}
1498 ++
1499 ++/**
1500 ++ * wake_q_add() - queue a wakeup for 'later' waking.
1501 ++ * @head: the wake_q_head to add @task to
1502 ++ * @task: the task to queue for 'later' wakeup
1503 ++ *
1504 ++ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
1505 ++ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
1506 ++ * instantly.
1507 ++ *
1508 ++ * This function must be used as-if it were wake_up_process(); IOW the task
1509 ++ * must be ready to be woken at this location.
1510 ++ */
1511 ++void wake_q_add(struct wake_q_head *head, struct task_struct *task)
1512 ++{
1513 ++ if (__wake_q_add(head, task))
1514 ++ get_task_struct(task);
1515 ++}
1516 ++
1517 ++/**
1518 ++ * wake_q_add_safe() - safely queue a wakeup for 'later' waking.
1519 ++ * @head: the wake_q_head to add @task to
1520 ++ * @task: the task to queue for 'later' wakeup
1521 ++ *
1522 ++ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
1523 ++ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
1524 ++ * instantly.
1525 ++ *
1526 ++ * This function must be used as-if it were wake_up_process(); IOW the task
1527 ++ * must be ready to be woken at this location.
1528 ++ *
1529 ++ * This function is essentially a task-safe equivalent to wake_q_add(). Callers
1530 ++ * that already hold reference to @task can call the 'safe' version and trust
1531 ++ * wake_q to do the right thing depending whether or not the @task is already
1532 ++ * queued for wakeup.
1533 ++ */
1534 ++void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
1535 ++{
1536 ++ if (!__wake_q_add(head, task))
1537 ++ put_task_struct(task);
1538 ++}
1539 ++
1540 ++void wake_up_q(struct wake_q_head *head)
1541 ++{
1542 ++ struct wake_q_node *node = head->first;
1543 ++
1544 ++ while (node != WAKE_Q_TAIL) {
1545 ++ struct task_struct *task;
1546 ++
1547 ++ task = container_of(node, struct task_struct, wake_q);
1548 ++ /* task can safely be re-inserted now: */
1549 ++ node = node->next;
1550 ++ task->wake_q.next = NULL;
1551 ++
1552 ++ /*
1553 ++ * wake_up_process() executes a full barrier, which pairs with
1554 ++ * the queueing in wake_q_add() so as not to miss wakeups.
1555 ++ */
1556 ++ wake_up_process(task);
1557 ++ put_task_struct(task);
1558 ++ }
1559 ++}
1560 ++
1561 ++/*
1562 ++ * resched_curr - mark rq's current task 'to be rescheduled now'.
1563 ++ *
1564 ++ * On UP this means the setting of the need_resched flag, on SMP it
1565 ++ * might also involve a cross-CPU call to trigger the scheduler on
1566 ++ * the target CPU.
1567 ++ */
1568 ++void resched_curr(struct rq *rq)
1569 ++{
1570 ++ struct task_struct *curr = rq->curr;
1571 ++ int cpu;
1572 ++
1573 ++ lockdep_assert_held(&rq->lock);
1574 ++
1575 ++ if (test_tsk_need_resched(curr))
1576 ++ return;
1577 ++
1578 ++ cpu = cpu_of(rq);
1579 ++ if (cpu == smp_processor_id()) {
1580 ++ set_tsk_need_resched(curr);
1581 ++ set_preempt_need_resched();
1582 ++ return;
1583 ++ }
1584 ++
1585 ++ if (set_nr_and_not_polling(curr))
1586 ++ smp_send_reschedule(cpu);
1587 ++ else
1588 ++ trace_sched_wake_idle_without_ipi(cpu);
1589 ++}
1590 ++
1591 ++void resched_cpu(int cpu)
1592 ++{
1593 ++ struct rq *rq = cpu_rq(cpu);
1594 ++ unsigned long flags;
1595 ++
1596 ++ raw_spin_lock_irqsave(&rq->lock, flags);
1597 ++ if (cpu_online(cpu) || cpu == smp_processor_id())
1598 ++ resched_curr(cpu_rq(cpu));
1599 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
1600 ++}
1601 ++
1602 ++#ifdef CONFIG_SMP
1603 ++#ifdef CONFIG_NO_HZ_COMMON
1604 ++void nohz_balance_enter_idle(int cpu) {}
1605 ++
1606 ++void select_nohz_load_balancer(int stop_tick) {}
1607 ++
1608 ++void set_cpu_sd_state_idle(void) {}
1609 ++
1610 ++/*
1611 ++ * In the semi idle case, use the nearest busy CPU for migrating timers
1612 ++ * from an idle CPU. This is good for power-savings.
1613 ++ *
1614 ++ * We don't do similar optimization for completely idle system, as
1615 ++ * selecting an idle CPU will add more delays to the timers than intended
1616 ++ * (as that CPU's timer base may not be uptodate wrt jiffies etc).
1617 ++ */
1618 ++int get_nohz_timer_target(void)
1619 ++{
1620 ++ int i, cpu = smp_processor_id(), default_cpu = -1;
1621 ++ struct cpumask *mask;
1622 ++
1623 ++ if (housekeeping_cpu(cpu, HK_FLAG_TIMER)) {
1624 ++ if (!idle_cpu(cpu))
1625 ++ return cpu;
1626 ++ default_cpu = cpu;
1627 ++ }
1628 ++
1629 ++ for (mask = per_cpu(sched_cpu_topo_masks, cpu) + 1;
1630 ++ mask < per_cpu(sched_cpu_topo_end_mask, cpu); mask++)
1631 ++ for_each_cpu_and(i, mask, housekeeping_cpumask(HK_FLAG_TIMER))
1632 ++ if (!idle_cpu(i))
1633 ++ return i;
1634 ++
1635 ++ if (default_cpu == -1)
1636 ++ default_cpu = housekeeping_any_cpu(HK_FLAG_TIMER);
1637 ++ cpu = default_cpu;
1638 ++
1639 ++ return cpu;
1640 ++}
1641 ++
1642 ++/*
1643 ++ * When add_timer_on() enqueues a timer into the timer wheel of an
1644 ++ * idle CPU then this timer might expire before the next timer event
1645 ++ * which is scheduled to wake up that CPU. In case of a completely
1646 ++ * idle system the next event might even be infinite time into the
1647 ++ * future. wake_up_idle_cpu() ensures that the CPU is woken up and
1648 ++ * leaves the inner idle loop so the newly added timer is taken into
1649 ++ * account when the CPU goes back to idle and evaluates the timer
1650 ++ * wheel for the next timer event.
1651 ++ */
1652 ++static inline void wake_up_idle_cpu(int cpu)
1653 ++{
1654 ++ struct rq *rq = cpu_rq(cpu);
1655 ++
1656 ++ if (cpu == smp_processor_id())
1657 ++ return;
1658 ++
1659 ++ if (set_nr_and_not_polling(rq->idle))
1660 ++ smp_send_reschedule(cpu);
1661 ++ else
1662 ++ trace_sched_wake_idle_without_ipi(cpu);
1663 ++}
1664 ++
1665 ++static inline bool wake_up_full_nohz_cpu(int cpu)
1666 ++{
1667 ++ /*
1668 ++ * We just need the target to call irq_exit() and re-evaluate
1669 ++ * the next tick. The nohz full kick at least implies that.
1670 ++ * If needed we can still optimize that later with an
1671 ++ * empty IRQ.
1672 ++ */
1673 ++ if (cpu_is_offline(cpu))
1674 ++ return true; /* Don't try to wake offline CPUs. */
1675 ++ if (tick_nohz_full_cpu(cpu)) {
1676 ++ if (cpu != smp_processor_id() ||
1677 ++ tick_nohz_tick_stopped())
1678 ++ tick_nohz_full_kick_cpu(cpu);
1679 ++ return true;
1680 ++ }
1681 ++
1682 ++ return false;
1683 ++}
1684 ++
1685 ++void wake_up_nohz_cpu(int cpu)
1686 ++{
1687 ++ if (!wake_up_full_nohz_cpu(cpu))
1688 ++ wake_up_idle_cpu(cpu);
1689 ++}
1690 ++
1691 ++static void nohz_csd_func(void *info)
1692 ++{
1693 ++ struct rq *rq = info;
1694 ++ int cpu = cpu_of(rq);
1695 ++ unsigned int flags;
1696 ++
1697 ++ /*
1698 ++ * Release the rq::nohz_csd.
1699 ++ */
1700 ++ flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
1701 ++ WARN_ON(!(flags & NOHZ_KICK_MASK));
1702 ++
1703 ++ rq->idle_balance = idle_cpu(cpu);
1704 ++ if (rq->idle_balance && !need_resched()) {
1705 ++ rq->nohz_idle_balance = flags;
1706 ++ raise_softirq_irqoff(SCHED_SOFTIRQ);
1707 ++ }
1708 ++}
1709 ++
1710 ++#endif /* CONFIG_NO_HZ_COMMON */
1711 ++#endif /* CONFIG_SMP */
1712 ++
1713 ++static inline void check_preempt_curr(struct rq *rq)
1714 ++{
1715 ++ if (sched_rq_first_task(rq) != rq->curr)
1716 ++ resched_curr(rq);
1717 ++}
1718 ++
1719 ++#ifdef CONFIG_SCHED_HRTICK
1720 ++/*
1721 ++ * Use HR-timers to deliver accurate preemption points.
1722 ++ */
1723 ++
1724 ++static void hrtick_clear(struct rq *rq)
1725 ++{
1726 ++ if (hrtimer_active(&rq->hrtick_timer))
1727 ++ hrtimer_cancel(&rq->hrtick_timer);
1728 ++}
1729 ++
1730 ++/*
1731 ++ * High-resolution timer tick.
1732 ++ * Runs from hardirq context with interrupts disabled.
1733 ++ */
1734 ++static enum hrtimer_restart hrtick(struct hrtimer *timer)
1735 ++{
1736 ++ struct rq *rq = container_of(timer, struct rq, hrtick_timer);
1737 ++
1738 ++ WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
1739 ++
1740 ++ raw_spin_lock(&rq->lock);
1741 ++ resched_curr(rq);
1742 ++ raw_spin_unlock(&rq->lock);
1743 ++
1744 ++ return HRTIMER_NORESTART;
1745 ++}
1746 ++
1747 ++/*
1748 ++ * Use hrtick when:
1749 ++ * - enabled by features
1750 ++ * - hrtimer is actually high res
1751 ++ */
1752 ++static inline int hrtick_enabled(struct rq *rq)
1753 ++{
1754 ++ /**
1755 ++ * Alt schedule FW doesn't support sched_feat yet
1756 ++ if (!sched_feat(HRTICK))
1757 ++ return 0;
1758 ++ */
1759 ++ if (!cpu_active(cpu_of(rq)))
1760 ++ return 0;
1761 ++ return hrtimer_is_hres_active(&rq->hrtick_timer);
1762 ++}
1763 ++
1764 ++#ifdef CONFIG_SMP
1765 ++
1766 ++static void __hrtick_restart(struct rq *rq)
1767 ++{
1768 ++ struct hrtimer *timer = &rq->hrtick_timer;
1769 ++ ktime_t time = rq->hrtick_time;
1770 ++
1771 ++ hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
1772 ++}
1773 ++
1774 ++/*
1775 ++ * called from hardirq (IPI) context
1776 ++ */
1777 ++static void __hrtick_start(void *arg)
1778 ++{
1779 ++ struct rq *rq = arg;
1780 ++
1781 ++ raw_spin_lock(&rq->lock);
1782 ++ __hrtick_restart(rq);
1783 ++ raw_spin_unlock(&rq->lock);
1784 ++}
1785 ++
1786 ++/*
1787 ++ * Called to set the hrtick timer state.
1788 ++ *
1789 ++ * called with rq->lock held and irqs disabled
1790 ++ */
1791 ++void hrtick_start(struct rq *rq, u64 delay)
1792 ++{
1793 ++ struct hrtimer *timer = &rq->hrtick_timer;
1794 ++ s64 delta;
1795 ++
1796 ++ /*
1797 ++ * Don't schedule slices shorter than 10000ns, that just
1798 ++ * doesn't make sense and can cause timer DoS.
1799 ++ */
1800 ++ delta = max_t(s64, delay, 10000LL);
1801 ++
1802 ++ rq->hrtick_time = ktime_add_ns(timer->base->get_time(), delta);
1803 ++
1804 ++ if (rq == this_rq())
1805 ++ __hrtick_restart(rq);
1806 ++ else
1807 ++ smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);
1808 ++}
1809 ++
1810 ++#else
1811 ++/*
1812 ++ * Called to set the hrtick timer state.
1813 ++ *
1814 ++ * called with rq->lock held and irqs disabled
1815 ++ */
1816 ++void hrtick_start(struct rq *rq, u64 delay)
1817 ++{
1818 ++ /*
1819 ++ * Don't schedule slices shorter than 10000ns, that just
1820 ++ * doesn't make sense. Rely on vruntime for fairness.
1821 ++ */
1822 ++ delay = max_t(u64, delay, 10000LL);
1823 ++ hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),
1824 ++ HRTIMER_MODE_REL_PINNED_HARD);
1825 ++}
1826 ++#endif /* CONFIG_SMP */
1827 ++
1828 ++static void hrtick_rq_init(struct rq *rq)
1829 ++{
1830 ++#ifdef CONFIG_SMP
1831 ++ INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
1832 ++#endif
1833 ++
1834 ++ hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
1835 ++ rq->hrtick_timer.function = hrtick;
1836 ++}
1837 ++#else /* CONFIG_SCHED_HRTICK */
1838 ++static inline int hrtick_enabled(struct rq *rq)
1839 ++{
1840 ++ return 0;
1841 ++}
1842 ++
1843 ++static inline void hrtick_clear(struct rq *rq)
1844 ++{
1845 ++}
1846 ++
1847 ++static inline void hrtick_rq_init(struct rq *rq)
1848 ++{
1849 ++}
1850 ++#endif /* CONFIG_SCHED_HRTICK */
1851 ++
1852 ++static inline int __normal_prio(int policy, int rt_prio, int static_prio)
1853 ++{
1854 ++ return rt_policy(policy) ? (MAX_RT_PRIO - 1 - rt_prio) :
1855 ++ static_prio + MAX_PRIORITY_ADJ;
1856 ++}
1857 ++
1858 ++/*
1859 ++ * Calculate the expected normal priority: i.e. priority
1860 ++ * without taking RT-inheritance into account. Might be
1861 ++ * boosted by interactivity modifiers. Changes upon fork,
1862 ++ * setprio syscalls, and whenever the interactivity
1863 ++ * estimator recalculates.
1864 ++ */
1865 ++static inline int normal_prio(struct task_struct *p)
1866 ++{
1867 ++ return __normal_prio(p->policy, p->rt_priority, p->static_prio);
1868 ++}
1869 ++
1870 ++/*
1871 ++ * Calculate the current priority, i.e. the priority
1872 ++ * taken into account by the scheduler. This value might
1873 ++ * be boosted by RT tasks as it will be RT if the task got
1874 ++ * RT-boosted. If not then it returns p->normal_prio.
1875 ++ */
1876 ++static int effective_prio(struct task_struct *p)
1877 ++{
1878 ++ p->normal_prio = normal_prio(p);
1879 ++ /*
1880 ++ * If we are RT tasks or we were boosted to RT priority,
1881 ++ * keep the priority unchanged. Otherwise, update priority
1882 ++ * to the normal priority:
1883 ++ */
1884 ++ if (!rt_prio(p->prio))
1885 ++ return p->normal_prio;
1886 ++ return p->prio;
1887 ++}
1888 ++
1889 ++/*
1890 ++ * activate_task - move a task to the runqueue.
1891 ++ *
1892 ++ * Context: rq->lock
1893 ++ */
1894 ++static void activate_task(struct task_struct *p, struct rq *rq)
1895 ++{
1896 ++ enqueue_task(p, rq, ENQUEUE_WAKEUP);
1897 ++ p->on_rq = TASK_ON_RQ_QUEUED;
1898 ++
1899 ++ /*
1900 ++ * If in_iowait is set, the code below may not trigger any cpufreq
1901 ++ * utilization updates, so do it here explicitly with the IOWAIT flag
1902 ++ * passed.
1903 ++ */
1904 ++ cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT * p->in_iowait);
1905 ++}
1906 ++
1907 ++/*
1908 ++ * deactivate_task - remove a task from the runqueue.
1909 ++ *
1910 ++ * Context: rq->lock
1911 ++ */
1912 ++static inline void deactivate_task(struct task_struct *p, struct rq *rq)
1913 ++{
1914 ++ dequeue_task(p, rq, DEQUEUE_SLEEP);
1915 ++ p->on_rq = 0;
1916 ++ cpufreq_update_util(rq, 0);
1917 ++}
1918 ++
1919 ++static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
1920 ++{
1921 ++#ifdef CONFIG_SMP
1922 ++ /*
1923 ++ * After ->cpu is set up to a new value, task_access_lock(p, ...) can be
1924 ++ * successfully executed on another CPU. We must ensure that updates of
1925 ++ * per-task data have been completed by this moment.
1926 ++ */
1927 ++ smp_wmb();
1928 ++
1929 ++#ifdef CONFIG_THREAD_INFO_IN_TASK
1930 ++ WRITE_ONCE(p->cpu, cpu);
1931 ++#else
1932 ++ WRITE_ONCE(task_thread_info(p)->cpu, cpu);
1933 ++#endif
1934 ++#endif
1935 ++}
1936 ++
1937 ++static inline bool is_migration_disabled(struct task_struct *p)
1938 ++{
1939 ++#ifdef CONFIG_SMP
1940 ++ return p->migration_disabled;
1941 ++#else
1942 ++ return false;
1943 ++#endif
1944 ++}
1945 ++
1946 ++#define SCA_CHECK 0x01
1947 ++
1948 ++#ifdef CONFIG_SMP
1949 ++
1950 ++void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
1951 ++{
1952 ++#ifdef CONFIG_SCHED_DEBUG
1953 ++ unsigned int state = READ_ONCE(p->__state);
1954 ++
1955 ++ /*
1956 ++ * We should never call set_task_cpu() on a blocked task,
1957 ++ * ttwu() will sort out the placement.
1958 ++ */
1959 ++ WARN_ON_ONCE(state != TASK_RUNNING && state != TASK_WAKING && !p->on_rq);
1960 ++
1961 ++#ifdef CONFIG_LOCKDEP
1962 ++ /*
1963 ++ * The caller should hold either p->pi_lock or rq->lock, when changing
1964 ++ * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
1965 ++ *
1966 ++ * sched_move_task() holds both and thus holding either pins the cgroup,
1967 ++ * see task_group().
1968 ++ */
1969 ++ WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
1970 ++ lockdep_is_held(&task_rq(p)->lock)));
1971 ++#endif
1972 ++ /*
1973 ++ * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
1974 ++ */
1975 ++ WARN_ON_ONCE(!cpu_online(new_cpu));
1976 ++
1977 ++ WARN_ON_ONCE(is_migration_disabled(p));
1978 ++#endif
1979 ++ if (task_cpu(p) == new_cpu)
1980 ++ return;
1981 ++ trace_sched_migrate_task(p, new_cpu);
1982 ++ rseq_migrate(p);
1983 ++ perf_event_task_migrate(p);
1984 ++
1985 ++ __set_task_cpu(p, new_cpu);
1986 ++}
1987 ++
1988 ++#define MDF_FORCE_ENABLED 0x80
1989 ++
1990 ++static void
1991 ++__do_set_cpus_ptr(struct task_struct *p, const struct cpumask *new_mask)
1992 ++{
1993 ++ /*
1994 ++ * This here violates the locking rules for affinity, since we're only
1995 ++ * supposed to change these variables while holding both rq->lock and
1996 ++ * p->pi_lock.
1997 ++ *
1998 ++ * HOWEVER, it magically works, because ttwu() is the only code that
1999 ++ * accesses these variables under p->pi_lock and only does so after
2000 ++ * smp_cond_load_acquire(&p->on_cpu, !VAL), and we're in __schedule()
2001 ++ * before finish_task().
2002 ++ *
2003 ++ * XXX do further audits, this smells like something putrid.
2004 ++ */
2005 ++ SCHED_WARN_ON(!p->on_cpu);
2006 ++ p->cpus_ptr = new_mask;
2007 ++}
2008 ++
2009 ++void migrate_disable(void)
2010 ++{
2011 ++ struct task_struct *p = current;
2012 ++ int cpu;
2013 ++
2014 ++ if (p->migration_disabled) {
2015 ++ p->migration_disabled++;
2016 ++ return;
2017 ++ }
2018 ++
2019 ++ preempt_disable();
2020 ++ cpu = smp_processor_id();
2021 ++ if (cpumask_test_cpu(cpu, &p->cpus_mask)) {
2022 ++ cpu_rq(cpu)->nr_pinned++;
2023 ++ p->migration_disabled = 1;
2024 ++ p->migration_flags &= ~MDF_FORCE_ENABLED;
2025 ++
2026 ++ /*
2027 ++ * Violates locking rules! see comment in __do_set_cpus_ptr().
2028 ++ */
2029 ++ if (p->cpus_ptr == &p->cpus_mask)
2030 ++ __do_set_cpus_ptr(p, cpumask_of(cpu));
2031 ++ }
2032 ++ preempt_enable();
2033 ++}
2034 ++EXPORT_SYMBOL_GPL(migrate_disable);
2035 ++
2036 ++void migrate_enable(void)
2037 ++{
2038 ++ struct task_struct *p = current;
2039 ++
2040 ++ if (0 == p->migration_disabled)
2041 ++ return;
2042 ++
2043 ++ if (p->migration_disabled > 1) {
2044 ++ p->migration_disabled--;
2045 ++ return;
2046 ++ }
2047 ++
2048 ++ /*
2049 ++ * Ensure stop_task runs either before or after this, and that
2050 ++ * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().
2051 ++ */
2052 ++ preempt_disable();
2053 ++ /*
2054 ++ * Assumption: current should be running on allowed cpu
2055 ++ */
2056 ++ WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &p->cpus_mask));
2057 ++ if (p->cpus_ptr != &p->cpus_mask)
2058 ++ __do_set_cpus_ptr(p, &p->cpus_mask);
2059 ++ /*
2060 ++ * Mustn't clear migration_disabled() until cpus_ptr points back at the
2061 ++ * regular cpus_mask, otherwise things that race (eg.
2062 ++ * select_fallback_rq) get confused.
2063 ++ */
2064 ++ barrier();
2065 ++ p->migration_disabled = 0;
2066 ++ this_rq()->nr_pinned--;
2067 ++ preempt_enable();
2068 ++}
2069 ++EXPORT_SYMBOL_GPL(migrate_enable);
2070 ++
2071 ++static inline bool rq_has_pinned_tasks(struct rq *rq)
2072 ++{
2073 ++ return rq->nr_pinned;
2074 ++}
2075 ++
2076 ++/*
2077 ++ * Per-CPU kthreads are allowed to run on !active && online CPUs, see
2078 ++ * __set_cpus_allowed_ptr() and select_fallback_rq().
2079 ++ */
2080 ++static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
2081 ++{
2082 ++ /* When not in the task's cpumask, no point in looking further. */
2083 ++ if (!cpumask_test_cpu(cpu, p->cpus_ptr))
2084 ++ return false;
2085 ++
2086 ++ /* migrate_disabled() must be allowed to finish. */
2087 ++ if (is_migration_disabled(p))
2088 ++ return cpu_online(cpu);
2089 ++
2090 ++ /* Non kernel threads are not allowed during either online or offline. */
2091 ++ if (!(p->flags & PF_KTHREAD))
2092 ++ return cpu_active(cpu);
2093 ++
2094 ++ /* KTHREAD_IS_PER_CPU is always allowed. */
2095 ++ if (kthread_is_per_cpu(p))
2096 ++ return cpu_online(cpu);
2097 ++
2098 ++ /* Regular kernel threads don't get to stay during offline. */
2099 ++ if (cpu_dying(cpu))
2100 ++ return false;
2101 ++
2102 ++ /* But are allowed during online. */
2103 ++ return cpu_online(cpu);
2104 ++}
2105 ++
2106 ++/*
2107 ++ * This is how migration works:
2108 ++ *
2109 ++ * 1) we invoke migration_cpu_stop() on the target CPU using
2110 ++ * stop_one_cpu().
2111 ++ * 2) stopper starts to run (implicitly forcing the migrated thread
2112 ++ * off the CPU)
2113 ++ * 3) it checks whether the migrated task is still in the wrong runqueue.
2114 ++ * 4) if it's in the wrong runqueue then the migration thread removes
2115 ++ * it and puts it into the right queue.
2116 ++ * 5) stopper completes and stop_one_cpu() returns and the migration
2117 ++ * is done.
2118 ++ */
2119 ++
2120 ++/*
2121 ++ * move_queued_task - move a queued task to new rq.
2122 ++ *
2123 ++ * Returns (locked) new rq. Old rq's lock is released.
2124 ++ */
2125 ++static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int
2126 ++ new_cpu)
2127 ++{
2128 ++ lockdep_assert_held(&rq->lock);
2129 ++
2130 ++ WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
2131 ++ dequeue_task(p, rq, 0);
2132 ++ set_task_cpu(p, new_cpu);
2133 ++ raw_spin_unlock(&rq->lock);
2134 ++
2135 ++ rq = cpu_rq(new_cpu);
2136 ++
2137 ++ raw_spin_lock(&rq->lock);
2138 ++ BUG_ON(task_cpu(p) != new_cpu);
2139 ++ sched_task_sanity_check(p, rq);
2140 ++ enqueue_task(p, rq, 0);
2141 ++ p->on_rq = TASK_ON_RQ_QUEUED;
2142 ++ check_preempt_curr(rq);
2143 ++
2144 ++ return rq;
2145 ++}
2146 ++
2147 ++struct migration_arg {
2148 ++ struct task_struct *task;
2149 ++ int dest_cpu;
2150 ++};
2151 ++
2152 ++/*
2153 ++ * Move (not current) task off this CPU, onto the destination CPU. We're doing
2154 ++ * this because either it can't run here any more (set_cpus_allowed()
2155 ++ * away from this CPU, or CPU going down), or because we're
2156 ++ * attempting to rebalance this task on exec (sched_exec).
2157 ++ *
2158 ++ * So we race with normal scheduler movements, but that's OK, as long
2159 ++ * as the task is no longer on this CPU.
2160 ++ */
2161 ++static struct rq *__migrate_task(struct rq *rq, struct task_struct *p, int
2162 ++ dest_cpu)
2163 ++{
2164 ++ /* Affinity changed (again). */
2165 ++ if (!is_cpu_allowed(p, dest_cpu))
2166 ++ return rq;
2167 ++
2168 ++ update_rq_clock(rq);
2169 ++ return move_queued_task(rq, p, dest_cpu);
2170 ++}
2171 ++
2172 ++/*
2173 ++ * migration_cpu_stop - this will be executed by a highprio stopper thread
2174 ++ * and performs thread migration by bumping thread off CPU then
2175 ++ * 'pushing' onto another runqueue.
2176 ++ */
2177 ++static int migration_cpu_stop(void *data)
2178 ++{
2179 ++ struct migration_arg *arg = data;
2180 ++ struct task_struct *p = arg->task;
2181 ++ struct rq *rq = this_rq();
2182 ++ unsigned long flags;
2183 ++
2184 ++ /*
2185 ++ * The original target CPU might have gone down and we might
2186 ++ * be on another CPU but it doesn't matter.
2187 ++ */
2188 ++ local_irq_save(flags);
2189 ++ /*
2190 ++ * We need to explicitly wake pending tasks before running
2191 ++ * __migrate_task() such that we will not miss enforcing cpus_ptr
2192 ++ * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.
2193 ++ */
2194 ++ flush_smp_call_function_from_idle();
2195 ++
2196 ++ raw_spin_lock(&p->pi_lock);
2197 ++ raw_spin_lock(&rq->lock);
2198 ++ /*
2199 ++ * If task_rq(p) != rq, it cannot be migrated here, because we're
2200 ++ * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because
2201 ++ * we're holding p->pi_lock.
2202 ++ */
2203 ++ if (task_rq(p) == rq && task_on_rq_queued(p))
2204 ++ rq = __migrate_task(rq, p, arg->dest_cpu);
2205 ++ raw_spin_unlock(&rq->lock);
2206 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
2207 ++
2208 ++ return 0;
2209 ++}
2210 ++
2211 ++static inline void
2212 ++set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask)
2213 ++{
2214 ++ cpumask_copy(&p->cpus_mask, new_mask);
2215 ++ p->nr_cpus_allowed = cpumask_weight(new_mask);
2216 ++}
2217 ++
2218 ++static void
2219 ++__do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
2220 ++{
2221 ++ lockdep_assert_held(&p->pi_lock);
2222 ++ set_cpus_allowed_common(p, new_mask);
2223 ++}
2224 ++
2225 ++void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
2226 ++{
2227 ++ __do_set_cpus_allowed(p, new_mask);
2228 ++}
2229 ++
2230 ++#endif
2231 ++
2232 ++/**
2233 ++ * task_curr - is this task currently executing on a CPU?
2234 ++ * @p: the task in question.
2235 ++ *
2236 ++ * Return: 1 if the task is currently executing. 0 otherwise.
2237 ++ */
2238 ++inline int task_curr(const struct task_struct *p)
2239 ++{
2240 ++ return cpu_curr(task_cpu(p)) == p;
2241 ++}
2242 ++
2243 ++#ifdef CONFIG_SMP
2244 ++/*
2245 ++ * wait_task_inactive - wait for a thread to unschedule.
2246 ++ *
2247 ++ * If @match_state is nonzero, it's the @p->state value just checked and
2248 ++ * not expected to change. If it changes, i.e. @p might have woken up,
2249 ++ * then return zero. When we succeed in waiting for @p to be off its CPU,
2250 ++ * we return a positive number (its total switch count). If a second call
2251 ++ * a short while later returns the same number, the caller can be sure that
2252 ++ * @p has remained unscheduled the whole time.
2253 ++ *
2254 ++ * The caller must ensure that the task *will* unschedule sometime soon,
2255 ++ * else this function might spin for a *long* time. This function can't
2256 ++ * be called with interrupts off, or it may introduce deadlock with
2257 ++ * smp_call_function() if an IPI is sent by the same process we are
2258 ++ * waiting to become inactive.
2259 ++ */
2260 ++unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state)
2261 ++{
2262 ++ unsigned long flags;
2263 ++ bool running, on_rq;
2264 ++ unsigned long ncsw;
2265 ++ struct rq *rq;
2266 ++ raw_spinlock_t *lock;
2267 ++
2268 ++ for (;;) {
2269 ++ rq = task_rq(p);
2270 ++
2271 ++ /*
2272 ++ * If the task is actively running on another CPU
2273 ++ * still, just relax and busy-wait without holding
2274 ++ * any locks.
2275 ++ *
2276 ++ * NOTE! Since we don't hold any locks, it's not
2277 ++ * even sure that "rq" stays as the right runqueue!
2278 ++ * But we don't care, since this will return false
2279 ++ * if the runqueue has changed and p is actually now
2280 ++ * running somewhere else!
2281 ++ */
2282 ++ while (task_running(p) && p == rq->curr) {
2283 ++ if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
2284 ++ return 0;
2285 ++ cpu_relax();
2286 ++ }
2287 ++
2288 ++ /*
2289 ++ * Ok, time to look more closely! We need the rq
2290 ++ * lock now, to be *sure*. If we're wrong, we'll
2291 ++ * just go back and repeat.
2292 ++ */
2293 ++ task_access_lock_irqsave(p, &lock, &flags);
2294 ++ trace_sched_wait_task(p);
2295 ++ running = task_running(p);
2296 ++ on_rq = p->on_rq;
2297 ++ ncsw = 0;
2298 ++ if (!match_state || READ_ONCE(p->__state) == match_state)
2299 ++ ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
2300 ++ task_access_unlock_irqrestore(p, lock, &flags);
2301 ++
2302 ++ /*
2303 ++ * If it changed from the expected state, bail out now.
2304 ++ */
2305 ++ if (unlikely(!ncsw))
2306 ++ break;
2307 ++
2308 ++ /*
2309 ++ * Was it really running after all now that we
2310 ++ * checked with the proper locks actually held?
2311 ++ *
2312 ++ * Oops. Go back and try again..
2313 ++ */
2314 ++ if (unlikely(running)) {
2315 ++ cpu_relax();
2316 ++ continue;
2317 ++ }
2318 ++
2319 ++ /*
2320 ++ * It's not enough that it's not actively running,
2321 ++ * it must be off the runqueue _entirely_, and not
2322 ++ * preempted!
2323 ++ *
2324 ++ * So if it was still runnable (but just not actively
2325 ++ * running right now), it's preempted, and we should
2326 ++ * yield - it could be a while.
2327 ++ */
2328 ++ if (unlikely(on_rq)) {
2329 ++ ktime_t to = NSEC_PER_SEC / HZ;
2330 ++
2331 ++ set_current_state(TASK_UNINTERRUPTIBLE);
2332 ++ schedule_hrtimeout(&to, HRTIMER_MODE_REL);
2333 ++ continue;
2334 ++ }
2335 ++
2336 ++ /*
2337 ++ * Ahh, all good. It wasn't running, and it wasn't
2338 ++ * runnable, which means that it will never become
2339 ++ * running in the future either. We're all done!
2340 ++ */
2341 ++ break;
2342 ++ }
2343 ++
2344 ++ return ncsw;
2345 ++}
2346 ++
2347 ++/***
2348 ++ * kick_process - kick a running thread to enter/exit the kernel
2349 ++ * @p: the to-be-kicked thread
2350 ++ *
2351 ++ * Cause a process which is running on another CPU to enter
2352 ++ * kernel-mode, without any delay. (to get signals handled.)
2353 ++ *
2354 ++ * NOTE: this function doesn't have to take the runqueue lock,
2355 ++ * because all it wants to ensure is that the remote task enters
2356 ++ * the kernel. If the IPI races and the task has been migrated
2357 ++ * to another CPU then no harm is done and the purpose has been
2358 ++ * achieved as well.
2359 ++ */
2360 ++void kick_process(struct task_struct *p)
2361 ++{
2362 ++ int cpu;
2363 ++
2364 ++ preempt_disable();
2365 ++ cpu = task_cpu(p);
2366 ++ if ((cpu != smp_processor_id()) && task_curr(p))
2367 ++ smp_send_reschedule(cpu);
2368 ++ preempt_enable();
2369 ++}
2370 ++EXPORT_SYMBOL_GPL(kick_process);
2371 ++
2372 ++/*
2373 ++ * ->cpus_ptr is protected by both rq->lock and p->pi_lock
2374 ++ *
2375 ++ * A few notes on cpu_active vs cpu_online:
2376 ++ *
2377 ++ * - cpu_active must be a subset of cpu_online
2378 ++ *
2379 ++ * - on CPU-up we allow per-CPU kthreads on the online && !active CPU,
2380 ++ * see __set_cpus_allowed_ptr(). At this point the newly online
2381 ++ * CPU isn't yet part of the sched domains, and balancing will not
2382 ++ * see it.
2383 ++ *
2384 ++ * - on cpu-down we clear cpu_active() to mask the sched domains and
2385 ++ * avoid the load balancer to place new tasks on the to be removed
2386 ++ * CPU. Existing tasks will remain running there and will be taken
2387 ++ * off.
2388 ++ *
2389 ++ * This means that fallback selection must not select !active CPUs.
2390 ++ * And can assume that any active CPU must be online. Conversely
2391 ++ * select_task_rq() below may allow selection of !active CPUs in order
2392 ++ * to satisfy the above rules.
2393 ++ */
2394 ++static int select_fallback_rq(int cpu, struct task_struct *p)
2395 ++{
2396 ++ int nid = cpu_to_node(cpu);
2397 ++ const struct cpumask *nodemask = NULL;
2398 ++ enum { cpuset, possible, fail } state = cpuset;
2399 ++ int dest_cpu;
2400 ++
2401 ++ /*
2402 ++ * If the node that the CPU is on has been offlined, cpu_to_node()
2403 ++ * will return -1. There is no CPU on the node, and we should
2404 ++ * select the CPU on the other node.
2405 ++ */
2406 ++ if (nid != -1) {
2407 ++ nodemask = cpumask_of_node(nid);
2408 ++
2409 ++ /* Look for allowed, online CPU in same node. */
2410 ++ for_each_cpu(dest_cpu, nodemask) {
2411 ++ if (!cpu_active(dest_cpu))
2412 ++ continue;
2413 ++ if (cpumask_test_cpu(dest_cpu, p->cpus_ptr))
2414 ++ return dest_cpu;
2415 ++ }
2416 ++ }
2417 ++
2418 ++ for (;;) {
2419 ++ /* Any allowed, online CPU? */
2420 ++ for_each_cpu(dest_cpu, p->cpus_ptr) {
2421 ++ if (!is_cpu_allowed(p, dest_cpu))
2422 ++ continue;
2423 ++ goto out;
2424 ++ }
2425 ++
2426 ++ /* No more Mr. Nice Guy. */
2427 ++ switch (state) {
2428 ++ case cpuset:
2429 ++ if (IS_ENABLED(CONFIG_CPUSETS)) {
2430 ++ cpuset_cpus_allowed_fallback(p);
2431 ++ state = possible;
2432 ++ break;
2433 ++ }
2434 ++ fallthrough;
2435 ++ case possible:
2436 ++ /*
2437 ++ * XXX When called from select_task_rq() we only
2438 ++ * hold p->pi_lock and again violate locking order.
2439 ++ *
2440 ++ * More yuck to audit.
2441 ++ */
2442 ++ do_set_cpus_allowed(p, cpu_possible_mask);
2443 ++ state = fail;
2444 ++ break;
2445 ++
2446 ++ case fail:
2447 ++ BUG();
2448 ++ break;
2449 ++ }
2450 ++ }
2451 ++
2452 ++out:
2453 ++ if (state != cpuset) {
2454 ++ /*
2455 ++ * Don't tell them about moving exiting tasks or
2456 ++ * kernel threads (both mm NULL), since they never
2457 ++ * leave kernel.
2458 ++ */
2459 ++ if (p->mm && printk_ratelimit()) {
2460 ++ printk_deferred("process %d (%s) no longer affine to cpu%d\n",
2461 ++ task_pid_nr(p), p->comm, cpu);
2462 ++ }
2463 ++ }
2464 ++
2465 ++ return dest_cpu;
2466 ++}
2467 ++
2468 ++static inline int select_task_rq(struct task_struct *p)
2469 ++{
2470 ++ cpumask_t chk_mask, tmp;
2471 ++
2472 ++ if (unlikely(!cpumask_and(&chk_mask, p->cpus_ptr, cpu_active_mask)))
2473 ++ return select_fallback_rq(task_cpu(p), p);
2474 ++
2475 ++ if (
2476 ++#ifdef CONFIG_SCHED_SMT
2477 ++ cpumask_and(&tmp, &chk_mask, &sched_sg_idle_mask) ||
2478 ++#endif
2479 ++ cpumask_and(&tmp, &chk_mask, sched_rq_watermark) ||
2480 ++ cpumask_and(&tmp, &chk_mask,
2481 ++ sched_rq_watermark + SCHED_BITS - task_sched_prio(p)))
2482 ++ return best_mask_cpu(task_cpu(p), &tmp);
2483 ++
2484 ++ return best_mask_cpu(task_cpu(p), &chk_mask);
2485 ++}
2486 ++
2487 ++void sched_set_stop_task(int cpu, struct task_struct *stop)
2488 ++{
2489 ++ static struct lock_class_key stop_pi_lock;
2490 ++ struct sched_param stop_param = { .sched_priority = STOP_PRIO };
2491 ++ struct sched_param start_param = { .sched_priority = 0 };
2492 ++ struct task_struct *old_stop = cpu_rq(cpu)->stop;
2493 ++
2494 ++ if (stop) {
2495 ++ /*
2496 ++ * Make it appear like a SCHED_FIFO task, its something
2497 ++ * userspace knows about and won't get confused about.
2498 ++ *
2499 ++ * Also, it will make PI more or less work without too
2500 ++ * much confusion -- but then, stop work should not
2501 ++ * rely on PI working anyway.
2502 ++ */
2503 ++ sched_setscheduler_nocheck(stop, SCHED_FIFO, &stop_param);
2504 ++
2505 ++ /*
2506 ++ * The PI code calls rt_mutex_setprio() with ->pi_lock held to
2507 ++ * adjust the effective priority of a task. As a result,
2508 ++ * rt_mutex_setprio() can trigger (RT) balancing operations,
2509 ++ * which can then trigger wakeups of the stop thread to push
2510 ++ * around the current task.
2511 ++ *
2512 ++ * The stop task itself will never be part of the PI-chain, it
2513 ++ * never blocks, therefore that ->pi_lock recursion is safe.
2514 ++ * Tell lockdep about this by placing the stop->pi_lock in its
2515 ++ * own class.
2516 ++ */
2517 ++ lockdep_set_class(&stop->pi_lock, &stop_pi_lock);
2518 ++ }
2519 ++
2520 ++ cpu_rq(cpu)->stop = stop;
2521 ++
2522 ++ if (old_stop) {
2523 ++ /*
2524 ++ * Reset it back to a normal scheduling policy so that
2525 ++ * it can die in pieces.
2526 ++ */
2527 ++ sched_setscheduler_nocheck(old_stop, SCHED_NORMAL, &start_param);
2528 ++ }
2529 ++}
2530 ++
2531 ++/*
2532 ++ * Change a given task's CPU affinity. Migrate the thread to a
2533 ++ * proper CPU and schedule it away if the CPU it's executing on
2534 ++ * is removed from the allowed bitmask.
2535 ++ *
2536 ++ * NOTE: the caller must have a valid reference to the task, the
2537 ++ * task must not exit() & deallocate itself prematurely. The
2538 ++ * call is not atomic; no spinlocks may be held.
2539 ++ */
2540 ++static int __set_cpus_allowed_ptr(struct task_struct *p,
2541 ++ const struct cpumask *new_mask,
2542 ++ u32 flags)
2543 ++{
2544 ++ const struct cpumask *cpu_valid_mask = cpu_active_mask;
2545 ++ int dest_cpu;
2546 ++ unsigned long irq_flags;
2547 ++ struct rq *rq;
2548 ++ raw_spinlock_t *lock;
2549 ++ int ret = 0;
2550 ++
2551 ++ raw_spin_lock_irqsave(&p->pi_lock, irq_flags);
2552 ++ rq = __task_access_lock(p, &lock);
2553 ++
2554 ++ if (p->flags & PF_KTHREAD || is_migration_disabled(p)) {
2555 ++ /*
2556 ++ * Kernel threads are allowed on online && !active CPUs,
2557 ++ * however, during cpu-hot-unplug, even these might get pushed
2558 ++ * away if not KTHREAD_IS_PER_CPU.
2559 ++ *
2560 ++ * Specifically, migration_disabled() tasks must not fail the
2561 ++ * cpumask_any_and_distribute() pick below, esp. so on
2562 ++ * SCA_MIGRATE_ENABLE, otherwise we'll not call
2563 ++ * set_cpus_allowed_common() and actually reset p->cpus_ptr.
2564 ++ */
2565 ++ cpu_valid_mask = cpu_online_mask;
2566 ++ }
2567 ++
2568 ++ /*
2569 ++ * Must re-check here, to close a race against __kthread_bind(),
2570 ++ * sched_setaffinity() is not guaranteed to observe the flag.
2571 ++ */
2572 ++ if ((flags & SCA_CHECK) && (p->flags & PF_NO_SETAFFINITY)) {
2573 ++ ret = -EINVAL;
2574 ++ goto out;
2575 ++ }
2576 ++
2577 ++ if (cpumask_equal(&p->cpus_mask, new_mask))
2578 ++ goto out;
2579 ++
2580 ++ dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
2581 ++ if (dest_cpu >= nr_cpu_ids) {
2582 ++ ret = -EINVAL;
2583 ++ goto out;
2584 ++ }
2585 ++
2586 ++ __do_set_cpus_allowed(p, new_mask);
2587 ++
2588 ++ /* Can the task run on the task's current CPU? If so, we're done */
2589 ++ if (cpumask_test_cpu(task_cpu(p), new_mask))
2590 ++ goto out;
2591 ++
2592 ++ if (p->migration_disabled) {
2593 ++ if (likely(p->cpus_ptr != &p->cpus_mask))
2594 ++ __do_set_cpus_ptr(p, &p->cpus_mask);
2595 ++ p->migration_disabled = 0;
2596 ++ p->migration_flags |= MDF_FORCE_ENABLED;
2597 ++ /* When p is migrate_disabled, rq->lock should be held */
2598 ++ rq->nr_pinned--;
2599 ++ }
2600 ++
2601 ++ if (task_running(p) || READ_ONCE(p->__state) == TASK_WAKING) {
2602 ++ struct migration_arg arg = { p, dest_cpu };
2603 ++
2604 ++ /* Need help from migration thread: drop lock and wait. */
2605 ++ __task_access_unlock(p, lock);
2606 ++ raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);
2607 ++ stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
2608 ++ return 0;
2609 ++ }
2610 ++ if (task_on_rq_queued(p)) {
2611 ++ /*
2612 ++ * OK, since we're going to drop the lock immediately
2613 ++ * afterwards anyway.
2614 ++ */
2615 ++ update_rq_clock(rq);
2616 ++ rq = move_queued_task(rq, p, dest_cpu);
2617 ++ lock = &rq->lock;
2618 ++ }
2619 ++
2620 ++out:
2621 ++ __task_access_unlock(p, lock);
2622 ++ raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);
2623 ++
2624 ++ return ret;
2625 ++}
2626 ++
2627 ++int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
2628 ++{
2629 ++ return __set_cpus_allowed_ptr(p, new_mask, 0);
2630 ++}
2631 ++EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
2632 ++
2633 ++#else /* CONFIG_SMP */
2634 ++
2635 ++static inline int select_task_rq(struct task_struct *p)
2636 ++{
2637 ++ return 0;
2638 ++}
2639 ++
2640 ++static inline int
2641 ++__set_cpus_allowed_ptr(struct task_struct *p,
2642 ++ const struct cpumask *new_mask,
2643 ++ u32 flags)
2644 ++{
2645 ++ return set_cpus_allowed_ptr(p, new_mask);
2646 ++}
2647 ++
2648 ++static inline bool rq_has_pinned_tasks(struct rq *rq)
2649 ++{
2650 ++ return false;
2651 ++}
2652 ++
2653 ++#endif /* !CONFIG_SMP */
2654 ++
2655 ++static void
2656 ++ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
2657 ++{
2658 ++ struct rq *rq;
2659 ++
2660 ++ if (!schedstat_enabled())
2661 ++ return;
2662 ++
2663 ++ rq = this_rq();
2664 ++
2665 ++#ifdef CONFIG_SMP
2666 ++ if (cpu == rq->cpu)
2667 ++ __schedstat_inc(rq->ttwu_local);
2668 ++ else {
2669 ++ /** Alt schedule FW ToDo:
2670 ++ * How to do ttwu_wake_remote
2671 ++ */
2672 ++ }
2673 ++#endif /* CONFIG_SMP */
2674 ++
2675 ++ __schedstat_inc(rq->ttwu_count);
2676 ++}
2677 ++
2678 ++/*
2679 ++ * Mark the task runnable and perform wakeup-preemption.
2680 ++ */
2681 ++static inline void
2682 ++ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
2683 ++{
2684 ++ check_preempt_curr(rq);
2685 ++ WRITE_ONCE(p->__state, TASK_RUNNING);
2686 ++ trace_sched_wakeup(p);
2687 ++}
2688 ++
2689 ++static inline void
2690 ++ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
2691 ++{
2692 ++ if (p->sched_contributes_to_load)
2693 ++ rq->nr_uninterruptible--;
2694 ++
2695 ++ if (
2696 ++#ifdef CONFIG_SMP
2697 ++ !(wake_flags & WF_MIGRATED) &&
2698 ++#endif
2699 ++ p->in_iowait) {
2700 ++ delayacct_blkio_end(p);
2701 ++ atomic_dec(&task_rq(p)->nr_iowait);
2702 ++ }
2703 ++
2704 ++ activate_task(p, rq);
2705 ++ ttwu_do_wakeup(rq, p, 0);
2706 ++}
2707 ++
2708 ++/*
2709 ++ * Consider @p being inside a wait loop:
2710 ++ *
2711 ++ * for (;;) {
2712 ++ * set_current_state(TASK_UNINTERRUPTIBLE);
2713 ++ *
2714 ++ * if (CONDITION)
2715 ++ * break;
2716 ++ *
2717 ++ * schedule();
2718 ++ * }
2719 ++ * __set_current_state(TASK_RUNNING);
2720 ++ *
2721 ++ * between set_current_state() and schedule(). In this case @p is still
2722 ++ * runnable, so all that needs doing is change p->state back to TASK_RUNNING in
2723 ++ * an atomic manner.
2724 ++ *
2725 ++ * By taking task_rq(p)->lock we serialize against schedule(), if @p->on_rq
2726 ++ * then schedule() must still happen and p->state can be changed to
2727 ++ * TASK_RUNNING. Otherwise we lost the race, schedule() has happened, and we
2728 ++ * need to do a full wakeup with enqueue.
2729 ++ *
2730 ++ * Returns: %true when the wakeup is done,
2731 ++ * %false otherwise.
2732 ++ */
2733 ++static int ttwu_runnable(struct task_struct *p, int wake_flags)
2734 ++{
2735 ++ struct rq *rq;
2736 ++ raw_spinlock_t *lock;
2737 ++ int ret = 0;
2738 ++
2739 ++ rq = __task_access_lock(p, &lock);
2740 ++ if (task_on_rq_queued(p)) {
2741 ++ /* check_preempt_curr() may use rq clock */
2742 ++ update_rq_clock(rq);
2743 ++ ttwu_do_wakeup(rq, p, wake_flags);
2744 ++ ret = 1;
2745 ++ }
2746 ++ __task_access_unlock(p, lock);
2747 ++
2748 ++ return ret;
2749 ++}
2750 ++
2751 ++#ifdef CONFIG_SMP
2752 ++void sched_ttwu_pending(void *arg)
2753 ++{
2754 ++ struct llist_node *llist = arg;
2755 ++ struct rq *rq = this_rq();
2756 ++ struct task_struct *p, *t;
2757 ++ struct rq_flags rf;
2758 ++
2759 ++ if (!llist)
2760 ++ return;
2761 ++
2762 ++ /*
2763 ++ * rq::ttwu_pending racy indication of out-standing wakeups.
2764 ++ * Races such that false-negatives are possible, since they
2765 ++ * are shorter lived that false-positives would be.
2766 ++ */
2767 ++ WRITE_ONCE(rq->ttwu_pending, 0);
2768 ++
2769 ++ rq_lock_irqsave(rq, &rf);
2770 ++ update_rq_clock(rq);
2771 ++
2772 ++ llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
2773 ++ if (WARN_ON_ONCE(p->on_cpu))
2774 ++ smp_cond_load_acquire(&p->on_cpu, !VAL);
2775 ++
2776 ++ if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
2777 ++ set_task_cpu(p, cpu_of(rq));
2778 ++
2779 ++ ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0);
2780 ++ }
2781 ++
2782 ++ rq_unlock_irqrestore(rq, &rf);
2783 ++}
2784 ++
2785 ++void send_call_function_single_ipi(int cpu)
2786 ++{
2787 ++ struct rq *rq = cpu_rq(cpu);
2788 ++
2789 ++ if (!set_nr_if_polling(rq->idle))
2790 ++ arch_send_call_function_single_ipi(cpu);
2791 ++ else
2792 ++ trace_sched_wake_idle_without_ipi(cpu);
2793 ++}
2794 ++
2795 ++/*
2796 ++ * Queue a task on the target CPUs wake_list and wake the CPU via IPI if
2797 ++ * necessary. The wakee CPU on receipt of the IPI will queue the task
2798 ++ * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
2799 ++ * of the wakeup instead of the waker.
2800 ++ */
2801 ++static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
2802 ++{
2803 ++ struct rq *rq = cpu_rq(cpu);
2804 ++
2805 ++ p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
2806 ++
2807 ++ WRITE_ONCE(rq->ttwu_pending, 1);
2808 ++ __smp_call_single_queue(cpu, &p->wake_entry.llist);
2809 ++}
2810 ++
2811 ++static inline bool ttwu_queue_cond(int cpu, int wake_flags)
2812 ++{
2813 ++ /*
2814 ++ * Do not complicate things with the async wake_list while the CPU is
2815 ++ * in hotplug state.
2816 ++ */
2817 ++ if (!cpu_active(cpu))
2818 ++ return false;
2819 ++
2820 ++ /*
2821 ++ * If the CPU does not share cache, then queue the task on the
2822 ++ * remote rqs wakelist to avoid accessing remote data.
2823 ++ */
2824 ++ if (!cpus_share_cache(smp_processor_id(), cpu))
2825 ++ return true;
2826 ++
2827 ++ /*
2828 ++ * If the task is descheduling and the only running task on the
2829 ++ * CPU then use the wakelist to offload the task activation to
2830 ++ * the soon-to-be-idle CPU as the current CPU is likely busy.
2831 ++ * nr_running is checked to avoid unnecessary task stacking.
2832 ++ */
2833 ++ if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
2834 ++ return true;
2835 ++
2836 ++ return false;
2837 ++}
2838 ++
2839 ++static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
2840 ++{
2841 ++ if (__is_defined(ALT_SCHED_TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
2842 ++ if (WARN_ON_ONCE(cpu == smp_processor_id()))
2843 ++ return false;
2844 ++
2845 ++ sched_clock_cpu(cpu); /* Sync clocks across CPUs */
2846 ++ __ttwu_queue_wakelist(p, cpu, wake_flags);
2847 ++ return true;
2848 ++ }
2849 ++
2850 ++ return false;
2851 ++}
2852 ++
2853 ++void wake_up_if_idle(int cpu)
2854 ++{
2855 ++ struct rq *rq = cpu_rq(cpu);
2856 ++ unsigned long flags;
2857 ++
2858 ++ rcu_read_lock();
2859 ++
2860 ++ if (!is_idle_task(rcu_dereference(rq->curr)))
2861 ++ goto out;
2862 ++
2863 ++ if (set_nr_if_polling(rq->idle)) {
2864 ++ trace_sched_wake_idle_without_ipi(cpu);
2865 ++ } else {
2866 ++ raw_spin_lock_irqsave(&rq->lock, flags);
2867 ++ if (is_idle_task(rq->curr))
2868 ++ smp_send_reschedule(cpu);
2869 ++ /* Else CPU is not idle, do nothing here */
2870 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
2871 ++ }
2872 ++
2873 ++out:
2874 ++ rcu_read_unlock();
2875 ++}
2876 ++
2877 ++bool cpus_share_cache(int this_cpu, int that_cpu)
2878 ++{
2879 ++ return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
2880 ++}
2881 ++#else /* !CONFIG_SMP */
2882 ++
2883 ++static inline bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
2884 ++{
2885 ++ return false;
2886 ++}
2887 ++
2888 ++#endif /* CONFIG_SMP */
2889 ++
2890 ++static inline void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
2891 ++{
2892 ++ struct rq *rq = cpu_rq(cpu);
2893 ++
2894 ++ if (ttwu_queue_wakelist(p, cpu, wake_flags))
2895 ++ return;
2896 ++
2897 ++ raw_spin_lock(&rq->lock);
2898 ++ update_rq_clock(rq);
2899 ++ ttwu_do_activate(rq, p, wake_flags);
2900 ++ raw_spin_unlock(&rq->lock);
2901 ++}
2902 ++
2903 ++/*
2904 ++ * Notes on Program-Order guarantees on SMP systems.
2905 ++ *
2906 ++ * MIGRATION
2907 ++ *
2908 ++ * The basic program-order guarantee on SMP systems is that when a task [t]
2909 ++ * migrates, all its activity on its old CPU [c0] happens-before any subsequent
2910 ++ * execution on its new CPU [c1].
2911 ++ *
2912 ++ * For migration (of runnable tasks) this is provided by the following means:
2913 ++ *
2914 ++ * A) UNLOCK of the rq(c0)->lock scheduling out task t
2915 ++ * B) migration for t is required to synchronize *both* rq(c0)->lock and
2916 ++ * rq(c1)->lock (if not at the same time, then in that order).
2917 ++ * C) LOCK of the rq(c1)->lock scheduling in task
2918 ++ *
2919 ++ * Transitivity guarantees that B happens after A and C after B.
2920 ++ * Note: we only require RCpc transitivity.
2921 ++ * Note: the CPU doing B need not be c0 or c1
2922 ++ *
2923 ++ * Example:
2924 ++ *
2925 ++ * CPU0 CPU1 CPU2
2926 ++ *
2927 ++ * LOCK rq(0)->lock
2928 ++ * sched-out X
2929 ++ * sched-in Y
2930 ++ * UNLOCK rq(0)->lock
2931 ++ *
2932 ++ * LOCK rq(0)->lock // orders against CPU0
2933 ++ * dequeue X
2934 ++ * UNLOCK rq(0)->lock
2935 ++ *
2936 ++ * LOCK rq(1)->lock
2937 ++ * enqueue X
2938 ++ * UNLOCK rq(1)->lock
2939 ++ *
2940 ++ * LOCK rq(1)->lock // orders against CPU2
2941 ++ * sched-out Z
2942 ++ * sched-in X
2943 ++ * UNLOCK rq(1)->lock
2944 ++ *
2945 ++ *
2946 ++ * BLOCKING -- aka. SLEEP + WAKEUP
2947 ++ *
2948 ++ * For blocking we (obviously) need to provide the same guarantee as for
2949 ++ * migration. However the means are completely different as there is no lock
2950 ++ * chain to provide order. Instead we do:
2951 ++ *
2952 ++ * 1) smp_store_release(X->on_cpu, 0) -- finish_task()
2953 ++ * 2) smp_cond_load_acquire(!X->on_cpu) -- try_to_wake_up()
2954 ++ *
2955 ++ * Example:
2956 ++ *
2957 ++ * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule)
2958 ++ *
2959 ++ * LOCK rq(0)->lock LOCK X->pi_lock
2960 ++ * dequeue X
2961 ++ * sched-out X
2962 ++ * smp_store_release(X->on_cpu, 0);
2963 ++ *
2964 ++ * smp_cond_load_acquire(&X->on_cpu, !VAL);
2965 ++ * X->state = WAKING
2966 ++ * set_task_cpu(X,2)
2967 ++ *
2968 ++ * LOCK rq(2)->lock
2969 ++ * enqueue X
2970 ++ * X->state = RUNNING
2971 ++ * UNLOCK rq(2)->lock
2972 ++ *
2973 ++ * LOCK rq(2)->lock // orders against CPU1
2974 ++ * sched-out Z
2975 ++ * sched-in X
2976 ++ * UNLOCK rq(2)->lock
2977 ++ *
2978 ++ * UNLOCK X->pi_lock
2979 ++ * UNLOCK rq(0)->lock
2980 ++ *
2981 ++ *
2982 ++ * However; for wakeups there is a second guarantee we must provide, namely we
2983 ++ * must observe the state that lead to our wakeup. That is, not only must our
2984 ++ * task observe its own prior state, it must also observe the stores prior to
2985 ++ * its wakeup.
2986 ++ *
2987 ++ * This means that any means of doing remote wakeups must order the CPU doing
2988 ++ * the wakeup against the CPU the task is going to end up running on. This,
2989 ++ * however, is already required for the regular Program-Order guarantee above,
2990 ++ * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).
2991 ++ *
2992 ++ */
2993 ++
2994 ++/**
2995 ++ * try_to_wake_up - wake up a thread
2996 ++ * @p: the thread to be awakened
2997 ++ * @state: the mask of task states that can be woken
2998 ++ * @wake_flags: wake modifier flags (WF_*)
2999 ++ *
3000 ++ * Conceptually does:
3001 ++ *
3002 ++ * If (@state & @p->state) @p->state = TASK_RUNNING.
3003 ++ *
3004 ++ * If the task was not queued/runnable, also place it back on a runqueue.
3005 ++ *
3006 ++ * This function is atomic against schedule() which would dequeue the task.
3007 ++ *
3008 ++ * It issues a full memory barrier before accessing @p->state, see the comment
3009 ++ * with set_current_state().
3010 ++ *
3011 ++ * Uses p->pi_lock to serialize against concurrent wake-ups.
3012 ++ *
3013 ++ * Relies on p->pi_lock stabilizing:
3014 ++ * - p->sched_class
3015 ++ * - p->cpus_ptr
3016 ++ * - p->sched_task_group
3017 ++ * in order to do migration, see its use of select_task_rq()/set_task_cpu().
3018 ++ *
3019 ++ * Tries really hard to only take one task_rq(p)->lock for performance.
3020 ++ * Takes rq->lock in:
3021 ++ * - ttwu_runnable() -- old rq, unavoidable, see comment there;
3022 ++ * - ttwu_queue() -- new rq, for enqueue of the task;
3023 ++ * - psi_ttwu_dequeue() -- much sadness :-( accounting will kill us.
3024 ++ *
3025 ++ * As a consequence we race really badly with just about everything. See the
3026 ++ * many memory barriers and their comments for details.
3027 ++ *
3028 ++ * Return: %true if @p->state changes (an actual wakeup was done),
3029 ++ * %false otherwise.
3030 ++ */
3031 ++static int try_to_wake_up(struct task_struct *p, unsigned int state,
3032 ++ int wake_flags)
3033 ++{
3034 ++ unsigned long flags;
3035 ++ int cpu, success = 0;
3036 ++
3037 ++ preempt_disable();
3038 ++ if (p == current) {
3039 ++ /*
3040 ++ * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
3041 ++ * == smp_processor_id()'. Together this means we can special
3042 ++ * case the whole 'p->on_rq && ttwu_runnable()' case below
3043 ++ * without taking any locks.
3044 ++ *
3045 ++ * In particular:
3046 ++ * - we rely on Program-Order guarantees for all the ordering,
3047 ++ * - we're serialized against set_special_state() by virtue of
3048 ++ * it disabling IRQs (this allows not taking ->pi_lock).
3049 ++ */
3050 ++ if (!(READ_ONCE(p->__state) & state))
3051 ++ goto out;
3052 ++
3053 ++ success = 1;
3054 ++ trace_sched_waking(p);
3055 ++ WRITE_ONCE(p->__state, TASK_RUNNING);
3056 ++ trace_sched_wakeup(p);
3057 ++ goto out;
3058 ++ }
3059 ++
3060 ++ /*
3061 ++ * If we are going to wake up a thread waiting for CONDITION we
3062 ++ * need to ensure that CONDITION=1 done by the caller can not be
3063 ++ * reordered with p->state check below. This pairs with smp_store_mb()
3064 ++ * in set_current_state() that the waiting thread does.
3065 ++ */
3066 ++ raw_spin_lock_irqsave(&p->pi_lock, flags);
3067 ++ smp_mb__after_spinlock();
3068 ++ if (!(READ_ONCE(p->__state) & state))
3069 ++ goto unlock;
3070 ++
3071 ++ trace_sched_waking(p);
3072 ++
3073 ++ /* We're going to change ->state: */
3074 ++ success = 1;
3075 ++
3076 ++ /*
3077 ++ * Ensure we load p->on_rq _after_ p->state, otherwise it would
3078 ++ * be possible to, falsely, observe p->on_rq == 0 and get stuck
3079 ++ * in smp_cond_load_acquire() below.
3080 ++ *
3081 ++ * sched_ttwu_pending() try_to_wake_up()
3082 ++ * STORE p->on_rq = 1 LOAD p->state
3083 ++ * UNLOCK rq->lock
3084 ++ *
3085 ++ * __schedule() (switch to task 'p')
3086 ++ * LOCK rq->lock smp_rmb();
3087 ++ * smp_mb__after_spinlock();
3088 ++ * UNLOCK rq->lock
3089 ++ *
3090 ++ * [task p]
3091 ++ * STORE p->state = UNINTERRUPTIBLE LOAD p->on_rq
3092 ++ *
3093 ++ * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
3094 ++ * __schedule(). See the comment for smp_mb__after_spinlock().
3095 ++ *
3096 ++ * A similar smb_rmb() lives in try_invoke_on_locked_down_task().
3097 ++ */
3098 ++ smp_rmb();
3099 ++ if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
3100 ++ goto unlock;
3101 ++
3102 ++#ifdef CONFIG_SMP
3103 ++ /*
3104 ++ * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
3105 ++ * possible to, falsely, observe p->on_cpu == 0.
3106 ++ *
3107 ++ * One must be running (->on_cpu == 1) in order to remove oneself
3108 ++ * from the runqueue.
3109 ++ *
3110 ++ * __schedule() (switch to task 'p') try_to_wake_up()
3111 ++ * STORE p->on_cpu = 1 LOAD p->on_rq
3112 ++ * UNLOCK rq->lock
3113 ++ *
3114 ++ * __schedule() (put 'p' to sleep)
3115 ++ * LOCK rq->lock smp_rmb();
3116 ++ * smp_mb__after_spinlock();
3117 ++ * STORE p->on_rq = 0 LOAD p->on_cpu
3118 ++ *
3119 ++ * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
3120 ++ * __schedule(). See the comment for smp_mb__after_spinlock().
3121 ++ *
3122 ++ * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
3123 ++ * schedule()'s deactivate_task() has 'happened' and p will no longer
3124 ++ * care about it's own p->state. See the comment in __schedule().
3125 ++ */
3126 ++ smp_acquire__after_ctrl_dep();
3127 ++
3128 ++ /*
3129 ++ * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq
3130 ++ * == 0), which means we need to do an enqueue, change p->state to
3131 ++ * TASK_WAKING such that we can unlock p->pi_lock before doing the
3132 ++ * enqueue, such as ttwu_queue_wakelist().
3133 ++ */
3134 ++ WRITE_ONCE(p->__state, TASK_WAKING);
3135 ++
3136 ++ /*
3137 ++ * If the owning (remote) CPU is still in the middle of schedule() with
3138 ++ * this task as prev, considering queueing p on the remote CPUs wake_list
3139 ++ * which potentially sends an IPI instead of spinning on p->on_cpu to
3140 ++ * let the waker make forward progress. This is safe because IRQs are
3141 ++ * disabled and the IPI will deliver after on_cpu is cleared.
3142 ++ *
3143 ++ * Ensure we load task_cpu(p) after p->on_cpu:
3144 ++ *
3145 ++ * set_task_cpu(p, cpu);
3146 ++ * STORE p->cpu = @cpu
3147 ++ * __schedule() (switch to task 'p')
3148 ++ * LOCK rq->lock
3149 ++ * smp_mb__after_spin_lock() smp_cond_load_acquire(&p->on_cpu)
3150 ++ * STORE p->on_cpu = 1 LOAD p->cpu
3151 ++ *
3152 ++ * to ensure we observe the correct CPU on which the task is currently
3153 ++ * scheduling.
3154 ++ */
3155 ++ if (smp_load_acquire(&p->on_cpu) &&
3156 ++ ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
3157 ++ goto unlock;
3158 ++
3159 ++ /*
3160 ++ * If the owning (remote) CPU is still in the middle of schedule() with
3161 ++ * this task as prev, wait until it's done referencing the task.
3162 ++ *
3163 ++ * Pairs with the smp_store_release() in finish_task().
3164 ++ *
3165 ++ * This ensures that tasks getting woken will be fully ordered against
3166 ++ * their previous state and preserve Program Order.
3167 ++ */
3168 ++ smp_cond_load_acquire(&p->on_cpu, !VAL);
3169 ++
3170 ++ sched_task_ttwu(p);
3171 ++
3172 ++ cpu = select_task_rq(p);
3173 ++
3174 ++ if (cpu != task_cpu(p)) {
3175 ++ if (p->in_iowait) {
3176 ++ delayacct_blkio_end(p);
3177 ++ atomic_dec(&task_rq(p)->nr_iowait);
3178 ++ }
3179 ++
3180 ++ wake_flags |= WF_MIGRATED;
3181 ++ psi_ttwu_dequeue(p);
3182 ++ set_task_cpu(p, cpu);
3183 ++ }
3184 ++#else
3185 ++ cpu = task_cpu(p);
3186 ++#endif /* CONFIG_SMP */
3187 ++
3188 ++ ttwu_queue(p, cpu, wake_flags);
3189 ++unlock:
3190 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
3191 ++out:
3192 ++ if (success)
3193 ++ ttwu_stat(p, task_cpu(p), wake_flags);
3194 ++ preempt_enable();
3195 ++
3196 ++ return success;
3197 ++}
3198 ++
3199 ++/**
3200 ++ * try_invoke_on_locked_down_task - Invoke a function on task in fixed state
3201 ++ * @p: Process for which the function is to be invoked, can be @current.
3202 ++ * @func: Function to invoke.
3203 ++ * @arg: Argument to function.
3204 ++ *
3205 ++ * If the specified task can be quickly locked into a definite state
3206 ++ * (either sleeping or on a given runqueue), arrange to keep it in that
3207 ++ * state while invoking @func(@arg). This function can use ->on_rq and
3208 ++ * task_curr() to work out what the state is, if required. Given that
3209 ++ * @func can be invoked with a runqueue lock held, it had better be quite
3210 ++ * lightweight.
3211 ++ *
3212 ++ * Returns:
3213 ++ * @false if the task slipped out from under the locks.
3214 ++ * @true if the task was locked onto a runqueue or is sleeping.
3215 ++ * However, @func can override this by returning @false.
3216 ++ */
3217 ++bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)
3218 ++{
3219 ++ struct rq_flags rf;
3220 ++ bool ret = false;
3221 ++ struct rq *rq;
3222 ++
3223 ++ raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
3224 ++ if (p->on_rq) {
3225 ++ rq = __task_rq_lock(p, &rf);
3226 ++ if (task_rq(p) == rq)
3227 ++ ret = func(p, arg);
3228 ++ __task_rq_unlock(rq, &rf);
3229 ++ } else {
3230 ++ switch (READ_ONCE(p->__state)) {
3231 ++ case TASK_RUNNING:
3232 ++ case TASK_WAKING:
3233 ++ break;
3234 ++ default:
3235 ++ smp_rmb(); // See smp_rmb() comment in try_to_wake_up().
3236 ++ if (!p->on_rq)
3237 ++ ret = func(p, arg);
3238 ++ }
3239 ++ }
3240 ++ raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
3241 ++ return ret;
3242 ++}
3243 ++
3244 ++/**
3245 ++ * wake_up_process - Wake up a specific process
3246 ++ * @p: The process to be woken up.
3247 ++ *
3248 ++ * Attempt to wake up the nominated process and move it to the set of runnable
3249 ++ * processes.
3250 ++ *
3251 ++ * Return: 1 if the process was woken up, 0 if it was already running.
3252 ++ *
3253 ++ * This function executes a full memory barrier before accessing the task state.
3254 ++ */
3255 ++int wake_up_process(struct task_struct *p)
3256 ++{
3257 ++ return try_to_wake_up(p, TASK_NORMAL, 0);
3258 ++}
3259 ++EXPORT_SYMBOL(wake_up_process);
3260 ++
3261 ++int wake_up_state(struct task_struct *p, unsigned int state)
3262 ++{
3263 ++ return try_to_wake_up(p, state, 0);
3264 ++}
3265 ++
3266 ++/*
3267 ++ * Perform scheduler related setup for a newly forked process p.
3268 ++ * p is forked by current.
3269 ++ *
3270 ++ * __sched_fork() is basic setup used by init_idle() too:
3271 ++ */
3272 ++static inline void __sched_fork(unsigned long clone_flags, struct task_struct *p)
3273 ++{
3274 ++ p->on_rq = 0;
3275 ++ p->on_cpu = 0;
3276 ++ p->utime = 0;
3277 ++ p->stime = 0;
3278 ++ p->sched_time = 0;
3279 ++
3280 ++#ifdef CONFIG_PREEMPT_NOTIFIERS
3281 ++ INIT_HLIST_HEAD(&p->preempt_notifiers);
3282 ++#endif
3283 ++
3284 ++#ifdef CONFIG_COMPACTION
3285 ++ p->capture_control = NULL;
3286 ++#endif
3287 ++#ifdef CONFIG_SMP
3288 ++ p->wake_entry.u_flags = CSD_TYPE_TTWU;
3289 ++#endif
3290 ++}
3291 ++
3292 ++/*
3293 ++ * fork()/clone()-time setup:
3294 ++ */
3295 ++int sched_fork(unsigned long clone_flags, struct task_struct *p)
3296 ++{
3297 ++ unsigned long flags;
3298 ++ struct rq *rq;
3299 ++
3300 ++ __sched_fork(clone_flags, p);
3301 ++ /*
3302 ++ * We mark the process as NEW here. This guarantees that
3303 ++ * nobody will actually run it, and a signal or other external
3304 ++ * event cannot wake it up and insert it on the runqueue either.
3305 ++ */
3306 ++ p->__state = TASK_NEW;
3307 ++
3308 ++ /*
3309 ++ * Make sure we do not leak PI boosting priority to the child.
3310 ++ */
3311 ++ p->prio = current->normal_prio;
3312 ++
3313 ++ /*
3314 ++ * Revert to default priority/policy on fork if requested.
3315 ++ */
3316 ++ if (unlikely(p->sched_reset_on_fork)) {
3317 ++ if (task_has_rt_policy(p)) {
3318 ++ p->policy = SCHED_NORMAL;
3319 ++ p->static_prio = NICE_TO_PRIO(0);
3320 ++ p->rt_priority = 0;
3321 ++ } else if (PRIO_TO_NICE(p->static_prio) < 0)
3322 ++ p->static_prio = NICE_TO_PRIO(0);
3323 ++
3324 ++ p->prio = p->normal_prio = p->static_prio;
3325 ++
3326 ++ /*
3327 ++ * We don't need the reset flag anymore after the fork. It has
3328 ++ * fulfilled its duty:
3329 ++ */
3330 ++ p->sched_reset_on_fork = 0;
3331 ++ }
3332 ++
3333 ++ /*
3334 ++ * The child is not yet in the pid-hash so no cgroup attach races,
3335 ++ * and the cgroup is pinned to this child due to cgroup_fork()
3336 ++ * is ran before sched_fork().
3337 ++ *
3338 ++ * Silence PROVE_RCU.
3339 ++ */
3340 ++ raw_spin_lock_irqsave(&p->pi_lock, flags);
3341 ++ /*
3342 ++ * Share the timeslice between parent and child, thus the
3343 ++ * total amount of pending timeslices in the system doesn't change,
3344 ++ * resulting in more scheduling fairness.
3345 ++ */
3346 ++ rq = this_rq();
3347 ++ raw_spin_lock(&rq->lock);
3348 ++
3349 ++ rq->curr->time_slice /= 2;
3350 ++ p->time_slice = rq->curr->time_slice;
3351 ++#ifdef CONFIG_SCHED_HRTICK
3352 ++ hrtick_start(rq, rq->curr->time_slice);
3353 ++#endif
3354 ++
3355 ++ if (p->time_slice < RESCHED_NS) {
3356 ++ p->time_slice = sched_timeslice_ns;
3357 ++ resched_curr(rq);
3358 ++ }
3359 ++ sched_task_fork(p, rq);
3360 ++ raw_spin_unlock(&rq->lock);
3361 ++
3362 ++ rseq_migrate(p);
3363 ++ /*
3364 ++ * We're setting the CPU for the first time, we don't migrate,
3365 ++ * so use __set_task_cpu().
3366 ++ */
3367 ++ __set_task_cpu(p, cpu_of(rq));
3368 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
3369 ++
3370 ++#ifdef CONFIG_SCHED_INFO
3371 ++ if (unlikely(sched_info_on()))
3372 ++ memset(&p->sched_info, 0, sizeof(p->sched_info));
3373 ++#endif
3374 ++ init_task_preempt_count(p);
3375 ++
3376 ++ return 0;
3377 ++}
3378 ++
3379 ++void sched_post_fork(struct task_struct *p) {}
3380 ++
3381 ++#ifdef CONFIG_SCHEDSTATS
3382 ++
3383 ++DEFINE_STATIC_KEY_FALSE(sched_schedstats);
3384 ++
3385 ++static void set_schedstats(bool enabled)
3386 ++{
3387 ++ if (enabled)
3388 ++ static_branch_enable(&sched_schedstats);
3389 ++ else
3390 ++ static_branch_disable(&sched_schedstats);
3391 ++}
3392 ++
3393 ++void force_schedstat_enabled(void)
3394 ++{
3395 ++ if (!schedstat_enabled()) {
3396 ++ pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");
3397 ++ static_branch_enable(&sched_schedstats);
3398 ++ }
3399 ++}
3400 ++
3401 ++static int __init setup_schedstats(char *str)
3402 ++{
3403 ++ int ret = 0;
3404 ++ if (!str)
3405 ++ goto out;
3406 ++
3407 ++ if (!strcmp(str, "enable")) {
3408 ++ set_schedstats(true);
3409 ++ ret = 1;
3410 ++ } else if (!strcmp(str, "disable")) {
3411 ++ set_schedstats(false);
3412 ++ ret = 1;
3413 ++ }
3414 ++out:
3415 ++ if (!ret)
3416 ++ pr_warn("Unable to parse schedstats=\n");
3417 ++
3418 ++ return ret;
3419 ++}
3420 ++__setup("schedstats=", setup_schedstats);
3421 ++
3422 ++#ifdef CONFIG_PROC_SYSCTL
3423 ++int sysctl_schedstats(struct ctl_table *table, int write,
3424 ++ void __user *buffer, size_t *lenp, loff_t *ppos)
3425 ++{
3426 ++ struct ctl_table t;
3427 ++ int err;
3428 ++ int state = static_branch_likely(&sched_schedstats);
3429 ++
3430 ++ if (write && !capable(CAP_SYS_ADMIN))
3431 ++ return -EPERM;
3432 ++
3433 ++ t = *table;
3434 ++ t.data = &state;
3435 ++ err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
3436 ++ if (err < 0)
3437 ++ return err;
3438 ++ if (write)
3439 ++ set_schedstats(state);
3440 ++ return err;
3441 ++}
3442 ++#endif /* CONFIG_PROC_SYSCTL */
3443 ++#endif /* CONFIG_SCHEDSTATS */
3444 ++
3445 ++/*
3446 ++ * wake_up_new_task - wake up a newly created task for the first time.
3447 ++ *
3448 ++ * This function will do some initial scheduler statistics housekeeping
3449 ++ * that must be done for every newly created context, then puts the task
3450 ++ * on the runqueue and wakes it.
3451 ++ */
3452 ++void wake_up_new_task(struct task_struct *p)
3453 ++{
3454 ++ unsigned long flags;
3455 ++ struct rq *rq;
3456 ++
3457 ++ raw_spin_lock_irqsave(&p->pi_lock, flags);
3458 ++ WRITE_ONCE(p->__state, TASK_RUNNING);
3459 ++ rq = cpu_rq(select_task_rq(p));
3460 ++#ifdef CONFIG_SMP
3461 ++ rseq_migrate(p);
3462 ++ /*
3463 ++ * Fork balancing, do it here and not earlier because:
3464 ++ * - cpus_ptr can change in the fork path
3465 ++ * - any previously selected CPU might disappear through hotplug
3466 ++ *
3467 ++ * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
3468 ++ * as we're not fully set-up yet.
3469 ++ */
3470 ++ __set_task_cpu(p, cpu_of(rq));
3471 ++#endif
3472 ++
3473 ++ raw_spin_lock(&rq->lock);
3474 ++ update_rq_clock(rq);
3475 ++
3476 ++ activate_task(p, rq);
3477 ++ trace_sched_wakeup_new(p);
3478 ++ check_preempt_curr(rq);
3479 ++
3480 ++ raw_spin_unlock(&rq->lock);
3481 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
3482 ++}
3483 ++
3484 ++#ifdef CONFIG_PREEMPT_NOTIFIERS
3485 ++
3486 ++static DEFINE_STATIC_KEY_FALSE(preempt_notifier_key);
3487 ++
3488 ++void preempt_notifier_inc(void)
3489 ++{
3490 ++ static_branch_inc(&preempt_notifier_key);
3491 ++}
3492 ++EXPORT_SYMBOL_GPL(preempt_notifier_inc);
3493 ++
3494 ++void preempt_notifier_dec(void)
3495 ++{
3496 ++ static_branch_dec(&preempt_notifier_key);
3497 ++}
3498 ++EXPORT_SYMBOL_GPL(preempt_notifier_dec);
3499 ++
3500 ++/**
3501 ++ * preempt_notifier_register - tell me when current is being preempted & rescheduled
3502 ++ * @notifier: notifier struct to register
3503 ++ */
3504 ++void preempt_notifier_register(struct preempt_notifier *notifier)
3505 ++{
3506 ++ if (!static_branch_unlikely(&preempt_notifier_key))
3507 ++ WARN(1, "registering preempt_notifier while notifiers disabled\n");
3508 ++
3509 ++ hlist_add_head(&notifier->link, &current->preempt_notifiers);
3510 ++}
3511 ++EXPORT_SYMBOL_GPL(preempt_notifier_register);
3512 ++
3513 ++/**
3514 ++ * preempt_notifier_unregister - no longer interested in preemption notifications
3515 ++ * @notifier: notifier struct to unregister
3516 ++ *
3517 ++ * This is *not* safe to call from within a preemption notifier.
3518 ++ */
3519 ++void preempt_notifier_unregister(struct preempt_notifier *notifier)
3520 ++{
3521 ++ hlist_del(&notifier->link);
3522 ++}
3523 ++EXPORT_SYMBOL_GPL(preempt_notifier_unregister);
3524 ++
3525 ++static void __fire_sched_in_preempt_notifiers(struct task_struct *curr)
3526 ++{
3527 ++ struct preempt_notifier *notifier;
3528 ++
3529 ++ hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)
3530 ++ notifier->ops->sched_in(notifier, raw_smp_processor_id());
3531 ++}
3532 ++
3533 ++static __always_inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)
3534 ++{
3535 ++ if (static_branch_unlikely(&preempt_notifier_key))
3536 ++ __fire_sched_in_preempt_notifiers(curr);
3537 ++}
3538 ++
3539 ++static void
3540 ++__fire_sched_out_preempt_notifiers(struct task_struct *curr,
3541 ++ struct task_struct *next)
3542 ++{
3543 ++ struct preempt_notifier *notifier;
3544 ++
3545 ++ hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)
3546 ++ notifier->ops->sched_out(notifier, next);
3547 ++}
3548 ++
3549 ++static __always_inline void
3550 ++fire_sched_out_preempt_notifiers(struct task_struct *curr,
3551 ++ struct task_struct *next)
3552 ++{
3553 ++ if (static_branch_unlikely(&preempt_notifier_key))
3554 ++ __fire_sched_out_preempt_notifiers(curr, next);
3555 ++}
3556 ++
3557 ++#else /* !CONFIG_PREEMPT_NOTIFIERS */
3558 ++
3559 ++static inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)
3560 ++{
3561 ++}
3562 ++
3563 ++static inline void
3564 ++fire_sched_out_preempt_notifiers(struct task_struct *curr,
3565 ++ struct task_struct *next)
3566 ++{
3567 ++}
3568 ++
3569 ++#endif /* CONFIG_PREEMPT_NOTIFIERS */
3570 ++
3571 ++static inline void prepare_task(struct task_struct *next)
3572 ++{
3573 ++ /*
3574 ++ * Claim the task as running, we do this before switching to it
3575 ++ * such that any running task will have this set.
3576 ++ *
3577 ++ * See the ttwu() WF_ON_CPU case and its ordering comment.
3578 ++ */
3579 ++ WRITE_ONCE(next->on_cpu, 1);
3580 ++}
3581 ++
3582 ++static inline void finish_task(struct task_struct *prev)
3583 ++{
3584 ++#ifdef CONFIG_SMP
3585 ++ /*
3586 ++ * This must be the very last reference to @prev from this CPU. After
3587 ++ * p->on_cpu is cleared, the task can be moved to a different CPU. We
3588 ++ * must ensure this doesn't happen until the switch is completely
3589 ++ * finished.
3590 ++ *
3591 ++ * In particular, the load of prev->state in finish_task_switch() must
3592 ++ * happen before this.
3593 ++ *
3594 ++ * Pairs with the smp_cond_load_acquire() in try_to_wake_up().
3595 ++ */
3596 ++ smp_store_release(&prev->on_cpu, 0);
3597 ++#else
3598 ++ prev->on_cpu = 0;
3599 ++#endif
3600 ++}
3601 ++
3602 ++#ifdef CONFIG_SMP
3603 ++
3604 ++static void do_balance_callbacks(struct rq *rq, struct callback_head *head)
3605 ++{
3606 ++ void (*func)(struct rq *rq);
3607 ++ struct callback_head *next;
3608 ++
3609 ++ lockdep_assert_held(&rq->lock);
3610 ++
3611 ++ while (head) {
3612 ++ func = (void (*)(struct rq *))head->func;
3613 ++ next = head->next;
3614 ++ head->next = NULL;
3615 ++ head = next;
3616 ++
3617 ++ func(rq);
3618 ++ }
3619 ++}
3620 ++
3621 ++static void balance_push(struct rq *rq);
3622 ++
3623 ++struct callback_head balance_push_callback = {
3624 ++ .next = NULL,
3625 ++ .func = (void (*)(struct callback_head *))balance_push,
3626 ++};
3627 ++
3628 ++static inline struct callback_head *splice_balance_callbacks(struct rq *rq)
3629 ++{
3630 ++ struct callback_head *head = rq->balance_callback;
3631 ++
3632 ++ if (head) {
3633 ++ lockdep_assert_held(&rq->lock);
3634 ++ rq->balance_callback = NULL;
3635 ++ }
3636 ++
3637 ++ return head;
3638 ++}
3639 ++
3640 ++static void __balance_callbacks(struct rq *rq)
3641 ++{
3642 ++ do_balance_callbacks(rq, splice_balance_callbacks(rq));
3643 ++}
3644 ++
3645 ++static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
3646 ++{
3647 ++ unsigned long flags;
3648 ++
3649 ++ if (unlikely(head)) {
3650 ++ raw_spin_lock_irqsave(&rq->lock, flags);
3651 ++ do_balance_callbacks(rq, head);
3652 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
3653 ++ }
3654 ++}
3655 ++
3656 ++#else
3657 ++
3658 ++static inline void __balance_callbacks(struct rq *rq)
3659 ++{
3660 ++}
3661 ++
3662 ++static inline struct callback_head *splice_balance_callbacks(struct rq *rq)
3663 ++{
3664 ++ return NULL;
3665 ++}
3666 ++
3667 ++static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
3668 ++{
3669 ++}
3670 ++
3671 ++#endif
3672 ++
3673 ++static inline void
3674 ++prepare_lock_switch(struct rq *rq, struct task_struct *next)
3675 ++{
3676 ++ /*
3677 ++ * Since the runqueue lock will be released by the next
3678 ++ * task (which is an invalid locking op but in the case
3679 ++ * of the scheduler it's an obvious special-case), so we
3680 ++ * do an early lockdep release here:
3681 ++ */
3682 ++ spin_release(&rq->lock.dep_map, _THIS_IP_);
3683 ++#ifdef CONFIG_DEBUG_SPINLOCK
3684 ++ /* this is a valid case when another task releases the spinlock */
3685 ++ rq->lock.owner = next;
3686 ++#endif
3687 ++}
3688 ++
3689 ++static inline void finish_lock_switch(struct rq *rq)
3690 ++{
3691 ++ /*
3692 ++ * If we are tracking spinlock dependencies then we have to
3693 ++ * fix up the runqueue lock - which gets 'carried over' from
3694 ++ * prev into current:
3695 ++ */
3696 ++ spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
3697 ++ __balance_callbacks(rq);
3698 ++ raw_spin_unlock_irq(&rq->lock);
3699 ++}
3700 ++
3701 ++/*
3702 ++ * NOP if the arch has not defined these:
3703 ++ */
3704 ++
3705 ++#ifndef prepare_arch_switch
3706 ++# define prepare_arch_switch(next) do { } while (0)
3707 ++#endif
3708 ++
3709 ++#ifndef finish_arch_post_lock_switch
3710 ++# define finish_arch_post_lock_switch() do { } while (0)
3711 ++#endif
3712 ++
3713 ++static inline void kmap_local_sched_out(void)
3714 ++{
3715 ++#ifdef CONFIG_KMAP_LOCAL
3716 ++ if (unlikely(current->kmap_ctrl.idx))
3717 ++ __kmap_local_sched_out();
3718 ++#endif
3719 ++}
3720 ++
3721 ++static inline void kmap_local_sched_in(void)
3722 ++{
3723 ++#ifdef CONFIG_KMAP_LOCAL
3724 ++ if (unlikely(current->kmap_ctrl.idx))
3725 ++ __kmap_local_sched_in();
3726 ++#endif
3727 ++}
3728 ++
3729 ++/**
3730 ++ * prepare_task_switch - prepare to switch tasks
3731 ++ * @rq: the runqueue preparing to switch
3732 ++ * @next: the task we are going to switch to.
3733 ++ *
3734 ++ * This is called with the rq lock held and interrupts off. It must
3735 ++ * be paired with a subsequent finish_task_switch after the context
3736 ++ * switch.
3737 ++ *
3738 ++ * prepare_task_switch sets up locking and calls architecture specific
3739 ++ * hooks.
3740 ++ */
3741 ++static inline void
3742 ++prepare_task_switch(struct rq *rq, struct task_struct *prev,
3743 ++ struct task_struct *next)
3744 ++{
3745 ++ kcov_prepare_switch(prev);
3746 ++ sched_info_switch(rq, prev, next);
3747 ++ perf_event_task_sched_out(prev, next);
3748 ++ rseq_preempt(prev);
3749 ++ fire_sched_out_preempt_notifiers(prev, next);
3750 ++ kmap_local_sched_out();
3751 ++ prepare_task(next);
3752 ++ prepare_arch_switch(next);
3753 ++}
3754 ++
3755 ++/**
3756 ++ * finish_task_switch - clean up after a task-switch
3757 ++ * @rq: runqueue associated with task-switch
3758 ++ * @prev: the thread we just switched away from.
3759 ++ *
3760 ++ * finish_task_switch must be called after the context switch, paired
3761 ++ * with a prepare_task_switch call before the context switch.
3762 ++ * finish_task_switch will reconcile locking set up by prepare_task_switch,
3763 ++ * and do any other architecture-specific cleanup actions.
3764 ++ *
3765 ++ * Note that we may have delayed dropping an mm in context_switch(). If
3766 ++ * so, we finish that here outside of the runqueue lock. (Doing it
3767 ++ * with the lock held can cause deadlocks; see schedule() for
3768 ++ * details.)
3769 ++ *
3770 ++ * The context switch have flipped the stack from under us and restored the
3771 ++ * local variables which were saved when this task called schedule() in the
3772 ++ * past. prev == current is still correct but we need to recalculate this_rq
3773 ++ * because prev may have moved to another CPU.
3774 ++ */
3775 ++static struct rq *finish_task_switch(struct task_struct *prev)
3776 ++ __releases(rq->lock)
3777 ++{
3778 ++ struct rq *rq = this_rq();
3779 ++ struct mm_struct *mm = rq->prev_mm;
3780 ++ long prev_state;
3781 ++
3782 ++ /*
3783 ++ * The previous task will have left us with a preempt_count of 2
3784 ++ * because it left us after:
3785 ++ *
3786 ++ * schedule()
3787 ++ * preempt_disable(); // 1
3788 ++ * __schedule()
3789 ++ * raw_spin_lock_irq(&rq->lock) // 2
3790 ++ *
3791 ++ * Also, see FORK_PREEMPT_COUNT.
3792 ++ */
3793 ++ if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
3794 ++ "corrupted preempt_count: %s/%d/0x%x\n",
3795 ++ current->comm, current->pid, preempt_count()))
3796 ++ preempt_count_set(FORK_PREEMPT_COUNT);
3797 ++
3798 ++ rq->prev_mm = NULL;
3799 ++
3800 ++ /*
3801 ++ * A task struct has one reference for the use as "current".
3802 ++ * If a task dies, then it sets TASK_DEAD in tsk->state and calls
3803 ++ * schedule one last time. The schedule call will never return, and
3804 ++ * the scheduled task must drop that reference.
3805 ++ *
3806 ++ * We must observe prev->state before clearing prev->on_cpu (in
3807 ++ * finish_task), otherwise a concurrent wakeup can get prev
3808 ++ * running on another CPU and we could rave with its RUNNING -> DEAD
3809 ++ * transition, resulting in a double drop.
3810 ++ */
3811 ++ prev_state = READ_ONCE(prev->__state);
3812 ++ vtime_task_switch(prev);
3813 ++ perf_event_task_sched_in(prev, current);
3814 ++ finish_task(prev);
3815 ++ tick_nohz_task_switch();
3816 ++ finish_lock_switch(rq);
3817 ++ finish_arch_post_lock_switch();
3818 ++ kcov_finish_switch(current);
3819 ++ /*
3820 ++ * kmap_local_sched_out() is invoked with rq::lock held and
3821 ++ * interrupts disabled. There is no requirement for that, but the
3822 ++ * sched out code does not have an interrupt enabled section.
3823 ++ * Restoring the maps on sched in does not require interrupts being
3824 ++ * disabled either.
3825 ++ */
3826 ++ kmap_local_sched_in();
3827 ++
3828 ++ fire_sched_in_preempt_notifiers(current);
3829 ++ /*
3830 ++ * When switching through a kernel thread, the loop in
3831 ++ * membarrier_{private,global}_expedited() may have observed that
3832 ++ * kernel thread and not issued an IPI. It is therefore possible to
3833 ++ * schedule between user->kernel->user threads without passing though
3834 ++ * switch_mm(). Membarrier requires a barrier after storing to
3835 ++ * rq->curr, before returning to userspace, so provide them here:
3836 ++ *
3837 ++ * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
3838 ++ * provided by mmdrop(),
3839 ++ * - a sync_core for SYNC_CORE.
3840 ++ */
3841 ++ if (mm) {
3842 ++ membarrier_mm_sync_core_before_usermode(mm);
3843 ++ mmdrop(mm);
3844 ++ }
3845 ++ if (unlikely(prev_state == TASK_DEAD)) {
3846 ++ /*
3847 ++ * Remove function-return probe instances associated with this
3848 ++ * task and put them back on the free list.
3849 ++ */
3850 ++ kprobe_flush_task(prev);
3851 ++
3852 ++ /* Task is done with its stack. */
3853 ++ put_task_stack(prev);
3854 ++
3855 ++ put_task_struct_rcu_user(prev);
3856 ++ }
3857 ++
3858 ++ return rq;
3859 ++}
3860 ++
3861 ++/**
3862 ++ * schedule_tail - first thing a freshly forked thread must call.
3863 ++ * @prev: the thread we just switched away from.
3864 ++ */
3865 ++asmlinkage __visible void schedule_tail(struct task_struct *prev)
3866 ++ __releases(rq->lock)
3867 ++{
3868 ++ /*
3869 ++ * New tasks start with FORK_PREEMPT_COUNT, see there and
3870 ++ * finish_task_switch() for details.
3871 ++ *
3872 ++ * finish_task_switch() will drop rq->lock() and lower preempt_count
3873 ++ * and the preempt_enable() will end up enabling preemption (on
3874 ++ * PREEMPT_COUNT kernels).
3875 ++ */
3876 ++
3877 ++ finish_task_switch(prev);
3878 ++ preempt_enable();
3879 ++
3880 ++ if (current->set_child_tid)
3881 ++ put_user(task_pid_vnr(current), current->set_child_tid);
3882 ++
3883 ++ calculate_sigpending();
3884 ++}
3885 ++
3886 ++/*
3887 ++ * context_switch - switch to the new MM and the new thread's register state.
3888 ++ */
3889 ++static __always_inline struct rq *
3890 ++context_switch(struct rq *rq, struct task_struct *prev,
3891 ++ struct task_struct *next)
3892 ++{
3893 ++ prepare_task_switch(rq, prev, next);
3894 ++
3895 ++ /*
3896 ++ * For paravirt, this is coupled with an exit in switch_to to
3897 ++ * combine the page table reload and the switch backend into
3898 ++ * one hypercall.
3899 ++ */
3900 ++ arch_start_context_switch(prev);
3901 ++
3902 ++ /*
3903 ++ * kernel -> kernel lazy + transfer active
3904 ++ * user -> kernel lazy + mmgrab() active
3905 ++ *
3906 ++ * kernel -> user switch + mmdrop() active
3907 ++ * user -> user switch
3908 ++ */
3909 ++ if (!next->mm) { // to kernel
3910 ++ enter_lazy_tlb(prev->active_mm, next);
3911 ++
3912 ++ next->active_mm = prev->active_mm;
3913 ++ if (prev->mm) // from user
3914 ++ mmgrab(prev->active_mm);
3915 ++ else
3916 ++ prev->active_mm = NULL;
3917 ++ } else { // to user
3918 ++ membarrier_switch_mm(rq, prev->active_mm, next->mm);
3919 ++ /*
3920 ++ * sys_membarrier() requires an smp_mb() between setting
3921 ++ * rq->curr / membarrier_switch_mm() and returning to userspace.
3922 ++ *
3923 ++ * The below provides this either through switch_mm(), or in
3924 ++ * case 'prev->active_mm == next->mm' through
3925 ++ * finish_task_switch()'s mmdrop().
3926 ++ */
3927 ++ switch_mm_irqs_off(prev->active_mm, next->mm, next);
3928 ++
3929 ++ if (!prev->mm) { // from kernel
3930 ++ /* will mmdrop() in finish_task_switch(). */
3931 ++ rq->prev_mm = prev->active_mm;
3932 ++ prev->active_mm = NULL;
3933 ++ }
3934 ++ }
3935 ++
3936 ++ prepare_lock_switch(rq, next);
3937 ++
3938 ++ /* Here we just switch the register state and the stack. */
3939 ++ switch_to(prev, next, prev);
3940 ++ barrier();
3941 ++
3942 ++ return finish_task_switch(prev);
3943 ++}
3944 ++
3945 ++/*
3946 ++ * nr_running, nr_uninterruptible and nr_context_switches:
3947 ++ *
3948 ++ * externally visible scheduler statistics: current number of runnable
3949 ++ * threads, total number of context switches performed since bootup.
3950 ++ */
3951 ++unsigned int nr_running(void)
3952 ++{
3953 ++ unsigned int i, sum = 0;
3954 ++
3955 ++ for_each_online_cpu(i)
3956 ++ sum += cpu_rq(i)->nr_running;
3957 ++
3958 ++ return sum;
3959 ++}
3960 ++
3961 ++/*
3962 ++ * Check if only the current task is running on the CPU.
3963 ++ *
3964 ++ * Caution: this function does not check that the caller has disabled
3965 ++ * preemption, thus the result might have a time-of-check-to-time-of-use
3966 ++ * race. The caller is responsible to use it correctly, for example:
3967 ++ *
3968 ++ * - from a non-preemptible section (of course)
3969 ++ *
3970 ++ * - from a thread that is bound to a single CPU
3971 ++ *
3972 ++ * - in a loop with very short iterations (e.g. a polling loop)
3973 ++ */
3974 ++bool single_task_running(void)
3975 ++{
3976 ++ return raw_rq()->nr_running == 1;
3977 ++}
3978 ++EXPORT_SYMBOL(single_task_running);
3979 ++
3980 ++unsigned long long nr_context_switches(void)
3981 ++{
3982 ++ int i;
3983 ++ unsigned long long sum = 0;
3984 ++
3985 ++ for_each_possible_cpu(i)
3986 ++ sum += cpu_rq(i)->nr_switches;
3987 ++
3988 ++ return sum;
3989 ++}
3990 ++
3991 ++/*
3992 ++ * Consumers of these two interfaces, like for example the cpuidle menu
3993 ++ * governor, are using nonsensical data. Preferring shallow idle state selection
3994 ++ * for a CPU that has IO-wait which might not even end up running the task when
3995 ++ * it does become runnable.
3996 ++ */
3997 ++
3998 ++unsigned int nr_iowait_cpu(int cpu)
3999 ++{
4000 ++ return atomic_read(&cpu_rq(cpu)->nr_iowait);
4001 ++}
4002 ++
4003 ++/*
4004 ++ * IO-wait accounting, and how it's mostly bollocks (on SMP).
4005 ++ *
4006 ++ * The idea behind IO-wait account is to account the idle time that we could
4007 ++ * have spend running if it were not for IO. That is, if we were to improve the
4008 ++ * storage performance, we'd have a proportional reduction in IO-wait time.
4009 ++ *
4010 ++ * This all works nicely on UP, where, when a task blocks on IO, we account
4011 ++ * idle time as IO-wait, because if the storage were faster, it could've been
4012 ++ * running and we'd not be idle.
4013 ++ *
4014 ++ * This has been extended to SMP, by doing the same for each CPU. This however
4015 ++ * is broken.
4016 ++ *
4017 ++ * Imagine for instance the case where two tasks block on one CPU, only the one
4018 ++ * CPU will have IO-wait accounted, while the other has regular idle. Even
4019 ++ * though, if the storage were faster, both could've ran at the same time,
4020 ++ * utilising both CPUs.
4021 ++ *
4022 ++ * This means, that when looking globally, the current IO-wait accounting on
4023 ++ * SMP is a lower bound, by reason of under accounting.
4024 ++ *
4025 ++ * Worse, since the numbers are provided per CPU, they are sometimes
4026 ++ * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
4027 ++ * associated with any one particular CPU, it can wake to another CPU than it
4028 ++ * blocked on. This means the per CPU IO-wait number is meaningless.
4029 ++ *
4030 ++ * Task CPU affinities can make all that even more 'interesting'.
4031 ++ */
4032 ++
4033 ++unsigned int nr_iowait(void)
4034 ++{
4035 ++ unsigned int i, sum = 0;
4036 ++
4037 ++ for_each_possible_cpu(i)
4038 ++ sum += nr_iowait_cpu(i);
4039 ++
4040 ++ return sum;
4041 ++}
4042 ++
4043 ++#ifdef CONFIG_SMP
4044 ++
4045 ++/*
4046 ++ * sched_exec - execve() is a valuable balancing opportunity, because at
4047 ++ * this point the task has the smallest effective memory and cache
4048 ++ * footprint.
4049 ++ */
4050 ++void sched_exec(void)
4051 ++{
4052 ++ struct task_struct *p = current;
4053 ++ unsigned long flags;
4054 ++ int dest_cpu;
4055 ++
4056 ++ raw_spin_lock_irqsave(&p->pi_lock, flags);
4057 ++ dest_cpu = cpumask_any(p->cpus_ptr);
4058 ++ if (dest_cpu == smp_processor_id())
4059 ++ goto unlock;
4060 ++
4061 ++ if (likely(cpu_active(dest_cpu))) {
4062 ++ struct migration_arg arg = { p, dest_cpu };
4063 ++
4064 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
4065 ++ stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
4066 ++ return;
4067 ++ }
4068 ++unlock:
4069 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
4070 ++}
4071 ++
4072 ++#endif
4073 ++
4074 ++DEFINE_PER_CPU(struct kernel_stat, kstat);
4075 ++DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
4076 ++
4077 ++EXPORT_PER_CPU_SYMBOL(kstat);
4078 ++EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
4079 ++
4080 ++static inline void update_curr(struct rq *rq, struct task_struct *p)
4081 ++{
4082 ++ s64 ns = rq->clock_task - p->last_ran;
4083 ++
4084 ++ p->sched_time += ns;
4085 ++ account_group_exec_runtime(p, ns);
4086 ++
4087 ++ p->time_slice -= ns;
4088 ++ p->last_ran = rq->clock_task;
4089 ++}
4090 ++
4091 ++/*
4092 ++ * Return accounted runtime for the task.
4093 ++ * Return separately the current's pending runtime that have not been
4094 ++ * accounted yet.
4095 ++ */
4096 ++unsigned long long task_sched_runtime(struct task_struct *p)
4097 ++{
4098 ++ unsigned long flags;
4099 ++ struct rq *rq;
4100 ++ raw_spinlock_t *lock;
4101 ++ u64 ns;
4102 ++
4103 ++#if defined(CONFIG_64BIT) && defined(CONFIG_SMP)
4104 ++ /*
4105 ++ * 64-bit doesn't need locks to atomically read a 64-bit value.
4106 ++ * So we have a optimization chance when the task's delta_exec is 0.
4107 ++ * Reading ->on_cpu is racy, but this is ok.
4108 ++ *
4109 ++ * If we race with it leaving CPU, we'll take a lock. So we're correct.
4110 ++ * If we race with it entering CPU, unaccounted time is 0. This is
4111 ++ * indistinguishable from the read occurring a few cycles earlier.
4112 ++ * If we see ->on_cpu without ->on_rq, the task is leaving, and has
4113 ++ * been accounted, so we're correct here as well.
4114 ++ */
4115 ++ if (!p->on_cpu || !task_on_rq_queued(p))
4116 ++ return tsk_seruntime(p);
4117 ++#endif
4118 ++
4119 ++ rq = task_access_lock_irqsave(p, &lock, &flags);
4120 ++ /*
4121 ++ * Must be ->curr _and_ ->on_rq. If dequeued, we would
4122 ++ * project cycles that may never be accounted to this
4123 ++ * thread, breaking clock_gettime().
4124 ++ */
4125 ++ if (p == rq->curr && task_on_rq_queued(p)) {
4126 ++ update_rq_clock(rq);
4127 ++ update_curr(rq, p);
4128 ++ }
4129 ++ ns = tsk_seruntime(p);
4130 ++ task_access_unlock_irqrestore(p, lock, &flags);
4131 ++
4132 ++ return ns;
4133 ++}
4134 ++
4135 ++/* This manages tasks that have run out of timeslice during a scheduler_tick */
4136 ++static inline void scheduler_task_tick(struct rq *rq)
4137 ++{
4138 ++ struct task_struct *p = rq->curr;
4139 ++
4140 ++ if (is_idle_task(p))
4141 ++ return;
4142 ++
4143 ++ update_curr(rq, p);
4144 ++ cpufreq_update_util(rq, 0);
4145 ++
4146 ++ /*
4147 ++ * Tasks have less than RESCHED_NS of time slice left they will be
4148 ++ * rescheduled.
4149 ++ */
4150 ++ if (p->time_slice >= RESCHED_NS)
4151 ++ return;
4152 ++ set_tsk_need_resched(p);
4153 ++ set_preempt_need_resched();
4154 ++}
4155 ++
4156 ++#ifdef CONFIG_SCHED_DEBUG
4157 ++static u64 cpu_resched_latency(struct rq *rq)
4158 ++{
4159 ++ int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms);
4160 ++ u64 resched_latency, now = rq_clock(rq);
4161 ++ static bool warned_once;
4162 ++
4163 ++ if (sysctl_resched_latency_warn_once && warned_once)
4164 ++ return 0;
4165 ++
4166 ++ if (!need_resched() || !latency_warn_ms)
4167 ++ return 0;
4168 ++
4169 ++ if (system_state == SYSTEM_BOOTING)
4170 ++ return 0;
4171 ++
4172 ++ if (!rq->last_seen_need_resched_ns) {
4173 ++ rq->last_seen_need_resched_ns = now;
4174 ++ rq->ticks_without_resched = 0;
4175 ++ return 0;
4176 ++ }
4177 ++
4178 ++ rq->ticks_without_resched++;
4179 ++ resched_latency = now - rq->last_seen_need_resched_ns;
4180 ++ if (resched_latency <= latency_warn_ms * NSEC_PER_MSEC)
4181 ++ return 0;
4182 ++
4183 ++ warned_once = true;
4184 ++
4185 ++ return resched_latency;
4186 ++}
4187 ++
4188 ++static int __init setup_resched_latency_warn_ms(char *str)
4189 ++{
4190 ++ long val;
4191 ++
4192 ++ if ((kstrtol(str, 0, &val))) {
4193 ++ pr_warn("Unable to set resched_latency_warn_ms\n");
4194 ++ return 1;
4195 ++ }
4196 ++
4197 ++ sysctl_resched_latency_warn_ms = val;
4198 ++ return 1;
4199 ++}
4200 ++__setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
4201 ++#else
4202 ++static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
4203 ++#endif /* CONFIG_SCHED_DEBUG */
4204 ++
4205 ++/*
4206 ++ * This function gets called by the timer code, with HZ frequency.
4207 ++ * We call it with interrupts disabled.
4208 ++ */
4209 ++void scheduler_tick(void)
4210 ++{
4211 ++ int cpu __maybe_unused = smp_processor_id();
4212 ++ struct rq *rq = cpu_rq(cpu);
4213 ++ u64 resched_latency;
4214 ++
4215 ++ arch_scale_freq_tick();
4216 ++ sched_clock_tick();
4217 ++
4218 ++ raw_spin_lock(&rq->lock);
4219 ++ update_rq_clock(rq);
4220 ++
4221 ++ scheduler_task_tick(rq);
4222 ++ if (sched_feat(LATENCY_WARN))
4223 ++ resched_latency = cpu_resched_latency(rq);
4224 ++ calc_global_load_tick(rq);
4225 ++
4226 ++ rq->last_tick = rq->clock;
4227 ++ raw_spin_unlock(&rq->lock);
4228 ++
4229 ++ if (sched_feat(LATENCY_WARN) && resched_latency)
4230 ++ resched_latency_warn(cpu, resched_latency);
4231 ++
4232 ++ perf_event_task_tick();
4233 ++}
4234 ++
4235 ++#ifdef CONFIG_SCHED_SMT
4236 ++static inline int active_load_balance_cpu_stop(void *data)
4237 ++{
4238 ++ struct rq *rq = this_rq();
4239 ++ struct task_struct *p = data;
4240 ++ cpumask_t tmp;
4241 ++ unsigned long flags;
4242 ++
4243 ++ local_irq_save(flags);
4244 ++
4245 ++ raw_spin_lock(&p->pi_lock);
4246 ++ raw_spin_lock(&rq->lock);
4247 ++
4248 ++ rq->active_balance = 0;
4249 ++ /* _something_ may have changed the task, double check again */
4250 ++ if (task_on_rq_queued(p) && task_rq(p) == rq &&
4251 ++ cpumask_and(&tmp, p->cpus_ptr, &sched_sg_idle_mask) &&
4252 ++ !is_migration_disabled(p)) {
4253 ++ int cpu = cpu_of(rq);
4254 ++ int dcpu = __best_mask_cpu(&tmp, per_cpu(sched_cpu_llc_mask, cpu));
4255 ++ rq = move_queued_task(rq, p, dcpu);
4256 ++ }
4257 ++
4258 ++ raw_spin_unlock(&rq->lock);
4259 ++ raw_spin_unlock(&p->pi_lock);
4260 ++
4261 ++ local_irq_restore(flags);
4262 ++
4263 ++ return 0;
4264 ++}
4265 ++
4266 ++/* sg_balance_trigger - trigger slibing group balance for @cpu */
4267 ++static inline int sg_balance_trigger(const int cpu)
4268 ++{
4269 ++ struct rq *rq= cpu_rq(cpu);
4270 ++ unsigned long flags;
4271 ++ struct task_struct *curr;
4272 ++ int res;
4273 ++
4274 ++ if (!raw_spin_trylock_irqsave(&rq->lock, flags))
4275 ++ return 0;
4276 ++ curr = rq->curr;
4277 ++ res = (!is_idle_task(curr)) && (1 == rq->nr_running) &&\
4278 ++ cpumask_intersects(curr->cpus_ptr, &sched_sg_idle_mask) &&\
4279 ++ !is_migration_disabled(curr) && (!rq->active_balance);
4280 ++
4281 ++ if (res)
4282 ++ rq->active_balance = 1;
4283 ++
4284 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
4285 ++
4286 ++ if (res)
4287 ++ stop_one_cpu_nowait(cpu, active_load_balance_cpu_stop,
4288 ++ curr, &rq->active_balance_work);
4289 ++ return res;
4290 ++}
4291 ++
4292 ++/*
4293 ++ * sg_balance_check - slibing group balance check for run queue @rq
4294 ++ */
4295 ++static inline void sg_balance_check(struct rq *rq)
4296 ++{
4297 ++ cpumask_t chk;
4298 ++ int cpu = cpu_of(rq);
4299 ++
4300 ++ /* exit when cpu is offline */
4301 ++ if (unlikely(!rq->online))
4302 ++ return;
4303 ++
4304 ++ /*
4305 ++ * Only cpu in slibing idle group will do the checking and then
4306 ++ * find potential cpus which can migrate the current running task
4307 ++ */
4308 ++ if (cpumask_test_cpu(cpu, &sched_sg_idle_mask) &&
4309 ++ cpumask_andnot(&chk, cpu_online_mask, sched_rq_watermark) &&
4310 ++ cpumask_andnot(&chk, &chk, &sched_rq_pending_mask)) {
4311 ++ int i;
4312 ++
4313 ++ for_each_cpu_wrap(i, &chk, cpu) {
4314 ++ if (cpumask_subset(cpu_smt_mask(i), &chk) &&
4315 ++ sg_balance_trigger(i))
4316 ++ return;
4317 ++ }
4318 ++ }
4319 ++}
4320 ++#endif /* CONFIG_SCHED_SMT */
4321 ++
4322 ++#ifdef CONFIG_NO_HZ_FULL
4323 ++
4324 ++struct tick_work {
4325 ++ int cpu;
4326 ++ atomic_t state;
4327 ++ struct delayed_work work;
4328 ++};
4329 ++/* Values for ->state, see diagram below. */
4330 ++#define TICK_SCHED_REMOTE_OFFLINE 0
4331 ++#define TICK_SCHED_REMOTE_OFFLINING 1
4332 ++#define TICK_SCHED_REMOTE_RUNNING 2
4333 ++
4334 ++/*
4335 ++ * State diagram for ->state:
4336 ++ *
4337 ++ *
4338 ++ * TICK_SCHED_REMOTE_OFFLINE
4339 ++ * | ^
4340 ++ * | |
4341 ++ * | | sched_tick_remote()
4342 ++ * | |
4343 ++ * | |
4344 ++ * +--TICK_SCHED_REMOTE_OFFLINING
4345 ++ * | ^
4346 ++ * | |
4347 ++ * sched_tick_start() | | sched_tick_stop()
4348 ++ * | |
4349 ++ * V |
4350 ++ * TICK_SCHED_REMOTE_RUNNING
4351 ++ *
4352 ++ *
4353 ++ * Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
4354 ++ * and sched_tick_start() are happy to leave the state in RUNNING.
4355 ++ */
4356 ++
4357 ++static struct tick_work __percpu *tick_work_cpu;
4358 ++
4359 ++static void sched_tick_remote(struct work_struct *work)
4360 ++{
4361 ++ struct delayed_work *dwork = to_delayed_work(work);
4362 ++ struct tick_work *twork = container_of(dwork, struct tick_work, work);
4363 ++ int cpu = twork->cpu;
4364 ++ struct rq *rq = cpu_rq(cpu);
4365 ++ struct task_struct *curr;
4366 ++ unsigned long flags;
4367 ++ u64 delta;
4368 ++ int os;
4369 ++
4370 ++ /*
4371 ++ * Handle the tick only if it appears the remote CPU is running in full
4372 ++ * dynticks mode. The check is racy by nature, but missing a tick or
4373 ++ * having one too much is no big deal because the scheduler tick updates
4374 ++ * statistics and checks timeslices in a time-independent way, regardless
4375 ++ * of when exactly it is running.
4376 ++ */
4377 ++ if (!tick_nohz_tick_stopped_cpu(cpu))
4378 ++ goto out_requeue;
4379 ++
4380 ++ raw_spin_lock_irqsave(&rq->lock, flags);
4381 ++ curr = rq->curr;
4382 ++ if (cpu_is_offline(cpu))
4383 ++ goto out_unlock;
4384 ++
4385 ++ update_rq_clock(rq);
4386 ++ if (!is_idle_task(curr)) {
4387 ++ /*
4388 ++ * Make sure the next tick runs within a reasonable
4389 ++ * amount of time.
4390 ++ */
4391 ++ delta = rq_clock_task(rq) - curr->last_ran;
4392 ++ WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);
4393 ++ }
4394 ++ scheduler_task_tick(rq);
4395 ++
4396 ++ calc_load_nohz_remote(rq);
4397 ++out_unlock:
4398 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
4399 ++
4400 ++out_requeue:
4401 ++ /*
4402 ++ * Run the remote tick once per second (1Hz). This arbitrary
4403 ++ * frequency is large enough to avoid overload but short enough
4404 ++ * to keep scheduler internal stats reasonably up to date. But
4405 ++ * first update state to reflect hotplug activity if required.
4406 ++ */
4407 ++ os = atomic_fetch_add_unless(&twork->state, -1, TICK_SCHED_REMOTE_RUNNING);
4408 ++ WARN_ON_ONCE(os == TICK_SCHED_REMOTE_OFFLINE);
4409 ++ if (os == TICK_SCHED_REMOTE_RUNNING)
4410 ++ queue_delayed_work(system_unbound_wq, dwork, HZ);
4411 ++}
4412 ++
4413 ++static void sched_tick_start(int cpu)
4414 ++{
4415 ++ int os;
4416 ++ struct tick_work *twork;
4417 ++
4418 ++ if (housekeeping_cpu(cpu, HK_FLAG_TICK))
4419 ++ return;
4420 ++
4421 ++ WARN_ON_ONCE(!tick_work_cpu);
4422 ++
4423 ++ twork = per_cpu_ptr(tick_work_cpu, cpu);
4424 ++ os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);
4425 ++ WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);
4426 ++ if (os == TICK_SCHED_REMOTE_OFFLINE) {
4427 ++ twork->cpu = cpu;
4428 ++ INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
4429 ++ queue_delayed_work(system_unbound_wq, &twork->work, HZ);
4430 ++ }
4431 ++}
4432 ++
4433 ++#ifdef CONFIG_HOTPLUG_CPU
4434 ++static void sched_tick_stop(int cpu)
4435 ++{
4436 ++ struct tick_work *twork;
4437 ++
4438 ++ if (housekeeping_cpu(cpu, HK_FLAG_TICK))
4439 ++ return;
4440 ++
4441 ++ WARN_ON_ONCE(!tick_work_cpu);
4442 ++
4443 ++ twork = per_cpu_ptr(tick_work_cpu, cpu);
4444 ++ cancel_delayed_work_sync(&twork->work);
4445 ++}
4446 ++#endif /* CONFIG_HOTPLUG_CPU */
4447 ++
4448 ++int __init sched_tick_offload_init(void)
4449 ++{
4450 ++ tick_work_cpu = alloc_percpu(struct tick_work);
4451 ++ BUG_ON(!tick_work_cpu);
4452 ++ return 0;
4453 ++}
4454 ++
4455 ++#else /* !CONFIG_NO_HZ_FULL */
4456 ++static inline void sched_tick_start(int cpu) { }
4457 ++static inline void sched_tick_stop(int cpu) { }
4458 ++#endif
4459 ++
4460 ++#if defined(CONFIG_PREEMPTION) && (defined(CONFIG_DEBUG_PREEMPT) || \
4461 ++ defined(CONFIG_PREEMPT_TRACER))
4462 ++/*
4463 ++ * If the value passed in is equal to the current preempt count
4464 ++ * then we just disabled preemption. Start timing the latency.
4465 ++ */
4466 ++static inline void preempt_latency_start(int val)
4467 ++{
4468 ++ if (preempt_count() == val) {
4469 ++ unsigned long ip = get_lock_parent_ip();
4470 ++#ifdef CONFIG_DEBUG_PREEMPT
4471 ++ current->preempt_disable_ip = ip;
4472 ++#endif
4473 ++ trace_preempt_off(CALLER_ADDR0, ip);
4474 ++ }
4475 ++}
4476 ++
4477 ++void preempt_count_add(int val)
4478 ++{
4479 ++#ifdef CONFIG_DEBUG_PREEMPT
4480 ++ /*
4481 ++ * Underflow?
4482 ++ */
4483 ++ if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))
4484 ++ return;
4485 ++#endif
4486 ++ __preempt_count_add(val);
4487 ++#ifdef CONFIG_DEBUG_PREEMPT
4488 ++ /*
4489 ++ * Spinlock count overflowing soon?
4490 ++ */
4491 ++ DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=
4492 ++ PREEMPT_MASK - 10);
4493 ++#endif
4494 ++ preempt_latency_start(val);
4495 ++}
4496 ++EXPORT_SYMBOL(preempt_count_add);
4497 ++NOKPROBE_SYMBOL(preempt_count_add);
4498 ++
4499 ++/*
4500 ++ * If the value passed in equals to the current preempt count
4501 ++ * then we just enabled preemption. Stop timing the latency.
4502 ++ */
4503 ++static inline void preempt_latency_stop(int val)
4504 ++{
4505 ++ if (preempt_count() == val)
4506 ++ trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
4507 ++}
4508 ++
4509 ++void preempt_count_sub(int val)
4510 ++{
4511 ++#ifdef CONFIG_DEBUG_PREEMPT
4512 ++ /*
4513 ++ * Underflow?
4514 ++ */
4515 ++ if (DEBUG_LOCKS_WARN_ON(val > preempt_count()))
4516 ++ return;
4517 ++ /*
4518 ++ * Is the spinlock portion underflowing?
4519 ++ */
4520 ++ if (DEBUG_LOCKS_WARN_ON((val < PREEMPT_MASK) &&
4521 ++ !(preempt_count() & PREEMPT_MASK)))
4522 ++ return;
4523 ++#endif
4524 ++
4525 ++ preempt_latency_stop(val);
4526 ++ __preempt_count_sub(val);
4527 ++}
4528 ++EXPORT_SYMBOL(preempt_count_sub);
4529 ++NOKPROBE_SYMBOL(preempt_count_sub);
4530 ++
4531 ++#else
4532 ++static inline void preempt_latency_start(int val) { }
4533 ++static inline void preempt_latency_stop(int val) { }
4534 ++#endif
4535 ++
4536 ++static inline unsigned long get_preempt_disable_ip(struct task_struct *p)
4537 ++{
4538 ++#ifdef CONFIG_DEBUG_PREEMPT
4539 ++ return p->preempt_disable_ip;
4540 ++#else
4541 ++ return 0;
4542 ++#endif
4543 ++}
4544 ++
4545 ++/*
4546 ++ * Print scheduling while atomic bug:
4547 ++ */
4548 ++static noinline void __schedule_bug(struct task_struct *prev)
4549 ++{
4550 ++ /* Save this before calling printk(), since that will clobber it */
4551 ++ unsigned long preempt_disable_ip = get_preempt_disable_ip(current);
4552 ++
4553 ++ if (oops_in_progress)
4554 ++ return;
4555 ++
4556 ++ printk(KERN_ERR "BUG: scheduling while atomic: %s/%d/0x%08x\n",
4557 ++ prev->comm, prev->pid, preempt_count());
4558 ++
4559 ++ debug_show_held_locks(prev);
4560 ++ print_modules();
4561 ++ if (irqs_disabled())
4562 ++ print_irqtrace_events(prev);
4563 ++ if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)
4564 ++ && in_atomic_preempt_off()) {
4565 ++ pr_err("Preemption disabled at:");
4566 ++ print_ip_sym(KERN_ERR, preempt_disable_ip);
4567 ++ }
4568 ++ if (panic_on_warn)
4569 ++ panic("scheduling while atomic\n");
4570 ++
4571 ++ dump_stack();
4572 ++ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
4573 ++}
4574 ++
4575 ++/*
4576 ++ * Various schedule()-time debugging checks and statistics:
4577 ++ */
4578 ++static inline void schedule_debug(struct task_struct *prev, bool preempt)
4579 ++{
4580 ++#ifdef CONFIG_SCHED_STACK_END_CHECK
4581 ++ if (task_stack_end_corrupted(prev))
4582 ++ panic("corrupted stack end detected inside scheduler\n");
4583 ++
4584 ++ if (task_scs_end_corrupted(prev))
4585 ++ panic("corrupted shadow stack detected inside scheduler\n");
4586 ++#endif
4587 ++
4588 ++#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
4589 ++ if (!preempt && READ_ONCE(prev->__state) && prev->non_block_count) {
4590 ++ printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",
4591 ++ prev->comm, prev->pid, prev->non_block_count);
4592 ++ dump_stack();
4593 ++ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
4594 ++ }
4595 ++#endif
4596 ++
4597 ++ if (unlikely(in_atomic_preempt_off())) {
4598 ++ __schedule_bug(prev);
4599 ++ preempt_count_set(PREEMPT_DISABLED);
4600 ++ }
4601 ++ rcu_sleep_check();
4602 ++ SCHED_WARN_ON(ct_state() == CONTEXT_USER);
4603 ++
4604 ++ profile_hit(SCHED_PROFILING, __builtin_return_address(0));
4605 ++
4606 ++ schedstat_inc(this_rq()->sched_count);
4607 ++}
4608 ++
4609 ++/*
4610 ++ * Compile time debug macro
4611 ++ * #define ALT_SCHED_DEBUG
4612 ++ */
4613 ++
4614 ++#ifdef ALT_SCHED_DEBUG
4615 ++void alt_sched_debug(void)
4616 ++{
4617 ++ printk(KERN_INFO "sched: pending: 0x%04lx, idle: 0x%04lx, sg_idle: 0x%04lx\n",
4618 ++ sched_rq_pending_mask.bits[0],
4619 ++ sched_rq_watermark[0].bits[0],
4620 ++ sched_sg_idle_mask.bits[0]);
4621 ++}
4622 ++#else
4623 ++inline void alt_sched_debug(void) {}
4624 ++#endif
4625 ++
4626 ++#ifdef CONFIG_SMP
4627 ++
4628 ++#define SCHED_RQ_NR_MIGRATION (32U)
4629 ++/*
4630 ++ * Migrate pending tasks in @rq to @dest_cpu
4631 ++ * Will try to migrate mininal of half of @rq nr_running tasks and
4632 ++ * SCHED_RQ_NR_MIGRATION to @dest_cpu
4633 ++ */
4634 ++static inline int
4635 ++migrate_pending_tasks(struct rq *rq, struct rq *dest_rq, const int dest_cpu)
4636 ++{
4637 ++ struct task_struct *p, *skip = rq->curr;
4638 ++ int nr_migrated = 0;
4639 ++ int nr_tries = min(rq->nr_running / 2, SCHED_RQ_NR_MIGRATION);
4640 ++
4641 ++ while (skip != rq->idle && nr_tries &&
4642 ++ (p = sched_rq_next_task(skip, rq)) != rq->idle) {
4643 ++ skip = sched_rq_next_task(p, rq);
4644 ++ if (cpumask_test_cpu(dest_cpu, p->cpus_ptr)) {
4645 ++ __SCHED_DEQUEUE_TASK(p, rq, 0, );
4646 ++ set_task_cpu(p, dest_cpu);
4647 ++ __SCHED_ENQUEUE_TASK(p, dest_rq, 0);
4648 ++ nr_migrated++;
4649 ++ }
4650 ++ nr_tries--;
4651 ++ }
4652 ++
4653 ++ return nr_migrated;
4654 ++}
4655 ++
4656 ++static inline int take_other_rq_tasks(struct rq *rq, int cpu)
4657 ++{
4658 ++ struct cpumask *topo_mask, *end_mask;
4659 ++
4660 ++ if (unlikely(!rq->online))
4661 ++ return 0;
4662 ++
4663 ++ if (cpumask_empty(&sched_rq_pending_mask))
4664 ++ return 0;
4665 ++
4666 ++ topo_mask = per_cpu(sched_cpu_topo_masks, cpu) + 1;
4667 ++ end_mask = per_cpu(sched_cpu_topo_end_mask, cpu);
4668 ++ do {
4669 ++ int i;
4670 ++ for_each_cpu_and(i, &sched_rq_pending_mask, topo_mask) {
4671 ++ int nr_migrated;
4672 ++ struct rq *src_rq;
4673 ++
4674 ++ src_rq = cpu_rq(i);
4675 ++ if (!do_raw_spin_trylock(&src_rq->lock))
4676 ++ continue;
4677 ++ spin_acquire(&src_rq->lock.dep_map,
4678 ++ SINGLE_DEPTH_NESTING, 1, _RET_IP_);
4679 ++
4680 ++ if ((nr_migrated = migrate_pending_tasks(src_rq, rq, cpu))) {
4681 ++ src_rq->nr_running -= nr_migrated;
4682 ++ if (src_rq->nr_running < 2)
4683 ++ cpumask_clear_cpu(i, &sched_rq_pending_mask);
4684 ++
4685 ++ rq->nr_running += nr_migrated;
4686 ++ if (rq->nr_running > 1)
4687 ++ cpumask_set_cpu(cpu, &sched_rq_pending_mask);
4688 ++
4689 ++ update_sched_rq_watermark(rq);
4690 ++ cpufreq_update_util(rq, 0);
4691 ++
4692 ++ spin_release(&src_rq->lock.dep_map, _RET_IP_);
4693 ++ do_raw_spin_unlock(&src_rq->lock);
4694 ++
4695 ++ return 1;
4696 ++ }
4697 ++
4698 ++ spin_release(&src_rq->lock.dep_map, _RET_IP_);
4699 ++ do_raw_spin_unlock(&src_rq->lock);
4700 ++ }
4701 ++ } while (++topo_mask < end_mask);
4702 ++
4703 ++ return 0;
4704 ++}
4705 ++#endif
4706 ++
4707 ++/*
4708 ++ * Timeslices below RESCHED_NS are considered as good as expired as there's no
4709 ++ * point rescheduling when there's so little time left.
4710 ++ */
4711 ++static inline void check_curr(struct task_struct *p, struct rq *rq)
4712 ++{
4713 ++ if (unlikely(rq->idle == p))
4714 ++ return;
4715 ++
4716 ++ update_curr(rq, p);
4717 ++
4718 ++ if (p->time_slice < RESCHED_NS)
4719 ++ time_slice_expired(p, rq);
4720 ++}
4721 ++
4722 ++static inline struct task_struct *
4723 ++choose_next_task(struct rq *rq, int cpu, struct task_struct *prev)
4724 ++{
4725 ++ struct task_struct *next;
4726 ++
4727 ++ if (unlikely(rq->skip)) {
4728 ++ next = rq_runnable_task(rq);
4729 ++ if (next == rq->idle) {
4730 ++#ifdef CONFIG_SMP
4731 ++ if (!take_other_rq_tasks(rq, cpu)) {
4732 ++#endif
4733 ++ rq->skip = NULL;
4734 ++ schedstat_inc(rq->sched_goidle);
4735 ++ return next;
4736 ++#ifdef CONFIG_SMP
4737 ++ }
4738 ++ next = rq_runnable_task(rq);
4739 ++#endif
4740 ++ }
4741 ++ rq->skip = NULL;
4742 ++#ifdef CONFIG_HIGH_RES_TIMERS
4743 ++ hrtick_start(rq, next->time_slice);
4744 ++#endif
4745 ++ return next;
4746 ++ }
4747 ++
4748 ++ next = sched_rq_first_task(rq);
4749 ++ if (next == rq->idle) {
4750 ++#ifdef CONFIG_SMP
4751 ++ if (!take_other_rq_tasks(rq, cpu)) {
4752 ++#endif
4753 ++ schedstat_inc(rq->sched_goidle);
4754 ++ /*printk(KERN_INFO "sched: choose_next_task(%d) idle %px\n", cpu, next);*/
4755 ++ return next;
4756 ++#ifdef CONFIG_SMP
4757 ++ }
4758 ++ next = sched_rq_first_task(rq);
4759 ++#endif
4760 ++ }
4761 ++#ifdef CONFIG_HIGH_RES_TIMERS
4762 ++ hrtick_start(rq, next->time_slice);
4763 ++#endif
4764 ++ /*printk(KERN_INFO "sched: choose_next_task(%d) next %px\n", cpu,
4765 ++ * next);*/
4766 ++ return next;
4767 ++}
4768 ++
4769 ++/*
4770 ++ * schedule() is the main scheduler function.
4771 ++ *
4772 ++ * The main means of driving the scheduler and thus entering this function are:
4773 ++ *
4774 ++ * 1. Explicit blocking: mutex, semaphore, waitqueue, etc.
4775 ++ *
4776 ++ * 2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
4777 ++ * paths. For example, see arch/x86/entry_64.S.
4778 ++ *
4779 ++ * To drive preemption between tasks, the scheduler sets the flag in timer
4780 ++ * interrupt handler scheduler_tick().
4781 ++ *
4782 ++ * 3. Wakeups don't really cause entry into schedule(). They add a
4783 ++ * task to the run-queue and that's it.
4784 ++ *
4785 ++ * Now, if the new task added to the run-queue preempts the current
4786 ++ * task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
4787 ++ * called on the nearest possible occasion:
4788 ++ *
4789 ++ * - If the kernel is preemptible (CONFIG_PREEMPTION=y):
4790 ++ *
4791 ++ * - in syscall or exception context, at the next outmost
4792 ++ * preempt_enable(). (this might be as soon as the wake_up()'s
4793 ++ * spin_unlock()!)
4794 ++ *
4795 ++ * - in IRQ context, return from interrupt-handler to
4796 ++ * preemptible context
4797 ++ *
4798 ++ * - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)
4799 ++ * then at the next:
4800 ++ *
4801 ++ * - cond_resched() call
4802 ++ * - explicit schedule() call
4803 ++ * - return from syscall or exception to user-space
4804 ++ * - return from interrupt-handler to user-space
4805 ++ *
4806 ++ * WARNING: must be called with preemption disabled!
4807 ++ */
4808 ++static void __sched notrace __schedule(bool preempt)
4809 ++{
4810 ++ struct task_struct *prev, *next;
4811 ++ unsigned long *switch_count;
4812 ++ unsigned long prev_state;
4813 ++ struct rq *rq;
4814 ++ int cpu;
4815 ++
4816 ++ cpu = smp_processor_id();
4817 ++ rq = cpu_rq(cpu);
4818 ++ prev = rq->curr;
4819 ++
4820 ++ schedule_debug(prev, preempt);
4821 ++
4822 ++ /* by passing sched_feat(HRTICK) checking which Alt schedule FW doesn't support */
4823 ++ hrtick_clear(rq);
4824 ++
4825 ++ local_irq_disable();
4826 ++ rcu_note_context_switch(preempt);
4827 ++
4828 ++ /*
4829 ++ * Make sure that signal_pending_state()->signal_pending() below
4830 ++ * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
4831 ++ * done by the caller to avoid the race with signal_wake_up():
4832 ++ *
4833 ++ * __set_current_state(@state) signal_wake_up()
4834 ++ * schedule() set_tsk_thread_flag(p, TIF_SIGPENDING)
4835 ++ * wake_up_state(p, state)
4836 ++ * LOCK rq->lock LOCK p->pi_state
4837 ++ * smp_mb__after_spinlock() smp_mb__after_spinlock()
4838 ++ * if (signal_pending_state()) if (p->state & @state)
4839 ++ *
4840 ++ * Also, the membarrier system call requires a full memory barrier
4841 ++ * after coming from user-space, before storing to rq->curr.
4842 ++ */
4843 ++ raw_spin_lock(&rq->lock);
4844 ++ smp_mb__after_spinlock();
4845 ++
4846 ++ update_rq_clock(rq);
4847 ++
4848 ++ switch_count = &prev->nivcsw;
4849 ++ /*
4850 ++ * We must load prev->state once (task_struct::state is volatile), such
4851 ++ * that:
4852 ++ *
4853 ++ * - we form a control dependency vs deactivate_task() below.
4854 ++ * - ptrace_{,un}freeze_traced() can change ->state underneath us.
4855 ++ */
4856 ++ prev_state = READ_ONCE(prev->__state);
4857 ++ if (!preempt && prev_state) {
4858 ++ if (signal_pending_state(prev_state, prev)) {
4859 ++ WRITE_ONCE(prev->__state, TASK_RUNNING);
4860 ++ } else {
4861 ++ prev->sched_contributes_to_load =
4862 ++ (prev_state & TASK_UNINTERRUPTIBLE) &&
4863 ++ !(prev_state & TASK_NOLOAD) &&
4864 ++ !(prev->flags & PF_FROZEN);
4865 ++
4866 ++ if (prev->sched_contributes_to_load)
4867 ++ rq->nr_uninterruptible++;
4868 ++
4869 ++ /*
4870 ++ * __schedule() ttwu()
4871 ++ * prev_state = prev->state; if (p->on_rq && ...)
4872 ++ * if (prev_state) goto out;
4873 ++ * p->on_rq = 0; smp_acquire__after_ctrl_dep();
4874 ++ * p->state = TASK_WAKING
4875 ++ *
4876 ++ * Where __schedule() and ttwu() have matching control dependencies.
4877 ++ *
4878 ++ * After this, schedule() must not care about p->state any more.
4879 ++ */
4880 ++ sched_task_deactivate(prev, rq);
4881 ++ deactivate_task(prev, rq);
4882 ++
4883 ++ if (prev->in_iowait) {
4884 ++ atomic_inc(&rq->nr_iowait);
4885 ++ delayacct_blkio_start();
4886 ++ }
4887 ++ }
4888 ++ switch_count = &prev->nvcsw;
4889 ++ }
4890 ++
4891 ++ check_curr(prev, rq);
4892 ++
4893 ++ next = choose_next_task(rq, cpu, prev);
4894 ++ clear_tsk_need_resched(prev);
4895 ++ clear_preempt_need_resched();
4896 ++#ifdef CONFIG_SCHED_DEBUG
4897 ++ rq->last_seen_need_resched_ns = 0;
4898 ++#endif
4899 ++
4900 ++ if (likely(prev != next)) {
4901 ++ next->last_ran = rq->clock_task;
4902 ++ rq->last_ts_switch = rq->clock;
4903 ++
4904 ++ rq->nr_switches++;
4905 ++ /*
4906 ++ * RCU users of rcu_dereference(rq->curr) may not see
4907 ++ * changes to task_struct made by pick_next_task().
4908 ++ */
4909 ++ RCU_INIT_POINTER(rq->curr, next);
4910 ++ /*
4911 ++ * The membarrier system call requires each architecture
4912 ++ * to have a full memory barrier after updating
4913 ++ * rq->curr, before returning to user-space.
4914 ++ *
4915 ++ * Here are the schemes providing that barrier on the
4916 ++ * various architectures:
4917 ++ * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
4918 ++ * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
4919 ++ * - finish_lock_switch() for weakly-ordered
4920 ++ * architectures where spin_unlock is a full barrier,
4921 ++ * - switch_to() for arm64 (weakly-ordered, spin_unlock
4922 ++ * is a RELEASE barrier),
4923 ++ */
4924 ++ ++*switch_count;
4925 ++
4926 ++ psi_sched_switch(prev, next, !task_on_rq_queued(prev));
4927 ++
4928 ++ trace_sched_switch(preempt, prev, next);
4929 ++
4930 ++ /* Also unlocks the rq: */
4931 ++ rq = context_switch(rq, prev, next);
4932 ++ } else {
4933 ++ __balance_callbacks(rq);
4934 ++ raw_spin_unlock_irq(&rq->lock);
4935 ++ }
4936 ++
4937 ++#ifdef CONFIG_SCHED_SMT
4938 ++ sg_balance_check(rq);
4939 ++#endif
4940 ++}
4941 ++
4942 ++void __noreturn do_task_dead(void)
4943 ++{
4944 ++ /* Causes final put_task_struct in finish_task_switch(): */
4945 ++ set_special_state(TASK_DEAD);
4946 ++
4947 ++ /* Tell freezer to ignore us: */
4948 ++ current->flags |= PF_NOFREEZE;
4949 ++
4950 ++ __schedule(false);
4951 ++ BUG();
4952 ++
4953 ++ /* Avoid "noreturn function does return" - but don't continue if BUG() is a NOP: */
4954 ++ for (;;)
4955 ++ cpu_relax();
4956 ++}
4957 ++
4958 ++static inline void sched_submit_work(struct task_struct *tsk)
4959 ++{
4960 ++ unsigned int task_flags;
4961 ++
4962 ++ if (task_is_running(tsk))
4963 ++ return;
4964 ++
4965 ++ task_flags = tsk->flags;
4966 ++ /*
4967 ++ * If a worker went to sleep, notify and ask workqueue whether
4968 ++ * it wants to wake up a task to maintain concurrency.
4969 ++ * As this function is called inside the schedule() context,
4970 ++ * we disable preemption to avoid it calling schedule() again
4971 ++ * in the possible wakeup of a kworker and because wq_worker_sleeping()
4972 ++ * requires it.
4973 ++ */
4974 ++ if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
4975 ++ preempt_disable();
4976 ++ if (task_flags & PF_WQ_WORKER)
4977 ++ wq_worker_sleeping(tsk);
4978 ++ else
4979 ++ io_wq_worker_sleeping(tsk);
4980 ++ preempt_enable_no_resched();
4981 ++ }
4982 ++
4983 ++ if (tsk_is_pi_blocked(tsk))
4984 ++ return;
4985 ++
4986 ++ /*
4987 ++ * If we are going to sleep and we have plugged IO queued,
4988 ++ * make sure to submit it to avoid deadlocks.
4989 ++ */
4990 ++ if (blk_needs_flush_plug(tsk))
4991 ++ blk_schedule_flush_plug(tsk);
4992 ++}
4993 ++
4994 ++static void sched_update_worker(struct task_struct *tsk)
4995 ++{
4996 ++ if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
4997 ++ if (tsk->flags & PF_WQ_WORKER)
4998 ++ wq_worker_running(tsk);
4999 ++ else
5000 ++ io_wq_worker_running(tsk);
5001 ++ }
5002 ++}
5003 ++
5004 ++asmlinkage __visible void __sched schedule(void)
5005 ++{
5006 ++ struct task_struct *tsk = current;
5007 ++
5008 ++ sched_submit_work(tsk);
5009 ++ do {
5010 ++ preempt_disable();
5011 ++ __schedule(false);
5012 ++ sched_preempt_enable_no_resched();
5013 ++ } while (need_resched());
5014 ++ sched_update_worker(tsk);
5015 ++}
5016 ++EXPORT_SYMBOL(schedule);
5017 ++
5018 ++/*
5019 ++ * synchronize_rcu_tasks() makes sure that no task is stuck in preempted
5020 ++ * state (have scheduled out non-voluntarily) by making sure that all
5021 ++ * tasks have either left the run queue or have gone into user space.
5022 ++ * As idle tasks do not do either, they must not ever be preempted
5023 ++ * (schedule out non-voluntarily).
5024 ++ *
5025 ++ * schedule_idle() is similar to schedule_preempt_disable() except that it
5026 ++ * never enables preemption because it does not call sched_submit_work().
5027 ++ */
5028 ++void __sched schedule_idle(void)
5029 ++{
5030 ++ /*
5031 ++ * As this skips calling sched_submit_work(), which the idle task does
5032 ++ * regardless because that function is a nop when the task is in a
5033 ++ * TASK_RUNNING state, make sure this isn't used someplace that the
5034 ++ * current task can be in any other state. Note, idle is always in the
5035 ++ * TASK_RUNNING state.
5036 ++ */
5037 ++ WARN_ON_ONCE(current->__state);
5038 ++ do {
5039 ++ __schedule(false);
5040 ++ } while (need_resched());
5041 ++}
5042 ++
5043 ++#if defined(CONFIG_CONTEXT_TRACKING) && !defined(CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK)
5044 ++asmlinkage __visible void __sched schedule_user(void)
5045 ++{
5046 ++ /*
5047 ++ * If we come here after a random call to set_need_resched(),
5048 ++ * or we have been woken up remotely but the IPI has not yet arrived,
5049 ++ * we haven't yet exited the RCU idle mode. Do it here manually until
5050 ++ * we find a better solution.
5051 ++ *
5052 ++ * NB: There are buggy callers of this function. Ideally we
5053 ++ * should warn if prev_state != CONTEXT_USER, but that will trigger
5054 ++ * too frequently to make sense yet.
5055 ++ */
5056 ++ enum ctx_state prev_state = exception_enter();
5057 ++ schedule();
5058 ++ exception_exit(prev_state);
5059 ++}
5060 ++#endif
5061 ++
5062 ++/**
5063 ++ * schedule_preempt_disabled - called with preemption disabled
5064 ++ *
5065 ++ * Returns with preemption disabled. Note: preempt_count must be 1
5066 ++ */
5067 ++void __sched schedule_preempt_disabled(void)
5068 ++{
5069 ++ sched_preempt_enable_no_resched();
5070 ++ schedule();
5071 ++ preempt_disable();
5072 ++}
5073 ++
5074 ++static void __sched notrace preempt_schedule_common(void)
5075 ++{
5076 ++ do {
5077 ++ /*
5078 ++ * Because the function tracer can trace preempt_count_sub()
5079 ++ * and it also uses preempt_enable/disable_notrace(), if
5080 ++ * NEED_RESCHED is set, the preempt_enable_notrace() called
5081 ++ * by the function tracer will call this function again and
5082 ++ * cause infinite recursion.
5083 ++ *
5084 ++ * Preemption must be disabled here before the function
5085 ++ * tracer can trace. Break up preempt_disable() into two
5086 ++ * calls. One to disable preemption without fear of being
5087 ++ * traced. The other to still record the preemption latency,
5088 ++ * which can also be traced by the function tracer.
5089 ++ */
5090 ++ preempt_disable_notrace();
5091 ++ preempt_latency_start(1);
5092 ++ __schedule(true);
5093 ++ preempt_latency_stop(1);
5094 ++ preempt_enable_no_resched_notrace();
5095 ++
5096 ++ /*
5097 ++ * Check again in case we missed a preemption opportunity
5098 ++ * between schedule and now.
5099 ++ */
5100 ++ } while (need_resched());
5101 ++}
5102 ++
5103 ++#ifdef CONFIG_PREEMPTION
5104 ++/*
5105 ++ * This is the entry point to schedule() from in-kernel preemption
5106 ++ * off of preempt_enable.
5107 ++ */
5108 ++asmlinkage __visible void __sched notrace preempt_schedule(void)
5109 ++{
5110 ++ /*
5111 ++ * If there is a non-zero preempt_count or interrupts are disabled,
5112 ++ * we do not want to preempt the current task. Just return..
5113 ++ */
5114 ++ if (likely(!preemptible()))
5115 ++ return;
5116 ++
5117 ++ preempt_schedule_common();
5118 ++}
5119 ++NOKPROBE_SYMBOL(preempt_schedule);
5120 ++EXPORT_SYMBOL(preempt_schedule);
5121 ++
5122 ++#ifdef CONFIG_PREEMPT_DYNAMIC
5123 ++DEFINE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
5124 ++EXPORT_STATIC_CALL_TRAMP(preempt_schedule);
5125 ++#endif
5126 ++
5127 ++
5128 ++/**
5129 ++ * preempt_schedule_notrace - preempt_schedule called by tracing
5130 ++ *
5131 ++ * The tracing infrastructure uses preempt_enable_notrace to prevent
5132 ++ * recursion and tracing preempt enabling caused by the tracing
5133 ++ * infrastructure itself. But as tracing can happen in areas coming
5134 ++ * from userspace or just about to enter userspace, a preempt enable
5135 ++ * can occur before user_exit() is called. This will cause the scheduler
5136 ++ * to be called when the system is still in usermode.
5137 ++ *
5138 ++ * To prevent this, the preempt_enable_notrace will use this function
5139 ++ * instead of preempt_schedule() to exit user context if needed before
5140 ++ * calling the scheduler.
5141 ++ */
5142 ++asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)
5143 ++{
5144 ++ enum ctx_state prev_ctx;
5145 ++
5146 ++ if (likely(!preemptible()))
5147 ++ return;
5148 ++
5149 ++ do {
5150 ++ /*
5151 ++ * Because the function tracer can trace preempt_count_sub()
5152 ++ * and it also uses preempt_enable/disable_notrace(), if
5153 ++ * NEED_RESCHED is set, the preempt_enable_notrace() called
5154 ++ * by the function tracer will call this function again and
5155 ++ * cause infinite recursion.
5156 ++ *
5157 ++ * Preemption must be disabled here before the function
5158 ++ * tracer can trace. Break up preempt_disable() into two
5159 ++ * calls. One to disable preemption without fear of being
5160 ++ * traced. The other to still record the preemption latency,
5161 ++ * which can also be traced by the function tracer.
5162 ++ */
5163 ++ preempt_disable_notrace();
5164 ++ preempt_latency_start(1);
5165 ++ /*
5166 ++ * Needs preempt disabled in case user_exit() is traced
5167 ++ * and the tracer calls preempt_enable_notrace() causing
5168 ++ * an infinite recursion.
5169 ++ */
5170 ++ prev_ctx = exception_enter();
5171 ++ __schedule(true);
5172 ++ exception_exit(prev_ctx);
5173 ++
5174 ++ preempt_latency_stop(1);
5175 ++ preempt_enable_no_resched_notrace();
5176 ++ } while (need_resched());
5177 ++}
5178 ++EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
5179 ++
5180 ++#ifdef CONFIG_PREEMPT_DYNAMIC
5181 ++DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
5182 ++EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
5183 ++#endif
5184 ++
5185 ++#endif /* CONFIG_PREEMPTION */
5186 ++
5187 ++#ifdef CONFIG_PREEMPT_DYNAMIC
5188 ++
5189 ++#include <linux/entry-common.h>
5190 ++
5191 ++/*
5192 ++ * SC:cond_resched
5193 ++ * SC:might_resched
5194 ++ * SC:preempt_schedule
5195 ++ * SC:preempt_schedule_notrace
5196 ++ * SC:irqentry_exit_cond_resched
5197 ++ *
5198 ++ *
5199 ++ * NONE:
5200 ++ * cond_resched <- __cond_resched
5201 ++ * might_resched <- RET0
5202 ++ * preempt_schedule <- NOP
5203 ++ * preempt_schedule_notrace <- NOP
5204 ++ * irqentry_exit_cond_resched <- NOP
5205 ++ *
5206 ++ * VOLUNTARY:
5207 ++ * cond_resched <- __cond_resched
5208 ++ * might_resched <- __cond_resched
5209 ++ * preempt_schedule <- NOP
5210 ++ * preempt_schedule_notrace <- NOP
5211 ++ * irqentry_exit_cond_resched <- NOP
5212 ++ *
5213 ++ * FULL:
5214 ++ * cond_resched <- RET0
5215 ++ * might_resched <- RET0
5216 ++ * preempt_schedule <- preempt_schedule
5217 ++ * preempt_schedule_notrace <- preempt_schedule_notrace
5218 ++ * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
5219 ++ */
5220 ++
5221 ++enum {
5222 ++ preempt_dynamic_none = 0,
5223 ++ preempt_dynamic_voluntary,
5224 ++ preempt_dynamic_full,
5225 ++};
5226 ++
5227 ++int preempt_dynamic_mode = preempt_dynamic_full;
5228 ++
5229 ++int sched_dynamic_mode(const char *str)
5230 ++{
5231 ++ if (!strcmp(str, "none"))
5232 ++ return preempt_dynamic_none;
5233 ++
5234 ++ if (!strcmp(str, "voluntary"))
5235 ++ return preempt_dynamic_voluntary;
5236 ++
5237 ++ if (!strcmp(str, "full"))
5238 ++ return preempt_dynamic_full;
5239 ++
5240 ++ return -EINVAL;
5241 ++}
5242 ++
5243 ++void sched_dynamic_update(int mode)
5244 ++{
5245 ++ /*
5246 ++ * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
5247 ++ * the ZERO state, which is invalid.
5248 ++ */
5249 ++ static_call_update(cond_resched, __cond_resched);
5250 ++ static_call_update(might_resched, __cond_resched);
5251 ++ static_call_update(preempt_schedule, __preempt_schedule_func);
5252 ++ static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
5253 ++ static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
5254 ++
5255 ++ switch (mode) {
5256 ++ case preempt_dynamic_none:
5257 ++ static_call_update(cond_resched, __cond_resched);
5258 ++ static_call_update(might_resched, (void *)&__static_call_return0);
5259 ++ static_call_update(preempt_schedule, NULL);
5260 ++ static_call_update(preempt_schedule_notrace, NULL);
5261 ++ static_call_update(irqentry_exit_cond_resched, NULL);
5262 ++ pr_info("Dynamic Preempt: none\n");
5263 ++ break;
5264 ++
5265 ++ case preempt_dynamic_voluntary:
5266 ++ static_call_update(cond_resched, __cond_resched);
5267 ++ static_call_update(might_resched, __cond_resched);
5268 ++ static_call_update(preempt_schedule, NULL);
5269 ++ static_call_update(preempt_schedule_notrace, NULL);
5270 ++ static_call_update(irqentry_exit_cond_resched, NULL);
5271 ++ pr_info("Dynamic Preempt: voluntary\n");
5272 ++ break;
5273 ++
5274 ++ case preempt_dynamic_full:
5275 ++ static_call_update(cond_resched, (void *)&__static_call_return0);
5276 ++ static_call_update(might_resched, (void *)&__static_call_return0);
5277 ++ static_call_update(preempt_schedule, __preempt_schedule_func);
5278 ++ static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
5279 ++ static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
5280 ++ pr_info("Dynamic Preempt: full\n");
5281 ++ break;
5282 ++ }
5283 ++
5284 ++ preempt_dynamic_mode = mode;
5285 ++}
5286 ++
5287 ++static int __init setup_preempt_mode(char *str)
5288 ++{
5289 ++ int mode = sched_dynamic_mode(str);
5290 ++ if (mode < 0) {
5291 ++ pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
5292 ++ return 1;
5293 ++ }
5294 ++
5295 ++ sched_dynamic_update(mode);
5296 ++ return 0;
5297 ++}
5298 ++__setup("preempt=", setup_preempt_mode);
5299 ++
5300 ++#endif /* CONFIG_PREEMPT_DYNAMIC */
5301 ++
5302 ++/*
5303 ++ * This is the entry point to schedule() from kernel preemption
5304 ++ * off of irq context.
5305 ++ * Note, that this is called and return with irqs disabled. This will
5306 ++ * protect us against recursive calling from irq.
5307 ++ */
5308 ++asmlinkage __visible void __sched preempt_schedule_irq(void)
5309 ++{
5310 ++ enum ctx_state prev_state;
5311 ++
5312 ++ /* Catch callers which need to be fixed */
5313 ++ BUG_ON(preempt_count() || !irqs_disabled());
5314 ++
5315 ++ prev_state = exception_enter();
5316 ++
5317 ++ do {
5318 ++ preempt_disable();
5319 ++ local_irq_enable();
5320 ++ __schedule(true);
5321 ++ local_irq_disable();
5322 ++ sched_preempt_enable_no_resched();
5323 ++ } while (need_resched());
5324 ++
5325 ++ exception_exit(prev_state);
5326 ++}
5327 ++
5328 ++int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
5329 ++ void *key)
5330 ++{
5331 ++ WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC);
5332 ++ return try_to_wake_up(curr->private, mode, wake_flags);
5333 ++}
5334 ++EXPORT_SYMBOL(default_wake_function);
5335 ++
5336 ++static inline void check_task_changed(struct task_struct *p, struct rq *rq)
5337 ++{
5338 ++ /* Trigger resched if task sched_prio has been modified. */
5339 ++ if (task_on_rq_queued(p) && task_sched_prio_idx(p, rq) != p->sq_idx) {
5340 ++ requeue_task(p, rq);
5341 ++ check_preempt_curr(rq);
5342 ++ }
5343 ++}
5344 ++
5345 ++static void __setscheduler_prio(struct task_struct *p, int prio)
5346 ++{
5347 ++ p->prio = prio;
5348 ++}
5349 ++
5350 ++#ifdef CONFIG_RT_MUTEXES
5351 ++
5352 ++static inline int __rt_effective_prio(struct task_struct *pi_task, int prio)
5353 ++{
5354 ++ if (pi_task)
5355 ++ prio = min(prio, pi_task->prio);
5356 ++
5357 ++ return prio;
5358 ++}
5359 ++
5360 ++static inline int rt_effective_prio(struct task_struct *p, int prio)
5361 ++{
5362 ++ struct task_struct *pi_task = rt_mutex_get_top_task(p);
5363 ++
5364 ++ return __rt_effective_prio(pi_task, prio);
5365 ++}
5366 ++
5367 ++/*
5368 ++ * rt_mutex_setprio - set the current priority of a task
5369 ++ * @p: task to boost
5370 ++ * @pi_task: donor task
5371 ++ *
5372 ++ * This function changes the 'effective' priority of a task. It does
5373 ++ * not touch ->normal_prio like __setscheduler().
5374 ++ *
5375 ++ * Used by the rt_mutex code to implement priority inheritance
5376 ++ * logic. Call site only calls if the priority of the task changed.
5377 ++ */
5378 ++void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
5379 ++{
5380 ++ int prio;
5381 ++ struct rq *rq;
5382 ++ raw_spinlock_t *lock;
5383 ++
5384 ++ /* XXX used to be waiter->prio, not waiter->task->prio */
5385 ++ prio = __rt_effective_prio(pi_task, p->normal_prio);
5386 ++
5387 ++ /*
5388 ++ * If nothing changed; bail early.
5389 ++ */
5390 ++ if (p->pi_top_task == pi_task && prio == p->prio)
5391 ++ return;
5392 ++
5393 ++ rq = __task_access_lock(p, &lock);
5394 ++ /*
5395 ++ * Set under pi_lock && rq->lock, such that the value can be used under
5396 ++ * either lock.
5397 ++ *
5398 ++ * Note that there is loads of tricky to make this pointer cache work
5399 ++ * right. rt_mutex_slowunlock()+rt_mutex_postunlock() work together to
5400 ++ * ensure a task is de-boosted (pi_task is set to NULL) before the
5401 ++ * task is allowed to run again (and can exit). This ensures the pointer
5402 ++ * points to a blocked task -- which guarantees the task is present.
5403 ++ */
5404 ++ p->pi_top_task = pi_task;
5405 ++
5406 ++ /*
5407 ++ * For FIFO/RR we only need to set prio, if that matches we're done.
5408 ++ */
5409 ++ if (prio == p->prio)
5410 ++ goto out_unlock;
5411 ++
5412 ++ /*
5413 ++ * Idle task boosting is a nono in general. There is one
5414 ++ * exception, when PREEMPT_RT and NOHZ is active:
5415 ++ *
5416 ++ * The idle task calls get_next_timer_interrupt() and holds
5417 ++ * the timer wheel base->lock on the CPU and another CPU wants
5418 ++ * to access the timer (probably to cancel it). We can safely
5419 ++ * ignore the boosting request, as the idle CPU runs this code
5420 ++ * with interrupts disabled and will complete the lock
5421 ++ * protected section without being interrupted. So there is no
5422 ++ * real need to boost.
5423 ++ */
5424 ++ if (unlikely(p == rq->idle)) {
5425 ++ WARN_ON(p != rq->curr);
5426 ++ WARN_ON(p->pi_blocked_on);
5427 ++ goto out_unlock;
5428 ++ }
5429 ++
5430 ++ trace_sched_pi_setprio(p, pi_task);
5431 ++
5432 ++ __setscheduler_prio(p, prio);
5433 ++
5434 ++ check_task_changed(p, rq);
5435 ++out_unlock:
5436 ++ /* Avoid rq from going away on us: */
5437 ++ preempt_disable();
5438 ++
5439 ++ __balance_callbacks(rq);
5440 ++ __task_access_unlock(p, lock);
5441 ++
5442 ++ preempt_enable();
5443 ++}
5444 ++#else
5445 ++static inline int rt_effective_prio(struct task_struct *p, int prio)
5446 ++{
5447 ++ return prio;
5448 ++}
5449 ++#endif
5450 ++
5451 ++void set_user_nice(struct task_struct *p, long nice)
5452 ++{
5453 ++ unsigned long flags;
5454 ++ struct rq *rq;
5455 ++ raw_spinlock_t *lock;
5456 ++
5457 ++ if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)
5458 ++ return;
5459 ++ /*
5460 ++ * We have to be careful, if called from sys_setpriority(),
5461 ++ * the task might be in the middle of scheduling on another CPU.
5462 ++ */
5463 ++ raw_spin_lock_irqsave(&p->pi_lock, flags);
5464 ++ rq = __task_access_lock(p, &lock);
5465 ++
5466 ++ p->static_prio = NICE_TO_PRIO(nice);
5467 ++ /*
5468 ++ * The RT priorities are set via sched_setscheduler(), but we still
5469 ++ * allow the 'normal' nice value to be set - but as expected
5470 ++ * it won't have any effect on scheduling until the task is
5471 ++ * not SCHED_NORMAL/SCHED_BATCH:
5472 ++ */
5473 ++ if (task_has_rt_policy(p))
5474 ++ goto out_unlock;
5475 ++
5476 ++ p->prio = effective_prio(p);
5477 ++
5478 ++ check_task_changed(p, rq);
5479 ++out_unlock:
5480 ++ __task_access_unlock(p, lock);
5481 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
5482 ++}
5483 ++EXPORT_SYMBOL(set_user_nice);
5484 ++
5485 ++/*
5486 ++ * can_nice - check if a task can reduce its nice value
5487 ++ * @p: task
5488 ++ * @nice: nice value
5489 ++ */
5490 ++int can_nice(const struct task_struct *p, const int nice)
5491 ++{
5492 ++ /* Convert nice value [19,-20] to rlimit style value [1,40] */
5493 ++ int nice_rlim = nice_to_rlimit(nice);
5494 ++
5495 ++ return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) ||
5496 ++ capable(CAP_SYS_NICE));
5497 ++}
5498 ++
5499 ++#ifdef __ARCH_WANT_SYS_NICE
5500 ++
5501 ++/*
5502 ++ * sys_nice - change the priority of the current process.
5503 ++ * @increment: priority increment
5504 ++ *
5505 ++ * sys_setpriority is a more generic, but much slower function that
5506 ++ * does similar things.
5507 ++ */
5508 ++SYSCALL_DEFINE1(nice, int, increment)
5509 ++{
5510 ++ long nice, retval;
5511 ++
5512 ++ /*
5513 ++ * Setpriority might change our priority at the same moment.
5514 ++ * We don't have to worry. Conceptually one call occurs first
5515 ++ * and we have a single winner.
5516 ++ */
5517 ++
5518 ++ increment = clamp(increment, -NICE_WIDTH, NICE_WIDTH);
5519 ++ nice = task_nice(current) + increment;
5520 ++
5521 ++ nice = clamp_val(nice, MIN_NICE, MAX_NICE);
5522 ++ if (increment < 0 && !can_nice(current, nice))
5523 ++ return -EPERM;
5524 ++
5525 ++ retval = security_task_setnice(current, nice);
5526 ++ if (retval)
5527 ++ return retval;
5528 ++
5529 ++ set_user_nice(current, nice);
5530 ++ return 0;
5531 ++}
5532 ++
5533 ++#endif
5534 ++
5535 ++/**
5536 ++ * task_prio - return the priority value of a given task.
5537 ++ * @p: the task in question.
5538 ++ *
5539 ++ * Return: The priority value as seen by users in /proc.
5540 ++ *
5541 ++ * sched policy return value kernel prio user prio/nice
5542 ++ *
5543 ++ * (BMQ)normal, batch, idle[0 ... 53] [100 ... 139] 0/[-20 ... 19]/[-7 ... 7]
5544 ++ * (PDS)normal, batch, idle[0 ... 39] 100 0/[-20 ... 19]
5545 ++ * fifo, rr [-1 ... -100] [99 ... 0] [0 ... 99]
5546 ++ */
5547 ++int task_prio(const struct task_struct *p)
5548 ++{
5549 ++ return (p->prio < MAX_RT_PRIO) ? p->prio - MAX_RT_PRIO :
5550 ++ task_sched_prio_normal(p, task_rq(p));
5551 ++}
5552 ++
5553 ++/**
5554 ++ * idle_cpu - is a given CPU idle currently?
5555 ++ * @cpu: the processor in question.
5556 ++ *
5557 ++ * Return: 1 if the CPU is currently idle. 0 otherwise.
5558 ++ */
5559 ++int idle_cpu(int cpu)
5560 ++{
5561 ++ struct rq *rq = cpu_rq(cpu);
5562 ++
5563 ++ if (rq->curr != rq->idle)
5564 ++ return 0;
5565 ++
5566 ++ if (rq->nr_running)
5567 ++ return 0;
5568 ++
5569 ++#ifdef CONFIG_SMP
5570 ++ if (rq->ttwu_pending)
5571 ++ return 0;
5572 ++#endif
5573 ++
5574 ++ return 1;
5575 ++}
5576 ++
5577 ++/**
5578 ++ * idle_task - return the idle task for a given CPU.
5579 ++ * @cpu: the processor in question.
5580 ++ *
5581 ++ * Return: The idle task for the cpu @cpu.
5582 ++ */
5583 ++struct task_struct *idle_task(int cpu)
5584 ++{
5585 ++ return cpu_rq(cpu)->idle;
5586 ++}
5587 ++
5588 ++/**
5589 ++ * find_process_by_pid - find a process with a matching PID value.
5590 ++ * @pid: the pid in question.
5591 ++ *
5592 ++ * The task of @pid, if found. %NULL otherwise.
5593 ++ */
5594 ++static inline struct task_struct *find_process_by_pid(pid_t pid)
5595 ++{
5596 ++ return pid ? find_task_by_vpid(pid) : current;
5597 ++}
5598 ++
5599 ++/*
5600 ++ * sched_setparam() passes in -1 for its policy, to let the functions
5601 ++ * it calls know not to change it.
5602 ++ */
5603 ++#define SETPARAM_POLICY -1
5604 ++
5605 ++static void __setscheduler_params(struct task_struct *p,
5606 ++ const struct sched_attr *attr)
5607 ++{
5608 ++ int policy = attr->sched_policy;
5609 ++
5610 ++ if (policy == SETPARAM_POLICY)
5611 ++ policy = p->policy;
5612 ++
5613 ++ p->policy = policy;
5614 ++
5615 ++ /*
5616 ++ * allow normal nice value to be set, but will not have any
5617 ++ * effect on scheduling until the task not SCHED_NORMAL/
5618 ++ * SCHED_BATCH
5619 ++ */
5620 ++ p->static_prio = NICE_TO_PRIO(attr->sched_nice);
5621 ++
5622 ++ /*
5623 ++ * __sched_setscheduler() ensures attr->sched_priority == 0 when
5624 ++ * !rt_policy. Always setting this ensures that things like
5625 ++ * getparam()/getattr() don't report silly values for !rt tasks.
5626 ++ */
5627 ++ p->rt_priority = attr->sched_priority;
5628 ++ p->normal_prio = normal_prio(p);
5629 ++}
5630 ++
5631 ++/*
5632 ++ * check the target process has a UID that matches the current process's
5633 ++ */
5634 ++static bool check_same_owner(struct task_struct *p)
5635 ++{
5636 ++ const struct cred *cred = current_cred(), *pcred;
5637 ++ bool match;
5638 ++
5639 ++ rcu_read_lock();
5640 ++ pcred = __task_cred(p);
5641 ++ match = (uid_eq(cred->euid, pcred->euid) ||
5642 ++ uid_eq(cred->euid, pcred->uid));
5643 ++ rcu_read_unlock();
5644 ++ return match;
5645 ++}
5646 ++
5647 ++static int __sched_setscheduler(struct task_struct *p,
5648 ++ const struct sched_attr *attr,
5649 ++ bool user, bool pi)
5650 ++{
5651 ++ const struct sched_attr dl_squash_attr = {
5652 ++ .size = sizeof(struct sched_attr),
5653 ++ .sched_policy = SCHED_FIFO,
5654 ++ .sched_nice = 0,
5655 ++ .sched_priority = 99,
5656 ++ };
5657 ++ int oldpolicy = -1, policy = attr->sched_policy;
5658 ++ int retval, newprio;
5659 ++ struct callback_head *head;
5660 ++ unsigned long flags;
5661 ++ struct rq *rq;
5662 ++ int reset_on_fork;
5663 ++ raw_spinlock_t *lock;
5664 ++
5665 ++ /* The pi code expects interrupts enabled */
5666 ++ BUG_ON(pi && in_interrupt());
5667 ++
5668 ++ /*
5669 ++ * Alt schedule FW supports SCHED_DEADLINE by squash it as prio 0 SCHED_FIFO
5670 ++ */
5671 ++ if (unlikely(SCHED_DEADLINE == policy)) {
5672 ++ attr = &dl_squash_attr;
5673 ++ policy = attr->sched_policy;
5674 ++ }
5675 ++recheck:
5676 ++ /* Double check policy once rq lock held */
5677 ++ if (policy < 0) {
5678 ++ reset_on_fork = p->sched_reset_on_fork;
5679 ++ policy = oldpolicy = p->policy;
5680 ++ } else {
5681 ++ reset_on_fork = !!(attr->sched_flags & SCHED_RESET_ON_FORK);
5682 ++
5683 ++ if (policy > SCHED_IDLE)
5684 ++ return -EINVAL;
5685 ++ }
5686 ++
5687 ++ if (attr->sched_flags & ~(SCHED_FLAG_ALL))
5688 ++ return -EINVAL;
5689 ++
5690 ++ /*
5691 ++ * Valid priorities for SCHED_FIFO and SCHED_RR are
5692 ++ * 1..MAX_RT_PRIO-1, valid priority for SCHED_NORMAL and
5693 ++ * SCHED_BATCH and SCHED_IDLE is 0.
5694 ++ */
5695 ++ if (attr->sched_priority < 0 ||
5696 ++ (p->mm && attr->sched_priority > MAX_RT_PRIO - 1) ||
5697 ++ (!p->mm && attr->sched_priority > MAX_RT_PRIO - 1))
5698 ++ return -EINVAL;
5699 ++ if ((SCHED_RR == policy || SCHED_FIFO == policy) !=
5700 ++ (attr->sched_priority != 0))
5701 ++ return -EINVAL;
5702 ++
5703 ++ /*
5704 ++ * Allow unprivileged RT tasks to decrease priority:
5705 ++ */
5706 ++ if (user && !capable(CAP_SYS_NICE)) {
5707 ++ if (SCHED_FIFO == policy || SCHED_RR == policy) {
5708 ++ unsigned long rlim_rtprio =
5709 ++ task_rlimit(p, RLIMIT_RTPRIO);
5710 ++
5711 ++ /* Can't set/change the rt policy */
5712 ++ if (policy != p->policy && !rlim_rtprio)
5713 ++ return -EPERM;
5714 ++
5715 ++ /* Can't increase priority */
5716 ++ if (attr->sched_priority > p->rt_priority &&
5717 ++ attr->sched_priority > rlim_rtprio)
5718 ++ return -EPERM;
5719 ++ }
5720 ++
5721 ++ /* Can't change other user's priorities */
5722 ++ if (!check_same_owner(p))
5723 ++ return -EPERM;
5724 ++
5725 ++ /* Normal users shall not reset the sched_reset_on_fork flag */
5726 ++ if (p->sched_reset_on_fork && !reset_on_fork)
5727 ++ return -EPERM;
5728 ++ }
5729 ++
5730 ++ if (user) {
5731 ++ retval = security_task_setscheduler(p);
5732 ++ if (retval)
5733 ++ return retval;
5734 ++ }
5735 ++
5736 ++ if (pi)
5737 ++ cpuset_read_lock();
5738 ++
5739 ++ /*
5740 ++ * Make sure no PI-waiters arrive (or leave) while we are
5741 ++ * changing the priority of the task:
5742 ++ */
5743 ++ raw_spin_lock_irqsave(&p->pi_lock, flags);
5744 ++
5745 ++ /*
5746 ++ * To be able to change p->policy safely, task_access_lock()
5747 ++ * must be called.
5748 ++ * IF use task_access_lock() here:
5749 ++ * For the task p which is not running, reading rq->stop is
5750 ++ * racy but acceptable as ->stop doesn't change much.
5751 ++ * An enhancemnet can be made to read rq->stop saftly.
5752 ++ */
5753 ++ rq = __task_access_lock(p, &lock);
5754 ++
5755 ++ /*
5756 ++ * Changing the policy of the stop threads its a very bad idea
5757 ++ */
5758 ++ if (p == rq->stop) {
5759 ++ retval = -EINVAL;
5760 ++ goto unlock;
5761 ++ }
5762 ++
5763 ++ /*
5764 ++ * If not changing anything there's no need to proceed further:
5765 ++ */
5766 ++ if (unlikely(policy == p->policy)) {
5767 ++ if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
5768 ++ goto change;
5769 ++ if (!rt_policy(policy) &&
5770 ++ NICE_TO_PRIO(attr->sched_nice) != p->static_prio)
5771 ++ goto change;
5772 ++
5773 ++ p->sched_reset_on_fork = reset_on_fork;
5774 ++ retval = 0;
5775 ++ goto unlock;
5776 ++ }
5777 ++change:
5778 ++
5779 ++ /* Re-check policy now with rq lock held */
5780 ++ if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
5781 ++ policy = oldpolicy = -1;
5782 ++ __task_access_unlock(p, lock);
5783 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
5784 ++ if (pi)
5785 ++ cpuset_read_unlock();
5786 ++ goto recheck;
5787 ++ }
5788 ++
5789 ++ p->sched_reset_on_fork = reset_on_fork;
5790 ++
5791 ++ newprio = __normal_prio(policy, attr->sched_priority, NICE_TO_PRIO(attr->sched_nice));
5792 ++ if (pi) {
5793 ++ /*
5794 ++ * Take priority boosted tasks into account. If the new
5795 ++ * effective priority is unchanged, we just store the new
5796 ++ * normal parameters and do not touch the scheduler class and
5797 ++ * the runqueue. This will be done when the task deboost
5798 ++ * itself.
5799 ++ */
5800 ++ if (rt_effective_prio(p, newprio) == p->prio) {
5801 ++ __setscheduler_params(p, attr);
5802 ++ retval = 0;
5803 ++ goto unlock;
5804 ++ }
5805 ++ }
5806 ++
5807 ++ if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) {
5808 ++ __setscheduler_params(p, attr);
5809 ++ __setscheduler_prio(p, newprio);
5810 ++ }
5811 ++
5812 ++ check_task_changed(p, rq);
5813 ++
5814 ++ /* Avoid rq from going away on us: */
5815 ++ preempt_disable();
5816 ++ head = splice_balance_callbacks(rq);
5817 ++ __task_access_unlock(p, lock);
5818 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
5819 ++
5820 ++ if (pi) {
5821 ++ cpuset_read_unlock();
5822 ++ rt_mutex_adjust_pi(p);
5823 ++ }
5824 ++
5825 ++ /* Run balance callbacks after we've adjusted the PI chain: */
5826 ++ balance_callbacks(rq, head);
5827 ++ preempt_enable();
5828 ++
5829 ++ return 0;
5830 ++
5831 ++unlock:
5832 ++ __task_access_unlock(p, lock);
5833 ++ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
5834 ++ if (pi)
5835 ++ cpuset_read_unlock();
5836 ++ return retval;
5837 ++}
5838 ++
5839 ++static int _sched_setscheduler(struct task_struct *p, int policy,
5840 ++ const struct sched_param *param, bool check)
5841 ++{
5842 ++ struct sched_attr attr = {
5843 ++ .sched_policy = policy,
5844 ++ .sched_priority = param->sched_priority,
5845 ++ .sched_nice = PRIO_TO_NICE(p->static_prio),
5846 ++ };
5847 ++
5848 ++ /* Fixup the legacy SCHED_RESET_ON_FORK hack. */
5849 ++ if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) {
5850 ++ attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
5851 ++ policy &= ~SCHED_RESET_ON_FORK;
5852 ++ attr.sched_policy = policy;
5853 ++ }
5854 ++
5855 ++ return __sched_setscheduler(p, &attr, check, true);
5856 ++}
5857 ++
5858 ++/**
5859 ++ * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
5860 ++ * @p: the task in question.
5861 ++ * @policy: new policy.
5862 ++ * @param: structure containing the new RT priority.
5863 ++ *
5864 ++ * Use sched_set_fifo(), read its comment.
5865 ++ *
5866 ++ * Return: 0 on success. An error code otherwise.
5867 ++ *
5868 ++ * NOTE that the task may be already dead.
5869 ++ */
5870 ++int sched_setscheduler(struct task_struct *p, int policy,
5871 ++ const struct sched_param *param)
5872 ++{
5873 ++ return _sched_setscheduler(p, policy, param, true);
5874 ++}
5875 ++
5876 ++int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
5877 ++{
5878 ++ return __sched_setscheduler(p, attr, true, true);
5879 ++}
5880 ++
5881 ++int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr)
5882 ++{
5883 ++ return __sched_setscheduler(p, attr, false, true);
5884 ++}
5885 ++EXPORT_SYMBOL_GPL(sched_setattr_nocheck);
5886 ++
5887 ++/**
5888 ++ * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
5889 ++ * @p: the task in question.
5890 ++ * @policy: new policy.
5891 ++ * @param: structure containing the new RT priority.
5892 ++ *
5893 ++ * Just like sched_setscheduler, only don't bother checking if the
5894 ++ * current context has permission. For example, this is needed in
5895 ++ * stop_machine(): we create temporary high priority worker threads,
5896 ++ * but our caller might not have that capability.
5897 ++ *
5898 ++ * Return: 0 on success. An error code otherwise.
5899 ++ */
5900 ++int sched_setscheduler_nocheck(struct task_struct *p, int policy,
5901 ++ const struct sched_param *param)
5902 ++{
5903 ++ return _sched_setscheduler(p, policy, param, false);
5904 ++}
5905 ++
5906 ++/*
5907 ++ * SCHED_FIFO is a broken scheduler model; that is, it is fundamentally
5908 ++ * incapable of resource management, which is the one thing an OS really should
5909 ++ * be doing.
5910 ++ *
5911 ++ * This is of course the reason it is limited to privileged users only.
5912 ++ *
5913 ++ * Worse still; it is fundamentally impossible to compose static priority
5914 ++ * workloads. You cannot take two correctly working static prio workloads
5915 ++ * and smash them together and still expect them to work.
5916 ++ *
5917 ++ * For this reason 'all' FIFO tasks the kernel creates are basically at:
5918 ++ *
5919 ++ * MAX_RT_PRIO / 2
5920 ++ *
5921 ++ * The administrator _MUST_ configure the system, the kernel simply doesn't
5922 ++ * know enough information to make a sensible choice.
5923 ++ */
5924 ++void sched_set_fifo(struct task_struct *p)
5925 ++{
5926 ++ struct sched_param sp = { .sched_priority = MAX_RT_PRIO / 2 };
5927 ++ WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0);
5928 ++}
5929 ++EXPORT_SYMBOL_GPL(sched_set_fifo);
5930 ++
5931 ++/*
5932 ++ * For when you don't much care about FIFO, but want to be above SCHED_NORMAL.
5933 ++ */
5934 ++void sched_set_fifo_low(struct task_struct *p)
5935 ++{
5936 ++ struct sched_param sp = { .sched_priority = 1 };
5937 ++ WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0);
5938 ++}
5939 ++EXPORT_SYMBOL_GPL(sched_set_fifo_low);
5940 ++
5941 ++void sched_set_normal(struct task_struct *p, int nice)
5942 ++{
5943 ++ struct sched_attr attr = {
5944 ++ .sched_policy = SCHED_NORMAL,
5945 ++ .sched_nice = nice,
5946 ++ };
5947 ++ WARN_ON_ONCE(sched_setattr_nocheck(p, &attr) != 0);
5948 ++}
5949 ++EXPORT_SYMBOL_GPL(sched_set_normal);
5950 ++
5951 ++static int
5952 ++do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
5953 ++{
5954 ++ struct sched_param lparam;
5955 ++ struct task_struct *p;
5956 ++ int retval;
5957 ++
5958 ++ if (!param || pid < 0)
5959 ++ return -EINVAL;
5960 ++ if (copy_from_user(&lparam, param, sizeof(struct sched_param)))
5961 ++ return -EFAULT;
5962 ++
5963 ++ rcu_read_lock();
5964 ++ retval = -ESRCH;
5965 ++ p = find_process_by_pid(pid);
5966 ++ if (likely(p))
5967 ++ get_task_struct(p);
5968 ++ rcu_read_unlock();
5969 ++
5970 ++ if (likely(p)) {
5971 ++ retval = sched_setscheduler(p, policy, &lparam);
5972 ++ put_task_struct(p);
5973 ++ }
5974 ++
5975 ++ return retval;
5976 ++}
5977 ++
5978 ++/*
5979 ++ * Mimics kernel/events/core.c perf_copy_attr().
5980 ++ */
5981 ++static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *attr)
5982 ++{
5983 ++ u32 size;
5984 ++ int ret;
5985 ++
5986 ++ /* Zero the full structure, so that a short copy will be nice: */
5987 ++ memset(attr, 0, sizeof(*attr));
5988 ++
5989 ++ ret = get_user(size, &uattr->size);
5990 ++ if (ret)
5991 ++ return ret;
5992 ++
5993 ++ /* ABI compatibility quirk: */
5994 ++ if (!size)
5995 ++ size = SCHED_ATTR_SIZE_VER0;
5996 ++
5997 ++ if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)
5998 ++ goto err_size;
5999 ++
6000 ++ ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
6001 ++ if (ret) {
6002 ++ if (ret == -E2BIG)
6003 ++ goto err_size;
6004 ++ return ret;
6005 ++ }
6006 ++
6007 ++ /*
6008 ++ * XXX: Do we want to be lenient like existing syscalls; or do we want
6009 ++ * to be strict and return an error on out-of-bounds values?
6010 ++ */
6011 ++ attr->sched_nice = clamp(attr->sched_nice, -20, 19);
6012 ++
6013 ++ /* sched/core.c uses zero here but we already know ret is zero */
6014 ++ return 0;
6015 ++
6016 ++err_size:
6017 ++ put_user(sizeof(*attr), &uattr->size);
6018 ++ return -E2BIG;
6019 ++}
6020 ++
6021 ++/**
6022 ++ * sys_sched_setscheduler - set/change the scheduler policy and RT priority
6023 ++ * @pid: the pid in question.
6024 ++ * @policy: new policy.
6025 ++ *
6026 ++ * Return: 0 on success. An error code otherwise.
6027 ++ * @param: structure containing the new RT priority.
6028 ++ */
6029 ++SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy, struct sched_param __user *, param)
6030 ++{
6031 ++ if (policy < 0)
6032 ++ return -EINVAL;
6033 ++
6034 ++ return do_sched_setscheduler(pid, policy, param);
6035 ++}
6036 ++
6037 ++/**
6038 ++ * sys_sched_setparam - set/change the RT priority of a thread
6039 ++ * @pid: the pid in question.
6040 ++ * @param: structure containing the new RT priority.
6041 ++ *
6042 ++ * Return: 0 on success. An error code otherwise.
6043 ++ */
6044 ++SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
6045 ++{
6046 ++ return do_sched_setscheduler(pid, SETPARAM_POLICY, param);
6047 ++}
6048 ++
6049 ++/**
6050 ++ * sys_sched_setattr - same as above, but with extended sched_attr
6051 ++ * @pid: the pid in question.
6052 ++ * @uattr: structure containing the extended parameters.
6053 ++ */
6054 ++SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
6055 ++ unsigned int, flags)
6056 ++{
6057 ++ struct sched_attr attr;
6058 ++ struct task_struct *p;
6059 ++ int retval;
6060 ++
6061 ++ if (!uattr || pid < 0 || flags)
6062 ++ return -EINVAL;
6063 ++
6064 ++ retval = sched_copy_attr(uattr, &attr);
6065 ++ if (retval)
6066 ++ return retval;
6067 ++
6068 ++ if ((int)attr.sched_policy < 0)
6069 ++ return -EINVAL;
6070 ++
6071 ++ rcu_read_lock();
6072 ++ retval = -ESRCH;
6073 ++ p = find_process_by_pid(pid);
6074 ++ if (likely(p))
6075 ++ get_task_struct(p);
6076 ++ rcu_read_unlock();
6077 ++
6078 ++ if (likely(p)) {
6079 ++ retval = sched_setattr(p, &attr);
6080 ++ put_task_struct(p);
6081 ++ }
6082 ++
6083 ++ return retval;
6084 ++}
6085 ++
6086 ++/**
6087 ++ * sys_sched_getscheduler - get the policy (scheduling class) of a thread
6088 ++ * @pid: the pid in question.
6089 ++ *
6090 ++ * Return: On success, the policy of the thread. Otherwise, a negative error
6091 ++ * code.
6092 ++ */
6093 ++SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid)
6094 ++{
6095 ++ struct task_struct *p;
6096 ++ int retval = -EINVAL;
6097 ++
6098 ++ if (pid < 0)
6099 ++ goto out_nounlock;
6100 ++
6101 ++ retval = -ESRCH;
6102 ++ rcu_read_lock();
6103 ++ p = find_process_by_pid(pid);
6104 ++ if (p) {
6105 ++ retval = security_task_getscheduler(p);
6106 ++ if (!retval)
6107 ++ retval = p->policy;
6108 ++ }
6109 ++ rcu_read_unlock();
6110 ++
6111 ++out_nounlock:
6112 ++ return retval;
6113 ++}
6114 ++
6115 ++/**
6116 ++ * sys_sched_getscheduler - get the RT priority of a thread
6117 ++ * @pid: the pid in question.
6118 ++ * @param: structure containing the RT priority.
6119 ++ *
6120 ++ * Return: On success, 0 and the RT priority is in @param. Otherwise, an error
6121 ++ * code.
6122 ++ */
6123 ++SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param)
6124 ++{
6125 ++ struct sched_param lp = { .sched_priority = 0 };
6126 ++ struct task_struct *p;
6127 ++ int retval = -EINVAL;
6128 ++
6129 ++ if (!param || pid < 0)
6130 ++ goto out_nounlock;
6131 ++
6132 ++ rcu_read_lock();
6133 ++ p = find_process_by_pid(pid);
6134 ++ retval = -ESRCH;
6135 ++ if (!p)
6136 ++ goto out_unlock;
6137 ++
6138 ++ retval = security_task_getscheduler(p);
6139 ++ if (retval)
6140 ++ goto out_unlock;
6141 ++
6142 ++ if (task_has_rt_policy(p))
6143 ++ lp.sched_priority = p->rt_priority;
6144 ++ rcu_read_unlock();
6145 ++
6146 ++ /*
6147 ++ * This one might sleep, we cannot do it with a spinlock held ...
6148 ++ */
6149 ++ retval = copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0;
6150 ++
6151 ++out_nounlock:
6152 ++ return retval;
6153 ++
6154 ++out_unlock:
6155 ++ rcu_read_unlock();
6156 ++ return retval;
6157 ++}
6158 ++
6159 ++/*
6160 ++ * Copy the kernel size attribute structure (which might be larger
6161 ++ * than what user-space knows about) to user-space.
6162 ++ *
6163 ++ * Note that all cases are valid: user-space buffer can be larger or
6164 ++ * smaller than the kernel-space buffer. The usual case is that both
6165 ++ * have the same size.
6166 ++ */
6167 ++static int
6168 ++sched_attr_copy_to_user(struct sched_attr __user *uattr,
6169 ++ struct sched_attr *kattr,
6170 ++ unsigned int usize)
6171 ++{
6172 ++ unsigned int ksize = sizeof(*kattr);
6173 ++
6174 ++ if (!access_ok(uattr, usize))
6175 ++ return -EFAULT;
6176 ++
6177 ++ /*
6178 ++ * sched_getattr() ABI forwards and backwards compatibility:
6179 ++ *
6180 ++ * If usize == ksize then we just copy everything to user-space and all is good.
6181 ++ *
6182 ++ * If usize < ksize then we only copy as much as user-space has space for,
6183 ++ * this keeps ABI compatibility as well. We skip the rest.
6184 ++ *
6185 ++ * If usize > ksize then user-space is using a newer version of the ABI,
6186 ++ * which part the kernel doesn't know about. Just ignore it - tooling can
6187 ++ * detect the kernel's knowledge of attributes from the attr->size value
6188 ++ * which is set to ksize in this case.
6189 ++ */
6190 ++ kattr->size = min(usize, ksize);
6191 ++
6192 ++ if (copy_to_user(uattr, kattr, kattr->size))
6193 ++ return -EFAULT;
6194 ++
6195 ++ return 0;
6196 ++}
6197 ++
6198 ++/**
6199 ++ * sys_sched_getattr - similar to sched_getparam, but with sched_attr
6200 ++ * @pid: the pid in question.
6201 ++ * @uattr: structure containing the extended parameters.
6202 ++ * @usize: sizeof(attr) for fwd/bwd comp.
6203 ++ * @flags: for future extension.
6204 ++ */
6205 ++SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
6206 ++ unsigned int, usize, unsigned int, flags)
6207 ++{
6208 ++ struct sched_attr kattr = { };
6209 ++ struct task_struct *p;
6210 ++ int retval;
6211 ++
6212 ++ if (!uattr || pid < 0 || usize > PAGE_SIZE ||
6213 ++ usize < SCHED_ATTR_SIZE_VER0 || flags)
6214 ++ return -EINVAL;
6215 ++
6216 ++ rcu_read_lock();
6217 ++ p = find_process_by_pid(pid);
6218 ++ retval = -ESRCH;
6219 ++ if (!p)
6220 ++ goto out_unlock;
6221 ++
6222 ++ retval = security_task_getscheduler(p);
6223 ++ if (retval)
6224 ++ goto out_unlock;
6225 ++
6226 ++ kattr.sched_policy = p->policy;
6227 ++ if (p->sched_reset_on_fork)
6228 ++ kattr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
6229 ++ if (task_has_rt_policy(p))
6230 ++ kattr.sched_priority = p->rt_priority;
6231 ++ else
6232 ++ kattr.sched_nice = task_nice(p);
6233 ++
6234 ++#ifdef CONFIG_UCLAMP_TASK
6235 ++ kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;
6236 ++ kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
6237 ++#endif
6238 ++
6239 ++ rcu_read_unlock();
6240 ++
6241 ++ return sched_attr_copy_to_user(uattr, &kattr, usize);
6242 ++
6243 ++out_unlock:
6244 ++ rcu_read_unlock();
6245 ++ return retval;
6246 ++}
6247 ++
6248 ++long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
6249 ++{
6250 ++ cpumask_var_t cpus_allowed, new_mask;
6251 ++ struct task_struct *p;
6252 ++ int retval;
6253 ++
6254 ++ rcu_read_lock();
6255 ++
6256 ++ p = find_process_by_pid(pid);
6257 ++ if (!p) {
6258 ++ rcu_read_unlock();
6259 ++ return -ESRCH;
6260 ++ }
6261 ++
6262 ++ /* Prevent p going away */
6263 ++ get_task_struct(p);
6264 ++ rcu_read_unlock();
6265 ++
6266 ++ if (p->flags & PF_NO_SETAFFINITY) {
6267 ++ retval = -EINVAL;
6268 ++ goto out_put_task;
6269 ++ }
6270 ++ if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
6271 ++ retval = -ENOMEM;
6272 ++ goto out_put_task;
6273 ++ }
6274 ++ if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {
6275 ++ retval = -ENOMEM;
6276 ++ goto out_free_cpus_allowed;
6277 ++ }
6278 ++ retval = -EPERM;
6279 ++ if (!check_same_owner(p)) {
6280 ++ rcu_read_lock();
6281 ++ if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
6282 ++ rcu_read_unlock();
6283 ++ goto out_free_new_mask;
6284 ++ }
6285 ++ rcu_read_unlock();
6286 ++ }
6287 ++
6288 ++ retval = security_task_setscheduler(p);
6289 ++ if (retval)
6290 ++ goto out_free_new_mask;
6291 ++
6292 ++ cpuset_cpus_allowed(p, cpus_allowed);
6293 ++ cpumask_and(new_mask, in_mask, cpus_allowed);
6294 ++
6295 ++again:
6296 ++ retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
6297 ++
6298 ++ if (!retval) {
6299 ++ cpuset_cpus_allowed(p, cpus_allowed);
6300 ++ if (!cpumask_subset(new_mask, cpus_allowed)) {
6301 ++ /*
6302 ++ * We must have raced with a concurrent cpuset
6303 ++ * update. Just reset the cpus_allowed to the
6304 ++ * cpuset's cpus_allowed
6305 ++ */
6306 ++ cpumask_copy(new_mask, cpus_allowed);
6307 ++ goto again;
6308 ++ }
6309 ++ }
6310 ++out_free_new_mask:
6311 ++ free_cpumask_var(new_mask);
6312 ++out_free_cpus_allowed:
6313 ++ free_cpumask_var(cpus_allowed);
6314 ++out_put_task:
6315 ++ put_task_struct(p);
6316 ++ return retval;
6317 ++}
6318 ++
6319 ++static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len,
6320 ++ struct cpumask *new_mask)
6321 ++{
6322 ++ if (len < cpumask_size())
6323 ++ cpumask_clear(new_mask);
6324 ++ else if (len > cpumask_size())
6325 ++ len = cpumask_size();
6326 ++
6327 ++ return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0;
6328 ++}
6329 ++
6330 ++/**
6331 ++ * sys_sched_setaffinity - set the CPU affinity of a process
6332 ++ * @pid: pid of the process
6333 ++ * @len: length in bytes of the bitmask pointed to by user_mask_ptr
6334 ++ * @user_mask_ptr: user-space pointer to the new CPU mask
6335 ++ *
6336 ++ * Return: 0 on success. An error code otherwise.
6337 ++ */
6338 ++SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
6339 ++ unsigned long __user *, user_mask_ptr)
6340 ++{
6341 ++ cpumask_var_t new_mask;
6342 ++ int retval;
6343 ++
6344 ++ if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))
6345 ++ return -ENOMEM;
6346 ++
6347 ++ retval = get_user_cpu_mask(user_mask_ptr, len, new_mask);
6348 ++ if (retval == 0)
6349 ++ retval = sched_setaffinity(pid, new_mask);
6350 ++ free_cpumask_var(new_mask);
6351 ++ return retval;
6352 ++}
6353 ++
6354 ++long sched_getaffinity(pid_t pid, cpumask_t *mask)
6355 ++{
6356 ++ struct task_struct *p;
6357 ++ raw_spinlock_t *lock;
6358 ++ unsigned long flags;
6359 ++ int retval;
6360 ++
6361 ++ rcu_read_lock();
6362 ++
6363 ++ retval = -ESRCH;
6364 ++ p = find_process_by_pid(pid);
6365 ++ if (!p)
6366 ++ goto out_unlock;
6367 ++
6368 ++ retval = security_task_getscheduler(p);
6369 ++ if (retval)
6370 ++ goto out_unlock;
6371 ++
6372 ++ task_access_lock_irqsave(p, &lock, &flags);
6373 ++ cpumask_and(mask, &p->cpus_mask, cpu_active_mask);
6374 ++ task_access_unlock_irqrestore(p, lock, &flags);
6375 ++
6376 ++out_unlock:
6377 ++ rcu_read_unlock();
6378 ++
6379 ++ return retval;
6380 ++}
6381 ++
6382 ++/**
6383 ++ * sys_sched_getaffinity - get the CPU affinity of a process
6384 ++ * @pid: pid of the process
6385 ++ * @len: length in bytes of the bitmask pointed to by user_mask_ptr
6386 ++ * @user_mask_ptr: user-space pointer to hold the current CPU mask
6387 ++ *
6388 ++ * Return: size of CPU mask copied to user_mask_ptr on success. An
6389 ++ * error code otherwise.
6390 ++ */
6391 ++SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
6392 ++ unsigned long __user *, user_mask_ptr)
6393 ++{
6394 ++ int ret;
6395 ++ cpumask_var_t mask;
6396 ++
6397 ++ if ((len * BITS_PER_BYTE) < nr_cpu_ids)
6398 ++ return -EINVAL;
6399 ++ if (len & (sizeof(unsigned long)-1))
6400 ++ return -EINVAL;
6401 ++
6402 ++ if (!alloc_cpumask_var(&mask, GFP_KERNEL))
6403 ++ return -ENOMEM;
6404 ++
6405 ++ ret = sched_getaffinity(pid, mask);
6406 ++ if (ret == 0) {
6407 ++ unsigned int retlen = min_t(size_t, len, cpumask_size());
6408 ++
6409 ++ if (copy_to_user(user_mask_ptr, mask, retlen))
6410 ++ ret = -EFAULT;
6411 ++ else
6412 ++ ret = retlen;
6413 ++ }
6414 ++ free_cpumask_var(mask);
6415 ++
6416 ++ return ret;
6417 ++}
6418 ++
6419 ++static void do_sched_yield(void)
6420 ++{
6421 ++ struct rq *rq;
6422 ++ struct rq_flags rf;
6423 ++
6424 ++ if (!sched_yield_type)
6425 ++ return;
6426 ++
6427 ++ rq = this_rq_lock_irq(&rf);
6428 ++
6429 ++ schedstat_inc(rq->yld_count);
6430 ++
6431 ++ if (1 == sched_yield_type) {
6432 ++ if (!rt_task(current))
6433 ++ do_sched_yield_type_1(current, rq);
6434 ++ } else if (2 == sched_yield_type) {
6435 ++ if (rq->nr_running > 1)
6436 ++ rq->skip = current;
6437 ++ }
6438 ++
6439 ++ preempt_disable();
6440 ++ raw_spin_unlock_irq(&rq->lock);
6441 ++ sched_preempt_enable_no_resched();
6442 ++
6443 ++ schedule();
6444 ++}
6445 ++
6446 ++/**
6447 ++ * sys_sched_yield - yield the current processor to other threads.
6448 ++ *
6449 ++ * This function yields the current CPU to other tasks. If there are no
6450 ++ * other threads running on this CPU then this function will return.
6451 ++ *
6452 ++ * Return: 0.
6453 ++ */
6454 ++SYSCALL_DEFINE0(sched_yield)
6455 ++{
6456 ++ do_sched_yield();
6457 ++ return 0;
6458 ++}
6459 ++
6460 ++#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
6461 ++int __sched __cond_resched(void)
6462 ++{
6463 ++ if (should_resched(0)) {
6464 ++ preempt_schedule_common();
6465 ++ return 1;
6466 ++ }
6467 ++#ifndef CONFIG_PREEMPT_RCU
6468 ++ rcu_all_qs();
6469 ++#endif
6470 ++ return 0;
6471 ++}
6472 ++EXPORT_SYMBOL(__cond_resched);
6473 ++#endif
6474 ++
6475 ++#ifdef CONFIG_PREEMPT_DYNAMIC
6476 ++DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched);
6477 ++EXPORT_STATIC_CALL_TRAMP(cond_resched);
6478 ++
6479 ++DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched);
6480 ++EXPORT_STATIC_CALL_TRAMP(might_resched);
6481 ++#endif
6482 ++
6483 ++/*
6484 ++ * __cond_resched_lock() - if a reschedule is pending, drop the given lock,
6485 ++ * call schedule, and on return reacquire the lock.
6486 ++ *
6487 ++ * This works OK both with and without CONFIG_PREEMPTION. We do strange low-level
6488 ++ * operations here to prevent schedule() from being called twice (once via
6489 ++ * spin_unlock(), once by hand).
6490 ++ */
6491 ++int __cond_resched_lock(spinlock_t *lock)
6492 ++{
6493 ++ int resched = should_resched(PREEMPT_LOCK_OFFSET);
6494 ++ int ret = 0;
6495 ++
6496 ++ lockdep_assert_held(lock);
6497 ++
6498 ++ if (spin_needbreak(lock) || resched) {
6499 ++ spin_unlock(lock);
6500 ++ if (resched)
6501 ++ preempt_schedule_common();
6502 ++ else
6503 ++ cpu_relax();
6504 ++ ret = 1;
6505 ++ spin_lock(lock);
6506 ++ }
6507 ++ return ret;
6508 ++}
6509 ++EXPORT_SYMBOL(__cond_resched_lock);
6510 ++
6511 ++int __cond_resched_rwlock_read(rwlock_t *lock)
6512 ++{
6513 ++ int resched = should_resched(PREEMPT_LOCK_OFFSET);
6514 ++ int ret = 0;
6515 ++
6516 ++ lockdep_assert_held_read(lock);
6517 ++
6518 ++ if (rwlock_needbreak(lock) || resched) {
6519 ++ read_unlock(lock);
6520 ++ if (resched)
6521 ++ preempt_schedule_common();
6522 ++ else
6523 ++ cpu_relax();
6524 ++ ret = 1;
6525 ++ read_lock(lock);
6526 ++ }
6527 ++ return ret;
6528 ++}
6529 ++EXPORT_SYMBOL(__cond_resched_rwlock_read);
6530 ++
6531 ++int __cond_resched_rwlock_write(rwlock_t *lock)
6532 ++{
6533 ++ int resched = should_resched(PREEMPT_LOCK_OFFSET);
6534 ++ int ret = 0;
6535 ++
6536 ++ lockdep_assert_held_write(lock);
6537 ++
6538 ++ if (rwlock_needbreak(lock) || resched) {
6539 ++ write_unlock(lock);
6540 ++ if (resched)
6541 ++ preempt_schedule_common();
6542 ++ else
6543 ++ cpu_relax();
6544 ++ ret = 1;
6545 ++ write_lock(lock);
6546 ++ }
6547 ++ return ret;
6548 ++}
6549 ++EXPORT_SYMBOL(__cond_resched_rwlock_write);
6550 ++
6551 ++/**
6552 ++ * yield - yield the current processor to other threads.
6553 ++ *
6554 ++ * Do not ever use this function, there's a 99% chance you're doing it wrong.
6555 ++ *
6556 ++ * The scheduler is at all times free to pick the calling task as the most
6557 ++ * eligible task to run, if removing the yield() call from your code breaks
6558 ++ * it, it's already broken.
6559 ++ *
6560 ++ * Typical broken usage is:
6561 ++ *
6562 ++ * while (!event)
6563 ++ * yield();
6564 ++ *
6565 ++ * where one assumes that yield() will let 'the other' process run that will
6566 ++ * make event true. If the current task is a SCHED_FIFO task that will never
6567 ++ * happen. Never use yield() as a progress guarantee!!
6568 ++ *
6569 ++ * If you want to use yield() to wait for something, use wait_event().
6570 ++ * If you want to use yield() to be 'nice' for others, use cond_resched().
6571 ++ * If you still want to use yield(), do not!
6572 ++ */
6573 ++void __sched yield(void)
6574 ++{
6575 ++ set_current_state(TASK_RUNNING);
6576 ++ do_sched_yield();
6577 ++}
6578 ++EXPORT_SYMBOL(yield);
6579 ++
6580 ++/**
6581 ++ * yield_to - yield the current processor to another thread in
6582 ++ * your thread group, or accelerate that thread toward the
6583 ++ * processor it's on.
6584 ++ * @p: target task
6585 ++ * @preempt: whether task preemption is allowed or not
6586 ++ *
6587 ++ * It's the caller's job to ensure that the target task struct
6588 ++ * can't go away on us before we can do any checks.
6589 ++ *
6590 ++ * In Alt schedule FW, yield_to is not supported.
6591 ++ *
6592 ++ * Return:
6593 ++ * true (>0) if we indeed boosted the target task.
6594 ++ * false (0) if we failed to boost the target.
6595 ++ * -ESRCH if there's no task to yield to.
6596 ++ */
6597 ++int __sched yield_to(struct task_struct *p, bool preempt)
6598 ++{
6599 ++ return 0;
6600 ++}
6601 ++EXPORT_SYMBOL_GPL(yield_to);
6602 ++
6603 ++int io_schedule_prepare(void)
6604 ++{
6605 ++ int old_iowait = current->in_iowait;
6606 ++
6607 ++ current->in_iowait = 1;
6608 ++ blk_schedule_flush_plug(current);
6609 ++
6610 ++ return old_iowait;
6611 ++}
6612 ++
6613 ++void io_schedule_finish(int token)
6614 ++{
6615 ++ current->in_iowait = token;
6616 ++}
6617 ++
6618 ++/*
6619 ++ * This task is about to go to sleep on IO. Increment rq->nr_iowait so
6620 ++ * that process accounting knows that this is a task in IO wait state.
6621 ++ *
6622 ++ * But don't do that if it is a deliberate, throttling IO wait (this task
6623 ++ * has set its backing_dev_info: the queue against which it should throttle)
6624 ++ */
6625 ++
6626 ++long __sched io_schedule_timeout(long timeout)
6627 ++{
6628 ++ int token;
6629 ++ long ret;
6630 ++
6631 ++ token = io_schedule_prepare();
6632 ++ ret = schedule_timeout(timeout);
6633 ++ io_schedule_finish(token);
6634 ++
6635 ++ return ret;
6636 ++}
6637 ++EXPORT_SYMBOL(io_schedule_timeout);
6638 ++
6639 ++void __sched io_schedule(void)
6640 ++{
6641 ++ int token;
6642 ++
6643 ++ token = io_schedule_prepare();
6644 ++ schedule();
6645 ++ io_schedule_finish(token);
6646 ++}
6647 ++EXPORT_SYMBOL(io_schedule);
6648 ++
6649 ++/**
6650 ++ * sys_sched_get_priority_max - return maximum RT priority.
6651 ++ * @policy: scheduling class.
6652 ++ *
6653 ++ * Return: On success, this syscall returns the maximum
6654 ++ * rt_priority that can be used by a given scheduling class.
6655 ++ * On failure, a negative error code is returned.
6656 ++ */
6657 ++SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
6658 ++{
6659 ++ int ret = -EINVAL;
6660 ++
6661 ++ switch (policy) {
6662 ++ case SCHED_FIFO:
6663 ++ case SCHED_RR:
6664 ++ ret = MAX_RT_PRIO - 1;
6665 ++ break;
6666 ++ case SCHED_NORMAL:
6667 ++ case SCHED_BATCH:
6668 ++ case SCHED_IDLE:
6669 ++ ret = 0;
6670 ++ break;
6671 ++ }
6672 ++ return ret;
6673 ++}
6674 ++
6675 ++/**
6676 ++ * sys_sched_get_priority_min - return minimum RT priority.
6677 ++ * @policy: scheduling class.
6678 ++ *
6679 ++ * Return: On success, this syscall returns the minimum
6680 ++ * rt_priority that can be used by a given scheduling class.
6681 ++ * On failure, a negative error code is returned.
6682 ++ */
6683 ++SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
6684 ++{
6685 ++ int ret = -EINVAL;
6686 ++
6687 ++ switch (policy) {
6688 ++ case SCHED_FIFO:
6689 ++ case SCHED_RR:
6690 ++ ret = 1;
6691 ++ break;
6692 ++ case SCHED_NORMAL:
6693 ++ case SCHED_BATCH:
6694 ++ case SCHED_IDLE:
6695 ++ ret = 0;
6696 ++ break;
6697 ++ }
6698 ++ return ret;
6699 ++}
6700 ++
6701 ++static int sched_rr_get_interval(pid_t pid, struct timespec64 *t)
6702 ++{
6703 ++ struct task_struct *p;
6704 ++ int retval;
6705 ++
6706 ++ alt_sched_debug();
6707 ++
6708 ++ if (pid < 0)
6709 ++ return -EINVAL;
6710 ++
6711 ++ retval = -ESRCH;
6712 ++ rcu_read_lock();
6713 ++ p = find_process_by_pid(pid);
6714 ++ if (!p)
6715 ++ goto out_unlock;
6716 ++
6717 ++ retval = security_task_getscheduler(p);
6718 ++ if (retval)
6719 ++ goto out_unlock;
6720 ++ rcu_read_unlock();
6721 ++
6722 ++ *t = ns_to_timespec64(sched_timeslice_ns);
6723 ++ return 0;
6724 ++
6725 ++out_unlock:
6726 ++ rcu_read_unlock();
6727 ++ return retval;
6728 ++}
6729 ++
6730 ++/**
6731 ++ * sys_sched_rr_get_interval - return the default timeslice of a process.
6732 ++ * @pid: pid of the process.
6733 ++ * @interval: userspace pointer to the timeslice value.
6734 ++ *
6735 ++ *
6736 ++ * Return: On success, 0 and the timeslice is in @interval. Otherwise,
6737 ++ * an error code.
6738 ++ */
6739 ++SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,
6740 ++ struct __kernel_timespec __user *, interval)
6741 ++{
6742 ++ struct timespec64 t;
6743 ++ int retval = sched_rr_get_interval(pid, &t);
6744 ++
6745 ++ if (retval == 0)
6746 ++ retval = put_timespec64(&t, interval);
6747 ++
6748 ++ return retval;
6749 ++}
6750 ++
6751 ++#ifdef CONFIG_COMPAT_32BIT_TIME
6752 ++SYSCALL_DEFINE2(sched_rr_get_interval_time32, pid_t, pid,
6753 ++ struct old_timespec32 __user *, interval)
6754 ++{
6755 ++ struct timespec64 t;
6756 ++ int retval = sched_rr_get_interval(pid, &t);
6757 ++
6758 ++ if (retval == 0)
6759 ++ retval = put_old_timespec32(&t, interval);
6760 ++ return retval;
6761 ++}
6762 ++#endif
6763 ++
6764 ++void sched_show_task(struct task_struct *p)
6765 ++{
6766 ++ unsigned long free = 0;
6767 ++ int ppid;
6768 ++
6769 ++ if (!try_get_task_stack(p))
6770 ++ return;
6771 ++
6772 ++ pr_info("task:%-15.15s state:%c", p->comm, task_state_to_char(p));
6773 ++
6774 ++ if (task_is_running(p))
6775 ++ pr_cont(" running task ");
6776 ++#ifdef CONFIG_DEBUG_STACK_USAGE
6777 ++ free = stack_not_used(p);
6778 ++#endif
6779 ++ ppid = 0;
6780 ++ rcu_read_lock();
6781 ++ if (pid_alive(p))
6782 ++ ppid = task_pid_nr(rcu_dereference(p->real_parent));
6783 ++ rcu_read_unlock();
6784 ++ pr_cont(" stack:%5lu pid:%5d ppid:%6d flags:0x%08lx\n",
6785 ++ free, task_pid_nr(p), ppid,
6786 ++ (unsigned long)task_thread_info(p)->flags);
6787 ++
6788 ++ print_worker_info(KERN_INFO, p);
6789 ++ print_stop_info(KERN_INFO, p);
6790 ++ show_stack(p, NULL, KERN_INFO);
6791 ++ put_task_stack(p);
6792 ++}
6793 ++EXPORT_SYMBOL_GPL(sched_show_task);
6794 ++
6795 ++static inline bool
6796 ++state_filter_match(unsigned long state_filter, struct task_struct *p)
6797 ++{
6798 ++ unsigned int state = READ_ONCE(p->__state);
6799 ++
6800 ++ /* no filter, everything matches */
6801 ++ if (!state_filter)
6802 ++ return true;
6803 ++
6804 ++ /* filter, but doesn't match */
6805 ++ if (!(state & state_filter))
6806 ++ return false;
6807 ++
6808 ++ /*
6809 ++ * When looking for TASK_UNINTERRUPTIBLE skip TASK_IDLE (allows
6810 ++ * TASK_KILLABLE).
6811 ++ */
6812 ++ if (state_filter == TASK_UNINTERRUPTIBLE && state == TASK_IDLE)
6813 ++ return false;
6814 ++
6815 ++ return true;
6816 ++}
6817 ++
6818 ++
6819 ++void show_state_filter(unsigned int state_filter)
6820 ++{
6821 ++ struct task_struct *g, *p;
6822 ++
6823 ++ rcu_read_lock();
6824 ++ for_each_process_thread(g, p) {
6825 ++ /*
6826 ++ * reset the NMI-timeout, listing all files on a slow
6827 ++ * console might take a lot of time:
6828 ++ * Also, reset softlockup watchdogs on all CPUs, because
6829 ++ * another CPU might be blocked waiting for us to process
6830 ++ * an IPI.
6831 ++ */
6832 ++ touch_nmi_watchdog();
6833 ++ touch_all_softlockup_watchdogs();
6834 ++ if (state_filter_match(state_filter, p))
6835 ++ sched_show_task(p);
6836 ++ }
6837 ++
6838 ++#ifdef CONFIG_SCHED_DEBUG
6839 ++ /* TODO: Alt schedule FW should support this
6840 ++ if (!state_filter)
6841 ++ sysrq_sched_debug_show();
6842 ++ */
6843 ++#endif
6844 ++ rcu_read_unlock();
6845 ++ /*
6846 ++ * Only show locks if all tasks are dumped:
6847 ++ */
6848 ++ if (!state_filter)
6849 ++ debug_show_all_locks();
6850 ++}
6851 ++
6852 ++void dump_cpu_task(int cpu)
6853 ++{
6854 ++ pr_info("Task dump for CPU %d:\n", cpu);
6855 ++ sched_show_task(cpu_curr(cpu));
6856 ++}
6857 ++
6858 ++/**
6859 ++ * init_idle - set up an idle thread for a given CPU
6860 ++ * @idle: task in question
6861 ++ * @cpu: CPU the idle task belongs to
6862 ++ *
6863 ++ * NOTE: this function does not set the idle thread's NEED_RESCHED
6864 ++ * flag, to make booting more robust.
6865 ++ */
6866 ++void __init init_idle(struct task_struct *idle, int cpu)
6867 ++{
6868 ++ struct rq *rq = cpu_rq(cpu);
6869 ++ unsigned long flags;
6870 ++
6871 ++ __sched_fork(0, idle);
6872 ++
6873 ++ /*
6874 ++ * The idle task doesn't need the kthread struct to function, but it
6875 ++ * is dressed up as a per-CPU kthread and thus needs to play the part
6876 ++ * if we want to avoid special-casing it in code that deals with per-CPU
6877 ++ * kthreads.
6878 ++ */
6879 ++ set_kthread_struct(idle);
6880 ++
6881 ++ raw_spin_lock_irqsave(&idle->pi_lock, flags);
6882 ++ raw_spin_lock(&rq->lock);
6883 ++ update_rq_clock(rq);
6884 ++
6885 ++ idle->last_ran = rq->clock_task;
6886 ++ idle->__state = TASK_RUNNING;
6887 ++ /*
6888 ++ * PF_KTHREAD should already be set at this point; regardless, make it
6889 ++ * look like a proper per-CPU kthread.
6890 ++ */
6891 ++ idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
6892 ++ kthread_set_per_cpu(idle, cpu);
6893 ++
6894 ++ sched_queue_init_idle(&rq->queue, idle);
6895 ++
6896 ++ scs_task_reset(idle);
6897 ++ kasan_unpoison_task_stack(idle);
6898 ++
6899 ++#ifdef CONFIG_SMP
6900 ++ /*
6901 ++ * It's possible that init_idle() gets called multiple times on a task,
6902 ++ * in that case do_set_cpus_allowed() will not do the right thing.
6903 ++ *
6904 ++ * And since this is boot we can forgo the serialisation.
6905 ++ */
6906 ++ set_cpus_allowed_common(idle, cpumask_of(cpu));
6907 ++#endif
6908 ++
6909 ++ /* Silence PROVE_RCU */
6910 ++ rcu_read_lock();
6911 ++ __set_task_cpu(idle, cpu);
6912 ++ rcu_read_unlock();
6913 ++
6914 ++ rq->idle = idle;
6915 ++ rcu_assign_pointer(rq->curr, idle);
6916 ++ idle->on_cpu = 1;
6917 ++
6918 ++ raw_spin_unlock(&rq->lock);
6919 ++ raw_spin_unlock_irqrestore(&idle->pi_lock, flags);
6920 ++
6921 ++ /* Set the preempt count _outside_ the spinlocks! */
6922 ++ init_idle_preempt_count(idle, cpu);
6923 ++
6924 ++ ftrace_graph_init_idle_task(idle, cpu);
6925 ++ vtime_init_idle(idle, cpu);
6926 ++#ifdef CONFIG_SMP
6927 ++ sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
6928 ++#endif
6929 ++}
6930 ++
6931 ++#ifdef CONFIG_SMP
6932 ++
6933 ++int cpuset_cpumask_can_shrink(const struct cpumask __maybe_unused *cur,
6934 ++ const struct cpumask __maybe_unused *trial)
6935 ++{
6936 ++ return 1;
6937 ++}
6938 ++
6939 ++int task_can_attach(struct task_struct *p,
6940 ++ const struct cpumask *cs_cpus_allowed)
6941 ++{
6942 ++ int ret = 0;
6943 ++
6944 ++ /*
6945 ++ * Kthreads which disallow setaffinity shouldn't be moved
6946 ++ * to a new cpuset; we don't want to change their CPU
6947 ++ * affinity and isolating such threads by their set of
6948 ++ * allowed nodes is unnecessary. Thus, cpusets are not
6949 ++ * applicable for such threads. This prevents checking for
6950 ++ * success of set_cpus_allowed_ptr() on all attached tasks
6951 ++ * before cpus_mask may be changed.
6952 ++ */
6953 ++ if (p->flags & PF_NO_SETAFFINITY)
6954 ++ ret = -EINVAL;
6955 ++
6956 ++ return ret;
6957 ++}
6958 ++
6959 ++bool sched_smp_initialized __read_mostly;
6960 ++
6961 ++#ifdef CONFIG_HOTPLUG_CPU
6962 ++/*
6963 ++ * Ensures that the idle task is using init_mm right before its CPU goes
6964 ++ * offline.
6965 ++ */
6966 ++void idle_task_exit(void)
6967 ++{
6968 ++ struct mm_struct *mm = current->active_mm;
6969 ++
6970 ++ BUG_ON(current != this_rq()->idle);
6971 ++
6972 ++ if (mm != &init_mm) {
6973 ++ switch_mm(mm, &init_mm, current);
6974 ++ finish_arch_post_lock_switch();
6975 ++ }
6976 ++
6977 ++ /* finish_cpu(), as ran on the BP, will clean up the active_mm state */
6978 ++}
6979 ++
6980 ++static int __balance_push_cpu_stop(void *arg)
6981 ++{
6982 ++ struct task_struct *p = arg;
6983 ++ struct rq *rq = this_rq();
6984 ++ struct rq_flags rf;
6985 ++ int cpu;
6986 ++
6987 ++ raw_spin_lock_irq(&p->pi_lock);
6988 ++ rq_lock(rq, &rf);
6989 ++
6990 ++ update_rq_clock(rq);
6991 ++
6992 ++ if (task_rq(p) == rq && task_on_rq_queued(p)) {
6993 ++ cpu = select_fallback_rq(rq->cpu, p);
6994 ++ rq = __migrate_task(rq, p, cpu);
6995 ++ }
6996 ++
6997 ++ rq_unlock(rq, &rf);
6998 ++ raw_spin_unlock_irq(&p->pi_lock);
6999 ++
7000 ++ put_task_struct(p);
7001 ++
7002 ++ return 0;
7003 ++}
7004 ++
7005 ++static DEFINE_PER_CPU(struct cpu_stop_work, push_work);
7006 ++
7007 ++/*
7008 ++ * This is enabled below SCHED_AP_ACTIVE; when !cpu_active(), but only
7009 ++ * effective when the hotplug motion is down.
7010 ++ */
7011 ++static void balance_push(struct rq *rq)
7012 ++{
7013 ++ struct task_struct *push_task = rq->curr;
7014 ++
7015 ++ lockdep_assert_held(&rq->lock);
7016 ++ SCHED_WARN_ON(rq->cpu != smp_processor_id());
7017 ++
7018 ++ /*
7019 ++ * Ensure the thing is persistent until balance_push_set(.on = false);
7020 ++ */
7021 ++ rq->balance_callback = &balance_push_callback;
7022 ++
7023 ++ /*
7024 ++ * Only active while going offline.
7025 ++ */
7026 ++ if (!cpu_dying(rq->cpu))
7027 ++ return;
7028 ++
7029 ++ /*
7030 ++ * Both the cpu-hotplug and stop task are in this case and are
7031 ++ * required to complete the hotplug process.
7032 ++ */
7033 ++ if (kthread_is_per_cpu(push_task) ||
7034 ++ is_migration_disabled(push_task)) {
7035 ++
7036 ++ /*
7037 ++ * If this is the idle task on the outgoing CPU try to wake
7038 ++ * up the hotplug control thread which might wait for the
7039 ++ * last task to vanish. The rcuwait_active() check is
7040 ++ * accurate here because the waiter is pinned on this CPU
7041 ++ * and can't obviously be running in parallel.
7042 ++ *
7043 ++ * On RT kernels this also has to check whether there are
7044 ++ * pinned and scheduled out tasks on the runqueue. They
7045 ++ * need to leave the migrate disabled section first.
7046 ++ */
7047 ++ if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&
7048 ++ rcuwait_active(&rq->hotplug_wait)) {
7049 ++ raw_spin_unlock(&rq->lock);
7050 ++ rcuwait_wake_up(&rq->hotplug_wait);
7051 ++ raw_spin_lock(&rq->lock);
7052 ++ }
7053 ++ return;
7054 ++ }
7055 ++
7056 ++ get_task_struct(push_task);
7057 ++ /*
7058 ++ * Temporarily drop rq->lock such that we can wake-up the stop task.
7059 ++ * Both preemption and IRQs are still disabled.
7060 ++ */
7061 ++ raw_spin_unlock(&rq->lock);
7062 ++ stop_one_cpu_nowait(rq->cpu, __balance_push_cpu_stop, push_task,
7063 ++ this_cpu_ptr(&push_work));
7064 ++ /*
7065 ++ * At this point need_resched() is true and we'll take the loop in
7066 ++ * schedule(). The next pick is obviously going to be the stop task
7067 ++ * which kthread_is_per_cpu() and will push this task away.
7068 ++ */
7069 ++ raw_spin_lock(&rq->lock);
7070 ++}
7071 ++
7072 ++static void balance_push_set(int cpu, bool on)
7073 ++{
7074 ++ struct rq *rq = cpu_rq(cpu);
7075 ++ struct rq_flags rf;
7076 ++
7077 ++ rq_lock_irqsave(rq, &rf);
7078 ++ if (on) {
7079 ++ WARN_ON_ONCE(rq->balance_callback);
7080 ++ rq->balance_callback = &balance_push_callback;
7081 ++ } else if (rq->balance_callback == &balance_push_callback) {
7082 ++ rq->balance_callback = NULL;
7083 ++ }
7084 ++ rq_unlock_irqrestore(rq, &rf);
7085 ++}
7086 ++
7087 ++/*
7088 ++ * Invoked from a CPUs hotplug control thread after the CPU has been marked
7089 ++ * inactive. All tasks which are not per CPU kernel threads are either
7090 ++ * pushed off this CPU now via balance_push() or placed on a different CPU
7091 ++ * during wakeup. Wait until the CPU is quiescent.
7092 ++ */
7093 ++static void balance_hotplug_wait(void)
7094 ++{
7095 ++ struct rq *rq = this_rq();
7096 ++
7097 ++ rcuwait_wait_event(&rq->hotplug_wait,
7098 ++ rq->nr_running == 1 && !rq_has_pinned_tasks(rq),
7099 ++ TASK_UNINTERRUPTIBLE);
7100 ++}
7101 ++
7102 ++#else
7103 ++
7104 ++static void balance_push(struct rq *rq)
7105 ++{
7106 ++}
7107 ++
7108 ++static void balance_push_set(int cpu, bool on)
7109 ++{
7110 ++}
7111 ++
7112 ++static inline void balance_hotplug_wait(void)
7113 ++{
7114 ++}
7115 ++#endif /* CONFIG_HOTPLUG_CPU */
7116 ++
7117 ++static void set_rq_offline(struct rq *rq)
7118 ++{
7119 ++ if (rq->online)
7120 ++ rq->online = false;
7121 ++}
7122 ++
7123 ++static void set_rq_online(struct rq *rq)
7124 ++{
7125 ++ if (!rq->online)
7126 ++ rq->online = true;
7127 ++}
7128 ++
7129 ++/*
7130 ++ * used to mark begin/end of suspend/resume:
7131 ++ */
7132 ++static int num_cpus_frozen;
7133 ++
7134 ++/*
7135 ++ * Update cpusets according to cpu_active mask. If cpusets are
7136 ++ * disabled, cpuset_update_active_cpus() becomes a simple wrapper
7137 ++ * around partition_sched_domains().
7138 ++ *
7139 ++ * If we come here as part of a suspend/resume, don't touch cpusets because we
7140 ++ * want to restore it back to its original state upon resume anyway.
7141 ++ */
7142 ++static void cpuset_cpu_active(void)
7143 ++{
7144 ++ if (cpuhp_tasks_frozen) {
7145 ++ /*
7146 ++ * num_cpus_frozen tracks how many CPUs are involved in suspend
7147 ++ * resume sequence. As long as this is not the last online
7148 ++ * operation in the resume sequence, just build a single sched
7149 ++ * domain, ignoring cpusets.
7150 ++ */
7151 ++ partition_sched_domains(1, NULL, NULL);
7152 ++ if (--num_cpus_frozen)
7153 ++ return;
7154 ++ /*
7155 ++ * This is the last CPU online operation. So fall through and
7156 ++ * restore the original sched domains by considering the
7157 ++ * cpuset configurations.
7158 ++ */
7159 ++ cpuset_force_rebuild();
7160 ++ }
7161 ++
7162 ++ cpuset_update_active_cpus();
7163 ++}
7164 ++
7165 ++static int cpuset_cpu_inactive(unsigned int cpu)
7166 ++{
7167 ++ if (!cpuhp_tasks_frozen) {
7168 ++ cpuset_update_active_cpus();
7169 ++ } else {
7170 ++ num_cpus_frozen++;
7171 ++ partition_sched_domains(1, NULL, NULL);
7172 ++ }
7173 ++ return 0;
7174 ++}
7175 ++
7176 ++int sched_cpu_activate(unsigned int cpu)
7177 ++{
7178 ++ struct rq *rq = cpu_rq(cpu);
7179 ++ unsigned long flags;
7180 ++
7181 ++ /*
7182 ++ * Clear the balance_push callback and prepare to schedule
7183 ++ * regular tasks.
7184 ++ */
7185 ++ balance_push_set(cpu, false);
7186 ++
7187 ++#ifdef CONFIG_SCHED_SMT
7188 ++ /*
7189 ++ * When going up, increment the number of cores with SMT present.
7190 ++ */
7191 ++ if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
7192 ++ static_branch_inc_cpuslocked(&sched_smt_present);
7193 ++#endif
7194 ++ set_cpu_active(cpu, true);
7195 ++
7196 ++ if (sched_smp_initialized)
7197 ++ cpuset_cpu_active();
7198 ++
7199 ++ /*
7200 ++ * Put the rq online, if not already. This happens:
7201 ++ *
7202 ++ * 1) In the early boot process, because we build the real domains
7203 ++ * after all cpus have been brought up.
7204 ++ *
7205 ++ * 2) At runtime, if cpuset_cpu_active() fails to rebuild the
7206 ++ * domains.
7207 ++ */
7208 ++ raw_spin_lock_irqsave(&rq->lock, flags);
7209 ++ set_rq_online(rq);
7210 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
7211 ++
7212 ++ return 0;
7213 ++}
7214 ++
7215 ++int sched_cpu_deactivate(unsigned int cpu)
7216 ++{
7217 ++ struct rq *rq = cpu_rq(cpu);
7218 ++ unsigned long flags;
7219 ++ int ret;
7220 ++
7221 ++ set_cpu_active(cpu, false);
7222 ++
7223 ++ /*
7224 ++ * From this point forward, this CPU will refuse to run any task that
7225 ++ * is not: migrate_disable() or KTHREAD_IS_PER_CPU, and will actively
7226 ++ * push those tasks away until this gets cleared, see
7227 ++ * sched_cpu_dying().
7228 ++ */
7229 ++ balance_push_set(cpu, true);
7230 ++
7231 ++ /*
7232 ++ * We've cleared cpu_active_mask, wait for all preempt-disabled and RCU
7233 ++ * users of this state to go away such that all new such users will
7234 ++ * observe it.
7235 ++ *
7236 ++ * Specifically, we rely on ttwu to no longer target this CPU, see
7237 ++ * ttwu_queue_cond() and is_cpu_allowed().
7238 ++ *
7239 ++ * Do sync before park smpboot threads to take care the rcu boost case.
7240 ++ */
7241 ++ synchronize_rcu();
7242 ++
7243 ++ raw_spin_lock_irqsave(&rq->lock, flags);
7244 ++ update_rq_clock(rq);
7245 ++ set_rq_offline(rq);
7246 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
7247 ++
7248 ++#ifdef CONFIG_SCHED_SMT
7249 ++ /*
7250 ++ * When going down, decrement the number of cores with SMT present.
7251 ++ */
7252 ++ if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
7253 ++ static_branch_dec_cpuslocked(&sched_smt_present);
7254 ++ if (!static_branch_likely(&sched_smt_present))
7255 ++ cpumask_clear(&sched_sg_idle_mask);
7256 ++ }
7257 ++#endif
7258 ++
7259 ++ if (!sched_smp_initialized)
7260 ++ return 0;
7261 ++
7262 ++ ret = cpuset_cpu_inactive(cpu);
7263 ++ if (ret) {
7264 ++ balance_push_set(cpu, false);
7265 ++ set_cpu_active(cpu, true);
7266 ++ return ret;
7267 ++ }
7268 ++
7269 ++ return 0;
7270 ++}
7271 ++
7272 ++static void sched_rq_cpu_starting(unsigned int cpu)
7273 ++{
7274 ++ struct rq *rq = cpu_rq(cpu);
7275 ++
7276 ++ rq->calc_load_update = calc_load_update;
7277 ++}
7278 ++
7279 ++int sched_cpu_starting(unsigned int cpu)
7280 ++{
7281 ++ sched_rq_cpu_starting(cpu);
7282 ++ sched_tick_start(cpu);
7283 ++ return 0;
7284 ++}
7285 ++
7286 ++#ifdef CONFIG_HOTPLUG_CPU
7287 ++
7288 ++/*
7289 ++ * Invoked immediately before the stopper thread is invoked to bring the
7290 ++ * CPU down completely. At this point all per CPU kthreads except the
7291 ++ * hotplug thread (current) and the stopper thread (inactive) have been
7292 ++ * either parked or have been unbound from the outgoing CPU. Ensure that
7293 ++ * any of those which might be on the way out are gone.
7294 ++ *
7295 ++ * If after this point a bound task is being woken on this CPU then the
7296 ++ * responsible hotplug callback has failed to do it's job.
7297 ++ * sched_cpu_dying() will catch it with the appropriate fireworks.
7298 ++ */
7299 ++int sched_cpu_wait_empty(unsigned int cpu)
7300 ++{
7301 ++ balance_hotplug_wait();
7302 ++ return 0;
7303 ++}
7304 ++
7305 ++/*
7306 ++ * Since this CPU is going 'away' for a while, fold any nr_active delta we
7307 ++ * might have. Called from the CPU stopper task after ensuring that the
7308 ++ * stopper is the last running task on the CPU, so nr_active count is
7309 ++ * stable. We need to take the teardown thread which is calling this into
7310 ++ * account, so we hand in adjust = 1 to the load calculation.
7311 ++ *
7312 ++ * Also see the comment "Global load-average calculations".
7313 ++ */
7314 ++static void calc_load_migrate(struct rq *rq)
7315 ++{
7316 ++ long delta = calc_load_fold_active(rq, 1);
7317 ++
7318 ++ if (delta)
7319 ++ atomic_long_add(delta, &calc_load_tasks);
7320 ++}
7321 ++
7322 ++static void dump_rq_tasks(struct rq *rq, const char *loglvl)
7323 ++{
7324 ++ struct task_struct *g, *p;
7325 ++ int cpu = cpu_of(rq);
7326 ++
7327 ++ lockdep_assert_held(&rq->lock);
7328 ++
7329 ++ printk("%sCPU%d enqueued tasks (%u total):\n", loglvl, cpu, rq->nr_running);
7330 ++ for_each_process_thread(g, p) {
7331 ++ if (task_cpu(p) != cpu)
7332 ++ continue;
7333 ++
7334 ++ if (!task_on_rq_queued(p))
7335 ++ continue;
7336 ++
7337 ++ printk("%s\tpid: %d, name: %s\n", loglvl, p->pid, p->comm);
7338 ++ }
7339 ++}
7340 ++
7341 ++int sched_cpu_dying(unsigned int cpu)
7342 ++{
7343 ++ struct rq *rq = cpu_rq(cpu);
7344 ++ unsigned long flags;
7345 ++
7346 ++ /* Handle pending wakeups and then migrate everything off */
7347 ++ sched_tick_stop(cpu);
7348 ++
7349 ++ raw_spin_lock_irqsave(&rq->lock, flags);
7350 ++ if (rq->nr_running != 1 || rq_has_pinned_tasks(rq)) {
7351 ++ WARN(true, "Dying CPU not properly vacated!");
7352 ++ dump_rq_tasks(rq, KERN_WARNING);
7353 ++ }
7354 ++ raw_spin_unlock_irqrestore(&rq->lock, flags);
7355 ++
7356 ++ calc_load_migrate(rq);
7357 ++ hrtick_clear(rq);
7358 ++ return 0;
7359 ++}
7360 ++#endif
7361 ++
7362 ++#ifdef CONFIG_SMP
7363 ++static void sched_init_topology_cpumask_early(void)
7364 ++{
7365 ++ int cpu;
7366 ++ cpumask_t *tmp;
7367 ++
7368 ++ for_each_possible_cpu(cpu) {
7369 ++ /* init topo masks */
7370 ++ tmp = per_cpu(sched_cpu_topo_masks, cpu);
7371 ++
7372 ++ cpumask_copy(tmp, cpumask_of(cpu));
7373 ++ tmp++;
7374 ++ cpumask_copy(tmp, cpu_possible_mask);
7375 ++ per_cpu(sched_cpu_llc_mask, cpu) = tmp;
7376 ++ per_cpu(sched_cpu_topo_end_mask, cpu) = ++tmp;
7377 ++ /*per_cpu(sd_llc_id, cpu) = cpu;*/
7378 ++ }
7379 ++}
7380 ++
7381 ++#define TOPOLOGY_CPUMASK(name, mask, last)\
7382 ++ if (cpumask_and(topo, topo, mask)) { \
7383 ++ cpumask_copy(topo, mask); \
7384 ++ printk(KERN_INFO "sched: cpu#%02d topo: 0x%08lx - "#name, \
7385 ++ cpu, (topo++)->bits[0]); \
7386 ++ } \
7387 ++ if (!last) \
7388 ++ cpumask_complement(topo, mask)
7389 ++
7390 ++static void sched_init_topology_cpumask(void)
7391 ++{
7392 ++ int cpu;
7393 ++ cpumask_t *topo;
7394 ++
7395 ++ for_each_online_cpu(cpu) {
7396 ++ /* take chance to reset time slice for idle tasks */
7397 ++ cpu_rq(cpu)->idle->time_slice = sched_timeslice_ns;
7398 ++
7399 ++ topo = per_cpu(sched_cpu_topo_masks, cpu) + 1;
7400 ++
7401 ++ cpumask_complement(topo, cpumask_of(cpu));
7402 ++#ifdef CONFIG_SCHED_SMT
7403 ++ TOPOLOGY_CPUMASK(smt, topology_sibling_cpumask(cpu), false);
7404 ++#endif
7405 ++ per_cpu(sd_llc_id, cpu) = cpumask_first(cpu_coregroup_mask(cpu));
7406 ++ per_cpu(sched_cpu_llc_mask, cpu) = topo;
7407 ++ TOPOLOGY_CPUMASK(coregroup, cpu_coregroup_mask(cpu), false);
7408 ++
7409 ++ TOPOLOGY_CPUMASK(core, topology_core_cpumask(cpu), false);
7410 ++
7411 ++ TOPOLOGY_CPUMASK(others, cpu_online_mask, true);
7412 ++
7413 ++ per_cpu(sched_cpu_topo_end_mask, cpu) = topo;
7414 ++ printk(KERN_INFO "sched: cpu#%02d llc_id = %d, llc_mask idx = %d\n",
7415 ++ cpu, per_cpu(sd_llc_id, cpu),
7416 ++ (int) (per_cpu(sched_cpu_llc_mask, cpu) -
7417 ++ per_cpu(sched_cpu_topo_masks, cpu)));
7418 ++ }
7419 ++}
7420 ++#endif
7421 ++
7422 ++void __init sched_init_smp(void)
7423 ++{
7424 ++ /* Move init over to a non-isolated CPU */
7425 ++ if (set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_FLAG_DOMAIN)) < 0)
7426 ++ BUG();
7427 ++ current->flags &= ~PF_NO_SETAFFINITY;
7428 ++
7429 ++ sched_init_topology_cpumask();
7430 ++
7431 ++ sched_smp_initialized = true;
7432 ++}
7433 ++#else
7434 ++void __init sched_init_smp(void)
7435 ++{
7436 ++ cpu_rq(0)->idle->time_slice = sched_timeslice_ns;
7437 ++}
7438 ++#endif /* CONFIG_SMP */
7439 ++
7440 ++int in_sched_functions(unsigned long addr)
7441 ++{
7442 ++ return in_lock_functions(addr) ||
7443 ++ (addr >= (unsigned long)__sched_text_start
7444 ++ && addr < (unsigned long)__sched_text_end);
7445 ++}
7446 ++
7447 ++#ifdef CONFIG_CGROUP_SCHED
7448 ++/* task group related information */
7449 ++struct task_group {
7450 ++ struct cgroup_subsys_state css;
7451 ++
7452 ++ struct rcu_head rcu;
7453 ++ struct list_head list;
7454 ++
7455 ++ struct task_group *parent;
7456 ++ struct list_head siblings;
7457 ++ struct list_head children;
7458 ++#ifdef CONFIG_FAIR_GROUP_SCHED
7459 ++ unsigned long shares;
7460 ++#endif
7461 ++};
7462 ++
7463 ++/*
7464 ++ * Default task group.
7465 ++ * Every task in system belongs to this group at bootup.
7466 ++ */
7467 ++struct task_group root_task_group;
7468 ++LIST_HEAD(task_groups);
7469 ++
7470 ++/* Cacheline aligned slab cache for task_group */
7471 ++static struct kmem_cache *task_group_cache __read_mostly;
7472 ++#endif /* CONFIG_CGROUP_SCHED */
7473 ++
7474 ++void __init sched_init(void)
7475 ++{
7476 ++ int i;
7477 ++ struct rq *rq;
7478 ++
7479 ++ printk(KERN_INFO ALT_SCHED_VERSION_MSG);
7480 ++
7481 ++ wait_bit_init();
7482 ++
7483 ++#ifdef CONFIG_SMP
7484 ++ for (i = 0; i < SCHED_BITS; i++)
7485 ++ cpumask_copy(sched_rq_watermark + i, cpu_present_mask);
7486 ++#endif
7487 ++
7488 ++#ifdef CONFIG_CGROUP_SCHED
7489 ++ task_group_cache = KMEM_CACHE(task_group, 0);
7490 ++
7491 ++ list_add(&root_task_group.list, &task_groups);
7492 ++ INIT_LIST_HEAD(&root_task_group.children);
7493 ++ INIT_LIST_HEAD(&root_task_group.siblings);
7494 ++#endif /* CONFIG_CGROUP_SCHED */
7495 ++ for_each_possible_cpu(i) {
7496 ++ rq = cpu_rq(i);
7497 ++
7498 ++ sched_queue_init(&rq->queue);
7499 ++ rq->watermark = IDLE_TASK_SCHED_PRIO;
7500 ++ rq->skip = NULL;
7501 ++
7502 ++ raw_spin_lock_init(&rq->lock);
7503 ++ rq->nr_running = rq->nr_uninterruptible = 0;
7504 ++ rq->calc_load_active = 0;
7505 ++ rq->calc_load_update = jiffies + LOAD_FREQ;
7506 ++#ifdef CONFIG_SMP
7507 ++ rq->online = false;
7508 ++ rq->cpu = i;
7509 ++
7510 ++#ifdef CONFIG_SCHED_SMT
7511 ++ rq->active_balance = 0;
7512 ++#endif
7513 ++
7514 ++#ifdef CONFIG_NO_HZ_COMMON
7515 ++ INIT_CSD(&rq->nohz_csd, nohz_csd_func, rq);
7516 ++#endif
7517 ++ rq->balance_callback = &balance_push_callback;
7518 ++#ifdef CONFIG_HOTPLUG_CPU
7519 ++ rcuwait_init(&rq->hotplug_wait);
7520 ++#endif
7521 ++#endif /* CONFIG_SMP */
7522 ++ rq->nr_switches = 0;
7523 ++
7524 ++ hrtick_rq_init(rq);
7525 ++ atomic_set(&rq->nr_iowait, 0);
7526 ++ }
7527 ++#ifdef CONFIG_SMP
7528 ++ /* Set rq->online for cpu 0 */
7529 ++ cpu_rq(0)->online = true;
7530 ++#endif
7531 ++ /*
7532 ++ * The boot idle thread does lazy MMU switching as well:
7533 ++ */
7534 ++ mmgrab(&init_mm);
7535 ++ enter_lazy_tlb(&init_mm, current);
7536 ++
7537 ++ /*
7538 ++ * Make us the idle thread. Technically, schedule() should not be
7539 ++ * called from this thread, however somewhere below it might be,
7540 ++ * but because we are the idle thread, we just pick up running again
7541 ++ * when this runqueue becomes "idle".
7542 ++ */
7543 ++ init_idle(current, smp_processor_id());
7544 ++
7545 ++ calc_load_update = jiffies + LOAD_FREQ;
7546 ++
7547 ++#ifdef CONFIG_SMP
7548 ++ idle_thread_set_boot_cpu();
7549 ++ balance_push_set(smp_processor_id(), false);
7550 ++
7551 ++ sched_init_topology_cpumask_early();
7552 ++#endif /* SMP */
7553 ++
7554 ++ psi_init();
7555 ++}
7556 ++
7557 ++#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
7558 ++static inline int preempt_count_equals(int preempt_offset)
7559 ++{
7560 ++ int nested = preempt_count() + rcu_preempt_depth();
7561 ++
7562 ++ return (nested == preempt_offset);
7563 ++}
7564 ++
7565 ++void __might_sleep(const char *file, int line, int preempt_offset)
7566 ++{
7567 ++ unsigned int state = get_current_state();
7568 ++ /*
7569 ++ * Blocking primitives will set (and therefore destroy) current->state,
7570 ++ * since we will exit with TASK_RUNNING make sure we enter with it,
7571 ++ * otherwise we will destroy state.
7572 ++ */
7573 ++ WARN_ONCE(state != TASK_RUNNING && current->task_state_change,
7574 ++ "do not call blocking ops when !TASK_RUNNING; "
7575 ++ "state=%x set at [<%p>] %pS\n", state,
7576 ++ (void *)current->task_state_change,
7577 ++ (void *)current->task_state_change);
7578 ++
7579 ++ ___might_sleep(file, line, preempt_offset);
7580 ++}
7581 ++EXPORT_SYMBOL(__might_sleep);
7582 ++
7583 ++void ___might_sleep(const char *file, int line, int preempt_offset)
7584 ++{
7585 ++ /* Ratelimiting timestamp: */
7586 ++ static unsigned long prev_jiffy;
7587 ++
7588 ++ unsigned long preempt_disable_ip;
7589 ++
7590 ++ /* WARN_ON_ONCE() by default, no rate limit required: */
7591 ++ rcu_sleep_check();
7592 ++
7593 ++ if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
7594 ++ !is_idle_task(current) && !current->non_block_count) ||
7595 ++ system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING ||
7596 ++ oops_in_progress)
7597 ++ return;
7598 ++ if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
7599 ++ return;
7600 ++ prev_jiffy = jiffies;
7601 ++
7602 ++ /* Save this before calling printk(), since that will clobber it: */
7603 ++ preempt_disable_ip = get_preempt_disable_ip(current);
7604 ++
7605 ++ printk(KERN_ERR
7606 ++ "BUG: sleeping function called from invalid context at %s:%d\n",
7607 ++ file, line);
7608 ++ printk(KERN_ERR
7609 ++ "in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",
7610 ++ in_atomic(), irqs_disabled(), current->non_block_count,
7611 ++ current->pid, current->comm);
7612 ++
7613 ++ if (task_stack_end_corrupted(current))
7614 ++ printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
7615 ++
7616 ++ debug_show_held_locks(current);
7617 ++ if (irqs_disabled())
7618 ++ print_irqtrace_events(current);
7619 ++#ifdef CONFIG_DEBUG_PREEMPT
7620 ++ if (!preempt_count_equals(preempt_offset)) {
7621 ++ pr_err("Preemption disabled at:");
7622 ++ print_ip_sym(KERN_ERR, preempt_disable_ip);
7623 ++ }
7624 ++#endif
7625 ++ dump_stack();
7626 ++ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
7627 ++}
7628 ++EXPORT_SYMBOL(___might_sleep);
7629 ++
7630 ++void __cant_sleep(const char *file, int line, int preempt_offset)
7631 ++{
7632 ++ static unsigned long prev_jiffy;
7633 ++
7634 ++ if (irqs_disabled())
7635 ++ return;
7636 ++
7637 ++ if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
7638 ++ return;
7639 ++
7640 ++ if (preempt_count() > preempt_offset)
7641 ++ return;
7642 ++
7643 ++ if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
7644 ++ return;
7645 ++ prev_jiffy = jiffies;
7646 ++
7647 ++ printk(KERN_ERR "BUG: assuming atomic context at %s:%d\n", file, line);
7648 ++ printk(KERN_ERR "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
7649 ++ in_atomic(), irqs_disabled(),
7650 ++ current->pid, current->comm);
7651 ++
7652 ++ debug_show_held_locks(current);
7653 ++ dump_stack();
7654 ++ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
7655 ++}
7656 ++EXPORT_SYMBOL_GPL(__cant_sleep);
7657 ++
7658 ++#ifdef CONFIG_SMP
7659 ++void __cant_migrate(const char *file, int line)
7660 ++{
7661 ++ static unsigned long prev_jiffy;
7662 ++
7663 ++ if (irqs_disabled())
7664 ++ return;
7665 ++
7666 ++ if (is_migration_disabled(current))
7667 ++ return;
7668 ++
7669 ++ if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
7670 ++ return;
7671 ++
7672 ++ if (preempt_count() > 0)
7673 ++ return;
7674 ++
7675 ++ if (current->migration_flags & MDF_FORCE_ENABLED)
7676 ++ return;
7677 ++
7678 ++ if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
7679 ++ return;
7680 ++ prev_jiffy = jiffies;
7681 ++
7682 ++ pr_err("BUG: assuming non migratable context at %s:%d\n", file, line);
7683 ++ pr_err("in_atomic(): %d, irqs_disabled(): %d, migration_disabled() %u pid: %d, name: %s\n",
7684 ++ in_atomic(), irqs_disabled(), is_migration_disabled(current),
7685 ++ current->pid, current->comm);
7686 ++
7687 ++ debug_show_held_locks(current);
7688 ++ dump_stack();
7689 ++ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
7690 ++}
7691 ++EXPORT_SYMBOL_GPL(__cant_migrate);
7692 ++#endif
7693 ++#endif
7694 ++
7695 ++#ifdef CONFIG_MAGIC_SYSRQ
7696 ++void normalize_rt_tasks(void)
7697 ++{
7698 ++ struct task_struct *g, *p;
7699 ++ struct sched_attr attr = {
7700 ++ .sched_policy = SCHED_NORMAL,
7701 ++ };
7702 ++
7703 ++ read_lock(&tasklist_lock);
7704 ++ for_each_process_thread(g, p) {
7705 ++ /*
7706 ++ * Only normalize user tasks:
7707 ++ */
7708 ++ if (p->flags & PF_KTHREAD)
7709 ++ continue;
7710 ++
7711 ++ if (!rt_task(p)) {
7712 ++ /*
7713 ++ * Renice negative nice level userspace
7714 ++ * tasks back to 0:
7715 ++ */
7716 ++ if (task_nice(p) < 0)
7717 ++ set_user_nice(p, 0);
7718 ++ continue;
7719 ++ }
7720 ++
7721 ++ __sched_setscheduler(p, &attr, false, false);
7722 ++ }
7723 ++ read_unlock(&tasklist_lock);
7724 ++}
7725 ++#endif /* CONFIG_MAGIC_SYSRQ */
7726 ++
7727 ++#if defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB)
7728 ++/*
7729 ++ * These functions are only useful for the IA64 MCA handling, or kdb.
7730 ++ *
7731 ++ * They can only be called when the whole system has been
7732 ++ * stopped - every CPU needs to be quiescent, and no scheduling
7733 ++ * activity can take place. Using them for anything else would
7734 ++ * be a serious bug, and as a result, they aren't even visible
7735 ++ * under any other configuration.
7736 ++ */
7737 ++
7738 ++/**
7739 ++ * curr_task - return the current task for a given CPU.
7740 ++ * @cpu: the processor in question.
7741 ++ *
7742 ++ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!
7743 ++ *
7744 ++ * Return: The current task for @cpu.
7745 ++ */
7746 ++struct task_struct *curr_task(int cpu)
7747 ++{
7748 ++ return cpu_curr(cpu);
7749 ++}
7750 ++
7751 ++#endif /* defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB) */
7752 ++
7753 ++#ifdef CONFIG_IA64
7754 ++/**
7755 ++ * ia64_set_curr_task - set the current task for a given CPU.
7756 ++ * @cpu: the processor in question.
7757 ++ * @p: the task pointer to set.
7758 ++ *
7759 ++ * Description: This function must only be used when non-maskable interrupts
7760 ++ * are serviced on a separate stack. It allows the architecture to switch the
7761 ++ * notion of the current task on a CPU in a non-blocking manner. This function
7762 ++ * must be called with all CPU's synchronised, and interrupts disabled, the
7763 ++ * and caller must save the original value of the current task (see
7764 ++ * curr_task() above) and restore that value before reenabling interrupts and
7765 ++ * re-starting the system.
7766 ++ *
7767 ++ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!
7768 ++ */
7769 ++void ia64_set_curr_task(int cpu, struct task_struct *p)
7770 ++{
7771 ++ cpu_curr(cpu) = p;
7772 ++}
7773 ++
7774 ++#endif
7775 ++
7776 ++#ifdef CONFIG_CGROUP_SCHED
7777 ++static void sched_free_group(struct task_group *tg)
7778 ++{
7779 ++ kmem_cache_free(task_group_cache, tg);
7780 ++}
7781 ++
7782 ++/* allocate runqueue etc for a new task group */
7783 ++struct task_group *sched_create_group(struct task_group *parent)
7784 ++{
7785 ++ struct task_group *tg;
7786 ++
7787 ++ tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
7788 ++ if (!tg)
7789 ++ return ERR_PTR(-ENOMEM);
7790 ++
7791 ++ return tg;
7792 ++}
7793 ++
7794 ++void sched_online_group(struct task_group *tg, struct task_group *parent)
7795 ++{
7796 ++}
7797 ++
7798 ++/* rcu callback to free various structures associated with a task group */
7799 ++static void sched_free_group_rcu(struct rcu_head *rhp)
7800 ++{
7801 ++ /* Now it should be safe to free those cfs_rqs */
7802 ++ sched_free_group(container_of(rhp, struct task_group, rcu));
7803 ++}
7804 ++
7805 ++void sched_destroy_group(struct task_group *tg)
7806 ++{
7807 ++ /* Wait for possible concurrent references to cfs_rqs complete */
7808 ++ call_rcu(&tg->rcu, sched_free_group_rcu);
7809 ++}
7810 ++
7811 ++void sched_offline_group(struct task_group *tg)
7812 ++{
7813 ++}
7814 ++
7815 ++static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
7816 ++{
7817 ++ return css ? container_of(css, struct task_group, css) : NULL;
7818 ++}
7819 ++
7820 ++static struct cgroup_subsys_state *
7821 ++cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
7822 ++{
7823 ++ struct task_group *parent = css_tg(parent_css);
7824 ++ struct task_group *tg;
7825 ++
7826 ++ if (!parent) {
7827 ++ /* This is early initialization for the top cgroup */
7828 ++ return &root_task_group.css;
7829 ++ }
7830 ++
7831 ++ tg = sched_create_group(parent);
7832 ++ if (IS_ERR(tg))
7833 ++ return ERR_PTR(-ENOMEM);
7834 ++ return &tg->css;
7835 ++}
7836 ++
7837 ++/* Expose task group only after completing cgroup initialization */
7838 ++static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
7839 ++{
7840 ++ struct task_group *tg = css_tg(css);
7841 ++ struct task_group *parent = css_tg(css->parent);
7842 ++
7843 ++ if (parent)
7844 ++ sched_online_group(tg, parent);
7845 ++ return 0;
7846 ++}
7847 ++
7848 ++static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
7849 ++{
7850 ++ struct task_group *tg = css_tg(css);
7851 ++
7852 ++ sched_offline_group(tg);
7853 ++}
7854 ++
7855 ++static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
7856 ++{
7857 ++ struct task_group *tg = css_tg(css);
7858 ++
7859 ++ /*
7860 ++ * Relies on the RCU grace period between css_released() and this.
7861 ++ */
7862 ++ sched_free_group(tg);
7863 ++}
7864 ++
7865 ++static void cpu_cgroup_fork(struct task_struct *task)
7866 ++{
7867 ++}
7868 ++
7869 ++static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
7870 ++{
7871 ++ return 0;
7872 ++}
7873 ++
7874 ++static void cpu_cgroup_attach(struct cgroup_taskset *tset)
7875 ++{
7876 ++}
7877 ++
7878 ++#ifdef CONFIG_FAIR_GROUP_SCHED
7879 ++static DEFINE_MUTEX(shares_mutex);
7880 ++
7881 ++int sched_group_set_shares(struct task_group *tg, unsigned long shares)
7882 ++{
7883 ++ /*
7884 ++ * We can't change the weight of the root cgroup.
7885 ++ */
7886 ++ if (&root_task_group == tg)
7887 ++ return -EINVAL;
7888 ++
7889 ++ shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
7890 ++
7891 ++ mutex_lock(&shares_mutex);
7892 ++ if (tg->shares == shares)
7893 ++ goto done;
7894 ++
7895 ++ tg->shares = shares;
7896 ++done:
7897 ++ mutex_unlock(&shares_mutex);
7898 ++ return 0;
7899 ++}
7900 ++
7901 ++static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
7902 ++ struct cftype *cftype, u64 shareval)
7903 ++{
7904 ++ if (shareval > scale_load_down(ULONG_MAX))
7905 ++ shareval = MAX_SHARES;
7906 ++ return sched_group_set_shares(css_tg(css), scale_load(shareval));
7907 ++}
7908 ++
7909 ++static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
7910 ++ struct cftype *cft)
7911 ++{
7912 ++ struct task_group *tg = css_tg(css);
7913 ++
7914 ++ return (u64) scale_load_down(tg->shares);
7915 ++}
7916 ++#endif
7917 ++
7918 ++static struct cftype cpu_legacy_files[] = {
7919 ++#ifdef CONFIG_FAIR_GROUP_SCHED
7920 ++ {
7921 ++ .name = "shares",
7922 ++ .read_u64 = cpu_shares_read_u64,
7923 ++ .write_u64 = cpu_shares_write_u64,
7924 ++ },
7925 ++#endif
7926 ++ { } /* Terminate */
7927 ++};
7928 ++
7929 ++
7930 ++static struct cftype cpu_files[] = {
7931 ++ { } /* terminate */
7932 ++};
7933 ++
7934 ++static int cpu_extra_stat_show(struct seq_file *sf,
7935 ++ struct cgroup_subsys_state *css)
7936 ++{
7937 ++ return 0;
7938 ++}
7939 ++
7940 ++struct cgroup_subsys cpu_cgrp_subsys = {
7941 ++ .css_alloc = cpu_cgroup_css_alloc,
7942 ++ .css_online = cpu_cgroup_css_online,
7943 ++ .css_released = cpu_cgroup_css_released,
7944 ++ .css_free = cpu_cgroup_css_free,
7945 ++ .css_extra_stat_show = cpu_extra_stat_show,
7946 ++ .fork = cpu_cgroup_fork,
7947 ++ .can_attach = cpu_cgroup_can_attach,
7948 ++ .attach = cpu_cgroup_attach,
7949 ++ .legacy_cftypes = cpu_files,
7950 ++ .legacy_cftypes = cpu_legacy_files,
7951 ++ .dfl_cftypes = cpu_files,
7952 ++ .early_init = true,
7953 ++ .threaded = true,
7954 ++};
7955 ++#endif /* CONFIG_CGROUP_SCHED */
7956 ++
7957 ++#undef CREATE_TRACE_POINTS
7958 +diff --git a/kernel/sched/alt_debug.c b/kernel/sched/alt_debug.c
7959 +new file mode 100644
7960 +index 000000000000..1212a031700e
7961 +--- /dev/null
7962 ++++ b/kernel/sched/alt_debug.c
7963 +@@ -0,0 +1,31 @@
7964 ++/*
7965 ++ * kernel/sched/alt_debug.c
7966 ++ *
7967 ++ * Print the alt scheduler debugging details
7968 ++ *
7969 ++ * Author: Alfred Chen
7970 ++ * Date : 2020
7971 ++ */
7972 ++#include "sched.h"
7973 ++
7974 ++/*
7975 ++ * This allows printing both to /proc/sched_debug and
7976 ++ * to the console
7977 ++ */
7978 ++#define SEQ_printf(m, x...) \
7979 ++ do { \
7980 ++ if (m) \
7981 ++ seq_printf(m, x); \
7982 ++ else \
7983 ++ pr_cont(x); \
7984 ++ } while (0)
7985 ++
7986 ++void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
7987 ++ struct seq_file *m)
7988 ++{
7989 ++ SEQ_printf(m, "%s (%d, #threads: %d)\n", p->comm, task_pid_nr_ns(p, ns),
7990 ++ get_nr_threads(p));
7991 ++}
7992 ++
7993 ++void proc_sched_set_task(struct task_struct *p)
7994 ++{}
7995 +diff --git a/kernel/sched/alt_sched.h b/kernel/sched/alt_sched.h
7996 +new file mode 100644
7997 +index 000000000000..f03af9ab9123
7998 +--- /dev/null
7999 ++++ b/kernel/sched/alt_sched.h
8000 +@@ -0,0 +1,692 @@
8001 ++#ifndef ALT_SCHED_H
8002 ++#define ALT_SCHED_H
8003 ++
8004 ++#include <linux/sched.h>
8005 ++
8006 ++#include <linux/sched/clock.h>
8007 ++#include <linux/sched/cpufreq.h>
8008 ++#include <linux/sched/cputime.h>
8009 ++#include <linux/sched/debug.h>
8010 ++#include <linux/sched/init.h>
8011 ++#include <linux/sched/isolation.h>
8012 ++#include <linux/sched/loadavg.h>
8013 ++#include <linux/sched/mm.h>
8014 ++#include <linux/sched/nohz.h>
8015 ++#include <linux/sched/signal.h>
8016 ++#include <linux/sched/stat.h>
8017 ++#include <linux/sched/sysctl.h>
8018 ++#include <linux/sched/task.h>
8019 ++#include <linux/sched/topology.h>
8020 ++#include <linux/sched/wake_q.h>
8021 ++
8022 ++#include <uapi/linux/sched/types.h>
8023 ++
8024 ++#include <linux/cgroup.h>
8025 ++#include <linux/cpufreq.h>
8026 ++#include <linux/cpuidle.h>
8027 ++#include <linux/cpuset.h>
8028 ++#include <linux/ctype.h>
8029 ++#include <linux/debugfs.h>
8030 ++#include <linux/kthread.h>
8031 ++#include <linux/livepatch.h>
8032 ++#include <linux/membarrier.h>
8033 ++#include <linux/proc_fs.h>
8034 ++#include <linux/psi.h>
8035 ++#include <linux/slab.h>
8036 ++#include <linux/stop_machine.h>
8037 ++#include <linux/suspend.h>
8038 ++#include <linux/swait.h>
8039 ++#include <linux/syscalls.h>
8040 ++#include <linux/tsacct_kern.h>
8041 ++
8042 ++#include <asm/tlb.h>
8043 ++
8044 ++#ifdef CONFIG_PARAVIRT
8045 ++# include <asm/paravirt.h>
8046 ++#endif
8047 ++
8048 ++#include "cpupri.h"
8049 ++
8050 ++#include <trace/events/sched.h>
8051 ++
8052 ++#ifdef CONFIG_SCHED_BMQ
8053 ++/* bits:
8054 ++ * RT(0-99), (Low prio adj range, nice width, high prio adj range) / 2, cpu idle task */
8055 ++#define SCHED_BITS (MAX_RT_PRIO + NICE_WIDTH / 2 + MAX_PRIORITY_ADJ + 1)
8056 ++#endif
8057 ++
8058 ++#ifdef CONFIG_SCHED_PDS
8059 ++/* bits: RT(0-99), reserved(100-127), NORMAL_PRIO_NUM, cpu idle task */
8060 ++#define SCHED_BITS (MIN_NORMAL_PRIO + NORMAL_PRIO_NUM + 1)
8061 ++#endif /* CONFIG_SCHED_PDS */
8062 ++
8063 ++#define IDLE_TASK_SCHED_PRIO (SCHED_BITS - 1)
8064 ++
8065 ++#ifdef CONFIG_SCHED_DEBUG
8066 ++# define SCHED_WARN_ON(x) WARN_ONCE(x, #x)
8067 ++extern void resched_latency_warn(int cpu, u64 latency);
8068 ++#else
8069 ++# define SCHED_WARN_ON(x) ({ (void)(x), 0; })
8070 ++static inline void resched_latency_warn(int cpu, u64 latency) {}
8071 ++#endif
8072 ++
8073 ++/*
8074 ++ * Increase resolution of nice-level calculations for 64-bit architectures.
8075 ++ * The extra resolution improves shares distribution and load balancing of
8076 ++ * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
8077 ++ * hierarchies, especially on larger systems. This is not a user-visible change
8078 ++ * and does not change the user-interface for setting shares/weights.
8079 ++ *
8080 ++ * We increase resolution only if we have enough bits to allow this increased
8081 ++ * resolution (i.e. 64-bit). The costs for increasing resolution when 32-bit
8082 ++ * are pretty high and the returns do not justify the increased costs.
8083 ++ *
8084 ++ * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
8085 ++ * increase coverage and consistency always enable it on 64-bit platforms.
8086 ++ */
8087 ++#ifdef CONFIG_64BIT
8088 ++# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
8089 ++# define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
8090 ++# define scale_load_down(w) \
8091 ++({ \
8092 ++ unsigned long __w = (w); \
8093 ++ if (__w) \
8094 ++ __w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \
8095 ++ __w; \
8096 ++})
8097 ++#else
8098 ++# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
8099 ++# define scale_load(w) (w)
8100 ++# define scale_load_down(w) (w)
8101 ++#endif
8102 ++
8103 ++#ifdef CONFIG_FAIR_GROUP_SCHED
8104 ++#define ROOT_TASK_GROUP_LOAD NICE_0_LOAD
8105 ++
8106 ++/*
8107 ++ * A weight of 0 or 1 can cause arithmetics problems.
8108 ++ * A weight of a cfs_rq is the sum of weights of which entities
8109 ++ * are queued on this cfs_rq, so a weight of a entity should not be
8110 ++ * too large, so as the shares value of a task group.
8111 ++ * (The default weight is 1024 - so there's no practical
8112 ++ * limitation from this.)
8113 ++ */
8114 ++#define MIN_SHARES (1UL << 1)
8115 ++#define MAX_SHARES (1UL << 18)
8116 ++#endif
8117 ++
8118 ++/* task_struct::on_rq states: */
8119 ++#define TASK_ON_RQ_QUEUED 1
8120 ++#define TASK_ON_RQ_MIGRATING 2
8121 ++
8122 ++static inline int task_on_rq_queued(struct task_struct *p)
8123 ++{
8124 ++ return p->on_rq == TASK_ON_RQ_QUEUED;
8125 ++}
8126 ++
8127 ++static inline int task_on_rq_migrating(struct task_struct *p)
8128 ++{
8129 ++ return READ_ONCE(p->on_rq) == TASK_ON_RQ_MIGRATING;
8130 ++}
8131 ++
8132 ++/*
8133 ++ * wake flags
8134 ++ */
8135 ++#define WF_SYNC 0x01 /* waker goes to sleep after wakeup */
8136 ++#define WF_FORK 0x02 /* child wakeup after fork */
8137 ++#define WF_MIGRATED 0x04 /* internal use, task got migrated */
8138 ++#define WF_ON_CPU 0x08 /* Wakee is on_rq */
8139 ++
8140 ++#define SCHED_QUEUE_BITS (SCHED_BITS - 1)
8141 ++
8142 ++struct sched_queue {
8143 ++ DECLARE_BITMAP(bitmap, SCHED_QUEUE_BITS);
8144 ++ struct list_head heads[SCHED_BITS];
8145 ++};
8146 ++
8147 ++/*
8148 ++ * This is the main, per-CPU runqueue data structure.
8149 ++ * This data should only be modified by the local cpu.
8150 ++ */
8151 ++struct rq {
8152 ++ /* runqueue lock: */
8153 ++ raw_spinlock_t lock;
8154 ++
8155 ++ struct task_struct __rcu *curr;
8156 ++ struct task_struct *idle, *stop, *skip;
8157 ++ struct mm_struct *prev_mm;
8158 ++
8159 ++ struct sched_queue queue;
8160 ++#ifdef CONFIG_SCHED_PDS
8161 ++ u64 time_edge;
8162 ++#endif
8163 ++ unsigned long watermark;
8164 ++
8165 ++ /* switch count */
8166 ++ u64 nr_switches;
8167 ++
8168 ++ atomic_t nr_iowait;
8169 ++
8170 ++#ifdef CONFIG_SCHED_DEBUG
8171 ++ u64 last_seen_need_resched_ns;
8172 ++ int ticks_without_resched;
8173 ++#endif
8174 ++
8175 ++#ifdef CONFIG_MEMBARRIER
8176 ++ int membarrier_state;
8177 ++#endif
8178 ++
8179 ++#ifdef CONFIG_SMP
8180 ++ int cpu; /* cpu of this runqueue */
8181 ++ bool online;
8182 ++
8183 ++ unsigned int ttwu_pending;
8184 ++ unsigned char nohz_idle_balance;
8185 ++ unsigned char idle_balance;
8186 ++
8187 ++#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
8188 ++ struct sched_avg avg_irq;
8189 ++#endif
8190 ++
8191 ++#ifdef CONFIG_SCHED_SMT
8192 ++ int active_balance;
8193 ++ struct cpu_stop_work active_balance_work;
8194 ++#endif
8195 ++ struct callback_head *balance_callback;
8196 ++#ifdef CONFIG_HOTPLUG_CPU
8197 ++ struct rcuwait hotplug_wait;
8198 ++#endif
8199 ++ unsigned int nr_pinned;
8200 ++#endif /* CONFIG_SMP */
8201 ++#ifdef CONFIG_IRQ_TIME_ACCOUNTING
8202 ++ u64 prev_irq_time;
8203 ++#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
8204 ++#ifdef CONFIG_PARAVIRT
8205 ++ u64 prev_steal_time;
8206 ++#endif /* CONFIG_PARAVIRT */
8207 ++#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
8208 ++ u64 prev_steal_time_rq;
8209 ++#endif /* CONFIG_PARAVIRT_TIME_ACCOUNTING */
8210 ++
8211 ++ /* calc_load related fields */
8212 ++ unsigned long calc_load_update;
8213 ++ long calc_load_active;
8214 ++
8215 ++ u64 clock, last_tick;
8216 ++ u64 last_ts_switch;
8217 ++ u64 clock_task;
8218 ++
8219 ++ unsigned int nr_running;
8220 ++ unsigned long nr_uninterruptible;
8221 ++
8222 ++#ifdef CONFIG_SCHED_HRTICK
8223 ++#ifdef CONFIG_SMP
8224 ++ call_single_data_t hrtick_csd;
8225 ++#endif
8226 ++ struct hrtimer hrtick_timer;
8227 ++ ktime_t hrtick_time;
8228 ++#endif
8229 ++
8230 ++#ifdef CONFIG_SCHEDSTATS
8231 ++
8232 ++ /* latency stats */
8233 ++ struct sched_info rq_sched_info;
8234 ++ unsigned long long rq_cpu_time;
8235 ++ /* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */
8236 ++
8237 ++ /* sys_sched_yield() stats */
8238 ++ unsigned int yld_count;
8239 ++
8240 ++ /* schedule() stats */
8241 ++ unsigned int sched_switch;
8242 ++ unsigned int sched_count;
8243 ++ unsigned int sched_goidle;
8244 ++
8245 ++ /* try_to_wake_up() stats */
8246 ++ unsigned int ttwu_count;
8247 ++ unsigned int ttwu_local;
8248 ++#endif /* CONFIG_SCHEDSTATS */
8249 ++
8250 ++#ifdef CONFIG_CPU_IDLE
8251 ++ /* Must be inspected within a rcu lock section */
8252 ++ struct cpuidle_state *idle_state;
8253 ++#endif
8254 ++
8255 ++#ifdef CONFIG_NO_HZ_COMMON
8256 ++#ifdef CONFIG_SMP
8257 ++ call_single_data_t nohz_csd;
8258 ++#endif
8259 ++ atomic_t nohz_flags;
8260 ++#endif /* CONFIG_NO_HZ_COMMON */
8261 ++};
8262 ++
8263 ++extern unsigned long calc_load_update;
8264 ++extern atomic_long_t calc_load_tasks;
8265 ++
8266 ++extern void calc_global_load_tick(struct rq *this_rq);
8267 ++extern long calc_load_fold_active(struct rq *this_rq, long adjust);
8268 ++
8269 ++DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
8270 ++#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))
8271 ++#define this_rq() this_cpu_ptr(&runqueues)
8272 ++#define task_rq(p) cpu_rq(task_cpu(p))
8273 ++#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
8274 ++#define raw_rq() raw_cpu_ptr(&runqueues)
8275 ++
8276 ++#ifdef CONFIG_SMP
8277 ++#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
8278 ++void register_sched_domain_sysctl(void);
8279 ++void unregister_sched_domain_sysctl(void);
8280 ++#else
8281 ++static inline void register_sched_domain_sysctl(void)
8282 ++{
8283 ++}
8284 ++static inline void unregister_sched_domain_sysctl(void)
8285 ++{
8286 ++}
8287 ++#endif
8288 ++
8289 ++extern bool sched_smp_initialized;
8290 ++
8291 ++enum {
8292 ++ ITSELF_LEVEL_SPACE_HOLDER,
8293 ++#ifdef CONFIG_SCHED_SMT
8294 ++ SMT_LEVEL_SPACE_HOLDER,
8295 ++#endif
8296 ++ COREGROUP_LEVEL_SPACE_HOLDER,
8297 ++ CORE_LEVEL_SPACE_HOLDER,
8298 ++ OTHER_LEVEL_SPACE_HOLDER,
8299 ++ NR_CPU_AFFINITY_LEVELS
8300 ++};
8301 ++
8302 ++DECLARE_PER_CPU(cpumask_t [NR_CPU_AFFINITY_LEVELS], sched_cpu_topo_masks);
8303 ++DECLARE_PER_CPU(cpumask_t *, sched_cpu_llc_mask);
8304 ++
8305 ++static inline int
8306 ++__best_mask_cpu(const cpumask_t *cpumask, const cpumask_t *mask)
8307 ++{
8308 ++ int cpu;
8309 ++
8310 ++ while ((cpu = cpumask_any_and(cpumask, mask)) >= nr_cpu_ids)
8311 ++ mask++;
8312 ++
8313 ++ return cpu;
8314 ++}
8315 ++
8316 ++static inline int best_mask_cpu(int cpu, const cpumask_t *mask)
8317 ++{
8318 ++ return __best_mask_cpu(mask, per_cpu(sched_cpu_topo_masks, cpu));
8319 ++}
8320 ++
8321 ++extern void flush_smp_call_function_from_idle(void);
8322 ++
8323 ++#else /* !CONFIG_SMP */
8324 ++static inline void flush_smp_call_function_from_idle(void) { }
8325 ++#endif
8326 ++
8327 ++#ifndef arch_scale_freq_tick
8328 ++static __always_inline
8329 ++void arch_scale_freq_tick(void)
8330 ++{
8331 ++}
8332 ++#endif
8333 ++
8334 ++#ifndef arch_scale_freq_capacity
8335 ++static __always_inline
8336 ++unsigned long arch_scale_freq_capacity(int cpu)
8337 ++{
8338 ++ return SCHED_CAPACITY_SCALE;
8339 ++}
8340 ++#endif
8341 ++
8342 ++static inline u64 __rq_clock_broken(struct rq *rq)
8343 ++{
8344 ++ return READ_ONCE(rq->clock);
8345 ++}
8346 ++
8347 ++static inline u64 rq_clock(struct rq *rq)
8348 ++{
8349 ++ /*
8350 ++ * Relax lockdep_assert_held() checking as in VRQ, call to
8351 ++ * sched_info_xxxx() may not held rq->lock
8352 ++ * lockdep_assert_held(&rq->lock);
8353 ++ */
8354 ++ return rq->clock;
8355 ++}
8356 ++
8357 ++static inline u64 rq_clock_task(struct rq *rq)
8358 ++{
8359 ++ /*
8360 ++ * Relax lockdep_assert_held() checking as in VRQ, call to
8361 ++ * sched_info_xxxx() may not held rq->lock
8362 ++ * lockdep_assert_held(&rq->lock);
8363 ++ */
8364 ++ return rq->clock_task;
8365 ++}
8366 ++
8367 ++/*
8368 ++ * {de,en}queue flags:
8369 ++ *
8370 ++ * DEQUEUE_SLEEP - task is no longer runnable
8371 ++ * ENQUEUE_WAKEUP - task just became runnable
8372 ++ *
8373 ++ */
8374 ++
8375 ++#define DEQUEUE_SLEEP 0x01
8376 ++
8377 ++#define ENQUEUE_WAKEUP 0x01
8378 ++
8379 ++
8380 ++/*
8381 ++ * Below are scheduler API which using in other kernel code
8382 ++ * It use the dummy rq_flags
8383 ++ * ToDo : BMQ need to support these APIs for compatibility with mainline
8384 ++ * scheduler code.
8385 ++ */
8386 ++struct rq_flags {
8387 ++ unsigned long flags;
8388 ++};
8389 ++
8390 ++struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
8391 ++ __acquires(rq->lock);
8392 ++
8393 ++struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
8394 ++ __acquires(p->pi_lock)
8395 ++ __acquires(rq->lock);
8396 ++
8397 ++static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
8398 ++ __releases(rq->lock)
8399 ++{
8400 ++ raw_spin_unlock(&rq->lock);
8401 ++}
8402 ++
8403 ++static inline void
8404 ++task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
8405 ++ __releases(rq->lock)
8406 ++ __releases(p->pi_lock)
8407 ++{
8408 ++ raw_spin_unlock(&rq->lock);
8409 ++ raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
8410 ++}
8411 ++
8412 ++static inline void
8413 ++rq_lock(struct rq *rq, struct rq_flags *rf)
8414 ++ __acquires(rq->lock)
8415 ++{
8416 ++ raw_spin_lock(&rq->lock);
8417 ++}
8418 ++
8419 ++static inline void
8420 ++rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
8421 ++ __releases(rq->lock)
8422 ++{
8423 ++ raw_spin_unlock_irq(&rq->lock);
8424 ++}
8425 ++
8426 ++static inline void
8427 ++rq_unlock(struct rq *rq, struct rq_flags *rf)
8428 ++ __releases(rq->lock)
8429 ++{
8430 ++ raw_spin_unlock(&rq->lock);
8431 ++}
8432 ++
8433 ++static inline struct rq *
8434 ++this_rq_lock_irq(struct rq_flags *rf)
8435 ++ __acquires(rq->lock)
8436 ++{
8437 ++ struct rq *rq;
8438 ++
8439 ++ local_irq_disable();
8440 ++ rq = this_rq();
8441 ++ raw_spin_lock(&rq->lock);
8442 ++
8443 ++ return rq;
8444 ++}
8445 ++
8446 ++extern void raw_spin_rq_lock_nested(struct rq *rq, int subclass);
8447 ++extern void raw_spin_rq_unlock(struct rq *rq);
8448 ++
8449 ++static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
8450 ++{
8451 ++ return &rq->lock;
8452 ++}
8453 ++
8454 ++static inline raw_spinlock_t *rq_lockp(struct rq *rq)
8455 ++{
8456 ++ return __rq_lockp(rq);
8457 ++}
8458 ++
8459 ++static inline void raw_spin_rq_lock(struct rq *rq)
8460 ++{
8461 ++ raw_spin_rq_lock_nested(rq, 0);
8462 ++}
8463 ++
8464 ++static inline void raw_spin_rq_lock_irq(struct rq *rq)
8465 ++{
8466 ++ local_irq_disable();
8467 ++ raw_spin_rq_lock(rq);
8468 ++}
8469 ++
8470 ++static inline void raw_spin_rq_unlock_irq(struct rq *rq)
8471 ++{
8472 ++ raw_spin_rq_unlock(rq);
8473 ++ local_irq_enable();
8474 ++}
8475 ++
8476 ++static inline int task_current(struct rq *rq, struct task_struct *p)
8477 ++{
8478 ++ return rq->curr == p;
8479 ++}
8480 ++
8481 ++static inline bool task_running(struct task_struct *p)
8482 ++{
8483 ++ return p->on_cpu;
8484 ++}
8485 ++
8486 ++extern int task_running_nice(struct task_struct *p);
8487 ++
8488 ++extern struct static_key_false sched_schedstats;
8489 ++
8490 ++#ifdef CONFIG_CPU_IDLE
8491 ++static inline void idle_set_state(struct rq *rq,
8492 ++ struct cpuidle_state *idle_state)
8493 ++{
8494 ++ rq->idle_state = idle_state;
8495 ++}
8496 ++
8497 ++static inline struct cpuidle_state *idle_get_state(struct rq *rq)
8498 ++{
8499 ++ WARN_ON(!rcu_read_lock_held());
8500 ++ return rq->idle_state;
8501 ++}
8502 ++#else
8503 ++static inline void idle_set_state(struct rq *rq,
8504 ++ struct cpuidle_state *idle_state)
8505 ++{
8506 ++}
8507 ++
8508 ++static inline struct cpuidle_state *idle_get_state(struct rq *rq)
8509 ++{
8510 ++ return NULL;
8511 ++}
8512 ++#endif
8513 ++
8514 ++static inline int cpu_of(const struct rq *rq)
8515 ++{
8516 ++#ifdef CONFIG_SMP
8517 ++ return rq->cpu;
8518 ++#else
8519 ++ return 0;
8520 ++#endif
8521 ++}
8522 ++
8523 ++#include "stats.h"
8524 ++
8525 ++#ifdef CONFIG_NO_HZ_COMMON
8526 ++#define NOHZ_BALANCE_KICK_BIT 0
8527 ++#define NOHZ_STATS_KICK_BIT 1
8528 ++
8529 ++#define NOHZ_BALANCE_KICK BIT(NOHZ_BALANCE_KICK_BIT)
8530 ++#define NOHZ_STATS_KICK BIT(NOHZ_STATS_KICK_BIT)
8531 ++
8532 ++#define NOHZ_KICK_MASK (NOHZ_BALANCE_KICK | NOHZ_STATS_KICK)
8533 ++
8534 ++#define nohz_flags(cpu) (&cpu_rq(cpu)->nohz_flags)
8535 ++
8536 ++/* TODO: needed?
8537 ++extern void nohz_balance_exit_idle(struct rq *rq);
8538 ++#else
8539 ++static inline void nohz_balance_exit_idle(struct rq *rq) { }
8540 ++*/
8541 ++#endif
8542 ++
8543 ++#ifdef CONFIG_IRQ_TIME_ACCOUNTING
8544 ++struct irqtime {
8545 ++ u64 total;
8546 ++ u64 tick_delta;
8547 ++ u64 irq_start_time;
8548 ++ struct u64_stats_sync sync;
8549 ++};
8550 ++
8551 ++DECLARE_PER_CPU(struct irqtime, cpu_irqtime);
8552 ++
8553 ++/*
8554 ++ * Returns the irqtime minus the softirq time computed by ksoftirqd.
8555 ++ * Otherwise ksoftirqd's sum_exec_runtime is substracted its own runtime
8556 ++ * and never move forward.
8557 ++ */
8558 ++static inline u64 irq_time_read(int cpu)
8559 ++{
8560 ++ struct irqtime *irqtime = &per_cpu(cpu_irqtime, cpu);
8561 ++ unsigned int seq;
8562 ++ u64 total;
8563 ++
8564 ++ do {
8565 ++ seq = __u64_stats_fetch_begin(&irqtime->sync);
8566 ++ total = irqtime->total;
8567 ++ } while (__u64_stats_fetch_retry(&irqtime->sync, seq));
8568 ++
8569 ++ return total;
8570 ++}
8571 ++#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
8572 ++
8573 ++#ifdef CONFIG_CPU_FREQ
8574 ++DECLARE_PER_CPU(struct update_util_data __rcu *, cpufreq_update_util_data);
8575 ++
8576 ++/**
8577 ++ * cpufreq_update_util - Take a note about CPU utilization changes.
8578 ++ * @rq: Runqueue to carry out the update for.
8579 ++ * @flags: Update reason flags.
8580 ++ *
8581 ++ * This function is called by the scheduler on the CPU whose utilization is
8582 ++ * being updated.
8583 ++ *
8584 ++ * It can only be called from RCU-sched read-side critical sections.
8585 ++ *
8586 ++ * The way cpufreq is currently arranged requires it to evaluate the CPU
8587 ++ * performance state (frequency/voltage) on a regular basis to prevent it from
8588 ++ * being stuck in a completely inadequate performance level for too long.
8589 ++ * That is not guaranteed to happen if the updates are only triggered from CFS
8590 ++ * and DL, though, because they may not be coming in if only RT tasks are
8591 ++ * active all the time (or there are RT tasks only).
8592 ++ *
8593 ++ * As a workaround for that issue, this function is called periodically by the
8594 ++ * RT sched class to trigger extra cpufreq updates to prevent it from stalling,
8595 ++ * but that really is a band-aid. Going forward it should be replaced with
8596 ++ * solutions targeted more specifically at RT tasks.
8597 ++ */
8598 ++static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
8599 ++{
8600 ++ struct update_util_data *data;
8601 ++
8602 ++ data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
8603 ++ cpu_of(rq)));
8604 ++ if (data)
8605 ++ data->func(data, rq_clock(rq), flags);
8606 ++}
8607 ++#else
8608 ++static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
8609 ++#endif /* CONFIG_CPU_FREQ */
8610 ++
8611 ++#ifdef CONFIG_NO_HZ_FULL
8612 ++extern int __init sched_tick_offload_init(void);
8613 ++#else
8614 ++static inline int sched_tick_offload_init(void) { return 0; }
8615 ++#endif
8616 ++
8617 ++#ifdef arch_scale_freq_capacity
8618 ++#ifndef arch_scale_freq_invariant
8619 ++#define arch_scale_freq_invariant() (true)
8620 ++#endif
8621 ++#else /* arch_scale_freq_capacity */
8622 ++#define arch_scale_freq_invariant() (false)
8623 ++#endif
8624 ++
8625 ++extern void schedule_idle(void);
8626 ++
8627 ++#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
8628 ++
8629 ++/*
8630 ++ * !! For sched_setattr_nocheck() (kernel) only !!
8631 ++ *
8632 ++ * This is actually gross. :(
8633 ++ *
8634 ++ * It is used to make schedutil kworker(s) higher priority than SCHED_DEADLINE
8635 ++ * tasks, but still be able to sleep. We need this on platforms that cannot
8636 ++ * atomically change clock frequency. Remove once fast switching will be
8637 ++ * available on such platforms.
8638 ++ *
8639 ++ * SUGOV stands for SchedUtil GOVernor.
8640 ++ */
8641 ++#define SCHED_FLAG_SUGOV 0x10000000
8642 ++
8643 ++#ifdef CONFIG_MEMBARRIER
8644 ++/*
8645 ++ * The scheduler provides memory barriers required by membarrier between:
8646 ++ * - prior user-space memory accesses and store to rq->membarrier_state,
8647 ++ * - store to rq->membarrier_state and following user-space memory accesses.
8648 ++ * In the same way it provides those guarantees around store to rq->curr.
8649 ++ */
8650 ++static inline void membarrier_switch_mm(struct rq *rq,
8651 ++ struct mm_struct *prev_mm,
8652 ++ struct mm_struct *next_mm)
8653 ++{
8654 ++ int membarrier_state;
8655 ++
8656 ++ if (prev_mm == next_mm)
8657 ++ return;
8658 ++
8659 ++ membarrier_state = atomic_read(&next_mm->membarrier_state);
8660 ++ if (READ_ONCE(rq->membarrier_state) == membarrier_state)
8661 ++ return;
8662 ++
8663 ++ WRITE_ONCE(rq->membarrier_state, membarrier_state);
8664 ++}
8665 ++#else
8666 ++static inline void membarrier_switch_mm(struct rq *rq,
8667 ++ struct mm_struct *prev_mm,
8668 ++ struct mm_struct *next_mm)
8669 ++{
8670 ++}
8671 ++#endif
8672 ++
8673 ++#ifdef CONFIG_NUMA
8674 ++extern int sched_numa_find_closest(const struct cpumask *cpus, int cpu);
8675 ++#else
8676 ++static inline int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
8677 ++{
8678 ++ return nr_cpu_ids;
8679 ++}
8680 ++#endif
8681 ++
8682 ++extern void swake_up_all_locked(struct swait_queue_head *q);
8683 ++extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
8684 ++
8685 ++#ifdef CONFIG_PREEMPT_DYNAMIC
8686 ++extern int preempt_dynamic_mode;
8687 ++extern int sched_dynamic_mode(const char *str);
8688 ++extern void sched_dynamic_update(int mode);
8689 ++#endif
8690 ++
8691 ++static inline void nohz_run_idle_balance(int cpu) { }
8692 ++#endif /* ALT_SCHED_H */
8693 +diff --git a/kernel/sched/bmq.h b/kernel/sched/bmq.h
8694 +new file mode 100644
8695 +index 000000000000..be3ee4a553ca
8696 +--- /dev/null
8697 ++++ b/kernel/sched/bmq.h
8698 +@@ -0,0 +1,111 @@
8699 ++#define ALT_SCHED_VERSION_MSG "sched/bmq: BMQ CPU Scheduler "ALT_SCHED_VERSION" by Alfred Chen.\n"
8700 ++
8701 ++/*
8702 ++ * BMQ only routines
8703 ++ */
8704 ++#define rq_switch_time(rq) ((rq)->clock - (rq)->last_ts_switch)
8705 ++#define boost_threshold(p) (sched_timeslice_ns >>\
8706 ++ (15 - MAX_PRIORITY_ADJ - (p)->boost_prio))
8707 ++
8708 ++static inline void boost_task(struct task_struct *p)
8709 ++{
8710 ++ int limit;
8711 ++
8712 ++ switch (p->policy) {
8713 ++ case SCHED_NORMAL:
8714 ++ limit = -MAX_PRIORITY_ADJ;
8715 ++ break;
8716 ++ case SCHED_BATCH:
8717 ++ case SCHED_IDLE:
8718 ++ limit = 0;
8719 ++ break;
8720 ++ default:
8721 ++ return;
8722 ++ }
8723 ++
8724 ++ if (p->boost_prio > limit)
8725 ++ p->boost_prio--;
8726 ++}
8727 ++
8728 ++static inline void deboost_task(struct task_struct *p)
8729 ++{
8730 ++ if (p->boost_prio < MAX_PRIORITY_ADJ)
8731 ++ p->boost_prio++;
8732 ++}
8733 ++
8734 ++/*
8735 ++ * Common interfaces
8736 ++ */
8737 ++static inline void sched_timeslice_imp(const int timeslice_ms) {}
8738 ++
8739 ++static inline int
8740 ++task_sched_prio_normal(const struct task_struct *p, const struct rq *rq)
8741 ++{
8742 ++ return p->prio + p->boost_prio - MAX_RT_PRIO;
8743 ++}
8744 ++
8745 ++static inline int task_sched_prio(const struct task_struct *p)
8746 ++{
8747 ++ return (p->prio < MAX_RT_PRIO)? p->prio : MAX_RT_PRIO / 2 + (p->prio + p->boost_prio) / 2;
8748 ++}
8749 ++
8750 ++static inline int
8751 ++task_sched_prio_idx(const struct task_struct *p, const struct rq *rq)
8752 ++{
8753 ++ return task_sched_prio(p);
8754 ++}
8755 ++
8756 ++static inline int sched_prio2idx(int prio, struct rq *rq)
8757 ++{
8758 ++ return prio;
8759 ++}
8760 ++
8761 ++static inline int sched_idx2prio(int idx, struct rq *rq)
8762 ++{
8763 ++ return idx;
8764 ++}
8765 ++
8766 ++static inline void time_slice_expired(struct task_struct *p, struct rq *rq)
8767 ++{
8768 ++ p->time_slice = sched_timeslice_ns;
8769 ++
8770 ++ if (SCHED_FIFO != p->policy && task_on_rq_queued(p)) {
8771 ++ if (SCHED_RR != p->policy)
8772 ++ deboost_task(p);
8773 ++ requeue_task(p, rq);
8774 ++ }
8775 ++}
8776 ++
8777 ++static inline void sched_task_sanity_check(struct task_struct *p, struct rq *rq) {}
8778 ++
8779 ++inline int task_running_nice(struct task_struct *p)
8780 ++{
8781 ++ return (p->prio + p->boost_prio > DEFAULT_PRIO + MAX_PRIORITY_ADJ);
8782 ++}
8783 ++
8784 ++static void sched_task_fork(struct task_struct *p, struct rq *rq)
8785 ++{
8786 ++ p->boost_prio = (p->boost_prio < 0) ?
8787 ++ p->boost_prio + MAX_PRIORITY_ADJ : MAX_PRIORITY_ADJ;
8788 ++}
8789 ++
8790 ++static inline void do_sched_yield_type_1(struct task_struct *p, struct rq *rq)
8791 ++{
8792 ++ p->boost_prio = MAX_PRIORITY_ADJ;
8793 ++}
8794 ++
8795 ++#ifdef CONFIG_SMP
8796 ++static inline void sched_task_ttwu(struct task_struct *p)
8797 ++{
8798 ++ if(this_rq()->clock_task - p->last_ran > sched_timeslice_ns)
8799 ++ boost_task(p);
8800 ++}
8801 ++#endif
8802 ++
8803 ++static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq)
8804 ++{
8805 ++ if (rq_switch_time(rq) < boost_threshold(p))
8806 ++ boost_task(p);
8807 ++}
8808 ++
8809 ++static inline void update_rq_time_edge(struct rq *rq) {}
8810 +diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
8811 +index 57124614363d..4057e51cef45 100644
8812 +--- a/kernel/sched/cpufreq_schedutil.c
8813 ++++ b/kernel/sched/cpufreq_schedutil.c
8814 +@@ -57,6 +57,13 @@ struct sugov_cpu {
8815 + unsigned long bw_dl;
8816 + unsigned long max;
8817 +
8818 ++#ifdef CONFIG_SCHED_ALT
8819 ++ /* For genenal cpu load util */
8820 ++ s32 load_history;
8821 ++ u64 load_block;
8822 ++ u64 load_stamp;
8823 ++#endif
8824 ++
8825 + /* The field below is for single-CPU policies only: */
8826 + #ifdef CONFIG_NO_HZ_COMMON
8827 + unsigned long saved_idle_calls;
8828 +@@ -161,6 +168,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
8829 + return cpufreq_driver_resolve_freq(policy, freq);
8830 + }
8831 +
8832 ++#ifndef CONFIG_SCHED_ALT
8833 + static void sugov_get_util(struct sugov_cpu *sg_cpu)
8834 + {
8835 + struct rq *rq = cpu_rq(sg_cpu->cpu);
8836 +@@ -172,6 +180,55 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
8837 + FREQUENCY_UTIL, NULL);
8838 + }
8839 +
8840 ++#else /* CONFIG_SCHED_ALT */
8841 ++
8842 ++#define SG_CPU_LOAD_HISTORY_BITS (sizeof(s32) * 8ULL)
8843 ++#define SG_CPU_UTIL_SHIFT (8)
8844 ++#define SG_CPU_LOAD_HISTORY_SHIFT (SG_CPU_LOAD_HISTORY_BITS - 1 - SG_CPU_UTIL_SHIFT)
8845 ++#define SG_CPU_LOAD_HISTORY_TO_UTIL(l) (((l) >> SG_CPU_LOAD_HISTORY_SHIFT) & 0xff)
8846 ++
8847 ++#define LOAD_BLOCK(t) ((t) >> 17)
8848 ++#define LOAD_HALF_BLOCK(t) ((t) >> 16)
8849 ++#define BLOCK_MASK(t) ((t) & ((0x01 << 18) - 1))
8850 ++#define LOAD_BLOCK_BIT(b) (1UL << (SG_CPU_LOAD_HISTORY_BITS - 1 - (b)))
8851 ++#define CURRENT_LOAD_BIT LOAD_BLOCK_BIT(0)
8852 ++
8853 ++static void sugov_get_util(struct sugov_cpu *sg_cpu)
8854 ++{
8855 ++ unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
8856 ++
8857 ++ sg_cpu->max = max;
8858 ++ sg_cpu->bw_dl = 0;
8859 ++ sg_cpu->util = SG_CPU_LOAD_HISTORY_TO_UTIL(sg_cpu->load_history) *
8860 ++ (max >> SG_CPU_UTIL_SHIFT);
8861 ++}
8862 ++
8863 ++static inline void sugov_cpu_load_update(struct sugov_cpu *sg_cpu, u64 time)
8864 ++{
8865 ++ u64 delta = min(LOAD_BLOCK(time) - LOAD_BLOCK(sg_cpu->load_stamp),
8866 ++ SG_CPU_LOAD_HISTORY_BITS - 1);
8867 ++ u64 prev = !!(sg_cpu->load_history & CURRENT_LOAD_BIT);
8868 ++ u64 curr = !!cpu_rq(sg_cpu->cpu)->nr_running;
8869 ++
8870 ++ if (delta) {
8871 ++ sg_cpu->load_history = sg_cpu->load_history >> delta;
8872 ++
8873 ++ if (delta <= SG_CPU_UTIL_SHIFT) {
8874 ++ sg_cpu->load_block += (~BLOCK_MASK(sg_cpu->load_stamp)) * prev;
8875 ++ if (!!LOAD_HALF_BLOCK(sg_cpu->load_block) ^ curr)
8876 ++ sg_cpu->load_history ^= LOAD_BLOCK_BIT(delta);
8877 ++ }
8878 ++
8879 ++ sg_cpu->load_block = BLOCK_MASK(time) * prev;
8880 ++ } else {
8881 ++ sg_cpu->load_block += (time - sg_cpu->load_stamp) * prev;
8882 ++ }
8883 ++ if (prev ^ curr)
8884 ++ sg_cpu->load_history ^= CURRENT_LOAD_BIT;
8885 ++ sg_cpu->load_stamp = time;
8886 ++}
8887 ++#endif /* CONFIG_SCHED_ALT */
8888 ++
8889 + /**
8890 + * sugov_iowait_reset() - Reset the IO boost status of a CPU.
8891 + * @sg_cpu: the sugov data for the CPU to boost
8892 +@@ -312,13 +369,19 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
8893 + */
8894 + static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
8895 + {
8896 ++#ifndef CONFIG_SCHED_ALT
8897 + if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
8898 + sg_cpu->sg_policy->limits_changed = true;
8899 ++#endif
8900 + }
8901 +
8902 + static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
8903 + u64 time, unsigned int flags)
8904 + {
8905 ++#ifdef CONFIG_SCHED_ALT
8906 ++ sugov_cpu_load_update(sg_cpu, time);
8907 ++#endif /* CONFIG_SCHED_ALT */
8908 ++
8909 + sugov_iowait_boost(sg_cpu, time, flags);
8910 + sg_cpu->last_update = time;
8911 +
8912 +@@ -439,6 +502,10 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
8913 +
8914 + raw_spin_lock(&sg_policy->update_lock);
8915 +
8916 ++#ifdef CONFIG_SCHED_ALT
8917 ++ sugov_cpu_load_update(sg_cpu, time);
8918 ++#endif /* CONFIG_SCHED_ALT */
8919 ++
8920 + sugov_iowait_boost(sg_cpu, time, flags);
8921 + sg_cpu->last_update = time;
8922 +
8923 +@@ -599,6 +666,7 @@ static int sugov_kthread_create(struct sugov_policy *sg_policy)
8924 + }
8925 +
8926 + ret = sched_setattr_nocheck(thread, &attr);
8927 ++
8928 + if (ret) {
8929 + kthread_stop(thread);
8930 + pr_warn("%s: failed to set SCHED_DEADLINE\n", __func__);
8931 +@@ -833,7 +901,9 @@ cpufreq_governor_init(schedutil_gov);
8932 + #ifdef CONFIG_ENERGY_MODEL
8933 + static void rebuild_sd_workfn(struct work_struct *work)
8934 + {
8935 ++#ifndef CONFIG_SCHED_ALT
8936 + rebuild_sched_domains_energy();
8937 ++#endif /* CONFIG_SCHED_ALT */
8938 + }
8939 + static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
8940 +
8941 +diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
8942 +index 872e481d5098..f920c8b48ec1 100644
8943 +--- a/kernel/sched/cputime.c
8944 ++++ b/kernel/sched/cputime.c
8945 +@@ -123,7 +123,7 @@ void account_user_time(struct task_struct *p, u64 cputime)
8946 + p->utime += cputime;
8947 + account_group_user_time(p, cputime);
8948 +
8949 +- index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;
8950 ++ index = task_running_nice(p) ? CPUTIME_NICE : CPUTIME_USER;
8951 +
8952 + /* Add user time to cpustat. */
8953 + task_group_account_field(p, index, cputime);
8954 +@@ -147,7 +147,7 @@ void account_guest_time(struct task_struct *p, u64 cputime)
8955 + p->gtime += cputime;
8956 +
8957 + /* Add guest time to cpustat. */
8958 +- if (task_nice(p) > 0) {
8959 ++ if (task_running_nice(p)) {
8960 + cpustat[CPUTIME_NICE] += cputime;
8961 + cpustat[CPUTIME_GUEST_NICE] += cputime;
8962 + } else {
8963 +@@ -270,7 +270,7 @@ static inline u64 account_other_time(u64 max)
8964 + #ifdef CONFIG_64BIT
8965 + static inline u64 read_sum_exec_runtime(struct task_struct *t)
8966 + {
8967 +- return t->se.sum_exec_runtime;
8968 ++ return tsk_seruntime(t);
8969 + }
8970 + #else
8971 + static u64 read_sum_exec_runtime(struct task_struct *t)
8972 +@@ -280,7 +280,7 @@ static u64 read_sum_exec_runtime(struct task_struct *t)
8973 + struct rq *rq;
8974 +
8975 + rq = task_rq_lock(t, &rf);
8976 +- ns = t->se.sum_exec_runtime;
8977 ++ ns = tsk_seruntime(t);
8978 + task_rq_unlock(rq, t, &rf);
8979 +
8980 + return ns;
8981 +@@ -612,7 +612,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
8982 + void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
8983 + {
8984 + struct task_cputime cputime = {
8985 +- .sum_exec_runtime = p->se.sum_exec_runtime,
8986 ++ .sum_exec_runtime = tsk_seruntime(p),
8987 + };
8988 +
8989 + task_cputime(p, &cputime.utime, &cputime.stime);
8990 +diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
8991 +index 0c5ec2776ddf..e3f4fe3f6e2c 100644
8992 +--- a/kernel/sched/debug.c
8993 ++++ b/kernel/sched/debug.c
8994 +@@ -8,6 +8,7 @@
8995 + */
8996 + #include "sched.h"
8997 +
8998 ++#ifndef CONFIG_SCHED_ALT
8999 + /*
9000 + * This allows printing both to /proc/sched_debug and
9001 + * to the console
9002 +@@ -210,6 +211,7 @@ static const struct file_operations sched_scaling_fops = {
9003 + };
9004 +
9005 + #endif /* SMP */
9006 ++#endif /* !CONFIG_SCHED_ALT */
9007 +
9008 + #ifdef CONFIG_PREEMPT_DYNAMIC
9009 +
9010 +@@ -273,6 +275,7 @@ static const struct file_operations sched_dynamic_fops = {
9011 +
9012 + #endif /* CONFIG_PREEMPT_DYNAMIC */
9013 +
9014 ++#ifndef CONFIG_SCHED_ALT
9015 + __read_mostly bool sched_debug_verbose;
9016 +
9017 + static const struct seq_operations sched_debug_sops;
9018 +@@ -288,6 +291,7 @@ static const struct file_operations sched_debug_fops = {
9019 + .llseek = seq_lseek,
9020 + .release = seq_release,
9021 + };
9022 ++#endif /* !CONFIG_SCHED_ALT */
9023 +
9024 + static struct dentry *debugfs_sched;
9025 +
9026 +@@ -297,12 +301,15 @@ static __init int sched_init_debug(void)
9027 +
9028 + debugfs_sched = debugfs_create_dir("sched", NULL);
9029 +
9030 ++#ifndef CONFIG_SCHED_ALT
9031 + debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
9032 + debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);
9033 ++#endif /* !CONFIG_SCHED_ALT */
9034 + #ifdef CONFIG_PREEMPT_DYNAMIC
9035 + debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
9036 + #endif
9037 +
9038 ++#ifndef CONFIG_SCHED_ALT
9039 + debugfs_create_u32("latency_ns", 0644, debugfs_sched, &sysctl_sched_latency);
9040 + debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);
9041 + debugfs_create_u32("wakeup_granularity_ns", 0644, debugfs_sched, &sysctl_sched_wakeup_granularity);
9042 +@@ -330,11 +337,13 @@ static __init int sched_init_debug(void)
9043 + #endif
9044 +
9045 + debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
9046 ++#endif /* !CONFIG_SCHED_ALT */
9047 +
9048 + return 0;
9049 + }
9050 + late_initcall(sched_init_debug);
9051 +
9052 ++#ifndef CONFIG_SCHED_ALT
9053 + #ifdef CONFIG_SMP
9054 +
9055 + static cpumask_var_t sd_sysctl_cpus;
9056 +@@ -1047,6 +1056,7 @@ void proc_sched_set_task(struct task_struct *p)
9057 + memset(&p->se.statistics, 0, sizeof(p->se.statistics));
9058 + #endif
9059 + }
9060 ++#endif /* !CONFIG_SCHED_ALT */
9061 +
9062 + void resched_latency_warn(int cpu, u64 latency)
9063 + {
9064 +diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
9065 +index 912b47aa99d8..7f6b13883c2a 100644
9066 +--- a/kernel/sched/idle.c
9067 ++++ b/kernel/sched/idle.c
9068 +@@ -403,6 +403,7 @@ void cpu_startup_entry(enum cpuhp_state state)
9069 + do_idle();
9070 + }
9071 +
9072 ++#ifndef CONFIG_SCHED_ALT
9073 + /*
9074 + * idle-task scheduling class.
9075 + */
9076 +@@ -525,3 +526,4 @@ DEFINE_SCHED_CLASS(idle) = {
9077 + .switched_to = switched_to_idle,
9078 + .update_curr = update_curr_idle,
9079 + };
9080 ++#endif
9081 +diff --git a/kernel/sched/pds.h b/kernel/sched/pds.h
9082 +new file mode 100644
9083 +index 000000000000..0f1f0d708b77
9084 +--- /dev/null
9085 ++++ b/kernel/sched/pds.h
9086 +@@ -0,0 +1,127 @@
9087 ++#define ALT_SCHED_VERSION_MSG "sched/pds: PDS CPU Scheduler "ALT_SCHED_VERSION" by Alfred Chen.\n"
9088 ++
9089 ++static int sched_timeslice_shift = 22;
9090 ++
9091 ++#define NORMAL_PRIO_MOD(x) ((x) & (NORMAL_PRIO_NUM - 1))
9092 ++
9093 ++/*
9094 ++ * Common interfaces
9095 ++ */
9096 ++static inline void sched_timeslice_imp(const int timeslice_ms)
9097 ++{
9098 ++ if (2 == timeslice_ms)
9099 ++ sched_timeslice_shift = 21;
9100 ++}
9101 ++
9102 ++static inline int
9103 ++task_sched_prio_normal(const struct task_struct *p, const struct rq *rq)
9104 ++{
9105 ++ s64 delta = p->deadline - rq->time_edge + NORMAL_PRIO_NUM - NICE_WIDTH;
9106 ++
9107 ++ if (WARN_ONCE(delta > NORMAL_PRIO_NUM - 1,
9108 ++ "pds: task_sched_prio_normal() delta %lld\n", delta))
9109 ++ return NORMAL_PRIO_NUM - 1;
9110 ++
9111 ++ return (delta < 0) ? 0 : delta;
9112 ++}
9113 ++
9114 ++static inline int task_sched_prio(const struct task_struct *p)
9115 ++{
9116 ++ return (p->prio < MAX_RT_PRIO) ? p->prio :
9117 ++ MIN_NORMAL_PRIO + task_sched_prio_normal(p, task_rq(p));
9118 ++}
9119 ++
9120 ++static inline int
9121 ++task_sched_prio_idx(const struct task_struct *p, const struct rq *rq)
9122 ++{
9123 ++ return (p->prio < MAX_RT_PRIO) ? p->prio : MIN_NORMAL_PRIO +
9124 ++ NORMAL_PRIO_MOD(task_sched_prio_normal(p, rq) + rq->time_edge);
9125 ++}
9126 ++
9127 ++static inline int sched_prio2idx(int prio, struct rq *rq)
9128 ++{
9129 ++ return (IDLE_TASK_SCHED_PRIO == prio || prio < MAX_RT_PRIO) ? prio :
9130 ++ MIN_NORMAL_PRIO + NORMAL_PRIO_MOD((prio - MIN_NORMAL_PRIO) +
9131 ++ rq->time_edge);
9132 ++}
9133 ++
9134 ++static inline int sched_idx2prio(int idx, struct rq *rq)
9135 ++{
9136 ++ return (idx < MAX_RT_PRIO) ? idx : MIN_NORMAL_PRIO +
9137 ++ NORMAL_PRIO_MOD((idx - MIN_NORMAL_PRIO) + NORMAL_PRIO_NUM -
9138 ++ NORMAL_PRIO_MOD(rq->time_edge));
9139 ++}
9140 ++
9141 ++static inline void sched_renew_deadline(struct task_struct *p, const struct rq *rq)
9142 ++{
9143 ++ if (p->prio >= MAX_RT_PRIO)
9144 ++ p->deadline = (rq->clock >> sched_timeslice_shift) +
9145 ++ p->static_prio - (MAX_PRIO - NICE_WIDTH);
9146 ++}
9147 ++
9148 ++int task_running_nice(struct task_struct *p)
9149 ++{
9150 ++ return (p->prio > DEFAULT_PRIO);
9151 ++}
9152 ++
9153 ++static inline void update_rq_time_edge(struct rq *rq)
9154 ++{
9155 ++ struct list_head head;
9156 ++ u64 old = rq->time_edge;
9157 ++ u64 now = rq->clock >> sched_timeslice_shift;
9158 ++ u64 prio, delta;
9159 ++
9160 ++ if (now == old)
9161 ++ return;
9162 ++
9163 ++ delta = min_t(u64, NORMAL_PRIO_NUM, now - old);
9164 ++ INIT_LIST_HEAD(&head);
9165 ++
9166 ++ for_each_set_bit(prio, &rq->queue.bitmap[2], delta)
9167 ++ list_splice_tail_init(rq->queue.heads + MIN_NORMAL_PRIO +
9168 ++ NORMAL_PRIO_MOD(prio + old), &head);
9169 ++
9170 ++ rq->queue.bitmap[2] = (NORMAL_PRIO_NUM == delta) ? 0UL :
9171 ++ rq->queue.bitmap[2] >> delta;
9172 ++ rq->time_edge = now;
9173 ++ if (!list_empty(&head)) {
9174 ++ u64 idx = MIN_NORMAL_PRIO + NORMAL_PRIO_MOD(now);
9175 ++ struct task_struct *p;
9176 ++
9177 ++ list_for_each_entry(p, &head, sq_node)
9178 ++ p->sq_idx = idx;
9179 ++
9180 ++ list_splice(&head, rq->queue.heads + idx);
9181 ++ rq->queue.bitmap[2] |= 1UL;
9182 ++ }
9183 ++}
9184 ++
9185 ++static inline void time_slice_expired(struct task_struct *p, struct rq *rq)
9186 ++{
9187 ++ p->time_slice = sched_timeslice_ns;
9188 ++ sched_renew_deadline(p, rq);
9189 ++ if (SCHED_FIFO != p->policy && task_on_rq_queued(p))
9190 ++ requeue_task(p, rq);
9191 ++}
9192 ++
9193 ++static inline void sched_task_sanity_check(struct task_struct *p, struct rq *rq)
9194 ++{
9195 ++ u64 max_dl = rq->time_edge + NICE_WIDTH - 1;
9196 ++ if (unlikely(p->deadline > max_dl))
9197 ++ p->deadline = max_dl;
9198 ++}
9199 ++
9200 ++static void sched_task_fork(struct task_struct *p, struct rq *rq)
9201 ++{
9202 ++ sched_renew_deadline(p, rq);
9203 ++}
9204 ++
9205 ++static inline void do_sched_yield_type_1(struct task_struct *p, struct rq *rq)
9206 ++{
9207 ++ time_slice_expired(p, rq);
9208 ++}
9209 ++
9210 ++#ifdef CONFIG_SMP
9211 ++static inline void sched_task_ttwu(struct task_struct *p) {}
9212 ++#endif
9213 ++static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq) {}
9214 +diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
9215 +index a554e3bbab2b..3e56f5e6ff5c 100644
9216 +--- a/kernel/sched/pelt.c
9217 ++++ b/kernel/sched/pelt.c
9218 +@@ -270,6 +270,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
9219 + WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
9220 + }
9221 +
9222 ++#ifndef CONFIG_SCHED_ALT
9223 + /*
9224 + * sched_entity:
9225 + *
9226 +@@ -387,8 +388,9 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
9227 +
9228 + return 0;
9229 + }
9230 ++#endif
9231 +
9232 +-#ifdef CONFIG_SCHED_THERMAL_PRESSURE
9233 ++#if defined(CONFIG_SCHED_THERMAL_PRESSURE) && !defined(CONFIG_SCHED_ALT)
9234 + /*
9235 + * thermal:
9236 + *
9237 +diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
9238 +index e06071bf3472..adf567df34d4 100644
9239 +--- a/kernel/sched/pelt.h
9240 ++++ b/kernel/sched/pelt.h
9241 +@@ -1,13 +1,15 @@
9242 + #ifdef CONFIG_SMP
9243 + #include "sched-pelt.h"
9244 +
9245 ++#ifndef CONFIG_SCHED_ALT
9246 + int __update_load_avg_blocked_se(u64 now, struct sched_entity *se);
9247 + int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);
9248 + int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
9249 + int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
9250 + int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
9251 ++#endif
9252 +
9253 +-#ifdef CONFIG_SCHED_THERMAL_PRESSURE
9254 ++#if defined(CONFIG_SCHED_THERMAL_PRESSURE) && !defined(CONFIG_SCHED_ALT)
9255 + int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
9256 +
9257 + static inline u64 thermal_load_avg(struct rq *rq)
9258 +@@ -42,6 +44,7 @@ static inline u32 get_pelt_divider(struct sched_avg *avg)
9259 + return LOAD_AVG_MAX - 1024 + avg->period_contrib;
9260 + }
9261 +
9262 ++#ifndef CONFIG_SCHED_ALT
9263 + static inline void cfs_se_util_change(struct sched_avg *avg)
9264 + {
9265 + unsigned int enqueued;
9266 +@@ -153,9 +156,11 @@ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
9267 + return rq_clock_pelt(rq_of(cfs_rq));
9268 + }
9269 + #endif
9270 ++#endif /* CONFIG_SCHED_ALT */
9271 +
9272 + #else
9273 +
9274 ++#ifndef CONFIG_SCHED_ALT
9275 + static inline int
9276 + update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
9277 + {
9278 +@@ -173,6 +178,7 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
9279 + {
9280 + return 0;
9281 + }
9282 ++#endif
9283 +
9284 + static inline int
9285 + update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
9286 +diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
9287 +index ddefb0419d7a..658c41b15d3c 100644
9288 +--- a/kernel/sched/sched.h
9289 ++++ b/kernel/sched/sched.h
9290 +@@ -2,6 +2,10 @@
9291 + /*
9292 + * Scheduler internal types and methods:
9293 + */
9294 ++#ifdef CONFIG_SCHED_ALT
9295 ++#include "alt_sched.h"
9296 ++#else
9297 ++
9298 + #include <linux/sched.h>
9299 +
9300 + #include <linux/sched/autogroup.h>
9301 +@@ -3038,3 +3042,8 @@ extern int sched_dynamic_mode(const char *str);
9302 + extern void sched_dynamic_update(int mode);
9303 + #endif
9304 +
9305 ++static inline int task_running_nice(struct task_struct *p)
9306 ++{
9307 ++ return (task_nice(p) > 0);
9308 ++}
9309 ++#endif /* !CONFIG_SCHED_ALT */
9310 +diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
9311 +index 3f93fc3b5648..528b71e144e9 100644
9312 +--- a/kernel/sched/stats.c
9313 ++++ b/kernel/sched/stats.c
9314 +@@ -22,8 +22,10 @@ static int show_schedstat(struct seq_file *seq, void *v)
9315 + } else {
9316 + struct rq *rq;
9317 + #ifdef CONFIG_SMP
9318 ++#ifndef CONFIG_SCHED_ALT
9319 + struct sched_domain *sd;
9320 + int dcount = 0;
9321 ++#endif
9322 + #endif
9323 + cpu = (unsigned long)(v - 2);
9324 + rq = cpu_rq(cpu);
9325 +@@ -40,6 +42,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
9326 + seq_printf(seq, "\n");
9327 +
9328 + #ifdef CONFIG_SMP
9329 ++#ifndef CONFIG_SCHED_ALT
9330 + /* domain-specific stats */
9331 + rcu_read_lock();
9332 + for_each_domain(cpu, sd) {
9333 +@@ -68,6 +71,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
9334 + sd->ttwu_move_balance);
9335 + }
9336 + rcu_read_unlock();
9337 ++#endif
9338 + #endif
9339 + }
9340 + return 0;
9341 +diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
9342 +index b77ad49dc14f..be9edf086412 100644
9343 +--- a/kernel/sched/topology.c
9344 ++++ b/kernel/sched/topology.c
9345 +@@ -4,6 +4,7 @@
9346 + */
9347 + #include "sched.h"
9348 +
9349 ++#ifndef CONFIG_SCHED_ALT
9350 + DEFINE_MUTEX(sched_domains_mutex);
9351 +
9352 + /* Protected by sched_domains_mutex: */
9353 +@@ -1382,8 +1383,10 @@ static void asym_cpu_capacity_scan(void)
9354 + */
9355 +
9356 + static int default_relax_domain_level = -1;
9357 ++#endif /* CONFIG_SCHED_ALT */
9358 + int sched_domain_level_max;
9359 +
9360 ++#ifndef CONFIG_SCHED_ALT
9361 + static int __init setup_relax_domain_level(char *str)
9362 + {
9363 + if (kstrtoint(str, 0, &default_relax_domain_level))
9364 +@@ -1617,6 +1620,7 @@ sd_init(struct sched_domain_topology_level *tl,
9365 +
9366 + return sd;
9367 + }
9368 ++#endif /* CONFIG_SCHED_ALT */
9369 +
9370 + /*
9371 + * Topology list, bottom-up.
9372 +@@ -1646,6 +1650,7 @@ void set_sched_topology(struct sched_domain_topology_level *tl)
9373 + sched_domain_topology = tl;
9374 + }
9375 +
9376 ++#ifndef CONFIG_SCHED_ALT
9377 + #ifdef CONFIG_NUMA
9378 +
9379 + static const struct cpumask *sd_numa_mask(int cpu)
9380 +@@ -2451,3 +2456,17 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
9381 + partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
9382 + mutex_unlock(&sched_domains_mutex);
9383 + }
9384 ++#else /* CONFIG_SCHED_ALT */
9385 ++void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
9386 ++ struct sched_domain_attr *dattr_new)
9387 ++{}
9388 ++
9389 ++#ifdef CONFIG_NUMA
9390 ++int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
9391 ++
9392 ++int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
9393 ++{
9394 ++ return best_mask_cpu(cpu, cpus);
9395 ++}
9396 ++#endif /* CONFIG_NUMA */
9397 ++#endif
9398 +diff --git a/kernel/sysctl.c b/kernel/sysctl.c
9399 +index 272f4a272f8c..1c9455c8ecf6 100644
9400 +--- a/kernel/sysctl.c
9401 ++++ b/kernel/sysctl.c
9402 +@@ -122,6 +122,10 @@ static unsigned long long_max = LONG_MAX;
9403 + static int one_hundred = 100;
9404 + static int two_hundred = 200;
9405 + static int one_thousand = 1000;
9406 ++#ifdef CONFIG_SCHED_ALT
9407 ++static int __maybe_unused zero = 0;
9408 ++extern int sched_yield_type;
9409 ++#endif
9410 + #ifdef CONFIG_PRINTK
9411 + static int ten_thousand = 10000;
9412 + #endif
9413 +@@ -1730,6 +1734,24 @@ int proc_do_static_key(struct ctl_table *table, int write,
9414 + }
9415 +
9416 + static struct ctl_table kern_table[] = {
9417 ++#ifdef CONFIG_SCHED_ALT
9418 ++/* In ALT, only supported "sched_schedstats" */
9419 ++#ifdef CONFIG_SCHED_DEBUG
9420 ++#ifdef CONFIG_SMP
9421 ++#ifdef CONFIG_SCHEDSTATS
9422 ++ {
9423 ++ .procname = "sched_schedstats",
9424 ++ .data = NULL,
9425 ++ .maxlen = sizeof(unsigned int),
9426 ++ .mode = 0644,
9427 ++ .proc_handler = sysctl_schedstats,
9428 ++ .extra1 = SYSCTL_ZERO,
9429 ++ .extra2 = SYSCTL_ONE,
9430 ++ },
9431 ++#endif /* CONFIG_SCHEDSTATS */
9432 ++#endif /* CONFIG_SMP */
9433 ++#endif /* CONFIG_SCHED_DEBUG */
9434 ++#else /* !CONFIG_SCHED_ALT */
9435 + {
9436 + .procname = "sched_child_runs_first",
9437 + .data = &sysctl_sched_child_runs_first,
9438 +@@ -1860,6 +1882,7 @@ static struct ctl_table kern_table[] = {
9439 + .extra2 = SYSCTL_ONE,
9440 + },
9441 + #endif
9442 ++#endif /* !CONFIG_SCHED_ALT */
9443 + #ifdef CONFIG_PROVE_LOCKING
9444 + {
9445 + .procname = "prove_locking",
9446 +@@ -2436,6 +2459,17 @@ static struct ctl_table kern_table[] = {
9447 + .proc_handler = proc_dointvec,
9448 + },
9449 + #endif
9450 ++#ifdef CONFIG_SCHED_ALT
9451 ++ {
9452 ++ .procname = "yield_type",
9453 ++ .data = &sched_yield_type,
9454 ++ .maxlen = sizeof (int),
9455 ++ .mode = 0644,
9456 ++ .proc_handler = &proc_dointvec_minmax,
9457 ++ .extra1 = &zero,
9458 ++ .extra2 = &two,
9459 ++ },
9460 ++#endif
9461 + #if defined(CONFIG_S390) && defined(CONFIG_SMP)
9462 + {
9463 + .procname = "spin_retry",
9464 +diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
9465 +index 4a66725b1d4a..cb80ed5c1f5c 100644
9466 +--- a/kernel/time/hrtimer.c
9467 ++++ b/kernel/time/hrtimer.c
9468 +@@ -1940,8 +1940,10 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
9469 + int ret = 0;
9470 + u64 slack;
9471 +
9472 ++#ifndef CONFIG_SCHED_ALT
9473 + slack = current->timer_slack_ns;
9474 + if (dl_task(current) || rt_task(current))
9475 ++#endif
9476 + slack = 0;
9477 +
9478 + hrtimer_init_sleeper_on_stack(&t, clockid, mode);
9479 +diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
9480 +index 517be7fd175e..de3afe8e0800 100644
9481 +--- a/kernel/time/posix-cpu-timers.c
9482 ++++ b/kernel/time/posix-cpu-timers.c
9483 +@@ -216,7 +216,7 @@ static void task_sample_cputime(struct task_struct *p, u64 *samples)
9484 + u64 stime, utime;
9485 +
9486 + task_cputime(p, &utime, &stime);
9487 +- store_samples(samples, stime, utime, p->se.sum_exec_runtime);
9488 ++ store_samples(samples, stime, utime, tsk_seruntime(p));
9489 + }
9490 +
9491 + static void proc_sample_cputime_atomic(struct task_cputime_atomic *at,
9492 +@@ -801,6 +801,7 @@ static void collect_posix_cputimers(struct posix_cputimers *pct, u64 *samples,
9493 + }
9494 + }
9495 +
9496 ++#ifndef CONFIG_SCHED_ALT
9497 + static inline void check_dl_overrun(struct task_struct *tsk)
9498 + {
9499 + if (tsk->dl.dl_overrun) {
9500 +@@ -808,6 +809,7 @@ static inline void check_dl_overrun(struct task_struct *tsk)
9501 + __group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
9502 + }
9503 + }
9504 ++#endif
9505 +
9506 + static bool check_rlimit(u64 time, u64 limit, int signo, bool rt, bool hard)
9507 + {
9508 +@@ -835,8 +837,10 @@ static void check_thread_timers(struct task_struct *tsk,
9509 + u64 samples[CPUCLOCK_MAX];
9510 + unsigned long soft;
9511 +
9512 ++#ifndef CONFIG_SCHED_ALT
9513 + if (dl_task(tsk))
9514 + check_dl_overrun(tsk);
9515 ++#endif
9516 +
9517 + if (expiry_cache_is_inactive(pct))
9518 + return;
9519 +@@ -850,7 +854,7 @@ static void check_thread_timers(struct task_struct *tsk,
9520 + soft = task_rlimit(tsk, RLIMIT_RTTIME);
9521 + if (soft != RLIM_INFINITY) {
9522 + /* Task RT timeout is accounted in jiffies. RTTIME is usec */
9523 +- unsigned long rttime = tsk->rt.timeout * (USEC_PER_SEC / HZ);
9524 ++ unsigned long rttime = tsk_rttimeout(tsk) * (USEC_PER_SEC / HZ);
9525 + unsigned long hard = task_rlimit_max(tsk, RLIMIT_RTTIME);
9526 +
9527 + /* At the hard limit, send SIGKILL. No further action. */
9528 +@@ -1086,8 +1090,10 @@ static inline bool fastpath_timer_check(struct task_struct *tsk)
9529 + return true;
9530 + }
9531 +
9532 ++#ifndef CONFIG_SCHED_ALT
9533 + if (dl_task(tsk) && tsk->dl.dl_overrun)
9534 + return true;
9535 ++#endif
9536 +
9537 + return false;
9538 + }
9539 +diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
9540 +index adf7ef194005..11c8f36e281b 100644
9541 +--- a/kernel/trace/trace_selftest.c
9542 ++++ b/kernel/trace/trace_selftest.c
9543 +@@ -1052,10 +1052,15 @@ static int trace_wakeup_test_thread(void *data)
9544 + {
9545 + /* Make this a -deadline thread */
9546 + static const struct sched_attr attr = {
9547 ++#ifdef CONFIG_SCHED_ALT
9548 ++ /* No deadline on BMQ/PDS, use RR */
9549 ++ .sched_policy = SCHED_RR,
9550 ++#else
9551 + .sched_policy = SCHED_DEADLINE,
9552 + .sched_runtime = 100000ULL,
9553 + .sched_deadline = 10000000ULL,
9554 + .sched_period = 10000000ULL
9555 ++#endif
9556 + };
9557 + struct wakeup_test_data *x = data;
9558 +
9559
9560 diff --git a/5021_BMQ-and-PDS-gentoo-defaults.patch b/5021_BMQ-and-PDS-gentoo-defaults.patch
9561 new file mode 100644
9562 index 0000000..d449eec
9563 --- /dev/null
9564 +++ b/5021_BMQ-and-PDS-gentoo-defaults.patch
9565 @@ -0,0 +1,13 @@
9566 +--- a/init/Kconfig 2021-04-27 07:38:30.556467045 -0400
9567 ++++ b/init/Kconfig 2021-04-27 07:39:32.956412800 -0400
9568 +@@ -780,8 +780,9 @@ config GENERIC_SCHED_CLOCK
9569 + menu "Scheduler features"
9570 +
9571 + menuconfig SCHED_ALT
9572 ++ depends on X86_64
9573 + bool "Alternative CPU Schedulers"
9574 +- default y
9575 ++ default n
9576 + help
9577 + This feature enable alternative CPU scheduler"
9578 +