[gentoo-commits] proj/linux-patches:5.14 commit in: / - gentoo-commits

From:	Mike Pagano <mpagano@g.o>
To:	gentoo-commits@l.g.o
Subject:	[gentoo-commits] proj/linux-patches:5.14 commit in: /
Date:	Tue, 14 Sep 2021 15:37:53
Message-Id:	`1631633821.bac991f4736e0a8f6712313af04b8b4cd873d3b5.mpagano@gentoo`

1

commit:     bac991f4736e0a8f6712313af04b8b4cd873d3b5

2

Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

3

AuthorDate: Tue Sep 14 15:37:01 2021 +0000

4

Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

5

CommitDate: Tue Sep 14 15:37:01 2021 +0000

6

URL:        https://gitweb.gentoo.org/proj/linux-patches.git/commit/?id=bac991f4

7

8

Add BMQ Scheduler Patch 5.14-r1

9

10

BMQ(BitMap Queue) Scheduler.

11

A new CPU scheduler developed from PDS(incld).

12

Inspired by the scheduler in zircon.

13

14

Set defaults for BMQ. Add archs as people test, default to N

15

16

Signed-off-by: Mike Pagano <mpagano <AT> gentoo.org>

17

18

 0000_README                                  |    7 +

19

 5020_BMQ-and-PDS-io-scheduler-v5.14-r1.patch | 9514 ++++++++++++++++++++++++++

20

 5021_BMQ-and-PDS-gentoo-defaults.patch       |   13 +

21

 3 files changed, 9534 insertions(+)

22

23

diff --git a/0000_README b/0000_README

24

index 4ad6164..f4fbe66 100644

25

--- a/0000_README

26

+++ b/0000_README

27

@@ -87,3 +87,10 @@ Patch:  5010_enable-cpu-optimizations-universal.patch

28

 From:   https://github.com/graysky2/kernel_compiler_patch

29

 Desc:   Kernel >= 5.8 patch enables gcc = v9+ optimizations for additional CPUs.

30

31

+Patch:  5020_BMQ-and-PDS-io-scheduler-v5.14-r1.patch

32

+From:   https://gitlab.com/alfredchen/linux-prjc

33

+Desc:   BMQ(BitMap Queue) Scheduler. A new CPU scheduler developed from PDS(incld). Inspired by the scheduler in zircon.

34

+

35

+Patch:  5021_BMQ-and-PDS-gentoo-defaults.patch

36

+From:   https://gitweb.gentoo.org/proj/linux-patches.git/

37

+Desc:   Set defaults for BMQ. Add archs as people test, default to N

38

39

diff --git a/5020_BMQ-and-PDS-io-scheduler-v5.14-r1.patch b/5020_BMQ-and-PDS-io-scheduler-v5.14-r1.patch

40

new file mode 100644

41

index 0000000..4c6f75c

42

--- /dev/null

43

+++ b/5020_BMQ-and-PDS-io-scheduler-v5.14-r1.patch

44

@@ -0,0 +1,9514 @@

45

+diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt

46

+index bdb22006f713..d755d7df632f 100644

47

+--- a/Documentation/admin-guide/kernel-parameters.txt

48

++++ b/Documentation/admin-guide/kernel-parameters.txt

49

+@@ -4947,6 +4947,12 @@

50

+

51

+ 	sbni=		[NET] Granch SBNI12 leased line adapter

52

+

53

++	sched_timeslice=

54

++			[KNL] Time slice in ms for Project C BMQ/PDS scheduler.

55

++			Format: integer 2, 4

56

++			Default: 4

57

++			See Documentation/scheduler/sched-BMQ.txt

58

++

59

+ 	sched_verbose	[KNL] Enables verbose scheduler debug messages.

60

+

61

+ 	schedstats=	[KNL,X86] Enable or disable scheduled statistics.

62

+diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst

63

+index 426162009ce9..15ac2d7e47cd 100644

64

+--- a/Documentation/admin-guide/sysctl/kernel.rst

65

++++ b/Documentation/admin-guide/sysctl/kernel.rst

66

+@@ -1542,3 +1542,13 @@ is 10 seconds.

67

+

68

+ The softlockup threshold is (``2 * watchdog_thresh``). Setting this

69

+ tunable to zero will disable lockup detection altogether.

70

++

71

++yield_type:

72

++===========

73

++

74

++BMQ/PDS CPU scheduler only. This determines what type of yield calls

75

++to sched_yield will perform.

76

++

77

++  0 - No yield.

78

++  1 - Deboost and requeue task. (default)

79

++  2 - Set run queue skip task.

80

+diff --git a/Documentation/scheduler/sched-BMQ.txt b/Documentation/scheduler/sched-BMQ.txt

81

+new file mode 100644

82

+index 000000000000..05c84eec0f31

83

+--- /dev/null

84

++++ b/Documentation/scheduler/sched-BMQ.txt

85

+@@ -0,0 +1,110 @@

86

++                         BitMap queue CPU Scheduler

87

++                         --------------------------

88

++

89

++CONTENT

90

++========

91

++

92

++ Background

93

++ Design

94

++   Overview

95

++   Task policy

96

++   Priority management

97

++   BitMap Queue

98

++   CPU Assignment and Migration

99

++

100

++

101

++Background

102

++==========

103

++

104

++BitMap Queue CPU scheduler, referred to as BMQ from here on, is an evolution

105

++of previous Priority and Deadline based Skiplist multiple queue scheduler(PDS),

106

++and inspired by Zircon scheduler. The goal of it is to keep the scheduler code

107

++simple, while efficiency and scalable for interactive tasks, such as desktop,

108

++movie playback and gaming etc.

109

++

110

++Design

111

++======

112

++

113

++Overview

114

++--------

115

++

116

++BMQ use per CPU run queue design, each CPU(logical) has it's own run queue,

117

++each CPU is responsible for scheduling the tasks that are putting into it's

118

++run queue.

119

++

120

++The run queue is a set of priority queues. Note that these queues are fifo

121

++queue for non-rt tasks or priority queue for rt tasks in data structure. See

122

++BitMap Queue below for details. BMQ is optimized for non-rt tasks in the fact

123

++that most applications are non-rt tasks. No matter the queue is fifo or

124

++priority, In each queue is an ordered list of runnable tasks awaiting execution

125

++and the data structures are the same. When it is time for a new task to run,

126

++the scheduler simply looks the lowest numbered queueue that contains a task,

127

++and runs the first task from the head of that queue. And per CPU idle task is

128

++also in the run queue, so the scheduler can always find a task to run on from

129

++its run queue.

130

++

131

++Each task will assigned the same timeslice(default 4ms) when it is picked to

132

++start running. Task will be reinserted at the end of the appropriate priority

133

++queue when it uses its whole timeslice. When the scheduler selects a new task

134

++from the priority queue it sets the CPU's preemption timer for the remainder of

135

++the previous timeslice. When that timer fires the scheduler will stop execution

136

++on that task, select another task and start over again.

137

++

138

++If a task blocks waiting for a shared resource then it's taken out of its

139

++priority queue and is placed in a wait queue for the shared resource. When it

140

++is unblocked it will be reinserted in the appropriate priority queue of an

141

++eligible CPU.

142

++

143

++Task policy

144

++-----------

145

++

146

++BMQ supports DEADLINE, FIFO, RR, NORMAL, BATCH and IDLE task policy like the

147

++mainline CFS scheduler. But BMQ is heavy optimized for non-rt task, that's

148

++NORMAL/BATCH/IDLE policy tasks. Below is the implementation detail of each

149

++policy.

150

++

151

++DEADLINE

152

++	It is squashed as priority 0 FIFO task.

153

++

154

++FIFO/RR

155

++	All RT tasks share one single priority queue in BMQ run queue designed. The

156

++complexity of insert operation is O(n). BMQ is not designed for system runs

157

++with major rt policy tasks.

158

++

159

++NORMAL/BATCH/IDLE

160

++	BATCH and IDLE tasks are treated as the same policy. They compete CPU with

161

++NORMAL policy tasks, but they just don't boost. To control the priority of

162

++NORMAL/BATCH/IDLE tasks, simply use nice level.

163

++

164

++ISO

165

++	ISO policy is not supported in BMQ. Please use nice level -20 NORMAL policy

166

++task instead.

167

++

168

++Priority management

169

++-------------------

170

++

171

++RT tasks have priority from 0-99. For non-rt tasks, there are three different

172

++factors used to determine the effective priority of a task. The effective

173

++priority being what is used to determine which queue it will be in.

174

++

175

++The first factor is simply the task’s static priority. Which is assigned from

176

++task's nice level, within [-20, 19] in userland's point of view and [0, 39]

177

++internally.

178

++

179

++The second factor is the priority boost. This is a value bounded between

180

++[-MAX_PRIORITY_ADJ, MAX_PRIORITY_ADJ] used to offset the base priority, it is

181

++modified by the following cases:

182

++

183

++*When a thread has used up its entire timeslice, always deboost its boost by

184

++increasing by one.

185

++*When a thread gives up cpu control(voluntary or non-voluntary) to reschedule,

186

++and its switch-in time(time after last switch and run) below the thredhold

187

++based on its priority boost, will boost its boost by decreasing by one buti is

188

++capped at 0 (won’t go negative).

189

++

190

++The intent in this system is to ensure that interactive threads are serviced

191

++quickly. These are usually the threads that interact directly with the user

192

++and cause user-perceivable latency. These threads usually do little work and

193

++spend most of their time blocked awaiting another user event. So they get the

194

++priority boost from unblocking while background threads that do most of the

195

++processing receive the priority penalty for using their entire timeslice.

196

+diff --git a/fs/proc/base.c b/fs/proc/base.c

197

+index e5b5f7709d48..284b3c4b7d90 100644

198

+--- a/fs/proc/base.c

199

++++ b/fs/proc/base.c

200

+@@ -476,7 +476,7 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,

201

+ 		seq_puts(m, "0 0 0\n");

202

+ 	else

203

+ 		seq_printf(m, "%llu %llu %lu\n",

204

+-		   (unsigned long long)task->se.sum_exec_runtime,

205

++		   (unsigned long long)tsk_seruntime(task),

206

+ 		   (unsigned long long)task->sched_info.run_delay,

207

+ 		   task->sched_info.pcount);

208

+

209

+diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h

210

+index 8874f681b056..59eb72bf7d5f 100644

211

+--- a/include/asm-generic/resource.h

212

++++ b/include/asm-generic/resource.h

213

+@@ -23,7 +23,7 @@

214

+ 	[RLIMIT_LOCKS]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\

215

+ 	[RLIMIT_SIGPENDING]	= { 		0,	       0 },	\

216

+ 	[RLIMIT_MSGQUEUE]	= {   MQ_BYTES_MAX,   MQ_BYTES_MAX },	\

217

+-	[RLIMIT_NICE]		= { 0, 0 },				\

218

++	[RLIMIT_NICE]		= { 30, 30 },				\

219

+ 	[RLIMIT_RTPRIO]		= { 0, 0 },				\

220

+ 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\

221

+ }

222

+diff --git a/include/linux/sched.h b/include/linux/sched.h

223

+index ec8d07d88641..b12f660404fd 100644

224

+--- a/include/linux/sched.h

225

++++ b/include/linux/sched.h

226

+@@ -681,12 +681,18 @@ struct task_struct {

227

+ 	unsigned int			ptrace;

228

+

229

+ #ifdef CONFIG_SMP

230

+-	int				on_cpu;

231

+ 	struct __call_single_node	wake_entry;

232

++#endif

233

++#if defined(CONFIG_SMP) || defined(CONFIG_SCHED_ALT)

234

++	int				on_cpu;

235

++#endif

236

++

237

++#ifdef CONFIG_SMP

238

+ #ifdef CONFIG_THREAD_INFO_IN_TASK

239

+ 	/* Current CPU: */

240

+ 	unsigned int			cpu;

241

+ #endif

242

++#ifndef CONFIG_SCHED_ALT

243

+ 	unsigned int			wakee_flips;

244

+ 	unsigned long			wakee_flip_decay_ts;

245

+ 	struct task_struct		*last_wakee;

246

+@@ -700,6 +706,7 @@ struct task_struct {

247

+ 	 */

248

+ 	int				recent_used_cpu;

249

+ 	int				wake_cpu;

250

++#endif /* !CONFIG_SCHED_ALT */

251

+ #endif

252

+ 	int				on_rq;

253

+

254

+@@ -708,6 +715,20 @@ struct task_struct {

255

+ 	int				normal_prio;

256

+ 	unsigned int			rt_priority;

257

+

258

++#ifdef CONFIG_SCHED_ALT

259

++	u64				last_ran;

260

++	s64				time_slice;

261

++	int				sq_idx;

262

++	struct list_head		sq_node;

263

++#ifdef CONFIG_SCHED_BMQ

264

++	int				boost_prio;

265

++#endif /* CONFIG_SCHED_BMQ */

266

++#ifdef CONFIG_SCHED_PDS

267

++	u64				deadline;

268

++#endif /* CONFIG_SCHED_PDS */

269

++	/* sched_clock time spent running */

270

++	u64				sched_time;

271

++#else /* !CONFIG_SCHED_ALT */

272

+ 	const struct sched_class	*sched_class;

273

+ 	struct sched_entity		se;

274

+ 	struct sched_rt_entity		rt;

275

+@@ -718,6 +739,7 @@ struct task_struct {

276

+ 	unsigned long			core_cookie;

277

+ 	unsigned int			core_occupation;

278

+ #endif

279

++#endif /* !CONFIG_SCHED_ALT */

280

+

281

+ #ifdef CONFIG_CGROUP_SCHED

282

+ 	struct task_group		*sched_task_group;

283

+@@ -1417,6 +1439,15 @@ struct task_struct {

284

+ 	 */

285

+ };

286

+

287

++#ifdef CONFIG_SCHED_ALT

288

++#define tsk_seruntime(t)		((t)->sched_time)

289

++/* replace the uncertian rt_timeout with 0UL */

290

++#define tsk_rttimeout(t)		(0UL)

291

++#else /* CFS */

292

++#define tsk_seruntime(t)	((t)->se.sum_exec_runtime)

293

++#define tsk_rttimeout(t)	((t)->rt.timeout)

294

++#endif /* !CONFIG_SCHED_ALT */

295

++

296

+ static inline struct pid *task_pid(struct task_struct *task)

297

+ {

298

+ 	return task->thread_pid;

299

+diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h

300

+index 1aff00b65f3c..216fdf2fe90c 100644

301

+--- a/include/linux/sched/deadline.h

302

++++ b/include/linux/sched/deadline.h

303

+@@ -1,5 +1,24 @@

304

+ /* SPDX-License-Identifier: GPL-2.0 */

305

+

306

++#ifdef CONFIG_SCHED_ALT

307

++

308

++static inline int dl_task(struct task_struct *p)

309

++{

310

++	return 0;

311

++}

312

++

313

++#ifdef CONFIG_SCHED_BMQ

314

++#define __tsk_deadline(p)	(0UL)

315

++#endif

316

++

317

++#ifdef CONFIG_SCHED_PDS

318

++#define __tsk_deadline(p)	((((u64) ((p)->prio))<<56) | (p)->deadline)

319

++#endif

320

++

321

++#else

322

++

323

++#define __tsk_deadline(p)	((p)->dl.deadline)

324

++

325

+ /*

326

+  * SCHED_DEADLINE tasks has negative priorities, reflecting

327

+  * the fact that any of them has higher prio than RT and

328

+@@ -19,6 +38,7 @@ static inline int dl_task(struct task_struct *p)

329

+ {

330

+ 	return dl_prio(p->prio);

331

+ }

332

++#endif /* CONFIG_SCHED_ALT */

333

+

334

+ static inline bool dl_time_before(u64 a, u64 b)

335

+ {

336

+diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h

337

+index ab83d85e1183..6af9ae681116 100644

338

+--- a/include/linux/sched/prio.h

339

++++ b/include/linux/sched/prio.h

340

+@@ -18,6 +18,32 @@

341

+ #define MAX_PRIO		(MAX_RT_PRIO + NICE_WIDTH)

342

+ #define DEFAULT_PRIO		(MAX_RT_PRIO + NICE_WIDTH / 2)

343

+

344

++#ifdef CONFIG_SCHED_ALT

345

++

346

++/* Undefine MAX_PRIO and DEFAULT_PRIO */

347

++#undef MAX_PRIO

348

++#undef DEFAULT_PRIO

349

++

350

++/* +/- priority levels from the base priority */

351

++#ifdef CONFIG_SCHED_BMQ

352

++#define MAX_PRIORITY_ADJ	(7)

353

++

354

++#define MIN_NORMAL_PRIO		(MAX_RT_PRIO)

355

++#define MAX_PRIO		(MIN_NORMAL_PRIO + NICE_WIDTH)

356

++#define DEFAULT_PRIO		(MIN_NORMAL_PRIO + NICE_WIDTH / 2)

357

++#endif

358

++

359

++#ifdef CONFIG_SCHED_PDS

360

++#define MAX_PRIORITY_ADJ	(0)

361

++

362

++#define MIN_NORMAL_PRIO		(128)

363

++#define NORMAL_PRIO_NUM		(64)

364

++#define MAX_PRIO		(MIN_NORMAL_PRIO + NORMAL_PRIO_NUM)

365

++#define DEFAULT_PRIO		(MAX_PRIO - NICE_WIDTH / 2)

366

++#endif

367

++

368

++#endif /* CONFIG_SCHED_ALT */

369

++

370

+ /*

371

+  * Convert user-nice values [ -20 ... 0 ... 19 ]

372

+  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],

373

+diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h

374

+index e5af028c08b4..0a7565d0d3cf 100644

375

+--- a/include/linux/sched/rt.h

376

++++ b/include/linux/sched/rt.h

377

+@@ -24,8 +24,10 @@ static inline bool task_is_realtime(struct task_struct *tsk)

378

+

379

+ 	if (policy == SCHED_FIFO || policy == SCHED_RR)

380

+ 		return true;

381

++#ifndef CONFIG_SCHED_ALT

382

+ 	if (policy == SCHED_DEADLINE)

383

+ 		return true;

384

++#endif

385

+ 	return false;

386

+ }

387

+

388

+diff --git a/init/Kconfig b/init/Kconfig

389

+index 55f9f7738ebb..9a9b244d3ca3 100644

390

+--- a/init/Kconfig

391

++++ b/init/Kconfig

392

+@@ -786,9 +786,39 @@ config GENERIC_SCHED_CLOCK

393

+

394

+ menu "Scheduler features"

395

+

396

++menuconfig SCHED_ALT

397

++	bool "Alternative CPU Schedulers"

398

++	default y

399

++	help

400

++	  This feature enable alternative CPU scheduler"

401

++

402

++if SCHED_ALT

403

++

404

++choice

405

++	prompt "Alternative CPU Scheduler"

406

++	default SCHED_BMQ

407

++

408

++config SCHED_BMQ

409

++	bool "BMQ CPU scheduler"

410

++	help

411

++	  The BitMap Queue CPU scheduler for excellent interactivity and

412

++	  responsiveness on the desktop and solid scalability on normal

413

++	  hardware and commodity servers.

414

++

415

++config SCHED_PDS

416

++	bool "PDS CPU scheduler"

417

++	help

418

++	  The Priority and Deadline based Skip list multiple queue CPU

419

++	  Scheduler.

420

++

421

++endchoice

422

++

423

++endif

424

++

425

+ config UCLAMP_TASK

426

+ 	bool "Enable utilization clamping for RT/FAIR tasks"

427

+ 	depends on CPU_FREQ_GOV_SCHEDUTIL

428

++	depends on !SCHED_ALT

429

+ 	help

430

+ 	  This feature enables the scheduler to track the clamped utilization

431

+ 	  of each CPU based on RUNNABLE tasks scheduled on that CPU.

432

+@@ -874,6 +904,7 @@ config NUMA_BALANCING

433

+ 	depends on ARCH_SUPPORTS_NUMA_BALANCING

434

+ 	depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY

435

+ 	depends on SMP && NUMA && MIGRATION

436

++	depends on !SCHED_ALT

437

+ 	help

438

+ 	  This option adds support for automatic NUMA aware memory/task placement.

439

+ 	  The mechanism is quite primitive and is based on migrating memory when

440

+@@ -966,6 +997,7 @@ config FAIR_GROUP_SCHED

441

+ 	depends on CGROUP_SCHED

442

+ 	default CGROUP_SCHED

443

+

444

++if !SCHED_ALT

445

+ config CFS_BANDWIDTH

446

+ 	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"

447

+ 	depends on FAIR_GROUP_SCHED

448

+@@ -988,6 +1020,7 @@ config RT_GROUP_SCHED

449

+ 	  realtime bandwidth for them.

450

+ 	  See Documentation/scheduler/sched-rt-group.rst for more information.

451

+

452

++endif #!SCHED_ALT

453

+ endif #CGROUP_SCHED

454

+

455

+ config UCLAMP_TASK_GROUP

456

+@@ -1231,6 +1264,7 @@ config CHECKPOINT_RESTORE

457

+

458

+ config SCHED_AUTOGROUP

459

+ 	bool "Automatic process group scheduling"

460

++	depends on !SCHED_ALT

461

+ 	select CGROUPS

462

+ 	select CGROUP_SCHED

463

+ 	select FAIR_GROUP_SCHED

464

+diff --git a/init/init_task.c b/init/init_task.c

465

+index 562f2ef8d157..177b63db4ce0 100644

466

+--- a/init/init_task.c

467

++++ b/init/init_task.c

468

+@@ -75,9 +75,15 @@ struct task_struct init_task

469

+ 	.stack		= init_stack,

470

+ 	.usage		= REFCOUNT_INIT(2),

471

+ 	.flags		= PF_KTHREAD,

472

++#ifdef CONFIG_SCHED_ALT

473

++	.prio		= DEFAULT_PRIO + MAX_PRIORITY_ADJ,

474

++	.static_prio	= DEFAULT_PRIO,

475

++	.normal_prio	= DEFAULT_PRIO + MAX_PRIORITY_ADJ,

476

++#else

477

+ 	.prio		= MAX_PRIO - 20,

478

+ 	.static_prio	= MAX_PRIO - 20,

479

+ 	.normal_prio	= MAX_PRIO - 20,

480

++#endif

481

+ 	.policy		= SCHED_NORMAL,

482

+ 	.cpus_ptr	= &init_task.cpus_mask,

483

+ 	.cpus_mask	= CPU_MASK_ALL,

484

+@@ -87,6 +93,17 @@ struct task_struct init_task

485

+ 	.restart_block	= {

486

+ 		.fn = do_no_restart_syscall,

487

+ 	},

488

++#ifdef CONFIG_SCHED_ALT

489

++	.sq_node	= LIST_HEAD_INIT(init_task.sq_node),

490

++#ifdef CONFIG_SCHED_BMQ

491

++	.boost_prio	= 0,

492

++	.sq_idx		= 15,

493

++#endif

494

++#ifdef CONFIG_SCHED_PDS

495

++	.deadline	= 0,

496

++#endif

497

++	.time_slice	= HZ,

498

++#else

499

+ 	.se		= {

500

+ 		.group_node 	= LIST_HEAD_INIT(init_task.se.group_node),

501

+ 	},

502

+@@ -94,6 +111,7 @@ struct task_struct init_task

503

+ 		.run_list	= LIST_HEAD_INIT(init_task.rt.run_list),

504

+ 		.time_slice	= RR_TIMESLICE,

505

+ 	},

506

++#endif

507

+ 	.tasks		= LIST_HEAD_INIT(init_task.tasks),

508

+ #ifdef CONFIG_SMP

509

+ 	.pushable_tasks	= PLIST_NODE_INIT(init_task.pushable_tasks, MAX_PRIO),

510

+diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt

511

+index 5876e30c5740..7594d0a31869 100644

512

+--- a/kernel/Kconfig.preempt

513

++++ b/kernel/Kconfig.preempt

514

+@@ -102,7 +102,7 @@ config PREEMPT_DYNAMIC

515

+

516

+ config SCHED_CORE

517

+ 	bool "Core Scheduling for SMT"

518

+-	depends on SCHED_SMT

519

++	depends on SCHED_SMT && !SCHED_ALT

520

+ 	help

521

+ 	  This option permits Core Scheduling, a means of coordinated task

522

+ 	  selection across SMT siblings. When enabled -- see

523

+diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c

524

+index adb5190c4429..8c02bce63146 100644

525

+--- a/kernel/cgroup/cpuset.c

526

++++ b/kernel/cgroup/cpuset.c

527

+@@ -636,7 +636,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)

528

+ 	return ret;

529

+ }

530

+

531

+-#ifdef CONFIG_SMP

532

++#if defined(CONFIG_SMP) && !defined(CONFIG_SCHED_ALT)

533

+ /*

534

+  * Helper routine for generate_sched_domains().

535

+  * Do cpusets a, b have overlapping effective cpus_allowed masks?

536

+@@ -1032,7 +1032,7 @@ static void rebuild_sched_domains_locked(void)

537

+ 	/* Have scheduler rebuild the domains */

538

+ 	partition_and_rebuild_sched_domains(ndoms, doms, attr);

539

+ }

540

+-#else /* !CONFIG_SMP */

541

++#else /* !CONFIG_SMP || CONFIG_SCHED_ALT */

542

+ static void rebuild_sched_domains_locked(void)

543

+ {

544

+ }

545

+diff --git a/kernel/delayacct.c b/kernel/delayacct.c

546

+index 51530d5b15a8..e542d71bb94b 100644

547

+--- a/kernel/delayacct.c

548

++++ b/kernel/delayacct.c

549

+@@ -139,7 +139,7 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)

550

+ 	 */

551

+ 	t1 = tsk->sched_info.pcount;

552

+ 	t2 = tsk->sched_info.run_delay;

553

+-	t3 = tsk->se.sum_exec_runtime;

554

++	t3 = tsk_seruntime(tsk);

555

+

556

+ 	d->cpu_count += t1;

557

+

558

+diff --git a/kernel/exit.c b/kernel/exit.c

559

+index 9a89e7f36acb..7fe34c56bd08 100644

560

+--- a/kernel/exit.c

561

++++ b/kernel/exit.c

562

+@@ -122,7 +122,7 @@ static void __exit_signal(struct task_struct *tsk)

563

+ 			sig->curr_target = next_thread(tsk);

564

+ 	}

565

+

566

+-	add_device_randomness((const void*) &tsk->se.sum_exec_runtime,

567

++	add_device_randomness((const void*) &tsk_seruntime(tsk),

568

+ 			      sizeof(unsigned long long));

569

+

570

+ 	/*

571

+@@ -143,7 +143,7 @@ static void __exit_signal(struct task_struct *tsk)

572

+ 	sig->inblock += task_io_get_inblock(tsk);

573

+ 	sig->oublock += task_io_get_oublock(tsk);

574

+ 	task_io_accounting_add(&sig->ioac, &tsk->ioac);

575

+-	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;

576

++	sig->sum_sched_runtime += tsk_seruntime(tsk);

577

+ 	sig->nr_threads--;

578

+ 	__unhash_process(tsk, group_dead);

579

+ 	write_sequnlock(&sig->stats_lock);

580

+diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c

581

+index 3a4beb9395c4..98a709628cb3 100644

582

+--- a/kernel/livepatch/transition.c

583

++++ b/kernel/livepatch/transition.c

584

+@@ -307,7 +307,11 @@ static bool klp_try_switch_task(struct task_struct *task)

585

+ 	 */

586

+ 	rq = task_rq_lock(task, &flags);

587

+

588

++#ifdef	CONFIG_SCHED_ALT

589

++	if (task_running(task) && task != current) {

590

++#else

591

+ 	if (task_running(rq, task) && task != current) {

592

++#endif

593

+ 		snprintf(err_buf, STACK_ERR_BUF_SIZE,

594

+ 			 "%s: %s:%d is running\n", __func__, task->comm,

595

+ 			 task->pid);

596

+diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c

597

+index ad0db322ed3b..350b0e506c17 100644

598

+--- a/kernel/locking/rtmutex.c

599

++++ b/kernel/locking/rtmutex.c

600

+@@ -227,14 +227,18 @@ static __always_inline bool unlock_rt_mutex_safe(struct rt_mutex *lock,

601

+  * Only use with rt_mutex_waiter_{less,equal}()

602

+  */

603

+ #define task_to_waiter(p)	\

604

+-	&(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = (p)->dl.deadline }

605

++	&(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = __tsk_deadline(p) }

606

+

607

+ static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,

608

+ 						struct rt_mutex_waiter *right)

609

+ {

610

++#ifdef CONFIG_SCHED_PDS

611

++	return (left->deadline < right->deadline);

612

++#else

613

+ 	if (left->prio < right->prio)

614

+ 		return 1;

615

+

616

++#ifndef CONFIG_SCHED_BMQ

617

+ 	/*

618

+ 	 * If both waiters have dl_prio(), we check the deadlines of the

619

+ 	 * associated tasks.

620

+@@ -243,16 +247,22 @@ static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,

621

+ 	 */

622

+ 	if (dl_prio(left->prio))

623

+ 		return dl_time_before(left->deadline, right->deadline);

624

++#endif

625

+

626

+ 	return 0;

627

++#endif

628

+ }

629

+

630

+ static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,

631

+ 						 struct rt_mutex_waiter *right)

632

+ {

633

++#ifdef CONFIG_SCHED_PDS

634

++	return (left->deadline == right->deadline);

635

++#else

636

+ 	if (left->prio != right->prio)

637

+ 		return 0;

638

+

639

++#ifndef CONFIG_SCHED_BMQ

640

+ 	/*

641

+ 	 * If both waiters have dl_prio(), we check the deadlines of the

642

+ 	 * associated tasks.

643

+@@ -261,8 +271,10 @@ static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,

644

+ 	 */

645

+ 	if (dl_prio(left->prio))

646

+ 		return left->deadline == right->deadline;

647

++#endif

648

+

649

+ 	return 1;

650

++#endif

651

+ }

652

+

653

+ #define __node_2_waiter(node) \

654

+@@ -654,7 +666,7 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,

655

+ 	 * the values of the node being removed.

656

+ 	 */

657

+ 	waiter->prio = task->prio;

658

+-	waiter->deadline = task->dl.deadline;

659

++	waiter->deadline = __tsk_deadline(task);

660

+

661

+ 	rt_mutex_enqueue(lock, waiter);

662

+

663

+@@ -925,7 +937,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex *lock,

664

+ 	waiter->task = task;

665

+ 	waiter->lock = lock;

666

+ 	waiter->prio = task->prio;

667

+-	waiter->deadline = task->dl.deadline;

668

++	waiter->deadline = __tsk_deadline(task);

669

+

670

+ 	/* Get the top priority waiter on the lock */

671

+ 	if (rt_mutex_has_waiters(lock))

672

+diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile

673

+index 978fcfca5871..0425ee149b4d 100644

674

+--- a/kernel/sched/Makefile

675

++++ b/kernel/sched/Makefile

676

+@@ -22,14 +22,21 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)

677

+ CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer

678

+ endif

679

+

680

+-obj-y += core.o loadavg.o clock.o cputime.o

681

+-obj-y += idle.o fair.o rt.o deadline.o

682

+-obj-y += wait.o wait_bit.o swait.o completion.o

683

+-

684

+-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o

685

++ifdef CONFIG_SCHED_ALT

686

++obj-y += alt_core.o

687

++obj-$(CONFIG_SCHED_DEBUG) += alt_debug.o

688

++else

689

++obj-y += core.o

690

++obj-y += fair.o rt.o deadline.o

691

++obj-$(CONFIG_SMP) += cpudeadline.o stop_task.o

692

+ obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o

693

+-obj-$(CONFIG_SCHEDSTATS) += stats.o

694

++endif

695

+ obj-$(CONFIG_SCHED_DEBUG) += debug.o

696

++obj-y += loadavg.o clock.o cputime.o

697

++obj-y += idle.o

698

++obj-y += wait.o wait_bit.o swait.o completion.o

699

++obj-$(CONFIG_SMP) += cpupri.o pelt.o topology.o

700

++obj-$(CONFIG_SCHEDSTATS) += stats.o

701

+ obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o

702

+ obj-$(CONFIG_CPU_FREQ) += cpufreq.o

703

+ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o

704

+diff --git a/kernel/sched/alt_core.c b/kernel/sched/alt_core.c

705

+new file mode 100644

706

+index 000000000000..900889c838ea

707

+--- /dev/null

708

++++ b/kernel/sched/alt_core.c

709

+@@ -0,0 +1,7248 @@

710

++/*

711

++ *  kernel/sched/alt_core.c

712

++ *

713

++ *  Core alternative kernel scheduler code and related syscalls

714

++ *

715

++ *  Copyright (C) 1991-2002  Linus Torvalds

716

++ *

717

++ *  2009-08-13	Brainfuck deadline scheduling policy by Con Kolivas deletes

718

++ *		a whole lot of those previous things.

719

++ *  2017-09-06	Priority and Deadline based Skip list multiple queue kernel

720

++ *		scheduler by Alfred Chen.

721

++ *  2019-02-20	BMQ(BitMap Queue) kernel scheduler by Alfred Chen.

722

++ */

723

++#define CREATE_TRACE_POINTS

724

++#include <trace/events/sched.h>

725

++#undef CREATE_TRACE_POINTS

726

++

727

++#include "sched.h"

728

++

729

++#include <linux/sched/rt.h>

730

++

731

++#include <linux/context_tracking.h>

732

++#include <linux/compat.h>

733

++#include <linux/blkdev.h>

734

++#include <linux/delayacct.h>

735

++#include <linux/freezer.h>

736

++#include <linux/init_task.h>

737

++#include <linux/kprobes.h>

738

++#include <linux/mmu_context.h>

739

++#include <linux/nmi.h>

740

++#include <linux/profile.h>

741

++#include <linux/rcupdate_wait.h>

742

++#include <linux/security.h>

743

++#include <linux/syscalls.h>

744

++#include <linux/wait_bit.h>

745

++

746

++#include <linux/kcov.h>

747

++#include <linux/scs.h>

748

++

749

++#include <asm/switch_to.h>

750

++

751

++#include "../workqueue_internal.h"

752

++#include "../../fs/io-wq.h"

753

++#include "../smpboot.h"

754

++

755

++#include "pelt.h"

756

++#include "smp.h"

757

++

758

++/*

759

++ * Export tracepoints that act as a bare tracehook (ie: have no trace event

760

++ * associated with them) to allow external modules to probe them.

761

++ */

762

++EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_irq_tp);

763

++

764

++#ifdef CONFIG_SCHED_DEBUG

765

++#define sched_feat(x)	(1)

766

++/*

767

++ * Print a warning if need_resched is set for the given duration (if

768

++ * LATENCY_WARN is enabled).

769

++ *

770

++ * If sysctl_resched_latency_warn_once is set, only one warning will be shown

771

++ * per boot.

772

++ */

773

++__read_mostly int sysctl_resched_latency_warn_ms = 100;

774

++__read_mostly int sysctl_resched_latency_warn_once = 1;

775

++#else

776

++#define sched_feat(x)	(0)

777

++#endif /* CONFIG_SCHED_DEBUG */

778

++

779

++#define ALT_SCHED_VERSION "v5.14-r1"

780

++

781

++/* rt_prio(prio) defined in include/linux/sched/rt.h */

782

++#define rt_task(p)		rt_prio((p)->prio)

783

++#define rt_policy(policy)	((policy) == SCHED_FIFO || (policy) == SCHED_RR)

784

++#define task_has_rt_policy(p)	(rt_policy((p)->policy))

785

++

786

++#define STOP_PRIO		(MAX_RT_PRIO - 1)

787

++

788

++/* Default time slice is 4 in ms, can be set via kernel parameter "sched_timeslice" */

789

++u64 sched_timeslice_ns __read_mostly = (4 << 20);

790

++

791

++static inline void requeue_task(struct task_struct *p, struct rq *rq);

792

++

793

++#ifdef CONFIG_SCHED_BMQ

794

++#include "bmq.h"

795

++#endif

796

++#ifdef CONFIG_SCHED_PDS

797

++#include "pds.h"

798

++#endif

799

++

800

++static int __init sched_timeslice(char *str)

801

++{

802

++	int timeslice_ms;

803

++

804

++	get_option(&str, &timeslice_ms);

805

++	if (2 != timeslice_ms)

806

++		timeslice_ms = 4;

807

++	sched_timeslice_ns = timeslice_ms << 20;

808

++	sched_timeslice_imp(timeslice_ms);

809

++

810

++	return 0;

811

++}

812

++early_param("sched_timeslice", sched_timeslice);

813

++

814

++/* Reschedule if less than this many μs left */

815

++#define RESCHED_NS		(100 << 10)

816

++

817

++/**

818

++ * sched_yield_type - Choose what sort of yield sched_yield will perform.

819

++ * 0: No yield.

820

++ * 1: Deboost and requeue task. (default)

821

++ * 2: Set rq skip task.

822

++ */

823

++int sched_yield_type __read_mostly = 1;

824

++

825

++#ifdef CONFIG_SMP

826

++static cpumask_t sched_rq_pending_mask ____cacheline_aligned_in_smp;

827

++

828

++DEFINE_PER_CPU(cpumask_t [NR_CPU_AFFINITY_LEVELS], sched_cpu_topo_masks);

829

++DEFINE_PER_CPU(cpumask_t *, sched_cpu_llc_mask);

830

++DEFINE_PER_CPU(cpumask_t *, sched_cpu_topo_end_mask);

831

++

832

++#ifdef CONFIG_SCHED_SMT

833

++DEFINE_STATIC_KEY_FALSE(sched_smt_present);

834

++EXPORT_SYMBOL_GPL(sched_smt_present);

835

++#endif

836

++

837

++/*

838

++ * Keep a unique ID per domain (we use the first CPUs number in the cpumask of

839

++ * the domain), this allows us to quickly tell if two cpus are in the same cache

840

++ * domain, see cpus_share_cache().

841

++ */

842

++DEFINE_PER_CPU(int, sd_llc_id);

843

++#endif /* CONFIG_SMP */

844

++

845

++static DEFINE_MUTEX(sched_hotcpu_mutex);

846

++

847

++DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

848

++

849

++#ifndef prepare_arch_switch

850

++# define prepare_arch_switch(next)	do { } while (0)

851

++#endif

852

++#ifndef finish_arch_post_lock_switch

853

++# define finish_arch_post_lock_switch()	do { } while (0)

854

++#endif

855

++

856

++#ifdef CONFIG_SCHED_SMT

857

++static cpumask_t sched_sg_idle_mask ____cacheline_aligned_in_smp;

858

++#endif

859

++static cpumask_t sched_rq_watermark[SCHED_BITS] ____cacheline_aligned_in_smp;

860

++

861

++/* sched_queue related functions */

862

++static inline void sched_queue_init(struct sched_queue *q)

863

++{

864

++	int i;

865

++

866

++	bitmap_zero(q->bitmap, SCHED_BITS);

867

++	for(i = 0; i < SCHED_BITS; i++)

868

++		INIT_LIST_HEAD(&q->heads[i]);

869

++}

870

++

871

++/*

872

++ * Init idle task and put into queue structure of rq

873

++ * IMPORTANT: may be called multiple times for a single cpu

874

++ */

875

++static inline void sched_queue_init_idle(struct sched_queue *q,

876

++					 struct task_struct *idle)

877

++{

878

++	idle->sq_idx = IDLE_TASK_SCHED_PRIO;

879

++	INIT_LIST_HEAD(&q->heads[idle->sq_idx]);

880

++	list_add(&idle->sq_node, &q->heads[idle->sq_idx]);

881

++}

882

++

883

++/* water mark related functions */

884

++static inline void update_sched_rq_watermark(struct rq *rq)

885

++{

886

++	unsigned long watermark = find_first_bit(rq->queue.bitmap, SCHED_QUEUE_BITS);

887

++	unsigned long last_wm = rq->watermark;

888

++	unsigned long i;

889

++	int cpu;

890

++

891

++	if (watermark == last_wm)

892

++		return;

893

++

894

++	rq->watermark = watermark;

895

++	cpu = cpu_of(rq);

896

++	if (watermark < last_wm) {

897

++		for (i = last_wm; i > watermark; i--)

898

++			cpumask_clear_cpu(cpu, sched_rq_watermark + SCHED_BITS - 1 - i);

899

++#ifdef CONFIG_SCHED_SMT

900

++		if (static_branch_likely(&sched_smt_present) &&

901

++		    IDLE_TASK_SCHED_PRIO == last_wm)

902

++			cpumask_andnot(&sched_sg_idle_mask,

903

++				       &sched_sg_idle_mask, cpu_smt_mask(cpu));

904

++#endif

905

++		return;

906

++	}

907

++	/* last_wm < watermark */

908

++	for (i = watermark; i > last_wm; i--)

909

++		cpumask_set_cpu(cpu, sched_rq_watermark + SCHED_BITS - 1 - i);

910

++#ifdef CONFIG_SCHED_SMT

911

++	if (static_branch_likely(&sched_smt_present) &&

912

++	    IDLE_TASK_SCHED_PRIO == watermark) {

913

++		cpumask_t tmp;

914

++

915

++		cpumask_and(&tmp, cpu_smt_mask(cpu), sched_rq_watermark);

916

++		if (cpumask_equal(&tmp, cpu_smt_mask(cpu)))

917

++			cpumask_or(&sched_sg_idle_mask,

918

++				   &sched_sg_idle_mask, cpu_smt_mask(cpu));

919

++	}

920

++#endif

921

++}

922

++

923

++/*

924

++ * This routine assume that the idle task always in queue

925

++ */

926

++static inline struct task_struct *sched_rq_first_task(struct rq *rq)

927

++{

928

++	unsigned long idx = find_first_bit(rq->queue.bitmap, SCHED_QUEUE_BITS);

929

++	const struct list_head *head = &rq->queue.heads[sched_prio2idx(idx, rq)];

930

++

931

++	return list_first_entry(head, struct task_struct, sq_node);

932

++}

933

++

934

++static inline struct task_struct *

935

++sched_rq_next_task(struct task_struct *p, struct rq *rq)

936

++{

937

++	unsigned long idx = p->sq_idx;

938

++	struct list_head *head = &rq->queue.heads[idx];

939

++

940

++	if (list_is_last(&p->sq_node, head)) {

941

++		idx = find_next_bit(rq->queue.bitmap, SCHED_QUEUE_BITS,

942

++				    sched_idx2prio(idx, rq) + 1);

943

++		head = &rq->queue.heads[sched_prio2idx(idx, rq)];

944

++

945

++		return list_first_entry(head, struct task_struct, sq_node);

946

++	}

947

++

948

++	return list_next_entry(p, sq_node);

949

++}

950

++

951

++static inline struct task_struct *rq_runnable_task(struct rq *rq)

952

++{

953

++	struct task_struct *next = sched_rq_first_task(rq);

954

++

955

++	if (unlikely(next == rq->skip))

956

++		next = sched_rq_next_task(next, rq);

957

++

958

++	return next;

959

++}

960

++

961

++/*

962

++ * Serialization rules:

963

++ *

964

++ * Lock order:

965

++ *

966

++ *   p->pi_lock

967

++ *     rq->lock

968

++ *       hrtimer_cpu_base->lock (hrtimer_start() for bandwidth controls)

969

++ *

970

++ *  rq1->lock

971

++ *    rq2->lock  where: rq1 < rq2

972

++ *

973

++ * Regular state:

974

++ *

975

++ * Normal scheduling state is serialized by rq->lock. __schedule() takes the

976

++ * local CPU's rq->lock, it optionally removes the task from the runqueue and

977

++ * always looks at the local rq data structures to find the most eligible task

978

++ * to run next.

979

++ *

980

++ * Task enqueue is also under rq->lock, possibly taken from another CPU.

981

++ * Wakeups from another LLC domain might use an IPI to transfer the enqueue to

982

++ * the local CPU to avoid bouncing the runqueue state around [ see

983

++ * ttwu_queue_wakelist() ]

984

++ *

985

++ * Task wakeup, specifically wakeups that involve migration, are horribly

986

++ * complicated to avoid having to take two rq->locks.

987

++ *

988

++ * Special state:

989

++ *

990

++ * System-calls and anything external will use task_rq_lock() which acquires

991

++ * both p->pi_lock and rq->lock. As a consequence the state they change is

992

++ * stable while holding either lock:

993

++ *

994

++ *  - sched_setaffinity()/

995

++ *    set_cpus_allowed_ptr():	p->cpus_ptr, p->nr_cpus_allowed

996

++ *  - set_user_nice():		p->se.load, p->*prio

997

++ *  - __sched_setscheduler():	p->sched_class, p->policy, p->*prio,

998

++ *				p->se.load, p->rt_priority,

999

++ *				p->dl.dl_{runtime, deadline, period, flags, bw, density}

1000

++ *  - sched_setnuma():		p->numa_preferred_nid

1001

++ *  - sched_move_task()/

1002

++ *    cpu_cgroup_fork():	p->sched_task_group

1003

++ *  - uclamp_update_active()	p->uclamp*

1004

++ *

1005

++ * p->state <- TASK_*:

1006

++ *

1007

++ *   is changed locklessly using set_current_state(), __set_current_state() or

1008

++ *   set_special_state(), see their respective comments, or by

1009

++ *   try_to_wake_up(). This latter uses p->pi_lock to serialize against

1010

++ *   concurrent self.

1011

++ *

1012

++ * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }:

1013

++ *

1014

++ *   is set by activate_task() and cleared by deactivate_task(), under

1015

++ *   rq->lock. Non-zero indicates the task is runnable, the special

1016

++ *   ON_RQ_MIGRATING state is used for migration without holding both

1017

++ *   rq->locks. It indicates task_cpu() is not stable, see task_rq_lock().

1018

++ *

1019

++ * p->on_cpu <- { 0, 1 }:

1020

++ *

1021

++ *   is set by prepare_task() and cleared by finish_task() such that it will be

1022

++ *   set before p is scheduled-in and cleared after p is scheduled-out, both

1023

++ *   under rq->lock. Non-zero indicates the task is running on its CPU.

1024

++ *

1025

++ *   [ The astute reader will observe that it is possible for two tasks on one

1026

++ *     CPU to have ->on_cpu = 1 at the same time. ]

1027

++ *

1028

++ * task_cpu(p): is changed by set_task_cpu(), the rules are:

1029

++ *

1030

++ *  - Don't call set_task_cpu() on a blocked task:

1031

++ *

1032

++ *    We don't care what CPU we're not running on, this simplifies hotplug,

1033

++ *    the CPU assignment of blocked tasks isn't required to be valid.

1034

++ *

1035

++ *  - for try_to_wake_up(), called under p->pi_lock:

1036

++ *

1037

++ *    This allows try_to_wake_up() to only take one rq->lock, see its comment.

1038

++ *

1039

++ *  - for migration called under rq->lock:

1040

++ *    [ see task_on_rq_migrating() in task_rq_lock() ]

1041

++ *

1042

++ *    o move_queued_task()

1043

++ *    o detach_task()

1044

++ *

1045

++ *  - for migration called under double_rq_lock():

1046

++ *

1047

++ *    o __migrate_swap_task()

1048

++ *    o push_rt_task() / pull_rt_task()

1049

++ *    o push_dl_task() / pull_dl_task()

1050

++ *    o dl_task_offline_migration()

1051

++ *

1052

++ */

1053

++

1054

++/*

1055

++ * Context: p->pi_lock

1056

++ */

1057

++static inline struct rq

1058

++*__task_access_lock(struct task_struct *p, raw_spinlock_t **plock)

1059

++{

1060

++	struct rq *rq;

1061

++	for (;;) {

1062

++		rq = task_rq(p);

1063

++		if (p->on_cpu || task_on_rq_queued(p)) {

1064

++			raw_spin_lock(&rq->lock);

1065

++			if (likely((p->on_cpu || task_on_rq_queued(p))

1066

++				   && rq == task_rq(p))) {

1067

++				*plock = &rq->lock;

1068

++				return rq;

1069

++			}

1070

++			raw_spin_unlock(&rq->lock);

1071

++		} else if (task_on_rq_migrating(p)) {

1072

++			do {

1073

++				cpu_relax();

1074

++			} while (unlikely(task_on_rq_migrating(p)));

1075

++		} else {

1076

++			*plock = NULL;

1077

++			return rq;

1078

++		}

1079

++	}

1080

++}

1081

++

1082

++static inline void

1083

++__task_access_unlock(struct task_struct *p, raw_spinlock_t *lock)

1084

++{

1085

++	if (NULL != lock)

1086

++		raw_spin_unlock(lock);

1087

++}

1088

++

1089

++static inline struct rq

1090

++*task_access_lock_irqsave(struct task_struct *p, raw_spinlock_t **plock,

1091

++			  unsigned long *flags)

1092

++{

1093

++	struct rq *rq;

1094

++	for (;;) {

1095

++		rq = task_rq(p);

1096

++		if (p->on_cpu || task_on_rq_queued(p)) {

1097

++			raw_spin_lock_irqsave(&rq->lock, *flags);

1098

++			if (likely((p->on_cpu || task_on_rq_queued(p))

1099

++				   && rq == task_rq(p))) {

1100

++				*plock = &rq->lock;

1101

++				return rq;

1102

++			}

1103

++			raw_spin_unlock_irqrestore(&rq->lock, *flags);

1104

++		} else if (task_on_rq_migrating(p)) {

1105

++			do {

1106

++				cpu_relax();

1107

++			} while (unlikely(task_on_rq_migrating(p)));

1108

++		} else {

1109

++			raw_spin_lock_irqsave(&p->pi_lock, *flags);

1110

++			if (likely(!p->on_cpu && !p->on_rq &&

1111

++				   rq == task_rq(p))) {

1112

++				*plock = &p->pi_lock;

1113

++				return rq;

1114

++			}

1115

++			raw_spin_unlock_irqrestore(&p->pi_lock, *flags);

1116

++		}

1117

++	}

1118

++}

1119

++

1120

++static inline void

1121

++task_access_unlock_irqrestore(struct task_struct *p, raw_spinlock_t *lock,

1122

++			      unsigned long *flags)

1123

++{

1124

++	raw_spin_unlock_irqrestore(lock, *flags);

1125

++}

1126

++

1127

++/*

1128

++ * __task_rq_lock - lock the rq @p resides on.

1129

++ */

1130

++struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)

1131

++	__acquires(rq->lock)

1132

++{

1133

++	struct rq *rq;

1134

++

1135

++	lockdep_assert_held(&p->pi_lock);

1136

++

1137

++	for (;;) {

1138

++		rq = task_rq(p);

1139

++		raw_spin_lock(&rq->lock);

1140

++		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))

1141

++			return rq;

1142

++		raw_spin_unlock(&rq->lock);

1143

++

1144

++		while (unlikely(task_on_rq_migrating(p)))

1145

++			cpu_relax();

1146

++	}

1147

++}

1148

++

1149

++/*

1150

++ * task_rq_lock - lock p->pi_lock and lock the rq @p resides on.

1151

++ */

1152

++struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)

1153

++	__acquires(p->pi_lock)

1154

++	__acquires(rq->lock)

1155

++{

1156

++	struct rq *rq;

1157

++

1158

++	for (;;) {

1159

++		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);

1160

++		rq = task_rq(p);

1161

++		raw_spin_lock(&rq->lock);

1162

++		/*

1163

++		 *	move_queued_task()		task_rq_lock()

1164

++		 *

1165

++		 *	ACQUIRE (rq->lock)

1166

++		 *	[S] ->on_rq = MIGRATING		[L] rq = task_rq()

1167

++		 *	WMB (__set_task_cpu())		ACQUIRE (rq->lock);

1168

++		 *	[S] ->cpu = new_cpu		[L] task_rq()

1169

++		 *					[L] ->on_rq

1170

++		 *	RELEASE (rq->lock)

1171

++		 *

1172

++		 * If we observe the old CPU in task_rq_lock(), the acquire of

1173

++		 * the old rq->lock will fully serialize against the stores.

1174

++		 *

1175

++		 * If we observe the new CPU in task_rq_lock(), the address

1176

++		 * dependency headed by '[L] rq = task_rq()' and the acquire

1177

++		 * will pair with the WMB to ensure we then also see migrating.

1178

++		 */

1179

++		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {

1180

++			return rq;

1181

++		}

1182

++		raw_spin_unlock(&rq->lock);

1183

++		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);

1184

++

1185

++		while (unlikely(task_on_rq_migrating(p)))

1186

++			cpu_relax();

1187

++	}

1188

++}

1189

++

1190

++static inline void

1191

++rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)

1192

++	__acquires(rq->lock)

1193

++{

1194

++	raw_spin_lock_irqsave(&rq->lock, rf->flags);

1195

++}

1196

++

1197

++static inline void

1198

++rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)

1199

++	__releases(rq->lock)

1200

++{

1201

++	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);

1202

++}

1203

++

1204

++void raw_spin_rq_lock_nested(struct rq *rq, int subclass)

1205

++{

1206

++	raw_spinlock_t *lock;

1207

++

1208

++	/* Matches synchronize_rcu() in __sched_core_enable() */

1209

++	preempt_disable();

1210

++

1211

++	for (;;) {

1212

++		lock = __rq_lockp(rq);

1213

++		raw_spin_lock_nested(lock, subclass);

1214

++		if (likely(lock == __rq_lockp(rq))) {

1215

++			/* preempt_count *MUST* be > 1 */

1216

++			preempt_enable_no_resched();

1217

++			return;

1218

++		}

1219

++		raw_spin_unlock(lock);

1220

++	}

1221

++}

1222

++

1223

++void raw_spin_rq_unlock(struct rq *rq)

1224

++{

1225

++	raw_spin_unlock(rq_lockp(rq));

1226

++}

1227

++

1228

++/*

1229

++ * RQ-clock updating methods:

1230

++ */

1231

++

1232

++static void update_rq_clock_task(struct rq *rq, s64 delta)

1233

++{

1234

++/*

1235

++ * In theory, the compile should just see 0 here, and optimize out the call

1236

++ * to sched_rt_avg_update. But I don't trust it...

1237

++ */

1238

++	s64 __maybe_unused steal = 0, irq_delta = 0;

1239

++

1240

++#ifdef CONFIG_IRQ_TIME_ACCOUNTING

1241

++	irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;

1242

++

1243

++	/*

1244

++	 * Since irq_time is only updated on {soft,}irq_exit, we might run into

1245

++	 * this case when a previous update_rq_clock() happened inside a

1246

++	 * {soft,}irq region.

1247

++	 *

1248

++	 * When this happens, we stop ->clock_task and only update the

1249

++	 * prev_irq_time stamp to account for the part that fit, so that a next

1250

++	 * update will consume the rest. This ensures ->clock_task is

1251

++	 * monotonic.

1252

++	 *

1253

++	 * It does however cause some slight miss-attribution of {soft,}irq

1254

++	 * time, a more accurate solution would be to update the irq_time using

1255

++	 * the current rq->clock timestamp, except that would require using

1256

++	 * atomic ops.

1257

++	 */

1258

++	if (irq_delta > delta)

1259

++		irq_delta = delta;

1260

++

1261

++	rq->prev_irq_time += irq_delta;

1262

++	delta -= irq_delta;

1263

++#endif

1264

++#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING

1265

++	if (static_key_false((&paravirt_steal_rq_enabled))) {

1266

++		steal = paravirt_steal_clock(cpu_of(rq));

1267

++		steal -= rq->prev_steal_time_rq;

1268

++

1269

++		if (unlikely(steal > delta))

1270

++			steal = delta;

1271

++

1272

++		rq->prev_steal_time_rq += steal;

1273

++		delta -= steal;

1274

++	}

1275

++#endif

1276

++

1277

++	rq->clock_task += delta;

1278

++

1279

++#ifdef CONFIG_HAVE_SCHED_AVG_IRQ

1280

++	if ((irq_delta + steal))

1281

++		update_irq_load_avg(rq, irq_delta + steal);

1282

++#endif

1283

++}

1284

++

1285

++static inline void update_rq_clock(struct rq *rq)

1286

++{

1287

++	s64 delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;

1288

++

1289

++	if (unlikely(delta <= 0))

1290

++		return;

1291

++	rq->clock += delta;

1292

++	update_rq_time_edge(rq);

1293

++	update_rq_clock_task(rq, delta);

1294

++}

1295

++

1296

++#ifdef CONFIG_NO_HZ_FULL

1297

++/*

1298

++ * Tick may be needed by tasks in the runqueue depending on their policy and

1299

++ * requirements. If tick is needed, lets send the target an IPI to kick it out

1300

++ * of nohz mode if necessary.

1301

++ */

1302

++static inline void sched_update_tick_dependency(struct rq *rq)

1303

++{

1304

++	int cpu = cpu_of(rq);

1305

++

1306

++	if (!tick_nohz_full_cpu(cpu))

1307

++		return;

1308

++

1309

++	if (rq->nr_running < 2)

1310

++		tick_nohz_dep_clear_cpu(cpu, TICK_DEP_BIT_SCHED);

1311

++	else

1312

++		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);

1313

++}

1314

++#else /* !CONFIG_NO_HZ_FULL */

1315

++static inline void sched_update_tick_dependency(struct rq *rq) { }

1316

++#endif

1317

++

1318

++bool sched_task_on_rq(struct task_struct *p)

1319

++{

1320

++	return task_on_rq_queued(p);

1321

++}

1322

++

1323

++/*

1324

++ * Add/Remove/Requeue task to/from the runqueue routines

1325

++ * Context: rq->lock

1326

++ */

1327

++#define __SCHED_DEQUEUE_TASK(p, rq, flags, func)		\

1328

++	psi_dequeue(p, flags & DEQUEUE_SLEEP);			\

1329

++	sched_info_dequeue(rq, p);				\

1330

++								\

1331

++	list_del(&p->sq_node);					\

1332

++	if (list_empty(&rq->queue.heads[p->sq_idx])) {		\

1333

++		clear_bit(sched_idx2prio(p->sq_idx, rq),	\

1334

++			  rq->queue.bitmap);			\

1335

++		func;						\

1336

++	}

1337

++

1338

++#define __SCHED_ENQUEUE_TASK(p, rq, flags)				\

1339

++	sched_info_enqueue(rq, p);					\

1340

++	psi_enqueue(p, flags);						\

1341

++									\

1342

++	p->sq_idx = task_sched_prio_idx(p, rq);				\

1343

++	list_add_tail(&p->sq_node, &rq->queue.heads[p->sq_idx]);	\

1344

++	set_bit(sched_idx2prio(p->sq_idx, rq), rq->queue.bitmap);

1345

++

1346

++static inline void dequeue_task(struct task_struct *p, struct rq *rq, int flags)

1347

++{

1348

++	lockdep_assert_held(&rq->lock);

1349

++

1350

++	/*printk(KERN_INFO "sched: dequeue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/

1351

++	WARN_ONCE(task_rq(p) != rq, "sched: dequeue task reside on cpu%d from cpu%d\n",

1352

++		  task_cpu(p), cpu_of(rq));

1353

++

1354

++	__SCHED_DEQUEUE_TASK(p, rq, flags, update_sched_rq_watermark(rq));

1355

++	--rq->nr_running;

1356

++#ifdef CONFIG_SMP

1357

++	if (1 == rq->nr_running)

1358

++		cpumask_clear_cpu(cpu_of(rq), &sched_rq_pending_mask);

1359

++#endif

1360

++

1361

++	sched_update_tick_dependency(rq);

1362

++}

1363

++

1364

++static inline void enqueue_task(struct task_struct *p, struct rq *rq, int flags)

1365

++{

1366

++	lockdep_assert_held(&rq->lock);

1367

++

1368

++	/*printk(KERN_INFO "sched: enqueue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/

1369

++	WARN_ONCE(task_rq(p) != rq, "sched: enqueue task reside on cpu%d to cpu%d\n",

1370

++		  task_cpu(p), cpu_of(rq));

1371

++

1372

++	__SCHED_ENQUEUE_TASK(p, rq, flags);

1373

++	update_sched_rq_watermark(rq);

1374

++	++rq->nr_running;

1375

++#ifdef CONFIG_SMP

1376

++	if (2 == rq->nr_running)

1377

++		cpumask_set_cpu(cpu_of(rq), &sched_rq_pending_mask);

1378

++#endif

1379

++

1380

++	sched_update_tick_dependency(rq);

1381

++}

1382

++

1383

++static inline void requeue_task(struct task_struct *p, struct rq *rq)

1384

++{

1385

++	int idx;

1386

++

1387

++	lockdep_assert_held(&rq->lock);

1388

++	/*printk(KERN_INFO "sched: requeue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/

1389

++	WARN_ONCE(task_rq(p) != rq, "sched: cpu[%d] requeue task reside on cpu%d\n",

1390

++		  cpu_of(rq), task_cpu(p));

1391

++

1392

++	idx = task_sched_prio_idx(p, rq);

1393

++

1394

++	list_del(&p->sq_node);

1395

++	list_add_tail(&p->sq_node, &rq->queue.heads[idx]);

1396

++	if (idx != p->sq_idx) {

1397

++		if (list_empty(&rq->queue.heads[p->sq_idx]))

1398

++			clear_bit(sched_idx2prio(p->sq_idx, rq),

1399

++				  rq->queue.bitmap);

1400

++		p->sq_idx = idx;

1401

++		set_bit(sched_idx2prio(p->sq_idx, rq), rq->queue.bitmap);

1402

++		update_sched_rq_watermark(rq);

1403

++	}

1404

++}

1405

++

1406

++/*

1407

++ * cmpxchg based fetch_or, macro so it works for different integer types

1408

++ */

1409

++#define fetch_or(ptr, mask)						\

1410

++	({								\

1411

++		typeof(ptr) _ptr = (ptr);				\

1412

++		typeof(mask) _mask = (mask);				\

1413

++		typeof(*_ptr) _old, _val = *_ptr;			\

1414

++									\

1415

++		for (;;) {						\

1416

++			_old = cmpxchg(_ptr, _val, _val | _mask);	\

1417

++			if (_old == _val)				\

1418

++				break;					\

1419

++			_val = _old;					\

1420

++		}							\

1421

++	_old;								\

1422

++})

1423

++

1424

++#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)

1425

++/*

1426

++ * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,

1427

++ * this avoids any races wrt polling state changes and thereby avoids

1428

++ * spurious IPIs.

1429

++ */

1430

++static bool set_nr_and_not_polling(struct task_struct *p)

1431

++{

1432

++	struct thread_info *ti = task_thread_info(p);

1433

++	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);

1434

++}

1435

++

1436

++/*

1437

++ * Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set.

1438

++ *

1439

++ * If this returns true, then the idle task promises to call

1440

++ * sched_ttwu_pending() and reschedule soon.

1441

++ */

1442

++static bool set_nr_if_polling(struct task_struct *p)

1443

++{

1444

++	struct thread_info *ti = task_thread_info(p);

1445

++	typeof(ti->flags) old, val = READ_ONCE(ti->flags);

1446

++

1447

++	for (;;) {

1448

++		if (!(val & _TIF_POLLING_NRFLAG))

1449

++			return false;

1450

++		if (val & _TIF_NEED_RESCHED)

1451

++			return true;

1452

++		old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED);

1453

++		if (old == val)

1454

++			break;

1455

++		val = old;

1456

++	}

1457

++	return true;

1458

++}

1459

++

1460

++#else

1461

++static bool set_nr_and_not_polling(struct task_struct *p)

1462

++{

1463

++	set_tsk_need_resched(p);

1464

++	return true;

1465

++}

1466

++

1467

++#ifdef CONFIG_SMP

1468

++static bool set_nr_if_polling(struct task_struct *p)

1469

++{

1470

++	return false;

1471

++}

1472

++#endif

1473

++#endif

1474

++

1475

++static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)

1476

++{

1477

++	struct wake_q_node *node = &task->wake_q;

1478

++

1479

++	/*

1480

++	 * Atomically grab the task, if ->wake_q is !nil already it means

1481

++	 * it's already queued (either by us or someone else) and will get the

1482

++	 * wakeup due to that.

1483

++	 *

1484

++	 * In order to ensure that a pending wakeup will observe our pending

1485

++	 * state, even in the failed case, an explicit smp_mb() must be used.

1486

++	 */

1487

++	smp_mb__before_atomic();

1488

++	if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))

1489

++		return false;

1490

++

1491

++	/*

1492

++	 * The head is context local, there can be no concurrency.

1493

++	 */

1494

++	*head->lastp = node;

1495

++	head->lastp = &node->next;

1496

++	return true;

1497

++}

1498

++

1499

++/**

1500

++ * wake_q_add() - queue a wakeup for 'later' waking.

1501

++ * @head: the wake_q_head to add @task to

1502

++ * @task: the task to queue for 'later' wakeup

1503

++ *

1504

++ * Queue a task for later wakeup, most likely by the wake_up_q() call in the

1505

++ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come

1506

++ * instantly.

1507

++ *

1508

++ * This function must be used as-if it were wake_up_process(); IOW the task

1509

++ * must be ready to be woken at this location.

1510

++ */

1511

++void wake_q_add(struct wake_q_head *head, struct task_struct *task)

1512

++{

1513

++	if (__wake_q_add(head, task))

1514

++		get_task_struct(task);

1515

++}

1516

++

1517

++/**

1518

++ * wake_q_add_safe() - safely queue a wakeup for 'later' waking.

1519

++ * @head: the wake_q_head to add @task to

1520

++ * @task: the task to queue for 'later' wakeup

1521

++ *

1522

++ * Queue a task for later wakeup, most likely by the wake_up_q() call in the

1523

++ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come

1524

++ * instantly.

1525

++ *

1526

++ * This function must be used as-if it were wake_up_process(); IOW the task

1527

++ * must be ready to be woken at this location.

1528

++ *

1529

++ * This function is essentially a task-safe equivalent to wake_q_add(). Callers

1530

++ * that already hold reference to @task can call the 'safe' version and trust

1531

++ * wake_q to do the right thing depending whether or not the @task is already

1532

++ * queued for wakeup.

1533

++ */

1534

++void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)

1535

++{

1536

++	if (!__wake_q_add(head, task))

1537

++		put_task_struct(task);

1538

++}

1539

++

1540

++void wake_up_q(struct wake_q_head *head)

1541

++{

1542

++	struct wake_q_node *node = head->first;

1543

++

1544

++	while (node != WAKE_Q_TAIL) {

1545

++		struct task_struct *task;

1546

++

1547

++		task = container_of(node, struct task_struct, wake_q);

1548

++		/* task can safely be re-inserted now: */

1549

++		node = node->next;

1550

++		task->wake_q.next = NULL;

1551

++

1552

++		/*

1553

++		 * wake_up_process() executes a full barrier, which pairs with

1554

++		 * the queueing in wake_q_add() so as not to miss wakeups.

1555

++		 */

1556

++		wake_up_process(task);

1557

++		put_task_struct(task);

1558

++	}

1559

++}

1560

++

1561

++/*

1562

++ * resched_curr - mark rq's current task 'to be rescheduled now'.

1563

++ *

1564

++ * On UP this means the setting of the need_resched flag, on SMP it

1565

++ * might also involve a cross-CPU call to trigger the scheduler on

1566

++ * the target CPU.

1567

++ */

1568

++void resched_curr(struct rq *rq)

1569

++{

1570

++	struct task_struct *curr = rq->curr;

1571

++	int cpu;

1572

++

1573

++	lockdep_assert_held(&rq->lock);

1574

++

1575

++	if (test_tsk_need_resched(curr))

1576

++		return;

1577

++

1578

++	cpu = cpu_of(rq);

1579

++	if (cpu == smp_processor_id()) {

1580

++		set_tsk_need_resched(curr);

1581

++		set_preempt_need_resched();

1582

++		return;

1583

++	}

1584

++

1585

++	if (set_nr_and_not_polling(curr))

1586

++		smp_send_reschedule(cpu);

1587

++	else

1588

++		trace_sched_wake_idle_without_ipi(cpu);

1589

++}

1590

++

1591

++void resched_cpu(int cpu)

1592

++{

1593

++	struct rq *rq = cpu_rq(cpu);

1594

++	unsigned long flags;

1595

++

1596

++	raw_spin_lock_irqsave(&rq->lock, flags);

1597

++	if (cpu_online(cpu) || cpu == smp_processor_id())

1598

++		resched_curr(cpu_rq(cpu));

1599

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

1600

++}

1601

++

1602

++#ifdef CONFIG_SMP

1603

++#ifdef CONFIG_NO_HZ_COMMON

1604

++void nohz_balance_enter_idle(int cpu) {}

1605

++

1606

++void select_nohz_load_balancer(int stop_tick) {}

1607

++

1608

++void set_cpu_sd_state_idle(void) {}

1609

++

1610

++/*

1611

++ * In the semi idle case, use the nearest busy CPU for migrating timers

1612

++ * from an idle CPU.  This is good for power-savings.

1613

++ *

1614

++ * We don't do similar optimization for completely idle system, as

1615

++ * selecting an idle CPU will add more delays to the timers than intended

1616

++ * (as that CPU's timer base may not be uptodate wrt jiffies etc).

1617

++ */

1618

++int get_nohz_timer_target(void)

1619

++{

1620

++	int i, cpu = smp_processor_id(), default_cpu = -1;

1621

++	struct cpumask *mask;

1622

++

1623

++	if (housekeeping_cpu(cpu, HK_FLAG_TIMER)) {

1624

++		if (!idle_cpu(cpu))

1625

++			return cpu;

1626

++		default_cpu = cpu;

1627

++	}

1628

++

1629

++	for (mask = per_cpu(sched_cpu_topo_masks, cpu) + 1;

1630

++	     mask < per_cpu(sched_cpu_topo_end_mask, cpu); mask++)

1631

++		for_each_cpu_and(i, mask, housekeeping_cpumask(HK_FLAG_TIMER))

1632

++			if (!idle_cpu(i))

1633

++				return i;

1634

++

1635

++	if (default_cpu == -1)

1636

++		default_cpu = housekeeping_any_cpu(HK_FLAG_TIMER);

1637

++	cpu = default_cpu;

1638

++

1639

++	return cpu;

1640

++}

1641

++

1642

++/*

1643

++ * When add_timer_on() enqueues a timer into the timer wheel of an

1644

++ * idle CPU then this timer might expire before the next timer event

1645

++ * which is scheduled to wake up that CPU. In case of a completely

1646

++ * idle system the next event might even be infinite time into the

1647

++ * future. wake_up_idle_cpu() ensures that the CPU is woken up and

1648

++ * leaves the inner idle loop so the newly added timer is taken into

1649

++ * account when the CPU goes back to idle and evaluates the timer

1650

++ * wheel for the next timer event.

1651

++ */

1652

++static inline void wake_up_idle_cpu(int cpu)

1653

++{

1654

++	struct rq *rq = cpu_rq(cpu);

1655

++

1656

++	if (cpu == smp_processor_id())

1657

++		return;

1658

++

1659

++	if (set_nr_and_not_polling(rq->idle))

1660

++		smp_send_reschedule(cpu);

1661

++	else

1662

++		trace_sched_wake_idle_without_ipi(cpu);

1663

++}

1664

++

1665

++static inline bool wake_up_full_nohz_cpu(int cpu)

1666

++{

1667

++	/*

1668

++	 * We just need the target to call irq_exit() and re-evaluate

1669

++	 * the next tick. The nohz full kick at least implies that.

1670

++	 * If needed we can still optimize that later with an

1671

++	 * empty IRQ.

1672

++	 */

1673

++	if (cpu_is_offline(cpu))

1674

++		return true;  /* Don't try to wake offline CPUs. */

1675

++	if (tick_nohz_full_cpu(cpu)) {

1676

++		if (cpu != smp_processor_id() ||

1677

++		    tick_nohz_tick_stopped())

1678

++			tick_nohz_full_kick_cpu(cpu);

1679

++		return true;

1680

++	}

1681

++

1682

++	return false;

1683

++}

1684

++

1685

++void wake_up_nohz_cpu(int cpu)

1686

++{

1687

++	if (!wake_up_full_nohz_cpu(cpu))

1688

++		wake_up_idle_cpu(cpu);

1689

++}

1690

++

1691

++static void nohz_csd_func(void *info)

1692

++{

1693

++	struct rq *rq = info;

1694

++	int cpu = cpu_of(rq);

1695

++	unsigned int flags;

1696

++

1697

++	/*

1698

++	 * Release the rq::nohz_csd.

1699

++	 */

1700

++	flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));

1701

++	WARN_ON(!(flags & NOHZ_KICK_MASK));

1702

++

1703

++	rq->idle_balance = idle_cpu(cpu);

1704

++	if (rq->idle_balance && !need_resched()) {

1705

++		rq->nohz_idle_balance = flags;

1706

++		raise_softirq_irqoff(SCHED_SOFTIRQ);

1707

++	}

1708

++}

1709

++

1710

++#endif /* CONFIG_NO_HZ_COMMON */

1711

++#endif /* CONFIG_SMP */

1712

++

1713

++static inline void check_preempt_curr(struct rq *rq)

1714

++{

1715

++	if (sched_rq_first_task(rq) != rq->curr)

1716

++		resched_curr(rq);

1717

++}

1718

++

1719

++#ifdef CONFIG_SCHED_HRTICK

1720

++/*

1721

++ * Use HR-timers to deliver accurate preemption points.

1722

++ */

1723

++

1724

++static void hrtick_clear(struct rq *rq)

1725

++{

1726

++	if (hrtimer_active(&rq->hrtick_timer))

1727

++		hrtimer_cancel(&rq->hrtick_timer);

1728

++}

1729

++

1730

++/*

1731

++ * High-resolution timer tick.

1732

++ * Runs from hardirq context with interrupts disabled.

1733

++ */

1734

++static enum hrtimer_restart hrtick(struct hrtimer *timer)

1735

++{

1736

++	struct rq *rq = container_of(timer, struct rq, hrtick_timer);

1737

++

1738

++	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());

1739

++

1740

++	raw_spin_lock(&rq->lock);

1741

++	resched_curr(rq);

1742

++	raw_spin_unlock(&rq->lock);

1743

++

1744

++	return HRTIMER_NORESTART;

1745

++}

1746

++

1747

++/*

1748

++ * Use hrtick when:

1749

++ *  - enabled by features

1750

++ *  - hrtimer is actually high res

1751

++ */

1752

++static inline int hrtick_enabled(struct rq *rq)

1753

++{

1754

++	/**

1755

++	 * Alt schedule FW doesn't support sched_feat yet

1756

++	if (!sched_feat(HRTICK))

1757

++		return 0;

1758

++	*/

1759

++	if (!cpu_active(cpu_of(rq)))

1760

++		return 0;

1761

++	return hrtimer_is_hres_active(&rq->hrtick_timer);

1762

++}

1763

++

1764

++#ifdef CONFIG_SMP

1765

++

1766

++static void __hrtick_restart(struct rq *rq)

1767

++{

1768

++	struct hrtimer *timer = &rq->hrtick_timer;

1769

++	ktime_t time = rq->hrtick_time;

1770

++

1771

++	hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);

1772

++}

1773

++

1774

++/*

1775

++ * called from hardirq (IPI) context

1776

++ */

1777

++static void __hrtick_start(void *arg)

1778

++{

1779

++	struct rq *rq = arg;

1780

++

1781

++	raw_spin_lock(&rq->lock);

1782

++	__hrtick_restart(rq);

1783

++	raw_spin_unlock(&rq->lock);

1784

++}

1785

++

1786

++/*

1787

++ * Called to set the hrtick timer state.

1788

++ *

1789

++ * called with rq->lock held and irqs disabled

1790

++ */

1791

++void hrtick_start(struct rq *rq, u64 delay)

1792

++{

1793

++	struct hrtimer *timer = &rq->hrtick_timer;

1794

++	s64 delta;

1795

++

1796

++	/*

1797

++	 * Don't schedule slices shorter than 10000ns, that just

1798

++	 * doesn't make sense and can cause timer DoS.

1799

++	 */

1800

++	delta = max_t(s64, delay, 10000LL);

1801

++

1802

++	rq->hrtick_time = ktime_add_ns(timer->base->get_time(), delta);

1803

++

1804

++	if (rq == this_rq())

1805

++		__hrtick_restart(rq);

1806

++	else

1807

++		smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);

1808

++}

1809

++

1810

++#else

1811

++/*

1812

++ * Called to set the hrtick timer state.

1813

++ *

1814

++ * called with rq->lock held and irqs disabled

1815

++ */

1816

++void hrtick_start(struct rq *rq, u64 delay)

1817

++{

1818

++	/*

1819

++	 * Don't schedule slices shorter than 10000ns, that just

1820

++	 * doesn't make sense. Rely on vruntime for fairness.

1821

++	 */

1822

++	delay = max_t(u64, delay, 10000LL);

1823

++	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),

1824

++		      HRTIMER_MODE_REL_PINNED_HARD);

1825

++}

1826

++#endif /* CONFIG_SMP */

1827

++

1828

++static void hrtick_rq_init(struct rq *rq)

1829

++{

1830

++#ifdef CONFIG_SMP

1831

++	INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);

1832

++#endif

1833

++

1834

++	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);

1835

++	rq->hrtick_timer.function = hrtick;

1836

++}

1837

++#else	/* CONFIG_SCHED_HRTICK */

1838

++static inline int hrtick_enabled(struct rq *rq)

1839

++{

1840

++	return 0;

1841

++}

1842

++

1843

++static inline void hrtick_clear(struct rq *rq)

1844

++{

1845

++}

1846

++

1847

++static inline void hrtick_rq_init(struct rq *rq)

1848

++{

1849

++}

1850

++#endif	/* CONFIG_SCHED_HRTICK */

1851

++

1852

++static inline int __normal_prio(int policy, int rt_prio, int static_prio)

1853

++{

1854

++	return rt_policy(policy) ? (MAX_RT_PRIO - 1 - rt_prio) :

1855

++		static_prio + MAX_PRIORITY_ADJ;

1856

++}

1857

++

1858

++/*

1859

++ * Calculate the expected normal priority: i.e. priority

1860

++ * without taking RT-inheritance into account. Might be

1861

++ * boosted by interactivity modifiers. Changes upon fork,

1862

++ * setprio syscalls, and whenever the interactivity

1863

++ * estimator recalculates.

1864

++ */

1865

++static inline int normal_prio(struct task_struct *p)

1866

++{

1867

++	return __normal_prio(p->policy, p->rt_priority, p->static_prio);

1868

++}

1869

++

1870

++/*

1871

++ * Calculate the current priority, i.e. the priority

1872

++ * taken into account by the scheduler. This value might

1873

++ * be boosted by RT tasks as it will be RT if the task got

1874

++ * RT-boosted. If not then it returns p->normal_prio.

1875

++ */

1876

++static int effective_prio(struct task_struct *p)

1877

++{

1878

++	p->normal_prio = normal_prio(p);

1879

++	/*

1880

++	 * If we are RT tasks or we were boosted to RT priority,

1881

++	 * keep the priority unchanged. Otherwise, update priority

1882

++	 * to the normal priority:

1883

++	 */

1884

++	if (!rt_prio(p->prio))

1885

++		return p->normal_prio;

1886

++	return p->prio;

1887

++}

1888

++

1889

++/*

1890

++ * activate_task - move a task to the runqueue.

1891

++ *

1892

++ * Context: rq->lock

1893

++ */

1894

++static void activate_task(struct task_struct *p, struct rq *rq)

1895

++{

1896

++	enqueue_task(p, rq, ENQUEUE_WAKEUP);

1897

++	p->on_rq = TASK_ON_RQ_QUEUED;

1898

++

1899

++	/*

1900

++	 * If in_iowait is set, the code below may not trigger any cpufreq

1901

++	 * utilization updates, so do it here explicitly with the IOWAIT flag

1902

++	 * passed.

1903

++	 */

1904

++	cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT * p->in_iowait);

1905

++}

1906

++

1907

++/*

1908

++ * deactivate_task - remove a task from the runqueue.

1909

++ *

1910

++ * Context: rq->lock

1911

++ */

1912

++static inline void deactivate_task(struct task_struct *p, struct rq *rq)

1913

++{

1914

++	dequeue_task(p, rq, DEQUEUE_SLEEP);

1915

++	p->on_rq = 0;

1916

++	cpufreq_update_util(rq, 0);

1917

++}

1918

++

1919

++static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)

1920

++{

1921

++#ifdef CONFIG_SMP

1922

++	/*

1923

++	 * After ->cpu is set up to a new value, task_access_lock(p, ...) can be

1924

++	 * successfully executed on another CPU. We must ensure that updates of

1925

++	 * per-task data have been completed by this moment.

1926

++	 */

1927

++	smp_wmb();

1928

++

1929

++#ifdef CONFIG_THREAD_INFO_IN_TASK

1930

++	WRITE_ONCE(p->cpu, cpu);

1931

++#else

1932

++	WRITE_ONCE(task_thread_info(p)->cpu, cpu);

1933

++#endif

1934

++#endif

1935

++}

1936

++

1937

++static inline bool is_migration_disabled(struct task_struct *p)

1938

++{

1939

++#ifdef CONFIG_SMP

1940

++	return p->migration_disabled;

1941

++#else

1942

++	return false;

1943

++#endif

1944

++}

1945

++

1946

++#define SCA_CHECK		0x01

1947

++

1948

++#ifdef CONFIG_SMP

1949

++

1950

++void set_task_cpu(struct task_struct *p, unsigned int new_cpu)

1951

++{

1952

++#ifdef CONFIG_SCHED_DEBUG

1953

++	unsigned int state = READ_ONCE(p->__state);

1954

++

1955

++	/*

1956

++	 * We should never call set_task_cpu() on a blocked task,

1957

++	 * ttwu() will sort out the placement.

1958

++	 */

1959

++	WARN_ON_ONCE(state != TASK_RUNNING && state != TASK_WAKING && !p->on_rq);

1960

++

1961

++#ifdef CONFIG_LOCKDEP

1962

++	/*

1963

++	 * The caller should hold either p->pi_lock or rq->lock, when changing

1964

++	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.

1965

++	 *

1966

++	 * sched_move_task() holds both and thus holding either pins the cgroup,

1967

++	 * see task_group().

1968

++	 */

1969

++	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||

1970

++				      lockdep_is_held(&task_rq(p)->lock)));

1971

++#endif

1972

++	/*

1973

++	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.

1974

++	 */

1975

++	WARN_ON_ONCE(!cpu_online(new_cpu));

1976

++

1977

++	WARN_ON_ONCE(is_migration_disabled(p));

1978

++#endif

1979

++	if (task_cpu(p) == new_cpu)

1980

++		return;

1981

++	trace_sched_migrate_task(p, new_cpu);

1982

++	rseq_migrate(p);

1983

++	perf_event_task_migrate(p);

1984

++

1985

++	__set_task_cpu(p, new_cpu);

1986

++}

1987

++

1988

++#define MDF_FORCE_ENABLED	0x80

1989

++

1990

++static void

1991

++__do_set_cpus_ptr(struct task_struct *p, const struct cpumask *new_mask)

1992

++{

1993

++	/*

1994

++	 * This here violates the locking rules for affinity, since we're only

1995

++	 * supposed to change these variables while holding both rq->lock and

1996

++	 * p->pi_lock.

1997

++	 *

1998

++	 * HOWEVER, it magically works, because ttwu() is the only code that

1999

++	 * accesses these variables under p->pi_lock and only does so after

2000

++	 * smp_cond_load_acquire(&p->on_cpu, !VAL), and we're in __schedule()

2001

++	 * before finish_task().

2002

++	 *

2003

++	 * XXX do further audits, this smells like something putrid.

2004

++	 */

2005

++	SCHED_WARN_ON(!p->on_cpu);

2006

++	p->cpus_ptr = new_mask;

2007

++}

2008

++

2009

++void migrate_disable(void)

2010

++{

2011

++	struct task_struct *p = current;

2012

++	int cpu;

2013

++

2014

++	if (p->migration_disabled) {

2015

++		p->migration_disabled++;

2016

++		return;

2017

++	}

2018

++

2019

++	preempt_disable();

2020

++	cpu = smp_processor_id();

2021

++	if (cpumask_test_cpu(cpu, &p->cpus_mask)) {

2022

++		cpu_rq(cpu)->nr_pinned++;

2023

++		p->migration_disabled = 1;

2024

++		p->migration_flags &= ~MDF_FORCE_ENABLED;

2025

++

2026

++		/*

2027

++		 * Violates locking rules! see comment in __do_set_cpus_ptr().

2028

++		 */

2029

++		if (p->cpus_ptr == &p->cpus_mask)

2030

++			__do_set_cpus_ptr(p, cpumask_of(cpu));

2031

++	}

2032

++	preempt_enable();

2033

++}

2034

++EXPORT_SYMBOL_GPL(migrate_disable);

2035

++

2036

++void migrate_enable(void)

2037

++{

2038

++	struct task_struct *p = current;

2039

++

2040

++	if (0 == p->migration_disabled)

2041

++		return;

2042

++

2043

++	if (p->migration_disabled > 1) {

2044

++		p->migration_disabled--;

2045

++		return;

2046

++	}

2047

++

2048

++	/*

2049

++	 * Ensure stop_task runs either before or after this, and that

2050

++	 * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().

2051

++	 */

2052

++	preempt_disable();

2053

++	/*

2054

++	 * Assumption: current should be running on allowed cpu

2055

++	 */

2056

++	WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &p->cpus_mask));

2057

++	if (p->cpus_ptr != &p->cpus_mask)

2058

++		__do_set_cpus_ptr(p, &p->cpus_mask);

2059

++	/*

2060

++	 * Mustn't clear migration_disabled() until cpus_ptr points back at the

2061

++	 * regular cpus_mask, otherwise things that race (eg.

2062

++	 * select_fallback_rq) get confused.

2063

++	 */

2064

++	barrier();

2065

++	p->migration_disabled = 0;

2066

++	this_rq()->nr_pinned--;

2067

++	preempt_enable();

2068

++}

2069

++EXPORT_SYMBOL_GPL(migrate_enable);

2070

++

2071

++static inline bool rq_has_pinned_tasks(struct rq *rq)

2072

++{

2073

++	return rq->nr_pinned;

2074

++}

2075

++

2076

++/*

2077

++ * Per-CPU kthreads are allowed to run on !active && online CPUs, see

2078

++ * __set_cpus_allowed_ptr() and select_fallback_rq().

2079

++ */

2080

++static inline bool is_cpu_allowed(struct task_struct *p, int cpu)

2081

++{

2082

++	/* When not in the task's cpumask, no point in looking further. */

2083

++	if (!cpumask_test_cpu(cpu, p->cpus_ptr))

2084

++		return false;

2085

++

2086

++	/* migrate_disabled() must be allowed to finish. */

2087

++	if (is_migration_disabled(p))

2088

++		return cpu_online(cpu);

2089

++

2090

++	/* Non kernel threads are not allowed during either online or offline. */

2091

++	if (!(p->flags & PF_KTHREAD))

2092

++		return cpu_active(cpu);

2093

++

2094

++	/* KTHREAD_IS_PER_CPU is always allowed. */

2095

++	if (kthread_is_per_cpu(p))

2096

++		return cpu_online(cpu);

2097

++

2098

++	/* Regular kernel threads don't get to stay during offline. */

2099

++	if (cpu_dying(cpu))

2100

++		return false;

2101

++

2102

++	/* But are allowed during online. */

2103

++	return cpu_online(cpu);

2104

++}

2105

++

2106

++/*

2107

++ * This is how migration works:

2108

++ *

2109

++ * 1) we invoke migration_cpu_stop() on the target CPU using

2110

++ *    stop_one_cpu().

2111

++ * 2) stopper starts to run (implicitly forcing the migrated thread

2112

++ *    off the CPU)

2113

++ * 3) it checks whether the migrated task is still in the wrong runqueue.

2114

++ * 4) if it's in the wrong runqueue then the migration thread removes

2115

++ *    it and puts it into the right queue.

2116

++ * 5) stopper completes and stop_one_cpu() returns and the migration

2117

++ *    is done.

2118

++ */

2119

++

2120

++/*

2121

++ * move_queued_task - move a queued task to new rq.

2122

++ *

2123

++ * Returns (locked) new rq. Old rq's lock is released.

2124

++ */

2125

++static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int

2126

++				   new_cpu)

2127

++{

2128

++	lockdep_assert_held(&rq->lock);

2129

++

2130

++	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);

2131

++	dequeue_task(p, rq, 0);

2132

++	set_task_cpu(p, new_cpu);

2133

++	raw_spin_unlock(&rq->lock);

2134

++

2135

++	rq = cpu_rq(new_cpu);

2136

++

2137

++	raw_spin_lock(&rq->lock);

2138

++	BUG_ON(task_cpu(p) != new_cpu);

2139

++	sched_task_sanity_check(p, rq);

2140

++	enqueue_task(p, rq, 0);

2141

++	p->on_rq = TASK_ON_RQ_QUEUED;

2142

++	check_preempt_curr(rq);

2143

++

2144

++	return rq;

2145

++}

2146

++

2147

++struct migration_arg {

2148

++	struct task_struct *task;

2149

++	int dest_cpu;

2150

++};

2151

++

2152

++/*

2153

++ * Move (not current) task off this CPU, onto the destination CPU. We're doing

2154

++ * this because either it can't run here any more (set_cpus_allowed()

2155

++ * away from this CPU, or CPU going down), or because we're

2156

++ * attempting to rebalance this task on exec (sched_exec).

2157

++ *

2158

++ * So we race with normal scheduler movements, but that's OK, as long

2159

++ * as the task is no longer on this CPU.

2160

++ */

2161

++static struct rq *__migrate_task(struct rq *rq, struct task_struct *p, int

2162

++				 dest_cpu)

2163

++{

2164

++	/* Affinity changed (again). */

2165

++	if (!is_cpu_allowed(p, dest_cpu))

2166

++		return rq;

2167

++

2168

++	update_rq_clock(rq);

2169

++	return move_queued_task(rq, p, dest_cpu);

2170

++}

2171

++

2172

++/*

2173

++ * migration_cpu_stop - this will be executed by a highprio stopper thread

2174

++ * and performs thread migration by bumping thread off CPU then

2175

++ * 'pushing' onto another runqueue.

2176

++ */

2177

++static int migration_cpu_stop(void *data)

2178

++{

2179

++	struct migration_arg *arg = data;

2180

++	struct task_struct *p = arg->task;

2181

++	struct rq *rq = this_rq();

2182

++	unsigned long flags;

2183

++

2184

++	/*

2185

++	 * The original target CPU might have gone down and we might

2186

++	 * be on another CPU but it doesn't matter.

2187

++	 */

2188

++	local_irq_save(flags);

2189

++	/*

2190

++	 * We need to explicitly wake pending tasks before running

2191

++	 * __migrate_task() such that we will not miss enforcing cpus_ptr

2192

++	 * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.

2193

++	 */

2194

++	flush_smp_call_function_from_idle();

2195

++

2196

++	raw_spin_lock(&p->pi_lock);

2197

++	raw_spin_lock(&rq->lock);

2198

++	/*

2199

++	 * If task_rq(p) != rq, it cannot be migrated here, because we're

2200

++	 * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because

2201

++	 * we're holding p->pi_lock.

2202

++	 */

2203

++	if (task_rq(p) == rq && task_on_rq_queued(p))

2204

++		rq = __migrate_task(rq, p, arg->dest_cpu);

2205

++	raw_spin_unlock(&rq->lock);

2206

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

2207

++

2208

++	return 0;

2209

++}

2210

++

2211

++static inline void

2212

++set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask)

2213

++{

2214

++	cpumask_copy(&p->cpus_mask, new_mask);

2215

++	p->nr_cpus_allowed = cpumask_weight(new_mask);

2216

++}

2217

++

2218

++static void

2219

++__do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)

2220

++{

2221

++	lockdep_assert_held(&p->pi_lock);

2222

++	set_cpus_allowed_common(p, new_mask);

2223

++}

2224

++

2225

++void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)

2226

++{

2227

++	__do_set_cpus_allowed(p, new_mask);

2228

++}

2229

++

2230

++#endif

2231

++

2232

++/**

2233

++ * task_curr - is this task currently executing on a CPU?

2234

++ * @p: the task in question.

2235

++ *

2236

++ * Return: 1 if the task is currently executing. 0 otherwise.

2237

++ */

2238

++inline int task_curr(const struct task_struct *p)

2239

++{

2240

++	return cpu_curr(task_cpu(p)) == p;

2241

++}

2242

++

2243

++#ifdef CONFIG_SMP

2244

++/*

2245

++ * wait_task_inactive - wait for a thread to unschedule.

2246

++ *

2247

++ * If @match_state is nonzero, it's the @p->state value just checked and

2248

++ * not expected to change.  If it changes, i.e. @p might have woken up,

2249

++ * then return zero.  When we succeed in waiting for @p to be off its CPU,

2250

++ * we return a positive number (its total switch count).  If a second call

2251

++ * a short while later returns the same number, the caller can be sure that

2252

++ * @p has remained unscheduled the whole time.

2253

++ *

2254

++ * The caller must ensure that the task *will* unschedule sometime soon,

2255

++ * else this function might spin for a *long* time. This function can't

2256

++ * be called with interrupts off, or it may introduce deadlock with

2257

++ * smp_call_function() if an IPI is sent by the same process we are

2258

++ * waiting to become inactive.

2259

++ */

2260

++unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state)

2261

++{

2262

++	unsigned long flags;

2263

++	bool running, on_rq;

2264

++	unsigned long ncsw;

2265

++	struct rq *rq;

2266

++	raw_spinlock_t *lock;

2267

++

2268

++	for (;;) {

2269

++		rq = task_rq(p);

2270

++

2271

++		/*

2272

++		 * If the task is actively running on another CPU

2273

++		 * still, just relax and busy-wait without holding

2274

++		 * any locks.

2275

++		 *

2276

++		 * NOTE! Since we don't hold any locks, it's not

2277

++		 * even sure that "rq" stays as the right runqueue!

2278

++		 * But we don't care, since this will return false

2279

++		 * if the runqueue has changed and p is actually now

2280

++		 * running somewhere else!

2281

++		 */

2282

++		while (task_running(p) && p == rq->curr) {

2283

++			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))

2284

++				return 0;

2285

++			cpu_relax();

2286

++		}

2287

++

2288

++		/*

2289

++		 * Ok, time to look more closely! We need the rq

2290

++		 * lock now, to be *sure*. If we're wrong, we'll

2291

++		 * just go back and repeat.

2292

++		 */

2293

++		task_access_lock_irqsave(p, &lock, &flags);

2294

++		trace_sched_wait_task(p);

2295

++		running = task_running(p);

2296

++		on_rq = p->on_rq;

2297

++		ncsw = 0;

2298

++		if (!match_state || READ_ONCE(p->__state) == match_state)

2299

++			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */

2300

++		task_access_unlock_irqrestore(p, lock, &flags);

2301

++

2302

++		/*

2303

++		 * If it changed from the expected state, bail out now.

2304

++		 */

2305

++		if (unlikely(!ncsw))

2306

++			break;

2307

++

2308

++		/*

2309

++		 * Was it really running after all now that we

2310

++		 * checked with the proper locks actually held?

2311

++		 *

2312

++		 * Oops. Go back and try again..

2313

++		 */

2314

++		if (unlikely(running)) {

2315

++			cpu_relax();

2316

++			continue;

2317

++		}

2318

++

2319

++		/*

2320

++		 * It's not enough that it's not actively running,

2321

++		 * it must be off the runqueue _entirely_, and not

2322

++		 * preempted!

2323

++		 *

2324

++		 * So if it was still runnable (but just not actively

2325

++		 * running right now), it's preempted, and we should

2326

++		 * yield - it could be a while.

2327

++		 */

2328

++		if (unlikely(on_rq)) {

2329

++			ktime_t to = NSEC_PER_SEC / HZ;

2330

++

2331

++			set_current_state(TASK_UNINTERRUPTIBLE);

2332

++			schedule_hrtimeout(&to, HRTIMER_MODE_REL);

2333

++			continue;

2334

++		}

2335

++

2336

++		/*

2337

++		 * Ahh, all good. It wasn't running, and it wasn't

2338

++		 * runnable, which means that it will never become

2339

++		 * running in the future either. We're all done!

2340

++		 */

2341

++		break;

2342

++	}

2343

++

2344

++	return ncsw;

2345

++}

2346

++

2347

++/***

2348

++ * kick_process - kick a running thread to enter/exit the kernel

2349

++ * @p: the to-be-kicked thread

2350

++ *

2351

++ * Cause a process which is running on another CPU to enter

2352

++ * kernel-mode, without any delay. (to get signals handled.)

2353

++ *

2354

++ * NOTE: this function doesn't have to take the runqueue lock,

2355

++ * because all it wants to ensure is that the remote task enters

2356

++ * the kernel. If the IPI races and the task has been migrated

2357

++ * to another CPU then no harm is done and the purpose has been

2358

++ * achieved as well.

2359

++ */

2360

++void kick_process(struct task_struct *p)

2361

++{

2362

++	int cpu;

2363

++

2364

++	preempt_disable();

2365

++	cpu = task_cpu(p);

2366

++	if ((cpu != smp_processor_id()) && task_curr(p))

2367

++		smp_send_reschedule(cpu);

2368

++	preempt_enable();

2369

++}

2370

++EXPORT_SYMBOL_GPL(kick_process);

2371

++

2372

++/*

2373

++ * ->cpus_ptr is protected by both rq->lock and p->pi_lock

2374

++ *

2375

++ * A few notes on cpu_active vs cpu_online:

2376

++ *

2377

++ *  - cpu_active must be a subset of cpu_online

2378

++ *

2379

++ *  - on CPU-up we allow per-CPU kthreads on the online && !active CPU,

2380

++ *    see __set_cpus_allowed_ptr(). At this point the newly online

2381

++ *    CPU isn't yet part of the sched domains, and balancing will not

2382

++ *    see it.

2383

++ *

2384

++ *  - on cpu-down we clear cpu_active() to mask the sched domains and

2385

++ *    avoid the load balancer to place new tasks on the to be removed

2386

++ *    CPU. Existing tasks will remain running there and will be taken

2387

++ *    off.

2388

++ *

2389

++ * This means that fallback selection must not select !active CPUs.

2390

++ * And can assume that any active CPU must be online. Conversely

2391

++ * select_task_rq() below may allow selection of !active CPUs in order

2392

++ * to satisfy the above rules.

2393

++ */

2394

++static int select_fallback_rq(int cpu, struct task_struct *p)

2395

++{

2396

++	int nid = cpu_to_node(cpu);

2397

++	const struct cpumask *nodemask = NULL;

2398

++	enum { cpuset, possible, fail } state = cpuset;

2399

++	int dest_cpu;

2400

++

2401

++	/*

2402

++	 * If the node that the CPU is on has been offlined, cpu_to_node()

2403

++	 * will return -1. There is no CPU on the node, and we should

2404

++	 * select the CPU on the other node.

2405

++	 */

2406

++	if (nid != -1) {

2407

++		nodemask = cpumask_of_node(nid);

2408

++

2409

++		/* Look for allowed, online CPU in same node. */

2410

++		for_each_cpu(dest_cpu, nodemask) {

2411

++			if (!cpu_active(dest_cpu))

2412

++				continue;

2413

++			if (cpumask_test_cpu(dest_cpu, p->cpus_ptr))

2414

++				return dest_cpu;

2415

++		}

2416

++	}

2417

++

2418

++	for (;;) {

2419

++		/* Any allowed, online CPU? */

2420

++		for_each_cpu(dest_cpu, p->cpus_ptr) {

2421

++			if (!is_cpu_allowed(p, dest_cpu))

2422

++				continue;

2423

++			goto out;

2424

++		}

2425

++

2426

++		/* No more Mr. Nice Guy. */

2427

++		switch (state) {

2428

++		case cpuset:

2429

++			if (IS_ENABLED(CONFIG_CPUSETS)) {

2430

++				cpuset_cpus_allowed_fallback(p);

2431

++				state = possible;

2432

++				break;

2433

++			}

2434

++			fallthrough;

2435

++		case possible:

2436

++			/*

2437

++			 * XXX When called from select_task_rq() we only

2438

++			 * hold p->pi_lock and again violate locking order.

2439

++			 *

2440

++			 * More yuck to audit.

2441

++			 */

2442

++			do_set_cpus_allowed(p, cpu_possible_mask);

2443

++			state = fail;

2444

++			break;

2445

++

2446

++		case fail:

2447

++			BUG();

2448

++			break;

2449

++		}

2450

++	}

2451

++

2452

++out:

2453

++	if (state != cpuset) {

2454

++		/*

2455

++		 * Don't tell them about moving exiting tasks or

2456

++		 * kernel threads (both mm NULL), since they never

2457

++		 * leave kernel.

2458

++		 */

2459

++		if (p->mm && printk_ratelimit()) {

2460

++			printk_deferred("process %d (%s) no longer affine to cpu%d\n",

2461

++					task_pid_nr(p), p->comm, cpu);

2462

++		}

2463

++	}

2464

++

2465

++	return dest_cpu;

2466

++}

2467

++

2468

++static inline int select_task_rq(struct task_struct *p)

2469

++{

2470

++	cpumask_t chk_mask, tmp;

2471

++

2472

++	if (unlikely(!cpumask_and(&chk_mask, p->cpus_ptr, cpu_active_mask)))

2473

++		return select_fallback_rq(task_cpu(p), p);

2474

++

2475

++	if (

2476

++#ifdef CONFIG_SCHED_SMT

2477

++	    cpumask_and(&tmp, &chk_mask, &sched_sg_idle_mask) ||

2478

++#endif

2479

++	    cpumask_and(&tmp, &chk_mask, sched_rq_watermark) ||

2480

++	    cpumask_and(&tmp, &chk_mask,

2481

++			sched_rq_watermark + SCHED_BITS - task_sched_prio(p)))

2482

++		return best_mask_cpu(task_cpu(p), &tmp);

2483

++

2484

++	return best_mask_cpu(task_cpu(p), &chk_mask);

2485

++}

2486

++

2487

++void sched_set_stop_task(int cpu, struct task_struct *stop)

2488

++{

2489

++	static struct lock_class_key stop_pi_lock;

2490

++	struct sched_param stop_param = { .sched_priority = STOP_PRIO };

2491

++	struct sched_param start_param = { .sched_priority = 0 };

2492

++	struct task_struct *old_stop = cpu_rq(cpu)->stop;

2493

++

2494

++	if (stop) {

2495

++		/*

2496

++		 * Make it appear like a SCHED_FIFO task, its something

2497

++		 * userspace knows about and won't get confused about.

2498

++		 *

2499

++		 * Also, it will make PI more or less work without too

2500

++		 * much confusion -- but then, stop work should not

2501

++		 * rely on PI working anyway.

2502

++		 */

2503

++		sched_setscheduler_nocheck(stop, SCHED_FIFO, &stop_param);

2504

++

2505

++		/*

2506

++		 * The PI code calls rt_mutex_setprio() with ->pi_lock held to

2507

++		 * adjust the effective priority of a task. As a result,

2508

++		 * rt_mutex_setprio() can trigger (RT) balancing operations,

2509

++		 * which can then trigger wakeups of the stop thread to push

2510

++		 * around the current task.

2511

++		 *

2512

++		 * The stop task itself will never be part of the PI-chain, it

2513

++		 * never blocks, therefore that ->pi_lock recursion is safe.

2514

++		 * Tell lockdep about this by placing the stop->pi_lock in its

2515

++		 * own class.

2516

++		 */

2517

++		lockdep_set_class(&stop->pi_lock, &stop_pi_lock);

2518

++	}

2519

++

2520

++	cpu_rq(cpu)->stop = stop;

2521

++

2522

++	if (old_stop) {

2523

++		/*

2524

++		 * Reset it back to a normal scheduling policy so that

2525

++		 * it can die in pieces.

2526

++		 */

2527

++		sched_setscheduler_nocheck(old_stop, SCHED_NORMAL, &start_param);

2528

++	}

2529

++}

2530

++

2531

++/*

2532

++ * Change a given task's CPU affinity. Migrate the thread to a

2533

++ * proper CPU and schedule it away if the CPU it's executing on

2534

++ * is removed from the allowed bitmask.

2535

++ *

2536

++ * NOTE: the caller must have a valid reference to the task, the

2537

++ * task must not exit() & deallocate itself prematurely. The

2538

++ * call is not atomic; no spinlocks may be held.

2539

++ */

2540

++static int __set_cpus_allowed_ptr(struct task_struct *p,

2541

++				  const struct cpumask *new_mask,

2542

++				  u32 flags)

2543

++{

2544

++	const struct cpumask *cpu_valid_mask = cpu_active_mask;

2545

++	int dest_cpu;

2546

++	unsigned long irq_flags;

2547

++	struct rq *rq;

2548

++	raw_spinlock_t *lock;

2549

++	int ret = 0;

2550

++

2551

++	raw_spin_lock_irqsave(&p->pi_lock, irq_flags);

2552

++	rq = __task_access_lock(p, &lock);

2553

++

2554

++	if (p->flags & PF_KTHREAD || is_migration_disabled(p)) {

2555

++		/*

2556

++		 * Kernel threads are allowed on online && !active CPUs,

2557

++		 * however, during cpu-hot-unplug, even these might get pushed

2558

++		 * away if not KTHREAD_IS_PER_CPU.

2559

++		 *

2560

++		 * Specifically, migration_disabled() tasks must not fail the

2561

++		 * cpumask_any_and_distribute() pick below, esp. so on

2562

++		 * SCA_MIGRATE_ENABLE, otherwise we'll not call

2563

++		 * set_cpus_allowed_common() and actually reset p->cpus_ptr.

2564

++		 */

2565

++		cpu_valid_mask = cpu_online_mask;

2566

++	}

2567

++

2568

++	/*

2569

++	 * Must re-check here, to close a race against __kthread_bind(),

2570

++	 * sched_setaffinity() is not guaranteed to observe the flag.

2571

++	 */

2572

++	if ((flags & SCA_CHECK) && (p->flags & PF_NO_SETAFFINITY)) {

2573

++		ret = -EINVAL;

2574

++		goto out;

2575

++	}

2576

++

2577

++	if (cpumask_equal(&p->cpus_mask, new_mask))

2578

++		goto out;

2579

++

2580

++	dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);

2581

++	if (dest_cpu >= nr_cpu_ids) {

2582

++		ret = -EINVAL;

2583

++		goto out;

2584

++	}

2585

++

2586

++	__do_set_cpus_allowed(p, new_mask);

2587

++

2588

++	/* Can the task run on the task's current CPU? If so, we're done */

2589

++	if (cpumask_test_cpu(task_cpu(p), new_mask))

2590

++		goto out;

2591

++

2592

++	if (p->migration_disabled) {

2593

++		if (likely(p->cpus_ptr != &p->cpus_mask))

2594

++			__do_set_cpus_ptr(p, &p->cpus_mask);

2595

++		p->migration_disabled = 0;

2596

++		p->migration_flags |= MDF_FORCE_ENABLED;

2597

++		/* When p is migrate_disabled, rq->lock should be held */

2598

++		rq->nr_pinned--;

2599

++	}

2600

++

2601

++	if (task_running(p) || READ_ONCE(p->__state) == TASK_WAKING) {

2602

++		struct migration_arg arg = { p, dest_cpu };

2603

++

2604

++		/* Need help from migration thread: drop lock and wait. */

2605

++		__task_access_unlock(p, lock);

2606

++		raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);

2607

++		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);

2608

++		return 0;

2609

++	}

2610

++	if (task_on_rq_queued(p)) {

2611

++		/*

2612

++		 * OK, since we're going to drop the lock immediately

2613

++		 * afterwards anyway.

2614

++		 */

2615

++		update_rq_clock(rq);

2616

++		rq = move_queued_task(rq, p, dest_cpu);

2617

++		lock = &rq->lock;

2618

++	}

2619

++

2620

++out:

2621

++	__task_access_unlock(p, lock);

2622

++	raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);

2623

++

2624

++	return ret;

2625

++}

2626

++

2627

++int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)

2628

++{

2629

++	return __set_cpus_allowed_ptr(p, new_mask, 0);

2630

++}

2631

++EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);

2632

++

2633

++#else /* CONFIG_SMP */

2634

++

2635

++static inline int select_task_rq(struct task_struct *p)

2636

++{

2637

++	return 0;

2638

++}

2639

++

2640

++static inline int

2641

++__set_cpus_allowed_ptr(struct task_struct *p,

2642

++		       const struct cpumask *new_mask,

2643

++		       u32 flags)

2644

++{

2645

++	return set_cpus_allowed_ptr(p, new_mask);

2646

++}

2647

++

2648

++static inline bool rq_has_pinned_tasks(struct rq *rq)

2649

++{

2650

++	return false;

2651

++}

2652

++

2653

++#endif /* !CONFIG_SMP */

2654

++

2655

++static void

2656

++ttwu_stat(struct task_struct *p, int cpu, int wake_flags)

2657

++{

2658

++	struct rq *rq;

2659

++

2660

++	if (!schedstat_enabled())

2661

++		return;

2662

++

2663

++	rq = this_rq();

2664

++

2665

++#ifdef CONFIG_SMP

2666

++	if (cpu == rq->cpu)

2667

++		__schedstat_inc(rq->ttwu_local);

2668

++	else {

2669

++		/** Alt schedule FW ToDo:

2670

++		 * How to do ttwu_wake_remote

2671

++		 */

2672

++	}

2673

++#endif /* CONFIG_SMP */

2674

++

2675

++	__schedstat_inc(rq->ttwu_count);

2676

++}

2677

++

2678

++/*

2679

++ * Mark the task runnable and perform wakeup-preemption.

2680

++ */

2681

++static inline void

2682

++ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)

2683

++{

2684

++	check_preempt_curr(rq);

2685

++	WRITE_ONCE(p->__state, TASK_RUNNING);

2686

++	trace_sched_wakeup(p);

2687

++}

2688

++

2689

++static inline void

2690

++ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)

2691

++{

2692

++	if (p->sched_contributes_to_load)

2693

++		rq->nr_uninterruptible--;

2694

++

2695

++	if (

2696

++#ifdef CONFIG_SMP

2697

++	    !(wake_flags & WF_MIGRATED) &&

2698

++#endif

2699

++	    p->in_iowait) {

2700

++		delayacct_blkio_end(p);

2701

++		atomic_dec(&task_rq(p)->nr_iowait);

2702

++	}

2703

++

2704

++	activate_task(p, rq);

2705

++	ttwu_do_wakeup(rq, p, 0);

2706

++}

2707

++

2708

++/*

2709

++ * Consider @p being inside a wait loop:

2710

++ *

2711

++ *   for (;;) {

2712

++ *      set_current_state(TASK_UNINTERRUPTIBLE);

2713

++ *

2714

++ *      if (CONDITION)

2715

++ *         break;

2716

++ *

2717

++ *      schedule();

2718

++ *   }

2719

++ *   __set_current_state(TASK_RUNNING);

2720

++ *

2721

++ * between set_current_state() and schedule(). In this case @p is still

2722

++ * runnable, so all that needs doing is change p->state back to TASK_RUNNING in

2723

++ * an atomic manner.

2724

++ *

2725

++ * By taking task_rq(p)->lock we serialize against schedule(), if @p->on_rq

2726

++ * then schedule() must still happen and p->state can be changed to

2727

++ * TASK_RUNNING. Otherwise we lost the race, schedule() has happened, and we

2728

++ * need to do a full wakeup with enqueue.

2729

++ *

2730

++ * Returns: %true when the wakeup is done,

2731

++ *          %false otherwise.

2732

++ */

2733

++static int ttwu_runnable(struct task_struct *p, int wake_flags)

2734

++{

2735

++	struct rq *rq;

2736

++	raw_spinlock_t *lock;

2737

++	int ret = 0;

2738

++

2739

++	rq = __task_access_lock(p, &lock);

2740

++	if (task_on_rq_queued(p)) {

2741

++		/* check_preempt_curr() may use rq clock */

2742

++		update_rq_clock(rq);

2743

++		ttwu_do_wakeup(rq, p, wake_flags);

2744

++		ret = 1;

2745

++	}

2746

++	__task_access_unlock(p, lock);

2747

++

2748

++	return ret;

2749

++}

2750

++

2751

++#ifdef CONFIG_SMP

2752

++void sched_ttwu_pending(void *arg)

2753

++{

2754

++	struct llist_node *llist = arg;

2755

++	struct rq *rq = this_rq();

2756

++	struct task_struct *p, *t;

2757

++	struct rq_flags rf;

2758

++

2759

++	if (!llist)

2760

++		return;

2761

++

2762

++	/*

2763

++	 * rq::ttwu_pending racy indication of out-standing wakeups.

2764

++	 * Races such that false-negatives are possible, since they

2765

++	 * are shorter lived that false-positives would be.

2766

++	 */

2767

++	WRITE_ONCE(rq->ttwu_pending, 0);

2768

++

2769

++	rq_lock_irqsave(rq, &rf);

2770

++	update_rq_clock(rq);

2771

++

2772

++	llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {

2773

++		if (WARN_ON_ONCE(p->on_cpu))

2774

++			smp_cond_load_acquire(&p->on_cpu, !VAL);

2775

++

2776

++		if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))

2777

++			set_task_cpu(p, cpu_of(rq));

2778

++

2779

++		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0);

2780

++	}

2781

++

2782

++	rq_unlock_irqrestore(rq, &rf);

2783

++}

2784

++

2785

++void send_call_function_single_ipi(int cpu)

2786

++{

2787

++	struct rq *rq = cpu_rq(cpu);

2788

++

2789

++	if (!set_nr_if_polling(rq->idle))

2790

++		arch_send_call_function_single_ipi(cpu);

2791

++	else

2792

++		trace_sched_wake_idle_without_ipi(cpu);

2793

++}

2794

++

2795

++/*

2796

++ * Queue a task on the target CPUs wake_list and wake the CPU via IPI if

2797

++ * necessary. The wakee CPU on receipt of the IPI will queue the task

2798

++ * via sched_ttwu_wakeup() for activation so the wakee incurs the cost

2799

++ * of the wakeup instead of the waker.

2800

++ */

2801

++static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)

2802

++{

2803

++	struct rq *rq = cpu_rq(cpu);

2804

++

2805

++	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);

2806

++

2807

++	WRITE_ONCE(rq->ttwu_pending, 1);

2808

++	__smp_call_single_queue(cpu, &p->wake_entry.llist);

2809

++}

2810

++

2811

++static inline bool ttwu_queue_cond(int cpu, int wake_flags)

2812

++{

2813

++	/*

2814

++	 * Do not complicate things with the async wake_list while the CPU is

2815

++	 * in hotplug state.

2816

++	 */

2817

++	if (!cpu_active(cpu))

2818

++		return false;

2819

++

2820

++	/*

2821

++	 * If the CPU does not share cache, then queue the task on the

2822

++	 * remote rqs wakelist to avoid accessing remote data.

2823

++	 */

2824

++	if (!cpus_share_cache(smp_processor_id(), cpu))

2825

++		return true;

2826

++

2827

++	/*

2828

++	 * If the task is descheduling and the only running task on the

2829

++	 * CPU then use the wakelist to offload the task activation to

2830

++	 * the soon-to-be-idle CPU as the current CPU is likely busy.

2831

++	 * nr_running is checked to avoid unnecessary task stacking.

2832

++	 */

2833

++	if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)

2834

++		return true;

2835

++

2836

++	return false;

2837

++}

2838

++

2839

++static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)

2840

++{

2841

++	if (__is_defined(ALT_SCHED_TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {

2842

++		if (WARN_ON_ONCE(cpu == smp_processor_id()))

2843

++			return false;

2844

++

2845

++		sched_clock_cpu(cpu); /* Sync clocks across CPUs */

2846

++		__ttwu_queue_wakelist(p, cpu, wake_flags);

2847

++		return true;

2848

++	}

2849

++

2850

++	return false;

2851

++}

2852

++

2853

++void wake_up_if_idle(int cpu)

2854

++{

2855

++	struct rq *rq = cpu_rq(cpu);

2856

++	unsigned long flags;

2857

++

2858

++	rcu_read_lock();

2859

++

2860

++	if (!is_idle_task(rcu_dereference(rq->curr)))

2861

++		goto out;

2862

++

2863

++	if (set_nr_if_polling(rq->idle)) {

2864

++		trace_sched_wake_idle_without_ipi(cpu);

2865

++	} else {

2866

++		raw_spin_lock_irqsave(&rq->lock, flags);

2867

++		if (is_idle_task(rq->curr))

2868

++			smp_send_reschedule(cpu);

2869

++		/* Else CPU is not idle, do nothing here */

2870

++		raw_spin_unlock_irqrestore(&rq->lock, flags);

2871

++	}

2872

++

2873

++out:

2874

++	rcu_read_unlock();

2875

++}

2876

++

2877

++bool cpus_share_cache(int this_cpu, int that_cpu)

2878

++{

2879

++	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);

2880

++}

2881

++#else /* !CONFIG_SMP */

2882

++

2883

++static inline bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)

2884

++{

2885

++	return false;

2886

++}

2887

++

2888

++#endif /* CONFIG_SMP */

2889

++

2890

++static inline void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)

2891

++{

2892

++	struct rq *rq = cpu_rq(cpu);

2893

++

2894

++	if (ttwu_queue_wakelist(p, cpu, wake_flags))

2895

++		return;

2896

++

2897

++	raw_spin_lock(&rq->lock);

2898

++	update_rq_clock(rq);

2899

++	ttwu_do_activate(rq, p, wake_flags);

2900

++	raw_spin_unlock(&rq->lock);

2901

++}

2902

++

2903

++/*

2904

++ * Notes on Program-Order guarantees on SMP systems.

2905

++ *

2906

++ *  MIGRATION

2907

++ *

2908

++ * The basic program-order guarantee on SMP systems is that when a task [t]

2909

++ * migrates, all its activity on its old CPU [c0] happens-before any subsequent

2910

++ * execution on its new CPU [c1].

2911

++ *

2912

++ * For migration (of runnable tasks) this is provided by the following means:

2913

++ *

2914

++ *  A) UNLOCK of the rq(c0)->lock scheduling out task t

2915

++ *  B) migration for t is required to synchronize *both* rq(c0)->lock and

2916

++ *     rq(c1)->lock (if not at the same time, then in that order).

2917

++ *  C) LOCK of the rq(c1)->lock scheduling in task

2918

++ *

2919

++ * Transitivity guarantees that B happens after A and C after B.

2920

++ * Note: we only require RCpc transitivity.

2921

++ * Note: the CPU doing B need not be c0 or c1

2922

++ *

2923

++ * Example:

2924

++ *

2925

++ *   CPU0            CPU1            CPU2

2926

++ *

2927

++ *   LOCK rq(0)->lock

2928

++ *   sched-out X

2929

++ *   sched-in Y

2930

++ *   UNLOCK rq(0)->lock

2931

++ *

2932

++ *                                   LOCK rq(0)->lock // orders against CPU0

2933

++ *                                   dequeue X

2934

++ *                                   UNLOCK rq(0)->lock

2935

++ *

2936

++ *                                   LOCK rq(1)->lock

2937

++ *                                   enqueue X

2938

++ *                                   UNLOCK rq(1)->lock

2939

++ *

2940

++ *                   LOCK rq(1)->lock // orders against CPU2

2941

++ *                   sched-out Z

2942

++ *                   sched-in X

2943

++ *                   UNLOCK rq(1)->lock

2944

++ *

2945

++ *

2946

++ *  BLOCKING -- aka. SLEEP + WAKEUP

2947

++ *

2948

++ * For blocking we (obviously) need to provide the same guarantee as for

2949

++ * migration. However the means are completely different as there is no lock

2950

++ * chain to provide order. Instead we do:

2951

++ *

2952

++ *   1) smp_store_release(X->on_cpu, 0)   -- finish_task()

2953

++ *   2) smp_cond_load_acquire(!X->on_cpu) -- try_to_wake_up()

2954

++ *

2955

++ * Example:

2956

++ *

2957

++ *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)

2958

++ *

2959

++ *   LOCK rq(0)->lock LOCK X->pi_lock

2960

++ *   dequeue X

2961

++ *   sched-out X

2962

++ *   smp_store_release(X->on_cpu, 0);

2963

++ *

2964

++ *                    smp_cond_load_acquire(&X->on_cpu, !VAL);

2965

++ *                    X->state = WAKING

2966

++ *                    set_task_cpu(X,2)

2967

++ *

2968

++ *                    LOCK rq(2)->lock

2969

++ *                    enqueue X

2970

++ *                    X->state = RUNNING

2971

++ *                    UNLOCK rq(2)->lock

2972

++ *

2973

++ *                                          LOCK rq(2)->lock // orders against CPU1

2974

++ *                                          sched-out Z

2975

++ *                                          sched-in X

2976

++ *                                          UNLOCK rq(2)->lock

2977

++ *

2978

++ *                    UNLOCK X->pi_lock

2979

++ *   UNLOCK rq(0)->lock

2980

++ *

2981

++ *

2982

++ * However; for wakeups there is a second guarantee we must provide, namely we

2983

++ * must observe the state that lead to our wakeup. That is, not only must our

2984

++ * task observe its own prior state, it must also observe the stores prior to

2985

++ * its wakeup.

2986

++ *

2987

++ * This means that any means of doing remote wakeups must order the CPU doing

2988

++ * the wakeup against the CPU the task is going to end up running on. This,

2989

++ * however, is already required for the regular Program-Order guarantee above,

2990

++ * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).

2991

++ *

2992

++ */

2993

++

2994

++/**

2995

++ * try_to_wake_up - wake up a thread

2996

++ * @p: the thread to be awakened

2997

++ * @state: the mask of task states that can be woken

2998

++ * @wake_flags: wake modifier flags (WF_*)

2999

++ *

3000

++ * Conceptually does:

3001

++ *

3002

++ *   If (@state & @p->state) @p->state = TASK_RUNNING.

3003

++ *

3004

++ * If the task was not queued/runnable, also place it back on a runqueue.

3005

++ *

3006

++ * This function is atomic against schedule() which would dequeue the task.

3007

++ *

3008

++ * It issues a full memory barrier before accessing @p->state, see the comment

3009

++ * with set_current_state().

3010

++ *

3011

++ * Uses p->pi_lock to serialize against concurrent wake-ups.

3012

++ *

3013

++ * Relies on p->pi_lock stabilizing:

3014

++ *  - p->sched_class

3015

++ *  - p->cpus_ptr

3016

++ *  - p->sched_task_group

3017

++ * in order to do migration, see its use of select_task_rq()/set_task_cpu().

3018

++ *

3019

++ * Tries really hard to only take one task_rq(p)->lock for performance.

3020

++ * Takes rq->lock in:

3021

++ *  - ttwu_runnable()    -- old rq, unavoidable, see comment there;

3022

++ *  - ttwu_queue()       -- new rq, for enqueue of the task;

3023

++ *  - psi_ttwu_dequeue() -- much sadness :-( accounting will kill us.

3024

++ *

3025

++ * As a consequence we race really badly with just about everything. See the

3026

++ * many memory barriers and their comments for details.

3027

++ *

3028

++ * Return: %true if @p->state changes (an actual wakeup was done),

3029

++ *	   %false otherwise.

3030

++ */

3031

++static int try_to_wake_up(struct task_struct *p, unsigned int state,

3032

++			  int wake_flags)

3033

++{

3034

++	unsigned long flags;

3035

++	int cpu, success = 0;

3036

++

3037

++	preempt_disable();

3038

++	if (p == current) {

3039

++		/*

3040

++		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)

3041

++		 * == smp_processor_id()'. Together this means we can special

3042

++		 * case the whole 'p->on_rq && ttwu_runnable()' case below

3043

++		 * without taking any locks.

3044

++		 *

3045

++		 * In particular:

3046

++		 *  - we rely on Program-Order guarantees for all the ordering,

3047

++		 *  - we're serialized against set_special_state() by virtue of

3048

++		 *    it disabling IRQs (this allows not taking ->pi_lock).

3049

++		 */

3050

++		if (!(READ_ONCE(p->__state) & state))

3051

++			goto out;

3052

++

3053

++		success = 1;

3054

++		trace_sched_waking(p);

3055

++		WRITE_ONCE(p->__state, TASK_RUNNING);

3056

++		trace_sched_wakeup(p);

3057

++		goto out;

3058

++	}

3059

++

3060

++	/*

3061

++	 * If we are going to wake up a thread waiting for CONDITION we

3062

++	 * need to ensure that CONDITION=1 done by the caller can not be

3063

++	 * reordered with p->state check below. This pairs with smp_store_mb()

3064

++	 * in set_current_state() that the waiting thread does.

3065

++	 */

3066

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

3067

++	smp_mb__after_spinlock();

3068

++	if (!(READ_ONCE(p->__state) & state))

3069

++		goto unlock;

3070

++

3071

++	trace_sched_waking(p);

3072

++

3073

++	/* We're going to change ->state: */

3074

++	success = 1;

3075

++

3076

++	/*

3077

++	 * Ensure we load p->on_rq _after_ p->state, otherwise it would

3078

++	 * be possible to, falsely, observe p->on_rq == 0 and get stuck

3079

++	 * in smp_cond_load_acquire() below.

3080

++	 *

3081

++	 * sched_ttwu_pending()			try_to_wake_up()

3082

++	 *   STORE p->on_rq = 1			  LOAD p->state

3083

++	 *   UNLOCK rq->lock

3084

++	 *

3085

++	 * __schedule() (switch to task 'p')

3086

++	 *   LOCK rq->lock			  smp_rmb();

3087

++	 *   smp_mb__after_spinlock();

3088

++	 *   UNLOCK rq->lock

3089

++	 *

3090

++	 * [task p]

3091

++	 *   STORE p->state = UNINTERRUPTIBLE	  LOAD p->on_rq

3092

++	 *

3093

++	 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in

3094

++	 * __schedule().  See the comment for smp_mb__after_spinlock().

3095

++	 *

3096

++	 * A similar smb_rmb() lives in try_invoke_on_locked_down_task().

3097

++	 */

3098

++	smp_rmb();

3099

++	if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))

3100

++		goto unlock;

3101

++

3102

++#ifdef CONFIG_SMP

3103

++	/*

3104

++	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be

3105

++	 * possible to, falsely, observe p->on_cpu == 0.

3106

++	 *

3107

++	 * One must be running (->on_cpu == 1) in order to remove oneself

3108

++	 * from the runqueue.

3109

++	 *

3110

++	 * __schedule() (switch to task 'p')	try_to_wake_up()

3111

++	 *   STORE p->on_cpu = 1		  LOAD p->on_rq

3112

++	 *   UNLOCK rq->lock

3113

++	 *

3114

++	 * __schedule() (put 'p' to sleep)

3115

++	 *   LOCK rq->lock			  smp_rmb();

3116

++	 *   smp_mb__after_spinlock();

3117

++	 *   STORE p->on_rq = 0			  LOAD p->on_cpu

3118

++	 *

3119

++	 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in

3120

++	 * __schedule().  See the comment for smp_mb__after_spinlock().

3121

++	 *

3122

++	 * Form a control-dep-acquire with p->on_rq == 0 above, to ensure

3123

++	 * schedule()'s deactivate_task() has 'happened' and p will no longer

3124

++	 * care about it's own p->state. See the comment in __schedule().

3125

++	 */

3126

++	smp_acquire__after_ctrl_dep();

3127

++

3128

++	/*

3129

++	 * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq

3130

++	 * == 0), which means we need to do an enqueue, change p->state to

3131

++	 * TASK_WAKING such that we can unlock p->pi_lock before doing the

3132

++	 * enqueue, such as ttwu_queue_wakelist().

3133

++	 */

3134

++	WRITE_ONCE(p->__state, TASK_WAKING);

3135

++

3136

++	/*

3137

++	 * If the owning (remote) CPU is still in the middle of schedule() with

3138

++	 * this task as prev, considering queueing p on the remote CPUs wake_list

3139

++	 * which potentially sends an IPI instead of spinning on p->on_cpu to

3140

++	 * let the waker make forward progress. This is safe because IRQs are

3141

++	 * disabled and the IPI will deliver after on_cpu is cleared.

3142

++	 *

3143

++	 * Ensure we load task_cpu(p) after p->on_cpu:

3144

++	 *

3145

++	 * set_task_cpu(p, cpu);

3146

++	 *   STORE p->cpu = @cpu

3147

++	 * __schedule() (switch to task 'p')

3148

++	 *   LOCK rq->lock

3149

++	 *   smp_mb__after_spin_lock()          smp_cond_load_acquire(&p->on_cpu)

3150

++	 *   STORE p->on_cpu = 1                LOAD p->cpu

3151

++	 *

3152

++	 * to ensure we observe the correct CPU on which the task is currently

3153

++	 * scheduling.

3154

++	 */

3155

++	if (smp_load_acquire(&p->on_cpu) &&

3156

++	    ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))

3157

++		goto unlock;

3158

++

3159

++	/*

3160

++	 * If the owning (remote) CPU is still in the middle of schedule() with

3161

++	 * this task as prev, wait until it's done referencing the task.

3162

++	 *

3163

++	 * Pairs with the smp_store_release() in finish_task().

3164

++	 *

3165

++	 * This ensures that tasks getting woken will be fully ordered against

3166

++	 * their previous state and preserve Program Order.

3167

++	 */

3168

++	smp_cond_load_acquire(&p->on_cpu, !VAL);

3169

++

3170

++	sched_task_ttwu(p);

3171

++

3172

++	cpu = select_task_rq(p);

3173

++

3174

++	if (cpu != task_cpu(p)) {

3175

++		if (p->in_iowait) {

3176

++			delayacct_blkio_end(p);

3177

++			atomic_dec(&task_rq(p)->nr_iowait);

3178

++		}

3179

++

3180

++		wake_flags |= WF_MIGRATED;

3181

++		psi_ttwu_dequeue(p);

3182

++		set_task_cpu(p, cpu);

3183

++	}

3184

++#else

3185

++	cpu = task_cpu(p);

3186

++#endif /* CONFIG_SMP */

3187

++

3188

++	ttwu_queue(p, cpu, wake_flags);

3189

++unlock:

3190

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

3191

++out:

3192

++	if (success)

3193

++		ttwu_stat(p, task_cpu(p), wake_flags);

3194

++	preempt_enable();

3195

++

3196

++	return success;

3197

++}

3198

++

3199

++/**

3200

++ * try_invoke_on_locked_down_task - Invoke a function on task in fixed state

3201

++ * @p: Process for which the function is to be invoked, can be @current.

3202

++ * @func: Function to invoke.

3203

++ * @arg: Argument to function.

3204

++ *

3205

++ * If the specified task can be quickly locked into a definite state

3206

++ * (either sleeping or on a given runqueue), arrange to keep it in that

3207

++ * state while invoking @func(@arg).  This function can use ->on_rq and

3208

++ * task_curr() to work out what the state is, if required.  Given that

3209

++ * @func can be invoked with a runqueue lock held, it had better be quite

3210

++ * lightweight.

3211

++ *

3212

++ * Returns:

3213

++ *	@false if the task slipped out from under the locks.

3214

++ *	@true if the task was locked onto a runqueue or is sleeping.

3215

++ *		However, @func can override this by returning @false.

3216

++ */

3217

++bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)

3218

++{

3219

++	struct rq_flags rf;

3220

++	bool ret = false;

3221

++	struct rq *rq;

3222

++

3223

++	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);

3224

++	if (p->on_rq) {

3225

++		rq = __task_rq_lock(p, &rf);

3226

++		if (task_rq(p) == rq)

3227

++			ret = func(p, arg);

3228

++		__task_rq_unlock(rq, &rf);

3229

++	} else {

3230

++		switch (READ_ONCE(p->__state)) {

3231

++		case TASK_RUNNING:

3232

++		case TASK_WAKING:

3233

++			break;

3234

++		default:

3235

++			smp_rmb(); // See smp_rmb() comment in try_to_wake_up().

3236

++			if (!p->on_rq)

3237

++				ret = func(p, arg);

3238

++		}

3239

++	}

3240

++	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);

3241

++	return ret;

3242

++}

3243

++

3244

++/**

3245

++ * wake_up_process - Wake up a specific process

3246

++ * @p: The process to be woken up.

3247

++ *

3248

++ * Attempt to wake up the nominated process and move it to the set of runnable

3249

++ * processes.

3250

++ *

3251

++ * Return: 1 if the process was woken up, 0 if it was already running.

3252

++ *

3253

++ * This function executes a full memory barrier before accessing the task state.

3254

++ */

3255

++int wake_up_process(struct task_struct *p)

3256

++{

3257

++	return try_to_wake_up(p, TASK_NORMAL, 0);

3258

++}

3259

++EXPORT_SYMBOL(wake_up_process);

3260

++

3261

++int wake_up_state(struct task_struct *p, unsigned int state)

3262

++{

3263

++	return try_to_wake_up(p, state, 0);

3264

++}

3265

++

3266

++/*

3267

++ * Perform scheduler related setup for a newly forked process p.

3268

++ * p is forked by current.

3269

++ *

3270

++ * __sched_fork() is basic setup used by init_idle() too:

3271

++ */

3272

++static inline void __sched_fork(unsigned long clone_flags, struct task_struct *p)

3273

++{

3274

++	p->on_rq			= 0;

3275

++	p->on_cpu			= 0;

3276

++	p->utime			= 0;

3277

++	p->stime			= 0;

3278

++	p->sched_time			= 0;

3279

++

3280

++#ifdef CONFIG_PREEMPT_NOTIFIERS

3281

++	INIT_HLIST_HEAD(&p->preempt_notifiers);

3282

++#endif

3283

++

3284

++#ifdef CONFIG_COMPACTION

3285

++	p->capture_control = NULL;

3286

++#endif

3287

++#ifdef CONFIG_SMP

3288

++	p->wake_entry.u_flags = CSD_TYPE_TTWU;

3289

++#endif

3290

++}

3291

++

3292

++/*

3293

++ * fork()/clone()-time setup:

3294

++ */

3295

++int sched_fork(unsigned long clone_flags, struct task_struct *p)

3296

++{

3297

++	unsigned long flags;

3298

++	struct rq *rq;

3299

++

3300

++	__sched_fork(clone_flags, p);

3301

++	/*

3302

++	 * We mark the process as NEW here. This guarantees that

3303

++	 * nobody will actually run it, and a signal or other external

3304

++	 * event cannot wake it up and insert it on the runqueue either.

3305

++	 */

3306

++	p->__state = TASK_NEW;

3307

++

3308

++	/*

3309

++	 * Make sure we do not leak PI boosting priority to the child.

3310

++	 */

3311

++	p->prio = current->normal_prio;

3312

++

3313

++	/*

3314

++	 * Revert to default priority/policy on fork if requested.

3315

++	 */

3316

++	if (unlikely(p->sched_reset_on_fork)) {

3317

++		if (task_has_rt_policy(p)) {

3318

++			p->policy = SCHED_NORMAL;

3319

++			p->static_prio = NICE_TO_PRIO(0);

3320

++			p->rt_priority = 0;

3321

++		} else if (PRIO_TO_NICE(p->static_prio) < 0)

3322

++			p->static_prio = NICE_TO_PRIO(0);

3323

++

3324

++		p->prio = p->normal_prio = p->static_prio;

3325

++

3326

++		/*

3327

++		 * We don't need the reset flag anymore after the fork. It has

3328

++		 * fulfilled its duty:

3329

++		 */

3330

++		p->sched_reset_on_fork = 0;

3331

++	}

3332

++

3333

++	/*

3334

++	 * The child is not yet in the pid-hash so no cgroup attach races,

3335

++	 * and the cgroup is pinned to this child due to cgroup_fork()

3336

++	 * is ran before sched_fork().

3337

++	 *

3338

++	 * Silence PROVE_RCU.

3339

++	 */

3340

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

3341

++	/*

3342

++	 * Share the timeslice between parent and child, thus the

3343

++	 * total amount of pending timeslices in the system doesn't change,

3344

++	 * resulting in more scheduling fairness.

3345

++	 */

3346

++	rq = this_rq();

3347

++	raw_spin_lock(&rq->lock);

3348

++

3349

++	rq->curr->time_slice /= 2;

3350

++	p->time_slice = rq->curr->time_slice;

3351

++#ifdef CONFIG_SCHED_HRTICK

3352

++	hrtick_start(rq, rq->curr->time_slice);

3353

++#endif

3354

++

3355

++	if (p->time_slice < RESCHED_NS) {

3356

++		p->time_slice = sched_timeslice_ns;

3357

++		resched_curr(rq);

3358

++	}

3359

++	sched_task_fork(p, rq);

3360

++	raw_spin_unlock(&rq->lock);

3361

++

3362

++	rseq_migrate(p);

3363

++	/*

3364

++	 * We're setting the CPU for the first time, we don't migrate,

3365

++	 * so use __set_task_cpu().

3366

++	 */

3367

++	__set_task_cpu(p, cpu_of(rq));

3368

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

3369

++

3370

++#ifdef CONFIG_SCHED_INFO

3371

++	if (unlikely(sched_info_on()))

3372

++		memset(&p->sched_info, 0, sizeof(p->sched_info));

3373

++#endif

3374

++	init_task_preempt_count(p);

3375

++

3376

++	return 0;

3377

++}

3378

++

3379

++void sched_post_fork(struct task_struct *p) {}

3380

++

3381

++#ifdef CONFIG_SCHEDSTATS

3382

++

3383

++DEFINE_STATIC_KEY_FALSE(sched_schedstats);

3384

++

3385

++static void set_schedstats(bool enabled)

3386

++{

3387

++	if (enabled)

3388

++		static_branch_enable(&sched_schedstats);

3389

++	else

3390

++		static_branch_disable(&sched_schedstats);

3391

++}

3392

++

3393

++void force_schedstat_enabled(void)

3394

++{

3395

++	if (!schedstat_enabled()) {

3396

++		pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");

3397

++		static_branch_enable(&sched_schedstats);

3398

++	}

3399

++}

3400

++

3401

++static int __init setup_schedstats(char *str)

3402

++{

3403

++	int ret = 0;

3404

++	if (!str)

3405

++		goto out;

3406

++

3407

++	if (!strcmp(str, "enable")) {

3408

++		set_schedstats(true);

3409

++		ret = 1;

3410

++	} else if (!strcmp(str, "disable")) {

3411

++		set_schedstats(false);

3412

++		ret = 1;

3413

++	}

3414

++out:

3415

++	if (!ret)

3416

++		pr_warn("Unable to parse schedstats=\n");

3417

++

3418

++	return ret;

3419

++}

3420

++__setup("schedstats=", setup_schedstats);

3421

++

3422

++#ifdef CONFIG_PROC_SYSCTL

3423

++int sysctl_schedstats(struct ctl_table *table, int write,

3424

++			 void __user *buffer, size_t *lenp, loff_t *ppos)

3425

++{

3426

++	struct ctl_table t;

3427

++	int err;

3428

++	int state = static_branch_likely(&sched_schedstats);

3429

++

3430

++	if (write && !capable(CAP_SYS_ADMIN))

3431

++		return -EPERM;

3432

++

3433

++	t = *table;

3434

++	t.data = &state;

3435

++	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);

3436

++	if (err < 0)

3437

++		return err;

3438

++	if (write)

3439

++		set_schedstats(state);

3440

++	return err;

3441

++}

3442

++#endif /* CONFIG_PROC_SYSCTL */

3443

++#endif /* CONFIG_SCHEDSTATS */

3444

++

3445

++/*

3446

++ * wake_up_new_task - wake up a newly created task for the first time.

3447

++ *

3448

++ * This function will do some initial scheduler statistics housekeeping

3449

++ * that must be done for every newly created context, then puts the task

3450

++ * on the runqueue and wakes it.

3451

++ */

3452

++void wake_up_new_task(struct task_struct *p)

3453

++{

3454

++	unsigned long flags;

3455

++	struct rq *rq;

3456

++

3457

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

3458

++	WRITE_ONCE(p->__state, TASK_RUNNING);

3459

++	rq = cpu_rq(select_task_rq(p));

3460

++#ifdef CONFIG_SMP

3461

++	rseq_migrate(p);

3462

++	/*

3463

++	 * Fork balancing, do it here and not earlier because:

3464

++	 * - cpus_ptr can change in the fork path

3465

++	 * - any previously selected CPU might disappear through hotplug

3466

++	 *

3467

++	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,

3468

++	 * as we're not fully set-up yet.

3469

++	 */

3470

++	__set_task_cpu(p, cpu_of(rq));

3471

++#endif

3472

++

3473

++	raw_spin_lock(&rq->lock);

3474

++	update_rq_clock(rq);

3475

++

3476

++	activate_task(p, rq);

3477

++	trace_sched_wakeup_new(p);

3478

++	check_preempt_curr(rq);

3479

++

3480

++	raw_spin_unlock(&rq->lock);

3481

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

3482

++}

3483

++

3484

++#ifdef CONFIG_PREEMPT_NOTIFIERS

3485

++

3486

++static DEFINE_STATIC_KEY_FALSE(preempt_notifier_key);

3487

++

3488

++void preempt_notifier_inc(void)

3489

++{

3490

++	static_branch_inc(&preempt_notifier_key);

3491

++}

3492

++EXPORT_SYMBOL_GPL(preempt_notifier_inc);

3493

++

3494

++void preempt_notifier_dec(void)

3495

++{

3496

++	static_branch_dec(&preempt_notifier_key);

3497

++}

3498

++EXPORT_SYMBOL_GPL(preempt_notifier_dec);

3499

++

3500

++/**

3501

++ * preempt_notifier_register - tell me when current is being preempted & rescheduled

3502

++ * @notifier: notifier struct to register

3503

++ */

3504

++void preempt_notifier_register(struct preempt_notifier *notifier)

3505

++{

3506

++	if (!static_branch_unlikely(&preempt_notifier_key))

3507

++		WARN(1, "registering preempt_notifier while notifiers disabled\n");

3508

++

3509

++	hlist_add_head(&notifier->link, &current->preempt_notifiers);

3510

++}

3511

++EXPORT_SYMBOL_GPL(preempt_notifier_register);

3512

++

3513

++/**

3514

++ * preempt_notifier_unregister - no longer interested in preemption notifications

3515

++ * @notifier: notifier struct to unregister

3516

++ *

3517

++ * This is *not* safe to call from within a preemption notifier.

3518

++ */

3519

++void preempt_notifier_unregister(struct preempt_notifier *notifier)

3520

++{

3521

++	hlist_del(&notifier->link);

3522

++}

3523

++EXPORT_SYMBOL_GPL(preempt_notifier_unregister);

3524

++

3525

++static void __fire_sched_in_preempt_notifiers(struct task_struct *curr)

3526

++{

3527

++	struct preempt_notifier *notifier;

3528

++

3529

++	hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)

3530

++		notifier->ops->sched_in(notifier, raw_smp_processor_id());

3531

++}

3532

++

3533

++static __always_inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)

3534

++{

3535

++	if (static_branch_unlikely(&preempt_notifier_key))

3536

++		__fire_sched_in_preempt_notifiers(curr);

3537

++}

3538

++

3539

++static void

3540

++__fire_sched_out_preempt_notifiers(struct task_struct *curr,

3541

++				   struct task_struct *next)

3542

++{

3543

++	struct preempt_notifier *notifier;

3544

++

3545

++	hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)

3546

++		notifier->ops->sched_out(notifier, next);

3547

++}

3548

++

3549

++static __always_inline void

3550

++fire_sched_out_preempt_notifiers(struct task_struct *curr,

3551

++				 struct task_struct *next)

3552

++{

3553

++	if (static_branch_unlikely(&preempt_notifier_key))

3554

++		__fire_sched_out_preempt_notifiers(curr, next);

3555

++}

3556

++

3557

++#else /* !CONFIG_PREEMPT_NOTIFIERS */

3558

++

3559

++static inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)

3560

++{

3561

++}

3562

++

3563

++static inline void

3564

++fire_sched_out_preempt_notifiers(struct task_struct *curr,

3565

++				 struct task_struct *next)

3566

++{

3567

++}

3568

++

3569

++#endif /* CONFIG_PREEMPT_NOTIFIERS */

3570

++

3571

++static inline void prepare_task(struct task_struct *next)

3572

++{

3573

++	/*

3574

++	 * Claim the task as running, we do this before switching to it

3575

++	 * such that any running task will have this set.

3576

++	 *

3577

++	 * See the ttwu() WF_ON_CPU case and its ordering comment.

3578

++	 */

3579

++	WRITE_ONCE(next->on_cpu, 1);

3580

++}

3581

++

3582

++static inline void finish_task(struct task_struct *prev)

3583

++{

3584

++#ifdef CONFIG_SMP

3585

++	/*

3586

++	 * This must be the very last reference to @prev from this CPU. After

3587

++	 * p->on_cpu is cleared, the task can be moved to a different CPU. We

3588

++	 * must ensure this doesn't happen until the switch is completely

3589

++	 * finished.

3590

++	 *

3591

++	 * In particular, the load of prev->state in finish_task_switch() must

3592

++	 * happen before this.

3593

++	 *

3594

++	 * Pairs with the smp_cond_load_acquire() in try_to_wake_up().

3595

++	 */

3596

++	smp_store_release(&prev->on_cpu, 0);

3597

++#else

3598

++	prev->on_cpu = 0;

3599

++#endif

3600

++}

3601

++

3602

++#ifdef CONFIG_SMP

3603

++

3604

++static void do_balance_callbacks(struct rq *rq, struct callback_head *head)

3605

++{

3606

++	void (*func)(struct rq *rq);

3607

++	struct callback_head *next;

3608

++

3609

++	lockdep_assert_held(&rq->lock);

3610

++

3611

++	while (head) {

3612

++		func = (void (*)(struct rq *))head->func;

3613

++		next = head->next;

3614

++		head->next = NULL;

3615

++		head = next;

3616

++

3617

++		func(rq);

3618

++	}

3619

++}

3620

++

3621

++static void balance_push(struct rq *rq);

3622

++

3623

++struct callback_head balance_push_callback = {

3624

++	.next = NULL,

3625

++	.func = (void (*)(struct callback_head *))balance_push,

3626

++};

3627

++

3628

++static inline struct callback_head *splice_balance_callbacks(struct rq *rq)

3629

++{

3630

++	struct callback_head *head = rq->balance_callback;

3631

++

3632

++	if (head) {

3633

++		lockdep_assert_held(&rq->lock);

3634

++		rq->balance_callback = NULL;

3635

++	}

3636

++

3637

++	return head;

3638

++}

3639

++

3640

++static void __balance_callbacks(struct rq *rq)

3641

++{

3642

++	do_balance_callbacks(rq, splice_balance_callbacks(rq));

3643

++}

3644

++

3645

++static inline void balance_callbacks(struct rq *rq, struct callback_head *head)

3646

++{

3647

++	unsigned long flags;

3648

++

3649

++	if (unlikely(head)) {

3650

++		raw_spin_lock_irqsave(&rq->lock, flags);

3651

++		do_balance_callbacks(rq, head);

3652

++		raw_spin_unlock_irqrestore(&rq->lock, flags);

3653

++	}

3654

++}

3655

++

3656

++#else

3657

++

3658

++static inline void __balance_callbacks(struct rq *rq)

3659

++{

3660

++}

3661

++

3662

++static inline struct callback_head *splice_balance_callbacks(struct rq *rq)

3663

++{

3664

++	return NULL;

3665

++}

3666

++

3667

++static inline void balance_callbacks(struct rq *rq, struct callback_head *head)

3668

++{

3669

++}

3670

++

3671

++#endif

3672

++

3673

++static inline void

3674

++prepare_lock_switch(struct rq *rq, struct task_struct *next)

3675

++{

3676

++	/*

3677

++	 * Since the runqueue lock will be released by the next

3678

++	 * task (which is an invalid locking op but in the case

3679

++	 * of the scheduler it's an obvious special-case), so we

3680

++	 * do an early lockdep release here:

3681

++	 */

3682

++	spin_release(&rq->lock.dep_map, _THIS_IP_);

3683

++#ifdef CONFIG_DEBUG_SPINLOCK

3684

++	/* this is a valid case when another task releases the spinlock */

3685

++	rq->lock.owner = next;

3686

++#endif

3687

++}

3688

++

3689

++static inline void finish_lock_switch(struct rq *rq)

3690

++{

3691

++	/*

3692

++	 * If we are tracking spinlock dependencies then we have to

3693

++	 * fix up the runqueue lock - which gets 'carried over' from

3694

++	 * prev into current:

3695

++	 */

3696

++	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);

3697

++	__balance_callbacks(rq);

3698

++	raw_spin_unlock_irq(&rq->lock);

3699

++}

3700

++

3701

++/*

3702

++ * NOP if the arch has not defined these:

3703

++ */

3704

++

3705

++#ifndef prepare_arch_switch

3706

++# define prepare_arch_switch(next)	do { } while (0)

3707

++#endif

3708

++

3709

++#ifndef finish_arch_post_lock_switch

3710

++# define finish_arch_post_lock_switch()	do { } while (0)

3711

++#endif

3712

++

3713

++static inline void kmap_local_sched_out(void)

3714

++{

3715

++#ifdef CONFIG_KMAP_LOCAL

3716

++	if (unlikely(current->kmap_ctrl.idx))

3717

++		__kmap_local_sched_out();

3718

++#endif

3719

++}

3720

++

3721

++static inline void kmap_local_sched_in(void)

3722

++{

3723

++#ifdef CONFIG_KMAP_LOCAL

3724

++	if (unlikely(current->kmap_ctrl.idx))

3725

++		__kmap_local_sched_in();

3726

++#endif

3727

++}

3728

++

3729

++/**

3730

++ * prepare_task_switch - prepare to switch tasks

3731

++ * @rq: the runqueue preparing to switch

3732

++ * @next: the task we are going to switch to.

3733

++ *

3734

++ * This is called with the rq lock held and interrupts off. It must

3735

++ * be paired with a subsequent finish_task_switch after the context

3736

++ * switch.

3737

++ *

3738

++ * prepare_task_switch sets up locking and calls architecture specific

3739

++ * hooks.

3740

++ */

3741

++static inline void

3742

++prepare_task_switch(struct rq *rq, struct task_struct *prev,

3743

++		    struct task_struct *next)

3744

++{

3745

++	kcov_prepare_switch(prev);

3746

++	sched_info_switch(rq, prev, next);

3747

++	perf_event_task_sched_out(prev, next);

3748

++	rseq_preempt(prev);

3749

++	fire_sched_out_preempt_notifiers(prev, next);

3750

++	kmap_local_sched_out();

3751

++	prepare_task(next);

3752

++	prepare_arch_switch(next);

3753

++}

3754

++

3755

++/**

3756

++ * finish_task_switch - clean up after a task-switch

3757

++ * @rq: runqueue associated with task-switch

3758

++ * @prev: the thread we just switched away from.

3759

++ *

3760

++ * finish_task_switch must be called after the context switch, paired

3761

++ * with a prepare_task_switch call before the context switch.

3762

++ * finish_task_switch will reconcile locking set up by prepare_task_switch,

3763

++ * and do any other architecture-specific cleanup actions.

3764

++ *

3765

++ * Note that we may have delayed dropping an mm in context_switch(). If

3766

++ * so, we finish that here outside of the runqueue lock.  (Doing it

3767

++ * with the lock held can cause deadlocks; see schedule() for

3768

++ * details.)

3769

++ *

3770

++ * The context switch have flipped the stack from under us and restored the

3771

++ * local variables which were saved when this task called schedule() in the

3772

++ * past. prev == current is still correct but we need to recalculate this_rq

3773

++ * because prev may have moved to another CPU.

3774

++ */

3775

++static struct rq *finish_task_switch(struct task_struct *prev)

3776

++	__releases(rq->lock)

3777

++{

3778

++	struct rq *rq = this_rq();

3779

++	struct mm_struct *mm = rq->prev_mm;

3780

++	long prev_state;

3781

++

3782

++	/*

3783

++	 * The previous task will have left us with a preempt_count of 2

3784

++	 * because it left us after:

3785

++	 *

3786

++	 *	schedule()

3787

++	 *	  preempt_disable();			// 1

3788

++	 *	  __schedule()

3789

++	 *	    raw_spin_lock_irq(&rq->lock)	// 2

3790

++	 *

3791

++	 * Also, see FORK_PREEMPT_COUNT.

3792

++	 */

3793

++	if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,

3794

++		      "corrupted preempt_count: %s/%d/0x%x\n",

3795

++		      current->comm, current->pid, preempt_count()))

3796

++		preempt_count_set(FORK_PREEMPT_COUNT);

3797

++

3798

++	rq->prev_mm = NULL;

3799

++

3800

++	/*

3801

++	 * A task struct has one reference for the use as "current".

3802

++	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls

3803

++	 * schedule one last time. The schedule call will never return, and

3804

++	 * the scheduled task must drop that reference.

3805

++	 *

3806

++	 * We must observe prev->state before clearing prev->on_cpu (in

3807

++	 * finish_task), otherwise a concurrent wakeup can get prev

3808

++	 * running on another CPU and we could rave with its RUNNING -> DEAD

3809

++	 * transition, resulting in a double drop.

3810

++	 */

3811

++	prev_state = READ_ONCE(prev->__state);

3812

++	vtime_task_switch(prev);

3813

++	perf_event_task_sched_in(prev, current);

3814

++	finish_task(prev);

3815

++	tick_nohz_task_switch();

3816

++	finish_lock_switch(rq);

3817

++	finish_arch_post_lock_switch();

3818

++	kcov_finish_switch(current);

3819

++	/*

3820

++	 * kmap_local_sched_out() is invoked with rq::lock held and

3821

++	 * interrupts disabled. There is no requirement for that, but the

3822

++	 * sched out code does not have an interrupt enabled section.

3823

++	 * Restoring the maps on sched in does not require interrupts being

3824

++	 * disabled either.

3825

++	 */

3826

++	kmap_local_sched_in();

3827

++

3828

++	fire_sched_in_preempt_notifiers(current);

3829

++	/*

3830

++	 * When switching through a kernel thread, the loop in

3831

++	 * membarrier_{private,global}_expedited() may have observed that

3832

++	 * kernel thread and not issued an IPI. It is therefore possible to

3833

++	 * schedule between user->kernel->user threads without passing though

3834

++	 * switch_mm(). Membarrier requires a barrier after storing to

3835

++	 * rq->curr, before returning to userspace, so provide them here:

3836

++	 *

3837

++	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly

3838

++	 *   provided by mmdrop(),

3839

++	 * - a sync_core for SYNC_CORE.

3840

++	 */

3841

++	if (mm) {

3842

++		membarrier_mm_sync_core_before_usermode(mm);

3843

++		mmdrop(mm);

3844

++	}

3845

++	if (unlikely(prev_state == TASK_DEAD)) {

3846

++		/*

3847

++		 * Remove function-return probe instances associated with this

3848

++		 * task and put them back on the free list.

3849

++		 */

3850

++		kprobe_flush_task(prev);

3851

++

3852

++		/* Task is done with its stack. */

3853

++		put_task_stack(prev);

3854

++

3855

++		put_task_struct_rcu_user(prev);

3856

++	}

3857

++

3858

++	return rq;

3859

++}

3860

++

3861

++/**

3862

++ * schedule_tail - first thing a freshly forked thread must call.

3863

++ * @prev: the thread we just switched away from.

3864

++ */

3865

++asmlinkage __visible void schedule_tail(struct task_struct *prev)

3866

++	__releases(rq->lock)

3867

++{

3868

++	/*

3869

++	 * New tasks start with FORK_PREEMPT_COUNT, see there and

3870

++	 * finish_task_switch() for details.

3871

++	 *

3872

++	 * finish_task_switch() will drop rq->lock() and lower preempt_count

3873

++	 * and the preempt_enable() will end up enabling preemption (on

3874

++	 * PREEMPT_COUNT kernels).

3875

++	 */

3876

++

3877

++	finish_task_switch(prev);

3878

++	preempt_enable();

3879

++

3880

++	if (current->set_child_tid)

3881

++		put_user(task_pid_vnr(current), current->set_child_tid);

3882

++

3883

++	calculate_sigpending();

3884

++}

3885

++

3886

++/*

3887

++ * context_switch - switch to the new MM and the new thread's register state.

3888

++ */

3889

++static __always_inline struct rq *

3890

++context_switch(struct rq *rq, struct task_struct *prev,

3891

++	       struct task_struct *next)

3892

++{

3893

++	prepare_task_switch(rq, prev, next);

3894

++

3895

++	/*

3896

++	 * For paravirt, this is coupled with an exit in switch_to to

3897

++	 * combine the page table reload and the switch backend into

3898

++	 * one hypercall.

3899

++	 */

3900

++	arch_start_context_switch(prev);

3901

++

3902

++	/*

3903

++	 * kernel -> kernel   lazy + transfer active

3904

++	 *   user -> kernel   lazy + mmgrab() active

3905

++	 *

3906

++	 * kernel ->   user   switch + mmdrop() active

3907

++	 *   user ->   user   switch

3908

++	 */

3909

++	if (!next->mm) {                                // to kernel

3910

++		enter_lazy_tlb(prev->active_mm, next);

3911

++

3912

++		next->active_mm = prev->active_mm;

3913

++		if (prev->mm)                           // from user

3914

++			mmgrab(prev->active_mm);

3915

++		else

3916

++			prev->active_mm = NULL;

3917

++	} else {                                        // to user

3918

++		membarrier_switch_mm(rq, prev->active_mm, next->mm);

3919

++		/*

3920

++		 * sys_membarrier() requires an smp_mb() between setting

3921

++		 * rq->curr / membarrier_switch_mm() and returning to userspace.

3922

++		 *

3923

++		 * The below provides this either through switch_mm(), or in

3924

++		 * case 'prev->active_mm == next->mm' through

3925

++		 * finish_task_switch()'s mmdrop().

3926

++		 */

3927

++		switch_mm_irqs_off(prev->active_mm, next->mm, next);

3928

++

3929

++		if (!prev->mm) {                        // from kernel

3930

++			/* will mmdrop() in finish_task_switch(). */

3931

++			rq->prev_mm = prev->active_mm;

3932

++			prev->active_mm = NULL;

3933

++		}

3934

++	}

3935

++

3936

++	prepare_lock_switch(rq, next);

3937

++

3938

++	/* Here we just switch the register state and the stack. */

3939

++	switch_to(prev, next, prev);

3940

++	barrier();

3941

++

3942

++	return finish_task_switch(prev);

3943

++}

3944

++

3945

++/*

3946

++ * nr_running, nr_uninterruptible and nr_context_switches:

3947

++ *

3948

++ * externally visible scheduler statistics: current number of runnable

3949

++ * threads, total number of context switches performed since bootup.

3950

++ */

3951

++unsigned int nr_running(void)

3952

++{

3953

++	unsigned int i, sum = 0;

3954

++

3955

++	for_each_online_cpu(i)

3956

++		sum += cpu_rq(i)->nr_running;

3957

++

3958

++	return sum;

3959

++}

3960

++

3961

++/*

3962

++ * Check if only the current task is running on the CPU.

3963

++ *

3964

++ * Caution: this function does not check that the caller has disabled

3965

++ * preemption, thus the result might have a time-of-check-to-time-of-use

3966

++ * race.  The caller is responsible to use it correctly, for example:

3967

++ *

3968

++ * - from a non-preemptible section (of course)

3969

++ *

3970

++ * - from a thread that is bound to a single CPU

3971

++ *

3972

++ * - in a loop with very short iterations (e.g. a polling loop)

3973

++ */

3974

++bool single_task_running(void)

3975

++{

3976

++	return raw_rq()->nr_running == 1;

3977

++}

3978

++EXPORT_SYMBOL(single_task_running);

3979

++

3980

++unsigned long long nr_context_switches(void)

3981

++{

3982

++	int i;

3983

++	unsigned long long sum = 0;

3984

++

3985

++	for_each_possible_cpu(i)

3986

++		sum += cpu_rq(i)->nr_switches;

3987

++

3988

++	return sum;

3989

++}

3990

++

3991

++/*

3992

++ * Consumers of these two interfaces, like for example the cpuidle menu

3993

++ * governor, are using nonsensical data. Preferring shallow idle state selection

3994

++ * for a CPU that has IO-wait which might not even end up running the task when

3995

++ * it does become runnable.

3996

++ */

3997

++

3998

++unsigned int nr_iowait_cpu(int cpu)

3999

++{

4000

++	return atomic_read(&cpu_rq(cpu)->nr_iowait);

4001

++}

4002

++

4003

++/*

4004

++ * IO-wait accounting, and how it's mostly bollocks (on SMP).

4005

++ *

4006

++ * The idea behind IO-wait account is to account the idle time that we could

4007

++ * have spend running if it were not for IO. That is, if we were to improve the

4008

++ * storage performance, we'd have a proportional reduction in IO-wait time.

4009

++ *

4010

++ * This all works nicely on UP, where, when a task blocks on IO, we account

4011

++ * idle time as IO-wait, because if the storage were faster, it could've been

4012

++ * running and we'd not be idle.

4013

++ *

4014

++ * This has been extended to SMP, by doing the same for each CPU. This however

4015

++ * is broken.

4016

++ *

4017

++ * Imagine for instance the case where two tasks block on one CPU, only the one

4018

++ * CPU will have IO-wait accounted, while the other has regular idle. Even

4019

++ * though, if the storage were faster, both could've ran at the same time,

4020

++ * utilising both CPUs.

4021

++ *

4022

++ * This means, that when looking globally, the current IO-wait accounting on

4023

++ * SMP is a lower bound, by reason of under accounting.

4024

++ *

4025

++ * Worse, since the numbers are provided per CPU, they are sometimes

4026

++ * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly

4027

++ * associated with any one particular CPU, it can wake to another CPU than it

4028

++ * blocked on. This means the per CPU IO-wait number is meaningless.

4029

++ *

4030

++ * Task CPU affinities can make all that even more 'interesting'.

4031

++ */

4032

++

4033

++unsigned int nr_iowait(void)

4034

++{

4035

++	unsigned int i, sum = 0;

4036

++

4037

++	for_each_possible_cpu(i)

4038

++		sum += nr_iowait_cpu(i);

4039

++

4040

++	return sum;

4041

++}

4042

++

4043

++#ifdef CONFIG_SMP

4044

++

4045

++/*

4046

++ * sched_exec - execve() is a valuable balancing opportunity, because at

4047

++ * this point the task has the smallest effective memory and cache

4048

++ * footprint.

4049

++ */

4050

++void sched_exec(void)

4051

++{

4052

++	struct task_struct *p = current;

4053

++	unsigned long flags;

4054

++	int dest_cpu;

4055

++

4056

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

4057

++	dest_cpu = cpumask_any(p->cpus_ptr);

4058

++	if (dest_cpu == smp_processor_id())

4059

++		goto unlock;

4060

++

4061

++	if (likely(cpu_active(dest_cpu))) {

4062

++		struct migration_arg arg = { p, dest_cpu };

4063

++

4064

++		raw_spin_unlock_irqrestore(&p->pi_lock, flags);

4065

++		stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);

4066

++		return;

4067

++	}

4068

++unlock:

4069

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

4070

++}

4071

++

4072

++#endif

4073

++

4074

++DEFINE_PER_CPU(struct kernel_stat, kstat);

4075

++DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);

4076

++

4077

++EXPORT_PER_CPU_SYMBOL(kstat);

4078

++EXPORT_PER_CPU_SYMBOL(kernel_cpustat);

4079

++

4080

++static inline void update_curr(struct rq *rq, struct task_struct *p)

4081

++{

4082

++	s64 ns = rq->clock_task - p->last_ran;

4083

++

4084

++	p->sched_time += ns;

4085

++	account_group_exec_runtime(p, ns);

4086

++

4087

++	p->time_slice -= ns;

4088

++	p->last_ran = rq->clock_task;

4089

++}

4090

++

4091

++/*

4092

++ * Return accounted runtime for the task.

4093

++ * Return separately the current's pending runtime that have not been

4094

++ * accounted yet.

4095

++ */

4096

++unsigned long long task_sched_runtime(struct task_struct *p)

4097

++{

4098

++	unsigned long flags;

4099

++	struct rq *rq;

4100

++	raw_spinlock_t *lock;

4101

++	u64 ns;

4102

++

4103

++#if defined(CONFIG_64BIT) && defined(CONFIG_SMP)

4104

++	/*

4105

++	 * 64-bit doesn't need locks to atomically read a 64-bit value.

4106

++	 * So we have a optimization chance when the task's delta_exec is 0.

4107

++	 * Reading ->on_cpu is racy, but this is ok.

4108

++	 *

4109

++	 * If we race with it leaving CPU, we'll take a lock. So we're correct.

4110

++	 * If we race with it entering CPU, unaccounted time is 0. This is

4111

++	 * indistinguishable from the read occurring a few cycles earlier.

4112

++	 * If we see ->on_cpu without ->on_rq, the task is leaving, and has

4113

++	 * been accounted, so we're correct here as well.

4114

++	 */

4115

++	if (!p->on_cpu || !task_on_rq_queued(p))

4116

++		return tsk_seruntime(p);

4117

++#endif

4118

++

4119

++	rq = task_access_lock_irqsave(p, &lock, &flags);

4120

++	/*

4121

++	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would

4122

++	 * project cycles that may never be accounted to this

4123

++	 * thread, breaking clock_gettime().

4124

++	 */

4125

++	if (p == rq->curr && task_on_rq_queued(p)) {

4126

++		update_rq_clock(rq);

4127

++		update_curr(rq, p);

4128

++	}

4129

++	ns = tsk_seruntime(p);

4130

++	task_access_unlock_irqrestore(p, lock, &flags);

4131

++

4132

++	return ns;

4133

++}

4134

++

4135

++/* This manages tasks that have run out of timeslice during a scheduler_tick */

4136

++static inline void scheduler_task_tick(struct rq *rq)

4137

++{

4138

++	struct task_struct *p = rq->curr;

4139

++

4140

++	if (is_idle_task(p))

4141

++		return;

4142

++

4143

++	update_curr(rq, p);

4144

++	cpufreq_update_util(rq, 0);

4145

++

4146

++	/*

4147

++	 * Tasks have less than RESCHED_NS of time slice left they will be

4148

++	 * rescheduled.

4149

++	 */

4150

++	if (p->time_slice >= RESCHED_NS)

4151

++		return;

4152

++	set_tsk_need_resched(p);

4153

++	set_preempt_need_resched();

4154

++}

4155

++

4156

++#ifdef CONFIG_SCHED_DEBUG

4157

++static u64 cpu_resched_latency(struct rq *rq)

4158

++{

4159

++	int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms);

4160

++	u64 resched_latency, now = rq_clock(rq);

4161

++	static bool warned_once;

4162

++

4163

++	if (sysctl_resched_latency_warn_once && warned_once)

4164

++		return 0;

4165

++

4166

++	if (!need_resched() || !latency_warn_ms)

4167

++		return 0;

4168

++

4169

++	if (system_state == SYSTEM_BOOTING)

4170

++		return 0;

4171

++

4172

++	if (!rq->last_seen_need_resched_ns) {

4173

++		rq->last_seen_need_resched_ns = now;

4174

++		rq->ticks_without_resched = 0;

4175

++		return 0;

4176

++	}

4177

++

4178

++	rq->ticks_without_resched++;

4179

++	resched_latency = now - rq->last_seen_need_resched_ns;

4180

++	if (resched_latency <= latency_warn_ms * NSEC_PER_MSEC)

4181

++		return 0;

4182

++

4183

++	warned_once = true;

4184

++

4185

++	return resched_latency;

4186

++}

4187

++

4188

++static int __init setup_resched_latency_warn_ms(char *str)

4189

++{

4190

++	long val;

4191

++

4192

++	if ((kstrtol(str, 0, &val))) {

4193

++		pr_warn("Unable to set resched_latency_warn_ms\n");

4194

++		return 1;

4195

++	}

4196

++

4197

++	sysctl_resched_latency_warn_ms = val;

4198

++	return 1;

4199

++}

4200

++__setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);

4201

++#else

4202

++static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }

4203

++#endif /* CONFIG_SCHED_DEBUG */

4204

++

4205

++/*

4206

++ * This function gets called by the timer code, with HZ frequency.

4207

++ * We call it with interrupts disabled.

4208

++ */

4209

++void scheduler_tick(void)

4210

++{

4211

++	int cpu __maybe_unused = smp_processor_id();

4212

++	struct rq *rq = cpu_rq(cpu);

4213

++	u64 resched_latency;

4214

++

4215

++	arch_scale_freq_tick();

4216

++	sched_clock_tick();

4217

++

4218

++	raw_spin_lock(&rq->lock);

4219

++	update_rq_clock(rq);

4220

++

4221

++	scheduler_task_tick(rq);

4222

++	if (sched_feat(LATENCY_WARN))

4223

++		resched_latency = cpu_resched_latency(rq);

4224

++	calc_global_load_tick(rq);

4225

++

4226

++	rq->last_tick = rq->clock;

4227

++	raw_spin_unlock(&rq->lock);

4228

++

4229

++	if (sched_feat(LATENCY_WARN) && resched_latency)

4230

++		resched_latency_warn(cpu, resched_latency);

4231

++

4232

++	perf_event_task_tick();

4233

++}

4234

++

4235

++#ifdef CONFIG_SCHED_SMT

4236

++static inline int active_load_balance_cpu_stop(void *data)

4237

++{

4238

++	struct rq *rq = this_rq();

4239

++	struct task_struct *p = data;

4240

++	cpumask_t tmp;

4241

++	unsigned long flags;

4242

++

4243

++	local_irq_save(flags);

4244

++

4245

++	raw_spin_lock(&p->pi_lock);

4246

++	raw_spin_lock(&rq->lock);

4247

++

4248

++	rq->active_balance = 0;

4249

++	/* _something_ may have changed the task, double check again */

4250

++	if (task_on_rq_queued(p) && task_rq(p) == rq &&

4251

++	    cpumask_and(&tmp, p->cpus_ptr, &sched_sg_idle_mask) &&

4252

++	    !is_migration_disabled(p)) {

4253

++		int cpu = cpu_of(rq);

4254

++		int dcpu = __best_mask_cpu(&tmp, per_cpu(sched_cpu_llc_mask, cpu));

4255

++		rq = move_queued_task(rq, p, dcpu);

4256

++	}

4257

++

4258

++	raw_spin_unlock(&rq->lock);

4259

++	raw_spin_unlock(&p->pi_lock);

4260

++

4261

++	local_irq_restore(flags);

4262

++

4263

++	return 0;

4264

++}

4265

++

4266

++/* sg_balance_trigger - trigger slibing group balance for @cpu */

4267

++static inline int sg_balance_trigger(const int cpu)

4268

++{

4269

++	struct rq *rq= cpu_rq(cpu);

4270

++	unsigned long flags;

4271

++	struct task_struct *curr;

4272

++	int res;

4273

++

4274

++	if (!raw_spin_trylock_irqsave(&rq->lock, flags))

4275

++		return 0;

4276

++	curr = rq->curr;

4277

++	res = (!is_idle_task(curr)) && (1 == rq->nr_running) &&\

4278

++	      cpumask_intersects(curr->cpus_ptr, &sched_sg_idle_mask) &&\

4279

++	      !is_migration_disabled(curr) && (!rq->active_balance);

4280

++

4281

++	if (res)

4282

++		rq->active_balance = 1;

4283

++

4284

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

4285

++

4286

++	if (res)

4287

++		stop_one_cpu_nowait(cpu, active_load_balance_cpu_stop,

4288

++				    curr, &rq->active_balance_work);

4289

++	return res;

4290

++}

4291

++

4292

++/*

4293

++ * sg_balance_check - slibing group balance check for run queue @rq

4294

++ */

4295

++static inline void sg_balance_check(struct rq *rq)

4296

++{

4297

++	cpumask_t chk;

4298

++	int cpu = cpu_of(rq);

4299

++

4300

++	/* exit when cpu is offline */

4301

++	if (unlikely(!rq->online))

4302

++		return;

4303

++

4304

++	/*

4305

++	 * Only cpu in slibing idle group will do the checking and then

4306

++	 * find potential cpus which can migrate the current running task

4307

++	 */

4308

++	if (cpumask_test_cpu(cpu, &sched_sg_idle_mask) &&

4309

++	    cpumask_andnot(&chk, cpu_online_mask, sched_rq_watermark) &&

4310

++	    cpumask_andnot(&chk, &chk, &sched_rq_pending_mask)) {

4311

++		int i;

4312

++

4313

++		for_each_cpu_wrap(i, &chk, cpu) {

4314

++			if (cpumask_subset(cpu_smt_mask(i), &chk) &&

4315

++			    sg_balance_trigger(i))

4316

++				return;

4317

++		}

4318

++	}

4319

++}

4320

++#endif /* CONFIG_SCHED_SMT */

4321

++

4322

++#ifdef CONFIG_NO_HZ_FULL

4323

++

4324

++struct tick_work {

4325

++	int			cpu;

4326

++	atomic_t		state;

4327

++	struct delayed_work	work;

4328

++};

4329

++/* Values for ->state, see diagram below. */

4330

++#define TICK_SCHED_REMOTE_OFFLINE	0

4331

++#define TICK_SCHED_REMOTE_OFFLINING	1

4332

++#define TICK_SCHED_REMOTE_RUNNING	2

4333

++

4334

++/*

4335

++ * State diagram for ->state:

4336

++ *

4337

++ *

4338

++ *          TICK_SCHED_REMOTE_OFFLINE

4339

++ *                    |   ^

4340

++ *                    |   |

4341

++ *                    |   | sched_tick_remote()

4342

++ *                    |   |

4343

++ *                    |   |

4344

++ *                    +--TICK_SCHED_REMOTE_OFFLINING

4345

++ *                    |   ^

4346

++ *                    |   |

4347

++ * sched_tick_start() |   | sched_tick_stop()

4348

++ *                    |   |

4349

++ *                    V   |

4350

++ *          TICK_SCHED_REMOTE_RUNNING

4351

++ *

4352

++ *

4353

++ * Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()

4354

++ * and sched_tick_start() are happy to leave the state in RUNNING.

4355

++ */

4356

++

4357

++static struct tick_work __percpu *tick_work_cpu;

4358

++

4359

++static void sched_tick_remote(struct work_struct *work)

4360

++{

4361

++	struct delayed_work *dwork = to_delayed_work(work);

4362

++	struct tick_work *twork = container_of(dwork, struct tick_work, work);

4363

++	int cpu = twork->cpu;

4364

++	struct rq *rq = cpu_rq(cpu);

4365

++	struct task_struct *curr;

4366

++	unsigned long flags;

4367

++	u64 delta;

4368

++	int os;

4369

++

4370

++	/*

4371

++	 * Handle the tick only if it appears the remote CPU is running in full

4372

++	 * dynticks mode. The check is racy by nature, but missing a tick or

4373

++	 * having one too much is no big deal because the scheduler tick updates

4374

++	 * statistics and checks timeslices in a time-independent way, regardless

4375

++	 * of when exactly it is running.

4376

++	 */

4377

++	if (!tick_nohz_tick_stopped_cpu(cpu))

4378

++		goto out_requeue;

4379

++

4380

++	raw_spin_lock_irqsave(&rq->lock, flags);

4381

++	curr = rq->curr;

4382

++	if (cpu_is_offline(cpu))

4383

++		goto out_unlock;

4384

++

4385

++	update_rq_clock(rq);

4386

++	if (!is_idle_task(curr)) {

4387

++		/*

4388

++		 * Make sure the next tick runs within a reasonable

4389

++		 * amount of time.

4390

++		 */

4391

++		delta = rq_clock_task(rq) - curr->last_ran;

4392

++		WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);

4393

++	}

4394

++	scheduler_task_tick(rq);

4395

++

4396

++	calc_load_nohz_remote(rq);

4397

++out_unlock:

4398

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

4399

++

4400

++out_requeue:

4401

++	/*

4402

++	 * Run the remote tick once per second (1Hz). This arbitrary

4403

++	 * frequency is large enough to avoid overload but short enough

4404

++	 * to keep scheduler internal stats reasonably up to date.  But

4405

++	 * first update state to reflect hotplug activity if required.

4406

++	 */

4407

++	os = atomic_fetch_add_unless(&twork->state, -1, TICK_SCHED_REMOTE_RUNNING);

4408

++	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_OFFLINE);

4409

++	if (os == TICK_SCHED_REMOTE_RUNNING)

4410

++		queue_delayed_work(system_unbound_wq, dwork, HZ);

4411

++}

4412

++

4413

++static void sched_tick_start(int cpu)

4414

++{

4415

++	int os;

4416

++	struct tick_work *twork;

4417

++

4418

++	if (housekeeping_cpu(cpu, HK_FLAG_TICK))

4419

++		return;

4420

++

4421

++	WARN_ON_ONCE(!tick_work_cpu);

4422

++

4423

++	twork = per_cpu_ptr(tick_work_cpu, cpu);

4424

++	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);

4425

++	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);

4426

++	if (os == TICK_SCHED_REMOTE_OFFLINE) {

4427

++		twork->cpu = cpu;

4428

++		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);

4429

++		queue_delayed_work(system_unbound_wq, &twork->work, HZ);

4430

++	}

4431

++}

4432

++

4433

++#ifdef CONFIG_HOTPLUG_CPU

4434

++static void sched_tick_stop(int cpu)

4435

++{

4436

++	struct tick_work *twork;

4437

++

4438

++	if (housekeeping_cpu(cpu, HK_FLAG_TICK))

4439

++		return;

4440

++

4441

++	WARN_ON_ONCE(!tick_work_cpu);

4442

++

4443

++	twork = per_cpu_ptr(tick_work_cpu, cpu);

4444

++	cancel_delayed_work_sync(&twork->work);

4445

++}

4446

++#endif /* CONFIG_HOTPLUG_CPU */

4447

++

4448

++int __init sched_tick_offload_init(void)

4449

++{

4450

++	tick_work_cpu = alloc_percpu(struct tick_work);

4451

++	BUG_ON(!tick_work_cpu);

4452

++	return 0;

4453

++}

4454

++

4455

++#else /* !CONFIG_NO_HZ_FULL */

4456

++static inline void sched_tick_start(int cpu) { }

4457

++static inline void sched_tick_stop(int cpu) { }

4458

++#endif

4459

++

4460

++#if defined(CONFIG_PREEMPTION) && (defined(CONFIG_DEBUG_PREEMPT) || \

4461

++				defined(CONFIG_PREEMPT_TRACER))

4462

++/*

4463

++ * If the value passed in is equal to the current preempt count

4464

++ * then we just disabled preemption. Start timing the latency.

4465

++ */

4466

++static inline void preempt_latency_start(int val)

4467

++{

4468

++	if (preempt_count() == val) {

4469

++		unsigned long ip = get_lock_parent_ip();

4470

++#ifdef CONFIG_DEBUG_PREEMPT

4471

++		current->preempt_disable_ip = ip;

4472

++#endif

4473

++		trace_preempt_off(CALLER_ADDR0, ip);

4474

++	}

4475

++}

4476

++

4477

++void preempt_count_add(int val)

4478

++{

4479

++#ifdef CONFIG_DEBUG_PREEMPT

4480

++	/*

4481

++	 * Underflow?

4482

++	 */

4483

++	if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))

4484

++		return;

4485

++#endif

4486

++	__preempt_count_add(val);

4487

++#ifdef CONFIG_DEBUG_PREEMPT

4488

++	/*

4489

++	 * Spinlock count overflowing soon?

4490

++	 */

4491

++	DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=

4492

++				PREEMPT_MASK - 10);

4493

++#endif

4494

++	preempt_latency_start(val);

4495

++}

4496

++EXPORT_SYMBOL(preempt_count_add);

4497

++NOKPROBE_SYMBOL(preempt_count_add);

4498

++

4499

++/*

4500

++ * If the value passed in equals to the current preempt count

4501

++ * then we just enabled preemption. Stop timing the latency.

4502

++ */

4503

++static inline void preempt_latency_stop(int val)

4504

++{

4505

++	if (preempt_count() == val)

4506

++		trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());

4507

++}

4508

++

4509

++void preempt_count_sub(int val)

4510

++{

4511

++#ifdef CONFIG_DEBUG_PREEMPT

4512

++	/*

4513

++	 * Underflow?

4514

++	 */

4515

++	if (DEBUG_LOCKS_WARN_ON(val > preempt_count()))

4516

++		return;

4517

++	/*

4518

++	 * Is the spinlock portion underflowing?

4519

++	 */

4520

++	if (DEBUG_LOCKS_WARN_ON((val < PREEMPT_MASK) &&

4521

++			!(preempt_count() & PREEMPT_MASK)))

4522

++		return;

4523

++#endif

4524

++

4525

++	preempt_latency_stop(val);

4526

++	__preempt_count_sub(val);

4527

++}

4528

++EXPORT_SYMBOL(preempt_count_sub);

4529

++NOKPROBE_SYMBOL(preempt_count_sub);

4530

++

4531

++#else

4532

++static inline void preempt_latency_start(int val) { }

4533

++static inline void preempt_latency_stop(int val) { }

4534

++#endif

4535

++

4536

++static inline unsigned long get_preempt_disable_ip(struct task_struct *p)

4537

++{

4538

++#ifdef CONFIG_DEBUG_PREEMPT

4539

++	return p->preempt_disable_ip;

4540

++#else

4541

++	return 0;

4542

++#endif

4543

++}

4544

++

4545

++/*

4546

++ * Print scheduling while atomic bug:

4547

++ */

4548

++static noinline void __schedule_bug(struct task_struct *prev)

4549

++{

4550

++	/* Save this before calling printk(), since that will clobber it */

4551

++	unsigned long preempt_disable_ip = get_preempt_disable_ip(current);

4552

++

4553

++	if (oops_in_progress)

4554

++		return;

4555

++

4556

++	printk(KERN_ERR "BUG: scheduling while atomic: %s/%d/0x%08x\n",

4557

++		prev->comm, prev->pid, preempt_count());

4558

++

4559

++	debug_show_held_locks(prev);

4560

++	print_modules();

4561

++	if (irqs_disabled())

4562

++		print_irqtrace_events(prev);

4563

++	if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)

4564

++	    && in_atomic_preempt_off()) {

4565

++		pr_err("Preemption disabled at:");

4566

++		print_ip_sym(KERN_ERR, preempt_disable_ip);

4567

++	}

4568

++	if (panic_on_warn)

4569

++		panic("scheduling while atomic\n");

4570

++

4571

++	dump_stack();

4572

++	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

4573

++}

4574

++

4575

++/*

4576

++ * Various schedule()-time debugging checks and statistics:

4577

++ */

4578

++static inline void schedule_debug(struct task_struct *prev, bool preempt)

4579

++{

4580

++#ifdef CONFIG_SCHED_STACK_END_CHECK

4581

++	if (task_stack_end_corrupted(prev))

4582

++		panic("corrupted stack end detected inside scheduler\n");

4583

++

4584

++	if (task_scs_end_corrupted(prev))

4585

++		panic("corrupted shadow stack detected inside scheduler\n");

4586

++#endif

4587

++

4588

++#ifdef CONFIG_DEBUG_ATOMIC_SLEEP

4589

++	if (!preempt && READ_ONCE(prev->__state) && prev->non_block_count) {

4590

++		printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",

4591

++			prev->comm, prev->pid, prev->non_block_count);

4592

++		dump_stack();

4593

++		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

4594

++	}

4595

++#endif

4596

++

4597

++	if (unlikely(in_atomic_preempt_off())) {

4598

++		__schedule_bug(prev);

4599

++		preempt_count_set(PREEMPT_DISABLED);

4600

++	}

4601

++	rcu_sleep_check();

4602

++	SCHED_WARN_ON(ct_state() == CONTEXT_USER);

4603

++

4604

++	profile_hit(SCHED_PROFILING, __builtin_return_address(0));

4605

++

4606

++	schedstat_inc(this_rq()->sched_count);

4607

++}

4608

++

4609

++/*

4610

++ * Compile time debug macro

4611

++ * #define ALT_SCHED_DEBUG

4612

++ */

4613

++

4614

++#ifdef ALT_SCHED_DEBUG

4615

++void alt_sched_debug(void)

4616

++{

4617

++	printk(KERN_INFO "sched: pending: 0x%04lx, idle: 0x%04lx, sg_idle: 0x%04lx\n",

4618

++	       sched_rq_pending_mask.bits[0],

4619

++	       sched_rq_watermark[0].bits[0],

4620

++	       sched_sg_idle_mask.bits[0]);

4621

++}

4622

++#else

4623

++inline void alt_sched_debug(void) {}

4624

++#endif

4625

++

4626

++#ifdef	CONFIG_SMP

4627

++

4628

++#define SCHED_RQ_NR_MIGRATION (32U)

4629

++/*

4630

++ * Migrate pending tasks in @rq to @dest_cpu

4631

++ * Will try to migrate mininal of half of @rq nr_running tasks and

4632

++ * SCHED_RQ_NR_MIGRATION to @dest_cpu

4633

++ */

4634

++static inline int

4635

++migrate_pending_tasks(struct rq *rq, struct rq *dest_rq, const int dest_cpu)

4636

++{

4637

++	struct task_struct *p, *skip = rq->curr;

4638

++	int nr_migrated = 0;

4639

++	int nr_tries = min(rq->nr_running / 2, SCHED_RQ_NR_MIGRATION);

4640

++

4641

++	while (skip != rq->idle && nr_tries &&

4642

++	       (p = sched_rq_next_task(skip, rq)) != rq->idle) {

4643

++		skip = sched_rq_next_task(p, rq);

4644

++		if (cpumask_test_cpu(dest_cpu, p->cpus_ptr)) {

4645

++			__SCHED_DEQUEUE_TASK(p, rq, 0, );

4646

++			set_task_cpu(p, dest_cpu);

4647

++			__SCHED_ENQUEUE_TASK(p, dest_rq, 0);

4648

++			nr_migrated++;

4649

++		}

4650

++		nr_tries--;

4651

++	}

4652

++

4653

++	return nr_migrated;

4654

++}

4655

++

4656

++static inline int take_other_rq_tasks(struct rq *rq, int cpu)

4657

++{

4658

++	struct cpumask *topo_mask, *end_mask;

4659

++

4660

++	if (unlikely(!rq->online))

4661

++		return 0;

4662

++

4663

++	if (cpumask_empty(&sched_rq_pending_mask))

4664

++		return 0;

4665

++

4666

++	topo_mask = per_cpu(sched_cpu_topo_masks, cpu) + 1;

4667

++	end_mask = per_cpu(sched_cpu_topo_end_mask, cpu);

4668

++	do {

4669

++		int i;

4670

++		for_each_cpu_and(i, &sched_rq_pending_mask, topo_mask) {

4671

++			int nr_migrated;

4672

++			struct rq *src_rq;

4673

++

4674

++			src_rq = cpu_rq(i);

4675

++			if (!do_raw_spin_trylock(&src_rq->lock))

4676

++				continue;

4677

++			spin_acquire(&src_rq->lock.dep_map,

4678

++				     SINGLE_DEPTH_NESTING, 1, _RET_IP_);

4679

++

4680

++			if ((nr_migrated = migrate_pending_tasks(src_rq, rq, cpu))) {

4681

++				src_rq->nr_running -= nr_migrated;

4682

++				if (src_rq->nr_running < 2)

4683

++					cpumask_clear_cpu(i, &sched_rq_pending_mask);

4684

++

4685

++				rq->nr_running += nr_migrated;

4686

++				if (rq->nr_running > 1)

4687

++					cpumask_set_cpu(cpu, &sched_rq_pending_mask);

4688

++

4689

++				update_sched_rq_watermark(rq);

4690

++				cpufreq_update_util(rq, 0);

4691

++

4692

++				spin_release(&src_rq->lock.dep_map, _RET_IP_);

4693

++				do_raw_spin_unlock(&src_rq->lock);

4694

++

4695

++				return 1;

4696

++			}

4697

++

4698

++			spin_release(&src_rq->lock.dep_map, _RET_IP_);

4699

++			do_raw_spin_unlock(&src_rq->lock);

4700

++		}

4701

++	} while (++topo_mask < end_mask);

4702

++

4703

++	return 0;

4704

++}

4705

++#endif

4706

++

4707

++/*

4708

++ * Timeslices below RESCHED_NS are considered as good as expired as there's no

4709

++ * point rescheduling when there's so little time left.

4710

++ */

4711

++static inline void check_curr(struct task_struct *p, struct rq *rq)

4712

++{

4713

++	if (unlikely(rq->idle == p))

4714

++		return;

4715

++

4716

++	update_curr(rq, p);

4717

++

4718

++	if (p->time_slice < RESCHED_NS)

4719

++		time_slice_expired(p, rq);

4720

++}

4721

++

4722

++static inline struct task_struct *

4723

++choose_next_task(struct rq *rq, int cpu, struct task_struct *prev)

4724

++{

4725

++	struct task_struct *next;

4726

++

4727

++	if (unlikely(rq->skip)) {

4728

++		next = rq_runnable_task(rq);

4729

++		if (next == rq->idle) {

4730

++#ifdef	CONFIG_SMP

4731

++			if (!take_other_rq_tasks(rq, cpu)) {

4732

++#endif

4733

++				rq->skip = NULL;

4734

++				schedstat_inc(rq->sched_goidle);

4735

++				return next;

4736

++#ifdef	CONFIG_SMP

4737

++			}

4738

++			next = rq_runnable_task(rq);

4739

++#endif

4740

++		}

4741

++		rq->skip = NULL;

4742

++#ifdef CONFIG_HIGH_RES_TIMERS

4743

++		hrtick_start(rq, next->time_slice);

4744

++#endif

4745

++		return next;

4746

++	}

4747

++

4748

++	next = sched_rq_first_task(rq);

4749

++	if (next == rq->idle) {

4750

++#ifdef	CONFIG_SMP

4751

++		if (!take_other_rq_tasks(rq, cpu)) {

4752

++#endif

4753

++			schedstat_inc(rq->sched_goidle);

4754

++			/*printk(KERN_INFO "sched: choose_next_task(%d) idle %px\n", cpu, next);*/

4755

++			return next;

4756

++#ifdef	CONFIG_SMP

4757

++		}

4758

++		next = sched_rq_first_task(rq);

4759

++#endif

4760

++	}

4761

++#ifdef CONFIG_HIGH_RES_TIMERS

4762

++	hrtick_start(rq, next->time_slice);

4763

++#endif

4764

++	/*printk(KERN_INFO "sched: choose_next_task(%d) next %px\n", cpu,

4765

++	 * next);*/

4766

++	return next;

4767

++}

4768

++

4769

++/*

4770

++ * schedule() is the main scheduler function.

4771

++ *

4772

++ * The main means of driving the scheduler and thus entering this function are:

4773

++ *

4774

++ *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.

4775

++ *

4776

++ *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return

4777

++ *      paths. For example, see arch/x86/entry_64.S.

4778

++ *

4779

++ *      To drive preemption between tasks, the scheduler sets the flag in timer

4780

++ *      interrupt handler scheduler_tick().

4781

++ *

4782

++ *   3. Wakeups don't really cause entry into schedule(). They add a

4783

++ *      task to the run-queue and that's it.

4784

++ *

4785

++ *      Now, if the new task added to the run-queue preempts the current

4786

++ *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets

4787

++ *      called on the nearest possible occasion:

4788

++ *

4789

++ *       - If the kernel is preemptible (CONFIG_PREEMPTION=y):

4790

++ *

4791

++ *         - in syscall or exception context, at the next outmost

4792

++ *           preempt_enable(). (this might be as soon as the wake_up()'s

4793

++ *           spin_unlock()!)

4794

++ *

4795

++ *         - in IRQ context, return from interrupt-handler to

4796

++ *           preemptible context

4797

++ *

4798

++ *       - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)

4799

++ *         then at the next:

4800

++ *

4801

++ *          - cond_resched() call

4802

++ *          - explicit schedule() call

4803

++ *          - return from syscall or exception to user-space

4804

++ *          - return from interrupt-handler to user-space

4805

++ *

4806

++ * WARNING: must be called with preemption disabled!

4807

++ */

4808

++static void __sched notrace __schedule(bool preempt)

4809

++{

4810

++	struct task_struct *prev, *next;

4811

++	unsigned long *switch_count;

4812

++	unsigned long prev_state;

4813

++	struct rq *rq;

4814

++	int cpu;

4815

++

4816

++	cpu = smp_processor_id();

4817

++	rq = cpu_rq(cpu);

4818

++	prev = rq->curr;

4819

++

4820

++	schedule_debug(prev, preempt);

4821

++

4822

++	/* by passing sched_feat(HRTICK) checking which Alt schedule FW doesn't support */

4823

++	hrtick_clear(rq);

4824

++

4825

++	local_irq_disable();

4826

++	rcu_note_context_switch(preempt);

4827

++

4828

++	/*

4829

++	 * Make sure that signal_pending_state()->signal_pending() below

4830

++	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)

4831

++	 * done by the caller to avoid the race with signal_wake_up():

4832

++	 *

4833

++	 * __set_current_state(@state)		signal_wake_up()

4834

++	 * schedule()				  set_tsk_thread_flag(p, TIF_SIGPENDING)

4835

++	 *					  wake_up_state(p, state)

4836

++	 *   LOCK rq->lock			    LOCK p->pi_state

4837

++	 *   smp_mb__after_spinlock()		    smp_mb__after_spinlock()

4838

++	 *     if (signal_pending_state())	    if (p->state & @state)

4839

++	 *

4840

++	 * Also, the membarrier system call requires a full memory barrier

4841

++	 * after coming from user-space, before storing to rq->curr.

4842

++	 */

4843

++	raw_spin_lock(&rq->lock);

4844

++	smp_mb__after_spinlock();

4845

++

4846

++	update_rq_clock(rq);

4847

++

4848

++	switch_count = &prev->nivcsw;

4849

++	/*

4850

++	 * We must load prev->state once (task_struct::state is volatile), such

4851

++	 * that:

4852

++	 *

4853

++	 *  - we form a control dependency vs deactivate_task() below.

4854

++	 *  - ptrace_{,un}freeze_traced() can change ->state underneath us.

4855

++	 */

4856

++	prev_state = READ_ONCE(prev->__state);

4857

++	if (!preempt && prev_state) {

4858

++		if (signal_pending_state(prev_state, prev)) {

4859

++			WRITE_ONCE(prev->__state, TASK_RUNNING);

4860

++		} else {

4861

++			prev->sched_contributes_to_load =

4862

++				(prev_state & TASK_UNINTERRUPTIBLE) &&

4863

++				!(prev_state & TASK_NOLOAD) &&

4864

++				!(prev->flags & PF_FROZEN);

4865

++

4866

++			if (prev->sched_contributes_to_load)

4867

++				rq->nr_uninterruptible++;

4868

++

4869

++			/*

4870

++			 * __schedule()			ttwu()

4871

++			 *   prev_state = prev->state;    if (p->on_rq && ...)

4872

++			 *   if (prev_state)		    goto out;

4873

++			 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();

4874

++			 *				  p->state = TASK_WAKING

4875

++			 *

4876

++			 * Where __schedule() and ttwu() have matching control dependencies.

4877

++			 *

4878

++			 * After this, schedule() must not care about p->state any more.

4879

++			 */

4880

++			sched_task_deactivate(prev, rq);

4881

++			deactivate_task(prev, rq);

4882

++

4883

++			if (prev->in_iowait) {

4884

++				atomic_inc(&rq->nr_iowait);

4885

++				delayacct_blkio_start();

4886

++			}

4887

++		}

4888

++		switch_count = &prev->nvcsw;

4889

++	}

4890

++

4891

++	check_curr(prev, rq);

4892

++

4893

++	next = choose_next_task(rq, cpu, prev);

4894

++	clear_tsk_need_resched(prev);

4895

++	clear_preempt_need_resched();

4896

++#ifdef CONFIG_SCHED_DEBUG

4897

++	rq->last_seen_need_resched_ns = 0;

4898

++#endif

4899

++

4900

++	if (likely(prev != next)) {

4901

++		next->last_ran = rq->clock_task;

4902

++		rq->last_ts_switch = rq->clock;

4903

++

4904

++		rq->nr_switches++;

4905

++		/*

4906

++		 * RCU users of rcu_dereference(rq->curr) may not see

4907

++		 * changes to task_struct made by pick_next_task().

4908

++		 */

4909

++		RCU_INIT_POINTER(rq->curr, next);

4910

++		/*

4911

++		 * The membarrier system call requires each architecture

4912

++		 * to have a full memory barrier after updating

4913

++		 * rq->curr, before returning to user-space.

4914

++		 *

4915

++		 * Here are the schemes providing that barrier on the

4916

++		 * various architectures:

4917

++		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.

4918

++		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.

4919

++		 * - finish_lock_switch() for weakly-ordered

4920

++		 *   architectures where spin_unlock is a full barrier,

4921

++		 * - switch_to() for arm64 (weakly-ordered, spin_unlock

4922

++		 *   is a RELEASE barrier),

4923

++		 */

4924

++		++*switch_count;

4925

++

4926

++		psi_sched_switch(prev, next, !task_on_rq_queued(prev));

4927

++

4928

++		trace_sched_switch(preempt, prev, next);

4929

++

4930

++		/* Also unlocks the rq: */

4931

++		rq = context_switch(rq, prev, next);

4932

++	} else {

4933

++		__balance_callbacks(rq);

4934

++		raw_spin_unlock_irq(&rq->lock);

4935

++	}

4936

++

4937

++#ifdef CONFIG_SCHED_SMT

4938

++	sg_balance_check(rq);

4939

++#endif

4940

++}

4941

++

4942

++void __noreturn do_task_dead(void)

4943

++{

4944

++	/* Causes final put_task_struct in finish_task_switch(): */

4945

++	set_special_state(TASK_DEAD);

4946

++

4947

++	/* Tell freezer to ignore us: */

4948

++	current->flags |= PF_NOFREEZE;

4949

++

4950

++	__schedule(false);

4951

++	BUG();

4952

++

4953

++	/* Avoid "noreturn function does return" - but don't continue if BUG() is a NOP: */

4954

++	for (;;)

4955

++		cpu_relax();

4956

++}

4957

++

4958

++static inline void sched_submit_work(struct task_struct *tsk)

4959

++{

4960

++	unsigned int task_flags;

4961

++

4962

++	if (task_is_running(tsk))

4963

++		return;

4964

++

4965

++	task_flags = tsk->flags;

4966

++	/*

4967

++	 * If a worker went to sleep, notify and ask workqueue whether

4968

++	 * it wants to wake up a task to maintain concurrency.

4969

++	 * As this function is called inside the schedule() context,

4970

++	 * we disable preemption to avoid it calling schedule() again

4971

++	 * in the possible wakeup of a kworker and because wq_worker_sleeping()

4972

++	 * requires it.

4973

++	 */

4974

++	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {

4975

++		preempt_disable();

4976

++		if (task_flags & PF_WQ_WORKER)

4977

++			wq_worker_sleeping(tsk);

4978

++		else

4979

++			io_wq_worker_sleeping(tsk);

4980

++		preempt_enable_no_resched();

4981

++	}

4982

++

4983

++	if (tsk_is_pi_blocked(tsk))

4984

++		return;

4985

++

4986

++	/*

4987

++	 * If we are going to sleep and we have plugged IO queued,

4988

++	 * make sure to submit it to avoid deadlocks.

4989

++	 */

4990

++	if (blk_needs_flush_plug(tsk))

4991

++		blk_schedule_flush_plug(tsk);

4992

++}

4993

++

4994

++static void sched_update_worker(struct task_struct *tsk)

4995

++{

4996

++	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {

4997

++		if (tsk->flags & PF_WQ_WORKER)

4998

++			wq_worker_running(tsk);

4999

++		else

5000

++			io_wq_worker_running(tsk);

5001

++	}

5002

++}

5003

++

5004

++asmlinkage __visible void __sched schedule(void)

5005

++{

5006

++	struct task_struct *tsk = current;

5007

++

5008

++	sched_submit_work(tsk);

5009

++	do {

5010

++		preempt_disable();

5011

++		__schedule(false);

5012

++		sched_preempt_enable_no_resched();

5013

++	} while (need_resched());

5014

++	sched_update_worker(tsk);

5015

++}

5016

++EXPORT_SYMBOL(schedule);

5017

++

5018

++/*

5019

++ * synchronize_rcu_tasks() makes sure that no task is stuck in preempted

5020

++ * state (have scheduled out non-voluntarily) by making sure that all

5021

++ * tasks have either left the run queue or have gone into user space.

5022

++ * As idle tasks do not do either, they must not ever be preempted

5023

++ * (schedule out non-voluntarily).

5024

++ *

5025

++ * schedule_idle() is similar to schedule_preempt_disable() except that it

5026

++ * never enables preemption because it does not call sched_submit_work().

5027

++ */

5028

++void __sched schedule_idle(void)

5029

++{

5030

++	/*

5031

++	 * As this skips calling sched_submit_work(), which the idle task does

5032

++	 * regardless because that function is a nop when the task is in a

5033

++	 * TASK_RUNNING state, make sure this isn't used someplace that the

5034

++	 * current task can be in any other state. Note, idle is always in the

5035

++	 * TASK_RUNNING state.

5036

++	 */

5037

++	WARN_ON_ONCE(current->__state);

5038

++	do {

5039

++		__schedule(false);

5040

++	} while (need_resched());

5041

++}

5042

++

5043

++#if defined(CONFIG_CONTEXT_TRACKING) && !defined(CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK)

5044

++asmlinkage __visible void __sched schedule_user(void)

5045

++{

5046

++	/*

5047

++	 * If we come here after a random call to set_need_resched(),

5048

++	 * or we have been woken up remotely but the IPI has not yet arrived,

5049

++	 * we haven't yet exited the RCU idle mode. Do it here manually until

5050

++	 * we find a better solution.

5051

++	 *

5052

++	 * NB: There are buggy callers of this function.  Ideally we

5053

++	 * should warn if prev_state != CONTEXT_USER, but that will trigger

5054

++	 * too frequently to make sense yet.

5055

++	 */

5056

++	enum ctx_state prev_state = exception_enter();

5057

++	schedule();

5058

++	exception_exit(prev_state);

5059

++}

5060

++#endif

5061

++

5062

++/**

5063

++ * schedule_preempt_disabled - called with preemption disabled

5064

++ *

5065

++ * Returns with preemption disabled. Note: preempt_count must be 1

5066

++ */

5067

++void __sched schedule_preempt_disabled(void)

5068

++{

5069

++	sched_preempt_enable_no_resched();

5070

++	schedule();

5071

++	preempt_disable();

5072

++}

5073

++

5074

++static void __sched notrace preempt_schedule_common(void)

5075

++{

5076

++	do {

5077

++		/*

5078

++		 * Because the function tracer can trace preempt_count_sub()

5079

++		 * and it also uses preempt_enable/disable_notrace(), if

5080

++		 * NEED_RESCHED is set, the preempt_enable_notrace() called

5081

++		 * by the function tracer will call this function again and

5082

++		 * cause infinite recursion.

5083

++		 *

5084

++		 * Preemption must be disabled here before the function

5085

++		 * tracer can trace. Break up preempt_disable() into two

5086

++		 * calls. One to disable preemption without fear of being

5087

++		 * traced. The other to still record the preemption latency,

5088

++		 * which can also be traced by the function tracer.

5089

++		 */

5090

++		preempt_disable_notrace();

5091

++		preempt_latency_start(1);

5092

++		__schedule(true);

5093

++		preempt_latency_stop(1);

5094

++		preempt_enable_no_resched_notrace();

5095

++

5096

++		/*

5097

++		 * Check again in case we missed a preemption opportunity

5098

++		 * between schedule and now.

5099

++		 */

5100

++	} while (need_resched());

5101

++}

5102

++

5103

++#ifdef CONFIG_PREEMPTION

5104

++/*

5105

++ * This is the entry point to schedule() from in-kernel preemption

5106

++ * off of preempt_enable.

5107

++ */

5108

++asmlinkage __visible void __sched notrace preempt_schedule(void)

5109

++{

5110

++	/*

5111

++	 * If there is a non-zero preempt_count or interrupts are disabled,

5112

++	 * we do not want to preempt the current task. Just return..

5113

++	 */

5114

++	if (likely(!preemptible()))

5115

++		return;

5116

++

5117

++	preempt_schedule_common();

5118

++}

5119

++NOKPROBE_SYMBOL(preempt_schedule);

5120

++EXPORT_SYMBOL(preempt_schedule);

5121

++

5122

++#ifdef CONFIG_PREEMPT_DYNAMIC

5123

++DEFINE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);

5124

++EXPORT_STATIC_CALL_TRAMP(preempt_schedule);

5125

++#endif

5126

++

5127

++

5128

++/**

5129

++ * preempt_schedule_notrace - preempt_schedule called by tracing

5130

++ *

5131

++ * The tracing infrastructure uses preempt_enable_notrace to prevent

5132

++ * recursion and tracing preempt enabling caused by the tracing

5133

++ * infrastructure itself. But as tracing can happen in areas coming

5134

++ * from userspace or just about to enter userspace, a preempt enable

5135

++ * can occur before user_exit() is called. This will cause the scheduler

5136

++ * to be called when the system is still in usermode.

5137

++ *

5138

++ * To prevent this, the preempt_enable_notrace will use this function

5139

++ * instead of preempt_schedule() to exit user context if needed before

5140

++ * calling the scheduler.

5141

++ */

5142

++asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)

5143

++{

5144

++	enum ctx_state prev_ctx;

5145

++

5146

++	if (likely(!preemptible()))

5147

++		return;

5148

++

5149

++	do {

5150

++		/*

5151

++		 * Because the function tracer can trace preempt_count_sub()

5152

++		 * and it also uses preempt_enable/disable_notrace(), if

5153

++		 * NEED_RESCHED is set, the preempt_enable_notrace() called

5154

++		 * by the function tracer will call this function again and

5155

++		 * cause infinite recursion.

5156

++		 *

5157

++		 * Preemption must be disabled here before the function

5158

++		 * tracer can trace. Break up preempt_disable() into two

5159

++		 * calls. One to disable preemption without fear of being

5160

++		 * traced. The other to still record the preemption latency,

5161

++		 * which can also be traced by the function tracer.

5162

++		 */

5163

++		preempt_disable_notrace();

5164

++		preempt_latency_start(1);

5165

++		/*

5166

++		 * Needs preempt disabled in case user_exit() is traced

5167

++		 * and the tracer calls preempt_enable_notrace() causing

5168

++		 * an infinite recursion.

5169

++		 */

5170

++		prev_ctx = exception_enter();

5171

++		__schedule(true);

5172

++		exception_exit(prev_ctx);

5173

++

5174

++		preempt_latency_stop(1);

5175

++		preempt_enable_no_resched_notrace();

5176

++	} while (need_resched());

5177

++}

5178

++EXPORT_SYMBOL_GPL(preempt_schedule_notrace);

5179

++

5180

++#ifdef CONFIG_PREEMPT_DYNAMIC

5181

++DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);

5182

++EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);

5183

++#endif

5184

++

5185

++#endif /* CONFIG_PREEMPTION */

5186

++

5187

++#ifdef CONFIG_PREEMPT_DYNAMIC

5188

++

5189

++#include <linux/entry-common.h>

5190

++

5191

++/*

5192

++ * SC:cond_resched

5193

++ * SC:might_resched

5194

++ * SC:preempt_schedule

5195

++ * SC:preempt_schedule_notrace

5196

++ * SC:irqentry_exit_cond_resched

5197

++ *

5198

++ *

5199

++ * NONE:

5200

++ *   cond_resched               <- __cond_resched

5201

++ *   might_resched              <- RET0

5202

++ *   preempt_schedule           <- NOP

5203

++ *   preempt_schedule_notrace   <- NOP

5204

++ *   irqentry_exit_cond_resched <- NOP

5205

++ *

5206

++ * VOLUNTARY:

5207

++ *   cond_resched               <- __cond_resched

5208

++ *   might_resched              <- __cond_resched

5209

++ *   preempt_schedule           <- NOP

5210

++ *   preempt_schedule_notrace   <- NOP

5211

++ *   irqentry_exit_cond_resched <- NOP

5212

++ *

5213

++ * FULL:

5214

++ *   cond_resched               <- RET0

5215

++ *   might_resched              <- RET0

5216

++ *   preempt_schedule           <- preempt_schedule

5217

++ *   preempt_schedule_notrace   <- preempt_schedule_notrace

5218

++ *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched

5219

++ */

5220

++

5221

++enum {

5222

++	preempt_dynamic_none = 0,

5223

++	preempt_dynamic_voluntary,

5224

++	preempt_dynamic_full,

5225

++};

5226

++

5227

++int preempt_dynamic_mode = preempt_dynamic_full;

5228

++

5229

++int sched_dynamic_mode(const char *str)

5230

++{

5231

++	if (!strcmp(str, "none"))

5232

++		return preempt_dynamic_none;

5233

++

5234

++	if (!strcmp(str, "voluntary"))

5235

++		return preempt_dynamic_voluntary;

5236

++

5237

++	if (!strcmp(str, "full"))

5238

++		return preempt_dynamic_full;

5239

++

5240

++	return -EINVAL;

5241

++}

5242

++

5243

++void sched_dynamic_update(int mode)

5244

++{

5245

++	/*

5246

++	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in

5247

++	 * the ZERO state, which is invalid.

5248

++	 */

5249

++	static_call_update(cond_resched, __cond_resched);

5250

++	static_call_update(might_resched, __cond_resched);

5251

++	static_call_update(preempt_schedule, __preempt_schedule_func);

5252

++	static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);

5253

++	static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);

5254

++

5255

++	switch (mode) {

5256

++	case preempt_dynamic_none:

5257

++		static_call_update(cond_resched, __cond_resched);

5258

++		static_call_update(might_resched, (void *)&__static_call_return0);

5259

++		static_call_update(preempt_schedule, NULL);

5260

++		static_call_update(preempt_schedule_notrace, NULL);

5261

++		static_call_update(irqentry_exit_cond_resched, NULL);

5262

++		pr_info("Dynamic Preempt: none\n");

5263

++		break;

5264

++

5265

++	case preempt_dynamic_voluntary:

5266

++		static_call_update(cond_resched, __cond_resched);

5267

++		static_call_update(might_resched, __cond_resched);

5268

++		static_call_update(preempt_schedule, NULL);

5269

++		static_call_update(preempt_schedule_notrace, NULL);

5270

++		static_call_update(irqentry_exit_cond_resched, NULL);

5271

++		pr_info("Dynamic Preempt: voluntary\n");

5272

++		break;

5273

++

5274

++	case preempt_dynamic_full:

5275

++		static_call_update(cond_resched, (void *)&__static_call_return0);

5276

++		static_call_update(might_resched, (void *)&__static_call_return0);

5277

++		static_call_update(preempt_schedule, __preempt_schedule_func);

5278

++		static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);

5279

++		static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);

5280

++		pr_info("Dynamic Preempt: full\n");

5281

++		break;

5282

++	}

5283

++

5284

++	preempt_dynamic_mode = mode;

5285

++}

5286

++

5287

++static int __init setup_preempt_mode(char *str)

5288

++{

5289

++	int mode = sched_dynamic_mode(str);

5290

++	if (mode < 0) {

5291

++		pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);

5292

++		return 1;

5293

++	}

5294

++

5295

++	sched_dynamic_update(mode);

5296

++	return 0;

5297

++}

5298

++__setup("preempt=", setup_preempt_mode);

5299

++

5300

++#endif /* CONFIG_PREEMPT_DYNAMIC */

5301

++

5302

++/*

5303

++ * This is the entry point to schedule() from kernel preemption

5304

++ * off of irq context.

5305

++ * Note, that this is called and return with irqs disabled. This will

5306

++ * protect us against recursive calling from irq.

5307

++ */

5308

++asmlinkage __visible void __sched preempt_schedule_irq(void)

5309

++{

5310

++	enum ctx_state prev_state;

5311

++

5312

++	/* Catch callers which need to be fixed */

5313

++	BUG_ON(preempt_count() || !irqs_disabled());

5314

++

5315

++	prev_state = exception_enter();

5316

++

5317

++	do {

5318

++		preempt_disable();

5319

++		local_irq_enable();

5320

++		__schedule(true);

5321

++		local_irq_disable();

5322

++		sched_preempt_enable_no_resched();

5323

++	} while (need_resched());

5324

++

5325

++	exception_exit(prev_state);

5326

++}

5327

++

5328

++int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,

5329

++			  void *key)

5330

++{

5331

++	WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC);

5332

++	return try_to_wake_up(curr->private, mode, wake_flags);

5333

++}

5334

++EXPORT_SYMBOL(default_wake_function);

5335

++

5336

++static inline void check_task_changed(struct task_struct *p, struct rq *rq)

5337

++{

5338

++	/* Trigger resched if task sched_prio has been modified. */

5339

++	if (task_on_rq_queued(p) && task_sched_prio_idx(p, rq) != p->sq_idx) {

5340

++		requeue_task(p, rq);

5341

++		check_preempt_curr(rq);

5342

++	}

5343

++}

5344

++

5345

++static void __setscheduler_prio(struct task_struct *p, int prio)

5346

++{

5347

++	p->prio = prio;

5348

++}

5349

++

5350

++#ifdef CONFIG_RT_MUTEXES

5351

++

5352

++static inline int __rt_effective_prio(struct task_struct *pi_task, int prio)

5353

++{

5354

++	if (pi_task)

5355

++		prio = min(prio, pi_task->prio);

5356

++

5357

++	return prio;

5358

++}

5359

++

5360

++static inline int rt_effective_prio(struct task_struct *p, int prio)

5361

++{

5362

++	struct task_struct *pi_task = rt_mutex_get_top_task(p);

5363

++

5364

++	return __rt_effective_prio(pi_task, prio);

5365

++}

5366

++

5367

++/*

5368

++ * rt_mutex_setprio - set the current priority of a task

5369

++ * @p: task to boost

5370

++ * @pi_task: donor task

5371

++ *

5372

++ * This function changes the 'effective' priority of a task. It does

5373

++ * not touch ->normal_prio like __setscheduler().

5374

++ *

5375

++ * Used by the rt_mutex code to implement priority inheritance

5376

++ * logic. Call site only calls if the priority of the task changed.

5377

++ */

5378

++void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)

5379

++{

5380

++	int prio;

5381

++	struct rq *rq;

5382

++	raw_spinlock_t *lock;

5383

++

5384

++	/* XXX used to be waiter->prio, not waiter->task->prio */

5385

++	prio = __rt_effective_prio(pi_task, p->normal_prio);

5386

++

5387

++	/*

5388

++	 * If nothing changed; bail early.

5389

++	 */

5390

++	if (p->pi_top_task == pi_task && prio == p->prio)

5391

++		return;

5392

++

5393

++	rq = __task_access_lock(p, &lock);

5394

++	/*

5395

++	 * Set under pi_lock && rq->lock, such that the value can be used under

5396

++	 * either lock.

5397

++	 *

5398

++	 * Note that there is loads of tricky to make this pointer cache work

5399

++	 * right. rt_mutex_slowunlock()+rt_mutex_postunlock() work together to

5400

++	 * ensure a task is de-boosted (pi_task is set to NULL) before the

5401

++	 * task is allowed to run again (and can exit). This ensures the pointer

5402

++	 * points to a blocked task -- which guarantees the task is present.

5403

++	 */

5404

++	p->pi_top_task = pi_task;

5405

++

5406

++	/*

5407

++	 * For FIFO/RR we only need to set prio, if that matches we're done.

5408

++	 */

5409

++	if (prio == p->prio)

5410

++		goto out_unlock;

5411

++

5412

++	/*

5413

++	 * Idle task boosting is a nono in general. There is one

5414

++	 * exception, when PREEMPT_RT and NOHZ is active:

5415

++	 *

5416

++	 * The idle task calls get_next_timer_interrupt() and holds

5417

++	 * the timer wheel base->lock on the CPU and another CPU wants

5418

++	 * to access the timer (probably to cancel it). We can safely

5419

++	 * ignore the boosting request, as the idle CPU runs this code

5420

++	 * with interrupts disabled and will complete the lock

5421

++	 * protected section without being interrupted. So there is no

5422

++	 * real need to boost.

5423

++	 */

5424

++	if (unlikely(p == rq->idle)) {

5425

++		WARN_ON(p != rq->curr);

5426

++		WARN_ON(p->pi_blocked_on);

5427

++		goto out_unlock;

5428

++	}

5429

++

5430

++	trace_sched_pi_setprio(p, pi_task);

5431

++

5432

++	__setscheduler_prio(p, prio);

5433

++

5434

++	check_task_changed(p, rq);

5435

++out_unlock:

5436

++	/* Avoid rq from going away on us: */

5437

++	preempt_disable();

5438

++

5439

++	__balance_callbacks(rq);

5440

++	__task_access_unlock(p, lock);

5441

++

5442

++	preempt_enable();

5443

++}

5444

++#else

5445

++static inline int rt_effective_prio(struct task_struct *p, int prio)

5446

++{

5447

++	return prio;

5448

++}

5449

++#endif

5450

++

5451

++void set_user_nice(struct task_struct *p, long nice)

5452

++{

5453

++	unsigned long flags;

5454

++	struct rq *rq;

5455

++	raw_spinlock_t *lock;

5456

++

5457

++	if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)

5458

++		return;

5459

++	/*

5460

++	 * We have to be careful, if called from sys_setpriority(),

5461

++	 * the task might be in the middle of scheduling on another CPU.

5462

++	 */

5463

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

5464

++	rq = __task_access_lock(p, &lock);

5465

++

5466

++	p->static_prio = NICE_TO_PRIO(nice);

5467

++	/*

5468

++	 * The RT priorities are set via sched_setscheduler(), but we still

5469

++	 * allow the 'normal' nice value to be set - but as expected

5470

++	 * it won't have any effect on scheduling until the task is

5471

++	 * not SCHED_NORMAL/SCHED_BATCH:

5472

++	 */

5473

++	if (task_has_rt_policy(p))

5474

++		goto out_unlock;

5475

++

5476

++	p->prio = effective_prio(p);

5477

++

5478

++	check_task_changed(p, rq);

5479

++out_unlock:

5480

++	__task_access_unlock(p, lock);

5481

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

5482

++}

5483

++EXPORT_SYMBOL(set_user_nice);

5484

++

5485

++/*

5486

++ * can_nice - check if a task can reduce its nice value

5487

++ * @p: task

5488

++ * @nice: nice value

5489

++ */

5490

++int can_nice(const struct task_struct *p, const int nice)

5491

++{

5492

++	/* Convert nice value [19,-20] to rlimit style value [1,40] */

5493

++	int nice_rlim = nice_to_rlimit(nice);

5494

++

5495

++	return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) ||

5496

++		capable(CAP_SYS_NICE));

5497

++}

5498

++

5499

++#ifdef __ARCH_WANT_SYS_NICE

5500

++

5501

++/*

5502

++ * sys_nice - change the priority of the current process.

5503

++ * @increment: priority increment

5504

++ *

5505

++ * sys_setpriority is a more generic, but much slower function that

5506

++ * does similar things.

5507

++ */

5508

++SYSCALL_DEFINE1(nice, int, increment)

5509

++{

5510

++	long nice, retval;

5511

++

5512

++	/*

5513

++	 * Setpriority might change our priority at the same moment.

5514

++	 * We don't have to worry. Conceptually one call occurs first

5515

++	 * and we have a single winner.

5516

++	 */

5517

++

5518

++	increment = clamp(increment, -NICE_WIDTH, NICE_WIDTH);

5519

++	nice = task_nice(current) + increment;

5520

++

5521

++	nice = clamp_val(nice, MIN_NICE, MAX_NICE);

5522

++	if (increment < 0 && !can_nice(current, nice))

5523

++		return -EPERM;

5524

++

5525

++	retval = security_task_setnice(current, nice);

5526

++	if (retval)

5527

++		return retval;

5528

++

5529

++	set_user_nice(current, nice);

5530

++	return 0;

5531

++}

5532

++

5533

++#endif

5534

++

5535

++/**

5536

++ * task_prio - return the priority value of a given task.

5537

++ * @p: the task in question.

5538

++ *

5539

++ * Return: The priority value as seen by users in /proc.

5540

++ *

5541

++ * sched policy         return value   kernel prio    user prio/nice

5542

++ *

5543

++ * (BMQ)normal, batch, idle[0 ... 53]  [100 ... 139]          0/[-20 ... 19]/[-7 ... 7]

5544

++ * (PDS)normal, batch, idle[0 ... 39]            100          0/[-20 ... 19]

5545

++ * fifo, rr             [-1 ... -100]     [99 ... 0]  [0 ... 99]

5546

++ */

5547

++int task_prio(const struct task_struct *p)

5548

++{

5549

++	return (p->prio < MAX_RT_PRIO) ? p->prio - MAX_RT_PRIO :

5550

++		task_sched_prio_normal(p, task_rq(p));

5551

++}

5552

++

5553

++/**

5554

++ * idle_cpu - is a given CPU idle currently?

5555

++ * @cpu: the processor in question.

5556

++ *

5557

++ * Return: 1 if the CPU is currently idle. 0 otherwise.

5558

++ */

5559

++int idle_cpu(int cpu)

5560

++{

5561

++	struct rq *rq = cpu_rq(cpu);

5562

++

5563

++	if (rq->curr != rq->idle)

5564

++		return 0;

5565

++

5566

++	if (rq->nr_running)

5567

++		return 0;

5568

++

5569

++#ifdef CONFIG_SMP

5570

++	if (rq->ttwu_pending)

5571

++		return 0;

5572

++#endif

5573

++

5574

++	return 1;

5575

++}

5576

++

5577

++/**

5578

++ * idle_task - return the idle task for a given CPU.

5579

++ * @cpu: the processor in question.

5580

++ *

5581

++ * Return: The idle task for the cpu @cpu.

5582

++ */

5583

++struct task_struct *idle_task(int cpu)

5584

++{

5585

++	return cpu_rq(cpu)->idle;

5586

++}

5587

++

5588

++/**

5589

++ * find_process_by_pid - find a process with a matching PID value.

5590

++ * @pid: the pid in question.

5591

++ *

5592

++ * The task of @pid, if found. %NULL otherwise.

5593

++ */

5594

++static inline struct task_struct *find_process_by_pid(pid_t pid)

5595

++{

5596

++	return pid ? find_task_by_vpid(pid) : current;

5597

++}

5598

++

5599

++/*

5600

++ * sched_setparam() passes in -1 for its policy, to let the functions

5601

++ * it calls know not to change it.

5602

++ */

5603

++#define SETPARAM_POLICY -1

5604

++

5605

++static void __setscheduler_params(struct task_struct *p,

5606

++		const struct sched_attr *attr)

5607

++{

5608

++	int policy = attr->sched_policy;

5609

++

5610

++	if (policy == SETPARAM_POLICY)

5611

++		policy = p->policy;

5612

++

5613

++	p->policy = policy;

5614

++

5615

++	/*

5616

++	 * allow normal nice value to be set, but will not have any

5617

++	 * effect on scheduling until the task not SCHED_NORMAL/

5618

++	 * SCHED_BATCH

5619

++	 */

5620

++	p->static_prio = NICE_TO_PRIO(attr->sched_nice);

5621

++

5622

++	/*

5623

++	 * __sched_setscheduler() ensures attr->sched_priority == 0 when

5624

++	 * !rt_policy. Always setting this ensures that things like

5625

++	 * getparam()/getattr() don't report silly values for !rt tasks.

5626

++	 */

5627

++	p->rt_priority = attr->sched_priority;

5628

++	p->normal_prio = normal_prio(p);

5629

++}

5630

++

5631

++/*

5632

++ * check the target process has a UID that matches the current process's

5633

++ */

5634

++static bool check_same_owner(struct task_struct *p)

5635

++{

5636

++	const struct cred *cred = current_cred(), *pcred;

5637

++	bool match;

5638

++

5639

++	rcu_read_lock();

5640

++	pcred = __task_cred(p);

5641

++	match = (uid_eq(cred->euid, pcred->euid) ||

5642

++		 uid_eq(cred->euid, pcred->uid));

5643

++	rcu_read_unlock();

5644

++	return match;

5645

++}

5646

++

5647

++static int __sched_setscheduler(struct task_struct *p,

5648

++				const struct sched_attr *attr,

5649

++				bool user, bool pi)

5650

++{

5651

++	const struct sched_attr dl_squash_attr = {

5652

++		.size		= sizeof(struct sched_attr),

5653

++		.sched_policy	= SCHED_FIFO,

5654

++		.sched_nice	= 0,

5655

++		.sched_priority = 99,

5656

++	};

5657

++	int oldpolicy = -1, policy = attr->sched_policy;

5658

++	int retval, newprio;

5659

++	struct callback_head *head;

5660

++	unsigned long flags;

5661

++	struct rq *rq;

5662

++	int reset_on_fork;

5663

++	raw_spinlock_t *lock;

5664

++

5665

++	/* The pi code expects interrupts enabled */

5666

++	BUG_ON(pi && in_interrupt());

5667

++

5668

++	/*

5669

++	 * Alt schedule FW supports SCHED_DEADLINE by squash it as prio 0 SCHED_FIFO

5670

++	 */

5671

++	if (unlikely(SCHED_DEADLINE == policy)) {

5672

++		attr = &dl_squash_attr;

5673

++		policy = attr->sched_policy;

5674

++	}

5675

++recheck:

5676

++	/* Double check policy once rq lock held */

5677

++	if (policy < 0) {

5678

++		reset_on_fork = p->sched_reset_on_fork;

5679

++		policy = oldpolicy = p->policy;

5680

++	} else {

5681

++		reset_on_fork = !!(attr->sched_flags & SCHED_RESET_ON_FORK);

5682

++

5683

++		if (policy > SCHED_IDLE)

5684

++			return -EINVAL;

5685

++	}

5686

++

5687

++	if (attr->sched_flags & ~(SCHED_FLAG_ALL))

5688

++		return -EINVAL;

5689

++

5690

++	/*

5691

++	 * Valid priorities for SCHED_FIFO and SCHED_RR are

5692

++	 * 1..MAX_RT_PRIO-1, valid priority for SCHED_NORMAL and

5693

++	 * SCHED_BATCH and SCHED_IDLE is 0.

5694

++	 */

5695

++	if (attr->sched_priority < 0 ||

5696

++	    (p->mm && attr->sched_priority > MAX_RT_PRIO - 1) ||

5697

++	    (!p->mm && attr->sched_priority > MAX_RT_PRIO - 1))

5698

++		return -EINVAL;

5699

++	if ((SCHED_RR == policy || SCHED_FIFO == policy) !=

5700

++	    (attr->sched_priority != 0))

5701

++		return -EINVAL;

5702

++

5703

++	/*

5704

++	 * Allow unprivileged RT tasks to decrease priority:

5705

++	 */

5706

++	if (user && !capable(CAP_SYS_NICE)) {

5707

++		if (SCHED_FIFO == policy || SCHED_RR == policy) {

5708

++			unsigned long rlim_rtprio =

5709

++					task_rlimit(p, RLIMIT_RTPRIO);

5710

++

5711

++			/* Can't set/change the rt policy */

5712

++			if (policy != p->policy && !rlim_rtprio)

5713

++				return -EPERM;

5714

++

5715

++			/* Can't increase priority */

5716

++			if (attr->sched_priority > p->rt_priority &&

5717

++			    attr->sched_priority > rlim_rtprio)

5718

++				return -EPERM;

5719

++		}

5720

++

5721

++		/* Can't change other user's priorities */

5722

++		if (!check_same_owner(p))

5723

++			return -EPERM;

5724

++

5725

++		/* Normal users shall not reset the sched_reset_on_fork flag */

5726

++		if (p->sched_reset_on_fork && !reset_on_fork)

5727

++			return -EPERM;

5728

++	}

5729

++

5730

++	if (user) {

5731

++		retval = security_task_setscheduler(p);

5732

++		if (retval)

5733

++			return retval;

5734

++	}

5735

++

5736

++	if (pi)

5737

++		cpuset_read_lock();

5738

++

5739

++	/*

5740

++	 * Make sure no PI-waiters arrive (or leave) while we are

5741

++	 * changing the priority of the task:

5742

++	 */

5743

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

5744

++

5745

++	/*

5746

++	 * To be able to change p->policy safely, task_access_lock()

5747

++	 * must be called.

5748

++	 * IF use task_access_lock() here:

5749

++	 * For the task p which is not running, reading rq->stop is

5750

++	 * racy but acceptable as ->stop doesn't change much.

5751

++	 * An enhancemnet can be made to read rq->stop saftly.

5752

++	 */

5753

++	rq = __task_access_lock(p, &lock);

5754

++

5755

++	/*

5756

++	 * Changing the policy of the stop threads its a very bad idea

5757

++	 */

5758

++	if (p == rq->stop) {

5759

++		retval = -EINVAL;

5760

++		goto unlock;

5761

++	}

5762

++

5763

++	/*

5764

++	 * If not changing anything there's no need to proceed further:

5765

++	 */

5766

++	if (unlikely(policy == p->policy)) {

5767

++		if (rt_policy(policy) && attr->sched_priority != p->rt_priority)

5768

++			goto change;

5769

++		if (!rt_policy(policy) &&

5770

++		    NICE_TO_PRIO(attr->sched_nice) != p->static_prio)

5771

++			goto change;

5772

++

5773

++		p->sched_reset_on_fork = reset_on_fork;

5774

++		retval = 0;

5775

++		goto unlock;

5776

++	}

5777

++change:

5778

++

5779

++	/* Re-check policy now with rq lock held */

5780

++	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {

5781

++		policy = oldpolicy = -1;

5782

++		__task_access_unlock(p, lock);

5783

++		raw_spin_unlock_irqrestore(&p->pi_lock, flags);

5784

++		if (pi)

5785

++			cpuset_read_unlock();

5786

++		goto recheck;

5787

++	}

5788

++

5789

++	p->sched_reset_on_fork = reset_on_fork;

5790

++

5791

++	newprio = __normal_prio(policy, attr->sched_priority, NICE_TO_PRIO(attr->sched_nice));

5792

++	if (pi) {

5793

++		/*

5794

++		 * Take priority boosted tasks into account. If the new

5795

++		 * effective priority is unchanged, we just store the new

5796

++		 * normal parameters and do not touch the scheduler class and

5797

++		 * the runqueue. This will be done when the task deboost

5798

++		 * itself.

5799

++		 */

5800

++		if (rt_effective_prio(p, newprio) == p->prio) {

5801

++			__setscheduler_params(p, attr);

5802

++			retval = 0;

5803

++			goto unlock;

5804

++		}

5805

++	}

5806

++

5807

++	if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) {

5808

++		__setscheduler_params(p, attr);

5809

++		__setscheduler_prio(p, newprio);

5810

++	}

5811

++

5812

++	check_task_changed(p, rq);

5813

++

5814

++	/* Avoid rq from going away on us: */

5815

++	preempt_disable();

5816

++	head = splice_balance_callbacks(rq);

5817

++	__task_access_unlock(p, lock);

5818

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

5819

++

5820

++	if (pi) {

5821

++		cpuset_read_unlock();

5822

++		rt_mutex_adjust_pi(p);

5823

++	}

5824

++

5825

++	/* Run balance callbacks after we've adjusted the PI chain: */

5826

++	balance_callbacks(rq, head);

5827

++	preempt_enable();

5828

++

5829

++	return 0;

5830

++

5831

++unlock:

5832

++	__task_access_unlock(p, lock);

5833

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

5834

++	if (pi)

5835

++		cpuset_read_unlock();

5836

++	return retval;

5837

++}

5838

++

5839

++static int _sched_setscheduler(struct task_struct *p, int policy,

5840

++			       const struct sched_param *param, bool check)

5841

++{

5842

++	struct sched_attr attr = {

5843

++		.sched_policy   = policy,

5844

++		.sched_priority = param->sched_priority,

5845

++		.sched_nice     = PRIO_TO_NICE(p->static_prio),

5846

++	};

5847

++

5848

++	/* Fixup the legacy SCHED_RESET_ON_FORK hack. */

5849

++	if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) {

5850

++		attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;

5851

++		policy &= ~SCHED_RESET_ON_FORK;

5852

++		attr.sched_policy = policy;

5853

++	}

5854

++

5855

++	return __sched_setscheduler(p, &attr, check, true);

5856

++}

5857

++

5858

++/**

5859

++ * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.

5860

++ * @p: the task in question.

5861

++ * @policy: new policy.

5862

++ * @param: structure containing the new RT priority.

5863

++ *

5864

++ * Use sched_set_fifo(), read its comment.

5865

++ *

5866

++ * Return: 0 on success. An error code otherwise.

5867

++ *

5868

++ * NOTE that the task may be already dead.

5869

++ */

5870

++int sched_setscheduler(struct task_struct *p, int policy,

5871

++		       const struct sched_param *param)

5872

++{

5873

++	return _sched_setscheduler(p, policy, param, true);

5874

++}

5875

++

5876

++int sched_setattr(struct task_struct *p, const struct sched_attr *attr)

5877

++{

5878

++	return __sched_setscheduler(p, attr, true, true);

5879

++}

5880

++

5881

++int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr)

5882

++{

5883

++	return __sched_setscheduler(p, attr, false, true);

5884

++}

5885

++EXPORT_SYMBOL_GPL(sched_setattr_nocheck);

5886

++

5887

++/**

5888

++ * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.

5889

++ * @p: the task in question.

5890

++ * @policy: new policy.

5891

++ * @param: structure containing the new RT priority.

5892

++ *

5893

++ * Just like sched_setscheduler, only don't bother checking if the

5894

++ * current context has permission.  For example, this is needed in

5895

++ * stop_machine(): we create temporary high priority worker threads,

5896

++ * but our caller might not have that capability.

5897

++ *

5898

++ * Return: 0 on success. An error code otherwise.

5899

++ */

5900

++int sched_setscheduler_nocheck(struct task_struct *p, int policy,

5901

++			       const struct sched_param *param)

5902

++{

5903

++	return _sched_setscheduler(p, policy, param, false);

5904

++}

5905

++

5906

++/*

5907

++ * SCHED_FIFO is a broken scheduler model; that is, it is fundamentally

5908

++ * incapable of resource management, which is the one thing an OS really should

5909

++ * be doing.

5910

++ *

5911

++ * This is of course the reason it is limited to privileged users only.

5912

++ *

5913

++ * Worse still; it is fundamentally impossible to compose static priority

5914

++ * workloads. You cannot take two correctly working static prio workloads

5915

++ * and smash them together and still expect them to work.

5916

++ *

5917

++ * For this reason 'all' FIFO tasks the kernel creates are basically at:

5918

++ *

5919

++ *   MAX_RT_PRIO / 2

5920

++ *

5921

++ * The administrator _MUST_ configure the system, the kernel simply doesn't

5922

++ * know enough information to make a sensible choice.

5923

++ */

5924

++void sched_set_fifo(struct task_struct *p)

5925

++{

5926

++	struct sched_param sp = { .sched_priority = MAX_RT_PRIO / 2 };

5927

++	WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0);

5928

++}

5929

++EXPORT_SYMBOL_GPL(sched_set_fifo);

5930

++

5931

++/*

5932

++ * For when you don't much care about FIFO, but want to be above SCHED_NORMAL.

5933

++ */

5934

++void sched_set_fifo_low(struct task_struct *p)

5935

++{

5936

++	struct sched_param sp = { .sched_priority = 1 };

5937

++	WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0);

5938

++}

5939

++EXPORT_SYMBOL_GPL(sched_set_fifo_low);

5940

++

5941

++void sched_set_normal(struct task_struct *p, int nice)

5942

++{

5943

++	struct sched_attr attr = {

5944

++		.sched_policy = SCHED_NORMAL,

5945

++		.sched_nice = nice,

5946

++	};

5947

++	WARN_ON_ONCE(sched_setattr_nocheck(p, &attr) != 0);

5948

++}

5949

++EXPORT_SYMBOL_GPL(sched_set_normal);

5950

++

5951

++static int

5952

++do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)

5953

++{

5954

++	struct sched_param lparam;

5955

++	struct task_struct *p;

5956

++	int retval;

5957

++

5958

++	if (!param || pid < 0)

5959

++		return -EINVAL;

5960

++	if (copy_from_user(&lparam, param, sizeof(struct sched_param)))

5961

++		return -EFAULT;

5962

++

5963

++	rcu_read_lock();

5964

++	retval = -ESRCH;

5965

++	p = find_process_by_pid(pid);

5966

++	if (likely(p))

5967

++		get_task_struct(p);

5968

++	rcu_read_unlock();

5969

++

5970

++	if (likely(p)) {

5971

++		retval = sched_setscheduler(p, policy, &lparam);

5972

++		put_task_struct(p);

5973

++	}

5974

++

5975

++	return retval;

5976

++}

5977

++

5978

++/*

5979

++ * Mimics kernel/events/core.c perf_copy_attr().

5980

++ */

5981

++static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *attr)

5982

++{

5983

++	u32 size;

5984

++	int ret;

5985

++

5986

++	/* Zero the full structure, so that a short copy will be nice: */

5987

++	memset(attr, 0, sizeof(*attr));

5988

++

5989

++	ret = get_user(size, &uattr->size);

5990

++	if (ret)

5991

++		return ret;

5992

++

5993

++	/* ABI compatibility quirk: */

5994

++	if (!size)

5995

++		size = SCHED_ATTR_SIZE_VER0;

5996

++

5997

++	if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)

5998

++		goto err_size;

5999

++

6000

++	ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);

6001

++	if (ret) {

6002

++		if (ret == -E2BIG)

6003

++			goto err_size;

6004

++		return ret;

6005

++	}

6006

++

6007

++	/*

6008

++	 * XXX: Do we want to be lenient like existing syscalls; or do we want

6009

++	 * to be strict and return an error on out-of-bounds values?

6010

++	 */

6011

++	attr->sched_nice = clamp(attr->sched_nice, -20, 19);

6012

++

6013

++	/* sched/core.c uses zero here but we already know ret is zero */

6014

++	return 0;

6015

++

6016

++err_size:

6017

++	put_user(sizeof(*attr), &uattr->size);

6018

++	return -E2BIG;

6019

++}

6020

++

6021

++/**

6022

++ * sys_sched_setscheduler - set/change the scheduler policy and RT priority

6023

++ * @pid: the pid in question.

6024

++ * @policy: new policy.

6025

++ *

6026

++ * Return: 0 on success. An error code otherwise.

6027

++ * @param: structure containing the new RT priority.

6028

++ */

6029

++SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy, struct sched_param __user *, param)

6030

++{

6031

++	if (policy < 0)

6032

++		return -EINVAL;

6033

++

6034

++	return do_sched_setscheduler(pid, policy, param);

6035

++}

6036

++

6037

++/**

6038

++ * sys_sched_setparam - set/change the RT priority of a thread

6039

++ * @pid: the pid in question.

6040

++ * @param: structure containing the new RT priority.

6041

++ *

6042

++ * Return: 0 on success. An error code otherwise.

6043

++ */

6044

++SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)

6045

++{

6046

++	return do_sched_setscheduler(pid, SETPARAM_POLICY, param);

6047

++}

6048

++

6049

++/**

6050

++ * sys_sched_setattr - same as above, but with extended sched_attr

6051

++ * @pid: the pid in question.

6052

++ * @uattr: structure containing the extended parameters.

6053

++ */

6054

++SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,

6055

++			       unsigned int, flags)

6056

++{

6057

++	struct sched_attr attr;

6058

++	struct task_struct *p;

6059

++	int retval;

6060

++

6061

++	if (!uattr || pid < 0 || flags)

6062

++		return -EINVAL;

6063

++

6064

++	retval = sched_copy_attr(uattr, &attr);

6065

++	if (retval)

6066

++		return retval;

6067

++

6068

++	if ((int)attr.sched_policy < 0)

6069

++		return -EINVAL;

6070

++

6071

++	rcu_read_lock();

6072

++	retval = -ESRCH;

6073

++	p = find_process_by_pid(pid);

6074

++	if (likely(p))

6075

++		get_task_struct(p);

6076

++	rcu_read_unlock();

6077

++

6078

++	if (likely(p)) {

6079

++		retval = sched_setattr(p, &attr);

6080

++		put_task_struct(p);

6081

++	}

6082

++

6083

++	return retval;

6084

++}

6085

++

6086

++/**

6087

++ * sys_sched_getscheduler - get the policy (scheduling class) of a thread

6088

++ * @pid: the pid in question.

6089

++ *

6090

++ * Return: On success, the policy of the thread. Otherwise, a negative error

6091

++ * code.

6092

++ */

6093

++SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid)

6094

++{

6095

++	struct task_struct *p;

6096

++	int retval = -EINVAL;

6097

++

6098

++	if (pid < 0)

6099

++		goto out_nounlock;

6100

++

6101

++	retval = -ESRCH;

6102

++	rcu_read_lock();

6103

++	p = find_process_by_pid(pid);

6104

++	if (p) {

6105

++		retval = security_task_getscheduler(p);

6106

++		if (!retval)

6107

++			retval = p->policy;

6108

++	}

6109

++	rcu_read_unlock();

6110

++

6111

++out_nounlock:

6112

++	return retval;

6113

++}

6114

++

6115

++/**

6116

++ * sys_sched_getscheduler - get the RT priority of a thread

6117

++ * @pid: the pid in question.

6118

++ * @param: structure containing the RT priority.

6119

++ *

6120

++ * Return: On success, 0 and the RT priority is in @param. Otherwise, an error

6121

++ * code.

6122

++ */

6123

++SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param)

6124

++{

6125

++	struct sched_param lp = { .sched_priority = 0 };

6126

++	struct task_struct *p;

6127

++	int retval = -EINVAL;

6128

++

6129

++	if (!param || pid < 0)

6130

++		goto out_nounlock;

6131

++

6132

++	rcu_read_lock();

6133

++	p = find_process_by_pid(pid);

6134

++	retval = -ESRCH;

6135

++	if (!p)

6136

++		goto out_unlock;

6137

++

6138

++	retval = security_task_getscheduler(p);

6139

++	if (retval)

6140

++		goto out_unlock;

6141

++

6142

++	if (task_has_rt_policy(p))

6143

++		lp.sched_priority = p->rt_priority;

6144

++	rcu_read_unlock();

6145

++

6146

++	/*

6147

++	 * This one might sleep, we cannot do it with a spinlock held ...

6148

++	 */

6149

++	retval = copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0;

6150

++

6151

++out_nounlock:

6152

++	return retval;

6153

++

6154

++out_unlock:

6155

++	rcu_read_unlock();

6156

++	return retval;

6157

++}

6158

++

6159

++/*

6160

++ * Copy the kernel size attribute structure (which might be larger

6161

++ * than what user-space knows about) to user-space.

6162

++ *

6163

++ * Note that all cases are valid: user-space buffer can be larger or

6164

++ * smaller than the kernel-space buffer. The usual case is that both

6165

++ * have the same size.

6166

++ */

6167

++static int

6168

++sched_attr_copy_to_user(struct sched_attr __user *uattr,

6169

++			struct sched_attr *kattr,

6170

++			unsigned int usize)

6171

++{

6172

++	unsigned int ksize = sizeof(*kattr);

6173

++

6174

++	if (!access_ok(uattr, usize))

6175

++		return -EFAULT;

6176

++

6177

++	/*

6178

++	 * sched_getattr() ABI forwards and backwards compatibility:

6179

++	 *

6180

++	 * If usize == ksize then we just copy everything to user-space and all is good.

6181

++	 *

6182

++	 * If usize < ksize then we only copy as much as user-space has space for,

6183

++	 * this keeps ABI compatibility as well. We skip the rest.

6184

++	 *

6185

++	 * If usize > ksize then user-space is using a newer version of the ABI,

6186

++	 * which part the kernel doesn't know about. Just ignore it - tooling can

6187

++	 * detect the kernel's knowledge of attributes from the attr->size value

6188

++	 * which is set to ksize in this case.

6189

++	 */

6190

++	kattr->size = min(usize, ksize);

6191

++

6192

++	if (copy_to_user(uattr, kattr, kattr->size))

6193

++		return -EFAULT;

6194

++

6195

++	return 0;

6196

++}

6197

++

6198

++/**

6199

++ * sys_sched_getattr - similar to sched_getparam, but with sched_attr

6200

++ * @pid: the pid in question.

6201

++ * @uattr: structure containing the extended parameters.

6202

++ * @usize: sizeof(attr) for fwd/bwd comp.

6203

++ * @flags: for future extension.

6204

++ */

6205

++SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,

6206

++		unsigned int, usize, unsigned int, flags)

6207

++{

6208

++	struct sched_attr kattr = { };

6209

++	struct task_struct *p;

6210

++	int retval;

6211

++

6212

++	if (!uattr || pid < 0 || usize > PAGE_SIZE ||

6213

++	    usize < SCHED_ATTR_SIZE_VER0 || flags)

6214

++		return -EINVAL;

6215

++

6216

++	rcu_read_lock();

6217

++	p = find_process_by_pid(pid);

6218

++	retval = -ESRCH;

6219

++	if (!p)

6220

++		goto out_unlock;

6221

++

6222

++	retval = security_task_getscheduler(p);

6223

++	if (retval)

6224

++		goto out_unlock;

6225

++

6226

++	kattr.sched_policy = p->policy;

6227

++	if (p->sched_reset_on_fork)

6228

++		kattr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;

6229

++	if (task_has_rt_policy(p))

6230

++		kattr.sched_priority = p->rt_priority;

6231

++	else

6232

++		kattr.sched_nice = task_nice(p);

6233

++

6234

++#ifdef CONFIG_UCLAMP_TASK

6235

++	kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;

6236

++	kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;

6237

++#endif

6238

++

6239

++	rcu_read_unlock();

6240

++

6241

++	return sched_attr_copy_to_user(uattr, &kattr, usize);

6242

++

6243

++out_unlock:

6244

++	rcu_read_unlock();

6245

++	return retval;

6246

++}

6247

++

6248

++long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)

6249

++{

6250

++	cpumask_var_t cpus_allowed, new_mask;

6251

++	struct task_struct *p;

6252

++	int retval;

6253

++

6254

++	rcu_read_lock();

6255

++

6256

++	p = find_process_by_pid(pid);

6257

++	if (!p) {

6258

++		rcu_read_unlock();

6259

++		return -ESRCH;

6260

++	}

6261

++

6262

++	/* Prevent p going away */

6263

++	get_task_struct(p);

6264

++	rcu_read_unlock();

6265

++

6266

++	if (p->flags & PF_NO_SETAFFINITY) {

6267

++		retval = -EINVAL;

6268

++		goto out_put_task;

6269

++	}

6270

++	if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {

6271

++		retval = -ENOMEM;

6272

++		goto out_put_task;

6273

++	}

6274

++	if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {

6275

++		retval = -ENOMEM;

6276

++		goto out_free_cpus_allowed;

6277

++	}

6278

++	retval = -EPERM;

6279

++	if (!check_same_owner(p)) {

6280

++		rcu_read_lock();

6281

++		if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {

6282

++			rcu_read_unlock();

6283

++			goto out_free_new_mask;

6284

++		}

6285

++		rcu_read_unlock();

6286

++	}

6287

++

6288

++	retval = security_task_setscheduler(p);

6289

++	if (retval)

6290

++		goto out_free_new_mask;

6291

++

6292

++	cpuset_cpus_allowed(p, cpus_allowed);

6293

++	cpumask_and(new_mask, in_mask, cpus_allowed);

6294

++

6295

++again:

6296

++	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);

6297

++

6298

++	if (!retval) {

6299

++		cpuset_cpus_allowed(p, cpus_allowed);

6300

++		if (!cpumask_subset(new_mask, cpus_allowed)) {

6301

++			/*

6302

++			 * We must have raced with a concurrent cpuset

6303

++			 * update. Just reset the cpus_allowed to the

6304

++			 * cpuset's cpus_allowed

6305

++			 */

6306

++			cpumask_copy(new_mask, cpus_allowed);

6307

++			goto again;

6308

++		}

6309

++	}

6310

++out_free_new_mask:

6311

++	free_cpumask_var(new_mask);

6312

++out_free_cpus_allowed:

6313

++	free_cpumask_var(cpus_allowed);

6314

++out_put_task:

6315

++	put_task_struct(p);

6316

++	return retval;

6317

++}

6318

++

6319

++static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len,

6320

++			     struct cpumask *new_mask)

6321

++{

6322

++	if (len < cpumask_size())

6323

++		cpumask_clear(new_mask);

6324

++	else if (len > cpumask_size())

6325

++		len = cpumask_size();

6326

++

6327

++	return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0;

6328

++}

6329

++

6330

++/**

6331

++ * sys_sched_setaffinity - set the CPU affinity of a process

6332

++ * @pid: pid of the process

6333

++ * @len: length in bytes of the bitmask pointed to by user_mask_ptr

6334

++ * @user_mask_ptr: user-space pointer to the new CPU mask

6335

++ *

6336

++ * Return: 0 on success. An error code otherwise.

6337

++ */

6338

++SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,

6339

++		unsigned long __user *, user_mask_ptr)

6340

++{

6341

++	cpumask_var_t new_mask;

6342

++	int retval;

6343

++

6344

++	if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))

6345

++		return -ENOMEM;

6346

++

6347

++	retval = get_user_cpu_mask(user_mask_ptr, len, new_mask);

6348

++	if (retval == 0)

6349

++		retval = sched_setaffinity(pid, new_mask);

6350

++	free_cpumask_var(new_mask);

6351

++	return retval;

6352

++}

6353

++

6354

++long sched_getaffinity(pid_t pid, cpumask_t *mask)

6355

++{

6356

++	struct task_struct *p;

6357

++	raw_spinlock_t *lock;

6358

++	unsigned long flags;

6359

++	int retval;

6360

++

6361

++	rcu_read_lock();

6362

++

6363

++	retval = -ESRCH;

6364

++	p = find_process_by_pid(pid);

6365

++	if (!p)

6366

++		goto out_unlock;

6367

++

6368

++	retval = security_task_getscheduler(p);

6369

++	if (retval)

6370

++		goto out_unlock;

6371

++

6372

++	task_access_lock_irqsave(p, &lock, &flags);

6373

++	cpumask_and(mask, &p->cpus_mask, cpu_active_mask);

6374

++	task_access_unlock_irqrestore(p, lock, &flags);

6375

++

6376

++out_unlock:

6377

++	rcu_read_unlock();

6378

++

6379

++	return retval;

6380

++}

6381

++

6382

++/**

6383

++ * sys_sched_getaffinity - get the CPU affinity of a process

6384

++ * @pid: pid of the process

6385

++ * @len: length in bytes of the bitmask pointed to by user_mask_ptr

6386

++ * @user_mask_ptr: user-space pointer to hold the current CPU mask

6387

++ *

6388

++ * Return: size of CPU mask copied to user_mask_ptr on success. An

6389

++ * error code otherwise.

6390

++ */

6391

++SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,

6392

++		unsigned long __user *, user_mask_ptr)

6393

++{

6394

++	int ret;

6395

++	cpumask_var_t mask;

6396

++

6397

++	if ((len * BITS_PER_BYTE) < nr_cpu_ids)

6398

++		return -EINVAL;

6399

++	if (len & (sizeof(unsigned long)-1))

6400

++		return -EINVAL;

6401

++

6402

++	if (!alloc_cpumask_var(&mask, GFP_KERNEL))

6403

++		return -ENOMEM;

6404

++

6405

++	ret = sched_getaffinity(pid, mask);

6406

++	if (ret == 0) {

6407

++		unsigned int retlen = min_t(size_t, len, cpumask_size());

6408

++

6409

++		if (copy_to_user(user_mask_ptr, mask, retlen))

6410

++			ret = -EFAULT;

6411

++		else

6412

++			ret = retlen;

6413

++	}

6414

++	free_cpumask_var(mask);

6415

++

6416

++	return ret;

6417

++}

6418

++

6419

++static void do_sched_yield(void)

6420

++{

6421

++	struct rq *rq;

6422

++	struct rq_flags rf;

6423

++

6424

++	if (!sched_yield_type)

6425

++		return;

6426

++

6427

++	rq = this_rq_lock_irq(&rf);

6428

++

6429

++	schedstat_inc(rq->yld_count);

6430

++

6431

++	if (1 == sched_yield_type) {

6432

++		if (!rt_task(current))

6433

++			do_sched_yield_type_1(current, rq);

6434

++	} else if (2 == sched_yield_type) {

6435

++		if (rq->nr_running > 1)

6436

++			rq->skip = current;

6437

++	}

6438

++

6439

++	preempt_disable();

6440

++	raw_spin_unlock_irq(&rq->lock);

6441

++	sched_preempt_enable_no_resched();

6442

++

6443

++	schedule();

6444

++}

6445

++

6446

++/**

6447

++ * sys_sched_yield - yield the current processor to other threads.

6448

++ *

6449

++ * This function yields the current CPU to other tasks. If there are no

6450

++ * other threads running on this CPU then this function will return.

6451

++ *

6452

++ * Return: 0.

6453

++ */

6454

++SYSCALL_DEFINE0(sched_yield)

6455

++{

6456

++	do_sched_yield();

6457

++	return 0;

6458

++}

6459

++

6460

++#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)

6461

++int __sched __cond_resched(void)

6462

++{

6463

++	if (should_resched(0)) {

6464

++		preempt_schedule_common();

6465

++		return 1;

6466

++	}

6467

++#ifndef CONFIG_PREEMPT_RCU

6468

++	rcu_all_qs();

6469

++#endif

6470

++	return 0;

6471

++}

6472

++EXPORT_SYMBOL(__cond_resched);

6473

++#endif

6474

++

6475

++#ifdef CONFIG_PREEMPT_DYNAMIC

6476

++DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched);

6477

++EXPORT_STATIC_CALL_TRAMP(cond_resched);

6478

++

6479

++DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched);

6480

++EXPORT_STATIC_CALL_TRAMP(might_resched);

6481

++#endif

6482

++

6483

++/*

6484

++ * __cond_resched_lock() - if a reschedule is pending, drop the given lock,

6485

++ * call schedule, and on return reacquire the lock.

6486

++ *

6487

++ * This works OK both with and without CONFIG_PREEMPTION.  We do strange low-level

6488

++ * operations here to prevent schedule() from being called twice (once via

6489

++ * spin_unlock(), once by hand).

6490

++ */

6491

++int __cond_resched_lock(spinlock_t *lock)

6492

++{

6493

++	int resched = should_resched(PREEMPT_LOCK_OFFSET);

6494

++	int ret = 0;

6495

++

6496

++	lockdep_assert_held(lock);

6497

++

6498

++	if (spin_needbreak(lock) || resched) {

6499

++		spin_unlock(lock);

6500

++		if (resched)

6501

++			preempt_schedule_common();

6502

++		else

6503

++			cpu_relax();

6504

++		ret = 1;

6505

++		spin_lock(lock);

6506

++	}

6507

++	return ret;

6508

++}

6509

++EXPORT_SYMBOL(__cond_resched_lock);

6510

++

6511

++int __cond_resched_rwlock_read(rwlock_t *lock)

6512

++{

6513

++	int resched = should_resched(PREEMPT_LOCK_OFFSET);

6514

++	int ret = 0;

6515

++

6516

++	lockdep_assert_held_read(lock);

6517

++

6518

++	if (rwlock_needbreak(lock) || resched) {

6519

++		read_unlock(lock);

6520

++		if (resched)

6521

++			preempt_schedule_common();

6522

++		else

6523

++			cpu_relax();

6524

++		ret = 1;

6525

++		read_lock(lock);

6526

++	}

6527

++	return ret;

6528

++}

6529

++EXPORT_SYMBOL(__cond_resched_rwlock_read);

6530

++

6531

++int __cond_resched_rwlock_write(rwlock_t *lock)

6532

++{

6533

++	int resched = should_resched(PREEMPT_LOCK_OFFSET);

6534

++	int ret = 0;

6535

++

6536

++	lockdep_assert_held_write(lock);

6537

++

6538

++	if (rwlock_needbreak(lock) || resched) {

6539

++		write_unlock(lock);

6540

++		if (resched)

6541

++			preempt_schedule_common();

6542

++		else

6543

++			cpu_relax();

6544

++		ret = 1;

6545

++		write_lock(lock);

6546

++	}

6547

++	return ret;

6548

++}

6549

++EXPORT_SYMBOL(__cond_resched_rwlock_write);

6550

++

6551

++/**

6552

++ * yield - yield the current processor to other threads.

6553

++ *

6554

++ * Do not ever use this function, there's a 99% chance you're doing it wrong.

6555

++ *

6556

++ * The scheduler is at all times free to pick the calling task as the most

6557

++ * eligible task to run, if removing the yield() call from your code breaks

6558

++ * it, it's already broken.

6559

++ *

6560

++ * Typical broken usage is:

6561

++ *

6562

++ * while (!event)

6563

++ * 	yield();

6564

++ *

6565

++ * where one assumes that yield() will let 'the other' process run that will

6566

++ * make event true. If the current task is a SCHED_FIFO task that will never

6567

++ * happen. Never use yield() as a progress guarantee!!

6568

++ *

6569

++ * If you want to use yield() to wait for something, use wait_event().

6570

++ * If you want to use yield() to be 'nice' for others, use cond_resched().

6571

++ * If you still want to use yield(), do not!

6572

++ */

6573

++void __sched yield(void)

6574

++{

6575

++	set_current_state(TASK_RUNNING);

6576

++	do_sched_yield();

6577

++}

6578

++EXPORT_SYMBOL(yield);

6579

++

6580

++/**

6581

++ * yield_to - yield the current processor to another thread in

6582

++ * your thread group, or accelerate that thread toward the

6583

++ * processor it's on.

6584

++ * @p: target task

6585

++ * @preempt: whether task preemption is allowed or not

6586

++ *

6587

++ * It's the caller's job to ensure that the target task struct

6588

++ * can't go away on us before we can do any checks.

6589

++ *

6590

++ * In Alt schedule FW, yield_to is not supported.

6591

++ *

6592

++ * Return:

6593

++ *	true (>0) if we indeed boosted the target task.

6594

++ *	false (0) if we failed to boost the target.

6595

++ *	-ESRCH if there's no task to yield to.

6596

++ */

6597

++int __sched yield_to(struct task_struct *p, bool preempt)

6598

++{

6599

++	return 0;

6600

++}

6601

++EXPORT_SYMBOL_GPL(yield_to);

6602

++

6603

++int io_schedule_prepare(void)

6604

++{

6605

++	int old_iowait = current->in_iowait;

6606

++

6607

++	current->in_iowait = 1;

6608

++	blk_schedule_flush_plug(current);

6609

++

6610

++	return old_iowait;

6611

++}

6612

++

6613

++void io_schedule_finish(int token)

6614

++{

6615

++	current->in_iowait = token;

6616

++}

6617

++

6618

++/*

6619

++ * This task is about to go to sleep on IO.  Increment rq->nr_iowait so

6620

++ * that process accounting knows that this is a task in IO wait state.

6621

++ *

6622

++ * But don't do that if it is a deliberate, throttling IO wait (this task

6623

++ * has set its backing_dev_info: the queue against which it should throttle)

6624

++ */

6625

++

6626

++long __sched io_schedule_timeout(long timeout)

6627

++{

6628

++	int token;

6629

++	long ret;

6630

++

6631

++	token = io_schedule_prepare();

6632

++	ret = schedule_timeout(timeout);

6633

++	io_schedule_finish(token);

6634

++

6635

++	return ret;

6636

++}

6637

++EXPORT_SYMBOL(io_schedule_timeout);

6638

++

6639

++void __sched io_schedule(void)

6640

++{

6641

++	int token;

6642

++

6643

++	token = io_schedule_prepare();

6644

++	schedule();

6645

++	io_schedule_finish(token);

6646

++}

6647

++EXPORT_SYMBOL(io_schedule);

6648

++

6649

++/**

6650

++ * sys_sched_get_priority_max - return maximum RT priority.

6651

++ * @policy: scheduling class.

6652

++ *

6653

++ * Return: On success, this syscall returns the maximum

6654

++ * rt_priority that can be used by a given scheduling class.

6655

++ * On failure, a negative error code is returned.

6656

++ */

6657

++SYSCALL_DEFINE1(sched_get_priority_max, int, policy)

6658

++{

6659

++	int ret = -EINVAL;

6660

++

6661

++	switch (policy) {

6662

++	case SCHED_FIFO:

6663

++	case SCHED_RR:

6664

++		ret = MAX_RT_PRIO - 1;

6665

++		break;

6666

++	case SCHED_NORMAL:

6667

++	case SCHED_BATCH:

6668

++	case SCHED_IDLE:

6669

++		ret = 0;

6670

++		break;

6671

++	}

6672

++	return ret;

6673

++}

6674

++

6675

++/**

6676

++ * sys_sched_get_priority_min - return minimum RT priority.

6677

++ * @policy: scheduling class.

6678

++ *

6679

++ * Return: On success, this syscall returns the minimum

6680

++ * rt_priority that can be used by a given scheduling class.

6681

++ * On failure, a negative error code is returned.

6682

++ */

6683

++SYSCALL_DEFINE1(sched_get_priority_min, int, policy)

6684

++{

6685

++	int ret = -EINVAL;

6686

++

6687

++	switch (policy) {

6688

++	case SCHED_FIFO:

6689

++	case SCHED_RR:

6690

++		ret = 1;

6691

++		break;

6692

++	case SCHED_NORMAL:

6693

++	case SCHED_BATCH:

6694

++	case SCHED_IDLE:

6695

++		ret = 0;

6696

++		break;

6697

++	}

6698

++	return ret;

6699

++}

6700

++

6701

++static int sched_rr_get_interval(pid_t pid, struct timespec64 *t)

6702

++{

6703

++	struct task_struct *p;

6704

++	int retval;

6705

++

6706

++	alt_sched_debug();

6707

++

6708

++	if (pid < 0)

6709

++		return -EINVAL;

6710

++

6711

++	retval = -ESRCH;

6712

++	rcu_read_lock();

6713

++	p = find_process_by_pid(pid);

6714

++	if (!p)

6715

++		goto out_unlock;

6716

++

6717

++	retval = security_task_getscheduler(p);

6718

++	if (retval)

6719

++		goto out_unlock;

6720

++	rcu_read_unlock();

6721

++

6722

++	*t = ns_to_timespec64(sched_timeslice_ns);

6723

++	return 0;

6724

++

6725

++out_unlock:

6726

++	rcu_read_unlock();

6727

++	return retval;

6728

++}

6729

++

6730

++/**

6731

++ * sys_sched_rr_get_interval - return the default timeslice of a process.

6732

++ * @pid: pid of the process.

6733

++ * @interval: userspace pointer to the timeslice value.

6734

++ *

6735

++ *

6736

++ * Return: On success, 0 and the timeslice is in @interval. Otherwise,

6737

++ * an error code.

6738

++ */

6739

++SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,

6740

++		struct __kernel_timespec __user *, interval)

6741

++{

6742

++	struct timespec64 t;

6743

++	int retval = sched_rr_get_interval(pid, &t);

6744

++

6745

++	if (retval == 0)

6746

++		retval = put_timespec64(&t, interval);

6747

++

6748

++	return retval;

6749

++}

6750

++

6751

++#ifdef CONFIG_COMPAT_32BIT_TIME

6752

++SYSCALL_DEFINE2(sched_rr_get_interval_time32, pid_t, pid,

6753

++		struct old_timespec32 __user *, interval)

6754

++{

6755

++	struct timespec64 t;

6756

++	int retval = sched_rr_get_interval(pid, &t);

6757

++

6758

++	if (retval == 0)

6759

++		retval = put_old_timespec32(&t, interval);

6760

++	return retval;

6761

++}

6762

++#endif

6763

++

6764

++void sched_show_task(struct task_struct *p)

6765

++{

6766

++	unsigned long free = 0;

6767

++	int ppid;

6768

++

6769

++	if (!try_get_task_stack(p))

6770

++		return;

6771

++

6772

++	pr_info("task:%-15.15s state:%c", p->comm, task_state_to_char(p));

6773

++

6774

++	if (task_is_running(p))

6775

++		pr_cont("  running task    ");

6776

++#ifdef CONFIG_DEBUG_STACK_USAGE

6777

++	free = stack_not_used(p);

6778

++#endif

6779

++	ppid = 0;

6780

++	rcu_read_lock();

6781

++	if (pid_alive(p))

6782

++		ppid = task_pid_nr(rcu_dereference(p->real_parent));

6783

++	rcu_read_unlock();

6784

++	pr_cont(" stack:%5lu pid:%5d ppid:%6d flags:0x%08lx\n",

6785

++		free, task_pid_nr(p), ppid,

6786

++		(unsigned long)task_thread_info(p)->flags);

6787

++

6788

++	print_worker_info(KERN_INFO, p);

6789

++	print_stop_info(KERN_INFO, p);

6790

++	show_stack(p, NULL, KERN_INFO);

6791

++	put_task_stack(p);

6792

++}

6793

++EXPORT_SYMBOL_GPL(sched_show_task);

6794

++

6795

++static inline bool

6796

++state_filter_match(unsigned long state_filter, struct task_struct *p)

6797

++{

6798

++	unsigned int state = READ_ONCE(p->__state);

6799

++

6800

++	/* no filter, everything matches */

6801

++	if (!state_filter)

6802

++		return true;

6803

++

6804

++	/* filter, but doesn't match */

6805

++	if (!(state & state_filter))

6806

++		return false;

6807

++

6808

++	/*

6809

++	 * When looking for TASK_UNINTERRUPTIBLE skip TASK_IDLE (allows

6810

++	 * TASK_KILLABLE).

6811

++	 */

6812

++	if (state_filter == TASK_UNINTERRUPTIBLE && state == TASK_IDLE)

6813

++		return false;

6814

++

6815

++	return true;

6816

++}

6817

++

6818

++

6819

++void show_state_filter(unsigned int state_filter)

6820

++{

6821

++	struct task_struct *g, *p;

6822

++

6823

++	rcu_read_lock();

6824

++	for_each_process_thread(g, p) {

6825

++		/*

6826

++		 * reset the NMI-timeout, listing all files on a slow

6827

++		 * console might take a lot of time:

6828

++		 * Also, reset softlockup watchdogs on all CPUs, because

6829

++		 * another CPU might be blocked waiting for us to process

6830

++		 * an IPI.

6831

++		 */

6832

++		touch_nmi_watchdog();

6833

++		touch_all_softlockup_watchdogs();

6834

++		if (state_filter_match(state_filter, p))

6835

++			sched_show_task(p);

6836

++	}

6837

++

6838

++#ifdef CONFIG_SCHED_DEBUG

6839

++	/* TODO: Alt schedule FW should support this

6840

++	if (!state_filter)

6841

++		sysrq_sched_debug_show();

6842

++	*/

6843

++#endif

6844

++	rcu_read_unlock();

6845

++	/*

6846

++	 * Only show locks if all tasks are dumped:

6847

++	 */

6848

++	if (!state_filter)

6849

++		debug_show_all_locks();

6850

++}

6851

++

6852

++void dump_cpu_task(int cpu)

6853

++{

6854

++	pr_info("Task dump for CPU %d:\n", cpu);

6855

++	sched_show_task(cpu_curr(cpu));

6856

++}

6857

++

6858

++/**

6859

++ * init_idle - set up an idle thread for a given CPU

6860

++ * @idle: task in question

6861

++ * @cpu: CPU the idle task belongs to

6862

++ *

6863

++ * NOTE: this function does not set the idle thread's NEED_RESCHED

6864

++ * flag, to make booting more robust.

6865

++ */

6866

++void __init init_idle(struct task_struct *idle, int cpu)

6867

++{

6868

++	struct rq *rq = cpu_rq(cpu);

6869

++	unsigned long flags;

6870

++

6871

++	__sched_fork(0, idle);

6872

++

6873

++	/*

6874

++	 * The idle task doesn't need the kthread struct to function, but it

6875

++	 * is dressed up as a per-CPU kthread and thus needs to play the part

6876

++	 * if we want to avoid special-casing it in code that deals with per-CPU

6877

++	 * kthreads.

6878

++	 */

6879

++	set_kthread_struct(idle);

6880

++

6881

++	raw_spin_lock_irqsave(&idle->pi_lock, flags);

6882

++	raw_spin_lock(&rq->lock);

6883

++	update_rq_clock(rq);

6884

++

6885

++	idle->last_ran = rq->clock_task;

6886

++	idle->__state = TASK_RUNNING;

6887

++	/*

6888

++	 * PF_KTHREAD should already be set at this point; regardless, make it

6889

++	 * look like a proper per-CPU kthread.

6890

++	 */

6891

++	idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;

6892

++	kthread_set_per_cpu(idle, cpu);

6893

++

6894

++	sched_queue_init_idle(&rq->queue, idle);

6895

++

6896

++	scs_task_reset(idle);

6897

++	kasan_unpoison_task_stack(idle);

6898

++

6899

++#ifdef CONFIG_SMP

6900

++	/*

6901

++	 * It's possible that init_idle() gets called multiple times on a task,

6902

++	 * in that case do_set_cpus_allowed() will not do the right thing.

6903

++	 *

6904

++	 * And since this is boot we can forgo the serialisation.

6905

++	 */

6906

++	set_cpus_allowed_common(idle, cpumask_of(cpu));

6907

++#endif

6908

++

6909

++	/* Silence PROVE_RCU */

6910

++	rcu_read_lock();

6911

++	__set_task_cpu(idle, cpu);

6912

++	rcu_read_unlock();

6913

++

6914

++	rq->idle = idle;

6915

++	rcu_assign_pointer(rq->curr, idle);

6916

++	idle->on_cpu = 1;

6917

++

6918

++	raw_spin_unlock(&rq->lock);

6919

++	raw_spin_unlock_irqrestore(&idle->pi_lock, flags);

6920

++

6921

++	/* Set the preempt count _outside_ the spinlocks! */

6922

++	init_idle_preempt_count(idle, cpu);

6923

++

6924

++	ftrace_graph_init_idle_task(idle, cpu);

6925

++	vtime_init_idle(idle, cpu);

6926

++#ifdef CONFIG_SMP

6927

++	sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);

6928

++#endif

6929

++}

6930

++

6931

++#ifdef CONFIG_SMP

6932

++

6933

++int cpuset_cpumask_can_shrink(const struct cpumask __maybe_unused *cur,

6934

++			      const struct cpumask __maybe_unused *trial)

6935

++{

6936

++	return 1;

6937

++}

6938

++

6939

++int task_can_attach(struct task_struct *p,

6940

++		    const struct cpumask *cs_cpus_allowed)

6941

++{

6942

++	int ret = 0;

6943

++

6944

++	/*

6945

++	 * Kthreads which disallow setaffinity shouldn't be moved

6946

++	 * to a new cpuset; we don't want to change their CPU

6947

++	 * affinity and isolating such threads by their set of

6948

++	 * allowed nodes is unnecessary.  Thus, cpusets are not

6949

++	 * applicable for such threads.  This prevents checking for

6950

++	 * success of set_cpus_allowed_ptr() on all attached tasks

6951

++	 * before cpus_mask may be changed.

6952

++	 */

6953

++	if (p->flags & PF_NO_SETAFFINITY)

6954

++		ret = -EINVAL;

6955

++

6956

++	return ret;

6957

++}

6958

++

6959

++bool sched_smp_initialized __read_mostly;

6960

++

6961

++#ifdef CONFIG_HOTPLUG_CPU

6962

++/*

6963

++ * Ensures that the idle task is using init_mm right before its CPU goes

6964

++ * offline.

6965

++ */

6966

++void idle_task_exit(void)

6967

++{

6968

++	struct mm_struct *mm = current->active_mm;

6969

++

6970

++	BUG_ON(current != this_rq()->idle);

6971

++

6972

++	if (mm != &init_mm) {

6973

++		switch_mm(mm, &init_mm, current);

6974

++		finish_arch_post_lock_switch();

6975

++	}

6976

++

6977

++	/* finish_cpu(), as ran on the BP, will clean up the active_mm state */

6978

++}

6979

++

6980

++static int __balance_push_cpu_stop(void *arg)

6981

++{

6982

++	struct task_struct *p = arg;

6983

++	struct rq *rq = this_rq();

6984

++	struct rq_flags rf;

6985

++	int cpu;

6986

++

6987

++	raw_spin_lock_irq(&p->pi_lock);

6988

++	rq_lock(rq, &rf);

6989

++

6990

++	update_rq_clock(rq);

6991

++

6992

++	if (task_rq(p) == rq && task_on_rq_queued(p)) {

6993

++		cpu = select_fallback_rq(rq->cpu, p);

6994

++		rq = __migrate_task(rq, p, cpu);

6995

++	}

6996

++

6997

++	rq_unlock(rq, &rf);

6998

++	raw_spin_unlock_irq(&p->pi_lock);

6999

++

7000

++	put_task_struct(p);

7001

++

7002

++	return 0;

7003

++}

7004

++

7005

++static DEFINE_PER_CPU(struct cpu_stop_work, push_work);

7006

++

7007

++/*

7008

++ * This is enabled below SCHED_AP_ACTIVE; when !cpu_active(), but only

7009

++ * effective when the hotplug motion is down.

7010

++ */

7011

++static void balance_push(struct rq *rq)

7012

++{

7013

++	struct task_struct *push_task = rq->curr;

7014

++

7015

++	lockdep_assert_held(&rq->lock);

7016

++	SCHED_WARN_ON(rq->cpu != smp_processor_id());

7017

++

7018

++	/*

7019

++	 * Ensure the thing is persistent until balance_push_set(.on = false);

7020

++	 */

7021

++	rq->balance_callback = &balance_push_callback;

7022

++

7023

++	/*

7024

++	 * Only active while going offline.

7025

++	 */

7026

++	if (!cpu_dying(rq->cpu))

7027

++		return;

7028

++

7029

++	/*

7030

++	 * Both the cpu-hotplug and stop task are in this case and are

7031

++	 * required to complete the hotplug process.

7032

++	 */

7033

++	if (kthread_is_per_cpu(push_task) ||

7034

++	    is_migration_disabled(push_task)) {

7035

++

7036

++		/*

7037

++		 * If this is the idle task on the outgoing CPU try to wake

7038

++		 * up the hotplug control thread which might wait for the

7039

++		 * last task to vanish. The rcuwait_active() check is

7040

++		 * accurate here because the waiter is pinned on this CPU

7041

++		 * and can't obviously be running in parallel.

7042

++		 *

7043

++		 * On RT kernels this also has to check whether there are

7044

++		 * pinned and scheduled out tasks on the runqueue. They

7045

++		 * need to leave the migrate disabled section first.

7046

++		 */

7047

++		if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&

7048

++		    rcuwait_active(&rq->hotplug_wait)) {

7049

++			raw_spin_unlock(&rq->lock);

7050

++			rcuwait_wake_up(&rq->hotplug_wait);

7051

++			raw_spin_lock(&rq->lock);

7052

++		}

7053

++		return;

7054

++	}

7055

++

7056

++	get_task_struct(push_task);

7057

++	/*

7058

++	 * Temporarily drop rq->lock such that we can wake-up the stop task.

7059

++	 * Both preemption and IRQs are still disabled.

7060

++	 */

7061

++	raw_spin_unlock(&rq->lock);

7062

++	stop_one_cpu_nowait(rq->cpu, __balance_push_cpu_stop, push_task,

7063

++			    this_cpu_ptr(&push_work));

7064

++	/*

7065

++	 * At this point need_resched() is true and we'll take the loop in

7066

++	 * schedule(). The next pick is obviously going to be the stop task

7067

++	 * which kthread_is_per_cpu() and will push this task away.

7068

++	 */

7069

++	raw_spin_lock(&rq->lock);

7070

++}

7071

++

7072

++static void balance_push_set(int cpu, bool on)

7073

++{

7074

++	struct rq *rq = cpu_rq(cpu);

7075

++	struct rq_flags rf;

7076

++

7077

++	rq_lock_irqsave(rq, &rf);

7078

++	if (on) {

7079

++		WARN_ON_ONCE(rq->balance_callback);

7080

++		rq->balance_callback = &balance_push_callback;

7081

++	} else if (rq->balance_callback == &balance_push_callback) {

7082

++		rq->balance_callback = NULL;

7083

++	}

7084

++	rq_unlock_irqrestore(rq, &rf);

7085

++}

7086

++

7087

++/*

7088

++ * Invoked from a CPUs hotplug control thread after the CPU has been marked

7089

++ * inactive. All tasks which are not per CPU kernel threads are either

7090

++ * pushed off this CPU now via balance_push() or placed on a different CPU

7091

++ * during wakeup. Wait until the CPU is quiescent.

7092

++ */

7093

++static void balance_hotplug_wait(void)

7094

++{

7095

++	struct rq *rq = this_rq();

7096

++

7097

++	rcuwait_wait_event(&rq->hotplug_wait,

7098

++			   rq->nr_running == 1 && !rq_has_pinned_tasks(rq),

7099

++			   TASK_UNINTERRUPTIBLE);

7100

++}

7101

++

7102

++#else

7103

++

7104

++static void balance_push(struct rq *rq)

7105

++{

7106

++}

7107

++

7108

++static void balance_push_set(int cpu, bool on)

7109

++{

7110

++}

7111

++

7112

++static inline void balance_hotplug_wait(void)

7113

++{

7114

++}

7115

++#endif /* CONFIG_HOTPLUG_CPU */

7116

++

7117

++static void set_rq_offline(struct rq *rq)

7118

++{

7119

++	if (rq->online)

7120

++		rq->online = false;

7121

++}

7122

++

7123

++static void set_rq_online(struct rq *rq)

7124

++{

7125

++	if (!rq->online)

7126

++		rq->online = true;

7127

++}

7128

++

7129

++/*

7130

++ * used to mark begin/end of suspend/resume:

7131

++ */

7132

++static int num_cpus_frozen;

7133

++

7134

++/*

7135

++ * Update cpusets according to cpu_active mask.  If cpusets are

7136

++ * disabled, cpuset_update_active_cpus() becomes a simple wrapper

7137

++ * around partition_sched_domains().

7138

++ *

7139

++ * If we come here as part of a suspend/resume, don't touch cpusets because we

7140

++ * want to restore it back to its original state upon resume anyway.

7141

++ */

7142

++static void cpuset_cpu_active(void)

7143

++{

7144

++	if (cpuhp_tasks_frozen) {

7145

++		/*

7146

++		 * num_cpus_frozen tracks how many CPUs are involved in suspend

7147

++		 * resume sequence. As long as this is not the last online

7148

++		 * operation in the resume sequence, just build a single sched

7149

++		 * domain, ignoring cpusets.

7150

++		 */

7151

++		partition_sched_domains(1, NULL, NULL);

7152

++		if (--num_cpus_frozen)

7153

++			return;

7154

++		/*

7155

++		 * This is the last CPU online operation. So fall through and

7156

++		 * restore the original sched domains by considering the

7157

++		 * cpuset configurations.

7158

++		 */

7159

++		cpuset_force_rebuild();

7160

++	}

7161

++

7162

++	cpuset_update_active_cpus();

7163

++}

7164

++

7165

++static int cpuset_cpu_inactive(unsigned int cpu)

7166

++{

7167

++	if (!cpuhp_tasks_frozen) {

7168

++		cpuset_update_active_cpus();

7169

++	} else {

7170

++		num_cpus_frozen++;

7171

++		partition_sched_domains(1, NULL, NULL);

7172

++	}

7173

++	return 0;

7174

++}

7175

++

7176

++int sched_cpu_activate(unsigned int cpu)

7177

++{

7178

++	struct rq *rq = cpu_rq(cpu);

7179

++	unsigned long flags;

7180

++

7181

++	/*

7182

++	 * Clear the balance_push callback and prepare to schedule

7183

++	 * regular tasks.

7184

++	 */

7185

++	balance_push_set(cpu, false);

7186

++

7187

++#ifdef CONFIG_SCHED_SMT

7188

++	/*

7189

++	 * When going up, increment the number of cores with SMT present.

7190

++	 */

7191

++	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)

7192

++		static_branch_inc_cpuslocked(&sched_smt_present);

7193

++#endif

7194

++	set_cpu_active(cpu, true);

7195

++

7196

++	if (sched_smp_initialized)

7197

++		cpuset_cpu_active();

7198

++

7199

++	/*

7200

++	 * Put the rq online, if not already. This happens:

7201

++	 *

7202

++	 * 1) In the early boot process, because we build the real domains

7203

++	 *    after all cpus have been brought up.

7204

++	 *

7205

++	 * 2) At runtime, if cpuset_cpu_active() fails to rebuild the

7206

++	 *    domains.

7207

++	 */

7208

++	raw_spin_lock_irqsave(&rq->lock, flags);

7209

++	set_rq_online(rq);

7210

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

7211

++

7212

++	return 0;

7213

++}

7214

++

7215

++int sched_cpu_deactivate(unsigned int cpu)

7216

++{

7217

++	struct rq *rq = cpu_rq(cpu);

7218

++	unsigned long flags;

7219

++	int ret;

7220

++

7221

++	set_cpu_active(cpu, false);

7222

++

7223

++	/*

7224

++	 * From this point forward, this CPU will refuse to run any task that

7225

++	 * is not: migrate_disable() or KTHREAD_IS_PER_CPU, and will actively

7226

++	 * push those tasks away until this gets cleared, see

7227

++	 * sched_cpu_dying().

7228

++	 */

7229

++	balance_push_set(cpu, true);

7230

++

7231

++	/*

7232

++	 * We've cleared cpu_active_mask, wait for all preempt-disabled and RCU

7233

++	 * users of this state to go away such that all new such users will

7234

++	 * observe it.

7235

++	 *

7236

++	 * Specifically, we rely on ttwu to no longer target this CPU, see

7237

++	 * ttwu_queue_cond() and is_cpu_allowed().

7238

++	 *

7239

++	 * Do sync before park smpboot threads to take care the rcu boost case.

7240

++	 */

7241

++	synchronize_rcu();

7242

++

7243

++	raw_spin_lock_irqsave(&rq->lock, flags);

7244

++	update_rq_clock(rq);

7245

++	set_rq_offline(rq);

7246

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

7247

++

7248

++#ifdef CONFIG_SCHED_SMT

7249

++	/*

7250

++	 * When going down, decrement the number of cores with SMT present.

7251

++	 */

7252

++	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {

7253

++		static_branch_dec_cpuslocked(&sched_smt_present);

7254

++		if (!static_branch_likely(&sched_smt_present))

7255

++			cpumask_clear(&sched_sg_idle_mask);

7256

++	}

7257

++#endif

7258

++

7259

++	if (!sched_smp_initialized)

7260

++		return 0;

7261

++

7262

++	ret = cpuset_cpu_inactive(cpu);

7263

++	if (ret) {

7264

++		balance_push_set(cpu, false);

7265

++		set_cpu_active(cpu, true);

7266

++		return ret;

7267

++	}

7268

++

7269

++	return 0;

7270

++}

7271

++

7272

++static void sched_rq_cpu_starting(unsigned int cpu)

7273

++{

7274

++	struct rq *rq = cpu_rq(cpu);

7275

++

7276

++	rq->calc_load_update = calc_load_update;

7277

++}

7278

++

7279

++int sched_cpu_starting(unsigned int cpu)

7280

++{

7281

++	sched_rq_cpu_starting(cpu);

7282

++	sched_tick_start(cpu);

7283

++	return 0;

7284

++}

7285

++

7286

++#ifdef CONFIG_HOTPLUG_CPU

7287

++

7288

++/*

7289

++ * Invoked immediately before the stopper thread is invoked to bring the

7290

++ * CPU down completely. At this point all per CPU kthreads except the

7291

++ * hotplug thread (current) and the stopper thread (inactive) have been

7292

++ * either parked or have been unbound from the outgoing CPU. Ensure that

7293

++ * any of those which might be on the way out are gone.

7294

++ *

7295

++ * If after this point a bound task is being woken on this CPU then the

7296

++ * responsible hotplug callback has failed to do it's job.

7297

++ * sched_cpu_dying() will catch it with the appropriate fireworks.

7298

++ */

7299

++int sched_cpu_wait_empty(unsigned int cpu)

7300

++{

7301

++	balance_hotplug_wait();

7302

++	return 0;

7303

++}

7304

++

7305

++/*

7306

++ * Since this CPU is going 'away' for a while, fold any nr_active delta we

7307

++ * might have. Called from the CPU stopper task after ensuring that the

7308

++ * stopper is the last running task on the CPU, so nr_active count is

7309

++ * stable. We need to take the teardown thread which is calling this into

7310

++ * account, so we hand in adjust = 1 to the load calculation.

7311

++ *

7312

++ * Also see the comment "Global load-average calculations".

7313

++ */

7314

++static void calc_load_migrate(struct rq *rq)

7315

++{

7316

++	long delta = calc_load_fold_active(rq, 1);

7317

++

7318

++	if (delta)

7319

++		atomic_long_add(delta, &calc_load_tasks);

7320

++}

7321

++

7322

++static void dump_rq_tasks(struct rq *rq, const char *loglvl)

7323

++{

7324

++	struct task_struct *g, *p;

7325

++	int cpu = cpu_of(rq);

7326

++

7327

++	lockdep_assert_held(&rq->lock);

7328

++

7329

++	printk("%sCPU%d enqueued tasks (%u total):\n", loglvl, cpu, rq->nr_running);

7330

++	for_each_process_thread(g, p) {

7331

++		if (task_cpu(p) != cpu)

7332

++			continue;

7333

++

7334

++		if (!task_on_rq_queued(p))

7335

++			continue;

7336

++

7337

++		printk("%s\tpid: %d, name: %s\n", loglvl, p->pid, p->comm);

7338

++	}

7339

++}

7340

++

7341

++int sched_cpu_dying(unsigned int cpu)

7342

++{

7343

++	struct rq *rq = cpu_rq(cpu);

7344

++	unsigned long flags;

7345

++

7346

++	/* Handle pending wakeups and then migrate everything off */

7347

++	sched_tick_stop(cpu);

7348

++

7349

++	raw_spin_lock_irqsave(&rq->lock, flags);

7350

++	if (rq->nr_running != 1 || rq_has_pinned_tasks(rq)) {

7351

++		WARN(true, "Dying CPU not properly vacated!");

7352

++		dump_rq_tasks(rq, KERN_WARNING);

7353

++	}

7354

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

7355

++

7356

++	calc_load_migrate(rq);

7357

++	hrtick_clear(rq);

7358

++	return 0;

7359

++}

7360

++#endif

7361

++

7362

++#ifdef CONFIG_SMP

7363

++static void sched_init_topology_cpumask_early(void)

7364

++{

7365

++	int cpu;

7366

++	cpumask_t *tmp;

7367

++

7368

++	for_each_possible_cpu(cpu) {

7369

++		/* init topo masks */

7370

++		tmp = per_cpu(sched_cpu_topo_masks, cpu);

7371

++

7372

++		cpumask_copy(tmp, cpumask_of(cpu));

7373

++		tmp++;

7374

++		cpumask_copy(tmp, cpu_possible_mask);

7375

++		per_cpu(sched_cpu_llc_mask, cpu) = tmp;

7376

++		per_cpu(sched_cpu_topo_end_mask, cpu) = ++tmp;

7377

++		/*per_cpu(sd_llc_id, cpu) = cpu;*/

7378

++	}

7379

++}

7380

++

7381

++#define TOPOLOGY_CPUMASK(name, mask, last)\

7382

++	if (cpumask_and(topo, topo, mask)) {					\

7383

++		cpumask_copy(topo, mask);					\

7384

++		printk(KERN_INFO "sched: cpu#%02d topo: 0x%08lx - "#name,	\

7385

++		       cpu, (topo++)->bits[0]);					\

7386

++	}									\

7387

++	if (!last)								\

7388

++		cpumask_complement(topo, mask)

7389

++

7390

++static void sched_init_topology_cpumask(void)

7391

++{

7392

++	int cpu;

7393

++	cpumask_t *topo;

7394

++

7395

++	for_each_online_cpu(cpu) {

7396

++		/* take chance to reset time slice for idle tasks */

7397

++		cpu_rq(cpu)->idle->time_slice = sched_timeslice_ns;

7398

++

7399

++		topo = per_cpu(sched_cpu_topo_masks, cpu) + 1;

7400

++

7401

++		cpumask_complement(topo, cpumask_of(cpu));

7402

++#ifdef CONFIG_SCHED_SMT

7403

++		TOPOLOGY_CPUMASK(smt, topology_sibling_cpumask(cpu), false);

7404

++#endif

7405

++		per_cpu(sd_llc_id, cpu) = cpumask_first(cpu_coregroup_mask(cpu));

7406

++		per_cpu(sched_cpu_llc_mask, cpu) = topo;

7407

++		TOPOLOGY_CPUMASK(coregroup, cpu_coregroup_mask(cpu), false);

7408

++

7409

++		TOPOLOGY_CPUMASK(core, topology_core_cpumask(cpu), false);

7410

++

7411

++		TOPOLOGY_CPUMASK(others, cpu_online_mask, true);

7412

++

7413

++		per_cpu(sched_cpu_topo_end_mask, cpu) = topo;

7414

++		printk(KERN_INFO "sched: cpu#%02d llc_id = %d, llc_mask idx = %d\n",

7415

++		       cpu, per_cpu(sd_llc_id, cpu),

7416

++		       (int) (per_cpu(sched_cpu_llc_mask, cpu) -

7417

++			      per_cpu(sched_cpu_topo_masks, cpu)));

7418

++	}

7419

++}

7420

++#endif

7421

++

7422

++void __init sched_init_smp(void)

7423

++{

7424

++	/* Move init over to a non-isolated CPU */

7425

++	if (set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_FLAG_DOMAIN)) < 0)

7426

++		BUG();

7427

++	current->flags &= ~PF_NO_SETAFFINITY;

7428

++

7429

++	sched_init_topology_cpumask();

7430

++

7431

++	sched_smp_initialized = true;

7432

++}

7433

++#else

7434

++void __init sched_init_smp(void)

7435

++{

7436

++	cpu_rq(0)->idle->time_slice = sched_timeslice_ns;

7437

++}

7438

++#endif /* CONFIG_SMP */

7439

++

7440

++int in_sched_functions(unsigned long addr)

7441

++{

7442

++	return in_lock_functions(addr) ||

7443

++		(addr >= (unsigned long)__sched_text_start

7444

++		&& addr < (unsigned long)__sched_text_end);

7445

++}

7446

++

7447

++#ifdef CONFIG_CGROUP_SCHED

7448

++/* task group related information */

7449

++struct task_group {

7450

++	struct cgroup_subsys_state css;

7451

++

7452

++	struct rcu_head rcu;

7453

++	struct list_head list;

7454

++

7455

++	struct task_group *parent;

7456

++	struct list_head siblings;

7457

++	struct list_head children;

7458

++#ifdef CONFIG_FAIR_GROUP_SCHED

7459

++	unsigned long		shares;

7460

++#endif

7461

++};

7462

++

7463

++/*

7464

++ * Default task group.

7465

++ * Every task in system belongs to this group at bootup.

7466

++ */

7467

++struct task_group root_task_group;

7468

++LIST_HEAD(task_groups);

7469

++

7470

++/* Cacheline aligned slab cache for task_group */

7471

++static struct kmem_cache *task_group_cache __read_mostly;

7472

++#endif /* CONFIG_CGROUP_SCHED */

7473

++

7474

++void __init sched_init(void)

7475

++{

7476

++	int i;

7477

++	struct rq *rq;

7478

++

7479

++	printk(KERN_INFO ALT_SCHED_VERSION_MSG);

7480

++

7481

++	wait_bit_init();

7482

++

7483

++#ifdef CONFIG_SMP

7484

++	for (i = 0; i < SCHED_BITS; i++)

7485

++		cpumask_copy(sched_rq_watermark + i, cpu_present_mask);

7486

++#endif

7487

++

7488

++#ifdef CONFIG_CGROUP_SCHED

7489

++	task_group_cache = KMEM_CACHE(task_group, 0);

7490

++

7491

++	list_add(&root_task_group.list, &task_groups);

7492

++	INIT_LIST_HEAD(&root_task_group.children);

7493

++	INIT_LIST_HEAD(&root_task_group.siblings);

7494

++#endif /* CONFIG_CGROUP_SCHED */

7495

++	for_each_possible_cpu(i) {

7496

++		rq = cpu_rq(i);

7497

++

7498

++		sched_queue_init(&rq->queue);

7499

++		rq->watermark = IDLE_TASK_SCHED_PRIO;

7500

++		rq->skip = NULL;

7501

++

7502

++		raw_spin_lock_init(&rq->lock);

7503

++		rq->nr_running = rq->nr_uninterruptible = 0;

7504

++		rq->calc_load_active = 0;

7505

++		rq->calc_load_update = jiffies + LOAD_FREQ;

7506

++#ifdef CONFIG_SMP

7507

++		rq->online = false;

7508

++		rq->cpu = i;

7509

++

7510

++#ifdef CONFIG_SCHED_SMT

7511

++		rq->active_balance = 0;

7512

++#endif

7513

++

7514

++#ifdef CONFIG_NO_HZ_COMMON

7515

++		INIT_CSD(&rq->nohz_csd, nohz_csd_func, rq);

7516

++#endif

7517

++		rq->balance_callback = &balance_push_callback;

7518

++#ifdef CONFIG_HOTPLUG_CPU

7519

++		rcuwait_init(&rq->hotplug_wait);

7520

++#endif

7521

++#endif /* CONFIG_SMP */

7522

++		rq->nr_switches = 0;

7523

++

7524

++		hrtick_rq_init(rq);

7525

++		atomic_set(&rq->nr_iowait, 0);

7526

++	}

7527

++#ifdef CONFIG_SMP

7528

++	/* Set rq->online for cpu 0 */

7529

++	cpu_rq(0)->online = true;

7530

++#endif

7531

++	/*

7532

++	 * The boot idle thread does lazy MMU switching as well:

7533

++	 */

7534

++	mmgrab(&init_mm);

7535

++	enter_lazy_tlb(&init_mm, current);

7536

++

7537

++	/*

7538

++	 * Make us the idle thread. Technically, schedule() should not be

7539

++	 * called from this thread, however somewhere below it might be,

7540

++	 * but because we are the idle thread, we just pick up running again

7541

++	 * when this runqueue becomes "idle".

7542

++	 */

7543

++	init_idle(current, smp_processor_id());

7544

++

7545

++	calc_load_update = jiffies + LOAD_FREQ;

7546

++

7547

++#ifdef CONFIG_SMP

7548

++	idle_thread_set_boot_cpu();

7549

++	balance_push_set(smp_processor_id(), false);

7550

++

7551

++	sched_init_topology_cpumask_early();

7552

++#endif /* SMP */

7553

++

7554

++	psi_init();

7555

++}

7556

++

7557

++#ifdef CONFIG_DEBUG_ATOMIC_SLEEP

7558

++static inline int preempt_count_equals(int preempt_offset)

7559

++{

7560

++	int nested = preempt_count() + rcu_preempt_depth();

7561

++

7562

++	return (nested == preempt_offset);

7563

++}

7564

++

7565

++void __might_sleep(const char *file, int line, int preempt_offset)

7566

++{

7567

++	unsigned int state = get_current_state();

7568

++	/*

7569

++	 * Blocking primitives will set (and therefore destroy) current->state,

7570

++	 * since we will exit with TASK_RUNNING make sure we enter with it,

7571

++	 * otherwise we will destroy state.

7572

++	 */

7573

++	WARN_ONCE(state != TASK_RUNNING && current->task_state_change,

7574

++			"do not call blocking ops when !TASK_RUNNING; "

7575

++			"state=%x set at [<%p>] %pS\n", state,

7576

++			(void *)current->task_state_change,

7577

++			(void *)current->task_state_change);

7578

++

7579

++	___might_sleep(file, line, preempt_offset);

7580

++}

7581

++EXPORT_SYMBOL(__might_sleep);

7582

++

7583

++void ___might_sleep(const char *file, int line, int preempt_offset)

7584

++{

7585

++	/* Ratelimiting timestamp: */

7586

++	static unsigned long prev_jiffy;

7587

++

7588

++	unsigned long preempt_disable_ip;

7589

++

7590

++	/* WARN_ON_ONCE() by default, no rate limit required: */

7591

++	rcu_sleep_check();

7592

++

7593

++	if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&

7594

++	     !is_idle_task(current) && !current->non_block_count) ||

7595

++	    system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING ||

7596

++	    oops_in_progress)

7597

++		return;

7598

++	if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)

7599

++		return;

7600

++	prev_jiffy = jiffies;

7601

++

7602

++	/* Save this before calling printk(), since that will clobber it: */

7603

++	preempt_disable_ip = get_preempt_disable_ip(current);

7604

++

7605

++	printk(KERN_ERR

7606

++		"BUG: sleeping function called from invalid context at %s:%d\n",

7607

++			file, line);

7608

++	printk(KERN_ERR

7609

++		"in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",

7610

++			in_atomic(), irqs_disabled(), current->non_block_count,

7611

++			current->pid, current->comm);

7612

++

7613

++	if (task_stack_end_corrupted(current))

7614

++		printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");

7615

++

7616

++	debug_show_held_locks(current);

7617

++	if (irqs_disabled())

7618

++		print_irqtrace_events(current);

7619

++#ifdef CONFIG_DEBUG_PREEMPT

7620

++	if (!preempt_count_equals(preempt_offset)) {

7621

++		pr_err("Preemption disabled at:");

7622

++		print_ip_sym(KERN_ERR, preempt_disable_ip);

7623

++	}

7624

++#endif

7625

++	dump_stack();

7626

++	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

7627

++}

7628

++EXPORT_SYMBOL(___might_sleep);

7629

++

7630

++void __cant_sleep(const char *file, int line, int preempt_offset)

7631

++{

7632

++	static unsigned long prev_jiffy;

7633

++

7634

++	if (irqs_disabled())

7635

++		return;

7636

++

7637

++	if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))

7638

++		return;

7639

++

7640

++	if (preempt_count() > preempt_offset)

7641

++		return;

7642

++

7643

++	if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)

7644

++		return;

7645

++	prev_jiffy = jiffies;

7646

++

7647

++	printk(KERN_ERR "BUG: assuming atomic context at %s:%d\n", file, line);

7648

++	printk(KERN_ERR "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",

7649

++			in_atomic(), irqs_disabled(),

7650

++			current->pid, current->comm);

7651

++

7652

++	debug_show_held_locks(current);

7653

++	dump_stack();

7654

++	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

7655

++}

7656

++EXPORT_SYMBOL_GPL(__cant_sleep);

7657

++

7658

++#ifdef CONFIG_SMP

7659

++void __cant_migrate(const char *file, int line)

7660

++{

7661

++	static unsigned long prev_jiffy;

7662

++

7663

++	if (irqs_disabled())

7664

++		return;

7665

++

7666

++	if (is_migration_disabled(current))

7667

++		return;

7668

++

7669

++	if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))

7670

++		return;

7671

++

7672

++	if (preempt_count() > 0)

7673

++		return;

7674

++

7675

++	if (current->migration_flags & MDF_FORCE_ENABLED)

7676

++		return;

7677

++

7678

++	if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)

7679

++		return;

7680

++	prev_jiffy = jiffies;

7681

++

7682

++	pr_err("BUG: assuming non migratable context at %s:%d\n", file, line);

7683

++	pr_err("in_atomic(): %d, irqs_disabled(): %d, migration_disabled() %u pid: %d, name: %s\n",

7684

++	       in_atomic(), irqs_disabled(), is_migration_disabled(current),

7685

++	       current->pid, current->comm);

7686

++

7687

++	debug_show_held_locks(current);

7688

++	dump_stack();

7689

++	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

7690

++}

7691

++EXPORT_SYMBOL_GPL(__cant_migrate);

7692

++#endif

7693

++#endif

7694

++

7695

++#ifdef CONFIG_MAGIC_SYSRQ

7696

++void normalize_rt_tasks(void)

7697

++{

7698

++	struct task_struct *g, *p;

7699

++	struct sched_attr attr = {

7700

++		.sched_policy = SCHED_NORMAL,

7701

++	};

7702

++

7703

++	read_lock(&tasklist_lock);

7704

++	for_each_process_thread(g, p) {

7705

++		/*

7706

++		 * Only normalize user tasks:

7707

++		 */

7708

++		if (p->flags & PF_KTHREAD)

7709

++			continue;

7710

++

7711

++		if (!rt_task(p)) {

7712

++			/*

7713

++			 * Renice negative nice level userspace

7714

++			 * tasks back to 0:

7715

++			 */

7716

++			if (task_nice(p) < 0)

7717

++				set_user_nice(p, 0);

7718

++			continue;

7719

++		}

7720

++

7721

++		__sched_setscheduler(p, &attr, false, false);

7722

++	}

7723

++	read_unlock(&tasklist_lock);

7724

++}

7725

++#endif /* CONFIG_MAGIC_SYSRQ */

7726

++

7727

++#if defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB)

7728

++/*

7729

++ * These functions are only useful for the IA64 MCA handling, or kdb.

7730

++ *

7731

++ * They can only be called when the whole system has been

7732

++ * stopped - every CPU needs to be quiescent, and no scheduling

7733

++ * activity can take place. Using them for anything else would

7734

++ * be a serious bug, and as a result, they aren't even visible

7735

++ * under any other configuration.

7736

++ */

7737

++

7738

++/**

7739

++ * curr_task - return the current task for a given CPU.

7740

++ * @cpu: the processor in question.

7741

++ *

7742

++ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!

7743

++ *

7744

++ * Return: The current task for @cpu.

7745

++ */

7746

++struct task_struct *curr_task(int cpu)

7747

++{

7748

++	return cpu_curr(cpu);

7749

++}

7750

++

7751

++#endif /* defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB) */

7752

++

7753

++#ifdef CONFIG_IA64

7754

++/**

7755

++ * ia64_set_curr_task - set the current task for a given CPU.

7756

++ * @cpu: the processor in question.

7757

++ * @p: the task pointer to set.

7758

++ *

7759

++ * Description: This function must only be used when non-maskable interrupts

7760

++ * are serviced on a separate stack.  It allows the architecture to switch the

7761

++ * notion of the current task on a CPU in a non-blocking manner.  This function

7762

++ * must be called with all CPU's synchronised, and interrupts disabled, the

7763

++ * and caller must save the original value of the current task (see

7764

++ * curr_task() above) and restore that value before reenabling interrupts and

7765

++ * re-starting the system.

7766

++ *

7767

++ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!

7768

++ */

7769

++void ia64_set_curr_task(int cpu, struct task_struct *p)

7770

++{

7771

++	cpu_curr(cpu) = p;

7772

++}

7773

++

7774

++#endif

7775

++

7776

++#ifdef CONFIG_CGROUP_SCHED

7777

++static void sched_free_group(struct task_group *tg)

7778

++{

7779

++	kmem_cache_free(task_group_cache, tg);

7780

++}

7781

++

7782

++/* allocate runqueue etc for a new task group */

7783

++struct task_group *sched_create_group(struct task_group *parent)

7784

++{

7785

++	struct task_group *tg;

7786

++

7787

++	tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);

7788

++	if (!tg)

7789

++		return ERR_PTR(-ENOMEM);

7790

++

7791

++	return tg;

7792

++}

7793

++

7794

++void sched_online_group(struct task_group *tg, struct task_group *parent)

7795

++{

7796

++}

7797

++

7798

++/* rcu callback to free various structures associated with a task group */

7799

++static void sched_free_group_rcu(struct rcu_head *rhp)

7800

++{

7801

++	/* Now it should be safe to free those cfs_rqs */

7802

++	sched_free_group(container_of(rhp, struct task_group, rcu));

7803

++}

7804

++

7805

++void sched_destroy_group(struct task_group *tg)

7806

++{

7807

++	/* Wait for possible concurrent references to cfs_rqs complete */

7808

++	call_rcu(&tg->rcu, sched_free_group_rcu);

7809

++}

7810

++

7811

++void sched_offline_group(struct task_group *tg)

7812

++{

7813

++}

7814

++

7815

++static inline struct task_group *css_tg(struct cgroup_subsys_state *css)

7816

++{

7817

++	return css ? container_of(css, struct task_group, css) : NULL;

7818

++}

7819

++

7820

++static struct cgroup_subsys_state *

7821

++cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)

7822

++{

7823

++	struct task_group *parent = css_tg(parent_css);

7824

++	struct task_group *tg;

7825

++

7826

++	if (!parent) {

7827

++		/* This is early initialization for the top cgroup */

7828

++		return &root_task_group.css;

7829

++	}

7830

++

7831

++	tg = sched_create_group(parent);

7832

++	if (IS_ERR(tg))

7833

++		return ERR_PTR(-ENOMEM);

7834

++	return &tg->css;

7835

++}

7836

++

7837

++/* Expose task group only after completing cgroup initialization */

7838

++static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)

7839

++{

7840

++	struct task_group *tg = css_tg(css);

7841

++	struct task_group *parent = css_tg(css->parent);

7842

++

7843

++	if (parent)

7844

++		sched_online_group(tg, parent);

7845

++	return 0;

7846

++}

7847

++

7848

++static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)

7849

++{

7850

++	struct task_group *tg = css_tg(css);

7851

++

7852

++	sched_offline_group(tg);

7853

++}

7854

++

7855

++static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)

7856

++{

7857

++	struct task_group *tg = css_tg(css);

7858

++

7859

++	/*

7860

++	 * Relies on the RCU grace period between css_released() and this.

7861

++	 */

7862

++	sched_free_group(tg);

7863

++}

7864

++

7865

++static void cpu_cgroup_fork(struct task_struct *task)

7866

++{

7867

++}

7868

++

7869

++static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)

7870

++{

7871

++	return 0;

7872

++}

7873

++

7874

++static void cpu_cgroup_attach(struct cgroup_taskset *tset)

7875

++{

7876

++}

7877

++

7878

++#ifdef CONFIG_FAIR_GROUP_SCHED

7879

++static DEFINE_MUTEX(shares_mutex);

7880

++

7881

++int sched_group_set_shares(struct task_group *tg, unsigned long shares)

7882

++{

7883

++	/*

7884

++	 * We can't change the weight of the root cgroup.

7885

++	 */

7886

++	if (&root_task_group == tg)

7887

++		return -EINVAL;

7888

++

7889

++	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));

7890

++

7891

++	mutex_lock(&shares_mutex);

7892

++	if (tg->shares == shares)

7893

++		goto done;

7894

++

7895

++	tg->shares = shares;

7896

++done:

7897

++	mutex_unlock(&shares_mutex);

7898

++	return 0;

7899

++}

7900

++

7901

++static int cpu_shares_write_u64(struct cgroup_subsys_state *css,

7902

++				struct cftype *cftype, u64 shareval)

7903

++{

7904

++	if (shareval > scale_load_down(ULONG_MAX))

7905

++		shareval = MAX_SHARES;

7906

++	return sched_group_set_shares(css_tg(css), scale_load(shareval));

7907

++}

7908

++

7909

++static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,

7910

++			       struct cftype *cft)

7911

++{

7912

++	struct task_group *tg = css_tg(css);

7913

++

7914

++	return (u64) scale_load_down(tg->shares);

7915

++}

7916

++#endif

7917

++

7918

++static struct cftype cpu_legacy_files[] = {

7919

++#ifdef CONFIG_FAIR_GROUP_SCHED

7920

++	{

7921

++		.name = "shares",

7922

++		.read_u64 = cpu_shares_read_u64,

7923

++		.write_u64 = cpu_shares_write_u64,

7924

++	},

7925

++#endif

7926

++	{ }	/* Terminate */

7927

++};

7928

++

7929

++

7930

++static struct cftype cpu_files[] = {

7931

++	{ }	/* terminate */

7932

++};

7933

++

7934

++static int cpu_extra_stat_show(struct seq_file *sf,

7935

++			       struct cgroup_subsys_state *css)

7936

++{

7937

++	return 0;

7938

++}

7939

++

7940

++struct cgroup_subsys cpu_cgrp_subsys = {

7941

++	.css_alloc	= cpu_cgroup_css_alloc,

7942

++	.css_online	= cpu_cgroup_css_online,

7943

++	.css_released	= cpu_cgroup_css_released,

7944

++	.css_free	= cpu_cgroup_css_free,

7945

++	.css_extra_stat_show = cpu_extra_stat_show,

7946

++	.fork		= cpu_cgroup_fork,

7947

++	.can_attach	= cpu_cgroup_can_attach,

7948

++	.attach		= cpu_cgroup_attach,

7949

++	.legacy_cftypes	= cpu_files,

7950

++	.legacy_cftypes	= cpu_legacy_files,

7951

++	.dfl_cftypes	= cpu_files,

7952

++	.early_init	= true,

7953

++	.threaded	= true,

7954

++};

7955

++#endif	/* CONFIG_CGROUP_SCHED */

7956

++

7957

++#undef CREATE_TRACE_POINTS

7958

+diff --git a/kernel/sched/alt_debug.c b/kernel/sched/alt_debug.c

7959

+new file mode 100644

7960

+index 000000000000..1212a031700e

7961

+--- /dev/null

7962

++++ b/kernel/sched/alt_debug.c

7963

+@@ -0,0 +1,31 @@

7964

++/*

7965

++ * kernel/sched/alt_debug.c

7966

++ *

7967

++ * Print the alt scheduler debugging details

7968

++ *

7969

++ * Author: Alfred Chen

7970

++ * Date  : 2020

7971

++ */

7972

++#include "sched.h"

7973

++

7974

++/*

7975

++ * This allows printing both to /proc/sched_debug and

7976

++ * to the console

7977

++ */

7978

++#define SEQ_printf(m, x...)			\

7979

++ do {						\

7980

++	if (m)					\

7981

++		seq_printf(m, x);		\

7982

++	else					\

7983

++		pr_cont(x);			\

7984

++ } while (0)

7985

++

7986

++void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,

7987

++			  struct seq_file *m)

7988

++{

7989

++	SEQ_printf(m, "%s (%d, #threads: %d)\n", p->comm, task_pid_nr_ns(p, ns),

7990

++						get_nr_threads(p));

7991

++}

7992

++

7993

++void proc_sched_set_task(struct task_struct *p)

7994

++{}

7995

+diff --git a/kernel/sched/alt_sched.h b/kernel/sched/alt_sched.h

7996

+new file mode 100644

7997

+index 000000000000..f03af9ab9123

7998

+--- /dev/null

7999

++++ b/kernel/sched/alt_sched.h

8000

+@@ -0,0 +1,692 @@

8001

++#ifndef ALT_SCHED_H

8002

++#define ALT_SCHED_H

8003

++

8004

++#include <linux/sched.h>

8005

++

8006

++#include <linux/sched/clock.h>

8007

++#include <linux/sched/cpufreq.h>

8008

++#include <linux/sched/cputime.h>

8009

++#include <linux/sched/debug.h>

8010

++#include <linux/sched/init.h>

8011

++#include <linux/sched/isolation.h>

8012

++#include <linux/sched/loadavg.h>

8013

++#include <linux/sched/mm.h>

8014

++#include <linux/sched/nohz.h>

8015

++#include <linux/sched/signal.h>

8016

++#include <linux/sched/stat.h>

8017

++#include <linux/sched/sysctl.h>

8018

++#include <linux/sched/task.h>

8019

++#include <linux/sched/topology.h>

8020

++#include <linux/sched/wake_q.h>

8021

++

8022

++#include <uapi/linux/sched/types.h>

8023

++

8024

++#include <linux/cgroup.h>

8025

++#include <linux/cpufreq.h>

8026

++#include <linux/cpuidle.h>

8027

++#include <linux/cpuset.h>

8028

++#include <linux/ctype.h>

8029

++#include <linux/debugfs.h>

8030

++#include <linux/kthread.h>

8031

++#include <linux/livepatch.h>

8032

++#include <linux/membarrier.h>

8033

++#include <linux/proc_fs.h>

8034

++#include <linux/psi.h>

8035

++#include <linux/slab.h>

8036

++#include <linux/stop_machine.h>

8037

++#include <linux/suspend.h>

8038

++#include <linux/swait.h>

8039

++#include <linux/syscalls.h>

8040

++#include <linux/tsacct_kern.h>

8041

++

8042

++#include <asm/tlb.h>

8043

++

8044

++#ifdef CONFIG_PARAVIRT

8045

++# include <asm/paravirt.h>

8046

++#endif

8047

++

8048

++#include "cpupri.h"

8049

++

8050

++#include <trace/events/sched.h>

8051

++

8052

++#ifdef CONFIG_SCHED_BMQ

8053

++/* bits:

8054

++ * RT(0-99), (Low prio adj range, nice width, high prio adj range) / 2, cpu idle task */

8055

++#define SCHED_BITS	(MAX_RT_PRIO + NICE_WIDTH / 2 + MAX_PRIORITY_ADJ + 1)

8056

++#endif

8057

++

8058

++#ifdef CONFIG_SCHED_PDS

8059

++/* bits: RT(0-99), reserved(100-127), NORMAL_PRIO_NUM, cpu idle task */

8060

++#define SCHED_BITS	(MIN_NORMAL_PRIO + NORMAL_PRIO_NUM + 1)

8061

++#endif /* CONFIG_SCHED_PDS */

8062

++

8063

++#define IDLE_TASK_SCHED_PRIO	(SCHED_BITS - 1)

8064

++

8065

++#ifdef CONFIG_SCHED_DEBUG

8066

++# define SCHED_WARN_ON(x)	WARN_ONCE(x, #x)

8067

++extern void resched_latency_warn(int cpu, u64 latency);

8068

++#else

8069

++# define SCHED_WARN_ON(x)	({ (void)(x), 0; })

8070

++static inline void resched_latency_warn(int cpu, u64 latency) {}

8071

++#endif

8072

++

8073

++/*

8074

++ * Increase resolution of nice-level calculations for 64-bit architectures.

8075

++ * The extra resolution improves shares distribution and load balancing of

8076

++ * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup

8077

++ * hierarchies, especially on larger systems. This is not a user-visible change

8078

++ * and does not change the user-interface for setting shares/weights.

8079

++ *

8080

++ * We increase resolution only if we have enough bits to allow this increased

8081

++ * resolution (i.e. 64-bit). The costs for increasing resolution when 32-bit

8082

++ * are pretty high and the returns do not justify the increased costs.

8083

++ *

8084

++ * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to

8085

++ * increase coverage and consistency always enable it on 64-bit platforms.

8086

++ */

8087

++#ifdef CONFIG_64BIT

8088

++# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)

8089

++# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)

8090

++# define scale_load_down(w) \

8091

++({ \

8092

++	unsigned long __w = (w); \

8093

++	if (__w) \

8094

++		__w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \

8095

++	__w; \

8096

++})

8097

++#else

8098

++# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)

8099

++# define scale_load(w)		(w)

8100

++# define scale_load_down(w)	(w)

8101

++#endif

8102

++

8103

++#ifdef CONFIG_FAIR_GROUP_SCHED

8104

++#define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD

8105

++

8106

++/*

8107

++ * A weight of 0 or 1 can cause arithmetics problems.

8108

++ * A weight of a cfs_rq is the sum of weights of which entities

8109

++ * are queued on this cfs_rq, so a weight of a entity should not be

8110

++ * too large, so as the shares value of a task group.

8111

++ * (The default weight is 1024 - so there's no practical

8112

++ *  limitation from this.)

8113

++ */

8114

++#define MIN_SHARES		(1UL <<  1)

8115

++#define MAX_SHARES		(1UL << 18)

8116

++#endif

8117

++

8118

++/* task_struct::on_rq states: */

8119

++#define TASK_ON_RQ_QUEUED	1

8120

++#define TASK_ON_RQ_MIGRATING	2

8121

++

8122

++static inline int task_on_rq_queued(struct task_struct *p)

8123

++{

8124

++	return p->on_rq == TASK_ON_RQ_QUEUED;

8125

++}

8126

++

8127

++static inline int task_on_rq_migrating(struct task_struct *p)

8128

++{

8129

++	return READ_ONCE(p->on_rq) == TASK_ON_RQ_MIGRATING;

8130

++}

8131

++

8132

++/*

8133

++ * wake flags

8134

++ */

8135

++#define WF_SYNC		0x01		/* waker goes to sleep after wakeup */

8136

++#define WF_FORK		0x02		/* child wakeup after fork */

8137

++#define WF_MIGRATED	0x04		/* internal use, task got migrated */

8138

++#define WF_ON_CPU	0x08		/* Wakee is on_rq */

8139

++

8140

++#define SCHED_QUEUE_BITS	(SCHED_BITS - 1)

8141

++

8142

++struct sched_queue {

8143

++	DECLARE_BITMAP(bitmap, SCHED_QUEUE_BITS);

8144

++	struct list_head heads[SCHED_BITS];

8145

++};

8146

++

8147

++/*

8148

++ * This is the main, per-CPU runqueue data structure.

8149

++ * This data should only be modified by the local cpu.

8150

++ */

8151

++struct rq {

8152

++	/* runqueue lock: */

8153

++	raw_spinlock_t lock;

8154

++

8155

++	struct task_struct __rcu *curr;

8156

++	struct task_struct *idle, *stop, *skip;

8157

++	struct mm_struct *prev_mm;

8158

++

8159

++	struct sched_queue	queue;

8160

++#ifdef CONFIG_SCHED_PDS

8161

++	u64			time_edge;

8162

++#endif

8163

++	unsigned long watermark;

8164

++

8165

++	/* switch count */

8166

++	u64 nr_switches;

8167

++

8168

++	atomic_t nr_iowait;

8169

++

8170

++#ifdef CONFIG_SCHED_DEBUG

8171

++	u64 last_seen_need_resched_ns;

8172

++	int ticks_without_resched;

8173

++#endif

8174

++

8175

++#ifdef CONFIG_MEMBARRIER

8176

++	int membarrier_state;

8177

++#endif

8178

++

8179

++#ifdef CONFIG_SMP

8180

++	int cpu;		/* cpu of this runqueue */

8181

++	bool online;

8182

++

8183

++	unsigned int		ttwu_pending;

8184

++	unsigned char		nohz_idle_balance;

8185

++	unsigned char		idle_balance;

8186

++

8187

++#ifdef CONFIG_HAVE_SCHED_AVG_IRQ

8188

++	struct sched_avg	avg_irq;

8189

++#endif

8190

++

8191

++#ifdef CONFIG_SCHED_SMT

8192

++	int active_balance;

8193

++	struct cpu_stop_work	active_balance_work;

8194

++#endif

8195

++	struct callback_head	*balance_callback;

8196

++#ifdef CONFIG_HOTPLUG_CPU

8197

++	struct rcuwait		hotplug_wait;

8198

++#endif

8199

++	unsigned int		nr_pinned;

8200

++#endif /* CONFIG_SMP */

8201

++#ifdef CONFIG_IRQ_TIME_ACCOUNTING

8202

++	u64 prev_irq_time;

8203

++#endif /* CONFIG_IRQ_TIME_ACCOUNTING */

8204

++#ifdef CONFIG_PARAVIRT

8205

++	u64 prev_steal_time;

8206

++#endif /* CONFIG_PARAVIRT */

8207

++#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING

8208

++	u64 prev_steal_time_rq;

8209

++#endif /* CONFIG_PARAVIRT_TIME_ACCOUNTING */

8210

++

8211

++	/* calc_load related fields */

8212

++	unsigned long calc_load_update;

8213

++	long calc_load_active;

8214

++

8215

++	u64 clock, last_tick;

8216

++	u64 last_ts_switch;

8217

++	u64 clock_task;

8218

++

8219

++	unsigned int  nr_running;

8220

++	unsigned long nr_uninterruptible;

8221

++

8222

++#ifdef CONFIG_SCHED_HRTICK

8223

++#ifdef CONFIG_SMP

8224

++	call_single_data_t hrtick_csd;

8225

++#endif

8226

++	struct hrtimer		hrtick_timer;

8227

++	ktime_t			hrtick_time;

8228

++#endif

8229

++

8230

++#ifdef CONFIG_SCHEDSTATS

8231

++

8232

++	/* latency stats */

8233

++	struct sched_info rq_sched_info;

8234

++	unsigned long long rq_cpu_time;

8235

++	/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */

8236

++

8237

++	/* sys_sched_yield() stats */

8238

++	unsigned int yld_count;

8239

++

8240

++	/* schedule() stats */

8241

++	unsigned int sched_switch;

8242

++	unsigned int sched_count;

8243

++	unsigned int sched_goidle;

8244

++

8245

++	/* try_to_wake_up() stats */

8246

++	unsigned int ttwu_count;

8247

++	unsigned int ttwu_local;

8248

++#endif /* CONFIG_SCHEDSTATS */

8249

++

8250

++#ifdef CONFIG_CPU_IDLE

8251

++	/* Must be inspected within a rcu lock section */

8252

++	struct cpuidle_state *idle_state;

8253

++#endif

8254

++

8255

++#ifdef CONFIG_NO_HZ_COMMON

8256

++#ifdef CONFIG_SMP

8257

++	call_single_data_t	nohz_csd;

8258

++#endif

8259

++	atomic_t		nohz_flags;

8260

++#endif /* CONFIG_NO_HZ_COMMON */

8261

++};

8262

++

8263

++extern unsigned long calc_load_update;

8264

++extern atomic_long_t calc_load_tasks;

8265

++

8266

++extern void calc_global_load_tick(struct rq *this_rq);

8267

++extern long calc_load_fold_active(struct rq *this_rq, long adjust);

8268

++

8269

++DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

8270

++#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))

8271

++#define this_rq()		this_cpu_ptr(&runqueues)

8272

++#define task_rq(p)		cpu_rq(task_cpu(p))

8273

++#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)

8274

++#define raw_rq()		raw_cpu_ptr(&runqueues)

8275

++

8276

++#ifdef CONFIG_SMP

8277

++#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)

8278

++void register_sched_domain_sysctl(void);

8279

++void unregister_sched_domain_sysctl(void);

8280

++#else

8281

++static inline void register_sched_domain_sysctl(void)

8282

++{

8283

++}

8284

++static inline void unregister_sched_domain_sysctl(void)

8285

++{

8286

++}

8287

++#endif

8288

++

8289

++extern bool sched_smp_initialized;

8290

++

8291

++enum {

8292

++	ITSELF_LEVEL_SPACE_HOLDER,

8293

++#ifdef CONFIG_SCHED_SMT

8294

++	SMT_LEVEL_SPACE_HOLDER,

8295

++#endif

8296

++	COREGROUP_LEVEL_SPACE_HOLDER,

8297

++	CORE_LEVEL_SPACE_HOLDER,

8298

++	OTHER_LEVEL_SPACE_HOLDER,

8299

++	NR_CPU_AFFINITY_LEVELS

8300

++};

8301

++

8302

++DECLARE_PER_CPU(cpumask_t [NR_CPU_AFFINITY_LEVELS], sched_cpu_topo_masks);

8303

++DECLARE_PER_CPU(cpumask_t *, sched_cpu_llc_mask);

8304

++

8305

++static inline int

8306

++__best_mask_cpu(const cpumask_t *cpumask, const cpumask_t *mask)

8307

++{

8308

++	int cpu;

8309

++

8310

++	while ((cpu = cpumask_any_and(cpumask, mask)) >= nr_cpu_ids)

8311

++		mask++;

8312

++

8313

++	return cpu;

8314

++}

8315

++

8316

++static inline int best_mask_cpu(int cpu, const cpumask_t *mask)

8317

++{

8318

++	return __best_mask_cpu(mask, per_cpu(sched_cpu_topo_masks, cpu));

8319

++}

8320

++

8321

++extern void flush_smp_call_function_from_idle(void);

8322

++

8323

++#else  /* !CONFIG_SMP */

8324

++static inline void flush_smp_call_function_from_idle(void) { }

8325

++#endif

8326

++

8327

++#ifndef arch_scale_freq_tick

8328

++static __always_inline

8329

++void arch_scale_freq_tick(void)

8330

++{

8331

++}

8332

++#endif

8333

++

8334

++#ifndef arch_scale_freq_capacity

8335

++static __always_inline

8336

++unsigned long arch_scale_freq_capacity(int cpu)

8337

++{

8338

++	return SCHED_CAPACITY_SCALE;

8339

++}

8340

++#endif

8341

++

8342

++static inline u64 __rq_clock_broken(struct rq *rq)

8343

++{

8344

++	return READ_ONCE(rq->clock);

8345

++}

8346

++

8347

++static inline u64 rq_clock(struct rq *rq)

8348

++{

8349

++	/*

8350

++	 * Relax lockdep_assert_held() checking as in VRQ, call to

8351

++	 * sched_info_xxxx() may not held rq->lock

8352

++	 * lockdep_assert_held(&rq->lock);

8353

++	 */

8354

++	return rq->clock;

8355

++}

8356

++

8357

++static inline u64 rq_clock_task(struct rq *rq)

8358

++{

8359

++	/*

8360

++	 * Relax lockdep_assert_held() checking as in VRQ, call to

8361

++	 * sched_info_xxxx() may not held rq->lock

8362

++	 * lockdep_assert_held(&rq->lock);

8363

++	 */

8364

++	return rq->clock_task;

8365

++}

8366

++

8367

++/*

8368

++ * {de,en}queue flags:

8369

++ *

8370

++ * DEQUEUE_SLEEP  - task is no longer runnable

8371

++ * ENQUEUE_WAKEUP - task just became runnable

8372

++ *

8373

++ */

8374

++

8375

++#define DEQUEUE_SLEEP		0x01

8376

++

8377

++#define ENQUEUE_WAKEUP		0x01

8378

++

8379

++

8380

++/*

8381

++ * Below are scheduler API which using in other kernel code

8382

++ * It use the dummy rq_flags

8383

++ * ToDo : BMQ need to support these APIs for compatibility with mainline

8384

++ * scheduler code.

8385

++ */

8386

++struct rq_flags {

8387

++	unsigned long flags;

8388

++};

8389

++

8390

++struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)

8391

++	__acquires(rq->lock);

8392

++

8393

++struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)

8394

++	__acquires(p->pi_lock)

8395

++	__acquires(rq->lock);

8396

++

8397

++static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)

8398

++	__releases(rq->lock)

8399

++{

8400

++	raw_spin_unlock(&rq->lock);

8401

++}

8402

++

8403

++static inline void

8404

++task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)

8405

++	__releases(rq->lock)

8406

++	__releases(p->pi_lock)

8407

++{

8408

++	raw_spin_unlock(&rq->lock);

8409

++	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);

8410

++}

8411

++

8412

++static inline void

8413

++rq_lock(struct rq *rq, struct rq_flags *rf)

8414

++	__acquires(rq->lock)

8415

++{

8416

++	raw_spin_lock(&rq->lock);

8417

++}

8418

++

8419

++static inline void

8420

++rq_unlock_irq(struct rq *rq, struct rq_flags *rf)

8421

++	__releases(rq->lock)

8422

++{

8423

++	raw_spin_unlock_irq(&rq->lock);

8424

++}

8425

++

8426

++static inline void

8427

++rq_unlock(struct rq *rq, struct rq_flags *rf)

8428

++	__releases(rq->lock)

8429

++{

8430

++	raw_spin_unlock(&rq->lock);

8431

++}

8432

++

8433

++static inline struct rq *

8434

++this_rq_lock_irq(struct rq_flags *rf)

8435

++	__acquires(rq->lock)

8436

++{

8437

++	struct rq *rq;

8438

++

8439

++	local_irq_disable();

8440

++	rq = this_rq();

8441

++	raw_spin_lock(&rq->lock);

8442

++

8443

++	return rq;

8444

++}

8445

++

8446

++extern void raw_spin_rq_lock_nested(struct rq *rq, int subclass);

8447

++extern void raw_spin_rq_unlock(struct rq *rq);

8448

++

8449

++static inline raw_spinlock_t *__rq_lockp(struct rq *rq)

8450

++{

8451

++	return &rq->lock;

8452

++}

8453

++

8454

++static inline raw_spinlock_t *rq_lockp(struct rq *rq)

8455

++{

8456

++	return __rq_lockp(rq);

8457

++}

8458

++

8459

++static inline void raw_spin_rq_lock(struct rq *rq)

8460

++{

8461

++	raw_spin_rq_lock_nested(rq, 0);

8462

++}

8463

++

8464

++static inline void raw_spin_rq_lock_irq(struct rq *rq)

8465

++{

8466

++	local_irq_disable();

8467

++	raw_spin_rq_lock(rq);

8468

++}

8469

++

8470

++static inline void raw_spin_rq_unlock_irq(struct rq *rq)

8471

++{

8472

++	raw_spin_rq_unlock(rq);

8473

++	local_irq_enable();

8474

++}

8475

++

8476

++static inline int task_current(struct rq *rq, struct task_struct *p)

8477

++{

8478

++	return rq->curr == p;

8479

++}

8480

++

8481

++static inline bool task_running(struct task_struct *p)

8482

++{

8483

++	return p->on_cpu;

8484

++}

8485

++

8486

++extern int task_running_nice(struct task_struct *p);

8487

++

8488

++extern struct static_key_false sched_schedstats;

8489

++

8490

++#ifdef CONFIG_CPU_IDLE

8491

++static inline void idle_set_state(struct rq *rq,

8492

++				  struct cpuidle_state *idle_state)

8493

++{

8494

++	rq->idle_state = idle_state;

8495

++}

8496

++

8497

++static inline struct cpuidle_state *idle_get_state(struct rq *rq)

8498

++{

8499

++	WARN_ON(!rcu_read_lock_held());

8500

++	return rq->idle_state;

8501

++}

8502

++#else

8503

++static inline void idle_set_state(struct rq *rq,

8504

++				  struct cpuidle_state *idle_state)

8505

++{

8506

++}

8507

++

8508

++static inline struct cpuidle_state *idle_get_state(struct rq *rq)

8509

++{

8510

++	return NULL;

8511

++}

8512

++#endif

8513

++

8514

++static inline int cpu_of(const struct rq *rq)

8515

++{

8516

++#ifdef CONFIG_SMP

8517

++	return rq->cpu;

8518

++#else

8519

++	return 0;

8520

++#endif

8521

++}

8522

++

8523

++#include "stats.h"

8524

++

8525

++#ifdef CONFIG_NO_HZ_COMMON

8526

++#define NOHZ_BALANCE_KICK_BIT	0

8527

++#define NOHZ_STATS_KICK_BIT	1

8528

++

8529

++#define NOHZ_BALANCE_KICK	BIT(NOHZ_BALANCE_KICK_BIT)

8530

++#define NOHZ_STATS_KICK		BIT(NOHZ_STATS_KICK_BIT)

8531

++

8532

++#define NOHZ_KICK_MASK	(NOHZ_BALANCE_KICK | NOHZ_STATS_KICK)

8533

++

8534

++#define nohz_flags(cpu)	(&cpu_rq(cpu)->nohz_flags)

8535

++

8536

++/* TODO: needed?

8537

++extern void nohz_balance_exit_idle(struct rq *rq);

8538

++#else

8539

++static inline void nohz_balance_exit_idle(struct rq *rq) { }

8540

++*/

8541

++#endif

8542

++

8543

++#ifdef CONFIG_IRQ_TIME_ACCOUNTING

8544

++struct irqtime {

8545

++	u64			total;

8546

++	u64			tick_delta;

8547

++	u64			irq_start_time;

8548

++	struct u64_stats_sync	sync;

8549

++};

8550

++

8551

++DECLARE_PER_CPU(struct irqtime, cpu_irqtime);

8552

++

8553

++/*

8554

++ * Returns the irqtime minus the softirq time computed by ksoftirqd.

8555

++ * Otherwise ksoftirqd's sum_exec_runtime is substracted its own runtime

8556

++ * and never move forward.

8557

++ */

8558

++static inline u64 irq_time_read(int cpu)

8559

++{

8560

++	struct irqtime *irqtime = &per_cpu(cpu_irqtime, cpu);

8561

++	unsigned int seq;

8562

++	u64 total;

8563

++

8564

++	do {

8565

++		seq = __u64_stats_fetch_begin(&irqtime->sync);

8566

++		total = irqtime->total;

8567

++	} while (__u64_stats_fetch_retry(&irqtime->sync, seq));

8568

++

8569

++	return total;

8570

++}

8571

++#endif /* CONFIG_IRQ_TIME_ACCOUNTING */

8572

++

8573

++#ifdef CONFIG_CPU_FREQ

8574

++DECLARE_PER_CPU(struct update_util_data __rcu *, cpufreq_update_util_data);

8575

++

8576

++/**

8577

++ * cpufreq_update_util - Take a note about CPU utilization changes.

8578

++ * @rq: Runqueue to carry out the update for.

8579

++ * @flags: Update reason flags.

8580

++ *

8581

++ * This function is called by the scheduler on the CPU whose utilization is

8582

++ * being updated.

8583

++ *

8584

++ * It can only be called from RCU-sched read-side critical sections.

8585

++ *

8586

++ * The way cpufreq is currently arranged requires it to evaluate the CPU

8587

++ * performance state (frequency/voltage) on a regular basis to prevent it from

8588

++ * being stuck in a completely inadequate performance level for too long.

8589

++ * That is not guaranteed to happen if the updates are only triggered from CFS

8590

++ * and DL, though, because they may not be coming in if only RT tasks are

8591

++ * active all the time (or there are RT tasks only).

8592

++ *

8593

++ * As a workaround for that issue, this function is called periodically by the

8594

++ * RT sched class to trigger extra cpufreq updates to prevent it from stalling,

8595

++ * but that really is a band-aid.  Going forward it should be replaced with

8596

++ * solutions targeted more specifically at RT tasks.

8597

++ */

8598

++static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)

8599

++{

8600

++	struct update_util_data *data;

8601

++

8602

++	data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,

8603

++						  cpu_of(rq)));

8604

++	if (data)

8605

++		data->func(data, rq_clock(rq), flags);

8606

++}

8607

++#else

8608

++static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

8609

++#endif /* CONFIG_CPU_FREQ */

8610

++

8611

++#ifdef CONFIG_NO_HZ_FULL

8612

++extern int __init sched_tick_offload_init(void);

8613

++#else

8614

++static inline int sched_tick_offload_init(void) { return 0; }

8615

++#endif

8616

++

8617

++#ifdef arch_scale_freq_capacity

8618

++#ifndef arch_scale_freq_invariant

8619

++#define arch_scale_freq_invariant()	(true)

8620

++#endif

8621

++#else /* arch_scale_freq_capacity */

8622

++#define arch_scale_freq_invariant()	(false)

8623

++#endif

8624

++

8625

++extern void schedule_idle(void);

8626

++

8627

++#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)

8628

++

8629

++/*

8630

++ * !! For sched_setattr_nocheck() (kernel) only !!

8631

++ *

8632

++ * This is actually gross. :(

8633

++ *

8634

++ * It is used to make schedutil kworker(s) higher priority than SCHED_DEADLINE

8635

++ * tasks, but still be able to sleep. We need this on platforms that cannot

8636

++ * atomically change clock frequency. Remove once fast switching will be

8637

++ * available on such platforms.

8638

++ *

8639

++ * SUGOV stands for SchedUtil GOVernor.

8640

++ */

8641

++#define SCHED_FLAG_SUGOV	0x10000000

8642

++

8643

++#ifdef CONFIG_MEMBARRIER

8644

++/*

8645

++ * The scheduler provides memory barriers required by membarrier between:

8646

++ * - prior user-space memory accesses and store to rq->membarrier_state,

8647

++ * - store to rq->membarrier_state and following user-space memory accesses.

8648

++ * In the same way it provides those guarantees around store to rq->curr.

8649

++ */

8650

++static inline void membarrier_switch_mm(struct rq *rq,

8651

++					struct mm_struct *prev_mm,

8652

++					struct mm_struct *next_mm)

8653

++{

8654

++	int membarrier_state;

8655

++

8656

++	if (prev_mm == next_mm)

8657

++		return;

8658

++

8659

++	membarrier_state = atomic_read(&next_mm->membarrier_state);

8660

++	if (READ_ONCE(rq->membarrier_state) == membarrier_state)

8661

++		return;

8662

++

8663

++	WRITE_ONCE(rq->membarrier_state, membarrier_state);

8664

++}

8665

++#else

8666

++static inline void membarrier_switch_mm(struct rq *rq,

8667

++					struct mm_struct *prev_mm,

8668

++					struct mm_struct *next_mm)

8669

++{

8670

++}

8671

++#endif

8672

++

8673

++#ifdef CONFIG_NUMA

8674

++extern int sched_numa_find_closest(const struct cpumask *cpus, int cpu);

8675

++#else

8676

++static inline int sched_numa_find_closest(const struct cpumask *cpus, int cpu)

8677

++{

8678

++	return nr_cpu_ids;

8679

++}

8680

++#endif

8681

++

8682

++extern void swake_up_all_locked(struct swait_queue_head *q);

8683

++extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);

8684

++

8685

++#ifdef CONFIG_PREEMPT_DYNAMIC

8686

++extern int preempt_dynamic_mode;

8687

++extern int sched_dynamic_mode(const char *str);

8688

++extern void sched_dynamic_update(int mode);

8689

++#endif

8690

++

8691

++static inline void nohz_run_idle_balance(int cpu) { }

8692

++#endif /* ALT_SCHED_H */

8693

+diff --git a/kernel/sched/bmq.h b/kernel/sched/bmq.h

8694

+new file mode 100644

8695

+index 000000000000..be3ee4a553ca

8696

+--- /dev/null

8697

++++ b/kernel/sched/bmq.h

8698

+@@ -0,0 +1,111 @@

8699

++#define ALT_SCHED_VERSION_MSG "sched/bmq: BMQ CPU Scheduler "ALT_SCHED_VERSION" by Alfred Chen.\n"

8700

++

8701

++/*

8702

++ * BMQ only routines

8703

++ */

8704

++#define rq_switch_time(rq)	((rq)->clock - (rq)->last_ts_switch)

8705

++#define boost_threshold(p)	(sched_timeslice_ns >>\

8706

++				 (15 - MAX_PRIORITY_ADJ -  (p)->boost_prio))

8707

++

8708

++static inline void boost_task(struct task_struct *p)

8709

++{

8710

++	int limit;

8711

++

8712

++	switch (p->policy) {

8713

++	case SCHED_NORMAL:

8714

++		limit = -MAX_PRIORITY_ADJ;

8715

++		break;

8716

++	case SCHED_BATCH:

8717

++	case SCHED_IDLE:

8718

++		limit = 0;

8719

++		break;

8720

++	default:

8721

++		return;

8722

++	}

8723

++

8724

++	if (p->boost_prio > limit)

8725

++		p->boost_prio--;

8726

++}

8727

++

8728

++static inline void deboost_task(struct task_struct *p)

8729

++{

8730

++	if (p->boost_prio < MAX_PRIORITY_ADJ)

8731

++		p->boost_prio++;

8732

++}

8733

++

8734

++/*

8735

++ * Common interfaces

8736

++ */

8737

++static inline void sched_timeslice_imp(const int timeslice_ms) {}

8738

++

8739

++static inline int

8740

++task_sched_prio_normal(const struct task_struct *p, const struct rq *rq)

8741

++{

8742

++	return p->prio + p->boost_prio - MAX_RT_PRIO;

8743

++}

8744

++

8745

++static inline int task_sched_prio(const struct task_struct *p)

8746

++{

8747

++	return (p->prio < MAX_RT_PRIO)? p->prio : MAX_RT_PRIO / 2 + (p->prio + p->boost_prio) / 2;

8748

++}

8749

++

8750

++static inline int

8751

++task_sched_prio_idx(const struct task_struct *p, const struct rq *rq)

8752

++{

8753

++	return task_sched_prio(p);

8754

++}

8755

++

8756

++static inline int sched_prio2idx(int prio, struct rq *rq)

8757

++{

8758

++	return prio;

8759

++}

8760

++

8761

++static inline int sched_idx2prio(int idx, struct rq *rq)

8762

++{

8763

++	return idx;

8764

++}

8765

++

8766

++static inline void time_slice_expired(struct task_struct *p, struct rq *rq)

8767

++{

8768

++	p->time_slice = sched_timeslice_ns;

8769

++

8770

++	if (SCHED_FIFO != p->policy && task_on_rq_queued(p)) {

8771

++		if (SCHED_RR != p->policy)

8772

++			deboost_task(p);

8773

++		requeue_task(p, rq);

8774

++	}

8775

++}

8776

++

8777

++static inline void sched_task_sanity_check(struct task_struct *p, struct rq *rq) {}

8778

++

8779

++inline int task_running_nice(struct task_struct *p)

8780

++{

8781

++	return (p->prio + p->boost_prio > DEFAULT_PRIO + MAX_PRIORITY_ADJ);

8782

++}

8783

++

8784

++static void sched_task_fork(struct task_struct *p, struct rq *rq)

8785

++{

8786

++	p->boost_prio = (p->boost_prio < 0) ?

8787

++		p->boost_prio + MAX_PRIORITY_ADJ : MAX_PRIORITY_ADJ;

8788

++}

8789

++

8790

++static inline void do_sched_yield_type_1(struct task_struct *p, struct rq *rq)

8791

++{

8792

++	p->boost_prio = MAX_PRIORITY_ADJ;

8793

++}

8794

++

8795

++#ifdef CONFIG_SMP

8796

++static inline void sched_task_ttwu(struct task_struct *p)

8797

++{

8798

++	if(this_rq()->clock_task - p->last_ran > sched_timeslice_ns)

8799

++		boost_task(p);

8800

++}

8801

++#endif

8802

++

8803

++static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq)

8804

++{

8805

++	if (rq_switch_time(rq) < boost_threshold(p))

8806

++		boost_task(p);

8807

++}

8808

++

8809

++static inline void update_rq_time_edge(struct rq *rq) {}

8810

+diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

8811

+index 57124614363d..4057e51cef45 100644

8812

+--- a/kernel/sched/cpufreq_schedutil.c

8813

++++ b/kernel/sched/cpufreq_schedutil.c

8814

+@@ -57,6 +57,13 @@ struct sugov_cpu {

8815

+ 	unsigned long		bw_dl;

8816

+ 	unsigned long		max;

8817

+

8818

++#ifdef CONFIG_SCHED_ALT

8819

++	/* For genenal cpu load util */

8820

++	s32			load_history;

8821

++	u64			load_block;

8822

++	u64			load_stamp;

8823

++#endif

8824

++

8825

+ 	/* The field below is for single-CPU policies only: */

8826

+ #ifdef CONFIG_NO_HZ_COMMON

8827

+ 	unsigned long		saved_idle_calls;

8828

+@@ -161,6 +168,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,

8829

+ 	return cpufreq_driver_resolve_freq(policy, freq);

8830

+ }

8831

+

8832

++#ifndef CONFIG_SCHED_ALT

8833

+ static void sugov_get_util(struct sugov_cpu *sg_cpu)

8834

+ {

8835

+ 	struct rq *rq = cpu_rq(sg_cpu->cpu);

8836

+@@ -172,6 +180,55 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

8837

+ 					  FREQUENCY_UTIL, NULL);

8838

+ }

8839

+

8840

++#else /* CONFIG_SCHED_ALT */

8841

++

8842

++#define SG_CPU_LOAD_HISTORY_BITS	(sizeof(s32) * 8ULL)

8843

++#define SG_CPU_UTIL_SHIFT		(8)

8844

++#define SG_CPU_LOAD_HISTORY_SHIFT	(SG_CPU_LOAD_HISTORY_BITS - 1 - SG_CPU_UTIL_SHIFT)

8845

++#define SG_CPU_LOAD_HISTORY_TO_UTIL(l)	(((l) >> SG_CPU_LOAD_HISTORY_SHIFT) & 0xff)

8846

++

8847

++#define LOAD_BLOCK(t)		((t) >> 17)

8848

++#define LOAD_HALF_BLOCK(t)	((t) >> 16)

8849

++#define BLOCK_MASK(t)		((t) & ((0x01 << 18) - 1))

8850

++#define LOAD_BLOCK_BIT(b)	(1UL << (SG_CPU_LOAD_HISTORY_BITS - 1 - (b)))

8851

++#define CURRENT_LOAD_BIT	LOAD_BLOCK_BIT(0)

8852

++

8853

++static void sugov_get_util(struct sugov_cpu *sg_cpu)

8854

++{

8855

++	unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);

8856

++

8857

++	sg_cpu->max = max;

8858

++	sg_cpu->bw_dl = 0;

8859

++	sg_cpu->util = SG_CPU_LOAD_HISTORY_TO_UTIL(sg_cpu->load_history) *

8860

++		(max >> SG_CPU_UTIL_SHIFT);

8861

++}

8862

++

8863

++static inline void sugov_cpu_load_update(struct sugov_cpu *sg_cpu, u64 time)

8864

++{

8865

++	u64 delta = min(LOAD_BLOCK(time) - LOAD_BLOCK(sg_cpu->load_stamp),

8866

++			SG_CPU_LOAD_HISTORY_BITS - 1);

8867

++	u64 prev = !!(sg_cpu->load_history & CURRENT_LOAD_BIT);

8868

++	u64 curr = !!cpu_rq(sg_cpu->cpu)->nr_running;

8869

++

8870

++	if (delta) {

8871

++		sg_cpu->load_history = sg_cpu->load_history >> delta;

8872

++

8873

++		if (delta <= SG_CPU_UTIL_SHIFT) {

8874

++			sg_cpu->load_block += (~BLOCK_MASK(sg_cpu->load_stamp)) * prev;

8875

++			if (!!LOAD_HALF_BLOCK(sg_cpu->load_block) ^ curr)

8876

++				sg_cpu->load_history ^= LOAD_BLOCK_BIT(delta);

8877

++		}

8878

++

8879

++		sg_cpu->load_block = BLOCK_MASK(time) * prev;

8880

++	} else {

8881

++		sg_cpu->load_block += (time - sg_cpu->load_stamp) * prev;

8882

++	}

8883

++	if (prev ^ curr)

8884

++		sg_cpu->load_history ^= CURRENT_LOAD_BIT;

8885

++	sg_cpu->load_stamp = time;

8886

++}

8887

++#endif /* CONFIG_SCHED_ALT */

8888

++

8889

+ /**

8890

+  * sugov_iowait_reset() - Reset the IO boost status of a CPU.

8891

+  * @sg_cpu: the sugov data for the CPU to boost

8892

+@@ -312,13 +369,19 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }

8893

+  */

8894

+ static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)

8895

+ {

8896

++#ifndef CONFIG_SCHED_ALT

8897

+ 	if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)

8898

+ 		sg_cpu->sg_policy->limits_changed = true;

8899

++#endif

8900

+ }

8901

+

8902

+ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,

8903

+ 					      u64 time, unsigned int flags)

8904

+ {

8905

++#ifdef CONFIG_SCHED_ALT

8906

++	sugov_cpu_load_update(sg_cpu, time);

8907

++#endif /* CONFIG_SCHED_ALT */

8908

++

8909

+ 	sugov_iowait_boost(sg_cpu, time, flags);

8910

+ 	sg_cpu->last_update = time;

8911

+

8912

+@@ -439,6 +502,10 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)

8913

+

8914

+ 	raw_spin_lock(&sg_policy->update_lock);

8915

+

8916

++#ifdef CONFIG_SCHED_ALT

8917

++	sugov_cpu_load_update(sg_cpu, time);

8918

++#endif /* CONFIG_SCHED_ALT */

8919

++

8920

+ 	sugov_iowait_boost(sg_cpu, time, flags);

8921

+ 	sg_cpu->last_update = time;

8922

+

8923

+@@ -599,6 +666,7 @@ static int sugov_kthread_create(struct sugov_policy *sg_policy)

8924

+ 	}

8925

+

8926

+ 	ret = sched_setattr_nocheck(thread, &attr);

8927

++

8928

+ 	if (ret) {

8929

+ 		kthread_stop(thread);

8930

+ 		pr_warn("%s: failed to set SCHED_DEADLINE\n", __func__);

8931

+@@ -833,7 +901,9 @@ cpufreq_governor_init(schedutil_gov);

8932

+ #ifdef CONFIG_ENERGY_MODEL

8933

+ static void rebuild_sd_workfn(struct work_struct *work)

8934

+ {

8935

++#ifndef CONFIG_SCHED_ALT

8936

+ 	rebuild_sched_domains_energy();

8937

++#endif /* CONFIG_SCHED_ALT */

8938

+ }

8939

+ static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);

8940

+

8941

+diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c

8942

+index 872e481d5098..f920c8b48ec1 100644

8943

+--- a/kernel/sched/cputime.c

8944

++++ b/kernel/sched/cputime.c

8945

+@@ -123,7 +123,7 @@ void account_user_time(struct task_struct *p, u64 cputime)

8946

+ 	p->utime += cputime;

8947

+ 	account_group_user_time(p, cputime);

8948

+

8949

+-	index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;

8950

++	index = task_running_nice(p) ? CPUTIME_NICE : CPUTIME_USER;

8951

+

8952

+ 	/* Add user time to cpustat. */

8953

+ 	task_group_account_field(p, index, cputime);

8954

+@@ -147,7 +147,7 @@ void account_guest_time(struct task_struct *p, u64 cputime)

8955

+ 	p->gtime += cputime;

8956

+

8957

+ 	/* Add guest time to cpustat. */

8958

+-	if (task_nice(p) > 0) {

8959

++	if (task_running_nice(p)) {

8960

+ 		cpustat[CPUTIME_NICE] += cputime;

8961

+ 		cpustat[CPUTIME_GUEST_NICE] += cputime;

8962

+ 	} else {

8963

+@@ -270,7 +270,7 @@ static inline u64 account_other_time(u64 max)

8964

+ #ifdef CONFIG_64BIT

8965

+ static inline u64 read_sum_exec_runtime(struct task_struct *t)

8966

+ {

8967

+-	return t->se.sum_exec_runtime;

8968

++	return tsk_seruntime(t);

8969

+ }

8970

+ #else

8971

+ static u64 read_sum_exec_runtime(struct task_struct *t)

8972

+@@ -280,7 +280,7 @@ static u64 read_sum_exec_runtime(struct task_struct *t)

8973

+ 	struct rq *rq;

8974

+

8975

+ 	rq = task_rq_lock(t, &rf);

8976

+-	ns = t->se.sum_exec_runtime;

8977

++	ns = tsk_seruntime(t);

8978

+ 	task_rq_unlock(rq, t, &rf);

8979

+

8980

+ 	return ns;

8981

+@@ -612,7 +612,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,

8982

+ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)

8983

+ {

8984

+ 	struct task_cputime cputime = {

8985

+-		.sum_exec_runtime = p->se.sum_exec_runtime,

8986

++		.sum_exec_runtime = tsk_seruntime(p),

8987

+ 	};

8988

+

8989

+ 	task_cputime(p, &cputime.utime, &cputime.stime);

8990

+diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c

8991

+index 0c5ec2776ddf..e3f4fe3f6e2c 100644

8992

+--- a/kernel/sched/debug.c

8993

++++ b/kernel/sched/debug.c

8994

+@@ -8,6 +8,7 @@

8995

+  */

8996

+ #include "sched.h"

8997

+

8998

++#ifndef CONFIG_SCHED_ALT

8999

+ /*

9000

+  * This allows printing both to /proc/sched_debug and

9001

+  * to the console

9002

+@@ -210,6 +211,7 @@ static const struct file_operations sched_scaling_fops = {

9003

+ };

9004

+

9005

+ #endif /* SMP */

9006

++#endif /* !CONFIG_SCHED_ALT */

9007

+

9008

+ #ifdef CONFIG_PREEMPT_DYNAMIC

9009

+

9010

+@@ -273,6 +275,7 @@ static const struct file_operations sched_dynamic_fops = {

9011

+

9012

+ #endif /* CONFIG_PREEMPT_DYNAMIC */

9013

+

9014

++#ifndef CONFIG_SCHED_ALT

9015

+ __read_mostly bool sched_debug_verbose;

9016

+

9017

+ static const struct seq_operations sched_debug_sops;

9018

+@@ -288,6 +291,7 @@ static const struct file_operations sched_debug_fops = {

9019

+ 	.llseek		= seq_lseek,

9020

+ 	.release	= seq_release,

9021

+ };

9022

++#endif /* !CONFIG_SCHED_ALT */

9023

+

9024

+ static struct dentry *debugfs_sched;

9025

+

9026

+@@ -297,12 +301,15 @@ static __init int sched_init_debug(void)

9027

+

9028

+ 	debugfs_sched = debugfs_create_dir("sched", NULL);

9029

+

9030

++#ifndef CONFIG_SCHED_ALT

9031

+ 	debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);

9032

+ 	debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);

9033

++#endif /* !CONFIG_SCHED_ALT */

9034

+ #ifdef CONFIG_PREEMPT_DYNAMIC

9035

+ 	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);

9036

+ #endif

9037

+

9038

++#ifndef CONFIG_SCHED_ALT

9039

+ 	debugfs_create_u32("latency_ns", 0644, debugfs_sched, &sysctl_sched_latency);

9040

+ 	debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);

9041

+ 	debugfs_create_u32("wakeup_granularity_ns", 0644, debugfs_sched, &sysctl_sched_wakeup_granularity);

9042

+@@ -330,11 +337,13 @@ static __init int sched_init_debug(void)

9043

+ #endif

9044

+

9045

+ 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);

9046

++#endif /* !CONFIG_SCHED_ALT */

9047

+

9048

+ 	return 0;

9049

+ }

9050

+ late_initcall(sched_init_debug);

9051

+

9052

++#ifndef CONFIG_SCHED_ALT

9053

+ #ifdef CONFIG_SMP

9054

+

9055

+ static cpumask_var_t		sd_sysctl_cpus;

9056

+@@ -1047,6 +1056,7 @@ void proc_sched_set_task(struct task_struct *p)

9057

+ 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));

9058

+ #endif

9059

+ }

9060

++#endif /* !CONFIG_SCHED_ALT */

9061

+

9062

+ void resched_latency_warn(int cpu, u64 latency)

9063

+ {

9064

+diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c

9065

+index 912b47aa99d8..7f6b13883c2a 100644

9066

+--- a/kernel/sched/idle.c

9067

++++ b/kernel/sched/idle.c

9068

+@@ -403,6 +403,7 @@ void cpu_startup_entry(enum cpuhp_state state)

9069

+ 		do_idle();

9070

+ }

9071

+

9072

++#ifndef CONFIG_SCHED_ALT

9073

+ /*

9074

+  * idle-task scheduling class.

9075

+  */

9076

+@@ -525,3 +526,4 @@ DEFINE_SCHED_CLASS(idle) = {

9077

+ 	.switched_to		= switched_to_idle,

9078

+ 	.update_curr		= update_curr_idle,

9079

+ };

9080

++#endif

9081

+diff --git a/kernel/sched/pds.h b/kernel/sched/pds.h

9082

+new file mode 100644

9083

+index 000000000000..0f1f0d708b77

9084

+--- /dev/null

9085

++++ b/kernel/sched/pds.h

9086

+@@ -0,0 +1,127 @@

9087

++#define ALT_SCHED_VERSION_MSG "sched/pds: PDS CPU Scheduler "ALT_SCHED_VERSION" by Alfred Chen.\n"

9088

++

9089

++static int sched_timeslice_shift = 22;

9090

++

9091

++#define NORMAL_PRIO_MOD(x)	((x) & (NORMAL_PRIO_NUM - 1))

9092

++

9093

++/*

9094

++ * Common interfaces

9095

++ */

9096

++static inline void sched_timeslice_imp(const int timeslice_ms)

9097

++{

9098

++	if (2 == timeslice_ms)

9099

++		sched_timeslice_shift = 21;

9100

++}

9101

++

9102

++static inline int

9103

++task_sched_prio_normal(const struct task_struct *p, const struct rq *rq)

9104

++{

9105

++	s64 delta = p->deadline - rq->time_edge + NORMAL_PRIO_NUM - NICE_WIDTH;

9106

++

9107

++	if (WARN_ONCE(delta > NORMAL_PRIO_NUM - 1,

9108

++		      "pds: task_sched_prio_normal() delta %lld\n", delta))

9109

++		return NORMAL_PRIO_NUM - 1;

9110

++

9111

++	return (delta < 0) ? 0 : delta;

9112

++}

9113

++

9114

++static inline int task_sched_prio(const struct task_struct *p)

9115

++{

9116

++	return (p->prio < MAX_RT_PRIO) ? p->prio :

9117

++		MIN_NORMAL_PRIO + task_sched_prio_normal(p, task_rq(p));

9118

++}

9119

++

9120

++static inline int

9121

++task_sched_prio_idx(const struct task_struct *p, const struct rq *rq)

9122

++{

9123

++	return (p->prio < MAX_RT_PRIO) ? p->prio : MIN_NORMAL_PRIO +

9124

++		NORMAL_PRIO_MOD(task_sched_prio_normal(p, rq) + rq->time_edge);

9125

++}

9126

++

9127

++static inline int sched_prio2idx(int prio, struct rq *rq)

9128

++{

9129

++	return (IDLE_TASK_SCHED_PRIO == prio || prio < MAX_RT_PRIO) ? prio :

9130

++		MIN_NORMAL_PRIO + NORMAL_PRIO_MOD((prio - MIN_NORMAL_PRIO) +

9131

++						  rq->time_edge);

9132

++}

9133

++

9134

++static inline int sched_idx2prio(int idx, struct rq *rq)

9135

++{

9136

++	return (idx < MAX_RT_PRIO) ? idx : MIN_NORMAL_PRIO +

9137

++		NORMAL_PRIO_MOD((idx - MIN_NORMAL_PRIO) + NORMAL_PRIO_NUM -

9138

++				NORMAL_PRIO_MOD(rq->time_edge));

9139

++}

9140

++

9141

++static inline void sched_renew_deadline(struct task_struct *p, const struct rq *rq)

9142

++{

9143

++	if (p->prio >= MAX_RT_PRIO)

9144

++		p->deadline = (rq->clock >> sched_timeslice_shift) +

9145

++			p->static_prio - (MAX_PRIO - NICE_WIDTH);

9146

++}

9147

++

9148

++int task_running_nice(struct task_struct *p)

9149

++{

9150

++	return (p->prio > DEFAULT_PRIO);

9151

++}

9152

++

9153

++static inline void update_rq_time_edge(struct rq *rq)

9154

++{

9155

++	struct list_head head;

9156

++	u64 old = rq->time_edge;

9157

++	u64 now = rq->clock >> sched_timeslice_shift;

9158

++	u64 prio, delta;

9159

++

9160

++	if (now == old)

9161

++		return;

9162

++

9163

++	delta = min_t(u64, NORMAL_PRIO_NUM, now - old);

9164

++	INIT_LIST_HEAD(&head);

9165

++

9166

++	for_each_set_bit(prio, &rq->queue.bitmap[2], delta)

9167

++		list_splice_tail_init(rq->queue.heads + MIN_NORMAL_PRIO +

9168

++				      NORMAL_PRIO_MOD(prio + old), &head);

9169

++

9170

++	rq->queue.bitmap[2] = (NORMAL_PRIO_NUM == delta) ? 0UL :

9171

++		rq->queue.bitmap[2] >> delta;

9172

++	rq->time_edge = now;

9173

++	if (!list_empty(&head)) {

9174

++		u64 idx = MIN_NORMAL_PRIO + NORMAL_PRIO_MOD(now);

9175

++		struct task_struct *p;

9176

++

9177

++		list_for_each_entry(p, &head, sq_node)

9178

++			p->sq_idx = idx;

9179

++

9180

++		list_splice(&head, rq->queue.heads + idx);

9181

++		rq->queue.bitmap[2] |= 1UL;

9182

++	}

9183

++}

9184

++

9185

++static inline void time_slice_expired(struct task_struct *p, struct rq *rq)

9186

++{

9187

++	p->time_slice = sched_timeslice_ns;

9188

++	sched_renew_deadline(p, rq);

9189

++	if (SCHED_FIFO != p->policy && task_on_rq_queued(p))

9190

++		requeue_task(p, rq);

9191

++}

9192

++

9193

++static inline void sched_task_sanity_check(struct task_struct *p, struct rq *rq)

9194

++{

9195

++	u64 max_dl = rq->time_edge + NICE_WIDTH - 1;

9196

++	if (unlikely(p->deadline > max_dl))

9197

++		p->deadline = max_dl;

9198

++}

9199

++

9200

++static void sched_task_fork(struct task_struct *p, struct rq *rq)

9201

++{

9202

++	sched_renew_deadline(p, rq);

9203

++}

9204

++

9205

++static inline void do_sched_yield_type_1(struct task_struct *p, struct rq *rq)

9206

++{

9207

++	time_slice_expired(p, rq);

9208

++}

9209

++

9210

++#ifdef CONFIG_SMP

9211

++static inline void sched_task_ttwu(struct task_struct *p) {}

9212

++#endif

9213

++static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq) {}

9214

+diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c

9215

+index a554e3bbab2b..3e56f5e6ff5c 100644

9216

+--- a/kernel/sched/pelt.c

9217

++++ b/kernel/sched/pelt.c

9218

+@@ -270,6 +270,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)

9219

+ 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);

9220

+ }

9221

+

9222

++#ifndef CONFIG_SCHED_ALT

9223

+ /*

9224

+  * sched_entity:

9225

+  *

9226

+@@ -387,8 +388,9 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)

9227

+

9228

+ 	return 0;

9229

+ }

9230

++#endif

9231

+

9232

+-#ifdef CONFIG_SCHED_THERMAL_PRESSURE

9233

++#if defined(CONFIG_SCHED_THERMAL_PRESSURE) && !defined(CONFIG_SCHED_ALT)

9234

+ /*

9235

+  * thermal:

9236

+  *

9237

+diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h

9238

+index e06071bf3472..adf567df34d4 100644

9239

+--- a/kernel/sched/pelt.h

9240

++++ b/kernel/sched/pelt.h

9241

+@@ -1,13 +1,15 @@

9242

+ #ifdef CONFIG_SMP

9243

+ #include "sched-pelt.h"

9244

+

9245

++#ifndef CONFIG_SCHED_ALT

9246

+ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se);

9247

+ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);

9248

+ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);

9249

+ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);

9250

+ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);

9251

++#endif

9252

+

9253

+-#ifdef CONFIG_SCHED_THERMAL_PRESSURE

9254

++#if defined(CONFIG_SCHED_THERMAL_PRESSURE) && !defined(CONFIG_SCHED_ALT)

9255

+ int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);

9256

+

9257

+ static inline u64 thermal_load_avg(struct rq *rq)

9258

+@@ -42,6 +44,7 @@ static inline u32 get_pelt_divider(struct sched_avg *avg)

9259

+ 	return LOAD_AVG_MAX - 1024 + avg->period_contrib;

9260

+ }

9261

+

9262

++#ifndef CONFIG_SCHED_ALT

9263

+ static inline void cfs_se_util_change(struct sched_avg *avg)

9264

+ {

9265

+ 	unsigned int enqueued;

9266

+@@ -153,9 +156,11 @@ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)

9267

+ 	return rq_clock_pelt(rq_of(cfs_rq));

9268

+ }

9269

+ #endif

9270

++#endif /* CONFIG_SCHED_ALT */

9271

+

9272

+ #else

9273

+

9274

++#ifndef CONFIG_SCHED_ALT

9275

+ static inline int

9276

+ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

9277

+ {

9278

+@@ -173,6 +178,7 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)

9279

+ {

9280

+ 	return 0;

9281

+ }

9282

++#endif

9283

+

9284

+ static inline int

9285

+ update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)

9286

+diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

9287

+index ddefb0419d7a..658c41b15d3c 100644

9288

+--- a/kernel/sched/sched.h

9289

++++ b/kernel/sched/sched.h

9290

+@@ -2,6 +2,10 @@

9291

+ /*

9292

+  * Scheduler internal types and methods:

9293

+  */

9294

++#ifdef CONFIG_SCHED_ALT

9295

++#include "alt_sched.h"

9296

++#else

9297

++

9298

+ #include <linux/sched.h>

9299

+

9300

+ #include <linux/sched/autogroup.h>

9301

+@@ -3038,3 +3042,8 @@ extern int sched_dynamic_mode(const char *str);

9302

+ extern void sched_dynamic_update(int mode);

9303

+ #endif

9304

+

9305

++static inline int task_running_nice(struct task_struct *p)

9306

++{

9307

++	return (task_nice(p) > 0);

9308

++}

9309

++#endif /* !CONFIG_SCHED_ALT */

9310

+diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c

9311

+index 3f93fc3b5648..528b71e144e9 100644

9312

+--- a/kernel/sched/stats.c

9313

++++ b/kernel/sched/stats.c

9314

+@@ -22,8 +22,10 @@ static int show_schedstat(struct seq_file *seq, void *v)

9315

+ 	} else {

9316

+ 		struct rq *rq;

9317

+ #ifdef CONFIG_SMP

9318

++#ifndef CONFIG_SCHED_ALT

9319

+ 		struct sched_domain *sd;

9320

+ 		int dcount = 0;

9321

++#endif

9322

+ #endif

9323

+ 		cpu = (unsigned long)(v - 2);

9324

+ 		rq = cpu_rq(cpu);

9325

+@@ -40,6 +42,7 @@ static int show_schedstat(struct seq_file *seq, void *v)

9326

+ 		seq_printf(seq, "\n");

9327

+

9328

+ #ifdef CONFIG_SMP

9329

++#ifndef CONFIG_SCHED_ALT

9330

+ 		/* domain-specific stats */

9331

+ 		rcu_read_lock();

9332

+ 		for_each_domain(cpu, sd) {

9333

+@@ -68,6 +71,7 @@ static int show_schedstat(struct seq_file *seq, void *v)

9334

+ 			    sd->ttwu_move_balance);

9335

+ 		}

9336

+ 		rcu_read_unlock();

9337

++#endif

9338

+ #endif

9339

+ 	}

9340

+ 	return 0;

9341

+diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c

9342

+index b77ad49dc14f..be9edf086412 100644

9343

+--- a/kernel/sched/topology.c

9344

++++ b/kernel/sched/topology.c

9345

+@@ -4,6 +4,7 @@

9346

+  */

9347

+ #include "sched.h"

9348

+

9349

++#ifndef CONFIG_SCHED_ALT

9350

+ DEFINE_MUTEX(sched_domains_mutex);

9351

+

9352

+ /* Protected by sched_domains_mutex: */

9353

+@@ -1382,8 +1383,10 @@ static void asym_cpu_capacity_scan(void)

9354

+  */

9355

+

9356

+ static int default_relax_domain_level = -1;

9357

++#endif /* CONFIG_SCHED_ALT */

9358

+ int sched_domain_level_max;

9359

+

9360

++#ifndef CONFIG_SCHED_ALT

9361

+ static int __init setup_relax_domain_level(char *str)

9362

+ {

9363

+ 	if (kstrtoint(str, 0, &default_relax_domain_level))

9364

+@@ -1617,6 +1620,7 @@ sd_init(struct sched_domain_topology_level *tl,

9365

+

9366

+ 	return sd;

9367

+ }

9368

++#endif /* CONFIG_SCHED_ALT */

9369

+

9370

+ /*

9371

+  * Topology list, bottom-up.

9372

+@@ -1646,6 +1650,7 @@ void set_sched_topology(struct sched_domain_topology_level *tl)

9373

+ 	sched_domain_topology = tl;

9374

+ }

9375

+

9376

++#ifndef CONFIG_SCHED_ALT

9377

+ #ifdef CONFIG_NUMA

9378

+

9379

+ static const struct cpumask *sd_numa_mask(int cpu)

9380

+@@ -2451,3 +2456,17 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

9381

+ 	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);

9382

+ 	mutex_unlock(&sched_domains_mutex);

9383

+ }

9384

++#else /* CONFIG_SCHED_ALT */

9385

++void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

9386

++			     struct sched_domain_attr *dattr_new)

9387

++{}

9388

++

9389

++#ifdef CONFIG_NUMA

9390

++int __read_mostly		node_reclaim_distance = RECLAIM_DISTANCE;

9391

++

9392

++int sched_numa_find_closest(const struct cpumask *cpus, int cpu)

9393

++{

9394

++	return best_mask_cpu(cpu, cpus);

9395

++}

9396

++#endif /* CONFIG_NUMA */

9397

++#endif

9398

+diff --git a/kernel/sysctl.c b/kernel/sysctl.c

9399

+index 272f4a272f8c..1c9455c8ecf6 100644

9400

+--- a/kernel/sysctl.c

9401

++++ b/kernel/sysctl.c

9402

+@@ -122,6 +122,10 @@ static unsigned long long_max = LONG_MAX;

9403

+ static int one_hundred = 100;

9404

+ static int two_hundred = 200;

9405

+ static int one_thousand = 1000;

9406

++#ifdef CONFIG_SCHED_ALT

9407

++static int __maybe_unused zero = 0;

9408

++extern int sched_yield_type;

9409

++#endif

9410

+ #ifdef CONFIG_PRINTK

9411

+ static int ten_thousand = 10000;

9412

+ #endif

9413

+@@ -1730,6 +1734,24 @@ int proc_do_static_key(struct ctl_table *table, int write,

9414

+ }

9415

+

9416

+ static struct ctl_table kern_table[] = {

9417

++#ifdef CONFIG_SCHED_ALT

9418

++/* In ALT, only supported "sched_schedstats" */

9419

++#ifdef CONFIG_SCHED_DEBUG

9420

++#ifdef CONFIG_SMP

9421

++#ifdef CONFIG_SCHEDSTATS

9422

++	{

9423

++		.procname	= "sched_schedstats",

9424

++		.data		= NULL,

9425

++		.maxlen		= sizeof(unsigned int),

9426

++		.mode		= 0644,

9427

++		.proc_handler	= sysctl_schedstats,

9428

++		.extra1		= SYSCTL_ZERO,

9429

++		.extra2		= SYSCTL_ONE,

9430

++	},

9431

++#endif /* CONFIG_SCHEDSTATS */

9432

++#endif /* CONFIG_SMP */

9433

++#endif /* CONFIG_SCHED_DEBUG */

9434

++#else  /* !CONFIG_SCHED_ALT */

9435

+ 	{

9436

+ 		.procname	= "sched_child_runs_first",

9437

+ 		.data		= &sysctl_sched_child_runs_first,

9438

+@@ -1860,6 +1882,7 @@ static struct ctl_table kern_table[] = {

9439

+ 		.extra2		= SYSCTL_ONE,

9440

+ 	},

9441

+ #endif

9442

++#endif /* !CONFIG_SCHED_ALT */

9443

+ #ifdef CONFIG_PROVE_LOCKING

9444

+ 	{

9445

+ 		.procname	= "prove_locking",

9446

+@@ -2436,6 +2459,17 @@ static struct ctl_table kern_table[] = {

9447

+ 		.proc_handler	= proc_dointvec,

9448

+ 	},

9449

+ #endif

9450

++#ifdef CONFIG_SCHED_ALT

9451

++	{

9452

++		.procname	= "yield_type",

9453

++		.data		= &sched_yield_type,

9454

++		.maxlen		= sizeof (int),

9455

++		.mode		= 0644,

9456

++		.proc_handler	= &proc_dointvec_minmax,

9457

++		.extra1		= &zero,

9458

++		.extra2		= &two,

9459

++	},

9460

++#endif

9461

+ #if defined(CONFIG_S390) && defined(CONFIG_SMP)

9462

+ 	{

9463

+ 		.procname	= "spin_retry",

9464

+diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c

9465

+index 4a66725b1d4a..cb80ed5c1f5c 100644

9466

+--- a/kernel/time/hrtimer.c

9467

++++ b/kernel/time/hrtimer.c

9468

+@@ -1940,8 +1940,10 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,

9469

+ 	int ret = 0;

9470

+ 	u64 slack;

9471

+

9472

++#ifndef CONFIG_SCHED_ALT

9473

+ 	slack = current->timer_slack_ns;

9474

+ 	if (dl_task(current) || rt_task(current))

9475

++#endif

9476

+ 		slack = 0;

9477

+

9478

+ 	hrtimer_init_sleeper_on_stack(&t, clockid, mode);

9479

+diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c

9480

+index 517be7fd175e..de3afe8e0800 100644

9481

+--- a/kernel/time/posix-cpu-timers.c

9482

++++ b/kernel/time/posix-cpu-timers.c

9483

+@@ -216,7 +216,7 @@ static void task_sample_cputime(struct task_struct *p, u64 *samples)

9484

+ 	u64 stime, utime;

9485

+

9486

+ 	task_cputime(p, &utime, &stime);

9487

+-	store_samples(samples, stime, utime, p->se.sum_exec_runtime);

9488

++	store_samples(samples, stime, utime, tsk_seruntime(p));

9489

+ }

9490

+

9491

+ static void proc_sample_cputime_atomic(struct task_cputime_atomic *at,

9492

+@@ -801,6 +801,7 @@ static void collect_posix_cputimers(struct posix_cputimers *pct, u64 *samples,

9493

+ 	}

9494

+ }

9495

+

9496

++#ifndef CONFIG_SCHED_ALT

9497

+ static inline void check_dl_overrun(struct task_struct *tsk)

9498

+ {

9499

+ 	if (tsk->dl.dl_overrun) {

9500

+@@ -808,6 +809,7 @@ static inline void check_dl_overrun(struct task_struct *tsk)

9501

+ 		__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);

9502

+ 	}

9503

+ }

9504

++#endif

9505

+

9506

+ static bool check_rlimit(u64 time, u64 limit, int signo, bool rt, bool hard)

9507

+ {

9508

+@@ -835,8 +837,10 @@ static void check_thread_timers(struct task_struct *tsk,

9509

+ 	u64 samples[CPUCLOCK_MAX];

9510

+ 	unsigned long soft;

9511

+

9512

++#ifndef CONFIG_SCHED_ALT

9513

+ 	if (dl_task(tsk))

9514

+ 		check_dl_overrun(tsk);

9515

++#endif

9516

+

9517

+ 	if (expiry_cache_is_inactive(pct))

9518

+ 		return;

9519

+@@ -850,7 +854,7 @@ static void check_thread_timers(struct task_struct *tsk,

9520

+ 	soft = task_rlimit(tsk, RLIMIT_RTTIME);

9521

+ 	if (soft != RLIM_INFINITY) {

9522

+ 		/* Task RT timeout is accounted in jiffies. RTTIME is usec */

9523

+-		unsigned long rttime = tsk->rt.timeout * (USEC_PER_SEC / HZ);

9524

++		unsigned long rttime = tsk_rttimeout(tsk) * (USEC_PER_SEC / HZ);

9525

+ 		unsigned long hard = task_rlimit_max(tsk, RLIMIT_RTTIME);

9526

+

9527

+ 		/* At the hard limit, send SIGKILL. No further action. */

9528

+@@ -1086,8 +1090,10 @@ static inline bool fastpath_timer_check(struct task_struct *tsk)

9529

+ 			return true;

9530

+ 	}

9531

+

9532

++#ifndef CONFIG_SCHED_ALT

9533

+ 	if (dl_task(tsk) && tsk->dl.dl_overrun)

9534

+ 		return true;

9535

++#endif

9536

+

9537

+ 	return false;

9538

+ }

9539

+diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c

9540

+index adf7ef194005..11c8f36e281b 100644

9541

+--- a/kernel/trace/trace_selftest.c

9542

++++ b/kernel/trace/trace_selftest.c

9543

+@@ -1052,10 +1052,15 @@ static int trace_wakeup_test_thread(void *data)

9544

+ {

9545

+ 	/* Make this a -deadline thread */

9546

+ 	static const struct sched_attr attr = {

9547

++#ifdef CONFIG_SCHED_ALT

9548

++		/* No deadline on BMQ/PDS, use RR */

9549

++		.sched_policy = SCHED_RR,

9550

++#else

9551

+ 		.sched_policy = SCHED_DEADLINE,

9552

+ 		.sched_runtime = 100000ULL,

9553

+ 		.sched_deadline = 10000000ULL,

9554

+ 		.sched_period = 10000000ULL

9555

++#endif

9556

+ 	};

9557

+ 	struct wakeup_test_data *x = data;

9558

+

9559

9560

diff --git a/5021_BMQ-and-PDS-gentoo-defaults.patch b/5021_BMQ-and-PDS-gentoo-defaults.patch

9561

new file mode 100644

9562

index 0000000..d449eec

9563

--- /dev/null

9564

+++ b/5021_BMQ-and-PDS-gentoo-defaults.patch

9565

@@ -0,0 +1,13 @@

9566

+--- a/init/Kconfig	2021-04-27 07:38:30.556467045 -0400

9567

++++ b/init/Kconfig	2021-04-27 07:39:32.956412800 -0400

9568

+@@ -780,8 +780,9 @@ config GENERIC_SCHED_CLOCK

9569

+ menu "Scheduler features"

9570

+

9571

+ menuconfig SCHED_ALT

9572

++	depends on X86_64

9573

+ 	bool "Alternative CPU Schedulers"

9574

+-	default y

9575

++	default n

9576

+ 	help

9577

+ 	  This feature enable alternative CPU scheduler"

9578

+

Gentoo Archives: gentoo-commits