[gentoo-commits] proj/linux-patches:5.15 commit in: / - gentoo-commits

From:	Mike Pagano <mpagano@g.o>
To:	gentoo-commits@l.g.o
Subject:	[gentoo-commits] proj/linux-patches:5.15 commit in: /
Date:	Thu, 04 Nov 2021 12:22:57
Message-Id:	`1636028504.412bab2012d1b669b481463fd275bbb8bb6933fb.mpagano@gentoo`

1

commit:     412bab2012d1b669b481463fd275bbb8bb6933fb

2

Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

3

AuthorDate: Thu Nov  4 12:21:44 2021 +0000

4

Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

5

CommitDate: Thu Nov  4 12:21:44 2021 +0000

6

URL:        https://gitweb.gentoo.org/proj/linux-patches.git/commit/?id=412bab20

7

8

Add patch for the BMQ(BitMap Queue) Scheduler.

9

10

A new CPU scheduler developed from PDS(incld).

11

Inspired by the scheduler in zircon

12

13

Signed-off-by: Mike Pagano <mpagano <AT> gentoo.org>

14

15

 0000_README                                  |    8 +

16

 5020_BMQ-and-PDS-io-scheduler-v5.15-r0.patch | 9787 ++++++++++++++++++++++++++

17

 5021_BMQ-and-PDS-gentoo-defaults.patch       |   13 +

18

 3 files changed, 9808 insertions(+)

19

20

diff --git a/0000_README b/0000_README

21

index efde5c7..9bc9951 100644

22

--- a/0000_README

23

+++ b/0000_README

24

@@ -71,6 +71,14 @@ Patch:  4567_distro-Gentoo-Kconfig.patch

25

 From:   Tom Wijsman <TomWij@g.o>

26

 Desc:   Add Gentoo Linux support config settings and defaults.

27

28

+Patch:  5020_BMQ-and-PDS-io-scheduler-v5.15-r0.patch

29

+From:   https://gitlab.com/alfredchen/linux-prjc

30

+Desc:   BMQ(BitMap Queue) Scheduler. A new CPU scheduler developed from PDS(incld). Inspired by the scheduler in zircon.

31

+

32

+Patch:  5021_BMQ-and-PDS-gentoo-defaults.patch

33

+From:   https://gitweb.gentoo.org/proj/linux-patches.git/

34

+Desc:   Set defaults for BMQ. Add archs as people test, default to N

35

+

36

 Patch:  5010_enable-cpu-optimizations-universal.patch

37

 From:   https://github.com/graysky2/kernel_compiler_patch

38

 Desc:   Kernel >= 5.15 patch enables gcc = v11.1+ optimizations for additional CPUs.

39

40

diff --git a/5020_BMQ-and-PDS-io-scheduler-v5.15-r0.patch b/5020_BMQ-and-PDS-io-scheduler-v5.15-r0.patch

41

new file mode 100644

42

index 0000000..1d0c322

43

--- /dev/null

44

+++ b/5020_BMQ-and-PDS-io-scheduler-v5.15-r0.patch

45

@@ -0,0 +1,9787 @@

46

+diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt

47

+index 43dc35fe5bc0..0873e92ca5d0 100644

48

+--- a/Documentation/admin-guide/kernel-parameters.txt

49

++++ b/Documentation/admin-guide/kernel-parameters.txt

50

+@@ -4985,6 +4985,12 @@

51

+ 	sa1100ir	[NET]

52

+ 			See drivers/net/irda/sa1100_ir.c.

53

+

54

++	sched_timeslice=

55

++			[KNL] Time slice in ms for Project C BMQ/PDS scheduler.

56

++			Format: integer 2, 4

57

++			Default: 4

58

++			See Documentation/scheduler/sched-BMQ.txt

59

++

60

+ 	sched_verbose	[KNL] Enables verbose scheduler debug messages.

61

+

62

+ 	schedstats=	[KNL,X86] Enable or disable scheduled statistics.

63

+diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst

64

+index 426162009ce9..15ac2d7e47cd 100644

65

+--- a/Documentation/admin-guide/sysctl/kernel.rst

66

++++ b/Documentation/admin-guide/sysctl/kernel.rst

67

+@@ -1542,3 +1542,13 @@ is 10 seconds.

68

+

69

+ The softlockup threshold is (``2 * watchdog_thresh``). Setting this

70

+ tunable to zero will disable lockup detection altogether.

71

++

72

++yield_type:

73

++===========

74

++

75

++BMQ/PDS CPU scheduler only. This determines what type of yield calls

76

++to sched_yield will perform.

77

++

78

++  0 - No yield.

79

++  1 - Deboost and requeue task. (default)

80

++  2 - Set run queue skip task.

81

+diff --git a/Documentation/scheduler/sched-BMQ.txt b/Documentation/scheduler/sched-BMQ.txt

82

+new file mode 100644

83

+index 000000000000..05c84eec0f31

84

+--- /dev/null

85

++++ b/Documentation/scheduler/sched-BMQ.txt

86

+@@ -0,0 +1,110 @@

87

++                         BitMap queue CPU Scheduler

88

++                         --------------------------

89

++

90

++CONTENT

91

++========

92

++

93

++ Background

94

++ Design

95

++   Overview

96

++   Task policy

97

++   Priority management

98

++   BitMap Queue

99

++   CPU Assignment and Migration

100

++

101

++

102

++Background

103

++==========

104

++

105

++BitMap Queue CPU scheduler, referred to as BMQ from here on, is an evolution

106

++of previous Priority and Deadline based Skiplist multiple queue scheduler(PDS),

107

++and inspired by Zircon scheduler. The goal of it is to keep the scheduler code

108

++simple, while efficiency and scalable for interactive tasks, such as desktop,

109

++movie playback and gaming etc.

110

++

111

++Design

112

++======

113

++

114

++Overview

115

++--------

116

++

117

++BMQ use per CPU run queue design, each CPU(logical) has it's own run queue,

118

++each CPU is responsible for scheduling the tasks that are putting into it's

119

++run queue.

120

++

121

++The run queue is a set of priority queues. Note that these queues are fifo

122

++queue for non-rt tasks or priority queue for rt tasks in data structure. See

123

++BitMap Queue below for details. BMQ is optimized for non-rt tasks in the fact

124

++that most applications are non-rt tasks. No matter the queue is fifo or

125

++priority, In each queue is an ordered list of runnable tasks awaiting execution

126

++and the data structures are the same. When it is time for a new task to run,

127

++the scheduler simply looks the lowest numbered queueue that contains a task,

128

++and runs the first task from the head of that queue. And per CPU idle task is

129

++also in the run queue, so the scheduler can always find a task to run on from

130

++its run queue.

131

++

132

++Each task will assigned the same timeslice(default 4ms) when it is picked to

133

++start running. Task will be reinserted at the end of the appropriate priority

134

++queue when it uses its whole timeslice. When the scheduler selects a new task

135

++from the priority queue it sets the CPU's preemption timer for the remainder of

136

++the previous timeslice. When that timer fires the scheduler will stop execution

137

++on that task, select another task and start over again.

138

++

139

++If a task blocks waiting for a shared resource then it's taken out of its

140

++priority queue and is placed in a wait queue for the shared resource. When it

141

++is unblocked it will be reinserted in the appropriate priority queue of an

142

++eligible CPU.

143

++

144

++Task policy

145

++-----------

146

++

147

++BMQ supports DEADLINE, FIFO, RR, NORMAL, BATCH and IDLE task policy like the

148

++mainline CFS scheduler. But BMQ is heavy optimized for non-rt task, that's

149

++NORMAL/BATCH/IDLE policy tasks. Below is the implementation detail of each

150

++policy.

151

++

152

++DEADLINE

153

++	It is squashed as priority 0 FIFO task.

154

++

155

++FIFO/RR

156

++	All RT tasks share one single priority queue in BMQ run queue designed. The

157

++complexity of insert operation is O(n). BMQ is not designed for system runs

158

++with major rt policy tasks.

159

++

160

++NORMAL/BATCH/IDLE

161

++	BATCH and IDLE tasks are treated as the same policy. They compete CPU with

162

++NORMAL policy tasks, but they just don't boost. To control the priority of

163

++NORMAL/BATCH/IDLE tasks, simply use nice level.

164

++

165

++ISO

166

++	ISO policy is not supported in BMQ. Please use nice level -20 NORMAL policy

167

++task instead.

168

++

169

++Priority management

170

++-------------------

171

++

172

++RT tasks have priority from 0-99. For non-rt tasks, there are three different

173

++factors used to determine the effective priority of a task. The effective

174

++priority being what is used to determine which queue it will be in.

175

++

176

++The first factor is simply the task’s static priority. Which is assigned from

177

++task's nice level, within [-20, 19] in userland's point of view and [0, 39]

178

++internally.

179

++

180

++The second factor is the priority boost. This is a value bounded between

181

++[-MAX_PRIORITY_ADJ, MAX_PRIORITY_ADJ] used to offset the base priority, it is

182

++modified by the following cases:

183

++

184

++*When a thread has used up its entire timeslice, always deboost its boost by

185

++increasing by one.

186

++*When a thread gives up cpu control(voluntary or non-voluntary) to reschedule,

187

++and its switch-in time(time after last switch and run) below the thredhold

188

++based on its priority boost, will boost its boost by decreasing by one buti is

189

++capped at 0 (won’t go negative).

190

++

191

++The intent in this system is to ensure that interactive threads are serviced

192

++quickly. These are usually the threads that interact directly with the user

193

++and cause user-perceivable latency. These threads usually do little work and

194

++spend most of their time blocked awaiting another user event. So they get the

195

++priority boost from unblocking while background threads that do most of the

196

++processing receive the priority penalty for using their entire timeslice.

197

+diff --git a/fs/proc/base.c b/fs/proc/base.c

198

+index 533d5836eb9a..5756c51c9b58 100644

199

+--- a/fs/proc/base.c

200

++++ b/fs/proc/base.c

201

+@@ -477,7 +477,7 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,

202

+ 		seq_puts(m, "0 0 0\n");

203

+ 	else

204

+ 		seq_printf(m, "%llu %llu %lu\n",

205

+-		   (unsigned long long)task->se.sum_exec_runtime,

206

++		   (unsigned long long)tsk_seruntime(task),

207

+ 		   (unsigned long long)task->sched_info.run_delay,

208

+ 		   task->sched_info.pcount);

209

+

210

+diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h

211

+index 8874f681b056..59eb72bf7d5f 100644

212

+--- a/include/asm-generic/resource.h

213

++++ b/include/asm-generic/resource.h

214

+@@ -23,7 +23,7 @@

215

+ 	[RLIMIT_LOCKS]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\

216

+ 	[RLIMIT_SIGPENDING]	= { 		0,	       0 },	\

217

+ 	[RLIMIT_MSGQUEUE]	= {   MQ_BYTES_MAX,   MQ_BYTES_MAX },	\

218

+-	[RLIMIT_NICE]		= { 0, 0 },				\

219

++	[RLIMIT_NICE]		= { 30, 30 },				\

220

+ 	[RLIMIT_RTPRIO]		= { 0, 0 },				\

221

+ 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\

222

+ }

223

+diff --git a/include/linux/sched.h b/include/linux/sched.h

224

+index c1a927ddec64..a7eb91d15442 100644

225

+--- a/include/linux/sched.h

226

++++ b/include/linux/sched.h

227

+@@ -748,12 +748,18 @@ struct task_struct {

228

+ 	unsigned int			ptrace;

229

+

230

+ #ifdef CONFIG_SMP

231

+-	int				on_cpu;

232

+ 	struct __call_single_node	wake_entry;

233

++#endif

234

++#if defined(CONFIG_SMP) || defined(CONFIG_SCHED_ALT)

235

++	int				on_cpu;

236

++#endif

237

++

238

++#ifdef CONFIG_SMP

239

+ #ifdef CONFIG_THREAD_INFO_IN_TASK

240

+ 	/* Current CPU: */

241

+ 	unsigned int			cpu;

242

+ #endif

243

++#ifndef CONFIG_SCHED_ALT

244

+ 	unsigned int			wakee_flips;

245

+ 	unsigned long			wakee_flip_decay_ts;

246

+ 	struct task_struct		*last_wakee;

247

+@@ -767,6 +773,7 @@ struct task_struct {

248

+ 	 */

249

+ 	int				recent_used_cpu;

250

+ 	int				wake_cpu;

251

++#endif /* !CONFIG_SCHED_ALT */

252

+ #endif

253

+ 	int				on_rq;

254

+

255

+@@ -775,6 +782,20 @@ struct task_struct {

256

+ 	int				normal_prio;

257

+ 	unsigned int			rt_priority;

258

+

259

++#ifdef CONFIG_SCHED_ALT

260

++	u64				last_ran;

261

++	s64				time_slice;

262

++	int				sq_idx;

263

++	struct list_head		sq_node;

264

++#ifdef CONFIG_SCHED_BMQ

265

++	int				boost_prio;

266

++#endif /* CONFIG_SCHED_BMQ */

267

++#ifdef CONFIG_SCHED_PDS

268

++	u64				deadline;

269

++#endif /* CONFIG_SCHED_PDS */

270

++	/* sched_clock time spent running */

271

++	u64				sched_time;

272

++#else /* !CONFIG_SCHED_ALT */

273

+ 	const struct sched_class	*sched_class;

274

+ 	struct sched_entity		se;

275

+ 	struct sched_rt_entity		rt;

276

+@@ -785,6 +806,7 @@ struct task_struct {

277

+ 	unsigned long			core_cookie;

278

+ 	unsigned int			core_occupation;

279

+ #endif

280

++#endif /* !CONFIG_SCHED_ALT */

281

+

282

+ #ifdef CONFIG_CGROUP_SCHED

283

+ 	struct task_group		*sched_task_group;

284

+@@ -1505,6 +1527,15 @@ struct task_struct {

285

+ 	 */

286

+ };

287

+

288

++#ifdef CONFIG_SCHED_ALT

289

++#define tsk_seruntime(t)		((t)->sched_time)

290

++/* replace the uncertian rt_timeout with 0UL */

291

++#define tsk_rttimeout(t)		(0UL)

292

++#else /* CFS */

293

++#define tsk_seruntime(t)	((t)->se.sum_exec_runtime)

294

++#define tsk_rttimeout(t)	((t)->rt.timeout)

295

++#endif /* !CONFIG_SCHED_ALT */

296

++

297

+ static inline struct pid *task_pid(struct task_struct *task)

298

+ {

299

+ 	return task->thread_pid;

300

+diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h

301

+index 1aff00b65f3c..216fdf2fe90c 100644

302

+--- a/include/linux/sched/deadline.h

303

++++ b/include/linux/sched/deadline.h

304

+@@ -1,5 +1,24 @@

305

+ /* SPDX-License-Identifier: GPL-2.0 */

306

+

307

++#ifdef CONFIG_SCHED_ALT

308

++

309

++static inline int dl_task(struct task_struct *p)

310

++{

311

++	return 0;

312

++}

313

++

314

++#ifdef CONFIG_SCHED_BMQ

315

++#define __tsk_deadline(p)	(0UL)

316

++#endif

317

++

318

++#ifdef CONFIG_SCHED_PDS

319

++#define __tsk_deadline(p)	((((u64) ((p)->prio))<<56) | (p)->deadline)

320

++#endif

321

++

322

++#else

323

++

324

++#define __tsk_deadline(p)	((p)->dl.deadline)

325

++

326

+ /*

327

+  * SCHED_DEADLINE tasks has negative priorities, reflecting

328

+  * the fact that any of them has higher prio than RT and

329

+@@ -19,6 +38,7 @@ static inline int dl_task(struct task_struct *p)

330

+ {

331

+ 	return dl_prio(p->prio);

332

+ }

333

++#endif /* CONFIG_SCHED_ALT */

334

+

335

+ static inline bool dl_time_before(u64 a, u64 b)

336

+ {

337

+diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h

338

+index ab83d85e1183..6af9ae681116 100644

339

+--- a/include/linux/sched/prio.h

340

++++ b/include/linux/sched/prio.h

341

+@@ -18,6 +18,32 @@

342

+ #define MAX_PRIO		(MAX_RT_PRIO + NICE_WIDTH)

343

+ #define DEFAULT_PRIO		(MAX_RT_PRIO + NICE_WIDTH / 2)

344

+

345

++#ifdef CONFIG_SCHED_ALT

346

++

347

++/* Undefine MAX_PRIO and DEFAULT_PRIO */

348

++#undef MAX_PRIO

349

++#undef DEFAULT_PRIO

350

++

351

++/* +/- priority levels from the base priority */

352

++#ifdef CONFIG_SCHED_BMQ

353

++#define MAX_PRIORITY_ADJ	(7)

354

++

355

++#define MIN_NORMAL_PRIO		(MAX_RT_PRIO)

356

++#define MAX_PRIO		(MIN_NORMAL_PRIO + NICE_WIDTH)

357

++#define DEFAULT_PRIO		(MIN_NORMAL_PRIO + NICE_WIDTH / 2)

358

++#endif

359

++

360

++#ifdef CONFIG_SCHED_PDS

361

++#define MAX_PRIORITY_ADJ	(0)

362

++

363

++#define MIN_NORMAL_PRIO		(128)

364

++#define NORMAL_PRIO_NUM		(64)

365

++#define MAX_PRIO		(MIN_NORMAL_PRIO + NORMAL_PRIO_NUM)

366

++#define DEFAULT_PRIO		(MAX_PRIO - NICE_WIDTH / 2)

367

++#endif

368

++

369

++#endif /* CONFIG_SCHED_ALT */

370

++

371

+ /*

372

+  * Convert user-nice values [ -20 ... 0 ... 19 ]

373

+  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],

374

+diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h

375

+index e5af028c08b4..0a7565d0d3cf 100644

376

+--- a/include/linux/sched/rt.h

377

++++ b/include/linux/sched/rt.h

378

+@@ -24,8 +24,10 @@ static inline bool task_is_realtime(struct task_struct *tsk)

379

+

380

+ 	if (policy == SCHED_FIFO || policy == SCHED_RR)

381

+ 		return true;

382

++#ifndef CONFIG_SCHED_ALT

383

+ 	if (policy == SCHED_DEADLINE)

384

+ 		return true;

385

++#endif

386

+ 	return false;

387

+ }

388

+

389

+diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h

390

+index 8f0f778b7c91..991f2280475b 100644

391

+--- a/include/linux/sched/topology.h

392

++++ b/include/linux/sched/topology.h

393

+@@ -225,7 +225,8 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)

394

+

395

+ #endif	/* !CONFIG_SMP */

396

+

397

+-#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)

398

++#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) && \

399

++	!defined(CONFIG_SCHED_ALT)

400

+ extern void rebuild_sched_domains_energy(void);

401

+ #else

402

+ static inline void rebuild_sched_domains_energy(void)

403

+diff --git a/init/Kconfig b/init/Kconfig

404

+index 11f8a845f259..c8e82fcafb9e 100644

405

+--- a/init/Kconfig

406

++++ b/init/Kconfig

407

+@@ -814,9 +814,39 @@ config GENERIC_SCHED_CLOCK

408

+

409

+ menu "Scheduler features"

410

+

411

++menuconfig SCHED_ALT

412

++	bool "Alternative CPU Schedulers"

413

++	default y

414

++	help

415

++	  This feature enable alternative CPU scheduler"

416

++

417

++if SCHED_ALT

418

++

419

++choice

420

++	prompt "Alternative CPU Scheduler"

421

++	default SCHED_BMQ

422

++

423

++config SCHED_BMQ

424

++	bool "BMQ CPU scheduler"

425

++	help

426

++	  The BitMap Queue CPU scheduler for excellent interactivity and

427

++	  responsiveness on the desktop and solid scalability on normal

428

++	  hardware and commodity servers.

429

++

430

++config SCHED_PDS

431

++	bool "PDS CPU scheduler"

432

++	help

433

++	  The Priority and Deadline based Skip list multiple queue CPU

434

++	  Scheduler.

435

++

436

++endchoice

437

++

438

++endif

439

++

440

+ config UCLAMP_TASK

441

+ 	bool "Enable utilization clamping for RT/FAIR tasks"

442

+ 	depends on CPU_FREQ_GOV_SCHEDUTIL

443

++	depends on !SCHED_ALT

444

+ 	help

445

+ 	  This feature enables the scheduler to track the clamped utilization

446

+ 	  of each CPU based on RUNNABLE tasks scheduled on that CPU.

447

+@@ -902,6 +932,7 @@ config NUMA_BALANCING

448

+ 	depends on ARCH_SUPPORTS_NUMA_BALANCING

449

+ 	depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY

450

+ 	depends on SMP && NUMA && MIGRATION

451

++	depends on !SCHED_ALT

452

+ 	help

453

+ 	  This option adds support for automatic NUMA aware memory/task placement.

454

+ 	  The mechanism is quite primitive and is based on migrating memory when

455

+@@ -994,6 +1025,7 @@ config FAIR_GROUP_SCHED

456

+ 	depends on CGROUP_SCHED

457

+ 	default CGROUP_SCHED

458

+

459

++if !SCHED_ALT

460

+ config CFS_BANDWIDTH

461

+ 	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"

462

+ 	depends on FAIR_GROUP_SCHED

463

+@@ -1016,6 +1048,7 @@ config RT_GROUP_SCHED

464

+ 	  realtime bandwidth for them.

465

+ 	  See Documentation/scheduler/sched-rt-group.rst for more information.

466

+

467

++endif #!SCHED_ALT

468

+ endif #CGROUP_SCHED

469

+

470

+ config UCLAMP_TASK_GROUP

471

+@@ -1259,6 +1292,7 @@ config CHECKPOINT_RESTORE

472

+

473

+ config SCHED_AUTOGROUP

474

+ 	bool "Automatic process group scheduling"

475

++	depends on !SCHED_ALT

476

+ 	select CGROUPS

477

+ 	select CGROUP_SCHED

478

+ 	select FAIR_GROUP_SCHED

479

+diff --git a/init/init_task.c b/init/init_task.c

480

+index 2d024066e27b..49f706df0904 100644

481

+--- a/init/init_task.c

482

++++ b/init/init_task.c

483

+@@ -75,9 +75,15 @@ struct task_struct init_task

484

+ 	.stack		= init_stack,

485

+ 	.usage		= REFCOUNT_INIT(2),

486

+ 	.flags		= PF_KTHREAD,

487

++#ifdef CONFIG_SCHED_ALT

488

++	.prio		= DEFAULT_PRIO + MAX_PRIORITY_ADJ,

489

++	.static_prio	= DEFAULT_PRIO,

490

++	.normal_prio	= DEFAULT_PRIO + MAX_PRIORITY_ADJ,

491

++#else

492

+ 	.prio		= MAX_PRIO - 20,

493

+ 	.static_prio	= MAX_PRIO - 20,

494

+ 	.normal_prio	= MAX_PRIO - 20,

495

++#endif

496

+ 	.policy		= SCHED_NORMAL,

497

+ 	.cpus_ptr	= &init_task.cpus_mask,

498

+ 	.user_cpus_ptr	= NULL,

499

+@@ -88,6 +94,17 @@ struct task_struct init_task

500

+ 	.restart_block	= {

501

+ 		.fn = do_no_restart_syscall,

502

+ 	},

503

++#ifdef CONFIG_SCHED_ALT

504

++	.sq_node	= LIST_HEAD_INIT(init_task.sq_node),

505

++#ifdef CONFIG_SCHED_BMQ

506

++	.boost_prio	= 0,

507

++	.sq_idx		= 15,

508

++#endif

509

++#ifdef CONFIG_SCHED_PDS

510

++	.deadline	= 0,

511

++#endif

512

++	.time_slice	= HZ,

513

++#else

514

+ 	.se		= {

515

+ 		.group_node 	= LIST_HEAD_INIT(init_task.se.group_node),

516

+ 	},

517

+@@ -95,6 +112,7 @@ struct task_struct init_task

518

+ 		.run_list	= LIST_HEAD_INIT(init_task.rt.run_list),

519

+ 		.time_slice	= RR_TIMESLICE,

520

+ 	},

521

++#endif

522

+ 	.tasks		= LIST_HEAD_INIT(init_task.tasks),

523

+ #ifdef CONFIG_SMP

524

+ 	.pushable_tasks	= PLIST_NODE_INIT(init_task.pushable_tasks, MAX_PRIO),

525

+diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt

526

+index 5876e30c5740..7594d0a31869 100644

527

+--- a/kernel/Kconfig.preempt

528

++++ b/kernel/Kconfig.preempt

529

+@@ -102,7 +102,7 @@ config PREEMPT_DYNAMIC

530

+

531

+ config SCHED_CORE

532

+ 	bool "Core Scheduling for SMT"

533

+-	depends on SCHED_SMT

534

++	depends on SCHED_SMT && !SCHED_ALT

535

+ 	help

536

+ 	  This option permits Core Scheduling, a means of coordinated task

537

+ 	  selection across SMT siblings. When enabled -- see

538

+diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c

539

+index 2a9695ccb65f..292112c267b8 100644

540

+--- a/kernel/cgroup/cpuset.c

541

++++ b/kernel/cgroup/cpuset.c

542

+@@ -664,7 +664,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)

543

+ 	return ret;

544

+ }

545

+

546

+-#ifdef CONFIG_SMP

547

++#if defined(CONFIG_SMP) && !defined(CONFIG_SCHED_ALT)

548

+ /*

549

+  * Helper routine for generate_sched_domains().

550

+  * Do cpusets a, b have overlapping effective cpus_allowed masks?

551

+@@ -1060,7 +1060,7 @@ static void rebuild_sched_domains_locked(void)

552

+ 	/* Have scheduler rebuild the domains */

553

+ 	partition_and_rebuild_sched_domains(ndoms, doms, attr);

554

+ }

555

+-#else /* !CONFIG_SMP */

556

++#else /* !CONFIG_SMP || CONFIG_SCHED_ALT */

557

+ static void rebuild_sched_domains_locked(void)

558

+ {

559

+ }

560

+diff --git a/kernel/delayacct.c b/kernel/delayacct.c

561

+index 51530d5b15a8..e542d71bb94b 100644

562

+--- a/kernel/delayacct.c

563

++++ b/kernel/delayacct.c

564

+@@ -139,7 +139,7 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)

565

+ 	 */

566

+ 	t1 = tsk->sched_info.pcount;

567

+ 	t2 = tsk->sched_info.run_delay;

568

+-	t3 = tsk->se.sum_exec_runtime;

569

++	t3 = tsk_seruntime(tsk);

570

+

571

+ 	d->cpu_count += t1;

572

+

573

+diff --git a/kernel/exit.c b/kernel/exit.c

574

+index 91a43e57a32e..4b157befc10c 100644

575

+--- a/kernel/exit.c

576

++++ b/kernel/exit.c

577

+@@ -122,7 +122,7 @@ static void __exit_signal(struct task_struct *tsk)

578

+ 			sig->curr_target = next_thread(tsk);

579

+ 	}

580

+

581

+-	add_device_randomness((const void*) &tsk->se.sum_exec_runtime,

582

++	add_device_randomness((const void*) &tsk_seruntime(tsk),

583

+ 			      sizeof(unsigned long long));

584

+

585

+ 	/*

586

+@@ -143,7 +143,7 @@ static void __exit_signal(struct task_struct *tsk)

587

+ 	sig->inblock += task_io_get_inblock(tsk);

588

+ 	sig->oublock += task_io_get_oublock(tsk);

589

+ 	task_io_accounting_add(&sig->ioac, &tsk->ioac);

590

+-	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;

591

++	sig->sum_sched_runtime += tsk_seruntime(tsk);

592

+ 	sig->nr_threads--;

593

+ 	__unhash_process(tsk, group_dead);

594

+ 	write_sequnlock(&sig->stats_lock);

595

+diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c

596

+index 291b857a6e20..f3480cdb7497 100644

597

+--- a/kernel/livepatch/transition.c

598

++++ b/kernel/livepatch/transition.c

599

+@@ -307,7 +307,11 @@ static bool klp_try_switch_task(struct task_struct *task)

600

+ 	 */

601

+ 	rq = task_rq_lock(task, &flags);

602

+

603

++#ifdef	CONFIG_SCHED_ALT

604

++	if (task_running(task) && task != current) {

605

++#else

606

+ 	if (task_running(rq, task) && task != current) {

607

++#endif

608

+ 		snprintf(err_buf, STACK_ERR_BUF_SIZE,

609

+ 			 "%s: %s:%d is running\n", __func__, task->comm,

610

+ 			 task->pid);

611

+diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c

612

+index 6bb116c559b4..d4c8168a8270 100644

613

+--- a/kernel/locking/rtmutex.c

614

++++ b/kernel/locking/rtmutex.c

615

+@@ -298,21 +298,25 @@ static __always_inline void

616

+ waiter_update_prio(struct rt_mutex_waiter *waiter, struct task_struct *task)

617

+ {

618

+ 	waiter->prio = __waiter_prio(task);

619

+-	waiter->deadline = task->dl.deadline;

620

++	waiter->deadline = __tsk_deadline(task);

621

+ }

622

+

623

+ /*

624

+  * Only use with rt_mutex_waiter_{less,equal}()

625

+  */

626

+ #define task_to_waiter(p)	\

627

+-	&(struct rt_mutex_waiter){ .prio = __waiter_prio(p), .deadline = (p)->dl.deadline }

628

++	&(struct rt_mutex_waiter){ .prio = __waiter_prio(p), .deadline = __tsk_deadline(p) }

629

+

630

+ static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,

631

+ 						struct rt_mutex_waiter *right)

632

+ {

633

++#ifdef CONFIG_SCHED_PDS

634

++	return (left->deadline < right->deadline);

635

++#else

636

+ 	if (left->prio < right->prio)

637

+ 		return 1;

638

+

639

++#ifndef CONFIG_SCHED_BMQ

640

+ 	/*

641

+ 	 * If both waiters have dl_prio(), we check the deadlines of the

642

+ 	 * associated tasks.

643

+@@ -321,16 +325,22 @@ static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,

644

+ 	 */

645

+ 	if (dl_prio(left->prio))

646

+ 		return dl_time_before(left->deadline, right->deadline);

647

++#endif

648

+

649

+ 	return 0;

650

++#endif

651

+ }

652

+

653

+ static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,

654

+ 						 struct rt_mutex_waiter *right)

655

+ {

656

++#ifdef CONFIG_SCHED_PDS

657

++	return (left->deadline == right->deadline);

658

++#else

659

+ 	if (left->prio != right->prio)

660

+ 		return 0;

661

+

662

++#ifndef CONFIG_SCHED_BMQ

663

+ 	/*

664

+ 	 * If both waiters have dl_prio(), we check the deadlines of the

665

+ 	 * associated tasks.

666

+@@ -339,8 +349,10 @@ static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,

667

+ 	 */

668

+ 	if (dl_prio(left->prio))

669

+ 		return left->deadline == right->deadline;

670

++#endif

671

+

672

+ 	return 1;

673

++#endif

674

+ }

675

+

676

+ static inline bool rt_mutex_steal(struct rt_mutex_waiter *waiter,

677

+diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile

678

+index 978fcfca5871..0425ee149b4d 100644

679

+--- a/kernel/sched/Makefile

680

++++ b/kernel/sched/Makefile

681

+@@ -22,14 +22,21 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)

682

+ CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer

683

+ endif

684

+

685

+-obj-y += core.o loadavg.o clock.o cputime.o

686

+-obj-y += idle.o fair.o rt.o deadline.o

687

+-obj-y += wait.o wait_bit.o swait.o completion.o

688

+-

689

+-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o

690

++ifdef CONFIG_SCHED_ALT

691

++obj-y += alt_core.o

692

++obj-$(CONFIG_SCHED_DEBUG) += alt_debug.o

693

++else

694

++obj-y += core.o

695

++obj-y += fair.o rt.o deadline.o

696

++obj-$(CONFIG_SMP) += cpudeadline.o stop_task.o

697

+ obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o

698

+-obj-$(CONFIG_SCHEDSTATS) += stats.o

699

++endif

700

+ obj-$(CONFIG_SCHED_DEBUG) += debug.o

701

++obj-y += loadavg.o clock.o cputime.o

702

++obj-y += idle.o

703

++obj-y += wait.o wait_bit.o swait.o completion.o

704

++obj-$(CONFIG_SMP) += cpupri.o pelt.o topology.o

705

++obj-$(CONFIG_SCHEDSTATS) += stats.o

706

+ obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o

707

+ obj-$(CONFIG_CPU_FREQ) += cpufreq.o

708

+ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o

709

+diff --git a/kernel/sched/alt_core.c b/kernel/sched/alt_core.c

710

+new file mode 100644

711

+index 000000000000..9576c57f82da

712

+--- /dev/null

713

++++ b/kernel/sched/alt_core.c

714

+@@ -0,0 +1,7626 @@

715

++/*

716

++ *  kernel/sched/alt_core.c

717

++ *

718

++ *  Core alternative kernel scheduler code and related syscalls

719

++ *

720

++ *  Copyright (C) 1991-2002  Linus Torvalds

721

++ *

722

++ *  2009-08-13	Brainfuck deadline scheduling policy by Con Kolivas deletes

723

++ *		a whole lot of those previous things.

724

++ *  2017-09-06	Priority and Deadline based Skip list multiple queue kernel

725

++ *		scheduler by Alfred Chen.

726

++ *  2019-02-20	BMQ(BitMap Queue) kernel scheduler by Alfred Chen.

727

++ */

728

++#define CREATE_TRACE_POINTS

729

++#include <trace/events/sched.h>

730

++#undef CREATE_TRACE_POINTS

731

++

732

++#include "sched.h"

733

++

734

++#include <linux/sched/rt.h>

735

++

736

++#include <linux/context_tracking.h>

737

++#include <linux/compat.h>

738

++#include <linux/blkdev.h>

739

++#include <linux/delayacct.h>

740

++#include <linux/freezer.h>

741

++#include <linux/init_task.h>

742

++#include <linux/kprobes.h>

743

++#include <linux/mmu_context.h>

744

++#include <linux/nmi.h>

745

++#include <linux/profile.h>

746

++#include <linux/rcupdate_wait.h>

747

++#include <linux/security.h>

748

++#include <linux/syscalls.h>

749

++#include <linux/wait_bit.h>

750

++

751

++#include <linux/kcov.h>

752

++#include <linux/scs.h>

753

++

754

++#include <asm/switch_to.h>

755

++

756

++#include "../workqueue_internal.h"

757

++#include "../../fs/io-wq.h"

758

++#include "../smpboot.h"

759

++

760

++#include "pelt.h"

761

++#include "smp.h"

762

++

763

++/*

764

++ * Export tracepoints that act as a bare tracehook (ie: have no trace event

765

++ * associated with them) to allow external modules to probe them.

766

++ */

767

++EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_irq_tp);

768

++

769

++#ifdef CONFIG_SCHED_DEBUG

770

++#define sched_feat(x)	(1)

771

++/*

772

++ * Print a warning if need_resched is set for the given duration (if

773

++ * LATENCY_WARN is enabled).

774

++ *

775

++ * If sysctl_resched_latency_warn_once is set, only one warning will be shown

776

++ * per boot.

777

++ */

778

++__read_mostly int sysctl_resched_latency_warn_ms = 100;

779

++__read_mostly int sysctl_resched_latency_warn_once = 1;

780

++#else

781

++#define sched_feat(x)	(0)

782

++#endif /* CONFIG_SCHED_DEBUG */

783

++

784

++#define ALT_SCHED_VERSION "v5.15-r0"

785

++

786

++/* rt_prio(prio) defined in include/linux/sched/rt.h */

787

++#define rt_task(p)		rt_prio((p)->prio)

788

++#define rt_policy(policy)	((policy) == SCHED_FIFO || (policy) == SCHED_RR)

789

++#define task_has_rt_policy(p)	(rt_policy((p)->policy))

790

++

791

++#define STOP_PRIO		(MAX_RT_PRIO - 1)

792

++

793

++/* Default time slice is 4 in ms, can be set via kernel parameter "sched_timeslice" */

794

++u64 sched_timeslice_ns __read_mostly = (4 << 20);

795

++

796

++static inline void requeue_task(struct task_struct *p, struct rq *rq);

797

++

798

++#ifdef CONFIG_SCHED_BMQ

799

++#include "bmq.h"

800

++#endif

801

++#ifdef CONFIG_SCHED_PDS

802

++#include "pds.h"

803

++#endif

804

++

805

++static int __init sched_timeslice(char *str)

806

++{

807

++	int timeslice_ms;

808

++

809

++	get_option(&str, &timeslice_ms);

810

++	if (2 != timeslice_ms)

811

++		timeslice_ms = 4;

812

++	sched_timeslice_ns = timeslice_ms << 20;

813

++	sched_timeslice_imp(timeslice_ms);

814

++

815

++	return 0;

816

++}

817

++early_param("sched_timeslice", sched_timeslice);

818

++

819

++/* Reschedule if less than this many μs left */

820

++#define RESCHED_NS		(100 << 10)

821

++

822

++/**

823

++ * sched_yield_type - Choose what sort of yield sched_yield will perform.

824

++ * 0: No yield.

825

++ * 1: Deboost and requeue task. (default)

826

++ * 2: Set rq skip task.

827

++ */

828

++int sched_yield_type __read_mostly = 1;

829

++

830

++#ifdef CONFIG_SMP

831

++static cpumask_t sched_rq_pending_mask ____cacheline_aligned_in_smp;

832

++

833

++DEFINE_PER_CPU(cpumask_t [NR_CPU_AFFINITY_LEVELS], sched_cpu_topo_masks);

834

++DEFINE_PER_CPU(cpumask_t *, sched_cpu_llc_mask);

835

++DEFINE_PER_CPU(cpumask_t *, sched_cpu_topo_end_mask);

836

++

837

++#ifdef CONFIG_SCHED_SMT

838

++DEFINE_STATIC_KEY_FALSE(sched_smt_present);

839

++EXPORT_SYMBOL_GPL(sched_smt_present);

840

++#endif

841

++

842

++/*

843

++ * Keep a unique ID per domain (we use the first CPUs number in the cpumask of

844

++ * the domain), this allows us to quickly tell if two cpus are in the same cache

845

++ * domain, see cpus_share_cache().

846

++ */

847

++DEFINE_PER_CPU(int, sd_llc_id);

848

++#endif /* CONFIG_SMP */

849

++

850

++static DEFINE_MUTEX(sched_hotcpu_mutex);

851

++

852

++DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

853

++

854

++#ifndef prepare_arch_switch

855

++# define prepare_arch_switch(next)	do { } while (0)

856

++#endif

857

++#ifndef finish_arch_post_lock_switch

858

++# define finish_arch_post_lock_switch()	do { } while (0)

859

++#endif

860

++

861

++#ifdef CONFIG_SCHED_SMT

862

++static cpumask_t sched_sg_idle_mask ____cacheline_aligned_in_smp;

863

++#endif

864

++static cpumask_t sched_rq_watermark[SCHED_BITS] ____cacheline_aligned_in_smp;

865

++

866

++/* sched_queue related functions */

867

++static inline void sched_queue_init(struct sched_queue *q)

868

++{

869

++	int i;

870

++

871

++	bitmap_zero(q->bitmap, SCHED_BITS);

872

++	for(i = 0; i < SCHED_BITS; i++)

873

++		INIT_LIST_HEAD(&q->heads[i]);

874

++}

875

++

876

++/*

877

++ * Init idle task and put into queue structure of rq

878

++ * IMPORTANT: may be called multiple times for a single cpu

879

++ */

880

++static inline void sched_queue_init_idle(struct sched_queue *q,

881

++					 struct task_struct *idle)

882

++{

883

++	idle->sq_idx = IDLE_TASK_SCHED_PRIO;

884

++	INIT_LIST_HEAD(&q->heads[idle->sq_idx]);

885

++	list_add(&idle->sq_node, &q->heads[idle->sq_idx]);

886

++}

887

++

888

++/* water mark related functions */

889

++static inline void update_sched_rq_watermark(struct rq *rq)

890

++{

891

++	unsigned long watermark = find_first_bit(rq->queue.bitmap, SCHED_QUEUE_BITS);

892

++	unsigned long last_wm = rq->watermark;

893

++	unsigned long i;

894

++	int cpu;

895

++

896

++	if (watermark == last_wm)

897

++		return;

898

++

899

++	rq->watermark = watermark;

900

++	cpu = cpu_of(rq);

901

++	if (watermark < last_wm) {

902

++		for (i = last_wm; i > watermark; i--)

903

++			cpumask_clear_cpu(cpu, sched_rq_watermark + SCHED_BITS - 1 - i);

904

++#ifdef CONFIG_SCHED_SMT

905

++		if (static_branch_likely(&sched_smt_present) &&

906

++		    IDLE_TASK_SCHED_PRIO == last_wm)

907

++			cpumask_andnot(&sched_sg_idle_mask,

908

++				       &sched_sg_idle_mask, cpu_smt_mask(cpu));

909

++#endif

910

++		return;

911

++	}

912

++	/* last_wm < watermark */

913

++	for (i = watermark; i > last_wm; i--)

914

++		cpumask_set_cpu(cpu, sched_rq_watermark + SCHED_BITS - 1 - i);

915

++#ifdef CONFIG_SCHED_SMT

916

++	if (static_branch_likely(&sched_smt_present) &&

917

++	    IDLE_TASK_SCHED_PRIO == watermark) {

918

++		cpumask_t tmp;

919

++

920

++		cpumask_and(&tmp, cpu_smt_mask(cpu), sched_rq_watermark);

921

++		if (cpumask_equal(&tmp, cpu_smt_mask(cpu)))

922

++			cpumask_or(&sched_sg_idle_mask,

923

++				   &sched_sg_idle_mask, cpu_smt_mask(cpu));

924

++	}

925

++#endif

926

++}

927

++

928

++/*

929

++ * This routine assume that the idle task always in queue

930

++ */

931

++static inline struct task_struct *sched_rq_first_task(struct rq *rq)

932

++{

933

++	unsigned long idx = find_first_bit(rq->queue.bitmap, SCHED_QUEUE_BITS);

934

++	const struct list_head *head = &rq->queue.heads[sched_prio2idx(idx, rq)];

935

++

936

++	return list_first_entry(head, struct task_struct, sq_node);

937

++}

938

++

939

++static inline struct task_struct *

940

++sched_rq_next_task(struct task_struct *p, struct rq *rq)

941

++{

942

++	unsigned long idx = p->sq_idx;

943

++	struct list_head *head = &rq->queue.heads[idx];

944

++

945

++	if (list_is_last(&p->sq_node, head)) {

946

++		idx = find_next_bit(rq->queue.bitmap, SCHED_QUEUE_BITS,

947

++				    sched_idx2prio(idx, rq) + 1);

948

++		head = &rq->queue.heads[sched_prio2idx(idx, rq)];

949

++

950

++		return list_first_entry(head, struct task_struct, sq_node);

951

++	}

952

++

953

++	return list_next_entry(p, sq_node);

954

++}

955

++

956

++static inline struct task_struct *rq_runnable_task(struct rq *rq)

957

++{

958

++	struct task_struct *next = sched_rq_first_task(rq);

959

++

960

++	if (unlikely(next == rq->skip))

961

++		next = sched_rq_next_task(next, rq);

962

++

963

++	return next;

964

++}

965

++

966

++/*

967

++ * Serialization rules:

968

++ *

969

++ * Lock order:

970

++ *

971

++ *   p->pi_lock

972

++ *     rq->lock

973

++ *       hrtimer_cpu_base->lock (hrtimer_start() for bandwidth controls)

974

++ *

975

++ *  rq1->lock

976

++ *    rq2->lock  where: rq1 < rq2

977

++ *

978

++ * Regular state:

979

++ *

980

++ * Normal scheduling state is serialized by rq->lock. __schedule() takes the

981

++ * local CPU's rq->lock, it optionally removes the task from the runqueue and

982

++ * always looks at the local rq data structures to find the most eligible task

983

++ * to run next.

984

++ *

985

++ * Task enqueue is also under rq->lock, possibly taken from another CPU.

986

++ * Wakeups from another LLC domain might use an IPI to transfer the enqueue to

987

++ * the local CPU to avoid bouncing the runqueue state around [ see

988

++ * ttwu_queue_wakelist() ]

989

++ *

990

++ * Task wakeup, specifically wakeups that involve migration, are horribly

991

++ * complicated to avoid having to take two rq->locks.

992

++ *

993

++ * Special state:

994

++ *

995

++ * System-calls and anything external will use task_rq_lock() which acquires

996

++ * both p->pi_lock and rq->lock. As a consequence the state they change is

997

++ * stable while holding either lock:

998

++ *

999

++ *  - sched_setaffinity()/

1000

++ *    set_cpus_allowed_ptr():	p->cpus_ptr, p->nr_cpus_allowed

1001

++ *  - set_user_nice():		p->se.load, p->*prio

1002

++ *  - __sched_setscheduler():	p->sched_class, p->policy, p->*prio,

1003

++ *				p->se.load, p->rt_priority,

1004

++ *				p->dl.dl_{runtime, deadline, period, flags, bw, density}

1005

++ *  - sched_setnuma():		p->numa_preferred_nid

1006

++ *  - sched_move_task()/

1007

++ *    cpu_cgroup_fork():	p->sched_task_group

1008

++ *  - uclamp_update_active()	p->uclamp*

1009

++ *

1010

++ * p->state <- TASK_*:

1011

++ *

1012

++ *   is changed locklessly using set_current_state(), __set_current_state() or

1013

++ *   set_special_state(), see their respective comments, or by

1014

++ *   try_to_wake_up(). This latter uses p->pi_lock to serialize against

1015

++ *   concurrent self.

1016

++ *

1017

++ * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }:

1018

++ *

1019

++ *   is set by activate_task() and cleared by deactivate_task(), under

1020

++ *   rq->lock. Non-zero indicates the task is runnable, the special

1021

++ *   ON_RQ_MIGRATING state is used for migration without holding both

1022

++ *   rq->locks. It indicates task_cpu() is not stable, see task_rq_lock().

1023

++ *

1024

++ * p->on_cpu <- { 0, 1 }:

1025

++ *

1026

++ *   is set by prepare_task() and cleared by finish_task() such that it will be

1027

++ *   set before p is scheduled-in and cleared after p is scheduled-out, both

1028

++ *   under rq->lock. Non-zero indicates the task is running on its CPU.

1029

++ *

1030

++ *   [ The astute reader will observe that it is possible for two tasks on one

1031

++ *     CPU to have ->on_cpu = 1 at the same time. ]

1032

++ *

1033

++ * task_cpu(p): is changed by set_task_cpu(), the rules are:

1034

++ *

1035

++ *  - Don't call set_task_cpu() on a blocked task:

1036

++ *

1037

++ *    We don't care what CPU we're not running on, this simplifies hotplug,

1038

++ *    the CPU assignment of blocked tasks isn't required to be valid.

1039

++ *

1040

++ *  - for try_to_wake_up(), called under p->pi_lock:

1041

++ *

1042

++ *    This allows try_to_wake_up() to only take one rq->lock, see its comment.

1043

++ *

1044

++ *  - for migration called under rq->lock:

1045

++ *    [ see task_on_rq_migrating() in task_rq_lock() ]

1046

++ *

1047

++ *    o move_queued_task()

1048

++ *    o detach_task()

1049

++ *

1050

++ *  - for migration called under double_rq_lock():

1051

++ *

1052

++ *    o __migrate_swap_task()

1053

++ *    o push_rt_task() / pull_rt_task()

1054

++ *    o push_dl_task() / pull_dl_task()

1055

++ *    o dl_task_offline_migration()

1056

++ *

1057

++ */

1058

++

1059

++/*

1060

++ * Context: p->pi_lock

1061

++ */

1062

++static inline struct rq

1063

++*__task_access_lock(struct task_struct *p, raw_spinlock_t **plock)

1064

++{

1065

++	struct rq *rq;

1066

++	for (;;) {

1067

++		rq = task_rq(p);

1068

++		if (p->on_cpu || task_on_rq_queued(p)) {

1069

++			raw_spin_lock(&rq->lock);

1070

++			if (likely((p->on_cpu || task_on_rq_queued(p))

1071

++				   && rq == task_rq(p))) {

1072

++				*plock = &rq->lock;

1073

++				return rq;

1074

++			}

1075

++			raw_spin_unlock(&rq->lock);

1076

++		} else if (task_on_rq_migrating(p)) {

1077

++			do {

1078

++				cpu_relax();

1079

++			} while (unlikely(task_on_rq_migrating(p)));

1080

++		} else {

1081

++			*plock = NULL;

1082

++			return rq;

1083

++		}

1084

++	}

1085

++}

1086

++

1087

++static inline void

1088

++__task_access_unlock(struct task_struct *p, raw_spinlock_t *lock)

1089

++{

1090

++	if (NULL != lock)

1091

++		raw_spin_unlock(lock);

1092

++}

1093

++

1094

++static inline struct rq

1095

++*task_access_lock_irqsave(struct task_struct *p, raw_spinlock_t **plock,

1096

++			  unsigned long *flags)

1097

++{

1098

++	struct rq *rq;

1099

++	for (;;) {

1100

++		rq = task_rq(p);

1101

++		if (p->on_cpu || task_on_rq_queued(p)) {

1102

++			raw_spin_lock_irqsave(&rq->lock, *flags);

1103

++			if (likely((p->on_cpu || task_on_rq_queued(p))

1104

++				   && rq == task_rq(p))) {

1105

++				*plock = &rq->lock;

1106

++				return rq;

1107

++			}

1108

++			raw_spin_unlock_irqrestore(&rq->lock, *flags);

1109

++		} else if (task_on_rq_migrating(p)) {

1110

++			do {

1111

++				cpu_relax();

1112

++			} while (unlikely(task_on_rq_migrating(p)));

1113

++		} else {

1114

++			raw_spin_lock_irqsave(&p->pi_lock, *flags);

1115

++			if (likely(!p->on_cpu && !p->on_rq &&

1116

++				   rq == task_rq(p))) {

1117

++				*plock = &p->pi_lock;

1118

++				return rq;

1119

++			}

1120

++			raw_spin_unlock_irqrestore(&p->pi_lock, *flags);

1121

++		}

1122

++	}

1123

++}

1124

++

1125

++static inline void

1126

++task_access_unlock_irqrestore(struct task_struct *p, raw_spinlock_t *lock,

1127

++			      unsigned long *flags)

1128

++{

1129

++	raw_spin_unlock_irqrestore(lock, *flags);

1130

++}

1131

++

1132

++/*

1133

++ * __task_rq_lock - lock the rq @p resides on.

1134

++ */

1135

++struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)

1136

++	__acquires(rq->lock)

1137

++{

1138

++	struct rq *rq;

1139

++

1140

++	lockdep_assert_held(&p->pi_lock);

1141

++

1142

++	for (;;) {

1143

++		rq = task_rq(p);

1144

++		raw_spin_lock(&rq->lock);

1145

++		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))

1146

++			return rq;

1147

++		raw_spin_unlock(&rq->lock);

1148

++

1149

++		while (unlikely(task_on_rq_migrating(p)))

1150

++			cpu_relax();

1151

++	}

1152

++}

1153

++

1154

++/*

1155

++ * task_rq_lock - lock p->pi_lock and lock the rq @p resides on.

1156

++ */

1157

++struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)

1158

++	__acquires(p->pi_lock)

1159

++	__acquires(rq->lock)

1160

++{

1161

++	struct rq *rq;

1162

++

1163

++	for (;;) {

1164

++		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);

1165

++		rq = task_rq(p);

1166

++		raw_spin_lock(&rq->lock);

1167

++		/*

1168

++		 *	move_queued_task()		task_rq_lock()

1169

++		 *

1170

++		 *	ACQUIRE (rq->lock)

1171

++		 *	[S] ->on_rq = MIGRATING		[L] rq = task_rq()

1172

++		 *	WMB (__set_task_cpu())		ACQUIRE (rq->lock);

1173

++		 *	[S] ->cpu = new_cpu		[L] task_rq()

1174

++		 *					[L] ->on_rq

1175

++		 *	RELEASE (rq->lock)

1176

++		 *

1177

++		 * If we observe the old CPU in task_rq_lock(), the acquire of

1178

++		 * the old rq->lock will fully serialize against the stores.

1179

++		 *

1180

++		 * If we observe the new CPU in task_rq_lock(), the address

1181

++		 * dependency headed by '[L] rq = task_rq()' and the acquire

1182

++		 * will pair with the WMB to ensure we then also see migrating.

1183

++		 */

1184

++		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {

1185

++			return rq;

1186

++		}

1187

++		raw_spin_unlock(&rq->lock);

1188

++		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);

1189

++

1190

++		while (unlikely(task_on_rq_migrating(p)))

1191

++			cpu_relax();

1192

++	}

1193

++}

1194

++

1195

++static inline void

1196

++rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)

1197

++	__acquires(rq->lock)

1198

++{

1199

++	raw_spin_lock_irqsave(&rq->lock, rf->flags);

1200

++}

1201

++

1202

++static inline void

1203

++rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)

1204

++	__releases(rq->lock)

1205

++{

1206

++	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);

1207

++}

1208

++

1209

++void raw_spin_rq_lock_nested(struct rq *rq, int subclass)

1210

++{

1211

++	raw_spinlock_t *lock;

1212

++

1213

++	/* Matches synchronize_rcu() in __sched_core_enable() */

1214

++	preempt_disable();

1215

++

1216

++	for (;;) {

1217

++		lock = __rq_lockp(rq);

1218

++		raw_spin_lock_nested(lock, subclass);

1219

++		if (likely(lock == __rq_lockp(rq))) {

1220

++			/* preempt_count *MUST* be > 1 */

1221

++			preempt_enable_no_resched();

1222

++			return;

1223

++		}

1224

++		raw_spin_unlock(lock);

1225

++	}

1226

++}

1227

++

1228

++void raw_spin_rq_unlock(struct rq *rq)

1229

++{

1230

++	raw_spin_unlock(rq_lockp(rq));

1231

++}

1232

++

1233

++/*

1234

++ * RQ-clock updating methods:

1235

++ */

1236

++

1237

++static void update_rq_clock_task(struct rq *rq, s64 delta)

1238

++{

1239

++/*

1240

++ * In theory, the compile should just see 0 here, and optimize out the call

1241

++ * to sched_rt_avg_update. But I don't trust it...

1242

++ */

1243

++	s64 __maybe_unused steal = 0, irq_delta = 0;

1244

++

1245

++#ifdef CONFIG_IRQ_TIME_ACCOUNTING

1246

++	irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;

1247

++

1248

++	/*

1249

++	 * Since irq_time is only updated on {soft,}irq_exit, we might run into

1250

++	 * this case when a previous update_rq_clock() happened inside a

1251

++	 * {soft,}irq region.

1252

++	 *

1253

++	 * When this happens, we stop ->clock_task and only update the

1254

++	 * prev_irq_time stamp to account for the part that fit, so that a next

1255

++	 * update will consume the rest. This ensures ->clock_task is

1256

++	 * monotonic.

1257

++	 *

1258

++	 * It does however cause some slight miss-attribution of {soft,}irq

1259

++	 * time, a more accurate solution would be to update the irq_time using

1260

++	 * the current rq->clock timestamp, except that would require using

1261

++	 * atomic ops.

1262

++	 */

1263

++	if (irq_delta > delta)

1264

++		irq_delta = delta;

1265

++

1266

++	rq->prev_irq_time += irq_delta;

1267

++	delta -= irq_delta;

1268

++#endif

1269

++#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING

1270

++	if (static_key_false((&paravirt_steal_rq_enabled))) {

1271

++		steal = paravirt_steal_clock(cpu_of(rq));

1272

++		steal -= rq->prev_steal_time_rq;

1273

++

1274

++		if (unlikely(steal > delta))

1275

++			steal = delta;

1276

++

1277

++		rq->prev_steal_time_rq += steal;

1278

++		delta -= steal;

1279

++	}

1280

++#endif

1281

++

1282

++	rq->clock_task += delta;

1283

++

1284

++#ifdef CONFIG_HAVE_SCHED_AVG_IRQ

1285

++	if ((irq_delta + steal))

1286

++		update_irq_load_avg(rq, irq_delta + steal);

1287

++#endif

1288

++}

1289

++

1290

++static inline void update_rq_clock(struct rq *rq)

1291

++{

1292

++	s64 delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;

1293

++

1294

++	if (unlikely(delta <= 0))

1295

++		return;

1296

++	rq->clock += delta;

1297

++	update_rq_time_edge(rq);

1298

++	update_rq_clock_task(rq, delta);

1299

++}

1300

++

1301

++/*

1302

++ * RQ Load update routine

1303

++ */

1304

++#define RQ_LOAD_HISTORY_BITS		(sizeof(s32) * 8ULL)

1305

++#define RQ_UTIL_SHIFT			(8)

1306

++#define RQ_LOAD_HISTORY_TO_UTIL(l)	(((l) >> (RQ_LOAD_HISTORY_BITS - 1 - RQ_UTIL_SHIFT)) & 0xff)

1307

++

1308

++#define LOAD_BLOCK(t)		((t) >> 17)

1309

++#define LOAD_HALF_BLOCK(t)	((t) >> 16)

1310

++#define BLOCK_MASK(t)		((t) & ((0x01 << 18) - 1))

1311

++#define LOAD_BLOCK_BIT(b)	(1UL << (RQ_LOAD_HISTORY_BITS - 1 - (b)))

1312

++#define CURRENT_LOAD_BIT	LOAD_BLOCK_BIT(0)

1313

++

1314

++static inline void rq_load_update(struct rq *rq)

1315

++{

1316

++	u64 time = rq->clock;

1317

++	u64 delta = min(LOAD_BLOCK(time) - LOAD_BLOCK(rq->load_stamp),

1318

++			RQ_LOAD_HISTORY_BITS - 1);

1319

++	u64 prev = !!(rq->load_history & CURRENT_LOAD_BIT);

1320

++	u64 curr = !!rq->nr_running;

1321

++

1322

++	if (delta) {

1323

++		rq->load_history = rq->load_history >> delta;

1324

++

1325

++		if (delta < RQ_UTIL_SHIFT) {

1326

++			rq->load_block += (~BLOCK_MASK(rq->load_stamp)) * prev;

1327

++			if (!!LOAD_HALF_BLOCK(rq->load_block) ^ curr)

1328

++				rq->load_history ^= LOAD_BLOCK_BIT(delta);

1329

++		}

1330

++

1331

++		rq->load_block = BLOCK_MASK(time) * prev;

1332

++	} else {

1333

++		rq->load_block += (time - rq->load_stamp) * prev;

1334

++	}

1335

++	if (prev ^ curr)

1336

++		rq->load_history ^= CURRENT_LOAD_BIT;

1337

++	rq->load_stamp = time;

1338

++}

1339

++

1340

++unsigned long rq_load_util(struct rq *rq, unsigned long max)

1341

++{

1342

++	return RQ_LOAD_HISTORY_TO_UTIL(rq->load_history) * (max >> RQ_UTIL_SHIFT);

1343

++}

1344

++

1345

++#ifdef CONFIG_SMP

1346

++unsigned long sched_cpu_util(int cpu, unsigned long max)

1347

++{

1348

++	return rq_load_util(cpu_rq(cpu), max);

1349

++}

1350

++#endif /* CONFIG_SMP */

1351

++

1352

++#ifdef CONFIG_CPU_FREQ

1353

++/**

1354

++ * cpufreq_update_util - Take a note about CPU utilization changes.

1355

++ * @rq: Runqueue to carry out the update for.

1356

++ * @flags: Update reason flags.

1357

++ *

1358

++ * This function is called by the scheduler on the CPU whose utilization is

1359

++ * being updated.

1360

++ *

1361

++ * It can only be called from RCU-sched read-side critical sections.

1362

++ *

1363

++ * The way cpufreq is currently arranged requires it to evaluate the CPU

1364

++ * performance state (frequency/voltage) on a regular basis to prevent it from

1365

++ * being stuck in a completely inadequate performance level for too long.

1366

++ * That is not guaranteed to happen if the updates are only triggered from CFS

1367

++ * and DL, though, because they may not be coming in if only RT tasks are

1368

++ * active all the time (or there are RT tasks only).

1369

++ *

1370

++ * As a workaround for that issue, this function is called periodically by the

1371

++ * RT sched class to trigger extra cpufreq updates to prevent it from stalling,

1372

++ * but that really is a band-aid.  Going forward it should be replaced with

1373

++ * solutions targeted more specifically at RT tasks.

1374

++ */

1375

++static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)

1376

++{

1377

++	struct update_util_data *data;

1378

++

1379

++#ifdef CONFIG_SMP

1380

++	rq_load_update(rq);

1381

++#endif

1382

++	data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,

1383

++						  cpu_of(rq)));

1384

++	if (data)

1385

++		data->func(data, rq_clock(rq), flags);

1386

++}

1387

++#else

1388

++static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)

1389

++{

1390

++#ifdef CONFIG_SMP

1391

++	rq_load_update(rq);

1392

++#endif

1393

++}

1394

++#endif /* CONFIG_CPU_FREQ */

1395

++

1396

++#ifdef CONFIG_NO_HZ_FULL

1397

++/*

1398

++ * Tick may be needed by tasks in the runqueue depending on their policy and

1399

++ * requirements. If tick is needed, lets send the target an IPI to kick it out

1400

++ * of nohz mode if necessary.

1401

++ */

1402

++static inline void sched_update_tick_dependency(struct rq *rq)

1403

++{

1404

++	int cpu = cpu_of(rq);

1405

++

1406

++	if (!tick_nohz_full_cpu(cpu))

1407

++		return;

1408

++

1409

++	if (rq->nr_running < 2)

1410

++		tick_nohz_dep_clear_cpu(cpu, TICK_DEP_BIT_SCHED);

1411

++	else

1412

++		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);

1413

++}

1414

++#else /* !CONFIG_NO_HZ_FULL */

1415

++static inline void sched_update_tick_dependency(struct rq *rq) { }

1416

++#endif

1417

++

1418

++bool sched_task_on_rq(struct task_struct *p)

1419

++{

1420

++	return task_on_rq_queued(p);

1421

++}

1422

++

1423

++/*

1424

++ * Add/Remove/Requeue task to/from the runqueue routines

1425

++ * Context: rq->lock

1426

++ */

1427

++#define __SCHED_DEQUEUE_TASK(p, rq, flags, func)		\

1428

++	psi_dequeue(p, flags & DEQUEUE_SLEEP);			\

1429

++	sched_info_dequeue(rq, p);				\

1430

++								\

1431

++	list_del(&p->sq_node);					\

1432

++	if (list_empty(&rq->queue.heads[p->sq_idx])) {		\

1433

++		clear_bit(sched_idx2prio(p->sq_idx, rq),	\

1434

++			  rq->queue.bitmap);			\

1435

++		func;						\

1436

++	}

1437

++

1438

++#define __SCHED_ENQUEUE_TASK(p, rq, flags)				\

1439

++	sched_info_enqueue(rq, p);					\

1440

++	psi_enqueue(p, flags);						\

1441

++									\

1442

++	p->sq_idx = task_sched_prio_idx(p, rq);				\

1443

++	list_add_tail(&p->sq_node, &rq->queue.heads[p->sq_idx]);	\

1444

++	set_bit(sched_idx2prio(p->sq_idx, rq), rq->queue.bitmap);

1445

++

1446

++static inline void dequeue_task(struct task_struct *p, struct rq *rq, int flags)

1447

++{

1448

++	lockdep_assert_held(&rq->lock);

1449

++

1450

++	/*printk(KERN_INFO "sched: dequeue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/

1451

++	WARN_ONCE(task_rq(p) != rq, "sched: dequeue task reside on cpu%d from cpu%d\n",

1452

++		  task_cpu(p), cpu_of(rq));

1453

++

1454

++	__SCHED_DEQUEUE_TASK(p, rq, flags, update_sched_rq_watermark(rq));

1455

++	--rq->nr_running;

1456

++#ifdef CONFIG_SMP

1457

++	if (1 == rq->nr_running)

1458

++		cpumask_clear_cpu(cpu_of(rq), &sched_rq_pending_mask);

1459

++#endif

1460

++

1461

++	sched_update_tick_dependency(rq);

1462

++}

1463

++

1464

++static inline void enqueue_task(struct task_struct *p, struct rq *rq, int flags)

1465

++{

1466

++	lockdep_assert_held(&rq->lock);

1467

++

1468

++	/*printk(KERN_INFO "sched: enqueue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/

1469

++	WARN_ONCE(task_rq(p) != rq, "sched: enqueue task reside on cpu%d to cpu%d\n",

1470

++		  task_cpu(p), cpu_of(rq));

1471

++

1472

++	__SCHED_ENQUEUE_TASK(p, rq, flags);

1473

++	update_sched_rq_watermark(rq);

1474

++	++rq->nr_running;

1475

++#ifdef CONFIG_SMP

1476

++	if (2 == rq->nr_running)

1477

++		cpumask_set_cpu(cpu_of(rq), &sched_rq_pending_mask);

1478

++#endif

1479

++

1480

++	sched_update_tick_dependency(rq);

1481

++}

1482

++

1483

++static inline void requeue_task(struct task_struct *p, struct rq *rq)

1484

++{

1485

++	int idx;

1486

++

1487

++	lockdep_assert_held(&rq->lock);

1488

++	/*printk(KERN_INFO "sched: requeue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/

1489

++	WARN_ONCE(task_rq(p) != rq, "sched: cpu[%d] requeue task reside on cpu%d\n",

1490

++		  cpu_of(rq), task_cpu(p));

1491

++

1492

++	idx = task_sched_prio_idx(p, rq);

1493

++

1494

++	list_del(&p->sq_node);

1495

++	list_add_tail(&p->sq_node, &rq->queue.heads[idx]);

1496

++	if (idx != p->sq_idx) {

1497

++		if (list_empty(&rq->queue.heads[p->sq_idx]))

1498

++			clear_bit(sched_idx2prio(p->sq_idx, rq),

1499

++				  rq->queue.bitmap);

1500

++		p->sq_idx = idx;

1501

++		set_bit(sched_idx2prio(p->sq_idx, rq), rq->queue.bitmap);

1502

++		update_sched_rq_watermark(rq);

1503

++	}

1504

++}

1505

++

1506

++/*

1507

++ * cmpxchg based fetch_or, macro so it works for different integer types

1508

++ */

1509

++#define fetch_or(ptr, mask)						\

1510

++	({								\

1511

++		typeof(ptr) _ptr = (ptr);				\

1512

++		typeof(mask) _mask = (mask);				\

1513

++		typeof(*_ptr) _old, _val = *_ptr;			\

1514

++									\

1515

++		for (;;) {						\

1516

++			_old = cmpxchg(_ptr, _val, _val | _mask);	\

1517

++			if (_old == _val)				\

1518

++				break;					\

1519

++			_val = _old;					\

1520

++		}							\

1521

++	_old;								\

1522

++})

1523

++

1524

++#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)

1525

++/*

1526

++ * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,

1527

++ * this avoids any races wrt polling state changes and thereby avoids

1528

++ * spurious IPIs.

1529

++ */

1530

++static bool set_nr_and_not_polling(struct task_struct *p)

1531

++{

1532

++	struct thread_info *ti = task_thread_info(p);

1533

++	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);

1534

++}

1535

++

1536

++/*

1537

++ * Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set.

1538

++ *

1539

++ * If this returns true, then the idle task promises to call

1540

++ * sched_ttwu_pending() and reschedule soon.

1541

++ */

1542

++static bool set_nr_if_polling(struct task_struct *p)

1543

++{

1544

++	struct thread_info *ti = task_thread_info(p);

1545

++	typeof(ti->flags) old, val = READ_ONCE(ti->flags);

1546

++

1547

++	for (;;) {

1548

++		if (!(val & _TIF_POLLING_NRFLAG))

1549

++			return false;

1550

++		if (val & _TIF_NEED_RESCHED)

1551

++			return true;

1552

++		old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED);

1553

++		if (old == val)

1554

++			break;

1555

++		val = old;

1556

++	}

1557

++	return true;

1558

++}

1559

++

1560

++#else

1561

++static bool set_nr_and_not_polling(struct task_struct *p)

1562

++{

1563

++	set_tsk_need_resched(p);

1564

++	return true;

1565

++}

1566

++

1567

++#ifdef CONFIG_SMP

1568

++static bool set_nr_if_polling(struct task_struct *p)

1569

++{

1570

++	return false;

1571

++}

1572

++#endif

1573

++#endif

1574

++

1575

++static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)

1576

++{

1577

++	struct wake_q_node *node = &task->wake_q;

1578

++

1579

++	/*

1580

++	 * Atomically grab the task, if ->wake_q is !nil already it means

1581

++	 * it's already queued (either by us or someone else) and will get the

1582

++	 * wakeup due to that.

1583

++	 *

1584

++	 * In order to ensure that a pending wakeup will observe our pending

1585

++	 * state, even in the failed case, an explicit smp_mb() must be used.

1586

++	 */

1587

++	smp_mb__before_atomic();

1588

++	if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))

1589

++		return false;

1590

++

1591

++	/*

1592

++	 * The head is context local, there can be no concurrency.

1593

++	 */

1594

++	*head->lastp = node;

1595

++	head->lastp = &node->next;

1596

++	return true;

1597

++}

1598

++

1599

++/**

1600

++ * wake_q_add() - queue a wakeup for 'later' waking.

1601

++ * @head: the wake_q_head to add @task to

1602

++ * @task: the task to queue for 'later' wakeup

1603

++ *

1604

++ * Queue a task for later wakeup, most likely by the wake_up_q() call in the

1605

++ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come

1606

++ * instantly.

1607

++ *

1608

++ * This function must be used as-if it were wake_up_process(); IOW the task

1609

++ * must be ready to be woken at this location.

1610

++ */

1611

++void wake_q_add(struct wake_q_head *head, struct task_struct *task)

1612

++{

1613

++	if (__wake_q_add(head, task))

1614

++		get_task_struct(task);

1615

++}

1616

++

1617

++/**

1618

++ * wake_q_add_safe() - safely queue a wakeup for 'later' waking.

1619

++ * @head: the wake_q_head to add @task to

1620

++ * @task: the task to queue for 'later' wakeup

1621

++ *

1622

++ * Queue a task for later wakeup, most likely by the wake_up_q() call in the

1623

++ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come

1624

++ * instantly.

1625

++ *

1626

++ * This function must be used as-if it were wake_up_process(); IOW the task

1627

++ * must be ready to be woken at this location.

1628

++ *

1629

++ * This function is essentially a task-safe equivalent to wake_q_add(). Callers

1630

++ * that already hold reference to @task can call the 'safe' version and trust

1631

++ * wake_q to do the right thing depending whether or not the @task is already

1632

++ * queued for wakeup.

1633

++ */

1634

++void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)

1635

++{

1636

++	if (!__wake_q_add(head, task))

1637

++		put_task_struct(task);

1638

++}

1639

++

1640

++void wake_up_q(struct wake_q_head *head)

1641

++{

1642

++	struct wake_q_node *node = head->first;

1643

++

1644

++	while (node != WAKE_Q_TAIL) {

1645

++		struct task_struct *task;

1646

++

1647

++		task = container_of(node, struct task_struct, wake_q);

1648

++		/* task can safely be re-inserted now: */

1649

++		node = node->next;

1650

++		task->wake_q.next = NULL;

1651

++

1652

++		/*

1653

++		 * wake_up_process() executes a full barrier, which pairs with

1654

++		 * the queueing in wake_q_add() so as not to miss wakeups.

1655

++		 */

1656

++		wake_up_process(task);

1657

++		put_task_struct(task);

1658

++	}

1659

++}

1660

++

1661

++/*

1662

++ * resched_curr - mark rq's current task 'to be rescheduled now'.

1663

++ *

1664

++ * On UP this means the setting of the need_resched flag, on SMP it

1665

++ * might also involve a cross-CPU call to trigger the scheduler on

1666

++ * the target CPU.

1667

++ */

1668

++void resched_curr(struct rq *rq)

1669

++{

1670

++	struct task_struct *curr = rq->curr;

1671

++	int cpu;

1672

++

1673

++	lockdep_assert_held(&rq->lock);

1674

++

1675

++	if (test_tsk_need_resched(curr))

1676

++		return;

1677

++

1678

++	cpu = cpu_of(rq);

1679

++	if (cpu == smp_processor_id()) {

1680

++		set_tsk_need_resched(curr);

1681

++		set_preempt_need_resched();

1682

++		return;

1683

++	}

1684

++

1685

++	if (set_nr_and_not_polling(curr))

1686

++		smp_send_reschedule(cpu);

1687

++	else

1688

++		trace_sched_wake_idle_without_ipi(cpu);

1689

++}

1690

++

1691

++void resched_cpu(int cpu)

1692

++{

1693

++	struct rq *rq = cpu_rq(cpu);

1694

++	unsigned long flags;

1695

++

1696

++	raw_spin_lock_irqsave(&rq->lock, flags);

1697

++	if (cpu_online(cpu) || cpu == smp_processor_id())

1698

++		resched_curr(cpu_rq(cpu));

1699

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

1700

++}

1701

++

1702

++#ifdef CONFIG_SMP

1703

++#ifdef CONFIG_NO_HZ_COMMON

1704

++void nohz_balance_enter_idle(int cpu) {}

1705

++

1706

++void select_nohz_load_balancer(int stop_tick) {}

1707

++

1708

++void set_cpu_sd_state_idle(void) {}

1709

++

1710

++/*

1711

++ * In the semi idle case, use the nearest busy CPU for migrating timers

1712

++ * from an idle CPU.  This is good for power-savings.

1713

++ *

1714

++ * We don't do similar optimization for completely idle system, as

1715

++ * selecting an idle CPU will add more delays to the timers than intended

1716

++ * (as that CPU's timer base may not be uptodate wrt jiffies etc).

1717

++ */

1718

++int get_nohz_timer_target(void)

1719

++{

1720

++	int i, cpu = smp_processor_id(), default_cpu = -1;

1721

++	struct cpumask *mask;

1722

++	const struct cpumask *hk_mask;

1723

++

1724

++	if (housekeeping_cpu(cpu, HK_FLAG_TIMER)) {

1725

++		if (!idle_cpu(cpu))

1726

++			return cpu;

1727

++		default_cpu = cpu;

1728

++	}

1729

++

1730

++	hk_mask = housekeeping_cpumask(HK_FLAG_TIMER);

1731

++

1732

++	for (mask = per_cpu(sched_cpu_topo_masks, cpu) + 1;

1733

++	     mask < per_cpu(sched_cpu_topo_end_mask, cpu); mask++)

1734

++		for_each_cpu_and(i, mask, hk_mask)

1735

++			if (!idle_cpu(i))

1736

++				return i;

1737

++

1738

++	if (default_cpu == -1)

1739

++		default_cpu = housekeeping_any_cpu(HK_FLAG_TIMER);

1740

++	cpu = default_cpu;

1741

++

1742

++	return cpu;

1743

++}

1744

++

1745

++/*

1746

++ * When add_timer_on() enqueues a timer into the timer wheel of an

1747

++ * idle CPU then this timer might expire before the next timer event

1748

++ * which is scheduled to wake up that CPU. In case of a completely

1749

++ * idle system the next event might even be infinite time into the

1750

++ * future. wake_up_idle_cpu() ensures that the CPU is woken up and

1751

++ * leaves the inner idle loop so the newly added timer is taken into

1752

++ * account when the CPU goes back to idle and evaluates the timer

1753

++ * wheel for the next timer event.

1754

++ */

1755

++static inline void wake_up_idle_cpu(int cpu)

1756

++{

1757

++	struct rq *rq = cpu_rq(cpu);

1758

++

1759

++	if (cpu == smp_processor_id())

1760

++		return;

1761

++

1762

++	if (set_nr_and_not_polling(rq->idle))

1763

++		smp_send_reschedule(cpu);

1764

++	else

1765

++		trace_sched_wake_idle_without_ipi(cpu);

1766

++}

1767

++

1768

++static inline bool wake_up_full_nohz_cpu(int cpu)

1769

++{

1770

++	/*

1771

++	 * We just need the target to call irq_exit() and re-evaluate

1772

++	 * the next tick. The nohz full kick at least implies that.

1773

++	 * If needed we can still optimize that later with an

1774

++	 * empty IRQ.

1775

++	 */

1776

++	if (cpu_is_offline(cpu))

1777

++		return true;  /* Don't try to wake offline CPUs. */

1778

++	if (tick_nohz_full_cpu(cpu)) {

1779

++		if (cpu != smp_processor_id() ||

1780

++		    tick_nohz_tick_stopped())

1781

++			tick_nohz_full_kick_cpu(cpu);

1782

++		return true;

1783

++	}

1784

++

1785

++	return false;

1786

++}

1787

++

1788

++void wake_up_nohz_cpu(int cpu)

1789

++{

1790

++	if (!wake_up_full_nohz_cpu(cpu))

1791

++		wake_up_idle_cpu(cpu);

1792

++}

1793

++

1794

++static void nohz_csd_func(void *info)

1795

++{

1796

++	struct rq *rq = info;

1797

++	int cpu = cpu_of(rq);

1798

++	unsigned int flags;

1799

++

1800

++	/*

1801

++	 * Release the rq::nohz_csd.

1802

++	 */

1803

++	flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));

1804

++	WARN_ON(!(flags & NOHZ_KICK_MASK));

1805

++

1806

++	rq->idle_balance = idle_cpu(cpu);

1807

++	if (rq->idle_balance && !need_resched()) {

1808

++		rq->nohz_idle_balance = flags;

1809

++		raise_softirq_irqoff(SCHED_SOFTIRQ);

1810

++	}

1811

++}

1812

++

1813

++#endif /* CONFIG_NO_HZ_COMMON */

1814

++#endif /* CONFIG_SMP */

1815

++

1816

++static inline void check_preempt_curr(struct rq *rq)

1817

++{

1818

++	if (sched_rq_first_task(rq) != rq->curr)

1819

++		resched_curr(rq);

1820

++}

1821

++

1822

++#ifdef CONFIG_SCHED_HRTICK

1823

++/*

1824

++ * Use HR-timers to deliver accurate preemption points.

1825

++ */

1826

++

1827

++static void hrtick_clear(struct rq *rq)

1828

++{

1829

++	if (hrtimer_active(&rq->hrtick_timer))

1830

++		hrtimer_cancel(&rq->hrtick_timer);

1831

++}

1832

++

1833

++/*

1834

++ * High-resolution timer tick.

1835

++ * Runs from hardirq context with interrupts disabled.

1836

++ */

1837

++static enum hrtimer_restart hrtick(struct hrtimer *timer)

1838

++{

1839

++	struct rq *rq = container_of(timer, struct rq, hrtick_timer);

1840

++

1841

++	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());

1842

++

1843

++	raw_spin_lock(&rq->lock);

1844

++	resched_curr(rq);

1845

++	raw_spin_unlock(&rq->lock);

1846

++

1847

++	return HRTIMER_NORESTART;

1848

++}

1849

++

1850

++/*

1851

++ * Use hrtick when:

1852

++ *  - enabled by features

1853

++ *  - hrtimer is actually high res

1854

++ */

1855

++static inline int hrtick_enabled(struct rq *rq)

1856

++{

1857

++	/**

1858

++	 * Alt schedule FW doesn't support sched_feat yet

1859

++	if (!sched_feat(HRTICK))

1860

++		return 0;

1861

++	*/

1862

++	if (!cpu_active(cpu_of(rq)))

1863

++		return 0;

1864

++	return hrtimer_is_hres_active(&rq->hrtick_timer);

1865

++}

1866

++

1867

++#ifdef CONFIG_SMP

1868

++

1869

++static void __hrtick_restart(struct rq *rq)

1870

++{

1871

++	struct hrtimer *timer = &rq->hrtick_timer;

1872

++	ktime_t time = rq->hrtick_time;

1873

++

1874

++	hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);

1875

++}

1876

++

1877

++/*

1878

++ * called from hardirq (IPI) context

1879

++ */

1880

++static void __hrtick_start(void *arg)

1881

++{

1882

++	struct rq *rq = arg;

1883

++

1884

++	raw_spin_lock(&rq->lock);

1885

++	__hrtick_restart(rq);

1886

++	raw_spin_unlock(&rq->lock);

1887

++}

1888

++

1889

++/*

1890

++ * Called to set the hrtick timer state.

1891

++ *

1892

++ * called with rq->lock held and irqs disabled

1893

++ */

1894

++void hrtick_start(struct rq *rq, u64 delay)

1895

++{

1896

++	struct hrtimer *timer = &rq->hrtick_timer;

1897

++	s64 delta;

1898

++

1899

++	/*

1900

++	 * Don't schedule slices shorter than 10000ns, that just

1901

++	 * doesn't make sense and can cause timer DoS.

1902

++	 */

1903

++	delta = max_t(s64, delay, 10000LL);

1904

++

1905

++	rq->hrtick_time = ktime_add_ns(timer->base->get_time(), delta);

1906

++

1907

++	if (rq == this_rq())

1908

++		__hrtick_restart(rq);

1909

++	else

1910

++		smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);

1911

++}

1912

++

1913

++#else

1914

++/*

1915

++ * Called to set the hrtick timer state.

1916

++ *

1917

++ * called with rq->lock held and irqs disabled

1918

++ */

1919

++void hrtick_start(struct rq *rq, u64 delay)

1920

++{

1921

++	/*

1922

++	 * Don't schedule slices shorter than 10000ns, that just

1923

++	 * doesn't make sense. Rely on vruntime for fairness.

1924

++	 */

1925

++	delay = max_t(u64, delay, 10000LL);

1926

++	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),

1927

++		      HRTIMER_MODE_REL_PINNED_HARD);

1928

++}

1929

++#endif /* CONFIG_SMP */

1930

++

1931

++static void hrtick_rq_init(struct rq *rq)

1932

++{

1933

++#ifdef CONFIG_SMP

1934

++	INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);

1935

++#endif

1936

++

1937

++	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);

1938

++	rq->hrtick_timer.function = hrtick;

1939

++}

1940

++#else	/* CONFIG_SCHED_HRTICK */

1941

++static inline int hrtick_enabled(struct rq *rq)

1942

++{

1943

++	return 0;

1944

++}

1945

++

1946

++static inline void hrtick_clear(struct rq *rq)

1947

++{

1948

++}

1949

++

1950

++static inline void hrtick_rq_init(struct rq *rq)

1951

++{

1952

++}

1953

++#endif	/* CONFIG_SCHED_HRTICK */

1954

++

1955

++static inline int __normal_prio(int policy, int rt_prio, int static_prio)

1956

++{

1957

++	return rt_policy(policy) ? (MAX_RT_PRIO - 1 - rt_prio) :

1958

++		static_prio + MAX_PRIORITY_ADJ;

1959

++}

1960

++

1961

++/*

1962

++ * Calculate the expected normal priority: i.e. priority

1963

++ * without taking RT-inheritance into account. Might be

1964

++ * boosted by interactivity modifiers. Changes upon fork,

1965

++ * setprio syscalls, and whenever the interactivity

1966

++ * estimator recalculates.

1967

++ */

1968

++static inline int normal_prio(struct task_struct *p)

1969

++{

1970

++	return __normal_prio(p->policy, p->rt_priority, p->static_prio);

1971

++}

1972

++

1973

++/*

1974

++ * Calculate the current priority, i.e. the priority

1975

++ * taken into account by the scheduler. This value might

1976

++ * be boosted by RT tasks as it will be RT if the task got

1977

++ * RT-boosted. If not then it returns p->normal_prio.

1978

++ */

1979

++static int effective_prio(struct task_struct *p)

1980

++{

1981

++	p->normal_prio = normal_prio(p);

1982

++	/*

1983

++	 * If we are RT tasks or we were boosted to RT priority,

1984

++	 * keep the priority unchanged. Otherwise, update priority

1985

++	 * to the normal priority:

1986

++	 */

1987

++	if (!rt_prio(p->prio))

1988

++		return p->normal_prio;

1989

++	return p->prio;

1990

++}

1991

++

1992

++/*

1993

++ * activate_task - move a task to the runqueue.

1994

++ *

1995

++ * Context: rq->lock

1996

++ */

1997

++static void activate_task(struct task_struct *p, struct rq *rq)

1998

++{

1999

++	enqueue_task(p, rq, ENQUEUE_WAKEUP);

2000

++	p->on_rq = TASK_ON_RQ_QUEUED;

2001

++

2002

++	/*

2003

++	 * If in_iowait is set, the code below may not trigger any cpufreq

2004

++	 * utilization updates, so do it here explicitly with the IOWAIT flag

2005

++	 * passed.

2006

++	 */

2007

++	cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT * p->in_iowait);

2008

++}

2009

++

2010

++/*

2011

++ * deactivate_task - remove a task from the runqueue.

2012

++ *

2013

++ * Context: rq->lock

2014

++ */

2015

++static inline void deactivate_task(struct task_struct *p, struct rq *rq)

2016

++{

2017

++	dequeue_task(p, rq, DEQUEUE_SLEEP);

2018

++	p->on_rq = 0;

2019

++	cpufreq_update_util(rq, 0);

2020

++}

2021

++

2022

++static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)

2023

++{

2024

++#ifdef CONFIG_SMP

2025

++	/*

2026

++	 * After ->cpu is set up to a new value, task_access_lock(p, ...) can be

2027

++	 * successfully executed on another CPU. We must ensure that updates of

2028

++	 * per-task data have been completed by this moment.

2029

++	 */

2030

++	smp_wmb();

2031

++

2032

++#ifdef CONFIG_THREAD_INFO_IN_TASK

2033

++	WRITE_ONCE(p->cpu, cpu);

2034

++#else

2035

++	WRITE_ONCE(task_thread_info(p)->cpu, cpu);

2036

++#endif

2037

++#endif

2038

++}

2039

++

2040

++static inline bool is_migration_disabled(struct task_struct *p)

2041

++{

2042

++#ifdef CONFIG_SMP

2043

++	return p->migration_disabled;

2044

++#else

2045

++	return false;

2046

++#endif

2047

++}

2048

++

2049

++#define SCA_CHECK		0x01

2050

++#define SCA_USER		0x08

2051

++

2052

++#ifdef CONFIG_SMP

2053

++

2054

++void set_task_cpu(struct task_struct *p, unsigned int new_cpu)

2055

++{

2056

++#ifdef CONFIG_SCHED_DEBUG

2057

++	unsigned int state = READ_ONCE(p->__state);

2058

++

2059

++	/*

2060

++	 * We should never call set_task_cpu() on a blocked task,

2061

++	 * ttwu() will sort out the placement.

2062

++	 */

2063

++	WARN_ON_ONCE(state != TASK_RUNNING && state != TASK_WAKING && !p->on_rq);

2064

++

2065

++#ifdef CONFIG_LOCKDEP

2066

++	/*

2067

++	 * The caller should hold either p->pi_lock or rq->lock, when changing

2068

++	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.

2069

++	 *

2070

++	 * sched_move_task() holds both and thus holding either pins the cgroup,

2071

++	 * see task_group().

2072

++	 */

2073

++	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||

2074

++				      lockdep_is_held(&task_rq(p)->lock)));

2075

++#endif

2076

++	/*

2077

++	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.

2078

++	 */

2079

++	WARN_ON_ONCE(!cpu_online(new_cpu));

2080

++

2081

++	WARN_ON_ONCE(is_migration_disabled(p));

2082

++#endif

2083

++	if (task_cpu(p) == new_cpu)

2084

++		return;

2085

++	trace_sched_migrate_task(p, new_cpu);

2086

++	rseq_migrate(p);

2087

++	perf_event_task_migrate(p);

2088

++

2089

++	__set_task_cpu(p, new_cpu);

2090

++}

2091

++

2092

++#define MDF_FORCE_ENABLED	0x80

2093

++

2094

++static void

2095

++__do_set_cpus_ptr(struct task_struct *p, const struct cpumask *new_mask)

2096

++{

2097

++	/*

2098

++	 * This here violates the locking rules for affinity, since we're only

2099

++	 * supposed to change these variables while holding both rq->lock and

2100

++	 * p->pi_lock.

2101

++	 *

2102

++	 * HOWEVER, it magically works, because ttwu() is the only code that

2103

++	 * accesses these variables under p->pi_lock and only does so after

2104

++	 * smp_cond_load_acquire(&p->on_cpu, !VAL), and we're in __schedule()

2105

++	 * before finish_task().

2106

++	 *

2107

++	 * XXX do further audits, this smells like something putrid.

2108

++	 */

2109

++	SCHED_WARN_ON(!p->on_cpu);

2110

++	p->cpus_ptr = new_mask;

2111

++}

2112

++

2113

++void migrate_disable(void)

2114

++{

2115

++	struct task_struct *p = current;

2116

++	int cpu;

2117

++

2118

++	if (p->migration_disabled) {

2119

++		p->migration_disabled++;

2120

++		return;

2121

++	}

2122

++

2123

++	preempt_disable();

2124

++	cpu = smp_processor_id();

2125

++	if (cpumask_test_cpu(cpu, &p->cpus_mask)) {

2126

++		cpu_rq(cpu)->nr_pinned++;

2127

++		p->migration_disabled = 1;

2128

++		p->migration_flags &= ~MDF_FORCE_ENABLED;

2129

++

2130

++		/*

2131

++		 * Violates locking rules! see comment in __do_set_cpus_ptr().

2132

++		 */

2133

++		if (p->cpus_ptr == &p->cpus_mask)

2134

++			__do_set_cpus_ptr(p, cpumask_of(cpu));

2135

++	}

2136

++	preempt_enable();

2137

++}

2138

++EXPORT_SYMBOL_GPL(migrate_disable);

2139

++

2140

++void migrate_enable(void)

2141

++{

2142

++	struct task_struct *p = current;

2143

++

2144

++	if (0 == p->migration_disabled)

2145

++		return;

2146

++

2147

++	if (p->migration_disabled > 1) {

2148

++		p->migration_disabled--;

2149

++		return;

2150

++	}

2151

++

2152

++	/*

2153

++	 * Ensure stop_task runs either before or after this, and that

2154

++	 * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().

2155

++	 */

2156

++	preempt_disable();

2157

++	/*

2158

++	 * Assumption: current should be running on allowed cpu

2159

++	 */

2160

++	WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &p->cpus_mask));

2161

++	if (p->cpus_ptr != &p->cpus_mask)

2162

++		__do_set_cpus_ptr(p, &p->cpus_mask);

2163

++	/*

2164

++	 * Mustn't clear migration_disabled() until cpus_ptr points back at the

2165

++	 * regular cpus_mask, otherwise things that race (eg.

2166

++	 * select_fallback_rq) get confused.

2167

++	 */

2168

++	barrier();

2169

++	p->migration_disabled = 0;

2170

++	this_rq()->nr_pinned--;

2171

++	preempt_enable();

2172

++}

2173

++EXPORT_SYMBOL_GPL(migrate_enable);

2174

++

2175

++static inline bool rq_has_pinned_tasks(struct rq *rq)

2176

++{

2177

++	return rq->nr_pinned;

2178

++}

2179

++

2180

++/*

2181

++ * Per-CPU kthreads are allowed to run on !active && online CPUs, see

2182

++ * __set_cpus_allowed_ptr() and select_fallback_rq().

2183

++ */

2184

++static inline bool is_cpu_allowed(struct task_struct *p, int cpu)

2185

++{

2186

++	/* When not in the task's cpumask, no point in looking further. */

2187

++	if (!cpumask_test_cpu(cpu, p->cpus_ptr))

2188

++		return false;

2189

++

2190

++	/* migrate_disabled() must be allowed to finish. */

2191

++	if (is_migration_disabled(p))

2192

++		return cpu_online(cpu);

2193

++

2194

++	/* Non kernel threads are not allowed during either online or offline. */

2195

++	if (!(p->flags & PF_KTHREAD))

2196

++		return cpu_active(cpu) && task_cpu_possible(cpu, p);

2197

++

2198

++	/* KTHREAD_IS_PER_CPU is always allowed. */

2199

++	if (kthread_is_per_cpu(p))

2200

++		return cpu_online(cpu);

2201

++

2202

++	/* Regular kernel threads don't get to stay during offline. */

2203

++	if (cpu_dying(cpu))

2204

++		return false;

2205

++

2206

++	/* But are allowed during online. */

2207

++	return cpu_online(cpu);

2208

++}

2209

++

2210

++/*

2211

++ * This is how migration works:

2212

++ *

2213

++ * 1) we invoke migration_cpu_stop() on the target CPU using

2214

++ *    stop_one_cpu().

2215

++ * 2) stopper starts to run (implicitly forcing the migrated thread

2216

++ *    off the CPU)

2217

++ * 3) it checks whether the migrated task is still in the wrong runqueue.

2218

++ * 4) if it's in the wrong runqueue then the migration thread removes

2219

++ *    it and puts it into the right queue.

2220

++ * 5) stopper completes and stop_one_cpu() returns and the migration

2221

++ *    is done.

2222

++ */

2223

++

2224

++/*

2225

++ * move_queued_task - move a queued task to new rq.

2226

++ *

2227

++ * Returns (locked) new rq. Old rq's lock is released.

2228

++ */

2229

++static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int

2230

++				   new_cpu)

2231

++{

2232

++	lockdep_assert_held(&rq->lock);

2233

++

2234

++	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);

2235

++	dequeue_task(p, rq, 0);

2236

++	set_task_cpu(p, new_cpu);

2237

++	raw_spin_unlock(&rq->lock);

2238

++

2239

++	rq = cpu_rq(new_cpu);

2240

++

2241

++	raw_spin_lock(&rq->lock);

2242

++	BUG_ON(task_cpu(p) != new_cpu);

2243

++	sched_task_sanity_check(p, rq);

2244

++	enqueue_task(p, rq, 0);

2245

++	p->on_rq = TASK_ON_RQ_QUEUED;

2246

++	check_preempt_curr(rq);

2247

++

2248

++	return rq;

2249

++}

2250

++

2251

++struct migration_arg {

2252

++	struct task_struct *task;

2253

++	int dest_cpu;

2254

++};

2255

++

2256

++/*

2257

++ * Move (not current) task off this CPU, onto the destination CPU. We're doing

2258

++ * this because either it can't run here any more (set_cpus_allowed()

2259

++ * away from this CPU, or CPU going down), or because we're

2260

++ * attempting to rebalance this task on exec (sched_exec).

2261

++ *

2262

++ * So we race with normal scheduler movements, but that's OK, as long

2263

++ * as the task is no longer on this CPU.

2264

++ */

2265

++static struct rq *__migrate_task(struct rq *rq, struct task_struct *p, int

2266

++				 dest_cpu)

2267

++{

2268

++	/* Affinity changed (again). */

2269

++	if (!is_cpu_allowed(p, dest_cpu))

2270

++		return rq;

2271

++

2272

++	update_rq_clock(rq);

2273

++	return move_queued_task(rq, p, dest_cpu);

2274

++}

2275

++

2276

++/*

2277

++ * migration_cpu_stop - this will be executed by a highprio stopper thread

2278

++ * and performs thread migration by bumping thread off CPU then

2279

++ * 'pushing' onto another runqueue.

2280

++ */

2281

++static int migration_cpu_stop(void *data)

2282

++{

2283

++	struct migration_arg *arg = data;

2284

++	struct task_struct *p = arg->task;

2285

++	struct rq *rq = this_rq();

2286

++	unsigned long flags;

2287

++

2288

++	/*

2289

++	 * The original target CPU might have gone down and we might

2290

++	 * be on another CPU but it doesn't matter.

2291

++	 */

2292

++	local_irq_save(flags);

2293

++	/*

2294

++	 * We need to explicitly wake pending tasks before running

2295

++	 * __migrate_task() such that we will not miss enforcing cpus_ptr

2296

++	 * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.

2297

++	 */

2298

++	flush_smp_call_function_from_idle();

2299

++

2300

++	raw_spin_lock(&p->pi_lock);

2301

++	raw_spin_lock(&rq->lock);

2302

++	/*

2303

++	 * If task_rq(p) != rq, it cannot be migrated here, because we're

2304

++	 * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because

2305

++	 * we're holding p->pi_lock.

2306

++	 */

2307

++	if (task_rq(p) == rq && task_on_rq_queued(p))

2308

++		rq = __migrate_task(rq, p, arg->dest_cpu);

2309

++	raw_spin_unlock(&rq->lock);

2310

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

2311

++

2312

++	return 0;

2313

++}

2314

++

2315

++static inline void

2316

++set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask)

2317

++{

2318

++	cpumask_copy(&p->cpus_mask, new_mask);

2319

++	p->nr_cpus_allowed = cpumask_weight(new_mask);

2320

++}

2321

++

2322

++static void

2323

++__do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)

2324

++{

2325

++	lockdep_assert_held(&p->pi_lock);

2326

++	set_cpus_allowed_common(p, new_mask);

2327

++}

2328

++

2329

++void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)

2330

++{

2331

++	__do_set_cpus_allowed(p, new_mask);

2332

++}

2333

++

2334

++int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,

2335

++		      int node)

2336

++{

2337

++	if (!src->user_cpus_ptr)

2338

++		return 0;

2339

++

2340

++	dst->user_cpus_ptr = kmalloc_node(cpumask_size(), GFP_KERNEL, node);

2341

++	if (!dst->user_cpus_ptr)

2342

++		return -ENOMEM;

2343

++

2344

++	cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);

2345

++	return 0;

2346

++}

2347

++

2348

++static inline struct cpumask *clear_user_cpus_ptr(struct task_struct *p)

2349

++{

2350

++	struct cpumask *user_mask = NULL;

2351

++

2352

++	swap(p->user_cpus_ptr, user_mask);

2353

++

2354

++	return user_mask;

2355

++}

2356

++

2357

++void release_user_cpus_ptr(struct task_struct *p)

2358

++{

2359

++	kfree(clear_user_cpus_ptr(p));

2360

++}

2361

++

2362

++#endif

2363

++

2364

++/**

2365

++ * task_curr - is this task currently executing on a CPU?

2366

++ * @p: the task in question.

2367

++ *

2368

++ * Return: 1 if the task is currently executing. 0 otherwise.

2369

++ */

2370

++inline int task_curr(const struct task_struct *p)

2371

++{

2372

++	return cpu_curr(task_cpu(p)) == p;

2373

++}

2374

++

2375

++#ifdef CONFIG_SMP

2376

++/*

2377

++ * wait_task_inactive - wait for a thread to unschedule.

2378

++ *

2379

++ * If @match_state is nonzero, it's the @p->state value just checked and

2380

++ * not expected to change.  If it changes, i.e. @p might have woken up,

2381

++ * then return zero.  When we succeed in waiting for @p to be off its CPU,

2382

++ * we return a positive number (its total switch count).  If a second call

2383

++ * a short while later returns the same number, the caller can be sure that

2384

++ * @p has remained unscheduled the whole time.

2385

++ *

2386

++ * The caller must ensure that the task *will* unschedule sometime soon,

2387

++ * else this function might spin for a *long* time. This function can't

2388

++ * be called with interrupts off, or it may introduce deadlock with

2389

++ * smp_call_function() if an IPI is sent by the same process we are

2390

++ * waiting to become inactive.

2391

++ */

2392

++unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state)

2393

++{

2394

++	unsigned long flags;

2395

++	bool running, on_rq;

2396

++	unsigned long ncsw;

2397

++	struct rq *rq;

2398

++	raw_spinlock_t *lock;

2399

++

2400

++	for (;;) {

2401

++		rq = task_rq(p);

2402

++

2403

++		/*

2404

++		 * If the task is actively running on another CPU

2405

++		 * still, just relax and busy-wait without holding

2406

++		 * any locks.

2407

++		 *

2408

++		 * NOTE! Since we don't hold any locks, it's not

2409

++		 * even sure that "rq" stays as the right runqueue!

2410

++		 * But we don't care, since this will return false

2411

++		 * if the runqueue has changed and p is actually now

2412

++		 * running somewhere else!

2413

++		 */

2414

++		while (task_running(p) && p == rq->curr) {

2415

++			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))

2416

++				return 0;

2417

++			cpu_relax();

2418

++		}

2419

++

2420

++		/*

2421

++		 * Ok, time to look more closely! We need the rq

2422

++		 * lock now, to be *sure*. If we're wrong, we'll

2423

++		 * just go back and repeat.

2424

++		 */

2425

++		task_access_lock_irqsave(p, &lock, &flags);

2426

++		trace_sched_wait_task(p);

2427

++		running = task_running(p);

2428

++		on_rq = p->on_rq;

2429

++		ncsw = 0;

2430

++		if (!match_state || READ_ONCE(p->__state) == match_state)

2431

++			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */

2432

++		task_access_unlock_irqrestore(p, lock, &flags);

2433

++

2434

++		/*

2435

++		 * If it changed from the expected state, bail out now.

2436

++		 */

2437

++		if (unlikely(!ncsw))

2438

++			break;

2439

++

2440

++		/*

2441

++		 * Was it really running after all now that we

2442

++		 * checked with the proper locks actually held?

2443

++		 *

2444

++		 * Oops. Go back and try again..

2445

++		 */

2446

++		if (unlikely(running)) {

2447

++			cpu_relax();

2448

++			continue;

2449

++		}

2450

++

2451

++		/*

2452

++		 * It's not enough that it's not actively running,

2453

++		 * it must be off the runqueue _entirely_, and not

2454

++		 * preempted!

2455

++		 *

2456

++		 * So if it was still runnable (but just not actively

2457

++		 * running right now), it's preempted, and we should

2458

++		 * yield - it could be a while.

2459

++		 */

2460

++		if (unlikely(on_rq)) {

2461

++			ktime_t to = NSEC_PER_SEC / HZ;

2462

++

2463

++			set_current_state(TASK_UNINTERRUPTIBLE);

2464

++			schedule_hrtimeout(&to, HRTIMER_MODE_REL);

2465

++			continue;

2466

++		}

2467

++

2468

++		/*

2469

++		 * Ahh, all good. It wasn't running, and it wasn't

2470

++		 * runnable, which means that it will never become

2471

++		 * running in the future either. We're all done!

2472

++		 */

2473

++		break;

2474

++	}

2475

++

2476

++	return ncsw;

2477

++}

2478

++

2479

++/***

2480

++ * kick_process - kick a running thread to enter/exit the kernel

2481

++ * @p: the to-be-kicked thread

2482

++ *

2483

++ * Cause a process which is running on another CPU to enter

2484

++ * kernel-mode, without any delay. (to get signals handled.)

2485

++ *

2486

++ * NOTE: this function doesn't have to take the runqueue lock,

2487

++ * because all it wants to ensure is that the remote task enters

2488

++ * the kernel. If the IPI races and the task has been migrated

2489

++ * to another CPU then no harm is done and the purpose has been

2490

++ * achieved as well.

2491

++ */

2492

++void kick_process(struct task_struct *p)

2493

++{

2494

++	int cpu;

2495

++

2496

++	preempt_disable();

2497

++	cpu = task_cpu(p);

2498

++	if ((cpu != smp_processor_id()) && task_curr(p))

2499

++		smp_send_reschedule(cpu);

2500

++	preempt_enable();

2501

++}

2502

++EXPORT_SYMBOL_GPL(kick_process);

2503

++

2504

++/*

2505

++ * ->cpus_ptr is protected by both rq->lock and p->pi_lock

2506

++ *

2507

++ * A few notes on cpu_active vs cpu_online:

2508

++ *

2509

++ *  - cpu_active must be a subset of cpu_online

2510

++ *

2511

++ *  - on CPU-up we allow per-CPU kthreads on the online && !active CPU,

2512

++ *    see __set_cpus_allowed_ptr(). At this point the newly online

2513

++ *    CPU isn't yet part of the sched domains, and balancing will not

2514

++ *    see it.

2515

++ *

2516

++ *  - on cpu-down we clear cpu_active() to mask the sched domains and

2517

++ *    avoid the load balancer to place new tasks on the to be removed

2518

++ *    CPU. Existing tasks will remain running there and will be taken

2519

++ *    off.

2520

++ *

2521

++ * This means that fallback selection must not select !active CPUs.

2522

++ * And can assume that any active CPU must be online. Conversely

2523

++ * select_task_rq() below may allow selection of !active CPUs in order

2524

++ * to satisfy the above rules.

2525

++ */

2526

++static int select_fallback_rq(int cpu, struct task_struct *p)

2527

++{

2528

++	int nid = cpu_to_node(cpu);

2529

++	const struct cpumask *nodemask = NULL;

2530

++	enum { cpuset, possible, fail } state = cpuset;

2531

++	int dest_cpu;

2532

++

2533

++	/*

2534

++	 * If the node that the CPU is on has been offlined, cpu_to_node()

2535

++	 * will return -1. There is no CPU on the node, and we should

2536

++	 * select the CPU on the other node.

2537

++	 */

2538

++	if (nid != -1) {

2539

++		nodemask = cpumask_of_node(nid);

2540

++

2541

++		/* Look for allowed, online CPU in same node. */

2542

++		for_each_cpu(dest_cpu, nodemask) {

2543

++			if (is_cpu_allowed(p, dest_cpu))

2544

++				return dest_cpu;

2545

++		}

2546

++	}

2547

++

2548

++	for (;;) {

2549

++		/* Any allowed, online CPU? */

2550

++		for_each_cpu(dest_cpu, p->cpus_ptr) {

2551

++			if (!is_cpu_allowed(p, dest_cpu))

2552

++				continue;

2553

++			goto out;

2554

++		}

2555

++

2556

++		/* No more Mr. Nice Guy. */

2557

++		switch (state) {

2558

++		case cpuset:

2559

++			if (cpuset_cpus_allowed_fallback(p)) {

2560

++				state = possible;

2561

++				break;

2562

++			}

2563

++			fallthrough;

2564

++		case possible:

2565

++			/*

2566

++			 * XXX When called from select_task_rq() we only

2567

++			 * hold p->pi_lock and again violate locking order.

2568

++			 *

2569

++			 * More yuck to audit.

2570

++			 */

2571

++			do_set_cpus_allowed(p, task_cpu_possible_mask(p));

2572

++			state = fail;

2573

++			break;

2574

++

2575

++		case fail:

2576

++			BUG();

2577

++			break;

2578

++		}

2579

++	}

2580

++

2581

++out:

2582

++	if (state != cpuset) {

2583

++		/*

2584

++		 * Don't tell them about moving exiting tasks or

2585

++		 * kernel threads (both mm NULL), since they never

2586

++		 * leave kernel.

2587

++		 */

2588

++		if (p->mm && printk_ratelimit()) {

2589

++			printk_deferred("process %d (%s) no longer affine to cpu%d\n",

2590

++					task_pid_nr(p), p->comm, cpu);

2591

++		}

2592

++	}

2593

++

2594

++	return dest_cpu;

2595

++}

2596

++

2597

++static inline int select_task_rq(struct task_struct *p)

2598

++{

2599

++	cpumask_t chk_mask, tmp;

2600

++

2601

++	if (unlikely(!cpumask_and(&chk_mask, p->cpus_ptr, cpu_active_mask)))

2602

++		return select_fallback_rq(task_cpu(p), p);

2603

++

2604

++	if (

2605

++#ifdef CONFIG_SCHED_SMT

2606

++	    cpumask_and(&tmp, &chk_mask, &sched_sg_idle_mask) ||

2607

++#endif

2608

++	    cpumask_and(&tmp, &chk_mask, sched_rq_watermark) ||

2609

++	    cpumask_and(&tmp, &chk_mask,

2610

++			sched_rq_watermark + SCHED_BITS - task_sched_prio(p)))

2611

++		return best_mask_cpu(task_cpu(p), &tmp);

2612

++

2613

++	return best_mask_cpu(task_cpu(p), &chk_mask);

2614

++}

2615

++

2616

++void sched_set_stop_task(int cpu, struct task_struct *stop)

2617

++{

2618

++	static struct lock_class_key stop_pi_lock;

2619

++	struct sched_param stop_param = { .sched_priority = STOP_PRIO };

2620

++	struct sched_param start_param = { .sched_priority = 0 };

2621

++	struct task_struct *old_stop = cpu_rq(cpu)->stop;

2622

++

2623

++	if (stop) {

2624

++		/*

2625

++		 * Make it appear like a SCHED_FIFO task, its something

2626

++		 * userspace knows about and won't get confused about.

2627

++		 *

2628

++		 * Also, it will make PI more or less work without too

2629

++		 * much confusion -- but then, stop work should not

2630

++		 * rely on PI working anyway.

2631

++		 */

2632

++		sched_setscheduler_nocheck(stop, SCHED_FIFO, &stop_param);

2633

++

2634

++		/*

2635

++		 * The PI code calls rt_mutex_setprio() with ->pi_lock held to

2636

++		 * adjust the effective priority of a task. As a result,

2637

++		 * rt_mutex_setprio() can trigger (RT) balancing operations,

2638

++		 * which can then trigger wakeups of the stop thread to push

2639

++		 * around the current task.

2640

++		 *

2641

++		 * The stop task itself will never be part of the PI-chain, it

2642

++		 * never blocks, therefore that ->pi_lock recursion is safe.

2643

++		 * Tell lockdep about this by placing the stop->pi_lock in its

2644

++		 * own class.

2645

++		 */

2646

++		lockdep_set_class(&stop->pi_lock, &stop_pi_lock);

2647

++	}

2648

++

2649

++	cpu_rq(cpu)->stop = stop;

2650

++

2651

++	if (old_stop) {

2652

++		/*

2653

++		 * Reset it back to a normal scheduling policy so that

2654

++		 * it can die in pieces.

2655

++		 */

2656

++		sched_setscheduler_nocheck(old_stop, SCHED_NORMAL, &start_param);

2657

++	}

2658

++}

2659

++

2660

++static int affine_move_task(struct rq *rq, struct task_struct *p, int dest_cpu,

2661

++			    raw_spinlock_t *lock, unsigned long irq_flags)

2662

++{

2663

++	/* Can the task run on the task's current CPU? If so, we're done */

2664

++	if (!cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {

2665

++		if (p->migration_disabled) {

2666

++			if (likely(p->cpus_ptr != &p->cpus_mask))

2667

++				__do_set_cpus_ptr(p, &p->cpus_mask);

2668

++			p->migration_disabled = 0;

2669

++			p->migration_flags |= MDF_FORCE_ENABLED;

2670

++			/* When p is migrate_disabled, rq->lock should be held */

2671

++			rq->nr_pinned--;

2672

++		}

2673

++

2674

++		if (task_running(p) || READ_ONCE(p->__state) == TASK_WAKING) {

2675

++			struct migration_arg arg = { p, dest_cpu };

2676

++

2677

++			/* Need help from migration thread: drop lock and wait. */

2678

++			__task_access_unlock(p, lock);

2679

++			raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);

2680

++			stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);

2681

++			return 0;

2682

++		}

2683

++		if (task_on_rq_queued(p)) {

2684

++			/*

2685

++			 * OK, since we're going to drop the lock immediately

2686

++			 * afterwards anyway.

2687

++			 */

2688

++			update_rq_clock(rq);

2689

++			rq = move_queued_task(rq, p, dest_cpu);

2690

++			lock = &rq->lock;

2691

++		}

2692

++	}

2693

++	__task_access_unlock(p, lock);

2694

++	raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);

2695

++	return 0;

2696

++}

2697

++

2698

++static int __set_cpus_allowed_ptr_locked(struct task_struct *p,

2699

++					 const struct cpumask *new_mask,

2700

++					 u32 flags,

2701

++					 struct rq *rq,

2702

++					 raw_spinlock_t *lock,

2703

++					 unsigned long irq_flags)

2704

++{

2705

++	const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);

2706

++	const struct cpumask *cpu_valid_mask = cpu_active_mask;

2707

++	bool kthread = p->flags & PF_KTHREAD;

2708

++	struct cpumask *user_mask = NULL;

2709

++	int dest_cpu;

2710

++	int ret = 0;

2711

++

2712

++	if (kthread || is_migration_disabled(p)) {

2713

++		/*

2714

++		 * Kernel threads are allowed on online && !active CPUs,

2715

++		 * however, during cpu-hot-unplug, even these might get pushed

2716

++		 * away if not KTHREAD_IS_PER_CPU.

2717

++		 *

2718

++		 * Specifically, migration_disabled() tasks must not fail the

2719

++		 * cpumask_any_and_distribute() pick below, esp. so on

2720

++		 * SCA_MIGRATE_ENABLE, otherwise we'll not call

2721

++		 * set_cpus_allowed_common() and actually reset p->cpus_ptr.

2722

++		 */

2723

++		cpu_valid_mask = cpu_online_mask;

2724

++	}

2725

++

2726

++	if (!kthread && !cpumask_subset(new_mask, cpu_allowed_mask)) {

2727

++		ret = -EINVAL;

2728

++		goto out;

2729

++	}

2730

++

2731

++	/*

2732

++	 * Must re-check here, to close a race against __kthread_bind(),

2733

++	 * sched_setaffinity() is not guaranteed to observe the flag.

2734

++	 */

2735

++	if ((flags & SCA_CHECK) && (p->flags & PF_NO_SETAFFINITY)) {

2736

++		ret = -EINVAL;

2737

++		goto out;

2738

++	}

2739

++

2740

++	if (cpumask_equal(&p->cpus_mask, new_mask))

2741

++		goto out;

2742

++

2743

++	dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);

2744

++	if (dest_cpu >= nr_cpu_ids) {

2745

++		ret = -EINVAL;

2746

++		goto out;

2747

++	}

2748

++

2749

++	__do_set_cpus_allowed(p, new_mask);

2750

++

2751

++	if (flags & SCA_USER)

2752

++		user_mask = clear_user_cpus_ptr(p);

2753

++

2754

++	ret = affine_move_task(rq, p, dest_cpu, lock, irq_flags);

2755

++

2756

++	kfree(user_mask);

2757

++

2758

++	return ret;

2759

++

2760

++out:

2761

++	__task_access_unlock(p, lock);

2762

++	raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);

2763

++

2764

++	return ret;

2765

++}

2766

++

2767

++/*

2768

++ * Change a given task's CPU affinity. Migrate the thread to a

2769

++ * proper CPU and schedule it away if the CPU it's executing on

2770

++ * is removed from the allowed bitmask.

2771

++ *

2772

++ * NOTE: the caller must have a valid reference to the task, the

2773

++ * task must not exit() & deallocate itself prematurely. The

2774

++ * call is not atomic; no spinlocks may be held.

2775

++ */

2776

++static int __set_cpus_allowed_ptr(struct task_struct *p,

2777

++				  const struct cpumask *new_mask, u32 flags)

2778

++{

2779

++	unsigned long irq_flags;

2780

++	struct rq *rq;

2781

++	raw_spinlock_t *lock;

2782

++

2783

++	raw_spin_lock_irqsave(&p->pi_lock, irq_flags);

2784

++	rq = __task_access_lock(p, &lock);

2785

++

2786

++	return __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, lock, irq_flags);

2787

++}

2788

++

2789

++int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)

2790

++{

2791

++	return __set_cpus_allowed_ptr(p, new_mask, 0);

2792

++}

2793

++EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);

2794

++

2795

++/*

2796

++ * Change a given task's CPU affinity to the intersection of its current

2797

++ * affinity mask and @subset_mask, writing the resulting mask to @new_mask

2798

++ * and pointing @p->user_cpus_ptr to a copy of the old mask.

2799

++ * If the resulting mask is empty, leave the affinity unchanged and return

2800

++ * -EINVAL.

2801

++ */

2802

++static int restrict_cpus_allowed_ptr(struct task_struct *p,

2803

++				     struct cpumask *new_mask,

2804

++				     const struct cpumask *subset_mask)

2805

++{

2806

++	struct cpumask *user_mask = NULL;

2807

++	unsigned long irq_flags;

2808

++	raw_spinlock_t *lock;

2809

++	struct rq *rq;

2810

++	int err;

2811

++

2812

++	if (!p->user_cpus_ptr) {

2813

++		user_mask = kmalloc(cpumask_size(), GFP_KERNEL);

2814

++		if (!user_mask)

2815

++			return -ENOMEM;

2816

++	}

2817

++

2818

++	raw_spin_lock_irqsave(&p->pi_lock, irq_flags);

2819

++	rq = __task_access_lock(p, &lock);

2820

++

2821

++	if (!cpumask_and(new_mask, &p->cpus_mask, subset_mask)) {

2822

++		err = -EINVAL;

2823

++		goto err_unlock;

2824

++	}

2825

++

2826

++	/*

2827

++	 * We're about to butcher the task affinity, so keep track of what

2828

++	 * the user asked for in case we're able to restore it later on.

2829

++	 */

2830

++	if (user_mask) {

2831

++		cpumask_copy(user_mask, p->cpus_ptr);

2832

++		p->user_cpus_ptr = user_mask;

2833

++	}

2834

++

2835

++	/*return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, &rf);*/

2836

++	return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, lock, irq_flags);

2837

++

2838

++err_unlock:

2839

++	__task_access_unlock(p, lock);

2840

++	raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);

2841

++	kfree(user_mask);

2842

++	return err;

2843

++}

2844

++

2845

++/*

2846

++ * Restrict the CPU affinity of task @p so that it is a subset of

2847

++ * task_cpu_possible_mask() and point @p->user_cpu_ptr to a copy of the

2848

++ * old affinity mask. If the resulting mask is empty, we warn and walk

2849

++ * up the cpuset hierarchy until we find a suitable mask.

2850

++ */

2851

++void force_compatible_cpus_allowed_ptr(struct task_struct *p)

2852

++{

2853

++	cpumask_var_t new_mask;

2854

++	const struct cpumask *override_mask = task_cpu_possible_mask(p);

2855

++

2856

++	alloc_cpumask_var(&new_mask, GFP_KERNEL);

2857

++

2858

++	/*

2859

++	 * __migrate_task() can fail silently in the face of concurrent

2860

++	 * offlining of the chosen destination CPU, so take the hotplug

2861

++	 * lock to ensure that the migration succeeds.

2862

++	 */

2863

++	cpus_read_lock();

2864

++	if (!cpumask_available(new_mask))

2865

++		goto out_set_mask;

2866

++

2867

++	if (!restrict_cpus_allowed_ptr(p, new_mask, override_mask))

2868

++		goto out_free_mask;

2869

++

2870

++	/*

2871

++	 * We failed to find a valid subset of the affinity mask for the

2872

++	 * task, so override it based on its cpuset hierarchy.

2873

++	 */

2874

++	cpuset_cpus_allowed(p, new_mask);

2875

++	override_mask = new_mask;

2876

++

2877

++out_set_mask:

2878

++	if (printk_ratelimit()) {

2879

++		printk_deferred("Overriding affinity for process %d (%s) to CPUs %*pbl\n",

2880

++				task_pid_nr(p), p->comm,

2881

++				cpumask_pr_args(override_mask));

2882

++	}

2883

++

2884

++	WARN_ON(set_cpus_allowed_ptr(p, override_mask));

2885

++out_free_mask:

2886

++	cpus_read_unlock();

2887

++	free_cpumask_var(new_mask);

2888

++}

2889

++

2890

++static int

2891

++__sched_setaffinity(struct task_struct *p, const struct cpumask *mask);

2892

++

2893

++/*

2894

++ * Restore the affinity of a task @p which was previously restricted by a

2895

++ * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)

2896

++ * @p->user_cpus_ptr.

2897

++ *

2898

++ * It is the caller's responsibility to serialise this with any calls to

2899

++ * force_compatible_cpus_allowed_ptr(@p).

2900

++ */

2901

++void relax_compatible_cpus_allowed_ptr(struct task_struct *p)

2902

++{

2903

++	struct cpumask *user_mask = p->user_cpus_ptr;

2904

++	unsigned long flags;

2905

++

2906

++	/*

2907

++	 * Try to restore the old affinity mask. If this fails, then

2908

++	 * we free the mask explicitly to avoid it being inherited across

2909

++	 * a subsequent fork().

2910

++	 */

2911

++	if (!user_mask || !__sched_setaffinity(p, user_mask))

2912

++		return;

2913

++

2914

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

2915

++	user_mask = clear_user_cpus_ptr(p);

2916

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

2917

++

2918

++	kfree(user_mask);

2919

++}

2920

++

2921

++#else /* CONFIG_SMP */

2922

++

2923

++static inline int select_task_rq(struct task_struct *p)

2924

++{

2925

++	return 0;

2926

++}

2927

++

2928

++static inline int

2929

++__set_cpus_allowed_ptr(struct task_struct *p,

2930

++		       const struct cpumask *new_mask, u32 flags)

2931

++{

2932

++	return set_cpus_allowed_ptr(p, new_mask);

2933

++}

2934

++

2935

++static inline bool rq_has_pinned_tasks(struct rq *rq)

2936

++{

2937

++	return false;

2938

++}

2939

++

2940

++#endif /* !CONFIG_SMP */

2941

++

2942

++static void

2943

++ttwu_stat(struct task_struct *p, int cpu, int wake_flags)

2944

++{

2945

++	struct rq *rq;

2946

++

2947

++	if (!schedstat_enabled())

2948

++		return;

2949

++

2950

++	rq = this_rq();

2951

++

2952

++#ifdef CONFIG_SMP

2953

++	if (cpu == rq->cpu)

2954

++		__schedstat_inc(rq->ttwu_local);

2955

++	else {

2956

++		/** Alt schedule FW ToDo:

2957

++		 * How to do ttwu_wake_remote

2958

++		 */

2959

++	}

2960

++#endif /* CONFIG_SMP */

2961

++

2962

++	__schedstat_inc(rq->ttwu_count);

2963

++}

2964

++

2965

++/*

2966

++ * Mark the task runnable and perform wakeup-preemption.

2967

++ */

2968

++static inline void

2969

++ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)

2970

++{

2971

++	check_preempt_curr(rq);

2972

++	WRITE_ONCE(p->__state, TASK_RUNNING);

2973

++	trace_sched_wakeup(p);

2974

++}

2975

++

2976

++static inline void

2977

++ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)

2978

++{

2979

++	if (p->sched_contributes_to_load)

2980

++		rq->nr_uninterruptible--;

2981

++

2982

++	if (

2983

++#ifdef CONFIG_SMP

2984

++	    !(wake_flags & WF_MIGRATED) &&

2985

++#endif

2986

++	    p->in_iowait) {

2987

++		delayacct_blkio_end(p);

2988

++		atomic_dec(&task_rq(p)->nr_iowait);

2989

++	}

2990

++

2991

++	activate_task(p, rq);

2992

++	ttwu_do_wakeup(rq, p, 0);

2993

++}

2994

++

2995

++/*

2996

++ * Consider @p being inside a wait loop:

2997

++ *

2998

++ *   for (;;) {

2999

++ *      set_current_state(TASK_UNINTERRUPTIBLE);

3000

++ *

3001

++ *      if (CONDITION)

3002

++ *         break;

3003

++ *

3004

++ *      schedule();

3005

++ *   }

3006

++ *   __set_current_state(TASK_RUNNING);

3007

++ *

3008

++ * between set_current_state() and schedule(). In this case @p is still

3009

++ * runnable, so all that needs doing is change p->state back to TASK_RUNNING in

3010

++ * an atomic manner.

3011

++ *

3012

++ * By taking task_rq(p)->lock we serialize against schedule(), if @p->on_rq

3013

++ * then schedule() must still happen and p->state can be changed to

3014

++ * TASK_RUNNING. Otherwise we lost the race, schedule() has happened, and we

3015

++ * need to do a full wakeup with enqueue.

3016

++ *

3017

++ * Returns: %true when the wakeup is done,

3018

++ *          %false otherwise.

3019

++ */

3020

++static int ttwu_runnable(struct task_struct *p, int wake_flags)

3021

++{

3022

++	struct rq *rq;

3023

++	raw_spinlock_t *lock;

3024

++	int ret = 0;

3025

++

3026

++	rq = __task_access_lock(p, &lock);

3027

++	if (task_on_rq_queued(p)) {

3028

++		/* check_preempt_curr() may use rq clock */

3029

++		update_rq_clock(rq);

3030

++		ttwu_do_wakeup(rq, p, wake_flags);

3031

++		ret = 1;

3032

++	}

3033

++	__task_access_unlock(p, lock);

3034

++

3035

++	return ret;

3036

++}

3037

++

3038

++#ifdef CONFIG_SMP

3039

++void sched_ttwu_pending(void *arg)

3040

++{

3041

++	struct llist_node *llist = arg;

3042

++	struct rq *rq = this_rq();

3043

++	struct task_struct *p, *t;

3044

++	struct rq_flags rf;

3045

++

3046

++	if (!llist)

3047

++		return;

3048

++

3049

++	/*

3050

++	 * rq::ttwu_pending racy indication of out-standing wakeups.

3051

++	 * Races such that false-negatives are possible, since they

3052

++	 * are shorter lived that false-positives would be.

3053

++	 */

3054

++	WRITE_ONCE(rq->ttwu_pending, 0);

3055

++

3056

++	rq_lock_irqsave(rq, &rf);

3057

++	update_rq_clock(rq);

3058

++

3059

++	llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {

3060

++		if (WARN_ON_ONCE(p->on_cpu))

3061

++			smp_cond_load_acquire(&p->on_cpu, !VAL);

3062

++

3063

++		if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))

3064

++			set_task_cpu(p, cpu_of(rq));

3065

++

3066

++		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0);

3067

++	}

3068

++

3069

++	rq_unlock_irqrestore(rq, &rf);

3070

++}

3071

++

3072

++void send_call_function_single_ipi(int cpu)

3073

++{

3074

++	struct rq *rq = cpu_rq(cpu);

3075

++

3076

++	if (!set_nr_if_polling(rq->idle))

3077

++		arch_send_call_function_single_ipi(cpu);

3078

++	else

3079

++		trace_sched_wake_idle_without_ipi(cpu);

3080

++}

3081

++

3082

++/*

3083

++ * Queue a task on the target CPUs wake_list and wake the CPU via IPI if

3084

++ * necessary. The wakee CPU on receipt of the IPI will queue the task

3085

++ * via sched_ttwu_wakeup() for activation so the wakee incurs the cost

3086

++ * of the wakeup instead of the waker.

3087

++ */

3088

++static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)

3089

++{

3090

++	struct rq *rq = cpu_rq(cpu);

3091

++

3092

++	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);

3093

++

3094

++	WRITE_ONCE(rq->ttwu_pending, 1);

3095

++	__smp_call_single_queue(cpu, &p->wake_entry.llist);

3096

++}

3097

++

3098

++static inline bool ttwu_queue_cond(int cpu, int wake_flags)

3099

++{

3100

++	/*

3101

++	 * Do not complicate things with the async wake_list while the CPU is

3102

++	 * in hotplug state.

3103

++	 */

3104

++	if (!cpu_active(cpu))

3105

++		return false;

3106

++

3107

++	/*

3108

++	 * If the CPU does not share cache, then queue the task on the

3109

++	 * remote rqs wakelist to avoid accessing remote data.

3110

++	 */

3111

++	if (!cpus_share_cache(smp_processor_id(), cpu))

3112

++		return true;

3113

++

3114

++	/*

3115

++	 * If the task is descheduling and the only running task on the

3116

++	 * CPU then use the wakelist to offload the task activation to

3117

++	 * the soon-to-be-idle CPU as the current CPU is likely busy.

3118

++	 * nr_running is checked to avoid unnecessary task stacking.

3119

++	 */

3120

++	if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)

3121

++		return true;

3122

++

3123

++	return false;

3124

++}

3125

++

3126

++static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)

3127

++{

3128

++	if (__is_defined(ALT_SCHED_TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {

3129

++		if (WARN_ON_ONCE(cpu == smp_processor_id()))

3130

++			return false;

3131

++

3132

++		sched_clock_cpu(cpu); /* Sync clocks across CPUs */

3133

++		__ttwu_queue_wakelist(p, cpu, wake_flags);

3134

++		return true;

3135

++	}

3136

++

3137

++	return false;

3138

++}

3139

++

3140

++void wake_up_if_idle(int cpu)

3141

++{

3142

++	struct rq *rq = cpu_rq(cpu);

3143

++	unsigned long flags;

3144

++

3145

++	rcu_read_lock();

3146

++

3147

++	if (!is_idle_task(rcu_dereference(rq->curr)))

3148

++		goto out;

3149

++

3150

++	if (set_nr_if_polling(rq->idle)) {

3151

++		trace_sched_wake_idle_without_ipi(cpu);

3152

++	} else {

3153

++		raw_spin_lock_irqsave(&rq->lock, flags);

3154

++		if (is_idle_task(rq->curr))

3155

++			smp_send_reschedule(cpu);

3156

++		/* Else CPU is not idle, do nothing here */

3157

++		raw_spin_unlock_irqrestore(&rq->lock, flags);

3158

++	}

3159

++

3160

++out:

3161

++	rcu_read_unlock();

3162

++}

3163

++

3164

++bool cpus_share_cache(int this_cpu, int that_cpu)

3165

++{

3166

++	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);

3167

++}

3168

++#else /* !CONFIG_SMP */

3169

++

3170

++static inline bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)

3171

++{

3172

++	return false;

3173

++}

3174

++

3175

++#endif /* CONFIG_SMP */

3176

++

3177

++static inline void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)

3178

++{

3179

++	struct rq *rq = cpu_rq(cpu);

3180

++

3181

++	if (ttwu_queue_wakelist(p, cpu, wake_flags))

3182

++		return;

3183

++

3184

++	raw_spin_lock(&rq->lock);

3185

++	update_rq_clock(rq);

3186

++	ttwu_do_activate(rq, p, wake_flags);

3187

++	raw_spin_unlock(&rq->lock);

3188

++}

3189

++

3190

++/*

3191

++ * Invoked from try_to_wake_up() to check whether the task can be woken up.

3192

++ *

3193

++ * The caller holds p::pi_lock if p != current or has preemption

3194

++ * disabled when p == current.

3195

++ *

3196

++ * The rules of PREEMPT_RT saved_state:

3197

++ *

3198

++ *   The related locking code always holds p::pi_lock when updating

3199

++ *   p::saved_state, which means the code is fully serialized in both cases.

3200

++ *

3201

++ *   The lock wait and lock wakeups happen via TASK_RTLOCK_WAIT. No other

3202

++ *   bits set. This allows to distinguish all wakeup scenarios.

3203

++ */

3204

++static __always_inline

3205

++bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)

3206

++{

3207

++	if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)) {

3208

++		WARN_ON_ONCE((state & TASK_RTLOCK_WAIT) &&

3209

++			     state != TASK_RTLOCK_WAIT);

3210

++	}

3211

++

3212

++	if (READ_ONCE(p->__state) & state) {

3213

++		*success = 1;

3214

++		return true;

3215

++	}

3216

++

3217

++#ifdef CONFIG_PREEMPT_RT

3218

++	/*

3219

++	 * Saved state preserves the task state across blocking on

3220

++	 * an RT lock.  If the state matches, set p::saved_state to

3221

++	 * TASK_RUNNING, but do not wake the task because it waits

3222

++	 * for a lock wakeup. Also indicate success because from

3223

++	 * the regular waker's point of view this has succeeded.

3224

++	 *

3225

++	 * After acquiring the lock the task will restore p::__state

3226

++	 * from p::saved_state which ensures that the regular

3227

++	 * wakeup is not lost. The restore will also set

3228

++	 * p::saved_state to TASK_RUNNING so any further tests will

3229

++	 * not result in false positives vs. @success

3230

++	 */

3231

++	if (p->saved_state & state) {

3232

++		p->saved_state = TASK_RUNNING;

3233

++		*success = 1;

3234

++	}

3235

++#endif

3236

++	return false;

3237

++}

3238

++

3239

++/*

3240

++ * Notes on Program-Order guarantees on SMP systems.

3241

++ *

3242

++ *  MIGRATION

3243

++ *

3244

++ * The basic program-order guarantee on SMP systems is that when a task [t]

3245

++ * migrates, all its activity on its old CPU [c0] happens-before any subsequent

3246

++ * execution on its new CPU [c1].

3247

++ *

3248

++ * For migration (of runnable tasks) this is provided by the following means:

3249

++ *

3250

++ *  A) UNLOCK of the rq(c0)->lock scheduling out task t

3251

++ *  B) migration for t is required to synchronize *both* rq(c0)->lock and

3252

++ *     rq(c1)->lock (if not at the same time, then in that order).

3253

++ *  C) LOCK of the rq(c1)->lock scheduling in task

3254

++ *

3255

++ * Transitivity guarantees that B happens after A and C after B.

3256

++ * Note: we only require RCpc transitivity.

3257

++ * Note: the CPU doing B need not be c0 or c1

3258

++ *

3259

++ * Example:

3260

++ *

3261

++ *   CPU0            CPU1            CPU2

3262

++ *

3263

++ *   LOCK rq(0)->lock

3264

++ *   sched-out X

3265

++ *   sched-in Y

3266

++ *   UNLOCK rq(0)->lock

3267

++ *

3268

++ *                                   LOCK rq(0)->lock // orders against CPU0

3269

++ *                                   dequeue X

3270

++ *                                   UNLOCK rq(0)->lock

3271

++ *

3272

++ *                                   LOCK rq(1)->lock

3273

++ *                                   enqueue X

3274

++ *                                   UNLOCK rq(1)->lock

3275

++ *

3276

++ *                   LOCK rq(1)->lock // orders against CPU2

3277

++ *                   sched-out Z

3278

++ *                   sched-in X

3279

++ *                   UNLOCK rq(1)->lock

3280

++ *

3281

++ *

3282

++ *  BLOCKING -- aka. SLEEP + WAKEUP

3283

++ *

3284

++ * For blocking we (obviously) need to provide the same guarantee as for

3285

++ * migration. However the means are completely different as there is no lock

3286

++ * chain to provide order. Instead we do:

3287

++ *

3288

++ *   1) smp_store_release(X->on_cpu, 0)   -- finish_task()

3289

++ *   2) smp_cond_load_acquire(!X->on_cpu) -- try_to_wake_up()

3290

++ *

3291

++ * Example:

3292

++ *

3293

++ *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)

3294

++ *

3295

++ *   LOCK rq(0)->lock LOCK X->pi_lock

3296

++ *   dequeue X

3297

++ *   sched-out X

3298

++ *   smp_store_release(X->on_cpu, 0);

3299

++ *

3300

++ *                    smp_cond_load_acquire(&X->on_cpu, !VAL);

3301

++ *                    X->state = WAKING

3302

++ *                    set_task_cpu(X,2)

3303

++ *

3304

++ *                    LOCK rq(2)->lock

3305

++ *                    enqueue X

3306

++ *                    X->state = RUNNING

3307

++ *                    UNLOCK rq(2)->lock

3308

++ *

3309

++ *                                          LOCK rq(2)->lock // orders against CPU1

3310

++ *                                          sched-out Z

3311

++ *                                          sched-in X

3312

++ *                                          UNLOCK rq(2)->lock

3313

++ *

3314

++ *                    UNLOCK X->pi_lock

3315

++ *   UNLOCK rq(0)->lock

3316

++ *

3317

++ *

3318

++ * However; for wakeups there is a second guarantee we must provide, namely we

3319

++ * must observe the state that lead to our wakeup. That is, not only must our

3320

++ * task observe its own prior state, it must also observe the stores prior to

3321

++ * its wakeup.

3322

++ *

3323

++ * This means that any means of doing remote wakeups must order the CPU doing

3324

++ * the wakeup against the CPU the task is going to end up running on. This,

3325

++ * however, is already required for the regular Program-Order guarantee above,

3326

++ * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).

3327

++ *

3328

++ */

3329

++

3330

++/**

3331

++ * try_to_wake_up - wake up a thread

3332

++ * @p: the thread to be awakened

3333

++ * @state: the mask of task states that can be woken

3334

++ * @wake_flags: wake modifier flags (WF_*)

3335

++ *

3336

++ * Conceptually does:

3337

++ *

3338

++ *   If (@state & @p->state) @p->state = TASK_RUNNING.

3339

++ *

3340

++ * If the task was not queued/runnable, also place it back on a runqueue.

3341

++ *

3342

++ * This function is atomic against schedule() which would dequeue the task.

3343

++ *

3344

++ * It issues a full memory barrier before accessing @p->state, see the comment

3345

++ * with set_current_state().

3346

++ *

3347

++ * Uses p->pi_lock to serialize against concurrent wake-ups.

3348

++ *

3349

++ * Relies on p->pi_lock stabilizing:

3350

++ *  - p->sched_class

3351

++ *  - p->cpus_ptr

3352

++ *  - p->sched_task_group

3353

++ * in order to do migration, see its use of select_task_rq()/set_task_cpu().

3354

++ *

3355

++ * Tries really hard to only take one task_rq(p)->lock for performance.

3356

++ * Takes rq->lock in:

3357

++ *  - ttwu_runnable()    -- old rq, unavoidable, see comment there;

3358

++ *  - ttwu_queue()       -- new rq, for enqueue of the task;

3359

++ *  - psi_ttwu_dequeue() -- much sadness :-( accounting will kill us.

3360

++ *

3361

++ * As a consequence we race really badly with just about everything. See the

3362

++ * many memory barriers and their comments for details.

3363

++ *

3364

++ * Return: %true if @p->state changes (an actual wakeup was done),

3365

++ *	   %false otherwise.

3366

++ */

3367

++static int try_to_wake_up(struct task_struct *p, unsigned int state,

3368

++			  int wake_flags)

3369

++{

3370

++	unsigned long flags;

3371

++	int cpu, success = 0;

3372

++

3373

++	preempt_disable();

3374

++	if (p == current) {

3375

++		/*

3376

++		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)

3377

++		 * == smp_processor_id()'. Together this means we can special

3378

++		 * case the whole 'p->on_rq && ttwu_runnable()' case below

3379

++		 * without taking any locks.

3380

++		 *

3381

++		 * In particular:

3382

++		 *  - we rely on Program-Order guarantees for all the ordering,

3383

++		 *  - we're serialized against set_special_state() by virtue of

3384

++		 *    it disabling IRQs (this allows not taking ->pi_lock).

3385

++		 */

3386

++		if (!ttwu_state_match(p, state, &success))

3387

++			goto out;

3388

++

3389

++		trace_sched_waking(p);

3390

++		WRITE_ONCE(p->__state, TASK_RUNNING);

3391

++		trace_sched_wakeup(p);

3392

++		goto out;

3393

++	}

3394

++

3395

++	/*

3396

++	 * If we are going to wake up a thread waiting for CONDITION we

3397

++	 * need to ensure that CONDITION=1 done by the caller can not be

3398

++	 * reordered with p->state check below. This pairs with smp_store_mb()

3399

++	 * in set_current_state() that the waiting thread does.

3400

++	 */

3401

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

3402

++	smp_mb__after_spinlock();

3403

++	if (!ttwu_state_match(p, state, &success))

3404

++		goto unlock;

3405

++

3406

++	trace_sched_waking(p);

3407

++

3408

++	/*

3409

++	 * Ensure we load p->on_rq _after_ p->state, otherwise it would

3410

++	 * be possible to, falsely, observe p->on_rq == 0 and get stuck

3411

++	 * in smp_cond_load_acquire() below.

3412

++	 *

3413

++	 * sched_ttwu_pending()			try_to_wake_up()

3414

++	 *   STORE p->on_rq = 1			  LOAD p->state

3415

++	 *   UNLOCK rq->lock

3416

++	 *

3417

++	 * __schedule() (switch to task 'p')

3418

++	 *   LOCK rq->lock			  smp_rmb();

3419

++	 *   smp_mb__after_spinlock();

3420

++	 *   UNLOCK rq->lock

3421

++	 *

3422

++	 * [task p]

3423

++	 *   STORE p->state = UNINTERRUPTIBLE	  LOAD p->on_rq

3424

++	 *

3425

++	 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in

3426

++	 * __schedule().  See the comment for smp_mb__after_spinlock().

3427

++	 *

3428

++	 * A similar smb_rmb() lives in try_invoke_on_locked_down_task().

3429

++	 */

3430

++	smp_rmb();

3431

++	if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))

3432

++		goto unlock;

3433

++

3434

++#ifdef CONFIG_SMP

3435

++	/*

3436

++	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be

3437

++	 * possible to, falsely, observe p->on_cpu == 0.

3438

++	 *

3439

++	 * One must be running (->on_cpu == 1) in order to remove oneself

3440

++	 * from the runqueue.

3441

++	 *

3442

++	 * __schedule() (switch to task 'p')	try_to_wake_up()

3443

++	 *   STORE p->on_cpu = 1		  LOAD p->on_rq

3444

++	 *   UNLOCK rq->lock

3445

++	 *

3446

++	 * __schedule() (put 'p' to sleep)

3447

++	 *   LOCK rq->lock			  smp_rmb();

3448

++	 *   smp_mb__after_spinlock();

3449

++	 *   STORE p->on_rq = 0			  LOAD p->on_cpu

3450

++	 *

3451

++	 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in

3452

++	 * __schedule().  See the comment for smp_mb__after_spinlock().

3453

++	 *

3454

++	 * Form a control-dep-acquire with p->on_rq == 0 above, to ensure

3455

++	 * schedule()'s deactivate_task() has 'happened' and p will no longer

3456

++	 * care about it's own p->state. See the comment in __schedule().

3457

++	 */

3458

++	smp_acquire__after_ctrl_dep();

3459

++

3460

++	/*

3461

++	 * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq

3462

++	 * == 0), which means we need to do an enqueue, change p->state to

3463

++	 * TASK_WAKING such that we can unlock p->pi_lock before doing the

3464

++	 * enqueue, such as ttwu_queue_wakelist().

3465

++	 */

3466

++	WRITE_ONCE(p->__state, TASK_WAKING);

3467

++

3468

++	/*

3469

++	 * If the owning (remote) CPU is still in the middle of schedule() with

3470

++	 * this task as prev, considering queueing p on the remote CPUs wake_list

3471

++	 * which potentially sends an IPI instead of spinning on p->on_cpu to

3472

++	 * let the waker make forward progress. This is safe because IRQs are

3473

++	 * disabled and the IPI will deliver after on_cpu is cleared.

3474

++	 *

3475

++	 * Ensure we load task_cpu(p) after p->on_cpu:

3476

++	 *

3477

++	 * set_task_cpu(p, cpu);

3478

++	 *   STORE p->cpu = @cpu

3479

++	 * __schedule() (switch to task 'p')

3480

++	 *   LOCK rq->lock

3481

++	 *   smp_mb__after_spin_lock()          smp_cond_load_acquire(&p->on_cpu)

3482

++	 *   STORE p->on_cpu = 1                LOAD p->cpu

3483

++	 *

3484

++	 * to ensure we observe the correct CPU on which the task is currently

3485

++	 * scheduling.

3486

++	 */

3487

++	if (smp_load_acquire(&p->on_cpu) &&

3488

++	    ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))

3489

++		goto unlock;

3490

++

3491

++	/*

3492

++	 * If the owning (remote) CPU is still in the middle of schedule() with

3493

++	 * this task as prev, wait until it's done referencing the task.

3494

++	 *

3495

++	 * Pairs with the smp_store_release() in finish_task().

3496

++	 *

3497

++	 * This ensures that tasks getting woken will be fully ordered against

3498

++	 * their previous state and preserve Program Order.

3499

++	 */

3500

++	smp_cond_load_acquire(&p->on_cpu, !VAL);

3501

++

3502

++	sched_task_ttwu(p);

3503

++

3504

++	cpu = select_task_rq(p);

3505

++

3506

++	if (cpu != task_cpu(p)) {

3507

++		if (p->in_iowait) {

3508

++			delayacct_blkio_end(p);

3509

++			atomic_dec(&task_rq(p)->nr_iowait);

3510

++		}

3511

++

3512

++		wake_flags |= WF_MIGRATED;

3513

++		psi_ttwu_dequeue(p);

3514

++		set_task_cpu(p, cpu);

3515

++	}

3516

++#else

3517

++	cpu = task_cpu(p);

3518

++#endif /* CONFIG_SMP */

3519

++

3520

++	ttwu_queue(p, cpu, wake_flags);

3521

++unlock:

3522

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

3523

++out:

3524

++	if (success)

3525

++		ttwu_stat(p, task_cpu(p), wake_flags);

3526

++	preempt_enable();

3527

++

3528

++	return success;

3529

++}

3530

++

3531

++/**

3532

++ * try_invoke_on_locked_down_task - Invoke a function on task in fixed state

3533

++ * @p: Process for which the function is to be invoked, can be @current.

3534

++ * @func: Function to invoke.

3535

++ * @arg: Argument to function.

3536

++ *

3537

++ * If the specified task can be quickly locked into a definite state

3538

++ * (either sleeping or on a given runqueue), arrange to keep it in that

3539

++ * state while invoking @func(@arg).  This function can use ->on_rq and

3540

++ * task_curr() to work out what the state is, if required.  Given that

3541

++ * @func can be invoked with a runqueue lock held, it had better be quite

3542

++ * lightweight.

3543

++ *

3544

++ * Returns:

3545

++ *	@false if the task slipped out from under the locks.

3546

++ *	@true if the task was locked onto a runqueue or is sleeping.

3547

++ *		However, @func can override this by returning @false.

3548

++ */

3549

++bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)

3550

++{

3551

++	struct rq_flags rf;

3552

++	bool ret = false;

3553

++	struct rq *rq;

3554

++

3555

++	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);

3556

++	if (p->on_rq) {

3557

++		rq = __task_rq_lock(p, &rf);

3558

++		if (task_rq(p) == rq)

3559

++			ret = func(p, arg);

3560

++		__task_rq_unlock(rq, &rf);

3561

++	} else {

3562

++		switch (READ_ONCE(p->__state)) {

3563

++		case TASK_RUNNING:

3564

++		case TASK_WAKING:

3565

++			break;

3566

++		default:

3567

++			smp_rmb(); // See smp_rmb() comment in try_to_wake_up().

3568

++			if (!p->on_rq)

3569

++				ret = func(p, arg);

3570

++		}

3571

++	}

3572

++	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);

3573

++	return ret;

3574

++}

3575

++

3576

++/**

3577

++ * wake_up_process - Wake up a specific process

3578

++ * @p: The process to be woken up.

3579

++ *

3580

++ * Attempt to wake up the nominated process and move it to the set of runnable

3581

++ * processes.

3582

++ *

3583

++ * Return: 1 if the process was woken up, 0 if it was already running.

3584

++ *

3585

++ * This function executes a full memory barrier before accessing the task state.

3586

++ */

3587

++int wake_up_process(struct task_struct *p)

3588

++{

3589

++	return try_to_wake_up(p, TASK_NORMAL, 0);

3590

++}

3591

++EXPORT_SYMBOL(wake_up_process);

3592

++

3593

++int wake_up_state(struct task_struct *p, unsigned int state)

3594

++{

3595

++	return try_to_wake_up(p, state, 0);

3596

++}

3597

++

3598

++/*

3599

++ * Perform scheduler related setup for a newly forked process p.

3600

++ * p is forked by current.

3601

++ *

3602

++ * __sched_fork() is basic setup used by init_idle() too:

3603

++ */

3604

++static inline void __sched_fork(unsigned long clone_flags, struct task_struct *p)

3605

++{

3606

++	p->on_rq			= 0;

3607

++	p->on_cpu			= 0;

3608

++	p->utime			= 0;

3609

++	p->stime			= 0;

3610

++	p->sched_time			= 0;

3611

++

3612

++#ifdef CONFIG_PREEMPT_NOTIFIERS

3613

++	INIT_HLIST_HEAD(&p->preempt_notifiers);

3614

++#endif

3615

++

3616

++#ifdef CONFIG_COMPACTION

3617

++	p->capture_control = NULL;

3618

++#endif

3619

++#ifdef CONFIG_SMP

3620

++	p->wake_entry.u_flags = CSD_TYPE_TTWU;

3621

++#endif

3622

++}

3623

++

3624

++/*

3625

++ * fork()/clone()-time setup:

3626

++ */

3627

++int sched_fork(unsigned long clone_flags, struct task_struct *p)

3628

++{

3629

++	unsigned long flags;

3630

++	struct rq *rq;

3631

++

3632

++	__sched_fork(clone_flags, p);

3633

++	/*

3634

++	 * We mark the process as NEW here. This guarantees that

3635

++	 * nobody will actually run it, and a signal or other external

3636

++	 * event cannot wake it up and insert it on the runqueue either.

3637

++	 */

3638

++	p->__state = TASK_NEW;

3639

++

3640

++	/*

3641

++	 * Make sure we do not leak PI boosting priority to the child.

3642

++	 */

3643

++	p->prio = current->normal_prio;

3644

++

3645

++	/*

3646

++	 * Revert to default priority/policy on fork if requested.

3647

++	 */

3648

++	if (unlikely(p->sched_reset_on_fork)) {

3649

++		if (task_has_rt_policy(p)) {

3650

++			p->policy = SCHED_NORMAL;

3651

++			p->static_prio = NICE_TO_PRIO(0);

3652

++			p->rt_priority = 0;

3653

++		} else if (PRIO_TO_NICE(p->static_prio) < 0)

3654

++			p->static_prio = NICE_TO_PRIO(0);

3655

++

3656

++		p->prio = p->normal_prio = p->static_prio;

3657

++

3658

++		/*

3659

++		 * We don't need the reset flag anymore after the fork. It has

3660

++		 * fulfilled its duty:

3661

++		 */

3662

++		p->sched_reset_on_fork = 0;

3663

++	}

3664

++

3665

++	/*

3666

++	 * The child is not yet in the pid-hash so no cgroup attach races,

3667

++	 * and the cgroup is pinned to this child due to cgroup_fork()

3668

++	 * is ran before sched_fork().

3669

++	 *

3670

++	 * Silence PROVE_RCU.

3671

++	 */

3672

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

3673

++	/*

3674

++	 * Share the timeslice between parent and child, thus the

3675

++	 * total amount of pending timeslices in the system doesn't change,

3676

++	 * resulting in more scheduling fairness.

3677

++	 */

3678

++	rq = this_rq();

3679

++	raw_spin_lock(&rq->lock);

3680

++

3681

++	rq->curr->time_slice /= 2;

3682

++	p->time_slice = rq->curr->time_slice;

3683

++#ifdef CONFIG_SCHED_HRTICK

3684

++	hrtick_start(rq, rq->curr->time_slice);

3685

++#endif

3686

++

3687

++	if (p->time_slice < RESCHED_NS) {

3688

++		p->time_slice = sched_timeslice_ns;

3689

++		resched_curr(rq);

3690

++	}

3691

++	sched_task_fork(p, rq);

3692

++	raw_spin_unlock(&rq->lock);

3693

++

3694

++	rseq_migrate(p);

3695

++	/*

3696

++	 * We're setting the CPU for the first time, we don't migrate,

3697

++	 * so use __set_task_cpu().

3698

++	 */

3699

++	__set_task_cpu(p, cpu_of(rq));

3700

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

3701

++

3702

++#ifdef CONFIG_SCHED_INFO

3703

++	if (unlikely(sched_info_on()))

3704

++		memset(&p->sched_info, 0, sizeof(p->sched_info));

3705

++#endif

3706

++	init_task_preempt_count(p);

3707

++

3708

++	return 0;

3709

++}

3710

++

3711

++void sched_post_fork(struct task_struct *p) {}

3712

++

3713

++#ifdef CONFIG_SCHEDSTATS

3714

++

3715

++DEFINE_STATIC_KEY_FALSE(sched_schedstats);

3716

++

3717

++static void set_schedstats(bool enabled)

3718

++{

3719

++	if (enabled)

3720

++		static_branch_enable(&sched_schedstats);

3721

++	else

3722

++		static_branch_disable(&sched_schedstats);

3723

++}

3724

++

3725

++void force_schedstat_enabled(void)

3726

++{

3727

++	if (!schedstat_enabled()) {

3728

++		pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");

3729

++		static_branch_enable(&sched_schedstats);

3730

++	}

3731

++}

3732

++

3733

++static int __init setup_schedstats(char *str)

3734

++{

3735

++	int ret = 0;

3736

++	if (!str)

3737

++		goto out;

3738

++

3739

++	if (!strcmp(str, "enable")) {

3740

++		set_schedstats(true);

3741

++		ret = 1;

3742

++	} else if (!strcmp(str, "disable")) {

3743

++		set_schedstats(false);

3744

++		ret = 1;

3745

++	}

3746

++out:

3747

++	if (!ret)

3748

++		pr_warn("Unable to parse schedstats=\n");

3749

++

3750

++	return ret;

3751

++}

3752

++__setup("schedstats=", setup_schedstats);

3753

++

3754

++#ifdef CONFIG_PROC_SYSCTL

3755

++int sysctl_schedstats(struct ctl_table *table, int write,

3756

++			 void __user *buffer, size_t *lenp, loff_t *ppos)

3757

++{

3758

++	struct ctl_table t;

3759

++	int err;

3760

++	int state = static_branch_likely(&sched_schedstats);

3761

++

3762

++	if (write && !capable(CAP_SYS_ADMIN))

3763

++		return -EPERM;

3764

++

3765

++	t = *table;

3766

++	t.data = &state;

3767

++	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);

3768

++	if (err < 0)

3769

++		return err;

3770

++	if (write)

3771

++		set_schedstats(state);

3772

++	return err;

3773

++}

3774

++#endif /* CONFIG_PROC_SYSCTL */

3775

++#endif /* CONFIG_SCHEDSTATS */

3776

++

3777

++/*

3778

++ * wake_up_new_task - wake up a newly created task for the first time.

3779

++ *

3780

++ * This function will do some initial scheduler statistics housekeeping

3781

++ * that must be done for every newly created context, then puts the task

3782

++ * on the runqueue and wakes it.

3783

++ */

3784

++void wake_up_new_task(struct task_struct *p)

3785

++{

3786

++	unsigned long flags;

3787

++	struct rq *rq;

3788

++

3789

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

3790

++	WRITE_ONCE(p->__state, TASK_RUNNING);

3791

++	rq = cpu_rq(select_task_rq(p));

3792

++#ifdef CONFIG_SMP

3793

++	rseq_migrate(p);

3794

++	/*

3795

++	 * Fork balancing, do it here and not earlier because:

3796

++	 * - cpus_ptr can change in the fork path

3797

++	 * - any previously selected CPU might disappear through hotplug

3798

++	 *

3799

++	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,

3800

++	 * as we're not fully set-up yet.

3801

++	 */

3802

++	__set_task_cpu(p, cpu_of(rq));

3803

++#endif

3804

++

3805

++	raw_spin_lock(&rq->lock);

3806

++	update_rq_clock(rq);

3807

++

3808

++	activate_task(p, rq);

3809

++	trace_sched_wakeup_new(p);

3810

++	check_preempt_curr(rq);

3811

++

3812

++	raw_spin_unlock(&rq->lock);

3813

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

3814

++}

3815

++

3816

++#ifdef CONFIG_PREEMPT_NOTIFIERS

3817

++

3818

++static DEFINE_STATIC_KEY_FALSE(preempt_notifier_key);

3819

++

3820

++void preempt_notifier_inc(void)

3821

++{

3822

++	static_branch_inc(&preempt_notifier_key);

3823

++}

3824

++EXPORT_SYMBOL_GPL(preempt_notifier_inc);

3825

++

3826

++void preempt_notifier_dec(void)

3827

++{

3828

++	static_branch_dec(&preempt_notifier_key);

3829

++}

3830

++EXPORT_SYMBOL_GPL(preempt_notifier_dec);

3831

++

3832

++/**

3833

++ * preempt_notifier_register - tell me when current is being preempted & rescheduled

3834

++ * @notifier: notifier struct to register

3835

++ */

3836

++void preempt_notifier_register(struct preempt_notifier *notifier)

3837

++{

3838

++	if (!static_branch_unlikely(&preempt_notifier_key))

3839

++		WARN(1, "registering preempt_notifier while notifiers disabled\n");

3840

++

3841

++	hlist_add_head(&notifier->link, &current->preempt_notifiers);

3842

++}

3843

++EXPORT_SYMBOL_GPL(preempt_notifier_register);

3844

++

3845

++/**

3846

++ * preempt_notifier_unregister - no longer interested in preemption notifications

3847

++ * @notifier: notifier struct to unregister

3848

++ *

3849

++ * This is *not* safe to call from within a preemption notifier.

3850

++ */

3851

++void preempt_notifier_unregister(struct preempt_notifier *notifier)

3852

++{

3853

++	hlist_del(&notifier->link);

3854

++}

3855

++EXPORT_SYMBOL_GPL(preempt_notifier_unregister);

3856

++

3857

++static void __fire_sched_in_preempt_notifiers(struct task_struct *curr)

3858

++{

3859

++	struct preempt_notifier *notifier;

3860

++

3861

++	hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)

3862

++		notifier->ops->sched_in(notifier, raw_smp_processor_id());

3863

++}

3864

++

3865

++static __always_inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)

3866

++{

3867

++	if (static_branch_unlikely(&preempt_notifier_key))

3868

++		__fire_sched_in_preempt_notifiers(curr);

3869

++}

3870

++

3871

++static void

3872

++__fire_sched_out_preempt_notifiers(struct task_struct *curr,

3873

++				   struct task_struct *next)

3874

++{

3875

++	struct preempt_notifier *notifier;

3876

++

3877

++	hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)

3878

++		notifier->ops->sched_out(notifier, next);

3879

++}

3880

++

3881

++static __always_inline void

3882

++fire_sched_out_preempt_notifiers(struct task_struct *curr,

3883

++				 struct task_struct *next)

3884

++{

3885

++	if (static_branch_unlikely(&preempt_notifier_key))

3886

++		__fire_sched_out_preempt_notifiers(curr, next);

3887

++}

3888

++

3889

++#else /* !CONFIG_PREEMPT_NOTIFIERS */

3890

++

3891

++static inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)

3892

++{

3893

++}

3894

++

3895

++static inline void

3896

++fire_sched_out_preempt_notifiers(struct task_struct *curr,

3897

++				 struct task_struct *next)

3898

++{

3899

++}

3900

++

3901

++#endif /* CONFIG_PREEMPT_NOTIFIERS */

3902

++

3903

++static inline void prepare_task(struct task_struct *next)

3904

++{

3905

++	/*

3906

++	 * Claim the task as running, we do this before switching to it

3907

++	 * such that any running task will have this set.

3908

++	 *

3909

++	 * See the ttwu() WF_ON_CPU case and its ordering comment.

3910

++	 */

3911

++	WRITE_ONCE(next->on_cpu, 1);

3912

++}

3913

++

3914

++static inline void finish_task(struct task_struct *prev)

3915

++{

3916

++#ifdef CONFIG_SMP

3917

++	/*

3918

++	 * This must be the very last reference to @prev from this CPU. After

3919

++	 * p->on_cpu is cleared, the task can be moved to a different CPU. We

3920

++	 * must ensure this doesn't happen until the switch is completely

3921

++	 * finished.

3922

++	 *

3923

++	 * In particular, the load of prev->state in finish_task_switch() must

3924

++	 * happen before this.

3925

++	 *

3926

++	 * Pairs with the smp_cond_load_acquire() in try_to_wake_up().

3927

++	 */

3928

++	smp_store_release(&prev->on_cpu, 0);

3929

++#else

3930

++	prev->on_cpu = 0;

3931

++#endif

3932

++}

3933

++

3934

++#ifdef CONFIG_SMP

3935

++

3936

++static void do_balance_callbacks(struct rq *rq, struct callback_head *head)

3937

++{

3938

++	void (*func)(struct rq *rq);

3939

++	struct callback_head *next;

3940

++

3941

++	lockdep_assert_held(&rq->lock);

3942

++

3943

++	while (head) {

3944

++		func = (void (*)(struct rq *))head->func;

3945

++		next = head->next;

3946

++		head->next = NULL;

3947

++		head = next;

3948

++

3949

++		func(rq);

3950

++	}

3951

++}

3952

++

3953

++static void balance_push(struct rq *rq);

3954

++

3955

++struct callback_head balance_push_callback = {

3956

++	.next = NULL,

3957

++	.func = (void (*)(struct callback_head *))balance_push,

3958

++};

3959

++

3960

++static inline struct callback_head *splice_balance_callbacks(struct rq *rq)

3961

++{

3962

++	struct callback_head *head = rq->balance_callback;

3963

++

3964

++	if (head) {

3965

++		lockdep_assert_held(&rq->lock);

3966

++		rq->balance_callback = NULL;

3967

++	}

3968

++

3969

++	return head;

3970

++}

3971

++

3972

++static void __balance_callbacks(struct rq *rq)

3973

++{

3974

++	do_balance_callbacks(rq, splice_balance_callbacks(rq));

3975

++}

3976

++

3977

++static inline void balance_callbacks(struct rq *rq, struct callback_head *head)

3978

++{

3979

++	unsigned long flags;

3980

++

3981

++	if (unlikely(head)) {

3982

++		raw_spin_lock_irqsave(&rq->lock, flags);

3983

++		do_balance_callbacks(rq, head);

3984

++		raw_spin_unlock_irqrestore(&rq->lock, flags);

3985

++	}

3986

++}

3987

++

3988

++#else

3989

++

3990

++static inline void __balance_callbacks(struct rq *rq)

3991

++{

3992

++}

3993

++

3994

++static inline struct callback_head *splice_balance_callbacks(struct rq *rq)

3995

++{

3996

++	return NULL;

3997

++}

3998

++

3999

++static inline void balance_callbacks(struct rq *rq, struct callback_head *head)

4000

++{

4001

++}

4002

++

4003

++#endif

4004

++

4005

++static inline void

4006

++prepare_lock_switch(struct rq *rq, struct task_struct *next)

4007

++{

4008

++	/*

4009

++	 * Since the runqueue lock will be released by the next

4010

++	 * task (which is an invalid locking op but in the case

4011

++	 * of the scheduler it's an obvious special-case), so we

4012

++	 * do an early lockdep release here:

4013

++	 */

4014

++	spin_release(&rq->lock.dep_map, _THIS_IP_);

4015

++#ifdef CONFIG_DEBUG_SPINLOCK

4016

++	/* this is a valid case when another task releases the spinlock */

4017

++	rq->lock.owner = next;

4018

++#endif

4019

++}

4020

++

4021

++static inline void finish_lock_switch(struct rq *rq)

4022

++{

4023

++	/*

4024

++	 * If we are tracking spinlock dependencies then we have to

4025

++	 * fix up the runqueue lock - which gets 'carried over' from

4026

++	 * prev into current:

4027

++	 */

4028

++	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);

4029

++	__balance_callbacks(rq);

4030

++	raw_spin_unlock_irq(&rq->lock);

4031

++}

4032

++

4033

++/*

4034

++ * NOP if the arch has not defined these:

4035

++ */

4036

++

4037

++#ifndef prepare_arch_switch

4038

++# define prepare_arch_switch(next)	do { } while (0)

4039

++#endif

4040

++

4041

++#ifndef finish_arch_post_lock_switch

4042

++# define finish_arch_post_lock_switch()	do { } while (0)

4043

++#endif

4044

++

4045

++static inline void kmap_local_sched_out(void)

4046

++{

4047

++#ifdef CONFIG_KMAP_LOCAL

4048

++	if (unlikely(current->kmap_ctrl.idx))

4049

++		__kmap_local_sched_out();

4050

++#endif

4051

++}

4052

++

4053

++static inline void kmap_local_sched_in(void)

4054

++{

4055

++#ifdef CONFIG_KMAP_LOCAL

4056

++	if (unlikely(current->kmap_ctrl.idx))

4057

++		__kmap_local_sched_in();

4058

++#endif

4059

++}

4060

++

4061

++/**

4062

++ * prepare_task_switch - prepare to switch tasks

4063

++ * @rq: the runqueue preparing to switch

4064

++ * @next: the task we are going to switch to.

4065

++ *

4066

++ * This is called with the rq lock held and interrupts off. It must

4067

++ * be paired with a subsequent finish_task_switch after the context

4068

++ * switch.

4069

++ *

4070

++ * prepare_task_switch sets up locking and calls architecture specific

4071

++ * hooks.

4072

++ */

4073

++static inline void

4074

++prepare_task_switch(struct rq *rq, struct task_struct *prev,

4075

++		    struct task_struct *next)

4076

++{

4077

++	kcov_prepare_switch(prev);

4078

++	sched_info_switch(rq, prev, next);

4079

++	perf_event_task_sched_out(prev, next);

4080

++	rseq_preempt(prev);

4081

++	fire_sched_out_preempt_notifiers(prev, next);

4082

++	kmap_local_sched_out();

4083

++	prepare_task(next);

4084

++	prepare_arch_switch(next);

4085

++}

4086

++

4087

++/**

4088

++ * finish_task_switch - clean up after a task-switch

4089

++ * @rq: runqueue associated with task-switch

4090

++ * @prev: the thread we just switched away from.

4091

++ *

4092

++ * finish_task_switch must be called after the context switch, paired

4093

++ * with a prepare_task_switch call before the context switch.

4094

++ * finish_task_switch will reconcile locking set up by prepare_task_switch,

4095

++ * and do any other architecture-specific cleanup actions.

4096

++ *

4097

++ * Note that we may have delayed dropping an mm in context_switch(). If

4098

++ * so, we finish that here outside of the runqueue lock.  (Doing it

4099

++ * with the lock held can cause deadlocks; see schedule() for

4100

++ * details.)

4101

++ *

4102

++ * The context switch have flipped the stack from under us and restored the

4103

++ * local variables which were saved when this task called schedule() in the

4104

++ * past. prev == current is still correct but we need to recalculate this_rq

4105

++ * because prev may have moved to another CPU.

4106

++ */

4107

++static struct rq *finish_task_switch(struct task_struct *prev)

4108

++	__releases(rq->lock)

4109

++{

4110

++	struct rq *rq = this_rq();

4111

++	struct mm_struct *mm = rq->prev_mm;

4112

++	long prev_state;

4113

++

4114

++	/*

4115

++	 * The previous task will have left us with a preempt_count of 2

4116

++	 * because it left us after:

4117

++	 *

4118

++	 *	schedule()

4119

++	 *	  preempt_disable();			// 1

4120

++	 *	  __schedule()

4121

++	 *	    raw_spin_lock_irq(&rq->lock)	// 2

4122

++	 *

4123

++	 * Also, see FORK_PREEMPT_COUNT.

4124

++	 */

4125

++	if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,

4126

++		      "corrupted preempt_count: %s/%d/0x%x\n",

4127

++		      current->comm, current->pid, preempt_count()))

4128

++		preempt_count_set(FORK_PREEMPT_COUNT);

4129

++

4130

++	rq->prev_mm = NULL;

4131

++

4132

++	/*

4133

++	 * A task struct has one reference for the use as "current".

4134

++	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls

4135

++	 * schedule one last time. The schedule call will never return, and

4136

++	 * the scheduled task must drop that reference.

4137

++	 *

4138

++	 * We must observe prev->state before clearing prev->on_cpu (in

4139

++	 * finish_task), otherwise a concurrent wakeup can get prev

4140

++	 * running on another CPU and we could rave with its RUNNING -> DEAD

4141

++	 * transition, resulting in a double drop.

4142

++	 */

4143

++	prev_state = READ_ONCE(prev->__state);

4144

++	vtime_task_switch(prev);

4145

++	perf_event_task_sched_in(prev, current);

4146

++	finish_task(prev);

4147

++	tick_nohz_task_switch();

4148

++	finish_lock_switch(rq);

4149

++	finish_arch_post_lock_switch();

4150

++	kcov_finish_switch(current);

4151

++	/*

4152

++	 * kmap_local_sched_out() is invoked with rq::lock held and

4153

++	 * interrupts disabled. There is no requirement for that, but the

4154

++	 * sched out code does not have an interrupt enabled section.

4155

++	 * Restoring the maps on sched in does not require interrupts being

4156

++	 * disabled either.

4157

++	 */

4158

++	kmap_local_sched_in();

4159

++

4160

++	fire_sched_in_preempt_notifiers(current);

4161

++	/*

4162

++	 * When switching through a kernel thread, the loop in

4163

++	 * membarrier_{private,global}_expedited() may have observed that

4164

++	 * kernel thread and not issued an IPI. It is therefore possible to

4165

++	 * schedule between user->kernel->user threads without passing though

4166

++	 * switch_mm(). Membarrier requires a barrier after storing to

4167

++	 * rq->curr, before returning to userspace, so provide them here:

4168

++	 *

4169

++	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly

4170

++	 *   provided by mmdrop(),

4171

++	 * - a sync_core for SYNC_CORE.

4172

++	 */

4173

++	if (mm) {

4174

++		membarrier_mm_sync_core_before_usermode(mm);

4175

++		mmdrop(mm);

4176

++	}

4177

++	if (unlikely(prev_state == TASK_DEAD)) {

4178

++		/*

4179

++		 * Remove function-return probe instances associated with this

4180

++		 * task and put them back on the free list.

4181

++		 */

4182

++		kprobe_flush_task(prev);

4183

++

4184

++		/* Task is done with its stack. */

4185

++		put_task_stack(prev);

4186

++

4187

++		put_task_struct_rcu_user(prev);

4188

++	}

4189

++

4190

++	return rq;

4191

++}

4192

++

4193

++/**

4194

++ * schedule_tail - first thing a freshly forked thread must call.

4195

++ * @prev: the thread we just switched away from.

4196

++ */

4197

++asmlinkage __visible void schedule_tail(struct task_struct *prev)

4198

++	__releases(rq->lock)

4199

++{

4200

++	/*

4201

++	 * New tasks start with FORK_PREEMPT_COUNT, see there and

4202

++	 * finish_task_switch() for details.

4203

++	 *

4204

++	 * finish_task_switch() will drop rq->lock() and lower preempt_count

4205

++	 * and the preempt_enable() will end up enabling preemption (on

4206

++	 * PREEMPT_COUNT kernels).

4207

++	 */

4208

++

4209

++	finish_task_switch(prev);

4210

++	preempt_enable();

4211

++

4212

++	if (current->set_child_tid)

4213

++		put_user(task_pid_vnr(current), current->set_child_tid);

4214

++

4215

++	calculate_sigpending();

4216

++}

4217

++

4218

++/*

4219

++ * context_switch - switch to the new MM and the new thread's register state.

4220

++ */

4221

++static __always_inline struct rq *

4222

++context_switch(struct rq *rq, struct task_struct *prev,

4223

++	       struct task_struct *next)

4224

++{

4225

++	prepare_task_switch(rq, prev, next);

4226

++

4227

++	/*

4228

++	 * For paravirt, this is coupled with an exit in switch_to to

4229

++	 * combine the page table reload and the switch backend into

4230

++	 * one hypercall.

4231

++	 */

4232

++	arch_start_context_switch(prev);

4233

++

4234

++	/*

4235

++	 * kernel -> kernel   lazy + transfer active

4236

++	 *   user -> kernel   lazy + mmgrab() active

4237

++	 *

4238

++	 * kernel ->   user   switch + mmdrop() active

4239

++	 *   user ->   user   switch

4240

++	 */

4241

++	if (!next->mm) {                                // to kernel

4242

++		enter_lazy_tlb(prev->active_mm, next);

4243

++

4244

++		next->active_mm = prev->active_mm;

4245

++		if (prev->mm)                           // from user

4246

++			mmgrab(prev->active_mm);

4247

++		else

4248

++			prev->active_mm = NULL;

4249

++	} else {                                        // to user

4250

++		membarrier_switch_mm(rq, prev->active_mm, next->mm);

4251

++		/*

4252

++		 * sys_membarrier() requires an smp_mb() between setting

4253

++		 * rq->curr / membarrier_switch_mm() and returning to userspace.

4254

++		 *

4255

++		 * The below provides this either through switch_mm(), or in

4256

++		 * case 'prev->active_mm == next->mm' through

4257

++		 * finish_task_switch()'s mmdrop().

4258

++		 */

4259

++		switch_mm_irqs_off(prev->active_mm, next->mm, next);

4260

++

4261

++		if (!prev->mm) {                        // from kernel

4262

++			/* will mmdrop() in finish_task_switch(). */

4263

++			rq->prev_mm = prev->active_mm;

4264

++			prev->active_mm = NULL;

4265

++		}

4266

++	}

4267

++

4268

++	prepare_lock_switch(rq, next);

4269

++

4270

++	/* Here we just switch the register state and the stack. */

4271

++	switch_to(prev, next, prev);

4272

++	barrier();

4273

++

4274

++	return finish_task_switch(prev);

4275

++}

4276

++

4277

++/*

4278

++ * nr_running, nr_uninterruptible and nr_context_switches:

4279

++ *

4280

++ * externally visible scheduler statistics: current number of runnable

4281

++ * threads, total number of context switches performed since bootup.

4282

++ */

4283

++unsigned int nr_running(void)

4284

++{

4285

++	unsigned int i, sum = 0;

4286

++

4287

++	for_each_online_cpu(i)

4288

++		sum += cpu_rq(i)->nr_running;

4289

++

4290

++	return sum;

4291

++}

4292

++

4293

++/*

4294

++ * Check if only the current task is running on the CPU.

4295

++ *

4296

++ * Caution: this function does not check that the caller has disabled

4297

++ * preemption, thus the result might have a time-of-check-to-time-of-use

4298

++ * race.  The caller is responsible to use it correctly, for example:

4299

++ *

4300

++ * - from a non-preemptible section (of course)

4301

++ *

4302

++ * - from a thread that is bound to a single CPU

4303

++ *

4304

++ * - in a loop with very short iterations (e.g. a polling loop)

4305

++ */

4306

++bool single_task_running(void)

4307

++{

4308

++	return raw_rq()->nr_running == 1;

4309

++}

4310

++EXPORT_SYMBOL(single_task_running);

4311

++

4312

++unsigned long long nr_context_switches(void)

4313

++{

4314

++	int i;

4315

++	unsigned long long sum = 0;

4316

++

4317

++	for_each_possible_cpu(i)

4318

++		sum += cpu_rq(i)->nr_switches;

4319

++

4320

++	return sum;

4321

++}

4322

++

4323

++/*

4324

++ * Consumers of these two interfaces, like for example the cpuidle menu

4325

++ * governor, are using nonsensical data. Preferring shallow idle state selection

4326

++ * for a CPU that has IO-wait which might not even end up running the task when

4327

++ * it does become runnable.

4328

++ */

4329

++

4330

++unsigned int nr_iowait_cpu(int cpu)

4331

++{

4332

++	return atomic_read(&cpu_rq(cpu)->nr_iowait);

4333

++}

4334

++

4335

++/*

4336

++ * IO-wait accounting, and how it's mostly bollocks (on SMP).

4337

++ *

4338

++ * The idea behind IO-wait account is to account the idle time that we could

4339

++ * have spend running if it were not for IO. That is, if we were to improve the

4340

++ * storage performance, we'd have a proportional reduction in IO-wait time.

4341

++ *

4342

++ * This all works nicely on UP, where, when a task blocks on IO, we account

4343

++ * idle time as IO-wait, because if the storage were faster, it could've been

4344

++ * running and we'd not be idle.

4345

++ *

4346

++ * This has been extended to SMP, by doing the same for each CPU. This however

4347

++ * is broken.

4348

++ *

4349

++ * Imagine for instance the case where two tasks block on one CPU, only the one

4350

++ * CPU will have IO-wait accounted, while the other has regular idle. Even

4351

++ * though, if the storage were faster, both could've ran at the same time,

4352

++ * utilising both CPUs.

4353

++ *

4354

++ * This means, that when looking globally, the current IO-wait accounting on

4355

++ * SMP is a lower bound, by reason of under accounting.

4356

++ *

4357

++ * Worse, since the numbers are provided per CPU, they are sometimes

4358

++ * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly

4359

++ * associated with any one particular CPU, it can wake to another CPU than it

4360

++ * blocked on. This means the per CPU IO-wait number is meaningless.

4361

++ *

4362

++ * Task CPU affinities can make all that even more 'interesting'.

4363

++ */

4364

++

4365

++unsigned int nr_iowait(void)

4366

++{

4367

++	unsigned int i, sum = 0;

4368

++

4369

++	for_each_possible_cpu(i)

4370

++		sum += nr_iowait_cpu(i);

4371

++

4372

++	return sum;

4373

++}

4374

++

4375

++#ifdef CONFIG_SMP

4376

++

4377

++/*

4378

++ * sched_exec - execve() is a valuable balancing opportunity, because at

4379

++ * this point the task has the smallest effective memory and cache

4380

++ * footprint.

4381

++ */

4382

++void sched_exec(void)

4383

++{

4384

++	struct task_struct *p = current;

4385

++	unsigned long flags;

4386

++	int dest_cpu;

4387

++

4388

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

4389

++	dest_cpu = cpumask_any(p->cpus_ptr);

4390

++	if (dest_cpu == smp_processor_id())

4391

++		goto unlock;

4392

++

4393

++	if (likely(cpu_active(dest_cpu))) {

4394

++		struct migration_arg arg = { p, dest_cpu };

4395

++

4396

++		raw_spin_unlock_irqrestore(&p->pi_lock, flags);

4397

++		stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);

4398

++		return;

4399

++	}

4400

++unlock:

4401

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

4402

++}

4403

++

4404

++#endif

4405

++

4406

++DEFINE_PER_CPU(struct kernel_stat, kstat);

4407

++DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);

4408

++

4409

++EXPORT_PER_CPU_SYMBOL(kstat);

4410

++EXPORT_PER_CPU_SYMBOL(kernel_cpustat);

4411

++

4412

++static inline void update_curr(struct rq *rq, struct task_struct *p)

4413

++{

4414

++	s64 ns = rq->clock_task - p->last_ran;

4415

++

4416

++	p->sched_time += ns;

4417

++	cgroup_account_cputime(p, ns);

4418

++	account_group_exec_runtime(p, ns);

4419

++

4420

++	p->time_slice -= ns;

4421

++	p->last_ran = rq->clock_task;

4422

++}

4423

++

4424

++/*

4425

++ * Return accounted runtime for the task.

4426

++ * Return separately the current's pending runtime that have not been

4427

++ * accounted yet.

4428

++ */

4429

++unsigned long long task_sched_runtime(struct task_struct *p)

4430

++{

4431

++	unsigned long flags;

4432

++	struct rq *rq;

4433

++	raw_spinlock_t *lock;

4434

++	u64 ns;

4435

++

4436

++#if defined(CONFIG_64BIT) && defined(CONFIG_SMP)

4437

++	/*

4438

++	 * 64-bit doesn't need locks to atomically read a 64-bit value.

4439

++	 * So we have a optimization chance when the task's delta_exec is 0.

4440

++	 * Reading ->on_cpu is racy, but this is ok.

4441

++	 *

4442

++	 * If we race with it leaving CPU, we'll take a lock. So we're correct.

4443

++	 * If we race with it entering CPU, unaccounted time is 0. This is

4444

++	 * indistinguishable from the read occurring a few cycles earlier.

4445

++	 * If we see ->on_cpu without ->on_rq, the task is leaving, and has

4446

++	 * been accounted, so we're correct here as well.

4447

++	 */

4448

++	if (!p->on_cpu || !task_on_rq_queued(p))

4449

++		return tsk_seruntime(p);

4450

++#endif

4451

++

4452

++	rq = task_access_lock_irqsave(p, &lock, &flags);

4453

++	/*

4454

++	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would

4455

++	 * project cycles that may never be accounted to this

4456

++	 * thread, breaking clock_gettime().

4457

++	 */

4458

++	if (p == rq->curr && task_on_rq_queued(p)) {

4459

++		update_rq_clock(rq);

4460

++		update_curr(rq, p);

4461

++	}

4462

++	ns = tsk_seruntime(p);

4463

++	task_access_unlock_irqrestore(p, lock, &flags);

4464

++

4465

++	return ns;

4466

++}

4467

++

4468

++/* This manages tasks that have run out of timeslice during a scheduler_tick */

4469

++static inline void scheduler_task_tick(struct rq *rq)

4470

++{

4471

++	struct task_struct *p = rq->curr;

4472

++

4473

++	if (is_idle_task(p))

4474

++		return;

4475

++

4476

++	update_curr(rq, p);

4477

++	cpufreq_update_util(rq, 0);

4478

++

4479

++	/*

4480

++	 * Tasks have less than RESCHED_NS of time slice left they will be

4481

++	 * rescheduled.

4482

++	 */

4483

++	if (p->time_slice >= RESCHED_NS)

4484

++		return;

4485

++	set_tsk_need_resched(p);

4486

++	set_preempt_need_resched();

4487

++}

4488

++

4489

++#ifdef CONFIG_SCHED_DEBUG

4490

++static u64 cpu_resched_latency(struct rq *rq)

4491

++{

4492

++	int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms);

4493

++	u64 resched_latency, now = rq_clock(rq);

4494

++	static bool warned_once;

4495

++

4496

++	if (sysctl_resched_latency_warn_once && warned_once)

4497

++		return 0;

4498

++

4499

++	if (!need_resched() || !latency_warn_ms)

4500

++		return 0;

4501

++

4502

++	if (system_state == SYSTEM_BOOTING)

4503

++		return 0;

4504

++

4505

++	if (!rq->last_seen_need_resched_ns) {

4506

++		rq->last_seen_need_resched_ns = now;

4507

++		rq->ticks_without_resched = 0;

4508

++		return 0;

4509

++	}

4510

++

4511

++	rq->ticks_without_resched++;

4512

++	resched_latency = now - rq->last_seen_need_resched_ns;

4513

++	if (resched_latency <= latency_warn_ms * NSEC_PER_MSEC)

4514

++		return 0;

4515

++

4516

++	warned_once = true;

4517

++

4518

++	return resched_latency;

4519

++}

4520

++

4521

++static int __init setup_resched_latency_warn_ms(char *str)

4522

++{

4523

++	long val;

4524

++

4525

++	if ((kstrtol(str, 0, &val))) {

4526

++		pr_warn("Unable to set resched_latency_warn_ms\n");

4527

++		return 1;

4528

++	}

4529

++

4530

++	sysctl_resched_latency_warn_ms = val;

4531

++	return 1;

4532

++}

4533

++__setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);

4534

++#else

4535

++static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }

4536

++#endif /* CONFIG_SCHED_DEBUG */

4537

++

4538

++/*

4539

++ * This function gets called by the timer code, with HZ frequency.

4540

++ * We call it with interrupts disabled.

4541

++ */

4542

++void scheduler_tick(void)

4543

++{

4544

++	int cpu __maybe_unused = smp_processor_id();

4545

++	struct rq *rq = cpu_rq(cpu);

4546

++	u64 resched_latency;

4547

++

4548

++	arch_scale_freq_tick();

4549

++	sched_clock_tick();

4550

++

4551

++	raw_spin_lock(&rq->lock);

4552

++	update_rq_clock(rq);

4553

++

4554

++	scheduler_task_tick(rq);

4555

++	if (sched_feat(LATENCY_WARN))

4556

++		resched_latency = cpu_resched_latency(rq);

4557

++	calc_global_load_tick(rq);

4558

++

4559

++	rq->last_tick = rq->clock;

4560

++	raw_spin_unlock(&rq->lock);

4561

++

4562

++	if (sched_feat(LATENCY_WARN) && resched_latency)

4563

++		resched_latency_warn(cpu, resched_latency);

4564

++

4565

++	perf_event_task_tick();

4566

++}

4567

++

4568

++#ifdef CONFIG_SCHED_SMT

4569

++static inline int active_load_balance_cpu_stop(void *data)

4570

++{

4571

++	struct rq *rq = this_rq();

4572

++	struct task_struct *p = data;

4573

++	cpumask_t tmp;

4574

++	unsigned long flags;

4575

++

4576

++	local_irq_save(flags);

4577

++

4578

++	raw_spin_lock(&p->pi_lock);

4579

++	raw_spin_lock(&rq->lock);

4580

++

4581

++	rq->active_balance = 0;

4582

++	/* _something_ may have changed the task, double check again */

4583

++	if (task_on_rq_queued(p) && task_rq(p) == rq &&

4584

++	    cpumask_and(&tmp, p->cpus_ptr, &sched_sg_idle_mask) &&

4585

++	    !is_migration_disabled(p)) {

4586

++		int cpu = cpu_of(rq);

4587

++		int dcpu = __best_mask_cpu(&tmp, per_cpu(sched_cpu_llc_mask, cpu));

4588

++		rq = move_queued_task(rq, p, dcpu);

4589

++	}

4590

++

4591

++	raw_spin_unlock(&rq->lock);

4592

++	raw_spin_unlock(&p->pi_lock);

4593

++

4594

++	local_irq_restore(flags);

4595

++

4596

++	return 0;

4597

++}

4598

++

4599

++/* sg_balance_trigger - trigger slibing group balance for @cpu */

4600

++static inline int sg_balance_trigger(const int cpu)

4601

++{

4602

++	struct rq *rq= cpu_rq(cpu);

4603

++	unsigned long flags;

4604

++	struct task_struct *curr;

4605

++	int res;

4606

++

4607

++	if (!raw_spin_trylock_irqsave(&rq->lock, flags))

4608

++		return 0;

4609

++	curr = rq->curr;

4610

++	res = (!is_idle_task(curr)) && (1 == rq->nr_running) &&\

4611

++	      cpumask_intersects(curr->cpus_ptr, &sched_sg_idle_mask) &&\

4612

++	      !is_migration_disabled(curr) && (!rq->active_balance);

4613

++

4614

++	if (res)

4615

++		rq->active_balance = 1;

4616

++

4617

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

4618

++

4619

++	if (res)

4620

++		stop_one_cpu_nowait(cpu, active_load_balance_cpu_stop,

4621

++				    curr, &rq->active_balance_work);

4622

++	return res;

4623

++}

4624

++

4625

++/*

4626

++ * sg_balance_check - slibing group balance check for run queue @rq

4627

++ */

4628

++static inline void sg_balance_check(struct rq *rq)

4629

++{

4630

++	cpumask_t chk;

4631

++	int cpu = cpu_of(rq);

4632

++

4633

++	/* exit when cpu is offline */

4634

++	if (unlikely(!rq->online))

4635

++		return;

4636

++

4637

++	/*

4638

++	 * Only cpu in slibing idle group will do the checking and then

4639

++	 * find potential cpus which can migrate the current running task

4640

++	 */

4641

++	if (cpumask_test_cpu(cpu, &sched_sg_idle_mask) &&

4642

++	    cpumask_andnot(&chk, cpu_online_mask, sched_rq_watermark) &&

4643

++	    cpumask_andnot(&chk, &chk, &sched_rq_pending_mask)) {

4644

++		int i;

4645

++

4646

++		for_each_cpu_wrap(i, &chk, cpu) {

4647

++			if (cpumask_subset(cpu_smt_mask(i), &chk) &&

4648

++			    sg_balance_trigger(i))

4649

++				return;

4650

++		}

4651

++	}

4652

++}

4653

++#endif /* CONFIG_SCHED_SMT */

4654

++

4655

++#ifdef CONFIG_NO_HZ_FULL

4656

++

4657

++struct tick_work {

4658

++	int			cpu;

4659

++	atomic_t		state;

4660

++	struct delayed_work	work;

4661

++};

4662

++/* Values for ->state, see diagram below. */

4663

++#define TICK_SCHED_REMOTE_OFFLINE	0

4664

++#define TICK_SCHED_REMOTE_OFFLINING	1

4665

++#define TICK_SCHED_REMOTE_RUNNING	2

4666

++

4667

++/*

4668

++ * State diagram for ->state:

4669

++ *

4670

++ *

4671

++ *          TICK_SCHED_REMOTE_OFFLINE

4672

++ *                    |   ^

4673

++ *                    |   |

4674

++ *                    |   | sched_tick_remote()

4675

++ *                    |   |

4676

++ *                    |   |

4677

++ *                    +--TICK_SCHED_REMOTE_OFFLINING

4678

++ *                    |   ^

4679

++ *                    |   |

4680

++ * sched_tick_start() |   | sched_tick_stop()

4681

++ *                    |   |

4682

++ *                    V   |

4683

++ *          TICK_SCHED_REMOTE_RUNNING

4684

++ *

4685

++ *

4686

++ * Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()

4687

++ * and sched_tick_start() are happy to leave the state in RUNNING.

4688

++ */

4689

++

4690

++static struct tick_work __percpu *tick_work_cpu;

4691

++

4692

++static void sched_tick_remote(struct work_struct *work)

4693

++{

4694

++	struct delayed_work *dwork = to_delayed_work(work);

4695

++	struct tick_work *twork = container_of(dwork, struct tick_work, work);

4696

++	int cpu = twork->cpu;

4697

++	struct rq *rq = cpu_rq(cpu);

4698

++	struct task_struct *curr;

4699

++	unsigned long flags;

4700

++	u64 delta;

4701

++	int os;

4702

++

4703

++	/*

4704

++	 * Handle the tick only if it appears the remote CPU is running in full

4705

++	 * dynticks mode. The check is racy by nature, but missing a tick or

4706

++	 * having one too much is no big deal because the scheduler tick updates

4707

++	 * statistics and checks timeslices in a time-independent way, regardless

4708

++	 * of when exactly it is running.

4709

++	 */

4710

++	if (!tick_nohz_tick_stopped_cpu(cpu))

4711

++		goto out_requeue;

4712

++

4713

++	raw_spin_lock_irqsave(&rq->lock, flags);

4714

++	curr = rq->curr;

4715

++	if (cpu_is_offline(cpu))

4716

++		goto out_unlock;

4717

++

4718

++	update_rq_clock(rq);

4719

++	if (!is_idle_task(curr)) {

4720

++		/*

4721

++		 * Make sure the next tick runs within a reasonable

4722

++		 * amount of time.

4723

++		 */

4724

++		delta = rq_clock_task(rq) - curr->last_ran;

4725

++		WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);

4726

++	}

4727

++	scheduler_task_tick(rq);

4728

++

4729

++	calc_load_nohz_remote(rq);

4730

++out_unlock:

4731

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

4732

++

4733

++out_requeue:

4734

++	/*

4735

++	 * Run the remote tick once per second (1Hz). This arbitrary

4736

++	 * frequency is large enough to avoid overload but short enough

4737

++	 * to keep scheduler internal stats reasonably up to date.  But

4738

++	 * first update state to reflect hotplug activity if required.

4739

++	 */

4740

++	os = atomic_fetch_add_unless(&twork->state, -1, TICK_SCHED_REMOTE_RUNNING);

4741

++	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_OFFLINE);

4742

++	if (os == TICK_SCHED_REMOTE_RUNNING)

4743

++		queue_delayed_work(system_unbound_wq, dwork, HZ);

4744

++}

4745

++

4746

++static void sched_tick_start(int cpu)

4747

++{

4748

++	int os;

4749

++	struct tick_work *twork;

4750

++

4751

++	if (housekeeping_cpu(cpu, HK_FLAG_TICK))

4752

++		return;

4753

++

4754

++	WARN_ON_ONCE(!tick_work_cpu);

4755

++

4756

++	twork = per_cpu_ptr(tick_work_cpu, cpu);

4757

++	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);

4758

++	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);

4759

++	if (os == TICK_SCHED_REMOTE_OFFLINE) {

4760

++		twork->cpu = cpu;

4761

++		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);

4762

++		queue_delayed_work(system_unbound_wq, &twork->work, HZ);

4763

++	}

4764

++}

4765

++

4766

++#ifdef CONFIG_HOTPLUG_CPU

4767

++static void sched_tick_stop(int cpu)

4768

++{

4769

++	struct tick_work *twork;

4770

++

4771

++	if (housekeeping_cpu(cpu, HK_FLAG_TICK))

4772

++		return;

4773

++

4774

++	WARN_ON_ONCE(!tick_work_cpu);

4775

++

4776

++	twork = per_cpu_ptr(tick_work_cpu, cpu);

4777

++	cancel_delayed_work_sync(&twork->work);

4778

++}

4779

++#endif /* CONFIG_HOTPLUG_CPU */

4780

++

4781

++int __init sched_tick_offload_init(void)

4782

++{

4783

++	tick_work_cpu = alloc_percpu(struct tick_work);

4784

++	BUG_ON(!tick_work_cpu);

4785

++	return 0;

4786

++}

4787

++

4788

++#else /* !CONFIG_NO_HZ_FULL */

4789

++static inline void sched_tick_start(int cpu) { }

4790

++static inline void sched_tick_stop(int cpu) { }

4791

++#endif

4792

++

4793

++#if defined(CONFIG_PREEMPTION) && (defined(CONFIG_DEBUG_PREEMPT) || \

4794

++				defined(CONFIG_PREEMPT_TRACER))

4795

++/*

4796

++ * If the value passed in is equal to the current preempt count

4797

++ * then we just disabled preemption. Start timing the latency.

4798

++ */

4799

++static inline void preempt_latency_start(int val)

4800

++{

4801

++	if (preempt_count() == val) {

4802

++		unsigned long ip = get_lock_parent_ip();

4803

++#ifdef CONFIG_DEBUG_PREEMPT

4804

++		current->preempt_disable_ip = ip;

4805

++#endif

4806

++		trace_preempt_off(CALLER_ADDR0, ip);

4807

++	}

4808

++}

4809

++

4810

++void preempt_count_add(int val)

4811

++{

4812

++#ifdef CONFIG_DEBUG_PREEMPT

4813

++	/*

4814

++	 * Underflow?

4815

++	 */

4816

++	if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))

4817

++		return;

4818

++#endif

4819

++	__preempt_count_add(val);

4820

++#ifdef CONFIG_DEBUG_PREEMPT

4821

++	/*

4822

++	 * Spinlock count overflowing soon?

4823

++	 */

4824

++	DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=

4825

++				PREEMPT_MASK - 10);

4826

++#endif

4827

++	preempt_latency_start(val);

4828

++}

4829

++EXPORT_SYMBOL(preempt_count_add);

4830

++NOKPROBE_SYMBOL(preempt_count_add);

4831

++

4832

++/*

4833

++ * If the value passed in equals to the current preempt count

4834

++ * then we just enabled preemption. Stop timing the latency.

4835

++ */

4836

++static inline void preempt_latency_stop(int val)

4837

++{

4838

++	if (preempt_count() == val)

4839

++		trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());

4840

++}

4841

++

4842

++void preempt_count_sub(int val)

4843

++{

4844

++#ifdef CONFIG_DEBUG_PREEMPT

4845

++	/*

4846

++	 * Underflow?

4847

++	 */

4848

++	if (DEBUG_LOCKS_WARN_ON(val > preempt_count()))

4849

++		return;

4850

++	/*

4851

++	 * Is the spinlock portion underflowing?

4852

++	 */

4853

++	if (DEBUG_LOCKS_WARN_ON((val < PREEMPT_MASK) &&

4854

++			!(preempt_count() & PREEMPT_MASK)))

4855

++		return;

4856

++#endif

4857

++

4858

++	preempt_latency_stop(val);

4859

++	__preempt_count_sub(val);

4860

++}

4861

++EXPORT_SYMBOL(preempt_count_sub);

4862

++NOKPROBE_SYMBOL(preempt_count_sub);

4863

++

4864

++#else

4865

++static inline void preempt_latency_start(int val) { }

4866

++static inline void preempt_latency_stop(int val) { }

4867

++#endif

4868

++

4869

++static inline unsigned long get_preempt_disable_ip(struct task_struct *p)

4870

++{

4871

++#ifdef CONFIG_DEBUG_PREEMPT

4872

++	return p->preempt_disable_ip;

4873

++#else

4874

++	return 0;

4875

++#endif

4876

++}

4877

++

4878

++/*

4879

++ * Print scheduling while atomic bug:

4880

++ */

4881

++static noinline void __schedule_bug(struct task_struct *prev)

4882

++{

4883

++	/* Save this before calling printk(), since that will clobber it */

4884

++	unsigned long preempt_disable_ip = get_preempt_disable_ip(current);

4885

++

4886

++	if (oops_in_progress)

4887

++		return;

4888

++

4889

++	printk(KERN_ERR "BUG: scheduling while atomic: %s/%d/0x%08x\n",

4890

++		prev->comm, prev->pid, preempt_count());

4891

++

4892

++	debug_show_held_locks(prev);

4893

++	print_modules();

4894

++	if (irqs_disabled())

4895

++		print_irqtrace_events(prev);

4896

++	if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)

4897

++	    && in_atomic_preempt_off()) {

4898

++		pr_err("Preemption disabled at:");

4899

++		print_ip_sym(KERN_ERR, preempt_disable_ip);

4900

++	}

4901

++	if (panic_on_warn)

4902

++		panic("scheduling while atomic\n");

4903

++

4904

++	dump_stack();

4905

++	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

4906

++}

4907

++

4908

++/*

4909

++ * Various schedule()-time debugging checks and statistics:

4910

++ */

4911

++static inline void schedule_debug(struct task_struct *prev, bool preempt)

4912

++{

4913

++#ifdef CONFIG_SCHED_STACK_END_CHECK

4914

++	if (task_stack_end_corrupted(prev))

4915

++		panic("corrupted stack end detected inside scheduler\n");

4916

++

4917

++	if (task_scs_end_corrupted(prev))

4918

++		panic("corrupted shadow stack detected inside scheduler\n");

4919

++#endif

4920

++

4921

++#ifdef CONFIG_DEBUG_ATOMIC_SLEEP

4922

++	if (!preempt && READ_ONCE(prev->__state) && prev->non_block_count) {

4923

++		printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",

4924

++			prev->comm, prev->pid, prev->non_block_count);

4925

++		dump_stack();

4926

++		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

4927

++	}

4928

++#endif

4929

++

4930

++	if (unlikely(in_atomic_preempt_off())) {

4931

++		__schedule_bug(prev);

4932

++		preempt_count_set(PREEMPT_DISABLED);

4933

++	}

4934

++	rcu_sleep_check();

4935

++	SCHED_WARN_ON(ct_state() == CONTEXT_USER);

4936

++

4937

++	profile_hit(SCHED_PROFILING, __builtin_return_address(0));

4938

++

4939

++	schedstat_inc(this_rq()->sched_count);

4940

++}

4941

++

4942

++/*

4943

++ * Compile time debug macro

4944

++ * #define ALT_SCHED_DEBUG

4945

++ */

4946

++

4947

++#ifdef ALT_SCHED_DEBUG

4948

++void alt_sched_debug(void)

4949

++{

4950

++	printk(KERN_INFO "sched: pending: 0x%04lx, idle: 0x%04lx, sg_idle: 0x%04lx\n",

4951

++	       sched_rq_pending_mask.bits[0],

4952

++	       sched_rq_watermark[0].bits[0],

4953

++	       sched_sg_idle_mask.bits[0]);

4954

++}

4955

++#else

4956

++inline void alt_sched_debug(void) {}

4957

++#endif

4958

++

4959

++#ifdef	CONFIG_SMP

4960

++

4961

++#define SCHED_RQ_NR_MIGRATION (32U)

4962

++/*

4963

++ * Migrate pending tasks in @rq to @dest_cpu

4964

++ * Will try to migrate mininal of half of @rq nr_running tasks and

4965

++ * SCHED_RQ_NR_MIGRATION to @dest_cpu

4966

++ */

4967

++static inline int

4968

++migrate_pending_tasks(struct rq *rq, struct rq *dest_rq, const int dest_cpu)

4969

++{

4970

++	struct task_struct *p, *skip = rq->curr;

4971

++	int nr_migrated = 0;

4972

++	int nr_tries = min(rq->nr_running / 2, SCHED_RQ_NR_MIGRATION);

4973

++

4974

++	while (skip != rq->idle && nr_tries &&

4975

++	       (p = sched_rq_next_task(skip, rq)) != rq->idle) {

4976

++		skip = sched_rq_next_task(p, rq);

4977

++		if (cpumask_test_cpu(dest_cpu, p->cpus_ptr)) {

4978

++			__SCHED_DEQUEUE_TASK(p, rq, 0, );

4979

++			set_task_cpu(p, dest_cpu);

4980

++			sched_task_sanity_check(p, dest_rq);

4981

++			__SCHED_ENQUEUE_TASK(p, dest_rq, 0);

4982

++			nr_migrated++;

4983

++		}

4984

++		nr_tries--;

4985

++	}

4986

++

4987

++	return nr_migrated;

4988

++}

4989

++

4990

++static inline int take_other_rq_tasks(struct rq *rq, int cpu)

4991

++{

4992

++	struct cpumask *topo_mask, *end_mask;

4993

++

4994

++	if (unlikely(!rq->online))

4995

++		return 0;

4996

++

4997

++	if (cpumask_empty(&sched_rq_pending_mask))

4998

++		return 0;

4999

++

5000

++	topo_mask = per_cpu(sched_cpu_topo_masks, cpu) + 1;

5001

++	end_mask = per_cpu(sched_cpu_topo_end_mask, cpu);

5002

++	do {

5003

++		int i;

5004

++		for_each_cpu_and(i, &sched_rq_pending_mask, topo_mask) {

5005

++			int nr_migrated;

5006

++			struct rq *src_rq;

5007

++

5008

++			src_rq = cpu_rq(i);

5009

++			if (!do_raw_spin_trylock(&src_rq->lock))

5010

++				continue;

5011

++			spin_acquire(&src_rq->lock.dep_map,

5012

++				     SINGLE_DEPTH_NESTING, 1, _RET_IP_);

5013

++

5014

++			if ((nr_migrated = migrate_pending_tasks(src_rq, rq, cpu))) {

5015

++				src_rq->nr_running -= nr_migrated;

5016

++				if (src_rq->nr_running < 2)

5017

++					cpumask_clear_cpu(i, &sched_rq_pending_mask);

5018

++

5019

++				rq->nr_running += nr_migrated;

5020

++				if (rq->nr_running > 1)

5021

++					cpumask_set_cpu(cpu, &sched_rq_pending_mask);

5022

++

5023

++				update_sched_rq_watermark(rq);

5024

++				cpufreq_update_util(rq, 0);

5025

++

5026

++				spin_release(&src_rq->lock.dep_map, _RET_IP_);

5027

++				do_raw_spin_unlock(&src_rq->lock);

5028

++

5029

++				return 1;

5030

++			}

5031

++

5032

++			spin_release(&src_rq->lock.dep_map, _RET_IP_);

5033

++			do_raw_spin_unlock(&src_rq->lock);

5034

++		}

5035

++	} while (++topo_mask < end_mask);

5036

++

5037

++	return 0;

5038

++}

5039

++#endif

5040

++

5041

++/*

5042

++ * Timeslices below RESCHED_NS are considered as good as expired as there's no

5043

++ * point rescheduling when there's so little time left.

5044

++ */

5045

++static inline void check_curr(struct task_struct *p, struct rq *rq)

5046

++{

5047

++	if (unlikely(rq->idle == p))

5048

++		return;

5049

++

5050

++	update_curr(rq, p);

5051

++

5052

++	if (p->time_slice < RESCHED_NS)

5053

++		time_slice_expired(p, rq);

5054

++}

5055

++

5056

++static inline struct task_struct *

5057

++choose_next_task(struct rq *rq, int cpu, struct task_struct *prev)

5058

++{

5059

++	struct task_struct *next;

5060

++

5061

++	if (unlikely(rq->skip)) {

5062

++		next = rq_runnable_task(rq);

5063

++		if (next == rq->idle) {

5064

++#ifdef	CONFIG_SMP

5065

++			if (!take_other_rq_tasks(rq, cpu)) {

5066

++#endif

5067

++				rq->skip = NULL;

5068

++				schedstat_inc(rq->sched_goidle);

5069

++				return next;

5070

++#ifdef	CONFIG_SMP

5071

++			}

5072

++			next = rq_runnable_task(rq);

5073

++#endif

5074

++		}

5075

++		rq->skip = NULL;

5076

++#ifdef CONFIG_HIGH_RES_TIMERS

5077

++		hrtick_start(rq, next->time_slice);

5078

++#endif

5079

++		return next;

5080

++	}

5081

++

5082

++	next = sched_rq_first_task(rq);

5083

++	if (next == rq->idle) {

5084

++#ifdef	CONFIG_SMP

5085

++		if (!take_other_rq_tasks(rq, cpu)) {

5086

++#endif

5087

++			schedstat_inc(rq->sched_goidle);

5088

++			/*printk(KERN_INFO "sched: choose_next_task(%d) idle %px\n", cpu, next);*/

5089

++			return next;

5090

++#ifdef	CONFIG_SMP

5091

++		}

5092

++		next = sched_rq_first_task(rq);

5093

++#endif

5094

++	}

5095

++#ifdef CONFIG_HIGH_RES_TIMERS

5096

++	hrtick_start(rq, next->time_slice);

5097

++#endif

5098

++	/*printk(KERN_INFO "sched: choose_next_task(%d) next %px\n", cpu,

5099

++	 * next);*/

5100

++	return next;

5101

++}

5102

++

5103

++/*

5104

++ * Constants for the sched_mode argument of __schedule().

5105

++ *

5106

++ * The mode argument allows RT enabled kernels to differentiate a

5107

++ * preemption from blocking on an 'sleeping' spin/rwlock. Note that

5108

++ * SM_MASK_PREEMPT for !RT has all bits set, which allows the compiler to

5109

++ * optimize the AND operation out and just check for zero.

5110

++ */

5111

++#define SM_NONE			0x0

5112

++#define SM_PREEMPT		0x1

5113

++#define SM_RTLOCK_WAIT		0x2

5114

++

5115

++#ifndef CONFIG_PREEMPT_RT

5116

++# define SM_MASK_PREEMPT	(~0U)

5117

++#else

5118

++# define SM_MASK_PREEMPT	SM_PREEMPT

5119

++#endif

5120

++

5121

++/*

5122

++ * schedule() is the main scheduler function.

5123

++ *

5124

++ * The main means of driving the scheduler and thus entering this function are:

5125

++ *

5126

++ *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.

5127

++ *

5128

++ *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return

5129

++ *      paths. For example, see arch/x86/entry_64.S.

5130

++ *

5131

++ *      To drive preemption between tasks, the scheduler sets the flag in timer

5132

++ *      interrupt handler scheduler_tick().

5133

++ *

5134

++ *   3. Wakeups don't really cause entry into schedule(). They add a

5135

++ *      task to the run-queue and that's it.

5136

++ *

5137

++ *      Now, if the new task added to the run-queue preempts the current

5138

++ *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets

5139

++ *      called on the nearest possible occasion:

5140

++ *

5141

++ *       - If the kernel is preemptible (CONFIG_PREEMPTION=y):

5142

++ *

5143

++ *         - in syscall or exception context, at the next outmost

5144

++ *           preempt_enable(). (this might be as soon as the wake_up()'s

5145

++ *           spin_unlock()!)

5146

++ *

5147

++ *         - in IRQ context, return from interrupt-handler to

5148

++ *           preemptible context

5149

++ *

5150

++ *       - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)

5151

++ *         then at the next:

5152

++ *

5153

++ *          - cond_resched() call

5154

++ *          - explicit schedule() call

5155

++ *          - return from syscall or exception to user-space

5156

++ *          - return from interrupt-handler to user-space

5157

++ *

5158

++ * WARNING: must be called with preemption disabled!

5159

++ */

5160

++static void __sched notrace __schedule(unsigned int sched_mode)

5161

++{

5162

++	struct task_struct *prev, *next;

5163

++	unsigned long *switch_count;

5164

++	unsigned long prev_state;

5165

++	struct rq *rq;

5166

++	int cpu;

5167

++

5168

++	cpu = smp_processor_id();

5169

++	rq = cpu_rq(cpu);

5170

++	prev = rq->curr;

5171

++

5172

++	schedule_debug(prev, !!sched_mode);

5173

++

5174

++	/* by passing sched_feat(HRTICK) checking which Alt schedule FW doesn't support */

5175

++	hrtick_clear(rq);

5176

++

5177

++	local_irq_disable();

5178

++	rcu_note_context_switch(!!sched_mode);

5179

++

5180

++	/*

5181

++	 * Make sure that signal_pending_state()->signal_pending() below

5182

++	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)

5183

++	 * done by the caller to avoid the race with signal_wake_up():

5184

++	 *

5185

++	 * __set_current_state(@state)		signal_wake_up()

5186

++	 * schedule()				  set_tsk_thread_flag(p, TIF_SIGPENDING)

5187

++	 *					  wake_up_state(p, state)

5188

++	 *   LOCK rq->lock			    LOCK p->pi_state

5189

++	 *   smp_mb__after_spinlock()		    smp_mb__after_spinlock()

5190

++	 *     if (signal_pending_state())	    if (p->state & @state)

5191

++	 *

5192

++	 * Also, the membarrier system call requires a full memory barrier

5193

++	 * after coming from user-space, before storing to rq->curr.

5194

++	 */

5195

++	raw_spin_lock(&rq->lock);

5196

++	smp_mb__after_spinlock();

5197

++

5198

++	update_rq_clock(rq);

5199

++

5200

++	switch_count = &prev->nivcsw;

5201

++	/*

5202

++	 * We must load prev->state once (task_struct::state is volatile), such

5203

++	 * that:

5204

++	 *

5205

++	 *  - we form a control dependency vs deactivate_task() below.

5206

++	 *  - ptrace_{,un}freeze_traced() can change ->state underneath us.

5207

++	 */

5208

++	prev_state = READ_ONCE(prev->__state);

5209

++	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {

5210

++		if (signal_pending_state(prev_state, prev)) {

5211

++			WRITE_ONCE(prev->__state, TASK_RUNNING);

5212

++		} else {

5213

++			prev->sched_contributes_to_load =

5214

++				(prev_state & TASK_UNINTERRUPTIBLE) &&

5215

++				!(prev_state & TASK_NOLOAD) &&

5216

++				!(prev->flags & PF_FROZEN);

5217

++

5218

++			if (prev->sched_contributes_to_load)

5219

++				rq->nr_uninterruptible++;

5220

++

5221

++			/*

5222

++			 * __schedule()			ttwu()

5223

++			 *   prev_state = prev->state;    if (p->on_rq && ...)

5224

++			 *   if (prev_state)		    goto out;

5225

++			 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();

5226

++			 *				  p->state = TASK_WAKING

5227

++			 *

5228

++			 * Where __schedule() and ttwu() have matching control dependencies.

5229

++			 *

5230

++			 * After this, schedule() must not care about p->state any more.

5231

++			 */

5232

++			sched_task_deactivate(prev, rq);

5233

++			deactivate_task(prev, rq);

5234

++

5235

++			if (prev->in_iowait) {

5236

++				atomic_inc(&rq->nr_iowait);

5237

++				delayacct_blkio_start();

5238

++			}

5239

++		}

5240

++		switch_count = &prev->nvcsw;

5241

++	}

5242

++

5243

++	check_curr(prev, rq);

5244

++

5245

++	next = choose_next_task(rq, cpu, prev);

5246

++	clear_tsk_need_resched(prev);

5247

++	clear_preempt_need_resched();

5248

++#ifdef CONFIG_SCHED_DEBUG

5249

++	rq->last_seen_need_resched_ns = 0;

5250

++#endif

5251

++

5252

++	if (likely(prev != next)) {

5253

++		next->last_ran = rq->clock_task;

5254

++		rq->last_ts_switch = rq->clock;

5255

++

5256

++		rq->nr_switches++;

5257

++		/*

5258

++		 * RCU users of rcu_dereference(rq->curr) may not see

5259

++		 * changes to task_struct made by pick_next_task().

5260

++		 */

5261

++		RCU_INIT_POINTER(rq->curr, next);

5262

++		/*

5263

++		 * The membarrier system call requires each architecture

5264

++		 * to have a full memory barrier after updating

5265

++		 * rq->curr, before returning to user-space.

5266

++		 *

5267

++		 * Here are the schemes providing that barrier on the

5268

++		 * various architectures:

5269

++		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.

5270

++		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.

5271

++		 * - finish_lock_switch() for weakly-ordered

5272

++		 *   architectures where spin_unlock is a full barrier,

5273

++		 * - switch_to() for arm64 (weakly-ordered, spin_unlock

5274

++		 *   is a RELEASE barrier),

5275

++		 */

5276

++		++*switch_count;

5277

++

5278

++		psi_sched_switch(prev, next, !task_on_rq_queued(prev));

5279

++

5280

++		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next);

5281

++

5282

++		/* Also unlocks the rq: */

5283

++		rq = context_switch(rq, prev, next);

5284

++	} else {

5285

++		__balance_callbacks(rq);

5286

++		raw_spin_unlock_irq(&rq->lock);

5287

++	}

5288

++

5289

++#ifdef CONFIG_SCHED_SMT

5290

++	sg_balance_check(rq);

5291

++#endif

5292

++}

5293

++

5294

++void __noreturn do_task_dead(void)

5295

++{

5296

++	/* Causes final put_task_struct in finish_task_switch(): */

5297

++	set_special_state(TASK_DEAD);

5298

++

5299

++	/* Tell freezer to ignore us: */

5300

++	current->flags |= PF_NOFREEZE;

5301

++

5302

++	__schedule(SM_NONE);

5303

++	BUG();

5304

++

5305

++	/* Avoid "noreturn function does return" - but don't continue if BUG() is a NOP: */

5306

++	for (;;)

5307

++		cpu_relax();

5308

++}

5309

++

5310

++static inline void sched_submit_work(struct task_struct *tsk)

5311

++{

5312

++	unsigned int task_flags;

5313

++

5314

++	if (task_is_running(tsk))

5315

++		return;

5316

++

5317

++	task_flags = tsk->flags;

5318

++	/*

5319

++	 * If a worker went to sleep, notify and ask workqueue whether

5320

++	 * it wants to wake up a task to maintain concurrency.

5321

++	 * As this function is called inside the schedule() context,

5322

++	 * we disable preemption to avoid it calling schedule() again

5323

++	 * in the possible wakeup of a kworker and because wq_worker_sleeping()

5324

++	 * requires it.

5325

++	 */

5326

++	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {

5327

++		preempt_disable();

5328

++		if (task_flags & PF_WQ_WORKER)

5329

++			wq_worker_sleeping(tsk);

5330

++		else

5331

++			io_wq_worker_sleeping(tsk);

5332

++		preempt_enable_no_resched();

5333

++	}

5334

++

5335

++	if (tsk_is_pi_blocked(tsk))

5336

++		return;

5337

++

5338

++	/*

5339

++	 * If we are going to sleep and we have plugged IO queued,

5340

++	 * make sure to submit it to avoid deadlocks.

5341

++	 */

5342

++	if (blk_needs_flush_plug(tsk))

5343

++		blk_schedule_flush_plug(tsk);

5344

++}

5345

++

5346

++static void sched_update_worker(struct task_struct *tsk)

5347

++{

5348

++	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {

5349

++		if (tsk->flags & PF_WQ_WORKER)

5350

++			wq_worker_running(tsk);

5351

++		else

5352

++			io_wq_worker_running(tsk);

5353

++	}

5354

++}

5355

++

5356

++asmlinkage __visible void __sched schedule(void)

5357

++{

5358

++	struct task_struct *tsk = current;

5359

++

5360

++	sched_submit_work(tsk);

5361

++	do {

5362

++		preempt_disable();

5363

++		__schedule(SM_NONE);

5364

++		sched_preempt_enable_no_resched();

5365

++	} while (need_resched());

5366

++	sched_update_worker(tsk);

5367

++}

5368

++EXPORT_SYMBOL(schedule);

5369

++

5370

++/*

5371

++ * synchronize_rcu_tasks() makes sure that no task is stuck in preempted

5372

++ * state (have scheduled out non-voluntarily) by making sure that all

5373

++ * tasks have either left the run queue or have gone into user space.

5374

++ * As idle tasks do not do either, they must not ever be preempted

5375

++ * (schedule out non-voluntarily).

5376

++ *

5377

++ * schedule_idle() is similar to schedule_preempt_disable() except that it

5378

++ * never enables preemption because it does not call sched_submit_work().

5379

++ */

5380

++void __sched schedule_idle(void)

5381

++{

5382

++	/*

5383

++	 * As this skips calling sched_submit_work(), which the idle task does

5384

++	 * regardless because that function is a nop when the task is in a

5385

++	 * TASK_RUNNING state, make sure this isn't used someplace that the

5386

++	 * current task can be in any other state. Note, idle is always in the

5387

++	 * TASK_RUNNING state.

5388

++	 */

5389

++	WARN_ON_ONCE(current->__state);

5390

++	do {

5391

++		__schedule(SM_NONE);

5392

++	} while (need_resched());

5393

++}

5394

++

5395

++#if defined(CONFIG_CONTEXT_TRACKING) && !defined(CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK)

5396

++asmlinkage __visible void __sched schedule_user(void)

5397

++{

5398

++	/*

5399

++	 * If we come here after a random call to set_need_resched(),

5400

++	 * or we have been woken up remotely but the IPI has not yet arrived,

5401

++	 * we haven't yet exited the RCU idle mode. Do it here manually until

5402

++	 * we find a better solution.

5403

++	 *

5404

++	 * NB: There are buggy callers of this function.  Ideally we

5405

++	 * should warn if prev_state != CONTEXT_USER, but that will trigger

5406

++	 * too frequently to make sense yet.

5407

++	 */

5408

++	enum ctx_state prev_state = exception_enter();

5409

++	schedule();

5410

++	exception_exit(prev_state);

5411

++}

5412

++#endif

5413

++

5414

++/**

5415

++ * schedule_preempt_disabled - called with preemption disabled

5416

++ *

5417

++ * Returns with preemption disabled. Note: preempt_count must be 1

5418

++ */

5419

++void __sched schedule_preempt_disabled(void)

5420

++{

5421

++	sched_preempt_enable_no_resched();

5422

++	schedule();

5423

++	preempt_disable();

5424

++}

5425

++

5426

++#ifdef CONFIG_PREEMPT_RT

5427

++void __sched notrace schedule_rtlock(void)

5428

++{

5429

++	do {

5430

++		preempt_disable();

5431

++		__schedule(SM_RTLOCK_WAIT);

5432

++		sched_preempt_enable_no_resched();

5433

++	} while (need_resched());

5434

++}

5435

++NOKPROBE_SYMBOL(schedule_rtlock);

5436

++#endif

5437

++

5438

++static void __sched notrace preempt_schedule_common(void)

5439

++{

5440

++	do {

5441

++		/*

5442

++		 * Because the function tracer can trace preempt_count_sub()

5443

++		 * and it also uses preempt_enable/disable_notrace(), if

5444

++		 * NEED_RESCHED is set, the preempt_enable_notrace() called

5445

++		 * by the function tracer will call this function again and

5446

++		 * cause infinite recursion.

5447

++		 *

5448

++		 * Preemption must be disabled here before the function

5449

++		 * tracer can trace. Break up preempt_disable() into two

5450

++		 * calls. One to disable preemption without fear of being

5451

++		 * traced. The other to still record the preemption latency,

5452

++		 * which can also be traced by the function tracer.

5453

++		 */

5454

++		preempt_disable_notrace();

5455

++		preempt_latency_start(1);

5456

++		__schedule(SM_PREEMPT);

5457

++		preempt_latency_stop(1);

5458

++		preempt_enable_no_resched_notrace();

5459

++

5460

++		/*

5461

++		 * Check again in case we missed a preemption opportunity

5462

++		 * between schedule and now.

5463

++		 */

5464

++	} while (need_resched());

5465

++}

5466

++

5467

++#ifdef CONFIG_PREEMPTION

5468

++/*

5469

++ * This is the entry point to schedule() from in-kernel preemption

5470

++ * off of preempt_enable.

5471

++ */

5472

++asmlinkage __visible void __sched notrace preempt_schedule(void)

5473

++{

5474

++	/*

5475

++	 * If there is a non-zero preempt_count or interrupts are disabled,

5476

++	 * we do not want to preempt the current task. Just return..

5477

++	 */

5478

++	if (likely(!preemptible()))

5479

++		return;

5480

++

5481

++	preempt_schedule_common();

5482

++}

5483

++NOKPROBE_SYMBOL(preempt_schedule);

5484

++EXPORT_SYMBOL(preempt_schedule);

5485

++

5486

++#ifdef CONFIG_PREEMPT_DYNAMIC

5487

++DEFINE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);

5488

++EXPORT_STATIC_CALL_TRAMP(preempt_schedule);

5489

++#endif

5490

++

5491

++

5492

++/**

5493

++ * preempt_schedule_notrace - preempt_schedule called by tracing

5494

++ *

5495

++ * The tracing infrastructure uses preempt_enable_notrace to prevent

5496

++ * recursion and tracing preempt enabling caused by the tracing

5497

++ * infrastructure itself. But as tracing can happen in areas coming

5498

++ * from userspace or just about to enter userspace, a preempt enable

5499

++ * can occur before user_exit() is called. This will cause the scheduler

5500

++ * to be called when the system is still in usermode.

5501

++ *

5502

++ * To prevent this, the preempt_enable_notrace will use this function

5503

++ * instead of preempt_schedule() to exit user context if needed before

5504

++ * calling the scheduler.

5505

++ */

5506

++asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)

5507

++{

5508

++	enum ctx_state prev_ctx;

5509

++

5510

++	if (likely(!preemptible()))

5511

++		return;

5512

++

5513

++	do {

5514

++		/*

5515

++		 * Because the function tracer can trace preempt_count_sub()

5516

++		 * and it also uses preempt_enable/disable_notrace(), if

5517

++		 * NEED_RESCHED is set, the preempt_enable_notrace() called

5518

++		 * by the function tracer will call this function again and

5519

++		 * cause infinite recursion.

5520

++		 *

5521

++		 * Preemption must be disabled here before the function

5522

++		 * tracer can trace. Break up preempt_disable() into two

5523

++		 * calls. One to disable preemption without fear of being

5524

++		 * traced. The other to still record the preemption latency,

5525

++		 * which can also be traced by the function tracer.

5526

++		 */

5527

++		preempt_disable_notrace();

5528

++		preempt_latency_start(1);

5529

++		/*

5530

++		 * Needs preempt disabled in case user_exit() is traced

5531

++		 * and the tracer calls preempt_enable_notrace() causing

5532

++		 * an infinite recursion.

5533

++		 */

5534

++		prev_ctx = exception_enter();

5535

++		__schedule(SM_PREEMPT);

5536

++		exception_exit(prev_ctx);

5537

++

5538

++		preempt_latency_stop(1);

5539

++		preempt_enable_no_resched_notrace();

5540

++	} while (need_resched());

5541

++}

5542

++EXPORT_SYMBOL_GPL(preempt_schedule_notrace);

5543

++

5544

++#ifdef CONFIG_PREEMPT_DYNAMIC

5545

++DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);

5546

++EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);

5547

++#endif

5548

++

5549

++#endif /* CONFIG_PREEMPTION */

5550

++

5551

++#ifdef CONFIG_PREEMPT_DYNAMIC

5552

++

5553

++#include <linux/entry-common.h>

5554

++

5555

++/*

5556

++ * SC:cond_resched

5557

++ * SC:might_resched

5558

++ * SC:preempt_schedule

5559

++ * SC:preempt_schedule_notrace

5560

++ * SC:irqentry_exit_cond_resched

5561

++ *

5562

++ *

5563

++ * NONE:

5564

++ *   cond_resched               <- __cond_resched

5565

++ *   might_resched              <- RET0

5566

++ *   preempt_schedule           <- NOP

5567

++ *   preempt_schedule_notrace   <- NOP

5568

++ *   irqentry_exit_cond_resched <- NOP

5569

++ *

5570

++ * VOLUNTARY:

5571

++ *   cond_resched               <- __cond_resched

5572

++ *   might_resched              <- __cond_resched

5573

++ *   preempt_schedule           <- NOP

5574

++ *   preempt_schedule_notrace   <- NOP

5575

++ *   irqentry_exit_cond_resched <- NOP

5576

++ *

5577

++ * FULL:

5578

++ *   cond_resched               <- RET0

5579

++ *   might_resched              <- RET0

5580

++ *   preempt_schedule           <- preempt_schedule

5581

++ *   preempt_schedule_notrace   <- preempt_schedule_notrace

5582

++ *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched

5583

++ */

5584

++

5585

++enum {

5586

++	preempt_dynamic_none = 0,

5587

++	preempt_dynamic_voluntary,

5588

++	preempt_dynamic_full,

5589

++};

5590

++

5591

++int preempt_dynamic_mode = preempt_dynamic_full;

5592

++

5593

++int sched_dynamic_mode(const char *str)

5594

++{

5595

++	if (!strcmp(str, "none"))

5596

++		return preempt_dynamic_none;

5597

++

5598

++	if (!strcmp(str, "voluntary"))

5599

++		return preempt_dynamic_voluntary;

5600

++

5601

++	if (!strcmp(str, "full"))

5602

++		return preempt_dynamic_full;

5603

++

5604

++	return -EINVAL;

5605

++}

5606

++

5607

++void sched_dynamic_update(int mode)

5608

++{

5609

++	/*

5610

++	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in

5611

++	 * the ZERO state, which is invalid.

5612

++	 */

5613

++	static_call_update(cond_resched, __cond_resched);

5614

++	static_call_update(might_resched, __cond_resched);

5615

++	static_call_update(preempt_schedule, __preempt_schedule_func);

5616

++	static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);

5617

++	static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);

5618

++

5619

++	switch (mode) {

5620

++	case preempt_dynamic_none:

5621

++		static_call_update(cond_resched, __cond_resched);

5622

++		static_call_update(might_resched, (void *)&__static_call_return0);

5623

++		static_call_update(preempt_schedule, NULL);

5624

++		static_call_update(preempt_schedule_notrace, NULL);

5625

++		static_call_update(irqentry_exit_cond_resched, NULL);

5626

++		pr_info("Dynamic Preempt: none\n");

5627

++		break;

5628

++

5629

++	case preempt_dynamic_voluntary:

5630

++		static_call_update(cond_resched, __cond_resched);

5631

++		static_call_update(might_resched, __cond_resched);

5632

++		static_call_update(preempt_schedule, NULL);

5633

++		static_call_update(preempt_schedule_notrace, NULL);

5634

++		static_call_update(irqentry_exit_cond_resched, NULL);

5635

++		pr_info("Dynamic Preempt: voluntary\n");

5636

++		break;

5637

++

5638

++	case preempt_dynamic_full:

5639

++		static_call_update(cond_resched, (void *)&__static_call_return0);

5640

++		static_call_update(might_resched, (void *)&__static_call_return0);

5641

++		static_call_update(preempt_schedule, __preempt_schedule_func);

5642

++		static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);

5643

++		static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);

5644

++		pr_info("Dynamic Preempt: full\n");

5645

++		break;

5646

++	}

5647

++

5648

++	preempt_dynamic_mode = mode;

5649

++}

5650

++

5651

++static int __init setup_preempt_mode(char *str)

5652

++{

5653

++	int mode = sched_dynamic_mode(str);

5654

++	if (mode < 0) {

5655

++		pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);

5656

++		return 1;

5657

++	}

5658

++

5659

++	sched_dynamic_update(mode);

5660

++	return 0;

5661

++}

5662

++__setup("preempt=", setup_preempt_mode);

5663

++

5664

++#endif /* CONFIG_PREEMPT_DYNAMIC */

5665

++

5666

++/*

5667

++ * This is the entry point to schedule() from kernel preemption

5668

++ * off of irq context.

5669

++ * Note, that this is called and return with irqs disabled. This will

5670

++ * protect us against recursive calling from irq.

5671

++ */

5672

++asmlinkage __visible void __sched preempt_schedule_irq(void)

5673

++{

5674

++	enum ctx_state prev_state;

5675

++

5676

++	/* Catch callers which need to be fixed */

5677

++	BUG_ON(preempt_count() || !irqs_disabled());

5678

++

5679

++	prev_state = exception_enter();

5680

++

5681

++	do {

5682

++		preempt_disable();

5683

++		local_irq_enable();

5684

++		__schedule(SM_PREEMPT);

5685

++		local_irq_disable();

5686

++		sched_preempt_enable_no_resched();

5687

++	} while (need_resched());

5688

++

5689

++	exception_exit(prev_state);

5690

++}

5691

++

5692

++int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,

5693

++			  void *key)

5694

++{

5695

++	WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC);

5696

++	return try_to_wake_up(curr->private, mode, wake_flags);

5697

++}

5698

++EXPORT_SYMBOL(default_wake_function);

5699

++

5700

++static inline void check_task_changed(struct task_struct *p, struct rq *rq)

5701

++{

5702

++	/* Trigger resched if task sched_prio has been modified. */

5703

++	if (task_on_rq_queued(p) && task_sched_prio_idx(p, rq) != p->sq_idx) {

5704

++		requeue_task(p, rq);

5705

++		check_preempt_curr(rq);

5706

++	}

5707

++}

5708

++

5709

++static void __setscheduler_prio(struct task_struct *p, int prio)

5710

++{

5711

++	p->prio = prio;

5712

++}

5713

++

5714

++#ifdef CONFIG_RT_MUTEXES

5715

++

5716

++static inline int __rt_effective_prio(struct task_struct *pi_task, int prio)

5717

++{

5718

++	if (pi_task)

5719

++		prio = min(prio, pi_task->prio);

5720

++

5721

++	return prio;

5722

++}

5723

++

5724

++static inline int rt_effective_prio(struct task_struct *p, int prio)

5725

++{

5726

++	struct task_struct *pi_task = rt_mutex_get_top_task(p);

5727

++

5728

++	return __rt_effective_prio(pi_task, prio);

5729

++}

5730

++

5731

++/*

5732

++ * rt_mutex_setprio - set the current priority of a task

5733

++ * @p: task to boost

5734

++ * @pi_task: donor task

5735

++ *

5736

++ * This function changes the 'effective' priority of a task. It does

5737

++ * not touch ->normal_prio like __setscheduler().

5738

++ *

5739

++ * Used by the rt_mutex code to implement priority inheritance

5740

++ * logic. Call site only calls if the priority of the task changed.

5741

++ */

5742

++void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)

5743

++{

5744

++	int prio;

5745

++	struct rq *rq;

5746

++	raw_spinlock_t *lock;

5747

++

5748

++	/* XXX used to be waiter->prio, not waiter->task->prio */

5749

++	prio = __rt_effective_prio(pi_task, p->normal_prio);

5750

++

5751

++	/*

5752

++	 * If nothing changed; bail early.

5753

++	 */

5754

++	if (p->pi_top_task == pi_task && prio == p->prio)

5755

++		return;

5756

++

5757

++	rq = __task_access_lock(p, &lock);

5758

++	/*

5759

++	 * Set under pi_lock && rq->lock, such that the value can be used under

5760

++	 * either lock.

5761

++	 *

5762

++	 * Note that there is loads of tricky to make this pointer cache work

5763

++	 * right. rt_mutex_slowunlock()+rt_mutex_postunlock() work together to

5764

++	 * ensure a task is de-boosted (pi_task is set to NULL) before the

5765

++	 * task is allowed to run again (and can exit). This ensures the pointer

5766

++	 * points to a blocked task -- which guarantees the task is present.

5767

++	 */

5768

++	p->pi_top_task = pi_task;

5769

++

5770

++	/*

5771

++	 * For FIFO/RR we only need to set prio, if that matches we're done.

5772

++	 */

5773

++	if (prio == p->prio)

5774

++		goto out_unlock;

5775

++

5776

++	/*

5777

++	 * Idle task boosting is a nono in general. There is one

5778

++	 * exception, when PREEMPT_RT and NOHZ is active:

5779

++	 *

5780

++	 * The idle task calls get_next_timer_interrupt() and holds

5781

++	 * the timer wheel base->lock on the CPU and another CPU wants

5782

++	 * to access the timer (probably to cancel it). We can safely

5783

++	 * ignore the boosting request, as the idle CPU runs this code

5784

++	 * with interrupts disabled and will complete the lock

5785

++	 * protected section without being interrupted. So there is no

5786

++	 * real need to boost.

5787

++	 */

5788

++	if (unlikely(p == rq->idle)) {

5789

++		WARN_ON(p != rq->curr);

5790

++		WARN_ON(p->pi_blocked_on);

5791

++		goto out_unlock;

5792

++	}

5793

++

5794

++	trace_sched_pi_setprio(p, pi_task);

5795

++

5796

++	__setscheduler_prio(p, prio);

5797

++

5798

++	check_task_changed(p, rq);

5799

++out_unlock:

5800

++	/* Avoid rq from going away on us: */

5801

++	preempt_disable();

5802

++

5803

++	__balance_callbacks(rq);

5804

++	__task_access_unlock(p, lock);

5805

++

5806

++	preempt_enable();

5807

++}

5808

++#else

5809

++static inline int rt_effective_prio(struct task_struct *p, int prio)

5810

++{

5811

++	return prio;

5812

++}

5813

++#endif

5814

++

5815

++void set_user_nice(struct task_struct *p, long nice)

5816

++{

5817

++	unsigned long flags;

5818

++	struct rq *rq;

5819

++	raw_spinlock_t *lock;

5820

++

5821

++	if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)

5822

++		return;

5823

++	/*

5824

++	 * We have to be careful, if called from sys_setpriority(),

5825

++	 * the task might be in the middle of scheduling on another CPU.

5826

++	 */

5827

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

5828

++	rq = __task_access_lock(p, &lock);

5829

++

5830

++	p->static_prio = NICE_TO_PRIO(nice);

5831

++	/*

5832

++	 * The RT priorities are set via sched_setscheduler(), but we still

5833

++	 * allow the 'normal' nice value to be set - but as expected

5834

++	 * it won't have any effect on scheduling until the task is

5835

++	 * not SCHED_NORMAL/SCHED_BATCH:

5836

++	 */

5837

++	if (task_has_rt_policy(p))

5838

++		goto out_unlock;

5839

++

5840

++	p->prio = effective_prio(p);

5841

++

5842

++	check_task_changed(p, rq);

5843

++out_unlock:

5844

++	__task_access_unlock(p, lock);

5845

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

5846

++}

5847

++EXPORT_SYMBOL(set_user_nice);

5848

++

5849

++/*

5850

++ * can_nice - check if a task can reduce its nice value

5851

++ * @p: task

5852

++ * @nice: nice value

5853

++ */

5854

++int can_nice(const struct task_struct *p, const int nice)

5855

++{

5856

++	/* Convert nice value [19,-20] to rlimit style value [1,40] */

5857

++	int nice_rlim = nice_to_rlimit(nice);

5858

++

5859

++	return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) ||

5860

++		capable(CAP_SYS_NICE));

5861

++}

5862

++

5863

++#ifdef __ARCH_WANT_SYS_NICE

5864

++

5865

++/*

5866

++ * sys_nice - change the priority of the current process.

5867

++ * @increment: priority increment

5868

++ *

5869

++ * sys_setpriority is a more generic, but much slower function that

5870

++ * does similar things.

5871

++ */

5872

++SYSCALL_DEFINE1(nice, int, increment)

5873

++{

5874

++	long nice, retval;

5875

++

5876

++	/*

5877

++	 * Setpriority might change our priority at the same moment.

5878

++	 * We don't have to worry. Conceptually one call occurs first

5879

++	 * and we have a single winner.

5880

++	 */

5881

++

5882

++	increment = clamp(increment, -NICE_WIDTH, NICE_WIDTH);

5883

++	nice = task_nice(current) + increment;

5884

++

5885

++	nice = clamp_val(nice, MIN_NICE, MAX_NICE);

5886

++	if (increment < 0 && !can_nice(current, nice))

5887

++		return -EPERM;

5888

++

5889

++	retval = security_task_setnice(current, nice);

5890

++	if (retval)

5891

++		return retval;

5892

++

5893

++	set_user_nice(current, nice);

5894

++	return 0;

5895

++}

5896

++

5897

++#endif

5898

++

5899

++/**

5900

++ * task_prio - return the priority value of a given task.

5901

++ * @p: the task in question.

5902

++ *

5903

++ * Return: The priority value as seen by users in /proc.

5904

++ *

5905

++ * sched policy         return value   kernel prio    user prio/nice

5906

++ *

5907

++ * (BMQ)normal, batch, idle[0 ... 53]  [100 ... 139]          0/[-20 ... 19]/[-7 ... 7]

5908

++ * (PDS)normal, batch, idle[0 ... 39]            100          0/[-20 ... 19]

5909

++ * fifo, rr             [-1 ... -100]     [99 ... 0]  [0 ... 99]

5910

++ */

5911

++int task_prio(const struct task_struct *p)

5912

++{

5913

++	return (p->prio < MAX_RT_PRIO) ? p->prio - MAX_RT_PRIO :

5914

++		task_sched_prio_normal(p, task_rq(p));

5915

++}

5916

++

5917

++/**

5918

++ * idle_cpu - is a given CPU idle currently?

5919

++ * @cpu: the processor in question.

5920

++ *

5921

++ * Return: 1 if the CPU is currently idle. 0 otherwise.

5922

++ */

5923

++int idle_cpu(int cpu)

5924

++{

5925

++	struct rq *rq = cpu_rq(cpu);

5926

++

5927

++	if (rq->curr != rq->idle)

5928

++		return 0;

5929

++

5930

++	if (rq->nr_running)

5931

++		return 0;

5932

++

5933

++#ifdef CONFIG_SMP

5934

++	if (rq->ttwu_pending)

5935

++		return 0;

5936

++#endif

5937

++

5938

++	return 1;

5939

++}

5940

++

5941

++/**

5942

++ * idle_task - return the idle task for a given CPU.

5943

++ * @cpu: the processor in question.

5944

++ *

5945

++ * Return: The idle task for the cpu @cpu.

5946

++ */

5947

++struct task_struct *idle_task(int cpu)

5948

++{

5949

++	return cpu_rq(cpu)->idle;

5950

++}

5951

++

5952

++/**

5953

++ * find_process_by_pid - find a process with a matching PID value.

5954

++ * @pid: the pid in question.

5955

++ *

5956

++ * The task of @pid, if found. %NULL otherwise.

5957

++ */

5958

++static inline struct task_struct *find_process_by_pid(pid_t pid)

5959

++{

5960

++	return pid ? find_task_by_vpid(pid) : current;

5961

++}

5962

++

5963

++/*

5964

++ * sched_setparam() passes in -1 for its policy, to let the functions

5965

++ * it calls know not to change it.

5966

++ */

5967

++#define SETPARAM_POLICY -1

5968

++

5969

++static void __setscheduler_params(struct task_struct *p,

5970

++		const struct sched_attr *attr)

5971

++{

5972

++	int policy = attr->sched_policy;

5973

++

5974

++	if (policy == SETPARAM_POLICY)

5975

++		policy = p->policy;

5976

++

5977

++	p->policy = policy;

5978

++

5979

++	/*

5980

++	 * allow normal nice value to be set, but will not have any

5981

++	 * effect on scheduling until the task not SCHED_NORMAL/

5982

++	 * SCHED_BATCH

5983

++	 */

5984

++	p->static_prio = NICE_TO_PRIO(attr->sched_nice);

5985

++

5986

++	/*

5987

++	 * __sched_setscheduler() ensures attr->sched_priority == 0 when

5988

++	 * !rt_policy. Always setting this ensures that things like

5989

++	 * getparam()/getattr() don't report silly values for !rt tasks.

5990

++	 */

5991

++	p->rt_priority = attr->sched_priority;

5992

++	p->normal_prio = normal_prio(p);

5993

++}

5994

++

5995

++/*

5996

++ * check the target process has a UID that matches the current process's

5997

++ */

5998

++static bool check_same_owner(struct task_struct *p)

5999

++{

6000

++	const struct cred *cred = current_cred(), *pcred;

6001

++	bool match;

6002

++

6003

++	rcu_read_lock();

6004

++	pcred = __task_cred(p);

6005

++	match = (uid_eq(cred->euid, pcred->euid) ||

6006

++		 uid_eq(cred->euid, pcred->uid));

6007

++	rcu_read_unlock();

6008

++	return match;

6009

++}

6010

++

6011

++static int __sched_setscheduler(struct task_struct *p,

6012

++				const struct sched_attr *attr,

6013

++				bool user, bool pi)

6014

++{

6015

++	const struct sched_attr dl_squash_attr = {

6016

++		.size		= sizeof(struct sched_attr),

6017

++		.sched_policy	= SCHED_FIFO,

6018

++		.sched_nice	= 0,

6019

++		.sched_priority = 99,

6020

++	};

6021

++	int oldpolicy = -1, policy = attr->sched_policy;

6022

++	int retval, newprio;

6023

++	struct callback_head *head;

6024

++	unsigned long flags;

6025

++	struct rq *rq;

6026

++	int reset_on_fork;

6027

++	raw_spinlock_t *lock;

6028

++

6029

++	/* The pi code expects interrupts enabled */

6030

++	BUG_ON(pi && in_interrupt());

6031

++

6032

++	/*

6033

++	 * Alt schedule FW supports SCHED_DEADLINE by squash it as prio 0 SCHED_FIFO

6034

++	 */

6035

++	if (unlikely(SCHED_DEADLINE == policy)) {

6036

++		attr = &dl_squash_attr;

6037

++		policy = attr->sched_policy;

6038

++	}

6039

++recheck:

6040

++	/* Double check policy once rq lock held */

6041

++	if (policy < 0) {

6042

++		reset_on_fork = p->sched_reset_on_fork;

6043

++		policy = oldpolicy = p->policy;

6044

++	} else {

6045

++		reset_on_fork = !!(attr->sched_flags & SCHED_RESET_ON_FORK);

6046

++

6047

++		if (policy > SCHED_IDLE)

6048

++			return -EINVAL;

6049

++	}

6050

++

6051

++	if (attr->sched_flags & ~(SCHED_FLAG_ALL))

6052

++		return -EINVAL;

6053

++

6054

++	/*

6055

++	 * Valid priorities for SCHED_FIFO and SCHED_RR are

6056

++	 * 1..MAX_RT_PRIO-1, valid priority for SCHED_NORMAL and

6057

++	 * SCHED_BATCH and SCHED_IDLE is 0.

6058

++	 */

6059

++	if (attr->sched_priority < 0 ||

6060

++	    (p->mm && attr->sched_priority > MAX_RT_PRIO - 1) ||

6061

++	    (!p->mm && attr->sched_priority > MAX_RT_PRIO - 1))

6062

++		return -EINVAL;

6063

++	if ((SCHED_RR == policy || SCHED_FIFO == policy) !=

6064

++	    (attr->sched_priority != 0))

6065

++		return -EINVAL;

6066

++

6067

++	/*

6068

++	 * Allow unprivileged RT tasks to decrease priority:

6069

++	 */

6070

++	if (user && !capable(CAP_SYS_NICE)) {

6071

++		if (SCHED_FIFO == policy || SCHED_RR == policy) {

6072

++			unsigned long rlim_rtprio =

6073

++					task_rlimit(p, RLIMIT_RTPRIO);

6074

++

6075

++			/* Can't set/change the rt policy */

6076

++			if (policy != p->policy && !rlim_rtprio)

6077

++				return -EPERM;

6078

++

6079

++			/* Can't increase priority */

6080

++			if (attr->sched_priority > p->rt_priority &&

6081

++			    attr->sched_priority > rlim_rtprio)

6082

++				return -EPERM;

6083

++		}

6084

++

6085

++		/* Can't change other user's priorities */

6086

++		if (!check_same_owner(p))

6087

++			return -EPERM;

6088

++

6089

++		/* Normal users shall not reset the sched_reset_on_fork flag */

6090

++		if (p->sched_reset_on_fork && !reset_on_fork)

6091

++			return -EPERM;

6092

++	}

6093

++

6094

++	if (user) {

6095

++		retval = security_task_setscheduler(p);

6096

++		if (retval)

6097

++			return retval;

6098

++	}

6099

++

6100

++	if (pi)

6101

++		cpuset_read_lock();

6102

++

6103

++	/*

6104

++	 * Make sure no PI-waiters arrive (or leave) while we are

6105

++	 * changing the priority of the task:

6106

++	 */

6107

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

6108

++

6109

++	/*

6110

++	 * To be able to change p->policy safely, task_access_lock()

6111

++	 * must be called.

6112

++	 * IF use task_access_lock() here:

6113

++	 * For the task p which is not running, reading rq->stop is

6114

++	 * racy but acceptable as ->stop doesn't change much.

6115

++	 * An enhancemnet can be made to read rq->stop saftly.

6116

++	 */

6117

++	rq = __task_access_lock(p, &lock);

6118

++

6119

++	/*

6120

++	 * Changing the policy of the stop threads its a very bad idea

6121

++	 */

6122

++	if (p == rq->stop) {

6123

++		retval = -EINVAL;

6124

++		goto unlock;

6125

++	}

6126

++

6127

++	/*

6128

++	 * If not changing anything there's no need to proceed further:

6129

++	 */

6130

++	if (unlikely(policy == p->policy)) {

6131

++		if (rt_policy(policy) && attr->sched_priority != p->rt_priority)

6132

++			goto change;

6133

++		if (!rt_policy(policy) &&

6134

++		    NICE_TO_PRIO(attr->sched_nice) != p->static_prio)

6135

++			goto change;

6136

++

6137

++		p->sched_reset_on_fork = reset_on_fork;

6138

++		retval = 0;

6139

++		goto unlock;

6140

++	}

6141

++change:

6142

++

6143

++	/* Re-check policy now with rq lock held */

6144

++	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {

6145

++		policy = oldpolicy = -1;

6146

++		__task_access_unlock(p, lock);

6147

++		raw_spin_unlock_irqrestore(&p->pi_lock, flags);

6148

++		if (pi)

6149

++			cpuset_read_unlock();

6150

++		goto recheck;

6151

++	}

6152

++

6153

++	p->sched_reset_on_fork = reset_on_fork;

6154

++

6155

++	newprio = __normal_prio(policy, attr->sched_priority, NICE_TO_PRIO(attr->sched_nice));

6156

++	if (pi) {

6157

++		/*

6158

++		 * Take priority boosted tasks into account. If the new

6159

++		 * effective priority is unchanged, we just store the new

6160

++		 * normal parameters and do not touch the scheduler class and

6161

++		 * the runqueue. This will be done when the task deboost

6162

++		 * itself.

6163

++		 */

6164

++		newprio = rt_effective_prio(p, newprio);

6165

++	}

6166

++

6167

++	if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) {

6168

++		__setscheduler_params(p, attr);

6169

++		__setscheduler_prio(p, newprio);

6170

++	}

6171

++

6172

++	check_task_changed(p, rq);

6173

++

6174

++	/* Avoid rq from going away on us: */

6175

++	preempt_disable();

6176

++	head = splice_balance_callbacks(rq);

6177

++	__task_access_unlock(p, lock);

6178

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

6179

++

6180

++	if (pi) {

6181

++		cpuset_read_unlock();

6182

++		rt_mutex_adjust_pi(p);

6183

++	}

6184

++

6185

++	/* Run balance callbacks after we've adjusted the PI chain: */

6186

++	balance_callbacks(rq, head);

6187

++	preempt_enable();

6188

++

6189

++	return 0;

6190

++

6191

++unlock:

6192

++	__task_access_unlock(p, lock);

6193

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

6194

++	if (pi)

6195

++		cpuset_read_unlock();

6196

++	return retval;

6197

++}

6198

++

6199

++static int _sched_setscheduler(struct task_struct *p, int policy,

6200

++			       const struct sched_param *param, bool check)

6201

++{

6202

++	struct sched_attr attr = {

6203

++		.sched_policy   = policy,

6204

++		.sched_priority = param->sched_priority,

6205

++		.sched_nice     = PRIO_TO_NICE(p->static_prio),

6206

++	};

6207

++

6208

++	/* Fixup the legacy SCHED_RESET_ON_FORK hack. */

6209

++	if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) {

6210

++		attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;

6211

++		policy &= ~SCHED_RESET_ON_FORK;

6212

++		attr.sched_policy = policy;

6213

++	}

6214

++

6215

++	return __sched_setscheduler(p, &attr, check, true);

6216

++}

6217

++

6218

++/**

6219

++ * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.

6220

++ * @p: the task in question.

6221

++ * @policy: new policy.

6222

++ * @param: structure containing the new RT priority.

6223

++ *

6224

++ * Use sched_set_fifo(), read its comment.

6225

++ *

6226

++ * Return: 0 on success. An error code otherwise.

6227

++ *

6228

++ * NOTE that the task may be already dead.

6229

++ */

6230

++int sched_setscheduler(struct task_struct *p, int policy,

6231

++		       const struct sched_param *param)

6232

++{

6233

++	return _sched_setscheduler(p, policy, param, true);

6234

++}

6235

++

6236

++int sched_setattr(struct task_struct *p, const struct sched_attr *attr)

6237

++{

6238

++	return __sched_setscheduler(p, attr, true, true);

6239

++}

6240

++

6241

++int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr)

6242

++{

6243

++	return __sched_setscheduler(p, attr, false, true);

6244

++}

6245

++EXPORT_SYMBOL_GPL(sched_setattr_nocheck);

6246

++

6247

++/**

6248

++ * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.

6249

++ * @p: the task in question.

6250

++ * @policy: new policy.

6251

++ * @param: structure containing the new RT priority.

6252

++ *

6253

++ * Just like sched_setscheduler, only don't bother checking if the

6254

++ * current context has permission.  For example, this is needed in

6255

++ * stop_machine(): we create temporary high priority worker threads,

6256

++ * but our caller might not have that capability.

6257

++ *

6258

++ * Return: 0 on success. An error code otherwise.

6259

++ */

6260

++int sched_setscheduler_nocheck(struct task_struct *p, int policy,

6261

++			       const struct sched_param *param)

6262

++{

6263

++	return _sched_setscheduler(p, policy, param, false);

6264

++}

6265

++

6266

++/*

6267

++ * SCHED_FIFO is a broken scheduler model; that is, it is fundamentally

6268

++ * incapable of resource management, which is the one thing an OS really should

6269

++ * be doing.

6270

++ *

6271

++ * This is of course the reason it is limited to privileged users only.

6272

++ *

6273

++ * Worse still; it is fundamentally impossible to compose static priority

6274

++ * workloads. You cannot take two correctly working static prio workloads

6275

++ * and smash them together and still expect them to work.

6276

++ *

6277

++ * For this reason 'all' FIFO tasks the kernel creates are basically at:

6278

++ *

6279

++ *   MAX_RT_PRIO / 2

6280

++ *

6281

++ * The administrator _MUST_ configure the system, the kernel simply doesn't

6282

++ * know enough information to make a sensible choice.

6283

++ */

6284

++void sched_set_fifo(struct task_struct *p)

6285

++{

6286

++	struct sched_param sp = { .sched_priority = MAX_RT_PRIO / 2 };

6287

++	WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0);

6288

++}

6289

++EXPORT_SYMBOL_GPL(sched_set_fifo);

6290

++

6291

++/*

6292

++ * For when you don't much care about FIFO, but want to be above SCHED_NORMAL.

6293

++ */

6294

++void sched_set_fifo_low(struct task_struct *p)

6295

++{

6296

++	struct sched_param sp = { .sched_priority = 1 };

6297

++	WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0);

6298

++}

6299

++EXPORT_SYMBOL_GPL(sched_set_fifo_low);

6300

++

6301

++void sched_set_normal(struct task_struct *p, int nice)

6302

++{

6303

++	struct sched_attr attr = {

6304

++		.sched_policy = SCHED_NORMAL,

6305

++		.sched_nice = nice,

6306

++	};

6307

++	WARN_ON_ONCE(sched_setattr_nocheck(p, &attr) != 0);

6308

++}

6309

++EXPORT_SYMBOL_GPL(sched_set_normal);

6310

++

6311

++static int

6312

++do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)

6313

++{

6314

++	struct sched_param lparam;

6315

++	struct task_struct *p;

6316

++	int retval;

6317

++

6318

++	if (!param || pid < 0)

6319

++		return -EINVAL;

6320

++	if (copy_from_user(&lparam, param, sizeof(struct sched_param)))

6321

++		return -EFAULT;

6322

++

6323

++	rcu_read_lock();

6324

++	retval = -ESRCH;

6325

++	p = find_process_by_pid(pid);

6326

++	if (likely(p))

6327

++		get_task_struct(p);

6328

++	rcu_read_unlock();

6329

++

6330

++	if (likely(p)) {

6331

++		retval = sched_setscheduler(p, policy, &lparam);

6332

++		put_task_struct(p);

6333

++	}

6334

++

6335

++	return retval;

6336

++}

6337

++

6338

++/*

6339

++ * Mimics kernel/events/core.c perf_copy_attr().

6340

++ */

6341

++static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *attr)

6342

++{

6343

++	u32 size;

6344

++	int ret;

6345

++

6346

++	/* Zero the full structure, so that a short copy will be nice: */

6347

++	memset(attr, 0, sizeof(*attr));

6348

++

6349

++	ret = get_user(size, &uattr->size);

6350

++	if (ret)

6351

++		return ret;

6352

++

6353

++	/* ABI compatibility quirk: */

6354

++	if (!size)

6355

++		size = SCHED_ATTR_SIZE_VER0;

6356

++

6357

++	if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)

6358

++		goto err_size;

6359

++

6360

++	ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);

6361

++	if (ret) {

6362

++		if (ret == -E2BIG)

6363

++			goto err_size;

6364

++		return ret;

6365

++	}

6366

++

6367

++	/*

6368

++	 * XXX: Do we want to be lenient like existing syscalls; or do we want

6369

++	 * to be strict and return an error on out-of-bounds values?

6370

++	 */

6371

++	attr->sched_nice = clamp(attr->sched_nice, -20, 19);

6372

++

6373

++	/* sched/core.c uses zero here but we already know ret is zero */

6374

++	return 0;

6375

++

6376

++err_size:

6377

++	put_user(sizeof(*attr), &uattr->size);

6378

++	return -E2BIG;

6379

++}

6380

++

6381

++/**

6382

++ * sys_sched_setscheduler - set/change the scheduler policy and RT priority

6383

++ * @pid: the pid in question.

6384

++ * @policy: new policy.

6385

++ *

6386

++ * Return: 0 on success. An error code otherwise.

6387

++ * @param: structure containing the new RT priority.

6388

++ */

6389

++SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy, struct sched_param __user *, param)

6390

++{

6391

++	if (policy < 0)

6392

++		return -EINVAL;

6393

++

6394

++	return do_sched_setscheduler(pid, policy, param);

6395

++}

6396

++

6397

++/**

6398

++ * sys_sched_setparam - set/change the RT priority of a thread

6399

++ * @pid: the pid in question.

6400

++ * @param: structure containing the new RT priority.

6401

++ *

6402

++ * Return: 0 on success. An error code otherwise.

6403

++ */

6404

++SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)

6405

++{

6406

++	return do_sched_setscheduler(pid, SETPARAM_POLICY, param);

6407

++}

6408

++

6409

++/**

6410

++ * sys_sched_setattr - same as above, but with extended sched_attr

6411

++ * @pid: the pid in question.

6412

++ * @uattr: structure containing the extended parameters.

6413

++ */

6414

++SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,

6415

++			       unsigned int, flags)

6416

++{

6417

++	struct sched_attr attr;

6418

++	struct task_struct *p;

6419

++	int retval;

6420

++

6421

++	if (!uattr || pid < 0 || flags)

6422

++		return -EINVAL;

6423

++

6424

++	retval = sched_copy_attr(uattr, &attr);

6425

++	if (retval)

6426

++		return retval;

6427

++

6428

++	if ((int)attr.sched_policy < 0)

6429

++		return -EINVAL;

6430

++

6431

++	rcu_read_lock();

6432

++	retval = -ESRCH;

6433

++	p = find_process_by_pid(pid);

6434

++	if (likely(p))

6435

++		get_task_struct(p);

6436

++	rcu_read_unlock();

6437

++

6438

++	if (likely(p)) {

6439

++		retval = sched_setattr(p, &attr);

6440

++		put_task_struct(p);

6441

++	}

6442

++

6443

++	return retval;

6444

++}

6445

++

6446

++/**

6447

++ * sys_sched_getscheduler - get the policy (scheduling class) of a thread

6448

++ * @pid: the pid in question.

6449

++ *

6450

++ * Return: On success, the policy of the thread. Otherwise, a negative error

6451

++ * code.

6452

++ */

6453

++SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid)

6454

++{

6455

++	struct task_struct *p;

6456

++	int retval = -EINVAL;

6457

++

6458

++	if (pid < 0)

6459

++		goto out_nounlock;

6460

++

6461

++	retval = -ESRCH;

6462

++	rcu_read_lock();

6463

++	p = find_process_by_pid(pid);

6464

++	if (p) {

6465

++		retval = security_task_getscheduler(p);

6466

++		if (!retval)

6467

++			retval = p->policy;

6468

++	}

6469

++	rcu_read_unlock();

6470

++

6471

++out_nounlock:

6472

++	return retval;

6473

++}

6474

++

6475

++/**

6476

++ * sys_sched_getscheduler - get the RT priority of a thread

6477

++ * @pid: the pid in question.

6478

++ * @param: structure containing the RT priority.

6479

++ *

6480

++ * Return: On success, 0 and the RT priority is in @param. Otherwise, an error

6481

++ * code.

6482

++ */

6483

++SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param)

6484

++{

6485

++	struct sched_param lp = { .sched_priority = 0 };

6486

++	struct task_struct *p;

6487

++	int retval = -EINVAL;

6488

++

6489

++	if (!param || pid < 0)

6490

++		goto out_nounlock;

6491

++

6492

++	rcu_read_lock();

6493

++	p = find_process_by_pid(pid);

6494

++	retval = -ESRCH;

6495

++	if (!p)

6496

++		goto out_unlock;

6497

++

6498

++	retval = security_task_getscheduler(p);

6499

++	if (retval)

6500

++		goto out_unlock;

6501

++

6502

++	if (task_has_rt_policy(p))

6503

++		lp.sched_priority = p->rt_priority;

6504

++	rcu_read_unlock();

6505

++

6506

++	/*

6507

++	 * This one might sleep, we cannot do it with a spinlock held ...

6508

++	 */

6509

++	retval = copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0;

6510

++

6511

++out_nounlock:

6512

++	return retval;

6513

++

6514

++out_unlock:

6515

++	rcu_read_unlock();

6516

++	return retval;

6517

++}

6518

++

6519

++/*

6520

++ * Copy the kernel size attribute structure (which might be larger

6521

++ * than what user-space knows about) to user-space.

6522

++ *

6523

++ * Note that all cases are valid: user-space buffer can be larger or

6524

++ * smaller than the kernel-space buffer. The usual case is that both

6525

++ * have the same size.

6526

++ */

6527

++static int

6528

++sched_attr_copy_to_user(struct sched_attr __user *uattr,

6529

++			struct sched_attr *kattr,

6530

++			unsigned int usize)

6531

++{

6532

++	unsigned int ksize = sizeof(*kattr);

6533

++

6534

++	if (!access_ok(uattr, usize))

6535

++		return -EFAULT;

6536

++

6537

++	/*

6538

++	 * sched_getattr() ABI forwards and backwards compatibility:

6539

++	 *

6540

++	 * If usize == ksize then we just copy everything to user-space and all is good.

6541

++	 *

6542

++	 * If usize < ksize then we only copy as much as user-space has space for,

6543

++	 * this keeps ABI compatibility as well. We skip the rest.

6544

++	 *

6545

++	 * If usize > ksize then user-space is using a newer version of the ABI,

6546

++	 * which part the kernel doesn't know about. Just ignore it - tooling can

6547

++	 * detect the kernel's knowledge of attributes from the attr->size value

6548

++	 * which is set to ksize in this case.

6549

++	 */

6550

++	kattr->size = min(usize, ksize);

6551

++

6552

++	if (copy_to_user(uattr, kattr, kattr->size))

6553

++		return -EFAULT;

6554

++

6555

++	return 0;

6556

++}

6557

++

6558

++/**

6559

++ * sys_sched_getattr - similar to sched_getparam, but with sched_attr

6560

++ * @pid: the pid in question.

6561

++ * @uattr: structure containing the extended parameters.

6562

++ * @usize: sizeof(attr) for fwd/bwd comp.

6563

++ * @flags: for future extension.

6564

++ */

6565

++SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,

6566

++		unsigned int, usize, unsigned int, flags)

6567

++{

6568

++	struct sched_attr kattr = { };

6569

++	struct task_struct *p;

6570

++	int retval;

6571

++

6572

++	if (!uattr || pid < 0 || usize > PAGE_SIZE ||

6573

++	    usize < SCHED_ATTR_SIZE_VER0 || flags)

6574

++		return -EINVAL;

6575

++

6576

++	rcu_read_lock();

6577

++	p = find_process_by_pid(pid);

6578

++	retval = -ESRCH;

6579

++	if (!p)

6580

++		goto out_unlock;

6581

++

6582

++	retval = security_task_getscheduler(p);

6583

++	if (retval)

6584

++		goto out_unlock;

6585

++

6586

++	kattr.sched_policy = p->policy;

6587

++	if (p->sched_reset_on_fork)

6588

++		kattr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;

6589

++	if (task_has_rt_policy(p))

6590

++		kattr.sched_priority = p->rt_priority;

6591

++	else

6592

++		kattr.sched_nice = task_nice(p);

6593

++	kattr.sched_flags &= SCHED_FLAG_ALL;

6594

++

6595

++#ifdef CONFIG_UCLAMP_TASK

6596

++	kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;

6597

++	kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;

6598

++#endif

6599

++

6600

++	rcu_read_unlock();

6601

++

6602

++	return sched_attr_copy_to_user(uattr, &kattr, usize);

6603

++

6604

++out_unlock:

6605

++	rcu_read_unlock();

6606

++	return retval;

6607

++}

6608

++

6609

++static int

6610

++__sched_setaffinity(struct task_struct *p, const struct cpumask *mask)

6611

++{

6612

++	int retval;

6613

++	cpumask_var_t cpus_allowed, new_mask;

6614

++

6615

++	if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL))

6616

++		return -ENOMEM;

6617

++

6618

++	if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {

6619

++		retval = -ENOMEM;

6620

++		goto out_free_cpus_allowed;

6621

++	}

6622

++

6623

++	cpuset_cpus_allowed(p, cpus_allowed);

6624

++	cpumask_and(new_mask, mask, cpus_allowed);

6625

++again:

6626

++	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | SCA_USER);

6627

++	if (retval)

6628

++		goto out_free_new_mask;

6629

++

6630

++	cpuset_cpus_allowed(p, cpus_allowed);

6631

++	if (!cpumask_subset(new_mask, cpus_allowed)) {

6632

++		/*

6633

++		 * We must have raced with a concurrent cpuset

6634

++		 * update. Just reset the cpus_allowed to the

6635

++		 * cpuset's cpus_allowed

6636

++		 */

6637

++		cpumask_copy(new_mask, cpus_allowed);

6638

++		goto again;

6639

++	}

6640

++

6641

++out_free_new_mask:

6642

++	free_cpumask_var(new_mask);

6643

++out_free_cpus_allowed:

6644

++	free_cpumask_var(cpus_allowed);

6645

++	return retval;

6646

++}

6647

++

6648

++long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)

6649

++{

6650

++	struct task_struct *p;

6651

++	int retval;

6652

++

6653

++	rcu_read_lock();

6654

++

6655

++	p = find_process_by_pid(pid);

6656

++	if (!p) {

6657

++		rcu_read_unlock();

6658

++		return -ESRCH;

6659

++	}

6660

++

6661

++	/* Prevent p going away */

6662

++	get_task_struct(p);

6663

++	rcu_read_unlock();

6664

++

6665

++	if (p->flags & PF_NO_SETAFFINITY) {

6666

++		retval = -EINVAL;

6667

++		goto out_put_task;

6668

++	}

6669

++

6670

++	if (!check_same_owner(p)) {

6671

++		rcu_read_lock();

6672

++		if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {

6673

++			rcu_read_unlock();

6674

++			retval = -EPERM;

6675

++			goto out_put_task;

6676

++		}

6677

++		rcu_read_unlock();

6678

++	}

6679

++

6680

++	retval = security_task_setscheduler(p);

6681

++	if (retval)

6682

++		goto out_put_task;

6683

++

6684

++	retval = __sched_setaffinity(p, in_mask);

6685

++out_put_task:

6686

++	put_task_struct(p);

6687

++	return retval;

6688

++}

6689

++

6690

++static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len,

6691

++			     struct cpumask *new_mask)

6692

++{

6693

++	if (len < cpumask_size())

6694

++		cpumask_clear(new_mask);

6695

++	else if (len > cpumask_size())

6696

++		len = cpumask_size();

6697

++

6698

++	return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0;

6699

++}

6700

++

6701

++/**

6702

++ * sys_sched_setaffinity - set the CPU affinity of a process

6703

++ * @pid: pid of the process

6704

++ * @len: length in bytes of the bitmask pointed to by user_mask_ptr

6705

++ * @user_mask_ptr: user-space pointer to the new CPU mask

6706

++ *

6707

++ * Return: 0 on success. An error code otherwise.

6708

++ */

6709

++SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,

6710

++		unsigned long __user *, user_mask_ptr)

6711

++{

6712

++	cpumask_var_t new_mask;

6713

++	int retval;

6714

++

6715

++	if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))

6716

++		return -ENOMEM;

6717

++

6718

++	retval = get_user_cpu_mask(user_mask_ptr, len, new_mask);

6719

++	if (retval == 0)

6720

++		retval = sched_setaffinity(pid, new_mask);

6721

++	free_cpumask_var(new_mask);

6722

++	return retval;

6723

++}

6724

++

6725

++long sched_getaffinity(pid_t pid, cpumask_t *mask)

6726

++{

6727

++	struct task_struct *p;

6728

++	raw_spinlock_t *lock;

6729

++	unsigned long flags;

6730

++	int retval;

6731

++

6732

++	rcu_read_lock();

6733

++

6734

++	retval = -ESRCH;

6735

++	p = find_process_by_pid(pid);

6736

++	if (!p)

6737

++		goto out_unlock;

6738

++

6739

++	retval = security_task_getscheduler(p);

6740

++	if (retval)

6741

++		goto out_unlock;

6742

++

6743

++	task_access_lock_irqsave(p, &lock, &flags);

6744

++	cpumask_and(mask, &p->cpus_mask, cpu_active_mask);

6745

++	task_access_unlock_irqrestore(p, lock, &flags);

6746

++

6747

++out_unlock:

6748

++	rcu_read_unlock();

6749

++

6750

++	return retval;

6751

++}

6752

++

6753

++/**

6754

++ * sys_sched_getaffinity - get the CPU affinity of a process

6755

++ * @pid: pid of the process

6756

++ * @len: length in bytes of the bitmask pointed to by user_mask_ptr

6757

++ * @user_mask_ptr: user-space pointer to hold the current CPU mask

6758

++ *

6759

++ * Return: size of CPU mask copied to user_mask_ptr on success. An

6760

++ * error code otherwise.

6761

++ */

6762

++SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,

6763

++		unsigned long __user *, user_mask_ptr)

6764

++{

6765

++	int ret;

6766

++	cpumask_var_t mask;

6767

++

6768

++	if ((len * BITS_PER_BYTE) < nr_cpu_ids)

6769

++		return -EINVAL;

6770

++	if (len & (sizeof(unsigned long)-1))

6771

++		return -EINVAL;

6772

++

6773

++	if (!alloc_cpumask_var(&mask, GFP_KERNEL))

6774

++		return -ENOMEM;

6775

++

6776

++	ret = sched_getaffinity(pid, mask);

6777

++	if (ret == 0) {

6778

++		unsigned int retlen = min_t(size_t, len, cpumask_size());

6779

++

6780

++		if (copy_to_user(user_mask_ptr, mask, retlen))

6781

++			ret = -EFAULT;

6782

++		else

6783

++			ret = retlen;

6784

++	}

6785

++	free_cpumask_var(mask);

6786

++

6787

++	return ret;

6788

++}

6789

++

6790

++static void do_sched_yield(void)

6791

++{

6792

++	struct rq *rq;

6793

++	struct rq_flags rf;

6794

++

6795

++	if (!sched_yield_type)

6796

++		return;

6797

++

6798

++	rq = this_rq_lock_irq(&rf);

6799

++

6800

++	schedstat_inc(rq->yld_count);

6801

++

6802

++	if (1 == sched_yield_type) {

6803

++		if (!rt_task(current))

6804

++			do_sched_yield_type_1(current, rq);

6805

++	} else if (2 == sched_yield_type) {

6806

++		if (rq->nr_running > 1)

6807

++			rq->skip = current;

6808

++	}

6809

++

6810

++	preempt_disable();

6811

++	raw_spin_unlock_irq(&rq->lock);

6812

++	sched_preempt_enable_no_resched();

6813

++

6814

++	schedule();

6815

++}

6816

++

6817

++/**

6818

++ * sys_sched_yield - yield the current processor to other threads.

6819

++ *

6820

++ * This function yields the current CPU to other tasks. If there are no

6821

++ * other threads running on this CPU then this function will return.

6822

++ *

6823

++ * Return: 0.

6824

++ */

6825

++SYSCALL_DEFINE0(sched_yield)

6826

++{

6827

++	do_sched_yield();

6828

++	return 0;

6829

++}

6830

++

6831

++#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)

6832

++int __sched __cond_resched(void)

6833

++{

6834

++	if (should_resched(0)) {

6835

++		preempt_schedule_common();

6836

++		return 1;

6837

++	}

6838

++	/*

6839

++	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick

6840

++	 * whether the current CPU is in an RCU read-side critical section,

6841

++	 * so the tick can report quiescent states even for CPUs looping

6842

++	 * in kernel context.  In contrast, in non-preemptible kernels,

6843

++	 * RCU readers leave no in-memory hints, which means that CPU-bound

6844

++	 * processes executing in kernel context might never report an

6845

++	 * RCU quiescent state.  Therefore, the following code causes

6846

++	 * cond_resched() to report a quiescent state, but only when RCU

6847

++	 * is in urgent need of one.

6848

++	 */

6849

++#ifndef CONFIG_PREEMPT_RCU

6850

++	rcu_all_qs();

6851

++#endif

6852

++	return 0;

6853

++}

6854

++EXPORT_SYMBOL(__cond_resched);

6855

++#endif

6856

++

6857

++#ifdef CONFIG_PREEMPT_DYNAMIC

6858

++DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched);

6859

++EXPORT_STATIC_CALL_TRAMP(cond_resched);

6860

++

6861

++DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched);

6862

++EXPORT_STATIC_CALL_TRAMP(might_resched);

6863

++#endif

6864

++

6865

++/*

6866

++ * __cond_resched_lock() - if a reschedule is pending, drop the given lock,

6867

++ * call schedule, and on return reacquire the lock.

6868

++ *

6869

++ * This works OK both with and without CONFIG_PREEMPTION.  We do strange low-level

6870

++ * operations here to prevent schedule() from being called twice (once via

6871

++ * spin_unlock(), once by hand).

6872

++ */

6873

++int __cond_resched_lock(spinlock_t *lock)

6874

++{

6875

++	int resched = should_resched(PREEMPT_LOCK_OFFSET);

6876

++	int ret = 0;

6877

++

6878

++	lockdep_assert_held(lock);

6879

++

6880

++	if (spin_needbreak(lock) || resched) {

6881

++		spin_unlock(lock);

6882

++		if (resched)

6883

++			preempt_schedule_common();

6884

++		else

6885

++			cpu_relax();

6886

++		ret = 1;

6887

++		spin_lock(lock);

6888

++	}

6889

++	return ret;

6890

++}

6891

++EXPORT_SYMBOL(__cond_resched_lock);

6892

++

6893

++int __cond_resched_rwlock_read(rwlock_t *lock)

6894

++{

6895

++	int resched = should_resched(PREEMPT_LOCK_OFFSET);

6896

++	int ret = 0;

6897

++

6898

++	lockdep_assert_held_read(lock);

6899

++

6900

++	if (rwlock_needbreak(lock) || resched) {

6901

++		read_unlock(lock);

6902

++		if (resched)

6903

++			preempt_schedule_common();

6904

++		else

6905

++			cpu_relax();

6906

++		ret = 1;

6907

++		read_lock(lock);

6908

++	}

6909

++	return ret;

6910

++}

6911

++EXPORT_SYMBOL(__cond_resched_rwlock_read);

6912

++

6913

++int __cond_resched_rwlock_write(rwlock_t *lock)

6914

++{

6915

++	int resched = should_resched(PREEMPT_LOCK_OFFSET);

6916

++	int ret = 0;

6917

++

6918

++	lockdep_assert_held_write(lock);

6919

++

6920

++	if (rwlock_needbreak(lock) || resched) {

6921

++		write_unlock(lock);

6922

++		if (resched)

6923

++			preempt_schedule_common();

6924

++		else

6925

++			cpu_relax();

6926

++		ret = 1;

6927

++		write_lock(lock);

6928

++	}

6929

++	return ret;

6930

++}

6931

++EXPORT_SYMBOL(__cond_resched_rwlock_write);

6932

++

6933

++/**

6934

++ * yield - yield the current processor to other threads.

6935

++ *

6936

++ * Do not ever use this function, there's a 99% chance you're doing it wrong.

6937

++ *

6938

++ * The scheduler is at all times free to pick the calling task as the most

6939

++ * eligible task to run, if removing the yield() call from your code breaks

6940

++ * it, it's already broken.

6941

++ *

6942

++ * Typical broken usage is:

6943

++ *

6944

++ * while (!event)

6945

++ * 	yield();

6946

++ *

6947

++ * where one assumes that yield() will let 'the other' process run that will

6948

++ * make event true. If the current task is a SCHED_FIFO task that will never

6949

++ * happen. Never use yield() as a progress guarantee!!

6950

++ *

6951

++ * If you want to use yield() to wait for something, use wait_event().

6952

++ * If you want to use yield() to be 'nice' for others, use cond_resched().

6953

++ * If you still want to use yield(), do not!

6954

++ */

6955

++void __sched yield(void)

6956

++{

6957

++	set_current_state(TASK_RUNNING);

6958

++	do_sched_yield();

6959

++}

6960

++EXPORT_SYMBOL(yield);

6961

++

6962

++/**

6963

++ * yield_to - yield the current processor to another thread in

6964

++ * your thread group, or accelerate that thread toward the

6965

++ * processor it's on.

6966

++ * @p: target task

6967

++ * @preempt: whether task preemption is allowed or not

6968

++ *

6969

++ * It's the caller's job to ensure that the target task struct

6970

++ * can't go away on us before we can do any checks.

6971

++ *

6972

++ * In Alt schedule FW, yield_to is not supported.

6973

++ *

6974

++ * Return:

6975

++ *	true (>0) if we indeed boosted the target task.

6976

++ *	false (0) if we failed to boost the target.

6977

++ *	-ESRCH if there's no task to yield to.

6978

++ */

6979

++int __sched yield_to(struct task_struct *p, bool preempt)

6980

++{

6981

++	return 0;

6982

++}

6983

++EXPORT_SYMBOL_GPL(yield_to);

6984

++

6985

++int io_schedule_prepare(void)

6986

++{

6987

++	int old_iowait = current->in_iowait;

6988

++

6989

++	current->in_iowait = 1;

6990

++	blk_schedule_flush_plug(current);

6991

++

6992

++	return old_iowait;

6993

++}

6994

++

6995

++void io_schedule_finish(int token)

6996

++{

6997

++	current->in_iowait = token;

6998

++}

6999

++

7000

++/*

7001

++ * This task is about to go to sleep on IO.  Increment rq->nr_iowait so

7002

++ * that process accounting knows that this is a task in IO wait state.

7003

++ *

7004

++ * But don't do that if it is a deliberate, throttling IO wait (this task

7005

++ * has set its backing_dev_info: the queue against which it should throttle)

7006

++ */

7007

++

7008

++long __sched io_schedule_timeout(long timeout)

7009

++{

7010

++	int token;

7011

++	long ret;

7012

++

7013

++	token = io_schedule_prepare();

7014

++	ret = schedule_timeout(timeout);

7015

++	io_schedule_finish(token);

7016

++

7017

++	return ret;

7018

++}

7019

++EXPORT_SYMBOL(io_schedule_timeout);

7020

++

7021

++void __sched io_schedule(void)

7022

++{

7023

++	int token;

7024

++

7025

++	token = io_schedule_prepare();

7026

++	schedule();

7027

++	io_schedule_finish(token);

7028

++}

7029

++EXPORT_SYMBOL(io_schedule);

7030

++

7031

++/**

7032

++ * sys_sched_get_priority_max - return maximum RT priority.

7033

++ * @policy: scheduling class.

7034

++ *

7035

++ * Return: On success, this syscall returns the maximum

7036

++ * rt_priority that can be used by a given scheduling class.

7037

++ * On failure, a negative error code is returned.

7038

++ */

7039

++SYSCALL_DEFINE1(sched_get_priority_max, int, policy)

7040

++{

7041

++	int ret = -EINVAL;

7042

++

7043

++	switch (policy) {

7044

++	case SCHED_FIFO:

7045

++	case SCHED_RR:

7046

++		ret = MAX_RT_PRIO - 1;

7047

++		break;

7048

++	case SCHED_NORMAL:

7049

++	case SCHED_BATCH:

7050

++	case SCHED_IDLE:

7051

++		ret = 0;

7052

++		break;

7053

++	}

7054

++	return ret;

7055

++}

7056

++

7057

++/**

7058

++ * sys_sched_get_priority_min - return minimum RT priority.

7059

++ * @policy: scheduling class.

7060

++ *

7061

++ * Return: On success, this syscall returns the minimum

7062

++ * rt_priority that can be used by a given scheduling class.

7063

++ * On failure, a negative error code is returned.

7064

++ */

7065

++SYSCALL_DEFINE1(sched_get_priority_min, int, policy)

7066

++{

7067

++	int ret = -EINVAL;

7068

++

7069

++	switch (policy) {

7070

++	case SCHED_FIFO:

7071

++	case SCHED_RR:

7072

++		ret = 1;

7073

++		break;

7074

++	case SCHED_NORMAL:

7075

++	case SCHED_BATCH:

7076

++	case SCHED_IDLE:

7077

++		ret = 0;

7078

++		break;

7079

++	}

7080

++	return ret;

7081

++}

7082

++

7083

++static int sched_rr_get_interval(pid_t pid, struct timespec64 *t)

7084

++{

7085

++	struct task_struct *p;

7086

++	int retval;

7087

++

7088

++	alt_sched_debug();

7089

++

7090

++	if (pid < 0)

7091

++		return -EINVAL;

7092

++

7093

++	retval = -ESRCH;

7094

++	rcu_read_lock();

7095

++	p = find_process_by_pid(pid);

7096

++	if (!p)

7097

++		goto out_unlock;

7098

++

7099

++	retval = security_task_getscheduler(p);

7100

++	if (retval)

7101

++		goto out_unlock;

7102

++	rcu_read_unlock();

7103

++

7104

++	*t = ns_to_timespec64(sched_timeslice_ns);

7105

++	return 0;

7106

++

7107

++out_unlock:

7108

++	rcu_read_unlock();

7109

++	return retval;

7110

++}

7111

++

7112

++/**

7113

++ * sys_sched_rr_get_interval - return the default timeslice of a process.

7114

++ * @pid: pid of the process.

7115

++ * @interval: userspace pointer to the timeslice value.

7116

++ *

7117

++ *

7118

++ * Return: On success, 0 and the timeslice is in @interval. Otherwise,

7119

++ * an error code.

7120

++ */

7121

++SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,

7122

++		struct __kernel_timespec __user *, interval)

7123

++{

7124

++	struct timespec64 t;

7125

++	int retval = sched_rr_get_interval(pid, &t);

7126

++

7127

++	if (retval == 0)

7128

++		retval = put_timespec64(&t, interval);

7129

++

7130

++	return retval;

7131

++}

7132

++

7133

++#ifdef CONFIG_COMPAT_32BIT_TIME

7134

++SYSCALL_DEFINE2(sched_rr_get_interval_time32, pid_t, pid,

7135

++		struct old_timespec32 __user *, interval)

7136

++{

7137

++	struct timespec64 t;

7138

++	int retval = sched_rr_get_interval(pid, &t);

7139

++

7140

++	if (retval == 0)

7141

++		retval = put_old_timespec32(&t, interval);

7142

++	return retval;

7143

++}

7144

++#endif

7145

++

7146

++void sched_show_task(struct task_struct *p)

7147

++{

7148

++	unsigned long free = 0;

7149

++	int ppid;

7150

++

7151

++	if (!try_get_task_stack(p))

7152

++		return;

7153

++

7154

++	pr_info("task:%-15.15s state:%c", p->comm, task_state_to_char(p));

7155

++

7156

++	if (task_is_running(p))

7157

++		pr_cont("  running task    ");

7158

++#ifdef CONFIG_DEBUG_STACK_USAGE

7159

++	free = stack_not_used(p);

7160

++#endif

7161

++	ppid = 0;

7162

++	rcu_read_lock();

7163

++	if (pid_alive(p))

7164

++		ppid = task_pid_nr(rcu_dereference(p->real_parent));

7165

++	rcu_read_unlock();

7166

++	pr_cont(" stack:%5lu pid:%5d ppid:%6d flags:0x%08lx\n",

7167

++		free, task_pid_nr(p), ppid,

7168

++		(unsigned long)task_thread_info(p)->flags);

7169

++

7170

++	print_worker_info(KERN_INFO, p);

7171

++	print_stop_info(KERN_INFO, p);

7172

++	show_stack(p, NULL, KERN_INFO);

7173

++	put_task_stack(p);

7174

++}

7175

++EXPORT_SYMBOL_GPL(sched_show_task);

7176

++

7177

++static inline bool

7178

++state_filter_match(unsigned long state_filter, struct task_struct *p)

7179

++{

7180

++	unsigned int state = READ_ONCE(p->__state);

7181

++

7182

++	/* no filter, everything matches */

7183

++	if (!state_filter)

7184

++		return true;

7185

++

7186

++	/* filter, but doesn't match */

7187

++	if (!(state & state_filter))

7188

++		return false;

7189

++

7190

++	/*

7191

++	 * When looking for TASK_UNINTERRUPTIBLE skip TASK_IDLE (allows

7192

++	 * TASK_KILLABLE).

7193

++	 */

7194

++	if (state_filter == TASK_UNINTERRUPTIBLE && state == TASK_IDLE)

7195

++		return false;

7196

++

7197

++	return true;

7198

++}

7199

++

7200

++

7201

++void show_state_filter(unsigned int state_filter)

7202

++{

7203

++	struct task_struct *g, *p;

7204

++

7205

++	rcu_read_lock();

7206

++	for_each_process_thread(g, p) {

7207

++		/*

7208

++		 * reset the NMI-timeout, listing all files on a slow

7209

++		 * console might take a lot of time:

7210

++		 * Also, reset softlockup watchdogs on all CPUs, because

7211

++		 * another CPU might be blocked waiting for us to process

7212

++		 * an IPI.

7213

++		 */

7214

++		touch_nmi_watchdog();

7215

++		touch_all_softlockup_watchdogs();

7216

++		if (state_filter_match(state_filter, p))

7217

++			sched_show_task(p);

7218

++	}

7219

++

7220

++#ifdef CONFIG_SCHED_DEBUG

7221

++	/* TODO: Alt schedule FW should support this

7222

++	if (!state_filter)

7223

++		sysrq_sched_debug_show();

7224

++	*/

7225

++#endif

7226

++	rcu_read_unlock();

7227

++	/*

7228

++	 * Only show locks if all tasks are dumped:

7229

++	 */

7230

++	if (!state_filter)

7231

++		debug_show_all_locks();

7232

++}

7233

++

7234

++void dump_cpu_task(int cpu)

7235

++{

7236

++	pr_info("Task dump for CPU %d:\n", cpu);

7237

++	sched_show_task(cpu_curr(cpu));

7238

++}

7239

++

7240

++/**

7241

++ * init_idle - set up an idle thread for a given CPU

7242

++ * @idle: task in question

7243

++ * @cpu: CPU the idle task belongs to

7244

++ *

7245

++ * NOTE: this function does not set the idle thread's NEED_RESCHED

7246

++ * flag, to make booting more robust.

7247

++ */

7248

++void __init init_idle(struct task_struct *idle, int cpu)

7249

++{

7250

++	struct rq *rq = cpu_rq(cpu);

7251

++	unsigned long flags;

7252

++

7253

++	__sched_fork(0, idle);

7254

++

7255

++	/*

7256

++	 * The idle task doesn't need the kthread struct to function, but it

7257

++	 * is dressed up as a per-CPU kthread and thus needs to play the part

7258

++	 * if we want to avoid special-casing it in code that deals with per-CPU

7259

++	 * kthreads.

7260

++	 */

7261

++	set_kthread_struct(idle);

7262

++

7263

++	raw_spin_lock_irqsave(&idle->pi_lock, flags);

7264

++	raw_spin_lock(&rq->lock);

7265

++	update_rq_clock(rq);

7266

++

7267

++	idle->last_ran = rq->clock_task;

7268

++	idle->__state = TASK_RUNNING;

7269

++	/*

7270

++	 * PF_KTHREAD should already be set at this point; regardless, make it

7271

++	 * look like a proper per-CPU kthread.

7272

++	 */

7273

++	idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;

7274

++	kthread_set_per_cpu(idle, cpu);

7275

++

7276

++	sched_queue_init_idle(&rq->queue, idle);

7277

++

7278

++	scs_task_reset(idle);

7279

++	kasan_unpoison_task_stack(idle);

7280

++

7281

++#ifdef CONFIG_SMP

7282

++	/*

7283

++	 * It's possible that init_idle() gets called multiple times on a task,

7284

++	 * in that case do_set_cpus_allowed() will not do the right thing.

7285

++	 *

7286

++	 * And since this is boot we can forgo the serialisation.

7287

++	 */

7288

++	set_cpus_allowed_common(idle, cpumask_of(cpu));

7289

++#endif

7290

++

7291

++	/* Silence PROVE_RCU */

7292

++	rcu_read_lock();

7293

++	__set_task_cpu(idle, cpu);

7294

++	rcu_read_unlock();

7295

++

7296

++	rq->idle = idle;

7297

++	rcu_assign_pointer(rq->curr, idle);

7298

++	idle->on_cpu = 1;

7299

++

7300

++	raw_spin_unlock(&rq->lock);

7301

++	raw_spin_unlock_irqrestore(&idle->pi_lock, flags);

7302

++

7303

++	/* Set the preempt count _outside_ the spinlocks! */

7304

++	init_idle_preempt_count(idle, cpu);

7305

++

7306

++	ftrace_graph_init_idle_task(idle, cpu);

7307

++	vtime_init_idle(idle, cpu);

7308

++#ifdef CONFIG_SMP

7309

++	sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);

7310

++#endif

7311

++}

7312

++

7313

++#ifdef CONFIG_SMP

7314

++

7315

++int cpuset_cpumask_can_shrink(const struct cpumask __maybe_unused *cur,

7316

++			      const struct cpumask __maybe_unused *trial)

7317

++{

7318

++	return 1;

7319

++}

7320

++

7321

++int task_can_attach(struct task_struct *p,

7322

++		    const struct cpumask *cs_cpus_allowed)

7323

++{

7324

++	int ret = 0;

7325

++

7326

++	/*

7327

++	 * Kthreads which disallow setaffinity shouldn't be moved

7328

++	 * to a new cpuset; we don't want to change their CPU

7329

++	 * affinity and isolating such threads by their set of

7330

++	 * allowed nodes is unnecessary.  Thus, cpusets are not

7331

++	 * applicable for such threads.  This prevents checking for

7332

++	 * success of set_cpus_allowed_ptr() on all attached tasks

7333

++	 * before cpus_mask may be changed.

7334

++	 */

7335

++	if (p->flags & PF_NO_SETAFFINITY)

7336

++		ret = -EINVAL;

7337

++

7338

++	return ret;

7339

++}

7340

++

7341

++bool sched_smp_initialized __read_mostly;

7342

++

7343

++#ifdef CONFIG_HOTPLUG_CPU

7344

++/*

7345

++ * Ensures that the idle task is using init_mm right before its CPU goes

7346

++ * offline.

7347

++ */

7348

++void idle_task_exit(void)

7349

++{

7350

++	struct mm_struct *mm = current->active_mm;

7351

++

7352

++	BUG_ON(current != this_rq()->idle);

7353

++

7354

++	if (mm != &init_mm) {

7355

++		switch_mm(mm, &init_mm, current);

7356

++		finish_arch_post_lock_switch();

7357

++	}

7358

++

7359

++	scs_task_reset(current);

7360

++	/* finish_cpu(), as ran on the BP, will clean up the active_mm state */

7361

++}

7362

++

7363

++static int __balance_push_cpu_stop(void *arg)

7364

++{

7365

++	struct task_struct *p = arg;

7366

++	struct rq *rq = this_rq();

7367

++	struct rq_flags rf;

7368

++	int cpu;

7369

++

7370

++	raw_spin_lock_irq(&p->pi_lock);

7371

++	rq_lock(rq, &rf);

7372

++

7373

++	update_rq_clock(rq);

7374

++

7375

++	if (task_rq(p) == rq && task_on_rq_queued(p)) {

7376

++		cpu = select_fallback_rq(rq->cpu, p);

7377

++		rq = __migrate_task(rq, p, cpu);

7378

++	}

7379

++

7380

++	rq_unlock(rq, &rf);

7381

++	raw_spin_unlock_irq(&p->pi_lock);

7382

++

7383

++	put_task_struct(p);

7384

++

7385

++	return 0;

7386

++}

7387

++

7388

++static DEFINE_PER_CPU(struct cpu_stop_work, push_work);

7389

++

7390

++/*

7391

++ * This is enabled below SCHED_AP_ACTIVE; when !cpu_active(), but only

7392

++ * effective when the hotplug motion is down.

7393

++ */

7394

++static void balance_push(struct rq *rq)

7395

++{

7396

++	struct task_struct *push_task = rq->curr;

7397

++

7398

++	lockdep_assert_held(&rq->lock);

7399

++

7400

++	/*

7401

++	 * Ensure the thing is persistent until balance_push_set(.on = false);

7402

++	 */

7403

++	rq->balance_callback = &balance_push_callback;

7404

++

7405

++	/*

7406

++	 * Only active while going offline and when invoked on the outgoing

7407

++	 * CPU.

7408

++	 */

7409

++	if (!cpu_dying(rq->cpu) || rq != this_rq())

7410

++		return;

7411

++

7412

++	/*

7413

++	 * Both the cpu-hotplug and stop task are in this case and are

7414

++	 * required to complete the hotplug process.

7415

++	 */

7416

++	if (kthread_is_per_cpu(push_task) ||

7417

++	    is_migration_disabled(push_task)) {

7418

++

7419

++		/*

7420

++		 * If this is the idle task on the outgoing CPU try to wake

7421

++		 * up the hotplug control thread which might wait for the

7422

++		 * last task to vanish. The rcuwait_active() check is

7423

++		 * accurate here because the waiter is pinned on this CPU

7424

++		 * and can't obviously be running in parallel.

7425

++		 *

7426

++		 * On RT kernels this also has to check whether there are

7427

++		 * pinned and scheduled out tasks on the runqueue. They

7428

++		 * need to leave the migrate disabled section first.

7429

++		 */

7430

++		if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&

7431

++		    rcuwait_active(&rq->hotplug_wait)) {

7432

++			raw_spin_unlock(&rq->lock);

7433

++			rcuwait_wake_up(&rq->hotplug_wait);

7434

++			raw_spin_lock(&rq->lock);

7435

++		}

7436

++		return;

7437

++	}

7438

++

7439

++	get_task_struct(push_task);

7440

++	/*

7441

++	 * Temporarily drop rq->lock such that we can wake-up the stop task.

7442

++	 * Both preemption and IRQs are still disabled.

7443

++	 */

7444

++	raw_spin_unlock(&rq->lock);

7445

++	stop_one_cpu_nowait(rq->cpu, __balance_push_cpu_stop, push_task,

7446

++			    this_cpu_ptr(&push_work));

7447

++	/*

7448

++	 * At this point need_resched() is true and we'll take the loop in

7449

++	 * schedule(). The next pick is obviously going to be the stop task

7450

++	 * which kthread_is_per_cpu() and will push this task away.

7451

++	 */

7452

++	raw_spin_lock(&rq->lock);

7453

++}

7454

++

7455

++static void balance_push_set(int cpu, bool on)

7456

++{

7457

++	struct rq *rq = cpu_rq(cpu);

7458

++	struct rq_flags rf;

7459

++

7460

++	rq_lock_irqsave(rq, &rf);

7461

++	if (on) {

7462

++		WARN_ON_ONCE(rq->balance_callback);

7463

++		rq->balance_callback = &balance_push_callback;

7464

++	} else if (rq->balance_callback == &balance_push_callback) {

7465

++		rq->balance_callback = NULL;

7466

++	}

7467

++	rq_unlock_irqrestore(rq, &rf);

7468

++}

7469

++

7470

++/*

7471

++ * Invoked from a CPUs hotplug control thread after the CPU has been marked

7472

++ * inactive. All tasks which are not per CPU kernel threads are either

7473

++ * pushed off this CPU now via balance_push() or placed on a different CPU

7474

++ * during wakeup. Wait until the CPU is quiescent.

7475

++ */

7476

++static void balance_hotplug_wait(void)

7477

++{

7478

++	struct rq *rq = this_rq();

7479

++

7480

++	rcuwait_wait_event(&rq->hotplug_wait,

7481

++			   rq->nr_running == 1 && !rq_has_pinned_tasks(rq),

7482

++			   TASK_UNINTERRUPTIBLE);

7483

++}

7484

++

7485

++#else

7486

++

7487

++static void balance_push(struct rq *rq)

7488

++{

7489

++}

7490

++

7491

++static void balance_push_set(int cpu, bool on)

7492

++{

7493

++}

7494

++

7495

++static inline void balance_hotplug_wait(void)

7496

++{

7497

++}

7498

++#endif /* CONFIG_HOTPLUG_CPU */

7499

++

7500

++static void set_rq_offline(struct rq *rq)

7501

++{

7502

++	if (rq->online)

7503

++		rq->online = false;

7504

++}

7505

++

7506

++static void set_rq_online(struct rq *rq)

7507

++{

7508

++	if (!rq->online)

7509

++		rq->online = true;

7510

++}

7511

++

7512

++/*

7513

++ * used to mark begin/end of suspend/resume:

7514

++ */

7515

++static int num_cpus_frozen;

7516

++

7517

++/*

7518

++ * Update cpusets according to cpu_active mask.  If cpusets are

7519

++ * disabled, cpuset_update_active_cpus() becomes a simple wrapper

7520

++ * around partition_sched_domains().

7521

++ *

7522

++ * If we come here as part of a suspend/resume, don't touch cpusets because we

7523

++ * want to restore it back to its original state upon resume anyway.

7524

++ */

7525

++static void cpuset_cpu_active(void)

7526

++{

7527

++	if (cpuhp_tasks_frozen) {

7528

++		/*

7529

++		 * num_cpus_frozen tracks how many CPUs are involved in suspend

7530

++		 * resume sequence. As long as this is not the last online

7531

++		 * operation in the resume sequence, just build a single sched

7532

++		 * domain, ignoring cpusets.

7533

++		 */

7534

++		partition_sched_domains(1, NULL, NULL);

7535

++		if (--num_cpus_frozen)

7536

++			return;

7537

++		/*

7538

++		 * This is the last CPU online operation. So fall through and

7539

++		 * restore the original sched domains by considering the

7540

++		 * cpuset configurations.

7541

++		 */

7542

++		cpuset_force_rebuild();

7543

++	}

7544

++

7545

++	cpuset_update_active_cpus();

7546

++}

7547

++

7548

++static int cpuset_cpu_inactive(unsigned int cpu)

7549

++{

7550

++	if (!cpuhp_tasks_frozen) {

7551

++		cpuset_update_active_cpus();

7552

++	} else {

7553

++		num_cpus_frozen++;

7554

++		partition_sched_domains(1, NULL, NULL);

7555

++	}

7556

++	return 0;

7557

++}

7558

++

7559

++int sched_cpu_activate(unsigned int cpu)

7560

++{

7561

++	struct rq *rq = cpu_rq(cpu);

7562

++	unsigned long flags;

7563

++

7564

++	/*

7565

++	 * Clear the balance_push callback and prepare to schedule

7566

++	 * regular tasks.

7567

++	 */

7568

++	balance_push_set(cpu, false);

7569

++

7570

++#ifdef CONFIG_SCHED_SMT

7571

++	/*

7572

++	 * When going up, increment the number of cores with SMT present.

7573

++	 */

7574

++	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)

7575

++		static_branch_inc_cpuslocked(&sched_smt_present);

7576

++#endif

7577

++	set_cpu_active(cpu, true);

7578

++

7579

++	if (sched_smp_initialized)

7580

++		cpuset_cpu_active();

7581

++

7582

++	/*

7583

++	 * Put the rq online, if not already. This happens:

7584

++	 *

7585

++	 * 1) In the early boot process, because we build the real domains

7586

++	 *    after all cpus have been brought up.

7587

++	 *

7588

++	 * 2) At runtime, if cpuset_cpu_active() fails to rebuild the

7589

++	 *    domains.

7590

++	 */

7591

++	raw_spin_lock_irqsave(&rq->lock, flags);

7592

++	set_rq_online(rq);

7593

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

7594

++

7595

++	return 0;

7596

++}

7597

++

7598

++int sched_cpu_deactivate(unsigned int cpu)

7599

++{

7600

++	struct rq *rq = cpu_rq(cpu);

7601

++	unsigned long flags;

7602

++	int ret;

7603

++

7604

++	set_cpu_active(cpu, false);

7605

++

7606

++	/*

7607

++	 * From this point forward, this CPU will refuse to run any task that

7608

++	 * is not: migrate_disable() or KTHREAD_IS_PER_CPU, and will actively

7609

++	 * push those tasks away until this gets cleared, see

7610

++	 * sched_cpu_dying().

7611

++	 */

7612

++	balance_push_set(cpu, true);

7613

++

7614

++	/*

7615

++	 * We've cleared cpu_active_mask, wait for all preempt-disabled and RCU

7616

++	 * users of this state to go away such that all new such users will

7617

++	 * observe it.

7618

++	 *

7619

++	 * Specifically, we rely on ttwu to no longer target this CPU, see

7620

++	 * ttwu_queue_cond() and is_cpu_allowed().

7621

++	 *

7622

++	 * Do sync before park smpboot threads to take care the rcu boost case.

7623

++	 */

7624

++	synchronize_rcu();

7625

++

7626

++	raw_spin_lock_irqsave(&rq->lock, flags);

7627

++	update_rq_clock(rq);

7628

++	set_rq_offline(rq);

7629

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

7630

++

7631

++#ifdef CONFIG_SCHED_SMT

7632

++	/*

7633

++	 * When going down, decrement the number of cores with SMT present.

7634

++	 */

7635

++	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {

7636

++		static_branch_dec_cpuslocked(&sched_smt_present);

7637

++		if (!static_branch_likely(&sched_smt_present))

7638

++			cpumask_clear(&sched_sg_idle_mask);

7639

++	}

7640

++#endif

7641

++

7642

++	if (!sched_smp_initialized)

7643

++		return 0;

7644

++

7645

++	ret = cpuset_cpu_inactive(cpu);

7646

++	if (ret) {

7647

++		balance_push_set(cpu, false);

7648

++		set_cpu_active(cpu, true);

7649

++		return ret;

7650

++	}

7651

++

7652

++	return 0;

7653

++}

7654

++

7655

++static void sched_rq_cpu_starting(unsigned int cpu)

7656

++{

7657

++	struct rq *rq = cpu_rq(cpu);

7658

++

7659

++	rq->calc_load_update = calc_load_update;

7660

++}

7661

++

7662

++int sched_cpu_starting(unsigned int cpu)

7663

++{

7664

++	sched_rq_cpu_starting(cpu);

7665

++	sched_tick_start(cpu);

7666

++	return 0;

7667

++}

7668

++

7669

++#ifdef CONFIG_HOTPLUG_CPU

7670

++

7671

++/*

7672

++ * Invoked immediately before the stopper thread is invoked to bring the

7673

++ * CPU down completely. At this point all per CPU kthreads except the

7674

++ * hotplug thread (current) and the stopper thread (inactive) have been

7675

++ * either parked or have been unbound from the outgoing CPU. Ensure that

7676

++ * any of those which might be on the way out are gone.

7677

++ *

7678

++ * If after this point a bound task is being woken on this CPU then the

7679

++ * responsible hotplug callback has failed to do it's job.

7680

++ * sched_cpu_dying() will catch it with the appropriate fireworks.

7681

++ */

7682

++int sched_cpu_wait_empty(unsigned int cpu)

7683

++{

7684

++	balance_hotplug_wait();

7685

++	return 0;

7686

++}

7687

++

7688

++/*

7689

++ * Since this CPU is going 'away' for a while, fold any nr_active delta we

7690

++ * might have. Called from the CPU stopper task after ensuring that the

7691

++ * stopper is the last running task on the CPU, so nr_active count is

7692

++ * stable. We need to take the teardown thread which is calling this into

7693

++ * account, so we hand in adjust = 1 to the load calculation.

7694

++ *

7695

++ * Also see the comment "Global load-average calculations".

7696

++ */

7697

++static void calc_load_migrate(struct rq *rq)

7698

++{

7699

++	long delta = calc_load_fold_active(rq, 1);

7700

++

7701

++	if (delta)

7702

++		atomic_long_add(delta, &calc_load_tasks);

7703

++}

7704

++

7705

++static void dump_rq_tasks(struct rq *rq, const char *loglvl)

7706

++{

7707

++	struct task_struct *g, *p;

7708

++	int cpu = cpu_of(rq);

7709

++

7710

++	lockdep_assert_held(&rq->lock);

7711

++

7712

++	printk("%sCPU%d enqueued tasks (%u total):\n", loglvl, cpu, rq->nr_running);

7713

++	for_each_process_thread(g, p) {

7714

++		if (task_cpu(p) != cpu)

7715

++			continue;

7716

++

7717

++		if (!task_on_rq_queued(p))

7718

++			continue;

7719

++

7720

++		printk("%s\tpid: %d, name: %s\n", loglvl, p->pid, p->comm);

7721

++	}

7722

++}

7723

++

7724

++int sched_cpu_dying(unsigned int cpu)

7725

++{

7726

++	struct rq *rq = cpu_rq(cpu);

7727

++	unsigned long flags;

7728

++

7729

++	/* Handle pending wakeups and then migrate everything off */

7730

++	sched_tick_stop(cpu);

7731

++

7732

++	raw_spin_lock_irqsave(&rq->lock, flags);

7733

++	if (rq->nr_running != 1 || rq_has_pinned_tasks(rq)) {

7734

++		WARN(true, "Dying CPU not properly vacated!");

7735

++		dump_rq_tasks(rq, KERN_WARNING);

7736

++	}

7737

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

7738

++

7739

++	calc_load_migrate(rq);

7740

++	hrtick_clear(rq);

7741

++	return 0;

7742

++}

7743

++#endif

7744

++

7745

++#ifdef CONFIG_SMP

7746

++static void sched_init_topology_cpumask_early(void)

7747

++{

7748

++	int cpu;

7749

++	cpumask_t *tmp;

7750

++

7751

++	for_each_possible_cpu(cpu) {

7752

++		/* init topo masks */

7753

++		tmp = per_cpu(sched_cpu_topo_masks, cpu);

7754

++

7755

++		cpumask_copy(tmp, cpumask_of(cpu));

7756

++		tmp++;

7757

++		cpumask_copy(tmp, cpu_possible_mask);

7758

++		per_cpu(sched_cpu_llc_mask, cpu) = tmp;

7759

++		per_cpu(sched_cpu_topo_end_mask, cpu) = ++tmp;

7760

++		/*per_cpu(sd_llc_id, cpu) = cpu;*/

7761

++	}

7762

++}

7763

++

7764

++#define TOPOLOGY_CPUMASK(name, mask, last)\

7765

++	if (cpumask_and(topo, topo, mask)) {					\

7766

++		cpumask_copy(topo, mask);					\

7767

++		printk(KERN_INFO "sched: cpu#%02d topo: 0x%08lx - "#name,	\

7768

++		       cpu, (topo++)->bits[0]);					\

7769

++	}									\

7770

++	if (!last)								\

7771

++		cpumask_complement(topo, mask)

7772

++

7773

++static void sched_init_topology_cpumask(void)

7774

++{

7775

++	int cpu;

7776

++	cpumask_t *topo;

7777

++

7778

++	for_each_online_cpu(cpu) {

7779

++		/* take chance to reset time slice for idle tasks */

7780

++		cpu_rq(cpu)->idle->time_slice = sched_timeslice_ns;

7781

++

7782

++		topo = per_cpu(sched_cpu_topo_masks, cpu) + 1;

7783

++

7784

++		cpumask_complement(topo, cpumask_of(cpu));

7785

++#ifdef CONFIG_SCHED_SMT

7786

++		TOPOLOGY_CPUMASK(smt, topology_sibling_cpumask(cpu), false);

7787

++#endif

7788

++		per_cpu(sd_llc_id, cpu) = cpumask_first(cpu_coregroup_mask(cpu));

7789

++		per_cpu(sched_cpu_llc_mask, cpu) = topo;

7790

++		TOPOLOGY_CPUMASK(coregroup, cpu_coregroup_mask(cpu), false);

7791

++

7792

++		TOPOLOGY_CPUMASK(core, topology_core_cpumask(cpu), false);

7793

++

7794

++		TOPOLOGY_CPUMASK(others, cpu_online_mask, true);

7795

++

7796

++		per_cpu(sched_cpu_topo_end_mask, cpu) = topo;

7797

++		printk(KERN_INFO "sched: cpu#%02d llc_id = %d, llc_mask idx = %d\n",

7798

++		       cpu, per_cpu(sd_llc_id, cpu),

7799

++		       (int) (per_cpu(sched_cpu_llc_mask, cpu) -

7800

++			      per_cpu(sched_cpu_topo_masks, cpu)));

7801

++	}

7802

++}

7803

++#endif

7804

++

7805

++void __init sched_init_smp(void)

7806

++{

7807

++	/* Move init over to a non-isolated CPU */

7808

++	if (set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_FLAG_DOMAIN)) < 0)

7809

++		BUG();

7810

++	current->flags &= ~PF_NO_SETAFFINITY;

7811

++

7812

++	sched_init_topology_cpumask();

7813

++

7814

++	sched_smp_initialized = true;

7815

++}

7816

++#else

7817

++void __init sched_init_smp(void)

7818

++{

7819

++	cpu_rq(0)->idle->time_slice = sched_timeslice_ns;

7820

++}

7821

++#endif /* CONFIG_SMP */

7822

++

7823

++int in_sched_functions(unsigned long addr)

7824

++{

7825

++	return in_lock_functions(addr) ||

7826

++		(addr >= (unsigned long)__sched_text_start

7827

++		&& addr < (unsigned long)__sched_text_end);

7828

++}

7829

++

7830

++#ifdef CONFIG_CGROUP_SCHED

7831

++/* task group related information */

7832

++struct task_group {

7833

++	struct cgroup_subsys_state css;

7834

++

7835

++	struct rcu_head rcu;

7836

++	struct list_head list;

7837

++

7838

++	struct task_group *parent;

7839

++	struct list_head siblings;

7840

++	struct list_head children;

7841

++#ifdef CONFIG_FAIR_GROUP_SCHED

7842

++	unsigned long		shares;

7843

++#endif

7844

++};

7845

++

7846

++/*

7847

++ * Default task group.

7848

++ * Every task in system belongs to this group at bootup.

7849

++ */

7850

++struct task_group root_task_group;

7851

++LIST_HEAD(task_groups);

7852

++

7853

++/* Cacheline aligned slab cache for task_group */

7854

++static struct kmem_cache *task_group_cache __read_mostly;

7855

++#endif /* CONFIG_CGROUP_SCHED */

7856

++

7857

++void __init sched_init(void)

7858

++{

7859

++	int i;

7860

++	struct rq *rq;

7861

++

7862

++	printk(KERN_INFO ALT_SCHED_VERSION_MSG);

7863

++

7864

++	wait_bit_init();

7865

++

7866

++#ifdef CONFIG_SMP

7867

++	for (i = 0; i < SCHED_BITS; i++)

7868

++		cpumask_copy(sched_rq_watermark + i, cpu_present_mask);

7869

++#endif

7870

++

7871

++#ifdef CONFIG_CGROUP_SCHED

7872

++	task_group_cache = KMEM_CACHE(task_group, 0);

7873

++

7874

++	list_add(&root_task_group.list, &task_groups);

7875

++	INIT_LIST_HEAD(&root_task_group.children);

7876

++	INIT_LIST_HEAD(&root_task_group.siblings);

7877

++#endif /* CONFIG_CGROUP_SCHED */

7878

++	for_each_possible_cpu(i) {

7879

++		rq = cpu_rq(i);

7880

++

7881

++		sched_queue_init(&rq->queue);

7882

++		rq->watermark = IDLE_TASK_SCHED_PRIO;

7883

++		rq->skip = NULL;

7884

++

7885

++		raw_spin_lock_init(&rq->lock);

7886

++		rq->nr_running = rq->nr_uninterruptible = 0;

7887

++		rq->calc_load_active = 0;

7888

++		rq->calc_load_update = jiffies + LOAD_FREQ;

7889

++#ifdef CONFIG_SMP

7890

++		rq->online = false;

7891

++		rq->cpu = i;

7892

++

7893

++#ifdef CONFIG_SCHED_SMT

7894

++		rq->active_balance = 0;

7895

++#endif

7896

++

7897

++#ifdef CONFIG_NO_HZ_COMMON

7898

++		INIT_CSD(&rq->nohz_csd, nohz_csd_func, rq);

7899

++#endif

7900

++		rq->balance_callback = &balance_push_callback;

7901

++#ifdef CONFIG_HOTPLUG_CPU

7902

++		rcuwait_init(&rq->hotplug_wait);

7903

++#endif

7904

++#endif /* CONFIG_SMP */

7905

++		rq->nr_switches = 0;

7906

++

7907

++		hrtick_rq_init(rq);

7908

++		atomic_set(&rq->nr_iowait, 0);

7909

++	}

7910

++#ifdef CONFIG_SMP

7911

++	/* Set rq->online for cpu 0 */

7912

++	cpu_rq(0)->online = true;

7913

++#endif

7914

++	/*

7915

++	 * The boot idle thread does lazy MMU switching as well:

7916

++	 */

7917

++	mmgrab(&init_mm);

7918

++	enter_lazy_tlb(&init_mm, current);

7919

++

7920

++	/*

7921

++	 * Make us the idle thread. Technically, schedule() should not be

7922

++	 * called from this thread, however somewhere below it might be,

7923

++	 * but because we are the idle thread, we just pick up running again

7924

++	 * when this runqueue becomes "idle".

7925

++	 */

7926

++	init_idle(current, smp_processor_id());

7927

++

7928

++	calc_load_update = jiffies + LOAD_FREQ;

7929

++

7930

++#ifdef CONFIG_SMP

7931

++	idle_thread_set_boot_cpu();

7932

++	balance_push_set(smp_processor_id(), false);

7933

++

7934

++	sched_init_topology_cpumask_early();

7935

++#endif /* SMP */

7936

++

7937

++	psi_init();

7938

++}

7939

++

7940

++#ifdef CONFIG_DEBUG_ATOMIC_SLEEP

7941

++static inline int preempt_count_equals(int preempt_offset)

7942

++{

7943

++	int nested = preempt_count() + rcu_preempt_depth();

7944

++

7945

++	return (nested == preempt_offset);

7946

++}

7947

++

7948

++void __might_sleep(const char *file, int line, int preempt_offset)

7949

++{

7950

++	unsigned int state = get_current_state();

7951

++	/*

7952

++	 * Blocking primitives will set (and therefore destroy) current->state,

7953

++	 * since we will exit with TASK_RUNNING make sure we enter with it,

7954

++	 * otherwise we will destroy state.

7955

++	 */

7956

++	WARN_ONCE(state != TASK_RUNNING && current->task_state_change,

7957

++			"do not call blocking ops when !TASK_RUNNING; "

7958

++			"state=%x set at [<%p>] %pS\n", state,

7959

++			(void *)current->task_state_change,

7960

++			(void *)current->task_state_change);

7961

++

7962

++	___might_sleep(file, line, preempt_offset);

7963

++}

7964

++EXPORT_SYMBOL(__might_sleep);

7965

++

7966

++void ___might_sleep(const char *file, int line, int preempt_offset)

7967

++{

7968

++	/* Ratelimiting timestamp: */

7969

++	static unsigned long prev_jiffy;

7970

++

7971

++	unsigned long preempt_disable_ip;

7972

++

7973

++	/* WARN_ON_ONCE() by default, no rate limit required: */

7974

++	rcu_sleep_check();

7975

++

7976

++	if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&

7977

++	     !is_idle_task(current) && !current->non_block_count) ||

7978

++	    system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING ||

7979

++	    oops_in_progress)

7980

++		return;

7981

++	if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)

7982

++		return;

7983

++	prev_jiffy = jiffies;

7984

++

7985

++	/* Save this before calling printk(), since that will clobber it: */

7986

++	preempt_disable_ip = get_preempt_disable_ip(current);

7987

++

7988

++	printk(KERN_ERR

7989

++		"BUG: sleeping function called from invalid context at %s:%d\n",

7990

++			file, line);

7991

++	printk(KERN_ERR

7992

++		"in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",

7993

++			in_atomic(), irqs_disabled(), current->non_block_count,

7994

++			current->pid, current->comm);

7995

++

7996

++	if (task_stack_end_corrupted(current))

7997

++		printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");

7998

++

7999

++	debug_show_held_locks(current);

8000

++	if (irqs_disabled())

8001

++		print_irqtrace_events(current);

8002

++#ifdef CONFIG_DEBUG_PREEMPT

8003

++	if (!preempt_count_equals(preempt_offset)) {

8004

++		pr_err("Preemption disabled at:");

8005

++		print_ip_sym(KERN_ERR, preempt_disable_ip);

8006

++	}

8007

++#endif

8008

++	dump_stack();

8009

++	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

8010

++}

8011

++EXPORT_SYMBOL(___might_sleep);

8012

++

8013

++void __cant_sleep(const char *file, int line, int preempt_offset)

8014

++{

8015

++	static unsigned long prev_jiffy;

8016

++

8017

++	if (irqs_disabled())

8018

++		return;

8019

++

8020

++	if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))

8021

++		return;

8022

++

8023

++	if (preempt_count() > preempt_offset)

8024

++		return;

8025

++

8026

++	if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)

8027

++		return;

8028

++	prev_jiffy = jiffies;

8029

++

8030

++	printk(KERN_ERR "BUG: assuming atomic context at %s:%d\n", file, line);

8031

++	printk(KERN_ERR "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",

8032

++			in_atomic(), irqs_disabled(),

8033

++			current->pid, current->comm);

8034

++

8035

++	debug_show_held_locks(current);

8036

++	dump_stack();

8037

++	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

8038

++}

8039

++EXPORT_SYMBOL_GPL(__cant_sleep);

8040

++

8041

++#ifdef CONFIG_SMP

8042

++void __cant_migrate(const char *file, int line)

8043

++{

8044

++	static unsigned long prev_jiffy;

8045

++

8046

++	if (irqs_disabled())

8047

++		return;

8048

++

8049

++	if (is_migration_disabled(current))

8050

++		return;

8051

++

8052

++	if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))

8053

++		return;

8054

++

8055

++	if (preempt_count() > 0)

8056

++		return;

8057

++

8058

++	if (current->migration_flags & MDF_FORCE_ENABLED)

8059

++		return;

8060

++

8061

++	if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)

8062

++		return;

8063

++	prev_jiffy = jiffies;

8064

++

8065

++	pr_err("BUG: assuming non migratable context at %s:%d\n", file, line);

8066

++	pr_err("in_atomic(): %d, irqs_disabled(): %d, migration_disabled() %u pid: %d, name: %s\n",

8067

++	       in_atomic(), irqs_disabled(), is_migration_disabled(current),

8068

++	       current->pid, current->comm);

8069

++

8070

++	debug_show_held_locks(current);

8071

++	dump_stack();

8072

++	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

8073

++}

8074

++EXPORT_SYMBOL_GPL(__cant_migrate);

8075

++#endif

8076

++#endif

8077

++

8078

++#ifdef CONFIG_MAGIC_SYSRQ

8079

++void normalize_rt_tasks(void)

8080

++{

8081

++	struct task_struct *g, *p;

8082

++	struct sched_attr attr = {

8083

++		.sched_policy = SCHED_NORMAL,

8084

++	};

8085

++

8086

++	read_lock(&tasklist_lock);

8087

++	for_each_process_thread(g, p) {

8088

++		/*

8089

++		 * Only normalize user tasks:

8090

++		 */

8091

++		if (p->flags & PF_KTHREAD)

8092

++			continue;

8093

++

8094

++		if (!rt_task(p)) {

8095

++			/*

8096

++			 * Renice negative nice level userspace

8097

++			 * tasks back to 0:

8098

++			 */

8099

++			if (task_nice(p) < 0)

8100

++				set_user_nice(p, 0);

8101

++			continue;

8102

++		}

8103

++

8104

++		__sched_setscheduler(p, &attr, false, false);

8105

++	}

8106

++	read_unlock(&tasklist_lock);

8107

++}

8108

++#endif /* CONFIG_MAGIC_SYSRQ */

8109

++

8110

++#if defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB)

8111

++/*

8112

++ * These functions are only useful for the IA64 MCA handling, or kdb.

8113

++ *

8114

++ * They can only be called when the whole system has been

8115

++ * stopped - every CPU needs to be quiescent, and no scheduling

8116

++ * activity can take place. Using them for anything else would

8117

++ * be a serious bug, and as a result, they aren't even visible

8118

++ * under any other configuration.

8119

++ */

8120

++

8121

++/**

8122

++ * curr_task - return the current task for a given CPU.

8123

++ * @cpu: the processor in question.

8124

++ *

8125

++ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!

8126

++ *

8127

++ * Return: The current task for @cpu.

8128

++ */

8129

++struct task_struct *curr_task(int cpu)

8130

++{

8131

++	return cpu_curr(cpu);

8132

++}

8133

++

8134

++#endif /* defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB) */

8135

++

8136

++#ifdef CONFIG_IA64

8137

++/**

8138

++ * ia64_set_curr_task - set the current task for a given CPU.

8139

++ * @cpu: the processor in question.

8140

++ * @p: the task pointer to set.

8141

++ *

8142

++ * Description: This function must only be used when non-maskable interrupts

8143

++ * are serviced on a separate stack.  It allows the architecture to switch the

8144

++ * notion of the current task on a CPU in a non-blocking manner.  This function

8145

++ * must be called with all CPU's synchronised, and interrupts disabled, the

8146

++ * and caller must save the original value of the current task (see

8147

++ * curr_task() above) and restore that value before reenabling interrupts and

8148

++ * re-starting the system.

8149

++ *

8150

++ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!

8151

++ */

8152

++void ia64_set_curr_task(int cpu, struct task_struct *p)

8153

++{

8154

++	cpu_curr(cpu) = p;

8155

++}

8156

++

8157

++#endif

8158

++

8159

++#ifdef CONFIG_CGROUP_SCHED

8160

++static void sched_free_group(struct task_group *tg)

8161

++{

8162

++	kmem_cache_free(task_group_cache, tg);

8163

++}

8164

++

8165

++/* allocate runqueue etc for a new task group */

8166

++struct task_group *sched_create_group(struct task_group *parent)

8167

++{

8168

++	struct task_group *tg;

8169

++

8170

++	tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);

8171

++	if (!tg)

8172

++		return ERR_PTR(-ENOMEM);

8173

++

8174

++	return tg;

8175

++}

8176

++

8177

++void sched_online_group(struct task_group *tg, struct task_group *parent)

8178

++{

8179

++}

8180

++

8181

++/* rcu callback to free various structures associated with a task group */

8182

++static void sched_free_group_rcu(struct rcu_head *rhp)

8183

++{

8184

++	/* Now it should be safe to free those cfs_rqs */

8185

++	sched_free_group(container_of(rhp, struct task_group, rcu));

8186

++}

8187

++

8188

++void sched_destroy_group(struct task_group *tg)

8189

++{

8190

++	/* Wait for possible concurrent references to cfs_rqs complete */

8191

++	call_rcu(&tg->rcu, sched_free_group_rcu);

8192

++}

8193

++

8194

++void sched_offline_group(struct task_group *tg)

8195

++{

8196

++}

8197

++

8198

++static inline struct task_group *css_tg(struct cgroup_subsys_state *css)

8199

++{

8200

++	return css ? container_of(css, struct task_group, css) : NULL;

8201

++}

8202

++

8203

++static struct cgroup_subsys_state *

8204

++cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)

8205

++{

8206

++	struct task_group *parent = css_tg(parent_css);

8207

++	struct task_group *tg;

8208

++

8209

++	if (!parent) {

8210

++		/* This is early initialization for the top cgroup */

8211

++		return &root_task_group.css;

8212

++	}

8213

++

8214

++	tg = sched_create_group(parent);

8215

++	if (IS_ERR(tg))

8216

++		return ERR_PTR(-ENOMEM);

8217

++	return &tg->css;

8218

++}

8219

++

8220

++/* Expose task group only after completing cgroup initialization */

8221

++static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)

8222

++{

8223

++	struct task_group *tg = css_tg(css);

8224

++	struct task_group *parent = css_tg(css->parent);

8225

++

8226

++	if (parent)

8227

++		sched_online_group(tg, parent);

8228

++	return 0;

8229

++}

8230

++

8231

++static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)

8232

++{

8233

++	struct task_group *tg = css_tg(css);

8234

++

8235

++	sched_offline_group(tg);

8236

++}

8237

++

8238

++static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)

8239

++{

8240

++	struct task_group *tg = css_tg(css);

8241

++

8242

++	/*

8243

++	 * Relies on the RCU grace period between css_released() and this.

8244

++	 */

8245

++	sched_free_group(tg);

8246

++}

8247

++

8248

++static void cpu_cgroup_fork(struct task_struct *task)

8249

++{

8250

++}

8251

++

8252

++static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)

8253

++{

8254

++	return 0;

8255

++}

8256

++

8257

++static void cpu_cgroup_attach(struct cgroup_taskset *tset)

8258

++{

8259

++}

8260

++

8261

++#ifdef CONFIG_FAIR_GROUP_SCHED

8262

++static DEFINE_MUTEX(shares_mutex);

8263

++

8264

++int sched_group_set_shares(struct task_group *tg, unsigned long shares)

8265

++{

8266

++	/*

8267

++	 * We can't change the weight of the root cgroup.

8268

++	 */

8269

++	if (&root_task_group == tg)

8270

++		return -EINVAL;

8271

++

8272

++	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));

8273

++

8274

++	mutex_lock(&shares_mutex);

8275

++	if (tg->shares == shares)

8276

++		goto done;

8277

++

8278

++	tg->shares = shares;

8279

++done:

8280

++	mutex_unlock(&shares_mutex);

8281

++	return 0;

8282

++}

8283

++

8284

++static int cpu_shares_write_u64(struct cgroup_subsys_state *css,

8285

++				struct cftype *cftype, u64 shareval)

8286

++{

8287

++	if (shareval > scale_load_down(ULONG_MAX))

8288

++		shareval = MAX_SHARES;

8289

++	return sched_group_set_shares(css_tg(css), scale_load(shareval));

8290

++}

8291

++

8292

++static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,

8293

++			       struct cftype *cft)

8294

++{

8295

++	struct task_group *tg = css_tg(css);

8296

++

8297

++	return (u64) scale_load_down(tg->shares);

8298

++}

8299

++#endif

8300

++

8301

++static struct cftype cpu_legacy_files[] = {

8302

++#ifdef CONFIG_FAIR_GROUP_SCHED

8303

++	{

8304

++		.name = "shares",

8305

++		.read_u64 = cpu_shares_read_u64,

8306

++		.write_u64 = cpu_shares_write_u64,

8307

++	},

8308

++#endif

8309

++	{ }	/* Terminate */

8310

++};

8311

++

8312

++

8313

++static struct cftype cpu_files[] = {

8314

++	{ }	/* terminate */

8315

++};

8316

++

8317

++static int cpu_extra_stat_show(struct seq_file *sf,

8318

++			       struct cgroup_subsys_state *css)

8319

++{

8320

++	return 0;

8321

++}

8322

++

8323

++struct cgroup_subsys cpu_cgrp_subsys = {

8324

++	.css_alloc	= cpu_cgroup_css_alloc,

8325

++	.css_online	= cpu_cgroup_css_online,

8326

++	.css_released	= cpu_cgroup_css_released,

8327

++	.css_free	= cpu_cgroup_css_free,

8328

++	.css_extra_stat_show = cpu_extra_stat_show,

8329

++	.fork		= cpu_cgroup_fork,

8330

++	.can_attach	= cpu_cgroup_can_attach,

8331

++	.attach		= cpu_cgroup_attach,

8332

++	.legacy_cftypes	= cpu_files,

8333

++	.legacy_cftypes	= cpu_legacy_files,

8334

++	.dfl_cftypes	= cpu_files,

8335

++	.early_init	= true,

8336

++	.threaded	= true,

8337

++};

8338

++#endif	/* CONFIG_CGROUP_SCHED */

8339

++

8340

++#undef CREATE_TRACE_POINTS

8341

+diff --git a/kernel/sched/alt_debug.c b/kernel/sched/alt_debug.c

8342

+new file mode 100644

8343

+index 000000000000..1212a031700e

8344

+--- /dev/null

8345

++++ b/kernel/sched/alt_debug.c

8346

+@@ -0,0 +1,31 @@

8347

++/*

8348

++ * kernel/sched/alt_debug.c

8349

++ *

8350

++ * Print the alt scheduler debugging details

8351

++ *

8352

++ * Author: Alfred Chen

8353

++ * Date  : 2020

8354

++ */

8355

++#include "sched.h"

8356

++

8357

++/*

8358

++ * This allows printing both to /proc/sched_debug and

8359

++ * to the console

8360

++ */

8361

++#define SEQ_printf(m, x...)			\

8362

++ do {						\

8363

++	if (m)					\

8364

++		seq_printf(m, x);		\

8365

++	else					\

8366

++		pr_cont(x);			\

8367

++ } while (0)

8368

++

8369

++void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,

8370

++			  struct seq_file *m)

8371

++{

8372

++	SEQ_printf(m, "%s (%d, #threads: %d)\n", p->comm, task_pid_nr_ns(p, ns),

8373

++						get_nr_threads(p));

8374

++}

8375

++

8376

++void proc_sched_set_task(struct task_struct *p)

8377

++{}

8378

+diff --git a/kernel/sched/alt_sched.h b/kernel/sched/alt_sched.h

8379

+new file mode 100644

8380

+index 000000000000..289058a09bd5

8381

+--- /dev/null

8382

++++ b/kernel/sched/alt_sched.h

8383

+@@ -0,0 +1,666 @@

8384

++#ifndef ALT_SCHED_H

8385

++#define ALT_SCHED_H

8386

++

8387

++#include <linux/sched.h>

8388

++

8389

++#include <linux/sched/clock.h>

8390

++#include <linux/sched/cpufreq.h>

8391

++#include <linux/sched/cputime.h>

8392

++#include <linux/sched/debug.h>

8393

++#include <linux/sched/init.h>

8394

++#include <linux/sched/isolation.h>

8395

++#include <linux/sched/loadavg.h>

8396

++#include <linux/sched/mm.h>

8397

++#include <linux/sched/nohz.h>

8398

++#include <linux/sched/signal.h>

8399

++#include <linux/sched/stat.h>

8400

++#include <linux/sched/sysctl.h>

8401

++#include <linux/sched/task.h>

8402

++#include <linux/sched/topology.h>

8403

++#include <linux/sched/wake_q.h>

8404

++

8405

++#include <uapi/linux/sched/types.h>

8406

++

8407

++#include <linux/cgroup.h>

8408

++#include <linux/cpufreq.h>

8409

++#include <linux/cpuidle.h>

8410

++#include <linux/cpuset.h>

8411

++#include <linux/ctype.h>

8412

++#include <linux/debugfs.h>

8413

++#include <linux/kthread.h>

8414

++#include <linux/livepatch.h>

8415

++#include <linux/membarrier.h>

8416

++#include <linux/proc_fs.h>

8417

++#include <linux/psi.h>

8418

++#include <linux/slab.h>

8419

++#include <linux/stop_machine.h>

8420

++#include <linux/suspend.h>

8421

++#include <linux/swait.h>

8422

++#include <linux/syscalls.h>

8423

++#include <linux/tsacct_kern.h>

8424

++

8425

++#include <asm/tlb.h>

8426

++

8427

++#ifdef CONFIG_PARAVIRT

8428

++# include <asm/paravirt.h>

8429

++#endif

8430

++

8431

++#include "cpupri.h"

8432

++

8433

++#include <trace/events/sched.h>

8434

++

8435

++#ifdef CONFIG_SCHED_BMQ

8436

++/* bits:

8437

++ * RT(0-99), (Low prio adj range, nice width, high prio adj range) / 2, cpu idle task */

8438

++#define SCHED_BITS	(MAX_RT_PRIO + NICE_WIDTH / 2 + MAX_PRIORITY_ADJ + 1)

8439

++#endif

8440

++

8441

++#ifdef CONFIG_SCHED_PDS

8442

++/* bits: RT(0-99), reserved(100-127), NORMAL_PRIO_NUM, cpu idle task */

8443

++#define SCHED_BITS	(MIN_NORMAL_PRIO + NORMAL_PRIO_NUM + 1)

8444

++#endif /* CONFIG_SCHED_PDS */

8445

++

8446

++#define IDLE_TASK_SCHED_PRIO	(SCHED_BITS - 1)

8447

++

8448

++#ifdef CONFIG_SCHED_DEBUG

8449

++# define SCHED_WARN_ON(x)	WARN_ONCE(x, #x)

8450

++extern void resched_latency_warn(int cpu, u64 latency);

8451

++#else

8452

++# define SCHED_WARN_ON(x)	({ (void)(x), 0; })

8453

++static inline void resched_latency_warn(int cpu, u64 latency) {}

8454

++#endif

8455

++

8456

++/*

8457

++ * Increase resolution of nice-level calculations for 64-bit architectures.

8458

++ * The extra resolution improves shares distribution and load balancing of

8459

++ * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup

8460

++ * hierarchies, especially on larger systems. This is not a user-visible change

8461

++ * and does not change the user-interface for setting shares/weights.

8462

++ *

8463

++ * We increase resolution only if we have enough bits to allow this increased

8464

++ * resolution (i.e. 64-bit). The costs for increasing resolution when 32-bit

8465

++ * are pretty high and the returns do not justify the increased costs.

8466

++ *

8467

++ * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to

8468

++ * increase coverage and consistency always enable it on 64-bit platforms.

8469

++ */

8470

++#ifdef CONFIG_64BIT

8471

++# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)

8472

++# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)

8473

++# define scale_load_down(w) \

8474

++({ \

8475

++	unsigned long __w = (w); \

8476

++	if (__w) \

8477

++		__w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \

8478

++	__w; \

8479

++})

8480

++#else

8481

++# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)

8482

++# define scale_load(w)		(w)

8483

++# define scale_load_down(w)	(w)

8484

++#endif

8485

++

8486

++#ifdef CONFIG_FAIR_GROUP_SCHED

8487

++#define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD

8488

++

8489

++/*

8490

++ * A weight of 0 or 1 can cause arithmetics problems.

8491

++ * A weight of a cfs_rq is the sum of weights of which entities

8492

++ * are queued on this cfs_rq, so a weight of a entity should not be

8493

++ * too large, so as the shares value of a task group.

8494

++ * (The default weight is 1024 - so there's no practical

8495

++ *  limitation from this.)

8496

++ */

8497

++#define MIN_SHARES		(1UL <<  1)

8498

++#define MAX_SHARES		(1UL << 18)

8499

++#endif

8500

++

8501

++/* task_struct::on_rq states: */

8502

++#define TASK_ON_RQ_QUEUED	1

8503

++#define TASK_ON_RQ_MIGRATING	2

8504

++

8505

++static inline int task_on_rq_queued(struct task_struct *p)

8506

++{

8507

++	return p->on_rq == TASK_ON_RQ_QUEUED;

8508

++}

8509

++

8510

++static inline int task_on_rq_migrating(struct task_struct *p)

8511

++{

8512

++	return READ_ONCE(p->on_rq) == TASK_ON_RQ_MIGRATING;

8513

++}

8514

++

8515

++/*

8516

++ * wake flags

8517

++ */

8518

++#define WF_SYNC		0x01		/* waker goes to sleep after wakeup */

8519

++#define WF_FORK		0x02		/* child wakeup after fork */

8520

++#define WF_MIGRATED	0x04		/* internal use, task got migrated */

8521

++#define WF_ON_CPU	0x08		/* Wakee is on_rq */

8522

++

8523

++#define SCHED_QUEUE_BITS	(SCHED_BITS - 1)

8524

++

8525

++struct sched_queue {

8526

++	DECLARE_BITMAP(bitmap, SCHED_QUEUE_BITS);

8527

++	struct list_head heads[SCHED_BITS];

8528

++};

8529

++

8530

++/*

8531

++ * This is the main, per-CPU runqueue data structure.

8532

++ * This data should only be modified by the local cpu.

8533

++ */

8534

++struct rq {

8535

++	/* runqueue lock: */

8536

++	raw_spinlock_t lock;

8537

++

8538

++	struct task_struct __rcu *curr;

8539

++	struct task_struct *idle, *stop, *skip;

8540

++	struct mm_struct *prev_mm;

8541

++

8542

++	struct sched_queue	queue;

8543

++#ifdef CONFIG_SCHED_PDS

8544

++	u64			time_edge;

8545

++#endif

8546

++	unsigned long watermark;

8547

++

8548

++	/* switch count */

8549

++	u64 nr_switches;

8550

++

8551

++	atomic_t nr_iowait;

8552

++

8553

++#ifdef CONFIG_SCHED_DEBUG

8554

++	u64 last_seen_need_resched_ns;

8555

++	int ticks_without_resched;

8556

++#endif

8557

++

8558

++#ifdef CONFIG_MEMBARRIER

8559

++	int membarrier_state;

8560

++#endif

8561

++

8562

++#ifdef CONFIG_SMP

8563

++	int cpu;		/* cpu of this runqueue */

8564

++	bool online;

8565

++

8566

++	unsigned int		ttwu_pending;

8567

++	unsigned char		nohz_idle_balance;

8568

++	unsigned char		idle_balance;

8569

++

8570

++#ifdef CONFIG_HAVE_SCHED_AVG_IRQ

8571

++	struct sched_avg	avg_irq;

8572

++#endif

8573

++

8574

++#ifdef CONFIG_SCHED_SMT

8575

++	int active_balance;

8576

++	struct cpu_stop_work	active_balance_work;

8577

++#endif

8578

++	struct callback_head	*balance_callback;

8579

++#ifdef CONFIG_HOTPLUG_CPU

8580

++	struct rcuwait		hotplug_wait;

8581

++#endif

8582

++	unsigned int		nr_pinned;

8583

++

8584

++#endif /* CONFIG_SMP */

8585

++#ifdef CONFIG_IRQ_TIME_ACCOUNTING

8586

++	u64 prev_irq_time;

8587

++#endif /* CONFIG_IRQ_TIME_ACCOUNTING */

8588

++#ifdef CONFIG_PARAVIRT

8589

++	u64 prev_steal_time;

8590

++#endif /* CONFIG_PARAVIRT */

8591

++#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING

8592

++	u64 prev_steal_time_rq;

8593

++#endif /* CONFIG_PARAVIRT_TIME_ACCOUNTING */

8594

++

8595

++	/* For genenal cpu load util */

8596

++	s32 load_history;

8597

++	u64 load_block;

8598

++	u64 load_stamp;

8599

++

8600

++	/* calc_load related fields */

8601

++	unsigned long calc_load_update;

8602

++	long calc_load_active;

8603

++

8604

++	u64 clock, last_tick;

8605

++	u64 last_ts_switch;

8606

++	u64 clock_task;

8607

++

8608

++	unsigned int  nr_running;

8609

++	unsigned long nr_uninterruptible;

8610

++

8611

++#ifdef CONFIG_SCHED_HRTICK

8612

++#ifdef CONFIG_SMP

8613

++	call_single_data_t hrtick_csd;

8614

++#endif

8615

++	struct hrtimer		hrtick_timer;

8616

++	ktime_t			hrtick_time;

8617

++#endif

8618

++

8619

++#ifdef CONFIG_SCHEDSTATS

8620

++

8621

++	/* latency stats */

8622

++	struct sched_info rq_sched_info;

8623

++	unsigned long long rq_cpu_time;

8624

++	/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */

8625

++

8626

++	/* sys_sched_yield() stats */

8627

++	unsigned int yld_count;

8628

++

8629

++	/* schedule() stats */

8630

++	unsigned int sched_switch;

8631

++	unsigned int sched_count;

8632

++	unsigned int sched_goidle;

8633

++

8634

++	/* try_to_wake_up() stats */

8635

++	unsigned int ttwu_count;

8636

++	unsigned int ttwu_local;

8637

++#endif /* CONFIG_SCHEDSTATS */

8638

++

8639

++#ifdef CONFIG_CPU_IDLE

8640

++	/* Must be inspected within a rcu lock section */

8641

++	struct cpuidle_state *idle_state;

8642

++#endif

8643

++

8644

++#ifdef CONFIG_NO_HZ_COMMON

8645

++#ifdef CONFIG_SMP

8646

++	call_single_data_t	nohz_csd;

8647

++#endif

8648

++	atomic_t		nohz_flags;

8649

++#endif /* CONFIG_NO_HZ_COMMON */

8650

++};

8651

++

8652

++extern unsigned long rq_load_util(struct rq *rq, unsigned long max);

8653

++

8654

++extern unsigned long calc_load_update;

8655

++extern atomic_long_t calc_load_tasks;

8656

++

8657

++extern void calc_global_load_tick(struct rq *this_rq);

8658

++extern long calc_load_fold_active(struct rq *this_rq, long adjust);

8659

++

8660

++DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

8661

++#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))

8662

++#define this_rq()		this_cpu_ptr(&runqueues)

8663

++#define task_rq(p)		cpu_rq(task_cpu(p))

8664

++#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)

8665

++#define raw_rq()		raw_cpu_ptr(&runqueues)

8666

++

8667

++#ifdef CONFIG_SMP

8668

++#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)

8669

++void register_sched_domain_sysctl(void);

8670

++void unregister_sched_domain_sysctl(void);

8671

++#else

8672

++static inline void register_sched_domain_sysctl(void)

8673

++{

8674

++}

8675

++static inline void unregister_sched_domain_sysctl(void)

8676

++{

8677

++}

8678

++#endif

8679

++

8680

++extern bool sched_smp_initialized;

8681

++

8682

++enum {

8683

++	ITSELF_LEVEL_SPACE_HOLDER,

8684

++#ifdef CONFIG_SCHED_SMT

8685

++	SMT_LEVEL_SPACE_HOLDER,

8686

++#endif

8687

++	COREGROUP_LEVEL_SPACE_HOLDER,

8688

++	CORE_LEVEL_SPACE_HOLDER,

8689

++	OTHER_LEVEL_SPACE_HOLDER,

8690

++	NR_CPU_AFFINITY_LEVELS

8691

++};

8692

++

8693

++DECLARE_PER_CPU(cpumask_t [NR_CPU_AFFINITY_LEVELS], sched_cpu_topo_masks);

8694

++DECLARE_PER_CPU(cpumask_t *, sched_cpu_llc_mask);

8695

++

8696

++static inline int

8697

++__best_mask_cpu(const cpumask_t *cpumask, const cpumask_t *mask)

8698

++{

8699

++	int cpu;

8700

++

8701

++	while ((cpu = cpumask_any_and(cpumask, mask)) >= nr_cpu_ids)

8702

++		mask++;

8703

++

8704

++	return cpu;

8705

++}

8706

++

8707

++static inline int best_mask_cpu(int cpu, const cpumask_t *mask)

8708

++{

8709

++	return __best_mask_cpu(mask, per_cpu(sched_cpu_topo_masks, cpu));

8710

++}

8711

++

8712

++extern void flush_smp_call_function_from_idle(void);

8713

++

8714

++#else  /* !CONFIG_SMP */

8715

++static inline void flush_smp_call_function_from_idle(void) { }

8716

++#endif

8717

++

8718

++#ifndef arch_scale_freq_tick

8719

++static __always_inline

8720

++void arch_scale_freq_tick(void)

8721

++{

8722

++}

8723

++#endif

8724

++

8725

++#ifndef arch_scale_freq_capacity

8726

++static __always_inline

8727

++unsigned long arch_scale_freq_capacity(int cpu)

8728

++{

8729

++	return SCHED_CAPACITY_SCALE;

8730

++}

8731

++#endif

8732

++

8733

++static inline u64 __rq_clock_broken(struct rq *rq)

8734

++{

8735

++	return READ_ONCE(rq->clock);

8736

++}

8737

++

8738

++static inline u64 rq_clock(struct rq *rq)

8739

++{

8740

++	/*

8741

++	 * Relax lockdep_assert_held() checking as in VRQ, call to

8742

++	 * sched_info_xxxx() may not held rq->lock

8743

++	 * lockdep_assert_held(&rq->lock);

8744

++	 */

8745

++	return rq->clock;

8746

++}

8747

++

8748

++static inline u64 rq_clock_task(struct rq *rq)

8749

++{

8750

++	/*

8751

++	 * Relax lockdep_assert_held() checking as in VRQ, call to

8752

++	 * sched_info_xxxx() may not held rq->lock

8753

++	 * lockdep_assert_held(&rq->lock);

8754

++	 */

8755

++	return rq->clock_task;

8756

++}

8757

++

8758

++/*

8759

++ * {de,en}queue flags:

8760

++ *

8761

++ * DEQUEUE_SLEEP  - task is no longer runnable

8762

++ * ENQUEUE_WAKEUP - task just became runnable

8763

++ *

8764

++ */

8765

++

8766

++#define DEQUEUE_SLEEP		0x01

8767

++

8768

++#define ENQUEUE_WAKEUP		0x01

8769

++

8770

++

8771

++/*

8772

++ * Below are scheduler API which using in other kernel code

8773

++ * It use the dummy rq_flags

8774

++ * ToDo : BMQ need to support these APIs for compatibility with mainline

8775

++ * scheduler code.

8776

++ */

8777

++struct rq_flags {

8778

++	unsigned long flags;

8779

++};

8780

++

8781

++struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)

8782

++	__acquires(rq->lock);

8783

++

8784

++struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)

8785

++	__acquires(p->pi_lock)

8786

++	__acquires(rq->lock);

8787

++

8788

++static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)

8789

++	__releases(rq->lock)

8790

++{

8791

++	raw_spin_unlock(&rq->lock);

8792

++}

8793

++

8794

++static inline void

8795

++task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)

8796

++	__releases(rq->lock)

8797

++	__releases(p->pi_lock)

8798

++{

8799

++	raw_spin_unlock(&rq->lock);

8800

++	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);

8801

++}

8802

++

8803

++static inline void

8804

++rq_lock(struct rq *rq, struct rq_flags *rf)

8805

++	__acquires(rq->lock)

8806

++{

8807

++	raw_spin_lock(&rq->lock);

8808

++}

8809

++

8810

++static inline void

8811

++rq_unlock_irq(struct rq *rq, struct rq_flags *rf)

8812

++	__releases(rq->lock)

8813

++{

8814

++	raw_spin_unlock_irq(&rq->lock);

8815

++}

8816

++

8817

++static inline void

8818

++rq_unlock(struct rq *rq, struct rq_flags *rf)

8819

++	__releases(rq->lock)

8820

++{

8821

++	raw_spin_unlock(&rq->lock);

8822

++}

8823

++

8824

++static inline struct rq *

8825

++this_rq_lock_irq(struct rq_flags *rf)

8826

++	__acquires(rq->lock)

8827

++{

8828

++	struct rq *rq;

8829

++

8830

++	local_irq_disable();

8831

++	rq = this_rq();

8832

++	raw_spin_lock(&rq->lock);

8833

++

8834

++	return rq;

8835

++}

8836

++

8837

++extern void raw_spin_rq_lock_nested(struct rq *rq, int subclass);

8838

++extern void raw_spin_rq_unlock(struct rq *rq);

8839

++

8840

++static inline raw_spinlock_t *__rq_lockp(struct rq *rq)

8841

++{

8842

++	return &rq->lock;

8843

++}

8844

++

8845

++static inline raw_spinlock_t *rq_lockp(struct rq *rq)

8846

++{

8847

++	return __rq_lockp(rq);

8848

++}

8849

++

8850

++static inline void raw_spin_rq_lock(struct rq *rq)

8851

++{

8852

++	raw_spin_rq_lock_nested(rq, 0);

8853

++}

8854

++

8855

++static inline void raw_spin_rq_lock_irq(struct rq *rq)

8856

++{

8857

++	local_irq_disable();

8858

++	raw_spin_rq_lock(rq);

8859

++}

8860

++

8861

++static inline void raw_spin_rq_unlock_irq(struct rq *rq)

8862

++{

8863

++	raw_spin_rq_unlock(rq);

8864

++	local_irq_enable();

8865

++}

8866

++

8867

++static inline int task_current(struct rq *rq, struct task_struct *p)

8868

++{

8869

++	return rq->curr == p;

8870

++}

8871

++

8872

++static inline bool task_running(struct task_struct *p)

8873

++{

8874

++	return p->on_cpu;

8875

++}

8876

++

8877

++extern int task_running_nice(struct task_struct *p);

8878

++

8879

++extern struct static_key_false sched_schedstats;

8880

++

8881

++#ifdef CONFIG_CPU_IDLE

8882

++static inline void idle_set_state(struct rq *rq,

8883

++				  struct cpuidle_state *idle_state)

8884

++{

8885

++	rq->idle_state = idle_state;

8886

++}

8887

++

8888

++static inline struct cpuidle_state *idle_get_state(struct rq *rq)

8889

++{

8890

++	WARN_ON(!rcu_read_lock_held());

8891

++	return rq->idle_state;

8892

++}

8893

++#else

8894

++static inline void idle_set_state(struct rq *rq,

8895

++				  struct cpuidle_state *idle_state)

8896

++{

8897

++}

8898

++

8899

++static inline struct cpuidle_state *idle_get_state(struct rq *rq)

8900

++{

8901

++	return NULL;

8902

++}

8903

++#endif

8904

++

8905

++static inline int cpu_of(const struct rq *rq)

8906

++{

8907

++#ifdef CONFIG_SMP

8908

++	return rq->cpu;

8909

++#else

8910

++	return 0;

8911

++#endif

8912

++}

8913

++

8914

++#include "stats.h"

8915

++

8916

++#ifdef CONFIG_NO_HZ_COMMON

8917

++#define NOHZ_BALANCE_KICK_BIT	0

8918

++#define NOHZ_STATS_KICK_BIT	1

8919

++

8920

++#define NOHZ_BALANCE_KICK	BIT(NOHZ_BALANCE_KICK_BIT)

8921

++#define NOHZ_STATS_KICK		BIT(NOHZ_STATS_KICK_BIT)

8922

++

8923

++#define NOHZ_KICK_MASK	(NOHZ_BALANCE_KICK | NOHZ_STATS_KICK)

8924

++

8925

++#define nohz_flags(cpu)	(&cpu_rq(cpu)->nohz_flags)

8926

++

8927

++/* TODO: needed?

8928

++extern void nohz_balance_exit_idle(struct rq *rq);

8929

++#else

8930

++static inline void nohz_balance_exit_idle(struct rq *rq) { }

8931

++*/

8932

++#endif

8933

++

8934

++#ifdef CONFIG_IRQ_TIME_ACCOUNTING

8935

++struct irqtime {

8936

++	u64			total;

8937

++	u64			tick_delta;

8938

++	u64			irq_start_time;

8939

++	struct u64_stats_sync	sync;

8940

++};

8941

++

8942

++DECLARE_PER_CPU(struct irqtime, cpu_irqtime);

8943

++

8944

++/*

8945

++ * Returns the irqtime minus the softirq time computed by ksoftirqd.

8946

++ * Otherwise ksoftirqd's sum_exec_runtime is substracted its own runtime

8947

++ * and never move forward.

8948

++ */

8949

++static inline u64 irq_time_read(int cpu)

8950

++{

8951

++	struct irqtime *irqtime = &per_cpu(cpu_irqtime, cpu);

8952

++	unsigned int seq;

8953

++	u64 total;

8954

++

8955

++	do {

8956

++		seq = __u64_stats_fetch_begin(&irqtime->sync);

8957

++		total = irqtime->total;

8958

++	} while (__u64_stats_fetch_retry(&irqtime->sync, seq));

8959

++

8960

++	return total;

8961

++}

8962

++#endif /* CONFIG_IRQ_TIME_ACCOUNTING */

8963

++

8964

++#ifdef CONFIG_CPU_FREQ

8965

++DECLARE_PER_CPU(struct update_util_data __rcu *, cpufreq_update_util_data);

8966

++#endif /* CONFIG_CPU_FREQ */

8967

++

8968

++#ifdef CONFIG_NO_HZ_FULL

8969

++extern int __init sched_tick_offload_init(void);

8970

++#else

8971

++static inline int sched_tick_offload_init(void) { return 0; }

8972

++#endif

8973

++

8974

++#ifdef arch_scale_freq_capacity

8975

++#ifndef arch_scale_freq_invariant

8976

++#define arch_scale_freq_invariant()	(true)

8977

++#endif

8978

++#else /* arch_scale_freq_capacity */

8979

++#define arch_scale_freq_invariant()	(false)

8980

++#endif

8981

++

8982

++extern void schedule_idle(void);

8983

++

8984

++#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)

8985

++

8986

++/*

8987

++ * !! For sched_setattr_nocheck() (kernel) only !!

8988

++ *

8989

++ * This is actually gross. :(

8990

++ *

8991

++ * It is used to make schedutil kworker(s) higher priority than SCHED_DEADLINE

8992

++ * tasks, but still be able to sleep. We need this on platforms that cannot

8993

++ * atomically change clock frequency. Remove once fast switching will be

8994

++ * available on such platforms.

8995

++ *

8996

++ * SUGOV stands for SchedUtil GOVernor.

8997

++ */

8998

++#define SCHED_FLAG_SUGOV	0x10000000

8999

++

9000

++#ifdef CONFIG_MEMBARRIER

9001

++/*

9002

++ * The scheduler provides memory barriers required by membarrier between:

9003

++ * - prior user-space memory accesses and store to rq->membarrier_state,

9004

++ * - store to rq->membarrier_state and following user-space memory accesses.

9005

++ * In the same way it provides those guarantees around store to rq->curr.

9006

++ */

9007

++static inline void membarrier_switch_mm(struct rq *rq,

9008

++					struct mm_struct *prev_mm,

9009

++					struct mm_struct *next_mm)

9010

++{

9011

++	int membarrier_state;

9012

++

9013

++	if (prev_mm == next_mm)

9014

++		return;

9015

++

9016

++	membarrier_state = atomic_read(&next_mm->membarrier_state);

9017

++	if (READ_ONCE(rq->membarrier_state) == membarrier_state)

9018

++		return;

9019

++

9020

++	WRITE_ONCE(rq->membarrier_state, membarrier_state);

9021

++}

9022

++#else

9023

++static inline void membarrier_switch_mm(struct rq *rq,

9024

++					struct mm_struct *prev_mm,

9025

++					struct mm_struct *next_mm)

9026

++{

9027

++}

9028

++#endif

9029

++

9030

++#ifdef CONFIG_NUMA

9031

++extern int sched_numa_find_closest(const struct cpumask *cpus, int cpu);

9032

++#else

9033

++static inline int sched_numa_find_closest(const struct cpumask *cpus, int cpu)

9034

++{

9035

++	return nr_cpu_ids;

9036

++}

9037

++#endif

9038

++

9039

++extern void swake_up_all_locked(struct swait_queue_head *q);

9040

++extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);

9041

++

9042

++#ifdef CONFIG_PREEMPT_DYNAMIC

9043

++extern int preempt_dynamic_mode;

9044

++extern int sched_dynamic_mode(const char *str);

9045

++extern void sched_dynamic_update(int mode);

9046

++#endif

9047

++

9048

++static inline void nohz_run_idle_balance(int cpu) { }

9049

++#endif /* ALT_SCHED_H */

9050

+diff --git a/kernel/sched/bmq.h b/kernel/sched/bmq.h

9051

+new file mode 100644

9052

+index 000000000000..be3ee4a553ca

9053

+--- /dev/null

9054

++++ b/kernel/sched/bmq.h

9055

+@@ -0,0 +1,111 @@

9056

++#define ALT_SCHED_VERSION_MSG "sched/bmq: BMQ CPU Scheduler "ALT_SCHED_VERSION" by Alfred Chen.\n"

9057

++

9058

++/*

9059

++ * BMQ only routines

9060

++ */

9061

++#define rq_switch_time(rq)	((rq)->clock - (rq)->last_ts_switch)

9062

++#define boost_threshold(p)	(sched_timeslice_ns >>\

9063

++				 (15 - MAX_PRIORITY_ADJ -  (p)->boost_prio))

9064

++

9065

++static inline void boost_task(struct task_struct *p)

9066

++{

9067

++	int limit;

9068

++

9069

++	switch (p->policy) {

9070

++	case SCHED_NORMAL:

9071

++		limit = -MAX_PRIORITY_ADJ;

9072

++		break;

9073

++	case SCHED_BATCH:

9074

++	case SCHED_IDLE:

9075

++		limit = 0;

9076

++		break;

9077

++	default:

9078

++		return;

9079

++	}

9080

++

9081

++	if (p->boost_prio > limit)

9082

++		p->boost_prio--;

9083

++}

9084

++

9085

++static inline void deboost_task(struct task_struct *p)

9086

++{

9087

++	if (p->boost_prio < MAX_PRIORITY_ADJ)

9088

++		p->boost_prio++;

9089

++}

9090

++

9091

++/*

9092

++ * Common interfaces

9093

++ */

9094

++static inline void sched_timeslice_imp(const int timeslice_ms) {}

9095

++

9096

++static inline int

9097

++task_sched_prio_normal(const struct task_struct *p, const struct rq *rq)

9098

++{

9099

++	return p->prio + p->boost_prio - MAX_RT_PRIO;

9100

++}

9101

++

9102

++static inline int task_sched_prio(const struct task_struct *p)

9103

++{

9104

++	return (p->prio < MAX_RT_PRIO)? p->prio : MAX_RT_PRIO / 2 + (p->prio + p->boost_prio) / 2;

9105

++}

9106

++

9107

++static inline int

9108

++task_sched_prio_idx(const struct task_struct *p, const struct rq *rq)

9109

++{

9110

++	return task_sched_prio(p);

9111

++}

9112

++

9113

++static inline int sched_prio2idx(int prio, struct rq *rq)

9114

++{

9115

++	return prio;

9116

++}

9117

++

9118

++static inline int sched_idx2prio(int idx, struct rq *rq)

9119

++{

9120

++	return idx;

9121

++}

9122

++

9123

++static inline void time_slice_expired(struct task_struct *p, struct rq *rq)

9124

++{

9125

++	p->time_slice = sched_timeslice_ns;

9126

++

9127

++	if (SCHED_FIFO != p->policy && task_on_rq_queued(p)) {

9128

++		if (SCHED_RR != p->policy)

9129

++			deboost_task(p);

9130

++		requeue_task(p, rq);

9131

++	}

9132

++}

9133

++

9134

++static inline void sched_task_sanity_check(struct task_struct *p, struct rq *rq) {}

9135

++

9136

++inline int task_running_nice(struct task_struct *p)

9137

++{

9138

++	return (p->prio + p->boost_prio > DEFAULT_PRIO + MAX_PRIORITY_ADJ);

9139

++}

9140

++

9141

++static void sched_task_fork(struct task_struct *p, struct rq *rq)

9142

++{

9143

++	p->boost_prio = (p->boost_prio < 0) ?

9144

++		p->boost_prio + MAX_PRIORITY_ADJ : MAX_PRIORITY_ADJ;

9145

++}

9146

++

9147

++static inline void do_sched_yield_type_1(struct task_struct *p, struct rq *rq)

9148

++{

9149

++	p->boost_prio = MAX_PRIORITY_ADJ;

9150

++}

9151

++

9152

++#ifdef CONFIG_SMP

9153

++static inline void sched_task_ttwu(struct task_struct *p)

9154

++{

9155

++	if(this_rq()->clock_task - p->last_ran > sched_timeslice_ns)

9156

++		boost_task(p);

9157

++}

9158

++#endif

9159

++

9160

++static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq)

9161

++{

9162

++	if (rq_switch_time(rq) < boost_threshold(p))

9163

++		boost_task(p);

9164

++}

9165

++

9166

++static inline void update_rq_time_edge(struct rq *rq) {}

9167

+diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

9168

+index e7af18857371..3e38816b736e 100644

9169

+--- a/kernel/sched/cpufreq_schedutil.c

9170

++++ b/kernel/sched/cpufreq_schedutil.c

9171

+@@ -167,9 +167,14 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

9172

+ 	unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);

9173

+

9174

+ 	sg_cpu->max = max;

9175

++#ifndef CONFIG_SCHED_ALT

9176

+ 	sg_cpu->bw_dl = cpu_bw_dl(rq);

9177

+ 	sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(rq), max,

9178

+ 					  FREQUENCY_UTIL, NULL);

9179

++#else

9180

++	sg_cpu->bw_dl = 0;

9181

++	sg_cpu->util = rq_load_util(rq, max);

9182

++#endif /* CONFIG_SCHED_ALT */

9183

+ }

9184

+

9185

+ /**

9186

+@@ -312,8 +317,10 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }

9187

+  */

9188

+ static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)

9189

+ {

9190

++#ifndef CONFIG_SCHED_ALT

9191

+ 	if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)

9192

+ 		sg_cpu->sg_policy->limits_changed = true;

9193

++#endif

9194

+ }

9195

+

9196

+ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,

9197

+@@ -607,6 +614,7 @@ static int sugov_kthread_create(struct sugov_policy *sg_policy)

9198

+ 	}

9199

+

9200

+ 	ret = sched_setattr_nocheck(thread, &attr);

9201

++

9202

+ 	if (ret) {

9203

+ 		kthread_stop(thread);

9204

+ 		pr_warn("%s: failed to set SCHED_DEADLINE\n", __func__);

9205

+@@ -839,7 +847,9 @@ cpufreq_governor_init(schedutil_gov);

9206

+ #ifdef CONFIG_ENERGY_MODEL

9207

+ static void rebuild_sd_workfn(struct work_struct *work)

9208

+ {

9209

++#ifndef CONFIG_SCHED_ALT

9210

+ 	rebuild_sched_domains_energy();

9211

++#endif /* CONFIG_SCHED_ALT */

9212

+ }

9213

+ static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);

9214

+

9215

+diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c

9216

+index 872e481d5098..f920c8b48ec1 100644

9217

+--- a/kernel/sched/cputime.c

9218

++++ b/kernel/sched/cputime.c

9219

+@@ -123,7 +123,7 @@ void account_user_time(struct task_struct *p, u64 cputime)

9220

+ 	p->utime += cputime;

9221

+ 	account_group_user_time(p, cputime);

9222

+

9223

+-	index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;

9224

++	index = task_running_nice(p) ? CPUTIME_NICE : CPUTIME_USER;

9225

+

9226

+ 	/* Add user time to cpustat. */

9227

+ 	task_group_account_field(p, index, cputime);

9228

+@@ -147,7 +147,7 @@ void account_guest_time(struct task_struct *p, u64 cputime)

9229

+ 	p->gtime += cputime;

9230

+

9231

+ 	/* Add guest time to cpustat. */

9232

+-	if (task_nice(p) > 0) {

9233

++	if (task_running_nice(p)) {

9234

+ 		cpustat[CPUTIME_NICE] += cputime;

9235

+ 		cpustat[CPUTIME_GUEST_NICE] += cputime;

9236

+ 	} else {

9237

+@@ -270,7 +270,7 @@ static inline u64 account_other_time(u64 max)

9238

+ #ifdef CONFIG_64BIT

9239

+ static inline u64 read_sum_exec_runtime(struct task_struct *t)

9240

+ {

9241

+-	return t->se.sum_exec_runtime;

9242

++	return tsk_seruntime(t);

9243

+ }

9244

+ #else

9245

+ static u64 read_sum_exec_runtime(struct task_struct *t)

9246

+@@ -280,7 +280,7 @@ static u64 read_sum_exec_runtime(struct task_struct *t)

9247

+ 	struct rq *rq;

9248

+

9249

+ 	rq = task_rq_lock(t, &rf);

9250

+-	ns = t->se.sum_exec_runtime;

9251

++	ns = tsk_seruntime(t);

9252

+ 	task_rq_unlock(rq, t, &rf);

9253

+

9254

+ 	return ns;

9255

+@@ -612,7 +612,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,

9256

+ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)

9257

+ {

9258

+ 	struct task_cputime cputime = {

9259

+-		.sum_exec_runtime = p->se.sum_exec_runtime,

9260

++		.sum_exec_runtime = tsk_seruntime(p),

9261

+ 	};

9262

+

9263

+ 	task_cputime(p, &cputime.utime, &cputime.stime);

9264

+diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c

9265

+index 17a653b67006..17ab2fe34d7a 100644

9266

+--- a/kernel/sched/debug.c

9267

++++ b/kernel/sched/debug.c

9268

+@@ -8,6 +8,7 @@

9269

+  */

9270

+ #include "sched.h"

9271

+

9272

++#ifndef CONFIG_SCHED_ALT

9273

+ /*

9274

+  * This allows printing both to /proc/sched_debug and

9275

+  * to the console

9276

+@@ -216,6 +217,7 @@ static const struct file_operations sched_scaling_fops = {

9277

+ };

9278

+

9279

+ #endif /* SMP */

9280

++#endif /* !CONFIG_SCHED_ALT */

9281

+

9282

+ #ifdef CONFIG_PREEMPT_DYNAMIC

9283

+

9284

+@@ -279,6 +281,7 @@ static const struct file_operations sched_dynamic_fops = {

9285

+

9286

+ #endif /* CONFIG_PREEMPT_DYNAMIC */

9287

+

9288

++#ifndef CONFIG_SCHED_ALT

9289

+ __read_mostly bool sched_debug_verbose;

9290

+

9291

+ static const struct seq_operations sched_debug_sops;

9292

+@@ -294,6 +297,7 @@ static const struct file_operations sched_debug_fops = {

9293

+ 	.llseek		= seq_lseek,

9294

+ 	.release	= seq_release,

9295

+ };

9296

++#endif /* !CONFIG_SCHED_ALT */

9297

+

9298

+ static struct dentry *debugfs_sched;

9299

+

9300

+@@ -303,12 +307,15 @@ static __init int sched_init_debug(void)

9301

+

9302

+ 	debugfs_sched = debugfs_create_dir("sched", NULL);

9303

+

9304

++#ifndef CONFIG_SCHED_ALT

9305

+ 	debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);

9306

+ 	debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);

9307

++#endif /* !CONFIG_SCHED_ALT */

9308

+ #ifdef CONFIG_PREEMPT_DYNAMIC

9309

+ 	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);

9310

+ #endif

9311

+

9312

++#ifndef CONFIG_SCHED_ALT

9313

+ 	debugfs_create_u32("latency_ns", 0644, debugfs_sched, &sysctl_sched_latency);

9314

+ 	debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);

9315

+ 	debugfs_create_u32("wakeup_granularity_ns", 0644, debugfs_sched, &sysctl_sched_wakeup_granularity);

9316

+@@ -336,11 +343,13 @@ static __init int sched_init_debug(void)

9317

+ #endif

9318

+

9319

+ 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);

9320

++#endif /* !CONFIG_SCHED_ALT */

9321

+

9322

+ 	return 0;

9323

+ }

9324

+ late_initcall(sched_init_debug);

9325

+

9326

++#ifndef CONFIG_SCHED_ALT

9327

+ #ifdef CONFIG_SMP

9328

+

9329

+ static cpumask_var_t		sd_sysctl_cpus;

9330

+@@ -1063,6 +1072,7 @@ void proc_sched_set_task(struct task_struct *p)

9331

+ 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));

9332

+ #endif

9333

+ }

9334

++#endif /* !CONFIG_SCHED_ALT */

9335

+

9336

+ void resched_latency_warn(int cpu, u64 latency)

9337

+ {

9338

+diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c

9339

+index d17b0a5ce6ac..6ff77fc6b73a 100644

9340

+--- a/kernel/sched/idle.c

9341

++++ b/kernel/sched/idle.c

9342

+@@ -403,6 +403,7 @@ void cpu_startup_entry(enum cpuhp_state state)

9343

+ 		do_idle();

9344

+ }

9345

+

9346

++#ifndef CONFIG_SCHED_ALT

9347

+ /*

9348

+  * idle-task scheduling class.

9349

+  */

9350

+@@ -525,3 +526,4 @@ DEFINE_SCHED_CLASS(idle) = {

9351

+ 	.switched_to		= switched_to_idle,

9352

+ 	.update_curr		= update_curr_idle,

9353

+ };

9354

++#endif

9355

+diff --git a/kernel/sched/pds.h b/kernel/sched/pds.h

9356

+new file mode 100644

9357

+index 000000000000..0f1f0d708b77

9358

+--- /dev/null

9359

++++ b/kernel/sched/pds.h

9360

+@@ -0,0 +1,127 @@

9361

++#define ALT_SCHED_VERSION_MSG "sched/pds: PDS CPU Scheduler "ALT_SCHED_VERSION" by Alfred Chen.\n"

9362

++

9363

++static int sched_timeslice_shift = 22;

9364

++

9365

++#define NORMAL_PRIO_MOD(x)	((x) & (NORMAL_PRIO_NUM - 1))

9366

++

9367

++/*

9368

++ * Common interfaces

9369

++ */

9370

++static inline void sched_timeslice_imp(const int timeslice_ms)

9371

++{

9372

++	if (2 == timeslice_ms)

9373

++		sched_timeslice_shift = 21;

9374

++}

9375

++

9376

++static inline int

9377

++task_sched_prio_normal(const struct task_struct *p, const struct rq *rq)

9378

++{

9379

++	s64 delta = p->deadline - rq->time_edge + NORMAL_PRIO_NUM - NICE_WIDTH;

9380

++

9381

++	if (WARN_ONCE(delta > NORMAL_PRIO_NUM - 1,

9382

++		      "pds: task_sched_prio_normal() delta %lld\n", delta))

9383

++		return NORMAL_PRIO_NUM - 1;

9384

++

9385

++	return (delta < 0) ? 0 : delta;

9386

++}

9387

++

9388

++static inline int task_sched_prio(const struct task_struct *p)

9389

++{

9390

++	return (p->prio < MAX_RT_PRIO) ? p->prio :

9391

++		MIN_NORMAL_PRIO + task_sched_prio_normal(p, task_rq(p));

9392

++}

9393

++

9394

++static inline int

9395

++task_sched_prio_idx(const struct task_struct *p, const struct rq *rq)

9396

++{

9397

++	return (p->prio < MAX_RT_PRIO) ? p->prio : MIN_NORMAL_PRIO +

9398

++		NORMAL_PRIO_MOD(task_sched_prio_normal(p, rq) + rq->time_edge);

9399

++}

9400

++

9401

++static inline int sched_prio2idx(int prio, struct rq *rq)

9402

++{

9403

++	return (IDLE_TASK_SCHED_PRIO == prio || prio < MAX_RT_PRIO) ? prio :

9404

++		MIN_NORMAL_PRIO + NORMAL_PRIO_MOD((prio - MIN_NORMAL_PRIO) +

9405

++						  rq->time_edge);

9406

++}

9407

++

9408

++static inline int sched_idx2prio(int idx, struct rq *rq)

9409

++{

9410

++	return (idx < MAX_RT_PRIO) ? idx : MIN_NORMAL_PRIO +

9411

++		NORMAL_PRIO_MOD((idx - MIN_NORMAL_PRIO) + NORMAL_PRIO_NUM -

9412

++				NORMAL_PRIO_MOD(rq->time_edge));

9413

++}

9414

++

9415

++static inline void sched_renew_deadline(struct task_struct *p, const struct rq *rq)

9416

++{

9417

++	if (p->prio >= MAX_RT_PRIO)

9418

++		p->deadline = (rq->clock >> sched_timeslice_shift) +

9419

++			p->static_prio - (MAX_PRIO - NICE_WIDTH);

9420

++}

9421

++

9422

++int task_running_nice(struct task_struct *p)

9423

++{

9424

++	return (p->prio > DEFAULT_PRIO);

9425

++}

9426

++

9427

++static inline void update_rq_time_edge(struct rq *rq)

9428

++{

9429

++	struct list_head head;

9430

++	u64 old = rq->time_edge;

9431

++	u64 now = rq->clock >> sched_timeslice_shift;

9432

++	u64 prio, delta;

9433

++

9434

++	if (now == old)

9435

++		return;

9436

++

9437

++	delta = min_t(u64, NORMAL_PRIO_NUM, now - old);

9438

++	INIT_LIST_HEAD(&head);

9439

++

9440

++	for_each_set_bit(prio, &rq->queue.bitmap[2], delta)

9441

++		list_splice_tail_init(rq->queue.heads + MIN_NORMAL_PRIO +

9442

++				      NORMAL_PRIO_MOD(prio + old), &head);

9443

++

9444

++	rq->queue.bitmap[2] = (NORMAL_PRIO_NUM == delta) ? 0UL :

9445

++		rq->queue.bitmap[2] >> delta;

9446

++	rq->time_edge = now;

9447

++	if (!list_empty(&head)) {

9448

++		u64 idx = MIN_NORMAL_PRIO + NORMAL_PRIO_MOD(now);

9449

++		struct task_struct *p;

9450

++

9451

++		list_for_each_entry(p, &head, sq_node)

9452

++			p->sq_idx = idx;

9453

++

9454

++		list_splice(&head, rq->queue.heads + idx);

9455

++		rq->queue.bitmap[2] |= 1UL;

9456

++	}

9457

++}

9458

++

9459

++static inline void time_slice_expired(struct task_struct *p, struct rq *rq)

9460

++{

9461

++	p->time_slice = sched_timeslice_ns;

9462

++	sched_renew_deadline(p, rq);

9463

++	if (SCHED_FIFO != p->policy && task_on_rq_queued(p))

9464

++		requeue_task(p, rq);

9465

++}

9466

++

9467

++static inline void sched_task_sanity_check(struct task_struct *p, struct rq *rq)

9468

++{

9469

++	u64 max_dl = rq->time_edge + NICE_WIDTH - 1;

9470

++	if (unlikely(p->deadline > max_dl))

9471

++		p->deadline = max_dl;

9472

++}

9473

++

9474

++static void sched_task_fork(struct task_struct *p, struct rq *rq)

9475

++{

9476

++	sched_renew_deadline(p, rq);

9477

++}

9478

++

9479

++static inline void do_sched_yield_type_1(struct task_struct *p, struct rq *rq)

9480

++{

9481

++	time_slice_expired(p, rq);

9482

++}

9483

++

9484

++#ifdef CONFIG_SMP

9485

++static inline void sched_task_ttwu(struct task_struct *p) {}

9486

++#endif

9487

++static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq) {}

9488

+diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c

9489

+index a554e3bbab2b..3e56f5e6ff5c 100644

9490

+--- a/kernel/sched/pelt.c

9491

++++ b/kernel/sched/pelt.c

9492

+@@ -270,6 +270,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)

9493

+ 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);

9494

+ }

9495

+

9496

++#ifndef CONFIG_SCHED_ALT

9497

+ /*

9498

+  * sched_entity:

9499

+  *

9500

+@@ -387,8 +388,9 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)

9501

+

9502

+ 	return 0;

9503

+ }

9504

++#endif

9505

+

9506

+-#ifdef CONFIG_SCHED_THERMAL_PRESSURE

9507

++#if defined(CONFIG_SCHED_THERMAL_PRESSURE) && !defined(CONFIG_SCHED_ALT)

9508

+ /*

9509

+  * thermal:

9510

+  *

9511

+diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h

9512

+index e06071bf3472..adf567df34d4 100644

9513

+--- a/kernel/sched/pelt.h

9514

++++ b/kernel/sched/pelt.h

9515

+@@ -1,13 +1,15 @@

9516

+ #ifdef CONFIG_SMP

9517

+ #include "sched-pelt.h"

9518

+

9519

++#ifndef CONFIG_SCHED_ALT

9520

+ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se);

9521

+ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);

9522

+ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);

9523

+ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);

9524

+ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);

9525

++#endif

9526

+

9527

+-#ifdef CONFIG_SCHED_THERMAL_PRESSURE

9528

++#if defined(CONFIG_SCHED_THERMAL_PRESSURE) && !defined(CONFIG_SCHED_ALT)

9529

+ int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);

9530

+

9531

+ static inline u64 thermal_load_avg(struct rq *rq)

9532

+@@ -42,6 +44,7 @@ static inline u32 get_pelt_divider(struct sched_avg *avg)

9533

+ 	return LOAD_AVG_MAX - 1024 + avg->period_contrib;

9534

+ }

9535

+

9536

++#ifndef CONFIG_SCHED_ALT

9537

+ static inline void cfs_se_util_change(struct sched_avg *avg)

9538

+ {

9539

+ 	unsigned int enqueued;

9540

+@@ -153,9 +156,11 @@ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)

9541

+ 	return rq_clock_pelt(rq_of(cfs_rq));

9542

+ }

9543

+ #endif

9544

++#endif /* CONFIG_SCHED_ALT */

9545

+

9546

+ #else

9547

+

9548

++#ifndef CONFIG_SCHED_ALT

9549

+ static inline int

9550

+ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

9551

+ {

9552

+@@ -173,6 +178,7 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)

9553

+ {

9554

+ 	return 0;

9555

+ }

9556

++#endif

9557

+

9558

+ static inline int

9559

+ update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)

9560

+diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

9561

+index 3d3e5793e117..c1d976ef623f 100644

9562

+--- a/kernel/sched/sched.h

9563

++++ b/kernel/sched/sched.h

9564

+@@ -2,6 +2,10 @@

9565

+ /*

9566

+  * Scheduler internal types and methods:

9567

+  */

9568

++#ifdef CONFIG_SCHED_ALT

9569

++#include "alt_sched.h"

9570

++#else

9571

++

9572

+ #include <linux/sched.h>

9573

+

9574

+ #include <linux/sched/autogroup.h>

9575

+@@ -3064,3 +3068,8 @@ extern int sched_dynamic_mode(const char *str);

9576

+ extern void sched_dynamic_update(int mode);

9577

+ #endif

9578

+

9579

++static inline int task_running_nice(struct task_struct *p)

9580

++{

9581

++	return (task_nice(p) > 0);

9582

++}

9583

++#endif /* !CONFIG_SCHED_ALT */

9584

+diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c

9585

+index 3f93fc3b5648..528b71e144e9 100644

9586

+--- a/kernel/sched/stats.c

9587

++++ b/kernel/sched/stats.c

9588

+@@ -22,8 +22,10 @@ static int show_schedstat(struct seq_file *seq, void *v)

9589

+ 	} else {

9590

+ 		struct rq *rq;

9591

+ #ifdef CONFIG_SMP

9592

++#ifndef CONFIG_SCHED_ALT

9593

+ 		struct sched_domain *sd;

9594

+ 		int dcount = 0;

9595

++#endif

9596

+ #endif

9597

+ 		cpu = (unsigned long)(v - 2);

9598

+ 		rq = cpu_rq(cpu);

9599

+@@ -40,6 +42,7 @@ static int show_schedstat(struct seq_file *seq, void *v)

9600

+ 		seq_printf(seq, "\n");

9601

+

9602

+ #ifdef CONFIG_SMP

9603

++#ifndef CONFIG_SCHED_ALT

9604

+ 		/* domain-specific stats */

9605

+ 		rcu_read_lock();

9606

+ 		for_each_domain(cpu, sd) {

9607

+@@ -68,6 +71,7 @@ static int show_schedstat(struct seq_file *seq, void *v)

9608

+ 			    sd->ttwu_move_balance);

9609

+ 		}

9610

+ 		rcu_read_unlock();

9611

++#endif

9612

+ #endif

9613

+ 	}

9614

+ 	return 0;

9615

+diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c

9616

+index 4e8698e62f07..36c61551252e 100644

9617

+--- a/kernel/sched/topology.c

9618

++++ b/kernel/sched/topology.c

9619

+@@ -4,6 +4,7 @@

9620

+  */

9621

+ #include "sched.h"

9622

+

9623

++#ifndef CONFIG_SCHED_ALT

9624

+ DEFINE_MUTEX(sched_domains_mutex);

9625

+

9626

+ /* Protected by sched_domains_mutex: */

9627

+@@ -1382,8 +1383,10 @@ static void asym_cpu_capacity_scan(void)

9628

+  */

9629

+

9630

+ static int default_relax_domain_level = -1;

9631

++#endif /* CONFIG_SCHED_ALT */

9632

+ int sched_domain_level_max;

9633

+

9634

++#ifndef CONFIG_SCHED_ALT

9635

+ static int __init setup_relax_domain_level(char *str)

9636

+ {

9637

+ 	if (kstrtoint(str, 0, &default_relax_domain_level))

9638

+@@ -1619,6 +1622,7 @@ sd_init(struct sched_domain_topology_level *tl,

9639

+

9640

+ 	return sd;

9641

+ }

9642

++#endif /* CONFIG_SCHED_ALT */

9643

+

9644

+ /*

9645

+  * Topology list, bottom-up.

9646

+@@ -1648,6 +1652,7 @@ void set_sched_topology(struct sched_domain_topology_level *tl)

9647

+ 	sched_domain_topology = tl;

9648

+ }

9649

+

9650

++#ifndef CONFIG_SCHED_ALT

9651

+ #ifdef CONFIG_NUMA

9652

+

9653

+ static const struct cpumask *sd_numa_mask(int cpu)

9654

+@@ -2516,3 +2521,17 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

9655

+ 	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);

9656

+ 	mutex_unlock(&sched_domains_mutex);

9657

+ }

9658

++#else /* CONFIG_SCHED_ALT */

9659

++void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

9660

++			     struct sched_domain_attr *dattr_new)

9661

++{}

9662

++

9663

++#ifdef CONFIG_NUMA

9664

++int __read_mostly		node_reclaim_distance = RECLAIM_DISTANCE;

9665

++

9666

++int sched_numa_find_closest(const struct cpumask *cpus, int cpu)

9667

++{

9668

++	return best_mask_cpu(cpu, cpus);

9669

++}

9670

++#endif /* CONFIG_NUMA */

9671

++#endif

9672

+diff --git a/kernel/sysctl.c b/kernel/sysctl.c

9673

+index 083be6af29d7..09fc6281d488 100644

9674

+--- a/kernel/sysctl.c

9675

++++ b/kernel/sysctl.c

9676

+@@ -122,6 +122,10 @@ static unsigned long long_max = LONG_MAX;

9677

+ static int one_hundred = 100;

9678

+ static int two_hundred = 200;

9679

+ static int one_thousand = 1000;

9680

++#ifdef CONFIG_SCHED_ALT

9681

++static int __maybe_unused zero = 0;

9682

++extern int sched_yield_type;

9683

++#endif

9684

+ #ifdef CONFIG_PRINTK

9685

+ static int ten_thousand = 10000;

9686

+ #endif

9687

+@@ -1771,6 +1775,24 @@ int proc_do_static_key(struct ctl_table *table, int write,

9688

+ }

9689

+

9690

+ static struct ctl_table kern_table[] = {

9691

++#ifdef CONFIG_SCHED_ALT

9692

++/* In ALT, only supported "sched_schedstats" */

9693

++#ifdef CONFIG_SCHED_DEBUG

9694

++#ifdef CONFIG_SMP

9695

++#ifdef CONFIG_SCHEDSTATS

9696

++	{

9697

++		.procname	= "sched_schedstats",

9698

++		.data		= NULL,

9699

++		.maxlen		= sizeof(unsigned int),

9700

++		.mode		= 0644,

9701

++		.proc_handler	= sysctl_schedstats,

9702

++		.extra1		= SYSCTL_ZERO,

9703

++		.extra2		= SYSCTL_ONE,

9704

++	},

9705

++#endif /* CONFIG_SCHEDSTATS */

9706

++#endif /* CONFIG_SMP */

9707

++#endif /* CONFIG_SCHED_DEBUG */

9708

++#else  /* !CONFIG_SCHED_ALT */

9709

+ 	{

9710

+ 		.procname	= "sched_child_runs_first",

9711

+ 		.data		= &sysctl_sched_child_runs_first,

9712

+@@ -1901,6 +1923,7 @@ static struct ctl_table kern_table[] = {

9713

+ 		.extra2		= SYSCTL_ONE,

9714

+ 	},

9715

+ #endif

9716

++#endif /* !CONFIG_SCHED_ALT */

9717

+ #ifdef CONFIG_PROVE_LOCKING

9718

+ 	{

9719

+ 		.procname	= "prove_locking",

9720

+@@ -2477,6 +2500,17 @@ static struct ctl_table kern_table[] = {

9721

+ 		.proc_handler	= proc_dointvec,

9722

+ 	},

9723

+ #endif

9724

++#ifdef CONFIG_SCHED_ALT

9725

++	{

9726

++		.procname	= "yield_type",

9727

++		.data		= &sched_yield_type,

9728

++		.maxlen		= sizeof (int),

9729

++		.mode		= 0644,

9730

++		.proc_handler	= &proc_dointvec_minmax,

9731

++		.extra1		= &zero,

9732

++		.extra2		= &two,

9733

++	},

9734

++#endif

9735

+ #if defined(CONFIG_S390) && defined(CONFIG_SMP)

9736

+ 	{

9737

+ 		.procname	= "spin_retry",

9738

+diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c

9739

+index 0ea8702eb516..a27a0f3a654d 100644

9740

+--- a/kernel/time/hrtimer.c

9741

++++ b/kernel/time/hrtimer.c

9742

+@@ -2088,8 +2088,10 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,

9743

+ 	int ret = 0;

9744

+ 	u64 slack;

9745

+

9746

++#ifndef CONFIG_SCHED_ALT

9747

+ 	slack = current->timer_slack_ns;

9748

+ 	if (dl_task(current) || rt_task(current))

9749

++#endif

9750

+ 		slack = 0;

9751

+

9752

+ 	hrtimer_init_sleeper_on_stack(&t, clockid, mode);

9753

+diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c

9754

+index 643d412ac623..6bf27565242f 100644

9755

+--- a/kernel/time/posix-cpu-timers.c

9756

++++ b/kernel/time/posix-cpu-timers.c

9757

+@@ -216,7 +216,7 @@ static void task_sample_cputime(struct task_struct *p, u64 *samples)

9758

+ 	u64 stime, utime;

9759

+

9760

+ 	task_cputime(p, &utime, &stime);

9761

+-	store_samples(samples, stime, utime, p->se.sum_exec_runtime);

9762

++	store_samples(samples, stime, utime, tsk_seruntime(p));

9763

+ }

9764

+

9765

+ static void proc_sample_cputime_atomic(struct task_cputime_atomic *at,

9766

+@@ -859,6 +859,7 @@ static void collect_posix_cputimers(struct posix_cputimers *pct, u64 *samples,

9767

+ 	}

9768

+ }

9769

+

9770

++#ifndef CONFIG_SCHED_ALT

9771

+ static inline void check_dl_overrun(struct task_struct *tsk)

9772

+ {

9773

+ 	if (tsk->dl.dl_overrun) {

9774

+@@ -866,6 +867,7 @@ static inline void check_dl_overrun(struct task_struct *tsk)

9775

+ 		__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);

9776

+ 	}

9777

+ }

9778

++#endif

9779

+

9780

+ static bool check_rlimit(u64 time, u64 limit, int signo, bool rt, bool hard)

9781

+ {

9782

+@@ -893,8 +895,10 @@ static void check_thread_timers(struct task_struct *tsk,

9783

+ 	u64 samples[CPUCLOCK_MAX];

9784

+ 	unsigned long soft;

9785

+

9786

++#ifndef CONFIG_SCHED_ALT

9787

+ 	if (dl_task(tsk))

9788

+ 		check_dl_overrun(tsk);

9789

++#endif

9790

+

9791

+ 	if (expiry_cache_is_inactive(pct))

9792

+ 		return;

9793

+@@ -908,7 +912,7 @@ static void check_thread_timers(struct task_struct *tsk,

9794

+ 	soft = task_rlimit(tsk, RLIMIT_RTTIME);

9795

+ 	if (soft != RLIM_INFINITY) {

9796

+ 		/* Task RT timeout is accounted in jiffies. RTTIME is usec */

9797

+-		unsigned long rttime = tsk->rt.timeout * (USEC_PER_SEC / HZ);

9798

++		unsigned long rttime = tsk_rttimeout(tsk) * (USEC_PER_SEC / HZ);

9799

+ 		unsigned long hard = task_rlimit_max(tsk, RLIMIT_RTTIME);

9800

+

9801

+ 		/* At the hard limit, send SIGKILL. No further action. */

9802

+@@ -1144,8 +1148,10 @@ static inline bool fastpath_timer_check(struct task_struct *tsk)

9803

+ 			return true;

9804

+ 	}

9805

+

9806

++#ifndef CONFIG_SCHED_ALT

9807

+ 	if (dl_task(tsk) && tsk->dl.dl_overrun)

9808

+ 		return true;

9809

++#endif

9810

+

9811

+ 	return false;

9812

+ }

9813

+diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c

9814

+index adf7ef194005..11c8f36e281b 100644

9815

+--- a/kernel/trace/trace_selftest.c

9816

++++ b/kernel/trace/trace_selftest.c

9817

+@@ -1052,10 +1052,15 @@ static int trace_wakeup_test_thread(void *data)

9818

+ {

9819

+ 	/* Make this a -deadline thread */

9820

+ 	static const struct sched_attr attr = {

9821

++#ifdef CONFIG_SCHED_ALT

9822

++		/* No deadline on BMQ/PDS, use RR */

9823

++		.sched_policy = SCHED_RR,

9824

++#else

9825

+ 		.sched_policy = SCHED_DEADLINE,

9826

+ 		.sched_runtime = 100000ULL,

9827

+ 		.sched_deadline = 10000000ULL,

9828

+ 		.sched_period = 10000000ULL

9829

++#endif

9830

+ 	};

9831

+ 	struct wakeup_test_data *x = data;

9832

+

9833

9834

diff --git a/5021_BMQ-and-PDS-gentoo-defaults.patch b/5021_BMQ-and-PDS-gentoo-defaults.patch

9835

new file mode 100644

9836

index 0000000..d449eec

9837

--- /dev/null

9838

+++ b/5021_BMQ-and-PDS-gentoo-defaults.patch

9839

@@ -0,0 +1,13 @@

9840

+--- a/init/Kconfig	2021-04-27 07:38:30.556467045 -0400

9841

++++ b/init/Kconfig	2021-04-27 07:39:32.956412800 -0400

9842

+@@ -780,8 +780,9 @@ config GENERIC_SCHED_CLOCK

9843

+ menu "Scheduler features"

9844

+

9845

+ menuconfig SCHED_ALT

9846

++	depends on X86_64

9847

+ 	bool "Alternative CPU Schedulers"

9848

+-	default y

9849

++	default n

9850

+ 	help

9851

+ 	  This feature enable alternative CPU scheduler"

9852

+

Gentoo Archives: gentoo-commits