[gentoo-commits] proj/linux-patches:5.19 commit in: / - gentoo-commits

From:	Mike Pagano <mpagano@g.o>
To:	gentoo-commits@l.g.o
Subject:	[gentoo-commits] proj/linux-patches:5.19 commit in: /
Date:	Tue, 02 Aug 2022 18:20:39
Message-Id:	`1659464335.5937201b16fe180982c20df4fc8ec78f7e8886f0.mpagano@gentoo`

1

commit:     5937201b16fe180982c20df4fc8ec78f7e8886f0

2

Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

3

AuthorDate: Tue Aug  2 18:18:55 2022 +0000

4

Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

5

CommitDate: Tue Aug  2 18:18:55 2022 +0000

6

URL:        https://gitweb.gentoo.org/proj/linux-patches.git/commit/?id=5937201b

7

8

Add the BMQ(BitMap Queue) Scheduler.

9

10

See: https://gitlab.com/alfredchen/projectc

11

12

Signed-off-by: Mike Pagano <mpagano <AT> gentoo.org>

13

14

 0000_README                                  |    8 +

15

 5020_BMQ-and-PDS-io-scheduler-v5.19-r0.patch | 9956 ++++++++++++++++++++++++++

16

 5021_BMQ-and-PDS-gentoo-defaults.patch       |   13 +

17

 3 files changed, 9977 insertions(+)

18

19

diff --git a/0000_README b/0000_README

20

index 639f7346..3d9202d9 100644

21

--- a/0000_README

22

+++ b/0000_README

23

@@ -78,3 +78,11 @@ Desc:   Add Gentoo Linux support config settings and defaults.

24

 Patch:  5010_enable-cpu-optimizations-universal.patch

25

 From:   https://github.com/graysky2/kernel_compiler_patch

26

 Desc:   Kernel >= 5.15 patch enables gcc = v11.1+ optimizations for additional CPUs.

27

+

28

+Patch:  5020_BMQ-and-PDS-io-scheduler-v5.19-r0.patch

29

+From:   https://gitlab.com/alfredchen/linux-prjc

30

+Desc:   BMQ(BitMap Queue) Scheduler. A new CPU scheduler developed from PDS(incld). Inspired by the scheduler in zircon.

31

+

32

+Patch:  5021_BMQ-and-PDS-gentoo-defaults.patch

33

+From:   https://gitweb.gentoo.org/proj/linux-patches.git/

34

+Desc:   Set defaults for BMQ. Add archs as people test, default to N

35

36

diff --git a/5020_BMQ-and-PDS-io-scheduler-v5.19-r0.patch b/5020_BMQ-and-PDS-io-scheduler-v5.19-r0.patch

37

new file mode 100644

38

index 00000000..610cfe83

39

--- /dev/null

40

+++ b/5020_BMQ-and-PDS-io-scheduler-v5.19-r0.patch

41

@@ -0,0 +1,9956 @@

42

+diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt

43

+index cc3ea8febc62..ab4c5a35b999 100644

44

+--- a/Documentation/admin-guide/kernel-parameters.txt

45

++++ b/Documentation/admin-guide/kernel-parameters.txt

46

+@@ -5299,6 +5299,12 @@

47

+ 	sa1100ir	[NET]

48

+ 			See drivers/net/irda/sa1100_ir.c.

49

+

50

++	sched_timeslice=

51

++			[KNL] Time slice in ms for Project C BMQ/PDS scheduler.

52

++			Format: integer 2, 4

53

++			Default: 4

54

++			See Documentation/scheduler/sched-BMQ.txt

55

++

56

+ 	sched_verbose	[KNL] Enables verbose scheduler debug messages.

57

+

58

+ 	schedstats=	[KNL,X86] Enable or disable scheduled statistics.

59

+diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst

60

+index ddccd1077462..e24781970a3d 100644

61

+--- a/Documentation/admin-guide/sysctl/kernel.rst

62

++++ b/Documentation/admin-guide/sysctl/kernel.rst

63

+@@ -1524,3 +1524,13 @@ is 10 seconds.

64

+

65

+ The softlockup threshold is (``2 * watchdog_thresh``). Setting this

66

+ tunable to zero will disable lockup detection altogether.

67

++

68

++yield_type:

69

++===========

70

++

71

++BMQ/PDS CPU scheduler only. This determines what type of yield calls

72

++to sched_yield will perform.

73

++

74

++  0 - No yield.

75

++  1 - Deboost and requeue task. (default)

76

++  2 - Set run queue skip task.

77

+diff --git a/Documentation/scheduler/sched-BMQ.txt b/Documentation/scheduler/sched-BMQ.txt

78

+new file mode 100644

79

+index 000000000000..05c84eec0f31

80

+--- /dev/null

81

++++ b/Documentation/scheduler/sched-BMQ.txt

82

+@@ -0,0 +1,110 @@

83

++                         BitMap queue CPU Scheduler

84

++                         --------------------------

85

++

86

++CONTENT

87

++========

88

++

89

++ Background

90

++ Design

91

++   Overview

92

++   Task policy

93

++   Priority management

94

++   BitMap Queue

95

++   CPU Assignment and Migration

96

++

97

++

98

++Background

99

++==========

100

++

101

++BitMap Queue CPU scheduler, referred to as BMQ from here on, is an evolution

102

++of previous Priority and Deadline based Skiplist multiple queue scheduler(PDS),

103

++and inspired by Zircon scheduler. The goal of it is to keep the scheduler code

104

++simple, while efficiency and scalable for interactive tasks, such as desktop,

105

++movie playback and gaming etc.

106

++

107

++Design

108

++======

109

++

110

++Overview

111

++--------

112

++

113

++BMQ use per CPU run queue design, each CPU(logical) has it's own run queue,

114

++each CPU is responsible for scheduling the tasks that are putting into it's

115

++run queue.

116

++

117

++The run queue is a set of priority queues. Note that these queues are fifo

118

++queue for non-rt tasks or priority queue for rt tasks in data structure. See

119

++BitMap Queue below for details. BMQ is optimized for non-rt tasks in the fact

120

++that most applications are non-rt tasks. No matter the queue is fifo or

121

++priority, In each queue is an ordered list of runnable tasks awaiting execution

122

++and the data structures are the same. When it is time for a new task to run,

123

++the scheduler simply looks the lowest numbered queueue that contains a task,

124

++and runs the first task from the head of that queue. And per CPU idle task is

125

++also in the run queue, so the scheduler can always find a task to run on from

126

++its run queue.

127

++

128

++Each task will assigned the same timeslice(default 4ms) when it is picked to

129

++start running. Task will be reinserted at the end of the appropriate priority

130

++queue when it uses its whole timeslice. When the scheduler selects a new task

131

++from the priority queue it sets the CPU's preemption timer for the remainder of

132

++the previous timeslice. When that timer fires the scheduler will stop execution

133

++on that task, select another task and start over again.

134

++

135

++If a task blocks waiting for a shared resource then it's taken out of its

136

++priority queue and is placed in a wait queue for the shared resource. When it

137

++is unblocked it will be reinserted in the appropriate priority queue of an

138

++eligible CPU.

139

++

140

++Task policy

141

++-----------

142

++

143

++BMQ supports DEADLINE, FIFO, RR, NORMAL, BATCH and IDLE task policy like the

144

++mainline CFS scheduler. But BMQ is heavy optimized for non-rt task, that's

145

++NORMAL/BATCH/IDLE policy tasks. Below is the implementation detail of each

146

++policy.

147

++

148

++DEADLINE

149

++	It is squashed as priority 0 FIFO task.

150

++

151

++FIFO/RR

152

++	All RT tasks share one single priority queue in BMQ run queue designed. The

153

++complexity of insert operation is O(n). BMQ is not designed for system runs

154

++with major rt policy tasks.

155

++

156

++NORMAL/BATCH/IDLE

157

++	BATCH and IDLE tasks are treated as the same policy. They compete CPU with

158

++NORMAL policy tasks, but they just don't boost. To control the priority of

159

++NORMAL/BATCH/IDLE tasks, simply use nice level.

160

++

161

++ISO

162

++	ISO policy is not supported in BMQ. Please use nice level -20 NORMAL policy

163

++task instead.

164

++

165

++Priority management

166

++-------------------

167

++

168

++RT tasks have priority from 0-99. For non-rt tasks, there are three different

169

++factors used to determine the effective priority of a task. The effective

170

++priority being what is used to determine which queue it will be in.

171

++

172

++The first factor is simply the task’s static priority. Which is assigned from

173

++task's nice level, within [-20, 19] in userland's point of view and [0, 39]

174

++internally.

175

++

176

++The second factor is the priority boost. This is a value bounded between

177

++[-MAX_PRIORITY_ADJ, MAX_PRIORITY_ADJ] used to offset the base priority, it is

178

++modified by the following cases:

179

++

180

++*When a thread has used up its entire timeslice, always deboost its boost by

181

++increasing by one.

182

++*When a thread gives up cpu control(voluntary or non-voluntary) to reschedule,

183

++and its switch-in time(time after last switch and run) below the thredhold

184

++based on its priority boost, will boost its boost by decreasing by one buti is

185

++capped at 0 (won’t go negative).

186

++

187

++The intent in this system is to ensure that interactive threads are serviced

188

++quickly. These are usually the threads that interact directly with the user

189

++and cause user-perceivable latency. These threads usually do little work and

190

++spend most of their time blocked awaiting another user event. So they get the

191

++priority boost from unblocking while background threads that do most of the

192

++processing receive the priority penalty for using their entire timeslice.

193

+diff --git a/fs/proc/base.c b/fs/proc/base.c

194

+index 8dfa36a99c74..46397c606e01 100644

195

+--- a/fs/proc/base.c

196

++++ b/fs/proc/base.c

197

+@@ -479,7 +479,7 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,

198

+ 		seq_puts(m, "0 0 0\n");

199

+ 	else

200

+ 		seq_printf(m, "%llu %llu %lu\n",

201

+-		   (unsigned long long)task->se.sum_exec_runtime,

202

++		   (unsigned long long)tsk_seruntime(task),

203

+ 		   (unsigned long long)task->sched_info.run_delay,

204

+ 		   task->sched_info.pcount);

205

+

206

+diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h

207

+index 8874f681b056..59eb72bf7d5f 100644

208

+--- a/include/asm-generic/resource.h

209

++++ b/include/asm-generic/resource.h

210

+@@ -23,7 +23,7 @@

211

+ 	[RLIMIT_LOCKS]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\

212

+ 	[RLIMIT_SIGPENDING]	= { 		0,	       0 },	\

213

+ 	[RLIMIT_MSGQUEUE]	= {   MQ_BYTES_MAX,   MQ_BYTES_MAX },	\

214

+-	[RLIMIT_NICE]		= { 0, 0 },				\

215

++	[RLIMIT_NICE]		= { 30, 30 },				\

216

+ 	[RLIMIT_RTPRIO]		= { 0, 0 },				\

217

+ 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\

218

+ }

219

+diff --git a/include/linux/sched.h b/include/linux/sched.h

220

+index c46f3a63b758..7c65e6317d97 100644

221

+--- a/include/linux/sched.h

222

++++ b/include/linux/sched.h

223

+@@ -751,8 +751,14 @@ struct task_struct {

224

+ 	unsigned int			ptrace;

225

+

226

+ #ifdef CONFIG_SMP

227

+-	int				on_cpu;

228

+ 	struct __call_single_node	wake_entry;

229

++#endif

230

++#if defined(CONFIG_SMP) || defined(CONFIG_SCHED_ALT)

231

++	int				on_cpu;

232

++#endif

233

++

234

++#ifdef CONFIG_SMP

235

++#ifndef CONFIG_SCHED_ALT

236

+ 	unsigned int			wakee_flips;

237

+ 	unsigned long			wakee_flip_decay_ts;

238

+ 	struct task_struct		*last_wakee;

239

+@@ -766,6 +772,7 @@ struct task_struct {

240

+ 	 */

241

+ 	int				recent_used_cpu;

242

+ 	int				wake_cpu;

243

++#endif /* !CONFIG_SCHED_ALT */

244

+ #endif

245

+ 	int				on_rq;

246

+

247

+@@ -774,6 +781,20 @@ struct task_struct {

248

+ 	int				normal_prio;

249

+ 	unsigned int			rt_priority;

250

+

251

++#ifdef CONFIG_SCHED_ALT

252

++	u64				last_ran;

253

++	s64				time_slice;

254

++	int				sq_idx;

255

++	struct list_head		sq_node;

256

++#ifdef CONFIG_SCHED_BMQ

257

++	int				boost_prio;

258

++#endif /* CONFIG_SCHED_BMQ */

259

++#ifdef CONFIG_SCHED_PDS

260

++	u64				deadline;

261

++#endif /* CONFIG_SCHED_PDS */

262

++	/* sched_clock time spent running */

263

++	u64				sched_time;

264

++#else /* !CONFIG_SCHED_ALT */

265

+ 	struct sched_entity		se;

266

+ 	struct sched_rt_entity		rt;

267

+ 	struct sched_dl_entity		dl;

268

+@@ -784,6 +805,7 @@ struct task_struct {

269

+ 	unsigned long			core_cookie;

270

+ 	unsigned int			core_occupation;

271

+ #endif

272

++#endif /* !CONFIG_SCHED_ALT */

273

+

274

+ #ifdef CONFIG_CGROUP_SCHED

275

+ 	struct task_group		*sched_task_group;

276

+@@ -1517,6 +1539,15 @@ struct task_struct {

277

+ 	 */

278

+ };

279

+

280

++#ifdef CONFIG_SCHED_ALT

281

++#define tsk_seruntime(t)		((t)->sched_time)

282

++/* replace the uncertian rt_timeout with 0UL */

283

++#define tsk_rttimeout(t)		(0UL)

284

++#else /* CFS */

285

++#define tsk_seruntime(t)	((t)->se.sum_exec_runtime)

286

++#define tsk_rttimeout(t)	((t)->rt.timeout)

287

++#endif /* !CONFIG_SCHED_ALT */

288

++

289

+ static inline struct pid *task_pid(struct task_struct *task)

290

+ {

291

+ 	return task->thread_pid;

292

+diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h

293

+index 7c83d4d5a971..fa30f98cb2be 100644

294

+--- a/include/linux/sched/deadline.h

295

++++ b/include/linux/sched/deadline.h

296

+@@ -1,5 +1,24 @@

297

+ /* SPDX-License-Identifier: GPL-2.0 */

298

+

299

++#ifdef CONFIG_SCHED_ALT

300

++

301

++static inline int dl_task(struct task_struct *p)

302

++{

303

++	return 0;

304

++}

305

++

306

++#ifdef CONFIG_SCHED_BMQ

307

++#define __tsk_deadline(p)	(0UL)

308

++#endif

309

++

310

++#ifdef CONFIG_SCHED_PDS

311

++#define __tsk_deadline(p)	((((u64) ((p)->prio))<<56) | (p)->deadline)

312

++#endif

313

++

314

++#else

315

++

316

++#define __tsk_deadline(p)	((p)->dl.deadline)

317

++

318

+ /*

319

+  * SCHED_DEADLINE tasks has negative priorities, reflecting

320

+  * the fact that any of them has higher prio than RT and

321

+@@ -21,6 +40,7 @@ static inline int dl_task(struct task_struct *p)

322

+ {

323

+ 	return dl_prio(p->prio);

324

+ }

325

++#endif /* CONFIG_SCHED_ALT */

326

+

327

+ static inline bool dl_time_before(u64 a, u64 b)

328

+ {

329

+diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h

330

+index ab83d85e1183..6af9ae681116 100644

331

+--- a/include/linux/sched/prio.h

332

++++ b/include/linux/sched/prio.h

333

+@@ -18,6 +18,32 @@

334

+ #define MAX_PRIO		(MAX_RT_PRIO + NICE_WIDTH)

335

+ #define DEFAULT_PRIO		(MAX_RT_PRIO + NICE_WIDTH / 2)

336

+

337

++#ifdef CONFIG_SCHED_ALT

338

++

339

++/* Undefine MAX_PRIO and DEFAULT_PRIO */

340

++#undef MAX_PRIO

341

++#undef DEFAULT_PRIO

342

++

343

++/* +/- priority levels from the base priority */

344

++#ifdef CONFIG_SCHED_BMQ

345

++#define MAX_PRIORITY_ADJ	(7)

346

++

347

++#define MIN_NORMAL_PRIO		(MAX_RT_PRIO)

348

++#define MAX_PRIO		(MIN_NORMAL_PRIO + NICE_WIDTH)

349

++#define DEFAULT_PRIO		(MIN_NORMAL_PRIO + NICE_WIDTH / 2)

350

++#endif

351

++

352

++#ifdef CONFIG_SCHED_PDS

353

++#define MAX_PRIORITY_ADJ	(0)

354

++

355

++#define MIN_NORMAL_PRIO		(128)

356

++#define NORMAL_PRIO_NUM		(64)

357

++#define MAX_PRIO		(MIN_NORMAL_PRIO + NORMAL_PRIO_NUM)

358

++#define DEFAULT_PRIO		(MAX_PRIO - NICE_WIDTH / 2)

359

++#endif

360

++

361

++#endif /* CONFIG_SCHED_ALT */

362

++

363

+ /*

364

+  * Convert user-nice values [ -20 ... 0 ... 19 ]

365

+  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],

366

+diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h

367

+index e5af028c08b4..0a7565d0d3cf 100644

368

+--- a/include/linux/sched/rt.h

369

++++ b/include/linux/sched/rt.h

370

+@@ -24,8 +24,10 @@ static inline bool task_is_realtime(struct task_struct *tsk)

371

+

372

+ 	if (policy == SCHED_FIFO || policy == SCHED_RR)

373

+ 		return true;

374

++#ifndef CONFIG_SCHED_ALT

375

+ 	if (policy == SCHED_DEADLINE)

376

+ 		return true;

377

++#endif

378

+ 	return false;

379

+ }

380

+

381

+diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h

382

+index 56cffe42abbc..e020fc572b22 100644

383

+--- a/include/linux/sched/topology.h

384

++++ b/include/linux/sched/topology.h

385

+@@ -233,7 +233,8 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)

386

+

387

+ #endif	/* !CONFIG_SMP */

388

+

389

+-#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)

390

++#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) && \

391

++	!defined(CONFIG_SCHED_ALT)

392

+ extern void rebuild_sched_domains_energy(void);

393

+ #else

394

+ static inline void rebuild_sched_domains_energy(void)

395

+diff --git a/init/Kconfig b/init/Kconfig

396

+index c7900e8975f1..d2b593e3807d 100644

397

+--- a/init/Kconfig

398

++++ b/init/Kconfig

399

+@@ -812,6 +812,7 @@ menu "Scheduler features"

400

+ config UCLAMP_TASK

401

+ 	bool "Enable utilization clamping for RT/FAIR tasks"

402

+ 	depends on CPU_FREQ_GOV_SCHEDUTIL

403

++	depends on !SCHED_ALT

404

+ 	help

405

+ 	  This feature enables the scheduler to track the clamped utilization

406

+ 	  of each CPU based on RUNNABLE tasks scheduled on that CPU.

407

+@@ -858,6 +859,35 @@ config UCLAMP_BUCKETS_COUNT

408

+

409

+ 	  If in doubt, use the default value.

410

+

411

++menuconfig SCHED_ALT

412

++	bool "Alternative CPU Schedulers"

413

++	default y

414

++	help

415

++	  This feature enable alternative CPU scheduler"

416

++

417

++if SCHED_ALT

418

++

419

++choice

420

++	prompt "Alternative CPU Scheduler"

421

++	default SCHED_BMQ

422

++

423

++config SCHED_BMQ

424

++	bool "BMQ CPU scheduler"

425

++	help

426

++	  The BitMap Queue CPU scheduler for excellent interactivity and

427

++	  responsiveness on the desktop and solid scalability on normal

428

++	  hardware and commodity servers.

429

++

430

++config SCHED_PDS

431

++	bool "PDS CPU scheduler"

432

++	help

433

++	  The Priority and Deadline based Skip list multiple queue CPU

434

++	  Scheduler.

435

++

436

++endchoice

437

++

438

++endif

439

++

440

+ endmenu

441

+

442

+ #

443

+@@ -911,6 +941,7 @@ config NUMA_BALANCING

444

+ 	depends on ARCH_SUPPORTS_NUMA_BALANCING

445

+ 	depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY

446

+ 	depends on SMP && NUMA && MIGRATION && !PREEMPT_RT

447

++	depends on !SCHED_ALT

448

+ 	help

449

+ 	  This option adds support for automatic NUMA aware memory/task placement.

450

+ 	  The mechanism is quite primitive and is based on migrating memory when

451

+@@ -1003,6 +1034,7 @@ config FAIR_GROUP_SCHED

452

+ 	depends on CGROUP_SCHED

453

+ 	default CGROUP_SCHED

454

+

455

++if !SCHED_ALT

456

+ config CFS_BANDWIDTH

457

+ 	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"

458

+ 	depends on FAIR_GROUP_SCHED

459

+@@ -1025,6 +1057,7 @@ config RT_GROUP_SCHED

460

+ 	  realtime bandwidth for them.

461

+ 	  See Documentation/scheduler/sched-rt-group.rst for more information.

462

+

463

++endif #!SCHED_ALT

464

+ endif #CGROUP_SCHED

465

+

466

+ config UCLAMP_TASK_GROUP

467

+@@ -1268,6 +1301,7 @@ config CHECKPOINT_RESTORE

468

+

469

+ config SCHED_AUTOGROUP

470

+ 	bool "Automatic process group scheduling"

471

++	depends on !SCHED_ALT

472

+ 	select CGROUPS

473

+ 	select CGROUP_SCHED

474

+ 	select FAIR_GROUP_SCHED

475

+diff --git a/init/init_task.c b/init/init_task.c

476

+index 73cc8f03511a..2d0bad762895 100644

477

+--- a/init/init_task.c

478

++++ b/init/init_task.c

479

+@@ -75,9 +75,15 @@ struct task_struct init_task

480

+ 	.stack		= init_stack,

481

+ 	.usage		= REFCOUNT_INIT(2),

482

+ 	.flags		= PF_KTHREAD,

483

++#ifdef CONFIG_SCHED_ALT

484

++	.prio		= DEFAULT_PRIO + MAX_PRIORITY_ADJ,

485

++	.static_prio	= DEFAULT_PRIO,

486

++	.normal_prio	= DEFAULT_PRIO + MAX_PRIORITY_ADJ,

487

++#else

488

+ 	.prio		= MAX_PRIO - 20,

489

+ 	.static_prio	= MAX_PRIO - 20,

490

+ 	.normal_prio	= MAX_PRIO - 20,

491

++#endif

492

+ 	.policy		= SCHED_NORMAL,

493

+ 	.cpus_ptr	= &init_task.cpus_mask,

494

+ 	.user_cpus_ptr	= NULL,

495

+@@ -88,6 +94,17 @@ struct task_struct init_task

496

+ 	.restart_block	= {

497

+ 		.fn = do_no_restart_syscall,

498

+ 	},

499

++#ifdef CONFIG_SCHED_ALT

500

++	.sq_node	= LIST_HEAD_INIT(init_task.sq_node),

501

++#ifdef CONFIG_SCHED_BMQ

502

++	.boost_prio	= 0,

503

++	.sq_idx		= 15,

504

++#endif

505

++#ifdef CONFIG_SCHED_PDS

506

++	.deadline	= 0,

507

++#endif

508

++	.time_slice	= HZ,

509

++#else

510

+ 	.se		= {

511

+ 		.group_node 	= LIST_HEAD_INIT(init_task.se.group_node),

512

+ 	},

513

+@@ -95,6 +112,7 @@ struct task_struct init_task

514

+ 		.run_list	= LIST_HEAD_INIT(init_task.rt.run_list),

515

+ 		.time_slice	= RR_TIMESLICE,

516

+ 	},

517

++#endif

518

+ 	.tasks		= LIST_HEAD_INIT(init_task.tasks),

519

+ #ifdef CONFIG_SMP

520

+ 	.pushable_tasks	= PLIST_NODE_INIT(init_task.pushable_tasks, MAX_PRIO),

521

+diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt

522

+index c2f1fd95a821..41654679b1b2 100644

523

+--- a/kernel/Kconfig.preempt

524

++++ b/kernel/Kconfig.preempt

525

+@@ -117,7 +117,7 @@ config PREEMPT_DYNAMIC

526

+

527

+ config SCHED_CORE

528

+ 	bool "Core Scheduling for SMT"

529

+-	depends on SCHED_SMT

530

++	depends on SCHED_SMT && !SCHED_ALT

531

+ 	help

532

+ 	  This option permits Core Scheduling, a means of coordinated task

533

+ 	  selection across SMT siblings. When enabled -- see

534

+diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c

535

+index 71a418858a5e..7e3016873db1 100644

536

+--- a/kernel/cgroup/cpuset.c

537

++++ b/kernel/cgroup/cpuset.c

538

+@@ -704,7 +704,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)

539

+ 	return ret;

540

+ }

541

+

542

+-#ifdef CONFIG_SMP

543

++#if defined(CONFIG_SMP) && !defined(CONFIG_SCHED_ALT)

544

+ /*

545

+  * Helper routine for generate_sched_domains().

546

+  * Do cpusets a, b have overlapping effective cpus_allowed masks?

547

+@@ -1100,7 +1100,7 @@ static void rebuild_sched_domains_locked(void)

548

+ 	/* Have scheduler rebuild the domains */

549

+ 	partition_and_rebuild_sched_domains(ndoms, doms, attr);

550

+ }

551

+-#else /* !CONFIG_SMP */

552

++#else /* !CONFIG_SMP || CONFIG_SCHED_ALT */

553

+ static void rebuild_sched_domains_locked(void)

554

+ {

555

+ }

556

+diff --git a/kernel/delayacct.c b/kernel/delayacct.c

557

+index 164ed9ef77a3..c974a84b056f 100644

558

+--- a/kernel/delayacct.c

559

++++ b/kernel/delayacct.c

560

+@@ -150,7 +150,7 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)

561

+ 	 */

562

+ 	t1 = tsk->sched_info.pcount;

563

+ 	t2 = tsk->sched_info.run_delay;

564

+-	t3 = tsk->se.sum_exec_runtime;

565

++	t3 = tsk_seruntime(tsk);

566

+

567

+ 	d->cpu_count += t1;

568

+

569

+diff --git a/kernel/exit.c b/kernel/exit.c

570

+index 64c938ce36fe..a353f7ef5392 100644

571

+--- a/kernel/exit.c

572

++++ b/kernel/exit.c

573

+@@ -124,7 +124,7 @@ static void __exit_signal(struct task_struct *tsk)

574

+ 			sig->curr_target = next_thread(tsk);

575

+ 	}

576

+

577

+-	add_device_randomness((const void*) &tsk->se.sum_exec_runtime,

578

++	add_device_randomness((const void*) &tsk_seruntime(tsk),

579

+ 			      sizeof(unsigned long long));

580

+

581

+ 	/*

582

+@@ -145,7 +145,7 @@ static void __exit_signal(struct task_struct *tsk)

583

+ 	sig->inblock += task_io_get_inblock(tsk);

584

+ 	sig->oublock += task_io_get_oublock(tsk);

585

+ 	task_io_accounting_add(&sig->ioac, &tsk->ioac);

586

+-	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;

587

++	sig->sum_sched_runtime += tsk_seruntime(tsk);

588

+ 	sig->nr_threads--;

589

+ 	__unhash_process(tsk, group_dead);

590

+ 	write_sequnlock(&sig->stats_lock);

591

+diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c

592

+index 7779ee8abc2a..5b9893cdfb1b 100644

593

+--- a/kernel/locking/rtmutex.c

594

++++ b/kernel/locking/rtmutex.c

595

+@@ -300,21 +300,25 @@ static __always_inline void

596

+ waiter_update_prio(struct rt_mutex_waiter *waiter, struct task_struct *task)

597

+ {

598

+ 	waiter->prio = __waiter_prio(task);

599

+-	waiter->deadline = task->dl.deadline;

600

++	waiter->deadline = __tsk_deadline(task);

601

+ }

602

+

603

+ /*

604

+  * Only use with rt_mutex_waiter_{less,equal}()

605

+  */

606

+ #define task_to_waiter(p)	\

607

+-	&(struct rt_mutex_waiter){ .prio = __waiter_prio(p), .deadline = (p)->dl.deadline }

608

++	&(struct rt_mutex_waiter){ .prio = __waiter_prio(p), .deadline = __tsk_deadline(p) }

609

+

610

+ static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,

611

+ 						struct rt_mutex_waiter *right)

612

+ {

613

++#ifdef CONFIG_SCHED_PDS

614

++	return (left->deadline < right->deadline);

615

++#else

616

+ 	if (left->prio < right->prio)

617

+ 		return 1;

618

+

619

++#ifndef CONFIG_SCHED_BMQ

620

+ 	/*

621

+ 	 * If both waiters have dl_prio(), we check the deadlines of the

622

+ 	 * associated tasks.

623

+@@ -323,16 +327,22 @@ static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,

624

+ 	 */

625

+ 	if (dl_prio(left->prio))

626

+ 		return dl_time_before(left->deadline, right->deadline);

627

++#endif

628

+

629

+ 	return 0;

630

++#endif

631

+ }

632

+

633

+ static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,

634

+ 						 struct rt_mutex_waiter *right)

635

+ {

636

++#ifdef CONFIG_SCHED_PDS

637

++	return (left->deadline == right->deadline);

638

++#else

639

+ 	if (left->prio != right->prio)

640

+ 		return 0;

641

+

642

++#ifndef CONFIG_SCHED_BMQ

643

+ 	/*

644

+ 	 * If both waiters have dl_prio(), we check the deadlines of the

645

+ 	 * associated tasks.

646

+@@ -341,8 +351,10 @@ static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,

647

+ 	 */

648

+ 	if (dl_prio(left->prio))

649

+ 		return left->deadline == right->deadline;

650

++#endif

651

+

652

+ 	return 1;

653

++#endif

654

+ }

655

+

656

+ static inline bool rt_mutex_steal(struct rt_mutex_waiter *waiter,

657

+diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile

658

+index 976092b7bd45..31d587c16ec1 100644

659

+--- a/kernel/sched/Makefile

660

++++ b/kernel/sched/Makefile

661

+@@ -28,7 +28,12 @@ endif

662

+ # These compilation units have roughly the same size and complexity - so their

663

+ # build parallelizes well and finishes roughly at once:

664

+ #

665

++ifdef CONFIG_SCHED_ALT

666

++obj-y += alt_core.o

667

++obj-$(CONFIG_SCHED_DEBUG) += alt_debug.o

668

++else

669

+ obj-y += core.o

670

+ obj-y += fair.o

671

++endif

672

+ obj-y += build_policy.o

673

+ obj-y += build_utility.o

674

+diff --git a/kernel/sched/alt_core.c b/kernel/sched/alt_core.c

675

+new file mode 100644

676

+index 000000000000..d0ab41c4d9ad

677

+--- /dev/null

678

++++ b/kernel/sched/alt_core.c

679

+@@ -0,0 +1,7807 @@

680

++/*

681

++ *  kernel/sched/alt_core.c

682

++ *

683

++ *  Core alternative kernel scheduler code and related syscalls

684

++ *

685

++ *  Copyright (C) 1991-2002  Linus Torvalds

686

++ *

687

++ *  2009-08-13	Brainfuck deadline scheduling policy by Con Kolivas deletes

688

++ *		a whole lot of those previous things.

689

++ *  2017-09-06	Priority and Deadline based Skip list multiple queue kernel

690

++ *		scheduler by Alfred Chen.

691

++ *  2019-02-20	BMQ(BitMap Queue) kernel scheduler by Alfred Chen.

692

++ */

693

++#include <linux/sched/cputime.h>

694

++#include <linux/sched/debug.h>

695

++#include <linux/sched/isolation.h>

696

++#include <linux/sched/loadavg.h>

697

++#include <linux/sched/mm.h>

698

++#include <linux/sched/nohz.h>

699

++#include <linux/sched/stat.h>

700

++#include <linux/sched/wake_q.h>

701

++

702

++#include <linux/blkdev.h>

703

++#include <linux/context_tracking.h>

704

++#include <linux/cpuset.h>

705

++#include <linux/delayacct.h>

706

++#include <linux/init_task.h>

707

++#include <linux/kcov.h>

708

++#include <linux/kprobes.h>

709

++#include <linux/profile.h>

710

++#include <linux/nmi.h>

711

++#include <linux/scs.h>

712

++

713

++#include <uapi/linux/sched/types.h>

714

++

715

++#include <asm/switch_to.h>

716

++

717

++#define CREATE_TRACE_POINTS

718

++#include <trace/events/sched.h>

719

++#undef CREATE_TRACE_POINTS

720

++

721

++#include "sched.h"

722

++

723

++#include "pelt.h"

724

++

725

++#include "../../fs/io-wq.h"

726

++#include "../smpboot.h"

727

++

728

++/*

729

++ * Export tracepoints that act as a bare tracehook (ie: have no trace event

730

++ * associated with them) to allow external modules to probe them.

731

++ */

732

++EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_irq_tp);

733

++

734

++#ifdef CONFIG_SCHED_DEBUG

735

++#define sched_feat(x)	(1)

736

++/*

737

++ * Print a warning if need_resched is set for the given duration (if

738

++ * LATENCY_WARN is enabled).

739

++ *

740

++ * If sysctl_resched_latency_warn_once is set, only one warning will be shown

741

++ * per boot.

742

++ */

743

++__read_mostly int sysctl_resched_latency_warn_ms = 100;

744

++__read_mostly int sysctl_resched_latency_warn_once = 1;

745

++#else

746

++#define sched_feat(x)	(0)

747

++#endif /* CONFIG_SCHED_DEBUG */

748

++

749

++#define ALT_SCHED_VERSION "v5.19-r0"

750

++

751

++/* rt_prio(prio) defined in include/linux/sched/rt.h */

752

++#define rt_task(p)		rt_prio((p)->prio)

753

++#define rt_policy(policy)	((policy) == SCHED_FIFO || (policy) == SCHED_RR)

754

++#define task_has_rt_policy(p)	(rt_policy((p)->policy))

755

++

756

++#define STOP_PRIO		(MAX_RT_PRIO - 1)

757

++

758

++/* Default time slice is 4 in ms, can be set via kernel parameter "sched_timeslice" */

759

++u64 sched_timeslice_ns __read_mostly = (4 << 20);

760

++

761

++static inline void requeue_task(struct task_struct *p, struct rq *rq, int idx);

762

++

763

++#ifdef CONFIG_SCHED_BMQ

764

++#include "bmq.h"

765

++#endif

766

++#ifdef CONFIG_SCHED_PDS

767

++#include "pds.h"

768

++#endif

769

++

770

++static int __init sched_timeslice(char *str)

771

++{

772

++	int timeslice_ms;

773

++

774

++	get_option(&str, &timeslice_ms);

775

++	if (2 != timeslice_ms)

776

++		timeslice_ms = 4;

777

++	sched_timeslice_ns = timeslice_ms << 20;

778

++	sched_timeslice_imp(timeslice_ms);

779

++

780

++	return 0;

781

++}

782

++early_param("sched_timeslice", sched_timeslice);

783

++

784

++/* Reschedule if less than this many μs left */

785

++#define RESCHED_NS		(100 << 10)

786

++

787

++/**

788

++ * sched_yield_type - Choose what sort of yield sched_yield will perform.

789

++ * 0: No yield.

790

++ * 1: Deboost and requeue task. (default)

791

++ * 2: Set rq skip task.

792

++ */

793

++int sched_yield_type __read_mostly = 1;

794

++

795

++#ifdef CONFIG_SMP

796

++static cpumask_t sched_rq_pending_mask ____cacheline_aligned_in_smp;

797

++

798

++DEFINE_PER_CPU(cpumask_t [NR_CPU_AFFINITY_LEVELS], sched_cpu_topo_masks);

799

++DEFINE_PER_CPU(cpumask_t *, sched_cpu_llc_mask);

800

++DEFINE_PER_CPU(cpumask_t *, sched_cpu_topo_end_mask);

801

++

802

++#ifdef CONFIG_SCHED_SMT

803

++DEFINE_STATIC_KEY_FALSE(sched_smt_present);

804

++EXPORT_SYMBOL_GPL(sched_smt_present);

805

++#endif

806

++

807

++/*

808

++ * Keep a unique ID per domain (we use the first CPUs number in the cpumask of

809

++ * the domain), this allows us to quickly tell if two cpus are in the same cache

810

++ * domain, see cpus_share_cache().

811

++ */

812

++DEFINE_PER_CPU(int, sd_llc_id);

813

++#endif /* CONFIG_SMP */

814

++

815

++static DEFINE_MUTEX(sched_hotcpu_mutex);

816

++

817

++DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

818

++

819

++#ifndef prepare_arch_switch

820

++# define prepare_arch_switch(next)	do { } while (0)

821

++#endif

822

++#ifndef finish_arch_post_lock_switch

823

++# define finish_arch_post_lock_switch()	do { } while (0)

824

++#endif

825

++

826

++#ifdef CONFIG_SCHED_SMT

827

++static cpumask_t sched_sg_idle_mask ____cacheline_aligned_in_smp;

828

++#endif

829

++static cpumask_t sched_rq_watermark[SCHED_QUEUE_BITS] ____cacheline_aligned_in_smp;

830

++

831

++/* sched_queue related functions */

832

++static inline void sched_queue_init(struct sched_queue *q)

833

++{

834

++	int i;

835

++

836

++	bitmap_zero(q->bitmap, SCHED_QUEUE_BITS);

837

++	for(i = 0; i < SCHED_BITS; i++)

838

++		INIT_LIST_HEAD(&q->heads[i]);

839

++}

840

++

841

++/*

842

++ * Init idle task and put into queue structure of rq

843

++ * IMPORTANT: may be called multiple times for a single cpu

844

++ */

845

++static inline void sched_queue_init_idle(struct sched_queue *q,

846

++					 struct task_struct *idle)

847

++{

848

++	idle->sq_idx = IDLE_TASK_SCHED_PRIO;

849

++	INIT_LIST_HEAD(&q->heads[idle->sq_idx]);

850

++	list_add(&idle->sq_node, &q->heads[idle->sq_idx]);

851

++}

852

++

853

++/* water mark related functions */

854

++static inline void update_sched_rq_watermark(struct rq *rq)

855

++{

856

++	unsigned long watermark = find_first_bit(rq->queue.bitmap, SCHED_QUEUE_BITS);

857

++	unsigned long last_wm = rq->watermark;

858

++	unsigned long i;

859

++	int cpu;

860

++

861

++	if (watermark == last_wm)

862

++		return;

863

++

864

++	rq->watermark = watermark;

865

++	cpu = cpu_of(rq);

866

++	if (watermark < last_wm) {

867

++		for (i = last_wm; i > watermark; i--)

868

++			cpumask_clear_cpu(cpu, sched_rq_watermark + SCHED_QUEUE_BITS - i);

869

++#ifdef CONFIG_SCHED_SMT

870

++		if (static_branch_likely(&sched_smt_present) &&

871

++		    IDLE_TASK_SCHED_PRIO == last_wm)

872

++			cpumask_andnot(&sched_sg_idle_mask,

873

++				       &sched_sg_idle_mask, cpu_smt_mask(cpu));

874

++#endif

875

++		return;

876

++	}

877

++	/* last_wm < watermark */

878

++	for (i = watermark; i > last_wm; i--)

879

++		cpumask_set_cpu(cpu, sched_rq_watermark + SCHED_QUEUE_BITS - i);

880

++#ifdef CONFIG_SCHED_SMT

881

++	if (static_branch_likely(&sched_smt_present) &&

882

++	    IDLE_TASK_SCHED_PRIO == watermark) {

883

++		cpumask_t tmp;

884

++

885

++		cpumask_and(&tmp, cpu_smt_mask(cpu), sched_rq_watermark);

886

++		if (cpumask_equal(&tmp, cpu_smt_mask(cpu)))

887

++			cpumask_or(&sched_sg_idle_mask,

888

++				   &sched_sg_idle_mask, cpu_smt_mask(cpu));

889

++	}

890

++#endif

891

++}

892

++

893

++/*

894

++ * This routine assume that the idle task always in queue

895

++ */

896

++static inline struct task_struct *sched_rq_first_task(struct rq *rq)

897

++{

898

++	unsigned long idx = find_first_bit(rq->queue.bitmap, SCHED_QUEUE_BITS);

899

++	const struct list_head *head = &rq->queue.heads[sched_prio2idx(idx, rq)];

900

++

901

++	return list_first_entry(head, struct task_struct, sq_node);

902

++}

903

++

904

++static inline struct task_struct *

905

++sched_rq_next_task(struct task_struct *p, struct rq *rq)

906

++{

907

++	unsigned long idx = p->sq_idx;

908

++	struct list_head *head = &rq->queue.heads[idx];

909

++

910

++	if (list_is_last(&p->sq_node, head)) {

911

++		idx = find_next_bit(rq->queue.bitmap, SCHED_QUEUE_BITS,

912

++				    sched_idx2prio(idx, rq) + 1);

913

++		head = &rq->queue.heads[sched_prio2idx(idx, rq)];

914

++

915

++		return list_first_entry(head, struct task_struct, sq_node);

916

++	}

917

++

918

++	return list_next_entry(p, sq_node);

919

++}

920

++

921

++static inline struct task_struct *rq_runnable_task(struct rq *rq)

922

++{

923

++	struct task_struct *next = sched_rq_first_task(rq);

924

++

925

++	if (unlikely(next == rq->skip))

926

++		next = sched_rq_next_task(next, rq);

927

++

928

++	return next;

929

++}

930

++

931

++/*

932

++ * Serialization rules:

933

++ *

934

++ * Lock order:

935

++ *

936

++ *   p->pi_lock

937

++ *     rq->lock

938

++ *       hrtimer_cpu_base->lock (hrtimer_start() for bandwidth controls)

939

++ *

940

++ *  rq1->lock

941

++ *    rq2->lock  where: rq1 < rq2

942

++ *

943

++ * Regular state:

944

++ *

945

++ * Normal scheduling state is serialized by rq->lock. __schedule() takes the

946

++ * local CPU's rq->lock, it optionally removes the task from the runqueue and

947

++ * always looks at the local rq data structures to find the most eligible task

948

++ * to run next.

949

++ *

950

++ * Task enqueue is also under rq->lock, possibly taken from another CPU.

951

++ * Wakeups from another LLC domain might use an IPI to transfer the enqueue to

952

++ * the local CPU to avoid bouncing the runqueue state around [ see

953

++ * ttwu_queue_wakelist() ]

954

++ *

955

++ * Task wakeup, specifically wakeups that involve migration, are horribly

956

++ * complicated to avoid having to take two rq->locks.

957

++ *

958

++ * Special state:

959

++ *

960

++ * System-calls and anything external will use task_rq_lock() which acquires

961

++ * both p->pi_lock and rq->lock. As a consequence the state they change is

962

++ * stable while holding either lock:

963

++ *

964

++ *  - sched_setaffinity()/

965

++ *    set_cpus_allowed_ptr():	p->cpus_ptr, p->nr_cpus_allowed

966

++ *  - set_user_nice():		p->se.load, p->*prio

967

++ *  - __sched_setscheduler():	p->sched_class, p->policy, p->*prio,

968

++ *				p->se.load, p->rt_priority,

969

++ *				p->dl.dl_{runtime, deadline, period, flags, bw, density}

970

++ *  - sched_setnuma():		p->numa_preferred_nid

971

++ *  - sched_move_task()/

972

++ *    cpu_cgroup_fork():	p->sched_task_group

973

++ *  - uclamp_update_active()	p->uclamp*

974

++ *

975

++ * p->state <- TASK_*:

976

++ *

977

++ *   is changed locklessly using set_current_state(), __set_current_state() or

978

++ *   set_special_state(), see their respective comments, or by

979

++ *   try_to_wake_up(). This latter uses p->pi_lock to serialize against

980

++ *   concurrent self.

981

++ *

982

++ * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }:

983

++ *

984

++ *   is set by activate_task() and cleared by deactivate_task(), under

985

++ *   rq->lock. Non-zero indicates the task is runnable, the special

986

++ *   ON_RQ_MIGRATING state is used for migration without holding both

987

++ *   rq->locks. It indicates task_cpu() is not stable, see task_rq_lock().

988

++ *

989

++ * p->on_cpu <- { 0, 1 }:

990

++ *

991

++ *   is set by prepare_task() and cleared by finish_task() such that it will be

992

++ *   set before p is scheduled-in and cleared after p is scheduled-out, both

993

++ *   under rq->lock. Non-zero indicates the task is running on its CPU.

994

++ *

995

++ *   [ The astute reader will observe that it is possible for two tasks on one

996

++ *     CPU to have ->on_cpu = 1 at the same time. ]

997

++ *

998

++ * task_cpu(p): is changed by set_task_cpu(), the rules are:

999

++ *

1000

++ *  - Don't call set_task_cpu() on a blocked task:

1001

++ *

1002

++ *    We don't care what CPU we're not running on, this simplifies hotplug,

1003

++ *    the CPU assignment of blocked tasks isn't required to be valid.

1004

++ *

1005

++ *  - for try_to_wake_up(), called under p->pi_lock:

1006

++ *

1007

++ *    This allows try_to_wake_up() to only take one rq->lock, see its comment.

1008

++ *

1009

++ *  - for migration called under rq->lock:

1010

++ *    [ see task_on_rq_migrating() in task_rq_lock() ]

1011

++ *

1012

++ *    o move_queued_task()

1013

++ *    o detach_task()

1014

++ *

1015

++ *  - for migration called under double_rq_lock():

1016

++ *

1017

++ *    o __migrate_swap_task()

1018

++ *    o push_rt_task() / pull_rt_task()

1019

++ *    o push_dl_task() / pull_dl_task()

1020

++ *    o dl_task_offline_migration()

1021

++ *

1022

++ */

1023

++

1024

++/*

1025

++ * Context: p->pi_lock

1026

++ */

1027

++static inline struct rq

1028

++*__task_access_lock(struct task_struct *p, raw_spinlock_t **plock)

1029

++{

1030

++	struct rq *rq;

1031

++	for (;;) {

1032

++		rq = task_rq(p);

1033

++		if (p->on_cpu || task_on_rq_queued(p)) {

1034

++			raw_spin_lock(&rq->lock);

1035

++			if (likely((p->on_cpu || task_on_rq_queued(p))

1036

++				   && rq == task_rq(p))) {

1037

++				*plock = &rq->lock;

1038

++				return rq;

1039

++			}

1040

++			raw_spin_unlock(&rq->lock);

1041

++		} else if (task_on_rq_migrating(p)) {

1042

++			do {

1043

++				cpu_relax();

1044

++			} while (unlikely(task_on_rq_migrating(p)));

1045

++		} else {

1046

++			*plock = NULL;

1047

++			return rq;

1048

++		}

1049

++	}

1050

++}

1051

++

1052

++static inline void

1053

++__task_access_unlock(struct task_struct *p, raw_spinlock_t *lock)

1054

++{

1055

++	if (NULL != lock)

1056

++		raw_spin_unlock(lock);

1057

++}

1058

++

1059

++static inline struct rq

1060

++*task_access_lock_irqsave(struct task_struct *p, raw_spinlock_t **plock,

1061

++			  unsigned long *flags)

1062

++{

1063

++	struct rq *rq;

1064

++	for (;;) {

1065

++		rq = task_rq(p);

1066

++		if (p->on_cpu || task_on_rq_queued(p)) {

1067

++			raw_spin_lock_irqsave(&rq->lock, *flags);

1068

++			if (likely((p->on_cpu || task_on_rq_queued(p))

1069

++				   && rq == task_rq(p))) {

1070

++				*plock = &rq->lock;

1071

++				return rq;

1072

++			}

1073

++			raw_spin_unlock_irqrestore(&rq->lock, *flags);

1074

++		} else if (task_on_rq_migrating(p)) {

1075

++			do {

1076

++				cpu_relax();

1077

++			} while (unlikely(task_on_rq_migrating(p)));

1078

++		} else {

1079

++			raw_spin_lock_irqsave(&p->pi_lock, *flags);

1080

++			if (likely(!p->on_cpu && !p->on_rq &&

1081

++				   rq == task_rq(p))) {

1082

++				*plock = &p->pi_lock;

1083

++				return rq;

1084

++			}

1085

++			raw_spin_unlock_irqrestore(&p->pi_lock, *flags);

1086

++		}

1087

++	}

1088

++}

1089

++

1090

++static inline void

1091

++task_access_unlock_irqrestore(struct task_struct *p, raw_spinlock_t *lock,

1092

++			      unsigned long *flags)

1093

++{

1094

++	raw_spin_unlock_irqrestore(lock, *flags);

1095

++}

1096

++

1097

++/*

1098

++ * __task_rq_lock - lock the rq @p resides on.

1099

++ */

1100

++struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)

1101

++	__acquires(rq->lock)

1102

++{

1103

++	struct rq *rq;

1104

++

1105

++	lockdep_assert_held(&p->pi_lock);

1106

++

1107

++	for (;;) {

1108

++		rq = task_rq(p);

1109

++		raw_spin_lock(&rq->lock);

1110

++		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))

1111

++			return rq;

1112

++		raw_spin_unlock(&rq->lock);

1113

++

1114

++		while (unlikely(task_on_rq_migrating(p)))

1115

++			cpu_relax();

1116

++	}

1117

++}

1118

++

1119

++/*

1120

++ * task_rq_lock - lock p->pi_lock and lock the rq @p resides on.

1121

++ */

1122

++struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)

1123

++	__acquires(p->pi_lock)

1124

++	__acquires(rq->lock)

1125

++{

1126

++	struct rq *rq;

1127

++

1128

++	for (;;) {

1129

++		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);

1130

++		rq = task_rq(p);

1131

++		raw_spin_lock(&rq->lock);

1132

++		/*

1133

++		 *	move_queued_task()		task_rq_lock()

1134

++		 *

1135

++		 *	ACQUIRE (rq->lock)

1136

++		 *	[S] ->on_rq = MIGRATING		[L] rq = task_rq()

1137

++		 *	WMB (__set_task_cpu())		ACQUIRE (rq->lock);

1138

++		 *	[S] ->cpu = new_cpu		[L] task_rq()

1139

++		 *					[L] ->on_rq

1140

++		 *	RELEASE (rq->lock)

1141

++		 *

1142

++		 * If we observe the old CPU in task_rq_lock(), the acquire of

1143

++		 * the old rq->lock will fully serialize against the stores.

1144

++		 *

1145

++		 * If we observe the new CPU in task_rq_lock(), the address

1146

++		 * dependency headed by '[L] rq = task_rq()' and the acquire

1147

++		 * will pair with the WMB to ensure we then also see migrating.

1148

++		 */

1149

++		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {

1150

++			return rq;

1151

++		}

1152

++		raw_spin_unlock(&rq->lock);

1153

++		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);

1154

++

1155

++		while (unlikely(task_on_rq_migrating(p)))

1156

++			cpu_relax();

1157

++	}

1158

++}

1159

++

1160

++static inline void

1161

++rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)

1162

++	__acquires(rq->lock)

1163

++{

1164

++	raw_spin_lock_irqsave(&rq->lock, rf->flags);

1165

++}

1166

++

1167

++static inline void

1168

++rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)

1169

++	__releases(rq->lock)

1170

++{

1171

++	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);

1172

++}

1173

++

1174

++void raw_spin_rq_lock_nested(struct rq *rq, int subclass)

1175

++{

1176

++	raw_spinlock_t *lock;

1177

++

1178

++	/* Matches synchronize_rcu() in __sched_core_enable() */

1179

++	preempt_disable();

1180

++

1181

++	for (;;) {

1182

++		lock = __rq_lockp(rq);

1183

++		raw_spin_lock_nested(lock, subclass);

1184

++		if (likely(lock == __rq_lockp(rq))) {

1185

++			/* preempt_count *MUST* be > 1 */

1186

++			preempt_enable_no_resched();

1187

++			return;

1188

++		}

1189

++		raw_spin_unlock(lock);

1190

++	}

1191

++}

1192

++

1193

++void raw_spin_rq_unlock(struct rq *rq)

1194

++{

1195

++	raw_spin_unlock(rq_lockp(rq));

1196

++}

1197

++

1198

++/*

1199

++ * RQ-clock updating methods:

1200

++ */

1201

++

1202

++static void update_rq_clock_task(struct rq *rq, s64 delta)

1203

++{

1204

++/*

1205

++ * In theory, the compile should just see 0 here, and optimize out the call

1206

++ * to sched_rt_avg_update. But I don't trust it...

1207

++ */

1208

++	s64 __maybe_unused steal = 0, irq_delta = 0;

1209

++

1210

++#ifdef CONFIG_IRQ_TIME_ACCOUNTING

1211

++	irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;

1212

++

1213

++	/*

1214

++	 * Since irq_time is only updated on {soft,}irq_exit, we might run into

1215

++	 * this case when a previous update_rq_clock() happened inside a

1216

++	 * {soft,}irq region.

1217

++	 *

1218

++	 * When this happens, we stop ->clock_task and only update the

1219

++	 * prev_irq_time stamp to account for the part that fit, so that a next

1220

++	 * update will consume the rest. This ensures ->clock_task is

1221

++	 * monotonic.

1222

++	 *

1223

++	 * It does however cause some slight miss-attribution of {soft,}irq

1224

++	 * time, a more accurate solution would be to update the irq_time using

1225

++	 * the current rq->clock timestamp, except that would require using

1226

++	 * atomic ops.

1227

++	 */

1228

++	if (irq_delta > delta)

1229

++		irq_delta = delta;

1230

++

1231

++	rq->prev_irq_time += irq_delta;

1232

++	delta -= irq_delta;

1233

++#endif

1234

++#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING

1235

++	if (static_key_false((&paravirt_steal_rq_enabled))) {

1236

++		steal = paravirt_steal_clock(cpu_of(rq));

1237

++		steal -= rq->prev_steal_time_rq;

1238

++

1239

++		if (unlikely(steal > delta))

1240

++			steal = delta;

1241

++

1242

++		rq->prev_steal_time_rq += steal;

1243

++		delta -= steal;

1244

++	}

1245

++#endif

1246

++

1247

++	rq->clock_task += delta;

1248

++

1249

++#ifdef CONFIG_HAVE_SCHED_AVG_IRQ

1250

++	if ((irq_delta + steal))

1251

++		update_irq_load_avg(rq, irq_delta + steal);

1252

++#endif

1253

++}

1254

++

1255

++static inline void update_rq_clock(struct rq *rq)

1256

++{

1257

++	s64 delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;

1258

++

1259

++	if (unlikely(delta <= 0))

1260

++		return;

1261

++	rq->clock += delta;

1262

++	update_rq_time_edge(rq);

1263

++	update_rq_clock_task(rq, delta);

1264

++}

1265

++

1266

++/*

1267

++ * RQ Load update routine

1268

++ */

1269

++#define RQ_LOAD_HISTORY_BITS		(sizeof(s32) * 8ULL)

1270

++#define RQ_UTIL_SHIFT			(8)

1271

++#define RQ_LOAD_HISTORY_TO_UTIL(l)	(((l) >> (RQ_LOAD_HISTORY_BITS - 1 - RQ_UTIL_SHIFT)) & 0xff)

1272

++

1273

++#define LOAD_BLOCK(t)		((t) >> 17)

1274

++#define LOAD_HALF_BLOCK(t)	((t) >> 16)

1275

++#define BLOCK_MASK(t)		((t) & ((0x01 << 18) - 1))

1276

++#define LOAD_BLOCK_BIT(b)	(1UL << (RQ_LOAD_HISTORY_BITS - 1 - (b)))

1277

++#define CURRENT_LOAD_BIT	LOAD_BLOCK_BIT(0)

1278

++

1279

++static inline void rq_load_update(struct rq *rq)

1280

++{

1281

++	u64 time = rq->clock;

1282

++	u64 delta = min(LOAD_BLOCK(time) - LOAD_BLOCK(rq->load_stamp),

1283

++			RQ_LOAD_HISTORY_BITS - 1);

1284

++	u64 prev = !!(rq->load_history & CURRENT_LOAD_BIT);

1285

++	u64 curr = !!rq->nr_running;

1286

++

1287

++	if (delta) {

1288

++		rq->load_history = rq->load_history >> delta;

1289

++

1290

++		if (delta < RQ_UTIL_SHIFT) {

1291

++			rq->load_block += (~BLOCK_MASK(rq->load_stamp)) * prev;

1292

++			if (!!LOAD_HALF_BLOCK(rq->load_block) ^ curr)

1293

++				rq->load_history ^= LOAD_BLOCK_BIT(delta);

1294

++		}

1295

++

1296

++		rq->load_block = BLOCK_MASK(time) * prev;

1297

++	} else {

1298

++		rq->load_block += (time - rq->load_stamp) * prev;

1299

++	}

1300

++	if (prev ^ curr)

1301

++		rq->load_history ^= CURRENT_LOAD_BIT;

1302

++	rq->load_stamp = time;

1303

++}

1304

++

1305

++unsigned long rq_load_util(struct rq *rq, unsigned long max)

1306

++{

1307

++	return RQ_LOAD_HISTORY_TO_UTIL(rq->load_history) * (max >> RQ_UTIL_SHIFT);

1308

++}

1309

++

1310

++#ifdef CONFIG_SMP

1311

++unsigned long sched_cpu_util(int cpu, unsigned long max)

1312

++{

1313

++	return rq_load_util(cpu_rq(cpu), max);

1314

++}

1315

++#endif /* CONFIG_SMP */

1316

++

1317

++#ifdef CONFIG_CPU_FREQ

1318

++/**

1319

++ * cpufreq_update_util - Take a note about CPU utilization changes.

1320

++ * @rq: Runqueue to carry out the update for.

1321

++ * @flags: Update reason flags.

1322

++ *

1323

++ * This function is called by the scheduler on the CPU whose utilization is

1324

++ * being updated.

1325

++ *

1326

++ * It can only be called from RCU-sched read-side critical sections.

1327

++ *

1328

++ * The way cpufreq is currently arranged requires it to evaluate the CPU

1329

++ * performance state (frequency/voltage) on a regular basis to prevent it from

1330

++ * being stuck in a completely inadequate performance level for too long.

1331

++ * That is not guaranteed to happen if the updates are only triggered from CFS

1332

++ * and DL, though, because they may not be coming in if only RT tasks are

1333

++ * active all the time (or there are RT tasks only).

1334

++ *

1335

++ * As a workaround for that issue, this function is called periodically by the

1336

++ * RT sched class to trigger extra cpufreq updates to prevent it from stalling,

1337

++ * but that really is a band-aid.  Going forward it should be replaced with

1338

++ * solutions targeted more specifically at RT tasks.

1339

++ */

1340

++static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)

1341

++{

1342

++	struct update_util_data *data;

1343

++

1344

++#ifdef CONFIG_SMP

1345

++	rq_load_update(rq);

1346

++#endif

1347

++	data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,

1348

++						  cpu_of(rq)));

1349

++	if (data)

1350

++		data->func(data, rq_clock(rq), flags);

1351

++}

1352

++#else

1353

++static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)

1354

++{

1355

++#ifdef CONFIG_SMP

1356

++	rq_load_update(rq);

1357

++#endif

1358

++}

1359

++#endif /* CONFIG_CPU_FREQ */

1360

++

1361

++#ifdef CONFIG_NO_HZ_FULL

1362

++/*

1363

++ * Tick may be needed by tasks in the runqueue depending on their policy and

1364

++ * requirements. If tick is needed, lets send the target an IPI to kick it out

1365

++ * of nohz mode if necessary.

1366

++ */

1367

++static inline void sched_update_tick_dependency(struct rq *rq)

1368

++{

1369

++	int cpu = cpu_of(rq);

1370

++

1371

++	if (!tick_nohz_full_cpu(cpu))

1372

++		return;

1373

++

1374

++	if (rq->nr_running < 2)

1375

++		tick_nohz_dep_clear_cpu(cpu, TICK_DEP_BIT_SCHED);

1376

++	else

1377

++		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);

1378

++}

1379

++#else /* !CONFIG_NO_HZ_FULL */

1380

++static inline void sched_update_tick_dependency(struct rq *rq) { }

1381

++#endif

1382

++

1383

++bool sched_task_on_rq(struct task_struct *p)

1384

++{

1385

++	return task_on_rq_queued(p);

1386

++}

1387

++

1388

++unsigned long get_wchan(struct task_struct *p)

1389

++{

1390

++	unsigned long ip = 0;

1391

++	unsigned int state;

1392

++

1393

++	if (!p || p == current)

1394

++		return 0;

1395

++

1396

++	/* Only get wchan if task is blocked and we can keep it that way. */

1397

++	raw_spin_lock_irq(&p->pi_lock);

1398

++	state = READ_ONCE(p->__state);

1399

++	smp_rmb(); /* see try_to_wake_up() */

1400

++	if (state != TASK_RUNNING && state != TASK_WAKING && !p->on_rq)

1401

++		ip = __get_wchan(p);

1402

++	raw_spin_unlock_irq(&p->pi_lock);

1403

++

1404

++	return ip;

1405

++}

1406

++

1407

++/*

1408

++ * Add/Remove/Requeue task to/from the runqueue routines

1409

++ * Context: rq->lock

1410

++ */

1411

++#define __SCHED_DEQUEUE_TASK(p, rq, flags)					\

1412

++	psi_dequeue(p, flags & DEQUEUE_SLEEP);					\

1413

++	sched_info_dequeue(rq, p);						\

1414

++										\

1415

++	list_del(&p->sq_node);							\

1416

++	if (list_empty(&rq->queue.heads[p->sq_idx])) 				\

1417

++		clear_bit(sched_idx2prio(p->sq_idx, rq), rq->queue.bitmap);

1418

++

1419

++#define __SCHED_ENQUEUE_TASK(p, rq, flags)				\

1420

++	sched_info_enqueue(rq, p);					\

1421

++	psi_enqueue(p, flags);						\

1422

++									\

1423

++	p->sq_idx = task_sched_prio_idx(p, rq);				\

1424

++	list_add_tail(&p->sq_node, &rq->queue.heads[p->sq_idx]);	\

1425

++	set_bit(sched_idx2prio(p->sq_idx, rq), rq->queue.bitmap);

1426

++

1427

++static inline void dequeue_task(struct task_struct *p, struct rq *rq, int flags)

1428

++{

1429

++	lockdep_assert_held(&rq->lock);

1430

++

1431

++	/*printk(KERN_INFO "sched: dequeue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/

1432

++	WARN_ONCE(task_rq(p) != rq, "sched: dequeue task reside on cpu%d from cpu%d\n",

1433

++		  task_cpu(p), cpu_of(rq));

1434

++

1435

++	__SCHED_DEQUEUE_TASK(p, rq, flags);

1436

++	--rq->nr_running;

1437

++#ifdef CONFIG_SMP

1438

++	if (1 == rq->nr_running)

1439

++		cpumask_clear_cpu(cpu_of(rq), &sched_rq_pending_mask);

1440

++#endif

1441

++

1442

++	sched_update_tick_dependency(rq);

1443

++}

1444

++

1445

++static inline void enqueue_task(struct task_struct *p, struct rq *rq, int flags)

1446

++{

1447

++	lockdep_assert_held(&rq->lock);

1448

++

1449

++	/*printk(KERN_INFO "sched: enqueue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/

1450

++	WARN_ONCE(task_rq(p) != rq, "sched: enqueue task reside on cpu%d to cpu%d\n",

1451

++		  task_cpu(p), cpu_of(rq));

1452

++

1453

++	__SCHED_ENQUEUE_TASK(p, rq, flags);

1454

++	update_sched_rq_watermark(rq);

1455

++	++rq->nr_running;

1456

++#ifdef CONFIG_SMP

1457

++	if (2 == rq->nr_running)

1458

++		cpumask_set_cpu(cpu_of(rq), &sched_rq_pending_mask);

1459

++#endif

1460

++

1461

++	sched_update_tick_dependency(rq);

1462

++}

1463

++

1464

++static inline void requeue_task(struct task_struct *p, struct rq *rq, int idx)

1465

++{

1466

++	lockdep_assert_held(&rq->lock);

1467

++	/*printk(KERN_INFO "sched: requeue(%d) %px %016llx\n", cpu_of(rq), p, p->priodl);*/

1468

++	WARN_ONCE(task_rq(p) != rq, "sched: cpu[%d] requeue task reside on cpu%d\n",

1469

++		  cpu_of(rq), task_cpu(p));

1470

++

1471

++	list_del(&p->sq_node);

1472

++	list_add_tail(&p->sq_node, &rq->queue.heads[idx]);

1473

++	if (idx != p->sq_idx) {

1474

++		if (list_empty(&rq->queue.heads[p->sq_idx]))

1475

++			clear_bit(sched_idx2prio(p->sq_idx, rq),

1476

++				  rq->queue.bitmap);

1477

++		p->sq_idx = idx;

1478

++		set_bit(sched_idx2prio(p->sq_idx, rq), rq->queue.bitmap);

1479

++		update_sched_rq_watermark(rq);

1480

++	}

1481

++}

1482

++

1483

++/*

1484

++ * cmpxchg based fetch_or, macro so it works for different integer types

1485

++ */

1486

++#define fetch_or(ptr, mask)						\

1487

++	({								\

1488

++		typeof(ptr) _ptr = (ptr);				\

1489

++		typeof(mask) _mask = (mask);				\

1490

++		typeof(*_ptr) _old, _val = *_ptr;			\

1491

++									\

1492

++		for (;;) {						\

1493

++			_old = cmpxchg(_ptr, _val, _val | _mask);	\

1494

++			if (_old == _val)				\

1495

++				break;					\

1496

++			_val = _old;					\

1497

++		}							\

1498

++	_old;								\

1499

++})

1500

++

1501

++#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)

1502

++/*

1503

++ * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,

1504

++ * this avoids any races wrt polling state changes and thereby avoids

1505

++ * spurious IPIs.

1506

++ */

1507

++static bool set_nr_and_not_polling(struct task_struct *p)

1508

++{

1509

++	struct thread_info *ti = task_thread_info(p);

1510

++	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);

1511

++}

1512

++

1513

++/*

1514

++ * Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set.

1515

++ *

1516

++ * If this returns true, then the idle task promises to call

1517

++ * sched_ttwu_pending() and reschedule soon.

1518

++ */

1519

++static bool set_nr_if_polling(struct task_struct *p)

1520

++{

1521

++	struct thread_info *ti = task_thread_info(p);

1522

++	typeof(ti->flags) old, val = READ_ONCE(ti->flags);

1523

++

1524

++	for (;;) {

1525

++		if (!(val & _TIF_POLLING_NRFLAG))

1526

++			return false;

1527

++		if (val & _TIF_NEED_RESCHED)

1528

++			return true;

1529

++		old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED);

1530

++		if (old == val)

1531

++			break;

1532

++		val = old;

1533

++	}

1534

++	return true;

1535

++}

1536

++

1537

++#else

1538

++static bool set_nr_and_not_polling(struct task_struct *p)

1539

++{

1540

++	set_tsk_need_resched(p);

1541

++	return true;

1542

++}

1543

++

1544

++#ifdef CONFIG_SMP

1545

++static bool set_nr_if_polling(struct task_struct *p)

1546

++{

1547

++	return false;

1548

++}

1549

++#endif

1550

++#endif

1551

++

1552

++static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)

1553

++{

1554

++	struct wake_q_node *node = &task->wake_q;

1555

++

1556

++	/*

1557

++	 * Atomically grab the task, if ->wake_q is !nil already it means

1558

++	 * it's already queued (either by us or someone else) and will get the

1559

++	 * wakeup due to that.

1560

++	 *

1561

++	 * In order to ensure that a pending wakeup will observe our pending

1562

++	 * state, even in the failed case, an explicit smp_mb() must be used.

1563

++	 */

1564

++	smp_mb__before_atomic();

1565

++	if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))

1566

++		return false;

1567

++

1568

++	/*

1569

++	 * The head is context local, there can be no concurrency.

1570

++	 */

1571

++	*head->lastp = node;

1572

++	head->lastp = &node->next;

1573

++	return true;

1574

++}

1575

++

1576

++/**

1577

++ * wake_q_add() - queue a wakeup for 'later' waking.

1578

++ * @head: the wake_q_head to add @task to

1579

++ * @task: the task to queue for 'later' wakeup

1580

++ *

1581

++ * Queue a task for later wakeup, most likely by the wake_up_q() call in the

1582

++ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come

1583

++ * instantly.

1584

++ *

1585

++ * This function must be used as-if it were wake_up_process(); IOW the task

1586

++ * must be ready to be woken at this location.

1587

++ */

1588

++void wake_q_add(struct wake_q_head *head, struct task_struct *task)

1589

++{

1590

++	if (__wake_q_add(head, task))

1591

++		get_task_struct(task);

1592

++}

1593

++

1594

++/**

1595

++ * wake_q_add_safe() - safely queue a wakeup for 'later' waking.

1596

++ * @head: the wake_q_head to add @task to

1597

++ * @task: the task to queue for 'later' wakeup

1598

++ *

1599

++ * Queue a task for later wakeup, most likely by the wake_up_q() call in the

1600

++ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come

1601

++ * instantly.

1602

++ *

1603

++ * This function must be used as-if it were wake_up_process(); IOW the task

1604

++ * must be ready to be woken at this location.

1605

++ *

1606

++ * This function is essentially a task-safe equivalent to wake_q_add(). Callers

1607

++ * that already hold reference to @task can call the 'safe' version and trust

1608

++ * wake_q to do the right thing depending whether or not the @task is already

1609

++ * queued for wakeup.

1610

++ */

1611

++void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)

1612

++{

1613

++	if (!__wake_q_add(head, task))

1614

++		put_task_struct(task);

1615

++}

1616

++

1617

++void wake_up_q(struct wake_q_head *head)

1618

++{

1619

++	struct wake_q_node *node = head->first;

1620

++

1621

++	while (node != WAKE_Q_TAIL) {

1622

++		struct task_struct *task;

1623

++

1624

++		task = container_of(node, struct task_struct, wake_q);

1625

++		/* task can safely be re-inserted now: */

1626

++		node = node->next;

1627

++		task->wake_q.next = NULL;

1628

++

1629

++		/*

1630

++		 * wake_up_process() executes a full barrier, which pairs with

1631

++		 * the queueing in wake_q_add() so as not to miss wakeups.

1632

++		 */

1633

++		wake_up_process(task);

1634

++		put_task_struct(task);

1635

++	}

1636

++}

1637

++

1638

++/*

1639

++ * resched_curr - mark rq's current task 'to be rescheduled now'.

1640

++ *

1641

++ * On UP this means the setting of the need_resched flag, on SMP it

1642

++ * might also involve a cross-CPU call to trigger the scheduler on

1643

++ * the target CPU.

1644

++ */

1645

++void resched_curr(struct rq *rq)

1646

++{

1647

++	struct task_struct *curr = rq->curr;

1648

++	int cpu;

1649

++

1650

++	lockdep_assert_held(&rq->lock);

1651

++

1652

++	if (test_tsk_need_resched(curr))

1653

++		return;

1654

++

1655

++	cpu = cpu_of(rq);

1656

++	if (cpu == smp_processor_id()) {

1657

++		set_tsk_need_resched(curr);

1658

++		set_preempt_need_resched();

1659

++		return;

1660

++	}

1661

++

1662

++	if (set_nr_and_not_polling(curr))

1663

++		smp_send_reschedule(cpu);

1664

++	else

1665

++		trace_sched_wake_idle_without_ipi(cpu);

1666

++}

1667

++

1668

++void resched_cpu(int cpu)

1669

++{

1670

++	struct rq *rq = cpu_rq(cpu);

1671

++	unsigned long flags;

1672

++

1673

++	raw_spin_lock_irqsave(&rq->lock, flags);

1674

++	if (cpu_online(cpu) || cpu == smp_processor_id())

1675

++		resched_curr(cpu_rq(cpu));

1676

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

1677

++}

1678

++

1679

++#ifdef CONFIG_SMP

1680

++#ifdef CONFIG_NO_HZ_COMMON

1681

++void nohz_balance_enter_idle(int cpu) {}

1682

++

1683

++void select_nohz_load_balancer(int stop_tick) {}

1684

++

1685

++void set_cpu_sd_state_idle(void) {}

1686

++

1687

++/*

1688

++ * In the semi idle case, use the nearest busy CPU for migrating timers

1689

++ * from an idle CPU.  This is good for power-savings.

1690

++ *

1691

++ * We don't do similar optimization for completely idle system, as

1692

++ * selecting an idle CPU will add more delays to the timers than intended

1693

++ * (as that CPU's timer base may not be uptodate wrt jiffies etc).

1694

++ */

1695

++int get_nohz_timer_target(void)

1696

++{

1697

++	int i, cpu = smp_processor_id(), default_cpu = -1;

1698

++	struct cpumask *mask;

1699

++	const struct cpumask *hk_mask;

1700

++

1701

++	if (housekeeping_cpu(cpu, HK_TYPE_TIMER)) {

1702

++		if (!idle_cpu(cpu))

1703

++			return cpu;

1704

++		default_cpu = cpu;

1705

++	}

1706

++

1707

++	hk_mask = housekeeping_cpumask(HK_TYPE_TIMER);

1708

++

1709

++	for (mask = per_cpu(sched_cpu_topo_masks, cpu) + 1;

1710

++	     mask < per_cpu(sched_cpu_topo_end_mask, cpu); mask++)

1711

++		for_each_cpu_and(i, mask, hk_mask)

1712

++			if (!idle_cpu(i))

1713

++				return i;

1714

++

1715

++	if (default_cpu == -1)

1716

++		default_cpu = housekeeping_any_cpu(HK_TYPE_TIMER);

1717

++	cpu = default_cpu;

1718

++

1719

++	return cpu;

1720

++}

1721

++

1722

++/*

1723

++ * When add_timer_on() enqueues a timer into the timer wheel of an

1724

++ * idle CPU then this timer might expire before the next timer event

1725

++ * which is scheduled to wake up that CPU. In case of a completely

1726

++ * idle system the next event might even be infinite time into the

1727

++ * future. wake_up_idle_cpu() ensures that the CPU is woken up and

1728

++ * leaves the inner idle loop so the newly added timer is taken into

1729

++ * account when the CPU goes back to idle and evaluates the timer

1730

++ * wheel for the next timer event.

1731

++ */

1732

++static inline void wake_up_idle_cpu(int cpu)

1733

++{

1734

++	struct rq *rq = cpu_rq(cpu);

1735

++

1736

++	if (cpu == smp_processor_id())

1737

++		return;

1738

++

1739

++	if (set_nr_and_not_polling(rq->idle))

1740

++		smp_send_reschedule(cpu);

1741

++	else

1742

++		trace_sched_wake_idle_without_ipi(cpu);

1743

++}

1744

++

1745

++static inline bool wake_up_full_nohz_cpu(int cpu)

1746

++{

1747

++	/*

1748

++	 * We just need the target to call irq_exit() and re-evaluate

1749

++	 * the next tick. The nohz full kick at least implies that.

1750

++	 * If needed we can still optimize that later with an

1751

++	 * empty IRQ.

1752

++	 */

1753

++	if (cpu_is_offline(cpu))

1754

++		return true;  /* Don't try to wake offline CPUs. */

1755

++	if (tick_nohz_full_cpu(cpu)) {

1756

++		if (cpu != smp_processor_id() ||

1757

++		    tick_nohz_tick_stopped())

1758

++			tick_nohz_full_kick_cpu(cpu);

1759

++		return true;

1760

++	}

1761

++

1762

++	return false;

1763

++}

1764

++

1765

++void wake_up_nohz_cpu(int cpu)

1766

++{

1767

++	if (!wake_up_full_nohz_cpu(cpu))

1768

++		wake_up_idle_cpu(cpu);

1769

++}

1770

++

1771

++static void nohz_csd_func(void *info)

1772

++{

1773

++	struct rq *rq = info;

1774

++	int cpu = cpu_of(rq);

1775

++	unsigned int flags;

1776

++

1777

++	/*

1778

++	 * Release the rq::nohz_csd.

1779

++	 */

1780

++	flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));

1781

++	WARN_ON(!(flags & NOHZ_KICK_MASK));

1782

++

1783

++	rq->idle_balance = idle_cpu(cpu);

1784

++	if (rq->idle_balance && !need_resched()) {

1785

++		rq->nohz_idle_balance = flags;

1786

++		raise_softirq_irqoff(SCHED_SOFTIRQ);

1787

++	}

1788

++}

1789

++

1790

++#endif /* CONFIG_NO_HZ_COMMON */

1791

++#endif /* CONFIG_SMP */

1792

++

1793

++static inline void check_preempt_curr(struct rq *rq)

1794

++{

1795

++	if (sched_rq_first_task(rq) != rq->curr)

1796

++		resched_curr(rq);

1797

++}

1798

++

1799

++#ifdef CONFIG_SCHED_HRTICK

1800

++/*

1801

++ * Use HR-timers to deliver accurate preemption points.

1802

++ */

1803

++

1804

++static void hrtick_clear(struct rq *rq)

1805

++{

1806

++	if (hrtimer_active(&rq->hrtick_timer))

1807

++		hrtimer_cancel(&rq->hrtick_timer);

1808

++}

1809

++

1810

++/*

1811

++ * High-resolution timer tick.

1812

++ * Runs from hardirq context with interrupts disabled.

1813

++ */

1814

++static enum hrtimer_restart hrtick(struct hrtimer *timer)

1815

++{

1816

++	struct rq *rq = container_of(timer, struct rq, hrtick_timer);

1817

++

1818

++	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());

1819

++

1820

++	raw_spin_lock(&rq->lock);

1821

++	resched_curr(rq);

1822

++	raw_spin_unlock(&rq->lock);

1823

++

1824

++	return HRTIMER_NORESTART;

1825

++}

1826

++

1827

++/*

1828

++ * Use hrtick when:

1829

++ *  - enabled by features

1830

++ *  - hrtimer is actually high res

1831

++ */

1832

++static inline int hrtick_enabled(struct rq *rq)

1833

++{

1834

++	/**

1835

++	 * Alt schedule FW doesn't support sched_feat yet

1836

++	if (!sched_feat(HRTICK))

1837

++		return 0;

1838

++	*/

1839

++	if (!cpu_active(cpu_of(rq)))

1840

++		return 0;

1841

++	return hrtimer_is_hres_active(&rq->hrtick_timer);

1842

++}

1843

++

1844

++#ifdef CONFIG_SMP

1845

++

1846

++static void __hrtick_restart(struct rq *rq)

1847

++{

1848

++	struct hrtimer *timer = &rq->hrtick_timer;

1849

++	ktime_t time = rq->hrtick_time;

1850

++

1851

++	hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);

1852

++}

1853

++

1854

++/*

1855

++ * called from hardirq (IPI) context

1856

++ */

1857

++static void __hrtick_start(void *arg)

1858

++{

1859

++	struct rq *rq = arg;

1860

++

1861

++	raw_spin_lock(&rq->lock);

1862

++	__hrtick_restart(rq);

1863

++	raw_spin_unlock(&rq->lock);

1864

++}

1865

++

1866

++/*

1867

++ * Called to set the hrtick timer state.

1868

++ *

1869

++ * called with rq->lock held and irqs disabled

1870

++ */

1871

++void hrtick_start(struct rq *rq, u64 delay)

1872

++{

1873

++	struct hrtimer *timer = &rq->hrtick_timer;

1874

++	s64 delta;

1875

++

1876

++	/*

1877

++	 * Don't schedule slices shorter than 10000ns, that just

1878

++	 * doesn't make sense and can cause timer DoS.

1879

++	 */

1880

++	delta = max_t(s64, delay, 10000LL);

1881

++

1882

++	rq->hrtick_time = ktime_add_ns(timer->base->get_time(), delta);

1883

++

1884

++	if (rq == this_rq())

1885

++		__hrtick_restart(rq);

1886

++	else

1887

++		smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);

1888

++}

1889

++

1890

++#else

1891

++/*

1892

++ * Called to set the hrtick timer state.

1893

++ *

1894

++ * called with rq->lock held and irqs disabled

1895

++ */

1896

++void hrtick_start(struct rq *rq, u64 delay)

1897

++{

1898

++	/*

1899

++	 * Don't schedule slices shorter than 10000ns, that just

1900

++	 * doesn't make sense. Rely on vruntime for fairness.

1901

++	 */

1902

++	delay = max_t(u64, delay, 10000LL);

1903

++	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),

1904

++		      HRTIMER_MODE_REL_PINNED_HARD);

1905

++}

1906

++#endif /* CONFIG_SMP */

1907

++

1908

++static void hrtick_rq_init(struct rq *rq)

1909

++{

1910

++#ifdef CONFIG_SMP

1911

++	INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);

1912

++#endif

1913

++

1914

++	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);

1915

++	rq->hrtick_timer.function = hrtick;

1916

++}

1917

++#else	/* CONFIG_SCHED_HRTICK */

1918

++static inline int hrtick_enabled(struct rq *rq)

1919

++{

1920

++	return 0;

1921

++}

1922

++

1923

++static inline void hrtick_clear(struct rq *rq)

1924

++{

1925

++}

1926

++

1927

++static inline void hrtick_rq_init(struct rq *rq)

1928

++{

1929

++}

1930

++#endif	/* CONFIG_SCHED_HRTICK */

1931

++

1932

++static inline int __normal_prio(int policy, int rt_prio, int static_prio)

1933

++{

1934

++	return rt_policy(policy) ? (MAX_RT_PRIO - 1 - rt_prio) :

1935

++		static_prio + MAX_PRIORITY_ADJ;

1936

++}

1937

++

1938

++/*

1939

++ * Calculate the expected normal priority: i.e. priority

1940

++ * without taking RT-inheritance into account. Might be

1941

++ * boosted by interactivity modifiers. Changes upon fork,

1942

++ * setprio syscalls, and whenever the interactivity

1943

++ * estimator recalculates.

1944

++ */

1945

++static inline int normal_prio(struct task_struct *p)

1946

++{

1947

++	return __normal_prio(p->policy, p->rt_priority, p->static_prio);

1948

++}

1949

++

1950

++/*

1951

++ * Calculate the current priority, i.e. the priority

1952

++ * taken into account by the scheduler. This value might

1953

++ * be boosted by RT tasks as it will be RT if the task got

1954

++ * RT-boosted. If not then it returns p->normal_prio.

1955

++ */

1956

++static int effective_prio(struct task_struct *p)

1957

++{

1958

++	p->normal_prio = normal_prio(p);

1959

++	/*

1960

++	 * If we are RT tasks or we were boosted to RT priority,

1961

++	 * keep the priority unchanged. Otherwise, update priority

1962

++	 * to the normal priority:

1963

++	 */

1964

++	if (!rt_prio(p->prio))

1965

++		return p->normal_prio;

1966

++	return p->prio;

1967

++}

1968

++

1969

++/*

1970

++ * activate_task - move a task to the runqueue.

1971

++ *

1972

++ * Context: rq->lock

1973

++ */

1974

++static void activate_task(struct task_struct *p, struct rq *rq)

1975

++{

1976

++	enqueue_task(p, rq, ENQUEUE_WAKEUP);

1977

++	p->on_rq = TASK_ON_RQ_QUEUED;

1978

++

1979

++	/*

1980

++	 * If in_iowait is set, the code below may not trigger any cpufreq

1981

++	 * utilization updates, so do it here explicitly with the IOWAIT flag

1982

++	 * passed.

1983

++	 */

1984

++	cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT * p->in_iowait);

1985

++}

1986

++

1987

++/*

1988

++ * deactivate_task - remove a task from the runqueue.

1989

++ *

1990

++ * Context: rq->lock

1991

++ */

1992

++static inline void deactivate_task(struct task_struct *p, struct rq *rq)

1993

++{

1994

++	dequeue_task(p, rq, DEQUEUE_SLEEP);

1995

++	p->on_rq = 0;

1996

++	cpufreq_update_util(rq, 0);

1997

++}

1998

++

1999

++static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)

2000

++{

2001

++#ifdef CONFIG_SMP

2002

++	/*

2003

++	 * After ->cpu is set up to a new value, task_access_lock(p, ...) can be

2004

++	 * successfully executed on another CPU. We must ensure that updates of

2005

++	 * per-task data have been completed by this moment.

2006

++	 */

2007

++	smp_wmb();

2008

++

2009

++	WRITE_ONCE(task_thread_info(p)->cpu, cpu);

2010

++#endif

2011

++}

2012

++

2013

++static inline bool is_migration_disabled(struct task_struct *p)

2014

++{

2015

++#ifdef CONFIG_SMP

2016

++	return p->migration_disabled;

2017

++#else

2018

++	return false;

2019

++#endif

2020

++}

2021

++

2022

++#define SCA_CHECK		0x01

2023

++#define SCA_USER		0x08

2024

++

2025

++#ifdef CONFIG_SMP

2026

++

2027

++void set_task_cpu(struct task_struct *p, unsigned int new_cpu)

2028

++{

2029

++#ifdef CONFIG_SCHED_DEBUG

2030

++	unsigned int state = READ_ONCE(p->__state);

2031

++

2032

++	/*

2033

++	 * We should never call set_task_cpu() on a blocked task,

2034

++	 * ttwu() will sort out the placement.

2035

++	 */

2036

++	WARN_ON_ONCE(state != TASK_RUNNING && state != TASK_WAKING && !p->on_rq);

2037

++

2038

++#ifdef CONFIG_LOCKDEP

2039

++	/*

2040

++	 * The caller should hold either p->pi_lock or rq->lock, when changing

2041

++	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.

2042

++	 *

2043

++	 * sched_move_task() holds both and thus holding either pins the cgroup,

2044

++	 * see task_group().

2045

++	 */

2046

++	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||

2047

++				      lockdep_is_held(&task_rq(p)->lock)));

2048

++#endif

2049

++	/*

2050

++	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.

2051

++	 */

2052

++	WARN_ON_ONCE(!cpu_online(new_cpu));

2053

++

2054

++	WARN_ON_ONCE(is_migration_disabled(p));

2055

++#endif

2056

++	if (task_cpu(p) == new_cpu)

2057

++		return;

2058

++	trace_sched_migrate_task(p, new_cpu);

2059

++	rseq_migrate(p);

2060

++	perf_event_task_migrate(p);

2061

++

2062

++	__set_task_cpu(p, new_cpu);

2063

++}

2064

++

2065

++#define MDF_FORCE_ENABLED	0x80

2066

++

2067

++static void

2068

++__do_set_cpus_ptr(struct task_struct *p, const struct cpumask *new_mask)

2069

++{

2070

++	/*

2071

++	 * This here violates the locking rules for affinity, since we're only

2072

++	 * supposed to change these variables while holding both rq->lock and

2073

++	 * p->pi_lock.

2074

++	 *

2075

++	 * HOWEVER, it magically works, because ttwu() is the only code that

2076

++	 * accesses these variables under p->pi_lock and only does so after

2077

++	 * smp_cond_load_acquire(&p->on_cpu, !VAL), and we're in __schedule()

2078

++	 * before finish_task().

2079

++	 *

2080

++	 * XXX do further audits, this smells like something putrid.

2081

++	 */

2082

++	SCHED_WARN_ON(!p->on_cpu);

2083

++	p->cpus_ptr = new_mask;

2084

++}

2085

++

2086

++void migrate_disable(void)

2087

++{

2088

++	struct task_struct *p = current;

2089

++	int cpu;

2090

++

2091

++	if (p->migration_disabled) {

2092

++		p->migration_disabled++;

2093

++		return;

2094

++	}

2095

++

2096

++	preempt_disable();

2097

++	cpu = smp_processor_id();

2098

++	if (cpumask_test_cpu(cpu, &p->cpus_mask)) {

2099

++		cpu_rq(cpu)->nr_pinned++;

2100

++		p->migration_disabled = 1;

2101

++		p->migration_flags &= ~MDF_FORCE_ENABLED;

2102

++

2103

++		/*

2104

++		 * Violates locking rules! see comment in __do_set_cpus_ptr().

2105

++		 */

2106

++		if (p->cpus_ptr == &p->cpus_mask)

2107

++			__do_set_cpus_ptr(p, cpumask_of(cpu));

2108

++	}

2109

++	preempt_enable();

2110

++}

2111

++EXPORT_SYMBOL_GPL(migrate_disable);

2112

++

2113

++void migrate_enable(void)

2114

++{

2115

++	struct task_struct *p = current;

2116

++

2117

++	if (0 == p->migration_disabled)

2118

++		return;

2119

++

2120

++	if (p->migration_disabled > 1) {

2121

++		p->migration_disabled--;

2122

++		return;

2123

++	}

2124

++

2125

++	if (WARN_ON_ONCE(!p->migration_disabled))

2126

++		return;

2127

++

2128

++	/*

2129

++	 * Ensure stop_task runs either before or after this, and that

2130

++	 * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().

2131

++	 */

2132

++	preempt_disable();

2133

++	/*

2134

++	 * Assumption: current should be running on allowed cpu

2135

++	 */

2136

++	WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &p->cpus_mask));

2137

++	if (p->cpus_ptr != &p->cpus_mask)

2138

++		__do_set_cpus_ptr(p, &p->cpus_mask);

2139

++	/*

2140

++	 * Mustn't clear migration_disabled() until cpus_ptr points back at the

2141

++	 * regular cpus_mask, otherwise things that race (eg.

2142

++	 * select_fallback_rq) get confused.

2143

++	 */

2144

++	barrier();

2145

++	p->migration_disabled = 0;

2146

++	this_rq()->nr_pinned--;

2147

++	preempt_enable();

2148

++}

2149

++EXPORT_SYMBOL_GPL(migrate_enable);

2150

++

2151

++static inline bool rq_has_pinned_tasks(struct rq *rq)

2152

++{

2153

++	return rq->nr_pinned;

2154

++}

2155

++

2156

++/*

2157

++ * Per-CPU kthreads are allowed to run on !active && online CPUs, see

2158

++ * __set_cpus_allowed_ptr() and select_fallback_rq().

2159

++ */

2160

++static inline bool is_cpu_allowed(struct task_struct *p, int cpu)

2161

++{

2162

++	/* When not in the task's cpumask, no point in looking further. */

2163

++	if (!cpumask_test_cpu(cpu, p->cpus_ptr))

2164

++		return false;

2165

++

2166

++	/* migrate_disabled() must be allowed to finish. */

2167

++	if (is_migration_disabled(p))

2168

++		return cpu_online(cpu);

2169

++

2170

++	/* Non kernel threads are not allowed during either online or offline. */

2171

++	if (!(p->flags & PF_KTHREAD))

2172

++		return cpu_active(cpu) && task_cpu_possible(cpu, p);

2173

++

2174

++	/* KTHREAD_IS_PER_CPU is always allowed. */

2175

++	if (kthread_is_per_cpu(p))

2176

++		return cpu_online(cpu);

2177

++

2178

++	/* Regular kernel threads don't get to stay during offline. */

2179

++	if (cpu_dying(cpu))

2180

++		return false;

2181

++

2182

++	/* But are allowed during online. */

2183

++	return cpu_online(cpu);

2184

++}

2185

++

2186

++/*

2187

++ * This is how migration works:

2188

++ *

2189

++ * 1) we invoke migration_cpu_stop() on the target CPU using

2190

++ *    stop_one_cpu().

2191

++ * 2) stopper starts to run (implicitly forcing the migrated thread

2192

++ *    off the CPU)

2193

++ * 3) it checks whether the migrated task is still in the wrong runqueue.

2194

++ * 4) if it's in the wrong runqueue then the migration thread removes

2195

++ *    it and puts it into the right queue.

2196

++ * 5) stopper completes and stop_one_cpu() returns and the migration

2197

++ *    is done.

2198

++ */

2199

++

2200

++/*

2201

++ * move_queued_task - move a queued task to new rq.

2202

++ *

2203

++ * Returns (locked) new rq. Old rq's lock is released.

2204

++ */

2205

++static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int

2206

++				   new_cpu)

2207

++{

2208

++	lockdep_assert_held(&rq->lock);

2209

++

2210

++	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);

2211

++	dequeue_task(p, rq, 0);

2212

++	update_sched_rq_watermark(rq);

2213

++	set_task_cpu(p, new_cpu);

2214

++	raw_spin_unlock(&rq->lock);

2215

++

2216

++	rq = cpu_rq(new_cpu);

2217

++

2218

++	raw_spin_lock(&rq->lock);

2219

++	BUG_ON(task_cpu(p) != new_cpu);

2220

++	sched_task_sanity_check(p, rq);

2221

++	enqueue_task(p, rq, 0);

2222

++	p->on_rq = TASK_ON_RQ_QUEUED;

2223

++	check_preempt_curr(rq);

2224

++

2225

++	return rq;

2226

++}

2227

++

2228

++struct migration_arg {

2229

++	struct task_struct *task;

2230

++	int dest_cpu;

2231

++};

2232

++

2233

++/*

2234

++ * Move (not current) task off this CPU, onto the destination CPU. We're doing

2235

++ * this because either it can't run here any more (set_cpus_allowed()

2236

++ * away from this CPU, or CPU going down), or because we're

2237

++ * attempting to rebalance this task on exec (sched_exec).

2238

++ *

2239

++ * So we race with normal scheduler movements, but that's OK, as long

2240

++ * as the task is no longer on this CPU.

2241

++ */

2242

++static struct rq *__migrate_task(struct rq *rq, struct task_struct *p, int

2243

++				 dest_cpu)

2244

++{

2245

++	/* Affinity changed (again). */

2246

++	if (!is_cpu_allowed(p, dest_cpu))

2247

++		return rq;

2248

++

2249

++	update_rq_clock(rq);

2250

++	return move_queued_task(rq, p, dest_cpu);

2251

++}

2252

++

2253

++/*

2254

++ * migration_cpu_stop - this will be executed by a highprio stopper thread

2255

++ * and performs thread migration by bumping thread off CPU then

2256

++ * 'pushing' onto another runqueue.

2257

++ */

2258

++static int migration_cpu_stop(void *data)

2259

++{

2260

++	struct migration_arg *arg = data;

2261

++	struct task_struct *p = arg->task;

2262

++	struct rq *rq = this_rq();

2263

++	unsigned long flags;

2264

++

2265

++	/*

2266

++	 * The original target CPU might have gone down and we might

2267

++	 * be on another CPU but it doesn't matter.

2268

++	 */

2269

++	local_irq_save(flags);

2270

++	/*

2271

++	 * We need to explicitly wake pending tasks before running

2272

++	 * __migrate_task() such that we will not miss enforcing cpus_ptr

2273

++	 * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.

2274

++	 */

2275

++	flush_smp_call_function_queue();

2276

++

2277

++	raw_spin_lock(&p->pi_lock);

2278

++	raw_spin_lock(&rq->lock);

2279

++	/*

2280

++	 * If task_rq(p) != rq, it cannot be migrated here, because we're

2281

++	 * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because

2282

++	 * we're holding p->pi_lock.

2283

++	 */

2284

++	if (task_rq(p) == rq && task_on_rq_queued(p))

2285

++		rq = __migrate_task(rq, p, arg->dest_cpu);

2286

++	raw_spin_unlock(&rq->lock);

2287

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

2288

++

2289

++	return 0;

2290

++}

2291

++

2292

++static inline void

2293

++set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask)

2294

++{

2295

++	cpumask_copy(&p->cpus_mask, new_mask);

2296

++	p->nr_cpus_allowed = cpumask_weight(new_mask);

2297

++}

2298

++

2299

++static void

2300

++__do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)

2301

++{

2302

++	lockdep_assert_held(&p->pi_lock);

2303

++	set_cpus_allowed_common(p, new_mask);

2304

++}

2305

++

2306

++void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)

2307

++{

2308

++	__do_set_cpus_allowed(p, new_mask);

2309

++}

2310

++

2311

++int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,

2312

++		      int node)

2313

++{

2314

++	if (!src->user_cpus_ptr)

2315

++		return 0;

2316

++

2317

++	dst->user_cpus_ptr = kmalloc_node(cpumask_size(), GFP_KERNEL, node);

2318

++	if (!dst->user_cpus_ptr)

2319

++		return -ENOMEM;

2320

++

2321

++	cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);

2322

++	return 0;

2323

++}

2324

++

2325

++static inline struct cpumask *clear_user_cpus_ptr(struct task_struct *p)

2326

++{

2327

++	struct cpumask *user_mask = NULL;

2328

++

2329

++	swap(p->user_cpus_ptr, user_mask);

2330

++

2331

++	return user_mask;

2332

++}

2333

++

2334

++void release_user_cpus_ptr(struct task_struct *p)

2335

++{

2336

++	kfree(clear_user_cpus_ptr(p));

2337

++}

2338

++

2339

++#endif

2340

++

2341

++/**

2342

++ * task_curr - is this task currently executing on a CPU?

2343

++ * @p: the task in question.

2344

++ *

2345

++ * Return: 1 if the task is currently executing. 0 otherwise.

2346

++ */

2347

++inline int task_curr(const struct task_struct *p)

2348

++{

2349

++	return cpu_curr(task_cpu(p)) == p;

2350

++}

2351

++

2352

++#ifdef CONFIG_SMP

2353

++/*

2354

++ * wait_task_inactive - wait for a thread to unschedule.

2355

++ *

2356

++ * If @match_state is nonzero, it's the @p->state value just checked and

2357

++ * not expected to change.  If it changes, i.e. @p might have woken up,

2358

++ * then return zero.  When we succeed in waiting for @p to be off its CPU,

2359

++ * we return a positive number (its total switch count).  If a second call

2360

++ * a short while later returns the same number, the caller can be sure that

2361

++ * @p has remained unscheduled the whole time.

2362

++ *

2363

++ * The caller must ensure that the task *will* unschedule sometime soon,

2364

++ * else this function might spin for a *long* time. This function can't

2365

++ * be called with interrupts off, or it may introduce deadlock with

2366

++ * smp_call_function() if an IPI is sent by the same process we are

2367

++ * waiting to become inactive.

2368

++ */

2369

++unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state)

2370

++{

2371

++	unsigned long flags;

2372

++	bool running, on_rq;

2373

++	unsigned long ncsw;

2374

++	struct rq *rq;

2375

++	raw_spinlock_t *lock;

2376

++

2377

++	for (;;) {

2378

++		rq = task_rq(p);

2379

++

2380

++		/*

2381

++		 * If the task is actively running on another CPU

2382

++		 * still, just relax and busy-wait without holding

2383

++		 * any locks.

2384

++		 *

2385

++		 * NOTE! Since we don't hold any locks, it's not

2386

++		 * even sure that "rq" stays as the right runqueue!

2387

++		 * But we don't care, since this will return false

2388

++		 * if the runqueue has changed and p is actually now

2389

++		 * running somewhere else!

2390

++		 */

2391

++		while (task_running(p) && p == rq->curr) {

2392

++			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))

2393

++				return 0;

2394

++			cpu_relax();

2395

++		}

2396

++

2397

++		/*

2398

++		 * Ok, time to look more closely! We need the rq

2399

++		 * lock now, to be *sure*. If we're wrong, we'll

2400

++		 * just go back and repeat.

2401

++		 */

2402

++		task_access_lock_irqsave(p, &lock, &flags);

2403

++		trace_sched_wait_task(p);

2404

++		running = task_running(p);

2405

++		on_rq = p->on_rq;

2406

++		ncsw = 0;

2407

++		if (!match_state || READ_ONCE(p->__state) == match_state)

2408

++			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */

2409

++		task_access_unlock_irqrestore(p, lock, &flags);

2410

++

2411

++		/*

2412

++		 * If it changed from the expected state, bail out now.

2413

++		 */

2414

++		if (unlikely(!ncsw))

2415

++			break;

2416

++

2417

++		/*

2418

++		 * Was it really running after all now that we

2419

++		 * checked with the proper locks actually held?

2420

++		 *

2421

++		 * Oops. Go back and try again..

2422

++		 */

2423

++		if (unlikely(running)) {

2424

++			cpu_relax();

2425

++			continue;

2426

++		}

2427

++

2428

++		/*

2429

++		 * It's not enough that it's not actively running,

2430

++		 * it must be off the runqueue _entirely_, and not

2431

++		 * preempted!

2432

++		 *

2433

++		 * So if it was still runnable (but just not actively

2434

++		 * running right now), it's preempted, and we should

2435

++		 * yield - it could be a while.

2436

++		 */

2437

++		if (unlikely(on_rq)) {

2438

++			ktime_t to = NSEC_PER_SEC / HZ;

2439

++

2440

++			set_current_state(TASK_UNINTERRUPTIBLE);

2441

++			schedule_hrtimeout(&to, HRTIMER_MODE_REL_HARD);

2442

++			continue;

2443

++		}

2444

++

2445

++		/*

2446

++		 * Ahh, all good. It wasn't running, and it wasn't

2447

++		 * runnable, which means that it will never become

2448

++		 * running in the future either. We're all done!

2449

++		 */

2450

++		break;

2451

++	}

2452

++

2453

++	return ncsw;

2454

++}

2455

++

2456

++/***

2457

++ * kick_process - kick a running thread to enter/exit the kernel

2458

++ * @p: the to-be-kicked thread

2459

++ *

2460

++ * Cause a process which is running on another CPU to enter

2461

++ * kernel-mode, without any delay. (to get signals handled.)

2462

++ *

2463

++ * NOTE: this function doesn't have to take the runqueue lock,

2464

++ * because all it wants to ensure is that the remote task enters

2465

++ * the kernel. If the IPI races and the task has been migrated

2466

++ * to another CPU then no harm is done and the purpose has been

2467

++ * achieved as well.

2468

++ */

2469

++void kick_process(struct task_struct *p)

2470

++{

2471

++	int cpu;

2472

++

2473

++	preempt_disable();

2474

++	cpu = task_cpu(p);

2475

++	if ((cpu != smp_processor_id()) && task_curr(p))

2476

++		smp_send_reschedule(cpu);

2477

++	preempt_enable();

2478

++}

2479

++EXPORT_SYMBOL_GPL(kick_process);

2480

++

2481

++/*

2482

++ * ->cpus_ptr is protected by both rq->lock and p->pi_lock

2483

++ *

2484

++ * A few notes on cpu_active vs cpu_online:

2485

++ *

2486

++ *  - cpu_active must be a subset of cpu_online

2487

++ *

2488

++ *  - on CPU-up we allow per-CPU kthreads on the online && !active CPU,

2489

++ *    see __set_cpus_allowed_ptr(). At this point the newly online

2490

++ *    CPU isn't yet part of the sched domains, and balancing will not

2491

++ *    see it.

2492

++ *

2493

++ *  - on cpu-down we clear cpu_active() to mask the sched domains and

2494

++ *    avoid the load balancer to place new tasks on the to be removed

2495

++ *    CPU. Existing tasks will remain running there and will be taken

2496

++ *    off.

2497

++ *

2498

++ * This means that fallback selection must not select !active CPUs.

2499

++ * And can assume that any active CPU must be online. Conversely

2500

++ * select_task_rq() below may allow selection of !active CPUs in order

2501

++ * to satisfy the above rules.

2502

++ */

2503

++static int select_fallback_rq(int cpu, struct task_struct *p)

2504

++{

2505

++	int nid = cpu_to_node(cpu);

2506

++	const struct cpumask *nodemask = NULL;

2507

++	enum { cpuset, possible, fail } state = cpuset;

2508

++	int dest_cpu;

2509

++

2510

++	/*

2511

++	 * If the node that the CPU is on has been offlined, cpu_to_node()

2512

++	 * will return -1. There is no CPU on the node, and we should

2513

++	 * select the CPU on the other node.

2514

++	 */

2515

++	if (nid != -1) {

2516

++		nodemask = cpumask_of_node(nid);

2517

++

2518

++		/* Look for allowed, online CPU in same node. */

2519

++		for_each_cpu(dest_cpu, nodemask) {

2520

++			if (is_cpu_allowed(p, dest_cpu))

2521

++				return dest_cpu;

2522

++		}

2523

++	}

2524

++

2525

++	for (;;) {

2526

++		/* Any allowed, online CPU? */

2527

++		for_each_cpu(dest_cpu, p->cpus_ptr) {

2528

++			if (!is_cpu_allowed(p, dest_cpu))

2529

++				continue;

2530

++			goto out;

2531

++		}

2532

++

2533

++		/* No more Mr. Nice Guy. */

2534

++		switch (state) {

2535

++		case cpuset:

2536

++			if (cpuset_cpus_allowed_fallback(p)) {

2537

++				state = possible;

2538

++				break;

2539

++			}

2540

++			fallthrough;

2541

++		case possible:

2542

++			/*

2543

++			 * XXX When called from select_task_rq() we only

2544

++			 * hold p->pi_lock and again violate locking order.

2545

++			 *

2546

++			 * More yuck to audit.

2547

++			 */

2548

++			do_set_cpus_allowed(p, task_cpu_possible_mask(p));

2549

++			state = fail;

2550

++			break;

2551

++

2552

++		case fail:

2553

++			BUG();

2554

++			break;

2555

++		}

2556

++	}

2557

++

2558

++out:

2559

++	if (state != cpuset) {

2560

++		/*

2561

++		 * Don't tell them about moving exiting tasks or

2562

++		 * kernel threads (both mm NULL), since they never

2563

++		 * leave kernel.

2564

++		 */

2565

++		if (p->mm && printk_ratelimit()) {

2566

++			printk_deferred("process %d (%s) no longer affine to cpu%d\n",

2567

++					task_pid_nr(p), p->comm, cpu);

2568

++		}

2569

++	}

2570

++

2571

++	return dest_cpu;

2572

++}

2573

++

2574

++static inline int select_task_rq(struct task_struct *p)

2575

++{

2576

++	cpumask_t chk_mask, tmp;

2577

++

2578

++	if (unlikely(!cpumask_and(&chk_mask, p->cpus_ptr, cpu_active_mask)))

2579

++		return select_fallback_rq(task_cpu(p), p);

2580

++

2581

++	if (

2582

++#ifdef CONFIG_SCHED_SMT

2583

++	    cpumask_and(&tmp, &chk_mask, &sched_sg_idle_mask) ||

2584

++#endif

2585

++	    cpumask_and(&tmp, &chk_mask, sched_rq_watermark) ||

2586

++	    cpumask_and(&tmp, &chk_mask,

2587

++			sched_rq_watermark + SCHED_QUEUE_BITS - 1 - task_sched_prio(p)))

2588

++		return best_mask_cpu(task_cpu(p), &tmp);

2589

++

2590

++	return best_mask_cpu(task_cpu(p), &chk_mask);

2591

++}

2592

++

2593

++void sched_set_stop_task(int cpu, struct task_struct *stop)

2594

++{

2595

++	static struct lock_class_key stop_pi_lock;

2596

++	struct sched_param stop_param = { .sched_priority = STOP_PRIO };

2597

++	struct sched_param start_param = { .sched_priority = 0 };

2598

++	struct task_struct *old_stop = cpu_rq(cpu)->stop;

2599

++

2600

++	if (stop) {

2601

++		/*

2602

++		 * Make it appear like a SCHED_FIFO task, its something

2603

++		 * userspace knows about and won't get confused about.

2604

++		 *

2605

++		 * Also, it will make PI more or less work without too

2606

++		 * much confusion -- but then, stop work should not

2607

++		 * rely on PI working anyway.

2608

++		 */

2609

++		sched_setscheduler_nocheck(stop, SCHED_FIFO, &stop_param);

2610

++

2611

++		/*

2612

++		 * The PI code calls rt_mutex_setprio() with ->pi_lock held to

2613

++		 * adjust the effective priority of a task. As a result,

2614

++		 * rt_mutex_setprio() can trigger (RT) balancing operations,

2615

++		 * which can then trigger wakeups of the stop thread to push

2616

++		 * around the current task.

2617

++		 *

2618

++		 * The stop task itself will never be part of the PI-chain, it

2619

++		 * never blocks, therefore that ->pi_lock recursion is safe.

2620

++		 * Tell lockdep about this by placing the stop->pi_lock in its

2621

++		 * own class.

2622

++		 */

2623

++		lockdep_set_class(&stop->pi_lock, &stop_pi_lock);

2624

++	}

2625

++

2626

++	cpu_rq(cpu)->stop = stop;

2627

++

2628

++	if (old_stop) {

2629

++		/*

2630

++		 * Reset it back to a normal scheduling policy so that

2631

++		 * it can die in pieces.

2632

++		 */

2633

++		sched_setscheduler_nocheck(old_stop, SCHED_NORMAL, &start_param);

2634

++	}

2635

++}

2636

++

2637

++static int affine_move_task(struct rq *rq, struct task_struct *p, int dest_cpu,

2638

++			    raw_spinlock_t *lock, unsigned long irq_flags)

2639

++{

2640

++	/* Can the task run on the task's current CPU? If so, we're done */

2641

++	if (!cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {

2642

++		if (p->migration_disabled) {

2643

++			if (likely(p->cpus_ptr != &p->cpus_mask))

2644

++				__do_set_cpus_ptr(p, &p->cpus_mask);

2645

++			p->migration_disabled = 0;

2646

++			p->migration_flags |= MDF_FORCE_ENABLED;

2647

++			/* When p is migrate_disabled, rq->lock should be held */

2648

++			rq->nr_pinned--;

2649

++		}

2650

++

2651

++		if (task_running(p) || READ_ONCE(p->__state) == TASK_WAKING) {

2652

++			struct migration_arg arg = { p, dest_cpu };

2653

++

2654

++			/* Need help from migration thread: drop lock and wait. */

2655

++			__task_access_unlock(p, lock);

2656

++			raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);

2657

++			stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);

2658

++			return 0;

2659

++		}

2660

++		if (task_on_rq_queued(p)) {

2661

++			/*

2662

++			 * OK, since we're going to drop the lock immediately

2663

++			 * afterwards anyway.

2664

++			 */

2665

++			update_rq_clock(rq);

2666

++			rq = move_queued_task(rq, p, dest_cpu);

2667

++			lock = &rq->lock;

2668

++		}

2669

++	}

2670

++	__task_access_unlock(p, lock);

2671

++	raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);

2672

++	return 0;

2673

++}

2674

++

2675

++static int __set_cpus_allowed_ptr_locked(struct task_struct *p,

2676

++					 const struct cpumask *new_mask,

2677

++					 u32 flags,

2678

++					 struct rq *rq,

2679

++					 raw_spinlock_t *lock,

2680

++					 unsigned long irq_flags)

2681

++{

2682

++	const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);

2683

++	const struct cpumask *cpu_valid_mask = cpu_active_mask;

2684

++	bool kthread = p->flags & PF_KTHREAD;

2685

++	struct cpumask *user_mask = NULL;

2686

++	int dest_cpu;

2687

++	int ret = 0;

2688

++

2689

++	if (kthread || is_migration_disabled(p)) {

2690

++		/*

2691

++		 * Kernel threads are allowed on online && !active CPUs,

2692

++		 * however, during cpu-hot-unplug, even these might get pushed

2693

++		 * away if not KTHREAD_IS_PER_CPU.

2694

++		 *

2695

++		 * Specifically, migration_disabled() tasks must not fail the

2696

++		 * cpumask_any_and_distribute() pick below, esp. so on

2697

++		 * SCA_MIGRATE_ENABLE, otherwise we'll not call

2698

++		 * set_cpus_allowed_common() and actually reset p->cpus_ptr.

2699

++		 */

2700

++		cpu_valid_mask = cpu_online_mask;

2701

++	}

2702

++

2703

++	if (!kthread && !cpumask_subset(new_mask, cpu_allowed_mask)) {

2704

++		ret = -EINVAL;

2705

++		goto out;

2706

++	}

2707

++

2708

++	/*

2709

++	 * Must re-check here, to close a race against __kthread_bind(),

2710

++	 * sched_setaffinity() is not guaranteed to observe the flag.

2711

++	 */

2712

++	if ((flags & SCA_CHECK) && (p->flags & PF_NO_SETAFFINITY)) {

2713

++		ret = -EINVAL;

2714

++		goto out;

2715

++	}

2716

++

2717

++	if (cpumask_equal(&p->cpus_mask, new_mask))

2718

++		goto out;

2719

++

2720

++	dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);

2721

++	if (dest_cpu >= nr_cpu_ids) {

2722

++		ret = -EINVAL;

2723

++		goto out;

2724

++	}

2725

++

2726

++	__do_set_cpus_allowed(p, new_mask);

2727

++

2728

++	if (flags & SCA_USER)

2729

++		user_mask = clear_user_cpus_ptr(p);

2730

++

2731

++	ret = affine_move_task(rq, p, dest_cpu, lock, irq_flags);

2732

++

2733

++	kfree(user_mask);

2734

++

2735

++	return ret;

2736

++

2737

++out:

2738

++	__task_access_unlock(p, lock);

2739

++	raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);

2740

++

2741

++	return ret;

2742

++}

2743

++

2744

++/*

2745

++ * Change a given task's CPU affinity. Migrate the thread to a

2746

++ * proper CPU and schedule it away if the CPU it's executing on

2747

++ * is removed from the allowed bitmask.

2748

++ *

2749

++ * NOTE: the caller must have a valid reference to the task, the

2750

++ * task must not exit() & deallocate itself prematurely. The

2751

++ * call is not atomic; no spinlocks may be held.

2752

++ */

2753

++static int __set_cpus_allowed_ptr(struct task_struct *p,

2754

++				  const struct cpumask *new_mask, u32 flags)

2755

++{

2756

++	unsigned long irq_flags;

2757

++	struct rq *rq;

2758

++	raw_spinlock_t *lock;

2759

++

2760

++	raw_spin_lock_irqsave(&p->pi_lock, irq_flags);

2761

++	rq = __task_access_lock(p, &lock);

2762

++

2763

++	return __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, lock, irq_flags);

2764

++}

2765

++

2766

++int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)

2767

++{

2768

++	return __set_cpus_allowed_ptr(p, new_mask, 0);

2769

++}

2770

++EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);

2771

++

2772

++/*

2773

++ * Change a given task's CPU affinity to the intersection of its current

2774

++ * affinity mask and @subset_mask, writing the resulting mask to @new_mask

2775

++ * and pointing @p->user_cpus_ptr to a copy of the old mask.

2776

++ * If the resulting mask is empty, leave the affinity unchanged and return

2777

++ * -EINVAL.

2778

++ */

2779

++static int restrict_cpus_allowed_ptr(struct task_struct *p,

2780

++				     struct cpumask *new_mask,

2781

++				     const struct cpumask *subset_mask)

2782

++{

2783

++	struct cpumask *user_mask = NULL;

2784

++	unsigned long irq_flags;

2785

++	raw_spinlock_t *lock;

2786

++	struct rq *rq;

2787

++	int err;

2788

++

2789

++	if (!p->user_cpus_ptr) {

2790

++		user_mask = kmalloc(cpumask_size(), GFP_KERNEL);

2791

++		if (!user_mask)

2792

++			return -ENOMEM;

2793

++	}

2794

++

2795

++	raw_spin_lock_irqsave(&p->pi_lock, irq_flags);

2796

++	rq = __task_access_lock(p, &lock);

2797

++

2798

++	if (!cpumask_and(new_mask, &p->cpus_mask, subset_mask)) {

2799

++		err = -EINVAL;

2800

++		goto err_unlock;

2801

++	}

2802

++

2803

++	/*

2804

++	 * We're about to butcher the task affinity, so keep track of what

2805

++	 * the user asked for in case we're able to restore it later on.

2806

++	 */

2807

++	if (user_mask) {

2808

++		cpumask_copy(user_mask, p->cpus_ptr);

2809

++		p->user_cpus_ptr = user_mask;

2810

++	}

2811

++

2812

++	/*return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, &rf);*/

2813

++	return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, lock, irq_flags);

2814

++

2815

++err_unlock:

2816

++	__task_access_unlock(p, lock);

2817

++	raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);

2818

++	kfree(user_mask);

2819

++	return err;

2820

++}

2821

++

2822

++/*

2823

++ * Restrict the CPU affinity of task @p so that it is a subset of

2824

++ * task_cpu_possible_mask() and point @p->user_cpu_ptr to a copy of the

2825

++ * old affinity mask. If the resulting mask is empty, we warn and walk

2826

++ * up the cpuset hierarchy until we find a suitable mask.

2827

++ */

2828

++void force_compatible_cpus_allowed_ptr(struct task_struct *p)

2829

++{

2830

++	cpumask_var_t new_mask;

2831

++	const struct cpumask *override_mask = task_cpu_possible_mask(p);

2832

++

2833

++	alloc_cpumask_var(&new_mask, GFP_KERNEL);

2834

++

2835

++	/*

2836

++	 * __migrate_task() can fail silently in the face of concurrent

2837

++	 * offlining of the chosen destination CPU, so take the hotplug

2838

++	 * lock to ensure that the migration succeeds.

2839

++	 */

2840

++	cpus_read_lock();

2841

++	if (!cpumask_available(new_mask))

2842

++		goto out_set_mask;

2843

++

2844

++	if (!restrict_cpus_allowed_ptr(p, new_mask, override_mask))

2845

++		goto out_free_mask;

2846

++

2847

++	/*

2848

++	 * We failed to find a valid subset of the affinity mask for the

2849

++	 * task, so override it based on its cpuset hierarchy.

2850

++	 */

2851

++	cpuset_cpus_allowed(p, new_mask);

2852

++	override_mask = new_mask;

2853

++

2854

++out_set_mask:

2855

++	if (printk_ratelimit()) {

2856

++		printk_deferred("Overriding affinity for process %d (%s) to CPUs %*pbl\n",

2857

++				task_pid_nr(p), p->comm,

2858

++				cpumask_pr_args(override_mask));

2859

++	}

2860

++

2861

++	WARN_ON(set_cpus_allowed_ptr(p, override_mask));

2862

++out_free_mask:

2863

++	cpus_read_unlock();

2864

++	free_cpumask_var(new_mask);

2865

++}

2866

++

2867

++static int

2868

++__sched_setaffinity(struct task_struct *p, const struct cpumask *mask);

2869

++

2870

++/*

2871

++ * Restore the affinity of a task @p which was previously restricted by a

2872

++ * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)

2873

++ * @p->user_cpus_ptr.

2874

++ *

2875

++ * It is the caller's responsibility to serialise this with any calls to

2876

++ * force_compatible_cpus_allowed_ptr(@p).

2877

++ */

2878

++void relax_compatible_cpus_allowed_ptr(struct task_struct *p)

2879

++{

2880

++	struct cpumask *user_mask = p->user_cpus_ptr;

2881

++	unsigned long flags;

2882

++

2883

++	/*

2884

++	 * Try to restore the old affinity mask. If this fails, then

2885

++	 * we free the mask explicitly to avoid it being inherited across

2886

++	 * a subsequent fork().

2887

++	 */

2888

++	if (!user_mask || !__sched_setaffinity(p, user_mask))

2889

++		return;

2890

++

2891

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

2892

++	user_mask = clear_user_cpus_ptr(p);

2893

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

2894

++

2895

++	kfree(user_mask);

2896

++}

2897

++

2898

++#else /* CONFIG_SMP */

2899

++

2900

++static inline int select_task_rq(struct task_struct *p)

2901

++{

2902

++	return 0;

2903

++}

2904

++

2905

++static inline int

2906

++__set_cpus_allowed_ptr(struct task_struct *p,

2907

++		       const struct cpumask *new_mask, u32 flags)

2908

++{

2909

++	return set_cpus_allowed_ptr(p, new_mask);

2910

++}

2911

++

2912

++static inline bool rq_has_pinned_tasks(struct rq *rq)

2913

++{

2914

++	return false;

2915

++}

2916

++

2917

++#endif /* !CONFIG_SMP */

2918

++

2919

++static void

2920

++ttwu_stat(struct task_struct *p, int cpu, int wake_flags)

2921

++{

2922

++	struct rq *rq;

2923

++

2924

++	if (!schedstat_enabled())

2925

++		return;

2926

++

2927

++	rq = this_rq();

2928

++

2929

++#ifdef CONFIG_SMP

2930

++	if (cpu == rq->cpu) {

2931

++		__schedstat_inc(rq->ttwu_local);

2932

++		__schedstat_inc(p->stats.nr_wakeups_local);

2933

++	} else {

2934

++		/** Alt schedule FW ToDo:

2935

++		 * How to do ttwu_wake_remote

2936

++		 */

2937

++	}

2938

++#endif /* CONFIG_SMP */

2939

++

2940

++	__schedstat_inc(rq->ttwu_count);

2941

++	__schedstat_inc(p->stats.nr_wakeups);

2942

++}

2943

++

2944

++/*

2945

++ * Mark the task runnable and perform wakeup-preemption.

2946

++ */

2947

++static inline void

2948

++ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)

2949

++{

2950

++	check_preempt_curr(rq);

2951

++	WRITE_ONCE(p->__state, TASK_RUNNING);

2952

++	trace_sched_wakeup(p);

2953

++}

2954

++

2955

++static inline void

2956

++ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)

2957

++{

2958

++	if (p->sched_contributes_to_load)

2959

++		rq->nr_uninterruptible--;

2960

++

2961

++	if (

2962

++#ifdef CONFIG_SMP

2963

++	    !(wake_flags & WF_MIGRATED) &&

2964

++#endif

2965

++	    p->in_iowait) {

2966

++		delayacct_blkio_end(p);

2967

++		atomic_dec(&task_rq(p)->nr_iowait);

2968

++	}

2969

++

2970

++	activate_task(p, rq);

2971

++	ttwu_do_wakeup(rq, p, 0);

2972

++}

2973

++

2974

++/*

2975

++ * Consider @p being inside a wait loop:

2976

++ *

2977

++ *   for (;;) {

2978

++ *      set_current_state(TASK_UNINTERRUPTIBLE);

2979

++ *

2980

++ *      if (CONDITION)

2981

++ *         break;

2982

++ *

2983

++ *      schedule();

2984

++ *   }

2985

++ *   __set_current_state(TASK_RUNNING);

2986

++ *

2987

++ * between set_current_state() and schedule(). In this case @p is still

2988

++ * runnable, so all that needs doing is change p->state back to TASK_RUNNING in

2989

++ * an atomic manner.

2990

++ *

2991

++ * By taking task_rq(p)->lock we serialize against schedule(), if @p->on_rq

2992

++ * then schedule() must still happen and p->state can be changed to

2993

++ * TASK_RUNNING. Otherwise we lost the race, schedule() has happened, and we

2994

++ * need to do a full wakeup with enqueue.

2995

++ *

2996

++ * Returns: %true when the wakeup is done,

2997

++ *          %false otherwise.

2998

++ */

2999

++static int ttwu_runnable(struct task_struct *p, int wake_flags)

3000

++{

3001

++	struct rq *rq;

3002

++	raw_spinlock_t *lock;

3003

++	int ret = 0;

3004

++

3005

++	rq = __task_access_lock(p, &lock);

3006

++	if (task_on_rq_queued(p)) {

3007

++		/* check_preempt_curr() may use rq clock */

3008

++		update_rq_clock(rq);

3009

++		ttwu_do_wakeup(rq, p, wake_flags);

3010

++		ret = 1;

3011

++	}

3012

++	__task_access_unlock(p, lock);

3013

++

3014

++	return ret;

3015

++}

3016

++

3017

++#ifdef CONFIG_SMP

3018

++void sched_ttwu_pending(void *arg)

3019

++{

3020

++	struct llist_node *llist = arg;

3021

++	struct rq *rq = this_rq();

3022

++	struct task_struct *p, *t;

3023

++	struct rq_flags rf;

3024

++

3025

++	if (!llist)

3026

++		return;

3027

++

3028

++	/*

3029

++	 * rq::ttwu_pending racy indication of out-standing wakeups.

3030

++	 * Races such that false-negatives are possible, since they

3031

++	 * are shorter lived that false-positives would be.

3032

++	 */

3033

++	WRITE_ONCE(rq->ttwu_pending, 0);

3034

++

3035

++	rq_lock_irqsave(rq, &rf);

3036

++	update_rq_clock(rq);

3037

++

3038

++	llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {

3039

++		if (WARN_ON_ONCE(p->on_cpu))

3040

++			smp_cond_load_acquire(&p->on_cpu, !VAL);

3041

++

3042

++		if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))

3043

++			set_task_cpu(p, cpu_of(rq));

3044

++

3045

++		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0);

3046

++	}

3047

++

3048

++	rq_unlock_irqrestore(rq, &rf);

3049

++}

3050

++

3051

++void send_call_function_single_ipi(int cpu)

3052

++{

3053

++	struct rq *rq = cpu_rq(cpu);

3054

++

3055

++	if (!set_nr_if_polling(rq->idle))

3056

++		arch_send_call_function_single_ipi(cpu);

3057

++	else

3058

++		trace_sched_wake_idle_without_ipi(cpu);

3059

++}

3060

++

3061

++/*

3062

++ * Queue a task on the target CPUs wake_list and wake the CPU via IPI if

3063

++ * necessary. The wakee CPU on receipt of the IPI will queue the task

3064

++ * via sched_ttwu_wakeup() for activation so the wakee incurs the cost

3065

++ * of the wakeup instead of the waker.

3066

++ */

3067

++static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)

3068

++{

3069

++	struct rq *rq = cpu_rq(cpu);

3070

++

3071

++	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);

3072

++

3073

++	WRITE_ONCE(rq->ttwu_pending, 1);

3074

++	__smp_call_single_queue(cpu, &p->wake_entry.llist);

3075

++}

3076

++

3077

++static inline bool ttwu_queue_cond(int cpu, int wake_flags)

3078

++{

3079

++	/*

3080

++	 * Do not complicate things with the async wake_list while the CPU is

3081

++	 * in hotplug state.

3082

++	 */

3083

++	if (!cpu_active(cpu))

3084

++		return false;

3085

++

3086

++	/*

3087

++	 * If the CPU does not share cache, then queue the task on the

3088

++	 * remote rqs wakelist to avoid accessing remote data.

3089

++	 */

3090

++	if (!cpus_share_cache(smp_processor_id(), cpu))

3091

++		return true;

3092

++

3093

++	/*

3094

++	 * If the task is descheduling and the only running task on the

3095

++	 * CPU then use the wakelist to offload the task activation to

3096

++	 * the soon-to-be-idle CPU as the current CPU is likely busy.

3097

++	 * nr_running is checked to avoid unnecessary task stacking.

3098

++	 */

3099

++	if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)

3100

++		return true;

3101

++

3102

++	return false;

3103

++}

3104

++

3105

++static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)

3106

++{

3107

++	if (__is_defined(ALT_SCHED_TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {

3108

++		if (WARN_ON_ONCE(cpu == smp_processor_id()))

3109

++			return false;

3110

++

3111

++		sched_clock_cpu(cpu); /* Sync clocks across CPUs */

3112

++		__ttwu_queue_wakelist(p, cpu, wake_flags);

3113

++		return true;

3114

++	}

3115

++

3116

++	return false;

3117

++}

3118

++

3119

++void wake_up_if_idle(int cpu)

3120

++{

3121

++	struct rq *rq = cpu_rq(cpu);

3122

++	unsigned long flags;

3123

++

3124

++	rcu_read_lock();

3125

++

3126

++	if (!is_idle_task(rcu_dereference(rq->curr)))

3127

++		goto out;

3128

++

3129

++	raw_spin_lock_irqsave(&rq->lock, flags);

3130

++	if (is_idle_task(rq->curr))

3131

++		resched_curr(rq);

3132

++	/* Else CPU is not idle, do nothing here */

3133

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

3134

++

3135

++out:

3136

++	rcu_read_unlock();

3137

++}

3138

++

3139

++bool cpus_share_cache(int this_cpu, int that_cpu)

3140

++{

3141

++	if (this_cpu == that_cpu)

3142

++		return true;

3143

++

3144

++	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);

3145

++}

3146

++#else /* !CONFIG_SMP */

3147

++

3148

++static inline bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)

3149

++{

3150

++	return false;

3151

++}

3152

++

3153

++#endif /* CONFIG_SMP */

3154

++

3155

++static inline void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)

3156

++{

3157

++	struct rq *rq = cpu_rq(cpu);

3158

++

3159

++	if (ttwu_queue_wakelist(p, cpu, wake_flags))

3160

++		return;

3161

++

3162

++	raw_spin_lock(&rq->lock);

3163

++	update_rq_clock(rq);

3164

++	ttwu_do_activate(rq, p, wake_flags);

3165

++	raw_spin_unlock(&rq->lock);

3166

++}

3167

++

3168

++/*

3169

++ * Invoked from try_to_wake_up() to check whether the task can be woken up.

3170

++ *

3171

++ * The caller holds p::pi_lock if p != current or has preemption

3172

++ * disabled when p == current.

3173

++ *

3174

++ * The rules of PREEMPT_RT saved_state:

3175

++ *

3176

++ *   The related locking code always holds p::pi_lock when updating

3177

++ *   p::saved_state, which means the code is fully serialized in both cases.

3178

++ *

3179

++ *   The lock wait and lock wakeups happen via TASK_RTLOCK_WAIT. No other

3180

++ *   bits set. This allows to distinguish all wakeup scenarios.

3181

++ */

3182

++static __always_inline

3183

++bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)

3184

++{

3185

++	if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)) {

3186

++		WARN_ON_ONCE((state & TASK_RTLOCK_WAIT) &&

3187

++			     state != TASK_RTLOCK_WAIT);

3188

++	}

3189

++

3190

++	if (READ_ONCE(p->__state) & state) {

3191

++		*success = 1;

3192

++		return true;

3193

++	}

3194

++

3195

++#ifdef CONFIG_PREEMPT_RT

3196

++	/*

3197

++	 * Saved state preserves the task state across blocking on

3198

++	 * an RT lock.  If the state matches, set p::saved_state to

3199

++	 * TASK_RUNNING, but do not wake the task because it waits

3200

++	 * for a lock wakeup. Also indicate success because from

3201

++	 * the regular waker's point of view this has succeeded.

3202

++	 *

3203

++	 * After acquiring the lock the task will restore p::__state

3204

++	 * from p::saved_state which ensures that the regular

3205

++	 * wakeup is not lost. The restore will also set

3206

++	 * p::saved_state to TASK_RUNNING so any further tests will

3207

++	 * not result in false positives vs. @success

3208

++	 */

3209

++	if (p->saved_state & state) {

3210

++		p->saved_state = TASK_RUNNING;

3211

++		*success = 1;

3212

++	}

3213

++#endif

3214

++	return false;

3215

++}

3216

++

3217

++/*

3218

++ * Notes on Program-Order guarantees on SMP systems.

3219

++ *

3220

++ *  MIGRATION

3221

++ *

3222

++ * The basic program-order guarantee on SMP systems is that when a task [t]

3223

++ * migrates, all its activity on its old CPU [c0] happens-before any subsequent

3224

++ * execution on its new CPU [c1].

3225

++ *

3226

++ * For migration (of runnable tasks) this is provided by the following means:

3227

++ *

3228

++ *  A) UNLOCK of the rq(c0)->lock scheduling out task t

3229

++ *  B) migration for t is required to synchronize *both* rq(c0)->lock and

3230

++ *     rq(c1)->lock (if not at the same time, then in that order).

3231

++ *  C) LOCK of the rq(c1)->lock scheduling in task

3232

++ *

3233

++ * Transitivity guarantees that B happens after A and C after B.

3234

++ * Note: we only require RCpc transitivity.

3235

++ * Note: the CPU doing B need not be c0 or c1

3236

++ *

3237

++ * Example:

3238

++ *

3239

++ *   CPU0            CPU1            CPU2

3240

++ *

3241

++ *   LOCK rq(0)->lock

3242

++ *   sched-out X

3243

++ *   sched-in Y

3244

++ *   UNLOCK rq(0)->lock

3245

++ *

3246

++ *                                   LOCK rq(0)->lock // orders against CPU0

3247

++ *                                   dequeue X

3248

++ *                                   UNLOCK rq(0)->lock

3249

++ *

3250

++ *                                   LOCK rq(1)->lock

3251

++ *                                   enqueue X

3252

++ *                                   UNLOCK rq(1)->lock

3253

++ *

3254

++ *                   LOCK rq(1)->lock // orders against CPU2

3255

++ *                   sched-out Z

3256

++ *                   sched-in X

3257

++ *                   UNLOCK rq(1)->lock

3258

++ *

3259

++ *

3260

++ *  BLOCKING -- aka. SLEEP + WAKEUP

3261

++ *

3262

++ * For blocking we (obviously) need to provide the same guarantee as for

3263

++ * migration. However the means are completely different as there is no lock

3264

++ * chain to provide order. Instead we do:

3265

++ *

3266

++ *   1) smp_store_release(X->on_cpu, 0)   -- finish_task()

3267

++ *   2) smp_cond_load_acquire(!X->on_cpu) -- try_to_wake_up()

3268

++ *

3269

++ * Example:

3270

++ *

3271

++ *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)

3272

++ *

3273

++ *   LOCK rq(0)->lock LOCK X->pi_lock

3274

++ *   dequeue X

3275

++ *   sched-out X

3276

++ *   smp_store_release(X->on_cpu, 0);

3277

++ *

3278

++ *                    smp_cond_load_acquire(&X->on_cpu, !VAL);

3279

++ *                    X->state = WAKING

3280

++ *                    set_task_cpu(X,2)

3281

++ *

3282

++ *                    LOCK rq(2)->lock

3283

++ *                    enqueue X

3284

++ *                    X->state = RUNNING

3285

++ *                    UNLOCK rq(2)->lock

3286

++ *

3287

++ *                                          LOCK rq(2)->lock // orders against CPU1

3288

++ *                                          sched-out Z

3289

++ *                                          sched-in X

3290

++ *                                          UNLOCK rq(2)->lock

3291

++ *

3292

++ *                    UNLOCK X->pi_lock

3293

++ *   UNLOCK rq(0)->lock

3294

++ *

3295

++ *

3296

++ * However; for wakeups there is a second guarantee we must provide, namely we

3297

++ * must observe the state that lead to our wakeup. That is, not only must our

3298

++ * task observe its own prior state, it must also observe the stores prior to

3299

++ * its wakeup.

3300

++ *

3301

++ * This means that any means of doing remote wakeups must order the CPU doing

3302

++ * the wakeup against the CPU the task is going to end up running on. This,

3303

++ * however, is already required for the regular Program-Order guarantee above,

3304

++ * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).

3305

++ *

3306

++ */

3307

++

3308

++/**

3309

++ * try_to_wake_up - wake up a thread

3310

++ * @p: the thread to be awakened

3311

++ * @state: the mask of task states that can be woken

3312

++ * @wake_flags: wake modifier flags (WF_*)

3313

++ *

3314

++ * Conceptually does:

3315

++ *

3316

++ *   If (@state & @p->state) @p->state = TASK_RUNNING.

3317

++ *

3318

++ * If the task was not queued/runnable, also place it back on a runqueue.

3319

++ *

3320

++ * This function is atomic against schedule() which would dequeue the task.

3321

++ *

3322

++ * It issues a full memory barrier before accessing @p->state, see the comment

3323

++ * with set_current_state().

3324

++ *

3325

++ * Uses p->pi_lock to serialize against concurrent wake-ups.

3326

++ *

3327

++ * Relies on p->pi_lock stabilizing:

3328

++ *  - p->sched_class

3329

++ *  - p->cpus_ptr

3330

++ *  - p->sched_task_group

3331

++ * in order to do migration, see its use of select_task_rq()/set_task_cpu().

3332

++ *

3333

++ * Tries really hard to only take one task_rq(p)->lock for performance.

3334

++ * Takes rq->lock in:

3335

++ *  - ttwu_runnable()    -- old rq, unavoidable, see comment there;

3336

++ *  - ttwu_queue()       -- new rq, for enqueue of the task;

3337

++ *  - psi_ttwu_dequeue() -- much sadness :-( accounting will kill us.

3338

++ *

3339

++ * As a consequence we race really badly with just about everything. See the

3340

++ * many memory barriers and their comments for details.

3341

++ *

3342

++ * Return: %true if @p->state changes (an actual wakeup was done),

3343

++ *	   %false otherwise.

3344

++ */

3345

++static int try_to_wake_up(struct task_struct *p, unsigned int state,

3346

++			  int wake_flags)

3347

++{

3348

++	unsigned long flags;

3349

++	int cpu, success = 0;

3350

++

3351

++	preempt_disable();

3352

++	if (p == current) {

3353

++		/*

3354

++		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)

3355

++		 * == smp_processor_id()'. Together this means we can special

3356

++		 * case the whole 'p->on_rq && ttwu_runnable()' case below

3357

++		 * without taking any locks.

3358

++		 *

3359

++		 * In particular:

3360

++		 *  - we rely on Program-Order guarantees for all the ordering,

3361

++		 *  - we're serialized against set_special_state() by virtue of

3362

++		 *    it disabling IRQs (this allows not taking ->pi_lock).

3363

++		 */

3364

++		if (!ttwu_state_match(p, state, &success))

3365

++			goto out;

3366

++

3367

++		trace_sched_waking(p);

3368

++		WRITE_ONCE(p->__state, TASK_RUNNING);

3369

++		trace_sched_wakeup(p);

3370

++		goto out;

3371

++	}

3372

++

3373

++	/*

3374

++	 * If we are going to wake up a thread waiting for CONDITION we

3375

++	 * need to ensure that CONDITION=1 done by the caller can not be

3376

++	 * reordered with p->state check below. This pairs with smp_store_mb()

3377

++	 * in set_current_state() that the waiting thread does.

3378

++	 */

3379

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

3380

++	smp_mb__after_spinlock();

3381

++	if (!ttwu_state_match(p, state, &success))

3382

++		goto unlock;

3383

++

3384

++	trace_sched_waking(p);

3385

++

3386

++	/*

3387

++	 * Ensure we load p->on_rq _after_ p->state, otherwise it would

3388

++	 * be possible to, falsely, observe p->on_rq == 0 and get stuck

3389

++	 * in smp_cond_load_acquire() below.

3390

++	 *

3391

++	 * sched_ttwu_pending()			try_to_wake_up()

3392

++	 *   STORE p->on_rq = 1			  LOAD p->state

3393

++	 *   UNLOCK rq->lock

3394

++	 *

3395

++	 * __schedule() (switch to task 'p')

3396

++	 *   LOCK rq->lock			  smp_rmb();

3397

++	 *   smp_mb__after_spinlock();

3398

++	 *   UNLOCK rq->lock

3399

++	 *

3400

++	 * [task p]

3401

++	 *   STORE p->state = UNINTERRUPTIBLE	  LOAD p->on_rq

3402

++	 *

3403

++	 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in

3404

++	 * __schedule().  See the comment for smp_mb__after_spinlock().

3405

++	 *

3406

++	 * A similar smb_rmb() lives in try_invoke_on_locked_down_task().

3407

++	 */

3408

++	smp_rmb();

3409

++	if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))

3410

++		goto unlock;

3411

++

3412

++#ifdef CONFIG_SMP

3413

++	/*

3414

++	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be

3415

++	 * possible to, falsely, observe p->on_cpu == 0.

3416

++	 *

3417

++	 * One must be running (->on_cpu == 1) in order to remove oneself

3418

++	 * from the runqueue.

3419

++	 *

3420

++	 * __schedule() (switch to task 'p')	try_to_wake_up()

3421

++	 *   STORE p->on_cpu = 1		  LOAD p->on_rq

3422

++	 *   UNLOCK rq->lock

3423

++	 *

3424

++	 * __schedule() (put 'p' to sleep)

3425

++	 *   LOCK rq->lock			  smp_rmb();

3426

++	 *   smp_mb__after_spinlock();

3427

++	 *   STORE p->on_rq = 0			  LOAD p->on_cpu

3428

++	 *

3429

++	 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in

3430

++	 * __schedule().  See the comment for smp_mb__after_spinlock().

3431

++	 *

3432

++	 * Form a control-dep-acquire with p->on_rq == 0 above, to ensure

3433

++	 * schedule()'s deactivate_task() has 'happened' and p will no longer

3434

++	 * care about it's own p->state. See the comment in __schedule().

3435

++	 */

3436

++	smp_acquire__after_ctrl_dep();

3437

++

3438

++	/*

3439

++	 * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq

3440

++	 * == 0), which means we need to do an enqueue, change p->state to

3441

++	 * TASK_WAKING such that we can unlock p->pi_lock before doing the

3442

++	 * enqueue, such as ttwu_queue_wakelist().

3443

++	 */

3444

++	WRITE_ONCE(p->__state, TASK_WAKING);

3445

++

3446

++	/*

3447

++	 * If the owning (remote) CPU is still in the middle of schedule() with

3448

++	 * this task as prev, considering queueing p on the remote CPUs wake_list

3449

++	 * which potentially sends an IPI instead of spinning on p->on_cpu to

3450

++	 * let the waker make forward progress. This is safe because IRQs are

3451

++	 * disabled and the IPI will deliver after on_cpu is cleared.

3452

++	 *

3453

++	 * Ensure we load task_cpu(p) after p->on_cpu:

3454

++	 *

3455

++	 * set_task_cpu(p, cpu);

3456

++	 *   STORE p->cpu = @cpu

3457

++	 * __schedule() (switch to task 'p')

3458

++	 *   LOCK rq->lock

3459

++	 *   smp_mb__after_spin_lock()          smp_cond_load_acquire(&p->on_cpu)

3460

++	 *   STORE p->on_cpu = 1                LOAD p->cpu

3461

++	 *

3462

++	 * to ensure we observe the correct CPU on which the task is currently

3463

++	 * scheduling.

3464

++	 */

3465

++	if (smp_load_acquire(&p->on_cpu) &&

3466

++	    ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))

3467

++		goto unlock;

3468

++

3469

++	/*

3470

++	 * If the owning (remote) CPU is still in the middle of schedule() with

3471

++	 * this task as prev, wait until it's done referencing the task.

3472

++	 *

3473

++	 * Pairs with the smp_store_release() in finish_task().

3474

++	 *

3475

++	 * This ensures that tasks getting woken will be fully ordered against

3476

++	 * their previous state and preserve Program Order.

3477

++	 */

3478

++	smp_cond_load_acquire(&p->on_cpu, !VAL);

3479

++

3480

++	sched_task_ttwu(p);

3481

++

3482

++	cpu = select_task_rq(p);

3483

++

3484

++	if (cpu != task_cpu(p)) {

3485

++		if (p->in_iowait) {

3486

++			delayacct_blkio_end(p);

3487

++			atomic_dec(&task_rq(p)->nr_iowait);

3488

++		}

3489

++

3490

++		wake_flags |= WF_MIGRATED;

3491

++		psi_ttwu_dequeue(p);

3492

++		set_task_cpu(p, cpu);

3493

++	}

3494

++#else

3495

++	cpu = task_cpu(p);

3496

++#endif /* CONFIG_SMP */

3497

++

3498

++	ttwu_queue(p, cpu, wake_flags);

3499

++unlock:

3500

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

3501

++out:

3502

++	if (success)

3503

++		ttwu_stat(p, task_cpu(p), wake_flags);

3504

++	preempt_enable();

3505

++

3506

++	return success;

3507

++}

3508

++

3509

++/**

3510

++ * task_call_func - Invoke a function on task in fixed state

3511

++ * @p: Process for which the function is to be invoked, can be @current.

3512

++ * @func: Function to invoke.

3513

++ * @arg: Argument to function.

3514

++ *

3515

++ * Fix the task in it's current state by avoiding wakeups and or rq operations

3516

++ * and call @func(@arg) on it.  This function can use ->on_rq and task_curr()

3517

++ * to work out what the state is, if required.  Given that @func can be invoked

3518

++ * with a runqueue lock held, it had better be quite lightweight.

3519

++ *

3520

++ * Returns:

3521

++ *   Whatever @func returns

3522

++ */

3523

++int task_call_func(struct task_struct *p, task_call_f func, void *arg)

3524

++{

3525

++	struct rq *rq = NULL;

3526

++	unsigned int state;

3527

++	struct rq_flags rf;

3528

++	int ret;

3529

++

3530

++	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);

3531

++

3532

++	state = READ_ONCE(p->__state);

3533

++

3534

++	/*

3535

++	 * Ensure we load p->on_rq after p->__state, otherwise it would be

3536

++	 * possible to, falsely, observe p->on_rq == 0.

3537

++	 *

3538

++	 * See try_to_wake_up() for a longer comment.

3539

++	 */

3540

++	smp_rmb();

3541

++

3542

++	/*

3543

++	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when

3544

++	 * the task is blocked. Make sure to check @state since ttwu() can drop

3545

++	 * locks at the end, see ttwu_queue_wakelist().

3546

++	 */

3547

++	if (state == TASK_RUNNING || state == TASK_WAKING || p->on_rq)

3548

++		rq = __task_rq_lock(p, &rf);

3549

++

3550

++	/*

3551

++	 * At this point the task is pinned; either:

3552

++	 *  - blocked and we're holding off wakeups      (pi->lock)

3553

++	 *  - woken, and we're holding off enqueue       (rq->lock)

3554

++	 *  - queued, and we're holding off schedule     (rq->lock)

3555

++	 *  - running, and we're holding off de-schedule (rq->lock)

3556

++	 *

3557

++	 * The called function (@func) can use: task_curr(), p->on_rq and

3558

++	 * p->__state to differentiate between these states.

3559

++	 */

3560

++	ret = func(p, arg);

3561

++

3562

++	if (rq)

3563

++		__task_rq_unlock(rq, &rf);

3564

++

3565

++	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);

3566

++	return ret;

3567

++}

3568

++

3569

++/**

3570

++ * wake_up_process - Wake up a specific process

3571

++ * @p: The process to be woken up.

3572

++ *

3573

++ * Attempt to wake up the nominated process and move it to the set of runnable

3574

++ * processes.

3575

++ *

3576

++ * Return: 1 if the process was woken up, 0 if it was already running.

3577

++ *

3578

++ * This function executes a full memory barrier before accessing the task state.

3579

++ */

3580

++int wake_up_process(struct task_struct *p)

3581

++{

3582

++	return try_to_wake_up(p, TASK_NORMAL, 0);

3583

++}

3584

++EXPORT_SYMBOL(wake_up_process);

3585

++

3586

++int wake_up_state(struct task_struct *p, unsigned int state)

3587

++{

3588

++	return try_to_wake_up(p, state, 0);

3589

++}

3590

++

3591

++/*

3592

++ * Perform scheduler related setup for a newly forked process p.

3593

++ * p is forked by current.

3594

++ *

3595

++ * __sched_fork() is basic setup used by init_idle() too:

3596

++ */

3597

++static inline void __sched_fork(unsigned long clone_flags, struct task_struct *p)

3598

++{

3599

++	p->on_rq			= 0;

3600

++	p->on_cpu			= 0;

3601

++	p->utime			= 0;

3602

++	p->stime			= 0;

3603

++	p->sched_time			= 0;

3604

++

3605

++#ifdef CONFIG_SCHEDSTATS

3606

++	/* Even if schedstat is disabled, there should not be garbage */

3607

++	memset(&p->stats, 0, sizeof(p->stats));

3608

++#endif

3609

++

3610

++#ifdef CONFIG_PREEMPT_NOTIFIERS

3611

++	INIT_HLIST_HEAD(&p->preempt_notifiers);

3612

++#endif

3613

++

3614

++#ifdef CONFIG_COMPACTION

3615

++	p->capture_control = NULL;

3616

++#endif

3617

++#ifdef CONFIG_SMP

3618

++	p->wake_entry.u_flags = CSD_TYPE_TTWU;

3619

++#endif

3620

++}

3621

++

3622

++/*

3623

++ * fork()/clone()-time setup:

3624

++ */

3625

++int sched_fork(unsigned long clone_flags, struct task_struct *p)

3626

++{

3627

++	__sched_fork(clone_flags, p);

3628

++	/*

3629

++	 * We mark the process as NEW here. This guarantees that

3630

++	 * nobody will actually run it, and a signal or other external

3631

++	 * event cannot wake it up and insert it on the runqueue either.

3632

++	 */

3633

++	p->__state = TASK_NEW;

3634

++

3635

++	/*

3636

++	 * Make sure we do not leak PI boosting priority to the child.

3637

++	 */

3638

++	p->prio = current->normal_prio;

3639

++

3640

++	/*

3641

++	 * Revert to default priority/policy on fork if requested.

3642

++	 */

3643

++	if (unlikely(p->sched_reset_on_fork)) {

3644

++		if (task_has_rt_policy(p)) {

3645

++			p->policy = SCHED_NORMAL;

3646

++			p->static_prio = NICE_TO_PRIO(0);

3647

++			p->rt_priority = 0;

3648

++		} else if (PRIO_TO_NICE(p->static_prio) < 0)

3649

++			p->static_prio = NICE_TO_PRIO(0);

3650

++

3651

++		p->prio = p->normal_prio = p->static_prio;

3652

++

3653

++		/*

3654

++		 * We don't need the reset flag anymore after the fork. It has

3655

++		 * fulfilled its duty:

3656

++		 */

3657

++		p->sched_reset_on_fork = 0;

3658

++	}

3659

++

3660

++#ifdef CONFIG_SCHED_INFO

3661

++	if (unlikely(sched_info_on()))

3662

++		memset(&p->sched_info, 0, sizeof(p->sched_info));

3663

++#endif

3664

++	init_task_preempt_count(p);

3665

++

3666

++	return 0;

3667

++}

3668

++

3669

++void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)

3670

++{

3671

++	unsigned long flags;

3672

++	struct rq *rq;

3673

++

3674

++	/*

3675

++	 * Because we're not yet on the pid-hash, p->pi_lock isn't strictly

3676

++	 * required yet, but lockdep gets upset if rules are violated.

3677

++	 */

3678

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

3679

++	/*

3680

++	 * Share the timeslice between parent and child, thus the

3681

++	 * total amount of pending timeslices in the system doesn't change,

3682

++	 * resulting in more scheduling fairness.

3683

++	 */

3684

++	rq = this_rq();

3685

++	raw_spin_lock(&rq->lock);

3686

++

3687

++	rq->curr->time_slice /= 2;

3688

++	p->time_slice = rq->curr->time_slice;

3689

++#ifdef CONFIG_SCHED_HRTICK

3690

++	hrtick_start(rq, rq->curr->time_slice);

3691

++#endif

3692

++

3693

++	if (p->time_slice < RESCHED_NS) {

3694

++		p->time_slice = sched_timeslice_ns;

3695

++		resched_curr(rq);

3696

++	}

3697

++	sched_task_fork(p, rq);

3698

++	raw_spin_unlock(&rq->lock);

3699

++

3700

++	rseq_migrate(p);

3701

++	/*

3702

++	 * We're setting the CPU for the first time, we don't migrate,

3703

++	 * so use __set_task_cpu().

3704

++	 */

3705

++	__set_task_cpu(p, smp_processor_id());

3706

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

3707

++}

3708

++

3709

++void sched_post_fork(struct task_struct *p)

3710

++{

3711

++}

3712

++

3713

++#ifdef CONFIG_SCHEDSTATS

3714

++

3715

++DEFINE_STATIC_KEY_FALSE(sched_schedstats);

3716

++

3717

++static void set_schedstats(bool enabled)

3718

++{

3719

++	if (enabled)

3720

++		static_branch_enable(&sched_schedstats);

3721

++	else

3722

++		static_branch_disable(&sched_schedstats);

3723

++}

3724

++

3725

++void force_schedstat_enabled(void)

3726

++{

3727

++	if (!schedstat_enabled()) {

3728

++		pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");

3729

++		static_branch_enable(&sched_schedstats);

3730

++	}

3731

++}

3732

++

3733

++static int __init setup_schedstats(char *str)

3734

++{

3735

++	int ret = 0;

3736

++	if (!str)

3737

++		goto out;

3738

++

3739

++	if (!strcmp(str, "enable")) {

3740

++		set_schedstats(true);

3741

++		ret = 1;

3742

++	} else if (!strcmp(str, "disable")) {

3743

++		set_schedstats(false);

3744

++		ret = 1;

3745

++	}

3746

++out:

3747

++	if (!ret)

3748

++		pr_warn("Unable to parse schedstats=\n");

3749

++

3750

++	return ret;

3751

++}

3752

++__setup("schedstats=", setup_schedstats);

3753

++

3754

++#ifdef CONFIG_PROC_SYSCTL

3755

++static int sysctl_schedstats(struct ctl_table *table, int write, void *buffer,

3756

++		size_t *lenp, loff_t *ppos)

3757

++{

3758

++	struct ctl_table t;

3759

++	int err;

3760

++	int state = static_branch_likely(&sched_schedstats);

3761

++

3762

++	if (write && !capable(CAP_SYS_ADMIN))

3763

++		return -EPERM;

3764

++

3765

++	t = *table;

3766

++	t.data = &state;

3767

++	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);

3768

++	if (err < 0)

3769

++		return err;

3770

++	if (write)

3771

++		set_schedstats(state);

3772

++	return err;

3773

++}

3774

++

3775

++static struct ctl_table sched_core_sysctls[] = {

3776

++	{

3777

++		.procname       = "sched_schedstats",

3778

++		.data           = NULL,

3779

++		.maxlen         = sizeof(unsigned int),

3780

++		.mode           = 0644,

3781

++		.proc_handler   = sysctl_schedstats,

3782

++		.extra1         = SYSCTL_ZERO,

3783

++		.extra2         = SYSCTL_ONE,

3784

++	},

3785

++	{}

3786

++};

3787

++static int __init sched_core_sysctl_init(void)

3788

++{

3789

++	register_sysctl_init("kernel", sched_core_sysctls);

3790

++	return 0;

3791

++}

3792

++late_initcall(sched_core_sysctl_init);

3793

++#endif /* CONFIG_PROC_SYSCTL */

3794

++#endif /* CONFIG_SCHEDSTATS */

3795

++

3796

++/*

3797

++ * wake_up_new_task - wake up a newly created task for the first time.

3798

++ *

3799

++ * This function will do some initial scheduler statistics housekeeping

3800

++ * that must be done for every newly created context, then puts the task

3801

++ * on the runqueue and wakes it.

3802

++ */

3803

++void wake_up_new_task(struct task_struct *p)

3804

++{

3805

++	unsigned long flags;

3806

++	struct rq *rq;

3807

++

3808

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

3809

++	WRITE_ONCE(p->__state, TASK_RUNNING);

3810

++	rq = cpu_rq(select_task_rq(p));

3811

++#ifdef CONFIG_SMP

3812

++	rseq_migrate(p);

3813

++	/*

3814

++	 * Fork balancing, do it here and not earlier because:

3815

++	 * - cpus_ptr can change in the fork path

3816

++	 * - any previously selected CPU might disappear through hotplug

3817

++	 *

3818

++	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,

3819

++	 * as we're not fully set-up yet.

3820

++	 */

3821

++	__set_task_cpu(p, cpu_of(rq));

3822

++#endif

3823

++

3824

++	raw_spin_lock(&rq->lock);

3825

++	update_rq_clock(rq);

3826

++

3827

++	activate_task(p, rq);

3828

++	trace_sched_wakeup_new(p);

3829

++	check_preempt_curr(rq);

3830

++

3831

++	raw_spin_unlock(&rq->lock);

3832

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

3833

++}

3834

++

3835

++#ifdef CONFIG_PREEMPT_NOTIFIERS

3836

++

3837

++static DEFINE_STATIC_KEY_FALSE(preempt_notifier_key);

3838

++

3839

++void preempt_notifier_inc(void)

3840

++{

3841

++	static_branch_inc(&preempt_notifier_key);

3842

++}

3843

++EXPORT_SYMBOL_GPL(preempt_notifier_inc);

3844

++

3845

++void preempt_notifier_dec(void)

3846

++{

3847

++	static_branch_dec(&preempt_notifier_key);

3848

++}

3849

++EXPORT_SYMBOL_GPL(preempt_notifier_dec);

3850

++

3851

++/**

3852

++ * preempt_notifier_register - tell me when current is being preempted & rescheduled

3853

++ * @notifier: notifier struct to register

3854

++ */

3855

++void preempt_notifier_register(struct preempt_notifier *notifier)

3856

++{

3857

++	if (!static_branch_unlikely(&preempt_notifier_key))

3858

++		WARN(1, "registering preempt_notifier while notifiers disabled\n");

3859

++

3860

++	hlist_add_head(&notifier->link, &current->preempt_notifiers);

3861

++}

3862

++EXPORT_SYMBOL_GPL(preempt_notifier_register);

3863

++

3864

++/**

3865

++ * preempt_notifier_unregister - no longer interested in preemption notifications

3866

++ * @notifier: notifier struct to unregister

3867

++ *

3868

++ * This is *not* safe to call from within a preemption notifier.

3869

++ */

3870

++void preempt_notifier_unregister(struct preempt_notifier *notifier)

3871

++{

3872

++	hlist_del(&notifier->link);

3873

++}

3874

++EXPORT_SYMBOL_GPL(preempt_notifier_unregister);

3875

++

3876

++static void __fire_sched_in_preempt_notifiers(struct task_struct *curr)

3877

++{

3878

++	struct preempt_notifier *notifier;

3879

++

3880

++	hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)

3881

++		notifier->ops->sched_in(notifier, raw_smp_processor_id());

3882

++}

3883

++

3884

++static __always_inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)

3885

++{

3886

++	if (static_branch_unlikely(&preempt_notifier_key))

3887

++		__fire_sched_in_preempt_notifiers(curr);

3888

++}

3889

++

3890

++static void

3891

++__fire_sched_out_preempt_notifiers(struct task_struct *curr,

3892

++				   struct task_struct *next)

3893

++{

3894

++	struct preempt_notifier *notifier;

3895

++

3896

++	hlist_for_each_entry(notifier, &curr->preempt_notifiers, link)

3897

++		notifier->ops->sched_out(notifier, next);

3898

++}

3899

++

3900

++static __always_inline void

3901

++fire_sched_out_preempt_notifiers(struct task_struct *curr,

3902

++				 struct task_struct *next)

3903

++{

3904

++	if (static_branch_unlikely(&preempt_notifier_key))

3905

++		__fire_sched_out_preempt_notifiers(curr, next);

3906

++}

3907

++

3908

++#else /* !CONFIG_PREEMPT_NOTIFIERS */

3909

++

3910

++static inline void fire_sched_in_preempt_notifiers(struct task_struct *curr)

3911

++{

3912

++}

3913

++

3914

++static inline void

3915

++fire_sched_out_preempt_notifiers(struct task_struct *curr,

3916

++				 struct task_struct *next)

3917

++{

3918

++}

3919

++

3920

++#endif /* CONFIG_PREEMPT_NOTIFIERS */

3921

++

3922

++static inline void prepare_task(struct task_struct *next)

3923

++{

3924

++	/*

3925

++	 * Claim the task as running, we do this before switching to it

3926

++	 * such that any running task will have this set.

3927

++	 *

3928

++	 * See the ttwu() WF_ON_CPU case and its ordering comment.

3929

++	 */

3930

++	WRITE_ONCE(next->on_cpu, 1);

3931

++}

3932

++

3933

++static inline void finish_task(struct task_struct *prev)

3934

++{

3935

++#ifdef CONFIG_SMP

3936

++	/*

3937

++	 * This must be the very last reference to @prev from this CPU. After

3938

++	 * p->on_cpu is cleared, the task can be moved to a different CPU. We

3939

++	 * must ensure this doesn't happen until the switch is completely

3940

++	 * finished.

3941

++	 *

3942

++	 * In particular, the load of prev->state in finish_task_switch() must

3943

++	 * happen before this.

3944

++	 *

3945

++	 * Pairs with the smp_cond_load_acquire() in try_to_wake_up().

3946

++	 */

3947

++	smp_store_release(&prev->on_cpu, 0);

3948

++#else

3949

++	prev->on_cpu = 0;

3950

++#endif

3951

++}

3952

++

3953

++#ifdef CONFIG_SMP

3954

++

3955

++static void do_balance_callbacks(struct rq *rq, struct callback_head *head)

3956

++{

3957

++	void (*func)(struct rq *rq);

3958

++	struct callback_head *next;

3959

++

3960

++	lockdep_assert_held(&rq->lock);

3961

++

3962

++	while (head) {

3963

++		func = (void (*)(struct rq *))head->func;

3964

++		next = head->next;

3965

++		head->next = NULL;

3966

++		head = next;

3967

++

3968

++		func(rq);

3969

++	}

3970

++}

3971

++

3972

++static void balance_push(struct rq *rq);

3973

++

3974

++/*

3975

++ * balance_push_callback is a right abuse of the callback interface and plays

3976

++ * by significantly different rules.

3977

++ *

3978

++ * Where the normal balance_callback's purpose is to be ran in the same context

3979

++ * that queued it (only later, when it's safe to drop rq->lock again),

3980

++ * balance_push_callback is specifically targeted at __schedule().

3981

++ *

3982

++ * This abuse is tolerated because it places all the unlikely/odd cases behind

3983

++ * a single test, namely: rq->balance_callback == NULL.

3984

++ */

3985

++struct callback_head balance_push_callback = {

3986

++	.next = NULL,

3987

++	.func = (void (*)(struct callback_head *))balance_push,

3988

++};

3989

++

3990

++static inline struct callback_head *

3991

++__splice_balance_callbacks(struct rq *rq, bool split)

3992

++{

3993

++	struct callback_head *head = rq->balance_callback;

3994

++

3995

++	if (likely(!head))

3996

++		return NULL;

3997

++

3998

++	lockdep_assert_rq_held(rq);

3999

++	/*

4000

++	 * Must not take balance_push_callback off the list when

4001

++	 * splice_balance_callbacks() and balance_callbacks() are not

4002

++	 * in the same rq->lock section.

4003

++	 *

4004

++	 * In that case it would be possible for __schedule() to interleave

4005

++	 * and observe the list empty.

4006

++	 */

4007

++	if (split && head == &balance_push_callback)

4008

++		head = NULL;

4009

++	else

4010

++		rq->balance_callback = NULL;

4011

++

4012

++	return head;

4013

++}

4014

++

4015

++static inline struct callback_head *splice_balance_callbacks(struct rq *rq)

4016

++{

4017

++	return __splice_balance_callbacks(rq, true);

4018

++}

4019

++

4020

++static void __balance_callbacks(struct rq *rq)

4021

++{

4022

++	do_balance_callbacks(rq, __splice_balance_callbacks(rq, false));

4023

++}

4024

++

4025

++static inline void balance_callbacks(struct rq *rq, struct callback_head *head)

4026

++{

4027

++	unsigned long flags;

4028

++

4029

++	if (unlikely(head)) {

4030

++		raw_spin_lock_irqsave(&rq->lock, flags);

4031

++		do_balance_callbacks(rq, head);

4032

++		raw_spin_unlock_irqrestore(&rq->lock, flags);

4033

++	}

4034

++}

4035

++

4036

++#else

4037

++

4038

++static inline void __balance_callbacks(struct rq *rq)

4039

++{

4040

++}

4041

++

4042

++static inline struct callback_head *splice_balance_callbacks(struct rq *rq)

4043

++{

4044

++	return NULL;

4045

++}

4046

++

4047

++static inline void balance_callbacks(struct rq *rq, struct callback_head *head)

4048

++{

4049

++}

4050

++

4051

++#endif

4052

++

4053

++static inline void

4054

++prepare_lock_switch(struct rq *rq, struct task_struct *next)

4055

++{

4056

++	/*

4057

++	 * Since the runqueue lock will be released by the next

4058

++	 * task (which is an invalid locking op but in the case

4059

++	 * of the scheduler it's an obvious special-case), so we

4060

++	 * do an early lockdep release here:

4061

++	 */

4062

++	spin_release(&rq->lock.dep_map, _THIS_IP_);

4063

++#ifdef CONFIG_DEBUG_SPINLOCK

4064

++	/* this is a valid case when another task releases the spinlock */

4065

++	rq->lock.owner = next;

4066

++#endif

4067

++}

4068

++

4069

++static inline void finish_lock_switch(struct rq *rq)

4070

++{

4071

++	/*

4072

++	 * If we are tracking spinlock dependencies then we have to

4073

++	 * fix up the runqueue lock - which gets 'carried over' from

4074

++	 * prev into current:

4075

++	 */

4076

++	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);

4077

++	__balance_callbacks(rq);

4078

++	raw_spin_unlock_irq(&rq->lock);

4079

++}

4080

++

4081

++/*

4082

++ * NOP if the arch has not defined these:

4083

++ */

4084

++

4085

++#ifndef prepare_arch_switch

4086

++# define prepare_arch_switch(next)	do { } while (0)

4087

++#endif

4088

++

4089

++#ifndef finish_arch_post_lock_switch

4090

++# define finish_arch_post_lock_switch()	do { } while (0)

4091

++#endif

4092

++

4093

++static inline void kmap_local_sched_out(void)

4094

++{

4095

++#ifdef CONFIG_KMAP_LOCAL

4096

++	if (unlikely(current->kmap_ctrl.idx))

4097

++		__kmap_local_sched_out();

4098

++#endif

4099

++}

4100

++

4101

++static inline void kmap_local_sched_in(void)

4102

++{

4103

++#ifdef CONFIG_KMAP_LOCAL

4104

++	if (unlikely(current->kmap_ctrl.idx))

4105

++		__kmap_local_sched_in();

4106

++#endif

4107

++}

4108

++

4109

++/**

4110

++ * prepare_task_switch - prepare to switch tasks

4111

++ * @rq: the runqueue preparing to switch

4112

++ * @next: the task we are going to switch to.

4113

++ *

4114

++ * This is called with the rq lock held and interrupts off. It must

4115

++ * be paired with a subsequent finish_task_switch after the context

4116

++ * switch.

4117

++ *

4118

++ * prepare_task_switch sets up locking and calls architecture specific

4119

++ * hooks.

4120

++ */

4121

++static inline void

4122

++prepare_task_switch(struct rq *rq, struct task_struct *prev,

4123

++		    struct task_struct *next)

4124

++{

4125

++	kcov_prepare_switch(prev);

4126

++	sched_info_switch(rq, prev, next);

4127

++	perf_event_task_sched_out(prev, next);

4128

++	rseq_preempt(prev);

4129

++	fire_sched_out_preempt_notifiers(prev, next);

4130

++	kmap_local_sched_out();

4131

++	prepare_task(next);

4132

++	prepare_arch_switch(next);

4133

++}

4134

++

4135

++/**

4136

++ * finish_task_switch - clean up after a task-switch

4137

++ * @rq: runqueue associated with task-switch

4138

++ * @prev: the thread we just switched away from.

4139

++ *

4140

++ * finish_task_switch must be called after the context switch, paired

4141

++ * with a prepare_task_switch call before the context switch.

4142

++ * finish_task_switch will reconcile locking set up by prepare_task_switch,

4143

++ * and do any other architecture-specific cleanup actions.

4144

++ *

4145

++ * Note that we may have delayed dropping an mm in context_switch(). If

4146

++ * so, we finish that here outside of the runqueue lock.  (Doing it

4147

++ * with the lock held can cause deadlocks; see schedule() for

4148

++ * details.)

4149

++ *

4150

++ * The context switch have flipped the stack from under us and restored the

4151

++ * local variables which were saved when this task called schedule() in the

4152

++ * past. prev == current is still correct but we need to recalculate this_rq

4153

++ * because prev may have moved to another CPU.

4154

++ */

4155

++static struct rq *finish_task_switch(struct task_struct *prev)

4156

++	__releases(rq->lock)

4157

++{

4158

++	struct rq *rq = this_rq();

4159

++	struct mm_struct *mm = rq->prev_mm;

4160

++	unsigned int prev_state;

4161

++

4162

++	/*

4163

++	 * The previous task will have left us with a preempt_count of 2

4164

++	 * because it left us after:

4165

++	 *

4166

++	 *	schedule()

4167

++	 *	  preempt_disable();			// 1

4168

++	 *	  __schedule()

4169

++	 *	    raw_spin_lock_irq(&rq->lock)	// 2

4170

++	 *

4171

++	 * Also, see FORK_PREEMPT_COUNT.

4172

++	 */

4173

++	if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,

4174

++		      "corrupted preempt_count: %s/%d/0x%x\n",

4175

++		      current->comm, current->pid, preempt_count()))

4176

++		preempt_count_set(FORK_PREEMPT_COUNT);

4177

++

4178

++	rq->prev_mm = NULL;

4179

++

4180

++	/*

4181

++	 * A task struct has one reference for the use as "current".

4182

++	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls

4183

++	 * schedule one last time. The schedule call will never return, and

4184

++	 * the scheduled task must drop that reference.

4185

++	 *

4186

++	 * We must observe prev->state before clearing prev->on_cpu (in

4187

++	 * finish_task), otherwise a concurrent wakeup can get prev

4188

++	 * running on another CPU and we could rave with its RUNNING -> DEAD

4189

++	 * transition, resulting in a double drop.

4190

++	 */

4191

++	prev_state = READ_ONCE(prev->__state);

4192

++	vtime_task_switch(prev);

4193

++	perf_event_task_sched_in(prev, current);

4194

++	finish_task(prev);

4195

++	tick_nohz_task_switch();

4196

++	finish_lock_switch(rq);

4197

++	finish_arch_post_lock_switch();

4198

++	kcov_finish_switch(current);

4199

++	/*

4200

++	 * kmap_local_sched_out() is invoked with rq::lock held and

4201

++	 * interrupts disabled. There is no requirement for that, but the

4202

++	 * sched out code does not have an interrupt enabled section.

4203

++	 * Restoring the maps on sched in does not require interrupts being

4204

++	 * disabled either.

4205

++	 */

4206

++	kmap_local_sched_in();

4207

++

4208

++	fire_sched_in_preempt_notifiers(current);

4209

++	/*

4210

++	 * When switching through a kernel thread, the loop in

4211

++	 * membarrier_{private,global}_expedited() may have observed that

4212

++	 * kernel thread and not issued an IPI. It is therefore possible to

4213

++	 * schedule between user->kernel->user threads without passing though

4214

++	 * switch_mm(). Membarrier requires a barrier after storing to

4215

++	 * rq->curr, before returning to userspace, so provide them here:

4216

++	 *

4217

++	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly

4218

++	 *   provided by mmdrop(),

4219

++	 * - a sync_core for SYNC_CORE.

4220

++	 */

4221

++	if (mm) {

4222

++		membarrier_mm_sync_core_before_usermode(mm);

4223

++		mmdrop_sched(mm);

4224

++	}

4225

++	if (unlikely(prev_state == TASK_DEAD)) {

4226

++		/* Task is done with its stack. */

4227

++		put_task_stack(prev);

4228

++

4229

++		put_task_struct_rcu_user(prev);

4230

++	}

4231

++

4232

++	return rq;

4233

++}

4234

++

4235

++/**

4236

++ * schedule_tail - first thing a freshly forked thread must call.

4237

++ * @prev: the thread we just switched away from.

4238

++ */

4239

++asmlinkage __visible void schedule_tail(struct task_struct *prev)

4240

++	__releases(rq->lock)

4241

++{

4242

++	/*

4243

++	 * New tasks start with FORK_PREEMPT_COUNT, see there and

4244

++	 * finish_task_switch() for details.

4245

++	 *

4246

++	 * finish_task_switch() will drop rq->lock() and lower preempt_count

4247

++	 * and the preempt_enable() will end up enabling preemption (on

4248

++	 * PREEMPT_COUNT kernels).

4249

++	 */

4250

++

4251

++	finish_task_switch(prev);

4252

++	preempt_enable();

4253

++

4254

++	if (current->set_child_tid)

4255

++		put_user(task_pid_vnr(current), current->set_child_tid);

4256

++

4257

++	calculate_sigpending();

4258

++}

4259

++

4260

++/*

4261

++ * context_switch - switch to the new MM and the new thread's register state.

4262

++ */

4263

++static __always_inline struct rq *

4264

++context_switch(struct rq *rq, struct task_struct *prev,

4265

++	       struct task_struct *next)

4266

++{

4267

++	prepare_task_switch(rq, prev, next);

4268

++

4269

++	/*

4270

++	 * For paravirt, this is coupled with an exit in switch_to to

4271

++	 * combine the page table reload and the switch backend into

4272

++	 * one hypercall.

4273

++	 */

4274

++	arch_start_context_switch(prev);

4275

++

4276

++	/*

4277

++	 * kernel -> kernel   lazy + transfer active

4278

++	 *   user -> kernel   lazy + mmgrab() active

4279

++	 *

4280

++	 * kernel ->   user   switch + mmdrop() active

4281

++	 *   user ->   user   switch

4282

++	 */

4283

++	if (!next->mm) {                                // to kernel

4284

++		enter_lazy_tlb(prev->active_mm, next);

4285

++

4286

++		next->active_mm = prev->active_mm;

4287

++		if (prev->mm)                           // from user

4288

++			mmgrab(prev->active_mm);

4289

++		else

4290

++			prev->active_mm = NULL;

4291

++	} else {                                        // to user

4292

++		membarrier_switch_mm(rq, prev->active_mm, next->mm);

4293

++		/*

4294

++		 * sys_membarrier() requires an smp_mb() between setting

4295

++		 * rq->curr / membarrier_switch_mm() and returning to userspace.

4296

++		 *

4297

++		 * The below provides this either through switch_mm(), or in

4298

++		 * case 'prev->active_mm == next->mm' through

4299

++		 * finish_task_switch()'s mmdrop().

4300

++		 */

4301

++		switch_mm_irqs_off(prev->active_mm, next->mm, next);

4302

++

4303

++		if (!prev->mm) {                        // from kernel

4304

++			/* will mmdrop() in finish_task_switch(). */

4305

++			rq->prev_mm = prev->active_mm;

4306

++			prev->active_mm = NULL;

4307

++		}

4308

++	}

4309

++

4310

++	prepare_lock_switch(rq, next);

4311

++

4312

++	/* Here we just switch the register state and the stack. */

4313

++	switch_to(prev, next, prev);

4314

++	barrier();

4315

++

4316

++	return finish_task_switch(prev);

4317

++}

4318

++

4319

++/*

4320

++ * nr_running, nr_uninterruptible and nr_context_switches:

4321

++ *

4322

++ * externally visible scheduler statistics: current number of runnable

4323

++ * threads, total number of context switches performed since bootup.

4324

++ */

4325

++unsigned int nr_running(void)

4326

++{

4327

++	unsigned int i, sum = 0;

4328

++

4329

++	for_each_online_cpu(i)

4330

++		sum += cpu_rq(i)->nr_running;

4331

++

4332

++	return sum;

4333

++}

4334

++

4335

++/*

4336

++ * Check if only the current task is running on the CPU.

4337

++ *

4338

++ * Caution: this function does not check that the caller has disabled

4339

++ * preemption, thus the result might have a time-of-check-to-time-of-use

4340

++ * race.  The caller is responsible to use it correctly, for example:

4341

++ *

4342

++ * - from a non-preemptible section (of course)

4343

++ *

4344

++ * - from a thread that is bound to a single CPU

4345

++ *

4346

++ * - in a loop with very short iterations (e.g. a polling loop)

4347

++ */

4348

++bool single_task_running(void)

4349

++{

4350

++	return raw_rq()->nr_running == 1;

4351

++}

4352

++EXPORT_SYMBOL(single_task_running);

4353

++

4354

++unsigned long long nr_context_switches(void)

4355

++{

4356

++	int i;

4357

++	unsigned long long sum = 0;

4358

++

4359

++	for_each_possible_cpu(i)

4360

++		sum += cpu_rq(i)->nr_switches;

4361

++

4362

++	return sum;

4363

++}

4364

++

4365

++/*

4366

++ * Consumers of these two interfaces, like for example the cpuidle menu

4367

++ * governor, are using nonsensical data. Preferring shallow idle state selection

4368

++ * for a CPU that has IO-wait which might not even end up running the task when

4369

++ * it does become runnable.

4370

++ */

4371

++

4372

++unsigned int nr_iowait_cpu(int cpu)

4373

++{

4374

++	return atomic_read(&cpu_rq(cpu)->nr_iowait);

4375

++}

4376

++

4377

++/*

4378

++ * IO-wait accounting, and how it's mostly bollocks (on SMP).

4379

++ *

4380

++ * The idea behind IO-wait account is to account the idle time that we could

4381

++ * have spend running if it were not for IO. That is, if we were to improve the

4382

++ * storage performance, we'd have a proportional reduction in IO-wait time.

4383

++ *

4384

++ * This all works nicely on UP, where, when a task blocks on IO, we account

4385

++ * idle time as IO-wait, because if the storage were faster, it could've been

4386

++ * running and we'd not be idle.

4387

++ *

4388

++ * This has been extended to SMP, by doing the same for each CPU. This however

4389

++ * is broken.

4390

++ *

4391

++ * Imagine for instance the case where two tasks block on one CPU, only the one

4392

++ * CPU will have IO-wait accounted, while the other has regular idle. Even

4393

++ * though, if the storage were faster, both could've ran at the same time,

4394

++ * utilising both CPUs.

4395

++ *

4396

++ * This means, that when looking globally, the current IO-wait accounting on

4397

++ * SMP is a lower bound, by reason of under accounting.

4398

++ *

4399

++ * Worse, since the numbers are provided per CPU, they are sometimes

4400

++ * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly

4401

++ * associated with any one particular CPU, it can wake to another CPU than it

4402

++ * blocked on. This means the per CPU IO-wait number is meaningless.

4403

++ *

4404

++ * Task CPU affinities can make all that even more 'interesting'.

4405

++ */

4406

++

4407

++unsigned int nr_iowait(void)

4408

++{

4409

++	unsigned int i, sum = 0;

4410

++

4411

++	for_each_possible_cpu(i)

4412

++		sum += nr_iowait_cpu(i);

4413

++

4414

++	return sum;

4415

++}

4416

++

4417

++#ifdef CONFIG_SMP

4418

++

4419

++/*

4420

++ * sched_exec - execve() is a valuable balancing opportunity, because at

4421

++ * this point the task has the smallest effective memory and cache

4422

++ * footprint.

4423

++ */

4424

++void sched_exec(void)

4425

++{

4426

++}

4427

++

4428

++#endif

4429

++

4430

++DEFINE_PER_CPU(struct kernel_stat, kstat);

4431

++DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);

4432

++

4433

++EXPORT_PER_CPU_SYMBOL(kstat);

4434

++EXPORT_PER_CPU_SYMBOL(kernel_cpustat);

4435

++

4436

++static inline void update_curr(struct rq *rq, struct task_struct *p)

4437

++{

4438

++	s64 ns = rq->clock_task - p->last_ran;

4439

++

4440

++	p->sched_time += ns;

4441

++	cgroup_account_cputime(p, ns);

4442

++	account_group_exec_runtime(p, ns);

4443

++

4444

++	p->time_slice -= ns;

4445

++	p->last_ran = rq->clock_task;

4446

++}

4447

++

4448

++/*

4449

++ * Return accounted runtime for the task.

4450

++ * Return separately the current's pending runtime that have not been

4451

++ * accounted yet.

4452

++ */

4453

++unsigned long long task_sched_runtime(struct task_struct *p)

4454

++{

4455

++	unsigned long flags;

4456

++	struct rq *rq;

4457

++	raw_spinlock_t *lock;

4458

++	u64 ns;

4459

++

4460

++#if defined(CONFIG_64BIT) && defined(CONFIG_SMP)

4461

++	/*

4462

++	 * 64-bit doesn't need locks to atomically read a 64-bit value.

4463

++	 * So we have a optimization chance when the task's delta_exec is 0.

4464

++	 * Reading ->on_cpu is racy, but this is ok.

4465

++	 *

4466

++	 * If we race with it leaving CPU, we'll take a lock. So we're correct.

4467

++	 * If we race with it entering CPU, unaccounted time is 0. This is

4468

++	 * indistinguishable from the read occurring a few cycles earlier.

4469

++	 * If we see ->on_cpu without ->on_rq, the task is leaving, and has

4470

++	 * been accounted, so we're correct here as well.

4471

++	 */

4472

++	if (!p->on_cpu || !task_on_rq_queued(p))

4473

++		return tsk_seruntime(p);

4474

++#endif

4475

++

4476

++	rq = task_access_lock_irqsave(p, &lock, &flags);

4477

++	/*

4478

++	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would

4479

++	 * project cycles that may never be accounted to this

4480

++	 * thread, breaking clock_gettime().

4481

++	 */

4482

++	if (p == rq->curr && task_on_rq_queued(p)) {

4483

++		update_rq_clock(rq);

4484

++		update_curr(rq, p);

4485

++	}

4486

++	ns = tsk_seruntime(p);

4487

++	task_access_unlock_irqrestore(p, lock, &flags);

4488

++

4489

++	return ns;

4490

++}

4491

++

4492

++/* This manages tasks that have run out of timeslice during a scheduler_tick */

4493

++static inline void scheduler_task_tick(struct rq *rq)

4494

++{

4495

++	struct task_struct *p = rq->curr;

4496

++

4497

++	if (is_idle_task(p))

4498

++		return;

4499

++

4500

++	update_curr(rq, p);

4501

++	cpufreq_update_util(rq, 0);

4502

++

4503

++	/*

4504

++	 * Tasks have less than RESCHED_NS of time slice left they will be

4505

++	 * rescheduled.

4506

++	 */

4507

++	if (p->time_slice >= RESCHED_NS)

4508

++		return;

4509

++	set_tsk_need_resched(p);

4510

++	set_preempt_need_resched();

4511

++}

4512

++

4513

++#ifdef CONFIG_SCHED_DEBUG

4514

++static u64 cpu_resched_latency(struct rq *rq)

4515

++{

4516

++	int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms);

4517

++	u64 resched_latency, now = rq_clock(rq);

4518

++	static bool warned_once;

4519

++

4520

++	if (sysctl_resched_latency_warn_once && warned_once)

4521

++		return 0;

4522

++

4523

++	if (!need_resched() || !latency_warn_ms)

4524

++		return 0;

4525

++

4526

++	if (system_state == SYSTEM_BOOTING)

4527

++		return 0;

4528

++

4529

++	if (!rq->last_seen_need_resched_ns) {

4530

++		rq->last_seen_need_resched_ns = now;

4531

++		rq->ticks_without_resched = 0;

4532

++		return 0;

4533

++	}

4534

++

4535

++	rq->ticks_without_resched++;

4536

++	resched_latency = now - rq->last_seen_need_resched_ns;

4537

++	if (resched_latency <= latency_warn_ms * NSEC_PER_MSEC)

4538

++		return 0;

4539

++

4540

++	warned_once = true;

4541

++

4542

++	return resched_latency;

4543

++}

4544

++

4545

++static int __init setup_resched_latency_warn_ms(char *str)

4546

++{

4547

++	long val;

4548

++

4549

++	if ((kstrtol(str, 0, &val))) {

4550

++		pr_warn("Unable to set resched_latency_warn_ms\n");

4551

++		return 1;

4552

++	}

4553

++

4554

++	sysctl_resched_latency_warn_ms = val;

4555

++	return 1;

4556

++}

4557

++__setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);

4558

++#else

4559

++static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }

4560

++#endif /* CONFIG_SCHED_DEBUG */

4561

++

4562

++/*

4563

++ * This function gets called by the timer code, with HZ frequency.

4564

++ * We call it with interrupts disabled.

4565

++ */

4566

++void scheduler_tick(void)

4567

++{

4568

++	int cpu __maybe_unused = smp_processor_id();

4569

++	struct rq *rq = cpu_rq(cpu);

4570

++	u64 resched_latency;

4571

++

4572

++	arch_scale_freq_tick();

4573

++	sched_clock_tick();

4574

++

4575

++	raw_spin_lock(&rq->lock);

4576

++	update_rq_clock(rq);

4577

++

4578

++	scheduler_task_tick(rq);

4579

++	if (sched_feat(LATENCY_WARN))

4580

++		resched_latency = cpu_resched_latency(rq);

4581

++	calc_global_load_tick(rq);

4582

++

4583

++	rq->last_tick = rq->clock;

4584

++	raw_spin_unlock(&rq->lock);

4585

++

4586

++	if (sched_feat(LATENCY_WARN) && resched_latency)

4587

++		resched_latency_warn(cpu, resched_latency);

4588

++

4589

++	perf_event_task_tick();

4590

++}

4591

++

4592

++#ifdef CONFIG_SCHED_SMT

4593

++static inline int sg_balance_cpu_stop(void *data)

4594

++{

4595

++	struct rq *rq = this_rq();

4596

++	struct task_struct *p = data;

4597

++	cpumask_t tmp;

4598

++	unsigned long flags;

4599

++

4600

++	local_irq_save(flags);

4601

++

4602

++	raw_spin_lock(&p->pi_lock);

4603

++	raw_spin_lock(&rq->lock);

4604

++

4605

++	rq->active_balance = 0;

4606

++	/* _something_ may have changed the task, double check again */

4607

++	if (task_on_rq_queued(p) && task_rq(p) == rq &&

4608

++	    cpumask_and(&tmp, p->cpus_ptr, &sched_sg_idle_mask) &&

4609

++	    !is_migration_disabled(p)) {

4610

++		int cpu = cpu_of(rq);

4611

++		int dcpu = __best_mask_cpu(&tmp, per_cpu(sched_cpu_llc_mask, cpu));

4612

++		rq = move_queued_task(rq, p, dcpu);

4613

++	}

4614

++

4615

++	raw_spin_unlock(&rq->lock);

4616

++	raw_spin_unlock(&p->pi_lock);

4617

++

4618

++	local_irq_restore(flags);

4619

++

4620

++	return 0;

4621

++}

4622

++

4623

++/* sg_balance_trigger - trigger slibing group balance for @cpu */

4624

++static inline int sg_balance_trigger(const int cpu)

4625

++{

4626

++	struct rq *rq= cpu_rq(cpu);

4627

++	unsigned long flags;

4628

++	struct task_struct *curr;

4629

++	int res;

4630

++

4631

++	if (!raw_spin_trylock_irqsave(&rq->lock, flags))

4632

++		return 0;

4633

++	curr = rq->curr;

4634

++	res = (!is_idle_task(curr)) && (1 == rq->nr_running) &&\

4635

++	      cpumask_intersects(curr->cpus_ptr, &sched_sg_idle_mask) &&\

4636

++	      !is_migration_disabled(curr) && (!rq->active_balance);

4637

++

4638

++	if (res)

4639

++		rq->active_balance = 1;

4640

++

4641

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

4642

++

4643

++	if (res)

4644

++		stop_one_cpu_nowait(cpu, sg_balance_cpu_stop, curr,

4645

++				    &rq->active_balance_work);

4646

++	return res;

4647

++}

4648

++

4649

++/*

4650

++ * sg_balance - slibing group balance check for run queue @rq

4651

++ */

4652

++static inline void sg_balance(struct rq *rq)

4653

++{

4654

++	cpumask_t chk;

4655

++	int cpu = cpu_of(rq);

4656

++

4657

++	/* exit when cpu is offline */

4658

++	if (unlikely(!rq->online))

4659

++		return;

4660

++

4661

++	/*

4662

++	 * Only cpu in slibing idle group will do the checking and then

4663

++	 * find potential cpus which can migrate the current running task

4664

++	 */

4665

++	if (cpumask_test_cpu(cpu, &sched_sg_idle_mask) &&

4666

++	    cpumask_andnot(&chk, cpu_online_mask, sched_rq_watermark) &&

4667

++	    cpumask_andnot(&chk, &chk, &sched_rq_pending_mask)) {

4668

++		int i;

4669

++

4670

++		for_each_cpu_wrap(i, &chk, cpu) {

4671

++			if (cpumask_subset(cpu_smt_mask(i), &chk) &&

4672

++			    sg_balance_trigger(i))

4673

++				return;

4674

++		}

4675

++	}

4676

++}

4677

++#endif /* CONFIG_SCHED_SMT */

4678

++

4679

++#ifdef CONFIG_NO_HZ_FULL

4680

++

4681

++struct tick_work {

4682

++	int			cpu;

4683

++	atomic_t		state;

4684

++	struct delayed_work	work;

4685

++};

4686

++/* Values for ->state, see diagram below. */

4687

++#define TICK_SCHED_REMOTE_OFFLINE	0

4688

++#define TICK_SCHED_REMOTE_OFFLINING	1

4689

++#define TICK_SCHED_REMOTE_RUNNING	2

4690

++

4691

++/*

4692

++ * State diagram for ->state:

4693

++ *

4694

++ *

4695

++ *          TICK_SCHED_REMOTE_OFFLINE

4696

++ *                    |   ^

4697

++ *                    |   |

4698

++ *                    |   | sched_tick_remote()

4699

++ *                    |   |

4700

++ *                    |   |

4701

++ *                    +--TICK_SCHED_REMOTE_OFFLINING

4702

++ *                    |   ^

4703

++ *                    |   |

4704

++ * sched_tick_start() |   | sched_tick_stop()

4705

++ *                    |   |

4706

++ *                    V   |

4707

++ *          TICK_SCHED_REMOTE_RUNNING

4708

++ *

4709

++ *

4710

++ * Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()

4711

++ * and sched_tick_start() are happy to leave the state in RUNNING.

4712

++ */

4713

++

4714

++static struct tick_work __percpu *tick_work_cpu;

4715

++

4716

++static void sched_tick_remote(struct work_struct *work)

4717

++{

4718

++	struct delayed_work *dwork = to_delayed_work(work);

4719

++	struct tick_work *twork = container_of(dwork, struct tick_work, work);

4720

++	int cpu = twork->cpu;

4721

++	struct rq *rq = cpu_rq(cpu);

4722

++	struct task_struct *curr;

4723

++	unsigned long flags;

4724

++	u64 delta;

4725

++	int os;

4726

++

4727

++	/*

4728

++	 * Handle the tick only if it appears the remote CPU is running in full

4729

++	 * dynticks mode. The check is racy by nature, but missing a tick or

4730

++	 * having one too much is no big deal because the scheduler tick updates

4731

++	 * statistics and checks timeslices in a time-independent way, regardless

4732

++	 * of when exactly it is running.

4733

++	 */

4734

++	if (!tick_nohz_tick_stopped_cpu(cpu))

4735

++		goto out_requeue;

4736

++

4737

++	raw_spin_lock_irqsave(&rq->lock, flags);

4738

++	curr = rq->curr;

4739

++	if (cpu_is_offline(cpu))

4740

++		goto out_unlock;

4741

++

4742

++	update_rq_clock(rq);

4743

++	if (!is_idle_task(curr)) {

4744

++		/*

4745

++		 * Make sure the next tick runs within a reasonable

4746

++		 * amount of time.

4747

++		 */

4748

++		delta = rq_clock_task(rq) - curr->last_ran;

4749

++		WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);

4750

++	}

4751

++	scheduler_task_tick(rq);

4752

++

4753

++	calc_load_nohz_remote(rq);

4754

++out_unlock:

4755

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

4756

++

4757

++out_requeue:

4758

++	/*

4759

++	 * Run the remote tick once per second (1Hz). This arbitrary

4760

++	 * frequency is large enough to avoid overload but short enough

4761

++	 * to keep scheduler internal stats reasonably up to date.  But

4762

++	 * first update state to reflect hotplug activity if required.

4763

++	 */

4764

++	os = atomic_fetch_add_unless(&twork->state, -1, TICK_SCHED_REMOTE_RUNNING);

4765

++	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_OFFLINE);

4766

++	if (os == TICK_SCHED_REMOTE_RUNNING)

4767

++		queue_delayed_work(system_unbound_wq, dwork, HZ);

4768

++}

4769

++

4770

++static void sched_tick_start(int cpu)

4771

++{

4772

++	int os;

4773

++	struct tick_work *twork;

4774

++

4775

++	if (housekeeping_cpu(cpu, HK_TYPE_TICK))

4776

++		return;

4777

++

4778

++	WARN_ON_ONCE(!tick_work_cpu);

4779

++

4780

++	twork = per_cpu_ptr(tick_work_cpu, cpu);

4781

++	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);

4782

++	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);

4783

++	if (os == TICK_SCHED_REMOTE_OFFLINE) {

4784

++		twork->cpu = cpu;

4785

++		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);

4786

++		queue_delayed_work(system_unbound_wq, &twork->work, HZ);

4787

++	}

4788

++}

4789

++

4790

++#ifdef CONFIG_HOTPLUG_CPU

4791

++static void sched_tick_stop(int cpu)

4792

++{

4793

++	struct tick_work *twork;

4794

++

4795

++	if (housekeeping_cpu(cpu, HK_TYPE_TICK))

4796

++		return;

4797

++

4798

++	WARN_ON_ONCE(!tick_work_cpu);

4799

++

4800

++	twork = per_cpu_ptr(tick_work_cpu, cpu);

4801

++	cancel_delayed_work_sync(&twork->work);

4802

++}

4803

++#endif /* CONFIG_HOTPLUG_CPU */

4804

++

4805

++int __init sched_tick_offload_init(void)

4806

++{

4807

++	tick_work_cpu = alloc_percpu(struct tick_work);

4808

++	BUG_ON(!tick_work_cpu);

4809

++	return 0;

4810

++}

4811

++

4812

++#else /* !CONFIG_NO_HZ_FULL */

4813

++static inline void sched_tick_start(int cpu) { }

4814

++static inline void sched_tick_stop(int cpu) { }

4815

++#endif

4816

++

4817

++#if defined(CONFIG_PREEMPTION) && (defined(CONFIG_DEBUG_PREEMPT) || \

4818

++				defined(CONFIG_PREEMPT_TRACER))

4819

++/*

4820

++ * If the value passed in is equal to the current preempt count

4821

++ * then we just disabled preemption. Start timing the latency.

4822

++ */

4823

++static inline void preempt_latency_start(int val)

4824

++{

4825

++	if (preempt_count() == val) {

4826

++		unsigned long ip = get_lock_parent_ip();

4827

++#ifdef CONFIG_DEBUG_PREEMPT

4828

++		current->preempt_disable_ip = ip;

4829

++#endif

4830

++		trace_preempt_off(CALLER_ADDR0, ip);

4831

++	}

4832

++}

4833

++

4834

++void preempt_count_add(int val)

4835

++{

4836

++#ifdef CONFIG_DEBUG_PREEMPT

4837

++	/*

4838

++	 * Underflow?

4839

++	 */

4840

++	if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))

4841

++		return;

4842

++#endif

4843

++	__preempt_count_add(val);

4844

++#ifdef CONFIG_DEBUG_PREEMPT

4845

++	/*

4846

++	 * Spinlock count overflowing soon?

4847

++	 */

4848

++	DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=

4849

++				PREEMPT_MASK - 10);

4850

++#endif

4851

++	preempt_latency_start(val);

4852

++}

4853

++EXPORT_SYMBOL(preempt_count_add);

4854

++NOKPROBE_SYMBOL(preempt_count_add);

4855

++

4856

++/*

4857

++ * If the value passed in equals to the current preempt count

4858

++ * then we just enabled preemption. Stop timing the latency.

4859

++ */

4860

++static inline void preempt_latency_stop(int val)

4861

++{

4862

++	if (preempt_count() == val)

4863

++		trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());

4864

++}

4865

++

4866

++void preempt_count_sub(int val)

4867

++{

4868

++#ifdef CONFIG_DEBUG_PREEMPT

4869

++	/*

4870

++	 * Underflow?

4871

++	 */

4872

++	if (DEBUG_LOCKS_WARN_ON(val > preempt_count()))

4873

++		return;

4874

++	/*

4875

++	 * Is the spinlock portion underflowing?

4876

++	 */

4877

++	if (DEBUG_LOCKS_WARN_ON((val < PREEMPT_MASK) &&

4878

++			!(preempt_count() & PREEMPT_MASK)))

4879

++		return;

4880

++#endif

4881

++

4882

++	preempt_latency_stop(val);

4883

++	__preempt_count_sub(val);

4884

++}

4885

++EXPORT_SYMBOL(preempt_count_sub);

4886

++NOKPROBE_SYMBOL(preempt_count_sub);

4887

++

4888

++#else

4889

++static inline void preempt_latency_start(int val) { }

4890

++static inline void preempt_latency_stop(int val) { }

4891

++#endif

4892

++

4893

++static inline unsigned long get_preempt_disable_ip(struct task_struct *p)

4894

++{

4895

++#ifdef CONFIG_DEBUG_PREEMPT

4896

++	return p->preempt_disable_ip;

4897

++#else

4898

++	return 0;

4899

++#endif

4900

++}

4901

++

4902

++/*

4903

++ * Print scheduling while atomic bug:

4904

++ */

4905

++static noinline void __schedule_bug(struct task_struct *prev)

4906

++{

4907

++	/* Save this before calling printk(), since that will clobber it */

4908

++	unsigned long preempt_disable_ip = get_preempt_disable_ip(current);

4909

++

4910

++	if (oops_in_progress)

4911

++		return;

4912

++

4913

++	printk(KERN_ERR "BUG: scheduling while atomic: %s/%d/0x%08x\n",

4914

++		prev->comm, prev->pid, preempt_count());

4915

++

4916

++	debug_show_held_locks(prev);

4917

++	print_modules();

4918

++	if (irqs_disabled())

4919

++		print_irqtrace_events(prev);

4920

++	if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)

4921

++	    && in_atomic_preempt_off()) {

4922

++		pr_err("Preemption disabled at:");

4923

++		print_ip_sym(KERN_ERR, preempt_disable_ip);

4924

++	}

4925

++	if (panic_on_warn)

4926

++		panic("scheduling while atomic\n");

4927

++

4928

++	dump_stack();

4929

++	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

4930

++}

4931

++

4932

++/*

4933

++ * Various schedule()-time debugging checks and statistics:

4934

++ */

4935

++static inline void schedule_debug(struct task_struct *prev, bool preempt)

4936

++{

4937

++#ifdef CONFIG_SCHED_STACK_END_CHECK

4938

++	if (task_stack_end_corrupted(prev))

4939

++		panic("corrupted stack end detected inside scheduler\n");

4940

++

4941

++	if (task_scs_end_corrupted(prev))

4942

++		panic("corrupted shadow stack detected inside scheduler\n");

4943

++#endif

4944

++

4945

++#ifdef CONFIG_DEBUG_ATOMIC_SLEEP

4946

++	if (!preempt && READ_ONCE(prev->__state) && prev->non_block_count) {

4947

++		printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",

4948

++			prev->comm, prev->pid, prev->non_block_count);

4949

++		dump_stack();

4950

++		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

4951

++	}

4952

++#endif

4953

++

4954

++	if (unlikely(in_atomic_preempt_off())) {

4955

++		__schedule_bug(prev);

4956

++		preempt_count_set(PREEMPT_DISABLED);

4957

++	}

4958

++	rcu_sleep_check();

4959

++	SCHED_WARN_ON(ct_state() == CONTEXT_USER);

4960

++

4961

++	profile_hit(SCHED_PROFILING, __builtin_return_address(0));

4962

++

4963

++	schedstat_inc(this_rq()->sched_count);

4964

++}

4965

++

4966

++/*

4967

++ * Compile time debug macro

4968

++ * #define ALT_SCHED_DEBUG

4969

++ */

4970

++

4971

++#ifdef ALT_SCHED_DEBUG

4972

++void alt_sched_debug(void)

4973

++{

4974

++	printk(KERN_INFO "sched: pending: 0x%04lx, idle: 0x%04lx, sg_idle: 0x%04lx\n",

4975

++	       sched_rq_pending_mask.bits[0],

4976

++	       sched_rq_watermark[0].bits[0],

4977

++	       sched_sg_idle_mask.bits[0]);

4978

++}

4979

++#else

4980

++inline void alt_sched_debug(void) {}

4981

++#endif

4982

++

4983

++#ifdef	CONFIG_SMP

4984

++

4985

++#define SCHED_RQ_NR_MIGRATION (32U)

4986

++/*

4987

++ * Migrate pending tasks in @rq to @dest_cpu

4988

++ * Will try to migrate mininal of half of @rq nr_running tasks and

4989

++ * SCHED_RQ_NR_MIGRATION to @dest_cpu

4990

++ */

4991

++static inline int

4992

++migrate_pending_tasks(struct rq *rq, struct rq *dest_rq, const int dest_cpu)

4993

++{

4994

++	struct task_struct *p, *skip = rq->curr;

4995

++	int nr_migrated = 0;

4996

++	int nr_tries = min(rq->nr_running / 2, SCHED_RQ_NR_MIGRATION);

4997

++

4998

++	while (skip != rq->idle && nr_tries &&

4999

++	       (p = sched_rq_next_task(skip, rq)) != rq->idle) {

5000

++		skip = sched_rq_next_task(p, rq);

5001

++		if (cpumask_test_cpu(dest_cpu, p->cpus_ptr)) {

5002

++			__SCHED_DEQUEUE_TASK(p, rq, 0);

5003

++			set_task_cpu(p, dest_cpu);

5004

++			sched_task_sanity_check(p, dest_rq);

5005

++			__SCHED_ENQUEUE_TASK(p, dest_rq, 0);

5006

++			nr_migrated++;

5007

++		}

5008

++		nr_tries--;

5009

++	}

5010

++

5011

++	return nr_migrated;

5012

++}

5013

++

5014

++static inline int take_other_rq_tasks(struct rq *rq, int cpu)

5015

++{

5016

++	struct cpumask *topo_mask, *end_mask;

5017

++

5018

++	if (unlikely(!rq->online))

5019

++		return 0;

5020

++

5021

++	if (cpumask_empty(&sched_rq_pending_mask))

5022

++		return 0;

5023

++

5024

++	topo_mask = per_cpu(sched_cpu_topo_masks, cpu) + 1;

5025

++	end_mask = per_cpu(sched_cpu_topo_end_mask, cpu);

5026

++	do {

5027

++		int i;

5028

++		for_each_cpu_and(i, &sched_rq_pending_mask, topo_mask) {

5029

++			int nr_migrated;

5030

++			struct rq *src_rq;

5031

++

5032

++			src_rq = cpu_rq(i);

5033

++			if (!do_raw_spin_trylock(&src_rq->lock))

5034

++				continue;

5035

++			spin_acquire(&src_rq->lock.dep_map,

5036

++				     SINGLE_DEPTH_NESTING, 1, _RET_IP_);

5037

++

5038

++			if ((nr_migrated = migrate_pending_tasks(src_rq, rq, cpu))) {

5039

++				src_rq->nr_running -= nr_migrated;

5040

++				if (src_rq->nr_running < 2)

5041

++					cpumask_clear_cpu(i, &sched_rq_pending_mask);

5042

++

5043

++				rq->nr_running += nr_migrated;

5044

++				if (rq->nr_running > 1)

5045

++					cpumask_set_cpu(cpu, &sched_rq_pending_mask);

5046

++

5047

++				cpufreq_update_util(rq, 0);

5048

++

5049

++				spin_release(&src_rq->lock.dep_map, _RET_IP_);

5050

++				do_raw_spin_unlock(&src_rq->lock);

5051

++

5052

++				return 1;

5053

++			}

5054

++

5055

++			spin_release(&src_rq->lock.dep_map, _RET_IP_);

5056

++			do_raw_spin_unlock(&src_rq->lock);

5057

++		}

5058

++	} while (++topo_mask < end_mask);

5059

++

5060

++	return 0;

5061

++}

5062

++#endif

5063

++

5064

++/*

5065

++ * Timeslices below RESCHED_NS are considered as good as expired as there's no

5066

++ * point rescheduling when there's so little time left.

5067

++ */

5068

++static inline void check_curr(struct task_struct *p, struct rq *rq)

5069

++{

5070

++	if (unlikely(rq->idle == p))

5071

++		return;

5072

++

5073

++	update_curr(rq, p);

5074

++

5075

++	if (p->time_slice < RESCHED_NS)

5076

++		time_slice_expired(p, rq);

5077

++}

5078

++

5079

++static inline struct task_struct *

5080

++choose_next_task(struct rq *rq, int cpu, struct task_struct *prev)

5081

++{

5082

++	struct task_struct *next;

5083

++

5084

++	if (unlikely(rq->skip)) {

5085

++		next = rq_runnable_task(rq);

5086

++		if (next == rq->idle) {

5087

++#ifdef	CONFIG_SMP

5088

++			if (!take_other_rq_tasks(rq, cpu)) {

5089

++#endif

5090

++				rq->skip = NULL;

5091

++				schedstat_inc(rq->sched_goidle);

5092

++				return next;

5093

++#ifdef	CONFIG_SMP

5094

++			}

5095

++			next = rq_runnable_task(rq);

5096

++#endif

5097

++		}

5098

++		rq->skip = NULL;

5099

++#ifdef CONFIG_HIGH_RES_TIMERS

5100

++		hrtick_start(rq, next->time_slice);

5101

++#endif

5102

++		return next;

5103

++	}

5104

++

5105

++	next = sched_rq_first_task(rq);

5106

++	if (next == rq->idle) {

5107

++#ifdef	CONFIG_SMP

5108

++		if (!take_other_rq_tasks(rq, cpu)) {

5109

++#endif

5110

++			schedstat_inc(rq->sched_goidle);

5111

++			/*printk(KERN_INFO "sched: choose_next_task(%d) idle %px\n", cpu, next);*/

5112

++			return next;

5113

++#ifdef	CONFIG_SMP

5114

++		}

5115

++		next = sched_rq_first_task(rq);

5116

++#endif

5117

++	}

5118

++#ifdef CONFIG_HIGH_RES_TIMERS

5119

++	hrtick_start(rq, next->time_slice);

5120

++#endif

5121

++	/*printk(KERN_INFO "sched: choose_next_task(%d) next %px\n", cpu,

5122

++	 * next);*/

5123

++	return next;

5124

++}

5125

++

5126

++/*

5127

++ * Constants for the sched_mode argument of __schedule().

5128

++ *

5129

++ * The mode argument allows RT enabled kernels to differentiate a

5130

++ * preemption from blocking on an 'sleeping' spin/rwlock. Note that

5131

++ * SM_MASK_PREEMPT for !RT has all bits set, which allows the compiler to

5132

++ * optimize the AND operation out and just check for zero.

5133

++ */

5134

++#define SM_NONE			0x0

5135

++#define SM_PREEMPT		0x1

5136

++#define SM_RTLOCK_WAIT		0x2

5137

++

5138

++#ifndef CONFIG_PREEMPT_RT

5139

++# define SM_MASK_PREEMPT	(~0U)

5140

++#else

5141

++# define SM_MASK_PREEMPT	SM_PREEMPT

5142

++#endif

5143

++

5144

++/*

5145

++ * schedule() is the main scheduler function.

5146

++ *

5147

++ * The main means of driving the scheduler and thus entering this function are:

5148

++ *

5149

++ *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.

5150

++ *

5151

++ *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return

5152

++ *      paths. For example, see arch/x86/entry_64.S.

5153

++ *

5154

++ *      To drive preemption between tasks, the scheduler sets the flag in timer

5155

++ *      interrupt handler scheduler_tick().

5156

++ *

5157

++ *   3. Wakeups don't really cause entry into schedule(). They add a

5158

++ *      task to the run-queue and that's it.

5159

++ *

5160

++ *      Now, if the new task added to the run-queue preempts the current

5161

++ *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets

5162

++ *      called on the nearest possible occasion:

5163

++ *

5164

++ *       - If the kernel is preemptible (CONFIG_PREEMPTION=y):

5165

++ *

5166

++ *         - in syscall or exception context, at the next outmost

5167

++ *           preempt_enable(). (this might be as soon as the wake_up()'s

5168

++ *           spin_unlock()!)

5169

++ *

5170

++ *         - in IRQ context, return from interrupt-handler to

5171

++ *           preemptible context

5172

++ *

5173

++ *       - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)

5174

++ *         then at the next:

5175

++ *

5176

++ *          - cond_resched() call

5177

++ *          - explicit schedule() call

5178

++ *          - return from syscall or exception to user-space

5179

++ *          - return from interrupt-handler to user-space

5180

++ *

5181

++ * WARNING: must be called with preemption disabled!

5182

++ */

5183

++static void __sched notrace __schedule(unsigned int sched_mode)

5184

++{

5185

++	struct task_struct *prev, *next;

5186

++	unsigned long *switch_count;

5187

++	unsigned long prev_state;

5188

++	struct rq *rq;

5189

++	int cpu;

5190

++	int deactivated = 0;

5191

++

5192

++	cpu = smp_processor_id();

5193

++	rq = cpu_rq(cpu);

5194

++	prev = rq->curr;

5195

++

5196

++	schedule_debug(prev, !!sched_mode);

5197

++

5198

++	/* by passing sched_feat(HRTICK) checking which Alt schedule FW doesn't support */

5199

++	hrtick_clear(rq);

5200

++

5201

++	local_irq_disable();

5202

++	rcu_note_context_switch(!!sched_mode);

5203

++

5204

++	/*

5205

++	 * Make sure that signal_pending_state()->signal_pending() below

5206

++	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)

5207

++	 * done by the caller to avoid the race with signal_wake_up():

5208

++	 *

5209

++	 * __set_current_state(@state)		signal_wake_up()

5210

++	 * schedule()				  set_tsk_thread_flag(p, TIF_SIGPENDING)

5211

++	 *					  wake_up_state(p, state)

5212

++	 *   LOCK rq->lock			    LOCK p->pi_state

5213

++	 *   smp_mb__after_spinlock()		    smp_mb__after_spinlock()

5214

++	 *     if (signal_pending_state())	    if (p->state & @state)

5215

++	 *

5216

++	 * Also, the membarrier system call requires a full memory barrier

5217

++	 * after coming from user-space, before storing to rq->curr.

5218

++	 */

5219

++	raw_spin_lock(&rq->lock);

5220

++	smp_mb__after_spinlock();

5221

++

5222

++	update_rq_clock(rq);

5223

++

5224

++	switch_count = &prev->nivcsw;

5225

++	/*

5226

++	 * We must load prev->state once (task_struct::state is volatile), such

5227

++	 * that we form a control dependency vs deactivate_task() below.

5228

++	 */

5229

++	prev_state = READ_ONCE(prev->__state);

5230

++	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {

5231

++		if (signal_pending_state(prev_state, prev)) {

5232

++			WRITE_ONCE(prev->__state, TASK_RUNNING);

5233

++		} else {

5234

++			prev->sched_contributes_to_load =

5235

++				(prev_state & TASK_UNINTERRUPTIBLE) &&

5236

++				!(prev_state & TASK_NOLOAD) &&

5237

++				!(prev->flags & PF_FROZEN);

5238

++

5239

++			if (prev->sched_contributes_to_load)

5240

++				rq->nr_uninterruptible++;

5241

++

5242

++			/*

5243

++			 * __schedule()			ttwu()

5244

++			 *   prev_state = prev->state;    if (p->on_rq && ...)

5245

++			 *   if (prev_state)		    goto out;

5246

++			 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();

5247

++			 *				  p->state = TASK_WAKING

5248

++			 *

5249

++			 * Where __schedule() and ttwu() have matching control dependencies.

5250

++			 *

5251

++			 * After this, schedule() must not care about p->state any more.

5252

++			 */

5253

++			sched_task_deactivate(prev, rq);

5254

++			deactivate_task(prev, rq);

5255

++			deactivated = 1;

5256

++

5257

++			if (prev->in_iowait) {

5258

++				atomic_inc(&rq->nr_iowait);

5259

++				delayacct_blkio_start();

5260

++			}

5261

++		}

5262

++		switch_count = &prev->nvcsw;

5263

++	}

5264

++

5265

++	check_curr(prev, rq);

5266

++

5267

++	next = choose_next_task(rq, cpu, prev);

5268

++	clear_tsk_need_resched(prev);

5269

++	clear_preempt_need_resched();

5270

++#ifdef CONFIG_SCHED_DEBUG

5271

++	rq->last_seen_need_resched_ns = 0;

5272

++#endif

5273

++

5274

++	if (likely(prev != next)) {

5275

++		if (deactivated)

5276

++			update_sched_rq_watermark(rq);

5277

++		next->last_ran = rq->clock_task;

5278

++		rq->last_ts_switch = rq->clock;

5279

++

5280

++		rq->nr_switches++;

5281

++		/*

5282

++		 * RCU users of rcu_dereference(rq->curr) may not see

5283

++		 * changes to task_struct made by pick_next_task().

5284

++		 */

5285

++		RCU_INIT_POINTER(rq->curr, next);

5286

++		/*

5287

++		 * The membarrier system call requires each architecture

5288

++		 * to have a full memory barrier after updating

5289

++		 * rq->curr, before returning to user-space.

5290

++		 *

5291

++		 * Here are the schemes providing that barrier on the

5292

++		 * various architectures:

5293

++		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.

5294

++		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.

5295

++		 * - finish_lock_switch() for weakly-ordered

5296

++		 *   architectures where spin_unlock is a full barrier,

5297

++		 * - switch_to() for arm64 (weakly-ordered, spin_unlock

5298

++		 *   is a RELEASE barrier),

5299

++		 */

5300

++		++*switch_count;

5301

++

5302

++		psi_sched_switch(prev, next, !task_on_rq_queued(prev));

5303

++

5304

++		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);

5305

++

5306

++		/* Also unlocks the rq: */

5307

++		rq = context_switch(rq, prev, next);

5308

++	} else {

5309

++		__balance_callbacks(rq);

5310

++		raw_spin_unlock_irq(&rq->lock);

5311

++	}

5312

++

5313

++#ifdef CONFIG_SCHED_SMT

5314

++	sg_balance(rq);

5315

++#endif

5316

++}

5317

++

5318

++void __noreturn do_task_dead(void)

5319

++{

5320

++	/* Causes final put_task_struct in finish_task_switch(): */

5321

++	set_special_state(TASK_DEAD);

5322

++

5323

++	/* Tell freezer to ignore us: */

5324

++	current->flags |= PF_NOFREEZE;

5325

++

5326

++	__schedule(SM_NONE);

5327

++	BUG();

5328

++

5329

++	/* Avoid "noreturn function does return" - but don't continue if BUG() is a NOP: */

5330

++	for (;;)

5331

++		cpu_relax();

5332

++}

5333

++

5334

++static inline void sched_submit_work(struct task_struct *tsk)

5335

++{

5336

++	unsigned int task_flags;

5337

++

5338

++	if (task_is_running(tsk))

5339

++		return;

5340

++

5341

++	task_flags = tsk->flags;

5342

++	/*

5343

++	 * If a worker goes to sleep, notify and ask workqueue whether it

5344

++	 * wants to wake up a task to maintain concurrency.

5345

++	 */

5346

++	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {

5347

++		if (task_flags & PF_WQ_WORKER)

5348

++			wq_worker_sleeping(tsk);

5349

++		else

5350

++			io_wq_worker_sleeping(tsk);

5351

++	}

5352

++

5353

++	if (tsk_is_pi_blocked(tsk))

5354

++		return;

5355

++

5356

++	/*

5357

++	 * If we are going to sleep and we have plugged IO queued,

5358

++	 * make sure to submit it to avoid deadlocks.

5359

++	 */

5360

++	blk_flush_plug(tsk->plug, true);

5361

++}

5362

++

5363

++static void sched_update_worker(struct task_struct *tsk)

5364

++{

5365

++	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {

5366

++		if (tsk->flags & PF_WQ_WORKER)

5367

++			wq_worker_running(tsk);

5368

++		else

5369

++			io_wq_worker_running(tsk);

5370

++	}

5371

++}

5372

++

5373

++asmlinkage __visible void __sched schedule(void)

5374

++{

5375

++	struct task_struct *tsk = current;

5376

++

5377

++	sched_submit_work(tsk);

5378

++	do {

5379

++		preempt_disable();

5380

++		__schedule(SM_NONE);

5381

++		sched_preempt_enable_no_resched();

5382

++	} while (need_resched());

5383

++	sched_update_worker(tsk);

5384

++}

5385

++EXPORT_SYMBOL(schedule);

5386

++

5387

++/*

5388

++ * synchronize_rcu_tasks() makes sure that no task is stuck in preempted

5389

++ * state (have scheduled out non-voluntarily) by making sure that all

5390

++ * tasks have either left the run queue or have gone into user space.

5391

++ * As idle tasks do not do either, they must not ever be preempted

5392

++ * (schedule out non-voluntarily).

5393

++ *

5394

++ * schedule_idle() is similar to schedule_preempt_disable() except that it

5395

++ * never enables preemption because it does not call sched_submit_work().

5396

++ */

5397

++void __sched schedule_idle(void)

5398

++{

5399

++	/*

5400

++	 * As this skips calling sched_submit_work(), which the idle task does

5401

++	 * regardless because that function is a nop when the task is in a

5402

++	 * TASK_RUNNING state, make sure this isn't used someplace that the

5403

++	 * current task can be in any other state. Note, idle is always in the

5404

++	 * TASK_RUNNING state.

5405

++	 */

5406

++	WARN_ON_ONCE(current->__state);

5407

++	do {

5408

++		__schedule(SM_NONE);

5409

++	} while (need_resched());

5410

++}

5411

++

5412

++#if defined(CONFIG_CONTEXT_TRACKING) && !defined(CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK)

5413

++asmlinkage __visible void __sched schedule_user(void)

5414

++{

5415

++	/*

5416

++	 * If we come here after a random call to set_need_resched(),

5417

++	 * or we have been woken up remotely but the IPI has not yet arrived,

5418

++	 * we haven't yet exited the RCU idle mode. Do it here manually until

5419

++	 * we find a better solution.

5420

++	 *

5421

++	 * NB: There are buggy callers of this function.  Ideally we

5422

++	 * should warn if prev_state != CONTEXT_USER, but that will trigger

5423

++	 * too frequently to make sense yet.

5424

++	 */

5425

++	enum ctx_state prev_state = exception_enter();

5426

++	schedule();

5427

++	exception_exit(prev_state);

5428

++}

5429

++#endif

5430

++

5431

++/**

5432

++ * schedule_preempt_disabled - called with preemption disabled

5433

++ *

5434

++ * Returns with preemption disabled. Note: preempt_count must be 1

5435

++ */

5436

++void __sched schedule_preempt_disabled(void)

5437

++{

5438

++	sched_preempt_enable_no_resched();

5439

++	schedule();

5440

++	preempt_disable();

5441

++}

5442

++

5443

++#ifdef CONFIG_PREEMPT_RT

5444

++void __sched notrace schedule_rtlock(void)

5445

++{

5446

++	do {

5447

++		preempt_disable();

5448

++		__schedule(SM_RTLOCK_WAIT);

5449

++		sched_preempt_enable_no_resched();

5450

++	} while (need_resched());

5451

++}

5452

++NOKPROBE_SYMBOL(schedule_rtlock);

5453

++#endif

5454

++

5455

++static void __sched notrace preempt_schedule_common(void)

5456

++{

5457

++	do {

5458

++		/*

5459

++		 * Because the function tracer can trace preempt_count_sub()

5460

++		 * and it also uses preempt_enable/disable_notrace(), if

5461

++		 * NEED_RESCHED is set, the preempt_enable_notrace() called

5462

++		 * by the function tracer will call this function again and

5463

++		 * cause infinite recursion.

5464

++		 *

5465

++		 * Preemption must be disabled here before the function

5466

++		 * tracer can trace. Break up preempt_disable() into two

5467

++		 * calls. One to disable preemption without fear of being

5468

++		 * traced. The other to still record the preemption latency,

5469

++		 * which can also be traced by the function tracer.

5470

++		 */

5471

++		preempt_disable_notrace();

5472

++		preempt_latency_start(1);

5473

++		__schedule(SM_PREEMPT);

5474

++		preempt_latency_stop(1);

5475

++		preempt_enable_no_resched_notrace();

5476

++

5477

++		/*

5478

++		 * Check again in case we missed a preemption opportunity

5479

++		 * between schedule and now.

5480

++		 */

5481

++	} while (need_resched());

5482

++}

5483

++

5484

++#ifdef CONFIG_PREEMPTION

5485

++/*

5486

++ * This is the entry point to schedule() from in-kernel preemption

5487

++ * off of preempt_enable.

5488

++ */

5489

++asmlinkage __visible void __sched notrace preempt_schedule(void)

5490

++{

5491

++	/*

5492

++	 * If there is a non-zero preempt_count or interrupts are disabled,

5493

++	 * we do not want to preempt the current task. Just return..

5494

++	 */

5495

++	if (likely(!preemptible()))

5496

++		return;

5497

++

5498

++	preempt_schedule_common();

5499

++}

5500

++NOKPROBE_SYMBOL(preempt_schedule);

5501

++EXPORT_SYMBOL(preempt_schedule);

5502

++

5503

++#ifdef CONFIG_PREEMPT_DYNAMIC

5504

++#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)

5505

++#ifndef preempt_schedule_dynamic_enabled

5506

++#define preempt_schedule_dynamic_enabled	preempt_schedule

5507

++#define preempt_schedule_dynamic_disabled	NULL

5508

++#endif

5509

++DEFINE_STATIC_CALL(preempt_schedule, preempt_schedule_dynamic_enabled);

5510

++EXPORT_STATIC_CALL_TRAMP(preempt_schedule);

5511

++#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)

5512

++static DEFINE_STATIC_KEY_TRUE(sk_dynamic_preempt_schedule);

5513

++void __sched notrace dynamic_preempt_schedule(void)

5514

++{

5515

++	if (!static_branch_unlikely(&sk_dynamic_preempt_schedule))

5516

++		return;

5517

++	preempt_schedule();

5518

++}

5519

++NOKPROBE_SYMBOL(dynamic_preempt_schedule);

5520

++EXPORT_SYMBOL(dynamic_preempt_schedule);

5521

++#endif

5522

++#endif

5523

++

5524

++/**

5525

++ * preempt_schedule_notrace - preempt_schedule called by tracing

5526

++ *

5527

++ * The tracing infrastructure uses preempt_enable_notrace to prevent

5528

++ * recursion and tracing preempt enabling caused by the tracing

5529

++ * infrastructure itself. But as tracing can happen in areas coming

5530

++ * from userspace or just about to enter userspace, a preempt enable

5531

++ * can occur before user_exit() is called. This will cause the scheduler

5532

++ * to be called when the system is still in usermode.

5533

++ *

5534

++ * To prevent this, the preempt_enable_notrace will use this function

5535

++ * instead of preempt_schedule() to exit user context if needed before

5536

++ * calling the scheduler.

5537

++ */

5538

++asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)

5539

++{

5540

++	enum ctx_state prev_ctx;

5541

++

5542

++	if (likely(!preemptible()))

5543

++		return;

5544

++

5545

++	do {

5546

++		/*

5547

++		 * Because the function tracer can trace preempt_count_sub()

5548

++		 * and it also uses preempt_enable/disable_notrace(), if

5549

++		 * NEED_RESCHED is set, the preempt_enable_notrace() called

5550

++		 * by the function tracer will call this function again and

5551

++		 * cause infinite recursion.

5552

++		 *

5553

++		 * Preemption must be disabled here before the function

5554

++		 * tracer can trace. Break up preempt_disable() into two

5555

++		 * calls. One to disable preemption without fear of being

5556

++		 * traced. The other to still record the preemption latency,

5557

++		 * which can also be traced by the function tracer.

5558

++		 */

5559

++		preempt_disable_notrace();

5560

++		preempt_latency_start(1);

5561

++		/*

5562

++		 * Needs preempt disabled in case user_exit() is traced

5563

++		 * and the tracer calls preempt_enable_notrace() causing

5564

++		 * an infinite recursion.

5565

++		 */

5566

++		prev_ctx = exception_enter();

5567

++		__schedule(SM_PREEMPT);

5568

++		exception_exit(prev_ctx);

5569

++

5570

++		preempt_latency_stop(1);

5571

++		preempt_enable_no_resched_notrace();

5572

++	} while (need_resched());

5573

++}

5574

++EXPORT_SYMBOL_GPL(preempt_schedule_notrace);

5575

++

5576

++#ifdef CONFIG_PREEMPT_DYNAMIC

5577

++#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)

5578

++#ifndef preempt_schedule_notrace_dynamic_enabled

5579

++#define preempt_schedule_notrace_dynamic_enabled	preempt_schedule_notrace

5580

++#define preempt_schedule_notrace_dynamic_disabled	NULL

5581

++#endif

5582

++DEFINE_STATIC_CALL(preempt_schedule_notrace, preempt_schedule_notrace_dynamic_enabled);

5583

++EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);

5584

++#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)

5585

++static DEFINE_STATIC_KEY_TRUE(sk_dynamic_preempt_schedule_notrace);

5586

++void __sched notrace dynamic_preempt_schedule_notrace(void)

5587

++{

5588

++	if (!static_branch_unlikely(&sk_dynamic_preempt_schedule_notrace))

5589

++		return;

5590

++	preempt_schedule_notrace();

5591

++}

5592

++NOKPROBE_SYMBOL(dynamic_preempt_schedule_notrace);

5593

++EXPORT_SYMBOL(dynamic_preempt_schedule_notrace);

5594

++#endif

5595

++#endif

5596

++

5597

++#endif /* CONFIG_PREEMPTION */

5598

++

5599

++/*

5600

++ * This is the entry point to schedule() from kernel preemption

5601

++ * off of irq context.

5602

++ * Note, that this is called and return with irqs disabled. This will

5603

++ * protect us against recursive calling from irq.

5604

++ */

5605

++asmlinkage __visible void __sched preempt_schedule_irq(void)

5606

++{

5607

++	enum ctx_state prev_state;

5608

++

5609

++	/* Catch callers which need to be fixed */

5610

++	BUG_ON(preempt_count() || !irqs_disabled());

5611

++

5612

++	prev_state = exception_enter();

5613

++

5614

++	do {

5615

++		preempt_disable();

5616

++		local_irq_enable();

5617

++		__schedule(SM_PREEMPT);

5618

++		local_irq_disable();

5619

++		sched_preempt_enable_no_resched();

5620

++	} while (need_resched());

5621

++

5622

++	exception_exit(prev_state);

5623

++}

5624

++

5625

++int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,

5626

++			  void *key)

5627

++{

5628

++	WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC);

5629

++	return try_to_wake_up(curr->private, mode, wake_flags);

5630

++}

5631

++EXPORT_SYMBOL(default_wake_function);

5632

++

5633

++static inline void check_task_changed(struct task_struct *p, struct rq *rq)

5634

++{

5635

++	int idx;

5636

++

5637

++	/* Trigger resched if task sched_prio has been modified. */

5638

++	if (task_on_rq_queued(p) && (idx = task_sched_prio_idx(p, rq)) != p->sq_idx) {

5639

++		requeue_task(p, rq, idx);

5640

++		check_preempt_curr(rq);

5641

++	}

5642

++}

5643

++

5644

++static void __setscheduler_prio(struct task_struct *p, int prio)

5645

++{

5646

++	p->prio = prio;

5647

++}

5648

++

5649

++#ifdef CONFIG_RT_MUTEXES

5650

++

5651

++static inline int __rt_effective_prio(struct task_struct *pi_task, int prio)

5652

++{

5653

++	if (pi_task)

5654

++		prio = min(prio, pi_task->prio);

5655

++

5656

++	return prio;

5657

++}

5658

++

5659

++static inline int rt_effective_prio(struct task_struct *p, int prio)

5660

++{

5661

++	struct task_struct *pi_task = rt_mutex_get_top_task(p);

5662

++

5663

++	return __rt_effective_prio(pi_task, prio);

5664

++}

5665

++

5666

++/*

5667

++ * rt_mutex_setprio - set the current priority of a task

5668

++ * @p: task to boost

5669

++ * @pi_task: donor task

5670

++ *

5671

++ * This function changes the 'effective' priority of a task. It does

5672

++ * not touch ->normal_prio like __setscheduler().

5673

++ *

5674

++ * Used by the rt_mutex code to implement priority inheritance

5675

++ * logic. Call site only calls if the priority of the task changed.

5676

++ */

5677

++void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)

5678

++{

5679

++	int prio;

5680

++	struct rq *rq;

5681

++	raw_spinlock_t *lock;

5682

++

5683

++	/* XXX used to be waiter->prio, not waiter->task->prio */

5684

++	prio = __rt_effective_prio(pi_task, p->normal_prio);

5685

++

5686

++	/*

5687

++	 * If nothing changed; bail early.

5688

++	 */

5689

++	if (p->pi_top_task == pi_task && prio == p->prio)

5690

++		return;

5691

++

5692

++	rq = __task_access_lock(p, &lock);

5693

++	/*

5694

++	 * Set under pi_lock && rq->lock, such that the value can be used under

5695

++	 * either lock.

5696

++	 *

5697

++	 * Note that there is loads of tricky to make this pointer cache work

5698

++	 * right. rt_mutex_slowunlock()+rt_mutex_postunlock() work together to

5699

++	 * ensure a task is de-boosted (pi_task is set to NULL) before the

5700

++	 * task is allowed to run again (and can exit). This ensures the pointer

5701

++	 * points to a blocked task -- which guarantees the task is present.

5702

++	 */

5703

++	p->pi_top_task = pi_task;

5704

++

5705

++	/*

5706

++	 * For FIFO/RR we only need to set prio, if that matches we're done.

5707

++	 */

5708

++	if (prio == p->prio)

5709

++		goto out_unlock;

5710

++

5711

++	/*

5712

++	 * Idle task boosting is a nono in general. There is one

5713

++	 * exception, when PREEMPT_RT and NOHZ is active:

5714

++	 *

5715

++	 * The idle task calls get_next_timer_interrupt() and holds

5716

++	 * the timer wheel base->lock on the CPU and another CPU wants

5717

++	 * to access the timer (probably to cancel it). We can safely

5718

++	 * ignore the boosting request, as the idle CPU runs this code

5719

++	 * with interrupts disabled and will complete the lock

5720

++	 * protected section without being interrupted. So there is no

5721

++	 * real need to boost.

5722

++	 */

5723

++	if (unlikely(p == rq->idle)) {

5724

++		WARN_ON(p != rq->curr);

5725

++		WARN_ON(p->pi_blocked_on);

5726

++		goto out_unlock;

5727

++	}

5728

++

5729

++	trace_sched_pi_setprio(p, pi_task);

5730

++

5731

++	__setscheduler_prio(p, prio);

5732

++

5733

++	check_task_changed(p, rq);

5734

++out_unlock:

5735

++	/* Avoid rq from going away on us: */

5736

++	preempt_disable();

5737

++

5738

++	__balance_callbacks(rq);

5739

++	__task_access_unlock(p, lock);

5740

++

5741

++	preempt_enable();

5742

++}

5743

++#else

5744

++static inline int rt_effective_prio(struct task_struct *p, int prio)

5745

++{

5746

++	return prio;

5747

++}

5748

++#endif

5749

++

5750

++void set_user_nice(struct task_struct *p, long nice)

5751

++{

5752

++	unsigned long flags;

5753

++	struct rq *rq;

5754

++	raw_spinlock_t *lock;

5755

++

5756

++	if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)

5757

++		return;

5758

++	/*

5759

++	 * We have to be careful, if called from sys_setpriority(),

5760

++	 * the task might be in the middle of scheduling on another CPU.

5761

++	 */

5762

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

5763

++	rq = __task_access_lock(p, &lock);

5764

++

5765

++	p->static_prio = NICE_TO_PRIO(nice);

5766

++	/*

5767

++	 * The RT priorities are set via sched_setscheduler(), but we still

5768

++	 * allow the 'normal' nice value to be set - but as expected

5769

++	 * it won't have any effect on scheduling until the task is

5770

++	 * not SCHED_NORMAL/SCHED_BATCH:

5771

++	 */

5772

++	if (task_has_rt_policy(p))

5773

++		goto out_unlock;

5774

++

5775

++	p->prio = effective_prio(p);

5776

++

5777

++	check_task_changed(p, rq);

5778

++out_unlock:

5779

++	__task_access_unlock(p, lock);

5780

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

5781

++}

5782

++EXPORT_SYMBOL(set_user_nice);

5783

++

5784

++/*

5785

++ * can_nice - check if a task can reduce its nice value

5786

++ * @p: task

5787

++ * @nice: nice value

5788

++ */

5789

++int can_nice(const struct task_struct *p, const int nice)

5790

++{

5791

++	/* Convert nice value [19,-20] to rlimit style value [1,40] */

5792

++	int nice_rlim = nice_to_rlimit(nice);

5793

++

5794

++	return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) ||

5795

++		capable(CAP_SYS_NICE));

5796

++}

5797

++

5798

++#ifdef __ARCH_WANT_SYS_NICE

5799

++

5800

++/*

5801

++ * sys_nice - change the priority of the current process.

5802

++ * @increment: priority increment

5803

++ *

5804

++ * sys_setpriority is a more generic, but much slower function that

5805

++ * does similar things.

5806

++ */

5807

++SYSCALL_DEFINE1(nice, int, increment)

5808

++{

5809

++	long nice, retval;

5810

++

5811

++	/*

5812

++	 * Setpriority might change our priority at the same moment.

5813

++	 * We don't have to worry. Conceptually one call occurs first

5814

++	 * and we have a single winner.

5815

++	 */

5816

++

5817

++	increment = clamp(increment, -NICE_WIDTH, NICE_WIDTH);

5818

++	nice = task_nice(current) + increment;

5819

++

5820

++	nice = clamp_val(nice, MIN_NICE, MAX_NICE);

5821

++	if (increment < 0 && !can_nice(current, nice))

5822

++		return -EPERM;

5823

++

5824

++	retval = security_task_setnice(current, nice);

5825

++	if (retval)

5826

++		return retval;

5827

++

5828

++	set_user_nice(current, nice);

5829

++	return 0;

5830

++}

5831

++

5832

++#endif

5833

++

5834

++/**

5835

++ * task_prio - return the priority value of a given task.

5836

++ * @p: the task in question.

5837

++ *

5838

++ * Return: The priority value as seen by users in /proc.

5839

++ *

5840

++ * sched policy         return value   kernel prio    user prio/nice

5841

++ *

5842

++ * (BMQ)normal, batch, idle[0 ... 53]  [100 ... 139]          0/[-20 ... 19]/[-7 ... 7]

5843

++ * (PDS)normal, batch, idle[0 ... 39]            100          0/[-20 ... 19]

5844

++ * fifo, rr             [-1 ... -100]     [99 ... 0]  [0 ... 99]

5845

++ */

5846

++int task_prio(const struct task_struct *p)

5847

++{

5848

++	return (p->prio < MAX_RT_PRIO) ? p->prio - MAX_RT_PRIO :

5849

++		task_sched_prio_normal(p, task_rq(p));

5850

++}

5851

++

5852

++/**

5853

++ * idle_cpu - is a given CPU idle currently?

5854

++ * @cpu: the processor in question.

5855

++ *

5856

++ * Return: 1 if the CPU is currently idle. 0 otherwise.

5857

++ */

5858

++int idle_cpu(int cpu)

5859

++{

5860

++	struct rq *rq = cpu_rq(cpu);

5861

++

5862

++	if (rq->curr != rq->idle)

5863

++		return 0;

5864

++

5865

++	if (rq->nr_running)

5866

++		return 0;

5867

++

5868

++#ifdef CONFIG_SMP

5869

++	if (rq->ttwu_pending)

5870

++		return 0;

5871

++#endif

5872

++

5873

++	return 1;

5874

++}

5875

++

5876

++/**

5877

++ * idle_task - return the idle task for a given CPU.

5878

++ * @cpu: the processor in question.

5879

++ *

5880

++ * Return: The idle task for the cpu @cpu.

5881

++ */

5882

++struct task_struct *idle_task(int cpu)

5883

++{

5884

++	return cpu_rq(cpu)->idle;

5885

++}

5886

++

5887

++/**

5888

++ * find_process_by_pid - find a process with a matching PID value.

5889

++ * @pid: the pid in question.

5890

++ *

5891

++ * The task of @pid, if found. %NULL otherwise.

5892

++ */

5893

++static inline struct task_struct *find_process_by_pid(pid_t pid)

5894

++{

5895

++	return pid ? find_task_by_vpid(pid) : current;

5896

++}

5897

++

5898

++/*

5899

++ * sched_setparam() passes in -1 for its policy, to let the functions

5900

++ * it calls know not to change it.

5901

++ */

5902

++#define SETPARAM_POLICY -1

5903

++

5904

++static void __setscheduler_params(struct task_struct *p,

5905

++		const struct sched_attr *attr)

5906

++{

5907

++	int policy = attr->sched_policy;

5908

++

5909

++	if (policy == SETPARAM_POLICY)

5910

++		policy = p->policy;

5911

++

5912

++	p->policy = policy;

5913

++

5914

++	/*

5915

++	 * allow normal nice value to be set, but will not have any

5916

++	 * effect on scheduling until the task not SCHED_NORMAL/

5917

++	 * SCHED_BATCH

5918

++	 */

5919

++	p->static_prio = NICE_TO_PRIO(attr->sched_nice);

5920

++

5921

++	/*

5922

++	 * __sched_setscheduler() ensures attr->sched_priority == 0 when

5923

++	 * !rt_policy. Always setting this ensures that things like

5924

++	 * getparam()/getattr() don't report silly values for !rt tasks.

5925

++	 */

5926

++	p->rt_priority = attr->sched_priority;

5927

++	p->normal_prio = normal_prio(p);

5928

++}

5929

++

5930

++/*

5931

++ * check the target process has a UID that matches the current process's

5932

++ */

5933

++static bool check_same_owner(struct task_struct *p)

5934

++{

5935

++	const struct cred *cred = current_cred(), *pcred;

5936

++	bool match;

5937

++

5938

++	rcu_read_lock();

5939

++	pcred = __task_cred(p);

5940

++	match = (uid_eq(cred->euid, pcred->euid) ||

5941

++		 uid_eq(cred->euid, pcred->uid));

5942

++	rcu_read_unlock();

5943

++	return match;

5944

++}

5945

++

5946

++static int __sched_setscheduler(struct task_struct *p,

5947

++				const struct sched_attr *attr,

5948

++				bool user, bool pi)

5949

++{

5950

++	const struct sched_attr dl_squash_attr = {

5951

++		.size		= sizeof(struct sched_attr),

5952

++		.sched_policy	= SCHED_FIFO,

5953

++		.sched_nice	= 0,

5954

++		.sched_priority = 99,

5955

++	};

5956

++	int oldpolicy = -1, policy = attr->sched_policy;

5957

++	int retval, newprio;

5958

++	struct callback_head *head;

5959

++	unsigned long flags;

5960

++	struct rq *rq;

5961

++	int reset_on_fork;

5962

++	raw_spinlock_t *lock;

5963

++

5964

++	/* The pi code expects interrupts enabled */

5965

++	BUG_ON(pi && in_interrupt());

5966

++

5967

++	/*

5968

++	 * Alt schedule FW supports SCHED_DEADLINE by squash it as prio 0 SCHED_FIFO

5969

++	 */

5970

++	if (unlikely(SCHED_DEADLINE == policy)) {

5971

++		attr = &dl_squash_attr;

5972

++		policy = attr->sched_policy;

5973

++	}

5974

++recheck:

5975

++	/* Double check policy once rq lock held */

5976

++	if (policy < 0) {

5977

++		reset_on_fork = p->sched_reset_on_fork;

5978

++		policy = oldpolicy = p->policy;

5979

++	} else {

5980

++		reset_on_fork = !!(attr->sched_flags & SCHED_RESET_ON_FORK);

5981

++

5982

++		if (policy > SCHED_IDLE)

5983

++			return -EINVAL;

5984

++	}

5985

++

5986

++	if (attr->sched_flags & ~(SCHED_FLAG_ALL))

5987

++		return -EINVAL;

5988

++

5989

++	/*

5990

++	 * Valid priorities for SCHED_FIFO and SCHED_RR are

5991

++	 * 1..MAX_RT_PRIO-1, valid priority for SCHED_NORMAL and

5992

++	 * SCHED_BATCH and SCHED_IDLE is 0.

5993

++	 */

5994

++	if (attr->sched_priority < 0 ||

5995

++	    (p->mm && attr->sched_priority > MAX_RT_PRIO - 1) ||

5996

++	    (!p->mm && attr->sched_priority > MAX_RT_PRIO - 1))

5997

++		return -EINVAL;

5998

++	if ((SCHED_RR == policy || SCHED_FIFO == policy) !=

5999

++	    (attr->sched_priority != 0))

6000

++		return -EINVAL;

6001

++

6002

++	/*

6003

++	 * Allow unprivileged RT tasks to decrease priority:

6004

++	 */

6005

++	if (user && !capable(CAP_SYS_NICE)) {

6006

++		if (SCHED_FIFO == policy || SCHED_RR == policy) {

6007

++			unsigned long rlim_rtprio =

6008

++					task_rlimit(p, RLIMIT_RTPRIO);

6009

++

6010

++			/* Can't set/change the rt policy */

6011

++			if (policy != p->policy && !rlim_rtprio)

6012

++				return -EPERM;

6013

++

6014

++			/* Can't increase priority */

6015

++			if (attr->sched_priority > p->rt_priority &&

6016

++			    attr->sched_priority > rlim_rtprio)

6017

++				return -EPERM;

6018

++		}

6019

++

6020

++		/* Can't change other user's priorities */

6021

++		if (!check_same_owner(p))

6022

++			return -EPERM;

6023

++

6024

++		/* Normal users shall not reset the sched_reset_on_fork flag */

6025

++		if (p->sched_reset_on_fork && !reset_on_fork)

6026

++			return -EPERM;

6027

++	}

6028

++

6029

++	if (user) {

6030

++		retval = security_task_setscheduler(p);

6031

++		if (retval)

6032

++			return retval;

6033

++	}

6034

++

6035

++	if (pi)

6036

++		cpuset_read_lock();

6037

++

6038

++	/*

6039

++	 * Make sure no PI-waiters arrive (or leave) while we are

6040

++	 * changing the priority of the task:

6041

++	 */

6042

++	raw_spin_lock_irqsave(&p->pi_lock, flags);

6043

++

6044

++	/*

6045

++	 * To be able to change p->policy safely, task_access_lock()

6046

++	 * must be called.

6047

++	 * IF use task_access_lock() here:

6048

++	 * For the task p which is not running, reading rq->stop is

6049

++	 * racy but acceptable as ->stop doesn't change much.

6050

++	 * An enhancemnet can be made to read rq->stop saftly.

6051

++	 */

6052

++	rq = __task_access_lock(p, &lock);

6053

++

6054

++	/*

6055

++	 * Changing the policy of the stop threads its a very bad idea

6056

++	 */

6057

++	if (p == rq->stop) {

6058

++		retval = -EINVAL;

6059

++		goto unlock;

6060

++	}

6061

++

6062

++	/*

6063

++	 * If not changing anything there's no need to proceed further:

6064

++	 */

6065

++	if (unlikely(policy == p->policy)) {

6066

++		if (rt_policy(policy) && attr->sched_priority != p->rt_priority)

6067

++			goto change;

6068

++		if (!rt_policy(policy) &&

6069

++		    NICE_TO_PRIO(attr->sched_nice) != p->static_prio)

6070

++			goto change;

6071

++

6072

++		p->sched_reset_on_fork = reset_on_fork;

6073

++		retval = 0;

6074

++		goto unlock;

6075

++	}

6076

++change:

6077

++

6078

++	/* Re-check policy now with rq lock held */

6079

++	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {

6080

++		policy = oldpolicy = -1;

6081

++		__task_access_unlock(p, lock);

6082

++		raw_spin_unlock_irqrestore(&p->pi_lock, flags);

6083

++		if (pi)

6084

++			cpuset_read_unlock();

6085

++		goto recheck;

6086

++	}

6087

++

6088

++	p->sched_reset_on_fork = reset_on_fork;

6089

++

6090

++	newprio = __normal_prio(policy, attr->sched_priority, NICE_TO_PRIO(attr->sched_nice));

6091

++	if (pi) {

6092

++		/*

6093

++		 * Take priority boosted tasks into account. If the new

6094

++		 * effective priority is unchanged, we just store the new

6095

++		 * normal parameters and do not touch the scheduler class and

6096

++		 * the runqueue. This will be done when the task deboost

6097

++		 * itself.

6098

++		 */

6099

++		newprio = rt_effective_prio(p, newprio);

6100

++	}

6101

++

6102

++	if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) {

6103

++		__setscheduler_params(p, attr);

6104

++		__setscheduler_prio(p, newprio);

6105

++	}

6106

++

6107

++	check_task_changed(p, rq);

6108

++

6109

++	/* Avoid rq from going away on us: */

6110

++	preempt_disable();

6111

++	head = splice_balance_callbacks(rq);

6112

++	__task_access_unlock(p, lock);

6113

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

6114

++

6115

++	if (pi) {

6116

++		cpuset_read_unlock();

6117

++		rt_mutex_adjust_pi(p);

6118

++	}

6119

++

6120

++	/* Run balance callbacks after we've adjusted the PI chain: */

6121

++	balance_callbacks(rq, head);

6122

++	preempt_enable();

6123

++

6124

++	return 0;

6125

++

6126

++unlock:

6127

++	__task_access_unlock(p, lock);

6128

++	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

6129

++	if (pi)

6130

++		cpuset_read_unlock();

6131

++	return retval;

6132

++}

6133

++

6134

++static int _sched_setscheduler(struct task_struct *p, int policy,

6135

++			       const struct sched_param *param, bool check)

6136

++{

6137

++	struct sched_attr attr = {

6138

++		.sched_policy   = policy,

6139

++		.sched_priority = param->sched_priority,

6140

++		.sched_nice     = PRIO_TO_NICE(p->static_prio),

6141

++	};

6142

++

6143

++	/* Fixup the legacy SCHED_RESET_ON_FORK hack. */

6144

++	if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) {

6145

++		attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;

6146

++		policy &= ~SCHED_RESET_ON_FORK;

6147

++		attr.sched_policy = policy;

6148

++	}

6149

++

6150

++	return __sched_setscheduler(p, &attr, check, true);

6151

++}

6152

++

6153

++/**

6154

++ * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.

6155

++ * @p: the task in question.

6156

++ * @policy: new policy.

6157

++ * @param: structure containing the new RT priority.

6158

++ *

6159

++ * Use sched_set_fifo(), read its comment.

6160

++ *

6161

++ * Return: 0 on success. An error code otherwise.

6162

++ *

6163

++ * NOTE that the task may be already dead.

6164

++ */

6165

++int sched_setscheduler(struct task_struct *p, int policy,

6166

++		       const struct sched_param *param)

6167

++{

6168

++	return _sched_setscheduler(p, policy, param, true);

6169

++}

6170

++

6171

++int sched_setattr(struct task_struct *p, const struct sched_attr *attr)

6172

++{

6173

++	return __sched_setscheduler(p, attr, true, true);

6174

++}

6175

++

6176

++int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr)

6177

++{

6178

++	return __sched_setscheduler(p, attr, false, true);

6179

++}

6180

++EXPORT_SYMBOL_GPL(sched_setattr_nocheck);

6181

++

6182

++/**

6183

++ * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.

6184

++ * @p: the task in question.

6185

++ * @policy: new policy.

6186

++ * @param: structure containing the new RT priority.

6187

++ *

6188

++ * Just like sched_setscheduler, only don't bother checking if the

6189

++ * current context has permission.  For example, this is needed in

6190

++ * stop_machine(): we create temporary high priority worker threads,

6191

++ * but our caller might not have that capability.

6192

++ *

6193

++ * Return: 0 on success. An error code otherwise.

6194

++ */

6195

++int sched_setscheduler_nocheck(struct task_struct *p, int policy,

6196

++			       const struct sched_param *param)

6197

++{

6198

++	return _sched_setscheduler(p, policy, param, false);

6199

++}

6200

++

6201

++/*

6202

++ * SCHED_FIFO is a broken scheduler model; that is, it is fundamentally

6203

++ * incapable of resource management, which is the one thing an OS really should

6204

++ * be doing.

6205

++ *

6206

++ * This is of course the reason it is limited to privileged users only.

6207

++ *

6208

++ * Worse still; it is fundamentally impossible to compose static priority

6209

++ * workloads. You cannot take two correctly working static prio workloads

6210

++ * and smash them together and still expect them to work.

6211

++ *

6212

++ * For this reason 'all' FIFO tasks the kernel creates are basically at:

6213

++ *

6214

++ *   MAX_RT_PRIO / 2

6215

++ *

6216

++ * The administrator _MUST_ configure the system, the kernel simply doesn't

6217

++ * know enough information to make a sensible choice.

6218

++ */

6219

++void sched_set_fifo(struct task_struct *p)

6220

++{

6221

++	struct sched_param sp = { .sched_priority = MAX_RT_PRIO / 2 };

6222

++	WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0);

6223

++}

6224

++EXPORT_SYMBOL_GPL(sched_set_fifo);

6225

++

6226

++/*

6227

++ * For when you don't much care about FIFO, but want to be above SCHED_NORMAL.

6228

++ */

6229

++void sched_set_fifo_low(struct task_struct *p)

6230

++{

6231

++	struct sched_param sp = { .sched_priority = 1 };

6232

++	WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0);

6233

++}

6234

++EXPORT_SYMBOL_GPL(sched_set_fifo_low);

6235

++

6236

++void sched_set_normal(struct task_struct *p, int nice)

6237

++{

6238

++	struct sched_attr attr = {

6239

++		.sched_policy = SCHED_NORMAL,

6240

++		.sched_nice = nice,

6241

++	};

6242

++	WARN_ON_ONCE(sched_setattr_nocheck(p, &attr) != 0);

6243

++}

6244

++EXPORT_SYMBOL_GPL(sched_set_normal);

6245

++

6246

++static int

6247

++do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)

6248

++{

6249

++	struct sched_param lparam;

6250

++	struct task_struct *p;

6251

++	int retval;

6252

++

6253

++	if (!param || pid < 0)

6254

++		return -EINVAL;

6255

++	if (copy_from_user(&lparam, param, sizeof(struct sched_param)))

6256

++		return -EFAULT;

6257

++

6258

++	rcu_read_lock();

6259

++	retval = -ESRCH;

6260

++	p = find_process_by_pid(pid);

6261

++	if (likely(p))

6262

++		get_task_struct(p);

6263

++	rcu_read_unlock();

6264

++

6265

++	if (likely(p)) {

6266

++		retval = sched_setscheduler(p, policy, &lparam);

6267

++		put_task_struct(p);

6268

++	}

6269

++

6270

++	return retval;

6271

++}

6272

++

6273

++/*

6274

++ * Mimics kernel/events/core.c perf_copy_attr().

6275

++ */

6276

++static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *attr)

6277

++{

6278

++	u32 size;

6279

++	int ret;

6280

++

6281

++	/* Zero the full structure, so that a short copy will be nice: */

6282

++	memset(attr, 0, sizeof(*attr));

6283

++

6284

++	ret = get_user(size, &uattr->size);

6285

++	if (ret)

6286

++		return ret;

6287

++

6288

++	/* ABI compatibility quirk: */

6289

++	if (!size)

6290

++		size = SCHED_ATTR_SIZE_VER0;

6291

++

6292

++	if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)

6293

++		goto err_size;

6294

++

6295

++	ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);

6296

++	if (ret) {

6297

++		if (ret == -E2BIG)

6298

++			goto err_size;

6299

++		return ret;

6300

++	}

6301

++

6302

++	/*

6303

++	 * XXX: Do we want to be lenient like existing syscalls; or do we want

6304

++	 * to be strict and return an error on out-of-bounds values?

6305

++	 */

6306

++	attr->sched_nice = clamp(attr->sched_nice, -20, 19);

6307

++

6308

++	/* sched/core.c uses zero here but we already know ret is zero */

6309

++	return 0;

6310

++

6311

++err_size:

6312

++	put_user(sizeof(*attr), &uattr->size);

6313

++	return -E2BIG;

6314

++}

6315

++

6316

++/**

6317

++ * sys_sched_setscheduler - set/change the scheduler policy and RT priority

6318

++ * @pid: the pid in question.

6319

++ * @policy: new policy.

6320

++ *

6321

++ * Return: 0 on success. An error code otherwise.

6322

++ * @param: structure containing the new RT priority.

6323

++ */

6324

++SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy, struct sched_param __user *, param)

6325

++{

6326

++	if (policy < 0)

6327

++		return -EINVAL;

6328

++

6329

++	return do_sched_setscheduler(pid, policy, param);

6330

++}

6331

++

6332

++/**

6333

++ * sys_sched_setparam - set/change the RT priority of a thread

6334

++ * @pid: the pid in question.

6335

++ * @param: structure containing the new RT priority.

6336

++ *

6337

++ * Return: 0 on success. An error code otherwise.

6338

++ */

6339

++SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)

6340

++{

6341

++	return do_sched_setscheduler(pid, SETPARAM_POLICY, param);

6342

++}

6343

++

6344

++/**

6345

++ * sys_sched_setattr - same as above, but with extended sched_attr

6346

++ * @pid: the pid in question.

6347

++ * @uattr: structure containing the extended parameters.

6348

++ */

6349

++SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,

6350

++			       unsigned int, flags)

6351

++{

6352

++	struct sched_attr attr;

6353

++	struct task_struct *p;

6354

++	int retval;

6355

++

6356

++	if (!uattr || pid < 0 || flags)

6357

++		return -EINVAL;

6358

++

6359

++	retval = sched_copy_attr(uattr, &attr);

6360

++	if (retval)

6361

++		return retval;

6362

++

6363

++	if ((int)attr.sched_policy < 0)

6364

++		return -EINVAL;

6365

++

6366

++	rcu_read_lock();

6367

++	retval = -ESRCH;

6368

++	p = find_process_by_pid(pid);

6369

++	if (likely(p))

6370

++		get_task_struct(p);

6371

++	rcu_read_unlock();

6372

++

6373

++	if (likely(p)) {

6374

++		retval = sched_setattr(p, &attr);

6375

++		put_task_struct(p);

6376

++	}

6377

++

6378

++	return retval;

6379

++}

6380

++

6381

++/**

6382

++ * sys_sched_getscheduler - get the policy (scheduling class) of a thread

6383

++ * @pid: the pid in question.

6384

++ *

6385

++ * Return: On success, the policy of the thread. Otherwise, a negative error

6386

++ * code.

6387

++ */

6388

++SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid)

6389

++{

6390

++	struct task_struct *p;

6391

++	int retval = -EINVAL;

6392

++

6393

++	if (pid < 0)

6394

++		goto out_nounlock;

6395

++

6396

++	retval = -ESRCH;

6397

++	rcu_read_lock();

6398

++	p = find_process_by_pid(pid);

6399

++	if (p) {

6400

++		retval = security_task_getscheduler(p);

6401

++		if (!retval)

6402

++			retval = p->policy;

6403

++	}

6404

++	rcu_read_unlock();

6405

++

6406

++out_nounlock:

6407

++	return retval;

6408

++}

6409

++

6410

++/**

6411

++ * sys_sched_getscheduler - get the RT priority of a thread

6412

++ * @pid: the pid in question.

6413

++ * @param: structure containing the RT priority.

6414

++ *

6415

++ * Return: On success, 0 and the RT priority is in @param. Otherwise, an error

6416

++ * code.

6417

++ */

6418

++SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param)

6419

++{

6420

++	struct sched_param lp = { .sched_priority = 0 };

6421

++	struct task_struct *p;

6422

++	int retval = -EINVAL;

6423

++

6424

++	if (!param || pid < 0)

6425

++		goto out_nounlock;

6426

++

6427

++	rcu_read_lock();

6428

++	p = find_process_by_pid(pid);

6429

++	retval = -ESRCH;

6430

++	if (!p)

6431

++		goto out_unlock;

6432

++

6433

++	retval = security_task_getscheduler(p);

6434

++	if (retval)

6435

++		goto out_unlock;

6436

++

6437

++	if (task_has_rt_policy(p))

6438

++		lp.sched_priority = p->rt_priority;

6439

++	rcu_read_unlock();

6440

++

6441

++	/*

6442

++	 * This one might sleep, we cannot do it with a spinlock held ...

6443

++	 */

6444

++	retval = copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0;

6445

++

6446

++out_nounlock:

6447

++	return retval;

6448

++

6449

++out_unlock:

6450

++	rcu_read_unlock();

6451

++	return retval;

6452

++}

6453

++

6454

++/*

6455

++ * Copy the kernel size attribute structure (which might be larger

6456

++ * than what user-space knows about) to user-space.

6457

++ *

6458

++ * Note that all cases are valid: user-space buffer can be larger or

6459

++ * smaller than the kernel-space buffer. The usual case is that both

6460

++ * have the same size.

6461

++ */

6462

++static int

6463

++sched_attr_copy_to_user(struct sched_attr __user *uattr,

6464

++			struct sched_attr *kattr,

6465

++			unsigned int usize)

6466

++{

6467

++	unsigned int ksize = sizeof(*kattr);

6468

++

6469

++	if (!access_ok(uattr, usize))

6470

++		return -EFAULT;

6471

++

6472

++	/*

6473

++	 * sched_getattr() ABI forwards and backwards compatibility:

6474

++	 *

6475

++	 * If usize == ksize then we just copy everything to user-space and all is good.

6476

++	 *

6477

++	 * If usize < ksize then we only copy as much as user-space has space for,

6478

++	 * this keeps ABI compatibility as well. We skip the rest.

6479

++	 *

6480

++	 * If usize > ksize then user-space is using a newer version of the ABI,

6481

++	 * which part the kernel doesn't know about. Just ignore it - tooling can

6482

++	 * detect the kernel's knowledge of attributes from the attr->size value

6483

++	 * which is set to ksize in this case.

6484

++	 */

6485

++	kattr->size = min(usize, ksize);

6486

++

6487

++	if (copy_to_user(uattr, kattr, kattr->size))

6488

++		return -EFAULT;

6489

++

6490

++	return 0;

6491

++}

6492

++

6493

++/**

6494

++ * sys_sched_getattr - similar to sched_getparam, but with sched_attr

6495

++ * @pid: the pid in question.

6496

++ * @uattr: structure containing the extended parameters.

6497

++ * @usize: sizeof(attr) for fwd/bwd comp.

6498

++ * @flags: for future extension.

6499

++ */

6500

++SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,

6501

++		unsigned int, usize, unsigned int, flags)

6502

++{

6503

++	struct sched_attr kattr = { };

6504

++	struct task_struct *p;

6505

++	int retval;

6506

++

6507

++	if (!uattr || pid < 0 || usize > PAGE_SIZE ||

6508

++	    usize < SCHED_ATTR_SIZE_VER0 || flags)

6509

++		return -EINVAL;

6510

++

6511

++	rcu_read_lock();

6512

++	p = find_process_by_pid(pid);

6513

++	retval = -ESRCH;

6514

++	if (!p)

6515

++		goto out_unlock;

6516

++

6517

++	retval = security_task_getscheduler(p);

6518

++	if (retval)

6519

++		goto out_unlock;

6520

++

6521

++	kattr.sched_policy = p->policy;

6522

++	if (p->sched_reset_on_fork)

6523

++		kattr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;

6524

++	if (task_has_rt_policy(p))

6525

++		kattr.sched_priority = p->rt_priority;

6526

++	else

6527

++		kattr.sched_nice = task_nice(p);

6528

++	kattr.sched_flags &= SCHED_FLAG_ALL;

6529

++

6530

++#ifdef CONFIG_UCLAMP_TASK

6531

++	kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;

6532

++	kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;

6533

++#endif

6534

++

6535

++	rcu_read_unlock();

6536

++

6537

++	return sched_attr_copy_to_user(uattr, &kattr, usize);

6538

++

6539

++out_unlock:

6540

++	rcu_read_unlock();

6541

++	return retval;

6542

++}

6543

++

6544

++static int

6545

++__sched_setaffinity(struct task_struct *p, const struct cpumask *mask)

6546

++{

6547

++	int retval;

6548

++	cpumask_var_t cpus_allowed, new_mask;

6549

++

6550

++	if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL))

6551

++		return -ENOMEM;

6552

++

6553

++	if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {

6554

++		retval = -ENOMEM;

6555

++		goto out_free_cpus_allowed;

6556

++	}

6557

++

6558

++	cpuset_cpus_allowed(p, cpus_allowed);

6559

++	cpumask_and(new_mask, mask, cpus_allowed);

6560

++again:

6561

++	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | SCA_USER);

6562

++	if (retval)

6563

++		goto out_free_new_mask;

6564

++

6565

++	cpuset_cpus_allowed(p, cpus_allowed);

6566

++	if (!cpumask_subset(new_mask, cpus_allowed)) {

6567

++		/*

6568

++		 * We must have raced with a concurrent cpuset

6569

++		 * update. Just reset the cpus_allowed to the

6570

++		 * cpuset's cpus_allowed

6571

++		 */

6572

++		cpumask_copy(new_mask, cpus_allowed);

6573

++		goto again;

6574

++	}

6575

++

6576

++out_free_new_mask:

6577

++	free_cpumask_var(new_mask);

6578

++out_free_cpus_allowed:

6579

++	free_cpumask_var(cpus_allowed);

6580

++	return retval;

6581

++}

6582

++

6583

++long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)

6584

++{

6585

++	struct task_struct *p;

6586

++	int retval;

6587

++

6588

++	rcu_read_lock();

6589

++

6590

++	p = find_process_by_pid(pid);

6591

++	if (!p) {

6592

++		rcu_read_unlock();

6593

++		return -ESRCH;

6594

++	}

6595

++

6596

++	/* Prevent p going away */

6597

++	get_task_struct(p);

6598

++	rcu_read_unlock();

6599

++

6600

++	if (p->flags & PF_NO_SETAFFINITY) {

6601

++		retval = -EINVAL;

6602

++		goto out_put_task;

6603

++	}

6604

++

6605

++	if (!check_same_owner(p)) {

6606

++		rcu_read_lock();

6607

++		if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {

6608

++			rcu_read_unlock();

6609

++			retval = -EPERM;

6610

++			goto out_put_task;

6611

++		}

6612

++		rcu_read_unlock();

6613

++	}

6614

++

6615

++	retval = security_task_setscheduler(p);

6616

++	if (retval)

6617

++		goto out_put_task;

6618

++

6619

++	retval = __sched_setaffinity(p, in_mask);

6620

++out_put_task:

6621

++	put_task_struct(p);

6622

++	return retval;

6623

++}

6624

++

6625

++static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len,

6626

++			     struct cpumask *new_mask)

6627

++{

6628

++	if (len < cpumask_size())

6629

++		cpumask_clear(new_mask);

6630

++	else if (len > cpumask_size())

6631

++		len = cpumask_size();

6632

++

6633

++	return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0;

6634

++}

6635

++

6636

++/**

6637

++ * sys_sched_setaffinity - set the CPU affinity of a process

6638

++ * @pid: pid of the process

6639

++ * @len: length in bytes of the bitmask pointed to by user_mask_ptr

6640

++ * @user_mask_ptr: user-space pointer to the new CPU mask

6641

++ *

6642

++ * Return: 0 on success. An error code otherwise.

6643

++ */

6644

++SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,

6645

++		unsigned long __user *, user_mask_ptr)

6646

++{

6647

++	cpumask_var_t new_mask;

6648

++	int retval;

6649

++

6650

++	if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))

6651

++		return -ENOMEM;

6652

++

6653

++	retval = get_user_cpu_mask(user_mask_ptr, len, new_mask);

6654

++	if (retval == 0)

6655

++		retval = sched_setaffinity(pid, new_mask);

6656

++	free_cpumask_var(new_mask);

6657

++	return retval;

6658

++}

6659

++

6660

++long sched_getaffinity(pid_t pid, cpumask_t *mask)

6661

++{

6662

++	struct task_struct *p;

6663

++	raw_spinlock_t *lock;

6664

++	unsigned long flags;

6665

++	int retval;

6666

++

6667

++	rcu_read_lock();

6668

++

6669

++	retval = -ESRCH;

6670

++	p = find_process_by_pid(pid);

6671

++	if (!p)

6672

++		goto out_unlock;

6673

++

6674

++	retval = security_task_getscheduler(p);

6675

++	if (retval)

6676

++		goto out_unlock;

6677

++

6678

++	task_access_lock_irqsave(p, &lock, &flags);

6679

++	cpumask_and(mask, &p->cpus_mask, cpu_active_mask);

6680

++	task_access_unlock_irqrestore(p, lock, &flags);

6681

++

6682

++out_unlock:

6683

++	rcu_read_unlock();

6684

++

6685

++	return retval;

6686

++}

6687

++

6688

++/**

6689

++ * sys_sched_getaffinity - get the CPU affinity of a process

6690

++ * @pid: pid of the process

6691

++ * @len: length in bytes of the bitmask pointed to by user_mask_ptr

6692

++ * @user_mask_ptr: user-space pointer to hold the current CPU mask

6693

++ *

6694

++ * Return: size of CPU mask copied to user_mask_ptr on success. An

6695

++ * error code otherwise.

6696

++ */

6697

++SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,

6698

++		unsigned long __user *, user_mask_ptr)

6699

++{

6700

++	int ret;

6701

++	cpumask_var_t mask;

6702

++

6703

++	if ((len * BITS_PER_BYTE) < nr_cpu_ids)

6704

++		return -EINVAL;

6705

++	if (len & (sizeof(unsigned long)-1))

6706

++		return -EINVAL;

6707

++

6708

++	if (!alloc_cpumask_var(&mask, GFP_KERNEL))

6709

++		return -ENOMEM;

6710

++

6711

++	ret = sched_getaffinity(pid, mask);

6712

++	if (ret == 0) {

6713

++		unsigned int retlen = min_t(size_t, len, cpumask_size());

6714

++

6715

++		if (copy_to_user(user_mask_ptr, mask, retlen))

6716

++			ret = -EFAULT;

6717

++		else

6718

++			ret = retlen;

6719

++	}

6720

++	free_cpumask_var(mask);

6721

++

6722

++	return ret;

6723

++}

6724

++

6725

++static void do_sched_yield(void)

6726

++{

6727

++	struct rq *rq;

6728

++	struct rq_flags rf;

6729

++

6730

++	if (!sched_yield_type)

6731

++		return;

6732

++

6733

++	rq = this_rq_lock_irq(&rf);

6734

++

6735

++	schedstat_inc(rq->yld_count);

6736

++

6737

++	if (1 == sched_yield_type) {

6738

++		if (!rt_task(current))

6739

++			do_sched_yield_type_1(current, rq);

6740

++	} else if (2 == sched_yield_type) {

6741

++		if (rq->nr_running > 1)

6742

++			rq->skip = current;

6743

++	}

6744

++

6745

++	preempt_disable();

6746

++	raw_spin_unlock_irq(&rq->lock);

6747

++	sched_preempt_enable_no_resched();

6748

++

6749

++	schedule();

6750

++}

6751

++

6752

++/**

6753

++ * sys_sched_yield - yield the current processor to other threads.

6754

++ *

6755

++ * This function yields the current CPU to other tasks. If there are no

6756

++ * other threads running on this CPU then this function will return.

6757

++ *

6758

++ * Return: 0.

6759

++ */

6760

++SYSCALL_DEFINE0(sched_yield)

6761

++{

6762

++	do_sched_yield();

6763

++	return 0;

6764

++}

6765

++

6766

++#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)

6767

++int __sched __cond_resched(void)

6768

++{

6769

++	if (should_resched(0)) {

6770

++		preempt_schedule_common();

6771

++		return 1;

6772

++	}

6773

++	/*

6774

++	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick

6775

++	 * whether the current CPU is in an RCU read-side critical section,

6776

++	 * so the tick can report quiescent states even for CPUs looping

6777

++	 * in kernel context.  In contrast, in non-preemptible kernels,

6778

++	 * RCU readers leave no in-memory hints, which means that CPU-bound

6779

++	 * processes executing in kernel context might never report an

6780

++	 * RCU quiescent state.  Therefore, the following code causes

6781

++	 * cond_resched() to report a quiescent state, but only when RCU

6782

++	 * is in urgent need of one.

6783

++	 */

6784

++#ifndef CONFIG_PREEMPT_RCU

6785

++	rcu_all_qs();

6786

++#endif

6787

++	return 0;

6788

++}

6789

++EXPORT_SYMBOL(__cond_resched);

6790

++#endif

6791

++

6792

++#ifdef CONFIG_PREEMPT_DYNAMIC

6793

++#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)

6794

++#define cond_resched_dynamic_enabled	__cond_resched

6795

++#define cond_resched_dynamic_disabled	((void *)&__static_call_return0)

6796

++DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched);

6797

++EXPORT_STATIC_CALL_TRAMP(cond_resched);

6798

++

6799

++#define might_resched_dynamic_enabled	__cond_resched

6800

++#define might_resched_dynamic_disabled	((void *)&__static_call_return0)

6801

++DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched);

6802

++EXPORT_STATIC_CALL_TRAMP(might_resched);

6803

++#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)

6804

++static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);

6805

++int __sched dynamic_cond_resched(void)

6806

++{

6807

++	if (!static_branch_unlikely(&sk_dynamic_cond_resched))

6808

++		return 0;

6809

++	return __cond_resched();

6810

++}

6811

++EXPORT_SYMBOL(dynamic_cond_resched);

6812

++

6813

++static DEFINE_STATIC_KEY_FALSE(sk_dynamic_might_resched);

6814

++int __sched dynamic_might_resched(void)

6815

++{

6816

++	if (!static_branch_unlikely(&sk_dynamic_might_resched))

6817

++		return 0;

6818

++	return __cond_resched();

6819

++}

6820

++EXPORT_SYMBOL(dynamic_might_resched);

6821

++#endif

6822

++#endif

6823

++

6824

++/*

6825

++ * __cond_resched_lock() - if a reschedule is pending, drop the given lock,

6826

++ * call schedule, and on return reacquire the lock.

6827

++ *

6828

++ * This works OK both with and without CONFIG_PREEMPTION.  We do strange low-level

6829

++ * operations here to prevent schedule() from being called twice (once via

6830

++ * spin_unlock(), once by hand).

6831

++ */

6832

++int __cond_resched_lock(spinlock_t *lock)

6833

++{

6834

++	int resched = should_resched(PREEMPT_LOCK_OFFSET);

6835

++	int ret = 0;

6836

++

6837

++	lockdep_assert_held(lock);

6838

++

6839

++	if (spin_needbreak(lock) || resched) {

6840

++		spin_unlock(lock);

6841

++		if (!_cond_resched())

6842

++			cpu_relax();

6843

++		ret = 1;

6844

++		spin_lock(lock);

6845

++	}

6846

++	return ret;

6847

++}

6848

++EXPORT_SYMBOL(__cond_resched_lock);

6849

++

6850

++int __cond_resched_rwlock_read(rwlock_t *lock)

6851

++{

6852

++	int resched = should_resched(PREEMPT_LOCK_OFFSET);

6853

++	int ret = 0;

6854

++

6855

++	lockdep_assert_held_read(lock);

6856

++

6857

++	if (rwlock_needbreak(lock) || resched) {

6858

++		read_unlock(lock);

6859

++		if (!_cond_resched())

6860

++			cpu_relax();

6861

++		ret = 1;

6862

++		read_lock(lock);

6863

++	}

6864

++	return ret;

6865

++}

6866

++EXPORT_SYMBOL(__cond_resched_rwlock_read);

6867

++

6868

++int __cond_resched_rwlock_write(rwlock_t *lock)

6869

++{

6870

++	int resched = should_resched(PREEMPT_LOCK_OFFSET);

6871

++	int ret = 0;

6872

++

6873

++	lockdep_assert_held_write(lock);

6874

++

6875

++	if (rwlock_needbreak(lock) || resched) {

6876

++		write_unlock(lock);

6877

++		if (!_cond_resched())

6878

++			cpu_relax();

6879

++		ret = 1;

6880

++		write_lock(lock);

6881

++	}

6882

++	return ret;

6883

++}

6884

++EXPORT_SYMBOL(__cond_resched_rwlock_write);

6885

++

6886

++#ifdef CONFIG_PREEMPT_DYNAMIC

6887

++

6888

++#ifdef CONFIG_GENERIC_ENTRY

6889

++#include <linux/entry-common.h>

6890

++#endif

6891

++

6892

++/*

6893

++ * SC:cond_resched

6894

++ * SC:might_resched

6895

++ * SC:preempt_schedule

6896

++ * SC:preempt_schedule_notrace

6897

++ * SC:irqentry_exit_cond_resched

6898

++ *

6899

++ *

6900

++ * NONE:

6901

++ *   cond_resched               <- __cond_resched

6902

++ *   might_resched              <- RET0

6903

++ *   preempt_schedule           <- NOP

6904

++ *   preempt_schedule_notrace   <- NOP

6905

++ *   irqentry_exit_cond_resched <- NOP

6906

++ *

6907

++ * VOLUNTARY:

6908

++ *   cond_resched               <- __cond_resched

6909

++ *   might_resched              <- __cond_resched

6910

++ *   preempt_schedule           <- NOP

6911

++ *   preempt_schedule_notrace   <- NOP

6912

++ *   irqentry_exit_cond_resched <- NOP

6913

++ *

6914

++ * FULL:

6915

++ *   cond_resched               <- RET0

6916

++ *   might_resched              <- RET0

6917

++ *   preempt_schedule           <- preempt_schedule

6918

++ *   preempt_schedule_notrace   <- preempt_schedule_notrace

6919

++ *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched

6920

++ */

6921

++

6922

++enum {

6923

++	preempt_dynamic_undefined = -1,

6924

++	preempt_dynamic_none,

6925

++	preempt_dynamic_voluntary,

6926

++	preempt_dynamic_full,

6927

++};

6928

++

6929

++int preempt_dynamic_mode = preempt_dynamic_undefined;

6930

++

6931

++int sched_dynamic_mode(const char *str)

6932

++{

6933

++	if (!strcmp(str, "none"))

6934

++		return preempt_dynamic_none;

6935

++

6936

++	if (!strcmp(str, "voluntary"))

6937

++		return preempt_dynamic_voluntary;

6938

++

6939

++	if (!strcmp(str, "full"))

6940

++		return preempt_dynamic_full;

6941

++

6942

++	return -EINVAL;

6943

++}

6944

++

6945

++#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)

6946

++#define preempt_dynamic_enable(f)	static_call_update(f, f##_dynamic_enabled)

6947

++#define preempt_dynamic_disable(f)	static_call_update(f, f##_dynamic_disabled)

6948

++#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)

6949

++#define preempt_dynamic_enable(f)	static_key_enable(&sk_dynamic_##f.key)

6950

++#define preempt_dynamic_disable(f)	static_key_disable(&sk_dynamic_##f.key)

6951

++#else

6952

++#error "Unsupported PREEMPT_DYNAMIC mechanism"

6953

++#endif

6954

++

6955

++void sched_dynamic_update(int mode)

6956

++{

6957

++	/*

6958

++	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in

6959

++	 * the ZERO state, which is invalid.

6960

++	 */

6961

++	preempt_dynamic_enable(cond_resched);

6962

++	preempt_dynamic_enable(might_resched);

6963

++	preempt_dynamic_enable(preempt_schedule);

6964

++	preempt_dynamic_enable(preempt_schedule_notrace);

6965

++	preempt_dynamic_enable(irqentry_exit_cond_resched);

6966

++

6967

++	switch (mode) {

6968

++	case preempt_dynamic_none:

6969

++		preempt_dynamic_enable(cond_resched);

6970

++		preempt_dynamic_disable(might_resched);

6971

++		preempt_dynamic_disable(preempt_schedule);

6972

++		preempt_dynamic_disable(preempt_schedule_notrace);

6973

++		preempt_dynamic_disable(irqentry_exit_cond_resched);

6974

++		pr_info("Dynamic Preempt: none\n");

6975

++		break;

6976

++

6977

++	case preempt_dynamic_voluntary:

6978

++		preempt_dynamic_enable(cond_resched);

6979

++		preempt_dynamic_enable(might_resched);

6980

++		preempt_dynamic_disable(preempt_schedule);

6981

++		preempt_dynamic_disable(preempt_schedule_notrace);

6982

++		preempt_dynamic_disable(irqentry_exit_cond_resched);

6983

++		pr_info("Dynamic Preempt: voluntary\n");

6984

++		break;

6985

++

6986

++	case preempt_dynamic_full:

6987

++		preempt_dynamic_disable(cond_resched);

6988

++		preempt_dynamic_disable(might_resched);

6989

++		preempt_dynamic_enable(preempt_schedule);

6990

++		preempt_dynamic_enable(preempt_schedule_notrace);

6991

++		preempt_dynamic_enable(irqentry_exit_cond_resched);

6992

++		pr_info("Dynamic Preempt: full\n");

6993

++		break;

6994

++	}

6995

++

6996

++	preempt_dynamic_mode = mode;

6997

++}

6998

++

6999

++static int __init setup_preempt_mode(char *str)

7000

++{

7001

++	int mode = sched_dynamic_mode(str);

7002

++	if (mode < 0) {

7003

++		pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);

7004

++		return 0;

7005

++	}

7006

++

7007

++	sched_dynamic_update(mode);

7008

++	return 1;

7009

++}

7010

++__setup("preempt=", setup_preempt_mode);

7011

++

7012

++static void __init preempt_dynamic_init(void)

7013

++{

7014

++	if (preempt_dynamic_mode == preempt_dynamic_undefined) {

7015

++		if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {

7016

++			sched_dynamic_update(preempt_dynamic_none);

7017

++		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {

7018

++			sched_dynamic_update(preempt_dynamic_voluntary);

7019

++		} else {

7020

++			/* Default static call setting, nothing to do */

7021

++			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));

7022

++			preempt_dynamic_mode = preempt_dynamic_full;

7023

++			pr_info("Dynamic Preempt: full\n");

7024

++		}

7025

++	}

7026

++}

7027

++

7028

++#define PREEMPT_MODEL_ACCESSOR(mode) \

7029

++	bool preempt_model_##mode(void)						 \

7030

++	{									 \

7031

++		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \

7032

++		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \

7033

++	}									 \

7034

++	EXPORT_SYMBOL_GPL(preempt_model_##mode)

7035

++

7036

++PREEMPT_MODEL_ACCESSOR(none);

7037

++PREEMPT_MODEL_ACCESSOR(voluntary);

7038

++PREEMPT_MODEL_ACCESSOR(full);

7039

++

7040

++#else /* !CONFIG_PREEMPT_DYNAMIC */

7041

++

7042

++static inline void preempt_dynamic_init(void) { }

7043

++

7044

++#endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */

7045

++

7046

++/**

7047

++ * yield - yield the current processor to other threads.

7048

++ *

7049

++ * Do not ever use this function, there's a 99% chance you're doing it wrong.

7050

++ *

7051

++ * The scheduler is at all times free to pick the calling task as the most

7052

++ * eligible task to run, if removing the yield() call from your code breaks

7053

++ * it, it's already broken.

7054

++ *

7055

++ * Typical broken usage is:

7056

++ *

7057

++ * while (!event)

7058

++ * 	yield();

7059

++ *

7060

++ * where one assumes that yield() will let 'the other' process run that will

7061

++ * make event true. If the current task is a SCHED_FIFO task that will never

7062

++ * happen. Never use yield() as a progress guarantee!!

7063

++ *

7064

++ * If you want to use yield() to wait for something, use wait_event().

7065

++ * If you want to use yield() to be 'nice' for others, use cond_resched().

7066

++ * If you still want to use yield(), do not!

7067

++ */

7068

++void __sched yield(void)

7069

++{

7070

++	set_current_state(TASK_RUNNING);

7071

++	do_sched_yield();

7072

++}

7073

++EXPORT_SYMBOL(yield);

7074

++

7075

++/**

7076

++ * yield_to - yield the current processor to another thread in

7077

++ * your thread group, or accelerate that thread toward the

7078

++ * processor it's on.

7079

++ * @p: target task

7080

++ * @preempt: whether task preemption is allowed or not

7081

++ *

7082

++ * It's the caller's job to ensure that the target task struct

7083

++ * can't go away on us before we can do any checks.

7084

++ *

7085

++ * In Alt schedule FW, yield_to is not supported.

7086

++ *

7087

++ * Return:

7088

++ *	true (>0) if we indeed boosted the target task.

7089

++ *	false (0) if we failed to boost the target.

7090

++ *	-ESRCH if there's no task to yield to.

7091

++ */

7092

++int __sched yield_to(struct task_struct *p, bool preempt)

7093

++{

7094

++	return 0;

7095

++}

7096

++EXPORT_SYMBOL_GPL(yield_to);

7097

++

7098

++int io_schedule_prepare(void)

7099

++{

7100

++	int old_iowait = current->in_iowait;

7101

++

7102

++	current->in_iowait = 1;

7103

++	blk_flush_plug(current->plug, true);

7104

++	return old_iowait;

7105

++}

7106

++

7107

++void io_schedule_finish(int token)

7108

++{

7109

++	current->in_iowait = token;

7110

++}

7111

++

7112

++/*

7113

++ * This task is about to go to sleep on IO.  Increment rq->nr_iowait so

7114

++ * that process accounting knows that this is a task in IO wait state.

7115

++ *

7116

++ * But don't do that if it is a deliberate, throttling IO wait (this task

7117

++ * has set its backing_dev_info: the queue against which it should throttle)

7118

++ */

7119

++

7120

++long __sched io_schedule_timeout(long timeout)

7121

++{

7122

++	int token;

7123

++	long ret;

7124

++

7125

++	token = io_schedule_prepare();

7126

++	ret = schedule_timeout(timeout);

7127

++	io_schedule_finish(token);

7128

++

7129

++	return ret;

7130

++}

7131

++EXPORT_SYMBOL(io_schedule_timeout);

7132

++

7133

++void __sched io_schedule(void)

7134

++{

7135

++	int token;

7136

++

7137

++	token = io_schedule_prepare();

7138

++	schedule();

7139

++	io_schedule_finish(token);

7140

++}

7141

++EXPORT_SYMBOL(io_schedule);

7142

++

7143

++/**

7144

++ * sys_sched_get_priority_max - return maximum RT priority.

7145

++ * @policy: scheduling class.

7146

++ *

7147

++ * Return: On success, this syscall returns the maximum

7148

++ * rt_priority that can be used by a given scheduling class.

7149

++ * On failure, a negative error code is returned.

7150

++ */

7151

++SYSCALL_DEFINE1(sched_get_priority_max, int, policy)

7152

++{

7153

++	int ret = -EINVAL;

7154

++

7155

++	switch (policy) {

7156

++	case SCHED_FIFO:

7157

++	case SCHED_RR:

7158

++		ret = MAX_RT_PRIO - 1;

7159

++		break;

7160

++	case SCHED_NORMAL:

7161

++	case SCHED_BATCH:

7162

++	case SCHED_IDLE:

7163

++		ret = 0;

7164

++		break;

7165

++	}

7166

++	return ret;

7167

++}

7168

++

7169

++/**

7170

++ * sys_sched_get_priority_min - return minimum RT priority.

7171

++ * @policy: scheduling class.

7172

++ *

7173

++ * Return: On success, this syscall returns the minimum

7174

++ * rt_priority that can be used by a given scheduling class.

7175

++ * On failure, a negative error code is returned.

7176

++ */

7177

++SYSCALL_DEFINE1(sched_get_priority_min, int, policy)

7178

++{

7179

++	int ret = -EINVAL;

7180

++

7181

++	switch (policy) {

7182

++	case SCHED_FIFO:

7183

++	case SCHED_RR:

7184

++		ret = 1;

7185

++		break;

7186

++	case SCHED_NORMAL:

7187

++	case SCHED_BATCH:

7188

++	case SCHED_IDLE:

7189

++		ret = 0;

7190

++		break;

7191

++	}

7192

++	return ret;

7193

++}

7194

++

7195

++static int sched_rr_get_interval(pid_t pid, struct timespec64 *t)

7196

++{

7197

++	struct task_struct *p;

7198

++	int retval;

7199

++

7200

++	alt_sched_debug();

7201

++

7202

++	if (pid < 0)

7203

++		return -EINVAL;

7204

++

7205

++	retval = -ESRCH;

7206

++	rcu_read_lock();

7207

++	p = find_process_by_pid(pid);

7208

++	if (!p)

7209

++		goto out_unlock;

7210

++

7211

++	retval = security_task_getscheduler(p);

7212

++	if (retval)

7213

++		goto out_unlock;

7214

++	rcu_read_unlock();

7215

++

7216

++	*t = ns_to_timespec64(sched_timeslice_ns);

7217

++	return 0;

7218

++

7219

++out_unlock:

7220

++	rcu_read_unlock();

7221

++	return retval;

7222

++}

7223

++

7224

++/**

7225

++ * sys_sched_rr_get_interval - return the default timeslice of a process.

7226

++ * @pid: pid of the process.

7227

++ * @interval: userspace pointer to the timeslice value.

7228

++ *

7229

++ *

7230

++ * Return: On success, 0 and the timeslice is in @interval. Otherwise,

7231

++ * an error code.

7232

++ */

7233

++SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,

7234

++		struct __kernel_timespec __user *, interval)

7235

++{

7236

++	struct timespec64 t;

7237

++	int retval = sched_rr_get_interval(pid, &t);

7238

++

7239

++	if (retval == 0)

7240

++		retval = put_timespec64(&t, interval);

7241

++

7242

++	return retval;

7243

++}

7244

++

7245

++#ifdef CONFIG_COMPAT_32BIT_TIME

7246

++SYSCALL_DEFINE2(sched_rr_get_interval_time32, pid_t, pid,

7247

++		struct old_timespec32 __user *, interval)

7248

++{

7249

++	struct timespec64 t;

7250

++	int retval = sched_rr_get_interval(pid, &t);

7251

++

7252

++	if (retval == 0)

7253

++		retval = put_old_timespec32(&t, interval);

7254

++	return retval;

7255

++}

7256

++#endif

7257

++

7258

++void sched_show_task(struct task_struct *p)

7259

++{

7260

++	unsigned long free = 0;

7261

++	int ppid;

7262

++

7263

++	if (!try_get_task_stack(p))

7264

++		return;

7265

++

7266

++	pr_info("task:%-15.15s state:%c", p->comm, task_state_to_char(p));

7267

++

7268

++	if (task_is_running(p))

7269

++		pr_cont("  running task    ");

7270

++#ifdef CONFIG_DEBUG_STACK_USAGE

7271

++	free = stack_not_used(p);

7272

++#endif

7273

++	ppid = 0;

7274

++	rcu_read_lock();

7275

++	if (pid_alive(p))

7276

++		ppid = task_pid_nr(rcu_dereference(p->real_parent));

7277

++	rcu_read_unlock();

7278

++	pr_cont(" stack:%5lu pid:%5d ppid:%6d flags:0x%08lx\n",

7279

++		free, task_pid_nr(p), ppid,

7280

++		read_task_thread_flags(p));

7281

++

7282

++	print_worker_info(KERN_INFO, p);

7283

++	print_stop_info(KERN_INFO, p);

7284

++	show_stack(p, NULL, KERN_INFO);

7285

++	put_task_stack(p);

7286

++}

7287

++EXPORT_SYMBOL_GPL(sched_show_task);

7288

++

7289

++static inline bool

7290

++state_filter_match(unsigned long state_filter, struct task_struct *p)

7291

++{

7292

++	unsigned int state = READ_ONCE(p->__state);

7293

++

7294

++	/* no filter, everything matches */

7295

++	if (!state_filter)

7296

++		return true;

7297

++

7298

++	/* filter, but doesn't match */

7299

++	if (!(state & state_filter))

7300

++		return false;

7301

++

7302

++	/*

7303

++	 * When looking for TASK_UNINTERRUPTIBLE skip TASK_IDLE (allows

7304

++	 * TASK_KILLABLE).

7305

++	 */

7306

++	if (state_filter == TASK_UNINTERRUPTIBLE && state == TASK_IDLE)

7307

++		return false;

7308

++

7309

++	return true;

7310

++}

7311

++

7312

++

7313

++void show_state_filter(unsigned int state_filter)

7314

++{

7315

++	struct task_struct *g, *p;

7316

++

7317

++	rcu_read_lock();

7318

++	for_each_process_thread(g, p) {

7319

++		/*

7320

++		 * reset the NMI-timeout, listing all files on a slow

7321

++		 * console might take a lot of time:

7322

++		 * Also, reset softlockup watchdogs on all CPUs, because

7323

++		 * another CPU might be blocked waiting for us to process

7324

++		 * an IPI.

7325

++		 */

7326

++		touch_nmi_watchdog();

7327

++		touch_all_softlockup_watchdogs();

7328

++		if (state_filter_match(state_filter, p))

7329

++			sched_show_task(p);

7330

++	}

7331

++

7332

++#ifdef CONFIG_SCHED_DEBUG

7333

++	/* TODO: Alt schedule FW should support this

7334

++	if (!state_filter)

7335

++		sysrq_sched_debug_show();

7336

++	*/

7337

++#endif

7338

++	rcu_read_unlock();

7339

++	/*

7340

++	 * Only show locks if all tasks are dumped:

7341

++	 */

7342

++	if (!state_filter)

7343

++		debug_show_all_locks();

7344

++}

7345

++

7346

++void dump_cpu_task(int cpu)

7347

++{

7348

++	pr_info("Task dump for CPU %d:\n", cpu);

7349

++	sched_show_task(cpu_curr(cpu));

7350

++}

7351

++

7352

++/**

7353

++ * init_idle - set up an idle thread for a given CPU

7354

++ * @idle: task in question

7355

++ * @cpu: CPU the idle task belongs to

7356

++ *

7357

++ * NOTE: this function does not set the idle thread's NEED_RESCHED

7358

++ * flag, to make booting more robust.

7359

++ */

7360

++void __init init_idle(struct task_struct *idle, int cpu)

7361

++{

7362

++	struct rq *rq = cpu_rq(cpu);

7363

++	unsigned long flags;

7364

++

7365

++	__sched_fork(0, idle);

7366

++

7367

++	raw_spin_lock_irqsave(&idle->pi_lock, flags);

7368

++	raw_spin_lock(&rq->lock);

7369

++	update_rq_clock(rq);

7370

++

7371

++	idle->last_ran = rq->clock_task;

7372

++	idle->__state = TASK_RUNNING;

7373

++	/*

7374

++	 * PF_KTHREAD should already be set at this point; regardless, make it

7375

++	 * look like a proper per-CPU kthread.

7376

++	 */

7377

++	idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;

7378

++	kthread_set_per_cpu(idle, cpu);

7379

++

7380

++	sched_queue_init_idle(&rq->queue, idle);

7381

++

7382

++#ifdef CONFIG_SMP

7383

++	/*

7384

++	 * It's possible that init_idle() gets called multiple times on a task,

7385

++	 * in that case do_set_cpus_allowed() will not do the right thing.

7386

++	 *

7387

++	 * And since this is boot we can forgo the serialisation.

7388

++	 */

7389

++	set_cpus_allowed_common(idle, cpumask_of(cpu));

7390

++#endif

7391

++

7392

++	/* Silence PROVE_RCU */

7393

++	rcu_read_lock();

7394

++	__set_task_cpu(idle, cpu);

7395

++	rcu_read_unlock();

7396

++

7397

++	rq->idle = idle;

7398

++	rcu_assign_pointer(rq->curr, idle);

7399

++	idle->on_cpu = 1;

7400

++

7401

++	raw_spin_unlock(&rq->lock);

7402

++	raw_spin_unlock_irqrestore(&idle->pi_lock, flags);

7403

++

7404

++	/* Set the preempt count _outside_ the spinlocks! */

7405

++	init_idle_preempt_count(idle, cpu);

7406

++

7407

++	ftrace_graph_init_idle_task(idle, cpu);

7408

++	vtime_init_idle(idle, cpu);

7409

++#ifdef CONFIG_SMP

7410

++	sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);

7411

++#endif

7412

++}

7413

++

7414

++#ifdef CONFIG_SMP

7415

++

7416

++int cpuset_cpumask_can_shrink(const struct cpumask __maybe_unused *cur,

7417

++			      const struct cpumask __maybe_unused *trial)

7418

++{

7419

++	return 1;

7420

++}

7421

++

7422

++int task_can_attach(struct task_struct *p,

7423

++		    const struct cpumask *cs_cpus_allowed)

7424

++{

7425

++	int ret = 0;

7426

++

7427

++	/*

7428

++	 * Kthreads which disallow setaffinity shouldn't be moved

7429

++	 * to a new cpuset; we don't want to change their CPU

7430

++	 * affinity and isolating such threads by their set of

7431

++	 * allowed nodes is unnecessary.  Thus, cpusets are not

7432

++	 * applicable for such threads.  This prevents checking for

7433

++	 * success of set_cpus_allowed_ptr() on all attached tasks

7434

++	 * before cpus_mask may be changed.

7435

++	 */

7436

++	if (p->flags & PF_NO_SETAFFINITY)

7437

++		ret = -EINVAL;

7438

++

7439

++	return ret;

7440

++}

7441

++

7442

++bool sched_smp_initialized __read_mostly;

7443

++

7444

++#ifdef CONFIG_HOTPLUG_CPU

7445

++/*

7446

++ * Ensures that the idle task is using init_mm right before its CPU goes

7447

++ * offline.

7448

++ */

7449

++void idle_task_exit(void)

7450

++{

7451

++	struct mm_struct *mm = current->active_mm;

7452

++

7453

++	BUG_ON(current != this_rq()->idle);

7454

++

7455

++	if (mm != &init_mm) {

7456

++		switch_mm(mm, &init_mm, current);

7457

++		finish_arch_post_lock_switch();

7458

++	}

7459

++

7460

++	/* finish_cpu(), as ran on the BP, will clean up the active_mm state */

7461

++}

7462

++

7463

++static int __balance_push_cpu_stop(void *arg)

7464

++{

7465

++	struct task_struct *p = arg;

7466

++	struct rq *rq = this_rq();

7467

++	struct rq_flags rf;

7468

++	int cpu;

7469

++

7470

++	raw_spin_lock_irq(&p->pi_lock);

7471

++	rq_lock(rq, &rf);

7472

++

7473

++	update_rq_clock(rq);

7474

++

7475

++	if (task_rq(p) == rq && task_on_rq_queued(p)) {

7476

++		cpu = select_fallback_rq(rq->cpu, p);

7477

++		rq = __migrate_task(rq, p, cpu);

7478

++	}

7479

++

7480

++	rq_unlock(rq, &rf);

7481

++	raw_spin_unlock_irq(&p->pi_lock);

7482

++

7483

++	put_task_struct(p);

7484

++

7485

++	return 0;

7486

++}

7487

++

7488

++static DEFINE_PER_CPU(struct cpu_stop_work, push_work);

7489

++

7490

++/*

7491

++ * This is enabled below SCHED_AP_ACTIVE; when !cpu_active(), but only

7492

++ * effective when the hotplug motion is down.

7493

++ */

7494

++static void balance_push(struct rq *rq)

7495

++{

7496

++	struct task_struct *push_task = rq->curr;

7497

++

7498

++	lockdep_assert_held(&rq->lock);

7499

++

7500

++	/*

7501

++	 * Ensure the thing is persistent until balance_push_set(.on = false);

7502

++	 */

7503

++	rq->balance_callback = &balance_push_callback;

7504

++

7505

++	/*

7506

++	 * Only active while going offline and when invoked on the outgoing

7507

++	 * CPU.

7508

++	 */

7509

++	if (!cpu_dying(rq->cpu) || rq != this_rq())

7510

++		return;

7511

++

7512

++	/*

7513

++	 * Both the cpu-hotplug and stop task are in this case and are

7514

++	 * required to complete the hotplug process.

7515

++	 */

7516

++	if (kthread_is_per_cpu(push_task) ||

7517

++	    is_migration_disabled(push_task)) {

7518

++

7519

++		/*

7520

++		 * If this is the idle task on the outgoing CPU try to wake

7521

++		 * up the hotplug control thread which might wait for the

7522

++		 * last task to vanish. The rcuwait_active() check is

7523

++		 * accurate here because the waiter is pinned on this CPU

7524

++		 * and can't obviously be running in parallel.

7525

++		 *

7526

++		 * On RT kernels this also has to check whether there are

7527

++		 * pinned and scheduled out tasks on the runqueue. They

7528

++		 * need to leave the migrate disabled section first.

7529

++		 */

7530

++		if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&

7531

++		    rcuwait_active(&rq->hotplug_wait)) {

7532

++			raw_spin_unlock(&rq->lock);

7533

++			rcuwait_wake_up(&rq->hotplug_wait);

7534

++			raw_spin_lock(&rq->lock);

7535

++		}

7536

++		return;

7537

++	}

7538

++

7539

++	get_task_struct(push_task);

7540

++	/*

7541

++	 * Temporarily drop rq->lock such that we can wake-up the stop task.

7542

++	 * Both preemption and IRQs are still disabled.

7543

++	 */

7544

++	raw_spin_unlock(&rq->lock);

7545

++	stop_one_cpu_nowait(rq->cpu, __balance_push_cpu_stop, push_task,

7546

++			    this_cpu_ptr(&push_work));

7547

++	/*

7548

++	 * At this point need_resched() is true and we'll take the loop in

7549

++	 * schedule(). The next pick is obviously going to be the stop task

7550

++	 * which kthread_is_per_cpu() and will push this task away.

7551

++	 */

7552

++	raw_spin_lock(&rq->lock);

7553

++}

7554

++

7555

++static void balance_push_set(int cpu, bool on)

7556

++{

7557

++	struct rq *rq = cpu_rq(cpu);

7558

++	struct rq_flags rf;

7559

++

7560

++	rq_lock_irqsave(rq, &rf);

7561

++	if (on) {

7562

++		WARN_ON_ONCE(rq->balance_callback);

7563

++		rq->balance_callback = &balance_push_callback;

7564

++	} else if (rq->balance_callback == &balance_push_callback) {

7565

++		rq->balance_callback = NULL;

7566

++	}

7567

++	rq_unlock_irqrestore(rq, &rf);

7568

++}

7569

++

7570

++/*

7571

++ * Invoked from a CPUs hotplug control thread after the CPU has been marked

7572

++ * inactive. All tasks which are not per CPU kernel threads are either

7573

++ * pushed off this CPU now via balance_push() or placed on a different CPU

7574

++ * during wakeup. Wait until the CPU is quiescent.

7575

++ */

7576

++static void balance_hotplug_wait(void)

7577

++{

7578

++	struct rq *rq = this_rq();

7579

++

7580

++	rcuwait_wait_event(&rq->hotplug_wait,

7581

++			   rq->nr_running == 1 && !rq_has_pinned_tasks(rq),

7582

++			   TASK_UNINTERRUPTIBLE);

7583

++}

7584

++

7585

++#else

7586

++

7587

++static void balance_push(struct rq *rq)

7588

++{

7589

++}

7590

++

7591

++static void balance_push_set(int cpu, bool on)

7592

++{

7593

++}

7594

++

7595

++static inline void balance_hotplug_wait(void)

7596

++{

7597

++}

7598

++#endif /* CONFIG_HOTPLUG_CPU */

7599

++

7600

++static void set_rq_offline(struct rq *rq)

7601

++{

7602

++	if (rq->online)

7603

++		rq->online = false;

7604

++}

7605

++

7606

++static void set_rq_online(struct rq *rq)

7607

++{

7608

++	if (!rq->online)

7609

++		rq->online = true;

7610

++}

7611

++

7612

++/*

7613

++ * used to mark begin/end of suspend/resume:

7614

++ */

7615

++static int num_cpus_frozen;

7616

++

7617

++/*

7618

++ * Update cpusets according to cpu_active mask.  If cpusets are

7619

++ * disabled, cpuset_update_active_cpus() becomes a simple wrapper

7620

++ * around partition_sched_domains().

7621

++ *

7622

++ * If we come here as part of a suspend/resume, don't touch cpusets because we

7623

++ * want to restore it back to its original state upon resume anyway.

7624

++ */

7625

++static void cpuset_cpu_active(void)

7626

++{

7627

++	if (cpuhp_tasks_frozen) {

7628

++		/*

7629

++		 * num_cpus_frozen tracks how many CPUs are involved in suspend

7630

++		 * resume sequence. As long as this is not the last online

7631

++		 * operation in the resume sequence, just build a single sched

7632

++		 * domain, ignoring cpusets.

7633

++		 */

7634

++		partition_sched_domains(1, NULL, NULL);

7635

++		if (--num_cpus_frozen)

7636

++			return;

7637

++		/*

7638

++		 * This is the last CPU online operation. So fall through and

7639

++		 * restore the original sched domains by considering the

7640

++		 * cpuset configurations.

7641

++		 */

7642

++		cpuset_force_rebuild();

7643

++	}

7644

++

7645

++	cpuset_update_active_cpus();

7646

++}

7647

++

7648

++static int cpuset_cpu_inactive(unsigned int cpu)

7649

++{

7650

++	if (!cpuhp_tasks_frozen) {

7651

++		cpuset_update_active_cpus();

7652

++	} else {

7653

++		num_cpus_frozen++;

7654

++		partition_sched_domains(1, NULL, NULL);

7655

++	}

7656

++	return 0;

7657

++}

7658

++

7659

++int sched_cpu_activate(unsigned int cpu)

7660

++{

7661

++	struct rq *rq = cpu_rq(cpu);

7662

++	unsigned long flags;

7663

++

7664

++	/*

7665

++	 * Clear the balance_push callback and prepare to schedule

7666

++	 * regular tasks.

7667

++	 */

7668

++	balance_push_set(cpu, false);

7669

++

7670

++#ifdef CONFIG_SCHED_SMT

7671

++	/*

7672

++	 * When going up, increment the number of cores with SMT present.

7673

++	 */

7674

++	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)

7675

++		static_branch_inc_cpuslocked(&sched_smt_present);

7676

++#endif

7677

++	set_cpu_active(cpu, true);

7678

++

7679

++	if (sched_smp_initialized)

7680

++		cpuset_cpu_active();

7681

++

7682

++	/*

7683

++	 * Put the rq online, if not already. This happens:

7684

++	 *

7685

++	 * 1) In the early boot process, because we build the real domains

7686

++	 *    after all cpus have been brought up.

7687

++	 *

7688

++	 * 2) At runtime, if cpuset_cpu_active() fails to rebuild the

7689

++	 *    domains.

7690

++	 */

7691

++	raw_spin_lock_irqsave(&rq->lock, flags);

7692

++	set_rq_online(rq);

7693

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

7694

++

7695

++	return 0;

7696

++}

7697

++

7698

++int sched_cpu_deactivate(unsigned int cpu)

7699

++{

7700

++	struct rq *rq = cpu_rq(cpu);

7701

++	unsigned long flags;

7702

++	int ret;

7703

++

7704

++	set_cpu_active(cpu, false);

7705

++

7706

++	/*

7707

++	 * From this point forward, this CPU will refuse to run any task that

7708

++	 * is not: migrate_disable() or KTHREAD_IS_PER_CPU, and will actively

7709

++	 * push those tasks away until this gets cleared, see

7710

++	 * sched_cpu_dying().

7711

++	 */

7712

++	balance_push_set(cpu, true);

7713

++

7714

++	/*

7715

++	 * We've cleared cpu_active_mask, wait for all preempt-disabled and RCU

7716

++	 * users of this state to go away such that all new such users will

7717

++	 * observe it.

7718

++	 *

7719

++	 * Specifically, we rely on ttwu to no longer target this CPU, see

7720

++	 * ttwu_queue_cond() and is_cpu_allowed().

7721

++	 *

7722

++	 * Do sync before park smpboot threads to take care the rcu boost case.

7723

++	 */

7724

++	synchronize_rcu();

7725

++

7726

++	raw_spin_lock_irqsave(&rq->lock, flags);

7727

++	update_rq_clock(rq);

7728

++	set_rq_offline(rq);

7729

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

7730

++

7731

++#ifdef CONFIG_SCHED_SMT

7732

++	/*

7733

++	 * When going down, decrement the number of cores with SMT present.

7734

++	 */

7735

++	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {

7736

++		static_branch_dec_cpuslocked(&sched_smt_present);

7737

++		if (!static_branch_likely(&sched_smt_present))

7738

++			cpumask_clear(&sched_sg_idle_mask);

7739

++	}

7740

++#endif

7741

++

7742

++	if (!sched_smp_initialized)

7743

++		return 0;

7744

++

7745

++	ret = cpuset_cpu_inactive(cpu);

7746

++	if (ret) {

7747

++		balance_push_set(cpu, false);

7748

++		set_cpu_active(cpu, true);

7749

++		return ret;

7750

++	}

7751

++

7752

++	return 0;

7753

++}

7754

++

7755

++static void sched_rq_cpu_starting(unsigned int cpu)

7756

++{

7757

++	struct rq *rq = cpu_rq(cpu);

7758

++

7759

++	rq->calc_load_update = calc_load_update;

7760

++}

7761

++

7762

++int sched_cpu_starting(unsigned int cpu)

7763

++{

7764

++	sched_rq_cpu_starting(cpu);

7765

++	sched_tick_start(cpu);

7766

++	return 0;

7767

++}

7768

++

7769

++#ifdef CONFIG_HOTPLUG_CPU

7770

++

7771

++/*

7772

++ * Invoked immediately before the stopper thread is invoked to bring the

7773

++ * CPU down completely. At this point all per CPU kthreads except the

7774

++ * hotplug thread (current) and the stopper thread (inactive) have been

7775

++ * either parked or have been unbound from the outgoing CPU. Ensure that

7776

++ * any of those which might be on the way out are gone.

7777

++ *

7778

++ * If after this point a bound task is being woken on this CPU then the

7779

++ * responsible hotplug callback has failed to do it's job.

7780

++ * sched_cpu_dying() will catch it with the appropriate fireworks.

7781

++ */

7782

++int sched_cpu_wait_empty(unsigned int cpu)

7783

++{

7784

++	balance_hotplug_wait();

7785

++	return 0;

7786

++}

7787

++

7788

++/*

7789

++ * Since this CPU is going 'away' for a while, fold any nr_active delta we

7790

++ * might have. Called from the CPU stopper task after ensuring that the

7791

++ * stopper is the last running task on the CPU, so nr_active count is

7792

++ * stable. We need to take the teardown thread which is calling this into

7793

++ * account, so we hand in adjust = 1 to the load calculation.

7794

++ *

7795

++ * Also see the comment "Global load-average calculations".

7796

++ */

7797

++static void calc_load_migrate(struct rq *rq)

7798

++{

7799

++	long delta = calc_load_fold_active(rq, 1);

7800

++

7801

++	if (delta)

7802

++		atomic_long_add(delta, &calc_load_tasks);

7803

++}

7804

++

7805

++static void dump_rq_tasks(struct rq *rq, const char *loglvl)

7806

++{

7807

++	struct task_struct *g, *p;

7808

++	int cpu = cpu_of(rq);

7809

++

7810

++	lockdep_assert_held(&rq->lock);

7811

++

7812

++	printk("%sCPU%d enqueued tasks (%u total):\n", loglvl, cpu, rq->nr_running);

7813

++	for_each_process_thread(g, p) {

7814

++		if (task_cpu(p) != cpu)

7815

++			continue;

7816

++

7817

++		if (!task_on_rq_queued(p))

7818

++			continue;

7819

++

7820

++		printk("%s\tpid: %d, name: %s\n", loglvl, p->pid, p->comm);

7821

++	}

7822

++}

7823

++

7824

++int sched_cpu_dying(unsigned int cpu)

7825

++{

7826

++	struct rq *rq = cpu_rq(cpu);

7827

++	unsigned long flags;

7828

++

7829

++	/* Handle pending wakeups and then migrate everything off */

7830

++	sched_tick_stop(cpu);

7831

++

7832

++	raw_spin_lock_irqsave(&rq->lock, flags);

7833

++	if (rq->nr_running != 1 || rq_has_pinned_tasks(rq)) {

7834

++		WARN(true, "Dying CPU not properly vacated!");

7835

++		dump_rq_tasks(rq, KERN_WARNING);

7836

++	}

7837

++	raw_spin_unlock_irqrestore(&rq->lock, flags);

7838

++

7839

++	calc_load_migrate(rq);

7840

++	hrtick_clear(rq);

7841

++	return 0;

7842

++}

7843

++#endif

7844

++

7845

++#ifdef CONFIG_SMP

7846

++static void sched_init_topology_cpumask_early(void)

7847

++{

7848

++	int cpu;

7849

++	cpumask_t *tmp;

7850

++

7851

++	for_each_possible_cpu(cpu) {

7852

++		/* init topo masks */

7853

++		tmp = per_cpu(sched_cpu_topo_masks, cpu);

7854

++

7855

++		cpumask_copy(tmp, cpumask_of(cpu));

7856

++		tmp++;

7857

++		cpumask_copy(tmp, cpu_possible_mask);

7858

++		per_cpu(sched_cpu_llc_mask, cpu) = tmp;

7859

++		per_cpu(sched_cpu_topo_end_mask, cpu) = ++tmp;

7860

++		/*per_cpu(sd_llc_id, cpu) = cpu;*/

7861

++	}

7862

++}

7863

++

7864

++#define TOPOLOGY_CPUMASK(name, mask, last)\

7865

++	if (cpumask_and(topo, topo, mask)) {					\

7866

++		cpumask_copy(topo, mask);					\

7867

++		printk(KERN_INFO "sched: cpu#%02d topo: 0x%08lx - "#name,	\

7868

++		       cpu, (topo++)->bits[0]);					\

7869

++	}									\

7870

++	if (!last)								\

7871

++		cpumask_complement(topo, mask)

7872

++

7873

++static void sched_init_topology_cpumask(void)

7874

++{

7875

++	int cpu;

7876

++	cpumask_t *topo;

7877

++

7878

++	for_each_online_cpu(cpu) {

7879

++		/* take chance to reset time slice for idle tasks */

7880

++		cpu_rq(cpu)->idle->time_slice = sched_timeslice_ns;

7881

++

7882

++		topo = per_cpu(sched_cpu_topo_masks, cpu) + 1;

7883

++

7884

++		cpumask_complement(topo, cpumask_of(cpu));

7885

++#ifdef CONFIG_SCHED_SMT

7886

++		TOPOLOGY_CPUMASK(smt, topology_sibling_cpumask(cpu), false);

7887

++#endif

7888

++		per_cpu(sd_llc_id, cpu) = cpumask_first(cpu_coregroup_mask(cpu));

7889

++		per_cpu(sched_cpu_llc_mask, cpu) = topo;

7890

++		TOPOLOGY_CPUMASK(coregroup, cpu_coregroup_mask(cpu), false);

7891

++

7892

++		TOPOLOGY_CPUMASK(core, topology_core_cpumask(cpu), false);

7893

++

7894

++		TOPOLOGY_CPUMASK(others, cpu_online_mask, true);

7895

++

7896

++		per_cpu(sched_cpu_topo_end_mask, cpu) = topo;

7897

++		printk(KERN_INFO "sched: cpu#%02d llc_id = %d, llc_mask idx = %d\n",

7898

++		       cpu, per_cpu(sd_llc_id, cpu),

7899

++		       (int) (per_cpu(sched_cpu_llc_mask, cpu) -

7900

++			      per_cpu(sched_cpu_topo_masks, cpu)));

7901

++	}

7902

++}

7903

++#endif

7904

++

7905

++void __init sched_init_smp(void)

7906

++{

7907

++	/* Move init over to a non-isolated CPU */

7908

++	if (set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_DOMAIN)) < 0)

7909

++		BUG();

7910

++	current->flags &= ~PF_NO_SETAFFINITY;

7911

++

7912

++	sched_init_topology_cpumask();

7913

++

7914

++	sched_smp_initialized = true;

7915

++}

7916

++#else

7917

++void __init sched_init_smp(void)

7918

++{

7919

++	cpu_rq(0)->idle->time_slice = sched_timeslice_ns;

7920

++}

7921

++#endif /* CONFIG_SMP */

7922

++

7923

++int in_sched_functions(unsigned long addr)

7924

++{

7925

++	return in_lock_functions(addr) ||

7926

++		(addr >= (unsigned long)__sched_text_start

7927

++		&& addr < (unsigned long)__sched_text_end);

7928

++}

7929

++

7930

++#ifdef CONFIG_CGROUP_SCHED

7931

++/* task group related information */

7932

++struct task_group {

7933

++	struct cgroup_subsys_state css;

7934

++

7935

++	struct rcu_head rcu;

7936

++	struct list_head list;

7937

++

7938

++	struct task_group *parent;

7939

++	struct list_head siblings;

7940

++	struct list_head children;

7941

++#ifdef CONFIG_FAIR_GROUP_SCHED

7942

++	unsigned long		shares;

7943

++#endif

7944

++};

7945

++

7946

++/*

7947

++ * Default task group.

7948

++ * Every task in system belongs to this group at bootup.

7949

++ */

7950

++struct task_group root_task_group;

7951

++LIST_HEAD(task_groups);

7952

++

7953

++/* Cacheline aligned slab cache for task_group */

7954

++static struct kmem_cache *task_group_cache __read_mostly;

7955

++#endif /* CONFIG_CGROUP_SCHED */

7956

++

7957

++void __init sched_init(void)

7958

++{

7959

++	int i;

7960

++	struct rq *rq;

7961

++

7962

++	printk(KERN_INFO ALT_SCHED_VERSION_MSG);

7963

++

7964

++	wait_bit_init();

7965

++

7966

++#ifdef CONFIG_SMP

7967

++	for (i = 0; i < SCHED_QUEUE_BITS; i++)

7968

++		cpumask_copy(sched_rq_watermark + i, cpu_present_mask);

7969

++#endif

7970

++

7971

++#ifdef CONFIG_CGROUP_SCHED

7972

++	task_group_cache = KMEM_CACHE(task_group, 0);

7973

++

7974

++	list_add(&root_task_group.list, &task_groups);

7975

++	INIT_LIST_HEAD(&root_task_group.children);

7976

++	INIT_LIST_HEAD(&root_task_group.siblings);

7977

++#endif /* CONFIG_CGROUP_SCHED */

7978

++	for_each_possible_cpu(i) {

7979

++		rq = cpu_rq(i);

7980

++

7981

++		sched_queue_init(&rq->queue);

7982

++		rq->watermark = IDLE_TASK_SCHED_PRIO;

7983

++		rq->skip = NULL;

7984

++

7985

++		raw_spin_lock_init(&rq->lock);

7986

++		rq->nr_running = rq->nr_uninterruptible = 0;

7987

++		rq->calc_load_active = 0;

7988

++		rq->calc_load_update = jiffies + LOAD_FREQ;

7989

++#ifdef CONFIG_SMP

7990

++		rq->online = false;

7991

++		rq->cpu = i;

7992

++

7993

++#ifdef CONFIG_SCHED_SMT

7994

++		rq->active_balance = 0;

7995

++#endif

7996

++

7997

++#ifdef CONFIG_NO_HZ_COMMON

7998

++		INIT_CSD(&rq->nohz_csd, nohz_csd_func, rq);

7999

++#endif

8000

++		rq->balance_callback = &balance_push_callback;

8001

++#ifdef CONFIG_HOTPLUG_CPU

8002

++		rcuwait_init(&rq->hotplug_wait);

8003

++#endif

8004

++#endif /* CONFIG_SMP */

8005

++		rq->nr_switches = 0;

8006

++

8007

++		hrtick_rq_init(rq);

8008

++		atomic_set(&rq->nr_iowait, 0);

8009

++	}

8010

++#ifdef CONFIG_SMP

8011

++	/* Set rq->online for cpu 0 */

8012

++	cpu_rq(0)->online = true;

8013

++#endif

8014

++	/*

8015

++	 * The boot idle thread does lazy MMU switching as well:

8016

++	 */

8017

++	mmgrab(&init_mm);

8018

++	enter_lazy_tlb(&init_mm, current);

8019

++

8020

++	/*

8021

++	 * The idle task doesn't need the kthread struct to function, but it

8022

++	 * is dressed up as a per-CPU kthread and thus needs to play the part

8023

++	 * if we want to avoid special-casing it in code that deals with per-CPU

8024

++	 * kthreads.

8025

++	 */

8026

++	WARN_ON(!set_kthread_struct(current));

8027

++

8028

++	/*

8029

++	 * Make us the idle thread. Technically, schedule() should not be

8030

++	 * called from this thread, however somewhere below it might be,

8031

++	 * but because we are the idle thread, we just pick up running again

8032

++	 * when this runqueue becomes "idle".

8033

++	 */

8034

++	init_idle(current, smp_processor_id());

8035

++

8036

++	calc_load_update = jiffies + LOAD_FREQ;

8037

++

8038

++#ifdef CONFIG_SMP

8039

++	idle_thread_set_boot_cpu();

8040

++	balance_push_set(smp_processor_id(), false);

8041

++

8042

++	sched_init_topology_cpumask_early();

8043

++#endif /* SMP */

8044

++

8045

++	psi_init();

8046

++

8047

++	preempt_dynamic_init();

8048

++}

8049

++

8050

++#ifdef CONFIG_DEBUG_ATOMIC_SLEEP

8051

++

8052

++void __might_sleep(const char *file, int line)

8053

++{

8054

++	unsigned int state = get_current_state();

8055

++	/*

8056

++	 * Blocking primitives will set (and therefore destroy) current->state,

8057

++	 * since we will exit with TASK_RUNNING make sure we enter with it,

8058

++	 * otherwise we will destroy state.

8059

++	 */

8060

++	WARN_ONCE(state != TASK_RUNNING && current->task_state_change,

8061

++			"do not call blocking ops when !TASK_RUNNING; "

8062

++			"state=%x set at [<%p>] %pS\n", state,

8063

++			(void *)current->task_state_change,

8064

++			(void *)current->task_state_change);

8065

++

8066

++	__might_resched(file, line, 0);

8067

++}

8068

++EXPORT_SYMBOL(__might_sleep);

8069

++

8070

++static void print_preempt_disable_ip(int preempt_offset, unsigned long ip)

8071

++{

8072

++	if (!IS_ENABLED(CONFIG_DEBUG_PREEMPT))

8073

++		return;

8074

++

8075

++	if (preempt_count() == preempt_offset)

8076

++		return;

8077

++

8078

++	pr_err("Preemption disabled at:");

8079

++	print_ip_sym(KERN_ERR, ip);

8080

++}

8081

++

8082

++static inline bool resched_offsets_ok(unsigned int offsets)

8083

++{

8084

++	unsigned int nested = preempt_count();

8085

++

8086

++	nested += rcu_preempt_depth() << MIGHT_RESCHED_RCU_SHIFT;

8087

++

8088

++	return nested == offsets;

8089

++}

8090

++

8091

++void __might_resched(const char *file, int line, unsigned int offsets)

8092

++{

8093

++	/* Ratelimiting timestamp: */

8094

++	static unsigned long prev_jiffy;

8095

++

8096

++	unsigned long preempt_disable_ip;

8097

++

8098

++	/* WARN_ON_ONCE() by default, no rate limit required: */

8099

++	rcu_sleep_check();

8100

++

8101

++	if ((resched_offsets_ok(offsets) && !irqs_disabled() &&

8102

++	     !is_idle_task(current) && !current->non_block_count) ||

8103

++	    system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING ||

8104

++	    oops_in_progress)

8105

++		return;

8106

++	if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)

8107

++		return;

8108

++	prev_jiffy = jiffies;

8109

++

8110

++	/* Save this before calling printk(), since that will clobber it: */

8111

++	preempt_disable_ip = get_preempt_disable_ip(current);

8112

++

8113

++	pr_err("BUG: sleeping function called from invalid context at %s:%d\n",

8114

++	       file, line);

8115

++	pr_err("in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",

8116

++	       in_atomic(), irqs_disabled(), current->non_block_count,

8117

++	       current->pid, current->comm);

8118

++	pr_err("preempt_count: %x, expected: %x\n", preempt_count(),

8119

++	       offsets & MIGHT_RESCHED_PREEMPT_MASK);

8120

++

8121

++	if (IS_ENABLED(CONFIG_PREEMPT_RCU)) {

8122

++		pr_err("RCU nest depth: %d, expected: %u\n",

8123

++		       rcu_preempt_depth(), offsets >> MIGHT_RESCHED_RCU_SHIFT);

8124

++	}

8125

++

8126

++	if (task_stack_end_corrupted(current))

8127

++		pr_emerg("Thread overran stack, or stack corrupted\n");

8128

++

8129

++	debug_show_held_locks(current);

8130

++	if (irqs_disabled())

8131

++		print_irqtrace_events(current);

8132

++

8133

++	print_preempt_disable_ip(offsets & MIGHT_RESCHED_PREEMPT_MASK,

8134

++				 preempt_disable_ip);

8135

++

8136

++	dump_stack();

8137

++	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

8138

++}

8139

++EXPORT_SYMBOL(__might_resched);

8140

++

8141

++void __cant_sleep(const char *file, int line, int preempt_offset)

8142

++{

8143

++	static unsigned long prev_jiffy;

8144

++

8145

++	if (irqs_disabled())

8146

++		return;

8147

++

8148

++	if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))

8149

++		return;

8150

++

8151

++	if (preempt_count() > preempt_offset)

8152

++		return;

8153

++

8154

++	if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)

8155

++		return;

8156

++	prev_jiffy = jiffies;

8157

++

8158

++	printk(KERN_ERR "BUG: assuming atomic context at %s:%d\n", file, line);

8159

++	printk(KERN_ERR "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",

8160

++			in_atomic(), irqs_disabled(),

8161

++			current->pid, current->comm);

8162

++

8163

++	debug_show_held_locks(current);

8164

++	dump_stack();

8165

++	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

8166

++}

8167

++EXPORT_SYMBOL_GPL(__cant_sleep);

8168

++

8169

++#ifdef CONFIG_SMP

8170

++void __cant_migrate(const char *file, int line)

8171

++{

8172

++	static unsigned long prev_jiffy;

8173

++

8174

++	if (irqs_disabled())

8175

++		return;

8176

++

8177

++	if (is_migration_disabled(current))

8178

++		return;

8179

++

8180

++	if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))

8181

++		return;

8182

++

8183

++	if (preempt_count() > 0)

8184

++		return;

8185

++

8186

++	if (current->migration_flags & MDF_FORCE_ENABLED)

8187

++		return;

8188

++

8189

++	if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)

8190

++		return;

8191

++	prev_jiffy = jiffies;

8192

++

8193

++	pr_err("BUG: assuming non migratable context at %s:%d\n", file, line);

8194

++	pr_err("in_atomic(): %d, irqs_disabled(): %d, migration_disabled() %u pid: %d, name: %s\n",

8195

++	       in_atomic(), irqs_disabled(), is_migration_disabled(current),

8196

++	       current->pid, current->comm);

8197

++

8198

++	debug_show_held_locks(current);

8199

++	dump_stack();

8200

++	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

8201

++}

8202

++EXPORT_SYMBOL_GPL(__cant_migrate);

8203

++#endif

8204

++#endif

8205

++

8206

++#ifdef CONFIG_MAGIC_SYSRQ

8207

++void normalize_rt_tasks(void)

8208

++{

8209

++	struct task_struct *g, *p;

8210

++	struct sched_attr attr = {

8211

++		.sched_policy = SCHED_NORMAL,

8212

++	};

8213

++

8214

++	read_lock(&tasklist_lock);

8215

++	for_each_process_thread(g, p) {

8216

++		/*

8217

++		 * Only normalize user tasks:

8218

++		 */

8219

++		if (p->flags & PF_KTHREAD)

8220

++			continue;

8221

++

8222

++		schedstat_set(p->stats.wait_start,  0);

8223

++		schedstat_set(p->stats.sleep_start, 0);

8224

++		schedstat_set(p->stats.block_start, 0);

8225

++

8226

++		if (!rt_task(p)) {

8227

++			/*

8228

++			 * Renice negative nice level userspace

8229

++			 * tasks back to 0:

8230

++			 */

8231

++			if (task_nice(p) < 0)

8232

++				set_user_nice(p, 0);

8233

++			continue;

8234

++		}

8235

++

8236

++		__sched_setscheduler(p, &attr, false, false);

8237

++	}

8238

++	read_unlock(&tasklist_lock);

8239

++}

8240

++#endif /* CONFIG_MAGIC_SYSRQ */

8241

++

8242

++#if defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB)

8243

++/*

8244

++ * These functions are only useful for the IA64 MCA handling, or kdb.

8245

++ *

8246

++ * They can only be called when the whole system has been

8247

++ * stopped - every CPU needs to be quiescent, and no scheduling

8248

++ * activity can take place. Using them for anything else would

8249

++ * be a serious bug, and as a result, they aren't even visible

8250

++ * under any other configuration.

8251

++ */

8252

++

8253

++/**

8254

++ * curr_task - return the current task for a given CPU.

8255

++ * @cpu: the processor in question.

8256

++ *

8257

++ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!

8258

++ *

8259

++ * Return: The current task for @cpu.

8260

++ */

8261

++struct task_struct *curr_task(int cpu)

8262

++{

8263

++	return cpu_curr(cpu);

8264

++}

8265

++

8266

++#endif /* defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB) */

8267

++

8268

++#ifdef CONFIG_IA64

8269

++/**

8270

++ * ia64_set_curr_task - set the current task for a given CPU.

8271

++ * @cpu: the processor in question.

8272

++ * @p: the task pointer to set.

8273

++ *

8274

++ * Description: This function must only be used when non-maskable interrupts

8275

++ * are serviced on a separate stack.  It allows the architecture to switch the

8276

++ * notion of the current task on a CPU in a non-blocking manner.  This function

8277

++ * must be called with all CPU's synchronised, and interrupts disabled, the

8278

++ * and caller must save the original value of the current task (see

8279

++ * curr_task() above) and restore that value before reenabling interrupts and

8280

++ * re-starting the system.

8281

++ *

8282

++ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED!

8283

++ */

8284

++void ia64_set_curr_task(int cpu, struct task_struct *p)

8285

++{

8286

++	cpu_curr(cpu) = p;

8287

++}

8288

++

8289

++#endif

8290

++

8291

++#ifdef CONFIG_CGROUP_SCHED

8292

++static void sched_free_group(struct task_group *tg)

8293

++{

8294

++	kmem_cache_free(task_group_cache, tg);

8295

++}

8296

++

8297

++static void sched_free_group_rcu(struct rcu_head *rhp)

8298

++{

8299

++	sched_free_group(container_of(rhp, struct task_group, rcu));

8300

++}

8301

++

8302

++static void sched_unregister_group(struct task_group *tg)

8303

++{

8304

++	/*

8305

++	 * We have to wait for yet another RCU grace period to expire, as

8306

++	 * print_cfs_stats() might run concurrently.

8307

++	 */

8308

++	call_rcu(&tg->rcu, sched_free_group_rcu);

8309

++}

8310

++

8311

++/* allocate runqueue etc for a new task group */

8312

++struct task_group *sched_create_group(struct task_group *parent)

8313

++{

8314

++	struct task_group *tg;

8315

++

8316

++	tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);

8317

++	if (!tg)

8318

++		return ERR_PTR(-ENOMEM);

8319

++

8320

++	return tg;

8321

++}

8322

++

8323

++void sched_online_group(struct task_group *tg, struct task_group *parent)

8324

++{

8325

++}

8326

++

8327

++/* rcu callback to free various structures associated with a task group */

8328

++static void sched_unregister_group_rcu(struct rcu_head *rhp)

8329

++{

8330

++	/* Now it should be safe to free those cfs_rqs: */

8331

++	sched_unregister_group(container_of(rhp, struct task_group, rcu));

8332

++}

8333

++

8334

++void sched_destroy_group(struct task_group *tg)

8335

++{

8336

++	/* Wait for possible concurrent references to cfs_rqs complete: */

8337

++	call_rcu(&tg->rcu, sched_unregister_group_rcu);

8338

++}

8339

++

8340

++void sched_release_group(struct task_group *tg)

8341

++{

8342

++}

8343

++

8344

++static inline struct task_group *css_tg(struct cgroup_subsys_state *css)

8345

++{

8346

++	return css ? container_of(css, struct task_group, css) : NULL;

8347

++}

8348

++

8349

++static struct cgroup_subsys_state *

8350

++cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)

8351

++{

8352

++	struct task_group *parent = css_tg(parent_css);

8353

++	struct task_group *tg;

8354

++

8355

++	if (!parent) {

8356

++		/* This is early initialization for the top cgroup */

8357

++		return &root_task_group.css;

8358

++	}

8359

++

8360

++	tg = sched_create_group(parent);

8361

++	if (IS_ERR(tg))

8362

++		return ERR_PTR(-ENOMEM);

8363

++	return &tg->css;

8364

++}

8365

++

8366

++/* Expose task group only after completing cgroup initialization */

8367

++static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)

8368

++{

8369

++	struct task_group *tg = css_tg(css);

8370

++	struct task_group *parent = css_tg(css->parent);

8371

++

8372

++	if (parent)

8373

++		sched_online_group(tg, parent);

8374

++	return 0;

8375

++}

8376

++

8377

++static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)

8378

++{

8379

++	struct task_group *tg = css_tg(css);

8380

++

8381

++	sched_release_group(tg);

8382

++}

8383

++

8384

++static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)

8385

++{

8386

++	struct task_group *tg = css_tg(css);

8387

++

8388

++	/*

8389

++	 * Relies on the RCU grace period between css_released() and this.

8390

++	 */

8391

++	sched_unregister_group(tg);

8392

++}

8393

++

8394

++static void cpu_cgroup_fork(struct task_struct *task)

8395

++{

8396

++}

8397

++

8398

++static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)

8399

++{

8400

++	return 0;

8401

++}

8402

++

8403

++static void cpu_cgroup_attach(struct cgroup_taskset *tset)

8404

++{

8405

++}

8406

++

8407

++#ifdef CONFIG_FAIR_GROUP_SCHED

8408

++static DEFINE_MUTEX(shares_mutex);

8409

++

8410

++int sched_group_set_shares(struct task_group *tg, unsigned long shares)

8411

++{

8412

++	/*

8413

++	 * We can't change the weight of the root cgroup.

8414

++	 */

8415

++	if (&root_task_group == tg)

8416

++		return -EINVAL;

8417

++

8418

++	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));

8419

++

8420

++	mutex_lock(&shares_mutex);

8421

++	if (tg->shares == shares)

8422

++		goto done;

8423

++

8424

++	tg->shares = shares;

8425

++done:

8426

++	mutex_unlock(&shares_mutex);

8427

++	return 0;

8428

++}

8429

++

8430

++static int cpu_shares_write_u64(struct cgroup_subsys_state *css,

8431

++				struct cftype *cftype, u64 shareval)

8432

++{

8433

++	if (shareval > scale_load_down(ULONG_MAX))

8434

++		shareval = MAX_SHARES;

8435

++	return sched_group_set_shares(css_tg(css), scale_load(shareval));

8436

++}

8437

++

8438

++static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,

8439

++			       struct cftype *cft)

8440

++{

8441

++	struct task_group *tg = css_tg(css);

8442

++

8443

++	return (u64) scale_load_down(tg->shares);

8444

++}

8445

++#endif

8446

++

8447

++static struct cftype cpu_legacy_files[] = {

8448

++#ifdef CONFIG_FAIR_GROUP_SCHED

8449

++	{

8450

++		.name = "shares",

8451

++		.read_u64 = cpu_shares_read_u64,

8452

++		.write_u64 = cpu_shares_write_u64,

8453

++	},

8454

++#endif

8455

++	{ }	/* Terminate */

8456

++};

8457

++

8458

++

8459

++static struct cftype cpu_files[] = {

8460

++	{ }	/* terminate */

8461

++};

8462

++

8463

++static int cpu_extra_stat_show(struct seq_file *sf,

8464

++			       struct cgroup_subsys_state *css)

8465

++{

8466

++	return 0;

8467

++}

8468

++

8469

++struct cgroup_subsys cpu_cgrp_subsys = {

8470

++	.css_alloc	= cpu_cgroup_css_alloc,

8471

++	.css_online	= cpu_cgroup_css_online,

8472

++	.css_released	= cpu_cgroup_css_released,

8473

++	.css_free	= cpu_cgroup_css_free,

8474

++	.css_extra_stat_show = cpu_extra_stat_show,

8475

++	.fork		= cpu_cgroup_fork,

8476

++	.can_attach	= cpu_cgroup_can_attach,

8477

++	.attach		= cpu_cgroup_attach,

8478

++	.legacy_cftypes	= cpu_files,

8479

++	.legacy_cftypes	= cpu_legacy_files,

8480

++	.dfl_cftypes	= cpu_files,

8481

++	.early_init	= true,

8482

++	.threaded	= true,

8483

++};

8484

++#endif	/* CONFIG_CGROUP_SCHED */

8485

++

8486

++#undef CREATE_TRACE_POINTS

8487

+diff --git a/kernel/sched/alt_debug.c b/kernel/sched/alt_debug.c

8488

+new file mode 100644

8489

+index 000000000000..1212a031700e

8490

+--- /dev/null

8491

++++ b/kernel/sched/alt_debug.c

8492

+@@ -0,0 +1,31 @@

8493

++/*

8494

++ * kernel/sched/alt_debug.c

8495

++ *

8496

++ * Print the alt scheduler debugging details

8497

++ *

8498

++ * Author: Alfred Chen

8499

++ * Date  : 2020

8500

++ */

8501

++#include "sched.h"

8502

++

8503

++/*

8504

++ * This allows printing both to /proc/sched_debug and

8505

++ * to the console

8506

++ */

8507

++#define SEQ_printf(m, x...)			\

8508

++ do {						\

8509

++	if (m)					\

8510

++		seq_printf(m, x);		\

8511

++	else					\

8512

++		pr_cont(x);			\

8513

++ } while (0)

8514

++

8515

++void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,

8516

++			  struct seq_file *m)

8517

++{

8518

++	SEQ_printf(m, "%s (%d, #threads: %d)\n", p->comm, task_pid_nr_ns(p, ns),

8519

++						get_nr_threads(p));

8520

++}

8521

++

8522

++void proc_sched_set_task(struct task_struct *p)

8523

++{}

8524

+diff --git a/kernel/sched/alt_sched.h b/kernel/sched/alt_sched.h

8525

+new file mode 100644

8526

+index 000000000000..a181bf9ce57d

8527

+--- /dev/null

8528

++++ b/kernel/sched/alt_sched.h

8529

+@@ -0,0 +1,645 @@

8530

++#ifndef ALT_SCHED_H

8531

++#define ALT_SCHED_H

8532

++

8533

++#include <linux/psi.h>

8534

++#include <linux/stop_machine.h>

8535

++#include <linux/syscalls.h>

8536

++#include <linux/tick.h>

8537

++

8538

++#include <trace/events/power.h>

8539

++#include <trace/events/sched.h>

8540

++

8541

++#include "../workqueue_internal.h"

8542

++

8543

++#include "cpupri.h"

8544

++

8545

++#ifdef CONFIG_SCHED_BMQ

8546

++/* bits:

8547

++ * RT(0-99), (Low prio adj range, nice width, high prio adj range) / 2, cpu idle task */

8548

++#define SCHED_BITS	(MAX_RT_PRIO + NICE_WIDTH / 2 + MAX_PRIORITY_ADJ + 1)

8549

++#endif

8550

++

8551

++#ifdef CONFIG_SCHED_PDS

8552

++/* bits: RT(0-99), reserved(100-127), NORMAL_PRIO_NUM, cpu idle task */

8553

++#define SCHED_BITS	(MIN_NORMAL_PRIO + NORMAL_PRIO_NUM + 1)

8554

++#endif /* CONFIG_SCHED_PDS */

8555

++

8556

++#define IDLE_TASK_SCHED_PRIO	(SCHED_BITS - 1)

8557

++

8558

++#ifdef CONFIG_SCHED_DEBUG

8559

++# define SCHED_WARN_ON(x)	WARN_ONCE(x, #x)

8560

++extern void resched_latency_warn(int cpu, u64 latency);

8561

++#else

8562

++# define SCHED_WARN_ON(x)	({ (void)(x), 0; })

8563

++static inline void resched_latency_warn(int cpu, u64 latency) {}

8564

++#endif

8565

++

8566

++/*

8567

++ * Increase resolution of nice-level calculations for 64-bit architectures.

8568

++ * The extra resolution improves shares distribution and load balancing of

8569

++ * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup

8570

++ * hierarchies, especially on larger systems. This is not a user-visible change

8571

++ * and does not change the user-interface for setting shares/weights.

8572

++ *

8573

++ * We increase resolution only if we have enough bits to allow this increased

8574

++ * resolution (i.e. 64-bit). The costs for increasing resolution when 32-bit

8575

++ * are pretty high and the returns do not justify the increased costs.

8576

++ *

8577

++ * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to

8578

++ * increase coverage and consistency always enable it on 64-bit platforms.

8579

++ */

8580

++#ifdef CONFIG_64BIT

8581

++# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)

8582

++# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)

8583

++# define scale_load_down(w) \

8584

++({ \

8585

++	unsigned long __w = (w); \

8586

++	if (__w) \

8587

++		__w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \

8588

++	__w; \

8589

++})

8590

++#else

8591

++# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)

8592

++# define scale_load(w)		(w)

8593

++# define scale_load_down(w)	(w)

8594

++#endif

8595

++

8596

++#ifdef CONFIG_FAIR_GROUP_SCHED

8597

++#define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD

8598

++

8599

++/*

8600

++ * A weight of 0 or 1 can cause arithmetics problems.

8601

++ * A weight of a cfs_rq is the sum of weights of which entities

8602

++ * are queued on this cfs_rq, so a weight of a entity should not be

8603

++ * too large, so as the shares value of a task group.

8604

++ * (The default weight is 1024 - so there's no practical

8605

++ *  limitation from this.)

8606

++ */

8607

++#define MIN_SHARES		(1UL <<  1)

8608

++#define MAX_SHARES		(1UL << 18)

8609

++#endif

8610

++

8611

++/* task_struct::on_rq states: */

8612

++#define TASK_ON_RQ_QUEUED	1

8613

++#define TASK_ON_RQ_MIGRATING	2

8614

++

8615

++static inline int task_on_rq_queued(struct task_struct *p)

8616

++{

8617

++	return p->on_rq == TASK_ON_RQ_QUEUED;

8618

++}

8619

++

8620

++static inline int task_on_rq_migrating(struct task_struct *p)

8621

++{

8622

++	return READ_ONCE(p->on_rq) == TASK_ON_RQ_MIGRATING;

8623

++}

8624

++

8625

++/*

8626

++ * wake flags

8627

++ */

8628

++#define WF_SYNC		0x01		/* waker goes to sleep after wakeup */

8629

++#define WF_FORK		0x02		/* child wakeup after fork */

8630

++#define WF_MIGRATED	0x04		/* internal use, task got migrated */

8631

++#define WF_ON_CPU	0x08		/* Wakee is on_rq */

8632

++

8633

++#define SCHED_QUEUE_BITS	(SCHED_BITS - 1)

8634

++

8635

++struct sched_queue {

8636

++	DECLARE_BITMAP(bitmap, SCHED_QUEUE_BITS);

8637

++	struct list_head heads[SCHED_BITS];

8638

++};

8639

++

8640

++/*

8641

++ * This is the main, per-CPU runqueue data structure.

8642

++ * This data should only be modified by the local cpu.

8643

++ */

8644

++struct rq {

8645

++	/* runqueue lock: */

8646

++	raw_spinlock_t lock;

8647

++

8648

++	struct task_struct __rcu *curr;

8649

++	struct task_struct *idle, *stop, *skip;

8650

++	struct mm_struct *prev_mm;

8651

++

8652

++	struct sched_queue	queue;

8653

++#ifdef CONFIG_SCHED_PDS

8654

++	u64			time_edge;

8655

++#endif

8656

++	unsigned long watermark;

8657

++

8658

++	/* switch count */

8659

++	u64 nr_switches;

8660

++

8661

++	atomic_t nr_iowait;

8662

++

8663

++#ifdef CONFIG_SCHED_DEBUG

8664

++	u64 last_seen_need_resched_ns;

8665

++	int ticks_without_resched;

8666

++#endif

8667

++

8668

++#ifdef CONFIG_MEMBARRIER

8669

++	int membarrier_state;

8670

++#endif

8671

++

8672

++#ifdef CONFIG_SMP

8673

++	int cpu;		/* cpu of this runqueue */

8674

++	bool online;

8675

++

8676

++	unsigned int		ttwu_pending;

8677

++	unsigned char		nohz_idle_balance;

8678

++	unsigned char		idle_balance;

8679

++

8680

++#ifdef CONFIG_HAVE_SCHED_AVG_IRQ

8681

++	struct sched_avg	avg_irq;

8682

++#endif

8683

++

8684

++#ifdef CONFIG_SCHED_SMT

8685

++	int active_balance;

8686

++	struct cpu_stop_work	active_balance_work;

8687

++#endif

8688

++	struct callback_head	*balance_callback;

8689

++#ifdef CONFIG_HOTPLUG_CPU

8690

++	struct rcuwait		hotplug_wait;

8691

++#endif

8692

++	unsigned int		nr_pinned;

8693

++

8694

++#endif /* CONFIG_SMP */

8695

++#ifdef CONFIG_IRQ_TIME_ACCOUNTING

8696

++	u64 prev_irq_time;

8697

++#endif /* CONFIG_IRQ_TIME_ACCOUNTING */

8698

++#ifdef CONFIG_PARAVIRT

8699

++	u64 prev_steal_time;

8700

++#endif /* CONFIG_PARAVIRT */

8701

++#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING

8702

++	u64 prev_steal_time_rq;

8703

++#endif /* CONFIG_PARAVIRT_TIME_ACCOUNTING */

8704

++

8705

++	/* For genenal cpu load util */

8706

++	s32 load_history;

8707

++	u64 load_block;

8708

++	u64 load_stamp;

8709

++

8710

++	/* calc_load related fields */

8711

++	unsigned long calc_load_update;

8712

++	long calc_load_active;

8713

++

8714

++	u64 clock, last_tick;

8715

++	u64 last_ts_switch;

8716

++	u64 clock_task;

8717

++

8718

++	unsigned int  nr_running;

8719

++	unsigned long nr_uninterruptible;

8720

++

8721

++#ifdef CONFIG_SCHED_HRTICK

8722

++#ifdef CONFIG_SMP

8723

++	call_single_data_t hrtick_csd;

8724

++#endif

8725

++	struct hrtimer		hrtick_timer;

8726

++	ktime_t			hrtick_time;

8727

++#endif

8728

++

8729

++#ifdef CONFIG_SCHEDSTATS

8730

++

8731

++	/* latency stats */

8732

++	struct sched_info rq_sched_info;

8733

++	unsigned long long rq_cpu_time;

8734

++	/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */

8735

++

8736

++	/* sys_sched_yield() stats */

8737

++	unsigned int yld_count;

8738

++

8739

++	/* schedule() stats */

8740

++	unsigned int sched_switch;

8741

++	unsigned int sched_count;

8742

++	unsigned int sched_goidle;

8743

++

8744

++	/* try_to_wake_up() stats */

8745

++	unsigned int ttwu_count;

8746

++	unsigned int ttwu_local;

8747

++#endif /* CONFIG_SCHEDSTATS */

8748

++

8749

++#ifdef CONFIG_CPU_IDLE

8750

++	/* Must be inspected within a rcu lock section */

8751

++	struct cpuidle_state *idle_state;

8752

++#endif

8753

++

8754

++#ifdef CONFIG_NO_HZ_COMMON

8755

++#ifdef CONFIG_SMP

8756

++	call_single_data_t	nohz_csd;

8757

++#endif

8758

++	atomic_t		nohz_flags;

8759

++#endif /* CONFIG_NO_HZ_COMMON */

8760

++};

8761

++

8762

++extern unsigned long rq_load_util(struct rq *rq, unsigned long max);

8763

++

8764

++extern unsigned long calc_load_update;

8765

++extern atomic_long_t calc_load_tasks;

8766

++

8767

++extern void calc_global_load_tick(struct rq *this_rq);

8768

++extern long calc_load_fold_active(struct rq *this_rq, long adjust);

8769

++

8770

++DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

8771

++#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))

8772

++#define this_rq()		this_cpu_ptr(&runqueues)

8773

++#define task_rq(p)		cpu_rq(task_cpu(p))

8774

++#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)

8775

++#define raw_rq()		raw_cpu_ptr(&runqueues)

8776

++

8777

++#ifdef CONFIG_SMP

8778

++#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)

8779

++void register_sched_domain_sysctl(void);

8780

++void unregister_sched_domain_sysctl(void);

8781

++#else

8782

++static inline void register_sched_domain_sysctl(void)

8783

++{

8784

++}

8785

++static inline void unregister_sched_domain_sysctl(void)

8786

++{

8787

++}

8788

++#endif

8789

++

8790

++extern bool sched_smp_initialized;

8791

++

8792

++enum {

8793

++	ITSELF_LEVEL_SPACE_HOLDER,

8794

++#ifdef CONFIG_SCHED_SMT

8795

++	SMT_LEVEL_SPACE_HOLDER,

8796

++#endif

8797

++	COREGROUP_LEVEL_SPACE_HOLDER,

8798

++	CORE_LEVEL_SPACE_HOLDER,

8799

++	OTHER_LEVEL_SPACE_HOLDER,

8800

++	NR_CPU_AFFINITY_LEVELS

8801

++};

8802

++

8803

++DECLARE_PER_CPU(cpumask_t [NR_CPU_AFFINITY_LEVELS], sched_cpu_topo_masks);

8804

++DECLARE_PER_CPU(cpumask_t *, sched_cpu_llc_mask);

8805

++

8806

++static inline int

8807

++__best_mask_cpu(const cpumask_t *cpumask, const cpumask_t *mask)

8808

++{

8809

++	int cpu;

8810

++

8811

++	while ((cpu = cpumask_any_and(cpumask, mask)) >= nr_cpu_ids)

8812

++		mask++;

8813

++

8814

++	return cpu;

8815

++}

8816

++

8817

++static inline int best_mask_cpu(int cpu, const cpumask_t *mask)

8818

++{

8819

++	return __best_mask_cpu(mask, per_cpu(sched_cpu_topo_masks, cpu));

8820

++}

8821

++

8822

++extern void flush_smp_call_function_queue(void);

8823

++

8824

++#else  /* !CONFIG_SMP */

8825

++static inline void flush_smp_call_function_queue(void) { }

8826

++#endif

8827

++

8828

++#ifndef arch_scale_freq_tick

8829

++static __always_inline

8830

++void arch_scale_freq_tick(void)

8831

++{

8832

++}

8833

++#endif

8834

++

8835

++#ifndef arch_scale_freq_capacity

8836

++static __always_inline

8837

++unsigned long arch_scale_freq_capacity(int cpu)

8838

++{

8839

++	return SCHED_CAPACITY_SCALE;

8840

++}

8841

++#endif

8842

++

8843

++static inline u64 __rq_clock_broken(struct rq *rq)

8844

++{

8845

++	return READ_ONCE(rq->clock);

8846

++}

8847

++

8848

++static inline u64 rq_clock(struct rq *rq)

8849

++{

8850

++	/*

8851

++	 * Relax lockdep_assert_held() checking as in VRQ, call to

8852

++	 * sched_info_xxxx() may not held rq->lock

8853

++	 * lockdep_assert_held(&rq->lock);

8854

++	 */

8855

++	return rq->clock;

8856

++}

8857

++

8858

++static inline u64 rq_clock_task(struct rq *rq)

8859

++{

8860

++	/*

8861

++	 * Relax lockdep_assert_held() checking as in VRQ, call to

8862

++	 * sched_info_xxxx() may not held rq->lock

8863

++	 * lockdep_assert_held(&rq->lock);

8864

++	 */

8865

++	return rq->clock_task;

8866

++}

8867

++

8868

++/*

8869

++ * {de,en}queue flags:

8870

++ *

8871

++ * DEQUEUE_SLEEP  - task is no longer runnable

8872

++ * ENQUEUE_WAKEUP - task just became runnable

8873

++ *

8874

++ */

8875

++

8876

++#define DEQUEUE_SLEEP		0x01

8877

++

8878

++#define ENQUEUE_WAKEUP		0x01

8879

++

8880

++

8881

++/*

8882

++ * Below are scheduler API which using in other kernel code

8883

++ * It use the dummy rq_flags

8884

++ * ToDo : BMQ need to support these APIs for compatibility with mainline

8885

++ * scheduler code.

8886

++ */

8887

++struct rq_flags {

8888

++	unsigned long flags;

8889

++};

8890

++

8891

++struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)

8892

++	__acquires(rq->lock);

8893

++

8894

++struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)

8895

++	__acquires(p->pi_lock)

8896

++	__acquires(rq->lock);

8897

++

8898

++static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)

8899

++	__releases(rq->lock)

8900

++{

8901

++	raw_spin_unlock(&rq->lock);

8902

++}

8903

++

8904

++static inline void

8905

++task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)

8906

++	__releases(rq->lock)

8907

++	__releases(p->pi_lock)

8908

++{

8909

++	raw_spin_unlock(&rq->lock);

8910

++	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);

8911

++}

8912

++

8913

++static inline void

8914

++rq_lock(struct rq *rq, struct rq_flags *rf)

8915

++	__acquires(rq->lock)

8916

++{

8917

++	raw_spin_lock(&rq->lock);

8918

++}

8919

++

8920

++static inline void

8921

++rq_unlock_irq(struct rq *rq, struct rq_flags *rf)

8922

++	__releases(rq->lock)

8923

++{

8924

++	raw_spin_unlock_irq(&rq->lock);

8925

++}

8926

++

8927

++static inline void

8928

++rq_unlock(struct rq *rq, struct rq_flags *rf)

8929

++	__releases(rq->lock)

8930

++{

8931

++	raw_spin_unlock(&rq->lock);

8932

++}

8933

++

8934

++static inline struct rq *

8935

++this_rq_lock_irq(struct rq_flags *rf)

8936

++	__acquires(rq->lock)

8937

++{

8938

++	struct rq *rq;

8939

++

8940

++	local_irq_disable();

8941

++	rq = this_rq();

8942

++	raw_spin_lock(&rq->lock);

8943

++

8944

++	return rq;

8945

++}

8946

++

8947

++static inline raw_spinlock_t *__rq_lockp(struct rq *rq)

8948

++{

8949

++	return &rq->lock;

8950

++}

8951

++

8952

++static inline raw_spinlock_t *rq_lockp(struct rq *rq)

8953

++{

8954

++	return __rq_lockp(rq);

8955

++}

8956

++

8957

++static inline void lockdep_assert_rq_held(struct rq *rq)

8958

++{

8959

++	lockdep_assert_held(__rq_lockp(rq));

8960

++}

8961

++

8962

++extern void raw_spin_rq_lock_nested(struct rq *rq, int subclass);

8963

++extern void raw_spin_rq_unlock(struct rq *rq);

8964

++

8965

++static inline void raw_spin_rq_lock(struct rq *rq)

8966

++{

8967

++	raw_spin_rq_lock_nested(rq, 0);

8968

++}

8969

++

8970

++static inline void raw_spin_rq_lock_irq(struct rq *rq)

8971

++{

8972

++	local_irq_disable();

8973

++	raw_spin_rq_lock(rq);

8974

++}

8975

++

8976

++static inline void raw_spin_rq_unlock_irq(struct rq *rq)

8977

++{

8978

++	raw_spin_rq_unlock(rq);

8979

++	local_irq_enable();

8980

++}

8981

++

8982

++static inline int task_current(struct rq *rq, struct task_struct *p)

8983

++{

8984

++	return rq->curr == p;

8985

++}

8986

++

8987

++static inline bool task_running(struct task_struct *p)

8988

++{

8989

++	return p->on_cpu;

8990

++}

8991

++

8992

++extern int task_running_nice(struct task_struct *p);

8993

++

8994

++extern struct static_key_false sched_schedstats;

8995

++

8996

++#ifdef CONFIG_CPU_IDLE

8997

++static inline void idle_set_state(struct rq *rq,

8998

++				  struct cpuidle_state *idle_state)

8999

++{

9000

++	rq->idle_state = idle_state;

9001

++}

9002

++

9003

++static inline struct cpuidle_state *idle_get_state(struct rq *rq)

9004

++{

9005

++	WARN_ON(!rcu_read_lock_held());

9006

++	return rq->idle_state;

9007

++}

9008

++#else

9009

++static inline void idle_set_state(struct rq *rq,

9010

++				  struct cpuidle_state *idle_state)

9011

++{

9012

++}

9013

++

9014

++static inline struct cpuidle_state *idle_get_state(struct rq *rq)

9015

++{

9016

++	return NULL;

9017

++}

9018

++#endif

9019

++

9020

++static inline int cpu_of(const struct rq *rq)

9021

++{

9022

++#ifdef CONFIG_SMP

9023

++	return rq->cpu;

9024

++#else

9025

++	return 0;

9026

++#endif

9027

++}

9028

++

9029

++#include "stats.h"

9030

++

9031

++#ifdef CONFIG_NO_HZ_COMMON

9032

++#define NOHZ_BALANCE_KICK_BIT	0

9033

++#define NOHZ_STATS_KICK_BIT	1

9034

++

9035

++#define NOHZ_BALANCE_KICK	BIT(NOHZ_BALANCE_KICK_BIT)

9036

++#define NOHZ_STATS_KICK		BIT(NOHZ_STATS_KICK_BIT)

9037

++

9038

++#define NOHZ_KICK_MASK	(NOHZ_BALANCE_KICK | NOHZ_STATS_KICK)

9039

++

9040

++#define nohz_flags(cpu)	(&cpu_rq(cpu)->nohz_flags)

9041

++

9042

++/* TODO: needed?

9043

++extern void nohz_balance_exit_idle(struct rq *rq);

9044

++#else

9045

++static inline void nohz_balance_exit_idle(struct rq *rq) { }

9046

++*/

9047

++#endif

9048

++

9049

++#ifdef CONFIG_IRQ_TIME_ACCOUNTING

9050

++struct irqtime {

9051

++	u64			total;

9052

++	u64			tick_delta;

9053

++	u64			irq_start_time;

9054

++	struct u64_stats_sync	sync;

9055

++};

9056

++

9057

++DECLARE_PER_CPU(struct irqtime, cpu_irqtime);

9058

++

9059

++/*

9060

++ * Returns the irqtime minus the softirq time computed by ksoftirqd.

9061

++ * Otherwise ksoftirqd's sum_exec_runtime is substracted its own runtime

9062

++ * and never move forward.

9063

++ */

9064

++static inline u64 irq_time_read(int cpu)

9065

++{

9066

++	struct irqtime *irqtime = &per_cpu(cpu_irqtime, cpu);

9067

++	unsigned int seq;

9068

++	u64 total;

9069

++

9070

++	do {

9071

++		seq = __u64_stats_fetch_begin(&irqtime->sync);

9072

++		total = irqtime->total;

9073

++	} while (__u64_stats_fetch_retry(&irqtime->sync, seq));

9074

++

9075

++	return total;

9076

++}

9077

++#endif /* CONFIG_IRQ_TIME_ACCOUNTING */

9078

++

9079

++#ifdef CONFIG_CPU_FREQ

9080

++DECLARE_PER_CPU(struct update_util_data __rcu *, cpufreq_update_util_data);

9081

++#endif /* CONFIG_CPU_FREQ */

9082

++

9083

++#ifdef CONFIG_NO_HZ_FULL

9084

++extern int __init sched_tick_offload_init(void);

9085

++#else

9086

++static inline int sched_tick_offload_init(void) { return 0; }

9087

++#endif

9088

++

9089

++#ifdef arch_scale_freq_capacity

9090

++#ifndef arch_scale_freq_invariant

9091

++#define arch_scale_freq_invariant()	(true)

9092

++#endif

9093

++#else /* arch_scale_freq_capacity */

9094

++#define arch_scale_freq_invariant()	(false)

9095

++#endif

9096

++

9097

++extern void schedule_idle(void);

9098

++

9099

++#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)

9100

++

9101

++/*

9102

++ * !! For sched_setattr_nocheck() (kernel) only !!

9103

++ *

9104

++ * This is actually gross. :(

9105

++ *

9106

++ * It is used to make schedutil kworker(s) higher priority than SCHED_DEADLINE

9107

++ * tasks, but still be able to sleep. We need this on platforms that cannot

9108

++ * atomically change clock frequency. Remove once fast switching will be

9109

++ * available on such platforms.

9110

++ *

9111

++ * SUGOV stands for SchedUtil GOVernor.

9112

++ */

9113

++#define SCHED_FLAG_SUGOV	0x10000000

9114

++

9115

++#ifdef CONFIG_MEMBARRIER

9116

++/*

9117

++ * The scheduler provides memory barriers required by membarrier between:

9118

++ * - prior user-space memory accesses and store to rq->membarrier_state,

9119

++ * - store to rq->membarrier_state and following user-space memory accesses.

9120

++ * In the same way it provides those guarantees around store to rq->curr.

9121

++ */

9122

++static inline void membarrier_switch_mm(struct rq *rq,

9123

++					struct mm_struct *prev_mm,

9124

++					struct mm_struct *next_mm)

9125

++{

9126

++	int membarrier_state;

9127

++

9128

++	if (prev_mm == next_mm)

9129

++		return;

9130

++

9131

++	membarrier_state = atomic_read(&next_mm->membarrier_state);

9132

++	if (READ_ONCE(rq->membarrier_state) == membarrier_state)

9133

++		return;

9134

++

9135

++	WRITE_ONCE(rq->membarrier_state, membarrier_state);

9136

++}

9137

++#else

9138

++static inline void membarrier_switch_mm(struct rq *rq,

9139

++					struct mm_struct *prev_mm,

9140

++					struct mm_struct *next_mm)

9141

++{

9142

++}

9143

++#endif

9144

++

9145

++#ifdef CONFIG_NUMA

9146

++extern int sched_numa_find_closest(const struct cpumask *cpus, int cpu);

9147

++#else

9148

++static inline int sched_numa_find_closest(const struct cpumask *cpus, int cpu)

9149

++{

9150

++	return nr_cpu_ids;

9151

++}

9152

++#endif

9153

++

9154

++extern void swake_up_all_locked(struct swait_queue_head *q);

9155

++extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);

9156

++

9157

++#ifdef CONFIG_PREEMPT_DYNAMIC

9158

++extern int preempt_dynamic_mode;

9159

++extern int sched_dynamic_mode(const char *str);

9160

++extern void sched_dynamic_update(int mode);

9161

++#endif

9162

++

9163

++static inline void nohz_run_idle_balance(int cpu) { }

9164

++

9165

++static inline

9166

++unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,

9167

++				  struct task_struct *p)

9168

++{

9169

++	return util;

9170

++}

9171

++

9172

++static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }

9173

++

9174

++#endif /* ALT_SCHED_H */

9175

+diff --git a/kernel/sched/bmq.h b/kernel/sched/bmq.h

9176

+new file mode 100644

9177

+index 000000000000..66b77291b9d0

9178

+--- /dev/null

9179

++++ b/kernel/sched/bmq.h

9180

+@@ -0,0 +1,110 @@

9181

++#define ALT_SCHED_VERSION_MSG "sched/bmq: BMQ CPU Scheduler "ALT_SCHED_VERSION" by Alfred Chen.\n"

9182

++

9183

++/*

9184

++ * BMQ only routines

9185

++ */

9186

++#define rq_switch_time(rq)	((rq)->clock - (rq)->last_ts_switch)

9187

++#define boost_threshold(p)	(sched_timeslice_ns >>\

9188

++				 (15 - MAX_PRIORITY_ADJ -  (p)->boost_prio))

9189

++

9190

++static inline void boost_task(struct task_struct *p)

9191

++{

9192

++	int limit;

9193

++

9194

++	switch (p->policy) {

9195

++	case SCHED_NORMAL:

9196

++		limit = -MAX_PRIORITY_ADJ;

9197

++		break;

9198

++	case SCHED_BATCH:

9199

++	case SCHED_IDLE:

9200

++		limit = 0;

9201

++		break;

9202

++	default:

9203

++		return;

9204

++	}

9205

++

9206

++	if (p->boost_prio > limit)

9207

++		p->boost_prio--;

9208

++}

9209

++

9210

++static inline void deboost_task(struct task_struct *p)

9211

++{

9212

++	if (p->boost_prio < MAX_PRIORITY_ADJ)

9213

++		p->boost_prio++;

9214

++}

9215

++

9216

++/*

9217

++ * Common interfaces

9218

++ */

9219

++static inline void sched_timeslice_imp(const int timeslice_ms) {}

9220

++

9221

++static inline int

9222

++task_sched_prio_normal(const struct task_struct *p, const struct rq *rq)

9223

++{

9224

++	return p->prio + p->boost_prio - MAX_RT_PRIO;

9225

++}

9226

++

9227

++static inline int task_sched_prio(const struct task_struct *p)

9228

++{

9229

++	return (p->prio < MAX_RT_PRIO)? p->prio : MAX_RT_PRIO / 2 + (p->prio + p->boost_prio) / 2;

9230

++}

9231

++

9232

++static inline int

9233

++task_sched_prio_idx(const struct task_struct *p, const struct rq *rq)

9234

++{

9235

++	return task_sched_prio(p);

9236

++}

9237

++

9238

++static inline int sched_prio2idx(int prio, struct rq *rq)

9239

++{

9240

++	return prio;

9241

++}

9242

++

9243

++static inline int sched_idx2prio(int idx, struct rq *rq)

9244

++{

9245

++	return idx;

9246

++}

9247

++

9248

++static inline void time_slice_expired(struct task_struct *p, struct rq *rq)

9249

++{

9250

++	p->time_slice = sched_timeslice_ns;

9251

++

9252

++	if (SCHED_FIFO != p->policy && task_on_rq_queued(p)) {

9253

++		if (SCHED_RR != p->policy)

9254

++			deboost_task(p);

9255

++		requeue_task(p, rq, task_sched_prio_idx(p, rq));

9256

++	}

9257

++}

9258

++

9259

++static inline void sched_task_sanity_check(struct task_struct *p, struct rq *rq) {}

9260

++

9261

++inline int task_running_nice(struct task_struct *p)

9262

++{

9263

++	return (p->prio + p->boost_prio > DEFAULT_PRIO + MAX_PRIORITY_ADJ);

9264

++}

9265

++

9266

++static void sched_task_fork(struct task_struct *p, struct rq *rq)

9267

++{

9268

++	p->boost_prio = MAX_PRIORITY_ADJ;

9269

++}

9270

++

9271

++static inline void do_sched_yield_type_1(struct task_struct *p, struct rq *rq)

9272

++{

9273

++	p->boost_prio = MAX_PRIORITY_ADJ;

9274

++}

9275

++

9276

++#ifdef CONFIG_SMP

9277

++static inline void sched_task_ttwu(struct task_struct *p)

9278

++{

9279

++	if(this_rq()->clock_task - p->last_ran > sched_timeslice_ns)

9280

++		boost_task(p);

9281

++}

9282

++#endif

9283

++

9284

++static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq)

9285

++{

9286

++	if (rq_switch_time(rq) < boost_threshold(p))

9287

++		boost_task(p);

9288

++}

9289

++

9290

++static inline void update_rq_time_edge(struct rq *rq) {}

9291

+diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c

9292

+index d9dc9ab3773f..71a25540d65e 100644

9293

+--- a/kernel/sched/build_policy.c

9294

++++ b/kernel/sched/build_policy.c

9295

+@@ -42,13 +42,19 @@

9296

+

9297

+ #include "idle.c"

9298

+

9299

++#ifndef CONFIG_SCHED_ALT

9300

+ #include "rt.c"

9301

++#endif

9302

+

9303

+ #ifdef CONFIG_SMP

9304

++#ifndef CONFIG_SCHED_ALT

9305

+ # include "cpudeadline.c"

9306

++#endif

9307

+ # include "pelt.c"

9308

+ #endif

9309

+

9310

+ #include "cputime.c"

9311

+-#include "deadline.c"

9312

+

9313

++#ifndef CONFIG_SCHED_ALT

9314

++#include "deadline.c"

9315

++#endif

9316

+diff --git a/kernel/sched/build_utility.c b/kernel/sched/build_utility.c

9317

+index 99bdd96f454f..23f80a86d2d7 100644

9318

+--- a/kernel/sched/build_utility.c

9319

++++ b/kernel/sched/build_utility.c

9320

+@@ -85,7 +85,9 @@

9321

+

9322

+ #ifdef CONFIG_SMP

9323

+ # include "cpupri.c"

9324

++#ifndef CONFIG_SCHED_ALT

9325

+ # include "stop_task.c"

9326

++#endif

9327

+ # include "topology.c"

9328

+ #endif

9329

+

9330

+diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

9331

+index 3dbf351d12d5..b2590f961139 100644

9332

+--- a/kernel/sched/cpufreq_schedutil.c

9333

++++ b/kernel/sched/cpufreq_schedutil.c

9334

+@@ -160,9 +160,14 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)

9335

+ 	unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);

9336

+

9337

+ 	sg_cpu->max = max;

9338

++#ifndef CONFIG_SCHED_ALT

9339

+ 	sg_cpu->bw_dl = cpu_bw_dl(rq);

9340

+ 	sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), max,

9341

+ 					  FREQUENCY_UTIL, NULL);

9342

++#else

9343

++	sg_cpu->bw_dl = 0;

9344

++	sg_cpu->util = rq_load_util(rq, max);

9345

++#endif /* CONFIG_SCHED_ALT */

9346

+ }

9347

+

9348

+ /**

9349

+@@ -306,8 +311,10 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }

9350

+  */

9351

+ static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)

9352

+ {

9353

++#ifndef CONFIG_SCHED_ALT

9354

+ 	if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)

9355

+ 		sg_cpu->sg_policy->limits_changed = true;

9356

++#endif

9357

+ }

9358

+

9359

+ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,

9360

+@@ -607,6 +614,7 @@ static int sugov_kthread_create(struct sugov_policy *sg_policy)

9361

+ 	}

9362

+

9363

+ 	ret = sched_setattr_nocheck(thread, &attr);

9364

++

9365

+ 	if (ret) {

9366

+ 		kthread_stop(thread);

9367

+ 		pr_warn("%s: failed to set SCHED_DEADLINE\n", __func__);

9368

+@@ -839,7 +847,9 @@ cpufreq_governor_init(schedutil_gov);

9369

+ #ifdef CONFIG_ENERGY_MODEL

9370

+ static void rebuild_sd_workfn(struct work_struct *work)

9371

+ {

9372

++#ifndef CONFIG_SCHED_ALT

9373

+ 	rebuild_sched_domains_energy();

9374

++#endif /* CONFIG_SCHED_ALT */

9375

+ }

9376

+ static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);

9377

+

9378

+diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c

9379

+index 78a233d43757..b3bbc87d4352 100644

9380

+--- a/kernel/sched/cputime.c

9381

++++ b/kernel/sched/cputime.c

9382

+@@ -122,7 +122,7 @@ void account_user_time(struct task_struct *p, u64 cputime)

9383

+ 	p->utime += cputime;

9384

+ 	account_group_user_time(p, cputime);

9385

+

9386

+-	index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;

9387

++	index = task_running_nice(p) ? CPUTIME_NICE : CPUTIME_USER;

9388

+

9389

+ 	/* Add user time to cpustat. */

9390

+ 	task_group_account_field(p, index, cputime);

9391

+@@ -146,7 +146,7 @@ void account_guest_time(struct task_struct *p, u64 cputime)

9392

+ 	p->gtime += cputime;

9393

+

9394

+ 	/* Add guest time to cpustat. */

9395

+-	if (task_nice(p) > 0) {

9396

++	if (task_running_nice(p)) {

9397

+ 		task_group_account_field(p, CPUTIME_NICE, cputime);

9398

+ 		cpustat[CPUTIME_GUEST_NICE] += cputime;

9399

+ 	} else {

9400

+@@ -269,7 +269,7 @@ static inline u64 account_other_time(u64 max)

9401

+ #ifdef CONFIG_64BIT

9402

+ static inline u64 read_sum_exec_runtime(struct task_struct *t)

9403

+ {

9404

+-	return t->se.sum_exec_runtime;

9405

++	return tsk_seruntime(t);

9406

+ }

9407

+ #else

9408

+ static u64 read_sum_exec_runtime(struct task_struct *t)

9409

+@@ -279,7 +279,7 @@ static u64 read_sum_exec_runtime(struct task_struct *t)

9410

+ 	struct rq *rq;

9411

+

9412

+ 	rq = task_rq_lock(t, &rf);

9413

+-	ns = t->se.sum_exec_runtime;

9414

++	ns = tsk_seruntime(t);

9415

+ 	task_rq_unlock(rq, t, &rf);

9416

+

9417

+ 	return ns;

9418

+@@ -611,7 +611,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,

9419

+ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)

9420

+ {

9421

+ 	struct task_cputime cputime = {

9422

+-		.sum_exec_runtime = p->se.sum_exec_runtime,

9423

++		.sum_exec_runtime = tsk_seruntime(p),

9424

+ 	};

9425

+

9426

+ 	if (task_cputime(p, &cputime.utime, &cputime.stime))

9427

+diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c

9428

+index bb3d63bdf4ae..4e1680785704 100644

9429

+--- a/kernel/sched/debug.c

9430

++++ b/kernel/sched/debug.c

9431

+@@ -7,6 +7,7 @@

9432

+  * Copyright(C) 2007, Red Hat, Inc., Ingo Molnar

9433

+  */

9434

+

9435

++#ifndef CONFIG_SCHED_ALT

9436

+ /*

9437

+  * This allows printing both to /proc/sched_debug and

9438

+  * to the console

9439

+@@ -215,6 +216,7 @@ static const struct file_operations sched_scaling_fops = {

9440

+ };

9441

+

9442

+ #endif /* SMP */

9443

++#endif /* !CONFIG_SCHED_ALT */

9444

+

9445

+ #ifdef CONFIG_PREEMPT_DYNAMIC

9446

+

9447

+@@ -278,6 +280,7 @@ static const struct file_operations sched_dynamic_fops = {

9448

+

9449

+ #endif /* CONFIG_PREEMPT_DYNAMIC */

9450

+

9451

++#ifndef CONFIG_SCHED_ALT

9452

+ __read_mostly bool sched_debug_verbose;

9453

+

9454

+ static const struct seq_operations sched_debug_sops;

9455

+@@ -293,6 +296,7 @@ static const struct file_operations sched_debug_fops = {

9456

+ 	.llseek		= seq_lseek,

9457

+ 	.release	= seq_release,

9458

+ };

9459

++#endif /* !CONFIG_SCHED_ALT */

9460

+

9461

+ static struct dentry *debugfs_sched;

9462

+

9463

+@@ -302,12 +306,15 @@ static __init int sched_init_debug(void)

9464

+

9465

+ 	debugfs_sched = debugfs_create_dir("sched", NULL);

9466

+

9467

++#ifndef CONFIG_SCHED_ALT

9468

+ 	debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);

9469

+ 	debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);

9470

++#endif /* !CONFIG_SCHED_ALT */

9471

+ #ifdef CONFIG_PREEMPT_DYNAMIC

9472

+ 	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);

9473

+ #endif

9474

+

9475

++#ifndef CONFIG_SCHED_ALT

9476

+ 	debugfs_create_u32("latency_ns", 0644, debugfs_sched, &sysctl_sched_latency);

9477

+ 	debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);

9478

+ 	debugfs_create_u32("idle_min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_idle_min_granularity);

9479

+@@ -336,11 +343,13 @@ static __init int sched_init_debug(void)

9480

+ #endif

9481

+

9482

+ 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);

9483

++#endif /* !CONFIG_SCHED_ALT */

9484

+

9485

+ 	return 0;

9486

+ }

9487

+ late_initcall(sched_init_debug);

9488

+

9489

++#ifndef CONFIG_SCHED_ALT

9490

+ #ifdef CONFIG_SMP

9491

+

9492

+ static cpumask_var_t		sd_sysctl_cpus;

9493

+@@ -1067,6 +1076,7 @@ void proc_sched_set_task(struct task_struct *p)

9494

+ 	memset(&p->stats, 0, sizeof(p->stats));

9495

+ #endif

9496

+ }

9497

++#endif /* !CONFIG_SCHED_ALT */

9498

+

9499

+ void resched_latency_warn(int cpu, u64 latency)

9500

+ {

9501

+diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c

9502

+index 328cccbee444..aef991facc79 100644

9503

+--- a/kernel/sched/idle.c

9504

++++ b/kernel/sched/idle.c

9505

+@@ -400,6 +400,7 @@ void cpu_startup_entry(enum cpuhp_state state)

9506

+ 		do_idle();

9507

+ }

9508

+

9509

++#ifndef CONFIG_SCHED_ALT

9510

+ /*

9511

+  * idle-task scheduling class.

9512

+  */

9513

+@@ -521,3 +522,4 @@ DEFINE_SCHED_CLASS(idle) = {

9514

+ 	.switched_to		= switched_to_idle,

9515

+ 	.update_curr		= update_curr_idle,

9516

+ };

9517

++#endif

9518

+diff --git a/kernel/sched/pds.h b/kernel/sched/pds.h

9519

+new file mode 100644

9520

+index 000000000000..56a649d02e49

9521

+--- /dev/null

9522

++++ b/kernel/sched/pds.h

9523

+@@ -0,0 +1,127 @@

9524

++#define ALT_SCHED_VERSION_MSG "sched/pds: PDS CPU Scheduler "ALT_SCHED_VERSION" by Alfred Chen.\n"

9525

++

9526

++static int sched_timeslice_shift = 22;

9527

++

9528

++#define NORMAL_PRIO_MOD(x)	((x) & (NORMAL_PRIO_NUM - 1))

9529

++

9530

++/*

9531

++ * Common interfaces

9532

++ */

9533

++static inline void sched_timeslice_imp(const int timeslice_ms)

9534

++{

9535

++	if (2 == timeslice_ms)

9536

++		sched_timeslice_shift = 21;

9537

++}

9538

++

9539

++static inline int

9540

++task_sched_prio_normal(const struct task_struct *p, const struct rq *rq)

9541

++{

9542

++	s64 delta = p->deadline - rq->time_edge + NORMAL_PRIO_NUM - NICE_WIDTH;

9543

++

9544

++	if (WARN_ONCE(delta > NORMAL_PRIO_NUM - 1,

9545

++		      "pds: task_sched_prio_normal() delta %lld\n", delta))

9546

++		return NORMAL_PRIO_NUM - 1;

9547

++

9548

++	return (delta < 0) ? 0 : delta;

9549

++}

9550

++

9551

++static inline int task_sched_prio(const struct task_struct *p)

9552

++{

9553

++	return (p->prio < MAX_RT_PRIO) ? p->prio :

9554

++		MIN_NORMAL_PRIO + task_sched_prio_normal(p, task_rq(p));

9555

++}

9556

++

9557

++static inline int

9558

++task_sched_prio_idx(const struct task_struct *p, const struct rq *rq)

9559

++{

9560

++	return (p->prio < MAX_RT_PRIO) ? p->prio : MIN_NORMAL_PRIO +

9561

++		NORMAL_PRIO_MOD(task_sched_prio_normal(p, rq) + rq->time_edge);

9562

++}

9563

++

9564

++static inline int sched_prio2idx(int prio, struct rq *rq)

9565

++{

9566

++	return (IDLE_TASK_SCHED_PRIO == prio || prio < MAX_RT_PRIO) ? prio :

9567

++		MIN_NORMAL_PRIO + NORMAL_PRIO_MOD((prio - MIN_NORMAL_PRIO) +

9568

++						  rq->time_edge);

9569

++}

9570

++

9571

++static inline int sched_idx2prio(int idx, struct rq *rq)

9572

++{

9573

++	return (idx < MAX_RT_PRIO) ? idx : MIN_NORMAL_PRIO +

9574

++		NORMAL_PRIO_MOD((idx - MIN_NORMAL_PRIO) + NORMAL_PRIO_NUM -

9575

++				NORMAL_PRIO_MOD(rq->time_edge));

9576

++}

9577

++

9578

++static inline void sched_renew_deadline(struct task_struct *p, const struct rq *rq)

9579

++{

9580

++	if (p->prio >= MAX_RT_PRIO)

9581

++		p->deadline = (rq->clock >> sched_timeslice_shift) +

9582

++			p->static_prio - (MAX_PRIO - NICE_WIDTH);

9583

++}

9584

++

9585

++int task_running_nice(struct task_struct *p)

9586

++{

9587

++	return (p->prio > DEFAULT_PRIO);

9588

++}

9589

++

9590

++static inline void update_rq_time_edge(struct rq *rq)

9591

++{

9592

++	struct list_head head;

9593

++	u64 old = rq->time_edge;

9594

++	u64 now = rq->clock >> sched_timeslice_shift;

9595

++	u64 prio, delta;

9596

++

9597

++	if (now == old)

9598

++		return;

9599

++

9600

++	delta = min_t(u64, NORMAL_PRIO_NUM, now - old);

9601

++	INIT_LIST_HEAD(&head);

9602

++

9603

++	for_each_set_bit(prio, &rq->queue.bitmap[2], delta)

9604

++		list_splice_tail_init(rq->queue.heads + MIN_NORMAL_PRIO +

9605

++				      NORMAL_PRIO_MOD(prio + old), &head);

9606

++

9607

++	rq->queue.bitmap[2] = (NORMAL_PRIO_NUM == delta) ? 0UL :

9608

++		rq->queue.bitmap[2] >> delta;

9609

++	rq->time_edge = now;

9610

++	if (!list_empty(&head)) {

9611

++		u64 idx = MIN_NORMAL_PRIO + NORMAL_PRIO_MOD(now);

9612

++		struct task_struct *p;

9613

++

9614

++		list_for_each_entry(p, &head, sq_node)

9615

++			p->sq_idx = idx;

9616

++

9617

++		list_splice(&head, rq->queue.heads + idx);

9618

++		rq->queue.bitmap[2] |= 1UL;

9619

++	}

9620

++}

9621

++

9622

++static inline void time_slice_expired(struct task_struct *p, struct rq *rq)

9623

++{

9624

++	p->time_slice = sched_timeslice_ns;

9625

++	sched_renew_deadline(p, rq);

9626

++	if (SCHED_FIFO != p->policy && task_on_rq_queued(p))

9627

++		requeue_task(p, rq, task_sched_prio_idx(p, rq));

9628

++}

9629

++

9630

++static inline void sched_task_sanity_check(struct task_struct *p, struct rq *rq)

9631

++{

9632

++	u64 max_dl = rq->time_edge + NICE_WIDTH - 1;

9633

++	if (unlikely(p->deadline > max_dl))

9634

++		p->deadline = max_dl;

9635

++}

9636

++

9637

++static void sched_task_fork(struct task_struct *p, struct rq *rq)

9638

++{

9639

++	sched_renew_deadline(p, rq);

9640

++}

9641

++

9642

++static inline void do_sched_yield_type_1(struct task_struct *p, struct rq *rq)

9643

++{

9644

++	time_slice_expired(p, rq);

9645

++}

9646

++

9647

++#ifdef CONFIG_SMP

9648

++static inline void sched_task_ttwu(struct task_struct *p) {}

9649

++#endif

9650

++static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq) {}

9651

+diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c

9652

+index 0f310768260c..bd38bf738fe9 100644

9653

+--- a/kernel/sched/pelt.c

9654

++++ b/kernel/sched/pelt.c

9655

+@@ -266,6 +266,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)

9656

+ 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);

9657

+ }

9658

+

9659

++#ifndef CONFIG_SCHED_ALT

9660

+ /*

9661

+  * sched_entity:

9662

+  *

9663

+@@ -383,8 +384,9 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)

9664

+

9665

+ 	return 0;

9666

+ }

9667

++#endif

9668

+

9669

+-#ifdef CONFIG_SCHED_THERMAL_PRESSURE

9670

++#if defined(CONFIG_SCHED_THERMAL_PRESSURE) && !defined(CONFIG_SCHED_ALT)

9671

+ /*

9672

+  * thermal:

9673

+  *

9674

+diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h

9675

+index 4ff2ed4f8fa1..226eeed61318 100644

9676

+--- a/kernel/sched/pelt.h

9677

++++ b/kernel/sched/pelt.h

9678

+@@ -1,13 +1,15 @@

9679

+ #ifdef CONFIG_SMP

9680

+ #include "sched-pelt.h"

9681

+

9682

++#ifndef CONFIG_SCHED_ALT

9683

+ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se);

9684

+ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);

9685

+ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);

9686

+ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);

9687

+ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);

9688

++#endif

9689

+

9690

+-#ifdef CONFIG_SCHED_THERMAL_PRESSURE

9691

++#if defined(CONFIG_SCHED_THERMAL_PRESSURE) && !defined(CONFIG_SCHED_ALT)

9692

+ int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);

9693

+

9694

+ static inline u64 thermal_load_avg(struct rq *rq)

9695

+@@ -44,6 +46,7 @@ static inline u32 get_pelt_divider(struct sched_avg *avg)

9696

+ 	return PELT_MIN_DIVIDER + avg->period_contrib;

9697

+ }

9698

+

9699

++#ifndef CONFIG_SCHED_ALT

9700

+ static inline void cfs_se_util_change(struct sched_avg *avg)

9701

+ {

9702

+ 	unsigned int enqueued;

9703

+@@ -155,9 +158,11 @@ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)

9704

+ 	return rq_clock_pelt(rq_of(cfs_rq));

9705

+ }

9706

+ #endif

9707

++#endif /* CONFIG_SCHED_ALT */

9708

+

9709

+ #else

9710

+

9711

++#ifndef CONFIG_SCHED_ALT

9712

+ static inline int

9713

+ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

9714

+ {

9715

+@@ -175,6 +180,7 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)

9716

+ {

9717

+ 	return 0;

9718

+ }

9719

++#endif

9720

+

9721

+ static inline int

9722

+ update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)

9723

+diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

9724

+index 47b89a0fc6e5..de2641a32c22 100644

9725

+--- a/kernel/sched/sched.h

9726

++++ b/kernel/sched/sched.h

9727

+@@ -5,6 +5,10 @@

9728

+ #ifndef _KERNEL_SCHED_SCHED_H

9729

+ #define _KERNEL_SCHED_SCHED_H

9730

+

9731

++#ifdef CONFIG_SCHED_ALT

9732

++#include "alt_sched.h"

9733

++#else

9734

++

9735

+ #include <linux/sched/affinity.h>

9736

+ #include <linux/sched/autogroup.h>

9737

+ #include <linux/sched/cpufreq.h>

9738

+@@ -3116,4 +3120,9 @@ extern int sched_dynamic_mode(const char *str);

9739

+ extern void sched_dynamic_update(int mode);

9740

+ #endif

9741

+

9742

++static inline int task_running_nice(struct task_struct *p)

9743

++{

9744

++	return (task_nice(p) > 0);

9745

++}

9746

++#endif /* !CONFIG_SCHED_ALT */

9747

+ #endif /* _KERNEL_SCHED_SCHED_H */

9748

+diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c

9749

+index 857f837f52cb..5486c63e4790 100644

9750

+--- a/kernel/sched/stats.c

9751

++++ b/kernel/sched/stats.c

9752

+@@ -125,8 +125,10 @@ static int show_schedstat(struct seq_file *seq, void *v)

9753

+ 	} else {

9754

+ 		struct rq *rq;

9755

+ #ifdef CONFIG_SMP

9756

++#ifndef CONFIG_SCHED_ALT

9757

+ 		struct sched_domain *sd;

9758

+ 		int dcount = 0;

9759

++#endif

9760

+ #endif

9761

+ 		cpu = (unsigned long)(v - 2);

9762

+ 		rq = cpu_rq(cpu);

9763

+@@ -143,6 +145,7 @@ static int show_schedstat(struct seq_file *seq, void *v)

9764

+ 		seq_printf(seq, "\n");

9765

+

9766

+ #ifdef CONFIG_SMP

9767

++#ifndef CONFIG_SCHED_ALT

9768

+ 		/* domain-specific stats */

9769

+ 		rcu_read_lock();

9770

+ 		for_each_domain(cpu, sd) {

9771

+@@ -171,6 +174,7 @@ static int show_schedstat(struct seq_file *seq, void *v)

9772

+ 			    sd->ttwu_move_balance);

9773

+ 		}

9774

+ 		rcu_read_unlock();

9775

++#endif

9776

+ #endif

9777

+ 	}

9778

+ 	return 0;

9779

+diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h

9780

+index baa839c1ba96..15238be0581b 100644

9781

+--- a/kernel/sched/stats.h

9782

++++ b/kernel/sched/stats.h

9783

+@@ -89,6 +89,7 @@ static inline void rq_sched_info_depart  (struct rq *rq, unsigned long long delt

9784

+

9785

+ #endif /* CONFIG_SCHEDSTATS */

9786

+

9787

++#ifndef CONFIG_SCHED_ALT

9788

+ #ifdef CONFIG_FAIR_GROUP_SCHED

9789

+ struct sched_entity_stats {

9790

+ 	struct sched_entity     se;

9791

+@@ -105,6 +106,7 @@ __schedstats_from_se(struct sched_entity *se)

9792

+ #endif

9793

+ 	return &task_of(se)->stats;

9794

+ }

9795

++#endif /* CONFIG_SCHED_ALT */

9796

+

9797

+ #ifdef CONFIG_PSI

9798

+ /*

9799

+diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c

9800

+index 05b6c2ad90b9..480ef393b3c9 100644

9801

+--- a/kernel/sched/topology.c

9802

++++ b/kernel/sched/topology.c

9803

+@@ -3,6 +3,7 @@

9804

+  * Scheduler topology setup/handling methods

9805

+  */

9806

+

9807

++#ifndef CONFIG_SCHED_ALT

9808

+ DEFINE_MUTEX(sched_domains_mutex);

9809

+

9810

+ /* Protected by sched_domains_mutex: */

9811

+@@ -1413,8 +1414,10 @@ static void asym_cpu_capacity_scan(void)

9812

+  */

9813

+

9814

+ static int default_relax_domain_level = -1;

9815

++#endif /* CONFIG_SCHED_ALT */

9816

+ int sched_domain_level_max;

9817

+

9818

++#ifndef CONFIG_SCHED_ALT

9819

+ static int __init setup_relax_domain_level(char *str)

9820

+ {

9821

+ 	if (kstrtoint(str, 0, &default_relax_domain_level))

9822

+@@ -1647,6 +1650,7 @@ sd_init(struct sched_domain_topology_level *tl,

9823

+

9824

+ 	return sd;

9825

+ }

9826

++#endif /* CONFIG_SCHED_ALT */

9827

+

9828

+ /*

9829

+  * Topology list, bottom-up.

9830

+@@ -1683,6 +1687,7 @@ void set_sched_topology(struct sched_domain_topology_level *tl)

9831

+ 	sched_domain_topology_saved = NULL;

9832

+ }

9833

+

9834

++#ifndef CONFIG_SCHED_ALT

9835

+ #ifdef CONFIG_NUMA

9836

+

9837

+ static const struct cpumask *sd_numa_mask(int cpu)

9838

+@@ -2638,3 +2643,15 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

9839

+ 	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);

9840

+ 	mutex_unlock(&sched_domains_mutex);

9841

+ }

9842

++#else /* CONFIG_SCHED_ALT */

9843

++void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

9844

++			     struct sched_domain_attr *dattr_new)

9845

++{}

9846

++

9847

++#ifdef CONFIG_NUMA

9848

++int sched_numa_find_closest(const struct cpumask *cpus, int cpu)

9849

++{

9850

++	return best_mask_cpu(cpu, cpus);

9851

++}

9852

++#endif /* CONFIG_NUMA */

9853

++#endif

9854

+diff --git a/kernel/sysctl.c b/kernel/sysctl.c

9855

+index 35d034219513..23719c728677 100644

9856

+--- a/kernel/sysctl.c

9857

++++ b/kernel/sysctl.c

9858

+@@ -86,6 +86,10 @@

9859

+

9860

+ /* Constants used for minimum and  maximum */

9861

+

9862

++#ifdef CONFIG_SCHED_ALT

9863

++extern int sched_yield_type;

9864

++#endif

9865

++

9866

+ #ifdef CONFIG_PERF_EVENTS

9867

+ static const int six_hundred_forty_kb = 640 * 1024;

9868

+ #endif

9869

+@@ -1590,6 +1594,7 @@ int proc_do_static_key(struct ctl_table *table, int write,

9870

+ }

9871

+

9872

+ static struct ctl_table kern_table[] = {

9873

++#ifndef CONFIG_SCHED_ALT

9874

+ #ifdef CONFIG_NUMA_BALANCING

9875

+ 	{

9876

+ 		.procname	= "numa_balancing",

9877

+@@ -1601,6 +1606,7 @@ static struct ctl_table kern_table[] = {

9878

+ 		.extra2		= SYSCTL_FOUR,

9879

+ 	},

9880

+ #endif /* CONFIG_NUMA_BALANCING */

9881

++#endif /* !CONFIG_SCHED_ALT */

9882

+ 	{

9883

+ 		.procname	= "panic",

9884

+ 		.data		= &panic_timeout,

9885

+@@ -1902,6 +1908,17 @@ static struct ctl_table kern_table[] = {

9886

+ 		.proc_handler	= proc_dointvec,

9887

+ 	},

9888

+ #endif

9889

++#ifdef CONFIG_SCHED_ALT

9890

++	{

9891

++		.procname	= "yield_type",

9892

++		.data		= &sched_yield_type,

9893

++		.maxlen		= sizeof (int),

9894

++		.mode		= 0644,

9895

++		.proc_handler	= &proc_dointvec_minmax,

9896

++		.extra1		= SYSCTL_ZERO,

9897

++		.extra2		= SYSCTL_TWO,

9898

++	},

9899

++#endif

9900

+ #if defined(CONFIG_S390) && defined(CONFIG_SMP)

9901

+ 	{

9902

+ 		.procname	= "spin_retry",

9903

+diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c

9904

+index 0ea8702eb516..a27a0f3a654d 100644

9905

+--- a/kernel/time/hrtimer.c

9906

++++ b/kernel/time/hrtimer.c

9907

+@@ -2088,8 +2088,10 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,

9908

+ 	int ret = 0;

9909

+ 	u64 slack;

9910

+

9911

++#ifndef CONFIG_SCHED_ALT

9912

+ 	slack = current->timer_slack_ns;

9913

+ 	if (dl_task(current) || rt_task(current))

9914

++#endif

9915

+ 		slack = 0;

9916

+

9917

+ 	hrtimer_init_sleeper_on_stack(&t, clockid, mode);

9918

+diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c

9919

+index cb925e8ef9a8..67d823510f5c 100644

9920

+--- a/kernel/time/posix-cpu-timers.c

9921

++++ b/kernel/time/posix-cpu-timers.c

9922

+@@ -223,7 +223,7 @@ static void task_sample_cputime(struct task_struct *p, u64 *samples)

9923

+ 	u64 stime, utime;

9924

+

9925

+ 	task_cputime(p, &utime, &stime);

9926

+-	store_samples(samples, stime, utime, p->se.sum_exec_runtime);

9927

++	store_samples(samples, stime, utime, tsk_seruntime(p));

9928

+ }

9929

+

9930

+ static void proc_sample_cputime_atomic(struct task_cputime_atomic *at,

9931

+@@ -866,6 +866,7 @@ static void collect_posix_cputimers(struct posix_cputimers *pct, u64 *samples,

9932

+ 	}

9933

+ }

9934

+

9935

++#ifndef CONFIG_SCHED_ALT

9936

+ static inline void check_dl_overrun(struct task_struct *tsk)

9937

+ {

9938

+ 	if (tsk->dl.dl_overrun) {

9939

+@@ -873,6 +874,7 @@ static inline void check_dl_overrun(struct task_struct *tsk)

9940

+ 		send_signal_locked(SIGXCPU, SEND_SIG_PRIV, tsk, PIDTYPE_TGID);

9941

+ 	}

9942

+ }

9943

++#endif

9944

+

9945

+ static bool check_rlimit(u64 time, u64 limit, int signo, bool rt, bool hard)

9946

+ {

9947

+@@ -900,8 +902,10 @@ static void check_thread_timers(struct task_struct *tsk,

9948

+ 	u64 samples[CPUCLOCK_MAX];

9949

+ 	unsigned long soft;

9950

+

9951

++#ifndef CONFIG_SCHED_ALT

9952

+ 	if (dl_task(tsk))

9953

+ 		check_dl_overrun(tsk);

9954

++#endif

9955

+

9956

+ 	if (expiry_cache_is_inactive(pct))

9957

+ 		return;

9958

+@@ -915,7 +919,7 @@ static void check_thread_timers(struct task_struct *tsk,

9959

+ 	soft = task_rlimit(tsk, RLIMIT_RTTIME);

9960

+ 	if (soft != RLIM_INFINITY) {

9961

+ 		/* Task RT timeout is accounted in jiffies. RTTIME is usec */

9962

+-		unsigned long rttime = tsk->rt.timeout * (USEC_PER_SEC / HZ);

9963

++		unsigned long rttime = tsk_rttimeout(tsk) * (USEC_PER_SEC / HZ);

9964

+ 		unsigned long hard = task_rlimit_max(tsk, RLIMIT_RTTIME);

9965

+

9966

+ 		/* At the hard limit, send SIGKILL. No further action. */

9967

+@@ -1151,8 +1155,10 @@ static inline bool fastpath_timer_check(struct task_struct *tsk)

9968

+ 			return true;

9969

+ 	}

9970

+

9971

++#ifndef CONFIG_SCHED_ALT

9972

+ 	if (dl_task(tsk) && tsk->dl.dl_overrun)

9973

+ 		return true;

9974

++#endif

9975

+

9976

+ 	return false;

9977

+ }

9978

+diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c

9979

+index a2d301f58ced..2ccdede8585c 100644

9980

+--- a/kernel/trace/trace_selftest.c

9981

++++ b/kernel/trace/trace_selftest.c

9982

+@@ -1143,10 +1143,15 @@ static int trace_wakeup_test_thread(void *data)

9983

+ {

9984

+ 	/* Make this a -deadline thread */

9985

+ 	static const struct sched_attr attr = {

9986

++#ifdef CONFIG_SCHED_ALT

9987

++		/* No deadline on BMQ/PDS, use RR */

9988

++		.sched_policy = SCHED_RR,

9989

++#else

9990

+ 		.sched_policy = SCHED_DEADLINE,

9991

+ 		.sched_runtime = 100000ULL,

9992

+ 		.sched_deadline = 10000000ULL,

9993

+ 		.sched_period = 10000000ULL

9994

++#endif

9995

+ 	};

9996

+ 	struct wakeup_test_data *x = data;

9997

+

9998

9999

diff --git a/5021_BMQ-and-PDS-gentoo-defaults.patch b/5021_BMQ-and-PDS-gentoo-defaults.patch

10000

new file mode 100644

10001

index 00000000..6b2049da

10002

--- /dev/null

10003

+++ b/5021_BMQ-and-PDS-gentoo-defaults.patch

10004

@@ -0,0 +1,13 @@

10005

+--- a/init/Kconfig	2022-07-07 13:22:00.698439887 -0400

10006

++++ b/init/Kconfig	2022-07-07 13:23:45.152333576 -0400

10007

+@@ -874,8 +874,9 @@ config UCLAMP_BUCKETS_COUNT

10008

+ 	  If in doubt, use the default value.

10009

+

10010

+ menuconfig SCHED_ALT

10011

++	depends on X86_64

10012

+ 	bool "Alternative CPU Schedulers"

10013

+-	default y

10014

++	default n

10015

+ 	help

10016

+ 	  This feature enable alternative CPU scheduler"

10017

+

Gentoo Archives: gentoo-commits