[gentoo-commits] proj/linux-patches:4.14 commit in: / - gentoo-commits

From:	Mike Pagano <mpagano@g.o>
To:	gentoo-commits@l.g.o
Subject:	[gentoo-commits] proj/linux-patches:4.14 commit in: /
Date:	Wed, 15 Aug 2018 16:48:17
Message-Id:	`1534351683.0918561a63056d9d05a92bd2e60d6ccaf84fff91.mpagano@gentoo`

1

commit:     0918561a63056d9d05a92bd2e60d6ccaf84fff91

2

Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

3

AuthorDate: Wed Aug 15 16:48:03 2018 +0000

4

Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

5

CommitDate: Wed Aug 15 16:48:03 2018 +0000

6

URL:        https://gitweb.gentoo.org/proj/linux-patches.git/commit/?id=0918561a

7

8

Linux patch 4.14.63

9

10

 0000_README              |    4 +

11

 1062_linux-4.14.63.patch | 5609 ++++++++++++++++++++++++++++++++++++++++++++++

12

 2 files changed, 5613 insertions(+)

13

14

diff --git a/0000_README b/0000_README

15

index b530931..4c5f97e 100644

16

--- a/0000_README

17

+++ b/0000_README

18

@@ -291,6 +291,10 @@ Patch:  1061_linux-4.14.62.patch

19

 From:   http://www.kernel.org

20

 Desc:   Linux 4.14.62

21

22

+Patch:  1062_linux-4.14.63.patch

23

+From:   http://www.kernel.org

24

+Desc:   Linux 4.14.63

25

+

26

 Patch:  1500_XATTR_USER_PREFIX.patch

27

 From:   https://bugs.gentoo.org/show_bug.cgi?id=470644

28

 Desc:   Support for namespace user.pax.* on tmpfs.

29

30

diff --git a/1062_linux-4.14.63.patch b/1062_linux-4.14.63.patch

31

new file mode 100644

32

index 0000000..cff73c5

33

--- /dev/null

34

+++ b/1062_linux-4.14.63.patch

35

@@ -0,0 +1,5609 @@

36

+diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu

37

+index 8355e79350b7..6cae60929cb6 100644

38

+--- a/Documentation/ABI/testing/sysfs-devices-system-cpu

39

++++ b/Documentation/ABI/testing/sysfs-devices-system-cpu

40

+@@ -379,6 +379,7 @@ What:		/sys/devices/system/cpu/vulnerabilities

41

+ 		/sys/devices/system/cpu/vulnerabilities/spectre_v1

42

+ 		/sys/devices/system/cpu/vulnerabilities/spectre_v2

43

+ 		/sys/devices/system/cpu/vulnerabilities/spec_store_bypass

44

++		/sys/devices/system/cpu/vulnerabilities/l1tf

45

+ Date:		January 2018

46

+ Contact:	Linux kernel mailing list <linux-kernel@×××××××××××.org>

47

+ Description:	Information about CPU vulnerabilities

48

+@@ -390,3 +391,26 @@ Description:	Information about CPU vulnerabilities

49

+ 		"Not affected"	  CPU is not affected by the vulnerability

50

+ 		"Vulnerable"	  CPU is affected and no mitigation in effect

51

+ 		"Mitigation: $M"  CPU is affected and mitigation $M is in effect

52

++

53

++		Details about the l1tf file can be found in

54

++		Documentation/admin-guide/l1tf.rst

55

++

56

++What:		/sys/devices/system/cpu/smt

57

++		/sys/devices/system/cpu/smt/active

58

++		/sys/devices/system/cpu/smt/control

59

++Date:		June 2018

60

++Contact:	Linux kernel mailing list <linux-kernel@×××××××××××.org>

61

++Description:	Control Symetric Multi Threading (SMT)

62

++

63

++		active:  Tells whether SMT is active (enabled and siblings online)

64

++

65

++		control: Read/write interface to control SMT. Possible

66

++			 values:

67

++

68

++			 "on"		SMT is enabled

69

++			 "off"		SMT is disabled

70

++			 "forceoff"	SMT is force disabled. Cannot be changed.

71

++			 "notsupported" SMT is not supported by the CPU

72

++

73

++			 If control status is "forceoff" or "notsupported" writes

74

++			 are rejected.

75

+diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst

76

+index 5bb9161dbe6a..78f8f00c369f 100644

77

+--- a/Documentation/admin-guide/index.rst

78

++++ b/Documentation/admin-guide/index.rst

79

+@@ -17,6 +17,15 @@ etc.

80

+    kernel-parameters

81

+    devices

82

+

83

++This section describes CPU vulnerabilities and provides an overview of the

84

++possible mitigations along with guidance for selecting mitigations if they

85

++are configurable at compile, boot or run time.

86

++

87

++.. toctree::

88

++   :maxdepth: 1

89

++

90

++   l1tf

91

++

92

+ Here is a set of documents aimed at users who are trying to track down

93

+ problems and bugs in particular.

94

+

95

+diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt

96

+index d6d7669e667f..9841bad6f271 100644

97

+--- a/Documentation/admin-guide/kernel-parameters.txt

98

++++ b/Documentation/admin-guide/kernel-parameters.txt

99

+@@ -1888,10 +1888,84 @@

100

+ 			(virtualized real and unpaged mode) on capable

101

+ 			Intel chips. Default is 1 (enabled)

102

+

103

++	kvm-intel.vmentry_l1d_flush=[KVM,Intel] Mitigation for L1 Terminal Fault

104

++			CVE-2018-3620.

105

++

106

++			Valid arguments: never, cond, always

107

++

108

++			always: L1D cache flush on every VMENTER.

109

++			cond:	Flush L1D on VMENTER only when the code between

110

++				VMEXIT and VMENTER can leak host memory.

111

++			never:	Disables the mitigation

112

++

113

++			Default is cond (do L1 cache flush in specific instances)

114

++

115

+ 	kvm-intel.vpid=	[KVM,Intel] Disable Virtual Processor Identification

116

+ 			feature (tagged TLBs) on capable Intel chips.

117

+ 			Default is 1 (enabled)

118

+

119

++	l1tf=           [X86] Control mitigation of the L1TF vulnerability on

120

++			      affected CPUs

121

++

122

++			The kernel PTE inversion protection is unconditionally

123

++			enabled and cannot be disabled.

124

++

125

++			full

126

++				Provides all available mitigations for the

127

++				L1TF vulnerability. Disables SMT and

128

++				enables all mitigations in the

129

++				hypervisors, i.e. unconditional L1D flush.

130

++

131

++				SMT control and L1D flush control via the

132

++				sysfs interface is still possible after

133

++				boot.  Hypervisors will issue a warning

134

++				when the first VM is started in a

135

++				potentially insecure configuration,

136

++				i.e. SMT enabled or L1D flush disabled.

137

++

138

++			full,force

139

++				Same as 'full', but disables SMT and L1D

140

++				flush runtime control. Implies the

141

++				'nosmt=force' command line option.

142

++				(i.e. sysfs control of SMT is disabled.)

143

++

144

++			flush

145

++				Leaves SMT enabled and enables the default

146

++				hypervisor mitigation, i.e. conditional

147

++				L1D flush.

148

++

149

++				SMT control and L1D flush control via the

150

++				sysfs interface is still possible after

151

++				boot.  Hypervisors will issue a warning

152

++				when the first VM is started in a

153

++				potentially insecure configuration,

154

++				i.e. SMT enabled or L1D flush disabled.

155

++

156

++			flush,nosmt

157

++

158

++				Disables SMT and enables the default

159

++				hypervisor mitigation.

160

++

161

++				SMT control and L1D flush control via the

162

++				sysfs interface is still possible after

163

++				boot.  Hypervisors will issue a warning

164

++				when the first VM is started in a

165

++				potentially insecure configuration,

166

++				i.e. SMT enabled or L1D flush disabled.

167

++

168

++			flush,nowarn

169

++				Same as 'flush', but hypervisors will not

170

++				warn when a VM is started in a potentially

171

++				insecure configuration.

172

++

173

++			off

174

++				Disables hypervisor mitigations and doesn't

175

++				emit any warnings.

176

++

177

++			Default is 'flush'.

178

++

179

++			For details see: Documentation/admin-guide/l1tf.rst

180

++

181

+ 	l2cr=		[PPC]

182

+

183

+ 	l3cr=		[PPC]

184

+@@ -2595,6 +2669,10 @@

185

+ 	nosmt		[KNL,S390] Disable symmetric multithreading (SMT).

186

+ 			Equivalent to smt=1.

187

+

188

++			[KNL,x86] Disable symmetric multithreading (SMT).

189

++			nosmt=force: Force disable SMT, cannot be undone

190

++				     via the sysfs control file.

191

++

192

+ 	nospectre_v2	[X86] Disable all mitigations for the Spectre variant 2

193

+ 			(indirect branch prediction) vulnerability. System may

194

+ 			allow data leaks with this option, which is equivalent

195

+diff --git a/Documentation/admin-guide/l1tf.rst b/Documentation/admin-guide/l1tf.rst

196

+new file mode 100644

197

+index 000000000000..bae52b845de0

198

+--- /dev/null

199

++++ b/Documentation/admin-guide/l1tf.rst

200

+@@ -0,0 +1,610 @@

201

++L1TF - L1 Terminal Fault

202

++========================

203

++

204

++L1 Terminal Fault is a hardware vulnerability which allows unprivileged

205

++speculative access to data which is available in the Level 1 Data Cache

206

++when the page table entry controlling the virtual address, which is used

207

++for the access, has the Present bit cleared or other reserved bits set.

208

++

209

++Affected processors

210

++-------------------

211

++

212

++This vulnerability affects a wide range of Intel processors. The

213

++vulnerability is not present on:

214

++

215

++   - Processors from AMD, Centaur and other non Intel vendors

216

++

217

++   - Older processor models, where the CPU family is < 6

218

++

219

++   - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,

220

++     Penwell, Pineview, Silvermont, Airmont, Merrifield)

221

++

222

++   - The Intel XEON PHI family

223

++

224

++   - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the

225

++     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected

226

++     by the Meltdown vulnerability either. These CPUs should become

227

++     available by end of 2018.

228

++

229

++Whether a processor is affected or not can be read out from the L1TF

230

++vulnerability file in sysfs. See :ref:`l1tf_sys_info`.

231

++

232

++Related CVEs

233

++------------

234

++

235

++The following CVE entries are related to the L1TF vulnerability:

236

++

237

++   =============  =================  ==============================

238

++   CVE-2018-3615  L1 Terminal Fault  SGX related aspects

239

++   CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects

240

++   CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects

241

++   =============  =================  ==============================

242

++

243

++Problem

244

++-------

245

++

246

++If an instruction accesses a virtual address for which the relevant page

247

++table entry (PTE) has the Present bit cleared or other reserved bits set,

248

++then speculative execution ignores the invalid PTE and loads the referenced

249

++data if it is present in the Level 1 Data Cache, as if the page referenced

250

++by the address bits in the PTE was still present and accessible.

251

++

252

++While this is a purely speculative mechanism and the instruction will raise

253

++a page fault when it is retired eventually, the pure act of loading the

254

++data and making it available to other speculative instructions opens up the

255

++opportunity for side channel attacks to unprivileged malicious code,

256

++similar to the Meltdown attack.

257

++

258

++While Meltdown breaks the user space to kernel space protection, L1TF

259

++allows to attack any physical memory address in the system and the attack

260

++works across all protection domains. It allows an attack of SGX and also

261

++works from inside virtual machines because the speculation bypasses the

262

++extended page table (EPT) protection mechanism.

263

++

264

++

265

++Attack scenarios

266

++----------------

267

++

268

++1. Malicious user space

269

++^^^^^^^^^^^^^^^^^^^^^^^

270

++

271

++   Operating Systems store arbitrary information in the address bits of a

272

++   PTE which is marked non present. This allows a malicious user space

273

++   application to attack the physical memory to which these PTEs resolve.

274

++   In some cases user-space can maliciously influence the information

275

++   encoded in the address bits of the PTE, thus making attacks more

276

++   deterministic and more practical.

277

++

278

++   The Linux kernel contains a mitigation for this attack vector, PTE

279

++   inversion, which is permanently enabled and has no performance

280

++   impact. The kernel ensures that the address bits of PTEs, which are not

281

++   marked present, never point to cacheable physical memory space.

282

++

283

++   A system with an up to date kernel is protected against attacks from

284

++   malicious user space applications.

285

++

286

++2. Malicious guest in a virtual machine

287

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

288

++

289

++   The fact that L1TF breaks all domain protections allows malicious guest

290

++   OSes, which can control the PTEs directly, and malicious guest user

291

++   space applications, which run on an unprotected guest kernel lacking the

292

++   PTE inversion mitigation for L1TF, to attack physical host memory.

293

++

294

++   A special aspect of L1TF in the context of virtualization is symmetric

295

++   multi threading (SMT). The Intel implementation of SMT is called

296

++   HyperThreading. The fact that Hyperthreads on the affected processors

297

++   share the L1 Data Cache (L1D) is important for this. As the flaw allows

298

++   only to attack data which is present in L1D, a malicious guest running

299

++   on one Hyperthread can attack the data which is brought into the L1D by

300

++   the context which runs on the sibling Hyperthread of the same physical

301

++   core. This context can be host OS, host user space or a different guest.

302

++

303

++   If the processor does not support Extended Page Tables, the attack is

304

++   only possible, when the hypervisor does not sanitize the content of the

305

++   effective (shadow) page tables.

306

++

307

++   While solutions exist to mitigate these attack vectors fully, these

308

++   mitigations are not enabled by default in the Linux kernel because they

309

++   can affect performance significantly. The kernel provides several

310

++   mechanisms which can be utilized to address the problem depending on the

311

++   deployment scenario. The mitigations, their protection scope and impact

312

++   are described in the next sections.

313

++

314

++   The default mitigations and the rationale for choosing them are explained

315

++   at the end of this document. See :ref:`default_mitigations`.

316

++

317

++.. _l1tf_sys_info:

318

++

319

++L1TF system information

320

++-----------------------

321

++

322

++The Linux kernel provides a sysfs interface to enumerate the current L1TF

323

++status of the system: whether the system is vulnerable, and which

324

++mitigations are active. The relevant sysfs file is:

325

++

326

++/sys/devices/system/cpu/vulnerabilities/l1tf

327

++

328

++The possible values in this file are:

329

++

330

++  ===========================   ===============================

331

++  'Not affected'		The processor is not vulnerable

332

++  'Mitigation: PTE Inversion'	The host protection is active

333

++  ===========================   ===============================

334

++

335

++If KVM/VMX is enabled and the processor is vulnerable then the following

336

++information is appended to the 'Mitigation: PTE Inversion' part:

337

++

338

++  - SMT status:

339

++

340

++    =====================  ================

341

++    'VMX: SMT vulnerable'  SMT is enabled

342

++    'VMX: SMT disabled'    SMT is disabled

343

++    =====================  ================

344

++

345

++  - L1D Flush mode:

346

++

347

++    ================================  ====================================

348

++    'L1D vulnerable'		      L1D flushing is disabled

349

++

350

++    'L1D conditional cache flushes'   L1D flush is conditionally enabled

351

++

352

++    'L1D cache flushes'		      L1D flush is unconditionally enabled

353

++    ================================  ====================================

354

++

355

++The resulting grade of protection is discussed in the following sections.

356

++

357

++

358

++Host mitigation mechanism

359

++-------------------------

360

++

361

++The kernel is unconditionally protected against L1TF attacks from malicious

362

++user space running on the host.

363

++

364

++

365

++Guest mitigation mechanisms

366

++---------------------------

367

++

368

++.. _l1d_flush:

369

++

370

++1. L1D flush on VMENTER

371

++^^^^^^^^^^^^^^^^^^^^^^^

372

++

373

++   To make sure that a guest cannot attack data which is present in the L1D

374

++   the hypervisor flushes the L1D before entering the guest.

375

++

376

++   Flushing the L1D evicts not only the data which should not be accessed

377

++   by a potentially malicious guest, it also flushes the guest

378

++   data. Flushing the L1D has a performance impact as the processor has to

379

++   bring the flushed guest data back into the L1D. Depending on the

380

++   frequency of VMEXIT/VMENTER and the type of computations in the guest

381

++   performance degradation in the range of 1% to 50% has been observed. For

382

++   scenarios where guest VMEXIT/VMENTER are rare the performance impact is

383

++   minimal. Virtio and mechanisms like posted interrupts are designed to

384

++   confine the VMEXITs to a bare minimum, but specific configurations and

385

++   application scenarios might still suffer from a high VMEXIT rate.

386

++

387

++   The kernel provides two L1D flush modes:

388

++    - conditional ('cond')

389

++    - unconditional ('always')

390

++

391

++   The conditional mode avoids L1D flushing after VMEXITs which execute

392

++   only audited code paths before the corresponding VMENTER. These code

393

++   paths have been verified that they cannot expose secrets or other

394

++   interesting data to an attacker, but they can leak information about the

395

++   address space layout of the hypervisor.

396

++

397

++   Unconditional mode flushes L1D on all VMENTER invocations and provides

398

++   maximum protection. It has a higher overhead than the conditional

399

++   mode. The overhead cannot be quantified correctly as it depends on the

400

++   workload scenario and the resulting number of VMEXITs.

401

++

402

++   The general recommendation is to enable L1D flush on VMENTER. The kernel

403

++   defaults to conditional mode on affected processors.

404

++

405

++   **Note**, that L1D flush does not prevent the SMT problem because the

406

++   sibling thread will also bring back its data into the L1D which makes it

407

++   attackable again.

408

++

409

++   L1D flush can be controlled by the administrator via the kernel command

410

++   line and sysfs control files. See :ref:`mitigation_control_command_line`

411

++   and :ref:`mitigation_control_kvm`.

412

++

413

++.. _guest_confinement:

414

++

415

++2. Guest VCPU confinement to dedicated physical cores

416

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

417

++

418

++   To address the SMT problem, it is possible to make a guest or a group of

419

++   guests affine to one or more physical cores. The proper mechanism for

420

++   that is to utilize exclusive cpusets to ensure that no other guest or

421

++   host tasks can run on these cores.

422

++

423

++   If only a single guest or related guests run on sibling SMT threads on

424

++   the same physical core then they can only attack their own memory and

425

++   restricted parts of the host memory.

426

++

427

++   Host memory is attackable, when one of the sibling SMT threads runs in

428

++   host OS (hypervisor) context and the other in guest context. The amount

429

++   of valuable information from the host OS context depends on the context

430

++   which the host OS executes, i.e. interrupts, soft interrupts and kernel

431

++   threads. The amount of valuable data from these contexts cannot be

432

++   declared as non-interesting for an attacker without deep inspection of

433

++   the code.

434

++

435

++   **Note**, that assigning guests to a fixed set of physical cores affects

436

++   the ability of the scheduler to do load balancing and might have

437

++   negative effects on CPU utilization depending on the hosting

438

++   scenario. Disabling SMT might be a viable alternative for particular

439

++   scenarios.

440

++

441

++   For further information about confining guests to a single or to a group

442

++   of cores consult the cpusets documentation:

443

++

444

++   https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt

445

++

446

++.. _interrupt_isolation:

447

++

448

++3. Interrupt affinity

449

++^^^^^^^^^^^^^^^^^^^^^

450

++

451

++   Interrupts can be made affine to logical CPUs. This is not universally

452

++   true because there are types of interrupts which are truly per CPU

453

++   interrupts, e.g. the local timer interrupt. Aside of that multi queue

454

++   devices affine their interrupts to single CPUs or groups of CPUs per

455

++   queue without allowing the administrator to control the affinities.

456

++

457

++   Moving the interrupts, which can be affinity controlled, away from CPUs

458

++   which run untrusted guests, reduces the attack vector space.

459

++

460

++   Whether the interrupts with are affine to CPUs, which run untrusted

461

++   guests, provide interesting data for an attacker depends on the system

462

++   configuration and the scenarios which run on the system. While for some

463

++   of the interrupts it can be assumed that they won't expose interesting

464

++   information beyond exposing hints about the host OS memory layout, there

465

++   is no way to make general assumptions.

466

++

467

++   Interrupt affinity can be controlled by the administrator via the

468

++   /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is

469

++   available at:

470

++

471

++   https://www.kernel.org/doc/Documentation/IRQ-affinity.txt

472

++

473

++.. _smt_control:

474

++

475

++4. SMT control

476

++^^^^^^^^^^^^^^

477

++

478

++   To prevent the SMT issues of L1TF it might be necessary to disable SMT

479

++   completely. Disabling SMT can have a significant performance impact, but

480

++   the impact depends on the hosting scenario and the type of workloads.

481

++   The impact of disabling SMT needs also to be weighted against the impact

482

++   of other mitigation solutions like confining guests to dedicated cores.

483

++

484

++   The kernel provides a sysfs interface to retrieve the status of SMT and

485

++   to control it. It also provides a kernel command line interface to

486

++   control SMT.

487

++

488

++   The kernel command line interface consists of the following options:

489

++

490

++     =========== ==========================================================

491

++     nosmt	 Affects the bring up of the secondary CPUs during boot. The

492

++		 kernel tries to bring all present CPUs online during the

493

++		 boot process. "nosmt" makes sure that from each physical

494

++		 core only one - the so called primary (hyper) thread is

495

++		 activated. Due to a design flaw of Intel processors related

496

++		 to Machine Check Exceptions the non primary siblings have

497

++		 to be brought up at least partially and are then shut down

498

++		 again.  "nosmt" can be undone via the sysfs interface.

499

++

500

++     nosmt=force Has the same effect as "nosmt" but it does not allow to

501

++		 undo the SMT disable via the sysfs interface.

502

++     =========== ==========================================================

503

++

504

++   The sysfs interface provides two files:

505

++

506

++   - /sys/devices/system/cpu/smt/control

507

++   - /sys/devices/system/cpu/smt/active

508

++

509

++   /sys/devices/system/cpu/smt/control:

510

++

511

++     This file allows to read out the SMT control state and provides the

512

++     ability to disable or (re)enable SMT. The possible states are:

513

++

514

++	==============  ===================================================

515

++	on		SMT is supported by the CPU and enabled. All

516

++			logical CPUs can be onlined and offlined without

517

++			restrictions.

518

++

519

++	off		SMT is supported by the CPU and disabled. Only

520

++			the so called primary SMT threads can be onlined

521

++			and offlined without restrictions. An attempt to

522

++			online a non-primary sibling is rejected

523

++

524

++	forceoff	Same as 'off' but the state cannot be controlled.

525

++			Attempts to write to the control file are rejected.

526

++

527

++	notsupported	The processor does not support SMT. It's therefore

528

++			not affected by the SMT implications of L1TF.

529

++			Attempts to write to the control file are rejected.

530

++	==============  ===================================================

531

++

532

++     The possible states which can be written into this file to control SMT

533

++     state are:

534

++

535

++     - on

536

++     - off

537

++     - forceoff

538

++

539

++   /sys/devices/system/cpu/smt/active:

540

++

541

++     This file reports whether SMT is enabled and active, i.e. if on any

542

++     physical core two or more sibling threads are online.

543

++

544

++   SMT control is also possible at boot time via the l1tf kernel command

545

++   line parameter in combination with L1D flush control. See

546

++   :ref:`mitigation_control_command_line`.

547

++

548

++5. Disabling EPT

549

++^^^^^^^^^^^^^^^^

550

++

551

++  Disabling EPT for virtual machines provides full mitigation for L1TF even

552

++  with SMT enabled, because the effective page tables for guests are

553

++  managed and sanitized by the hypervisor. Though disabling EPT has a

554

++  significant performance impact especially when the Meltdown mitigation

555

++  KPTI is enabled.

556

++

557

++  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.

558

++

559

++There is ongoing research and development for new mitigation mechanisms to

560

++address the performance impact of disabling SMT or EPT.

561

++

562

++.. _mitigation_control_command_line:

563

++

564

++Mitigation control on the kernel command line

565

++---------------------------------------------

566

++

567

++The kernel command line allows to control the L1TF mitigations at boot

568

++time with the option "l1tf=". The valid arguments for this option are:

569

++

570

++  ============  =============================================================

571

++  full		Provides all available mitigations for the L1TF

572

++		vulnerability. Disables SMT and enables all mitigations in

573

++		the hypervisors, i.e. unconditional L1D flushing

574

++

575

++		SMT control and L1D flush control via the sysfs interface

576

++		is still possible after boot.  Hypervisors will issue a

577

++		warning when the first VM is started in a potentially

578

++		insecure configuration, i.e. SMT enabled or L1D flush

579

++		disabled.

580

++

581

++  full,force	Same as 'full', but disables SMT and L1D flush runtime

582

++		control. Implies the 'nosmt=force' command line option.

583

++		(i.e. sysfs control of SMT is disabled.)

584

++

585

++  flush		Leaves SMT enabled and enables the default hypervisor

586

++		mitigation, i.e. conditional L1D flushing

587

++

588

++		SMT control and L1D flush control via the sysfs interface

589

++		is still possible after boot.  Hypervisors will issue a

590

++		warning when the first VM is started in a potentially

591

++		insecure configuration, i.e. SMT enabled or L1D flush

592

++		disabled.

593

++

594

++  flush,nosmt	Disables SMT and enables the default hypervisor mitigation,

595

++		i.e. conditional L1D flushing.

596

++

597

++		SMT control and L1D flush control via the sysfs interface

598

++		is still possible after boot.  Hypervisors will issue a

599

++		warning when the first VM is started in a potentially

600

++		insecure configuration, i.e. SMT enabled or L1D flush

601

++		disabled.

602

++

603

++  flush,nowarn	Same as 'flush', but hypervisors will not warn when a VM is

604

++		started in a potentially insecure configuration.

605

++

606

++  off		Disables hypervisor mitigations and doesn't emit any

607

++		warnings.

608

++  ============  =============================================================

609

++

610

++The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.

611

++

612

++

613

++.. _mitigation_control_kvm:

614

++

615

++Mitigation control for KVM - module parameter

616

++-------------------------------------------------------------

617

++

618

++The KVM hypervisor mitigation mechanism, flushing the L1D cache when

619

++entering a guest, can be controlled with a module parameter.

620

++

621

++The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the

622

++following arguments:

623

++

624

++  ============  ==============================================================

625

++  always	L1D cache flush on every VMENTER.

626

++

627

++  cond		Flush L1D on VMENTER only when the code between VMEXIT and

628

++		VMENTER can leak host memory which is considered

629

++		interesting for an attacker. This still can leak host memory

630

++		which allows e.g. to determine the hosts address space layout.

631

++

632

++  never		Disables the mitigation

633

++  ============  ==============================================================

634

++

635

++The parameter can be provided on the kernel command line, as a module

636

++parameter when loading the modules and at runtime modified via the sysfs

637

++file:

638

++

639

++/sys/module/kvm_intel/parameters/vmentry_l1d_flush

640

++

641

++The default is 'cond'. If 'l1tf=full,force' is given on the kernel command

642

++line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush

643

++module parameter is ignored and writes to the sysfs file are rejected.

644

++

645

++

646

++Mitigation selection guide

647

++--------------------------

648

++

649

++1. No virtualization in use

650

++^^^^^^^^^^^^^^^^^^^^^^^^^^^

651

++

652

++   The system is protected by the kernel unconditionally and no further

653

++   action is required.

654

++

655

++2. Virtualization with trusted guests

656

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

657

++

658

++   If the guest comes from a trusted source and the guest OS kernel is

659

++   guaranteed to have the L1TF mitigations in place the system is fully

660

++   protected against L1TF and no further action is required.

661

++

662

++   To avoid the overhead of the default L1D flushing on VMENTER the

663

++   administrator can disable the flushing via the kernel command line and

664

++   sysfs control files. See :ref:`mitigation_control_command_line` and

665

++   :ref:`mitigation_control_kvm`.

666

++

667

++

668

++3. Virtualization with untrusted guests

669

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

670

++

671

++3.1. SMT not supported or disabled

672

++""""""""""""""""""""""""""""""""""

673

++

674

++  If SMT is not supported by the processor or disabled in the BIOS or by

675

++  the kernel, it's only required to enforce L1D flushing on VMENTER.

676

++

677

++  Conditional L1D flushing is the default behaviour and can be tuned. See

678

++  :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.

679

++

680

++3.2. EPT not supported or disabled

681

++""""""""""""""""""""""""""""""""""

682

++

683

++  If EPT is not supported by the processor or disabled in the hypervisor,

684

++  the system is fully protected. SMT can stay enabled and L1D flushing on

685

++  VMENTER is not required.

686

++

687

++  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.

688

++

689

++3.3. SMT and EPT supported and active

690

++"""""""""""""""""""""""""""""""""""""

691

++

692

++  If SMT and EPT are supported and active then various degrees of

693

++  mitigations can be employed:

694

++

695

++  - L1D flushing on VMENTER:

696

++

697

++    L1D flushing on VMENTER is the minimal protection requirement, but it

698

++    is only potent in combination with other mitigation methods.

699

++

700

++    Conditional L1D flushing is the default behaviour and can be tuned. See

701

++    :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.

702

++

703

++  - Guest confinement:

704

++

705

++    Confinement of guests to a single or a group of physical cores which

706

++    are not running any other processes, can reduce the attack surface

707

++    significantly, but interrupts, soft interrupts and kernel threads can

708

++    still expose valuable data to a potential attacker. See

709

++    :ref:`guest_confinement`.

710

++

711

++  - Interrupt isolation:

712

++

713

++    Isolating the guest CPUs from interrupts can reduce the attack surface

714

++    further, but still allows a malicious guest to explore a limited amount

715

++    of host physical memory. This can at least be used to gain knowledge

716

++    about the host address space layout. The interrupts which have a fixed

717

++    affinity to the CPUs which run the untrusted guests can depending on

718

++    the scenario still trigger soft interrupts and schedule kernel threads

719

++    which might expose valuable information. See

720

++    :ref:`interrupt_isolation`.

721

++

722

++The above three mitigation methods combined can provide protection to a

723

++certain degree, but the risk of the remaining attack surface has to be

724

++carefully analyzed. For full protection the following methods are

725

++available:

726

++

727

++  - Disabling SMT:

728

++

729

++    Disabling SMT and enforcing the L1D flushing provides the maximum

730

++    amount of protection. This mitigation is not depending on any of the

731

++    above mitigation methods.

732

++

733

++    SMT control and L1D flushing can be tuned by the command line

734

++    parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run

735

++    time with the matching sysfs control files. See :ref:`smt_control`,

736

++    :ref:`mitigation_control_command_line` and

737

++    :ref:`mitigation_control_kvm`.

738

++

739

++  - Disabling EPT:

740

++

741

++    Disabling EPT provides the maximum amount of protection as well. It is

742

++    not depending on any of the above mitigation methods. SMT can stay

743

++    enabled and L1D flushing is not required, but the performance impact is

744

++    significant.

745

++

746

++    EPT can be disabled in the hypervisor via the 'kvm-intel.ept'

747

++    parameter.

748

++

749

++3.4. Nested virtual machines

750

++""""""""""""""""""""""""""""

751

++

752

++When nested virtualization is in use, three operating systems are involved:

753

++the bare metal hypervisor, the nested hypervisor and the nested virtual

754

++machine.  VMENTER operations from the nested hypervisor into the nested

755

++guest will always be processed by the bare metal hypervisor. If KVM is the

756

++bare metal hypervisor it wiil:

757

++

758

++ - Flush the L1D cache on every switch from the nested hypervisor to the

759

++   nested virtual machine, so that the nested hypervisor's secrets are not

760

++   exposed to the nested virtual machine;

761

++

762

++ - Flush the L1D cache on every switch from the nested virtual machine to

763

++   the nested hypervisor; this is a complex operation, and flushing the L1D

764

++   cache avoids that the bare metal hypervisor's secrets are exposed to the

765

++   nested virtual machine;

766

++

767

++ - Instruct the nested hypervisor to not perform any L1D cache flush. This

768

++   is an optimization to avoid double L1D flushing.

769

++

770

++

771

++.. _default_mitigations:

772

++

773

++Default mitigations

774

++-------------------

775

++

776

++  The kernel default mitigations for vulnerable processors are:

777

++

778

++  - PTE inversion to protect against malicious user space. This is done

779

++    unconditionally and cannot be controlled.

780

++

781

++  - L1D conditional flushing on VMENTER when EPT is enabled for

782

++    a guest.

783

++

784

++  The kernel does not by default enforce the disabling of SMT, which leaves

785

++  SMT systems vulnerable when running untrusted guests with EPT enabled.

786

++

787

++  The rationale for this choice is:

788

++

789

++  - Force disabling SMT can break existing setups, especially with

790

++    unattended updates.

791

++

792

++  - If regular users run untrusted guests on their machine, then L1TF is

793

++    just an add on to other malware which might be embedded in an untrusted

794

++    guest, e.g. spam-bots or attacks on the local network.

795

++

796

++    There is no technical way to prevent a user from running untrusted code

797

++    on their machines blindly.

798

++

799

++  - It's technically extremely unlikely and from today's knowledge even

800

++    impossible that L1TF can be exploited via the most popular attack

801

++    mechanisms like JavaScript because these mechanisms have no way to

802

++    control PTEs. If this would be possible and not other mitigation would

803

++    be possible, then the default might be different.

804

++

805

++  - The administrators of cloud and hosting setups have to carefully

806

++    analyze the risk for their scenarios and make the appropriate

807

++    mitigation choices, which might even vary across their deployed

808

++    machines and also result in other changes of their overall setup.

809

++    There is no way for the kernel to provide a sensible default for this

810

++    kind of scenarios.

811

+diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt

812

+index 88ad78c6f605..5d12166bd66b 100644

813

+--- a/Documentation/virtual/kvm/api.txt

814

++++ b/Documentation/virtual/kvm/api.txt

815

+@@ -123,14 +123,15 @@ memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the

816

+ flag KVM_VM_MIPS_VZ.

817

+

818

+

819

+-4.3 KVM_GET_MSR_INDEX_LIST

820

++4.3 KVM_GET_MSR_INDEX_LIST, KVM_GET_MSR_FEATURE_INDEX_LIST

821

+

822

+-Capability: basic

823

++Capability: basic, KVM_CAP_GET_MSR_FEATURES for KVM_GET_MSR_FEATURE_INDEX_LIST

824

+ Architectures: x86

825

+-Type: system

826

++Type: system ioctl

827

+ Parameters: struct kvm_msr_list (in/out)

828

+ Returns: 0 on success; -1 on error

829

+ Errors:

830

++  EFAULT:    the msr index list cannot be read from or written to

831

+   E2BIG:     the msr index list is to be to fit in the array specified by

832

+              the user.

833

+

834

+@@ -139,16 +140,23 @@ struct kvm_msr_list {

835

+ 	__u32 indices[0];

836

+ };

837

+

838

+-This ioctl returns the guest msrs that are supported.  The list varies

839

+-by kvm version and host processor, but does not change otherwise.  The

840

+-user fills in the size of the indices array in nmsrs, and in return

841

+-kvm adjusts nmsrs to reflect the actual number of msrs and fills in

842

+-the indices array with their numbers.

843

++The user fills in the size of the indices array in nmsrs, and in return

844

++kvm adjusts nmsrs to reflect the actual number of msrs and fills in the

845

++indices array with their numbers.

846

++

847

++KVM_GET_MSR_INDEX_LIST returns the guest msrs that are supported.  The list

848

++varies by kvm version and host processor, but does not change otherwise.

849

+

850

+ Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are

851

+ not returned in the MSR list, as different vcpus can have a different number

852

+ of banks, as set via the KVM_X86_SETUP_MCE ioctl.

853

+

854

++KVM_GET_MSR_FEATURE_INDEX_LIST returns the list of MSRs that can be passed

855

++to the KVM_GET_MSRS system ioctl.  This lets userspace probe host capabilities

856

++and processor features that are exposed via MSRs (e.g., VMX capabilities).

857

++This list also varies by kvm version and host processor, but does not change

858

++otherwise.

859

++

860

+

861

+ 4.4 KVM_CHECK_EXTENSION

862

+

863

+@@ -475,14 +483,22 @@ Support for this has been removed.  Use KVM_SET_GUEST_DEBUG instead.

864

+

865

+ 4.18 KVM_GET_MSRS

866

+

867

+-Capability: basic

868

++Capability: basic (vcpu), KVM_CAP_GET_MSR_FEATURES (system)

869

+ Architectures: x86

870

+-Type: vcpu ioctl

871

++Type: system ioctl, vcpu ioctl

872

+ Parameters: struct kvm_msrs (in/out)

873

+-Returns: 0 on success, -1 on error

874

++Returns: number of msrs successfully returned;

875

++        -1 on error

876

++

877

++When used as a system ioctl:

878

++Reads the values of MSR-based features that are available for the VM.  This

879

++is similar to KVM_GET_SUPPORTED_CPUID, but it returns MSR indices and values.

880

++The list of msr-based features can be obtained using KVM_GET_MSR_FEATURE_INDEX_LIST

881

++in a system ioctl.

882

+

883

++When used as a vcpu ioctl:

884

+ Reads model-specific registers from the vcpu.  Supported msr indices can

885

+-be obtained using KVM_GET_MSR_INDEX_LIST.

886

++be obtained using KVM_GET_MSR_INDEX_LIST in a system ioctl.

887

+

888

+ struct kvm_msrs {

889

+ 	__u32 nmsrs; /* number of msrs in entries */

890

+diff --git a/Makefile b/Makefile

891

+index d407ecfdee0b..f3bb9428b3dc 100644

892

+--- a/Makefile

893

++++ b/Makefile

894

+@@ -1,7 +1,7 @@

895

+ # SPDX-License-Identifier: GPL-2.0

896

+ VERSION = 4

897

+ PATCHLEVEL = 14

898

+-SUBLEVEL = 62

899

++SUBLEVEL = 63

900

+ EXTRAVERSION =

901

+ NAME = Petit Gorille

902

+

903

+diff --git a/arch/Kconfig b/arch/Kconfig

904

+index 400b9e1b2f27..4e01862f58e4 100644

905

+--- a/arch/Kconfig

906

++++ b/arch/Kconfig

907

+@@ -13,6 +13,9 @@ config KEXEC_CORE

908

+ config HAVE_IMA_KEXEC

909

+ 	bool

910

+

911

++config HOTPLUG_SMT

912

++	bool

913

++

914

+ config OPROFILE

915

+ 	tristate "OProfile system profiling"

916

+ 	depends on PROFILING

917

+diff --git a/arch/arm/boot/dts/imx6sx.dtsi b/arch/arm/boot/dts/imx6sx.dtsi

918

+index 6c7eb54be9e2..d64438bfa68b 100644

919

+--- a/arch/arm/boot/dts/imx6sx.dtsi

920

++++ b/arch/arm/boot/dts/imx6sx.dtsi

921

+@@ -1305,7 +1305,7 @@

922

+ 				  0x82000000 0 0x08000000 0x08000000 0 0x00f00000>;

923

+ 			bus-range = <0x00 0xff>;

924

+ 			num-lanes = <1>;

925

+-			interrupts = <GIC_SPI 123 IRQ_TYPE_LEVEL_HIGH>;

926

++			interrupts = <GIC_SPI 120 IRQ_TYPE_LEVEL_HIGH>;

927

+ 			clocks = <&clks IMX6SX_CLK_PCIE_REF_125M>,

928

+ 				 <&clks IMX6SX_CLK_PCIE_AXI>,

929

+ 				 <&clks IMX6SX_CLK_LVDS1_OUT>,

930

+diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig

931

+index 1fd3eb5b66c6..89e684fd795f 100644

932

+--- a/arch/parisc/Kconfig

933

++++ b/arch/parisc/Kconfig

934

+@@ -201,7 +201,7 @@ config PREFETCH

935

+

936

+ config MLONGCALLS

937

+ 	bool "Enable the -mlong-calls compiler option for big kernels"

938

+-	def_bool y if (!MODULES)

939

++	default y

940

+ 	depends on PA8X00

941

+ 	help

942

+ 	  If you configure the kernel to include many drivers built-in instead

943

+diff --git a/arch/parisc/include/asm/barrier.h b/arch/parisc/include/asm/barrier.h

944

+new file mode 100644

945

+index 000000000000..dbaaca84f27f

946

+--- /dev/null

947

++++ b/arch/parisc/include/asm/barrier.h

948

+@@ -0,0 +1,32 @@

949

++/* SPDX-License-Identifier: GPL-2.0 */

950

++#ifndef __ASM_BARRIER_H

951

++#define __ASM_BARRIER_H

952

++

953

++#ifndef __ASSEMBLY__

954

++

955

++/* The synchronize caches instruction executes as a nop on systems in

956

++   which all memory references are performed in order. */

957

++#define synchronize_caches() __asm__ __volatile__ ("sync" : : : "memory")

958

++

959

++#if defined(CONFIG_SMP)

960

++#define mb()		do { synchronize_caches(); } while (0)

961

++#define rmb()		mb()

962

++#define wmb()		mb()

963

++#define dma_rmb()	mb()

964

++#define dma_wmb()	mb()

965

++#else

966

++#define mb()		barrier()

967

++#define rmb()		barrier()

968

++#define wmb()		barrier()

969

++#define dma_rmb()	barrier()

970

++#define dma_wmb()	barrier()

971

++#endif

972

++

973

++#define __smp_mb()	mb()

974

++#define __smp_rmb()	mb()

975

++#define __smp_wmb()	mb()

976

++

977

++#include <asm-generic/barrier.h>

978

++

979

++#endif /* !__ASSEMBLY__ */

980

++#endif /* __ASM_BARRIER_H */

981

+diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S

982

+index e95207c0565e..1b4732e20137 100644

983

+--- a/arch/parisc/kernel/entry.S

984

++++ b/arch/parisc/kernel/entry.S

985

+@@ -481,6 +481,8 @@

986

+ 	/* Release pa_tlb_lock lock without reloading lock address. */

987

+ 	.macro		tlb_unlock0	spc,tmp

988

+ #ifdef CONFIG_SMP

989

++	or,COND(=)	%r0,\spc,%r0

990

++	sync

991

+ 	or,COND(=)	%r0,\spc,%r0

992

+ 	stw             \spc,0(\tmp)

993

+ #endif

994

+diff --git a/arch/parisc/kernel/pacache.S b/arch/parisc/kernel/pacache.S

995

+index 67b0f7532e83..3e163df49cf3 100644

996

+--- a/arch/parisc/kernel/pacache.S

997

++++ b/arch/parisc/kernel/pacache.S

998

+@@ -354,6 +354,7 @@ ENDPROC_CFI(flush_data_cache_local)

999

+ 	.macro	tlb_unlock	la,flags,tmp

1000

+ #ifdef CONFIG_SMP

1001

+ 	ldi		1,\tmp

1002

++	sync

1003

+ 	stw		\tmp,0(\la)

1004

+ 	mtsm		\flags

1005

+ #endif

1006

+diff --git a/arch/parisc/kernel/syscall.S b/arch/parisc/kernel/syscall.S

1007

+index e775f80ae28c..4886a6db42e9 100644

1008

+--- a/arch/parisc/kernel/syscall.S

1009

++++ b/arch/parisc/kernel/syscall.S

1010

+@@ -633,6 +633,7 @@ cas_action:

1011

+ 	sub,<>	%r28, %r25, %r0

1012

+ 2:	stw,ma	%r24, 0(%r26)

1013

+ 	/* Free lock */

1014

++	sync

1015

+ 	stw,ma	%r20, 0(%sr2,%r20)

1016

+ #if ENABLE_LWS_DEBUG

1017

+ 	/* Clear thread register indicator */

1018

+@@ -647,6 +648,7 @@ cas_action:

1019

+ 3:		

1020

+ 	/* Error occurred on load or store */

1021

+ 	/* Free lock */

1022

++	sync

1023

+ 	stw	%r20, 0(%sr2,%r20)

1024

+ #if ENABLE_LWS_DEBUG

1025

+ 	stw	%r0, 4(%sr2,%r20)

1026

+@@ -848,6 +850,7 @@ cas2_action:

1027

+

1028

+ cas2_end:

1029

+ 	/* Free lock */

1030

++	sync

1031

+ 	stw,ma	%r20, 0(%sr2,%r20)

1032

+ 	/* Enable interrupts */

1033

+ 	ssm	PSW_SM_I, %r0

1034

+@@ -858,6 +861,7 @@ cas2_end:

1035

+ 22:

1036

+ 	/* Error occurred on load or store */

1037

+ 	/* Free lock */

1038

++	sync

1039

+ 	stw	%r20, 0(%sr2,%r20)

1040

+ 	ssm	PSW_SM_I, %r0

1041

+ 	ldo	1(%r0),%r28

1042

+diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig

1043

+index 7483cd514c32..1c63a4b5320d 100644

1044

+--- a/arch/x86/Kconfig

1045

++++ b/arch/x86/Kconfig

1046

+@@ -176,6 +176,7 @@ config X86

1047

+ 	select HAVE_SYSCALL_TRACEPOINTS

1048

+ 	select HAVE_UNSTABLE_SCHED_CLOCK

1049

+ 	select HAVE_USER_RETURN_NOTIFIER

1050

++	select HOTPLUG_SMT			if SMP

1051

+ 	select IRQ_FORCED_THREADING

1052

+ 	select PCI_LOCKLESS_CONFIG

1053

+ 	select PERF_EVENTS

1054

+diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h

1055

+index 5f01671c68f2..a1ed92aae12a 100644

1056

+--- a/arch/x86/include/asm/apic.h

1057

++++ b/arch/x86/include/asm/apic.h

1058

+@@ -10,6 +10,7 @@

1059

+ #include <asm/fixmap.h>

1060

+ #include <asm/mpspec.h>

1061

+ #include <asm/msr.h>

1062

++#include <asm/hardirq.h>

1063

+

1064

+ #define ARCH_APICTIMER_STOPS_ON_C3	1

1065

+

1066

+@@ -613,12 +614,20 @@ extern int default_check_phys_apicid_present(int phys_apicid);

1067

+ #endif

1068

+

1069

+ #endif /* CONFIG_X86_LOCAL_APIC */

1070

++

1071

++#ifdef CONFIG_SMP

1072

++bool apic_id_is_primary_thread(unsigned int id);

1073

++#else

1074

++static inline bool apic_id_is_primary_thread(unsigned int id) { return false; }

1075

++#endif

1076

++

1077

+ extern void irq_enter(void);

1078

+ extern void irq_exit(void);

1079

+

1080

+ static inline void entering_irq(void)

1081

+ {

1082

+ 	irq_enter();

1083

++	kvm_set_cpu_l1tf_flush_l1d();

1084

+ }

1085

+

1086

+ static inline void entering_ack_irq(void)

1087

+@@ -631,6 +640,7 @@ static inline void ipi_entering_ack_irq(void)

1088

+ {

1089

+ 	irq_enter();

1090

+ 	ack_APIC_irq();

1091

++	kvm_set_cpu_l1tf_flush_l1d();

1092

+ }

1093

+

1094

+ static inline void exiting_irq(void)

1095

+diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h

1096

+index 403e97d5e243..8418462298e7 100644

1097

+--- a/arch/x86/include/asm/cpufeatures.h

1098

++++ b/arch/x86/include/asm/cpufeatures.h

1099

+@@ -219,6 +219,7 @@

1100

+ #define X86_FEATURE_IBPB		( 7*32+26) /* Indirect Branch Prediction Barrier */

1101

+ #define X86_FEATURE_STIBP		( 7*32+27) /* Single Thread Indirect Branch Predictors */

1102

+ #define X86_FEATURE_ZEN			( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */

1103

++#define X86_FEATURE_L1TF_PTEINV		( 7*32+29) /* "" L1TF workaround PTE inversion */

1104

+

1105

+ /* Virtualization flags: Linux defined, word 8 */

1106

+ #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */

1107

+@@ -338,6 +339,7 @@

1108

+ #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */

1109

+ #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */

1110

+ #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */

1111

++#define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */

1112

+ #define X86_FEATURE_ARCH_CAPABILITIES	(18*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */

1113

+ #define X86_FEATURE_SPEC_CTRL_SSBD	(18*32+31) /* "" Speculative Store Bypass Disable */

1114

+

1115

+@@ -370,5 +372,6 @@

1116

+ #define X86_BUG_SPECTRE_V1		X86_BUG(15) /* CPU is affected by Spectre variant 1 attack with conditional branches */

1117

+ #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */

1118

+ #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */

1119

++#define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */

1120

+

1121

+ #endif /* _ASM_X86_CPUFEATURES_H */

1122

+diff --git a/arch/x86/include/asm/dmi.h b/arch/x86/include/asm/dmi.h

1123

+index 0ab2ab27ad1f..b825cb201251 100644

1124

+--- a/arch/x86/include/asm/dmi.h

1125

++++ b/arch/x86/include/asm/dmi.h

1126

+@@ -4,8 +4,8 @@

1127

+

1128

+ #include <linux/compiler.h>

1129

+ #include <linux/init.h>

1130

++#include <linux/io.h>

1131

+

1132

+-#include <asm/io.h>

1133

+ #include <asm/setup.h>

1134

+

1135

+ static __always_inline __init void *dmi_alloc(unsigned len)

1136

+diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h

1137

+index 51cc979dd364..486c843273c4 100644

1138

+--- a/arch/x86/include/asm/hardirq.h

1139

++++ b/arch/x86/include/asm/hardirq.h

1140

+@@ -3,10 +3,12 @@

1141

+ #define _ASM_X86_HARDIRQ_H

1142

+

1143

+ #include <linux/threads.h>

1144

+-#include <linux/irq.h>

1145

+

1146

+ typedef struct {

1147

+-	unsigned int __softirq_pending;

1148

++	u16	     __softirq_pending;

1149

++#if IS_ENABLED(CONFIG_KVM_INTEL)

1150

++	u8	     kvm_cpu_l1tf_flush_l1d;

1151

++#endif

1152

+ 	unsigned int __nmi_count;	/* arch dependent */

1153

+ #ifdef CONFIG_X86_LOCAL_APIC

1154

+ 	unsigned int apic_timer_irqs;	/* arch dependent */

1155

+@@ -62,4 +64,24 @@ extern u64 arch_irq_stat_cpu(unsigned int cpu);

1156

+ extern u64 arch_irq_stat(void);

1157

+ #define arch_irq_stat		arch_irq_stat

1158

+

1159

++

1160

++#if IS_ENABLED(CONFIG_KVM_INTEL)

1161

++static inline void kvm_set_cpu_l1tf_flush_l1d(void)

1162

++{

1163

++	__this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 1);

1164

++}

1165

++

1166

++static inline void kvm_clear_cpu_l1tf_flush_l1d(void)

1167

++{

1168

++	__this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 0);

1169

++}

1170

++

1171

++static inline bool kvm_get_cpu_l1tf_flush_l1d(void)

1172

++{

1173

++	return __this_cpu_read(irq_stat.kvm_cpu_l1tf_flush_l1d);

1174

++}

1175

++#else /* !IS_ENABLED(CONFIG_KVM_INTEL) */

1176

++static inline void kvm_set_cpu_l1tf_flush_l1d(void) { }

1177

++#endif /* IS_ENABLED(CONFIG_KVM_INTEL) */

1178

++

1179

+ #endif /* _ASM_X86_HARDIRQ_H */

1180

+diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h

1181

+index c4fc17220df9..c14f2a74b2be 100644

1182

+--- a/arch/x86/include/asm/irqflags.h

1183

++++ b/arch/x86/include/asm/irqflags.h

1184

+@@ -13,6 +13,8 @@

1185

+  * Interrupt control:

1186

+  */

1187

+

1188

++/* Declaration required for gcc < 4.9 to prevent -Werror=missing-prototypes */

1189

++extern inline unsigned long native_save_fl(void);

1190

+ extern inline unsigned long native_save_fl(void)

1191

+ {

1192

+ 	unsigned long flags;

1193

+diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h

1194

+index 174b9c41efce..4015b88383ce 100644

1195

+--- a/arch/x86/include/asm/kvm_host.h

1196

++++ b/arch/x86/include/asm/kvm_host.h

1197

+@@ -17,6 +17,7 @@

1198

+ #include <linux/tracepoint.h>

1199

+ #include <linux/cpumask.h>

1200

+ #include <linux/irq_work.h>

1201

++#include <linux/irq.h>

1202

+

1203

+ #include <linux/kvm.h>

1204

+ #include <linux/kvm_para.h>

1205

+@@ -506,6 +507,7 @@ struct kvm_vcpu_arch {

1206

+ 	u64 smbase;

1207

+ 	bool tpr_access_reporting;

1208

+ 	u64 ia32_xss;

1209

++	u64 microcode_version;

1210

+

1211

+ 	/*

1212

+ 	 * Paging state of the vcpu

1213

+@@ -693,6 +695,9 @@ struct kvm_vcpu_arch {

1214

+

1215

+ 	/* be preempted when it's in kernel-mode(cpl=0) */

1216

+ 	bool preempted_in_kernel;

1217

++

1218

++	/* Flush the L1 Data cache for L1TF mitigation on VMENTER */

1219

++	bool l1tf_flush_l1d;

1220

+ };

1221

+

1222

+ struct kvm_lpage_info {

1223

+@@ -862,6 +867,7 @@ struct kvm_vcpu_stat {

1224

+ 	u64 signal_exits;

1225

+ 	u64 irq_window_exits;

1226

+ 	u64 nmi_window_exits;

1227

++	u64 l1d_flush;

1228

+ 	u64 halt_exits;

1229

+ 	u64 halt_successful_poll;

1230

+ 	u64 halt_attempted_poll;

1231

+@@ -1061,6 +1067,8 @@ struct kvm_x86_ops {

1232

+ 	void (*cancel_hv_timer)(struct kvm_vcpu *vcpu);

1233

+

1234

+ 	void (*setup_mce)(struct kvm_vcpu *vcpu);

1235

++

1236

++	int (*get_msr_feature)(struct kvm_msr_entry *entry);

1237

+ };

1238

+

1239

+ struct kvm_arch_async_pf {

1240

+@@ -1366,6 +1374,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v);

1241

+ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);

1242

+ void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu);

1243

+

1244

++u64 kvm_get_arch_capabilities(void);

1245

+ void kvm_define_shared_msr(unsigned index, u32 msr);

1246

+ int kvm_set_shared_msr(unsigned index, u64 val, u64 mask);

1247

+

1248

+diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h

1249

+index 504b21692d32..ef7eec669a1b 100644

1250

+--- a/arch/x86/include/asm/msr-index.h

1251

++++ b/arch/x86/include/asm/msr-index.h

1252

+@@ -70,12 +70,19 @@

1253

+ #define MSR_IA32_ARCH_CAPABILITIES	0x0000010a

1254

+ #define ARCH_CAP_RDCL_NO		(1 << 0)   /* Not susceptible to Meltdown */

1255

+ #define ARCH_CAP_IBRS_ALL		(1 << 1)   /* Enhanced IBRS support */

1256

++#define ARCH_CAP_SKIP_VMENTRY_L1DFLUSH	(1 << 3)   /* Skip L1D flush on vmentry */

1257

+ #define ARCH_CAP_SSB_NO			(1 << 4)   /*

1258

+ 						    * Not susceptible to Speculative Store Bypass

1259

+ 						    * attack, so no Speculative Store Bypass

1260

+ 						    * control required.

1261

+ 						    */

1262

+

1263

++#define MSR_IA32_FLUSH_CMD		0x0000010b

1264

++#define L1D_FLUSH			(1 << 0)   /*

1265

++						    * Writeback and invalidate the

1266

++						    * L1 data cache.

1267

++						    */

1268

++

1269

+ #define MSR_IA32_BBL_CR_CTL		0x00000119

1270

+ #define MSR_IA32_BBL_CR_CTL3		0x0000011e

1271

+

1272

+diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h

1273

+index aa30c3241ea7..0d5c739eebd7 100644

1274

+--- a/arch/x86/include/asm/page_32_types.h

1275

++++ b/arch/x86/include/asm/page_32_types.h

1276

+@@ -29,8 +29,13 @@

1277

+ #define N_EXCEPTION_STACKS 1

1278

+

1279

+ #ifdef CONFIG_X86_PAE

1280

+-/* 44=32+12, the limit we can fit into an unsigned long pfn */

1281

+-#define __PHYSICAL_MASK_SHIFT	44

1282

++/*

1283

++ * This is beyond the 44 bit limit imposed by the 32bit long pfns,

1284

++ * but we need the full mask to make sure inverted PROT_NONE

1285

++ * entries have all the host bits set in a guest.

1286

++ * The real limit is still 44 bits.

1287

++ */

1288

++#define __PHYSICAL_MASK_SHIFT	52

1289

+ #define __VIRTUAL_MASK_SHIFT	32

1290

+

1291

+ #else  /* !CONFIG_X86_PAE */

1292

+diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h

1293

+index 685ffe8a0eaf..60d0f9015317 100644

1294

+--- a/arch/x86/include/asm/pgtable-2level.h

1295

++++ b/arch/x86/include/asm/pgtable-2level.h

1296

+@@ -95,4 +95,21 @@ static inline unsigned long pte_bitop(unsigned long value, unsigned int rightshi

1297

+ #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })

1298

+ #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })

1299

+

1300

++/* No inverted PFNs on 2 level page tables */

1301

++

1302

++static inline u64 protnone_mask(u64 val)

1303

++{

1304

++	return 0;

1305

++}

1306

++

1307

++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)

1308

++{

1309

++	return val;

1310

++}

1311

++

1312

++static inline bool __pte_needs_invert(u64 val)

1313

++{

1314

++	return false;

1315

++}

1316

++

1317

+ #endif /* _ASM_X86_PGTABLE_2LEVEL_H */

1318

+diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h

1319

+index bc4af5453802..9dc19b4a2a87 100644

1320

+--- a/arch/x86/include/asm/pgtable-3level.h

1321

++++ b/arch/x86/include/asm/pgtable-3level.h

1322

+@@ -206,12 +206,43 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)

1323

+ #endif

1324

+

1325

+ /* Encode and de-code a swap entry */

1326

++#define SWP_TYPE_BITS		5

1327

++

1328

++#define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)

1329

++

1330

++/* We always extract/encode the offset by shifting it all the way up, and then down again */

1331

++#define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS)

1332

++

1333

+ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)

1334

+ #define __swp_type(x)			(((x).val) & 0x1f)

1335

+ #define __swp_offset(x)			((x).val >> 5)

1336

+ #define __swp_entry(type, offset)	((swp_entry_t){(type) | (offset) << 5})

1337

+-#define __pte_to_swp_entry(pte)		((swp_entry_t){ (pte).pte_high })

1338

+-#define __swp_entry_to_pte(x)		((pte_t){ { .pte_high = (x).val } })

1339

++

1340

++/*

1341

++ * Normally, __swp_entry() converts from arch-independent swp_entry_t to

1342

++ * arch-dependent swp_entry_t, and __swp_entry_to_pte() just stores the result

1343

++ * to pte. But here we have 32bit swp_entry_t and 64bit pte, and need to use the

1344

++ * whole 64 bits. Thus, we shift the "real" arch-dependent conversion to

1345

++ * __swp_entry_to_pte() through the following helper macro based on 64bit

1346

++ * __swp_entry().

1347

++ */

1348

++#define __swp_pteval_entry(type, offset) ((pteval_t) { \

1349

++	(~(pteval_t)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \

1350

++	| ((pteval_t)(type) << (64 - SWP_TYPE_BITS)) })

1351

++

1352

++#define __swp_entry_to_pte(x)	((pte_t){ .pte = \

1353

++		__swp_pteval_entry(__swp_type(x), __swp_offset(x)) })

1354

++/*

1355

++ * Analogically, __pte_to_swp_entry() doesn't just extract the arch-dependent

1356

++ * swp_entry_t, but also has to convert it from 64bit to the 32bit

1357

++ * intermediate representation, using the following macros based on 64bit

1358

++ * __swp_type() and __swp_offset().

1359

++ */

1360

++#define __pteval_swp_type(x) ((unsigned long)((x).pte >> (64 - SWP_TYPE_BITS)))

1361

++#define __pteval_swp_offset(x) ((unsigned long)(~((x).pte) << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT))

1362

++

1363

++#define __pte_to_swp_entry(pte)	(__swp_entry(__pteval_swp_type(pte), \

1364

++					     __pteval_swp_offset(pte)))

1365

+

1366

+ #define gup_get_pte gup_get_pte

1367

+ /*

1368

+@@ -260,4 +291,6 @@ static inline pte_t gup_get_pte(pte_t *ptep)

1369

+ 	return pte;

1370

+ }

1371

+

1372

++#include <asm/pgtable-invert.h>

1373

++

1374

+ #endif /* _ASM_X86_PGTABLE_3LEVEL_H */

1375

+diff --git a/arch/x86/include/asm/pgtable-invert.h b/arch/x86/include/asm/pgtable-invert.h

1376

+new file mode 100644

1377

+index 000000000000..44b1203ece12

1378

+--- /dev/null

1379

++++ b/arch/x86/include/asm/pgtable-invert.h

1380

+@@ -0,0 +1,32 @@

1381

++/* SPDX-License-Identifier: GPL-2.0 */

1382

++#ifndef _ASM_PGTABLE_INVERT_H

1383

++#define _ASM_PGTABLE_INVERT_H 1

1384

++

1385

++#ifndef __ASSEMBLY__

1386

++

1387

++static inline bool __pte_needs_invert(u64 val)

1388

++{

1389

++	return !(val & _PAGE_PRESENT);

1390

++}

1391

++

1392

++/* Get a mask to xor with the page table entry to get the correct pfn. */

1393

++static inline u64 protnone_mask(u64 val)

1394

++{

1395

++	return __pte_needs_invert(val) ?  ~0ull : 0;

1396

++}

1397

++

1398

++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)

1399

++{

1400

++	/*

1401

++	 * When a PTE transitions from NONE to !NONE or vice-versa

1402

++	 * invert the PFN part to stop speculation.

1403

++	 * pte_pfn undoes this when needed.

1404

++	 */

1405

++	if (__pte_needs_invert(oldval) != __pte_needs_invert(val))

1406

++		val = (val & ~mask) | (~val & mask);

1407

++	return val;

1408

++}

1409

++

1410

++#endif /* __ASSEMBLY__ */

1411

++

1412

++#endif

1413

+diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h

1414

+index 5c790e93657d..6a4b1a54ff47 100644

1415

+--- a/arch/x86/include/asm/pgtable.h

1416

++++ b/arch/x86/include/asm/pgtable.h

1417

+@@ -185,19 +185,29 @@ static inline int pte_special(pte_t pte)

1418

+ 	return pte_flags(pte) & _PAGE_SPECIAL;

1419

+ }

1420

+

1421

++/* Entries that were set to PROT_NONE are inverted */

1422

++

1423

++static inline u64 protnone_mask(u64 val);

1424

++

1425

+ static inline unsigned long pte_pfn(pte_t pte)

1426

+ {

1427

+-	return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;

1428

++	phys_addr_t pfn = pte_val(pte);

1429

++	pfn ^= protnone_mask(pfn);

1430

++	return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT;

1431

+ }

1432

+

1433

+ static inline unsigned long pmd_pfn(pmd_t pmd)

1434

+ {

1435

+-	return (pmd_val(pmd) & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;

1436

++	phys_addr_t pfn = pmd_val(pmd);

1437

++	pfn ^= protnone_mask(pfn);

1438

++	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;

1439

+ }

1440

+

1441

+ static inline unsigned long pud_pfn(pud_t pud)

1442

+ {

1443

+-	return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT;

1444

++	phys_addr_t pfn = pud_val(pud);

1445

++	pfn ^= protnone_mask(pfn);

1446

++	return (pfn & pud_pfn_mask(pud)) >> PAGE_SHIFT;

1447

+ }

1448

+

1449

+ static inline unsigned long p4d_pfn(p4d_t p4d)

1450

+@@ -400,11 +410,6 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)

1451

+ 	return pmd_set_flags(pmd, _PAGE_RW);

1452

+ }

1453

+

1454

+-static inline pmd_t pmd_mknotpresent(pmd_t pmd)

1455

+-{

1456

+-	return pmd_clear_flags(pmd, _PAGE_PRESENT | _PAGE_PROTNONE);

1457

+-}

1458

+-

1459

+ static inline pud_t pud_set_flags(pud_t pud, pudval_t set)

1460

+ {

1461

+ 	pudval_t v = native_pud_val(pud);

1462

+@@ -459,11 +464,6 @@ static inline pud_t pud_mkwrite(pud_t pud)

1463

+ 	return pud_set_flags(pud, _PAGE_RW);

1464

+ }

1465

+

1466

+-static inline pud_t pud_mknotpresent(pud_t pud)

1467

+-{

1468

+-	return pud_clear_flags(pud, _PAGE_PRESENT | _PAGE_PROTNONE);

1469

+-}

1470

+-

1471

+ #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY

1472

+ static inline int pte_soft_dirty(pte_t pte)

1473

+ {

1474

+@@ -528,25 +528,45 @@ static inline pgprotval_t massage_pgprot(pgprot_t pgprot)

1475

+

1476

+ static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)

1477

+ {

1478

+-	return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) |

1479

+-		     massage_pgprot(pgprot));

1480

++	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;

1481

++	pfn ^= protnone_mask(pgprot_val(pgprot));

1482

++	pfn &= PTE_PFN_MASK;

1483

++	return __pte(pfn | massage_pgprot(pgprot));

1484

+ }

1485

+

1486

+ static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)

1487

+ {

1488

+-	return __pmd(((phys_addr_t)page_nr << PAGE_SHIFT) |

1489

+-		     massage_pgprot(pgprot));

1490

++	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;

1491

++	pfn ^= protnone_mask(pgprot_val(pgprot));

1492

++	pfn &= PHYSICAL_PMD_PAGE_MASK;

1493

++	return __pmd(pfn | massage_pgprot(pgprot));

1494

+ }

1495

+

1496

+ static inline pud_t pfn_pud(unsigned long page_nr, pgprot_t pgprot)

1497

+ {

1498

+-	return __pud(((phys_addr_t)page_nr << PAGE_SHIFT) |

1499

+-		     massage_pgprot(pgprot));

1500

++	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;

1501

++	pfn ^= protnone_mask(pgprot_val(pgprot));

1502

++	pfn &= PHYSICAL_PUD_PAGE_MASK;

1503

++	return __pud(pfn | massage_pgprot(pgprot));

1504

+ }

1505

+

1506

++static inline pmd_t pmd_mknotpresent(pmd_t pmd)

1507

++{

1508

++	return pfn_pmd(pmd_pfn(pmd),

1509

++		      __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));

1510

++}

1511

++

1512

++static inline pud_t pud_mknotpresent(pud_t pud)

1513

++{

1514

++	return pfn_pud(pud_pfn(pud),

1515

++	      __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));

1516

++}

1517

++

1518

++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);

1519

++

1520

+ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)

1521

+ {

1522

+-	pteval_t val = pte_val(pte);

1523

++	pteval_t val = pte_val(pte), oldval = val;

1524

+

1525

+ 	/*

1526

+ 	 * Chop off the NX bit (if present), and add the NX portion of

1527

+@@ -554,17 +574,17 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)

1528

+ 	 */

1529

+ 	val &= _PAGE_CHG_MASK;

1530

+ 	val |= massage_pgprot(newprot) & ~_PAGE_CHG_MASK;

1531

+-

1532

++	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);

1533

+ 	return __pte(val);

1534

+ }

1535

+

1536

+ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)

1537

+ {

1538

+-	pmdval_t val = pmd_val(pmd);

1539

++	pmdval_t val = pmd_val(pmd), oldval = val;

1540

+

1541

+ 	val &= _HPAGE_CHG_MASK;

1542

+ 	val |= massage_pgprot(newprot) & ~_HPAGE_CHG_MASK;

1543

+-

1544

++	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);

1545

+ 	return __pmd(val);

1546

+ }

1547

+

1548

+@@ -1274,6 +1294,14 @@ static inline bool pud_access_permitted(pud_t pud, bool write)

1549

+ 	return __pte_access_permitted(pud_val(pud), write);

1550

+ }

1551

+

1552

++#define __HAVE_ARCH_PFN_MODIFY_ALLOWED 1

1553

++extern bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot);

1554

++

1555

++static inline bool arch_has_pfn_modify_check(void)

1556

++{

1557

++	return boot_cpu_has_bug(X86_BUG_L1TF);

1558

++}

1559

++

1560

+ #include <asm-generic/pgtable.h>

1561

+ #endif	/* __ASSEMBLY__ */

1562

+

1563

+diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h

1564

+index 1149d2112b2e..4ecb72831938 100644

1565

+--- a/arch/x86/include/asm/pgtable_64.h

1566

++++ b/arch/x86/include/asm/pgtable_64.h

1567

+@@ -276,7 +276,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }

1568

+  *

1569

+  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number

1570

+  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names

1571

+- * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry

1572

++ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry

1573

+  *

1574

+  * G (8) is aliased and used as a PROT_NONE indicator for

1575

+  * !present ptes.  We need to start storing swap entries above

1576

+@@ -289,20 +289,34 @@ static inline int pgd_large(pgd_t pgd) { return 0; }

1577

+  *

1578

+  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,

1579

+  * but also L and G.

1580

++ *

1581

++ * The offset is inverted by a binary not operation to make the high

1582

++ * physical bits set.

1583

+  */

1584

+-#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)

1585

+-#define SWP_TYPE_BITS 5

1586

+-/* Place the offset above the type: */

1587

+-#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS)

1588

++#define SWP_TYPE_BITS		5

1589

++

1590

++#define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)

1591

++

1592

++/* We always extract/encode the offset by shifting it all the way up, and then down again */

1593

++#define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS)

1594

+

1595

+ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)

1596

+

1597

+-#define __swp_type(x)			(((x).val >> (SWP_TYPE_FIRST_BIT)) \

1598

+-					 & ((1U << SWP_TYPE_BITS) - 1))

1599

+-#define __swp_offset(x)			((x).val >> SWP_OFFSET_FIRST_BIT)

1600

+-#define __swp_entry(type, offset)	((swp_entry_t) { \

1601

+-					 ((type) << (SWP_TYPE_FIRST_BIT)) \

1602

+-					 | ((offset) << SWP_OFFSET_FIRST_BIT) })

1603

++/* Extract the high bits for type */

1604

++#define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS))

1605

++

1606

++/* Shift up (to get rid of type), then down to get value */

1607

++#define __swp_offset(x) (~(x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)

1608

++

1609

++/*

1610

++ * Shift the offset up "too far" by TYPE bits, then down again

1611

++ * The offset is inverted by a binary not operation to make the high

1612

++ * physical bits set.

1613

++ */

1614

++#define __swp_entry(type, offset) ((swp_entry_t) { \

1615

++	(~(unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \

1616

++	| ((unsigned long)(type) << (64-SWP_TYPE_BITS)) })

1617

++

1618

+ #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })

1619

+ #define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })

1620

+ #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })

1621

+@@ -346,5 +360,7 @@ static inline bool gup_fast_permitted(unsigned long start, int nr_pages,

1622

+ 	return true;

1623

+ }

1624

+

1625

++#include <asm/pgtable-invert.h>

1626

++

1627

+ #endif /* !__ASSEMBLY__ */

1628

+ #endif /* _ASM_X86_PGTABLE_64_H */

1629

+diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h

1630

+index 3222c7746cb1..0e856c0628b3 100644

1631

+--- a/arch/x86/include/asm/processor.h

1632

++++ b/arch/x86/include/asm/processor.h

1633

+@@ -180,6 +180,11 @@ extern const struct seq_operations cpuinfo_op;

1634

+

1635

+ extern void cpu_detect(struct cpuinfo_x86 *c);

1636

+

1637

++static inline unsigned long l1tf_pfn_limit(void)

1638

++{

1639

++	return BIT(boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT) - 1;

1640

++}

1641

++

1642

+ extern void early_cpu_init(void);

1643

+ extern void identify_boot_cpu(void);

1644

+ extern void identify_secondary_cpu(struct cpuinfo_x86 *);

1645

+@@ -969,4 +974,16 @@ bool xen_set_default_idle(void);

1646

+ void stop_this_cpu(void *dummy);

1647

+ void df_debug(struct pt_regs *regs, long error_code);

1648

+ void microcode_check(void);

1649

++

1650

++enum l1tf_mitigations {

1651

++	L1TF_MITIGATION_OFF,

1652

++	L1TF_MITIGATION_FLUSH_NOWARN,

1653

++	L1TF_MITIGATION_FLUSH,

1654

++	L1TF_MITIGATION_FLUSH_NOSMT,

1655

++	L1TF_MITIGATION_FULL,

1656

++	L1TF_MITIGATION_FULL_FORCE

1657

++};

1658

++

1659

++extern enum l1tf_mitigations l1tf_mitigation;

1660

++

1661

+ #endif /* _ASM_X86_PROCESSOR_H */

1662

+diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h

1663

+index 461f53d27708..fe2ee61880a8 100644

1664

+--- a/arch/x86/include/asm/smp.h

1665

++++ b/arch/x86/include/asm/smp.h

1666

+@@ -170,7 +170,6 @@ static inline int wbinvd_on_all_cpus(void)

1667

+ 	wbinvd();

1668

+ 	return 0;

1669

+ }

1670

+-#define smp_num_siblings	1

1671

+ #endif /* CONFIG_SMP */

1672

+

1673

+ extern unsigned disabled_cpus;

1674

+diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h

1675

+index c1d2a9892352..453cf38a1c33 100644

1676

+--- a/arch/x86/include/asm/topology.h

1677

++++ b/arch/x86/include/asm/topology.h

1678

+@@ -123,13 +123,17 @@ static inline int topology_max_smt_threads(void)

1679

+ }

1680

+

1681

+ int topology_update_package_map(unsigned int apicid, unsigned int cpu);

1682

+-extern int topology_phys_to_logical_pkg(unsigned int pkg);

1683

++int topology_phys_to_logical_pkg(unsigned int pkg);

1684

++bool topology_is_primary_thread(unsigned int cpu);

1685

++bool topology_smt_supported(void);

1686

+ #else

1687

+ #define topology_max_packages()			(1)

1688

+ static inline int

1689

+ topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; }

1690

+ static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }

1691

+ static inline int topology_max_smt_threads(void) { return 1; }

1692

++static inline bool topology_is_primary_thread(unsigned int cpu) { return true; }

1693

++static inline bool topology_smt_supported(void) { return false; }

1694

+ #endif

1695

+

1696

+ static inline void arch_fix_phys_package_id(int num, u32 slot)

1697

+diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h

1698

+index 7c300299e12e..08c14aec26ac 100644

1699

+--- a/arch/x86/include/asm/vmx.h

1700

++++ b/arch/x86/include/asm/vmx.h

1701

+@@ -571,4 +571,15 @@ enum vm_instruction_error_number {

1702

+ 	VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,

1703

+ };

1704

+

1705

++enum vmx_l1d_flush_state {

1706

++	VMENTER_L1D_FLUSH_AUTO,

1707

++	VMENTER_L1D_FLUSH_NEVER,

1708

++	VMENTER_L1D_FLUSH_COND,

1709

++	VMENTER_L1D_FLUSH_ALWAYS,

1710

++	VMENTER_L1D_FLUSH_EPT_DISABLED,

1711

++	VMENTER_L1D_FLUSH_NOT_REQUIRED,

1712

++};

1713

++

1714

++extern enum vmx_l1d_flush_state l1tf_vmx_mitigation;

1715

++

1716

+ #endif

1717

+diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c

1718

+index f48a51335538..2e64178f284d 100644

1719

+--- a/arch/x86/kernel/apic/apic.c

1720

++++ b/arch/x86/kernel/apic/apic.c

1721

+@@ -34,6 +34,7 @@

1722

+ #include <linux/dmi.h>

1723

+ #include <linux/smp.h>

1724

+ #include <linux/mm.h>

1725

++#include <linux/irq.h>

1726

+

1727

+ #include <asm/trace/irq_vectors.h>

1728

+ #include <asm/irq_remapping.h>

1729

+@@ -56,6 +57,7 @@

1730

+ #include <asm/hypervisor.h>

1731

+ #include <asm/cpu_device_id.h>

1732

+ #include <asm/intel-family.h>

1733

++#include <asm/irq_regs.h>

1734

+

1735

+ unsigned int num_processors;

1736

+

1737

+@@ -2092,6 +2094,23 @@ static int cpuid_to_apicid[] = {

1738

+ 	[0 ... NR_CPUS - 1] = -1,

1739

+ };

1740

+

1741

++#ifdef CONFIG_SMP

1742

++/**

1743

++ * apic_id_is_primary_thread - Check whether APIC ID belongs to a primary thread

1744

++ * @id:	APIC ID to check

1745

++ */

1746

++bool apic_id_is_primary_thread(unsigned int apicid)

1747

++{

1748

++	u32 mask;

1749

++

1750

++	if (smp_num_siblings == 1)

1751

++		return true;

1752

++	/* Isolate the SMT bit(s) in the APICID and check for 0 */

1753

++	mask = (1U << (fls(smp_num_siblings) - 1)) - 1;

1754

++	return !(apicid & mask);

1755

++}

1756

++#endif

1757

++

1758

+ /*

1759

+  * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids

1760

+  * and cpuid_to_apicid[] synchronized.

1761

+diff --git a/arch/x86/kernel/apic/htirq.c b/arch/x86/kernel/apic/htirq.c

1762

+index 56ccf9346b08..741de281ed5d 100644

1763

+--- a/arch/x86/kernel/apic/htirq.c

1764

++++ b/arch/x86/kernel/apic/htirq.c

1765

+@@ -16,6 +16,8 @@

1766

+ #include <linux/device.h>

1767

+ #include <linux/pci.h>

1768

+ #include <linux/htirq.h>

1769

++#include <linux/irq.h>

1770

++

1771

+ #include <asm/irqdomain.h>

1772

+ #include <asm/hw_irq.h>

1773

+ #include <asm/apic.h>

1774

+diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c

1775

+index 3b89b27945ff..96a8a68f9c79 100644

1776

+--- a/arch/x86/kernel/apic/io_apic.c

1777

++++ b/arch/x86/kernel/apic/io_apic.c

1778

+@@ -33,6 +33,7 @@

1779

+

1780

+ #include <linux/mm.h>

1781

+ #include <linux/interrupt.h>

1782

++#include <linux/irq.h>

1783

+ #include <linux/init.h>

1784

+ #include <linux/delay.h>

1785

+ #include <linux/sched.h>

1786

+diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c

1787

+index 9b18be764422..f10e7f93b0e2 100644

1788

+--- a/arch/x86/kernel/apic/msi.c

1789

++++ b/arch/x86/kernel/apic/msi.c

1790

+@@ -12,6 +12,7 @@

1791

+  */

1792

+ #include <linux/mm.h>

1793

+ #include <linux/interrupt.h>

1794

++#include <linux/irq.h>

1795

+ #include <linux/pci.h>

1796

+ #include <linux/dmar.h>

1797

+ #include <linux/hpet.h>

1798

+diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c

1799

+index 2ce1c708b8ee..b958082c74a7 100644

1800

+--- a/arch/x86/kernel/apic/vector.c

1801

++++ b/arch/x86/kernel/apic/vector.c

1802

+@@ -11,6 +11,7 @@

1803

+  * published by the Free Software Foundation.

1804

+  */

1805

+ #include <linux/interrupt.h>

1806

++#include <linux/irq.h>

1807

+ #include <linux/init.h>

1808

+ #include <linux/compiler.h>

1809

+ #include <linux/slab.h>

1810

+diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c

1811

+index 90574f731c05..dda741bd5789 100644

1812

+--- a/arch/x86/kernel/cpu/amd.c

1813

++++ b/arch/x86/kernel/cpu/amd.c

1814

+@@ -298,7 +298,6 @@ static int nearby_node(int apicid)

1815

+ }

1816

+ #endif

1817

+

1818

+-#ifdef CONFIG_SMP

1819

+ /*

1820

+  * Fix up cpu_core_id for pre-F17h systems to be in the

1821

+  * [0 .. cores_per_node - 1] range. Not really needed but

1822

+@@ -315,6 +314,13 @@ static void legacy_fixup_core_id(struct cpuinfo_x86 *c)

1823

+ 	c->cpu_core_id %= cus_per_node;

1824

+ }

1825

+

1826

++

1827

++static void amd_get_topology_early(struct cpuinfo_x86 *c)

1828

++{

1829

++	if (cpu_has(c, X86_FEATURE_TOPOEXT))

1830

++		smp_num_siblings = ((cpuid_ebx(0x8000001e) >> 8) & 0xff) + 1;

1831

++}

1832

++

1833

+ /*

1834

+  * Fixup core topology information for

1835

+  * (1) AMD multi-node processors

1836

+@@ -333,7 +339,6 @@ static void amd_get_topology(struct cpuinfo_x86 *c)

1837

+ 		cpuid(0x8000001e, &eax, &ebx, &ecx, &edx);

1838

+

1839

+ 		node_id  = ecx & 0xff;

1840

+-		smp_num_siblings = ((ebx >> 8) & 0xff) + 1;

1841

+

1842

+ 		if (c->x86 == 0x15)

1843

+ 			c->cu_id = ebx & 0xff;

1844

+@@ -376,7 +381,6 @@ static void amd_get_topology(struct cpuinfo_x86 *c)

1845

+ 		legacy_fixup_core_id(c);

1846

+ 	}

1847

+ }

1848

+-#endif

1849

+

1850

+ /*

1851

+  * On a AMD dual core setup the lower bits of the APIC id distinguish the cores.

1852

+@@ -384,7 +388,6 @@ static void amd_get_topology(struct cpuinfo_x86 *c)

1853

+  */

1854

+ static void amd_detect_cmp(struct cpuinfo_x86 *c)

1855

+ {

1856

+-#ifdef CONFIG_SMP

1857

+ 	unsigned bits;

1858

+ 	int cpu = smp_processor_id();

1859

+

1860

+@@ -396,16 +399,11 @@ static void amd_detect_cmp(struct cpuinfo_x86 *c)

1861

+ 	/* use socket ID also for last level cache */

1862

+ 	per_cpu(cpu_llc_id, cpu) = c->phys_proc_id;

1863

+ 	amd_get_topology(c);

1864

+-#endif

1865

+ }

1866

+

1867

+ u16 amd_get_nb_id(int cpu)

1868

+ {

1869

+-	u16 id = 0;

1870

+-#ifdef CONFIG_SMP

1871

+-	id = per_cpu(cpu_llc_id, cpu);

1872

+-#endif

1873

+-	return id;

1874

++	return per_cpu(cpu_llc_id, cpu);

1875

+ }

1876

+ EXPORT_SYMBOL_GPL(amd_get_nb_id);

1877

+

1878

+@@ -579,6 +577,7 @@ static void bsp_init_amd(struct cpuinfo_x86 *c)

1879

+

1880

+ static void early_init_amd(struct cpuinfo_x86 *c)

1881

+ {

1882

++	u64 value;

1883

+ 	u32 dummy;

1884

+

1885

+ 	early_init_amd_mc(c);

1886

+@@ -668,6 +667,22 @@ static void early_init_amd(struct cpuinfo_x86 *c)

1887

+ 			clear_cpu_cap(c, X86_FEATURE_SME);

1888

+ 		}

1889

+ 	}

1890

++

1891

++	/* Re-enable TopologyExtensions if switched off by BIOS */

1892

++	if (c->x86 == 0x15 &&

1893

++	    (c->x86_model >= 0x10 && c->x86_model <= 0x6f) &&

1894

++	    !cpu_has(c, X86_FEATURE_TOPOEXT)) {

1895

++

1896

++		if (msr_set_bit(0xc0011005, 54) > 0) {

1897

++			rdmsrl(0xc0011005, value);

1898

++			if (value & BIT_64(54)) {

1899

++				set_cpu_cap(c, X86_FEATURE_TOPOEXT);

1900

++				pr_info_once(FW_INFO "CPU: Re-enabling disabled Topology Extensions Support.\n");

1901

++			}

1902

++		}

1903

++	}

1904

++

1905

++	amd_get_topology_early(c);

1906

+ }

1907

+

1908

+ static void init_amd_k8(struct cpuinfo_x86 *c)

1909

+@@ -759,19 +774,6 @@ static void init_amd_bd(struct cpuinfo_x86 *c)

1910

+ {

1911

+ 	u64 value;

1912

+

1913

+-	/* re-enable TopologyExtensions if switched off by BIOS */

1914

+-	if ((c->x86_model >= 0x10) && (c->x86_model <= 0x6f) &&

1915

+-	    !cpu_has(c, X86_FEATURE_TOPOEXT)) {

1916

+-

1917

+-		if (msr_set_bit(0xc0011005, 54) > 0) {

1918

+-			rdmsrl(0xc0011005, value);

1919

+-			if (value & BIT_64(54)) {

1920

+-				set_cpu_cap(c, X86_FEATURE_TOPOEXT);

1921

+-				pr_info_once(FW_INFO "CPU: Re-enabling disabled Topology Extensions Support.\n");

1922

+-			}

1923

+-		}

1924

+-	}

1925

+-

1926

+ 	/*

1927

+ 	 * The way access filter has a performance penalty on some workloads.

1928

+ 	 * Disable it on the affected CPUs.

1929

+@@ -835,15 +837,8 @@ static void init_amd(struct cpuinfo_x86 *c)

1930

+

1931

+ 	cpu_detect_cache_sizes(c);

1932

+

1933

+-	/* Multi core CPU? */

1934

+-	if (c->extended_cpuid_level >= 0x80000008) {

1935

+-		amd_detect_cmp(c);

1936

+-		srat_detect_node(c);

1937

+-	}

1938

+-

1939

+-#ifdef CONFIG_X86_32

1940

+-	detect_ht(c);

1941

+-#endif

1942

++	amd_detect_cmp(c);

1943

++	srat_detect_node(c);

1944

+

1945

+ 	init_amd_cacheinfo(c);

1946

+

1947

+diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c

1948

+index 7416fc206b4a..edfc64a8a154 100644

1949

+--- a/arch/x86/kernel/cpu/bugs.c

1950

++++ b/arch/x86/kernel/cpu/bugs.c

1951

+@@ -22,14 +22,17 @@

1952

+ #include <asm/processor-flags.h>

1953

+ #include <asm/fpu/internal.h>

1954

+ #include <asm/msr.h>

1955

++#include <asm/vmx.h>

1956

+ #include <asm/paravirt.h>

1957

+ #include <asm/alternative.h>

1958

+ #include <asm/pgtable.h>

1959

+ #include <asm/set_memory.h>

1960

+ #include <asm/intel-family.h>

1961

++#include <asm/e820/api.h>

1962

+

1963

+ static void __init spectre_v2_select_mitigation(void);

1964

+ static void __init ssb_select_mitigation(void);

1965

++static void __init l1tf_select_mitigation(void);

1966

+

1967

+ /*

1968

+  * Our boot-time value of the SPEC_CTRL MSR. We read it once so that any

1969

+@@ -55,6 +58,12 @@ void __init check_bugs(void)

1970

+ {

1971

+ 	identify_boot_cpu();

1972

+

1973

++	/*

1974

++	 * identify_boot_cpu() initialized SMT support information, let the

1975

++	 * core code know.

1976

++	 */

1977

++	cpu_smt_check_topology_early();

1978

++

1979

+ 	if (!IS_ENABLED(CONFIG_SMP)) {

1980

+ 		pr_info("CPU: ");

1981

+ 		print_cpu_info(&boot_cpu_data);

1982

+@@ -81,6 +90,8 @@ void __init check_bugs(void)

1983

+ 	 */

1984

+ 	ssb_select_mitigation();

1985

+

1986

++	l1tf_select_mitigation();

1987

++

1988

+ #ifdef CONFIG_X86_32

1989

+ 	/*

1990

+ 	 * Check whether we are able to run this kernel safely on SMP.

1991

+@@ -311,23 +322,6 @@ static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)

1992

+ 	return cmd;

1993

+ }

1994

+

1995

+-/* Check for Skylake-like CPUs (for RSB handling) */

1996

+-static bool __init is_skylake_era(void)

1997

+-{

1998

+-	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&

1999

+-	    boot_cpu_data.x86 == 6) {

2000

+-		switch (boot_cpu_data.x86_model) {

2001

+-		case INTEL_FAM6_SKYLAKE_MOBILE:

2002

+-		case INTEL_FAM6_SKYLAKE_DESKTOP:

2003

+-		case INTEL_FAM6_SKYLAKE_X:

2004

+-		case INTEL_FAM6_KABYLAKE_MOBILE:

2005

+-		case INTEL_FAM6_KABYLAKE_DESKTOP:

2006

+-			return true;

2007

+-		}

2008

+-	}

2009

+-	return false;

2010

+-}

2011

+-

2012

+ static void __init spectre_v2_select_mitigation(void)

2013

+ {

2014

+ 	enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline();

2015

+@@ -388,22 +382,15 @@ retpoline_auto:

2016

+ 	pr_info("%s\n", spectre_v2_strings[mode]);

2017

+

2018

+ 	/*

2019

+-	 * If neither SMEP nor PTI are available, there is a risk of

2020

+-	 * hitting userspace addresses in the RSB after a context switch

2021

+-	 * from a shallow call stack to a deeper one. To prevent this fill

2022

+-	 * the entire RSB, even when using IBRS.

2023

++	 * If spectre v2 protection has been enabled, unconditionally fill

2024

++	 * RSB during a context switch; this protects against two independent

2025

++	 * issues:

2026

+ 	 *

2027

+-	 * Skylake era CPUs have a separate issue with *underflow* of the

2028

+-	 * RSB, when they will predict 'ret' targets from the generic BTB.

2029

+-	 * The proper mitigation for this is IBRS. If IBRS is not supported

2030

+-	 * or deactivated in favour of retpolines the RSB fill on context

2031

+-	 * switch is required.

2032

++	 *	- RSB underflow (and switch to BTB) on Skylake+

2033

++	 *	- SpectreRSB variant of spectre v2 on X86_BUG_SPECTRE_V2 CPUs

2034

+ 	 */

2035

+-	if ((!boot_cpu_has(X86_FEATURE_PTI) &&

2036

+-	     !boot_cpu_has(X86_FEATURE_SMEP)) || is_skylake_era()) {

2037

+-		setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);

2038

+-		pr_info("Spectre v2 mitigation: Filling RSB on context switch\n");

2039

+-	}

2040

++	setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);

2041

++	pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch\n");

2042

+

2043

+ 	/* Initialize Indirect Branch Prediction Barrier if supported */

2044

+ 	if (boot_cpu_has(X86_FEATURE_IBPB)) {

2045

+@@ -654,8 +641,121 @@ void x86_spec_ctrl_setup_ap(void)

2046

+ 		x86_amd_ssb_disable();

2047

+ }

2048

+

2049

++#undef pr_fmt

2050

++#define pr_fmt(fmt)	"L1TF: " fmt

2051

++

2052

++/* Default mitigation for L1TF-affected CPUs */

2053

++enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_FLUSH;

2054

++#if IS_ENABLED(CONFIG_KVM_INTEL)

2055

++EXPORT_SYMBOL_GPL(l1tf_mitigation);

2056

++

2057

++enum vmx_l1d_flush_state l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;

2058

++EXPORT_SYMBOL_GPL(l1tf_vmx_mitigation);

2059

++#endif

2060

++

2061

++static void __init l1tf_select_mitigation(void)

2062

++{

2063

++	u64 half_pa;

2064

++

2065

++	if (!boot_cpu_has_bug(X86_BUG_L1TF))

2066

++		return;

2067

++

2068

++	switch (l1tf_mitigation) {

2069

++	case L1TF_MITIGATION_OFF:

2070

++	case L1TF_MITIGATION_FLUSH_NOWARN:

2071

++	case L1TF_MITIGATION_FLUSH:

2072

++		break;

2073

++	case L1TF_MITIGATION_FLUSH_NOSMT:

2074

++	case L1TF_MITIGATION_FULL:

2075

++		cpu_smt_disable(false);

2076

++		break;

2077

++	case L1TF_MITIGATION_FULL_FORCE:

2078

++		cpu_smt_disable(true);

2079

++		break;

2080

++	}

2081

++

2082

++#if CONFIG_PGTABLE_LEVELS == 2

2083

++	pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n");

2084

++	return;

2085

++#endif

2086

++

2087

++	/*

2088

++	 * This is extremely unlikely to happen because almost all

2089

++	 * systems have far more MAX_PA/2 than RAM can be fit into

2090

++	 * DIMM slots.

2091

++	 */

2092

++	half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT;

2093

++	if (e820__mapped_any(half_pa, ULLONG_MAX - half_pa, E820_TYPE_RAM)) {

2094

++		pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation not effective.\n");

2095

++		return;

2096

++	}

2097

++

2098

++	setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV);

2099

++}

2100

++

2101

++static int __init l1tf_cmdline(char *str)

2102

++{

2103

++	if (!boot_cpu_has_bug(X86_BUG_L1TF))

2104

++		return 0;

2105

++

2106

++	if (!str)

2107

++		return -EINVAL;

2108

++

2109

++	if (!strcmp(str, "off"))

2110

++		l1tf_mitigation = L1TF_MITIGATION_OFF;

2111

++	else if (!strcmp(str, "flush,nowarn"))

2112

++		l1tf_mitigation = L1TF_MITIGATION_FLUSH_NOWARN;

2113

++	else if (!strcmp(str, "flush"))

2114

++		l1tf_mitigation = L1TF_MITIGATION_FLUSH;

2115

++	else if (!strcmp(str, "flush,nosmt"))

2116

++		l1tf_mitigation = L1TF_MITIGATION_FLUSH_NOSMT;

2117

++	else if (!strcmp(str, "full"))

2118

++		l1tf_mitigation = L1TF_MITIGATION_FULL;

2119

++	else if (!strcmp(str, "full,force"))

2120

++		l1tf_mitigation = L1TF_MITIGATION_FULL_FORCE;

2121

++

2122

++	return 0;

2123

++}

2124

++early_param("l1tf", l1tf_cmdline);

2125

++

2126

++#undef pr_fmt

2127

++

2128

+ #ifdef CONFIG_SYSFS

2129

+

2130

++#define L1TF_DEFAULT_MSG "Mitigation: PTE Inversion"

2131

++

2132

++#if IS_ENABLED(CONFIG_KVM_INTEL)

2133

++static const char *l1tf_vmx_states[] = {

2134

++	[VMENTER_L1D_FLUSH_AUTO]		= "auto",

2135

++	[VMENTER_L1D_FLUSH_NEVER]		= "vulnerable",

2136

++	[VMENTER_L1D_FLUSH_COND]		= "conditional cache flushes",

2137

++	[VMENTER_L1D_FLUSH_ALWAYS]		= "cache flushes",

2138

++	[VMENTER_L1D_FLUSH_EPT_DISABLED]	= "EPT disabled",

2139

++	[VMENTER_L1D_FLUSH_NOT_REQUIRED]	= "flush not necessary"

2140

++};

2141

++

2142

++static ssize_t l1tf_show_state(char *buf)

2143

++{

2144

++	if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_AUTO)

2145

++		return sprintf(buf, "%s\n", L1TF_DEFAULT_MSG);

2146

++

2147

++	if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_EPT_DISABLED ||

2148

++	    (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_NEVER &&

2149

++	     cpu_smt_control == CPU_SMT_ENABLED))

2150

++		return sprintf(buf, "%s; VMX: %s\n", L1TF_DEFAULT_MSG,

2151

++			       l1tf_vmx_states[l1tf_vmx_mitigation]);

2152

++

2153

++	return sprintf(buf, "%s; VMX: %s, SMT %s\n", L1TF_DEFAULT_MSG,

2154

++		       l1tf_vmx_states[l1tf_vmx_mitigation],

2155

++		       cpu_smt_control == CPU_SMT_ENABLED ? "vulnerable" : "disabled");

2156

++}

2157

++#else

2158

++static ssize_t l1tf_show_state(char *buf)

2159

++{

2160

++	return sprintf(buf, "%s\n", L1TF_DEFAULT_MSG);

2161

++}

2162

++#endif

2163

++

2164

+ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr,

2165

+ 			       char *buf, unsigned int bug)

2166

+ {

2167

+@@ -681,6 +781,10 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr

2168

+ 	case X86_BUG_SPEC_STORE_BYPASS:

2169

+ 		return sprintf(buf, "%s\n", ssb_strings[ssb_mode]);

2170

+

2171

++	case X86_BUG_L1TF:

2172

++		if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))

2173

++			return l1tf_show_state(buf);

2174

++		break;

2175

+ 	default:

2176

+ 		break;

2177

+ 	}

2178

+@@ -707,4 +811,9 @@ ssize_t cpu_show_spec_store_bypass(struct device *dev, struct device_attribute *

2179

+ {

2180

+ 	return cpu_show_common(dev, attr, buf, X86_BUG_SPEC_STORE_BYPASS);

2181

+ }

2182

++

2183

++ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *buf)

2184

++{

2185

++	return cpu_show_common(dev, attr, buf, X86_BUG_L1TF);

2186

++}

2187

+ #endif

2188

+diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c

2189

+index 48e98964ecad..dd02ee4fa8cd 100644

2190

+--- a/arch/x86/kernel/cpu/common.c

2191

++++ b/arch/x86/kernel/cpu/common.c

2192

+@@ -66,6 +66,13 @@ cpumask_var_t cpu_callin_mask;

2193

+ /* representing cpus for which sibling maps can be computed */

2194

+ cpumask_var_t cpu_sibling_setup_mask;

2195

+

2196

++/* Number of siblings per CPU package */

2197

++int smp_num_siblings = 1;

2198

++EXPORT_SYMBOL(smp_num_siblings);

2199

++

2200

++/* Last level cache ID of each logical CPU */

2201

++DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id) = BAD_APICID;

2202

++

2203

+ /* correctly size the local cpu masks */

2204

+ void __init setup_cpu_local_masks(void)

2205

+ {

2206

+@@ -614,33 +621,36 @@ static void cpu_detect_tlb(struct cpuinfo_x86 *c)

2207

+ 		tlb_lld_4m[ENTRIES], tlb_lld_1g[ENTRIES]);

2208

+ }

2209

+

2210

+-void detect_ht(struct cpuinfo_x86 *c)

2211

++int detect_ht_early(struct cpuinfo_x86 *c)

2212

+ {

2213

+ #ifdef CONFIG_SMP

2214

+ 	u32 eax, ebx, ecx, edx;

2215

+-	int index_msb, core_bits;

2216

+-	static bool printed;

2217

+

2218

+ 	if (!cpu_has(c, X86_FEATURE_HT))

2219

+-		return;

2220

++		return -1;

2221

+

2222

+ 	if (cpu_has(c, X86_FEATURE_CMP_LEGACY))

2223

+-		goto out;

2224

++		return -1;

2225

+

2226

+ 	if (cpu_has(c, X86_FEATURE_XTOPOLOGY))

2227

+-		return;

2228

++		return -1;

2229

+

2230

+ 	cpuid(1, &eax, &ebx, &ecx, &edx);

2231

+

2232

+ 	smp_num_siblings = (ebx & 0xff0000) >> 16;

2233

+-

2234

+-	if (smp_num_siblings == 1) {

2235

++	if (smp_num_siblings == 1)

2236

+ 		pr_info_once("CPU0: Hyper-Threading is disabled\n");

2237

+-		goto out;

2238

+-	}

2239

++#endif

2240

++	return 0;

2241

++}

2242

+

2243

+-	if (smp_num_siblings <= 1)

2244

+-		goto out;

2245

++void detect_ht(struct cpuinfo_x86 *c)

2246

++{

2247

++#ifdef CONFIG_SMP

2248

++	int index_msb, core_bits;

2249

++

2250

++	if (detect_ht_early(c) < 0)

2251

++		return;

2252

+

2253

+ 	index_msb = get_count_order(smp_num_siblings);

2254

+ 	c->phys_proc_id = apic->phys_pkg_id(c->initial_apicid, index_msb);

2255

+@@ -653,15 +663,6 @@ void detect_ht(struct cpuinfo_x86 *c)

2256

+

2257

+ 	c->cpu_core_id = apic->phys_pkg_id(c->initial_apicid, index_msb) &

2258

+ 				       ((1 << core_bits) - 1);

2259

+-

2260

+-out:

2261

+-	if (!printed && (c->x86_max_cores * smp_num_siblings) > 1) {

2262

+-		pr_info("CPU: Physical Processor ID: %d\n",

2263

+-			c->phys_proc_id);

2264

+-		pr_info("CPU: Processor Core ID: %d\n",

2265

+-			c->cpu_core_id);

2266

+-		printed = 1;

2267

+-	}

2268

+ #endif

2269

+ }

2270

+

2271

+@@ -933,6 +934,21 @@ static const __initconst struct x86_cpu_id cpu_no_spec_store_bypass[] = {

2272

+ 	{}

2273

+ };

2274

+

2275

++static const __initconst struct x86_cpu_id cpu_no_l1tf[] = {

2276

++	/* in addition to cpu_no_speculation */

2277

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_SILVERMONT1	},

2278

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_SILVERMONT2	},

2279

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_AIRMONT		},

2280

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_MERRIFIELD	},

2281

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_MOOREFIELD	},

2282

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT	},

2283

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_DENVERTON	},

2284

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GEMINI_LAKE	},

2285

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_XEON_PHI_KNL		},

2286

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_XEON_PHI_KNM		},

2287

++	{}

2288

++};

2289

++

2290

+ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)

2291

+ {

2292

+ 	u64 ia32_cap = 0;

2293

+@@ -958,6 +974,11 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)

2294

+ 		return;

2295

+

2296

+ 	setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);

2297

++

2298

++	if (x86_match_cpu(cpu_no_l1tf))

2299

++		return;

2300

++

2301

++	setup_force_cpu_bug(X86_BUG_L1TF);

2302

+ }

2303

+

2304

+ /*

2305

+diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h

2306

+index 37672d299e35..cca588407dca 100644

2307

+--- a/arch/x86/kernel/cpu/cpu.h

2308

++++ b/arch/x86/kernel/cpu/cpu.h

2309

+@@ -47,6 +47,8 @@ extern const struct cpu_dev *const __x86_cpu_dev_start[],

2310

+

2311

+ extern void get_cpu_cap(struct cpuinfo_x86 *c);

2312

+ extern void cpu_detect_cache_sizes(struct cpuinfo_x86 *c);

2313

++extern int detect_extended_topology_early(struct cpuinfo_x86 *c);

2314

++extern int detect_ht_early(struct cpuinfo_x86 *c);

2315

+

2316

+ unsigned int aperfmperf_get_khz(int cpu);

2317

+

2318

+diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c

2319

+index 0b2330e19169..278be092b300 100644

2320

+--- a/arch/x86/kernel/cpu/intel.c

2321

++++ b/arch/x86/kernel/cpu/intel.c

2322

+@@ -301,6 +301,13 @@ static void early_init_intel(struct cpuinfo_x86 *c)

2323

+ 	}

2324

+

2325

+ 	check_mpx_erratum(c);

2326

++

2327

++	/*

2328

++	 * Get the number of SMT siblings early from the extended topology

2329

++	 * leaf, if available. Otherwise try the legacy SMT detection.

2330

++	 */

2331

++	if (detect_extended_topology_early(c) < 0)

2332

++		detect_ht_early(c);

2333

+ }

2334

+

2335

+ #ifdef CONFIG_X86_32

2336

+diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c

2337

+index 4fc0e08a30b9..387a8f44fba1 100644

2338

+--- a/arch/x86/kernel/cpu/microcode/core.c

2339

++++ b/arch/x86/kernel/cpu/microcode/core.c

2340

+@@ -509,12 +509,20 @@ static struct platform_device	*microcode_pdev;

2341

+

2342

+ static int check_online_cpus(void)

2343

+ {

2344

+-	if (num_online_cpus() == num_present_cpus())

2345

+-		return 0;

2346

++	unsigned int cpu;

2347

+

2348

+-	pr_err("Not all CPUs online, aborting microcode update.\n");

2349

++	/*

2350

++	 * Make sure all CPUs are online.  It's fine for SMT to be disabled if

2351

++	 * all the primary threads are still online.

2352

++	 */

2353

++	for_each_present_cpu(cpu) {

2354

++		if (topology_is_primary_thread(cpu) && !cpu_online(cpu)) {

2355

++			pr_err("Not all CPUs online, aborting microcode update.\n");

2356

++			return -EINVAL;

2357

++		}

2358

++	}

2359

+

2360

+-	return -EINVAL;

2361

++	return 0;

2362

+ }

2363

+

2364

+ static atomic_t late_cpus_in;

2365

+diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c

2366

+index b099024d339c..19c6e800e816 100644

2367

+--- a/arch/x86/kernel/cpu/topology.c

2368

++++ b/arch/x86/kernel/cpu/topology.c

2369

+@@ -27,16 +27,13 @@

2370

+  * exists, use it for populating initial_apicid and cpu topology

2371

+  * detection.

2372

+  */

2373

+-void detect_extended_topology(struct cpuinfo_x86 *c)

2374

++int detect_extended_topology_early(struct cpuinfo_x86 *c)

2375

+ {

2376

+ #ifdef CONFIG_SMP

2377

+-	unsigned int eax, ebx, ecx, edx, sub_index;

2378

+-	unsigned int ht_mask_width, core_plus_mask_width;

2379

+-	unsigned int core_select_mask, core_level_siblings;

2380

+-	static bool printed;

2381

++	unsigned int eax, ebx, ecx, edx;

2382

+

2383

+ 	if (c->cpuid_level < 0xb)

2384

+-		return;

2385

++		return -1;

2386

+

2387

+ 	cpuid_count(0xb, SMT_LEVEL, &eax, &ebx, &ecx, &edx);

2388

+

2389

+@@ -44,7 +41,7 @@ void detect_extended_topology(struct cpuinfo_x86 *c)

2390

+ 	 * check if the cpuid leaf 0xb is actually implemented.

2391

+ 	 */

2392

+ 	if (ebx == 0 || (LEAFB_SUBTYPE(ecx) != SMT_TYPE))

2393

+-		return;

2394

++		return -1;

2395

+

2396

+ 	set_cpu_cap(c, X86_FEATURE_XTOPOLOGY);

2397

+

2398

+@@ -52,10 +49,30 @@ void detect_extended_topology(struct cpuinfo_x86 *c)

2399

+ 	 * initial apic id, which also represents 32-bit extended x2apic id.

2400

+ 	 */

2401

+ 	c->initial_apicid = edx;

2402

++	smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);

2403

++#endif

2404

++	return 0;

2405

++}

2406

++

2407

++/*

2408

++ * Check for extended topology enumeration cpuid leaf 0xb and if it

2409

++ * exists, use it for populating initial_apicid and cpu topology

2410

++ * detection.

2411

++ */

2412

++void detect_extended_topology(struct cpuinfo_x86 *c)

2413

++{

2414

++#ifdef CONFIG_SMP

2415

++	unsigned int eax, ebx, ecx, edx, sub_index;

2416

++	unsigned int ht_mask_width, core_plus_mask_width;

2417

++	unsigned int core_select_mask, core_level_siblings;

2418

++

2419

++	if (detect_extended_topology_early(c) < 0)

2420

++		return;

2421

+

2422

+ 	/*

2423

+ 	 * Populate HT related information from sub-leaf level 0.

2424

+ 	 */

2425

++	cpuid_count(0xb, SMT_LEVEL, &eax, &ebx, &ecx, &edx);

2426

+ 	core_level_siblings = smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);

2427

+ 	core_plus_mask_width = ht_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);

2428

+

2429

+@@ -86,15 +103,5 @@ void detect_extended_topology(struct cpuinfo_x86 *c)

2430

+ 	c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);

2431

+

2432

+ 	c->x86_max_cores = (core_level_siblings / smp_num_siblings);

2433

+-

2434

+-	if (!printed) {

2435

+-		pr_info("CPU: Physical Processor ID: %d\n",

2436

+-		       c->phys_proc_id);

2437

+-		if (c->x86_max_cores > 1)

2438

+-			pr_info("CPU: Processor Core ID: %d\n",

2439

+-			       c->cpu_core_id);

2440

+-		printed = 1;

2441

+-	}

2442

+-	return;

2443

+ #endif

2444

+ }

2445

+diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c

2446

+index f92a6593de1e..2ea85b32421a 100644

2447

+--- a/arch/x86/kernel/fpu/core.c

2448

++++ b/arch/x86/kernel/fpu/core.c

2449

+@@ -10,6 +10,7 @@

2450

+ #include <asm/fpu/signal.h>

2451

+ #include <asm/fpu/types.h>

2452

+ #include <asm/traps.h>

2453

++#include <asm/irq_regs.h>

2454

+

2455

+ #include <linux/hardirq.h>

2456

+ #include <linux/pkeys.h>

2457

+diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c

2458

+index 01ebcb6f263e..7acb87cb2da8 100644

2459

+--- a/arch/x86/kernel/ftrace.c

2460

++++ b/arch/x86/kernel/ftrace.c

2461

+@@ -27,6 +27,7 @@

2462

+

2463

+ #include <asm/set_memory.h>

2464

+ #include <asm/kprobes.h>

2465

++#include <asm/sections.h>

2466

+ #include <asm/ftrace.h>

2467

+ #include <asm/nops.h>

2468

+

2469

+diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c

2470

+index 8ce4212e2b8d..afa1a204bc6d 100644

2471

+--- a/arch/x86/kernel/hpet.c

2472

++++ b/arch/x86/kernel/hpet.c

2473

+@@ -1,6 +1,7 @@

2474

+ #include <linux/clocksource.h>

2475

+ #include <linux/clockchips.h>

2476

+ #include <linux/interrupt.h>

2477

++#include <linux/irq.h>

2478

+ #include <linux/export.h>

2479

+ #include <linux/delay.h>

2480

+ #include <linux/errno.h>

2481

+diff --git a/arch/x86/kernel/i8259.c b/arch/x86/kernel/i8259.c

2482

+index 8f5cb2c7060c..02abc134367f 100644

2483

+--- a/arch/x86/kernel/i8259.c

2484

++++ b/arch/x86/kernel/i8259.c

2485

+@@ -5,6 +5,7 @@

2486

+ #include <linux/sched.h>

2487

+ #include <linux/ioport.h>

2488

+ #include <linux/interrupt.h>

2489

++#include <linux/irq.h>

2490

+ #include <linux/timex.h>

2491

+ #include <linux/random.h>

2492

+ #include <linux/init.h>

2493

+diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c

2494

+index 0c5256653d6c..38c3d5790970 100644

2495

+--- a/arch/x86/kernel/idt.c

2496

++++ b/arch/x86/kernel/idt.c

2497

+@@ -8,6 +8,7 @@

2498

+ #include <asm/traps.h>

2499

+ #include <asm/proto.h>

2500

+ #include <asm/desc.h>

2501

++#include <asm/hw_irq.h>

2502

+

2503

+ struct idt_data {

2504

+ 	unsigned int	vector;

2505

+diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c

2506

+index aa9d51eea9d0..3c2326b59820 100644

2507

+--- a/arch/x86/kernel/irq.c

2508

++++ b/arch/x86/kernel/irq.c

2509

+@@ -10,6 +10,7 @@

2510

+ #include <linux/ftrace.h>

2511

+ #include <linux/delay.h>

2512

+ #include <linux/export.h>

2513

++#include <linux/irq.h>

2514

+

2515

+ #include <asm/apic.h>

2516

+ #include <asm/io_apic.h>

2517

+diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c

2518

+index c1bdbd3d3232..95600a99ae93 100644

2519

+--- a/arch/x86/kernel/irq_32.c

2520

++++ b/arch/x86/kernel/irq_32.c

2521

+@@ -11,6 +11,7 @@

2522

+

2523

+ #include <linux/seq_file.h>

2524

+ #include <linux/interrupt.h>

2525

++#include <linux/irq.h>

2526

+ #include <linux/kernel_stat.h>

2527

+ #include <linux/notifier.h>

2528

+ #include <linux/cpu.h>

2529

+diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c

2530

+index d86e344f5b3d..0469cd078db1 100644

2531

+--- a/arch/x86/kernel/irq_64.c

2532

++++ b/arch/x86/kernel/irq_64.c

2533

+@@ -11,6 +11,7 @@

2534

+

2535

+ #include <linux/kernel_stat.h>

2536

+ #include <linux/interrupt.h>

2537

++#include <linux/irq.h>

2538

+ #include <linux/seq_file.h>

2539

+ #include <linux/delay.h>

2540

+ #include <linux/ftrace.h>

2541

+diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c

2542

+index 1e4094eba15e..40f83d0d7b8a 100644

2543

+--- a/arch/x86/kernel/irqinit.c

2544

++++ b/arch/x86/kernel/irqinit.c

2545

+@@ -5,6 +5,7 @@

2546

+ #include <linux/sched.h>

2547

+ #include <linux/ioport.h>

2548

+ #include <linux/interrupt.h>

2549

++#include <linux/irq.h>

2550

+ #include <linux/timex.h>

2551

+ #include <linux/random.h>

2552

+ #include <linux/kprobes.h>

2553

+diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c

2554

+index f1030c522e06..65452d555f05 100644

2555

+--- a/arch/x86/kernel/kprobes/core.c

2556

++++ b/arch/x86/kernel/kprobes/core.c

2557

+@@ -63,6 +63,7 @@

2558

+ #include <asm/insn.h>

2559

+ #include <asm/debugreg.h>

2560

+ #include <asm/set_memory.h>

2561

++#include <asm/sections.h>

2562

+

2563

+ #include "common.h"

2564

+

2565

+@@ -394,8 +395,6 @@ int __copy_instruction(u8 *dest, u8 *src, struct insn *insn)

2566

+ 			  - (u8 *) dest;

2567

+ 		if ((s64) (s32) newdisp != newdisp) {

2568

+ 			pr_err("Kprobes error: new displacement does not fit into s32 (%llx)\n", newdisp);

2569

+-			pr_err("\tSrc: %p, Dest: %p, old disp: %x\n",

2570

+-				src, dest, insn->displacement.value);

2571

+ 			return 0;

2572

+ 		}

2573

+ 		disp = (u8 *) dest + insn_offset_displacement(insn);

2574

+@@ -621,8 +620,7 @@ static int reenter_kprobe(struct kprobe *p, struct pt_regs *regs,

2575

+ 		 * Raise a BUG or we'll continue in an endless reentering loop

2576

+ 		 * and eventually a stack overflow.

2577

+ 		 */

2578

+-		printk(KERN_WARNING "Unrecoverable kprobe detected at %p.\n",

2579

+-		       p->addr);

2580

++		pr_err("Unrecoverable kprobe detected.\n");

2581

+ 		dump_kprobe(p);

2582

+ 		BUG();

2583

+ 	default:

2584

+diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c

2585

+index e1df9ef5d78c..f3559b84cd75 100644

2586

+--- a/arch/x86/kernel/paravirt.c

2587

++++ b/arch/x86/kernel/paravirt.c

2588

+@@ -88,10 +88,12 @@ unsigned paravirt_patch_call(void *insnbuf,

2589

+ 	struct branch *b = insnbuf;

2590

+ 	unsigned long delta = (unsigned long)target - (addr+5);

2591

+

2592

+-	if (tgt_clobbers & ~site_clobbers)

2593

+-		return len;	/* target would clobber too much for this site */

2594

+-	if (len < 5)

2595

++	if (len < 5) {

2596

++#ifdef CONFIG_RETPOLINE

2597

++		WARN_ONCE("Failing to patch indirect CALL in %ps\n", (void *)addr);

2598

++#endif

2599

+ 		return len;	/* call too long for patch site */

2600

++	}

2601

+

2602

+ 	b->opcode = 0xe8; /* call */

2603

+ 	b->delta = delta;

2604

+@@ -106,8 +108,12 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target,

2605

+ 	struct branch *b = insnbuf;

2606

+ 	unsigned long delta = (unsigned long)target - (addr+5);

2607

+

2608

+-	if (len < 5)

2609

++	if (len < 5) {

2610

++#ifdef CONFIG_RETPOLINE

2611

++		WARN_ONCE("Failing to patch indirect JMP in %ps\n", (void *)addr);

2612

++#endif

2613

+ 		return len;	/* call too long for patch site */

2614

++	}

2615

+

2616

+ 	b->opcode = 0xe9;	/* jmp */

2617

+ 	b->delta = delta;

2618

+diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c

2619

+index efbcf5283520..dcb00acb6583 100644

2620

+--- a/arch/x86/kernel/setup.c

2621

++++ b/arch/x86/kernel/setup.c

2622

+@@ -852,6 +852,12 @@ void __init setup_arch(char **cmdline_p)

2623

+ 	memblock_reserve(__pa_symbol(_text),

2624

+ 			 (unsigned long)__bss_stop - (unsigned long)_text);

2625

+

2626

++	/*

2627

++	 * Make sure page 0 is always reserved because on systems with

2628

++	 * L1TF its contents can be leaked to user processes.

2629

++	 */

2630

++	memblock_reserve(0, PAGE_SIZE);

2631

++

2632

+ 	early_reserve_initrd();

2633

+

2634

+ 	/*

2635

+diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c

2636

+index 5c574dff4c1a..04adc8d60aed 100644

2637

+--- a/arch/x86/kernel/smp.c

2638

++++ b/arch/x86/kernel/smp.c

2639

+@@ -261,6 +261,7 @@ __visible void __irq_entry smp_reschedule_interrupt(struct pt_regs *regs)

2640

+ {

2641

+ 	ack_APIC_irq();

2642

+ 	inc_irq_stat(irq_resched_count);

2643

++	kvm_set_cpu_l1tf_flush_l1d();

2644

+

2645

+ 	if (trace_resched_ipi_enabled()) {

2646

+ 		/*

2647

+diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c

2648

+index 344d3c160f8d..5ebb0dbcf4f7 100644

2649

+--- a/arch/x86/kernel/smpboot.c

2650

++++ b/arch/x86/kernel/smpboot.c

2651

+@@ -78,13 +78,7 @@

2652

+ #include <asm/realmode.h>

2653

+ #include <asm/misc.h>

2654

+ #include <asm/spec-ctrl.h>

2655

+-

2656

+-/* Number of siblings per CPU package */

2657

+-int smp_num_siblings = 1;

2658

+-EXPORT_SYMBOL(smp_num_siblings);

2659

+-

2660

+-/* Last level cache ID of each logical CPU */

2661

+-DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id) = BAD_APICID;

2662

++#include <asm/hw_irq.h>

2663

+

2664

+ /* representing HT siblings of each logical CPU */

2665

+ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);

2666

+@@ -311,6 +305,23 @@ found:

2667

+ 	return 0;

2668

+ }

2669

+

2670

++/**

2671

++ * topology_is_primary_thread - Check whether CPU is the primary SMT thread

2672

++ * @cpu:	CPU to check

2673

++ */

2674

++bool topology_is_primary_thread(unsigned int cpu)

2675

++{

2676

++	return apic_id_is_primary_thread(per_cpu(x86_cpu_to_apicid, cpu));

2677

++}

2678

++

2679

++/**

2680

++ * topology_smt_supported - Check whether SMT is supported by the CPUs

2681

++ */

2682

++bool topology_smt_supported(void)

2683

++{

2684

++	return smp_num_siblings > 1;

2685

++}

2686

++

2687

+ /**

2688

+  * topology_phys_to_logical_pkg - Map a physical package id to a logical

2689

+  *

2690

+diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c

2691

+index 879af864d99a..49a5c394f3ed 100644

2692

+--- a/arch/x86/kernel/time.c

2693

++++ b/arch/x86/kernel/time.c

2694

+@@ -12,6 +12,7 @@

2695

+

2696

+ #include <linux/clockchips.h>

2697

+ #include <linux/interrupt.h>

2698

++#include <linux/irq.h>

2699

+ #include <linux/i8253.h>

2700

+ #include <linux/time.h>

2701

+ #include <linux/export.h>

2702

+diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c

2703

+index 2ef2f1fe875b..00e2ae033a0f 100644

2704

+--- a/arch/x86/kvm/mmu.c

2705

++++ b/arch/x86/kvm/mmu.c

2706

+@@ -3825,6 +3825,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,

2707

+ {

2708

+ 	int r = 1;

2709

+

2710

++	vcpu->arch.l1tf_flush_l1d = true;

2711

+ 	switch (vcpu->arch.apf.host_apf_reason) {

2712

+ 	default:

2713

+ 		trace_kvm_page_fault(fault_address, error_code);

2714

+diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c

2715

+index cfa155078ebb..282bbcbf3b6a 100644

2716

+--- a/arch/x86/kvm/svm.c

2717

++++ b/arch/x86/kvm/svm.c

2718

+@@ -175,6 +175,8 @@ struct vcpu_svm {

2719

+ 	uint64_t sysenter_eip;

2720

+ 	uint64_t tsc_aux;

2721

+

2722

++	u64 msr_decfg;

2723

++

2724

+ 	u64 next_rip;

2725

+

2726

+ 	u64 host_user_msrs[NR_HOST_SAVE_USER_MSRS];

2727

+@@ -1616,6 +1618,7 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

2728

+ 	u32 dummy;

2729

+ 	u32 eax = 1;

2730

+

2731

++	vcpu->arch.microcode_version = 0x01000065;

2732

+ 	svm->spec_ctrl = 0;

2733

+ 	svm->virt_spec_ctrl = 0;

2734

+

2735

+@@ -3555,6 +3558,22 @@ static int cr8_write_interception(struct vcpu_svm *svm)

2736

+ 	return 0;

2737

+ }

2738

+

2739

++static int svm_get_msr_feature(struct kvm_msr_entry *msr)

2740

++{

2741

++	msr->data = 0;

2742

++

2743

++	switch (msr->index) {

2744

++	case MSR_F10H_DECFG:

2745

++		if (boot_cpu_has(X86_FEATURE_LFENCE_RDTSC))

2746

++			msr->data |= MSR_F10H_DECFG_LFENCE_SERIALIZE;

2747

++		break;

2748

++	default:

2749

++		return 1;

2750

++	}

2751

++

2752

++	return 0;

2753

++}

2754

++

2755

+ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

2756

+ {

2757

+ 	struct vcpu_svm *svm = to_svm(vcpu);

2758

+@@ -3637,9 +3656,6 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

2759

+

2760

+ 		msr_info->data = svm->virt_spec_ctrl;

2761

+ 		break;

2762

+-	case MSR_IA32_UCODE_REV:

2763

+-		msr_info->data = 0x01000065;

2764

+-		break;

2765

+ 	case MSR_F15H_IC_CFG: {

2766

+

2767

+ 		int family, model;

2768

+@@ -3657,6 +3673,9 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

2769

+ 			msr_info->data = 0x1E;

2770

+ 		}

2771

+ 		break;

2772

++	case MSR_F10H_DECFG:

2773

++		msr_info->data = svm->msr_decfg;

2774

++		break;

2775

+ 	default:

2776

+ 		return kvm_get_msr_common(vcpu, msr_info);

2777

+ 	}

2778

+@@ -3845,6 +3864,24 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)

2779

+ 	case MSR_VM_IGNNE:

2780

+ 		vcpu_unimpl(vcpu, "unimplemented wrmsr: 0x%x data 0x%llx\n", ecx, data);

2781

+ 		break;

2782

++	case MSR_F10H_DECFG: {

2783

++		struct kvm_msr_entry msr_entry;

2784

++

2785

++		msr_entry.index = msr->index;

2786

++		if (svm_get_msr_feature(&msr_entry))

2787

++			return 1;

2788

++

2789

++		/* Check the supported bits */

2790

++		if (data & ~msr_entry.data)

2791

++			return 1;

2792

++

2793

++		/* Don't allow the guest to change a bit, #GP */

2794

++		if (!msr->host_initiated && (data ^ msr_entry.data))

2795

++			return 1;

2796

++

2797

++		svm->msr_decfg = data;

2798

++		break;

2799

++	}

2800

+ 	case MSR_IA32_APICBASE:

2801

+ 		if (kvm_vcpu_apicv_active(vcpu))

2802

+ 			avic_update_vapic_bar(to_svm(vcpu), data);

2803

+@@ -5588,6 +5625,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {

2804

+ 	.vcpu_unblocking = svm_vcpu_unblocking,

2805

+

2806

+ 	.update_bp_intercept = update_bp_intercept,

2807

++	.get_msr_feature = svm_get_msr_feature,

2808

+ 	.get_msr = svm_get_msr,

2809

+ 	.set_msr = svm_set_msr,

2810

+ 	.get_segment_base = svm_get_segment_base,

2811

+diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c

2812

+index 8d000fde1414..f015ca3997d9 100644

2813

+--- a/arch/x86/kvm/vmx.c

2814

++++ b/arch/x86/kvm/vmx.c

2815

+@@ -191,6 +191,150 @@ module_param(ple_window_max, int, S_IRUGO);

2816

+

2817

+ extern const ulong vmx_return;

2818

+

2819

++static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);

2820

++static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);

2821

++static DEFINE_MUTEX(vmx_l1d_flush_mutex);

2822

++

2823

++/* Storage for pre module init parameter parsing */

2824

++static enum vmx_l1d_flush_state __read_mostly vmentry_l1d_flush_param = VMENTER_L1D_FLUSH_AUTO;

2825

++

2826

++static const struct {

2827

++	const char *option;

2828

++	enum vmx_l1d_flush_state cmd;

2829

++} vmentry_l1d_param[] = {

2830

++	{"auto",	VMENTER_L1D_FLUSH_AUTO},

2831

++	{"never",	VMENTER_L1D_FLUSH_NEVER},

2832

++	{"cond",	VMENTER_L1D_FLUSH_COND},

2833

++	{"always",	VMENTER_L1D_FLUSH_ALWAYS},

2834

++};

2835

++

2836

++#define L1D_CACHE_ORDER 4

2837

++static void *vmx_l1d_flush_pages;

2838

++

2839

++static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)

2840

++{

2841

++	struct page *page;

2842

++	unsigned int i;

2843

++

2844

++	if (!enable_ept) {

2845

++		l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_EPT_DISABLED;

2846

++		return 0;

2847

++	}

2848

++

2849

++       if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES)) {

2850

++	       u64 msr;

2851

++

2852

++	       rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);

2853

++	       if (msr & ARCH_CAP_SKIP_VMENTRY_L1DFLUSH) {

2854

++		       l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NOT_REQUIRED;

2855

++		       return 0;

2856

++	       }

2857

++       }

2858

++

2859

++	/* If set to auto use the default l1tf mitigation method */

2860

++	if (l1tf == VMENTER_L1D_FLUSH_AUTO) {

2861

++		switch (l1tf_mitigation) {

2862

++		case L1TF_MITIGATION_OFF:

2863

++			l1tf = VMENTER_L1D_FLUSH_NEVER;

2864

++			break;

2865

++		case L1TF_MITIGATION_FLUSH_NOWARN:

2866

++		case L1TF_MITIGATION_FLUSH:

2867

++		case L1TF_MITIGATION_FLUSH_NOSMT:

2868

++			l1tf = VMENTER_L1D_FLUSH_COND;

2869

++			break;

2870

++		case L1TF_MITIGATION_FULL:

2871

++		case L1TF_MITIGATION_FULL_FORCE:

2872

++			l1tf = VMENTER_L1D_FLUSH_ALWAYS;

2873

++			break;

2874

++		}

2875

++	} else if (l1tf_mitigation == L1TF_MITIGATION_FULL_FORCE) {

2876

++		l1tf = VMENTER_L1D_FLUSH_ALWAYS;

2877

++	}

2878

++

2879

++	if (l1tf != VMENTER_L1D_FLUSH_NEVER && !vmx_l1d_flush_pages &&

2880

++	    !boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {

2881

++		page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);

2882

++		if (!page)

2883

++			return -ENOMEM;

2884

++		vmx_l1d_flush_pages = page_address(page);

2885

++

2886

++		/*

2887

++		 * Initialize each page with a different pattern in

2888

++		 * order to protect against KSM in the nested

2889

++		 * virtualization case.

2890

++		 */

2891

++		for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {

2892

++			memset(vmx_l1d_flush_pages + i * PAGE_SIZE, i + 1,

2893

++			       PAGE_SIZE);

2894

++		}

2895

++	}

2896

++

2897

++	l1tf_vmx_mitigation = l1tf;

2898

++

2899

++	if (l1tf != VMENTER_L1D_FLUSH_NEVER)

2900

++		static_branch_enable(&vmx_l1d_should_flush);

2901

++	else

2902

++		static_branch_disable(&vmx_l1d_should_flush);

2903

++

2904

++	if (l1tf == VMENTER_L1D_FLUSH_COND)

2905

++		static_branch_enable(&vmx_l1d_flush_cond);

2906

++	else

2907

++		static_branch_disable(&vmx_l1d_flush_cond);

2908

++	return 0;

2909

++}

2910

++

2911

++static int vmentry_l1d_flush_parse(const char *s)

2912

++{

2913

++	unsigned int i;

2914

++

2915

++	if (s) {

2916

++		for (i = 0; i < ARRAY_SIZE(vmentry_l1d_param); i++) {

2917

++			if (sysfs_streq(s, vmentry_l1d_param[i].option))

2918

++				return vmentry_l1d_param[i].cmd;

2919

++		}

2920

++	}

2921

++	return -EINVAL;

2922

++}

2923

++

2924

++static int vmentry_l1d_flush_set(const char *s, const struct kernel_param *kp)

2925

++{

2926

++	int l1tf, ret;

2927

++

2928

++	if (!boot_cpu_has(X86_BUG_L1TF))

2929

++		return 0;

2930

++

2931

++	l1tf = vmentry_l1d_flush_parse(s);

2932

++	if (l1tf < 0)

2933

++		return l1tf;

2934

++

2935

++	/*

2936

++	 * Has vmx_init() run already? If not then this is the pre init

2937

++	 * parameter parsing. In that case just store the value and let

2938

++	 * vmx_init() do the proper setup after enable_ept has been

2939

++	 * established.

2940

++	 */

2941

++	if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_AUTO) {

2942

++		vmentry_l1d_flush_param = l1tf;

2943

++		return 0;

2944

++	}

2945

++

2946

++	mutex_lock(&vmx_l1d_flush_mutex);

2947

++	ret = vmx_setup_l1d_flush(l1tf);

2948

++	mutex_unlock(&vmx_l1d_flush_mutex);

2949

++	return ret;

2950

++}

2951

++

2952

++static int vmentry_l1d_flush_get(char *s, const struct kernel_param *kp)

2953

++{

2954

++	return sprintf(s, "%s\n", vmentry_l1d_param[l1tf_vmx_mitigation].option);

2955

++}

2956

++

2957

++static const struct kernel_param_ops vmentry_l1d_flush_ops = {

2958

++	.set = vmentry_l1d_flush_set,

2959

++	.get = vmentry_l1d_flush_get,

2960

++};

2961

++module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644);

2962

++

2963

+ #define NR_AUTOLOAD_MSRS 8

2964

+

2965

+ struct vmcs {

2966

+@@ -567,6 +711,11 @@ static inline int pi_test_sn(struct pi_desc *pi_desc)

2967

+ 			(unsigned long *)&pi_desc->control);

2968

+ }

2969

+

2970

++struct vmx_msrs {

2971

++	unsigned int		nr;

2972

++	struct vmx_msr_entry	val[NR_AUTOLOAD_MSRS];

2973

++};

2974

++

2975

+ struct vcpu_vmx {

2976

+ 	struct kvm_vcpu       vcpu;

2977

+ 	unsigned long         host_rsp;

2978

+@@ -600,9 +749,8 @@ struct vcpu_vmx {

2979

+ 	struct loaded_vmcs   *loaded_vmcs;

2980

+ 	bool                  __launched; /* temporary, used in vmx_vcpu_run */

2981

+ 	struct msr_autoload {

2982

+-		unsigned nr;

2983

+-		struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];

2984

+-		struct vmx_msr_entry host[NR_AUTOLOAD_MSRS];

2985

++		struct vmx_msrs guest;

2986

++		struct vmx_msrs host;

2987

+ 	} msr_autoload;

2988

+ 	struct {

2989

+ 		int           loaded;

2990

+@@ -1967,9 +2115,20 @@ static void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,

2991

+ 	vm_exit_controls_clearbit(vmx, exit);

2992

+ }

2993

+

2994

++static int find_msr(struct vmx_msrs *m, unsigned int msr)

2995

++{

2996

++	unsigned int i;

2997

++

2998

++	for (i = 0; i < m->nr; ++i) {

2999

++		if (m->val[i].index == msr)

3000

++			return i;

3001

++	}

3002

++	return -ENOENT;

3003

++}

3004

++

3005

+ static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr)

3006

+ {

3007

+-	unsigned i;

3008

++	int i;

3009

+ 	struct msr_autoload *m = &vmx->msr_autoload;

3010

+

3011

+ 	switch (msr) {

3012

+@@ -1990,18 +2149,21 @@ static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr)

3013

+ 		}

3014

+ 		break;

3015

+ 	}

3016

++	i = find_msr(&m->guest, msr);

3017

++	if (i < 0)

3018

++		goto skip_guest;

3019

++	--m->guest.nr;

3020

++	m->guest.val[i] = m->guest.val[m->guest.nr];

3021

++	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);

3022

+

3023

+-	for (i = 0; i < m->nr; ++i)

3024

+-		if (m->guest[i].index == msr)

3025

+-			break;

3026

+-

3027

+-	if (i == m->nr)

3028

++skip_guest:

3029

++	i = find_msr(&m->host, msr);

3030

++	if (i < 0)

3031

+ 		return;

3032

+-	--m->nr;

3033

+-	m->guest[i] = m->guest[m->nr];

3034

+-	m->host[i] = m->host[m->nr];

3035

+-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);

3036

+-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);

3037

++

3038

++	--m->host.nr;

3039

++	m->host.val[i] = m->host.val[m->host.nr];

3040

++	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);

3041

+ }

3042

+

3043

+ static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,

3044

+@@ -2016,9 +2178,9 @@ static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,

3045

+ }

3046

+

3047

+ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,

3048

+-				  u64 guest_val, u64 host_val)

3049

++				  u64 guest_val, u64 host_val, bool entry_only)

3050

+ {

3051

+-	unsigned i;

3052

++	int i, j = 0;

3053

+ 	struct msr_autoload *m = &vmx->msr_autoload;

3054

+

3055

+ 	switch (msr) {

3056

+@@ -2053,24 +2215,31 @@ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,

3057

+ 		wrmsrl(MSR_IA32_PEBS_ENABLE, 0);

3058

+ 	}

3059

+

3060

+-	for (i = 0; i < m->nr; ++i)

3061

+-		if (m->guest[i].index == msr)

3062

+-			break;

3063

++	i = find_msr(&m->guest, msr);

3064

++	if (!entry_only)

3065

++		j = find_msr(&m->host, msr);

3066

+

3067

+-	if (i == NR_AUTOLOAD_MSRS) {

3068

++	if (i == NR_AUTOLOAD_MSRS || j == NR_AUTOLOAD_MSRS) {

3069

+ 		printk_once(KERN_WARNING "Not enough msr switch entries. "

3070

+ 				"Can't add msr %x\n", msr);

3071

+ 		return;

3072

+-	} else if (i == m->nr) {

3073

+-		++m->nr;

3074

+-		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);

3075

+-		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);

3076

+ 	}

3077

++	if (i < 0) {

3078

++		i = m->guest.nr++;

3079

++		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);

3080

++	}

3081

++	m->guest.val[i].index = msr;

3082

++	m->guest.val[i].value = guest_val;

3083

+

3084

+-	m->guest[i].index = msr;

3085

+-	m->guest[i].value = guest_val;

3086

+-	m->host[i].index = msr;

3087

+-	m->host[i].value = host_val;

3088

++	if (entry_only)

3089

++		return;

3090

++

3091

++	if (j < 0) {

3092

++		j = m->host.nr++;

3093

++		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);

3094

++	}

3095

++	m->host.val[j].index = msr;

3096

++	m->host.val[j].value = host_val;

3097

+ }

3098

+

3099

+ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)

3100

+@@ -2114,7 +2283,7 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)

3101

+ 			guest_efer &= ~EFER_LME;

3102

+ 		if (guest_efer != host_efer)

3103

+ 			add_atomic_switch_msr(vmx, MSR_EFER,

3104

+-					      guest_efer, host_efer);

3105

++					      guest_efer, host_efer, false);

3106

+ 		return false;

3107

+ 	} else {

3108

+ 		guest_efer &= ~ignore_bits;

3109

+@@ -3266,6 +3435,11 @@ static inline bool vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,

3110

+ 	return !(val & ~valid_bits);

3111

+ }

3112

+

3113

++static int vmx_get_msr_feature(struct kvm_msr_entry *msr)

3114

++{

3115

++	return 1;

3116

++}

3117

++

3118

+ /*

3119

+  * Reads an msr value (of 'msr_index') into 'pdata'.

3120

+  * Returns 0 on success, non-0 otherwise.

3121

+@@ -3523,7 +3697,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

3122

+ 		vcpu->arch.ia32_xss = data;

3123

+ 		if (vcpu->arch.ia32_xss != host_xss)

3124

+ 			add_atomic_switch_msr(vmx, MSR_IA32_XSS,

3125

+-				vcpu->arch.ia32_xss, host_xss);

3126

++				vcpu->arch.ia32_xss, host_xss, false);

3127

+ 		else

3128

+ 			clear_atomic_switch_msr(vmx, MSR_IA32_XSS);

3129

+ 		break;

3130

+@@ -5714,9 +5888,9 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)

3131

+

3132

+ 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);

3133

+ 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);

3134

+-	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));

3135

++	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));

3136

+ 	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);

3137

+-	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));

3138

++	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));

3139

+

3140

+ 	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)

3141

+ 		vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);

3142

+@@ -5736,8 +5910,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)

3143

+ 		++vmx->nmsrs;

3144

+ 	}

3145

+

3146

+-	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))

3147

+-		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);

3148

++	vmx->arch_capabilities = kvm_get_arch_capabilities();

3149

+

3150

+ 	vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);

3151

+

3152

+@@ -5770,6 +5943,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

3153

+ 	vmx->rmode.vm86_active = 0;

3154

+ 	vmx->spec_ctrl = 0;

3155

+

3156

++	vcpu->arch.microcode_version = 0x100000000ULL;

3157

+ 	vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();

3158

+ 	kvm_set_cr8(vcpu, 0);

3159

+

3160

+@@ -8987,6 +9161,79 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)

3161

+ 	}

3162

+ }

3163

+

3164

++/*

3165

++ * Software based L1D cache flush which is used when microcode providing

3166

++ * the cache control MSR is not loaded.

3167

++ *

3168

++ * The L1D cache is 32 KiB on Nehalem and later microarchitectures, but to

3169

++ * flush it is required to read in 64 KiB because the replacement algorithm

3170

++ * is not exactly LRU. This could be sized at runtime via topology

3171

++ * information but as all relevant affected CPUs have 32KiB L1D cache size

3172

++ * there is no point in doing so.

3173

++ */

3174

++#define L1D_CACHE_ORDER 4

3175

++static void *vmx_l1d_flush_pages;

3176

++

3177

++static void vmx_l1d_flush(struct kvm_vcpu *vcpu)

3178

++{

3179

++	int size = PAGE_SIZE << L1D_CACHE_ORDER;

3180

++

3181

++	/*

3182

++	 * This code is only executed when the the flush mode is 'cond' or

3183

++	 * 'always'

3184

++	 */

3185

++	if (static_branch_likely(&vmx_l1d_flush_cond)) {

3186

++		bool flush_l1d;

3187

++

3188

++		/*

3189

++		 * Clear the per-vcpu flush bit, it gets set again

3190

++		 * either from vcpu_run() or from one of the unsafe

3191

++		 * VMEXIT handlers.

3192

++		 */

3193

++		flush_l1d = vcpu->arch.l1tf_flush_l1d;

3194

++		vcpu->arch.l1tf_flush_l1d = false;

3195

++

3196

++		/*

3197

++		 * Clear the per-cpu flush bit, it gets set again from

3198

++		 * the interrupt handlers.

3199

++		 */

3200

++		flush_l1d |= kvm_get_cpu_l1tf_flush_l1d();

3201

++		kvm_clear_cpu_l1tf_flush_l1d();

3202

++

3203

++		if (!flush_l1d)

3204

++			return;

3205

++	}

3206

++

3207

++	vcpu->stat.l1d_flush++;

3208

++

3209

++	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {

3210

++		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);

3211

++		return;

3212

++	}

3213

++

3214

++	asm volatile(

3215

++		/* First ensure the pages are in the TLB */

3216

++		"xorl	%%eax, %%eax\n"

3217

++		".Lpopulate_tlb:\n\t"

3218

++		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"

3219

++		"addl	$4096, %%eax\n\t"

3220

++		"cmpl	%%eax, %[size]\n\t"

3221

++		"jne	.Lpopulate_tlb\n\t"

3222

++		"xorl	%%eax, %%eax\n\t"

3223

++		"cpuid\n\t"

3224

++		/* Now fill the cache */

3225

++		"xorl	%%eax, %%eax\n"

3226

++		".Lfill_cache:\n"

3227

++		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"

3228

++		"addl	$64, %%eax\n\t"

3229

++		"cmpl	%%eax, %[size]\n\t"

3230

++		"jne	.Lfill_cache\n\t"

3231

++		"lfence\n"

3232

++		:: [flush_pages] "r" (vmx_l1d_flush_pages),

3233

++		    [size] "r" (size)

3234

++		: "eax", "ebx", "ecx", "edx");

3235

++}

3236

++

3237

+ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)

3238

+ {

3239

+ 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);

3240

+@@ -9390,7 +9637,7 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)

3241

+ 			clear_atomic_switch_msr(vmx, msrs[i].msr);

3242

+ 		else

3243

+ 			add_atomic_switch_msr(vmx, msrs[i].msr, msrs[i].guest,

3244

+-					msrs[i].host);

3245

++					msrs[i].host, false);

3246

+ }

3247

+

3248

+ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)

3249

+@@ -9483,6 +9730,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)

3250

+

3251

+ 	vmx->__launched = vmx->loaded_vmcs->launched;

3252

+

3253

++	if (static_branch_unlikely(&vmx_l1d_should_flush))

3254

++		vmx_l1d_flush(vcpu);

3255

++

3256

+ 	asm(

3257

+ 		/* Store host registers */

3258

+ 		"push %%" _ASM_DX "; push %%" _ASM_BP ";"

3259

+@@ -9835,6 +10085,37 @@ free_vcpu:

3260

+ 	return ERR_PTR(err);

3261

+ }

3262

+

3263

++#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"

3264

++#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"

3265

++

3266

++static int vmx_vm_init(struct kvm *kvm)

3267

++{

3268

++	if (boot_cpu_has(X86_BUG_L1TF) && enable_ept) {

3269

++		switch (l1tf_mitigation) {

3270

++		case L1TF_MITIGATION_OFF:

3271

++		case L1TF_MITIGATION_FLUSH_NOWARN:

3272

++			/* 'I explicitly don't care' is set */

3273

++			break;

3274

++		case L1TF_MITIGATION_FLUSH:

3275

++		case L1TF_MITIGATION_FLUSH_NOSMT:

3276

++		case L1TF_MITIGATION_FULL:

3277

++			/*

3278

++			 * Warn upon starting the first VM in a potentially

3279

++			 * insecure environment.

3280

++			 */

3281

++			if (cpu_smt_control == CPU_SMT_ENABLED)

3282

++				pr_warn_once(L1TF_MSG_SMT);

3283

++			if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_NEVER)

3284

++				pr_warn_once(L1TF_MSG_L1D);

3285

++			break;

3286

++		case L1TF_MITIGATION_FULL_FORCE:

3287

++			/* Flush is enforced */

3288

++			break;

3289

++		}

3290

++	}

3291

++	return 0;

3292

++}

3293

++

3294

+ static void __init vmx_check_processor_compat(void *rtn)

3295

+ {

3296

+ 	struct vmcs_config vmcs_conf;

3297

+@@ -10774,10 +11055,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,

3298

+ 	 * Set the MSR load/store lists to match L0's settings.

3299

+ 	 */

3300

+ 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);

3301

+-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

3302

+-	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));

3303

+-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

3304

+-	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));

3305

++	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);

3306

++	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));

3307

++	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);

3308

++	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));

3309

+

3310

+ 	/*

3311

+ 	 * HOST_RSP is normally set correctly in vmx_vcpu_run() just before

3312

+@@ -11202,6 +11483,9 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)

3313

+ 	if (ret)

3314

+ 		return ret;

3315

+

3316

++	/* Hide L1D cache contents from the nested guest.  */

3317

++	vmx->vcpu.arch.l1tf_flush_l1d = true;

3318

++

3319

+ 	/*

3320

+ 	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken

3321

+ 	 * by event injection, halt vcpu.

3322

+@@ -11712,8 +11996,8 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,

3323

+ 	vmx_segment_cache_clear(vmx);

3324

+

3325

+ 	/* Update any VMCS fields that might have changed while L2 ran */

3326

+-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

3327

+-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

3328

++	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);

3329

++	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);

3330

+ 	vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);

3331

+ 	if (vmx->hv_deadline_tsc == -1)

3332

+ 		vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,

3333

+@@ -12225,6 +12509,8 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {

3334

+ 	.cpu_has_accelerated_tpr = report_flexpriority,

3335

+ 	.has_emulated_msr = vmx_has_emulated_msr,

3336

+

3337

++	.vm_init = vmx_vm_init,

3338

++

3339

+ 	.vcpu_create = vmx_create_vcpu,

3340

+ 	.vcpu_free = vmx_free_vcpu,

3341

+ 	.vcpu_reset = vmx_vcpu_reset,

3342

+@@ -12234,6 +12520,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {

3343

+ 	.vcpu_put = vmx_vcpu_put,

3344

+

3345

+ 	.update_bp_intercept = update_exception_bitmap,

3346

++	.get_msr_feature = vmx_get_msr_feature,

3347

+ 	.get_msr = vmx_get_msr,

3348

+ 	.set_msr = vmx_set_msr,

3349

+ 	.get_segment_base = vmx_get_segment_base,

3350

+@@ -12341,22 +12628,18 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {

3351

+ 	.setup_mce = vmx_setup_mce,

3352

+ };

3353

+

3354

+-static int __init vmx_init(void)

3355

++static void vmx_cleanup_l1d_flush(void)

3356

+ {

3357

+-	int r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),

3358

+-                     __alignof__(struct vcpu_vmx), THIS_MODULE);

3359

+-	if (r)

3360

+-		return r;

3361

+-

3362

+-#ifdef CONFIG_KEXEC_CORE

3363

+-	rcu_assign_pointer(crash_vmclear_loaded_vmcss,

3364

+-			   crash_vmclear_local_loaded_vmcss);

3365

+-#endif

3366

+-

3367

+-	return 0;

3368

++	if (vmx_l1d_flush_pages) {

3369

++		free_pages((unsigned long)vmx_l1d_flush_pages, L1D_CACHE_ORDER);

3370

++		vmx_l1d_flush_pages = NULL;

3371

++	}

3372

++	/* Restore state so sysfs ignores VMX */

3373

++	l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;

3374

+ }

3375

+

3376

+-static void __exit vmx_exit(void)

3377

++

3378

++static void vmx_exit(void)

3379

+ {

3380

+ #ifdef CONFIG_KEXEC_CORE

3381

+ 	RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);

3382

+@@ -12364,7 +12647,40 @@ static void __exit vmx_exit(void)

3383

+ #endif

3384

+

3385

+ 	kvm_exit();

3386

++

3387

++	vmx_cleanup_l1d_flush();

3388

+ }

3389

++module_exit(vmx_exit)

3390

+

3391

++static int __init vmx_init(void)

3392

++{

3393

++	int r;

3394

++

3395

++	r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),

3396

++		     __alignof__(struct vcpu_vmx), THIS_MODULE);

3397

++	if (r)

3398

++		return r;

3399

++

3400

++	/*

3401

++	 * Must be called after kvm_init() so enable_ept is properly set

3402

++	 * up. Hand the parameter mitigation value in which was stored in

3403

++	 * the pre module init parser. If no parameter was given, it will

3404

++	 * contain 'auto' which will be turned into the default 'cond'

3405

++	 * mitigation mode.

3406

++	 */

3407

++	if (boot_cpu_has(X86_BUG_L1TF)) {

3408

++		r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);

3409

++		if (r) {

3410

++			vmx_exit();

3411

++			return r;

3412

++		}

3413

++	}

3414

++

3415

++#ifdef CONFIG_KEXEC_CORE

3416

++	rcu_assign_pointer(crash_vmclear_loaded_vmcss,

3417

++			   crash_vmclear_local_loaded_vmcss);

3418

++#endif

3419

++

3420

++	return 0;

3421

++}

3422

+ module_init(vmx_init)

3423

+-module_exit(vmx_exit)

3424

+diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c

3425

+index 2f3fe25639b3..5c2c09f6c1c3 100644

3426

+--- a/arch/x86/kvm/x86.c

3427

++++ b/arch/x86/kvm/x86.c

3428

+@@ -181,6 +181,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {

3429

+ 	{ "irq_injections", VCPU_STAT(irq_injections) },

3430

+ 	{ "nmi_injections", VCPU_STAT(nmi_injections) },

3431

+ 	{ "req_event", VCPU_STAT(req_event) },

3432

++	{ "l1d_flush", VCPU_STAT(l1d_flush) },

3433

+ 	{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },

3434

+ 	{ "mmu_pte_write", VM_STAT(mmu_pte_write) },

3435

+ 	{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },

3436

+@@ -1041,6 +1042,71 @@ static u32 emulated_msrs[] = {

3437

+

3438

+ static unsigned num_emulated_msrs;

3439

+

3440

++/*

3441

++ * List of msr numbers which are used to expose MSR-based features that

3442

++ * can be used by a hypervisor to validate requested CPU features.

3443

++ */

3444

++static u32 msr_based_features[] = {

3445

++	MSR_F10H_DECFG,

3446

++	MSR_IA32_UCODE_REV,

3447

++	MSR_IA32_ARCH_CAPABILITIES,

3448

++};

3449

++

3450

++static unsigned int num_msr_based_features;

3451

++

3452

++u64 kvm_get_arch_capabilities(void)

3453

++{

3454

++	u64 data;

3455

++

3456

++	rdmsrl_safe(MSR_IA32_ARCH_CAPABILITIES, &data);

3457

++

3458

++	/*

3459

++	 * If we're doing cache flushes (either "always" or "cond")

3460

++	 * we will do one whenever the guest does a vmlaunch/vmresume.

3461

++	 * If an outer hypervisor is doing the cache flush for us

3462

++	 * (VMENTER_L1D_FLUSH_NESTED_VM), we can safely pass that

3463

++	 * capability to the guest too, and if EPT is disabled we're not

3464

++	 * vulnerable.  Overall, only VMENTER_L1D_FLUSH_NEVER will

3465

++	 * require a nested hypervisor to do a flush of its own.

3466

++	 */

3467

++	if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)

3468

++		data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;

3469

++

3470

++	return data;

3471

++}

3472

++EXPORT_SYMBOL_GPL(kvm_get_arch_capabilities);

3473

++

3474

++static int kvm_get_msr_feature(struct kvm_msr_entry *msr)

3475

++{

3476

++	switch (msr->index) {

3477

++	case MSR_IA32_ARCH_CAPABILITIES:

3478

++		msr->data = kvm_get_arch_capabilities();

3479

++		break;

3480

++	case MSR_IA32_UCODE_REV:

3481

++		rdmsrl_safe(msr->index, &msr->data);

3482

++		break;

3483

++	default:

3484

++		if (kvm_x86_ops->get_msr_feature(msr))

3485

++			return 1;

3486

++	}

3487

++	return 0;

3488

++}

3489

++

3490

++static int do_get_msr_feature(struct kvm_vcpu *vcpu, unsigned index, u64 *data)

3491

++{

3492

++	struct kvm_msr_entry msr;

3493

++	int r;

3494

++

3495

++	msr.index = index;

3496

++	r = kvm_get_msr_feature(&msr);

3497

++	if (r)

3498

++		return r;

3499

++

3500

++	*data = msr.data;

3501

++

3502

++	return 0;

3503

++}

3504

++

3505

+ bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer)

3506

+ {

3507

+ 	if (efer & efer_reserved_bits)

3508

+@@ -2156,7 +2222,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

3509

+

3510

+ 	switch (msr) {

3511

+ 	case MSR_AMD64_NB_CFG:

3512

+-	case MSR_IA32_UCODE_REV:

3513

+ 	case MSR_IA32_UCODE_WRITE:

3514

+ 	case MSR_VM_HSAVE_PA:

3515

+ 	case MSR_AMD64_PATCH_LOADER:

3516

+@@ -2164,6 +2229,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

3517

+ 	case MSR_AMD64_DC_CFG:

3518

+ 		break;

3519

+

3520

++	case MSR_IA32_UCODE_REV:

3521

++		if (msr_info->host_initiated)

3522

++			vcpu->arch.microcode_version = data;

3523

++		break;

3524

+ 	case MSR_EFER:

3525

+ 		return set_efer(vcpu, data);

3526

+ 	case MSR_K7_HWCR:

3527

+@@ -2450,7 +2519,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

3528

+ 		msr_info->data = 0;

3529

+ 		break;

3530

+ 	case MSR_IA32_UCODE_REV:

3531

+-		msr_info->data = 0x100000000ULL;

3532

++		msr_info->data = vcpu->arch.microcode_version;

3533

+ 		break;

3534

+ 	case MSR_MTRRcap:

3535

+ 	case 0x200 ... 0x2ff:

3536

+@@ -2600,13 +2669,11 @@ static int __msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs *msrs,

3537

+ 		    int (*do_msr)(struct kvm_vcpu *vcpu,

3538

+ 				  unsigned index, u64 *data))

3539

+ {

3540

+-	int i, idx;

3541

++	int i;

3542

+

3543

+-	idx = srcu_read_lock(&vcpu->kvm->srcu);

3544

+ 	for (i = 0; i < msrs->nmsrs; ++i)

3545

+ 		if (do_msr(vcpu, entries[i].index, &entries[i].data))

3546

+ 			break;

3547

+-	srcu_read_unlock(&vcpu->kvm->srcu, idx);

3548

+

3549

+ 	return i;

3550

+ }

3551

+@@ -2705,6 +2772,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)

3552

+ 	case KVM_CAP_SET_BOOT_CPU_ID:

3553

+  	case KVM_CAP_SPLIT_IRQCHIP:

3554

+ 	case KVM_CAP_IMMEDIATE_EXIT:

3555

++	case KVM_CAP_GET_MSR_FEATURES:

3556

+ 		r = 1;

3557

+ 		break;

3558

+ 	case KVM_CAP_ADJUST_CLOCK:

3559

+@@ -2819,6 +2887,31 @@ long kvm_arch_dev_ioctl(struct file *filp,

3560

+ 			goto out;

3561

+ 		r = 0;

3562

+ 		break;

3563

++	case KVM_GET_MSR_FEATURE_INDEX_LIST: {

3564

++		struct kvm_msr_list __user *user_msr_list = argp;

3565

++		struct kvm_msr_list msr_list;

3566

++		unsigned int n;

3567

++

3568

++		r = -EFAULT;

3569

++		if (copy_from_user(&msr_list, user_msr_list, sizeof(msr_list)))

3570

++			goto out;

3571

++		n = msr_list.nmsrs;

3572

++		msr_list.nmsrs = num_msr_based_features;

3573

++		if (copy_to_user(user_msr_list, &msr_list, sizeof(msr_list)))

3574

++			goto out;

3575

++		r = -E2BIG;

3576

++		if (n < msr_list.nmsrs)

3577

++			goto out;

3578

++		r = -EFAULT;

3579

++		if (copy_to_user(user_msr_list->indices, &msr_based_features,

3580

++				 num_msr_based_features * sizeof(u32)))

3581

++			goto out;

3582

++		r = 0;

3583

++		break;

3584

++	}

3585

++	case KVM_GET_MSRS:

3586

++		r = msr_io(NULL, argp, do_get_msr_feature, 1);

3587

++		break;

3588

+ 	}

3589

+ 	default:

3590

+ 		r = -EINVAL;

3591

+@@ -3553,12 +3646,18 @@ long kvm_arch_vcpu_ioctl(struct file *filp,

3592

+ 		r = 0;

3593

+ 		break;

3594

+ 	}

3595

+-	case KVM_GET_MSRS:

3596

++	case KVM_GET_MSRS: {

3597

++		int idx = srcu_read_lock(&vcpu->kvm->srcu);

3598

+ 		r = msr_io(vcpu, argp, do_get_msr, 1);

3599

++		srcu_read_unlock(&vcpu->kvm->srcu, idx);

3600

+ 		break;

3601

+-	case KVM_SET_MSRS:

3602

++	}

3603

++	case KVM_SET_MSRS: {

3604

++		int idx = srcu_read_lock(&vcpu->kvm->srcu);

3605

+ 		r = msr_io(vcpu, argp, do_set_msr, 0);

3606

++		srcu_read_unlock(&vcpu->kvm->srcu, idx);

3607

+ 		break;

3608

++	}

3609

+ 	case KVM_TPR_ACCESS_REPORTING: {

3610

+ 		struct kvm_tpr_access_ctl tac;

3611

+

3612

+@@ -4333,6 +4432,19 @@ static void kvm_init_msr_list(void)

3613

+ 		j++;

3614

+ 	}

3615

+ 	num_emulated_msrs = j;

3616

++

3617

++	for (i = j = 0; i < ARRAY_SIZE(msr_based_features); i++) {

3618

++		struct kvm_msr_entry msr;

3619

++

3620

++		msr.index = msr_based_features[i];

3621

++		if (kvm_get_msr_feature(&msr))

3622

++			continue;

3623

++

3624

++		if (j < i)

3625

++			msr_based_features[j] = msr_based_features[i];

3626

++		j++;

3627

++	}

3628

++	num_msr_based_features = j;

3629

+ }

3630

+

3631

+ static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,

3632

+@@ -4573,6 +4685,9 @@ static int emulator_write_std(struct x86_emulate_ctxt *ctxt, gva_t addr, void *v

3633

+ int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu, gva_t addr, void *val,

3634

+ 				unsigned int bytes, struct x86_exception *exception)

3635

+ {

3636

++	/* kvm_write_guest_virt_system can pull in tons of pages. */

3637

++	vcpu->arch.l1tf_flush_l1d = true;

3638

++

3639

+ 	return kvm_write_guest_virt_helper(addr, val, bytes, vcpu,

3640

+ 					   PFERR_WRITE_MASK, exception);

3641

+ }

3642

+@@ -5701,6 +5816,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,

3643

+ 	bool writeback = true;

3644

+ 	bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;

3645

+

3646

++	vcpu->arch.l1tf_flush_l1d = true;

3647

++

3648

+ 	/*

3649

+ 	 * Clear write_fault_to_shadow_pgtable here to ensure it is

3650

+ 	 * never reused.

3651

+@@ -7146,6 +7263,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)

3652

+ 	struct kvm *kvm = vcpu->kvm;

3653

+

3654

+ 	vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);

3655

++	vcpu->arch.l1tf_flush_l1d = true;

3656

+

3657

+ 	for (;;) {

3658

+ 		if (kvm_vcpu_running(vcpu)) {

3659

+@@ -8153,6 +8271,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)

3660

+

3661

+ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)

3662

+ {

3663

++	vcpu->arch.l1tf_flush_l1d = true;

3664

+ 	kvm_x86_ops->sched_in(vcpu, cpu);

3665

+ }

3666

+

3667

+diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c

3668

+index 0133d26f16be..c2faff548f59 100644

3669

+--- a/arch/x86/mm/fault.c

3670

++++ b/arch/x86/mm/fault.c

3671

+@@ -24,6 +24,7 @@

3672

+ #include <asm/vsyscall.h>		/* emulate_vsyscall		*/

3673

+ #include <asm/vm86.h>			/* struct vm86			*/

3674

+ #include <asm/mmu_context.h>		/* vma_pkey()			*/

3675

++#include <asm/sections.h>

3676

+

3677

+ #define CREATE_TRACE_POINTS

3678

+ #include <asm/trace/exceptions.h>

3679

+diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c

3680

+index 071cbbbb60d9..37f60dfd7e4e 100644

3681

+--- a/arch/x86/mm/init.c

3682

++++ b/arch/x86/mm/init.c

3683

+@@ -4,6 +4,8 @@

3684

+ #include <linux/swap.h>

3685

+ #include <linux/memblock.h>

3686

+ #include <linux/bootmem.h>	/* for max_low_pfn */

3687

++#include <linux/swapfile.h>

3688

++#include <linux/swapops.h>

3689

+

3690

+ #include <asm/set_memory.h>

3691

+ #include <asm/e820/api.h>

3692

+@@ -880,3 +882,26 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)

3693

+ 	__cachemode2pte_tbl[cache] = __cm_idx2pte(entry);

3694

+ 	__pte2cachemode_tbl[entry] = cache;

3695

+ }

3696

++

3697

++#ifdef CONFIG_SWAP

3698

++unsigned long max_swapfile_size(void)

3699

++{

3700

++	unsigned long pages;

3701

++

3702

++	pages = generic_max_swapfile_size();

3703

++

3704

++	if (boot_cpu_has_bug(X86_BUG_L1TF)) {

3705

++		/* Limit the swap file size to MAX_PA/2 for L1TF workaround */

3706

++		unsigned long l1tf_limit = l1tf_pfn_limit() + 1;

3707

++		/*

3708

++		 * We encode swap offsets also with 3 bits below those for pfn

3709

++		 * which makes the usable limit higher.

3710

++		 */

3711

++#if CONFIG_PGTABLE_LEVELS > 2

3712

++		l1tf_limit <<= PAGE_SHIFT - SWP_OFFSET_FIRST_BIT;

3713

++#endif

3714

++		pages = min_t(unsigned long, l1tf_limit, pages);

3715

++	}

3716

++	return pages;

3717

++}

3718

++#endif

3719

+diff --git a/arch/x86/mm/kmmio.c b/arch/x86/mm/kmmio.c

3720

+index 7c8686709636..79eb55ce69a9 100644

3721

+--- a/arch/x86/mm/kmmio.c

3722

++++ b/arch/x86/mm/kmmio.c

3723

+@@ -126,24 +126,29 @@ static struct kmmio_fault_page *get_kmmio_fault_page(unsigned long addr)

3724

+

3725

+ static void clear_pmd_presence(pmd_t *pmd, bool clear, pmdval_t *old)

3726

+ {

3727

++	pmd_t new_pmd;

3728

+ 	pmdval_t v = pmd_val(*pmd);

3729

+ 	if (clear) {

3730

+-		*old = v & _PAGE_PRESENT;

3731

+-		v &= ~_PAGE_PRESENT;

3732

+-	} else	/* presume this has been called with clear==true previously */

3733

+-		v |= *old;

3734

+-	set_pmd(pmd, __pmd(v));

3735

++		*old = v;

3736

++		new_pmd = pmd_mknotpresent(*pmd);

3737

++	} else {

3738

++		/* Presume this has been called with clear==true previously */

3739

++		new_pmd = __pmd(*old);

3740

++	}

3741

++	set_pmd(pmd, new_pmd);

3742

+ }

3743

+

3744

+ static void clear_pte_presence(pte_t *pte, bool clear, pteval_t *old)

3745

+ {

3746

+ 	pteval_t v = pte_val(*pte);

3747

+ 	if (clear) {

3748

+-		*old = v & _PAGE_PRESENT;

3749

+-		v &= ~_PAGE_PRESENT;

3750

+-	} else	/* presume this has been called with clear==true previously */

3751

+-		v |= *old;

3752

+-	set_pte_atomic(pte, __pte(v));

3753

++		*old = v;

3754

++		/* Nothing should care about address */

3755

++		pte_clear(&init_mm, 0, pte);

3756

++	} else {

3757

++		/* Presume this has been called with clear==true previously */

3758

++		set_pte_atomic(pte, __pte(*old));

3759

++	}

3760

+ }

3761

+

3762

+ static int clear_page_presence(struct kmmio_fault_page *f, bool clear)

3763

+diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c

3764

+index a99679826846..5f4805d69aab 100644

3765

+--- a/arch/x86/mm/mmap.c

3766

++++ b/arch/x86/mm/mmap.c

3767

+@@ -174,3 +174,24 @@ const char *arch_vma_name(struct vm_area_struct *vma)

3768

+ 		return "[mpx]";

3769

+ 	return NULL;

3770

+ }

3771

++

3772

++/*

3773

++ * Only allow root to set high MMIO mappings to PROT_NONE.

3774

++ * This prevents an unpriv. user to set them to PROT_NONE and invert

3775

++ * them, then pointing to valid memory for L1TF speculation.

3776

++ *

3777

++ * Note: for locked down kernels may want to disable the root override.

3778

++ */

3779

++bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)

3780

++{

3781

++	if (!boot_cpu_has_bug(X86_BUG_L1TF))

3782

++		return true;

3783

++	if (!__pte_needs_invert(pgprot_val(prot)))

3784

++		return true;

3785

++	/* If it's real memory always allow */

3786

++	if (pfn_valid(pfn))

3787

++		return true;

3788

++	if (pfn > l1tf_pfn_limit() && !capable(CAP_SYS_ADMIN))

3789

++		return false;

3790

++	return true;

3791

++}

3792

+diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c

3793

+index 4085897fef64..464f53da3a6f 100644

3794

+--- a/arch/x86/mm/pageattr.c

3795

++++ b/arch/x86/mm/pageattr.c

3796

+@@ -1006,8 +1006,8 @@ static long populate_pmd(struct cpa_data *cpa,

3797

+

3798

+ 		pmd = pmd_offset(pud, start);

3799

+

3800

+-		set_pmd(pmd, __pmd(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |

3801

+-				   massage_pgprot(pmd_pgprot)));

3802

++		set_pmd(pmd, pmd_mkhuge(pfn_pmd(cpa->pfn,

3803

++					canon_pgprot(pmd_pgprot))));

3804

+

3805

+ 		start	  += PMD_SIZE;

3806

+ 		cpa->pfn  += PMD_SIZE >> PAGE_SHIFT;

3807

+@@ -1079,8 +1079,8 @@ static int populate_pud(struct cpa_data *cpa, unsigned long start, p4d_t *p4d,

3808

+ 	 * Map everything starting from the Gb boundary, possibly with 1G pages

3809

+ 	 */

3810

+ 	while (boot_cpu_has(X86_FEATURE_GBPAGES) && end - start >= PUD_SIZE) {

3811

+-		set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |

3812

+-				   massage_pgprot(pud_pgprot)));

3813

++		set_pud(pud, pud_mkhuge(pfn_pud(cpa->pfn,

3814

++				   canon_pgprot(pud_pgprot))));

3815

+

3816

+ 		start	  += PUD_SIZE;

3817

+ 		cpa->pfn  += PUD_SIZE >> PAGE_SHIFT;

3818

+diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c

3819

+index ce38f165489b..d6f11accd37a 100644

3820

+--- a/arch/x86/mm/pti.c

3821

++++ b/arch/x86/mm/pti.c

3822

+@@ -45,6 +45,7 @@

3823

+ #include <asm/pgalloc.h>

3824

+ #include <asm/tlbflush.h>

3825

+ #include <asm/desc.h>

3826

++#include <asm/sections.h>

3827

+

3828

+ #undef pr_fmt

3829

+ #define pr_fmt(fmt)     "Kernel/User page tables isolation: " fmt

3830

+diff --git a/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c b/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c

3831

+index 4f5fa65a1011..2acd6be13375 100644

3832

+--- a/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c

3833

++++ b/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c

3834

+@@ -18,6 +18,7 @@

3835

+ #include <asm/intel-mid.h>

3836

+ #include <asm/intel_scu_ipc.h>

3837

+ #include <asm/io_apic.h>

3838

++#include <asm/hw_irq.h>

3839

+

3840

+ #define TANGIER_EXT_TIMER0_MSI 12

3841

+

3842

+diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c

3843

+index 0b530c53de1f..34f9a9ce6236 100644

3844

+--- a/arch/x86/platform/uv/tlb_uv.c

3845

++++ b/arch/x86/platform/uv/tlb_uv.c

3846

+@@ -1285,6 +1285,7 @@ void uv_bau_message_interrupt(struct pt_regs *regs)

3847

+ 	struct msg_desc msgdesc;

3848

+

3849

+ 	ack_APIC_irq();

3850

++	kvm_set_cpu_l1tf_flush_l1d();

3851

+ 	time_start = get_cycles();

3852

+

3853

+ 	bcp = &per_cpu(bau_control, smp_processor_id());

3854

+diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c

3855

+index c9081c6671f0..df208af3cd74 100644

3856

+--- a/arch/x86/xen/enlighten.c

3857

++++ b/arch/x86/xen/enlighten.c

3858

+@@ -3,6 +3,7 @@

3859

+ #endif

3860

+ #include <linux/cpu.h>

3861

+ #include <linux/kexec.h>

3862

++#include <linux/slab.h>

3863

+

3864

+ #include <xen/features.h>

3865

+ #include <xen/page.h>

3866

+diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c

3867

+index 433f14bcab15..93758b528d8f 100644

3868

+--- a/drivers/base/cpu.c

3869

++++ b/drivers/base/cpu.c

3870

+@@ -527,16 +527,24 @@ ssize_t __weak cpu_show_spec_store_bypass(struct device *dev,

3871

+ 	return sprintf(buf, "Not affected\n");

3872

+ }

3873

+

3874

++ssize_t __weak cpu_show_l1tf(struct device *dev,

3875

++			     struct device_attribute *attr, char *buf)

3876

++{

3877

++	return sprintf(buf, "Not affected\n");

3878

++}

3879

++

3880

+ static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);

3881

+ static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);

3882

+ static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);

3883

+ static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);

3884

++static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);

3885

+

3886

+ static struct attribute *cpu_root_vulnerabilities_attrs[] = {

3887

+ 	&dev_attr_meltdown.attr,

3888

+ 	&dev_attr_spectre_v1.attr,

3889

+ 	&dev_attr_spectre_v2.attr,

3890

+ 	&dev_attr_spec_store_bypass.attr,

3891

++	&dev_attr_l1tf.attr,

3892

+ 	NULL

3893

+ };

3894

+

3895

+diff --git a/drivers/bluetooth/hci_ldisc.c b/drivers/bluetooth/hci_ldisc.c

3896

+index 6aef3bde10d7..c823914b3a80 100644

3897

+--- a/drivers/bluetooth/hci_ldisc.c

3898

++++ b/drivers/bluetooth/hci_ldisc.c

3899

+@@ -115,12 +115,12 @@ static inline struct sk_buff *hci_uart_dequeue(struct hci_uart *hu)

3900

+ 	struct sk_buff *skb = hu->tx_skb;

3901

+

3902

+ 	if (!skb) {

3903

+-		read_lock(&hu->proto_lock);

3904

++		percpu_down_read(&hu->proto_lock);

3905

+

3906

+ 		if (test_bit(HCI_UART_PROTO_READY, &hu->flags))

3907

+ 			skb = hu->proto->dequeue(hu);

3908

+

3909

+-		read_unlock(&hu->proto_lock);

3910

++		percpu_up_read(&hu->proto_lock);

3911

+ 	} else {

3912

+ 		hu->tx_skb = NULL;

3913

+ 	}

3914

+@@ -130,7 +130,14 @@ static inline struct sk_buff *hci_uart_dequeue(struct hci_uart *hu)

3915

+

3916

+ int hci_uart_tx_wakeup(struct hci_uart *hu)

3917

+ {

3918

+-	read_lock(&hu->proto_lock);

3919

++	/* This may be called in an IRQ context, so we can't sleep. Therefore

3920

++	 * we try to acquire the lock only, and if that fails we assume the

3921

++	 * tty is being closed because that is the only time the write lock is

3922

++	 * acquired. If, however, at some point in the future the write lock

3923

++	 * is also acquired in other situations, then this must be revisited.

3924

++	 */

3925

++	if (!percpu_down_read_trylock(&hu->proto_lock))

3926

++		return 0;

3927

+

3928

+ 	if (!test_bit(HCI_UART_PROTO_READY, &hu->flags))

3929

+ 		goto no_schedule;

3930

+@@ -145,7 +152,7 @@ int hci_uart_tx_wakeup(struct hci_uart *hu)

3931

+ 	schedule_work(&hu->write_work);

3932

+

3933

+ no_schedule:

3934

+-	read_unlock(&hu->proto_lock);

3935

++	percpu_up_read(&hu->proto_lock);

3936

+

3937

+ 	return 0;

3938

+ }

3939

+@@ -247,12 +254,12 @@ static int hci_uart_flush(struct hci_dev *hdev)

3940

+ 	tty_ldisc_flush(tty);

3941

+ 	tty_driver_flush_buffer(tty);

3942

+

3943

+-	read_lock(&hu->proto_lock);

3944

++	percpu_down_read(&hu->proto_lock);

3945

+

3946

+ 	if (test_bit(HCI_UART_PROTO_READY, &hu->flags))

3947

+ 		hu->proto->flush(hu);

3948

+

3949

+-	read_unlock(&hu->proto_lock);

3950

++	percpu_up_read(&hu->proto_lock);

3951

+

3952

+ 	return 0;

3953

+ }

3954

+@@ -275,15 +282,15 @@ static int hci_uart_send_frame(struct hci_dev *hdev, struct sk_buff *skb)

3955

+ 	BT_DBG("%s: type %d len %d", hdev->name, hci_skb_pkt_type(skb),

3956

+ 	       skb->len);

3957

+

3958

+-	read_lock(&hu->proto_lock);

3959

++	percpu_down_read(&hu->proto_lock);

3960

+

3961

+ 	if (!test_bit(HCI_UART_PROTO_READY, &hu->flags)) {

3962

+-		read_unlock(&hu->proto_lock);

3963

++		percpu_up_read(&hu->proto_lock);

3964

+ 		return -EUNATCH;

3965

+ 	}

3966

+

3967

+ 	hu->proto->enqueue(hu, skb);

3968

+-	read_unlock(&hu->proto_lock);

3969

++	percpu_up_read(&hu->proto_lock);

3970

+

3971

+ 	hci_uart_tx_wakeup(hu);

3972

+

3973

+@@ -486,7 +493,7 @@ static int hci_uart_tty_open(struct tty_struct *tty)

3974

+ 	INIT_WORK(&hu->init_ready, hci_uart_init_work);

3975

+ 	INIT_WORK(&hu->write_work, hci_uart_write_work);

3976

+

3977

+-	rwlock_init(&hu->proto_lock);

3978

++	percpu_init_rwsem(&hu->proto_lock);

3979

+

3980

+ 	/* Flush any pending characters in the driver */

3981

+ 	tty_driver_flush_buffer(tty);

3982

+@@ -503,7 +510,6 @@ static void hci_uart_tty_close(struct tty_struct *tty)

3983

+ {

3984

+ 	struct hci_uart *hu = tty->disc_data;

3985

+ 	struct hci_dev *hdev;

3986

+-	unsigned long flags;

3987

+

3988

+ 	BT_DBG("tty %p", tty);

3989

+

3990

+@@ -518,9 +524,9 @@ static void hci_uart_tty_close(struct tty_struct *tty)

3991

+ 		hci_uart_close(hdev);

3992

+

3993

+ 	if (test_bit(HCI_UART_PROTO_READY, &hu->flags)) {

3994

+-		write_lock_irqsave(&hu->proto_lock, flags);

3995

++		percpu_down_write(&hu->proto_lock);

3996

+ 		clear_bit(HCI_UART_PROTO_READY, &hu->flags);

3997

+-		write_unlock_irqrestore(&hu->proto_lock, flags);

3998

++		percpu_up_write(&hu->proto_lock);

3999

+

4000

+ 		cancel_work_sync(&hu->write_work);

4001

+

4002

+@@ -582,10 +588,10 @@ static void hci_uart_tty_receive(struct tty_struct *tty, const u8 *data,

4003

+ 	if (!hu || tty != hu->tty)

4004

+ 		return;

4005

+

4006

+-	read_lock(&hu->proto_lock);

4007

++	percpu_down_read(&hu->proto_lock);

4008

+

4009

+ 	if (!test_bit(HCI_UART_PROTO_READY, &hu->flags)) {

4010

+-		read_unlock(&hu->proto_lock);

4011

++		percpu_up_read(&hu->proto_lock);

4012

+ 		return;

4013

+ 	}

4014

+

4015

+@@ -593,7 +599,7 @@ static void hci_uart_tty_receive(struct tty_struct *tty, const u8 *data,

4016

+ 	 * tty caller

4017

+ 	 */

4018

+ 	hu->proto->recv(hu, data, count);

4019

+-	read_unlock(&hu->proto_lock);

4020

++	percpu_up_read(&hu->proto_lock);

4021

+

4022

+ 	if (hu->hdev)

4023

+ 		hu->hdev->stat.byte_rx += count;

4024

+diff --git a/drivers/bluetooth/hci_serdev.c b/drivers/bluetooth/hci_serdev.c

4025

+index b725ac4f7ff6..52e6d4d1608e 100644

4026

+--- a/drivers/bluetooth/hci_serdev.c

4027

++++ b/drivers/bluetooth/hci_serdev.c

4028

+@@ -304,6 +304,7 @@ int hci_uart_register_device(struct hci_uart *hu,

4029

+ 	hci_set_drvdata(hdev, hu);

4030

+

4031

+ 	INIT_WORK(&hu->write_work, hci_uart_write_work);

4032

++	percpu_init_rwsem(&hu->proto_lock);

4033

+

4034

+ 	/* Only when vendor specific setup callback is provided, consider

4035

+ 	 * the manufacturer information valid. This avoids filling in the

4036

+diff --git a/drivers/bluetooth/hci_uart.h b/drivers/bluetooth/hci_uart.h

4037

+index d9cd95d81149..66e8c68e4607 100644

4038

+--- a/drivers/bluetooth/hci_uart.h

4039

++++ b/drivers/bluetooth/hci_uart.h

4040

+@@ -87,7 +87,7 @@ struct hci_uart {

4041

+ 	struct work_struct	write_work;

4042

+

4043

+ 	const struct hci_uart_proto *proto;

4044

+-	rwlock_t		proto_lock;	/* Stop work for proto close */

4045

++	struct percpu_rw_semaphore proto_lock;	/* Stop work for proto close */

4046

+ 	void			*priv;

4047

+

4048

+ 	struct sk_buff		*tx_skb;

4049

+diff --git a/drivers/gpu/drm/i915/intel_lpe_audio.c b/drivers/gpu/drm/i915/intel_lpe_audio.c

4050

+index 3bf65288ffff..2fdf302ebdad 100644

4051

+--- a/drivers/gpu/drm/i915/intel_lpe_audio.c

4052

++++ b/drivers/gpu/drm/i915/intel_lpe_audio.c

4053

+@@ -62,6 +62,7 @@

4054

+

4055

+ #include <linux/acpi.h>

4056

+ #include <linux/device.h>

4057

++#include <linux/irq.h>

4058

+ #include <linux/pci.h>

4059

+ #include <linux/pm_runtime.h>

4060

+

4061

+diff --git a/drivers/mtd/nand/qcom_nandc.c b/drivers/mtd/nand/qcom_nandc.c

4062

+index 3baddfc997d1..b49ca02b399d 100644

4063

+--- a/drivers/mtd/nand/qcom_nandc.c

4064

++++ b/drivers/mtd/nand/qcom_nandc.c

4065

+@@ -2544,6 +2544,9 @@ static int qcom_nand_host_init(struct qcom_nand_controller *nandc,

4066

+

4067

+ 	nand_set_flash_node(chip, dn);

4068

+ 	mtd->name = devm_kasprintf(dev, GFP_KERNEL, "qcom_nand.%d", host->cs);

4069

++	if (!mtd->name)

4070

++		return -ENOMEM;

4071

++

4072

+ 	mtd->owner = THIS_MODULE;

4073

+ 	mtd->dev.parent = dev;

4074

+

4075

+diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c

4076

+index dfc076f9ee4b..d5e790dd589a 100644

4077

+--- a/drivers/net/xen-netfront.c

4078

++++ b/drivers/net/xen-netfront.c

4079

+@@ -894,7 +894,6 @@ static RING_IDX xennet_fill_frags(struct netfront_queue *queue,

4080

+ 				  struct sk_buff *skb,

4081

+ 				  struct sk_buff_head *list)

4082

+ {

4083

+-	struct skb_shared_info *shinfo = skb_shinfo(skb);

4084

+ 	RING_IDX cons = queue->rx.rsp_cons;

4085

+ 	struct sk_buff *nskb;

4086

+

4087

+@@ -903,15 +902,16 @@ static RING_IDX xennet_fill_frags(struct netfront_queue *queue,

4088

+ 			RING_GET_RESPONSE(&queue->rx, ++cons);

4089

+ 		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];

4090

+

4091

+-		if (shinfo->nr_frags == MAX_SKB_FRAGS) {

4092

++		if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS) {

4093

+ 			unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;

4094

+

4095

+ 			BUG_ON(pull_to <= skb_headlen(skb));

4096

+ 			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));

4097

+ 		}

4098

+-		BUG_ON(shinfo->nr_frags >= MAX_SKB_FRAGS);

4099

++		BUG_ON(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS);

4100

+

4101

+-		skb_add_rx_frag(skb, shinfo->nr_frags, skb_frag_page(nfrag),

4102

++		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,

4103

++				skb_frag_page(nfrag),

4104

+ 				rx->offset, rx->status, PAGE_SIZE);

4105

+

4106

+ 		skb_shinfo(nskb)->nr_frags = 0;

4107

+diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c

4108

+index 4523d7e1bcb9..ffc87a956d97 100644

4109

+--- a/drivers/pci/host/pci-hyperv.c

4110

++++ b/drivers/pci/host/pci-hyperv.c

4111

+@@ -53,6 +53,8 @@

4112

+ #include <linux/delay.h>

4113

+ #include <linux/semaphore.h>

4114

+ #include <linux/irqdomain.h>

4115

++#include <linux/irq.h>

4116

++

4117

+ #include <asm/irqdomain.h>

4118

+ #include <asm/apic.h>

4119

+ #include <linux/msi.h>

4120

+diff --git a/drivers/phy/mediatek/phy-mtk-tphy.c b/drivers/phy/mediatek/phy-mtk-tphy.c

4121

+index 721a2a1c97ef..a63bba12aee4 100644

4122

+--- a/drivers/phy/mediatek/phy-mtk-tphy.c

4123

++++ b/drivers/phy/mediatek/phy-mtk-tphy.c

4124

+@@ -438,9 +438,9 @@ static void u2_phy_instance_init(struct mtk_tphy *tphy,

4125

+ 	u32 index = instance->index;

4126

+ 	u32 tmp;

4127

+

4128

+-	/* switch to USB function. (system register, force ip into usb mode) */

4129

++	/* switch to USB function, and enable usb pll */

4130

+ 	tmp = readl(com + U3P_U2PHYDTM0);

4131

+-	tmp &= ~P2C_FORCE_UART_EN;

4132

++	tmp &= ~(P2C_FORCE_UART_EN | P2C_FORCE_SUSPENDM);

4133

+ 	tmp |= P2C_RG_XCVRSEL_VAL(1) | P2C_RG_DATAIN_VAL(0);

4134

+ 	writel(tmp, com + U3P_U2PHYDTM0);

4135

+

4136

+@@ -500,10 +500,8 @@ static void u2_phy_instance_power_on(struct mtk_tphy *tphy,

4137

+ 	u32 index = instance->index;

4138

+ 	u32 tmp;

4139

+

4140

+-	/* (force_suspendm=0) (let suspendm=1, enable usb 480MHz pll) */

4141

+ 	tmp = readl(com + U3P_U2PHYDTM0);

4142

+-	tmp &= ~(P2C_FORCE_SUSPENDM | P2C_RG_XCVRSEL);

4143

+-	tmp &= ~(P2C_RG_DATAIN | P2C_DTM0_PART_MASK);

4144

++	tmp &= ~(P2C_RG_XCVRSEL | P2C_RG_DATAIN | P2C_DTM0_PART_MASK);

4145

+ 	writel(tmp, com + U3P_U2PHYDTM0);

4146

+

4147

+ 	/* OTG Enable */

4148

+@@ -538,7 +536,6 @@ static void u2_phy_instance_power_off(struct mtk_tphy *tphy,

4149

+

4150

+ 	tmp = readl(com + U3P_U2PHYDTM0);

4151

+ 	tmp &= ~(P2C_RG_XCVRSEL | P2C_RG_DATAIN);

4152

+-	tmp |= P2C_FORCE_SUSPENDM;

4153

+ 	writel(tmp, com + U3P_U2PHYDTM0);

4154

+

4155

+ 	/* OTG Disable */

4156

+@@ -546,18 +543,16 @@ static void u2_phy_instance_power_off(struct mtk_tphy *tphy,

4157

+ 	tmp &= ~PA6_RG_U2_OTG_VBUSCMP_EN;

4158

+ 	writel(tmp, com + U3P_USBPHYACR6);

4159

+

4160

+-	/* let suspendm=0, set utmi into analog power down */

4161

+-	tmp = readl(com + U3P_U2PHYDTM0);

4162

+-	tmp &= ~P2C_RG_SUSPENDM;

4163

+-	writel(tmp, com + U3P_U2PHYDTM0);

4164

+-	udelay(1);

4165

+-

4166

+ 	tmp = readl(com + U3P_U2PHYDTM1);

4167

+ 	tmp &= ~(P2C_RG_VBUSVALID | P2C_RG_AVALID);

4168

+ 	tmp |= P2C_RG_SESSEND;

4169

+ 	writel(tmp, com + U3P_U2PHYDTM1);

4170

+

4171

+ 	if (tphy->pdata->avoid_rx_sen_degradation && index) {

4172

++		tmp = readl(com + U3P_U2PHYDTM0);

4173

++		tmp &= ~(P2C_RG_SUSPENDM | P2C_FORCE_SUSPENDM);

4174

++		writel(tmp, com + U3P_U2PHYDTM0);

4175

++

4176

+ 		tmp = readl(com + U3D_U2PHYDCR0);

4177

+ 		tmp &= ~P2C_RG_SIF_U2PLL_FORCE_ON;

4178

+ 		writel(tmp, com + U3D_U2PHYDCR0);

4179

+diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c

4180

+index dd9464920456..ef22b275d050 100644

4181

+--- a/drivers/scsi/hosts.c

4182

++++ b/drivers/scsi/hosts.c

4183

+@@ -474,6 +474,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)

4184

+ 		shost->dma_boundary = 0xffffffff;

4185

+

4186

+ 	shost->use_blk_mq = scsi_use_blk_mq;

4187

++	shost->use_blk_mq = scsi_use_blk_mq || shost->hostt->force_blk_mq;

4188

+

4189

+ 	device_initialize(&shost->shost_gendev);

4190

+ 	dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);

4191

+diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c

4192

+index 604a39dba5d0..5b4b7f9be2d7 100644

4193

+--- a/drivers/scsi/hpsa.c

4194

++++ b/drivers/scsi/hpsa.c

4195

+@@ -1040,11 +1040,7 @@ static void set_performant_mode(struct ctlr_info *h, struct CommandList *c,

4196

+ 		c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);

4197

+ 		if (unlikely(!h->msix_vectors))

4198

+ 			return;

4199

+-		if (likely(reply_queue == DEFAULT_REPLY_QUEUE))

4200

+-			c->Header.ReplyQueue =

4201

+-				raw_smp_processor_id() % h->nreply_queues;

4202

+-		else

4203

+-			c->Header.ReplyQueue = reply_queue % h->nreply_queues;

4204

++		c->Header.ReplyQueue = reply_queue;

4205

+ 	}

4206

+ }

4207

+

4208

+@@ -1058,10 +1054,7 @@ static void set_ioaccel1_performant_mode(struct ctlr_info *h,

4209

+ 	 * Tell the controller to post the reply to the queue for this

4210

+ 	 * processor.  This seems to give the best I/O throughput.

4211

+ 	 */

4212

+-	if (likely(reply_queue == DEFAULT_REPLY_QUEUE))

4213

+-		cp->ReplyQueue = smp_processor_id() % h->nreply_queues;

4214

+-	else

4215

+-		cp->ReplyQueue = reply_queue % h->nreply_queues;

4216

++	cp->ReplyQueue = reply_queue;

4217

+ 	/*

4218

+ 	 * Set the bits in the address sent down to include:

4219

+ 	 *  - performant mode bit (bit 0)

4220

+@@ -1082,10 +1075,7 @@ static void set_ioaccel2_tmf_performant_mode(struct ctlr_info *h,

4221

+ 	/* Tell the controller to post the reply to the queue for this

4222

+ 	 * processor.  This seems to give the best I/O throughput.

4223

+ 	 */

4224

+-	if (likely(reply_queue == DEFAULT_REPLY_QUEUE))

4225

+-		cp->reply_queue = smp_processor_id() % h->nreply_queues;

4226

+-	else

4227

+-		cp->reply_queue = reply_queue % h->nreply_queues;

4228

++	cp->reply_queue = reply_queue;

4229

+ 	/* Set the bits in the address sent down to include:

4230

+ 	 *  - performant mode bit not used in ioaccel mode 2

4231

+ 	 *  - pull count (bits 0-3)

4232

+@@ -1104,10 +1094,7 @@ static void set_ioaccel2_performant_mode(struct ctlr_info *h,

4233

+ 	 * Tell the controller to post the reply to the queue for this

4234

+ 	 * processor.  This seems to give the best I/O throughput.

4235

+ 	 */

4236

+-	if (likely(reply_queue == DEFAULT_REPLY_QUEUE))

4237

+-		cp->reply_queue = smp_processor_id() % h->nreply_queues;

4238

+-	else

4239

+-		cp->reply_queue = reply_queue % h->nreply_queues;

4240

++	cp->reply_queue = reply_queue;

4241

+ 	/*

4242

+ 	 * Set the bits in the address sent down to include:

4243

+ 	 *  - performant mode bit not used in ioaccel mode 2

4244

+@@ -1152,6 +1139,8 @@ static void __enqueue_cmd_and_start_io(struct ctlr_info *h,

4245

+ {

4246

+ 	dial_down_lockup_detection_during_fw_flash(h, c);

4247

+ 	atomic_inc(&h->commands_outstanding);

4248

++

4249

++	reply_queue = h->reply_map[raw_smp_processor_id()];

4250

+ 	switch (c->cmd_type) {

4251

+ 	case CMD_IOACCEL1:

4252

+ 		set_ioaccel1_performant_mode(h, c, reply_queue);

4253

+@@ -7244,6 +7233,26 @@ static void hpsa_disable_interrupt_mode(struct ctlr_info *h)

4254

+ 	h->msix_vectors = 0;

4255

+ }

4256

+

4257

++static void hpsa_setup_reply_map(struct ctlr_info *h)

4258

++{

4259

++	const struct cpumask *mask;

4260

++	unsigned int queue, cpu;

4261

++

4262

++	for (queue = 0; queue < h->msix_vectors; queue++) {

4263

++		mask = pci_irq_get_affinity(h->pdev, queue);

4264

++		if (!mask)

4265

++			goto fallback;

4266

++

4267

++		for_each_cpu(cpu, mask)

4268

++			h->reply_map[cpu] = queue;

4269

++	}

4270

++	return;

4271

++

4272

++fallback:

4273

++	for_each_possible_cpu(cpu)

4274

++		h->reply_map[cpu] = 0;

4275

++}

4276

++

4277

+ /* If MSI/MSI-X is supported by the kernel we will try to enable it on

4278

+  * controllers that are capable. If not, we use legacy INTx mode.

4279

+  */

4280

+@@ -7639,6 +7648,10 @@ static int hpsa_pci_init(struct ctlr_info *h)

4281

+ 	err = hpsa_interrupt_mode(h);

4282

+ 	if (err)

4283

+ 		goto clean1;

4284

++

4285

++	/* setup mapping between CPU and reply queue */

4286

++	hpsa_setup_reply_map(h);

4287

++

4288

+ 	err = hpsa_pci_find_memory_BAR(h->pdev, &h->paddr);

4289

+ 	if (err)

4290

+ 		goto clean2;	/* intmode+region, pci */

4291

+@@ -8284,6 +8297,28 @@ static struct workqueue_struct *hpsa_create_controller_wq(struct ctlr_info *h,

4292

+ 	return wq;

4293

+ }

4294

+

4295

++static void hpda_free_ctlr_info(struct ctlr_info *h)

4296

++{

4297

++	kfree(h->reply_map);

4298

++	kfree(h);

4299

++}

4300

++

4301

++static struct ctlr_info *hpda_alloc_ctlr_info(void)

4302

++{

4303

++	struct ctlr_info *h;

4304

++

4305

++	h = kzalloc(sizeof(*h), GFP_KERNEL);

4306

++	if (!h)

4307

++		return NULL;

4308

++

4309

++	h->reply_map = kzalloc(sizeof(*h->reply_map) * nr_cpu_ids, GFP_KERNEL);

4310

++	if (!h->reply_map) {

4311

++		kfree(h);

4312

++		return NULL;

4313

++	}

4314

++	return h;

4315

++}

4316

++

4317

+ static int hpsa_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)

4318

+ {

4319

+ 	int dac, rc;

4320

+@@ -8321,7 +8356,7 @@ reinit_after_soft_reset:

4321

+ 	 * the driver.  See comments in hpsa.h for more info.

4322

+ 	 */

4323

+ 	BUILD_BUG_ON(sizeof(struct CommandList) % COMMANDLIST_ALIGNMENT);

4324

+-	h = kzalloc(sizeof(*h), GFP_KERNEL);

4325

++	h = hpda_alloc_ctlr_info();

4326

+ 	if (!h) {

4327

+ 		dev_err(&pdev->dev, "Failed to allocate controller head\n");

4328

+ 		return -ENOMEM;

4329

+@@ -8726,7 +8761,7 @@ static void hpsa_remove_one(struct pci_dev *pdev)

4330

+ 	h->lockup_detected = NULL;			/* init_one 2 */

4331

+ 	/* (void) pci_disable_pcie_error_reporting(pdev); */	/* init_one 1 */

4332

+

4333

+-	kfree(h);					/* init_one 1 */

4334

++	hpda_free_ctlr_info(h);				/* init_one 1 */

4335

+ }

4336

+

4337

+ static int hpsa_suspend(__attribute__((unused)) struct pci_dev *pdev,

4338

+diff --git a/drivers/scsi/hpsa.h b/drivers/scsi/hpsa.h

4339

+index 018f980a701c..fb9f5e7f8209 100644

4340

+--- a/drivers/scsi/hpsa.h

4341

++++ b/drivers/scsi/hpsa.h

4342

+@@ -158,6 +158,7 @@ struct bmic_controller_parameters {

4343

+ #pragma pack()

4344

+

4345

+ struct ctlr_info {

4346

++	unsigned int *reply_map;

4347

+ 	int	ctlr;

4348

+ 	char	devname[8];

4349

+ 	char    *product_name;

4350

+diff --git a/drivers/scsi/qla2xxx/qla_iocb.c b/drivers/scsi/qla2xxx/qla_iocb.c

4351

+index 63bea6a65d51..8d579bf0fc81 100644

4352

+--- a/drivers/scsi/qla2xxx/qla_iocb.c

4353

++++ b/drivers/scsi/qla2xxx/qla_iocb.c

4354

+@@ -2128,34 +2128,11 @@ __qla2x00_alloc_iocbs(struct qla_qpair *qpair, srb_t *sp)

4355

+ 	req_cnt = 1;

4356

+ 	handle = 0;

4357

+

4358

+-	if (!sp)

4359

+-		goto skip_cmd_array;

4360

+-

4361

+-	/* Check for room in outstanding command list. */

4362

+-	handle = req->current_outstanding_cmd;

4363

+-	for (index = 1; index < req->num_outstanding_cmds; index++) {

4364

+-		handle++;

4365

+-		if (handle == req->num_outstanding_cmds)

4366

+-			handle = 1;

4367

+-		if (!req->outstanding_cmds[handle])

4368

+-			break;

4369

+-	}

4370

+-	if (index == req->num_outstanding_cmds) {

4371

+-		ql_log(ql_log_warn, vha, 0x700b,

4372

+-		    "No room on outstanding cmd array.\n");

4373

+-		goto queuing_error;

4374

+-	}

4375

+-

4376

+-	/* Prep command array. */

4377

+-	req->current_outstanding_cmd = handle;

4378

+-	req->outstanding_cmds[handle] = sp;

4379

+-	sp->handle = handle;

4380

+-

4381

+-	/* Adjust entry-counts as needed. */

4382

+-	if (sp->type != SRB_SCSI_CMD)

4383

++	if (sp && (sp->type != SRB_SCSI_CMD)) {

4384

++		/* Adjust entry-counts as needed. */

4385

+ 		req_cnt = sp->iocbs;

4386

++	}

4387

+

4388

+-skip_cmd_array:

4389

+ 	/* Check for room on request queue. */

4390

+ 	if (req->cnt < req_cnt + 2) {

4391

+ 		if (ha->mqenable || IS_QLA83XX(ha) || IS_QLA27XX(ha))

4392

+@@ -2179,6 +2156,28 @@ skip_cmd_array:

4393

+ 	if (req->cnt < req_cnt + 2)

4394

+ 		goto queuing_error;

4395

+

4396

++	if (sp) {

4397

++		/* Check for room in outstanding command list. */

4398

++		handle = req->current_outstanding_cmd;

4399

++		for (index = 1; index < req->num_outstanding_cmds; index++) {

4400

++			handle++;

4401

++			if (handle == req->num_outstanding_cmds)

4402

++				handle = 1;

4403

++			if (!req->outstanding_cmds[handle])

4404

++				break;

4405

++		}

4406

++		if (index == req->num_outstanding_cmds) {

4407

++			ql_log(ql_log_warn, vha, 0x700b,

4408

++			    "No room on outstanding cmd array.\n");

4409

++			goto queuing_error;

4410

++		}

4411

++

4412

++		/* Prep command array. */

4413

++		req->current_outstanding_cmd = handle;

4414

++		req->outstanding_cmds[handle] = sp;

4415

++		sp->handle = handle;

4416

++	}

4417

++

4418

+ 	/* Prep packet */

4419

+ 	req->cnt -= req_cnt;

4420

+ 	pkt = req->ring_ptr;

4421

+@@ -2191,6 +2190,8 @@ skip_cmd_array:

4422

+ 		pkt->handle = handle;

4423

+ 	}

4424

+

4425

++	return pkt;

4426

++

4427

+ queuing_error:

4428

+ 	qpair->tgt_counters.num_alloc_iocb_failed++;

4429

+ 	return pkt;

4430

+diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c

4431

+index 3f3cb72e0c0c..d0389b20574d 100644

4432

+--- a/drivers/scsi/sr.c

4433

++++ b/drivers/scsi/sr.c

4434

+@@ -523,18 +523,26 @@ static int sr_init_command(struct scsi_cmnd *SCpnt)

4435

+ static int sr_block_open(struct block_device *bdev, fmode_t mode)

4436

+ {

4437

+ 	struct scsi_cd *cd;

4438

++	struct scsi_device *sdev;

4439

+ 	int ret = -ENXIO;

4440

+

4441

++	cd = scsi_cd_get(bdev->bd_disk);

4442

++	if (!cd)

4443

++		goto out;

4444

++

4445

++	sdev = cd->device;

4446

++	scsi_autopm_get_device(sdev);

4447

+ 	check_disk_change(bdev);

4448

+

4449

+ 	mutex_lock(&sr_mutex);

4450

+-	cd = scsi_cd_get(bdev->bd_disk);

4451

+-	if (cd) {

4452

+-		ret = cdrom_open(&cd->cdi, bdev, mode);

4453

+-		if (ret)

4454

+-			scsi_cd_put(cd);

4455

+-	}

4456

++	ret = cdrom_open(&cd->cdi, bdev, mode);

4457

+ 	mutex_unlock(&sr_mutex);

4458

++

4459

++	scsi_autopm_put_device(sdev);

4460

++	if (ret)

4461

++		scsi_cd_put(cd);

4462

++

4463

++out:

4464

+ 	return ret;

4465

+ }

4466

+

4467

+@@ -562,6 +570,8 @@ static int sr_block_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,

4468

+ 	if (ret)

4469

+ 		goto out;

4470

+

4471

++	scsi_autopm_get_device(sdev);

4472

++

4473

+ 	/*

4474

+ 	 * Send SCSI addressing ioctls directly to mid level, send other

4475

+ 	 * ioctls to cdrom/block level.

4476

+@@ -570,15 +580,18 @@ static int sr_block_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,

4477

+ 	case SCSI_IOCTL_GET_IDLUN:

4478

+ 	case SCSI_IOCTL_GET_BUS_NUMBER:

4479

+ 		ret = scsi_ioctl(sdev, cmd, argp);

4480

+-		goto out;

4481

++		goto put;

4482

+ 	}

4483

+

4484

+ 	ret = cdrom_ioctl(&cd->cdi, bdev, mode, cmd, arg);

4485

+ 	if (ret != -ENOSYS)

4486

+-		goto out;

4487

++		goto put;

4488

+

4489

+ 	ret = scsi_ioctl(sdev, cmd, argp);

4490

+

4491

++put:

4492

++	scsi_autopm_put_device(sdev);

4493

++

4494

+ out:

4495

+ 	mutex_unlock(&sr_mutex);

4496

+ 	return ret;

4497

+diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c

4498

+index 7c28e8d4955a..54e3a0f6844c 100644

4499

+--- a/drivers/scsi/virtio_scsi.c

4500

++++ b/drivers/scsi/virtio_scsi.c

4501

+@@ -91,9 +91,6 @@ struct virtio_scsi_vq {

4502

+ struct virtio_scsi_target_state {

4503

+ 	seqcount_t tgt_seq;

4504

+

4505

+-	/* Count of outstanding requests. */

4506

+-	atomic_t reqs;

4507

+-

4508

+ 	/* Currently active virtqueue for requests sent to this target. */

4509

+ 	struct virtio_scsi_vq *req_vq;

4510

+ };

4511

+@@ -152,8 +149,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)

4512

+ 	struct virtio_scsi_cmd *cmd = buf;

4513

+ 	struct scsi_cmnd *sc = cmd->sc;

4514

+ 	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;

4515

+-	struct virtio_scsi_target_state *tgt =

4516

+-				scsi_target(sc->device)->hostdata;

4517

+

4518

+ 	dev_dbg(&sc->device->sdev_gendev,

4519

+ 		"cmd %p response %u status %#02x sense_len %u\n",

4520

+@@ -210,8 +205,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)

4521

+ 	}

4522

+

4523

+ 	sc->scsi_done(sc);

4524

+-

4525

+-	atomic_dec(&tgt->reqs);

4526

+ }

4527

+

4528

+ static void virtscsi_vq_done(struct virtio_scsi *vscsi,

4529

+@@ -580,10 +573,7 @@ static int virtscsi_queuecommand_single(struct Scsi_Host *sh,

4530

+ 					struct scsi_cmnd *sc)

4531

+ {

4532

+ 	struct virtio_scsi *vscsi = shost_priv(sh);

4533

+-	struct virtio_scsi_target_state *tgt =

4534

+-				scsi_target(sc->device)->hostdata;

4535

+

4536

+-	atomic_inc(&tgt->reqs);

4537

+ 	return virtscsi_queuecommand(vscsi, &vscsi->req_vqs[0], sc);

4538

+ }

4539

+

4540

+@@ -596,55 +586,11 @@ static struct virtio_scsi_vq *virtscsi_pick_vq_mq(struct virtio_scsi *vscsi,

4541

+ 	return &vscsi->req_vqs[hwq];

4542

+ }

4543

+

4544

+-static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,

4545

+-					       struct virtio_scsi_target_state *tgt)

4546

+-{

4547

+-	struct virtio_scsi_vq *vq;

4548

+-	unsigned long flags;

4549

+-	u32 queue_num;

4550

+-

4551

+-	local_irq_save(flags);

4552

+-	if (atomic_inc_return(&tgt->reqs) > 1) {

4553

+-		unsigned long seq;

4554

+-

4555

+-		do {

4556

+-			seq = read_seqcount_begin(&tgt->tgt_seq);

4557

+-			vq = tgt->req_vq;

4558

+-		} while (read_seqcount_retry(&tgt->tgt_seq, seq));

4559

+-	} else {

4560

+-		/* no writes can be concurrent because of atomic_t */

4561

+-		write_seqcount_begin(&tgt->tgt_seq);

4562

+-

4563

+-		/* keep previous req_vq if a reader just arrived */

4564

+-		if (unlikely(atomic_read(&tgt->reqs) > 1)) {

4565

+-			vq = tgt->req_vq;

4566

+-			goto unlock;

4567

+-		}

4568

+-

4569

+-		queue_num = smp_processor_id();

4570

+-		while (unlikely(queue_num >= vscsi->num_queues))

4571

+-			queue_num -= vscsi->num_queues;

4572

+-		tgt->req_vq = vq = &vscsi->req_vqs[queue_num];

4573

+- unlock:

4574

+-		write_seqcount_end(&tgt->tgt_seq);

4575

+-	}

4576

+-	local_irq_restore(flags);

4577

+-

4578

+-	return vq;

4579

+-}

4580

+-

4581

+ static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,

4582

+ 				       struct scsi_cmnd *sc)

4583

+ {

4584

+ 	struct virtio_scsi *vscsi = shost_priv(sh);

4585

+-	struct virtio_scsi_target_state *tgt =

4586

+-				scsi_target(sc->device)->hostdata;

4587

+-	struct virtio_scsi_vq *req_vq;

4588

+-

4589

+-	if (shost_use_blk_mq(sh))

4590

+-		req_vq = virtscsi_pick_vq_mq(vscsi, sc);

4591

+-	else

4592

+-		req_vq = virtscsi_pick_vq(vscsi, tgt);

4593

++	struct virtio_scsi_vq *req_vq = virtscsi_pick_vq_mq(vscsi, sc);

4594

+

4595

+ 	return virtscsi_queuecommand(vscsi, req_vq, sc);

4596

+ }

4597

+@@ -775,7 +721,6 @@ static int virtscsi_target_alloc(struct scsi_target *starget)

4598

+ 		return -ENOMEM;

4599

+

4600

+ 	seqcount_init(&tgt->tgt_seq);

4601

+-	atomic_set(&tgt->reqs, 0);

4602

+ 	tgt->req_vq = &vscsi->req_vqs[0];

4603

+

4604

+ 	starget->hostdata = tgt;

4605

+@@ -823,6 +768,7 @@ static struct scsi_host_template virtscsi_host_template_single = {

4606

+ 	.target_alloc = virtscsi_target_alloc,

4607

+ 	.target_destroy = virtscsi_target_destroy,

4608

+ 	.track_queue_depth = 1,

4609

++	.force_blk_mq = 1,

4610

+ };

4611

+

4612

+ static struct scsi_host_template virtscsi_host_template_multi = {

4613

+@@ -844,6 +790,7 @@ static struct scsi_host_template virtscsi_host_template_multi = {

4614

+ 	.target_destroy = virtscsi_target_destroy,

4615

+ 	.map_queues = virtscsi_map_queues,

4616

+ 	.track_queue_depth = 1,

4617

++	.force_blk_mq = 1,

4618

+ };

4619

+

4620

+ #define virtscsi_config_get(vdev, fld) \

4621

+diff --git a/fs/dcache.c b/fs/dcache.c

4622

+index 5f31a93150d1..8d4935978fec 100644

4623

+--- a/fs/dcache.c

4624

++++ b/fs/dcache.c

4625

+@@ -357,14 +357,11 @@ static void dentry_unlink_inode(struct dentry * dentry)

4626

+ 	__releases(dentry->d_inode->i_lock)

4627

+ {

4628

+ 	struct inode *inode = dentry->d_inode;

4629

+-	bool hashed = !d_unhashed(dentry);

4630

+

4631

+-	if (hashed)

4632

+-		raw_write_seqcount_begin(&dentry->d_seq);

4633

++	raw_write_seqcount_begin(&dentry->d_seq);

4634

+ 	__d_clear_type_and_inode(dentry);

4635

+ 	hlist_del_init(&dentry->d_u.d_alias);

4636

+-	if (hashed)

4637

+-		raw_write_seqcount_end(&dentry->d_seq);

4638

++	raw_write_seqcount_end(&dentry->d_seq);

4639

+ 	spin_unlock(&dentry->d_lock);

4640

+ 	spin_unlock(&inode->i_lock);

4641

+ 	if (!inode->i_nlink)

4642

+@@ -1922,10 +1919,12 @@ struct dentry *d_make_root(struct inode *root_inode)

4643

+

4644

+ 	if (root_inode) {

4645

+ 		res = __d_alloc(root_inode->i_sb, NULL);

4646

+-		if (res)

4647

++		if (res) {

4648

++			res->d_flags |= DCACHE_RCUACCESS;

4649

+ 			d_instantiate(res, root_inode);

4650

+-		else

4651

++		} else {

4652

+ 			iput(root_inode);

4653

++		}

4654

+ 	}

4655

+ 	return res;

4656

+ }

4657

+diff --git a/fs/namespace.c b/fs/namespace.c

4658

+index 1eb3bfd8be5a..9dc146e7b5e0 100644

4659

+--- a/fs/namespace.c

4660

++++ b/fs/namespace.c

4661

+@@ -659,12 +659,21 @@ int __legitimize_mnt(struct vfsmount *bastard, unsigned seq)

4662

+ 		return 0;

4663

+ 	mnt = real_mount(bastard);

4664

+ 	mnt_add_count(mnt, 1);

4665

++	smp_mb();			// see mntput_no_expire()

4666

+ 	if (likely(!read_seqretry(&mount_lock, seq)))

4667

+ 		return 0;

4668

+ 	if (bastard->mnt_flags & MNT_SYNC_UMOUNT) {

4669

+ 		mnt_add_count(mnt, -1);

4670

+ 		return 1;

4671

+ 	}

4672

++	lock_mount_hash();

4673

++	if (unlikely(bastard->mnt_flags & MNT_DOOMED)) {

4674

++		mnt_add_count(mnt, -1);

4675

++		unlock_mount_hash();

4676

++		return 1;

4677

++	}

4678

++	unlock_mount_hash();

4679

++	/* caller will mntput() */

4680

+ 	return -1;

4681

+ }

4682

+

4683

+@@ -1195,12 +1204,27 @@ static DECLARE_DELAYED_WORK(delayed_mntput_work, delayed_mntput);

4684

+ static void mntput_no_expire(struct mount *mnt)

4685

+ {

4686

+ 	rcu_read_lock();

4687

+-	mnt_add_count(mnt, -1);

4688

+-	if (likely(mnt->mnt_ns)) { /* shouldn't be the last one */

4689

++	if (likely(READ_ONCE(mnt->mnt_ns))) {

4690

++		/*

4691

++		 * Since we don't do lock_mount_hash() here,

4692

++		 * ->mnt_ns can change under us.  However, if it's

4693

++		 * non-NULL, then there's a reference that won't

4694

++		 * be dropped until after an RCU delay done after

4695

++		 * turning ->mnt_ns NULL.  So if we observe it

4696

++		 * non-NULL under rcu_read_lock(), the reference

4697

++		 * we are dropping is not the final one.

4698

++		 */

4699

++		mnt_add_count(mnt, -1);

4700

+ 		rcu_read_unlock();

4701

+ 		return;

4702

+ 	}

4703

+ 	lock_mount_hash();

4704

++	/*

4705

++	 * make sure that if __legitimize_mnt() has not seen us grab

4706

++	 * mount_lock, we'll see their refcount increment here.

4707

++	 */

4708

++	smp_mb();

4709

++	mnt_add_count(mnt, -1);

4710

+ 	if (mnt_get_count(mnt)) {

4711

+ 		rcu_read_unlock();

4712

+ 		unlock_mount_hash();

4713

+diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h

4714

+index 2142bceaeb75..46a2f5d9aa25 100644

4715

+--- a/include/asm-generic/pgtable.h

4716

++++ b/include/asm-generic/pgtable.h

4717

+@@ -1055,6 +1055,18 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,

4718

+ static inline void init_espfix_bsp(void) { }

4719

+ #endif

4720

+

4721

++#ifndef __HAVE_ARCH_PFN_MODIFY_ALLOWED

4722

++static inline bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)

4723

++{

4724

++	return true;

4725

++}

4726

++

4727

++static inline bool arch_has_pfn_modify_check(void)

4728

++{

4729

++	return false;

4730

++}

4731

++#endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */

4732

++

4733

+ #endif /* !__ASSEMBLY__ */

4734

+

4735

+ #ifndef io_remap_pfn_range

4736

+diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h

4737

+index 070f85d92c15..28b76f0894d4 100644

4738

+--- a/include/linux/compiler-clang.h

4739

++++ b/include/linux/compiler-clang.h

4740

+@@ -17,6 +17,9 @@

4741

+  */

4742

+ #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)

4743

+

4744

++#undef __no_sanitize_address

4745

++#define __no_sanitize_address __attribute__((no_sanitize("address")))

4746

++

4747

+ /* Clang doesn't have a way to turn it off per-function, yet. */

4748

+ #ifdef __noretpoline

4749

+ #undef __noretpoline

4750

+diff --git a/include/linux/cpu.h b/include/linux/cpu.h

4751

+index 9546bf2fe310..2a378d261914 100644

4752

+--- a/include/linux/cpu.h

4753

++++ b/include/linux/cpu.h

4754

+@@ -30,7 +30,7 @@ struct cpu {

4755

+ };

4756

+

4757

+ extern void boot_cpu_init(void);

4758

+-extern void boot_cpu_state_init(void);

4759

++extern void boot_cpu_hotplug_init(void);

4760

+ extern void cpu_init(void);

4761

+ extern void trap_init(void);

4762

+

4763

+@@ -55,6 +55,8 @@ extern ssize_t cpu_show_spectre_v2(struct device *dev,

4764

+ 				   struct device_attribute *attr, char *buf);

4765

+ extern ssize_t cpu_show_spec_store_bypass(struct device *dev,

4766

+ 					  struct device_attribute *attr, char *buf);

4767

++extern ssize_t cpu_show_l1tf(struct device *dev,

4768

++			     struct device_attribute *attr, char *buf);

4769

+

4770

+ extern __printf(4, 5)

4771

+ struct device *cpu_device_create(struct device *parent, void *drvdata,

4772

+@@ -176,4 +178,23 @@ void cpuhp_report_idle_dead(void);

4773

+ static inline void cpuhp_report_idle_dead(void) { }

4774

+ #endif /* #ifdef CONFIG_HOTPLUG_CPU */

4775

+

4776

++enum cpuhp_smt_control {

4777

++	CPU_SMT_ENABLED,

4778

++	CPU_SMT_DISABLED,

4779

++	CPU_SMT_FORCE_DISABLED,

4780

++	CPU_SMT_NOT_SUPPORTED,

4781

++};

4782

++

4783

++#if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT)

4784

++extern enum cpuhp_smt_control cpu_smt_control;

4785

++extern void cpu_smt_disable(bool force);

4786

++extern void cpu_smt_check_topology_early(void);

4787

++extern void cpu_smt_check_topology(void);

4788

++#else

4789

++# define cpu_smt_control		(CPU_SMT_ENABLED)

4790

++static inline void cpu_smt_disable(bool force) { }

4791

++static inline void cpu_smt_check_topology_early(void) { }

4792

++static inline void cpu_smt_check_topology(void) { }

4793

++#endif

4794

++

4795

+ #endif /* _LINUX_CPU_H_ */

4796

+diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h

4797

+index 06bd7b096167..e06febf62978 100644

4798

+--- a/include/linux/swapfile.h

4799

++++ b/include/linux/swapfile.h

4800

+@@ -10,5 +10,7 @@ extern spinlock_t swap_lock;

4801

+ extern struct plist_head swap_active_head;

4802

+ extern struct swap_info_struct *swap_info[];

4803

+ extern int try_to_unuse(unsigned int, bool, unsigned long);

4804

++extern unsigned long generic_max_swapfile_size(void);

4805

++extern unsigned long max_swapfile_size(void);

4806

+

4807

+ #endif /* _LINUX_SWAPFILE_H */

4808

+diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h

4809

+index a8b7bf879ced..9c1e4bad6581 100644

4810

+--- a/include/scsi/scsi_host.h

4811

++++ b/include/scsi/scsi_host.h

4812

+@@ -452,6 +452,9 @@ struct scsi_host_template {

4813

+ 	/* True if the controller does not support WRITE SAME */

4814

+ 	unsigned no_write_same:1;

4815

+

4816

++	/* True if the low-level driver supports blk-mq only */

4817

++	unsigned force_blk_mq:1;

4818

++

4819

+ 	/*

4820

+ 	 * Countdown for host blocking with no commands outstanding.

4821

+ 	 */

4822

+diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h

4823

+index 857bad91c454..27c62abb6c9e 100644

4824

+--- a/include/uapi/linux/kvm.h

4825

++++ b/include/uapi/linux/kvm.h

4826

+@@ -761,6 +761,7 @@ struct kvm_ppc_resize_hpt {

4827

+ #define KVM_TRACE_PAUSE           __KVM_DEPRECATED_MAIN_0x07

4828

+ #define KVM_TRACE_DISABLE         __KVM_DEPRECATED_MAIN_0x08

4829

+ #define KVM_GET_EMULATED_CPUID	  _IOWR(KVMIO, 0x09, struct kvm_cpuid2)

4830

++#define KVM_GET_MSR_FEATURE_INDEX_LIST    _IOWR(KVMIO, 0x0a, struct kvm_msr_list)

4831

+

4832

+ /*

4833

+  * Extension capability list.

4834

+@@ -932,6 +933,7 @@ struct kvm_ppc_resize_hpt {

4835

+ #define KVM_CAP_HYPERV_SYNIC2 148

4836

+ #define KVM_CAP_HYPERV_VP_INDEX 149

4837

+ #define KVM_CAP_S390_BPB 152

4838

++#define KVM_CAP_GET_MSR_FEATURES 153

4839

+

4840

+ #ifdef KVM_CAP_IRQ_ROUTING

4841

+

4842

+diff --git a/init/main.c b/init/main.c

4843

+index 0d88f37febcb..c4a45145e102 100644

4844

+--- a/init/main.c

4845

++++ b/init/main.c

4846

+@@ -543,8 +543,8 @@ asmlinkage __visible void __init start_kernel(void)

4847

+ 	setup_command_line(command_line);

4848

+ 	setup_nr_cpu_ids();

4849

+ 	setup_per_cpu_areas();

4850

+-	boot_cpu_state_init();

4851

+ 	smp_prepare_boot_cpu();	/* arch-specific boot-cpu hooks */

4852

++	boot_cpu_hotplug_init();

4853

+

4854

+ 	build_all_zonelists(NULL);

4855

+ 	page_alloc_init();

4856

+diff --git a/kernel/cpu.c b/kernel/cpu.c

4857

+index f21bfa3172d8..8f02f9b6e046 100644

4858

+--- a/kernel/cpu.c

4859

++++ b/kernel/cpu.c

4860

+@@ -60,6 +60,7 @@ struct cpuhp_cpu_state {

4861

+ 	bool			rollback;

4862

+ 	bool			single;

4863

+ 	bool			bringup;

4864

++	bool			booted_once;

4865

+ 	struct hlist_node	*node;

4866

+ 	struct hlist_node	*last;

4867

+ 	enum cpuhp_state	cb_state;

4868

+@@ -346,6 +347,85 @@ void cpu_hotplug_enable(void)

4869

+ EXPORT_SYMBOL_GPL(cpu_hotplug_enable);

4870

+ #endif	/* CONFIG_HOTPLUG_CPU */

4871

+

4872

++#ifdef CONFIG_HOTPLUG_SMT

4873

++enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;

4874

++EXPORT_SYMBOL_GPL(cpu_smt_control);

4875

++

4876

++static bool cpu_smt_available __read_mostly;

4877

++

4878

++void __init cpu_smt_disable(bool force)

4879

++{

4880

++	if (cpu_smt_control == CPU_SMT_FORCE_DISABLED ||

4881

++		cpu_smt_control == CPU_SMT_NOT_SUPPORTED)

4882

++		return;

4883

++

4884

++	if (force) {

4885

++		pr_info("SMT: Force disabled\n");

4886

++		cpu_smt_control = CPU_SMT_FORCE_DISABLED;

4887

++	} else {

4888

++		cpu_smt_control = CPU_SMT_DISABLED;

4889

++	}

4890

++}

4891

++

4892

++/*

4893

++ * The decision whether SMT is supported can only be done after the full

4894

++ * CPU identification. Called from architecture code before non boot CPUs

4895

++ * are brought up.

4896

++ */

4897

++void __init cpu_smt_check_topology_early(void)

4898

++{

4899

++	if (!topology_smt_supported())

4900

++		cpu_smt_control = CPU_SMT_NOT_SUPPORTED;

4901

++}

4902

++

4903

++/*

4904

++ * If SMT was disabled by BIOS, detect it here, after the CPUs have been

4905

++ * brought online. This ensures the smt/l1tf sysfs entries are consistent

4906

++ * with reality. cpu_smt_available is set to true during the bringup of non

4907

++ * boot CPUs when a SMT sibling is detected. Note, this may overwrite

4908

++ * cpu_smt_control's previous setting.

4909

++ */

4910

++void __init cpu_smt_check_topology(void)

4911

++{

4912

++	if (!cpu_smt_available)

4913

++		cpu_smt_control = CPU_SMT_NOT_SUPPORTED;

4914

++}

4915

++

4916

++static int __init smt_cmdline_disable(char *str)

4917

++{

4918

++	cpu_smt_disable(str && !strcmp(str, "force"));

4919

++	return 0;

4920

++}

4921

++early_param("nosmt", smt_cmdline_disable);

4922

++

4923

++static inline bool cpu_smt_allowed(unsigned int cpu)

4924

++{

4925

++	if (topology_is_primary_thread(cpu))

4926

++		return true;

4927

++

4928

++	/*

4929

++	 * If the CPU is not a 'primary' thread and the booted_once bit is

4930

++	 * set then the processor has SMT support. Store this information

4931

++	 * for the late check of SMT support in cpu_smt_check_topology().

4932

++	 */

4933

++	if (per_cpu(cpuhp_state, cpu).booted_once)

4934

++		cpu_smt_available = true;

4935

++

4936

++	if (cpu_smt_control == CPU_SMT_ENABLED)

4937

++		return true;

4938

++

4939

++	/*

4940

++	 * On x86 it's required to boot all logical CPUs at least once so

4941

++	 * that the init code can get a chance to set CR4.MCE on each

4942

++	 * CPU. Otherwise, a broadacasted MCE observing CR4.MCE=0b on any

4943

++	 * core will shutdown the machine.

4944

++	 */

4945

++	return !per_cpu(cpuhp_state, cpu).booted_once;

4946

++}

4947

++#else

4948

++static inline bool cpu_smt_allowed(unsigned int cpu) { return true; }

4949

++#endif

4950

++

4951

+ static inline enum cpuhp_state

4952

+ cpuhp_set_state(struct cpuhp_cpu_state *st, enum cpuhp_state target)

4953

+ {

4954

+@@ -426,6 +506,16 @@ static int bringup_wait_for_ap(unsigned int cpu)

4955

+ 	stop_machine_unpark(cpu);

4956

+ 	kthread_unpark(st->thread);

4957

+

4958

++	/*

4959

++	 * SMT soft disabling on X86 requires to bring the CPU out of the

4960

++	 * BIOS 'wait for SIPI' state in order to set the CR4.MCE bit.  The

4961

++	 * CPU marked itself as booted_once in cpu_notify_starting() so the

4962

++	 * cpu_smt_allowed() check will now return false if this is not the

4963

++	 * primary sibling.

4964

++	 */

4965

++	if (!cpu_smt_allowed(cpu))

4966

++		return -ECANCELED;

4967

++

4968

+ 	if (st->target <= CPUHP_AP_ONLINE_IDLE)

4969

+ 		return 0;

4970

+

4971

+@@ -758,7 +848,6 @@ static int takedown_cpu(unsigned int cpu)

4972

+

4973

+ 	/* Park the smpboot threads */

4974

+ 	kthread_park(per_cpu_ptr(&cpuhp_state, cpu)->thread);

4975

+-	smpboot_park_threads(cpu);

4976

+

4977

+ 	/*

4978

+ 	 * Prevent irq alloc/free while the dying cpu reorganizes the

4979

+@@ -911,20 +1000,19 @@ out:

4980

+ 	return ret;

4981

+ }

4982

+

4983

++static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)

4984

++{

4985

++	if (cpu_hotplug_disabled)

4986

++		return -EBUSY;

4987

++	return _cpu_down(cpu, 0, target);

4988

++}

4989

++

4990

+ static int do_cpu_down(unsigned int cpu, enum cpuhp_state target)

4991

+ {

4992

+ 	int err;

4993

+

4994

+ 	cpu_maps_update_begin();

4995

+-

4996

+-	if (cpu_hotplug_disabled) {

4997

+-		err = -EBUSY;

4998

+-		goto out;

4999

+-	}

5000

+-

5001

+-	err = _cpu_down(cpu, 0, target);

5002

+-

5003

+-out:

5004

++	err = cpu_down_maps_locked(cpu, target);

5005

+ 	cpu_maps_update_done();

5006

+ 	return err;

5007

+ }

5008

+@@ -953,6 +1041,7 @@ void notify_cpu_starting(unsigned int cpu)

5009

+ 	int ret;

5010

+

5011

+ 	rcu_cpu_starting(cpu);	/* Enables RCU usage on this CPU. */

5012

++	st->booted_once = true;

5013

+ 	while (st->state < target) {

5014

+ 		st->state++;

5015

+ 		ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL);

5016

+@@ -1062,6 +1151,10 @@ static int do_cpu_up(unsigned int cpu, enum cpuhp_state target)

5017

+ 		err = -EBUSY;

5018

+ 		goto out;

5019

+ 	}

5020

++	if (!cpu_smt_allowed(cpu)) {

5021

++		err = -EPERM;

5022

++		goto out;

5023

++	}

5024

+

5025

+ 	err = _cpu_up(cpu, 0, target);

5026

+ out:

5027

+@@ -1344,7 +1437,7 @@ static struct cpuhp_step cpuhp_ap_states[] = {

5028

+ 	[CPUHP_AP_SMPBOOT_THREADS] = {

5029

+ 		.name			= "smpboot/threads:online",

5030

+ 		.startup.single		= smpboot_unpark_threads,

5031

+-		.teardown.single	= NULL,

5032

++		.teardown.single	= smpboot_park_threads,

5033

+ 	},

5034

+ 	[CPUHP_AP_IRQ_AFFINITY_ONLINE] = {

5035

+ 		.name			= "irq/affinity:online",

5036

+@@ -1918,10 +2011,172 @@ static const struct attribute_group cpuhp_cpu_root_attr_group = {

5037

+ 	NULL

5038

+ };

5039

+

5040

++#ifdef CONFIG_HOTPLUG_SMT

5041

++

5042

++static const char *smt_states[] = {

5043

++	[CPU_SMT_ENABLED]		= "on",

5044

++	[CPU_SMT_DISABLED]		= "off",

5045

++	[CPU_SMT_FORCE_DISABLED]	= "forceoff",

5046

++	[CPU_SMT_NOT_SUPPORTED]		= "notsupported",

5047

++};

5048

++

5049

++static ssize_t

5050

++show_smt_control(struct device *dev, struct device_attribute *attr, char *buf)

5051

++{

5052

++	return snprintf(buf, PAGE_SIZE - 2, "%s\n", smt_states[cpu_smt_control]);

5053

++}

5054

++

5055

++static void cpuhp_offline_cpu_device(unsigned int cpu)

5056

++{

5057

++	struct device *dev = get_cpu_device(cpu);

5058

++

5059

++	dev->offline = true;

5060

++	/* Tell user space about the state change */

5061

++	kobject_uevent(&dev->kobj, KOBJ_OFFLINE);

5062

++}

5063

++

5064

++static void cpuhp_online_cpu_device(unsigned int cpu)

5065

++{

5066

++	struct device *dev = get_cpu_device(cpu);

5067

++

5068

++	dev->offline = false;

5069

++	/* Tell user space about the state change */

5070

++	kobject_uevent(&dev->kobj, KOBJ_ONLINE);

5071

++}

5072

++

5073

++static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)

5074

++{

5075

++	int cpu, ret = 0;

5076

++

5077

++	cpu_maps_update_begin();

5078

++	for_each_online_cpu(cpu) {

5079

++		if (topology_is_primary_thread(cpu))

5080

++			continue;

5081

++		ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);

5082

++		if (ret)

5083

++			break;

5084

++		/*

5085

++		 * As this needs to hold the cpu maps lock it's impossible

5086

++		 * to call device_offline() because that ends up calling

5087

++		 * cpu_down() which takes cpu maps lock. cpu maps lock

5088

++		 * needs to be held as this might race against in kernel

5089

++		 * abusers of the hotplug machinery (thermal management).

5090

++		 *

5091

++		 * So nothing would update device:offline state. That would

5092

++		 * leave the sysfs entry stale and prevent onlining after

5093

++		 * smt control has been changed to 'off' again. This is

5094

++		 * called under the sysfs hotplug lock, so it is properly

5095

++		 * serialized against the regular offline usage.

5096

++		 */

5097

++		cpuhp_offline_cpu_device(cpu);

5098

++	}

5099

++	if (!ret)

5100

++		cpu_smt_control = ctrlval;

5101

++	cpu_maps_update_done();

5102

++	return ret;

5103

++}

5104

++

5105

++static int cpuhp_smt_enable(void)

5106

++{

5107

++	int cpu, ret = 0;

5108

++

5109

++	cpu_maps_update_begin();

5110

++	cpu_smt_control = CPU_SMT_ENABLED;

5111

++	for_each_present_cpu(cpu) {

5112

++		/* Skip online CPUs and CPUs on offline nodes */

5113

++		if (cpu_online(cpu) || !node_online(cpu_to_node(cpu)))

5114

++			continue;

5115

++		ret = _cpu_up(cpu, 0, CPUHP_ONLINE);

5116

++		if (ret)

5117

++			break;

5118

++		/* See comment in cpuhp_smt_disable() */

5119

++		cpuhp_online_cpu_device(cpu);

5120

++	}

5121

++	cpu_maps_update_done();

5122

++	return ret;

5123

++}

5124

++

5125

++static ssize_t

5126

++store_smt_control(struct device *dev, struct device_attribute *attr,

5127

++		  const char *buf, size_t count)

5128

++{

5129

++	int ctrlval, ret;

5130

++

5131

++	if (sysfs_streq(buf, "on"))

5132

++		ctrlval = CPU_SMT_ENABLED;

5133

++	else if (sysfs_streq(buf, "off"))

5134

++		ctrlval = CPU_SMT_DISABLED;

5135

++	else if (sysfs_streq(buf, "forceoff"))

5136

++		ctrlval = CPU_SMT_FORCE_DISABLED;

5137

++	else

5138

++		return -EINVAL;

5139

++

5140

++	if (cpu_smt_control == CPU_SMT_FORCE_DISABLED)

5141

++		return -EPERM;

5142

++

5143

++	if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED)

5144

++		return -ENODEV;

5145

++

5146

++	ret = lock_device_hotplug_sysfs();

5147

++	if (ret)

5148

++		return ret;

5149

++

5150

++	if (ctrlval != cpu_smt_control) {

5151

++		switch (ctrlval) {

5152

++		case CPU_SMT_ENABLED:

5153

++			ret = cpuhp_smt_enable();

5154

++			break;

5155

++		case CPU_SMT_DISABLED:

5156

++		case CPU_SMT_FORCE_DISABLED:

5157

++			ret = cpuhp_smt_disable(ctrlval);

5158

++			break;

5159

++		}

5160

++	}

5161

++

5162

++	unlock_device_hotplug();

5163

++	return ret ? ret : count;

5164

++}

5165

++static DEVICE_ATTR(control, 0644, show_smt_control, store_smt_control);

5166

++

5167

++static ssize_t

5168

++show_smt_active(struct device *dev, struct device_attribute *attr, char *buf)

5169

++{

5170

++	bool active = topology_max_smt_threads() > 1;

5171

++

5172

++	return snprintf(buf, PAGE_SIZE - 2, "%d\n", active);

5173

++}

5174

++static DEVICE_ATTR(active, 0444, show_smt_active, NULL);

5175

++

5176

++static struct attribute *cpuhp_smt_attrs[] = {

5177

++	&dev_attr_control.attr,

5178

++	&dev_attr_active.attr,

5179

++	NULL

5180

++};

5181

++

5182

++static const struct attribute_group cpuhp_smt_attr_group = {

5183

++	.attrs = cpuhp_smt_attrs,

5184

++	.name = "smt",

5185

++	NULL

5186

++};

5187

++

5188

++static int __init cpu_smt_state_init(void)

5189

++{

5190

++	return sysfs_create_group(&cpu_subsys.dev_root->kobj,

5191

++				  &cpuhp_smt_attr_group);

5192

++}

5193

++

5194

++#else

5195

++static inline int cpu_smt_state_init(void) { return 0; }

5196

++#endif

5197

++

5198

+ static int __init cpuhp_sysfs_init(void)

5199

+ {

5200

+ 	int cpu, ret;

5201

+

5202

++	ret = cpu_smt_state_init();

5203

++	if (ret)

5204

++		return ret;

5205

++

5206

+ 	ret = sysfs_create_group(&cpu_subsys.dev_root->kobj,

5207

+ 				 &cpuhp_cpu_root_attr_group);

5208

+ 	if (ret)

5209

+@@ -2022,7 +2277,10 @@ void __init boot_cpu_init(void)

5210

+ /*

5211

+  * Must be called _AFTER_ setting up the per_cpu areas

5212

+  */

5213

+-void __init boot_cpu_state_init(void)

5214

++void __init boot_cpu_hotplug_init(void)

5215

+ {

5216

+-	per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;

5217

++#ifdef CONFIG_SMP

5218

++	this_cpu_write(cpuhp_state.booted_once, true);

5219

++#endif

5220

++	this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);

5221

+ }

5222

+diff --git a/kernel/sched/core.c b/kernel/sched/core.c

5223

+index 31615d1ae44c..4e89ed8a0fb2 100644

5224

+--- a/kernel/sched/core.c

5225

++++ b/kernel/sched/core.c

5226

+@@ -5615,6 +5615,18 @@ int sched_cpu_activate(unsigned int cpu)

5227

+ 	struct rq *rq = cpu_rq(cpu);

5228

+ 	struct rq_flags rf;

5229

+

5230

++#ifdef CONFIG_SCHED_SMT

5231

++	/*

5232

++	 * The sched_smt_present static key needs to be evaluated on every

5233

++	 * hotplug event because at boot time SMT might be disabled when

5234

++	 * the number of booted CPUs is limited.

5235

++	 *

5236

++	 * If then later a sibling gets hotplugged, then the key would stay

5237

++	 * off and SMT scheduling would never be functional.

5238

++	 */

5239

++	if (cpumask_weight(cpu_smt_mask(cpu)) > 1)

5240

++		static_branch_enable_cpuslocked(&sched_smt_present);

5241

++#endif

5242

+ 	set_cpu_active(cpu, true);

5243

+

5244

+ 	if (sched_smp_initialized) {

5245

+@@ -5710,22 +5722,6 @@ int sched_cpu_dying(unsigned int cpu)

5246

+ }

5247

+ #endif

5248

+

5249

+-#ifdef CONFIG_SCHED_SMT

5250

+-DEFINE_STATIC_KEY_FALSE(sched_smt_present);

5251

+-

5252

+-static void sched_init_smt(void)

5253

+-{

5254

+-	/*

5255

+-	 * We've enumerated all CPUs and will assume that if any CPU

5256

+-	 * has SMT siblings, CPU0 will too.

5257

+-	 */

5258

+-	if (cpumask_weight(cpu_smt_mask(0)) > 1)

5259

+-		static_branch_enable(&sched_smt_present);

5260

+-}

5261

+-#else

5262

+-static inline void sched_init_smt(void) { }

5263

+-#endif

5264

+-

5265

+ void __init sched_init_smp(void)

5266

+ {

5267

+ 	cpumask_var_t non_isolated_cpus;

5268

+@@ -5755,8 +5751,6 @@ void __init sched_init_smp(void)

5269

+ 	init_sched_rt_class();

5270

+ 	init_sched_dl_class();

5271

+

5272

+-	sched_init_smt();

5273

+-

5274

+ 	sched_smp_initialized = true;

5275

+ }

5276

+

5277

+diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

5278

+index 5c09ddf8c832..0cc7098c6dfd 100644

5279

+--- a/kernel/sched/fair.c

5280

++++ b/kernel/sched/fair.c

5281

+@@ -5631,6 +5631,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)

5282

+ }

5283

+

5284

+ #ifdef CONFIG_SCHED_SMT

5285

++DEFINE_STATIC_KEY_FALSE(sched_smt_present);

5286

+

5287

+ static inline void set_idle_cores(int cpu, int val)

5288

+ {

5289

+diff --git a/kernel/smp.c b/kernel/smp.c

5290

+index c94dd85c8d41..2d1da290f144 100644

5291

+--- a/kernel/smp.c

5292

++++ b/kernel/smp.c

5293

+@@ -584,6 +584,8 @@ void __init smp_init(void)

5294

+ 		num_nodes, (num_nodes > 1 ? "s" : ""),

5295

+ 		num_cpus,  (num_cpus  > 1 ? "s" : ""));

5296

+

5297

++	/* Final decision about SMT support */

5298

++	cpu_smt_check_topology();

5299

+ 	/* Any cleanup work */

5300

+ 	smp_cpus_done(setup_max_cpus);

5301

+ }

5302

+diff --git a/kernel/softirq.c b/kernel/softirq.c

5303

+index f40ac7191257..a4c87cf27f9d 100644

5304

+--- a/kernel/softirq.c

5305

++++ b/kernel/softirq.c

5306

+@@ -79,12 +79,16 @@ static void wakeup_softirqd(void)

5307

+

5308

+ /*

5309

+  * If ksoftirqd is scheduled, we do not want to process pending softirqs

5310

+- * right now. Let ksoftirqd handle this at its own rate, to get fairness.

5311

++ * right now. Let ksoftirqd handle this at its own rate, to get fairness,

5312

++ * unless we're doing some of the synchronous softirqs.

5313

+  */

5314

+-static bool ksoftirqd_running(void)

5315

++#define SOFTIRQ_NOW_MASK ((1 << HI_SOFTIRQ) | (1 << TASKLET_SOFTIRQ))

5316

++static bool ksoftirqd_running(unsigned long pending)

5317

+ {

5318

+ 	struct task_struct *tsk = __this_cpu_read(ksoftirqd);

5319

+

5320

++	if (pending & SOFTIRQ_NOW_MASK)

5321

++		return false;

5322

+ 	return tsk && (tsk->state == TASK_RUNNING);

5323

+ }

5324

+

5325

+@@ -324,7 +328,7 @@ asmlinkage __visible void do_softirq(void)

5326

+

5327

+ 	pending = local_softirq_pending();

5328

+

5329

+-	if (pending && !ksoftirqd_running())

5330

++	if (pending && !ksoftirqd_running(pending))

5331

+ 		do_softirq_own_stack();

5332

+

5333

+ 	local_irq_restore(flags);

5334

+@@ -351,7 +355,7 @@ void irq_enter(void)

5335

+

5336

+ static inline void invoke_softirq(void)

5337

+ {

5338

+-	if (ksoftirqd_running())

5339

++	if (ksoftirqd_running(local_softirq_pending()))

5340

+ 		return;

5341

+

5342

+ 	if (!force_irqthreads) {

5343

+diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c

5344

+index 1ff523dae6e2..e190d1ef3a23 100644

5345

+--- a/kernel/stop_machine.c

5346

++++ b/kernel/stop_machine.c

5347

+@@ -260,6 +260,15 @@ retry:

5348

+ 	err = 0;

5349

+ 	__cpu_stop_queue_work(stopper1, work1, &wakeq);

5350

+ 	__cpu_stop_queue_work(stopper2, work2, &wakeq);

5351

++	/*

5352

++	 * The waking up of stopper threads has to happen

5353

++	 * in the same scheduling context as the queueing.

5354

++	 * Otherwise, there is a possibility of one of the

5355

++	 * above stoppers being woken up by another CPU,

5356

++	 * and preempting us. This will cause us to n ot

5357

++	 * wake up the other stopper forever.

5358

++	 */

5359

++	preempt_disable();

5360

+ unlock:

5361

+ 	raw_spin_unlock(&stopper2->lock);

5362

+ 	raw_spin_unlock_irq(&stopper1->lock);

5363

+@@ -271,7 +280,6 @@ unlock:

5364

+ 	}

5365

+

5366

+ 	if (!err) {

5367

+-		preempt_disable();

5368

+ 		wake_up_q(&wakeq);

5369

+ 		preempt_enable();

5370

+ 	}

5371

+diff --git a/mm/memory.c b/mm/memory.c

5372

+index fc7779165dcf..5539b1975091 100644

5373

+--- a/mm/memory.c

5374

++++ b/mm/memory.c

5375

+@@ -1887,6 +1887,9 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,

5376

+ 	if (addr < vma->vm_start || addr >= vma->vm_end)

5377

+ 		return -EFAULT;

5378

+

5379

++	if (!pfn_modify_allowed(pfn, pgprot))

5380

++		return -EACCES;

5381

++

5382

+ 	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));

5383

+

5384

+ 	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,

5385

+@@ -1908,6 +1911,9 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,

5386

+

5387

+ 	track_pfn_insert(vma, &pgprot, pfn);

5388

+

5389

++	if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))

5390

++		return -EACCES;

5391

++

5392

+ 	/*

5393

+ 	 * If we don't have pte special, then we have to use the pfn_valid()

5394

+ 	 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*

5395

+@@ -1955,6 +1961,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,

5396

+ {

5397

+ 	pte_t *pte;

5398

+ 	spinlock_t *ptl;

5399

++	int err = 0;

5400

+

5401

+ 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);

5402

+ 	if (!pte)

5403

+@@ -1962,12 +1969,16 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,

5404

+ 	arch_enter_lazy_mmu_mode();

5405

+ 	do {

5406

+ 		BUG_ON(!pte_none(*pte));

5407

++		if (!pfn_modify_allowed(pfn, prot)) {

5408

++			err = -EACCES;

5409

++			break;

5410

++		}

5411

+ 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));

5412

+ 		pfn++;

5413

+ 	} while (pte++, addr += PAGE_SIZE, addr != end);

5414

+ 	arch_leave_lazy_mmu_mode();

5415

+ 	pte_unmap_unlock(pte - 1, ptl);

5416

+-	return 0;

5417

++	return err;

5418

+ }

5419

+

5420

+ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,

5421

+@@ -1976,6 +1987,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,

5422

+ {

5423

+ 	pmd_t *pmd;

5424

+ 	unsigned long next;

5425

++	int err;

5426

+

5427

+ 	pfn -= addr >> PAGE_SHIFT;

5428

+ 	pmd = pmd_alloc(mm, pud, addr);

5429

+@@ -1984,9 +1996,10 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,

5430

+ 	VM_BUG_ON(pmd_trans_huge(*pmd));

5431

+ 	do {

5432

+ 		next = pmd_addr_end(addr, end);

5433

+-		if (remap_pte_range(mm, pmd, addr, next,

5434

+-				pfn + (addr >> PAGE_SHIFT), prot))

5435

+-			return -ENOMEM;

5436

++		err = remap_pte_range(mm, pmd, addr, next,

5437

++				pfn + (addr >> PAGE_SHIFT), prot);

5438

++		if (err)

5439

++			return err;

5440

+ 	} while (pmd++, addr = next, addr != end);

5441

+ 	return 0;

5442

+ }

5443

+@@ -1997,6 +2010,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,

5444

+ {

5445

+ 	pud_t *pud;

5446

+ 	unsigned long next;

5447

++	int err;

5448

+

5449

+ 	pfn -= addr >> PAGE_SHIFT;

5450

+ 	pud = pud_alloc(mm, p4d, addr);

5451

+@@ -2004,9 +2018,10 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,

5452

+ 		return -ENOMEM;

5453

+ 	do {

5454

+ 		next = pud_addr_end(addr, end);

5455

+-		if (remap_pmd_range(mm, pud, addr, next,

5456

+-				pfn + (addr >> PAGE_SHIFT), prot))

5457

+-			return -ENOMEM;

5458

++		err = remap_pmd_range(mm, pud, addr, next,

5459

++				pfn + (addr >> PAGE_SHIFT), prot);

5460

++		if (err)

5461

++			return err;

5462

+ 	} while (pud++, addr = next, addr != end);

5463

+ 	return 0;

5464

+ }

5465

+@@ -2017,6 +2032,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,

5466

+ {

5467

+ 	p4d_t *p4d;

5468

+ 	unsigned long next;

5469

++	int err;

5470

+

5471

+ 	pfn -= addr >> PAGE_SHIFT;

5472

+ 	p4d = p4d_alloc(mm, pgd, addr);

5473

+@@ -2024,9 +2040,10 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,

5474

+ 		return -ENOMEM;

5475

+ 	do {

5476

+ 		next = p4d_addr_end(addr, end);

5477

+-		if (remap_pud_range(mm, p4d, addr, next,

5478

+-				pfn + (addr >> PAGE_SHIFT), prot))

5479

+-			return -ENOMEM;

5480

++		err = remap_pud_range(mm, p4d, addr, next,

5481

++				pfn + (addr >> PAGE_SHIFT), prot);

5482

++		if (err)

5483

++			return err;

5484

+ 	} while (p4d++, addr = next, addr != end);

5485

+ 	return 0;

5486

+ }

5487

+diff --git a/mm/mprotect.c b/mm/mprotect.c

5488

+index 58b629bb70de..60864e19421e 100644

5489

+--- a/mm/mprotect.c

5490

++++ b/mm/mprotect.c

5491

+@@ -292,6 +292,42 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,

5492

+ 	return pages;

5493

+ }

5494

+

5495

++static int prot_none_pte_entry(pte_t *pte, unsigned long addr,

5496

++			       unsigned long next, struct mm_walk *walk)

5497

++{

5498

++	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?

5499

++		0 : -EACCES;

5500

++}

5501

++

5502

++static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,

5503

++				   unsigned long addr, unsigned long next,

5504

++				   struct mm_walk *walk)

5505

++{

5506

++	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?

5507

++		0 : -EACCES;

5508

++}

5509

++

5510

++static int prot_none_test(unsigned long addr, unsigned long next,

5511

++			  struct mm_walk *walk)

5512

++{

5513

++	return 0;

5514

++}

5515

++

5516

++static int prot_none_walk(struct vm_area_struct *vma, unsigned long start,

5517

++			   unsigned long end, unsigned long newflags)

5518

++{

5519

++	pgprot_t new_pgprot = vm_get_page_prot(newflags);

5520

++	struct mm_walk prot_none_walk = {

5521

++		.pte_entry = prot_none_pte_entry,

5522

++		.hugetlb_entry = prot_none_hugetlb_entry,

5523

++		.test_walk = prot_none_test,

5524

++		.mm = current->mm,

5525

++		.private = &new_pgprot,

5526

++	};

5527

++

5528

++	return walk_page_range(start, end, &prot_none_walk);

5529

++}

5530

++

5531

+ int

5532

+ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,

5533

+ 	unsigned long start, unsigned long end, unsigned long newflags)

5534

+@@ -309,6 +345,19 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,

5535

+ 		return 0;

5536

+ 	}

5537

+

5538

++	/*

5539

++	 * Do PROT_NONE PFN permission checks here when we can still

5540

++	 * bail out without undoing a lot of state. This is a rather

5541

++	 * uncommon case, so doesn't need to be very optimized.

5542

++	 */

5543

++	if (arch_has_pfn_modify_check() &&

5544

++	    (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&

5545

++	    (newflags & (VM_READ|VM_WRITE|VM_EXEC)) == 0) {

5546

++		error = prot_none_walk(vma, start, end, newflags);

5547

++		if (error)

5548

++			return error;

5549

++	}

5550

++

5551

+ 	/*

5552

+ 	 * If we make a private mapping writable we increase our commit;

5553

+ 	 * but (without finer accounting) cannot reduce our commit if we

5554

+diff --git a/mm/swapfile.c b/mm/swapfile.c

5555

+index 03d2ce288d83..8cbc7d6fd52e 100644

5556

+--- a/mm/swapfile.c

5557

++++ b/mm/swapfile.c

5558

+@@ -2902,6 +2902,35 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)

5559

+ 	return 0;

5560

+ }

5561

+

5562

++

5563

++/*

5564

++ * Find out how many pages are allowed for a single swap device. There

5565

++ * are two limiting factors:

5566

++ * 1) the number of bits for the swap offset in the swp_entry_t type, and

5567

++ * 2) the number of bits in the swap pte, as defined by the different

5568

++ * architectures.

5569

++ *

5570

++ * In order to find the largest possible bit mask, a swap entry with

5571

++ * swap type 0 and swap offset ~0UL is created, encoded to a swap pte,

5572

++ * decoded to a swp_entry_t again, and finally the swap offset is

5573

++ * extracted.

5574

++ *

5575

++ * This will mask all the bits from the initial ~0UL mask that can't

5576

++ * be encoded in either the swp_entry_t or the architecture definition

5577

++ * of a swap pte.

5578

++ */

5579

++unsigned long generic_max_swapfile_size(void)

5580

++{

5581

++	return swp_offset(pte_to_swp_entry(

5582

++			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;

5583

++}

5584

++

5585

++/* Can be overridden by an architecture for additional checks. */

5586

++__weak unsigned long max_swapfile_size(void)

5587

++{

5588

++	return generic_max_swapfile_size();

5589

++}

5590

++

5591

+ static unsigned long read_swap_header(struct swap_info_struct *p,

5592

+ 					union swap_header *swap_header,

5593

+ 					struct inode *inode)

5594

+@@ -2937,22 +2966,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,

5595

+ 	p->cluster_next = 1;

5596

+ 	p->cluster_nr = 0;

5597

+

5598

+-	/*

5599

+-	 * Find out how many pages are allowed for a single swap

5600

+-	 * device. There are two limiting factors: 1) the number

5601

+-	 * of bits for the swap offset in the swp_entry_t type, and

5602

+-	 * 2) the number of bits in the swap pte as defined by the

5603

+-	 * different architectures. In order to find the

5604

+-	 * largest possible bit mask, a swap entry with swap type 0

5605

+-	 * and swap offset ~0UL is created, encoded to a swap pte,

5606

+-	 * decoded to a swp_entry_t again, and finally the swap

5607

+-	 * offset is extracted. This will mask all the bits from

5608

+-	 * the initial ~0UL mask that can't be encoded in either

5609

+-	 * the swp_entry_t or the architecture definition of a

5610

+-	 * swap pte.

5611

+-	 */

5612

+-	maxpages = swp_offset(pte_to_swp_entry(

5613

+-			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;

5614

++	maxpages = max_swapfile_size();

5615

+ 	last_page = swap_header->info.last_page;

5616

+ 	if (!last_page) {

5617

+ 		pr_warn("Empty swap-file\n");

5618

+diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h

5619

+index 403e97d5e243..8418462298e7 100644

5620

+--- a/tools/arch/x86/include/asm/cpufeatures.h

5621

++++ b/tools/arch/x86/include/asm/cpufeatures.h

5622

+@@ -219,6 +219,7 @@

5623

+ #define X86_FEATURE_IBPB		( 7*32+26) /* Indirect Branch Prediction Barrier */

5624

+ #define X86_FEATURE_STIBP		( 7*32+27) /* Single Thread Indirect Branch Predictors */

5625

+ #define X86_FEATURE_ZEN			( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */

5626

++#define X86_FEATURE_L1TF_PTEINV		( 7*32+29) /* "" L1TF workaround PTE inversion */

5627

+

5628

+ /* Virtualization flags: Linux defined, word 8 */

5629

+ #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */

5630

+@@ -338,6 +339,7 @@

5631

+ #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */

5632

+ #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */

5633

+ #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */

5634

++#define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */

5635

+ #define X86_FEATURE_ARCH_CAPABILITIES	(18*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */

5636

+ #define X86_FEATURE_SPEC_CTRL_SSBD	(18*32+31) /* "" Speculative Store Bypass Disable */

5637

+

5638

+@@ -370,5 +372,6 @@

5639

+ #define X86_BUG_SPECTRE_V1		X86_BUG(15) /* CPU is affected by Spectre variant 1 attack with conditional branches */

5640

+ #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */

5641

+ #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */

5642

++#define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */

5643

+

5644

+ #endif /* _ASM_X86_CPUFEATURES_H */

Gentoo Archives: gentoo-commits