[gentoo-commits] proj/linux-patches:4.18 commit in: / - gentoo-commits

From:	Mike Pagano <mpagano@g.o>
To:	gentoo-commits@l.g.o
Subject:	[gentoo-commits] proj/linux-patches:4.18 commit in: /
Date:	Wed, 14 Nov 2018 13:15:59
Message-Id:	`1542201338.37a9cc2ec281085f4896d9928b79147c086194a2.mpagano@gentoo`

1

commit:     37a9cc2ec281085f4896d9928b79147c086194a2

2

Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

3

AuthorDate: Wed Aug 15 16:36:52 2018 +0000

4

Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

5

CommitDate: Wed Nov 14 13:15:38 2018 +0000

6

URL:        https://gitweb.gentoo.org/proj/linux-patches.git/commit/?id=37a9cc2e

7

8

Linuxpatch 4.18.1

9

10

Signed-off-by: Mike Pagano <mpagano <AT> gentoo.org>

11

12

 0000_README             |    4 +

13

 1000_linux-4.18.1.patch | 4083 +++++++++++++++++++++++++++++++++++++++++++++++

14

 2 files changed, 4087 insertions(+)

15

16

diff --git a/0000_README b/0000_README

17

index 917d838..cf32ff2 100644

18

--- a/0000_README

19

+++ b/0000_README

20

@@ -43,6 +43,10 @@ EXPERIMENTAL

21

 Individual Patch Descriptions:

22

 --------------------------------------------------------------------------

23

24

+Patch:  1000_linux-4.18.1.patch

25

+From:   http://www.kernel.org

26

+Desc:   Linux 4.18.1

27

+

28

 Patch:  1500_XATTR_USER_PREFIX.patch

29

 From:   https://bugs.gentoo.org/show_bug.cgi?id=470644

30

 Desc:   Support for namespace user.pax.* on tmpfs.

31

32

diff --git a/1000_linux-4.18.1.patch b/1000_linux-4.18.1.patch

33

new file mode 100644

34

index 0000000..bd9c2da

35

--- /dev/null

36

+++ b/1000_linux-4.18.1.patch

37

@@ -0,0 +1,4083 @@

38

+diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu

39

+index 9c5e7732d249..73318225a368 100644

40

+--- a/Documentation/ABI/testing/sysfs-devices-system-cpu

41

++++ b/Documentation/ABI/testing/sysfs-devices-system-cpu

42

+@@ -476,6 +476,7 @@ What:		/sys/devices/system/cpu/vulnerabilities

43

+ 		/sys/devices/system/cpu/vulnerabilities/spectre_v1

44

+ 		/sys/devices/system/cpu/vulnerabilities/spectre_v2

45

+ 		/sys/devices/system/cpu/vulnerabilities/spec_store_bypass

46

++		/sys/devices/system/cpu/vulnerabilities/l1tf

47

+ Date:		January 2018

48

+ Contact:	Linux kernel mailing list <linux-kernel@×××××××××××.org>

49

+ Description:	Information about CPU vulnerabilities

50

+@@ -487,3 +488,26 @@ Description:	Information about CPU vulnerabilities

51

+ 		"Not affected"	  CPU is not affected by the vulnerability

52

+ 		"Vulnerable"	  CPU is affected and no mitigation in effect

53

+ 		"Mitigation: $M"  CPU is affected and mitigation $M is in effect

54

++

55

++		Details about the l1tf file can be found in

56

++		Documentation/admin-guide/l1tf.rst

57

++

58

++What:		/sys/devices/system/cpu/smt

59

++		/sys/devices/system/cpu/smt/active

60

++		/sys/devices/system/cpu/smt/control

61

++Date:		June 2018

62

++Contact:	Linux kernel mailing list <linux-kernel@×××××××××××.org>

63

++Description:	Control Symetric Multi Threading (SMT)

64

++

65

++		active:  Tells whether SMT is active (enabled and siblings online)

66

++

67

++		control: Read/write interface to control SMT. Possible

68

++			 values:

69

++

70

++			 "on"		SMT is enabled

71

++			 "off"		SMT is disabled

72

++			 "forceoff"	SMT is force disabled. Cannot be changed.

73

++			 "notsupported" SMT is not supported by the CPU

74

++

75

++			 If control status is "forceoff" or "notsupported" writes

76

++			 are rejected.

77

+diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst

78

+index 48d70af11652..0873685bab0f 100644

79

+--- a/Documentation/admin-guide/index.rst

80

++++ b/Documentation/admin-guide/index.rst

81

+@@ -17,6 +17,15 @@ etc.

82

+    kernel-parameters

83

+    devices

84

+

85

++This section describes CPU vulnerabilities and provides an overview of the

86

++possible mitigations along with guidance for selecting mitigations if they

87

++are configurable at compile, boot or run time.

88

++

89

++.. toctree::

90

++   :maxdepth: 1

91

++

92

++   l1tf

93

++

94

+ Here is a set of documents aimed at users who are trying to track down

95

+ problems and bugs in particular.

96

+

97

+diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt

98

+index 533ff5c68970..1370b424a453 100644

99

+--- a/Documentation/admin-guide/kernel-parameters.txt

100

++++ b/Documentation/admin-guide/kernel-parameters.txt

101

+@@ -1967,10 +1967,84 @@

102

+ 			(virtualized real and unpaged mode) on capable

103

+ 			Intel chips. Default is 1 (enabled)

104

+

105

++	kvm-intel.vmentry_l1d_flush=[KVM,Intel] Mitigation for L1 Terminal Fault

106

++			CVE-2018-3620.

107

++

108

++			Valid arguments: never, cond, always

109

++

110

++			always: L1D cache flush on every VMENTER.

111

++			cond:	Flush L1D on VMENTER only when the code between

112

++				VMEXIT and VMENTER can leak host memory.

113

++			never:	Disables the mitigation

114

++

115

++			Default is cond (do L1 cache flush in specific instances)

116

++

117

+ 	kvm-intel.vpid=	[KVM,Intel] Disable Virtual Processor Identification

118

+ 			feature (tagged TLBs) on capable Intel chips.

119

+ 			Default is 1 (enabled)

120

+

121

++	l1tf=           [X86] Control mitigation of the L1TF vulnerability on

122

++			      affected CPUs

123

++

124

++			The kernel PTE inversion protection is unconditionally

125

++			enabled and cannot be disabled.

126

++

127

++			full

128

++				Provides all available mitigations for the

129

++				L1TF vulnerability. Disables SMT and

130

++				enables all mitigations in the

131

++				hypervisors, i.e. unconditional L1D flush.

132

++

133

++				SMT control and L1D flush control via the

134

++				sysfs interface is still possible after

135

++				boot.  Hypervisors will issue a warning

136

++				when the first VM is started in a

137

++				potentially insecure configuration,

138

++				i.e. SMT enabled or L1D flush disabled.

139

++

140

++			full,force

141

++				Same as 'full', but disables SMT and L1D

142

++				flush runtime control. Implies the

143

++				'nosmt=force' command line option.

144

++				(i.e. sysfs control of SMT is disabled.)

145

++

146

++			flush

147

++				Leaves SMT enabled and enables the default

148

++				hypervisor mitigation, i.e. conditional

149

++				L1D flush.

150

++

151

++				SMT control and L1D flush control via the

152

++				sysfs interface is still possible after

153

++				boot.  Hypervisors will issue a warning

154

++				when the first VM is started in a

155

++				potentially insecure configuration,

156

++				i.e. SMT enabled or L1D flush disabled.

157

++

158

++			flush,nosmt

159

++

160

++				Disables SMT and enables the default

161

++				hypervisor mitigation.

162

++

163

++				SMT control and L1D flush control via the

164

++				sysfs interface is still possible after

165

++				boot.  Hypervisors will issue a warning

166

++				when the first VM is started in a

167

++				potentially insecure configuration,

168

++				i.e. SMT enabled or L1D flush disabled.

169

++

170

++			flush,nowarn

171

++				Same as 'flush', but hypervisors will not

172

++				warn when a VM is started in a potentially

173

++				insecure configuration.

174

++

175

++			off

176

++				Disables hypervisor mitigations and doesn't

177

++				emit any warnings.

178

++

179

++			Default is 'flush'.

180

++

181

++			For details see: Documentation/admin-guide/l1tf.rst

182

++

183

+ 	l2cr=		[PPC]

184

+

185

+ 	l3cr=		[PPC]

186

+@@ -2687,6 +2761,10 @@

187

+ 	nosmt		[KNL,S390] Disable symmetric multithreading (SMT).

188

+ 			Equivalent to smt=1.

189

+

190

++			[KNL,x86] Disable symmetric multithreading (SMT).

191

++			nosmt=force: Force disable SMT, cannot be undone

192

++				     via the sysfs control file.

193

++

194

+ 	nospectre_v2	[X86] Disable all mitigations for the Spectre variant 2

195

+ 			(indirect branch prediction) vulnerability. System may

196

+ 			allow data leaks with this option, which is equivalent

197

+diff --git a/Documentation/admin-guide/l1tf.rst b/Documentation/admin-guide/l1tf.rst

198

+new file mode 100644

199

+index 000000000000..bae52b845de0

200

+--- /dev/null

201

++++ b/Documentation/admin-guide/l1tf.rst

202

+@@ -0,0 +1,610 @@

203

++L1TF - L1 Terminal Fault

204

++========================

205

++

206

++L1 Terminal Fault is a hardware vulnerability which allows unprivileged

207

++speculative access to data which is available in the Level 1 Data Cache

208

++when the page table entry controlling the virtual address, which is used

209

++for the access, has the Present bit cleared or other reserved bits set.

210

++

211

++Affected processors

212

++-------------------

213

++

214

++This vulnerability affects a wide range of Intel processors. The

215

++vulnerability is not present on:

216

++

217

++   - Processors from AMD, Centaur and other non Intel vendors

218

++

219

++   - Older processor models, where the CPU family is < 6

220

++

221

++   - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,

222

++     Penwell, Pineview, Silvermont, Airmont, Merrifield)

223

++

224

++   - The Intel XEON PHI family

225

++

226

++   - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the

227

++     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected

228

++     by the Meltdown vulnerability either. These CPUs should become

229

++     available by end of 2018.

230

++

231

++Whether a processor is affected or not can be read out from the L1TF

232

++vulnerability file in sysfs. See :ref:`l1tf_sys_info`.

233

++

234

++Related CVEs

235

++------------

236

++

237

++The following CVE entries are related to the L1TF vulnerability:

238

++

239

++   =============  =================  ==============================

240

++   CVE-2018-3615  L1 Terminal Fault  SGX related aspects

241

++   CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects

242

++   CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects

243

++   =============  =================  ==============================

244

++

245

++Problem

246

++-------

247

++

248

++If an instruction accesses a virtual address for which the relevant page

249

++table entry (PTE) has the Present bit cleared or other reserved bits set,

250

++then speculative execution ignores the invalid PTE and loads the referenced

251

++data if it is present in the Level 1 Data Cache, as if the page referenced

252

++by the address bits in the PTE was still present and accessible.

253

++

254

++While this is a purely speculative mechanism and the instruction will raise

255

++a page fault when it is retired eventually, the pure act of loading the

256

++data and making it available to other speculative instructions opens up the

257

++opportunity for side channel attacks to unprivileged malicious code,

258

++similar to the Meltdown attack.

259

++

260

++While Meltdown breaks the user space to kernel space protection, L1TF

261

++allows to attack any physical memory address in the system and the attack

262

++works across all protection domains. It allows an attack of SGX and also

263

++works from inside virtual machines because the speculation bypasses the

264

++extended page table (EPT) protection mechanism.

265

++

266

++

267

++Attack scenarios

268

++----------------

269

++

270

++1. Malicious user space

271

++^^^^^^^^^^^^^^^^^^^^^^^

272

++

273

++   Operating Systems store arbitrary information in the address bits of a

274

++   PTE which is marked non present. This allows a malicious user space

275

++   application to attack the physical memory to which these PTEs resolve.

276

++   In some cases user-space can maliciously influence the information

277

++   encoded in the address bits of the PTE, thus making attacks more

278

++   deterministic and more practical.

279

++

280

++   The Linux kernel contains a mitigation for this attack vector, PTE

281

++   inversion, which is permanently enabled and has no performance

282

++   impact. The kernel ensures that the address bits of PTEs, which are not

283

++   marked present, never point to cacheable physical memory space.

284

++

285

++   A system with an up to date kernel is protected against attacks from

286

++   malicious user space applications.

287

++

288

++2. Malicious guest in a virtual machine

289

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

290

++

291

++   The fact that L1TF breaks all domain protections allows malicious guest

292

++   OSes, which can control the PTEs directly, and malicious guest user

293

++   space applications, which run on an unprotected guest kernel lacking the

294

++   PTE inversion mitigation for L1TF, to attack physical host memory.

295

++

296

++   A special aspect of L1TF in the context of virtualization is symmetric

297

++   multi threading (SMT). The Intel implementation of SMT is called

298

++   HyperThreading. The fact that Hyperthreads on the affected processors

299

++   share the L1 Data Cache (L1D) is important for this. As the flaw allows

300

++   only to attack data which is present in L1D, a malicious guest running

301

++   on one Hyperthread can attack the data which is brought into the L1D by

302

++   the context which runs on the sibling Hyperthread of the same physical

303

++   core. This context can be host OS, host user space or a different guest.

304

++

305

++   If the processor does not support Extended Page Tables, the attack is

306

++   only possible, when the hypervisor does not sanitize the content of the

307

++   effective (shadow) page tables.

308

++

309

++   While solutions exist to mitigate these attack vectors fully, these

310

++   mitigations are not enabled by default in the Linux kernel because they

311

++   can affect performance significantly. The kernel provides several

312

++   mechanisms which can be utilized to address the problem depending on the

313

++   deployment scenario. The mitigations, their protection scope and impact

314

++   are described in the next sections.

315

++

316

++   The default mitigations and the rationale for choosing them are explained

317

++   at the end of this document. See :ref:`default_mitigations`.

318

++

319

++.. _l1tf_sys_info:

320

++

321

++L1TF system information

322

++-----------------------

323

++

324

++The Linux kernel provides a sysfs interface to enumerate the current L1TF

325

++status of the system: whether the system is vulnerable, and which

326

++mitigations are active. The relevant sysfs file is:

327

++

328

++/sys/devices/system/cpu/vulnerabilities/l1tf

329

++

330

++The possible values in this file are:

331

++

332

++  ===========================   ===============================

333

++  'Not affected'		The processor is not vulnerable

334

++  'Mitigation: PTE Inversion'	The host protection is active

335

++  ===========================   ===============================

336

++

337

++If KVM/VMX is enabled and the processor is vulnerable then the following

338

++information is appended to the 'Mitigation: PTE Inversion' part:

339

++

340

++  - SMT status:

341

++

342

++    =====================  ================

343

++    'VMX: SMT vulnerable'  SMT is enabled

344

++    'VMX: SMT disabled'    SMT is disabled

345

++    =====================  ================

346

++

347

++  - L1D Flush mode:

348

++

349

++    ================================  ====================================

350

++    'L1D vulnerable'		      L1D flushing is disabled

351

++

352

++    'L1D conditional cache flushes'   L1D flush is conditionally enabled

353

++

354

++    'L1D cache flushes'		      L1D flush is unconditionally enabled

355

++    ================================  ====================================

356

++

357

++The resulting grade of protection is discussed in the following sections.

358

++

359

++

360

++Host mitigation mechanism

361

++-------------------------

362

++

363

++The kernel is unconditionally protected against L1TF attacks from malicious

364

++user space running on the host.

365

++

366

++

367

++Guest mitigation mechanisms

368

++---------------------------

369

++

370

++.. _l1d_flush:

371

++

372

++1. L1D flush on VMENTER

373

++^^^^^^^^^^^^^^^^^^^^^^^

374

++

375

++   To make sure that a guest cannot attack data which is present in the L1D

376

++   the hypervisor flushes the L1D before entering the guest.

377

++

378

++   Flushing the L1D evicts not only the data which should not be accessed

379

++   by a potentially malicious guest, it also flushes the guest

380

++   data. Flushing the L1D has a performance impact as the processor has to

381

++   bring the flushed guest data back into the L1D. Depending on the

382

++   frequency of VMEXIT/VMENTER and the type of computations in the guest

383

++   performance degradation in the range of 1% to 50% has been observed. For

384

++   scenarios where guest VMEXIT/VMENTER are rare the performance impact is

385

++   minimal. Virtio and mechanisms like posted interrupts are designed to

386

++   confine the VMEXITs to a bare minimum, but specific configurations and

387

++   application scenarios might still suffer from a high VMEXIT rate.

388

++

389

++   The kernel provides two L1D flush modes:

390

++    - conditional ('cond')

391

++    - unconditional ('always')

392

++

393

++   The conditional mode avoids L1D flushing after VMEXITs which execute

394

++   only audited code paths before the corresponding VMENTER. These code

395

++   paths have been verified that they cannot expose secrets or other

396

++   interesting data to an attacker, but they can leak information about the

397

++   address space layout of the hypervisor.

398

++

399

++   Unconditional mode flushes L1D on all VMENTER invocations and provides

400

++   maximum protection. It has a higher overhead than the conditional

401

++   mode. The overhead cannot be quantified correctly as it depends on the

402

++   workload scenario and the resulting number of VMEXITs.

403

++

404

++   The general recommendation is to enable L1D flush on VMENTER. The kernel

405

++   defaults to conditional mode on affected processors.

406

++

407

++   **Note**, that L1D flush does not prevent the SMT problem because the

408

++   sibling thread will also bring back its data into the L1D which makes it

409

++   attackable again.

410

++

411

++   L1D flush can be controlled by the administrator via the kernel command

412

++   line and sysfs control files. See :ref:`mitigation_control_command_line`

413

++   and :ref:`mitigation_control_kvm`.

414

++

415

++.. _guest_confinement:

416

++

417

++2. Guest VCPU confinement to dedicated physical cores

418

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

419

++

420

++   To address the SMT problem, it is possible to make a guest or a group of

421

++   guests affine to one or more physical cores. The proper mechanism for

422

++   that is to utilize exclusive cpusets to ensure that no other guest or

423

++   host tasks can run on these cores.

424

++

425

++   If only a single guest or related guests run on sibling SMT threads on

426

++   the same physical core then they can only attack their own memory and

427

++   restricted parts of the host memory.

428

++

429

++   Host memory is attackable, when one of the sibling SMT threads runs in

430

++   host OS (hypervisor) context and the other in guest context. The amount

431

++   of valuable information from the host OS context depends on the context

432

++   which the host OS executes, i.e. interrupts, soft interrupts and kernel

433

++   threads. The amount of valuable data from these contexts cannot be

434

++   declared as non-interesting for an attacker without deep inspection of

435

++   the code.

436

++

437

++   **Note**, that assigning guests to a fixed set of physical cores affects

438

++   the ability of the scheduler to do load balancing and might have

439

++   negative effects on CPU utilization depending on the hosting

440

++   scenario. Disabling SMT might be a viable alternative for particular

441

++   scenarios.

442

++

443

++   For further information about confining guests to a single or to a group

444

++   of cores consult the cpusets documentation:

445

++

446

++   https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt

447

++

448

++.. _interrupt_isolation:

449

++

450

++3. Interrupt affinity

451

++^^^^^^^^^^^^^^^^^^^^^

452

++

453

++   Interrupts can be made affine to logical CPUs. This is not universally

454

++   true because there are types of interrupts which are truly per CPU

455

++   interrupts, e.g. the local timer interrupt. Aside of that multi queue

456

++   devices affine their interrupts to single CPUs or groups of CPUs per

457

++   queue without allowing the administrator to control the affinities.

458

++

459

++   Moving the interrupts, which can be affinity controlled, away from CPUs

460

++   which run untrusted guests, reduces the attack vector space.

461

++

462

++   Whether the interrupts with are affine to CPUs, which run untrusted

463

++   guests, provide interesting data for an attacker depends on the system

464

++   configuration and the scenarios which run on the system. While for some

465

++   of the interrupts it can be assumed that they won't expose interesting

466

++   information beyond exposing hints about the host OS memory layout, there

467

++   is no way to make general assumptions.

468

++

469

++   Interrupt affinity can be controlled by the administrator via the

470

++   /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is

471

++   available at:

472

++

473

++   https://www.kernel.org/doc/Documentation/IRQ-affinity.txt

474

++

475

++.. _smt_control:

476

++

477

++4. SMT control

478

++^^^^^^^^^^^^^^

479

++

480

++   To prevent the SMT issues of L1TF it might be necessary to disable SMT

481

++   completely. Disabling SMT can have a significant performance impact, but

482

++   the impact depends on the hosting scenario and the type of workloads.

483

++   The impact of disabling SMT needs also to be weighted against the impact

484

++   of other mitigation solutions like confining guests to dedicated cores.

485

++

486

++   The kernel provides a sysfs interface to retrieve the status of SMT and

487

++   to control it. It also provides a kernel command line interface to

488

++   control SMT.

489

++

490

++   The kernel command line interface consists of the following options:

491

++

492

++     =========== ==========================================================

493

++     nosmt	 Affects the bring up of the secondary CPUs during boot. The

494

++		 kernel tries to bring all present CPUs online during the

495

++		 boot process. "nosmt" makes sure that from each physical

496

++		 core only one - the so called primary (hyper) thread is

497

++		 activated. Due to a design flaw of Intel processors related

498

++		 to Machine Check Exceptions the non primary siblings have

499

++		 to be brought up at least partially and are then shut down

500

++		 again.  "nosmt" can be undone via the sysfs interface.

501

++

502

++     nosmt=force Has the same effect as "nosmt" but it does not allow to

503

++		 undo the SMT disable via the sysfs interface.

504

++     =========== ==========================================================

505

++

506

++   The sysfs interface provides two files:

507

++

508

++   - /sys/devices/system/cpu/smt/control

509

++   - /sys/devices/system/cpu/smt/active

510

++

511

++   /sys/devices/system/cpu/smt/control:

512

++

513

++     This file allows to read out the SMT control state and provides the

514

++     ability to disable or (re)enable SMT. The possible states are:

515

++

516

++	==============  ===================================================

517

++	on		SMT is supported by the CPU and enabled. All

518

++			logical CPUs can be onlined and offlined without

519

++			restrictions.

520

++

521

++	off		SMT is supported by the CPU and disabled. Only

522

++			the so called primary SMT threads can be onlined

523

++			and offlined without restrictions. An attempt to

524

++			online a non-primary sibling is rejected

525

++

526

++	forceoff	Same as 'off' but the state cannot be controlled.

527

++			Attempts to write to the control file are rejected.

528

++

529

++	notsupported	The processor does not support SMT. It's therefore

530

++			not affected by the SMT implications of L1TF.

531

++			Attempts to write to the control file are rejected.

532

++	==============  ===================================================

533

++

534

++     The possible states which can be written into this file to control SMT

535

++     state are:

536

++

537

++     - on

538

++     - off

539

++     - forceoff

540

++

541

++   /sys/devices/system/cpu/smt/active:

542

++

543

++     This file reports whether SMT is enabled and active, i.e. if on any

544

++     physical core two or more sibling threads are online.

545

++

546

++   SMT control is also possible at boot time via the l1tf kernel command

547

++   line parameter in combination with L1D flush control. See

548

++   :ref:`mitigation_control_command_line`.

549

++

550

++5. Disabling EPT

551

++^^^^^^^^^^^^^^^^

552

++

553

++  Disabling EPT for virtual machines provides full mitigation for L1TF even

554

++  with SMT enabled, because the effective page tables for guests are

555

++  managed and sanitized by the hypervisor. Though disabling EPT has a

556

++  significant performance impact especially when the Meltdown mitigation

557

++  KPTI is enabled.

558

++

559

++  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.

560

++

561

++There is ongoing research and development for new mitigation mechanisms to

562

++address the performance impact of disabling SMT or EPT.

563

++

564

++.. _mitigation_control_command_line:

565

++

566

++Mitigation control on the kernel command line

567

++---------------------------------------------

568

++

569

++The kernel command line allows to control the L1TF mitigations at boot

570

++time with the option "l1tf=". The valid arguments for this option are:

571

++

572

++  ============  =============================================================

573

++  full		Provides all available mitigations for the L1TF

574

++		vulnerability. Disables SMT and enables all mitigations in

575

++		the hypervisors, i.e. unconditional L1D flushing

576

++

577

++		SMT control and L1D flush control via the sysfs interface

578

++		is still possible after boot.  Hypervisors will issue a

579

++		warning when the first VM is started in a potentially

580

++		insecure configuration, i.e. SMT enabled or L1D flush

581

++		disabled.

582

++

583

++  full,force	Same as 'full', but disables SMT and L1D flush runtime

584

++		control. Implies the 'nosmt=force' command line option.

585

++		(i.e. sysfs control of SMT is disabled.)

586

++

587

++  flush		Leaves SMT enabled and enables the default hypervisor

588

++		mitigation, i.e. conditional L1D flushing

589

++

590

++		SMT control and L1D flush control via the sysfs interface

591

++		is still possible after boot.  Hypervisors will issue a

592

++		warning when the first VM is started in a potentially

593

++		insecure configuration, i.e. SMT enabled or L1D flush

594

++		disabled.

595

++

596

++  flush,nosmt	Disables SMT and enables the default hypervisor mitigation,

597

++		i.e. conditional L1D flushing.

598

++

599

++		SMT control and L1D flush control via the sysfs interface

600

++		is still possible after boot.  Hypervisors will issue a

601

++		warning when the first VM is started in a potentially

602

++		insecure configuration, i.e. SMT enabled or L1D flush

603

++		disabled.

604

++

605

++  flush,nowarn	Same as 'flush', but hypervisors will not warn when a VM is

606

++		started in a potentially insecure configuration.

607

++

608

++  off		Disables hypervisor mitigations and doesn't emit any

609

++		warnings.

610

++  ============  =============================================================

611

++

612

++The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.

613

++

614

++

615

++.. _mitigation_control_kvm:

616

++

617

++Mitigation control for KVM - module parameter

618

++-------------------------------------------------------------

619

++

620

++The KVM hypervisor mitigation mechanism, flushing the L1D cache when

621

++entering a guest, can be controlled with a module parameter.

622

++

623

++The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the

624

++following arguments:

625

++

626

++  ============  ==============================================================

627

++  always	L1D cache flush on every VMENTER.

628

++

629

++  cond		Flush L1D on VMENTER only when the code between VMEXIT and

630

++		VMENTER can leak host memory which is considered

631

++		interesting for an attacker. This still can leak host memory

632

++		which allows e.g. to determine the hosts address space layout.

633

++

634

++  never		Disables the mitigation

635

++  ============  ==============================================================

636

++

637

++The parameter can be provided on the kernel command line, as a module

638

++parameter when loading the modules and at runtime modified via the sysfs

639

++file:

640

++

641

++/sys/module/kvm_intel/parameters/vmentry_l1d_flush

642

++

643

++The default is 'cond'. If 'l1tf=full,force' is given on the kernel command

644

++line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush

645

++module parameter is ignored and writes to the sysfs file are rejected.

646

++

647

++

648

++Mitigation selection guide

649

++--------------------------

650

++

651

++1. No virtualization in use

652

++^^^^^^^^^^^^^^^^^^^^^^^^^^^

653

++

654

++   The system is protected by the kernel unconditionally and no further

655

++   action is required.

656

++

657

++2. Virtualization with trusted guests

658

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

659

++

660

++   If the guest comes from a trusted source and the guest OS kernel is

661

++   guaranteed to have the L1TF mitigations in place the system is fully

662

++   protected against L1TF and no further action is required.

663

++

664

++   To avoid the overhead of the default L1D flushing on VMENTER the

665

++   administrator can disable the flushing via the kernel command line and

666

++   sysfs control files. See :ref:`mitigation_control_command_line` and

667

++   :ref:`mitigation_control_kvm`.

668

++

669

++

670

++3. Virtualization with untrusted guests

671

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

672

++

673

++3.1. SMT not supported or disabled

674

++""""""""""""""""""""""""""""""""""

675

++

676

++  If SMT is not supported by the processor or disabled in the BIOS or by

677

++  the kernel, it's only required to enforce L1D flushing on VMENTER.

678

++

679

++  Conditional L1D flushing is the default behaviour and can be tuned. See

680

++  :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.

681

++

682

++3.2. EPT not supported or disabled

683

++""""""""""""""""""""""""""""""""""

684

++

685

++  If EPT is not supported by the processor or disabled in the hypervisor,

686

++  the system is fully protected. SMT can stay enabled and L1D flushing on

687

++  VMENTER is not required.

688

++

689

++  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.

690

++

691

++3.3. SMT and EPT supported and active

692

++"""""""""""""""""""""""""""""""""""""

693

++

694

++  If SMT and EPT are supported and active then various degrees of

695

++  mitigations can be employed:

696

++

697

++  - L1D flushing on VMENTER:

698

++

699

++    L1D flushing on VMENTER is the minimal protection requirement, but it

700

++    is only potent in combination with other mitigation methods.

701

++

702

++    Conditional L1D flushing is the default behaviour and can be tuned. See

703

++    :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.

704

++

705

++  - Guest confinement:

706

++

707

++    Confinement of guests to a single or a group of physical cores which

708

++    are not running any other processes, can reduce the attack surface

709

++    significantly, but interrupts, soft interrupts and kernel threads can

710

++    still expose valuable data to a potential attacker. See

711

++    :ref:`guest_confinement`.

712

++

713

++  - Interrupt isolation:

714

++

715

++    Isolating the guest CPUs from interrupts can reduce the attack surface

716

++    further, but still allows a malicious guest to explore a limited amount

717

++    of host physical memory. This can at least be used to gain knowledge

718

++    about the host address space layout. The interrupts which have a fixed

719

++    affinity to the CPUs which run the untrusted guests can depending on

720

++    the scenario still trigger soft interrupts and schedule kernel threads

721

++    which might expose valuable information. See

722

++    :ref:`interrupt_isolation`.

723

++

724

++The above three mitigation methods combined can provide protection to a

725

++certain degree, but the risk of the remaining attack surface has to be

726

++carefully analyzed. For full protection the following methods are

727

++available:

728

++

729

++  - Disabling SMT:

730

++

731

++    Disabling SMT and enforcing the L1D flushing provides the maximum

732

++    amount of protection. This mitigation is not depending on any of the

733

++    above mitigation methods.

734

++

735

++    SMT control and L1D flushing can be tuned by the command line

736

++    parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run

737

++    time with the matching sysfs control files. See :ref:`smt_control`,

738

++    :ref:`mitigation_control_command_line` and

739

++    :ref:`mitigation_control_kvm`.

740

++

741

++  - Disabling EPT:

742

++

743

++    Disabling EPT provides the maximum amount of protection as well. It is

744

++    not depending on any of the above mitigation methods. SMT can stay

745

++    enabled and L1D flushing is not required, but the performance impact is

746

++    significant.

747

++

748

++    EPT can be disabled in the hypervisor via the 'kvm-intel.ept'

749

++    parameter.

750

++

751

++3.4. Nested virtual machines

752

++""""""""""""""""""""""""""""

753

++

754

++When nested virtualization is in use, three operating systems are involved:

755

++the bare metal hypervisor, the nested hypervisor and the nested virtual

756

++machine.  VMENTER operations from the nested hypervisor into the nested

757

++guest will always be processed by the bare metal hypervisor. If KVM is the

758

++bare metal hypervisor it wiil:

759

++

760

++ - Flush the L1D cache on every switch from the nested hypervisor to the

761

++   nested virtual machine, so that the nested hypervisor's secrets are not

762

++   exposed to the nested virtual machine;

763

++

764

++ - Flush the L1D cache on every switch from the nested virtual machine to

765

++   the nested hypervisor; this is a complex operation, and flushing the L1D

766

++   cache avoids that the bare metal hypervisor's secrets are exposed to the

767

++   nested virtual machine;

768

++

769

++ - Instruct the nested hypervisor to not perform any L1D cache flush. This

770

++   is an optimization to avoid double L1D flushing.

771

++

772

++

773

++.. _default_mitigations:

774

++

775

++Default mitigations

776

++-------------------

777

++

778

++  The kernel default mitigations for vulnerable processors are:

779

++

780

++  - PTE inversion to protect against malicious user space. This is done

781

++    unconditionally and cannot be controlled.

782

++

783

++  - L1D conditional flushing on VMENTER when EPT is enabled for

784

++    a guest.

785

++

786

++  The kernel does not by default enforce the disabling of SMT, which leaves

787

++  SMT systems vulnerable when running untrusted guests with EPT enabled.

788

++

789

++  The rationale for this choice is:

790

++

791

++  - Force disabling SMT can break existing setups, especially with

792

++    unattended updates.

793

++

794

++  - If regular users run untrusted guests on their machine, then L1TF is

795

++    just an add on to other malware which might be embedded in an untrusted

796

++    guest, e.g. spam-bots or attacks on the local network.

797

++

798

++    There is no technical way to prevent a user from running untrusted code

799

++    on their machines blindly.

800

++

801

++  - It's technically extremely unlikely and from today's knowledge even

802

++    impossible that L1TF can be exploited via the most popular attack

803

++    mechanisms like JavaScript because these mechanisms have no way to

804

++    control PTEs. If this would be possible and not other mitigation would

805

++    be possible, then the default might be different.

806

++

807

++  - The administrators of cloud and hosting setups have to carefully

808

++    analyze the risk for their scenarios and make the appropriate

809

++    mitigation choices, which might even vary across their deployed

810

++    machines and also result in other changes of their overall setup.

811

++    There is no way for the kernel to provide a sensible default for this

812

++    kind of scenarios.

813

+diff --git a/Makefile b/Makefile

814

+index 863f58503bee..5edf963148e8 100644

815

+--- a/Makefile

816

++++ b/Makefile

817

+@@ -1,7 +1,7 @@

818

+ # SPDX-License-Identifier: GPL-2.0

819

+ VERSION = 4

820

+ PATCHLEVEL = 18

821

+-SUBLEVEL = 0

822

++SUBLEVEL = 1

823

+ EXTRAVERSION =

824

+ NAME = Merciless Moray

825

+

826

+diff --git a/arch/Kconfig b/arch/Kconfig

827

+index 1aa59063f1fd..d1f2ed462ac8 100644

828

+--- a/arch/Kconfig

829

++++ b/arch/Kconfig

830

+@@ -13,6 +13,9 @@ config KEXEC_CORE

831

+ config HAVE_IMA_KEXEC

832

+ 	bool

833

+

834

++config HOTPLUG_SMT

835

++	bool

836

++

837

+ config OPROFILE

838

+ 	tristate "OProfile system profiling"

839

+ 	depends on PROFILING

840

+diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig

841

+index 887d3a7bb646..6b8065d718bd 100644

842

+--- a/arch/x86/Kconfig

843

++++ b/arch/x86/Kconfig

844

+@@ -187,6 +187,7 @@ config X86

845

+ 	select HAVE_SYSCALL_TRACEPOINTS

846

+ 	select HAVE_UNSTABLE_SCHED_CLOCK

847

+ 	select HAVE_USER_RETURN_NOTIFIER

848

++	select HOTPLUG_SMT			if SMP

849

+ 	select IRQ_FORCED_THREADING

850

+ 	select NEED_SG_DMA_LENGTH

851

+ 	select PCI_LOCKLESS_CONFIG

852

+diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h

853

+index 74a9e06b6cfd..130e81e10fc7 100644

854

+--- a/arch/x86/include/asm/apic.h

855

++++ b/arch/x86/include/asm/apic.h

856

+@@ -10,6 +10,7 @@

857

+ #include <asm/fixmap.h>

858

+ #include <asm/mpspec.h>

859

+ #include <asm/msr.h>

860

++#include <asm/hardirq.h>

861

+

862

+ #define ARCH_APICTIMER_STOPS_ON_C3	1

863

+

864

+@@ -502,12 +503,19 @@ extern int default_check_phys_apicid_present(int phys_apicid);

865

+

866

+ #endif /* CONFIG_X86_LOCAL_APIC */

867

+

868

++#ifdef CONFIG_SMP

869

++bool apic_id_is_primary_thread(unsigned int id);

870

++#else

871

++static inline bool apic_id_is_primary_thread(unsigned int id) { return false; }

872

++#endif

873

++

874

+ extern void irq_enter(void);

875

+ extern void irq_exit(void);

876

+

877

+ static inline void entering_irq(void)

878

+ {

879

+ 	irq_enter();

880

++	kvm_set_cpu_l1tf_flush_l1d();

881

+ }

882

+

883

+ static inline void entering_ack_irq(void)

884

+@@ -520,6 +528,7 @@ static inline void ipi_entering_ack_irq(void)

885

+ {

886

+ 	irq_enter();

887

+ 	ack_APIC_irq();

888

++	kvm_set_cpu_l1tf_flush_l1d();

889

+ }

890

+

891

+ static inline void exiting_irq(void)

892

+diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h

893

+index 5701f5cecd31..64aaa3f5f36c 100644

894

+--- a/arch/x86/include/asm/cpufeatures.h

895

++++ b/arch/x86/include/asm/cpufeatures.h

896

+@@ -219,6 +219,7 @@

897

+ #define X86_FEATURE_IBPB		( 7*32+26) /* Indirect Branch Prediction Barrier */

898

+ #define X86_FEATURE_STIBP		( 7*32+27) /* Single Thread Indirect Branch Predictors */

899

+ #define X86_FEATURE_ZEN			( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */

900

++#define X86_FEATURE_L1TF_PTEINV		( 7*32+29) /* "" L1TF workaround PTE inversion */

901

+

902

+ /* Virtualization flags: Linux defined, word 8 */

903

+ #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */

904

+@@ -341,6 +342,7 @@

905

+ #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */

906

+ #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */

907

+ #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */

908

++#define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */

909

+ #define X86_FEATURE_ARCH_CAPABILITIES	(18*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */

910

+ #define X86_FEATURE_SPEC_CTRL_SSBD	(18*32+31) /* "" Speculative Store Bypass Disable */

911

+

912

+@@ -373,5 +375,6 @@

913

+ #define X86_BUG_SPECTRE_V1		X86_BUG(15) /* CPU is affected by Spectre variant 1 attack with conditional branches */

914

+ #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */

915

+ #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */

916

++#define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */

917

+

918

+ #endif /* _ASM_X86_CPUFEATURES_H */

919

+diff --git a/arch/x86/include/asm/dmi.h b/arch/x86/include/asm/dmi.h

920

+index 0ab2ab27ad1f..b825cb201251 100644

921

+--- a/arch/x86/include/asm/dmi.h

922

++++ b/arch/x86/include/asm/dmi.h

923

+@@ -4,8 +4,8 @@

924

+

925

+ #include <linux/compiler.h>

926

+ #include <linux/init.h>

927

++#include <linux/io.h>

928

+

929

+-#include <asm/io.h>

930

+ #include <asm/setup.h>

931

+

932

+ static __always_inline __init void *dmi_alloc(unsigned len)

933

+diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h

934

+index 740a428acf1e..d9069bb26c7f 100644

935

+--- a/arch/x86/include/asm/hardirq.h

936

++++ b/arch/x86/include/asm/hardirq.h

937

+@@ -3,10 +3,12 @@

938

+ #define _ASM_X86_HARDIRQ_H

939

+

940

+ #include <linux/threads.h>

941

+-#include <linux/irq.h>

942

+

943

+ typedef struct {

944

+-	unsigned int __softirq_pending;

945

++	u16	     __softirq_pending;

946

++#if IS_ENABLED(CONFIG_KVM_INTEL)

947

++	u8	     kvm_cpu_l1tf_flush_l1d;

948

++#endif

949

+ 	unsigned int __nmi_count;	/* arch dependent */

950

+ #ifdef CONFIG_X86_LOCAL_APIC

951

+ 	unsigned int apic_timer_irqs;	/* arch dependent */

952

+@@ -58,4 +60,24 @@ extern u64 arch_irq_stat_cpu(unsigned int cpu);

953

+ extern u64 arch_irq_stat(void);

954

+ #define arch_irq_stat		arch_irq_stat

955

+

956

++

957

++#if IS_ENABLED(CONFIG_KVM_INTEL)

958

++static inline void kvm_set_cpu_l1tf_flush_l1d(void)

959

++{

960

++	__this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 1);

961

++}

962

++

963

++static inline void kvm_clear_cpu_l1tf_flush_l1d(void)

964

++{

965

++	__this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 0);

966

++}

967

++

968

++static inline bool kvm_get_cpu_l1tf_flush_l1d(void)

969

++{

970

++	return __this_cpu_read(irq_stat.kvm_cpu_l1tf_flush_l1d);

971

++}

972

++#else /* !IS_ENABLED(CONFIG_KVM_INTEL) */

973

++static inline void kvm_set_cpu_l1tf_flush_l1d(void) { }

974

++#endif /* IS_ENABLED(CONFIG_KVM_INTEL) */

975

++

976

+ #endif /* _ASM_X86_HARDIRQ_H */

977

+diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h

978

+index c4fc17220df9..c14f2a74b2be 100644

979

+--- a/arch/x86/include/asm/irqflags.h

980

++++ b/arch/x86/include/asm/irqflags.h

981

+@@ -13,6 +13,8 @@

982

+  * Interrupt control:

983

+  */

984

+

985

++/* Declaration required for gcc < 4.9 to prevent -Werror=missing-prototypes */

986

++extern inline unsigned long native_save_fl(void);

987

+ extern inline unsigned long native_save_fl(void)

988

+ {

989

+ 	unsigned long flags;

990

+diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h

991

+index c13cd28d9d1b..acebb808c4b5 100644

992

+--- a/arch/x86/include/asm/kvm_host.h

993

++++ b/arch/x86/include/asm/kvm_host.h

994

+@@ -17,6 +17,7 @@

995

+ #include <linux/tracepoint.h>

996

+ #include <linux/cpumask.h>

997

+ #include <linux/irq_work.h>

998

++#include <linux/irq.h>

999

+

1000

+ #include <linux/kvm.h>

1001

+ #include <linux/kvm_para.h>

1002

+@@ -713,6 +714,9 @@ struct kvm_vcpu_arch {

1003

+

1004

+ 	/* be preempted when it's in kernel-mode(cpl=0) */

1005

+ 	bool preempted_in_kernel;

1006

++

1007

++	/* Flush the L1 Data cache for L1TF mitigation on VMENTER */

1008

++	bool l1tf_flush_l1d;

1009

+ };

1010

+

1011

+ struct kvm_lpage_info {

1012

+@@ -881,6 +885,7 @@ struct kvm_vcpu_stat {

1013

+ 	u64 signal_exits;

1014

+ 	u64 irq_window_exits;

1015

+ 	u64 nmi_window_exits;

1016

++	u64 l1d_flush;

1017

+ 	u64 halt_exits;

1018

+ 	u64 halt_successful_poll;

1019

+ 	u64 halt_attempted_poll;

1020

+@@ -1413,6 +1418,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v);

1021

+ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);

1022

+ void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu);

1023

+

1024

++u64 kvm_get_arch_capabilities(void);

1025

+ void kvm_define_shared_msr(unsigned index, u32 msr);

1026

+ int kvm_set_shared_msr(unsigned index, u64 val, u64 mask);

1027

+

1028

+diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h

1029

+index 68b2c3150de1..4731f0cf97c5 100644

1030

+--- a/arch/x86/include/asm/msr-index.h

1031

++++ b/arch/x86/include/asm/msr-index.h

1032

+@@ -70,12 +70,19 @@

1033

+ #define MSR_IA32_ARCH_CAPABILITIES	0x0000010a

1034

+ #define ARCH_CAP_RDCL_NO		(1 << 0)   /* Not susceptible to Meltdown */

1035

+ #define ARCH_CAP_IBRS_ALL		(1 << 1)   /* Enhanced IBRS support */

1036

++#define ARCH_CAP_SKIP_VMENTRY_L1DFLUSH	(1 << 3)   /* Skip L1D flush on vmentry */

1037

+ #define ARCH_CAP_SSB_NO			(1 << 4)   /*

1038

+ 						    * Not susceptible to Speculative Store Bypass

1039

+ 						    * attack, so no Speculative Store Bypass

1040

+ 						    * control required.

1041

+ 						    */

1042

+

1043

++#define MSR_IA32_FLUSH_CMD		0x0000010b

1044

++#define L1D_FLUSH			(1 << 0)   /*

1045

++						    * Writeback and invalidate the

1046

++						    * L1 data cache.

1047

++						    */

1048

++

1049

+ #define MSR_IA32_BBL_CR_CTL		0x00000119

1050

+ #define MSR_IA32_BBL_CR_CTL3		0x0000011e

1051

+

1052

+diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h

1053

+index aa30c3241ea7..0d5c739eebd7 100644

1054

+--- a/arch/x86/include/asm/page_32_types.h

1055

++++ b/arch/x86/include/asm/page_32_types.h

1056

+@@ -29,8 +29,13 @@

1057

+ #define N_EXCEPTION_STACKS 1

1058

+

1059

+ #ifdef CONFIG_X86_PAE

1060

+-/* 44=32+12, the limit we can fit into an unsigned long pfn */

1061

+-#define __PHYSICAL_MASK_SHIFT	44

1062

++/*

1063

++ * This is beyond the 44 bit limit imposed by the 32bit long pfns,

1064

++ * but we need the full mask to make sure inverted PROT_NONE

1065

++ * entries have all the host bits set in a guest.

1066

++ * The real limit is still 44 bits.

1067

++ */

1068

++#define __PHYSICAL_MASK_SHIFT	52

1069

+ #define __VIRTUAL_MASK_SHIFT	32

1070

+

1071

+ #else  /* !CONFIG_X86_PAE */

1072

+diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h

1073

+index 685ffe8a0eaf..60d0f9015317 100644

1074

+--- a/arch/x86/include/asm/pgtable-2level.h

1075

++++ b/arch/x86/include/asm/pgtable-2level.h

1076

+@@ -95,4 +95,21 @@ static inline unsigned long pte_bitop(unsigned long value, unsigned int rightshi

1077

+ #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })

1078

+ #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })

1079

+

1080

++/* No inverted PFNs on 2 level page tables */

1081

++

1082

++static inline u64 protnone_mask(u64 val)

1083

++{

1084

++	return 0;

1085

++}

1086

++

1087

++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)

1088

++{

1089

++	return val;

1090

++}

1091

++

1092

++static inline bool __pte_needs_invert(u64 val)

1093

++{

1094

++	return false;

1095

++}

1096

++

1097

+ #endif /* _ASM_X86_PGTABLE_2LEVEL_H */

1098

+diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h

1099

+index f24df59c40b2..bb035a4cbc8c 100644

1100

+--- a/arch/x86/include/asm/pgtable-3level.h

1101

++++ b/arch/x86/include/asm/pgtable-3level.h

1102

+@@ -241,12 +241,43 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)

1103

+ #endif

1104

+

1105

+ /* Encode and de-code a swap entry */

1106

++#define SWP_TYPE_BITS		5

1107

++

1108

++#define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)

1109

++

1110

++/* We always extract/encode the offset by shifting it all the way up, and then down again */

1111

++#define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS)

1112

++

1113

+ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)

1114

+ #define __swp_type(x)			(((x).val) & 0x1f)

1115

+ #define __swp_offset(x)			((x).val >> 5)

1116

+ #define __swp_entry(type, offset)	((swp_entry_t){(type) | (offset) << 5})

1117

+-#define __pte_to_swp_entry(pte)		((swp_entry_t){ (pte).pte_high })

1118

+-#define __swp_entry_to_pte(x)		((pte_t){ { .pte_high = (x).val } })

1119

++

1120

++/*

1121

++ * Normally, __swp_entry() converts from arch-independent swp_entry_t to

1122

++ * arch-dependent swp_entry_t, and __swp_entry_to_pte() just stores the result

1123

++ * to pte. But here we have 32bit swp_entry_t and 64bit pte, and need to use the

1124

++ * whole 64 bits. Thus, we shift the "real" arch-dependent conversion to

1125

++ * __swp_entry_to_pte() through the following helper macro based on 64bit

1126

++ * __swp_entry().

1127

++ */

1128

++#define __swp_pteval_entry(type, offset) ((pteval_t) { \

1129

++	(~(pteval_t)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \

1130

++	| ((pteval_t)(type) << (64 - SWP_TYPE_BITS)) })

1131

++

1132

++#define __swp_entry_to_pte(x)	((pte_t){ .pte = \

1133

++		__swp_pteval_entry(__swp_type(x), __swp_offset(x)) })

1134

++/*

1135

++ * Analogically, __pte_to_swp_entry() doesn't just extract the arch-dependent

1136

++ * swp_entry_t, but also has to convert it from 64bit to the 32bit

1137

++ * intermediate representation, using the following macros based on 64bit

1138

++ * __swp_type() and __swp_offset().

1139

++ */

1140

++#define __pteval_swp_type(x) ((unsigned long)((x).pte >> (64 - SWP_TYPE_BITS)))

1141

++#define __pteval_swp_offset(x) ((unsigned long)(~((x).pte) << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT))

1142

++

1143

++#define __pte_to_swp_entry(pte)	(__swp_entry(__pteval_swp_type(pte), \

1144

++					     __pteval_swp_offset(pte)))

1145

+

1146

+ #define gup_get_pte gup_get_pte

1147

+ /*

1148

+@@ -295,4 +326,6 @@ static inline pte_t gup_get_pte(pte_t *ptep)

1149

+ 	return pte;

1150

+ }

1151

+

1152

++#include <asm/pgtable-invert.h>

1153

++

1154

+ #endif /* _ASM_X86_PGTABLE_3LEVEL_H */

1155

+diff --git a/arch/x86/include/asm/pgtable-invert.h b/arch/x86/include/asm/pgtable-invert.h

1156

+new file mode 100644

1157

+index 000000000000..44b1203ece12

1158

+--- /dev/null

1159

++++ b/arch/x86/include/asm/pgtable-invert.h

1160

+@@ -0,0 +1,32 @@

1161

++/* SPDX-License-Identifier: GPL-2.0 */

1162

++#ifndef _ASM_PGTABLE_INVERT_H

1163

++#define _ASM_PGTABLE_INVERT_H 1

1164

++

1165

++#ifndef __ASSEMBLY__

1166

++

1167

++static inline bool __pte_needs_invert(u64 val)

1168

++{

1169

++	return !(val & _PAGE_PRESENT);

1170

++}

1171

++

1172

++/* Get a mask to xor with the page table entry to get the correct pfn. */

1173

++static inline u64 protnone_mask(u64 val)

1174

++{

1175

++	return __pte_needs_invert(val) ?  ~0ull : 0;

1176

++}

1177

++

1178

++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)

1179

++{

1180

++	/*

1181

++	 * When a PTE transitions from NONE to !NONE or vice-versa

1182

++	 * invert the PFN part to stop speculation.

1183

++	 * pte_pfn undoes this when needed.

1184

++	 */

1185

++	if (__pte_needs_invert(oldval) != __pte_needs_invert(val))

1186

++		val = (val & ~mask) | (~val & mask);

1187

++	return val;

1188

++}

1189

++

1190

++#endif /* __ASSEMBLY__ */

1191

++

1192

++#endif

1193

+diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h

1194

+index 5715647fc4fe..13125aad804c 100644

1195

+--- a/arch/x86/include/asm/pgtable.h

1196

++++ b/arch/x86/include/asm/pgtable.h

1197

+@@ -185,19 +185,29 @@ static inline int pte_special(pte_t pte)

1198

+ 	return pte_flags(pte) & _PAGE_SPECIAL;

1199

+ }

1200

+

1201

++/* Entries that were set to PROT_NONE are inverted */

1202

++

1203

++static inline u64 protnone_mask(u64 val);

1204

++

1205

+ static inline unsigned long pte_pfn(pte_t pte)

1206

+ {

1207

+-	return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;

1208

++	phys_addr_t pfn = pte_val(pte);

1209

++	pfn ^= protnone_mask(pfn);

1210

++	return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT;

1211

+ }

1212

+

1213

+ static inline unsigned long pmd_pfn(pmd_t pmd)

1214

+ {

1215

+-	return (pmd_val(pmd) & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;

1216

++	phys_addr_t pfn = pmd_val(pmd);

1217

++	pfn ^= protnone_mask(pfn);

1218

++	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;

1219

+ }

1220

+

1221

+ static inline unsigned long pud_pfn(pud_t pud)

1222

+ {

1223

+-	return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT;

1224

++	phys_addr_t pfn = pud_val(pud);

1225

++	pfn ^= protnone_mask(pfn);

1226

++	return (pfn & pud_pfn_mask(pud)) >> PAGE_SHIFT;

1227

+ }

1228

+

1229

+ static inline unsigned long p4d_pfn(p4d_t p4d)

1230

+@@ -400,11 +410,6 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)

1231

+ 	return pmd_set_flags(pmd, _PAGE_RW);

1232

+ }

1233

+

1234

+-static inline pmd_t pmd_mknotpresent(pmd_t pmd)

1235

+-{

1236

+-	return pmd_clear_flags(pmd, _PAGE_PRESENT | _PAGE_PROTNONE);

1237

+-}

1238

+-

1239

+ static inline pud_t pud_set_flags(pud_t pud, pudval_t set)

1240

+ {

1241

+ 	pudval_t v = native_pud_val(pud);

1242

+@@ -459,11 +464,6 @@ static inline pud_t pud_mkwrite(pud_t pud)

1243

+ 	return pud_set_flags(pud, _PAGE_RW);

1244

+ }

1245

+

1246

+-static inline pud_t pud_mknotpresent(pud_t pud)

1247

+-{

1248

+-	return pud_clear_flags(pud, _PAGE_PRESENT | _PAGE_PROTNONE);

1249

+-}

1250

+-

1251

+ #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY

1252

+ static inline int pte_soft_dirty(pte_t pte)

1253

+ {

1254

+@@ -545,25 +545,45 @@ static inline pgprotval_t check_pgprot(pgprot_t pgprot)

1255

+

1256

+ static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)

1257

+ {

1258

+-	return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) |

1259

+-		     check_pgprot(pgprot));

1260

++	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;

1261

++	pfn ^= protnone_mask(pgprot_val(pgprot));

1262

++	pfn &= PTE_PFN_MASK;

1263

++	return __pte(pfn | check_pgprot(pgprot));

1264

+ }

1265

+

1266

+ static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)

1267

+ {

1268

+-	return __pmd(((phys_addr_t)page_nr << PAGE_SHIFT) |

1269

+-		     check_pgprot(pgprot));

1270

++	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;

1271

++	pfn ^= protnone_mask(pgprot_val(pgprot));

1272

++	pfn &= PHYSICAL_PMD_PAGE_MASK;

1273

++	return __pmd(pfn | check_pgprot(pgprot));

1274

+ }

1275

+

1276

+ static inline pud_t pfn_pud(unsigned long page_nr, pgprot_t pgprot)

1277

+ {

1278

+-	return __pud(((phys_addr_t)page_nr << PAGE_SHIFT) |

1279

+-		     check_pgprot(pgprot));

1280

++	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;

1281

++	pfn ^= protnone_mask(pgprot_val(pgprot));

1282

++	pfn &= PHYSICAL_PUD_PAGE_MASK;

1283

++	return __pud(pfn | check_pgprot(pgprot));

1284

+ }

1285

+

1286

++static inline pmd_t pmd_mknotpresent(pmd_t pmd)

1287

++{

1288

++	return pfn_pmd(pmd_pfn(pmd),

1289

++		      __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));

1290

++}

1291

++

1292

++static inline pud_t pud_mknotpresent(pud_t pud)

1293

++{

1294

++	return pfn_pud(pud_pfn(pud),

1295

++	      __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));

1296

++}

1297

++

1298

++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);

1299

++

1300

+ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)

1301

+ {

1302

+-	pteval_t val = pte_val(pte);

1303

++	pteval_t val = pte_val(pte), oldval = val;

1304

+

1305

+ 	/*

1306

+ 	 * Chop off the NX bit (if present), and add the NX portion of

1307

+@@ -571,17 +591,17 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)

1308

+ 	 */

1309

+ 	val &= _PAGE_CHG_MASK;

1310

+ 	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;

1311

+-

1312

++	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);

1313

+ 	return __pte(val);

1314

+ }

1315

+

1316

+ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)

1317

+ {

1318

+-	pmdval_t val = pmd_val(pmd);

1319

++	pmdval_t val = pmd_val(pmd), oldval = val;

1320

+

1321

+ 	val &= _HPAGE_CHG_MASK;

1322

+ 	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;

1323

+-

1324

++	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);

1325

+ 	return __pmd(val);

1326

+ }

1327

+

1328

+@@ -1320,6 +1340,14 @@ static inline bool pud_access_permitted(pud_t pud, bool write)

1329

+ 	return __pte_access_permitted(pud_val(pud), write);

1330

+ }

1331

+

1332

++#define __HAVE_ARCH_PFN_MODIFY_ALLOWED 1

1333

++extern bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot);

1334

++

1335

++static inline bool arch_has_pfn_modify_check(void)

1336

++{

1337

++	return boot_cpu_has_bug(X86_BUG_L1TF);

1338

++}

1339

++

1340

+ #include <asm-generic/pgtable.h>

1341

+ #endif	/* __ASSEMBLY__ */

1342

+

1343

+diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h

1344

+index 3c5385f9a88f..82ff20b0ae45 100644

1345

+--- a/arch/x86/include/asm/pgtable_64.h

1346

++++ b/arch/x86/include/asm/pgtable_64.h

1347

+@@ -273,7 +273,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }

1348

+  *

1349

+  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number

1350

+  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names

1351

+- * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry

1352

++ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry

1353

+  *

1354

+  * G (8) is aliased and used as a PROT_NONE indicator for

1355

+  * !present ptes.  We need to start storing swap entries above

1356

+@@ -286,20 +286,34 @@ static inline int pgd_large(pgd_t pgd) { return 0; }

1357

+  *

1358

+  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,

1359

+  * but also L and G.

1360

++ *

1361

++ * The offset is inverted by a binary not operation to make the high

1362

++ * physical bits set.

1363

+  */

1364

+-#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)

1365

+-#define SWP_TYPE_BITS 5

1366

+-/* Place the offset above the type: */

1367

+-#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS)

1368

++#define SWP_TYPE_BITS		5

1369

++

1370

++#define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)

1371

++

1372

++/* We always extract/encode the offset by shifting it all the way up, and then down again */

1373

++#define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS)

1374

+

1375

+ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)

1376

+

1377

+-#define __swp_type(x)			(((x).val >> (SWP_TYPE_FIRST_BIT)) \

1378

+-					 & ((1U << SWP_TYPE_BITS) - 1))

1379

+-#define __swp_offset(x)			((x).val >> SWP_OFFSET_FIRST_BIT)

1380

+-#define __swp_entry(type, offset)	((swp_entry_t) { \

1381

+-					 ((type) << (SWP_TYPE_FIRST_BIT)) \

1382

+-					 | ((offset) << SWP_OFFSET_FIRST_BIT) })

1383

++/* Extract the high bits for type */

1384

++#define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS))

1385

++

1386

++/* Shift up (to get rid of type), then down to get value */

1387

++#define __swp_offset(x) (~(x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)

1388

++

1389

++/*

1390

++ * Shift the offset up "too far" by TYPE bits, then down again

1391

++ * The offset is inverted by a binary not operation to make the high

1392

++ * physical bits set.

1393

++ */

1394

++#define __swp_entry(type, offset) ((swp_entry_t) { \

1395

++	(~(unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \

1396

++	| ((unsigned long)(type) << (64-SWP_TYPE_BITS)) })

1397

++

1398

+ #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })

1399

+ #define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })

1400

+ #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })

1401

+@@ -343,5 +357,7 @@ static inline bool gup_fast_permitted(unsigned long start, int nr_pages,

1402

+ 	return true;

1403

+ }

1404

+

1405

++#include <asm/pgtable-invert.h>

1406

++

1407

+ #endif /* !__ASSEMBLY__ */

1408

+ #endif /* _ASM_X86_PGTABLE_64_H */

1409

+diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h

1410

+index cfd29ee8c3da..79e409974ccc 100644

1411

+--- a/arch/x86/include/asm/processor.h

1412

++++ b/arch/x86/include/asm/processor.h

1413

+@@ -181,6 +181,11 @@ extern const struct seq_operations cpuinfo_op;

1414

+

1415

+ extern void cpu_detect(struct cpuinfo_x86 *c);

1416

+

1417

++static inline unsigned long l1tf_pfn_limit(void)

1418

++{

1419

++	return BIT(boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT) - 1;

1420

++}

1421

++

1422

+ extern void early_cpu_init(void);

1423

+ extern void identify_boot_cpu(void);

1424

+ extern void identify_secondary_cpu(struct cpuinfo_x86 *);

1425

+@@ -977,4 +982,16 @@ bool xen_set_default_idle(void);

1426

+ void stop_this_cpu(void *dummy);

1427

+ void df_debug(struct pt_regs *regs, long error_code);

1428

+ void microcode_check(void);

1429

++

1430

++enum l1tf_mitigations {

1431

++	L1TF_MITIGATION_OFF,

1432

++	L1TF_MITIGATION_FLUSH_NOWARN,

1433

++	L1TF_MITIGATION_FLUSH,

1434

++	L1TF_MITIGATION_FLUSH_NOSMT,

1435

++	L1TF_MITIGATION_FULL,

1436

++	L1TF_MITIGATION_FULL_FORCE

1437

++};

1438

++

1439

++extern enum l1tf_mitigations l1tf_mitigation;

1440

++

1441

+ #endif /* _ASM_X86_PROCESSOR_H */

1442

+diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h

1443

+index c1d2a9892352..453cf38a1c33 100644

1444

+--- a/arch/x86/include/asm/topology.h

1445

++++ b/arch/x86/include/asm/topology.h

1446

+@@ -123,13 +123,17 @@ static inline int topology_max_smt_threads(void)

1447

+ }

1448

+

1449

+ int topology_update_package_map(unsigned int apicid, unsigned int cpu);

1450

+-extern int topology_phys_to_logical_pkg(unsigned int pkg);

1451

++int topology_phys_to_logical_pkg(unsigned int pkg);

1452

++bool topology_is_primary_thread(unsigned int cpu);

1453

++bool topology_smt_supported(void);

1454

+ #else

1455

+ #define topology_max_packages()			(1)

1456

+ static inline int

1457

+ topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; }

1458

+ static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }

1459

+ static inline int topology_max_smt_threads(void) { return 1; }

1460

++static inline bool topology_is_primary_thread(unsigned int cpu) { return true; }

1461

++static inline bool topology_smt_supported(void) { return false; }

1462

+ #endif

1463

+

1464

+ static inline void arch_fix_phys_package_id(int num, u32 slot)

1465

+diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h

1466

+index 6aa8499e1f62..95f9107449bf 100644

1467

+--- a/arch/x86/include/asm/vmx.h

1468

++++ b/arch/x86/include/asm/vmx.h

1469

+@@ -576,4 +576,15 @@ enum vm_instruction_error_number {

1470

+ 	VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,

1471

+ };

1472

+

1473

++enum vmx_l1d_flush_state {

1474

++	VMENTER_L1D_FLUSH_AUTO,

1475

++	VMENTER_L1D_FLUSH_NEVER,

1476

++	VMENTER_L1D_FLUSH_COND,

1477

++	VMENTER_L1D_FLUSH_ALWAYS,

1478

++	VMENTER_L1D_FLUSH_EPT_DISABLED,

1479

++	VMENTER_L1D_FLUSH_NOT_REQUIRED,

1480

++};

1481

++

1482

++extern enum vmx_l1d_flush_state l1tf_vmx_mitigation;

1483

++

1484

+ #endif

1485

+diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c

1486

+index adbda5847b14..3b3a2d0af78d 100644

1487

+--- a/arch/x86/kernel/apic/apic.c

1488

++++ b/arch/x86/kernel/apic/apic.c

1489

+@@ -56,6 +56,7 @@

1490

+ #include <asm/hypervisor.h>

1491

+ #include <asm/cpu_device_id.h>

1492

+ #include <asm/intel-family.h>

1493

++#include <asm/irq_regs.h>

1494

+

1495

+ unsigned int num_processors;

1496

+

1497

+@@ -2192,6 +2193,23 @@ static int cpuid_to_apicid[] = {

1498

+ 	[0 ... NR_CPUS - 1] = -1,

1499

+ };

1500

+

1501

++#ifdef CONFIG_SMP

1502

++/**

1503

++ * apic_id_is_primary_thread - Check whether APIC ID belongs to a primary thread

1504

++ * @id:	APIC ID to check

1505

++ */

1506

++bool apic_id_is_primary_thread(unsigned int apicid)

1507

++{

1508

++	u32 mask;

1509

++

1510

++	if (smp_num_siblings == 1)

1511

++		return true;

1512

++	/* Isolate the SMT bit(s) in the APICID and check for 0 */

1513

++	mask = (1U << (fls(smp_num_siblings) - 1)) - 1;

1514

++	return !(apicid & mask);

1515

++}

1516

++#endif

1517

++

1518

+ /*

1519

+  * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids

1520

+  * and cpuid_to_apicid[] synchronized.

1521

+diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c

1522

+index 3982f79d2377..ff0d14cd9e82 100644

1523

+--- a/arch/x86/kernel/apic/io_apic.c

1524

++++ b/arch/x86/kernel/apic/io_apic.c

1525

+@@ -33,6 +33,7 @@

1526

+

1527

+ #include <linux/mm.h>

1528

+ #include <linux/interrupt.h>

1529

++#include <linux/irq.h>

1530

+ #include <linux/init.h>

1531

+ #include <linux/delay.h>

1532

+ #include <linux/sched.h>

1533

+diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c

1534

+index ce503c99f5c4..72a94401f9e0 100644

1535

+--- a/arch/x86/kernel/apic/msi.c

1536

++++ b/arch/x86/kernel/apic/msi.c

1537

+@@ -12,6 +12,7 @@

1538

+  */

1539

+ #include <linux/mm.h>

1540

+ #include <linux/interrupt.h>

1541

++#include <linux/irq.h>

1542

+ #include <linux/pci.h>

1543

+ #include <linux/dmar.h>

1544

+ #include <linux/hpet.h>

1545

+diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c

1546

+index 35aaee4fc028..c9b773401fd8 100644

1547

+--- a/arch/x86/kernel/apic/vector.c

1548

++++ b/arch/x86/kernel/apic/vector.c

1549

+@@ -11,6 +11,7 @@

1550

+  * published by the Free Software Foundation.

1551

+  */

1552

+ #include <linux/interrupt.h>

1553

++#include <linux/irq.h>

1554

+ #include <linux/seq_file.h>

1555

+ #include <linux/init.h>

1556

+ #include <linux/compiler.h>

1557

+diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c

1558

+index 38915fbfae73..97e962afb967 100644

1559

+--- a/arch/x86/kernel/cpu/amd.c

1560

++++ b/arch/x86/kernel/cpu/amd.c

1561

+@@ -315,6 +315,13 @@ static void legacy_fixup_core_id(struct cpuinfo_x86 *c)

1562

+ 	c->cpu_core_id %= cus_per_node;

1563

+ }

1564

+

1565

++

1566

++static void amd_get_topology_early(struct cpuinfo_x86 *c)

1567

++{

1568

++	if (cpu_has(c, X86_FEATURE_TOPOEXT))

1569

++		smp_num_siblings = ((cpuid_ebx(0x8000001e) >> 8) & 0xff) + 1;

1570

++}

1571

++

1572

+ /*

1573

+  * Fixup core topology information for

1574

+  * (1) AMD multi-node processors

1575

+@@ -334,7 +341,6 @@ static void amd_get_topology(struct cpuinfo_x86 *c)

1576

+ 		cpuid(0x8000001e, &eax, &ebx, &ecx, &edx);

1577

+

1578

+ 		node_id  = ecx & 0xff;

1579

+-		smp_num_siblings = ((ebx >> 8) & 0xff) + 1;

1580

+

1581

+ 		if (c->x86 == 0x15)

1582

+ 			c->cu_id = ebx & 0xff;

1583

+@@ -613,6 +619,7 @@ clear_sev:

1584

+

1585

+ static void early_init_amd(struct cpuinfo_x86 *c)

1586

+ {

1587

++	u64 value;

1588

+ 	u32 dummy;

1589

+

1590

+ 	early_init_amd_mc(c);

1591

+@@ -683,6 +690,22 @@ static void early_init_amd(struct cpuinfo_x86 *c)

1592

+ 		set_cpu_bug(c, X86_BUG_AMD_E400);

1593

+

1594

+ 	early_detect_mem_encrypt(c);

1595

++

1596

++	/* Re-enable TopologyExtensions if switched off by BIOS */

1597

++	if (c->x86 == 0x15 &&

1598

++	    (c->x86_model >= 0x10 && c->x86_model <= 0x6f) &&

1599

++	    !cpu_has(c, X86_FEATURE_TOPOEXT)) {

1600

++

1601

++		if (msr_set_bit(0xc0011005, 54) > 0) {

1602

++			rdmsrl(0xc0011005, value);

1603

++			if (value & BIT_64(54)) {

1604

++				set_cpu_cap(c, X86_FEATURE_TOPOEXT);

1605

++				pr_info_once(FW_INFO "CPU: Re-enabling disabled Topology Extensions Support.\n");

1606

++			}

1607

++		}

1608

++	}

1609

++

1610

++	amd_get_topology_early(c);

1611

+ }

1612

+

1613

+ static void init_amd_k8(struct cpuinfo_x86 *c)

1614

+@@ -774,19 +797,6 @@ static void init_amd_bd(struct cpuinfo_x86 *c)

1615

+ {

1616

+ 	u64 value;

1617

+

1618

+-	/* re-enable TopologyExtensions if switched off by BIOS */

1619

+-	if ((c->x86_model >= 0x10) && (c->x86_model <= 0x6f) &&

1620

+-	    !cpu_has(c, X86_FEATURE_TOPOEXT)) {

1621

+-

1622

+-		if (msr_set_bit(0xc0011005, 54) > 0) {

1623

+-			rdmsrl(0xc0011005, value);

1624

+-			if (value & BIT_64(54)) {

1625

+-				set_cpu_cap(c, X86_FEATURE_TOPOEXT);

1626

+-				pr_info_once(FW_INFO "CPU: Re-enabling disabled Topology Extensions Support.\n");

1627

+-			}

1628

+-		}

1629

+-	}

1630

+-

1631

+ 	/*

1632

+ 	 * The way access filter has a performance penalty on some workloads.

1633

+ 	 * Disable it on the affected CPUs.

1634

+@@ -850,16 +860,9 @@ static void init_amd(struct cpuinfo_x86 *c)

1635

+

1636

+ 	cpu_detect_cache_sizes(c);

1637

+

1638

+-	/* Multi core CPU? */

1639

+-	if (c->extended_cpuid_level >= 0x80000008) {

1640

+-		amd_detect_cmp(c);

1641

+-		amd_get_topology(c);

1642

+-		srat_detect_node(c);

1643

+-	}

1644

+-

1645

+-#ifdef CONFIG_X86_32

1646

+-	detect_ht(c);

1647

+-#endif

1648

++	amd_detect_cmp(c);

1649

++	amd_get_topology(c);

1650

++	srat_detect_node(c);

1651

+

1652

+ 	init_amd_cacheinfo(c);

1653

+

1654

+diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c

1655

+index 5c0ea39311fe..c4f0ae49a53d 100644

1656

+--- a/arch/x86/kernel/cpu/bugs.c

1657

++++ b/arch/x86/kernel/cpu/bugs.c

1658

+@@ -22,15 +22,18 @@

1659

+ #include <asm/processor-flags.h>

1660

+ #include <asm/fpu/internal.h>

1661

+ #include <asm/msr.h>

1662

++#include <asm/vmx.h>

1663

+ #include <asm/paravirt.h>

1664

+ #include <asm/alternative.h>

1665

+ #include <asm/pgtable.h>

1666

+ #include <asm/set_memory.h>

1667

+ #include <asm/intel-family.h>

1668

+ #include <asm/hypervisor.h>

1669

++#include <asm/e820/api.h>

1670

+

1671

+ static void __init spectre_v2_select_mitigation(void);

1672

+ static void __init ssb_select_mitigation(void);

1673

++static void __init l1tf_select_mitigation(void);

1674

+

1675

+ /*

1676

+  * Our boot-time value of the SPEC_CTRL MSR. We read it once so that any

1677

+@@ -56,6 +59,12 @@ void __init check_bugs(void)

1678

+ {

1679

+ 	identify_boot_cpu();

1680

+

1681

++	/*

1682

++	 * identify_boot_cpu() initialized SMT support information, let the

1683

++	 * core code know.

1684

++	 */

1685

++	cpu_smt_check_topology_early();

1686

++

1687

+ 	if (!IS_ENABLED(CONFIG_SMP)) {

1688

+ 		pr_info("CPU: ");

1689

+ 		print_cpu_info(&boot_cpu_data);

1690

+@@ -82,6 +91,8 @@ void __init check_bugs(void)

1691

+ 	 */

1692

+ 	ssb_select_mitigation();

1693

+

1694

++	l1tf_select_mitigation();

1695

++

1696

+ #ifdef CONFIG_X86_32

1697

+ 	/*

1698

+ 	 * Check whether we are able to run this kernel safely on SMP.

1699

+@@ -313,23 +324,6 @@ static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)

1700

+ 	return cmd;

1701

+ }

1702

+

1703

+-/* Check for Skylake-like CPUs (for RSB handling) */

1704

+-static bool __init is_skylake_era(void)

1705

+-{

1706

+-	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&

1707

+-	    boot_cpu_data.x86 == 6) {

1708

+-		switch (boot_cpu_data.x86_model) {

1709

+-		case INTEL_FAM6_SKYLAKE_MOBILE:

1710

+-		case INTEL_FAM6_SKYLAKE_DESKTOP:

1711

+-		case INTEL_FAM6_SKYLAKE_X:

1712

+-		case INTEL_FAM6_KABYLAKE_MOBILE:

1713

+-		case INTEL_FAM6_KABYLAKE_DESKTOP:

1714

+-			return true;

1715

+-		}

1716

+-	}

1717

+-	return false;

1718

+-}

1719

+-

1720

+ static void __init spectre_v2_select_mitigation(void)

1721

+ {

1722

+ 	enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline();

1723

+@@ -390,22 +384,15 @@ retpoline_auto:

1724

+ 	pr_info("%s\n", spectre_v2_strings[mode]);

1725

+

1726

+ 	/*

1727

+-	 * If neither SMEP nor PTI are available, there is a risk of

1728

+-	 * hitting userspace addresses in the RSB after a context switch

1729

+-	 * from a shallow call stack to a deeper one. To prevent this fill

1730

+-	 * the entire RSB, even when using IBRS.

1731

++	 * If spectre v2 protection has been enabled, unconditionally fill

1732

++	 * RSB during a context switch; this protects against two independent

1733

++	 * issues:

1734

+ 	 *

1735

+-	 * Skylake era CPUs have a separate issue with *underflow* of the

1736

+-	 * RSB, when they will predict 'ret' targets from the generic BTB.

1737

+-	 * The proper mitigation for this is IBRS. If IBRS is not supported

1738

+-	 * or deactivated in favour of retpolines the RSB fill on context

1739

+-	 * switch is required.

1740

++	 *	- RSB underflow (and switch to BTB) on Skylake+

1741

++	 *	- SpectreRSB variant of spectre v2 on X86_BUG_SPECTRE_V2 CPUs

1742

+ 	 */

1743

+-	if ((!boot_cpu_has(X86_FEATURE_PTI) &&

1744

+-	     !boot_cpu_has(X86_FEATURE_SMEP)) || is_skylake_era()) {

1745

+-		setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);

1746

+-		pr_info("Spectre v2 mitigation: Filling RSB on context switch\n");

1747

+-	}

1748

++	setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);

1749

++	pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch\n");

1750

+

1751

+ 	/* Initialize Indirect Branch Prediction Barrier if supported */

1752

+ 	if (boot_cpu_has(X86_FEATURE_IBPB)) {

1753

+@@ -654,8 +641,121 @@ void x86_spec_ctrl_setup_ap(void)

1754

+ 		x86_amd_ssb_disable();

1755

+ }

1756

+

1757

++#undef pr_fmt

1758

++#define pr_fmt(fmt)	"L1TF: " fmt

1759

++

1760

++/* Default mitigation for L1TF-affected CPUs */

1761

++enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_FLUSH;

1762

++#if IS_ENABLED(CONFIG_KVM_INTEL)

1763

++EXPORT_SYMBOL_GPL(l1tf_mitigation);

1764

++

1765

++enum vmx_l1d_flush_state l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;

1766

++EXPORT_SYMBOL_GPL(l1tf_vmx_mitigation);

1767

++#endif

1768

++

1769

++static void __init l1tf_select_mitigation(void)

1770

++{

1771

++	u64 half_pa;

1772

++

1773

++	if (!boot_cpu_has_bug(X86_BUG_L1TF))

1774

++		return;

1775

++

1776

++	switch (l1tf_mitigation) {

1777

++	case L1TF_MITIGATION_OFF:

1778

++	case L1TF_MITIGATION_FLUSH_NOWARN:

1779

++	case L1TF_MITIGATION_FLUSH:

1780

++		break;

1781

++	case L1TF_MITIGATION_FLUSH_NOSMT:

1782

++	case L1TF_MITIGATION_FULL:

1783

++		cpu_smt_disable(false);

1784

++		break;

1785

++	case L1TF_MITIGATION_FULL_FORCE:

1786

++		cpu_smt_disable(true);

1787

++		break;

1788

++	}

1789

++

1790

++#if CONFIG_PGTABLE_LEVELS == 2

1791

++	pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n");

1792

++	return;

1793

++#endif

1794

++

1795

++	/*

1796

++	 * This is extremely unlikely to happen because almost all

1797

++	 * systems have far more MAX_PA/2 than RAM can be fit into

1798

++	 * DIMM slots.

1799

++	 */

1800

++	half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT;

1801

++	if (e820__mapped_any(half_pa, ULLONG_MAX - half_pa, E820_TYPE_RAM)) {

1802

++		pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation not effective.\n");

1803

++		return;

1804

++	}

1805

++

1806

++	setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV);

1807

++}

1808

++

1809

++static int __init l1tf_cmdline(char *str)

1810

++{

1811

++	if (!boot_cpu_has_bug(X86_BUG_L1TF))

1812

++		return 0;

1813

++

1814

++	if (!str)

1815

++		return -EINVAL;

1816

++

1817

++	if (!strcmp(str, "off"))

1818

++		l1tf_mitigation = L1TF_MITIGATION_OFF;

1819

++	else if (!strcmp(str, "flush,nowarn"))

1820

++		l1tf_mitigation = L1TF_MITIGATION_FLUSH_NOWARN;

1821

++	else if (!strcmp(str, "flush"))

1822

++		l1tf_mitigation = L1TF_MITIGATION_FLUSH;

1823

++	else if (!strcmp(str, "flush,nosmt"))

1824

++		l1tf_mitigation = L1TF_MITIGATION_FLUSH_NOSMT;

1825

++	else if (!strcmp(str, "full"))

1826

++		l1tf_mitigation = L1TF_MITIGATION_FULL;

1827

++	else if (!strcmp(str, "full,force"))

1828

++		l1tf_mitigation = L1TF_MITIGATION_FULL_FORCE;

1829

++

1830

++	return 0;

1831

++}

1832

++early_param("l1tf", l1tf_cmdline);

1833

++

1834

++#undef pr_fmt

1835

++

1836

+ #ifdef CONFIG_SYSFS

1837

+

1838

++#define L1TF_DEFAULT_MSG "Mitigation: PTE Inversion"

1839

++

1840

++#if IS_ENABLED(CONFIG_KVM_INTEL)

1841

++static const char *l1tf_vmx_states[] = {

1842

++	[VMENTER_L1D_FLUSH_AUTO]		= "auto",

1843

++	[VMENTER_L1D_FLUSH_NEVER]		= "vulnerable",

1844

++	[VMENTER_L1D_FLUSH_COND]		= "conditional cache flushes",

1845

++	[VMENTER_L1D_FLUSH_ALWAYS]		= "cache flushes",

1846

++	[VMENTER_L1D_FLUSH_EPT_DISABLED]	= "EPT disabled",

1847

++	[VMENTER_L1D_FLUSH_NOT_REQUIRED]	= "flush not necessary"

1848

++};

1849

++

1850

++static ssize_t l1tf_show_state(char *buf)

1851

++{

1852

++	if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_AUTO)

1853

++		return sprintf(buf, "%s\n", L1TF_DEFAULT_MSG);

1854

++

1855

++	if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_EPT_DISABLED ||

1856

++	    (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_NEVER &&

1857

++	     cpu_smt_control == CPU_SMT_ENABLED))

1858

++		return sprintf(buf, "%s; VMX: %s\n", L1TF_DEFAULT_MSG,

1859

++			       l1tf_vmx_states[l1tf_vmx_mitigation]);

1860

++

1861

++	return sprintf(buf, "%s; VMX: %s, SMT %s\n", L1TF_DEFAULT_MSG,

1862

++		       l1tf_vmx_states[l1tf_vmx_mitigation],

1863

++		       cpu_smt_control == CPU_SMT_ENABLED ? "vulnerable" : "disabled");

1864

++}

1865

++#else

1866

++static ssize_t l1tf_show_state(char *buf)

1867

++{

1868

++	return sprintf(buf, "%s\n", L1TF_DEFAULT_MSG);

1869

++}

1870

++#endif

1871

++

1872

+ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr,

1873

+ 			       char *buf, unsigned int bug)

1874

+ {

1875

+@@ -684,6 +784,10 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr

1876

+ 	case X86_BUG_SPEC_STORE_BYPASS:

1877

+ 		return sprintf(buf, "%s\n", ssb_strings[ssb_mode]);

1878

+

1879

++	case X86_BUG_L1TF:

1880

++		if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))

1881

++			return l1tf_show_state(buf);

1882

++		break;

1883

+ 	default:

1884

+ 		break;

1885

+ 	}

1886

+@@ -710,4 +814,9 @@ ssize_t cpu_show_spec_store_bypass(struct device *dev, struct device_attribute *

1887

+ {

1888

+ 	return cpu_show_common(dev, attr, buf, X86_BUG_SPEC_STORE_BYPASS);

1889

+ }

1890

++

1891

++ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *buf)

1892

++{

1893

++	return cpu_show_common(dev, attr, buf, X86_BUG_L1TF);

1894

++}

1895

+ #endif

1896

+diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c

1897

+index eb4cb3efd20e..9eda6f730ec4 100644

1898

+--- a/arch/x86/kernel/cpu/common.c

1899

++++ b/arch/x86/kernel/cpu/common.c

1900

+@@ -661,33 +661,36 @@ static void cpu_detect_tlb(struct cpuinfo_x86 *c)

1901

+ 		tlb_lld_4m[ENTRIES], tlb_lld_1g[ENTRIES]);

1902

+ }

1903

+

1904

+-void detect_ht(struct cpuinfo_x86 *c)

1905

++int detect_ht_early(struct cpuinfo_x86 *c)

1906

+ {

1907

+ #ifdef CONFIG_SMP

1908

+ 	u32 eax, ebx, ecx, edx;

1909

+-	int index_msb, core_bits;

1910

+-	static bool printed;

1911

+

1912

+ 	if (!cpu_has(c, X86_FEATURE_HT))

1913

+-		return;

1914

++		return -1;

1915

+

1916

+ 	if (cpu_has(c, X86_FEATURE_CMP_LEGACY))

1917

+-		goto out;

1918

++		return -1;

1919

+

1920

+ 	if (cpu_has(c, X86_FEATURE_XTOPOLOGY))

1921

+-		return;

1922

++		return -1;

1923

+

1924

+ 	cpuid(1, &eax, &ebx, &ecx, &edx);

1925

+

1926

+ 	smp_num_siblings = (ebx & 0xff0000) >> 16;

1927

+-

1928

+-	if (smp_num_siblings == 1) {

1929

++	if (smp_num_siblings == 1)

1930

+ 		pr_info_once("CPU0: Hyper-Threading is disabled\n");

1931

+-		goto out;

1932

+-	}

1933

++#endif

1934

++	return 0;

1935

++}

1936

+

1937

+-	if (smp_num_siblings <= 1)

1938

+-		goto out;

1939

++void detect_ht(struct cpuinfo_x86 *c)

1940

++{

1941

++#ifdef CONFIG_SMP

1942

++	int index_msb, core_bits;

1943

++

1944

++	if (detect_ht_early(c) < 0)

1945

++		return;

1946

+

1947

+ 	index_msb = get_count_order(smp_num_siblings);

1948

+ 	c->phys_proc_id = apic->phys_pkg_id(c->initial_apicid, index_msb);

1949

+@@ -700,15 +703,6 @@ void detect_ht(struct cpuinfo_x86 *c)

1950

+

1951

+ 	c->cpu_core_id = apic->phys_pkg_id(c->initial_apicid, index_msb) &

1952

+ 				       ((1 << core_bits) - 1);

1953

+-

1954

+-out:

1955

+-	if (!printed && (c->x86_max_cores * smp_num_siblings) > 1) {

1956

+-		pr_info("CPU: Physical Processor ID: %d\n",

1957

+-			c->phys_proc_id);

1958

+-		pr_info("CPU: Processor Core ID: %d\n",

1959

+-			c->cpu_core_id);

1960

+-		printed = 1;

1961

+-	}

1962

+ #endif

1963

+ }

1964

+

1965

+@@ -987,6 +981,21 @@ static const __initconst struct x86_cpu_id cpu_no_spec_store_bypass[] = {

1966

+ 	{}

1967

+ };

1968

+

1969

++static const __initconst struct x86_cpu_id cpu_no_l1tf[] = {

1970

++	/* in addition to cpu_no_speculation */

1971

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_SILVERMONT1	},

1972

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_SILVERMONT2	},

1973

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_AIRMONT		},

1974

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_MERRIFIELD	},

1975

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_MOOREFIELD	},

1976

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT	},

1977

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_DENVERTON	},

1978

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GEMINI_LAKE	},

1979

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_XEON_PHI_KNL		},

1980

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_XEON_PHI_KNM		},

1981

++	{}

1982

++};

1983

++

1984

+ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)

1985

+ {

1986

+ 	u64 ia32_cap = 0;

1987

+@@ -1013,6 +1022,11 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)

1988

+ 		return;

1989

+

1990

+ 	setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);

1991

++

1992

++	if (x86_match_cpu(cpu_no_l1tf))

1993

++		return;

1994

++

1995

++	setup_force_cpu_bug(X86_BUG_L1TF);

1996

+ }

1997

+

1998

+ /*

1999

+diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h

2000

+index 38216f678fc3..e59c0ea82a33 100644

2001

+--- a/arch/x86/kernel/cpu/cpu.h

2002

++++ b/arch/x86/kernel/cpu/cpu.h

2003

+@@ -55,7 +55,9 @@ extern void init_intel_cacheinfo(struct cpuinfo_x86 *c);

2004

+ extern void init_amd_cacheinfo(struct cpuinfo_x86 *c);

2005

+

2006

+ extern void detect_num_cpu_cores(struct cpuinfo_x86 *c);

2007

++extern int detect_extended_topology_early(struct cpuinfo_x86 *c);

2008

+ extern int detect_extended_topology(struct cpuinfo_x86 *c);

2009

++extern int detect_ht_early(struct cpuinfo_x86 *c);

2010

+ extern void detect_ht(struct cpuinfo_x86 *c);

2011

+

2012

+ unsigned int aperfmperf_get_khz(int cpu);

2013

+diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c

2014

+index eb75564f2d25..6602941cfebf 100644

2015

+--- a/arch/x86/kernel/cpu/intel.c

2016

++++ b/arch/x86/kernel/cpu/intel.c

2017

+@@ -301,6 +301,13 @@ static void early_init_intel(struct cpuinfo_x86 *c)

2018

+ 	}

2019

+

2020

+ 	check_mpx_erratum(c);

2021

++

2022

++	/*

2023

++	 * Get the number of SMT siblings early from the extended topology

2024

++	 * leaf, if available. Otherwise try the legacy SMT detection.

2025

++	 */

2026

++	if (detect_extended_topology_early(c) < 0)

2027

++		detect_ht_early(c);

2028

+ }

2029

+

2030

+ #ifdef CONFIG_X86_32

2031

+diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c

2032

+index 08286269fd24..b9bc8a1a584e 100644

2033

+--- a/arch/x86/kernel/cpu/microcode/core.c

2034

++++ b/arch/x86/kernel/cpu/microcode/core.c

2035

+@@ -509,12 +509,20 @@ static struct platform_device	*microcode_pdev;

2036

+

2037

+ static int check_online_cpus(void)

2038

+ {

2039

+-	if (num_online_cpus() == num_present_cpus())

2040

+-		return 0;

2041

++	unsigned int cpu;

2042

+

2043

+-	pr_err("Not all CPUs online, aborting microcode update.\n");

2044

++	/*

2045

++	 * Make sure all CPUs are online.  It's fine for SMT to be disabled if

2046

++	 * all the primary threads are still online.

2047

++	 */

2048

++	for_each_present_cpu(cpu) {

2049

++		if (topology_is_primary_thread(cpu) && !cpu_online(cpu)) {

2050

++			pr_err("Not all CPUs online, aborting microcode update.\n");

2051

++			return -EINVAL;

2052

++		}

2053

++	}

2054

+

2055

+-	return -EINVAL;

2056

++	return 0;

2057

+ }

2058

+

2059

+ static atomic_t late_cpus_in;

2060

+diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c

2061

+index 81c0afb39d0a..71ca064e3794 100644

2062

+--- a/arch/x86/kernel/cpu/topology.c

2063

++++ b/arch/x86/kernel/cpu/topology.c

2064

+@@ -22,18 +22,10 @@

2065

+ #define BITS_SHIFT_NEXT_LEVEL(eax)	((eax) & 0x1f)

2066

+ #define LEVEL_MAX_SIBLINGS(ebx)		((ebx) & 0xffff)

2067

+

2068

+-/*

2069

+- * Check for extended topology enumeration cpuid leaf 0xb and if it

2070

+- * exists, use it for populating initial_apicid and cpu topology

2071

+- * detection.

2072

+- */

2073

+-int detect_extended_topology(struct cpuinfo_x86 *c)

2074

++int detect_extended_topology_early(struct cpuinfo_x86 *c)

2075

+ {

2076

+ #ifdef CONFIG_SMP

2077

+-	unsigned int eax, ebx, ecx, edx, sub_index;

2078

+-	unsigned int ht_mask_width, core_plus_mask_width;

2079

+-	unsigned int core_select_mask, core_level_siblings;

2080

+-	static bool printed;

2081

++	unsigned int eax, ebx, ecx, edx;

2082

+

2083

+ 	if (c->cpuid_level < 0xb)

2084

+ 		return -1;

2085

+@@ -52,10 +44,30 @@ int detect_extended_topology(struct cpuinfo_x86 *c)

2086

+ 	 * initial apic id, which also represents 32-bit extended x2apic id.

2087

+ 	 */

2088

+ 	c->initial_apicid = edx;

2089

++	smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);

2090

++#endif

2091

++	return 0;

2092

++}

2093

++

2094

++/*

2095

++ * Check for extended topology enumeration cpuid leaf 0xb and if it

2096

++ * exists, use it for populating initial_apicid and cpu topology

2097

++ * detection.

2098

++ */

2099

++int detect_extended_topology(struct cpuinfo_x86 *c)

2100

++{

2101

++#ifdef CONFIG_SMP

2102

++	unsigned int eax, ebx, ecx, edx, sub_index;

2103

++	unsigned int ht_mask_width, core_plus_mask_width;

2104

++	unsigned int core_select_mask, core_level_siblings;

2105

++

2106

++	if (detect_extended_topology_early(c) < 0)

2107

++		return -1;

2108

+

2109

+ 	/*

2110

+ 	 * Populate HT related information from sub-leaf level 0.

2111

+ 	 */

2112

++	cpuid_count(0xb, SMT_LEVEL, &eax, &ebx, &ecx, &edx);

2113

+ 	core_level_siblings = smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);

2114

+ 	core_plus_mask_width = ht_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);

2115

+

2116

+@@ -86,15 +98,6 @@ int detect_extended_topology(struct cpuinfo_x86 *c)

2117

+ 	c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);

2118

+

2119

+ 	c->x86_max_cores = (core_level_siblings / smp_num_siblings);

2120

+-

2121

+-	if (!printed) {

2122

+-		pr_info("CPU: Physical Processor ID: %d\n",

2123

+-		       c->phys_proc_id);

2124

+-		if (c->x86_max_cores > 1)

2125

+-			pr_info("CPU: Processor Core ID: %d\n",

2126

+-			       c->cpu_core_id);

2127

+-		printed = 1;

2128

+-	}

2129

+ #endif

2130

+ 	return 0;

2131

+ }

2132

+diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c

2133

+index f92a6593de1e..2ea85b32421a 100644

2134

+--- a/arch/x86/kernel/fpu/core.c

2135

++++ b/arch/x86/kernel/fpu/core.c

2136

+@@ -10,6 +10,7 @@

2137

+ #include <asm/fpu/signal.h>

2138

+ #include <asm/fpu/types.h>

2139

+ #include <asm/traps.h>

2140

++#include <asm/irq_regs.h>

2141

+

2142

+ #include <linux/hardirq.h>

2143

+ #include <linux/pkeys.h>

2144

+diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c

2145

+index 346b24883911..b0acb22e5a46 100644

2146

+--- a/arch/x86/kernel/hpet.c

2147

++++ b/arch/x86/kernel/hpet.c

2148

+@@ -1,6 +1,7 @@

2149

+ #include <linux/clocksource.h>

2150

+ #include <linux/clockchips.h>

2151

+ #include <linux/interrupt.h>

2152

++#include <linux/irq.h>

2153

+ #include <linux/export.h>

2154

+ #include <linux/delay.h>

2155

+ #include <linux/errno.h>

2156

+diff --git a/arch/x86/kernel/i8259.c b/arch/x86/kernel/i8259.c

2157

+index 86c4439f9d74..519649ddf100 100644

2158

+--- a/arch/x86/kernel/i8259.c

2159

++++ b/arch/x86/kernel/i8259.c

2160

+@@ -5,6 +5,7 @@

2161

+ #include <linux/sched.h>

2162

+ #include <linux/ioport.h>

2163

+ #include <linux/interrupt.h>

2164

++#include <linux/irq.h>

2165

+ #include <linux/timex.h>

2166

+ #include <linux/random.h>

2167

+ #include <linux/init.h>

2168

+diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c

2169

+index 74383a3780dc..01adea278a71 100644

2170

+--- a/arch/x86/kernel/idt.c

2171

++++ b/arch/x86/kernel/idt.c

2172

+@@ -8,6 +8,7 @@

2173

+ #include <asm/traps.h>

2174

+ #include <asm/proto.h>

2175

+ #include <asm/desc.h>

2176

++#include <asm/hw_irq.h>

2177

+

2178

+ struct idt_data {

2179

+ 	unsigned int	vector;

2180

+diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c

2181

+index 328d027d829d..59b5f2ea7c2f 100644

2182

+--- a/arch/x86/kernel/irq.c

2183

++++ b/arch/x86/kernel/irq.c

2184

+@@ -10,6 +10,7 @@

2185

+ #include <linux/ftrace.h>

2186

+ #include <linux/delay.h>

2187

+ #include <linux/export.h>

2188

++#include <linux/irq.h>

2189

+

2190

+ #include <asm/apic.h>

2191

+ #include <asm/io_apic.h>

2192

+diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c

2193

+index c1bdbd3d3232..95600a99ae93 100644

2194

+--- a/arch/x86/kernel/irq_32.c

2195

++++ b/arch/x86/kernel/irq_32.c

2196

+@@ -11,6 +11,7 @@

2197

+

2198

+ #include <linux/seq_file.h>

2199

+ #include <linux/interrupt.h>

2200

++#include <linux/irq.h>

2201

+ #include <linux/kernel_stat.h>

2202

+ #include <linux/notifier.h>

2203

+ #include <linux/cpu.h>

2204

+diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c

2205

+index d86e344f5b3d..0469cd078db1 100644

2206

+--- a/arch/x86/kernel/irq_64.c

2207

++++ b/arch/x86/kernel/irq_64.c

2208

+@@ -11,6 +11,7 @@

2209

+

2210

+ #include <linux/kernel_stat.h>

2211

+ #include <linux/interrupt.h>

2212

++#include <linux/irq.h>

2213

+ #include <linux/seq_file.h>

2214

+ #include <linux/delay.h>

2215

+ #include <linux/ftrace.h>

2216

+diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c

2217

+index 772196c1b8c4..a0693b71cfc1 100644

2218

+--- a/arch/x86/kernel/irqinit.c

2219

++++ b/arch/x86/kernel/irqinit.c

2220

+@@ -5,6 +5,7 @@

2221

+ #include <linux/sched.h>

2222

+ #include <linux/ioport.h>

2223

+ #include <linux/interrupt.h>

2224

++#include <linux/irq.h>

2225

+ #include <linux/timex.h>

2226

+ #include <linux/random.h>

2227

+ #include <linux/kprobes.h>

2228

+diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c

2229

+index 6f4d42377fe5..44e26dc326d5 100644

2230

+--- a/arch/x86/kernel/kprobes/core.c

2231

++++ b/arch/x86/kernel/kprobes/core.c

2232

+@@ -395,8 +395,6 @@ int __copy_instruction(u8 *dest, u8 *src, u8 *real, struct insn *insn)

2233

+ 			  - (u8 *) real;

2234

+ 		if ((s64) (s32) newdisp != newdisp) {

2235

+ 			pr_err("Kprobes error: new displacement does not fit into s32 (%llx)\n", newdisp);

2236

+-			pr_err("\tSrc: %p, Dest: %p, old disp: %x\n",

2237

+-				src, real, insn->displacement.value);

2238

+ 			return 0;

2239

+ 		}

2240

+ 		disp = (u8 *) dest + insn_offset_displacement(insn);

2241

+@@ -640,8 +638,7 @@ static int reenter_kprobe(struct kprobe *p, struct pt_regs *regs,

2242

+ 		 * Raise a BUG or we'll continue in an endless reentering loop

2243

+ 		 * and eventually a stack overflow.

2244

+ 		 */

2245

+-		printk(KERN_WARNING "Unrecoverable kprobe detected at %p.\n",

2246

+-		       p->addr);

2247

++		pr_err("Unrecoverable kprobe detected.\n");

2248

+ 		dump_kprobe(p);

2249

+ 		BUG();

2250

+ 	default:

2251

+diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c

2252

+index 99dc79e76bdc..930c88341e4e 100644

2253

+--- a/arch/x86/kernel/paravirt.c

2254

++++ b/arch/x86/kernel/paravirt.c

2255

+@@ -88,10 +88,12 @@ unsigned paravirt_patch_call(void *insnbuf,

2256

+ 	struct branch *b = insnbuf;

2257

+ 	unsigned long delta = (unsigned long)target - (addr+5);

2258

+

2259

+-	if (tgt_clobbers & ~site_clobbers)

2260

+-		return len;	/* target would clobber too much for this site */

2261

+-	if (len < 5)

2262

++	if (len < 5) {

2263

++#ifdef CONFIG_RETPOLINE

2264

++		WARN_ONCE("Failing to patch indirect CALL in %ps\n", (void *)addr);

2265

++#endif

2266

+ 		return len;	/* call too long for patch site */

2267

++	}

2268

+

2269

+ 	b->opcode = 0xe8; /* call */

2270

+ 	b->delta = delta;

2271

+@@ -106,8 +108,12 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target,

2272

+ 	struct branch *b = insnbuf;

2273

+ 	unsigned long delta = (unsigned long)target - (addr+5);

2274

+

2275

+-	if (len < 5)

2276

++	if (len < 5) {

2277

++#ifdef CONFIG_RETPOLINE

2278

++		WARN_ONCE("Failing to patch indirect JMP in %ps\n", (void *)addr);

2279

++#endif

2280

+ 		return len;	/* call too long for patch site */

2281

++	}

2282

+

2283

+ 	b->opcode = 0xe9;	/* jmp */

2284

+ 	b->delta = delta;

2285

+diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c

2286

+index 2f86d883dd95..74b4472ba0a6 100644

2287

+--- a/arch/x86/kernel/setup.c

2288

++++ b/arch/x86/kernel/setup.c

2289

+@@ -823,6 +823,12 @@ void __init setup_arch(char **cmdline_p)

2290

+ 	memblock_reserve(__pa_symbol(_text),

2291

+ 			 (unsigned long)__bss_stop - (unsigned long)_text);

2292

+

2293

++	/*

2294

++	 * Make sure page 0 is always reserved because on systems with

2295

++	 * L1TF its contents can be leaked to user processes.

2296

++	 */

2297

++	memblock_reserve(0, PAGE_SIZE);

2298

++

2299

+ 	early_reserve_initrd();

2300

+

2301

+ 	/*

2302

+diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c

2303

+index 5c574dff4c1a..04adc8d60aed 100644

2304

+--- a/arch/x86/kernel/smp.c

2305

++++ b/arch/x86/kernel/smp.c

2306

+@@ -261,6 +261,7 @@ __visible void __irq_entry smp_reschedule_interrupt(struct pt_regs *regs)

2307

+ {

2308

+ 	ack_APIC_irq();

2309

+ 	inc_irq_stat(irq_resched_count);

2310

++	kvm_set_cpu_l1tf_flush_l1d();

2311

+

2312

+ 	if (trace_resched_ipi_enabled()) {

2313

+ 		/*

2314

+diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c

2315

+index db9656e13ea0..f02ecaf97904 100644

2316

+--- a/arch/x86/kernel/smpboot.c

2317

++++ b/arch/x86/kernel/smpboot.c

2318

+@@ -80,6 +80,7 @@

2319

+ #include <asm/intel-family.h>

2320

+ #include <asm/cpu_device_id.h>

2321

+ #include <asm/spec-ctrl.h>

2322

++#include <asm/hw_irq.h>

2323

+

2324

+ /* representing HT siblings of each logical CPU */

2325

+ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);

2326

+@@ -270,6 +271,23 @@ static void notrace start_secondary(void *unused)

2327

+ 	cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);

2328

+ }

2329

+

2330

++/**

2331

++ * topology_is_primary_thread - Check whether CPU is the primary SMT thread

2332

++ * @cpu:	CPU to check

2333

++ */

2334

++bool topology_is_primary_thread(unsigned int cpu)

2335

++{

2336

++	return apic_id_is_primary_thread(per_cpu(x86_cpu_to_apicid, cpu));

2337

++}

2338

++

2339

++/**

2340

++ * topology_smt_supported - Check whether SMT is supported by the CPUs

2341

++ */

2342

++bool topology_smt_supported(void)

2343

++{

2344

++	return smp_num_siblings > 1;

2345

++}

2346

++

2347

+ /**

2348

+  * topology_phys_to_logical_pkg - Map a physical package id to a logical

2349

+  *

2350

+diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c

2351

+index 774ebafa97c4..be01328eb755 100644

2352

+--- a/arch/x86/kernel/time.c

2353

++++ b/arch/x86/kernel/time.c

2354

+@@ -12,6 +12,7 @@

2355

+

2356

+ #include <linux/clockchips.h>

2357

+ #include <linux/interrupt.h>

2358

++#include <linux/irq.h>

2359

+ #include <linux/i8253.h>

2360

+ #include <linux/time.h>

2361

+ #include <linux/export.h>

2362

+diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c

2363

+index 6b8f11521c41..a44e568363a4 100644

2364

+--- a/arch/x86/kvm/mmu.c

2365

++++ b/arch/x86/kvm/mmu.c

2366

+@@ -3840,6 +3840,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,

2367

+ {

2368

+ 	int r = 1;

2369

+

2370

++	vcpu->arch.l1tf_flush_l1d = true;

2371

+ 	switch (vcpu->arch.apf.host_apf_reason) {

2372

+ 	default:

2373

+ 		trace_kvm_page_fault(fault_address, error_code);

2374

+diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c

2375

+index 5d8e317c2b04..46b428c0990e 100644

2376

+--- a/arch/x86/kvm/vmx.c

2377

++++ b/arch/x86/kvm/vmx.c

2378

+@@ -188,6 +188,150 @@ module_param(ple_window_max, uint, 0444);

2379

+

2380

+ extern const ulong vmx_return;

2381

+

2382

++static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);

2383

++static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);

2384

++static DEFINE_MUTEX(vmx_l1d_flush_mutex);

2385

++

2386

++/* Storage for pre module init parameter parsing */

2387

++static enum vmx_l1d_flush_state __read_mostly vmentry_l1d_flush_param = VMENTER_L1D_FLUSH_AUTO;

2388

++

2389

++static const struct {

2390

++	const char *option;

2391

++	enum vmx_l1d_flush_state cmd;

2392

++} vmentry_l1d_param[] = {

2393

++	{"auto",	VMENTER_L1D_FLUSH_AUTO},

2394

++	{"never",	VMENTER_L1D_FLUSH_NEVER},

2395

++	{"cond",	VMENTER_L1D_FLUSH_COND},

2396

++	{"always",	VMENTER_L1D_FLUSH_ALWAYS},

2397

++};

2398

++

2399

++#define L1D_CACHE_ORDER 4

2400

++static void *vmx_l1d_flush_pages;

2401

++

2402

++static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)

2403

++{

2404

++	struct page *page;

2405

++	unsigned int i;

2406

++

2407

++	if (!enable_ept) {

2408

++		l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_EPT_DISABLED;

2409

++		return 0;

2410

++	}

2411

++

2412

++       if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES)) {

2413

++	       u64 msr;

2414

++

2415

++	       rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);

2416

++	       if (msr & ARCH_CAP_SKIP_VMENTRY_L1DFLUSH) {

2417

++		       l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NOT_REQUIRED;

2418

++		       return 0;

2419

++	       }

2420

++       }

2421

++

2422

++	/* If set to auto use the default l1tf mitigation method */

2423

++	if (l1tf == VMENTER_L1D_FLUSH_AUTO) {

2424

++		switch (l1tf_mitigation) {

2425

++		case L1TF_MITIGATION_OFF:

2426

++			l1tf = VMENTER_L1D_FLUSH_NEVER;

2427

++			break;

2428

++		case L1TF_MITIGATION_FLUSH_NOWARN:

2429

++		case L1TF_MITIGATION_FLUSH:

2430

++		case L1TF_MITIGATION_FLUSH_NOSMT:

2431

++			l1tf = VMENTER_L1D_FLUSH_COND;

2432

++			break;

2433

++		case L1TF_MITIGATION_FULL:

2434

++		case L1TF_MITIGATION_FULL_FORCE:

2435

++			l1tf = VMENTER_L1D_FLUSH_ALWAYS;

2436

++			break;

2437

++		}

2438

++	} else if (l1tf_mitigation == L1TF_MITIGATION_FULL_FORCE) {

2439

++		l1tf = VMENTER_L1D_FLUSH_ALWAYS;

2440

++	}

2441

++

2442

++	if (l1tf != VMENTER_L1D_FLUSH_NEVER && !vmx_l1d_flush_pages &&

2443

++	    !boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {

2444

++		page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);

2445

++		if (!page)

2446

++			return -ENOMEM;

2447

++		vmx_l1d_flush_pages = page_address(page);

2448

++

2449

++		/*

2450

++		 * Initialize each page with a different pattern in

2451

++		 * order to protect against KSM in the nested

2452

++		 * virtualization case.

2453

++		 */

2454

++		for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {

2455

++			memset(vmx_l1d_flush_pages + i * PAGE_SIZE, i + 1,

2456

++			       PAGE_SIZE);

2457

++		}

2458

++	}

2459

++

2460

++	l1tf_vmx_mitigation = l1tf;

2461

++

2462

++	if (l1tf != VMENTER_L1D_FLUSH_NEVER)

2463

++		static_branch_enable(&vmx_l1d_should_flush);

2464

++	else

2465

++		static_branch_disable(&vmx_l1d_should_flush);

2466

++

2467

++	if (l1tf == VMENTER_L1D_FLUSH_COND)

2468

++		static_branch_enable(&vmx_l1d_flush_cond);

2469

++	else

2470

++		static_branch_disable(&vmx_l1d_flush_cond);

2471

++	return 0;

2472

++}

2473

++

2474

++static int vmentry_l1d_flush_parse(const char *s)

2475

++{

2476

++	unsigned int i;

2477

++

2478

++	if (s) {

2479

++		for (i = 0; i < ARRAY_SIZE(vmentry_l1d_param); i++) {

2480

++			if (sysfs_streq(s, vmentry_l1d_param[i].option))

2481

++				return vmentry_l1d_param[i].cmd;

2482

++		}

2483

++	}

2484

++	return -EINVAL;

2485

++}

2486

++

2487

++static int vmentry_l1d_flush_set(const char *s, const struct kernel_param *kp)

2488

++{

2489

++	int l1tf, ret;

2490

++

2491

++	if (!boot_cpu_has(X86_BUG_L1TF))

2492

++		return 0;

2493

++

2494

++	l1tf = vmentry_l1d_flush_parse(s);

2495

++	if (l1tf < 0)

2496

++		return l1tf;

2497

++

2498

++	/*

2499

++	 * Has vmx_init() run already? If not then this is the pre init

2500

++	 * parameter parsing. In that case just store the value and let

2501

++	 * vmx_init() do the proper setup after enable_ept has been

2502

++	 * established.

2503

++	 */

2504

++	if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_AUTO) {

2505

++		vmentry_l1d_flush_param = l1tf;

2506

++		return 0;

2507

++	}

2508

++

2509

++	mutex_lock(&vmx_l1d_flush_mutex);

2510

++	ret = vmx_setup_l1d_flush(l1tf);

2511

++	mutex_unlock(&vmx_l1d_flush_mutex);

2512

++	return ret;

2513

++}

2514

++

2515

++static int vmentry_l1d_flush_get(char *s, const struct kernel_param *kp)

2516

++{

2517

++	return sprintf(s, "%s\n", vmentry_l1d_param[l1tf_vmx_mitigation].option);

2518

++}

2519

++

2520

++static const struct kernel_param_ops vmentry_l1d_flush_ops = {

2521

++	.set = vmentry_l1d_flush_set,

2522

++	.get = vmentry_l1d_flush_get,

2523

++};

2524

++module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644);

2525

++

2526

+ struct kvm_vmx {

2527

+ 	struct kvm kvm;

2528

+

2529

+@@ -757,6 +901,11 @@ static inline int pi_test_sn(struct pi_desc *pi_desc)

2530

+ 			(unsigned long *)&pi_desc->control);

2531

+ }

2532

+

2533

++struct vmx_msrs {

2534

++	unsigned int		nr;

2535

++	struct vmx_msr_entry	val[NR_AUTOLOAD_MSRS];

2536

++};

2537

++

2538

+ struct vcpu_vmx {

2539

+ 	struct kvm_vcpu       vcpu;

2540

+ 	unsigned long         host_rsp;

2541

+@@ -790,9 +939,8 @@ struct vcpu_vmx {

2542

+ 	struct loaded_vmcs   *loaded_vmcs;

2543

+ 	bool                  __launched; /* temporary, used in vmx_vcpu_run */

2544

+ 	struct msr_autoload {

2545

+-		unsigned nr;

2546

+-		struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];

2547

+-		struct vmx_msr_entry host[NR_AUTOLOAD_MSRS];

2548

++		struct vmx_msrs guest;

2549

++		struct vmx_msrs host;

2550

+ 	} msr_autoload;

2551

+ 	struct {

2552

+ 		int           loaded;

2553

+@@ -2377,9 +2525,20 @@ static void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,

2554

+ 	vm_exit_controls_clearbit(vmx, exit);

2555

+ }

2556

+

2557

++static int find_msr(struct vmx_msrs *m, unsigned int msr)

2558

++{

2559

++	unsigned int i;

2560

++

2561

++	for (i = 0; i < m->nr; ++i) {

2562

++		if (m->val[i].index == msr)

2563

++			return i;

2564

++	}

2565

++	return -ENOENT;

2566

++}

2567

++

2568

+ static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr)

2569

+ {

2570

+-	unsigned i;

2571

++	int i;

2572

+ 	struct msr_autoload *m = &vmx->msr_autoload;

2573

+

2574

+ 	switch (msr) {

2575

+@@ -2400,18 +2559,21 @@ static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr)

2576

+ 		}

2577

+ 		break;

2578

+ 	}

2579

++	i = find_msr(&m->guest, msr);

2580

++	if (i < 0)

2581

++		goto skip_guest;

2582

++	--m->guest.nr;

2583

++	m->guest.val[i] = m->guest.val[m->guest.nr];

2584

++	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);

2585

+

2586

+-	for (i = 0; i < m->nr; ++i)

2587

+-		if (m->guest[i].index == msr)

2588

+-			break;

2589

+-

2590

+-	if (i == m->nr)

2591

++skip_guest:

2592

++	i = find_msr(&m->host, msr);

2593

++	if (i < 0)

2594

+ 		return;

2595

+-	--m->nr;

2596

+-	m->guest[i] = m->guest[m->nr];

2597

+-	m->host[i] = m->host[m->nr];

2598

+-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);

2599

+-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);

2600

++

2601

++	--m->host.nr;

2602

++	m->host.val[i] = m->host.val[m->host.nr];

2603

++	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);

2604

+ }

2605

+

2606

+ static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,

2607

+@@ -2426,9 +2588,9 @@ static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,

2608

+ }

2609

+

2610

+ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,

2611

+-				  u64 guest_val, u64 host_val)

2612

++				  u64 guest_val, u64 host_val, bool entry_only)

2613

+ {

2614

+-	unsigned i;

2615

++	int i, j = 0;

2616

+ 	struct msr_autoload *m = &vmx->msr_autoload;

2617

+

2618

+ 	switch (msr) {

2619

+@@ -2463,24 +2625,31 @@ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,

2620

+ 		wrmsrl(MSR_IA32_PEBS_ENABLE, 0);

2621

+ 	}

2622

+

2623

+-	for (i = 0; i < m->nr; ++i)

2624

+-		if (m->guest[i].index == msr)

2625

+-			break;

2626

++	i = find_msr(&m->guest, msr);

2627

++	if (!entry_only)

2628

++		j = find_msr(&m->host, msr);

2629

+

2630

+-	if (i == NR_AUTOLOAD_MSRS) {

2631

++	if (i == NR_AUTOLOAD_MSRS || j == NR_AUTOLOAD_MSRS) {

2632

+ 		printk_once(KERN_WARNING "Not enough msr switch entries. "

2633

+ 				"Can't add msr %x\n", msr);

2634

+ 		return;

2635

+-	} else if (i == m->nr) {

2636

+-		++m->nr;

2637

+-		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);

2638

+-		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);

2639

+ 	}

2640

++	if (i < 0) {

2641

++		i = m->guest.nr++;

2642

++		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);

2643

++	}

2644

++	m->guest.val[i].index = msr;

2645

++	m->guest.val[i].value = guest_val;

2646

++

2647

++	if (entry_only)

2648

++		return;

2649

+

2650

+-	m->guest[i].index = msr;

2651

+-	m->guest[i].value = guest_val;

2652

+-	m->host[i].index = msr;

2653

+-	m->host[i].value = host_val;

2654

++	if (j < 0) {

2655

++		j = m->host.nr++;

2656

++		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);

2657

++	}

2658

++	m->host.val[j].index = msr;

2659

++	m->host.val[j].value = host_val;

2660

+ }

2661

+

2662

+ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)

2663

+@@ -2524,7 +2693,7 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)

2664

+ 			guest_efer &= ~EFER_LME;

2665

+ 		if (guest_efer != host_efer)

2666

+ 			add_atomic_switch_msr(vmx, MSR_EFER,

2667

+-					      guest_efer, host_efer);

2668

++					      guest_efer, host_efer, false);

2669

+ 		return false;

2670

+ 	} else {

2671

+ 		guest_efer &= ~ignore_bits;

2672

+@@ -3987,7 +4156,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

2673

+ 		vcpu->arch.ia32_xss = data;

2674

+ 		if (vcpu->arch.ia32_xss != host_xss)

2675

+ 			add_atomic_switch_msr(vmx, MSR_IA32_XSS,

2676

+-				vcpu->arch.ia32_xss, host_xss);

2677

++				vcpu->arch.ia32_xss, host_xss, false);

2678

+ 		else

2679

+ 			clear_atomic_switch_msr(vmx, MSR_IA32_XSS);

2680

+ 		break;

2681

+@@ -6274,9 +6443,9 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)

2682

+

2683

+ 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);

2684

+ 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);

2685

+-	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));

2686

++	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));

2687

+ 	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);

2688

+-	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));

2689

++	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));

2690

+

2691

+ 	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)

2692

+ 		vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);

2693

+@@ -6296,8 +6465,7 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)

2694

+ 		++vmx->nmsrs;

2695

+ 	}

2696

+

2697

+-	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))

2698

+-		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);

2699

++	vmx->arch_capabilities = kvm_get_arch_capabilities();

2700

+

2701

+ 	vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);

2702

+

2703

+@@ -9548,6 +9716,79 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)

2704

+ 	}

2705

+ }

2706

+

2707

++/*

2708

++ * Software based L1D cache flush which is used when microcode providing

2709

++ * the cache control MSR is not loaded.

2710

++ *

2711

++ * The L1D cache is 32 KiB on Nehalem and later microarchitectures, but to

2712

++ * flush it is required to read in 64 KiB because the replacement algorithm

2713

++ * is not exactly LRU. This could be sized at runtime via topology

2714

++ * information but as all relevant affected CPUs have 32KiB L1D cache size

2715

++ * there is no point in doing so.

2716

++ */

2717

++#define L1D_CACHE_ORDER 4

2718

++static void *vmx_l1d_flush_pages;

2719

++

2720

++static void vmx_l1d_flush(struct kvm_vcpu *vcpu)

2721

++{

2722

++	int size = PAGE_SIZE << L1D_CACHE_ORDER;

2723

++

2724

++	/*

2725

++	 * This code is only executed when the the flush mode is 'cond' or

2726

++	 * 'always'

2727

++	 */

2728

++	if (static_branch_likely(&vmx_l1d_flush_cond)) {

2729

++		bool flush_l1d;

2730

++

2731

++		/*

2732

++		 * Clear the per-vcpu flush bit, it gets set again

2733

++		 * either from vcpu_run() or from one of the unsafe

2734

++		 * VMEXIT handlers.

2735

++		 */

2736

++		flush_l1d = vcpu->arch.l1tf_flush_l1d;

2737

++		vcpu->arch.l1tf_flush_l1d = false;

2738

++

2739

++		/*

2740

++		 * Clear the per-cpu flush bit, it gets set again from

2741

++		 * the interrupt handlers.

2742

++		 */

2743

++		flush_l1d |= kvm_get_cpu_l1tf_flush_l1d();

2744

++		kvm_clear_cpu_l1tf_flush_l1d();

2745

++

2746

++		if (!flush_l1d)

2747

++			return;

2748

++	}

2749

++

2750

++	vcpu->stat.l1d_flush++;

2751

++

2752

++	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {

2753

++		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);

2754

++		return;

2755

++	}

2756

++

2757

++	asm volatile(

2758

++		/* First ensure the pages are in the TLB */

2759

++		"xorl	%%eax, %%eax\n"

2760

++		".Lpopulate_tlb:\n\t"

2761

++		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"

2762

++		"addl	$4096, %%eax\n\t"

2763

++		"cmpl	%%eax, %[size]\n\t"

2764

++		"jne	.Lpopulate_tlb\n\t"

2765

++		"xorl	%%eax, %%eax\n\t"

2766

++		"cpuid\n\t"

2767

++		/* Now fill the cache */

2768

++		"xorl	%%eax, %%eax\n"

2769

++		".Lfill_cache:\n"

2770

++		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"

2771

++		"addl	$64, %%eax\n\t"

2772

++		"cmpl	%%eax, %[size]\n\t"

2773

++		"jne	.Lfill_cache\n\t"

2774

++		"lfence\n"

2775

++		:: [flush_pages] "r" (vmx_l1d_flush_pages),

2776

++		    [size] "r" (size)

2777

++		: "eax", "ebx", "ecx", "edx");

2778

++}

2779

++

2780

+ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)

2781

+ {

2782

+ 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);

2783

+@@ -9949,7 +10190,7 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)

2784

+ 			clear_atomic_switch_msr(vmx, msrs[i].msr);

2785

+ 		else

2786

+ 			add_atomic_switch_msr(vmx, msrs[i].msr, msrs[i].guest,

2787

+-					msrs[i].host);

2788

++					msrs[i].host, false);

2789

+ }

2790

+

2791

+ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)

2792

+@@ -10044,6 +10285,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)

2793

+ 	evmcs_rsp = static_branch_unlikely(&enable_evmcs) ?

2794

+ 		(unsigned long)&current_evmcs->host_rsp : 0;

2795

+

2796

++	if (static_branch_unlikely(&vmx_l1d_should_flush))

2797

++		vmx_l1d_flush(vcpu);

2798

++

2799

+ 	asm(

2800

+ 		/* Store host registers */

2801

+ 		"push %%" _ASM_DX "; push %%" _ASM_BP ";"

2802

+@@ -10403,10 +10647,37 @@ free_vcpu:

2803

+ 	return ERR_PTR(err);

2804

+ }

2805

+

2806

++#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"

2807

++#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"

2808

++

2809

+ static int vmx_vm_init(struct kvm *kvm)

2810

+ {

2811

+ 	if (!ple_gap)

2812

+ 		kvm->arch.pause_in_guest = true;

2813

++

2814

++	if (boot_cpu_has(X86_BUG_L1TF) && enable_ept) {

2815

++		switch (l1tf_mitigation) {

2816

++		case L1TF_MITIGATION_OFF:

2817

++		case L1TF_MITIGATION_FLUSH_NOWARN:

2818

++			/* 'I explicitly don't care' is set */

2819

++			break;

2820

++		case L1TF_MITIGATION_FLUSH:

2821

++		case L1TF_MITIGATION_FLUSH_NOSMT:

2822

++		case L1TF_MITIGATION_FULL:

2823

++			/*

2824

++			 * Warn upon starting the first VM in a potentially

2825

++			 * insecure environment.

2826

++			 */

2827

++			if (cpu_smt_control == CPU_SMT_ENABLED)

2828

++				pr_warn_once(L1TF_MSG_SMT);

2829

++			if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_NEVER)

2830

++				pr_warn_once(L1TF_MSG_L1D);

2831

++			break;

2832

++		case L1TF_MITIGATION_FULL_FORCE:

2833

++			/* Flush is enforced */

2834

++			break;

2835

++		}

2836

++	}

2837

+ 	return 0;

2838

+ }

2839

+

2840

+@@ -11260,10 +11531,10 @@ static void prepare_vmcs02_full(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)

2841

+ 	 * Set the MSR load/store lists to match L0's settings.

2842

+ 	 */

2843

+ 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);

2844

+-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

2845

+-	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));

2846

+-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

2847

+-	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));

2848

++	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);

2849

++	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));

2850

++	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);

2851

++	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));

2852

+

2853

+ 	set_cr4_guest_host_mask(vmx);

2854

+

2855

+@@ -11899,6 +12170,9 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)

2856

+ 		return ret;

2857

+ 	}

2858

+

2859

++	/* Hide L1D cache contents from the nested guest.  */

2860

++	vmx->vcpu.arch.l1tf_flush_l1d = true;

2861

++

2862

+ 	/*

2863

+ 	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken

2864

+ 	 * by event injection, halt vcpu.

2865

+@@ -12419,8 +12693,8 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,

2866

+ 	vmx_segment_cache_clear(vmx);

2867

+

2868

+ 	/* Update any VMCS fields that might have changed while L2 ran */

2869

+-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

2870

+-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

2871

++	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);

2872

++	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);

2873

+ 	vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);

2874

+ 	if (vmx->hv_deadline_tsc == -1)

2875

+ 		vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,

2876

+@@ -13137,6 +13411,51 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {

2877

+ 	.enable_smi_window = enable_smi_window,

2878

+ };

2879

+

2880

++static void vmx_cleanup_l1d_flush(void)

2881

++{

2882

++	if (vmx_l1d_flush_pages) {

2883

++		free_pages((unsigned long)vmx_l1d_flush_pages, L1D_CACHE_ORDER);

2884

++		vmx_l1d_flush_pages = NULL;

2885

++	}

2886

++	/* Restore state so sysfs ignores VMX */

2887

++	l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;

2888

++}

2889

++

2890

++static void vmx_exit(void)

2891

++{

2892

++#ifdef CONFIG_KEXEC_CORE

2893

++	RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);

2894

++	synchronize_rcu();

2895

++#endif

2896

++

2897

++	kvm_exit();

2898

++

2899

++#if IS_ENABLED(CONFIG_HYPERV)

2900

++	if (static_branch_unlikely(&enable_evmcs)) {

2901

++		int cpu;

2902

++		struct hv_vp_assist_page *vp_ap;

2903

++		/*

2904

++		 * Reset everything to support using non-enlightened VMCS

2905

++		 * access later (e.g. when we reload the module with

2906

++		 * enlightened_vmcs=0)

2907

++		 */

2908

++		for_each_online_cpu(cpu) {

2909

++			vp_ap =	hv_get_vp_assist_page(cpu);

2910

++

2911

++			if (!vp_ap)

2912

++				continue;

2913

++

2914

++			vp_ap->current_nested_vmcs = 0;

2915

++			vp_ap->enlighten_vmentry = 0;

2916

++		}

2917

++

2918

++		static_branch_disable(&enable_evmcs);

2919

++	}

2920

++#endif

2921

++	vmx_cleanup_l1d_flush();

2922

++}

2923

++module_exit(vmx_exit);

2924

++

2925

+ static int __init vmx_init(void)

2926

+ {

2927

+ 	int r;

2928

+@@ -13171,10 +13490,25 @@ static int __init vmx_init(void)

2929

+ #endif

2930

+

2931

+ 	r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),

2932

+-                     __alignof__(struct vcpu_vmx), THIS_MODULE);

2933

++		     __alignof__(struct vcpu_vmx), THIS_MODULE);

2934

+ 	if (r)

2935

+ 		return r;

2936

+

2937

++	/*

2938

++	 * Must be called after kvm_init() so enable_ept is properly set

2939

++	 * up. Hand the parameter mitigation value in which was stored in

2940

++	 * the pre module init parser. If no parameter was given, it will

2941

++	 * contain 'auto' which will be turned into the default 'cond'

2942

++	 * mitigation mode.

2943

++	 */

2944

++	if (boot_cpu_has(X86_BUG_L1TF)) {

2945

++		r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);

2946

++		if (r) {

2947

++			vmx_exit();

2948

++			return r;

2949

++		}

2950

++	}

2951

++

2952

+ #ifdef CONFIG_KEXEC_CORE

2953

+ 	rcu_assign_pointer(crash_vmclear_loaded_vmcss,

2954

+ 			   crash_vmclear_local_loaded_vmcss);

2955

+@@ -13183,39 +13517,4 @@ static int __init vmx_init(void)

2956

+

2957

+ 	return 0;

2958

+ }

2959

+-

2960

+-static void __exit vmx_exit(void)

2961

+-{

2962

+-#ifdef CONFIG_KEXEC_CORE

2963

+-	RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);

2964

+-	synchronize_rcu();

2965

+-#endif

2966

+-

2967

+-	kvm_exit();

2968

+-

2969

+-#if IS_ENABLED(CONFIG_HYPERV)

2970

+-	if (static_branch_unlikely(&enable_evmcs)) {

2971

+-		int cpu;

2972

+-		struct hv_vp_assist_page *vp_ap;

2973

+-		/*

2974

+-		 * Reset everything to support using non-enlightened VMCS

2975

+-		 * access later (e.g. when we reload the module with

2976

+-		 * enlightened_vmcs=0)

2977

+-		 */

2978

+-		for_each_online_cpu(cpu) {

2979

+-			vp_ap =	hv_get_vp_assist_page(cpu);

2980

+-

2981

+-			if (!vp_ap)

2982

+-				continue;

2983

+-

2984

+-			vp_ap->current_nested_vmcs = 0;

2985

+-			vp_ap->enlighten_vmentry = 0;

2986

+-		}

2987

+-

2988

+-		static_branch_disable(&enable_evmcs);

2989

+-	}

2990

+-#endif

2991

+-}

2992

+-

2993

+-module_init(vmx_init)

2994

+-module_exit(vmx_exit)

2995

++module_init(vmx_init);

2996

+diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c

2997

+index 2b812b3c5088..a5caa5e5480c 100644

2998

+--- a/arch/x86/kvm/x86.c

2999

++++ b/arch/x86/kvm/x86.c

3000

+@@ -195,6 +195,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {

3001

+ 	{ "irq_injections", VCPU_STAT(irq_injections) },

3002

+ 	{ "nmi_injections", VCPU_STAT(nmi_injections) },

3003

+ 	{ "req_event", VCPU_STAT(req_event) },

3004

++	{ "l1d_flush", VCPU_STAT(l1d_flush) },

3005

+ 	{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },

3006

+ 	{ "mmu_pte_write", VM_STAT(mmu_pte_write) },

3007

+ 	{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },

3008

+@@ -1102,11 +1103,35 @@ static u32 msr_based_features[] = {

3009

+

3010

+ static unsigned int num_msr_based_features;

3011

+

3012

++u64 kvm_get_arch_capabilities(void)

3013

++{

3014

++	u64 data;

3015

++

3016

++	rdmsrl_safe(MSR_IA32_ARCH_CAPABILITIES, &data);

3017

++

3018

++	/*

3019

++	 * If we're doing cache flushes (either "always" or "cond")

3020

++	 * we will do one whenever the guest does a vmlaunch/vmresume.

3021

++	 * If an outer hypervisor is doing the cache flush for us

3022

++	 * (VMENTER_L1D_FLUSH_NESTED_VM), we can safely pass that

3023

++	 * capability to the guest too, and if EPT is disabled we're not

3024

++	 * vulnerable.  Overall, only VMENTER_L1D_FLUSH_NEVER will

3025

++	 * require a nested hypervisor to do a flush of its own.

3026

++	 */

3027

++	if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)

3028

++		data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;

3029

++

3030

++	return data;

3031

++}

3032

++EXPORT_SYMBOL_GPL(kvm_get_arch_capabilities);

3033

++

3034

+ static int kvm_get_msr_feature(struct kvm_msr_entry *msr)

3035

+ {

3036

+ 	switch (msr->index) {

3037

+-	case MSR_IA32_UCODE_REV:

3038

+ 	case MSR_IA32_ARCH_CAPABILITIES:

3039

++		msr->data = kvm_get_arch_capabilities();

3040

++		break;

3041

++	case MSR_IA32_UCODE_REV:

3042

+ 		rdmsrl_safe(msr->index, &msr->data);

3043

+ 		break;

3044

+ 	default:

3045

+@@ -4876,6 +4901,9 @@ static int emulator_write_std(struct x86_emulate_ctxt *ctxt, gva_t addr, void *v

3046

+ int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu, gva_t addr, void *val,

3047

+ 				unsigned int bytes, struct x86_exception *exception)

3048

+ {

3049

++	/* kvm_write_guest_virt_system can pull in tons of pages. */

3050

++	vcpu->arch.l1tf_flush_l1d = true;

3051

++

3052

+ 	return kvm_write_guest_virt_helper(addr, val, bytes, vcpu,

3053

+ 					   PFERR_WRITE_MASK, exception);

3054

+ }

3055

+@@ -6052,6 +6080,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,

3056

+ 	bool writeback = true;

3057

+ 	bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;

3058

+

3059

++	vcpu->arch.l1tf_flush_l1d = true;

3060

++

3061

+ 	/*

3062

+ 	 * Clear write_fault_to_shadow_pgtable here to ensure it is

3063

+ 	 * never reused.

3064

+@@ -7581,6 +7611,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)

3065

+ 	struct kvm *kvm = vcpu->kvm;

3066

+

3067

+ 	vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);

3068

++	vcpu->arch.l1tf_flush_l1d = true;

3069

+

3070

+ 	for (;;) {

3071

+ 		if (kvm_vcpu_running(vcpu)) {

3072

+@@ -8700,6 +8731,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)

3073

+

3074

+ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)

3075

+ {

3076

++	vcpu->arch.l1tf_flush_l1d = true;

3077

+ 	kvm_x86_ops->sched_in(vcpu, cpu);

3078

+ }

3079

+

3080

+diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c

3081

+index cee58a972cb2..83241eb71cd4 100644

3082

+--- a/arch/x86/mm/init.c

3083

++++ b/arch/x86/mm/init.c

3084

+@@ -4,6 +4,8 @@

3085

+ #include <linux/swap.h>

3086

+ #include <linux/memblock.h>

3087

+ #include <linux/bootmem.h>	/* for max_low_pfn */

3088

++#include <linux/swapfile.h>

3089

++#include <linux/swapops.h>

3090

+

3091

+ #include <asm/set_memory.h>

3092

+ #include <asm/e820/api.h>

3093

+@@ -880,3 +882,26 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)

3094

+ 	__cachemode2pte_tbl[cache] = __cm_idx2pte(entry);

3095

+ 	__pte2cachemode_tbl[entry] = cache;

3096

+ }

3097

++

3098

++#ifdef CONFIG_SWAP

3099

++unsigned long max_swapfile_size(void)

3100

++{

3101

++	unsigned long pages;

3102

++

3103

++	pages = generic_max_swapfile_size();

3104

++

3105

++	if (boot_cpu_has_bug(X86_BUG_L1TF)) {

3106

++		/* Limit the swap file size to MAX_PA/2 for L1TF workaround */

3107

++		unsigned long l1tf_limit = l1tf_pfn_limit() + 1;

3108

++		/*

3109

++		 * We encode swap offsets also with 3 bits below those for pfn

3110

++		 * which makes the usable limit higher.

3111

++		 */

3112

++#if CONFIG_PGTABLE_LEVELS > 2

3113

++		l1tf_limit <<= PAGE_SHIFT - SWP_OFFSET_FIRST_BIT;

3114

++#endif

3115

++		pages = min_t(unsigned long, l1tf_limit, pages);

3116

++	}

3117

++	return pages;

3118

++}

3119

++#endif

3120

+diff --git a/arch/x86/mm/kmmio.c b/arch/x86/mm/kmmio.c

3121

+index 7c8686709636..79eb55ce69a9 100644

3122

+--- a/arch/x86/mm/kmmio.c

3123

++++ b/arch/x86/mm/kmmio.c

3124

+@@ -126,24 +126,29 @@ static struct kmmio_fault_page *get_kmmio_fault_page(unsigned long addr)

3125

+

3126

+ static void clear_pmd_presence(pmd_t *pmd, bool clear, pmdval_t *old)

3127

+ {

3128

++	pmd_t new_pmd;

3129

+ 	pmdval_t v = pmd_val(*pmd);

3130

+ 	if (clear) {

3131

+-		*old = v & _PAGE_PRESENT;

3132

+-		v &= ~_PAGE_PRESENT;

3133

+-	} else	/* presume this has been called with clear==true previously */

3134

+-		v |= *old;

3135

+-	set_pmd(pmd, __pmd(v));

3136

++		*old = v;

3137

++		new_pmd = pmd_mknotpresent(*pmd);

3138

++	} else {

3139

++		/* Presume this has been called with clear==true previously */

3140

++		new_pmd = __pmd(*old);

3141

++	}

3142

++	set_pmd(pmd, new_pmd);

3143

+ }

3144

+

3145

+ static void clear_pte_presence(pte_t *pte, bool clear, pteval_t *old)

3146

+ {

3147

+ 	pteval_t v = pte_val(*pte);

3148

+ 	if (clear) {

3149

+-		*old = v & _PAGE_PRESENT;

3150

+-		v &= ~_PAGE_PRESENT;

3151

+-	} else	/* presume this has been called with clear==true previously */

3152

+-		v |= *old;

3153

+-	set_pte_atomic(pte, __pte(v));

3154

++		*old = v;

3155

++		/* Nothing should care about address */

3156

++		pte_clear(&init_mm, 0, pte);

3157

++	} else {

3158

++		/* Presume this has been called with clear==true previously */

3159

++		set_pte_atomic(pte, __pte(*old));

3160

++	}

3161

+ }

3162

+

3163

+ static int clear_page_presence(struct kmmio_fault_page *f, bool clear)

3164

+diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c

3165

+index 48c591251600..f40ab8185d94 100644

3166

+--- a/arch/x86/mm/mmap.c

3167

++++ b/arch/x86/mm/mmap.c

3168

+@@ -240,3 +240,24 @@ int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)

3169

+

3170

+ 	return phys_addr_valid(addr + count - 1);

3171

+ }

3172

++

3173

++/*

3174

++ * Only allow root to set high MMIO mappings to PROT_NONE.

3175

++ * This prevents an unpriv. user to set them to PROT_NONE and invert

3176

++ * them, then pointing to valid memory for L1TF speculation.

3177

++ *

3178

++ * Note: for locked down kernels may want to disable the root override.

3179

++ */

3180

++bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)

3181

++{

3182

++	if (!boot_cpu_has_bug(X86_BUG_L1TF))

3183

++		return true;

3184

++	if (!__pte_needs_invert(pgprot_val(prot)))

3185

++		return true;

3186

++	/* If it's real memory always allow */

3187

++	if (pfn_valid(pfn))

3188

++		return true;

3189

++	if (pfn > l1tf_pfn_limit() && !capable(CAP_SYS_ADMIN))

3190

++		return false;

3191

++	return true;

3192

++}

3193

+diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c

3194

+index 3bded76e8d5c..7bb6f65c79de 100644

3195

+--- a/arch/x86/mm/pageattr.c

3196

++++ b/arch/x86/mm/pageattr.c

3197

+@@ -1014,8 +1014,8 @@ static long populate_pmd(struct cpa_data *cpa,

3198

+

3199

+ 		pmd = pmd_offset(pud, start);

3200

+

3201

+-		set_pmd(pmd, __pmd(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |

3202

+-				   massage_pgprot(pmd_pgprot)));

3203

++		set_pmd(pmd, pmd_mkhuge(pfn_pmd(cpa->pfn,

3204

++					canon_pgprot(pmd_pgprot))));

3205

+

3206

+ 		start	  += PMD_SIZE;

3207

+ 		cpa->pfn  += PMD_SIZE >> PAGE_SHIFT;

3208

+@@ -1087,8 +1087,8 @@ static int populate_pud(struct cpa_data *cpa, unsigned long start, p4d_t *p4d,

3209

+ 	 * Map everything starting from the Gb boundary, possibly with 1G pages

3210

+ 	 */

3211

+ 	while (boot_cpu_has(X86_FEATURE_GBPAGES) && end - start >= PUD_SIZE) {

3212

+-		set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |

3213

+-				   massage_pgprot(pud_pgprot)));

3214

++		set_pud(pud, pud_mkhuge(pfn_pud(cpa->pfn,

3215

++				   canon_pgprot(pud_pgprot))));

3216

+

3217

+ 		start	  += PUD_SIZE;

3218

+ 		cpa->pfn  += PUD_SIZE >> PAGE_SHIFT;

3219

+diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c

3220

+index 4d418e705878..fb752d9a3ce9 100644

3221

+--- a/arch/x86/mm/pti.c

3222

++++ b/arch/x86/mm/pti.c

3223

+@@ -45,6 +45,7 @@

3224

+ #include <asm/pgalloc.h>

3225

+ #include <asm/tlbflush.h>

3226

+ #include <asm/desc.h>

3227

++#include <asm/sections.h>

3228

+

3229

+ #undef pr_fmt

3230

+ #define pr_fmt(fmt)     "Kernel/User page tables isolation: " fmt

3231

+diff --git a/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c b/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c

3232

+index 4f5fa65a1011..2acd6be13375 100644

3233

+--- a/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c

3234

++++ b/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c

3235

+@@ -18,6 +18,7 @@

3236

+ #include <asm/intel-mid.h>

3237

+ #include <asm/intel_scu_ipc.h>

3238

+ #include <asm/io_apic.h>

3239

++#include <asm/hw_irq.h>

3240

+

3241

+ #define TANGIER_EXT_TIMER0_MSI 12

3242

+

3243

+diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c

3244

+index ca446da48fd2..3866b96a7ee7 100644

3245

+--- a/arch/x86/platform/uv/tlb_uv.c

3246

++++ b/arch/x86/platform/uv/tlb_uv.c

3247

+@@ -1285,6 +1285,7 @@ void uv_bau_message_interrupt(struct pt_regs *regs)

3248

+ 	struct msg_desc msgdesc;

3249

+

3250

+ 	ack_APIC_irq();

3251

++	kvm_set_cpu_l1tf_flush_l1d();

3252

+ 	time_start = get_cycles();

3253

+

3254

+ 	bcp = &per_cpu(bau_control, smp_processor_id());

3255

+diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c

3256

+index 3b5318505c69..2eeddd814653 100644

3257

+--- a/arch/x86/xen/enlighten.c

3258

++++ b/arch/x86/xen/enlighten.c

3259

+@@ -3,6 +3,7 @@

3260

+ #endif

3261

+ #include <linux/cpu.h>

3262

+ #include <linux/kexec.h>

3263

++#include <linux/slab.h>

3264

+

3265

+ #include <xen/features.h>

3266

+ #include <xen/page.h>

3267

+diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c

3268

+index 30cc9c877ebb..eb9443d5bae1 100644

3269

+--- a/drivers/base/cpu.c

3270

++++ b/drivers/base/cpu.c

3271

+@@ -540,16 +540,24 @@ ssize_t __weak cpu_show_spec_store_bypass(struct device *dev,

3272

+ 	return sprintf(buf, "Not affected\n");

3273

+ }

3274

+

3275

++ssize_t __weak cpu_show_l1tf(struct device *dev,

3276

++			     struct device_attribute *attr, char *buf)

3277

++{

3278

++	return sprintf(buf, "Not affected\n");

3279

++}

3280

++

3281

+ static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);

3282

+ static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);

3283

+ static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);

3284

+ static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);

3285

++static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);

3286

+

3287

+ static struct attribute *cpu_root_vulnerabilities_attrs[] = {

3288

+ 	&dev_attr_meltdown.attr,

3289

+ 	&dev_attr_spectre_v1.attr,

3290

+ 	&dev_attr_spectre_v2.attr,

3291

+ 	&dev_attr_spec_store_bypass.attr,

3292

++	&dev_attr_l1tf.attr,

3293

+ 	NULL

3294

+ };

3295

+

3296

+diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c

3297

+index dc87797db500..b50b74053664 100644

3298

+--- a/drivers/gpu/drm/i915/i915_pmu.c

3299

++++ b/drivers/gpu/drm/i915/i915_pmu.c

3300

+@@ -4,6 +4,7 @@

3301

+  * Copyright © 2017-2018 Intel Corporation

3302

+  */

3303

+

3304

++#include <linux/irq.h>

3305

+ #include "i915_pmu.h"

3306

+ #include "intel_ringbuffer.h"

3307

+ #include "i915_drv.h"

3308

+diff --git a/drivers/gpu/drm/i915/intel_lpe_audio.c b/drivers/gpu/drm/i915/intel_lpe_audio.c

3309

+index 6269750e2b54..b4941101f21a 100644

3310

+--- a/drivers/gpu/drm/i915/intel_lpe_audio.c

3311

++++ b/drivers/gpu/drm/i915/intel_lpe_audio.c

3312

+@@ -62,6 +62,7 @@

3313

+

3314

+ #include <linux/acpi.h>

3315

+ #include <linux/device.h>

3316

++#include <linux/irq.h>

3317

+ #include <linux/pci.h>

3318

+ #include <linux/pm_runtime.h>

3319

+

3320

+diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c

3321

+index f6325f1a89e8..d4d4a55f09f8 100644

3322

+--- a/drivers/pci/controller/pci-hyperv.c

3323

++++ b/drivers/pci/controller/pci-hyperv.c

3324

+@@ -45,6 +45,7 @@

3325

+ #include <linux/irqdomain.h>

3326

+ #include <asm/irqdomain.h>

3327

+ #include <asm/apic.h>

3328

++#include <linux/irq.h>

3329

+ #include <linux/msi.h>

3330

+ #include <linux/hyperv.h>

3331

+ #include <linux/refcount.h>

3332

+diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h

3333

+index f59639afaa39..26ca0276b503 100644

3334

+--- a/include/asm-generic/pgtable.h

3335

++++ b/include/asm-generic/pgtable.h

3336

+@@ -1083,6 +1083,18 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,

3337

+ static inline void init_espfix_bsp(void) { }

3338

+ #endif

3339

+

3340

++#ifndef __HAVE_ARCH_PFN_MODIFY_ALLOWED

3341

++static inline bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)

3342

++{

3343

++	return true;

3344

++}

3345

++

3346

++static inline bool arch_has_pfn_modify_check(void)

3347

++{

3348

++	return false;

3349

++}

3350

++#endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */

3351

++

3352

+ #endif /* !__ASSEMBLY__ */

3353

+

3354

+ #ifndef io_remap_pfn_range

3355

+diff --git a/include/linux/cpu.h b/include/linux/cpu.h

3356

+index 3233fbe23594..45789a892c41 100644

3357

+--- a/include/linux/cpu.h

3358

++++ b/include/linux/cpu.h

3359

+@@ -55,6 +55,8 @@ extern ssize_t cpu_show_spectre_v2(struct device *dev,

3360

+ 				   struct device_attribute *attr, char *buf);

3361

+ extern ssize_t cpu_show_spec_store_bypass(struct device *dev,

3362

+ 					  struct device_attribute *attr, char *buf);

3363

++extern ssize_t cpu_show_l1tf(struct device *dev,

3364

++			     struct device_attribute *attr, char *buf);

3365

+

3366

+ extern __printf(4, 5)

3367

+ struct device *cpu_device_create(struct device *parent, void *drvdata,

3368

+@@ -166,4 +168,23 @@ void cpuhp_report_idle_dead(void);

3369

+ static inline void cpuhp_report_idle_dead(void) { }

3370

+ #endif /* #ifdef CONFIG_HOTPLUG_CPU */

3371

+

3372

++enum cpuhp_smt_control {

3373

++	CPU_SMT_ENABLED,

3374

++	CPU_SMT_DISABLED,

3375

++	CPU_SMT_FORCE_DISABLED,

3376

++	CPU_SMT_NOT_SUPPORTED,

3377

++};

3378

++

3379

++#if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT)

3380

++extern enum cpuhp_smt_control cpu_smt_control;

3381

++extern void cpu_smt_disable(bool force);

3382

++extern void cpu_smt_check_topology_early(void);

3383

++extern void cpu_smt_check_topology(void);

3384

++#else

3385

++# define cpu_smt_control		(CPU_SMT_ENABLED)

3386

++static inline void cpu_smt_disable(bool force) { }

3387

++static inline void cpu_smt_check_topology_early(void) { }

3388

++static inline void cpu_smt_check_topology(void) { }

3389

++#endif

3390

++

3391

+ #endif /* _LINUX_CPU_H_ */

3392

+diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h

3393

+index 06bd7b096167..e06febf62978 100644

3394

+--- a/include/linux/swapfile.h

3395

++++ b/include/linux/swapfile.h

3396

+@@ -10,5 +10,7 @@ extern spinlock_t swap_lock;

3397

+ extern struct plist_head swap_active_head;

3398

+ extern struct swap_info_struct *swap_info[];

3399

+ extern int try_to_unuse(unsigned int, bool, unsigned long);

3400

++extern unsigned long generic_max_swapfile_size(void);

3401

++extern unsigned long max_swapfile_size(void);

3402

+

3403

+ #endif /* _LINUX_SWAPFILE_H */

3404

+diff --git a/kernel/cpu.c b/kernel/cpu.c

3405

+index 2f8f338e77cf..f80afc674f02 100644

3406

+--- a/kernel/cpu.c

3407

++++ b/kernel/cpu.c

3408

+@@ -60,6 +60,7 @@ struct cpuhp_cpu_state {

3409

+ 	bool			rollback;

3410

+ 	bool			single;

3411

+ 	bool			bringup;

3412

++	bool			booted_once;

3413

+ 	struct hlist_node	*node;

3414

+ 	struct hlist_node	*last;

3415

+ 	enum cpuhp_state	cb_state;

3416

+@@ -342,6 +343,85 @@ void cpu_hotplug_enable(void)

3417

+ EXPORT_SYMBOL_GPL(cpu_hotplug_enable);

3418

+ #endif	/* CONFIG_HOTPLUG_CPU */

3419

+

3420

++#ifdef CONFIG_HOTPLUG_SMT

3421

++enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;

3422

++EXPORT_SYMBOL_GPL(cpu_smt_control);

3423

++

3424

++static bool cpu_smt_available __read_mostly;

3425

++

3426

++void __init cpu_smt_disable(bool force)

3427

++{

3428

++	if (cpu_smt_control == CPU_SMT_FORCE_DISABLED ||

3429

++		cpu_smt_control == CPU_SMT_NOT_SUPPORTED)

3430

++		return;

3431

++

3432

++	if (force) {

3433

++		pr_info("SMT: Force disabled\n");

3434

++		cpu_smt_control = CPU_SMT_FORCE_DISABLED;

3435

++	} else {

3436

++		cpu_smt_control = CPU_SMT_DISABLED;

3437

++	}

3438

++}

3439

++

3440

++/*

3441

++ * The decision whether SMT is supported can only be done after the full

3442

++ * CPU identification. Called from architecture code before non boot CPUs

3443

++ * are brought up.

3444

++ */

3445

++void __init cpu_smt_check_topology_early(void)

3446

++{

3447

++	if (!topology_smt_supported())

3448

++		cpu_smt_control = CPU_SMT_NOT_SUPPORTED;

3449

++}

3450

++

3451

++/*

3452

++ * If SMT was disabled by BIOS, detect it here, after the CPUs have been

3453

++ * brought online. This ensures the smt/l1tf sysfs entries are consistent

3454

++ * with reality. cpu_smt_available is set to true during the bringup of non

3455

++ * boot CPUs when a SMT sibling is detected. Note, this may overwrite

3456

++ * cpu_smt_control's previous setting.

3457

++ */

3458

++void __init cpu_smt_check_topology(void)

3459

++{

3460

++	if (!cpu_smt_available)

3461

++		cpu_smt_control = CPU_SMT_NOT_SUPPORTED;

3462

++}

3463

++

3464

++static int __init smt_cmdline_disable(char *str)

3465

++{

3466

++	cpu_smt_disable(str && !strcmp(str, "force"));

3467

++	return 0;

3468

++}

3469

++early_param("nosmt", smt_cmdline_disable);

3470

++

3471

++static inline bool cpu_smt_allowed(unsigned int cpu)

3472

++{

3473

++	if (topology_is_primary_thread(cpu))

3474

++		return true;

3475

++

3476

++	/*

3477

++	 * If the CPU is not a 'primary' thread and the booted_once bit is

3478

++	 * set then the processor has SMT support. Store this information

3479

++	 * for the late check of SMT support in cpu_smt_check_topology().

3480

++	 */

3481

++	if (per_cpu(cpuhp_state, cpu).booted_once)

3482

++		cpu_smt_available = true;

3483

++

3484

++	if (cpu_smt_control == CPU_SMT_ENABLED)

3485

++		return true;

3486

++

3487

++	/*

3488

++	 * On x86 it's required to boot all logical CPUs at least once so

3489

++	 * that the init code can get a chance to set CR4.MCE on each

3490

++	 * CPU. Otherwise, a broadacasted MCE observing CR4.MCE=0b on any

3491

++	 * core will shutdown the machine.

3492

++	 */

3493

++	return !per_cpu(cpuhp_state, cpu).booted_once;

3494

++}

3495

++#else

3496

++static inline bool cpu_smt_allowed(unsigned int cpu) { return true; }

3497

++#endif

3498

++

3499

+ static inline enum cpuhp_state

3500

+ cpuhp_set_state(struct cpuhp_cpu_state *st, enum cpuhp_state target)

3501

+ {

3502

+@@ -422,6 +502,16 @@ static int bringup_wait_for_ap(unsigned int cpu)

3503

+ 	stop_machine_unpark(cpu);

3504

+ 	kthread_unpark(st->thread);

3505

+

3506

++	/*

3507

++	 * SMT soft disabling on X86 requires to bring the CPU out of the

3508

++	 * BIOS 'wait for SIPI' state in order to set the CR4.MCE bit.  The

3509

++	 * CPU marked itself as booted_once in cpu_notify_starting() so the

3510

++	 * cpu_smt_allowed() check will now return false if this is not the

3511

++	 * primary sibling.

3512

++	 */

3513

++	if (!cpu_smt_allowed(cpu))

3514

++		return -ECANCELED;

3515

++

3516

+ 	if (st->target <= CPUHP_AP_ONLINE_IDLE)

3517

+ 		return 0;

3518

+

3519

+@@ -754,7 +844,6 @@ static int takedown_cpu(unsigned int cpu)

3520

+

3521

+ 	/* Park the smpboot threads */

3522

+ 	kthread_park(per_cpu_ptr(&cpuhp_state, cpu)->thread);

3523

+-	smpboot_park_threads(cpu);

3524

+

3525

+ 	/*

3526

+ 	 * Prevent irq alloc/free while the dying cpu reorganizes the

3527

+@@ -907,20 +996,19 @@ out:

3528

+ 	return ret;

3529

+ }

3530

+

3531

++static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)

3532

++{

3533

++	if (cpu_hotplug_disabled)

3534

++		return -EBUSY;

3535

++	return _cpu_down(cpu, 0, target);

3536

++}

3537

++

3538

+ static int do_cpu_down(unsigned int cpu, enum cpuhp_state target)

3539

+ {

3540

+ 	int err;

3541

+

3542

+ 	cpu_maps_update_begin();

3543

+-

3544

+-	if (cpu_hotplug_disabled) {

3545

+-		err = -EBUSY;

3546

+-		goto out;

3547

+-	}

3548

+-

3549

+-	err = _cpu_down(cpu, 0, target);

3550

+-

3551

+-out:

3552

++	err = cpu_down_maps_locked(cpu, target);

3553

+ 	cpu_maps_update_done();

3554

+ 	return err;

3555

+ }

3556

+@@ -949,6 +1037,7 @@ void notify_cpu_starting(unsigned int cpu)

3557

+ 	int ret;

3558

+

3559

+ 	rcu_cpu_starting(cpu);	/* Enables RCU usage on this CPU. */

3560

++	st->booted_once = true;

3561

+ 	while (st->state < target) {

3562

+ 		st->state++;

3563

+ 		ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL);

3564

+@@ -1058,6 +1147,10 @@ static int do_cpu_up(unsigned int cpu, enum cpuhp_state target)

3565

+ 		err = -EBUSY;

3566

+ 		goto out;

3567

+ 	}

3568

++	if (!cpu_smt_allowed(cpu)) {

3569

++		err = -EPERM;

3570

++		goto out;

3571

++	}

3572

+

3573

+ 	err = _cpu_up(cpu, 0, target);

3574

+ out:

3575

+@@ -1332,7 +1425,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {

3576

+ 	[CPUHP_AP_SMPBOOT_THREADS] = {

3577

+ 		.name			= "smpboot/threads:online",

3578

+ 		.startup.single		= smpboot_unpark_threads,

3579

+-		.teardown.single	= NULL,

3580

++		.teardown.single	= smpboot_park_threads,

3581

+ 	},

3582

+ 	[CPUHP_AP_IRQ_AFFINITY_ONLINE] = {

3583

+ 		.name			= "irq/affinity:online",

3584

+@@ -1906,10 +1999,172 @@ static const struct attribute_group cpuhp_cpu_root_attr_group = {

3585

+ 	NULL

3586

+ };

3587

+

3588

++#ifdef CONFIG_HOTPLUG_SMT

3589

++

3590

++static const char *smt_states[] = {

3591

++	[CPU_SMT_ENABLED]		= "on",

3592

++	[CPU_SMT_DISABLED]		= "off",

3593

++	[CPU_SMT_FORCE_DISABLED]	= "forceoff",

3594

++	[CPU_SMT_NOT_SUPPORTED]		= "notsupported",

3595

++};

3596

++

3597

++static ssize_t

3598

++show_smt_control(struct device *dev, struct device_attribute *attr, char *buf)

3599

++{

3600

++	return snprintf(buf, PAGE_SIZE - 2, "%s\n", smt_states[cpu_smt_control]);

3601

++}

3602

++

3603

++static void cpuhp_offline_cpu_device(unsigned int cpu)

3604

++{

3605

++	struct device *dev = get_cpu_device(cpu);

3606

++

3607

++	dev->offline = true;

3608

++	/* Tell user space about the state change */

3609

++	kobject_uevent(&dev->kobj, KOBJ_OFFLINE);

3610

++}

3611

++

3612

++static void cpuhp_online_cpu_device(unsigned int cpu)

3613

++{

3614

++	struct device *dev = get_cpu_device(cpu);

3615

++

3616

++	dev->offline = false;

3617

++	/* Tell user space about the state change */

3618

++	kobject_uevent(&dev->kobj, KOBJ_ONLINE);

3619

++}

3620

++

3621

++static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)

3622

++{

3623

++	int cpu, ret = 0;

3624

++

3625

++	cpu_maps_update_begin();

3626

++	for_each_online_cpu(cpu) {

3627

++		if (topology_is_primary_thread(cpu))

3628

++			continue;

3629

++		ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);

3630

++		if (ret)

3631

++			break;

3632

++		/*

3633

++		 * As this needs to hold the cpu maps lock it's impossible

3634

++		 * to call device_offline() because that ends up calling

3635

++		 * cpu_down() which takes cpu maps lock. cpu maps lock

3636

++		 * needs to be held as this might race against in kernel

3637

++		 * abusers of the hotplug machinery (thermal management).

3638

++		 *

3639

++		 * So nothing would update device:offline state. That would

3640

++		 * leave the sysfs entry stale and prevent onlining after

3641

++		 * smt control has been changed to 'off' again. This is

3642

++		 * called under the sysfs hotplug lock, so it is properly

3643

++		 * serialized against the regular offline usage.

3644

++		 */

3645

++		cpuhp_offline_cpu_device(cpu);

3646

++	}

3647

++	if (!ret)

3648

++		cpu_smt_control = ctrlval;

3649

++	cpu_maps_update_done();

3650

++	return ret;

3651

++}

3652

++

3653

++static int cpuhp_smt_enable(void)

3654

++{

3655

++	int cpu, ret = 0;

3656

++

3657

++	cpu_maps_update_begin();

3658

++	cpu_smt_control = CPU_SMT_ENABLED;

3659

++	for_each_present_cpu(cpu) {

3660

++		/* Skip online CPUs and CPUs on offline nodes */

3661

++		if (cpu_online(cpu) || !node_online(cpu_to_node(cpu)))

3662

++			continue;

3663

++		ret = _cpu_up(cpu, 0, CPUHP_ONLINE);

3664

++		if (ret)

3665

++			break;

3666

++		/* See comment in cpuhp_smt_disable() */

3667

++		cpuhp_online_cpu_device(cpu);

3668

++	}

3669

++	cpu_maps_update_done();

3670

++	return ret;

3671

++}

3672

++

3673

++static ssize_t

3674

++store_smt_control(struct device *dev, struct device_attribute *attr,

3675

++		  const char *buf, size_t count)

3676

++{

3677

++	int ctrlval, ret;

3678

++

3679

++	if (sysfs_streq(buf, "on"))

3680

++		ctrlval = CPU_SMT_ENABLED;

3681

++	else if (sysfs_streq(buf, "off"))

3682

++		ctrlval = CPU_SMT_DISABLED;

3683

++	else if (sysfs_streq(buf, "forceoff"))

3684

++		ctrlval = CPU_SMT_FORCE_DISABLED;

3685

++	else

3686

++		return -EINVAL;

3687

++

3688

++	if (cpu_smt_control == CPU_SMT_FORCE_DISABLED)

3689

++		return -EPERM;

3690

++

3691

++	if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED)

3692

++		return -ENODEV;

3693

++

3694

++	ret = lock_device_hotplug_sysfs();

3695

++	if (ret)

3696

++		return ret;

3697

++

3698

++	if (ctrlval != cpu_smt_control) {

3699

++		switch (ctrlval) {

3700

++		case CPU_SMT_ENABLED:

3701

++			ret = cpuhp_smt_enable();

3702

++			break;

3703

++		case CPU_SMT_DISABLED:

3704

++		case CPU_SMT_FORCE_DISABLED:

3705

++			ret = cpuhp_smt_disable(ctrlval);

3706

++			break;

3707

++		}

3708

++	}

3709

++

3710

++	unlock_device_hotplug();

3711

++	return ret ? ret : count;

3712

++}

3713

++static DEVICE_ATTR(control, 0644, show_smt_control, store_smt_control);

3714

++

3715

++static ssize_t

3716

++show_smt_active(struct device *dev, struct device_attribute *attr, char *buf)

3717

++{

3718

++	bool active = topology_max_smt_threads() > 1;

3719

++

3720

++	return snprintf(buf, PAGE_SIZE - 2, "%d\n", active);

3721

++}

3722

++static DEVICE_ATTR(active, 0444, show_smt_active, NULL);

3723

++

3724

++static struct attribute *cpuhp_smt_attrs[] = {

3725

++	&dev_attr_control.attr,

3726

++	&dev_attr_active.attr,

3727

++	NULL

3728

++};

3729

++

3730

++static const struct attribute_group cpuhp_smt_attr_group = {

3731

++	.attrs = cpuhp_smt_attrs,

3732

++	.name = "smt",

3733

++	NULL

3734

++};

3735

++

3736

++static int __init cpu_smt_state_init(void)

3737

++{

3738

++	return sysfs_create_group(&cpu_subsys.dev_root->kobj,

3739

++				  &cpuhp_smt_attr_group);

3740

++}

3741

++

3742

++#else

3743

++static inline int cpu_smt_state_init(void) { return 0; }

3744

++#endif

3745

++

3746

+ static int __init cpuhp_sysfs_init(void)

3747

+ {

3748

+ 	int cpu, ret;

3749

+

3750

++	ret = cpu_smt_state_init();

3751

++	if (ret)

3752

++		return ret;

3753

++

3754

+ 	ret = sysfs_create_group(&cpu_subsys.dev_root->kobj,

3755

+ 				 &cpuhp_cpu_root_attr_group);

3756

+ 	if (ret)

3757

+@@ -2012,5 +2267,8 @@ void __init boot_cpu_init(void)

3758

+  */

3759

+ void __init boot_cpu_hotplug_init(void)

3760

+ {

3761

+-	per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;

3762

++#ifdef CONFIG_SMP

3763

++	this_cpu_write(cpuhp_state.booted_once, true);

3764

++#endif

3765

++	this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);

3766

+ }

3767

+diff --git a/kernel/sched/core.c b/kernel/sched/core.c

3768

+index fe365c9a08e9..5ba96d9ddbde 100644

3769

+--- a/kernel/sched/core.c

3770

++++ b/kernel/sched/core.c

3771

+@@ -5774,6 +5774,18 @@ int sched_cpu_activate(unsigned int cpu)

3772

+ 	struct rq *rq = cpu_rq(cpu);

3773

+ 	struct rq_flags rf;

3774

+

3775

++#ifdef CONFIG_SCHED_SMT

3776

++	/*

3777

++	 * The sched_smt_present static key needs to be evaluated on every

3778

++	 * hotplug event because at boot time SMT might be disabled when

3779

++	 * the number of booted CPUs is limited.

3780

++	 *

3781

++	 * If then later a sibling gets hotplugged, then the key would stay

3782

++	 * off and SMT scheduling would never be functional.

3783

++	 */

3784

++	if (cpumask_weight(cpu_smt_mask(cpu)) > 1)

3785

++		static_branch_enable_cpuslocked(&sched_smt_present);

3786

++#endif

3787

+ 	set_cpu_active(cpu, true);

3788

+

3789

+ 	if (sched_smp_initialized) {

3790

+@@ -5871,22 +5883,6 @@ int sched_cpu_dying(unsigned int cpu)

3791

+ }

3792

+ #endif

3793

+

3794

+-#ifdef CONFIG_SCHED_SMT

3795

+-DEFINE_STATIC_KEY_FALSE(sched_smt_present);

3796

+-

3797

+-static void sched_init_smt(void)

3798

+-{

3799

+-	/*

3800

+-	 * We've enumerated all CPUs and will assume that if any CPU

3801

+-	 * has SMT siblings, CPU0 will too.

3802

+-	 */

3803

+-	if (cpumask_weight(cpu_smt_mask(0)) > 1)

3804

+-		static_branch_enable(&sched_smt_present);

3805

+-}

3806

+-#else

3807

+-static inline void sched_init_smt(void) { }

3808

+-#endif

3809

+-

3810

+ void __init sched_init_smp(void)

3811

+ {

3812

+ 	sched_init_numa();

3813

+@@ -5908,8 +5904,6 @@ void __init sched_init_smp(void)

3814

+ 	init_sched_rt_class();

3815

+ 	init_sched_dl_class();

3816

+

3817

+-	sched_init_smt();

3818

+-

3819

+ 	sched_smp_initialized = true;

3820

+ }

3821

+

3822

+diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

3823

+index 2f0a0be4d344..9c219f7b0970 100644

3824

+--- a/kernel/sched/fair.c

3825

++++ b/kernel/sched/fair.c

3826

+@@ -6237,6 +6237,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p

3827

+ }

3828

+

3829

+ #ifdef CONFIG_SCHED_SMT

3830

++DEFINE_STATIC_KEY_FALSE(sched_smt_present);

3831

+

3832

+ static inline void set_idle_cores(int cpu, int val)

3833

+ {

3834

+diff --git a/kernel/smp.c b/kernel/smp.c

3835

+index 084c8b3a2681..d86eec5f51c1 100644

3836

+--- a/kernel/smp.c

3837

++++ b/kernel/smp.c

3838

+@@ -584,6 +584,8 @@ void __init smp_init(void)

3839

+ 		num_nodes, (num_nodes > 1 ? "s" : ""),

3840

+ 		num_cpus,  (num_cpus  > 1 ? "s" : ""));

3841

+

3842

++	/* Final decision about SMT support */

3843

++	cpu_smt_check_topology();

3844

+ 	/* Any cleanup work */

3845

+ 	smp_cpus_done(setup_max_cpus);

3846

+ }

3847

+diff --git a/mm/memory.c b/mm/memory.c

3848

+index c5e87a3a82ba..0e356dd923c2 100644

3849

+--- a/mm/memory.c

3850

++++ b/mm/memory.c

3851

+@@ -1884,6 +1884,9 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,

3852

+ 	if (addr < vma->vm_start || addr >= vma->vm_end)

3853

+ 		return -EFAULT;

3854

+

3855

++	if (!pfn_modify_allowed(pfn, pgprot))

3856

++		return -EACCES;

3857

++

3858

+ 	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));

3859

+

3860

+ 	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,

3861

+@@ -1919,6 +1922,9 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,

3862

+

3863

+ 	track_pfn_insert(vma, &pgprot, pfn);

3864

+

3865

++	if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))

3866

++		return -EACCES;

3867

++

3868

+ 	/*

3869

+ 	 * If we don't have pte special, then we have to use the pfn_valid()

3870

+ 	 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*

3871

+@@ -1980,6 +1986,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,

3872

+ {

3873

+ 	pte_t *pte;

3874

+ 	spinlock_t *ptl;

3875

++	int err = 0;

3876

+

3877

+ 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);

3878

+ 	if (!pte)

3879

+@@ -1987,12 +1994,16 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,

3880

+ 	arch_enter_lazy_mmu_mode();

3881

+ 	do {

3882

+ 		BUG_ON(!pte_none(*pte));

3883

++		if (!pfn_modify_allowed(pfn, prot)) {

3884

++			err = -EACCES;

3885

++			break;

3886

++		}

3887

+ 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));

3888

+ 		pfn++;

3889

+ 	} while (pte++, addr += PAGE_SIZE, addr != end);

3890

+ 	arch_leave_lazy_mmu_mode();

3891

+ 	pte_unmap_unlock(pte - 1, ptl);

3892

+-	return 0;

3893

++	return err;

3894

+ }

3895

+

3896

+ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,

3897

+@@ -2001,6 +2012,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,

3898

+ {

3899

+ 	pmd_t *pmd;

3900

+ 	unsigned long next;

3901

++	int err;

3902

+

3903

+ 	pfn -= addr >> PAGE_SHIFT;

3904

+ 	pmd = pmd_alloc(mm, pud, addr);

3905

+@@ -2009,9 +2021,10 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,

3906

+ 	VM_BUG_ON(pmd_trans_huge(*pmd));

3907

+ 	do {

3908

+ 		next = pmd_addr_end(addr, end);

3909

+-		if (remap_pte_range(mm, pmd, addr, next,

3910

+-				pfn + (addr >> PAGE_SHIFT), prot))

3911

+-			return -ENOMEM;

3912

++		err = remap_pte_range(mm, pmd, addr, next,

3913

++				pfn + (addr >> PAGE_SHIFT), prot);

3914

++		if (err)

3915

++			return err;

3916

+ 	} while (pmd++, addr = next, addr != end);

3917

+ 	return 0;

3918

+ }

3919

+@@ -2022,6 +2035,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,

3920

+ {

3921

+ 	pud_t *pud;

3922

+ 	unsigned long next;

3923

++	int err;

3924

+

3925

+ 	pfn -= addr >> PAGE_SHIFT;

3926

+ 	pud = pud_alloc(mm, p4d, addr);

3927

+@@ -2029,9 +2043,10 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,

3928

+ 		return -ENOMEM;

3929

+ 	do {

3930

+ 		next = pud_addr_end(addr, end);

3931

+-		if (remap_pmd_range(mm, pud, addr, next,

3932

+-				pfn + (addr >> PAGE_SHIFT), prot))

3933

+-			return -ENOMEM;

3934

++		err = remap_pmd_range(mm, pud, addr, next,

3935

++				pfn + (addr >> PAGE_SHIFT), prot);

3936

++		if (err)

3937

++			return err;

3938

+ 	} while (pud++, addr = next, addr != end);

3939

+ 	return 0;

3940

+ }

3941

+@@ -2042,6 +2057,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,

3942

+ {

3943

+ 	p4d_t *p4d;

3944

+ 	unsigned long next;

3945

++	int err;

3946

+

3947

+ 	pfn -= addr >> PAGE_SHIFT;

3948

+ 	p4d = p4d_alloc(mm, pgd, addr);

3949

+@@ -2049,9 +2065,10 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,

3950

+ 		return -ENOMEM;

3951

+ 	do {

3952

+ 		next = p4d_addr_end(addr, end);

3953

+-		if (remap_pud_range(mm, p4d, addr, next,

3954

+-				pfn + (addr >> PAGE_SHIFT), prot))

3955

+-			return -ENOMEM;

3956

++		err = remap_pud_range(mm, p4d, addr, next,

3957

++				pfn + (addr >> PAGE_SHIFT), prot);

3958

++		if (err)

3959

++			return err;

3960

+ 	} while (p4d++, addr = next, addr != end);

3961

+ 	return 0;

3962

+ }

3963

+diff --git a/mm/mprotect.c b/mm/mprotect.c

3964

+index 625608bc8962..6d331620b9e5 100644

3965

+--- a/mm/mprotect.c

3966

++++ b/mm/mprotect.c

3967

+@@ -306,6 +306,42 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,

3968

+ 	return pages;

3969

+ }

3970

+

3971

++static int prot_none_pte_entry(pte_t *pte, unsigned long addr,

3972

++			       unsigned long next, struct mm_walk *walk)

3973

++{

3974

++	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?

3975

++		0 : -EACCES;

3976

++}

3977

++

3978

++static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,

3979

++				   unsigned long addr, unsigned long next,

3980

++				   struct mm_walk *walk)

3981

++{

3982

++	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?

3983

++		0 : -EACCES;

3984

++}

3985

++

3986

++static int prot_none_test(unsigned long addr, unsigned long next,

3987

++			  struct mm_walk *walk)

3988

++{

3989

++	return 0;

3990

++}

3991

++

3992

++static int prot_none_walk(struct vm_area_struct *vma, unsigned long start,

3993

++			   unsigned long end, unsigned long newflags)

3994

++{

3995

++	pgprot_t new_pgprot = vm_get_page_prot(newflags);

3996

++	struct mm_walk prot_none_walk = {

3997

++		.pte_entry = prot_none_pte_entry,

3998

++		.hugetlb_entry = prot_none_hugetlb_entry,

3999

++		.test_walk = prot_none_test,

4000

++		.mm = current->mm,

4001

++		.private = &new_pgprot,

4002

++	};

4003

++

4004

++	return walk_page_range(start, end, &prot_none_walk);

4005

++}

4006

++

4007

+ int

4008

+ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,

4009

+ 	unsigned long start, unsigned long end, unsigned long newflags)

4010

+@@ -323,6 +359,19 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,

4011

+ 		return 0;

4012

+ 	}

4013

+

4014

++	/*

4015

++	 * Do PROT_NONE PFN permission checks here when we can still

4016

++	 * bail out without undoing a lot of state. This is a rather

4017

++	 * uncommon case, so doesn't need to be very optimized.

4018

++	 */

4019

++	if (arch_has_pfn_modify_check() &&

4020

++	    (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&

4021

++	    (newflags & (VM_READ|VM_WRITE|VM_EXEC)) == 0) {

4022

++		error = prot_none_walk(vma, start, end, newflags);

4023

++		if (error)

4024

++			return error;

4025

++	}

4026

++

4027

+ 	/*

4028

+ 	 * If we make a private mapping writable we increase our commit;

4029

+ 	 * but (without finer accounting) cannot reduce our commit if we

4030

+diff --git a/mm/swapfile.c b/mm/swapfile.c

4031

+index 2cc2972eedaf..18185ae4f223 100644

4032

+--- a/mm/swapfile.c

4033

++++ b/mm/swapfile.c

4034

+@@ -2909,6 +2909,35 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)

4035

+ 	return 0;

4036

+ }

4037

+

4038

++

4039

++/*

4040

++ * Find out how many pages are allowed for a single swap device. There

4041

++ * are two limiting factors:

4042

++ * 1) the number of bits for the swap offset in the swp_entry_t type, and

4043

++ * 2) the number of bits in the swap pte, as defined by the different

4044

++ * architectures.

4045

++ *

4046

++ * In order to find the largest possible bit mask, a swap entry with

4047

++ * swap type 0 and swap offset ~0UL is created, encoded to a swap pte,

4048

++ * decoded to a swp_entry_t again, and finally the swap offset is

4049

++ * extracted.

4050

++ *

4051

++ * This will mask all the bits from the initial ~0UL mask that can't

4052

++ * be encoded in either the swp_entry_t or the architecture definition

4053

++ * of a swap pte.

4054

++ */

4055

++unsigned long generic_max_swapfile_size(void)

4056

++{

4057

++	return swp_offset(pte_to_swp_entry(

4058

++			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;

4059

++}

4060

++

4061

++/* Can be overridden by an architecture for additional checks. */

4062

++__weak unsigned long max_swapfile_size(void)

4063

++{

4064

++	return generic_max_swapfile_size();

4065

++}

4066

++

4067

+ static unsigned long read_swap_header(struct swap_info_struct *p,

4068

+ 					union swap_header *swap_header,

4069

+ 					struct inode *inode)

4070

+@@ -2944,22 +2973,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,

4071

+ 	p->cluster_next = 1;

4072

+ 	p->cluster_nr = 0;

4073

+

4074

+-	/*

4075

+-	 * Find out how many pages are allowed for a single swap

4076

+-	 * device. There are two limiting factors: 1) the number

4077

+-	 * of bits for the swap offset in the swp_entry_t type, and

4078

+-	 * 2) the number of bits in the swap pte as defined by the

4079

+-	 * different architectures. In order to find the

4080

+-	 * largest possible bit mask, a swap entry with swap type 0

4081

+-	 * and swap offset ~0UL is created, encoded to a swap pte,

4082

+-	 * decoded to a swp_entry_t again, and finally the swap

4083

+-	 * offset is extracted. This will mask all the bits from

4084

+-	 * the initial ~0UL mask that can't be encoded in either

4085

+-	 * the swp_entry_t or the architecture definition of a

4086

+-	 * swap pte.

4087

+-	 */

4088

+-	maxpages = swp_offset(pte_to_swp_entry(

4089

+-			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;

4090

++	maxpages = max_swapfile_size();

4091

+ 	last_page = swap_header->info.last_page;

4092

+ 	if (!last_page) {

4093

+ 		pr_warn("Empty swap-file\n");

4094

+diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h

4095

+index 5701f5cecd31..64aaa3f5f36c 100644

4096

+--- a/tools/arch/x86/include/asm/cpufeatures.h

4097

++++ b/tools/arch/x86/include/asm/cpufeatures.h

4098

+@@ -219,6 +219,7 @@

4099

+ #define X86_FEATURE_IBPB		( 7*32+26) /* Indirect Branch Prediction Barrier */

4100

+ #define X86_FEATURE_STIBP		( 7*32+27) /* Single Thread Indirect Branch Predictors */

4101

+ #define X86_FEATURE_ZEN			( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */

4102

++#define X86_FEATURE_L1TF_PTEINV		( 7*32+29) /* "" L1TF workaround PTE inversion */

4103

+

4104

+ /* Virtualization flags: Linux defined, word 8 */

4105

+ #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */

4106

+@@ -341,6 +342,7 @@

4107

+ #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */

4108

+ #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */

4109

+ #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */

4110

++#define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */

4111

+ #define X86_FEATURE_ARCH_CAPABILITIES	(18*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */

4112

+ #define X86_FEATURE_SPEC_CTRL_SSBD	(18*32+31) /* "" Speculative Store Bypass Disable */

4113

+

4114

+@@ -373,5 +375,6 @@

4115

+ #define X86_BUG_SPECTRE_V1		X86_BUG(15) /* CPU is affected by Spectre variant 1 attack with conditional branches */

4116

+ #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */

4117

+ #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */

4118

++#define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */

4119

+

4120

+ #endif /* _ASM_X86_CPUFEATURES_H */

Gentoo Archives: gentoo-commits