[gentoo-commits] proj/linux-patches:4.18 commit in: / - gentoo-commits

From:	Mike Pagano <mpagano@g.o>
To:	gentoo-commits@l.g.o
Subject:	[gentoo-commits] proj/linux-patches:4.18 commit in: /
Date:	Wed, 15 Aug 2018 16:37:04
Message-Id:	`1534351012.ad052097fe9d40c63236e6ae02f106d5226de58d.mpagano@gentoo`

1

commit:     ad052097fe9d40c63236e6ae02f106d5226de58d

2

Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

3

AuthorDate: Wed Aug 15 16:36:52 2018 +0000

4

Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

5

CommitDate: Wed Aug 15 16:36:52 2018 +0000

6

URL:        https://gitweb.gentoo.org/proj/linux-patches.git/commit/?id=ad052097

7

8

Linuxpatch 4.18.1

9

10

 0000_README             |    4 +

11

 1000_linux-4.18.1.patch | 4083 +++++++++++++++++++++++++++++++++++++++++++++++

12

 2 files changed, 4087 insertions(+)

13

14

diff --git a/0000_README b/0000_README

15

index 917d838..cf32ff2 100644

16

--- a/0000_README

17

+++ b/0000_README

18

@@ -43,6 +43,10 @@ EXPERIMENTAL

19

 Individual Patch Descriptions:

20

 --------------------------------------------------------------------------

21

22

+Patch:  1000_linux-4.18.1.patch

23

+From:   http://www.kernel.org

24

+Desc:   Linux 4.18.1

25

+

26

 Patch:  1500_XATTR_USER_PREFIX.patch

27

 From:   https://bugs.gentoo.org/show_bug.cgi?id=470644

28

 Desc:   Support for namespace user.pax.* on tmpfs.

29

30

diff --git a/1000_linux-4.18.1.patch b/1000_linux-4.18.1.patch

31

new file mode 100644

32

index 0000000..bd9c2da

33

--- /dev/null

34

+++ b/1000_linux-4.18.1.patch

35

@@ -0,0 +1,4083 @@

36

+diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu

37

+index 9c5e7732d249..73318225a368 100644

38

+--- a/Documentation/ABI/testing/sysfs-devices-system-cpu

39

++++ b/Documentation/ABI/testing/sysfs-devices-system-cpu

40

+@@ -476,6 +476,7 @@ What:		/sys/devices/system/cpu/vulnerabilities

41

+ 		/sys/devices/system/cpu/vulnerabilities/spectre_v1

42

+ 		/sys/devices/system/cpu/vulnerabilities/spectre_v2

43

+ 		/sys/devices/system/cpu/vulnerabilities/spec_store_bypass

44

++		/sys/devices/system/cpu/vulnerabilities/l1tf

45

+ Date:		January 2018

46

+ Contact:	Linux kernel mailing list <linux-kernel@×××××××××××.org>

47

+ Description:	Information about CPU vulnerabilities

48

+@@ -487,3 +488,26 @@ Description:	Information about CPU vulnerabilities

49

+ 		"Not affected"	  CPU is not affected by the vulnerability

50

+ 		"Vulnerable"	  CPU is affected and no mitigation in effect

51

+ 		"Mitigation: $M"  CPU is affected and mitigation $M is in effect

52

++

53

++		Details about the l1tf file can be found in

54

++		Documentation/admin-guide/l1tf.rst

55

++

56

++What:		/sys/devices/system/cpu/smt

57

++		/sys/devices/system/cpu/smt/active

58

++		/sys/devices/system/cpu/smt/control

59

++Date:		June 2018

60

++Contact:	Linux kernel mailing list <linux-kernel@×××××××××××.org>

61

++Description:	Control Symetric Multi Threading (SMT)

62

++

63

++		active:  Tells whether SMT is active (enabled and siblings online)

64

++

65

++		control: Read/write interface to control SMT. Possible

66

++			 values:

67

++

68

++			 "on"		SMT is enabled

69

++			 "off"		SMT is disabled

70

++			 "forceoff"	SMT is force disabled. Cannot be changed.

71

++			 "notsupported" SMT is not supported by the CPU

72

++

73

++			 If control status is "forceoff" or "notsupported" writes

74

++			 are rejected.

75

+diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst

76

+index 48d70af11652..0873685bab0f 100644

77

+--- a/Documentation/admin-guide/index.rst

78

++++ b/Documentation/admin-guide/index.rst

79

+@@ -17,6 +17,15 @@ etc.

80

+    kernel-parameters

81

+    devices

82

+

83

++This section describes CPU vulnerabilities and provides an overview of the

84

++possible mitigations along with guidance for selecting mitigations if they

85

++are configurable at compile, boot or run time.

86

++

87

++.. toctree::

88

++   :maxdepth: 1

89

++

90

++   l1tf

91

++

92

+ Here is a set of documents aimed at users who are trying to track down

93

+ problems and bugs in particular.

94

+

95

+diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt

96

+index 533ff5c68970..1370b424a453 100644

97

+--- a/Documentation/admin-guide/kernel-parameters.txt

98

++++ b/Documentation/admin-guide/kernel-parameters.txt

99

+@@ -1967,10 +1967,84 @@

100

+ 			(virtualized real and unpaged mode) on capable

101

+ 			Intel chips. Default is 1 (enabled)

102

+

103

++	kvm-intel.vmentry_l1d_flush=[KVM,Intel] Mitigation for L1 Terminal Fault

104

++			CVE-2018-3620.

105

++

106

++			Valid arguments: never, cond, always

107

++

108

++			always: L1D cache flush on every VMENTER.

109

++			cond:	Flush L1D on VMENTER only when the code between

110

++				VMEXIT and VMENTER can leak host memory.

111

++			never:	Disables the mitigation

112

++

113

++			Default is cond (do L1 cache flush in specific instances)

114

++

115

+ 	kvm-intel.vpid=	[KVM,Intel] Disable Virtual Processor Identification

116

+ 			feature (tagged TLBs) on capable Intel chips.

117

+ 			Default is 1 (enabled)

118

+

119

++	l1tf=           [X86] Control mitigation of the L1TF vulnerability on

120

++			      affected CPUs

121

++

122

++			The kernel PTE inversion protection is unconditionally

123

++			enabled and cannot be disabled.

124

++

125

++			full

126

++				Provides all available mitigations for the

127

++				L1TF vulnerability. Disables SMT and

128

++				enables all mitigations in the

129

++				hypervisors, i.e. unconditional L1D flush.

130

++

131

++				SMT control and L1D flush control via the

132

++				sysfs interface is still possible after

133

++				boot.  Hypervisors will issue a warning

134

++				when the first VM is started in a

135

++				potentially insecure configuration,

136

++				i.e. SMT enabled or L1D flush disabled.

137

++

138

++			full,force

139

++				Same as 'full', but disables SMT and L1D

140

++				flush runtime control. Implies the

141

++				'nosmt=force' command line option.

142

++				(i.e. sysfs control of SMT is disabled.)

143

++

144

++			flush

145

++				Leaves SMT enabled and enables the default

146

++				hypervisor mitigation, i.e. conditional

147

++				L1D flush.

148

++

149

++				SMT control and L1D flush control via the

150

++				sysfs interface is still possible after

151

++				boot.  Hypervisors will issue a warning

152

++				when the first VM is started in a

153

++				potentially insecure configuration,

154

++				i.e. SMT enabled or L1D flush disabled.

155

++

156

++			flush,nosmt

157

++

158

++				Disables SMT and enables the default

159

++				hypervisor mitigation.

160

++

161

++				SMT control and L1D flush control via the

162

++				sysfs interface is still possible after

163

++				boot.  Hypervisors will issue a warning

164

++				when the first VM is started in a

165

++				potentially insecure configuration,

166

++				i.e. SMT enabled or L1D flush disabled.

167

++

168

++			flush,nowarn

169

++				Same as 'flush', but hypervisors will not

170

++				warn when a VM is started in a potentially

171

++				insecure configuration.

172

++

173

++			off

174

++				Disables hypervisor mitigations and doesn't

175

++				emit any warnings.

176

++

177

++			Default is 'flush'.

178

++

179

++			For details see: Documentation/admin-guide/l1tf.rst

180

++

181

+ 	l2cr=		[PPC]

182

+

183

+ 	l3cr=		[PPC]

184

+@@ -2687,6 +2761,10 @@

185

+ 	nosmt		[KNL,S390] Disable symmetric multithreading (SMT).

186

+ 			Equivalent to smt=1.

187

+

188

++			[KNL,x86] Disable symmetric multithreading (SMT).

189

++			nosmt=force: Force disable SMT, cannot be undone

190

++				     via the sysfs control file.

191

++

192

+ 	nospectre_v2	[X86] Disable all mitigations for the Spectre variant 2

193

+ 			(indirect branch prediction) vulnerability. System may

194

+ 			allow data leaks with this option, which is equivalent

195

+diff --git a/Documentation/admin-guide/l1tf.rst b/Documentation/admin-guide/l1tf.rst

196

+new file mode 100644

197

+index 000000000000..bae52b845de0

198

+--- /dev/null

199

++++ b/Documentation/admin-guide/l1tf.rst

200

+@@ -0,0 +1,610 @@

201

++L1TF - L1 Terminal Fault

202

++========================

203

++

204

++L1 Terminal Fault is a hardware vulnerability which allows unprivileged

205

++speculative access to data which is available in the Level 1 Data Cache

206

++when the page table entry controlling the virtual address, which is used

207

++for the access, has the Present bit cleared or other reserved bits set.

208

++

209

++Affected processors

210

++-------------------

211

++

212

++This vulnerability affects a wide range of Intel processors. The

213

++vulnerability is not present on:

214

++

215

++   - Processors from AMD, Centaur and other non Intel vendors

216

++

217

++   - Older processor models, where the CPU family is < 6

218

++

219

++   - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,

220

++     Penwell, Pineview, Silvermont, Airmont, Merrifield)

221

++

222

++   - The Intel XEON PHI family

223

++

224

++   - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the

225

++     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected

226

++     by the Meltdown vulnerability either. These CPUs should become

227

++     available by end of 2018.

228

++

229

++Whether a processor is affected or not can be read out from the L1TF

230

++vulnerability file in sysfs. See :ref:`l1tf_sys_info`.

231

++

232

++Related CVEs

233

++------------

234

++

235

++The following CVE entries are related to the L1TF vulnerability:

236

++

237

++   =============  =================  ==============================

238

++   CVE-2018-3615  L1 Terminal Fault  SGX related aspects

239

++   CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects

240

++   CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects

241

++   =============  =================  ==============================

242

++

243

++Problem

244

++-------

245

++

246

++If an instruction accesses a virtual address for which the relevant page

247

++table entry (PTE) has the Present bit cleared or other reserved bits set,

248

++then speculative execution ignores the invalid PTE and loads the referenced

249

++data if it is present in the Level 1 Data Cache, as if the page referenced

250

++by the address bits in the PTE was still present and accessible.

251

++

252

++While this is a purely speculative mechanism and the instruction will raise

253

++a page fault when it is retired eventually, the pure act of loading the

254

++data and making it available to other speculative instructions opens up the

255

++opportunity for side channel attacks to unprivileged malicious code,

256

++similar to the Meltdown attack.

257

++

258

++While Meltdown breaks the user space to kernel space protection, L1TF

259

++allows to attack any physical memory address in the system and the attack

260

++works across all protection domains. It allows an attack of SGX and also

261

++works from inside virtual machines because the speculation bypasses the

262

++extended page table (EPT) protection mechanism.

263

++

264

++

265

++Attack scenarios

266

++----------------

267

++

268

++1. Malicious user space

269

++^^^^^^^^^^^^^^^^^^^^^^^

270

++

271

++   Operating Systems store arbitrary information in the address bits of a

272

++   PTE which is marked non present. This allows a malicious user space

273

++   application to attack the physical memory to which these PTEs resolve.

274

++   In some cases user-space can maliciously influence the information

275

++   encoded in the address bits of the PTE, thus making attacks more

276

++   deterministic and more practical.

277

++

278

++   The Linux kernel contains a mitigation for this attack vector, PTE

279

++   inversion, which is permanently enabled and has no performance

280

++   impact. The kernel ensures that the address bits of PTEs, which are not

281

++   marked present, never point to cacheable physical memory space.

282

++

283

++   A system with an up to date kernel is protected against attacks from

284

++   malicious user space applications.

285

++

286

++2. Malicious guest in a virtual machine

287

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

288

++

289

++   The fact that L1TF breaks all domain protections allows malicious guest

290

++   OSes, which can control the PTEs directly, and malicious guest user

291

++   space applications, which run on an unprotected guest kernel lacking the

292

++   PTE inversion mitigation for L1TF, to attack physical host memory.

293

++

294

++   A special aspect of L1TF in the context of virtualization is symmetric

295

++   multi threading (SMT). The Intel implementation of SMT is called

296

++   HyperThreading. The fact that Hyperthreads on the affected processors

297

++   share the L1 Data Cache (L1D) is important for this. As the flaw allows

298

++   only to attack data which is present in L1D, a malicious guest running

299

++   on one Hyperthread can attack the data which is brought into the L1D by

300

++   the context which runs on the sibling Hyperthread of the same physical

301

++   core. This context can be host OS, host user space or a different guest.

302

++

303

++   If the processor does not support Extended Page Tables, the attack is

304

++   only possible, when the hypervisor does not sanitize the content of the

305

++   effective (shadow) page tables.

306

++

307

++   While solutions exist to mitigate these attack vectors fully, these

308

++   mitigations are not enabled by default in the Linux kernel because they

309

++   can affect performance significantly. The kernel provides several

310

++   mechanisms which can be utilized to address the problem depending on the

311

++   deployment scenario. The mitigations, their protection scope and impact

312

++   are described in the next sections.

313

++

314

++   The default mitigations and the rationale for choosing them are explained

315

++   at the end of this document. See :ref:`default_mitigations`.

316

++

317

++.. _l1tf_sys_info:

318

++

319

++L1TF system information

320

++-----------------------

321

++

322

++The Linux kernel provides a sysfs interface to enumerate the current L1TF

323

++status of the system: whether the system is vulnerable, and which

324

++mitigations are active. The relevant sysfs file is:

325

++

326

++/sys/devices/system/cpu/vulnerabilities/l1tf

327

++

328

++The possible values in this file are:

329

++

330

++  ===========================   ===============================

331

++  'Not affected'		The processor is not vulnerable

332

++  'Mitigation: PTE Inversion'	The host protection is active

333

++  ===========================   ===============================

334

++

335

++If KVM/VMX is enabled and the processor is vulnerable then the following

336

++information is appended to the 'Mitigation: PTE Inversion' part:

337

++

338

++  - SMT status:

339

++

340

++    =====================  ================

341

++    'VMX: SMT vulnerable'  SMT is enabled

342

++    'VMX: SMT disabled'    SMT is disabled

343

++    =====================  ================

344

++

345

++  - L1D Flush mode:

346

++

347

++    ================================  ====================================

348

++    'L1D vulnerable'		      L1D flushing is disabled

349

++

350

++    'L1D conditional cache flushes'   L1D flush is conditionally enabled

351

++

352

++    'L1D cache flushes'		      L1D flush is unconditionally enabled

353

++    ================================  ====================================

354

++

355

++The resulting grade of protection is discussed in the following sections.

356

++

357

++

358

++Host mitigation mechanism

359

++-------------------------

360

++

361

++The kernel is unconditionally protected against L1TF attacks from malicious

362

++user space running on the host.

363

++

364

++

365

++Guest mitigation mechanisms

366

++---------------------------

367

++

368

++.. _l1d_flush:

369

++

370

++1. L1D flush on VMENTER

371

++^^^^^^^^^^^^^^^^^^^^^^^

372

++

373

++   To make sure that a guest cannot attack data which is present in the L1D

374

++   the hypervisor flushes the L1D before entering the guest.

375

++

376

++   Flushing the L1D evicts not only the data which should not be accessed

377

++   by a potentially malicious guest, it also flushes the guest

378

++   data. Flushing the L1D has a performance impact as the processor has to

379

++   bring the flushed guest data back into the L1D. Depending on the

380

++   frequency of VMEXIT/VMENTER and the type of computations in the guest

381

++   performance degradation in the range of 1% to 50% has been observed. For

382

++   scenarios where guest VMEXIT/VMENTER are rare the performance impact is

383

++   minimal. Virtio and mechanisms like posted interrupts are designed to

384

++   confine the VMEXITs to a bare minimum, but specific configurations and

385

++   application scenarios might still suffer from a high VMEXIT rate.

386

++

387

++   The kernel provides two L1D flush modes:

388

++    - conditional ('cond')

389

++    - unconditional ('always')

390

++

391

++   The conditional mode avoids L1D flushing after VMEXITs which execute

392

++   only audited code paths before the corresponding VMENTER. These code

393

++   paths have been verified that they cannot expose secrets or other

394

++   interesting data to an attacker, but they can leak information about the

395

++   address space layout of the hypervisor.

396

++

397

++   Unconditional mode flushes L1D on all VMENTER invocations and provides

398

++   maximum protection. It has a higher overhead than the conditional

399

++   mode. The overhead cannot be quantified correctly as it depends on the

400

++   workload scenario and the resulting number of VMEXITs.

401

++

402

++   The general recommendation is to enable L1D flush on VMENTER. The kernel

403

++   defaults to conditional mode on affected processors.

404

++

405

++   **Note**, that L1D flush does not prevent the SMT problem because the

406

++   sibling thread will also bring back its data into the L1D which makes it

407

++   attackable again.

408

++

409

++   L1D flush can be controlled by the administrator via the kernel command

410

++   line and sysfs control files. See :ref:`mitigation_control_command_line`

411

++   and :ref:`mitigation_control_kvm`.

412

++

413

++.. _guest_confinement:

414

++

415

++2. Guest VCPU confinement to dedicated physical cores

416

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

417

++

418

++   To address the SMT problem, it is possible to make a guest or a group of

419

++   guests affine to one or more physical cores. The proper mechanism for

420

++   that is to utilize exclusive cpusets to ensure that no other guest or

421

++   host tasks can run on these cores.

422

++

423

++   If only a single guest or related guests run on sibling SMT threads on

424

++   the same physical core then they can only attack their own memory and

425

++   restricted parts of the host memory.

426

++

427

++   Host memory is attackable, when one of the sibling SMT threads runs in

428

++   host OS (hypervisor) context and the other in guest context. The amount

429

++   of valuable information from the host OS context depends on the context

430

++   which the host OS executes, i.e. interrupts, soft interrupts and kernel

431

++   threads. The amount of valuable data from these contexts cannot be

432

++   declared as non-interesting for an attacker without deep inspection of

433

++   the code.

434

++

435

++   **Note**, that assigning guests to a fixed set of physical cores affects

436

++   the ability of the scheduler to do load balancing and might have

437

++   negative effects on CPU utilization depending on the hosting

438

++   scenario. Disabling SMT might be a viable alternative for particular

439

++   scenarios.

440

++

441

++   For further information about confining guests to a single or to a group

442

++   of cores consult the cpusets documentation:

443

++

444

++   https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt

445

++

446

++.. _interrupt_isolation:

447

++

448

++3. Interrupt affinity

449

++^^^^^^^^^^^^^^^^^^^^^

450

++

451

++   Interrupts can be made affine to logical CPUs. This is not universally

452

++   true because there are types of interrupts which are truly per CPU

453

++   interrupts, e.g. the local timer interrupt. Aside of that multi queue

454

++   devices affine their interrupts to single CPUs or groups of CPUs per

455

++   queue without allowing the administrator to control the affinities.

456

++

457

++   Moving the interrupts, which can be affinity controlled, away from CPUs

458

++   which run untrusted guests, reduces the attack vector space.

459

++

460

++   Whether the interrupts with are affine to CPUs, which run untrusted

461

++   guests, provide interesting data for an attacker depends on the system

462

++   configuration and the scenarios which run on the system. While for some

463

++   of the interrupts it can be assumed that they won't expose interesting

464

++   information beyond exposing hints about the host OS memory layout, there

465

++   is no way to make general assumptions.

466

++

467

++   Interrupt affinity can be controlled by the administrator via the

468

++   /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is

469

++   available at:

470

++

471

++   https://www.kernel.org/doc/Documentation/IRQ-affinity.txt

472

++

473

++.. _smt_control:

474

++

475

++4. SMT control

476

++^^^^^^^^^^^^^^

477

++

478

++   To prevent the SMT issues of L1TF it might be necessary to disable SMT

479

++   completely. Disabling SMT can have a significant performance impact, but

480

++   the impact depends on the hosting scenario and the type of workloads.

481

++   The impact of disabling SMT needs also to be weighted against the impact

482

++   of other mitigation solutions like confining guests to dedicated cores.

483

++

484

++   The kernel provides a sysfs interface to retrieve the status of SMT and

485

++   to control it. It also provides a kernel command line interface to

486

++   control SMT.

487

++

488

++   The kernel command line interface consists of the following options:

489

++

490

++     =========== ==========================================================

491

++     nosmt	 Affects the bring up of the secondary CPUs during boot. The

492

++		 kernel tries to bring all present CPUs online during the

493

++		 boot process. "nosmt" makes sure that from each physical

494

++		 core only one - the so called primary (hyper) thread is

495

++		 activated. Due to a design flaw of Intel processors related

496

++		 to Machine Check Exceptions the non primary siblings have

497

++		 to be brought up at least partially and are then shut down

498

++		 again.  "nosmt" can be undone via the sysfs interface.

499

++

500

++     nosmt=force Has the same effect as "nosmt" but it does not allow to

501

++		 undo the SMT disable via the sysfs interface.

502

++     =========== ==========================================================

503

++

504

++   The sysfs interface provides two files:

505

++

506

++   - /sys/devices/system/cpu/smt/control

507

++   - /sys/devices/system/cpu/smt/active

508

++

509

++   /sys/devices/system/cpu/smt/control:

510

++

511

++     This file allows to read out the SMT control state and provides the

512

++     ability to disable or (re)enable SMT. The possible states are:

513

++

514

++	==============  ===================================================

515

++	on		SMT is supported by the CPU and enabled. All

516

++			logical CPUs can be onlined and offlined without

517

++			restrictions.

518

++

519

++	off		SMT is supported by the CPU and disabled. Only

520

++			the so called primary SMT threads can be onlined

521

++			and offlined without restrictions. An attempt to

522

++			online a non-primary sibling is rejected

523

++

524

++	forceoff	Same as 'off' but the state cannot be controlled.

525

++			Attempts to write to the control file are rejected.

526

++

527

++	notsupported	The processor does not support SMT. It's therefore

528

++			not affected by the SMT implications of L1TF.

529

++			Attempts to write to the control file are rejected.

530

++	==============  ===================================================

531

++

532

++     The possible states which can be written into this file to control SMT

533

++     state are:

534

++

535

++     - on

536

++     - off

537

++     - forceoff

538

++

539

++   /sys/devices/system/cpu/smt/active:

540

++

541

++     This file reports whether SMT is enabled and active, i.e. if on any

542

++     physical core two or more sibling threads are online.

543

++

544

++   SMT control is also possible at boot time via the l1tf kernel command

545

++   line parameter in combination with L1D flush control. See

546

++   :ref:`mitigation_control_command_line`.

547

++

548

++5. Disabling EPT

549

++^^^^^^^^^^^^^^^^

550

++

551

++  Disabling EPT for virtual machines provides full mitigation for L1TF even

552

++  with SMT enabled, because the effective page tables for guests are

553

++  managed and sanitized by the hypervisor. Though disabling EPT has a

554

++  significant performance impact especially when the Meltdown mitigation

555

++  KPTI is enabled.

556

++

557

++  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.

558

++

559

++There is ongoing research and development for new mitigation mechanisms to

560

++address the performance impact of disabling SMT or EPT.

561

++

562

++.. _mitigation_control_command_line:

563

++

564

++Mitigation control on the kernel command line

565

++---------------------------------------------

566

++

567

++The kernel command line allows to control the L1TF mitigations at boot

568

++time with the option "l1tf=". The valid arguments for this option are:

569

++

570

++  ============  =============================================================

571

++  full		Provides all available mitigations for the L1TF

572

++		vulnerability. Disables SMT and enables all mitigations in

573

++		the hypervisors, i.e. unconditional L1D flushing

574

++

575

++		SMT control and L1D flush control via the sysfs interface

576

++		is still possible after boot.  Hypervisors will issue a

577

++		warning when the first VM is started in a potentially

578

++		insecure configuration, i.e. SMT enabled or L1D flush

579

++		disabled.

580

++

581

++  full,force	Same as 'full', but disables SMT and L1D flush runtime

582

++		control. Implies the 'nosmt=force' command line option.

583

++		(i.e. sysfs control of SMT is disabled.)

584

++

585

++  flush		Leaves SMT enabled and enables the default hypervisor

586

++		mitigation, i.e. conditional L1D flushing

587

++

588

++		SMT control and L1D flush control via the sysfs interface

589

++		is still possible after boot.  Hypervisors will issue a

590

++		warning when the first VM is started in a potentially

591

++		insecure configuration, i.e. SMT enabled or L1D flush

592

++		disabled.

593

++

594

++  flush,nosmt	Disables SMT and enables the default hypervisor mitigation,

595

++		i.e. conditional L1D flushing.

596

++

597

++		SMT control and L1D flush control via the sysfs interface

598

++		is still possible after boot.  Hypervisors will issue a

599

++		warning when the first VM is started in a potentially

600

++		insecure configuration, i.e. SMT enabled or L1D flush

601

++		disabled.

602

++

603

++  flush,nowarn	Same as 'flush', but hypervisors will not warn when a VM is

604

++		started in a potentially insecure configuration.

605

++

606

++  off		Disables hypervisor mitigations and doesn't emit any

607

++		warnings.

608

++  ============  =============================================================

609

++

610

++The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.

611

++

612

++

613

++.. _mitigation_control_kvm:

614

++

615

++Mitigation control for KVM - module parameter

616

++-------------------------------------------------------------

617

++

618

++The KVM hypervisor mitigation mechanism, flushing the L1D cache when

619

++entering a guest, can be controlled with a module parameter.

620

++

621

++The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the

622

++following arguments:

623

++

624

++  ============  ==============================================================

625

++  always	L1D cache flush on every VMENTER.

626

++

627

++  cond		Flush L1D on VMENTER only when the code between VMEXIT and

628

++		VMENTER can leak host memory which is considered

629

++		interesting for an attacker. This still can leak host memory

630

++		which allows e.g. to determine the hosts address space layout.

631

++

632

++  never		Disables the mitigation

633

++  ============  ==============================================================

634

++

635

++The parameter can be provided on the kernel command line, as a module

636

++parameter when loading the modules and at runtime modified via the sysfs

637

++file:

638

++

639

++/sys/module/kvm_intel/parameters/vmentry_l1d_flush

640

++

641

++The default is 'cond'. If 'l1tf=full,force' is given on the kernel command

642

++line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush

643

++module parameter is ignored and writes to the sysfs file are rejected.

644

++

645

++

646

++Mitigation selection guide

647

++--------------------------

648

++

649

++1. No virtualization in use

650

++^^^^^^^^^^^^^^^^^^^^^^^^^^^

651

++

652

++   The system is protected by the kernel unconditionally and no further

653

++   action is required.

654

++

655

++2. Virtualization with trusted guests

656

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

657

++

658

++   If the guest comes from a trusted source and the guest OS kernel is

659

++   guaranteed to have the L1TF mitigations in place the system is fully

660

++   protected against L1TF and no further action is required.

661

++

662

++   To avoid the overhead of the default L1D flushing on VMENTER the

663

++   administrator can disable the flushing via the kernel command line and

664

++   sysfs control files. See :ref:`mitigation_control_command_line` and

665

++   :ref:`mitigation_control_kvm`.

666

++

667

++

668

++3. Virtualization with untrusted guests

669

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

670

++

671

++3.1. SMT not supported or disabled

672

++""""""""""""""""""""""""""""""""""

673

++

674

++  If SMT is not supported by the processor or disabled in the BIOS or by

675

++  the kernel, it's only required to enforce L1D flushing on VMENTER.

676

++

677

++  Conditional L1D flushing is the default behaviour and can be tuned. See

678

++  :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.

679

++

680

++3.2. EPT not supported or disabled

681

++""""""""""""""""""""""""""""""""""

682

++

683

++  If EPT is not supported by the processor or disabled in the hypervisor,

684

++  the system is fully protected. SMT can stay enabled and L1D flushing on

685

++  VMENTER is not required.

686

++

687

++  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.

688

++

689

++3.3. SMT and EPT supported and active

690

++"""""""""""""""""""""""""""""""""""""

691

++

692

++  If SMT and EPT are supported and active then various degrees of

693

++  mitigations can be employed:

694

++

695

++  - L1D flushing on VMENTER:

696

++

697

++    L1D flushing on VMENTER is the minimal protection requirement, but it

698

++    is only potent in combination with other mitigation methods.

699

++

700

++    Conditional L1D flushing is the default behaviour and can be tuned. See

701

++    :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.

702

++

703

++  - Guest confinement:

704

++

705

++    Confinement of guests to a single or a group of physical cores which

706

++    are not running any other processes, can reduce the attack surface

707

++    significantly, but interrupts, soft interrupts and kernel threads can

708

++    still expose valuable data to a potential attacker. See

709

++    :ref:`guest_confinement`.

710

++

711

++  - Interrupt isolation:

712

++

713

++    Isolating the guest CPUs from interrupts can reduce the attack surface

714

++    further, but still allows a malicious guest to explore a limited amount

715

++    of host physical memory. This can at least be used to gain knowledge

716

++    about the host address space layout. The interrupts which have a fixed

717

++    affinity to the CPUs which run the untrusted guests can depending on

718

++    the scenario still trigger soft interrupts and schedule kernel threads

719

++    which might expose valuable information. See

720

++    :ref:`interrupt_isolation`.

721

++

722

++The above three mitigation methods combined can provide protection to a

723

++certain degree, but the risk of the remaining attack surface has to be

724

++carefully analyzed. For full protection the following methods are

725

++available:

726

++

727

++  - Disabling SMT:

728

++

729

++    Disabling SMT and enforcing the L1D flushing provides the maximum

730

++    amount of protection. This mitigation is not depending on any of the

731

++    above mitigation methods.

732

++

733

++    SMT control and L1D flushing can be tuned by the command line

734

++    parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run

735

++    time with the matching sysfs control files. See :ref:`smt_control`,

736

++    :ref:`mitigation_control_command_line` and

737

++    :ref:`mitigation_control_kvm`.

738

++

739

++  - Disabling EPT:

740

++

741

++    Disabling EPT provides the maximum amount of protection as well. It is

742

++    not depending on any of the above mitigation methods. SMT can stay

743

++    enabled and L1D flushing is not required, but the performance impact is

744

++    significant.

745

++

746

++    EPT can be disabled in the hypervisor via the 'kvm-intel.ept'

747

++    parameter.

748

++

749

++3.4. Nested virtual machines

750

++""""""""""""""""""""""""""""

751

++

752

++When nested virtualization is in use, three operating systems are involved:

753

++the bare metal hypervisor, the nested hypervisor and the nested virtual

754

++machine.  VMENTER operations from the nested hypervisor into the nested

755

++guest will always be processed by the bare metal hypervisor. If KVM is the

756

++bare metal hypervisor it wiil:

757

++

758

++ - Flush the L1D cache on every switch from the nested hypervisor to the

759

++   nested virtual machine, so that the nested hypervisor's secrets are not

760

++   exposed to the nested virtual machine;

761

++

762

++ - Flush the L1D cache on every switch from the nested virtual machine to

763

++   the nested hypervisor; this is a complex operation, and flushing the L1D

764

++   cache avoids that the bare metal hypervisor's secrets are exposed to the

765

++   nested virtual machine;

766

++

767

++ - Instruct the nested hypervisor to not perform any L1D cache flush. This

768

++   is an optimization to avoid double L1D flushing.

769

++

770

++

771

++.. _default_mitigations:

772

++

773

++Default mitigations

774

++-------------------

775

++

776

++  The kernel default mitigations for vulnerable processors are:

777

++

778

++  - PTE inversion to protect against malicious user space. This is done

779

++    unconditionally and cannot be controlled.

780

++

781

++  - L1D conditional flushing on VMENTER when EPT is enabled for

782

++    a guest.

783

++

784

++  The kernel does not by default enforce the disabling of SMT, which leaves

785

++  SMT systems vulnerable when running untrusted guests with EPT enabled.

786

++

787

++  The rationale for this choice is:

788

++

789

++  - Force disabling SMT can break existing setups, especially with

790

++    unattended updates.

791

++

792

++  - If regular users run untrusted guests on their machine, then L1TF is

793

++    just an add on to other malware which might be embedded in an untrusted

794

++    guest, e.g. spam-bots or attacks on the local network.

795

++

796

++    There is no technical way to prevent a user from running untrusted code

797

++    on their machines blindly.

798

++

799

++  - It's technically extremely unlikely and from today's knowledge even

800

++    impossible that L1TF can be exploited via the most popular attack

801

++    mechanisms like JavaScript because these mechanisms have no way to

802

++    control PTEs. If this would be possible and not other mitigation would

803

++    be possible, then the default might be different.

804

++

805

++  - The administrators of cloud and hosting setups have to carefully

806

++    analyze the risk for their scenarios and make the appropriate

807

++    mitigation choices, which might even vary across their deployed

808

++    machines and also result in other changes of their overall setup.

809

++    There is no way for the kernel to provide a sensible default for this

810

++    kind of scenarios.

811

+diff --git a/Makefile b/Makefile

812

+index 863f58503bee..5edf963148e8 100644

813

+--- a/Makefile

814

++++ b/Makefile

815

+@@ -1,7 +1,7 @@

816

+ # SPDX-License-Identifier: GPL-2.0

817

+ VERSION = 4

818

+ PATCHLEVEL = 18

819

+-SUBLEVEL = 0

820

++SUBLEVEL = 1

821

+ EXTRAVERSION =

822

+ NAME = Merciless Moray

823

+

824

+diff --git a/arch/Kconfig b/arch/Kconfig

825

+index 1aa59063f1fd..d1f2ed462ac8 100644

826

+--- a/arch/Kconfig

827

++++ b/arch/Kconfig

828

+@@ -13,6 +13,9 @@ config KEXEC_CORE

829

+ config HAVE_IMA_KEXEC

830

+ 	bool

831

+

832

++config HOTPLUG_SMT

833

++	bool

834

++

835

+ config OPROFILE

836

+ 	tristate "OProfile system profiling"

837

+ 	depends on PROFILING

838

+diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig

839

+index 887d3a7bb646..6b8065d718bd 100644

840

+--- a/arch/x86/Kconfig

841

++++ b/arch/x86/Kconfig

842

+@@ -187,6 +187,7 @@ config X86

843

+ 	select HAVE_SYSCALL_TRACEPOINTS

844

+ 	select HAVE_UNSTABLE_SCHED_CLOCK

845

+ 	select HAVE_USER_RETURN_NOTIFIER

846

++	select HOTPLUG_SMT			if SMP

847

+ 	select IRQ_FORCED_THREADING

848

+ 	select NEED_SG_DMA_LENGTH

849

+ 	select PCI_LOCKLESS_CONFIG

850

+diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h

851

+index 74a9e06b6cfd..130e81e10fc7 100644

852

+--- a/arch/x86/include/asm/apic.h

853

++++ b/arch/x86/include/asm/apic.h

854

+@@ -10,6 +10,7 @@

855

+ #include <asm/fixmap.h>

856

+ #include <asm/mpspec.h>

857

+ #include <asm/msr.h>

858

++#include <asm/hardirq.h>

859

+

860

+ #define ARCH_APICTIMER_STOPS_ON_C3	1

861

+

862

+@@ -502,12 +503,19 @@ extern int default_check_phys_apicid_present(int phys_apicid);

863

+

864

+ #endif /* CONFIG_X86_LOCAL_APIC */

865

+

866

++#ifdef CONFIG_SMP

867

++bool apic_id_is_primary_thread(unsigned int id);

868

++#else

869

++static inline bool apic_id_is_primary_thread(unsigned int id) { return false; }

870

++#endif

871

++

872

+ extern void irq_enter(void);

873

+ extern void irq_exit(void);

874

+

875

+ static inline void entering_irq(void)

876

+ {

877

+ 	irq_enter();

878

++	kvm_set_cpu_l1tf_flush_l1d();

879

+ }

880

+

881

+ static inline void entering_ack_irq(void)

882

+@@ -520,6 +528,7 @@ static inline void ipi_entering_ack_irq(void)

883

+ {

884

+ 	irq_enter();

885

+ 	ack_APIC_irq();

886

++	kvm_set_cpu_l1tf_flush_l1d();

887

+ }

888

+

889

+ static inline void exiting_irq(void)

890

+diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h

891

+index 5701f5cecd31..64aaa3f5f36c 100644

892

+--- a/arch/x86/include/asm/cpufeatures.h

893

++++ b/arch/x86/include/asm/cpufeatures.h

894

+@@ -219,6 +219,7 @@

895

+ #define X86_FEATURE_IBPB		( 7*32+26) /* Indirect Branch Prediction Barrier */

896

+ #define X86_FEATURE_STIBP		( 7*32+27) /* Single Thread Indirect Branch Predictors */

897

+ #define X86_FEATURE_ZEN			( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */

898

++#define X86_FEATURE_L1TF_PTEINV		( 7*32+29) /* "" L1TF workaround PTE inversion */

899

+

900

+ /* Virtualization flags: Linux defined, word 8 */

901

+ #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */

902

+@@ -341,6 +342,7 @@

903

+ #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */

904

+ #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */

905

+ #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */

906

++#define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */

907

+ #define X86_FEATURE_ARCH_CAPABILITIES	(18*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */

908

+ #define X86_FEATURE_SPEC_CTRL_SSBD	(18*32+31) /* "" Speculative Store Bypass Disable */

909

+

910

+@@ -373,5 +375,6 @@

911

+ #define X86_BUG_SPECTRE_V1		X86_BUG(15) /* CPU is affected by Spectre variant 1 attack with conditional branches */

912

+ #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */

913

+ #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */

914

++#define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */

915

+

916

+ #endif /* _ASM_X86_CPUFEATURES_H */

917

+diff --git a/arch/x86/include/asm/dmi.h b/arch/x86/include/asm/dmi.h

918

+index 0ab2ab27ad1f..b825cb201251 100644

919

+--- a/arch/x86/include/asm/dmi.h

920

++++ b/arch/x86/include/asm/dmi.h

921

+@@ -4,8 +4,8 @@

922

+

923

+ #include <linux/compiler.h>

924

+ #include <linux/init.h>

925

++#include <linux/io.h>

926

+

927

+-#include <asm/io.h>

928

+ #include <asm/setup.h>

929

+

930

+ static __always_inline __init void *dmi_alloc(unsigned len)

931

+diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h

932

+index 740a428acf1e..d9069bb26c7f 100644

933

+--- a/arch/x86/include/asm/hardirq.h

934

++++ b/arch/x86/include/asm/hardirq.h

935

+@@ -3,10 +3,12 @@

936

+ #define _ASM_X86_HARDIRQ_H

937

+

938

+ #include <linux/threads.h>

939

+-#include <linux/irq.h>

940

+

941

+ typedef struct {

942

+-	unsigned int __softirq_pending;

943

++	u16	     __softirq_pending;

944

++#if IS_ENABLED(CONFIG_KVM_INTEL)

945

++	u8	     kvm_cpu_l1tf_flush_l1d;

946

++#endif

947

+ 	unsigned int __nmi_count;	/* arch dependent */

948

+ #ifdef CONFIG_X86_LOCAL_APIC

949

+ 	unsigned int apic_timer_irqs;	/* arch dependent */

950

+@@ -58,4 +60,24 @@ extern u64 arch_irq_stat_cpu(unsigned int cpu);

951

+ extern u64 arch_irq_stat(void);

952

+ #define arch_irq_stat		arch_irq_stat

953

+

954

++

955

++#if IS_ENABLED(CONFIG_KVM_INTEL)

956

++static inline void kvm_set_cpu_l1tf_flush_l1d(void)

957

++{

958

++	__this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 1);

959

++}

960

++

961

++static inline void kvm_clear_cpu_l1tf_flush_l1d(void)

962

++{

963

++	__this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 0);

964

++}

965

++

966

++static inline bool kvm_get_cpu_l1tf_flush_l1d(void)

967

++{

968

++	return __this_cpu_read(irq_stat.kvm_cpu_l1tf_flush_l1d);

969

++}

970

++#else /* !IS_ENABLED(CONFIG_KVM_INTEL) */

971

++static inline void kvm_set_cpu_l1tf_flush_l1d(void) { }

972

++#endif /* IS_ENABLED(CONFIG_KVM_INTEL) */

973

++

974

+ #endif /* _ASM_X86_HARDIRQ_H */

975

+diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h

976

+index c4fc17220df9..c14f2a74b2be 100644

977

+--- a/arch/x86/include/asm/irqflags.h

978

++++ b/arch/x86/include/asm/irqflags.h

979

+@@ -13,6 +13,8 @@

980

+  * Interrupt control:

981

+  */

982

+

983

++/* Declaration required for gcc < 4.9 to prevent -Werror=missing-prototypes */

984

++extern inline unsigned long native_save_fl(void);

985

+ extern inline unsigned long native_save_fl(void)

986

+ {

987

+ 	unsigned long flags;

988

+diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h

989

+index c13cd28d9d1b..acebb808c4b5 100644

990

+--- a/arch/x86/include/asm/kvm_host.h

991

++++ b/arch/x86/include/asm/kvm_host.h

992

+@@ -17,6 +17,7 @@

993

+ #include <linux/tracepoint.h>

994

+ #include <linux/cpumask.h>

995

+ #include <linux/irq_work.h>

996

++#include <linux/irq.h>

997

+

998

+ #include <linux/kvm.h>

999

+ #include <linux/kvm_para.h>

1000

+@@ -713,6 +714,9 @@ struct kvm_vcpu_arch {

1001

+

1002

+ 	/* be preempted when it's in kernel-mode(cpl=0) */

1003

+ 	bool preempted_in_kernel;

1004

++

1005

++	/* Flush the L1 Data cache for L1TF mitigation on VMENTER */

1006

++	bool l1tf_flush_l1d;

1007

+ };

1008

+

1009

+ struct kvm_lpage_info {

1010

+@@ -881,6 +885,7 @@ struct kvm_vcpu_stat {

1011

+ 	u64 signal_exits;

1012

+ 	u64 irq_window_exits;

1013

+ 	u64 nmi_window_exits;

1014

++	u64 l1d_flush;

1015

+ 	u64 halt_exits;

1016

+ 	u64 halt_successful_poll;

1017

+ 	u64 halt_attempted_poll;

1018

+@@ -1413,6 +1418,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v);

1019

+ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);

1020

+ void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu);

1021

+

1022

++u64 kvm_get_arch_capabilities(void);

1023

+ void kvm_define_shared_msr(unsigned index, u32 msr);

1024

+ int kvm_set_shared_msr(unsigned index, u64 val, u64 mask);

1025

+

1026

+diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h

1027

+index 68b2c3150de1..4731f0cf97c5 100644

1028

+--- a/arch/x86/include/asm/msr-index.h

1029

++++ b/arch/x86/include/asm/msr-index.h

1030

+@@ -70,12 +70,19 @@

1031

+ #define MSR_IA32_ARCH_CAPABILITIES	0x0000010a

1032

+ #define ARCH_CAP_RDCL_NO		(1 << 0)   /* Not susceptible to Meltdown */

1033

+ #define ARCH_CAP_IBRS_ALL		(1 << 1)   /* Enhanced IBRS support */

1034

++#define ARCH_CAP_SKIP_VMENTRY_L1DFLUSH	(1 << 3)   /* Skip L1D flush on vmentry */

1035

+ #define ARCH_CAP_SSB_NO			(1 << 4)   /*

1036

+ 						    * Not susceptible to Speculative Store Bypass

1037

+ 						    * attack, so no Speculative Store Bypass

1038

+ 						    * control required.

1039

+ 						    */

1040

+

1041

++#define MSR_IA32_FLUSH_CMD		0x0000010b

1042

++#define L1D_FLUSH			(1 << 0)   /*

1043

++						    * Writeback and invalidate the

1044

++						    * L1 data cache.

1045

++						    */

1046

++

1047

+ #define MSR_IA32_BBL_CR_CTL		0x00000119

1048

+ #define MSR_IA32_BBL_CR_CTL3		0x0000011e

1049

+

1050

+diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h

1051

+index aa30c3241ea7..0d5c739eebd7 100644

1052

+--- a/arch/x86/include/asm/page_32_types.h

1053

++++ b/arch/x86/include/asm/page_32_types.h

1054

+@@ -29,8 +29,13 @@

1055

+ #define N_EXCEPTION_STACKS 1

1056

+

1057

+ #ifdef CONFIG_X86_PAE

1058

+-/* 44=32+12, the limit we can fit into an unsigned long pfn */

1059

+-#define __PHYSICAL_MASK_SHIFT	44

1060

++/*

1061

++ * This is beyond the 44 bit limit imposed by the 32bit long pfns,

1062

++ * but we need the full mask to make sure inverted PROT_NONE

1063

++ * entries have all the host bits set in a guest.

1064

++ * The real limit is still 44 bits.

1065

++ */

1066

++#define __PHYSICAL_MASK_SHIFT	52

1067

+ #define __VIRTUAL_MASK_SHIFT	32

1068

+

1069

+ #else  /* !CONFIG_X86_PAE */

1070

+diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h

1071

+index 685ffe8a0eaf..60d0f9015317 100644

1072

+--- a/arch/x86/include/asm/pgtable-2level.h

1073

++++ b/arch/x86/include/asm/pgtable-2level.h

1074

+@@ -95,4 +95,21 @@ static inline unsigned long pte_bitop(unsigned long value, unsigned int rightshi

1075

+ #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })

1076

+ #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })

1077

+

1078

++/* No inverted PFNs on 2 level page tables */

1079

++

1080

++static inline u64 protnone_mask(u64 val)

1081

++{

1082

++	return 0;

1083

++}

1084

++

1085

++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)

1086

++{

1087

++	return val;

1088

++}

1089

++

1090

++static inline bool __pte_needs_invert(u64 val)

1091

++{

1092

++	return false;

1093

++}

1094

++

1095

+ #endif /* _ASM_X86_PGTABLE_2LEVEL_H */

1096

+diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h

1097

+index f24df59c40b2..bb035a4cbc8c 100644

1098

+--- a/arch/x86/include/asm/pgtable-3level.h

1099

++++ b/arch/x86/include/asm/pgtable-3level.h

1100

+@@ -241,12 +241,43 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)

1101

+ #endif

1102

+

1103

+ /* Encode and de-code a swap entry */

1104

++#define SWP_TYPE_BITS		5

1105

++

1106

++#define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)

1107

++

1108

++/* We always extract/encode the offset by shifting it all the way up, and then down again */

1109

++#define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS)

1110

++

1111

+ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)

1112

+ #define __swp_type(x)			(((x).val) & 0x1f)

1113

+ #define __swp_offset(x)			((x).val >> 5)

1114

+ #define __swp_entry(type, offset)	((swp_entry_t){(type) | (offset) << 5})

1115

+-#define __pte_to_swp_entry(pte)		((swp_entry_t){ (pte).pte_high })

1116

+-#define __swp_entry_to_pte(x)		((pte_t){ { .pte_high = (x).val } })

1117

++

1118

++/*

1119

++ * Normally, __swp_entry() converts from arch-independent swp_entry_t to

1120

++ * arch-dependent swp_entry_t, and __swp_entry_to_pte() just stores the result

1121

++ * to pte. But here we have 32bit swp_entry_t and 64bit pte, and need to use the

1122

++ * whole 64 bits. Thus, we shift the "real" arch-dependent conversion to

1123

++ * __swp_entry_to_pte() through the following helper macro based on 64bit

1124

++ * __swp_entry().

1125

++ */

1126

++#define __swp_pteval_entry(type, offset) ((pteval_t) { \

1127

++	(~(pteval_t)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \

1128

++	| ((pteval_t)(type) << (64 - SWP_TYPE_BITS)) })

1129

++

1130

++#define __swp_entry_to_pte(x)	((pte_t){ .pte = \

1131

++		__swp_pteval_entry(__swp_type(x), __swp_offset(x)) })

1132

++/*

1133

++ * Analogically, __pte_to_swp_entry() doesn't just extract the arch-dependent

1134

++ * swp_entry_t, but also has to convert it from 64bit to the 32bit

1135

++ * intermediate representation, using the following macros based on 64bit

1136

++ * __swp_type() and __swp_offset().

1137

++ */

1138

++#define __pteval_swp_type(x) ((unsigned long)((x).pte >> (64 - SWP_TYPE_BITS)))

1139

++#define __pteval_swp_offset(x) ((unsigned long)(~((x).pte) << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT))

1140

++

1141

++#define __pte_to_swp_entry(pte)	(__swp_entry(__pteval_swp_type(pte), \

1142

++					     __pteval_swp_offset(pte)))

1143

+

1144

+ #define gup_get_pte gup_get_pte

1145

+ /*

1146

+@@ -295,4 +326,6 @@ static inline pte_t gup_get_pte(pte_t *ptep)

1147

+ 	return pte;

1148

+ }

1149

+

1150

++#include <asm/pgtable-invert.h>

1151

++

1152

+ #endif /* _ASM_X86_PGTABLE_3LEVEL_H */

1153

+diff --git a/arch/x86/include/asm/pgtable-invert.h b/arch/x86/include/asm/pgtable-invert.h

1154

+new file mode 100644

1155

+index 000000000000..44b1203ece12

1156

+--- /dev/null

1157

++++ b/arch/x86/include/asm/pgtable-invert.h

1158

+@@ -0,0 +1,32 @@

1159

++/* SPDX-License-Identifier: GPL-2.0 */

1160

++#ifndef _ASM_PGTABLE_INVERT_H

1161

++#define _ASM_PGTABLE_INVERT_H 1

1162

++

1163

++#ifndef __ASSEMBLY__

1164

++

1165

++static inline bool __pte_needs_invert(u64 val)

1166

++{

1167

++	return !(val & _PAGE_PRESENT);

1168

++}

1169

++

1170

++/* Get a mask to xor with the page table entry to get the correct pfn. */

1171

++static inline u64 protnone_mask(u64 val)

1172

++{

1173

++	return __pte_needs_invert(val) ?  ~0ull : 0;

1174

++}

1175

++

1176

++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)

1177

++{

1178

++	/*

1179

++	 * When a PTE transitions from NONE to !NONE or vice-versa

1180

++	 * invert the PFN part to stop speculation.

1181

++	 * pte_pfn undoes this when needed.

1182

++	 */

1183

++	if (__pte_needs_invert(oldval) != __pte_needs_invert(val))

1184

++		val = (val & ~mask) | (~val & mask);

1185

++	return val;

1186

++}

1187

++

1188

++#endif /* __ASSEMBLY__ */

1189

++

1190

++#endif

1191

+diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h

1192

+index 5715647fc4fe..13125aad804c 100644

1193

+--- a/arch/x86/include/asm/pgtable.h

1194

++++ b/arch/x86/include/asm/pgtable.h

1195

+@@ -185,19 +185,29 @@ static inline int pte_special(pte_t pte)

1196

+ 	return pte_flags(pte) & _PAGE_SPECIAL;

1197

+ }

1198

+

1199

++/* Entries that were set to PROT_NONE are inverted */

1200

++

1201

++static inline u64 protnone_mask(u64 val);

1202

++

1203

+ static inline unsigned long pte_pfn(pte_t pte)

1204

+ {

1205

+-	return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;

1206

++	phys_addr_t pfn = pte_val(pte);

1207

++	pfn ^= protnone_mask(pfn);

1208

++	return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT;

1209

+ }

1210

+

1211

+ static inline unsigned long pmd_pfn(pmd_t pmd)

1212

+ {

1213

+-	return (pmd_val(pmd) & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;

1214

++	phys_addr_t pfn = pmd_val(pmd);

1215

++	pfn ^= protnone_mask(pfn);

1216

++	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;

1217

+ }

1218

+

1219

+ static inline unsigned long pud_pfn(pud_t pud)

1220

+ {

1221

+-	return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT;

1222

++	phys_addr_t pfn = pud_val(pud);

1223

++	pfn ^= protnone_mask(pfn);

1224

++	return (pfn & pud_pfn_mask(pud)) >> PAGE_SHIFT;

1225

+ }

1226

+

1227

+ static inline unsigned long p4d_pfn(p4d_t p4d)

1228

+@@ -400,11 +410,6 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)

1229

+ 	return pmd_set_flags(pmd, _PAGE_RW);

1230

+ }

1231

+

1232

+-static inline pmd_t pmd_mknotpresent(pmd_t pmd)

1233

+-{

1234

+-	return pmd_clear_flags(pmd, _PAGE_PRESENT | _PAGE_PROTNONE);

1235

+-}

1236

+-

1237

+ static inline pud_t pud_set_flags(pud_t pud, pudval_t set)

1238

+ {

1239

+ 	pudval_t v = native_pud_val(pud);

1240

+@@ -459,11 +464,6 @@ static inline pud_t pud_mkwrite(pud_t pud)

1241

+ 	return pud_set_flags(pud, _PAGE_RW);

1242

+ }

1243

+

1244

+-static inline pud_t pud_mknotpresent(pud_t pud)

1245

+-{

1246

+-	return pud_clear_flags(pud, _PAGE_PRESENT | _PAGE_PROTNONE);

1247

+-}

1248

+-

1249

+ #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY

1250

+ static inline int pte_soft_dirty(pte_t pte)

1251

+ {

1252

+@@ -545,25 +545,45 @@ static inline pgprotval_t check_pgprot(pgprot_t pgprot)

1253

+

1254

+ static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)

1255

+ {

1256

+-	return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) |

1257

+-		     check_pgprot(pgprot));

1258

++	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;

1259

++	pfn ^= protnone_mask(pgprot_val(pgprot));

1260

++	pfn &= PTE_PFN_MASK;

1261

++	return __pte(pfn | check_pgprot(pgprot));

1262

+ }

1263

+

1264

+ static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)

1265

+ {

1266

+-	return __pmd(((phys_addr_t)page_nr << PAGE_SHIFT) |

1267

+-		     check_pgprot(pgprot));

1268

++	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;

1269

++	pfn ^= protnone_mask(pgprot_val(pgprot));

1270

++	pfn &= PHYSICAL_PMD_PAGE_MASK;

1271

++	return __pmd(pfn | check_pgprot(pgprot));

1272

+ }

1273

+

1274

+ static inline pud_t pfn_pud(unsigned long page_nr, pgprot_t pgprot)

1275

+ {

1276

+-	return __pud(((phys_addr_t)page_nr << PAGE_SHIFT) |

1277

+-		     check_pgprot(pgprot));

1278

++	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;

1279

++	pfn ^= protnone_mask(pgprot_val(pgprot));

1280

++	pfn &= PHYSICAL_PUD_PAGE_MASK;

1281

++	return __pud(pfn | check_pgprot(pgprot));

1282

+ }

1283

+

1284

++static inline pmd_t pmd_mknotpresent(pmd_t pmd)

1285

++{

1286

++	return pfn_pmd(pmd_pfn(pmd),

1287

++		      __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));

1288

++}

1289

++

1290

++static inline pud_t pud_mknotpresent(pud_t pud)

1291

++{

1292

++	return pfn_pud(pud_pfn(pud),

1293

++	      __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));

1294

++}

1295

++

1296

++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);

1297

++

1298

+ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)

1299

+ {

1300

+-	pteval_t val = pte_val(pte);

1301

++	pteval_t val = pte_val(pte), oldval = val;

1302

+

1303

+ 	/*

1304

+ 	 * Chop off the NX bit (if present), and add the NX portion of

1305

+@@ -571,17 +591,17 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)

1306

+ 	 */

1307

+ 	val &= _PAGE_CHG_MASK;

1308

+ 	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;

1309

+-

1310

++	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);

1311

+ 	return __pte(val);

1312

+ }

1313

+

1314

+ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)

1315

+ {

1316

+-	pmdval_t val = pmd_val(pmd);

1317

++	pmdval_t val = pmd_val(pmd), oldval = val;

1318

+

1319

+ 	val &= _HPAGE_CHG_MASK;

1320

+ 	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;

1321

+-

1322

++	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);

1323

+ 	return __pmd(val);

1324

+ }

1325

+

1326

+@@ -1320,6 +1340,14 @@ static inline bool pud_access_permitted(pud_t pud, bool write)

1327

+ 	return __pte_access_permitted(pud_val(pud), write);

1328

+ }

1329

+

1330

++#define __HAVE_ARCH_PFN_MODIFY_ALLOWED 1

1331

++extern bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot);

1332

++

1333

++static inline bool arch_has_pfn_modify_check(void)

1334

++{

1335

++	return boot_cpu_has_bug(X86_BUG_L1TF);

1336

++}

1337

++

1338

+ #include <asm-generic/pgtable.h>

1339

+ #endif	/* __ASSEMBLY__ */

1340

+

1341

+diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h

1342

+index 3c5385f9a88f..82ff20b0ae45 100644

1343

+--- a/arch/x86/include/asm/pgtable_64.h

1344

++++ b/arch/x86/include/asm/pgtable_64.h

1345

+@@ -273,7 +273,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }

1346

+  *

1347

+  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number

1348

+  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names

1349

+- * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry

1350

++ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry

1351

+  *

1352

+  * G (8) is aliased and used as a PROT_NONE indicator for

1353

+  * !present ptes.  We need to start storing swap entries above

1354

+@@ -286,20 +286,34 @@ static inline int pgd_large(pgd_t pgd) { return 0; }

1355

+  *

1356

+  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,

1357

+  * but also L and G.

1358

++ *

1359

++ * The offset is inverted by a binary not operation to make the high

1360

++ * physical bits set.

1361

+  */

1362

+-#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)

1363

+-#define SWP_TYPE_BITS 5

1364

+-/* Place the offset above the type: */

1365

+-#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS)

1366

++#define SWP_TYPE_BITS		5

1367

++

1368

++#define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)

1369

++

1370

++/* We always extract/encode the offset by shifting it all the way up, and then down again */

1371

++#define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS)

1372

+

1373

+ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)

1374

+

1375

+-#define __swp_type(x)			(((x).val >> (SWP_TYPE_FIRST_BIT)) \

1376

+-					 & ((1U << SWP_TYPE_BITS) - 1))

1377

+-#define __swp_offset(x)			((x).val >> SWP_OFFSET_FIRST_BIT)

1378

+-#define __swp_entry(type, offset)	((swp_entry_t) { \

1379

+-					 ((type) << (SWP_TYPE_FIRST_BIT)) \

1380

+-					 | ((offset) << SWP_OFFSET_FIRST_BIT) })

1381

++/* Extract the high bits for type */

1382

++#define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS))

1383

++

1384

++/* Shift up (to get rid of type), then down to get value */

1385

++#define __swp_offset(x) (~(x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)

1386

++

1387

++/*

1388

++ * Shift the offset up "too far" by TYPE bits, then down again

1389

++ * The offset is inverted by a binary not operation to make the high

1390

++ * physical bits set.

1391

++ */

1392

++#define __swp_entry(type, offset) ((swp_entry_t) { \

1393

++	(~(unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \

1394

++	| ((unsigned long)(type) << (64-SWP_TYPE_BITS)) })

1395

++

1396

+ #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })

1397

+ #define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })

1398

+ #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })

1399

+@@ -343,5 +357,7 @@ static inline bool gup_fast_permitted(unsigned long start, int nr_pages,

1400

+ 	return true;

1401

+ }

1402

+

1403

++#include <asm/pgtable-invert.h>

1404

++

1405

+ #endif /* !__ASSEMBLY__ */

1406

+ #endif /* _ASM_X86_PGTABLE_64_H */

1407

+diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h

1408

+index cfd29ee8c3da..79e409974ccc 100644

1409

+--- a/arch/x86/include/asm/processor.h

1410

++++ b/arch/x86/include/asm/processor.h

1411

+@@ -181,6 +181,11 @@ extern const struct seq_operations cpuinfo_op;

1412

+

1413

+ extern void cpu_detect(struct cpuinfo_x86 *c);

1414

+

1415

++static inline unsigned long l1tf_pfn_limit(void)

1416

++{

1417

++	return BIT(boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT) - 1;

1418

++}

1419

++

1420

+ extern void early_cpu_init(void);

1421

+ extern void identify_boot_cpu(void);

1422

+ extern void identify_secondary_cpu(struct cpuinfo_x86 *);

1423

+@@ -977,4 +982,16 @@ bool xen_set_default_idle(void);

1424

+ void stop_this_cpu(void *dummy);

1425

+ void df_debug(struct pt_regs *regs, long error_code);

1426

+ void microcode_check(void);

1427

++

1428

++enum l1tf_mitigations {

1429

++	L1TF_MITIGATION_OFF,

1430

++	L1TF_MITIGATION_FLUSH_NOWARN,

1431

++	L1TF_MITIGATION_FLUSH,

1432

++	L1TF_MITIGATION_FLUSH_NOSMT,

1433

++	L1TF_MITIGATION_FULL,

1434

++	L1TF_MITIGATION_FULL_FORCE

1435

++};

1436

++

1437

++extern enum l1tf_mitigations l1tf_mitigation;

1438

++

1439

+ #endif /* _ASM_X86_PROCESSOR_H */

1440

+diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h

1441

+index c1d2a9892352..453cf38a1c33 100644

1442

+--- a/arch/x86/include/asm/topology.h

1443

++++ b/arch/x86/include/asm/topology.h

1444

+@@ -123,13 +123,17 @@ static inline int topology_max_smt_threads(void)

1445

+ }

1446

+

1447

+ int topology_update_package_map(unsigned int apicid, unsigned int cpu);

1448

+-extern int topology_phys_to_logical_pkg(unsigned int pkg);

1449

++int topology_phys_to_logical_pkg(unsigned int pkg);

1450

++bool topology_is_primary_thread(unsigned int cpu);

1451

++bool topology_smt_supported(void);

1452

+ #else

1453

+ #define topology_max_packages()			(1)

1454

+ static inline int

1455

+ topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; }

1456

+ static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }

1457

+ static inline int topology_max_smt_threads(void) { return 1; }

1458

++static inline bool topology_is_primary_thread(unsigned int cpu) { return true; }

1459

++static inline bool topology_smt_supported(void) { return false; }

1460

+ #endif

1461

+

1462

+ static inline void arch_fix_phys_package_id(int num, u32 slot)

1463

+diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h

1464

+index 6aa8499e1f62..95f9107449bf 100644

1465

+--- a/arch/x86/include/asm/vmx.h

1466

++++ b/arch/x86/include/asm/vmx.h

1467

+@@ -576,4 +576,15 @@ enum vm_instruction_error_number {

1468

+ 	VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,

1469

+ };

1470

+

1471

++enum vmx_l1d_flush_state {

1472

++	VMENTER_L1D_FLUSH_AUTO,

1473

++	VMENTER_L1D_FLUSH_NEVER,

1474

++	VMENTER_L1D_FLUSH_COND,

1475

++	VMENTER_L1D_FLUSH_ALWAYS,

1476

++	VMENTER_L1D_FLUSH_EPT_DISABLED,

1477

++	VMENTER_L1D_FLUSH_NOT_REQUIRED,

1478

++};

1479

++

1480

++extern enum vmx_l1d_flush_state l1tf_vmx_mitigation;

1481

++

1482

+ #endif

1483

+diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c

1484

+index adbda5847b14..3b3a2d0af78d 100644

1485

+--- a/arch/x86/kernel/apic/apic.c

1486

++++ b/arch/x86/kernel/apic/apic.c

1487

+@@ -56,6 +56,7 @@

1488

+ #include <asm/hypervisor.h>

1489

+ #include <asm/cpu_device_id.h>

1490

+ #include <asm/intel-family.h>

1491

++#include <asm/irq_regs.h>

1492

+

1493

+ unsigned int num_processors;

1494

+

1495

+@@ -2192,6 +2193,23 @@ static int cpuid_to_apicid[] = {

1496

+ 	[0 ... NR_CPUS - 1] = -1,

1497

+ };

1498

+

1499

++#ifdef CONFIG_SMP

1500

++/**

1501

++ * apic_id_is_primary_thread - Check whether APIC ID belongs to a primary thread

1502

++ * @id:	APIC ID to check

1503

++ */

1504

++bool apic_id_is_primary_thread(unsigned int apicid)

1505

++{

1506

++	u32 mask;

1507

++

1508

++	if (smp_num_siblings == 1)

1509

++		return true;

1510

++	/* Isolate the SMT bit(s) in the APICID and check for 0 */

1511

++	mask = (1U << (fls(smp_num_siblings) - 1)) - 1;

1512

++	return !(apicid & mask);

1513

++}

1514

++#endif

1515

++

1516

+ /*

1517

+  * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids

1518

+  * and cpuid_to_apicid[] synchronized.

1519

+diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c

1520

+index 3982f79d2377..ff0d14cd9e82 100644

1521

+--- a/arch/x86/kernel/apic/io_apic.c

1522

++++ b/arch/x86/kernel/apic/io_apic.c

1523

+@@ -33,6 +33,7 @@

1524

+

1525

+ #include <linux/mm.h>

1526

+ #include <linux/interrupt.h>

1527

++#include <linux/irq.h>

1528

+ #include <linux/init.h>

1529

+ #include <linux/delay.h>

1530

+ #include <linux/sched.h>

1531

+diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c

1532

+index ce503c99f5c4..72a94401f9e0 100644

1533

+--- a/arch/x86/kernel/apic/msi.c

1534

++++ b/arch/x86/kernel/apic/msi.c

1535

+@@ -12,6 +12,7 @@

1536

+  */

1537

+ #include <linux/mm.h>

1538

+ #include <linux/interrupt.h>

1539

++#include <linux/irq.h>

1540

+ #include <linux/pci.h>

1541

+ #include <linux/dmar.h>

1542

+ #include <linux/hpet.h>

1543

+diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c

1544

+index 35aaee4fc028..c9b773401fd8 100644

1545

+--- a/arch/x86/kernel/apic/vector.c

1546

++++ b/arch/x86/kernel/apic/vector.c

1547

+@@ -11,6 +11,7 @@

1548

+  * published by the Free Software Foundation.

1549

+  */

1550

+ #include <linux/interrupt.h>

1551

++#include <linux/irq.h>

1552

+ #include <linux/seq_file.h>

1553

+ #include <linux/init.h>

1554

+ #include <linux/compiler.h>

1555

+diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c

1556

+index 38915fbfae73..97e962afb967 100644

1557

+--- a/arch/x86/kernel/cpu/amd.c

1558

++++ b/arch/x86/kernel/cpu/amd.c

1559

+@@ -315,6 +315,13 @@ static void legacy_fixup_core_id(struct cpuinfo_x86 *c)

1560

+ 	c->cpu_core_id %= cus_per_node;

1561

+ }

1562

+

1563

++

1564

++static void amd_get_topology_early(struct cpuinfo_x86 *c)

1565

++{

1566

++	if (cpu_has(c, X86_FEATURE_TOPOEXT))

1567

++		smp_num_siblings = ((cpuid_ebx(0x8000001e) >> 8) & 0xff) + 1;

1568

++}

1569

++

1570

+ /*

1571

+  * Fixup core topology information for

1572

+  * (1) AMD multi-node processors

1573

+@@ -334,7 +341,6 @@ static void amd_get_topology(struct cpuinfo_x86 *c)

1574

+ 		cpuid(0x8000001e, &eax, &ebx, &ecx, &edx);

1575

+

1576

+ 		node_id  = ecx & 0xff;

1577

+-		smp_num_siblings = ((ebx >> 8) & 0xff) + 1;

1578

+

1579

+ 		if (c->x86 == 0x15)

1580

+ 			c->cu_id = ebx & 0xff;

1581

+@@ -613,6 +619,7 @@ clear_sev:

1582

+

1583

+ static void early_init_amd(struct cpuinfo_x86 *c)

1584

+ {

1585

++	u64 value;

1586

+ 	u32 dummy;

1587

+

1588

+ 	early_init_amd_mc(c);

1589

+@@ -683,6 +690,22 @@ static void early_init_amd(struct cpuinfo_x86 *c)

1590

+ 		set_cpu_bug(c, X86_BUG_AMD_E400);

1591

+

1592

+ 	early_detect_mem_encrypt(c);

1593

++

1594

++	/* Re-enable TopologyExtensions if switched off by BIOS */

1595

++	if (c->x86 == 0x15 &&

1596

++	    (c->x86_model >= 0x10 && c->x86_model <= 0x6f) &&

1597

++	    !cpu_has(c, X86_FEATURE_TOPOEXT)) {

1598

++

1599

++		if (msr_set_bit(0xc0011005, 54) > 0) {

1600

++			rdmsrl(0xc0011005, value);

1601

++			if (value & BIT_64(54)) {

1602

++				set_cpu_cap(c, X86_FEATURE_TOPOEXT);

1603

++				pr_info_once(FW_INFO "CPU: Re-enabling disabled Topology Extensions Support.\n");

1604

++			}

1605

++		}

1606

++	}

1607

++

1608

++	amd_get_topology_early(c);

1609

+ }

1610

+

1611

+ static void init_amd_k8(struct cpuinfo_x86 *c)

1612

+@@ -774,19 +797,6 @@ static void init_amd_bd(struct cpuinfo_x86 *c)

1613

+ {

1614

+ 	u64 value;

1615

+

1616

+-	/* re-enable TopologyExtensions if switched off by BIOS */

1617

+-	if ((c->x86_model >= 0x10) && (c->x86_model <= 0x6f) &&

1618

+-	    !cpu_has(c, X86_FEATURE_TOPOEXT)) {

1619

+-

1620

+-		if (msr_set_bit(0xc0011005, 54) > 0) {

1621

+-			rdmsrl(0xc0011005, value);

1622

+-			if (value & BIT_64(54)) {

1623

+-				set_cpu_cap(c, X86_FEATURE_TOPOEXT);

1624

+-				pr_info_once(FW_INFO "CPU: Re-enabling disabled Topology Extensions Support.\n");

1625

+-			}

1626

+-		}

1627

+-	}

1628

+-

1629

+ 	/*

1630

+ 	 * The way access filter has a performance penalty on some workloads.

1631

+ 	 * Disable it on the affected CPUs.

1632

+@@ -850,16 +860,9 @@ static void init_amd(struct cpuinfo_x86 *c)

1633

+

1634

+ 	cpu_detect_cache_sizes(c);

1635

+

1636

+-	/* Multi core CPU? */

1637

+-	if (c->extended_cpuid_level >= 0x80000008) {

1638

+-		amd_detect_cmp(c);

1639

+-		amd_get_topology(c);

1640

+-		srat_detect_node(c);

1641

+-	}

1642

+-

1643

+-#ifdef CONFIG_X86_32

1644

+-	detect_ht(c);

1645

+-#endif

1646

++	amd_detect_cmp(c);

1647

++	amd_get_topology(c);

1648

++	srat_detect_node(c);

1649

+

1650

+ 	init_amd_cacheinfo(c);

1651

+

1652

+diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c

1653

+index 5c0ea39311fe..c4f0ae49a53d 100644

1654

+--- a/arch/x86/kernel/cpu/bugs.c

1655

++++ b/arch/x86/kernel/cpu/bugs.c

1656

+@@ -22,15 +22,18 @@

1657

+ #include <asm/processor-flags.h>

1658

+ #include <asm/fpu/internal.h>

1659

+ #include <asm/msr.h>

1660

++#include <asm/vmx.h>

1661

+ #include <asm/paravirt.h>

1662

+ #include <asm/alternative.h>

1663

+ #include <asm/pgtable.h>

1664

+ #include <asm/set_memory.h>

1665

+ #include <asm/intel-family.h>

1666

+ #include <asm/hypervisor.h>

1667

++#include <asm/e820/api.h>

1668

+

1669

+ static void __init spectre_v2_select_mitigation(void);

1670

+ static void __init ssb_select_mitigation(void);

1671

++static void __init l1tf_select_mitigation(void);

1672

+

1673

+ /*

1674

+  * Our boot-time value of the SPEC_CTRL MSR. We read it once so that any

1675

+@@ -56,6 +59,12 @@ void __init check_bugs(void)

1676

+ {

1677

+ 	identify_boot_cpu();

1678

+

1679

++	/*

1680

++	 * identify_boot_cpu() initialized SMT support information, let the

1681

++	 * core code know.

1682

++	 */

1683

++	cpu_smt_check_topology_early();

1684

++

1685

+ 	if (!IS_ENABLED(CONFIG_SMP)) {

1686

+ 		pr_info("CPU: ");

1687

+ 		print_cpu_info(&boot_cpu_data);

1688

+@@ -82,6 +91,8 @@ void __init check_bugs(void)

1689

+ 	 */

1690

+ 	ssb_select_mitigation();

1691

+

1692

++	l1tf_select_mitigation();

1693

++

1694

+ #ifdef CONFIG_X86_32

1695

+ 	/*

1696

+ 	 * Check whether we are able to run this kernel safely on SMP.

1697

+@@ -313,23 +324,6 @@ static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)

1698

+ 	return cmd;

1699

+ }

1700

+

1701

+-/* Check for Skylake-like CPUs (for RSB handling) */

1702

+-static bool __init is_skylake_era(void)

1703

+-{

1704

+-	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&

1705

+-	    boot_cpu_data.x86 == 6) {

1706

+-		switch (boot_cpu_data.x86_model) {

1707

+-		case INTEL_FAM6_SKYLAKE_MOBILE:

1708

+-		case INTEL_FAM6_SKYLAKE_DESKTOP:

1709

+-		case INTEL_FAM6_SKYLAKE_X:

1710

+-		case INTEL_FAM6_KABYLAKE_MOBILE:

1711

+-		case INTEL_FAM6_KABYLAKE_DESKTOP:

1712

+-			return true;

1713

+-		}

1714

+-	}

1715

+-	return false;

1716

+-}

1717

+-

1718

+ static void __init spectre_v2_select_mitigation(void)

1719

+ {

1720

+ 	enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline();

1721

+@@ -390,22 +384,15 @@ retpoline_auto:

1722

+ 	pr_info("%s\n", spectre_v2_strings[mode]);

1723

+

1724

+ 	/*

1725

+-	 * If neither SMEP nor PTI are available, there is a risk of

1726

+-	 * hitting userspace addresses in the RSB after a context switch

1727

+-	 * from a shallow call stack to a deeper one. To prevent this fill

1728

+-	 * the entire RSB, even when using IBRS.

1729

++	 * If spectre v2 protection has been enabled, unconditionally fill

1730

++	 * RSB during a context switch; this protects against two independent

1731

++	 * issues:

1732

+ 	 *

1733

+-	 * Skylake era CPUs have a separate issue with *underflow* of the

1734

+-	 * RSB, when they will predict 'ret' targets from the generic BTB.

1735

+-	 * The proper mitigation for this is IBRS. If IBRS is not supported

1736

+-	 * or deactivated in favour of retpolines the RSB fill on context

1737

+-	 * switch is required.

1738

++	 *	- RSB underflow (and switch to BTB) on Skylake+

1739

++	 *	- SpectreRSB variant of spectre v2 on X86_BUG_SPECTRE_V2 CPUs

1740

+ 	 */

1741

+-	if ((!boot_cpu_has(X86_FEATURE_PTI) &&

1742

+-	     !boot_cpu_has(X86_FEATURE_SMEP)) || is_skylake_era()) {

1743

+-		setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);

1744

+-		pr_info("Spectre v2 mitigation: Filling RSB on context switch\n");

1745

+-	}

1746

++	setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);

1747

++	pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch\n");

1748

+

1749

+ 	/* Initialize Indirect Branch Prediction Barrier if supported */

1750

+ 	if (boot_cpu_has(X86_FEATURE_IBPB)) {

1751

+@@ -654,8 +641,121 @@ void x86_spec_ctrl_setup_ap(void)

1752

+ 		x86_amd_ssb_disable();

1753

+ }

1754

+

1755

++#undef pr_fmt

1756

++#define pr_fmt(fmt)	"L1TF: " fmt

1757

++

1758

++/* Default mitigation for L1TF-affected CPUs */

1759

++enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_FLUSH;

1760

++#if IS_ENABLED(CONFIG_KVM_INTEL)

1761

++EXPORT_SYMBOL_GPL(l1tf_mitigation);

1762

++

1763

++enum vmx_l1d_flush_state l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;

1764

++EXPORT_SYMBOL_GPL(l1tf_vmx_mitigation);

1765

++#endif

1766

++

1767

++static void __init l1tf_select_mitigation(void)

1768

++{

1769

++	u64 half_pa;

1770

++

1771

++	if (!boot_cpu_has_bug(X86_BUG_L1TF))

1772

++		return;

1773

++

1774

++	switch (l1tf_mitigation) {

1775

++	case L1TF_MITIGATION_OFF:

1776

++	case L1TF_MITIGATION_FLUSH_NOWARN:

1777

++	case L1TF_MITIGATION_FLUSH:

1778

++		break;

1779

++	case L1TF_MITIGATION_FLUSH_NOSMT:

1780

++	case L1TF_MITIGATION_FULL:

1781

++		cpu_smt_disable(false);

1782

++		break;

1783

++	case L1TF_MITIGATION_FULL_FORCE:

1784

++		cpu_smt_disable(true);

1785

++		break;

1786

++	}

1787

++

1788

++#if CONFIG_PGTABLE_LEVELS == 2

1789

++	pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n");

1790

++	return;

1791

++#endif

1792

++

1793

++	/*

1794

++	 * This is extremely unlikely to happen because almost all

1795

++	 * systems have far more MAX_PA/2 than RAM can be fit into

1796

++	 * DIMM slots.

1797

++	 */

1798

++	half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT;

1799

++	if (e820__mapped_any(half_pa, ULLONG_MAX - half_pa, E820_TYPE_RAM)) {

1800

++		pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation not effective.\n");

1801

++		return;

1802

++	}

1803

++

1804

++	setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV);

1805

++}

1806

++

1807

++static int __init l1tf_cmdline(char *str)

1808

++{

1809

++	if (!boot_cpu_has_bug(X86_BUG_L1TF))

1810

++		return 0;

1811

++

1812

++	if (!str)

1813

++		return -EINVAL;

1814

++

1815

++	if (!strcmp(str, "off"))

1816

++		l1tf_mitigation = L1TF_MITIGATION_OFF;

1817

++	else if (!strcmp(str, "flush,nowarn"))

1818

++		l1tf_mitigation = L1TF_MITIGATION_FLUSH_NOWARN;

1819

++	else if (!strcmp(str, "flush"))

1820

++		l1tf_mitigation = L1TF_MITIGATION_FLUSH;

1821

++	else if (!strcmp(str, "flush,nosmt"))

1822

++		l1tf_mitigation = L1TF_MITIGATION_FLUSH_NOSMT;

1823

++	else if (!strcmp(str, "full"))

1824

++		l1tf_mitigation = L1TF_MITIGATION_FULL;

1825

++	else if (!strcmp(str, "full,force"))

1826

++		l1tf_mitigation = L1TF_MITIGATION_FULL_FORCE;

1827

++

1828

++	return 0;

1829

++}

1830

++early_param("l1tf", l1tf_cmdline);

1831

++

1832

++#undef pr_fmt

1833

++

1834

+ #ifdef CONFIG_SYSFS

1835

+

1836

++#define L1TF_DEFAULT_MSG "Mitigation: PTE Inversion"

1837

++

1838

++#if IS_ENABLED(CONFIG_KVM_INTEL)

1839

++static const char *l1tf_vmx_states[] = {

1840

++	[VMENTER_L1D_FLUSH_AUTO]		= "auto",

1841

++	[VMENTER_L1D_FLUSH_NEVER]		= "vulnerable",

1842

++	[VMENTER_L1D_FLUSH_COND]		= "conditional cache flushes",

1843

++	[VMENTER_L1D_FLUSH_ALWAYS]		= "cache flushes",

1844

++	[VMENTER_L1D_FLUSH_EPT_DISABLED]	= "EPT disabled",

1845

++	[VMENTER_L1D_FLUSH_NOT_REQUIRED]	= "flush not necessary"

1846

++};

1847

++

1848

++static ssize_t l1tf_show_state(char *buf)

1849

++{

1850

++	if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_AUTO)

1851

++		return sprintf(buf, "%s\n", L1TF_DEFAULT_MSG);

1852

++

1853

++	if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_EPT_DISABLED ||

1854

++	    (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_NEVER &&

1855

++	     cpu_smt_control == CPU_SMT_ENABLED))

1856

++		return sprintf(buf, "%s; VMX: %s\n", L1TF_DEFAULT_MSG,

1857

++			       l1tf_vmx_states[l1tf_vmx_mitigation]);

1858

++

1859

++	return sprintf(buf, "%s; VMX: %s, SMT %s\n", L1TF_DEFAULT_MSG,

1860

++		       l1tf_vmx_states[l1tf_vmx_mitigation],

1861

++		       cpu_smt_control == CPU_SMT_ENABLED ? "vulnerable" : "disabled");

1862

++}

1863

++#else

1864

++static ssize_t l1tf_show_state(char *buf)

1865

++{

1866

++	return sprintf(buf, "%s\n", L1TF_DEFAULT_MSG);

1867

++}

1868

++#endif

1869

++

1870

+ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr,

1871

+ 			       char *buf, unsigned int bug)

1872

+ {

1873

+@@ -684,6 +784,10 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr

1874

+ 	case X86_BUG_SPEC_STORE_BYPASS:

1875

+ 		return sprintf(buf, "%s\n", ssb_strings[ssb_mode]);

1876

+

1877

++	case X86_BUG_L1TF:

1878

++		if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))

1879

++			return l1tf_show_state(buf);

1880

++		break;

1881

+ 	default:

1882

+ 		break;

1883

+ 	}

1884

+@@ -710,4 +814,9 @@ ssize_t cpu_show_spec_store_bypass(struct device *dev, struct device_attribute *

1885

+ {

1886

+ 	return cpu_show_common(dev, attr, buf, X86_BUG_SPEC_STORE_BYPASS);

1887

+ }

1888

++

1889

++ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *buf)

1890

++{

1891

++	return cpu_show_common(dev, attr, buf, X86_BUG_L1TF);

1892

++}

1893

+ #endif

1894

+diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c

1895

+index eb4cb3efd20e..9eda6f730ec4 100644

1896

+--- a/arch/x86/kernel/cpu/common.c

1897

++++ b/arch/x86/kernel/cpu/common.c

1898

+@@ -661,33 +661,36 @@ static void cpu_detect_tlb(struct cpuinfo_x86 *c)

1899

+ 		tlb_lld_4m[ENTRIES], tlb_lld_1g[ENTRIES]);

1900

+ }

1901

+

1902

+-void detect_ht(struct cpuinfo_x86 *c)

1903

++int detect_ht_early(struct cpuinfo_x86 *c)

1904

+ {

1905

+ #ifdef CONFIG_SMP

1906

+ 	u32 eax, ebx, ecx, edx;

1907

+-	int index_msb, core_bits;

1908

+-	static bool printed;

1909

+

1910

+ 	if (!cpu_has(c, X86_FEATURE_HT))

1911

+-		return;

1912

++		return -1;

1913

+

1914

+ 	if (cpu_has(c, X86_FEATURE_CMP_LEGACY))

1915

+-		goto out;

1916

++		return -1;

1917

+

1918

+ 	if (cpu_has(c, X86_FEATURE_XTOPOLOGY))

1919

+-		return;

1920

++		return -1;

1921

+

1922

+ 	cpuid(1, &eax, &ebx, &ecx, &edx);

1923

+

1924

+ 	smp_num_siblings = (ebx & 0xff0000) >> 16;

1925

+-

1926

+-	if (smp_num_siblings == 1) {

1927

++	if (smp_num_siblings == 1)

1928

+ 		pr_info_once("CPU0: Hyper-Threading is disabled\n");

1929

+-		goto out;

1930

+-	}

1931

++#endif

1932

++	return 0;

1933

++}

1934

+

1935

+-	if (smp_num_siblings <= 1)

1936

+-		goto out;

1937

++void detect_ht(struct cpuinfo_x86 *c)

1938

++{

1939

++#ifdef CONFIG_SMP

1940

++	int index_msb, core_bits;

1941

++

1942

++	if (detect_ht_early(c) < 0)

1943

++		return;

1944

+

1945

+ 	index_msb = get_count_order(smp_num_siblings);

1946

+ 	c->phys_proc_id = apic->phys_pkg_id(c->initial_apicid, index_msb);

1947

+@@ -700,15 +703,6 @@ void detect_ht(struct cpuinfo_x86 *c)

1948

+

1949

+ 	c->cpu_core_id = apic->phys_pkg_id(c->initial_apicid, index_msb) &

1950

+ 				       ((1 << core_bits) - 1);

1951

+-

1952

+-out:

1953

+-	if (!printed && (c->x86_max_cores * smp_num_siblings) > 1) {

1954

+-		pr_info("CPU: Physical Processor ID: %d\n",

1955

+-			c->phys_proc_id);

1956

+-		pr_info("CPU: Processor Core ID: %d\n",

1957

+-			c->cpu_core_id);

1958

+-		printed = 1;

1959

+-	}

1960

+ #endif

1961

+ }

1962

+

1963

+@@ -987,6 +981,21 @@ static const __initconst struct x86_cpu_id cpu_no_spec_store_bypass[] = {

1964

+ 	{}

1965

+ };

1966

+

1967

++static const __initconst struct x86_cpu_id cpu_no_l1tf[] = {

1968

++	/* in addition to cpu_no_speculation */

1969

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_SILVERMONT1	},

1970

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_SILVERMONT2	},

1971

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_AIRMONT		},

1972

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_MERRIFIELD	},

1973

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_MOOREFIELD	},

1974

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT	},

1975

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_DENVERTON	},

1976

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GEMINI_LAKE	},

1977

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_XEON_PHI_KNL		},

1978

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_XEON_PHI_KNM		},

1979

++	{}

1980

++};

1981

++

1982

+ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)

1983

+ {

1984

+ 	u64 ia32_cap = 0;

1985

+@@ -1013,6 +1022,11 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)

1986

+ 		return;

1987

+

1988

+ 	setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);

1989

++

1990

++	if (x86_match_cpu(cpu_no_l1tf))

1991

++		return;

1992

++

1993

++	setup_force_cpu_bug(X86_BUG_L1TF);

1994

+ }

1995

+

1996

+ /*

1997

+diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h

1998

+index 38216f678fc3..e59c0ea82a33 100644

1999

+--- a/arch/x86/kernel/cpu/cpu.h

2000

++++ b/arch/x86/kernel/cpu/cpu.h

2001

+@@ -55,7 +55,9 @@ extern void init_intel_cacheinfo(struct cpuinfo_x86 *c);

2002

+ extern void init_amd_cacheinfo(struct cpuinfo_x86 *c);

2003

+

2004

+ extern void detect_num_cpu_cores(struct cpuinfo_x86 *c);

2005

++extern int detect_extended_topology_early(struct cpuinfo_x86 *c);

2006

+ extern int detect_extended_topology(struct cpuinfo_x86 *c);

2007

++extern int detect_ht_early(struct cpuinfo_x86 *c);

2008

+ extern void detect_ht(struct cpuinfo_x86 *c);

2009

+

2010

+ unsigned int aperfmperf_get_khz(int cpu);

2011

+diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c

2012

+index eb75564f2d25..6602941cfebf 100644

2013

+--- a/arch/x86/kernel/cpu/intel.c

2014

++++ b/arch/x86/kernel/cpu/intel.c

2015

+@@ -301,6 +301,13 @@ static void early_init_intel(struct cpuinfo_x86 *c)

2016

+ 	}

2017

+

2018

+ 	check_mpx_erratum(c);

2019

++

2020

++	/*

2021

++	 * Get the number of SMT siblings early from the extended topology

2022

++	 * leaf, if available. Otherwise try the legacy SMT detection.

2023

++	 */

2024

++	if (detect_extended_topology_early(c) < 0)

2025

++		detect_ht_early(c);

2026

+ }

2027

+

2028

+ #ifdef CONFIG_X86_32

2029

+diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c

2030

+index 08286269fd24..b9bc8a1a584e 100644

2031

+--- a/arch/x86/kernel/cpu/microcode/core.c

2032

++++ b/arch/x86/kernel/cpu/microcode/core.c

2033

+@@ -509,12 +509,20 @@ static struct platform_device	*microcode_pdev;

2034

+

2035

+ static int check_online_cpus(void)

2036

+ {

2037

+-	if (num_online_cpus() == num_present_cpus())

2038

+-		return 0;

2039

++	unsigned int cpu;

2040

+

2041

+-	pr_err("Not all CPUs online, aborting microcode update.\n");

2042

++	/*

2043

++	 * Make sure all CPUs are online.  It's fine for SMT to be disabled if

2044

++	 * all the primary threads are still online.

2045

++	 */

2046

++	for_each_present_cpu(cpu) {

2047

++		if (topology_is_primary_thread(cpu) && !cpu_online(cpu)) {

2048

++			pr_err("Not all CPUs online, aborting microcode update.\n");

2049

++			return -EINVAL;

2050

++		}

2051

++	}

2052

+

2053

+-	return -EINVAL;

2054

++	return 0;

2055

+ }

2056

+

2057

+ static atomic_t late_cpus_in;

2058

+diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c

2059

+index 81c0afb39d0a..71ca064e3794 100644

2060

+--- a/arch/x86/kernel/cpu/topology.c

2061

++++ b/arch/x86/kernel/cpu/topology.c

2062

+@@ -22,18 +22,10 @@

2063

+ #define BITS_SHIFT_NEXT_LEVEL(eax)	((eax) & 0x1f)

2064

+ #define LEVEL_MAX_SIBLINGS(ebx)		((ebx) & 0xffff)

2065

+

2066

+-/*

2067

+- * Check for extended topology enumeration cpuid leaf 0xb and if it

2068

+- * exists, use it for populating initial_apicid and cpu topology

2069

+- * detection.

2070

+- */

2071

+-int detect_extended_topology(struct cpuinfo_x86 *c)

2072

++int detect_extended_topology_early(struct cpuinfo_x86 *c)

2073

+ {

2074

+ #ifdef CONFIG_SMP

2075

+-	unsigned int eax, ebx, ecx, edx, sub_index;

2076

+-	unsigned int ht_mask_width, core_plus_mask_width;

2077

+-	unsigned int core_select_mask, core_level_siblings;

2078

+-	static bool printed;

2079

++	unsigned int eax, ebx, ecx, edx;

2080

+

2081

+ 	if (c->cpuid_level < 0xb)

2082

+ 		return -1;

2083

+@@ -52,10 +44,30 @@ int detect_extended_topology(struct cpuinfo_x86 *c)

2084

+ 	 * initial apic id, which also represents 32-bit extended x2apic id.

2085

+ 	 */

2086

+ 	c->initial_apicid = edx;

2087

++	smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);

2088

++#endif

2089

++	return 0;

2090

++}

2091

++

2092

++/*

2093

++ * Check for extended topology enumeration cpuid leaf 0xb and if it

2094

++ * exists, use it for populating initial_apicid and cpu topology

2095

++ * detection.

2096

++ */

2097

++int detect_extended_topology(struct cpuinfo_x86 *c)

2098

++{

2099

++#ifdef CONFIG_SMP

2100

++	unsigned int eax, ebx, ecx, edx, sub_index;

2101

++	unsigned int ht_mask_width, core_plus_mask_width;

2102

++	unsigned int core_select_mask, core_level_siblings;

2103

++

2104

++	if (detect_extended_topology_early(c) < 0)

2105

++		return -1;

2106

+

2107

+ 	/*

2108

+ 	 * Populate HT related information from sub-leaf level 0.

2109

+ 	 */

2110

++	cpuid_count(0xb, SMT_LEVEL, &eax, &ebx, &ecx, &edx);

2111

+ 	core_level_siblings = smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);

2112

+ 	core_plus_mask_width = ht_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);

2113

+

2114

+@@ -86,15 +98,6 @@ int detect_extended_topology(struct cpuinfo_x86 *c)

2115

+ 	c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);

2116

+

2117

+ 	c->x86_max_cores = (core_level_siblings / smp_num_siblings);

2118

+-

2119

+-	if (!printed) {

2120

+-		pr_info("CPU: Physical Processor ID: %d\n",

2121

+-		       c->phys_proc_id);

2122

+-		if (c->x86_max_cores > 1)

2123

+-			pr_info("CPU: Processor Core ID: %d\n",

2124

+-			       c->cpu_core_id);

2125

+-		printed = 1;

2126

+-	}

2127

+ #endif

2128

+ 	return 0;

2129

+ }

2130

+diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c

2131

+index f92a6593de1e..2ea85b32421a 100644

2132

+--- a/arch/x86/kernel/fpu/core.c

2133

++++ b/arch/x86/kernel/fpu/core.c

2134

+@@ -10,6 +10,7 @@

2135

+ #include <asm/fpu/signal.h>

2136

+ #include <asm/fpu/types.h>

2137

+ #include <asm/traps.h>

2138

++#include <asm/irq_regs.h>

2139

+

2140

+ #include <linux/hardirq.h>

2141

+ #include <linux/pkeys.h>

2142

+diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c

2143

+index 346b24883911..b0acb22e5a46 100644

2144

+--- a/arch/x86/kernel/hpet.c

2145

++++ b/arch/x86/kernel/hpet.c

2146

+@@ -1,6 +1,7 @@

2147

+ #include <linux/clocksource.h>

2148

+ #include <linux/clockchips.h>

2149

+ #include <linux/interrupt.h>

2150

++#include <linux/irq.h>

2151

+ #include <linux/export.h>

2152

+ #include <linux/delay.h>

2153

+ #include <linux/errno.h>

2154

+diff --git a/arch/x86/kernel/i8259.c b/arch/x86/kernel/i8259.c

2155

+index 86c4439f9d74..519649ddf100 100644

2156

+--- a/arch/x86/kernel/i8259.c

2157

++++ b/arch/x86/kernel/i8259.c

2158

+@@ -5,6 +5,7 @@

2159

+ #include <linux/sched.h>

2160

+ #include <linux/ioport.h>

2161

+ #include <linux/interrupt.h>

2162

++#include <linux/irq.h>

2163

+ #include <linux/timex.h>

2164

+ #include <linux/random.h>

2165

+ #include <linux/init.h>

2166

+diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c

2167

+index 74383a3780dc..01adea278a71 100644

2168

+--- a/arch/x86/kernel/idt.c

2169

++++ b/arch/x86/kernel/idt.c

2170

+@@ -8,6 +8,7 @@

2171

+ #include <asm/traps.h>

2172

+ #include <asm/proto.h>

2173

+ #include <asm/desc.h>

2174

++#include <asm/hw_irq.h>

2175

+

2176

+ struct idt_data {

2177

+ 	unsigned int	vector;

2178

+diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c

2179

+index 328d027d829d..59b5f2ea7c2f 100644

2180

+--- a/arch/x86/kernel/irq.c

2181

++++ b/arch/x86/kernel/irq.c

2182

+@@ -10,6 +10,7 @@

2183

+ #include <linux/ftrace.h>

2184

+ #include <linux/delay.h>

2185

+ #include <linux/export.h>

2186

++#include <linux/irq.h>

2187

+

2188

+ #include <asm/apic.h>

2189

+ #include <asm/io_apic.h>

2190

+diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c

2191

+index c1bdbd3d3232..95600a99ae93 100644

2192

+--- a/arch/x86/kernel/irq_32.c

2193

++++ b/arch/x86/kernel/irq_32.c

2194

+@@ -11,6 +11,7 @@

2195

+

2196

+ #include <linux/seq_file.h>

2197

+ #include <linux/interrupt.h>

2198

++#include <linux/irq.h>

2199

+ #include <linux/kernel_stat.h>

2200

+ #include <linux/notifier.h>

2201

+ #include <linux/cpu.h>

2202

+diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c

2203

+index d86e344f5b3d..0469cd078db1 100644

2204

+--- a/arch/x86/kernel/irq_64.c

2205

++++ b/arch/x86/kernel/irq_64.c

2206

+@@ -11,6 +11,7 @@

2207

+

2208

+ #include <linux/kernel_stat.h>

2209

+ #include <linux/interrupt.h>

2210

++#include <linux/irq.h>

2211

+ #include <linux/seq_file.h>

2212

+ #include <linux/delay.h>

2213

+ #include <linux/ftrace.h>

2214

+diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c

2215

+index 772196c1b8c4..a0693b71cfc1 100644

2216

+--- a/arch/x86/kernel/irqinit.c

2217

++++ b/arch/x86/kernel/irqinit.c

2218

+@@ -5,6 +5,7 @@

2219

+ #include <linux/sched.h>

2220

+ #include <linux/ioport.h>

2221

+ #include <linux/interrupt.h>

2222

++#include <linux/irq.h>

2223

+ #include <linux/timex.h>

2224

+ #include <linux/random.h>

2225

+ #include <linux/kprobes.h>

2226

+diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c

2227

+index 6f4d42377fe5..44e26dc326d5 100644

2228

+--- a/arch/x86/kernel/kprobes/core.c

2229

++++ b/arch/x86/kernel/kprobes/core.c

2230

+@@ -395,8 +395,6 @@ int __copy_instruction(u8 *dest, u8 *src, u8 *real, struct insn *insn)

2231

+ 			  - (u8 *) real;

2232

+ 		if ((s64) (s32) newdisp != newdisp) {

2233

+ 			pr_err("Kprobes error: new displacement does not fit into s32 (%llx)\n", newdisp);

2234

+-			pr_err("\tSrc: %p, Dest: %p, old disp: %x\n",

2235

+-				src, real, insn->displacement.value);

2236

+ 			return 0;

2237

+ 		}

2238

+ 		disp = (u8 *) dest + insn_offset_displacement(insn);

2239

+@@ -640,8 +638,7 @@ static int reenter_kprobe(struct kprobe *p, struct pt_regs *regs,

2240

+ 		 * Raise a BUG or we'll continue in an endless reentering loop

2241

+ 		 * and eventually a stack overflow.

2242

+ 		 */

2243

+-		printk(KERN_WARNING "Unrecoverable kprobe detected at %p.\n",

2244

+-		       p->addr);

2245

++		pr_err("Unrecoverable kprobe detected.\n");

2246

+ 		dump_kprobe(p);

2247

+ 		BUG();

2248

+ 	default:

2249

+diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c

2250

+index 99dc79e76bdc..930c88341e4e 100644

2251

+--- a/arch/x86/kernel/paravirt.c

2252

++++ b/arch/x86/kernel/paravirt.c

2253

+@@ -88,10 +88,12 @@ unsigned paravirt_patch_call(void *insnbuf,

2254

+ 	struct branch *b = insnbuf;

2255

+ 	unsigned long delta = (unsigned long)target - (addr+5);

2256

+

2257

+-	if (tgt_clobbers & ~site_clobbers)

2258

+-		return len;	/* target would clobber too much for this site */

2259

+-	if (len < 5)

2260

++	if (len < 5) {

2261

++#ifdef CONFIG_RETPOLINE

2262

++		WARN_ONCE("Failing to patch indirect CALL in %ps\n", (void *)addr);

2263

++#endif

2264

+ 		return len;	/* call too long for patch site */

2265

++	}

2266

+

2267

+ 	b->opcode = 0xe8; /* call */

2268

+ 	b->delta = delta;

2269

+@@ -106,8 +108,12 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target,

2270

+ 	struct branch *b = insnbuf;

2271

+ 	unsigned long delta = (unsigned long)target - (addr+5);

2272

+

2273

+-	if (len < 5)

2274

++	if (len < 5) {

2275

++#ifdef CONFIG_RETPOLINE

2276

++		WARN_ONCE("Failing to patch indirect JMP in %ps\n", (void *)addr);

2277

++#endif

2278

+ 		return len;	/* call too long for patch site */

2279

++	}

2280

+

2281

+ 	b->opcode = 0xe9;	/* jmp */

2282

+ 	b->delta = delta;

2283

+diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c

2284

+index 2f86d883dd95..74b4472ba0a6 100644

2285

+--- a/arch/x86/kernel/setup.c

2286

++++ b/arch/x86/kernel/setup.c

2287

+@@ -823,6 +823,12 @@ void __init setup_arch(char **cmdline_p)

2288

+ 	memblock_reserve(__pa_symbol(_text),

2289

+ 			 (unsigned long)__bss_stop - (unsigned long)_text);

2290

+

2291

++	/*

2292

++	 * Make sure page 0 is always reserved because on systems with

2293

++	 * L1TF its contents can be leaked to user processes.

2294

++	 */

2295

++	memblock_reserve(0, PAGE_SIZE);

2296

++

2297

+ 	early_reserve_initrd();

2298

+

2299

+ 	/*

2300

+diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c

2301

+index 5c574dff4c1a..04adc8d60aed 100644

2302

+--- a/arch/x86/kernel/smp.c

2303

++++ b/arch/x86/kernel/smp.c

2304

+@@ -261,6 +261,7 @@ __visible void __irq_entry smp_reschedule_interrupt(struct pt_regs *regs)

2305

+ {

2306

+ 	ack_APIC_irq();

2307

+ 	inc_irq_stat(irq_resched_count);

2308

++	kvm_set_cpu_l1tf_flush_l1d();

2309

+

2310

+ 	if (trace_resched_ipi_enabled()) {

2311

+ 		/*

2312

+diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c

2313

+index db9656e13ea0..f02ecaf97904 100644

2314

+--- a/arch/x86/kernel/smpboot.c

2315

++++ b/arch/x86/kernel/smpboot.c

2316

+@@ -80,6 +80,7 @@

2317

+ #include <asm/intel-family.h>

2318

+ #include <asm/cpu_device_id.h>

2319

+ #include <asm/spec-ctrl.h>

2320

++#include <asm/hw_irq.h>

2321

+

2322

+ /* representing HT siblings of each logical CPU */

2323

+ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);

2324

+@@ -270,6 +271,23 @@ static void notrace start_secondary(void *unused)

2325

+ 	cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);

2326

+ }

2327

+

2328

++/**

2329

++ * topology_is_primary_thread - Check whether CPU is the primary SMT thread

2330

++ * @cpu:	CPU to check

2331

++ */

2332

++bool topology_is_primary_thread(unsigned int cpu)

2333

++{

2334

++	return apic_id_is_primary_thread(per_cpu(x86_cpu_to_apicid, cpu));

2335

++}

2336

++

2337

++/**

2338

++ * topology_smt_supported - Check whether SMT is supported by the CPUs

2339

++ */

2340

++bool topology_smt_supported(void)

2341

++{

2342

++	return smp_num_siblings > 1;

2343

++}

2344

++

2345

+ /**

2346

+  * topology_phys_to_logical_pkg - Map a physical package id to a logical

2347

+  *

2348

+diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c

2349

+index 774ebafa97c4..be01328eb755 100644

2350

+--- a/arch/x86/kernel/time.c

2351

++++ b/arch/x86/kernel/time.c

2352

+@@ -12,6 +12,7 @@

2353

+

2354

+ #include <linux/clockchips.h>

2355

+ #include <linux/interrupt.h>

2356

++#include <linux/irq.h>

2357

+ #include <linux/i8253.h>

2358

+ #include <linux/time.h>

2359

+ #include <linux/export.h>

2360

+diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c

2361

+index 6b8f11521c41..a44e568363a4 100644

2362

+--- a/arch/x86/kvm/mmu.c

2363

++++ b/arch/x86/kvm/mmu.c

2364

+@@ -3840,6 +3840,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,

2365

+ {

2366

+ 	int r = 1;

2367

+

2368

++	vcpu->arch.l1tf_flush_l1d = true;

2369

+ 	switch (vcpu->arch.apf.host_apf_reason) {

2370

+ 	default:

2371

+ 		trace_kvm_page_fault(fault_address, error_code);

2372

+diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c

2373

+index 5d8e317c2b04..46b428c0990e 100644

2374

+--- a/arch/x86/kvm/vmx.c

2375

++++ b/arch/x86/kvm/vmx.c

2376

+@@ -188,6 +188,150 @@ module_param(ple_window_max, uint, 0444);

2377

+

2378

+ extern const ulong vmx_return;

2379

+

2380

++static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);

2381

++static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);

2382

++static DEFINE_MUTEX(vmx_l1d_flush_mutex);

2383

++

2384

++/* Storage for pre module init parameter parsing */

2385

++static enum vmx_l1d_flush_state __read_mostly vmentry_l1d_flush_param = VMENTER_L1D_FLUSH_AUTO;

2386

++

2387

++static const struct {

2388

++	const char *option;

2389

++	enum vmx_l1d_flush_state cmd;

2390

++} vmentry_l1d_param[] = {

2391

++	{"auto",	VMENTER_L1D_FLUSH_AUTO},

2392

++	{"never",	VMENTER_L1D_FLUSH_NEVER},

2393

++	{"cond",	VMENTER_L1D_FLUSH_COND},

2394

++	{"always",	VMENTER_L1D_FLUSH_ALWAYS},

2395

++};

2396

++

2397

++#define L1D_CACHE_ORDER 4

2398

++static void *vmx_l1d_flush_pages;

2399

++

2400

++static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)

2401

++{

2402

++	struct page *page;

2403

++	unsigned int i;

2404

++

2405

++	if (!enable_ept) {

2406

++		l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_EPT_DISABLED;

2407

++		return 0;

2408

++	}

2409

++

2410

++       if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES)) {

2411

++	       u64 msr;

2412

++

2413

++	       rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);

2414

++	       if (msr & ARCH_CAP_SKIP_VMENTRY_L1DFLUSH) {

2415

++		       l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NOT_REQUIRED;

2416

++		       return 0;

2417

++	       }

2418

++       }

2419

++

2420

++	/* If set to auto use the default l1tf mitigation method */

2421

++	if (l1tf == VMENTER_L1D_FLUSH_AUTO) {

2422

++		switch (l1tf_mitigation) {

2423

++		case L1TF_MITIGATION_OFF:

2424

++			l1tf = VMENTER_L1D_FLUSH_NEVER;

2425

++			break;

2426

++		case L1TF_MITIGATION_FLUSH_NOWARN:

2427

++		case L1TF_MITIGATION_FLUSH:

2428

++		case L1TF_MITIGATION_FLUSH_NOSMT:

2429

++			l1tf = VMENTER_L1D_FLUSH_COND;

2430

++			break;

2431

++		case L1TF_MITIGATION_FULL:

2432

++		case L1TF_MITIGATION_FULL_FORCE:

2433

++			l1tf = VMENTER_L1D_FLUSH_ALWAYS;

2434

++			break;

2435

++		}

2436

++	} else if (l1tf_mitigation == L1TF_MITIGATION_FULL_FORCE) {

2437

++		l1tf = VMENTER_L1D_FLUSH_ALWAYS;

2438

++	}

2439

++

2440

++	if (l1tf != VMENTER_L1D_FLUSH_NEVER && !vmx_l1d_flush_pages &&

2441

++	    !boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {

2442

++		page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);

2443

++		if (!page)

2444

++			return -ENOMEM;

2445

++		vmx_l1d_flush_pages = page_address(page);

2446

++

2447

++		/*

2448

++		 * Initialize each page with a different pattern in

2449

++		 * order to protect against KSM in the nested

2450

++		 * virtualization case.

2451

++		 */

2452

++		for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {

2453

++			memset(vmx_l1d_flush_pages + i * PAGE_SIZE, i + 1,

2454

++			       PAGE_SIZE);

2455

++		}

2456

++	}

2457

++

2458

++	l1tf_vmx_mitigation = l1tf;

2459

++

2460

++	if (l1tf != VMENTER_L1D_FLUSH_NEVER)

2461

++		static_branch_enable(&vmx_l1d_should_flush);

2462

++	else

2463

++		static_branch_disable(&vmx_l1d_should_flush);

2464

++

2465

++	if (l1tf == VMENTER_L1D_FLUSH_COND)

2466

++		static_branch_enable(&vmx_l1d_flush_cond);

2467

++	else

2468

++		static_branch_disable(&vmx_l1d_flush_cond);

2469

++	return 0;

2470

++}

2471

++

2472

++static int vmentry_l1d_flush_parse(const char *s)

2473

++{

2474

++	unsigned int i;

2475

++

2476

++	if (s) {

2477

++		for (i = 0; i < ARRAY_SIZE(vmentry_l1d_param); i++) {

2478

++			if (sysfs_streq(s, vmentry_l1d_param[i].option))

2479

++				return vmentry_l1d_param[i].cmd;

2480

++		}

2481

++	}

2482

++	return -EINVAL;

2483

++}

2484

++

2485

++static int vmentry_l1d_flush_set(const char *s, const struct kernel_param *kp)

2486

++{

2487

++	int l1tf, ret;

2488

++

2489

++	if (!boot_cpu_has(X86_BUG_L1TF))

2490

++		return 0;

2491

++

2492

++	l1tf = vmentry_l1d_flush_parse(s);

2493

++	if (l1tf < 0)

2494

++		return l1tf;

2495

++

2496

++	/*

2497

++	 * Has vmx_init() run already? If not then this is the pre init

2498

++	 * parameter parsing. In that case just store the value and let

2499

++	 * vmx_init() do the proper setup after enable_ept has been

2500

++	 * established.

2501

++	 */

2502

++	if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_AUTO) {

2503

++		vmentry_l1d_flush_param = l1tf;

2504

++		return 0;

2505

++	}

2506

++

2507

++	mutex_lock(&vmx_l1d_flush_mutex);

2508

++	ret = vmx_setup_l1d_flush(l1tf);

2509

++	mutex_unlock(&vmx_l1d_flush_mutex);

2510

++	return ret;

2511

++}

2512

++

2513

++static int vmentry_l1d_flush_get(char *s, const struct kernel_param *kp)

2514

++{

2515

++	return sprintf(s, "%s\n", vmentry_l1d_param[l1tf_vmx_mitigation].option);

2516

++}

2517

++

2518

++static const struct kernel_param_ops vmentry_l1d_flush_ops = {

2519

++	.set = vmentry_l1d_flush_set,

2520

++	.get = vmentry_l1d_flush_get,

2521

++};

2522

++module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644);

2523

++

2524

+ struct kvm_vmx {

2525

+ 	struct kvm kvm;

2526

+

2527

+@@ -757,6 +901,11 @@ static inline int pi_test_sn(struct pi_desc *pi_desc)

2528

+ 			(unsigned long *)&pi_desc->control);

2529

+ }

2530

+

2531

++struct vmx_msrs {

2532

++	unsigned int		nr;

2533

++	struct vmx_msr_entry	val[NR_AUTOLOAD_MSRS];

2534

++};

2535

++

2536

+ struct vcpu_vmx {

2537

+ 	struct kvm_vcpu       vcpu;

2538

+ 	unsigned long         host_rsp;

2539

+@@ -790,9 +939,8 @@ struct vcpu_vmx {

2540

+ 	struct loaded_vmcs   *loaded_vmcs;

2541

+ 	bool                  __launched; /* temporary, used in vmx_vcpu_run */

2542

+ 	struct msr_autoload {

2543

+-		unsigned nr;

2544

+-		struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];

2545

+-		struct vmx_msr_entry host[NR_AUTOLOAD_MSRS];

2546

++		struct vmx_msrs guest;

2547

++		struct vmx_msrs host;

2548

+ 	} msr_autoload;

2549

+ 	struct {

2550

+ 		int           loaded;

2551

+@@ -2377,9 +2525,20 @@ static void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,

2552

+ 	vm_exit_controls_clearbit(vmx, exit);

2553

+ }

2554

+

2555

++static int find_msr(struct vmx_msrs *m, unsigned int msr)

2556

++{

2557

++	unsigned int i;

2558

++

2559

++	for (i = 0; i < m->nr; ++i) {

2560

++		if (m->val[i].index == msr)

2561

++			return i;

2562

++	}

2563

++	return -ENOENT;

2564

++}

2565

++

2566

+ static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr)

2567

+ {

2568

+-	unsigned i;

2569

++	int i;

2570

+ 	struct msr_autoload *m = &vmx->msr_autoload;

2571

+

2572

+ 	switch (msr) {

2573

+@@ -2400,18 +2559,21 @@ static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr)

2574

+ 		}

2575

+ 		break;

2576

+ 	}

2577

++	i = find_msr(&m->guest, msr);

2578

++	if (i < 0)

2579

++		goto skip_guest;

2580

++	--m->guest.nr;

2581

++	m->guest.val[i] = m->guest.val[m->guest.nr];

2582

++	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);

2583

+

2584

+-	for (i = 0; i < m->nr; ++i)

2585

+-		if (m->guest[i].index == msr)

2586

+-			break;

2587

+-

2588

+-	if (i == m->nr)

2589

++skip_guest:

2590

++	i = find_msr(&m->host, msr);

2591

++	if (i < 0)

2592

+ 		return;

2593

+-	--m->nr;

2594

+-	m->guest[i] = m->guest[m->nr];

2595

+-	m->host[i] = m->host[m->nr];

2596

+-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);

2597

+-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);

2598

++

2599

++	--m->host.nr;

2600

++	m->host.val[i] = m->host.val[m->host.nr];

2601

++	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);

2602

+ }

2603

+

2604

+ static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,

2605

+@@ -2426,9 +2588,9 @@ static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,

2606

+ }

2607

+

2608

+ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,

2609

+-				  u64 guest_val, u64 host_val)

2610

++				  u64 guest_val, u64 host_val, bool entry_only)

2611

+ {

2612

+-	unsigned i;

2613

++	int i, j = 0;

2614

+ 	struct msr_autoload *m = &vmx->msr_autoload;

2615

+

2616

+ 	switch (msr) {

2617

+@@ -2463,24 +2625,31 @@ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,

2618

+ 		wrmsrl(MSR_IA32_PEBS_ENABLE, 0);

2619

+ 	}

2620

+

2621

+-	for (i = 0; i < m->nr; ++i)

2622

+-		if (m->guest[i].index == msr)

2623

+-			break;

2624

++	i = find_msr(&m->guest, msr);

2625

++	if (!entry_only)

2626

++		j = find_msr(&m->host, msr);

2627

+

2628

+-	if (i == NR_AUTOLOAD_MSRS) {

2629

++	if (i == NR_AUTOLOAD_MSRS || j == NR_AUTOLOAD_MSRS) {

2630

+ 		printk_once(KERN_WARNING "Not enough msr switch entries. "

2631

+ 				"Can't add msr %x\n", msr);

2632

+ 		return;

2633

+-	} else if (i == m->nr) {

2634

+-		++m->nr;

2635

+-		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);

2636

+-		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);

2637

+ 	}

2638

++	if (i < 0) {

2639

++		i = m->guest.nr++;

2640

++		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);

2641

++	}

2642

++	m->guest.val[i].index = msr;

2643

++	m->guest.val[i].value = guest_val;

2644

++

2645

++	if (entry_only)

2646

++		return;

2647

+

2648

+-	m->guest[i].index = msr;

2649

+-	m->guest[i].value = guest_val;

2650

+-	m->host[i].index = msr;

2651

+-	m->host[i].value = host_val;

2652

++	if (j < 0) {

2653

++		j = m->host.nr++;

2654

++		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);

2655

++	}

2656

++	m->host.val[j].index = msr;

2657

++	m->host.val[j].value = host_val;

2658

+ }

2659

+

2660

+ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)

2661

+@@ -2524,7 +2693,7 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)

2662

+ 			guest_efer &= ~EFER_LME;

2663

+ 		if (guest_efer != host_efer)

2664

+ 			add_atomic_switch_msr(vmx, MSR_EFER,

2665

+-					      guest_efer, host_efer);

2666

++					      guest_efer, host_efer, false);

2667

+ 		return false;

2668

+ 	} else {

2669

+ 		guest_efer &= ~ignore_bits;

2670

+@@ -3987,7 +4156,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

2671

+ 		vcpu->arch.ia32_xss = data;

2672

+ 		if (vcpu->arch.ia32_xss != host_xss)

2673

+ 			add_atomic_switch_msr(vmx, MSR_IA32_XSS,

2674

+-				vcpu->arch.ia32_xss, host_xss);

2675

++				vcpu->arch.ia32_xss, host_xss, false);

2676

+ 		else

2677

+ 			clear_atomic_switch_msr(vmx, MSR_IA32_XSS);

2678

+ 		break;

2679

+@@ -6274,9 +6443,9 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)

2680

+

2681

+ 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);

2682

+ 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);

2683

+-	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));

2684

++	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));

2685

+ 	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);

2686

+-	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));

2687

++	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));

2688

+

2689

+ 	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)

2690

+ 		vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);

2691

+@@ -6296,8 +6465,7 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)

2692

+ 		++vmx->nmsrs;

2693

+ 	}

2694

+

2695

+-	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))

2696

+-		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);

2697

++	vmx->arch_capabilities = kvm_get_arch_capabilities();

2698

+

2699

+ 	vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);

2700

+

2701

+@@ -9548,6 +9716,79 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)

2702

+ 	}

2703

+ }

2704

+

2705

++/*

2706

++ * Software based L1D cache flush which is used when microcode providing

2707

++ * the cache control MSR is not loaded.

2708

++ *

2709

++ * The L1D cache is 32 KiB on Nehalem and later microarchitectures, but to

2710

++ * flush it is required to read in 64 KiB because the replacement algorithm

2711

++ * is not exactly LRU. This could be sized at runtime via topology

2712

++ * information but as all relevant affected CPUs have 32KiB L1D cache size

2713

++ * there is no point in doing so.

2714

++ */

2715

++#define L1D_CACHE_ORDER 4

2716

++static void *vmx_l1d_flush_pages;

2717

++

2718

++static void vmx_l1d_flush(struct kvm_vcpu *vcpu)

2719

++{

2720

++	int size = PAGE_SIZE << L1D_CACHE_ORDER;

2721

++

2722

++	/*

2723

++	 * This code is only executed when the the flush mode is 'cond' or

2724

++	 * 'always'

2725

++	 */

2726

++	if (static_branch_likely(&vmx_l1d_flush_cond)) {

2727

++		bool flush_l1d;

2728

++

2729

++		/*

2730

++		 * Clear the per-vcpu flush bit, it gets set again

2731

++		 * either from vcpu_run() or from one of the unsafe

2732

++		 * VMEXIT handlers.

2733

++		 */

2734

++		flush_l1d = vcpu->arch.l1tf_flush_l1d;

2735

++		vcpu->arch.l1tf_flush_l1d = false;

2736

++

2737

++		/*

2738

++		 * Clear the per-cpu flush bit, it gets set again from

2739

++		 * the interrupt handlers.

2740

++		 */

2741

++		flush_l1d |= kvm_get_cpu_l1tf_flush_l1d();

2742

++		kvm_clear_cpu_l1tf_flush_l1d();

2743

++

2744

++		if (!flush_l1d)

2745

++			return;

2746

++	}

2747

++

2748

++	vcpu->stat.l1d_flush++;

2749

++

2750

++	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {

2751

++		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);

2752

++		return;

2753

++	}

2754

++

2755

++	asm volatile(

2756

++		/* First ensure the pages are in the TLB */

2757

++		"xorl	%%eax, %%eax\n"

2758

++		".Lpopulate_tlb:\n\t"

2759

++		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"

2760

++		"addl	$4096, %%eax\n\t"

2761

++		"cmpl	%%eax, %[size]\n\t"

2762

++		"jne	.Lpopulate_tlb\n\t"

2763

++		"xorl	%%eax, %%eax\n\t"

2764

++		"cpuid\n\t"

2765

++		/* Now fill the cache */

2766

++		"xorl	%%eax, %%eax\n"

2767

++		".Lfill_cache:\n"

2768

++		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"

2769

++		"addl	$64, %%eax\n\t"

2770

++		"cmpl	%%eax, %[size]\n\t"

2771

++		"jne	.Lfill_cache\n\t"

2772

++		"lfence\n"

2773

++		:: [flush_pages] "r" (vmx_l1d_flush_pages),

2774

++		    [size] "r" (size)

2775

++		: "eax", "ebx", "ecx", "edx");

2776

++}

2777

++

2778

+ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)

2779

+ {

2780

+ 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);

2781

+@@ -9949,7 +10190,7 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)

2782

+ 			clear_atomic_switch_msr(vmx, msrs[i].msr);

2783

+ 		else

2784

+ 			add_atomic_switch_msr(vmx, msrs[i].msr, msrs[i].guest,

2785

+-					msrs[i].host);

2786

++					msrs[i].host, false);

2787

+ }

2788

+

2789

+ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)

2790

+@@ -10044,6 +10285,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)

2791

+ 	evmcs_rsp = static_branch_unlikely(&enable_evmcs) ?

2792

+ 		(unsigned long)&current_evmcs->host_rsp : 0;

2793

+

2794

++	if (static_branch_unlikely(&vmx_l1d_should_flush))

2795

++		vmx_l1d_flush(vcpu);

2796

++

2797

+ 	asm(

2798

+ 		/* Store host registers */

2799

+ 		"push %%" _ASM_DX "; push %%" _ASM_BP ";"

2800

+@@ -10403,10 +10647,37 @@ free_vcpu:

2801

+ 	return ERR_PTR(err);

2802

+ }

2803

+

2804

++#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"

2805

++#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"

2806

++

2807

+ static int vmx_vm_init(struct kvm *kvm)

2808

+ {

2809

+ 	if (!ple_gap)

2810

+ 		kvm->arch.pause_in_guest = true;

2811

++

2812

++	if (boot_cpu_has(X86_BUG_L1TF) && enable_ept) {

2813

++		switch (l1tf_mitigation) {

2814

++		case L1TF_MITIGATION_OFF:

2815

++		case L1TF_MITIGATION_FLUSH_NOWARN:

2816

++			/* 'I explicitly don't care' is set */

2817

++			break;

2818

++		case L1TF_MITIGATION_FLUSH:

2819

++		case L1TF_MITIGATION_FLUSH_NOSMT:

2820

++		case L1TF_MITIGATION_FULL:

2821

++			/*

2822

++			 * Warn upon starting the first VM in a potentially

2823

++			 * insecure environment.

2824

++			 */

2825

++			if (cpu_smt_control == CPU_SMT_ENABLED)

2826

++				pr_warn_once(L1TF_MSG_SMT);

2827

++			if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_NEVER)

2828

++				pr_warn_once(L1TF_MSG_L1D);

2829

++			break;

2830

++		case L1TF_MITIGATION_FULL_FORCE:

2831

++			/* Flush is enforced */

2832

++			break;

2833

++		}

2834

++	}

2835

+ 	return 0;

2836

+ }

2837

+

2838

+@@ -11260,10 +11531,10 @@ static void prepare_vmcs02_full(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)

2839

+ 	 * Set the MSR load/store lists to match L0's settings.

2840

+ 	 */

2841

+ 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);

2842

+-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

2843

+-	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));

2844

+-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

2845

+-	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));

2846

++	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);

2847

++	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));

2848

++	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);

2849

++	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));

2850

+

2851

+ 	set_cr4_guest_host_mask(vmx);

2852

+

2853

+@@ -11899,6 +12170,9 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)

2854

+ 		return ret;

2855

+ 	}

2856

+

2857

++	/* Hide L1D cache contents from the nested guest.  */

2858

++	vmx->vcpu.arch.l1tf_flush_l1d = true;

2859

++

2860

+ 	/*

2861

+ 	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken

2862

+ 	 * by event injection, halt vcpu.

2863

+@@ -12419,8 +12693,8 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,

2864

+ 	vmx_segment_cache_clear(vmx);

2865

+

2866

+ 	/* Update any VMCS fields that might have changed while L2 ran */

2867

+-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

2868

+-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

2869

++	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);

2870

++	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);

2871

+ 	vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);

2872

+ 	if (vmx->hv_deadline_tsc == -1)

2873

+ 		vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,

2874

+@@ -13137,6 +13411,51 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {

2875

+ 	.enable_smi_window = enable_smi_window,

2876

+ };

2877

+

2878

++static void vmx_cleanup_l1d_flush(void)

2879

++{

2880

++	if (vmx_l1d_flush_pages) {

2881

++		free_pages((unsigned long)vmx_l1d_flush_pages, L1D_CACHE_ORDER);

2882

++		vmx_l1d_flush_pages = NULL;

2883

++	}

2884

++	/* Restore state so sysfs ignores VMX */

2885

++	l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;

2886

++}

2887

++

2888

++static void vmx_exit(void)

2889

++{

2890

++#ifdef CONFIG_KEXEC_CORE

2891

++	RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);

2892

++	synchronize_rcu();

2893

++#endif

2894

++

2895

++	kvm_exit();

2896

++

2897

++#if IS_ENABLED(CONFIG_HYPERV)

2898

++	if (static_branch_unlikely(&enable_evmcs)) {

2899

++		int cpu;

2900

++		struct hv_vp_assist_page *vp_ap;

2901

++		/*

2902

++		 * Reset everything to support using non-enlightened VMCS

2903

++		 * access later (e.g. when we reload the module with

2904

++		 * enlightened_vmcs=0)

2905

++		 */

2906

++		for_each_online_cpu(cpu) {

2907

++			vp_ap =	hv_get_vp_assist_page(cpu);

2908

++

2909

++			if (!vp_ap)

2910

++				continue;

2911

++

2912

++			vp_ap->current_nested_vmcs = 0;

2913

++			vp_ap->enlighten_vmentry = 0;

2914

++		}

2915

++

2916

++		static_branch_disable(&enable_evmcs);

2917

++	}

2918

++#endif

2919

++	vmx_cleanup_l1d_flush();

2920

++}

2921

++module_exit(vmx_exit);

2922

++

2923

+ static int __init vmx_init(void)

2924

+ {

2925

+ 	int r;

2926

+@@ -13171,10 +13490,25 @@ static int __init vmx_init(void)

2927

+ #endif

2928

+

2929

+ 	r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),

2930

+-                     __alignof__(struct vcpu_vmx), THIS_MODULE);

2931

++		     __alignof__(struct vcpu_vmx), THIS_MODULE);

2932

+ 	if (r)

2933

+ 		return r;

2934

+

2935

++	/*

2936

++	 * Must be called after kvm_init() so enable_ept is properly set

2937

++	 * up. Hand the parameter mitigation value in which was stored in

2938

++	 * the pre module init parser. If no parameter was given, it will

2939

++	 * contain 'auto' which will be turned into the default 'cond'

2940

++	 * mitigation mode.

2941

++	 */

2942

++	if (boot_cpu_has(X86_BUG_L1TF)) {

2943

++		r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);

2944

++		if (r) {

2945

++			vmx_exit();

2946

++			return r;

2947

++		}

2948

++	}

2949

++

2950

+ #ifdef CONFIG_KEXEC_CORE

2951

+ 	rcu_assign_pointer(crash_vmclear_loaded_vmcss,

2952

+ 			   crash_vmclear_local_loaded_vmcss);

2953

+@@ -13183,39 +13517,4 @@ static int __init vmx_init(void)

2954

+

2955

+ 	return 0;

2956

+ }

2957

+-

2958

+-static void __exit vmx_exit(void)

2959

+-{

2960

+-#ifdef CONFIG_KEXEC_CORE

2961

+-	RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);

2962

+-	synchronize_rcu();

2963

+-#endif

2964

+-

2965

+-	kvm_exit();

2966

+-

2967

+-#if IS_ENABLED(CONFIG_HYPERV)

2968

+-	if (static_branch_unlikely(&enable_evmcs)) {

2969

+-		int cpu;

2970

+-		struct hv_vp_assist_page *vp_ap;

2971

+-		/*

2972

+-		 * Reset everything to support using non-enlightened VMCS

2973

+-		 * access later (e.g. when we reload the module with

2974

+-		 * enlightened_vmcs=0)

2975

+-		 */

2976

+-		for_each_online_cpu(cpu) {

2977

+-			vp_ap =	hv_get_vp_assist_page(cpu);

2978

+-

2979

+-			if (!vp_ap)

2980

+-				continue;

2981

+-

2982

+-			vp_ap->current_nested_vmcs = 0;

2983

+-			vp_ap->enlighten_vmentry = 0;

2984

+-		}

2985

+-

2986

+-		static_branch_disable(&enable_evmcs);

2987

+-	}

2988

+-#endif

2989

+-}

2990

+-

2991

+-module_init(vmx_init)

2992

+-module_exit(vmx_exit)

2993

++module_init(vmx_init);

2994

+diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c

2995

+index 2b812b3c5088..a5caa5e5480c 100644

2996

+--- a/arch/x86/kvm/x86.c

2997

++++ b/arch/x86/kvm/x86.c

2998

+@@ -195,6 +195,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {

2999

+ 	{ "irq_injections", VCPU_STAT(irq_injections) },

3000

+ 	{ "nmi_injections", VCPU_STAT(nmi_injections) },

3001

+ 	{ "req_event", VCPU_STAT(req_event) },

3002

++	{ "l1d_flush", VCPU_STAT(l1d_flush) },

3003

+ 	{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },

3004

+ 	{ "mmu_pte_write", VM_STAT(mmu_pte_write) },

3005

+ 	{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },

3006

+@@ -1102,11 +1103,35 @@ static u32 msr_based_features[] = {

3007

+

3008

+ static unsigned int num_msr_based_features;

3009

+

3010

++u64 kvm_get_arch_capabilities(void)

3011

++{

3012

++	u64 data;

3013

++

3014

++	rdmsrl_safe(MSR_IA32_ARCH_CAPABILITIES, &data);

3015

++

3016

++	/*

3017

++	 * If we're doing cache flushes (either "always" or "cond")

3018

++	 * we will do one whenever the guest does a vmlaunch/vmresume.

3019

++	 * If an outer hypervisor is doing the cache flush for us

3020

++	 * (VMENTER_L1D_FLUSH_NESTED_VM), we can safely pass that

3021

++	 * capability to the guest too, and if EPT is disabled we're not

3022

++	 * vulnerable.  Overall, only VMENTER_L1D_FLUSH_NEVER will

3023

++	 * require a nested hypervisor to do a flush of its own.

3024

++	 */

3025

++	if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)

3026

++		data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;

3027

++

3028

++	return data;

3029

++}

3030

++EXPORT_SYMBOL_GPL(kvm_get_arch_capabilities);

3031

++

3032

+ static int kvm_get_msr_feature(struct kvm_msr_entry *msr)

3033

+ {

3034

+ 	switch (msr->index) {

3035

+-	case MSR_IA32_UCODE_REV:

3036

+ 	case MSR_IA32_ARCH_CAPABILITIES:

3037

++		msr->data = kvm_get_arch_capabilities();

3038

++		break;

3039

++	case MSR_IA32_UCODE_REV:

3040

+ 		rdmsrl_safe(msr->index, &msr->data);

3041

+ 		break;

3042

+ 	default:

3043

+@@ -4876,6 +4901,9 @@ static int emulator_write_std(struct x86_emulate_ctxt *ctxt, gva_t addr, void *v

3044

+ int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu, gva_t addr, void *val,

3045

+ 				unsigned int bytes, struct x86_exception *exception)

3046

+ {

3047

++	/* kvm_write_guest_virt_system can pull in tons of pages. */

3048

++	vcpu->arch.l1tf_flush_l1d = true;

3049

++

3050

+ 	return kvm_write_guest_virt_helper(addr, val, bytes, vcpu,

3051

+ 					   PFERR_WRITE_MASK, exception);

3052

+ }

3053

+@@ -6052,6 +6080,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,

3054

+ 	bool writeback = true;

3055

+ 	bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;

3056

+

3057

++	vcpu->arch.l1tf_flush_l1d = true;

3058

++

3059

+ 	/*

3060

+ 	 * Clear write_fault_to_shadow_pgtable here to ensure it is

3061

+ 	 * never reused.

3062

+@@ -7581,6 +7611,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)

3063

+ 	struct kvm *kvm = vcpu->kvm;

3064

+

3065

+ 	vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);

3066

++	vcpu->arch.l1tf_flush_l1d = true;

3067

+

3068

+ 	for (;;) {

3069

+ 		if (kvm_vcpu_running(vcpu)) {

3070

+@@ -8700,6 +8731,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)

3071

+

3072

+ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)

3073

+ {

3074

++	vcpu->arch.l1tf_flush_l1d = true;

3075

+ 	kvm_x86_ops->sched_in(vcpu, cpu);

3076

+ }

3077

+

3078

+diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c

3079

+index cee58a972cb2..83241eb71cd4 100644

3080

+--- a/arch/x86/mm/init.c

3081

++++ b/arch/x86/mm/init.c

3082

+@@ -4,6 +4,8 @@

3083

+ #include <linux/swap.h>

3084

+ #include <linux/memblock.h>

3085

+ #include <linux/bootmem.h>	/* for max_low_pfn */

3086

++#include <linux/swapfile.h>

3087

++#include <linux/swapops.h>

3088

+

3089

+ #include <asm/set_memory.h>

3090

+ #include <asm/e820/api.h>

3091

+@@ -880,3 +882,26 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)

3092

+ 	__cachemode2pte_tbl[cache] = __cm_idx2pte(entry);

3093

+ 	__pte2cachemode_tbl[entry] = cache;

3094

+ }

3095

++

3096

++#ifdef CONFIG_SWAP

3097

++unsigned long max_swapfile_size(void)

3098

++{

3099

++	unsigned long pages;

3100

++

3101

++	pages = generic_max_swapfile_size();

3102

++

3103

++	if (boot_cpu_has_bug(X86_BUG_L1TF)) {

3104

++		/* Limit the swap file size to MAX_PA/2 for L1TF workaround */

3105

++		unsigned long l1tf_limit = l1tf_pfn_limit() + 1;

3106

++		/*

3107

++		 * We encode swap offsets also with 3 bits below those for pfn

3108

++		 * which makes the usable limit higher.

3109

++		 */

3110

++#if CONFIG_PGTABLE_LEVELS > 2

3111

++		l1tf_limit <<= PAGE_SHIFT - SWP_OFFSET_FIRST_BIT;

3112

++#endif

3113

++		pages = min_t(unsigned long, l1tf_limit, pages);

3114

++	}

3115

++	return pages;

3116

++}

3117

++#endif

3118

+diff --git a/arch/x86/mm/kmmio.c b/arch/x86/mm/kmmio.c

3119

+index 7c8686709636..79eb55ce69a9 100644

3120

+--- a/arch/x86/mm/kmmio.c

3121

++++ b/arch/x86/mm/kmmio.c

3122

+@@ -126,24 +126,29 @@ static struct kmmio_fault_page *get_kmmio_fault_page(unsigned long addr)

3123

+

3124

+ static void clear_pmd_presence(pmd_t *pmd, bool clear, pmdval_t *old)

3125

+ {

3126

++	pmd_t new_pmd;

3127

+ 	pmdval_t v = pmd_val(*pmd);

3128

+ 	if (clear) {

3129

+-		*old = v & _PAGE_PRESENT;

3130

+-		v &= ~_PAGE_PRESENT;

3131

+-	} else	/* presume this has been called with clear==true previously */

3132

+-		v |= *old;

3133

+-	set_pmd(pmd, __pmd(v));

3134

++		*old = v;

3135

++		new_pmd = pmd_mknotpresent(*pmd);

3136

++	} else {

3137

++		/* Presume this has been called with clear==true previously */

3138

++		new_pmd = __pmd(*old);

3139

++	}

3140

++	set_pmd(pmd, new_pmd);

3141

+ }

3142

+

3143

+ static void clear_pte_presence(pte_t *pte, bool clear, pteval_t *old)

3144

+ {

3145

+ 	pteval_t v = pte_val(*pte);

3146

+ 	if (clear) {

3147

+-		*old = v & _PAGE_PRESENT;

3148

+-		v &= ~_PAGE_PRESENT;

3149

+-	} else	/* presume this has been called with clear==true previously */

3150

+-		v |= *old;

3151

+-	set_pte_atomic(pte, __pte(v));

3152

++		*old = v;

3153

++		/* Nothing should care about address */

3154

++		pte_clear(&init_mm, 0, pte);

3155

++	} else {

3156

++		/* Presume this has been called with clear==true previously */

3157

++		set_pte_atomic(pte, __pte(*old));

3158

++	}

3159

+ }

3160

+

3161

+ static int clear_page_presence(struct kmmio_fault_page *f, bool clear)

3162

+diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c

3163

+index 48c591251600..f40ab8185d94 100644

3164

+--- a/arch/x86/mm/mmap.c

3165

++++ b/arch/x86/mm/mmap.c

3166

+@@ -240,3 +240,24 @@ int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)

3167

+

3168

+ 	return phys_addr_valid(addr + count - 1);

3169

+ }

3170

++

3171

++/*

3172

++ * Only allow root to set high MMIO mappings to PROT_NONE.

3173

++ * This prevents an unpriv. user to set them to PROT_NONE and invert

3174

++ * them, then pointing to valid memory for L1TF speculation.

3175

++ *

3176

++ * Note: for locked down kernels may want to disable the root override.

3177

++ */

3178

++bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)

3179

++{

3180

++	if (!boot_cpu_has_bug(X86_BUG_L1TF))

3181

++		return true;

3182

++	if (!__pte_needs_invert(pgprot_val(prot)))

3183

++		return true;

3184

++	/* If it's real memory always allow */

3185

++	if (pfn_valid(pfn))

3186

++		return true;

3187

++	if (pfn > l1tf_pfn_limit() && !capable(CAP_SYS_ADMIN))

3188

++		return false;

3189

++	return true;

3190

++}

3191

+diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c

3192

+index 3bded76e8d5c..7bb6f65c79de 100644

3193

+--- a/arch/x86/mm/pageattr.c

3194

++++ b/arch/x86/mm/pageattr.c

3195

+@@ -1014,8 +1014,8 @@ static long populate_pmd(struct cpa_data *cpa,

3196

+

3197

+ 		pmd = pmd_offset(pud, start);

3198

+

3199

+-		set_pmd(pmd, __pmd(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |

3200

+-				   massage_pgprot(pmd_pgprot)));

3201

++		set_pmd(pmd, pmd_mkhuge(pfn_pmd(cpa->pfn,

3202

++					canon_pgprot(pmd_pgprot))));

3203

+

3204

+ 		start	  += PMD_SIZE;

3205

+ 		cpa->pfn  += PMD_SIZE >> PAGE_SHIFT;

3206

+@@ -1087,8 +1087,8 @@ static int populate_pud(struct cpa_data *cpa, unsigned long start, p4d_t *p4d,

3207

+ 	 * Map everything starting from the Gb boundary, possibly with 1G pages

3208

+ 	 */

3209

+ 	while (boot_cpu_has(X86_FEATURE_GBPAGES) && end - start >= PUD_SIZE) {

3210

+-		set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |

3211

+-				   massage_pgprot(pud_pgprot)));

3212

++		set_pud(pud, pud_mkhuge(pfn_pud(cpa->pfn,

3213

++				   canon_pgprot(pud_pgprot))));

3214

+

3215

+ 		start	  += PUD_SIZE;

3216

+ 		cpa->pfn  += PUD_SIZE >> PAGE_SHIFT;

3217

+diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c

3218

+index 4d418e705878..fb752d9a3ce9 100644

3219

+--- a/arch/x86/mm/pti.c

3220

++++ b/arch/x86/mm/pti.c

3221

+@@ -45,6 +45,7 @@

3222

+ #include <asm/pgalloc.h>

3223

+ #include <asm/tlbflush.h>

3224

+ #include <asm/desc.h>

3225

++#include <asm/sections.h>

3226

+

3227

+ #undef pr_fmt

3228

+ #define pr_fmt(fmt)     "Kernel/User page tables isolation: " fmt

3229

+diff --git a/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c b/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c

3230

+index 4f5fa65a1011..2acd6be13375 100644

3231

+--- a/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c

3232

++++ b/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c

3233

+@@ -18,6 +18,7 @@

3234

+ #include <asm/intel-mid.h>

3235

+ #include <asm/intel_scu_ipc.h>

3236

+ #include <asm/io_apic.h>

3237

++#include <asm/hw_irq.h>

3238

+

3239

+ #define TANGIER_EXT_TIMER0_MSI 12

3240

+

3241

+diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c

3242

+index ca446da48fd2..3866b96a7ee7 100644

3243

+--- a/arch/x86/platform/uv/tlb_uv.c

3244

++++ b/arch/x86/platform/uv/tlb_uv.c

3245

+@@ -1285,6 +1285,7 @@ void uv_bau_message_interrupt(struct pt_regs *regs)

3246

+ 	struct msg_desc msgdesc;

3247

+

3248

+ 	ack_APIC_irq();

3249

++	kvm_set_cpu_l1tf_flush_l1d();

3250

+ 	time_start = get_cycles();

3251

+

3252

+ 	bcp = &per_cpu(bau_control, smp_processor_id());

3253

+diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c

3254

+index 3b5318505c69..2eeddd814653 100644

3255

+--- a/arch/x86/xen/enlighten.c

3256

++++ b/arch/x86/xen/enlighten.c

3257

+@@ -3,6 +3,7 @@

3258

+ #endif

3259

+ #include <linux/cpu.h>

3260

+ #include <linux/kexec.h>

3261

++#include <linux/slab.h>

3262

+

3263

+ #include <xen/features.h>

3264

+ #include <xen/page.h>

3265

+diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c

3266

+index 30cc9c877ebb..eb9443d5bae1 100644

3267

+--- a/drivers/base/cpu.c

3268

++++ b/drivers/base/cpu.c

3269

+@@ -540,16 +540,24 @@ ssize_t __weak cpu_show_spec_store_bypass(struct device *dev,

3270

+ 	return sprintf(buf, "Not affected\n");

3271

+ }

3272

+

3273

++ssize_t __weak cpu_show_l1tf(struct device *dev,

3274

++			     struct device_attribute *attr, char *buf)

3275

++{

3276

++	return sprintf(buf, "Not affected\n");

3277

++}

3278

++

3279

+ static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);

3280

+ static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);

3281

+ static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);

3282

+ static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);

3283

++static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);

3284

+

3285

+ static struct attribute *cpu_root_vulnerabilities_attrs[] = {

3286

+ 	&dev_attr_meltdown.attr,

3287

+ 	&dev_attr_spectre_v1.attr,

3288

+ 	&dev_attr_spectre_v2.attr,

3289

+ 	&dev_attr_spec_store_bypass.attr,

3290

++	&dev_attr_l1tf.attr,

3291

+ 	NULL

3292

+ };

3293

+

3294

+diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c

3295

+index dc87797db500..b50b74053664 100644

3296

+--- a/drivers/gpu/drm/i915/i915_pmu.c

3297

++++ b/drivers/gpu/drm/i915/i915_pmu.c

3298

+@@ -4,6 +4,7 @@

3299

+  * Copyright © 2017-2018 Intel Corporation

3300

+  */

3301

+

3302

++#include <linux/irq.h>

3303

+ #include "i915_pmu.h"

3304

+ #include "intel_ringbuffer.h"

3305

+ #include "i915_drv.h"

3306

+diff --git a/drivers/gpu/drm/i915/intel_lpe_audio.c b/drivers/gpu/drm/i915/intel_lpe_audio.c

3307

+index 6269750e2b54..b4941101f21a 100644

3308

+--- a/drivers/gpu/drm/i915/intel_lpe_audio.c

3309

++++ b/drivers/gpu/drm/i915/intel_lpe_audio.c

3310

+@@ -62,6 +62,7 @@

3311

+

3312

+ #include <linux/acpi.h>

3313

+ #include <linux/device.h>

3314

++#include <linux/irq.h>

3315

+ #include <linux/pci.h>

3316

+ #include <linux/pm_runtime.h>

3317

+

3318

+diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c

3319

+index f6325f1a89e8..d4d4a55f09f8 100644

3320

+--- a/drivers/pci/controller/pci-hyperv.c

3321

++++ b/drivers/pci/controller/pci-hyperv.c

3322

+@@ -45,6 +45,7 @@

3323

+ #include <linux/irqdomain.h>

3324

+ #include <asm/irqdomain.h>

3325

+ #include <asm/apic.h>

3326

++#include <linux/irq.h>

3327

+ #include <linux/msi.h>

3328

+ #include <linux/hyperv.h>

3329

+ #include <linux/refcount.h>

3330

+diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h

3331

+index f59639afaa39..26ca0276b503 100644

3332

+--- a/include/asm-generic/pgtable.h

3333

++++ b/include/asm-generic/pgtable.h

3334

+@@ -1083,6 +1083,18 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,

3335

+ static inline void init_espfix_bsp(void) { }

3336

+ #endif

3337

+

3338

++#ifndef __HAVE_ARCH_PFN_MODIFY_ALLOWED

3339

++static inline bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)

3340

++{

3341

++	return true;

3342

++}

3343

++

3344

++static inline bool arch_has_pfn_modify_check(void)

3345

++{

3346

++	return false;

3347

++}

3348

++#endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */

3349

++

3350

+ #endif /* !__ASSEMBLY__ */

3351

+

3352

+ #ifndef io_remap_pfn_range

3353

+diff --git a/include/linux/cpu.h b/include/linux/cpu.h

3354

+index 3233fbe23594..45789a892c41 100644

3355

+--- a/include/linux/cpu.h

3356

++++ b/include/linux/cpu.h

3357

+@@ -55,6 +55,8 @@ extern ssize_t cpu_show_spectre_v2(struct device *dev,

3358

+ 				   struct device_attribute *attr, char *buf);

3359

+ extern ssize_t cpu_show_spec_store_bypass(struct device *dev,

3360

+ 					  struct device_attribute *attr, char *buf);

3361

++extern ssize_t cpu_show_l1tf(struct device *dev,

3362

++			     struct device_attribute *attr, char *buf);

3363

+

3364

+ extern __printf(4, 5)

3365

+ struct device *cpu_device_create(struct device *parent, void *drvdata,

3366

+@@ -166,4 +168,23 @@ void cpuhp_report_idle_dead(void);

3367

+ static inline void cpuhp_report_idle_dead(void) { }

3368

+ #endif /* #ifdef CONFIG_HOTPLUG_CPU */

3369

+

3370

++enum cpuhp_smt_control {

3371

++	CPU_SMT_ENABLED,

3372

++	CPU_SMT_DISABLED,

3373

++	CPU_SMT_FORCE_DISABLED,

3374

++	CPU_SMT_NOT_SUPPORTED,

3375

++};

3376

++

3377

++#if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT)

3378

++extern enum cpuhp_smt_control cpu_smt_control;

3379

++extern void cpu_smt_disable(bool force);

3380

++extern void cpu_smt_check_topology_early(void);

3381

++extern void cpu_smt_check_topology(void);

3382

++#else

3383

++# define cpu_smt_control		(CPU_SMT_ENABLED)

3384

++static inline void cpu_smt_disable(bool force) { }

3385

++static inline void cpu_smt_check_topology_early(void) { }

3386

++static inline void cpu_smt_check_topology(void) { }

3387

++#endif

3388

++

3389

+ #endif /* _LINUX_CPU_H_ */

3390

+diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h

3391

+index 06bd7b096167..e06febf62978 100644

3392

+--- a/include/linux/swapfile.h

3393

++++ b/include/linux/swapfile.h

3394

+@@ -10,5 +10,7 @@ extern spinlock_t swap_lock;

3395

+ extern struct plist_head swap_active_head;

3396

+ extern struct swap_info_struct *swap_info[];

3397

+ extern int try_to_unuse(unsigned int, bool, unsigned long);

3398

++extern unsigned long generic_max_swapfile_size(void);

3399

++extern unsigned long max_swapfile_size(void);

3400

+

3401

+ #endif /* _LINUX_SWAPFILE_H */

3402

+diff --git a/kernel/cpu.c b/kernel/cpu.c

3403

+index 2f8f338e77cf..f80afc674f02 100644

3404

+--- a/kernel/cpu.c

3405

++++ b/kernel/cpu.c

3406

+@@ -60,6 +60,7 @@ struct cpuhp_cpu_state {

3407

+ 	bool			rollback;

3408

+ 	bool			single;

3409

+ 	bool			bringup;

3410

++	bool			booted_once;

3411

+ 	struct hlist_node	*node;

3412

+ 	struct hlist_node	*last;

3413

+ 	enum cpuhp_state	cb_state;

3414

+@@ -342,6 +343,85 @@ void cpu_hotplug_enable(void)

3415

+ EXPORT_SYMBOL_GPL(cpu_hotplug_enable);

3416

+ #endif	/* CONFIG_HOTPLUG_CPU */

3417

+

3418

++#ifdef CONFIG_HOTPLUG_SMT

3419

++enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;

3420

++EXPORT_SYMBOL_GPL(cpu_smt_control);

3421

++

3422

++static bool cpu_smt_available __read_mostly;

3423

++

3424

++void __init cpu_smt_disable(bool force)

3425

++{

3426

++	if (cpu_smt_control == CPU_SMT_FORCE_DISABLED ||

3427

++		cpu_smt_control == CPU_SMT_NOT_SUPPORTED)

3428

++		return;

3429

++

3430

++	if (force) {

3431

++		pr_info("SMT: Force disabled\n");

3432

++		cpu_smt_control = CPU_SMT_FORCE_DISABLED;

3433

++	} else {

3434

++		cpu_smt_control = CPU_SMT_DISABLED;

3435

++	}

3436

++}

3437

++

3438

++/*

3439

++ * The decision whether SMT is supported can only be done after the full

3440

++ * CPU identification. Called from architecture code before non boot CPUs

3441

++ * are brought up.

3442

++ */

3443

++void __init cpu_smt_check_topology_early(void)

3444

++{

3445

++	if (!topology_smt_supported())

3446

++		cpu_smt_control = CPU_SMT_NOT_SUPPORTED;

3447

++}

3448

++

3449

++/*

3450

++ * If SMT was disabled by BIOS, detect it here, after the CPUs have been

3451

++ * brought online. This ensures the smt/l1tf sysfs entries are consistent

3452

++ * with reality. cpu_smt_available is set to true during the bringup of non

3453

++ * boot CPUs when a SMT sibling is detected. Note, this may overwrite

3454

++ * cpu_smt_control's previous setting.

3455

++ */

3456

++void __init cpu_smt_check_topology(void)

3457

++{

3458

++	if (!cpu_smt_available)

3459

++		cpu_smt_control = CPU_SMT_NOT_SUPPORTED;

3460

++}

3461

++

3462

++static int __init smt_cmdline_disable(char *str)

3463

++{

3464

++	cpu_smt_disable(str && !strcmp(str, "force"));

3465

++	return 0;

3466

++}

3467

++early_param("nosmt", smt_cmdline_disable);

3468

++

3469

++static inline bool cpu_smt_allowed(unsigned int cpu)

3470

++{

3471

++	if (topology_is_primary_thread(cpu))

3472

++		return true;

3473

++

3474

++	/*

3475

++	 * If the CPU is not a 'primary' thread and the booted_once bit is

3476

++	 * set then the processor has SMT support. Store this information

3477

++	 * for the late check of SMT support in cpu_smt_check_topology().

3478

++	 */

3479

++	if (per_cpu(cpuhp_state, cpu).booted_once)

3480

++		cpu_smt_available = true;

3481

++

3482

++	if (cpu_smt_control == CPU_SMT_ENABLED)

3483

++		return true;

3484

++

3485

++	/*

3486

++	 * On x86 it's required to boot all logical CPUs at least once so

3487

++	 * that the init code can get a chance to set CR4.MCE on each

3488

++	 * CPU. Otherwise, a broadacasted MCE observing CR4.MCE=0b on any

3489

++	 * core will shutdown the machine.

3490

++	 */

3491

++	return !per_cpu(cpuhp_state, cpu).booted_once;

3492

++}

3493

++#else

3494

++static inline bool cpu_smt_allowed(unsigned int cpu) { return true; }

3495

++#endif

3496

++

3497

+ static inline enum cpuhp_state

3498

+ cpuhp_set_state(struct cpuhp_cpu_state *st, enum cpuhp_state target)

3499

+ {

3500

+@@ -422,6 +502,16 @@ static int bringup_wait_for_ap(unsigned int cpu)

3501

+ 	stop_machine_unpark(cpu);

3502

+ 	kthread_unpark(st->thread);

3503

+

3504

++	/*

3505

++	 * SMT soft disabling on X86 requires to bring the CPU out of the

3506

++	 * BIOS 'wait for SIPI' state in order to set the CR4.MCE bit.  The

3507

++	 * CPU marked itself as booted_once in cpu_notify_starting() so the

3508

++	 * cpu_smt_allowed() check will now return false if this is not the

3509

++	 * primary sibling.

3510

++	 */

3511

++	if (!cpu_smt_allowed(cpu))

3512

++		return -ECANCELED;

3513

++

3514

+ 	if (st->target <= CPUHP_AP_ONLINE_IDLE)

3515

+ 		return 0;

3516

+

3517

+@@ -754,7 +844,6 @@ static int takedown_cpu(unsigned int cpu)

3518

+

3519

+ 	/* Park the smpboot threads */

3520

+ 	kthread_park(per_cpu_ptr(&cpuhp_state, cpu)->thread);

3521

+-	smpboot_park_threads(cpu);

3522

+

3523

+ 	/*

3524

+ 	 * Prevent irq alloc/free while the dying cpu reorganizes the

3525

+@@ -907,20 +996,19 @@ out:

3526

+ 	return ret;

3527

+ }

3528

+

3529

++static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)

3530

++{

3531

++	if (cpu_hotplug_disabled)

3532

++		return -EBUSY;

3533

++	return _cpu_down(cpu, 0, target);

3534

++}

3535

++

3536

+ static int do_cpu_down(unsigned int cpu, enum cpuhp_state target)

3537

+ {

3538

+ 	int err;

3539

+

3540

+ 	cpu_maps_update_begin();

3541

+-

3542

+-	if (cpu_hotplug_disabled) {

3543

+-		err = -EBUSY;

3544

+-		goto out;

3545

+-	}

3546

+-

3547

+-	err = _cpu_down(cpu, 0, target);

3548

+-

3549

+-out:

3550

++	err = cpu_down_maps_locked(cpu, target);

3551

+ 	cpu_maps_update_done();

3552

+ 	return err;

3553

+ }

3554

+@@ -949,6 +1037,7 @@ void notify_cpu_starting(unsigned int cpu)

3555

+ 	int ret;

3556

+

3557

+ 	rcu_cpu_starting(cpu);	/* Enables RCU usage on this CPU. */

3558

++	st->booted_once = true;

3559

+ 	while (st->state < target) {

3560

+ 		st->state++;

3561

+ 		ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL);

3562

+@@ -1058,6 +1147,10 @@ static int do_cpu_up(unsigned int cpu, enum cpuhp_state target)

3563

+ 		err = -EBUSY;

3564

+ 		goto out;

3565

+ 	}

3566

++	if (!cpu_smt_allowed(cpu)) {

3567

++		err = -EPERM;

3568

++		goto out;

3569

++	}

3570

+

3571

+ 	err = _cpu_up(cpu, 0, target);

3572

+ out:

3573

+@@ -1332,7 +1425,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {

3574

+ 	[CPUHP_AP_SMPBOOT_THREADS] = {

3575

+ 		.name			= "smpboot/threads:online",

3576

+ 		.startup.single		= smpboot_unpark_threads,

3577

+-		.teardown.single	= NULL,

3578

++		.teardown.single	= smpboot_park_threads,

3579

+ 	},

3580

+ 	[CPUHP_AP_IRQ_AFFINITY_ONLINE] = {

3581

+ 		.name			= "irq/affinity:online",

3582

+@@ -1906,10 +1999,172 @@ static const struct attribute_group cpuhp_cpu_root_attr_group = {

3583

+ 	NULL

3584

+ };

3585

+

3586

++#ifdef CONFIG_HOTPLUG_SMT

3587

++

3588

++static const char *smt_states[] = {

3589

++	[CPU_SMT_ENABLED]		= "on",

3590

++	[CPU_SMT_DISABLED]		= "off",

3591

++	[CPU_SMT_FORCE_DISABLED]	= "forceoff",

3592

++	[CPU_SMT_NOT_SUPPORTED]		= "notsupported",

3593

++};

3594

++

3595

++static ssize_t

3596

++show_smt_control(struct device *dev, struct device_attribute *attr, char *buf)

3597

++{

3598

++	return snprintf(buf, PAGE_SIZE - 2, "%s\n", smt_states[cpu_smt_control]);

3599

++}

3600

++

3601

++static void cpuhp_offline_cpu_device(unsigned int cpu)

3602

++{

3603

++	struct device *dev = get_cpu_device(cpu);

3604

++

3605

++	dev->offline = true;

3606

++	/* Tell user space about the state change */

3607

++	kobject_uevent(&dev->kobj, KOBJ_OFFLINE);

3608

++}

3609

++

3610

++static void cpuhp_online_cpu_device(unsigned int cpu)

3611

++{

3612

++	struct device *dev = get_cpu_device(cpu);

3613

++

3614

++	dev->offline = false;

3615

++	/* Tell user space about the state change */

3616

++	kobject_uevent(&dev->kobj, KOBJ_ONLINE);

3617

++}

3618

++

3619

++static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)

3620

++{

3621

++	int cpu, ret = 0;

3622

++

3623

++	cpu_maps_update_begin();

3624

++	for_each_online_cpu(cpu) {

3625

++		if (topology_is_primary_thread(cpu))

3626

++			continue;

3627

++		ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);

3628

++		if (ret)

3629

++			break;

3630

++		/*

3631

++		 * As this needs to hold the cpu maps lock it's impossible

3632

++		 * to call device_offline() because that ends up calling

3633

++		 * cpu_down() which takes cpu maps lock. cpu maps lock

3634

++		 * needs to be held as this might race against in kernel

3635

++		 * abusers of the hotplug machinery (thermal management).

3636

++		 *

3637

++		 * So nothing would update device:offline state. That would

3638

++		 * leave the sysfs entry stale and prevent onlining after

3639

++		 * smt control has been changed to 'off' again. This is

3640

++		 * called under the sysfs hotplug lock, so it is properly

3641

++		 * serialized against the regular offline usage.

3642

++		 */

3643

++		cpuhp_offline_cpu_device(cpu);

3644

++	}

3645

++	if (!ret)

3646

++		cpu_smt_control = ctrlval;

3647

++	cpu_maps_update_done();

3648

++	return ret;

3649

++}

3650

++

3651

++static int cpuhp_smt_enable(void)

3652

++{

3653

++	int cpu, ret = 0;

3654

++

3655

++	cpu_maps_update_begin();

3656

++	cpu_smt_control = CPU_SMT_ENABLED;

3657

++	for_each_present_cpu(cpu) {

3658

++		/* Skip online CPUs and CPUs on offline nodes */

3659

++		if (cpu_online(cpu) || !node_online(cpu_to_node(cpu)))

3660

++			continue;

3661

++		ret = _cpu_up(cpu, 0, CPUHP_ONLINE);

3662

++		if (ret)

3663

++			break;

3664

++		/* See comment in cpuhp_smt_disable() */

3665

++		cpuhp_online_cpu_device(cpu);

3666

++	}

3667

++	cpu_maps_update_done();

3668

++	return ret;

3669

++}

3670

++

3671

++static ssize_t

3672

++store_smt_control(struct device *dev, struct device_attribute *attr,

3673

++		  const char *buf, size_t count)

3674

++{

3675

++	int ctrlval, ret;

3676

++

3677

++	if (sysfs_streq(buf, "on"))

3678

++		ctrlval = CPU_SMT_ENABLED;

3679

++	else if (sysfs_streq(buf, "off"))

3680

++		ctrlval = CPU_SMT_DISABLED;

3681

++	else if (sysfs_streq(buf, "forceoff"))

3682

++		ctrlval = CPU_SMT_FORCE_DISABLED;

3683

++	else

3684

++		return -EINVAL;

3685

++

3686

++	if (cpu_smt_control == CPU_SMT_FORCE_DISABLED)

3687

++		return -EPERM;

3688

++

3689

++	if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED)

3690

++		return -ENODEV;

3691

++

3692

++	ret = lock_device_hotplug_sysfs();

3693

++	if (ret)

3694

++		return ret;

3695

++

3696

++	if (ctrlval != cpu_smt_control) {

3697

++		switch (ctrlval) {

3698

++		case CPU_SMT_ENABLED:

3699

++			ret = cpuhp_smt_enable();

3700

++			break;

3701

++		case CPU_SMT_DISABLED:

3702

++		case CPU_SMT_FORCE_DISABLED:

3703

++			ret = cpuhp_smt_disable(ctrlval);

3704

++			break;

3705

++		}

3706

++	}

3707

++

3708

++	unlock_device_hotplug();

3709

++	return ret ? ret : count;

3710

++}

3711

++static DEVICE_ATTR(control, 0644, show_smt_control, store_smt_control);

3712

++

3713

++static ssize_t

3714

++show_smt_active(struct device *dev, struct device_attribute *attr, char *buf)

3715

++{

3716

++	bool active = topology_max_smt_threads() > 1;

3717

++

3718

++	return snprintf(buf, PAGE_SIZE - 2, "%d\n", active);

3719

++}

3720

++static DEVICE_ATTR(active, 0444, show_smt_active, NULL);

3721

++

3722

++static struct attribute *cpuhp_smt_attrs[] = {

3723

++	&dev_attr_control.attr,

3724

++	&dev_attr_active.attr,

3725

++	NULL

3726

++};

3727

++

3728

++static const struct attribute_group cpuhp_smt_attr_group = {

3729

++	.attrs = cpuhp_smt_attrs,

3730

++	.name = "smt",

3731

++	NULL

3732

++};

3733

++

3734

++static int __init cpu_smt_state_init(void)

3735

++{

3736

++	return sysfs_create_group(&cpu_subsys.dev_root->kobj,

3737

++				  &cpuhp_smt_attr_group);

3738

++}

3739

++

3740

++#else

3741

++static inline int cpu_smt_state_init(void) { return 0; }

3742

++#endif

3743

++

3744

+ static int __init cpuhp_sysfs_init(void)

3745

+ {

3746

+ 	int cpu, ret;

3747

+

3748

++	ret = cpu_smt_state_init();

3749

++	if (ret)

3750

++		return ret;

3751

++

3752

+ 	ret = sysfs_create_group(&cpu_subsys.dev_root->kobj,

3753

+ 				 &cpuhp_cpu_root_attr_group);

3754

+ 	if (ret)

3755

+@@ -2012,5 +2267,8 @@ void __init boot_cpu_init(void)

3756

+  */

3757

+ void __init boot_cpu_hotplug_init(void)

3758

+ {

3759

+-	per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;

3760

++#ifdef CONFIG_SMP

3761

++	this_cpu_write(cpuhp_state.booted_once, true);

3762

++#endif

3763

++	this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);

3764

+ }

3765

+diff --git a/kernel/sched/core.c b/kernel/sched/core.c

3766

+index fe365c9a08e9..5ba96d9ddbde 100644

3767

+--- a/kernel/sched/core.c

3768

++++ b/kernel/sched/core.c

3769

+@@ -5774,6 +5774,18 @@ int sched_cpu_activate(unsigned int cpu)

3770

+ 	struct rq *rq = cpu_rq(cpu);

3771

+ 	struct rq_flags rf;

3772

+

3773

++#ifdef CONFIG_SCHED_SMT

3774

++	/*

3775

++	 * The sched_smt_present static key needs to be evaluated on every

3776

++	 * hotplug event because at boot time SMT might be disabled when

3777

++	 * the number of booted CPUs is limited.

3778

++	 *

3779

++	 * If then later a sibling gets hotplugged, then the key would stay

3780

++	 * off and SMT scheduling would never be functional.

3781

++	 */

3782

++	if (cpumask_weight(cpu_smt_mask(cpu)) > 1)

3783

++		static_branch_enable_cpuslocked(&sched_smt_present);

3784

++#endif

3785

+ 	set_cpu_active(cpu, true);

3786

+

3787

+ 	if (sched_smp_initialized) {

3788

+@@ -5871,22 +5883,6 @@ int sched_cpu_dying(unsigned int cpu)

3789

+ }

3790

+ #endif

3791

+

3792

+-#ifdef CONFIG_SCHED_SMT

3793

+-DEFINE_STATIC_KEY_FALSE(sched_smt_present);

3794

+-

3795

+-static void sched_init_smt(void)

3796

+-{

3797

+-	/*

3798

+-	 * We've enumerated all CPUs and will assume that if any CPU

3799

+-	 * has SMT siblings, CPU0 will too.

3800

+-	 */

3801

+-	if (cpumask_weight(cpu_smt_mask(0)) > 1)

3802

+-		static_branch_enable(&sched_smt_present);

3803

+-}

3804

+-#else

3805

+-static inline void sched_init_smt(void) { }

3806

+-#endif

3807

+-

3808

+ void __init sched_init_smp(void)

3809

+ {

3810

+ 	sched_init_numa();

3811

+@@ -5908,8 +5904,6 @@ void __init sched_init_smp(void)

3812

+ 	init_sched_rt_class();

3813

+ 	init_sched_dl_class();

3814

+

3815

+-	sched_init_smt();

3816

+-

3817

+ 	sched_smp_initialized = true;

3818

+ }

3819

+

3820

+diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

3821

+index 2f0a0be4d344..9c219f7b0970 100644

3822

+--- a/kernel/sched/fair.c

3823

++++ b/kernel/sched/fair.c

3824

+@@ -6237,6 +6237,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p

3825

+ }

3826

+

3827

+ #ifdef CONFIG_SCHED_SMT

3828

++DEFINE_STATIC_KEY_FALSE(sched_smt_present);

3829

+

3830

+ static inline void set_idle_cores(int cpu, int val)

3831

+ {

3832

+diff --git a/kernel/smp.c b/kernel/smp.c

3833

+index 084c8b3a2681..d86eec5f51c1 100644

3834

+--- a/kernel/smp.c

3835

++++ b/kernel/smp.c

3836

+@@ -584,6 +584,8 @@ void __init smp_init(void)

3837

+ 		num_nodes, (num_nodes > 1 ? "s" : ""),

3838

+ 		num_cpus,  (num_cpus  > 1 ? "s" : ""));

3839

+

3840

++	/* Final decision about SMT support */

3841

++	cpu_smt_check_topology();

3842

+ 	/* Any cleanup work */

3843

+ 	smp_cpus_done(setup_max_cpus);

3844

+ }

3845

+diff --git a/mm/memory.c b/mm/memory.c

3846

+index c5e87a3a82ba..0e356dd923c2 100644

3847

+--- a/mm/memory.c

3848

++++ b/mm/memory.c

3849

+@@ -1884,6 +1884,9 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,

3850

+ 	if (addr < vma->vm_start || addr >= vma->vm_end)

3851

+ 		return -EFAULT;

3852

+

3853

++	if (!pfn_modify_allowed(pfn, pgprot))

3854

++		return -EACCES;

3855

++

3856

+ 	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));

3857

+

3858

+ 	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,

3859

+@@ -1919,6 +1922,9 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,

3860

+

3861

+ 	track_pfn_insert(vma, &pgprot, pfn);

3862

+

3863

++	if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))

3864

++		return -EACCES;

3865

++

3866

+ 	/*

3867

+ 	 * If we don't have pte special, then we have to use the pfn_valid()

3868

+ 	 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*

3869

+@@ -1980,6 +1986,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,

3870

+ {

3871

+ 	pte_t *pte;

3872

+ 	spinlock_t *ptl;

3873

++	int err = 0;

3874

+

3875

+ 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);

3876

+ 	if (!pte)

3877

+@@ -1987,12 +1994,16 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,

3878

+ 	arch_enter_lazy_mmu_mode();

3879

+ 	do {

3880

+ 		BUG_ON(!pte_none(*pte));

3881

++		if (!pfn_modify_allowed(pfn, prot)) {

3882

++			err = -EACCES;

3883

++			break;

3884

++		}

3885

+ 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));

3886

+ 		pfn++;

3887

+ 	} while (pte++, addr += PAGE_SIZE, addr != end);

3888

+ 	arch_leave_lazy_mmu_mode();

3889

+ 	pte_unmap_unlock(pte - 1, ptl);

3890

+-	return 0;

3891

++	return err;

3892

+ }

3893

+

3894

+ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,

3895

+@@ -2001,6 +2012,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,

3896

+ {

3897

+ 	pmd_t *pmd;

3898

+ 	unsigned long next;

3899

++	int err;

3900

+

3901

+ 	pfn -= addr >> PAGE_SHIFT;

3902

+ 	pmd = pmd_alloc(mm, pud, addr);

3903

+@@ -2009,9 +2021,10 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,

3904

+ 	VM_BUG_ON(pmd_trans_huge(*pmd));

3905

+ 	do {

3906

+ 		next = pmd_addr_end(addr, end);

3907

+-		if (remap_pte_range(mm, pmd, addr, next,

3908

+-				pfn + (addr >> PAGE_SHIFT), prot))

3909

+-			return -ENOMEM;

3910

++		err = remap_pte_range(mm, pmd, addr, next,

3911

++				pfn + (addr >> PAGE_SHIFT), prot);

3912

++		if (err)

3913

++			return err;

3914

+ 	} while (pmd++, addr = next, addr != end);

3915

+ 	return 0;

3916

+ }

3917

+@@ -2022,6 +2035,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,

3918

+ {

3919

+ 	pud_t *pud;

3920

+ 	unsigned long next;

3921

++	int err;

3922

+

3923

+ 	pfn -= addr >> PAGE_SHIFT;

3924

+ 	pud = pud_alloc(mm, p4d, addr);

3925

+@@ -2029,9 +2043,10 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,

3926

+ 		return -ENOMEM;

3927

+ 	do {

3928

+ 		next = pud_addr_end(addr, end);

3929

+-		if (remap_pmd_range(mm, pud, addr, next,

3930

+-				pfn + (addr >> PAGE_SHIFT), prot))

3931

+-			return -ENOMEM;

3932

++		err = remap_pmd_range(mm, pud, addr, next,

3933

++				pfn + (addr >> PAGE_SHIFT), prot);

3934

++		if (err)

3935

++			return err;

3936

+ 	} while (pud++, addr = next, addr != end);

3937

+ 	return 0;

3938

+ }

3939

+@@ -2042,6 +2057,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,

3940

+ {

3941

+ 	p4d_t *p4d;

3942

+ 	unsigned long next;

3943

++	int err;

3944

+

3945

+ 	pfn -= addr >> PAGE_SHIFT;

3946

+ 	p4d = p4d_alloc(mm, pgd, addr);

3947

+@@ -2049,9 +2065,10 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,

3948

+ 		return -ENOMEM;

3949

+ 	do {

3950

+ 		next = p4d_addr_end(addr, end);

3951

+-		if (remap_pud_range(mm, p4d, addr, next,

3952

+-				pfn + (addr >> PAGE_SHIFT), prot))

3953

+-			return -ENOMEM;

3954

++		err = remap_pud_range(mm, p4d, addr, next,

3955

++				pfn + (addr >> PAGE_SHIFT), prot);

3956

++		if (err)

3957

++			return err;

3958

+ 	} while (p4d++, addr = next, addr != end);

3959

+ 	return 0;

3960

+ }

3961

+diff --git a/mm/mprotect.c b/mm/mprotect.c

3962

+index 625608bc8962..6d331620b9e5 100644

3963

+--- a/mm/mprotect.c

3964

++++ b/mm/mprotect.c

3965

+@@ -306,6 +306,42 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,

3966

+ 	return pages;

3967

+ }

3968

+

3969

++static int prot_none_pte_entry(pte_t *pte, unsigned long addr,

3970

++			       unsigned long next, struct mm_walk *walk)

3971

++{

3972

++	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?

3973

++		0 : -EACCES;

3974

++}

3975

++

3976

++static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,

3977

++				   unsigned long addr, unsigned long next,

3978

++				   struct mm_walk *walk)

3979

++{

3980

++	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?

3981

++		0 : -EACCES;

3982

++}

3983

++

3984

++static int prot_none_test(unsigned long addr, unsigned long next,

3985

++			  struct mm_walk *walk)

3986

++{

3987

++	return 0;

3988

++}

3989

++

3990

++static int prot_none_walk(struct vm_area_struct *vma, unsigned long start,

3991

++			   unsigned long end, unsigned long newflags)

3992

++{

3993

++	pgprot_t new_pgprot = vm_get_page_prot(newflags);

3994

++	struct mm_walk prot_none_walk = {

3995

++		.pte_entry = prot_none_pte_entry,

3996

++		.hugetlb_entry = prot_none_hugetlb_entry,

3997

++		.test_walk = prot_none_test,

3998

++		.mm = current->mm,

3999

++		.private = &new_pgprot,

4000

++	};

4001

++

4002

++	return walk_page_range(start, end, &prot_none_walk);

4003

++}

4004

++

4005

+ int

4006

+ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,

4007

+ 	unsigned long start, unsigned long end, unsigned long newflags)

4008

+@@ -323,6 +359,19 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,

4009

+ 		return 0;

4010

+ 	}

4011

+

4012

++	/*

4013

++	 * Do PROT_NONE PFN permission checks here when we can still

4014

++	 * bail out without undoing a lot of state. This is a rather

4015

++	 * uncommon case, so doesn't need to be very optimized.

4016

++	 */

4017

++	if (arch_has_pfn_modify_check() &&

4018

++	    (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&

4019

++	    (newflags & (VM_READ|VM_WRITE|VM_EXEC)) == 0) {

4020

++		error = prot_none_walk(vma, start, end, newflags);

4021

++		if (error)

4022

++			return error;

4023

++	}

4024

++

4025

+ 	/*

4026

+ 	 * If we make a private mapping writable we increase our commit;

4027

+ 	 * but (without finer accounting) cannot reduce our commit if we

4028

+diff --git a/mm/swapfile.c b/mm/swapfile.c

4029

+index 2cc2972eedaf..18185ae4f223 100644

4030

+--- a/mm/swapfile.c

4031

++++ b/mm/swapfile.c

4032

+@@ -2909,6 +2909,35 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)

4033

+ 	return 0;

4034

+ }

4035

+

4036

++

4037

++/*

4038

++ * Find out how many pages are allowed for a single swap device. There

4039

++ * are two limiting factors:

4040

++ * 1) the number of bits for the swap offset in the swp_entry_t type, and

4041

++ * 2) the number of bits in the swap pte, as defined by the different

4042

++ * architectures.

4043

++ *

4044

++ * In order to find the largest possible bit mask, a swap entry with

4045

++ * swap type 0 and swap offset ~0UL is created, encoded to a swap pte,

4046

++ * decoded to a swp_entry_t again, and finally the swap offset is

4047

++ * extracted.

4048

++ *

4049

++ * This will mask all the bits from the initial ~0UL mask that can't

4050

++ * be encoded in either the swp_entry_t or the architecture definition

4051

++ * of a swap pte.

4052

++ */

4053

++unsigned long generic_max_swapfile_size(void)

4054

++{

4055

++	return swp_offset(pte_to_swp_entry(

4056

++			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;

4057

++}

4058

++

4059

++/* Can be overridden by an architecture for additional checks. */

4060

++__weak unsigned long max_swapfile_size(void)

4061

++{

4062

++	return generic_max_swapfile_size();

4063

++}

4064

++

4065

+ static unsigned long read_swap_header(struct swap_info_struct *p,

4066

+ 					union swap_header *swap_header,

4067

+ 					struct inode *inode)

4068

+@@ -2944,22 +2973,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,

4069

+ 	p->cluster_next = 1;

4070

+ 	p->cluster_nr = 0;

4071

+

4072

+-	/*

4073

+-	 * Find out how many pages are allowed for a single swap

4074

+-	 * device. There are two limiting factors: 1) the number

4075

+-	 * of bits for the swap offset in the swp_entry_t type, and

4076

+-	 * 2) the number of bits in the swap pte as defined by the

4077

+-	 * different architectures. In order to find the

4078

+-	 * largest possible bit mask, a swap entry with swap type 0

4079

+-	 * and swap offset ~0UL is created, encoded to a swap pte,

4080

+-	 * decoded to a swp_entry_t again, and finally the swap

4081

+-	 * offset is extracted. This will mask all the bits from

4082

+-	 * the initial ~0UL mask that can't be encoded in either

4083

+-	 * the swp_entry_t or the architecture definition of a

4084

+-	 * swap pte.

4085

+-	 */

4086

+-	maxpages = swp_offset(pte_to_swp_entry(

4087

+-			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;

4088

++	maxpages = max_swapfile_size();

4089

+ 	last_page = swap_header->info.last_page;

4090

+ 	if (!last_page) {

4091

+ 		pr_warn("Empty swap-file\n");

4092

+diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h

4093

+index 5701f5cecd31..64aaa3f5f36c 100644

4094

+--- a/tools/arch/x86/include/asm/cpufeatures.h

4095

++++ b/tools/arch/x86/include/asm/cpufeatures.h

4096

+@@ -219,6 +219,7 @@

4097

+ #define X86_FEATURE_IBPB		( 7*32+26) /* Indirect Branch Prediction Barrier */

4098

+ #define X86_FEATURE_STIBP		( 7*32+27) /* Single Thread Indirect Branch Predictors */

4099

+ #define X86_FEATURE_ZEN			( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */

4100

++#define X86_FEATURE_L1TF_PTEINV		( 7*32+29) /* "" L1TF workaround PTE inversion */

4101

+

4102

+ /* Virtualization flags: Linux defined, word 8 */

4103

+ #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */

4104

+@@ -341,6 +342,7 @@

4105

+ #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */

4106

+ #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */

4107

+ #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */

4108

++#define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */

4109

+ #define X86_FEATURE_ARCH_CAPABILITIES	(18*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */

4110

+ #define X86_FEATURE_SPEC_CTRL_SSBD	(18*32+31) /* "" Speculative Store Bypass Disable */

4111

+

4112

+@@ -373,5 +375,6 @@

4113

+ #define X86_BUG_SPECTRE_V1		X86_BUG(15) /* CPU is affected by Spectre variant 1 attack with conditional branches */

4114

+ #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */

4115

+ #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */

4116

++#define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */

4117

+

4118

+ #endif /* _ASM_X86_CPUFEATURES_H */

Gentoo Archives: gentoo-commits