[gentoo-commits] proj/linux-patches:4.14 commit in: / - gentoo-commits

From:	Mike Pagano <mpagano@g.o>
To:	gentoo-commits@l.g.o
Subject:	[gentoo-commits] proj/linux-patches:4.14 commit in: /
Date:	Wed, 14 Nov 2018 14:00:59
Message-Id:	`1542204040.c9b6e13d4252980748af407d3541b2b2dea60567.mpagano@gentoo`

1

commit:     c9b6e13d4252980748af407d3541b2b2dea60567

2

Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

3

AuthorDate: Wed Aug 15 16:48:03 2018 +0000

4

Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>

5

CommitDate: Wed Nov 14 14:00:40 2018 +0000

6

URL:        https://gitweb.gentoo.org/proj/linux-patches.git/commit/?id=c9b6e13d

7

8

Linux patch 4.14.63

9

10

Signed-off-by: Mike Pagano <mpagano <AT> gentoo.org>

11

12

 0000_README              |    4 +

13

 1062_linux-4.14.63.patch | 5609 ++++++++++++++++++++++++++++++++++++++++++++++

14

 2 files changed, 5613 insertions(+)

15

16

diff --git a/0000_README b/0000_README

17

index b530931..4c5f97e 100644

18

--- a/0000_README

19

+++ b/0000_README

20

@@ -291,6 +291,10 @@ Patch:  1061_linux-4.14.62.patch

21

 From:   http://www.kernel.org

22

 Desc:   Linux 4.14.62

23

24

+Patch:  1062_linux-4.14.63.patch

25

+From:   http://www.kernel.org

26

+Desc:   Linux 4.14.63

27

+

28

 Patch:  1500_XATTR_USER_PREFIX.patch

29

 From:   https://bugs.gentoo.org/show_bug.cgi?id=470644

30

 Desc:   Support for namespace user.pax.* on tmpfs.

31

32

diff --git a/1062_linux-4.14.63.patch b/1062_linux-4.14.63.patch

33

new file mode 100644

34

index 0000000..cff73c5

35

--- /dev/null

36

+++ b/1062_linux-4.14.63.patch

37

@@ -0,0 +1,5609 @@

38

+diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu

39

+index 8355e79350b7..6cae60929cb6 100644

40

+--- a/Documentation/ABI/testing/sysfs-devices-system-cpu

41

++++ b/Documentation/ABI/testing/sysfs-devices-system-cpu

42

+@@ -379,6 +379,7 @@ What:		/sys/devices/system/cpu/vulnerabilities

43

+ 		/sys/devices/system/cpu/vulnerabilities/spectre_v1

44

+ 		/sys/devices/system/cpu/vulnerabilities/spectre_v2

45

+ 		/sys/devices/system/cpu/vulnerabilities/spec_store_bypass

46

++		/sys/devices/system/cpu/vulnerabilities/l1tf

47

+ Date:		January 2018

48

+ Contact:	Linux kernel mailing list <linux-kernel@×××××××××××.org>

49

+ Description:	Information about CPU vulnerabilities

50

+@@ -390,3 +391,26 @@ Description:	Information about CPU vulnerabilities

51

+ 		"Not affected"	  CPU is not affected by the vulnerability

52

+ 		"Vulnerable"	  CPU is affected and no mitigation in effect

53

+ 		"Mitigation: $M"  CPU is affected and mitigation $M is in effect

54

++

55

++		Details about the l1tf file can be found in

56

++		Documentation/admin-guide/l1tf.rst

57

++

58

++What:		/sys/devices/system/cpu/smt

59

++		/sys/devices/system/cpu/smt/active

60

++		/sys/devices/system/cpu/smt/control

61

++Date:		June 2018

62

++Contact:	Linux kernel mailing list <linux-kernel@×××××××××××.org>

63

++Description:	Control Symetric Multi Threading (SMT)

64

++

65

++		active:  Tells whether SMT is active (enabled and siblings online)

66

++

67

++		control: Read/write interface to control SMT. Possible

68

++			 values:

69

++

70

++			 "on"		SMT is enabled

71

++			 "off"		SMT is disabled

72

++			 "forceoff"	SMT is force disabled. Cannot be changed.

73

++			 "notsupported" SMT is not supported by the CPU

74

++

75

++			 If control status is "forceoff" or "notsupported" writes

76

++			 are rejected.

77

+diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst

78

+index 5bb9161dbe6a..78f8f00c369f 100644

79

+--- a/Documentation/admin-guide/index.rst

80

++++ b/Documentation/admin-guide/index.rst

81

+@@ -17,6 +17,15 @@ etc.

82

+    kernel-parameters

83

+    devices

84

+

85

++This section describes CPU vulnerabilities and provides an overview of the

86

++possible mitigations along with guidance for selecting mitigations if they

87

++are configurable at compile, boot or run time.

88

++

89

++.. toctree::

90

++   :maxdepth: 1

91

++

92

++   l1tf

93

++

94

+ Here is a set of documents aimed at users who are trying to track down

95

+ problems and bugs in particular.

96

+

97

+diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt

98

+index d6d7669e667f..9841bad6f271 100644

99

+--- a/Documentation/admin-guide/kernel-parameters.txt

100

++++ b/Documentation/admin-guide/kernel-parameters.txt

101

+@@ -1888,10 +1888,84 @@

102

+ 			(virtualized real and unpaged mode) on capable

103

+ 			Intel chips. Default is 1 (enabled)

104

+

105

++	kvm-intel.vmentry_l1d_flush=[KVM,Intel] Mitigation for L1 Terminal Fault

106

++			CVE-2018-3620.

107

++

108

++			Valid arguments: never, cond, always

109

++

110

++			always: L1D cache flush on every VMENTER.

111

++			cond:	Flush L1D on VMENTER only when the code between

112

++				VMEXIT and VMENTER can leak host memory.

113

++			never:	Disables the mitigation

114

++

115

++			Default is cond (do L1 cache flush in specific instances)

116

++

117

+ 	kvm-intel.vpid=	[KVM,Intel] Disable Virtual Processor Identification

118

+ 			feature (tagged TLBs) on capable Intel chips.

119

+ 			Default is 1 (enabled)

120

+

121

++	l1tf=           [X86] Control mitigation of the L1TF vulnerability on

122

++			      affected CPUs

123

++

124

++			The kernel PTE inversion protection is unconditionally

125

++			enabled and cannot be disabled.

126

++

127

++			full

128

++				Provides all available mitigations for the

129

++				L1TF vulnerability. Disables SMT and

130

++				enables all mitigations in the

131

++				hypervisors, i.e. unconditional L1D flush.

132

++

133

++				SMT control and L1D flush control via the

134

++				sysfs interface is still possible after

135

++				boot.  Hypervisors will issue a warning

136

++				when the first VM is started in a

137

++				potentially insecure configuration,

138

++				i.e. SMT enabled or L1D flush disabled.

139

++

140

++			full,force

141

++				Same as 'full', but disables SMT and L1D

142

++				flush runtime control. Implies the

143

++				'nosmt=force' command line option.

144

++				(i.e. sysfs control of SMT is disabled.)

145

++

146

++			flush

147

++				Leaves SMT enabled and enables the default

148

++				hypervisor mitigation, i.e. conditional

149

++				L1D flush.

150

++

151

++				SMT control and L1D flush control via the

152

++				sysfs interface is still possible after

153

++				boot.  Hypervisors will issue a warning

154

++				when the first VM is started in a

155

++				potentially insecure configuration,

156

++				i.e. SMT enabled or L1D flush disabled.

157

++

158

++			flush,nosmt

159

++

160

++				Disables SMT and enables the default

161

++				hypervisor mitigation.

162

++

163

++				SMT control and L1D flush control via the

164

++				sysfs interface is still possible after

165

++				boot.  Hypervisors will issue a warning

166

++				when the first VM is started in a

167

++				potentially insecure configuration,

168

++				i.e. SMT enabled or L1D flush disabled.

169

++

170

++			flush,nowarn

171

++				Same as 'flush', but hypervisors will not

172

++				warn when a VM is started in a potentially

173

++				insecure configuration.

174

++

175

++			off

176

++				Disables hypervisor mitigations and doesn't

177

++				emit any warnings.

178

++

179

++			Default is 'flush'.

180

++

181

++			For details see: Documentation/admin-guide/l1tf.rst

182

++

183

+ 	l2cr=		[PPC]

184

+

185

+ 	l3cr=		[PPC]

186

+@@ -2595,6 +2669,10 @@

187

+ 	nosmt		[KNL,S390] Disable symmetric multithreading (SMT).

188

+ 			Equivalent to smt=1.

189

+

190

++			[KNL,x86] Disable symmetric multithreading (SMT).

191

++			nosmt=force: Force disable SMT, cannot be undone

192

++				     via the sysfs control file.

193

++

194

+ 	nospectre_v2	[X86] Disable all mitigations for the Spectre variant 2

195

+ 			(indirect branch prediction) vulnerability. System may

196

+ 			allow data leaks with this option, which is equivalent

197

+diff --git a/Documentation/admin-guide/l1tf.rst b/Documentation/admin-guide/l1tf.rst

198

+new file mode 100644

199

+index 000000000000..bae52b845de0

200

+--- /dev/null

201

++++ b/Documentation/admin-guide/l1tf.rst

202

+@@ -0,0 +1,610 @@

203

++L1TF - L1 Terminal Fault

204

++========================

205

++

206

++L1 Terminal Fault is a hardware vulnerability which allows unprivileged

207

++speculative access to data which is available in the Level 1 Data Cache

208

++when the page table entry controlling the virtual address, which is used

209

++for the access, has the Present bit cleared or other reserved bits set.

210

++

211

++Affected processors

212

++-------------------

213

++

214

++This vulnerability affects a wide range of Intel processors. The

215

++vulnerability is not present on:

216

++

217

++   - Processors from AMD, Centaur and other non Intel vendors

218

++

219

++   - Older processor models, where the CPU family is < 6

220

++

221

++   - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,

222

++     Penwell, Pineview, Silvermont, Airmont, Merrifield)

223

++

224

++   - The Intel XEON PHI family

225

++

226

++   - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the

227

++     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected

228

++     by the Meltdown vulnerability either. These CPUs should become

229

++     available by end of 2018.

230

++

231

++Whether a processor is affected or not can be read out from the L1TF

232

++vulnerability file in sysfs. See :ref:`l1tf_sys_info`.

233

++

234

++Related CVEs

235

++------------

236

++

237

++The following CVE entries are related to the L1TF vulnerability:

238

++

239

++   =============  =================  ==============================

240

++   CVE-2018-3615  L1 Terminal Fault  SGX related aspects

241

++   CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects

242

++   CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects

243

++   =============  =================  ==============================

244

++

245

++Problem

246

++-------

247

++

248

++If an instruction accesses a virtual address for which the relevant page

249

++table entry (PTE) has the Present bit cleared or other reserved bits set,

250

++then speculative execution ignores the invalid PTE and loads the referenced

251

++data if it is present in the Level 1 Data Cache, as if the page referenced

252

++by the address bits in the PTE was still present and accessible.

253

++

254

++While this is a purely speculative mechanism and the instruction will raise

255

++a page fault when it is retired eventually, the pure act of loading the

256

++data and making it available to other speculative instructions opens up the

257

++opportunity for side channel attacks to unprivileged malicious code,

258

++similar to the Meltdown attack.

259

++

260

++While Meltdown breaks the user space to kernel space protection, L1TF

261

++allows to attack any physical memory address in the system and the attack

262

++works across all protection domains. It allows an attack of SGX and also

263

++works from inside virtual machines because the speculation bypasses the

264

++extended page table (EPT) protection mechanism.

265

++

266

++

267

++Attack scenarios

268

++----------------

269

++

270

++1. Malicious user space

271

++^^^^^^^^^^^^^^^^^^^^^^^

272

++

273

++   Operating Systems store arbitrary information in the address bits of a

274

++   PTE which is marked non present. This allows a malicious user space

275

++   application to attack the physical memory to which these PTEs resolve.

276

++   In some cases user-space can maliciously influence the information

277

++   encoded in the address bits of the PTE, thus making attacks more

278

++   deterministic and more practical.

279

++

280

++   The Linux kernel contains a mitigation for this attack vector, PTE

281

++   inversion, which is permanently enabled and has no performance

282

++   impact. The kernel ensures that the address bits of PTEs, which are not

283

++   marked present, never point to cacheable physical memory space.

284

++

285

++   A system with an up to date kernel is protected against attacks from

286

++   malicious user space applications.

287

++

288

++2. Malicious guest in a virtual machine

289

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

290

++

291

++   The fact that L1TF breaks all domain protections allows malicious guest

292

++   OSes, which can control the PTEs directly, and malicious guest user

293

++   space applications, which run on an unprotected guest kernel lacking the

294

++   PTE inversion mitigation for L1TF, to attack physical host memory.

295

++

296

++   A special aspect of L1TF in the context of virtualization is symmetric

297

++   multi threading (SMT). The Intel implementation of SMT is called

298

++   HyperThreading. The fact that Hyperthreads on the affected processors

299

++   share the L1 Data Cache (L1D) is important for this. As the flaw allows

300

++   only to attack data which is present in L1D, a malicious guest running

301

++   on one Hyperthread can attack the data which is brought into the L1D by

302

++   the context which runs on the sibling Hyperthread of the same physical

303

++   core. This context can be host OS, host user space or a different guest.

304

++

305

++   If the processor does not support Extended Page Tables, the attack is

306

++   only possible, when the hypervisor does not sanitize the content of the

307

++   effective (shadow) page tables.

308

++

309

++   While solutions exist to mitigate these attack vectors fully, these

310

++   mitigations are not enabled by default in the Linux kernel because they

311

++   can affect performance significantly. The kernel provides several

312

++   mechanisms which can be utilized to address the problem depending on the

313

++   deployment scenario. The mitigations, their protection scope and impact

314

++   are described in the next sections.

315

++

316

++   The default mitigations and the rationale for choosing them are explained

317

++   at the end of this document. See :ref:`default_mitigations`.

318

++

319

++.. _l1tf_sys_info:

320

++

321

++L1TF system information

322

++-----------------------

323

++

324

++The Linux kernel provides a sysfs interface to enumerate the current L1TF

325

++status of the system: whether the system is vulnerable, and which

326

++mitigations are active. The relevant sysfs file is:

327

++

328

++/sys/devices/system/cpu/vulnerabilities/l1tf

329

++

330

++The possible values in this file are:

331

++

332

++  ===========================   ===============================

333

++  'Not affected'		The processor is not vulnerable

334

++  'Mitigation: PTE Inversion'	The host protection is active

335

++  ===========================   ===============================

336

++

337

++If KVM/VMX is enabled and the processor is vulnerable then the following

338

++information is appended to the 'Mitigation: PTE Inversion' part:

339

++

340

++  - SMT status:

341

++

342

++    =====================  ================

343

++    'VMX: SMT vulnerable'  SMT is enabled

344

++    'VMX: SMT disabled'    SMT is disabled

345

++    =====================  ================

346

++

347

++  - L1D Flush mode:

348

++

349

++    ================================  ====================================

350

++    'L1D vulnerable'		      L1D flushing is disabled

351

++

352

++    'L1D conditional cache flushes'   L1D flush is conditionally enabled

353

++

354

++    'L1D cache flushes'		      L1D flush is unconditionally enabled

355

++    ================================  ====================================

356

++

357

++The resulting grade of protection is discussed in the following sections.

358

++

359

++

360

++Host mitigation mechanism

361

++-------------------------

362

++

363

++The kernel is unconditionally protected against L1TF attacks from malicious

364

++user space running on the host.

365

++

366

++

367

++Guest mitigation mechanisms

368

++---------------------------

369

++

370

++.. _l1d_flush:

371

++

372

++1. L1D flush on VMENTER

373

++^^^^^^^^^^^^^^^^^^^^^^^

374

++

375

++   To make sure that a guest cannot attack data which is present in the L1D

376

++   the hypervisor flushes the L1D before entering the guest.

377

++

378

++   Flushing the L1D evicts not only the data which should not be accessed

379

++   by a potentially malicious guest, it also flushes the guest

380

++   data. Flushing the L1D has a performance impact as the processor has to

381

++   bring the flushed guest data back into the L1D. Depending on the

382

++   frequency of VMEXIT/VMENTER and the type of computations in the guest

383

++   performance degradation in the range of 1% to 50% has been observed. For

384

++   scenarios where guest VMEXIT/VMENTER are rare the performance impact is

385

++   minimal. Virtio and mechanisms like posted interrupts are designed to

386

++   confine the VMEXITs to a bare minimum, but specific configurations and

387

++   application scenarios might still suffer from a high VMEXIT rate.

388

++

389

++   The kernel provides two L1D flush modes:

390

++    - conditional ('cond')

391

++    - unconditional ('always')

392

++

393

++   The conditional mode avoids L1D flushing after VMEXITs which execute

394

++   only audited code paths before the corresponding VMENTER. These code

395

++   paths have been verified that they cannot expose secrets or other

396

++   interesting data to an attacker, but they can leak information about the

397

++   address space layout of the hypervisor.

398

++

399

++   Unconditional mode flushes L1D on all VMENTER invocations and provides

400

++   maximum protection. It has a higher overhead than the conditional

401

++   mode. The overhead cannot be quantified correctly as it depends on the

402

++   workload scenario and the resulting number of VMEXITs.

403

++

404

++   The general recommendation is to enable L1D flush on VMENTER. The kernel

405

++   defaults to conditional mode on affected processors.

406

++

407

++   **Note**, that L1D flush does not prevent the SMT problem because the

408

++   sibling thread will also bring back its data into the L1D which makes it

409

++   attackable again.

410

++

411

++   L1D flush can be controlled by the administrator via the kernel command

412

++   line and sysfs control files. See :ref:`mitigation_control_command_line`

413

++   and :ref:`mitigation_control_kvm`.

414

++

415

++.. _guest_confinement:

416

++

417

++2. Guest VCPU confinement to dedicated physical cores

418

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

419

++

420

++   To address the SMT problem, it is possible to make a guest or a group of

421

++   guests affine to one or more physical cores. The proper mechanism for

422

++   that is to utilize exclusive cpusets to ensure that no other guest or

423

++   host tasks can run on these cores.

424

++

425

++   If only a single guest or related guests run on sibling SMT threads on

426

++   the same physical core then they can only attack their own memory and

427

++   restricted parts of the host memory.

428

++

429

++   Host memory is attackable, when one of the sibling SMT threads runs in

430

++   host OS (hypervisor) context and the other in guest context. The amount

431

++   of valuable information from the host OS context depends on the context

432

++   which the host OS executes, i.e. interrupts, soft interrupts and kernel

433

++   threads. The amount of valuable data from these contexts cannot be

434

++   declared as non-interesting for an attacker without deep inspection of

435

++   the code.

436

++

437

++   **Note**, that assigning guests to a fixed set of physical cores affects

438

++   the ability of the scheduler to do load balancing and might have

439

++   negative effects on CPU utilization depending on the hosting

440

++   scenario. Disabling SMT might be a viable alternative for particular

441

++   scenarios.

442

++

443

++   For further information about confining guests to a single or to a group

444

++   of cores consult the cpusets documentation:

445

++

446

++   https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt

447

++

448

++.. _interrupt_isolation:

449

++

450

++3. Interrupt affinity

451

++^^^^^^^^^^^^^^^^^^^^^

452

++

453

++   Interrupts can be made affine to logical CPUs. This is not universally

454

++   true because there are types of interrupts which are truly per CPU

455

++   interrupts, e.g. the local timer interrupt. Aside of that multi queue

456

++   devices affine their interrupts to single CPUs or groups of CPUs per

457

++   queue without allowing the administrator to control the affinities.

458

++

459

++   Moving the interrupts, which can be affinity controlled, away from CPUs

460

++   which run untrusted guests, reduces the attack vector space.

461

++

462

++   Whether the interrupts with are affine to CPUs, which run untrusted

463

++   guests, provide interesting data for an attacker depends on the system

464

++   configuration and the scenarios which run on the system. While for some

465

++   of the interrupts it can be assumed that they won't expose interesting

466

++   information beyond exposing hints about the host OS memory layout, there

467

++   is no way to make general assumptions.

468

++

469

++   Interrupt affinity can be controlled by the administrator via the

470

++   /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is

471

++   available at:

472

++

473

++   https://www.kernel.org/doc/Documentation/IRQ-affinity.txt

474

++

475

++.. _smt_control:

476

++

477

++4. SMT control

478

++^^^^^^^^^^^^^^

479

++

480

++   To prevent the SMT issues of L1TF it might be necessary to disable SMT

481

++   completely. Disabling SMT can have a significant performance impact, but

482

++   the impact depends on the hosting scenario and the type of workloads.

483

++   The impact of disabling SMT needs also to be weighted against the impact

484

++   of other mitigation solutions like confining guests to dedicated cores.

485

++

486

++   The kernel provides a sysfs interface to retrieve the status of SMT and

487

++   to control it. It also provides a kernel command line interface to

488

++   control SMT.

489

++

490

++   The kernel command line interface consists of the following options:

491

++

492

++     =========== ==========================================================

493

++     nosmt	 Affects the bring up of the secondary CPUs during boot. The

494

++		 kernel tries to bring all present CPUs online during the

495

++		 boot process. "nosmt" makes sure that from each physical

496

++		 core only one - the so called primary (hyper) thread is

497

++		 activated. Due to a design flaw of Intel processors related

498

++		 to Machine Check Exceptions the non primary siblings have

499

++		 to be brought up at least partially and are then shut down

500

++		 again.  "nosmt" can be undone via the sysfs interface.

501

++

502

++     nosmt=force Has the same effect as "nosmt" but it does not allow to

503

++		 undo the SMT disable via the sysfs interface.

504

++     =========== ==========================================================

505

++

506

++   The sysfs interface provides two files:

507

++

508

++   - /sys/devices/system/cpu/smt/control

509

++   - /sys/devices/system/cpu/smt/active

510

++

511

++   /sys/devices/system/cpu/smt/control:

512

++

513

++     This file allows to read out the SMT control state and provides the

514

++     ability to disable or (re)enable SMT. The possible states are:

515

++

516

++	==============  ===================================================

517

++	on		SMT is supported by the CPU and enabled. All

518

++			logical CPUs can be onlined and offlined without

519

++			restrictions.

520

++

521

++	off		SMT is supported by the CPU and disabled. Only

522

++			the so called primary SMT threads can be onlined

523

++			and offlined without restrictions. An attempt to

524

++			online a non-primary sibling is rejected

525

++

526

++	forceoff	Same as 'off' but the state cannot be controlled.

527

++			Attempts to write to the control file are rejected.

528

++

529

++	notsupported	The processor does not support SMT. It's therefore

530

++			not affected by the SMT implications of L1TF.

531

++			Attempts to write to the control file are rejected.

532

++	==============  ===================================================

533

++

534

++     The possible states which can be written into this file to control SMT

535

++     state are:

536

++

537

++     - on

538

++     - off

539

++     - forceoff

540

++

541

++   /sys/devices/system/cpu/smt/active:

542

++

543

++     This file reports whether SMT is enabled and active, i.e. if on any

544

++     physical core two or more sibling threads are online.

545

++

546

++   SMT control is also possible at boot time via the l1tf kernel command

547

++   line parameter in combination with L1D flush control. See

548

++   :ref:`mitigation_control_command_line`.

549

++

550

++5. Disabling EPT

551

++^^^^^^^^^^^^^^^^

552

++

553

++  Disabling EPT for virtual machines provides full mitigation for L1TF even

554

++  with SMT enabled, because the effective page tables for guests are

555

++  managed and sanitized by the hypervisor. Though disabling EPT has a

556

++  significant performance impact especially when the Meltdown mitigation

557

++  KPTI is enabled.

558

++

559

++  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.

560

++

561

++There is ongoing research and development for new mitigation mechanisms to

562

++address the performance impact of disabling SMT or EPT.

563

++

564

++.. _mitigation_control_command_line:

565

++

566

++Mitigation control on the kernel command line

567

++---------------------------------------------

568

++

569

++The kernel command line allows to control the L1TF mitigations at boot

570

++time with the option "l1tf=". The valid arguments for this option are:

571

++

572

++  ============  =============================================================

573

++  full		Provides all available mitigations for the L1TF

574

++		vulnerability. Disables SMT and enables all mitigations in

575

++		the hypervisors, i.e. unconditional L1D flushing

576

++

577

++		SMT control and L1D flush control via the sysfs interface

578

++		is still possible after boot.  Hypervisors will issue a

579

++		warning when the first VM is started in a potentially

580

++		insecure configuration, i.e. SMT enabled or L1D flush

581

++		disabled.

582

++

583

++  full,force	Same as 'full', but disables SMT and L1D flush runtime

584

++		control. Implies the 'nosmt=force' command line option.

585

++		(i.e. sysfs control of SMT is disabled.)

586

++

587

++  flush		Leaves SMT enabled and enables the default hypervisor

588

++		mitigation, i.e. conditional L1D flushing

589

++

590

++		SMT control and L1D flush control via the sysfs interface

591

++		is still possible after boot.  Hypervisors will issue a

592

++		warning when the first VM is started in a potentially

593

++		insecure configuration, i.e. SMT enabled or L1D flush

594

++		disabled.

595

++

596

++  flush,nosmt	Disables SMT and enables the default hypervisor mitigation,

597

++		i.e. conditional L1D flushing.

598

++

599

++		SMT control and L1D flush control via the sysfs interface

600

++		is still possible after boot.  Hypervisors will issue a

601

++		warning when the first VM is started in a potentially

602

++		insecure configuration, i.e. SMT enabled or L1D flush

603

++		disabled.

604

++

605

++  flush,nowarn	Same as 'flush', but hypervisors will not warn when a VM is

606

++		started in a potentially insecure configuration.

607

++

608

++  off		Disables hypervisor mitigations and doesn't emit any

609

++		warnings.

610

++  ============  =============================================================

611

++

612

++The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.

613

++

614

++

615

++.. _mitigation_control_kvm:

616

++

617

++Mitigation control for KVM - module parameter

618

++-------------------------------------------------------------

619

++

620

++The KVM hypervisor mitigation mechanism, flushing the L1D cache when

621

++entering a guest, can be controlled with a module parameter.

622

++

623

++The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the

624

++following arguments:

625

++

626

++  ============  ==============================================================

627

++  always	L1D cache flush on every VMENTER.

628

++

629

++  cond		Flush L1D on VMENTER only when the code between VMEXIT and

630

++		VMENTER can leak host memory which is considered

631

++		interesting for an attacker. This still can leak host memory

632

++		which allows e.g. to determine the hosts address space layout.

633

++

634

++  never		Disables the mitigation

635

++  ============  ==============================================================

636

++

637

++The parameter can be provided on the kernel command line, as a module

638

++parameter when loading the modules and at runtime modified via the sysfs

639

++file:

640

++

641

++/sys/module/kvm_intel/parameters/vmentry_l1d_flush

642

++

643

++The default is 'cond'. If 'l1tf=full,force' is given on the kernel command

644

++line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush

645

++module parameter is ignored and writes to the sysfs file are rejected.

646

++

647

++

648

++Mitigation selection guide

649

++--------------------------

650

++

651

++1. No virtualization in use

652

++^^^^^^^^^^^^^^^^^^^^^^^^^^^

653

++

654

++   The system is protected by the kernel unconditionally and no further

655

++   action is required.

656

++

657

++2. Virtualization with trusted guests

658

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

659

++

660

++   If the guest comes from a trusted source and the guest OS kernel is

661

++   guaranteed to have the L1TF mitigations in place the system is fully

662

++   protected against L1TF and no further action is required.

663

++

664

++   To avoid the overhead of the default L1D flushing on VMENTER the

665

++   administrator can disable the flushing via the kernel command line and

666

++   sysfs control files. See :ref:`mitigation_control_command_line` and

667

++   :ref:`mitigation_control_kvm`.

668

++

669

++

670

++3. Virtualization with untrusted guests

671

++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

672

++

673

++3.1. SMT not supported or disabled

674

++""""""""""""""""""""""""""""""""""

675

++

676

++  If SMT is not supported by the processor or disabled in the BIOS or by

677

++  the kernel, it's only required to enforce L1D flushing on VMENTER.

678

++

679

++  Conditional L1D flushing is the default behaviour and can be tuned. See

680

++  :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.

681

++

682

++3.2. EPT not supported or disabled

683

++""""""""""""""""""""""""""""""""""

684

++

685

++  If EPT is not supported by the processor or disabled in the hypervisor,

686

++  the system is fully protected. SMT can stay enabled and L1D flushing on

687

++  VMENTER is not required.

688

++

689

++  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.

690

++

691

++3.3. SMT and EPT supported and active

692

++"""""""""""""""""""""""""""""""""""""

693

++

694

++  If SMT and EPT are supported and active then various degrees of

695

++  mitigations can be employed:

696

++

697

++  - L1D flushing on VMENTER:

698

++

699

++    L1D flushing on VMENTER is the minimal protection requirement, but it

700

++    is only potent in combination with other mitigation methods.

701

++

702

++    Conditional L1D flushing is the default behaviour and can be tuned. See

703

++    :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.

704

++

705

++  - Guest confinement:

706

++

707

++    Confinement of guests to a single or a group of physical cores which

708

++    are not running any other processes, can reduce the attack surface

709

++    significantly, but interrupts, soft interrupts and kernel threads can

710

++    still expose valuable data to a potential attacker. See

711

++    :ref:`guest_confinement`.

712

++

713

++  - Interrupt isolation:

714

++

715

++    Isolating the guest CPUs from interrupts can reduce the attack surface

716

++    further, but still allows a malicious guest to explore a limited amount

717

++    of host physical memory. This can at least be used to gain knowledge

718

++    about the host address space layout. The interrupts which have a fixed

719

++    affinity to the CPUs which run the untrusted guests can depending on

720

++    the scenario still trigger soft interrupts and schedule kernel threads

721

++    which might expose valuable information. See

722

++    :ref:`interrupt_isolation`.

723

++

724

++The above three mitigation methods combined can provide protection to a

725

++certain degree, but the risk of the remaining attack surface has to be

726

++carefully analyzed. For full protection the following methods are

727

++available:

728

++

729

++  - Disabling SMT:

730

++

731

++    Disabling SMT and enforcing the L1D flushing provides the maximum

732

++    amount of protection. This mitigation is not depending on any of the

733

++    above mitigation methods.

734

++

735

++    SMT control and L1D flushing can be tuned by the command line

736

++    parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run

737

++    time with the matching sysfs control files. See :ref:`smt_control`,

738

++    :ref:`mitigation_control_command_line` and

739

++    :ref:`mitigation_control_kvm`.

740

++

741

++  - Disabling EPT:

742

++

743

++    Disabling EPT provides the maximum amount of protection as well. It is

744

++    not depending on any of the above mitigation methods. SMT can stay

745

++    enabled and L1D flushing is not required, but the performance impact is

746

++    significant.

747

++

748

++    EPT can be disabled in the hypervisor via the 'kvm-intel.ept'

749

++    parameter.

750

++

751

++3.4. Nested virtual machines

752

++""""""""""""""""""""""""""""

753

++

754

++When nested virtualization is in use, three operating systems are involved:

755

++the bare metal hypervisor, the nested hypervisor and the nested virtual

756

++machine.  VMENTER operations from the nested hypervisor into the nested

757

++guest will always be processed by the bare metal hypervisor. If KVM is the

758

++bare metal hypervisor it wiil:

759

++

760

++ - Flush the L1D cache on every switch from the nested hypervisor to the

761

++   nested virtual machine, so that the nested hypervisor's secrets are not

762

++   exposed to the nested virtual machine;

763

++

764

++ - Flush the L1D cache on every switch from the nested virtual machine to

765

++   the nested hypervisor; this is a complex operation, and flushing the L1D

766

++   cache avoids that the bare metal hypervisor's secrets are exposed to the

767

++   nested virtual machine;

768

++

769

++ - Instruct the nested hypervisor to not perform any L1D cache flush. This

770

++   is an optimization to avoid double L1D flushing.

771

++

772

++

773

++.. _default_mitigations:

774

++

775

++Default mitigations

776

++-------------------

777

++

778

++  The kernel default mitigations for vulnerable processors are:

779

++

780

++  - PTE inversion to protect against malicious user space. This is done

781

++    unconditionally and cannot be controlled.

782

++

783

++  - L1D conditional flushing on VMENTER when EPT is enabled for

784

++    a guest.

785

++

786

++  The kernel does not by default enforce the disabling of SMT, which leaves

787

++  SMT systems vulnerable when running untrusted guests with EPT enabled.

788

++

789

++  The rationale for this choice is:

790

++

791

++  - Force disabling SMT can break existing setups, especially with

792

++    unattended updates.

793

++

794

++  - If regular users run untrusted guests on their machine, then L1TF is

795

++    just an add on to other malware which might be embedded in an untrusted

796

++    guest, e.g. spam-bots or attacks on the local network.

797

++

798

++    There is no technical way to prevent a user from running untrusted code

799

++    on their machines blindly.

800

++

801

++  - It's technically extremely unlikely and from today's knowledge even

802

++    impossible that L1TF can be exploited via the most popular attack

803

++    mechanisms like JavaScript because these mechanisms have no way to

804

++    control PTEs. If this would be possible and not other mitigation would

805

++    be possible, then the default might be different.

806

++

807

++  - The administrators of cloud and hosting setups have to carefully

808

++    analyze the risk for their scenarios and make the appropriate

809

++    mitigation choices, which might even vary across their deployed

810

++    machines and also result in other changes of their overall setup.

811

++    There is no way for the kernel to provide a sensible default for this

812

++    kind of scenarios.

813

+diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt

814

+index 88ad78c6f605..5d12166bd66b 100644

815

+--- a/Documentation/virtual/kvm/api.txt

816

++++ b/Documentation/virtual/kvm/api.txt

817

+@@ -123,14 +123,15 @@ memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the

818

+ flag KVM_VM_MIPS_VZ.

819

+

820

+

821

+-4.3 KVM_GET_MSR_INDEX_LIST

822

++4.3 KVM_GET_MSR_INDEX_LIST, KVM_GET_MSR_FEATURE_INDEX_LIST

823

+

824

+-Capability: basic

825

++Capability: basic, KVM_CAP_GET_MSR_FEATURES for KVM_GET_MSR_FEATURE_INDEX_LIST

826

+ Architectures: x86

827

+-Type: system

828

++Type: system ioctl

829

+ Parameters: struct kvm_msr_list (in/out)

830

+ Returns: 0 on success; -1 on error

831

+ Errors:

832

++  EFAULT:    the msr index list cannot be read from or written to

833

+   E2BIG:     the msr index list is to be to fit in the array specified by

834

+              the user.

835

+

836

+@@ -139,16 +140,23 @@ struct kvm_msr_list {

837

+ 	__u32 indices[0];

838

+ };

839

+

840

+-This ioctl returns the guest msrs that are supported.  The list varies

841

+-by kvm version and host processor, but does not change otherwise.  The

842

+-user fills in the size of the indices array in nmsrs, and in return

843

+-kvm adjusts nmsrs to reflect the actual number of msrs and fills in

844

+-the indices array with their numbers.

845

++The user fills in the size of the indices array in nmsrs, and in return

846

++kvm adjusts nmsrs to reflect the actual number of msrs and fills in the

847

++indices array with their numbers.

848

++

849

++KVM_GET_MSR_INDEX_LIST returns the guest msrs that are supported.  The list

850

++varies by kvm version and host processor, but does not change otherwise.

851

+

852

+ Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are

853

+ not returned in the MSR list, as different vcpus can have a different number

854

+ of banks, as set via the KVM_X86_SETUP_MCE ioctl.

855

+

856

++KVM_GET_MSR_FEATURE_INDEX_LIST returns the list of MSRs that can be passed

857

++to the KVM_GET_MSRS system ioctl.  This lets userspace probe host capabilities

858

++and processor features that are exposed via MSRs (e.g., VMX capabilities).

859

++This list also varies by kvm version and host processor, but does not change

860

++otherwise.

861

++

862

+

863

+ 4.4 KVM_CHECK_EXTENSION

864

+

865

+@@ -475,14 +483,22 @@ Support for this has been removed.  Use KVM_SET_GUEST_DEBUG instead.

866

+

867

+ 4.18 KVM_GET_MSRS

868

+

869

+-Capability: basic

870

++Capability: basic (vcpu), KVM_CAP_GET_MSR_FEATURES (system)

871

+ Architectures: x86

872

+-Type: vcpu ioctl

873

++Type: system ioctl, vcpu ioctl

874

+ Parameters: struct kvm_msrs (in/out)

875

+-Returns: 0 on success, -1 on error

876

++Returns: number of msrs successfully returned;

877

++        -1 on error

878

++

879

++When used as a system ioctl:

880

++Reads the values of MSR-based features that are available for the VM.  This

881

++is similar to KVM_GET_SUPPORTED_CPUID, but it returns MSR indices and values.

882

++The list of msr-based features can be obtained using KVM_GET_MSR_FEATURE_INDEX_LIST

883

++in a system ioctl.

884

+

885

++When used as a vcpu ioctl:

886

+ Reads model-specific registers from the vcpu.  Supported msr indices can

887

+-be obtained using KVM_GET_MSR_INDEX_LIST.

888

++be obtained using KVM_GET_MSR_INDEX_LIST in a system ioctl.

889

+

890

+ struct kvm_msrs {

891

+ 	__u32 nmsrs; /* number of msrs in entries */

892

+diff --git a/Makefile b/Makefile

893

+index d407ecfdee0b..f3bb9428b3dc 100644

894

+--- a/Makefile

895

++++ b/Makefile

896

+@@ -1,7 +1,7 @@

897

+ # SPDX-License-Identifier: GPL-2.0

898

+ VERSION = 4

899

+ PATCHLEVEL = 14

900

+-SUBLEVEL = 62

901

++SUBLEVEL = 63

902

+ EXTRAVERSION =

903

+ NAME = Petit Gorille

904

+

905

+diff --git a/arch/Kconfig b/arch/Kconfig

906

+index 400b9e1b2f27..4e01862f58e4 100644

907

+--- a/arch/Kconfig

908

++++ b/arch/Kconfig

909

+@@ -13,6 +13,9 @@ config KEXEC_CORE

910

+ config HAVE_IMA_KEXEC

911

+ 	bool

912

+

913

++config HOTPLUG_SMT

914

++	bool

915

++

916

+ config OPROFILE

917

+ 	tristate "OProfile system profiling"

918

+ 	depends on PROFILING

919

+diff --git a/arch/arm/boot/dts/imx6sx.dtsi b/arch/arm/boot/dts/imx6sx.dtsi

920

+index 6c7eb54be9e2..d64438bfa68b 100644

921

+--- a/arch/arm/boot/dts/imx6sx.dtsi

922

++++ b/arch/arm/boot/dts/imx6sx.dtsi

923

+@@ -1305,7 +1305,7 @@

924

+ 				  0x82000000 0 0x08000000 0x08000000 0 0x00f00000>;

925

+ 			bus-range = <0x00 0xff>;

926

+ 			num-lanes = <1>;

927

+-			interrupts = <GIC_SPI 123 IRQ_TYPE_LEVEL_HIGH>;

928

++			interrupts = <GIC_SPI 120 IRQ_TYPE_LEVEL_HIGH>;

929

+ 			clocks = <&clks IMX6SX_CLK_PCIE_REF_125M>,

930

+ 				 <&clks IMX6SX_CLK_PCIE_AXI>,

931

+ 				 <&clks IMX6SX_CLK_LVDS1_OUT>,

932

+diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig

933

+index 1fd3eb5b66c6..89e684fd795f 100644

934

+--- a/arch/parisc/Kconfig

935

++++ b/arch/parisc/Kconfig

936

+@@ -201,7 +201,7 @@ config PREFETCH

937

+

938

+ config MLONGCALLS

939

+ 	bool "Enable the -mlong-calls compiler option for big kernels"

940

+-	def_bool y if (!MODULES)

941

++	default y

942

+ 	depends on PA8X00

943

+ 	help

944

+ 	  If you configure the kernel to include many drivers built-in instead

945

+diff --git a/arch/parisc/include/asm/barrier.h b/arch/parisc/include/asm/barrier.h

946

+new file mode 100644

947

+index 000000000000..dbaaca84f27f

948

+--- /dev/null

949

++++ b/arch/parisc/include/asm/barrier.h

950

+@@ -0,0 +1,32 @@

951

++/* SPDX-License-Identifier: GPL-2.0 */

952

++#ifndef __ASM_BARRIER_H

953

++#define __ASM_BARRIER_H

954

++

955

++#ifndef __ASSEMBLY__

956

++

957

++/* The synchronize caches instruction executes as a nop on systems in

958

++   which all memory references are performed in order. */

959

++#define synchronize_caches() __asm__ __volatile__ ("sync" : : : "memory")

960

++

961

++#if defined(CONFIG_SMP)

962

++#define mb()		do { synchronize_caches(); } while (0)

963

++#define rmb()		mb()

964

++#define wmb()		mb()

965

++#define dma_rmb()	mb()

966

++#define dma_wmb()	mb()

967

++#else

968

++#define mb()		barrier()

969

++#define rmb()		barrier()

970

++#define wmb()		barrier()

971

++#define dma_rmb()	barrier()

972

++#define dma_wmb()	barrier()

973

++#endif

974

++

975

++#define __smp_mb()	mb()

976

++#define __smp_rmb()	mb()

977

++#define __smp_wmb()	mb()

978

++

979

++#include <asm-generic/barrier.h>

980

++

981

++#endif /* !__ASSEMBLY__ */

982

++#endif /* __ASM_BARRIER_H */

983

+diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S

984

+index e95207c0565e..1b4732e20137 100644

985

+--- a/arch/parisc/kernel/entry.S

986

++++ b/arch/parisc/kernel/entry.S

987

+@@ -481,6 +481,8 @@

988

+ 	/* Release pa_tlb_lock lock without reloading lock address. */

989

+ 	.macro		tlb_unlock0	spc,tmp

990

+ #ifdef CONFIG_SMP

991

++	or,COND(=)	%r0,\spc,%r0

992

++	sync

993

+ 	or,COND(=)	%r0,\spc,%r0

994

+ 	stw             \spc,0(\tmp)

995

+ #endif

996

+diff --git a/arch/parisc/kernel/pacache.S b/arch/parisc/kernel/pacache.S

997

+index 67b0f7532e83..3e163df49cf3 100644

998

+--- a/arch/parisc/kernel/pacache.S

999

++++ b/arch/parisc/kernel/pacache.S

1000

+@@ -354,6 +354,7 @@ ENDPROC_CFI(flush_data_cache_local)

1001

+ 	.macro	tlb_unlock	la,flags,tmp

1002

+ #ifdef CONFIG_SMP

1003

+ 	ldi		1,\tmp

1004

++	sync

1005

+ 	stw		\tmp,0(\la)

1006

+ 	mtsm		\flags

1007

+ #endif

1008

+diff --git a/arch/parisc/kernel/syscall.S b/arch/parisc/kernel/syscall.S

1009

+index e775f80ae28c..4886a6db42e9 100644

1010

+--- a/arch/parisc/kernel/syscall.S

1011

++++ b/arch/parisc/kernel/syscall.S

1012

+@@ -633,6 +633,7 @@ cas_action:

1013

+ 	sub,<>	%r28, %r25, %r0

1014

+ 2:	stw,ma	%r24, 0(%r26)

1015

+ 	/* Free lock */

1016

++	sync

1017

+ 	stw,ma	%r20, 0(%sr2,%r20)

1018

+ #if ENABLE_LWS_DEBUG

1019

+ 	/* Clear thread register indicator */

1020

+@@ -647,6 +648,7 @@ cas_action:

1021

+ 3:		

1022

+ 	/* Error occurred on load or store */

1023

+ 	/* Free lock */

1024

++	sync

1025

+ 	stw	%r20, 0(%sr2,%r20)

1026

+ #if ENABLE_LWS_DEBUG

1027

+ 	stw	%r0, 4(%sr2,%r20)

1028

+@@ -848,6 +850,7 @@ cas2_action:

1029

+

1030

+ cas2_end:

1031

+ 	/* Free lock */

1032

++	sync

1033

+ 	stw,ma	%r20, 0(%sr2,%r20)

1034

+ 	/* Enable interrupts */

1035

+ 	ssm	PSW_SM_I, %r0

1036

+@@ -858,6 +861,7 @@ cas2_end:

1037

+ 22:

1038

+ 	/* Error occurred on load or store */

1039

+ 	/* Free lock */

1040

++	sync

1041

+ 	stw	%r20, 0(%sr2,%r20)

1042

+ 	ssm	PSW_SM_I, %r0

1043

+ 	ldo	1(%r0),%r28

1044

+diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig

1045

+index 7483cd514c32..1c63a4b5320d 100644

1046

+--- a/arch/x86/Kconfig

1047

++++ b/arch/x86/Kconfig

1048

+@@ -176,6 +176,7 @@ config X86

1049

+ 	select HAVE_SYSCALL_TRACEPOINTS

1050

+ 	select HAVE_UNSTABLE_SCHED_CLOCK

1051

+ 	select HAVE_USER_RETURN_NOTIFIER

1052

++	select HOTPLUG_SMT			if SMP

1053

+ 	select IRQ_FORCED_THREADING

1054

+ 	select PCI_LOCKLESS_CONFIG

1055

+ 	select PERF_EVENTS

1056

+diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h

1057

+index 5f01671c68f2..a1ed92aae12a 100644

1058

+--- a/arch/x86/include/asm/apic.h

1059

++++ b/arch/x86/include/asm/apic.h

1060

+@@ -10,6 +10,7 @@

1061

+ #include <asm/fixmap.h>

1062

+ #include <asm/mpspec.h>

1063

+ #include <asm/msr.h>

1064

++#include <asm/hardirq.h>

1065

+

1066

+ #define ARCH_APICTIMER_STOPS_ON_C3	1

1067

+

1068

+@@ -613,12 +614,20 @@ extern int default_check_phys_apicid_present(int phys_apicid);

1069

+ #endif

1070

+

1071

+ #endif /* CONFIG_X86_LOCAL_APIC */

1072

++

1073

++#ifdef CONFIG_SMP

1074

++bool apic_id_is_primary_thread(unsigned int id);

1075

++#else

1076

++static inline bool apic_id_is_primary_thread(unsigned int id) { return false; }

1077

++#endif

1078

++

1079

+ extern void irq_enter(void);

1080

+ extern void irq_exit(void);

1081

+

1082

+ static inline void entering_irq(void)

1083

+ {

1084

+ 	irq_enter();

1085

++	kvm_set_cpu_l1tf_flush_l1d();

1086

+ }

1087

+

1088

+ static inline void entering_ack_irq(void)

1089

+@@ -631,6 +640,7 @@ static inline void ipi_entering_ack_irq(void)

1090

+ {

1091

+ 	irq_enter();

1092

+ 	ack_APIC_irq();

1093

++	kvm_set_cpu_l1tf_flush_l1d();

1094

+ }

1095

+

1096

+ static inline void exiting_irq(void)

1097

+diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h

1098

+index 403e97d5e243..8418462298e7 100644

1099

+--- a/arch/x86/include/asm/cpufeatures.h

1100

++++ b/arch/x86/include/asm/cpufeatures.h

1101

+@@ -219,6 +219,7 @@

1102

+ #define X86_FEATURE_IBPB		( 7*32+26) /* Indirect Branch Prediction Barrier */

1103

+ #define X86_FEATURE_STIBP		( 7*32+27) /* Single Thread Indirect Branch Predictors */

1104

+ #define X86_FEATURE_ZEN			( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */

1105

++#define X86_FEATURE_L1TF_PTEINV		( 7*32+29) /* "" L1TF workaround PTE inversion */

1106

+

1107

+ /* Virtualization flags: Linux defined, word 8 */

1108

+ #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */

1109

+@@ -338,6 +339,7 @@

1110

+ #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */

1111

+ #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */

1112

+ #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */

1113

++#define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */

1114

+ #define X86_FEATURE_ARCH_CAPABILITIES	(18*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */

1115

+ #define X86_FEATURE_SPEC_CTRL_SSBD	(18*32+31) /* "" Speculative Store Bypass Disable */

1116

+

1117

+@@ -370,5 +372,6 @@

1118

+ #define X86_BUG_SPECTRE_V1		X86_BUG(15) /* CPU is affected by Spectre variant 1 attack with conditional branches */

1119

+ #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */

1120

+ #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */

1121

++#define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */

1122

+

1123

+ #endif /* _ASM_X86_CPUFEATURES_H */

1124

+diff --git a/arch/x86/include/asm/dmi.h b/arch/x86/include/asm/dmi.h

1125

+index 0ab2ab27ad1f..b825cb201251 100644

1126

+--- a/arch/x86/include/asm/dmi.h

1127

++++ b/arch/x86/include/asm/dmi.h

1128

+@@ -4,8 +4,8 @@

1129

+

1130

+ #include <linux/compiler.h>

1131

+ #include <linux/init.h>

1132

++#include <linux/io.h>

1133

+

1134

+-#include <asm/io.h>

1135

+ #include <asm/setup.h>

1136

+

1137

+ static __always_inline __init void *dmi_alloc(unsigned len)

1138

+diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h

1139

+index 51cc979dd364..486c843273c4 100644

1140

+--- a/arch/x86/include/asm/hardirq.h

1141

++++ b/arch/x86/include/asm/hardirq.h

1142

+@@ -3,10 +3,12 @@

1143

+ #define _ASM_X86_HARDIRQ_H

1144

+

1145

+ #include <linux/threads.h>

1146

+-#include <linux/irq.h>

1147

+

1148

+ typedef struct {

1149

+-	unsigned int __softirq_pending;

1150

++	u16	     __softirq_pending;

1151

++#if IS_ENABLED(CONFIG_KVM_INTEL)

1152

++	u8	     kvm_cpu_l1tf_flush_l1d;

1153

++#endif

1154

+ 	unsigned int __nmi_count;	/* arch dependent */

1155

+ #ifdef CONFIG_X86_LOCAL_APIC

1156

+ 	unsigned int apic_timer_irqs;	/* arch dependent */

1157

+@@ -62,4 +64,24 @@ extern u64 arch_irq_stat_cpu(unsigned int cpu);

1158

+ extern u64 arch_irq_stat(void);

1159

+ #define arch_irq_stat		arch_irq_stat

1160

+

1161

++

1162

++#if IS_ENABLED(CONFIG_KVM_INTEL)

1163

++static inline void kvm_set_cpu_l1tf_flush_l1d(void)

1164

++{

1165

++	__this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 1);

1166

++}

1167

++

1168

++static inline void kvm_clear_cpu_l1tf_flush_l1d(void)

1169

++{

1170

++	__this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 0);

1171

++}

1172

++

1173

++static inline bool kvm_get_cpu_l1tf_flush_l1d(void)

1174

++{

1175

++	return __this_cpu_read(irq_stat.kvm_cpu_l1tf_flush_l1d);

1176

++}

1177

++#else /* !IS_ENABLED(CONFIG_KVM_INTEL) */

1178

++static inline void kvm_set_cpu_l1tf_flush_l1d(void) { }

1179

++#endif /* IS_ENABLED(CONFIG_KVM_INTEL) */

1180

++

1181

+ #endif /* _ASM_X86_HARDIRQ_H */

1182

+diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h

1183

+index c4fc17220df9..c14f2a74b2be 100644

1184

+--- a/arch/x86/include/asm/irqflags.h

1185

++++ b/arch/x86/include/asm/irqflags.h

1186

+@@ -13,6 +13,8 @@

1187

+  * Interrupt control:

1188

+  */

1189

+

1190

++/* Declaration required for gcc < 4.9 to prevent -Werror=missing-prototypes */

1191

++extern inline unsigned long native_save_fl(void);

1192

+ extern inline unsigned long native_save_fl(void)

1193

+ {

1194

+ 	unsigned long flags;

1195

+diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h

1196

+index 174b9c41efce..4015b88383ce 100644

1197

+--- a/arch/x86/include/asm/kvm_host.h

1198

++++ b/arch/x86/include/asm/kvm_host.h

1199

+@@ -17,6 +17,7 @@

1200

+ #include <linux/tracepoint.h>

1201

+ #include <linux/cpumask.h>

1202

+ #include <linux/irq_work.h>

1203

++#include <linux/irq.h>

1204

+

1205

+ #include <linux/kvm.h>

1206

+ #include <linux/kvm_para.h>

1207

+@@ -506,6 +507,7 @@ struct kvm_vcpu_arch {

1208

+ 	u64 smbase;

1209

+ 	bool tpr_access_reporting;

1210

+ 	u64 ia32_xss;

1211

++	u64 microcode_version;

1212

+

1213

+ 	/*

1214

+ 	 * Paging state of the vcpu

1215

+@@ -693,6 +695,9 @@ struct kvm_vcpu_arch {

1216

+

1217

+ 	/* be preempted when it's in kernel-mode(cpl=0) */

1218

+ 	bool preempted_in_kernel;

1219

++

1220

++	/* Flush the L1 Data cache for L1TF mitigation on VMENTER */

1221

++	bool l1tf_flush_l1d;

1222

+ };

1223

+

1224

+ struct kvm_lpage_info {

1225

+@@ -862,6 +867,7 @@ struct kvm_vcpu_stat {

1226

+ 	u64 signal_exits;

1227

+ 	u64 irq_window_exits;

1228

+ 	u64 nmi_window_exits;

1229

++	u64 l1d_flush;

1230

+ 	u64 halt_exits;

1231

+ 	u64 halt_successful_poll;

1232

+ 	u64 halt_attempted_poll;

1233

+@@ -1061,6 +1067,8 @@ struct kvm_x86_ops {

1234

+ 	void (*cancel_hv_timer)(struct kvm_vcpu *vcpu);

1235

+

1236

+ 	void (*setup_mce)(struct kvm_vcpu *vcpu);

1237

++

1238

++	int (*get_msr_feature)(struct kvm_msr_entry *entry);

1239

+ };

1240

+

1241

+ struct kvm_arch_async_pf {

1242

+@@ -1366,6 +1374,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v);

1243

+ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);

1244

+ void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu);

1245

+

1246

++u64 kvm_get_arch_capabilities(void);

1247

+ void kvm_define_shared_msr(unsigned index, u32 msr);

1248

+ int kvm_set_shared_msr(unsigned index, u64 val, u64 mask);

1249

+

1250

+diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h

1251

+index 504b21692d32..ef7eec669a1b 100644

1252

+--- a/arch/x86/include/asm/msr-index.h

1253

++++ b/arch/x86/include/asm/msr-index.h

1254

+@@ -70,12 +70,19 @@

1255

+ #define MSR_IA32_ARCH_CAPABILITIES	0x0000010a

1256

+ #define ARCH_CAP_RDCL_NO		(1 << 0)   /* Not susceptible to Meltdown */

1257

+ #define ARCH_CAP_IBRS_ALL		(1 << 1)   /* Enhanced IBRS support */

1258

++#define ARCH_CAP_SKIP_VMENTRY_L1DFLUSH	(1 << 3)   /* Skip L1D flush on vmentry */

1259

+ #define ARCH_CAP_SSB_NO			(1 << 4)   /*

1260

+ 						    * Not susceptible to Speculative Store Bypass

1261

+ 						    * attack, so no Speculative Store Bypass

1262

+ 						    * control required.

1263

+ 						    */

1264

+

1265

++#define MSR_IA32_FLUSH_CMD		0x0000010b

1266

++#define L1D_FLUSH			(1 << 0)   /*

1267

++						    * Writeback and invalidate the

1268

++						    * L1 data cache.

1269

++						    */

1270

++

1271

+ #define MSR_IA32_BBL_CR_CTL		0x00000119

1272

+ #define MSR_IA32_BBL_CR_CTL3		0x0000011e

1273

+

1274

+diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h

1275

+index aa30c3241ea7..0d5c739eebd7 100644

1276

+--- a/arch/x86/include/asm/page_32_types.h

1277

++++ b/arch/x86/include/asm/page_32_types.h

1278

+@@ -29,8 +29,13 @@

1279

+ #define N_EXCEPTION_STACKS 1

1280

+

1281

+ #ifdef CONFIG_X86_PAE

1282

+-/* 44=32+12, the limit we can fit into an unsigned long pfn */

1283

+-#define __PHYSICAL_MASK_SHIFT	44

1284

++/*

1285

++ * This is beyond the 44 bit limit imposed by the 32bit long pfns,

1286

++ * but we need the full mask to make sure inverted PROT_NONE

1287

++ * entries have all the host bits set in a guest.

1288

++ * The real limit is still 44 bits.

1289

++ */

1290

++#define __PHYSICAL_MASK_SHIFT	52

1291

+ #define __VIRTUAL_MASK_SHIFT	32

1292

+

1293

+ #else  /* !CONFIG_X86_PAE */

1294

+diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h

1295

+index 685ffe8a0eaf..60d0f9015317 100644

1296

+--- a/arch/x86/include/asm/pgtable-2level.h

1297

++++ b/arch/x86/include/asm/pgtable-2level.h

1298

+@@ -95,4 +95,21 @@ static inline unsigned long pte_bitop(unsigned long value, unsigned int rightshi

1299

+ #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })

1300

+ #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })

1301

+

1302

++/* No inverted PFNs on 2 level page tables */

1303

++

1304

++static inline u64 protnone_mask(u64 val)

1305

++{

1306

++	return 0;

1307

++}

1308

++

1309

++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)

1310

++{

1311

++	return val;

1312

++}

1313

++

1314

++static inline bool __pte_needs_invert(u64 val)

1315

++{

1316

++	return false;

1317

++}

1318

++

1319

+ #endif /* _ASM_X86_PGTABLE_2LEVEL_H */

1320

+diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h

1321

+index bc4af5453802..9dc19b4a2a87 100644

1322

+--- a/arch/x86/include/asm/pgtable-3level.h

1323

++++ b/arch/x86/include/asm/pgtable-3level.h

1324

+@@ -206,12 +206,43 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)

1325

+ #endif

1326

+

1327

+ /* Encode and de-code a swap entry */

1328

++#define SWP_TYPE_BITS		5

1329

++

1330

++#define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)

1331

++

1332

++/* We always extract/encode the offset by shifting it all the way up, and then down again */

1333

++#define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS)

1334

++

1335

+ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)

1336

+ #define __swp_type(x)			(((x).val) & 0x1f)

1337

+ #define __swp_offset(x)			((x).val >> 5)

1338

+ #define __swp_entry(type, offset)	((swp_entry_t){(type) | (offset) << 5})

1339

+-#define __pte_to_swp_entry(pte)		((swp_entry_t){ (pte).pte_high })

1340

+-#define __swp_entry_to_pte(x)		((pte_t){ { .pte_high = (x).val } })

1341

++

1342

++/*

1343

++ * Normally, __swp_entry() converts from arch-independent swp_entry_t to

1344

++ * arch-dependent swp_entry_t, and __swp_entry_to_pte() just stores the result

1345

++ * to pte. But here we have 32bit swp_entry_t and 64bit pte, and need to use the

1346

++ * whole 64 bits. Thus, we shift the "real" arch-dependent conversion to

1347

++ * __swp_entry_to_pte() through the following helper macro based on 64bit

1348

++ * __swp_entry().

1349

++ */

1350

++#define __swp_pteval_entry(type, offset) ((pteval_t) { \

1351

++	(~(pteval_t)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \

1352

++	| ((pteval_t)(type) << (64 - SWP_TYPE_BITS)) })

1353

++

1354

++#define __swp_entry_to_pte(x)	((pte_t){ .pte = \

1355

++		__swp_pteval_entry(__swp_type(x), __swp_offset(x)) })

1356

++/*

1357

++ * Analogically, __pte_to_swp_entry() doesn't just extract the arch-dependent

1358

++ * swp_entry_t, but also has to convert it from 64bit to the 32bit

1359

++ * intermediate representation, using the following macros based on 64bit

1360

++ * __swp_type() and __swp_offset().

1361

++ */

1362

++#define __pteval_swp_type(x) ((unsigned long)((x).pte >> (64 - SWP_TYPE_BITS)))

1363

++#define __pteval_swp_offset(x) ((unsigned long)(~((x).pte) << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT))

1364

++

1365

++#define __pte_to_swp_entry(pte)	(__swp_entry(__pteval_swp_type(pte), \

1366

++					     __pteval_swp_offset(pte)))

1367

+

1368

+ #define gup_get_pte gup_get_pte

1369

+ /*

1370

+@@ -260,4 +291,6 @@ static inline pte_t gup_get_pte(pte_t *ptep)

1371

+ 	return pte;

1372

+ }

1373

+

1374

++#include <asm/pgtable-invert.h>

1375

++

1376

+ #endif /* _ASM_X86_PGTABLE_3LEVEL_H */

1377

+diff --git a/arch/x86/include/asm/pgtable-invert.h b/arch/x86/include/asm/pgtable-invert.h

1378

+new file mode 100644

1379

+index 000000000000..44b1203ece12

1380

+--- /dev/null

1381

++++ b/arch/x86/include/asm/pgtable-invert.h

1382

+@@ -0,0 +1,32 @@

1383

++/* SPDX-License-Identifier: GPL-2.0 */

1384

++#ifndef _ASM_PGTABLE_INVERT_H

1385

++#define _ASM_PGTABLE_INVERT_H 1

1386

++

1387

++#ifndef __ASSEMBLY__

1388

++

1389

++static inline bool __pte_needs_invert(u64 val)

1390

++{

1391

++	return !(val & _PAGE_PRESENT);

1392

++}

1393

++

1394

++/* Get a mask to xor with the page table entry to get the correct pfn. */

1395

++static inline u64 protnone_mask(u64 val)

1396

++{

1397

++	return __pte_needs_invert(val) ?  ~0ull : 0;

1398

++}

1399

++

1400

++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)

1401

++{

1402

++	/*

1403

++	 * When a PTE transitions from NONE to !NONE or vice-versa

1404

++	 * invert the PFN part to stop speculation.

1405

++	 * pte_pfn undoes this when needed.

1406

++	 */

1407

++	if (__pte_needs_invert(oldval) != __pte_needs_invert(val))

1408

++		val = (val & ~mask) | (~val & mask);

1409

++	return val;

1410

++}

1411

++

1412

++#endif /* __ASSEMBLY__ */

1413

++

1414

++#endif

1415

+diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h

1416

+index 5c790e93657d..6a4b1a54ff47 100644

1417

+--- a/arch/x86/include/asm/pgtable.h

1418

++++ b/arch/x86/include/asm/pgtable.h

1419

+@@ -185,19 +185,29 @@ static inline int pte_special(pte_t pte)

1420

+ 	return pte_flags(pte) & _PAGE_SPECIAL;

1421

+ }

1422

+

1423

++/* Entries that were set to PROT_NONE are inverted */

1424

++

1425

++static inline u64 protnone_mask(u64 val);

1426

++

1427

+ static inline unsigned long pte_pfn(pte_t pte)

1428

+ {

1429

+-	return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;

1430

++	phys_addr_t pfn = pte_val(pte);

1431

++	pfn ^= protnone_mask(pfn);

1432

++	return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT;

1433

+ }

1434

+

1435

+ static inline unsigned long pmd_pfn(pmd_t pmd)

1436

+ {

1437

+-	return (pmd_val(pmd) & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;

1438

++	phys_addr_t pfn = pmd_val(pmd);

1439

++	pfn ^= protnone_mask(pfn);

1440

++	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;

1441

+ }

1442

+

1443

+ static inline unsigned long pud_pfn(pud_t pud)

1444

+ {

1445

+-	return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT;

1446

++	phys_addr_t pfn = pud_val(pud);

1447

++	pfn ^= protnone_mask(pfn);

1448

++	return (pfn & pud_pfn_mask(pud)) >> PAGE_SHIFT;

1449

+ }

1450

+

1451

+ static inline unsigned long p4d_pfn(p4d_t p4d)

1452

+@@ -400,11 +410,6 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)

1453

+ 	return pmd_set_flags(pmd, _PAGE_RW);

1454

+ }

1455

+

1456

+-static inline pmd_t pmd_mknotpresent(pmd_t pmd)

1457

+-{

1458

+-	return pmd_clear_flags(pmd, _PAGE_PRESENT | _PAGE_PROTNONE);

1459

+-}

1460

+-

1461

+ static inline pud_t pud_set_flags(pud_t pud, pudval_t set)

1462

+ {

1463

+ 	pudval_t v = native_pud_val(pud);

1464

+@@ -459,11 +464,6 @@ static inline pud_t pud_mkwrite(pud_t pud)

1465

+ 	return pud_set_flags(pud, _PAGE_RW);

1466

+ }

1467

+

1468

+-static inline pud_t pud_mknotpresent(pud_t pud)

1469

+-{

1470

+-	return pud_clear_flags(pud, _PAGE_PRESENT | _PAGE_PROTNONE);

1471

+-}

1472

+-

1473

+ #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY

1474

+ static inline int pte_soft_dirty(pte_t pte)

1475

+ {

1476

+@@ -528,25 +528,45 @@ static inline pgprotval_t massage_pgprot(pgprot_t pgprot)

1477

+

1478

+ static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)

1479

+ {

1480

+-	return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) |

1481

+-		     massage_pgprot(pgprot));

1482

++	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;

1483

++	pfn ^= protnone_mask(pgprot_val(pgprot));

1484

++	pfn &= PTE_PFN_MASK;

1485

++	return __pte(pfn | massage_pgprot(pgprot));

1486

+ }

1487

+

1488

+ static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)

1489

+ {

1490

+-	return __pmd(((phys_addr_t)page_nr << PAGE_SHIFT) |

1491

+-		     massage_pgprot(pgprot));

1492

++	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;

1493

++	pfn ^= protnone_mask(pgprot_val(pgprot));

1494

++	pfn &= PHYSICAL_PMD_PAGE_MASK;

1495

++	return __pmd(pfn | massage_pgprot(pgprot));

1496

+ }

1497

+

1498

+ static inline pud_t pfn_pud(unsigned long page_nr, pgprot_t pgprot)

1499

+ {

1500

+-	return __pud(((phys_addr_t)page_nr << PAGE_SHIFT) |

1501

+-		     massage_pgprot(pgprot));

1502

++	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;

1503

++	pfn ^= protnone_mask(pgprot_val(pgprot));

1504

++	pfn &= PHYSICAL_PUD_PAGE_MASK;

1505

++	return __pud(pfn | massage_pgprot(pgprot));

1506

+ }

1507

+

1508

++static inline pmd_t pmd_mknotpresent(pmd_t pmd)

1509

++{

1510

++	return pfn_pmd(pmd_pfn(pmd),

1511

++		      __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));

1512

++}

1513

++

1514

++static inline pud_t pud_mknotpresent(pud_t pud)

1515

++{

1516

++	return pfn_pud(pud_pfn(pud),

1517

++	      __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));

1518

++}

1519

++

1520

++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);

1521

++

1522

+ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)

1523

+ {

1524

+-	pteval_t val = pte_val(pte);

1525

++	pteval_t val = pte_val(pte), oldval = val;

1526

+

1527

+ 	/*

1528

+ 	 * Chop off the NX bit (if present), and add the NX portion of

1529

+@@ -554,17 +574,17 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)

1530

+ 	 */

1531

+ 	val &= _PAGE_CHG_MASK;

1532

+ 	val |= massage_pgprot(newprot) & ~_PAGE_CHG_MASK;

1533

+-

1534

++	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);

1535

+ 	return __pte(val);

1536

+ }

1537

+

1538

+ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)

1539

+ {

1540

+-	pmdval_t val = pmd_val(pmd);

1541

++	pmdval_t val = pmd_val(pmd), oldval = val;

1542

+

1543

+ 	val &= _HPAGE_CHG_MASK;

1544

+ 	val |= massage_pgprot(newprot) & ~_HPAGE_CHG_MASK;

1545

+-

1546

++	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);

1547

+ 	return __pmd(val);

1548

+ }

1549

+

1550

+@@ -1274,6 +1294,14 @@ static inline bool pud_access_permitted(pud_t pud, bool write)

1551

+ 	return __pte_access_permitted(pud_val(pud), write);

1552

+ }

1553

+

1554

++#define __HAVE_ARCH_PFN_MODIFY_ALLOWED 1

1555

++extern bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot);

1556

++

1557

++static inline bool arch_has_pfn_modify_check(void)

1558

++{

1559

++	return boot_cpu_has_bug(X86_BUG_L1TF);

1560

++}

1561

++

1562

+ #include <asm-generic/pgtable.h>

1563

+ #endif	/* __ASSEMBLY__ */

1564

+

1565

+diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h

1566

+index 1149d2112b2e..4ecb72831938 100644

1567

+--- a/arch/x86/include/asm/pgtable_64.h

1568

++++ b/arch/x86/include/asm/pgtable_64.h

1569

+@@ -276,7 +276,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }

1570

+  *

1571

+  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number

1572

+  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names

1573

+- * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry

1574

++ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry

1575

+  *

1576

+  * G (8) is aliased and used as a PROT_NONE indicator for

1577

+  * !present ptes.  We need to start storing swap entries above

1578

+@@ -289,20 +289,34 @@ static inline int pgd_large(pgd_t pgd) { return 0; }

1579

+  *

1580

+  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,

1581

+  * but also L and G.

1582

++ *

1583

++ * The offset is inverted by a binary not operation to make the high

1584

++ * physical bits set.

1585

+  */

1586

+-#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)

1587

+-#define SWP_TYPE_BITS 5

1588

+-/* Place the offset above the type: */

1589

+-#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS)

1590

++#define SWP_TYPE_BITS		5

1591

++

1592

++#define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)

1593

++

1594

++/* We always extract/encode the offset by shifting it all the way up, and then down again */

1595

++#define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS)

1596

+

1597

+ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)

1598

+

1599

+-#define __swp_type(x)			(((x).val >> (SWP_TYPE_FIRST_BIT)) \

1600

+-					 & ((1U << SWP_TYPE_BITS) - 1))

1601

+-#define __swp_offset(x)			((x).val >> SWP_OFFSET_FIRST_BIT)

1602

+-#define __swp_entry(type, offset)	((swp_entry_t) { \

1603

+-					 ((type) << (SWP_TYPE_FIRST_BIT)) \

1604

+-					 | ((offset) << SWP_OFFSET_FIRST_BIT) })

1605

++/* Extract the high bits for type */

1606

++#define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS))

1607

++

1608

++/* Shift up (to get rid of type), then down to get value */

1609

++#define __swp_offset(x) (~(x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)

1610

++

1611

++/*

1612

++ * Shift the offset up "too far" by TYPE bits, then down again

1613

++ * The offset is inverted by a binary not operation to make the high

1614

++ * physical bits set.

1615

++ */

1616

++#define __swp_entry(type, offset) ((swp_entry_t) { \

1617

++	(~(unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \

1618

++	| ((unsigned long)(type) << (64-SWP_TYPE_BITS)) })

1619

++

1620

+ #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })

1621

+ #define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })

1622

+ #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })

1623

+@@ -346,5 +360,7 @@ static inline bool gup_fast_permitted(unsigned long start, int nr_pages,

1624

+ 	return true;

1625

+ }

1626

+

1627

++#include <asm/pgtable-invert.h>

1628

++

1629

+ #endif /* !__ASSEMBLY__ */

1630

+ #endif /* _ASM_X86_PGTABLE_64_H */

1631

+diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h

1632

+index 3222c7746cb1..0e856c0628b3 100644

1633

+--- a/arch/x86/include/asm/processor.h

1634

++++ b/arch/x86/include/asm/processor.h

1635

+@@ -180,6 +180,11 @@ extern const struct seq_operations cpuinfo_op;

1636

+

1637

+ extern void cpu_detect(struct cpuinfo_x86 *c);

1638

+

1639

++static inline unsigned long l1tf_pfn_limit(void)

1640

++{

1641

++	return BIT(boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT) - 1;

1642

++}

1643

++

1644

+ extern void early_cpu_init(void);

1645

+ extern void identify_boot_cpu(void);

1646

+ extern void identify_secondary_cpu(struct cpuinfo_x86 *);

1647

+@@ -969,4 +974,16 @@ bool xen_set_default_idle(void);

1648

+ void stop_this_cpu(void *dummy);

1649

+ void df_debug(struct pt_regs *regs, long error_code);

1650

+ void microcode_check(void);

1651

++

1652

++enum l1tf_mitigations {

1653

++	L1TF_MITIGATION_OFF,

1654

++	L1TF_MITIGATION_FLUSH_NOWARN,

1655

++	L1TF_MITIGATION_FLUSH,

1656

++	L1TF_MITIGATION_FLUSH_NOSMT,

1657

++	L1TF_MITIGATION_FULL,

1658

++	L1TF_MITIGATION_FULL_FORCE

1659

++};

1660

++

1661

++extern enum l1tf_mitigations l1tf_mitigation;

1662

++

1663

+ #endif /* _ASM_X86_PROCESSOR_H */

1664

+diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h

1665

+index 461f53d27708..fe2ee61880a8 100644

1666

+--- a/arch/x86/include/asm/smp.h

1667

++++ b/arch/x86/include/asm/smp.h

1668

+@@ -170,7 +170,6 @@ static inline int wbinvd_on_all_cpus(void)

1669

+ 	wbinvd();

1670

+ 	return 0;

1671

+ }

1672

+-#define smp_num_siblings	1

1673

+ #endif /* CONFIG_SMP */

1674

+

1675

+ extern unsigned disabled_cpus;

1676

+diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h

1677

+index c1d2a9892352..453cf38a1c33 100644

1678

+--- a/arch/x86/include/asm/topology.h

1679

++++ b/arch/x86/include/asm/topology.h

1680

+@@ -123,13 +123,17 @@ static inline int topology_max_smt_threads(void)

1681

+ }

1682

+

1683

+ int topology_update_package_map(unsigned int apicid, unsigned int cpu);

1684

+-extern int topology_phys_to_logical_pkg(unsigned int pkg);

1685

++int topology_phys_to_logical_pkg(unsigned int pkg);

1686

++bool topology_is_primary_thread(unsigned int cpu);

1687

++bool topology_smt_supported(void);

1688

+ #else

1689

+ #define topology_max_packages()			(1)

1690

+ static inline int

1691

+ topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; }

1692

+ static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }

1693

+ static inline int topology_max_smt_threads(void) { return 1; }

1694

++static inline bool topology_is_primary_thread(unsigned int cpu) { return true; }

1695

++static inline bool topology_smt_supported(void) { return false; }

1696

+ #endif

1697

+

1698

+ static inline void arch_fix_phys_package_id(int num, u32 slot)

1699

+diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h

1700

+index 7c300299e12e..08c14aec26ac 100644

1701

+--- a/arch/x86/include/asm/vmx.h

1702

++++ b/arch/x86/include/asm/vmx.h

1703

+@@ -571,4 +571,15 @@ enum vm_instruction_error_number {

1704

+ 	VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,

1705

+ };

1706

+

1707

++enum vmx_l1d_flush_state {

1708

++	VMENTER_L1D_FLUSH_AUTO,

1709

++	VMENTER_L1D_FLUSH_NEVER,

1710

++	VMENTER_L1D_FLUSH_COND,

1711

++	VMENTER_L1D_FLUSH_ALWAYS,

1712

++	VMENTER_L1D_FLUSH_EPT_DISABLED,

1713

++	VMENTER_L1D_FLUSH_NOT_REQUIRED,

1714

++};

1715

++

1716

++extern enum vmx_l1d_flush_state l1tf_vmx_mitigation;

1717

++

1718

+ #endif

1719

+diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c

1720

+index f48a51335538..2e64178f284d 100644

1721

+--- a/arch/x86/kernel/apic/apic.c

1722

++++ b/arch/x86/kernel/apic/apic.c

1723

+@@ -34,6 +34,7 @@

1724

+ #include <linux/dmi.h>

1725

+ #include <linux/smp.h>

1726

+ #include <linux/mm.h>

1727

++#include <linux/irq.h>

1728

+

1729

+ #include <asm/trace/irq_vectors.h>

1730

+ #include <asm/irq_remapping.h>

1731

+@@ -56,6 +57,7 @@

1732

+ #include <asm/hypervisor.h>

1733

+ #include <asm/cpu_device_id.h>

1734

+ #include <asm/intel-family.h>

1735

++#include <asm/irq_regs.h>

1736

+

1737

+ unsigned int num_processors;

1738

+

1739

+@@ -2092,6 +2094,23 @@ static int cpuid_to_apicid[] = {

1740

+ 	[0 ... NR_CPUS - 1] = -1,

1741

+ };

1742

+

1743

++#ifdef CONFIG_SMP

1744

++/**

1745

++ * apic_id_is_primary_thread - Check whether APIC ID belongs to a primary thread

1746

++ * @id:	APIC ID to check

1747

++ */

1748

++bool apic_id_is_primary_thread(unsigned int apicid)

1749

++{

1750

++	u32 mask;

1751

++

1752

++	if (smp_num_siblings == 1)

1753

++		return true;

1754

++	/* Isolate the SMT bit(s) in the APICID and check for 0 */

1755

++	mask = (1U << (fls(smp_num_siblings) - 1)) - 1;

1756

++	return !(apicid & mask);

1757

++}

1758

++#endif

1759

++

1760

+ /*

1761

+  * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids

1762

+  * and cpuid_to_apicid[] synchronized.

1763

+diff --git a/arch/x86/kernel/apic/htirq.c b/arch/x86/kernel/apic/htirq.c

1764

+index 56ccf9346b08..741de281ed5d 100644

1765

+--- a/arch/x86/kernel/apic/htirq.c

1766

++++ b/arch/x86/kernel/apic/htirq.c

1767

+@@ -16,6 +16,8 @@

1768

+ #include <linux/device.h>

1769

+ #include <linux/pci.h>

1770

+ #include <linux/htirq.h>

1771

++#include <linux/irq.h>

1772

++

1773

+ #include <asm/irqdomain.h>

1774

+ #include <asm/hw_irq.h>

1775

+ #include <asm/apic.h>

1776

+diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c

1777

+index 3b89b27945ff..96a8a68f9c79 100644

1778

+--- a/arch/x86/kernel/apic/io_apic.c

1779

++++ b/arch/x86/kernel/apic/io_apic.c

1780

+@@ -33,6 +33,7 @@

1781

+

1782

+ #include <linux/mm.h>

1783

+ #include <linux/interrupt.h>

1784

++#include <linux/irq.h>

1785

+ #include <linux/init.h>

1786

+ #include <linux/delay.h>

1787

+ #include <linux/sched.h>

1788

+diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c

1789

+index 9b18be764422..f10e7f93b0e2 100644

1790

+--- a/arch/x86/kernel/apic/msi.c

1791

++++ b/arch/x86/kernel/apic/msi.c

1792

+@@ -12,6 +12,7 @@

1793

+  */

1794

+ #include <linux/mm.h>

1795

+ #include <linux/interrupt.h>

1796

++#include <linux/irq.h>

1797

+ #include <linux/pci.h>

1798

+ #include <linux/dmar.h>

1799

+ #include <linux/hpet.h>

1800

+diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c

1801

+index 2ce1c708b8ee..b958082c74a7 100644

1802

+--- a/arch/x86/kernel/apic/vector.c

1803

++++ b/arch/x86/kernel/apic/vector.c

1804

+@@ -11,6 +11,7 @@

1805

+  * published by the Free Software Foundation.

1806

+  */

1807

+ #include <linux/interrupt.h>

1808

++#include <linux/irq.h>

1809

+ #include <linux/init.h>

1810

+ #include <linux/compiler.h>

1811

+ #include <linux/slab.h>

1812

+diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c

1813

+index 90574f731c05..dda741bd5789 100644

1814

+--- a/arch/x86/kernel/cpu/amd.c

1815

++++ b/arch/x86/kernel/cpu/amd.c

1816

+@@ -298,7 +298,6 @@ static int nearby_node(int apicid)

1817

+ }

1818

+ #endif

1819

+

1820

+-#ifdef CONFIG_SMP

1821

+ /*

1822

+  * Fix up cpu_core_id for pre-F17h systems to be in the

1823

+  * [0 .. cores_per_node - 1] range. Not really needed but

1824

+@@ -315,6 +314,13 @@ static void legacy_fixup_core_id(struct cpuinfo_x86 *c)

1825

+ 	c->cpu_core_id %= cus_per_node;

1826

+ }

1827

+

1828

++

1829

++static void amd_get_topology_early(struct cpuinfo_x86 *c)

1830

++{

1831

++	if (cpu_has(c, X86_FEATURE_TOPOEXT))

1832

++		smp_num_siblings = ((cpuid_ebx(0x8000001e) >> 8) & 0xff) + 1;

1833

++}

1834

++

1835

+ /*

1836

+  * Fixup core topology information for

1837

+  * (1) AMD multi-node processors

1838

+@@ -333,7 +339,6 @@ static void amd_get_topology(struct cpuinfo_x86 *c)

1839

+ 		cpuid(0x8000001e, &eax, &ebx, &ecx, &edx);

1840

+

1841

+ 		node_id  = ecx & 0xff;

1842

+-		smp_num_siblings = ((ebx >> 8) & 0xff) + 1;

1843

+

1844

+ 		if (c->x86 == 0x15)

1845

+ 			c->cu_id = ebx & 0xff;

1846

+@@ -376,7 +381,6 @@ static void amd_get_topology(struct cpuinfo_x86 *c)

1847

+ 		legacy_fixup_core_id(c);

1848

+ 	}

1849

+ }

1850

+-#endif

1851

+

1852

+ /*

1853

+  * On a AMD dual core setup the lower bits of the APIC id distinguish the cores.

1854

+@@ -384,7 +388,6 @@ static void amd_get_topology(struct cpuinfo_x86 *c)

1855

+  */

1856

+ static void amd_detect_cmp(struct cpuinfo_x86 *c)

1857

+ {

1858

+-#ifdef CONFIG_SMP

1859

+ 	unsigned bits;

1860

+ 	int cpu = smp_processor_id();

1861

+

1862

+@@ -396,16 +399,11 @@ static void amd_detect_cmp(struct cpuinfo_x86 *c)

1863

+ 	/* use socket ID also for last level cache */

1864

+ 	per_cpu(cpu_llc_id, cpu) = c->phys_proc_id;

1865

+ 	amd_get_topology(c);

1866

+-#endif

1867

+ }

1868

+

1869

+ u16 amd_get_nb_id(int cpu)

1870

+ {

1871

+-	u16 id = 0;

1872

+-#ifdef CONFIG_SMP

1873

+-	id = per_cpu(cpu_llc_id, cpu);

1874

+-#endif

1875

+-	return id;

1876

++	return per_cpu(cpu_llc_id, cpu);

1877

+ }

1878

+ EXPORT_SYMBOL_GPL(amd_get_nb_id);

1879

+

1880

+@@ -579,6 +577,7 @@ static void bsp_init_amd(struct cpuinfo_x86 *c)

1881

+

1882

+ static void early_init_amd(struct cpuinfo_x86 *c)

1883

+ {

1884

++	u64 value;

1885

+ 	u32 dummy;

1886

+

1887

+ 	early_init_amd_mc(c);

1888

+@@ -668,6 +667,22 @@ static void early_init_amd(struct cpuinfo_x86 *c)

1889

+ 			clear_cpu_cap(c, X86_FEATURE_SME);

1890

+ 		}

1891

+ 	}

1892

++

1893

++	/* Re-enable TopologyExtensions if switched off by BIOS */

1894

++	if (c->x86 == 0x15 &&

1895

++	    (c->x86_model >= 0x10 && c->x86_model <= 0x6f) &&

1896

++	    !cpu_has(c, X86_FEATURE_TOPOEXT)) {

1897

++

1898

++		if (msr_set_bit(0xc0011005, 54) > 0) {

1899

++			rdmsrl(0xc0011005, value);

1900

++			if (value & BIT_64(54)) {

1901

++				set_cpu_cap(c, X86_FEATURE_TOPOEXT);

1902

++				pr_info_once(FW_INFO "CPU: Re-enabling disabled Topology Extensions Support.\n");

1903

++			}

1904

++		}

1905

++	}

1906

++

1907

++	amd_get_topology_early(c);

1908

+ }

1909

+

1910

+ static void init_amd_k8(struct cpuinfo_x86 *c)

1911

+@@ -759,19 +774,6 @@ static void init_amd_bd(struct cpuinfo_x86 *c)

1912

+ {

1913

+ 	u64 value;

1914

+

1915

+-	/* re-enable TopologyExtensions if switched off by BIOS */

1916

+-	if ((c->x86_model >= 0x10) && (c->x86_model <= 0x6f) &&

1917

+-	    !cpu_has(c, X86_FEATURE_TOPOEXT)) {

1918

+-

1919

+-		if (msr_set_bit(0xc0011005, 54) > 0) {

1920

+-			rdmsrl(0xc0011005, value);

1921

+-			if (value & BIT_64(54)) {

1922

+-				set_cpu_cap(c, X86_FEATURE_TOPOEXT);

1923

+-				pr_info_once(FW_INFO "CPU: Re-enabling disabled Topology Extensions Support.\n");

1924

+-			}

1925

+-		}

1926

+-	}

1927

+-

1928

+ 	/*

1929

+ 	 * The way access filter has a performance penalty on some workloads.

1930

+ 	 * Disable it on the affected CPUs.

1931

+@@ -835,15 +837,8 @@ static void init_amd(struct cpuinfo_x86 *c)

1932

+

1933

+ 	cpu_detect_cache_sizes(c);

1934

+

1935

+-	/* Multi core CPU? */

1936

+-	if (c->extended_cpuid_level >= 0x80000008) {

1937

+-		amd_detect_cmp(c);

1938

+-		srat_detect_node(c);

1939

+-	}

1940

+-

1941

+-#ifdef CONFIG_X86_32

1942

+-	detect_ht(c);

1943

+-#endif

1944

++	amd_detect_cmp(c);

1945

++	srat_detect_node(c);

1946

+

1947

+ 	init_amd_cacheinfo(c);

1948

+

1949

+diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c

1950

+index 7416fc206b4a..edfc64a8a154 100644

1951

+--- a/arch/x86/kernel/cpu/bugs.c

1952

++++ b/arch/x86/kernel/cpu/bugs.c

1953

+@@ -22,14 +22,17 @@

1954

+ #include <asm/processor-flags.h>

1955

+ #include <asm/fpu/internal.h>

1956

+ #include <asm/msr.h>

1957

++#include <asm/vmx.h>

1958

+ #include <asm/paravirt.h>

1959

+ #include <asm/alternative.h>

1960

+ #include <asm/pgtable.h>

1961

+ #include <asm/set_memory.h>

1962

+ #include <asm/intel-family.h>

1963

++#include <asm/e820/api.h>

1964

+

1965

+ static void __init spectre_v2_select_mitigation(void);

1966

+ static void __init ssb_select_mitigation(void);

1967

++static void __init l1tf_select_mitigation(void);

1968

+

1969

+ /*

1970

+  * Our boot-time value of the SPEC_CTRL MSR. We read it once so that any

1971

+@@ -55,6 +58,12 @@ void __init check_bugs(void)

1972

+ {

1973

+ 	identify_boot_cpu();

1974

+

1975

++	/*

1976

++	 * identify_boot_cpu() initialized SMT support information, let the

1977

++	 * core code know.

1978

++	 */

1979

++	cpu_smt_check_topology_early();

1980

++

1981

+ 	if (!IS_ENABLED(CONFIG_SMP)) {

1982

+ 		pr_info("CPU: ");

1983

+ 		print_cpu_info(&boot_cpu_data);

1984

+@@ -81,6 +90,8 @@ void __init check_bugs(void)

1985

+ 	 */

1986

+ 	ssb_select_mitigation();

1987

+

1988

++	l1tf_select_mitigation();

1989

++

1990

+ #ifdef CONFIG_X86_32

1991

+ 	/*

1992

+ 	 * Check whether we are able to run this kernel safely on SMP.

1993

+@@ -311,23 +322,6 @@ static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)

1994

+ 	return cmd;

1995

+ }

1996

+

1997

+-/* Check for Skylake-like CPUs (for RSB handling) */

1998

+-static bool __init is_skylake_era(void)

1999

+-{

2000

+-	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&

2001

+-	    boot_cpu_data.x86 == 6) {

2002

+-		switch (boot_cpu_data.x86_model) {

2003

+-		case INTEL_FAM6_SKYLAKE_MOBILE:

2004

+-		case INTEL_FAM6_SKYLAKE_DESKTOP:

2005

+-		case INTEL_FAM6_SKYLAKE_X:

2006

+-		case INTEL_FAM6_KABYLAKE_MOBILE:

2007

+-		case INTEL_FAM6_KABYLAKE_DESKTOP:

2008

+-			return true;

2009

+-		}

2010

+-	}

2011

+-	return false;

2012

+-}

2013

+-

2014

+ static void __init spectre_v2_select_mitigation(void)

2015

+ {

2016

+ 	enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline();

2017

+@@ -388,22 +382,15 @@ retpoline_auto:

2018

+ 	pr_info("%s\n", spectre_v2_strings[mode]);

2019

+

2020

+ 	/*

2021

+-	 * If neither SMEP nor PTI are available, there is a risk of

2022

+-	 * hitting userspace addresses in the RSB after a context switch

2023

+-	 * from a shallow call stack to a deeper one. To prevent this fill

2024

+-	 * the entire RSB, even when using IBRS.

2025

++	 * If spectre v2 protection has been enabled, unconditionally fill

2026

++	 * RSB during a context switch; this protects against two independent

2027

++	 * issues:

2028

+ 	 *

2029

+-	 * Skylake era CPUs have a separate issue with *underflow* of the

2030

+-	 * RSB, when they will predict 'ret' targets from the generic BTB.

2031

+-	 * The proper mitigation for this is IBRS. If IBRS is not supported

2032

+-	 * or deactivated in favour of retpolines the RSB fill on context

2033

+-	 * switch is required.

2034

++	 *	- RSB underflow (and switch to BTB) on Skylake+

2035

++	 *	- SpectreRSB variant of spectre v2 on X86_BUG_SPECTRE_V2 CPUs

2036

+ 	 */

2037

+-	if ((!boot_cpu_has(X86_FEATURE_PTI) &&

2038

+-	     !boot_cpu_has(X86_FEATURE_SMEP)) || is_skylake_era()) {

2039

+-		setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);

2040

+-		pr_info("Spectre v2 mitigation: Filling RSB on context switch\n");

2041

+-	}

2042

++	setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);

2043

++	pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch\n");

2044

+

2045

+ 	/* Initialize Indirect Branch Prediction Barrier if supported */

2046

+ 	if (boot_cpu_has(X86_FEATURE_IBPB)) {

2047

+@@ -654,8 +641,121 @@ void x86_spec_ctrl_setup_ap(void)

2048

+ 		x86_amd_ssb_disable();

2049

+ }

2050

+

2051

++#undef pr_fmt

2052

++#define pr_fmt(fmt)	"L1TF: " fmt

2053

++

2054

++/* Default mitigation for L1TF-affected CPUs */

2055

++enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_FLUSH;

2056

++#if IS_ENABLED(CONFIG_KVM_INTEL)

2057

++EXPORT_SYMBOL_GPL(l1tf_mitigation);

2058

++

2059

++enum vmx_l1d_flush_state l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;

2060

++EXPORT_SYMBOL_GPL(l1tf_vmx_mitigation);

2061

++#endif

2062

++

2063

++static void __init l1tf_select_mitigation(void)

2064

++{

2065

++	u64 half_pa;

2066

++

2067

++	if (!boot_cpu_has_bug(X86_BUG_L1TF))

2068

++		return;

2069

++

2070

++	switch (l1tf_mitigation) {

2071

++	case L1TF_MITIGATION_OFF:

2072

++	case L1TF_MITIGATION_FLUSH_NOWARN:

2073

++	case L1TF_MITIGATION_FLUSH:

2074

++		break;

2075

++	case L1TF_MITIGATION_FLUSH_NOSMT:

2076

++	case L1TF_MITIGATION_FULL:

2077

++		cpu_smt_disable(false);

2078

++		break;

2079

++	case L1TF_MITIGATION_FULL_FORCE:

2080

++		cpu_smt_disable(true);

2081

++		break;

2082

++	}

2083

++

2084

++#if CONFIG_PGTABLE_LEVELS == 2

2085

++	pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n");

2086

++	return;

2087

++#endif

2088

++

2089

++	/*

2090

++	 * This is extremely unlikely to happen because almost all

2091

++	 * systems have far more MAX_PA/2 than RAM can be fit into

2092

++	 * DIMM slots.

2093

++	 */

2094

++	half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT;

2095

++	if (e820__mapped_any(half_pa, ULLONG_MAX - half_pa, E820_TYPE_RAM)) {

2096

++		pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation not effective.\n");

2097

++		return;

2098

++	}

2099

++

2100

++	setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV);

2101

++}

2102

++

2103

++static int __init l1tf_cmdline(char *str)

2104

++{

2105

++	if (!boot_cpu_has_bug(X86_BUG_L1TF))

2106

++		return 0;

2107

++

2108

++	if (!str)

2109

++		return -EINVAL;

2110

++

2111

++	if (!strcmp(str, "off"))

2112

++		l1tf_mitigation = L1TF_MITIGATION_OFF;

2113

++	else if (!strcmp(str, "flush,nowarn"))

2114

++		l1tf_mitigation = L1TF_MITIGATION_FLUSH_NOWARN;

2115

++	else if (!strcmp(str, "flush"))

2116

++		l1tf_mitigation = L1TF_MITIGATION_FLUSH;

2117

++	else if (!strcmp(str, "flush,nosmt"))

2118

++		l1tf_mitigation = L1TF_MITIGATION_FLUSH_NOSMT;

2119

++	else if (!strcmp(str, "full"))

2120

++		l1tf_mitigation = L1TF_MITIGATION_FULL;

2121

++	else if (!strcmp(str, "full,force"))

2122

++		l1tf_mitigation = L1TF_MITIGATION_FULL_FORCE;

2123

++

2124

++	return 0;

2125

++}

2126

++early_param("l1tf", l1tf_cmdline);

2127

++

2128

++#undef pr_fmt

2129

++

2130

+ #ifdef CONFIG_SYSFS

2131

+

2132

++#define L1TF_DEFAULT_MSG "Mitigation: PTE Inversion"

2133

++

2134

++#if IS_ENABLED(CONFIG_KVM_INTEL)

2135

++static const char *l1tf_vmx_states[] = {

2136

++	[VMENTER_L1D_FLUSH_AUTO]		= "auto",

2137

++	[VMENTER_L1D_FLUSH_NEVER]		= "vulnerable",

2138

++	[VMENTER_L1D_FLUSH_COND]		= "conditional cache flushes",

2139

++	[VMENTER_L1D_FLUSH_ALWAYS]		= "cache flushes",

2140

++	[VMENTER_L1D_FLUSH_EPT_DISABLED]	= "EPT disabled",

2141

++	[VMENTER_L1D_FLUSH_NOT_REQUIRED]	= "flush not necessary"

2142

++};

2143

++

2144

++static ssize_t l1tf_show_state(char *buf)

2145

++{

2146

++	if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_AUTO)

2147

++		return sprintf(buf, "%s\n", L1TF_DEFAULT_MSG);

2148

++

2149

++	if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_EPT_DISABLED ||

2150

++	    (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_NEVER &&

2151

++	     cpu_smt_control == CPU_SMT_ENABLED))

2152

++		return sprintf(buf, "%s; VMX: %s\n", L1TF_DEFAULT_MSG,

2153

++			       l1tf_vmx_states[l1tf_vmx_mitigation]);

2154

++

2155

++	return sprintf(buf, "%s; VMX: %s, SMT %s\n", L1TF_DEFAULT_MSG,

2156

++		       l1tf_vmx_states[l1tf_vmx_mitigation],

2157

++		       cpu_smt_control == CPU_SMT_ENABLED ? "vulnerable" : "disabled");

2158

++}

2159

++#else

2160

++static ssize_t l1tf_show_state(char *buf)

2161

++{

2162

++	return sprintf(buf, "%s\n", L1TF_DEFAULT_MSG);

2163

++}

2164

++#endif

2165

++

2166

+ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr,

2167

+ 			       char *buf, unsigned int bug)

2168

+ {

2169

+@@ -681,6 +781,10 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr

2170

+ 	case X86_BUG_SPEC_STORE_BYPASS:

2171

+ 		return sprintf(buf, "%s\n", ssb_strings[ssb_mode]);

2172

+

2173

++	case X86_BUG_L1TF:

2174

++		if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))

2175

++			return l1tf_show_state(buf);

2176

++		break;

2177

+ 	default:

2178

+ 		break;

2179

+ 	}

2180

+@@ -707,4 +811,9 @@ ssize_t cpu_show_spec_store_bypass(struct device *dev, struct device_attribute *

2181

+ {

2182

+ 	return cpu_show_common(dev, attr, buf, X86_BUG_SPEC_STORE_BYPASS);

2183

+ }

2184

++

2185

++ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *buf)

2186

++{

2187

++	return cpu_show_common(dev, attr, buf, X86_BUG_L1TF);

2188

++}

2189

+ #endif

2190

+diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c

2191

+index 48e98964ecad..dd02ee4fa8cd 100644

2192

+--- a/arch/x86/kernel/cpu/common.c

2193

++++ b/arch/x86/kernel/cpu/common.c

2194

+@@ -66,6 +66,13 @@ cpumask_var_t cpu_callin_mask;

2195

+ /* representing cpus for which sibling maps can be computed */

2196

+ cpumask_var_t cpu_sibling_setup_mask;

2197

+

2198

++/* Number of siblings per CPU package */

2199

++int smp_num_siblings = 1;

2200

++EXPORT_SYMBOL(smp_num_siblings);

2201

++

2202

++/* Last level cache ID of each logical CPU */

2203

++DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id) = BAD_APICID;

2204

++

2205

+ /* correctly size the local cpu masks */

2206

+ void __init setup_cpu_local_masks(void)

2207

+ {

2208

+@@ -614,33 +621,36 @@ static void cpu_detect_tlb(struct cpuinfo_x86 *c)

2209

+ 		tlb_lld_4m[ENTRIES], tlb_lld_1g[ENTRIES]);

2210

+ }

2211

+

2212

+-void detect_ht(struct cpuinfo_x86 *c)

2213

++int detect_ht_early(struct cpuinfo_x86 *c)

2214

+ {

2215

+ #ifdef CONFIG_SMP

2216

+ 	u32 eax, ebx, ecx, edx;

2217

+-	int index_msb, core_bits;

2218

+-	static bool printed;

2219

+

2220

+ 	if (!cpu_has(c, X86_FEATURE_HT))

2221

+-		return;

2222

++		return -1;

2223

+

2224

+ 	if (cpu_has(c, X86_FEATURE_CMP_LEGACY))

2225

+-		goto out;

2226

++		return -1;

2227

+

2228

+ 	if (cpu_has(c, X86_FEATURE_XTOPOLOGY))

2229

+-		return;

2230

++		return -1;

2231

+

2232

+ 	cpuid(1, &eax, &ebx, &ecx, &edx);

2233

+

2234

+ 	smp_num_siblings = (ebx & 0xff0000) >> 16;

2235

+-

2236

+-	if (smp_num_siblings == 1) {

2237

++	if (smp_num_siblings == 1)

2238

+ 		pr_info_once("CPU0: Hyper-Threading is disabled\n");

2239

+-		goto out;

2240

+-	}

2241

++#endif

2242

++	return 0;

2243

++}

2244

+

2245

+-	if (smp_num_siblings <= 1)

2246

+-		goto out;

2247

++void detect_ht(struct cpuinfo_x86 *c)

2248

++{

2249

++#ifdef CONFIG_SMP

2250

++	int index_msb, core_bits;

2251

++

2252

++	if (detect_ht_early(c) < 0)

2253

++		return;

2254

+

2255

+ 	index_msb = get_count_order(smp_num_siblings);

2256

+ 	c->phys_proc_id = apic->phys_pkg_id(c->initial_apicid, index_msb);

2257

+@@ -653,15 +663,6 @@ void detect_ht(struct cpuinfo_x86 *c)

2258

+

2259

+ 	c->cpu_core_id = apic->phys_pkg_id(c->initial_apicid, index_msb) &

2260

+ 				       ((1 << core_bits) - 1);

2261

+-

2262

+-out:

2263

+-	if (!printed && (c->x86_max_cores * smp_num_siblings) > 1) {

2264

+-		pr_info("CPU: Physical Processor ID: %d\n",

2265

+-			c->phys_proc_id);

2266

+-		pr_info("CPU: Processor Core ID: %d\n",

2267

+-			c->cpu_core_id);

2268

+-		printed = 1;

2269

+-	}

2270

+ #endif

2271

+ }

2272

+

2273

+@@ -933,6 +934,21 @@ static const __initconst struct x86_cpu_id cpu_no_spec_store_bypass[] = {

2274

+ 	{}

2275

+ };

2276

+

2277

++static const __initconst struct x86_cpu_id cpu_no_l1tf[] = {

2278

++	/* in addition to cpu_no_speculation */

2279

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_SILVERMONT1	},

2280

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_SILVERMONT2	},

2281

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_AIRMONT		},

2282

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_MERRIFIELD	},

2283

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_MOOREFIELD	},

2284

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT	},

2285

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_DENVERTON	},

2286

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GEMINI_LAKE	},

2287

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_XEON_PHI_KNL		},

2288

++	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_XEON_PHI_KNM		},

2289

++	{}

2290

++};

2291

++

2292

+ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)

2293

+ {

2294

+ 	u64 ia32_cap = 0;

2295

+@@ -958,6 +974,11 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)

2296

+ 		return;

2297

+

2298

+ 	setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);

2299

++

2300

++	if (x86_match_cpu(cpu_no_l1tf))

2301

++		return;

2302

++

2303

++	setup_force_cpu_bug(X86_BUG_L1TF);

2304

+ }

2305

+

2306

+ /*

2307

+diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h

2308

+index 37672d299e35..cca588407dca 100644

2309

+--- a/arch/x86/kernel/cpu/cpu.h

2310

++++ b/arch/x86/kernel/cpu/cpu.h

2311

+@@ -47,6 +47,8 @@ extern const struct cpu_dev *const __x86_cpu_dev_start[],

2312

+

2313

+ extern void get_cpu_cap(struct cpuinfo_x86 *c);

2314

+ extern void cpu_detect_cache_sizes(struct cpuinfo_x86 *c);

2315

++extern int detect_extended_topology_early(struct cpuinfo_x86 *c);

2316

++extern int detect_ht_early(struct cpuinfo_x86 *c);

2317

+

2318

+ unsigned int aperfmperf_get_khz(int cpu);

2319

+

2320

+diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c

2321

+index 0b2330e19169..278be092b300 100644

2322

+--- a/arch/x86/kernel/cpu/intel.c

2323

++++ b/arch/x86/kernel/cpu/intel.c

2324

+@@ -301,6 +301,13 @@ static void early_init_intel(struct cpuinfo_x86 *c)

2325

+ 	}

2326

+

2327

+ 	check_mpx_erratum(c);

2328

++

2329

++	/*

2330

++	 * Get the number of SMT siblings early from the extended topology

2331

++	 * leaf, if available. Otherwise try the legacy SMT detection.

2332

++	 */

2333

++	if (detect_extended_topology_early(c) < 0)

2334

++		detect_ht_early(c);

2335

+ }

2336

+

2337

+ #ifdef CONFIG_X86_32

2338

+diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c

2339

+index 4fc0e08a30b9..387a8f44fba1 100644

2340

+--- a/arch/x86/kernel/cpu/microcode/core.c

2341

++++ b/arch/x86/kernel/cpu/microcode/core.c

2342

+@@ -509,12 +509,20 @@ static struct platform_device	*microcode_pdev;

2343

+

2344

+ static int check_online_cpus(void)

2345

+ {

2346

+-	if (num_online_cpus() == num_present_cpus())

2347

+-		return 0;

2348

++	unsigned int cpu;

2349

+

2350

+-	pr_err("Not all CPUs online, aborting microcode update.\n");

2351

++	/*

2352

++	 * Make sure all CPUs are online.  It's fine for SMT to be disabled if

2353

++	 * all the primary threads are still online.

2354

++	 */

2355

++	for_each_present_cpu(cpu) {

2356

++		if (topology_is_primary_thread(cpu) && !cpu_online(cpu)) {

2357

++			pr_err("Not all CPUs online, aborting microcode update.\n");

2358

++			return -EINVAL;

2359

++		}

2360

++	}

2361

+

2362

+-	return -EINVAL;

2363

++	return 0;

2364

+ }

2365

+

2366

+ static atomic_t late_cpus_in;

2367

+diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c

2368

+index b099024d339c..19c6e800e816 100644

2369

+--- a/arch/x86/kernel/cpu/topology.c

2370

++++ b/arch/x86/kernel/cpu/topology.c

2371

+@@ -27,16 +27,13 @@

2372

+  * exists, use it for populating initial_apicid and cpu topology

2373

+  * detection.

2374

+  */

2375

+-void detect_extended_topology(struct cpuinfo_x86 *c)

2376

++int detect_extended_topology_early(struct cpuinfo_x86 *c)

2377

+ {

2378

+ #ifdef CONFIG_SMP

2379

+-	unsigned int eax, ebx, ecx, edx, sub_index;

2380

+-	unsigned int ht_mask_width, core_plus_mask_width;

2381

+-	unsigned int core_select_mask, core_level_siblings;

2382

+-	static bool printed;

2383

++	unsigned int eax, ebx, ecx, edx;

2384

+

2385

+ 	if (c->cpuid_level < 0xb)

2386

+-		return;

2387

++		return -1;

2388

+

2389

+ 	cpuid_count(0xb, SMT_LEVEL, &eax, &ebx, &ecx, &edx);

2390

+

2391

+@@ -44,7 +41,7 @@ void detect_extended_topology(struct cpuinfo_x86 *c)

2392

+ 	 * check if the cpuid leaf 0xb is actually implemented.

2393

+ 	 */

2394

+ 	if (ebx == 0 || (LEAFB_SUBTYPE(ecx) != SMT_TYPE))

2395

+-		return;

2396

++		return -1;

2397

+

2398

+ 	set_cpu_cap(c, X86_FEATURE_XTOPOLOGY);

2399

+

2400

+@@ -52,10 +49,30 @@ void detect_extended_topology(struct cpuinfo_x86 *c)

2401

+ 	 * initial apic id, which also represents 32-bit extended x2apic id.

2402

+ 	 */

2403

+ 	c->initial_apicid = edx;

2404

++	smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);

2405

++#endif

2406

++	return 0;

2407

++}

2408

++

2409

++/*

2410

++ * Check for extended topology enumeration cpuid leaf 0xb and if it

2411

++ * exists, use it for populating initial_apicid and cpu topology

2412

++ * detection.

2413

++ */

2414

++void detect_extended_topology(struct cpuinfo_x86 *c)

2415

++{

2416

++#ifdef CONFIG_SMP

2417

++	unsigned int eax, ebx, ecx, edx, sub_index;

2418

++	unsigned int ht_mask_width, core_plus_mask_width;

2419

++	unsigned int core_select_mask, core_level_siblings;

2420

++

2421

++	if (detect_extended_topology_early(c) < 0)

2422

++		return;

2423

+

2424

+ 	/*

2425

+ 	 * Populate HT related information from sub-leaf level 0.

2426

+ 	 */

2427

++	cpuid_count(0xb, SMT_LEVEL, &eax, &ebx, &ecx, &edx);

2428

+ 	core_level_siblings = smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);

2429

+ 	core_plus_mask_width = ht_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);

2430

+

2431

+@@ -86,15 +103,5 @@ void detect_extended_topology(struct cpuinfo_x86 *c)

2432

+ 	c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);

2433

+

2434

+ 	c->x86_max_cores = (core_level_siblings / smp_num_siblings);

2435

+-

2436

+-	if (!printed) {

2437

+-		pr_info("CPU: Physical Processor ID: %d\n",

2438

+-		       c->phys_proc_id);

2439

+-		if (c->x86_max_cores > 1)

2440

+-			pr_info("CPU: Processor Core ID: %d\n",

2441

+-			       c->cpu_core_id);

2442

+-		printed = 1;

2443

+-	}

2444

+-	return;

2445

+ #endif

2446

+ }

2447

+diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c

2448

+index f92a6593de1e..2ea85b32421a 100644

2449

+--- a/arch/x86/kernel/fpu/core.c

2450

++++ b/arch/x86/kernel/fpu/core.c

2451

+@@ -10,6 +10,7 @@

2452

+ #include <asm/fpu/signal.h>

2453

+ #include <asm/fpu/types.h>

2454

+ #include <asm/traps.h>

2455

++#include <asm/irq_regs.h>

2456

+

2457

+ #include <linux/hardirq.h>

2458

+ #include <linux/pkeys.h>

2459

+diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c

2460

+index 01ebcb6f263e..7acb87cb2da8 100644

2461

+--- a/arch/x86/kernel/ftrace.c

2462

++++ b/arch/x86/kernel/ftrace.c

2463

+@@ -27,6 +27,7 @@

2464

+

2465

+ #include <asm/set_memory.h>

2466

+ #include <asm/kprobes.h>

2467

++#include <asm/sections.h>

2468

+ #include <asm/ftrace.h>

2469

+ #include <asm/nops.h>

2470

+

2471

+diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c

2472

+index 8ce4212e2b8d..afa1a204bc6d 100644

2473

+--- a/arch/x86/kernel/hpet.c

2474

++++ b/arch/x86/kernel/hpet.c

2475

+@@ -1,6 +1,7 @@

2476

+ #include <linux/clocksource.h>

2477

+ #include <linux/clockchips.h>

2478

+ #include <linux/interrupt.h>

2479

++#include <linux/irq.h>

2480

+ #include <linux/export.h>

2481

+ #include <linux/delay.h>

2482

+ #include <linux/errno.h>

2483

+diff --git a/arch/x86/kernel/i8259.c b/arch/x86/kernel/i8259.c

2484

+index 8f5cb2c7060c..02abc134367f 100644

2485

+--- a/arch/x86/kernel/i8259.c

2486

++++ b/arch/x86/kernel/i8259.c

2487

+@@ -5,6 +5,7 @@

2488

+ #include <linux/sched.h>

2489

+ #include <linux/ioport.h>

2490

+ #include <linux/interrupt.h>

2491

++#include <linux/irq.h>

2492

+ #include <linux/timex.h>

2493

+ #include <linux/random.h>

2494

+ #include <linux/init.h>

2495

+diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c

2496

+index 0c5256653d6c..38c3d5790970 100644

2497

+--- a/arch/x86/kernel/idt.c

2498

++++ b/arch/x86/kernel/idt.c

2499

+@@ -8,6 +8,7 @@

2500

+ #include <asm/traps.h>

2501

+ #include <asm/proto.h>

2502

+ #include <asm/desc.h>

2503

++#include <asm/hw_irq.h>

2504

+

2505

+ struct idt_data {

2506

+ 	unsigned int	vector;

2507

+diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c

2508

+index aa9d51eea9d0..3c2326b59820 100644

2509

+--- a/arch/x86/kernel/irq.c

2510

++++ b/arch/x86/kernel/irq.c

2511

+@@ -10,6 +10,7 @@

2512

+ #include <linux/ftrace.h>

2513

+ #include <linux/delay.h>

2514

+ #include <linux/export.h>

2515

++#include <linux/irq.h>

2516

+

2517

+ #include <asm/apic.h>

2518

+ #include <asm/io_apic.h>

2519

+diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c

2520

+index c1bdbd3d3232..95600a99ae93 100644

2521

+--- a/arch/x86/kernel/irq_32.c

2522

++++ b/arch/x86/kernel/irq_32.c

2523

+@@ -11,6 +11,7 @@

2524

+

2525

+ #include <linux/seq_file.h>

2526

+ #include <linux/interrupt.h>

2527

++#include <linux/irq.h>

2528

+ #include <linux/kernel_stat.h>

2529

+ #include <linux/notifier.h>

2530

+ #include <linux/cpu.h>

2531

+diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c

2532

+index d86e344f5b3d..0469cd078db1 100644

2533

+--- a/arch/x86/kernel/irq_64.c

2534

++++ b/arch/x86/kernel/irq_64.c

2535

+@@ -11,6 +11,7 @@

2536

+

2537

+ #include <linux/kernel_stat.h>

2538

+ #include <linux/interrupt.h>

2539

++#include <linux/irq.h>

2540

+ #include <linux/seq_file.h>

2541

+ #include <linux/delay.h>

2542

+ #include <linux/ftrace.h>

2543

+diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c

2544

+index 1e4094eba15e..40f83d0d7b8a 100644

2545

+--- a/arch/x86/kernel/irqinit.c

2546

++++ b/arch/x86/kernel/irqinit.c

2547

+@@ -5,6 +5,7 @@

2548

+ #include <linux/sched.h>

2549

+ #include <linux/ioport.h>

2550

+ #include <linux/interrupt.h>

2551

++#include <linux/irq.h>

2552

+ #include <linux/timex.h>

2553

+ #include <linux/random.h>

2554

+ #include <linux/kprobes.h>

2555

+diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c

2556

+index f1030c522e06..65452d555f05 100644

2557

+--- a/arch/x86/kernel/kprobes/core.c

2558

++++ b/arch/x86/kernel/kprobes/core.c

2559

+@@ -63,6 +63,7 @@

2560

+ #include <asm/insn.h>

2561

+ #include <asm/debugreg.h>

2562

+ #include <asm/set_memory.h>

2563

++#include <asm/sections.h>

2564

+

2565

+ #include "common.h"

2566

+

2567

+@@ -394,8 +395,6 @@ int __copy_instruction(u8 *dest, u8 *src, struct insn *insn)

2568

+ 			  - (u8 *) dest;

2569

+ 		if ((s64) (s32) newdisp != newdisp) {

2570

+ 			pr_err("Kprobes error: new displacement does not fit into s32 (%llx)\n", newdisp);

2571

+-			pr_err("\tSrc: %p, Dest: %p, old disp: %x\n",

2572

+-				src, dest, insn->displacement.value);

2573

+ 			return 0;

2574

+ 		}

2575

+ 		disp = (u8 *) dest + insn_offset_displacement(insn);

2576

+@@ -621,8 +620,7 @@ static int reenter_kprobe(struct kprobe *p, struct pt_regs *regs,

2577

+ 		 * Raise a BUG or we'll continue in an endless reentering loop

2578

+ 		 * and eventually a stack overflow.

2579

+ 		 */

2580

+-		printk(KERN_WARNING "Unrecoverable kprobe detected at %p.\n",

2581

+-		       p->addr);

2582

++		pr_err("Unrecoverable kprobe detected.\n");

2583

+ 		dump_kprobe(p);

2584

+ 		BUG();

2585

+ 	default:

2586

+diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c

2587

+index e1df9ef5d78c..f3559b84cd75 100644

2588

+--- a/arch/x86/kernel/paravirt.c

2589

++++ b/arch/x86/kernel/paravirt.c

2590

+@@ -88,10 +88,12 @@ unsigned paravirt_patch_call(void *insnbuf,

2591

+ 	struct branch *b = insnbuf;

2592

+ 	unsigned long delta = (unsigned long)target - (addr+5);

2593

+

2594

+-	if (tgt_clobbers & ~site_clobbers)

2595

+-		return len;	/* target would clobber too much for this site */

2596

+-	if (len < 5)

2597

++	if (len < 5) {

2598

++#ifdef CONFIG_RETPOLINE

2599

++		WARN_ONCE("Failing to patch indirect CALL in %ps\n", (void *)addr);

2600

++#endif

2601

+ 		return len;	/* call too long for patch site */

2602

++	}

2603

+

2604

+ 	b->opcode = 0xe8; /* call */

2605

+ 	b->delta = delta;

2606

+@@ -106,8 +108,12 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target,

2607

+ 	struct branch *b = insnbuf;

2608

+ 	unsigned long delta = (unsigned long)target - (addr+5);

2609

+

2610

+-	if (len < 5)

2611

++	if (len < 5) {

2612

++#ifdef CONFIG_RETPOLINE

2613

++		WARN_ONCE("Failing to patch indirect JMP in %ps\n", (void *)addr);

2614

++#endif

2615

+ 		return len;	/* call too long for patch site */

2616

++	}

2617

+

2618

+ 	b->opcode = 0xe9;	/* jmp */

2619

+ 	b->delta = delta;

2620

+diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c

2621

+index efbcf5283520..dcb00acb6583 100644

2622

+--- a/arch/x86/kernel/setup.c

2623

++++ b/arch/x86/kernel/setup.c

2624

+@@ -852,6 +852,12 @@ void __init setup_arch(char **cmdline_p)

2625

+ 	memblock_reserve(__pa_symbol(_text),

2626

+ 			 (unsigned long)__bss_stop - (unsigned long)_text);

2627

+

2628

++	/*

2629

++	 * Make sure page 0 is always reserved because on systems with

2630

++	 * L1TF its contents can be leaked to user processes.

2631

++	 */

2632

++	memblock_reserve(0, PAGE_SIZE);

2633

++

2634

+ 	early_reserve_initrd();

2635

+

2636

+ 	/*

2637

+diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c

2638

+index 5c574dff4c1a..04adc8d60aed 100644

2639

+--- a/arch/x86/kernel/smp.c

2640

++++ b/arch/x86/kernel/smp.c

2641

+@@ -261,6 +261,7 @@ __visible void __irq_entry smp_reschedule_interrupt(struct pt_regs *regs)

2642

+ {

2643

+ 	ack_APIC_irq();

2644

+ 	inc_irq_stat(irq_resched_count);

2645

++	kvm_set_cpu_l1tf_flush_l1d();

2646

+

2647

+ 	if (trace_resched_ipi_enabled()) {

2648

+ 		/*

2649

+diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c

2650

+index 344d3c160f8d..5ebb0dbcf4f7 100644

2651

+--- a/arch/x86/kernel/smpboot.c

2652

++++ b/arch/x86/kernel/smpboot.c

2653

+@@ -78,13 +78,7 @@

2654

+ #include <asm/realmode.h>

2655

+ #include <asm/misc.h>

2656

+ #include <asm/spec-ctrl.h>

2657

+-

2658

+-/* Number of siblings per CPU package */

2659

+-int smp_num_siblings = 1;

2660

+-EXPORT_SYMBOL(smp_num_siblings);

2661

+-

2662

+-/* Last level cache ID of each logical CPU */

2663

+-DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id) = BAD_APICID;

2664

++#include <asm/hw_irq.h>

2665

+

2666

+ /* representing HT siblings of each logical CPU */

2667

+ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);

2668

+@@ -311,6 +305,23 @@ found:

2669

+ 	return 0;

2670

+ }

2671

+

2672

++/**

2673

++ * topology_is_primary_thread - Check whether CPU is the primary SMT thread

2674

++ * @cpu:	CPU to check

2675

++ */

2676

++bool topology_is_primary_thread(unsigned int cpu)

2677

++{

2678

++	return apic_id_is_primary_thread(per_cpu(x86_cpu_to_apicid, cpu));

2679

++}

2680

++

2681

++/**

2682

++ * topology_smt_supported - Check whether SMT is supported by the CPUs

2683

++ */

2684

++bool topology_smt_supported(void)

2685

++{

2686

++	return smp_num_siblings > 1;

2687

++}

2688

++

2689

+ /**

2690

+  * topology_phys_to_logical_pkg - Map a physical package id to a logical

2691

+  *

2692

+diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c

2693

+index 879af864d99a..49a5c394f3ed 100644

2694

+--- a/arch/x86/kernel/time.c

2695

++++ b/arch/x86/kernel/time.c

2696

+@@ -12,6 +12,7 @@

2697

+

2698

+ #include <linux/clockchips.h>

2699

+ #include <linux/interrupt.h>

2700

++#include <linux/irq.h>

2701

+ #include <linux/i8253.h>

2702

+ #include <linux/time.h>

2703

+ #include <linux/export.h>

2704

+diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c

2705

+index 2ef2f1fe875b..00e2ae033a0f 100644

2706

+--- a/arch/x86/kvm/mmu.c

2707

++++ b/arch/x86/kvm/mmu.c

2708

+@@ -3825,6 +3825,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,

2709

+ {

2710

+ 	int r = 1;

2711

+

2712

++	vcpu->arch.l1tf_flush_l1d = true;

2713

+ 	switch (vcpu->arch.apf.host_apf_reason) {

2714

+ 	default:

2715

+ 		trace_kvm_page_fault(fault_address, error_code);

2716

+diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c

2717

+index cfa155078ebb..282bbcbf3b6a 100644

2718

+--- a/arch/x86/kvm/svm.c

2719

++++ b/arch/x86/kvm/svm.c

2720

+@@ -175,6 +175,8 @@ struct vcpu_svm {

2721

+ 	uint64_t sysenter_eip;

2722

+ 	uint64_t tsc_aux;

2723

+

2724

++	u64 msr_decfg;

2725

++

2726

+ 	u64 next_rip;

2727

+

2728

+ 	u64 host_user_msrs[NR_HOST_SAVE_USER_MSRS];

2729

+@@ -1616,6 +1618,7 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

2730

+ 	u32 dummy;

2731

+ 	u32 eax = 1;

2732

+

2733

++	vcpu->arch.microcode_version = 0x01000065;

2734

+ 	svm->spec_ctrl = 0;

2735

+ 	svm->virt_spec_ctrl = 0;

2736

+

2737

+@@ -3555,6 +3558,22 @@ static int cr8_write_interception(struct vcpu_svm *svm)

2738

+ 	return 0;

2739

+ }

2740

+

2741

++static int svm_get_msr_feature(struct kvm_msr_entry *msr)

2742

++{

2743

++	msr->data = 0;

2744

++

2745

++	switch (msr->index) {

2746

++	case MSR_F10H_DECFG:

2747

++		if (boot_cpu_has(X86_FEATURE_LFENCE_RDTSC))

2748

++			msr->data |= MSR_F10H_DECFG_LFENCE_SERIALIZE;

2749

++		break;

2750

++	default:

2751

++		return 1;

2752

++	}

2753

++

2754

++	return 0;

2755

++}

2756

++

2757

+ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

2758

+ {

2759

+ 	struct vcpu_svm *svm = to_svm(vcpu);

2760

+@@ -3637,9 +3656,6 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

2761

+

2762

+ 		msr_info->data = svm->virt_spec_ctrl;

2763

+ 		break;

2764

+-	case MSR_IA32_UCODE_REV:

2765

+-		msr_info->data = 0x01000065;

2766

+-		break;

2767

+ 	case MSR_F15H_IC_CFG: {

2768

+

2769

+ 		int family, model;

2770

+@@ -3657,6 +3673,9 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

2771

+ 			msr_info->data = 0x1E;

2772

+ 		}

2773

+ 		break;

2774

++	case MSR_F10H_DECFG:

2775

++		msr_info->data = svm->msr_decfg;

2776

++		break;

2777

+ 	default:

2778

+ 		return kvm_get_msr_common(vcpu, msr_info);

2779

+ 	}

2780

+@@ -3845,6 +3864,24 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)

2781

+ 	case MSR_VM_IGNNE:

2782

+ 		vcpu_unimpl(vcpu, "unimplemented wrmsr: 0x%x data 0x%llx\n", ecx, data);

2783

+ 		break;

2784

++	case MSR_F10H_DECFG: {

2785

++		struct kvm_msr_entry msr_entry;

2786

++

2787

++		msr_entry.index = msr->index;

2788

++		if (svm_get_msr_feature(&msr_entry))

2789

++			return 1;

2790

++

2791

++		/* Check the supported bits */

2792

++		if (data & ~msr_entry.data)

2793

++			return 1;

2794

++

2795

++		/* Don't allow the guest to change a bit, #GP */

2796

++		if (!msr->host_initiated && (data ^ msr_entry.data))

2797

++			return 1;

2798

++

2799

++		svm->msr_decfg = data;

2800

++		break;

2801

++	}

2802

+ 	case MSR_IA32_APICBASE:

2803

+ 		if (kvm_vcpu_apicv_active(vcpu))

2804

+ 			avic_update_vapic_bar(to_svm(vcpu), data);

2805

+@@ -5588,6 +5625,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {

2806

+ 	.vcpu_unblocking = svm_vcpu_unblocking,

2807

+

2808

+ 	.update_bp_intercept = update_bp_intercept,

2809

++	.get_msr_feature = svm_get_msr_feature,

2810

+ 	.get_msr = svm_get_msr,

2811

+ 	.set_msr = svm_set_msr,

2812

+ 	.get_segment_base = svm_get_segment_base,

2813

+diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c

2814

+index 8d000fde1414..f015ca3997d9 100644

2815

+--- a/arch/x86/kvm/vmx.c

2816

++++ b/arch/x86/kvm/vmx.c

2817

+@@ -191,6 +191,150 @@ module_param(ple_window_max, int, S_IRUGO);

2818

+

2819

+ extern const ulong vmx_return;

2820

+

2821

++static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);

2822

++static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);

2823

++static DEFINE_MUTEX(vmx_l1d_flush_mutex);

2824

++

2825

++/* Storage for pre module init parameter parsing */

2826

++static enum vmx_l1d_flush_state __read_mostly vmentry_l1d_flush_param = VMENTER_L1D_FLUSH_AUTO;

2827

++

2828

++static const struct {

2829

++	const char *option;

2830

++	enum vmx_l1d_flush_state cmd;

2831

++} vmentry_l1d_param[] = {

2832

++	{"auto",	VMENTER_L1D_FLUSH_AUTO},

2833

++	{"never",	VMENTER_L1D_FLUSH_NEVER},

2834

++	{"cond",	VMENTER_L1D_FLUSH_COND},

2835

++	{"always",	VMENTER_L1D_FLUSH_ALWAYS},

2836

++};

2837

++

2838

++#define L1D_CACHE_ORDER 4

2839

++static void *vmx_l1d_flush_pages;

2840

++

2841

++static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)

2842

++{

2843

++	struct page *page;

2844

++	unsigned int i;

2845

++

2846

++	if (!enable_ept) {

2847

++		l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_EPT_DISABLED;

2848

++		return 0;

2849

++	}

2850

++

2851

++       if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES)) {

2852

++	       u64 msr;

2853

++

2854

++	       rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);

2855

++	       if (msr & ARCH_CAP_SKIP_VMENTRY_L1DFLUSH) {

2856

++		       l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NOT_REQUIRED;

2857

++		       return 0;

2858

++	       }

2859

++       }

2860

++

2861

++	/* If set to auto use the default l1tf mitigation method */

2862

++	if (l1tf == VMENTER_L1D_FLUSH_AUTO) {

2863

++		switch (l1tf_mitigation) {

2864

++		case L1TF_MITIGATION_OFF:

2865

++			l1tf = VMENTER_L1D_FLUSH_NEVER;

2866

++			break;

2867

++		case L1TF_MITIGATION_FLUSH_NOWARN:

2868

++		case L1TF_MITIGATION_FLUSH:

2869

++		case L1TF_MITIGATION_FLUSH_NOSMT:

2870

++			l1tf = VMENTER_L1D_FLUSH_COND;

2871

++			break;

2872

++		case L1TF_MITIGATION_FULL:

2873

++		case L1TF_MITIGATION_FULL_FORCE:

2874

++			l1tf = VMENTER_L1D_FLUSH_ALWAYS;

2875

++			break;

2876

++		}

2877

++	} else if (l1tf_mitigation == L1TF_MITIGATION_FULL_FORCE) {

2878

++		l1tf = VMENTER_L1D_FLUSH_ALWAYS;

2879

++	}

2880

++

2881

++	if (l1tf != VMENTER_L1D_FLUSH_NEVER && !vmx_l1d_flush_pages &&

2882

++	    !boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {

2883

++		page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);

2884

++		if (!page)

2885

++			return -ENOMEM;

2886

++		vmx_l1d_flush_pages = page_address(page);

2887

++

2888

++		/*

2889

++		 * Initialize each page with a different pattern in

2890

++		 * order to protect against KSM in the nested

2891

++		 * virtualization case.

2892

++		 */

2893

++		for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {

2894

++			memset(vmx_l1d_flush_pages + i * PAGE_SIZE, i + 1,

2895

++			       PAGE_SIZE);

2896

++		}

2897

++	}

2898

++

2899

++	l1tf_vmx_mitigation = l1tf;

2900

++

2901

++	if (l1tf != VMENTER_L1D_FLUSH_NEVER)

2902

++		static_branch_enable(&vmx_l1d_should_flush);

2903

++	else

2904

++		static_branch_disable(&vmx_l1d_should_flush);

2905

++

2906

++	if (l1tf == VMENTER_L1D_FLUSH_COND)

2907

++		static_branch_enable(&vmx_l1d_flush_cond);

2908

++	else

2909

++		static_branch_disable(&vmx_l1d_flush_cond);

2910

++	return 0;

2911

++}

2912

++

2913

++static int vmentry_l1d_flush_parse(const char *s)

2914

++{

2915

++	unsigned int i;

2916

++

2917

++	if (s) {

2918

++		for (i = 0; i < ARRAY_SIZE(vmentry_l1d_param); i++) {

2919

++			if (sysfs_streq(s, vmentry_l1d_param[i].option))

2920

++				return vmentry_l1d_param[i].cmd;

2921

++		}

2922

++	}

2923

++	return -EINVAL;

2924

++}

2925

++

2926

++static int vmentry_l1d_flush_set(const char *s, const struct kernel_param *kp)

2927

++{

2928

++	int l1tf, ret;

2929

++

2930

++	if (!boot_cpu_has(X86_BUG_L1TF))

2931

++		return 0;

2932

++

2933

++	l1tf = vmentry_l1d_flush_parse(s);

2934

++	if (l1tf < 0)

2935

++		return l1tf;

2936

++

2937

++	/*

2938

++	 * Has vmx_init() run already? If not then this is the pre init

2939

++	 * parameter parsing. In that case just store the value and let

2940

++	 * vmx_init() do the proper setup after enable_ept has been

2941

++	 * established.

2942

++	 */

2943

++	if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_AUTO) {

2944

++		vmentry_l1d_flush_param = l1tf;

2945

++		return 0;

2946

++	}

2947

++

2948

++	mutex_lock(&vmx_l1d_flush_mutex);

2949

++	ret = vmx_setup_l1d_flush(l1tf);

2950

++	mutex_unlock(&vmx_l1d_flush_mutex);

2951

++	return ret;

2952

++}

2953

++

2954

++static int vmentry_l1d_flush_get(char *s, const struct kernel_param *kp)

2955

++{

2956

++	return sprintf(s, "%s\n", vmentry_l1d_param[l1tf_vmx_mitigation].option);

2957

++}

2958

++

2959

++static const struct kernel_param_ops vmentry_l1d_flush_ops = {

2960

++	.set = vmentry_l1d_flush_set,

2961

++	.get = vmentry_l1d_flush_get,

2962

++};

2963

++module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644);

2964

++

2965

+ #define NR_AUTOLOAD_MSRS 8

2966

+

2967

+ struct vmcs {

2968

+@@ -567,6 +711,11 @@ static inline int pi_test_sn(struct pi_desc *pi_desc)

2969

+ 			(unsigned long *)&pi_desc->control);

2970

+ }

2971

+

2972

++struct vmx_msrs {

2973

++	unsigned int		nr;

2974

++	struct vmx_msr_entry	val[NR_AUTOLOAD_MSRS];

2975

++};

2976

++

2977

+ struct vcpu_vmx {

2978

+ 	struct kvm_vcpu       vcpu;

2979

+ 	unsigned long         host_rsp;

2980

+@@ -600,9 +749,8 @@ struct vcpu_vmx {

2981

+ 	struct loaded_vmcs   *loaded_vmcs;

2982

+ 	bool                  __launched; /* temporary, used in vmx_vcpu_run */

2983

+ 	struct msr_autoload {

2984

+-		unsigned nr;

2985

+-		struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];

2986

+-		struct vmx_msr_entry host[NR_AUTOLOAD_MSRS];

2987

++		struct vmx_msrs guest;

2988

++		struct vmx_msrs host;

2989

+ 	} msr_autoload;

2990

+ 	struct {

2991

+ 		int           loaded;

2992

+@@ -1967,9 +2115,20 @@ static void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,

2993

+ 	vm_exit_controls_clearbit(vmx, exit);

2994

+ }

2995

+

2996

++static int find_msr(struct vmx_msrs *m, unsigned int msr)

2997

++{

2998

++	unsigned int i;

2999

++

3000

++	for (i = 0; i < m->nr; ++i) {

3001

++		if (m->val[i].index == msr)

3002

++			return i;

3003

++	}

3004

++	return -ENOENT;

3005

++}

3006

++

3007

+ static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr)

3008

+ {

3009

+-	unsigned i;

3010

++	int i;

3011

+ 	struct msr_autoload *m = &vmx->msr_autoload;

3012

+

3013

+ 	switch (msr) {

3014

+@@ -1990,18 +2149,21 @@ static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr)

3015

+ 		}

3016

+ 		break;

3017

+ 	}

3018

++	i = find_msr(&m->guest, msr);

3019

++	if (i < 0)

3020

++		goto skip_guest;

3021

++	--m->guest.nr;

3022

++	m->guest.val[i] = m->guest.val[m->guest.nr];

3023

++	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);

3024

+

3025

+-	for (i = 0; i < m->nr; ++i)

3026

+-		if (m->guest[i].index == msr)

3027

+-			break;

3028

+-

3029

+-	if (i == m->nr)

3030

++skip_guest:

3031

++	i = find_msr(&m->host, msr);

3032

++	if (i < 0)

3033

+ 		return;

3034

+-	--m->nr;

3035

+-	m->guest[i] = m->guest[m->nr];

3036

+-	m->host[i] = m->host[m->nr];

3037

+-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);

3038

+-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);

3039

++

3040

++	--m->host.nr;

3041

++	m->host.val[i] = m->host.val[m->host.nr];

3042

++	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);

3043

+ }

3044

+

3045

+ static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,

3046

+@@ -2016,9 +2178,9 @@ static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,

3047

+ }

3048

+

3049

+ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,

3050

+-				  u64 guest_val, u64 host_val)

3051

++				  u64 guest_val, u64 host_val, bool entry_only)

3052

+ {

3053

+-	unsigned i;

3054

++	int i, j = 0;

3055

+ 	struct msr_autoload *m = &vmx->msr_autoload;

3056

+

3057

+ 	switch (msr) {

3058

+@@ -2053,24 +2215,31 @@ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,

3059

+ 		wrmsrl(MSR_IA32_PEBS_ENABLE, 0);

3060

+ 	}

3061

+

3062

+-	for (i = 0; i < m->nr; ++i)

3063

+-		if (m->guest[i].index == msr)

3064

+-			break;

3065

++	i = find_msr(&m->guest, msr);

3066

++	if (!entry_only)

3067

++		j = find_msr(&m->host, msr);

3068

+

3069

+-	if (i == NR_AUTOLOAD_MSRS) {

3070

++	if (i == NR_AUTOLOAD_MSRS || j == NR_AUTOLOAD_MSRS) {

3071

+ 		printk_once(KERN_WARNING "Not enough msr switch entries. "

3072

+ 				"Can't add msr %x\n", msr);

3073

+ 		return;

3074

+-	} else if (i == m->nr) {

3075

+-		++m->nr;

3076

+-		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);

3077

+-		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);

3078

+ 	}

3079

++	if (i < 0) {

3080

++		i = m->guest.nr++;

3081

++		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);

3082

++	}

3083

++	m->guest.val[i].index = msr;

3084

++	m->guest.val[i].value = guest_val;

3085

+

3086

+-	m->guest[i].index = msr;

3087

+-	m->guest[i].value = guest_val;

3088

+-	m->host[i].index = msr;

3089

+-	m->host[i].value = host_val;

3090

++	if (entry_only)

3091

++		return;

3092

++

3093

++	if (j < 0) {

3094

++		j = m->host.nr++;

3095

++		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);

3096

++	}

3097

++	m->host.val[j].index = msr;

3098

++	m->host.val[j].value = host_val;

3099

+ }

3100

+

3101

+ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)

3102

+@@ -2114,7 +2283,7 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)

3103

+ 			guest_efer &= ~EFER_LME;

3104

+ 		if (guest_efer != host_efer)

3105

+ 			add_atomic_switch_msr(vmx, MSR_EFER,

3106

+-					      guest_efer, host_efer);

3107

++					      guest_efer, host_efer, false);

3108

+ 		return false;

3109

+ 	} else {

3110

+ 		guest_efer &= ~ignore_bits;

3111

+@@ -3266,6 +3435,11 @@ static inline bool vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,

3112

+ 	return !(val & ~valid_bits);

3113

+ }

3114

+

3115

++static int vmx_get_msr_feature(struct kvm_msr_entry *msr)

3116

++{

3117

++	return 1;

3118

++}

3119

++

3120

+ /*

3121

+  * Reads an msr value (of 'msr_index') into 'pdata'.

3122

+  * Returns 0 on success, non-0 otherwise.

3123

+@@ -3523,7 +3697,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

3124

+ 		vcpu->arch.ia32_xss = data;

3125

+ 		if (vcpu->arch.ia32_xss != host_xss)

3126

+ 			add_atomic_switch_msr(vmx, MSR_IA32_XSS,

3127

+-				vcpu->arch.ia32_xss, host_xss);

3128

++				vcpu->arch.ia32_xss, host_xss, false);

3129

+ 		else

3130

+ 			clear_atomic_switch_msr(vmx, MSR_IA32_XSS);

3131

+ 		break;

3132

+@@ -5714,9 +5888,9 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)

3133

+

3134

+ 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);

3135

+ 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);

3136

+-	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));

3137

++	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));

3138

+ 	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);

3139

+-	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));

3140

++	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));

3141

+

3142

+ 	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)

3143

+ 		vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);

3144

+@@ -5736,8 +5910,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)

3145

+ 		++vmx->nmsrs;

3146

+ 	}

3147

+

3148

+-	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))

3149

+-		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);

3150

++	vmx->arch_capabilities = kvm_get_arch_capabilities();

3151

+

3152

+ 	vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);

3153

+

3154

+@@ -5770,6 +5943,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

3155

+ 	vmx->rmode.vm86_active = 0;

3156

+ 	vmx->spec_ctrl = 0;

3157

+

3158

++	vcpu->arch.microcode_version = 0x100000000ULL;

3159

+ 	vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();

3160

+ 	kvm_set_cr8(vcpu, 0);

3161

+

3162

+@@ -8987,6 +9161,79 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)

3163

+ 	}

3164

+ }

3165

+

3166

++/*

3167

++ * Software based L1D cache flush which is used when microcode providing

3168

++ * the cache control MSR is not loaded.

3169

++ *

3170

++ * The L1D cache is 32 KiB on Nehalem and later microarchitectures, but to

3171

++ * flush it is required to read in 64 KiB because the replacement algorithm

3172

++ * is not exactly LRU. This could be sized at runtime via topology

3173

++ * information but as all relevant affected CPUs have 32KiB L1D cache size

3174

++ * there is no point in doing so.

3175

++ */

3176

++#define L1D_CACHE_ORDER 4

3177

++static void *vmx_l1d_flush_pages;

3178

++

3179

++static void vmx_l1d_flush(struct kvm_vcpu *vcpu)

3180

++{

3181

++	int size = PAGE_SIZE << L1D_CACHE_ORDER;

3182

++

3183

++	/*

3184

++	 * This code is only executed when the the flush mode is 'cond' or

3185

++	 * 'always'

3186

++	 */

3187

++	if (static_branch_likely(&vmx_l1d_flush_cond)) {

3188

++		bool flush_l1d;

3189

++

3190

++		/*

3191

++		 * Clear the per-vcpu flush bit, it gets set again

3192

++		 * either from vcpu_run() or from one of the unsafe

3193

++		 * VMEXIT handlers.

3194

++		 */

3195

++		flush_l1d = vcpu->arch.l1tf_flush_l1d;

3196

++		vcpu->arch.l1tf_flush_l1d = false;

3197

++

3198

++		/*

3199

++		 * Clear the per-cpu flush bit, it gets set again from

3200

++		 * the interrupt handlers.

3201

++		 */

3202

++		flush_l1d |= kvm_get_cpu_l1tf_flush_l1d();

3203

++		kvm_clear_cpu_l1tf_flush_l1d();

3204

++

3205

++		if (!flush_l1d)

3206

++			return;

3207

++	}

3208

++

3209

++	vcpu->stat.l1d_flush++;

3210

++

3211

++	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {

3212

++		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);

3213

++		return;

3214

++	}

3215

++

3216

++	asm volatile(

3217

++		/* First ensure the pages are in the TLB */

3218

++		"xorl	%%eax, %%eax\n"

3219

++		".Lpopulate_tlb:\n\t"

3220

++		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"

3221

++		"addl	$4096, %%eax\n\t"

3222

++		"cmpl	%%eax, %[size]\n\t"

3223

++		"jne	.Lpopulate_tlb\n\t"

3224

++		"xorl	%%eax, %%eax\n\t"

3225

++		"cpuid\n\t"

3226

++		/* Now fill the cache */

3227

++		"xorl	%%eax, %%eax\n"

3228

++		".Lfill_cache:\n"

3229

++		"movzbl	(%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"

3230

++		"addl	$64, %%eax\n\t"

3231

++		"cmpl	%%eax, %[size]\n\t"

3232

++		"jne	.Lfill_cache\n\t"

3233

++		"lfence\n"

3234

++		:: [flush_pages] "r" (vmx_l1d_flush_pages),

3235

++		    [size] "r" (size)

3236

++		: "eax", "ebx", "ecx", "edx");

3237

++}

3238

++

3239

+ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)

3240

+ {

3241

+ 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);

3242

+@@ -9390,7 +9637,7 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)

3243

+ 			clear_atomic_switch_msr(vmx, msrs[i].msr);

3244

+ 		else

3245

+ 			add_atomic_switch_msr(vmx, msrs[i].msr, msrs[i].guest,

3246

+-					msrs[i].host);

3247

++					msrs[i].host, false);

3248

+ }

3249

+

3250

+ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)

3251

+@@ -9483,6 +9730,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)

3252

+

3253

+ 	vmx->__launched = vmx->loaded_vmcs->launched;

3254

+

3255

++	if (static_branch_unlikely(&vmx_l1d_should_flush))

3256

++		vmx_l1d_flush(vcpu);

3257

++

3258

+ 	asm(

3259

+ 		/* Store host registers */

3260

+ 		"push %%" _ASM_DX "; push %%" _ASM_BP ";"

3261

+@@ -9835,6 +10085,37 @@ free_vcpu:

3262

+ 	return ERR_PTR(err);

3263

+ }

3264

+

3265

++#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"

3266

++#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"

3267

++

3268

++static int vmx_vm_init(struct kvm *kvm)

3269

++{

3270

++	if (boot_cpu_has(X86_BUG_L1TF) && enable_ept) {

3271

++		switch (l1tf_mitigation) {

3272

++		case L1TF_MITIGATION_OFF:

3273

++		case L1TF_MITIGATION_FLUSH_NOWARN:

3274

++			/* 'I explicitly don't care' is set */

3275

++			break;

3276

++		case L1TF_MITIGATION_FLUSH:

3277

++		case L1TF_MITIGATION_FLUSH_NOSMT:

3278

++		case L1TF_MITIGATION_FULL:

3279

++			/*

3280

++			 * Warn upon starting the first VM in a potentially

3281

++			 * insecure environment.

3282

++			 */

3283

++			if (cpu_smt_control == CPU_SMT_ENABLED)

3284

++				pr_warn_once(L1TF_MSG_SMT);

3285

++			if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_NEVER)

3286

++				pr_warn_once(L1TF_MSG_L1D);

3287

++			break;

3288

++		case L1TF_MITIGATION_FULL_FORCE:

3289

++			/* Flush is enforced */

3290

++			break;

3291

++		}

3292

++	}

3293

++	return 0;

3294

++}

3295

++

3296

+ static void __init vmx_check_processor_compat(void *rtn)

3297

+ {

3298

+ 	struct vmcs_config vmcs_conf;

3299

+@@ -10774,10 +11055,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,

3300

+ 	 * Set the MSR load/store lists to match L0's settings.

3301

+ 	 */

3302

+ 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);

3303

+-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

3304

+-	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));

3305

+-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

3306

+-	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));

3307

++	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);

3308

++	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));

3309

++	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);

3310

++	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));

3311

+

3312

+ 	/*

3313

+ 	 * HOST_RSP is normally set correctly in vmx_vcpu_run() just before

3314

+@@ -11202,6 +11483,9 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)

3315

+ 	if (ret)

3316

+ 		return ret;

3317

+

3318

++	/* Hide L1D cache contents from the nested guest.  */

3319

++	vmx->vcpu.arch.l1tf_flush_l1d = true;

3320

++

3321

+ 	/*

3322

+ 	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken

3323

+ 	 * by event injection, halt vcpu.

3324

+@@ -11712,8 +11996,8 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,

3325

+ 	vmx_segment_cache_clear(vmx);

3326

+

3327

+ 	/* Update any VMCS fields that might have changed while L2 ran */

3328

+-	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

3329

+-	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

3330

++	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);

3331

++	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);

3332

+ 	vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);

3333

+ 	if (vmx->hv_deadline_tsc == -1)

3334

+ 		vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,

3335

+@@ -12225,6 +12509,8 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {

3336

+ 	.cpu_has_accelerated_tpr = report_flexpriority,

3337

+ 	.has_emulated_msr = vmx_has_emulated_msr,

3338

+

3339

++	.vm_init = vmx_vm_init,

3340

++

3341

+ 	.vcpu_create = vmx_create_vcpu,

3342

+ 	.vcpu_free = vmx_free_vcpu,

3343

+ 	.vcpu_reset = vmx_vcpu_reset,

3344

+@@ -12234,6 +12520,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {

3345

+ 	.vcpu_put = vmx_vcpu_put,

3346

+

3347

+ 	.update_bp_intercept = update_exception_bitmap,

3348

++	.get_msr_feature = vmx_get_msr_feature,

3349

+ 	.get_msr = vmx_get_msr,

3350

+ 	.set_msr = vmx_set_msr,

3351

+ 	.get_segment_base = vmx_get_segment_base,

3352

+@@ -12341,22 +12628,18 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {

3353

+ 	.setup_mce = vmx_setup_mce,

3354

+ };

3355

+

3356

+-static int __init vmx_init(void)

3357

++static void vmx_cleanup_l1d_flush(void)

3358

+ {

3359

+-	int r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),

3360

+-                     __alignof__(struct vcpu_vmx), THIS_MODULE);

3361

+-	if (r)

3362

+-		return r;

3363

+-

3364

+-#ifdef CONFIG_KEXEC_CORE

3365

+-	rcu_assign_pointer(crash_vmclear_loaded_vmcss,

3366

+-			   crash_vmclear_local_loaded_vmcss);

3367

+-#endif

3368

+-

3369

+-	return 0;

3370

++	if (vmx_l1d_flush_pages) {

3371

++		free_pages((unsigned long)vmx_l1d_flush_pages, L1D_CACHE_ORDER);

3372

++		vmx_l1d_flush_pages = NULL;

3373

++	}

3374

++	/* Restore state so sysfs ignores VMX */

3375

++	l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;

3376

+ }

3377

+

3378

+-static void __exit vmx_exit(void)

3379

++

3380

++static void vmx_exit(void)

3381

+ {

3382

+ #ifdef CONFIG_KEXEC_CORE

3383

+ 	RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);

3384

+@@ -12364,7 +12647,40 @@ static void __exit vmx_exit(void)

3385

+ #endif

3386

+

3387

+ 	kvm_exit();

3388

++

3389

++	vmx_cleanup_l1d_flush();

3390

+ }

3391

++module_exit(vmx_exit)

3392

+

3393

++static int __init vmx_init(void)

3394

++{

3395

++	int r;

3396

++

3397

++	r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),

3398

++		     __alignof__(struct vcpu_vmx), THIS_MODULE);

3399

++	if (r)

3400

++		return r;

3401

++

3402

++	/*

3403

++	 * Must be called after kvm_init() so enable_ept is properly set

3404

++	 * up. Hand the parameter mitigation value in which was stored in

3405

++	 * the pre module init parser. If no parameter was given, it will

3406

++	 * contain 'auto' which will be turned into the default 'cond'

3407

++	 * mitigation mode.

3408

++	 */

3409

++	if (boot_cpu_has(X86_BUG_L1TF)) {

3410

++		r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);

3411

++		if (r) {

3412

++			vmx_exit();

3413

++			return r;

3414

++		}

3415

++	}

3416

++

3417

++#ifdef CONFIG_KEXEC_CORE

3418

++	rcu_assign_pointer(crash_vmclear_loaded_vmcss,

3419

++			   crash_vmclear_local_loaded_vmcss);

3420

++#endif

3421

++

3422

++	return 0;

3423

++}

3424

+ module_init(vmx_init)

3425

+-module_exit(vmx_exit)

3426

+diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c

3427

+index 2f3fe25639b3..5c2c09f6c1c3 100644

3428

+--- a/arch/x86/kvm/x86.c

3429

++++ b/arch/x86/kvm/x86.c

3430

+@@ -181,6 +181,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {

3431

+ 	{ "irq_injections", VCPU_STAT(irq_injections) },

3432

+ 	{ "nmi_injections", VCPU_STAT(nmi_injections) },

3433

+ 	{ "req_event", VCPU_STAT(req_event) },

3434

++	{ "l1d_flush", VCPU_STAT(l1d_flush) },

3435

+ 	{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },

3436

+ 	{ "mmu_pte_write", VM_STAT(mmu_pte_write) },

3437

+ 	{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },

3438

+@@ -1041,6 +1042,71 @@ static u32 emulated_msrs[] = {

3439

+

3440

+ static unsigned num_emulated_msrs;

3441

+

3442

++/*

3443

++ * List of msr numbers which are used to expose MSR-based features that

3444

++ * can be used by a hypervisor to validate requested CPU features.

3445

++ */

3446

++static u32 msr_based_features[] = {

3447

++	MSR_F10H_DECFG,

3448

++	MSR_IA32_UCODE_REV,

3449

++	MSR_IA32_ARCH_CAPABILITIES,

3450

++};

3451

++

3452

++static unsigned int num_msr_based_features;

3453

++

3454

++u64 kvm_get_arch_capabilities(void)

3455

++{

3456

++	u64 data;

3457

++

3458

++	rdmsrl_safe(MSR_IA32_ARCH_CAPABILITIES, &data);

3459

++

3460

++	/*

3461

++	 * If we're doing cache flushes (either "always" or "cond")

3462

++	 * we will do one whenever the guest does a vmlaunch/vmresume.

3463

++	 * If an outer hypervisor is doing the cache flush for us

3464

++	 * (VMENTER_L1D_FLUSH_NESTED_VM), we can safely pass that

3465

++	 * capability to the guest too, and if EPT is disabled we're not

3466

++	 * vulnerable.  Overall, only VMENTER_L1D_FLUSH_NEVER will

3467

++	 * require a nested hypervisor to do a flush of its own.

3468

++	 */

3469

++	if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)

3470

++		data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;

3471

++

3472

++	return data;

3473

++}

3474

++EXPORT_SYMBOL_GPL(kvm_get_arch_capabilities);

3475

++

3476

++static int kvm_get_msr_feature(struct kvm_msr_entry *msr)

3477

++{

3478

++	switch (msr->index) {

3479

++	case MSR_IA32_ARCH_CAPABILITIES:

3480

++		msr->data = kvm_get_arch_capabilities();

3481

++		break;

3482

++	case MSR_IA32_UCODE_REV:

3483

++		rdmsrl_safe(msr->index, &msr->data);

3484

++		break;

3485

++	default:

3486

++		if (kvm_x86_ops->get_msr_feature(msr))

3487

++			return 1;

3488

++	}

3489

++	return 0;

3490

++}

3491

++

3492

++static int do_get_msr_feature(struct kvm_vcpu *vcpu, unsigned index, u64 *data)

3493

++{

3494

++	struct kvm_msr_entry msr;

3495

++	int r;

3496

++

3497

++	msr.index = index;

3498

++	r = kvm_get_msr_feature(&msr);

3499

++	if (r)

3500

++		return r;

3501

++

3502

++	*data = msr.data;

3503

++

3504

++	return 0;

3505

++}

3506

++

3507

+ bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer)

3508

+ {

3509

+ 	if (efer & efer_reserved_bits)

3510

+@@ -2156,7 +2222,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

3511

+

3512

+ 	switch (msr) {

3513

+ 	case MSR_AMD64_NB_CFG:

3514

+-	case MSR_IA32_UCODE_REV:

3515

+ 	case MSR_IA32_UCODE_WRITE:

3516

+ 	case MSR_VM_HSAVE_PA:

3517

+ 	case MSR_AMD64_PATCH_LOADER:

3518

+@@ -2164,6 +2229,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

3519

+ 	case MSR_AMD64_DC_CFG:

3520

+ 		break;

3521

+

3522

++	case MSR_IA32_UCODE_REV:

3523

++		if (msr_info->host_initiated)

3524

++			vcpu->arch.microcode_version = data;

3525

++		break;

3526

+ 	case MSR_EFER:

3527

+ 		return set_efer(vcpu, data);

3528

+ 	case MSR_K7_HWCR:

3529

+@@ -2450,7 +2519,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

3530

+ 		msr_info->data = 0;

3531

+ 		break;

3532

+ 	case MSR_IA32_UCODE_REV:

3533

+-		msr_info->data = 0x100000000ULL;

3534

++		msr_info->data = vcpu->arch.microcode_version;

3535

+ 		break;

3536

+ 	case MSR_MTRRcap:

3537

+ 	case 0x200 ... 0x2ff:

3538

+@@ -2600,13 +2669,11 @@ static int __msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs *msrs,

3539

+ 		    int (*do_msr)(struct kvm_vcpu *vcpu,

3540

+ 				  unsigned index, u64 *data))

3541

+ {

3542

+-	int i, idx;

3543

++	int i;

3544

+

3545

+-	idx = srcu_read_lock(&vcpu->kvm->srcu);

3546

+ 	for (i = 0; i < msrs->nmsrs; ++i)

3547

+ 		if (do_msr(vcpu, entries[i].index, &entries[i].data))

3548

+ 			break;

3549

+-	srcu_read_unlock(&vcpu->kvm->srcu, idx);

3550

+

3551

+ 	return i;

3552

+ }

3553

+@@ -2705,6 +2772,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)

3554

+ 	case KVM_CAP_SET_BOOT_CPU_ID:

3555

+  	case KVM_CAP_SPLIT_IRQCHIP:

3556

+ 	case KVM_CAP_IMMEDIATE_EXIT:

3557

++	case KVM_CAP_GET_MSR_FEATURES:

3558

+ 		r = 1;

3559

+ 		break;

3560

+ 	case KVM_CAP_ADJUST_CLOCK:

3561

+@@ -2819,6 +2887,31 @@ long kvm_arch_dev_ioctl(struct file *filp,

3562

+ 			goto out;

3563

+ 		r = 0;

3564

+ 		break;

3565

++	case KVM_GET_MSR_FEATURE_INDEX_LIST: {

3566

++		struct kvm_msr_list __user *user_msr_list = argp;

3567

++		struct kvm_msr_list msr_list;

3568

++		unsigned int n;

3569

++

3570

++		r = -EFAULT;

3571

++		if (copy_from_user(&msr_list, user_msr_list, sizeof(msr_list)))

3572

++			goto out;

3573

++		n = msr_list.nmsrs;

3574

++		msr_list.nmsrs = num_msr_based_features;

3575

++		if (copy_to_user(user_msr_list, &msr_list, sizeof(msr_list)))

3576

++			goto out;

3577

++		r = -E2BIG;

3578

++		if (n < msr_list.nmsrs)

3579

++			goto out;

3580

++		r = -EFAULT;

3581

++		if (copy_to_user(user_msr_list->indices, &msr_based_features,

3582

++				 num_msr_based_features * sizeof(u32)))

3583

++			goto out;

3584

++		r = 0;

3585

++		break;

3586

++	}

3587

++	case KVM_GET_MSRS:

3588

++		r = msr_io(NULL, argp, do_get_msr_feature, 1);

3589

++		break;

3590

+ 	}

3591

+ 	default:

3592

+ 		r = -EINVAL;

3593

+@@ -3553,12 +3646,18 @@ long kvm_arch_vcpu_ioctl(struct file *filp,

3594

+ 		r = 0;

3595

+ 		break;

3596

+ 	}

3597

+-	case KVM_GET_MSRS:

3598

++	case KVM_GET_MSRS: {

3599

++		int idx = srcu_read_lock(&vcpu->kvm->srcu);

3600

+ 		r = msr_io(vcpu, argp, do_get_msr, 1);

3601

++		srcu_read_unlock(&vcpu->kvm->srcu, idx);

3602

+ 		break;

3603

+-	case KVM_SET_MSRS:

3604

++	}

3605

++	case KVM_SET_MSRS: {

3606

++		int idx = srcu_read_lock(&vcpu->kvm->srcu);

3607

+ 		r = msr_io(vcpu, argp, do_set_msr, 0);

3608

++		srcu_read_unlock(&vcpu->kvm->srcu, idx);

3609

+ 		break;

3610

++	}

3611

+ 	case KVM_TPR_ACCESS_REPORTING: {

3612

+ 		struct kvm_tpr_access_ctl tac;

3613

+

3614

+@@ -4333,6 +4432,19 @@ static void kvm_init_msr_list(void)

3615

+ 		j++;

3616

+ 	}

3617

+ 	num_emulated_msrs = j;

3618

++

3619

++	for (i = j = 0; i < ARRAY_SIZE(msr_based_features); i++) {

3620

++		struct kvm_msr_entry msr;

3621

++

3622

++		msr.index = msr_based_features[i];

3623

++		if (kvm_get_msr_feature(&msr))

3624

++			continue;

3625

++

3626

++		if (j < i)

3627

++			msr_based_features[j] = msr_based_features[i];

3628

++		j++;

3629

++	}

3630

++	num_msr_based_features = j;

3631

+ }

3632

+

3633

+ static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,

3634

+@@ -4573,6 +4685,9 @@ static int emulator_write_std(struct x86_emulate_ctxt *ctxt, gva_t addr, void *v

3635

+ int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu, gva_t addr, void *val,

3636

+ 				unsigned int bytes, struct x86_exception *exception)

3637

+ {

3638

++	/* kvm_write_guest_virt_system can pull in tons of pages. */

3639

++	vcpu->arch.l1tf_flush_l1d = true;

3640

++

3641

+ 	return kvm_write_guest_virt_helper(addr, val, bytes, vcpu,

3642

+ 					   PFERR_WRITE_MASK, exception);

3643

+ }

3644

+@@ -5701,6 +5816,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,

3645

+ 	bool writeback = true;

3646

+ 	bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;

3647

+

3648

++	vcpu->arch.l1tf_flush_l1d = true;

3649

++

3650

+ 	/*

3651

+ 	 * Clear write_fault_to_shadow_pgtable here to ensure it is

3652

+ 	 * never reused.

3653

+@@ -7146,6 +7263,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)

3654

+ 	struct kvm *kvm = vcpu->kvm;

3655

+

3656

+ 	vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);

3657

++	vcpu->arch.l1tf_flush_l1d = true;

3658

+

3659

+ 	for (;;) {

3660

+ 		if (kvm_vcpu_running(vcpu)) {

3661

+@@ -8153,6 +8271,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)

3662

+

3663

+ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)

3664

+ {

3665

++	vcpu->arch.l1tf_flush_l1d = true;

3666

+ 	kvm_x86_ops->sched_in(vcpu, cpu);

3667

+ }

3668

+

3669

+diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c

3670

+index 0133d26f16be..c2faff548f59 100644

3671

+--- a/arch/x86/mm/fault.c

3672

++++ b/arch/x86/mm/fault.c

3673

+@@ -24,6 +24,7 @@

3674

+ #include <asm/vsyscall.h>		/* emulate_vsyscall		*/

3675

+ #include <asm/vm86.h>			/* struct vm86			*/

3676

+ #include <asm/mmu_context.h>		/* vma_pkey()			*/

3677

++#include <asm/sections.h>

3678

+

3679

+ #define CREATE_TRACE_POINTS

3680

+ #include <asm/trace/exceptions.h>

3681

+diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c

3682

+index 071cbbbb60d9..37f60dfd7e4e 100644

3683

+--- a/arch/x86/mm/init.c

3684

++++ b/arch/x86/mm/init.c

3685

+@@ -4,6 +4,8 @@

3686

+ #include <linux/swap.h>

3687

+ #include <linux/memblock.h>

3688

+ #include <linux/bootmem.h>	/* for max_low_pfn */

3689

++#include <linux/swapfile.h>

3690

++#include <linux/swapops.h>

3691

+

3692

+ #include <asm/set_memory.h>

3693

+ #include <asm/e820/api.h>

3694

+@@ -880,3 +882,26 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)

3695

+ 	__cachemode2pte_tbl[cache] = __cm_idx2pte(entry);

3696

+ 	__pte2cachemode_tbl[entry] = cache;

3697

+ }

3698

++

3699

++#ifdef CONFIG_SWAP

3700

++unsigned long max_swapfile_size(void)

3701

++{

3702

++	unsigned long pages;

3703

++

3704

++	pages = generic_max_swapfile_size();

3705

++

3706

++	if (boot_cpu_has_bug(X86_BUG_L1TF)) {

3707

++		/* Limit the swap file size to MAX_PA/2 for L1TF workaround */

3708

++		unsigned long l1tf_limit = l1tf_pfn_limit() + 1;

3709

++		/*

3710

++		 * We encode swap offsets also with 3 bits below those for pfn

3711

++		 * which makes the usable limit higher.

3712

++		 */

3713

++#if CONFIG_PGTABLE_LEVELS > 2

3714

++		l1tf_limit <<= PAGE_SHIFT - SWP_OFFSET_FIRST_BIT;

3715

++#endif

3716

++		pages = min_t(unsigned long, l1tf_limit, pages);

3717

++	}

3718

++	return pages;

3719

++}

3720

++#endif

3721

+diff --git a/arch/x86/mm/kmmio.c b/arch/x86/mm/kmmio.c

3722

+index 7c8686709636..79eb55ce69a9 100644

3723

+--- a/arch/x86/mm/kmmio.c

3724

++++ b/arch/x86/mm/kmmio.c

3725

+@@ -126,24 +126,29 @@ static struct kmmio_fault_page *get_kmmio_fault_page(unsigned long addr)

3726

+

3727

+ static void clear_pmd_presence(pmd_t *pmd, bool clear, pmdval_t *old)

3728

+ {

3729

++	pmd_t new_pmd;

3730

+ 	pmdval_t v = pmd_val(*pmd);

3731

+ 	if (clear) {

3732

+-		*old = v & _PAGE_PRESENT;

3733

+-		v &= ~_PAGE_PRESENT;

3734

+-	} else	/* presume this has been called with clear==true previously */

3735

+-		v |= *old;

3736

+-	set_pmd(pmd, __pmd(v));

3737

++		*old = v;

3738

++		new_pmd = pmd_mknotpresent(*pmd);

3739

++	} else {

3740

++		/* Presume this has been called with clear==true previously */

3741

++		new_pmd = __pmd(*old);

3742

++	}

3743

++	set_pmd(pmd, new_pmd);

3744

+ }

3745

+

3746

+ static void clear_pte_presence(pte_t *pte, bool clear, pteval_t *old)

3747

+ {

3748

+ 	pteval_t v = pte_val(*pte);

3749

+ 	if (clear) {

3750

+-		*old = v & _PAGE_PRESENT;

3751

+-		v &= ~_PAGE_PRESENT;

3752

+-	} else	/* presume this has been called with clear==true previously */

3753

+-		v |= *old;

3754

+-	set_pte_atomic(pte, __pte(v));

3755

++		*old = v;

3756

++		/* Nothing should care about address */

3757

++		pte_clear(&init_mm, 0, pte);

3758

++	} else {

3759

++		/* Presume this has been called with clear==true previously */

3760

++		set_pte_atomic(pte, __pte(*old));

3761

++	}

3762

+ }

3763

+

3764

+ static int clear_page_presence(struct kmmio_fault_page *f, bool clear)

3765

+diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c

3766

+index a99679826846..5f4805d69aab 100644

3767

+--- a/arch/x86/mm/mmap.c

3768

++++ b/arch/x86/mm/mmap.c

3769

+@@ -174,3 +174,24 @@ const char *arch_vma_name(struct vm_area_struct *vma)

3770

+ 		return "[mpx]";

3771

+ 	return NULL;

3772

+ }

3773

++

3774

++/*

3775

++ * Only allow root to set high MMIO mappings to PROT_NONE.

3776

++ * This prevents an unpriv. user to set them to PROT_NONE and invert

3777

++ * them, then pointing to valid memory for L1TF speculation.

3778

++ *

3779

++ * Note: for locked down kernels may want to disable the root override.

3780

++ */

3781

++bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)

3782

++{

3783

++	if (!boot_cpu_has_bug(X86_BUG_L1TF))

3784

++		return true;

3785

++	if (!__pte_needs_invert(pgprot_val(prot)))

3786

++		return true;

3787

++	/* If it's real memory always allow */

3788

++	if (pfn_valid(pfn))

3789

++		return true;

3790

++	if (pfn > l1tf_pfn_limit() && !capable(CAP_SYS_ADMIN))

3791

++		return false;

3792

++	return true;

3793

++}

3794

+diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c

3795

+index 4085897fef64..464f53da3a6f 100644

3796

+--- a/arch/x86/mm/pageattr.c

3797

++++ b/arch/x86/mm/pageattr.c

3798

+@@ -1006,8 +1006,8 @@ static long populate_pmd(struct cpa_data *cpa,

3799

+

3800

+ 		pmd = pmd_offset(pud, start);

3801

+

3802

+-		set_pmd(pmd, __pmd(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |

3803

+-				   massage_pgprot(pmd_pgprot)));

3804

++		set_pmd(pmd, pmd_mkhuge(pfn_pmd(cpa->pfn,

3805

++					canon_pgprot(pmd_pgprot))));

3806

+

3807

+ 		start	  += PMD_SIZE;

3808

+ 		cpa->pfn  += PMD_SIZE >> PAGE_SHIFT;

3809

+@@ -1079,8 +1079,8 @@ static int populate_pud(struct cpa_data *cpa, unsigned long start, p4d_t *p4d,

3810

+ 	 * Map everything starting from the Gb boundary, possibly with 1G pages

3811

+ 	 */

3812

+ 	while (boot_cpu_has(X86_FEATURE_GBPAGES) && end - start >= PUD_SIZE) {

3813

+-		set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |

3814

+-				   massage_pgprot(pud_pgprot)));

3815

++		set_pud(pud, pud_mkhuge(pfn_pud(cpa->pfn,

3816

++				   canon_pgprot(pud_pgprot))));

3817

+

3818

+ 		start	  += PUD_SIZE;

3819

+ 		cpa->pfn  += PUD_SIZE >> PAGE_SHIFT;

3820

+diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c

3821

+index ce38f165489b..d6f11accd37a 100644

3822

+--- a/arch/x86/mm/pti.c

3823

++++ b/arch/x86/mm/pti.c

3824

+@@ -45,6 +45,7 @@

3825

+ #include <asm/pgalloc.h>

3826

+ #include <asm/tlbflush.h>

3827

+ #include <asm/desc.h>

3828

++#include <asm/sections.h>

3829

+

3830

+ #undef pr_fmt

3831

+ #define pr_fmt(fmt)     "Kernel/User page tables isolation: " fmt

3832

+diff --git a/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c b/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c

3833

+index 4f5fa65a1011..2acd6be13375 100644

3834

+--- a/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c

3835

++++ b/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c

3836

+@@ -18,6 +18,7 @@

3837

+ #include <asm/intel-mid.h>

3838

+ #include <asm/intel_scu_ipc.h>

3839

+ #include <asm/io_apic.h>

3840

++#include <asm/hw_irq.h>

3841

+

3842

+ #define TANGIER_EXT_TIMER0_MSI 12

3843

+

3844

+diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c

3845

+index 0b530c53de1f..34f9a9ce6236 100644

3846

+--- a/arch/x86/platform/uv/tlb_uv.c

3847

++++ b/arch/x86/platform/uv/tlb_uv.c

3848

+@@ -1285,6 +1285,7 @@ void uv_bau_message_interrupt(struct pt_regs *regs)

3849

+ 	struct msg_desc msgdesc;

3850

+

3851

+ 	ack_APIC_irq();

3852

++	kvm_set_cpu_l1tf_flush_l1d();

3853

+ 	time_start = get_cycles();

3854

+

3855

+ 	bcp = &per_cpu(bau_control, smp_processor_id());

3856

+diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c

3857

+index c9081c6671f0..df208af3cd74 100644

3858

+--- a/arch/x86/xen/enlighten.c

3859

++++ b/arch/x86/xen/enlighten.c

3860

+@@ -3,6 +3,7 @@

3861

+ #endif

3862

+ #include <linux/cpu.h>

3863

+ #include <linux/kexec.h>

3864

++#include <linux/slab.h>

3865

+

3866

+ #include <xen/features.h>

3867

+ #include <xen/page.h>

3868

+diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c

3869

+index 433f14bcab15..93758b528d8f 100644

3870

+--- a/drivers/base/cpu.c

3871

++++ b/drivers/base/cpu.c

3872

+@@ -527,16 +527,24 @@ ssize_t __weak cpu_show_spec_store_bypass(struct device *dev,

3873

+ 	return sprintf(buf, "Not affected\n");

3874

+ }

3875

+

3876

++ssize_t __weak cpu_show_l1tf(struct device *dev,

3877

++			     struct device_attribute *attr, char *buf)

3878

++{

3879

++	return sprintf(buf, "Not affected\n");

3880

++}

3881

++

3882

+ static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);

3883

+ static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);

3884

+ static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);

3885

+ static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);

3886

++static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);

3887

+

3888

+ static struct attribute *cpu_root_vulnerabilities_attrs[] = {

3889

+ 	&dev_attr_meltdown.attr,

3890

+ 	&dev_attr_spectre_v1.attr,

3891

+ 	&dev_attr_spectre_v2.attr,

3892

+ 	&dev_attr_spec_store_bypass.attr,

3893

++	&dev_attr_l1tf.attr,

3894

+ 	NULL

3895

+ };

3896

+

3897

+diff --git a/drivers/bluetooth/hci_ldisc.c b/drivers/bluetooth/hci_ldisc.c

3898

+index 6aef3bde10d7..c823914b3a80 100644

3899

+--- a/drivers/bluetooth/hci_ldisc.c

3900

++++ b/drivers/bluetooth/hci_ldisc.c

3901

+@@ -115,12 +115,12 @@ static inline struct sk_buff *hci_uart_dequeue(struct hci_uart *hu)

3902

+ 	struct sk_buff *skb = hu->tx_skb;

3903

+

3904

+ 	if (!skb) {

3905

+-		read_lock(&hu->proto_lock);

3906

++		percpu_down_read(&hu->proto_lock);

3907

+

3908

+ 		if (test_bit(HCI_UART_PROTO_READY, &hu->flags))

3909

+ 			skb = hu->proto->dequeue(hu);

3910

+

3911

+-		read_unlock(&hu->proto_lock);

3912

++		percpu_up_read(&hu->proto_lock);

3913

+ 	} else {

3914

+ 		hu->tx_skb = NULL;

3915

+ 	}

3916

+@@ -130,7 +130,14 @@ static inline struct sk_buff *hci_uart_dequeue(struct hci_uart *hu)

3917

+

3918

+ int hci_uart_tx_wakeup(struct hci_uart *hu)

3919

+ {

3920

+-	read_lock(&hu->proto_lock);

3921

++	/* This may be called in an IRQ context, so we can't sleep. Therefore

3922

++	 * we try to acquire the lock only, and if that fails we assume the

3923

++	 * tty is being closed because that is the only time the write lock is

3924

++	 * acquired. If, however, at some point in the future the write lock

3925

++	 * is also acquired in other situations, then this must be revisited.

3926

++	 */

3927

++	if (!percpu_down_read_trylock(&hu->proto_lock))

3928

++		return 0;

3929

+

3930

+ 	if (!test_bit(HCI_UART_PROTO_READY, &hu->flags))

3931

+ 		goto no_schedule;

3932

+@@ -145,7 +152,7 @@ int hci_uart_tx_wakeup(struct hci_uart *hu)

3933

+ 	schedule_work(&hu->write_work);

3934

+

3935

+ no_schedule:

3936

+-	read_unlock(&hu->proto_lock);

3937

++	percpu_up_read(&hu->proto_lock);

3938

+

3939

+ 	return 0;

3940

+ }

3941

+@@ -247,12 +254,12 @@ static int hci_uart_flush(struct hci_dev *hdev)

3942

+ 	tty_ldisc_flush(tty);

3943

+ 	tty_driver_flush_buffer(tty);

3944

+

3945

+-	read_lock(&hu->proto_lock);

3946

++	percpu_down_read(&hu->proto_lock);

3947

+

3948

+ 	if (test_bit(HCI_UART_PROTO_READY, &hu->flags))

3949

+ 		hu->proto->flush(hu);

3950

+

3951

+-	read_unlock(&hu->proto_lock);

3952

++	percpu_up_read(&hu->proto_lock);

3953

+

3954

+ 	return 0;

3955

+ }

3956

+@@ -275,15 +282,15 @@ static int hci_uart_send_frame(struct hci_dev *hdev, struct sk_buff *skb)

3957

+ 	BT_DBG("%s: type %d len %d", hdev->name, hci_skb_pkt_type(skb),

3958

+ 	       skb->len);

3959

+

3960

+-	read_lock(&hu->proto_lock);

3961

++	percpu_down_read(&hu->proto_lock);

3962

+

3963

+ 	if (!test_bit(HCI_UART_PROTO_READY, &hu->flags)) {

3964

+-		read_unlock(&hu->proto_lock);

3965

++		percpu_up_read(&hu->proto_lock);

3966

+ 		return -EUNATCH;

3967

+ 	}

3968

+

3969

+ 	hu->proto->enqueue(hu, skb);

3970

+-	read_unlock(&hu->proto_lock);

3971

++	percpu_up_read(&hu->proto_lock);

3972

+

3973

+ 	hci_uart_tx_wakeup(hu);

3974

+

3975

+@@ -486,7 +493,7 @@ static int hci_uart_tty_open(struct tty_struct *tty)

3976

+ 	INIT_WORK(&hu->init_ready, hci_uart_init_work);

3977

+ 	INIT_WORK(&hu->write_work, hci_uart_write_work);

3978

+

3979

+-	rwlock_init(&hu->proto_lock);

3980

++	percpu_init_rwsem(&hu->proto_lock);

3981

+

3982

+ 	/* Flush any pending characters in the driver */

3983

+ 	tty_driver_flush_buffer(tty);

3984

+@@ -503,7 +510,6 @@ static void hci_uart_tty_close(struct tty_struct *tty)

3985

+ {

3986

+ 	struct hci_uart *hu = tty->disc_data;

3987

+ 	struct hci_dev *hdev;

3988

+-	unsigned long flags;

3989

+

3990

+ 	BT_DBG("tty %p", tty);

3991

+

3992

+@@ -518,9 +524,9 @@ static void hci_uart_tty_close(struct tty_struct *tty)

3993

+ 		hci_uart_close(hdev);

3994

+

3995

+ 	if (test_bit(HCI_UART_PROTO_READY, &hu->flags)) {

3996

+-		write_lock_irqsave(&hu->proto_lock, flags);

3997

++		percpu_down_write(&hu->proto_lock);

3998

+ 		clear_bit(HCI_UART_PROTO_READY, &hu->flags);

3999

+-		write_unlock_irqrestore(&hu->proto_lock, flags);

4000

++		percpu_up_write(&hu->proto_lock);

4001

+

4002

+ 		cancel_work_sync(&hu->write_work);

4003

+

4004

+@@ -582,10 +588,10 @@ static void hci_uart_tty_receive(struct tty_struct *tty, const u8 *data,

4005

+ 	if (!hu || tty != hu->tty)

4006

+ 		return;

4007

+

4008

+-	read_lock(&hu->proto_lock);

4009

++	percpu_down_read(&hu->proto_lock);

4010

+

4011

+ 	if (!test_bit(HCI_UART_PROTO_READY, &hu->flags)) {

4012

+-		read_unlock(&hu->proto_lock);

4013

++		percpu_up_read(&hu->proto_lock);

4014

+ 		return;

4015

+ 	}

4016

+

4017

+@@ -593,7 +599,7 @@ static void hci_uart_tty_receive(struct tty_struct *tty, const u8 *data,

4018

+ 	 * tty caller

4019

+ 	 */

4020

+ 	hu->proto->recv(hu, data, count);

4021

+-	read_unlock(&hu->proto_lock);

4022

++	percpu_up_read(&hu->proto_lock);

4023

+

4024

+ 	if (hu->hdev)

4025

+ 		hu->hdev->stat.byte_rx += count;

4026

+diff --git a/drivers/bluetooth/hci_serdev.c b/drivers/bluetooth/hci_serdev.c

4027

+index b725ac4f7ff6..52e6d4d1608e 100644

4028

+--- a/drivers/bluetooth/hci_serdev.c

4029

++++ b/drivers/bluetooth/hci_serdev.c

4030

+@@ -304,6 +304,7 @@ int hci_uart_register_device(struct hci_uart *hu,

4031

+ 	hci_set_drvdata(hdev, hu);

4032

+

4033

+ 	INIT_WORK(&hu->write_work, hci_uart_write_work);

4034

++	percpu_init_rwsem(&hu->proto_lock);

4035

+

4036

+ 	/* Only when vendor specific setup callback is provided, consider

4037

+ 	 * the manufacturer information valid. This avoids filling in the

4038

+diff --git a/drivers/bluetooth/hci_uart.h b/drivers/bluetooth/hci_uart.h

4039

+index d9cd95d81149..66e8c68e4607 100644

4040

+--- a/drivers/bluetooth/hci_uart.h

4041

++++ b/drivers/bluetooth/hci_uart.h

4042

+@@ -87,7 +87,7 @@ struct hci_uart {

4043

+ 	struct work_struct	write_work;

4044

+

4045

+ 	const struct hci_uart_proto *proto;

4046

+-	rwlock_t		proto_lock;	/* Stop work for proto close */

4047

++	struct percpu_rw_semaphore proto_lock;	/* Stop work for proto close */

4048

+ 	void			*priv;

4049

+

4050

+ 	struct sk_buff		*tx_skb;

4051

+diff --git a/drivers/gpu/drm/i915/intel_lpe_audio.c b/drivers/gpu/drm/i915/intel_lpe_audio.c

4052

+index 3bf65288ffff..2fdf302ebdad 100644

4053

+--- a/drivers/gpu/drm/i915/intel_lpe_audio.c

4054

++++ b/drivers/gpu/drm/i915/intel_lpe_audio.c

4055

+@@ -62,6 +62,7 @@

4056

+

4057

+ #include <linux/acpi.h>

4058

+ #include <linux/device.h>

4059

++#include <linux/irq.h>

4060

+ #include <linux/pci.h>

4061

+ #include <linux/pm_runtime.h>

4062

+

4063

+diff --git a/drivers/mtd/nand/qcom_nandc.c b/drivers/mtd/nand/qcom_nandc.c

4064

+index 3baddfc997d1..b49ca02b399d 100644

4065

+--- a/drivers/mtd/nand/qcom_nandc.c

4066

++++ b/drivers/mtd/nand/qcom_nandc.c

4067

+@@ -2544,6 +2544,9 @@ static int qcom_nand_host_init(struct qcom_nand_controller *nandc,

4068

+

4069

+ 	nand_set_flash_node(chip, dn);

4070

+ 	mtd->name = devm_kasprintf(dev, GFP_KERNEL, "qcom_nand.%d", host->cs);

4071

++	if (!mtd->name)

4072

++		return -ENOMEM;

4073

++

4074

+ 	mtd->owner = THIS_MODULE;

4075

+ 	mtd->dev.parent = dev;

4076

+

4077

+diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c

4078

+index dfc076f9ee4b..d5e790dd589a 100644

4079

+--- a/drivers/net/xen-netfront.c

4080

++++ b/drivers/net/xen-netfront.c

4081

+@@ -894,7 +894,6 @@ static RING_IDX xennet_fill_frags(struct netfront_queue *queue,

4082

+ 				  struct sk_buff *skb,

4083

+ 				  struct sk_buff_head *list)

4084

+ {

4085

+-	struct skb_shared_info *shinfo = skb_shinfo(skb);

4086

+ 	RING_IDX cons = queue->rx.rsp_cons;

4087

+ 	struct sk_buff *nskb;

4088

+

4089

+@@ -903,15 +902,16 @@ static RING_IDX xennet_fill_frags(struct netfront_queue *queue,

4090

+ 			RING_GET_RESPONSE(&queue->rx, ++cons);

4091

+ 		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];

4092

+

4093

+-		if (shinfo->nr_frags == MAX_SKB_FRAGS) {

4094

++		if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS) {

4095

+ 			unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;

4096

+

4097

+ 			BUG_ON(pull_to <= skb_headlen(skb));

4098

+ 			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));

4099

+ 		}

4100

+-		BUG_ON(shinfo->nr_frags >= MAX_SKB_FRAGS);

4101

++		BUG_ON(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS);

4102

+

4103

+-		skb_add_rx_frag(skb, shinfo->nr_frags, skb_frag_page(nfrag),

4104

++		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,

4105

++				skb_frag_page(nfrag),

4106

+ 				rx->offset, rx->status, PAGE_SIZE);

4107

+

4108

+ 		skb_shinfo(nskb)->nr_frags = 0;

4109

+diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c

4110

+index 4523d7e1bcb9..ffc87a956d97 100644

4111

+--- a/drivers/pci/host/pci-hyperv.c

4112

++++ b/drivers/pci/host/pci-hyperv.c

4113

+@@ -53,6 +53,8 @@

4114

+ #include <linux/delay.h>

4115

+ #include <linux/semaphore.h>

4116

+ #include <linux/irqdomain.h>

4117

++#include <linux/irq.h>

4118

++

4119

+ #include <asm/irqdomain.h>

4120

+ #include <asm/apic.h>

4121

+ #include <linux/msi.h>

4122

+diff --git a/drivers/phy/mediatek/phy-mtk-tphy.c b/drivers/phy/mediatek/phy-mtk-tphy.c

4123

+index 721a2a1c97ef..a63bba12aee4 100644

4124

+--- a/drivers/phy/mediatek/phy-mtk-tphy.c

4125

++++ b/drivers/phy/mediatek/phy-mtk-tphy.c

4126

+@@ -438,9 +438,9 @@ static void u2_phy_instance_init(struct mtk_tphy *tphy,

4127

+ 	u32 index = instance->index;

4128

+ 	u32 tmp;

4129

+

4130

+-	/* switch to USB function. (system register, force ip into usb mode) */

4131

++	/* switch to USB function, and enable usb pll */

4132

+ 	tmp = readl(com + U3P_U2PHYDTM0);

4133

+-	tmp &= ~P2C_FORCE_UART_EN;

4134

++	tmp &= ~(P2C_FORCE_UART_EN | P2C_FORCE_SUSPENDM);

4135

+ 	tmp |= P2C_RG_XCVRSEL_VAL(1) | P2C_RG_DATAIN_VAL(0);

4136

+ 	writel(tmp, com + U3P_U2PHYDTM0);

4137

+

4138

+@@ -500,10 +500,8 @@ static void u2_phy_instance_power_on(struct mtk_tphy *tphy,

4139

+ 	u32 index = instance->index;

4140

+ 	u32 tmp;

4141

+

4142

+-	/* (force_suspendm=0) (let suspendm=1, enable usb 480MHz pll) */

4143

+ 	tmp = readl(com + U3P_U2PHYDTM0);

4144

+-	tmp &= ~(P2C_FORCE_SUSPENDM | P2C_RG_XCVRSEL);

4145

+-	tmp &= ~(P2C_RG_DATAIN | P2C_DTM0_PART_MASK);

4146

++	tmp &= ~(P2C_RG_XCVRSEL | P2C_RG_DATAIN | P2C_DTM0_PART_MASK);

4147

+ 	writel(tmp, com + U3P_U2PHYDTM0);

4148

+

4149

+ 	/* OTG Enable */

4150

+@@ -538,7 +536,6 @@ static void u2_phy_instance_power_off(struct mtk_tphy *tphy,

4151

+

4152

+ 	tmp = readl(com + U3P_U2PHYDTM0);

4153

+ 	tmp &= ~(P2C_RG_XCVRSEL | P2C_RG_DATAIN);

4154

+-	tmp |= P2C_FORCE_SUSPENDM;

4155

+ 	writel(tmp, com + U3P_U2PHYDTM0);

4156

+

4157

+ 	/* OTG Disable */

4158

+@@ -546,18 +543,16 @@ static void u2_phy_instance_power_off(struct mtk_tphy *tphy,

4159

+ 	tmp &= ~PA6_RG_U2_OTG_VBUSCMP_EN;

4160

+ 	writel(tmp, com + U3P_USBPHYACR6);

4161

+

4162

+-	/* let suspendm=0, set utmi into analog power down */

4163

+-	tmp = readl(com + U3P_U2PHYDTM0);

4164

+-	tmp &= ~P2C_RG_SUSPENDM;

4165

+-	writel(tmp, com + U3P_U2PHYDTM0);

4166

+-	udelay(1);

4167

+-

4168

+ 	tmp = readl(com + U3P_U2PHYDTM1);

4169

+ 	tmp &= ~(P2C_RG_VBUSVALID | P2C_RG_AVALID);

4170

+ 	tmp |= P2C_RG_SESSEND;

4171

+ 	writel(tmp, com + U3P_U2PHYDTM1);

4172

+

4173

+ 	if (tphy->pdata->avoid_rx_sen_degradation && index) {

4174

++		tmp = readl(com + U3P_U2PHYDTM0);

4175

++		tmp &= ~(P2C_RG_SUSPENDM | P2C_FORCE_SUSPENDM);

4176

++		writel(tmp, com + U3P_U2PHYDTM0);

4177

++

4178

+ 		tmp = readl(com + U3D_U2PHYDCR0);

4179

+ 		tmp &= ~P2C_RG_SIF_U2PLL_FORCE_ON;

4180

+ 		writel(tmp, com + U3D_U2PHYDCR0);

4181

+diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c

4182

+index dd9464920456..ef22b275d050 100644

4183

+--- a/drivers/scsi/hosts.c

4184

++++ b/drivers/scsi/hosts.c

4185

+@@ -474,6 +474,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)

4186

+ 		shost->dma_boundary = 0xffffffff;

4187

+

4188

+ 	shost->use_blk_mq = scsi_use_blk_mq;

4189

++	shost->use_blk_mq = scsi_use_blk_mq || shost->hostt->force_blk_mq;

4190

+

4191

+ 	device_initialize(&shost->shost_gendev);

4192

+ 	dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);

4193

+diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c

4194

+index 604a39dba5d0..5b4b7f9be2d7 100644

4195

+--- a/drivers/scsi/hpsa.c

4196

++++ b/drivers/scsi/hpsa.c

4197

+@@ -1040,11 +1040,7 @@ static void set_performant_mode(struct ctlr_info *h, struct CommandList *c,

4198

+ 		c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);

4199

+ 		if (unlikely(!h->msix_vectors))

4200

+ 			return;

4201

+-		if (likely(reply_queue == DEFAULT_REPLY_QUEUE))

4202

+-			c->Header.ReplyQueue =

4203

+-				raw_smp_processor_id() % h->nreply_queues;

4204

+-		else

4205

+-			c->Header.ReplyQueue = reply_queue % h->nreply_queues;

4206

++		c->Header.ReplyQueue = reply_queue;

4207

+ 	}

4208

+ }

4209

+

4210

+@@ -1058,10 +1054,7 @@ static void set_ioaccel1_performant_mode(struct ctlr_info *h,

4211

+ 	 * Tell the controller to post the reply to the queue for this

4212

+ 	 * processor.  This seems to give the best I/O throughput.

4213

+ 	 */

4214

+-	if (likely(reply_queue == DEFAULT_REPLY_QUEUE))

4215

+-		cp->ReplyQueue = smp_processor_id() % h->nreply_queues;

4216

+-	else

4217

+-		cp->ReplyQueue = reply_queue % h->nreply_queues;

4218

++	cp->ReplyQueue = reply_queue;

4219

+ 	/*

4220

+ 	 * Set the bits in the address sent down to include:

4221

+ 	 *  - performant mode bit (bit 0)

4222

+@@ -1082,10 +1075,7 @@ static void set_ioaccel2_tmf_performant_mode(struct ctlr_info *h,

4223

+ 	/* Tell the controller to post the reply to the queue for this

4224

+ 	 * processor.  This seems to give the best I/O throughput.

4225

+ 	 */

4226

+-	if (likely(reply_queue == DEFAULT_REPLY_QUEUE))

4227

+-		cp->reply_queue = smp_processor_id() % h->nreply_queues;

4228

+-	else

4229

+-		cp->reply_queue = reply_queue % h->nreply_queues;

4230

++	cp->reply_queue = reply_queue;

4231

+ 	/* Set the bits in the address sent down to include:

4232

+ 	 *  - performant mode bit not used in ioaccel mode 2

4233

+ 	 *  - pull count (bits 0-3)

4234

+@@ -1104,10 +1094,7 @@ static void set_ioaccel2_performant_mode(struct ctlr_info *h,

4235

+ 	 * Tell the controller to post the reply to the queue for this

4236

+ 	 * processor.  This seems to give the best I/O throughput.

4237

+ 	 */

4238

+-	if (likely(reply_queue == DEFAULT_REPLY_QUEUE))

4239

+-		cp->reply_queue = smp_processor_id() % h->nreply_queues;

4240

+-	else

4241

+-		cp->reply_queue = reply_queue % h->nreply_queues;

4242

++	cp->reply_queue = reply_queue;

4243

+ 	/*

4244

+ 	 * Set the bits in the address sent down to include:

4245

+ 	 *  - performant mode bit not used in ioaccel mode 2

4246

+@@ -1152,6 +1139,8 @@ static void __enqueue_cmd_and_start_io(struct ctlr_info *h,

4247

+ {

4248

+ 	dial_down_lockup_detection_during_fw_flash(h, c);

4249

+ 	atomic_inc(&h->commands_outstanding);

4250

++

4251

++	reply_queue = h->reply_map[raw_smp_processor_id()];

4252

+ 	switch (c->cmd_type) {

4253

+ 	case CMD_IOACCEL1:

4254

+ 		set_ioaccel1_performant_mode(h, c, reply_queue);

4255

+@@ -7244,6 +7233,26 @@ static void hpsa_disable_interrupt_mode(struct ctlr_info *h)

4256

+ 	h->msix_vectors = 0;

4257

+ }

4258

+

4259

++static void hpsa_setup_reply_map(struct ctlr_info *h)

4260

++{

4261

++	const struct cpumask *mask;

4262

++	unsigned int queue, cpu;

4263

++

4264

++	for (queue = 0; queue < h->msix_vectors; queue++) {

4265

++		mask = pci_irq_get_affinity(h->pdev, queue);

4266

++		if (!mask)

4267

++			goto fallback;

4268

++

4269

++		for_each_cpu(cpu, mask)

4270

++			h->reply_map[cpu] = queue;

4271

++	}

4272

++	return;

4273

++

4274

++fallback:

4275

++	for_each_possible_cpu(cpu)

4276

++		h->reply_map[cpu] = 0;

4277

++}

4278

++

4279

+ /* If MSI/MSI-X is supported by the kernel we will try to enable it on

4280

+  * controllers that are capable. If not, we use legacy INTx mode.

4281

+  */

4282

+@@ -7639,6 +7648,10 @@ static int hpsa_pci_init(struct ctlr_info *h)

4283

+ 	err = hpsa_interrupt_mode(h);

4284

+ 	if (err)

4285

+ 		goto clean1;

4286

++

4287

++	/* setup mapping between CPU and reply queue */

4288

++	hpsa_setup_reply_map(h);

4289

++

4290

+ 	err = hpsa_pci_find_memory_BAR(h->pdev, &h->paddr);

4291

+ 	if (err)

4292

+ 		goto clean2;	/* intmode+region, pci */

4293

+@@ -8284,6 +8297,28 @@ static struct workqueue_struct *hpsa_create_controller_wq(struct ctlr_info *h,

4294

+ 	return wq;

4295

+ }

4296

+

4297

++static void hpda_free_ctlr_info(struct ctlr_info *h)

4298

++{

4299

++	kfree(h->reply_map);

4300

++	kfree(h);

4301

++}

4302

++

4303

++static struct ctlr_info *hpda_alloc_ctlr_info(void)

4304

++{

4305

++	struct ctlr_info *h;

4306

++

4307

++	h = kzalloc(sizeof(*h), GFP_KERNEL);

4308

++	if (!h)

4309

++		return NULL;

4310

++

4311

++	h->reply_map = kzalloc(sizeof(*h->reply_map) * nr_cpu_ids, GFP_KERNEL);

4312

++	if (!h->reply_map) {

4313

++		kfree(h);

4314

++		return NULL;

4315

++	}

4316

++	return h;

4317

++}

4318

++

4319

+ static int hpsa_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)

4320

+ {

4321

+ 	int dac, rc;

4322

+@@ -8321,7 +8356,7 @@ reinit_after_soft_reset:

4323

+ 	 * the driver.  See comments in hpsa.h for more info.

4324

+ 	 */

4325

+ 	BUILD_BUG_ON(sizeof(struct CommandList) % COMMANDLIST_ALIGNMENT);

4326

+-	h = kzalloc(sizeof(*h), GFP_KERNEL);

4327

++	h = hpda_alloc_ctlr_info();

4328

+ 	if (!h) {

4329

+ 		dev_err(&pdev->dev, "Failed to allocate controller head\n");

4330

+ 		return -ENOMEM;

4331

+@@ -8726,7 +8761,7 @@ static void hpsa_remove_one(struct pci_dev *pdev)

4332

+ 	h->lockup_detected = NULL;			/* init_one 2 */

4333

+ 	/* (void) pci_disable_pcie_error_reporting(pdev); */	/* init_one 1 */

4334

+

4335

+-	kfree(h);					/* init_one 1 */

4336

++	hpda_free_ctlr_info(h);				/* init_one 1 */

4337

+ }

4338

+

4339

+ static int hpsa_suspend(__attribute__((unused)) struct pci_dev *pdev,

4340

+diff --git a/drivers/scsi/hpsa.h b/drivers/scsi/hpsa.h

4341

+index 018f980a701c..fb9f5e7f8209 100644

4342

+--- a/drivers/scsi/hpsa.h

4343

++++ b/drivers/scsi/hpsa.h

4344

+@@ -158,6 +158,7 @@ struct bmic_controller_parameters {

4345

+ #pragma pack()

4346

+

4347

+ struct ctlr_info {

4348

++	unsigned int *reply_map;

4349

+ 	int	ctlr;

4350

+ 	char	devname[8];

4351

+ 	char    *product_name;

4352

+diff --git a/drivers/scsi/qla2xxx/qla_iocb.c b/drivers/scsi/qla2xxx/qla_iocb.c

4353

+index 63bea6a65d51..8d579bf0fc81 100644

4354

+--- a/drivers/scsi/qla2xxx/qla_iocb.c

4355

++++ b/drivers/scsi/qla2xxx/qla_iocb.c

4356

+@@ -2128,34 +2128,11 @@ __qla2x00_alloc_iocbs(struct qla_qpair *qpair, srb_t *sp)

4357

+ 	req_cnt = 1;

4358

+ 	handle = 0;

4359

+

4360

+-	if (!sp)

4361

+-		goto skip_cmd_array;

4362

+-

4363

+-	/* Check for room in outstanding command list. */

4364

+-	handle = req->current_outstanding_cmd;

4365

+-	for (index = 1; index < req->num_outstanding_cmds; index++) {

4366

+-		handle++;

4367

+-		if (handle == req->num_outstanding_cmds)

4368

+-			handle = 1;

4369

+-		if (!req->outstanding_cmds[handle])

4370

+-			break;

4371

+-	}

4372

+-	if (index == req->num_outstanding_cmds) {

4373

+-		ql_log(ql_log_warn, vha, 0x700b,

4374

+-		    "No room on outstanding cmd array.\n");

4375

+-		goto queuing_error;

4376

+-	}

4377

+-

4378

+-	/* Prep command array. */

4379

+-	req->current_outstanding_cmd = handle;

4380

+-	req->outstanding_cmds[handle] = sp;

4381

+-	sp->handle = handle;

4382

+-

4383

+-	/* Adjust entry-counts as needed. */

4384

+-	if (sp->type != SRB_SCSI_CMD)

4385

++	if (sp && (sp->type != SRB_SCSI_CMD)) {

4386

++		/* Adjust entry-counts as needed. */

4387

+ 		req_cnt = sp->iocbs;

4388

++	}

4389

+

4390

+-skip_cmd_array:

4391

+ 	/* Check for room on request queue. */

4392

+ 	if (req->cnt < req_cnt + 2) {

4393

+ 		if (ha->mqenable || IS_QLA83XX(ha) || IS_QLA27XX(ha))

4394

+@@ -2179,6 +2156,28 @@ skip_cmd_array:

4395

+ 	if (req->cnt < req_cnt + 2)

4396

+ 		goto queuing_error;

4397

+

4398

++	if (sp) {

4399

++		/* Check for room in outstanding command list. */

4400

++		handle = req->current_outstanding_cmd;

4401

++		for (index = 1; index < req->num_outstanding_cmds; index++) {

4402

++			handle++;

4403

++			if (handle == req->num_outstanding_cmds)

4404

++				handle = 1;

4405

++			if (!req->outstanding_cmds[handle])

4406

++				break;

4407

++		}

4408

++		if (index == req->num_outstanding_cmds) {

4409

++			ql_log(ql_log_warn, vha, 0x700b,

4410

++			    "No room on outstanding cmd array.\n");

4411

++			goto queuing_error;

4412

++		}

4413

++

4414

++		/* Prep command array. */

4415

++		req->current_outstanding_cmd = handle;

4416

++		req->outstanding_cmds[handle] = sp;

4417

++		sp->handle = handle;

4418

++	}

4419

++

4420

+ 	/* Prep packet */

4421

+ 	req->cnt -= req_cnt;

4422

+ 	pkt = req->ring_ptr;

4423

+@@ -2191,6 +2190,8 @@ skip_cmd_array:

4424

+ 		pkt->handle = handle;

4425

+ 	}

4426

+

4427

++	return pkt;

4428

++

4429

+ queuing_error:

4430

+ 	qpair->tgt_counters.num_alloc_iocb_failed++;

4431

+ 	return pkt;

4432

+diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c

4433

+index 3f3cb72e0c0c..d0389b20574d 100644

4434

+--- a/drivers/scsi/sr.c

4435

++++ b/drivers/scsi/sr.c

4436

+@@ -523,18 +523,26 @@ static int sr_init_command(struct scsi_cmnd *SCpnt)

4437

+ static int sr_block_open(struct block_device *bdev, fmode_t mode)

4438

+ {

4439

+ 	struct scsi_cd *cd;

4440

++	struct scsi_device *sdev;

4441

+ 	int ret = -ENXIO;

4442

+

4443

++	cd = scsi_cd_get(bdev->bd_disk);

4444

++	if (!cd)

4445

++		goto out;

4446

++

4447

++	sdev = cd->device;

4448

++	scsi_autopm_get_device(sdev);

4449

+ 	check_disk_change(bdev);

4450

+

4451

+ 	mutex_lock(&sr_mutex);

4452

+-	cd = scsi_cd_get(bdev->bd_disk);

4453

+-	if (cd) {

4454

+-		ret = cdrom_open(&cd->cdi, bdev, mode);

4455

+-		if (ret)

4456

+-			scsi_cd_put(cd);

4457

+-	}

4458

++	ret = cdrom_open(&cd->cdi, bdev, mode);

4459

+ 	mutex_unlock(&sr_mutex);

4460

++

4461

++	scsi_autopm_put_device(sdev);

4462

++	if (ret)

4463

++		scsi_cd_put(cd);

4464

++

4465

++out:

4466

+ 	return ret;

4467

+ }

4468

+

4469

+@@ -562,6 +570,8 @@ static int sr_block_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,

4470

+ 	if (ret)

4471

+ 		goto out;

4472

+

4473

++	scsi_autopm_get_device(sdev);

4474

++

4475

+ 	/*

4476

+ 	 * Send SCSI addressing ioctls directly to mid level, send other

4477

+ 	 * ioctls to cdrom/block level.

4478

+@@ -570,15 +580,18 @@ static int sr_block_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,

4479

+ 	case SCSI_IOCTL_GET_IDLUN:

4480

+ 	case SCSI_IOCTL_GET_BUS_NUMBER:

4481

+ 		ret = scsi_ioctl(sdev, cmd, argp);

4482

+-		goto out;

4483

++		goto put;

4484

+ 	}

4485

+

4486

+ 	ret = cdrom_ioctl(&cd->cdi, bdev, mode, cmd, arg);

4487

+ 	if (ret != -ENOSYS)

4488

+-		goto out;

4489

++		goto put;

4490

+

4491

+ 	ret = scsi_ioctl(sdev, cmd, argp);

4492

+

4493

++put:

4494

++	scsi_autopm_put_device(sdev);

4495

++

4496

+ out:

4497

+ 	mutex_unlock(&sr_mutex);

4498

+ 	return ret;

4499

+diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c

4500

+index 7c28e8d4955a..54e3a0f6844c 100644

4501

+--- a/drivers/scsi/virtio_scsi.c

4502

++++ b/drivers/scsi/virtio_scsi.c

4503

+@@ -91,9 +91,6 @@ struct virtio_scsi_vq {

4504

+ struct virtio_scsi_target_state {

4505

+ 	seqcount_t tgt_seq;

4506

+

4507

+-	/* Count of outstanding requests. */

4508

+-	atomic_t reqs;

4509

+-

4510

+ 	/* Currently active virtqueue for requests sent to this target. */

4511

+ 	struct virtio_scsi_vq *req_vq;

4512

+ };

4513

+@@ -152,8 +149,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)

4514

+ 	struct virtio_scsi_cmd *cmd = buf;

4515

+ 	struct scsi_cmnd *sc = cmd->sc;

4516

+ 	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;

4517

+-	struct virtio_scsi_target_state *tgt =

4518

+-				scsi_target(sc->device)->hostdata;

4519

+

4520

+ 	dev_dbg(&sc->device->sdev_gendev,

4521

+ 		"cmd %p response %u status %#02x sense_len %u\n",

4522

+@@ -210,8 +205,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)

4523

+ 	}

4524

+

4525

+ 	sc->scsi_done(sc);

4526

+-

4527

+-	atomic_dec(&tgt->reqs);

4528

+ }

4529

+

4530

+ static void virtscsi_vq_done(struct virtio_scsi *vscsi,

4531

+@@ -580,10 +573,7 @@ static int virtscsi_queuecommand_single(struct Scsi_Host *sh,

4532

+ 					struct scsi_cmnd *sc)

4533

+ {

4534

+ 	struct virtio_scsi *vscsi = shost_priv(sh);

4535

+-	struct virtio_scsi_target_state *tgt =

4536

+-				scsi_target(sc->device)->hostdata;

4537

+

4538

+-	atomic_inc(&tgt->reqs);

4539

+ 	return virtscsi_queuecommand(vscsi, &vscsi->req_vqs[0], sc);

4540

+ }

4541

+

4542

+@@ -596,55 +586,11 @@ static struct virtio_scsi_vq *virtscsi_pick_vq_mq(struct virtio_scsi *vscsi,

4543

+ 	return &vscsi->req_vqs[hwq];

4544

+ }

4545

+

4546

+-static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,

4547

+-					       struct virtio_scsi_target_state *tgt)

4548

+-{

4549

+-	struct virtio_scsi_vq *vq;

4550

+-	unsigned long flags;

4551

+-	u32 queue_num;

4552

+-

4553

+-	local_irq_save(flags);

4554

+-	if (atomic_inc_return(&tgt->reqs) > 1) {

4555

+-		unsigned long seq;

4556

+-

4557

+-		do {

4558

+-			seq = read_seqcount_begin(&tgt->tgt_seq);

4559

+-			vq = tgt->req_vq;

4560

+-		} while (read_seqcount_retry(&tgt->tgt_seq, seq));

4561

+-	} else {

4562

+-		/* no writes can be concurrent because of atomic_t */

4563

+-		write_seqcount_begin(&tgt->tgt_seq);

4564

+-

4565

+-		/* keep previous req_vq if a reader just arrived */

4566

+-		if (unlikely(atomic_read(&tgt->reqs) > 1)) {

4567

+-			vq = tgt->req_vq;

4568

+-			goto unlock;

4569

+-		}

4570

+-

4571

+-		queue_num = smp_processor_id();

4572

+-		while (unlikely(queue_num >= vscsi->num_queues))

4573

+-			queue_num -= vscsi->num_queues;

4574

+-		tgt->req_vq = vq = &vscsi->req_vqs[queue_num];

4575

+- unlock:

4576

+-		write_seqcount_end(&tgt->tgt_seq);

4577

+-	}

4578

+-	local_irq_restore(flags);

4579

+-

4580

+-	return vq;

4581

+-}

4582

+-

4583

+ static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,

4584

+ 				       struct scsi_cmnd *sc)

4585

+ {

4586

+ 	struct virtio_scsi *vscsi = shost_priv(sh);

4587

+-	struct virtio_scsi_target_state *tgt =

4588

+-				scsi_target(sc->device)->hostdata;

4589

+-	struct virtio_scsi_vq *req_vq;

4590

+-

4591

+-	if (shost_use_blk_mq(sh))

4592

+-		req_vq = virtscsi_pick_vq_mq(vscsi, sc);

4593

+-	else

4594

+-		req_vq = virtscsi_pick_vq(vscsi, tgt);

4595

++	struct virtio_scsi_vq *req_vq = virtscsi_pick_vq_mq(vscsi, sc);

4596

+

4597

+ 	return virtscsi_queuecommand(vscsi, req_vq, sc);

4598

+ }

4599

+@@ -775,7 +721,6 @@ static int virtscsi_target_alloc(struct scsi_target *starget)

4600

+ 		return -ENOMEM;

4601

+

4602

+ 	seqcount_init(&tgt->tgt_seq);

4603

+-	atomic_set(&tgt->reqs, 0);

4604

+ 	tgt->req_vq = &vscsi->req_vqs[0];

4605

+

4606

+ 	starget->hostdata = tgt;

4607

+@@ -823,6 +768,7 @@ static struct scsi_host_template virtscsi_host_template_single = {

4608

+ 	.target_alloc = virtscsi_target_alloc,

4609

+ 	.target_destroy = virtscsi_target_destroy,

4610

+ 	.track_queue_depth = 1,

4611

++	.force_blk_mq = 1,

4612

+ };

4613

+

4614

+ static struct scsi_host_template virtscsi_host_template_multi = {

4615

+@@ -844,6 +790,7 @@ static struct scsi_host_template virtscsi_host_template_multi = {

4616

+ 	.target_destroy = virtscsi_target_destroy,

4617

+ 	.map_queues = virtscsi_map_queues,

4618

+ 	.track_queue_depth = 1,

4619

++	.force_blk_mq = 1,

4620

+ };

4621

+

4622

+ #define virtscsi_config_get(vdev, fld) \

4623

+diff --git a/fs/dcache.c b/fs/dcache.c

4624

+index 5f31a93150d1..8d4935978fec 100644

4625

+--- a/fs/dcache.c

4626

++++ b/fs/dcache.c

4627

+@@ -357,14 +357,11 @@ static void dentry_unlink_inode(struct dentry * dentry)

4628

+ 	__releases(dentry->d_inode->i_lock)

4629

+ {

4630

+ 	struct inode *inode = dentry->d_inode;

4631

+-	bool hashed = !d_unhashed(dentry);

4632

+

4633

+-	if (hashed)

4634

+-		raw_write_seqcount_begin(&dentry->d_seq);

4635

++	raw_write_seqcount_begin(&dentry->d_seq);

4636

+ 	__d_clear_type_and_inode(dentry);

4637

+ 	hlist_del_init(&dentry->d_u.d_alias);

4638

+-	if (hashed)

4639

+-		raw_write_seqcount_end(&dentry->d_seq);

4640

++	raw_write_seqcount_end(&dentry->d_seq);

4641

+ 	spin_unlock(&dentry->d_lock);

4642

+ 	spin_unlock(&inode->i_lock);

4643

+ 	if (!inode->i_nlink)

4644

+@@ -1922,10 +1919,12 @@ struct dentry *d_make_root(struct inode *root_inode)

4645

+

4646

+ 	if (root_inode) {

4647

+ 		res = __d_alloc(root_inode->i_sb, NULL);

4648

+-		if (res)

4649

++		if (res) {

4650

++			res->d_flags |= DCACHE_RCUACCESS;

4651

+ 			d_instantiate(res, root_inode);

4652

+-		else

4653

++		} else {

4654

+ 			iput(root_inode);

4655

++		}

4656

+ 	}

4657

+ 	return res;

4658

+ }

4659

+diff --git a/fs/namespace.c b/fs/namespace.c

4660

+index 1eb3bfd8be5a..9dc146e7b5e0 100644

4661

+--- a/fs/namespace.c

4662

++++ b/fs/namespace.c

4663

+@@ -659,12 +659,21 @@ int __legitimize_mnt(struct vfsmount *bastard, unsigned seq)

4664

+ 		return 0;

4665

+ 	mnt = real_mount(bastard);

4666

+ 	mnt_add_count(mnt, 1);

4667

++	smp_mb();			// see mntput_no_expire()

4668

+ 	if (likely(!read_seqretry(&mount_lock, seq)))

4669

+ 		return 0;

4670

+ 	if (bastard->mnt_flags & MNT_SYNC_UMOUNT) {

4671

+ 		mnt_add_count(mnt, -1);

4672

+ 		return 1;

4673

+ 	}

4674

++	lock_mount_hash();

4675

++	if (unlikely(bastard->mnt_flags & MNT_DOOMED)) {

4676

++		mnt_add_count(mnt, -1);

4677

++		unlock_mount_hash();

4678

++		return 1;

4679

++	}

4680

++	unlock_mount_hash();

4681

++	/* caller will mntput() */

4682

+ 	return -1;

4683

+ }

4684

+

4685

+@@ -1195,12 +1204,27 @@ static DECLARE_DELAYED_WORK(delayed_mntput_work, delayed_mntput);

4686

+ static void mntput_no_expire(struct mount *mnt)

4687

+ {

4688

+ 	rcu_read_lock();

4689

+-	mnt_add_count(mnt, -1);

4690

+-	if (likely(mnt->mnt_ns)) { /* shouldn't be the last one */

4691

++	if (likely(READ_ONCE(mnt->mnt_ns))) {

4692

++		/*

4693

++		 * Since we don't do lock_mount_hash() here,

4694

++		 * ->mnt_ns can change under us.  However, if it's

4695

++		 * non-NULL, then there's a reference that won't

4696

++		 * be dropped until after an RCU delay done after

4697

++		 * turning ->mnt_ns NULL.  So if we observe it

4698

++		 * non-NULL under rcu_read_lock(), the reference

4699

++		 * we are dropping is not the final one.

4700

++		 */

4701

++		mnt_add_count(mnt, -1);

4702

+ 		rcu_read_unlock();

4703

+ 		return;

4704

+ 	}

4705

+ 	lock_mount_hash();

4706

++	/*

4707

++	 * make sure that if __legitimize_mnt() has not seen us grab

4708

++	 * mount_lock, we'll see their refcount increment here.

4709

++	 */

4710

++	smp_mb();

4711

++	mnt_add_count(mnt, -1);

4712

+ 	if (mnt_get_count(mnt)) {

4713

+ 		rcu_read_unlock();

4714

+ 		unlock_mount_hash();

4715

+diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h

4716

+index 2142bceaeb75..46a2f5d9aa25 100644

4717

+--- a/include/asm-generic/pgtable.h

4718

++++ b/include/asm-generic/pgtable.h

4719

+@@ -1055,6 +1055,18 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,

4720

+ static inline void init_espfix_bsp(void) { }

4721

+ #endif

4722

+

4723

++#ifndef __HAVE_ARCH_PFN_MODIFY_ALLOWED

4724

++static inline bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)

4725

++{

4726

++	return true;

4727

++}

4728

++

4729

++static inline bool arch_has_pfn_modify_check(void)

4730

++{

4731

++	return false;

4732

++}

4733

++#endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */

4734

++

4735

+ #endif /* !__ASSEMBLY__ */

4736

+

4737

+ #ifndef io_remap_pfn_range

4738

+diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h

4739

+index 070f85d92c15..28b76f0894d4 100644

4740

+--- a/include/linux/compiler-clang.h

4741

++++ b/include/linux/compiler-clang.h

4742

+@@ -17,6 +17,9 @@

4743

+  */

4744

+ #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)

4745

+

4746

++#undef __no_sanitize_address

4747

++#define __no_sanitize_address __attribute__((no_sanitize("address")))

4748

++

4749

+ /* Clang doesn't have a way to turn it off per-function, yet. */

4750

+ #ifdef __noretpoline

4751

+ #undef __noretpoline

4752

+diff --git a/include/linux/cpu.h b/include/linux/cpu.h

4753

+index 9546bf2fe310..2a378d261914 100644

4754

+--- a/include/linux/cpu.h

4755

++++ b/include/linux/cpu.h

4756

+@@ -30,7 +30,7 @@ struct cpu {

4757

+ };

4758

+

4759

+ extern void boot_cpu_init(void);

4760

+-extern void boot_cpu_state_init(void);

4761

++extern void boot_cpu_hotplug_init(void);

4762

+ extern void cpu_init(void);

4763

+ extern void trap_init(void);

4764

+

4765

+@@ -55,6 +55,8 @@ extern ssize_t cpu_show_spectre_v2(struct device *dev,

4766

+ 				   struct device_attribute *attr, char *buf);

4767

+ extern ssize_t cpu_show_spec_store_bypass(struct device *dev,

4768

+ 					  struct device_attribute *attr, char *buf);

4769

++extern ssize_t cpu_show_l1tf(struct device *dev,

4770

++			     struct device_attribute *attr, char *buf);

4771

+

4772

+ extern __printf(4, 5)

4773

+ struct device *cpu_device_create(struct device *parent, void *drvdata,

4774

+@@ -176,4 +178,23 @@ void cpuhp_report_idle_dead(void);

4775

+ static inline void cpuhp_report_idle_dead(void) { }

4776

+ #endif /* #ifdef CONFIG_HOTPLUG_CPU */

4777

+

4778

++enum cpuhp_smt_control {

4779

++	CPU_SMT_ENABLED,

4780

++	CPU_SMT_DISABLED,

4781

++	CPU_SMT_FORCE_DISABLED,

4782

++	CPU_SMT_NOT_SUPPORTED,

4783

++};

4784

++

4785

++#if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT)

4786

++extern enum cpuhp_smt_control cpu_smt_control;

4787

++extern void cpu_smt_disable(bool force);

4788

++extern void cpu_smt_check_topology_early(void);

4789

++extern void cpu_smt_check_topology(void);

4790

++#else

4791

++# define cpu_smt_control		(CPU_SMT_ENABLED)

4792

++static inline void cpu_smt_disable(bool force) { }

4793

++static inline void cpu_smt_check_topology_early(void) { }

4794

++static inline void cpu_smt_check_topology(void) { }

4795

++#endif

4796

++

4797

+ #endif /* _LINUX_CPU_H_ */

4798

+diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h

4799

+index 06bd7b096167..e06febf62978 100644

4800

+--- a/include/linux/swapfile.h

4801

++++ b/include/linux/swapfile.h

4802

+@@ -10,5 +10,7 @@ extern spinlock_t swap_lock;

4803

+ extern struct plist_head swap_active_head;

4804

+ extern struct swap_info_struct *swap_info[];

4805

+ extern int try_to_unuse(unsigned int, bool, unsigned long);

4806

++extern unsigned long generic_max_swapfile_size(void);

4807

++extern unsigned long max_swapfile_size(void);

4808

+

4809

+ #endif /* _LINUX_SWAPFILE_H */

4810

+diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h

4811

+index a8b7bf879ced..9c1e4bad6581 100644

4812

+--- a/include/scsi/scsi_host.h

4813

++++ b/include/scsi/scsi_host.h

4814

+@@ -452,6 +452,9 @@ struct scsi_host_template {

4815

+ 	/* True if the controller does not support WRITE SAME */

4816

+ 	unsigned no_write_same:1;

4817

+

4818

++	/* True if the low-level driver supports blk-mq only */

4819

++	unsigned force_blk_mq:1;

4820

++

4821

+ 	/*

4822

+ 	 * Countdown for host blocking with no commands outstanding.

4823

+ 	 */

4824

+diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h

4825

+index 857bad91c454..27c62abb6c9e 100644

4826

+--- a/include/uapi/linux/kvm.h

4827

++++ b/include/uapi/linux/kvm.h

4828

+@@ -761,6 +761,7 @@ struct kvm_ppc_resize_hpt {

4829

+ #define KVM_TRACE_PAUSE           __KVM_DEPRECATED_MAIN_0x07

4830

+ #define KVM_TRACE_DISABLE         __KVM_DEPRECATED_MAIN_0x08

4831

+ #define KVM_GET_EMULATED_CPUID	  _IOWR(KVMIO, 0x09, struct kvm_cpuid2)

4832

++#define KVM_GET_MSR_FEATURE_INDEX_LIST    _IOWR(KVMIO, 0x0a, struct kvm_msr_list)

4833

+

4834

+ /*

4835

+  * Extension capability list.

4836

+@@ -932,6 +933,7 @@ struct kvm_ppc_resize_hpt {

4837

+ #define KVM_CAP_HYPERV_SYNIC2 148

4838

+ #define KVM_CAP_HYPERV_VP_INDEX 149

4839

+ #define KVM_CAP_S390_BPB 152

4840

++#define KVM_CAP_GET_MSR_FEATURES 153

4841

+

4842

+ #ifdef KVM_CAP_IRQ_ROUTING

4843

+

4844

+diff --git a/init/main.c b/init/main.c

4845

+index 0d88f37febcb..c4a45145e102 100644

4846

+--- a/init/main.c

4847

++++ b/init/main.c

4848

+@@ -543,8 +543,8 @@ asmlinkage __visible void __init start_kernel(void)

4849

+ 	setup_command_line(command_line);

4850

+ 	setup_nr_cpu_ids();

4851

+ 	setup_per_cpu_areas();

4852

+-	boot_cpu_state_init();

4853

+ 	smp_prepare_boot_cpu();	/* arch-specific boot-cpu hooks */

4854

++	boot_cpu_hotplug_init();

4855

+

4856

+ 	build_all_zonelists(NULL);

4857

+ 	page_alloc_init();

4858

+diff --git a/kernel/cpu.c b/kernel/cpu.c

4859

+index f21bfa3172d8..8f02f9b6e046 100644

4860

+--- a/kernel/cpu.c

4861

++++ b/kernel/cpu.c

4862

+@@ -60,6 +60,7 @@ struct cpuhp_cpu_state {

4863

+ 	bool			rollback;

4864

+ 	bool			single;

4865

+ 	bool			bringup;

4866

++	bool			booted_once;

4867

+ 	struct hlist_node	*node;

4868

+ 	struct hlist_node	*last;

4869

+ 	enum cpuhp_state	cb_state;

4870

+@@ -346,6 +347,85 @@ void cpu_hotplug_enable(void)

4871

+ EXPORT_SYMBOL_GPL(cpu_hotplug_enable);

4872

+ #endif	/* CONFIG_HOTPLUG_CPU */

4873

+

4874

++#ifdef CONFIG_HOTPLUG_SMT

4875

++enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;

4876

++EXPORT_SYMBOL_GPL(cpu_smt_control);

4877

++

4878

++static bool cpu_smt_available __read_mostly;

4879

++

4880

++void __init cpu_smt_disable(bool force)

4881

++{

4882

++	if (cpu_smt_control == CPU_SMT_FORCE_DISABLED ||

4883

++		cpu_smt_control == CPU_SMT_NOT_SUPPORTED)

4884

++		return;

4885

++

4886

++	if (force) {

4887

++		pr_info("SMT: Force disabled\n");

4888

++		cpu_smt_control = CPU_SMT_FORCE_DISABLED;

4889

++	} else {

4890

++		cpu_smt_control = CPU_SMT_DISABLED;

4891

++	}

4892

++}

4893

++

4894

++/*

4895

++ * The decision whether SMT is supported can only be done after the full

4896

++ * CPU identification. Called from architecture code before non boot CPUs

4897

++ * are brought up.

4898

++ */

4899

++void __init cpu_smt_check_topology_early(void)

4900

++{

4901

++	if (!topology_smt_supported())

4902

++		cpu_smt_control = CPU_SMT_NOT_SUPPORTED;

4903

++}

4904

++

4905

++/*

4906

++ * If SMT was disabled by BIOS, detect it here, after the CPUs have been

4907

++ * brought online. This ensures the smt/l1tf sysfs entries are consistent

4908

++ * with reality. cpu_smt_available is set to true during the bringup of non

4909

++ * boot CPUs when a SMT sibling is detected. Note, this may overwrite

4910

++ * cpu_smt_control's previous setting.

4911

++ */

4912

++void __init cpu_smt_check_topology(void)

4913

++{

4914

++	if (!cpu_smt_available)

4915

++		cpu_smt_control = CPU_SMT_NOT_SUPPORTED;

4916

++}

4917

++

4918

++static int __init smt_cmdline_disable(char *str)

4919

++{

4920

++	cpu_smt_disable(str && !strcmp(str, "force"));

4921

++	return 0;

4922

++}

4923

++early_param("nosmt", smt_cmdline_disable);

4924

++

4925

++static inline bool cpu_smt_allowed(unsigned int cpu)

4926

++{

4927

++	if (topology_is_primary_thread(cpu))

4928

++		return true;

4929

++

4930

++	/*

4931

++	 * If the CPU is not a 'primary' thread and the booted_once bit is

4932

++	 * set then the processor has SMT support. Store this information

4933

++	 * for the late check of SMT support in cpu_smt_check_topology().

4934

++	 */

4935

++	if (per_cpu(cpuhp_state, cpu).booted_once)

4936

++		cpu_smt_available = true;

4937

++

4938

++	if (cpu_smt_control == CPU_SMT_ENABLED)

4939

++		return true;

4940

++

4941

++	/*

4942

++	 * On x86 it's required to boot all logical CPUs at least once so

4943

++	 * that the init code can get a chance to set CR4.MCE on each

4944

++	 * CPU. Otherwise, a broadacasted MCE observing CR4.MCE=0b on any

4945

++	 * core will shutdown the machine.

4946

++	 */

4947

++	return !per_cpu(cpuhp_state, cpu).booted_once;

4948

++}

4949

++#else

4950

++static inline bool cpu_smt_allowed(unsigned int cpu) { return true; }

4951

++#endif

4952

++

4953

+ static inline enum cpuhp_state

4954

+ cpuhp_set_state(struct cpuhp_cpu_state *st, enum cpuhp_state target)

4955

+ {

4956

+@@ -426,6 +506,16 @@ static int bringup_wait_for_ap(unsigned int cpu)

4957

+ 	stop_machine_unpark(cpu);

4958

+ 	kthread_unpark(st->thread);

4959

+

4960

++	/*

4961

++	 * SMT soft disabling on X86 requires to bring the CPU out of the

4962

++	 * BIOS 'wait for SIPI' state in order to set the CR4.MCE bit.  The

4963

++	 * CPU marked itself as booted_once in cpu_notify_starting() so the

4964

++	 * cpu_smt_allowed() check will now return false if this is not the

4965

++	 * primary sibling.

4966

++	 */

4967

++	if (!cpu_smt_allowed(cpu))

4968

++		return -ECANCELED;

4969

++

4970

+ 	if (st->target <= CPUHP_AP_ONLINE_IDLE)

4971

+ 		return 0;

4972

+

4973

+@@ -758,7 +848,6 @@ static int takedown_cpu(unsigned int cpu)

4974

+

4975

+ 	/* Park the smpboot threads */

4976

+ 	kthread_park(per_cpu_ptr(&cpuhp_state, cpu)->thread);

4977

+-	smpboot_park_threads(cpu);

4978

+

4979

+ 	/*

4980

+ 	 * Prevent irq alloc/free while the dying cpu reorganizes the

4981

+@@ -911,20 +1000,19 @@ out:

4982

+ 	return ret;

4983

+ }

4984

+

4985

++static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)

4986

++{

4987

++	if (cpu_hotplug_disabled)

4988

++		return -EBUSY;

4989

++	return _cpu_down(cpu, 0, target);

4990

++}

4991

++

4992

+ static int do_cpu_down(unsigned int cpu, enum cpuhp_state target)

4993

+ {

4994

+ 	int err;

4995

+

4996

+ 	cpu_maps_update_begin();

4997

+-

4998

+-	if (cpu_hotplug_disabled) {

4999

+-		err = -EBUSY;

5000

+-		goto out;

5001

+-	}

5002

+-

5003

+-	err = _cpu_down(cpu, 0, target);

5004

+-

5005

+-out:

5006

++	err = cpu_down_maps_locked(cpu, target);

5007

+ 	cpu_maps_update_done();

5008

+ 	return err;

5009

+ }

5010

+@@ -953,6 +1041,7 @@ void notify_cpu_starting(unsigned int cpu)

5011

+ 	int ret;

5012

+

5013

+ 	rcu_cpu_starting(cpu);	/* Enables RCU usage on this CPU. */

5014

++	st->booted_once = true;

5015

+ 	while (st->state < target) {

5016

+ 		st->state++;

5017

+ 		ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL);

5018

+@@ -1062,6 +1151,10 @@ static int do_cpu_up(unsigned int cpu, enum cpuhp_state target)

5019

+ 		err = -EBUSY;

5020

+ 		goto out;

5021

+ 	}

5022

++	if (!cpu_smt_allowed(cpu)) {

5023

++		err = -EPERM;

5024

++		goto out;

5025

++	}

5026

+

5027

+ 	err = _cpu_up(cpu, 0, target);

5028

+ out:

5029

+@@ -1344,7 +1437,7 @@ static struct cpuhp_step cpuhp_ap_states[] = {

5030

+ 	[CPUHP_AP_SMPBOOT_THREADS] = {

5031

+ 		.name			= "smpboot/threads:online",

5032

+ 		.startup.single		= smpboot_unpark_threads,

5033

+-		.teardown.single	= NULL,

5034

++		.teardown.single	= smpboot_park_threads,

5035

+ 	},

5036

+ 	[CPUHP_AP_IRQ_AFFINITY_ONLINE] = {

5037

+ 		.name			= "irq/affinity:online",

5038

+@@ -1918,10 +2011,172 @@ static const struct attribute_group cpuhp_cpu_root_attr_group = {

5039

+ 	NULL

5040

+ };

5041

+

5042

++#ifdef CONFIG_HOTPLUG_SMT

5043

++

5044

++static const char *smt_states[] = {

5045

++	[CPU_SMT_ENABLED]		= "on",

5046

++	[CPU_SMT_DISABLED]		= "off",

5047

++	[CPU_SMT_FORCE_DISABLED]	= "forceoff",

5048

++	[CPU_SMT_NOT_SUPPORTED]		= "notsupported",

5049

++};

5050

++

5051

++static ssize_t

5052

++show_smt_control(struct device *dev, struct device_attribute *attr, char *buf)

5053

++{

5054

++	return snprintf(buf, PAGE_SIZE - 2, "%s\n", smt_states[cpu_smt_control]);

5055

++}

5056

++

5057

++static void cpuhp_offline_cpu_device(unsigned int cpu)

5058

++{

5059

++	struct device *dev = get_cpu_device(cpu);

5060

++

5061

++	dev->offline = true;

5062

++	/* Tell user space about the state change */

5063

++	kobject_uevent(&dev->kobj, KOBJ_OFFLINE);

5064

++}

5065

++

5066

++static void cpuhp_online_cpu_device(unsigned int cpu)

5067

++{

5068

++	struct device *dev = get_cpu_device(cpu);

5069

++

5070

++	dev->offline = false;

5071

++	/* Tell user space about the state change */

5072

++	kobject_uevent(&dev->kobj, KOBJ_ONLINE);

5073

++}

5074

++

5075

++static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)

5076

++{

5077

++	int cpu, ret = 0;

5078

++

5079

++	cpu_maps_update_begin();

5080

++	for_each_online_cpu(cpu) {

5081

++		if (topology_is_primary_thread(cpu))

5082

++			continue;

5083

++		ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);

5084

++		if (ret)

5085

++			break;

5086

++		/*

5087

++		 * As this needs to hold the cpu maps lock it's impossible

5088

++		 * to call device_offline() because that ends up calling

5089

++		 * cpu_down() which takes cpu maps lock. cpu maps lock

5090

++		 * needs to be held as this might race against in kernel

5091

++		 * abusers of the hotplug machinery (thermal management).

5092

++		 *

5093

++		 * So nothing would update device:offline state. That would

5094

++		 * leave the sysfs entry stale and prevent onlining after

5095

++		 * smt control has been changed to 'off' again. This is

5096

++		 * called under the sysfs hotplug lock, so it is properly

5097

++		 * serialized against the regular offline usage.

5098

++		 */

5099

++		cpuhp_offline_cpu_device(cpu);

5100

++	}

5101

++	if (!ret)

5102

++		cpu_smt_control = ctrlval;

5103

++	cpu_maps_update_done();

5104

++	return ret;

5105

++}

5106

++

5107

++static int cpuhp_smt_enable(void)

5108

++{

5109

++	int cpu, ret = 0;

5110

++

5111

++	cpu_maps_update_begin();

5112

++	cpu_smt_control = CPU_SMT_ENABLED;

5113

++	for_each_present_cpu(cpu) {

5114

++		/* Skip online CPUs and CPUs on offline nodes */

5115

++		if (cpu_online(cpu) || !node_online(cpu_to_node(cpu)))

5116

++			continue;

5117

++		ret = _cpu_up(cpu, 0, CPUHP_ONLINE);

5118

++		if (ret)

5119

++			break;

5120

++		/* See comment in cpuhp_smt_disable() */

5121

++		cpuhp_online_cpu_device(cpu);

5122

++	}

5123

++	cpu_maps_update_done();

5124

++	return ret;

5125

++}

5126

++

5127

++static ssize_t

5128

++store_smt_control(struct device *dev, struct device_attribute *attr,

5129

++		  const char *buf, size_t count)

5130

++{

5131

++	int ctrlval, ret;

5132

++

5133

++	if (sysfs_streq(buf, "on"))

5134

++		ctrlval = CPU_SMT_ENABLED;

5135

++	else if (sysfs_streq(buf, "off"))

5136

++		ctrlval = CPU_SMT_DISABLED;

5137

++	else if (sysfs_streq(buf, "forceoff"))

5138

++		ctrlval = CPU_SMT_FORCE_DISABLED;

5139

++	else

5140

++		return -EINVAL;

5141

++

5142

++	if (cpu_smt_control == CPU_SMT_FORCE_DISABLED)

5143

++		return -EPERM;

5144

++

5145

++	if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED)

5146

++		return -ENODEV;

5147

++

5148

++	ret = lock_device_hotplug_sysfs();

5149

++	if (ret)

5150

++		return ret;

5151

++

5152

++	if (ctrlval != cpu_smt_control) {

5153

++		switch (ctrlval) {

5154

++		case CPU_SMT_ENABLED:

5155

++			ret = cpuhp_smt_enable();

5156

++			break;

5157

++		case CPU_SMT_DISABLED:

5158

++		case CPU_SMT_FORCE_DISABLED:

5159

++			ret = cpuhp_smt_disable(ctrlval);

5160

++			break;

5161

++		}

5162

++	}

5163

++

5164

++	unlock_device_hotplug();

5165

++	return ret ? ret : count;

5166

++}

5167

++static DEVICE_ATTR(control, 0644, show_smt_control, store_smt_control);

5168

++

5169

++static ssize_t

5170

++show_smt_active(struct device *dev, struct device_attribute *attr, char *buf)

5171

++{

5172

++	bool active = topology_max_smt_threads() > 1;

5173

++

5174

++	return snprintf(buf, PAGE_SIZE - 2, "%d\n", active);

5175

++}

5176

++static DEVICE_ATTR(active, 0444, show_smt_active, NULL);

5177

++

5178

++static struct attribute *cpuhp_smt_attrs[] = {

5179

++	&dev_attr_control.attr,

5180

++	&dev_attr_active.attr,

5181

++	NULL

5182

++};

5183

++

5184

++static const struct attribute_group cpuhp_smt_attr_group = {

5185

++	.attrs = cpuhp_smt_attrs,

5186

++	.name = "smt",

5187

++	NULL

5188

++};

5189

++

5190

++static int __init cpu_smt_state_init(void)

5191

++{

5192

++	return sysfs_create_group(&cpu_subsys.dev_root->kobj,

5193

++				  &cpuhp_smt_attr_group);

5194

++}

5195

++

5196

++#else

5197

++static inline int cpu_smt_state_init(void) { return 0; }

5198

++#endif

5199

++

5200

+ static int __init cpuhp_sysfs_init(void)

5201

+ {

5202

+ 	int cpu, ret;

5203

+

5204

++	ret = cpu_smt_state_init();

5205

++	if (ret)

5206

++		return ret;

5207

++

5208

+ 	ret = sysfs_create_group(&cpu_subsys.dev_root->kobj,

5209

+ 				 &cpuhp_cpu_root_attr_group);

5210

+ 	if (ret)

5211

+@@ -2022,7 +2277,10 @@ void __init boot_cpu_init(void)

5212

+ /*

5213

+  * Must be called _AFTER_ setting up the per_cpu areas

5214

+  */

5215

+-void __init boot_cpu_state_init(void)

5216

++void __init boot_cpu_hotplug_init(void)

5217

+ {

5218

+-	per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;

5219

++#ifdef CONFIG_SMP

5220

++	this_cpu_write(cpuhp_state.booted_once, true);

5221

++#endif

5222

++	this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);

5223

+ }

5224

+diff --git a/kernel/sched/core.c b/kernel/sched/core.c

5225

+index 31615d1ae44c..4e89ed8a0fb2 100644

5226

+--- a/kernel/sched/core.c

5227

++++ b/kernel/sched/core.c

5228

+@@ -5615,6 +5615,18 @@ int sched_cpu_activate(unsigned int cpu)

5229

+ 	struct rq *rq = cpu_rq(cpu);

5230

+ 	struct rq_flags rf;

5231

+

5232

++#ifdef CONFIG_SCHED_SMT

5233

++	/*

5234

++	 * The sched_smt_present static key needs to be evaluated on every

5235

++	 * hotplug event because at boot time SMT might be disabled when

5236

++	 * the number of booted CPUs is limited.

5237

++	 *

5238

++	 * If then later a sibling gets hotplugged, then the key would stay

5239

++	 * off and SMT scheduling would never be functional.

5240

++	 */

5241

++	if (cpumask_weight(cpu_smt_mask(cpu)) > 1)

5242

++		static_branch_enable_cpuslocked(&sched_smt_present);

5243

++#endif

5244

+ 	set_cpu_active(cpu, true);

5245

+

5246

+ 	if (sched_smp_initialized) {

5247

+@@ -5710,22 +5722,6 @@ int sched_cpu_dying(unsigned int cpu)

5248

+ }

5249

+ #endif

5250

+

5251

+-#ifdef CONFIG_SCHED_SMT

5252

+-DEFINE_STATIC_KEY_FALSE(sched_smt_present);

5253

+-

5254

+-static void sched_init_smt(void)

5255

+-{

5256

+-	/*

5257

+-	 * We've enumerated all CPUs and will assume that if any CPU

5258

+-	 * has SMT siblings, CPU0 will too.

5259

+-	 */

5260

+-	if (cpumask_weight(cpu_smt_mask(0)) > 1)

5261

+-		static_branch_enable(&sched_smt_present);

5262

+-}

5263

+-#else

5264

+-static inline void sched_init_smt(void) { }

5265

+-#endif

5266

+-

5267

+ void __init sched_init_smp(void)

5268

+ {

5269

+ 	cpumask_var_t non_isolated_cpus;

5270

+@@ -5755,8 +5751,6 @@ void __init sched_init_smp(void)

5271

+ 	init_sched_rt_class();

5272

+ 	init_sched_dl_class();

5273

+

5274

+-	sched_init_smt();

5275

+-

5276

+ 	sched_smp_initialized = true;

5277

+ }

5278

+

5279

+diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

5280

+index 5c09ddf8c832..0cc7098c6dfd 100644

5281

+--- a/kernel/sched/fair.c

5282

++++ b/kernel/sched/fair.c

5283

+@@ -5631,6 +5631,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)

5284

+ }

5285

+

5286

+ #ifdef CONFIG_SCHED_SMT

5287

++DEFINE_STATIC_KEY_FALSE(sched_smt_present);

5288

+

5289

+ static inline void set_idle_cores(int cpu, int val)

5290

+ {

5291

+diff --git a/kernel/smp.c b/kernel/smp.c

5292

+index c94dd85c8d41..2d1da290f144 100644

5293

+--- a/kernel/smp.c

5294

++++ b/kernel/smp.c

5295

+@@ -584,6 +584,8 @@ void __init smp_init(void)

5296

+ 		num_nodes, (num_nodes > 1 ? "s" : ""),

5297

+ 		num_cpus,  (num_cpus  > 1 ? "s" : ""));

5298

+

5299

++	/* Final decision about SMT support */

5300

++	cpu_smt_check_topology();

5301

+ 	/* Any cleanup work */

5302

+ 	smp_cpus_done(setup_max_cpus);

5303

+ }

5304

+diff --git a/kernel/softirq.c b/kernel/softirq.c

5305

+index f40ac7191257..a4c87cf27f9d 100644

5306

+--- a/kernel/softirq.c

5307

++++ b/kernel/softirq.c

5308

+@@ -79,12 +79,16 @@ static void wakeup_softirqd(void)

5309

+

5310

+ /*

5311

+  * If ksoftirqd is scheduled, we do not want to process pending softirqs

5312

+- * right now. Let ksoftirqd handle this at its own rate, to get fairness.

5313

++ * right now. Let ksoftirqd handle this at its own rate, to get fairness,

5314

++ * unless we're doing some of the synchronous softirqs.

5315

+  */

5316

+-static bool ksoftirqd_running(void)

5317

++#define SOFTIRQ_NOW_MASK ((1 << HI_SOFTIRQ) | (1 << TASKLET_SOFTIRQ))

5318

++static bool ksoftirqd_running(unsigned long pending)

5319

+ {

5320

+ 	struct task_struct *tsk = __this_cpu_read(ksoftirqd);

5321

+

5322

++	if (pending & SOFTIRQ_NOW_MASK)

5323

++		return false;

5324

+ 	return tsk && (tsk->state == TASK_RUNNING);

5325

+ }

5326

+

5327

+@@ -324,7 +328,7 @@ asmlinkage __visible void do_softirq(void)

5328

+

5329

+ 	pending = local_softirq_pending();

5330

+

5331

+-	if (pending && !ksoftirqd_running())

5332

++	if (pending && !ksoftirqd_running(pending))

5333

+ 		do_softirq_own_stack();

5334

+

5335

+ 	local_irq_restore(flags);

5336

+@@ -351,7 +355,7 @@ void irq_enter(void)

5337

+

5338

+ static inline void invoke_softirq(void)

5339

+ {

5340

+-	if (ksoftirqd_running())

5341

++	if (ksoftirqd_running(local_softirq_pending()))

5342

+ 		return;

5343

+

5344

+ 	if (!force_irqthreads) {

5345

+diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c

5346

+index 1ff523dae6e2..e190d1ef3a23 100644

5347

+--- a/kernel/stop_machine.c

5348

++++ b/kernel/stop_machine.c

5349

+@@ -260,6 +260,15 @@ retry:

5350

+ 	err = 0;

5351

+ 	__cpu_stop_queue_work(stopper1, work1, &wakeq);

5352

+ 	__cpu_stop_queue_work(stopper2, work2, &wakeq);

5353

++	/*

5354

++	 * The waking up of stopper threads has to happen

5355

++	 * in the same scheduling context as the queueing.

5356

++	 * Otherwise, there is a possibility of one of the

5357

++	 * above stoppers being woken up by another CPU,

5358

++	 * and preempting us. This will cause us to n ot

5359

++	 * wake up the other stopper forever.

5360

++	 */

5361

++	preempt_disable();

5362

+ unlock:

5363

+ 	raw_spin_unlock(&stopper2->lock);

5364

+ 	raw_spin_unlock_irq(&stopper1->lock);

5365

+@@ -271,7 +280,6 @@ unlock:

5366

+ 	}

5367

+

5368

+ 	if (!err) {

5369

+-		preempt_disable();

5370

+ 		wake_up_q(&wakeq);

5371

+ 		preempt_enable();

5372

+ 	}

5373

+diff --git a/mm/memory.c b/mm/memory.c

5374

+index fc7779165dcf..5539b1975091 100644

5375

+--- a/mm/memory.c

5376

++++ b/mm/memory.c

5377

+@@ -1887,6 +1887,9 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,

5378

+ 	if (addr < vma->vm_start || addr >= vma->vm_end)

5379

+ 		return -EFAULT;

5380

+

5381

++	if (!pfn_modify_allowed(pfn, pgprot))

5382

++		return -EACCES;

5383

++

5384

+ 	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));

5385

+

5386

+ 	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,

5387

+@@ -1908,6 +1911,9 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,

5388

+

5389

+ 	track_pfn_insert(vma, &pgprot, pfn);

5390

+

5391

++	if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))

5392

++		return -EACCES;

5393

++

5394

+ 	/*

5395

+ 	 * If we don't have pte special, then we have to use the pfn_valid()

5396

+ 	 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*

5397

+@@ -1955,6 +1961,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,

5398

+ {

5399

+ 	pte_t *pte;

5400

+ 	spinlock_t *ptl;

5401

++	int err = 0;

5402

+

5403

+ 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);

5404

+ 	if (!pte)

5405

+@@ -1962,12 +1969,16 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,

5406

+ 	arch_enter_lazy_mmu_mode();

5407

+ 	do {

5408

+ 		BUG_ON(!pte_none(*pte));

5409

++		if (!pfn_modify_allowed(pfn, prot)) {

5410

++			err = -EACCES;

5411

++			break;

5412

++		}

5413

+ 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));

5414

+ 		pfn++;

5415

+ 	} while (pte++, addr += PAGE_SIZE, addr != end);

5416

+ 	arch_leave_lazy_mmu_mode();

5417

+ 	pte_unmap_unlock(pte - 1, ptl);

5418

+-	return 0;

5419

++	return err;

5420

+ }

5421

+

5422

+ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,

5423

+@@ -1976,6 +1987,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,

5424

+ {

5425

+ 	pmd_t *pmd;

5426

+ 	unsigned long next;

5427

++	int err;

5428

+

5429

+ 	pfn -= addr >> PAGE_SHIFT;

5430

+ 	pmd = pmd_alloc(mm, pud, addr);

5431

+@@ -1984,9 +1996,10 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,

5432

+ 	VM_BUG_ON(pmd_trans_huge(*pmd));

5433

+ 	do {

5434

+ 		next = pmd_addr_end(addr, end);

5435

+-		if (remap_pte_range(mm, pmd, addr, next,

5436

+-				pfn + (addr >> PAGE_SHIFT), prot))

5437

+-			return -ENOMEM;

5438

++		err = remap_pte_range(mm, pmd, addr, next,

5439

++				pfn + (addr >> PAGE_SHIFT), prot);

5440

++		if (err)

5441

++			return err;

5442

+ 	} while (pmd++, addr = next, addr != end);

5443

+ 	return 0;

5444

+ }

5445

+@@ -1997,6 +2010,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,

5446

+ {

5447

+ 	pud_t *pud;

5448

+ 	unsigned long next;

5449

++	int err;

5450

+

5451

+ 	pfn -= addr >> PAGE_SHIFT;

5452

+ 	pud = pud_alloc(mm, p4d, addr);

5453

+@@ -2004,9 +2018,10 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,

5454

+ 		return -ENOMEM;

5455

+ 	do {

5456

+ 		next = pud_addr_end(addr, end);

5457

+-		if (remap_pmd_range(mm, pud, addr, next,

5458

+-				pfn + (addr >> PAGE_SHIFT), prot))

5459

+-			return -ENOMEM;

5460

++		err = remap_pmd_range(mm, pud, addr, next,

5461

++				pfn + (addr >> PAGE_SHIFT), prot);

5462

++		if (err)

5463

++			return err;

5464

+ 	} while (pud++, addr = next, addr != end);

5465

+ 	return 0;

5466

+ }

5467

+@@ -2017,6 +2032,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,

5468

+ {

5469

+ 	p4d_t *p4d;

5470

+ 	unsigned long next;

5471

++	int err;

5472

+

5473

+ 	pfn -= addr >> PAGE_SHIFT;

5474

+ 	p4d = p4d_alloc(mm, pgd, addr);

5475

+@@ -2024,9 +2040,10 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,

5476

+ 		return -ENOMEM;

5477

+ 	do {

5478

+ 		next = p4d_addr_end(addr, end);

5479

+-		if (remap_pud_range(mm, p4d, addr, next,

5480

+-				pfn + (addr >> PAGE_SHIFT), prot))

5481

+-			return -ENOMEM;

5482

++		err = remap_pud_range(mm, p4d, addr, next,

5483

++				pfn + (addr >> PAGE_SHIFT), prot);

5484

++		if (err)

5485

++			return err;

5486

+ 	} while (p4d++, addr = next, addr != end);

5487

+ 	return 0;

5488

+ }

5489

+diff --git a/mm/mprotect.c b/mm/mprotect.c

5490

+index 58b629bb70de..60864e19421e 100644

5491

+--- a/mm/mprotect.c

5492

++++ b/mm/mprotect.c

5493

+@@ -292,6 +292,42 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,

5494

+ 	return pages;

5495

+ }

5496

+

5497

++static int prot_none_pte_entry(pte_t *pte, unsigned long addr,

5498

++			       unsigned long next, struct mm_walk *walk)

5499

++{

5500

++	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?

5501

++		0 : -EACCES;

5502

++}

5503

++

5504

++static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,

5505

++				   unsigned long addr, unsigned long next,

5506

++				   struct mm_walk *walk)

5507

++{

5508

++	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?

5509

++		0 : -EACCES;

5510

++}

5511

++

5512

++static int prot_none_test(unsigned long addr, unsigned long next,

5513

++			  struct mm_walk *walk)

5514

++{

5515

++	return 0;

5516

++}

5517

++

5518

++static int prot_none_walk(struct vm_area_struct *vma, unsigned long start,

5519

++			   unsigned long end, unsigned long newflags)

5520

++{

5521

++	pgprot_t new_pgprot = vm_get_page_prot(newflags);

5522

++	struct mm_walk prot_none_walk = {

5523

++		.pte_entry = prot_none_pte_entry,

5524

++		.hugetlb_entry = prot_none_hugetlb_entry,

5525

++		.test_walk = prot_none_test,

5526

++		.mm = current->mm,

5527

++		.private = &new_pgprot,

5528

++	};

5529

++

5530

++	return walk_page_range(start, end, &prot_none_walk);

5531

++}

5532

++

5533

+ int

5534

+ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,

5535

+ 	unsigned long start, unsigned long end, unsigned long newflags)

5536

+@@ -309,6 +345,19 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,

5537

+ 		return 0;

5538

+ 	}

5539

+

5540

++	/*

5541

++	 * Do PROT_NONE PFN permission checks here when we can still

5542

++	 * bail out without undoing a lot of state. This is a rather

5543

++	 * uncommon case, so doesn't need to be very optimized.

5544

++	 */

5545

++	if (arch_has_pfn_modify_check() &&

5546

++	    (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&

5547

++	    (newflags & (VM_READ|VM_WRITE|VM_EXEC)) == 0) {

5548

++		error = prot_none_walk(vma, start, end, newflags);

5549

++		if (error)

5550

++			return error;

5551

++	}

5552

++

5553

+ 	/*

5554

+ 	 * If we make a private mapping writable we increase our commit;

5555

+ 	 * but (without finer accounting) cannot reduce our commit if we

5556

+diff --git a/mm/swapfile.c b/mm/swapfile.c

5557

+index 03d2ce288d83..8cbc7d6fd52e 100644

5558

+--- a/mm/swapfile.c

5559

++++ b/mm/swapfile.c

5560

+@@ -2902,6 +2902,35 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)

5561

+ 	return 0;

5562

+ }

5563

+

5564

++

5565

++/*

5566

++ * Find out how many pages are allowed for a single swap device. There

5567

++ * are two limiting factors:

5568

++ * 1) the number of bits for the swap offset in the swp_entry_t type, and

5569

++ * 2) the number of bits in the swap pte, as defined by the different

5570

++ * architectures.

5571

++ *

5572

++ * In order to find the largest possible bit mask, a swap entry with

5573

++ * swap type 0 and swap offset ~0UL is created, encoded to a swap pte,

5574

++ * decoded to a swp_entry_t again, and finally the swap offset is

5575

++ * extracted.

5576

++ *

5577

++ * This will mask all the bits from the initial ~0UL mask that can't

5578

++ * be encoded in either the swp_entry_t or the architecture definition

5579

++ * of a swap pte.

5580

++ */

5581

++unsigned long generic_max_swapfile_size(void)

5582

++{

5583

++	return swp_offset(pte_to_swp_entry(

5584

++			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;

5585

++}

5586

++

5587

++/* Can be overridden by an architecture for additional checks. */

5588

++__weak unsigned long max_swapfile_size(void)

5589

++{

5590

++	return generic_max_swapfile_size();

5591

++}

5592

++

5593

+ static unsigned long read_swap_header(struct swap_info_struct *p,

5594

+ 					union swap_header *swap_header,

5595

+ 					struct inode *inode)

5596

+@@ -2937,22 +2966,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,

5597

+ 	p->cluster_next = 1;

5598

+ 	p->cluster_nr = 0;

5599

+

5600

+-	/*

5601

+-	 * Find out how many pages are allowed for a single swap

5602

+-	 * device. There are two limiting factors: 1) the number

5603

+-	 * of bits for the swap offset in the swp_entry_t type, and

5604

+-	 * 2) the number of bits in the swap pte as defined by the

5605

+-	 * different architectures. In order to find the

5606

+-	 * largest possible bit mask, a swap entry with swap type 0

5607

+-	 * and swap offset ~0UL is created, encoded to a swap pte,

5608

+-	 * decoded to a swp_entry_t again, and finally the swap

5609

+-	 * offset is extracted. This will mask all the bits from

5610

+-	 * the initial ~0UL mask that can't be encoded in either

5611

+-	 * the swp_entry_t or the architecture definition of a

5612

+-	 * swap pte.

5613

+-	 */

5614

+-	maxpages = swp_offset(pte_to_swp_entry(

5615

+-			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;

5616

++	maxpages = max_swapfile_size();

5617

+ 	last_page = swap_header->info.last_page;

5618

+ 	if (!last_page) {

5619

+ 		pr_warn("Empty swap-file\n");

5620

+diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h

5621

+index 403e97d5e243..8418462298e7 100644

5622

+--- a/tools/arch/x86/include/asm/cpufeatures.h

5623

++++ b/tools/arch/x86/include/asm/cpufeatures.h

5624

+@@ -219,6 +219,7 @@

5625

+ #define X86_FEATURE_IBPB		( 7*32+26) /* Indirect Branch Prediction Barrier */

5626

+ #define X86_FEATURE_STIBP		( 7*32+27) /* Single Thread Indirect Branch Predictors */

5627

+ #define X86_FEATURE_ZEN			( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */

5628

++#define X86_FEATURE_L1TF_PTEINV		( 7*32+29) /* "" L1TF workaround PTE inversion */

5629

+

5630

+ /* Virtualization flags: Linux defined, word 8 */

5631

+ #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */

5632

+@@ -338,6 +339,7 @@

5633

+ #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */

5634

+ #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */

5635

+ #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */

5636

++#define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */

5637

+ #define X86_FEATURE_ARCH_CAPABILITIES	(18*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */

5638

+ #define X86_FEATURE_SPEC_CTRL_SSBD	(18*32+31) /* "" Speculative Store Bypass Disable */

5639

+

5640

+@@ -370,5 +372,6 @@

5641

+ #define X86_BUG_SPECTRE_V1		X86_BUG(15) /* CPU is affected by Spectre variant 1 attack with conditional branches */

5642

+ #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */

5643

+ #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */

5644

++#define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */

5645

+

5646

+ #endif /* _ASM_X86_CPUFEATURES_H */

Gentoo Archives: gentoo-commits