Gentoo Archives: gentoo-commits

From: Mike Pagano <mpagano@g.o>
To: gentoo-commits@l.g.o
Subject: [gentoo-commits] proj/linux-patches:4.18 commit in: /
Date: Wed, 15 Aug 2018 16:37:04
Message-Id: 1534351012.ad052097fe9d40c63236e6ae02f106d5226de58d.mpagano@gentoo
1 commit: ad052097fe9d40c63236e6ae02f106d5226de58d
2 Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
3 AuthorDate: Wed Aug 15 16:36:52 2018 +0000
4 Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
5 CommitDate: Wed Aug 15 16:36:52 2018 +0000
6 URL: https://gitweb.gentoo.org/proj/linux-patches.git/commit/?id=ad052097
7
8 Linuxpatch 4.18.1
9
10 0000_README | 4 +
11 1000_linux-4.18.1.patch | 4083 +++++++++++++++++++++++++++++++++++++++++++++++
12 2 files changed, 4087 insertions(+)
13
14 diff --git a/0000_README b/0000_README
15 index 917d838..cf32ff2 100644
16 --- a/0000_README
17 +++ b/0000_README
18 @@ -43,6 +43,10 @@ EXPERIMENTAL
19 Individual Patch Descriptions:
20 --------------------------------------------------------------------------
21
22 +Patch: 1000_linux-4.18.1.patch
23 +From: http://www.kernel.org
24 +Desc: Linux 4.18.1
25 +
26 Patch: 1500_XATTR_USER_PREFIX.patch
27 From: https://bugs.gentoo.org/show_bug.cgi?id=470644
28 Desc: Support for namespace user.pax.* on tmpfs.
29
30 diff --git a/1000_linux-4.18.1.patch b/1000_linux-4.18.1.patch
31 new file mode 100644
32 index 0000000..bd9c2da
33 --- /dev/null
34 +++ b/1000_linux-4.18.1.patch
35 @@ -0,0 +1,4083 @@
36 +diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
37 +index 9c5e7732d249..73318225a368 100644
38 +--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
39 ++++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
40 +@@ -476,6 +476,7 @@ What: /sys/devices/system/cpu/vulnerabilities
41 + /sys/devices/system/cpu/vulnerabilities/spectre_v1
42 + /sys/devices/system/cpu/vulnerabilities/spectre_v2
43 + /sys/devices/system/cpu/vulnerabilities/spec_store_bypass
44 ++ /sys/devices/system/cpu/vulnerabilities/l1tf
45 + Date: January 2018
46 + Contact: Linux kernel mailing list <linux-kernel@×××××××××××.org>
47 + Description: Information about CPU vulnerabilities
48 +@@ -487,3 +488,26 @@ Description: Information about CPU vulnerabilities
49 + "Not affected" CPU is not affected by the vulnerability
50 + "Vulnerable" CPU is affected and no mitigation in effect
51 + "Mitigation: $M" CPU is affected and mitigation $M is in effect
52 ++
53 ++ Details about the l1tf file can be found in
54 ++ Documentation/admin-guide/l1tf.rst
55 ++
56 ++What: /sys/devices/system/cpu/smt
57 ++ /sys/devices/system/cpu/smt/active
58 ++ /sys/devices/system/cpu/smt/control
59 ++Date: June 2018
60 ++Contact: Linux kernel mailing list <linux-kernel@×××××××××××.org>
61 ++Description: Control Symetric Multi Threading (SMT)
62 ++
63 ++ active: Tells whether SMT is active (enabled and siblings online)
64 ++
65 ++ control: Read/write interface to control SMT. Possible
66 ++ values:
67 ++
68 ++ "on" SMT is enabled
69 ++ "off" SMT is disabled
70 ++ "forceoff" SMT is force disabled. Cannot be changed.
71 ++ "notsupported" SMT is not supported by the CPU
72 ++
73 ++ If control status is "forceoff" or "notsupported" writes
74 ++ are rejected.
75 +diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
76 +index 48d70af11652..0873685bab0f 100644
77 +--- a/Documentation/admin-guide/index.rst
78 ++++ b/Documentation/admin-guide/index.rst
79 +@@ -17,6 +17,15 @@ etc.
80 + kernel-parameters
81 + devices
82 +
83 ++This section describes CPU vulnerabilities and provides an overview of the
84 ++possible mitigations along with guidance for selecting mitigations if they
85 ++are configurable at compile, boot or run time.
86 ++
87 ++.. toctree::
88 ++ :maxdepth: 1
89 ++
90 ++ l1tf
91 ++
92 + Here is a set of documents aimed at users who are trying to track down
93 + problems and bugs in particular.
94 +
95 +diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
96 +index 533ff5c68970..1370b424a453 100644
97 +--- a/Documentation/admin-guide/kernel-parameters.txt
98 ++++ b/Documentation/admin-guide/kernel-parameters.txt
99 +@@ -1967,10 +1967,84 @@
100 + (virtualized real and unpaged mode) on capable
101 + Intel chips. Default is 1 (enabled)
102 +
103 ++ kvm-intel.vmentry_l1d_flush=[KVM,Intel] Mitigation for L1 Terminal Fault
104 ++ CVE-2018-3620.
105 ++
106 ++ Valid arguments: never, cond, always
107 ++
108 ++ always: L1D cache flush on every VMENTER.
109 ++ cond: Flush L1D on VMENTER only when the code between
110 ++ VMEXIT and VMENTER can leak host memory.
111 ++ never: Disables the mitigation
112 ++
113 ++ Default is cond (do L1 cache flush in specific instances)
114 ++
115 + kvm-intel.vpid= [KVM,Intel] Disable Virtual Processor Identification
116 + feature (tagged TLBs) on capable Intel chips.
117 + Default is 1 (enabled)
118 +
119 ++ l1tf= [X86] Control mitigation of the L1TF vulnerability on
120 ++ affected CPUs
121 ++
122 ++ The kernel PTE inversion protection is unconditionally
123 ++ enabled and cannot be disabled.
124 ++
125 ++ full
126 ++ Provides all available mitigations for the
127 ++ L1TF vulnerability. Disables SMT and
128 ++ enables all mitigations in the
129 ++ hypervisors, i.e. unconditional L1D flush.
130 ++
131 ++ SMT control and L1D flush control via the
132 ++ sysfs interface is still possible after
133 ++ boot. Hypervisors will issue a warning
134 ++ when the first VM is started in a
135 ++ potentially insecure configuration,
136 ++ i.e. SMT enabled or L1D flush disabled.
137 ++
138 ++ full,force
139 ++ Same as 'full', but disables SMT and L1D
140 ++ flush runtime control. Implies the
141 ++ 'nosmt=force' command line option.
142 ++ (i.e. sysfs control of SMT is disabled.)
143 ++
144 ++ flush
145 ++ Leaves SMT enabled and enables the default
146 ++ hypervisor mitigation, i.e. conditional
147 ++ L1D flush.
148 ++
149 ++ SMT control and L1D flush control via the
150 ++ sysfs interface is still possible after
151 ++ boot. Hypervisors will issue a warning
152 ++ when the first VM is started in a
153 ++ potentially insecure configuration,
154 ++ i.e. SMT enabled or L1D flush disabled.
155 ++
156 ++ flush,nosmt
157 ++
158 ++ Disables SMT and enables the default
159 ++ hypervisor mitigation.
160 ++
161 ++ SMT control and L1D flush control via the
162 ++ sysfs interface is still possible after
163 ++ boot. Hypervisors will issue a warning
164 ++ when the first VM is started in a
165 ++ potentially insecure configuration,
166 ++ i.e. SMT enabled or L1D flush disabled.
167 ++
168 ++ flush,nowarn
169 ++ Same as 'flush', but hypervisors will not
170 ++ warn when a VM is started in a potentially
171 ++ insecure configuration.
172 ++
173 ++ off
174 ++ Disables hypervisor mitigations and doesn't
175 ++ emit any warnings.
176 ++
177 ++ Default is 'flush'.
178 ++
179 ++ For details see: Documentation/admin-guide/l1tf.rst
180 ++
181 + l2cr= [PPC]
182 +
183 + l3cr= [PPC]
184 +@@ -2687,6 +2761,10 @@
185 + nosmt [KNL,S390] Disable symmetric multithreading (SMT).
186 + Equivalent to smt=1.
187 +
188 ++ [KNL,x86] Disable symmetric multithreading (SMT).
189 ++ nosmt=force: Force disable SMT, cannot be undone
190 ++ via the sysfs control file.
191 ++
192 + nospectre_v2 [X86] Disable all mitigations for the Spectre variant 2
193 + (indirect branch prediction) vulnerability. System may
194 + allow data leaks with this option, which is equivalent
195 +diff --git a/Documentation/admin-guide/l1tf.rst b/Documentation/admin-guide/l1tf.rst
196 +new file mode 100644
197 +index 000000000000..bae52b845de0
198 +--- /dev/null
199 ++++ b/Documentation/admin-guide/l1tf.rst
200 +@@ -0,0 +1,610 @@
201 ++L1TF - L1 Terminal Fault
202 ++========================
203 ++
204 ++L1 Terminal Fault is a hardware vulnerability which allows unprivileged
205 ++speculative access to data which is available in the Level 1 Data Cache
206 ++when the page table entry controlling the virtual address, which is used
207 ++for the access, has the Present bit cleared or other reserved bits set.
208 ++
209 ++Affected processors
210 ++-------------------
211 ++
212 ++This vulnerability affects a wide range of Intel processors. The
213 ++vulnerability is not present on:
214 ++
215 ++ - Processors from AMD, Centaur and other non Intel vendors
216 ++
217 ++ - Older processor models, where the CPU family is < 6
218 ++
219 ++ - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
220 ++ Penwell, Pineview, Silvermont, Airmont, Merrifield)
221 ++
222 ++ - The Intel XEON PHI family
223 ++
224 ++ - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
225 ++ IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
226 ++ by the Meltdown vulnerability either. These CPUs should become
227 ++ available by end of 2018.
228 ++
229 ++Whether a processor is affected or not can be read out from the L1TF
230 ++vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
231 ++
232 ++Related CVEs
233 ++------------
234 ++
235 ++The following CVE entries are related to the L1TF vulnerability:
236 ++
237 ++ ============= ================= ==============================
238 ++ CVE-2018-3615 L1 Terminal Fault SGX related aspects
239 ++ CVE-2018-3620 L1 Terminal Fault OS, SMM related aspects
240 ++ CVE-2018-3646 L1 Terminal Fault Virtualization related aspects
241 ++ ============= ================= ==============================
242 ++
243 ++Problem
244 ++-------
245 ++
246 ++If an instruction accesses a virtual address for which the relevant page
247 ++table entry (PTE) has the Present bit cleared or other reserved bits set,
248 ++then speculative execution ignores the invalid PTE and loads the referenced
249 ++data if it is present in the Level 1 Data Cache, as if the page referenced
250 ++by the address bits in the PTE was still present and accessible.
251 ++
252 ++While this is a purely speculative mechanism and the instruction will raise
253 ++a page fault when it is retired eventually, the pure act of loading the
254 ++data and making it available to other speculative instructions opens up the
255 ++opportunity for side channel attacks to unprivileged malicious code,
256 ++similar to the Meltdown attack.
257 ++
258 ++While Meltdown breaks the user space to kernel space protection, L1TF
259 ++allows to attack any physical memory address in the system and the attack
260 ++works across all protection domains. It allows an attack of SGX and also
261 ++works from inside virtual machines because the speculation bypasses the
262 ++extended page table (EPT) protection mechanism.
263 ++
264 ++
265 ++Attack scenarios
266 ++----------------
267 ++
268 ++1. Malicious user space
269 ++^^^^^^^^^^^^^^^^^^^^^^^
270 ++
271 ++ Operating Systems store arbitrary information in the address bits of a
272 ++ PTE which is marked non present. This allows a malicious user space
273 ++ application to attack the physical memory to which these PTEs resolve.
274 ++ In some cases user-space can maliciously influence the information
275 ++ encoded in the address bits of the PTE, thus making attacks more
276 ++ deterministic and more practical.
277 ++
278 ++ The Linux kernel contains a mitigation for this attack vector, PTE
279 ++ inversion, which is permanently enabled and has no performance
280 ++ impact. The kernel ensures that the address bits of PTEs, which are not
281 ++ marked present, never point to cacheable physical memory space.
282 ++
283 ++ A system with an up to date kernel is protected against attacks from
284 ++ malicious user space applications.
285 ++
286 ++2. Malicious guest in a virtual machine
287 ++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
288 ++
289 ++ The fact that L1TF breaks all domain protections allows malicious guest
290 ++ OSes, which can control the PTEs directly, and malicious guest user
291 ++ space applications, which run on an unprotected guest kernel lacking the
292 ++ PTE inversion mitigation for L1TF, to attack physical host memory.
293 ++
294 ++ A special aspect of L1TF in the context of virtualization is symmetric
295 ++ multi threading (SMT). The Intel implementation of SMT is called
296 ++ HyperThreading. The fact that Hyperthreads on the affected processors
297 ++ share the L1 Data Cache (L1D) is important for this. As the flaw allows
298 ++ only to attack data which is present in L1D, a malicious guest running
299 ++ on one Hyperthread can attack the data which is brought into the L1D by
300 ++ the context which runs on the sibling Hyperthread of the same physical
301 ++ core. This context can be host OS, host user space or a different guest.
302 ++
303 ++ If the processor does not support Extended Page Tables, the attack is
304 ++ only possible, when the hypervisor does not sanitize the content of the
305 ++ effective (shadow) page tables.
306 ++
307 ++ While solutions exist to mitigate these attack vectors fully, these
308 ++ mitigations are not enabled by default in the Linux kernel because they
309 ++ can affect performance significantly. The kernel provides several
310 ++ mechanisms which can be utilized to address the problem depending on the
311 ++ deployment scenario. The mitigations, their protection scope and impact
312 ++ are described in the next sections.
313 ++
314 ++ The default mitigations and the rationale for choosing them are explained
315 ++ at the end of this document. See :ref:`default_mitigations`.
316 ++
317 ++.. _l1tf_sys_info:
318 ++
319 ++L1TF system information
320 ++-----------------------
321 ++
322 ++The Linux kernel provides a sysfs interface to enumerate the current L1TF
323 ++status of the system: whether the system is vulnerable, and which
324 ++mitigations are active. The relevant sysfs file is:
325 ++
326 ++/sys/devices/system/cpu/vulnerabilities/l1tf
327 ++
328 ++The possible values in this file are:
329 ++
330 ++ =========================== ===============================
331 ++ 'Not affected' The processor is not vulnerable
332 ++ 'Mitigation: PTE Inversion' The host protection is active
333 ++ =========================== ===============================
334 ++
335 ++If KVM/VMX is enabled and the processor is vulnerable then the following
336 ++information is appended to the 'Mitigation: PTE Inversion' part:
337 ++
338 ++ - SMT status:
339 ++
340 ++ ===================== ================
341 ++ 'VMX: SMT vulnerable' SMT is enabled
342 ++ 'VMX: SMT disabled' SMT is disabled
343 ++ ===================== ================
344 ++
345 ++ - L1D Flush mode:
346 ++
347 ++ ================================ ====================================
348 ++ 'L1D vulnerable' L1D flushing is disabled
349 ++
350 ++ 'L1D conditional cache flushes' L1D flush is conditionally enabled
351 ++
352 ++ 'L1D cache flushes' L1D flush is unconditionally enabled
353 ++ ================================ ====================================
354 ++
355 ++The resulting grade of protection is discussed in the following sections.
356 ++
357 ++
358 ++Host mitigation mechanism
359 ++-------------------------
360 ++
361 ++The kernel is unconditionally protected against L1TF attacks from malicious
362 ++user space running on the host.
363 ++
364 ++
365 ++Guest mitigation mechanisms
366 ++---------------------------
367 ++
368 ++.. _l1d_flush:
369 ++
370 ++1. L1D flush on VMENTER
371 ++^^^^^^^^^^^^^^^^^^^^^^^
372 ++
373 ++ To make sure that a guest cannot attack data which is present in the L1D
374 ++ the hypervisor flushes the L1D before entering the guest.
375 ++
376 ++ Flushing the L1D evicts not only the data which should not be accessed
377 ++ by a potentially malicious guest, it also flushes the guest
378 ++ data. Flushing the L1D has a performance impact as the processor has to
379 ++ bring the flushed guest data back into the L1D. Depending on the
380 ++ frequency of VMEXIT/VMENTER and the type of computations in the guest
381 ++ performance degradation in the range of 1% to 50% has been observed. For
382 ++ scenarios where guest VMEXIT/VMENTER are rare the performance impact is
383 ++ minimal. Virtio and mechanisms like posted interrupts are designed to
384 ++ confine the VMEXITs to a bare minimum, but specific configurations and
385 ++ application scenarios might still suffer from a high VMEXIT rate.
386 ++
387 ++ The kernel provides two L1D flush modes:
388 ++ - conditional ('cond')
389 ++ - unconditional ('always')
390 ++
391 ++ The conditional mode avoids L1D flushing after VMEXITs which execute
392 ++ only audited code paths before the corresponding VMENTER. These code
393 ++ paths have been verified that they cannot expose secrets or other
394 ++ interesting data to an attacker, but they can leak information about the
395 ++ address space layout of the hypervisor.
396 ++
397 ++ Unconditional mode flushes L1D on all VMENTER invocations and provides
398 ++ maximum protection. It has a higher overhead than the conditional
399 ++ mode. The overhead cannot be quantified correctly as it depends on the
400 ++ workload scenario and the resulting number of VMEXITs.
401 ++
402 ++ The general recommendation is to enable L1D flush on VMENTER. The kernel
403 ++ defaults to conditional mode on affected processors.
404 ++
405 ++ **Note**, that L1D flush does not prevent the SMT problem because the
406 ++ sibling thread will also bring back its data into the L1D which makes it
407 ++ attackable again.
408 ++
409 ++ L1D flush can be controlled by the administrator via the kernel command
410 ++ line and sysfs control files. See :ref:`mitigation_control_command_line`
411 ++ and :ref:`mitigation_control_kvm`.
412 ++
413 ++.. _guest_confinement:
414 ++
415 ++2. Guest VCPU confinement to dedicated physical cores
416 ++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
417 ++
418 ++ To address the SMT problem, it is possible to make a guest or a group of
419 ++ guests affine to one or more physical cores. The proper mechanism for
420 ++ that is to utilize exclusive cpusets to ensure that no other guest or
421 ++ host tasks can run on these cores.
422 ++
423 ++ If only a single guest or related guests run on sibling SMT threads on
424 ++ the same physical core then they can only attack their own memory and
425 ++ restricted parts of the host memory.
426 ++
427 ++ Host memory is attackable, when one of the sibling SMT threads runs in
428 ++ host OS (hypervisor) context and the other in guest context. The amount
429 ++ of valuable information from the host OS context depends on the context
430 ++ which the host OS executes, i.e. interrupts, soft interrupts and kernel
431 ++ threads. The amount of valuable data from these contexts cannot be
432 ++ declared as non-interesting for an attacker without deep inspection of
433 ++ the code.
434 ++
435 ++ **Note**, that assigning guests to a fixed set of physical cores affects
436 ++ the ability of the scheduler to do load balancing and might have
437 ++ negative effects on CPU utilization depending on the hosting
438 ++ scenario. Disabling SMT might be a viable alternative for particular
439 ++ scenarios.
440 ++
441 ++ For further information about confining guests to a single or to a group
442 ++ of cores consult the cpusets documentation:
443 ++
444 ++ https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
445 ++
446 ++.. _interrupt_isolation:
447 ++
448 ++3. Interrupt affinity
449 ++^^^^^^^^^^^^^^^^^^^^^
450 ++
451 ++ Interrupts can be made affine to logical CPUs. This is not universally
452 ++ true because there are types of interrupts which are truly per CPU
453 ++ interrupts, e.g. the local timer interrupt. Aside of that multi queue
454 ++ devices affine their interrupts to single CPUs or groups of CPUs per
455 ++ queue without allowing the administrator to control the affinities.
456 ++
457 ++ Moving the interrupts, which can be affinity controlled, away from CPUs
458 ++ which run untrusted guests, reduces the attack vector space.
459 ++
460 ++ Whether the interrupts with are affine to CPUs, which run untrusted
461 ++ guests, provide interesting data for an attacker depends on the system
462 ++ configuration and the scenarios which run on the system. While for some
463 ++ of the interrupts it can be assumed that they won't expose interesting
464 ++ information beyond exposing hints about the host OS memory layout, there
465 ++ is no way to make general assumptions.
466 ++
467 ++ Interrupt affinity can be controlled by the administrator via the
468 ++ /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
469 ++ available at:
470 ++
471 ++ https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
472 ++
473 ++.. _smt_control:
474 ++
475 ++4. SMT control
476 ++^^^^^^^^^^^^^^
477 ++
478 ++ To prevent the SMT issues of L1TF it might be necessary to disable SMT
479 ++ completely. Disabling SMT can have a significant performance impact, but
480 ++ the impact depends on the hosting scenario and the type of workloads.
481 ++ The impact of disabling SMT needs also to be weighted against the impact
482 ++ of other mitigation solutions like confining guests to dedicated cores.
483 ++
484 ++ The kernel provides a sysfs interface to retrieve the status of SMT and
485 ++ to control it. It also provides a kernel command line interface to
486 ++ control SMT.
487 ++
488 ++ The kernel command line interface consists of the following options:
489 ++
490 ++ =========== ==========================================================
491 ++ nosmt Affects the bring up of the secondary CPUs during boot. The
492 ++ kernel tries to bring all present CPUs online during the
493 ++ boot process. "nosmt" makes sure that from each physical
494 ++ core only one - the so called primary (hyper) thread is
495 ++ activated. Due to a design flaw of Intel processors related
496 ++ to Machine Check Exceptions the non primary siblings have
497 ++ to be brought up at least partially and are then shut down
498 ++ again. "nosmt" can be undone via the sysfs interface.
499 ++
500 ++ nosmt=force Has the same effect as "nosmt" but it does not allow to
501 ++ undo the SMT disable via the sysfs interface.
502 ++ =========== ==========================================================
503 ++
504 ++ The sysfs interface provides two files:
505 ++
506 ++ - /sys/devices/system/cpu/smt/control
507 ++ - /sys/devices/system/cpu/smt/active
508 ++
509 ++ /sys/devices/system/cpu/smt/control:
510 ++
511 ++ This file allows to read out the SMT control state and provides the
512 ++ ability to disable or (re)enable SMT. The possible states are:
513 ++
514 ++ ============== ===================================================
515 ++ on SMT is supported by the CPU and enabled. All
516 ++ logical CPUs can be onlined and offlined without
517 ++ restrictions.
518 ++
519 ++ off SMT is supported by the CPU and disabled. Only
520 ++ the so called primary SMT threads can be onlined
521 ++ and offlined without restrictions. An attempt to
522 ++ online a non-primary sibling is rejected
523 ++
524 ++ forceoff Same as 'off' but the state cannot be controlled.
525 ++ Attempts to write to the control file are rejected.
526 ++
527 ++ notsupported The processor does not support SMT. It's therefore
528 ++ not affected by the SMT implications of L1TF.
529 ++ Attempts to write to the control file are rejected.
530 ++ ============== ===================================================
531 ++
532 ++ The possible states which can be written into this file to control SMT
533 ++ state are:
534 ++
535 ++ - on
536 ++ - off
537 ++ - forceoff
538 ++
539 ++ /sys/devices/system/cpu/smt/active:
540 ++
541 ++ This file reports whether SMT is enabled and active, i.e. if on any
542 ++ physical core two or more sibling threads are online.
543 ++
544 ++ SMT control is also possible at boot time via the l1tf kernel command
545 ++ line parameter in combination with L1D flush control. See
546 ++ :ref:`mitigation_control_command_line`.
547 ++
548 ++5. Disabling EPT
549 ++^^^^^^^^^^^^^^^^
550 ++
551 ++ Disabling EPT for virtual machines provides full mitigation for L1TF even
552 ++ with SMT enabled, because the effective page tables for guests are
553 ++ managed and sanitized by the hypervisor. Though disabling EPT has a
554 ++ significant performance impact especially when the Meltdown mitigation
555 ++ KPTI is enabled.
556 ++
557 ++ EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
558 ++
559 ++There is ongoing research and development for new mitigation mechanisms to
560 ++address the performance impact of disabling SMT or EPT.
561 ++
562 ++.. _mitigation_control_command_line:
563 ++
564 ++Mitigation control on the kernel command line
565 ++---------------------------------------------
566 ++
567 ++The kernel command line allows to control the L1TF mitigations at boot
568 ++time with the option "l1tf=". The valid arguments for this option are:
569 ++
570 ++ ============ =============================================================
571 ++ full Provides all available mitigations for the L1TF
572 ++ vulnerability. Disables SMT and enables all mitigations in
573 ++ the hypervisors, i.e. unconditional L1D flushing
574 ++
575 ++ SMT control and L1D flush control via the sysfs interface
576 ++ is still possible after boot. Hypervisors will issue a
577 ++ warning when the first VM is started in a potentially
578 ++ insecure configuration, i.e. SMT enabled or L1D flush
579 ++ disabled.
580 ++
581 ++ full,force Same as 'full', but disables SMT and L1D flush runtime
582 ++ control. Implies the 'nosmt=force' command line option.
583 ++ (i.e. sysfs control of SMT is disabled.)
584 ++
585 ++ flush Leaves SMT enabled and enables the default hypervisor
586 ++ mitigation, i.e. conditional L1D flushing
587 ++
588 ++ SMT control and L1D flush control via the sysfs interface
589 ++ is still possible after boot. Hypervisors will issue a
590 ++ warning when the first VM is started in a potentially
591 ++ insecure configuration, i.e. SMT enabled or L1D flush
592 ++ disabled.
593 ++
594 ++ flush,nosmt Disables SMT and enables the default hypervisor mitigation,
595 ++ i.e. conditional L1D flushing.
596 ++
597 ++ SMT control and L1D flush control via the sysfs interface
598 ++ is still possible after boot. Hypervisors will issue a
599 ++ warning when the first VM is started in a potentially
600 ++ insecure configuration, i.e. SMT enabled or L1D flush
601 ++ disabled.
602 ++
603 ++ flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is
604 ++ started in a potentially insecure configuration.
605 ++
606 ++ off Disables hypervisor mitigations and doesn't emit any
607 ++ warnings.
608 ++ ============ =============================================================
609 ++
610 ++The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
611 ++
612 ++
613 ++.. _mitigation_control_kvm:
614 ++
615 ++Mitigation control for KVM - module parameter
616 ++-------------------------------------------------------------
617 ++
618 ++The KVM hypervisor mitigation mechanism, flushing the L1D cache when
619 ++entering a guest, can be controlled with a module parameter.
620 ++
621 ++The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
622 ++following arguments:
623 ++
624 ++ ============ ==============================================================
625 ++ always L1D cache flush on every VMENTER.
626 ++
627 ++ cond Flush L1D on VMENTER only when the code between VMEXIT and
628 ++ VMENTER can leak host memory which is considered
629 ++ interesting for an attacker. This still can leak host memory
630 ++ which allows e.g. to determine the hosts address space layout.
631 ++
632 ++ never Disables the mitigation
633 ++ ============ ==============================================================
634 ++
635 ++The parameter can be provided on the kernel command line, as a module
636 ++parameter when loading the modules and at runtime modified via the sysfs
637 ++file:
638 ++
639 ++/sys/module/kvm_intel/parameters/vmentry_l1d_flush
640 ++
641 ++The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
642 ++line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
643 ++module parameter is ignored and writes to the sysfs file are rejected.
644 ++
645 ++
646 ++Mitigation selection guide
647 ++--------------------------
648 ++
649 ++1. No virtualization in use
650 ++^^^^^^^^^^^^^^^^^^^^^^^^^^^
651 ++
652 ++ The system is protected by the kernel unconditionally and no further
653 ++ action is required.
654 ++
655 ++2. Virtualization with trusted guests
656 ++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
657 ++
658 ++ If the guest comes from a trusted source and the guest OS kernel is
659 ++ guaranteed to have the L1TF mitigations in place the system is fully
660 ++ protected against L1TF and no further action is required.
661 ++
662 ++ To avoid the overhead of the default L1D flushing on VMENTER the
663 ++ administrator can disable the flushing via the kernel command line and
664 ++ sysfs control files. See :ref:`mitigation_control_command_line` and
665 ++ :ref:`mitigation_control_kvm`.
666 ++
667 ++
668 ++3. Virtualization with untrusted guests
669 ++^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
670 ++
671 ++3.1. SMT not supported or disabled
672 ++""""""""""""""""""""""""""""""""""
673 ++
674 ++ If SMT is not supported by the processor or disabled in the BIOS or by
675 ++ the kernel, it's only required to enforce L1D flushing on VMENTER.
676 ++
677 ++ Conditional L1D flushing is the default behaviour and can be tuned. See
678 ++ :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
679 ++
680 ++3.2. EPT not supported or disabled
681 ++""""""""""""""""""""""""""""""""""
682 ++
683 ++ If EPT is not supported by the processor or disabled in the hypervisor,
684 ++ the system is fully protected. SMT can stay enabled and L1D flushing on
685 ++ VMENTER is not required.
686 ++
687 ++ EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
688 ++
689 ++3.3. SMT and EPT supported and active
690 ++"""""""""""""""""""""""""""""""""""""
691 ++
692 ++ If SMT and EPT are supported and active then various degrees of
693 ++ mitigations can be employed:
694 ++
695 ++ - L1D flushing on VMENTER:
696 ++
697 ++ L1D flushing on VMENTER is the minimal protection requirement, but it
698 ++ is only potent in combination with other mitigation methods.
699 ++
700 ++ Conditional L1D flushing is the default behaviour and can be tuned. See
701 ++ :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
702 ++
703 ++ - Guest confinement:
704 ++
705 ++ Confinement of guests to a single or a group of physical cores which
706 ++ are not running any other processes, can reduce the attack surface
707 ++ significantly, but interrupts, soft interrupts and kernel threads can
708 ++ still expose valuable data to a potential attacker. See
709 ++ :ref:`guest_confinement`.
710 ++
711 ++ - Interrupt isolation:
712 ++
713 ++ Isolating the guest CPUs from interrupts can reduce the attack surface
714 ++ further, but still allows a malicious guest to explore a limited amount
715 ++ of host physical memory. This can at least be used to gain knowledge
716 ++ about the host address space layout. The interrupts which have a fixed
717 ++ affinity to the CPUs which run the untrusted guests can depending on
718 ++ the scenario still trigger soft interrupts and schedule kernel threads
719 ++ which might expose valuable information. See
720 ++ :ref:`interrupt_isolation`.
721 ++
722 ++The above three mitigation methods combined can provide protection to a
723 ++certain degree, but the risk of the remaining attack surface has to be
724 ++carefully analyzed. For full protection the following methods are
725 ++available:
726 ++
727 ++ - Disabling SMT:
728 ++
729 ++ Disabling SMT and enforcing the L1D flushing provides the maximum
730 ++ amount of protection. This mitigation is not depending on any of the
731 ++ above mitigation methods.
732 ++
733 ++ SMT control and L1D flushing can be tuned by the command line
734 ++ parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
735 ++ time with the matching sysfs control files. See :ref:`smt_control`,
736 ++ :ref:`mitigation_control_command_line` and
737 ++ :ref:`mitigation_control_kvm`.
738 ++
739 ++ - Disabling EPT:
740 ++
741 ++ Disabling EPT provides the maximum amount of protection as well. It is
742 ++ not depending on any of the above mitigation methods. SMT can stay
743 ++ enabled and L1D flushing is not required, but the performance impact is
744 ++ significant.
745 ++
746 ++ EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
747 ++ parameter.
748 ++
749 ++3.4. Nested virtual machines
750 ++""""""""""""""""""""""""""""
751 ++
752 ++When nested virtualization is in use, three operating systems are involved:
753 ++the bare metal hypervisor, the nested hypervisor and the nested virtual
754 ++machine. VMENTER operations from the nested hypervisor into the nested
755 ++guest will always be processed by the bare metal hypervisor. If KVM is the
756 ++bare metal hypervisor it wiil:
757 ++
758 ++ - Flush the L1D cache on every switch from the nested hypervisor to the
759 ++ nested virtual machine, so that the nested hypervisor's secrets are not
760 ++ exposed to the nested virtual machine;
761 ++
762 ++ - Flush the L1D cache on every switch from the nested virtual machine to
763 ++ the nested hypervisor; this is a complex operation, and flushing the L1D
764 ++ cache avoids that the bare metal hypervisor's secrets are exposed to the
765 ++ nested virtual machine;
766 ++
767 ++ - Instruct the nested hypervisor to not perform any L1D cache flush. This
768 ++ is an optimization to avoid double L1D flushing.
769 ++
770 ++
771 ++.. _default_mitigations:
772 ++
773 ++Default mitigations
774 ++-------------------
775 ++
776 ++ The kernel default mitigations for vulnerable processors are:
777 ++
778 ++ - PTE inversion to protect against malicious user space. This is done
779 ++ unconditionally and cannot be controlled.
780 ++
781 ++ - L1D conditional flushing on VMENTER when EPT is enabled for
782 ++ a guest.
783 ++
784 ++ The kernel does not by default enforce the disabling of SMT, which leaves
785 ++ SMT systems vulnerable when running untrusted guests with EPT enabled.
786 ++
787 ++ The rationale for this choice is:
788 ++
789 ++ - Force disabling SMT can break existing setups, especially with
790 ++ unattended updates.
791 ++
792 ++ - If regular users run untrusted guests on their machine, then L1TF is
793 ++ just an add on to other malware which might be embedded in an untrusted
794 ++ guest, e.g. spam-bots or attacks on the local network.
795 ++
796 ++ There is no technical way to prevent a user from running untrusted code
797 ++ on their machines blindly.
798 ++
799 ++ - It's technically extremely unlikely and from today's knowledge even
800 ++ impossible that L1TF can be exploited via the most popular attack
801 ++ mechanisms like JavaScript because these mechanisms have no way to
802 ++ control PTEs. If this would be possible and not other mitigation would
803 ++ be possible, then the default might be different.
804 ++
805 ++ - The administrators of cloud and hosting setups have to carefully
806 ++ analyze the risk for their scenarios and make the appropriate
807 ++ mitigation choices, which might even vary across their deployed
808 ++ machines and also result in other changes of their overall setup.
809 ++ There is no way for the kernel to provide a sensible default for this
810 ++ kind of scenarios.
811 +diff --git a/Makefile b/Makefile
812 +index 863f58503bee..5edf963148e8 100644
813 +--- a/Makefile
814 ++++ b/Makefile
815 +@@ -1,7 +1,7 @@
816 + # SPDX-License-Identifier: GPL-2.0
817 + VERSION = 4
818 + PATCHLEVEL = 18
819 +-SUBLEVEL = 0
820 ++SUBLEVEL = 1
821 + EXTRAVERSION =
822 + NAME = Merciless Moray
823 +
824 +diff --git a/arch/Kconfig b/arch/Kconfig
825 +index 1aa59063f1fd..d1f2ed462ac8 100644
826 +--- a/arch/Kconfig
827 ++++ b/arch/Kconfig
828 +@@ -13,6 +13,9 @@ config KEXEC_CORE
829 + config HAVE_IMA_KEXEC
830 + bool
831 +
832 ++config HOTPLUG_SMT
833 ++ bool
834 ++
835 + config OPROFILE
836 + tristate "OProfile system profiling"
837 + depends on PROFILING
838 +diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
839 +index 887d3a7bb646..6b8065d718bd 100644
840 +--- a/arch/x86/Kconfig
841 ++++ b/arch/x86/Kconfig
842 +@@ -187,6 +187,7 @@ config X86
843 + select HAVE_SYSCALL_TRACEPOINTS
844 + select HAVE_UNSTABLE_SCHED_CLOCK
845 + select HAVE_USER_RETURN_NOTIFIER
846 ++ select HOTPLUG_SMT if SMP
847 + select IRQ_FORCED_THREADING
848 + select NEED_SG_DMA_LENGTH
849 + select PCI_LOCKLESS_CONFIG
850 +diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
851 +index 74a9e06b6cfd..130e81e10fc7 100644
852 +--- a/arch/x86/include/asm/apic.h
853 ++++ b/arch/x86/include/asm/apic.h
854 +@@ -10,6 +10,7 @@
855 + #include <asm/fixmap.h>
856 + #include <asm/mpspec.h>
857 + #include <asm/msr.h>
858 ++#include <asm/hardirq.h>
859 +
860 + #define ARCH_APICTIMER_STOPS_ON_C3 1
861 +
862 +@@ -502,12 +503,19 @@ extern int default_check_phys_apicid_present(int phys_apicid);
863 +
864 + #endif /* CONFIG_X86_LOCAL_APIC */
865 +
866 ++#ifdef CONFIG_SMP
867 ++bool apic_id_is_primary_thread(unsigned int id);
868 ++#else
869 ++static inline bool apic_id_is_primary_thread(unsigned int id) { return false; }
870 ++#endif
871 ++
872 + extern void irq_enter(void);
873 + extern void irq_exit(void);
874 +
875 + static inline void entering_irq(void)
876 + {
877 + irq_enter();
878 ++ kvm_set_cpu_l1tf_flush_l1d();
879 + }
880 +
881 + static inline void entering_ack_irq(void)
882 +@@ -520,6 +528,7 @@ static inline void ipi_entering_ack_irq(void)
883 + {
884 + irq_enter();
885 + ack_APIC_irq();
886 ++ kvm_set_cpu_l1tf_flush_l1d();
887 + }
888 +
889 + static inline void exiting_irq(void)
890 +diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
891 +index 5701f5cecd31..64aaa3f5f36c 100644
892 +--- a/arch/x86/include/asm/cpufeatures.h
893 ++++ b/arch/x86/include/asm/cpufeatures.h
894 +@@ -219,6 +219,7 @@
895 + #define X86_FEATURE_IBPB ( 7*32+26) /* Indirect Branch Prediction Barrier */
896 + #define X86_FEATURE_STIBP ( 7*32+27) /* Single Thread Indirect Branch Predictors */
897 + #define X86_FEATURE_ZEN ( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */
898 ++#define X86_FEATURE_L1TF_PTEINV ( 7*32+29) /* "" L1TF workaround PTE inversion */
899 +
900 + /* Virtualization flags: Linux defined, word 8 */
901 + #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
902 +@@ -341,6 +342,7 @@
903 + #define X86_FEATURE_PCONFIG (18*32+18) /* Intel PCONFIG */
904 + #define X86_FEATURE_SPEC_CTRL (18*32+26) /* "" Speculation Control (IBRS + IBPB) */
905 + #define X86_FEATURE_INTEL_STIBP (18*32+27) /* "" Single Thread Indirect Branch Predictors */
906 ++#define X86_FEATURE_FLUSH_L1D (18*32+28) /* Flush L1D cache */
907 + #define X86_FEATURE_ARCH_CAPABILITIES (18*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */
908 + #define X86_FEATURE_SPEC_CTRL_SSBD (18*32+31) /* "" Speculative Store Bypass Disable */
909 +
910 +@@ -373,5 +375,6 @@
911 + #define X86_BUG_SPECTRE_V1 X86_BUG(15) /* CPU is affected by Spectre variant 1 attack with conditional branches */
912 + #define X86_BUG_SPECTRE_V2 X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */
913 + #define X86_BUG_SPEC_STORE_BYPASS X86_BUG(17) /* CPU is affected by speculative store bypass attack */
914 ++#define X86_BUG_L1TF X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
915 +
916 + #endif /* _ASM_X86_CPUFEATURES_H */
917 +diff --git a/arch/x86/include/asm/dmi.h b/arch/x86/include/asm/dmi.h
918 +index 0ab2ab27ad1f..b825cb201251 100644
919 +--- a/arch/x86/include/asm/dmi.h
920 ++++ b/arch/x86/include/asm/dmi.h
921 +@@ -4,8 +4,8 @@
922 +
923 + #include <linux/compiler.h>
924 + #include <linux/init.h>
925 ++#include <linux/io.h>
926 +
927 +-#include <asm/io.h>
928 + #include <asm/setup.h>
929 +
930 + static __always_inline __init void *dmi_alloc(unsigned len)
931 +diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
932 +index 740a428acf1e..d9069bb26c7f 100644
933 +--- a/arch/x86/include/asm/hardirq.h
934 ++++ b/arch/x86/include/asm/hardirq.h
935 +@@ -3,10 +3,12 @@
936 + #define _ASM_X86_HARDIRQ_H
937 +
938 + #include <linux/threads.h>
939 +-#include <linux/irq.h>
940 +
941 + typedef struct {
942 +- unsigned int __softirq_pending;
943 ++ u16 __softirq_pending;
944 ++#if IS_ENABLED(CONFIG_KVM_INTEL)
945 ++ u8 kvm_cpu_l1tf_flush_l1d;
946 ++#endif
947 + unsigned int __nmi_count; /* arch dependent */
948 + #ifdef CONFIG_X86_LOCAL_APIC
949 + unsigned int apic_timer_irqs; /* arch dependent */
950 +@@ -58,4 +60,24 @@ extern u64 arch_irq_stat_cpu(unsigned int cpu);
951 + extern u64 arch_irq_stat(void);
952 + #define arch_irq_stat arch_irq_stat
953 +
954 ++
955 ++#if IS_ENABLED(CONFIG_KVM_INTEL)
956 ++static inline void kvm_set_cpu_l1tf_flush_l1d(void)
957 ++{
958 ++ __this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 1);
959 ++}
960 ++
961 ++static inline void kvm_clear_cpu_l1tf_flush_l1d(void)
962 ++{
963 ++ __this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 0);
964 ++}
965 ++
966 ++static inline bool kvm_get_cpu_l1tf_flush_l1d(void)
967 ++{
968 ++ return __this_cpu_read(irq_stat.kvm_cpu_l1tf_flush_l1d);
969 ++}
970 ++#else /* !IS_ENABLED(CONFIG_KVM_INTEL) */
971 ++static inline void kvm_set_cpu_l1tf_flush_l1d(void) { }
972 ++#endif /* IS_ENABLED(CONFIG_KVM_INTEL) */
973 ++
974 + #endif /* _ASM_X86_HARDIRQ_H */
975 +diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
976 +index c4fc17220df9..c14f2a74b2be 100644
977 +--- a/arch/x86/include/asm/irqflags.h
978 ++++ b/arch/x86/include/asm/irqflags.h
979 +@@ -13,6 +13,8 @@
980 + * Interrupt control:
981 + */
982 +
983 ++/* Declaration required for gcc < 4.9 to prevent -Werror=missing-prototypes */
984 ++extern inline unsigned long native_save_fl(void);
985 + extern inline unsigned long native_save_fl(void)
986 + {
987 + unsigned long flags;
988 +diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
989 +index c13cd28d9d1b..acebb808c4b5 100644
990 +--- a/arch/x86/include/asm/kvm_host.h
991 ++++ b/arch/x86/include/asm/kvm_host.h
992 +@@ -17,6 +17,7 @@
993 + #include <linux/tracepoint.h>
994 + #include <linux/cpumask.h>
995 + #include <linux/irq_work.h>
996 ++#include <linux/irq.h>
997 +
998 + #include <linux/kvm.h>
999 + #include <linux/kvm_para.h>
1000 +@@ -713,6 +714,9 @@ struct kvm_vcpu_arch {
1001 +
1002 + /* be preempted when it's in kernel-mode(cpl=0) */
1003 + bool preempted_in_kernel;
1004 ++
1005 ++ /* Flush the L1 Data cache for L1TF mitigation on VMENTER */
1006 ++ bool l1tf_flush_l1d;
1007 + };
1008 +
1009 + struct kvm_lpage_info {
1010 +@@ -881,6 +885,7 @@ struct kvm_vcpu_stat {
1011 + u64 signal_exits;
1012 + u64 irq_window_exits;
1013 + u64 nmi_window_exits;
1014 ++ u64 l1d_flush;
1015 + u64 halt_exits;
1016 + u64 halt_successful_poll;
1017 + u64 halt_attempted_poll;
1018 +@@ -1413,6 +1418,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
1019 + void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
1020 + void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu);
1021 +
1022 ++u64 kvm_get_arch_capabilities(void);
1023 + void kvm_define_shared_msr(unsigned index, u32 msr);
1024 + int kvm_set_shared_msr(unsigned index, u64 val, u64 mask);
1025 +
1026 +diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
1027 +index 68b2c3150de1..4731f0cf97c5 100644
1028 +--- a/arch/x86/include/asm/msr-index.h
1029 ++++ b/arch/x86/include/asm/msr-index.h
1030 +@@ -70,12 +70,19 @@
1031 + #define MSR_IA32_ARCH_CAPABILITIES 0x0000010a
1032 + #define ARCH_CAP_RDCL_NO (1 << 0) /* Not susceptible to Meltdown */
1033 + #define ARCH_CAP_IBRS_ALL (1 << 1) /* Enhanced IBRS support */
1034 ++#define ARCH_CAP_SKIP_VMENTRY_L1DFLUSH (1 << 3) /* Skip L1D flush on vmentry */
1035 + #define ARCH_CAP_SSB_NO (1 << 4) /*
1036 + * Not susceptible to Speculative Store Bypass
1037 + * attack, so no Speculative Store Bypass
1038 + * control required.
1039 + */
1040 +
1041 ++#define MSR_IA32_FLUSH_CMD 0x0000010b
1042 ++#define L1D_FLUSH (1 << 0) /*
1043 ++ * Writeback and invalidate the
1044 ++ * L1 data cache.
1045 ++ */
1046 ++
1047 + #define MSR_IA32_BBL_CR_CTL 0x00000119
1048 + #define MSR_IA32_BBL_CR_CTL3 0x0000011e
1049 +
1050 +diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h
1051 +index aa30c3241ea7..0d5c739eebd7 100644
1052 +--- a/arch/x86/include/asm/page_32_types.h
1053 ++++ b/arch/x86/include/asm/page_32_types.h
1054 +@@ -29,8 +29,13 @@
1055 + #define N_EXCEPTION_STACKS 1
1056 +
1057 + #ifdef CONFIG_X86_PAE
1058 +-/* 44=32+12, the limit we can fit into an unsigned long pfn */
1059 +-#define __PHYSICAL_MASK_SHIFT 44
1060 ++/*
1061 ++ * This is beyond the 44 bit limit imposed by the 32bit long pfns,
1062 ++ * but we need the full mask to make sure inverted PROT_NONE
1063 ++ * entries have all the host bits set in a guest.
1064 ++ * The real limit is still 44 bits.
1065 ++ */
1066 ++#define __PHYSICAL_MASK_SHIFT 52
1067 + #define __VIRTUAL_MASK_SHIFT 32
1068 +
1069 + #else /* !CONFIG_X86_PAE */
1070 +diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
1071 +index 685ffe8a0eaf..60d0f9015317 100644
1072 +--- a/arch/x86/include/asm/pgtable-2level.h
1073 ++++ b/arch/x86/include/asm/pgtable-2level.h
1074 +@@ -95,4 +95,21 @@ static inline unsigned long pte_bitop(unsigned long value, unsigned int rightshi
1075 + #define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low })
1076 + #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
1077 +
1078 ++/* No inverted PFNs on 2 level page tables */
1079 ++
1080 ++static inline u64 protnone_mask(u64 val)
1081 ++{
1082 ++ return 0;
1083 ++}
1084 ++
1085 ++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)
1086 ++{
1087 ++ return val;
1088 ++}
1089 ++
1090 ++static inline bool __pte_needs_invert(u64 val)
1091 ++{
1092 ++ return false;
1093 ++}
1094 ++
1095 + #endif /* _ASM_X86_PGTABLE_2LEVEL_H */
1096 +diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
1097 +index f24df59c40b2..bb035a4cbc8c 100644
1098 +--- a/arch/x86/include/asm/pgtable-3level.h
1099 ++++ b/arch/x86/include/asm/pgtable-3level.h
1100 +@@ -241,12 +241,43 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
1101 + #endif
1102 +
1103 + /* Encode and de-code a swap entry */
1104 ++#define SWP_TYPE_BITS 5
1105 ++
1106 ++#define SWP_OFFSET_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
1107 ++
1108 ++/* We always extract/encode the offset by shifting it all the way up, and then down again */
1109 ++#define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS)
1110 ++
1111 + #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)
1112 + #define __swp_type(x) (((x).val) & 0x1f)
1113 + #define __swp_offset(x) ((x).val >> 5)
1114 + #define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << 5})
1115 +-#define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high })
1116 +-#define __swp_entry_to_pte(x) ((pte_t){ { .pte_high = (x).val } })
1117 ++
1118 ++/*
1119 ++ * Normally, __swp_entry() converts from arch-independent swp_entry_t to
1120 ++ * arch-dependent swp_entry_t, and __swp_entry_to_pte() just stores the result
1121 ++ * to pte. But here we have 32bit swp_entry_t and 64bit pte, and need to use the
1122 ++ * whole 64 bits. Thus, we shift the "real" arch-dependent conversion to
1123 ++ * __swp_entry_to_pte() through the following helper macro based on 64bit
1124 ++ * __swp_entry().
1125 ++ */
1126 ++#define __swp_pteval_entry(type, offset) ((pteval_t) { \
1127 ++ (~(pteval_t)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \
1128 ++ | ((pteval_t)(type) << (64 - SWP_TYPE_BITS)) })
1129 ++
1130 ++#define __swp_entry_to_pte(x) ((pte_t){ .pte = \
1131 ++ __swp_pteval_entry(__swp_type(x), __swp_offset(x)) })
1132 ++/*
1133 ++ * Analogically, __pte_to_swp_entry() doesn't just extract the arch-dependent
1134 ++ * swp_entry_t, but also has to convert it from 64bit to the 32bit
1135 ++ * intermediate representation, using the following macros based on 64bit
1136 ++ * __swp_type() and __swp_offset().
1137 ++ */
1138 ++#define __pteval_swp_type(x) ((unsigned long)((x).pte >> (64 - SWP_TYPE_BITS)))
1139 ++#define __pteval_swp_offset(x) ((unsigned long)(~((x).pte) << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT))
1140 ++
1141 ++#define __pte_to_swp_entry(pte) (__swp_entry(__pteval_swp_type(pte), \
1142 ++ __pteval_swp_offset(pte)))
1143 +
1144 + #define gup_get_pte gup_get_pte
1145 + /*
1146 +@@ -295,4 +326,6 @@ static inline pte_t gup_get_pte(pte_t *ptep)
1147 + return pte;
1148 + }
1149 +
1150 ++#include <asm/pgtable-invert.h>
1151 ++
1152 + #endif /* _ASM_X86_PGTABLE_3LEVEL_H */
1153 +diff --git a/arch/x86/include/asm/pgtable-invert.h b/arch/x86/include/asm/pgtable-invert.h
1154 +new file mode 100644
1155 +index 000000000000..44b1203ece12
1156 +--- /dev/null
1157 ++++ b/arch/x86/include/asm/pgtable-invert.h
1158 +@@ -0,0 +1,32 @@
1159 ++/* SPDX-License-Identifier: GPL-2.0 */
1160 ++#ifndef _ASM_PGTABLE_INVERT_H
1161 ++#define _ASM_PGTABLE_INVERT_H 1
1162 ++
1163 ++#ifndef __ASSEMBLY__
1164 ++
1165 ++static inline bool __pte_needs_invert(u64 val)
1166 ++{
1167 ++ return !(val & _PAGE_PRESENT);
1168 ++}
1169 ++
1170 ++/* Get a mask to xor with the page table entry to get the correct pfn. */
1171 ++static inline u64 protnone_mask(u64 val)
1172 ++{
1173 ++ return __pte_needs_invert(val) ? ~0ull : 0;
1174 ++}
1175 ++
1176 ++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)
1177 ++{
1178 ++ /*
1179 ++ * When a PTE transitions from NONE to !NONE or vice-versa
1180 ++ * invert the PFN part to stop speculation.
1181 ++ * pte_pfn undoes this when needed.
1182 ++ */
1183 ++ if (__pte_needs_invert(oldval) != __pte_needs_invert(val))
1184 ++ val = (val & ~mask) | (~val & mask);
1185 ++ return val;
1186 ++}
1187 ++
1188 ++#endif /* __ASSEMBLY__ */
1189 ++
1190 ++#endif
1191 +diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
1192 +index 5715647fc4fe..13125aad804c 100644
1193 +--- a/arch/x86/include/asm/pgtable.h
1194 ++++ b/arch/x86/include/asm/pgtable.h
1195 +@@ -185,19 +185,29 @@ static inline int pte_special(pte_t pte)
1196 + return pte_flags(pte) & _PAGE_SPECIAL;
1197 + }
1198 +
1199 ++/* Entries that were set to PROT_NONE are inverted */
1200 ++
1201 ++static inline u64 protnone_mask(u64 val);
1202 ++
1203 + static inline unsigned long pte_pfn(pte_t pte)
1204 + {
1205 +- return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;
1206 ++ phys_addr_t pfn = pte_val(pte);
1207 ++ pfn ^= protnone_mask(pfn);
1208 ++ return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT;
1209 + }
1210 +
1211 + static inline unsigned long pmd_pfn(pmd_t pmd)
1212 + {
1213 +- return (pmd_val(pmd) & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
1214 ++ phys_addr_t pfn = pmd_val(pmd);
1215 ++ pfn ^= protnone_mask(pfn);
1216 ++ return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
1217 + }
1218 +
1219 + static inline unsigned long pud_pfn(pud_t pud)
1220 + {
1221 +- return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT;
1222 ++ phys_addr_t pfn = pud_val(pud);
1223 ++ pfn ^= protnone_mask(pfn);
1224 ++ return (pfn & pud_pfn_mask(pud)) >> PAGE_SHIFT;
1225 + }
1226 +
1227 + static inline unsigned long p4d_pfn(p4d_t p4d)
1228 +@@ -400,11 +410,6 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)
1229 + return pmd_set_flags(pmd, _PAGE_RW);
1230 + }
1231 +
1232 +-static inline pmd_t pmd_mknotpresent(pmd_t pmd)
1233 +-{
1234 +- return pmd_clear_flags(pmd, _PAGE_PRESENT | _PAGE_PROTNONE);
1235 +-}
1236 +-
1237 + static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
1238 + {
1239 + pudval_t v = native_pud_val(pud);
1240 +@@ -459,11 +464,6 @@ static inline pud_t pud_mkwrite(pud_t pud)
1241 + return pud_set_flags(pud, _PAGE_RW);
1242 + }
1243 +
1244 +-static inline pud_t pud_mknotpresent(pud_t pud)
1245 +-{
1246 +- return pud_clear_flags(pud, _PAGE_PRESENT | _PAGE_PROTNONE);
1247 +-}
1248 +-
1249 + #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
1250 + static inline int pte_soft_dirty(pte_t pte)
1251 + {
1252 +@@ -545,25 +545,45 @@ static inline pgprotval_t check_pgprot(pgprot_t pgprot)
1253 +
1254 + static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
1255 + {
1256 +- return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) |
1257 +- check_pgprot(pgprot));
1258 ++ phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;
1259 ++ pfn ^= protnone_mask(pgprot_val(pgprot));
1260 ++ pfn &= PTE_PFN_MASK;
1261 ++ return __pte(pfn | check_pgprot(pgprot));
1262 + }
1263 +
1264 + static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
1265 + {
1266 +- return __pmd(((phys_addr_t)page_nr << PAGE_SHIFT) |
1267 +- check_pgprot(pgprot));
1268 ++ phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;
1269 ++ pfn ^= protnone_mask(pgprot_val(pgprot));
1270 ++ pfn &= PHYSICAL_PMD_PAGE_MASK;
1271 ++ return __pmd(pfn | check_pgprot(pgprot));
1272 + }
1273 +
1274 + static inline pud_t pfn_pud(unsigned long page_nr, pgprot_t pgprot)
1275 + {
1276 +- return __pud(((phys_addr_t)page_nr << PAGE_SHIFT) |
1277 +- check_pgprot(pgprot));
1278 ++ phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;
1279 ++ pfn ^= protnone_mask(pgprot_val(pgprot));
1280 ++ pfn &= PHYSICAL_PUD_PAGE_MASK;
1281 ++ return __pud(pfn | check_pgprot(pgprot));
1282 + }
1283 +
1284 ++static inline pmd_t pmd_mknotpresent(pmd_t pmd)
1285 ++{
1286 ++ return pfn_pmd(pmd_pfn(pmd),
1287 ++ __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
1288 ++}
1289 ++
1290 ++static inline pud_t pud_mknotpresent(pud_t pud)
1291 ++{
1292 ++ return pfn_pud(pud_pfn(pud),
1293 ++ __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
1294 ++}
1295 ++
1296 ++static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
1297 ++
1298 + static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
1299 + {
1300 +- pteval_t val = pte_val(pte);
1301 ++ pteval_t val = pte_val(pte), oldval = val;
1302 +
1303 + /*
1304 + * Chop off the NX bit (if present), and add the NX portion of
1305 +@@ -571,17 +591,17 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
1306 + */
1307 + val &= _PAGE_CHG_MASK;
1308 + val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
1309 +-
1310 ++ val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
1311 + return __pte(val);
1312 + }
1313 +
1314 + static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
1315 + {
1316 +- pmdval_t val = pmd_val(pmd);
1317 ++ pmdval_t val = pmd_val(pmd), oldval = val;
1318 +
1319 + val &= _HPAGE_CHG_MASK;
1320 + val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
1321 +-
1322 ++ val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
1323 + return __pmd(val);
1324 + }
1325 +
1326 +@@ -1320,6 +1340,14 @@ static inline bool pud_access_permitted(pud_t pud, bool write)
1327 + return __pte_access_permitted(pud_val(pud), write);
1328 + }
1329 +
1330 ++#define __HAVE_ARCH_PFN_MODIFY_ALLOWED 1
1331 ++extern bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot);
1332 ++
1333 ++static inline bool arch_has_pfn_modify_check(void)
1334 ++{
1335 ++ return boot_cpu_has_bug(X86_BUG_L1TF);
1336 ++}
1337 ++
1338 + #include <asm-generic/pgtable.h>
1339 + #endif /* __ASSEMBLY__ */
1340 +
1341 +diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
1342 +index 3c5385f9a88f..82ff20b0ae45 100644
1343 +--- a/arch/x86/include/asm/pgtable_64.h
1344 ++++ b/arch/x86/include/asm/pgtable_64.h
1345 +@@ -273,7 +273,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
1346 + *
1347 + * | ... | 11| 10| 9|8|7|6|5| 4| 3|2| 1|0| <- bit number
1348 + * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
1349 +- * | OFFSET (14->63) | TYPE (9-13) |0|0|X|X| X| X|X|SD|0| <- swp entry
1350 ++ * | TYPE (59-63) | ~OFFSET (9-58) |0|0|X|X| X| X|X|SD|0| <- swp entry
1351 + *
1352 + * G (8) is aliased and used as a PROT_NONE indicator for
1353 + * !present ptes. We need to start storing swap entries above
1354 +@@ -286,20 +286,34 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
1355 + *
1356 + * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
1357 + * but also L and G.
1358 ++ *
1359 ++ * The offset is inverted by a binary not operation to make the high
1360 ++ * physical bits set.
1361 + */
1362 +-#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
1363 +-#define SWP_TYPE_BITS 5
1364 +-/* Place the offset above the type: */
1365 +-#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS)
1366 ++#define SWP_TYPE_BITS 5
1367 ++
1368 ++#define SWP_OFFSET_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
1369 ++
1370 ++/* We always extract/encode the offset by shifting it all the way up, and then down again */
1371 ++#define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS)
1372 +
1373 + #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
1374 +
1375 +-#define __swp_type(x) (((x).val >> (SWP_TYPE_FIRST_BIT)) \
1376 +- & ((1U << SWP_TYPE_BITS) - 1))
1377 +-#define __swp_offset(x) ((x).val >> SWP_OFFSET_FIRST_BIT)
1378 +-#define __swp_entry(type, offset) ((swp_entry_t) { \
1379 +- ((type) << (SWP_TYPE_FIRST_BIT)) \
1380 +- | ((offset) << SWP_OFFSET_FIRST_BIT) })
1381 ++/* Extract the high bits for type */
1382 ++#define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS))
1383 ++
1384 ++/* Shift up (to get rid of type), then down to get value */
1385 ++#define __swp_offset(x) (~(x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)
1386 ++
1387 ++/*
1388 ++ * Shift the offset up "too far" by TYPE bits, then down again
1389 ++ * The offset is inverted by a binary not operation to make the high
1390 ++ * physical bits set.
1391 ++ */
1392 ++#define __swp_entry(type, offset) ((swp_entry_t) { \
1393 ++ (~(unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \
1394 ++ | ((unsigned long)(type) << (64-SWP_TYPE_BITS)) })
1395 ++
1396 + #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })
1397 + #define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val((pmd)) })
1398 + #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
1399 +@@ -343,5 +357,7 @@ static inline bool gup_fast_permitted(unsigned long start, int nr_pages,
1400 + return true;
1401 + }
1402 +
1403 ++#include <asm/pgtable-invert.h>
1404 ++
1405 + #endif /* !__ASSEMBLY__ */
1406 + #endif /* _ASM_X86_PGTABLE_64_H */
1407 +diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
1408 +index cfd29ee8c3da..79e409974ccc 100644
1409 +--- a/arch/x86/include/asm/processor.h
1410 ++++ b/arch/x86/include/asm/processor.h
1411 +@@ -181,6 +181,11 @@ extern const struct seq_operations cpuinfo_op;
1412 +
1413 + extern void cpu_detect(struct cpuinfo_x86 *c);
1414 +
1415 ++static inline unsigned long l1tf_pfn_limit(void)
1416 ++{
1417 ++ return BIT(boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT) - 1;
1418 ++}
1419 ++
1420 + extern void early_cpu_init(void);
1421 + extern void identify_boot_cpu(void);
1422 + extern void identify_secondary_cpu(struct cpuinfo_x86 *);
1423 +@@ -977,4 +982,16 @@ bool xen_set_default_idle(void);
1424 + void stop_this_cpu(void *dummy);
1425 + void df_debug(struct pt_regs *regs, long error_code);
1426 + void microcode_check(void);
1427 ++
1428 ++enum l1tf_mitigations {
1429 ++ L1TF_MITIGATION_OFF,
1430 ++ L1TF_MITIGATION_FLUSH_NOWARN,
1431 ++ L1TF_MITIGATION_FLUSH,
1432 ++ L1TF_MITIGATION_FLUSH_NOSMT,
1433 ++ L1TF_MITIGATION_FULL,
1434 ++ L1TF_MITIGATION_FULL_FORCE
1435 ++};
1436 ++
1437 ++extern enum l1tf_mitigations l1tf_mitigation;
1438 ++
1439 + #endif /* _ASM_X86_PROCESSOR_H */
1440 +diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
1441 +index c1d2a9892352..453cf38a1c33 100644
1442 +--- a/arch/x86/include/asm/topology.h
1443 ++++ b/arch/x86/include/asm/topology.h
1444 +@@ -123,13 +123,17 @@ static inline int topology_max_smt_threads(void)
1445 + }
1446 +
1447 + int topology_update_package_map(unsigned int apicid, unsigned int cpu);
1448 +-extern int topology_phys_to_logical_pkg(unsigned int pkg);
1449 ++int topology_phys_to_logical_pkg(unsigned int pkg);
1450 ++bool topology_is_primary_thread(unsigned int cpu);
1451 ++bool topology_smt_supported(void);
1452 + #else
1453 + #define topology_max_packages() (1)
1454 + static inline int
1455 + topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; }
1456 + static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
1457 + static inline int topology_max_smt_threads(void) { return 1; }
1458 ++static inline bool topology_is_primary_thread(unsigned int cpu) { return true; }
1459 ++static inline bool topology_smt_supported(void) { return false; }
1460 + #endif
1461 +
1462 + static inline void arch_fix_phys_package_id(int num, u32 slot)
1463 +diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
1464 +index 6aa8499e1f62..95f9107449bf 100644
1465 +--- a/arch/x86/include/asm/vmx.h
1466 ++++ b/arch/x86/include/asm/vmx.h
1467 +@@ -576,4 +576,15 @@ enum vm_instruction_error_number {
1468 + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,
1469 + };
1470 +
1471 ++enum vmx_l1d_flush_state {
1472 ++ VMENTER_L1D_FLUSH_AUTO,
1473 ++ VMENTER_L1D_FLUSH_NEVER,
1474 ++ VMENTER_L1D_FLUSH_COND,
1475 ++ VMENTER_L1D_FLUSH_ALWAYS,
1476 ++ VMENTER_L1D_FLUSH_EPT_DISABLED,
1477 ++ VMENTER_L1D_FLUSH_NOT_REQUIRED,
1478 ++};
1479 ++
1480 ++extern enum vmx_l1d_flush_state l1tf_vmx_mitigation;
1481 ++
1482 + #endif
1483 +diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
1484 +index adbda5847b14..3b3a2d0af78d 100644
1485 +--- a/arch/x86/kernel/apic/apic.c
1486 ++++ b/arch/x86/kernel/apic/apic.c
1487 +@@ -56,6 +56,7 @@
1488 + #include <asm/hypervisor.h>
1489 + #include <asm/cpu_device_id.h>
1490 + #include <asm/intel-family.h>
1491 ++#include <asm/irq_regs.h>
1492 +
1493 + unsigned int num_processors;
1494 +
1495 +@@ -2192,6 +2193,23 @@ static int cpuid_to_apicid[] = {
1496 + [0 ... NR_CPUS - 1] = -1,
1497 + };
1498 +
1499 ++#ifdef CONFIG_SMP
1500 ++/**
1501 ++ * apic_id_is_primary_thread - Check whether APIC ID belongs to a primary thread
1502 ++ * @id: APIC ID to check
1503 ++ */
1504 ++bool apic_id_is_primary_thread(unsigned int apicid)
1505 ++{
1506 ++ u32 mask;
1507 ++
1508 ++ if (smp_num_siblings == 1)
1509 ++ return true;
1510 ++ /* Isolate the SMT bit(s) in the APICID and check for 0 */
1511 ++ mask = (1U << (fls(smp_num_siblings) - 1)) - 1;
1512 ++ return !(apicid & mask);
1513 ++}
1514 ++#endif
1515 ++
1516 + /*
1517 + * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids
1518 + * and cpuid_to_apicid[] synchronized.
1519 +diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
1520 +index 3982f79d2377..ff0d14cd9e82 100644
1521 +--- a/arch/x86/kernel/apic/io_apic.c
1522 ++++ b/arch/x86/kernel/apic/io_apic.c
1523 +@@ -33,6 +33,7 @@
1524 +
1525 + #include <linux/mm.h>
1526 + #include <linux/interrupt.h>
1527 ++#include <linux/irq.h>
1528 + #include <linux/init.h>
1529 + #include <linux/delay.h>
1530 + #include <linux/sched.h>
1531 +diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
1532 +index ce503c99f5c4..72a94401f9e0 100644
1533 +--- a/arch/x86/kernel/apic/msi.c
1534 ++++ b/arch/x86/kernel/apic/msi.c
1535 +@@ -12,6 +12,7 @@
1536 + */
1537 + #include <linux/mm.h>
1538 + #include <linux/interrupt.h>
1539 ++#include <linux/irq.h>
1540 + #include <linux/pci.h>
1541 + #include <linux/dmar.h>
1542 + #include <linux/hpet.h>
1543 +diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
1544 +index 35aaee4fc028..c9b773401fd8 100644
1545 +--- a/arch/x86/kernel/apic/vector.c
1546 ++++ b/arch/x86/kernel/apic/vector.c
1547 +@@ -11,6 +11,7 @@
1548 + * published by the Free Software Foundation.
1549 + */
1550 + #include <linux/interrupt.h>
1551 ++#include <linux/irq.h>
1552 + #include <linux/seq_file.h>
1553 + #include <linux/init.h>
1554 + #include <linux/compiler.h>
1555 +diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
1556 +index 38915fbfae73..97e962afb967 100644
1557 +--- a/arch/x86/kernel/cpu/amd.c
1558 ++++ b/arch/x86/kernel/cpu/amd.c
1559 +@@ -315,6 +315,13 @@ static void legacy_fixup_core_id(struct cpuinfo_x86 *c)
1560 + c->cpu_core_id %= cus_per_node;
1561 + }
1562 +
1563 ++
1564 ++static void amd_get_topology_early(struct cpuinfo_x86 *c)
1565 ++{
1566 ++ if (cpu_has(c, X86_FEATURE_TOPOEXT))
1567 ++ smp_num_siblings = ((cpuid_ebx(0x8000001e) >> 8) & 0xff) + 1;
1568 ++}
1569 ++
1570 + /*
1571 + * Fixup core topology information for
1572 + * (1) AMD multi-node processors
1573 +@@ -334,7 +341,6 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
1574 + cpuid(0x8000001e, &eax, &ebx, &ecx, &edx);
1575 +
1576 + node_id = ecx & 0xff;
1577 +- smp_num_siblings = ((ebx >> 8) & 0xff) + 1;
1578 +
1579 + if (c->x86 == 0x15)
1580 + c->cu_id = ebx & 0xff;
1581 +@@ -613,6 +619,7 @@ clear_sev:
1582 +
1583 + static void early_init_amd(struct cpuinfo_x86 *c)
1584 + {
1585 ++ u64 value;
1586 + u32 dummy;
1587 +
1588 + early_init_amd_mc(c);
1589 +@@ -683,6 +690,22 @@ static void early_init_amd(struct cpuinfo_x86 *c)
1590 + set_cpu_bug(c, X86_BUG_AMD_E400);
1591 +
1592 + early_detect_mem_encrypt(c);
1593 ++
1594 ++ /* Re-enable TopologyExtensions if switched off by BIOS */
1595 ++ if (c->x86 == 0x15 &&
1596 ++ (c->x86_model >= 0x10 && c->x86_model <= 0x6f) &&
1597 ++ !cpu_has(c, X86_FEATURE_TOPOEXT)) {
1598 ++
1599 ++ if (msr_set_bit(0xc0011005, 54) > 0) {
1600 ++ rdmsrl(0xc0011005, value);
1601 ++ if (value & BIT_64(54)) {
1602 ++ set_cpu_cap(c, X86_FEATURE_TOPOEXT);
1603 ++ pr_info_once(FW_INFO "CPU: Re-enabling disabled Topology Extensions Support.\n");
1604 ++ }
1605 ++ }
1606 ++ }
1607 ++
1608 ++ amd_get_topology_early(c);
1609 + }
1610 +
1611 + static void init_amd_k8(struct cpuinfo_x86 *c)
1612 +@@ -774,19 +797,6 @@ static void init_amd_bd(struct cpuinfo_x86 *c)
1613 + {
1614 + u64 value;
1615 +
1616 +- /* re-enable TopologyExtensions if switched off by BIOS */
1617 +- if ((c->x86_model >= 0x10) && (c->x86_model <= 0x6f) &&
1618 +- !cpu_has(c, X86_FEATURE_TOPOEXT)) {
1619 +-
1620 +- if (msr_set_bit(0xc0011005, 54) > 0) {
1621 +- rdmsrl(0xc0011005, value);
1622 +- if (value & BIT_64(54)) {
1623 +- set_cpu_cap(c, X86_FEATURE_TOPOEXT);
1624 +- pr_info_once(FW_INFO "CPU: Re-enabling disabled Topology Extensions Support.\n");
1625 +- }
1626 +- }
1627 +- }
1628 +-
1629 + /*
1630 + * The way access filter has a performance penalty on some workloads.
1631 + * Disable it on the affected CPUs.
1632 +@@ -850,16 +860,9 @@ static void init_amd(struct cpuinfo_x86 *c)
1633 +
1634 + cpu_detect_cache_sizes(c);
1635 +
1636 +- /* Multi core CPU? */
1637 +- if (c->extended_cpuid_level >= 0x80000008) {
1638 +- amd_detect_cmp(c);
1639 +- amd_get_topology(c);
1640 +- srat_detect_node(c);
1641 +- }
1642 +-
1643 +-#ifdef CONFIG_X86_32
1644 +- detect_ht(c);
1645 +-#endif
1646 ++ amd_detect_cmp(c);
1647 ++ amd_get_topology(c);
1648 ++ srat_detect_node(c);
1649 +
1650 + init_amd_cacheinfo(c);
1651 +
1652 +diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
1653 +index 5c0ea39311fe..c4f0ae49a53d 100644
1654 +--- a/arch/x86/kernel/cpu/bugs.c
1655 ++++ b/arch/x86/kernel/cpu/bugs.c
1656 +@@ -22,15 +22,18 @@
1657 + #include <asm/processor-flags.h>
1658 + #include <asm/fpu/internal.h>
1659 + #include <asm/msr.h>
1660 ++#include <asm/vmx.h>
1661 + #include <asm/paravirt.h>
1662 + #include <asm/alternative.h>
1663 + #include <asm/pgtable.h>
1664 + #include <asm/set_memory.h>
1665 + #include <asm/intel-family.h>
1666 + #include <asm/hypervisor.h>
1667 ++#include <asm/e820/api.h>
1668 +
1669 + static void __init spectre_v2_select_mitigation(void);
1670 + static void __init ssb_select_mitigation(void);
1671 ++static void __init l1tf_select_mitigation(void);
1672 +
1673 + /*
1674 + * Our boot-time value of the SPEC_CTRL MSR. We read it once so that any
1675 +@@ -56,6 +59,12 @@ void __init check_bugs(void)
1676 + {
1677 + identify_boot_cpu();
1678 +
1679 ++ /*
1680 ++ * identify_boot_cpu() initialized SMT support information, let the
1681 ++ * core code know.
1682 ++ */
1683 ++ cpu_smt_check_topology_early();
1684 ++
1685 + if (!IS_ENABLED(CONFIG_SMP)) {
1686 + pr_info("CPU: ");
1687 + print_cpu_info(&boot_cpu_data);
1688 +@@ -82,6 +91,8 @@ void __init check_bugs(void)
1689 + */
1690 + ssb_select_mitigation();
1691 +
1692 ++ l1tf_select_mitigation();
1693 ++
1694 + #ifdef CONFIG_X86_32
1695 + /*
1696 + * Check whether we are able to run this kernel safely on SMP.
1697 +@@ -313,23 +324,6 @@ static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)
1698 + return cmd;
1699 + }
1700 +
1701 +-/* Check for Skylake-like CPUs (for RSB handling) */
1702 +-static bool __init is_skylake_era(void)
1703 +-{
1704 +- if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
1705 +- boot_cpu_data.x86 == 6) {
1706 +- switch (boot_cpu_data.x86_model) {
1707 +- case INTEL_FAM6_SKYLAKE_MOBILE:
1708 +- case INTEL_FAM6_SKYLAKE_DESKTOP:
1709 +- case INTEL_FAM6_SKYLAKE_X:
1710 +- case INTEL_FAM6_KABYLAKE_MOBILE:
1711 +- case INTEL_FAM6_KABYLAKE_DESKTOP:
1712 +- return true;
1713 +- }
1714 +- }
1715 +- return false;
1716 +-}
1717 +-
1718 + static void __init spectre_v2_select_mitigation(void)
1719 + {
1720 + enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline();
1721 +@@ -390,22 +384,15 @@ retpoline_auto:
1722 + pr_info("%s\n", spectre_v2_strings[mode]);
1723 +
1724 + /*
1725 +- * If neither SMEP nor PTI are available, there is a risk of
1726 +- * hitting userspace addresses in the RSB after a context switch
1727 +- * from a shallow call stack to a deeper one. To prevent this fill
1728 +- * the entire RSB, even when using IBRS.
1729 ++ * If spectre v2 protection has been enabled, unconditionally fill
1730 ++ * RSB during a context switch; this protects against two independent
1731 ++ * issues:
1732 + *
1733 +- * Skylake era CPUs have a separate issue with *underflow* of the
1734 +- * RSB, when they will predict 'ret' targets from the generic BTB.
1735 +- * The proper mitigation for this is IBRS. If IBRS is not supported
1736 +- * or deactivated in favour of retpolines the RSB fill on context
1737 +- * switch is required.
1738 ++ * - RSB underflow (and switch to BTB) on Skylake+
1739 ++ * - SpectreRSB variant of spectre v2 on X86_BUG_SPECTRE_V2 CPUs
1740 + */
1741 +- if ((!boot_cpu_has(X86_FEATURE_PTI) &&
1742 +- !boot_cpu_has(X86_FEATURE_SMEP)) || is_skylake_era()) {
1743 +- setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
1744 +- pr_info("Spectre v2 mitigation: Filling RSB on context switch\n");
1745 +- }
1746 ++ setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
1747 ++ pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch\n");
1748 +
1749 + /* Initialize Indirect Branch Prediction Barrier if supported */
1750 + if (boot_cpu_has(X86_FEATURE_IBPB)) {
1751 +@@ -654,8 +641,121 @@ void x86_spec_ctrl_setup_ap(void)
1752 + x86_amd_ssb_disable();
1753 + }
1754 +
1755 ++#undef pr_fmt
1756 ++#define pr_fmt(fmt) "L1TF: " fmt
1757 ++
1758 ++/* Default mitigation for L1TF-affected CPUs */
1759 ++enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_FLUSH;
1760 ++#if IS_ENABLED(CONFIG_KVM_INTEL)
1761 ++EXPORT_SYMBOL_GPL(l1tf_mitigation);
1762 ++
1763 ++enum vmx_l1d_flush_state l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
1764 ++EXPORT_SYMBOL_GPL(l1tf_vmx_mitigation);
1765 ++#endif
1766 ++
1767 ++static void __init l1tf_select_mitigation(void)
1768 ++{
1769 ++ u64 half_pa;
1770 ++
1771 ++ if (!boot_cpu_has_bug(X86_BUG_L1TF))
1772 ++ return;
1773 ++
1774 ++ switch (l1tf_mitigation) {
1775 ++ case L1TF_MITIGATION_OFF:
1776 ++ case L1TF_MITIGATION_FLUSH_NOWARN:
1777 ++ case L1TF_MITIGATION_FLUSH:
1778 ++ break;
1779 ++ case L1TF_MITIGATION_FLUSH_NOSMT:
1780 ++ case L1TF_MITIGATION_FULL:
1781 ++ cpu_smt_disable(false);
1782 ++ break;
1783 ++ case L1TF_MITIGATION_FULL_FORCE:
1784 ++ cpu_smt_disable(true);
1785 ++ break;
1786 ++ }
1787 ++
1788 ++#if CONFIG_PGTABLE_LEVELS == 2
1789 ++ pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n");
1790 ++ return;
1791 ++#endif
1792 ++
1793 ++ /*
1794 ++ * This is extremely unlikely to happen because almost all
1795 ++ * systems have far more MAX_PA/2 than RAM can be fit into
1796 ++ * DIMM slots.
1797 ++ */
1798 ++ half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT;
1799 ++ if (e820__mapped_any(half_pa, ULLONG_MAX - half_pa, E820_TYPE_RAM)) {
1800 ++ pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation not effective.\n");
1801 ++ return;
1802 ++ }
1803 ++
1804 ++ setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV);
1805 ++}
1806 ++
1807 ++static int __init l1tf_cmdline(char *str)
1808 ++{
1809 ++ if (!boot_cpu_has_bug(X86_BUG_L1TF))
1810 ++ return 0;
1811 ++
1812 ++ if (!str)
1813 ++ return -EINVAL;
1814 ++
1815 ++ if (!strcmp(str, "off"))
1816 ++ l1tf_mitigation = L1TF_MITIGATION_OFF;
1817 ++ else if (!strcmp(str, "flush,nowarn"))
1818 ++ l1tf_mitigation = L1TF_MITIGATION_FLUSH_NOWARN;
1819 ++ else if (!strcmp(str, "flush"))
1820 ++ l1tf_mitigation = L1TF_MITIGATION_FLUSH;
1821 ++ else if (!strcmp(str, "flush,nosmt"))
1822 ++ l1tf_mitigation = L1TF_MITIGATION_FLUSH_NOSMT;
1823 ++ else if (!strcmp(str, "full"))
1824 ++ l1tf_mitigation = L1TF_MITIGATION_FULL;
1825 ++ else if (!strcmp(str, "full,force"))
1826 ++ l1tf_mitigation = L1TF_MITIGATION_FULL_FORCE;
1827 ++
1828 ++ return 0;
1829 ++}
1830 ++early_param("l1tf", l1tf_cmdline);
1831 ++
1832 ++#undef pr_fmt
1833 ++
1834 + #ifdef CONFIG_SYSFS
1835 +
1836 ++#define L1TF_DEFAULT_MSG "Mitigation: PTE Inversion"
1837 ++
1838 ++#if IS_ENABLED(CONFIG_KVM_INTEL)
1839 ++static const char *l1tf_vmx_states[] = {
1840 ++ [VMENTER_L1D_FLUSH_AUTO] = "auto",
1841 ++ [VMENTER_L1D_FLUSH_NEVER] = "vulnerable",
1842 ++ [VMENTER_L1D_FLUSH_COND] = "conditional cache flushes",
1843 ++ [VMENTER_L1D_FLUSH_ALWAYS] = "cache flushes",
1844 ++ [VMENTER_L1D_FLUSH_EPT_DISABLED] = "EPT disabled",
1845 ++ [VMENTER_L1D_FLUSH_NOT_REQUIRED] = "flush not necessary"
1846 ++};
1847 ++
1848 ++static ssize_t l1tf_show_state(char *buf)
1849 ++{
1850 ++ if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_AUTO)
1851 ++ return sprintf(buf, "%s\n", L1TF_DEFAULT_MSG);
1852 ++
1853 ++ if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_EPT_DISABLED ||
1854 ++ (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_NEVER &&
1855 ++ cpu_smt_control == CPU_SMT_ENABLED))
1856 ++ return sprintf(buf, "%s; VMX: %s\n", L1TF_DEFAULT_MSG,
1857 ++ l1tf_vmx_states[l1tf_vmx_mitigation]);
1858 ++
1859 ++ return sprintf(buf, "%s; VMX: %s, SMT %s\n", L1TF_DEFAULT_MSG,
1860 ++ l1tf_vmx_states[l1tf_vmx_mitigation],
1861 ++ cpu_smt_control == CPU_SMT_ENABLED ? "vulnerable" : "disabled");
1862 ++}
1863 ++#else
1864 ++static ssize_t l1tf_show_state(char *buf)
1865 ++{
1866 ++ return sprintf(buf, "%s\n", L1TF_DEFAULT_MSG);
1867 ++}
1868 ++#endif
1869 ++
1870 + static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr,
1871 + char *buf, unsigned int bug)
1872 + {
1873 +@@ -684,6 +784,10 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
1874 + case X86_BUG_SPEC_STORE_BYPASS:
1875 + return sprintf(buf, "%s\n", ssb_strings[ssb_mode]);
1876 +
1877 ++ case X86_BUG_L1TF:
1878 ++ if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))
1879 ++ return l1tf_show_state(buf);
1880 ++ break;
1881 + default:
1882 + break;
1883 + }
1884 +@@ -710,4 +814,9 @@ ssize_t cpu_show_spec_store_bypass(struct device *dev, struct device_attribute *
1885 + {
1886 + return cpu_show_common(dev, attr, buf, X86_BUG_SPEC_STORE_BYPASS);
1887 + }
1888 ++
1889 ++ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *buf)
1890 ++{
1891 ++ return cpu_show_common(dev, attr, buf, X86_BUG_L1TF);
1892 ++}
1893 + #endif
1894 +diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
1895 +index eb4cb3efd20e..9eda6f730ec4 100644
1896 +--- a/arch/x86/kernel/cpu/common.c
1897 ++++ b/arch/x86/kernel/cpu/common.c
1898 +@@ -661,33 +661,36 @@ static void cpu_detect_tlb(struct cpuinfo_x86 *c)
1899 + tlb_lld_4m[ENTRIES], tlb_lld_1g[ENTRIES]);
1900 + }
1901 +
1902 +-void detect_ht(struct cpuinfo_x86 *c)
1903 ++int detect_ht_early(struct cpuinfo_x86 *c)
1904 + {
1905 + #ifdef CONFIG_SMP
1906 + u32 eax, ebx, ecx, edx;
1907 +- int index_msb, core_bits;
1908 +- static bool printed;
1909 +
1910 + if (!cpu_has(c, X86_FEATURE_HT))
1911 +- return;
1912 ++ return -1;
1913 +
1914 + if (cpu_has(c, X86_FEATURE_CMP_LEGACY))
1915 +- goto out;
1916 ++ return -1;
1917 +
1918 + if (cpu_has(c, X86_FEATURE_XTOPOLOGY))
1919 +- return;
1920 ++ return -1;
1921 +
1922 + cpuid(1, &eax, &ebx, &ecx, &edx);
1923 +
1924 + smp_num_siblings = (ebx & 0xff0000) >> 16;
1925 +-
1926 +- if (smp_num_siblings == 1) {
1927 ++ if (smp_num_siblings == 1)
1928 + pr_info_once("CPU0: Hyper-Threading is disabled\n");
1929 +- goto out;
1930 +- }
1931 ++#endif
1932 ++ return 0;
1933 ++}
1934 +
1935 +- if (smp_num_siblings <= 1)
1936 +- goto out;
1937 ++void detect_ht(struct cpuinfo_x86 *c)
1938 ++{
1939 ++#ifdef CONFIG_SMP
1940 ++ int index_msb, core_bits;
1941 ++
1942 ++ if (detect_ht_early(c) < 0)
1943 ++ return;
1944 +
1945 + index_msb = get_count_order(smp_num_siblings);
1946 + c->phys_proc_id = apic->phys_pkg_id(c->initial_apicid, index_msb);
1947 +@@ -700,15 +703,6 @@ void detect_ht(struct cpuinfo_x86 *c)
1948 +
1949 + c->cpu_core_id = apic->phys_pkg_id(c->initial_apicid, index_msb) &
1950 + ((1 << core_bits) - 1);
1951 +-
1952 +-out:
1953 +- if (!printed && (c->x86_max_cores * smp_num_siblings) > 1) {
1954 +- pr_info("CPU: Physical Processor ID: %d\n",
1955 +- c->phys_proc_id);
1956 +- pr_info("CPU: Processor Core ID: %d\n",
1957 +- c->cpu_core_id);
1958 +- printed = 1;
1959 +- }
1960 + #endif
1961 + }
1962 +
1963 +@@ -987,6 +981,21 @@ static const __initconst struct x86_cpu_id cpu_no_spec_store_bypass[] = {
1964 + {}
1965 + };
1966 +
1967 ++static const __initconst struct x86_cpu_id cpu_no_l1tf[] = {
1968 ++ /* in addition to cpu_no_speculation */
1969 ++ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT1 },
1970 ++ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT2 },
1971 ++ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_AIRMONT },
1972 ++ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_MERRIFIELD },
1973 ++ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_MOOREFIELD },
1974 ++ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_GOLDMONT },
1975 ++ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_DENVERTON },
1976 ++ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_GEMINI_LAKE },
1977 ++ { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNL },
1978 ++ { X86_VENDOR_INTEL, 6, INTEL_FAM6_XEON_PHI_KNM },
1979 ++ {}
1980 ++};
1981 ++
1982 + static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
1983 + {
1984 + u64 ia32_cap = 0;
1985 +@@ -1013,6 +1022,11 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
1986 + return;
1987 +
1988 + setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);
1989 ++
1990 ++ if (x86_match_cpu(cpu_no_l1tf))
1991 ++ return;
1992 ++
1993 ++ setup_force_cpu_bug(X86_BUG_L1TF);
1994 + }
1995 +
1996 + /*
1997 +diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
1998 +index 38216f678fc3..e59c0ea82a33 100644
1999 +--- a/arch/x86/kernel/cpu/cpu.h
2000 ++++ b/arch/x86/kernel/cpu/cpu.h
2001 +@@ -55,7 +55,9 @@ extern void init_intel_cacheinfo(struct cpuinfo_x86 *c);
2002 + extern void init_amd_cacheinfo(struct cpuinfo_x86 *c);
2003 +
2004 + extern void detect_num_cpu_cores(struct cpuinfo_x86 *c);
2005 ++extern int detect_extended_topology_early(struct cpuinfo_x86 *c);
2006 + extern int detect_extended_topology(struct cpuinfo_x86 *c);
2007 ++extern int detect_ht_early(struct cpuinfo_x86 *c);
2008 + extern void detect_ht(struct cpuinfo_x86 *c);
2009 +
2010 + unsigned int aperfmperf_get_khz(int cpu);
2011 +diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
2012 +index eb75564f2d25..6602941cfebf 100644
2013 +--- a/arch/x86/kernel/cpu/intel.c
2014 ++++ b/arch/x86/kernel/cpu/intel.c
2015 +@@ -301,6 +301,13 @@ static void early_init_intel(struct cpuinfo_x86 *c)
2016 + }
2017 +
2018 + check_mpx_erratum(c);
2019 ++
2020 ++ /*
2021 ++ * Get the number of SMT siblings early from the extended topology
2022 ++ * leaf, if available. Otherwise try the legacy SMT detection.
2023 ++ */
2024 ++ if (detect_extended_topology_early(c) < 0)
2025 ++ detect_ht_early(c);
2026 + }
2027 +
2028 + #ifdef CONFIG_X86_32
2029 +diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c
2030 +index 08286269fd24..b9bc8a1a584e 100644
2031 +--- a/arch/x86/kernel/cpu/microcode/core.c
2032 ++++ b/arch/x86/kernel/cpu/microcode/core.c
2033 +@@ -509,12 +509,20 @@ static struct platform_device *microcode_pdev;
2034 +
2035 + static int check_online_cpus(void)
2036 + {
2037 +- if (num_online_cpus() == num_present_cpus())
2038 +- return 0;
2039 ++ unsigned int cpu;
2040 +
2041 +- pr_err("Not all CPUs online, aborting microcode update.\n");
2042 ++ /*
2043 ++ * Make sure all CPUs are online. It's fine for SMT to be disabled if
2044 ++ * all the primary threads are still online.
2045 ++ */
2046 ++ for_each_present_cpu(cpu) {
2047 ++ if (topology_is_primary_thread(cpu) && !cpu_online(cpu)) {
2048 ++ pr_err("Not all CPUs online, aborting microcode update.\n");
2049 ++ return -EINVAL;
2050 ++ }
2051 ++ }
2052 +
2053 +- return -EINVAL;
2054 ++ return 0;
2055 + }
2056 +
2057 + static atomic_t late_cpus_in;
2058 +diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
2059 +index 81c0afb39d0a..71ca064e3794 100644
2060 +--- a/arch/x86/kernel/cpu/topology.c
2061 ++++ b/arch/x86/kernel/cpu/topology.c
2062 +@@ -22,18 +22,10 @@
2063 + #define BITS_SHIFT_NEXT_LEVEL(eax) ((eax) & 0x1f)
2064 + #define LEVEL_MAX_SIBLINGS(ebx) ((ebx) & 0xffff)
2065 +
2066 +-/*
2067 +- * Check for extended topology enumeration cpuid leaf 0xb and if it
2068 +- * exists, use it for populating initial_apicid and cpu topology
2069 +- * detection.
2070 +- */
2071 +-int detect_extended_topology(struct cpuinfo_x86 *c)
2072 ++int detect_extended_topology_early(struct cpuinfo_x86 *c)
2073 + {
2074 + #ifdef CONFIG_SMP
2075 +- unsigned int eax, ebx, ecx, edx, sub_index;
2076 +- unsigned int ht_mask_width, core_plus_mask_width;
2077 +- unsigned int core_select_mask, core_level_siblings;
2078 +- static bool printed;
2079 ++ unsigned int eax, ebx, ecx, edx;
2080 +
2081 + if (c->cpuid_level < 0xb)
2082 + return -1;
2083 +@@ -52,10 +44,30 @@ int detect_extended_topology(struct cpuinfo_x86 *c)
2084 + * initial apic id, which also represents 32-bit extended x2apic id.
2085 + */
2086 + c->initial_apicid = edx;
2087 ++ smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
2088 ++#endif
2089 ++ return 0;
2090 ++}
2091 ++
2092 ++/*
2093 ++ * Check for extended topology enumeration cpuid leaf 0xb and if it
2094 ++ * exists, use it for populating initial_apicid and cpu topology
2095 ++ * detection.
2096 ++ */
2097 ++int detect_extended_topology(struct cpuinfo_x86 *c)
2098 ++{
2099 ++#ifdef CONFIG_SMP
2100 ++ unsigned int eax, ebx, ecx, edx, sub_index;
2101 ++ unsigned int ht_mask_width, core_plus_mask_width;
2102 ++ unsigned int core_select_mask, core_level_siblings;
2103 ++
2104 ++ if (detect_extended_topology_early(c) < 0)
2105 ++ return -1;
2106 +
2107 + /*
2108 + * Populate HT related information from sub-leaf level 0.
2109 + */
2110 ++ cpuid_count(0xb, SMT_LEVEL, &eax, &ebx, &ecx, &edx);
2111 + core_level_siblings = smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
2112 + core_plus_mask_width = ht_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);
2113 +
2114 +@@ -86,15 +98,6 @@ int detect_extended_topology(struct cpuinfo_x86 *c)
2115 + c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);
2116 +
2117 + c->x86_max_cores = (core_level_siblings / smp_num_siblings);
2118 +-
2119 +- if (!printed) {
2120 +- pr_info("CPU: Physical Processor ID: %d\n",
2121 +- c->phys_proc_id);
2122 +- if (c->x86_max_cores > 1)
2123 +- pr_info("CPU: Processor Core ID: %d\n",
2124 +- c->cpu_core_id);
2125 +- printed = 1;
2126 +- }
2127 + #endif
2128 + return 0;
2129 + }
2130 +diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
2131 +index f92a6593de1e..2ea85b32421a 100644
2132 +--- a/arch/x86/kernel/fpu/core.c
2133 ++++ b/arch/x86/kernel/fpu/core.c
2134 +@@ -10,6 +10,7 @@
2135 + #include <asm/fpu/signal.h>
2136 + #include <asm/fpu/types.h>
2137 + #include <asm/traps.h>
2138 ++#include <asm/irq_regs.h>
2139 +
2140 + #include <linux/hardirq.h>
2141 + #include <linux/pkeys.h>
2142 +diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
2143 +index 346b24883911..b0acb22e5a46 100644
2144 +--- a/arch/x86/kernel/hpet.c
2145 ++++ b/arch/x86/kernel/hpet.c
2146 +@@ -1,6 +1,7 @@
2147 + #include <linux/clocksource.h>
2148 + #include <linux/clockchips.h>
2149 + #include <linux/interrupt.h>
2150 ++#include <linux/irq.h>
2151 + #include <linux/export.h>
2152 + #include <linux/delay.h>
2153 + #include <linux/errno.h>
2154 +diff --git a/arch/x86/kernel/i8259.c b/arch/x86/kernel/i8259.c
2155 +index 86c4439f9d74..519649ddf100 100644
2156 +--- a/arch/x86/kernel/i8259.c
2157 ++++ b/arch/x86/kernel/i8259.c
2158 +@@ -5,6 +5,7 @@
2159 + #include <linux/sched.h>
2160 + #include <linux/ioport.h>
2161 + #include <linux/interrupt.h>
2162 ++#include <linux/irq.h>
2163 + #include <linux/timex.h>
2164 + #include <linux/random.h>
2165 + #include <linux/init.h>
2166 +diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
2167 +index 74383a3780dc..01adea278a71 100644
2168 +--- a/arch/x86/kernel/idt.c
2169 ++++ b/arch/x86/kernel/idt.c
2170 +@@ -8,6 +8,7 @@
2171 + #include <asm/traps.h>
2172 + #include <asm/proto.h>
2173 + #include <asm/desc.h>
2174 ++#include <asm/hw_irq.h>
2175 +
2176 + struct idt_data {
2177 + unsigned int vector;
2178 +diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
2179 +index 328d027d829d..59b5f2ea7c2f 100644
2180 +--- a/arch/x86/kernel/irq.c
2181 ++++ b/arch/x86/kernel/irq.c
2182 +@@ -10,6 +10,7 @@
2183 + #include <linux/ftrace.h>
2184 + #include <linux/delay.h>
2185 + #include <linux/export.h>
2186 ++#include <linux/irq.h>
2187 +
2188 + #include <asm/apic.h>
2189 + #include <asm/io_apic.h>
2190 +diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
2191 +index c1bdbd3d3232..95600a99ae93 100644
2192 +--- a/arch/x86/kernel/irq_32.c
2193 ++++ b/arch/x86/kernel/irq_32.c
2194 +@@ -11,6 +11,7 @@
2195 +
2196 + #include <linux/seq_file.h>
2197 + #include <linux/interrupt.h>
2198 ++#include <linux/irq.h>
2199 + #include <linux/kernel_stat.h>
2200 + #include <linux/notifier.h>
2201 + #include <linux/cpu.h>
2202 +diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
2203 +index d86e344f5b3d..0469cd078db1 100644
2204 +--- a/arch/x86/kernel/irq_64.c
2205 ++++ b/arch/x86/kernel/irq_64.c
2206 +@@ -11,6 +11,7 @@
2207 +
2208 + #include <linux/kernel_stat.h>
2209 + #include <linux/interrupt.h>
2210 ++#include <linux/irq.h>
2211 + #include <linux/seq_file.h>
2212 + #include <linux/delay.h>
2213 + #include <linux/ftrace.h>
2214 +diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
2215 +index 772196c1b8c4..a0693b71cfc1 100644
2216 +--- a/arch/x86/kernel/irqinit.c
2217 ++++ b/arch/x86/kernel/irqinit.c
2218 +@@ -5,6 +5,7 @@
2219 + #include <linux/sched.h>
2220 + #include <linux/ioport.h>
2221 + #include <linux/interrupt.h>
2222 ++#include <linux/irq.h>
2223 + #include <linux/timex.h>
2224 + #include <linux/random.h>
2225 + #include <linux/kprobes.h>
2226 +diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
2227 +index 6f4d42377fe5..44e26dc326d5 100644
2228 +--- a/arch/x86/kernel/kprobes/core.c
2229 ++++ b/arch/x86/kernel/kprobes/core.c
2230 +@@ -395,8 +395,6 @@ int __copy_instruction(u8 *dest, u8 *src, u8 *real, struct insn *insn)
2231 + - (u8 *) real;
2232 + if ((s64) (s32) newdisp != newdisp) {
2233 + pr_err("Kprobes error: new displacement does not fit into s32 (%llx)\n", newdisp);
2234 +- pr_err("\tSrc: %p, Dest: %p, old disp: %x\n",
2235 +- src, real, insn->displacement.value);
2236 + return 0;
2237 + }
2238 + disp = (u8 *) dest + insn_offset_displacement(insn);
2239 +@@ -640,8 +638,7 @@ static int reenter_kprobe(struct kprobe *p, struct pt_regs *regs,
2240 + * Raise a BUG or we'll continue in an endless reentering loop
2241 + * and eventually a stack overflow.
2242 + */
2243 +- printk(KERN_WARNING "Unrecoverable kprobe detected at %p.\n",
2244 +- p->addr);
2245 ++ pr_err("Unrecoverable kprobe detected.\n");
2246 + dump_kprobe(p);
2247 + BUG();
2248 + default:
2249 +diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
2250 +index 99dc79e76bdc..930c88341e4e 100644
2251 +--- a/arch/x86/kernel/paravirt.c
2252 ++++ b/arch/x86/kernel/paravirt.c
2253 +@@ -88,10 +88,12 @@ unsigned paravirt_patch_call(void *insnbuf,
2254 + struct branch *b = insnbuf;
2255 + unsigned long delta = (unsigned long)target - (addr+5);
2256 +
2257 +- if (tgt_clobbers & ~site_clobbers)
2258 +- return len; /* target would clobber too much for this site */
2259 +- if (len < 5)
2260 ++ if (len < 5) {
2261 ++#ifdef CONFIG_RETPOLINE
2262 ++ WARN_ONCE("Failing to patch indirect CALL in %ps\n", (void *)addr);
2263 ++#endif
2264 + return len; /* call too long for patch site */
2265 ++ }
2266 +
2267 + b->opcode = 0xe8; /* call */
2268 + b->delta = delta;
2269 +@@ -106,8 +108,12 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target,
2270 + struct branch *b = insnbuf;
2271 + unsigned long delta = (unsigned long)target - (addr+5);
2272 +
2273 +- if (len < 5)
2274 ++ if (len < 5) {
2275 ++#ifdef CONFIG_RETPOLINE
2276 ++ WARN_ONCE("Failing to patch indirect JMP in %ps\n", (void *)addr);
2277 ++#endif
2278 + return len; /* call too long for patch site */
2279 ++ }
2280 +
2281 + b->opcode = 0xe9; /* jmp */
2282 + b->delta = delta;
2283 +diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
2284 +index 2f86d883dd95..74b4472ba0a6 100644
2285 +--- a/arch/x86/kernel/setup.c
2286 ++++ b/arch/x86/kernel/setup.c
2287 +@@ -823,6 +823,12 @@ void __init setup_arch(char **cmdline_p)
2288 + memblock_reserve(__pa_symbol(_text),
2289 + (unsigned long)__bss_stop - (unsigned long)_text);
2290 +
2291 ++ /*
2292 ++ * Make sure page 0 is always reserved because on systems with
2293 ++ * L1TF its contents can be leaked to user processes.
2294 ++ */
2295 ++ memblock_reserve(0, PAGE_SIZE);
2296 ++
2297 + early_reserve_initrd();
2298 +
2299 + /*
2300 +diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
2301 +index 5c574dff4c1a..04adc8d60aed 100644
2302 +--- a/arch/x86/kernel/smp.c
2303 ++++ b/arch/x86/kernel/smp.c
2304 +@@ -261,6 +261,7 @@ __visible void __irq_entry smp_reschedule_interrupt(struct pt_regs *regs)
2305 + {
2306 + ack_APIC_irq();
2307 + inc_irq_stat(irq_resched_count);
2308 ++ kvm_set_cpu_l1tf_flush_l1d();
2309 +
2310 + if (trace_resched_ipi_enabled()) {
2311 + /*
2312 +diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
2313 +index db9656e13ea0..f02ecaf97904 100644
2314 +--- a/arch/x86/kernel/smpboot.c
2315 ++++ b/arch/x86/kernel/smpboot.c
2316 +@@ -80,6 +80,7 @@
2317 + #include <asm/intel-family.h>
2318 + #include <asm/cpu_device_id.h>
2319 + #include <asm/spec-ctrl.h>
2320 ++#include <asm/hw_irq.h>
2321 +
2322 + /* representing HT siblings of each logical CPU */
2323 + DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
2324 +@@ -270,6 +271,23 @@ static void notrace start_secondary(void *unused)
2325 + cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
2326 + }
2327 +
2328 ++/**
2329 ++ * topology_is_primary_thread - Check whether CPU is the primary SMT thread
2330 ++ * @cpu: CPU to check
2331 ++ */
2332 ++bool topology_is_primary_thread(unsigned int cpu)
2333 ++{
2334 ++ return apic_id_is_primary_thread(per_cpu(x86_cpu_to_apicid, cpu));
2335 ++}
2336 ++
2337 ++/**
2338 ++ * topology_smt_supported - Check whether SMT is supported by the CPUs
2339 ++ */
2340 ++bool topology_smt_supported(void)
2341 ++{
2342 ++ return smp_num_siblings > 1;
2343 ++}
2344 ++
2345 + /**
2346 + * topology_phys_to_logical_pkg - Map a physical package id to a logical
2347 + *
2348 +diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
2349 +index 774ebafa97c4..be01328eb755 100644
2350 +--- a/arch/x86/kernel/time.c
2351 ++++ b/arch/x86/kernel/time.c
2352 +@@ -12,6 +12,7 @@
2353 +
2354 + #include <linux/clockchips.h>
2355 + #include <linux/interrupt.h>
2356 ++#include <linux/irq.h>
2357 + #include <linux/i8253.h>
2358 + #include <linux/time.h>
2359 + #include <linux/export.h>
2360 +diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
2361 +index 6b8f11521c41..a44e568363a4 100644
2362 +--- a/arch/x86/kvm/mmu.c
2363 ++++ b/arch/x86/kvm/mmu.c
2364 +@@ -3840,6 +3840,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
2365 + {
2366 + int r = 1;
2367 +
2368 ++ vcpu->arch.l1tf_flush_l1d = true;
2369 + switch (vcpu->arch.apf.host_apf_reason) {
2370 + default:
2371 + trace_kvm_page_fault(fault_address, error_code);
2372 +diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
2373 +index 5d8e317c2b04..46b428c0990e 100644
2374 +--- a/arch/x86/kvm/vmx.c
2375 ++++ b/arch/x86/kvm/vmx.c
2376 +@@ -188,6 +188,150 @@ module_param(ple_window_max, uint, 0444);
2377 +
2378 + extern const ulong vmx_return;
2379 +
2380 ++static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
2381 ++static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);
2382 ++static DEFINE_MUTEX(vmx_l1d_flush_mutex);
2383 ++
2384 ++/* Storage for pre module init parameter parsing */
2385 ++static enum vmx_l1d_flush_state __read_mostly vmentry_l1d_flush_param = VMENTER_L1D_FLUSH_AUTO;
2386 ++
2387 ++static const struct {
2388 ++ const char *option;
2389 ++ enum vmx_l1d_flush_state cmd;
2390 ++} vmentry_l1d_param[] = {
2391 ++ {"auto", VMENTER_L1D_FLUSH_AUTO},
2392 ++ {"never", VMENTER_L1D_FLUSH_NEVER},
2393 ++ {"cond", VMENTER_L1D_FLUSH_COND},
2394 ++ {"always", VMENTER_L1D_FLUSH_ALWAYS},
2395 ++};
2396 ++
2397 ++#define L1D_CACHE_ORDER 4
2398 ++static void *vmx_l1d_flush_pages;
2399 ++
2400 ++static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
2401 ++{
2402 ++ struct page *page;
2403 ++ unsigned int i;
2404 ++
2405 ++ if (!enable_ept) {
2406 ++ l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_EPT_DISABLED;
2407 ++ return 0;
2408 ++ }
2409 ++
2410 ++ if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES)) {
2411 ++ u64 msr;
2412 ++
2413 ++ rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);
2414 ++ if (msr & ARCH_CAP_SKIP_VMENTRY_L1DFLUSH) {
2415 ++ l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NOT_REQUIRED;
2416 ++ return 0;
2417 ++ }
2418 ++ }
2419 ++
2420 ++ /* If set to auto use the default l1tf mitigation method */
2421 ++ if (l1tf == VMENTER_L1D_FLUSH_AUTO) {
2422 ++ switch (l1tf_mitigation) {
2423 ++ case L1TF_MITIGATION_OFF:
2424 ++ l1tf = VMENTER_L1D_FLUSH_NEVER;
2425 ++ break;
2426 ++ case L1TF_MITIGATION_FLUSH_NOWARN:
2427 ++ case L1TF_MITIGATION_FLUSH:
2428 ++ case L1TF_MITIGATION_FLUSH_NOSMT:
2429 ++ l1tf = VMENTER_L1D_FLUSH_COND;
2430 ++ break;
2431 ++ case L1TF_MITIGATION_FULL:
2432 ++ case L1TF_MITIGATION_FULL_FORCE:
2433 ++ l1tf = VMENTER_L1D_FLUSH_ALWAYS;
2434 ++ break;
2435 ++ }
2436 ++ } else if (l1tf_mitigation == L1TF_MITIGATION_FULL_FORCE) {
2437 ++ l1tf = VMENTER_L1D_FLUSH_ALWAYS;
2438 ++ }
2439 ++
2440 ++ if (l1tf != VMENTER_L1D_FLUSH_NEVER && !vmx_l1d_flush_pages &&
2441 ++ !boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {
2442 ++ page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
2443 ++ if (!page)
2444 ++ return -ENOMEM;
2445 ++ vmx_l1d_flush_pages = page_address(page);
2446 ++
2447 ++ /*
2448 ++ * Initialize each page with a different pattern in
2449 ++ * order to protect against KSM in the nested
2450 ++ * virtualization case.
2451 ++ */
2452 ++ for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {
2453 ++ memset(vmx_l1d_flush_pages + i * PAGE_SIZE, i + 1,
2454 ++ PAGE_SIZE);
2455 ++ }
2456 ++ }
2457 ++
2458 ++ l1tf_vmx_mitigation = l1tf;
2459 ++
2460 ++ if (l1tf != VMENTER_L1D_FLUSH_NEVER)
2461 ++ static_branch_enable(&vmx_l1d_should_flush);
2462 ++ else
2463 ++ static_branch_disable(&vmx_l1d_should_flush);
2464 ++
2465 ++ if (l1tf == VMENTER_L1D_FLUSH_COND)
2466 ++ static_branch_enable(&vmx_l1d_flush_cond);
2467 ++ else
2468 ++ static_branch_disable(&vmx_l1d_flush_cond);
2469 ++ return 0;
2470 ++}
2471 ++
2472 ++static int vmentry_l1d_flush_parse(const char *s)
2473 ++{
2474 ++ unsigned int i;
2475 ++
2476 ++ if (s) {
2477 ++ for (i = 0; i < ARRAY_SIZE(vmentry_l1d_param); i++) {
2478 ++ if (sysfs_streq(s, vmentry_l1d_param[i].option))
2479 ++ return vmentry_l1d_param[i].cmd;
2480 ++ }
2481 ++ }
2482 ++ return -EINVAL;
2483 ++}
2484 ++
2485 ++static int vmentry_l1d_flush_set(const char *s, const struct kernel_param *kp)
2486 ++{
2487 ++ int l1tf, ret;
2488 ++
2489 ++ if (!boot_cpu_has(X86_BUG_L1TF))
2490 ++ return 0;
2491 ++
2492 ++ l1tf = vmentry_l1d_flush_parse(s);
2493 ++ if (l1tf < 0)
2494 ++ return l1tf;
2495 ++
2496 ++ /*
2497 ++ * Has vmx_init() run already? If not then this is the pre init
2498 ++ * parameter parsing. In that case just store the value and let
2499 ++ * vmx_init() do the proper setup after enable_ept has been
2500 ++ * established.
2501 ++ */
2502 ++ if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_AUTO) {
2503 ++ vmentry_l1d_flush_param = l1tf;
2504 ++ return 0;
2505 ++ }
2506 ++
2507 ++ mutex_lock(&vmx_l1d_flush_mutex);
2508 ++ ret = vmx_setup_l1d_flush(l1tf);
2509 ++ mutex_unlock(&vmx_l1d_flush_mutex);
2510 ++ return ret;
2511 ++}
2512 ++
2513 ++static int vmentry_l1d_flush_get(char *s, const struct kernel_param *kp)
2514 ++{
2515 ++ return sprintf(s, "%s\n", vmentry_l1d_param[l1tf_vmx_mitigation].option);
2516 ++}
2517 ++
2518 ++static const struct kernel_param_ops vmentry_l1d_flush_ops = {
2519 ++ .set = vmentry_l1d_flush_set,
2520 ++ .get = vmentry_l1d_flush_get,
2521 ++};
2522 ++module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644);
2523 ++
2524 + struct kvm_vmx {
2525 + struct kvm kvm;
2526 +
2527 +@@ -757,6 +901,11 @@ static inline int pi_test_sn(struct pi_desc *pi_desc)
2528 + (unsigned long *)&pi_desc->control);
2529 + }
2530 +
2531 ++struct vmx_msrs {
2532 ++ unsigned int nr;
2533 ++ struct vmx_msr_entry val[NR_AUTOLOAD_MSRS];
2534 ++};
2535 ++
2536 + struct vcpu_vmx {
2537 + struct kvm_vcpu vcpu;
2538 + unsigned long host_rsp;
2539 +@@ -790,9 +939,8 @@ struct vcpu_vmx {
2540 + struct loaded_vmcs *loaded_vmcs;
2541 + bool __launched; /* temporary, used in vmx_vcpu_run */
2542 + struct msr_autoload {
2543 +- unsigned nr;
2544 +- struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];
2545 +- struct vmx_msr_entry host[NR_AUTOLOAD_MSRS];
2546 ++ struct vmx_msrs guest;
2547 ++ struct vmx_msrs host;
2548 + } msr_autoload;
2549 + struct {
2550 + int loaded;
2551 +@@ -2377,9 +2525,20 @@ static void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,
2552 + vm_exit_controls_clearbit(vmx, exit);
2553 + }
2554 +
2555 ++static int find_msr(struct vmx_msrs *m, unsigned int msr)
2556 ++{
2557 ++ unsigned int i;
2558 ++
2559 ++ for (i = 0; i < m->nr; ++i) {
2560 ++ if (m->val[i].index == msr)
2561 ++ return i;
2562 ++ }
2563 ++ return -ENOENT;
2564 ++}
2565 ++
2566 + static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr)
2567 + {
2568 +- unsigned i;
2569 ++ int i;
2570 + struct msr_autoload *m = &vmx->msr_autoload;
2571 +
2572 + switch (msr) {
2573 +@@ -2400,18 +2559,21 @@ static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr)
2574 + }
2575 + break;
2576 + }
2577 ++ i = find_msr(&m->guest, msr);
2578 ++ if (i < 0)
2579 ++ goto skip_guest;
2580 ++ --m->guest.nr;
2581 ++ m->guest.val[i] = m->guest.val[m->guest.nr];
2582 ++ vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);
2583 +
2584 +- for (i = 0; i < m->nr; ++i)
2585 +- if (m->guest[i].index == msr)
2586 +- break;
2587 +-
2588 +- if (i == m->nr)
2589 ++skip_guest:
2590 ++ i = find_msr(&m->host, msr);
2591 ++ if (i < 0)
2592 + return;
2593 +- --m->nr;
2594 +- m->guest[i] = m->guest[m->nr];
2595 +- m->host[i] = m->host[m->nr];
2596 +- vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
2597 +- vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
2598 ++
2599 ++ --m->host.nr;
2600 ++ m->host.val[i] = m->host.val[m->host.nr];
2601 ++ vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);
2602 + }
2603 +
2604 + static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,
2605 +@@ -2426,9 +2588,9 @@ static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,
2606 + }
2607 +
2608 + static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,
2609 +- u64 guest_val, u64 host_val)
2610 ++ u64 guest_val, u64 host_val, bool entry_only)
2611 + {
2612 +- unsigned i;
2613 ++ int i, j = 0;
2614 + struct msr_autoload *m = &vmx->msr_autoload;
2615 +
2616 + switch (msr) {
2617 +@@ -2463,24 +2625,31 @@ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,
2618 + wrmsrl(MSR_IA32_PEBS_ENABLE, 0);
2619 + }
2620 +
2621 +- for (i = 0; i < m->nr; ++i)
2622 +- if (m->guest[i].index == msr)
2623 +- break;
2624 ++ i = find_msr(&m->guest, msr);
2625 ++ if (!entry_only)
2626 ++ j = find_msr(&m->host, msr);
2627 +
2628 +- if (i == NR_AUTOLOAD_MSRS) {
2629 ++ if (i == NR_AUTOLOAD_MSRS || j == NR_AUTOLOAD_MSRS) {
2630 + printk_once(KERN_WARNING "Not enough msr switch entries. "
2631 + "Can't add msr %x\n", msr);
2632 + return;
2633 +- } else if (i == m->nr) {
2634 +- ++m->nr;
2635 +- vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
2636 +- vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
2637 + }
2638 ++ if (i < 0) {
2639 ++ i = m->guest.nr++;
2640 ++ vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->guest.nr);
2641 ++ }
2642 ++ m->guest.val[i].index = msr;
2643 ++ m->guest.val[i].value = guest_val;
2644 ++
2645 ++ if (entry_only)
2646 ++ return;
2647 +
2648 +- m->guest[i].index = msr;
2649 +- m->guest[i].value = guest_val;
2650 +- m->host[i].index = msr;
2651 +- m->host[i].value = host_val;
2652 ++ if (j < 0) {
2653 ++ j = m->host.nr++;
2654 ++ vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);
2655 ++ }
2656 ++ m->host.val[j].index = msr;
2657 ++ m->host.val[j].value = host_val;
2658 + }
2659 +
2660 + static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)
2661 +@@ -2524,7 +2693,7 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)
2662 + guest_efer &= ~EFER_LME;
2663 + if (guest_efer != host_efer)
2664 + add_atomic_switch_msr(vmx, MSR_EFER,
2665 +- guest_efer, host_efer);
2666 ++ guest_efer, host_efer, false);
2667 + return false;
2668 + } else {
2669 + guest_efer &= ~ignore_bits;
2670 +@@ -3987,7 +4156,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
2671 + vcpu->arch.ia32_xss = data;
2672 + if (vcpu->arch.ia32_xss != host_xss)
2673 + add_atomic_switch_msr(vmx, MSR_IA32_XSS,
2674 +- vcpu->arch.ia32_xss, host_xss);
2675 ++ vcpu->arch.ia32_xss, host_xss, false);
2676 + else
2677 + clear_atomic_switch_msr(vmx, MSR_IA32_XSS);
2678 + break;
2679 +@@ -6274,9 +6443,9 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
2680 +
2681 + vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
2682 + vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
2683 +- vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
2684 ++ vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));
2685 + vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
2686 +- vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));
2687 ++ vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));
2688 +
2689 + if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
2690 + vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
2691 +@@ -6296,8 +6465,7 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
2692 + ++vmx->nmsrs;
2693 + }
2694 +
2695 +- if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
2696 +- rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);
2697 ++ vmx->arch_capabilities = kvm_get_arch_capabilities();
2698 +
2699 + vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);
2700 +
2701 +@@ -9548,6 +9716,79 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
2702 + }
2703 + }
2704 +
2705 ++/*
2706 ++ * Software based L1D cache flush which is used when microcode providing
2707 ++ * the cache control MSR is not loaded.
2708 ++ *
2709 ++ * The L1D cache is 32 KiB on Nehalem and later microarchitectures, but to
2710 ++ * flush it is required to read in 64 KiB because the replacement algorithm
2711 ++ * is not exactly LRU. This could be sized at runtime via topology
2712 ++ * information but as all relevant affected CPUs have 32KiB L1D cache size
2713 ++ * there is no point in doing so.
2714 ++ */
2715 ++#define L1D_CACHE_ORDER 4
2716 ++static void *vmx_l1d_flush_pages;
2717 ++
2718 ++static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
2719 ++{
2720 ++ int size = PAGE_SIZE << L1D_CACHE_ORDER;
2721 ++
2722 ++ /*
2723 ++ * This code is only executed when the the flush mode is 'cond' or
2724 ++ * 'always'
2725 ++ */
2726 ++ if (static_branch_likely(&vmx_l1d_flush_cond)) {
2727 ++ bool flush_l1d;
2728 ++
2729 ++ /*
2730 ++ * Clear the per-vcpu flush bit, it gets set again
2731 ++ * either from vcpu_run() or from one of the unsafe
2732 ++ * VMEXIT handlers.
2733 ++ */
2734 ++ flush_l1d = vcpu->arch.l1tf_flush_l1d;
2735 ++ vcpu->arch.l1tf_flush_l1d = false;
2736 ++
2737 ++ /*
2738 ++ * Clear the per-cpu flush bit, it gets set again from
2739 ++ * the interrupt handlers.
2740 ++ */
2741 ++ flush_l1d |= kvm_get_cpu_l1tf_flush_l1d();
2742 ++ kvm_clear_cpu_l1tf_flush_l1d();
2743 ++
2744 ++ if (!flush_l1d)
2745 ++ return;
2746 ++ }
2747 ++
2748 ++ vcpu->stat.l1d_flush++;
2749 ++
2750 ++ if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
2751 ++ wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
2752 ++ return;
2753 ++ }
2754 ++
2755 ++ asm volatile(
2756 ++ /* First ensure the pages are in the TLB */
2757 ++ "xorl %%eax, %%eax\n"
2758 ++ ".Lpopulate_tlb:\n\t"
2759 ++ "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
2760 ++ "addl $4096, %%eax\n\t"
2761 ++ "cmpl %%eax, %[size]\n\t"
2762 ++ "jne .Lpopulate_tlb\n\t"
2763 ++ "xorl %%eax, %%eax\n\t"
2764 ++ "cpuid\n\t"
2765 ++ /* Now fill the cache */
2766 ++ "xorl %%eax, %%eax\n"
2767 ++ ".Lfill_cache:\n"
2768 ++ "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
2769 ++ "addl $64, %%eax\n\t"
2770 ++ "cmpl %%eax, %[size]\n\t"
2771 ++ "jne .Lfill_cache\n\t"
2772 ++ "lfence\n"
2773 ++ :: [flush_pages] "r" (vmx_l1d_flush_pages),
2774 ++ [size] "r" (size)
2775 ++ : "eax", "ebx", "ecx", "edx");
2776 ++}
2777 ++
2778 + static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
2779 + {
2780 + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
2781 +@@ -9949,7 +10190,7 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
2782 + clear_atomic_switch_msr(vmx, msrs[i].msr);
2783 + else
2784 + add_atomic_switch_msr(vmx, msrs[i].msr, msrs[i].guest,
2785 +- msrs[i].host);
2786 ++ msrs[i].host, false);
2787 + }
2788 +
2789 + static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
2790 +@@ -10044,6 +10285,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
2791 + evmcs_rsp = static_branch_unlikely(&enable_evmcs) ?
2792 + (unsigned long)&current_evmcs->host_rsp : 0;
2793 +
2794 ++ if (static_branch_unlikely(&vmx_l1d_should_flush))
2795 ++ vmx_l1d_flush(vcpu);
2796 ++
2797 + asm(
2798 + /* Store host registers */
2799 + "push %%" _ASM_DX "; push %%" _ASM_BP ";"
2800 +@@ -10403,10 +10647,37 @@ free_vcpu:
2801 + return ERR_PTR(err);
2802 + }
2803 +
2804 ++#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"
2805 ++#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.\n"
2806 ++
2807 + static int vmx_vm_init(struct kvm *kvm)
2808 + {
2809 + if (!ple_gap)
2810 + kvm->arch.pause_in_guest = true;
2811 ++
2812 ++ if (boot_cpu_has(X86_BUG_L1TF) && enable_ept) {
2813 ++ switch (l1tf_mitigation) {
2814 ++ case L1TF_MITIGATION_OFF:
2815 ++ case L1TF_MITIGATION_FLUSH_NOWARN:
2816 ++ /* 'I explicitly don't care' is set */
2817 ++ break;
2818 ++ case L1TF_MITIGATION_FLUSH:
2819 ++ case L1TF_MITIGATION_FLUSH_NOSMT:
2820 ++ case L1TF_MITIGATION_FULL:
2821 ++ /*
2822 ++ * Warn upon starting the first VM in a potentially
2823 ++ * insecure environment.
2824 ++ */
2825 ++ if (cpu_smt_control == CPU_SMT_ENABLED)
2826 ++ pr_warn_once(L1TF_MSG_SMT);
2827 ++ if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_NEVER)
2828 ++ pr_warn_once(L1TF_MSG_L1D);
2829 ++ break;
2830 ++ case L1TF_MITIGATION_FULL_FORCE:
2831 ++ /* Flush is enforced */
2832 ++ break;
2833 ++ }
2834 ++ }
2835 + return 0;
2836 + }
2837 +
2838 +@@ -11260,10 +11531,10 @@ static void prepare_vmcs02_full(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
2839 + * Set the MSR load/store lists to match L0's settings.
2840 + */
2841 + vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
2842 +- vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);
2843 +- vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
2844 +- vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);
2845 +- vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));
2846 ++ vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);
2847 ++ vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));
2848 ++ vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);
2849 ++ vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));
2850 +
2851 + set_cr4_guest_host_mask(vmx);
2852 +
2853 +@@ -11899,6 +12170,9 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
2854 + return ret;
2855 + }
2856 +
2857 ++ /* Hide L1D cache contents from the nested guest. */
2858 ++ vmx->vcpu.arch.l1tf_flush_l1d = true;
2859 ++
2860 + /*
2861 + * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken
2862 + * by event injection, halt vcpu.
2863 +@@ -12419,8 +12693,8 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
2864 + vmx_segment_cache_clear(vmx);
2865 +
2866 + /* Update any VMCS fields that might have changed while L2 ran */
2867 +- vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);
2868 +- vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);
2869 ++ vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);
2870 ++ vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);
2871 + vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
2872 + if (vmx->hv_deadline_tsc == -1)
2873 + vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,
2874 +@@ -13137,6 +13411,51 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
2875 + .enable_smi_window = enable_smi_window,
2876 + };
2877 +
2878 ++static void vmx_cleanup_l1d_flush(void)
2879 ++{
2880 ++ if (vmx_l1d_flush_pages) {
2881 ++ free_pages((unsigned long)vmx_l1d_flush_pages, L1D_CACHE_ORDER);
2882 ++ vmx_l1d_flush_pages = NULL;
2883 ++ }
2884 ++ /* Restore state so sysfs ignores VMX */
2885 ++ l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
2886 ++}
2887 ++
2888 ++static void vmx_exit(void)
2889 ++{
2890 ++#ifdef CONFIG_KEXEC_CORE
2891 ++ RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
2892 ++ synchronize_rcu();
2893 ++#endif
2894 ++
2895 ++ kvm_exit();
2896 ++
2897 ++#if IS_ENABLED(CONFIG_HYPERV)
2898 ++ if (static_branch_unlikely(&enable_evmcs)) {
2899 ++ int cpu;
2900 ++ struct hv_vp_assist_page *vp_ap;
2901 ++ /*
2902 ++ * Reset everything to support using non-enlightened VMCS
2903 ++ * access later (e.g. when we reload the module with
2904 ++ * enlightened_vmcs=0)
2905 ++ */
2906 ++ for_each_online_cpu(cpu) {
2907 ++ vp_ap = hv_get_vp_assist_page(cpu);
2908 ++
2909 ++ if (!vp_ap)
2910 ++ continue;
2911 ++
2912 ++ vp_ap->current_nested_vmcs = 0;
2913 ++ vp_ap->enlighten_vmentry = 0;
2914 ++ }
2915 ++
2916 ++ static_branch_disable(&enable_evmcs);
2917 ++ }
2918 ++#endif
2919 ++ vmx_cleanup_l1d_flush();
2920 ++}
2921 ++module_exit(vmx_exit);
2922 ++
2923 + static int __init vmx_init(void)
2924 + {
2925 + int r;
2926 +@@ -13171,10 +13490,25 @@ static int __init vmx_init(void)
2927 + #endif
2928 +
2929 + r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),
2930 +- __alignof__(struct vcpu_vmx), THIS_MODULE);
2931 ++ __alignof__(struct vcpu_vmx), THIS_MODULE);
2932 + if (r)
2933 + return r;
2934 +
2935 ++ /*
2936 ++ * Must be called after kvm_init() so enable_ept is properly set
2937 ++ * up. Hand the parameter mitigation value in which was stored in
2938 ++ * the pre module init parser. If no parameter was given, it will
2939 ++ * contain 'auto' which will be turned into the default 'cond'
2940 ++ * mitigation mode.
2941 ++ */
2942 ++ if (boot_cpu_has(X86_BUG_L1TF)) {
2943 ++ r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
2944 ++ if (r) {
2945 ++ vmx_exit();
2946 ++ return r;
2947 ++ }
2948 ++ }
2949 ++
2950 + #ifdef CONFIG_KEXEC_CORE
2951 + rcu_assign_pointer(crash_vmclear_loaded_vmcss,
2952 + crash_vmclear_local_loaded_vmcss);
2953 +@@ -13183,39 +13517,4 @@ static int __init vmx_init(void)
2954 +
2955 + return 0;
2956 + }
2957 +-
2958 +-static void __exit vmx_exit(void)
2959 +-{
2960 +-#ifdef CONFIG_KEXEC_CORE
2961 +- RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
2962 +- synchronize_rcu();
2963 +-#endif
2964 +-
2965 +- kvm_exit();
2966 +-
2967 +-#if IS_ENABLED(CONFIG_HYPERV)
2968 +- if (static_branch_unlikely(&enable_evmcs)) {
2969 +- int cpu;
2970 +- struct hv_vp_assist_page *vp_ap;
2971 +- /*
2972 +- * Reset everything to support using non-enlightened VMCS
2973 +- * access later (e.g. when we reload the module with
2974 +- * enlightened_vmcs=0)
2975 +- */
2976 +- for_each_online_cpu(cpu) {
2977 +- vp_ap = hv_get_vp_assist_page(cpu);
2978 +-
2979 +- if (!vp_ap)
2980 +- continue;
2981 +-
2982 +- vp_ap->current_nested_vmcs = 0;
2983 +- vp_ap->enlighten_vmentry = 0;
2984 +- }
2985 +-
2986 +- static_branch_disable(&enable_evmcs);
2987 +- }
2988 +-#endif
2989 +-}
2990 +-
2991 +-module_init(vmx_init)
2992 +-module_exit(vmx_exit)
2993 ++module_init(vmx_init);
2994 +diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
2995 +index 2b812b3c5088..a5caa5e5480c 100644
2996 +--- a/arch/x86/kvm/x86.c
2997 ++++ b/arch/x86/kvm/x86.c
2998 +@@ -195,6 +195,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
2999 + { "irq_injections", VCPU_STAT(irq_injections) },
3000 + { "nmi_injections", VCPU_STAT(nmi_injections) },
3001 + { "req_event", VCPU_STAT(req_event) },
3002 ++ { "l1d_flush", VCPU_STAT(l1d_flush) },
3003 + { "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
3004 + { "mmu_pte_write", VM_STAT(mmu_pte_write) },
3005 + { "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
3006 +@@ -1102,11 +1103,35 @@ static u32 msr_based_features[] = {
3007 +
3008 + static unsigned int num_msr_based_features;
3009 +
3010 ++u64 kvm_get_arch_capabilities(void)
3011 ++{
3012 ++ u64 data;
3013 ++
3014 ++ rdmsrl_safe(MSR_IA32_ARCH_CAPABILITIES, &data);
3015 ++
3016 ++ /*
3017 ++ * If we're doing cache flushes (either "always" or "cond")
3018 ++ * we will do one whenever the guest does a vmlaunch/vmresume.
3019 ++ * If an outer hypervisor is doing the cache flush for us
3020 ++ * (VMENTER_L1D_FLUSH_NESTED_VM), we can safely pass that
3021 ++ * capability to the guest too, and if EPT is disabled we're not
3022 ++ * vulnerable. Overall, only VMENTER_L1D_FLUSH_NEVER will
3023 ++ * require a nested hypervisor to do a flush of its own.
3024 ++ */
3025 ++ if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)
3026 ++ data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;
3027 ++
3028 ++ return data;
3029 ++}
3030 ++EXPORT_SYMBOL_GPL(kvm_get_arch_capabilities);
3031 ++
3032 + static int kvm_get_msr_feature(struct kvm_msr_entry *msr)
3033 + {
3034 + switch (msr->index) {
3035 +- case MSR_IA32_UCODE_REV:
3036 + case MSR_IA32_ARCH_CAPABILITIES:
3037 ++ msr->data = kvm_get_arch_capabilities();
3038 ++ break;
3039 ++ case MSR_IA32_UCODE_REV:
3040 + rdmsrl_safe(msr->index, &msr->data);
3041 + break;
3042 + default:
3043 +@@ -4876,6 +4901,9 @@ static int emulator_write_std(struct x86_emulate_ctxt *ctxt, gva_t addr, void *v
3044 + int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu, gva_t addr, void *val,
3045 + unsigned int bytes, struct x86_exception *exception)
3046 + {
3047 ++ /* kvm_write_guest_virt_system can pull in tons of pages. */
3048 ++ vcpu->arch.l1tf_flush_l1d = true;
3049 ++
3050 + return kvm_write_guest_virt_helper(addr, val, bytes, vcpu,
3051 + PFERR_WRITE_MASK, exception);
3052 + }
3053 +@@ -6052,6 +6080,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
3054 + bool writeback = true;
3055 + bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;
3056 +
3057 ++ vcpu->arch.l1tf_flush_l1d = true;
3058 ++
3059 + /*
3060 + * Clear write_fault_to_shadow_pgtable here to ensure it is
3061 + * never reused.
3062 +@@ -7581,6 +7611,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
3063 + struct kvm *kvm = vcpu->kvm;
3064 +
3065 + vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
3066 ++ vcpu->arch.l1tf_flush_l1d = true;
3067 +
3068 + for (;;) {
3069 + if (kvm_vcpu_running(vcpu)) {
3070 +@@ -8700,6 +8731,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
3071 +
3072 + void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
3073 + {
3074 ++ vcpu->arch.l1tf_flush_l1d = true;
3075 + kvm_x86_ops->sched_in(vcpu, cpu);
3076 + }
3077 +
3078 +diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
3079 +index cee58a972cb2..83241eb71cd4 100644
3080 +--- a/arch/x86/mm/init.c
3081 ++++ b/arch/x86/mm/init.c
3082 +@@ -4,6 +4,8 @@
3083 + #include <linux/swap.h>
3084 + #include <linux/memblock.h>
3085 + #include <linux/bootmem.h> /* for max_low_pfn */
3086 ++#include <linux/swapfile.h>
3087 ++#include <linux/swapops.h>
3088 +
3089 + #include <asm/set_memory.h>
3090 + #include <asm/e820/api.h>
3091 +@@ -880,3 +882,26 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)
3092 + __cachemode2pte_tbl[cache] = __cm_idx2pte(entry);
3093 + __pte2cachemode_tbl[entry] = cache;
3094 + }
3095 ++
3096 ++#ifdef CONFIG_SWAP
3097 ++unsigned long max_swapfile_size(void)
3098 ++{
3099 ++ unsigned long pages;
3100 ++
3101 ++ pages = generic_max_swapfile_size();
3102 ++
3103 ++ if (boot_cpu_has_bug(X86_BUG_L1TF)) {
3104 ++ /* Limit the swap file size to MAX_PA/2 for L1TF workaround */
3105 ++ unsigned long l1tf_limit = l1tf_pfn_limit() + 1;
3106 ++ /*
3107 ++ * We encode swap offsets also with 3 bits below those for pfn
3108 ++ * which makes the usable limit higher.
3109 ++ */
3110 ++#if CONFIG_PGTABLE_LEVELS > 2
3111 ++ l1tf_limit <<= PAGE_SHIFT - SWP_OFFSET_FIRST_BIT;
3112 ++#endif
3113 ++ pages = min_t(unsigned long, l1tf_limit, pages);
3114 ++ }
3115 ++ return pages;
3116 ++}
3117 ++#endif
3118 +diff --git a/arch/x86/mm/kmmio.c b/arch/x86/mm/kmmio.c
3119 +index 7c8686709636..79eb55ce69a9 100644
3120 +--- a/arch/x86/mm/kmmio.c
3121 ++++ b/arch/x86/mm/kmmio.c
3122 +@@ -126,24 +126,29 @@ static struct kmmio_fault_page *get_kmmio_fault_page(unsigned long addr)
3123 +
3124 + static void clear_pmd_presence(pmd_t *pmd, bool clear, pmdval_t *old)
3125 + {
3126 ++ pmd_t new_pmd;
3127 + pmdval_t v = pmd_val(*pmd);
3128 + if (clear) {
3129 +- *old = v & _PAGE_PRESENT;
3130 +- v &= ~_PAGE_PRESENT;
3131 +- } else /* presume this has been called with clear==true previously */
3132 +- v |= *old;
3133 +- set_pmd(pmd, __pmd(v));
3134 ++ *old = v;
3135 ++ new_pmd = pmd_mknotpresent(*pmd);
3136 ++ } else {
3137 ++ /* Presume this has been called with clear==true previously */
3138 ++ new_pmd = __pmd(*old);
3139 ++ }
3140 ++ set_pmd(pmd, new_pmd);
3141 + }
3142 +
3143 + static void clear_pte_presence(pte_t *pte, bool clear, pteval_t *old)
3144 + {
3145 + pteval_t v = pte_val(*pte);
3146 + if (clear) {
3147 +- *old = v & _PAGE_PRESENT;
3148 +- v &= ~_PAGE_PRESENT;
3149 +- } else /* presume this has been called with clear==true previously */
3150 +- v |= *old;
3151 +- set_pte_atomic(pte, __pte(v));
3152 ++ *old = v;
3153 ++ /* Nothing should care about address */
3154 ++ pte_clear(&init_mm, 0, pte);
3155 ++ } else {
3156 ++ /* Presume this has been called with clear==true previously */
3157 ++ set_pte_atomic(pte, __pte(*old));
3158 ++ }
3159 + }
3160 +
3161 + static int clear_page_presence(struct kmmio_fault_page *f, bool clear)
3162 +diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
3163 +index 48c591251600..f40ab8185d94 100644
3164 +--- a/arch/x86/mm/mmap.c
3165 ++++ b/arch/x86/mm/mmap.c
3166 +@@ -240,3 +240,24 @@ int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
3167 +
3168 + return phys_addr_valid(addr + count - 1);
3169 + }
3170 ++
3171 ++/*
3172 ++ * Only allow root to set high MMIO mappings to PROT_NONE.
3173 ++ * This prevents an unpriv. user to set them to PROT_NONE and invert
3174 ++ * them, then pointing to valid memory for L1TF speculation.
3175 ++ *
3176 ++ * Note: for locked down kernels may want to disable the root override.
3177 ++ */
3178 ++bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
3179 ++{
3180 ++ if (!boot_cpu_has_bug(X86_BUG_L1TF))
3181 ++ return true;
3182 ++ if (!__pte_needs_invert(pgprot_val(prot)))
3183 ++ return true;
3184 ++ /* If it's real memory always allow */
3185 ++ if (pfn_valid(pfn))
3186 ++ return true;
3187 ++ if (pfn > l1tf_pfn_limit() && !capable(CAP_SYS_ADMIN))
3188 ++ return false;
3189 ++ return true;
3190 ++}
3191 +diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
3192 +index 3bded76e8d5c..7bb6f65c79de 100644
3193 +--- a/arch/x86/mm/pageattr.c
3194 ++++ b/arch/x86/mm/pageattr.c
3195 +@@ -1014,8 +1014,8 @@ static long populate_pmd(struct cpa_data *cpa,
3196 +
3197 + pmd = pmd_offset(pud, start);
3198 +
3199 +- set_pmd(pmd, __pmd(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
3200 +- massage_pgprot(pmd_pgprot)));
3201 ++ set_pmd(pmd, pmd_mkhuge(pfn_pmd(cpa->pfn,
3202 ++ canon_pgprot(pmd_pgprot))));
3203 +
3204 + start += PMD_SIZE;
3205 + cpa->pfn += PMD_SIZE >> PAGE_SHIFT;
3206 +@@ -1087,8 +1087,8 @@ static int populate_pud(struct cpa_data *cpa, unsigned long start, p4d_t *p4d,
3207 + * Map everything starting from the Gb boundary, possibly with 1G pages
3208 + */
3209 + while (boot_cpu_has(X86_FEATURE_GBPAGES) && end - start >= PUD_SIZE) {
3210 +- set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
3211 +- massage_pgprot(pud_pgprot)));
3212 ++ set_pud(pud, pud_mkhuge(pfn_pud(cpa->pfn,
3213 ++ canon_pgprot(pud_pgprot))));
3214 +
3215 + start += PUD_SIZE;
3216 + cpa->pfn += PUD_SIZE >> PAGE_SHIFT;
3217 +diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
3218 +index 4d418e705878..fb752d9a3ce9 100644
3219 +--- a/arch/x86/mm/pti.c
3220 ++++ b/arch/x86/mm/pti.c
3221 +@@ -45,6 +45,7 @@
3222 + #include <asm/pgalloc.h>
3223 + #include <asm/tlbflush.h>
3224 + #include <asm/desc.h>
3225 ++#include <asm/sections.h>
3226 +
3227 + #undef pr_fmt
3228 + #define pr_fmt(fmt) "Kernel/User page tables isolation: " fmt
3229 +diff --git a/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c b/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c
3230 +index 4f5fa65a1011..2acd6be13375 100644
3231 +--- a/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c
3232 ++++ b/arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c
3233 +@@ -18,6 +18,7 @@
3234 + #include <asm/intel-mid.h>
3235 + #include <asm/intel_scu_ipc.h>
3236 + #include <asm/io_apic.h>
3237 ++#include <asm/hw_irq.h>
3238 +
3239 + #define TANGIER_EXT_TIMER0_MSI 12
3240 +
3241 +diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
3242 +index ca446da48fd2..3866b96a7ee7 100644
3243 +--- a/arch/x86/platform/uv/tlb_uv.c
3244 ++++ b/arch/x86/platform/uv/tlb_uv.c
3245 +@@ -1285,6 +1285,7 @@ void uv_bau_message_interrupt(struct pt_regs *regs)
3246 + struct msg_desc msgdesc;
3247 +
3248 + ack_APIC_irq();
3249 ++ kvm_set_cpu_l1tf_flush_l1d();
3250 + time_start = get_cycles();
3251 +
3252 + bcp = &per_cpu(bau_control, smp_processor_id());
3253 +diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
3254 +index 3b5318505c69..2eeddd814653 100644
3255 +--- a/arch/x86/xen/enlighten.c
3256 ++++ b/arch/x86/xen/enlighten.c
3257 +@@ -3,6 +3,7 @@
3258 + #endif
3259 + #include <linux/cpu.h>
3260 + #include <linux/kexec.h>
3261 ++#include <linux/slab.h>
3262 +
3263 + #include <xen/features.h>
3264 + #include <xen/page.h>
3265 +diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
3266 +index 30cc9c877ebb..eb9443d5bae1 100644
3267 +--- a/drivers/base/cpu.c
3268 ++++ b/drivers/base/cpu.c
3269 +@@ -540,16 +540,24 @@ ssize_t __weak cpu_show_spec_store_bypass(struct device *dev,
3270 + return sprintf(buf, "Not affected\n");
3271 + }
3272 +
3273 ++ssize_t __weak cpu_show_l1tf(struct device *dev,
3274 ++ struct device_attribute *attr, char *buf)
3275 ++{
3276 ++ return sprintf(buf, "Not affected\n");
3277 ++}
3278 ++
3279 + static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
3280 + static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
3281 + static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
3282 + static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);
3283 ++static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
3284 +
3285 + static struct attribute *cpu_root_vulnerabilities_attrs[] = {
3286 + &dev_attr_meltdown.attr,
3287 + &dev_attr_spectre_v1.attr,
3288 + &dev_attr_spectre_v2.attr,
3289 + &dev_attr_spec_store_bypass.attr,
3290 ++ &dev_attr_l1tf.attr,
3291 + NULL
3292 + };
3293 +
3294 +diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
3295 +index dc87797db500..b50b74053664 100644
3296 +--- a/drivers/gpu/drm/i915/i915_pmu.c
3297 ++++ b/drivers/gpu/drm/i915/i915_pmu.c
3298 +@@ -4,6 +4,7 @@
3299 + * Copyright © 2017-2018 Intel Corporation
3300 + */
3301 +
3302 ++#include <linux/irq.h>
3303 + #include "i915_pmu.h"
3304 + #include "intel_ringbuffer.h"
3305 + #include "i915_drv.h"
3306 +diff --git a/drivers/gpu/drm/i915/intel_lpe_audio.c b/drivers/gpu/drm/i915/intel_lpe_audio.c
3307 +index 6269750e2b54..b4941101f21a 100644
3308 +--- a/drivers/gpu/drm/i915/intel_lpe_audio.c
3309 ++++ b/drivers/gpu/drm/i915/intel_lpe_audio.c
3310 +@@ -62,6 +62,7 @@
3311 +
3312 + #include <linux/acpi.h>
3313 + #include <linux/device.h>
3314 ++#include <linux/irq.h>
3315 + #include <linux/pci.h>
3316 + #include <linux/pm_runtime.h>
3317 +
3318 +diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
3319 +index f6325f1a89e8..d4d4a55f09f8 100644
3320 +--- a/drivers/pci/controller/pci-hyperv.c
3321 ++++ b/drivers/pci/controller/pci-hyperv.c
3322 +@@ -45,6 +45,7 @@
3323 + #include <linux/irqdomain.h>
3324 + #include <asm/irqdomain.h>
3325 + #include <asm/apic.h>
3326 ++#include <linux/irq.h>
3327 + #include <linux/msi.h>
3328 + #include <linux/hyperv.h>
3329 + #include <linux/refcount.h>
3330 +diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
3331 +index f59639afaa39..26ca0276b503 100644
3332 +--- a/include/asm-generic/pgtable.h
3333 ++++ b/include/asm-generic/pgtable.h
3334 +@@ -1083,6 +1083,18 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
3335 + static inline void init_espfix_bsp(void) { }
3336 + #endif
3337 +
3338 ++#ifndef __HAVE_ARCH_PFN_MODIFY_ALLOWED
3339 ++static inline bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
3340 ++{
3341 ++ return true;
3342 ++}
3343 ++
3344 ++static inline bool arch_has_pfn_modify_check(void)
3345 ++{
3346 ++ return false;
3347 ++}
3348 ++#endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */
3349 ++
3350 + #endif /* !__ASSEMBLY__ */
3351 +
3352 + #ifndef io_remap_pfn_range
3353 +diff --git a/include/linux/cpu.h b/include/linux/cpu.h
3354 +index 3233fbe23594..45789a892c41 100644
3355 +--- a/include/linux/cpu.h
3356 ++++ b/include/linux/cpu.h
3357 +@@ -55,6 +55,8 @@ extern ssize_t cpu_show_spectre_v2(struct device *dev,
3358 + struct device_attribute *attr, char *buf);
3359 + extern ssize_t cpu_show_spec_store_bypass(struct device *dev,
3360 + struct device_attribute *attr, char *buf);
3361 ++extern ssize_t cpu_show_l1tf(struct device *dev,
3362 ++ struct device_attribute *attr, char *buf);
3363 +
3364 + extern __printf(4, 5)
3365 + struct device *cpu_device_create(struct device *parent, void *drvdata,
3366 +@@ -166,4 +168,23 @@ void cpuhp_report_idle_dead(void);
3367 + static inline void cpuhp_report_idle_dead(void) { }
3368 + #endif /* #ifdef CONFIG_HOTPLUG_CPU */
3369 +
3370 ++enum cpuhp_smt_control {
3371 ++ CPU_SMT_ENABLED,
3372 ++ CPU_SMT_DISABLED,
3373 ++ CPU_SMT_FORCE_DISABLED,
3374 ++ CPU_SMT_NOT_SUPPORTED,
3375 ++};
3376 ++
3377 ++#if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT)
3378 ++extern enum cpuhp_smt_control cpu_smt_control;
3379 ++extern void cpu_smt_disable(bool force);
3380 ++extern void cpu_smt_check_topology_early(void);
3381 ++extern void cpu_smt_check_topology(void);
3382 ++#else
3383 ++# define cpu_smt_control (CPU_SMT_ENABLED)
3384 ++static inline void cpu_smt_disable(bool force) { }
3385 ++static inline void cpu_smt_check_topology_early(void) { }
3386 ++static inline void cpu_smt_check_topology(void) { }
3387 ++#endif
3388 ++
3389 + #endif /* _LINUX_CPU_H_ */
3390 +diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
3391 +index 06bd7b096167..e06febf62978 100644
3392 +--- a/include/linux/swapfile.h
3393 ++++ b/include/linux/swapfile.h
3394 +@@ -10,5 +10,7 @@ extern spinlock_t swap_lock;
3395 + extern struct plist_head swap_active_head;
3396 + extern struct swap_info_struct *swap_info[];
3397 + extern int try_to_unuse(unsigned int, bool, unsigned long);
3398 ++extern unsigned long generic_max_swapfile_size(void);
3399 ++extern unsigned long max_swapfile_size(void);
3400 +
3401 + #endif /* _LINUX_SWAPFILE_H */
3402 +diff --git a/kernel/cpu.c b/kernel/cpu.c
3403 +index 2f8f338e77cf..f80afc674f02 100644
3404 +--- a/kernel/cpu.c
3405 ++++ b/kernel/cpu.c
3406 +@@ -60,6 +60,7 @@ struct cpuhp_cpu_state {
3407 + bool rollback;
3408 + bool single;
3409 + bool bringup;
3410 ++ bool booted_once;
3411 + struct hlist_node *node;
3412 + struct hlist_node *last;
3413 + enum cpuhp_state cb_state;
3414 +@@ -342,6 +343,85 @@ void cpu_hotplug_enable(void)
3415 + EXPORT_SYMBOL_GPL(cpu_hotplug_enable);
3416 + #endif /* CONFIG_HOTPLUG_CPU */
3417 +
3418 ++#ifdef CONFIG_HOTPLUG_SMT
3419 ++enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;
3420 ++EXPORT_SYMBOL_GPL(cpu_smt_control);
3421 ++
3422 ++static bool cpu_smt_available __read_mostly;
3423 ++
3424 ++void __init cpu_smt_disable(bool force)
3425 ++{
3426 ++ if (cpu_smt_control == CPU_SMT_FORCE_DISABLED ||
3427 ++ cpu_smt_control == CPU_SMT_NOT_SUPPORTED)
3428 ++ return;
3429 ++
3430 ++ if (force) {
3431 ++ pr_info("SMT: Force disabled\n");
3432 ++ cpu_smt_control = CPU_SMT_FORCE_DISABLED;
3433 ++ } else {
3434 ++ cpu_smt_control = CPU_SMT_DISABLED;
3435 ++ }
3436 ++}
3437 ++
3438 ++/*
3439 ++ * The decision whether SMT is supported can only be done after the full
3440 ++ * CPU identification. Called from architecture code before non boot CPUs
3441 ++ * are brought up.
3442 ++ */
3443 ++void __init cpu_smt_check_topology_early(void)
3444 ++{
3445 ++ if (!topology_smt_supported())
3446 ++ cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
3447 ++}
3448 ++
3449 ++/*
3450 ++ * If SMT was disabled by BIOS, detect it here, after the CPUs have been
3451 ++ * brought online. This ensures the smt/l1tf sysfs entries are consistent
3452 ++ * with reality. cpu_smt_available is set to true during the bringup of non
3453 ++ * boot CPUs when a SMT sibling is detected. Note, this may overwrite
3454 ++ * cpu_smt_control's previous setting.
3455 ++ */
3456 ++void __init cpu_smt_check_topology(void)
3457 ++{
3458 ++ if (!cpu_smt_available)
3459 ++ cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
3460 ++}
3461 ++
3462 ++static int __init smt_cmdline_disable(char *str)
3463 ++{
3464 ++ cpu_smt_disable(str && !strcmp(str, "force"));
3465 ++ return 0;
3466 ++}
3467 ++early_param("nosmt", smt_cmdline_disable);
3468 ++
3469 ++static inline bool cpu_smt_allowed(unsigned int cpu)
3470 ++{
3471 ++ if (topology_is_primary_thread(cpu))
3472 ++ return true;
3473 ++
3474 ++ /*
3475 ++ * If the CPU is not a 'primary' thread and the booted_once bit is
3476 ++ * set then the processor has SMT support. Store this information
3477 ++ * for the late check of SMT support in cpu_smt_check_topology().
3478 ++ */
3479 ++ if (per_cpu(cpuhp_state, cpu).booted_once)
3480 ++ cpu_smt_available = true;
3481 ++
3482 ++ if (cpu_smt_control == CPU_SMT_ENABLED)
3483 ++ return true;
3484 ++
3485 ++ /*
3486 ++ * On x86 it's required to boot all logical CPUs at least once so
3487 ++ * that the init code can get a chance to set CR4.MCE on each
3488 ++ * CPU. Otherwise, a broadacasted MCE observing CR4.MCE=0b on any
3489 ++ * core will shutdown the machine.
3490 ++ */
3491 ++ return !per_cpu(cpuhp_state, cpu).booted_once;
3492 ++}
3493 ++#else
3494 ++static inline bool cpu_smt_allowed(unsigned int cpu) { return true; }
3495 ++#endif
3496 ++
3497 + static inline enum cpuhp_state
3498 + cpuhp_set_state(struct cpuhp_cpu_state *st, enum cpuhp_state target)
3499 + {
3500 +@@ -422,6 +502,16 @@ static int bringup_wait_for_ap(unsigned int cpu)
3501 + stop_machine_unpark(cpu);
3502 + kthread_unpark(st->thread);
3503 +
3504 ++ /*
3505 ++ * SMT soft disabling on X86 requires to bring the CPU out of the
3506 ++ * BIOS 'wait for SIPI' state in order to set the CR4.MCE bit. The
3507 ++ * CPU marked itself as booted_once in cpu_notify_starting() so the
3508 ++ * cpu_smt_allowed() check will now return false if this is not the
3509 ++ * primary sibling.
3510 ++ */
3511 ++ if (!cpu_smt_allowed(cpu))
3512 ++ return -ECANCELED;
3513 ++
3514 + if (st->target <= CPUHP_AP_ONLINE_IDLE)
3515 + return 0;
3516 +
3517 +@@ -754,7 +844,6 @@ static int takedown_cpu(unsigned int cpu)
3518 +
3519 + /* Park the smpboot threads */
3520 + kthread_park(per_cpu_ptr(&cpuhp_state, cpu)->thread);
3521 +- smpboot_park_threads(cpu);
3522 +
3523 + /*
3524 + * Prevent irq alloc/free while the dying cpu reorganizes the
3525 +@@ -907,20 +996,19 @@ out:
3526 + return ret;
3527 + }
3528 +
3529 ++static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
3530 ++{
3531 ++ if (cpu_hotplug_disabled)
3532 ++ return -EBUSY;
3533 ++ return _cpu_down(cpu, 0, target);
3534 ++}
3535 ++
3536 + static int do_cpu_down(unsigned int cpu, enum cpuhp_state target)
3537 + {
3538 + int err;
3539 +
3540 + cpu_maps_update_begin();
3541 +-
3542 +- if (cpu_hotplug_disabled) {
3543 +- err = -EBUSY;
3544 +- goto out;
3545 +- }
3546 +-
3547 +- err = _cpu_down(cpu, 0, target);
3548 +-
3549 +-out:
3550 ++ err = cpu_down_maps_locked(cpu, target);
3551 + cpu_maps_update_done();
3552 + return err;
3553 + }
3554 +@@ -949,6 +1037,7 @@ void notify_cpu_starting(unsigned int cpu)
3555 + int ret;
3556 +
3557 + rcu_cpu_starting(cpu); /* Enables RCU usage on this CPU. */
3558 ++ st->booted_once = true;
3559 + while (st->state < target) {
3560 + st->state++;
3561 + ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL);
3562 +@@ -1058,6 +1147,10 @@ static int do_cpu_up(unsigned int cpu, enum cpuhp_state target)
3563 + err = -EBUSY;
3564 + goto out;
3565 + }
3566 ++ if (!cpu_smt_allowed(cpu)) {
3567 ++ err = -EPERM;
3568 ++ goto out;
3569 ++ }
3570 +
3571 + err = _cpu_up(cpu, 0, target);
3572 + out:
3573 +@@ -1332,7 +1425,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {
3574 + [CPUHP_AP_SMPBOOT_THREADS] = {
3575 + .name = "smpboot/threads:online",
3576 + .startup.single = smpboot_unpark_threads,
3577 +- .teardown.single = NULL,
3578 ++ .teardown.single = smpboot_park_threads,
3579 + },
3580 + [CPUHP_AP_IRQ_AFFINITY_ONLINE] = {
3581 + .name = "irq/affinity:online",
3582 +@@ -1906,10 +1999,172 @@ static const struct attribute_group cpuhp_cpu_root_attr_group = {
3583 + NULL
3584 + };
3585 +
3586 ++#ifdef CONFIG_HOTPLUG_SMT
3587 ++
3588 ++static const char *smt_states[] = {
3589 ++ [CPU_SMT_ENABLED] = "on",
3590 ++ [CPU_SMT_DISABLED] = "off",
3591 ++ [CPU_SMT_FORCE_DISABLED] = "forceoff",
3592 ++ [CPU_SMT_NOT_SUPPORTED] = "notsupported",
3593 ++};
3594 ++
3595 ++static ssize_t
3596 ++show_smt_control(struct device *dev, struct device_attribute *attr, char *buf)
3597 ++{
3598 ++ return snprintf(buf, PAGE_SIZE - 2, "%s\n", smt_states[cpu_smt_control]);
3599 ++}
3600 ++
3601 ++static void cpuhp_offline_cpu_device(unsigned int cpu)
3602 ++{
3603 ++ struct device *dev = get_cpu_device(cpu);
3604 ++
3605 ++ dev->offline = true;
3606 ++ /* Tell user space about the state change */
3607 ++ kobject_uevent(&dev->kobj, KOBJ_OFFLINE);
3608 ++}
3609 ++
3610 ++static void cpuhp_online_cpu_device(unsigned int cpu)
3611 ++{
3612 ++ struct device *dev = get_cpu_device(cpu);
3613 ++
3614 ++ dev->offline = false;
3615 ++ /* Tell user space about the state change */
3616 ++ kobject_uevent(&dev->kobj, KOBJ_ONLINE);
3617 ++}
3618 ++
3619 ++static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
3620 ++{
3621 ++ int cpu, ret = 0;
3622 ++
3623 ++ cpu_maps_update_begin();
3624 ++ for_each_online_cpu(cpu) {
3625 ++ if (topology_is_primary_thread(cpu))
3626 ++ continue;
3627 ++ ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);
3628 ++ if (ret)
3629 ++ break;
3630 ++ /*
3631 ++ * As this needs to hold the cpu maps lock it's impossible
3632 ++ * to call device_offline() because that ends up calling
3633 ++ * cpu_down() which takes cpu maps lock. cpu maps lock
3634 ++ * needs to be held as this might race against in kernel
3635 ++ * abusers of the hotplug machinery (thermal management).
3636 ++ *
3637 ++ * So nothing would update device:offline state. That would
3638 ++ * leave the sysfs entry stale and prevent onlining after
3639 ++ * smt control has been changed to 'off' again. This is
3640 ++ * called under the sysfs hotplug lock, so it is properly
3641 ++ * serialized against the regular offline usage.
3642 ++ */
3643 ++ cpuhp_offline_cpu_device(cpu);
3644 ++ }
3645 ++ if (!ret)
3646 ++ cpu_smt_control = ctrlval;
3647 ++ cpu_maps_update_done();
3648 ++ return ret;
3649 ++}
3650 ++
3651 ++static int cpuhp_smt_enable(void)
3652 ++{
3653 ++ int cpu, ret = 0;
3654 ++
3655 ++ cpu_maps_update_begin();
3656 ++ cpu_smt_control = CPU_SMT_ENABLED;
3657 ++ for_each_present_cpu(cpu) {
3658 ++ /* Skip online CPUs and CPUs on offline nodes */
3659 ++ if (cpu_online(cpu) || !node_online(cpu_to_node(cpu)))
3660 ++ continue;
3661 ++ ret = _cpu_up(cpu, 0, CPUHP_ONLINE);
3662 ++ if (ret)
3663 ++ break;
3664 ++ /* See comment in cpuhp_smt_disable() */
3665 ++ cpuhp_online_cpu_device(cpu);
3666 ++ }
3667 ++ cpu_maps_update_done();
3668 ++ return ret;
3669 ++}
3670 ++
3671 ++static ssize_t
3672 ++store_smt_control(struct device *dev, struct device_attribute *attr,
3673 ++ const char *buf, size_t count)
3674 ++{
3675 ++ int ctrlval, ret;
3676 ++
3677 ++ if (sysfs_streq(buf, "on"))
3678 ++ ctrlval = CPU_SMT_ENABLED;
3679 ++ else if (sysfs_streq(buf, "off"))
3680 ++ ctrlval = CPU_SMT_DISABLED;
3681 ++ else if (sysfs_streq(buf, "forceoff"))
3682 ++ ctrlval = CPU_SMT_FORCE_DISABLED;
3683 ++ else
3684 ++ return -EINVAL;
3685 ++
3686 ++ if (cpu_smt_control == CPU_SMT_FORCE_DISABLED)
3687 ++ return -EPERM;
3688 ++
3689 ++ if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED)
3690 ++ return -ENODEV;
3691 ++
3692 ++ ret = lock_device_hotplug_sysfs();
3693 ++ if (ret)
3694 ++ return ret;
3695 ++
3696 ++ if (ctrlval != cpu_smt_control) {
3697 ++ switch (ctrlval) {
3698 ++ case CPU_SMT_ENABLED:
3699 ++ ret = cpuhp_smt_enable();
3700 ++ break;
3701 ++ case CPU_SMT_DISABLED:
3702 ++ case CPU_SMT_FORCE_DISABLED:
3703 ++ ret = cpuhp_smt_disable(ctrlval);
3704 ++ break;
3705 ++ }
3706 ++ }
3707 ++
3708 ++ unlock_device_hotplug();
3709 ++ return ret ? ret : count;
3710 ++}
3711 ++static DEVICE_ATTR(control, 0644, show_smt_control, store_smt_control);
3712 ++
3713 ++static ssize_t
3714 ++show_smt_active(struct device *dev, struct device_attribute *attr, char *buf)
3715 ++{
3716 ++ bool active = topology_max_smt_threads() > 1;
3717 ++
3718 ++ return snprintf(buf, PAGE_SIZE - 2, "%d\n", active);
3719 ++}
3720 ++static DEVICE_ATTR(active, 0444, show_smt_active, NULL);
3721 ++
3722 ++static struct attribute *cpuhp_smt_attrs[] = {
3723 ++ &dev_attr_control.attr,
3724 ++ &dev_attr_active.attr,
3725 ++ NULL
3726 ++};
3727 ++
3728 ++static const struct attribute_group cpuhp_smt_attr_group = {
3729 ++ .attrs = cpuhp_smt_attrs,
3730 ++ .name = "smt",
3731 ++ NULL
3732 ++};
3733 ++
3734 ++static int __init cpu_smt_state_init(void)
3735 ++{
3736 ++ return sysfs_create_group(&cpu_subsys.dev_root->kobj,
3737 ++ &cpuhp_smt_attr_group);
3738 ++}
3739 ++
3740 ++#else
3741 ++static inline int cpu_smt_state_init(void) { return 0; }
3742 ++#endif
3743 ++
3744 + static int __init cpuhp_sysfs_init(void)
3745 + {
3746 + int cpu, ret;
3747 +
3748 ++ ret = cpu_smt_state_init();
3749 ++ if (ret)
3750 ++ return ret;
3751 ++
3752 + ret = sysfs_create_group(&cpu_subsys.dev_root->kobj,
3753 + &cpuhp_cpu_root_attr_group);
3754 + if (ret)
3755 +@@ -2012,5 +2267,8 @@ void __init boot_cpu_init(void)
3756 + */
3757 + void __init boot_cpu_hotplug_init(void)
3758 + {
3759 +- per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;
3760 ++#ifdef CONFIG_SMP
3761 ++ this_cpu_write(cpuhp_state.booted_once, true);
3762 ++#endif
3763 ++ this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);
3764 + }
3765 +diff --git a/kernel/sched/core.c b/kernel/sched/core.c
3766 +index fe365c9a08e9..5ba96d9ddbde 100644
3767 +--- a/kernel/sched/core.c
3768 ++++ b/kernel/sched/core.c
3769 +@@ -5774,6 +5774,18 @@ int sched_cpu_activate(unsigned int cpu)
3770 + struct rq *rq = cpu_rq(cpu);
3771 + struct rq_flags rf;
3772 +
3773 ++#ifdef CONFIG_SCHED_SMT
3774 ++ /*
3775 ++ * The sched_smt_present static key needs to be evaluated on every
3776 ++ * hotplug event because at boot time SMT might be disabled when
3777 ++ * the number of booted CPUs is limited.
3778 ++ *
3779 ++ * If then later a sibling gets hotplugged, then the key would stay
3780 ++ * off and SMT scheduling would never be functional.
3781 ++ */
3782 ++ if (cpumask_weight(cpu_smt_mask(cpu)) > 1)
3783 ++ static_branch_enable_cpuslocked(&sched_smt_present);
3784 ++#endif
3785 + set_cpu_active(cpu, true);
3786 +
3787 + if (sched_smp_initialized) {
3788 +@@ -5871,22 +5883,6 @@ int sched_cpu_dying(unsigned int cpu)
3789 + }
3790 + #endif
3791 +
3792 +-#ifdef CONFIG_SCHED_SMT
3793 +-DEFINE_STATIC_KEY_FALSE(sched_smt_present);
3794 +-
3795 +-static void sched_init_smt(void)
3796 +-{
3797 +- /*
3798 +- * We've enumerated all CPUs and will assume that if any CPU
3799 +- * has SMT siblings, CPU0 will too.
3800 +- */
3801 +- if (cpumask_weight(cpu_smt_mask(0)) > 1)
3802 +- static_branch_enable(&sched_smt_present);
3803 +-}
3804 +-#else
3805 +-static inline void sched_init_smt(void) { }
3806 +-#endif
3807 +-
3808 + void __init sched_init_smp(void)
3809 + {
3810 + sched_init_numa();
3811 +@@ -5908,8 +5904,6 @@ void __init sched_init_smp(void)
3812 + init_sched_rt_class();
3813 + init_sched_dl_class();
3814 +
3815 +- sched_init_smt();
3816 +-
3817 + sched_smp_initialized = true;
3818 + }
3819 +
3820 +diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
3821 +index 2f0a0be4d344..9c219f7b0970 100644
3822 +--- a/kernel/sched/fair.c
3823 ++++ b/kernel/sched/fair.c
3824 +@@ -6237,6 +6237,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
3825 + }
3826 +
3827 + #ifdef CONFIG_SCHED_SMT
3828 ++DEFINE_STATIC_KEY_FALSE(sched_smt_present);
3829 +
3830 + static inline void set_idle_cores(int cpu, int val)
3831 + {
3832 +diff --git a/kernel/smp.c b/kernel/smp.c
3833 +index 084c8b3a2681..d86eec5f51c1 100644
3834 +--- a/kernel/smp.c
3835 ++++ b/kernel/smp.c
3836 +@@ -584,6 +584,8 @@ void __init smp_init(void)
3837 + num_nodes, (num_nodes > 1 ? "s" : ""),
3838 + num_cpus, (num_cpus > 1 ? "s" : ""));
3839 +
3840 ++ /* Final decision about SMT support */
3841 ++ cpu_smt_check_topology();
3842 + /* Any cleanup work */
3843 + smp_cpus_done(setup_max_cpus);
3844 + }
3845 +diff --git a/mm/memory.c b/mm/memory.c
3846 +index c5e87a3a82ba..0e356dd923c2 100644
3847 +--- a/mm/memory.c
3848 ++++ b/mm/memory.c
3849 +@@ -1884,6 +1884,9 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
3850 + if (addr < vma->vm_start || addr >= vma->vm_end)
3851 + return -EFAULT;
3852 +
3853 ++ if (!pfn_modify_allowed(pfn, pgprot))
3854 ++ return -EACCES;
3855 ++
3856 + track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
3857 +
3858 + ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
3859 +@@ -1919,6 +1922,9 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
3860 +
3861 + track_pfn_insert(vma, &pgprot, pfn);
3862 +
3863 ++ if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
3864 ++ return -EACCES;
3865 ++
3866 + /*
3867 + * If we don't have pte special, then we have to use the pfn_valid()
3868 + * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
3869 +@@ -1980,6 +1986,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
3870 + {
3871 + pte_t *pte;
3872 + spinlock_t *ptl;
3873 ++ int err = 0;
3874 +
3875 + pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
3876 + if (!pte)
3877 +@@ -1987,12 +1994,16 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
3878 + arch_enter_lazy_mmu_mode();
3879 + do {
3880 + BUG_ON(!pte_none(*pte));
3881 ++ if (!pfn_modify_allowed(pfn, prot)) {
3882 ++ err = -EACCES;
3883 ++ break;
3884 ++ }
3885 + set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
3886 + pfn++;
3887 + } while (pte++, addr += PAGE_SIZE, addr != end);
3888 + arch_leave_lazy_mmu_mode();
3889 + pte_unmap_unlock(pte - 1, ptl);
3890 +- return 0;
3891 ++ return err;
3892 + }
3893 +
3894 + static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
3895 +@@ -2001,6 +2012,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
3896 + {
3897 + pmd_t *pmd;
3898 + unsigned long next;
3899 ++ int err;
3900 +
3901 + pfn -= addr >> PAGE_SHIFT;
3902 + pmd = pmd_alloc(mm, pud, addr);
3903 +@@ -2009,9 +2021,10 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
3904 + VM_BUG_ON(pmd_trans_huge(*pmd));
3905 + do {
3906 + next = pmd_addr_end(addr, end);
3907 +- if (remap_pte_range(mm, pmd, addr, next,
3908 +- pfn + (addr >> PAGE_SHIFT), prot))
3909 +- return -ENOMEM;
3910 ++ err = remap_pte_range(mm, pmd, addr, next,
3911 ++ pfn + (addr >> PAGE_SHIFT), prot);
3912 ++ if (err)
3913 ++ return err;
3914 + } while (pmd++, addr = next, addr != end);
3915 + return 0;
3916 + }
3917 +@@ -2022,6 +2035,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
3918 + {
3919 + pud_t *pud;
3920 + unsigned long next;
3921 ++ int err;
3922 +
3923 + pfn -= addr >> PAGE_SHIFT;
3924 + pud = pud_alloc(mm, p4d, addr);
3925 +@@ -2029,9 +2043,10 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
3926 + return -ENOMEM;
3927 + do {
3928 + next = pud_addr_end(addr, end);
3929 +- if (remap_pmd_range(mm, pud, addr, next,
3930 +- pfn + (addr >> PAGE_SHIFT), prot))
3931 +- return -ENOMEM;
3932 ++ err = remap_pmd_range(mm, pud, addr, next,
3933 ++ pfn + (addr >> PAGE_SHIFT), prot);
3934 ++ if (err)
3935 ++ return err;
3936 + } while (pud++, addr = next, addr != end);
3937 + return 0;
3938 + }
3939 +@@ -2042,6 +2057,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
3940 + {
3941 + p4d_t *p4d;
3942 + unsigned long next;
3943 ++ int err;
3944 +
3945 + pfn -= addr >> PAGE_SHIFT;
3946 + p4d = p4d_alloc(mm, pgd, addr);
3947 +@@ -2049,9 +2065,10 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
3948 + return -ENOMEM;
3949 + do {
3950 + next = p4d_addr_end(addr, end);
3951 +- if (remap_pud_range(mm, p4d, addr, next,
3952 +- pfn + (addr >> PAGE_SHIFT), prot))
3953 +- return -ENOMEM;
3954 ++ err = remap_pud_range(mm, p4d, addr, next,
3955 ++ pfn + (addr >> PAGE_SHIFT), prot);
3956 ++ if (err)
3957 ++ return err;
3958 + } while (p4d++, addr = next, addr != end);
3959 + return 0;
3960 + }
3961 +diff --git a/mm/mprotect.c b/mm/mprotect.c
3962 +index 625608bc8962..6d331620b9e5 100644
3963 +--- a/mm/mprotect.c
3964 ++++ b/mm/mprotect.c
3965 +@@ -306,6 +306,42 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
3966 + return pages;
3967 + }
3968 +
3969 ++static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
3970 ++ unsigned long next, struct mm_walk *walk)
3971 ++{
3972 ++ return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
3973 ++ 0 : -EACCES;
3974 ++}
3975 ++
3976 ++static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,
3977 ++ unsigned long addr, unsigned long next,
3978 ++ struct mm_walk *walk)
3979 ++{
3980 ++ return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
3981 ++ 0 : -EACCES;
3982 ++}
3983 ++
3984 ++static int prot_none_test(unsigned long addr, unsigned long next,
3985 ++ struct mm_walk *walk)
3986 ++{
3987 ++ return 0;
3988 ++}
3989 ++
3990 ++static int prot_none_walk(struct vm_area_struct *vma, unsigned long start,
3991 ++ unsigned long end, unsigned long newflags)
3992 ++{
3993 ++ pgprot_t new_pgprot = vm_get_page_prot(newflags);
3994 ++ struct mm_walk prot_none_walk = {
3995 ++ .pte_entry = prot_none_pte_entry,
3996 ++ .hugetlb_entry = prot_none_hugetlb_entry,
3997 ++ .test_walk = prot_none_test,
3998 ++ .mm = current->mm,
3999 ++ .private = &new_pgprot,
4000 ++ };
4001 ++
4002 ++ return walk_page_range(start, end, &prot_none_walk);
4003 ++}
4004 ++
4005 + int
4006 + mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
4007 + unsigned long start, unsigned long end, unsigned long newflags)
4008 +@@ -323,6 +359,19 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
4009 + return 0;
4010 + }
4011 +
4012 ++ /*
4013 ++ * Do PROT_NONE PFN permission checks here when we can still
4014 ++ * bail out without undoing a lot of state. This is a rather
4015 ++ * uncommon case, so doesn't need to be very optimized.
4016 ++ */
4017 ++ if (arch_has_pfn_modify_check() &&
4018 ++ (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
4019 ++ (newflags & (VM_READ|VM_WRITE|VM_EXEC)) == 0) {
4020 ++ error = prot_none_walk(vma, start, end, newflags);
4021 ++ if (error)
4022 ++ return error;
4023 ++ }
4024 ++
4025 + /*
4026 + * If we make a private mapping writable we increase our commit;
4027 + * but (without finer accounting) cannot reduce our commit if we
4028 +diff --git a/mm/swapfile.c b/mm/swapfile.c
4029 +index 2cc2972eedaf..18185ae4f223 100644
4030 +--- a/mm/swapfile.c
4031 ++++ b/mm/swapfile.c
4032 +@@ -2909,6 +2909,35 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)
4033 + return 0;
4034 + }
4035 +
4036 ++
4037 ++/*
4038 ++ * Find out how many pages are allowed for a single swap device. There
4039 ++ * are two limiting factors:
4040 ++ * 1) the number of bits for the swap offset in the swp_entry_t type, and
4041 ++ * 2) the number of bits in the swap pte, as defined by the different
4042 ++ * architectures.
4043 ++ *
4044 ++ * In order to find the largest possible bit mask, a swap entry with
4045 ++ * swap type 0 and swap offset ~0UL is created, encoded to a swap pte,
4046 ++ * decoded to a swp_entry_t again, and finally the swap offset is
4047 ++ * extracted.
4048 ++ *
4049 ++ * This will mask all the bits from the initial ~0UL mask that can't
4050 ++ * be encoded in either the swp_entry_t or the architecture definition
4051 ++ * of a swap pte.
4052 ++ */
4053 ++unsigned long generic_max_swapfile_size(void)
4054 ++{
4055 ++ return swp_offset(pte_to_swp_entry(
4056 ++ swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;
4057 ++}
4058 ++
4059 ++/* Can be overridden by an architecture for additional checks. */
4060 ++__weak unsigned long max_swapfile_size(void)
4061 ++{
4062 ++ return generic_max_swapfile_size();
4063 ++}
4064 ++
4065 + static unsigned long read_swap_header(struct swap_info_struct *p,
4066 + union swap_header *swap_header,
4067 + struct inode *inode)
4068 +@@ -2944,22 +2973,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
4069 + p->cluster_next = 1;
4070 + p->cluster_nr = 0;
4071 +
4072 +- /*
4073 +- * Find out how many pages are allowed for a single swap
4074 +- * device. There are two limiting factors: 1) the number
4075 +- * of bits for the swap offset in the swp_entry_t type, and
4076 +- * 2) the number of bits in the swap pte as defined by the
4077 +- * different architectures. In order to find the
4078 +- * largest possible bit mask, a swap entry with swap type 0
4079 +- * and swap offset ~0UL is created, encoded to a swap pte,
4080 +- * decoded to a swp_entry_t again, and finally the swap
4081 +- * offset is extracted. This will mask all the bits from
4082 +- * the initial ~0UL mask that can't be encoded in either
4083 +- * the swp_entry_t or the architecture definition of a
4084 +- * swap pte.
4085 +- */
4086 +- maxpages = swp_offset(pte_to_swp_entry(
4087 +- swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;
4088 ++ maxpages = max_swapfile_size();
4089 + last_page = swap_header->info.last_page;
4090 + if (!last_page) {
4091 + pr_warn("Empty swap-file\n");
4092 +diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h
4093 +index 5701f5cecd31..64aaa3f5f36c 100644
4094 +--- a/tools/arch/x86/include/asm/cpufeatures.h
4095 ++++ b/tools/arch/x86/include/asm/cpufeatures.h
4096 +@@ -219,6 +219,7 @@
4097 + #define X86_FEATURE_IBPB ( 7*32+26) /* Indirect Branch Prediction Barrier */
4098 + #define X86_FEATURE_STIBP ( 7*32+27) /* Single Thread Indirect Branch Predictors */
4099 + #define X86_FEATURE_ZEN ( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */
4100 ++#define X86_FEATURE_L1TF_PTEINV ( 7*32+29) /* "" L1TF workaround PTE inversion */
4101 +
4102 + /* Virtualization flags: Linux defined, word 8 */
4103 + #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
4104 +@@ -341,6 +342,7 @@
4105 + #define X86_FEATURE_PCONFIG (18*32+18) /* Intel PCONFIG */
4106 + #define X86_FEATURE_SPEC_CTRL (18*32+26) /* "" Speculation Control (IBRS + IBPB) */
4107 + #define X86_FEATURE_INTEL_STIBP (18*32+27) /* "" Single Thread Indirect Branch Predictors */
4108 ++#define X86_FEATURE_FLUSH_L1D (18*32+28) /* Flush L1D cache */
4109 + #define X86_FEATURE_ARCH_CAPABILITIES (18*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */
4110 + #define X86_FEATURE_SPEC_CTRL_SSBD (18*32+31) /* "" Speculative Store Bypass Disable */
4111 +
4112 +@@ -373,5 +375,6 @@
4113 + #define X86_BUG_SPECTRE_V1 X86_BUG(15) /* CPU is affected by Spectre variant 1 attack with conditional branches */
4114 + #define X86_BUG_SPECTRE_V2 X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */
4115 + #define X86_BUG_SPEC_STORE_BYPASS X86_BUG(17) /* CPU is affected by speculative store bypass attack */
4116 ++#define X86_BUG_L1TF X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
4117 +
4118 + #endif /* _ASM_X86_CPUFEATURES_H */