-
Notifications
You must be signed in to change notification settings - Fork 50
add CPU platforms to instances #8728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gjc/report-sled-families
Are you sure you want to change the base?
Conversation
fb1e0a7
to
620d7c9
Compare
nexus/src/app/instance_platform.rs
Outdated
0x8000001D, 0x3, 0x00000163, 0x03C0003F, 0x00007FFF, 0x00000001 | ||
), | ||
cpuid_subleaf!( | ||
0x8000001D, 0x4, 0x00000000, 0x00000000, 0x00000000, 0x00000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Propolis CPUID specializer doesn't like this subleaf because theoretically it's for a cache level "4". agh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally got to taking a fine-tooth comb through the CPUID bits here and differences between hardware and what guests currently see. for the most part this is in line with what guests already get from byhve defaults but i've noticed a few typos that unsurprisingly do not pose an issue booting at least Alpine guests. i'll clean that up and update 314 appropriately tomorrow.
// See [RFD 314](https://314.rfd.oxide.computer/) section 6 for all the | ||
// gnarly details. | ||
const MILAN_CPUID: [CpuidEntry; 32] = [ | ||
cpuid_leaf!(0x0, 0x0000000D, 0x68747541, 0x444D4163, 0x69746E65), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a guest currently sees eax=0x10
here, where leaves 0xe
, 0xf
, and 0x10
are all zeroes. 0xe
is reserved and zero on the host. 0xf
and 0x10
are zeroed because of the default byhve masking behavior. setting eax=0xd
is just more precise: leaves 0xf
and 0x10
being "present" but zero does not communicate any feature presence.
// gnarly details. | ||
const MILAN_CPUID: [CpuidEntry; 32] = [ | ||
cpuid_leaf!(0x0, 0x0000000D, 0x68747541, 0x444D4163, 0x69746E65), | ||
cpuid_leaf!(0x1, 0x00A00F11, 0x00000800, 0xF6F83203, 0x078BFBFF), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a guest currently sees
0x00000001 0x00: eax=0x00a00f11 ebx=0x01020800 ecx=0xfeda3203 edx=0x178bfbff
ecx
goes from0xfeda3203
to0xF6F83203
edx
' bit indicates hyperthreading support, and is set based on Propolis' specialization. it gets unconditionally set, today, so these match in practice.
const MILAN_CPUID: [CpuidEntry; 32] = [ | ||
cpuid_leaf!(0x0, 0x0000000D, 0x68747541, 0x444D4163, 0x69746E65), | ||
cpuid_leaf!(0x1, 0x00A00F11, 0x00000800, 0xF6F83203, 0x078BFBFF), | ||
cpuid_leaf!(0x5, 0x00000000, 0x00000000, 0x00000000, 0x00000000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a guest currently sees
0x00000005 0x00: eax=0x00000040 ebx=0x00000040 ecx=0x00000003 edx=0x0000001
this leaf is about monitor
/mwait
support. support for these instructions is masked off, so this leaf becomes all zeroes. the bits that were here report that the monitor-line size is no less and no greater than 0x40
bytes, ecx=3
indicates mwait
can set state to be woken on interrupt as well as EMX
(less familiar with this). edx
is reserved, curious that it's 1.
nexus/src/app/instance_platform.rs
Outdated
cpuid_leaf!(0x0, 0x0000000D, 0x68747541, 0x444D4163, 0x69746E65), | ||
cpuid_leaf!(0x1, 0x00A00F11, 0x00000800, 0xF6F83203, 0x078BFBFF), | ||
cpuid_leaf!(0x5, 0x00000000, 0x00000000, 0x00000000, 0x00000000), | ||
cpuid_leaf!(0x6, 0x00000002, 0x00000000, 0x00000000, 0x00000000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a guest currently sees
0x00000006 0x00: eax=0x00000004 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
2
in the new leaf is a typo. bit 2 should be set, indicating ARAT, for 0x4
. the other bit here is ecx.0
indicating aperf/mperf support. bhyve masks that too, so no change.
nexus/src/app/instance_platform.rs
Outdated
cpuid_subleaf!( | ||
0x7, 0x0, 0x00000000, 0x219C03A9, 0x00000000, 0x00000000 | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a guest currently sees
0x00000007 0x00: eax=0x00000000 ebx=0x201003a9 ecx=0x00000600 edx=0x00000000
the ebx
bits differ on bits 18, 19, 23, and 24. all of these are masked out, but would be made visible:
- bit 18: RDSEED. this should be masked off for the time being for reasons described in 314.
- bit 19:
ADX
, the ADC/ADOX instructions. it's really curious this is not already advertised. need to follow up on this. might be masked by default in bhyve? - bit 23:
clflushopt
support. similar to above, should be supported in hardware and might have been masked by default by byhve. - bit 24:
clwb
support. another instruction, same as above.
the host reports 0x219897a9
for ebx
today, which has RDSEED masked off (again, for the time being this is expected, and described in 314) and PQE/PQM support enabled. so all the instruction extensions we'd expect to be present are.
the ecx
bits are VAES
and VPCMULQDQ
support. this was an open question in RFD 314, i need to update the RFD with the fact that these are set by Milan hardware and pass these through too. hardware bits are 0x0040068c
so passing VAES
and VPCMULQDQ
through is OK.
cpuid_subleaf!( | ||
0x8000001D, 0x1, 0x00000122, 0x01C0003F, 0x0000003F, 0x00000000 | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guest do not get told about the L1 icache at all today? kinda weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange but true! the whole default cache topology is a figment of bhyve's imagination: https://code.illumos.org/plugins/gitiles/illumos-gate/+/refs/heads/master/usr/src/uts/intel/io/vmm/vmm_cpuid.c#722
0x8000001D, 0x2, 0x00000143, 0x01C0003F, 0x000003FF, 0x00000002 | ||
), | ||
cpuid_subleaf!( | ||
0x8000001D, 0x3, 0x00000163, 0x03C0003F, 0x00007FFF, 0x00000001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for all 8000001D leaves, guests see 0's for NumSharingCache
, CacheNumWays
, CacheNumSets
, CacheInclusive
, or WBINVD
scope. the bits here are what i see from a Gimlet (absent NumSharingCache
, which is zeroes here still).
cpuid_subleaf!( | ||
0x8000001D, 0x4, 0x00000000, 0x00000000, 0x00000000, 0x00000000 | ||
), | ||
cpuid_leaf!(0x8000001E, 0x00000000, 0x00000100, 0x00000000, 0x00000000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a guest currently sees:
0x8000001e 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
we'll be advertising SMT, here. need to check if we fix up ComputeUnitId
here.
0x8000001D, 0x4, 0x00000000, 0x00000000, 0x00000000, 0x00000000 | ||
), | ||
cpuid_leaf!(0x8000001E, 0x00000000, 0x00000100, 0x00000000, 0x00000000), | ||
cpuid_leaf!(0x8000001F, 0x00000000, 0x00000000, 0x00000000, 0x00000000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a guest currently sees SEV feature bits here, except that SEV is masked off from guests, so this is another leaf that we might want to more aggressively mask by bhyve defaults?
nexus/src/app/instance_platform.rs
Outdated
), | ||
cpuid_leaf!(0x8000001E, 0x00000000, 0x00000100, 0x00000000, 0x00000000), | ||
cpuid_leaf!(0x8000001F, 0x00000000, 0x00000000, 0x00000000, 0x00000000), | ||
cpuid_leaf!(0x80000021, 0x0000002D, 0x00000000, 0x00000000, 0x00000000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a guest currently sees:
0x80000021 0x00: eax=0x0000204d ebx=0x00000000 ecx=0x00000000 edx=0x00000000
2D
is a typo in this PR: it should be 45
. masking off the SmmPgCfgLock
for unsupported SMM controls, and NullSelectorClearsBase
is off by one. guests also saw PrefetchCtlMsr
? not ideal.
lots of upper bits in eax
here become meaningful in later generations, but not yet!
0xB, 0x0, 0x00000001, 0x00000002, 0x00000100, 0x00000000 | ||
), | ||
cpuid_subleaf!( | ||
0xB, 0x1, 0x00000000, 0x00000000, 0x00000201, 0x00000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same as RFD 314 says, but eax
ought to be different in the RFD. Propolis fixes up B.1.EBX
here to the number of vCPUs, but eax
of 0 implies that the subleaf is actually invalid (from the APM: "If this function is executed with an unimplemented level (passed in ECX), the instruction returns all zeroes in the EAX register." .. also taken faithfully, this implies the VM's topology is vCPU-many sockets with SMT pairs across each socket pair. oops).
RFD 505 proposes that instances should be able to set a "minimum hardware platform" or "minimum CPU platform" that allows uers to constrain an instance to run on sleds that have a specific set of CPU features available. Previously, actually-available CPU information was plumbed from sleds to Nexus. This actually adds a `min_cpu_platform` setting for instance creation and uses it to drive selection of guest CPUID leaves. As-is, this moves VMs on Gimlets away from the byhve-default CPUID leaves (which are effectively "host CPUID information, but features that are not or cannot be virtualized are masked out"), instead using the specific CPUID information set out in RFD 505. There is no provision for Turin yet, which instead gets CPUID leaves that look like Milan. Adding a set of CPUID information to advertise for an `amd_turin` CPU platform, from here, is fairly straightforward. This does not have a mechanism to enforce specific CPU platform use or disuse, either in a silo or rack-wide. One could imagine a simple system oriented around "this silo is permitted to specify these minimum CPU platforms", but that leaves uncomfortable issues like: if a silo A permits only Milan, and silo B permits Milan and Turin, all Milan CPUs are allocated already, and someone is attemting to create a new Milan-based VM in silo A, should this succeed using Turin CPUs potentially starving silo B?
…platform as defined in Nexus
620d7c9
to
5cf7b9c
Compare
// TODO(gjc): eww. the correct way to do this is to write this as | ||
// | ||
// "AND sled.cpu_family = ANY (" | ||
// | ||
// and then just have one `param` which can be bound to a | ||
// `sql_types::Array<SledCpuFamilyEnum>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, i ought to take a swing at de-eww'ing this
@@ -612,6 +613,7 @@ mod tests { | |||
external_ips: vec![], | |||
disks: vec![], | |||
boot_disk: None, | |||
cpu_platform: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure is weird that this file doesn't include a change to SledReservationConstraintBuilder
like in instance_start.rs
huh?
that's a bug. this currently would allow a Turin to migrate onto a Milan (bad)
this materializes RFD 314 and in some respects, 505.
builds on #8725 for CPU family information, which is a stand-in for the notion of sled families and generations described in RFD 314. There are a few important details here where CPU platforms differ from the sled CPU family and I've differed from 314/505 (and need to update the RFDs to match). I'd not noticed the sheer volume of comments on https://github.com/oxidecomputer/rfd/pull/515 so I'm taking a pass through those and the exact bits in
MILAN_CPUID
may be further tweaked. I suspect the fixed array needs at least a few more tweaks anyway, cross-referencing RFD 314 turns out to make for awkward review and it's hard to eyeball the semantic of bits here (or which are to be filled in by some later component of the stack!)As-is: I think would be OK to merge but is not quite as polished as I'd like it to be, so it's a real PR but I expect further changes.
hardware CPU families are less linear than Oxide CPU platforms.
We can (and do, in RFD 314) define Milan restrictively enough that we can present Turin (and probably later!) CPUs to guests "as if" they were Milan. Similarly I'd expect that Turin would be defined as roughly "Milan-plus-some-AVX-512-features" and pretty forward-compatible. Importantly these are related to but not directly representative of real CPUs; as an example I'd expect "Turin"-the-instance-CPU-platform to be able to run on a Turin Dense CPU. Conversely, there's probably not a reason to define a "Turin Dense" CPU platform since from a guest perspective they'd look about the same.
But at the same time the lineage through the AMD server part family splits at Zen 4 kind of, with Zen 4 vs Zen 4c-based parts and similar with Zen 5/c. It's somewhat hard (I think) to predict what workloads would be sensitive to this. And as #8730 gets into a bit, the details of a processor's packaging (core topology, frequency, cache size) can vary substantially even inside one CPU family. The important detail here is that we do not expect CPU platforms to cover these details and it would probably be cumbersome to try; if the instance's constraint is "I want AVX256, and I want to be on high-frequency-capable processors only", then it doesn't actually matter if it's run on a Turin or a Milan and to tie it to that CPU platform may be overly restrictive.
On instance CPU platforms, the hope is that by focusing on CPU features we're able to present a more linear path as the microarchitecture grow.
instance platforms aren't "minimum"
I've walked back the initial description of an instance's CPU platform as the "minimum CPU platform", but haven't updated RFD 314 to describe that or why quite yet. As present in other systems, "minimum CPU platform" would more analogously mean "can we put you on a Rome Gimlet or must we put you on a Milan Gimlet?", or "Genoa Cosmo vs Turin Cosmo?" - it doesn't seem possible to say "this instance must have AVX 512, but otherwise I don't care what kind of hardware it runs on.", but that's more what we mean by CPU platform.
In a "minimum CPU platform" interpretation, we could provide a bunch of Turin CPUID bits to a VM that said it wanted Milan. But since there's no upper bound here, if an OS has an issue with a future "Zen 14" or whatever, a user would discover that by their "minimum-Milan" instance getting scheduled on the new space-age processor and exploding on boot or something. OSes shouldn't do that, but...
Implementation-wise, this is really just about the names right now. You always get Milan CPUID leaves for the time being. When there are Turin CPUID leaves defined for the instance CPU platform, and Cosmos on which they make sense, this becomes more concrete.
bonus: probably shouldn't lie about topologies?
When it comes to forward/backwards compatibility between platforms and newer CPUs, we need to take a closer look than just "are exposed feature bits a subset of hardware feature bits". The really important one to me is leaf D subleaf 2: this talks about the format of XSAVE regions and includes for example the offset of the YMM save area. this could (but in practice does not) change between family to family. if it does, claiming the YMM save offset is 0x340 when it is actually 0x380 or something would lie to the guest about the CPU's expected behavior, and untold calamity likely would ensue.
RFD 314 deserves a section outlining how we determine a new CPU can be a stand-in for an older platform, talking about this and other leaves (particularly if we pass topology information through, elsewhere).