-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint & Restore between machines with different CPU features #11486
Comments
Yeah we need to add support CPU feature leveling - i.e. allow users to specify the CPU feature set the application can use. If runsc users want to checkpoint/restore across hosts, then they would define that as the lowest common denominator between the two CPUs. PRs are much appreciated! What you have described seems acceptable. The annotation approach is good. |
Currently runsc will compare the host feature set between the machine used to checkpoint and the machine used to restore. If the former feature set is not a subset of the latter, restore will fail, for the gr3 might mistakenly use the unsupported instruction. Thanks to cpuid faulting support by gVisor, we are able to intercept the cpuid instruction from gr3 and generate the result refering host feature set recorded during sandbox boot. This patch adds support to only expose given cpu features by removing features out of list when storing the host feature set. This makes it possible to checkpoint & restore on machines with different cpu features. The list of enabled features is passing by annotation. Updates google#11486 Signed-off-by: Tianyu Zhou <[email protected]>
Currently, runsc compares the host feature set between the machine used for checkpointing and the machine used for restoring. If the former feature set is not a subset of the latter, the restore will fail, as user apps might mistakenly use unsupported instructions. Thanks to cpuid faulting support, it is possible to intercept the cpuid instruction from user apps and generate results referring to the host feature set recorded during sandbox boot. This patch adds support to expose only specified CPU features by removing features from the list when storing the host feature set. This makes it possible to checkpoint and restore on machines with different CPU features. The list of enabled features is passed via annotation. Updates google#11486 Signed-off-by: Tianyu Zhou <[email protected]>
Currently, runsc compares the host feature set between the machine used for checkpointing and the machine used for restoring. If the former feature set is not a subset of the latter, the restore will fail, as user apps might mistakenly use unsupported instructions. Thanks to cpuid faulting support, it is possible to intercept the cpuid instruction from user apps and generate results referring to the host feature set recorded during sandbox boot. This patch adds support to expose only specified CPU features by removing features from the list when storing the host feature set. This makes it possible to checkpoint and restore on machines with different CPU features. The list of enabled features is passed via annotation. Updates google#11486 Signed-off-by: Tianyu Zhou <[email protected]>
Currently, runsc compares the host feature set between the machine used for checkpointing and the machine used for restoring. If the former feature set is not a subset of the latter, the restore will fail, as user apps might mistakenly use unsupported instructions. Thanks to cpuid faulting support, it is possible to intercept the cpuid instruction from user apps and generate results referring to the host feature set recorded during sandbox boot. This patch adds support to expose only specified CPU features by removing features from the list when storing the host feature set. This makes it possible to checkpoint and restore on machines with different CPU features. The list of enabled features is passed via annotation. Updates google#11486 Signed-off-by: Tianyu Zhou <[email protected]>
Currently, runsc compares the host feature set between the machine used for checkpointing and the machine used for restoring. If the former feature set is not a subset of the latter, the restore will fail, as user apps might mistakenly use unsupported instructions. Thanks to cpuid faulting support, it is possible to intercept the cpuid instruction from user apps and generate results referring to the host feature set recorded during sandbox boot. This patch adds support to expose only specified CPU features by removing features from the list when storing the host feature set. This makes it possible to checkpoint and restore on machines with different CPU features. The list of enabled features is passed via annotation. It should be noted that currently, CPUID faulting is not supported on the arm64 architecture, and therefore, control over the CPU features exposed to user apps is also not supported. Updates google#11486 Signed-off-by: Tianyu Zhou <[email protected]>
Currently, runsc compares the host feature set between the machine used for checkpointing and the machine used for restoring. If the former feature set is not a subset of the latter, the restore will fail, as user apps might mistakenly use unsupported instructions. Thanks to cpuid faulting support, it is possible to intercept the cpuid instruction from user apps and generate results referring to the host feature set recorded during sandbox boot. This patch adds support to expose only specified CPU features by removing features from the list when storing the host feature set. This makes it possible to checkpoint and restore on machines with different CPU features. The list of enabled features is passed via annotation. It should be noted that currently, CPUID faulting is not supported on the arm64 architecture, and therefore, control over the CPU features exposed to user apps is also not supported. Updates google#11486 Signed-off-by: Tianyu Zhou <[email protected]>
Currently, runsc compares the host feature set between the machine used for checkpointing and the machine used for restoring. If the former feature set is not a subset of the latter, the restore will fail, as user apps might mistakenly use unsupported instructions. Thanks to cpuid faulting support, it is possible to intercept the cpuid instruction from user apps and generate results referring to the host feature set recorded during sandbox boot. This patch adds support to expose only specified CPU features by removing features from the list when storing the host feature set. This makes it possible to checkpoint and restore on machines with different CPU features. The list of enabled features is passed via annotation. It should be noted that currently, CPUID faulting is not supported on the arm64 architecture, and therefore, control over the CPU features exposed to user apps is also not supported. Updates google#11486 Signed-off-by: Tianyu Zhou <[email protected]>
Currently, runsc compares the host feature set between the machine used for checkpointing and the machine used for restoring. If the former feature set is not a subset of the latter, the restore will fail, as user apps might mistakenly use unsupported instructions. Thanks to cpuid faulting support, it is possible to intercept the cpuid instruction from user apps and generate results referring to the host feature set recorded during sandbox boot. This patch adds support to expose only specified CPU features by removing features from the list when storing the host feature set. This makes it possible to checkpoint and restore on machines with different CPU features. The list of enabled features is passed via annotation. It should be noted that currently, CPUID faulting is not supported on the arm64 architecture, and therefore, control over the CPU features exposed to user apps is also not supported. Updates google#11486 Signed-off-by: Tianyu Zhou <[email protected]>
PR #11498 has been submitted; please review it when you have time :) |
Currently, runsc compares the host feature set between the machine used for checkpointing and the machine used for restoring. If the former feature set is not a subset of the latter, the restore will fail, as user apps might mistakenly use unsupported instructions. Thanks to cpuid faulting support, it is possible to intercept the cpuid instruction from user apps and generate results referring to the host feature set recorded during sandbox boot. This patch adds support to expose only specified CPU features by removing features from the list when storing the host feature set. This makes it possible to checkpoint and restore on machines with different CPU features. The list of enabled features is passed via annotation. It should be noted that currently, CPUID faulting is not supported on the arm64 architecture, and therefore, control over the CPU features exposed to user apps is also not supported. Updates google#11486 Signed-off-by: Tianyu Zhou <[email protected]>
Description
Currently, a checkpoint image created on a physical machine with a newer CPU may encounter restoration failures due to missing CPU flags when restored on a physical machine with an older CPU[1].
This has increased the complexity of using Checkpoint/Restore technology to accelerate container startup (one image, multiple containers). We either have to find a machine(or choose a vm) that has a feature set as the maximum subset to create the checkpoint image, or we must create separate checkpoint images for each type of machine and distribute them according to the machine type.
Modal has encountered similar issues, which has led them to create multiple images[2].
Is this feature related to a specific bug?
No response
Do you have a specific solution in mind?
Thanks to the capability of gVisor's cpuid emulation, we can control the CPU features exposed to the user application (i.e., the maximum feature subset of all CPUs in the cluster), which allows us to create only one checkpoint image. This has been widely used internally, and we hope to merge this feature into the mainline.
Currently, we use an annotation
dev.gvisor.internal.cpufeatures
inside config.json to pass the CPU features exposed to the user application, and we also hope the gVisor community can give some input to see what approach would be more general.The text was updated successfully, but these errors were encountered: