-
Notifications
You must be signed in to change notification settings - Fork 373
sandbox/virtcontainers: memory resource hotplug when create container. #786
Conversation
After this PR merge, I will create a new PR name "elastic memory hotplug" to do what's mention in #624 (comment) |
@miaoyq @clarecch @bergwolf @jodh-intel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! - added some suggestions.
virtcontainers/api.go
Outdated
@@ -112,6 +112,11 @@ func createSandboxFromConfig(ctx context.Context, sandboxConfig SandboxConfig, f | |||
} | |||
}() | |||
|
|||
// get and store guest details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are following the same example s.createContainers()
, but I think the function call describe well what it is doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha yes, nice catch!
virtcontainers/container.go
Outdated
@@ -178,7 +178,7 @@ type ContainerResources struct { | |||
VCPUs uint32 | |||
|
|||
// Mem is the memory that is being used by the container | |||
MemMB uint32 | |||
Mem int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename it to MemBytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originly think that if we don't have suffix MB or KB, the mem is default to be in unit of byte. I can add a suffix of Byte
if it's needed. Does this need to be? @jcvenegas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right, @clarecch it was to pretty obvious to me at the beginning. But lets have a second comment here @jodh-intel wdyt?
Makefile
Outdated
@@ -124,7 +124,7 @@ DEFVCPUS := 1 | |||
# Default maximum number of vCPUs | |||
DEFMAXVCPUS := 0 | |||
# Default memory size in MiB | |||
DEFMEMSZ := 2048 | |||
DEFMEMSZ := 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the default initial memory of the guest, I think we still need to have a predefined memory when the memory cgroup is not defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really get what the predefined memory
is.
If docker run
without memory limit, we will have a 128MB vm, which is assigned when qemu setup.
If docker run --memory 100M
, we will have a 128MB+100MB vm, the 100MB is hotplug memory.
What else do we need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clarecch yes, for your use case docker run
without limited memory the memory that the container will have is just 128MB?. I feel that just 128MB for a not limited container is a little of memory, dont you think?
I mean it is great for density starting all the not limited containers with 128MB, but at the same time not sure if is a good default for most of the container applications running today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't kown whether 128MB is a good choice for all container application. Let's find a more suitable default memory. Do you have any good choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say 1Gb or 2Gb (that is what we have today )would be a good.
So I think that config is for the memory to be hot added if there is not a limit defined. But also I agree to start the VM with the minimal memory.
So we start the VM with 128MB (probably could configurable)
We create the container, if memory.limit is not defined, start with a default size (1 or 2 GB).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add a new config field name SYSRESMEMSZ(system reserve memory)
which default 128MB. And keeping DEFMEMSZ
to be 2GB. If memory.limit is not defined, start with DEFMEMSZ
, else if memory.limit is defined, start with SYSRESMEMSZ
+ memory.limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcvenegas When we create a pod, we want to start the vm with the minimal memory, but the pause container
has no memory limit defined, which means each pod will have 1 or 2GB base memory size. That's why I choose 128MB before. @sboeuf @bergwolf @jodh-intel WDYT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jodh-intel asked me a question, which raised a point we should make, and another one that springs to mind....
-
note that, although we (currently) hand the VM 2G of memory space - that does not mean it allocates and consumes 2G of host side memory, unless it is used. Sure, once you start doing 'interesting stuff' in your container it will probably consume it all by filling it with cache - but, it should be noted that handing the VM more or less memory at startup does not carry a large host side overhead, at least initially afaik.
-
Generally we try to be as compatible with the other container runtimes (meaning pretty much
runc
) as we can. If you run a runc container or pod with no memory restrictions, then I believe you get 'all of host memory'. Restricting us to 128M will be the opposite extreme of that. I'm not sure that is a good thing - I'm pretty sure we will start getting a lot of reports of 'I ran my mysql DB and it fell over'.
We can't really 'win' here - if we hand the VM 128M by default, many workloads will fail under the 'default memory' situation. If we hand the VM 'all of host memory', then on large systems that does carry a lot of overhead for the VM to manage it, and then we will consume host memory pretty quickly, and launching more VMs will be refused (as by default the host kernels check that the RAM is available before they allow the VM to launch - they do a pre-overcommit check).
We have had a 2G default for a couple of years now (across clear containers and now kata), and have run into remarkably few situations where anybody has found it a problem. Do we have any prior experience to add here from the runv side (@gnawux maybe?) - maybe that can help guide us here.
128M worries me a little as it is pretty small. That would probably be fine if we knew all workloads we ever get have a defined memory requirement/resource/constraint - but, afaik, they don't (or it is not a guarantee). And, I don't think we know that in advance, so we can't even dynamically choose depending on the workload can we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, see: https://github.com/kata-containers/runtime/blob/master/cli/config.go#L693 - we will fail if the memory level is "too low".
virtcontainers/container.go
Outdated
@@ -984,7 +984,8 @@ func (c *Container) update(resources specs.LinuxResources) error { | |||
|
|||
newResources := ContainerResources{ | |||
VCPUs: uint32(utils.ConstraintsToVCPUs(*resources.CPU.Quota, *resources.CPU.Period)), | |||
MemMB: uint32(*resources.Memory.Limit >> 20), | |||
// do page align to memory, as cgroup memory.limit_in_bytes will be aligned to page when effect | |||
Mem: (*resources.Memory.Limit >> 12) << 12, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a constant?
Also I dont fully get what is done here can you explain me a little bit more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Mem is in unit of MB, it means 100MB+4096B is equal to 100MB, which is not good.
If Mem is in unit of Byte, it means 100MB+1B is not equal to 100MB, which is not compatible to OCI spec.
So we do page align here, then 100MB+1B is equal to 100MB, 100MB+4096B is equal to 100MB+4096B
virtcontainers/container.go
Outdated
if memorySectionSizeMB == 0 { | ||
return mem, nil | ||
// calculate hotplug memory size with memory block size of guestos | ||
func (c *Container) calcHotplugMemSize(mem int64) (uint32, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably want to add as a comment header that this returns memory in Mebibytes
calcHotplugMemMiBSize(memKB)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it's better to rename to calcHotplugMemMiBSize
virtcontainers/container.go
Outdated
return mem, nil | ||
// calculate hotplug memory size with memory block size of guestos | ||
func (c *Container) calcHotplugMemSize(mem int64) (uint32, error) { | ||
memoryBlockSize := int64(c.sandbox.state.GuestMemoryBlockSizeMB << 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be done after the if statement
0 << 20 stil is 20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we do zero judgement first, then do 20 left shift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clarecch yes :)
virtcontainers/container.go
Outdated
} | ||
|
||
func (c *Container) updateMemoryResources(oldResources, newResources ContainerResources) error { | ||
oldMemMB := oldResources.MemMB | ||
newMemMB := newResources.MemMB | ||
oldMem := oldResources.Mem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer oldMemKb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in unit of Byte, see reply above #786 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jodh-intel PTAL.
@linzichang can you create an issue to track "elastic memory hotplug". I want to track what are the missing items related with memory. I'll open and issue and work a issue related with memory update when it is reduced.
|
@@ -1181,15 +1182,14 @@ func (c *Container) detachDevices() error { | |||
} | |||
|
|||
func (c *Container) addResources() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addResources
is just a special case of updateResources
. Could you add a new commit to combine the shared codes so that we do not maintain the two pieces of identical code?
virtcontainers/container.go
Outdated
@@ -984,7 +984,8 @@ func (c *Container) update(resources specs.LinuxResources) error { | |||
|
|||
newResources := ContainerResources{ | |||
VCPUs: uint32(utils.ConstraintsToVCPUs(*resources.CPU.Quota, *resources.CPU.Period)), | |||
MemMB: uint32(*resources.Memory.Limit >> 20), | |||
// do page align to memory, as cgroup memory.limit_in_bytes will be aligned to page when effect | |||
Mem: (*resources.Memory.Limit >> 12) << 12, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size is aligned to memoryBlockSize
later on and the new value is used for memory hotplug. I wonder why the alignment is needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #786 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't answer my question. The size is aligned with memoryBlockSize
right after the page size alignment here. What is the point of having two alignments? The memoryBlockSize
is multiple times of page size. I don't see why you need to align with page size first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case we don't do page size alignment first, if we setup a container of 100MB, then we update the container memory of 100MB+1B. As a result, we will hotplug 1 memoryBlockSize
memory to vm, which is not recommended. Because for container memory limit cgroup, it would be 100MB rather than 100MB+1B when you update 100MB+1B to a 100MB memory limit container(linux kernel will do page alignment when memory limit cgroup takes effect, reference). So we don't have to hotplug a memoryBlockSize
memory to vm for 1B.
If we do page align here, we will ignore 1B, but not ignore 4096B. It means we can hotplug a memoryBlockSize
memory to vm if you update 100MB+4096B to a 100MB memoy limit container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Now I see you point. You want to avoid wasting the extra 1 memoryBlockSize
of memory in case of 100MB + 1B. And I think you should call os.GetPageSize()
to find out the system page size rather than always assuming 4KB page size. For example, aarch64 might use 64KB page size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I don't think about page size of other architecture. I found the config CONFIG_ARM64_64K_PAGES
. Does os.GetPageSize() get host page size? Maybe what we need is guest page size. @bergwolf WDYT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, another field in the GetGuestDetails
call ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think we should call GetGuestDetails
when making the very first connection to the agent, and save all the fields we are interested in. Then we do not need to call it again at different places when these fields are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it depends on agent grpc changes, please add a TODO
in the code, or create a new issue for it. Then we can handle it later and this PR can proceed without guest page size querying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think we should call
GetGuestDetails
when making the very first connection to the agent, and save all the fields we are interested in. Then we do not need to call it again at different places when these fields are needed.
Yes, this is what we do now. We call GetGuestDetails
when create sandbox and save all the fields in sandbox state.json. I will add a TODO comment in the code later.
@jcvenegas Do you plan to work on memory hot remove? I think it would be a hard work. Look forward to it. |
@jcvenegas See #788 |
@clarecch Yes hot remove would be a hardwork (I dont know if possible). As first step to help to return memory would be to use virtio-balloon. See #790 |
507379e
to
d5affcf
Compare
edit: @linzichang @clarecch any blocker on this? |
We have update our code, pls take a look and start the CI test |
virtcontainers/sandbox.go
Outdated
@@ -1253,12 +1253,6 @@ func (s *Sandbox) CreateContainer(contConfig ContainerConfig) (VCContainer, erro | |||
return nil, err | |||
} | |||
|
|||
// Store it. | |||
err = c.storeContainer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you delete storeContainer()
here? If updateResources
does nothing, we won't have anyone storing container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, it's my omissions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
/test |
Some of the CIs (f27, Centos) are faling with the 'nested' issue. But I found this in the 16.04 logs:
|
@grahamwhaley Indeed, some testcases in CI need to be modified, I am handling this. |
Hello @grahamwhaley , the reason the CI failed is I have deleted the
TestCreateSandboxConfigFail will set |
Thanks for the diagnosis @clarecch /cc @jcvenegas @devimc for input. |
I think we can just change |
@clarecch there is not any restriction about the cgroup limit being 0, in the OCI spec (or I did not find it). Just being curious, I think the container process wont work but the VM/pod now will be safe as we have the initial VM memory not depending on the container limit.As reference docker does not allow to start or update the memory of the process to be less than 4M, but this is an extra validation it does. You can still try to update with kata-runtime update if you want to see what probably happen. |
/test |
Codecov Report
@@ Coverage Diff @@
## master #786 +/- ##
==========================================
- Coverage 66.09% 66.07% -0.02%
==========================================
Files 87 87
Lines 10505 10494 -11
==========================================
- Hits 6943 6934 -9
+ Misses 2863 2862 -1
+ Partials 699 698 -1 |
Overall the CI has no problem. Please help rebuild the |
The code is all updated now, and the testcase kata-containers/tests#813 is also done. |
Looking at the metrics CI logs, you can see it failed on a boot time speed test:
The metrics is not perfect (it has some noise), so let's re-kick that and see if it is reproducible. If so, you may want to try out the metrics report generator on your local system to see if that is a real speed impact, and then we can diagnose... |
Thanks @grahamwhaley , I think the boot time noise may be caused by memory hotplug when boot. And could you tell me how to deal |
Hi @clarecch - ah, that |
And for reference, I'm seeing other metrics jobs fail on other repos with that boot boundary condition, so it may not be this PR that is causing it, it could be something we landed recently. I'll check to see if I can find a pattern/culprit. |
Thanks for checking @grahamwhaley |
The CI is green now, PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor comments. Otherwise looks good to me.
MemMB: uint32(*resources.Memory.Limit >> 20), | ||
// do page align to memory, as cgroup memory.limit_in_bytes will be aligned to page when effect. | ||
// TODO use GetGuestDetails to get the guest OS page size. | ||
MemByte: (*resources.Memory.Limit >> 12) << 12, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still assuming 4KB page size. Please use os.Getpagesize()
for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still valid and if so does it still need to be addressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the comment and TODO are still valid, I will create a issue for this TODO.
resources.MemMB = uint32(*ocispec.Linux.Resources.Memory.Limit >> 20) | ||
// do page align to memory, as cgroup memory.limit_in_bytes will be aligned to page when effect | ||
// TODO use GetGuestDetails to get the guest OS page size. | ||
resources.MemByte = (*ocispec.Linux.Resources.Memory.Limit >> 12) << 12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. Please create a helper function and use os.Getpagesize()
for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we have talked about this before that we should get guest page size here, so I have added a TODO use GetGuestDetails to get the guest OS page size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean we use os.Getpagesize for now? But the 12
means guest page size here. It's not proper to use os.Getpagesize()
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I remember it now. Sorry for the noise.
@jcvenegas Please hava a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@sboeuf @jodh-intel please take a look |
PTAL at both this PR and kata-containers/tests#813, which should be merged together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -75,7 +75,7 @@ default_bridges = @DEFBRIDGES@ | |||
|
|||
# Default memory size in MiB for SB/VM. | |||
# If unspecified then it will be set @DEFMEMSZ@ MiB. | |||
#default_memory = @DEFMEMSZ@ | |||
default_memory = @DEFMEMSZ@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is its behavior changed? Not a big problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change this because KataConfig.Hypervisor[DefaultHypervisor].DefaultMemSz
in kata-containers/tests#813 need the value of default_memory
in toml file.
See https://github.com/kata-containers/tests/blob/master/config.go#L37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might need to be release-noted then as users with a /etc/kata-containers/configuration.toml
file probably won't have that option enabled since it's likely to be based on the current config file we distribute as /usr/share/defaults/kata-containers/configuration.toml
.
/cc @egernst, @jcvenegas, @bergwolf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I change cli/config/configuration.toml.in just for CI testcase, which I think will not affect users. So maybe there is no need to release-noted for now as long as the CI won't fail.
@jodh-intel @sboeuf Please take a look, if there is no big problem we can merge this PR and kata-containers/tests#813 now : ) |
/test |
Signed-off-by: Clare Chen <[email protected]> Signed-off-by: Zichang Lin <[email protected]>
/test |
Hi @grahamwhaley ,have you found out why |
Hi @clarecch I've not found a specific root cause why the metrics have started to fail. It looks like maybe our boot times have crept up, and our footprint crept down, so I will have to adjust the bounds check values to match the new 'status quo'. I opened an issue about it yesterday kata-containers/ci#72 an will try to get to do the tweaks today. In the meantime, the codecov drop looks like 'noise' (-0.02% ;-)), so I'm going to merge this :-) |
Thanks for helping @grahamwhaley , also please take a look at kata-containers/tests#813 , which depends on this PR. |
The logic of this Makefile is to use "-extldflags '-static'" if STATIC is defined. On the other hand, LDFLAGS is the standard name. So we should avoid using LDFLAGS to avoid build failure. fixes kata-containers#786 Signed-off-by: Chen Qi <[email protected]>
When create sandbox, we setup a sandbox of 128M base memory, and
then hotplug memory that is needed for every new container. And
we change the unit of c.config.Resources.Mem from MiB to Byte in
order to prevent the 4095B < memory < 1MiB from being lost.
Fixes #400
Signed-off-by: Clare Chen [email protected]
Signed-off-by: Zichang Lin [email protected]