You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Warning FailedCreatePodSandBox 31s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d03d1785bad07f92d23677169acc40ecdd3ff90658d18c39ead55b010438fb4b": plugin type="multus" name="multus-cni-network" failed (add): [drun-test/llama2-master-0/f699f414-842c-40ab-8379-71710eac15c0:sriov-gpu20-enp40s0np0]: error adding container to network "sriov-gpu20-enp40s0np0": failed to set up IPAM plugin type "spiderpool" from the device "enp40s0np0": spiderpool IP allocation error: [POST /ipam/ip][500] postIpamIpFailure failed to allocate IP addresses in standard mode: failed to patch IP allocation results to Endpoint: Operation cannot be fulfilled on [spiderendpoints.spiderpool.spidernet.io](http://spiderendpoints.spiderpool.spidernet.io/) "llama2-master-0": the object has been modified; please apply your changes to the latest version and try again
What did you expect to happen?
success
How to reproduce it (as minimally and precisely as possible)
PyTorch creates jobs in batches, and its job names are named like sequence numbers in stateful applications. Therefore, after creating a set of tasks, the administrator quickly cancels them and creates a new set of tasks. Occasionally, endpoints with the same name remain, and the IP address cannot be allocated.
Additional Context
Solution: The uuid of the pod corresponding to the endpoint does not exist. Detect and delete/update the endpoint object and use gc old data.
The text was updated successfully, but these errors were encountered:
ty-dc
changed the title
The working application was rebuilt and the same endpoint name was kept which made it impossible to assign an IP address.
The job application was rebuilt and the same endpoint name was kept which made it impossible to assign an IP address.
Jul 8, 2024
Spiderpool Version
v0.9.3
Bug Type
IPAM
Main CNI
macvlan
What happened?
What did you expect to happen?
success
How to reproduce it (as minimally and precisely as possible)
PyTorch creates jobs in batches, and its job names are named like sequence numbers in stateful applications. Therefore, after creating a set of tasks, the administrator quickly cancels them and creates a new set of tasks. Occasionally, endpoints with the same name remain, and the IP address cannot be allocated.
Additional Context
Solution: The uuid of the pod corresponding to the endpoint does not exist. Detect and delete/update the endpoint object and use gc old data.
The text was updated successfully, but these errors were encountered: