The job application was rebuilt and the same endpoint name was kept which made it impossible to assign an IP address. #3699

ty-dc · 2024-07-08T02:29:31Z

Spiderpool Version

v0.9.3

Bug Type

IPAM

Main CNI

macvlan

What happened?

Warning  FailedCreatePodSandBox  31s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d03d1785bad07f92d23677169acc40ecdd3ff90658d18c39ead55b010438fb4b": plugin type="multus" name="multus-cni-network" failed (add): [drun-test/llama2-master-0/f699f414-842c-40ab-8379-71710eac15c0:sriov-gpu20-enp40s0np0]: error adding container to network "sriov-gpu20-enp40s0np0": failed to set up IPAM plugin type "spiderpool" from the device "enp40s0np0": spiderpool IP allocation error: [POST /ipam/ip][500] postIpamIpFailure  failed to allocate IP addresses in standard mode: failed to patch IP allocation results to Endpoint: Operation cannot be fulfilled on [spiderendpoints.spiderpool.spidernet.io](http://spiderendpoints.spiderpool.spidernet.io/) "llama2-master-0": the object has been modified; please apply your changes to the latest version and try again

What did you expect to happen?

success

How to reproduce it (as minimally and precisely as possible)

PyTorch creates jobs in batches, and its job names are named like sequence numbers in stateful applications. Therefore, after creating a set of tasks, the administrator quickly cancels them and creates a new set of tasks. Occasionally, endpoints with the same name remain, and the IP address cannot be allocated.

Additional Context

Solution: The uuid of the pod corresponding to the endpoint does not exist. Detect and delete/update the endpoint object and use gc old data.

The text was updated successfully, but these errors were encountered:

ty-dc · 2024-08-29T02:55:26Z

fix #3778

ty-dc added the kind/bug label Jul 8, 2024

ty-dc assigned lou-lan, cyclinder and ty-dc Jul 8, 2024

ty-dc changed the title ~~The working application was rebuilt and the same endpoint name was kept which made it impossible to assign an IP address.~~ The job application was rebuilt and the same endpoint name was kept which made it impossible to assign an IP address. Jul 8, 2024

ty-dc closed this as completed Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The job application was rebuilt and the same endpoint name was kept which made it impossible to assign an IP address. #3699

The job application was rebuilt and the same endpoint name was kept which made it impossible to assign an IP address. #3699

ty-dc commented Jul 8, 2024 •

edited

Loading

ty-dc commented Aug 29, 2024

The job application was rebuilt and the same endpoint name was kept which made it impossible to assign an IP address. #3699

The job application was rebuilt and the same endpoint name was kept which made it impossible to assign an IP address. #3699

Comments

ty-dc commented Jul 8, 2024 • edited Loading

Spiderpool Version

Bug Type

Main CNI

What happened?

What did you expect to happen?

How to reproduce it (as minimally and precisely as possible)

Additional Context

ty-dc commented Aug 29, 2024

ty-dc commented Jul 8, 2024 •

edited

Loading