Skip to content

EPP silently drops pods when container ports don't match InferencePool targetPorts #2562

@hexfusion

Description

@hexfusion

What happened:

When a pod matches an InferencePool's label selector but none of its container ports intersect with the pool's targetPorts, the pod is silently ignored — no endpoint is created in the datastore, no warning is logged. The pod appears healthy (Running, labels match) but never receives traffic.

In PodUpdateOrAddIfNotExist[1], endpoints are only created for ports in pool.TargetPorts:

If no ports match, pods is empty and nothing is stored. No log is emitted.

What you expected to happen:

To assist in debugging EPP should consider logging a warning when a pod matching the pool's label selector produces zero active endpoints.

How to reproduce it (as minimally and precisely as possible):

  1. Create an InferencePool with targetPorts: [8000]
  2. Deploy a pod that matches the pool's label selector but listens on a different port (e.g., 8300)
  3. Observe that the pod is running and labels match, but EPP never routes traffic to it
  4. No warning appears in EPP logs

This scenario arises in practice when using hostNetwork: true for RDMA prefill and decode pods share the host network namespace, forcing a port remap to avoid conflicts. If targetPorts isn't updated to match, the remapped pod is silently dropped. See llm-d/llm-d#632 [2]

Anything else we need to know?:

Happy to submit a PR for this if the approach looks right.

Environment:

  • Kubernetes version (use kubectl version): k3s v1.32
  • Inference extension version (use git describe --tags --dirty --always): v1.3.1
  • Cloud provider or hardware configuration: Bare metal, 2-node k3s, Mellanox ConnectX-4 (RoCE v2)
  • Install tools: Helm
  • Others: hostNetwork required for RDMA device access

[1]

func (ds *datastore) PodUpdateOrAddIfNotExist(pod *corev1.Pod) bool {

[2] llm-d/llm-d#632

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions