Skip to content

USHIFT-6902: Add PCP metrics collection to test scenarios#6658

Draft
pacevedom wants to merge 1 commit into
openshift:mainfrom
pacevedom:pm-scenarios
Draft

USHIFT-6902: Add PCP metrics collection to test scenarios#6658
pacevedom wants to merge 1 commit into
openshift:mainfrom
pacevedom:pm-scenarios

Conversation

@pacevedom
Copy link
Copy Markdown
Contributor

@pacevedom pacevedom commented May 12, 2026

Install and run Performance Co-Pilot on all online VMs during test execution, then collect the archives as artifacts alongside SOS reports. Controlled via SKIP_PCP environment variable (defaults to false).

Summary by CodeRabbit

  • New Features
    • Test images now include PCP packages so performance collection is available during runs.
  • Tests
    • Test scenarios support optional Performance Co-Pilot (PCP) collection to capture detailed system metrics during test runs.
    • PCP collection is enabled by default and can be disabled via configuration.
    • Performance metric archives are automatically collected and stored alongside test results.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 12, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 12, 2026

@pacevedom: This pull request references USHIFT-6902 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Install and run Performance Co-Pilot on all online VMs during test execution, then collect the archives as artifacts alongside SOS reports. Controlled via SKIP_PCP environment variable (defaults to false).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@pacevedom
Copy link
Copy Markdown
Contributor Author

/test ?

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pacevedom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

Walkthrough

The test scenario script adds optional PCP (Performance Co-Pilot) data collection controlled by SKIP_PCP (default false). New helpers start/restart PCP services on online VMs before tests and collect/archive PCP logs on exit. Test-agent containerfiles and image blueprints now install pcp and pcp-zeroconf.

Changes

PCP Collection Infrastructure

Layer / File(s) Summary
Feature flag & helpers
test/bin/scenario.sh (SKIP_PCP at top; new functions lines ~296–362)
Adds SKIP_PCP=${SKIP_PCP:-false}. Implements start_pcp_on_all_vms() to restart pmcd and pmlogger on online VMs, and collect_pcp_reports() to stop pmlogger, tar /var/log/pcp/pmlogger, and copy archives into each VM's ${vmdir}/pcp/. Both are no-ops when SKIP_PCP=true.
Test-run wiring
test/bin/scenario.sh (action_run trap and flow ~1655–1672)
Calls start_pcp_on_all_vms before scenario_run_tests so PCP runs during tests. EXIT trap extended to invoke collect_pcp_reports (errors ignored) alongside existing JUnit and SOS collection.

Test-agent images and blueprints

Layer / File(s) Summary
Containerfiles: install PCP packages
test/image-blueprints-bootc/*/*-test-agent.containerfile, test/image-blueprints/upstream/*/*-test-agent.containerfile
Updated dnf install lines to include pcp and pcp-zeroconf alongside microshift-test-agent across affected containerfiles.
Image blueprints: add pcp-zeroconf
test/image-blueprints/layer1-base/group1/rhel96.toml, test/image-blueprints/layer1-base/group1/rhel98.toml
Added [[packages]] { name = "pcp-zeroconf"; version = "*" } to the listed blueprints so images include the zeroconf package.

Sequence Diagram

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding PCP metrics collection to test scenarios, which is the primary focus of all file modifications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Custom check for Ginkgo test name stability is not applicable. PR modifies only bash scripts, containerfiles, and TOML blueprints—no Ginkgo test definitions present.
Test Structure And Quality ✅ Passed This PR modifies shell scripts, containerfiles, and TOML configuration files only. It contains no Ginkgo test code (*_test.go files). The custom check is not applicable to this PR's changes.
Microshift Test Compatibility ✅ Passed This PR contains no new Ginkgo e2e test declarations. Changes are infrastructure/tooling only: PCP metrics collection, container image configs, and test blueprints. The check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR adds no Ginkgo e2e tests. Changes are shell scripts, containerfiles, and blueprint configs for PCP metrics collection. SNO compatibility check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed Changes are test infrastructure only (scripts, container images, blueprints). No deployment manifests, operator code, or controllers are modified.
Ote Binary Stdout Contract ✅ Passed Custom check not applicable. PR modifies only shell scripts and configuration files, not Go test code or OTE binaries.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR does not add Ginkgo e2e tests. Changes are PCP metrics infrastructure only: shell scripts, containerfiles, and TOML blueprints. Check is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@pacevedom
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-tests-periodic

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/bin/scenario.sh`:
- Around line 296-323: The dnf install in start_pcp_on_all_vms can hang; update
the command executed by run_command_on_vm (inside function start_pcp_on_all_vms)
to wrap the package installation with a timeout (e.g., use the timeout utility
such as timeout 300s sudo dnf install -y pcp pcp-zeroconf) so a stalled install
will abort and return non-zero; ensure the wrapped command replaces the existing
"sudo dnf install -y pcp pcp-zeroconf" invocation so failures/timeouts propagate
and trigger the existing warning for that VM.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9595bcb4-1d7c-476f-82df-5e49a6354b46

📥 Commits

Reviewing files that changed from the base of the PR and between db194f3 and af215d1.

📒 Files selected for processing (1)
  • test/bin/scenario.sh

Comment thread test/bin/scenario.sh
Comment on lines +296 to +323
start_pcp_on_all_vms() {
if "${SKIP_PCP}"; then
echo "Skipping PCP collection"
return 0
fi

for vmdir in "${SCENARIO_INFO_DIR}"/"${SCENARIO}"/vms/*; do
if [ ! -d "${vmdir}" ]; then
continue
fi

local vmname
vmname=$(basename "${vmdir}")
local ip
ip=$(cat "$(vm_property_filename "${vmname}" "ip")" 2>/dev/null) || true

if [ -z "${ip}" ]; then
continue
fi

echo "Starting PCP collection on ${vmname}"
run_command_on_vm "${vmname}" \
"rpm -q pcp-zeroconf >/dev/null 2>&1 || sudo dnf install -y pcp pcp-zeroconf; \
sudo systemctl restart pmcd; \
sudo systemctl restart pmlogger" \
|| echo "WARNING: Failed to start PCP on ${vmname}"
done
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add timeout protection for package installation.

The dnf install command on line 318 lacks timeout protection and could hang indefinitely if repositories are slow or unavailable. Since this function runs before test execution (line 1667), a hang would block the entire test suite.

🛡️ Proposed fix

Wrap the command with timeout:

         echo "Starting PCP collection on ${vmname}"
         run_command_on_vm "${vmname}" \
-            "rpm -q pcp-zeroconf >/dev/null 2>&1 || sudo dnf install -y pcp pcp-zeroconf; \
+            "rpm -q pcp-zeroconf >/dev/null 2>&1 || sudo timeout 5m dnf install -y pcp pcp-zeroconf; \
              sudo systemctl restart pmcd; \
              sudo systemctl restart pmlogger" \
             || echo "WARNING: Failed to start PCP on ${vmname}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
start_pcp_on_all_vms() {
if "${SKIP_PCP}"; then
echo "Skipping PCP collection"
return 0
fi
for vmdir in "${SCENARIO_INFO_DIR}"/"${SCENARIO}"/vms/*; do
if [ ! -d "${vmdir}" ]; then
continue
fi
local vmname
vmname=$(basename "${vmdir}")
local ip
ip=$(cat "$(vm_property_filename "${vmname}" "ip")" 2>/dev/null) || true
if [ -z "${ip}" ]; then
continue
fi
echo "Starting PCP collection on ${vmname}"
run_command_on_vm "${vmname}" \
"rpm -q pcp-zeroconf >/dev/null 2>&1 || sudo dnf install -y pcp pcp-zeroconf; \
sudo systemctl restart pmcd; \
sudo systemctl restart pmlogger" \
|| echo "WARNING: Failed to start PCP on ${vmname}"
done
}
start_pcp_on_all_vms() {
if "${SKIP_PCP}"; then
echo "Skipping PCP collection"
return 0
fi
for vmdir in "${SCENARIO_INFO_DIR}"/"${SCENARIO}"/vms/*; do
if [ ! -d "${vmdir}" ]; then
continue
fi
local vmname
vmname=$(basename "${vmdir}")
local ip
ip=$(cat "$(vm_property_filename "${vmname}" "ip")" 2>/dev/null) || true
if [ -z "${ip}" ]; then
continue
fi
echo "Starting PCP collection on ${vmname}"
run_command_on_vm "${vmname}" \
"rpm -q pcp-zeroconf >/dev/null 2>&1 || sudo timeout 5m dnf install -y pcp pcp-zeroconf; \
sudo systemctl restart pmcd; \
sudo systemctl restart pmlogger" \
|| echo "WARNING: Failed to start PCP on ${vmname}"
done
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/bin/scenario.sh` around lines 296 - 323, The dnf install in
start_pcp_on_all_vms can hang; update the command executed by run_command_on_vm
(inside function start_pcp_on_all_vms) to wrap the package installation with a
timeout (e.g., use the timeout utility such as timeout 300s sudo dnf install -y
pcp pcp-zeroconf) so a stalled install will abort and return non-zero; ensure
the wrapped command replaces the existing "sudo dnf install -y pcp pcp-zeroconf"
invocation so failures/timeouts propagate and trigger the existing warning for
that VM.

@pacevedom
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-tests-periodic

1 similar comment
@pacevedom
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-tests-periodic

Bake pcp and pcp-zeroconf into all base images (ostree blueprints and
bootc Containerfiles) so Performance Co-Pilot is available on every
test VM. At test time, scenario.sh starts pmcd/pmlogger before tests
and collects the PCP archives as artifacts alongside SOS reports.
Controlled via SKIP_PCP environment variable (defaults to false).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pacevedom
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-tests-periodic

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/bin/scenario.sh`:
- Around line 353-355: The tar invocation currently swallows failures via "||
true"; change the run_command_on_vm "${vmname}" "sudo tar czf
/tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ." call to capture its exit
status, and if non-zero emit a clear warning (e.g., printf or process logger)
including the vmname and the tar command output, mark/record that PCP packaging
for that VM failed (set a variable or append the vmname to a failures list) and
continue rather than silently ignoring the error; use the run_command_on_vm
call, vmname, and the tar command string to locate where to implement this
logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 1e7ed5a3-3544-448c-822e-3f77b1e0f49e

📥 Commits

Reviewing files that changed from the base of the PR and between af215d1 and 4e34c84.

📒 Files selected for processing (8)
  • test/bin/scenario.sh
  • test/image-blueprints-bootc/el10/layer1-base/group1/rhel102-test-agent.containerfile
  • test/image-blueprints-bootc/el9/layer1-base/group1/rhel96-test-agent.containerfile
  • test/image-blueprints-bootc/el9/layer1-base/group1/rhel98-test-agent.containerfile
  • test/image-blueprints-bootc/upstream/group1/cos10-test-agent.containerfile
  • test/image-blueprints-bootc/upstream/group1/cos9-test-agent.containerfile
  • test/image-blueprints/layer1-base/group1/rhel96.toml
  • test/image-blueprints/layer1-base/group1/rhel98.toml

Comment thread test/bin/scenario.sh
Comment on lines +353 to +355
run_command_on_vm "${vmname}" \
"sudo tar czf /tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ." || true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle PCP archive creation failures explicitly.

Suppressing tar errors here can mask why artifacts are missing and make collection diagnostics weaker. Fail this VM’s packaging step with a warning and continue.

Proposed fix
-        run_command_on_vm "${vmname}" \
-            "sudo tar czf /tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ." || true
+        if ! run_command_on_vm "${vmname}" \
+            "sudo rm -f /tmp/pcp-archives.tar.gz && \
+             sudo tar czf /tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ."; then
+            echo "WARNING: Failed to package PCP data on ${vmname}"
+            continue
+        fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
run_command_on_vm "${vmname}" \
"sudo tar czf /tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ." || true
if ! run_command_on_vm "${vmname}" \
"sudo rm -f /tmp/pcp-archives.tar.gz && \
sudo tar czf /tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ."; then
echo "WARNING: Failed to package PCP data on ${vmname}"
continue
fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/bin/scenario.sh` around lines 353 - 355, The tar invocation currently
swallows failures via "|| true"; change the run_command_on_vm "${vmname}" "sudo
tar czf /tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ." call to capture its
exit status, and if non-zero emit a clear warning (e.g., printf or process
logger) including the vmname and the tar command output, mark/record that PCP
packaging for that VM failed (set a variable or append the vmname to a failures
list) and continue rather than silently ignoring the error; use the
run_command_on_vm call, vmname, and the tar command string to locate where to
implement this logic.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 13, 2026

@pacevedom: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants