USHIFT-6902: Add PCP metrics collection to test scenarios#6658
USHIFT-6902: Add PCP metrics collection to test scenarios#6658pacevedom wants to merge 1 commit into
Conversation
|
@pacevedom: This pull request references USHIFT-6902 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
/test ? |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pacevedom The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughThe test scenario script adds optional PCP (Performance Co-Pilot) data collection controlled by ChangesPCP Collection Infrastructure
Test-agent images and blueprints
Sequence DiagramEstimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/test e2e-aws-tests-periodic |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/bin/scenario.sh`:
- Around line 296-323: The dnf install in start_pcp_on_all_vms can hang; update
the command executed by run_command_on_vm (inside function start_pcp_on_all_vms)
to wrap the package installation with a timeout (e.g., use the timeout utility
such as timeout 300s sudo dnf install -y pcp pcp-zeroconf) so a stalled install
will abort and return non-zero; ensure the wrapped command replaces the existing
"sudo dnf install -y pcp pcp-zeroconf" invocation so failures/timeouts propagate
and trigger the existing warning for that VM.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 9595bcb4-1d7c-476f-82df-5e49a6354b46
📒 Files selected for processing (1)
test/bin/scenario.sh
| start_pcp_on_all_vms() { | ||
| if "${SKIP_PCP}"; then | ||
| echo "Skipping PCP collection" | ||
| return 0 | ||
| fi | ||
|
|
||
| for vmdir in "${SCENARIO_INFO_DIR}"/"${SCENARIO}"/vms/*; do | ||
| if [ ! -d "${vmdir}" ]; then | ||
| continue | ||
| fi | ||
|
|
||
| local vmname | ||
| vmname=$(basename "${vmdir}") | ||
| local ip | ||
| ip=$(cat "$(vm_property_filename "${vmname}" "ip")" 2>/dev/null) || true | ||
|
|
||
| if [ -z "${ip}" ]; then | ||
| continue | ||
| fi | ||
|
|
||
| echo "Starting PCP collection on ${vmname}" | ||
| run_command_on_vm "${vmname}" \ | ||
| "rpm -q pcp-zeroconf >/dev/null 2>&1 || sudo dnf install -y pcp pcp-zeroconf; \ | ||
| sudo systemctl restart pmcd; \ | ||
| sudo systemctl restart pmlogger" \ | ||
| || echo "WARNING: Failed to start PCP on ${vmname}" | ||
| done | ||
| } |
There was a problem hiding this comment.
Add timeout protection for package installation.
The dnf install command on line 318 lacks timeout protection and could hang indefinitely if repositories are slow or unavailable. Since this function runs before test execution (line 1667), a hang would block the entire test suite.
🛡️ Proposed fix
Wrap the command with timeout:
echo "Starting PCP collection on ${vmname}"
run_command_on_vm "${vmname}" \
- "rpm -q pcp-zeroconf >/dev/null 2>&1 || sudo dnf install -y pcp pcp-zeroconf; \
+ "rpm -q pcp-zeroconf >/dev/null 2>&1 || sudo timeout 5m dnf install -y pcp pcp-zeroconf; \
sudo systemctl restart pmcd; \
sudo systemctl restart pmlogger" \
|| echo "WARNING: Failed to start PCP on ${vmname}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| start_pcp_on_all_vms() { | |
| if "${SKIP_PCP}"; then | |
| echo "Skipping PCP collection" | |
| return 0 | |
| fi | |
| for vmdir in "${SCENARIO_INFO_DIR}"/"${SCENARIO}"/vms/*; do | |
| if [ ! -d "${vmdir}" ]; then | |
| continue | |
| fi | |
| local vmname | |
| vmname=$(basename "${vmdir}") | |
| local ip | |
| ip=$(cat "$(vm_property_filename "${vmname}" "ip")" 2>/dev/null) || true | |
| if [ -z "${ip}" ]; then | |
| continue | |
| fi | |
| echo "Starting PCP collection on ${vmname}" | |
| run_command_on_vm "${vmname}" \ | |
| "rpm -q pcp-zeroconf >/dev/null 2>&1 || sudo dnf install -y pcp pcp-zeroconf; \ | |
| sudo systemctl restart pmcd; \ | |
| sudo systemctl restart pmlogger" \ | |
| || echo "WARNING: Failed to start PCP on ${vmname}" | |
| done | |
| } | |
| start_pcp_on_all_vms() { | |
| if "${SKIP_PCP}"; then | |
| echo "Skipping PCP collection" | |
| return 0 | |
| fi | |
| for vmdir in "${SCENARIO_INFO_DIR}"/"${SCENARIO}"/vms/*; do | |
| if [ ! -d "${vmdir}" ]; then | |
| continue | |
| fi | |
| local vmname | |
| vmname=$(basename "${vmdir}") | |
| local ip | |
| ip=$(cat "$(vm_property_filename "${vmname}" "ip")" 2>/dev/null) || true | |
| if [ -z "${ip}" ]; then | |
| continue | |
| fi | |
| echo "Starting PCP collection on ${vmname}" | |
| run_command_on_vm "${vmname}" \ | |
| "rpm -q pcp-zeroconf >/dev/null 2>&1 || sudo timeout 5m dnf install -y pcp pcp-zeroconf; \ | |
| sudo systemctl restart pmcd; \ | |
| sudo systemctl restart pmlogger" \ | |
| || echo "WARNING: Failed to start PCP on ${vmname}" | |
| done | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/bin/scenario.sh` around lines 296 - 323, The dnf install in
start_pcp_on_all_vms can hang; update the command executed by run_command_on_vm
(inside function start_pcp_on_all_vms) to wrap the package installation with a
timeout (e.g., use the timeout utility such as timeout 300s sudo dnf install -y
pcp pcp-zeroconf) so a stalled install will abort and return non-zero; ensure
the wrapped command replaces the existing "sudo dnf install -y pcp pcp-zeroconf"
invocation so failures/timeouts propagate and trigger the existing warning for
that VM.
|
/test e2e-aws-tests-periodic |
1 similar comment
|
/test e2e-aws-tests-periodic |
Bake pcp and pcp-zeroconf into all base images (ostree blueprints and bootc Containerfiles) so Performance Co-Pilot is available on every test VM. At test time, scenario.sh starts pmcd/pmlogger before tests and collects the PCP archives as artifacts alongside SOS reports. Controlled via SKIP_PCP environment variable (defaults to false). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/test e2e-aws-tests-periodic |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/bin/scenario.sh`:
- Around line 353-355: The tar invocation currently swallows failures via "||
true"; change the run_command_on_vm "${vmname}" "sudo tar czf
/tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ." call to capture its exit
status, and if non-zero emit a clear warning (e.g., printf or process logger)
including the vmname and the tar command output, mark/record that PCP packaging
for that VM failed (set a variable or append the vmname to a failures list) and
continue rather than silently ignoring the error; use the run_command_on_vm
call, vmname, and the tar command string to locate where to implement this
logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 1e7ed5a3-3544-448c-822e-3f77b1e0f49e
📒 Files selected for processing (8)
test/bin/scenario.shtest/image-blueprints-bootc/el10/layer1-base/group1/rhel102-test-agent.containerfiletest/image-blueprints-bootc/el9/layer1-base/group1/rhel96-test-agent.containerfiletest/image-blueprints-bootc/el9/layer1-base/group1/rhel98-test-agent.containerfiletest/image-blueprints-bootc/upstream/group1/cos10-test-agent.containerfiletest/image-blueprints-bootc/upstream/group1/cos9-test-agent.containerfiletest/image-blueprints/layer1-base/group1/rhel96.tomltest/image-blueprints/layer1-base/group1/rhel98.toml
| run_command_on_vm "${vmname}" \ | ||
| "sudo tar czf /tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ." || true | ||
|
|
There was a problem hiding this comment.
Handle PCP archive creation failures explicitly.
Suppressing tar errors here can mask why artifacts are missing and make collection diagnostics weaker. Fail this VM’s packaging step with a warning and continue.
Proposed fix
- run_command_on_vm "${vmname}" \
- "sudo tar czf /tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ." || true
+ if ! run_command_on_vm "${vmname}" \
+ "sudo rm -f /tmp/pcp-archives.tar.gz && \
+ sudo tar czf /tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ."; then
+ echo "WARNING: Failed to package PCP data on ${vmname}"
+ continue
+ fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| run_command_on_vm "${vmname}" \ | |
| "sudo tar czf /tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ." || true | |
| if ! run_command_on_vm "${vmname}" \ | |
| "sudo rm -f /tmp/pcp-archives.tar.gz && \ | |
| sudo tar czf /tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ."; then | |
| echo "WARNING: Failed to package PCP data on ${vmname}" | |
| continue | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/bin/scenario.sh` around lines 353 - 355, The tar invocation currently
swallows failures via "|| true"; change the run_command_on_vm "${vmname}" "sudo
tar czf /tmp/pcp-archives.tar.gz -C /var/log/pcp/pmlogger ." call to capture its
exit status, and if non-zero emit a clear warning (e.g., printf or process
logger) including the vmname and the tar command output, mark/record that PCP
packaging for that VM failed (set a variable or append the vmname to a failures
list) and continue rather than silently ignoring the error; use the
run_command_on_vm call, vmname, and the tar command string to locate where to
implement this logic.
|
@pacevedom: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Install and run Performance Co-Pilot on all online VMs during test execution, then collect the archives as artifacts alongside SOS reports. Controlled via SKIP_PCP environment variable (defaults to false).
Summary by CodeRabbit