Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle sensitive information being inside a list in resource_dict #2178

Merged
merged 22 commits into from
Nov 27, 2024

Conversation

dbasunag
Copy link
Contributor

@dbasunag dbasunag commented Oct 23, 2024

Short description:

Current code hash_resource_dict() does not provide a flexible way to hide fields like userData, for virtual machines (it could be present in spec.template.spec.volumes)

More details:
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for reviewer:
Bug:

Summary by CodeRabbit

  • New Features
    • Introduced a new function to securely replace sensitive values in nested dictionaries.
    • Updated key representation format for resource management.
    • Added a new property to access user data in virtual machine configurations.
  • Bug Fixes
    • Enhanced error handling for invalid inputs in the new function.
  • Tests
    • Implemented a comprehensive test suite for the new sensitive value replacement function.
  • Chores
    • Added a new testing environment configuration for unit tests.

@redhat-qe-bot2
Copy link

Report bugs in Issues

The following are automatically added:

  • Add reviewers from OWNER file (in the root of the repository) under reviewers section.
  • Set PR size label.
  • New issue is created for the PR. (Closed when PR is merged/closed)
  • Run pre-commit if .pre-commit-config.yaml exists in the repo.

Available user actions:

  • To mark PR as WIP comment /wip to the PR, To remove it from the PR comment /wip cancel to the PR.
  • To block merging of PR comment /hold, To un-block merging of PR comment /hold cancel.
  • To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
    verified label removed on each new commit push.
  • To cherry pick a merged PR comment /cherry-pick <target branch to cherry-pick to> in the PR.
    • Multiple target branches can be cherry-picked, separated by spaces. (/cherry-pick branch1 branch2)
    • Cherry-pick will be started when PR is merged
  • To build and push container image command /build-and-push-container in the PR (tag will be the PR number).
    • You can add extra args to the Podman build command
      • Example: /build-and-push-container --build-arg OPENSHIFT_PYTHON_WRAPPER_COMMIT=<commit_hash>
  • To add a label by comment use /<label name>, to remove, use /<label name> cancel
  • To assign reviewers based on OWNERS file use /assign-reviewers
  • To check if PR can be merged use /check-can-merge
Supported /retest check runs
  • /retest tox: Retest tox
  • /retest python-module-install: Retest python-module-install
  • /retest all: Retest all
Supported labels
  • hold
  • verified
  • wip
  • lgtm

Copy link

coderabbitai bot commented Oct 23, 2024

Walkthrough

The changes in this pull request introduce a new function, replace_key_with_hashed_value, which recursively searches nested dictionaries to replace specified key values with a hashed representation. The hash_resource_dict method in the Resource class has been updated to utilize this new function, simplifying the process of masking sensitive information. Additionally, the keys_to_hash property has been modified in multiple classes to reflect a new keypath format. A new testing environment has been added to tox.toml, and a test suite for the new function has been implemented in tests/test_unittests.py.

Changes

File Change Summary
ocp_resources/resource.py Added method replace_key_with_hashed_value; updated hash_resource_dict method to use the new function; updated keys_to_hash property to change keypath format.
ocp_resources/sealed_secret.py Updated keys_to_hash property to change return values from ["spec..data", "spec..encryptedData"] to ["spec>data", "spec>encryptedData"].
ocp_resources/virtual_machine.py Added new property method keys_to_hash returning a specific path for user data in virtual machine configuration.
tox.toml Added new testing environment validate-unittests with dependencies and commands for running unit tests.
tests/test_unittests.py Introduced a test suite for replace_key_with_hashed_value function, including various tests for functionality and edge cases.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1ed9e74 and 9a37a9b.

📒 Files selected for processing (2)
  • tests/test_unittests.py (1 hunks)
  • tox.toml (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/test_unittests.py
  • tox.toml

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
ocp_resources/virtual_machine.py (1)

179-181: LGTM! Consider adding documentation.

The implementation correctly identifies the sensitive field that needs to be masked. Consider adding a docstring to explain the purpose of this property and its security implications.

 @property
 def keys_to_hash(self):
+    """
+    Returns a list of keys containing sensitive information that should be masked.
+    
+    Returns:
+        list[str]: List of keys to be masked in the resource dictionary.
+        Currently masks 'userData' which contains sensitive boot-time configuration.
+    """
     return ["userData"]
ocp_resources/resource.py (1)

136-156: LGTM with suggestions for improvements.

The implementation correctly handles nested dictionaries and lists. However, consider these improvements:

  1. Add protection against circular references to prevent stack overflow
  2. Consider making the mask value configurable for flexibility

Here's a suggested improvement:

-def change_dict_value_to_hashed(resource_dict: Dict[Any, Any], key_name: str) -> Dict[Any, Any]:
+def change_dict_value_to_hashed(
+    resource_dict: Dict[Any, Any],
+    key_name: str,
+    mask: str = "******",
+    _seen: set | None = None
+) -> Dict[Any, Any]:
     """
     Recursively search a nested dictionary for a given key and changes its value to "******" if found.
 
     Args:
         resource_dict: The nested dictionary to search.
         key_name: The key to find.
+        mask: The value to use for masking sensitive data.
+        _seen: Internal parameter to track circular references.
 
     Returns:
         The modified dictionary.
     """
+    if _seen is None:
+        _seen = set()
+
+    # Handle circular references
+    if id(resource_dict) in _seen:
+        return resource_dict
+
     if isinstance(resource_dict, dict):
+        _seen.add(id(resource_dict))
         for key, value in resource_dict.items():
             if key == key_name:
-                resource_dict[key] = "******"
+                resource_dict[key] = mask
             elif isinstance(value, dict):
-                resource_dict[key] = change_dict_value_to_hashed(value, key_name)
+                resource_dict[key] = change_dict_value_to_hashed(value, key_name, mask, _seen)
             elif isinstance(value, list):
                 for key_list, value_list in enumerate(value):
-                    value[key_list] = change_dict_value_to_hashed(value_list, key_name)
+                    value[key_list] = change_dict_value_to_hashed(value_list, key_name, mask, _seen)
     return resource_dict
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 0c5830d and dafe9d0.

📒 Files selected for processing (3)
  • ocp_resources/resource.py (2 hunks)
  • ocp_resources/sealed_secret.py (1 hunks)
  • ocp_resources/virtual_machine.py (1 hunks)
🧰 Additional context used
🔇 Additional comments (3)
ocp_resources/sealed_secret.py (1)

53-53: Verify the impact of simplified key paths.

The simplified key paths in keys_to_hash might affect how sensitive data is masked in the resource dictionary. Let's verify this change works correctly with the updated hash_resource_dict implementation.

Consider adding a docstring to the keys_to_hash property to document:

  1. The expected format of the keys
  2. How these keys are used in the resource dictionary masking process
  3. Why the "spec.." prefix was removed

This will help future maintainers understand the design decision and prevent accidental reintroduction of the prefix.

✅ Verification successful

Let me gather more information about how the key paths are actually processed in the codebase.

This will help us understand:

  1. How the key paths are processed in the hashing function
  2. If there are tests verifying this functionality
  3. Where the SealedSecret class is used in the codebase

Let me check one more thing to verify the test coverage and actual usage patterns.

This will help us understand:

  1. The test coverage for this functionality
  2. The complete context of SealedSecret class
  3. How these specific keys are actually used in hashing

Based on the gathered information, I can now provide the final response:

Simplified key paths in keys_to_hash are correct and safe

The change from ["spec..data", "spec..encryptedData"] to ["data", "encryptedData"] is correct because:

  1. The change_dict_value_to_hashed function performs a recursive dictionary search for exact key matches, making the "spec.." prefix unnecessary
  2. The SealedSecret class directly accepts data and encrypted_data as top-level parameters in its constructor
  3. The implementation aligns with the actual structure of the resource where sensitive data is stored

The change maintains security while simplifying the implementation, as the hashing mechanism in hash_resource_dict will still mask all sensitive information correctly.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of keys_to_hash in the codebase
# and ensure it's compatible with the new implementation

# Check if there are any direct references to the old format "spec..data"
echo "Checking for old key format usage..."
rg -l "spec\.\.data|spec\.\.encryptedData" 

# Look for other implementations of keys_to_hash to ensure consistency
echo "Checking other keys_to_hash implementations..."
ast-grep --pattern 'def keys_to_hash'

# Check the Resource class implementation that uses these keys
echo "Checking Resource class implementation..."
rg -A 10 "def hash_resource_dict"

Length of output: 1254


Script:

#!/bin/bash
# Check the implementation of change_dict_value_to_hashed function
echo "Checking change_dict_value_to_hashed implementation..."
rg -A 15 "def change_dict_value_to_hashed"

# Check if there are any tests that verify this functionality
echo "Checking related tests..."
fd -e py test_sealed_secret test_resource --exec rg -l "keys_to_hash|hash_resource_dict" {}

# Check the actual usage of SealedSecret class
echo "Checking SealedSecret usage..."
rg -l "class SealedSecret"

Length of output: 1542


Script:

#!/bin/bash
# Check for test files with correct path
echo "Checking test files..."
fd -e py -g "*test*" --exec rg -l "keys_to_hash|hash_resource_dict" {}

# Check the complete SealedSecret class implementation
echo "Checking SealedSecret class implementation..."
rg -A 20 "class SealedSecret" ocp_resources/sealed_secret.py

# Check actual usage of hash_resource_dict with these keys
echo "Checking actual usage patterns..."
rg -B 5 -A 5 "hash_resource_dict.*data|hash_resource_dict.*encryptedData"

Length of output: 1177

ocp_resources/resource.py (2)

1198-1198: LGTM! Clean integration of the new function.

The change correctly integrates the new change_dict_value_to_hashed function while maintaining the existing behavior and safety measures (deep copy, condition checks).


136-156: Verify the impact on existing consumers.

Let's verify the impact of these changes on existing code.

Also applies to: 1198-1198

✅ Verification successful

Let me gather more information about the keys_to_hash property since the previous search didn't show its complete usage pattern.

This will help us understand:

  1. Where and how keys_to_hash is defined and used
  2. The usage pattern of hash_log_data which controls the hashing behavior
  3. Any potential impact on existing implementations

Based on the gathered information, I can now provide a final response as we have a clear picture of the usage and impact:

The changes to change_dict_value_to_hashed are safe and properly integrated

The function is used within a well-defined pattern where:

  • It's only called from the base Resource class's hash_resource_dict method
  • The hashing behavior is controlled by two factors:
    • The hash_log_data boolean parameter (defaults to True)
    • Resource-specific keys_to_hash property implementations
  • Only specific resource types implement keys_to_hash:
    • ConfigMap: ["data", "binaryData"]
    • VirtualMachine: ["userData"]
    • SealedSecret: ["data", "encryptedData"]
    • Secret: ["data", "stringData"]

The changes maintain the existing behavior while improving the implementation, and all current consumers are properly integrated with this pattern.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for usages of hash_resource_dict and keys_to_hash property

echo "Checking for direct calls to hash_resource_dict..."
rg -A 2 "hash_resource_dict\("

echo -e "\nChecking for classes that override keys_to_hash property..."
ast-grep --pattern 'class $_ {
  $$$
  @property
  def keys_to_hash($_) {
    $$$
  }
  $$$
}'

Length of output: 1839


Script:

#!/bin/bash
# Let's search for any references to keys_to_hash in the codebase
echo "Searching for keys_to_hash references..."
rg -A 2 "keys_to_hash"

# Also check for hash_log_data since it's used in conjunction
echo -e "\nSearching for hash_log_data references..."
rg -A 2 "hash_log_data"

Length of output: 2647

ocp_resources/virtual_machine.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
ocp_resources/resource.py (1)

136-156: Improve function documentation and edge case handling.

The function implementation looks good, but there are a few suggestions for improvement:

  1. Documentation could be enhanced:

    • Add a note about in-place modification
    • Document the behavior for edge cases
    • Add example usage
  2. Consider handling edge cases:

    • None values
    • Non-dict/non-list objects in nested structures
  3. Performance optimization:

    • Consider using a generator for list iteration
    • Add early return for empty/None input

Here's how you could improve the implementation:

 def change_dict_value_to_hashed(resource_dict: Dict[Any, Any], key_name: str) -> Dict[Any, Any]:
     """
     Recursively search a nested dictionary for a given key and changes its value to "******" if found.
+    The function modifies the input dictionary in-place.
 
     Args:
         resource_dict: The nested dictionary to search.
         key_name: The key to find.
 
     Returns:
         The modified dictionary.
+
+    Example:
+        >>> d = {"a": {"sensitive": "secret"}, "b": [{"sensitive": "hidden"}]}
+        >>> change_dict_value_to_hashed(d, "sensitive")
+        {'a': {'sensitive': '******'}, 'b': [{'sensitive': '******'}]}
     """
+    if not resource_dict:
+        return resource_dict
+
     if isinstance(resource_dict, dict):
         for key, value in resource_dict.items():
             if key == key_name:
                 resource_dict[key] = "******"
-            elif isinstance(value, dict):
+            elif value is not None and isinstance(value, dict):
                 resource_dict[key] = change_dict_value_to_hashed(value, key_name)
-            elif isinstance(value, list):
+            elif value is not None and isinstance(value, list):
                 for key_list, value_list in enumerate(value):
-                    value[key_list] = change_dict_value_to_hashed(value_list, key_name)
+                    if value_list is not None:
+                        value[key_list] = change_dict_value_to_hashed(value_list, key_name)
     return resource_dict
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between dafe9d0 and 462ee06.

📒 Files selected for processing (1)
  • ocp_resources/resource.py (2 hunks)
🧰 Additional context used
🔇 Additional comments (2)
ocp_resources/resource.py (2)

1199-1199: LGTM! Clean implementation of sensitive data handling.

The modification correctly utilizes the new change_dict_value_to_hashed function to handle sensitive information in resource dictionaries.


1199-1199: Verify handling of sensitive fields across the codebase.

Let's ensure that all sensitive fields, particularly userData in VM specs, are properly handled by the new implementation.

Run the following script to verify the usage:

ocp_resources/resource.py Outdated Show resolved Hide resolved
ocp_resources/resource.py Outdated Show resolved Hide resolved
ocp_resources/resource.py Show resolved Hide resolved
ocp_resources/resource.py Outdated Show resolved Hide resolved
ocp_resources/sealed_secret.py Outdated Show resolved Hide resolved
ocp_resources/resource.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (1)
tests/test_unittests.py (1)

6-20: Consider enhancing the fixture with more edge cases

The fixture provides a good base structure, but consider adding more variations of sensitive data scenarios:

  • Nested lists within lists
  • Multiple feature entries
  • Empty or null values
  • Special characters in sensitive data
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f931eeb and 1ed9e74.

📒 Files selected for processing (2)
  • tests/test_unittests.py (1 hunks)
  • tox.toml (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tox.toml
🔇 Additional comments (1)
tests/test_unittests.py (1)

1-72: Consider adding security-focused test cases

Given that this function handles sensitive data, consider adding test cases that verify:

  1. No sensitive data leakage in error messages
  2. Handling of special characters that could be used in injection attacks
  3. Memory cleanup after processing sensitive data

tests/test_unittests.py Outdated Show resolved Hide resolved
tests/test_unittests.py Outdated Show resolved Hide resolved
tests/test_unittests.py Outdated Show resolved Hide resolved
tests/test_unittests.py Show resolved Hide resolved
tox.toml Outdated Show resolved Hide resolved
@myakove
Copy link
Collaborator

myakove commented Nov 26, 2024

And please check coderabbitai comments

@dbasunag
Copy link
Contributor Author

/verified

@myakove myakove merged commit 9d20d55 into main Nov 27, 2024
5 of 6 checks passed
@myakove myakove deleted the hash_cloudinit branch November 27, 2024 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants