Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding issue by removing BOM from provider output files #1302

Merged
merged 42 commits into from
Sep 19, 2024

Conversation

schrolla
Copy link
Collaborator

@schrolla schrolla commented Sep 6, 2024

🗣 Description

This update changes ScubaGear such that its tenant configuration captured in the provider output JSON (ProviderSettingsExport.json, by default) is encoded as a UTF-8 file without a byte order mark (BOM) rather than including a BOM.

Updates to the code include:

  • New functions Get-Utf8NoBOM and Set-Utf8NoBOM for reading and writing utf-8 encoded files without a BOM
  • Wrapper functions Invoke-ReadAllLines and Invoke-WriteAllLines that provide testability for use of the associated .NET System.IO functions
  • A new Utility powershell module for the above functions to support use across ScubaGear modules
  • PowerShell unit tests for the new methods covering pathing and escape sequence handling

💭 Motivation and context

OPA Rego (as of 0.68) does not recognize the byte order mark (BOM). As a result, parsing of the file can fail when some character sequences, such as unescaped backslashes, are present in the input. These characters are common in certain tenant configurations, such as Exchange transport rule regular expressions or Windows file paths. By removing the BOM from provider output files and encoding as UTF-8 without a BOM, Rego can properly handle the provider JSON as input without issue.

Closes #935
Closes #990
Closes #1138
Closes #1214
Closes #1242
Closes #1299

🧪 Testing

To test this PR, first configure a test tenant such that it contains an input that triggers the existing bug. This can be done by adding or modifying an existing DLP policy description to include the string:
Create a "\/Date(1698260458733)\/" custom policy from )\/scratch. You "\/Date will choose the type of content to protect and how you want to protect it.

Run ScubaGear against the test tenant including the Defender product, at a minimum.
Invoke-Scuba -p defender

This should result in an error similar to the one shown here:
image

✅ Pre-approval checklist

  • This PR has an informative and human-readable title.
  • PR targets the correct parent branch (e.g., main or release-name) for merge.
  • Changes are limited to a single goal - eschew scope creep!
  • Changes are sized such that they do not touch excessive number of files.
  • All future TODOs are captured in issues, which are referenced in code comments.
  • These code changes follow the ScubaGear content style guide.
  • Related issues these changes resolve are linked preferably via closing keywords.
  • All relevant type-of-change labels added.
  • All relevant project fields are set.
  • All relevant repo and/or project documentation updated to reflect these changes.
  • Unit tests added/updated to cover PowerShell and Rego changes.
  • Functional tests added/updated to cover PowerShell and Rego changes.
  • All relevant functional tests passed.
  • All automated checks (e.g., linting, static analysis, unit/smoke tests) passed.

✅ Pre-merge checklist

  • PR passed smoke test check.

  • Feature branch has been rebased against changes from parent branch, as needed

    Use Rebase branch button below or use this reference to rebase from the command line.

  • Resolved all merge conflicts on branch

  • Notified merge coordinator that PR is ready for merge via comment mention

✅ Post-merge checklist

  • Feature branch deleted after merge to clean up repository.
  • Verified that all checks pass on parent branch (e.g., main or release-name) after merge.

@schrolla schrolla added bug This issue or pull request addresses broken functionality public-reported This issue is reported by the public users of the tool. labels Sep 6, 2024
@schrolla schrolla added this to the Jellyfish milestone Sep 6, 2024
@schrolla schrolla self-assigned this Sep 6, 2024
@schrolla schrolla requested review from adhilto and tkol2022 September 6, 2024 16:46
@schrolla
Copy link
Collaborator Author

schrolla commented Sep 6, 2024

There is one TODO related to this. Currently, only the Provider JSON is included in this fix, but the ScubaResults JSON used with -MergeJson is also written, at this time, with a BOM as it uses Set-Content to write out the file. Since the ScubaResults.json file is not further processed by OPA Rego, there is no issue, but any tools that might process ScubaResults must either handle the BOM correctly, or likely suffer the same issue. Either this is a TODO to change the orchestrator to similarly write out the ScubaResults with no BOM, or just a note that is out of scope and should be handled by any external code processing the ScubaResults file.
(Edit: Now logged as #1304)

@schrolla schrolla marked this pull request as ready for review September 6, 2024 17:56
@schrolla
Copy link
Collaborator Author

schrolla commented Sep 6, 2024

Appears to be an issue with import of the new functions when modifying cached functional tests. Until those are resolved, will need to wait on merge. Functional tests run fine manually, but not in the runner. Will need to sort that out.

Copy link
Collaborator

@adhilto adhilto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Testing I performed:

  • I tested against 4 different tenants, it worked as expected in all cases.
  • I tested naming a conditional access policy \/ \A \B Ĥ \ to see how it handled backslashes and unicode, it also worked as expected.
  • I tested with a non-default OutPath, also worked as expected
  • Finally, I tested Invoke-ScubaCached to see if it could read BOM-less json, worked as expected.

One minor comment left below, also the PSLinter is unhappy about Write-Host. Both things are minor enough, would like to see them addressed, but approving either way.

PowerShell/ScubaGear/Modules/Orchestrator.psm1 Outdated Show resolved Hide resolved
@tkol2022
Copy link
Collaborator

Comments on usage of specific System.IO.File methods

  • Can you use WriteAllText instead of WriteAllLines? WriteAllText takes a string as the contents argument instead of an array of strings.
  • Can you use ReadAllText instead of ReadAllLines? Same reason as prior bullet.

I tried making these changes and it seemed to work okay but you should do some testing.

I also ran some limited benchmark time comparisons between WriteAllText versus WriteAllLines and didn't see any difference. Regardless I think using the *AllText versions make sense.

Time when using WriteAllText
image

Time when using WriteAllLines
image

@schrolla
Copy link
Collaborator Author

Looks great. Testing I performed:

* I tested against 4 different tenants, it worked as expected in all cases.

* I tested naming a conditional access policy `\/ \A \B Ĥ \` to see how it handled backslashes and unicode, it also worked as expected.

* I tested with a non-default OutPath, also worked as expected

* Finally, I tested `Invoke-ScubaCached` to see if it could read BOM-less json, worked as expected.

One minor comment left below, also the PSLinter is unhappy about Write-Host. Both things are minor enough, would like to see them addressed, but approving either way.

I removed the Write-Host(s) that were left in from testing.

@tkol2022
Copy link
Collaborator

Tested on a Windows system that uses \\UNC paths and the fix code works fine.

@tkol2022
Copy link
Collaborator

Tests of various strings and escape characters

Conducted tests with the data described below against the E5 tenant. No defects found yet.

Defender DLP Policy comment field

Generally ugly test string

This is a non-escaping comment. αۂ a "\/Date(1698260458733)\/" custom policy from )\/scratch.\nYou "\/Date will choose \ the type of content\u0000 to protect and how you want to protect it.\" tabs\t, newlines\n, or \uD83D\uDE00 emoji 😀 for testing. Special JSON characters: { "key": "value" } and SQL injection: '; DROP TABLE users; --'

Emoji test string

🧬🦠🧫🧪🛸🤖🦾🦿🧍‍♂️🧍‍♀️🧑‍🤝‍🧑🦹‍♂️🦹‍♀️🦸‍♂️🦸‍♀️🧙‍♂️🧙‍♀️🧛‍♂️🧛‍♀️🧜‍♂️🧜‍♀️🧞‍♂️🧞‍♀️🧝‍♂️🧝‍♀️🧟‍♂️🧟‍♀️🦧🦣🦦🦨🦩🦫🦭🦬🦙🦘🦡🦔🦏🦈🦭🦅🦩

Exchange transport rule Regex

^[a-zA-Z0-9.%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$|^[A-Z0-9.%+-]+@[A-Z0-9.-]+.[A-Z]{2,}$|^user[A-Z0-9]+$

Exchange transport rule name

αۂ a "\/Date(1698260458733)\/" custom policy
/Date(1695680803000)/
"/Date(1695680803000)/"
\\\\\\\

Exchange transport rule comment

𝖘𝖚𝖕𝖊𝖗⚡𝖈𝖔𝖒𝖕𝖑𝖊𝖝🚀𝖓𝖆𝖒𝖊❗✨ NullChar:\0 Backspace:\b BellChar:\a ZeroWidthSpace:\u200B ByteOrderMark:\uFEFF💣

[Also pasted an HTML document into the comment which I have redacted here for brevity]

@schrolla
Copy link
Collaborator Author

Comments on usage of specific System.IO.File methods

* Can you use WriteAllText instead of WriteAllLines? WriteAllText takes a string as the contents argument instead of an array of strings.

* Can you use ReadAllText instead of ReadAllLines? Same reason as prior bullet.

Yes, that's a good point. I went ahead and switched out the Lines for Text methods which also meant being able to remove some usage of Out-String and .Trim() to aggregate the string arrays they produced. Cleaner approach and tests pass with it as well.

@schrolla
Copy link
Collaborator Author

@tkol2022 Updates and functional testing fixes are in. So this code should be final form at this point.

@schrolla
Copy link
Collaborator Author

All tests are passing, including smoke test and nightly functional tests. Note: There were some individual failures on the functional tests, but they were individual test failures unrelated to the changes in this PR. None of the errors were attributable to changes in this PR.
See the following workflow results to review specific errors: https://github.com/cisagov/ScubaGear/actions/runs/10855877536

@tkol2022
Copy link
Collaborator

Tested on a Windows system running ScubaGear in a OneDrive folder and the fix code works fine.

Copy link
Collaborator

@buidav buidav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments

@tkol2022
Copy link
Collaborator

In your Get-Utf8NoBom.Tests.ps1, can you use the approach below to avoid hard coding the C$ or does this not matter? I added the $SystemDrive variable.

Context 'UNC path' {
            It 'Set to shared UNC path' {
                $SystemDrive = $(Resolve-Path $TestDrive).ProviderPath.Substring(0, 1)
                $TempFolder = $($(Resolve-Path $TestDrive).ProviderPath).Substring(2)
                $TestFile = "\\$env:COMPUTERNAME\$SystemDrive$\$TempFolder\output.json"
                New-Item -ItemType File $TestFile
                Get-Utf8NoBom -FilePath $TestFile | Should -Be "Pass"
            }
        }

@schrolla
Copy link
Collaborator Author

In your Get-Utf8NoBom.Tests.ps1, can you use the approach below to avoid hard coding the C$ or does this not matter? I added the $SystemDrive variable.

Context 'UNC path' {
            It 'Set to shared UNC path' {
                $SystemDrive = $(Resolve-Path $TestDrive).ProviderPath.Substring(0, 1)
                $TempFolder = $($(Resolve-Path $TestDrive).ProviderPath).Substring(2)
                $TestFile = "\\$env:COMPUTERNAME\$SystemDrive$\$TempFolder\output.json"
                New-Item -ItemType File $TestFile
                Get-Utf8NoBom -FilePath $TestFile | Should -Be "Pass"
            }
        }

Yes, although slightly altered as SYSTEMDRIVE is an env itself and includes the extraneous drive letter separator ':'. So, I added the $env:SYSTEMDRIVE and a .Trim(':') to construct the share name which works. Nice catch. See 0ad71e8 for full change.

@tkol2022
Copy link
Collaborator

Recommendation to consider. For consistency, in the Set-Utf8NoBom.Tests.ps1 first set of tests, tweak the declaration of the mock function Invoke-WriteAllText as per below. I found this easier to read.

BeforeAll {
            Mock -ModuleName Utility Invoke-WriteAllText { }

            # Setup test directories for testing
            Push-Location $TestDrive
            New-Item -ItemType Directory "$TestDrive\a\b" -Force -ErrorAction Ignore
            New-Item -ItemType Directory "$TestDrive\c" -Force -ErrorAction Ignore
        }

This also helps make the code consistent with how you declared the mock function in the sister file Get-Utf8NoBom.Tests.ps1
image

@tkol2022
Copy link
Collaborator

I couldn't tell the difference between the 'Set-Utf8NoBom Inputs' versus 'Get-Utf8NoBom Inputs' unit tests. They look identical?

image

@schrolla
Copy link
Collaborator Author

@nanda-katikaneni Smoke test passed and ready for merge.

@nanda-katikaneni nanda-katikaneni merged commit 13d7f0c into main Sep 19, 2024
21 checks passed
@nanda-katikaneni nanda-katikaneni deleted the escape-testing branch September 19, 2024 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue or pull request addresses broken functionality public-reported This issue is reported by the public users of the tool.
Projects
None yet
6 participants