Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration tests executed on a real deployment as part of the CICD - Datasets #1358

Open
dlpzx opened this issue Jun 24, 2024 · 1 comment · Fixed by #1379
Open

Integration tests executed on a real deployment as part of the CICD - Datasets #1358

dlpzx opened this issue Jun 24, 2024 · 1 comment · Fixed by #1379

Comments

@dlpzx
Copy link
Contributor

dlpzx commented Jun 24, 2024

Same as for #1220.

This issue is to track the progress for the Datasets modules.
It has its own dedicated issue because of the challenge of pre-existing infrastructure needed to test datasets

@dlpzx
Copy link
Contributor Author

dlpzx commented Jul 1, 2024

Required tests for basic coverage

#1379

For fresh deployments

For each of the following API calls we need to test authorized and unauthorized scenarios as well as all possible configurations (e.g. autoapproval...)

  • Create Dataset - includes testing if dataset is indexed in Catalog
  • Import Dataset - includes testing if dataset is indexed in Catalog
  • List Datasets
  • Edit Dataset
  • Delete dataset
  • Start crawler
  • Sync tables
  • Preview table
  • Delete table
  • Create folder
  • Delete folder

For backwards compatibility

  • Update created Dataset stack
  • Update imported Dataset stack
  • Create Dataset in updated Environment
  • Import Dataset in updated Environment

Full coverage

For fresh deployments

For backwards compatibility

For the updated Dataset stacks:

  • AWS access to Dataset - Credentials
  • AWS access to Dataset - S3 redirect
  • Start crawler
  • Sync tables

@dlpzx dlpzx linked a pull request Jul 2, 2024 that will close this issue
11 tasks
dlpzx added a commit that referenced this issue Jul 9, 2024
…a.all (#1379)

### Feature or Bugfix
- Feature

### Detail
It implements some tests for s3_datasets (check full list in #1358)
### For fresh deployments
- [x] Create Dataset
- [x] Import Dataset --> IMPORTANT: See below details on AWS actions for
testing
- [x] List Datasets
- [X] Get Dataset
- [x] Edit Dataset
- [x] Delete dataset - decision: I only added explicit test for
delete_unauthorized since delete_dataset is covered in the fixtures and
it takes a long time to deploy + delete. If needed we can introduce the
test for better reporting
- [X] Access dataset assume role url
- [X] Generate dataset access token
- [X] Dataset upload data presigned url
- [X] Backwards compatibility - update dataset
- [X] Backwards compatibility - import dataset

🔦 **AWS actions outside of data.all**
There are some actions that in real life are performed outside of
data.all. To run the tests we need to either perform this actions
manually before the tests are executed or we can use AWS SDK to automate
them. Most important actions performed outside of data.all.
- Creation of consumption roles
- Creation of imported dataset bucket, kms key and glue database *IN
THIS PR
- Create VPCs for Notebooks
- Validate shares - we assume the share request role for this

To create resources we need to assume a role in the environment account.
We could assume the pivot role, but then we need to ensure that it has
CreateBucket... permissions; which is not the case. I have opted to
create a separate isolated role `dataall-integration-tests-role` as part
of the environment stack ONLY when we are creating environments during
integration testing. As part of the global config of environments users
can use the boto3 session of this role to perform direct AWS calls in
the environment account.

In https://github.com/data-dot-all/dataall/pull/1382/files we discussed
some alternatives. In this PR we use the `environmentType` variable in
the environment model, which was not used for anything (it always
defaulted to Data environments).
API call create environment (input: environmentType =
IntegrationTesting) ---> in environment stack we check the type of
environment and deploy the integration test role.

Then we use an SSM parameter to read the tooling account id needed for
the assume role trust policy


### Relates
- #1358

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: Noah Paige <[email protected]>
@dlpzx dlpzx added this to v2.7.0 Jul 12, 2024
@github-project-automation github-project-automation bot moved this to Nominated in v2.7.0 Jul 12, 2024
@dlpzx dlpzx reopened this Jul 12, 2024
@dlpzx dlpzx moved this from Nominated to Prioritized To do in v2.7.0 Jul 12, 2024
@dlpzx dlpzx moved this from Backlog to In progress in v2.7.0 Sep 5, 2024
@dlpzx dlpzx added this to v2.8.0 Sep 9, 2024
@github-project-automation github-project-automation bot moved this to Nominated in v2.8.0 Sep 9, 2024
@dlpzx dlpzx removed this from v2.8.0 Sep 9, 2024
noah-paige added a commit that referenced this issue Sep 16, 2024
### Feature or Bugfix
<!-- please choose -->
- Feature


### Detail
- Adding integration tests for Dataset Table Data Filters

- PENDING TESTS PASSING IN DEV AWS ENV
- Merge after #1391

### Relates
- related to #1220 and
#1358


### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dlpzx <[email protected]>
dlpzx added a commit that referenced this issue Sep 18, 2024
#1533)

### Feature or Bugfix
- Feature: Tests
TO BE MERGED AFTER #1391

### Detail
Follow-up of #1391. This PR
adds:
- Tests for profiling jobs - because it is an easy submodule I decided
to "chain" the tests and make them one dependent on the next one. I
could also create a fixture for a profiling job (check warning,
profiling jobs cannot be deleted)
- Added missing tests in datasets_base - we still need to add redshift
datasets and other types of datasets every time there is a new dataset
added.
- Added missing tests in s3_datasets:
test_list_s3_datasets_owned_by_env_group.

⚠️ Issues discovered during testing.
They are not bugs, they are missing functionalities:
- Profiling jobs can never be deleted. It is just information on the RDS
database, but nevertheless it cannot be deleted.
- It would be nice to have an API that checks the status of a Glue
crawler

### Relates
- #1358
- #1391
### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: Noah Paige <[email protected]>
dlpzx added a commit that referenced this issue Sep 24, 2024
### Feature or Bugfix
- Feature: Testing

### Detail
Follow-up of #1391 

- Implement Table Column tests

### Relates
- #1358
- #1391 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: Noah Paige <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants