Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Destination Redshift: integration tests fail due to OOM #13375

Closed
alexandr-shegeda opened this issue Jun 1, 2022 · 16 comments · Fixed by #14015
Closed

🐛 Destination Redshift: integration tests fail due to OOM #13375

alexandr-shegeda opened this issue Jun 1, 2022 · 16 comments · Fixed by #14015
Assignees
Labels
team/connectors-python type/bug Something isn't working

Comments

@alexandr-shegeda
Copy link
Contributor

Environment

Integration tests currently fail on CI due to OOM.

Logs

2022-05-31 11:48:44 destination > Details: -----------------------------------------------
2022-05-31 11:48:44 destination > error: Out Of Memory:
2022-05-31 11:48:44 destination > code: 1004
2022-05-31 11:48:44 destination > context: alloc(9992,MtPlan)
2022-05-31 11:48:44 destination > query: 7120875
2022-05-31 11:48:44 destination > location: alloc.cpp:493
2022-05-31 11:48:44 destination > process: query0_118_7120875 [pid=27884]

Additional context

https://github.com/airbytehq/airbyte/runs/6669394169?check_suite_focus=true

@edgao
Copy link
Contributor

edgao commented Jun 2, 2022

@alexandr-shegeda is this happening every time you run tests, or only sometimes?

I also saw this in base-normalization #12846 (comment) (https://github.com/airbytehq/airbyte/runs/6702093751?check_suite_focus=true#step:11:12870)

	 04:35:16  Database Error in model stream_with_1000_columnsjmuzg (models/generated/airbyte_tables/test_ephemeral/stream_with_1000_columnsjmuzg.sql)
	 04:35:16    Out Of Memory: 
	 04:35:16    DETAIL:  
	 04:35:16      -----------------------------------------------
	 04:35:16      error:  Out Of Memory: 
	 04:35:16      code:      1004
	 04:35:16      context:   alloc(1048486,MtPartiQL)
	 04:35:16      query:     7168005
	 04:35:16      location:  alloc.cpp:493
	 04:35:16      process:   query0_123_7168005 [pid=1301]
	 04:35:16      -----------------------------------------------

I haven't tried running the full test suite via gradle, but the normalization redshift tests pass locally when I run each test case individually. Will try running via gradle later today. actually, I realized I was running the wrong test case. This fails locally also.

@alexandertsukanov alexandertsukanov self-assigned this Jun 3, 2022
@grubberr
Copy link
Contributor

grubberr commented Jun 3, 2022

@edgao about this "Out Of Memory" error
I saw it happened from time to time on stream_with_1000_column dbt normalization test
because this test generates pretty heavy SQL query like this

select
    {{ json_extract_scalar('_airbyte_data', ['romwn'], ['romwn']) }} as romwn,
    {{ json_extract_scalar('_airbyte_data', ['zioqs'], ['zioqs']) }} as zioqs,
    ...
    <1000 lines>
    ...
    {{ json_extract_scalar('_airbyte_data', ['sjlmb'], ['sjlmb']) }} as sjlmb,
from _airbyte_raw_stream_with_1000_column

Redshift server can send us back "Out of memory" error from the remote server (it's not our memory issue)

I think we cannot really understand why this happens because this error coming from redshift
but the empirical way I found that if the integrationtests database is empty - there is no error
if integrationtests database has a lot of tables (from OLD tests) - there is a high chance that this error can happen.

To avoid this problem I usually re-create integrationtests database before running tests.

I think we need to improve our python base-normalization tests and improve resource cleanup.

@alexandertsukanov
Copy link
Contributor

For the java-side I have tried different scenarios:

  • run a build of destination-redshift locally
  • run a build of destination-redshift on CI
  • run a build of normalization-redshift locally
  • run a build of normaliztion-redshift on CI

Neither of them produced an OOM error. It looks like an impermanent issue related to the Redshift integrationtests schema (Redshift Service), not to the Airbyte connector/normalization by themselves. I have checked builds for redshift and they didn't fail, at least for the past 4 days. I gonna track this issue, if the issue will appear one more time I will re-open.

CC: @edgao, @alexandr-shegeda, @grubberr

@edgao
Copy link
Contributor

edgao commented Jun 10, 2022

I ran into this last night on a local test run, so it's probably still an issue. If the cause is that tests are polluting the integrationtests schema, could we update the tests to clean up their tables after completion?

this was base-normalization's integration_tests/test_ephemeral.py::test_destination_supported_limits[DestinationType.REDSHIFT-1000] Testing ephemeral test:

         00:09:57  Database Error in model stream_with_1000_columnsmzbvs (models/generated/airbyte_tables/test_ephemeral/stream_with_1000_columnsmzbvs.sql)
         00:09:57    Out of Memory
         00:09:57    DETAIL:
         00:09:57      -----------------------------------------------
         00:09:57      error:  Out of Memory
         00:09:57      code:      1020
         00:09:57      context:   From OomGuard
         00:09:57      query:     0
         00:09:57      location:  oom_guard.cpp:392
         00:09:57      process:   sched [pid=8721]
         00:09:57      -----------------------------------------------
         00:09:57    compiled SQL at ../build/run/airbyte_utils/models/generated/airbyte_tables/test_ephemeral/stream_with_1000_columnsmzbvs.sql

@alexandertsukanov
Copy link
Contributor

Looks like the issue is related to the Python team. Please, guys, take a look.
CC: @grubberr , @edgao

@bazarnov
Copy link
Collaborator

@edgao @grubberr @alexandr-shegeda @alexandertsukanov
Here is the related information about the issue we experiencing:
https://repost.aws/questions/QUe2mHTu_nReyXiy95znu5Kg/redshift-out-of-memory-when-running-query

Looks like we would need to enable WLM functionality in our redshift cluster config and probably restart the cluster (if needed), does anyone know who has the permissions to do this?

@edgao
Copy link
Contributor

edgao commented Jun 14, 2022

are you referring to automatic WLM? From https://docs.aws.amazon.com/redshift/latest/dg/automatic-wlm.html#wlm-monitoring-automatic-wlm I think we're already using it; when I ran the select * from stv_wlm_service_class_config where service_class >= 100; query against the integrationtests DB, it returned a non-empty result set.

Also, https://us-east-2.console.aws.amazon.com/redshiftv2/home?region=us-east-2#workload-management?parameter-group=default.redshift-1.0 says we're using automatic WLM.

regardless - I have (or can find someone who has) permissions to modify+restart the cluster, so if there's some instructions I can follow just LMK!

@bazarnov
Copy link
Collaborator

bazarnov commented Jun 15, 2022

This is the link to successfully passed SAT base-normalization: https://github.com/airbytehq/airbyte/actions/runs/2496928260

The conditions should be:

  • nobody except "you" should run the tests which triggered by base-normalization at the same time, because they could interfere each other.
  • Specifically for the REDSHIFT tests, the target test db should be empty before test circle begins.

Ideally, if we want to have this build passed all the time, we should have separated accounts/clusters/instances for all destinations, rather than use the same for all integration tests. But this is for ideal world)

In our case we can drop all the schemas related to normalization tests before all tests begins, using dbt macro, I'll make a PR for this.

@edgao
Copy link
Contributor

edgao commented Jun 15, 2022

is this something we can fix by scaling up our cluster? It seems pretty limiting to only have one person able to run tests at a time. (agree that it's a reasonable fix for now though)

I just noticed that our disk usage spiked around when this issue was reported, could it be related?
Screen Shot 2022-06-15 at 8 52 42 AM

@bazarnov
Copy link
Collaborator

is this something we can fix by scaling up our cluster? It seems pretty limiting to only have one person able to run tests at a time. (agree that it's a reasonable fix for now though)

I just noticed that our disk usage spiked around when this issue was reported, could it be related? Screen Shot 2022-06-15 at 8 52 42 AM

How much space do we have now?) we can try to increase it by 20% from existing amount just for test, if the issue is gone, so this was our blocker.

@edgao
Copy link
Contributor

edgao commented Jun 15, 2022

hm. do you know if there's any issues with going from a 1-node cluster to multi-node? afaict that's the easiest+cheapest way to increase disk space, I couldn't find an option to just give the node more disk.

We could probably also try deleting stuff from the cluster to free up some space? Not sure how difficult that would be.

@bazarnov
Copy link
Collaborator

do you know if there's any issues with going from a 1-node cluster to multi-node?

No idea, at the moment, but sounds like a plan.

We could probably also try deleting stuff from the cluster to free up some space?

I can see only the integrationtests db, since I'm connected to it, are there any others available on the cluster? If so, probably we can free up some space indeed.

@bazarnov
Copy link
Collaborator

bazarnov commented Jun 16, 2022

@edgao
UPDATE:
Some space in amount of 243 GB was free-up today on Redshift Cluster, there were outdated data from previous custom normalization tests.

Also, I've created another clean db: normalization_tests for normalization tests purposes, instead of running tests on the single integrationtests, which is the placeholder for test data for Java SAT and Normalization currently, this will allow us to safely vacuum/drop the schemas/tables after the Normalization tests, without the need of selecting the target schemas to be cleaned-up.
The GSM secret: SECRET_BASE-NORMALIZATION_REDSHIFT__CREDS was also updated accordingly. Please be aware.

This should be enough for all tests running on redshift cluster.

@grubberr
Copy link
Contributor

grubberr commented Jun 16, 2022

2 PR with running normalization tests still can conflict with each other for redshift

@bazarnov
Copy link
Collaborator

bazarnov commented Jun 16, 2022

Yes, therefore the normalization tests should be modified to use suffix something like table_name_<random_string> / schema_name_<random_string>.

Or as far as I can see, we still need to clean-up the target tables after each tests, since schemas for redshift tests are the same, but target tables already have the unique names (suffix applied).

@edgao
Copy link
Contributor

edgao commented Jun 16, 2022

redshift environment is looking good! https://github.com/airbytehq/airbyte/runs/6922267528?check_suite_focus=true ran successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/connectors-python type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants