🐛 Destination Redshift: integration tests fail due to OOM #13375

alexandr-shegeda · 2022-06-01T13:30:14Z

Environment

Integration tests currently fail on CI due to OOM.

Logs

2022-05-31 11:48:44 destination > Details: -----------------------------------------------
2022-05-31 11:48:44 destination > error: Out Of Memory:
2022-05-31 11:48:44 destination > code: 1004
2022-05-31 11:48:44 destination > context: alloc(9992,MtPlan)
2022-05-31 11:48:44 destination > query: 7120875
2022-05-31 11:48:44 destination > location: alloc.cpp:493
2022-05-31 11:48:44 destination > process: query0_118_7120875 [pid=27884]

Additional context

https://github.com/airbytehq/airbyte/runs/6669394169?check_suite_focus=true

edgao · 2022-06-02T21:45:29Z

@alexandr-shegeda is this happening every time you run tests, or only sometimes?

I also saw this in base-normalization #12846 (comment) (https://github.com/airbytehq/airbyte/runs/6702093751?check_suite_focus=true#step:11:12870)

	 04:35:16  Database Error in model stream_with_1000_columnsjmuzg (models/generated/airbyte_tables/test_ephemeral/stream_with_1000_columnsjmuzg.sql)
	 04:35:16    Out Of Memory: 
	 04:35:16    DETAIL:  
	 04:35:16      -----------------------------------------------
	 04:35:16      error:  Out Of Memory: 
	 04:35:16      code:      1004
	 04:35:16      context:   alloc(1048486,MtPartiQL)
	 04:35:16      query:     7168005
	 04:35:16      location:  alloc.cpp:493
	 04:35:16      process:   query0_123_7168005 [pid=1301]
	 04:35:16      -----------------------------------------------

~~I haven't tried running the full test suite via gradle, but the normalization redshift tests pass locally when I run each test case individually. Will try running via gradle later today.~~ actually, I realized I was running the wrong test case. This fails locally also.

grubberr · 2022-06-03T13:37:56Z

@edgao about this "Out Of Memory" error
I saw it happened from time to time on stream_with_1000_column dbt normalization test
because this test generates pretty heavy SQL query like this

select
    {{ json_extract_scalar('_airbyte_data', ['romwn'], ['romwn']) }} as romwn,
    {{ json_extract_scalar('_airbyte_data', ['zioqs'], ['zioqs']) }} as zioqs,
    ...
    <1000 lines>
    ...
    {{ json_extract_scalar('_airbyte_data', ['sjlmb'], ['sjlmb']) }} as sjlmb,
from _airbyte_raw_stream_with_1000_column

Redshift server can send us back "Out of memory" error from the remote server (it's not our memory issue)

I think we cannot really understand why this happens because this error coming from redshift
but the empirical way I found that if the integrationtests database is empty - there is no error
if integrationtests database has a lot of tables (from OLD tests) - there is a high chance that this error can happen.

To avoid this problem I usually re-create integrationtests database before running tests.

I think we need to improve our python base-normalization tests and improve resource cleanup.

alexandertsukanov · 2022-06-10T09:04:28Z

For the java-side I have tried different scenarios:

run a build of destination-redshift locally
run a build of destination-redshift on CI
run a build of normalization-redshift locally
run a build of normaliztion-redshift on CI

Neither of them produced an OOM error. It looks like an impermanent issue related to the Redshift integrationtests schema (Redshift Service), not to the Airbyte connector/normalization by themselves. I have checked builds for redshift and they didn't fail, at least for the past 4 days. I gonna track this issue, if the issue will appear one more time I will re-open.

CC: @edgao, @alexandr-shegeda, @grubberr

edgao · 2022-06-10T15:49:02Z

I ran into this last night on a local test run, so it's probably still an issue. If the cause is that tests are polluting the integrationtests schema, could we update the tests to clean up their tables after completion?

this was base-normalization's integration_tests/test_ephemeral.py::test_destination_supported_limits[DestinationType.REDSHIFT-1000] Testing ephemeral test:

         00:09:57  Database Error in model stream_with_1000_columnsmzbvs (models/generated/airbyte_tables/test_ephemeral/stream_with_1000_columnsmzbvs.sql)
         00:09:57    Out of Memory
         00:09:57    DETAIL:
         00:09:57      -----------------------------------------------
         00:09:57      error:  Out of Memory
         00:09:57      code:      1020
         00:09:57      context:   From OomGuard
         00:09:57      query:     0
         00:09:57      location:  oom_guard.cpp:392
         00:09:57      process:   sched [pid=8721]
         00:09:57      -----------------------------------------------
         00:09:57    compiled SQL at ../build/run/airbyte_utils/models/generated/airbyte_tables/test_ephemeral/stream_with_1000_columnsmzbvs.sql

alexandertsukanov · 2022-06-10T16:34:04Z

Looks like the issue is related to the Python team. Please, guys, take a look.
CC: @grubberr , @edgao

bazarnov · 2022-06-14T14:02:20Z

@edgao @grubberr @alexandr-shegeda @alexandertsukanov
Here is the related information about the issue we experiencing:
https://repost.aws/questions/QUe2mHTu_nReyXiy95znu5Kg/redshift-out-of-memory-when-running-query

Looks like we would need to enable WLM functionality in our redshift cluster config and probably restart the cluster (if needed), does anyone know who has the permissions to do this?

edgao · 2022-06-14T15:52:03Z

are you referring to automatic WLM? From https://docs.aws.amazon.com/redshift/latest/dg/automatic-wlm.html#wlm-monitoring-automatic-wlm I think we're already using it; when I ran the select * from stv_wlm_service_class_config where service_class >= 100; query against the integrationtests DB, it returned a non-empty result set.

Also, https://us-east-2.console.aws.amazon.com/redshiftv2/home?region=us-east-2#workload-management?parameter-group=default.redshift-1.0 says we're using automatic WLM.

regardless - I have (or can find someone who has) permissions to modify+restart the cluster, so if there's some instructions I can follow just LMK!

bazarnov · 2022-06-15T10:28:41Z

This is the link to successfully passed SAT base-normalization: https://github.com/airbytehq/airbyte/actions/runs/2496928260

The conditions should be:

nobody except "you" should run the tests which triggered by base-normalization at the same time, because they could interfere each other.
Specifically for the REDSHIFT tests, the target test db should be empty before test circle begins.

Ideally, if we want to have this build passed all the time, we should have separated accounts/clusters/instances for all destinations, rather than use the same for all integration tests. But this is for ideal world)

In our case we can drop all the schemas related to normalization tests before all tests begins, using dbt macro, I'll make a PR for this.

edgao · 2022-06-15T15:55:19Z

is this something we can fix by scaling up our cluster? It seems pretty limiting to only have one person able to run tests at a time. (agree that it's a reasonable fix for now though)

I just noticed that our disk usage spiked around when this issue was reported, could it be related?

bazarnov · 2022-06-15T18:09:20Z

is this something we can fix by scaling up our cluster? It seems pretty limiting to only have one person able to run tests at a time. (agree that it's a reasonable fix for now though)

I just noticed that our disk usage spiked around when this issue was reported, could it be related?

How much space do we have now?) we can try to increase it by 20% from existing amount just for test, if the issue is gone, so this was our blocker.

edgao · 2022-06-15T19:40:36Z

hm. do you know if there's any issues with going from a 1-node cluster to multi-node? afaict that's the easiest+cheapest way to increase disk space, I couldn't find an option to just give the node more disk.

We could probably also try deleting stuff from the cluster to free up some space? Not sure how difficult that would be.

bazarnov · 2022-06-15T19:58:04Z

do you know if there's any issues with going from a 1-node cluster to multi-node?

No idea, at the moment, but sounds like a plan.

We could probably also try deleting stuff from the cluster to free up some space?

I can see only the integrationtests db, since I'm connected to it, are there any others available on the cluster? If so, probably we can free up some space indeed.

bazarnov · 2022-06-16T14:04:09Z

@edgao
UPDATE:
Some space in amount of 243 GB was free-up today on Redshift Cluster, there were outdated data from previous custom normalization tests.

Also, I've created another clean db: normalization_tests for normalization tests purposes, instead of running tests on the single integrationtests, which is the placeholder for test data for Java SAT and Normalization currently, this will allow us to safely vacuum/drop the schemas/tables after the Normalization tests, without the need of selecting the target schemas to be cleaned-up.
The GSM secret: SECRET_BASE-NORMALIZATION_REDSHIFT__CREDS was also updated accordingly. Please be aware.

This should be enough for all tests running on redshift cluster.

grubberr · 2022-06-16T14:34:52Z

2 PR with running normalization tests still can conflict with each other for redshift

bazarnov · 2022-06-16T14:39:36Z

Yes, therefore the normalization tests should be modified to use suffix something like table_name_<random_string> / schema_name_<random_string>.

Or as far as I can see, we still need to clean-up the target tables after each tests, since schemas for redshift tests are the same, but target tables already have the unique names (suffix applied).

edgao · 2022-06-16T18:16:13Z

redshift environment is looking good! https://github.com/airbytehq/airbyte/runs/6922267528?check_suite_focus=true ran successfully.

alexandr-shegeda added team/connectors-java type/bug Something isn't working labels Jun 1, 2022

alexandertsukanov self-assigned this Jun 3, 2022

alexandertsukanov closed this as completed Jun 10, 2022

alexandertsukanov reopened this Jun 10, 2022

alexandr-shegeda added the team/connectors-python label Jun 10, 2022

lazebnyi assigned bazarnov Jun 14, 2022

edgao mentioned this issue Jun 14, 2022

🐛 Normalization correctly propagates deletions to the final tables #12846

Merged

14 tasks

bazarnov mentioned this issue Jun 22, 2022

🎉 Base Norrmalization: clean-up Redshift tmp_schemas after SAT #14015

Merged

8 tasks

bazarnov linked a pull request Jun 22, 2022 that will close this issue

🎉 Base Norrmalization: clean-up Redshift tmp_schemas after SAT #14015

Merged

8 tasks

bazarnov closed this as completed in #14015 Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Destination Redshift: integration tests fail due to OOM #13375

🐛 Destination Redshift: integration tests fail due to OOM #13375

alexandr-shegeda commented Jun 1, 2022

edgao commented Jun 2, 2022 •

edited

Loading

grubberr commented Jun 3, 2022 •

edited

Loading

alexandertsukanov commented Jun 10, 2022

edgao commented Jun 10, 2022 •

edited

Loading

alexandertsukanov commented Jun 10, 2022

bazarnov commented Jun 14, 2022

edgao commented Jun 14, 2022

bazarnov commented Jun 15, 2022 •

edited

Loading

edgao commented Jun 15, 2022

bazarnov commented Jun 15, 2022

edgao commented Jun 15, 2022

bazarnov commented Jun 15, 2022

bazarnov commented Jun 16, 2022 •

edited

Loading

grubberr commented Jun 16, 2022 •

edited

Loading

bazarnov commented Jun 16, 2022 •

edited

Loading

edgao commented Jun 16, 2022

🐛 Destination Redshift: integration tests fail due to OOM #13375

🐛 Destination Redshift: integration tests fail due to OOM #13375

Comments

alexandr-shegeda commented Jun 1, 2022

Environment

Logs

Additional context

edgao commented Jun 2, 2022 • edited Loading

grubberr commented Jun 3, 2022 • edited Loading

alexandertsukanov commented Jun 10, 2022

edgao commented Jun 10, 2022 • edited Loading

alexandertsukanov commented Jun 10, 2022

bazarnov commented Jun 14, 2022

edgao commented Jun 14, 2022

bazarnov commented Jun 15, 2022 • edited Loading

edgao commented Jun 15, 2022

bazarnov commented Jun 15, 2022

edgao commented Jun 15, 2022

bazarnov commented Jun 15, 2022

bazarnov commented Jun 16, 2022 • edited Loading

grubberr commented Jun 16, 2022 • edited Loading

bazarnov commented Jun 16, 2022 • edited Loading

edgao commented Jun 16, 2022

edgao commented Jun 2, 2022 •

edited

Loading

grubberr commented Jun 3, 2022 •

edited

Loading

edgao commented Jun 10, 2022 •

edited

Loading

bazarnov commented Jun 15, 2022 •

edited

Loading

bazarnov commented Jun 16, 2022 •

edited

Loading

grubberr commented Jun 16, 2022 •

edited

Loading

bazarnov commented Jun 16, 2022 •

edited

Loading