Run existing nemesis with 90% storage utilization test function #9155

pehala · 2024-11-07T12:46:54Z

Not all nemesis are gonna work, but we should try running with as many as possible.

yarongilor · 2024-12-05T15:31:33Z

Notes:

GrowShrinkClusterNemesis is already tested (as in Elasticity 90 percent #9400)
Decommission nemeses already have an issues and might not be supported (Balancer is not fairly distributing tablets in a heterogeneous cluster during decommission scylladb#21783)
Backup / snapshot nemesis will probably not going to work as well.

yarongilor · 2024-12-09T13:39:06Z

Using performance_test didn't run well.
retesting a 4x smaller scale in a standard longevity.

yarongilor · 2024-12-11T09:12:31Z

Unsupported Nemesisdue to Tablets constraints:

ToggleCDCMonkey
CDCStressorMonkey
DecommissionStreamingErrMonkey
RebuildStreamingErrMonkey (disrupt_rebuild_streaming_err)
disrupt_destroy_data_then_rebuild
disrupt_restart_with_resharding
disrupt_nodetool_flush_and_reshard_on_kubernetes

Tested nemesis status:

Nemesis Name	Passed/Failed	Failure reason/limitation	Nemesis type	Comment
disrupt_destroy_data_then_repair	Passed		Disruptive
disrupt_nodetool_decommission	failed		Disruptive
disrupt_mgmt_corrupt_then_repair	Passed		Disruptive
disrupt_rolling_config_change_internode_compression	Passed		Disruptive	Test Id: a8de1780-aaa5-419e-afe6-13e8af4d6eb9
disrupt_network_reject_node_exporter	Passed		Disruptive
disrupt_stop_wait_start_scylla_server	Passed		Disruptive
disrupt_soft_reboot_node	Passed		Disruptive
disrupt_kill_scylla	Passed		Disruptive
disrupt_multiple_hard_reboot_node	Passed		Disruptive
disrupt_disable_binary_gossip_execute_major_compaction	Passed		Disruptive
disrupt_soft_reboot_node	Passed		Disruptive
disrupt_network_reject_thrift	Passed		Disruptive
disrupt_rolling_restart_cluster	Passed		Disruptive
disrupt_toggle_audit_syslog	Passed		Disruptive
disrupt_add_remove_mv	failed	source=SoftTimeout message=operation 'CREATE_MV' took 18894.122734308243s and soft-timeout was set to 14400s	Disruptive	no-space-left. suggestion: scylladb/scylladb#3524 (comment)
disrupt_network_block	Passed		Disruptive	test id 413a3a9b-fe7b-4e5e-b864-6f1f26628226
disrupt_stop_start_scylla_server	Passed		Disruptive
disrupt_nodetool_cleanup	Passed		Disruptive
disrupt_switch_between_password_authenticator_and_saslauthd_authenticator_and_back	Skipped	SaslauthdAuthenticator can't work without saslauthd environment	Disruptive	TODO: reconfigure and retest
disrupt_restart_then_repair_node	failed	100% utilization while another node is restarted	Disruptive	see scylladb/scylladb#22020 (comment)
disrupt_nodetool_refresh	Passed		Disruptive
disrupt_replace_service_level_using_detach_during_load	Failed	IndexError: list index out of range	Disruptive	Test Id: 2556bfba-bff7-4ec5-833d-312330270ab4
disrupt_network_random_interruptions	Passed		Disruptive
disrupt_truncate	Passed		Disruptive
disrupt_hard_reboot_node	Passed		Disruptive
disrupt_disable_enable_ldap_authorization	Passed		Disruptive
disrupt_ldap_connection_toggle	Passed		Disruptive
disrupt_load_and_stream	Passed		Disruptive
disrupt_network_start_stop_interface	Passed		Disruptive
disrupt_no_corrupt_repair	Skipped	Disabled due to scylladb/scylladb#18059 not fixed yet	Disruptive
disrupt_network_reject_inter_node_communication	Skipped	scylladb/scylladb#6522	Disruptive
disrupt_replace_service_level_using_drop_during_load	Failed	IndexError: list index out of range	Disruptive
disrupt_validate_hh_short_downtime	Skipped	scylladb/scylladb#8136	Disruptive
disrupt_modify_table	Passed		Disruptive
disrupt_memory_stress	Skipped	Disabled cause of #6928	Disruptive
disrupt_major_compaction	Passed		Disruptive
disrupt_increase_shares_by_attach_another_sl_during_load	Failed	IndexError: list index out of range	Disruptive
disrupt_hot_reloading_internode_certificate	Passed		Disruptive
disrupt_remove_service_level_while_load	Skipped	This nemesis is supported with Service Level and role are pre-defined	Disruptive	TODO: reconfigure yaml and RETEST
disrupt_maximum_allowed_sls_with_max_shares_during_load	Failed	IndexError: list index out of range	Disruptive
disrupt_abort_repair	Passed		Disruptive
disrupt_start_stop_cleanup_compaction	Passed		Disruptive
disrupt_show_toppartitions	Passed		Disruptive
disrupt_start_stop_validation_compaction	Failed	start and stop a table scrub causes "Storage I/O error: 28: No space left on device"	Disruptive	scylladb/scylladb#22088
disrupt_start_stop_scrub_compaction	Passed		Disruptive
disrupt_snapshot_operations	Passed		Disruptive
disrupt_corrupt_then_scrub	Failed	"Storage I/O error: 28: No space left on device"	Disruptive	scylladb/scylladb#22088

pehala · 2024-12-11T09:16:02Z

Started a report doc.

Please post stuff directly here, I do not think we need yet another document

roydahan · 2024-12-15T21:31:35Z

I think this effort need to be handled with several approaches in parallel:

Define new yaml for Individual nemesis and run all of them in parallel - get to a point that you can clearly filter the ones that failed with enospc (short run, quick filter all jobs) - From the list of failures, filter out those that it make sense that they failed and investigate further those that at least at first look you expected to pass.
Define a set of "important nemesis" you (or someone) think should pass in this scenario (e.g. MgmtRepair, MgmtBackup, StopStartScylla, etc).
You can use a new property to flag them (either temporary) or we will actually merge it if it make sense.
Once you have set of several nemesis from previous item, you can try to config a parallel nemesis longevity that combines "elasticity" (add/remove nodes) and other nemesis.
(disruptive and non-disruptive and no need many of them).

The main challenge here is if you use a workload that is also writing (include overriding) data during the longevity, there is very low predictability of the space it will consume on disk accounting compactions overhead and tombstones.
I suggest to tackle it like this:

Start with a read only longevity (during stress) - to first flush out "easy" issues if we have.
Populate only 80% and run the longevity with no nemesis first and analyze the space consumption over time.
(It will change significantly with some nemesis, but at least you have some base line).
Use throttled writes that you know their space overhead.

yarongilor · 2024-12-16T08:09:19Z

I think this effort need to be handled with several approaches in parallel:

Define new yaml for Individual nemesis and run all of them in parallel - get to a point that you can clearly filter the ones that failed with enospc (short run, quick filter all jobs) - From the list of failures, filter out those that it make sense that they failed and investigate further those that at least at first look you expected to pass.

Define a set of "important nemesis" you (or someone) think should pass in this scenario (e.g. MgmtRepair, MgmtBackup, StopStartScylla, etc).
You can use a new property to flag them (either temporary) or we will actually merge it if it make sense.

Once you have set of several nemesis from previous item, you can try to config a parallel nemesis longevity that combines "elasticity" (add/remove nodes) and other nemesis.
(disruptive and non-disruptive and no need many of them).

The main challenge here is if you use a workload that is also writing (include overriding) data during the longevity, there is very low predictability of the space it will consume on disk accounting compactions overhead and tombstones. I suggest to tackle it like this:

Start with a read only longevity (during stress) - to first flush out "easy" issues if we have.

Populate only 80% and run the longevity with no nemesis first and analyze the space consumption over time.
(It will change significantly with some nemesis, but at least you have some base line).

Use throttled writes that you know their space overhead.

@roydahan , many of the above suggestions are already addressed in one way or another and tested this weekend. i waited for the test results in order to update the issue.
The EnoSCP seems to be a test issue and not a real issue.
An update (posted on Slack ) was:

An update about elasticity with nemesis - it seems like all tests failed pretty soon (following a long setup duration) for the same basic issue.
It was based on a test yaml by Roy, that used stress_cmd: "cassandra-stress mixed..." . since this "mixed" has both read and write, it continued increasing the existing 90% utilization up to 100% pretty fast and failed for no-space-left.
Using a simple workaround of splitting  read and write to 2 different stresses , where the "write" stress is very very minimal and limited, seem to solve the issue. I was now able to cover many nemesis in a row without getting any failure.
i'm not sure how all other tests running its load post 90%, i used the following read + write stresses, for example:
(it would probably be better for the read stress to use CL quorum instead of "one".
  - "cassandra-stress read no-warmup cl=ONE duration=800m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=200 fixed=3000/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..162500000,81250000,1625000)'"
  - "cassandra-stress write no-warmup cl=ONE duration=800m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=1 fixed=3/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..162500000,81250000,1625000)'"

Not sure i follow the idea in (1) - we don't want to run each nemesis separately due to the long setup time.
I'll update the above table with many disruptive nemesis that passed ok the last run.

pehala · 2024-12-16T08:23:19Z

Using a simple workaround of splitting read and write to 2 different stresses , where the "write" stress is very very minimal and limited, seem to solve the issue. I was now able to cover many nemesis in a row without getting any failure.

I believe we are doing replace with our writes for the other cases, does replace simply not work (i.e. the space reclaimaition is too slow) when hit with nemesis?

yarongilor · 2024-12-16T09:10:25Z

Using a simple workaround of splitting read and write to 2 different stresses , where the "write" stress is very very minimal and limited, seem to solve the issue. I was now able to cover many nemesis in a row without getting any failure.

I believe we are doing replace with our writes for the other cases, does replace simply not work (i.e. the space reclaimaition is too slow) when hit with nemesis?

it would be difficult to count on space reclamation i think. Perhaps unless the writes are to a really small token range.

pehala · 2024-12-16T09:12:24Z

Using a simple workaround of splitting read and write to 2 different stresses , where the "write" stress is very very minimal and limited, seem to solve the issue. I was now able to cover many nemesis in a row without getting any failure.

I believe we are doing replace with our writes for the other cases, does replace simply not work (i.e. the space reclaimaition is too slow) when hit with nemesis?

it would be difficult to count on space reclamation i think. Perhaps unless the writes are to a really small token range.

Adding @Lakshmipathi @cezarmoise, since their testcases work with mixed workloads

yarongilor · 2024-12-16T13:58:51Z

@pehala , @roydahan ,
as for the above reported failure of disrupt_add_remove_mv, it looks like creating MV took more than 5 hours (it didn't complete since the test load ended). During this period, the disk utilization increased from 86% to 98%. So i didn't investigate if it's expected and not sure what could be the actions following such nemesis failures in this test. the grafana shows:

pehala · 2024-12-16T14:15:10Z

as for the above reported failure of disrupt_add_remove_mv, it looks like creating MV took more than 5 hours (it didn't complete since the test load ended). During this period, the disk utilization increased from 86% to 98%. So i didn't investigate if it's expected and not sure what could be the actions following such nemesis failures in this test. the grafana shows:

Please look at what the nemesis is doing, if it is creating large MV, than I would say that is expected

roydahan · 2024-12-16T17:52:22Z

Adding MV is basically doubling the space of the original table.

yarongilor · 2024-12-19T15:39:56Z

Adding MV is basically doubling the space of the original table.

@roydahan , it depends what columns are selected. In this case only 1 out of 8 columns is selected for the MV:

< t:2024-12-15 22:43:33,301 f:common.py       l:1325 c:utils                p:DEBUG > Executing CQL 'CREATE MATERIALIZED VIEW keyspace1.standard1_view AS SELECT "C7", key FROM keyspace1.standard1 WHERE "C7" is not null and key is not null PRIMARY KEY ("C7", key) WITH comment='test MV'' ...

But perhaps 1 out of 8 is still too big for the left capacity.
Since it sounds like a fault that user better be protect of, i discussed it with Nadav and we thought it might be possible to help user avoid such EnoSPC. It is updated here - scylladb/scylladb#3524 (comment)

yarongilor · 2024-12-19T16:28:50Z

the nemesis of disrupt_stop_wait_start_scylla_server passed ok several times, but did also fail for no-space left one time (no clear why) so it is marked as failed.
Test Id: 2c23b329-4757-4f06-b60c-fc222590dcf4

< t:2024-12-17 17:32:08,905 f:base.py         l:143  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.2.150>: Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2024-12-17 17:32:08,906 f:nemesis.py      l:694  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Sleep for 300 seconds
< t:2024-12-17 17:37:09,067 f:remote_base.py  l:560  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.2.150>: Running command "sudo systemctl start scylla-server.service"...
< t:2024-12-17 17:37:09,503 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2024-12-17T17:37:09.425+00:00 elasticity-test-nemesis-master-db-node-2c23b329-3   !NOTICE | sudo[9891]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2024-12-17 17:37:09,808 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2024-12-17T17:37:09.759+00:00 elasticity-test-nemesis-master-db-node-2c23b329-3     !INFO | scylla_prepare[9905]: Restarting irqbalance via systemctl...
< t:2024-12-17 17:47:07,204 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.2.150>: See "systemctl status scylla-server.service" and "journalctl -xeu scylla-server.service" for details.
< t:2024-12-17 17:47:07,204 f:base.py         l:147  c:RemoteLibSSH2CmdRunner p:ERROR > <10.4.2.150>: Error executing command: "sudo systemctl start scylla-server.service"; Exit status: 1

pehala · 2024-12-20T09:17:16Z

But perhaps 1 out of 8 is still too big for the left capacity.

If the data is spread equally among those 8 columns, then 1/8 of additonal data would bring us above 100%, so it makes sense

pehala · 2024-12-20T09:20:04Z

the nemesis of disrupt_stop_wait_start_scylla_server passed ok several times, but did also fail for no-space left one time (no clear why) so it is marked as failed.
Test Id: 2c23b329-4757-4f06-b60c-fc222590dcf4

I think we could include it in the list and run it in the test. The nemesis doesnt have a fundamental problem with 90%, so it might uncover a bug

yarongilor · 2024-12-22T08:01:53Z

the nemesis of disrupt_stop_wait_start_scylla_server passed ok several times, but did also fail for no-space left one time (no clear why) so it is marked as failed.
Test Id: 2c23b329-4757-4f06-b60c-fc222590dcf4

I think we could include it in the list and run it in the test. The nemesis doesnt have a fundamental problem with 90%, so it might uncover a bug

ok, @pehala , please let me know if we want to open issues for such cases.
The problem running such a nemesis is that no-space-left is an unrecoverable error that fails (both health-check and) the test.

pehala · 2024-12-22T08:14:59Z

ok, @pehala , please let me know if we want to open issues for such cases.
The problem running such a nemesis is that no-space-left is an unrecoverable error that fails (both health-check and) the test.

We definitely do want to file bug for this, but given that you couldnt replicate it since, I think it is enough to file the bug when you encounter it again. I would mark the nemesis in the table as "unstable" and continued with the investigation of others

yarongilor · 2024-12-22T09:21:22Z

ok, @pehala , please let me know if we want to open issues for such cases.
The problem running such a nemesis is that no-space-left is an unrecoverable error that fails (both health-check and) the test.

We definitely do want to file bug for this, but given that you couldnt replicate it since, I think it is enough to file the bug when you encounter it again. I would mark the nemesis in the table as "unstable" and continued with the investigation of others

To summarize the main testing blockers -
On scylla side - getting unexpected 100% utilization following service restart (triggered by many nemesis).
This problem becomes unrecoverable and cannot be handled by SCT.
It fails the SCT health-check as in #9599.
So this becomes the main overhead/blocker of testing nemesis with 90% utilization.
Perhaps we should consider a workaround of a special health-check mode that identifies 100% utilization, then deleted user data and run repair (or any other idea).
@pehala , @roydahan , @fruch , please advise.

pehala · 2024-12-22T09:33:00Z

On scylla side - getting unexpected 100% utilization following service restart (triggered by many nemesis).

Is this expected? I am not sure why a simple service restart could increase storage space utilization

fruch · 2024-12-22T10:51:12Z

regardless, sounds like an issue need to be resolved in scylla end, or with test expectations.

I don't think a machinery for track disk utilization and clear it, is beneficial to testing

yarongilor · 2024-12-22T12:11:36Z

ok, @pehala , please let me know if we want to open issues for such cases.
The problem running such a nemesis is that no-space-left is an unrecoverable error that fails (both health-check and) the test.

We definitely do want to file bug for this, but given that you couldnt replicate it since, I think it is enough to file the bug when you encounter it again. I would mark the nemesis in the table as "unstable" and continued with the investigation of others

opened scylladb/scylladb#22020

pehala mentioned this issue Nov 7, 2024

Add 90% storage utilization tests #9129

Open

13 tasks

github-actions bot assigned pehala Nov 7, 2024

pehala added the area/elastic cloud Issues related to the elastic cloud project label Nov 7, 2024

pehala removed their assignment Nov 7, 2024

pehala added area/serverlessv2 area/tablets labels Nov 11, 2024

pehala added the P1 Urgent label Nov 21, 2024

Lakshmipathi self-assigned this Nov 21, 2024

dani-tweig removed the area/serverlessv2 label Nov 26, 2024

yarongilor assigned yarongilor and unassigned Lakshmipathi Dec 5, 2024

pehala assigned Lakshmipathi Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run existing nemesis with 90% storage utilization test function #9155

Run existing nemesis with 90% storage utilization test function #9155

pehala commented Nov 7, 2024 •

edited

Loading

yarongilor commented Dec 5, 2024

yarongilor commented Dec 9, 2024 •

edited

Loading

yarongilor commented Dec 11, 2024 •

edited

Loading

pehala commented Dec 11, 2024

roydahan commented Dec 15, 2024

yarongilor commented Dec 16, 2024

pehala commented Dec 16, 2024

yarongilor commented Dec 16, 2024

pehala commented Dec 16, 2024

yarongilor commented Dec 16, 2024

pehala commented Dec 16, 2024

roydahan commented Dec 16, 2024

yarongilor commented Dec 19, 2024

yarongilor commented Dec 19, 2024

pehala commented Dec 20, 2024

pehala commented Dec 20, 2024

yarongilor commented Dec 22, 2024

pehala commented Dec 22, 2024

yarongilor commented Dec 22, 2024

pehala commented Dec 22, 2024

fruch commented Dec 22, 2024

yarongilor commented Dec 22, 2024

Run existing nemesis with 90% storage utilization test function #9155

Run existing nemesis with 90% storage utilization test function #9155

Comments

pehala commented Nov 7, 2024 • edited Loading

yarongilor commented Dec 5, 2024

yarongilor commented Dec 9, 2024 • edited Loading

yarongilor commented Dec 11, 2024 • edited Loading

pehala commented Dec 11, 2024

roydahan commented Dec 15, 2024

yarongilor commented Dec 16, 2024

pehala commented Dec 16, 2024

yarongilor commented Dec 16, 2024

pehala commented Dec 16, 2024

yarongilor commented Dec 16, 2024

pehala commented Dec 16, 2024

roydahan commented Dec 16, 2024

yarongilor commented Dec 19, 2024

yarongilor commented Dec 19, 2024

pehala commented Dec 20, 2024

pehala commented Dec 20, 2024

yarongilor commented Dec 22, 2024

pehala commented Dec 22, 2024

yarongilor commented Dec 22, 2024

pehala commented Dec 22, 2024

fruch commented Dec 22, 2024

yarongilor commented Dec 22, 2024

pehala commented Nov 7, 2024 •

edited

Loading

yarongilor commented Dec 9, 2024 •

edited

Loading

yarongilor commented Dec 11, 2024 •

edited

Loading