Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run existing nemesis with 90% storage utilization test function #9155

Open
pehala opened this issue Nov 7, 2024 · 22 comments
Open

Run existing nemesis with 90% storage utilization test function #9155

pehala opened this issue Nov 7, 2024 · 22 comments
Assignees
Labels
area/elastic cloud Issues related to the elastic cloud project area/tablets P1 Urgent

Comments

@pehala
Copy link
Contributor

pehala commented Nov 7, 2024

Not all nemesis are gonna work, but we should try running with as many as possible.

@pehala pehala added the area/elastic cloud Issues related to the elastic cloud project label Nov 7, 2024
@pehala pehala removed their assignment Nov 7, 2024
@pehala pehala added the P1 Urgent label Nov 21, 2024
@Lakshmipathi Lakshmipathi self-assigned this Nov 21, 2024
@yarongilor yarongilor assigned yarongilor and unassigned Lakshmipathi Dec 5, 2024
@yarongilor
Copy link
Contributor

Notes:

@yarongilor
Copy link
Contributor

yarongilor commented Dec 9, 2024

Using performance_test didn't run well.
retesting a 4x smaller scale in a standard longevity.

@yarongilor
Copy link
Contributor

yarongilor commented Dec 11, 2024

Unsupported Nemesisdue to Tablets constraints:

  1. ToggleCDCMonkey
  2. CDCStressorMonkey
  3. DecommissionStreamingErrMonkey
  4. RebuildStreamingErrMonkey (disrupt_rebuild_streaming_err)
  5. disrupt_destroy_data_then_rebuild
  6. disrupt_restart_with_resharding
  7. disrupt_nodetool_flush_and_reshard_on_kubernetes

Tested nemesis status:

Nemesis Name Passed/Failed Failure reason/limitation Nemesis type Comment
disrupt_destroy_data_then_repair Passed Disruptive
disrupt_nodetool_decommission failed Disruptive
disrupt_mgmt_corrupt_then_repair Passed Disruptive
disrupt_rolling_config_change_internode_compression Passed Disruptive Test Id: a8de1780-aaa5-419e-afe6-13e8af4d6eb9
disrupt_network_reject_node_exporter Passed Disruptive
disrupt_stop_wait_start_scylla_server Passed Disruptive
disrupt_soft_reboot_node Passed Disruptive
disrupt_kill_scylla Passed Disruptive
disrupt_multiple_hard_reboot_node Passed Disruptive
disrupt_disable_binary_gossip_execute_major_compaction Passed Disruptive
disrupt_soft_reboot_node Passed Disruptive
disrupt_network_reject_thrift Passed Disruptive
disrupt_rolling_restart_cluster Passed Disruptive
disrupt_toggle_audit_syslog Passed Disruptive
disrupt_add_remove_mv failed source=SoftTimeout message=operation 'CREATE_MV' took 18894.122734308243s and soft-timeout was set to 14400s Disruptive no-space-left. suggestion: scylladb/scylladb#3524 (comment)
disrupt_network_block Passed Disruptive test id 413a3a9b-fe7b-4e5e-b864-6f1f26628226
disrupt_stop_start_scylla_server Passed Disruptive
disrupt_nodetool_cleanup Passed Disruptive
disrupt_switch_between_password_authenticator_and_saslauthd_authenticator_and_back Skipped SaslauthdAuthenticator can't work without saslauthd environment Disruptive TODO: reconfigure and retest
disrupt_restart_then_repair_node failed 100% utilization while another node is restarted Disruptive see scylladb/scylladb#22020 (comment)
disrupt_nodetool_refresh Passed Disruptive
disrupt_replace_service_level_using_detach_during_load Failed IndexError: list index out of range Disruptive Test Id: 2556bfba-bff7-4ec5-833d-312330270ab4
disrupt_network_random_interruptions Passed Disruptive
disrupt_truncate Passed Disruptive
disrupt_hard_reboot_node Passed Disruptive
disrupt_disable_enable_ldap_authorization Passed Disruptive
disrupt_ldap_connection_toggle Passed Disruptive
disrupt_load_and_stream Passed Disruptive
disrupt_network_start_stop_interface Passed Disruptive
disrupt_no_corrupt_repair Skipped Disabled due to scylladb/scylladb#18059 not fixed yet Disruptive
disrupt_network_reject_inter_node_communication Skipped scylladb/scylladb#6522 Disruptive
disrupt_replace_service_level_using_drop_during_load Failed IndexError: list index out of range Disruptive
disrupt_validate_hh_short_downtime Skipped scylladb/scylladb#8136 Disruptive
disrupt_modify_table Passed Disruptive
disrupt_memory_stress Skipped Disabled cause of #6928 Disruptive
disrupt_major_compaction Passed Disruptive
disrupt_increase_shares_by_attach_another_sl_during_load Failed IndexError: list index out of range Disruptive
disrupt_hot_reloading_internode_certificate Passed Disruptive
disrupt_remove_service_level_while_load Skipped This nemesis is supported with Service Level and role are pre-defined Disruptive TODO: reconfigure yaml and RETEST
disrupt_maximum_allowed_sls_with_max_shares_during_load Failed IndexError: list index out of range Disruptive
disrupt_abort_repair Passed Disruptive
disrupt_start_stop_cleanup_compaction Passed Disruptive
disrupt_show_toppartitions Passed Disruptive
disrupt_start_stop_validation_compaction Failed start and stop a table scrub causes "Storage I/O error: 28: No space left on device" Disruptive scylladb/scylladb#22088
disrupt_start_stop_scrub_compaction Passed Disruptive
disrupt_snapshot_operations Passed Disruptive
disrupt_corrupt_then_scrub Failed "Storage I/O error: 28: No space left on device" Disruptive scylladb/scylladb#22088

@pehala
Copy link
Contributor Author

pehala commented Dec 11, 2024

Started a report doc.

Please post stuff directly here, I do not think we need yet another document

@roydahan
Copy link
Contributor

I think this effort need to be handled with several approaches in parallel:

  1. Define new yaml for Individual nemesis and run all of them in parallel - get to a point that you can clearly filter the ones that failed with enospc (short run, quick filter all jobs) - From the list of failures, filter out those that it make sense that they failed and investigate further those that at least at first look you expected to pass.

  2. Define a set of "important nemesis" you (or someone) think should pass in this scenario (e.g. MgmtRepair, MgmtBackup, StopStartScylla, etc).
    You can use a new property to flag them (either temporary) or we will actually merge it if it make sense.

  3. Once you have set of several nemesis from previous item, you can try to config a parallel nemesis longevity that combines "elasticity" (add/remove nodes) and other nemesis.
    (disruptive and non-disruptive and no need many of them).

The main challenge here is if you use a workload that is also writing (include overriding) data during the longevity, there is very low predictability of the space it will consume on disk accounting compactions overhead and tombstones.
I suggest to tackle it like this:

  • Start with a read only longevity (during stress) - to first flush out "easy" issues if we have.
  • Populate only 80% and run the longevity with no nemesis first and analyze the space consumption over time.
    (It will change significantly with some nemesis, but at least you have some base line).
  • Use throttled writes that you know their space overhead.

@yarongilor
Copy link
Contributor

I think this effort need to be handled with several approaches in parallel:

  1. Define new yaml for Individual nemesis and run all of them in parallel - get to a point that you can clearly filter the ones that failed with enospc (short run, quick filter all jobs) - From the list of failures, filter out those that it make sense that they failed and investigate further those that at least at first look you expected to pass.
  2. Define a set of "important nemesis" you (or someone) think should pass in this scenario (e.g. MgmtRepair, MgmtBackup, StopStartScylla, etc).
    You can use a new property to flag them (either temporary) or we will actually merge it if it make sense.
  3. Once you have set of several nemesis from previous item, you can try to config a parallel nemesis longevity that combines "elasticity" (add/remove nodes) and other nemesis.
    (disruptive and non-disruptive and no need many of them).

The main challenge here is if you use a workload that is also writing (include overriding) data during the longevity, there is very low predictability of the space it will consume on disk accounting compactions overhead and tombstones. I suggest to tackle it like this:

  • Start with a read only longevity (during stress) - to first flush out "easy" issues if we have.
  • Populate only 80% and run the longevity with no nemesis first and analyze the space consumption over time.
    (It will change significantly with some nemesis, but at least you have some base line).
  • Use throttled writes that you know their space overhead.

@roydahan , many of the above suggestions are already addressed in one way or another and tested this weekend. i waited for the test results in order to update the issue.
The EnoSCP seems to be a test issue and not a real issue.
An update (posted on Slack ) was:

An update about elasticity with nemesis - it seems like all tests failed pretty soon (following a long setup duration) for the same basic issue.
It was based on a test yaml by Roy, that used stress_cmd: "cassandra-stress mixed..." . since this "mixed" has both read and write, it continued increasing the existing 90% utilization up to 100% pretty fast and failed for no-space-left.
Using a simple workaround of splitting  read and write to 2 different stresses , where the "write" stress is very very minimal and limited, seem to solve the issue. I was now able to cover many nemesis in a row without getting any failure.
i'm not sure how all other tests running its load post 90%, i used the following read + write stresses, for example:
(it would probably be better for the read stress to use CL quorum instead of "one".
  - "cassandra-stress read no-warmup cl=ONE duration=800m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=200 fixed=3000/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..162500000,81250000,1625000)'"
  - "cassandra-stress write no-warmup cl=ONE duration=800m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=1 fixed=3/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..162500000,81250000,1625000)'"

Not sure i follow the idea in (1) - we don't want to run each nemesis separately due to the long setup time.
I'll update the above table with many disruptive nemesis that passed ok the last run.

@pehala
Copy link
Contributor Author

pehala commented Dec 16, 2024

Using a simple workaround of splitting read and write to 2 different stresses , where the "write" stress is very very minimal and limited, seem to solve the issue. I was now able to cover many nemesis in a row without getting any failure.

I believe we are doing replace with our writes for the other cases, does replace simply not work (i.e. the space reclaimaition is too slow) when hit with nemesis?

@yarongilor
Copy link
Contributor

Using a simple workaround of splitting read and write to 2 different stresses , where the "write" stress is very very minimal and limited, seem to solve the issue. I was now able to cover many nemesis in a row without getting any failure.

I believe we are doing replace with our writes for the other cases, does replace simply not work (i.e. the space reclaimaition is too slow) when hit with nemesis?

it would be difficult to count on space reclamation i think. Perhaps unless the writes are to a really small token range.

@pehala
Copy link
Contributor Author

pehala commented Dec 16, 2024

Using a simple workaround of splitting read and write to 2 different stresses , where the "write" stress is very very minimal and limited, seem to solve the issue. I was now able to cover many nemesis in a row without getting any failure.

I believe we are doing replace with our writes for the other cases, does replace simply not work (i.e. the space reclaimaition is too slow) when hit with nemesis?

it would be difficult to count on space reclamation i think. Perhaps unless the writes are to a really small token range.

Adding @Lakshmipathi @cezarmoise, since their testcases work with mixed workloads

@yarongilor
Copy link
Contributor

@pehala , @roydahan ,
as for the above reported failure of disrupt_add_remove_mv, it looks like creating MV took more than 5 hours (it didn't complete since the test load ended). During this period, the disk utilization increased from 86% to 98%. So i didn't investigate if it's expected and not sure what could be the actions following such nemesis failures in this test. the grafana shows:
Image

@pehala
Copy link
Contributor Author

pehala commented Dec 16, 2024

as for the above reported failure of disrupt_add_remove_mv, it looks like creating MV took more than 5 hours (it didn't complete since the test load ended). During this period, the disk utilization increased from 86% to 98%. So i didn't investigate if it's expected and not sure what could be the actions following such nemesis failures in this test. the grafana shows:

Please look at what the nemesis is doing, if it is creating large MV, than I would say that is expected

@roydahan
Copy link
Contributor

Adding MV is basically doubling the space of the original table.

@yarongilor
Copy link
Contributor

Adding MV is basically doubling the space of the original table.

@roydahan , it depends what columns are selected. In this case only 1 out of 8 columns is selected for the MV:

< t:2024-12-15 22:43:33,301 f:common.py       l:1325 c:utils                p:DEBUG > Executing CQL 'CREATE MATERIALIZED VIEW keyspace1.standard1_view AS SELECT "C7", key FROM keyspace1.standard1 WHERE "C7" is not null and key is not null PRIMARY KEY ("C7", key) WITH comment='test MV'' ...

But perhaps 1 out of 8 is still too big for the left capacity.
Since it sounds like a fault that user better be protect of, i discussed it with Nadav and we thought it might be possible to help user avoid such EnoSPC. It is updated here - scylladb/scylladb#3524 (comment)

@yarongilor
Copy link
Contributor

the nemesis of disrupt_stop_wait_start_scylla_server passed ok several times, but did also fail for no-space left one time (no clear why) so it is marked as failed.
Test Id: 2c23b329-4757-4f06-b60c-fc222590dcf4

< t:2024-12-17 17:32:08,905 f:base.py         l:143  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.2.150>: Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2024-12-17 17:32:08,906 f:nemesis.py      l:694  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Sleep for 300 seconds
< t:2024-12-17 17:37:09,067 f:remote_base.py  l:560  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.2.150>: Running command "sudo systemctl start scylla-server.service"...
< t:2024-12-17 17:37:09,503 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2024-12-17T17:37:09.425+00:00 elasticity-test-nemesis-master-db-node-2c23b329-3   !NOTICE | sudo[9891]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2024-12-17 17:37:09,808 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2024-12-17T17:37:09.759+00:00 elasticity-test-nemesis-master-db-node-2c23b329-3     !INFO | scylla_prepare[9905]: Restarting irqbalance via systemctl...
< t:2024-12-17 17:47:07,204 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.2.150>: See "systemctl status scylla-server.service" and "journalctl -xeu scylla-server.service" for details.
< t:2024-12-17 17:47:07,204 f:base.py         l:147  c:RemoteLibSSH2CmdRunner p:ERROR > <10.4.2.150>: Error executing command: "sudo systemctl start scylla-server.service"; Exit status: 1

Image

@pehala
Copy link
Contributor Author

pehala commented Dec 20, 2024

But perhaps 1 out of 8 is still too big for the left capacity.

If the data is spread equally among those 8 columns, then 1/8 of additonal data would bring us above 100%, so it makes sense

@pehala
Copy link
Contributor Author

pehala commented Dec 20, 2024

the nemesis of disrupt_stop_wait_start_scylla_server passed ok several times, but did also fail for no-space left one time (no clear why) so it is marked as failed.
Test Id: 2c23b329-4757-4f06-b60c-fc222590dcf4

I think we could include it in the list and run it in the test. The nemesis doesnt have a fundamental problem with 90%, so it might uncover a bug

@yarongilor
Copy link
Contributor

the nemesis of disrupt_stop_wait_start_scylla_server passed ok several times, but did also fail for no-space left one time (no clear why) so it is marked as failed.
Test Id: 2c23b329-4757-4f06-b60c-fc222590dcf4

I think we could include it in the list and run it in the test. The nemesis doesnt have a fundamental problem with 90%, so it might uncover a bug

ok, @pehala , please let me know if we want to open issues for such cases.
The problem running such a nemesis is that no-space-left is an unrecoverable error that fails (both health-check and) the test.

@pehala
Copy link
Contributor Author

pehala commented Dec 22, 2024

ok, @pehala , please let me know if we want to open issues for such cases.
The problem running such a nemesis is that no-space-left is an unrecoverable error that fails (both health-check and) the test.

We definitely do want to file bug for this, but given that you couldnt replicate it since, I think it is enough to file the bug when you encounter it again. I would mark the nemesis in the table as "unstable" and continued with the investigation of others

@yarongilor
Copy link
Contributor

ok, @pehala , please let me know if we want to open issues for such cases.
The problem running such a nemesis is that no-space-left is an unrecoverable error that fails (both health-check and) the test.

We definitely do want to file bug for this, but given that you couldnt replicate it since, I think it is enough to file the bug when you encounter it again. I would mark the nemesis in the table as "unstable" and continued with the investigation of others

To summarize the main testing blockers -
On scylla side - getting unexpected 100% utilization following service restart (triggered by many nemesis).
This problem becomes unrecoverable and cannot be handled by SCT.
It fails the SCT health-check as in #9599.
So this becomes the main overhead/blocker of testing nemesis with 90% utilization.
Perhaps we should consider a workaround of a special health-check mode that identifies 100% utilization, then deleted user data and run repair (or any other idea).
@pehala , @roydahan , @fruch , please advise.

@pehala
Copy link
Contributor Author

pehala commented Dec 22, 2024

On scylla side - getting unexpected 100% utilization following service restart (triggered by many nemesis).

Is this expected? I am not sure why a simple service restart could increase storage space utilization

@fruch
Copy link
Contributor

fruch commented Dec 22, 2024

regardless, sounds like an issue need to be resolved in scylla end, or with test expectations.

I don't think a machinery for track disk utilization and clear it, is beneficial to testing

@yarongilor
Copy link
Contributor

ok, @pehala , please let me know if we want to open issues for such cases.
The problem running such a nemesis is that no-space-left is an unrecoverable error that fails (both health-check and) the test.

We definitely do want to file bug for this, but given that you couldnt replicate it since, I think it is enough to file the bug when you encounter it again. I would mark the nemesis in the table as "unstable" and continued with the investigation of others

opened scylladb/scylladb#22020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/elastic cloud Issues related to the elastic cloud project area/tablets P1 Urgent
Projects
None yet
Development

No branches or pull requests

6 participants