Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify zone db locking to avoid a race #2561

Merged
merged 6 commits into from
Nov 4, 2024

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Nov 1, 2024

I started to run into a race condition when running the integration tests in spark-rapids for an unrelated PR where

env -u SPARK_CONF_DIR SPARK_HOME=.../spark_3.4.2/ SPARK_RAPIDS_TEST_INJECT_OOM_SEED=0 DATAGEN_SEED=0 TEST_PARALLEL=0 MAX_PARALLEL=20 ./run_pyspark_from_build.sh -k 'test_aqe_join_reused_exchange_inequality_condition or dayofmonth'

Would fail 2 of the tests with a null pointer exception saying that Z was not inside fixedTransitions.

 java.lang.NullPointerException
E                       at com.nvidia.spark.rapids.jni.GpuTimeZoneDB.fromUtcTimestampToTimestamp(GpuTimeZoneDB.java:227)
E                       at com.nvidia.spark.rapids.GpuCast$.doCast(GpuCast.scala:638)

I am not 100% sure how this ended up as a problem. Looking at the code quickly I don't see how this would happen. But when I added a synchronize to fromUtcTimestampToTimestamp so that my log messages would come out not-intermixed, the problem went away. The synchronization logic is here so that shutdown can be called while we are still trying to load the DB. But that also is a bug because we don't want to assume that the data is freed properly when we are trying to allocate it. So I just made the locking simpler. Every critical section, loading the DB and freeing the resources, is now protected by synchronizing on GpuTimeZoneDB.class, which is the same as a synchronize on a static method in that class.

Signed-off-by: Robert (Bobby) Evans <[email protected]>
@revans2
Copy link
Collaborator Author

revans2 commented Nov 1, 2024

build


if (lock.isLoading) {
// another thread is loading(), return
synchronized (GpuTimeZoneDB.class) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still leaves room for a race where shutdown is called from another thread after the lock is released on L87. Should we make the whole method cacheDatabaseAsync synchronized instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second thread will check that the resource it is trying to update is not null, which would be closed and checked under the lock. So the second thread will do work, unnecessarily, but I don't see a case for a runtime error here, unless I am missing something.

But agree, what if this was all locked, is it bad?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gerashegalov is 100% correct. I will fix it.

Is it bad?

Yes and no. We should not get inconsistent data, but we might load data after it was shutdown was called, and have no way to properly free it. It is on shutdown, but the change is small enough, and better enough, that I think it is best.

@revans2
Copy link
Collaborator Author

revans2 commented Nov 1, 2024

@gerashegalov and @abellina I fixed the issue with locking, but I could not leave it alone and I removed the singleton instance + static methods abstraction. It was not actually needed for unit tests, so i just removed it to make the code a little bit cleaner. Hopefully.

@revans2
Copy link
Collaborator Author

revans2 commented Nov 1, 2024

build

@revans2
Copy link
Collaborator Author

revans2 commented Nov 1, 2024

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@revans2 revans2 merged commit e6a7128 into NVIDIA:branch-24.12 Nov 4, 2024
3 checks passed
@revans2 revans2 deleted the simplify_zonedb_locking branch November 4, 2024 14:26
revans2 added a commit that referenced this pull request Nov 4, 2024
@sameerz sameerz added the bug Something isn't working label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants