Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] YARN illegal memory access GpuOutOfCoreSortIterator. #3697

Closed
tgravescs opened this issue Sep 29, 2021 · 1 comment
Closed

[BUG] YARN illegal memory access GpuOutOfCoreSortIterator. #3697

tgravescs opened this issue Sep 29, 2021 · 1 comment
Labels
bug Something isn't working duplicate This issue or pull request already exists

Comments

@tgravescs
Copy link
Collaborator

tgravescs commented Sep 29, 2021

Describe the bug

a bunch of hash aggregate tests failed on the yarn egx nightly build:

06:42:43  FAILED integration_tests/src/main/python/hash_aggregate_test.py::test_hash_groupby_approx_percentile_double[IGNORE_ORDER({'local': True})]
06:42:43  FAILED integration_tests/src/main/python/hash_aggregate_test.py::test_hash_groupby_approx_percentile_double_scalar[IGNORE_ORDER({'local': True})]
06:42:43  FAILED integration_tests/src/main/python/hash_aggregate_test.py::test_hash_grpby_avg_nulls[{'****.rapids.sql.variableFloatAgg.enabled': 'true', '****.rapids.sql.hasNans': 'false', '****.rapids.sql.castStringToFloat.enabled': 'true'}-[('a', RepeatSeq(String)), ('b', Integer), ('c', Null)]][IGNORE_ORDER]
06:42:43  FAILED integration_tests/src/main/python/hash_aggregate_test.py::test_hash_grpby_avg_nulls[{'****.rapids.sql.variableFloatAgg.enabled': 'true', '****.rapids.sql.hasNans': 'false', '****.rapids.sql.castStringToFloat.enabled': 'true', '****.rapids.sql.batchSizeBytes': '1000'}-[('a', RepeatSeq(String)), ('b', Integer), ('c', Null)]][IGNORE_ORDER]
06:42:43  FAILED integration_tests/src/main/python/hash_aggregate_test.py::test_hash_grpby_avg_nulls[{'****.rapids.sql.variableFloatAgg.enabled': 'true', '****.rapids.sql.hasNans': 'false', '****.rapids.sql.castStringToFloat.enabled': 'true', '****.rapids.sql.hashAgg.replaceMode': 'final'}-[('a', RepeatSeq(String)), ('b', Integer), ('c', Null)]][IGNORE_ORDER, ALLOW_NON_GPU(HashAggregateExec,AggregateExpression,UnscaledValue,MakeDecimal,AttributeReference,Alias,Sum,Count,Max,Min,Average,Cast,KnownFloatingPointNormalized,NormalizeNaNAndZero,GreaterThan,Literal,If,EqualTo,First,SortAggregateExec,Coalesce,IsNull,EqualNullSafe,PivotFirst,GetArrayItem,ShuffleExchangeExec,HashPartitioning)]
06:42:43  FAILED integration_tests/src/main/python/hash_aggregate_test.py::test_hash_grpby_avg_nulls[{'****.rapids.sql.variableFloatAgg.enabled': 'true', '****.rapids.sql.hasNans': 'false', '****.rapids.sql.castStringToFloat.enabled': 'true', '****.rapids.sql.hashAgg.replaceMode': 'partial'}-[('a', RepeatSeq(String)), ('b', Integer), ('c', Null)]][IGNORE_ORDER, ALLOW_NON_GPU(HashAggregateExec,AggregateExpression,UnscaledValue,MakeDecimal,AttributeReference,Alias,Sum,Count,Max,Min,Average,Cast,KnownFloatingPointNormalized,NormalizeNaNAndZero,GreaterThan,Literal,If,EqualTo,First,SortAggregateExec,Coalesce,IsNull,EqualNullSafe,PivotFirst,GetArrayItem,ShuffleExchangeExec,HashPartitioning)]
06:42:43  FAILED integration_tests/src/main/python/hash_aggregate_test.py::test_hash_grpby_avg_nulls_ansi[{'****.rapids.sql.variableFloatAgg.enabled': 'true', '****.rapids.sql.hasNans': 'false', '****.rapids.sql.castStringToFloat.enabled': 'true'}-[('a', RepeatSeq(String)), ('b', Integer), ('c', Null)]][IGNORE_ORDER, ALLOW_NON_GPU(HashAggregateExec,Alias,AggregateExpression,Cast,HashPartitioning,ShuffleExchangeExec,Average)]

....

06:42:43  E                   py4j.protocol.Py4JJavaError: An error occurred while calling o231391.collectToPython.
06:42:43  E                   : org.apache.****.SparkException: Job aborted due to stage failure: Task 0 in stage 8011.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8011.0 (TID 692884, ****-egx-09, executor 1): ai.rapids.cudf.CudaException: an illegal memory access was encountered
06:42:43  E                   	at ai.rapids.cudf.Cuda.memcpyOnStream(Native Method)
06:42:43  E                   	at ai.rapids.cudf.Cuda.memcpy(Cuda.java:472)
06:42:43  E                   	at ai.rapids.cudf.Cuda.memcpy(Cuda.java:288)
06:42:43  E                   	at ai.rapids.cudf.BaseDeviceMemoryBuffer.copyFromHostBuffer(BaseDeviceMemoryBuffer.java:43)
06:42:43  E                   	at ai.rapids.cudf.BaseDeviceMemoryBuffer.copyFromHostBuffer(BaseDeviceMemoryBuffer.java:105)
06:42:43  E                   	at ai.rapids.cudf.ColumnView$NestedColumnVector.createNestedColumnVector(ColumnView.java:3896)
06:42:43  E                   	at ai.rapids.cudf.ColumnView$NestedColumnVector.createNewNestedColumnVector(ColumnView.java:3833)
06:42:43  E                   	at ai.rapids.cudf.ColumnView$NestedColumnVector.createColumnVector(ColumnView.java:3780)
06:42:43  E                   	at ai.rapids.cudf.HostColumnVector.copyToDevice(HostColumnVector.java:220)
06:42:43  E                   	at com.nvidia.****.rapids.UnsafeRowToColumnarBatchIterator.next(UnsafeRowToColumnarBatchIterator.java:148)
06:42:43  E                   	at org.apache.****.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeRowToColumnarBatchIterator.next(Unknown Source)
06:42:43  E                   	at scala.collection.Iterator$$anon$11.next(Iterator.scala:494)
06:42:43  E                   	at com.nvidia.****.rapids.GpuOutOfCoreSortIterator.firstPassReadBatches(GpuSortExec.scala:378)
06:42:43  E                   	at com.nvidia.****.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:513)
06:42:43  E                   	at com.nvidia.****.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:231)
06:42:43  E                   	at com.nvidia.****.rapids.GpuHashAggregateIterator.aggregateInputBatches(aggregate.scala:352)
06:42:43  E                   	at com.nvidia.****.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:307)
06:42:43  E                   	at scala.Option.getOrElse(Option.scala:189)
06:42:43  E                   	at com.nvidia.****.rapids.GpuHashAggregateIterator.next(aggregate.scala:304)
06:42:43  E                   	at com.nvidia.****.rapids.GpuHashAggregateIterator.next(aggregate.scala:247)
06:42:43  E                   	at com.nvidia.****.rapids.GpuHashAggregateIterator.aggregateInputBatches(aggregate.scala:352)
06:42:43  E                   	at com.nvidia.****.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:307)
06:42:43  E                   	at scala.Option.getOrElse(Option.scala:189)
06:42:43  E                   	at com.nvidia.****.rapids.GpuHashAggregateIterator.next(aggregate.scala:304)
06:42:43  E                   	at com.nvidia.****.rapids.GpuHashAggregateIterator.next(aggregate.scala:247)
06:42:43  E                   	at com.nvidia.****.rapids.AcceleratedColumnarToRowIterator.$anonfun$fetchNextBatch$1(GpuColumnarToRowExec.scala:148)
06:42:43  E                   	at com.nvidia.****.rapids.Arm.withResource(Arm.scala:28)
06:42:43  E                   	at com.nvidia.****.rapids.Arm.withResource$(Arm.scala:26)
06:42:43  E                   	at com.nvidia.****.rapids.AcceleratedColumnarToRowIterator.withResource(GpuColumnarToRowExec.scala:39)
06:42:43  E                   	at com.nvidia.****.rapids.AcceleratedColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:146)
06:42:43  E                   	at com.nvidia.****.rapids.AcceleratedColumnarToRowIterator.populateBatch(GpuColumnarToRowExec.scala:137)
06:42:43  E                   	at com.nvidia.****.rapids.AcceleratedColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:129)
06:42:43  E                   	at com.nvidia.****.rapids.AcceleratedColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:158)
06:42:43  E                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
06:42:43  E                   	at org.apache.****.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
06:42:43  E                   	at org.apache.****.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
06:42:43  E                   	at org.apache.****.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
06:42:43  E                   	at org.apache.****.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
06:42:43  E                   	at org.apache.****.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
06:42:43  E                   	at org.apache.****.rdd.RDD.iterator(RDD.scala:313)
06:42:43  E                   	at org.apache.****.scheduler.ResultTask.runTask(ResultTask.scala:90)
06:42:43  E                   	at org.apache.****.scheduler.Task.run(Task.scala:127)
06:42:43  E                   	at org.apache.****.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
06:42:43  E                   	at org.apache.****.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
06:42:43  E                   	at org.apache.****.executor.Executor$TaskRunner.run(Executor.scala:449)
06:42:43  E                   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
06:42:43  E                   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
06:42:43  E                   	at java.lang.Thread.run(Thread.java:748)
@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 29, 2021
@sameerz sameerz added duplicate This issue or pull request already exists and removed ? - Needs Triage Need team to review and classify labels Oct 5, 2021
@sameerz
Copy link
Collaborator

sameerz commented Oct 5, 2021

Closing as a duplicate of #3703

@sameerz sameerz closed this as completed Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants