Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBM on Spark Bad Allocation memory error #483

Closed
debadridtt opened this issue Feb 5, 2019 · 9 comments
Closed

LightGBM on Spark Bad Allocation memory error #483

debadridtt opened this issue Feb 5, 2019 · 9 comments
Labels
area/lightgbm bug high priority high priority issues must be fixed as soon as possible

Comments

@debadridtt
Copy link

debadridtt commented Feb 5, 2019

I'm trying to run lightgbm on a small dataset. I'm using this command: : pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:0.15.dev2+1.g11ad24d --repositories https://mmlspark.azureedge.net/maven to launch my notebook.

I'm trying other linear and bagging algorithms as well, like Logistic regression and Random Forest from the PySpark module, they seem to run fine, but sometimes I'm getting ball_alloc memory error when I'm trying to run Lightgbm on the same dataset. Its happening sometimes, not everytime, suppose I execute the cell for 3 times, I get a memory error in the second time and also the dataset is very small, ~2000 rows in the .csv file.
What may be the problem, because I don't even notice significant changes in the memory usage through my Resource monitor.

P.S. I'm using Windows 10

@debadridtt debadridtt changed the title LightGBM on Sparm Bad Allocation memory error LightGBM on Spark Bad Allocation memory error Feb 5, 2019
@imatiach-msft
Copy link
Contributor

@debadridtt sorry to hear about the trouble you are having. If the dataset is not confidential, would you be able to share the dataset and a code snippet that reproduces the error? I can try to debug and take a look into it.

@debadridtt
Copy link
Author

@debadridtt sorry to hear about the trouble you are having. If the dataset is not confidential, would you be able to share the dataset and a code snippet that reproduces the error? I can try to debug and take a look into it.

Hi, can you please go through the code I have posted on Stackexhange: https://datascience.stackexchange.com/questions/45144/pyspark-v-pandas-dataframe-memory-issue

@debadridtt
Copy link
Author

@debadridtt sorry to hear about the trouble you are having. If the dataset is not confidential, would you be able to share the dataset and a code snippet that reproduces the error? I can try to debug and take a look into it.

I'm running PySpark v2.2.0

@longyunshen
Copy link

longyunshen commented Mar 28, 2019

@imatiach-msft @debadridtt
is this problem solved? I have exactly the same issue. The following is the log.

Py4JJavaError: An error occurred while calling o1248.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 93.0 failed 1 times, most recent failure: Lost task 0.0 in stage 93.0 (TID 132, localhost, executor driver): java.lang.Exception: Dataset create call failed in LightGBM with error: bad allocation
at com.microsoft.ml.spark.LightGBMUtils$.validate(LightGBMUtils.scala:29)
at com.microsoft.ml.spark.LightGBMUtils$.generateSparseDataset(LightGBMUtils.scala:380)
at com.microsoft.ml.spark.TrainUtils$.translate(TrainUtils.scala:62)
at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:219)
at com.microsoft.ml.spark.LightGBMRegressor$$anonfun$3.apply(LightGBMRegressor.scala:90)
at com.microsoft.ml.spark.LightGBMRegressor$$anonfun$3.apply(LightGBMRegressor.scala:90)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185)

My spark is 2.3.2. I am working on local windows10. I am using pyspark. I pip installed mmlspark-0.15.dev2+1.g11ad24d-py2.py3-none-any.whl. I --packages pyspark --packages Azure:mmlspark:0.16. I used spark.ml. classification.LogisticRegression for compare and it is ok. And stuck at lightgbm. My dataset is only 300k.

@imatiach-msft
Copy link
Contributor

@longyunshen
I think we may have failed to create the dataset due to an out of memory error:
https://github.com/Azure/mmlspark/blob/0b84a230d1556ced87be9139dd798237711c1158/src/lightgbm/src/main/scala/LightGBMUtils.scala#L343
how large is your cluster? If you have 300k rows * (assuming) 1000 cols * 8 bytes per col + some additional data it would be around 3 GB total in memory, which doesn't seem a lot for spark but may be enough to go out of memory on a local machine. Have you tried downsampling the data?

@longyunshen
Copy link

@imatiach-msft
There is no cluster on my local windows10. My dataset is 300kB as a .gz file. If extracted, around 1.5 MB. The dataset is 45000 rows by 17 columns. So I think it has nothing to do with dataset itself. The following is the code I ran on spyder in windows.

import findspark
findspark.init()
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp")
.config("spark.jars.packages", "Azure:mmlspark:0.16")
.getOrCreate()

from mmlspark import LightGBMRegressor

lgb = LightGBMRegressor(alpha=0.3,learningRate=0.3,numIterations=100,numLeaves=31)

import pyspark.sql.types as typ

labels=[
('INFANT_ALIVE_AT_REPORT', typ.IntegerType()),
('BIRTH_PLACE', typ.StringType()),
('MOTHER_AGE_YEARS', typ.IntegerType()),
('FATHER_COMBINED_AGE', typ.IntegerType()),
('CIG_BEFORE', typ.IntegerType()),
('CIG_1_TRI', typ.IntegerType()),
('CIG_2_TRI', typ.IntegerType()),
('CIG_3_TRI', typ.IntegerType()),
('MOTHER_HEIGHT_IN', typ.IntegerType()),
('MOTHER_PRE_WEIGHT', typ.IntegerType()),
('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()),
('MOTHER_WEIGHT_GAIN', typ.IntegerType()),
('DIABETES_PRE', typ.IntegerType()),
('DIABETES_GEST', typ.IntegerType()),
('HYP_TENS_PRE', typ.IntegerType()),
('HYP_TENS_GEST', typ.IntegerType()),
('PREV_BIRTH_PRETERM', typ.IntegerType())
]

schema=typ.StructType([typ.StructField(e[0], e[1], False) for e in labels])
births=spark.read.csv('births_transformed.csv.gz',
header=True,
schema=schema)

import pyspark.ml.feature as ft

births=births
.withColumn('BIRTH_PLACE_INT', births['BIRTH_PLACE']
.cast(typ.IntegerType()))
encoder=ft.OneHotEncoder(
inputCol='BIRTH_PLACE_INT',
outputCol='BIRTH_PLACE_VEC')
featuresCreator=ft.VectorAssembler(
inputCols=[col[0] for col in labels[2:]] +
[encoder.getOutputCol()],
outputCol='features'
)

#import logistic regression for compare
import pyspark.ml.classification as cl
logistic = cl.LogisticRegression(
maxIter=10,
regParam=0.01,
labelCol='INFANT_ALIVE_AT_REPORT')

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[
encoder,
featuresCreator,
logistic
])

births_train, births_test=births
.randomSplit([0.7, 0.3], seed=666)
model = pipeline.fit(births_train)
test_model = model.transform(births_test)

import pyspark.ml.evaluation as ev
evaluator = ev.BinaryClassificationEvaluator(
rawPredictionCol='probability',
labelCol='INFANT_ALIVE_AT_REPORT')
print(evaluator.evaluate(test_model,
{evaluator.metricName: 'areaUnderROC'}))
print(evaluator.evaluate(test_model,
{evaluator.metricName: 'areaUnderPR'}))

from mmlspark import LightGBMRegressor
lgb = LightGBMRegressor(alpha=0.3,learningRate=0.3,numIterations=100,numLeaves=31,labelCol='INFANT_ALIVE_AT_REPORT')

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[
encoder,
featuresCreator,
lgb])

model = pipeline.fit(births_train) # IT IS STUCK HERE!!!!!!!
test_model = model.transform(births_test)

import pyspark.ml.evaluation as ev
evaluator = ev.BinaryClassificationEvaluator(
rawPredictionCol='probability',
labelCol='INFANT_ALIVE_AT_REPORT')
print(evaluator.evaluate(test_model,
{evaluator.metricName: 'areaUnderROC'}))
print(evaluator.evaluate(test_model,
{evaluator.metricName: 'areaUnderPR'}))

@imatiach-msft
Copy link
Contributor

@longyunshen I see, it sounds like there is some error in the native code then. Is the dataset you are using confidential? Wondering if I can reproduce this issue locally.

@loomlike
Copy link
Contributor

loomlike commented Apr 9, 2019

@imatiach-msft we are testing our Recommenders repo on Windows DSVM and we have the similar error. FYI, the notebook works well on Linux DSVM.

To reproduce the error, please run staging/notebooks/02_model/mmlspark_lightgbm_criteo.ipynb on Windows.

java.lang.Exception: Dataset create call failed in LightGBM with error: bad allocation
at com.microsoft.ml.spark.LightGBMUtils$.validate(LightGBMUtils.scala:29)
at com.microsoft.ml.spark.LightGBMUtils$.generateSparseDataset(LightGBMUtils.scala:380)
at com.microsoft.ml.spark.TrainUtils$.translate(TrainUtils.scala:62)
at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:219)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

@imatiach-msft
Copy link
Contributor

the bug on windows should be fixed now on latest master (available with next release)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/lightgbm bug high priority high priority issues must be fixed as soon as possible
Projects
None yet
Development

No branches or pull requests

5 participants