Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛BUG] Out-of-memory for LightGCN on the gowalla dataset #519

Closed
tszumowski opened this issue Nov 21, 2020 · 2 comments
Closed

[🐛BUG] Out-of-memory for LightGCN on the gowalla dataset #519

tszumowski opened this issue Nov 21, 2020 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@tszumowski
Copy link

Describe the bug
I am unable to run LightGCN on the gowalla dataset. It appears it is out-of-memory, by a lot:

MemoryError: Unable to allocate 511. GiB for an array with shape (107093, 1280970) and data type float32    

I imagine it may by user-error on my part. But with the following direct command, I can't get it to fit using different parameters:

$ python3 run_recbole.py --dataset=gowalla --model=LightGCN --metrics=Recall --valid_metric=Recall@100 --topk=100 --train_batch_size=4 --eval_batch_size=4

To Reproduce
Steps to reproduce the behavior:

  1. extra yaml file: None
  2. your code: None
  3. script for running: See screenshots

Expected behavior
Expected the ability to run gowalla dataset using LightGCN. Looking for a way to fit it into memory

Screenshots
Here is a trace of the full run:

Command:

$ python3 run_recbole.py --dataset=gowalla --model=LightGCN --metrics=Recall --valid_metric=Recall@100 --topk=100 --train_batch_size=4 --eval_batch_size=4

Output:

20 Nov 23:58    INFO General Hyper Parameters:
gpu_id=0          
use_gpu=True 
seed=2020                                                                                                                                                                                                          [76/1356]
state=INFO   
reproducibility=True
data_path=dataset/gowalla
                    
Training Hyper Parameters:
checkpoint_dir=saved              
epochs=300       
train_batch_size=4     
learner=adam              
learning_rate=0.001         
training_neg_sample_num=1   
eval_step=1                  
stopping_step=10         

Evaluation Hyper Parameters:
eval_setting=RO_RS,full
group_by_user=True                                                                
split_ratio=[0.8, 0.1, 0.1]
leave_one_num=2                                                      
real_time_process=True
metrics=Recall              
topk=100
valid_metric=Recall@100
eval_batch_size=4

Dataset Hyper Parameters:
field_separator=
seq_separator=
USER_ID_FIELD=user_id
ITEM_ID_FIELD=item_id
RATING_FIELD=rating
LABEL_FIELD=label
threshold=None
NEG_PREFIX=neg_
load_col={'inter': ['user_id', 'item_id']}
unload_col=None
additional_feat_suffix=None
max_user_inter_num=None
min_user_inter_num=0
max_item_inter_num=None
min_item_inter_num=0
lowest_val=None
highest_val=None                                                                                                                                                                                                 
equal_val=None                                
not_equal_val=None                                      
drop_filter_field=True                                  
fields_in_same_space=None         
fill_nan=True                                
preload_weight=None                                                                       
drop_preload_weight=True                                                                                
normalize_field=None                                                           
normalize_all=True                                                                                              
ITEM_LIST_LENGTH_FIELD=item_length                                
LIST_SUFFIX=_list                                                                                                       
MAX_ITEM_LIST_LENGTH=50                                      
POSITION_FIELD=position_id                                                                                        
HEAD_ENTITY_ID_FIELD=head_id            
TAIL_ENTITY_ID_FIELD=tail_id                                                                                         
RELATION_ID_FIELD=relation_id            
ENTITY_ID_FIELD=entity_id                                                                                                     
                                                 
                                                                                                                                                                                                                  
20 Nov 23:58    INFO gowalla                                    
The number of users: 107093                                              
Average actions of users: 37.17675456616741
The number of items: 1280970                                          
Average actions of items: 3.1080635050496928
The number of inters: 3981333
The sparsity of the dataset: 99.99709779249932%
Remain Fields: ['user_id', 'item_id']
20 Nov 23:58    INFO Build [ModelType.GENERAL] DataLoader for [train] with format [InputType.PAIRWISE]
20 Nov 23:58    INFO Evaluation Setting:
        Group by user_id
        Ordering: {'strategy': 'shuffle'}
        Splitting: {'strategy': 'by_ratio', 'ratios': [0.8, 0.1, 0.1]}
        Negative Sampling: {'strategy': 'by', 'distribution': 'uniform', 'by': 1}
20 Nov 23:58    INFO batch_size = [[4]], shuffle = [True]

20 Nov 23:58    INFO Build [ModelType.GENERAL] DataLoader for [evaluation] with format [InputType.POINTWISE]
20 Nov 23:58    INFO Evaluation Setting:
        Group by user_id
        Ordering: {'strategy': 'shuffle'}
        Splitting: {'strategy': 'by_ratio', 'ratios': [0.8, 0.1, 0.1]}
        Negative Sampling: {'strategy': 'full', 'distribution': 'uniform'}
20 Nov 23:58    INFO batch_size = [[4, 4]], shuffle = [False]

20 Nov 23:58    WARNING Batch size is changed to 1280970
20 Nov 23:58    WARNING Batch size is changed to 1280970
Traceback (most recent call last):
  File "run_recbole.py", line 25, in <module>
    run_recbole(model=args.model, dataset=args.dataset, config_file_list=config_file_list)
  File "/home/szumowskit1/workspace/RecBole/recbole/quick_start/quick_start.py", line 45, in run_recbole
    model = get_model(config['model'])(config, train_data).to(config['device'])
  File "/home/szumowskit1/workspace/RecBole/recbole/model/general_recommender/lightgcn.py", line 69, in __init__
    self.norm_adj_matrix = self.get_norm_adj_mat().to(self.device)
  File "/home/szumowskit1/workspace/RecBole/recbole/model/general_recommender/lightgcn.py", line 90, in get_norm_adj_mat
    A[:self.n_users, self.n_users:] = self.interaction_matrix
  File "/home/szumowskit1/.venv/recbole/lib/python3.7/site-packages/scipy/sparse/lil.py", line 333, in __setitem__
    IndexMixin.__setitem__(self, key, x)
  File "/home/szumowskit1/.venv/recbole/lib/python3.7/site-packages/scipy/sparse/_index.py", line 116, in __setitem__
    self._set_arrayXarray_sparse(i, j, x)
  File "/home/szumowskit1/.venv/recbole/lib/python3.7/site-packages/scipy/sparse/lil.py", line 319, in _set_arrayXarray_sparse
    x = np.asarray(x.toarray(), dtype=self.dtype)
  File "/home/szumowskit1/.venv/recbole/lib/python3.7/site-packages/scipy/sparse/coo.py", line 321, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/szumowskit1/.venv/recbole/lib/python3.7/site-packages/scipy/sparse/base.py", line 1185, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError: Unable to allocate 511. GiB for an array with shape (107093, 1280970) and data type float32

Colab Links
N/A

Desktop (please complete the following information):

  • OS: Linux
  • RecBole Version: 0.1.1
  • Python Version: 3.7.6
  • PyTorch Version: 1.7.0+cu101
  • cudatoolkit Version: 10.1
  • GPU: P100 (16GB)
  • Machine Specs: 16 CPU machine, 60GB RAM
@ShanleiMu
Copy link
Member

We have optimized the GPU memory occupation of adjacency matrix in GNN-based models(e.g., LightGCN, NGCF). More details can be found in #525 and #526. In the updated version, your provided command can be executed successfully in our test, which occupies about 8G GPU memory.

It's unavoidable that larger datasets take up more resouces (e.g., GPU memory) and require more time for training and evaluation. So we recommend using some data filtering methods (e.g., k-core filtering by setting min_user_inter_num and min_item_inter_num) to ensure the quality of the dataset and reduce the data size. What's more, we provide a preliminary list (Time and memory cost of models) of training time, evaluation time and memory occupation of different recommendation methods under three different scale datasets for reference. The related information of LightGCN can be found in Time and memory costs of general recommendation models.

We hope these can help you.

@tszumowski
Copy link
Author

@ShanleiMu This is fantastic and incredibly helpful. Thank you. I have some questions about the time/memory costs pages, but I'll direct those over to #518, and close this since it's merged. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants