关于数据集的问题 #78

Rzero7 · 2024-01-01T12:52:37Z

您好，我在百度网盘中下载的的Yelp2018、Amazon-Book、Gowalla数据集都要比论文中所报告的统计数值大很多，请问怎么获得与论文中相同版本的数据集呢，十分感谢~

ahukmr · 2024-01-08T04:00:18Z

您好，我在百度网盘中下载的的Yelp2018、Amazon-Book、Gowalla数据集都要比论文中所报告的统计数值大很多，请问怎么获得与论文中相同版本的数据集呢，十分感谢~

同问，请问你这个问题解决了嘛？

downeykking · 2024-01-09T09:01:01Z

原始数据是没经过数据预处理的, 你可以写个dataset的yaml文件, 例如下面这个yelp.yaml文件

load_col:
  inter: [user_id, item_id, rating]
ITEM_ID_FIELD: item_id
RATING_FIELD: rating

user_inter_num_interval: "[15,inf)"
item_inter_num_interval: "[15,inf)"
val_interval:
  rating: "[3,inf)"

在执行的时候加上参数--config_files

python run_recbole_gnn.py --config_files=yelp.yaml

参数的具体含义可以参考文档: https://recbole.io/docs/user_guide/config/data_settings.html

ahukmr · 2024-01-09T09:08:26Z

原始数据是没经过数据预处理的, 你可以写个dataset的yaml文件, 例如下面这个yelp.yaml文件
load_col:
  inter: [user_id, item_id, rating]
ITEM_ID_FIELD: item_id
RATING_FIELD: rating

user_inter_num_interval: "[15,inf)"
item_inter_num_interval: "[15,inf)"
val_interval:
  rating: "[3,inf)"
在执行的时候加上参数--config_files
python run_recbole_gnn.py --config_files=yelp.yaml
参数的具体含义可以参考文档: https://recbole.io/docs/user_guide/config/data_settings.html

非常感谢你的回答，是这样的，我下载了recbole库网盘里的数据集，也设置了yaml的数据过滤操作，但是得到的过滤后是数据信息和论文里的不一样，比如amazonbooks数据集，同样是设置成load_col:
inter: [user_id, item_id, rating]
val_interval:
rating: "[3,inf)"
user_inter_num_interval: "[15,inf)"
item_inter_num_interval: "[15,inf)"但是得到的用户数项目数目交互数目和论文里的统计数目不一样，请问这个问题是怎么解决的，是数据集下载的不对吗，我下载的就算百度网盘里的设置好的数据集，就比如Amazonbook我下载的是recbole网盘的Amazon_rating文件夹下的Amazonbook，yelp下载的是yelp文件夹下yelp2018，但是经过设置过滤文件得到的都不是论文统计的数据

downeykking · 2024-01-09T09:41:48Z

原始数据是没经过数据预处理的, 你可以写个dataset的yaml文件, 例如下面这个yelp.yaml文件
load_col:
  inter: [user_id, item_id, rating]
ITEM_ID_FIELD: item_id
RATING_FIELD: rating

user_inter_num_interval: "[15,inf)"
item_inter_num_interval: "[15,inf)"
val_interval:
  rating: "[3,inf)"
在执行的时候加上参数--config_files
python run_recbole_gnn.py --config_files=yelp.yaml
参数的具体含义可以参考文档: https://recbole.io/docs/user_guide/config/data_settings.html
非常感谢你的回答，是这样的，我下载了recbole库网盘里的数据集，也设置了yaml的数据过滤操作，但是得到的过滤后是数据信息和论文里的不一样，比如amazonbooks数据集，同样是设置成load_col: inter: [user_id, item_id, rating] val_interval: rating: "[3,inf)" user_inter_num_interval: "[15,inf)" item_inter_num_interval: "[15,inf)"但是得到的用户数项目数目交互数目和论文里的统计数目不一样，请问这个问题是怎么解决的，是数据集下载的不对吗，我下载的就算百度网盘里的设置好的数据集，就比如Amazonbook我下载的是recbole网盘的Amazon_rating文件夹下的Amazonbook，yelp下载的是yelp文件夹下yelp2018，但是经过设置过滤文件得到的都不是论文统计的数据

yelp和amazonbook应该都不是18版本的, 你可以直接输入数据集应该会自动下载, 如果自己下载的话我当时用的google drive
yelp对应的是 https://drive.google.com/file/d/1x5I2wHvKf2C4KxtczGHLNvofHX_G5fS3/view?usp=drive_link
amazon-books 对应的是 https://drive.google.com/file/d/1x4MXPyX6ClQs779lHyPwIz2ivTBAUfUV/view?usp=drive_link

ahukmr · 2024-01-19T07:21:29Z

非常感谢您的回答，不好意思我的问题有点多，我还想向您请教一个问题，就是我想用recbole库中训练好的模型重新对testdata做新的评价指标的计算，例如之前我跑了ndcg，我现在想在之前模型训练的参数上在测试一下recall或者其他指标，然后我用from recbole.quick_start import load_data_and_model
from recbole.trainer import Trainer
config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
model_file="saved/LightGCN-Nov-24-2023_21-20-02.pth",
)

config, model, dataset, train_data, valid_data, test_data = load_data_and_model(

model_file = "D:\RecBole-1.2.0\saved\LightGCN-Jan-16-2024_20-55-38.pth")

config['metrics'] = ["Recall","NDCG","ItemCoverage","Novelty"]
config['topk'] = [10, 20, 50]
trainer = Trainer(config, model)
test_result = trainer.evaluate(test_data)这段代码，运行时总是说找不到D:\RecBole-1.2.0\saved\LightGCN-Jan-16-2024_20-55-38.pth这个文件路径，然后我就在case_study_example,py文件中运行这段代码，没有报错路径问题，但是会报错INFO Prepare to download dataset [ml-1m] from [https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/MovieLens/ml-1m.zip].
。。。。。
File "C:\Users\ahu\anaconda3\envs\kmrpy310\lib\urllib\request.py", line 1351, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。>，我看这里是它会自动下载ml-1m数据集，但是这个数据集我之前跑模型的时候已经下载在本地了，请问该如何修改这个bug，或者我想用之前训练好的模型参数跑新的评价指标应该怎么做，是不是我的方法不对？
期待得到您的回复，谢谢！！！

downeykking · 2024-01-19T11:22:41Z

非常感谢您的回答，不好意思我的问题有点多，我还想向您请教一个问题，就是我想用recbole库中训练好的模型重新对testdata做新的评价指标的计算，例如之前我跑了ndcg，我现在想在之前模型训练的参数上在测试一下recall或者其他指标，然后我用from recbole.quick_start import load_data_and_model from recbole.trainer import Trainer config, model, dataset, train_data, valid_data, test_data = load_data_and_model( model_file="saved/LightGCN-Nov-24-2023_21-20-02.pth", )

config, model, dataset, train_data, valid_data, test_data = load_data_and_model(

model_file = "D:\RecBole-1.2.0\saved\LightGCN-Jan-16-2024_20-55-38.pth")

config['metrics'] = ["Recall","NDCG","ItemCoverage","Novelty"] config['topk'] = [10, 20, 50] trainer = Trainer(config, model) test_result = trainer.evaluate(test_data)这段代码，运行时总是说找不到D:\RecBole-1.2.0\saved\LightGCN-Jan-16-2024_20-55-38.pth这个文件路径，然后我就在case_study_example,py文件中运行这段代码，没有报错路径问题，但是会报错INFO Prepare to download dataset [ml-1m] from [https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/MovieLens/ml-1m.zip]. 。。。。。 File "C:\Users\ahu\anaconda3\envs\kmrpy310\lib\urllib\request.py", line 1351, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。>，我看这里是它会自动下载ml-1m数据集，但是这个数据集我之前跑模型的时候已经下载在本地了，请问该如何修改这个bug，或者我想用之前训练好的模型参数跑新的评价指标应该怎么做，是不是我的方法不对？期待得到您的回复，谢谢！！！

第一个路径报错可能是因为在evaluate函数中，需要再次显式传入模型地址以加载，函数的签名如下所示：

  def evaluate(
      self, eval_data, load_best_model=True, model_file=None, show_progress=False
  ):
      r"""Evaluate the model based on the eval data.

      Args:
          eval_data (DataLoader): the eval data
          load_best_model (bool, optional): whether load the best model in the training process, default: True.
                                            It should be set True, if users want to test the model after training.
          model_file (str, optional): the saved model file, default: None. If users want to test the previously
                                      trained model file, they can set this parameter.
          show_progress (bool): Show the progress of evaluate epoch. Defaults to ``False``.

      Returns:
          collections.OrderedDict: eval result, key is the eval metric and value in the corresponding metric value.
      """

第二个问题是你的config中路径是否设置对了呢？可以通过打印 print(config['data_path']) 来对照一下
以下是一个可行的脚本供尝试:

from recbole.trainer import Trainer
from recbole_gnn.model.general_recommender import LightGCN
from recbole_gnn.utils import create_dataset, data_preparation
from recbole.utils import init_logger, init_seed
from logging import getLogger
import torch


def load_data_and_model(model_file, mymodel):
    def compatibility_settings():
        import numpy as np
        np.bool = np.bool_
        np.int = np.int_
        np.float = np.float_
        np.complex = np.complex_
        np.object = np.object_
        np.str = np.str_
        np.long = np.int_
        np.unicode = np.unicode_
    compatibility_settings()

    checkpoint = torch.load(model_file)
    config = checkpoint['config']
    init_seed(config['seed'], config['reproducibility'])
    init_logger(config)
    logger = getLogger()
    logger.info(config)

    dataset = create_dataset(config)
    logger.info(dataset)
    train_data, valid_data, test_data = data_preparation(config, dataset)

    init_seed(config['seed'], config['reproducibility'])
    model = mymodel(config, train_data.dataset).to(config['device'])
    model.load_state_dict(checkpoint['state_dict'])
    model.load_other_parameter(checkpoint.get('other_parameter'))

    return config, model, dataset, train_data, valid_data, test_data


model_path = 'your_path'

config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
    model_file=model_path, mymodel=LightGCN
)

print(config['data_path'])
config['metrics'] = ["Recall", "NDCG"]
config['topk'] = [10, 20, 50]
print(config)
trainer = Trainer(config, model)
test_result = trainer.evaluate(test_data, model_file=model_path)

print(test_result)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于数据集的问题 #78

关于数据集的问题 #78

Rzero7 commented Jan 1, 2024

ahukmr commented Jan 8, 2024

downeykking commented Jan 9, 2024

ahukmr commented Jan 9, 2024

downeykking commented Jan 9, 2024

ahukmr commented Jan 19, 2024

downeykking commented Jan 19, 2024

config, model, dataset, train_data, valid_data, test_data = load_data_and_model(

model_file = "D:\RecBole-1.2.0\saved\LightGCN-Jan-16-2024_20-55-38.pth")

关于数据集的问题 #78

关于数据集的问题 #78

Comments

Rzero7 commented Jan 1, 2024

ahukmr commented Jan 8, 2024

downeykking commented Jan 9, 2024

ahukmr commented Jan 9, 2024

downeykking commented Jan 9, 2024

ahukmr commented Jan 19, 2024

config, model, dataset, train_data, valid_data, test_data = load_data_and_model(

model_file = "D:\RecBole-1.2.0\saved\LightGCN-Jan-16-2024_20-55-38.pth")

downeykking commented Jan 19, 2024

config, model, dataset, train_data, valid_data, test_data = load_data_and_model(

model_file = "D:\RecBole-1.2.0\saved\LightGCN-Jan-16-2024_20-55-38.pth")