Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于数据集的问题 #78

Open
Rzero7 opened this issue Jan 1, 2024 · 6 comments
Open

关于数据集的问题 #78

Rzero7 opened this issue Jan 1, 2024 · 6 comments

Comments

@Rzero7
Copy link

Rzero7 commented Jan 1, 2024

您好,我在百度网盘中下载的的Yelp2018、Amazon-Book、Gowalla数据集都要比论文中所报告的统计数值大很多,请问怎么获得与论文中相同版本的数据集呢,十分感谢~

@ahukmr
Copy link

ahukmr commented Jan 8, 2024

您好,我在百度网盘中下载的的Yelp2018、Amazon-Book、Gowalla数据集都要比论文中所报告的统计数值大很多,请问怎么获得与论文中相同版本的数据集呢,十分感谢~

同问,请问你这个问题解决了嘛?

@downeykking
Copy link
Contributor

原始数据是没经过数据预处理的, 你可以写个dataset的yaml文件, 例如下面这个yelp.yaml文件

load_col:
  inter: [user_id, item_id, rating]
ITEM_ID_FIELD: item_id
RATING_FIELD: rating

user_inter_num_interval: "[15,inf)"
item_inter_num_interval: "[15,inf)"
val_interval:
  rating: "[3,inf)"

在执行的时候加上参数--config_files

python run_recbole_gnn.py --config_files=yelp.yaml

参数的具体含义可以参考文档: https://recbole.io/docs/user_guide/config/data_settings.html

@ahukmr
Copy link

ahukmr commented Jan 9, 2024

原始数据是没经过数据预处理的, 你可以写个dataset的yaml文件, 例如下面这个yelp.yaml文件

load_col:
  inter: [user_id, item_id, rating]
ITEM_ID_FIELD: item_id
RATING_FIELD: rating

user_inter_num_interval: "[15,inf)"
item_inter_num_interval: "[15,inf)"
val_interval:
  rating: "[3,inf)"

在执行的时候加上参数--config_files

python run_recbole_gnn.py --config_files=yelp.yaml

参数的具体含义可以参考文档: https://recbole.io/docs/user_guide/config/data_settings.html

非常感谢你的回答,是这样的,我下载了recbole库网盘里的数据集,也设置了yaml的数据过滤操作,但是得到的过滤后是数据信息和论文里的不一样,比如amazonbooks数据集,同样是设置成load_col:
inter: [user_id, item_id, rating]
val_interval:
rating: "[3,inf)"
user_inter_num_interval: "[15,inf)"
item_inter_num_interval: "[15,inf)"但是得到的用户数项目数目交互数目和论文里的统计数目不一样,请问这个问题是怎么解决的,是数据集下载的不对吗,我下载的就算百度网盘里的设置好的数据集,就比如Amazonbook我下载的是recbole网盘的Amazon_rating文件夹下的Amazonbook,yelp下载的是yelp文件夹下yelp2018,但是经过设置过滤文件得到的都不是论文统计的数据

@downeykking
Copy link
Contributor

原始数据是没经过数据预处理的, 你可以写个dataset的yaml文件, 例如下面这个yelp.yaml文件

load_col:
  inter: [user_id, item_id, rating]
ITEM_ID_FIELD: item_id
RATING_FIELD: rating

user_inter_num_interval: "[15,inf)"
item_inter_num_interval: "[15,inf)"
val_interval:
  rating: "[3,inf)"

在执行的时候加上参数--config_files

python run_recbole_gnn.py --config_files=yelp.yaml

参数的具体含义可以参考文档: https://recbole.io/docs/user_guide/config/data_settings.html

非常感谢你的回答,是这样的,我下载了recbole库网盘里的数据集,也设置了yaml的数据过滤操作,但是得到的过滤后是数据信息和论文里的不一样,比如amazonbooks数据集,同样是设置成load_col: inter: [user_id, item_id, rating] val_interval: rating: "[3,inf)" user_inter_num_interval: "[15,inf)" item_inter_num_interval: "[15,inf)"但是得到的用户数项目数目交互数目和论文里的统计数目不一样,请问这个问题是怎么解决的,是数据集下载的不对吗,我下载的就算百度网盘里的设置好的数据集,就比如Amazonbook我下载的是recbole网盘的Amazon_rating文件夹下的Amazonbook,yelp下载的是yelp文件夹下yelp2018,但是经过设置过滤文件得到的都不是论文统计的数据

yelp和amazonbook应该都不是18版本的, 你可以直接输入数据集应该会自动下载, 如果自己下载的话我当时用的google drive
yelp对应的是 https://drive.google.com/file/d/1x5I2wHvKf2C4KxtczGHLNvofHX_G5fS3/view?usp=drive_link
amazon-books 对应的是 https://drive.google.com/file/d/1x4MXPyX6ClQs779lHyPwIz2ivTBAUfUV/view?usp=drive_link

@ahukmr
Copy link

ahukmr commented Jan 19, 2024

非常感谢您的回答,不好意思我的问题有点多,我还想向您请教一个问题,就是我想用recbole库中训练好的模型重新对testdata做新的评价指标的计算,例如之前我跑了ndcg,我现在想在之前模型训练的参数上在测试一下recall或者其他指标,然后我用from recbole.quick_start import load_data_and_model
from recbole.trainer import Trainer
config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
model_file="saved/LightGCN-Nov-24-2023_21-20-02.pth",
)

config, model, dataset, train_data, valid_data, test_data = load_data_and_model(

model_file = "D:\RecBole-1.2.0\saved\LightGCN-Jan-16-2024_20-55-38.pth")

config['metrics'] = ["Recall","NDCG","ItemCoverage","Novelty"]
config['topk'] = [10, 20, 50]
trainer = Trainer(config, model)
test_result = trainer.evaluate(test_data)这段代码,运行时总是说找不到D:\RecBole-1.2.0\saved\LightGCN-Jan-16-2024_20-55-38.pth这个文件路径,然后我就在case_study_example,py文件中运行这段代码,没有报错路径问题,但是会报错INFO Prepare to download dataset [ml-1m] from [https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/MovieLens/ml-1m.zip].
。。。。。
File "C:\Users\ahu\anaconda3\envs\kmrpy310\lib\urllib\request.py", line 1351, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。>,我看这里是它会自动下载ml-1m数据集,但是这个数据集我之前跑模型的时候已经下载在本地了,请问该如何修改这个bug,或者我想用之前训练好的模型参数跑新的评价指标应该怎么做,是不是我的方法不对?
期待得到您的回复,谢谢!!!

@downeykking
Copy link
Contributor

非常感谢您的回答,不好意思我的问题有点多,我还想向您请教一个问题,就是我想用recbole库中训练好的模型重新对testdata做新的评价指标的计算,例如之前我跑了ndcg,我现在想在之前模型训练的参数上在测试一下recall或者其他指标,然后我用from recbole.quick_start import load_data_and_model from recbole.trainer import Trainer config, model, dataset, train_data, valid_data, test_data = load_data_and_model( model_file="saved/LightGCN-Nov-24-2023_21-20-02.pth", )

config, model, dataset, train_data, valid_data, test_data = load_data_and_model(

model_file = "D:\RecBole-1.2.0\saved\LightGCN-Jan-16-2024_20-55-38.pth")

config['metrics'] = ["Recall","NDCG","ItemCoverage","Novelty"] config['topk'] = [10, 20, 50] trainer = Trainer(config, model) test_result = trainer.evaluate(test_data)这段代码,运行时总是说找不到D:\RecBole-1.2.0\saved\LightGCN-Jan-16-2024_20-55-38.pth这个文件路径,然后我就在case_study_example,py文件中运行这段代码,没有报错路径问题,但是会报错INFO Prepare to download dataset [ml-1m] from [https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/MovieLens/ml-1m.zip]. 。。。。。 File "C:\Users\ahu\anaconda3\envs\kmrpy310\lib\urllib\request.py", line 1351, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。>,我看这里是它会自动下载ml-1m数据集,但是这个数据集我之前跑模型的时候已经下载在本地了,请问该如何修改这个bug,或者我想用之前训练好的模型参数跑新的评价指标应该怎么做,是不是我的方法不对? 期待得到您的回复,谢谢!!!

第一个路径报错可能是因为在evaluate函数中,需要再次显式传入模型地址以加载,函数的签名如下所示:

  def evaluate(
      self, eval_data, load_best_model=True, model_file=None, show_progress=False
  ):
      r"""Evaluate the model based on the eval data.

      Args:
          eval_data (DataLoader): the eval data
          load_best_model (bool, optional): whether load the best model in the training process, default: True.
                                            It should be set True, if users want to test the model after training.
          model_file (str, optional): the saved model file, default: None. If users want to test the previously
                                      trained model file, they can set this parameter.
          show_progress (bool): Show the progress of evaluate epoch. Defaults to ``False``.

      Returns:
          collections.OrderedDict: eval result, key is the eval metric and value in the corresponding metric value.
      """

第二个问题是你的config中路径是否设置对了呢?可以通过打印 print(config['data_path']) 来对照一下
以下是一个可行的脚本供尝试:

from recbole.trainer import Trainer
from recbole_gnn.model.general_recommender import LightGCN
from recbole_gnn.utils import create_dataset, data_preparation
from recbole.utils import init_logger, init_seed
from logging import getLogger
import torch


def load_data_and_model(model_file, mymodel):
    def compatibility_settings():
        import numpy as np
        np.bool = np.bool_
        np.int = np.int_
        np.float = np.float_
        np.complex = np.complex_
        np.object = np.object_
        np.str = np.str_
        np.long = np.int_
        np.unicode = np.unicode_
    compatibility_settings()

    checkpoint = torch.load(model_file)
    config = checkpoint['config']
    init_seed(config['seed'], config['reproducibility'])
    init_logger(config)
    logger = getLogger()
    logger.info(config)

    dataset = create_dataset(config)
    logger.info(dataset)
    train_data, valid_data, test_data = data_preparation(config, dataset)

    init_seed(config['seed'], config['reproducibility'])
    model = mymodel(config, train_data.dataset).to(config['device'])
    model.load_state_dict(checkpoint['state_dict'])
    model.load_other_parameter(checkpoint.get('other_parameter'))

    return config, model, dataset, train_data, valid_data, test_data


model_path = 'your_path'

config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
    model_file=model_path, mymodel=LightGCN
)

print(config['data_path'])
config['metrics'] = ["Recall", "NDCG"]
config['topk'] = [10, 20, 50]
print(config)
trainer = Trainer(config, model)
test_result = trainer.evaluate(test_data, model_file=model_path)

print(test_result)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants