Recently, deep attribute graph clustering has developed rapidly. At the same time various methods have sprung up. Although most of the methods are open-source, it is a pity that these codes do not have a unified framework, which makes researchers have to spend a lot of time modifying the code to achieve the purpose of reproduction. Fortunately, Liu et al. [Homepage: yueliu1999] organized the deep graph clustering method into a code warehouse—— Awesome-Deep-Graph-Clustering(ADGC). For example, they provided more than 20 datasets and unified the format. Moreover, they list the most related paper about deep graph clustering and give the link of source code. It is worth mentioning that they organize the code of deep graph clustering into rand-augmentation-model-clustering-visualization-utils structure, which greatly facilitates beginners and researchers. Here, on behalf of myself, I would like to express my sincere thanks and high respect to Liu et al.
❤️ Acknowledgements:
Thanks for the open source of these authors (not listed in order):
[ yueliu1999 | bdy9527| Liam Liu | Zhihao PENG | William Zhu | WxTu ]
[ xihongyang1999 | gongleii ]
On the basis of ADGC, I refactored the code to make the deep clustering code achieve a higher level of unification. Specifically, I redesigned the architecture of the code, so that you can run the open source code easily. I defined some tool classes and functions to simplify the code and make the settings' configuration clear.
- 📃
main.py
: The entrance file of my framework. - 📃
requirements.txt
: The third-party library environments that need to be installed first. - 📁
dataset
: The directory including the dataset you need, whose subdirectories are named after dataset names. The subdirectory includes the features file, the labels file and the adjacency matrix file, named after {dataset name}_feat.npy, {dataset name}_label.npy and {dataset name}_adj.npy, such as acm_feat.npy, acm_label.npy and acm_adj.npy. Besides, the dataset directory also includes a python file named dataset_info.py which stores the information related to datasets. - 📁
module
: The directory including the most used basic modules of model, such as the Auto-encoder (AE.py), the Graph Convolutional Layer (GCN.py), the Graph Attention Layer (GAT.py), et al. - 📁
model
: The directory including the model you want to run. The directory format is a subdirectory named after the uppercase letters of the model name, which contains two files, one is the model file model.py for storing model classes, and the other is the training file train.py for model training. Our framework will dynamically import the training file of the model according to the input model name. Besides, it can also store the pre-training directory named the lowercase letters of pretrain_{module name}_for_{model name}, which stores the train.py file. For example, if you want to pretrain the AE module in SDCN, you can named the directory with pretrain_ae_for_sdcn. The model.py file and train.py file can be overwritten according to the template provided in the template directory. The explanation.txt file provides the attributes that argparse has, and you can use them according to your needs. - 🛠️
utils
: The directory including some tool classes and functions.- 💾
load_data.py
: It includes the functions of loading dataset for training. - 📊
data_processor.py
: It includes the functions of transferring data storing types and others, such as numpy to torch, symmetric normalization et al. calculator.py
:It includes the function of calculating mean and standard difference.This file has been merged intoutils.py
.- 📊
evalution.py
: It includes the function of calculating the related metrics of clustering, such as ACC, NMI, ARI and F1_score. This file has been merged intoformatter.py
: It includes the function of formatting the output of variables according to your input variables.utils.py
.- 📃
logger.py
: It includes a log class, through which you can record the information you want to output to the log file. This file has been merged intoparameter_counter.py
: It includes the function of counting the model's parameters.utils.py
.- 📁
path_manager.py
: It includes the function of transforming the relative path to the absolute path if needed. Of course, if you don't need, it also should be called because it also stores the path needed by the training, such as the storing path of logs, pretrain parameters files, clustering visualization images, et al. - 🎨
plot.py
: It includes the function of drawing clustering visualization via TSNE and save the image. The features heatmap will also be developed soon later. - ⏱️
time_manager.py
: It includes a time class to record time consuming and a function to format datetime. - 🎲
rand.py
: It includes the function of set random seed. - 🛠️
utils.py
: It includes the tools function from pervious file, such asget_format_variables()
fromfomatter.py
. - ⚙️
options.py
: It includes the argparse object. - 💨
kmeans_gpu.py
: It contains the GPU-accelerated K-means algorithm - 📊
result.py
: Defining a Result class to unify the return value.
- 💾
- 📁
logs
: The directory is used to store the output logs files. Its subdirectories are named after the model names and the logs files are named after the start time. - 📁
pretrain
: The directory is used to store the pre-training parameters files. Its subdirectories are named after the format of pretrain_{module name}. Parameters files are categorized by model and dataset name. - 🖼️
img
: The directory is used to store the output images, whose subdirectories are named after clustering and heatmap.
After git clone the code, you can follow the steps below to run:
Step 1
: Check the environment or run the requirements.txt to install the libraries directly.
pip install -r requirements.txt
Step 2
: Prepare the datasets. If you don't have the datasets, you can download them from Liu's warehouse [yueliu1999 | Google Drive | Nutstore]. Then unzip them to the dataset directory.
Step 3
: Run the file in the directory where main.py is located in command line. If it is in the integrated compilation environment, you can directly run the main.py file.
Take the training of the DAEGC as example:
1️⃣ pretrain GAT:
python main.py --pretrain --model pretrain_gat_for_daegc --dataset acm --t 2 --desc pretrain_the_GAT_for_DAEGC_on_acm
# or the simplified command:
python main.py -P -M pretrain_gat_for_daegc -D acm -T 2 -DS pretrain_the_GAT_for_DAEGC_on_acm
2️⃣ train DAEGC:
python main.py --model DAEGC --dataset cora --t 2 -desc Train_DAEGC_1_iteration_on_the_ACM_dataset
# or the simplified command:
python main.py -M DAEGC -D cora -T 2 -DS Train_DAEGC_1_iteration_on_the_ACM_dataset
Take the training of the SDCN as example:
1️⃣ pretrain AE:
python main.py --pretrain --model pretrain_ae_for_sdcn --dataset acm --desc pretrain_ae_for_SDCN_on_acm
# or simplified command:
python main.py -P -M pretrain_ae_for_sdcn -D acm -DS pretrain_ae_for_SDCN_on_acm
2️⃣ train SDCN:
python main.py --model SDCN --dataset acm --norm --desc Train_SDCN_1_iteration_on_the_ACM_dataset
# or simplified command:
python main.py -M SDCN -D acm -N -DS Train_SDCN_1_iteration_on_the_ACM_dataset
Step 4
: If you run the code successfully, don't forget give me a star! 😉
No. | Model | Paper | Source Code |
---|---|---|---|
1 | DAEGC | 《Attributed Graph Clustering: A Deep Attentional Embedding Approach》 |
link |
2 | SDCN | 《Structural Deep Clustering Network》 | link |
3 | AGCN | 《Attention-driven Graph Clustering Network》 | link |
4 | EFR-DGC | 《Deep Graph clustering with enhanced feature representations for community detection》 |
link |
5 | GCAE | ❗ In fact, it's GAE with GCN. | - |
6 | DFCN | 《Deep Fusion Clustering Network》 | link |
7 | HSAN | 《Hard Sample Aware Network for Contrastive Deep Graph Clustering》 |
link |
8 | DCRN | 《Deep Graph Clustering via Dual Correlation Reduction》 |
link |
9 | CCGC | 《Cluster-guided Contrastive Graph Clustering Network》 |
link |
10 | AGC-DRR | 《Attributed Graph Clustering with Dual Redundancy Reduction》 |
link |
❗ Attention
- The training process of DFCN are divided into three stages according to the paper. First, pretrain pretrain_ae_for_dfcn and pretrain_igae_for_dfcn separately for 30 epochs. Second, pretrain ae and igae simultaneously for 100 epochs which are both integrated into pretrain_both_for_dfcn. Finally, train DFCN formally at least 200 epochs. So is DCRN!
- The HSAN model does not require pretraining.
- The results in the DCRN paper have not yet been reproduced, and will continue to be updated in the future.
In the future, I plan to update the other models. If you find my framework useful, feel free to contribute to its improvement by submitting your own code.
No. | Model | Paper | Source Code |
---|---|---|---|
1 | SCGC | 《Simple Contrastive Graph Clustering》 | link |
2 | Dink-Net | 《Dink-Net: Neural Clustering on Large Graphs》 | link |
# pretrain
python main.py -P -M pretrain_gat_for_daegc -D acm -T 2 -DS balabala -LS 1
# train
python main.py -M DAEGC -D acm -T 2 -DS balabala -LS 1 -TS -H
# pretrain
python main.py -P -M pretrain_ae_for_sdcn -D acm -DS balabala -LS 1
# train
python main.py -M SDCN -D acm -N -DS balabala -LS 1 -TS -H
# pretrain
python main.py -P -M pretrain_ae_for_agcn -D acm -DS balabala -LS 1
# train
python main.py -M AGCN -D acm -N -SF -DS balabala -LS 1 -TS -H
# pretrain
python main.py -P -M pretrain_ae_for_efrdgc -D acm -DS balabala -LS 1
python main.py -P -M pretrain_gat_for_efrdgc -D acm -T 2 -DS balabala -LS 1
# train
python main.py -M EFRDGC -D acm -T 2 -DS balabala -LS 1 -TS -H
# pretrain
python main.py -P -M pretrain_gae_for_gcae -D acm -N -DS balabala -LS 1
# train
python main.py -M GCAE -D acm -N -DS balabala -LS 1 -TS -H
# pretrain. Execute the following commands in sequence.
python main.py -P -M pretrain_ae_for_dfcn -D acm -DS balabala -LS 1
python main.py -P -M pretrain_igae_for_dfcn -D acm -N -DS balabala -LS 1
python main.py -P -M pretrain_both_for_dfcn -D acm -N -DS balabala -LS 1
# train
python main.py -M DFCN -D acm -N -DS balabala -LS 1 -TS -H
# train
python main.py -M HSAN -D cora -SLF -A npy -F npy -DS balabala -LS 1 -TS
# pretrain. Execute the following commands in sequence.
python main.py -P -M pretrain_ae_for_dcrn -D acm -S 1 -DS balabala -LS 1
python main.py -P -M pretrain_igae_for_dcrn -D acm -N -SF -S 1 -DS balabala -LS 1
python main.py -P -M pretrain_both_for_dcrn -D acm -N -SF -S 1 -DS balabala -LS 1
# train
python main.py -M DCRN -D acm -SLF -A npy -S 3 -DS balabala -LS 1 -TS -H
python main.py -M CCGC -D acm -SLF -SF -A npy -S 0 -LS 1 -DS balabala
python main.py -M AGCDRR -D acm -F npy -S 0 -LS 1 -DS balabala
> python main.py --help
usage: main.py [-h] [-P] [-TS] [-H] [-N] [-SLF] [-SF] [-DS DESC]
[-M MODEL_NAME] [-D DATASET_NAME] [-R ROOT] [-K K] [-T T]
[-LS LOOPS] [-F {tensor,npy}] [-L {tensor,npy}]
[-A {tensor,npy}] [-S SEED]
Scalable Unified Framework of Deep Graph Clustering
optional arguments:
-h, --help show this help message and exit
-P, --pretrain Whether to pretrain. Using '-P' to pretrain.
-TS, --tsne Whether to draw the clustering tsne image. Using '-TS'
to draw clustering TSNE.
-H, --heatmap Whether to draw the embedding heatmap. Using '-H' to
draw embedding heatmap.
-N, --norm Whether to normalize the adj, default is False. Using
'-N' to load adj with normalization.
-SLF, --self_loop_false
Whether the adj has self-loop, default is True. Using
'-SLF' to load adj without self-loop.
-SF, --symmetric_false
Whether the normalization type is symmetric. Using
'-SF' to load asymmetric adj.
-DS DESC, --desc DESC
The description of this experiment.
-M MODEL_NAME, --model MODEL_NAME
The model you want to run.
-D DATASET_NAME, --dataset DATASET_NAME
The dataset you want to use.
-R ROOT, --root ROOT Input root path to switch relative path to absolute.
-K K, --k K The k of KNN.
-T T, --t T The order in GAT. 'None' denotes don't calculate the
matrix M.
-LS LOOPS, --loops LOOPS
The Number of training rounds.
-F {tensor,npy}, --feature {tensor,npy}
The datatype of feature. 'tenor' and 'npy' are
available.
-L {tensor,npy}, --label {tensor,npy}
The datatype of label. 'tenor' and 'npy' are
available.
-A {tensor,npy}, --adj {tensor,npy}
The datatype of adj. 'tenor' and 'npy' are available.
-S SEED, --seed SEED The random seed. The default value is 0.
Here are the details of argparse arguments you can change:
tag | arguments | short | description | type/action | default |
---|---|---|---|---|---|
🟥 | --pretrain | -P | Whether this training is pretraining. | "store_true" | False |
🟩 | --tsne | -TS | If you want to draw the clustering result with scatter, you can use it. |
"store_true" | False |
🟩 | --heatmap | -H | If you want to draw the heatmap of the embedding representation learned by model, you can use it. |
"store_true" | False |
🟥 | --norm | -N | Whether to normalize the adj, default is False. Using '-N' to load adj with normalization. |
"store_true" | False |
🟦 | --self_loop_false | -SLF | Whether the adj has self-loop, default is True. Using '-SLF' to load adj without self-loop. |
"store_false" | True |
🟦 | --symmetric_false | -SF | Whether the normalization type is symmetric. Using '-SF' to load asymmetric adj. |
"store_false" | True |
🟥 | --model | -M | The model you want to train. Should correspond to the model in the model directory. |
str | "SDCN" |
🟥 | --dataset | -D | The dataset you want to train. Should correspond to the dataset name in the dataset directory. |
str | "acm" |
🟦 | --k | -K | For graph dataset, it is set to None. If the dataset is not graph type, you should set k to construct 'KNN' graph of dataset. |
int | None |
🟦 | --t | -T | If the model need to get the matrix M, such as DAEGC, you should set t according to the paper. None denotes the model needn't M. |
int | None |
🟥 | --loops | -LS | The training times. If you want to train the model for 10 times, you can set it to 10. |
int | 1 |
🟥 | --root | -R | If you need to change the relative path to the absolute path, you can set it to root path. |
str | None |
🟪 | --desc | -DS | The description of this experiment. | str | "default" |
🟦 | --feature | -F | The datatype of feature. 'tenor' and 'npy' are available. |
str | "tensor" |
🟦 | --label | -L | The datatype of label. 'tenor' and 'npy' are available. |
str | "npy" |
🟦 | --adj | -A | The datatype of adj. 'tenor' and 'npy' are available. |
str | "tensor" |
🟥 | --seed | -S | The random seed. It is 0 if not specified. | int | 0 |
💡 Tips:
- The arguments marked with 🟥 are usually need to be specified.
- The arguments marked with 🟩 are the drawing functions.
- The arguments marked with 🟦 are related to the data loading.
- The argument marked with 🟪 is strongly recommended to you to record the experimental key points.
- Note that "--norm" is used in the graph convolutional network to obtain a symmetric normalized adjacency matrix, but it is not required for the graph attention network. If both are used at the same time, it is recommended to obtain the adjacency matrix without symmetric normalization first, and then manually symmetric normalize it.
Strong scalability is a prominent feature of this framework. If you want to run your own code in this framework, you can follow the steps:
🚄 Step 1
: Write a model file model.py
using Pytorch and a training function file train.py
and then put them into a directory named after the uppercase of model name. Then put it into the model directory. We provide the template file in the template directory.
🚄 Step 2
: If your model need to be pretrained, you need to write a pretraining file train.py
and put it into a directory named after pretrain_{module(lowercase)} _for_{model (lowercase)}, then put it into the model directory. We provide the template file in the template directory.
🚄 Step 3
: Modify the pretrain_type_dict in line 38 in path_manager.py
. The format is "model name(uppercase)": [items]. If your model needn't be pretrained, let the list null. Otherwise, you should list all modules you need to pretrain. For example, if you want to pretrain AE module, you should add "pretrain_ae" to the list. Meanwhile, please check whether the pretrain type exists in if-else sentence, if not, please add it manually.
🚄 Step 4
: Run your code!
🚌 Step 1
: Make sure that your dataset are well processed and the file suffix is 'npy' which denotes the file store the numpy array. If your dataset is graph data, you need to include {dataset name}_feat.npy、{dataset name}_label.npy、{dataset name}_adj.npy. If your dataset is non-graph data, there are two ways to handle. One is directly using {dataset name}_feat.npy、{dataset name}_label.npy, and set the type of constructing graph in line 167 in load_data.py
. If the construct type not exists, please add it to the function construct_graph
in data\_processor.py
. Another is to construct graph data manually, and use {dataset name}_feat.npy、{dataset name}_label.npy、{dataset name}_adj.npy, but you need remember what value the k used because the dataset is considered as graph dataset.
🚌 Step 2
: Put the file above to a directory named after the lowercase of dataset name. Then put them into the dataset directory.
🚌 Step 3
: Add the information about the dataset in the dataset_info.py
.
🚌 Step 4
: Use your dataset!
Graph deep clustering is currently in a stage of rapid development, and more graph clustering methods will be proposed in the future. Therefore, providing a unified code framework can save researchers' coding and experiment time, and put more energy on the theoretical innovation. It is believed that graph clustering will reach a higher level in the future.
If this repository is helpful to you, please remember to Star~😘.
If you use our code, please cite these papers:
@article{ding2023graph,
title = {Graph clustering network with structure embedding enhanced},
journal = {Pattern Recognition},
volume = {144},
pages = {109833},
year = {2023},
issn = {0031-3203},
doi = {https://doi.org/10.1016/j.patcog.2023.109833},
url = {https://www.sciencedirect.com/science/article/pii/S0031320323005319},
author = {Shifei Ding and Benyu Wu and Xiao Xu and Lili Guo and Ling Ding},
}
@article{ding2024towards,
author = {Ding, Shifei and Wu, Benyu and Ding, Ling and Xu, Xiao and Guo, Lili and Liao, Hongmei and Wu, Xindong},
title = {Towards Faster Deep Graph Clustering via Efficient Graph Auto-Encoder},
year = {2024},
issue_date = {September 2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {18},
number = {8},
issn = {1556-4681},
url = {https://doi.org/10.1145/3674983},
doi = {10.1145/3674983},
journal = {ACM Trans. Knowl. Discov. Data},
month = {aug},
articleno = {202},
numpages = {23},
}