Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How I use your pretrained model for unlabeled face images? #9

Open
SharharZ opened this issue Jun 15, 2019 · 47 comments
Open

How I use your pretrained model for unlabeled face images? #9

SharharZ opened this issue Jun 15, 2019 · 47 comments

Comments

@SharharZ
Copy link

Hi, i want to use this method preprocess many unlabed face images, how i use your pretrained model to classify and labeled. Thank you very much!

@yl-1993
Copy link
Owner

yl-1993 commented Jun 15, 2019

@SharharZ Hi, you can (1) use pretrained face recognition models to extract face features. (2) use the clustering methods provided in this repo to group face features.

@SharharZ
Copy link
Author

@yl-1993 Thanks for your reply! Whetheri use generate_proposal.py extract features and use dsgcn/main.py to cluster? Can you supported your pretrained model in Baidu Yun? How many images supported in code, maybe i have million images.

@yl-1993
Copy link
Owner

yl-1993 commented Jun 17, 2019

@SharharZ Yes, you can follow the pipeline in sctipts/pipeline.sh. As shown in our face clustering benchmark, it can handle at least 5M unlabeled face images.

@yl-1993
Copy link
Owner

yl-1993 commented Jun 17, 2019

@SharharZ The pretrained model has already been shared through Baidu Yun. Checkout Setup and get data for more details.

@SharharZ
Copy link
Author

SharharZ commented Jun 18, 2019

@yl-1993 Thank you! I'm sorry that maybe I didn't describe it clearly. I mean the pretrained model of hfsoftmax.I analysised the code and download your data. I am no sure how generate the .bin file and npz file for my face image data. In other words, i extract face features in 512 dimension, how to covert into your format file.

@yl-1993
Copy link
Owner

yl-1993 commented Jun 18, 2019

@SharharZ I think you can store your features with np.save. More details can be found in extract.py. Besides, I will upload the pretrained face recognition model to Baidu Yun soon.

@SharharZ
Copy link
Author

@yl-1993 thank you very much!

@jxyecn
Copy link

jxyecn commented Jun 21, 2019

@SharharZ Hi, you can (1) use pretrained face recognition models to extract face features. (2) use the clustering methods provided in this repo to group face features.

@yl-1993 There are some different pre-trained models for extracting face features in the link you provided, which feature extracting pre-trained model matches for the clustering's pre-trained model?

@yl-1993
Copy link
Owner

yl-1993 commented Jun 26, 2019

@SharharZ Pretrained models for feature extraction has been uploaded to BaiduYun. You can find the link in the hfsoftmax wiki.

@yl-1993
Copy link
Owner

yl-1993 commented Jun 26, 2019

@jxyecn For pretrained clustering model, we use ResNet-50 as feature extractor.

  • If you only want to try the clustering method, you can directly use the extracted features.
  • If you want to extract your own features and train the clustering model, you can choose any model as the feature extractor.

@jxyecn
Copy link

jxyecn commented Jun 26, 2019

@jxyecn For pretrained clustering model, we use ResNet-50 as feature extractor.

  • If you only want to try the clustering method, you can directly use the extracted features.
  • If you want to extract your own features and train the clustering model, you can choose any model as the feature extractor.

@yl-1993 感谢回复!不过我理解如果用的提face feature的模型不一致,聚类的模型应该需要重训吧?所以想确认下哪一个提feature模型是和放出来的聚类预训练模型是匹配的。

@yl-1993
Copy link
Owner

yl-1993 commented Jun 26, 2019

@jxyecn 是的,所以上述回复中说,如果你想抽取自己的特征并训练你的聚类模型,可以选择任意的特征提取模型。另外,这个ResNet-50的模型参数和聚类预训练模型用到的略有不同,如果发现有较大影响,可以继续在这个issue下留言。

@engmubarak48
Copy link

engmubarak48 commented Jul 18, 2019

@SharharZ Hi, you can (1) use pretrained face recognition models to extract face features. (2) use the clustering methods provided in this repo to group face features.

The question is how to use your main.py file. I wanted to provide extracted face features (face embeddings), but your config file seems to be taking the training related files. I suppose I should put the directory of embedding in the test path location (of this file "cfg_test_0.7_0.75.yaml"). but can't figure out how this it is gonna work since it is also taking training file path. can you explain this part a bit?

@yl-1993
Copy link
Owner

yl-1993 commented Jul 19, 2019

@engmubarak48 Thanks for pointing out. For testing part, it will read training file path but not use it. I will refine this part to make it more clear. Currently, I think you can set a dummy training path or simply set the training path the same as the testing path.

@engmubarak48
Copy link

@yl-1993 Thanks for your quick reply. I would like to ask, which part of your code extracts/generates the features of images. I have read your generate_proposals.py file, and it seems to be taking .bin files. do we have to extract the features on our own, or there is a file that extracts the features and saves as a bin file.
I was hoping there should a file in your repo, that exploits the face extraction pre-trained models and saves the extracted features as any format that the cluster faces file will accept.

thanks.

@yl-1993
Copy link
Owner

yl-1993 commented Jul 22, 2019

@engmubarak48 Since this repo focuses on the clustering framework, the face recognition training and feature extraction are not included. You can checkout hfsoftmax for pretrained model and feature extraction. Similar discussion can be found in #4.

@engmubarak48
Copy link

engmubarak48 commented Jul 23, 2019

@engmubarak48 Thanks for pointing out. For the testing part, it will read the training file path but not use it. I will refine this part to make it more clear. Currently, I think you can set a dummy training path or simply set the training path the same as the testing path.

@yl-1993 Since the data is unlabeled, I can have only one file that consists of extracted features (assuming that I extracted my features and saved as a bin file). but in your test config file, there is a path pointing to a .meta file (which indicates the labels according to my understanding). what type of labels are they, and why do we need, since we are clustering unlabeled images.

or meta.file is used only for evaluation. and can be removed if the evaluation is not needed?

Dear @yl-1993 what I intend to do is the following.

  1. extract features of my unlabeled image data via your extract_feat.py
  2. then, by using your main.py script to cluster the embeddings.

And, also I realized your extract_feat.py in hfsoftmax reads images from bin.file. So, I think I should save my numpy array image data into a bin file too.

Could you please, in steps, clarify for me "the format my data should be" and also "what needs to be filled in the config file?"--- both in the extract.py and main.py

I would really appreciate.

@yl-1993
Copy link
Owner

yl-1993 commented Jul 23, 2019

@engmubarak48

  1. You can use the FileListDataset which takes filelist and image prefix as input.
val_dataset = FileListDataset(
    args.val_list, args.val_root,
    transforms.Compose([
        transforms.Resize(args.image_size),
        transforms.CenterCrop(args.input_size),
        transforms.ToTensor(),
        normalize,
    ]))
  1. For feature extraction, we don't need config file. You can use the following command.
python extract_feat.py \
        --arch {} \
        --batch-size {} \
        --input-size {} \
        --feature-dim {} \
        --load-path {} \
        --val_list {} \
        --val_root {} \
        --output-path {}

@engmubarak48
Copy link

Dear @yl-1993

The main question I asked is what should I fill to the .meta file if I don't have the labels of the data. In your "cfg_test_0.7_0.75.yaml" config file. there is a path pointing to this file "part1_test.meta"

In general, I only want to cluster the images. and add each cluster to a folder. then check the clusters manually.

Thanks

@yl-1993
Copy link
Owner

yl-1993 commented Jul 27, 2019

@engmubarak48 Sorry for not fully understanding your question. For a quick fix, you can simply use a dummy meta for testing, which will not influence the clustering result. The meta file is currently used for measuring the difference between predicted score and ground-truth score. It is a reference value in test phase. This is a good point. We will support empty meta during inference soon.

@yl-1993
Copy link
Owner

yl-1993 commented Aug 6, 2019

@engmubarak48 #17 removes unnecessary inputs during inference. For now, you only need to feed features and proposals into the trained network.

@felixfuu
Copy link

Can I use which model you provide to extract face features and then use the clustering model(pretrained_gcn_d.pth.tar) you provide to process my own images?

@yl-1993
Copy link
Owner

yl-1993 commented Aug 14, 2019

@felixfuu You can use resnet50-softmax as the feature extractor. (It is a little different with the feature extractor used to train the clustering model. If there is a big performance drop, feel free to report under this issue.)

@engmubarak48
Copy link

@engmubarak48 #17 removes unnecessary inputs during inference. For now, you only need to feed features and proposals into the trained network.

Thanks, @yl-1993, I have already made it work back then when I was checking the performance. Do you have any further plans to improve the performance? I am working on this area (face clustering), let me know if you planning further research on this area. we might exchange some ideas.

@felixfuu
Copy link

@felixfuu You can use resnet50-softmax as the feature extractor. (It is a little different with the feature extractor used to train the clustering model. If there is a big performance drop, feel free to report under this issue.)

How to make an annotation file(.meta) for new data?

@yl-1993
Copy link
Owner

yl-1993 commented Aug 16, 2019

@felixfuu For clustering, you only need to feed features and proposals into the trained network.

@felixfuu
Copy link

felixfuu commented Aug 18, 2019

The result of my experiment is not very good. i used 940 faces (many of the same ids) to cluster out 900 labels. Almost every picture has a label. @yl-1993

@felixfuu
Copy link

@yl-1993 I use resnet50-softmax as the feature extractor, and follow the pipeline in sctipts/pipeline.sh. Is there an error in this process?

@yl-1993
Copy link
Owner

yl-1993 commented Aug 21, 2019

@felixfuu The overall procedure is correct. I think there are two ways to check your results. (1) Check the extracted features. You can use the scripts/generate_proposals.sh to generate cluster proposals, which can be regarded as the clustering results. You may reduce the k or maxsz for your data (940 instances). This step only depends on the extracted features and should yield reasonable results. (2) Check the pipeline. You can download the provided features and reproduce the result on ms1m.

@felixfuu
Copy link

@yl-1993 According to your suggestion, I visualize the cluster proposals and the result of clustering is not good, so it should be the reason of the feature. In my experiment, the k = 20, max=100.

@yl-1993
Copy link
Owner

yl-1993 commented Aug 21, 2019

@felixfuu k and maxsz is reasonable. To check the extracted features, you can pick up a face pair with same identity and another face pair with different identity, and compare the cosine similarity between these two pairs. Besides, as a reminder, the face images need to be aligned before feeding into the feature extraction network.

@felixfuu
Copy link

@yl-1993 Feature extraction will not be a problem, I also checked it with a pair (the cosine similarity is over 0.7 when the pair with same identity and below 0.5 with different identity).

@felixfuu
Copy link

By the way, i used knn_hnsw. @yl-1993

@MrHwc
Copy link

MrHwc commented Aug 21, 2019

I use train_cluster_det to train the clustering model, node and edge are generated by generate_proposals. But I found that the cluster generated by generate_proposals is not accurate, The resulting model has poor performance. Am I using train_cluster_det correctly?

@yl-1993
Copy link
Owner

yl-1993 commented Aug 21, 2019

@felixfuu @MrHwc It seems both of you encounter problems with respect to proposal generation. The basic rule is to reduce th if the number of clusters is too large. Since the algorithm is finding connected components, a high threshold will lead to a large number of small clusters. You can post more details, e.g., images per cluster, and we can better diagnose the problem.

@felixfuu
Copy link

felixfuu commented Aug 21, 2019

@yl-1993 I checked the proposal, it should be that the feature is not robust enough, there is no obvious gap between the same identity and different identity.

@MrHwc
Copy link

MrHwc commented Aug 22, 2019

My training set is about 100,000, each id has at least 3 feature vectors, up to 381. K={30, 60, 80}, th={0.5, 0.55, 0.6, 0.65 }, I use evaluate evaluation results, pre=0.78 recall=0.62 fscore=0.69, number of class predicted is 29171. I think this is unreasonable, the ground truth is 4726.

@yl-1993
Copy link
Owner

yl-1993 commented Aug 26, 2019

@MrHwc There are several ways may help. (1) Have you checked the distribution of the generated clusters? Empirically, a large proportion of clusters may only have 2 images. (2) What's the results of single proposals? For example, the result of K=80, th=0.6. If the clustering model is well trained, it will surpass the result of single proposals. (3) Proposals with low threshold is helpful to recall and proposals with high threshold may improve precision. From the results, you can try to involve proposals with higher threshold. e.g., th=0.7.

@ghost
Copy link

ghost commented Oct 12, 2019

@yl-1993 您好,我在使用您的代码时遇到了一些问题,烦请指教。我用了您提供的特征提取代码提取了55张图片特征后,再使用该聚类代码后最后出来的pred_labels.txt包含了584013行的数据,我的理解是每一行对应一张图片的label,但这远远大于我的图片数了,若使用自己的feature后是否需要修改程序,这个数据似乎是与您提供的feature对应的。

@yl-1993
Copy link
Owner

yl-1993 commented Oct 12, 2019

@luhengjie 您好,可否列出具体的调用方式?我猜测应该是有些地方用到了默认的part1_test的数据。另外,为了便于有相同问题的人也能理解,我用英文也回复一下。When Hengjie uses the repo for his own feature, the number of predicted results does not match the number of his features. I guess the problem may lie in using the part1_test in somewhere. We can identify the problem when more details are posted.

@ghost
Copy link

ghost commented Oct 15, 2019

@yl-1993 Thank you for your reply. The way I use your code is to replace the part1_test in the features with my own features, and delete all files in the label folder to avoid influence.The last step is sh scripts/pipeline.

@yl-1993
Copy link
Owner

yl-1993 commented Oct 16, 2019

@luhengjie Thanks. If you name it as part1_test, then you can add --is_rebuild in the script (https://github.com/yl-1993/learn-to-cluster/blob/master/scripts/pipeline.sh#L33) to rebuild the knn and proposals. If you use a different name, you may also need to modify the feature path in dsgcn/configs/cfg_test_0.7_0.75.yaml.

@yl-1993
Copy link
Owner

yl-1993 commented Feb 20, 2020

Hi all, PR #28 simplifies the pipeline of training and testing. To apply the pretrained model to your own unlabeled features, you only need to:

  1. edit the feat_path in test config, e.g., dsgcn/configs/cfg_test_ms1m_20_prpsls.py. (remove the label_path if you don't have it.)
  2. run the test script(scripts/dsgcn/test_cluster_det.sh).

@rose-jinyang
Copy link

rose-jinyang commented Apr 13, 2020

Hi
I have a question
I made a bin file for entire image data to cluster.
But I found that the number of images is twice as that of labels in the script "hfsoftmax/utils.py".
image
Could u explain this?
Thanks

@rose-jinyang
Copy link

Hi
I am going to make my custom training dataset.
I extracted 2048 dimension embedding per face image and saved all the embeddings to a file by using numpy.save.
Then how should I make a meta file for labels?
May I store an label for each face embedding in meta file?
Then the number of feature embedding in feature file will be equal to the number of labels in meta file.
Is it right?

@yl-1993
Copy link
Owner

yl-1993 commented Apr 14, 2020

Hi @rose-jinyang

  • For the first question, this loader is mainly designed for processing the .bin file provided by ArcFace. It duplicates images for fast pair generation to evaluate face verification. I will make it clear in hfsoftmax/utils.py.
  • For the second question, it is basically correct except the embeddings are currently saved by np.tofile. Related explanations of making custom dataset will be added to README.

@liupengcnu
Copy link

@SharharZ I think you can store your features with np.save. More details can be found in extract.py. Besides, I will upload the pretrained face recognition model to Baidu Yun soon.

按照你的extract.py里面的代码写的,你是将你提取到的特征保存成.npy文件,而不是二进制文件.bin,请问怎么才可以保存出.bin文件呢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants