Daily Logs

2022

2022/07

2022/07/01, Vendredi.

Multi-View Transformer for 3D Visual Grounding(CVPR2022) [PDF] [Code]
- Main Idea: Two models: Point cloud and text. Learn a multi-modal representation independent from from its sepecific single view. Different rotation matrixes are used for robust multi-view representation. Fuse features of each object with the query features.
- Experiments: Nr3D: 55.1%, Sr3D: 58.5%, Sr3D+: 59.5%(SOTA) ScanRefer: 40.80%(GOOD)
- Reproduce Notes:
  - 1 RTX 3090 takes almost 15h for Nr3D, 55.1% for Nr3D!
  - Replacing all mentions of AT_CHECK with TORCH_CHECK in ./referit3d/external_tools/pointnet2/_ext_src/src in CUDA 11.
  - Point Cloud Visualization tool: open3d [Package]
  - Point Cloud 3D Box Visualization: [Code]
  - Point Cloud aligned: [Code]
Distilling Audio-Visual Knowledge by Compositional Contrastive Learning(CVPR2021) [PDF] [Code]
- Main Idea: Contrastive Compositional learning for video feature extraction in order to solve sematic gap between two different modalities.
- Experiments: UCF51: 70.0%, ActivityNet: 47.3%
- Reproduce Notes:
  - 1 RTX 3090 takes almost 10h for UCF101, 3 days for ActivityNet, 6 days for VGGSound.

2022/07/02, Samedi.

3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection(CVPR2022) [PDF] [Code]
- Main Idea: First single stage 3D visual grounding method. It regards 3DVG task as a keypoint selection problem. Pcloud is input, Pseed is feature, P0 is language-relevant keypooint, Pt is target keypoints and finally, Pt regresses to the bounding boxes.
- Experiments: ScanRefer:47-48%(SOTA), Nr3D:51.5%, Sr3D:62.6%(GOOD)
- Reproduce Notes:
  - 1 Telsa V100 or 2 RTX3090 is enough. It takes almost 39h while training on 2 RTX3090 with/without multi-view features.
  - Distributed training yaml [Code]
  - Distributed training script [Code]
  - In pytorch 1.7.0 environment, you should replace "tile" in lib/ap_helper.py with "repeat".
  - If you use distributed training, you should add "if args.local_rank == 0" before you save the model.
  - If you use distributed training, you should change the torch.load code in scipts/eval.py to
```
checkpoint = torch.load(path)
model.load_state_dict({k[7:]: v for k, v in checkpoint.items()}, strict=True)
```
  - If you want to visualize the results, you should do following steps:
  1. add following code to config/default.yaml
```
VISUALIZE:
    scene_id: "scene0011_00"
```
  1. change some codes in script/visualize.py [Code]
  2. run command in the terminal
```
python scripts/visualize.py --folder 2022-07-23_20-36_REPRODUCE-MULTIVIEW_DOUBLE_WORKERS-1 --config ./config/default.yaml
```

2022/07/03, Dimanche.

ScanQA: 3D Question Answering for Spatial Scene Understanding(CVPR2022) [PDF] [Code]
- Main Idea: This paper provides a new task: 3D VQA and a baseline which consists of 3 parts: question and point clouds feature extraction, feature fusion and 3 MLP heads for object classification, answer classification and object localization.
- Experiments: 23.45% (Baseline)
- Reproduce Notes:
  - Not implemented yet. (TODO)
  - 1 Telsa V100 takes < 1 day.

2022/07/04, Lundi.

ScanQA: Text-guided graph neural networks for referring 3d instance segmentation.(AAAI2021) [PDF] [Code]
- Main Idea: This paper dividing the task into two part: 3D instance segmentation and instance refering. 3D mask prediction is interesting. They propose a clustering algorithm to cluster points belonging to the same instance. A text-guided graph neural network is proposed for the second phrase.
- Experiments: (Baseline)

2022/07/05, Mardi.

X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning(CVPR2022) [PDF] [Code]
- Main Idea: In training stage, they utilize both 2D and 3D modalities as teacher network to teach the student network who only use 3D modality. In inference stage, they only use 3D modality.
  - They propose a different fusion module: randomly mask the teacher features and add it to the student feature.
  - They propose a different object representation method.
- Experiments:

2022/07/06, Mercredi.

3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds(CVPR2022) [PDF] [Code]
- Main Idea: This paper provides a unified framework for joint dense captioning and visual grounding on 3D point clouds. Feature representation and fusion modules are task-agnostic which are designed for collaboatively learning.
- Experiments: SOTA on ScanRefer and Nr3D, even better than 3D Vision Gounding paper in IJCV2022.
- Reproduce Notes:
  - 1 RTX3090 takes almost 4 days to train and 1h*"repeats" to validate on ScanRefer dataset.
  - If you use multi-view features, this project will occupy 212GB of space. So, you'd better rent GPUs in BeiJing district in AutoDL.
  - Scan2CAD dataset and its preprocessing are also needed to train this project. [Code]
  - If your system is CUDA11.0+, you should replace pointnet++ in the original repo with 3DSPS.
  - To issue:"No module named 'quaternion", you should type "pip install numpy-quaternion" in the terminal.
  - ScanRefer dataset can directly unzip in the dataset folder.
  - "ScanRefer_filtered_organized.json" can be obtained by [Code]
  - Training arguments must match validation arguments or you will get a RunTime Error: size mismatch.
  - Java is a must.
```
sudo apt-get update
sudo apt-get install openjdk-8-jdk
```
  - "--num_ground=150" means avoiding the training of the caption head for the first 150 epochs.

Visual Grounding：

Validation Set			Unique	Unique	Multiple	Multiple	Overall	Overall
Methods	Publication	Modality	Acc@0.25	Acc@0.5	Acc@0.25	Acc@0.5	Acc@0.25	Acc@0.5
3DJCG (Paper)	CVPR2022	3D	78.75	61.30	40.13	30.08	47.62	36.14
3DJCG (Paper)	CVPR2022	2D + 3D	83.47	64.34	41.39	30.82	49.56	37.33
3DJCG (Our Reproduce)	CVPR2022	2D + 3D	81.98	63.18	41.35	30.04	49.23	36.47

- Future work: Performance of vision grounding will improve.

2022/07/07, Jeudi.
I try to do an experiment on COCO2017 dataset with 4 RTX3090s!

Escaping the Big Data Paradigm with Compact Transformers(Arxiv202206) [PDF] [Code]
- Main Idea: This paper design a new transformer architecture for training on small dataset. First, they reduce the layers, heads and hidden dimensions. Then, they design SeqPool module:x'=softmax(g(f(x)).T) => z=x'*x, where f is transformer encoder, g is a linear layer. Finally, a convolutional tokenizer, which substitutes for patch and embedding is designed to introduce an inductive bias into the model.
- Experiments: SOTA in small dataset such as Cifar10 and Flower102.

2022/07/08, Vendredi.
After adding Transformer to IEEC, although the training process is unstable, it can surpass the baseline! Here is the validation accuracy reported by tensorboard:

ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes (ECCV2020) [PDF] [Code]
- Main Idea: They introduce a two-part dataset: a high quality synthetic dataset of 83572 referential utterances (Sr3D) and a dataset with 41503 natural (human) referential utterances (Nr3D).

2022/07/09, Samedi. Run the 3DJCG code.
2022/07/10, Dimanche.

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation (CVPR2017) [PDF] [Code]
- Main Idea: This paper fully exploits permutation invariant property of points cloud and propose PointNet. This paper also provides some theoretical analysis in [Supplemental].

2022/07/11, Lundi.
Experiment on COCO2017 dataset with 4 RTX3090s is over. The experiment lasted for 5 days! My results surpass the baseline model.
2022/07/20, Mercredi.

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds (IJCV2022) [PDF] [Code]
- Main Idea: This paper proposed SpaCap to do 3D Dense Captioning. Main-axis spatial relation label maps are prepocessed before training. They can be used as the prior knowledge for model. Besides, this paper also propose a new Transformer decoder: vision token mask and word token mask are both fed to self-attention layer.
- Reproduce Notes:
  - Successful

2022/07/30, Samedi.

D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding (ECCV2022) [PDF] [Code]
- Main Idea: This paper proposed D3Net to do 3D visual grounding and dense captioning jointly. This self-critical property of D3Net also introduces discriminability during object caption generation and enables semi-supervised training on ScanNet data with partially annotated descriptions. They outperforms SOTA methods in both tasks on the ScanRefer dataset.
- Reproduce Notes:
  - Not provide a yaml file.

When the downloaded zip file is corrupted, we can fix it by WinRAR!

2022/09

2022/09/05, Lundi.

基于深度学习的图像复原技术研究_武士想 (2022中科大博)

Non-blind image recovery: including hyper-segmentation and denoising (hyper-segmentation*2+denoising*1)

Blind image recovery: (based on unsupervised CycleGAN framework*1)

Blind image single-image recovery: (based on generative model*1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logs.md

Logs.md

Daily Logs

Table of Contents

2022

2022/07

2022/09

2023

2023/01

Files

Logs.md

Latest commit

History

Logs.md

File metadata and controls

Daily Logs

Table of Contents

2022

2022/07

2022/09

2023

2023/01