1.GRCNN: Graph Recognition Convolutional Neural Network for Synthesizing Programs from Flow Charts ⬇️
Program synthesis is the task to automatically generate programs based on user specification. In this paper, we present a framework that synthesizes programs from flow charts that serve as accurate and intuitive specifications. In order doing so, we propose a deep neural network called GRCNN that recognizes graph structure from its image. GRCNN is trained end-to-end, which can predict edge and node information of the flow chart simultaneously. Experiments show that the accuracy rate to synthesize a program is 66.4%, and the accuracy rates to recognize edge and nodes are 94.1% and 67.9%, respectively. On average, it takes about 60 milliseconds to synthesize a program.
2.LittleYOLO-SPP: A Delicate Real-Time Vehicle Detection Algorithm ⬇️
Vehicle detection in real-time is a challenging and important task. The existing real-time vehicle detection lacks accuracy and speed. Real-time systems must detect and locate vehicles during criminal activities like theft of vehicle and road traffic violations with high accuracy. Detection of vehicles in complex scenes with occlusion is also extremely difficult. In this study, a lightweight model of deep neural network LittleYOLO-SPP based on the YOLOv3-tiny network is proposed to detect vehicles effectively in real-time. The YOLOv3-tiny object detection network is improved by modifying its feature extraction network to increase the speed and accuracy of vehicle detection. The proposed network incorporated Spatial pyramid pooling into the network, which consists of different scales of pooling layers for concatenation of features to enhance network learning capability. The Mean square error (MSE) and Generalized IoU (GIoU) loss function for bounding box regression is used to increase the performance of the network. The network training includes vehicle-based classes from PASCAL VOC 2007,2012 and MS COCO 2014 datasets such as car, bus, and truck. LittleYOLO-SPP network detects the vehicle in real-time with high accuracy regardless of video frame and weather conditions. The improved network achieves a higher mAP of 77.44% on PASCAL VOC and 52.95% mAP on MS COCO datasets.
3.Age Gap Reducer-GAN for Recognizing Age-Separated Faces ⬇️
In this paper, we propose a novel algorithm for matching faces with temporal variations caused due to age progression. The proposed generative adversarial network algorithm is a unified framework that combines facial age estimation and age-separated face verification. The key idea of this approach is to learn the age variations across time by conditioning the input image on the subject's gender and the target age group to which the face needs to be progressed. The loss function accounts for reducing the age gap between the original image and generated face image as well as preserving the identity. Both visual fidelity and quantitative evaluations demonstrate the efficacy of the proposed architecture on different facial age databases for age-separated face recognition.
4.Transferred Fusion Learning using Skipped Networks ⬇️
Identification of an entity that is of interest is prominent in any intelligent system. The visual intelligence of the model is enhanced when the capability of recognition is added. Several methods such as transfer learning and zero shot learning help to reuse the existing models or augment the existing model to achieve improved performance at the task of object recognition. Transferred fusion learning is one such mechanism that intends to use the best of both worlds and build a model that is capable of outperforming the models involved in the system. We propose a novel mechanism to amplify the process of transfer learning by introducing a student architecture where the networks learn from each other.
5.DeepI2I: Enabling Deep Hierarchical Image-to-Image Translation by Transferring from GANs ⬇️
Image-to-image translation has recently achieved remarkable results. But despite current success, it suffers from inferior performance when translations between classes require large shape changes. We attribute this to the high-resolution bottlenecks which are used by current state-of-the-art image-to-image methods. Therefore, in this work, we propose a novel deep hierarchical Image-to-Image Translation method, called DeepI2I. We learn a model by leveraging hierarchical features: (a) structural information contained in the shallow layers and (b) semantic information extracted from the deep layers. To enable the training of deep I2I models on small datasets, we propose a novel transfer learning method, that transfers knowledge from pre-trained GANs. Specifically, we leverage the discriminator of a pre-trained GANs (i.e. BigGAN or StyleGAN) to initialize both the encoder and the discriminator and the pre-trained generator to initialize the generator of our model. Applying knowledge transfer leads to an alignment problem between the encoder and generator. We introduce an adaptor network to address this. On many-class image-to-image translation on three datasets (Animal faces, Birds, and Foods) we decrease mFID by at least 35% when compared to the state-of-the-art. Furthermore, we qualitatively and quantitatively demonstrate that transfer learning significantly improves the performance of I2I systems, especially for small datasets. Finally, we are the first to perform I2I translations for domains with over 100 classes.
6.Where to drive: free space detection with one fisheye camera ⬇️
The development in the field of autonomous driving goes hand in hand with ever new developments in the field of image processing and machine learning methods. In order to fully exploit the advantages of deep learning, it is necessary to have sufficient labeled training data available. This is especially not the case for omnidirectional fisheye cameras. As a solution, we propose in this paper to use synthetic training data based on Unity3D. A five-pass algorithm is used to create a virtual fisheye camera. This synthetic training data is evaluated for the application of free space detection for different deep learning network architectures. The results indicate that synthetic fisheye images can be used in deep learning context.
7.Dynamic Plane Convolutional Occupancy Networks ⬇️
Learning-based 3D reconstruction using implicit neural representations has shown promising progress not only at the object level but also in more complicated scenes. In this paper, we propose Dynamic Plane Convolutional Occupancy Networks, a novel implicit representation pushing further the quality of 3D surface reconstruction. The input noisy point clouds are encoded into per-point features that are projected onto multiple 2D dynamic planes. A fully-connected network learns to predict plane parameters that best describe the shapes of objects or scenes. To further exploit translational equivariance, convolutional neural networks are applied to process the plane features. Our method shows superior performance in surface reconstruction from unoriented point clouds in ShapeNet as well as an indoor scene dataset. Moreover, we also provide interesting observations on the distribution of learned dynamic planes.
8.Learned Equivariant Rendering without Transformation Supervision ⬇️
We propose a self-supervised framework to learn scene representations from video that are automatically delineated into objects and background. Our method relies on moving objects being equivariant with respect to their transformation across frames and the background being constant. After training, we can manipulate and render the scenes in real time to create unseen combinations of objects, transformations, and backgrounds. We show results on moving MNIST with backgrounds.
9.Finding Relevant Flood Images on Twitter using Content-based Filters ⬇️
The analysis of natural disasters such as floods in a timely manner often suffers from limited data due to coarsely distributed sensors or sensor failures. At the same time, a plethora of information is buried in an abundance of images of the event posted on social media platforms such as Twitter. These images could be used to document and rapidly assess the situation and derive proxy-data not available from sensors, e.g., the degree of water pollution. However, not all images posted online are suitable or informative enough for this purpose. Therefore, we propose an automatic filtering approach using machine learning techniques for finding Twitter images that are relevant for one of the following information objectives: assessing the flooded area, the inundation depth, and the degree of water pollution. Instead of relying on textual information present in the tweet, the filter analyzes the image contents directly. We evaluate the performance of two different approaches and various features on a case-study of two major flooding events. Our image-based filter is able to enhance the quality of the results substantially compared with a keyword-based filter, improving the mean average precision from 23% to 53% on average.
10.Survey on 3D face reconstruction from uncalibrated images ⬇️
Recently, a lot of attention has been focused on the incorporation of 3D data into face analysis and its applications. Despite providing a more accurate representation of the face, 3D face images are more complex to acquire than 2D pictures. As a consequence, great effort has been invested in developing systems that reconstruct 3D faces from an uncalibrated 2D image. However, the 3D-from-2D face reconstruction problem is ill-posed, thus prior knowledge is needed to restrict the solutions space. In this work, we review 3D face reconstruction methods in the last decade, focusing on those that only use 2D pictures captured under uncontrolled conditions. We present a classification of the proposed methods based on the technique used to add prior knowledge, considering three main strategies, namely, statistical model fitting, photometry, and deep learning, and reviewing each of them separately. In addition, given the relevance of statistical 3D facial models as prior knowledge, we explain the construction procedure and provide a comprehensive list of the publicly available 3D facial models. After the exhaustive study of 3D-from-2D face reconstruction approaches, we observe that the deep learning strategy is rapidly growing since the last few years, matching its extension to that of the widespread statistical model fitting. Unlike the other two strategies, photometry-based methods have decreased in number since the required strong assumptions cause the reconstructions to be of more limited quality than those resulting from model fitting and deep learning methods. The review also identifies current gaps and suggests avenues for future research.
11.DeepSim: Semantic similarity metrics for learned image registration ⬇️
We propose a semantic similarity metric for image registration. Existing metrics like euclidean distance or normalized cross-correlation focus on aligning intensity values, giving difficulties with low intensity contrast or noise. Our semantic approach learns dataset-specific features that drive the optimization of a learning-based registration model. Comparing to existing unsupervised and supervised methods across multiple image modalities and applications, we achieve consistently high registration accuracy and faster convergence than state of the art, and the learned invariance to noise gives smoother transformations on low-quality images.
12.A CNN-based Feature Space for Semi-supervised Incremental Learning in Assisted Living Applications ⬇️
A Convolutional Neural Network (CNN) is sometimes confronted with objects of changing appearance ( new instances) that exceed its generalization capability. This requires the CNN to incorporate new knowledge, i.e., to learn incrementally. In this paper, we are concerned with this problem in the context of assisted living. We propose using the feature space that results from the training dataset to automatically label problematic images that could not be properly recognized by the CNN. The idea is to exploit the extra information in the feature space for a semi-supervised labeling and to employ problematic images to improve the CNN's classification model. Among other benefits, the resulting semi-supervised incremental learning process allows improving the classification accuracy of new instances by 40% as illustrated by extensive experiments.
13.Learning from THEODORE: A Synthetic Omnidirectional Top-View Indoor Dataset for Deep Transfer Learning ⬇️
Recent work about synthetic indoor datasets from perspective views has shown significant improvements of object detection results with Convolutional Neural Networks(CNNs). In this paper, we introduce THEODORE: a novel, large-scale indoor dataset containing 100,000 high-resolution diversified fisheye images with 14 classes. To this end, we create 3D virtual environments of living rooms, different human characters and interior textures. Beside capturing fisheye images from virtual environments we create annotations for semantic segmentation, instance masks and bounding boxes for object detection tasks. We compare our synthetic dataset to state of the art real-world datasets for omnidirectional images. Based on MS COCO weights, we show that our dataset is well suited for fine-tuning CNNs for object detection. Through a high generalization of our models by means of image synthesis and domain randomization, we reach an AP up to 0.84 for class person on High-Definition Analytics dataset.
14.Invariant Deep Compressible Covariance Pooling for Aerial Scene Categorization ⬇️
Learning discriminative and invariant feature representation is the key to visual image categorization. In this article, we propose a novel invariant deep compressible covariance pooling (IDCCP) to solve nuisance variations in aerial scene categorization. We consider transforming the input image according to a finite transformation group that consists of multiple confounding orthogonal matrices, such as the D4 group. Then, we adopt a Siamese-style network to transfer the group structure to the representation space, where we can derive a trivial representation that is invariant under the group action. The linear classifier trained with trivial representation will also be possessed with invariance. To further improve the discriminative power of representation, we extend the representation to the tensor space while imposing orthogonal constraints on the transformation matrix to effectively reduce feature dimensions. We conduct extensive experiments on the publicly released aerial scene image data sets and demonstrate the superiority of this method compared with state-of-the-art methods. In particular, with using ResNet architecture, our IDCCP model can reduce the dimension of the tensor representation by about 98% without sacrificing accuracy (i.e., <0.5%).
15.Noise Conscious Training of Non Local Neural Network powered by Self Attentive Spectral Normalized Markovian Patch GAN for Low Dose CT Denoising ⬇️
The explosive rise of the use of Computer tomography (CT) imaging in medical practice has heightened public concern over the patient's associated radiation dose. However, reducing the radiation dose leads to increased noise and artifacts, which adversely degrades the scan's interpretability. Consequently, an advanced image reconstruction algorithm to improve the diagnostic performance of low dose ct arose as the primary concern among the researchers, which is challenging due to the ill-posedness of the problem. In recent times, the deep learning-based technique has emerged as a dominant method for low dose CT(LDCT) denoising. However, some common bottleneck still exists, which hinders deep learning-based techniques from furnishing the best performance. In this study, we attempted to mitigate these problems with three novel accretions. First, we propose a novel convolutional module as the first attempt to utilize neighborhood similarity of CT images for denoising tasks. Our proposed module assisted in boosting the denoising by a significant margin. Next, we moved towards the problem of non-stationarity of CT noise and introduced a new noise aware mean square error loss for LDCT denoising. Moreover, the loss mentioned above also assisted to alleviate the laborious effort required while training CT denoising network using image patches. Lastly, we propose a novel discriminator function for CT denoising tasks. The conventional vanilla discriminator tends to overlook the fine structural details and focus on the global agreement. Our proposed discriminator leverage self-attention and pixel-wise GANs for restoring the diagnostic quality of LDCT images. Our method validated on a publicly available dataset of the 2016 NIH-AAPM-Mayo Clinic Low Dose CT Grand Challenge performed remarkably better than the existing state of the art method.
16.Zero-Pair Image to Image Translation using Domain Conditional Normalization ⬇️
In this paper, we propose an approach based on domain conditional normalization (DCN) for zero-pair image-to-image translation, i.e., translating between two domains which have no paired training data available but each have paired training data with a third domain. We employ a single generator which has an encoder-decoder structure and analyze different implementations of domain conditional normalization to obtain the desired target domain output. The validation benchmark uses RGB-depth pairs and RGB-semantic pairs for training and compares performance for the depth-semantic translation task. The proposed approaches improve in qualitative and quantitative terms over the compared methods, while using much fewer parameters. Code available at this https URL
17.FPGA: Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification ⬇️
Deep learning techniques have provided significant improvements in hyperspectral image (HSI) classification. The current deep learning based HSI classifiers follow a patch-based learning framework by dividing the image into overlapping patches. As such, these methods are local learning methods, which have a high computational cost. In this paper, a fast patch-free global learning (FPGA) framework is proposed for HSI classification. In FPGA, an encoder-decoder based FCN is utilized to consider the global spatial information by processing the whole image, which results in fast inference. However, it is difficult to directly utilize the encoder-decoder based FCN for HSI classification as it always fails to converge due to the insufficiently diverse gradients caused by the limited training samples. To solve the divergence problem and maintain the abilities of FCN of fast inference and global spatial information mining, a global stochastic stratified sampling strategy is first proposed by transforming all the training samples into a stochastic sequence of stratified samples. This strategy can obtain diverse gradients to guarantee the convergence of the FCN in the FPGA framework. For a better design of FCN architecture, FreeNet, which is a fully end-to-end network for HSI classification, is proposed to maximize the exploitation of the global spatial information and boost the performance via a spectral attention based encoder and a lightweight decoder. A lateral connection module is also designed to connect the encoder and decoder, fusing the spatial details in the encoder and the semantic features in the decoder. The experimental results obtained using three public benchmark datasets suggest that the FPGA framework is superior to the patch-based framework in both speed and accuracy for HSI classification. Code has been made available at: this https URL.
18.A Hybrid Approach for 6DoF Pose Estimation ⬇️
We propose a method for 6DoF pose estimation of rigid objects that uses a state-of-the-art deep learning based instance detector to segment object instances in an RGB image, followed by a point-pair based voting method to recover the object's pose. We additionally use an automatic method selection that chooses the instance detector and the training set as that with the highest performance on the validation set. This hybrid approach leverages the best of learning and classic approaches, using CNNs to filter highly unstructured data and cut through the clutter, and a local geometric approach with proven convergence for robust pose estimation. The method is evaluated on the BOP core datasets where it significantly exceeds the baseline method and is the best fast method in the BOP 2020 Challenge.
19.Progressive Spatio-Temporal Graph Convolutional Network for Skeleton-Based Human Action Recognition ⬇️
Graph convolutional networks (GCNs) have been very successful in skeleton-based human action recognition where the sequence of skeletons is modeled as a graph. However, most of the GCN-based methods in this area train a deep feed-forward network with a fixed topology that leads to high computational complexity and restricts their application in low computation scenarios. In this paper, we propose a method to automatically find a compact and problem-specific topology for spatio-temporal graph convolutional networks in a progressive manner. Experimental results on two widely used datasets for skeleton-based human action recognition indicate that the proposed method has competitive or even better classification performance compared to the state-of-the-art methods with much lower computational complexity.
20.Skeleton-based Relational Reasoning for Group Activity Analysis ⬇️
Research on group activity recognition mostly leans on standard two-stream approach (RGB and Optical Flow) as their input features. Few have explored explicit pose information, with none using it directly to reason about the individuals interactions. In this paper, we leverage the skeleton information to learn the interactions between the individuals straight from it. With our proposed method GIRN, multiple relationship types are inferred from independent modules, that describe the relations between the joints pair-by-pair. Additionally to the joints relations, we also experiment with previously unexplored relationship between individuals and relevant objects (e.g. volleyball). The individuals distinct relations are then merged through an attention mechanism, that gives more importance to those more relevant for distinguishing the group activity. We evaluate our method in the Volleyball dataset, obtaining competitive results to the state-of-the-art, even though using a single modality. Therefore demonstrating the potential of skeleton-based approaches for modeling multi-person interactions.
21.Semi-supervised Sparse Representation with Graph Regularization for Image Classification ⬇️
Image classification is a challenging problem for computer in reality. Large numbers of methods can achieve satisfying performances with sufficient labeled images. However, labeled images are still highly limited for certain image classification tasks. Instead, lots of unlabeled images are available and easy to be obtained. Therefore, making full use of the available unlabeled data can be a potential way to further improve the performance of current image classification methods. In this paper, we propose a discriminative semi-supervised sparse representation algorithm for image classification. In the algorithm, the classification process is combined with the sparse coding to learn a data-driven linear classifier. To obtain discriminative predictions, the predicted labels are regularized with three graphs, i.e., the global manifold structure graph, the within-class graph and the between-classes graph. The constructed graphs are able to extract structure information included in both the labeled and unlabeled data. Moreover, the proposed method is extended to a kernel version for dealing with data that cannot be linearly classified. Accordingly, efficient algorithms are developed to solve the corresponding optimization problems. Experimental results on several challenging databases demonstrate that the proposed algorithm achieves excellent performances compared with related popular methods.
22.Self-supervised Segmentation via Background Inpainting ⬇️
While supervised object detection and segmentation methods achieve impressive accuracy, they generalize poorly to images whose appearance significantly differs from the data they have been trained on. To address this when annotating data is prohibitively expensive, we introduce a self-supervised detection and segmentation approach that can work with single images captured by a potentially moving camera. At the heart of our approach lies the observation that object segmentation and background reconstruction are linked tasks, and that, for structured scenes, background regions can be re-synthesized from their surroundings, whereas regions depicting the moving object cannot. We encode this intuition into a self-supervised loss function that we exploit to train a proposal-based segmentation network. To account for the discrete nature of the proposals, we develop a Monte Carlo-based training strategy that allows the algorithm to explore the large space of object proposals. We apply our method to human detection and segmentation in images that visually depart from those of standard benchmarks and outperform existing self-supervised methods.
23.Scribble-Supervised Semantic Segmentation by Random Walk on Neural Representation and Self-Supervision on Neural Eigenspa ⬇️
Scribble-supervised semantic segmentation has gained much attention recently for its promising performance without high-quality annotations. Many approaches have been proposed. Typically, they handle this problem to either introduce a well-labeled dataset from another related task, turn to iterative refinement and post-processing with the graphical model, or manipulate the scribble label. This work aims to achieve semantic segmentation supervised by scribble label directly without auxiliary information and other intermediate manipulation. Specifically, we impose diffusion on neural representation by random walk and consistency on neural eigenspace by self-supervision, which forces the neural network to produce dense and consistent predictions over the whole dataset. The random walk embedded in the network will compute a probabilistic transition matrix, with which the neural representation diffused to be uniform. Moreover, given the probabilistic transition matrix, we apply the self-supervision on its eigenspace for consistency in the image's main parts. In addition to comparing the common scribble dataset, we also conduct experiments on the modified datasets that randomly shrink and even drop the scribbles on image objects. The results demonstrate the superiority of the proposed method and are even comparable to some full-label supervised ones. The code and datasets are available at this https URL.
24.Intentonomy: a Dataset and Study towards Human Intent Understanding ⬇️
An image is worth a thousand words, conveying information that goes beyond the mere visual content therein. In this paper, we study the intent behind social media images with an aim to analyze how visual information can facilitate recognition of human intent. Towards this goal, we introduce an intent dataset, Intentonomy, comprising 14K images covering a wide range of everyday scenes. These images are manually annotated with 28 intent categories derived from a social psychology taxonomy. We then systematically study whether, and to what extent, commonly used visual information, i.e., object and context, contribute to human motive understanding. Based on our findings, we conduct further study to quantify the effect of attending to object and context classes as well as textual information in the form of hashtags when training an intent classifier. Our results quantitatively and qualitatively shed light on how visual and textual information can produce observable effects when predicting intent.
25.End-to-End Chinese Landscape Painting Creation Using Generative Adversarial Networks ⬇️
Current GAN-based art generation methods produce unoriginal artwork due to their dependence on conditional input. Here, we propose Sketch-And-Paint GAN (SAPGAN), the first model which generates Chinese landscape paintings from end to end, without conditional input. SAPGAN is composed of two GANs: SketchGAN for generation of edge maps, and PaintGAN for subsequent edge-to-painting translation. Our model is trained on a new dataset of traditional Chinese landscape paintings never before used for generative research. A 242-person Visual Turing Test study reveals that SAPGAN paintings are mistaken as human artwork with 55% frequency, significantly outperforming paintings from baseline GANs. Our work lays a groundwork for truly machine-original art generation.
26.Optimized Loss Functions for Object detection and Application on Nighttime Vehicle Detection ⬇️
Loss functions is a crucial factor than affecting the detection precision in object detection task. In this paper, we optimize both two loss functions for classification and localization simultaneously. Firstly, by multiplying an IoU-based coefficient by the standard cross entropy loss in classification loss function, the correlation between localization and classification is established. Compared to the existing studies, in which the correlation is only applied to improve the localization accuracy for positive samples, this paper utilizes the correlation to obtain the really hard negative samples and aims to decrease the misclassified rate for negative samples. Besides, a novel localization loss named MIoU is proposed by incorporating a Mahalanobis distance between predicted box and target box, which eliminate the gradients inconsistency problem in the DIoU loss, further improving the localization accuracy. Finally, sufficient experiments for nighttime vehicle detection have been done on two datasets. Our results show than when train with the proposed loss functions, the detection performance can be outstandingly improved. The source code and trained models are available at this https URL.
27.Automatic Open-World Reliability Assessment ⬇️
Image classification in the open-world must handle out-of-distribution (OOD) images. Systems should ideally reject OOD images, or they will map atop of known classes and reduce reliability. Using open-set classifiers that can reject OOD inputs can help. However, optimal accuracy of open-set classifiers depend on the frequency of OOD data. Thus, for either standard or open-set classifiers, it is important to be able to determine when the world changes and increasing OOD inputs will result in reduced system reliability. However, during operations, we cannot directly assess accuracy as there are no labels. Thus, the reliability assessment of these classifiers must be done by human operators, made more complex because networks are not 100% accurate, so some failures are to be expected. To automate this process, herein, we formalize the open-world recognition reliability problem and propose multiple automatic reliability assessment policies to address this new problem using only the distribution of reported scores/probability data. The distributional algorithms can be applied to both classic classifiers with SoftMax as well as the open-world Extreme Value Machine (EVM) to provide automated reliability assessment. We show that all of the new algorithms significantly outperform detection using the mean of SoftMax.
28.Unsupervised Learning of Dense Visual Representations ⬇️
Contrastive self-supervised learning has emerged as a promising approach to unsupervised visual representation learning. In general, these methods learn global (image-level) representations that are invariant to different views (i.e., compositions of data augmentation) of the same image. However, many visual understanding tasks require dense (pixel-level) representations. In this paper, we propose View-Agnostic Dense Representation (VADeR) for unsupervised learning of dense representations. VADeR learns pixelwise representations by forcing local features to remain constant over different viewing conditions. Specifically, this is achieved through pixel-level contrastive learning: matching features (that is, features that describes the same location of the scene on different views) should be close in an embedding space, while non-matching features should be apart. VADeR provides a natural representation for dense prediction tasks and transfers well to downstream tasks. Our method outperforms ImageNet supervised pretraining (and strong unsupervised baselines) in multiple dense prediction tasks.
29.ForestNet: Classifying Drivers of Deforestation in Indonesia using Deep Learning on Satellite Imagery ⬇️
Characterizing the processes leading to deforestation is critical to the development and implementation of targeted forest conservation and management policies. In this work, we develop a deep learning model called ForestNet to classify the drivers of primary forest loss in Indonesia, a country with one of the highest deforestation rates in the world. Using satellite imagery, ForestNet identifies the direct drivers of deforestation in forest loss patches of any size. We curate a dataset of Landsat 8 satellite images of known forest loss events paired with driver annotations from expert interpreters. We use the dataset to train and validate the models and demonstrate that ForestNet substantially outperforms other standard driver classification approaches. In order to support future research on automated approaches to deforestation driver classification, the dataset curated in this study is publicly available at this https URL .
30.A Self-supervised Learning System for Object Detection in Videos Using Random Walks on Graphs ⬇️
This paper presents a new self-supervised system for learning to detect novel and previously unseen categories of objects in images. The proposed system receives as input several unlabeled videos of scenes containing various objects. The frames of the videos are segmented into objects using depth information, and the segments are tracked along each video. The system then constructs a weighted graph that connects sequences based on the similarities between the objects that they contain. The similarity between two sequences of objects is measured by using generic visual features, after automatically re-arranging the frames in the two sequences to align the viewpoints of the objects. The graph is used to sample triplets of similar and dissimilar examples by performing random walks. The triplet examples are finally used to train a siamese neural network that projects the generic visual features into a low-dimensional manifold. Experiments on three public datasets, YCB-Video, CORe50 and RGBD-Object, show that the projected low-dimensional features improve the accuracy of clustering unknown objects into novel categories, and outperform several recent unsupervised clustering techniques.
31.Fast & Slow Learning: Incorporating Synthetic Gradients in Neural Memory Controllers ⬇️
Neural Memory Networks (NMNs) have received increased attention in recent years compared to deep architectures that use a constrained memory. Despite their new appeal, the success of NMNs hinges on the ability of the gradient-based optimiser to perform incremental training of the NMN controllers, determining how to leverage their high capacity for knowledge retrieval. This means that while excellent performance can be achieved when the training data is consistent and well distributed, rare data samples are hard to learn from as the controllers fail to incorporate them effectively during model training. Drawing inspiration from the human cognition process, in particular the utilisation of neuromodulators in the human brain, we propose to decouple the learning process of the NMN controllers to allow them to achieve flexible, rapid adaptation in the presence of new information. This trait is highly beneficial for meta-learning tasks where the memory controllers must quickly grasp abstract concepts in the target domain, and adapt stored knowledge. This allows the NMN controllers to quickly determine which memories are to be retained and which are to be erased, and swiftly adapt their strategy to the new task at hand. Through both quantitative and qualitative evaluations on multiple public benchmarks, including classification and regression tasks, we demonstrate the utility of the proposed approach. Our evaluations not only highlight the ability of the proposed NMN architecture to outperform the current state-of-the-art methods, but also provide insights on how the proposed augmentations help achieve such superior results. In addition, we demonstrate the practical implications of the proposed learning strategy, where the feedback path can be shared among multiple neural memory networks as a mechanism for knowledge sharing.
32.Debugging Tests for Model Explanations ⬇️
We investigate whether post-hoc model explanations are effective for diagnosing model errors--model debugging. In response to the challenge of explaining a model's prediction, a vast array of explanation methods have been proposed. Despite increasing use, it is unclear if they are effective. To start, we categorize \textit{bugs}, based on their source, into:~\textit{data, model, and test-time} contamination bugs. For several explanation methods, we assess their ability to: detect spurious correlation artifacts (data contamination), diagnose mislabeled training examples (data contamination), differentiate between a (partially) re-initialized model and a trained one (model contamination), and detect out-of-distribution inputs (test-time contamination). We find that the methods tested are able to diagnose a spurious background bug, but not conclusively identify mislabeled training examples. In addition, a class of methods, that modify the back-propagation algorithm are invariant to the higher layer parameters of a deep network; hence, ineffective for diagnosing model contamination. We complement our analysis with a human subject study, and find that subjects fail to identify defective models using attributions, but instead rely, primarily, on model predictions. Taken together, our results provide guidance for practitioners and researchers turning to explanations as tools for model debugging.
33.Using GANs to Synthesise Minimum Training Data for Deepfake Generation ⬇️
There are many applications of Generative Adversarial Networks (GANs) in fields like computer vision, natural language processing, speech synthesis, and more. Undoubtedly the most notable results have been in the area of image synthesis and in particular in the generation of deepfake videos. While deepfakes have received much negative media coverage, they can be a useful technology in applications like entertainment, customer relations, or even assistive care. One problem with generating deepfakes is the requirement for a lot of image training data of the subject which is not an issue if the subject is a celebrity for whom many images already exist. If there are only a small number of training images then the quality of the deepfake will be poor. Some media reports have indicated that a good deepfake can be produced with as few as 500 images but in practice, quality deepfakes require many thousands of images, one of the reasons why deepfakes of celebrities and politicians have become so popular. In this study, we exploit the property of a GAN to produce images of an individual with variable facial expressions which we then use to generate a deepfake. We observe that with such variability in facial expressions of synthetic GAN-generated training images and a reduced quantity of them, we can produce a near-realistic deepfake videos.
34.Collaborative Augmented Reality on Smartphones via Life-long City-scale Maps ⬇️
In this paper we present the first published end-to-end production computer-vision system for powering city-scale shared augmented reality experiences on mobile devices. In doing so we propose a new formulation for an experience-based mapping framework as an effective solution to the key issues of city-scale SLAM scalability, robustness, map updates and all-time all-weather performance required by a production system. Furthermore, we propose an effective way of synchronising SLAM systems to deliver seamless real-time localisation of multiple edge devices at the same time. All this in the presence of network latency and bandwidth limitations. The resulting system is deployed and tested at scale in San Francisco where it delivers AR experiences in a mapped area of several hundred kilometers. To foster further development of this area we offer the data set to the public, constituting the largest of this kind to date.
35.Selective Spatio-Temporal Aggregation Based Pose Refinement System: Towards Understanding Human Activities in Real-World Videos ⬇️
Taking advantage of human pose data for understanding human activities has attracted much attention these days. However, state-of-the-art pose estimators struggle in obtaining high-quality 2D or 3D pose data due to occlusion, truncation and low-resolution in real-world un-annotated videos. Hence, in this work, we propose 1) a Selective Spatio-Temporal Aggregation mechanism, named SST-A, that refines and smooths the keypoint locations extracted by multiple expert pose estimators, 2) an effective weakly-supervised self-training framework which leverages the aggregated poses as pseudo ground-truth instead of handcrafted annotations for real-world pose estimation. Extensive experiments are conducted for evaluating not only the upstream pose refinement but also the downstream action recognition performance on four datasets, Toyota Smarthome, NTU-RGB+D, Charades, and Kinetics-50. We demonstrate that the skeleton data refined by our Pose-Refinement system (SSTA-PRS) is effective at boosting various existing action recognition models, which achieves competitive or state-of-the-art performance.
36.Vulnerability of the Neural Networks Against Adversarial Examples: A Survey ⬇️
With further development in the fields of computer vision, network security, natural language processing and so on so forth, deep learning technology gradually exposed certain security risks. The existing deep learning algorithms cannot effectively describe the essential characteristics of data, making the algorithm unable to give the correct result in the face of malicious input. Based on current security threats faced by deep learning, this paper introduces the problem of adversarial examples in deep learning, sorts out the existing attack and defense methods of the black box and white box, and classifies them. It briefly describes the application of some adversarial examples in different scenarios in recent years, compares several defense technologies of adversarial examples, and finally summarizes the problems in this research field and prospects for its future development. This paper introduces the common white box attack methods in detail, and further compares the similarities and differences between the attack of the black and white box. Correspondingly, the author also introduces the defense methods, and analyzes the performance of these methods against the black and white box attack.
37.Transformers for One-Shot Visual Imitation ⬇️
Humans are able to seamlessly visually imitate others, by inferring their intentions and using past experience to achieve the same end goal. In other words, we can parse complex semantic knowledge from raw video and efficiently translate that into concrete motor control. Is it possible to give a robot this same capability? Prior research in robot imitation learning has created agents which can acquire diverse skills from expert human operators. However, expanding these techniques to work with a single positive example during test time is still an open challenge. Apart from control, the difficulty stems from mismatches between the demonstrator and robot domains. For example, objects may be placed in different locations (e.g. kitchen layouts are different in every house). Additionally, the demonstration may come from an agent with different morphology and physical appearance (e.g. human), so one-to-one action correspondences are not available. This paper investigates techniques which allow robots to partially bridge these domain gaps, using their past experience. A neural network is trained to mimic ground truth robot actions given context video from another agent, and must generalize to unseen task instances when prompted with new videos during test time. We hypothesize that our policy representations must be both context driven and dynamics aware in order to perform these tasks. These assumptions are baked into the neural network using the Transformers attention mechanism and a self-supervised inverse dynamics loss. Finally, we experimentally determine that our method accomplishes a $\sim 2$x improvement in terms of task success rate over prior baselines in a suite of one-shot manipulation tasks.
38.FAT: Training Neural Networks for Reliable Inference Under Hardware Faults ⬇️
Deep neural networks (DNNs) are state-of-the-art algorithms for multiple applications, spanning from image classification to speech recognition. While providing excellent accuracy, they often have enormous compute and memory requirements. As a result of this, quantized neural networks (QNNs) are increasingly being adopted and deployed especially on embedded devices, thanks to their high accuracy, but also since they have significantly lower compute and memory requirements compared to their floating point equivalents. QNN deployment is also being evaluated for safety-critical applications, such as automotive, avionics, medical or industrial. These systems require functional safety, guaranteeing failure-free behaviour even in the presence of hardware faults. In general fault tolerance can be achieved by adding redundancy to the system, which further exacerbates the overall computational demands and makes it difficult to meet the power and performance requirements. In order to decrease the hardware cost for achieving functional safety, it is vital to explore domain-specific solutions which can exploit the inherent features of DNNs. In this work we present a novel methodology called fault-aware training (FAT), which includes error modeling during neural network (NN) training, to make QNNs resilient to specific fault models on the device. Our experiments show that by injecting faults in the convolutional layers during training, highly accurate convolutional neural networks (CNNs) can be trained which exhibits much better error tolerance compared to the original. Furthermore, we show that redundant systems which are built from QNNs trained with FAT achieve higher worse-case accuracy at lower hardware cost. This has been validated for numerous classification tasks including CIFAR10, GTSRB, SVHN and ImageNet.
39.Distorted image restoration using stacked adversarial network ⬇️
Liquify is a common technique for distortion. Due to the uncertainty in the distortion variation, restoring distorted images caused by liquify filter is a challenging task. Unlike existing methods mainly designed for specific single deformation, this paper aims at automatic distorted image restoration, which is characterized by seeking the appropriate warping of multitype and multi-scale distorted images. In this work, we propose a stacked adversarial framework with a novel coherent skip connection to directly predict the reconstruction mappings and represent high-dimensional feature. Since there is no available benchmark which hinders the exploration, we contribute a distorted face dataset by reconstructing distortion mappings based on CelebA dataset. We also introduce a novel method for generating synthesized data. We evaluate our method on proposed benchmark quantitatively and qualitatively, and apply it to the real world for validation.
40.Classification of COVID-19 in Chest CT Images using Convolutional Support Vector Machines ⬇️
Purpose: Coronavirus 2019 (COVID-19), which emerged in Wuhan, China and affected the whole world, has cost the lives of thousands of people. Manual diagnosis is inefficient due to the rapid spread of this virus. For this reason, automatic COVID-19 detection studies are carried out with the support of artificial intelligence algorithms. Methods: In this study, a deep learning model that detects COVID-19 cases with high performance is presented. The proposed method is defined as Convolutional Support Vector Machine (CSVM) and can automatically classify Computed Tomography (CT) images. Unlike the pre-trained Convolutional Neural Networks (CNN) trained with the transfer learning method, the CSVM model is trained as a scratch. To evaluate the performance of the CSVM method, the dataset is divided into two parts as training (%75) and testing (%25). The CSVM model consists of blocks containing three different numbers of SVM kernels. Results: When the performance of pre-trained CNN networks and CSVM models is assessed, CSVM (7x7, 3x3, 1x1) model shows the highest performance with 94.03% ACC, 96.09% SEN, 92.01% SPE, 92.19% PRE, 94.10% F1-Score, 88.15% MCC and 88.07% Kappa metric values. Conclusion: The proposed method is more effective than other methods. It has proven in experiments performed to be an inspiration for combating COVID and for future studies.
41.EvidentialMix: Learning with Combined Open-set and Closed-set Noisy Labels ⬇️
The efficacy of deep learning depends on large-scale data sets that have been carefully curated with reliable data acquisition and annotation processes. However, acquiring such large-scale data sets with precise annotations is very expensive and time-consuming, and the cheap alternatives often yield data sets that have noisy labels. The field has addressed this problem by focusing on training models under two types of label noise: 1) closed-set noise, where some training samples are incorrectly annotated to a training label other than their known true class; and 2) open-set noise, where the training set includes samples that possess a true class that is (strictly) not contained in the set of known training labels. In this work, we study a new variant of the noisy label problem that combines the open-set and closed-set noisy labels, and introduce a benchmark evaluation to assess the performance of training algorithms under this setup. We argue that such problem is more general and better reflects the noisy label scenarios in practice. Furthermore, we propose a novel algorithm, called EvidentialMix, that addresses this problem and compare its performance with the state-of-the-art methods for both closed-set and open-set noise on the proposed benchmark. Our results show that our method produces superior classification results and better feature representations than previous state-of-the-art methods. The code is available at this https URL.
42.Skin disease diagnosis with deep learning: a review ⬇️
Skin cancer is one of the most threatening diseases worldwide. However, diagnosing a skin cancer correctly is challenging. Recently, deep learning algorithms have achieved excellent performance on various tasks. Particularly, they have been also implemented for the tasks of skin disease diagnosis. In this paper, we present a review on deep learning methods and their applications in skin disease diagnosis. We first introduce skin diseases and image acquisition methods in dermatology, and list several publicly available datasets for training and testing algorithms for skin disease diagnosis. Then, we introduce the conception of deep learning and review popular deep learning architectures. Thereafter, popular deep learning frameworks that facilitate the implementation of deep learning algorithms and performance evaluation metrics are presented. As an important part of this article, we then review the literatures involving deep learning methods for skin disease diagnosis from several aspects according to the specific tasks. Additionally, we discuss the challenges faced in the area of skin disease diagnosis with deep learning and suggest possible future research directions. Finally, we summarize the article. The major purpose of this article is to provide a conceptual and systematically review of the recent works on skin disease diagnosis with deep learning. Given the popularity of deep learning, there remains great challenges in the area, as well as opportunities that we can explore in the future.
43.Adversarial images for the primate brain ⬇️
Deep artificial neural networks have been proposed as a model of primate vision. However, these networks are vulnerable to adversarial attacks, whereby introducing minimal noise can fool networks into misclassifying images. Primate vision is thought to be robust to such adversarial images. We evaluated this assumption by designing adversarial images to fool primate vision. To do so, we first trained a model to predict responses of face-selective neurons in macaque inferior temporal cortex. Next, we modified images, such as human faces, to match their model-predicted neuronal responses to a target category, such as monkey faces. These adversarial images elicited neuronal responses similar to the target category. Remarkably, the same images fooled monkeys and humans at the behavioral level. These results challenge fundamental assumptions about the similarity between computer and primate vision and show that a model of neuronal activity can selectively direct primate visual behavior.
44.Invertible CNN-Based Super Resolution with Downsampling Awareness ⬇️
Single image super resolution involves artificially increasing the resolution of an image. Recently, convolutional neural networks have been demonstrated as very powerful tools for this problem. These networks are typically trained by artificially degrading high resolution images and training the neural network to reproduce the original. Because these neural networks are learning an inverse function for an image downsampling scheme, their high-resolution outputs should ideally re-produce the corresponding low-resolution input when the same downsampling scheme is applied. This constraint has not historically been explicitly and strictly imposed during training however. Here, a method for "downsampling aware" super resolution networks is proposed. A differentiable operator is applied as the final output layer of the neural network that forces the downsampled output to match the low resolution input data under 2D-average downsampling. It is demonstrated that appending this operator to a selection of state-of-the-art deep-learning-based super resolution schemes improves training time and overall performance on most of the common image super resolution benchmark datasets. In addition to this performance improvement for images, this method has potentially broad and significant impacts in the physical sciences. This scheme can be applied to data produced by medical scans, precipitation radars, gridded numerical simulations, satellite imagers, and many other sources. In such applications, the proposed method's guarantee of strict adherence to physical conservation laws is of critical importance.
45.An ensemble-based approach by fine-tuning the deep transfer learning models to classify pneumonia from chest X-ray images ⬇️
Pneumonia is caused by viruses, bacteria, or fungi that infect the lungs, which, if not diagnosed, can be fatal and lead to respiratory failure. More than 250,000 individuals in the United States, mainly adults, are diagnosed with pneumonia each year, and 50,000 die from the disease. Chest Radiography (X-ray) is widely used by radiologists to detect pneumonia. It is not uncommon to overlook pneumonia detection for a well-trained radiologist, which triggers the need for improvement in the diagnosis's accuracy. In this work, we propose using transfer learning, which can reduce the neural network's training time and minimize the generalization error. We trained, fine-tuned the state-of-the-art deep learning models such as InceptionResNet, MobileNetV2, Xception, DenseNet201, and ResNet152V2 to classify pneumonia accurately. Later, we created a weighted average ensemble of these models and achieved a test accuracy of 98.46%, precision of 98.38%, recall of 99.53%, and f1 score of 98.96%. These performance metrics of accuracy, precision, and f1 score are at their highest levels ever reported in the literature, which can be considered a benchmark for the accurate pneumonia classification.
46.A Unified Framework for Compressive Video Recovery from Coded Exposure Techniques ⬇️
Several coded exposure techniques have been proposed for acquiring high frame rate videos at low bandwidth. Most recently, a Coded-2-Bucket camera has been proposed that can acquire two compressed measurements in a single exposure, unlike previously proposed coded exposure techniques, which can acquire only a single measurement. Although two measurements are better than one for an effective video recovery, we are yet unaware of the clear advantage of two measurements, either quantitatively or qualitatively. Here, we propose a unified learning-based framework to make such a qualitative and quantitative comparison between those which capture only a single coded image (Flutter Shutter, Pixel-wise coded exposure) and those that capture two measurements per exposure (C2B). Our learning-based framework consists of a shift-variant convolutional layer followed by a fully convolutional deep neural network. Our proposed unified framework achieves the state of the art reconstructions in all three sensing techniques. Further analysis shows that when most scene points are static, the C2B sensor has a significant advantage over acquiring a single pixel-wise coded measurement. However, when most scene points undergo motion, the C2B sensor has only a marginal benefit over the single pixel-wise coded exposure measurement.
47.Dense U-net for super-resolution with shuffle pooling layer ⬇️
Single image super-resolution (SISR) in unconstrained environments is challenging because of various illuminations, occlusion and complex environments. Recent researches have achieved great progress on super-resolution due to the development of deep learning in the field of computer vision. In this letter, a Dense U-net with shuffle pooling method is proposed. First, a modified U-net with dense blocks, called dense U-net, is proposed for SISR. Second, a novel pooling strategy called shuffle pooling is designed, which is applied to the dense U-Net for super-resolution task. Third, a mix loss function, which combined with Mean Square Error(MSE), Structural Similarity Index (SSIM) and Mean Gradient Error (MGE), is proposed to solve the perception loss and high-frequency information loss. The proposed method achieves superior accuracy over previous state-of-the-arts on the three benchmark datasets: SET14, BSD300, ICDAR2003. Code is available online.
48.Do You See What I See? Coordinating Multiple Aerial Cameras for Robot Cinematography ⬇️
Aerial cinematography is significantly expanding the capabilities of film-makers. Recent progress in autonomous unmanned aerial vehicles (UAVs) has further increased the potential impact of aerial cameras, with systems that can safely track actors in unstructured cluttered environments. Professional productions, however, require the use of multiple cameras simultaneously to record different viewpoints of the same scene, which are edited into the final footage either in real time or in post-production. Such extreme motion coordination is particularly hard for unscripted action scenes, which are a common use case of aerial cameras. In this work we develop a real-time multi-UAV coordination system that is capable of recording dynamic targets while maximizing shot diversity and avoiding collisions and mutual visibility between cameras. We validate our approach in multiple cluttered environments of a photo-realistic simulator, and deploy the system using two UAVs in real-world experiments. We show that our coordination scheme has low computational cost and takes only 1.17 ms on average to plan for a team of 3 UAVs over a 10 s time horizon. Supplementary video: this https URL
49.Self-Supervised Out-of-Distribution Detection in Brain CT Scans ⬇️
Medical imaging data suffers from the limited availability of annotation because annotating 3D medical data is a time-consuming and expensive task. Moreover, even if the annotation is available, supervised learning-based approaches suffer highly imbalanced data. Most of the scans during the screening are from normal subjects, but there are also large variations in abnormal cases. To address these issues, recently, unsupervised deep anomaly detection methods that train the model on large-sized normal scans and detect abnormal scans by calculating reconstruction error have been reported. In this paper, we propose a novel self-supervised learning technique for anomaly detection. Our architecture largely consists of two parts: 1) Reconstruction and 2) predicting geometric transformations. By training the network to predict geometric transformations, the model could learn better image features and distribution of normal scans. In the test time, the geometric transformation predictor can assign the anomaly score by calculating the error between geometric transformation and prediction. Moreover, we further use self-supervised learning with context restoration for pretraining our model. By comparative experiments on clinical brain CT scans, the effectiveness of the proposed method has been verified.
50.Glioma Classification Using Multimodal Radiology and Histology Data ⬇️
Gliomas are brain tumours with a high mortality rate. There are various grades and sub-types of this tumour, and the treatment procedure varies accordingly. Clinicians and oncologists diagnose and categorise these tumours based on visual inspection of radiology and histology data. However, this process can be time-consuming and subjective. The computer-assisted methods can help clinicians to make better and faster decisions. In this paper, we propose a pipeline for automatic classification of gliomas into three sub-types: oligodendroglioma, astrocytoma, and glioblastoma, using both radiology and histopathology images. The proposed approach implements distinct classification models for radiographic and histologic modalities and combines them through an ensemble method. The classification algorithm initially carries out tile-level (for histology) and slice-level (for radiology) classification via a deep learning method, then tile/slice-level latent features are combined for a whole-slide and whole-volume sub-type prediction. The classification algorithm was evaluated using the data set provided in the CPM-RadPath 2020 challenge. The proposed pipeline achieved the F1-Score of 0.886, Cohen's Kappa score of 0.811 and Balance accuracy of 0.860. The ability of the proposed model for end-to-end learning of diverse features enables it to give a comparable prediction of glioma tumour sub-types.
51.Deep Learning Derived Histopathology Image Score for Increasing Phase 3 Clinical Trial Probability of Success ⬇️
Failures in Phase 3 clinical trials contribute to expensive cost of drug development in oncology. To drastically reduce such cost, responders to an oncology treatment need to be identified early on in the drug development process with limited amount of patient data before the planning of Phase 3 clinical trials. Despite the challenge of small sample size, we pioneered the use of deep-learning derived digital pathology scores to identify responders based on the immunohistochemistry images of the target antigen expressed in tumor biopsy samples from a Phase 1 Non-small Cell Lung Cancer clinical trial. Based on repeated 10-fold cross validations, the deep-learning derived score on average achieved 4% higher AUC of ROC curve and 6% higher AUC of Precision-Recall curve comparing to the tumor proportion score (TPS) based clinical benchmark. In a small independent testing set of patients, we also demonstrated that the deep-learning derived score achieved numerically at least 25% higher responder rate in the enriched population than the TPS clinical benchmark.